This page provides access to the data and the code used in the experiments described in "Word representations: A simple and general method for semi-supervised learning" by Turian, Ratinov and Bengio , in ACL 2010.

You can download the data here: http://cogcomp.cs.illinois.edu/experiments/ACL2010_NER_Experiments.zip.

Once you unzip the file, you will see the folder NerACL2010_Experiments. Go to the folder. In the folder, you'll see several other foldes. One important folder is Data which contains the Brown clusters, the C&W and HLBL embeddings, and also the models learned in the different settings. The other important folder is ConfigACL2010 which contains the different configuration files which specify which embeddings to use etc. The rest of the folders, are essentially copies of the same code- I duplicated the system under different names so that I can run several settings at a time.

It's easy to reproduce the code and to analyze the results. For example, to reproduce the result of C&W embeddings with 50 dimentional embedding and a normalization factor of 0.3 (on this page, normalization factor of 0.3 means the following scaling E[i,j]=E[i,j]/(0.3*max(E)_), go to the folder: Cw50Dim0.3:
$ cd /Cw50Dim0.3

Typing:
$ cat results.txt
will display the learning curve and the final performance of the experiments.

If you want to reproduce the experiments from scratch, for example for Cw50Dim0.3, do the following:

$ cd /Cw50Dim0.3
$ ./cleanCompile
$ nohup nice java -Xmx6g -classpath LBJ2.jar:LBJ2Library.jar:bin:stanford-ner.jar:stanford-ner.src.jar:lucene-core-2.4.1.jar ExperimentsACL2010.TrainExperimentsCoNLLDevTuningGivenConfig ../ConfigACL2010/cwRcv50DimOverall0.3.config > results.txt &

Contrary to its name, the TrainExperimentsCoNLLDevTuningGivenConfig file will train the system on the CoNLL03 training data while tuning on the CoNLL03 Dev data and test it on the CoNLL and out of domain data under the evaluation scheme described in the paper. The executable expects one parameter- a config file which specifies which embeddings to use. See the folder ConfigACL2010 for all the possible configurations. The filenames make it clear what resources are used. The executable will print the learning curve and the final performance to the standard output, therefore the redirect to results.txt.

In case you have any problems reproducing the experiments, please email ratinov2@uiuc.edu or arie.ratinov@gmail.com