Banner
News   Research   People   Software   Demos   Publications   Resources   Schedule  
Themes
Projects
Funding
Past Projects
Packages
Tools
Tutorial 8/2008
Tutorial 8/2010
Tutorial 5/2011
Journal Articles and Conference Papers
Journal Articles
Conference Papers
Thesis Papers
All Publications
Complete Bibfile
Links
Corpora
Data
Wiki
Group Meetings
Individual Meetings
Conference Schedule

Data

Some of the resources below are copyrighted (marked with a '+') and require a password to access. To use these resources, you must be a member of the LDC or have purchased a license for the relevant resource. If you are a student at the University of Illinois at Urbana-Champaign and need to use these resources, you may be covered by an existing license; check with your professor and then email us to get access.
IT IS ILLEGAL TO SHARE A COPYRIGHTED LDC RESOURCE WITH PEOPLE OR ORGANIZATIONS WHO DO NOT HAVE EITHER AN LDC MEMBERSHIP OR A LICENSE TO USE THE COPYRIGHTED RESOURCE.

For large corpora, the links below take you to a directory structure; to download the data, you can use wget.


English-Russian temporally aligned corpus [TemporalBbc.tgz]

This data consists of weakly temporally aligned reports from the BBC in English and in Russian. [TemporalBbc.tgz]

Relevant publications:

  • Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora

English-Hebrew Transliteration Corpus [hebrewEnglishAlignment.tgz]

English-Hebrew NE transliteration dataset, consists of 250 NE pairs used for training, and another 300 NE pairs for testing. The data was collected from matching Wikipedia articles in English and Hebrew. [hebrewEnglishAlignment.tgz]

Relevant publications:

  • Active Sample Selection for Named Entity Transliteration. D. Goldwasser and D. RothProceedings of ACL-08: HLT, Short Papers - 2008
  • Transliteration as Constrained Optimization. D. Goldwasser and D. Roth. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) - 2008
  • Unsupervised Constraint Driven Learning For Transliteration Discovery. M. Chang and D. Goldwasser and D. Roth and Y. Tu. Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) - 2009
  • Learning Better Transliterations. J. Pasternack and D. RothThe 18th ACM Conference on Information and Knowledge Management (CIKM) - 2009
  • Discriminative Learning over Constrained Latent Representations. M. Chang and D. Goldwasser and D. Roth and V. Srikumar. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) - 2010
  • Structured Output Learning with Indirect Supervision. M. Chang and V. Srikumar and D. Goldwasser and D. Roth. Proc. of the International Conference on Machine Learning (ICML) - 2010

Data for Constraint-Driven Learning (CODL)

  •  Citations and advertisements datasets

Wikification Data for the "Local and Global Disambiguation to Wikipedia" ACL 2011 paper

  •  AQUAINT, MSNBC, ACE, Wikipedia articles linked to Wikipedia, as well as system output files.


Corpora and Preprocessed Data for a number of NLP tasks

  • +POS tagging
  • +Gazetteers for NER
  • +Noun Phrase Identification
  • +Prepositional Phrase Attachment
  • +Subject/Verb Identification
  • +Verb/Object Identification
  • +Chunking
  • Context Sensitive Spelling Correction
  • Data for Entity and Relation Recognition Experiments
  • Data for Information Extraction
  • Data for the named-entity coreference task
  • Data for the semantic role-labeling task
  • Data for NE extraction from temporally aligned corpora
  • Opinion Recognition and Multi-Perspective Question Answering (MPQA) Data
  • +Lin and Pantel's Clustering By Committe data
  • +Chinese Learner English Corpus (by kind permission of Professor Yang Huizhong): Gui, Shicun and Huizhong Yang (eds.). 2003. Zhongguo Xuexizhe Yingyu Yuliaohu. (Chinese Learner English Corpus). Shanghai Waiyu Jiaoyu Chubanshe. (In Chinese). Here is a page that talks about the CLEC corpus.

Corpora for Question Answering Task

  • TREC 8 questions
  • TREC 9 questions
  • TREC 10 questions
  • Question Classification
  • Named Entity Recognition (with HTML tags)
  • TREC 11 Answers
  • TREC 11 Answers (long)

Semantic Entailment Corpora

The entailment corpora from the three PASCAL Recognizing Textual Entailment challenges are provided here in a column format that encodes a range of annotation of the original text. The original corpora can be accessed from the NIST Text Analysis Conference RTE track web site. corpus. This file explains the column format. The PARC sentence pairs were provided separately by Xerox PARC.

  • PARC sentence pairs: plain text text+info column
  • RTE_1 dev
    RTE_1 test
  • RTE_2 dev
    RTE_2 test
  • RTE_3 dev
    RTE_3 test

FRACAS Entailment(-like) Data

For a more extensive set of examples testing the kinds of phenomena modeled in the PARC dataset, take a look at the FRACAS dataset provided by Bill MacCartney of the Stanford University NLP Group.


REMEDIA Story Comprehension Corpus

  • Text files
  • SRL-annotated files

Comma Resolution Data

The corpus and annotation guidelines developed for (V. Srikumar, R. Roichert, M. Sammons, A. Rappoport, and D. Roth, "Extraction of Entailed Semantic Relations Through Syntax-Based Comma Resolution", Proc. of the Annual Meeting of the ACL (2008)) can be downloaded for research use via the link below.

comma resolution data.

If you use this data, please cite the work referenced above.


Image Data

UIUC Image Database for Car Detection.

This data was used in the research described in the paper, "Learning to Detect Objects in Images via a Sparse, Part-Based Representation". The software used in this research can also be downloaded here. If you use this data or the code provided, please cite the above work.


Emotion corpus:

http://lrc.cornell.edu/swedish/dataset/affectdata/index.html


Clean wikipedia data:

http://nlp.cs.nyu.edu/wikipedia-data/


Shared Corpora Collection

If you are UIUC faculty or working on a UIUC-supported project, you can access a number of copyright-protected corpora that are used by UIUC researchers (access via BlueStem, using your Active Directory password).


Other Corpora

Note that these will soon be merged into the shared corpora directory linked to above.

  • +British National Corpus (BNC)
  • +Penn's Treebank Corpora
  • +English-Arabic Treebank v1.0
  • +Penn Discourse Treebank v2.0
  • +MUC-7 Corpora
  • +ACE-2 Corpora
  • +ACE 2004 Multilingual Training Corpus
  • +ACE 2005 Multilingual Training Corpus
  • +Reuters 2003 corpora for NER
  • +BBN Coref corpus
  • +Joseph Turian's embedding of words in 50 dimensional space
  • +TimeBank 1.2
  • +Google N-Gram Data
  • +OntoNotes v1.0
  • +ACQUAINT English News Text
  • +TRECWeb 2002 Data
  • +Three verb class models. These models allow to stem, lemmatize verbs, and to abstract them into verb groups. These models were used in SRL and other applications.

Other Resources

  •  +Some incomplete entity lists
  • +A polarity lexicon of factive verbs used in the research described in a paper by Nairn, Condoravdi, and Karttunen
  • +Lin and Pantel's DIRT rules
  •  ESL error annotation
  • Taxonomic Relation Data Page

Information about other corpora that may be available to students/faculty working at UIUC
can be found here.



Copyright © 2010 | University of Illinois at Urbana-Champaign | All Rights Reserved | Contact Us