Banner
News   Research   People   Software   Demos   Publications   Resources   Schedule  
Themes
Projects
Funding
Past Projects
Packages
Tools
Tutorial 5/2012
Tutorial 5/2011
Tutorial 8/2010
Tutorial 8/2008
Journal Articles and Conference Papers
Journal Articles
Conference Papers
Theses
All Publications
Complete Bibfile
Corpora
Data
Wiki
Group Meetings
Individual Meetings
Conference Schedule

Data

Some of the resources below are copyrighted (marked with a '+') and require a password to access. To use these resources, you must be a member of the LDC or have purchased a license for the relevant resource. If you are a student at the University of Illinois at Urbana-Champaign and need to use these resources, you may be covered by an existing license; check with your professor and then email us to get access.
IT IS ILLEGAL TO SHARE A COPYRIGHTED LDC RESOURCE WITH PEOPLE OR ORGANIZATIONS WHO DO NOT HAVE EITHER AN LDC MEMBERSHIP OR A LICENSE TO USE THE COPYRIGHTED RESOURCE.

For large corpora, the links below take you to a directory structure; to download the data, you can use wget.


English-Russian temporally aligned corpus [TemporalBbc.tgz]

This data consists of weakly temporally aligned reports from the BBC in English and in Russian. [TemporalBbc.tgz]

Relevant publications:

  • Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora

English-Hebrew Transliteration Corpus [hebrewEnglishAlignment.tgz]

English-Hebrew NE transliteration dataset, consists of 250 NE pairs used for training, and another 300 NE pairs for testing. The data was collected from matching Wikipedia articles in English and Hebrew. [hebrewEnglishAlignment.tgz]

Relevant publications:

  • Active Sample Selection for Named Entity Transliteration. D. Goldwasser and D. RothProceedings of ACL-08: HLT, Short Papers - 2008
  • Transliteration as Constrained Optimization. D. Goldwasser and D. Roth. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) - 2008
  • Unsupervised Constraint Driven Learning For Transliteration Discovery. M. Chang and D. Goldwasser and D. Roth and Y. Tu. Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) - 2009
  • Learning Better Transliterations. J. Pasternack and D. RothThe 18th ACM Conference on Information and Knowledge Management (CIKM) - 2009
  • Discriminative Learning over Constrained Latent Representations. M. Chang and D. Goldwasser and D. Roth and V. Srikumar. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) - 2010
  • Structured Output Learning with Indirect Supervision. M. Chang and V. Srikumar and D. Goldwasser and D. Roth. Proc. of the International Conference on Machine Learning (ICML) - 2010

Data for Constrained Conditional Models/Constraint-Driven Learning (CODL)

  •  Citations and advertisements datasets
  •  Complete CODL/CCM data and code

Wikification Data for the "Local and Global Disambiguation to Wikipedia" ACL 2011 paper

  •  AQUAINT, MSNBC, ACE, Wikipedia articles linked to Wikipedia, as well as system output files.
  •  Wikipedia Knowledge Attributes for co-ref ACE 2004 dataset

Web page data for the "Design Challenges and Misconceptions in Named Entity Recognition" CoNLL 2009 paper

Web page data


Textual Entailment: Explanation-Based Annotation of Entailment Labels

Description, data and scripts for the annotation effort/pilot evaluation described in "Ask not what Textual Entailment can do for you..."(ACL 2010):

https://wiki.engr.illinois.edu/display/rtedata/Explanation+Based+Analysis+of+RTE+Data


Corpora and Preprocessed Data for a number of NLP tasks

  • +POS tagging
  • +Gazetteers for NER
  • +Noun Phrase Identification
  • +Prepositional Phrase Attachment
  • +Subject/Verb Identification
  • +Verb/Object Identification
  • +Chunking
  • Context Sensitive Spelling Correction
  • Data for Entity and Relation Recognition Experiments
  • Data for Information Extraction
  • Data for the named-entity coreference task
  • Data for the semantic role-labeling task
  • Data for NE extraction from temporally aligned corpora
  • Opinion Recognition and Multi-Perspective Question Answering (MPQA) Data
  • +Lin and Pantel's Clustering By Committe data
  • +Chinese Learner English Corpus (by kind permission of Professor Yang Huizhong): Gui, Shicun and Huizhong Yang (eds.). 2003. Zhongguo Xuexizhe Yingyu Yuliaohu. (Chinese Learner English Corpus). Shanghai Waiyu Jiaoyu Chubanshe. (In Chinese). Here is a page that talks about the CLEC corpus.
  • Wikification 2011 ACL paper: [Data Only | Full System | Demo]
  • Wikipedia Knowledge Attributes for co-ref ACE 2004 dataset
  • Complete CODL/CCM Data and code

Corpora for Question Answering Task

  • TREC 8 questions
  • TREC 9 questions
  • TREC 10 questions
  • Question Classification
  • Named Entity Recognition (with HTML tags)
  • TREC 11 Answers
  • TREC 11 Answers (long)

Semantic Entailment Corpora

The entailment corpora from the three PASCAL Recognizing Textual Entailment challenges are provided here in a column format that encodes a range of annotation of the original text. The original corpora can be accessed from the NIST Text Analysis Conference RTE track web site. corpus. This file explains the column format. The PARC sentence pairs were provided separately by Xerox PARC.

  • PARC sentence pairs: plain text text+info column
  • RTE_1 dev
    RTE_1 test
  • RTE_2 dev
    RTE_2 test
  • RTE_3 dev
    RTE_3 test

FRACAS Entailment(-like) Data

For a more extensive set of examples testing the kinds of phenomena modeled in the PARC dataset, take a look at the FRACAS dataset provided by Bill MacCartney of the Stanford University NLP Group.


REMEDIA Story Comprehension Corpus

  • Text files
  • SRL-annotated files

Comma Resolution Data

The corpus and annotation guidelines developed for (V. Srikumar, R. Roichert, M. Sammons, A. Rappoport, and D. Roth, "Extraction of Entailed Semantic Relations Through Syntax-Based Comma Resolution", Proc. of the Annual Meeting of the ACL (2008)) can be downloaded for research use via the link below.

comma resolution data.

If you use this data, please cite the work referenced above.


Image Data

UIUC Image Database for Car Detection.

This data was used in the research described in the paper, "Learning to Detect Objects in Images via a Sparse, Part-Based Representation". The software used in this research can also be downloaded here. If you use this data or the code provided, please cite the above work.


Emotion corpus:

http://lrc.cornell.edu/swedish/dataset/affectdata/index.html


Clean wikipedia data:

http://nlp.cs.nyu.edu/wikipedia-data/


Fact-finding Datasets

Consult our COLING'10 "Knowing What to Believe (when you already know something)" paper for more details.

  • Population dataset
    • Extracted from Wikipedia
    • Editors make claims about the population size of a city in a given year
    • Only direct edits to the relevant infobox field are included
    • Labeled data taken from US census
  • Biographies dataset
    • Extracted from Wikipedia
    • Editors make claims about the birth and death dates of people
    • All edits to the page are counted
    • Labeled data synthesized from multiple authoritative websites
  • Format:
    • Each line corresponds to one assertion
    • Each line is one assertion:
      •  [claim subject]\t[dataset name]\t[source infobox type]\t[edit location]\t[contributor name]\t[revision ID]\t[claim data]
    • Where:
      • Edit location is "F" (direct edit to the relevant infobox field), "T" (edit somewhere else in the same template as the field of interest), or "R" (edit somewhere else on the page entirely).
        • You can ignore this if you're interested in replicating our experiments; the population dataset already omits non-F edits.
      • Population data
        • [claim subject] == name of the city
        • [claim data] == "[relevant timespan]\t[purported population]".  The timespan is (IIRC) always one year.
      • Biographies dataset
        • [claim subject] == [person] born, [person] died, [person] children, [person] spouse, [person] parents
        • [claim data] = name of the person with the given relationship to the subject, or a date range, as appropriate.  Dates are often provided as a specific day, month or year--the range reflects this.
    • Labeled data
      • The labeled data files are idiosyncratic (sorry) but can be parsed with minimal effort; please see the contents of each file.
      • In the future we may reformat these files to correspond with the assertion data's format.

 

Wikipedia Templates

Templates are a primary source of structured data in Wikipedia; for example, the infobox for a city giving its size, mayor, date of founding, etc. is a template.  Templates generally have well-defined fields (documented in Wikipedia, e.g. Template:Infobox Settlement) and are an excellent source of a wide variety of relations (e.g. city population in a given year).  This file was created from the  English Wikipedia and contains all the templates added, removed, or changed by users for all articles in the default namespace (this is the main content of the encyclopedia and excludes pages in "support" namespaces like Talk:, File:, etc.) from each article's inception until January 2012.  This allows you to determine which editor asserted what relation, and when, which can be very useful in work such as information credibility (e.g. our Population and Biography datasets, above).

  • Full History of Wikipedia Article Templates (5GB).
    • The file is a 7z archive containing a 200GB XML file.  I strongly suggest decompressing it dynamically as needed, which is both faster than reading the full text from disk and also mitigates potential storage issues.
    • An example containing the first ~10MB of the XML can be found here.  The format should be mostly self-explanatory.
      • Articles consist of multiple revisions.  Each revision has a single author, who may have deleted some of the previous revision's templates and/or added new ones (a template is changed by removing the old version and adding the new version). 
      • Revisions that do not add or remove templates are omitted.  Similarly, articles that never contained any template are omitted altogether.
      • ID values are used to identify which templates were deleted, but are not unique: a template which is added, then deleted, and then re-added will be given a new ID (and when it is [again] deleted, the deletion will refer to that new ID).
      • Templates have 0 or more named parameters (search the example for <NamedParameter> for an example) and 0 or more numbered parameters (search <NumberedParameter>).  Generally only the NamedParameters are interesting.

 

  • +British National Corpus (BNC)
  • +Penn's Treebank Corpora
  • +English-Arabic Treebank v1.0
  • +Penn Discourse Treebank v2.0
  • +MUC-7 Corpora
  • +ACE-2 Corpora
  • +ACE 2004 Multilingual Training Corpus
  • +ACE 2005 Multilingual Training Corpus
  • +Reuters 2003 corpora for NER
  • +BBN Coref corpus
  • +Joseph Turian's embedding of words in 50 dimensional space
  • +TimeBank 1.2
  • +Google N-Gram Data
  • +OntoNotes v1.0
  • +ACQUAINT English News Text
  • +TRECWeb 2002 Data
  • +Three verb class models. These models allow to stem, lemmatize verbs, and to abstract them into verb groups. These models were used in SRL and other applications.
    •  +Some incomplete entity lists
    • +A polarity lexicon of factive verbs used in the research described in a paper by Nairn, Condoravdi, and Karttunen
    • +Lin and Pantel's DIRT rules
    •  ESL error annotation
    • Taxonomic Relation Data Page
    • NER Web Data
    • PVC_Data

     

    Shared Corpora Collection

    If you are UIUC faculty or working on a UIUC-supported project, you can access a number of copyright-protected corpora that are used by UIUC researchers (access via BlueStem, using your Active Directory password).


    Other Corpora

    Note that these will soon be merged into the shared corpora directory linked to above.


    Other Resources


    Information about other corpora that may be available to students/faculty working at UIUC
    can be found here.



    Copyright © 2010 | University of Illinois at Urbana-Champaign | All Rights Reserved | Contact Us