
Some of the resources below are copyrighted (marked with a '+') and require a password to access. To use these resources, you must be a member of the LDC or have purchased a license for the relevant resource. If you are a student at the University of Illinois at Urbana-Champaign and need to use these resources, you may be covered by an existing license; check with your professor and then email us to get access.
IT IS ILLEGAL TO SHARE A COPYRIGHTED LDC RESOURCE WITH PEOPLE OR ORGANIZATIONS WHO DO NOT HAVE EITHER AN LDC MEMBERSHIP OR A LICENSE TO USE THE COPYRIGHTED RESOURCE.
For large corpora, the links below take you to a directory structure; to download the data, you can use wget.
This data consists of weakly temporally aligned reports from the BBC in English and in Russian. [TemporalBbc.tgz]
Relevant publications:
English-Hebrew NE transliteration dataset, consists of 250 NE pairs used for training, and another 300 NE pairs for testing. The data was collected from matching Wikipedia articles in English and Hebrew. [hebrewEnglishAlignment.tgz]
Relevant publications:
Description, data and scripts for the annotation effort/pilot evaluation described in "Ask not what Textual Entailment can do for you..."(ACL 2010):
https://wiki.engr.illinois.edu/display/rtedata/Explanation+Based+Analysis+of+RTE+Data
The entailment corpora from the three PASCAL Recognizing Textual Entailment challenges are provided here in a column format that encodes a range of annotation of the original text. The original corpora can be accessed from the NIST Text Analysis Conference RTE track web site. corpus. This file explains the column format. The PARC sentence pairs were provided separately by Xerox PARC.
For a more extensive set of examples testing the kinds of phenomena modeled in the PARC dataset, take a look at the FRACAS dataset provided by Bill MacCartney of the Stanford University NLP Group.
The corpus and annotation guidelines developed for (V. Srikumar, R. Roichert, M. Sammons, A. Rappoport, and D. Roth, "Extraction of Entailed Semantic Relations Through Syntax-Based Comma Resolution", Proc. of the Annual Meeting of the ACL (2008)) can be downloaded for research use via the link below.
comma resolution data.If you use this data, please cite the work referenced above.
UIUC Image Database for Car Detection.
This data was used in the research described in the paper, "Learning to Detect Objects in Images via a Sparse, Part-Based Representation". The software used in this research can also be downloaded here. If you use this data or the code provided, please cite the above work.
http://lrc.cornell.edu/swedish/dataset/affectdata/index.html
http://nlp.cs.nyu.edu/wikipedia-data/
Consult our COLING'10 "Knowing What to Believe (when you already know something)" paper for more details.
Templates are a primary source of structured data in Wikipedia; for example, the infobox for a city giving its size, mayor, date of founding, etc. is a template. Templates generally have well-defined fields (documented in Wikipedia, e.g. Template:Infobox Settlement) and are an excellent source of a wide variety of relations (e.g. city population in a given year). This file was created from the English Wikipedia and contains all the templates added, removed, or changed by users for all articles in the default namespace (this is the main content of the encyclopedia and excludes pages in "support" namespaces like Talk:, File:, etc.) from each article's inception until January 2012. This allows you to determine which editor asserted what relation, and when, which can be very useful in work such as information credibility (e.g. our Population and Biography datasets, above).
If you are UIUC faculty or working on a UIUC-supported project, you can access a number of copyright-protected corpora that are used by UIUC researchers (access via BlueStem, using your Active Directory password).
Note that these will soon be merged into the shared corpora directory linked to above.
Information about other corpora that may be available to students/faculty working at UIUC
can be found here.