Most of the information available today is in free form text.
Current technologies (google, yahoo) allow us to access text only
via key-word search.
We would like to facilitate content-based access to information. Examples include:
Achieving these tasks requires that we develop programs that can,
at some level, understand natural language. The collection of
demos below shows some of the technologies we are developing in
order to address these and related questions. Some address direct
Information Extraction tasks, and some exhibit fundamental natural
language technologies that we are developing in order to support
better access to information. The demonstrations below build on our research in Machine Learning - the fundamental research area that allows us to write programs that learn from their experience, and thus support 'closer to human capabilities' of natural language. Feel free to insert your text to test out these demonstrations of our applications.
Hover your mouse pointer over one of the demo names below to get a short description of that demo; click on it to see a fuller description. To run the demo, click on the [Run Demo] link to the right of the demo name.
Commas can define various semantic relations between elements of a sentence. These relations are often implicit -- that is, they are not expressed by means of a verb in the sentence. Comma resolution is the task of identifying these relations and extracting them.
This demo shows our comma resolution system in action. It accepts sentences as user input and decomposes them into smaller ones using cues from the structure of the sentence. In doing so, it makes implicit relations (expressed by the commas) explicit.
Lexical paraphrasing (replacing one word with another) is an inherently context sensitive problem because a word's meaning depends on context. Most paraphrasing work finds patterns and templates that can replace other patterns or templates in some context, but we are attempting to make decisions for a specific context. We have developed a global classifier that takes a verb v and its context (sentence that v appears in, along with a candidate verb u, and determines whether u can replace v in the given sentence while maintaining the original meaning. The classifier makes its decision by finding other contexts that both v and u appear in, and seeing how similar these are to the given context of v. We train the classifier without supervision by utilizing a large set of local classifiers each trained to locate paraphrases of a single word. These local classifiers then generate labeled data for the global classifier.
Standard errors resulting in valid words can not be caught by a standard dictionary spell checker, and account for some 25% of all spelling errors.
Examples include: "please feel this form"; "I'd like a peace of cake" etc. Context sensitive spelling correction has been shown to be extremely effective in learning to correct these errors, performing with an accuracy level greater than 95%. This demo allows used to input text as if they are using their own editor. The program will then suggest corrections for any errors it finds.
A given entity - representing a person, a location, or an organization - may be mentioned in text in multiple, ambiguous ways. Understanding natural language and supporting intelligent access to textual information requires identifying whether different entity mentions are actually referencing the same entity. The Coreference Resolution Demo processes unannotated text, detecting mentions of entities and showing which mentions are coreferential.
Dataless Classification is a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data. Like humans, a dataless classifier interprets a string of words as a set of semantic concepts.
This demo shows this idea in action, allowing the user to enter arbitrary text and class labels. Without any training, the text is classified into the labels.
LLM computes a similarity measure over pairs of text spans.
Uses a range of metrics to compare two text spans, presenting a visual mapping.
A basic sub-task of many natural language processing problems is the identification of words or phrases of specific types (e.g. locations, people, and organizations) in text, and is commonly called Named Entity Recognition (NER). Most successful approaches to NER require large amounts of text with Named Entities tagged by a human annotator. However, in many (especially less common) languages such resources do not exist. We demonstrate a method to automatically generate such resources from multilingual corpora (such as multilingual news streams).
Understanding natural language and supporting intelligent access to textual information require identifying whether different mentions of a name, within and across documents, represents the same entity. We demonstrate a browsing tool that incorporates some of our newly developed Machine Learning based technologies in this area. It enables users to trace different mentions of the same entity, presented in different textual forms, across documents.
Named entity recognition refers to the task of identifying what phrases in text represent names of People, what represent names of Locations, Organizations, etc. This is a fundamental task in information extraction since it allows some level of abstraction that is required to support the level of interaction people are comfortable with. This is a context sensitive task, as is shown in: Jakob Washington left to Denver to meet with John Denver who works for Washington Mutual.
The Illinois Extended Named Entity Recognizer labels eighteen predefined types of entities in plain text.
In textual inference, it is often necessary to determine when two proper nouns refer to the same entity; for example, "Bill Gates" could refer to the same entity as "Mr. Gates", but not "Mrs. Gates". Our Named Entity Similarity Metric applies sets of rules to two strings to determine whether they are likely to refer to the same underlying entity. It handles different types of entity -- people, locations, and organizations -- using appropriate resources (.e.g, acronyms for companies).
Number Quantization refers to the task of recognizing the values of numbers written in text. This tool recognizes numerical entities whether they are written as words or numerals, and can support comparison of commensurate numerical types (e.g. dates).
The importance of assigning each word in a sentence the part of speech (POS) that it assumes in that sentence stems from the fact that identifying POS is one of the early stages in the process performed by various natural language related processes such as speech recognition, translation, and information retrieval and extraction. See how it's done!
We demonstrate a novel and robust approach for the problem of identifying taxonomic relations between pairs of concepts. We focus on identifying relations that are essential to supporting textual inference: determining whether two concepts hold an ancestor-child relation or whether they are siblings. Our method makes use of Wikipedia as a main source for background knowledge.
Beyond the syntactical analysis of natural language sentences is the extraction of its semantic information. Semantic role labeling is one of such task which identifies the verb and argument structure in natural language sentences, and is an important task toward natural language understanding.
Enabling a machine to respond to natural language input demands that the machine is equipped with the capacity to identify syntactical phrases in sentences. It is virtually impossible to manually write a comprehensive set of rules the accurately defines the appropriate solutioin to every task of the this nature. However, the availability of annotated corpora (collections of text) and robust machine learning techniques make it possible to emply machines to learn this task from training examples.
The Illinois TimeSim software processes document text, extracting and canonizing strings representing temporal expressions. It generates an interval-based representation of each individual expression, and determines their order relative to a user-specified reference time.
This analysis tool annotates different syntactic and semantic information, including syntactic parse trees, named entities, semantic roles and nominal relations on raw text.
It is not hard for a human to know that a sentence "Joe Smith offers a generous gift to the university." also means "Joe Smith contributes to academia.". But it is extremely hard for a machine. Being able to tackle this task will be an important step toward natural language understanding. This demonstration presents a system that aims to tackle this problem.
The Wikifier identifies important entities and concepts in text, disambiguates them and links them to Wikipedia. Wikification is an important step in helping to facilitate Information Access, in knowledge acquisition from text and in helping to inject background knowledge into NLP applications. The main decisions the Wikifier must make are: (1) What expressions to link to Wikipedia. (2) Disambiguating the ambiguous expressions and entities. This Wikification demo uses four types of features: (a) String matching and prevalence of entities in Wikipedia. (b) Lexical similarity between the input document and the Wikipedia pages. (c) "Semantic Similarity" between the ESA summary of the input document and Wikipedia pages. (d) How likely is a set of Wikipedia pages to be linked from a single document (we get this statistic by looking at the linkage patterns in Wikipedia).
A word similarity metric using WordNet and other resources.