Doug Downey

Autonomous Web-scale Information Extraction

 

Search engines are extremely useful tools for answering questions. However, a significant number of questions users might pose -- for example, "which nanotechnology companies are hiring on the West Coast?" -- cannot be addressed using existing search engines, because the answers do not lie on a single page. To answer these kinds of queries, users must extract and synthesize information from multiple documents. Currently, this is a tedious and error-prone manual process.

In this talk, I will describe my research aimed at automating the extraction of this information from the Web. I will present a model of the redundancy inherent in the Web, and show that the model can be used to identify correct extractions autonomously, without the manually labeled examples typically assumed in previous information extraction research. However, the model has limited efficacy for the "long tail" of infrequently mentioned facts; I demonstrate how unsupervised language models can be leveraged in concert with redundancy to overcome this limitation. Lastly, I will describe recent theoretical and experimental results illustrating that a generalization of the redundancy-based approach is effective for a variety of textual classification tasks, beyond information extraction.

 

 

 

 

 

Official inquiries about AIIS should be directed to Alexandre Klementiev (klementi AT uiuc DOT edu)
Last update: 08/30/2007