Abstract - Data Mining || Building knowledge bases from the web

Yahoo.com

 

Abstract:

The web is a vast repository of human knowledge. Extracting structured data from web pages can enable applications like comparison shopping, and lead to improved ranking and rendering of search results. In this talk, I will describe two efforts at Yahoo! Labs to extract records from pages at web scale. The first is a wrapper induction system that handles end-to-end extraction tasks from clustering web pages to learning XPath extraction rules to relearning rules when sites change. The system has been deployed in production within Yahoo! to extract more than 200 million records from ~200 web sites. The second effort exploits content redundancy on the web to automatically extract records without human supervision. Starting with a seed database, we determine values in the pages of each new site that match attribute values in the seed records. We devise a new notion of similarity for matching templatized attribute content, and an apriori style algorithm that exploits templatized page structure to prune spurious attribute matches.

Bio: http://research.yahoo.com/Rajeev_Rastogi

Previously Rajeev was a Bell Labs Fellow and the founding Director of the Bell Labs Research Center in Bangalore, India. Rajeev worked at Bell Labs from 1993 until 2008. During the period, he led a number of research projects that were incorporated into Lucent products and services. These include the Datablitz main-memory database system, the Fellini multimedia storage server, and the NetInventory auto-discovery engine. His research interests include database systems, data mining, and network management. His most recent research has focused on the areas of network monitoring and security, network graph compression and analysis, and video content dissemination.

Rajeev is active in the fields of databases, data mining, and networking, and has served on the program committees of several conferences in these areas. He currently serves on the editorial board of the CACM, and has been an Associate editor for IEEE Transactions on Knowledge and Data Engineering in the past. He has published over 125 papers, and filed over 70 patents of which 40 have been issued. Rajeev received his B. Tech degree from IIT Bombay, and a PhD degree in Computer Science from the University of Texas, Austin.