
Presenters
Nick Rizzolo,
Mark Sammons,
James Clarke,
and
Vivek Srikumar
Many of the technologies we rely on in our everyday lives depend on the ability to automatically handle natural language. Search engines determine the relevance of documents with respect to keywords. Spam detectors filter email messages based on their content. Automatic machine translators translate from one natural language to another. Systems such as these all utilize machine learning to leverage the information in large datasets and improve their performance with experience. In this tutorial, we introduce our suite of state-of-the-art NLP tools, focusing on our modeling language for general learning-based programs, the aptly named Learning Based Java (LBJ). As motivation, consider the following application whose implementation will become achievable by the end of the tutorial. Suppose you have a news feed through which you receive articles from the full spectrum of news sections; world news, politics, health, finance, sports, etc. Perhaps you'd like to filter these articles based on the appearance of people in them who are famous for different reasons. For example, they may be politicians, athletes, or corporate moguls. While a given type of famous person does tend to appear most commonly in a single news section, you'd like to see all news involving those types of people no matter the section. How can your news feed software automatically determine what a given person in the news is famous for?
Part I: Learning Classifiers from Data with LBJ
[ slides ]
A classifier is simply a function that takes some object as input
and produces a discrete output, classifying the object into one of a set of
categories. In a traditional programming language, functions such as these
must be hard coded entirely in the syntax of the language. LBJ, on the other
hand, allows the partial specification of a classifier whose definition can
only be completed via interaction with data. When paired with different data,
the same LBJ code results in a different classifier. Specifically, we'll take
a look at the well known "20 Newsgroups" dataset, language identification, and
spam detection.
Part II: Illinois NLP Tools
[ slides ]
The Cognitive Computations Group has developed a
suite of state-of-the-art NLP tools, many of which have online
demos so you can try them out even before downloading them. We manage the
application of these tools in experiments and NLP software using a service
called the Curator. Time allowing, we'll begin discussing these tools
during this first lecture.
Part III: Feature Engineering
Any machine learning algorithm must be told which facets (or features)
of the data to incorporate in the learned function which then weighs the
importance of each. For example, when learning a spam detector, one option is
to use as features the appearance of each possible word in the email's body.
We may also want the learned function to consider the words in the subject
line, or the values of the other various headers, or higher level properties
of the text. This part of the tutorial is a hands-on exploration of these
ideas, in which you'll try to improve a classifier from Part I by engineering
your own features.
Part IV: The Fame Classifier
[ slides ]
Now that we have some expertise in learning based programming, we'll discuss
the implementation of a "fame classifier", capable of classifying newsworthy
people by what they're famous for. We'll employ our
Named Entity Recognizer to detect when a
person is mentioned in the text, as well as our
Part of Speech Tagger to help us engineer
effective features. LBJ has abstracted away the details of learning and
applying these two NLP tools so that we can focus on building the fame
classifier and its associated application.
Part V: Illinois NLP Tools
[ slides ]
The Cognitive Computation Group has developed a
suite of state-of-the-art NLP tools, many of which have online
demos so you can try them out even before downloading them. We manage the
application of these tools in experiments and NLP software using a service
called the Curator. The suite includes, but is not limited to:
|
In this section of the tutorial, we give an overview of these tools, describing ways in which they can be used to develop better NLP applications.
Part VI: More Feature Engineering
[ slides ]
This hands-on afternoon session is another opportunity for you to apply what
you've learned to improve the fame classifier.