
What is LBJ?Learning Based Java is a modeling language for the rapid development of software systems with one or more learned functions, designed for use with the JavaTM programming language. LBJ offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application. With LBJ, the details of feature extraction, learning, model evaluation, and inference are all abstracted away from the programmer, leaving him to reason more directly about his application. |
Many software systems are in need of functions that are simple to describe but that no one knows how to implement. Recently, more and more designers of such systems have turned to machine learning to plug these gaps. Given data, a discriminative machine learning algorithm yields a function that classifies instances from some problem domain into one of a set of categories. For example, given an instance from the domain of email messages (i.e., given an email), we may desire a function that classifies that email as either "spam" or "not spam". Given data (in particular, a set of emails for which the correct classification is known), a machine learning algorithm can provide such a function. We call systems that utilize machine learning technology learning based programs.
Modern learning based programs often involve several learning components (or,
at least a single learning component applied repeatedly) whose classifications
are dependent on each other. There are many approaches to designing such
programs; here, we focus on the following approach. Given data, the various
learning components are trained entirely independently of each other, each
optimizing its own loss function. Then, when the learned functions are
applied in the wild, the independent predictions made by each function are
reconciled according to user specified constraints. This approach has been
applied successfully to complicated domains such as
Semantic Role
Labeling.
Learning Based Java (LBJ) is a modeling language that expedites the
development of learning based programs, designed for use with the
JavaTM programming language. The LBJ compiler accepts
the programmer's classifier and constraint specifications as input,
automatically generating efficient Java code and applying learning algorithms
(i.e., performing training) as necessary to implement the classifiers' entire
computation from raw data (i.e., text, images, etc.) to output decision (i.e.,
part of speech tag, type of recognized object, etc.). The details of feature
extraction, learning, model evaluation, and inference (i.e., reconciling the
predictions in terms of the constraints at runtime) are abstracted away from
the programmer.
Under the LBJ programming philosophy, the designer of a learning based program
will first design an object-oriented internal representation (IR) of the
application's raw data using pure Java. For example, if we wish to write
software dealing with emails, then we may wish to define a Java class named
Email. An LBJ source file then allows the programmer to define
classifiers that take Emails as input. A classifier is
merely any method that produces one or more discrete or real valued
classifications when given a single object from the programmer's IR. It might
be hard-coded using explicit Java code (usually for use as a feature
extractor), or learned from data (e.g., labeled example Emails)
using other classifiers as feature extractors.
Feature extraction and learning typically produce several different intermediate representations of the data they process. The LBJ compiler automates these processes, managing all of their intermediate representations automatically. An LBJ source file also acts as a Makefile of sorts. When you make a change to your LBJ source file, LBJ knows which operations need to be repeated. For example, when you change the code in a hard-coded classifier, only those learned classifiers that use it as a feature will be retrained. When you change only a learning algorithm parameter, LBJ skips feature extraction and goes straight to learning.
LBJ is supported by a library of interfaces and classes that implement a
standardized functionality for features and classifiers. The library includes
learning and inference algorithm implementations, general purpose and domain
specific internal representations, and domain specific parsers.
Illinois Part of Speech Tagger |
This is an implementation of our SNoW-based POS tagger for use with LBJ. |
|
Illinois Chunker |
A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure. |
|
Illinois Coreference |
A Coreference Resolver, based on LBJ, trained on the ACE 2004 corpus. |
|
Illinois Named Entity Tagger |
This is a state of the art NE tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous). It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8F1 on the CoNLL03 shared task data. |
Suppose we want to classify newsgroup posts according to the newsgroup to
which each post is best suited. It is plausible that these classifications
could be made as a function of the words that appear in them. For example,
the word "motor" is likely to appear more often in rec.autos or
rec.motorcycles than in alt.atheism. However, we do
not want to come up with these associations on our own one at a time, so we
turn to LBJ.
To use LBJ, we first need to decide on an object oriented internal
representation. In this case, it makes sense to define a class named
Post that stores the contents of a newsgroup post. We'll also
need to implement a parser that knows how to create Post
objects when given the raw data in a file or files. Then, in the LBJ source
file, we can define a hard-coded classifier that identifies which words appear
in each post and a learning classifier that categorizes each post based on
those words.
The following example source code can be trained on the famous
20 Newsgroups
corpus. It involves a single feature extraction classifier named
BagOfWords, a label classifier to provide labels during training
named NewsgroupLabel, and a multi-class classifier that predicts
a newsgroup label named NewsgroupClassifierAP. It also assumes
that the Post class and the parser NewsgroupParser
that instantiates Post objects have already been implemented in
separate Java source files. To see the code in action, download the
source
distribution of this example, which includes the data and all the classes
mentioned above, and run ./train.sh (assuming that LBJ is already
on your CLASSPATH). See also the next example on this web page
for an explanation of the parameter tuning syntax.
|
In a learning classifier expression, where you used to write a constant, you can now write either of the following syntaxes specifying a set of constants:
|
So, let's say you write the following learning classifier expression:
|
This says we'd like to try learning rates of .1, .2, .3, .4, and .5, thick
separators with thicknesses of 0, .5, 1, 1.5, 2, 2.5, 3, 3.5, and 4, and
either 5, 20, or 50 rounds of training. Since we gave the
cval clause, cross validation will be
executed on every combination of these parameter values. The best performing
parameter values will be reported and used to train over the entire training
set at the end.
Here's another similar example:
|
In this example, we used a block of code to specify parameters. Arbitrary Java is allowed in this block, and the parameter set notation can appear anywhere a constant would otherwise be allowed.
Also note the testFrom clause, which is
an alternative to cross validation in this context. When it appears (and the
cval clause does not), it supplies data
from a development set on which each combination of parameter values is
evaluated. Again, the best performing parameter values are reported and used
to train on the training set at the end. When using parameter set syntax,
either cval or
testFrom must also appear.
The code below is the same as our original 20 newsgroups code, with a small
modification. We have now decided to keep a separate bag of words for every
part of speech encountered. So, there will be a bag of words for nouns, a
separate bag of words for adjectives, etc. Of course, in order to determine
the part of speech of each word, we need a method capable of computing it.
Fortunately, the
Illinois Part of
Speech Tagger has been learned with LBJ and is available for download.
Below, added or modified lines of code have been marked with
.
|
Notice that once the POSTagger classifier has been imported, it can be called
as if it was a method returning a String, just like any
classifier defined in the source file. Also notice the new form of the
"sense" statement. It now has two fields separated by a colon. Semantically,
the value of the expression on the left of the colon serves as the name of the
sensed feature, and the value of the expression on the right serves as its
value.
Unfortunately, these new features don't improve the performance of our
newsgroup classifier. But the ability to import and use previously learned
classifiers will definitely come in handy.
In the previous example, we saw how easy it was to import and use the newsgroup classifier for feature extraction. In this example, we'll see how to import it into a Java application. Below, we see the Java source code of a simple program that takes file names as input and produces their newsgroup classifications as output. Of course, in order to do this, it needs to invoke the classifier we learned above.
That classifier, NewsgroupClassifierAP, has been translated into
a Java class by the LBJ compiler. However, it still expects to be applied to
Post objects. So, for each filename on the command line, we
create such an object and pass it to our classifier's
discreteValue(Object) method. Every classifier whose return type
is discrete in its LBJ source file has such a method in its Java
translation. The return value is a String containing the
predicted label.
|
The program above is also included in the
source
distribution of this 20 newsgroups example. After train.sh
completes successfully, check the README for an example command
line that runs the program.