What is LBJ?Learning Based Java is a modeling language for the rapid development of software systems with one or more learned functions, designed for use with the JavaTM programming language. LBJ offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application. With LBJ, the details of feature extraction, learning, model evaluation, and inference are all abstracted away from the programmer, leaving him to reason more directly about his application.
Many software systems are in need of functions that are simple to describe but that no one knows how to implement. Recently, more and more designers of such systems have turned to machine learning to plug these gaps. Given data, a discriminative machine learning algorithm yields a function that classifies instances from some problem domain into one of a set of categories. For example, given an instance from the domain of email messages (i.e., given an email), we may desire a function that classifies that email as either "spam" or "not spam". Given data (in particular, a set of emails for which the correct classification is known), a machine learning algorithm can provide such a function. We call systems that utilize machine learning technology learning based programs.
Modern learning based programs often involve several learning components (or,
at least a single learning component applied repeatedly) whose classifications
are dependent on each other. There are many approaches to designing such
programs; here, we focus on the following approach. Given data, the various
learning components are trained entirely independently of each other, each
optimizing its own loss function. Then, when the learned functions are
applied in the wild, the independent predictions made by each function are
reconciled according to user specified constraints. This approach has been
applied successfully to complicated domains such as
Learning Based Java (LBJ) is a modeling language that expedites the
development of learning based programs, designed for use with the
TM programming language. The LBJ compiler accepts
the programmer's classifier and constraint specifications as input,
automatically generating efficient Java code and applying learning algorithms
(i.e., performing training) as necessary to implement the classifiers' entire
computation from raw data (i.e., text, images, etc.) to output decision (i.e.,
part of speech tag, type of recognized object, etc.). The details of feature
extraction, learning, model evaluation, and inference (i.e., reconciling the
predictions in terms of the constraints at runtime) are abstracted away from
Under the LBJ programming philosophy, the designer of a learning based program
will first design an object-oriented internal representation (IR) of the
application's raw data using pure Java. For example, if we wish to write
software dealing with emails, then we may wish to define a Java class named
Feature extraction and learning typically produce several different intermediate representations of the data they process. The LBJ compiler automates these processes, managing all of their intermediate representations automatically. An LBJ source file also acts as a Makefile of sorts. When you make a change to your LBJ source file, LBJ knows which operations need to be repeated. For example, when you change the code in a hard-coded classifier, only those learned classifiers that use it as a feature will be retrained. When you change only a learning algorithm parameter, LBJ skips feature extraction and goes straight to learning.
LBJ is supported by a library of interfaces and classes that implement a
standardized functionality for features and classifiers. The library includes
learning and inference algorithm implementations, general purpose and domain
specific internal representations, and domain specific parsers.
This is an implementation of our SNoW-based POS tagger for use with LBJ.
A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure.
A Coreference Resolver, based on LBJ, trained on the ACE 2004 corpus.
This is a state of the art NE tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous). It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8F1 on the CoNLL03 shared task data.
Suppose we want to classify newsgroup posts according to the newsgroup to
which each post is best suited. It is plausible that these classifications
could be made as a function of the words that appear in them. For example,
the word "motor" is likely to appear more often in
rec.motorcycles than in
alt.atheism. However, we do
not want to come up with these associations on our own one at a time, so we
turn to LBJ.
To use LBJ, we first need to decide on an object oriented internal
representation. In this case, it makes sense to define a class named
Post that stores the contents of a newsgroup post. We'll also
need to implement a parser that knows how to create
objects when given the raw data in a file or files. Then, in the LBJ source
file, we can define a hard-coded classifier that identifies which words appear
in each post and a learning classifier that categorizes each post based on
The following example source code can be trained on the famous
corpus. It involves a single feature extraction classifier named
BagOfWords, a label classifier to provide labels during training
NewsgroupLabel, and a multi-class classifier that predicts
a newsgroup label named
NewsgroupClassifierAP. It also assumes
Post class and the parser
Post objects have already been implemented in
separate Java source files. To see the code in action, download the
distribution of this example, which includes the data and all the classes
mentioned above, and run
./train.sh (assuming that LBJ is already
CLASSPATH). See also the next example on this web page
for an explanation of the parameter tuning syntax.
In a learning classifier expression, where you used to write a constant, you can now write either of the following syntaxes specifying a set of constants:
So, let's say you write the following learning classifier expression:
This says we'd like to try learning rates of .1, .2, .3, .4, and .5, thick
separators with thicknesses of 0, .5, 1, 1.5, 2, 2.5, 3, 3.5, and 4, and
either 5, 20, or 50 rounds of training. Since we gave the
cval clause, cross validation will be
executed on every combination of these parameter values. The best performing
parameter values will be reported and used to train over the entire training
set at the end.
Here's another similar example:
In this example, we used a block of code to specify parameters. Arbitrary Java is allowed in this block, and the parameter set notation can appear anywhere a constant would otherwise be allowed.
Also note the
testFrom clause, which is
an alternative to cross validation in this context. When it appears (and the
cval clause does not), it supplies data
from a development set on which each combination of parameter values is
evaluated. Again, the best performing parameter values are reported and used
to train on the training set at the end. When using parameter set syntax,
testFrom must also appear.
The code below is the same as our original 20 newsgroups code, with a small modification. We have now decided to keep a separate bag of words for every part of speech encountered. So, there will be a bag of words for nouns, a separate bag of words for adjectives, etc. Of course, in order to determine the part of speech of each word, we need a method capable of computing it. Fortunately, the Illinois Part of Speech Tagger has been learned with LBJ and is available for download. Below, added or modified lines of code have been marked with .
Notice that once the POSTagger classifier has been imported, it can be called
as if it was a method returning a
String, just like any
classifier defined in the source file. Also notice the new form of the
"sense" statement. It now has two fields separated by a colon. Semantically,
the value of the expression on the left of the colon serves as the name of the
sensed feature, and the value of the expression on the right serves as its
Unfortunately, these new features don't improve the performance of our
newsgroup classifier. But the ability to import and use previously learned
classifiers will definitely come in handy.
In the previous example, we saw how easy it was to import and use the newsgroup classifier for feature extraction. In this example, we'll see how to import it into a Java application. Below, we see the Java source code of a simple program that takes file names as input and produces their newsgroup classifications as output. Of course, in order to do this, it needs to invoke the classifier we learned above.
NewsgroupClassifierAP, has been translated into
a Java class by the LBJ compiler. However, it still expects to be applied to
Post objects. So, for each filename on the command line, we
create such an object and pass it to our classifier's
discreteValue(Object) method. Every classifier whose return type
discrete in its LBJ source file has such a method in its Java
translation. The return value is a
String containing the
The program above is also included in the
distribution of this 20 newsgroups example. After
completes successfully, check the
README for an example command
line that runs the program.