Learning Based Java

(18077 downloads)

[ Download | User Guide | Key Publication | Questions/Comments ]

If you wish to cite this work, please use the following.

N. Rizzolo and D. Roth, Learning Based Java for Rapid Development of NLP Systems. LREC  (2010)

What is LBJava?

Learning Based Java is a modeling language for the rapid development of software systems with one or more learned functions, designed for use with the JavaTM programming language. LBJava offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application. With LBJava, the details of feature extraction, learning, model evaluation, and inference are all abstracted away from the programmer, leaving him to reason more directly about his application.

Introduction

Many software systems are in need of functions that are simple to describe but that no one knows how to implement. Recently, more and more designers of such systems have turned to machine learning to plug these gaps. Given data, a discriminative machine learning algorithm yields a function that classifies instances from some problem domain into one of a set of categories. For example, given an instance from the domain of email messages (i.e., given an email), we may desire a function that classifies that email as either "spam" or "not spam". Given data (in particular, a set of emails for which the correct classification is known), a machine learning algorithm can provide such a function. We call systems that utilize machine learning technology learning based programs.

Modern learning based programs often involve several learning components (or, at least a single learning component applied repeatedly) whose classifications are dependent on each other. There are many approaches to designing such programs; here, we focus on the following approach. Given data, the various learning components are trained entirely independently of each other, each optimizing its own loss function. Then, when the learned functions are applied in the wild, the independent predictions made by each function are reconciled according to user specified constraints. This approach has been applied successfully to complicated domains such as Semantic Role Labeling.

LBJava

Learning Based Java (LBJava) is a modeling language that expedites the development of learning based programs, designed for use with the JavaTM programming language. The LBJava compiler accepts the programmer's classifier and constraint specifications as input, automatically generating efficient Java code and applying learning algorithms (i.e., performing training) as necessary to implement the classifiers' entire computation from raw data (i.e., text, images, etc.) to output decision (i.e., part of speech tag, type of recognized object, etc.). The details of feature extraction, learning, model evaluation, and inference (i.e., reconciling the predictions in terms of the constraints at runtime) are abstracted away from the programmer.

Under the LBJava programming philosophy, the designer of a learning based program will first design an object-oriented internal representation (IR) of the application's raw data using pure Java. For example, if we wish to write software dealing with emails, then we may wish to define a Java class named Email. An LBJava source file then allows the programmer to define classifiers that take Emails as input. A classifier is merely any method that produces one or more discrete or real valued classifications when given a single object from the programmer's IR. It might be hard-coded using explicit Java code (usually for use as a feature extractor), or learned from data (e.g., labeled example Emails) using other classifiers as feature extractors.

Feature extraction and learning typically produce several different intermediate representations of the data they process. The LBJava compiler automates these processes, managing all of their intermediate representations automatically. An LBJava source file also acts as a Makefile of sorts. When you make a change to your LBJava source file, LBJava knows which operations need to be repeated. For example, when you change the code in a hard-coded classifier, only those learned classifiers that use it as a feature will be retrained. When you change only a learning algorithm parameter, LBJava skips feature extraction and goes straight to learning.

LBJava is supported by a library of interfaces and classes that implement a standardized functionality for features and classifiers. The library includes learning and inference algorithm implementations, general purpose and domain specific internal representations, and domain specific parsers.





Feature Library

LBJava makes it easy to develop and use classifiers as features. In addition to the simple, hard-coded classifiers that come packaged with LBJava (see the online Javadoc), a constantly growing suite of learned classifiers is available. Simply import them into your LBJava or Java source code, and call them just like methods (see below for examples).

Illinois Part of Speech Tagger

   

This is an implementation of our SNoW-based POS tagger for use with LBJava.

Illinois Chunker

   

A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure.

Illinois Coreference

   

A Coreference Resolver, based on LBJava, trained on the ACE 2004 corpus.

Illinois Named Entity Tagger

   

This is a state of the art NE tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous). It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8F1 on the CoNLL03 shared task data.



Example: 20 Newsgroups

Suppose we want to classify newsgroup posts according to the newsgroup to which each post is best suited. It is plausible that these classifications could be made as a function of the words that appear in them. For example, the word "motor" is likely to appear more often in rec.autos or rec.motorcycles than in alt.atheism. However, we do not want to come up with these associations on our own one at a time, so we turn to LBJava.

To use LBJava, we first need to decide on an object oriented internal representation. In this case, it makes sense to define a class named Post that stores the contents of a newsgroup post. We'll also need to implement a parser that knows how to create Post objects when given the raw data in a file or files. Then, in the LBJava source file, we can define a hard-coded classifier that identifies which words appear in each post and a learning classifier that categorizes each post based on those words.

The following example source code can be trained on the famous 20 Newsgroups corpus. It involves a single feature extraction classifier named BagOfWords, a label classifier to provide labels during training named NewsgroupLabel, and a multi-class classifier that predicts a newsgroup label named NewsgroupClassifierAP. It also assumes that the Post class and the parser NewsgroupParser that instantiates Post objects have already been implemented in separate Java source files. To see the code in action, download the source distribution of this example, which includes the data and all the classes mentioned above, and run ./train.sh (assuming that LBJava is already on your CLASSPATH). See also the next example on this web page for an explanation of the parameter tuning syntax.

   
import LBJ2.nlp.seg.Token;

/**
  * This feature generating classifier "senses" all the words in the document
  * that begin with an alphabet letter.  The result is a bag-of-words
  * representation of the document.
 **/
discrete% BagOfWords(Post post) <- {
  for (int i = 0; i < post.bodySize(); ++i)
    for (int j = 0; j < post.lineSize(i); ++j) {
      Token word = post.getBodyWord(i, j);
      String form = word.form;
      if (form.length() > 0 && form.substring(0, 1).matches("[A-Za-z]"))
        sense form;
    }
}

/** The label of the document. */
discrete NewsgroupLabel(Post post) <- { return post.getNewsgroup(); }

/**
  * Here, we train averaged Perceptron for many rounds of the training data.
  * The number of rounds and separator thickness were tuned using cross
  * validation, and their best values now appear in the source code.
 **/
discrete NewsgroupClassifierAP(Post post) <-
learn NewsgroupLabel
  using BagOfWords
  from new NewsgroupParser("data/20news.train.shuffled")
    40 rounds
    //{{ 5, 10, 20, 30, 40 }} rounds
  with SparseNetworkLearner {
    SparseAveragedPerceptron.Parameters p =
      new SparseAveragedPerceptron.Parameters();
    p.learningRate = .1;
    p.thickness = 3;
    //p.thickness = {{ 1 -> 3 : 0.5 }};
    baseLTU = new SparseAveragedPerceptron(p);
  }

  //cval 5 "random"
  progressOutput 20000
  testFrom new NewsgroupParser("data/20news.test")
end


Example: Parameter Tuning

In a learning classifier expression, where you used to write a constant, you can now write either of the following syntaxes specifying a set of constants:

   
{{ 1, 2, 3, 4, 5 }}

{{ 1 -> 5 : 1 }}

So, let's say you write the following learning classifier expression:

   
learn SomethingInteresting
  using Features
  with new SparsePerceptron({{ 0.1 -> 0.5 : 0.1 }}, 0, {{ 0 -> 4 : 0.5 }})
  from
    new DataParser("data")
    {{ 5, 20, 50 }} rounds
  cval 5 "random"
end

This says we'd like to try learning rates of .1, .2, .3, .4, and .5, thick separators with thicknesses of 0, .5, 1, 1.5, 2, 2.5, 3, 3.5, and 4, and either 5, 20, or 50 rounds of training. Since we gave the cval clause, cross validation will be executed on every combination of these parameter values. The best performing parameter values will be reported and used to train over the entire training set at the end.

Here's another similar example:

   
learn SomethingInteresting
  using Features
  with SparsePerceptron {
    learningRate = {{ 0.1 -> 0.5 : 0.1 }};
    thickness = {{ 0 -> 4 : 0.5 }};
  }
  from
    new TrainingParser("trainingData")
    {{ 5, 20, 50 }} rounds
  testFrom new DevelopmentParser("developmentData")
end

In this example, we used a block of code to specify parameters. Arbitrary Java is allowed in this block, and the parameter set notation can appear anywhere a constant would otherwise be allowed.

Also note the testFrom clause, which is an alternative to cross validation in this context. When it appears (and the cval clause does not), it supplies data from a development set on which each combination of parameter values is evaluated. Again, the best performing parameter values are reported and used to train on the training set at the end. When using parameter set syntax, either cval or testFrom must also appear.

Example: Importing an External Classifier

The code below is the same as our original 20 newsgroups code, with a small modification. We have now decided to keep a separate bag of words for every part of speech encountered. So, there will be a bag of words for nouns, a separate bag of words for adjectives, etc. Of course, in order to determine the part of speech of each word, we need a method capable of computing it. Fortunately, the Illinois Part of Speech Tagger has been learned with LBJava and is available for download. Below, added or modified lines of code have been marked with .

   
import LBJ2.nlp.seg.Token;
import edu.illinois.cs.cogcomp.LBJ.pos.POSTagger; 

/**
  * This feature generating classifier is similar to BagOfWords (defined
  * above) except that it "senses" a separate bag-of-words for each possible
  * POS tag.
 **/
discrete% BagOfWordsByPOS(Post post) <- { 
  for (int i = 0; i < post.bodySize(); ++i)
    for (int j = 0; j < post.lineSize(i); ++j) {
      Token word = post.getBodyWord(i, j);
      String form = word.form;
      String tag = POSTagger(word); 
      if (form.length() > 0 && form.substring(0, 1).matches("[A-Za-z]"))
        sense tag : form; 
    }
}

/** The label of the document. */
discrete NewsgroupLabel(Post post) <- { return post.getNewsgroup(); }

/**
  * Here, we train averaged Perceptron for many rounds of the training data.
  * The number of rounds and separator thickness were tuned using cross
  * validation, and their best values now appear in the source code.
 **/
discrete NewsgroupClassifierAP(Post post) <-
learn NewsgroupLabel
  using BagOfWordsByPOS 
  from new NewsgroupParser("data/20news.train.shuffled")
    40 rounds
    //{{ 5, 10, 20, 30, 40 }} rounds
  with SparseNetworkLearner {
    SparseAveragedPerceptron.Parameters p =
      new SparseAveragedPerceptron.Parameters();
    p.learningRate = .1;
    p.thickness = 3;
    //p.thickness = {{ 1 -> 3 : 0.5 }};
    baseLTU = new SparseAveragedPerceptron(p);
  }

  //cval 5 "random"
  progressOutput 20000
  testFrom new NewsgroupParser("data/20news.test")
end

Notice that once the POSTagger classifier has been imported, it can be called as if it was a method returning a String, just like any classifier defined in the source file. Also notice the new form of the "sense" statement. It now has two fields separated by a colon. Semantically, the value of the expression on the left of the colon serves as the name of the sensed feature, and the value of the expression on the right serves as its value.

Unfortunately, these new features don't improve the performance of our newsgroup classifier. But the ability to import and use previously learned classifiers will definitely come in handy.

Example: Using the Newsgroup Classifier

In the previous example, we saw how easy it was to import and use the newsgroup classifier for feature extraction. In this example, we'll see how to import it into a Java application. Below, we see the Java source code of a simple program that takes file names as input and produces their newsgroup classifications as output. Of course, in order to do this, it needs to invoke the classifier we learned above.

That classifier, NewsgroupClassifierAP, has been translated into a Java class by the LBJava compiler. However, it still expects to be applied to Post objects. So, for each filename on the command line, we create such an object and pass it to our classifier's discreteValue(Object) method. Every classifier whose return type is discrete in its LBJava source file has such a method in its Java translation. The return value is a String containing the predicted label.

   
import LBJ2.classify.Classifier;

public class NewsgroupPrediction
{
  public static void main(String[] args) {
    if (args.length == 0) {
      System.err.println(
          "usage: java NewsgroupPrediction <file> [<file> ...]");
      System.exit(1);
    }

    Classifier classifier = new NewsgroupClassifierAP();

    for (String file : args) {
      Post post = new Post(file);
      System.out.println(file + ": " + classifier.discreteValue(post));
    }
  }
}

The program above is also included in the source distribution of this 20 newsgroups example. After train.sh completes successfully, check the README for an example command line that runs the program.

Participants:

Projects:

Publications: