Illinois Chunker

(3610 downloads)

[ Download | User Guide | Key Publication | Questions/Comments ]

If you wish to cite this work, please use the following.

V. Punyakanok and D. Roth, The Use of Classifiers in Sequential Inference. NIPS  (2001) pp. 995--1001

A chunker or ("shallow parser"), is a program that partitions plain text into sequences of semantically related words. The type of partition is also computed. For example:

   
[NP Jack and Jill ] [VP went ] [ADVP up ] [NP the hill ]
[VP to fetch ] [NP a pail ] [PP of ] [NP water ] .

This task is simpler than "full parsing" (in which a parse tree indicating nested phrase structure is produced), and it was originally intended to be an aid for full parsers.

See the online Javadoc documentation.

Using the Chunker

Testing

Assuming the chunker's class files are on the CLASSPATH, its performance can be tested on test data labeled in the same format as the CoNLL 2000 corpus with the following command:

   
java edu.illinois.cs.cogcomp.lbj.chunk.ChunkTester <test data>

where <test data> is the path to the labeled test data. This very simple program makes use of the LBJ2.nlp.seg.BIOTester class which collects precision, recall, and F1 statistics over the segments (i.e., chunks, in this case) discovered by a "BIO" style classifier (such as this chunker; see below for a description of the tags produced).

If your data has chunk labels but not part of speech tags, use the same CoNLL 2000 corpus format with a single dash in place of each POS tag. These tags will then be computed automatically during feature extraction.

Evaluating

The Illinois Chunker comes bundled with a program that takes a plain, unannotated text file as input and produces that same text with both chunk and part-of-speech annotations as output. To invoke this program, type:

   
java -Xmx512m edu.illinois.cs.cogcomp.lbj.chunk.ChunksAndPOSTags \
                <plain text file>

For more information about the ChunksAndPOSTags program, see the chunker's online documentation.

Additionally, the LBJava runtime library contains a class that implements a general purpose segmenter which can use any word classifier that returns "BIO" style tags, such as this chunker, to produce corresponding segment annotations (but omitting the POS tags). To invoke this program, type:

   
java LBJ2.nlp.seg.SegmentTagPlain \
         edu.illinois.cs.cogcomp.lbj.chunk.Chunker <plain text file>

For more information about the SegmentTagPlain program, see LBJ's online documentation.

Importing

This implementation uses the LBJava library's Token class to internally represent the words whose chunk tags it computes. If your Java application uses the Token class as well, you can import the chunker and use it like so:

   
// Begin Foo.java

import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;
import LBJ2.nlp.seg.Token;

public class Foo
{
  ...
  void myMethod()
  {
    ...
    Chunker tagger = new Chunker();
    ...
    Token word = ...
    ...
    String tag = tagger.discreteValue(word);
    ...
  }
  ...
}

Note that if your word object does not have its partOfSpeech field filled, the LBJava POS tagger (which must be on your CLASSPATH) will be loaded automatically by the chunker to compute the tag for use as a feature.

Used as shown above, the chunker will return one of the following tags for each word:

Tag Explanation: "The chunker predicts that the word ..."
B-ADJP   begins an adjective phrase.
I-ADJP   is inside an adjective phrase.
B-ADVP   begins an adverbial phrase.
I-ADVP   is inside an adverbial phrase.
B-CONJP   begins a conjunctive phrase.
I-CONJP   is inside a conjunctive phrase.
B-INTJ   begins an interjection.
I-INTJ   is inside an interjection.
B-LST   begins a list marker.
I-LST   is inside a list marker.
B-NP   begins a noun phrase.
I-NP   is inside a noun phrase.
B-PP   begins a prepositional phrase.
I-PP   is inside a prepositional phrase.
B-PRT   begins a particle.
I-PRT   is inside a particle.
B-SBAR   begins a subordinated clause.
I-SBAR   is inside a subordinated clause.
B-UCP   begins an unlike coordinated phrase.
I-UCP   is inside an unlike coordinated phrase.
B-VP   begins a verb phrase.
I-VP   is inside a verb phrase.
O   is outside of any chunk.

Participants:

Demo: