
A chunker or ("shallow parser"), is a program that partitions plain text into sequences of semantically related words. The type of partition is also computed. For example:
|
This task is simpler than "full parsing" (in which a parse tree indicating nested phrase structure is produced), and it was originally intended to be an aid for full parsers.
See the online Javadoc documentation.
Assuming the chunker's class files are on the CLASSPATH, its
performance can be tested on test data labeled in the same format as the CoNLL
2000 corpus with the following command:
|
where <test data> is the path to the labeled test data.
This very simple program makes use of the
LBJ2.nlp.seg.BIOTester
class which collects precision, recall, and F1 statistics
over the segments (i.e., chunks, in this case) discovered by a "BIO"
style classifier (such as this chunker; see below for a description of the
tags produced).
If your data has chunk labels but not part of speech tags, use the same
CoNLL 2000 corpus format with a single dash in place of each POS tag. These
tags will then be computed automatically during feature extraction.
The Illinois Chunker comes bundled with a program that takes a plain, unannotated text file as input and produces that same text with both chunk and part-of-speech annotations as output. To invoke this program, type:
|
For more information about the ChunksAndPOSTags program, see the
chunker's
online documentation.
Additionally, the LBJ runtime library contains a class that implements a general purpose segmenter which can use any word classifier that returns "BIO" style tags, such as this chunker, to produce corresponding segment annotations (but omitting the POS tags). To invoke this program, type:
|
For more information about the SegmentTagPlain program, see LBJ's
online documentation.
This implementation uses the LBJ library's
Token
class to internally represent the words whose chunk tags it computes. If your
Java application uses the Token class as well, you can import the
chunker and use it like so:
|
Note that if your word object does not have its partOfSpeech
field filled, the
LBJ POS
tagger (which must be on your CLASSPATH) will be loaded
automatically by the chunker to compute the tag for use as a feature.
Used as shown above, the chunker will return one of the following tags for each word:
| Tag | Explanation: "The chunker predicts that the word ..." | |
|---|---|---|
B-ADJP |
begins an adjective phrase. | |
I-ADJP |
is inside an adjective phrase. | |
B-ADVP |
begins an adverbial phrase. | |
I-ADVP |
is inside an adverbial phrase. | |
B-CONJP |
begins a conjunctive phrase. | |
I-CONJP |
is inside a conjunctive phrase. | |
B-INTJ |
begins an interjection. | |
I-INTJ |
is inside an interjection. | |
B-LST |
begins a list marker. | |
I-LST |
is inside a list marker. | |
B-NP |
begins a noun phrase. | |
I-NP |
is inside a noun phrase. | |
B-PP |
begins a prepositional phrase. | |
I-PP |
is inside a prepositional phrase. | |
B-PRT |
begins a particle. | |
I-PRT |
is inside a particle. | |
B-SBAR |
begins a subordinated clause. | |
I-SBAR |
is inside a subordinated clause. | |
B-UCP |
begins an unlike coordinated phrase. | |
I-UCP |
is inside an unlike coordinated phrase. | |
B-VP |
begins a verb phrase. | |
I-VP |
is inside a verb phrase. | |
O |
is outside of any chunk. |