BasicExample.java

#

You can download the java file here.

package edu.illinois.cs.cogcomp.edison.examples;

import java.util.Arrays;
import java.util.List;

import edu.illinois.cs.cogcomp.edison.sentences.Sentence;
import edu.illinois.cs.cogcomp.edison.sentences.TextAnnotation;
#

Creating a TextAnnotation

class BasicExample {
#

In Edison, different annotations over text are called Views, each of which is a graph of Constituents and Relations. All the Views of a given text are managed by an object called a TextAnnotation.

One key assumption in Edison is that the views can be defined in terms of the tokens of the text. In other words, the TextAnnotation object fixes a tokenization for the text when an object is created and all the views are defined in terms of this tokenization.

    public static void main(String[] args) {
#

This example shows some ways that can be used to create a TextAnnotation.

	
#

We always need to specify the corpus and text identifiers. In the current version of Edison, these identifiers are not used to perform any book-keeping, but this could be introduced in future versions.

	String corpus = "2001_ODYSSEY";
	String textId = "001";
#

The simplest way to define a TextAnnotation is to just give the text to the constructor. Note that text1 consists of two sentences. The corresponding ta1 will use the SentenceSplitter defined in the LBJ library to split the text into sentences and further apply the LBJ tokenizer to tokenize the sentence.

	String text1 = "Good afternoon, gentlemen. I am a HAL-9000 computer.";
	TextAnnotation ta1 = new TextAnnotation(corpus, textId + "1", text1);
#

Another way to create a TextAnnotation is to specify the sentences and tokens manually. In this case, the input to the constructor consists of the corpus, text identifier and a List of strings. Each element in the list will be treated as a sentence. Further, this constructor assumes that the sentences are white-space tokenized.

	List<String> tokenizedSentences = 
		Arrays.asList("Good afternoon , gentlemen .", 
							"I am a HAL-9000 computer .");
	TextAnnotation ta2 = new TextAnnotation(corpus, textId +"2", 
			tokenizedSentences);
#

Print the text. This prints the raw text that was used to create the TextAnnotation object. In the case where the second constructor is used, the text is printed whitespace tokenized.

	System.out.println(ta1.getText());
	System.out.println(ta2.getText());
#

Print the tokenized text. The tokenization scheme is specified by the constructor, which in the first example defaults to the LBJ tokenizer, and in the second one is specified manually.

	System.out.println(ta1.getTokenizedText());
	System.out.println(ta2.getTokenizedText());
#

Print the de-tokenized text. This uses a normalization scheme to pretty-print text. Also, this can be used to normalize tokenized and un-tokenized text and can, hence, be used as a key to Maps. Note: The detokenization scheme is an evolving one. It handles several punctuation-related oddities, but not all.

	System.out.println(ta1.getDetokenizedText());
#

Print the list of views that this text annotation contains. This will print SENTENCES.

Notes:

  1. The tokens are not stored in a View. The TextAnnotation knows the tokens of the text and each Constituent of every view is defined in terms of the tokens. A constituent can represent zero tokens, spans or even arbitrary (non-contiguous) collections of tokens.

  2. Sentences are stored as a view. In the terminology above, the Constituents will correspond to the sentences. There are no Relations between them. (The ordering between the sentences is not explicitly represented because this can be inferred from the Constituents which refer to the tokens.) So the graph that this View represents is a degenerate graph, with only nodes and no edges.

	System.out.println(ta1.getAvailableViews());
#

Print the sentences. The Sentence class of the same methods as a TextAnnotation.

	List<Sentence> sentences = ta1.sentences();

	System.out.println(sentences.size() + " sentences found.");

	for (int i = 0; i < sentences.size(); i++) {
	    Sentence sentence = sentences.get(i);
	    System.out.println(sentence);
	}

    }
}