edu.illinois.cs.cogcomp.lbj.coref.ir.docs
Interface Doc

All Known Implementing Classes:
DocACEPhase2, DocAPF, DocBase, DocPlainText, DocXMLBase

public interface Doc

Represents one document from a corpus, including the text, sentences, words, part-of-speech tags, annotations of coreference, relations, entities, mentions, and other relevant information.

The most common way to create a document is to use a DocLoader, such as DocFromTextLoader or DocLoader.getDefaultLoader(java.lang.String). The advantage of such an approach is that loading of mentions is done automatically (if annotation is provided in the files) and/or mention prediction and typing is done automatically (if mention detectors and typers are provided). Alternatively, subclasses may be constructed directly.

Author:
Eric Bengtson

Method Summary
 Mention getBestMentionFor(Mention m)
          Gets the canonical mention of the entity containing m.
 CExample getCExampleFor(Mention m1, Mention m2)
          Returns the unique CExample for the given pair of mentions in the given order.
 java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo()
          Gets the coherence info using the value of usePredictedEntities() to determine whether to use predicted entities.
 java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo(boolean usePred)
          Gets a grid indicating the mention type for each combination of entities and sentences.
 ChainSolution<Mention> getCorefChains()
          Gets the partition of mentions into coreferential sets.
 java.lang.String getDocID()
          Gets the ID for this document, as a string.
 java.util.List<Entity> getEntities()
          Gets the entities, in no particular order.
 Entity getEntityFor(Mention m)
          Gets the entity containing m.
 Entity getEntityFor(Mention m, boolean usePred)
          Gets the entity containing m.
 GExample getGExampleFor(Mention m)
          Returns the unique GExample for the given pair of mentions in the given order.
 double getInCorpusInverseFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the corpus.
 double getInDocInverseFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the document.
 double getInverseTrueHeadFreq(int wordNum)
          Gets the inverse true head frequency of the word at the specified position.
 double getInverseTrueHeadFreq(java.lang.String word)
          Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.
 java.util.List<Mention> getMentions()
          Gets the mentions of the document, sorted (typically in document order).
 java.util.Set<Mention> getMentionsContainedIn(Mention m)
          Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself.
 java.util.Set<Mention> getMentionsContaining(Mention m)
          Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself.
 java.util.List<Mention> getMentionsInSent(int sentNum)
          Gets a list of the mentions in a specified sentence in order.
 Pair<java.util.List<Mention>,java.util.List<Mention>> getMentionsInSentences(int s1, int s2)
          Gets a pair of lists of mentions, one for each of the two specified sentences.
 java.util.Set<Mention> getMentionsWithExtentStartingAt(int startWord)
          Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found.
 java.util.Set<Mention> getMentionsWithHeadStartingAt(int startWord)
          Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found.
 int getNumRelations()
          Gets the number of relations.
 int getNumSentences()
          Returns the number of sentences in the document.
 java.lang.String getPlainText()
          Gets the text that is the basis for counting, including the start/end characters in Chunk objects.
 java.util.List<java.lang.String> getPOS()
          Gets a list of the Part-Of-Speech tags for the words of the document.
 java.lang.String getPOS(int posNum)
          Gets the Part-Of-Speech tag for the word at the posNum position in the document.
 java.util.List<Entity> getPredEntities()
          Gets a list of predicted entities, in no particular order.
 java.util.List<Mention> getPredMentions()
          Gets a sorted list of predicted mentions.
 int getQuoteNestLevel(int wordNum)
          Indicates the number of nested quotes the specified word is in.
 Relation getRelation(int number)
          Gets the specified relation.
 int getSentNum(int wordNum)
          Gets the sentence number for the specified word.
 int getStartCharNum(int wordNum)
          Gets the zero-based position of the first character of a word.
 int getTextFirstWordNum()
          Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)
 java.util.List<Entity> getTrueEntities()
          Gets a list of true entities, in no particular order.
 Mention getTrueMentionFor(Mention pred)
          Gets the true mention aligned with the specified mention.
 java.util.List<Mention> getTrueMentions()
          Gets a sorted list of true mentions.
 java.util.Map<java.lang.String,java.lang.Integer> getWholeDocCounts()
          Gets the counts for the words in the document.
 java.lang.String getWord(int wordNum)
          Gets the specified word.
 int getWordNum(int charNum)
          Determines the word number (zero-based) of the word at charNum, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.
 java.util.List<java.lang.String> getWords()
          Gets a list of the surface forms of the words of the document.
 boolean hasPredEntities()
          Indicates whether predicted entities are available.
 boolean hasPredMentions()
          Indicates whether predicted mentions have been set.
 boolean hasTrueEntities()
          Indicates whether true entities are available.
 boolean hasTrueMentions()
          Indicates whether true mentions have been set.
 boolean isCaseSensitive()
          Indicates whether the document is case sensitive.
 Chunk makeChunk(int startWord, int endWord)
          Create a chunk spanning the specified words in this document.
 void save()
          Writes the document to a file using serialization.
 void setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)
          Sets the corpus counts for the words in the corpus.
 void setPredEntities(ChainSolution<Mention> sol)
          Sets the predicted entities to be those specified by sol.
 void setPredictedMentions(java.util.Collection<Mention> ments)
          Sets the predicted mentions and records a preference for using them.
 void setUsePredictedEntities(boolean usePred)
          Sets the preference for using predicted entities or true entities.
 void setUsePredictedMentions(boolean usePred)
          Sets the preference for using predicted mentions or true mentions.
 java.lang.String toAnnotatedString(boolean showPOS)
          Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags.
 java.lang.String toAnnotatedString(boolean showPOS, boolean showMTypes, boolean showETypes, boolean showEIDs)
          Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs.
 java.lang.String toCoherenceTableString()
          Gets the coherence grid represented as a string, laid out in a grid.
 java.lang.String toCoherenceTableString(boolean usePred)
          Gets the coherence grid represented as a string, laid out in a grid.
 java.lang.String toSubstituteString()
          Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.
 boolean usePredictedEntities()
          Indicates whether requests for entities will return predicted entities or true entities.
 boolean usePredictedMentions()
          Indicates whether requests for mentions will return predicted mentions or true mentions.
 void write(boolean usePredictions)
          Writes this Doc in the appropriate format.
 void write(java.lang.String filename, boolean usePredictions)
          Writes this Doc in the appropriate format.
 

Method Detail

getPlainText

java.lang.String getPlainText()
Gets the text that is the basis for counting, including the start/end characters in Chunk objects.

Returns:
The plain text.

getDocID

java.lang.String getDocID()
Gets the ID for this document, as a string.

Returns:
The document ID.

isCaseSensitive

boolean isCaseSensitive()
Indicates whether the document is case sensitive.

Returns:
Whether the document is case sensitive.

getSentNum

int getSentNum(int wordNum)
Gets the sentence number for the specified word.

Parameters:
wordNum - the zero-based position of the word whose sentence number is desired.
Returns:
The zero-based sentence number.

getNumSentences

int getNumSentences()
Returns the number of sentences in the document.

Returns:
The number of sentences.

setUsePredictedEntities

void setUsePredictedEntities(boolean usePred)
Sets the preference for using predicted entities or true entities.

Parameters:
usePred - if true, prefer to use predicted entities, otherwise, prefer true entities.

usePredictedEntities

boolean usePredictedEntities()
Indicates whether requests for entities will return predicted entities or true entities.

Returns:
Whether predicted or true entities are to be used.

getEntities

java.util.List<Entity> getEntities()
Gets the entities, in no particular order. If usePredictedEntities() and predicted entities are available, return them; otherwise return true entities.

Returns:
An unmodifiable view of the entities.

getPredEntities

java.util.List<Entity> getPredEntities()
Gets a list of predicted entities, in no particular order.

Returns:
An unmodifiable view of the predicted entities or an empty list.

getTrueEntities

java.util.List<Entity> getTrueEntities()
Gets a list of true entities, in no particular order.

Returns:
An unmodifiable view of the true entities or an empty list.

getCorefChains

ChainSolution<Mention> getCorefChains()
Gets the partition of mentions into coreferential sets.

Returns:
A reference to the chain solution representing the predicted partitioning of mentions into entities, or null if none has been set.

getEntityFor

Entity getEntityFor(Mention m)
Gets the entity containing m. Uses entities from getEntities().

Parameters:
m - The mention whose entity is desired.
Returns:
The entity containing m, or null if not found.

getEntityFor

Entity getEntityFor(Mention m,
                    boolean usePred)
Gets the entity containing m. If usePred, returns the predicted entity, else returns the true entity (if the requested type of entity is not available, null will be returned).

Parameters:
m - The mention whose entity is desired.
usePred - Whether to return a predicted entity or a true entity.
Returns:
The entity containing m, or null if the entity of the specified type is not available.

setPredEntities

void setPredEntities(ChainSolution<Mention> sol)
Sets the predicted entities to be those specified by sol. Entity IDs are automatically created, and each mention's setPredictedEntityID() method is called. Also sets usePredictedEntities to true. The entities are backed internally, but the mentions are not duplicated.

Parameters:
sol - The partition of mentions from which to derive entities.

hasPredEntities

boolean hasPredEntities()
Indicates whether predicted entities are available.

Returns:
Whether predicted entities have been set.

hasTrueEntities

boolean hasTrueEntities()
Indicates whether true entities are available.

Returns:
Whether true entities have been set.

getCExampleFor

CExample getCExampleFor(Mention m1,
                        Mention m2)
Returns the unique CExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same CExample any time an inference-based classifier is used.

Parameters:
m1 - The first mention.
m2 - The second mention.
Returns:
The unique CExample referring to the ordered pair m1, m2.

getGExampleFor

GExample getGExampleFor(Mention m)
Returns the unique GExample for the given pair of mentions in the given order. Doc is the head of a collection of related examples; as such, it needs to return the same GExample any time an inference-based classifier is used.

Parameters:
m - The mention.
Returns:
The unique GExample referring to the ordered pair m1, m2.

setUsePredictedMentions

void setUsePredictedMentions(boolean usePred)
Sets the preference for using predicted mentions or true mentions.

Parameters:
usePred - if true, prefer to use predicted mentions, otherwise, prefer true mentions.

usePredictedMentions

boolean usePredictedMentions()
Indicates whether requests for mentions will return predicted mentions or true mentions.

Returns:
Whether predicted or true mentions are to be used.

getMentions

java.util.List<Mention> getMentions()
Gets the mentions of the document, sorted (typically in document order). Returns predicted mentions or true mentions depending on the result of usePredictedMentions().

Returns:
mentions sorted by their natural ordering (usually document ordering).

getPredMentions

java.util.List<Mention> getPredMentions()
Gets a sorted list of predicted mentions.

Returns:
sorted predicted mentions, or an empty list if none available.

hasPredMentions

boolean hasPredMentions()
Indicates whether predicted mentions have been set.

Returns:
Whether predicted mentions have been set.

hasTrueMentions

boolean hasTrueMentions()
Indicates whether true mentions have been set.

Returns:
Whether true mentions have been set.

getTrueMentions

java.util.List<Mention> getTrueMentions()
Gets a sorted list of true mentions.

Returns:
sorted true mentions, or an empty list if none available.

setPredictedMentions

void setPredictedMentions(java.util.Collection<Mention> ments)
Sets the predicted mentions and records a preference for using them.

Parameters:
ments - The predicted mentions (copied defensively).

getTrueMentionFor

Mention getTrueMentionFor(Mention pred)
Gets the true mention aligned with the specified mention.

Parameters:
pred - A predicted mention.
Returns:
The true mention aligned with pred.

getBestMentionFor

Mention getBestMentionFor(Mention m)
Gets the canonical mention of the entity containing m.

Parameters:
m - A mention.
Returns:
The canonical mention for m.

getMentionsWithHeadStartingAt

java.util.Set<Mention> getMentionsWithHeadStartingAt(int startWord)
Returns the set of mentions whose heads start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Parameters:
startWord - A word number.
Returns:
The set of mentions whose heads start at startWord.

getMentionsWithExtentStartingAt

java.util.Set<Mention> getMentionsWithExtentStartingAt(int startWord)
Returns the set of mentions whose extents start at the specified word number, or an empty set if none are found. May be backed internally or not: no guarantees are made (yet).

Parameters:
startWord - A word number.
Returns:
The set of mentions whose extents start at startWord.

getMentionsContainedIn

java.util.Set<Mention> getMentionsContainedIn(Mention m)
Gets the set of mentions whose head is entirely contained within a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Parameters:
m - The specified mention.
Returns:
The set of mentions contained in m.

getMentionsContaining

java.util.Set<Mention> getMentionsContaining(Mention m)
Gets the set of mentions whose extents entirely contain a specified mention's extent, including the specified mention itself. Returns predicted or true mentions according to the result of getMentions().

Parameters:
m - The specified mention.
Returns:
The set of mentions containing m.

getMentionsInSent

java.util.List<Mention> getMentionsInSent(int sentNum)
Gets a list of the mentions in a specified sentence in order. Returns true or predicted mentions according to the value of usePredictedMentions().

Parameters:
sentNum - The number of the specified sentence.
Returns:
A List of the mentions in the specified sentence, in the order that they appear in the sentence.

getMentionsInSentences

Pair<java.util.List<Mention>,java.util.List<Mention>> getMentionsInSentences(int s1,
                                                                             int s2)
Gets a pair of lists of mentions, one for each of the two specified sentences. Gets all the mentions in the specified sentences.

Parameters:
s1 - The number of the first sentence.
s2 - The number of the second sentence.
Returns:
A pair of lists of mentions, one for each sentence.

makeChunk

Chunk makeChunk(int startWord,
                int endWord)
Create a chunk spanning the specified words in this document.

Parameters:
startWord - The position of the first word in desired chunk.
endWord - The position of the last word in the desired chunk.
Returns:
The desired chunk.

getWords

java.util.List<java.lang.String> getWords()
Gets a list of the surface forms of the words of the document.

Returns:
A list of Strings of words, in the order they appear.

getWord

java.lang.String getWord(int wordNum)
Gets the specified word.

Parameters:
wordNum - The position of the specified word (as an index into a List).
Returns:
The wordNumth word as a string.

getPOS

java.util.List<java.lang.String> getPOS()
Gets a list of the Part-Of-Speech tags for the words of the document. The tag set is that output by the LBJ POS tagger.

Returns:
A list of Part-Of-Speech tags corresponding to the words of the document.
See Also:
POSTagger

getPOS

java.lang.String getPOS(int posNum)
Gets the Part-Of-Speech tag for the word at the posNum position in the document.

Parameters:
posNum - The position of the word whose POS tag should be returned.
Returns:
The Part-Of-Speech tag for the desired word position.
See Also:
POSTagger

getWordNum

int getWordNum(int charNum)
Determines the word number (zero-based) of the word at charNum, or if no word is at charNum, return the word number of the closest word appearing after charNum, or if no such word exists, return -1.

Parameters:
charNum - The character number.
Returns:
The word number corresponding to the specified character number.

getTextFirstWordNum

int getTextFirstWordNum()
Gets the word number of the first word in the main text of the document (as distinguished from headlines and metadata that may be included in the plain text.)

Returns:
The word number of the first word in the main text.

getStartCharNum

int getStartCharNum(int wordNum)
Gets the zero-based position of the first character of a word.

Parameters:
wordNum - The zero-based position of the word in the document.
Returns:
The zero-based position of the first character in the word within into the plain text, or -1 if wordNum is invalid.

getQuoteNestLevel

int getQuoteNestLevel(int wordNum)
Indicates the number of nested quotes the specified word is in. 0 is the base level of the text.

Parameters:
wordNum - The position of the specified word.
Returns:
The number of nested quotes.

getInverseTrueHeadFreq

double getInverseTrueHeadFreq(int wordNum)
Gets the inverse true head frequency of the word at the specified position.

Parameters:
wordNum - The position in the document of the specified word.
Returns:
The inverse true head frequency of the specified word, or 1.0 if the word is not in a true head.
See Also:
getInverseTrueHeadFreq(String)

getInverseTrueHeadFreq

double getInverseTrueHeadFreq(java.lang.String word)
Gets the inverse of the number of occurrences of the specified word in the heads of the true mentions in the document.

Parameters:
word - The specified word.
Returns:
The inverse true head frequency of the specified word, or 1.0 if the word is not found in any heads.

getInDocInverseFreq

double getInDocInverseFreq(java.lang.String word)
Gets the inverse of the number of occurrences of the specified word in the document. Not normalized.

Parameters:
word - The specified word.
Returns:
The inverse of the number of times the word occurs in the document, or 1.0 if the word does not occur.

getInCorpusInverseFreq

double getInCorpusInverseFreq(java.lang.String word)
Gets the inverse of the number of occurrences of the specified word in the corpus. Not normalized.

Parameters:
word - The specified word.
Returns:
The inverse of the number of times the word occurs in the corpus, or 1.0 if the word does not occur.

setCorpusCounts

void setCorpusCounts(java.util.Map<java.lang.String,java.lang.Integer> counts)
Sets the corpus counts for the words in the corpus. Makes a copy of the map, which may be slow or space consuming.

Parameters:
counts - A map from words to counts of words in the corpus.

getWholeDocCounts

java.util.Map<java.lang.String,java.lang.Integer> getWholeDocCounts()
Gets the counts for the words in the document. Returns a copy, which may be slow or space consuming.

Returns:
A map from words to counts of words in the document.

getRelation

Relation getRelation(int number)
Gets the specified relation. Relations are not yet emphasized.

Parameters:
number - the number of the desired relation.
Returns:
The desired relation.

getNumRelations

int getNumRelations()
Gets the number of relations.

Returns:
The number of relations.

toAnnotatedString

java.lang.String toAnnotatedString(boolean showPOS,
                                   boolean showMTypes,
                                   boolean showETypes,
                                   boolean showEIDs)
Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags, mention types, entity types, and entity IDs. Predicted Entity IDs will be shown if available.

Parameters:
showPOS - Whether the Part-Of-Speech tags should be shown.
showMTypes - Whether mention types should be shown.
showETypes - Whether entity types should be shown.
showEIDs - Whether entity IDs should be shown.
Returns:
The text of the document, annotated.

toAnnotatedString

java.lang.String toAnnotatedString(boolean showPOS)
Gets the document as a string annotated with mention boundaries, with square brackets for true mentions, asterisks for false alarms, and triangle brackets for missed mentions, and optionally annotated with Part-Of-Speech tags.

Parameters:
showPOS - Whether the Part-Of-Speech tags should be shown.
Returns:
The text of the document, annotated.

toSubstituteString

java.lang.String toSubstituteString()
Gets the document as a string where each mention has been replaced by the most specific mention coreferential with it.

Returns:
The doc as a string with each mention represented by its most specific coreferential mention.

getCoherenceInfo

java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo(boolean usePred)
Gets a grid indicating the mention type for each combination of entities and sentences. If a mention is predicted to belong to its true entity, its mention type will be uppercase; but if it is predicted to be in the wrong entity (due to coreference mistake) its mention type will be lowercase and the mention's entity ID will be appended after its mention type.

Parameters:
usePred - Whether predicted entities should be used.
Returns:
A map from entities to a map from sentence numbers to strings, representing the grid described above.

getCoherenceInfo

java.util.Map<Entity,java.util.Map<java.lang.Integer,java.lang.String>> getCoherenceInfo()
Gets the coherence info using the value of usePredictedEntities() to determine whether to use predicted entities.

Returns:
Coherence info as described in the one parameter version of this method.

toCoherenceTableString

java.lang.String toCoherenceTableString(boolean usePred)
Gets the coherence grid represented as a string, laid out in a grid.

Parameters:
usePred -
Returns:
A coherence grid as a string.
See Also:
getCoherenceInfo()

toCoherenceTableString

java.lang.String toCoherenceTableString()
Gets the coherence grid represented as a string, laid out in a grid. Predicted entities will be used as determined by the value of usePredictedEntities().

Returns:
A coherence grid as a string.
See Also:
getCoherenceInfo()

save

void save()
          throws java.io.IOException
Writes the document to a file using serialization.

Throws:
java.io.IOException

write

void write(boolean usePredictions)
Writes this Doc in the appropriate format.

Parameters:
usePredictions - Whether predicted mentions and entities should be written.

write

void write(java.lang.String filename,
           boolean usePredictions)
Writes this Doc in the appropriate format.

Parameters:
filename - The name of the target file.
usePredictions - Whether predicted mentions and entities should be written.