public class Lexicon extends Object implements Cloneable, Serializable
Lexicon
contains a mapping from Feature
s to integers. The integer key of a
feature is returned by the lookup(Feature)
method. If the feature is not already in the
lexicon, then it will be added to the lexicon, and thus lookup calls can be made without the need
to check if an entry already exists. The integer keys are incremented in ascending order starting
from 0 as features are added to the lexicon.
The map is implemented as a HashMap
by default and the Lexicon
class
has similar functionality. This class also maintains a second Vector
of integers to
their associated features for fast reverse lookup using the lookupKey(int)
method.
Modifier and Type | Class and Description |
---|---|
static class |
Lexicon.CountPolicy
Immutable type representing the feature counting policy of a
lexicon.
|
static class |
Lexicon.PruningPolicy
Represents the feature counting policy of a lexicon.
|
Modifier and Type | Field and Description |
---|---|
protected edu.illinois.cs.cogcomp.core.datastructures.vectors.IVector |
featureCounts
Counts the number of occurrences of each feature.
|
protected Map |
lexicon
The map of features to integer keys.
|
protected ChildLexicon |
lexiconChildren
Stores features that might appear repeatedly as children of other features, but which are not
themselves given indexes in the lexicon.
|
protected FVector |
lexiconInv
The inverted map of integer keys to their features.
|
protected edu.illinois.cs.cogcomp.core.datastructures.vectors.IVector2D |
perClassFeatureCounts
Counts the number of occurrences of each feature on a class-by-class basis.
|
protected int |
pruneCutoff
Features at this index in
lexiconInv or higher have been pruned. |
Constructor and Description |
---|
Lexicon()
Creates an empty lexicon.
|
Lexicon(String e)
Creates an empty lexicon with the given encoding.
|
Modifier and Type | Method and Description |
---|---|
void |
clear()
Clears the data structures associated with this instance.
|
Object |
clone()
Returns a deep clone of this lexicon implemented as a
HashMap . |
boolean |
contains(Feature f)
Returns
true if the given feature is already in the
lexicon (whether it's past the pruneCutoff or not) and false otherwise. |
void |
countFeatures(Lexicon.CountPolicy policy)
Call this method to initialize the lexicon to count
feature occurrences on each call to
lookup(feature, true) (counting still won't
happen on a call to lookup(feature, false) ). |
void |
discardPrunedFeatures()
Permanently discards any features that have been pruned via
prune(Lexicon.PruningPolicy) as well as all feature counts. |
boolean |
equals(Object o)
Returns whether the given Lexicon object is equal to this one.
|
Feature |
getChildFeature(Feature f,
int label)
Used to lookup the children of conjunctive and referring features during training, this
method checks
lexiconChildren if the feature isn't present in lexicon and
lexiconInv , and then stores the given feature in lexiconChildren if it
wasn't present anywhere. |
Lexicon.CountPolicy |
getCountPolicy()
Returns the feature counting policy currently employed by this
lexicon.
|
int |
getCutoff()
|
Map |
getMap()
Simply returns the map stored in
lexicon . |
int |
hashCode()
Returns a hash code for this lexicon.
|
protected void |
incrementCount(int index,
int label)
Increments the count of the feature with the given index(es).
|
boolean |
isPruned(int i,
int label,
Lexicon.PruningPolicy policy)
Determines if the given feature index should be
pruned according to the given pruning policy, which must have its thresholds set already in
the case that it represents the "Percentage" policy.
|
boolean |
isPruned(int i,
Lexicon.PruningPolicy policy)
Determines if the given feature index should be pruned
according to the given pruning policy, which must have its thresholds set already in the case
that it represents the "Percentage" policy.
|
protected void |
lazyMapCreation()
Various other methods in this class call this method to ensure that
lexicon is
populated before performing operations on it. |
int |
lookup(Feature f)
Looks up a feature's index by calling
lookup(f, false) . |
int |
lookup(Feature f,
boolean training)
Looks up a feature's index by calling
lookup(f, training,
-1) . |
int |
lookup(Feature f,
boolean training,
int label)
Looks up the given feature in the lexicon, possibly
counting it and/or expanding the lexicon to accomodate it.
|
int |
lookupChild(Feature f)
Used to lookup the children of conjunctive and referring features while writing the lexicon,
this method checks
lexiconChildren if the feature isn't present in lexicon
and lexiconInv , and will throw an exception if it still can't be found. |
Feature |
lookupKey(int i)
Does a reverse lexicon lookup and returns the
Feature
associated with the given integer key, and null if no such feature exists. |
static void |
main(String[] args) |
void |
perClassToGlobalCounts()
Collapses per-class feature counts into global counts.
|
void |
printCountTable(boolean p)
Produces on
STDOUT a table of feature counts
including a line indicating the position of pruneCutoff . |
int[] |
prune(Lexicon.PruningPolicy policy)
Rearranges the order in which features appear in the lexicon
based on the compiled feature counts in
featureCounts or
perClassFeatureCounts so that pruned features are at the end of the feature space. |
void |
read(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
Reads the binary representation of a lexicon from the
specified stream, overwriting the data in this object.
|
void |
read(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in,
boolean readCounts)
Reads the binary representation of a lexicon
from the specified stream, overwriting the data in this object.
|
static Lexicon |
readLexicon(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
Reads a feature lexicon from the
specified stream.
|
static Lexicon |
readLexicon(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in,
boolean readCounts)
Reads a feature lexicon from the
specified stream, with the option to ignore the feature counts by setting the second argument
to
false . |
static Lexicon |
readLexicon(String filename)
Reads and returns a feature lexicon from the specified file.
|
static Lexicon |
readLexicon(URL url)
Reads a feature lexicon from the specified location.
|
static Lexicon |
readLexicon(URL url,
boolean readCounts)
Reads a feature lexicon from the specified location, with
the option to ignore the feature counts by setting the second argument to
false . |
static int |
readPrunedSize(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
Reads the value of
pruneCutoff
from the specified stream, discarding everything else. |
void |
setEncoding(String e)
Sets the encoding used when adding features to this lexicon.
|
int |
size()
Returns the number of features currently stored in
lexicon . |
String |
toString()
Returns a text representation of this lexicon (for debugging).
|
void |
write(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessOutputStream out)
Writes a binary representation of the lexicon.
|
protected Map lexicon
protected FVector lexiconInv
protected edu.illinois.cs.cogcomp.core.datastructures.vectors.IVector featureCounts
protected edu.illinois.cs.cogcomp.core.datastructures.vectors.IVector2D perClassFeatureCounts
protected int pruneCutoff
lexiconInv
or higher have been pruned. -1
indicates that no pruning has been done.protected ChildLexicon lexiconChildren
public Lexicon()
public Lexicon(String e)
e
- The encoding to use when adding features to this lexicon.public static Lexicon readLexicon(String filename)
filename
- The name of the file from which to read the feature lexicon.public static Lexicon readLexicon(URL url)
url
- The location from which to read the feature lexicon.public static Lexicon readLexicon(URL url, boolean readCounts)
false
.url
- The location from which to read the feature lexicon.readCounts
- Whether or not to read the feature counts.public static Lexicon readLexicon(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
in
- The stream from which to read the feature lexicon.public static Lexicon readLexicon(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in, boolean readCounts)
false
.in
- The stream from which to read the feature lexicon.readCounts
- Whether or not to read the feature counts.public void clear()
public void setEncoding(String e)
e
- The encoding.public int size()
lexicon
.public int getCutoff()
public void countFeatures(Lexicon.CountPolicy policy)
lookup(feature, true)
(counting still won't
happen on a call to lookup(feature, false)
). Alternatively, this method can also
cause the lexicon to discard all its feature counts and cease counting features at any time
in the future. The former happens when policy
is something other than
Lexicon.CountPolicy.none
, and the latter happens when policy
is
Lexicon.CountPolicy.none
.policy
- The new feature counting policy.lookup(Feature,boolean)
public Lexicon.CountPolicy getCountPolicy()
public void perClassToGlobalCounts()
public boolean contains(Feature f)
true
if the given feature is already in the
lexicon (whether it's past the pruneCutoff
or not) and false
otherwise.
This does not alter or add anything to the lexicon.f
- The feature to look up.public int lookup(Feature f)
lookup(f, false)
.
See lookup(Feature,boolean,int)
for more details.f
- The feature to look up.public int lookup(Feature f, boolean training)
lookup(f, training,
-1)
. See lookup(Feature,boolean,int)
for more details.f
- The feature to look up.training
- Whether or not the learner is currently training.public int lookup(Feature f, boolean training, int label)
training
is true
. Otherwise,
f
is not counted even if already in the lexicon, and a previously unobserved
feature will cause this method to return the value of getCutoff()
without expanding
the lexicon to accomodate the new feature.f
- The feature to look up.training
- Whether or not the learner is currently training.label
- The label of the example containing this feature, or -1 if we aren't doing per
class feature counting.public Feature getChildFeature(Feature f, int label)
lexiconChildren
if the feature isn't present in lexicon
and
lexiconInv
, and then stores the given feature in lexiconChildren
if it
wasn't present anywhere.f
- The feature to look up.label
- The label of the example containing this feature, or -1 if we aren't doing per
class feature counting.f
that is stored in this lexicon.protected void incrementCount(int index, int label)
index
- The index of the feature.label
- The label of the example containing this feature, which is ignored if we aren't
doing per class feature counting.public int lookupChild(Feature f)
lexiconChildren
if the feature isn't present in lexicon
and lexiconInv
, and will throw an exception if it still can't be found.f
- The feature to look up.lexicon
, its associated integer index is
returned. Otherwise, -i - 1
is returned, where i
is the
index associated with the feature in lexiconChildren
.UnsupportedOperationException
- If the feature isn't found anywhere in the lexicon.public Feature lookupKey(int i)
Feature
associated with the given integer key, and null
if no such feature exists.i
- The integer key to look up. If i
is negative, lexiconChildren
is queried instead of lexiconInv
.public boolean isPruned(int i, Lexicon.PruningPolicy policy)
isPruned(i, -1, p)
.i
- The feature index.policy
- The pruning policy.true
iff the feature should be pruned.isPruned(int,int,Lexicon.PruningPolicy)
public boolean isPruned(int i, int label, Lexicon.PruningPolicy policy)
true
. When per class feature counts are
present and the label is non-negative, only the count corresponding to that label must be
greater than or equal to its corresonding threshold.
In other words, passing -1 in the second argument gives the behavior expected when pruning
the lexicon as in prune(Lexicon.PruningPolicy)
. Passing a non-negative label in the
second argument gives the behavior expected when pruning the actual examples.
i
- The feature index.label
- The label of the example containing this feature, or -1 if we want the lexicon
pruning behavior.policy
- The pruning policy.true
iff the feature should be pruned.public int[] prune(Lexicon.PruningPolicy policy)
featureCounts
or
perClassFeatureCounts
so that pruned features are at the end of the feature space.
This way, learning algorithms can allocate exactly enough space in their weight vectors for
the unpruned features.
This method returns an array of integers which is a permutation of the integers from 0
(inclusive) to the number of features in the lexicon (exclusive). It represents a map from
the features' original indexes to their new ones after pruning. The getCutoff()
method then returns the new index of the first pruned feature (or, equivalently, the number
of unpruned features). All features with a new index greater than or equal to this index are
considered pruned in the case of global pruning. In the case of per-class pruning, the cutoff
represents the first feature whose count fell below the threshold for every class.
Thus, in this case, features below the cutoff may still be pruned in any given class; just
not all of them.
policy
- The type of pruning to perform.null
if
policy
indicates no pruning.public void discardPrunedFeatures()
prune(Lexicon.PruningPolicy)
as well as all feature counts.public Object clone()
HashMap
.public boolean equals(Object o)
public int hashCode()
public void write(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessOutputStream out)
out
- The output stream.public void read(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
in
- The input stream.public void read(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in, boolean readCounts)
false
.in
- The input stream.readCounts
- Whether or not to read the feature counts.protected void lazyMapCreation()
lexicon
is
populated before performing operations on it. The only reason it wouldn't be is if it had
just been read off disk.public static int readPrunedSize(edu.illinois.cs.cogcomp.core.datastructures.vectors.ExceptionlessInputStream in)
pruneCutoff
from the specified stream, discarding everything else.in
- The input stream.public String toString()
public void printCountTable(boolean p)
STDOUT
a table of feature counts
including a line indicating the position of pruneCutoff
. It's probably not a good
idea to call this method unless you know your lexicon is small.p
- Whether or not to include package names in the output.public static void main(String[] args)
Copyright © 2016. All rights reserved.