Learning Packages

Dataless Hierarchical Text Classification  

This project demonstrate how to perform dataless hierarchical text classification.

Learning Based Java  

Learning Based Java (LBJava) is a modeling language that expedites the development of systems with one or more learning components. In an LBJ model, simple learned components are modeled conditionally, and their initial predictions are then combined via constrained optimization, yielding an expressive, globally coherent set of final predictions.

SNoW Learning Architecture  

SNoW is a learning architecture that is tailored for learning in the presence of a very large number of information sources (features). SNoW learns a network of linear functions.

Feature Extraction Language (FEX)  

FEX is a feature extraction package used to provide input to machine learning algorithms. FEX can be used to generate features from structured text or other relational data.

Edison: NLP Feature Extraction Framework  

Edison is a java library for representing different NLP annotations (views) over text in the form of graphs over constituents. It provides easy-to-use accessors for different types of views and facilitates feature extraction.

JLIS: a multi-purpose structural learning library  

JLIS (pronounced as "jealous") is a multi-purpose structural learning library. JLIS-multiclass package supports performing cost-sensitive multiclass classification. JLIS-reranking package supports using the performance measure (e.g. F1) to do "weighted" reranking.

Streaming Data SVM (SBM)  

This software is aimed at learning a linear classifier when data cannot fit in memory on a single machine

NLP Tools

Illinois NLP Curator  

The Curator is a system that acts as a central server in providing annotations for text. It is responsible for requesting annotations from multiple natural language processing servers, caching and storing previous annotations and refreshing stale annotations. The Curator provides a centralized resource which requests annotations for natural language text.

Illinois Named Entity Tagger  

This is a state of the art NE tagger that tags plain text with named entitites. The newest version can tag with the "classic" label set (people / organizations / locations / miscellaneous) or a larger (18-label) set defined by the OntoNotes corpus. It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8 F1 on the CoNLL03 shared task data.

Illinois Wikifier  

This system identifies "important expressions" in the input text and cross-links them to Wikipedia

Illinois Quantifier  

A Quantity Detector and Standardizer

Illinois NLP Pipeline  

The Illinois NLP Pipeline is a stand-alone package that integrates tokenization, POS tagging, Chunking, and NER tagging. It provides a programmatic interface via Curator data structures.

Illinois Lemmatizer  

The Illinois Lemmatizer combines WordNet-based lemmatizers with some additional heuristics, and can populate Views of Edison's TextAnnotation and Curator's Record data structures.


IllinoisCloudNLP is software framework that allows users to run a set of Cognitive Computation Group's NLP tools on Amazon's cloud computing framework. The software makes it straightforward for experts and non-experts to process large corpora with state-of-the-art NLP tools quickly, on demand, at a reasonable cost, and with minimal local hardware requirements.

Illinois Semantic Role Labeler (SRL)  

The Semantic Role Labeler identifies the verb-argument structure in a sentence. Specifically, it labels the sentence with Propbank-style labels. This tool is a machine-learning based system that uses SNoW and FEX for local classification decisions, and Integer Linear Programming to make global inferences about sets of these local decisions.

Illinois Part of Speech Tagger  

This is an implementation of our SNoW-based POS tagger for use with LBJ.

Illinois Chunker  

A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure.

Illinois Coreference Package  

A Coreference Resolver, based on LBJ, trained on the ACE 2004 corpus.

Illinois Temporal Expression Extractor  

The Illinois Temporal Extractor processes documents and extracts temporal expressions, relating them to each other and optionally to a reference date.

Other Packages

Descartes: Dataless Classification  

This software gives an API to perform Dataless Classification


Given a research area as a query, this package returns names of experts in this area.


A set of utilities to simplify interactions with Curator (or the stand-alone NLP pipeline)

Wikipedia API  

Extracts plain text, interwiki links and categories from compressed Wikipedia XML dump

Illinois Car Detection software  

This software was used in the research described in the paper

Illinois Lifted First-Order Probabilistic Inference  

This package implements a Lifted First-Order Probabilistic Inference algorithm.

Illinois CoRanker: an algorithm for NE discovery  

An implementation of CoRanker, an algorithm for Named Entity discovery from multilingual comparable corpora.

pySNoW: A Python interface to SNoW  

pySNoW is a minimal python interface to the SNoW - Sparse Network of Winnows learning architecture. It is meant to be faithful to the original command line interface and provides access to the train, test, evaluate, interactive and server modes directly from python. pySNoW requires SNoW version 3.2.0.

Illinois Unsupervised Rank Aggregation  

An implementation of an unsupervised learning algorithm for rank aggregation with distance-based models.

Maximum Subsequence Segmentation: Extract article text from HTML pages  

Implementation of Maximum Subsequence Segmentation for extracting article text (or other blocks of content) from HTML documents

Similarity Packages

LLM (Lexical Level Matcher)  

LLM is an asymmetric similarity measure between spans of text.

Illinois WNSim: WordNet-based Similarity Metric  

WNSim provides a WordNet-based similarity metric that computes a symmetric similarity score between a pair of words or phrases. It is coded in c++, but runs as an xmlrpc service, so can be used by applications written in other languages.

Illinois WNSim (Java) WordNet-based Similarity Metric  

This is a Java version of the C++ WNSim tool, which computes a similarity score based on relative positions of the compared terms within the WordNet hierarchy.

NESim: Named Entity Similarity Metric  

NESim provides a similarity metric that computes a similarity score between a pair of Named Entities (People, Organizations, Locations, and Misc). It is coded in java, but runs as an xmlrpc service, so can be used by applications written in other languages.

Medical NLP Packages

Illinois Medical Coreference  

This provides the source code for the Medical Coreference Project.

Illinois Medical NER  

This software finds medical concepts in text.

Illinois Medical Drug Abuse Detector  

This software finds drug abuse events using medical set expansion