Maximum Subsequence Segmentation Data
Here you can download the data we used in our paper, Extracting Article Text
from the Web with Maximum Subsequence Segmentation (Pasternack & Roth, 2009).
You can find a live demo of the supervised method from our paper
here.
The data is compressed with 7zip (.7z) and consists of two SQLite databases, one
containing the Myriad 40 examples and the other containing the remainder of the
data (training examples and the Big 5). Note that many of the sources have
more labeled examples than we used, especially the training sources, for which
many thousands of additional automatically labeled examples are available (but
we only used the first 2000 returned from the database).
The databases each contains three tables:
- Categories -- you can ignore this table. In the ate.db file, this gives
my content-type categorizations of a number of documents from the evaluation
sources.
- Pages -- the documents themselves, including the HTML, the source (which
news website it came from; important in the case of the ate.db file for
distinguishing the training examples from the Big 5 evaluation set), and the
document ID.
- Extractions -- the labels that describe which part of the document
constitutes an extraction of the text. The document they correspond to is
indicated by the document ID. Recall from the paper that a document may
have multiple extractions. The start and length indicate the character
offset in the document's HTML where the extraction begins, and how many
characters it contains, respectively. You can ignore the Quality column if
you wish, as its information is redundant: 0 indicates an automatic extraction
(for the training data), 1 indicates a hand-labeled extraction (for evaluation
data) and 2 indicates the shortest hand-labeled extraction (for evaluation
data).
The remainder of the schema should (hopefully) be self-explanatory, but if you
have questions, please email Jeff Pasternack at jpaster2 (at) uiuc.edu.
Here's the file (1.37GB). Be warned that it will
decompress to roughly 27GB.