Maximum Subsequence Segmentation Data

Here you can download the data we used in our paper, Extracting Article Text from the Web with Maximum Subsequence Segmentation (Pasternack & Roth, 2009).  You can find a live demo of the supervised method from our paper here.

The data is compressed with 7zip (.7z) and consists of two SQLite databases, one containing the Myriad 40 examples and the other containing the remainder of the data (training examples and the Big 5).  Note that many of the sources have more labeled examples than we used, especially the training sources, for which many thousands of additional automatically labeled examples are available (but we only used the first 2000 returned from the database).

The databases each contains three tables:

  1. Categories -- you can ignore this table.  In the ate.db file, this gives my content-type categorizations of a number of documents from the evaluation sources.
  2. Pages -- the documents themselves, including the HTML, the source (which news website it came from; important in the case of the ate.db file for distinguishing the training examples from the Big 5 evaluation set), and the document ID.
  3. Extractions -- the labels that describe which part of the document constitutes an extraction of the text.  The document they correspond to is indicated by the document ID.  Recall from the paper that a document may have multiple extractions.  The start and length indicate the character offset in the document's HTML where the extraction begins, and how many characters it contains, respectively.  You can ignore the Quality column if you wish, as its information is redundant: 0 indicates an automatic extraction (for the training data), 1 indicates a hand-labeled extraction (for evaluation data) and 2 indicates the shortest hand-labeled extraction (for evaluation data).

The remainder of the schema should (hopefully) be self-explanatory, but if you have questions, please email Jeff Pasternack at jpaster2 (at) uiuc.edu.

Here's the file (1.37GB).  Be warned that it will decompress to roughly 27GB.