Data for Content Structure Induction

This page provides the data used for training and testing our method for text structure induction. The detailed description of these corpora is given in "Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization", (R.Barzilay and L. Lee, NAACL'04). This page is based upon work supported in part by the National Science Foundation under grants ITR/IM IIS-0081334 and IIS-0329064 and by an Alfred P. Sloan Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of the National Science Foundation or Sloan Foundation.

The first four sets contain articles from the North American News Text Corpus (NANTC) available through LDC. Due to the copyright restrictions, we only provide file names in NANTC. The fifth collection (extracted from the National Transportation Safety Board's database) provides access to actual files.