Data for Content Structure Induction
This page provides the data used for training and testing our method for text
structure induction. The detailed description of these corpora is given in "Catching
the Drift: Probabilistic Content Models, with Applications to Generation and
Summarization", (R.Barzilay and L. Lee, NAACL'04).
This page is based upon work supported in part by the National Science
Foundation under grants ITR/IM IIS-0081334 and IIS-0329064 and by an Alfred P.
Sloan Research Fellowship. Any opinions, findings, and conclusions or
recommendations expressed above are those of the authors and do not necessarily
reflect the views of the National Science Foundation or Sloan Foundation.
The first four sets contain articles from the North
American News Text Corpus (NANTC) available through LDC. Due to the
copyright restrictions, we only provide file names in NANTC. The fifth
collection (extracted from the National
Transportation Safety Board's database) provides access to actual files.