Author: Erdong Chen
May 15, 2007

This package contains the source code, data, and binaries for incremental online structuring system, described in the following publication:

Incremental Online Text Structuring with Hierarchical Ranking 
Erdong Chen, Benjamin Snyder, and Regina Barzilay 
to appear in  Joint Meeting of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning(EMNLP-CoNLL) 2007. 

Description of codes
The directory contents of this distribution are as follows:

code/PorterStemmer.cpp	   - porter stemmer(http://www.tartarus.org/~martin/PorterStemmer/c_thread_safe.txt)
code/bay_ordering.cpp	   -  bayesian module to estimate lexical adjacency(See Lapata 2003)
code/dictionary.cpp	   - universal dictionary hash table
code/hpercept.cpp	   - hierarchical perceptron
code/minipar.cpp	   - interface with MINIPAR parser (Lin 1999)
code/rerank.cpp		   - linear perceptron (Collins 2002)
code/sen_order		   - main program
code/stopword.txt	   - stop words list
code/temporal.cpp	   - interface with temporal expression tagger (Mani and Wilson 2000)
code/sen_order		   - binary file of the system


Building & Running Instructions

This is a C++-based, platform-independent implementation. In order to compile this package, you need to have installed the g++ 3.3 or higher.

To compile the whole package, run the command line int the directory "./code": 
make

You can modify the path information pointing to data files in sen_order.cpp. By default settings, the program assumes that the directory "./data" is placed in the directory where the directory "./code" is placed. 


To run the program, run the command line in the directory "./code":
./sen_order workDir/

(All files generated in the directory "workDir" are temporal or debugging files.)


Description of Data
We extracted articles from Wikipedia. Please refer http://en.wikipedia.org/wiki/Wikipedia:Copyrights for copyright details. The following is quoted from Wikipedia: 

------------------------------------------
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.
A copy of the license is included in the section entitled "GNU Free Documentation License". Content on Wikipedia is covered by disclaimers. 
------------------------------------------


Our corpus extracted 18748 Wikipedia articles that belong to the category "Living People". Finally, 1503 biography articles and 7949 revisions which involve one-sentence insertions are used in our experiments. 4051 sentence insertions are extracted from the corpus. 

We removed some reference and links from biographies in Wikipedia. Our corpus only contains articles which has more than one section. When a sentence inserted between paragraphs, by convention we treat it as part of the previous paragraph.

1. data/bio.inf
This file contains three parts.

Part1: biography information
The first line contains the number of biographies in the corpus
for each biography, the format is as follows. (each number takes a single line)
  1. biography ID (start from 0)
  2. name of the biography 
  3. latest revisionID of the biography
  4. number of categories that the biography belongs
  5. the biography's  category ID (each ID takes aline)

Part2: category information
The first line conains the number of categories in the corpus
each of the following line contains a category name

Part3: document information
The first line contains the number of documents in the corpus
for each document, the format is as follows. (each number takes a single line)
  1. revisioID (you can check it online by )
  2. biography ID the document belongs
  3. number of sentences in the document
  4. number of sections
     number of sentences in each section 
  5. number of paragraphs
     number of sentences in each paragraph 

2. data/bio.dat 
This file contains all the sentences in documents which follows the order in bio.inf

3. data/bio.pos
This file contains all the sentences with POS tags in documents which follows the order in bio.inf

4. data/testcase.dat
This file contains all the insertion cases. 
Each test case has two lines, and the format is as follows: 
ID1 ID2
an inserted sentence with POS tags

ID1: the revisionID of the version before a sentence is inserted 
ID2: the revisionID of the version after a sentence is inserted 


5. data/minipar.dat & data/temporal.dat 
Those files contain subjects/objects and temporal features for my own program using MINIPAR and TIMEX2. 
However, they can only be used only for my program, because my program changes the order of sentences.

6. Statistics
The following is some statistics of the data(including both training/testing data). 
Avg # tokens in sentence = 23.6939
Max # tokens in sentence = 111
Min # tokens in sentence = 9
Avg # sentences in documents = 32.8941
Avg # sections in documents = 3.61812
Avg # paragrahes in documents  = 10.9612
Avg # paragrahes in sections  = 3.02954
Percentage of only-one-sentence-in-a-paragraph insertions = 0.469514 
Percentage of paragraph-end insertions = 0.887682
Percentage of document-end insertions = 0.231054

# training insertions = 3240
# test insertions = 811
# developing docs for bayesian estimation = 11428

If you have any questions, contributions, or bug reports feel free to contact me at edc AT csail.mit.edu .

This code is copyright 2007 by the Massachusetts Instute of Technology. The system is reserved for academic and research use only.