Author: Erdong Chen May 15, 2007 This package contains the source code, data, and binaries for incremental online structuring system, described in the following publication: Incremental Online Text Structuring with Hierarchical Ranking Erdong Chen, Benjamin Snyder, and Regina Barzilay to appear in Joint Meeting of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning(EMNLP-CoNLL) 2007. Description of codes The directory contents of this distribution are as follows: code/PorterStemmer.cpp - porter stemmer(http://www.tartarus.org/~martin/PorterStemmer/c_thread_safe.txt) code/bay_ordering.cpp - bayesian module to estimate lexical adjacency(See Lapata 2003) code/dictionary.cpp - universal dictionary hash table code/hpercept.cpp - hierarchical perceptron code/minipar.cpp - interface with MINIPAR parser (Lin 1999) code/rerank.cpp - linear perceptron (Collins 2002) code/sen_order - main program code/stopword.txt - stop words list code/temporal.cpp - interface with temporal expression tagger (Mani and Wilson 2000) code/sen_order - binary file of the system Building & Running Instructions This is a C++-based, platform-independent implementation. In order to compile this package, you need to have installed the g++ 3.3 or higher. To compile the whole package, run the command line int the directory "./code": make You can modify the path information pointing to data files in sen_order.cpp. By default settings, the program assumes that the directory "./data" is placed in the directory where the directory "./code" is placed. To run the program, run the command line in the directory "./code": ./sen_order workDir/ (All files generated in the directory "workDir" are temporal or debugging files.) Description of Data We extracted articles from Wikipedia. Please refer http://en.wikipedia.org/wiki/Wikipedia:Copyrights for copyright details. The following is quoted from Wikipedia: ------------------------------------------ Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Content on Wikipedia is covered by disclaimers. ------------------------------------------ Our corpus extracted 18748 Wikipedia articles that belong to the category "Living People". Finally, 1503 biography articles and 7949 revisions which involve one-sentence insertions are used in our experiments. 4051 sentence insertions are extracted from the corpus. We removed some reference and links from biographies in Wikipedia. Our corpus only contains articles which has more than one section. When a sentence inserted between paragraphs, by convention we treat it as part of the previous paragraph. 1. data/bio.inf This file contains three parts. Part1: biography information The first line contains the number of biographies in the corpus for each biography, the format is as follows. (each number takes a single line) 1. biography ID (start from 0) 2. name of the biography 3. latest revisionID of the biography 4. number of categories that the biography belongs 5. the biography's category ID (each ID takes aline) Part2: category information The first line conains the number of categories in the corpus each of the following line contains a category name Part3: document information The first line contains the number of documents in the corpus for each document, the format is as follows. (each number takes a single line) 1. revisioID (you can check it online by ) 2. biography ID the document belongs 3. number of sentences in the document 4. number of sections number of sentences in each section 5. number of paragraphs number of sentences in each paragraph 2. data/bio.dat This file contains all the sentences in documents which follows the order in bio.inf 3. data/bio.pos This file contains all the sentences with POS tags in documents which follows the order in bio.inf 4. data/testcase.dat This file contains all the insertion cases. Each test case has two lines, and the format is as follows: ID1 ID2 an inserted sentence with POS tags ID1: the revisionID of the version before a sentence is inserted ID2: the revisionID of the version after a sentence is inserted 5. data/minipar.dat & data/temporal.dat Those files contain subjects/objects and temporal features for my own program using MINIPAR and TIMEX2. However, they can only be used only for my program, because my program changes the order of sentences. 6. Statistics The following is some statistics of the data(including both training/testing data). Avg # tokens in sentence = 23.6939 Max # tokens in sentence = 111 Min # tokens in sentence = 9 Avg # sentences in documents = 32.8941 Avg # sections in documents = 3.61812 Avg # paragrahes in documents = 10.9612 Avg # paragrahes in sections = 3.02954 Percentage of only-one-sentence-in-a-paragraph insertions = 0.469514 Percentage of paragraph-end insertions = 0.887682 Percentage of document-end insertions = 0.231054 # training insertions = 3240 # test insertions = 811 # developing docs for bayesian estimation = 11428 If you have any questions, contributions, or bug reports feel free to contact me at edc AT csail.mit.edu . This code is copyright 2007 by the Massachusetts Instute of Technology. The system is reserved for academic and research use only.