Author: Pawan Deshpande http://people.csail.mit.edu/pawand Date: February 2, 2007 ============================================== CONTENTS This package contains the source code for the algorithms, described in the following publication: Pawan Deshpande, Regina Barzilay, David R. K Karger. Randomized Decoding for Selection-and-Ordering Problems. In Proceedings of NAACL 2007. NOTE: This package does not include the maximum entropy classifier used to generate the selection scores. Instead we have provided sample input and output files of the classifier. ============================================== LICENSING This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ============================================== DIRECTORY CONTENTS run.bash - bash shell script to run code *.py - python scripts that perform most of the operations java/selection.jar - Java jar file that generates Boostexter input file ( source files included in the jar ) extras/stopwords.txt - stop and auxiliary words files extras/more_stopwords.txt README - Documentation selection_out/* - Sample selection score files [ Not included in this release ] maxent/ - maximum entropy classifier scripts and binary ============================================== DEPENDENCIES You need to have the following installed to run: -Python2.4 -BoosTexter -MXPOST Tagger -Penn Treebank Tokenizer -SRILM Toolkit Optional: -Python2.4-Psyco ( For speed up ) ============================================== CONFIGURATION There are two primary files where configuration options are speficied: 1. run.bash In run.bash you need to specify: $BOOSTEXTER - path to BoosTexter binary and various option for each dataset. 2. config.py The varaible in config.py are labelled and you generally should only need to change the ones that point to external binaries ( SRILM, Tagger, etc... ) ============================================== HOW THIS WORKS Below is the pipeline of steps involved. (P) indicates the step is parallelizable. For parallelized steps, the code automatically maintains lock files to prevent redundant operations. Below the diagram are descriptions and the command to run each step. For each command, replace dataset with the name of the dataset (e.g. algorithms) and assign any arbitrary job_id (e.g. 3). This job id should be the same across all all steps for a single experiment. ----------------------- | 1. INITIALIZE | ----------------------- || \/ ----------------------- | 2. TRAIN | ----------------------- || \/ ----------------------- | 3. TAG | ----------------------- || \/ ----------------------- | 4. BUILD (P) | ----------------------- || \/ ----------------------- | 5. DECODE (P) | ----------------------- 1. INITIALIZE: Creates directories necessary for remainder of pipeline $ bash run.bash dataset initialize job_id 2. TRAIN: Computes selection scores for each stem in each document - Generates feature vectors in Java - Trains with Boostexter - Creates binary feature discretizations from BoostTexter training - Creates Maximum Entropy feature vectors - Trains Maximum Entropy classifier [ Not included ] - Computes selection scores [ Not included ] $ bash run.bash dataset train job_id 3. TAG: Tags the document text and titles of training documents $ bash run.bash dataset taglm job_id 4. BUILD: Builds graph pickles $ bash run.bash dataset build job_id 5. DECODE: Decodes each graph using beam search or color-coding, this package does not include ILP decoding. For beam search decoding run: $ bash run.bash dataset beam job_id For color-coding decoding run: $ bash run.bash dataset rand job_id ==============================================