Author: Pawan Deshpande
http://people.csail.mit.edu/pawand
Date: February 2, 2007

==============================================
CONTENTS

This package contains the source code for the algorithms, described in
the following publication:

Pawan Deshpande, Regina Barzilay, David R. K Karger.
Randomized Decoding for Selection-and-Ordering Problems. 
In Proceedings of NAACL 2007.

NOTE: This package does not include the maximum entropy classifier
used to generate the selection scores.  Instead we have provided
sample input and output files of the classifier.

==============================================
LICENSING

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA

==============================================
DIRECTORY CONTENTS

run.bash           - bash shell script to run code

*.py 		   - python scripts that perform most of the operations

java/selection.jar - Java jar file that generates Boostexter input file
	   	     ( source files included in the jar )

extras/stopwords.txt       - stop and auxiliary words files
extras/more_stopwords.txt  

README             - Documentation

selection_out/*    - Sample selection score files

[ Not included in this release ]
maxent/            - maximum entropy classifier scripts and binary  

==============================================
DEPENDENCIES

You need to have the following installed to run:
-Python2.4
-BoosTexter
-MXPOST Tagger
-Penn Treebank Tokenizer
-SRILM Toolkit

Optional:
-Python2.4-Psyco ( For speed up )
 
==============================================
CONFIGURATION

There are two primary files where configuration options are speficied:

1. run.bash

In run.bash you need to specify: $BOOSTEXTER - path to BoosTexter binary
and various option for each dataset.

2. config.py

The varaible in config.py are labelled and you generally should only
need to change the ones that point to external binaries ( SRILM,
Tagger, etc... )

==============================================
HOW THIS WORKS

Below is the pipeline of steps involved.  (P) indicates the step is
parallelizable.  For parallelized steps, the code automatically
maintains lock files to prevent redundant operations.  Below the
diagram are descriptions and the command to run each step.  For each
command, replace dataset with the name of the dataset
(e.g. algorithms) and assign any arbitrary job_id (e.g. 3).  This job
id should be the same across all all steps for a single experiment.

    -----------------------
    |    1. INITIALIZE    |
    -----------------------
              ||
              \/
    -----------------------
    |    2.  TRAIN        |
    -----------------------
              ||
              \/
    -----------------------
    |    3.  TAG          |
    -----------------------
              ||
              \/
    -----------------------
    |    4.  BUILD (P)    |
    -----------------------
              ||
              \/
    -----------------------
    |    5. DECODE (P)     |
    -----------------------

1. INITIALIZE: Creates directories necessary for remainder of pipeline

   $ bash run.bash dataset initialize job_id

2. TRAIN: Computes selection scores for each stem in each document
   - Generates feature vectors in Java
   - Trains with Boostexter
   - Creates binary feature discretizations from BoostTexter training
   - Creates Maximum Entropy feature vectors
   - Trains Maximum Entropy classifier [ Not included ]
   - Computes selection scores [ Not included ]

   $ bash run.bash dataset train job_id

3. TAG: Tags the document text and titles of training documents

   $ bash run.bash dataset taglm job_id

4. BUILD: Builds graph pickles

   $ bash run.bash dataset build job_id

5. DECODE: Decodes each graph using beam search or color-coding, this
           package does not include ILP decoding.

   For beam search decoding run:
   $ bash run.bash dataset beam job_id

   For color-coding decoding run:
   $ bash run.bash dataset rand job_id

==============================================