12.11.06 The latest version of the code with the anisotropic diffusion smoothing described in the thesis is available upon request.

Download the code.

The javadocs can be accessed here .

This package contains the source code and binaries for the Minimum Cut text 
segmentation system, described in the following publication:

Igor Malioutov, Regina Barzilay. Minimum Cut Model for Spoken Lecture Segmentation. 
In Proceedings of COLING-ACL 2006, pp. 9-16.

 
Copyright (C) 2006 Igor Malioutov
 
    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.
 
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
 
    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

============================================================

The directory contents of this distribution are as follows:

./bin  		- binaries and scripts for running the system

MinCutSeg.jar	- the main package
jdom.jar, log4j-1.2.8.jar, mtj.jar, options.jar - library dependencies
run-win.bat	- Windows run script
run-unix.script	- Unix run script

./config   	- sample configuration files 

physics.config	- Sample configuration file for the segmenter
log.config	- optional logger configuration file
STOPWORD.list	- list of stop words used

./data      	- the AI and Physics lecture corpora used in the paper

./source  	- source code and the Jakarta ant build file

./docs      	- javadoc documentation pages and library licenses



This is a java-based, platform-independent implementation. In order to run this package, 
you need to have installed the Java Runtime Environment (JRE) 5.0 or higher. 

In general, to run the MinCut segmenter, set the classpath to include
the library dependencies (jdom.jar, log4j-1.2.8.jar, mtj.jar, options.jar) and the main 
package (MinCutSeg.jar). 

In Windows, you can run the system by invoking the run-win.bat script contained in the bin 
directory. Modify the second line of the script in a text editor to run in a desired mode or to 
take in specific input arguments. In Unix, call the run-unix script from the command-line:

./run-unix.script

============================================================
The following running modes are supported:

---Evaluation

java edu.mit.nlp.segmenter.SegmentationScore -config config.file -ref ref.text -hyp hyp.text

Given segmented reference and hypothesis texts, compute the Pk and WindowDiff measures.

---Optimization

java edu.mit.nlp.segmenter.MinCutSeg  -config config.file -optimize  file1.dev file2.dev ... 

Optimize the parameters on a set of segmented development texts.

---Segmentation

java edu.mit.nlp.segmenter.MinCutSeg  -config config.file -n INT -in input.file -out output.file

Run the MinCut Segmenter to partition a sentence-separated text file into the desired number
of segments. You can also add an optional "-evaluate" flag if the input file contains reference
boundaries, and you would like to evaluate the system on the fly. The configuration file
can contain custom parameter settings or you can use the default parameter settings by
setting the use-custom-parameter-settings flag in the configuration file to false.

The boundaries in the segmented texts are delimited by "==========" markers, following
the convention adopted in Freddy Choi's segmentation corpus. 

Choi, F. Y. 2000. Advances in domain independent linear text segmentation.  
In Proceedings of NAACL 2000. pp 26-33.

============================================================
Building Instructions

The build system uses Jakarta Ant framework (http://jakarta.apache.org/ant/).
In order to build, install ant, or use the version of ant bundled with Eclipse or IntelliJ IDEA
Java development environments. 

To build in unix, set your current working directory to source, where the file 
"build.xml" is located, and enter the command "ant", optionally followed by 
target name (compile, package, clean, or javadoc)

This will generate a file called "MinCutSeg.jar" in the "./package" directory 
together with the javadoc documentation files. 
============================================================
October 11, 2006
IGORM AT mit DOT edu
http://csail.mit.edu/~igorm
============================================================

If you have any questions, contributions, or bug reports feel free to contact me at igorm AT mit DOT edu.

This code is copyright 2006 by the Massachusetts Instute of Technology. The system is reserved for academic and research use only.