Download the code.
The javadocs can be accessed here .
This package contains the source code and binaries for the Minimum Cut text
segmentation system, described in the following publication:
Igor Malioutov, Regina Barzilay. Minimum Cut Model for Spoken Lecture Segmentation.
In Proceedings of COLING-ACL 2006, pp. 9-16.
Copyright (C) 2006 Igor Malioutov
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
============================================================
The directory contents of this distribution are as follows:
./bin - binaries and scripts for running the system
MinCutSeg.jar - the main package
jdom.jar, log4j-1.2.8.jar, mtj.jar, options.jar - library dependencies
run-win.bat - Windows run script
run-unix.script - Unix run script
./config - sample configuration files
physics.config - Sample configuration file for the segmenter
log.config - optional logger configuration file
STOPWORD.list - list of stop words used
./data - the AI and Physics lecture corpora used in the paper
./source - source code and the Jakarta ant build file
./docs - javadoc documentation pages and library licenses
This is a java-based, platform-independent implementation. In order to run this package,
you need to have installed the Java Runtime Environment (JRE) 5.0 or higher.
In general, to run the MinCut segmenter, set the classpath to include
the library dependencies (jdom.jar, log4j-1.2.8.jar, mtj.jar, options.jar) and the main
package (MinCutSeg.jar).
In Windows, you can run the system by invoking the run-win.bat script contained in the bin
directory. Modify the second line of the script in a text editor to run in a desired mode or to
take in specific input arguments. In Unix, call the run-unix script from the command-line:
./run-unix.script
============================================================
The following running modes are supported:
---Evaluation
java edu.mit.nlp.segmenter.SegmentationScore -config config.file -ref ref.text -hyp hyp.text
Given segmented reference and hypothesis texts, compute the Pk and WindowDiff measures.
---Optimization
java edu.mit.nlp.segmenter.MinCutSeg -config config.file -optimize file1.dev file2.dev ...
Optimize the parameters on a set of segmented development texts.
---Segmentation
java edu.mit.nlp.segmenter.MinCutSeg -config config.file -n INT -in input.file -out output.file
Run the MinCut Segmenter to partition a sentence-separated text file into the desired number
of segments. You can also add an optional "-evaluate" flag if the input file contains reference
boundaries, and you would like to evaluate the system on the fly. The configuration file
can contain custom parameter settings or you can use the default parameter settings by
setting the use-custom-parameter-settings flag in the configuration file to false.
The boundaries in the segmented texts are delimited by "==========" markers, following
the convention adopted in Freddy Choi's segmentation corpus.
Choi, F. Y. 2000. Advances in domain independent linear text segmentation.
In Proceedings of NAACL 2000. pp 26-33.
============================================================
Building Instructions
The build system uses Jakarta Ant framework (http://jakarta.apache.org/ant/).
In order to build, install ant, or use the version of ant bundled with Eclipse or IntelliJ IDEA
Java development environments.
To build in unix, set your current working directory to source, where the file
"build.xml" is located, and enter the command "ant", optionally followed by
target name (compile, package, clean, or javadoc)
This will generate a file called "MinCutSeg.jar" in the "./package" directory
together with the javadoc documentation files.
============================================================
October 11, 2006
IGORM AT mit DOT edu
http://csail.mit.edu/~igorm
============================================================
If you have any questions, contributions, or bug reports feel free to contact me at igorm AT mit DOT edu.
This code is copyright 2006 by the Massachusetts Instute of Technology. The system is reserved for academic and research use only.