6.881 Homework #1
Due: 10/14/2004
The goal of the homework is to design and evaluate a method for sentence
segmentation of speech transcripts. Since raw speech transcripts do not contain
sentence boundaries, a tool for sentence separation is important for many
applications that operate over speech transcipts, such as infromation retrieval
and summarization.
For training, development and testing, you will be provided with 6.001 lecture
transcripts manually annotated with sentence boundaries. The transcripts are
also annotated with pause information that your model may use. Note that
transcripts do not contain capitalization and punctuation, so your model should
not rely on this information.
What to do?
-
Read related work: You will find abundant literature on the topic of
sentence segmentation, and it is worth looking at some of the existing
techniques before designing your own. The Manning&Schutze text gives a short
summary of the sentence segmentation of written language, and provides some
pointers. In addition, you may want to consider literature on sentence
segmentation of spoken language. (Note that you do not have access
to prosodic features typically used in spoken language segmentation.)
-
Establish upper and lower boundaries: To compute the upper boundary, manually
segment the following file into 20 sentences. Compare your
segmentation with the "gold standard". To
establish the lower boundary, randomly segment the file into 20 sentences,
and compare against "gold standard". Report precision, recall and
F-measure for both boundaries.
-
Design your method: Traditionally, sentence segmentation is cast as a
binary classification task, where each potential boundary is classified either
as the end of a sentence or not. If you decide to follow the traditional
path, you will need to decide about the set of relevant features, and then
apply one of the existing classifiers (see links below). You can also consider a
model that takes into account the global properties of sentence segmentation
(e.g., sentence length distribution).
Tune all the parameters on the development set.
-
Analyze the performance: Compute the learning curve of your method, and
report the performance using various features subsets. Consider other
experiments that shed light on the merits and weaknesses of your method.
What to submit?
You have to submit a writeup that clearly explains your model, presents the
results and analyzes its performance. You have to submit your code, and the
output of your model on the test set. The README file should clearly specify
how to run your program.
Data
The data is in hw1-data.tgz. It contains
three directories with data for training, development and testing. Here you will find the data annotated with pauses.
Relevant Links
Links to classifiers: