6.881 Homework #1

Due: 10/14/2004

The goal of the homework is to design and evaluate a method for sentence segmentation of speech transcripts. Since raw speech transcripts do not contain sentence boundaries, a tool for sentence separation is important for many applications that operate over speech transcipts, such as infromation retrieval and summarization.

For training, development and testing, you will be provided with 6.001 lecture transcripts manually annotated with sentence boundaries. The transcripts are also annotated with pause information that your model may use. Note that transcripts do not contain capitalization and punctuation, so your model should not rely on this information.

What to do?

What to submit?

You have to submit a writeup that clearly explains your model, presents the results and analyzes its performance. You have to submit your code, and the output of your model on the test set. The README file should clearly specify how to run your program.

Data

The data is in hw1-data.tgz. It contains three directories with data for training, development and testing. Here you will find the data annotated with pauses.

Relevant Links

Links to classifiers: Links to part-of-speech taggers: