6.881 Homework #2
Due: 11/2/2004
In this homework, you will explore corpus-based approaches to lexical
semantics. More concretely, you will implement and analyze a method for
clustering words based on their distributional properties. By evaluating the
resultant clustering on two disambiguation tasks, you will explore the merits
of different representations and study the properties of the learning method.
To train your method, you will use the lecture transcript corpus from the first homework, and a 6.001
textbook source file.
What to do?
-
Implement: Your program has to cluster a given list of words into n groups
based on their distributional patterns. You will first construct a
word-by-word matrix that captures co-occurrence patterns of the given words.
Then, you will cluster them, using the EM algorithm. Your program should take
as input a list of words to cluster and a number of clusters.
- Evaluate: The first evaluation task is pseudoword disambiguation. For each
one of 50 words in this file, randomly substitute half of
their occurrences in the corpus with its reverse (e.g., ``procedure'' will be
transformed into ``erudecorp''). Now, apply your clustering algorithm to the
list of 100 words, which contains original words and their reverses. If you
generate 50 clusters, how many of them will contain correct pairs (i.e., a
word and its reverse)?
The second evaluation task relates to the part-of-speech disambiguation. This
file contains nouns, verbs, and adjectives. Apply your program to
these words to create three clusters. Evaluate the quality of the clusters.
-
Analyze the performance: Your analysis will focus on the impact of context
representation and the features of training data on the quality of generated
clusters. To analyze the contribution of contextual representation, consider
different ways of constructing a word-by-word matrix (e.g., vary the
dimensions of the matrix) and experiment with different definitions of
context. To analyze the impact of training data, train your system on spoken
and written parts of the corpus separately, and also on their combination.
Do you observe any difference?
Do you reach similar conclusions, when you analyze the performance of your
clustering on the two evaluation tasks?
Can you find any regularity in the mistakes of your method?
What to submit?
You have to submit a writeup that clearly explains parameters of your models,
presents the results and analyzes its performance. You have to submit your
code, and the output of your model. The README file should clearly specify how
to run your program.