Due dates (all are 5 PM EST): Wednesday, October 21 - homework 3 released Monday, November 2 - homework 3 due Early November - short (one-week) homework 4 Thursday, November 5 - project proposal due - note this is a revised date! Tuesday, November 24 - project checkpoint Thursday, December 3 - second quiz Sunday, December 6 - project final report due Tuesday, December 8 - project poster session #1 Thursday, December 10 - project poster session #2 Here is what we expect from the projects: Overview Projects should propose and tackle a problem broadly related to NLP, whether a core NLP task or an interesting application. We expect that you will implement statistical techniques as part of the project (from the literature or novel), apply them to real-world data sets, and perform comprehensive evaluation of the techniques. Project writeups should also discuss related work/approaches in the research literature. You may work in groups of up to 3 people; we expect that larger groups will tackle more substantial projects. Proposal (1 or 2 pages) The proposal should clearly outline each of the following components: - A description of the algorithms and approaches you plan to implement - An overall plan for experiments, i.e., what experimental conditions you plan to evaluate on - Real world data-set(s) that will be used for evaluation - A discussion of related work or approaches, with references to the research literature Checkpoint (up to 4 pages for one person groups, 6 for two, 8 for three, not including references) The checkpoint should be a substantial draft of the final writeup. We expect that the dataset should be ready and preliminary experiments conducted by this point. Final writeup (up to 4 pages for one person groups, 6 for two, 8 for three, not including references) The final report should be similar in quality to a research conference publication, and should provide a comprehensive overview of the project, covering all the points discussed in the Overview. Poster session The last two class sessions will be devoted to presenting the projects. Each group will prepare a poster, and all of you will have a chance to learn about each others' work. The quality of the poster and poster presentation will be part of the overall project grade. SAMPLE PROJECT IDEAS PHRASE EXTRACTION We currently have a massive set (100,000+) of consumer reviews of restaurants scraped from citysearch.com (collected by the Spoken Language Systems group here at MIT). A major step towards making this information more useful is to produce short keyphrase summaries of multiple sets of reviews. We currently have a small set of phrases that were manually extracted from a subset of the reviews. Can you build a more robust keyphrase extraction system, preferably one that aggregates related keyphrases together? What features may be applicable here? CROSS-LINGUAL NAMED ENTITY RECOGNITION A named entity is a short string of words that identifies a person, organization, location, etc. Roughly speaking, named entities correspond to proper nouns in English. Automatically recognizing named entities in English is relatively easier because proper nouns are capitalized. In many languages such orthographic cues are not readily available, making named entity recognition more difficult. However, it is frequently possible to obtain alignments between parallel texts in different languages. Is it possible to leverage this alignment and transfer capitalization information across languages to improve named entity recognition in the other language? PREDICTING LANGUAGE HISTORY Many documents, such as United Nations proceedings, have been translated into hundreds of different languages. These languages historically form a genealogical tree where closely related languages will tend to have lower-height common ancestors. Would it be possible to use statistical properties of the parallel documents to reconstruct a language tree? What features would be most useful for doing so?