Introduction
Overall Theme: Extracting information from informal, written communication. We are interested in the end-to-end problem, from low-level tasks such as identifying entities to high-level tasks such as extracting relationships and preferences.
Abstract
Extraction from informal, written communication raises issues that haven't been fully addressed in the natural language community. Informal, written communication has little or no structured markup. Also, grammar and syntax used by authors is loose, fundamental grammar rules are occasionally broken, and authors sometimes use creative spacing and punctuation to get their point across. To extract information in this domain, we need techniques that make minimal assumptions about language usage and are robust to noise. We focus on statistical machine learning methods due to their flexibility and ability to adapt to structure inherent in the data.
We focus on the task of extracting information from restaurant discussion boards. This grounds our work and provides specific motivation. We develop new learning techniques and extend existing techniques that address our specific task, but are also sufficiently general to apply to a wide range of information extraction problems. We address the low-level natural language task of named entity extraction, and more complex tasks of co-reference resolution and tracking context. We also address the high-level problem of collaborative prediction; we bootstrap our natural langauge techniques to extract preference information and use a new collaborative filtering algorithm to predict restaurant preferences. We run experiments on restaurant bulletin board data and other data to exhibit our techniques.
Named Entity Extraction
Motivation: The identification and classification of named entities is an essential sub-task of information extraction in the majority of application domains. However, most work on named entity extraction has focused on formal and/or structured domains. In particular, when written communication has proper capitalization and punctuation, such features can be used to identify named entities with high accuracy. But, in informal communication, these features tend to be noisy; hence, extraction performance suffers.
Idea: State-of-the-art named entity extraction treats the problem as sequential classification and makes use of local syntax, capitalization and punctuation features. This feature set is not so effective for informal communication, since punctuation, capitalization and part-of-speech information tends to be noisy. However, words that are part of named entities tend to be rare and bursty. And, word statistics are robust to noise. We propose to add a feature that uses full-corpus word statistics to quantify how rare and bursty each word is. This additional feature will allow us to improve named entity extraction performance on informal, written communication without necessitating a paradigm shift.
What is completed:
- Developed a burstiness score---a two-binomial mixture model
- Developed informativeness score---product of burstiness score and a rarity score (IDF)
- Experiments show that the informativeness score is a useful feature in identifying restaurant names in the restaurant discussion board data.
- SIGIR paper
What needs to be done:
- Experiment with transfer---jointly identify restaurant names and locations; the joint problem may yield better restaurant identification performance.
- Do experiments using the Empirical Bayes version of the multinomial, i.e. the Dirichlet Distribution. Note that this distribution only allows for a single rate at which evidence is incorporated into the posterior prior. Similar to the 2-binomial-mixture, we should try a 2-mixture of EB multinomials, which allows for two rates at which empirical evidence is absorbed into the posterior.
- Study to motivate the need for new techniques/features. Create two versions of a Chowhound data set: (1) no modification, (2) correct capitalization/punctuation errors. Hope is that techniques perform poorly on #1 and very well on #2. This indicates that existing techniques are largely reliant on cap/punct features and new features are needed. [1/4/06]
- 7/11/06 - Tommi: chapter looks good
Co-reference Resolution
Motivation: An issue in any natural language domain is, once named entities are identified, how do we resolve them? In our restaurant discussion board domain, we would like to group together different mentions of the same restaurant. By resolving named entity mentions, we will be able to organize information on each entity. This is of primary importance in our restaurant discussion board domain, and many informal domains. Information about an entity is often scattered throughout a corpus. Location and chef information, along with reviews of a restaurant may be available on a restaurant discussion board, but not in a centralized, easy-to-find location. Each piece of information may be in a separate discussion, thus necessitating that we resolve the entity mentions to join the information together.
Idea: Much work (e.g. Aone & Bennett 1995, Soon, et al. 2001, and Ng & Cardie 2002) has pigeon-holed co-reference resolution as a classification problem, where each entity is a class. But, it is an awkward fit: the number of classes is variable, heuristics must be used to determine positive and negative examples for training, and inference is a greedy procedure that is ignorant of the full effects of its decisions. Other early work (e.g. Cardie & Wagstaff 1999) formulated the problem appropriately as a clustering problem, but used a heuristic distance metric, which is unlikely to generalize well across different domains and types of writing. McCallum & Wellner 2003 use a clustering framework and learn a distance metric according to the data; they achieve state-of-the-art performance on proper noun phrases, but perform less well on pronoun and common noun phrases. This is due to the fact that they treat each cluster as a fully connected sub-graph---each mention must have high similarity to all other mentions in the same cluster. This is appropriate for proper nouns (since similarity is mostly based on the mention string itself), but inapproprate for common nouns and pronouns (since a chain structure is more common and similarity is closely tied to distance and physical location). We propose an extension of the McCallum & Wellner work which also uses a clustering framework and learns a distance metric. But, mentions are separated into two types: (1) proper nouns, and (2) common nouns and pronouns. A proper noun must be similar to all proper nouns in its cluster. Whereas a common noun or pronoun must simply have a similarity chain connecting it to its cluster. As was our overall theme for this work, this extension is a statistical learning model that can adapt to the data and is robust to noise.
What is completed:
- Idea for model based on extension of \cite{McCallum03}. Use McCallum/Wellner for proper nouns, a tree-based model for pronouns. Max-antecednet model.
- Idea for how to do optimization: use soft-max and calculate expectation.
What needs to be done:
- Implement max-antecedent model.
- Experiments on restaurant data.
- 7/11/06 - Tommi: chapter is lacking, feels incomplete w/o experiments
Thoughts: In Hal Daume's talk, he talked about simultaneously identifying and resolving entities. It would be nice to do something like this and would certainly be possible. Is it realistic? Also, he mentioned that the MUC data is weird; he preferred to use the ACE data. I think one thing he didn't like about MUC is the fact a named entity that is mentioned exactly once is not tagged as a named entity.
Tracking Topic and Context
Motivation: Unlike formal text, in informal written communication, there are few formatting cues that help the reader determine topic and/or context. Also, in e-mails and bulletin board posts, authors tend to rely heavily on context and obscure references. Thus, even moreso than in other domains it is necessary to track context in order to correctly associate information with its corresponding entity.
In newspapers, each article is focused on a single topic or story; paragraphs separate minor changes in topic; initial paragraphs give a summary and later paragraphs fill in less important details. E-mail and bulletin board posts are much less structured. A thread may be tightly related to a topic or it may wander, the last message being completely unrelated to the first. Also, it is not uncommon in informal text to refer to an entity without an explicit mention. For example, a thread pertaining to a specific restaurant on a discussion board may only explicitly mention the restaurant once or twice, utilizing context to make clear that there is no switching of topic.
Idea: We approach the problem as one of topic tracking, with the hope that the developed techniques may also be applicable to the problem of context tracking. We assume that there is some fundamental unit of text which may contain only a single topic. For a newspaper, the fundamental unit might be an entire article, whereas for informal text, a much smaller unit, such as a sentence or noun phrase, may be necessary. We create a hierarchy of units based on the structure that does exist in the corpus. For example, for bulletin boards, we can identify posts, identify replies and threads; we might also be able to identify some within-message structure, such as paragraphs.
We specify a distribution of text using the hierarchical structure. For each sub-tree (parent-children) of the hierarchy, there is a one potential function that specifies the conditional distribution of the children given the parents. Furthermore, in cases where there is clear sequential structure (such as sentences of a paragraph, or a sequence of replies), we include an additional potential function that encourages adjacent units to be similar. To track topics, we formalize this as a graphical model, and add an additional layer of "topic" nodes, selecting the topic structure with the highest marginal likelihood.
Thoughts: would be helpful to allow for triggers that indicate a change of topic (words/phrases that signal a change)
What is completed:
- Developed idea for trace-norm based hierarchical model. Conditional log-likelihood for each sub-tree is log-multinomial-likelihood (or other text model LL) plus constant times trace norm of matrix of (natural) parameter vectors, one vector per row of matrix. Fully developed for fixed-hierarchy case (no need to normalize).
- Started trying to figure out how to normalize the distribution (necessary for model selection, since fewer sub-trees always has a lower unnormalized negative LL). Calculated normalization constant for exp(\|x\|_2) where x is an n-dimensional vector. This is the same as a 1-by-n matrix trace norm.
What needs to be done:
- Calculate the trace norm normalization constant.
- Work out details of how to calculate marginal likelihood for model selection.
- Devise some experiments. One idea: take a small corpus of restaurant board data, segmented by noun phrase. Tag each NP as to whether it is related to the last-mentioned restaurant (OLD), related to a different restaurant (NEW), or not related to any restaurant (NONE).
- 7/11/06 - Tommi: cluster Chowhound post sentences using mixture of trace norm models; compare vs. mixture of multinomials; don't worry about trying to make sure that model will be a success---okay to report why it didn't work.
Collaborative Filtering
Motivation: Though we have mostly focused on low-level tools, here we develop an algorithm for another important aspect of information extraction---adding value to extracted primitive information. One purpose of restaurant discussion boards (and many other boards), is to elicit preference information. Many messages request a restaurant recommendation; others describe a person's recent restaurant experience. But, extracted preference information can be difficult to use because different people have different tastes. This information must be personalized. Such is the task of collaborative filtering.
Idea: For collaborative filtering, we are given a set of users, a set of items (such as movies, books or restaurants), and a partially observed matrix of discrete preference values (e.g. 1 to 5 stars). The goal is to fill-in unobserved entries of the matrix. At an intuitive level, we bootstrap the preferences of other users and items to make a prediction for the current user/item. We treat collaborative prediction as multiple ordinal regression problems where the feature vectors are unknown. In ordinal regression, each item has a feature vector and falls into one of an ordered set of classes. Using a linear classifier, the goal is to find a set of weights and thresholds such that the thresholded weight-feature dot-products correspond to the correct ordinal classes. We simultaneously learn feature vectors, weight vectors and thresholds (one per user) using a trace norm regularization penalty and an ordinal regression loss function that bounds the mean absolute error.
What is completed:
- Worked with Nati to help develop Max-Margin Matrix Factorization (MMMF).
- Developed direct optimization objective for MMMF which allows for gradient-based optimization and scales to large data sets.
- Experiments on semi-large collaborative filtering data (100k Movie Lens) showing performance that easily tops K-medians and WLRA.
- Experiments on large data sets (1M Movie Lens, EachMovie), results better than all algorithms in Ben Wellner's thesis.
What needs to be done:
- research the preference extraction literature
- write-up description of how to use MMMF with real-valued ratings (which is the output you'd take from a preference extraction system).
- prove that immediate-threshold requires constraints for optimization; all-threshold does not [1/4/06]
- [Desirable] Write sentiment classifier for bboard posts, using context information to determine which restaurant the sentiment relates to. [1/4/06]
- [Desirable] Use MMMF to predict new preferences; also write interface so that other people can enter their likes/dislikes and be given a list of restaurants they might like. [1/4/06]
Jason Rennie
Last modified: Wed Nov 9 13:07:20 EST 2005