Grammars are a powerful technique for modeling and extracting the structure of documents. One large challenge, however, is computational complexity. The computational cost of grammatical parsing is related to both the complexity of the input and the ambiguity of the grammar. For programming languages, where the terminals appear in a linear sequence and the grammar is unambiguous, parsing is O(N). For natural languages, which are linear yet have an ambiguous grammar, parsing is O(N^3). For documents, where the terminals are arranged in two dimensions and the grammar is ambiguous, parsing time can be exponential in the number of terminals. In this paper we introduce (and unify) several types of geometrical data structures which can be used to significantly accelerate parsing time. Each data structure embodies a different geometrical constraint on the set of possible valid parses. These data structures are very general, in that they can be used by any type of grammatical model, and a wide variety of document understanding tasks, to limit the set of hypotheses examined and tested. Assuming a clean design for the parsing software, the same parsing framework can be tested with various geometric constraints to determine the most effective combination.
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available ``for free'' in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing.
In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces.
Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using the Perceptron algorithm. We also compare Markov and semi-Markov models on the two segmentation tasks. Our results show that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in terms of reducing the error using a fixed amount of labeled data. We find that sometimes semi-Markov models can also improve performance over Markov models.
Acyclicity is an important property of hypergraphs which has applications in many areas such as graphical models and relational databases. Our contributions in this paper are two-fold: First, we present two new characterizations of a hyperforest (equivalently, acyclic hypergraph or triangulated graph) through a hierarchical decomposition of the hyperforest and through the lack of hypercycles, a concept defined in this paper. Second, we present the first efficient dynamic data structure for maintaining acyclicity in a hypergraph. The data structure uses as a building block Tarjan's Union-Find data structure (which can be used to maintain acyclicity in graphs) to achieve an amortized expected query time that has an inverse Ackermann dependence on the number of vertices. To demonstrate the practicality of this data structure, we conduct experiments using our data structure to construct high-weight hyperforests.
Markov trees generalize naturally to bounded tree-width Markov networks, on which exact computations can still be done efficiently. However, learning the maximum likelihood Markov network with tree-width greater than 1 is NP-hard, so we discuss a few algorithms for approximating the optimal Markov network. We present a set of methods for training a density estimator. Each method is specified by three arguments: tree-width, model scoring metric (maximum likelihood or minimum description length), and model representation (using one joint distribution or several class-conditional distributions). On these methods, we give empirical results on density estimation and classification tasks and explore the implications of these arguments.
Current approximation algorithms for maximum weight hypertrees find heavy windmill farms, and are based on the fact that a constant ratio (for constant width k) of the weight of a k-hypertree can be captured by a k-windmill farm. However, the exact worst case ratio is not known and is only bounded to be between 1/(k+1)! and 1/(k+1). We investigate this worst case ratio by searching for weighted hypertrees that minimize the ratio of their weight that can be captured with a windmill farm. To do so, we use a novel approach in which a linear program is used to find ``bad'' inputs to a dynamic program.