Probabilistic Grammars

Many real-world data such as text and XML documents exhibit a rich internal structure. Consider for instance natural language sentences and XML document. Parsing based on grammars allows us to find tree structures representing the internal organization of these data. Consider the following grammar (syntax exmplained below)

Example

S  -> NP, VP.
VP -> V.
NP -> Det, Noun.
Det -> the.
Noun -> dog.
V -> barks.

Using this grammar, the sentence The dog barks can be parsed yielding the tree

These trees are useful for a wide variety of tasks, including semantic interpretation, information retrieval, and machine translation. Unfortunately, most data have a large number of possible structures and are typically noisy. Naive Bayes, hidden Markov models, and Bayesian networks cannot - at least elegantly - represent the internal tree structure. Probabilistic grammars provide tools to address the resulting ambiguities. Here, we will focus on probabilistic context free grammars.

Probabilistic Context-Free Grammars (PCFGs)

A probabilistic context free grammar (or PCFG) is a context free grammar that associates a probability with each of its productions. It generates the same set of parses for a text that the corresponding context free grammar does, and assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product of the probabilities of the productions used to generate it.

Syntax and Semantics

More formally, let T be a terminal alphabet and N a nonterminal alphabet. A probabilistic context-free grammar (PCFG) G consists of a distinguished start symbol S in N plus a finite set of productions of the form

p : X -> alpha,

where X is in N,alpha is in the power set of N u T, and p is a probability value. The probability values for all production rules of X sum up to 1.0.

A PCFG defines a stochastic process with sentential forms as states, and leftmost rewriting steps as transitions. We denote a single rewriting operation of the grammar by a single arrow ->. If as a result of one ore more rewriting operations we are able to rewrite and element beta of the powerset of N u T as a sequence gamma of nonterminals and terminals, then we write beta ->* gamma.The probability of this rewriting is the product of all probability values associated to productions used in the derivation.

Inference Problems

As for HMMs, there are three main problems associated with PCFGs:

  1. Given the parameters of the model, compute the probability of a particular pars. This problem is solved by the inside algorithm.
  2. Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by a generalized Viterbi algorithm.
  3. Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem is solved by the Inside-Outside algorithm.