Conditional Random Fields (CRFs)

Whereas Bayesian networks are directed graphical models, random fields (aka Markov networks) are undirected graphical models defining the joint distribution of some set of random variables. As an example consider the following smoker random field inspired by Richardson and Domingos:


that computed the probability of a person having lung cancer based whether or not a person or her friends smoke. In a CRF, the distribution of some random variable Y in the graph (the output) is conditioned on the remaining random variables X in the graph (the input).

Linear-Chain CRFs

In recent years, CRFs have been most often used for labeling and segmenting structured data, such as sequences, trees and lattices. Here, we will focus on so-called linear-chain CRFs. Let G be an undirected graphical model over sets of random variables X and Y, where X = < xi > and Y = < yi > are sequences of symbols, so that Y is a labeling of an the observed input sequence X. Then, linear-chain CRFs define the conditional probability of a state sequence given the observed sequence as

P(Y|X) = (1/Z(X)) * exp( sum_t pi(t,yt,X) + pi(t-1,yt-1,yt,X) ),

where pi(t,yt,X) and pi(t-1,yt-1,yt,X) are potential functions and Z(X) is a normalization factor over all state sequences X. A potential function is a real-valued function that captures the degree to which the assignment yt to the output variable fits the transition from yt-1 and X. Due to the global normalization by Z(X), each potential has an influence on the overall probability.

Potential Functions

To apply CRFs to a labeling problems, one must choose a representation for the potentials. Typically, it is assumed that the potentials factorize according to a set of features {gk} resp. {fk}, which are given and fixed, so that

 pi(t,yt,X) = Sum vk * gk(yt,X) resp. pi(t,yt,X) = Sum wk * fk(t-1,yt-1,yt,X). 

In other words, the potential are linear combinations of the input features. Consequently, the model parameters are now a set of real-valued weights, one weight for each feature. In linear-chain CRFs, a first-order Markov assumption is made on the hidden variables. Note that feature functions can be arbitrary such as a binary test that has value 1 if and only if yt is subsumed by some logical formula.