Recursive Segmentation and Recognition Templates for 2D Parsing
Long Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan Yuille
Introduction
Language and image understanding are two major tasks in 
artificial intelligence. Natural language researchers have formalized this task 
in terms of parsing an input signal into a hierarchical representation. They 
have made great progress in both representation and inference (i.e. parsing). 
Firstly, they have developed probabilistic grammars (e.g. stochastic context 
free grammar (SCFG) \cite{jelinek91cl} and beyond \cite{collins99}) which are 
capable of representing complex syntactic and semantic language phenomena. For 
example, speech contains elementary constituents, such as nouns and verbs, that 
can be recursively composed into a hierarchy of (e.g. noun phrase or verb 
phrase) of increasing complexity. Secondly, they have exploited the 
one-dimensional structure of language to obtain efficient polynomial-time 
parsing algorithms (e.g. the inside-outside algorithm \cite{lari90csl}). 
By contrast, the nature of images makes it much harder to design efficient image 
parsers which are capable of simultaneously performing segmentation (parsing an 
image into regions) and recognition (labeling the regions). Firstly, it is 
unclear what hierarchical representations should be used to model images and 
there are no direct analogies to the syntactic categories and phrase structures 
that occur in speech. Secondly, the inference problem is formidable due to the 
well-known complexity and ambiguity of segmentation and recognition. Unlike most 
languages (Chinese is an exception), whose constituents are well-separated 
words, the boundaries between different image regions are usually highly 
unclear. Exploring all the different image partitions results in combinatorial 
explosions because of the two-dimensional nature of images (which makes it 
impossible to order these partitions to enable dynamic programming). Overall it 
has been hard to adapt methods from natural language parsing and apply them to 
vision despite the high-level conceptual similarities (except for restricted 
problems such as text \cite{shilman05iccv}). 
Attempts at image parsing must make trade-offs between the complexity of the 
models and the complexity of the computation (for inference and learning). 
Broadly speaking, recent attempts can be divided into two different styles. The 
first style emphasizes the modeling problem and develops stochastic grammars 
\cite{tu02pami, tu03iccv} capable of representing a rich class of visual 
relationships and conceptual knowledge about objects, scenes, and images. This 
style of research pays less attention to the complexity of computation. Learning 
is usually performed, if at all, only for individual components of the models. 
Parsing is performed by MCMC sampling and is only efficient provided effective 
proposal probabilities can be designed \cite{tu03iccv}. The second style builds 
on the success of conditional random fields (CRF's) \cite{lafferty01icml} and 
emphasizes efficient computation. This yields simpler (discriminative) models 
which are less capable of representing complex image structures and long range 
interactions. Efficient inference (e.g. belief propagation and graph-cuts) and 
learning (e.g. AdaBoost, MLE) are available for basic CRF's and make these 
methods attractive. But these inference algorithms become less effective, and 
can fail, if we attempt to make the CRF models more powerful. For example, 
TextonBoost \cite{shotton06eccv} requires the parameters of the CRF to be tuned 
manually. Overall, it seems hard to extend the CRF style methods to include 
long-range relationships and contextual knowledge without significantly altering 
the models and the algorithms.
In this paper, we introduce Hierarchical Image Models (HIM)'s for image parsing. 
HIM's balance the trade-off between model and inference complexity by 
introducing a hierarchy of hidden states. In particular, we introduce \emph{recursive 
segmentation and recognition templates} which represent complex image knowledge 
and serve as elementary constituents analogous to those used in speech. As in 
speech, we can recursively compose these constituents at lower levels to form 
more complex constituents at higher level. Each node of the hierarchy 
corresponds to an image region (whose size depends on the level in the 
hierarchy). The state of each node represents both the partitioning of the 
corresponding region into segments and the labeling of these segments (i.e. in 
terms of objects).  Segmentations at the top levels of the hierarchy give 
coarse descriptions of the image which are refined by the segmentations at the 
lower levels. Learning and inference (parsing) are made efficient by exploiting 
the hierarchical structure (and the absence of loops). In short, this novel 
architecture offers two advantages: (I) Representation -- the hierarchical model 
using segmentation templates is able to capture long-range dependency and 
exploiting different levels of contextual information, (II) Computation -- the 
hierarchical tree structure enables rapid inference (polynomial time) and 
learning by variants of dynamic programming (with pruning) and the use of 
machine learning (e.g. structured perceptrons \cite{collins02emnlp}).
To illustrate the HIM we implement it for parsing images and we evaluate it on 
the public MSRC image dataset \cite{shotton06eccv}. Our results show that the 
HIM outperforms the other state-of-the-art approaches. We discuss ways that 
HIM's can be extended naturally to model more complex image phenomena.
 

Figure 1: The left panel shows the structure of the Hierarchical Image Model. The grey circles are the nodes of the hierarchy. All nodes, except the top node, have only one parent nodes. All nodes except the leafs are connected to four child nodes. The middle panel shows a dictionary of
30 segmentation templates. The color of the sub-parts of each template indicates the object class. Different sub-parts may share the same label. For example, three sub-parts may have only two distinct labels. The last panel shows that the ground truth pixel labels (upper right panel) can be well approximated by composing a set of labeled segmentation templates (bottom right panel).

Figure 2: This figure illustrates how the segmentation templates and object labels (S-R pair) represent image regions in a coarse-to-fine way. The left figure is the input image which is followed by global, mid-level and local S-R pairs. The global S-R pair gives a coarse description of the object identity (horse), its background (grass), and its position in the image (central). The mid-level S-R pair corresponds to the region bounded by the black box in the input image. It represents (roughly) the shape of the horse's leg. The four S-R pairs at the lower level combine to represent the same leg more accurately.

Figure 4: This figure is best viewed in color. The colors indicate the labels of 21 object classes as in [8]. The columns (except the fourth "accuracy" column) show the input images, ground truth, the labels obtained by HIM and the boosting classifier respectively. The "accuracy" column shows the global accuracy obtained by HIM (left) and the boosting classifier (right). In these 7 examples, HIM improves boosting by 1%, -1% (an outlier!), 1%, 10%, 18%, 20% and 32% in terms of global accuracy.