Unsupervised Structure Learning: Hierarchical Composition, Suspicious Coincidence and Competitive Exclusion

Long Zhu, Chenxi Lin, Haoda Huang, Yuanhao Chen, Alan Yuille

Introduction

The goal of this paper is to learn a hierarchical compositional model (HCM) for deformable objects. The learning is unsupervised in the sense that we are given a training dataset of images containing the object in cluttered backgrounds but we do not know the position or boundary of the object. Unsupervised learning is desirable since it avoids the need for time consuming hand labeling and prevents implicit biases. We apply the model to tasks such as object segmentation and parsing (matching of object parts).

Learning a hierarchical compositional model is very challenging since it requires us to learn the structure of the model (e.g. the relationships between the variables, and the existence of hidden =
variables) in addition to the parameters of the model. The difficulties are visually illustrated in figure~\ref{fig:taskunsupervised}: (i) the objects are deformable so there is ambiguity in the structure. (ii) there is cluttered background noise. (iii) parts of the object may be missing, (iv) the input is simple oriented edge features, which is a simple and highly ambiguous representation.

We now discuss some critical computational issues for unsupervised learning which motivate our hierarchical approach and contrast it to other unsupervised approaches for learning object models (e.g. \cite{Fergus03}, \cite{LZhu07}). An abstraction of the structure learning problem is that we have a dataset of images each of which contain approximately $M$ object features and $N$ total features. In this paper, we are considering the case of $M=100$ and $N=5000$. Learning an object model requires solving a complicated correspondence problem to determine which features are object, which are background, and the spatial relationships between the object features. There are several strategies for addressing this correspondence problem. The first naive strategy is brute force enumeration which involves testing all $N^M$ possibilities. This is only practical if $M$ and $N$ are small and the appearances of the features are sufficiently distinctive to enable many possible correspondences to be rejected. The second strategy is to learn the model in a greedy (non-hierarchical) way by sequentially growing subparts one by one \cite{Fergus03,LZhu07}. This is practical in cases where brute force fails, but it still makes two assumptions: 1) sparseness assumption which requires $M$ and $N$ to be fairly small, e.g., $M=6$ and $N=100$ in \cite{Fergus03} and 2) small ambiguity assumption which requires the appearances of the features to be somewhat distinctive. Neither of these two strategies are
applicable if the features are edgelets, because $M$ and $N$ are both large and all edgelets have the same appearance which leads to big ambiguity. (One strategy is to use more powerful features, but we argue that this merely postpones the complexity problem). The purpose of this paper is to develop a more general unsupervised learning method without making strong assumptions of sparseness and low ambiguity. In other words, we try to provide a unified learning scheme for both generic features and object structures. Hence, we are driven to a third strategy, named as Hierarchical Composition, that creates a model by combining elementary structures to build a hierarchy.

Our strategy is based on a novel form of bottom-up hierarchical clustering. We assume that the object can be expressed in terms of hierarchical compositions of elementary structures (see the right panel of figure 1). We learn the structure by repeatedly combining proposals for substructures to make proposals for more complex structures. This process stops when we cannot generate any more proposals. A huge number of proposals are examined and stored at every level to avoid ambiguity since small substructure are not necessarily distinct enough between object and background. However, combining proposals in this way risks a combinatorial explosion (imagine that the number of combinations will grow exponentially as we go up to the  upper levels). We avoid this by the use of two principles: (i) suspicious coincidences, and  (ii) competitive exclusion. Suspicious coincidences eliminates proposals which occur infrequently in the image dataset (in less than 90 percent of the images). The competitive exclusion principle is adapted from ecology community where it prevents two animals from sharing the same environmental niche. In this paper, competitive exclusion eliminates proposals which seek to explain overlapping parts of the image.

The bottom-up clustering is followed by a top-down stage which refines and fills in gaps in the hierarchy. These gaps can occur for two reasons. Firstly, the competitive exclusion principle sometimes eliminates subparts of the hierarchy because of small overlaps. Secondly, gaps may occur at low-levels of the hierarchy, for example at the neck of a horse, because there are considerable local shape variations and so suspicious coincidences are not found. The top-down process can remove these gaps automatically by relaxing the two principles. For example, at higher levels the system discovers more regular relationships between the head and torso of the horse which provides context enabling the algorithm to fill in the gaps.

In summary, hierarchical bottom-up and top-down procedure allows us to grow the structure exponentially (the height of a hierarchy is $\log(M)$) and end up with a nearly complete structure (the resulting representation is dense and the size $M$ could be big). We tested our approach by using it to learn a hierarchical  compositional model for parsing and segmenting horses on Weizmann dataset \cite{Borenstein02}. We show that the resulting model is comparable with (or better than) alternative methods. The versatility of our approach is demonstrated by learning models for other objects (e.g., faces, pianos, butterflies, monitors, etc.).