Marina Meila

Learning with Mixtures of Trees

PhD Thesis

Density estimation in multidimensional domains and graphical models
Tree distributions and their properties
Mixtures of trees: definition, properties, an efficient learning algorithm
Learning mixtures of trees in the Bayesian framework
Accelerated tree learning algorithms for sparse discrete data
A new approach to hidden variable discovery, an empirical method for estimating the description length of a mixture model, a novel independence test based on large deviation theory
Experiments: density estimation and classification with mixtures of trees
Conclusions

Abstract

One of the challenges of density estimation as it is used in machine learning is that usually the data are multivariate and often the dimensionality is large. Operating with joint distributions over multidimensional domains raises specific problems that are not encountered in the univariate case. Graphical models are representations of joint densities that are specifically tailored to address these problems. They take advantage of the (conditional) independencies between subsets of variables in the domain which they represent by means of a graph. When the graph is sparse, graphical models provide an excellent support for human intuition and allow for efficient inference algorithms. However, learning the underlying dependence graph from data is generally NP-hard.

The purpose of this thesis is to propose and to study a class of models that admits tractable inference and learning algorithms yet is rich enough for practical applications. This class is the class of mixtures of trees models. Mixtures of trees inherit the excellent computational properties of tree distributions (themselves a subset of graphical models) but combine several trees in order to augment their modeling power, thereby going beyond the standard graphical model framework.

The thesis demonstrates the performance of the mixture of trees in density estimation and classification tasks. In the same time it deepens the understanding of the properties of the tree distribution as a multivariate density model. Among others, it shows that the tree classifier implements an implicit variable selection mechanism.

Learning mixtures of trees from data is a central subject of this thesis. The learning algorithm that is introduced here is based on the the EM and the Minimum Weight Spanning Tree algorithms and is quadratic in the dimension of the domain.

This algorithm can serve as a tool for discovering hidden variables in a special but important class of models where, conditioned on the hidden variable, the dependencies between the observed variables become sparse. Finally, it is shown that in the case of sparse discrete data, the original learning algorithm can be transformed in an algorithm that is jointly subquadratic and that in simulations achieves speedups factors of up to a thousand.

Learning with Mixtures of Trees

Marina Meila-Predoviciu

Introduction
- Density estimation in multidimensional domains
- Introduction by example
- Graphical models of conditional independence
  - Examples of established belief network classes
  - Advantages of graphical models
  - Structure learning in belief networks
  - Inference and decomposable models
- Why, what and where? Goal, contributions and road map of the thesis
  - Contributions
    - The mixture of trees model
    - An efficient learning algorithm
    - An accelerated learning algorithm for sparse data
    - A top-down approach to hidden variable discovery
  - A road map for the reader
Trees and their properties
- Tree distributions
- Inference, sampling and marginalization in a tree distribution
  - Inference
  - Marginalization
  - Sampling
- Learning trees in the Maximum Likelihood framework
  - Problem formulation
  - Fitting a tree to a distribution
  - Solving the ML learning problem
- Representation capabilities
- Appendix: The Junction Tree algorithm for trees
Mixtures of trees
- Representation power of mixtures of trees
- Basic operations with mixtures of trees
  - Marginalization
  - Inference
  - Sampling
- Learning mixtures of trees in the ML framework
  - The basic algorithm
  - Running time and storage requirements
  - Learning mixtures of trees with shared structure
  - Remarks on the learning algorithms
- Summary and related work
Learning mixtures of trees in the Bayesian framework
- MAP estimation by the EM algorithm
- Decomposable priors for tree distributions
  - Decomposable priors over tree structures
  - Priors for tree parameters: the Dirichlet prior
    - The Dirichlet prior in natural coordinates
    - Dirichlet priors for trees and mixtures
Accelerating the tree learning algorithm
- Introduction
- Assumptions
- Accelerated CL algorithms
  - First idea: Comparing mutual informations between binary variables
  - Second idea: computing cooccurrences in a bipartite graph data representation
  - Putting it all together: the aCL-I algorithm and its data structures
  - Time and storage requirements
  - The aCL-II algorithm
  - Time and memory requirements for aCL-II
- Generalization to discrete variables of arbitrary arity
  - Computing cooccurrences
  - Presorting mutual informations
    - A ``chain rule'' expression for the entropy of a discrete variable
    - The mutual information of two non-cooccurring variables
- Using the aCL algorithms with EM
- Decomposable priors and the aCL algorithm
- Experiments
- Concluding remarks
- Appendix: Bounding the number of lists
An Approach to Hidden variable discovery
- Experiments
  - Experimental procedure
  - Experiments with tree H models
  - General H models
- Approximating the description length of a model
  - Encoding a multinomial distribution
- Model validation by independence testing
  - An alternative independence test
  - A threshold for mixtures
  - Validating graphical models with hidden variables
- Discussion
Experimental results
- Recovering the structure
- Density estimation experiments
  - Digits and digit pairs images
  - The ALARM network and data set
- Classification with mixtures of trees
  - Using a mixture of trees as a classifier
  - The AUSTRALIAN data set
  - The MUSHROOM data set
  - The SPLICE data set. Classification and structure discovery
  - The single tree classifier as an automatic feature selector
Conclusion
References (html)

Marina Meila

to my UW home page

Marina Meila

PhD Thesis

Contents

Abstract

Learning with Mixtures of Trees

Table of Contents