misha@imb.imb.ac.ru
Institute for Protein Research
Pushchino, Russia 142292
Motivation: Compositionally homogeneous segments of genomic DNA often
correspond to meaningful biological units. Simple sliding window analysis is
usually insufficient for compositional segmentation of natural sequences.
Hidden Markov Models (HMM) with a small number of states are a natural
language for description of compositional properties of chromosome-size DNA
sequences.
Results: The algorithms were applied to yeast Saccharomyces
cerevisiae chromosomes (YC) I, III, IV, VI and IX. The optimal number of
HMM states is found to be four. The optimal 4-state HMMs for all three
chromosomes are very similar, as well as the
reconstructed segmentations. In most cases the models
with k+1 states are obtained by ``splitting'' one of the states in
the model with k states, and the corresponding increase of
the level of detail in segmentation. The high AT states usually
correspond to intergenic regions.
We also explore the model's likelihood landscape and analyze the dynamics of
the optimization process, thus addressing the problem of reliability of the
obtained optima and efficiency of the algorithms.
Availability:
The system is available on request from the authors.
Contact e-mail: pesha at csail dot mit dot edu
Keywords: Hidden Markov Models, Genome, Yeast, GC-content.
Developed initially for speech recognition [Rabiner'89], hidden Markov models (HMM) are now widely used in computational molecular biology. The usual applications of HMMs include multiple alignment and functional classification of proteins [Krogh'94], prediction of protein folding [Francesco], recognition of genes in bacterial and human genomes [Burge,Kulp,Henderson], analysis and prediction of DNA functional sites [Crowley], and identification of nucleosomal DNA periodical patterns [Baldi'96].
However, the first application of HMMs to the analysis of genetic data, done by Churchill (1989, 1992), was to analyze the compositional heterogeneity of natural DNA sequences. Although it is the simplest level of DNA statistics, it still is not well understood. Base composition of natural DNA sequences is clearly non-uniform, and HMMs are a convenient language for its description. Despite the importance of the pioneering work by Churchill, a number of questions has remained unresolved. First, computational problems precluded analysis of long sequences, and allowed only the use of models with a small number of states (two or three). Second, no analysis of consistency of the generated segmentation or statistical confidence of a model choice was undertaken, and the features of the model parameter space were not analyzed. Finally, new data have became available recently, especialy complete genomes and chromosomes that provide an interesting subject for the analysis.
The present work aims to fill some of these gaps and addresses general properties of HMMs in the domain of DNA sequences. We systematically investigate the computational aspects such as the structure of the parameter space, the model likelihood landscapes and robustness of the obtained data segmentation. The algorithms are applied both to natural and simulated DNA sequences.
In addition to the standard nucleotide sequences, it is sometimes of interest to consider binary representations. In this paper we address the strong-weak hydrogen bonding alphabet, and thus decompose the DNA sequences into segment of homogeneous GC-content. This segmentation is invariant relative to the DNA strand and is meaningful both physically and biologically [Churchill'89,Bird'86,Aissani'91,Goodall'89,Bernardi'89].
Yeast Saccharomyces cerevisiae chromosomes I (YCI, 230 Kbp, 60.7% AT), III (YCIII, 315 Kbp, 61.4% AT), the fragment of IV (YCIV, 300 Kbp, 61.6% AT), VI (YCVI, 270 Kbp, 61.3% AT) and VI (YCVI, 270 Kbp, 61.3% AT) and IX (YCIX, 440 Kbp, 61.1% AT) [Goffeau] were downloaded from Stanford's DNA Sequence and Technology Center (http://genome-www.stanford.edu/Saccharomyces/chromosomes.html).
To get an idea of a variational pattern, see for example the lower subgraph of Figure 3, which presents variations in GC-contents for YCI within sliding windows of 1150 bp.
We are not giving a detailed description of HMMs and related algorithms in this paper (see [Rabiner'89] for extensive coverage of this subject).
The HMM (sometimes called Hidden Markov Chains) is a well-known technique of stochastic modeling used so far mostly for speech and handwriting recognition. Within this model, the observations are considered to be a stochastic function of a Markov process whose states are ``hidden'', i.e. not directly observable. An ensemble of observation parameters O, state transition parameters T, and prior distribution over initial state constitutes a Hidden Markov Model. There are standard ways of estimating the model's parameters from data, as well as evaluating the probability of test data being generated by the model [Rabiner'89]. The model produced can be used to impose segmentation of the data.
We are facing an instance of a multi-dimensional optimization problem that does not have a solution in the general case. There are two parts to this optimization process: (i) assuming some N, find the best HMM among HMMs with N states; (ii) among best models with different number of states, find the best one. More important yet, we have to define what does the best mean for our problem instance.
We use the Baum-Welch (BW hereafter) algorithm which, given a sequence
S and some initial model
with a given number of states N, adjusts
the parameters (the transition matrix
T and the observation matrix O) as
to maximize likelihood Pr(S | M). The BW algorithm is a version of EM
(Expectation - Maximization) proved to converge to a local optimum.
Our solution is simply to run the BW algorithm starting from models with
random initial
parameters and keep the best local optimum until we agree to
stop having found ``satisfactory'' good local optimum.
An improvement to this would be to draw initial models from
some distribution, which is suggested by the problem nature and therefore
is more likely to fall close to an optimal model. These questions
are addressed in details in THE NEXT TWO sections.
Consider a state transition matrix for an HMM. An assumption that there are
long segments with homogeneous compostion corresponds to the diagonal values
of the transition matrix being close to 1. For example for an HMM with two
states:
.
If
values are small, the system remains in the state for a
long time (expected
or
correspondingly)
before it makes a transition out of the state. Accordingly, to create a
random model with transition matrix ``biased'' towards segmented structure
having uniform distribution over segments length, one should for each raw
i of transition matrix: (i) Choose a length of segment
between some small
value (say, ten) and some fraction of the sequence length (say, half),
according to the exponential distribution (so that the segments of length
are equally likely for all p) and set
.
(ii) Using a uniform distribution spread the probability mass
over off-diagonal elements of the row.
![]() |
Theoretically there is no way to determine whether we have found the optimal model with the given number of states for the given sequence. The general idea is to accumulate the statistics over local optima, in order to obtain some estimate of how ``good'' is the best local optimum found. Below we introduce the procedure of ``likelihood landscape exploration'' which is an initial stage of sequence analysis: (i) generate a large number of random models and collect statistics of those models' likelihood: minimal and maximal values, mean and standard deviation. (ii) run the local optimization procedure on a smaller number of random initial models, keeping track of how long it takes for each run to converge, thereby getting an estimate of how many different ``valleys'' are there and how ``wide'' each of them is. (iii) collect statistics over the points of convergence. It is often the case that we converge to the same point starting from many different initial models (e.g. Figure 1).
Another useful benchmark for the likelihood of a model with k states is the likelihood of the best model with k-1 states. It is clear that the optimal k states model is always better than the optimal k-1 states model.
Generally, when we are trying to describe some data, the more degrees of freedom we allow to the model, the better fit is obtained and therefore the higher likelihood for an optimal set of parameters is reached. To compare models with the same number of parameters we can simply use likelihoods, otherwise some form of penalty for extra degrees of freedom is neccessary.
We use the Bayesian information criterion (BIC) [#!Schwarz78!#].
Given several candidates, the model with the maximum value of BIC
,
is taken to be the best model. Here
is the
log-likelihood, k is the number of degrees of freedom in the
model and L is the sequence length. Thus, the BIC value is a penalized form of
the log-likelihood with a penalty that increases linearly with the number of
parameters in the model.
Along with the theoretical criterion we also use an empirical one. Having found the best models of different orders, we compare them by observing to what extent an extra state is being used for description of the data and segmentation (see e.g. Figure 3). We found that for the genomic sequences this method gives results similar to BIC.
We have explored the robustness of solutions and effectiveness of HMM algorithms using synthetic data to do a reverse computation - keep the HMM and the sequence of hidden states used to generate a sequence of observations, and see if we are able to recover them starting from a distorted initial model.
We conducted two distinct series of model restoration experiments: (i) the initial model had the correct state transition matrix T (taken from the actual model used to generate the data), but a randomly distorted observation distribution matrix O; (ii) on the contrary, the observation matrix O was taken from the correct model, whereas the state transition matrix T was altered.
It turned out that the ``segment biased'' models are very robust towards distortion of state transition probabilities in the initial model. The BW algorithm properly converges back to the original model in most cases, whereas distortion of the observation probabilities strongly influences the behavior of the algorithm. Thus estimating an unknown HMM, we should accurately sample the observation parameter space by trying models with initial setting on a regular grid or generated from the uniform distribution. On the other hand, the distribution of the initial settings for state transitions could be skewed towards more likely parameters and explored with less care and computational costs.
The Viterbi algorithm seeks the most likely hidden state sequence which not necessarily coincides at every position with a real sequence recorded while generating sample data. To analyze the robusness of the Viterbi algorithm, we have applied it to a number of observation sequences generated by a given model with a known hidden state sequence. The disagreement between the initial and reconstructed sequences of hidden states was less than .1%. Better then average performance is justified by the segment nature of the data. Thus having found a correct model we are certain to arrive to the correct segmentation.
Finaly, we note that BIC correctly estimates the number of hidden states in synthetic data (Figure 2).
![]() |
Figure 3 presents the segmentation of YCI by the best models with two through five states. The patterns of segmentation by the two and three state models are rather different. The change is smaller as we go up to four state HMM. As we further increase the number of states there is no significant change in the segmentation. The fifth state with GC-content .043 is only visited twice very briefly along the whole sequence of 230 Kbp. As a matter of fact this state covers only .06% of the sequence length, whereas the next rarest states with GC-contents .29 and .52 are used for approximately 3% of the sequence, the state with .44 GC for 24%, and state .37 GC for 69%.
Experiments with other yeast chromosomes produced similar outcomes and for brevity we do not present these segmentation results here.
Since the three state models of the yeast chromosomes are similar, it is interesting to analyze segmentation of a chromosome by the optimal model obtained for a different chromosome. The agreement between different segmentations is illustrated by table 2. Indeed, the diagonal elements, corresponding to regions consistently described by different models, dominate in all cells.
Finally, it should be noted that positions of transitions between states produced by different models (either trained on different chromosomes or with different number of states) often exactly coincide (compare different panels on Figure 3).
Dividing genomes into compositionally homogeneous regions is impotant for a number of reasons. First, these segments are compositionally meaningful. Indeed, we have observed that AT-rich states usually occur in intergenic regions. At that, the AT-richest segments in three-state models usually cover the intergenic regions together with a piece of protein-coding genes, whereas the AT-richest segments in four-state modes lie completely within the intergenic regions. This agrees with the observation that the GC-content of protein-coding regions is higher than that of non-coding region in mammals [Aissani'91] and plants [Goodall'89], but contradicts the observation that GC-rich segments often occur in regulatory regions of vertebrate genes, [Bird'86]. We have also observed that subtelomeric regions of all yeast chromosomes form separate segments. It is noteworthy that some homogeneous segments are rather long (e.g. the region 152228-189740 on Figure 3), whereas in other cases the segmentation pattern is rather complex (e.g. the region around pos. 73469). Further analysis is required to assess the biological meaning of this observation.
Second, it is well known that performance of many algorithms for functional mapping of genomic DNA crucially depends on homogeneity of the sequences and it deteriorates when the sequences are not homogeneous. The compositional segmentation of genomes allows for use of different sets of parameters fine-tuned for each composition and thus to improve performance [Burge'97].
Finally, the HMM segmentation says something about the general statistical properties of natural DNA. The isochore theory by Bernardi says that the nuclear genomes of eukaryotes are composed of large segments (tens or hundreds of kilobases) which have fairly homogeneous GC content within themselves and fall into one of a small number of distinct classes with different characteristic proportions of GC [Bernardi'89].
An alternative model is that changes in GC contents are better described by smooth variations in parameters of a stochastic distribution, which could be modeled as a random walk with reflecting bounds [Fickett'92]. This ``random walk'' model also can be described by specific type of HMM with a small number of free parameters, thus making it possible to perform direct comparison with the isochore model (work in progress).
We conclude that HMM are a useful and convenient tool for statistical analysis of the genomic data. They provide detailed, yet robust descriptions that can be used for further structural analysis or direct biological interpretation.
Leonid Peshkin 2003-10-06