Feature-Based Phylogeny

TRILOGY: Discovery of sequence-structure patterns across diverse proteins describes the Trilogy program for the automated discovery of sequence-structure patterns in proteins. The Trilogy analysis finds clusters of conserved residues which correspond to (small) matching structures.

The high-scoring patterns represent known motifs of structural and functional significance (helix capping patterns; an NAD/FAD binding pattern), as well as potentially novel motifs.
Significant motifs like binding sites should correspond closely with protein function and be highly conserved. The very succinct representation of Trilogy clusters means that an order of magnitude more protein comparisons could be accomplished than with multiple sequence alignments for a given computational resource.

The Trilogy site catalogs the 7768 high-scoring patterns in a set of representative domains taken from the SCOP protein structure classification database. With the SCOP protein set averaging over 20 clusters apiece, we can view this dataset as a feature space and define a metric.

The phylogy script takes Trilogy cluster data for 1400 SCOP proteins and, treating the occurrences of each cluster type in a SCOP protein as a 7768-dimensional vector, computes the distance between each pair of SCOP proteins and writes this matrix to a file. The three matrix files corresponding to the three datasets provided on the Trilogy website are:

Using the Phylip program a phylogeny was computed from each distance matrix:

So far so good. But I haven't found a program able to display these 1400 leaf trees.

Copyright © 2004 Aubrey Jaffer

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.  My actions and comments do not reflect in any way on MIT.
agj @ alum.mit.edu
Go Figure!
http://people.csail.mit.edu/jaffer/trilogy