Accurate Gene Tree Reconstruction Using TreeFix and TreeFix-DTL: A Tutorial
TreeFix and TreeFix-DTL are programs for reconstructing very accurate gene trees. TreeFix is designed for reconstructing eukaryotic gene trees (where horizontal gene transfer is assumed to be negligible) and TreeFix-DTL for prokaryotic gene trees. Both programs take as input a multiple sequence alignment for the gene family, a maximum likelihood gene tree (which can be constructed, for example, using RAxML or PhyML), and a known rooted species tree topology for that gene family. The idea is to use the species tree topology to guide the reconstruction of the gene tree and to balance sequence and species tree information through a statistical hypothesis testing framework. TreeFix assumes that discordance between the gene tree and species tree topologies is due to gene duplication and gene loss, while TreeFix-DTL assumes that the discordance is due to gene duplication, horizontal gene transfer, and gene loss. These programs are currently the best performing programs for gene tree reconstruction, outperforming even the most sophisticated species tree aware Bayesian methods. An additional advantage of TreeFix and TreeFix-DTL is that they do not require species divergence times or any other parameters such as rates of gene duplication or gene loss. Moreover, they are scalable to gene trees with hundreds of leaves.
The goal of this tutorial is to instruct participants on how to reconstruct highly accurate gene trees using TreeFix and TreeFix-DTL.
By attending this tutorial, participants will be able to: (1) Appreciate the importance of reconstructing gene trees accurately, (2) understand why reconstructing gene trees accurately can be a challenging problem, (3) understand the ideas and principles underlying TreeFix and TreeFix-DTL, and (4) confidently use both programs on their own datasets.
TreeFix and TreeFix-DTL can be easily installed on Windows, Mac OS, or Linux. The basic requirements are as follows:
Given these requirements, installation is easiest on Linux. To install on Windows, users must install and use the Cygwin environment (http://www.cygwin.com/). Installation on a Mac is straightforward once the basic requirements are met, though it can be a slight hassle to install SWIG on Mac OS.
In addition, the speed and accuracy of TreeFix and TreeFix-DTL may be slightly improved if the following optional python packages are installed.
- Numpy (1.5.1 or greater): http://www.numpy.org
If Numpy is not found, TreeFix and TreeFix-DTL use Python's built-in 'random' module.
- Scipy (0.7.1 or greater): http://www.scipy.org
If Scipy is not found, TreeFix and TreeFix-DTL use internal libraries to approximate the normal distribution (so p-values may be slightly off.)
You can download TreeFix from http://compbio.mit.edu/treefix and TreeFix-DTL from http://compbio.mit.edu/treefix-dtl. Both packages contain a file called INSTALL.txt with detailed instructions on how to install the software. Participants may choose to carefully read this file and proceed with the installation on their own. Alternatively, participants can follow the simple step-by-step installation instructions given below. Tutorial participants should install at least one of TreeFix or TreeFix-DTL on their computers, and are encouraged to install both. If installing both software packages, we recommend that TreeFix be installed first.
Detailed step-by-step installation instructions now follow:
Step-by-step installation instructions for TreeFix:
- Create a new directory called TreeFix in your home directory.
- Download TreeFix from http://compbio.mit.edu/treefix and copy it to the newly created directory.
- Extract TreeFix from the tarball and enter the extracted folder.
tar -xvzf treefix-1.1.7.tar.gz
- Run the installation scripts.
If both the above steps were successfully executed, then TreeFix is now installed and ready to be used and you may proceed to the installation instructions for TreeFix-DTL below.
python setup.py build
python setup.py install
- If users do not have system permissions to install in the default location then the install step above will fail. If this happens then the --prefix flag can be used to specify the directory where TreeFix should be installed. Thus, if the build step above succeeded but the install step failed, then please execute the following command:
Finally, if you used the --prefix option above, then to ensure that the operating system can find the newly installed scripts and executables, set the PATH and PYTHONPATH variables to the installation directory as follows:
python setup.py install --prefix=~/TreeFix/sw
We recommend adding the two lines above to the .bashrc, .bash_profile, or another similar file. Otherwise, you will need to execute the two lines above each time you start a new command line session to use TreeFix. Also note that "python2.6" in the PYTHONPATH may change depending on the Python
Step-by-step installation instructions for TreeFix-DTL:
- Create a new directory called TreeFix-DTL in your home directory.
- Download TreeFix-DTL from http://compbio.mit.edu/treefix-dtl and copy it to the newly created directory.
- Extract TreeFix-DTL from the tarball and enter the extracted folder.
tar -xvzf treefixDTL-1.0.1.tar.gz
- Run the installation scripts.
NOTE: If you installed TreeFix using the --prefix option, then TreeFix-DTL must be installed to the same directory where TreeFix was installed. This can be done as follows:
python setup.py build
python setup.py install
python setup.py build
python setup.py install --prefix=~/TreeFix/sw
- If you did not install TreeFix and are unable to install TreeFix-DTL in its default location, please follow the instructions given in step 5 of the installation instructions for TreeFix (taking care to replace "TreeFix" with "TreeFix-DTL" in the commands).
Please email Mukul Bansal
if you are unable to successfully install TreeFix or TreeFix-DTL.
Datasets for testing
You may check if TreeFix and TreeFix-DTL installed correctly by invoking the TreeFix and TreeFix-DTL executables as follows:
The commands above will prompt TreeFix and TreeFix-DTL to display their respective help messages with details on how to use the programs.
Also, TreeFix and TreeFix-DTL each include a small test dataset that you can use to learn how to use these programs. These are available in the following locations:
Details on how to use TreeFix and TreeFix-DTL are available in the file called test.sh in those directories. Next, we provide step-by-step instructions on how to execute TreeFix and TreeFix-DTL on the test datasets.
Analyzing the test dataset using TreeFix:
To analyze the test dataset using TreeFix, execute the following commands:
treefix -s config/fungi.stree -S config/fungi.smap -A .nt.align -o .nt.raxml.tree -n .nt.raxml.treefix.tree -V 1 -l sim-fungi/0/0.nt.raxml.treefix.log sim-fungi/0/0.nt.raxml.tree
TreeFix should require less than a minute to execute on the dataset above. The reconstructed gene tree will be available in the folder ~/TreeFix/treefix-1.1.7/examples/sim-fungi/0/ as the file "0.nt.raxml.treefix.tree".
Analyzing the test dataset using TreeFix-DTL:
To analyze the test dataset using TreeFix-DTL, execute the following commands (but also see the additional instructions below if you are using Cygwin):
treefixDTL -s config/S1.stree -S config/S.smap -A .pep.align -o .pep.raxml.boot.tree -n .pep.raxml.treefixDTL.tree -V 1 -e "-m PROTGAMMAJTT" -l sim/G1/G1.pep.raxml.treefixDTL.log sim/G1/G1.pep.raxml.boot.tree
TreeFix-DTL should require about three hours to execute on the dataset above. The reconstructed gene tree will be available in the folder ~/TreeFix-DTL/treefixDTL-1.0.1/examples/sim/G1/ as the file "G1.pep.raxml.treefixDTL.tree".
If executing TreeFix-DTL on Cygwin, an additional temporary working directory must be created and TreeFix-DTL must be informed of the location of this working directory using an additional command line parameter. Thus, if using Cygwin, a revised set of commands for executing TreeFix-DTL on the test dataset is as follows:
treefixDTL -s config/S1.stree -S config/S.smap -A .pep.align -o .pep.raxml.boot.tree -n .pep.raxml.treefixDTL.tree -V 1 -e "-m PROTGAMMAJTT" -l sim/G1/G1.pep.raxml.treefixDTL.log -E "--tmp ./tmp" sim/G1/G1.pep.raxml.boot.tree
Further details on the command line options used in the commands above are given below.
Explanation of Command Line Options
A complete list of available command line options for TreeFix and for TreeFix-DTL can be obtained by using the -h option (as shown above), and we recommend that users read through these help messages to familiarize themselves with the kinds of options available. Here, we describe in detail the most important and fundamental command line options. Each of the options described below is applicable to both TreeFix and TreeFix-DTL.
-s <species tree>, --stree=<species tree>
specifies the location of the species tree file (in newick format)
-S <species map>, --smap=<species map>
specifies the location of the file mapping gene names to species names
-A <alignment file extension>, --alignext=<alignment file extension>
alignment file extension (default: ".align")
-o <old tree file extension>, --oldext=<old tree file extension>
old tree file extension (default: ".tree")
-n <new tree file extension>, --newext=<new tree file extension>
file extension for the file where the reconstructed gene tree will be written (default: ".treefix.tree")
-l <log file>, --log=<log file>
log filename. Use '-' to display on stdout.
-V <verbosity level>, --verbose=<verbosity level>
verbosity level of the log file (0=quiet, 1=low, 2=medium, 3=high)
The default value is 0 (i.e., no log file will be created), but we recommend setting this value to 1.
-e <extra arguments to module>, --extra=<extra arguments to module>
extra arguments to pass to the program that computes likelihoods (the default implementations of TreeFix and TreeFix-DTL use RAxML)
The primary use of this option will be to pass along the RAxML likelihood model to be used by TreeFix or TreeFix-DTL. Further details appear below.
number of search iterations to be performed (default: 100 for TreeFix and 1000 for TreeFix-DTL)
Further details on the proper use of this option appear below.
As illustrated above in the TreeFix and TreeFix-DTL command blocks for executing the test datasets, each command block begins with the name of the program (treefix or treefixDTL), followed by required and optional command line options, and ends with the specification of the file containing the maximum likelihood gene tree (constructed previously using, for example, RAxML or PhyML).
File Naming conventions:
While the species tree file (specified using the -s option) and the name mapping file (specified using the -S option) may have arbitrary names, both TreeFix and TreeFix-DTL expect the alignment file and the maximum likelihood gene tree file to be in the same directory and to share a common prefix. This prefix is indirectly specified using the -o option. For instance, if the name of the maximum likelihood tree file is G1.raxml.best.tree and if the command block says "-o .raxml.best.tree", then the prefix is inferred to be G1. Additionally, suppose the -A and -n options are invoked as follows: "-A .align -n .treefix.tree", then the programs will assume that the alignment file is called G1.align and that the final reconstructed gene tree should be written to the file G1.treefix.tree.
Increasing the number of search iterations:
As explained above, the command line option --niter can be used to specify the number of search iterations performed by TreeFix and TreeFix-DTL. The higher the number of iterations, the more accurate the final reconstructed gene tree. For TreeFix-DTL the default number of iterations is set to 1000. This should work well for a range of gene tree sizes and nicely balances accuracy and running time. For TreeFix, however, the default number of iterations is only 100, which is appropriate for small gene trees (say with no more than a couple dozen leaves), but may be too small to effectively reconstruct larger gene trees. Thus, when reconstructing larger gene trees using TreeFix, we recommend increasing the number of iterations to 1000. In general, we strongly recommend that the --niter option be used only to increase
the number of search iterations compared to the default value; reducing the number of iterations will negatively impact the accuracy of these programs.
The -e option:
By default, the likelihood module used by TreeFix and TreeFix-DTL assumes a GTRGAMMA model of sequence evolution. To change this, add the following to the treefix or treefixDTL command: -e '-m <model>'
Note that the specified model must be supported by RAxML. The TreeFix-DTL command block for executing the test datasets, shown above, illustrates the use of this option.
Changing the parameters of the reconciliation model (if necessary):
Both TreeFix and TreeFix-DTL allow users to change the parameters used for performing the reconciliation step. In general, however, we recommend that users make use of the default parameters since these have been tested to work well for a variety of scenarios. If needed, these parameters can be changed as follows.
TreeFix: By default, the reconciliation cost module used by TreeFix assumes equal costs (D=1, L=1) for inferred (duplication-loss) events. To change this, add the following to the treefix command:
-E '-D <dup cost> -L <loss cost>'
TreeFix-DTL: By default, the reconciliation cost module used by TreeFix-DTL uses costs D=2, T=3, and L=1 for the reconciliation. To change this, add the following to the treefixDTL command:
-E '-D <dup cost> -T <trans cost> -L <loss cost>'
Note that the costs must be non-negative. And be sure to watch the quotes.
Using TreeFix and TreeFix-DTL in Practice
Reconstructing highly accurate gene trees using TreeFix or TreeFix-DTL in practice entails the following simple steps:
- Obtain (or compute) a multiple sequence alignment for the gene family of interest.
- Obtain (or compute) a rooted species tree for the species in the gene family.
- Construct a maximum likelihood gene tree for the gene family (using your favorite maximum likelihood phylogeny program, e.g., RAxML or PhyML).
- Arbitrarily root the maximum likelihood gene tree. (TreeFix and TreeFix-DTL require as input a rooted maximum likelihood gene tree. The actual position of the root is unimportant.)
- Create a file that maps the gene tree leaf labels to species tree leaf labels. Examples of the format of such a file are available as part of the test datasets discussed above.
- Depending on whether the gene family is eukaryotic or prokaryotic, execute either TreeFix or TreeFix-DTL on the rooted maximum likelihood gene tree, using the appropriate command line options as described above.
- Upon termination, TreeFix and TreeFix-DTL will write the topology of the reconstructed gene tree to the specified file. Note that this reconstructed gene tree will not have any branch lengths specified; if needed, branch lengths can be easily computed for the reconstructed gene tree using software such as RAxML.
The following paper describes the computational and statistical framework used by TreeFix and TreeFix-DTL, and demonstrates the performance and accuracy of TreeFix.
The following paper describes the Duplication-Transfer-Loss reconciliation model used by TreeFix-DTL.
The paper describing TreeFix-DTL in detail and evaluating its performance is currently under review.
- Reliable and Accurate Gene Tree Reconstruction for Deciphering Microbial Evolution
Mukul S. Bansal, Yi-Chieh Wu, Eric J. Alm, and Manolis Kellis.
Last updated on January 21, 2013