NHML is a package inluding programs for
maximum-likelihood phylogenetic
inferences from DNA sequence data using a
non-homogeneous Markov model
of DNA sequence evolution, as published
in:
Galtier N. & Gouy M. 1998. Mol. Biol.
Evol. 15:871-879.
The above reference includes general
information about the model, the
optimization algorithm, and the accuracy
of estimates.
Examples of applications can be found in:
Galtier N. & Mouchiroud D.
1998. Genetics 150:1577-1584
Galtier N., Tourasse N.
& Gouy M. 1999. Science 283:220-221.
The current version is NHML 2 . It allows
for unequal substitution
rates among sites. Discretized Gamma distribution
among sites is
used, following Yang 1994 JME 39:306.
The NHML
package includes 3 programs:
- eval_nh computes the likelihood of a
given tree.
- star_nh builds a tree according to the
star-decomposition method (Saitou 1988
JME 27:261) adapted for rooted
trees.
- shake_nh builds a tree according to the
tree-pruning-regrafting method
(Felsenstein 1993, PHYLIP Package) adapted
for rooted trees.
-------
EVAL_NH -------
eval_nh
can be seen as the user-tree version of star_nh or shake_nh.
--
INPUT --
Three files are required: a sequence file, a tree file and
an option file.
The sequence file should be in MASE+ (Faulkner and
Jurka 1988 TIBS 13:321,
Galtier et al. 1996 CABIOS 12:543) format; a
description is given below.
Pre-defined sets of sites can be accounted
for. This file may include more
than one data set; a multiple-data set
MASE+ file is the concatenation of
several single-data set MASE+ files
(lines beginning with ";;" indicates a
new data set).
The
tree file should be in Newsweek (PHYLIP, CLUSTAL, etc...) format. Tree(s)
must
be rooted. Branch lengths are not required. It may include more than
one
tree. Comments (within brackets) are allowed. Taxon names must be
identical
to names in the sequence file.
The option file
includes a few algorithmic options (see below).
Three example input
files can be found in directory exple: exple.mase,
exple.tree,
exple.opt.
Input file names can be supplied as command line
arguments:
eval_nh
exple.mase exple.tree exple.opt
Alternatively, eval_nh can be called
without arguments (and input file
names are asked for). In the former case
(arguments), pre-defined sets
of sites are not taken into account: all
sites are used.
If more than one data sets and/or more than one tree
are found within input
files, data may be processed two ways, depending on
the ALLCOUPLES option.
If ALLCOUPLES is set to 0, equally ranked data sets
and trees are processed
together, i.e. tree 1 is evaluated with data set
1, tree 2 with data set 2,
etc... If ALLCOUPLES is set to 1, all possible
(data set, tree) evaluations
are processed, i.e. (data set 1, tree1),
(data set 1, tree 2), ...,
(data set 1, tree p), ... (data set n, tree 1),
... (data set n, tree p).
-- OUTPUT --
- files.
2 tree files are
created: treefile.eqgc and treefile.ndgc. Both include
the evaluated
tree(s), with estimated branch lengths and pseudo-bootstrap
values.
Pseudo-bootstrap values in file treefile.eqgc are estimated
equilibrium
G+C contents (theta parameters in Galtier and Gouy 1998).
Pseudo-bootstrap
values in file treefile.ndgc are estimates of G+C contents
at each node
(excepting root node). Output trees can be viewed by using
program NJPLOT
(M. Gouy), distributed at
ftp://biom3.univ-lyon1.fr/pub/mol_phylogeny/njplot.
- screen.
The amount of screen
display can be chosen through the PRINT1 and PRINT2
options:
if PRINT1 and PRINT2 are set to 0, the
only screen output is the
likelihood of evaluated trees.
if PRINT1=1 and PRINT2=0, additional
output is displayed, including the
estimates of the
transition/transversion ratio (kappa in Galtier and Gouy
1998), the
ancestral G+C content (omega in Galtier and Gouy 1998) and the shape
of
the Gamma distribution.
if
PRINT1 and PRINT2 are set to 1, steps of the iterative optimization
algorithm
are displayed.
-- OPTIONS --
The main running
options are about which kind of parameter has to be
optimized (i.e.
evaluated). Each option should be on a new line, in the
form
OPTION_NAME=OPTION_VALUE. Option values may be integers, real numbers,
or
strings, depending on the option. Comments (within /* */) are allowed.
OPTIMIZE_LENGTH,
OPTIMIZE_GC, OPTIMIZE_TITV, OPTIMIZE_ROOT, OPTIMIZE_ANC
and OPTIMIZE_GAMMA
indicate wether branch lengths (lambda parameters),
equilibrium G+C
contents (theta parameters), transition/transversion ratio
(kappa
parameter), root location (phi parameter) ancestral G+C content
(omega
parameter) and shape of he Gamma distribution should be optimized,
respectively
(1 -> optimized, 0 -> not optimized).
INIT_LENGTH, INIT_GC,
INIT_TITV, INIT_ROOT, INIT_ANC and INIT_GAMMA indicate
the starting values
of the optimization algorithm for branch lengths,
equilibrium G+C
contents, transition/transversion ratio, root location,
ancestral G+C
content and shape of the Gamma distribution, respectively.
Starting
values have little influence for optimized parameters (but avoid
aberrant
starting values if possible). However, if a given (kind of)
parameter
should not be optimized (OPTIMIZE_???=0), then the starting
value is
considered as the actual one, and has much influence on the
likelihood
computation.
INIT_LENGTH:
starting branch lengths may be either kept from the tree file
(KEEP) or
computed from the data (REDO). In the former case, input tree must
include
branch lengths. In the latter case, wether input tree has branch
lengths
or not does not matter. To compute starting branch lengths from the
data,
pairwise distances are first computed according to Galtier and Gouy's
1995
method (PNAS 92:11317), then the distance matrix is fit to the evaluated
tree
topology according to the least-square criterion.
INIT_GC: equilibrium G+C contents and
ancestral G+C content can be either
constant (CONST), variable (VAR), or
balanced (BALANCED). In the constant case,
the mean G+C content over the
whole data set is used. In the variable case,
equilibrium G+C contents at
terminal branches are set to the observed G+C
content in corresponding
sequences, and equilibrium G+C contents at internal
branches are set to
the mean of equilibrium G+C contents of two underlying
connected
branches. Balanced means that the equilibrium G+C-content is
assumed to
be 50% in all branches.
INIT_TITV:
starting transition/transversion value is set to the supplied
value if
positive, or automatically computed according to Galtier and Gouy
1995 if
-1. is supplied.
INIT_ROOT:
starting root location is set to the supplied value if positive.
If -1. is
supplied and the input tree has branch lengths, the input root
location is
kept. If -1. is supplied and the input tree has no branch
lengths,
starting root location is set to 0.5.
INIT_GAMMA: starting transition/transversion value is set to the
supplied
value if positive. A negative value means infinity, i.e. constant
rates
among sites.
GAMMA_NBCLASS: number of classes of the discretized Gamma
distribution.
The running time will be proportional to this number.
Please
note that constant rates among sites can be assumed by setting
INIT_GAMMA=-1
and OPTIMIZE_GAMMA=0, that Tamura's (1992 MBE 9:678) model
can be used by
setting INIT_GC=CONST and OPTIMZE_GC=0, that Kimura's
(1980 JME 16:111)
model can be used by setting INIT_GC=BALANCED and
OPTIMIZE_GC=0, and that
Jukes & Cantor (1969) model can be used by
setting INIT_GC=BALANCED,
OPTIMIZE_GC=0, INIT_TITV=1 and OPTIMIZE_TITV=0.
PRECISION: number of digits. If PRECISION is
set to 3, the iterative
algorithm will stop when the difference berween
consecutive log-likelihood
values is lower than 10^-3 (or 1.0e-3).
PRINT1, PRINT2, OUTPUT_COV: output options. See
above.
ALLCOUPLES: input
option. See above.
-------
STAR_NH -------
star_nh
implements the star-decomposition algorithm (Saitou 1988). A star
(unresolved)
topology is used as the starting tree. Any pairwise grouping
is tried and
the likelihoods of resulting partly resolved trees are
computed. The most
likely grouping is kept, and the algorithm iterates
with one taxon less. This
algorithm mimicks the distance-based Neighbor-
Joining method (Saitou and
Nei 1987 MBE 4:406).
The star_nh program is adapted to the rooted case:
the location of the root
should be specified, and star-decomposition is
performed in both "sides"
of the tree.
Most features
of program eval_nh are relevant for star_nh. Below are listed
features
specific to star_nh.
-- INPUT --
Three files are
required: a sequence file, a tree file and an option file.
The
sequence file should be in MASE+ format and should include a single data
set.
A tree file is required to specify the location of the root: two
groups of
taxa must be defined, as in file exple.utree. A single tree is
required.
The ALLCOUPLES option is useless since a single data set
can be processed.
------- SHAKE_NH
-------
shake_nh implements the tree-pruning-regrafting
algorithm (Felsenstein 1993).
A starting (resolved) tree is modified by
pruning any subtree and
moving it to another place. Resulting trees are
evaluated. If a rearrangement
results in an increase of the likelihood,
the new tree is kept as a starting
tree, and the algorithm iterates until
no rearrangement increases the
likelihood.
Again, the shake_nh
program is adapted to the rooted case. Again, most features
of programs
eval_nh and star_nh are relevant.
-- INPUT --
Three
files are required: a sequence file, a tree file and an option file.
The
sequence file should be in MASE+ format and should include a single data
set.
A tree file is required to specify the starting tree topology. A
single,
resolved tree is required.
The ALLCOUPLES option is
useless since a single data set can be processed.
-- OPTIONS
--
Two additional options are available:
SH_G: number of
internal branches crossed during rearrangements (G option in
phylip. See
phylip manual). -1 means maximum G value.
SH_MAXLCROSSED: Maximum
length of crossed branches during rearrangements.
Rearrangements that
involve moving a subtree accross a branch of length higher
than the supplied
value are not considered. -1. means infinite.