The
Partition web
site
1Laboratoire Génome,
Populations, Interactions,
CNRS UMR 5000,
Harpenden,
Hertfordshire AL5 2JQ,
The underlying population genetic
model is appropriate for out-crossing diploid organisms. It assumes
that the individuals in the sample belong to a number of separate
source populations. These source populations are assumed to be at
Hardy-Weinberg and linkage equilibrium, but the allelic composition of
these source populations, and even the number of source populations
represented in the sample, are treated as parameters whose values are
unknown.
The parameter of interest is the partition
of the set of sampled individuals, induced by the assignment
of individuals to source populations. This parameter will be referred
to as the sample partition.
The
software
package Partition consists of two separate programs. The first
program
(Partition.exe) is a Markov chain sampler which runs for a large number
of
iterations. This program generates a large sample of observations from
the
posterior distribution (of the sample partition). This output contains a great deal
of
information, which can be summarised in a number of different ways. The
second program (PartitionView.exe) can be used to process the output
from
the Markov chain sampler, in order to provide such visual summaries.
One
such visual representation is a tree of all the individuals in the
sample
(or more generally, a forest of separate trees). The heights of each
node
in the tree is equal to the posterior probability that all the
individuals
in the cluster defined by that node come form the same population.
Another
useful summary is the posterior distribution of the number of source
populations
represented in the sample. It also calculates the Bayes factor in
support
of a single source population against the alternative of more than one
source
population..
A full explanation of the Bayesian
inference method can be found in the accompanying article
(LINK).
A
Bayesian approach to clustering
Sampling
from the posterior distribution using Partition.exe
Processing the output using PartitionView.exe
2 <-----
number of populations
aat1
<-----name of
first locus
2 120 130 <-----
number of alleles, followed by a list of alleles coded with 3
digits
aat2 <-----
name of second locus
3 100 132 146
adh
2 100 123
est1
4 100 107 110 115
site1_1999 <-----
name of first population (only the first 15 characters will be retained
)
3 <-----
population size
i001______ 120120
132132 100100 107110 For each locus (in the
same order as in the header)
i002120130 132132
100100 107110 the genotype is given in
6 digits
(3 digits per allele)
i003120120 100132
100100 110110
Population "j" <-----
name of second population
5
j001______ 120120
132132 100100 107107
j002120130 132132
100123 107107
j003120120 100132
100100 107107
j004120120 100100
100123 107107
j005120120 100132
100123 107110
------------
ten characters to identify each individual
PartitionView.exe
Partition.exe is a program for sampling from the posterior
distribution
of the partition of a sample of individuals, based on a Bayesian
calculation.
This program uses the Metropolis-Hastings algorithm to simulate a
Markov
chain. When it reaches equilibrium, this Markov chain samples the
posterior
distribution of the sample partition.
PartitionView.exe is a program for processing (and also monitoring in
real
time) the output from Partition.exe.
test microsats.gtx
First we must specify a prior
distribution, representing our initial (prior) beliefs about all the
parameters of the model. Then we use Bayes' Theorem to transform our
prior distribution into our posterior distribution. This posterior
distribution represents our beliefs after looking at the data.
So, the input for the Bayesian
inference procedure consists of the data, and a prior probability
distribution for the parameters of the model. The output is also a
probability distribution for the parameters of the model.
This joint posterior distribution of
all the parameters can easily be collapsed to obtain the marginal
posterior distributions of individual parameters. In this way, we can
make inferences about individual parameters of interest, even for
models with very large numbers of parameters. Furthermore, each
marginal posterior distribution provides a meaningful and coherent
representation of our uncertainty about an individual parameter.
Sometimes, the parameter of interest is more complex: for example, the
assignment of individuals to source populations. Here too, our
uncertainty can be usefully quantified using the Bayesian framework.
Our solution to this problem is to apply an
agglomerative
hierarchical clustering algorithm, where at each step, all possible
fusions
of the existing clusters are proposed, and we accept the fusion which
generates
the cluster of individuals having the highest posterior co-assignment
probability.
We refer to this as the exact linkage algorithm.
The output from the exact linkage algorithm is a
tree
of all the individuals in the sample, where the height of each node is
equal
to the posterior probability that all the individuals in the se defined
by
that node belong to the same source population (this is the posterior
co-assignment
probability of the set). Clusters of individuals which are well
supported
by the posterior distribution, can be easily identified from this tree.
using Partition.exe
You will be prompted for a file name when
you click on “RUN” button.
Default value: result.txt
Range: 10 - 100000
Default value: 10000
The period is the number of iterations (or
cycles) of the Markov chain, between successive observations of the
state of the Markov
chain.
Range: 1 - 1000
Default value: 10
This value must be greater than one, and
must not exceed the number of individuals in the sample.
Range: 2 - 100
Default value: 4
This parameter specifies the prior
distribution on the allelic diversity within the source population.
Lower values of theta mean lower allelic diversity, and higher values
mean higher allelic diversity.
This value must be greater than zero.
Range: 0.001 – 100.0
Default value : 1
This is a parameter of the prior
probability distribution on the number of source populations k
.
When prior parameter u is = 1 the
prior on k is uniformand when it’s
less than 1, the prior on k decays geometrically, by
this factor.
Range: 0.01 – 1.0
Default value: 1
toto.gtx ß data file
name
200
ß total
number of
individuals
4
ß maximum
number of possible source populations
10000
ß number of
observations
10
ß period
1.000000
ß prior
parameter u
1.000000
ß prior
parameter theta
1
ß the number
of the observation
-4347.217879 ß the log
likelihood of the partition
2
ß the number
of source populations, k, represented in the sample
106
ß the number
of individuals assigned to first source population, followed by the
list of individuals
assigned to this population.
8
13
18
30
56
68
72
92
93
100
44
82
12
26
11
31
52
7
57
67
55
9
99
62
1
14
22
25
28
33
37
42
58
61
64
65
86
94
96
152
2
29
36
46
66
71
73
75
79
90
98
78 32
74
17
80
76
38
20
24
16
15
77
19
54
45
88
84
3
89
85
60
43
21
48
63
81
6
97
49
35
195
39
69
87
47 5
23
162
4
149
83
59
34
50
177
40
53
139
91
41
27
119 95
10
166
94
ß the number
of individuals assigned to second source population
106 124 127 143
170 176
183 189
191 198
146 116 134 135
169 128
105
133 196 114 136
141 153
122 144
140 168
110 155 121 172
132 117
164
179
200
123
186
165
103
112
185
150
108
137
151
142
175
182
163
130
147 109 148 190
131 199
101 178
157 188
111 197 145 115
126 160
129
118 192 187 107
125 193
158 104
194 180
102 113 159 171
51
138
167
70 173 181 161
184 120
174 154
156
-4354.480912
2
98
8
13
18
30
56
68
72
92
93
44
82
12
26
11
57
67
55
9
14
25
28
33
42
58
65
86
96
29
36
66
71
73
75
79
90 98
78
74
17
76
38
20
24
15
77
19
54
45
88
84
3
89 85
60
43
48
81
6
49
35
39
69
47
23
4
83
59
34
53 91
27
10
7
63
100
52
41
99
2
80
46
37
51
61
22
177 32
152 50
5
62
70
31
141
64
120
109
21
102
106 124 143 170
176 183
189 191
198 146
134 135 169 105
133 196
114
136 153 122 140
168 110
121 172
132 117
164 179 200 123
186 165
103
112 185 150 108
151 175
182 163
130 148
190 131 101 178
157 111
145
126 160 129 118
192 107
125 193
158 104
180 102 113 171
138 181
184
174 154 156 149
195 144
94
119
197
167
159
40
173
127
142
115
187
147 199 97 188
162 161
155 137
95
128
139
87
16
194
1
116
166
...
...
9
ß the number
of the observation
-4583.715074 ß the log
likelihood of the partition
1
ß the number
of source populations, k, represented in the sample
200
ß the number
of individuals assigned to the single source population (= the total
number of individuals).
0
ß When all
individuals are assigned to a single population (k = 1),
the list of individuals is omitted
-4583.715074
1
200
0
-4583.715074
1
200
0
In order to obtain
useful
output from the partition output file, we have to process it using the
companion
program PartitionView.exe.
It is possible (and sometimes useful) to start running the program
PartitionView.exe,
while the program Partition.exe is running. In this case, the program
PartitionView.exe
will read the partition output file, as it is being written, and
will
display the path of the Markov chain (see below) as it evolves in real
time.
Warning! Avoid running the make tree option of PartitionView.exe
until
the end of the analysis, as it can be slow.
The interface of PartitionView.exe.
consists of two pages, marked:
Make tree
When the Markov chain has reached
equilibrium, both these plots should show no persistent trends. So,
these plots can be used to guide the choice of burn in.
The button save trajectories creates a trajectories outfile (default
file
name "trajectories"). Each line of this file contains the number of
source
populations k and the log likelihood, for an observed state of the
Markov
chain. The ith line corresponds to the ith observation from the
Markov
chain.
Default value: posterior_k.txt
Range: 0 - one less than number of
observations
Default value: 1000
The program PartitionView.exe,
writes
the posterior distribution of k, and the Bayes
factor to the posterior k outfile. The format of
this file is as follows:
k Posterior prob of k
1 0.234838
2
0.427725
3
0.265390
4
0.058368
5
0.013680
6
0.000000
7
0.000000
Note that the posterior distribution
of the number of source populations (and the Bayes factor B1
) can prove to be unreliable if the data set is too small. (For
discussion of
this, see Dawson and Belkhir 2001.)
In many situations, the first
question of interest is: do we have a random sample from a single
panmictic population, or do we have something more complicated? One
possible measure of the evidence in support of a single panmictic
population, is the Bayes factorB1 for k =
1 against the alternative of k >
1.
In general, a Bayes factor is
interpreted in much the same way as a likelihood ratio. The advantage
of the Bayes factor is that it is less sensitive to the choice of
prior, than the posterior distribution of k. Arguably,
this particular Bayes factor, B1, is only relevant when
the mode of the posterior distribution of k is at k
= 1.
Supplementary
information about clustering and the tree
Plot
pairwise
posterior co-assignment probabilities
Compute
posterior co-assignment probability of any set
Conditioning on k
The resulting tree is exported to a tree outfile, of
standard
format (the "Newick 8:45" format adopted by PHYLIP
). The tree outfile can be viewed using standard software,
including
TreeView
and NJplot
.
Trees can be viewed in colour using the Baobab software.
If you use TreeView, you must choose the phylogram
option to view the tree correctly.
These trees are best viewed in the usual tree orientation (rooted in
the
ground). In this orientation, nodes with higher probability levels are
located
further up the tree.
The node heights of the
n-1
internal nodes are also exported to the node heights outfile.
Note that, in the course of this algorithm, the posterior co-assignment
probabilities
of some additional sets of individuals (which do not correspond
to
nodes of the tree) also have to be computed. The pairwise co-assignment
probabilities
(one for each pair of individuals) are always among the co-assignment
probabilities
which have to be computed in order to construct the tree. These
pairwise
co-assignment probabilities are exported to the pairwise probs outfile
where
they are arranged in a lower-triangular array, where the diagonal is
filled
in with 1s.
Default value: tree outfile name = tree
Choose a file name for the pairwise probs outfile.
Default value: pairwise probs outfile name = pairwise_probs
Choose a file name for the node heights outfile.
Default value: node heights outfile name = node_heights
Choose a file name for the node clusters outfile.
Default value: node clusters outfile name = node_clusters
and the tree
As with any method of hierarchical clustering, we
proceed
by constructing a binary tree. Every individual in the sample is
associated
with a terminal node. Every internal node in a binary tree has two
descendant
nodes. If there are individuals in the sample, and
hence
n terminal nodes, then there will be n-1 internal nodes when the
tree
is completed. The n-1 internal nodes are labelled t = 1, 2, ...,
n-1,
where is the generation at which the node was
created.
Every internal node defines a set of terminal nodes (the set of all the
terminal
nodes that can be reached by descending the tree, starting from that
node),
and hence a set, or "cluster" of individuals.
In the resulting tree, the height of node t , is
equal
to the posterior co-assignment probability of the set of individuals
defined
by node t . That is, the posterior probability that all the individuals
in
the set defined by this node belong to the same population. All
terminal
nodes are defined to have height equal to 1.
For computational reasons, Dawson and Belkhir (2001)
implemented
a more general algorithm (the maximin algorithm) which, in general,
differs
form the exact linkage algorithm in that the co-assignment
probabilities
of the larger sets of individuals are replaced by certain upper bounds
(which
serve as approximations for the co-assignment probabilities). The exact
linkage
algorithm is a special case of the maximin algorithm of Dawson and
Belkhir
(2001). (In the notation of that paper, it is the case where the
dimension
is equal to the sample size .) The exact linkage algorithm
includes
a maximisation step, but no minimisation. So, the name maximin
algorithm
is no longer appropriate. (The exact linkage algorithm and the more
familiar
complete linkage algorithm are both special cases of the maximin
algorithm.
However, unlike the complete linkage algorithm, the exact linkage
algorithm
makes use of the exact co-assignment probabilities for sets of any
size,
- hence its name.)
As recognised in Dawson and Belkhir (2001), the tree
generated
by exact linkage algorithm captures more of the information from the
posterior
distribution than any other version of the maximin algorithm. However,
in
our earlier work we implemented the maximin algorithm, with
dimension
d = 2, 3 or 4, for
computational reasons. We now recognise that it is feasible
to implement this preferred version of the algorithm. So, it is the
exact
linkage algorithm which is recommended.
The maximin algorithm (with dimension d = 2,
3 or 4) was implemented in the earlier companion program, called
Analyse.exe,
which PartitionView.exe replaces. "Analyse.exe" was an unfortunate
choice of
name (as it has already been used in population genetics, and no doubt
elsewhere).
It might be better to give it the posthumous name PartitionView0.exe.
This is a plot of the height p(t) of the node created at generation t ,
against
the generation t (where t = 1, 2, ..., n-1). The height p(t) of
the
node t is the posterior co-assignment probability of the set of
individuals
defined by node t . This plot shows the progress of the clustering
algorithm.
(For details, see Dawson and Belkhir 2001).
Interpretation of output
It is important to recognise that the node height
p(t)
has no frequentist interpretation. It is a posterior probability, which
depends
both on the observed data and the choice of prior. So it is a
subjective
probability. This raises an important question: How low should we allow
p(t)t
to fall, before we stop accepting further agglomeration? More
experience
with the method will be needed to resolve this. Provisionally, we
recommend
using the graph of p(t) against t as a guide, paying
particular
attention to any values of t where the probability level falls
more
sharply than usual.
To use this option, change the number in the
individual
box. (This number corresponding to the position of that individual in
the
data file.) This will automatically up-date the plot. It would be very
time
consuming to do this for each individual. However, it should be enough
to
view this plot for a few different individuals. (Preferably for
individuals
that are far apart, - so that every pair among these selected
individuals
has a low co-assignment probability.) This can help us to visualise how
clearly
defined the different clusters are.
There are situations where we may want to compute
the
posterior co-assignmement probability of a set of individuals which
does
not correspond to any node of our tree. For example, we may want to
evaluate
the evidence that individuals taken from a particular sampling site all
belong
to the same population.)
To answer such questions, first we use the arrow
buttons to
choose a set of individuals, by moving them form the box on the left to
the
box on the right (or back again, if necessary). (As usual, the numbers
correspond
to the position of each individual in the data file.) Then the
posterior co-assigmnent probability of the chosen set is updated. (The
posterior probability that all the individuals
in this set come from the same population.)
A more detailed picture can be obtained by looking
at
the plot of the values of the conditional posterior co-assignement
probabilities.
(For example, conditional upon k lying in the interval from k = 2 to k
=
4.) To specify such a condition, go back to condition on range of k.
(The default values for condition on
range of k are:
from k = 1 to k
=
maximum number of possible source populations.
This introduces NO
conditioning. We recommend that the default values are used, unless the
user has a strong reason for conditioning.)
The program PartitionView.exe produces the following output files.
posterior k outfile (default file name "posterior_k.txt"). This file
contains:
the posterior distribution of the number of source populations k
represented
in the sample; and the Bayes factor B1 for k = 1 against the
alternative
of k > 1.
tree outfile (default file name "tree.txt"). This file contains a
representation
of the tree in a standard format (the "Newick 8:45" format adopted by
PHYLIP).
pairwise probs outfile (default file name "pairwise_probs.txt"). This
file contains
the pairwise co-assignment probabilities (one for each pair of
individuals),
arranged in a lower-triangular array, where the diagonal is filled in
with
1s.
node heights outfile (default file name "node_heights.txt"). Each line
of this
file contains the generation t at which a node was created, and the
height
p(t) of this node, There is one line for each internal node of the tree.
node clusters outfile (default file name "node_clusters.txt"). This
file is similar
to node heights outfile, but contains additional information. Each line
of
this file contains the generation t at which a node was created, the
height
p(t) of this node, and the cluster of individuals defined by this node.
There
is one line for each internal node of the tree.
The button save trajectories creates a trajectories outfile (default
file
name "trajectories"). Each line of this file contains the number of
source
populations k and the log likelihood, for an observed state of the
Markov
chain. The ith line corresponds to the ith observation from the
Markov
chain.
2001. Genetical Research 78,
59-77.
Warnings for version 1.2:
Some bugs remained in Analyse.exe in Partition 1.2. These bugs involved memory allocation.
As a consequence, problems were sometimes encountered when analysing large data sets.
These bugs have been eliminated from the latest release, - Partition 2.0.
Analyse.exe still has a high memory requirement which grows with the number of individuals in the sample,
and with the range of the posterior distribution of the number of source populations.
We have tested Partition 2.0 on samples of 1000 individuals.
Warnings for version 1.0:
In the first version of Partition (1.0) there was a bug !
If the user choose Kmax > 10 Partition 1.0 failed to sample
from the correct posterior distribution. !
So if you chose Kmax = 11 or 12 then you need to
reanalyse your data with the new version (1.1) !
In this new version (1.1) the user can choose any value of
kmax from 2 to 24.
In this version we have also eliminated a
bug in Analyse.exe which occured when the data
filename
contained any blank spaces.
N.B. Missing data are not allowed in this version, so they
must be discarded from your data file !!!
Download the version 2.0 from this link or this one
and unzip it to a new folder.
Partition is primarily designed for the Windows operating system (W9x ,
Win me and Windows 2000).
However, the Wine package makes it
possible to run native Windows applications alongside native
Linux applications on a Linux (or Solaris) system.