The Partition web site

Khalid Belkhir1 and  Kevin J. Dawson2 

1Laboratoire Génome, Populations, Interactions, Adaptation

CNRS UMR 5000,  Université de Montpellier II, 34095 Montpellier Cedex 5, France.

2Rothamsted Research,

Harpenden, Hertfordshire AL5 2JQ, United Kingdom.

 

Partition is a model-based statistical software package for identifying population sub-division and assigning individuals to populations, on the basis of their genotypes at co-dominant marker loci. The underlying population genetic model is appropriate for out-crossing diploid organisms.

 

Warnings and bug alerts

Introduction to Partition

User notes

Related publications

Download

 

 

 

Introduction to Partition

             Partition is a model-based statistical software package for identifying population sub-division (barriers to gene flow) and assigning individuals to populations, on the basis of their genotypes at co-dominant marker loci. Suitable genetic markers include microsatellites and allozymes. Unlinked, or loosely linked, marker loci should be used, so that the assumption of linkage equilibrium is a reasonable one.

             The underlying population genetic model is appropriate for out-crossing diploid organisms. It assumes that the individuals in the sample belong to a number of separate source populations. These source populations are assumed to be at Hardy-Weinberg and linkage equilibrium, but the allelic composition of these source populations, and even the number of source populations represented in the sample, are treated as parameters whose values are unknown.

             The parameter of interest is the partition of the set of sampled individuals, induced by the assignment of individuals to source populations. This parameter will be referred to as the sample partition.

    The software package Partition consists of two separate programs. The first program  (Partition.exe) is a Markov chain sampler which runs for a large number of iterations. This program generates a large sample of observations from the posterior distribution (of the sample partition).   This output contains a great deal of information, which can be summarised in a number of different ways. The second program (PartitionView.exe) can be used to process the output from the Markov chain sampler, in order to provide such visual summaries. One such visual representation is a tree of all the individuals in the sample (or more generally, a forest of separate trees). The heights of each node in the tree is equal to the posterior probability that all the individuals in the cluster defined by that node come form the same population. Another useful summary is the posterior distribution of the number of source populations represented in the sample. It also calculates the Bayes factor in support of a single source population against the alternative of more than one source population..

             A full explanation of the Bayesian inference method can be found in the accompanying article (LINK).

 

User Notes

Contents


Data file format

Getting started

Introduction to Bayesian inference

A Bayesian approach to clustering

Sampling from the posterior distribution using Partition.exe

Processing the output using PartitionView.exe


 

Data file format

Data Input file format is the same as the population genetics program GENETIX (http://www.univ-montp2.fr/~genetix/genetix/genetix.htm). Essentially , the genotype data is arranged as a matrix in which the data for each individual are stored in a single row, where each column is the genotype data for a single locus. The above mentioned software can also be used to import files to and from several other common population genetics software packages (FSTAT, GENEPOP …).

4 <----- number of loci

2 <----- number of populations

aat1 <-----name of first locus

2 120 130 <----- number of alleles, followed by a list of alleles coded with 3 digits 

aat2 <----- name of second locus 

3 100 132 146 

adh

2 100 123 

est1

4 100 107 110 115

site1_1999 <----- name of first population (only the first 15 characters will be retained ) 

3 <----- population size

i001______ 120120 132132 100100 107110 For each locus (in the same order as in the header)

i002120130 132132 100100 107110 the genotype is given in 6 digits (3 digits per allele)

i003120120 100132 100100 110110

Population "j" <----- name of second population

5

j001______ 120120 132132 100100 107107

j002120130 132132 100123 107107

j003120120 100132 100100 107107

j004120120 100100 100123 107107

j005120120 100132 100123 107110

------------

­

ten characters to identify each individual

 

Getting started

Unzip the file Partition.zip to find the following three files:


Partition.exe

PartitionView.exe

test microsats.gtx


Partition.exe is a program for sampling from the posterior distribution of the partition of a sample of individuals, based on a Bayesian calculation. This program uses the Metropolis-Hastings algorithm to simulate a Markov chain. When it reaches equilibrium, this Markov chain samples the posterior distribution of the sample partition.

PartitionView.exe is a program for processing (and also monitoring in real time) the output from Partition.exe.


test microsats.gtx
 is an example data file in GENETIX format.

 

Introduction to Bayesian inference

             In Bayesian inference our "beliefs" or "uncertainty" about the unknown state of nature (the parameters of a model) are represented by a probability distribution over the parameter space. This probability distribution determines what bets we should be willing to make about the parameter values.

             First we must specify a prior distribution, representing our initial (prior) beliefs about all the parameters of the model. Then we use Bayes' Theorem to transform our prior distribution into our posterior distribution. This posterior distribution represents our beliefs after looking at the data.

             So, the input for the Bayesian inference procedure consists of the data, and a prior probability distribution for the parameters of the model. The output is also a probability distribution for the parameters of the model.

             This joint posterior distribution of all the parameters can easily be collapsed to obtain the marginal posterior distributions of individual parameters. In this way, we can make inferences about individual parameters of interest, even for models with very large numbers of parameters. Furthermore, each marginal posterior distribution provides a meaningful and coherent representation of our uncertainty about an individual parameter. Sometimes, the parameter of interest is more complex: for example, the assignment of individuals to source populations. Here too, our uncertainty can be usefully quantified using the Bayesian framework.

A Bayesian approach to clustering

    The (marginal) posterior distribution of the sample partition contains all the information that we need in order to assign the individuals in the sample to source populations. However, the number of ways of partitioning a set of  distinct elements into   non-empty disjoint (and unlabelled) sub-sets is equal to the Stirling number of the second kind. These numbers grow very rapidly with. Consequently, it is only in cases where the number of elements   is small, that a probability distribution over the space of partitions of the set can be visualised in its entirety. This need not be a problem if there are only a small number of partitions which have an appreciable share of the posterior probability, - as these partitions can easily be identified and listed along with their individual posterior probabilities. However, if there are many individuals that are difficult to assign, then there will be many plausible partitions, each one having an individually low posterior probability. So, what we need is a method for identifying "clusters" of individuals, whose assignment together is well supported by the posterior distribution.
    Our solution to this problem is to apply an agglomerative hierarchical clustering algorithm, where at each step, all possible fusions of the existing clusters are proposed, and we accept the fusion which generates the cluster of individuals having the highest posterior co-assignment probability. We refer to this as the exact linkage algorithm.
    The output from the exact linkage algorithm is a tree of all the individuals in the sample, where the height of each node is equal to the posterior probability that all the individuals in the se defined by that node belong to the same source population (this is the posterior co-assignment probability of the set). Clusters of individuals which are well supported by the posterior distribution, can be easily identified from this tree.

Sampling from the posterior distribution
using Partition.exe


How to begin

The following interface will appear on the screen.


Input

The data file has the GENETIX file format. In this file format, different "populations" can be identified. These "populations" will have been chosen by the user. They may reflect in some way the sampling strategy. For example, different "populations" may correspond to different sampling sites, or samples collected on different occasions. The program Partition.exe will take no account whatsoever of these user-defined "populations".

You will be prompted for a file name when you click on “RUN” button.

Choose parameter settings

Choose a file name for the partition output file.

Default value: result.txt

Choose parameter settings for the Markov chain :

Choose a value of the number of observations of the Markov chain to be written to the partition output file.

Range: 10 - 100000

Default value: 10000

Choose a value of the period between observations.

The period is the number of iterations (or cycles) of the Markov chain, between successive observations of the state of the Markov chain.

Range: 1 - 1000

Default value: 10

Choose parameter settings for the prior distribution :

This value must be greater than one, and must not exceed the number of individuals in the sample. 

Range: 2 - 100

Default value: 4


This parameter specifies the prior distribution on the allelic diversity within the source population. Lower values of theta mean lower allelic diversity, and higher values mean higher allelic diversity.

This value must be greater than zero. 

Range: 0.001 – 100.0

Default value : 1


This is a parameter of the prior probability distribution on the number of source populations k

When prior parameter u is = 1 the prior on k is uniformand when it’s less than 1, the prior on k decays geometrically, by this factor.

Range: 0.01 – 1.0

Default value: 1

Output from Partition.exe

             The output form the program Partition.exe, is simply the partition output file. (The default file name is “result.txt”.) This file contains a list of the parameter settings followed by the sequence of observations of the Markov chain.

Example file: 

toto.gtx ß data file name

200                          ß total number of individuals

4                              ß maximum number of possible source populations

10000                      ß number of observations

10                            ß period

1.000000                 ß prior parameter u

1.000000                 ß prior parameter theta

1                              ß the number of the observation

-4347.217879          ß the log likelihood of the partition

                             ß the number of source populations, k, represented in the sample

106                          ß the number of individuals assigned to first source population, followed by the list of individuals assigned to this population.

8              13            18           30            56           68            72           92            93           100          44           82            12           26            11           31                52 

             57            67           55                        99            62           1              14           22            25           28            33           37            42           58                 61 

64            65            86           94            96           152                      29            36           46            66           71            73           75            79           90                98 

78            32            74           17            80           76            38           20            24           16            15           77            19           54            45           88                 84 

             89            85           60            43           21            48           63            81                        97           49            35           195          39           69                 87 

47                         23           162                      149          83           59            34           50            177         40            53           139          91           41                 27 

119          95            10           166         

94                            ß the number of individuals assigned to second source population

106          124          127         143          170         176          183         189          191         198          146         116          134         135          169         128                 105 

133          196          114         136          141         153          122         144          140         168          110         155          121         172          132         117                 164 

179          200          123         186          165         103          112         185          150         108          137         151          142         175          182         163                130 

147          109          148         190          131         199          101         178          157         188          111         197          145         115          126         160                 129 

118          192          187         107          125         193          158         104          194         180          102         113          159         171          51           138                 167 

70            173          181         161          184         120          174         154          156        

2

-4354.480912 

98 

             13            18           30            56           68            72           92            93           44            82           12            26           11            57           67                 55 

             14            25           28            33           42            58           65            86           96            29           36            66           71            73           75                 79 

90            98            78           74            17           76            38           20            24           15            77           19            54           45            88           84                

89            85            60           43            48           81                        49            35           39            69           47            23                        83           59                 34 

53            91            27           10                        63            100         52            41           99                        80            46           37            51           61                 22 

177          32            152         50                        62            70           31            141         64            120         109          21          

102 

106          124          143         170          176         183          189         191          198         146          134         135          169         105          133         196                 114

136          153          122         140          168         110          121         172          132         117          164         179          200         123          186         165                 103 

112          185          150         108          151         175          182         163          130         148          190         131          101         178          157         111                 145 

126          160          129         118          192         107          125         193          158         104          180         102          113         171          138         181                 184 

174          154          156         149          195         144          94           119          197         167          159         40            173         127          142         115                 187 

147          199          97           188          162         161          155         137          95           128          139         87            16           194                      116                 166        

...

...

9                              ß the number of the observation

-4583.715074          ß the log likelihood of the partition

                             ß the number of source populations, k, represented in the sample

200                          ß the number of individuals assigned to the single source population (= the total number of individuals).

0                              ß When all individuals are assigned to a single population (k = 1), the list of individuals is omitted

10                           

-4583.715074 

200 

            

11

-4583.715074 

200 

            

             

In order to obtain useful output from the partition output file, we have to process it using the companion program PartitionView.exe.

It is possible (and sometimes useful) to start running the program PartitionView.exe, while the program Partition.exe is running. In this case, the program PartitionView.exe will read the  partition output file, as it is being written, and will display the path of the Markov chain (see below) as it evolves in real time.

Warning! Avoid running the make tree option of PartitionView.exe until the end of the analysis, as it can be slow.
 

Challenge! Feel free to develop your own program to process the output of Partition.exe.


Another challenge! Feel free to develop your own Markov chain sampler to generate output files that can be processed by PartitionView.exe.

Processing the output using PartitionView.exe

How to begin

The partition output file, generated by Partition.exe, is the input file for PartitionView.exe.

The interface of PartitionView.exe. consists of two pages, marked:

Infer k

Make tree

Click on the run button. A file dialogue box will appear prompting for the input file: partition output file.

Page: “Infer k”

Plot path of Markov chain

Example of display

This is a joint plot of the (1) the number of source populationsk in the partition, and (2) the log likelihood of the partition, against the number of the observation of the Markov chain.

             When the Markov chain has reached equilibrium, both these plots should show no persistent trends. So, these plots can be used to guide the choice of burn in.

             We only want to make use observations of the Markov chain, made after it has reached equilibrium. So, we usually discard a number of observations of the beginning of the Markov chain. The parameter burn in is the number of observations of the Markov chain which are discard. The default value (burn in = 1000) is usually sufficient.

The button save trajectories creates a trajectories outfile (default file name "trajectories"). Each line of this file contains the number of source populations k and the log likelihood, for an observed state of the Markov chain. The ith line corresponds to the ith observation  from the Markov chain.

Choose parameter settings

Choose a file name for the posterior k outfile.

Default value: posterior_k.txt

Choose a value of the burn in.

Range: 0 - one less than number of observations

Default value: 1000

             Every time a new value is chosen for burn in, the program PartitionView.exe automatically updates the posterior distribution of k , and the Bayes factor. So, you can easily experiment with different values for burn-in.

             The program PartitionView.exe, writes the posterior distribution of k, and the Bayes factor to the posterior k outfile. The format of this file is as follows:

Bayes factor :  1.841478


      
Posterior prob of  k
1          0.234838

2          0.427725

3          0.265390

4          0.058368

5          0.013680

6          0.000000

7          0.000000

Plot posterior distribution of k

             The posterior distribution of the number of source populations, k, is plotted as a histogram. This histogram is also stored in table form in a file (the posterior k outfile).

Interpretation of output : The posterior probability of each value of k , is a direct measure of the evidence in support of that value of k . The most probable value of k , may be used as a point estimate of k .

             Note that the posterior distribution of the number of source populations (and the Bayes factor B1 ) can prove to be unreliable if the data set is too small. (For discussion of this, see Dawson and Belkhir 2001.)

Calculate Bayes factor for  k = 1 against the alternative of  k > 1

             PartitionView.exe computes the Bayes factor in support of a single source population against the alternative of more than one source population.

             In many situations, the first question of interest is: do we have a random sample from a single panmictic population, or do we have something more complicated? One possible measure of the evidence in support of a single panmictic population, is the Bayes factorB1 for k = 1 against the alternative of k > 1.

This information is displayed on the screen, and at the same time it is written to the posterior k outfile.

Picture of interface .

Example file( See above )

Interpretation of output : The Bayes factor B1 is always greater than, or equal, to zero. If the Bayes factor is greater than 1, this means that the evidence favours the hypothesis that k = 1. If the Bayes factor is less than 1, this means that the evidence favours the alternative hypothesis that k > 1.

             In general, a Bayes factor is interpreted in much the same way as a likelihood ratio. The advantage of the Bayes factor is that it is less sensitive to the choice of prior, than the posterior distribution of k. Arguably, this particular Bayes factor, B1, is only relevant when the mode of the posterior distribution of k is at k = 1.

Page: Make Tree

Make tree

Parameter settings

Supplementary information about clustering and the tree

Plot node heights

Plot pairwise posterior co-assignment probabilities

Compute posterior co-assignment probability of any set
Conditioning on k

Make tree

    Here, an agglomerative hierarchical clustering algorithm (the exact linkage algorithm) is used to construct a rooted binary tree, where the height of each node is equal to the posterior co-assignment probability of the set of individuals defined by the node.

The resulting tree is  exported to a tree outfile, of standard format (the "Newick 8:45" format adopted by PHYLIP ). The tree outfile can be viewed using standard software, including TreeView and NJplot .


Trees can be viewed in colour using the Baobab software.

If you use TreeView, you must choose the phylogram option to view the tree correctly.


These trees are best viewed in the usual tree orientation (rooted in the ground). In this orientation, nodes with higher probability levels are located further up the tree.

The node heights of the n-1 internal nodes are also exported to the node heights outfile.

Note that, in the course of this algorithm, the posterior co-assignment probabilities of some additional  sets of individuals (which do not correspond to nodes of the tree) also have to be computed. The pairwise co-assignment probabilities (one for each pair of individuals) are always among the co-assignment probabilities which have to be computed in order to construct the tree. These pairwise co-assignment probabilities are exported to the pairwise probs outfile where they are arranged in a lower-triangular array, where the diagonal is filled in with 1s.


Picture of interface.

Parameter settings

Choose a file name for the tree outfile.
Default value: tree outfile name = tree

Choose a file name for the pairwise probs outfile.
Default value: pairwise probs outfile name = pairwise_probs

Choose a file name for the node heights outfile.
Default value: node heights outfile name = node_heights

Choose a file name for the node clusters outfile.
Default value: node clusters outfile name = node_clusters

Supplementary information about clustering
and the tree

    The exact linkage algorithm is an agglomerative hierarchical clustering algorithm. It is a special case of the maximin algorithm described in Dawson and Belkhir (2001). We now describe the exact linkage algorithm.
    As with any method of hierarchical clustering, we proceed by constructing a binary tree. Every individual in the sample is associated with a terminal node. Every internal node in a binary tree has two descendant nodes.  If there are   individuals in the sample, and hence n terminal nodes, then there will be n-1  internal nodes when the tree is completed. The n-1  internal nodes are labelled t = 1, 2, ..., n-1, where    is  the generation at which the node was created. Every internal node defines a set of terminal nodes (the set of all the terminal nodes that can be reached by descending the tree, starting from that node), and hence a set, or "cluster" of individuals.
    In the resulting tree, the height of node t , is equal to the posterior co-assignment probability of the set of individuals defined by node t . That is, the posterior probability that all the individuals in the set defined by this node belong to the same population. All terminal nodes are defined to have height equal to 1.
    For computational reasons, Dawson and Belkhir (2001) implemented a more general algorithm (the maximin algorithm) which, in general, differs form the exact linkage algorithm in that the co-assignment probabilities of the larger sets of individuals are replaced by certain upper bounds (which serve as approximations for the co-assignment probabilities). The exact linkage algorithm is a special case of the maximin algorithm of Dawson and Belkhir (2001). (In the notation of that paper, it is the case where the dimension   is equal to the sample size  .) The exact linkage algorithm includes a maximisation step, but no minimisation. So, the name maximin algorithm is no longer appropriate. (The exact linkage algorithm and the more familiar complete linkage algorithm are both special cases of the maximin algorithm. However, unlike the complete linkage algorithm, the exact linkage algorithm makes use of the exact co-assignment probabilities for sets of any size, - hence its name.)
    As recognised in Dawson and Belkhir (2001), the tree generated by exact linkage algorithm captures more of the information from the posterior distribution than any other version of the maximin algorithm. However, in our earlier work we implemented the maximin algorithm, with dimension   d = 2, 3 or 4, for computational reasons. We now recognise that it is feasible to implement this preferred version of the algorithm. So, it is the exact linkage algorithm which is recommended.

    The maximin algorithm (with dimension  d = 2, 3 or 4) was implemented in the earlier companion program, called Analyse.exe, which PartitionView.exe replaces. "Analyse.exe" was an unfortunate choice of name (as it has already been used in population genetics, and no doubt elsewhere). It might be better to give it the posthumous name PartitionView0.exe.

 

Plot node heights

Example of display


This is a plot of the height p(t) of the node created at generation t , against the generation t  (where t = 1, 2, ..., n-1). The height p(t) of the node t  is the posterior co-assignment probability of the set of individuals defined by node t . This plot shows the progress of the clustering algorithm. (For details, see Dawson and Belkhir 2001).

Interpretation of output
    It is important to recognise that the node height p(t) has no frequentist interpretation. It is a posterior probability, which depends both on the observed data and the choice of prior. So it is a subjective probability. This raises an important question: How low should we allow p(t)t to fall, before we stop accepting further agglomeration? More experience with the method will be needed to resolve this. Provisionally, we recommend using the graph of p(t)  against  t  as a guide, paying particular attention to any values of t  where the probability level falls more sharply than usual.
 

Plot pairwise posterior co-assignment probabilities

    It is often instructive to choose an individual, say I, and generate a plot of the posterior probability that individual I belongs tot he same population as individual J (the co-assignment probability of I and J) for all individuals J (different from I ). (Recall that the pairwise prob outfile , contains all these pairwise co-assignment probabilities.)
    To use this option, change the number in the individual box. (This number corresponding to the position of that individual in the data file.) This will automatically up-date the plot. It would be very time consuming to do this for each individual. However, it should be enough to view this plot for a few different individuals. (Preferably for individuals that are far apart, - so that every pair among these selected individuals has a low co-assignment probability.) This can help us to visualise how clearly defined the different clusters are.


Compute posterior co-assignment probability of any set of individuals


    There are situations where we may want to compute the posterior co-assignmement probability of a set of individuals which does not correspond to any node of our tree. For example, we may want to evaluate the evidence that individuals taken from a particular sampling site all belong to the same population.)
    To answer such questions, first we use the arrow buttons to choose a set of individuals, by moving them form the box on the left to the box on the right (or back again, if necessary). (As usual, the numbers correspond to the position of each individual in the data file.) Then the  posterior co-assigmnent probability of the chosen set is updated. (The posterior probability that all the individuals in this set come from the same population.)

    A more detailed picture can be obtained by looking at the plot of the values of the conditional posterior co-assignement probabilities. (For example, conditional upon k lying in the interval from k = 2 to k = 4.) To specify such a condition, go back to condition on range of k.

Conditioning on k

             It is also possible to analyse a posterior distribution (of the sample partition, and k) which is conditioned upon values of k lying in a particular range. For example, from k = 2 to k = 4. This means that the co-assignment probabilities are calculated using only those observations of the Markov chain, where the number of source populations lies in this range.

(The default values for condition on range of k are:

from k = 1 to k = maximum number of possible source populations.

This introduces NO conditioning. We recommend that the default values are used, unless the user has a strong reason for conditioning.)

Output from PartitionView.exe


The program PartitionView.exe produces the following output files.

posterior k outfile (default file name "posterior_k.txt"). This file contains: the posterior distribution of the number of source populations k represented in the sample; and the Bayes factor  B1 for k = 1 against the alternative of k > 1.

tree outfile (default file name "tree.txt"). This file contains a representation of the tree in a standard format (the "Newick 8:45" format adopted by PHYLIP).

pairwise probs outfile (default file name "pairwise_probs.txt"). This file contains the pairwise co-assignment probabilities (one for each pair of individuals), arranged in a lower-triangular array, where the diagonal is filled in with 1s.

node heights outfile (default file name "node_heights.txt"). Each line of this file contains the generation t at which a node was created, and the height p(t) of this node, There is one line for each internal node of the tree.

node clusters outfile (default file name "node_clusters.txt"). This file is similar to node heights outfile, but contains additional information. Each line of this file contains the generation t at which a node was created, the height p(t) of this node, and the cluster of individuals defined by this node. There is one line for each internal node of the tree.

The button save trajectories creates a trajectories outfile (default file name "trajectories"). Each line of this file contains the number of source populations k and the log likelihood, for an observed state of the Markov chain. The ith line corresponds to the ith observation  from the Markov chain.

Related Publications

A Bayesian approach to the identification of panmictic populations and the assignment of individuals.

2001. Genetical Research 78, 59-77.


 
 
 
 
 
 

 

 

 

Warnings and bug alerts :

 

Warnings for version 1.2:

 

Some bugs remained in Analyse.exe in Partition 1.2. These bugs involved memory allocation.

As a consequence, problems were sometimes encountered when analysing large data sets.

These bugs have been eliminated from the latest release, - Partition 2.0.  

 

Analyse.exe still has a high memory requirement which grows with the number of individuals in the sample,

and with the range of the posterior distribution of the number of source populations.

 

We have tested Partition 2.0 on samples of 1000 individuals.

 

 

 

Warnings for version 1.0:


In the first version of Partition (1.0) there was a bug !
If the user choose Kmax > 10 Partition 1.0 failed to sample from the correct posterior distribution. !
So if you chose Kmax = 11 or 12 then you need to reanalyse your data with the new version (1.1) !

In this new version (1.1) the user can choose any value of kmax from 2 to 24.

In this version we have also eliminated  a bug  in Analyse.exe  which occured when  the data filename
contained any blank spaces.

 

N.B. Missing data are not allowed in this version, so they must be discarded from your data file !!!

 


Download :

 

Download the version 2.0  from this link or this one and unzip it to a new folder.
Partition is primarily designed for the Windows operating system (W9x , Win me and Windows 2000).
However, the
Wine package makes it possible to run native Windows applications alongside native
Linux applications on a Linux (or Solaris) system.