PartitionML:
Khalid Belkhir and FrançoisBonhomme
Laboratoire Génome, Populations, Interactions, CNRS UMR 5000, Université de Montpellier II, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France.
PartitionML is a program that searches for the best possible partition of
a sample into independent panmictic clusters and simultaneously assign individuals
to them using a maximum likelihood (ML) criterion.
Download
: The program is available at this
link
. .
Contact
: Belkhir@univ-montp2.fr
DNA profiling data (microsatellites, RAPDs, SNPs) at several loci of a number
of individuals of a species allows one either to deduce some information
about potential source populations or to assign individual of unknown origin
to predefined populations.
Under panmixy and linkage equilibrium, genotype frequencies in the sample
should follow Hardy-Weinberg equilibrium, where the frequency of a genotype
is the product of the frequencies of the alleles it includes. The existence
in a sample of individuals from subpopulations with distinct allele frequencies
will result in a departure from this equilibrium.
PartitionML addresses the issues of detecting such hidden population structure
and assigning individuals to their population of origin.
Paetkau et al. (1995) and Rannala and Mountain (1997) have developed assignment
methods based on the probability of drawing a multilocus genotype from a
range of candidate populations. Their approach is to compute, for each individual
and each candidate population, an assignment index reflecting the probability
that this particular individual belongs to a particular source population.
These genotypic likelihoods calculated for each candidate population are then
compared and the individual is assigned to the population for which the likelihood
is the highest.
However this approach depends on the assumption that the samples in the
data set are representative of the source populations. This assumption can
be violated if the sampling of individuals is made independently of any
a priori knowledge of the underlying number of independent groups where
reproduction has taken place. Another limitation is the fact that individuals
are assigned successively and independently from each other so the method
do not updates our knowledge on the populations when the result of the assignment
of an individual changes the populations composition. Ideally, we would like
to identify populations, and assign individuals to populations simultaneously.
The present approach is aimed at circumventing both those difficulties.
The ML calculation we used is similar to the one presented in the genetic
mixture analysis model used by Smouse et al. (1990) in conjunction with an
EM algorithm to test the contribution of predefined stocks to a catch of Pacific
salmon. Alternatively and without any a priori knowledge about the
source populations, PartitionML implements a “Simulated Annealing” method
(Kirkpatrick et al. 1983) that allows to test for a varying number of underlying
source populations contributing to a sample and to assign individuals to them.
Consider the case where we are testing the hypothesis of a mixture of 2
gene pools in our sample. For each possible partition of the N sampled individuals
into 2 clusters of size N1 and N2 (N1 + N2 = N), the individual i has
a priori a probability N1/N of coming from cluster 1, and N2/N from
cluster 2. Thus, the probability of drawing this particular multi-locus
genotype from the mixture is Li = (N1/N * Li1) + (N2/N
* Li2 ), where Li1 (respectively Li2) is
the probability of drawing this multi-locus genotype from the source population
corresponding to the first (respectively second) cluster. The likelihood
of the partition is then given by the product of the N individual multilocus
likelihoods L i (i = 1, 2, ..., N).
The probability of a particular monolocus genotype being drawn from a population
is calculated as fp = xlP² for homozygotes,
and fp = 2 xlP xl’P for heterozygotes
(where xlP is the frequency of the lth allele at
this locus in the target population p, usually the unknown xl
p’s being estimated by their observed value in the sample p). These
values are calculated for each locus and are then multiplied to give the likelihood
of the individual’s multilocus genotype (Lip ) for the
population p.
To seek for the best possible partition of a sample, PartitionML implements
a Metropolis-Hastings algorithm (
Gilks et al. 1996)
to simulate a Boltzman probability distribution of a system whose possible
configurations are the different partitions of the sample in k clusters,
and whose objective function E is proportional to the likelihood which we
are seeking to maximize. The transition between the configurations is obtained
by a random transfer of individuals between the k clusters.
The number of independent panmictic source populations (or clusters) assumed
for a particular run of the program is set a priori by the user.
Depending on the nature of the data analyzed, there may be a unique “natural”
value to use. Otherwise a number of different values of k will have to be
considered.
To compensate for the fact that likelihood maximization intrinsically favors
partitions with more clusters one can use Likelihood Ratio Tests to compare
runs with different number of source populations.
Thus under certain regularity conditions, minus twice the log of the likelihood
ratio has approximately (in large samples) a Chi-square distribution with
degrees of freedom equal to the difference in the number of parameters between
two models (see
Crawley (1993))
. A partition P2 with more source populations and likelihood L2
, is only accepted in preference to a partition P1 with fewer parameters
(and likelihood L1) if -2 * ln(L2/L1) is
greater than the 5% threshold of a Chi-square with degrees of freedom df
= df 2–df1 (where dfi = ki *
å
l
(Jl-1)with k
i = number of source populations in the model i and Jl
= number of alleles at the locus l).
F
ollowing Smouse et al. (1990)
u
nder the null hypothesis of k source populations the optimal number of source
populations to be retained can be obtained by incrementing k until the null
hypothesis is not rejected by the "k+1 vs. k" likelihood Ratio Test.
Other approaches of the problem implying more complex Bayesian statistics
have been implemented by Pritchard et al. (2000) and Dawson and Belkhir (in
press). The present ML algorithm is a fast and complementary alternative
to those methods.
Data Input file format is the same as the population genetics program GENETIX
(http://www.univ-montp2.fr/~genetix/genetix.htm
). Essentially
,
the genotype data is arranged as a matrix in which the data for each individual
are stored in a single row, where each column is the genotype data for a
single locus. The above-mentioned software can also be used to import files
to and from several other common population genetics software packages.
The program produces a visual output for monitoring the progress of the
algorithm toward the best partition, i.e. a graph where the Log Likelihood
of the partition is plotted at each iteration. In the output file, the maximum
likelihood partition is represented by listing the individuals belonging
to each cluster. The program also produces a data file in input format, where
individuals have been reassigned to populations corresponding to the maximum
likelihood partition.
The program was written in DELPHI 4.0 and runs on Windows platforms. It
has been tested on various real and simulated data set, and runs reasonably
fast on a modern PC.
Critical discussions of K. Dawson, N. Galtier and N. Raufaste are gratefully
acknowledged.
Dawson,K.J. and Belkhir,K. (2001). A Bayesian approach to the identification
of panmictic populations and the assignment of individuals, Genetical Research
(in press).
Gilks,W.R., Richardson,S. and Spiegelhalter,D. J. (Eds.) (1996). Markov
Chain Monte Carlo in Practice. London, Chapman and Hall.
Kirkpatrick,S., Gelatt,C.D. and Vecchi,M.P. (1983),
Optimization by simulated annealing.
Science, 220, 671–680.
Paetkau,D., Calvert,W., Stirling,I. and Strobeck, C. (1995) Microsatellite
analysis of population structure in Canadian polar bears. Mol. Ecol
., 4, 347-354.
Pritchard,J.K., Stephens,M. and Donnelly,P. (2000). Inference of population
structure using multilocus genotype data. Genetics
155, 945-959.
Crawley,M.J. (1993). GLIM for Ecologists. Blackwell Scientific Publications,
Oxford, U.K.
Rannala,B. and Mountain,J.L. (1997) Detecting immigration by using multilocus
genotypes. Proc. Natl. Acad. Sci. USA, 94, 9197-9201.
Smouse,P.E., Waples,R.S. and Tworek,J.A. (1990). A genetic mixture analysis
for use with incomplete source population data. Canadian Journal of Fisheries
and Aquatic Science 47, 620-634.