PartitionML:

a maximum likelihood estimation of the best partition of a sample into panmictic units.

Khalid Belkhir and FrançoisBonhomme

Laboratoire Génome, Populations, Interactions, CNRS UMR 5000, Université de Montpellier II, Place Eugène Bataillon, 34095 Montpellier Cedex 5, France.


Related publication :  
Castric V., Belkhir K., Bernatchez L., & Bonhomme F., 2002 - Heterozygote Deficiencies in Small Lacustrine Populations of Brook Charr Salvelinus Fontinalis Mitchill (Pisces, Salmonidae): A Test of Alternative Hypotheses. Heredity, 89, (1), 27-35.


Summary :

PartitionML is a program that searches for the best possible partition of a sample into independent panmictic clusters and simultaneously assign individuals to them using a maximum likelihood (ML) criterion.

 

Download : The program is available at this link.

 

Contact : Belkhir@univ-montp2.fr



Within-species comparative genomics is an important field of modern biology. Analysis of the molecular variation at the population level is essential in numerous applications including forensics, identity and paternity investigations, epidemiology, biodiversity and resource management. 
 

DNA profiling data (microsatellites, RAPDs, SNPs) at several loci of a number of individuals of a species allows one either to deduce some information about potential source populations or to assign individual of unknown origin to predefined populations.
 

Under panmixy and linkage equilibrium, genotype frequencies in the sample should follow Hardy-Weinberg equilibrium, where the frequency of a genotype is the product of the frequencies of the alleles it includes. The existence in a sample of individuals from subpopulations with distinct allele frequencies will result in a departure from this equilibrium.
 

PartitionML addresses the issues of detecting such hidden population structure and assigning individuals to their population of origin.
 

Paetkau et al. (1995) and Rannala and Mountain (1997) have developed assignment methods based on the probability of drawing a multilocus genotype from a range of candidate populations. Their approach is to compute, for each individual and each candidate population, an assignment index reflecting the probability that this particular individual belongs to a particular source population. These genotypic likelihoods calculated for each candidate population are then compared and the individual is assigned to the population for which the likelihood is the highest.
 

However this approach depends on the assumption that the samples in the data set are representative of the source populations. This assumption can be violated if the sampling of individuals is made independently of any a priori knowledge of the underlying number of independent groups where reproduction has taken place. Another limitation is the fact that individuals are assigned successively and independently from each other so the method do not updates our knowledge on the populations when the result of the assignment of an individual changes the populations composition. Ideally, we would like to identify populations, and assign individuals to populations simultaneously.
 

The present approach is aimed at circumventing both those difficulties.
 

The ML calculation we used is similar to the one presented in the genetic mixture analysis model used by Smouse et al. (1990) in conjunction with an EM algorithm to test the contribution of predefined stocks to a catch of Pacific salmon. Alternatively and without any a priori knowledge about the source populations, PartitionML implements a “Simulated Annealing” method (Kirkpatrick et al. 1983) that allows to test for a varying number of underlying source populations contributing to a sample and to assign individuals to them.
 

Consider the case where we are testing the hypothesis of a mixture of 2 gene pools in our sample. For each possible partition of the N sampled individuals into 2 clusters of size N1 and N2 (N1 + N2 = N), the individual i has a priori a probability N1/N of coming from cluster 1, and N2/N from cluster 2. Thus, the probability of drawing this particular multi-locus genotype from the mixture is Li = (N1/N * Li1) + (N2/N * Li2 ), where Li1 (respectively Li2) is the probability of drawing this multi-locus genotype from the source population corresponding to the first (respectively second) cluster. The likelihood of the partition is then given by the product of the N individual multilocus likelihoods L i (i = 1, 2, ..., N). 
 

The probability of a particular monolocus genotype being drawn from a population is calculated as fp = xlP² for homozygotes, and fp = 2 xlP xl’P for heterozygotes (where xlP is the frequency of the lth allele at this locus in the target population p, usually the unknown xl p’s being estimated by their observed value in the sample p). These values are calculated for each locus and are then multiplied to give the likelihood of the individual’s multilocus genotype (Lip ) for the population p.
 

To seek for the best possible partition of a sample, PartitionML implements a Metropolis-Hastings algorithm ( Gilks et al. 1996) to simulate a Boltzman probability distribution of a system whose possible configurations are the different partitions of the sample in k clusters, and whose objective function E is proportional to the likelihood which we are seeking to maximize. The transition between the configurations is obtained by a random transfer of individuals between the k clusters.
 

The number of independent panmictic source populations (or clusters) assumed for a particular run of the program is set a priori by the user. Depending on the nature of the data analyzed, there may be a unique “natural” value to use. Otherwise a number of different values of k will have to be considered. To compensate for the fact that likelihood maximization intrinsically favors partitions with more clusters one can use Likelihood Ratio Tests to compare runs with different number of source populations.
 

Thus under certain regularity conditions, minus twice the log of the likelihood ratio has approximately (in large samples) a Chi-square distribution with degrees of freedom equal to the difference in the number of parameters between two models (see  Crawley (1993)) . A partition P2 with more source populations and likelihood L2 , is only accepted in preference to a partition P1 with fewer parameters (and likelihood L1) if -2 * ln(L2/L1) is greater than the 5% threshold of a Chi-square with degrees of freedom df = df 2–df1 (where dfi = ki å l (Jl-1)with k i = number of source populations in the model i and Jl = number of alleles at the locus l).
 

F ollowing Smouse et al. (1990)  u nder the null hypothesis of k source populations the optimal number of source populations to be retained can be obtained by incrementing k until the null hypothesis is not rejected by the "k+1 vs. k" likelihood Ratio Test.
 

Other approaches of the problem implying more complex Bayesian statistics have been implemented by Pritchard et al. (2000) and Dawson and Belkhir (in press). The present ML algorithm is a fast and complementary alternative to those methods.
 

Data Input file format is the same as the population genetics program GENETIX (http://www.univ-montp2.fr/~genetix/genetix.htm ). Essentially , the genotype data is arranged as a matrix in which the data for each individual are stored in a single row, where each column is the genotype data for a single locus. The above-mentioned software can also be used to import files to and from several other common population genetics software packages.
 

The program produces a visual output for monitoring the progress of the algorithm toward the best partition, i.e. a graph where the Log Likelihood of the partition is plotted at each iteration. In the output file, the maximum likelihood partition is represented by listing the individuals belonging to each cluster. The program also produces a data file in input format, where individuals have been reassigned to populations corresponding to the maximum likelihood partition. 
 

The program was written in DELPHI 4.0 and runs on Windows platforms. It has been tested on various real and simulated data set, and runs reasonably fast on a modern PC. 
 



Acknowledgements

Critical discussions of K. Dawson, N. Galtier and N. Raufaste are gratefully acknowledged. 



References :

Dawson,K.J. and Belkhir,K. (2001). A Bayesian approach to the identification of panmictic populations and the assignment of individuals, Genetical Research (in press).

Gilks,W.R., Richardson,S. and Spiegelhalter,D. J. (Eds.) (1996). Markov Chain Monte Carlo in Practice. London, Chapman and Hall.

Kirkpatrick,S., Gelatt,C.D. and Vecchi,M.P. (1983), Optimization by simulated annealing. Science, 220, 671–680.

Paetkau,D., Calvert,W., Stirling,I. and Strobeck, C. (1995) Microsatellite analysis of population structure in Canadian polar bears. Mol. Ecol ., 4, 347-354. 

Pritchard,J.K., Stephens,M. and Donnelly,P. (2000). Inference of population structure using multilocus genotype data. Genetics 155, 945-959.

Crawley,M.J. (1993). GLIM for Ecologists. Blackwell Scientific Publications, Oxford, U.K.

Rannala,B. and Mountain,J.L. (1997) Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA, 94, 9197-9201.

Smouse,P.E., Waples,R.S. and Tworek,J.A. (1990). A genetic mixture analysis for use with incomplete source population data. Canadian Journal of Fisheries and Aquatic Science 47, 620-634.