Genetics, Vol. 164, 1189-1204, July 2003, Copyright © 2003

Maximum-Likelihood and Markov Chain Monte Carlo Approaches to Estimate Inbreeding and Effective Size From Allele Frequency Changes

Guillaume Lavala, Magali SanCristobalb, and Claude Chevaletb
a Computational and Molecular Population Genetics Laboratory, Zoology Institute, University of Bern, 3012 Bern, Switzerland
b Laboratoire de Génétique Cellulaire, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France

Corresponding author: Magali SanCristobal, INRA, Chemin de Borde-Rouge--Auzeville, BP 27, 31326 Castanet-Tolosan Cedex, France., msc{at}toulouse.inra.fr (E-mail)

Communicating editor: S. W. SCHAEFFER


*  ABSTRACT
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Maximum-likelihood and Bayesian (MCMC algorithm) estimates of the increase of the Wright-Malécot inbreeding coefficient, Ft, between two temporally spaced samples, were developed from the Dirichlet approximation of allelic frequency distribution (model MD) and from the admixture of the Dirichlet approximation and the probabilities of fixation and loss of alleles (model MDL). Their accuracy was tested using computer simulations in which Ft = 10% or less. The maximum-likelihood method based on the model MDL was found to be the best estimate of Ft provided that initial frequencies are known exactly. When founder frequencies are estimated from a limited set of founder animals, only the estimates based on the model MD can be used for the moment. In this case no method was found to be the best in all situations investigated. The likelihood and Bayesian approaches give better results than the classical F-statistics when markers exhibiting a low polymorphism (such as the SNP markers) are used. Concerning the estimations of the effective population size all the new estimates presented here were found to be better than the F-statistics classically used.


MONITORING genetic diversity in animal populations has become an important concern for institutions involved in the preservation of wild life and of endangered species, as well as in animal breeding. In the latter case, a few breeds are selected with high selection pressure, while other breeds are no longer extensively used and are faced with risks of extinction. In both cases, severe reductions in the genetic effective size are observed, leading to the loss of genetic diversity. Measuring this loss can be performed by means of historical surveys of allelic frequencies at a number of polymorphic loci. Such time-spaced sampling protocols have been applied in a wide range of populations, from natural and domestic taxa (WAPLES 1990 Down; KANTANEN et al. 1999 Down). Furthermore, the development of systematic molecular analysis, through hybridization onto DNA chips or high-throughput mass spectrometry analyzers of single-nucleotide polymorphism, with expected lower costs, will allow population geneticists to perform large-scale DNA analysis and make this type of study probably very common in the future.

Consider a neutral marker exhibiting k alleles and an isolated population of effective size N, which is described by its allele frequencies p0 = (p0,1, ... , p0,i, ... , p0,k) and pt = (pt,1, ... , pt,i, ... , pt,k) (for i = 1, ... , k), in a first generation, named the founder generation, and t generations later, respectively. In the tth generation the expectation and variance of the allelic frequencies distribution are given by

(1)


(2)

when genetic drift is assumed during t discrete and nonoverlapping generations. The quantity 1 - (1 - 1/2N)t, which can be viewed as the growth from the founder generation of the Wright-Malécot inbreeding coefficient (WRIGHT 1931 Down; MALECOT 1948 Down), will be denoted Ft and used as a measure of time.

A theoretical framework based on F-statistics proposed by KRIMBAS and TSAKAS 1971 Down was developed by NEI and TAJIMA 1981 Down and POLLAK 1983 Down to derive the variance effective size N of populations from the approximated relation = t/2t in which t, the number of generations between two samplings, is known. The estimations of Ft and of the variance effective size performed in a natural population give the increase in the inbreeding coefficient and the size of a Wright-Fisher population that would experience a comparable increase in variance of gene frequency over time.

In fact, estimates of allele frequencies are obtained by sampling a number of individuals in the population, which suggests using the theory of coalescence. Following it (Fig 1, right, broken arrows), the probability to get a sample mt = {mt,1, ... , mt,i, ... mt,J0,·} at time t, given the true initial frequencies p0 = {pt,1, ... , pt,i, ... pt,J0,·}, is obtained by conditioning on the true number J0,· = {sum}i J0,i of original genes, or lines of descent, from which the final sample originates. The probability distribution of J0,·, given the size mt = {sum}imt,i of the final sample, can be derived exactly (TAVARE 1984 Down, Sect. 7.3). Under that condition, the probability of the observed sample mt is obtained by summing over the possible drawings J0,i of J0,· genes from the original population with the combinatorial distribution between the partitions J0,i and mt that is independent of time (WILSON and BALDING 1998 Down; BEAUMONT 1999 Down).



View larger version (16K):
In this window
In a new window
Download PPT slide
 
Figure 1. Graph showing different ways to get the probability of a sample drawn from the tth generation. The coalescence approach is presented in the right-hand side with broken arrows, and the alternative approach, developed in the present article, is in the left-hand side with solid arrows.

In the following, we consider the alternative approach (Fig 1, left, solid arrows), introducing the direct transition between initial allele frequencies p0 and allele frequencies pt at time t, followed by the final (multinomial) sampling giving the observed sample mt. Asymptotic distribution of allele frequencies is known under special cases involving mutations between a finite number of alleles and leading to Dirichlet distributions (WRIGHT 1951 Down), or under the infinite-alleles model, leading to distributions such as Ewens' sampling formula (EWENS 1972 Down) and related results (TAVARE 1984 Down). In the present context we neglect effects of mutations since we refer to medium or small populations analyzed over a short interval of time during which mutations are not expected to play an important role in the change of allelic frequencies. Hence, we refer to the simple drift process as the only source of frequency fluctuations.

The exact transition due to drift has no simple analytical expression. We therefore investigated approximations of joint allele probability using the Dirichlet distribution (BALDING and NICHOLS 1995 Down; HOLSINGER 1999 Down). The parameters of the Dirichlet are adjusted to fit with the known moments of the drift process. As a by-product, we use the same model as described in KITADA et al. 2000 Down to directly estimate the increase of the inbreeding coefficient and derive from it the variance effective size. Since a model based on the Dirichlet approximation at time t implicitly set pt,i != 0 for every allele existing in the founder sample, it is clear that this first model cannot take into account the loss of alleles due to the drift process. We introduced a second model, which is an admixture of allele fixation and loss probabilities (CHEVALET 2000 Down) and the Dirichlet distribution, to consider that one or more founder alleles can disappear (pt,i = 0 for some i).

Distributions of the sample mt are obtained conditionally on initial exact frequencies p0. In this article we therefore consider the following two cases. First, we propose maximum-likelihood estimates based on two models, Dirichlet only and Dirichlet and allele loss probabilities admixture, when founder frequencies are known. This theoretical situation allows us to discuss the relevance of the Dirichlet approximation and the benefit given by the introduction of the allele loss probabilities.

Second, the founder frequencies are estimated from a small sample of founder animals. In such a case the second model taking allele loss into account is hardly tractable. Developing unbiased estimates requires further statistical developments that are not presented in this article. Here, we consider only the first model (Dirichlet only) in which the sampling process within the founder generation is dealt with either from a heuristic point of view, checking for possible corrections of maximum likelihood estimates, or using a Monte Carlo Markov chain (MCMC) algorithm in a Bayesian context.

Comparisons with two F-statistic methods, one introduced by NEI and TAJIMA 1981 Down and promoted by BARKER et al. 1998 Down and another that is derived from Reynolds' genetic distance (REYNOLDS et al. 1983 Down), were performed using computer simulations focused on short-term evolution. The largest simulated values of Ft were chosen to be slightly >0.1, i.e., 10 generations of drift for an effective population size of 50.

Since the number of generations between the founder and the tth generation is known, we test the accuracy of the estimations of N from the maximum-likelihood estimations of Ft using the classical formula = t/2t. We also introduce a corrected estimate of N.

Furthermore, to test these new methods (maximum-likelihood estimates and MCMC algorithm) to a real data set, a French snail population (a species of importance for French consumers) was analyzed.


*  STATISTICAL BACKGROUND
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Multinomial-Dirichlet model:
In the following sections, this model is called multinomial-Dirichlet (MD). We assumed that the allele-frequency distributions in the current generation can be approximated by a Dirichlet distribution (BALDING and NICHOLS 1995 Down; HOLSINGER 1999 Down; KITADA et al. 2000 Down), with parameters {alpha}t = ({alpha}t,1, ... , {alpha}t,i, ... , {alpha}t,k),

where {sum}i {alpha}t,i = At. Parameters {alpha}t can be related to Ft by adjusting the first two moments under the genetic drift model (Equation 1 andEquation 2) and under the statistical Dirichlet model. This leads to

(3)

It may be shown that the drift and the Dirichlet distributions are only approximately equal, since third moments are different in an amount proportional to F2t, indicating that the approximation is valid only for small Ft values.

Taking account of gene sampling in the tth generation is performed as follows: for one locus, the gene sampling gives partitions mt = (mt,1, ... , mt,i, ... , mt,k). Let us denote the total number of sampled alleles, i.e., twice the number of individuals, as {sum}i mt,i = mt. Assuming that the sample stage is described by a multinomial drawing, the whole model is a compound multinomial-Dirichlet model:

Accordingly, an approximate likelihood MD of mt given p0 and At (after integrating pt out) can be written as

(4)

Allele loss taken into account:
In the following sections, this model is called multinomial-Dirichlet and allele loss (MDL). The Dirichlet distribution is no longer valid when fixation or loss of alleles occurs and is expected to bias the likelihood since any null allele frequency is considered as the result of final sampling only. Taking into account a possible loss of alleles during the drift process implies that f(pt|p0, Ft) is written as a mixture of discrete and continuous terms, introducing the probabilities that some alleles can be lost. Let S be a state in which some of the alleles are lost; then

(5)

Pr(S|p0, Ft) is the probability of getting state S at reduced time Ft from the initial conditions p0 and is derived from the probabilities that alleles are fixed,

(6)

which are given by solutions of the classical partial differential equation of KIMURA 1955 Down. Then Tr(pt|S, p0, Ft) stands for the distribution of transient frequencies pt,i, excluding null frequencies, conditional on state S, on "time" Ft, and on initial conditions p0,i. Appendix A shows the method for the calculation of the probabilities of S states and to derive expectations qS,i of nonnull frequencies under the various S conditions. Following the same heuristics as before, we approximated the transient distributions by Dirichlet distributions, adjusting their parameters to the genetic drift model.

In practice, the following approximations were used. Fixation probabilities were approximated following CHEVALET 2000 Down, and the {alpha}S,i parameters of the Dirichlet characterizing a transient state S were set to

keeping a single unconditional At parameter, and taking

(7)

where p0,(S-lost) is the total of initial frequencies of lost alleles in state S, both approximations being justified by the short-term scope of the present scenario.

Writing f(pt|p0, Ft) as a mixture (Equation 5), the likelihood of a sample becomes a mixture involving various S terms:

Since the first term Pr(mt|pt, S) is zero for any S state in which one allele is observed in the sample while it is of null frequency in state S, there are only one or two terms in the previous likelihood. If all the alleles are observed in the sample (state S1), then the likelihood MDL is equivalent to MD inEquation 4, weighted by Pr(S1|p0, Ft). Otherwise, at least one allele is not observed in the sample. In that case, two S states must be considered, the one in which all frequencies pt,i are positive but sampling did not allow some alleles to be observed (state S1) and the state that indicates that unobserved alleles have been lost (state S2). For example, for three alleles (see Appendix A), if mt,1 > 0, mt,2 > 0, and mt,3 = 0, the whole likelihood MDL is given by

(8)

with

where 111 stands for state S1 in which the three alleles have been kept, and 110 for state S2 in which the third allele has been lost during the drift process.


*  ESTIMATION PROCEDURE
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

The likelihoods MD and MDL, inEquation 4 andEquation 8, depend on the true founder frequencies. Thus we consider the following two cases:

Known founder frequencies:
This situation allows us to check by simulation the validity of the Dirichlet approximation made in this study. However, this situation can be found in some selected breeds in which all founder animals are known and can be genotyped.

Assuming statistical independence between loci ({ell}), the maximization of the multilocus log-likelihood

in which {ell} = MD in model MD (Equation 4) and {ell} = MDL in model MDL (Equation 8), was performed using a quasi-Newton algorithm. In model MD and in model MDL, the maximum-likelihood estimators are named MD(ML) and MDL(ML), respectively.

Estimated founder frequencies:
For one locus, gene sampling within the founder generation gives partitions m0 = (m0,1, ... , m0,i, ... , m0,k). Let the total number of alleles sampled be {sum}i m0,i = m0,·. The estimations of Ft can be based on likelihoods (Equation 4 andEquation 8), where p0 are replaced by consistent estimators x0 (x0,i = m0,i/m0,·). Since maximum-likelihood estimations do not explicitly account for the sampling process in the founder generation, this was performed as follows.

Model MD (Dirichlet): MD(ML) shows a positive bias of 1/m0,· (data not shown), corresponding to REYNOLDS et al. 1983 Down distance between the sample and the founder generation in a Multinomial sampling scheme. Hence, as in KRIMBAS and TSAKAS 1971 Down, a correction MD(corrML) = MD(ML) - 1/m0,· was proposed.

Alternatively, a full Bayesian approach can be implemented, taking advantage of prior knowledge on parameters At and p0 through Gamma and Dirichlet distributions, respectively, as

(9)


(10)

with {alpha}0 = ({alpha}0,1, ... , {alpha}0,k). Computer simulations showed that maximum a posteriori estimates of Ft were biased, with biases depending on sample size (data not shown), and could not be simply corrected as was the case for MD(ML). Therefore, Bayes estimates, noted MD(MC) in the following, were derived from the mean of the marginal a posteriori distribution of parameters of interest, f(Ft|m0, mt), obtained by means of a MCMC approach using a Metropolis-Hasting algorithm (METROPOLIS et al. 1953 Down; HASTING 1970 Down; and Appendix B).

Model MDL (Dirichlet and allele loss): Combining the model accounting for gene loss and the sampling process in the founder generation (p0 are replaced by x0) did not give satisfactory results. No simple rule was found to obtain unbiased estimates through maximization of likelihood or of a posteriori probability. Extending the MCMC algorithm was postponed for future studies.

Bias correction for F-statistics:
As in WILLIAMSON and SLATKIN 1999 Down, we used the F-statistic NT (the index t is omitted) of NEI and TAJIMA 1981 Down corrected for reduced sample size:

(11)

We also used Reynolds' F-statistic R (REYNOLDS et al. 1983 Down), given by

(12)

in which, for generations 0 and t, {sum}i x2i is replaced by ({sum}i x2i - (1/m·))/(1 - (1/m·)) (NEI 1978 Down).

Multilocus estimation and variance prediction for F-statistics:
Denoting {ell} a single-locus estimate (Equation 11 orEquation 12), multilocus estimates were written as

(13)

where n0,{ell} are the observed numbers of founder alleles at locus {ell} (POLLAK 1983 Down). Weighting by n0,{ell} - 1 allows heterogeneity of information between markers to be taken into account in minimizing the variance of the multilocus estimation assuming statistical independence between loci.

The predictions of the standard error (SE) of NEI and TAJIMA's (1981) distance given in POLLAK 1983 Down, BARKER et al. 1998 Down, and FOULLEY and HILL 1999 Down and of the standard error of REYNOLDS et al. 1983 Down distance given in LAVAL et al. 2002 Down lead to an approximated multilocus standard error of (13) equal to

(14)


*  SIMULATION PROCEDURE
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Twenty genetically independent loci were always considered in the simulations. A simulated population of size N = 500, with allele frequencies p00,i (for i = 1, ... , k), was initially considered, and a pure genetic drift process was simulated 1000 times through five generations. This process generates 1000 quasi-independent populations used as starting points for simulation runs.

To provide inbreeding coefficient values in the interval [0.002, 0.008], each one of these 1000 populations, described by its founder frequencies p0, was submitted to a further pure genetic drift process, with constant diploid effective sizes set to 500 individuals during 8 nonoverlapping generations. Samples of mt = 50 alleles were drawn (sampling of 25 diploids) in a multinomial way every two generations. To provide inbreeding coefficient values in the interval [0.01, 0.1], the same process was applied to a population of 100 individuals evolving during 22 generations.

Two kinds of sampling at the stage of the founder population were considered: (i) exhaustive sampling, where the founder sample size is made up of the complete founder population, so that the true founder allele frequencies p0,i are known, and (ii) finite sampling, where the founder sample size m0,· was set to 50 alleles (sampling of 25 diploids).

In Uniform founder frequencies, three sets of 1000 simulations, in which allele frequencies of the initial population were set to p00,i = 1/k, were performed with k = 2, k = 4, and k = 8 alleles, respectively.

In Biochemical and microsatellite founder frequencies two sets of 1000 simulations were used, in which allele frequencies p00,i in the initial populations were set to biochemical and microsatellite marker frequencies published in KANTANEN et al. 1999 Down and in LAVAL et al. 2000 Down, respectively.


*  RESULTS
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Performances of the estimates were compared using the bias, Bt (or the relative bias, Bt/Ft with Ft the true value of inbreeding coefficient) and the standard error, SEt, computed over S simulations (unless it was explicitly indicated, S was always set to 1000). The global accuracy of estimations was established on the basis of the square root of MSE a criterion combining bias and variance, which is equal to the standard error when the estimation is unbiased.

Approximate confidence intervals of relative biases and standard errors were computed using formulas that are valid for normal distributions of estimates. They are indicated in the figures when relative biases are significantly different from zero (zero outside the 95% confidence interval, P < 0.05) and when differences between standard errors were significant (no overlapping of the 95% confidence intervals, P < 0.05).

Uniform founder frequencies
Known founder frequencies ( Fig 2A, Fig 3A, Fig 4A, and Fig 5A): Square root of the mean squared error— Fig 2A: With two founder alleles per locus (top curves), NT gives the least accurate estimations ( are the highest) whatever the level of drift. With four founder alleles (middle curves) and eight founder alleles (bottom curves) per locus, the two likelihood methods MD(ML) (Dirichlet without gene loss) and MDL(ML) (Dirichlet with gene loss) diverge from each other for ~F > 0.07 in the first case and F > 0.04 in the second case. MD(ML) shows a large loss of accuracy with eight alleles whereas MDL(ML) always gives the best estimations for any Ft value (with results significantly better for F > 0.08) and a small to medium number of alleles.



View larger version (32K):
In this window
In a new window
Download PPT slide
 
Figure 2. Square root of mean square errors of inbreeding estimations: two, four, and eight founder alleles per locus are shown. The estimations of Ft <= 0.01 (respectively Ft >= 0.01) were computed over 20 loci and 1000 replications performed with populations of N = 500, t varying from 2 to 8 (respectively N = 100, t varying from 2 to 22). Every sample size was equal to 25 individuals. In A, the true founder frequencies are known. In B, the founder sample size was set to 25 individuals. In both A and B, dotted lines show the results obtained with uniform distributions involving four founder alleles. Solid lines show the results obtained with uniform distributions involving two (top curves) and eight (bottom curves) founder alleles.



View larger version (21K):
In this window
In a new window
Download PPT slide
 
Figure 3. Relative biases of inbreeding estimations: eight founder alleles per locus. For the replication conditions refer to the Fig 2 legend. Confidence intervals are shown when the bias is significantly different from 0. In this case all methods except R show biases significantly different from 0 for almost every value of Ft.



View larger version (22K):
In this window
In a new window
Download PPT slide
 
Figure 4. Standard errors of inbreeding estimations: two founder alleles per locus. For the replication conditions refer to the Fig 2 legend. The straight line is the expected value of standard deviation fromEquation 14. Confidence intervals of SE are shown when differences between methods are significant. In A the differences between methods are significant for Ft values >0.08 [except between R and MD(ML)]. In B MD(corrML) exhibits standard errors significantly lower than the standard errors of the other methods for every value of Ft >= 0.01.



View larger version (24K):
In this window
In a new window
Download PPT slide
 
Figure 5. Standard errors of inbreeding estimations: eight founder alleles per locus. For the replication conditions refer to the Fig 2 legend. The straight line is the expected value of standard deviation fromEquation 14. Confidence intervals of SE are shown when differences between methods are significant. In A R and MD(ML) exhibit the same results. The differences between R and MDL(ML) are significant for Ft values >0.03. The differences between NT and MDL(ML) are significant for Ft > 0.07. In B NT exhibits standard errors significantly lower than the standard errors of the other methods for Ft > 0.05.

Relative bias— Fig 3A: As expected from its definition, R shows no significant bias as is illustrated for eight alleles (Fig 3A) and for fewer alleles (data not shown). In contrast to this result, other measures show biases that are dependent on polymorphism and on the amount of drift. NT shows significant positive bias with two alleles (data not shown); with eight alleles NT, and also MDL(ML), exhibit small but significant negative biases for F > 0.07 (confidence intervals in Fig 3A). However, biases remain small. Therefore, except for MD(ML) computed with eight alleles per locus, the global accuracy of every method is determined mainly by the standard errors of the estimations ( {cong} SE) as long as the number of loci is not too large (<50).

Standard error— Fig 4A and Fig 5A: The prediction given byEquation 14 provides a conservative basis for choosing the number and the polymorphism of markers needed to achieve a given level of precision as defined by the variance. Indeed it is apparent that the different methods considered here are characterized by a variance equal to or less than that predicted by the formula. In every situation encountered (four alleles, data not shown, and eight alleles, Fig 5A) MDL(ML) proves to be the best with respect to the criterion of minimal variance, with the difference with the theoretical reference (Equation 14) always significant for F > 0.07. For eight alleles (Fig 5A) NT performs nearly as well but is slightly less accurate [differences between NT and MDL(ML) are significant].

Estimated founder frequencies ( Fig 2B, Fig 3B, Fig 4B, and Fig 5B): As mentioned above, taking account of the sampling process within the founder generation needs additional statistical treatments when the Dirichlet approximation and loss of gene probabilities are combined. Replacing p0,i by x0,i provides biased estimates, with biases inversely proportional to the founder sample size. The results given by MDL(ML) are not shown.

Square root of the mean square error— Fig 2B: For two alleles (top curves) a gain in accuracy was found with the corrected likelihood estimate MD(corrML). Beyond 7% drift, R is slightly better than NT. For four alleles, results are nearly identical for all the compared methods. With eight alleles NT turns out to be better than the others.

Relative bias— Fig 3B: R is always unbiased (biases were never statistically different from zero under the simulation conditions used in this article). Both NT and MD(corrML) measures show a bias that depends on the allelic distribution, changing in sign with the number of alleles and the true value of Ft. In the range of Ft values considered, and discarding very low values, the relative bias remains smaller than ~10%. However, the global accuracy of methods is still due mainly to the standard error of estimation [ {cong} SE, except for MD(corrML) computed with eight alleles per locus and Ft > 0.07].

Standard error— Fig 4B and Fig 5B: The best method is MD(corrML) when the number of alleles is reduced (two alleles, Fig 4B). With eight alleles per locus (Fig 5B) NT is the best method with a reduction in standard error of up to 25% for an Ft value ~0.10.

The results presented in this section highlight the main drawback of the Dirichlet approximation. Biases depending on the founder polymorphism of markers and on the Ft values appear even for a small amount of drift (Ft <= 0.1). This problem can be avoided when allele loss probabilities are taken into account. These results suggest that this new model should be used to develop further estimates, which will be more accurate when highly polymorphic markers are used. However MD(corrML) performs well with markers exhibiting two alleles. This method should, then, be recommended when markers such as single-nucleotide polymorphisms (SNPs) are used.

This method, as well as the MCMC algorithm, was tested with two data sets generated with allele frequencies observed in pig and cattle breeds. Furthermore, the accuracy of the estimations of N obtained with the corrected maximum-likelihood method (estimations of N were derived from the estimations of Ft) was determined with these data sets.

Biochemical and microsatellite founder frequencies
In this section the allele frequency distributions include rare founder alleles and different numbers of alleles between markers. The in Fig 6 and Fig 7 were computed with 20 biochemical (KANTANEN et al. 1999 Down) and 20 microsatellite (LAVAL et al. 2000 Down) markers exhibiting mean numbers of founder alleles equal to 2.5 (markers with two and three alleles) and 4 (allele numbers ranging from 2 to 6), respectively.



View larger version (23K):
In this window
In a new window
Download PPT slide
 
Figure 6. Square root of mean square errors: biochemical markers. The results were performed (1000 replications, N = 500 and N = 100) with distributions of founder frequencies belonging to 20 biochemical markers, exhibiting an average number of founder alleles of 2.5, observed within cattle breeds from KANTANEN et al. 1999 Down.



View larger version (23K):
In this window
In a new window
Download PPT slide
 
Figure 7. Square root of mean square errors: microsatellite markers. The results were performed (1000 replications, N = 500 and N = 100) with distributions of founder frequencies belonging to 20 microsatellite markers, exhibiting an average number of founder alleles of four, observed within pig breeds from LAVAL et al. 2000 Down.

Estimation of Ft: To show the relevance of the model combining Dirichlet and allele loss probabilities, we give the results obtained when the founder frequencies are known (Fig 6A and Fig 7A). MDL(ML) is still the best inbreeding estimator in every case (being unbiased and with the smallest standard error).

When the founder frequencies of biochemical markers are estimated (Fig 6B), the likelihood approach MD(corrML) and NEI and TAJIMA's (1981) statistics provide the best inbreeding estimations: MD(corrML) is nearly the best for biochemical markers. As for the replications performed in the previous section, the effect of bias on the global accuracy ( {cong} SE) is negligible. For microsatellite markers (Fig 7B), the method giving the smallest standard error still provides the most accurate F estimations, in this case NT.

The MCMC algorithm requires larger computation times than the likelihood methods. The results are shown for two different values of Ft, Ft = 0.01 and Ft = 0.1 (Table 1), and the number of simulations was limited to S = 100. In practice with a real data set the parameters of candidate generating densities were empirically chosen so that the Metropolis-Hasting algorithm accepts 25% of drawn values (ROBERT 1996 Down). Since this cannot be performed for all the 100 simulations, we determined optimal parameters with a simulated data set randomly chosen, and we kept them for all the 100 simulations. We discarded simulations in which convergence is of doubtful validity (the percentage of accepted values <5). The numbers of simulations used to compute accuracy criteria are equal to 85 (biochemical markers) and 91 (microsatellites) for Ft = 0.01 and equal to 100 (for both biochemical and microsatellite markers) for Ft = 0.1, respectively.


 
View this table:
In this window
In a new window

 
Table 1. Estimation of Ft = 0.01 and Ft = 0.1 computed with the MCMC algorithm from biochemical and microsatellite markers

With biochemical and microsatellite markers, MD(MC) was found to be significantly more accurate than NT and MD(corrML) for Ft values of 0.01 (the first two rows in Table 1). The is almost halved in every case. For Ft = 0.1 the MCMC algorithm does not give the same large gain in accuracy. There are no significant differences among the three methods. The MCMC algorithm allows us to compute the posterior marginal distribution of the parameter of interest, f(Ft|m0, mt) (Fig 8 and Fig 9; Ft = 0.01 and Ft = 0.1, respectively).



View larger version (69K):
In this window
In a new window
Download PPT slide
 
Figure 8. Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for Ft = 0.01. We kept the same parameters as in Table 1. A and B were computed with one simulation involving 20 biochemical markers and one simulation involving 20 microsatellite markers, respectively. For biochemical markers the mean and the standard error are equal to 0.0062 and 0.0046, respectively, and the percentage of accepted values of the At parameter by the Metropolis-Hasting algorithm is equal to 19%. For microsatellite markers the mean and the standard error are equal to 0.0065 and 0.0046, respectively, and the percentage of accepted values of the At parameter is equal to 17%.



View larger version (62K):
In this window
In a new window
Download PPT slide
 
Figure 9. Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for Ft = 0.1. We kept the same parameters as in Table 1. A and B were computed with one simulation involving 20 biochemical markers and one simulation involving 20 microsatellite markers, respectively. For biochemical markers the mean and the standard error are equal to 0.093 and 0.03, respectively, and the percentage of accepted values of the At parameter by the Metropolis-Hasting algorithm is equal to 23%. For microsatellite markers the mean and the standard error are equal to 0.093 and 0.023, respectively, and the percentage of accepted values of the At parameter is equal to 27%.

Estimation of N: These data sets were also used to obtain estimations of the effective population size since t is known (and small) and biases of F estimations are small. For Ft between 0.01 and 0.1 (a simulated population of N = 100) estimations of N can be simply obtained from

(15)

However, even if t is unbiased, is biased as

(16)

Equation 14 suggests the following estimate ' of N:

(17)

The performances of these estimates are given in Table 2 and Table 3. Their names are given with the same subscript as used for the notation.


 
View this table:
In this window
In a new window

 
Table 2. Estimation of N = 100 from biochemical markers


 
View this table:
In this window
In a new window

 
Table 3. Estimation of N = 100 from microsatellite markers

The results obtained with R and MD(MC) are not shown since they are less accurate than those obtained with NT and MD(corrML), respectively.

In every case, the ' estimations are less biased with a smaller standard error than that of the noncorrected estimators . The corrected-likelihood method is more accurate than the F-statistics, for every t and for both biases and standard errors. It is more suitable to combine ' with the corrected maximum-likelihood approach: in the last row of Table 3 (20 microsatellites), 'MD(corrML) gives an unbiased estimation and leads to a decrease of 30% of the standard error of NT.

It should be mentioned here that this corrected estimate must be used when the experimental conditions lead to a small coefficient of variation of the estimation of Ft, sinceEquation 16 is accurate only when Var(t)/E(t)2 is negligible. The highest relative standard error of Ft in Table 2 and Table 3 is <0.5. In practiceEquation 14 can be used to estimate the relative standard error to decide whether ' is relevant.


*  A REAL DATA SET
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

We used a data set provided by J. F. ARNAULD (unpublished results) to illustrate the behavior of the corrected maximum-likelihood method and the MCMC algorithm with a real data set. The population of Helix aspersa (Gastropoda: Helicidae) belongs to an intensive agricultural zone located in Brittany (northwestern France), in the polders of the Bay of Mont-Saint-Michel. The snail population was sampled in 1998 and 2000 with 15 and 30 individuals, respectively. The estimations of Ft and N were computed with four microsatellite markers, exhibiting 5, 6, 8, and 12 alleles in the founder generation, respectively.

The estimations of Ft obtained with NT, R, and MD(corrML) are equal to 0.011, 0.016, and 0.032, respectively. Their coefficients of variation, computed fromEquation 14, are equal to 1.715, 1.264, and 1 respectively. With such coefficients of variation ' cannot be used. We estimated the effective size fromEquation 15 assuming one generation per year (MADEC et al. 2000 Down). The estimations of N obtained with NEI and TAJIMA's (1981), REYNOLDS et al. 1983 Down, and the corrected maximum-likelihood methods are equal to 88, 62, and 31 individuals, respectively.

The MCMC estimations of Ft and of the coefficient of variation (computed with the standard error obtained from the marginal posterior distribution of Ft, Fig 10) are equal to 0.007 and 0.857, respectively.



View larger version (80K):
In this window
In a new window
Download PPT slide
 
Figure 10. Histogram and kernel density estimate of MCMC drawings in the posterior distribution of the inbreeding coefficient, for a French snail population. The parameters of the prior distribution are equal to those presented in Table 1. The parameters of the candidate generating densities (bu = 0.4, ag = 8, and bg = 20) and the number of replicates (500,000 with a burn-in period of 150,000 replicates) were empirically chosen to give an optimal convergence of the chains of the At and p0,i; 25% (respectively 10%) of the sampled values of the At parameter (respectively the founder frequencies p0,i) are accepted by the Metropolis algorithm (ROBERT 1996 Down). Because of the small number of markers (four markers) the convergence of chains needs large numbers of replicates to be completed (the number of replicates used here is significantly higher than those used with the simulated data sets; Table 1). The mean and the standard error are equal to 0.007 and 0.006, respectively.


*  DISCUSSION
*TOP
*ABSTRACT
*STATISTICAL BACKGROUND
*ESTIMATION PROCEDURE
*SIMULATION PROCEDURE
*RESULTS
*A REAL DATA SET
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Since many different methods and situations were investigated in this study, a summary table (Table 4) is given to highlight the main results obtained when the founder frequencies are estimated with a limited founder sample size (here 25 individuals), the situation most widely found in an experimental scheme.


 
View this table:
In this window
In a new window

 
Table 4. Advantage of each method as a function of the polymorphism of markers used and with moderate founder sample sizes

Scope of the present work:
The scope of this article was limited to a short-term investigation. The comparisons of methods are based on an expected increase in inbreeding of 10% or less, which corresponds to 10 generations for a population of 50 individuals. This seemed a realistic timescale since the validity of the assumption of no mutation may be questionable for longer time intervals. The number of markers used was 20, which is a common value found in practical (MACHUGH et al. 1997 Down; MOAZAMI-GOUDARZI et al. 1997 Down; LAVAL et al. 2000 Down) and theoretical studies (BERTHIER et al. 2002 Down). This allows the comparisons of methods made in the present study to be useful for experiments currently performed.

Increasing the number of markers to get a better estimate of drift, which reduces the standard error around the expectation but not around the true value, requires the use of unbiased measures. Bias becomes a significant consideration when the number of alleles is large, and analytical corrections for NT could be derived to avoid this bias. In practice, however, this may be not crucial since with the number of markers commonly used and the small values of Ft observed biases remain small in comparison with the standard errors. To illustrate this point when Ft = 0.08, markers exhibit eight alleles, and founder allele frequencies are exactly known, >200 markers are needed for the unbiased R statistic to outperform NEI and TAJIMA's (1981) method. Hence working with measures with a low bias may be an advantage when the level of inbreeding is low.

It must be stressed that this conclusion is quite different when we consider the distance between two different breeds. Indeed, the values of distances are often >0.1. Methods such as NT and similar statistics as well as likelihood estimates show nonnegligible biases when allele numbers and Ft values are large and therefore cannot be recommended on the basis of this work. Unbiased methods such as REYNOLDS et al. 1983 Down distance would be preferred in such cases (LAVAL et al. 2002 Down).

For dominant markers such as randomly amplified polymorphic DNA or amplified fragment length polymorphism, allelic frequencies can be estimated with the square root rule from the frequencies of absence of bands and can be used to estimate Ft, keeping in mind that a deviation from the Hardy-Weinberg proportions leads to biased maximum-likelihood estimations of allelic frequencies. Some software (Arlequin, SCHNEIDER et al. 2000 Down) provide better estimations of allelic frequencies for dominant markers by using an expectation-maximization algorithm (EXCOFFIER and SLATKIN 1995 Down).

Performances of various estimates of Ft were sensitive to the total number of alleles over loci (Equation 14). Considering biallelic markers, SNPs seem to be full of potential (VIGNAL et al. 2002 Down) since their low numbers of alleles can be counterbalanced by the high number of SNPs found in the whole genome. As a consequence the assumption of independence between loci does not hold with a data set involving a high number of loci.

The theoretical prediction (Equation 14), although computed under the assumption of the statistical independence of loci (linkage disequilibrium remains null), seems to be a conservative estimate of the variance of the optimized F-statistics (Equation 13) and of the other measures considered. With a real data set in which some markers are in linkage disequilibrium, the expected standard error cannot be computed easily. Some preliminary simulations have been undertaken with 100 individuals, 22 generations, and 20 highly polymorphic loci (20 founder alleles per locus), with a recombination rate r between them varying from 0.5 to 0.001. Results show that the variance of the FR estimate is not affected by linkage between loci as long as r > 0.1. The standard deviation is increased by 17% if r = 0.05 and by 155% if r = 0.001 (9 and 104%, respectively, with 4 founder alleles per locus). More work is needed to quantify the influence of nonindependence of loci on the variances of the estimates.

Analytical approximation:
We derived an approximate likelihood by using the Dirichlet approximation of conditional allele frequency distribution. Comparisons between likelihood methods computed when true founder frequencies are known and the F-statistics (unbiased like R and with a small variance like NT) indicate that the MD approximation model is relevant when the polymorphism of markers is low.

When factors enhance the probability that alleles may be lost (high polymorphism leading to the occurrence of rare founder alleles and intermediate and high levels of drift), this simple approximation is no longer relevant. Several levels of approximations were used, to circumvent the combinatorial problems raised by the exact coalescent approach when the number of alleles increases. Likelihoods and posterior probabilities were derived by means of a Dirichlet model and probabilities of fixation and loss of alleles made use of a simple approximation. The latter was previously checked (CHEVALET 2000 Down) and should not induce much error in the short-term evolution considered here. The Dirichlet approximation may be more sensitive. The simpler version (model MD) makes use of a single distribution that does not allow for null frequencies of alleles. Indeed, the behavior of the corresponding estimate (MD(ML)) observed in Fig 2A shows the deleterious effects of increasing drift or number of alleles, with both effects increasing the probabilities that some alleles may be lost by drift. Combining probabilities of losses and a mixture of distributions in model MDL allowed the best estimates to be obtained when using initial known frequencies (Fig 2A). The advantage over other methods seems to be uniform over all considered cases, and the most pronounced improvement concerns the variance of F estimates, although the gain seems to be less for a large number of alleles.

The latter observation suggests that the combined approximation becomes less accurate for more than eight alleles per loci. It may be suggested that the choice of the Dirichlet distributions used for transient allele frequencies (Appendix A) is the most sensitive step since—considering the first two moments of distributions—it can be proved that it is not possible to equate the true mixture (Equation 5) to the mixture of Dirichlet distributions that yieldsEquation 8. Other parameter adjustments, such as using exact conditional expectations (Appendix A) rather than the simplified expression ofEquation 7, and adjusting the dispersion parameters (A) for each transient state S, might give a better account of drift for mid- or long-term processes. At the same time this should avoid the combinatorial problem driven by a large number of alleles in the exact coalescence.

Initial sampling:
Precision in the estimation of founder allele frequencies is a key to accurately estimate the amount of drift. For example, the best estimate obtained with exact biochemical founder frequencies [ Fig 6A, MDL(ML) estimate] shows the same mean square error as the best estimate possible with microsatellite markers sampled in the founder generation (Fig 7B, NT estimate).

As mentioned above in experimental schemes applied to domestic breeds all founder animals are known and can be sampled. In this special case, we have shown that it is possible to derive a valuable approximation of the drift process, which allows improvements of Ft estimation to be obtained with both biochemical and microsatellites markers. The gain is significant for intermediate values of Ft (in the range of 5–10%) and if the mean number of alleles is not too large. From a practical point of view concerning natural populations the methods based on this MDL model might be used when the founder sample is large (up to 100 individuals). The bias of the maximum-likelihood estimates, which is inversely proportional to the founder sample size (data not shown), tends to be small in front of the standard error.

For small founder sample sizes no method was found to be consistently better than the others over all the situations tested. Introducing the allele loss probabilities in the Dirichlet model is hardly tractable in this case and we kept this for future work. However, the methods based on the Dirichlet model without allele loss greatly improve the estimation of the amount of drift in several interesting situations.

The corrected maximum-likelihood methods (MD(corrML)) and the MCMC algorithm should be preferred when markers of low polymorphism are used, a situation that may be of importance with the advent of the SNP markers. Using estimations of founder frequencies can lead to biased Ft estimations with the maximum-likelihood method based on the MD model but we have shown that this problem can be solved simply with a heuristic correction. This bias correction gives more accurate estimations than the F-statistics give without changing the standard error of the maximum-likelihood estimate. In contrast, the MCMC algorithm greatly reduces the standard error of the maximum-likelihood estimate. Using this algorithm, which performs a numerical integration of the nuisance parameters (here p0), is the most relevant when a large part of this standard error is due to the sampling in the founder generation, as was the case for small Ft. We can deduce fromEquation 14 that the sampling process is largely responsible for the decrease of the accuracy of estimations when Ft is small: the relative standard error SEt/Ft becomes large when Ft tends to 0 [the part depending on 1/(m0,·) is inversely proportional to Ft].

The analysis of the French snail population shows that the MCMC algorithm can be applied to a real data set. The Markov chain well converges and the estimation of Ft is of the order of magnitude of the estimations given by the other approaches, suggesting that the results given by the MCMC algorithm are consistent. The MCMC algorithm presented here is based