- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Berthier, P.
- Articles by Luikart, G.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Berthier, P.
- Articles by Luikart, G.
Likelihood-Based Estimation of the Effective Population Size Using Temporal Changes in Allele Frequencies: A Genealogical Approach
Pierre Berthiera,b, Mark A. Beaumontc, Jean-Marie Cornuetd, and Gordon Luikartaa Laboratoire de Biologie des Populations d'Altitude, UMR CNRS 5553, Université Joseph Fourier, F38041 BP53 Cedex 9 Grenoble, France,
b Computational and Molecular Population Genetics Laboratory, Zoologisches Institut, Universität Bern, 3012 Bern, Switzerland,
c School of Animal and Microbial Sciences, University of Reading, Reading RG6 6AJ, United Kingdom
d Laboratoire de Modélisation et de Biologie Évolutive, INRA-URLB, 34090 Montpellier, France
Corresponding author: Pierre Berthier, Zoologisches Institut, Universität Bern, Baltzerstrasse 6, 3012 Bern, Switzerland., pierre.berthier{at}zoo.unibe.ch (E-mail)
Communicating editor: G. A. CHURCHILL
| ABSTRACT |
|---|
A new genetic estimator of the effective population size (Ne) is introduced. This likelihood-based (LB) estimator uses two temporally spaced genetic samples of individuals from a population. We compared its performance to that of the classical F-statistic-based Ne estimator (
) by using data from simulated populations with known Ne and real populations. The new likelihood-based estimator (
) showed narrower credible intervals and greater accuracy than (
) when genetic drift was strong, but performed only slightly better when genetic drift was relatively weak. When drift was strong (e.g., Ne = 20 for five generations), as few as
10 loci (heterozygosity of 0.6; samples of 30 individuals) are sufficient to consistently achieve credible intervals with an upper limit <50 using the LB method. In contrast,
20 loci are required for the same precision when using the classical F-statistic approach. The
estimator is much improved over the classical method when there are many rare alleles. It will be especially useful in conservation biology because it less often overestimates Ne than does
and thus is less likely to erroneously suggest that a population is large and has a low extinction risk.
THE effective size of a population can be defined as the size of an ideal population (Wright-Fisher model) in which genetic drift occurs at the same rate as in the studied population (![]()
Unfortunately, Ne is notoriously difficult to estimate by demographic methods in the field (![]()
![]()
![]()
![]()
![]()
![]()
![]()
The most widely used genetic method consists of measuring the variance of allele frequencies between generations to estimate the "variance effective size," NeV (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
LB estimators should provide better precision compared to moment-based estimators, because they use more of the information from the data (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The objectives of this article are twofold: (i) to present a new likelihood-based estimator of Ne, on the basis of coalescent theory, which allows the use of multiallelic markers and can be extended to incorporate Bayesian prior information about the Ne value (e.g., knowledge that Ne < 500) and (ii) to evaluate the accuracy and precision of this method in comparison to the classical estimator based on F-statistics (![]()
![]()
![]()
![]()
The biggest problem with existing methods is their large confidence intervals (![]()
![]()
2 approximation method (used for
) is known to be slightly too conservative (i.e., confidence intervals are too wide; ![]()
![]()
![]()
In MATERIALS AND METHODS, we describe (i) the new likelihood-based estimator for Ne (noted
), (ii) the classical F-statistic-based estimator (noted
), (iii) the model used to simulate data sets, (iv) the real data sets, and (v) the methods we used to assess the relative performance of both estimators.
| MATERIALS AND METHODS |
|---|
Likelihood-based estimator:
The method used to estimate likelihoods for Ne given the data is based on the genealogical approach described in ![]()
![]()
![]()
![]()
![]()
![]()
(T, Ne, x1, x2, ... , xm) can be obtained by multiplying the probabilities across samples and across loci. Metropolis-Hastings simulation is then used to integrate out the xi as described below.
|
Probability of the first sample: It is assumed that the nT chromosomes are sampled with replacement from a population having unknown gene frequency x. Hence aT follows the multinomial distribution with probability p(aT|nT, k, x).
Probability of the second sample:
The sample at t = 0 is assumed to have a genealogy, described by the standard coalescent model, with n0 lineages at t = 0, nc coalescent events between t = 0 and t = T, and nf = n0 - nc lineages at t = T. Since mutations are assumed not to occur, nf
k0. Let af be the vector of counts of the nf distinct lineages that remain at t = T, distributed among the k allelic classes. Those lineages are sampled with replacement from the population with probability p(af|nf, x). It is possible to calculate the probability of obtaining nc coalescent events, p(nc|T, Ne, n0) (![]()
![]()
![]()
Given the configuration in the founder lineages, af, it is possible to calculate the probability of obtaining the configuration in the sample, p(a0|af, n0, nc), by nc successive iterations of choosing a lineage uniformly at random and duplicating it (![]()
![]() |
(1) |
Although it is feasible to evaluate this sum for small samples, it is impractical for typical cases. We use here an alternative, efficient method of evaluation described by ![]()
![]()
![]()
![]()
![]()
We are interested in estimating p(a0|T, Ne, n0, k, x). Rewriting (1) in a more general way,
![]() |
(2) |
The summation is over all genealogical histories G, each of which can be represented as a sequence of configurations G = (g0, g1, ... , gnc), where, looking back in time from t = 0, g0 is the configuration prior to the first coalescent event and gnc is the configuration prior to the sampling event at T (i.e., gnc = af). The states (gnc, ... , g1, g0) form a Markov chain where the number of lineages increases by 1 at each step. Let b denote a vector of length k with 1 at position b and 0 elsewhere, b being a particular allelic class. In the following, g(b)r denotes the number of genes in the allelic class b within the rth state. The rth state in the sequence can be generated from

where
. Thus p(G|af, nc) is the product of these probabilities over nc coalescent events, and p(a0|G) = 1 if g0 = a0, 0 otherwise.
Equation 2 has the general form
, which can be estimated by the classical Monte Carlo method as
, where the yi are drawn from p(y) (see, e.g., ![]()
![]()
yp(x|y)p(y) is estimated as
, where p*(y) is chosen to be as close as possible to p(y|x); the variance due to importance sampling would be zero if it could be made equal to p(y|x)this follows because
for any draw from p(y|x). In general it is not possible to draw exactly from p(y|x) but this can be approximated quite closely. Following the method of ![]()
![]() |
(3) |
where j subscripts are the allelic classes represented in gr. The ratio p(G|af, nc)/p*(G) is then given by the product of
![]() |
(4) |
over all coalescent events in G. Thus to estimate (2) we can use the following algorithm, which calculates 1/s x S, with
, and
. This algorithm involves two nested iterations: a first iteration, which builds the summation S over sampled genealogies, and another iteration, nested in the previous one, which calculates each Pi by sampling the number of coalescent events in the genealogy Gi. Let nr be the number of lineages, m the number of allelic categories in the current configuration, and
the total time elapsed since the start of the building of the current genealogy. The genealogies are built starting at
= 0, with the data configuration a0, so that for all i,
. Then coalescent events are simulated using a waiting time, t, until the total time elapsed becomes >T/Ne. In the following algorithm, t is a random variable, sampled from an exponential distribution with scale (nr2), which changes each time we refer to it:
- set S to 0;
- do s times:
- set Pi to 1; set nr to n0; set
to t; - while (
T/Ne and nr
m) do: - with probability given by Equation 3, choose an allelic category in the current configuration and decrease the count by 1. Decrease nr by 1;
- multiply Pi by the importance weight (Equation 4);
- set
to
+ t; - if (nr = m and
T/Ne) then set Pi = 0 (probability of obtaining the data is zero with this number of coalescences) else - let afi be the current configuration (i.e., allele counts among founder lineages), containing nfi genes;
- add to S the product of Pi by p(afi|x, nfi), the multinomial probability of choosing the current configuration from x;
- evaluate S/s.
Thus the method estimates p(a0|T, Ne, n0, k, x) with some error depending on the number of iterations, and the number we actually used is discussed below.
Posterior distribution:
Assuming independence the probabilities can be multiplied across samples and loci as described above to give a likelihood
(Ne, T, x1, ... , xm). We integrate out the xi using Metropolis-Hastings sampling, following the approach described in ![]()
In Metropolis-Hastings sampling we propose candidate values x' from some distribution conditional on the current value x, p(x'|x) and calculate the ratio
![]() |
(5) |
which gives the likelihood of a candidate value, weighted by the probability of choosing it from the current value, relative to the current value, weighted by the probability of choosing it from the candidate value. If X > 1 we accept x'; otherwise we accept it with probability X, retaining x otherwise. The likelihoods include a weighting by priors. Provided certain conditions hold (see ![]()
(x)/
(x)dx. In the case of jointly distributed variables marginal distributions can be estimated by looking at only one variable (e.g., T/Ne here). Since T is known, using this approach we can estimate the posterior distribution p(Ne|a0, aT, T), which will be proportional to the likelihood for Ne if a uniform prior for Ne is assumed. We need to specify the upper limit, NeMAX, because convergence of the MCMC simulation is not otherwise guaranteed. Therefore our approach is more accurately described as Bayesian, with an informative rectangular prior on Ne. Here we assume rectangular priors on Ne between zero and some upper limit (NeMAXusually 500) and uniform Dirichlet,
(1, ... 1), priors on x. However, since the tests of power and precision given in this article are inherently non-Bayesian (since the answer is known), we use the term "likelihood-based" to describe the general approach and restrict the use of the term Bayesian when comparing the effect on inference of different values of NeMAX. When real data are analyzed, it is clearly sensible to take a fully Bayesian approach, with a prior that reflects background information (which is then unlikely to be rectangular).
Candidate values for Ne are proposed, using a lognormal distribution centered around the current value. Candidate values for x for each locus are proposed by randomly partioning the alleles into two groups and using a beta-distribution, as described in ![]()
![]()
(Ne, T, x1, ... , xm) is estimated as described above. The performance of Metropolis-Hastings sampling when the likelihood is estimated with error has not been intensively studied. As in ![]()
![]()
(Ne, T, x1, ... , xm), had been suggested by pilot simulations in ![]()
Convergence of the Metropolis-Hastings procedure:
For the data sets described here, the Metropolis-Hastings procedure was run for single chains of 20,000 iterations, sampling every 5 iterations. The validity of this number was assessed for four representative data sets, by running five independent chains from different starting points for each set, and using the Gelman-Rubin criterion to assess convergence, implemented in CODA (![]()
![]()
|
Point estimates and confidence intervals: From the output of the MCMC simulations the following summary statistics are estimated: the mode and 0.05 and 0.95 quantiles (giving a 90% credible interval). The 0.05 and 0.95 quantiles are obtained from the ranked output in the standard way. The mode is obtained by kernel density estimation. We used a logistic function as a kernel with a bandwidth (standard deviation) given by

where n is the sample size (the number of updates), q2 is the value of the 0.625 quantile, and q1 is the value of the 0.375 quantile. The constant, k, was set to be 1 for all the simulations. This formula was obtained by trial and error in pilot simulations and was set to tend toward under- rather than oversmoothing. A logistic kernel was used because it was easy to define truncated kernels at 0 and NeMAX, which reduced the degree of underestimation of the density at the boundaries. For appropriate bandwidths (slightly larger values of k) the method gives results very similar to those obtained using the program Locfit (![]()
![]()
Many data sets may have little information on the upper limit of Ne (i.e., the likelihood function tends to a nonzero constant for large Ne), in which case the 90% credible interval depends strongly on the prior used. We use a rectangular prior in the study here, but for real data a more smoothly varying prior may reflect background information more closely. This estimator was implemented in a computer program (TM3) available from http://www.rubic.rdg.ac.uk/~mab/software.html. A program for converting GENEPOP files to input format for the TM3 program is also available. The mode estimated from the MCMC output is referred to as the likelihood-based estimator
.
F-statistic-based estimator:
We estimated NeFk for each population using the equation
![]() |
(6) |
(![]()
![]()
We estimated F as did ![]()
![]() |
(7) |
(![]()
![]()
Other estimators of F give generally similar results to those obtained with Fk (![]()
![]()
![]()
2 approximation, known to be unbiased, but too conservative (![]()
![]()
Model for simulating data sets:
The simulated populations were generated using an individual-based model with Mendelian inheritance. Populations were initiated by randomly sampling alleles from independent loci (defined by an array of allelic frequencies; see Table 2). Simulations were based on a Wright-Fisher model, with two modifications: (i) separate sexes (with an equal sex ratio) and (ii) strict allogamy (no selfing). These modifications should lead to an Ne slightly larger than the actual number of breeders (Nb, individuals), as shown in ![]()
1.91 (instead of 2) for Nb = 10. In this case, Ne
10.2. This difference between Ne and Nb is negligible and is even smaller when Nb > 10. Individuals were sampled under sampling scheme II of ![]()
![]()
![]()
|
We studied only cases with relatively small Ne because it is known that estimators of Ne give reasonably small confidence intervals when Ne is large and when using realistic sample sizes (e.g., 1020 loci and 3060 individuals; ![]()
![]()
![]()
![]()
Real data:
To test further the Ne estimators, we applied them to real data, which potentially follow a model very different from the ones used both by the estimators and by our simulation model. For example, real data can show migration (killifishes), selection and linkage among loci (Drosophila; see DISCUSSION), overlapping generations (otter), and nonstable population size (mosquito fish). Our purpose here is not to use these data sets to quantify the problems introduced by those violations of the model underlying the Ne estimators, but rather to illustrate that the violation of assumptions can lead to changes in the estimators' performance. We used microsatellite data from four animal populations:
- Spanish killifish Aphanius iberus: A captive population was founded in 1994 using 83 individuals from a wild population and a subsequent incorporation of 68 new individuals occurred 1 year later from the wild into the captive population (S. SCHÖNHUTH, unpublished data). The first sample was taken among the 83 founders (originating from the wild population) and the second sample was taken from the captive population after the incorporation of the 68 new individuals.
- Two Drosophila melanogaster populations (1spm24 and 100a), which originated from the wild (at the Tyrrells Winery near Sydney, Australia), were genotyped (first sample); submitted to a bottleneck for a duration of 1 and 57 generations for, respectively, 1spm24 and 100a (see Table 3); and then genotyped a second time 10 generations later, during a rapid expansion period to a population size of 500 and 750 individuals, respectively. The Ne value given in Table 3 is the harmonic mean of the Ne for the generations between samples. The Ne for each generation is estimated from the pedigrees (
ENGLAND 1998 ).
View this table:
In this window
In a new window
Table 3. Effective population size estimates and confidence intervals for empirical data sets - Eurasian otter Lutra lutra (
DALLAS et al. 1999 ), for which samples from 1983 to 1988 were pooled to represent one sample, as were samples from 1991 to 1997.
- Two western mosquito fish (Gambusia affinis) populations (A5 and B4), established experimentally by a bottleneck of one pair and eight pairs, respectively, from a wild population "source" (
SPENCER et al. 2000 ). For each of these two populations, the first temporal sample was taken in the source population. Experimental populations were then allowed to expand after the founder event and the second temporal sample was taken two to three generations later. Female founders were obtained from a general laboratory stock and may have not been virgin at the release time, so that the Ne at establishment could be underestimated by the actual number of founders.
Comparison of estimators:
The first goal of this article is to describe a new estimator for Ne. The second goal is to provide an evaluation of its performance in comparison with a widely used estimator, so that researchers can use it with some confidence, on the basis of realistic examples. We take a frequentist approach and use simulations to compare the two methods. Although this is inconsistent with the Bayesian paradigm behind the new estimator, it is the most practicable way to compare the two approaches. There are two issues to consider here. Having an upper limit of Ne = 500 in the likelihood-based method will in itself increase the accuracy of the method. However, this effect is generally small, and cases where it has an effect are clear in the RESULTS AND DISCUSSION. In addition, the credible intervals from the likelihood estimator are a priori not expected to have the same coverage properties as the confidence intervals from the F-statistic-based estimatorthe former is a fixed interval, and the "truth" is a random variable, while the latter is a random interval, and the truth is fixed. However, since the assessment is frequency based (where the truth is fixed) it seems reasonable to assess the coverage properties of the two estimators, while accepting that they mean different things. It is also worth noting that a general result in classical theory is that the credible intervals and confidence intervals will converge to have the same coverage properties, providing that the posterior distribution is asymptotically normal (see, e.g., ![]()
In assessing the performance of each estimator we attempted to answer the following questions: (i) Are the point estimates biased?, (ii) how accurate are the two estimators?, (iii) how large can we expect the confidence intervals to be?, and (iv) how often do the confidence intervals contain the true value? We answer these questions by using simulated data sets and compare the performance of the estimators using summary statistics computed from the set of replicates obtained with each tested scenario. The parameters investigated include the true Ne value, the sample size (S), the number of loci (L), the allele frequencies in the population from which is taken the first sample (AF; Table 2), and the number of generations between samples (T). Each scenario (i.e., parameter set) was evaluated using 200 populations simulated independently (i.e., replicates).
The bias of the point estimates was investigated by reporting the median of the 200 point estimates from the simulations. The median was used rather than the mean because the distribution of point estimates can be strongly skewed (for example, in the case of the F-statistic-based estimator, there are occasionally point estimates of infinity). We also report the standard error of the mean. The accuracy of the point estimators was investigated by reporting the square root of the mean of the squared differences between the estimation values and the true value (the square root of the mean square error,
).
The properties of the estimated confidence intervals (or credible intervals in the case of the likelihood-based estimator) were assessed by reporting two types of summary statistic. First, to summarize the overall spread of the estimated intervals, we report the 5th percentile of the lower limits of the intervals (that is, 5% of the 200 confidence or credible intervals had lower bounds that were lower than this) and the corresponding 95th percentile of the upper limits. The interval defined by these two percentiles is a summary of the overall spread of confidence or credible intervals in the 200 simulations. Second, to summarize the coverage properties of the estimated intervals, we report the proportion of times the true Ne was below the lower limit and the proportion of times the true Ne was above the upper limit.
| RESULTS AND DISCUSSION |
|---|
Accuracy of estimates:
The likelihood-based (LB) method appeared generally more accurate than the F-statistic-based method (Table 4). For example, in 13 out of 16 sets of simulations
was lower in the former than the latter (P
0.02). The performance of the estimators seems rather variable and may partially depend on the base frequencies used. For example, two of the cases where
is higher in the LB method come from the only two sets of simulations that use frequency set B. There appear to be no other clear-cut effects on
in Table 4.
|
On the basis of the medians of the estimates from each set of 200 simulations, patterns in the bias of the estimates are discernible in Table 4. In general,
appears to systematically overestimate Ne, whereas
appears to slightly underestimate it. The overestimation by
is due mainly to the loss of alleles in early generations of the time period between the two samples (see ![]()
![]()
![]()
![]()
overestimates Ne by up to 25%, whereas it generally overestimates by <10% in the other two sets of simulations. Interestingly, the degree of overestimation appears also to be affected by Ne or T, because the amount of drift is otherwise the same in these two sets of simulations. The Drosophila data in Table 3 support the second observation on the effect of rare alleles on
. Here
substantially overestimates Ne, whereas
does not (
was 191.7 and
was 83.3 when true Ne was 100.8).
The underestimation by
is probably due to the violation of the assumptions of the coalescent in the simulated data sets and appears to be strongest when Ne is small. For example, looking at the simulations where
is at reasonable levels (i.e., ignoring the three extreme cases), the underestimate is
25% for Ne = 10, 10% for Ne = 20, and <5% for Ne = 50. Although the
is too variable for any clear patterns to be discerned, it is reasonable to assume that the reduction in bias will tend to increase the relative accuracy of
over
.
Precision of estimates:
Because the accuracy is generally good (see Table 4 and examples above), it is usually less problematic than the poor precision (see, for example, the Ne estimates for the otter data in Table 3). Thus, it is very informative to assess the relative performance of the methods used to compute confidence or credible intervals (CIs): the
2 approximation (![]()
The credible intervals for the likelihood-based estimator of Ne(
) gave better precision than the F-statistic-based estimator (
) in the cases with high drift, whereas the difference in performance appeared variable (depending on Ne and on the number of loci) in the cases with relatively low drift. In the high drift cases, for example, when true Ne = 20 (and T = 5, L = 10, AF = A), the summaries of CIs were 9.753.9 for NeLB and 11.374.7 for
(Table 4). With real microsatellite data from a Drosophila population of known Ne
, the confidence interval for
(7.917.6) was also much smaller than that for
(21.5780.75; Table 3). We note that the Npedigreee was computed assuming no selection and no association among loci. However, selection is possible (e.g., inbreeding depression, see ![]()
When drift is weak, it is known that the lack of drift signal in allele frequency data leads to bad performance of the Ne estimators studied here (e.g., ![]()
gave more reliable confidence intervals than
, but both estimators are similar when at least
20 loci are used (see Table 4). When considering the scenario with Ne = 10, T = 1, and 5 loci (first line in Table 4), it is noteworthy that NeLB performed better when doubling the sample size (second line) than when doubling the number of loci used (third line). On the other hand,
showed similar benefits from these two possible strategies for increasing the sampling effort. It is also important that the
CIs performed slightly worse as Ne became smaller (from 50 to 20 to 10). This probably results from violation of the assumption in coalescent methods that Ne is large.
Because our simulations used a fixed heterozygosity (H) level for all loci, we investigated if a variance of H among loci changes precision of Ne estimates. We used a set of five loci having a mean
, described in Table 2, under the name F. The CIs obtained from populations simulated using variable interlocus H were only slightly different than those from populations having all loci with the same H (0.6; Table 4). Thus the results in Table 4 can be used as very rough guidelines for the number of loci and sample size needed to achieve reasonable precision.
Table 5 shows that the NeLB CIs are more often too low than too high (they are biased low), whereas NeFk CIs are more often too high than too low (they are biased high). The same pattern appears with real data (see Table 3); for example, with Drosophila populations, when the actual Ne was 18.8, the LB method credible interval was 7.917.6, whereas the F-statistic-based method CI was 21.680.7. This is interesting for conservation biology applications because overestimation is likely to delay the detection of a small Ne.
|
Guidelines:
To improve precision, one has four possibilities: (i) Use more loci or loci with more alleles, as this provides more independent observations of drift; (ii) sample more individuals, as it gives better allele frequency estimates; (iii) increase the time between samples, as this increases the number of drift events observed (T); and (iv) use prior knowledge on Ne in the case of the likelihood-based method. The first three should increase the signal-to-noise ratio in the data (![]()
Typical users wishing to increase the precision of Ne estimators would like to know whether increasing the number of individuals sampled or the number of loci is more likely to help. Indeed, new promising molecular markers (e.g., SNPs) may allow use of dozens of loci in the near future and could double or triple the number of loci typically used today (520). However, the cost and time needed for developing them might counterbalance the cost of sampling more individuals while keeping classical genetics markers. This is why performance analyses like ours are needed to help make such decisions about sampling investment. It is difficult to find dozens of unlinked loci in most real populations, and a linkage disequilibrium between loci is possible in small populations. Thus, more study is needed to quantify the potential problems caused by statistical association between loci on the Ne estimators we studied.
For improving precision, the total number of independent alleles (
iAi - 1, where Ai is the number of alleles at the locus i) is more important than the number of loci. For example, using five diallelic loci (thus 5 independent alleles) leads to infinite CIs with both NeFk and NeLB, whereas using five loci with 5 alleles each (20 independent alleles) provided far smaller CIs (Table 4, lines 6 and 10, for example). In this study, adding more loci or more individuals in the sample dramatically increased the precision on NeFk and NeLB estimates (Fig 2), but doubling the number of loci did not lead to obviously better results than doubling the number of individuals (for loci having 5 alleles), except for NeLB in the cases of low drift. Our results provide rough ideas about this issue when using realistic data sets.
|
It is interesting that increasing the number of alleles in the samples by having more loci with rare alleles is now less problematic with
than with
. For example,
is less biased and more precise than
when using loci with some rare alleles (e.g., scheme A in Table 2). This is important because most alleles in natural populations are at low frequency except in severely bottlenecked populations (![]()
Bayesian methods give more direct information than classical statistics about the estimated parameter Ne, as one gets a posterior distribution for Ne (Fig 3). One advantage of a posterior distribution is that it allows the visualization of the information brought by the data. Thus, a flat posterior means that the data contain no information in connection with the model used: See, for example, Fig 3C. The symmetry of the curve allows us to roughly estimate the signal-to-noise ratio (large if symmetrical, for example, Fig 3A; small if skewed or flat, for example, Fig 3B and Fig C).
|
The Bayesian aspect of using
(i.e., setting a narrow prior probability distribution for Ne:
instead of 5000) provided smaller CIs when drift was relatively weak, but no obvious difference when drift was strong. For example, when Ne was 10 (and T = 1, with 10 loci), setting NeMAX to 5000 gave a summary of CIs of 3.63732.2 for
percentiles, whereas it gave 3.6130.6 with
. This is a far better improvement than in the case of high drift (Ne = 20, T = 5, with 10 loci), for which case we obtained 6.3107.8, when NeMAX was 5000 and 6.3105.6 when NeMAX was 500. The better improvement observed under low drift situations was expected because likelihood curves obtained in these cases are skewed toward high Ne, and using the NeMAX prior information is likely to remove a greater amount of the area under the curve (i.e., in the right tail of the curve) than in the high drift cases. Further research is needed to study the influence of using different prior distributions for Ne, e.g., smaller NeMAX values.
In the context of management the Bayesian methodology presented here may be further refined to incorporate the methods of decision theory. For example, a loss function can be defined, which enables the posterior risk of management decisions based on Ne to be quantified. Nevertheless, for conservation biology purposes, the
method should already be useful as it gives narrower credible limits, with less chance to be biased high than low. In conservation biology, it is critical to not overestimate Ne because overestimation could lead managers to not detect a small Ne (and an associated excessive loss of genetic variation) and to consequently underestimate the population extinction risk.
| ACKNOWLEDGMENTS |
|---|
We sincerely thank P. England for the Drosophila data, S. Schönhuth for the killifish data, J. Dallas for the otter data, and C. Spencer for the mosquito fish data. Many thanks to E. Anderson, M. Schwartz, L. Excoffier, N. Valière, S. Manel, and M. Gaudeul for their helpful comments on the manuscript. P.B. was funded by a DEA grant from French Research Ministry. M.A.B. is supported by a University of Reading RETF fellowship. The Bureau des Ressources Génétiques (INRA) provided funding for G.L. and J.-M.C. G.L. was also supported by a grant from the European Union (BIO4 CT96 1189).
Manuscript received October 15, 2000; Accepted for publication October 19, 2001.
| LITERATURE CITED |
|---|
ANDERSON, E. C., E. G. WILLIAMSON, and E. A. THOMPSON, 2000 Monte Carlo evaluation of the likelihood for Ne from temporally spaced samples. Genetics 156:2109-2118
BEAUMONT, M. A., and M. W. BRUFORD, 1999 Microsatellites in conservation genetics, pp. 165182 in Microsatellites: Evolution and Applications, edited by D. B. GOLDSTEIN and C. SCHLÖTTERER. Oxford University Press, Oxford.
BEST, N. G., M. K. COWLES and S. K. VINES, 1995 CODA: convergence diagnostics and output analysis software for Gibbs sampler output, version 0.3. Technical report, Cambridge, UK, 1995. (http://www.stat.ufl.edu/system/man/BUGS/cdaman03/).
BURCZYK, J., 1996 Variance effective population size based on multilocus gamete frequencies in coniferous populations: an example of a scots pine clonal seed orchard. Heredity 77:74-82.
CABALLERO, A., 1994 Developments in the prediction of effective population size. Heredity 73:657-679.
CHIKHI, L., M. W. BRUFORD, and M. A. BEAUMONT, 2001 Estimation of admixture proportions: a likelihood-based approach using Markov chain Monte Carlo. Genetics 158:1347-1362
CIOFI, C., M. A. BEAUMONT, I. R. SWINGLAND, and M. W. BRUFORD, 1999 Genetic divergence and units for conservation in the Komodo Dragon Varanus komodoensis.. Proc. R. Soc. Lond. Ser. B 266:2269-2274.
DALLAS, J. F., P. J. BACON, and D. N. CARSS et al., 1999 Genetic diversity in the Eurasian otter, Lutra lutra, in Scotland: evidence from microsatellite polymorphism. Biol. J. Linn. Soc. 68:73-86.
DONNELLY, P. and S. TAVARÉ, 1995 Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29:401-421[Medline].
EDWARDS, A. W. F., 1972 Likelihood. Cambridge University Press, Cambridge, UK/London/New York.
ENGLAND, P. R., 1998 The conservation genetics of population bottlenecks. Ph.D. Thesis, Macquarie University, Sydney, Australia.
FELSENSTEIN, J., M. K. KUHNER, J. YAMATO, and P. BEERLI, 1999 Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data. IMS Lect. Notes Monogr. Ser. 33:163-185.
FIUMERA, A. C., L. WU, P. G. PARKER, and P. A. FUERST, 1999 Effective population size in the captive breeding program of the Lake Victoria cichlid Paralabidichromis chilotes.. Zoo Biol. 18:215-222.
FRANKHAM, R., 1995 Effective population size/adult population size ratios in wildlife: a review. Genet. Res. 66:95-106.
FUNK, W. C., D. A. TALLMON, and F. W. ALLENDORF, 1999 Small effective population size in the long-toed salamander. Mol. Ecol. 8:1633-1640[Medline].
GELMAN, A., 1996 Inference and monitoring convergence, pp. 131144 in Markov Chain Monte Carlo in Practice, edited by W. R. GILKS, S. RICHARDSON and D. J. SPIEGELHALTER. Chapman & Hall, London/New York.
GELMAN, A., J. B. CARLIN, H. S. STERN and D. B. RUBIN, 1995 Bayesian Data Analysis. Chapman & Hall, London/New York.
GRIFFITHS, R. C. and S. TAVARÉ, 1994 Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46:131-159.
HEDGECOCK, D., V. CHOW, and R. S. WAPLES, 1992 Effective population numbers of shellfish broodstocks estimated from temporal variance in allelic frequencies. Aquaculture 108:215-232.
HILL, W. G., 1981 Estimation of effective population size from data on linkage disequilibrium. Genet. Res. 38:209-216.
HUSBAND, B. C. and S. C. H. BARRETT, 1992 Effective population size and genetic drift in tristylous Eichhornia paniculata (Pontederiacee). Evolution 46:1875-1890.
IHAKA, R. and R. GENTLEMAN, 1996 R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5:299-314.
JORDE, P. E. and N. RYMAN, 1996 Demographic genetics of brown trout (Salmo trutta) and estimation of effective population size from temporal change in allele frequencies. Genetics 143:1369-1381[Abstract].
KANTANEN, J., I. OLSAKER, S. ADALSTEINSSON, and K. SANDBERG, 1999 Temporal changes in genetic variation of North European cattle breeds. Anim. Genet. 30:16-27[Medline].
KRIMBAS, C. B. and S. TSAKAS, 1971 The genetics of Dacus oleae V: changes of esterase polymorphism in a natural population following insecticide controlselection or drift? Evolution 25:454-460.
LAIKRE, L., P. E. JORDE, and N. RYMAN, 1998 Temporal change in mitochondrial DNA haplotype frequencies and female effective size in a brown trout (Salmo trutta) population. Evolution 52:910-915.
LOADER, C. R., 1996 Local likelihood density estimation. Ann. Stat. 24:1602-1618.
LUIKART, G. and J.-M. CORNUET, 1999 Estimating the effective number of breeders from heterozygote excess in progeny. Genetics 151:1211-1216
LUIKART, G., F. W. ALLENDORF, J.-M. CORNUET, and W. B. SHERWIN, 1998 Distortion of allele frequency distributions provides a test for recent population bottlenecks. J. Hered. 89:238-247
LUIKART, G., J.-M. CORNUET, and F. W. ALLENDORF, 1999 Temporal changes in allele frequencies provide estimates of population bottleneck size. Conserv. Biol. 13:523-530.
MARJORAM, P., and P. DONNELLY, 1997 Human demography and the time since mitochondrial Eve, pp. 107131 in Progress in Population Genetics and Human Evolution, edited by P. DONNELLY and S. TAVARÉ. Springer-Verlag, New York.
MILLER, L. M. and A. R. KAPUSCINSKI, 1997 Historical analysis of genetic variation reveals low effective population size in a northern pike (Esox lucius) population. Genetics 147:1249-1258[Abstract].
MOHLE, M., 2000 Ancestral processes in population geneticsthe coalescent. J. Theor. Biol. 204:629-638[Medline].
NEI, M. and F. TAJIMA, 1981 Genetic drift and estimation of effective population size. Genetics 98:625-640
NIELSEN, R., J. L. MOUNTAIN, J. P. HUELSENBECK, and M. SLATKIN, 1998 Maximum-likelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution 52:669-676.
O'RYAN, C., E. H. HARLEY, M. W. BRUFORD, M. BEAUMONT, and R. K. WAYNE et al., 1998 Microsatellite analysis of genetic diversity in fragmented South African buffalo populations. Anim. Conserv. 1:85-94.
PLANES, S. and G. LECAILLON, 1998 Consequences of the founder effect in the genetic structure of introduced island coral reef fish populations. Biol. J. Linn. Soc. 63:537-552.
POLLAK, E., 1983 A new method for estimating the effective population size from allele frequency changes. Genetics 104









