- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Donnelly, P.
- Articles by Joyce, P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Donnelly, P.
- Articles by Joyce, P.
Likelihoods and Simulation Methods for a Class of Nonneutral Population Genetics Models
Peter Donnellya, Magnus Nordborg1,b, and Paul Joyceca Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom,
b Department of Genetics, Lund University, Lund 223 62, Sweden
c Department of Mathematics and Division of Statistics, University of Idaho, Moscow, Idaho 83844-1104
Corresponding author: Peter Donnelly, Department of Statistics, University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United Kingdom., donnelly{at}stats.ox.ac.uk (E-mail)
Communicating editor: S. TAVARÉ
| ABSTRACT |
|---|
Methods for simulating samples and sample statistics, under mutation-selection-drift equilibrium for a class of nonneutral population genetics models, and for evaluating the likelihood surface, in selection and mutation parameters, are developed and applied for observed data. The methods apply to large populations in settings in which selection is weak, in the sense that selection intensities, like mutation rates, are of the order of the inverse of the population size. General diploid selection is allowed, but the approach is currently restricted to models, such as the infinite alleles model and certain K-models, in which the type of a mutant allele does not depend on the type of its progenitor allele. The simulation methods have considerable advantages over available alternatives. No other methods currently seem practicable for approximating likelihood surfaces.
THE neutral theory provides a natural null hypothesis in many evolutionary contexts, and the behavior of neutral models is now rather well understood. On the other hand, any assessment of the forces generating and maintaining genetic diversity also requires an understanding of the behavior of a range of nonneutral models. More formally, at least in the context of data at a single locus, the development of efficient tests of neutrality against particular selective alternatives, and efficient inference for mutation and selection parameters, requires methods for evaluating or approximating likelihood surfaces under the selective models of interest.
In this article, we develop and apply methods for the simulation of samples from nonneutral models and for approximating the distribution of summary statistics of sample diversity. We also give a method for approximating the likelihood function, in the evolutionary parameters, for observed data. This is applied to two data sets, one from Drosophila where purifying selection appears to be operating, and one from Borrelia burgdorferi, the cause of Lyme disease, in which there appears to be balancing selection. While our approach allows general diploid selection, it is currently applicable only to models for mutation in which the type of mutant does not depend on the type of parent, such as the infinite-alleles model.
The methods described here are computationally intensive. On the other hand, our simulation algorithms appear to have advantages over available alternatives (all of which are also computationally intensive). While the methods require substantial computational effort, they remain practicable.
Throughout, we use "frequency" to refer to relative frequencies.
| THEORETICAL BACKGROUND |
|---|
We now describe the class of evolutionary models to which our results apply. For concreteness we first set the discussion in the context of the Wright-Fisher model for the evolution of a panmictic population of constant size N diploid individuals. Write E for the collection of possible alleles in the population. We assume that the probability that any particular allele in a given generation will be a mutant copy of its progenitor allele is u, regardless of the type of the progenitor allele. We further assume throughout that when a mutation occurs the (random) type of the mutant gene does not depend on the type of the progenitor allele. This assumption applies, for example, to an infinite-alleles model and to some K-allele models. It does not hold for stepwise mutation models nor for most models that aim to model DNA sequences directly, including the infinite-sites model. If a mutation occurs, we write
for the distribution of the type of the mutant allele.
Write wN(Ai, Aj) (= wN(Aj, Ai)) for the fitness of an individual with genotype (Ai, Aj). In forming the next generation, 2N pairs of genes are first selected from the current generation. These choices are independent, and the chance that a particular pair (Ai, Aj) is selected at each choice is proportional to wN(Ai, Aj). Next, one of the genes within each pair is chosen uniformly. With probability 1 - u a nonmutant copy of that gene will be included in the next generation, while with probability u a mutation will occur, in which case the type of the mutant will be chosen according to the distribution
. This selection and possibly mutation from within a chosen pair is independent between pairs.
Our results actually apply to the usual diffusion approximation of this model, which arises for a large, panmictic population of constant size N, in which the forces of genetic drift, mutation, and selection are comparable in size. As is usual, we scale the mutation and selection parameters and write
= 4Nu and
(Ai, Aj) = 2N (wN(Ai, Aj) - 1). Note that exactly the same diffusion approximation applies to a wide range of population models and hence our results relate more generally than to the Wright-Fisher model.
Our concern is with distributions that arise in the diffusion approximation of populations at (stochastic) equilibrium under the forces of drift, mutation, and selection. Specifically we relate properties of these distributions in the general model to those that arise under neutrality, which can be thought of as the special case in which
(Ai, Aj) = 0 for all possible genotypes (Ai, Aj).
The genetic composition of a population can be described by listing the alleles present in the population together with their frequencies. We use X to denote a generic population and think of this as a list {(A1, x1), (A2, x2), ... } in which the alleles present are A1, A2, ... with respective frequencies x1, x2, ... (so that xi
0, i = 1, 2, ... , and x1 + x2 + ... = 1). One can think of the diffusion approximation as describing the evolution of a hypothetical infinite population whose genetic composition may be described in exactly this way. In the same way we can describe the genetic composition of a sample Xn from the population as a list of the alleles present in the sample together with their (relative) frequencies in the sample.
We denote the mean fitness of a population X by
* (X),
![]() |
(1) |
(![]()
![]() |
(2) |
Here C is chosen so that
N(g(X)) = 1, i.e.,
|
(3) |
where exp denotes the exponential function, and the expectation,
N, is taken over the distribution of the population composition, X, for the neutral model with the same mutation mechanism in which all genotypes have the same fitness. The normalizing constant, C, will in general depend on the mutation parameters
and
and the fitness function
(·, ·). (Note that the original fitness function still appears in the exponential factor on the right of Equation 2.)
It turns out that sampling distributions for models with selection may be written in terms of properties of the analogous neutral model, and our purpose here is to explore various statistical consequences of this fact. For the configuration Xn of a sample of size n, write LS(Xn) for the probability of obtaining such a sample at equilibrium in the diffusion model described above. Further, we write LN(Xn) for the probability of obtaining such a sample in the diffusion model with the same mutation structure but in which all genotypes have equal fitness. ![]()
| (4) |
The second factor on the right of (4) represents the expected value of the function g(X), where the distribution of X is the conditional distribution, under neutrality, of the composition of the population given that a sample from it has distribution Xn.
The importance of the relation (4) is that it relates sampling distributions under selection to quantities that arise in neutral models. For the class of models we are considering, the probabilities LN(Xn) are known explicitly. Under neutrality, the conditional distribution of the composition of the population given the sample is also known for the class of models considered here. As a consequence, the second factor in (4) can also be evaluated, perhaps by simulation.
| SPECIFIC MODELS |
|---|
We now consider some concrete examples of models to which our analysis applies.
K-allele models:
Consider a model with K alleles, so that E = {A1, A2, ... , AK}. Assume that the scaled mutation rate for each allele is
and that a mutant allele will be Ai with probability
i, i = 1, ... , K, independently for each mutant. As above, we write
(Ai, Aj) for the scaled fitness of the genotype (Ai, Aj).
In this setting, the sampling distribution under neutrality is
![]() |
(5) |
in which we have written ni = nxi for the number of copies of allele Ai in the sample and the notation z(n) for the rising factorial z(n) = z(z + 1) · · · (z + n - 1). Further, under neutrality, the conditional distribution of the population given the sample Xn is a Dirichlet distribution with parameters (
1 + n1, 
2 + n2, ... , 
K + nK), which has probability density
![]() |
(6) |
for
xi = 1. Equation 5 and Equation 6 follow, for example, from WRIGHT's (1949) characterization of the stationary distribution for population allele frequencies as Dirichlet and elementary properties of multinomial samples from the Dirichlet distribution (e.g., ![]()
A special case of this model was used by ![]()
(A, A) = 2
,
(A, C) =
(A, G) =
(A, T) =
, and
(x, y) = 0 for all other genotypes.
Infinite alleles:
For infinite-alleles models with scaled mutation rate
, the neutral sampling formula LN(Xn) is given by the Ewens sampling formula (![]()
![]() |
(7) |
in which aj is the number of alleles with j representatives in the sample.
Under neutrality, the allele frequencies, (X1, X2, ... ), in the population follow the so-called GEM distribution with parameter
,

in which the Vi are independent and identically distributed random variables with probability density

(see, for example, ![]()
To apply (4) we need the conditional distribution of the population given the configuration, Xn, in the sample. Suppose there are k alleles in the sample, which we label 1, 2, ... , k. These alleles must belong to the population, and we write X1, X2, ... , Xk for their population frequencies. There will also be (an infinite number of) alleles in the population that do not appear in the sample. Write W for the total frequency in the population of these alleles. Conditional on the sample frequencies, the distribution of (X1, X2, ... , Xk, W) is Dirichlet with parameters (n1, n2, ... , nk,
), where ni is the number of copies of allele i in the sample. It remains to describe the way in which the total frequency W of "new" alleles, those that occur in the population but not the sample, is broken up among these various alleles. In fact, the proportion of new alleles of the various types has a GEM distribution, independent of X1, X2, ... , Xk, W. Thus, labeling these new alleles k + 1, k + 2, ... , the population frequencies can be written
![]() |
(8) |
where Xk+i = WZi, i = 1, 2, ... , and (Z1, Z2, ...) has a GEM distribution with parameter
, independent of X1, X2, ... , Xk, W. For further background see ![]()
![]()
Various selective schemes may be of interest in the infinite-alleles context. We describe two explicitly and consider them in further detail in subsequent sections.
The first is heterozygote advantage, in which all heterozygotes have equal fitness and all homozygotes have equal fitness that is lower than that of the heterozygotes. In our notation

![]()
A second set of selective schemes arises when the collection of possible alleles is broken up into classes. The simplest example arises when there are two classes, which we call neutral and deleterious. Write
and
for the neutral and deleterious classes, respectively. The fitnesses are then

There are obvious generalizations to schemes with more than two selective classes. See, for example, ![]()
![]()
| APPLICATIONS |
|---|
The theoretical results presented above lead naturally to several computational applications. In this section, three such applications are described in general terms: they are illustrated in two specific genetic models in the following two sections. The software used is available from the authors.
Simulating samples:
In view of the relation (4), samples from the nonneutral model can be obtained by rejection sampling. The idea is to generate neutral samples, Xn, and to "accept" or "reject" each one according to the value of
N (g(X)|Xn). The collection of accepted samples will then be an independent sequence of samples from the appropriate model with selection. For background on rejection methods of simulation (also called acceptance sampling), see, for example, ![]()
It is convenient to introduce a notation for the fitness of the fittest genotype. Define
max to be the maximum, across all possible genotypes, of the scaled selection coefficients:

For example, in both the infinite alleles models described above,
max = 0. Samples from the models we are considering may be generated as follows.
Algorithm 1: The rejection method for simulating nonneutral samples:
- 1. Simulate a sample, Xn, from the neutral model.
- 2. Simulate a population composition, X, conditional on Xn.
- 3. Simulate U, an independent uniform random variate on [0, 1].
- If U
exp(
*(X) -
max), report Xn as a sample from the nonneutral model. Otherwise return to step 1.
The efficiency of this algorithm depends mostly on how often samples are accepted. It is clear that, for some models with very strong selection, the rejection scheme may be too slow to be practical.
Several methods for simulating neutral samples (i.e., for carrying out step 1) exist. For the neutral models we are considering, the following algorithm, similar to the one used by ![]()
Algorithm 2: The urn scheme for generating neutral samples:
- 1. Start with an urn containing a single black ball of mass
. - 2. At each stage, independently of the past, draw a ball from the urn with probability proportional to its mass.
- a. If the black ball is chosen, return it to the urn together with a ball of unit mass whose type is chosen (independently of everything else) according to the distribution
on E. - b. If a ball other than the black ball is chosen, return it to the urn together with an additional ball of unit mass whose type is identical to that of the chosen ball.
- 3. Stop when the urn contains n balls other than the black ball.
The collection of types of balls, other than the black ball, in the urn has the distribution, LN(Xn), of the configuration of a sample of size n from the appropriate neutral model (![]()
![]()
In the case of the K-allele models described above, the balls added to the urn can simply be labeled from the integers 1, 2, ... , K. One convenient method for dealing with infinite-alleles models is to label each successive allele introduced by choosing an independent uniform random variable from [0, 1]. Another possibility is to use the integers 1, 2, 3, ... , successively, as the need arises.
Step 2 of the rejection algorithm requires simulation from the conditional distribution of the population, X, given the sample, Xn. For K-allele models, the required conditional distribution is the Dirichlet distribution (6). For the infinite-alleles model, it is given by Equation 8.
There are also several ways to simulate X given Xn, i.e., to carry out step 2 of the rejection algorithm. As we described in the previous section, the analytic form (under neutrality) of the population distribution conditional on the sample is known explicitly for the models we are considering. We thus generate X directly from this distribution.
Samples Xn, which arise from these simulations, contain information about the allele frequencies and the labels of the alleles. Thus, for example, in a two-class model, it will be known which alleles in the sample belong to which class. This division of alleles into classes will not be known for many actual data sets. Simulation of data of this sort may be achieved simply by ignoring the allele labels in the samples simulated as above.
The scheme above applies to any of the nonneutral models within the class we are considering. Special features of particular such models may lead to other simulation schemes for those models. For example, ![]()
![]()
Estimating the distribution of sample statistics:
A common reason for simulating samples is to estimate the distribution of some sample statistic. Two related approaches, each exploiting (4), are available for this problem in the context of the nonneutral models we are considering. The first is simply to use the rejection method just described to simulate a large number, m say, of samples from the appropriate model, to evaluate the statistic of interest for each of the simulated samples, and to approximate its distribution by the histogram of simulated values.
The second approach is based on importance sampling. (For background, see ![]()
Algorithm 3: Importance-sampling scheme for estimating the distribution of a sample statistic:
- 1. Simulate a collection of samples from the neutral model, X(i)n, i = 1, 2, ... , M.
- 2. For the ith such sample, X(i)n, simulate, under neutrality, a value X(i) for the population composition given that the sample configuration is X(i)n. Evaluate
i = exp(
*(X(i))). - 3. Create a histogram that assigns probability
to the ith observed value of the sample statistic f(X(i)n).
The resulting histogram estimates the distribution of f(Xn), where Xn is a sample from the nonneutral model. The importance sampling estimate of E(f(Xn)k), the kth moment (under the nonneutral model) of the statistic, is just
![]() |
(9) |
In particular, the mean value of the statistic is estimated from (9) with k = 1.
The first two steps in the importance sampling method just described are carried out exactly as for the rejection scheme above. Note that, since the Xn and X are drawn from the relevant neutral distribution, they can be reused for all selection schemes of interest (e.g., for different values of the parameters describing the strength of selection). This improves the efficiency of the method considerably.
The idea of the importance sampling scheme is that each simulated neutral sample is used in estimating the distribution of the statistic, but the neutral samples are given different weightsthose that are more likely under the selection scheme of interest are given relatively higher weights than those that are unlikely. In contrast, the rejection scheme makes a simple (randomized) "yes/no" decision about whether to include or exclude each neutral sample. As discussed in the Appendix, the importance sampling scheme is more efficient.
Likelihood analysis:
It is also possible to carry out full-likelihood analyses on the basis of (4). The likelihood for a sample Xn involves the following three quantities:
- LN(Xn), the likelihood of the sample under neutrality.
N(exp(
*(X))|Xn). - The normalizing constant C =
N(exp(
*(X))), where the expectation is taken over the distribution of the unconditional population composition.
Multiplying the first and second quantities, and dividing by the third, gives the likelihood under the nonneutral model.
Since, as we have seen, the sampling distribution is known for the neutral models that we are considering, the first quantity can be calculated numerically. For example, for the K-allele model it is given by (5) and for the infinite-alleles model it is the Ewens sampling Equation 7.
The second quantity is estimated as described for the second step in the importance sampling method.
The third quantity can be estimated by generating a large number of (unconditional) population compositions, X, calculating exp(
*(X)) for each of them, and taking the average. We note that, as before, it is possible to use the same set of simulated X for several different selection schemes.
Although in principle straightforward, it turns out that this last step can sometimes be computationally extremely difficult. The reason for this is simple enough: if selection is strong, then populations that have evolved in the absence of selection will with high probability have extremely low mean fitness and with low probability have high fitness (consider the probability of observing a functioning mitochondrion if mutation were the only factor governing molecular evolution!). The distribution of exp(
*(X)) is thus extremely skewed, and its expectation is difficult to estimate. Note that the distribution of exp(
*(X)) conditional on the observed sample is usually much better behaved, because the selective scheme will typically be chosen to fit the sample. We return to these issues in the examples below.
| PURIFYING SELECTION |
|---|
As described above, deleterious mutation can be modeled in the infinite-alleles framework by letting the collection of all possible alleles (which we label as points in [0, 1]) be divided into the class of wild-type alleles,
= [0, a), and the class of deleterious alleles,
= [a, 1]. Thus the scaled deleterious mutation rate is (1 - a)
, and the rate of back mutation a
, so a should be close to zero for this model to be at all reasonable.
We use the common fitness parameterization 1, 1 - sh, and 1 - s for the homozygous wild type, heterozygote, and homozygote deleterious, respectively. The appropriately scaled fitness coefficients are thus
12 = 2Nsh and
22 = 2Ns. Note that if h is small (recessive mutations) and the mutation rate small relative to the strength of selection, the mean equilibrium frequency of deleterious alleles will also be small and close to the deterministic approximation
/2
12 (![]()
Likelihood for simulated samples:
We first demonstrate the calculation of a likelihood surface on samples simulated with known parameters using Algorithm 1. Fig 1 and Fig 2 show two such examples. Furthermore, it seems that there is enough information in samples under this model to obtain reasonable estimates of both
and
22.
|
|
These examples also illustrate some of the predicted problems with the methods. First, both samples required on the order of 104 rejections, which is not a problem when generating only a few samples, but can make simulating large numbers of samples prohibitively time consuming. Second, the surfaces have "ridges" along the
22-axis and look jagged along the
-axis. The ridges are caused by the fact that we (as described in the previous section) reuse the population compositions generated for a given
for all selection values. The values along the
22-axis are thus completely correlated. This would not be a problem if the values along the
-axis were well estimated, but they are clearly not. It turns out that the problem here lies in the estimation of the constant C, which is very difficult under this model. Fortunately, there is also an easy alternative for this model, because exp(
*(X)) depends only on the frequency of deleterious alleles, not the entire composition. Under neutrality, it can be shown, for example, by considering a two-allele model with mutation rate
(1 - a) and
a, respectively, between the two alleles, that the frequency of deleterious alleles in the population under the present mutation scheme is ß-distributed with parameters
(1 - a) and
a, and it is thus possible to calculate C through numerical integration. Fig 9 in the following section illustrates how much this can improve the estimation.
|
|
|
|
|
|
|
Xdh in Drosophila pseudoobscura:
We now turn to a real data set, namely that of ![]()

In a sample of size n = 89, there were thus kn = 15 different alleles, 8 of which were singletons and 1 of which appeared at a frequency close to 60%. Under the neutral infinite-alleles model,
, the maximum-likelihood estimate of
is
5. The expected sample homozygosity,
(Fn), for
= 5 is 0.17. The observed Fn is much higher, 0.37, and neutrality is rejected by the Ewens-Watterson test (![]()
Assuming
= 5, a = 0.01, and h = 0.2, we used Algorithm 3 to approximate the distributions of kn, Fn, and the sample frequency of deleterious alleles, qn. The expectations of these quantities as a function of the strength of selection are shown in Fig 3. As the strength of selection increases, the expected frequency of deleterious alleles in the sample decreases toward the value predicted by Haldane. At the same time, the expected number of alleles in the sample decreases, and homozygosity increases, as observed in the Gundlach-Bundschu sample.
As predicted by classical population genetics theory (![]()
Given this, we should not be surprised to discover that the distribution of Fn does not change in a simple manner as the strength of selection increases either (Fig 6). This is considerably more informative than simply noting that the observed Fn lies significantly above what would be expected under neutrality. For one thing, it is clear that selection must be above a certain threshold for the Ewens-Watterson test to have any power.
The distribution of kn (Fig 7 and Fig 8) is well approximated for smaller selective values but less so for larger ones.
It should be emphasized that, although the importance sampling method will give an estimate of the distribution for cases where simulating samples with the rejection method is infeasible, this estimate may be very poor. It is thus wise to investigate the robustness of the estimate in some way. For instance, in the present case, it was found that the estimated sample frequency of deleterious alleles (Fig 3 and Fig 5) was far too high given the known population distribution for stronger selection coefficients. The reasons for this are the same as those that made the normalizing constant C difficult to estimate (see above). In essence, importance sampling works poorly when the distribution that one is sampling from is very different from the target distribution: since we are sampling from the neutral distribution precisely, this occurs when selection is strong.
Fig 9, finally, shows the likelihood of the Gundlach-Bundschu sample if all but the most common allele are deleterious, as a function of
and
22, again assuming h = 0.2 and a = 0.01 (it is of course possible to vary as many of these parameters as desired). The surface shows clear support for
22 > 0. Different values of a give qualitatively similar results (in that neutrality is rejected), but the actual location of the peak differs.
| BALANCING SELECTION |
|---|
Lyme disease:
The model we study in this section is motivated by data collected by ![]()
under the infinite-alleles model is 0.5, for which the expected homozygosity is 0.7. The observed value is 0.29, and neutrality can again be rejected using the Ewens-Watterson test. Qiu et al. hypothesized that some form of frequency-dependent balancing selection could be acting on the B. burgdorferi types.
We used the following simple model of frequency-dependent selection. Assume that the fitness of an allele i with frequency xi (note that Borrelia is haploid) is 1 - sxi. Then the mean fitness of the population is 1 - s
ixi2, and, scaling as before,
*(X) = -
ixi2. This is the same
*(X) as for the heterozygote advantage model described above. It is straightforward to check that this haploid infinite-alleles model with frequency-dependent selection is equivalent to a diploid infinite-alleles model with heterozygote advantage, to which our approach applies.
Homozygosity and number of alleles:
Qiu et al. thus rejected neutrality because the observed homozygosity was far too low given the number of alleles. Fig 10 shows the distribution of Fn for
= 0.5 under neutrality and for increasing strengths of selection. The observed Fn = 0.27 is clearly more likely under selection.
|
Fig 11 shows the distribution of the number of alleles in the sample for the same model. Note that the observed value of kn = 4 becomes less likely for stronger selection. This is not surprising given that frequency-dependent selection will tend to produce samples with more alleles and that the used value of
is the maximum-likelihood estimate under neutrality. For a given number of alleles in a sample, the stronger we believe selection to be, the smaller we should believe
to be.
|
Likelihood analysis:
It turns out that the inverse relationship between selection and mutation is "stronger" than one might intuitively have expected: likelihood surfaces for samples from this model will always have a maximum for the smallest mutation rate and strongest selection allowed.
That this should happen for infinite-alleles mutation should perhaps not be entirely unexpected, because this model allows infinitely many selectively different alleles, effectively reducing the population homozygosity to zero. However, we found that the same problem occurs whenever potential alleles not present in the sample are allowed.
If we assume that the four observed types are the only possible ones, likelihood analysis becomes possible (Fig 12). The intuitive interpretation of the resulting surface is that the data tell us that the combination of mutation and selection must be strong enough to ensure that all four alleles are present in the sample at reasonably equal frequencies, but not so strong as to make the observed unevenness very unlikely. Thus the value of 
is rather well determined, whereas the absolute value of either is not. Following ![]()
![]()
.
|
| DISCUSSION |
|---|
While the neutral theory is a convenient "null hypothesis" in many studies of genetic diversity, methods for simulating and analyzing nonneutral models are an essential prerequisite to understanding observed patterns of variability as well as the extent to which the neutral theory is generally applicable. In this article we described computational algorithms for simulating samples from nonneutral models, for approximating the distribution of sample statistics, and for approximating the likelihood of observed data as a function of the unknown parameters. The methods apply to nonneutral models in which the distribution of the type of a mutant allele does not depend on the type of its parental allele. Our approach was illustrated on two data sets.
The key to the approach is the Equation 4 that relates the probability distribution of samples under selection to those in the equivalent neutral model. Loosely speaking, Equation 4 states that the probability of a sample under selection can be evaluated by evaluating its probability under neutrality and then applying a multiplicative correction on the basis of the expected fitness of the population from which the sample arises. The utility of (4) stems from the fact that all the quantities on the right-hand side are evaluated for neutral models, which are extremely well understood. Within the class of models we are considering, the methods we described are valid regardless of the strength of selection. However, they work best when selection is weak.
Wright's formula (![]()
. (There is a simplification in this limit because the conditional expectation is trivial, and the second factor just becomes g(X).) In this sense, 4 extends Wright's formula to the other models we considered, including infinite-alleles models.
In light of this observation, one possible way to simulate a sample from such a model would be first to simulate the genetic composition of the whole population at equilibrium and then to take a sample from the simulated population (![]()
Although computationally intensive, application of the rejection-sampling algorithm leads to a collection of independent samples from the specified model. As in classical statistics, the precision with which this collection can be used to estimate quantities of interest will depend on the number of samples used. We show in the Appendix that the importance sampling method is more efficient than the rejection method. On the other hand, it is much harder to assess the precision of the resulting estimates.
In the Appendix we define formally the concept of effective sample size, ESSI and ESSR, as a measure of the information content of samples generated by the importance and rejection algorithms, respectively. Informally, if m neutral samples are simulated for use in Algorithm 3, the effective sample size ESSI corresponds to the number of independent samples under selection, which would have the same information content as the m samples used in the importance sampling algorithm. ESSR is the expected number of independent samples generated by the rejection method.
Care is needed, on several fronts, in the implementation of the methods in this article. Table 1 gives estimates of the ratio of ESSI and ESSR to the number of simulated neutral samples for various levels of selection in the deleterious alleles model. Several points are noteworthy. Considering the final column, we note that the importance sampling method is more efficient (as proved in the Appendix) and in general requires about an order of magnitude fewer samples than the rejection method to gain the same information. Second, especially for more intense selection, the effective sample sizes are small. To estimate the distribution of sample statistics, one might want the equivalent of 103 or perhaps 104 independent samples under selection. Using Algorithm 3, importance sampling would require the simulation of order 106, 108, and 109 neutral samples, for
22 = 10, 25, and 40, respectively, to have as much information as of order 103 independent samples from the selected model. In this sense, very substantial computational effort is needed to implement these methods. Finally, considering the values for finite m shows that if ESSI is estimated from the simulation output (which is usually the only option), then it can be very seriously overestimated for small m, leading to inappropriate confidence in the amount of information in the resulting simulations. The reasons for this are described in the Appendix, but it is crucial to be aware of these potential pitfalls when implementing the method (for a fuller discussion, see ![]()
|
In view of the substantial computational burden in applying our approach, it is natural to ask whether there are easier alternatives.
Consider first the problem of simulating samples, or summary statistics, from the selected model. An alternative would be to simulate the evolution of the whole population forward in time for a sufficiently long period and then simply to take a sample from it. One difficulty with this approach is that it is difficult to be sure how long the simulation should be run to ensure that the population allele frequencies are at stationarity under mutation-selection-drift equilibrium. In practice it is also not clear how to choose the size of the population to be simulated. Clearly the computational burden increases substantially with the population size: more work is required in simulating each generation, and more generations will be required to approximate stationarity in a larger population. Implementation of the forward simulation thus requires some leaps of faith (Has stationarity been achieved, and is the population size large enough?), which cannot be verified without more simulation (for more generations and/or for larger population sizes). In contrast, the Equation 4 on which our method is based is exact for stationarity. Unless selection coefficients are large and interest focuses on a small population, our algorithms for simulation seem preferable to forward simulation of the population.
A second alternative would be to use the ancestral selection graph developed for genic selection by ![]()
![]()
![]()
Traditional approaches to inference in genetics models, based on one-dimensional summary statistics, often ignore much of the information in data. For single-locus data, where information is limited anyway by evolutionary variability, this can be unhelpful. Within the context of a particular genetics model, efficient inference (either estimation of parameters, hypothesis testing, or model comparison) requires evaluation of the likelihood function, and there has been considerable recent interest in computationally intensive methods for approximating likelihoods under various evolutionary models (![]()
- It immediately allows for Bayesian analyses, which have proved helpful in several contexts within population genetics.
- For differing computational methods for approximating the likelihood, comparison of estimated likelihood surfaces can be a valuable tool in understanding the pitfalls of different approaches.
- Comparisons of likelihoods based on the full data with those for simpler methods based on summary statistics gives a useful sense of the amount of information lost by the simpler methods.
From the perspective of classical statistics, the availability of the likelihood surface immediately allows the use of maximum-likelihood estimators for parameters. These would be expected to have good properties, at least for large data sets, but the usual large-sample statistical theory, which, for example, allows the construction of confidence intervals for parameters from likelihood surfaces or the testing via generalized-likelihood-ratio statistics of nested models, does not obviously apply for population genetics data. An alternative approach to interval estimation would be through the use of a parameteric bootstrap (![]()
We applied our method to two data sets, in which purifying, respectively balancing, selection might plausibly be thought to be operating. In practice, at least in a classical statistical framework, it would be appropriate to settle on a selective model before rather than after seeing the data. Our principal aim here was illustrative. For a formal analysis one could use part or all of the data to suggest a particular model and then fit the model on additional data. There may also be some value informally in asking about what levels of particular sorts of selection might be consistent with observed data. For purifying selection there is also the question of which allele is the wild type. This could be regarded as part of the model choice and, for example, be based on data that are not subsequently used to fit the model. Alternatively, if primary interest is in whether purifying selection is acting at all, rather than on which alleles it is acting, it would be appropriate to evaluate the likelihood by summing over all possible choices of the wild-type allele. This could be done at substantial computational cost directly, but it would be easier to apply the results of ![]()
While our approach to approximating likelihoods is computationally intensive, it seems the only one currently practicable. Except for very small sample sizes, the naive approach of simulating samples for a grid of parameter values and for each set of parameter values approximating the likelihood by counting the proportion of times the data arise is unworkable. (For example, for the Gundlach-Bundschu data, likelihood values are of order 10-17, so that
1019 simulated samples would be required to approximate these well for each parameter value, with each such simulated sample itself requiring major computational effort, as discussed above.)
All of the methods described here apply to panmictic populations of constant size. No theoretical results analogous to (4) are available for other demographic models. It is well known that certain forms of population demography can produce patterns in data that are similar to those produced by particular forms of selection (![]()
![]()
![]()
The methods discussed in this article are computationally intensive. Our general view is that computational power is increasing and that given the cost and effort required to collect data, there should not be concern over analyses that take, say, hours or days, on modern workstations. One positive feature of the methods developed here for likelihood surfaces is that the computational burden does not depend much on the sample size. (In fact, it decreases slightly as the sample size increases.)
Both the rejection and importance sampling algorithms for simulation avoid the need to evaluate the constant C defined at (3). (It conveniently cancels from numerator and denominator of both the rejection probability and the importance weights.) In contrast, use of (4) to derive likelihoods for parameters such as the mutation rate and the strength of selection requires explicit evaluation of C. As we noted, the obvious approach of simulating neutral populations and averaging the values of exp(
(X)) requires care because the distribution of exp(
(X)) is extremely skewed to the right. As a consequence, very large numbers of simulations from the neutral model are needed to gain any precision in evaluation of C. In some settings, it can be evaluated by other means. We used population genetics theory and numerical integration for one of our examples and note a vast improvement (Fig 9).
A major weakness of the approach developed here is the restriction imposed on the mutation mechanism, namely that the type of a mutant does not depend on the type of its parent. Our methods do not apply, for example, to effectively all models for mutation at the level of DNA sequences nor to sensible models for microsatellite mutation processes. There are some types of data (including those analyzed above) for which they are applicable or at least are reasonable approximations. It would be nice to extend our approach to more realistic models for mutation. At present, however, the required theoretical results are not available (![]()
| FOOTNOTES |
|---|
1 Present address: Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-1340. ![]()
| ACKNOWLEDGMENTS |
|---|
We thank A. Kong and M. Stephens for valuable discussions. P.D. was supported in part by UK Engineering and Physical Science Research Council Advanced Fellowship B/AF/1299 and grant GR/M14197, UK Biotechnology and Biological Sciences Research Council grant 43/MMI09788, and National Science Foundation grant DMS 9505129. M.N. was supported in part by Naturvetenskapliga forskningsrådet grant B-AA/BU 12026 and by the Erik-Philip Sörensen Foundation. P.J. was supported in part by NSF grants DMS-96-26764 and DMS-00-72198.
Manuscript received March 22, 1999; Accepted for publication July 12, 2001.
| APPENDIX |
|---|
Rejection vs. importance sampling for estimating the distribution of sample statistics:
The rejection scheme has the advantage that it generates an independent sample, of known size m, of values of the sample statistic under the nonneutral model. Just as in standard simulation problems, the accuracy of the estimated distribution can be assessed in terms of the sample size m. On the other hand, as our examples later illustrate, each of the m nonneutral samples may actually involve the simulation, and rejection, of a large number (of order 104 in our examples) of neutral samples.
In using all the simulated neutral samples, the importance sampling scheme is in some sense more efficient than the rejection method. On the other hand, its accuracy is much more difficult to assess. ![]()
![]() |
(A1) |
where CV2 =
is the squared coefficient of variation of the importance weights. The sample mean and sample variance of the importance weights
i, i = 1, 2, ... , m, can be used to estimate E(
i) and Var(
i), respectively, and hence to estimate CV2. An estimate of the effective sample size EESI can then be calculated from (10), with the interpretation that the accuracy of the estimated distribution will be roughly the same as that obtained from ESSI independent samples from the nonneutral model. In settings such as those considered later in the article, the effective sample size will often be very much smaller than the number, m, of samples taken. (This is analogous to the fact that the rejection method will typically require many rejections for each accepted sample.)
A serious danger with the importance sampling scheme occurs in settings when the distribution of t




















