Abstract
In this article we explore statistical properties of the maximumlikelihood estimates (MLEs) of the selection and mutation parameters in a Poisson random field population genetics model of directional selection at DNA sites. We derive the asymptotic variances and covariance of the MLEs and explore the power of the likelihood ratio tests (LRT) of neutrality for varying levels of mutation and selection as well as the robustness of the LRT to deviations from the assumption of free recombination among sites. We also discuss the coverage of confidence intervals on the basis of two standardlikelihood methods. We find that the LRT has high power to detect deviations from neutrality and that the maximumlikelihood estimation performs very well when the ancestral states of all mutations in the sample are known. When the ancestral states are not known, the test has high power to detect deviations from neutrality for negative selection but not for positive selection. We also find that the LRT is not robust to deviations from the assumption of independence among sites.
THE Poisson random field (PRF) models of Sawyer and Hartl (1992) and Hartl et al. (1994) afford a general likelihood framework for estimating mutation and selection parameters in various population genetic settings. This is currently the only method available for estimating selection directly from DNA sequence data in population genetics. The PRF model has proven quite useful in studying selection on “silent” site polymorphism in Drosophila (Akashi 1995; Akashi and Schaeffer 1997) and Escherichia coli (Hartlet al. 1994) as well as amino acid variation in E. coli and Salmonella enterica (Hartlet al. 2000). It has also been used to study the power of various tests of neutrality on the basis of polymorphism and divergence data (Akashi 1999).
Since the PRF model assumes independence among sites, it is a limiting case for population genetic models that incorporate mutation, selection, and recombination. For typical DNA sequence data, in which many sites segregate simultaneously, the assumption of independence among sites is equivalent to the assumption that there is free recombination between segregating nucleotides. Studying statistical inference in the PRF framework can therefore illuminate efforts to build models that relax the assumption of independence among sites. The reason for this is that any model that hopes to estimate mutation, selection, and recombination parameters can do as well as or worse than the PRF if the data conform to the PRF model.
It would be of great interest to know, therefore, how well maximumlikelihood methods for point estimation and hypothesis testing in the PRF framework perform under varying levels of selection and mutation. One property of interest is the probability of rejection neutrality if the locus is truly under selection, which addresses the power of the likelihoodratio tests (LRTs) of neutrality. Another statistical issue of interest concerns the robustness of these models to deviations from the assumption of independently evolving sites, since this assumption may be unrealistic for certain types of data.
This article treats in detail the PRF model for sitefrequency data proposed by Hartl et al. (1994) for both folded and unfolded site configurations. In particular, we (1) derive the asymptotic variances and covariance of the maximumlikelihood estimates (MLEs) of the selection and mutation parameters, (2) explore the power of the likelihoodratio test of no selection under varying levels of mutation and selection, (3) explore the coverage of two types of confidence intervals for the selection parameter, and (4) explore the robustness of the model to deviations from the assumption of independence among sites.
THEORY
Poisson random field model: First let us consider a simple example with which to illustrate the type of data being modeled. The data in Table 1 represent a sample of six individuals sequenced at five variable DNA sites. Each column is an aligned DNA site and each row is an individual sampled from the population. At each site where the individual carries the derived nucleotide, the resulting entry in the matrix is 1; otherwise the entry is 0. For each site, then, the sum of the column gives the number of individuals in the sample that carry the derived form of the mutation.
Define the unfolded sitefrequency spectrum as the random vector X = (X_{1}, X_{2}, … , X_{n}_{−1}) of sample configurations X_{i}, where X_{i} represent the number of sites that have n − i ancestral and i derived nucleotides for 1 ≤ i ≤ n − 1 among the n aligned nucleotide sequences. For the example in Table 1 the unfolded sitefrequency spectrum is (3, 1, 0, 1, 0) since there are three sites where one individual carries the derived mutation, one site where two individuals carry the derived mutation, no sites where three individuals or five individuals carry the derived mutations, and one site where four individuals carry the derived mutation.
So far we have assumed that for each site we know which state is derived and which is ancestral. If this information is unknown, we would model the folded sitefrequency spectrum, X′, defined as the number of sites where 1 individual or (n − 1) individuals carry a mutation, 2 individuals or (n − 2) individuals carry a mutation, etc., so that
We wish to model the sitefrequency spectrum for DNA sites in an infinitesites (i.e., irreversible mutation) independent (i.e., freely recombining) model with (weak) directional haploid selection. We assume that all mutations that enter the population have the same selective effect and that there are no epistatic interactions among mutations (i.e., independent fitness effects of all mutations). We also assume that there is no population structure to the data and the population has achieved statistical stationarity.
Fisher (1930) and Wright (1938, 1969) derived a formula for what Wright called the “transient distribution” for the frequency of a single newly arisen mutation under selection. Letting f(q, γ) be the density of the frequency of a single mutation, q, in the range dq, their formula is
Sawyer and Hartl (1992) showed that for recurrent mutation at a locus with independently evolving sites, the expected density of a random field composed of the individual frequencies of the derived mutations is (θ/2)f(q, γ), where θ/2 is the perlocus mutation rate scaled by the effective population size. They also showed that this random field is a PRF, which means that the number of sites whose derived mutation frequency q satisfies 0 < a < q < b < 1 has a Poisson distribution for given a, b and also that the number of sites for nonoverlapping intervals (a, b) is independent (Kingman 1993).
For a particular site, given the frequency q of the mutation in the population, the probability of sampling i sequences of one type and n − i of another is binomially distributed with parameters n and q. If mutations enter the population as a Poisson process with rate θ/2, the X_{i}'s are independent Poissondistributed random variables with mean, θF(i, γ), where
The PRF model outline above leads directly to a likelihoodratio test of neutrality, which compares the null hypothesis γ = 0 with the alternative hypothesis that γ ≠ 0. Since the X_{i}'s are independent, the likelihood function is the product of the individual p(X_{i} = x_{i}θ, γ). Letting L_{f}(θ, γ  x) be the likelihood function for the folded model and L_{u} be the likelihood function for the unfolded model,
Appealing to large sample results (Kendall 1987), 2 ln(Λ) is
Maximumlikelihood estimation and asymptotic results: Finding the maximumlikelihood estimates of θ and γ for both the folded and unfolded site frequency models is relatively straightforward. Let l be the loglikelihood function for the unfolded model
The maximumlikelihood estimates of γ and θ (i.e.,
An efficient approach to solving for the MLEs is to maximize the profile likelihood that has the same global maximum but requires only maximization in one dimension. The logprofilelikelihood function, l*(γ), is
It follows from (2) that
We can now maximize the logprofilelikelihood function l*(γ) numerically, using a standard optimization technique such as NewtonRaphson iteration (Lange 1999). To implement NewtonRaphson we need to find the first and second derivatives of l*(γ) with respect to γ. To do this we first substitute (13) into (8) and take the necessary derivatives. The results are
To find the equivalent results for the folded model, substitute n by [n/2] and F, F′, and F″ by G, G′, and G″, respectively, in the equations above, where
To find the asymptotic covariance matrix of the maximumlikelihood estimates, we compute the Fisher information matrix I(θ, γ), which is minus the expected value of the matrix of second derivations of the loglikelihood function. Since the components of I are linear in x_{i} and S, this is equivalent to replacing x_{i} by E(X_{i}) = θF(i, γ) and S by θT_{0}(γ). The asymptotic covariance matrix of
The components of the Fisher information matrix I (θ, γ) are easily shown to be
The asymptotic covariance of the MLEs is the inverse matrix of I (
Note that the negative inverse of the second derivative of the logprofilelikelihood function (17), with x_{i} replaced by E(X_{i}) = θF(i, γ) and θ by S/T_{0}, is formally the same as (32). this is consistent with the fact that both expressions give the asymptotic variance of
From general likelihood theory, we expect (
The first confidence set consists of γ in the interval
SIMULATIONS
The two properties of the LRT that we are interested in exploring are the size and power of the test.
The size, confidence level, or probability of type I error of a test is defined as the probability of rejecting a true null hypothesis and is denoted α. The power of a test is the probability of rejecting a false null hypothesis and is denoted 1 −β, where β is the probability of not rejecting a false null hypothesis (i.e., the probability of a type II error).
In particular, we explore whether the LRT maintains the proper size even if the assumption of independence among segregating sites is not met. If the test is robust to the assumption of free recombination and the null hypothesis is true, then in repeated sampling we should reject the null hypothesis (100 · α)% of the time. If the test is not robust to the assumption of free recombination, the realized size of the test will vary with the rate of recombination. We might also expect the realized size of the test to vary with the mutation rate if the test is not robust to recombination, since a larger mutation rate will produce a test with a higher probability of rejecting the null hypothesis. It would also be of interest to know how the power of the LRT changes as the strength of selection changes for data that conform to the independence assumption.
To explore these two issues, as well as the coverage of the two different types of confidence intervals for
All numerical integration was done using Romberg integration, maximization by NewtonRaphson iteration with bisection, and random number generation using standard numerical algorithms (Presset al. 1988). Software for calculating the PRF likelihoodratio test is available from the authors upon request.
RESULTS AND DISCUSSION
Size and power of the LRT: In Table 2 we summarize the results for the analysis of the size of the PRF test for both folded and unfolded site configurations under two different mutation rates and 10 levels of recombination. From this analysis we see that the Poisson random field LRT is extremely sensitive to the assumption of independence among sites. If the test were not sensitive to this assumption, we would expect the entries in the table to fluctuate ~5% but we see rejection levels that reach upward of 50% for tight linkage. While the strength of the effect declines quickly as the rate of recombination increases, the tests do not attain level α until the rate of recombination is more than an order of magnitude greater than the mutation rate.
The reason for this becomes clear when one examines the effect of recombination rate on the distribution of the sitefrequency spectrum, X. In Figure 1 we present representative replicates of X from the coalescent simulations (γ = 0) under varying levels of recombination for θ = 10. The expected value of the number of sites at a given frequency i/n is the same for each of the six graphs presented [E(X_{i}) = θ/i]. The variance of X_{i}, though, increases as the recombination rate tends to zero. The reason, therefore, that we reject neutrality more than expected in Table 2 with increasing linkage is that we underestimated the variance for the site frequencies. Likewise, if neutral sites are linked to selected sites there will also be an increase in the variance in the distribution of the X. These results suggest that the analysis of single genes using this method is invalid if the rate of recombination is not extremely large, since one rejects the null hypothesis (which of course includes the assumption of free recombination) if the data are neutral but not independent.
Performance of maximum likelihood for point estimation in the PRF framework: While the method may not be applicable for linked regions, the analysis of genomewide singlenucleotide polymorphism may have the quasiindependence required for the test to maintain level α. It would be desirable, therefore, to know how well the method performs in moderate sample sizes when the data conform to the PRF model. For the remainder of this article, all of the analysis is performed on data generated under the PRF model.
In Figure 2 we summarize the results for the analysis of the power of the PRF test for both folded and unfolded site configurations under the same mutation rates above for γ ε [−10, 10]. We see from that that the LRT has very good power to detect deviations from neutrality even in moderate sample sizes when the ancestral states of variable sites are known (i.e., for unfolded site configurations). A surprising result is that for folded site configurations the test has strong power to detect negative selection but very weak power to detect positive selection. The reason for this is related to the issue of point estimation in general in the PRF framework.
From generallikelihood theory, we expect the maximumlikelihood estimate for a parameter under certain regularity conditions to be consistent, unbiased, and efficient as the sample size used to estimate the parameter tends to infinity. By “consistent” we mean that the parameter estimate will converge in probability to the true value of the parameter (i.e., the variance in the parameter estimate will tend toward 0), by “unbiased” we mean that the expected value of the parameter estimate is centered on the true value of the parameter, and by “efficient” we mean that the estimator has the lowest possible variance of all possible estimators. In practice, though, maximumlikelihood estimation can often behave very poorly if the sample size is not large enough or if the regularity conditions are not met.
For the levels of mutation examined in our generated data, we find that the distributions for
While maximumlikelihood estimation works reasonably well for point estimation using unfolded data, the method performs very poorly for folded data in the region of positive selection as alluded to above in the section on power. As we see in Figure 3, for the folded data the method performs well in terms of estimating γ for the region of negative selection, performs reasonably well for a small region of positive selection, and then asymptotes to neutrality for arbitrarily large γ.
For a better understanding of this phenomenon, let us focus on one of the replicate data sets we generated under θ = 10 and γ = 10, x = (11, 6, 8, 1, 2, 6, 2, 3, 10). For these data,
The reason for this phenomenon is that when the site configurations are folded, the limiting distribution as γ → ∞ is indistinguishable from neutral evolution with a mutation rate that is twice as large as the true mutation rate. This can be shown by noting that the loglikelihood function for a given data set will reach finite asymptotic value for γ → ∞ since
It is important to note that this is a general problem for methods that attempt to detect selection using folded site configurations [e.g., Tajima's (1989) Dstatistic]. This also explains the odd result reported in Akashi (1999) that Tajima's Dstatistic has little to no power to detect positive selection. Since this is a general problem for estimating selection from population genetics data, the inclusion of a parameter for recombination would make the problems of inference even worse, since in this case we know that the data conform to the assumptions of the model.
One interesting point related to this issue is the effect of the estimation of γ on the estimation of θ. We see from Figure 6 that estimation of θ is excellent for the unfolded model for regions of positive selection. It may seem paradoxical that while the variance in the estimates of γ increases with increasing level of selection, the variance in the estimates of θ decreases. The reason for this is that once selection becomes large the predictions of the PRF model for the sitefrequency spectrum become indistinguishable (this is exactly the reason for the asymptote at ∞ for positive selection), but only a small range of θ is consistent with the observed number of segregating sites. In the case of negative selection, the results are reversed since as selection increases in the negative selection the values for F(i, γ) become very small, leading to large variance in the estimate of both selection and mutation.
For the folded model, in Figure 6 we begin to see the effect of the maximum near 0 for
Confidence sets for γ: The last issue we explore is the construction of confidence intervals for the selection parameter in the PRF framework. As we outlined above our two methods for constructing confidence intervals are (1) the region that contains (1 −α) · 100% of the area in the normal approximation to likelihood function and (2) the region in the profile likelihood space that is
We see from Figure 7 that confidence intervals based on the normal approximation to the likelihood tend to undercover (i.e., they tend to be too small), particularly for the folded model and for strong positive selection. The confidence intervals based on the profile likelihood, however, have excellent coverage regardless of the folding of the data or the strength of selection.
The reason for this is seen in Figure 8, where we plot the logprofilelikelihood function and its normal approximation for a data set with the same
CONCLUSIONS
The Poisson random field model for sitefrequency data holds great potential for untangling the effects of mutation and selection on standing genetic variation. Our analysis suggests, however, that when the data do not conform to the assumptions of very loose linkage, the estimates from this model may be misleading. Likewise, if the data set being analyzed is small, one should be cautious about the asymptotic assumptions implicit in maximum likelihood as a method of point estimation. This is a general problem for methods that attempt to estimate selection using sitefrequency data, and methods that attempt to incorporate recombination into point estimation will only make some of these problems worse. The solution may be to focus on genetic variation that is independent (such as the growing number of single nucleotide polymorphisms for many organisms) rather than focusing on methods that detect selection at single loci. We also find that folding the sitefrequency spectrum obscures positive selection and suggest that every effort should be made to incorporate information on the ancestry of mutations being studies, if one wishes to estimate selection.
We also point out that simply adding information on the number of fixed differences in the region between the study population and a close species will not aid in point estimation for selection. Such a method would add exactly one data point and one parameter (time since divergence), which can be scaled to fit that point exactly. A method that uses two classes of sites (e.g., silent and replacement) and fixed differences circumvents this problem and will add useful information for the estimation of selection. We have developed a method that uses divergence data and the sitefrequency spectrum to estimate selection. Likewise, information on variation in the sitefrequency spectrum among different genomic regions also holds great potential for refining point estimation for single regions of interest. We have also developed a method that uses Markov chain Monte Carlo to perform Gibbs and Metropolis sampling for a hierarchical model of this form. We will expand upon these two methods in subsequent publications.
Acknowledgments
We thank Richard Lewontin, Rasmus Nielsen, and Steve Palumbi for many helpful comments. This work was supported by a Howard Hughes Medical Institute Graduate Fellowship to C.D.B. and National Science Foundation grant DMS9707045 to S.A.S.
Footnotes

Communicating editor: N. Takahata
 Received April 24, 2001.
 Accepted August 20, 2001.
 Copyright © 2001 by the Genetics Society of America