## Abstract

A new generation of high-throughput sequencing strategies will soon lead to the acquisition of high-coverage genomic profiles of hundreds to thousands of individuals within species, generating unprecedented levels of information on the frequencies of nucleotides segregating at individual sites. However, because these new technologies are error prone and yield uneven coverage of alleles in diploid individuals, they also introduce the need for novel methods for analyzing the raw read data. A maximum-likelihood method for the estimation of allele frequencies is developed, eliminating both the need to arbitrarily discard individuals with low coverage and the requirement for an extrinsic measure of the sequence error rate. The resultant estimates are nearly unbiased with asymptotically minimal sampling variance, thereby defining the limits to our ability to estimate population-genetic parameters and providing a logical basis for the optimal design of population-genomic surveys.

VERY soon, the field of population genomics will be overwhelmed with enormous data sets, including those generated by the “1000-genome” projects being planned for human, Drosophila, and Arabidopsis. With surveys of this magnitude, it should be possible to go well beyond summary measures of nucleotide diversity for large genomic regions to refined estimates of allele-frequency distributions at specific nucleotide sites. Such information will provide a deeper understanding of the mechanisms of evolution operating at the genomic level, as the properties of site-frequency distributions are functions of the relative power of mutation, drift, selection, and historical demographic events (which transiently modify the power of drift) (Kimura 1983; Ewens 2004; Keightley and Eyre-Walker 2007; McVean 2007). Accurate estimates of allele frequencies are also central to association studies for QTL mapping (Lynch and Walsh 1998), to the development of databases for forensic analysis (Weir 1998), and to the ascertainment of pairwise individual relationships (Weir *et al.* 2006).

If the promise of high-throughput population-genomic data is to be fulfilled, a number of analytical challenges will need to be met. High-throughput sequencing strategies lead to the unbalanced sampling of parental alleles (in nonselfing diploid species) and also introduce errors that can result in uncertainty about individual genotypic states (Johnson and Slatkin 2007; Hellmann *et al.* 2008; Lynch 2008; Jiang *et al.* 2009). Unless they are statistically accounted for, both types of problems will lead to elevated estimates of low-frequency alleles, especially singletons, potentially leading to exaggerated conclusions about purifying selection. Such aberrations are of concern because, for economic reasons, many plans for surveying large numbers of genomes envision the pursuit of relatively low (∼2×) depth-of-coverage sequence for each individual, a sampling design that will magnify the variance of parental-allele sampling, while also rendering the error-correction problem particularly difficult (Lynch 2008).

Simple *ad hoc* strategies for dealing with light-coverage data sets include the restriction of site-specific analyses to individuals that by chance have adequate coverage for a high probability of biallelic sampling (*e.g.*, Jiang *et al.* 2009). However, such an approach can result in the discarding of large amounts of data, increasing the lower bound on the frequency of alleles that can be determined (as a result of the reduction in sample size). In addition, although methods exist for flagging and discarding low-quality reads (Ewing and Green 1998; Ewing *et al.* 1998; Huse *et al.* 2007), such treatment is unable to eliminate other sources of error. With low-coverage data, singleton reads cannot be discarded without inducing significant bias, as such treatment will also lead to the censoring of true heterozygotes with unbalanced parental-allele sampling.

Thus, there is a clear need for novel methods for efficiently estimating nucleotide frequencies at individual sites, not only to maximize the information that can be extracted from population-level samples, but also to guide in the design of such surveys. In the latter context, questions remain as to the relative merits of increasing the number of individuals sampled *vs.* the depth of sequence coverage per individual, *e.g.*, a 1000-genome project involving 2× coverage per individual, as opposed to a 500-genome project with 4× coverage, etc. Because many genetic analyses rely on low-frequency markers (SNPs), there is also a need to know the limits to our ability to estimate the frequencies of rare alleles, particularly in the face of error rates of comparable or even higher levels.

As a first step toward addressing these issues, a maximum-likelihood (ML) method is developed for the estimation of allele frequencies at individual nucleotide sites. This method is shown to behave optimally in that the estimates are nearly unbiased with sampling variances asymptotically approaching the expectation under pure genotypic sampling at a high depth of sequence coverage. Moreover, the proposed approach eliminates the need for an extrinsic measure of the read-error rate, using the data themselves to remove this nuisance parameter, thereby reducing potential biases resulting from site-specific variation in error-generating processes.

## THEORY AND ANALYSIS

Throughout, we assume that there are no more than two actual nucleotides segregating per site, a situation that experience has shown to be almost universally true at the population level [average nucleotide heterozygosity in most diploid species is generally well below 0.1, even at neutral sites (Lynch 2007)]. It is also assumed that prior to analysis the investigator has properly assembled the sequence reads to ensure that paralogous regions (including mobile elements and copy-number variants) that might aggregate to the same sites have been removed from the analysis. This is, of course, a consideration for all methods of sequence analysis, but with high-coverage projects involving small sequence reads, it is more manageable because unusually high depths of coverage can flag problematical regions.

For the classical situation in which the genotypes of *N* individuals have been determined without error, assuming a population in Hardy–Weinberg equilibrium, the likelihood that the frequency of the leading allele (denoted as 1) is equal to *p* is simply(1)where *N*_{11}, *N*_{12}, and *N*_{22} are the numbers of times the three genotypes appear in the sample, and *K* is a constant that accounts for the multiplicity of orderings of individuals within the sample (which has no bearing on the likelihood analysis). The ML estimate of *p*, which maximizes *L*(*p*), has the well-known analytical solution (which also applies in the face of Hardy–Weinberg deviations) (Weir 1996).

Unfortunately, uncertainty in the identity of individual genotypes, resulting from incomplete sampling and erroneous reads, substantially complicates the situation in a random-sequencing project, as there are no longer just three categories of individuals. Rather, each site in each individual will be characterized by a specific read-frequency array (the number of reads for A, C, G, and T). This necessitates that the full likelihood be broken up into two components, the first associated with the sampling of reads within individuals and the second associated with the sampling of individual genotypes (as above). Assuming no more than two alleles per site, the data can then be further condensed as the analysis proceeds.

For example, letting 1 and 2 denote the putative major and minor nucleotides at a site in the population (with respective frequencies above and below 0.5) and 3 denote the error bin containing incorrect reads to the remaining two nucleotides, for any candidate major/minor nucleotide combination, each individual surveyed at the site can be represented by an array (*n*_{1}, *n*_{2}, *n*_{3}), where the three entries denote the numbers of reads at this site that coincide with the candidate major nucleotide, the minor nucleotide, and an erroneous read. Given the observed reads at the site for *N* individuals, our goal is to identify the major/minor nucleotides at the site and to estimate their frequencies. There are 12 possible arrangements of major and minor nucleotides, A/C, A/G, … , G/T, and the likelihoods of each of these alternatives must be considered sequentially.

We again start with the assumption of a population in Hardy–Weinberg equilibrium and further assume a homogeneous error distribution, such that each nucleotide is equally likely to be recorded erroneously as any other nucleotide with probability ε/3 (the total error rate per site being ε). The composite-likelihood function for a particular read configuration is the sum of three terms, each the product of the probability of a particular genotype and the probability of the read configuration conditional on being that genotype,(2)The first two terms in this expression denote the contributions to the likelihood assuming the genotype is homozygous for the major or minor alleles, with the functions ϕ_{e}(*x*, *y*; *n*, ε/3, 2ε/3) denoting joint probabilities of *x* errors to the alternative allele and *y* errors to the remaining two nonexistent alleles, assumed here to follow a trinomial distribution with *n* = *n*_{1} + *n*_{2} + *n*_{3}. The third term accounts for heterozygotes, with ϕ_{e}(*n*_{3}; *n*, 2ε/3) being the probability of *n*_{3} errors to non-1/2 nucleotides, with the total error rate being reduced by one-third to discount internal (unobservable) errors, and *p*(*n*_{1}; *n*_{1} + *n*_{2}, 0.5) being the probability of sampling *n*_{1} copies of the major allele among the remaining (*n* − *n*_{3}) reads consistent with the designated genotype (Lynch 2008).

It is informative to start with the simple situation in which each individual is sequenced to the same depth of coverage (*n*), in which case the log likelihood for the entire data set, conditional on a particular candidate major/minor nucleotide pair, major-allele frequency (*p*), and error rate (ε), is(3)where the unit of analysis is *N*(*n*_{1}, *n*_{2}, *n*_{3}), the number of individuals sampled with read configuration (*n*_{1}, *n*_{2}, *n*_{3}), and the summation is over all observed configurations. For sites with *n*× coverage, there are 1 + [*n*(*n* + 3)/2] possible read-array configurations for each major/minor nucleotide combination satisfying *n* = *n*_{1} + *n*_{2} + *n*_{3}. To obtain the ML solution, this expression needs to be evaluated over all possible combinations of major and minor nucleotides to determine the joint combination of major/minor allele identities, major-allele frequency (*p*), and error rate (ε) that maximizes *L*.

To evaluate the behavior of Equations 2 and 3, stochastic simulations were performed by drawing random samples of *N* individuals from populations in Hardy–Weinberg equilibrium, with a constant sequence coverage per site *n*. Errors were randomly assigned to each read with probability ε, such that each nucleotide had a probability of being erroneously recorded as any other with probability ε/3. A relatively high error rate of ε = 0.01 was assumed to evaluate the situation under the worst possible conditions. The ML estimates for each simulation were obtained by thorough grid searches of the full parameter space for *p* and ε for each possible combination of major and minor alleles, for 250 to 500 replications.

The results of these analyses indicate that, provided the coverage is >1×, the proposed method yields estimates of *p* that are essentially unbiased, with the variance among replicate samples rapidly approaching the expectation based on individual sampling alone, *p*(1 − *p*)/(2*N*), once the coverage exceeds 4× (Figure 1). Small downward biases in the frequency estimates for rare alleles can arise when the sample size is low enough to cause rare-allele sampling to be extremely sporadic and the coverage is also low enough to cause sporadic sampling of errors. The problem is most severe with sequences with 1× coverage, as there is then no information within individuals on heterozygosity. However, even with 2× coverage, this bias is minor enough to be of little concern in most applications, as when it is likely to be of quantitative importance, it is also overwhelmed by the sampling variance. Moreover, as is demonstrated below, under the usual situation of variable coverage, even 1× sites are less problematic, as they supplement the information from sites with higher coverage.

The remaining issue is the extension of the above approach to sites with variable coverage among individuals. In this case, the absolute probabilities of the various configuration arrays are now functions of the probabilities of the various coverage levels, but the latter influence only the arbitrary constant in the log likelihood, so the approaches outlined above can be used without further modification, other than the summation over all configurations at all coverage levels in Equation 3. Evaluation of this procedure by computer simulation, assuming a Poisson distribution of read numbers across individuals and including 1×-covered sites in the analysis, yielded no substantive differences from the patterns illustrated in Figure 1. The estimator is essentially unbiased over all estimable allele frequencies (minor-allele frequencies >1/2*N*), and the sampling variance of the estimate asymptotically approaches *p*(1 − *p*)/(2*N*) at high coverages.

We may now inquire as to the optimal sampling strategy for accurate allele-frequency estimates under a fixed sequencing budget, which is assumed here to be proportional to *T* = *Nn*, the expected number of bases sequenced per site in the sample (with *N* individuals, each sequenced to average depth of coverage *n*). This effort function assumes that the bulk of the cost of data acquisition involves sequencing rather than the preparation/procurement of individual samples. Low coverage will magnify the problem of sampling variance within individuals, while also resulting in a fraction of individuals that are entirely lacking in observations at individual sites (∼*e*^{−}* ^{n}* under a Poisson distribution of coverages). On the other hand, under the constraint of constant

*Nn*, increased sequencing depth per individual will result in a reduction in the number of individuals that can be assayed, thereby reducing the ability to detect rare alleles.

Assuming error rates in the range of ε = 0.001–0.01, if allele-frequency estimation is the goal, it appears that there is often no advantage to sequencing above 1× coverage, except for alleles with frequencies on the order of ε, where the sampling variance of is minimized at 1–3× coverage (Figure 2). Indeed, under the assumption of constant *Nn* as small as 500, for most allele frequencies, the sampling variance of actually continues to decline at coverages well below 1× provided ε < 0.01, and in no case is there a gain in power above 2× coverage.

## DISCUSSION

Contrary to previous studies of nucleotide variation in which careful enough attention has been given to individual sequences that genotypes are unambiguous, with the new generation of high-throughput sequencing strategies, quality is sacrificed for enormous quantity. Genotypes are then incompletely penetrant, in that one or both alleles need not be revealed in a subset of individuals, and those that are sampled are sometimes recorded erroneously. The method proposed herein provides a logical and efficient means for estimating allele frequencies in the face of these problems. Although the applications of the model in this article to simulated data involved brute-force searches of the full range of potential parameter space (the code for which is available as supporting information, File S1 and File S2), iterative methods will likely be desirable for large-scale studies involving millions of sites, and some derivations essential to such approaches are provided in the appendix.

Two primary advantages of the proposed method are (1) its use of the full data set, which eliminates the need for arbitrary decisions on cutoffs for adequate coverage, and (2) its ability to internally separate the incidence of sequence errors from true variation, which eliminates the necessity of relying on external measures of the read-error rate that do not incorporate all sources of error. As the error rates assumed in a number of the simulations in this study are presumably near the upper end of what will occur in ultra-high-throughput methods now under development [although likely still too low in the case of ancient DNA samples (Briggs *et al.* 2007; Gilbert *et al.* 2008)], the fact that the estimates are nearly unbiased with asymptotic sampling variances defined by the number of individuals alone suggests that the ML method provides an optimal solution to the ascertainment of frequencies of both common and rare alleles, just as it does in the conventional situation where genotypes are assumed to be known without error (Weir 1996). Moreover, the fact that the proposed approach provides joint estimates of *p* and ε on a site-by-site basis provides a potentially strong advantage in situations where the error rate might be heterogeneous, *e.g.*, dependent on the total sequence context.

With this methodology in hand, it should be possible to ascertain the form of the site-frequency spectrum for broad classes of genomic positions down to a very fine level. Indeed, provided the depth of coverage per site is sufficiently high, the proposed method has the potential to yield accurate allele-frequency estimates even when *p* is smaller than the error rate, as is often the case with disease-associated alleles. Thus, with large enough sample sizes of individuals, there are no significant barriers to estimating allele frequencies .

The ML allele-frequency estimator extends previous methods for estimating nucleotide heterozygosity (π) from high-throughput data from one or a small number of individuals (Johnson and Slatkin 2007; Hellmann *et al.* 2008; Lynch 2008; Jiang *et al.* 2009). For a given genomic region sequenced from multiple individuals without error, π is generally estimated as averaged over all sites, the terms containing 2*N* correcting for the bias associated with individual sampling, which induces variance in the estimate of *p* equal to *p*(1 − *p*)/(2*N*) (Nei 1978). It can be shown that when sequence errors are present, the sampling variance of is elevated to ∼[1 + (4ε/3)^{2}]*p*(1 − *p*)/(2*N*) + ε(1 − 4*p*)^{2}/(3*Nn*), but with and , heterozygosities per nucleotide site will still be adequately estimated as .

A number of variants of the model assumed herein can be readily implemented. For example, it is straightforward to allow for non-Hardy–Weinberg conditions. For two alleles 1 and 2, this simply requires the incorporation of two unknown genotype frequencies (*P*_{11} and *P*_{12}), with the third being constrained to be (1 − *P*_{11} − *P*_{12}) and hence the estimation of just one additional unknown (Weir 1996). A simple likelihood-ratio test, contrasting the fit from the full model with that under Hardy–Weinberg assumptions, can then be used to evaluate whether sites are in Hardy–Weinberg equilibrium. By a similar extension, it is possible to test for interpopulation divergence by evaluating whether the likelihood of the full data set for multiple population samples is significantly improved by allowing for population-specific estimates of allele frequencies.

The assumption of a homogeneous site-specific error process can also be relaxed by allowing for unique error rates for any major/minor allele combination. Although there are 12 possible types of single-base substitution errors, for any particular major/minor nucleotide combination (under the two-allele model), a maximum of 4 are relevant to the specific likelihood expression: the reciprocal errors between the major and minor alleles and the movement of each of the two true nucleotides into the error bin. Again, the necessity of these types of embellishments can be evaluated by likelihood-ratio tests.

The overall analysis indicates that unless there is a need for specific information on each individual, from the standpoint of procuring accurate population estimates of allelic frequencies (and other associated parameters), there is little advantage in pursuing high-coverage data per individual. Indeed, given a fixed sequencing budget, an overemphasis on depth of coverage will result in a reduction in the accuracy of population-level parameters, as the optimal design for estimating most allele frequencies employs an average coverage of <1×, even though this magnifies the fraction of individuals with no data. Of course, if the ultimate goal is the procurement of accurate genotypic profiles for individuals (as, for example, in association-mapping studies), there is no substitute for relatively high coverages.

Finally, although the focus of this article has been on the development of an optimally efficient method for extracting allele-frequency estimates from high-throughput sequencing projects, the utility of any specific application will ultimately depend on the quality of the sequence assembly prior to analysis. The complex genomes of multicellular organisms are typically laden with repetitive DNAs, duplicate genes, and mobile elements, which will misassemble to a degree declining with the evolutionary distances between paralogous sequences. Prescreening of the data for regions of exceptionally high depth of coverage, combined with the masking of duplicated DNA (when a reference genome is available), will aid in the elimination of such problematical sequences. An additional potential source of error may be a bias in assembling reads to reference genomes, an issue that is likely to become of diminishing importance with increasing read lengths.

## APPENDIX

Although site-specific allele frequencies and error rates can be determined by solving the likelihood function (Equation 3), over the full range of parameter space and searching for the global maximum, numerical methods such as Newton–Raphson iteration (Edwards 1972) may provide a more time-economical solution when large numbers of sites are to be assayed. For any candidate pair of major/minor nucleotides, the ML solution satisfies the two equations obtained by setting the partial derivatives of the likelihood function equal to zero,where *N*(*n*_{1}, *n*_{2}, *n*_{3}) and *P*(*n*_{1}, *n*_{2}, *n*_{3}) are abbreviated as *N*_{123} and *P*_{123}, respectively, with *P*_{123}, the likelihood of the data, being defined as Equation 2. For the model presented in the text (two alleles, with homogeneous error rates across all nucleotides),where the terms ϕ_{e1}, ϕ_{e2}, ϕ_{e3}, and *p _{n}*

_{1}, which are functions of

*n*

_{1},

*n*

_{2}, and

*n*

_{3}, are notationally abbreviated from their use in Equation 2 by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their standard errors from the curvature of the likelihood surface) will generally require the second derivatives,where

## Acknowledgments

Two anonymous reviewers provided helpful comments on this manuscript. This work was funded by National Science Foundation grant EF-0827411, National Institutes of Health grant GM36827, and MetaCyte funding from the Lilly Foundation.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1.

Communicating editor: R. W. Doerge

- Received January 6, 2009.
- Accepted March 6, 2009.

- Copyright © 2009 by the Genetics Society of America