## Abstract

We live in an age in which our ability to collect large amounts of genome-wide genetic variation data offers the promise of providing the key to the understanding and treatment of genetic diseases. Over the next few years this effort will be spearheaded by so-called *next-generation* sequencing technologies, which provide vast amounts of short-read sequence data at relatively low cost. This technology is often used to detect unknown variation in regions that have been linked with a given disease or phenotype. However, error rates are significant, leading to some nontrivial issues when it comes to interpreting the data. In this article, we present a method with which to address questions of widespread interest: calling variants and estimating the population mutation rate. We show performance of the method using simulation studies before applying our approach to an analysis of data from the 1000 Genomes project.

WE live in an era in which technology drives scientific discovery. This is very clearly demonstrated in molecular genetics, in which the development of technologies such as gene expression arrays and so-called SNP chips, to name just two examples, has led to a vast increase in the amount of genetic data at our disposal. Arguably, at this time our ability to interpret such data lags somewhat behind the technology. One of the most recent and exciting technologies to be placed at our disposal is next-generation sequence (NGS) platforms, in which enormous quantities of sequence data can be collected at reasonably low cost. In this article we focus on two applications of this technology: estimation of mutation rate and polymorphism detection.

Our focus is partly motivated by a common use for NGS data in genome-wide association studies (GWAS). While GWAS have now identified a large number of loci at which polymorphism is associated with disease phenotypes, the overall amount of variance explained by these polymorphisms is low (*e.g.*, Manolio *et al.* 2009; Frazer *et al.* 2009). One explanation for the remaining variance, so-called *dark matter*, is the possible existence of *rare variants*, polymorphic loci at which the minor allele has low frequency. With that in mind, in addition to the wish to answer general population genetics questions, such as estimation of mutation rate, the investigator will often collect NGS data to identify polymorphic loci at which to collect data in a follow-up study using a custom SNP array, say. This article presents a computationally tractable method for estimating mutation rate and calling genotypes (and thereby detecting polymorphism).

NGS technology comes in a number of forms. Chief among these are the Illumina (San Diego, CA) Genome Analyzer, the 454 Genome Sequencers (Roche Applied Science, Basel, Switzerland), the SOLiD platform (Applied Biosystems, Foster City, CA), the Polonator (Dover Systems, Salem, NH), and the HeliScope Single Molecule Sequencer technology (Helicos, Cambridge, MA). While these methods differ in the details (see Shendure and Ji 2008 for a review), the data that are obtained have similar broad features. Essentially one obtains relatively short regions of sequenced DNA, so-called *reads*, either singly or in pairs drawn from opposite ends of a larger fragment of DNA. Typically these fragments are sequenced and then aligned using one of a number of possible algorithms [*e.g.*, Mapping and Assembly with Quality (MAQ) (H. Li *et al.* 2008) or Short Oligonucleotide Alignment Program (SOAP) (R. Li *et al.* 2008)], but several key features are constant across platforms. First, the reads are short (35–400 bp, depending upon platform); second, for that reason, alignment to a reference sequence is challenging; third, genotyping error rates vary along the read, typically increasing as we move along the read (modulo some variation from that trend that may exist at the beginning of the reads). In this article we do not focus on the issue of alignment—our method is designed to be applied to reads postalignment.

While these technologies are new, a number of approaches already exist for the analysis of the resulting data. In the context of estimating mutation rate, the first method was that of Hellmann *et al.* (2008). While their method was developed for shotgun-sequencing data, in which error rates are lower, it can nonetheless be applied to NGS data, albeit at some loss of performance. A similar approach was taken by Jiang *et al.* (2008), where robustness to issues such as genotyping errors or biased amplification was examined more explicitly. Furthermore, a wide variety of methods drawn from related topics also exist. Examples of this range from the extremely elegant and simple estimator due to Watterson (1975), to the more computationally intense methods of Griffiths and Tavaré (1994) and Kuhner *et al.* (1995). However, none of these methods were developed for NGS data, and, for example, they fail to allow for the possibility of genotyping error. Methods for estimating mutation rate in the presence of genotyping errors do exist, for example, approaches based upon considering nonsingleton variants (Knudsen and Miyamoto 2009), but these do not exploit the particular properties of NGS data.

In this article we use the *coalescent* as a model for genotype data for a sample of individuals drawn from a population. The coalescent was first formalized by Kingman (1982 a,b,c) and has become the most widely used model for population genetics data. For accessible introductions see Wakeley (2008) or Hein *et al.* (2005). Several algorithms now exist for detection of polymorphic sites for NGS data. Li and Leal (2009) developed a Bayesian method for computing individual genotype likelihood values from NGS data. There are also approaches that combine the resequenced data of the samples for better SNP calling. For example, Bansal *et al.* (2010) used a method containing a population error correction term to avoid systematic sequencing errors. Such methods were used in the 1000 Genomes Project Consortium (2010). After giving methodological details of our approach we demonstrate performance via a series of simulation studies before applying it to data from the 1000 Genomes Project and comparing to results from two popular algorithms: samtools and GATK.

## Methods

### Overview of the E–M algorithm

We assume we have read data that have been aligned to a reference sequence and, in the simplest form of our algorithm, that we have known (or estimated) position-specific error rates. Our goal is to compute individual genotype likelihoods for a sample of size *n*, *i.e.*, the probability of the genotype for each site for each member of the sample. For computational tractability our method ignores information due to linkage disequilibrium (see *Discussion*).

Our method is an E–M algorithm that explores possible values for Θ and the genotype data while producing the maximum-likelihood estimate of the population mutation rate 4*N*_{e}μ = Θ, where *N*_{e} is the effective population size and μ is the mutation rate per base per individual. We initialize using an arbitrarily chosen value of Θ = 0.001. In outline the algorithm proceeds as follows:

Expectation step: Conditional on the current Θ-value and the observed read data, we calculate the expected number of polymorphic sites,

*E*(*S*).Maximization step: Update Θ on the basis of

*E*(*S*).

In analyses in which we also wish to infer error rates, those error rates are updated in step 2, along with Θ.

We iterate the above steps 100 times and then, using the final estimated Θ, the posterior probability of being polymorphic is computed for each site. For each site that is predicted to be polymorphic, the genotype and output, for each individual in the sample, are inferred. While our method is computationally intense, for the examples given in this article, in which we simulate 20 diploid samples over a 100-kb region, the run time is short: 1 min for estimation of Θ or 30 min with coestimation of Θ and error rates. Run time is approximately linear with the length of region being analyzed.

We now give more details on each of the above steps.

### Calculation of the likelihood

We begin by deriving the likelihood of Θ conditional on the read data. Let ℛ = {*R*_{1}, *R*_{2}, …, *R _{n}*} be the set of read data across all sites (

*i.e.*, all base positions) over all samples and let

*G*denote the unobserved genotypes across all sites. Using Bayes’ theorem, the probability of Θ given the read data for a sample is

By ignoring linkage disequilibrium we can compute a composite likelihood (cf. Hudson 2001) by multiplying the likelihoods at each site:*i* indexes sites, *g* denotes the sample genotypes at a given site, and Prob(*R _{i}*|

*g*) is set to 1 if

*R*does not overlap

_{i}*g*.

Calculation of Equation 3 requires derivation of the terms Prob(*R _{i}* |

*g*) and Prob(

*g*| Θ). We begin by deriving Prob(

*R*|

_{i}*g*).

Let **r** be the set of read data at a given position in the genome for a given individual, *r _{k}* refer to the

*k*th element of

**r**, and

*e*(

*r*) be the error rate of the base called at this position for

_{k}*r*. This can be estimated using quality scores, or from external estimates of read accuracy. While it is generally the case that error rate can be reasonably assumed to vary across positions within a read, but to be constant across reads at each position, we do not make this restriction here. Let

_{k}*a*

_{0}be the base at this position in the reference sequence and

*a*

_{1}denote a possible variant type. We now compute the individual genotype likelihoods for each possible genotype. The possible genotypes can be categorized as homologous consensus/consensus, heterozygous consensus/variant, and homologous variant/variant. To improve computational efficiency we assume at most a single mutation at each position in the ancestry of the sample, as is the case with the so-called

*infinite sites*mode, for example. Thus, we write

Next we derive Prob(*g* | Θ). The prior probabilities for each genotype are calculated from the expected allele frequency spectrum under the coalescent model with constant population size or expanding population size (as appropriate). The joint prior probabilities of each genotype are then approximated from the expected allele frequency spectrum (Griffiths and Tavaré 1998).*g*(*m*) represents a genotype with *m* mutant alleles across the sample.

### Estimation of Θ

The value of Θ maximizing the likelihood is derived using an E–M algorithm. The derivative of the likelihood function (3) is

### Unknown error rates

In its simplest form, our method requires the error rates of the resequencing process as input. While these will generally be available (via quality scores for example) we note that, if not, our method can coestimate either a single overall error rate *E** or a vector *E* of position-specific error rates, while estimating Θ and calling polymorphism. In this case we model the error rates as growing exponentially along the read, which reflects the generally observed behavior. For example, the likelihood that (3) is rewritten as*E* = {*E*_{1}, *E*_{2}, ..., *E _{L}*} (say), can be estimated within the E–M algorithm

*g*at site

*i*in the resequencing position

*k*and

*n*

_{i}_{,}

*is the total number of reads at site*

_{k}*i*. λ(

*g*,

*i*) is the posterior probability of genotype

*g*at site

*i*:

To lessen the computational burden, we restrict this calculation to sites that are currently called nonpolymorphic:

### Output of polymorphic sites

At the conclusion of our algorithm, we calculate the posterior probability of polymorphism at each site by conditioning on the final estimated Θ-value and computing the posterior probability of polymorphism for each site as

Since all sites will have a nonzero probability of being polymorphic, but for most sites that probability will likely be small, we adopt the following heuristic. We use a predetermined threshold *T*. If the posterior probability of polymorphism at a site exceeds *T*, that site is reported as a candidate polymorphic site. In the examples in this article we use *T* = 0.9. For sites that are predicted to be polymorphic, the genotype with the highest posterior probability is then output for each individual.

## Results

### Simulation of data

For all analyses in the article that are based upon simulated haplotypes, we simulated over a 100-kb region using Hudson’s coalescent simulator ms (Hudson 2002) under a constant population-size model. We assume sequencing occurs on an individual level. Twenty diploid genomes were formed by random pairing of haplotypes. We then generated paired-end NGS data, with reads occurring as a Poisson process, each end consisting of a read of 35 bases and the distance between ends being Normally distributed with mean 500 and standard deviation of 20. Unless otherwise specified the NGS data were generated with an overall average error rate of 0.005, but with error rates increasing as the read is traversed. Specifically, the error rate at read position *i* was assumed to be

### Estimation of Θ from simulated data

From the simulated data set we estimate Θ for a range of coverages per diploid and/or a range of sample sizes *n*. For each combination of parameters (mutation rate, Θ, recombination rate, ρ, coverage, *X*, and diploid sample size, *n*) we simulated 100 data sets. We assumed that the position-specific error rates were known. Table 1 shows results for a variety of values of ρ and *X*.

In the top half of Table 1 Θ is fixed at 0.001 and ρ varies, while in the bottom half Θ varies while ρ is fixed at 10^{−4}. We show the average of the estimated Θ-values, *R*_{t}, defined as the root mean squared error (RMSE) of the ratio of our estimate of Θ compared to the true Θ value (we normalized by dividing by Θ to ease interpretation). For benchmark purposes, we also compare our estimate of Θ, based upon the resequencing data, to that obtained by applying Watterson’s estimator Watterson (1975) to the true (and, in fact, unobserved) haplotype data. For this purpose we report (*R*_{w}), the root mean squared error of the ratio of Watterson’s estimate of Θ compared to the true Θ. *R*_{w} represents a good estimate of the limit of what is attainable due to the stochastic noise within the coalescent process itself (since Watterson’s estimate is known to be a good estimator of Θ, and we apply it here to the unobserved true genotypes).

The estimates of Θ are close to the true value of Θ for all combinations of parameters, even at 1× coverage. As expected, RMSE values are bigger when coverage is low. We also see that Θ is slightly overestimated for lower recombination rates, which is a reflection of the increased variance of our, or any other estimator of Θ as LD in the data increases.

In the lower part of Table 1 we show results for a range of Θ-values with ρ fixed. Once again the estimator appears to perform very well in all cases.

Next, in Table 2, we show results as a function of sample size for scenarios in which Θ = 0.001 and ρ = 10^{−4}. Performance is robust to change of sample size *n*, with RMSEs decreasing as *n* increases, as is expected.

### Detection of polymorphic sites

In Figure 1 we show the false-positive rate, FP, and false-negative rate, FN, as a function of coverage. For a range of Θ− and ρ-values we show the percentage of sites called as segregating that are not in fact polymorphic, FP (*y*-axis), against coverage *X* (*x*-axis). We also show plots for the FN rate. We see that both the FP and FN rates are reduced as coverages increases, as would be expected. Both FP and FN are robust to changes in recombination rate, but both are sensitive to changes in mutation rate. The power to detect polymorphic sites is affected by minor allele frequency (MAF). In Figure 2, we see that our method results in good detection rates for common variants (MAF > 0.05) given reasonable coverage (say 4×), but that, unsurprisingly, many rarer variants are missed, especially with lower coverages.

Given that our method is based on the constant population size model, it is important to explore robustness to departures from that assumption. We focus on situations in which, as is common, population size is growing. Two data sets were simulated under an exponentially growing population size, where the population size at time *t* units back into the past, *N*(*t*), is modeled as *N*(*t*) = *N*_{0}*e*^{–}* ^{rt}*, where

*N*

_{0}is the current population size and

*r*is the exponential growth rate. Table 3 shows the false-negative rates of polymorphic site calling for three different growth rates. Since exponential growth leads to there being more rare variants, variation on average become harder to detect.

### Effects of resequencing error rate

We also explored the effect on performance of changes to the overall base-calling error rate and to (read) position-specific base-calling error rates. Assuming exponentially distributed position-specific error rates, per Equation 7, we first explore changes to the overall rate, while setting β = 0.1 (both are assumed known). Table 4 shows that performance is robust to this change. However, the false-positive and -negative rates do grow higher as the overall error rate increases, as might be expected (Figure 3).

Second, we assumed the overall error rate is fixed, but changed β, reflecting varying degrees of degradation of calling accuracy along the reads. Once again our method is robust to this change (Table 5) but again the FP and FN rates are affected (Figure 3).

It is also of interest to see how performance is affected if incorrect error rate estimates are used during the analysis. This introduces biases to the estimated Θ-values. When the actual error rate is lower than the assumed error rate, Θ is underestimated, and when the actual error rate is higher than assumed, Θ is typically overestimated. These biases decreased as the level of coverage increases (Table 6).

It is also possible to jointly estimate (i) average error rate, *E*_{t}, and associated exponential parameter β, and (ii) Θ. Table 7 shows results for this case. For all coverage levels, the estimated values are close to the true values, and accuracy of polymorphic site detection is much the same as for the analysis that assumes the true error rate (data not shown).

### Comparison with other polymorphic site-detection methods

We next use simulated data to compare performance of our method, when focused on polymorphism detection, to that of other detection tools (samtools version 0.1.12) (Li and Leal 2009), and GATK Unified Genotyper (v. 1.0) (Mckenna *et al.* 2010). To do this we created four 100-kb data sets, each containing 40 diploid samples, using four different mutation parameters (Θ = 0.00001, 0.0001. 0.0002, and 0.0005). This resulted in data with 52 (data set 1), 531 (data set 2), 915 (data set 3), and 2482 polymorphic sites (data set 4), respectively. We then simulated 100 replicate sets of read data at three different coverages (1×, 4×, 8×). The simulated reads were then mapped with BWA (Burrows-Wheeler Alignment Tool) (Li and Durbin 2009). We then analyzed each of these data sets using our method, samtools, and GATK using default parameter values for the latter two methods. Figure 4 shows a summary of results over the 100 replicates. Results for our method are a function of the threshold, *T*; for calling a polymorphism, this results in a curve of results, as a function of *T*, compared to the single datapoint for the other two approaches. For all data sets and coverages, our method shows good polymorphic site detection rate.

### Analysis of 1000 Genomes Project data

We also applied our method to a 10-Mb region (16 Mb ∼26 Mb) on chromosome 21 from the 1000 Genomes Project data. We use only the Illumina data; for the purpose of simplicity, data from other platforms are ignored. Aligned NGS data from 57 African samples (Yoruba), 51 European samples [CEPH (Centre d’Etude du Polymorphisme Humain)], and 58 Asian samples [CHB (Chinese from Beijing) + JPN (Japanese)] were used. The analyzed data are from the Low Coverage Pilot project, in which the whole genome was sequenced at low coverage (the average coverage in the analyzed region is 3.4×). To improve computational efficiency and allow for local variation of population mutation rate, we divided the 10-Mb region into 100-kb subregions (with no overlap), and analyzed each separately. For each subregion, Θ-values were estimated and polymorphic sites and MAFs were predicted. We analyzed data from each population separately. For quality control, when applying our genotyping algorithm, reads with more than two mismatches to consensus sequence or lower mapping quality score (< 30) were not included in the analysis. The sequencing error rate per single read per site was inferred from the quality score, as in Cock *et al.* (2010). To improve comparability between results from our method, and those from 1000 Genomes, we use a probability threshold of *T* = 0.9, as is commonly used in similar analyses of 1000 Genomes Data (cf. Li *et al.* 2011). Results are shown in Figure 5.

As expected, the estimated Θ from the African data are bigger than from other populations. The average values over the region for the African, European, and Asian populations are 0.00112, 0.000742, and 0.000650, respectively. The local variation of estimated Θ-values between populations is highly correlated. The non-African populations show highest correlation (*R*^{2} = 0.727). The correlation between African and non-African populations is also high (African-European *R*^{2} = 0.560 and African-Asian *R*^{2} = 0.489) but some regions show differences. (17.9–18.1 Mb, 21.6–21.7 Mb, 24.1–24.4 Mb, and 25.3–25.5 Mb). These lower estimates for the non-African populations may be evidence of local selection, but might also reflect systematic differences in data production across centers, or simply noise.

Figure 6 shows a summary of our polymorphic site calls. First we show the number of polymorphic sites detected. Second, we calculate and report the expectation of the minor allele frequency, using the posterior genotype probabilities at all sites that are called polymorphic. More polymorphic sites are inferred from the African population than from the Asian or European populations. The mean MAF at polymorphic sites from the African, Asian, and European populations are about 0.145, 0.189, and 0.169, respectively. This is bigger than the expected MAF derived from even a constant population size model (≈ 0.053). This reflects an ascertainment bias due to the diffculting of detecting rare variants. The average MAF of population-private polymorphic sites is 0.048 and is much lower than polymorphic sites common to all three populations (0.22), again as would be expected (assuming that private polymorphisms are likely to have arisen more recently).

For the Yoruban population we compared our polymorphic site calls to those in dbSNP (version 129) and to calls from the 1000 Genomes Project (Figure 7). A comparison between estimated MAFs of novel and known variants shows that the polymorphic sites reported in dbSNP are relatively common variants. About 82% of variable sites from the 1000 genome study are inferred as polymorphic by our method. Our method infers 2410 more known sites than the 1000 Genomes Project. As a quality check of polymorphic site calling, we calculated the transition/transversion ratios (ts/tv) using the predicted genotypes at sites that were detected as polymorphic. The overall ts/tv ratio using calls from our algorithm is 2.28, but the ts/tv ratio of novel sites is significantly lower (1.83), suggesting a greater number of false positives at novel sites.

Our method can simultaneously estimate resequencing error rates. We therefore estimated error rates for the 1000 Genome Project data. The error rate of each read position is estimated from data for the Yoruban population, using a constant-sized population model. Results are shown in Figure 8. Our estimates are highly correlated with the error rate reported in 1000 Geomes; however, our estimated error rates show a systematic tendency to be lower than those derived during the 1000 Genomes analysis. This may reflect differences in the preprocessing of the read data. For example, in our analysis all reads with two or more mismatches were excluded.

## Discussion

In this article we have presented a method for both (i) estimation of overall mutation rate Θ, and (ii) calling polymorphic sites, within a region for which NGS data have been collected. The posterior probability of a site being polymorphic is computed for each site, along with an estimate of Θ. If position-specific error rates are not known, or the accuracy of estimates of these are in doubt, our method can also estimate those error rates within its likelihood framework. However, in general, estimation of error rates is unlikely to be needed. Our method appears to work well, in part at least because it exploits the coalescent model to leverage power when genotyping jointly across multiple individuals.

There are at least three sources of information that can be used when calling genotypes at a given locus, *L*, for a given individual, *I*, (say):

The read data for that individual at that position;

the data for other individuals at that position; and

the data at other positions.

Each of these three sources could be used to inform either genotype calls or determination of whether the locus is polymorphic. When interpreting the first source of information, the key question to address is “Do the data result from genuine heterozygosity within that individual, or are they a result of base-calling errors from the sequencing platform?” A variety of methods exists for dealing with this issue, such as those compared in this article. Intuitively speaking, the likelihood of interpreting a set of read data at *L* for *I* as evidence for heterozygosity should increase if polymorphism at that locus is also observed in other individuals. (Once we have seen a variant allele at *L* we are more likely to see it at that location in subsequent individuals.) This information is exploited by our algorithm. Patterns of linkage disequilibrium (LD) are also informative regarding genotyping (and hence calling of polymorphism). Loosely speaking, we should be more likely to call a polymorphism at *L* if the data support a pattern of polymorphism at *L* that would be in LD with other nearby loci. Detailed modeling of patterns of LD is a complex issue, generally implying significant computational complexity.

It is also worth noting that LD information will not greatly help to detect a polymorphism that has not already been observed in an existing sample. For these reasons we omit it from the present method, but our results show that our method performs well despite making this simplification (a choice that ensures computational tractability). Nonetheless, it has been successfully exploited using a number of different underlying probabilistic models, in algorithms such as fastPHASE and Li and Stephens’ PAC method) (Li and Stephens 2003; Scheet and Stephens 2006). However, applications to SNP data are of much lower dimension than to NGS data, and it is not clear how tractable those methods might remain when applied to NGS data, where we are looking at many millions of bases. Neither will it be straightforward to allow for the specific error structure of read data, in which, broadly speaking, error rates increase as we move along the read. It is reasonable to think that algorithms that eventually also include this third source of information will result in a further improvement in performance over this method, but the computational challenges will be significant. Our method exploits the first two sources of information and already requires significant computational resources, even though it does not use LD information for genotyping. However, we have shown that it results in good ability to accurately infer mutation rates and to call genotypes.

In this article, our method was applied to simulated data in which we assumed that reads had been aligned without error. This is not, of course, the case with real data (*e.g.*, our analysis of the 1000 Genomes data) or for the analysis in Figure 4, in which read alignment algorithms were applied to simulated read data. Our method does work well in these more realistic scenarios, but it is important to note that the read alignment process can be expected to induce biases in estimation of mutation rate and/or polymorphism calling, for example, in regions in which alignment is difficult (*e.g.*, near indels). Since most alignment approaches discard reads that have a nontrivial number of mismatches, a downward bias in the estimation of Θ can be expected to occur (since some of these mismatches presumably represent true polymorphism.)

Software for implementation of our method is available upon request from the authors. We note that, as well as outputting most likely genotype calls for loci that are determined to be polymorphic, the algorithm can be used to output posterior genotype probabilities at those loci. This would allow for explicit incorporation of uncertainty in genotype calls to be incorporated in subsequent analyses, such as genome-wide association studies.

## Acknowledgments

The authors gratefully acknowledge funding from National Institutes of Health award 3R01 MH084678 and the helpful comments of two reviewers.

- Received May 20, 2011.
- Accepted July 14, 2011.

- Copyright © 2011 by the Genetics Society of America