## Abstract

Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.

DETERMINING the level of genetic variation within and between species or populations is necessary to study the effects of mutation, natural selection, and genetic drift. In the past few years, faster and cheaper high-throughput DNA sequencing technologies have provided us with an unprecedented amount of large-scale genetic data. These next-generation sequencing (NGS) technologies are now commonly used in population genetic studies and provide us with the perfect opportunity to investigate the evolutionary forces affecting genetic variation.

Currently, available NGS technologies differ in their protocol design (reviewed in Metzker 2010) but all produce data with similar general features. Briefly, the sequencing output consists of relatively short stretches (*e.g.*, currently about 50–100 bp for Illumina machines) of sequenced DNA, commonly called “reads.” These small segments of DNA are then aligned to a reference genome or assembled into scaffolds in *de novo* assembly when a reference genome is not available.

These technologies have greatly improved sequencing efforts in both model and nonmodel organisms, but they have also introduced new challenges because many of the data sets produced using these methods are sequenced at low coverage (a position in the genome is covered by only few sequencing reads), and raw sequencing error rates are often higher than observed using Sanger sequencing. In such circumstances, it is often difficult to distinguish between a variable site and a sequencing error, making the identification of variable sites in the sample (a procedure known as “SNP calling”) nontrivial and prone to error. Also, determining the genotype for each individual (“genotype calling”) can be unreliable due to uncertainty about whether both the parental chromosomes were sampled. Therefore, sequencing errors and uncertainty in the genotype calls may lead to a biased allele frequency distribution (Johnson and Slatkin 2008; Hellmann *et al.* 2008).

Accurate estimation of the site frequency spectrum (SFS), however, is important for population genetic inferences of demography, natural selection, and population structure. Indeed, many summary statistics for evolutionary inferences are functions of the sample allele frequencies (Nielsen 2005). Higher sequencing coverage lowers the uncertainty, but with a fixed budget, researchers need to choose between sequencing fewer samples at higher coverage or sequencing more samples at low to medium coverage. The latter was the preferred option for many recent large-scale sequencing population genetic studies (1000 Genomes Project Consortium 2010, 2012; Auton *et al.* 2012; Huang *et al.* 2012).

Naïve methods for estimating allele frequencies, which are primarily based on direct counting of sequencing reads, provide inaccurate estimates of local nucleotide diversity (Nielsen *et al.* 2011). Consequently, there have been numerous efforts to use statistical models to analyze NGS data to provide more accurate estimates of allele frequencies. To this end, maximum-likelihood (ML) methods and Bayesian methods have been developed for estimating the allele frequency at any given site (Lynch 2009; Keightley and Halligan 2011; Kim *et al.* 2011) or the entire distribution of allele frequencies jointly across multiple sites (Li 2011; Keightley and Halligan 2011; Nielsen *et al.* 2012). Bayesian methods incorporate base quality scores and statistical uncertainty to obtain posterior probabilities associated with each genotype. Recent studies incorporate this probabilistic approach to estimate population genetic parameters from NGS data (Yi *et al.* 2010; Gompert and Buerkle 2011; Kang and Marjoram 2011; Li 2011; Gompert *et al.* 2012).

Thanks to these approaches, genome-wide scans of positive selection have been possible in samples sequenced at moderate coverage. For example, in Yi *et al.* (2010), 50 Tibetan individuals were sequenced to identify the regions of the genome involved in the adaptation to high altitude. Species of rice (Xu *et al.* 2011), chicken (Rubin *et al.* 2010), and silkworm (Xia *et al.* 2009) have also been sequenced at low coverage to identify functional differences between domesticated and wild populations.

In genome-wide scans for selection, it is often informative to summarize genetic variation using population differentiation statistics, such as *F*_{ST} (Wright 1951), to identify particular regions of the genome that are highly differentiated relative to the rest of the genome. *F*_{ST} can also be informative about the divergence time between two populations. Another powerful tool for the analysis of genetic data are principal component analysis (PCA). This data reduction method is a convenient way to visualize the data, derive corrections for population stratification in association studies, and investigate specific features of population history and differentiation. Both PCA and *F*_{ST} have been used extensively for the past 30 years and continue to be valuable tools in summarizing genetic variation.

However, as we show, traditional methods for computing *F*_{ST} and performing PCA result in biases when applied to genotype calls from low or moderate coverage NGS data. Therefore, we propose a new method to estimate *F*_{ST} from NGS data that accounts for uncertainty in the genotype calls. Furthermore, we also show that population structure can be investigated with PCA under the proposed probabilistic framework that accounts for sequencing errors. These new methods outperform previous approaches, especially in the case of low-coverage sequencing data as determined from simulated sequences. Finally, we demonstrate the power of the proposed methods by applying it to a previously published data set of wild and domesticated samples of *Bombyx mori* (Xia *et al.* 2009).

The methods developed in this study contribute to the current toolkit for population genetic analyses of next-generation sequencing data and can be applied to both model and nonmodel organisms.

## Materials and Methods

### Measuring genetic differentiation between populations

*F*_{ST} is a measure of population genetic differentiation that quantifies the proportion of variance in allele frequencies among populations relative to the total variance (the sum of the variance within individuals, within populations, and between populations). Several estimators of *F*_{ST} have been proposed through the years (reviewed in Weir and Hill 2002; Holsinger and Weir 2009).

There is considerable debate about definitions of *F*_{ST}. Some researchers consider *F*_{ST} to be a model parameter (*e.g.*, Balding and Nichols 1995; Nicholson *et al.* 2002; Holsinger *et al.* 2002), while others consider it to be a statistic (*e.g.*, Reynolds *et al.* 1983; Weir and Cockerham 1984; Hudson *et al.* 1992). Even when considering *F*_{ST} as a parameter, there is considerable discussion about what model it is a parameter of and how it should be estimated (Marchini and Cardon 2002; Balding 2003). The objective of this article is not to compare these approaches, which differ both in what they estimate and in how the estimation procedure works. We remain agnostic with regard to the debate on interpretation and definition of *F*_{ST}, although we use the word “estimator” throughout. Instead, we show how some of the most commonly applied estimators of *F*_{ST} can be modified in the presence of low- and medium-coverage data to more accurately reflect what the original *F*_{ST} estimators were intended to capture; *i.e.*, the objective will be to derive estimators applicable to NGS data that produce results similar to those that would have been obtained from the original estimator based on full genotype data without any errors. As a note, other estimators, not considered here, could potentially be modified in a similar fashion.

#### Method-of-moments estimation:

We start by considering the most simple method-of-moments estimators of *F*_{ST}. They do not rely on any assumptions about the shape of the sampling distribution, beyond the moments used to estimate the parameters, and they are easy to implement through simple algebraic expressions. For these reasons, method-of-moments estimators are popular and often used.

Our first aim is to extend the method-of-moments *F*_{ST} estimator proposed by Reynolds *et al.* (1983), as this is one of the most popular and well-motivated estimators of *F*_{ST}, to take into account genotyping uncertainty. Assuming a biallelic SNP, with nonreference allele at estimated frequencies of , , and for population *i*, *j*, and pooled, the genetic variance between and within populations at site *s* is, respectively, (1)and (2)where *n _{i}* and

*n*are the number of sampled individuals per population, , and . Table 1 describes nomenclature used throughout this manuscript.

_{j}The estimate of *F*_{ST} for a single site is then (3)while for a *locus* of *m* sites it is

#### Maximum-likelihood estimation:

ML methods for estimating *F*_{ST} require the specification of a sampling probability distribution. Once this distribution is defined, one can maximize a likelihood function to obtain ML estimators for the parameters of the distribution. ML estimators of *F*_{ST} have been very popular, particularly for detecting signatures of adaptive natural selection among populations (*e.g.*, Beaumont and Balding 2004; Riebler *et al.* 2008; Foll and Gaggiotti 2008).

Assuming a biallelic site *s* with beta-distributed allele frequencies, the probability of the sample allele frequencies at population *i* can be expressed as a beta-binomial distribution with parameters 2*n _{i}* (sample size),

*F*

_{ST}, and

*p*

_{anc,}

*, the ancestral population allele frequency. This parameterization assumes divergence from a common ancestral population and that the subsequent divergence is well modeled by the beta-distribution. The marginal sampling distribution in population*

_{s}*i*is then given by (Balding and Nichols 1995; Balding 2003) (5)where

*k*is the count of the nonreference (or derived) allele,

*B*is the Beta-function, (6)and

The full-likelihood function is the product of this sampling distribution for all populations, as the populations are independent conditional on *p*_{anc,}* _{s}*. For two populations

*i*and

*j*, we have (8)where the subscripts on

*n*and indicate population identity. We numerically maximize Equation 8 using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (Fletcher 1987; Press

*et al.*2007).

### Quantifying population genetic differentiation by calling genotypes

A naïve strategy for estimating sample allele frequencies and *F*_{ST} is to first call genotypes at each site, and then simply count the occurrence of nonreference or derived alleles among all individuals.

We first assessed the accuracy of several genotype-calling strategies (Supporting Information, File S1). These methods include approaches based on direct counts of read bases, on genotype likelihoods, and on genotype posterior probabilities. One promising approach is to use Bayesian methods to assign individual genotypes by computing genotype posterior probabilities *P*(*G*|*X*) from genotype likelihoods and a specific prior *P*(*G*) on genotype *G*. Bayes’ theorem is used to calculate *P*(*G*|*X*), the posterior probability of genotype *G* given the observed data *X* (1000 Genomes Project Consortium 2010). The prior can be defined using extraneous data, such as the reference sequence, sequences in a database, an estimate of the allele frequency, and/or inbreeding coefficients, etc. (*e.g.*, 1000 Genomes Project Consortium 2010; Li 2011; Nielsen *et al.* 2012).

We calculate genotype posterior probabilities at site *s* for individual *w*, *P*(*G*_{(}_{w}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)}) as(9)where *P*(*X*_{(}_{w}_{,}_{s}_{)}|*G*_{(}_{w}_{,}_{s}_{)}) are the genotype likelihoods and *P*(*G*_{(}_{w}_{,}_{s}_{)}) is the prior probability of genotype *G* at site *s* under Hardy–Weinberg Equilibrium (HWE). The prior is calculated from estimates of the per-site population allele frequencies using the method described in Kim *et al.* (2011). To call genotypes, the genotype with the highest posterior probability was chosen for each individual.

Results show that calling genotypes from genotype posterior probabilities provides the most stable and accurate genotype and SNP-calling accuracy at almost all tested experimental scenarios (Table S1, Table S2, and Table S3). We adopted this strategy to call genotypes throughout the rest of the study. Specifically, we counted nonreference alleles from these called genotypes to infer allele frequencies and computed a method-of-moments estimator of *F*_{ST}, which we labeled (Equations 10 and 11). We adopted this genotype calling strategy to compute a ML estimator of *F*_{ST}, (Equations 5 and 8).

An alternative strategy for computing *F*_{ST} is to avoid genotype calling altogether so that inference is based directly on the posterior probabilities (*e.g.*, Yi *et al.* 2010; Nielsen *et al.* 2012). We describe such methods in the following sections.

### Quantifying population genetic differentiation without calling genotypes

Here we propose using a Bayesian probabilistic framework to estimate *F*_{ST} from posterior probabilities of sample allele frequencies of each population at each site without calling specific genotypes. In our applications, we compute a maximum-likelihood estimate of the site frequency spectrum from genotype likelihoods, as previously proposed by Nielsen *et al.* (2012). Using this ML estimate of the SFS as a prior in an empirical Bayes approach, we estimate the posterior probability for all possible allele frequencies at each site (Nielsen *et al.* 2012).

#### Method-of-moments estimation:

Let be the posterior probability that a site in population *i* has derived sample allele frequency , in a sample of *n _{i}* diploid individuals, given the read data

*Y*

_{(}

_{i}_{,}

_{s}_{)}. This probability can be calculated from the genotype probabilities using the algorithm in Nielsen

*et al.*(2012). Allele labeling with respect to the derived allele is arbitrary and any other labeling of alleles could have been chosen if identification of the ancestral and derived state is not possible.

From these quantities, we compute the posterior expectation of the genetic variance between and within populations (see Equations 1 and 2) at site *s* as (10)and (11)where and are genetic variances from Reynolds *et al.* (1983) formula, with *k*- and *z*-derived alleles in populations *i* and *j*, respectively, and *Y _{s}* is the sequencing data at site

*s*. The total expected variance,

*E*[

*c*|

_{s}*Y*], at each site, is then

_{s}*E*[

*c*|

_{s}*Y*] =

_{s}*E*[

*a*|

_{s}*Y*] +

_{s}*E*[

*b*|

_{s}*Y*].

_{s}The estimate of *F*_{ST} for a single site is given by the ratio of *E*[*a _{s}*|

*Y*] to

_{s}*E*[

*c*|

_{s}*Y*] (Equation 3). However, since the two variance components are not independent and this calculation involves the expectation of a ratio, we approximate it using the delta method (Rice 2008; Rice and Papadopoulos 2009) to obtain the following estimator of

_{s}*F*

_{ST}at site

*s*, (12)where 〈

*c*〉 is the

_{u}*u*th central moment of

*c*and 〈

_{s}*a*,

*c*〉 is the mixed central moment, which can be calculated as (13)and (14)where is the total genetic variance from Reynolds

_{u}*et al.*(1983) formula, with

*k*- and

*z*-derived alleles in populations

*i*and

*j*, respectively. For computational purposes, we use only the first central and mixed central moments.

can be calculated using maximum likelihood similarly to the method used for calculating for a single population (Nielsen *et al.* 2012). However, this calculation may not be desirable due to the high variance associated with the estimation of so many parameters.

An alternative approach is to compute an estimate of the two-dimensional site frequency spectrum (2D-SFS), , as (15)where and are the marginal likelihoods of observing *k* and *z* nonreference alleles at population *i* and *j*, respectively, at site *s*, as presented in Nielsen *et al.* (2012).

is then used as a prior to compute the posterior probability of quantities of interest. For instance, the expectation of the genetic variance between populations (see Equation 10) can be computed as (16)Finally, a method-of-moments estimator of *F*_{ST} over *m* sites is given by Equation 4. When analyzing multiple sites, we do not add the correction factor to the ratio of *E*[*a*|*X*] to *E*[*c*|*X*] at each site because, for a large number of sites, the error introduced by taking the ratio of two nonindependent expectations will be minimal. We also tested the performance of other methods to estimate *F*_{ST} from sequencing data derived from the expectations of sample allele frequencies (File S1).

These methods can be extended to nonpairwise definitions of *F*_{ST} (Weir 1996). These formulations require the estimation of a joint SFS among all populations, which can be estimated in a similar fashion as in Equation 15.

#### Maximum-likelihood estimation:

We also extend the procedure for ML estimation of *F*_{ST} and *p*_{anc} under the Beta-binomial distribution (Balding and Nichols 1995; Balding 2003) (Equation 8) to the case of unknown genotypes. These estimates, which we call *F*_{ST.ML}, are obtained by maximizing the likelihood function(17)where *Y*_{(}_{i}_{,}_{s}_{)} and *Y*_{(}_{j}_{,}_{s}_{)} are the observed read data at site *s* for population *i* and *j*, respectively, and and are again the marginal likelihoods of the sample allele frequency for population *i* and *j*, computed as in Nielsen *et al.* (2012).

### Principal Components Analysis

A similar approach to the one used for correcting estimates of *F*_{ST} can be used in PCA. The now-standard method for calculation PCA in population genetics is based on Patterson *et al.* (2006). For *n* individuals and *m* sites a normalized covariance matrix *C* is calculated as (18)where is the derived allele frequency at site *s* (the labeling is again arbitrary) and *G*_{(}_{w}_{,}_{s}_{)} is the number of derived alleles for individual *w* at site *s* (*G* ∈ {0, 1, 2} in the diploid case). The denominator is inserted to account for genetic drift and normalizes the standardized allele frequencies to have the same variance (Patterson *et al.* 2006). However, other normalizations can be chosen. An eigenvector decomposition of *C* is then computed.

We propose computing an estimate of *C*_{(}_{w}_{,}_{y}_{)} by integrating over the posterior genotype probabilities at site *s* for individual *w*, *P*(*G*_{(}_{w}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)}), and *y*, *P*(*G*_{(}_{y}_{,}_{s}_{)}|*X*_{(}_{y}_{,}_{s}_{)}), which can both be calculated as in Equation 9. The prior is calculated using the sample allele frequencies at site *s* as in Kim *et al.* (2011). Therefore, , , , where *G*_{(}_{w}_{,}_{s}_{)} is the number of derived alleles for individual *i* at site *s*. Missing genotype data are then implicitly incorporated in a Bayesian manner using the prior from the sample allele frequencies.

Additionally, the *C* matrix is weighted by the probability of each site being variable. This is motivated by the fact that, at low to medium sequencing coverage, sites that have a small probability of being variable in the sample can have a small but nonnegligible contribution to the matrix *C*. As they are several orders of magnitude more invariable than variable sites, this can have a profound effect on the analyses, even when weighting with genotype probabilities. Instead of using an arbitrary discrete SNP calling, or minor allele frequency, cut-off, we propose weighting sites according to their probability of being variable.

We, therefore, estimate the matrix *C* as (for *w* ≠ *y*) (19)where the probability of site *s* being variable, *P*_{var,}* _{s}*, is computed as (20)We emphasize that this approach does not provide a form of Bayesian PCA analysis. Rather, it is a modification of the Patterson

*et al.*(2006) approach for PCA analysis in the context of population genetics, modified to incorporate uncertainty in genotype calls by using an appropriate weighting of different genotypes using their respective posterior probabilities.

Note that we estimate the joint posterior of the genotype probabilities for the two individuals using the product of their marginal genotype probabilities, *i.e.*, we estimate *P*(*G*_{(}_{w}_{,}_{s}_{)}, *G*_{(}_{y}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)}, *X*_{(}_{y}_{,}_{s}_{)}) by *P*(*G*_{(}_{w}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)})*P*(*G*_{(}_{y}_{,}_{s}_{)}|*X*_{(}_{y}_{,}_{s}_{)}). *P*(*G*_{(}_{w}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)}) and *P*(*G*_{(}_{y}_{,}_{s}_{)}|*X*_{(}_{y}_{,}_{s}_{)}) are not independent as they are correlated through the underlying estimate of genotype frequencies affecting the prior. However, as these analyses are carried out conditional on an estimated allele frequency, the approximation is accurate, although it ignores the sampling variance in the estimate of the allele frequency. Conditional on the allele frequency, *P*(*G*_{(}_{w}_{,}_{s}_{)}|*X*_{(}_{w}_{,}_{s}_{)}) and *P*(*G*_{(}_{y}_{,}_{s}_{)}|*X*_{(}_{y}_{,}_{s}_{)}) are independent.

We also note that (21)for unrelated individuals under HWE assuming known allele frequencies and a HWE-derived prior for the genotype probabilities. This shows that the covariance function for unrelated individuals is in fact expected to be zero using this estimator, a necessary and desirable property for the method to perform well. Proof of Equation 21 is provided in the *Appendix*. As we argue, the resulting PCA is greatly improved over naïve methods using genotype calling under all the explored scenarios.

This approach could be extended to different strategies to perform PCA from a matrix of genotype posterior probabilities, for instance, ML methods that account for noise contributions of each variable (Wentzell *et al.* 1997) or Bayesian methods that use external information about the data (Nounou *et al.* 2002).

### Simulating sequencing data for multiple populations

We performed simulations to compare the performance of these methods to estimate population genetic differentiation, as well as to quantify the genotyping and SNP calling accuracy, under a broad range of experimental conditions. As in previous studies (Kim *et al.* 2010, 2011), we simulated sequencing data rather than raw sequencing reads for computational efficiency. We treated sites as independent of each other and simulated genotypes for each individual assuming HWE and a specific population allele frequency. Specifically, we repeated the following procedure for each site.

First, for each site, we drew an ancestral allele frequency *p*_{anc} from a distribution in [5 × 10^{−3}, 1 − (5 × 10^{−3})] with density proportional to 1/*x*. This distribution is the expected allele frequency distribution under a standard neutral infinite sites model, truncated at the boundaries corresponding to a population size of 200 individuals (see, *e.g.*, Ewens 2004). We then simulated allele frequencies for two populations using the Balding–Nichols model (Balding and Nichols 1995) with mean equal to *p*_{anc}, as in previous studies (Pritchard and Donnelly 2001; Price *et al.* 2006). We simulated two independent samples, conditionally on *F*_{ST} and *p*_{anc}, from this distribution to obtain allele frequencies for two populations (see Equation 5). From these population allele frequencies, we assigned genotypes according to HWE for each individual.

To simulate data from three populations, we first drew population allele frequencies from the Balding–Nichols model for two populations as described above. We then assigned the first allele frequency to population 1 and used the second allele frequency as the ancestral allele frequency for populations 2 and 3. We then drew two population allele frequencies from the Balding–Nichols model for a different value of *F*_{ST} and assigned these allele frequencies to populations 2 and 3.

To simulate NGS data, the number of reads at each locus for each individual was simulated from a Poisson distribution as in Kim *et al.* (2010, 2011). Additionally, errors were randomly introduced uniformly among nucleotides at a rate of 0.0075. This value is comparable to error rates found in previous studies (1000 Genomes Project Consortium 2010; Li *et al.* 2010; Yi *et al.* 2010). The probability of a site being polymorphic, *P*_{var}, was varied from 0.02 to 1.

We computed genotype likelihoods from simulated sequencing reads. Genotype likelihoods depend on both base calls and quality scores and are proportional to the probability, *P*(*X*|*G*), of the observed read data, *X*, at a site for each individual given a certain genotype *G*. In the simplest possible case, for read *z* at site *s*, we calculated the genotype likelihood of a particular base *v*, *L*_{(}_{z}_{,}_{v}_{,}_{s}_{)} with *v* ∈ {A, C, G, T} as *L*_{(}_{z}_{,}_{v}_{,}_{s}_{)} = (1 − *e*) if *v* is the observed base at read *z*, and *L*_{(}_{z}_{,}_{v}_{,}_{s}_{)} = *e*/3 otherwise. Here *e* is the sequencing error used in the simulation setting. There are many other methods for estimating *e*, including methods for estimating it directly from the data (*e.g.*, Kim *et al.* 2011). Genotype likelihoods at site *s* for individual *w* are then calculated by taking the product of the likelihoods over all *r* reads: (22)Using this procedure, we computed genotype likelihoods for each individual at each site for all 10 possible genotypes. We then computed posterior probabilities of genotypes and sample allele frequencies, as previously described (see Equation 9).

When calling genotypes, we assigned genotypes with a posterior probability <0.90 as missing data. We removed sites where more than half of the individuals had missing genotypes. With this procedure, we filtered ∼25% of the total sites at 2× sequencing coverage. We computed *F*_{ST} only on nonmissing genotypes, while for PCA we imputed missing data with genotypes with the highest posterior probability.

To assess the accuracy of the per-site estimates of *F*_{ST}, we simulated two data sets of 10k and 1k sites for each experimental scenario to evaluate method-of-moments and ML estimates, respectively, with *F*_{ST} varying from 0.01 to 0.4, and with *P*_{var} = 1. We verified convergence of optimization algorithms for ML estimators of *F*_{ST} and discarded sites where this condition was not met. We also simulated 1M sites by concatenating 100 sets of 10k simulated sites with *F*_{ST} values drawn from a Normal distribution *N*(0.2, 0.2) truncated at 0.02 and 0.90, and *P*_{var} = 0.10 to assess the accuracy of multiple-sites estimates of *F*_{ST}. We simulated 20 individuals per population at low (2×), medium (6×), and high (20×) sequencing coverage.

To evaluate the performance of different methods for estimating *F*_{ST}, we calculated two measures of deviation from the true *F*_{ST} over *m* sites: the root-mean-square deviation (RMSD), (23)and mean bias (24)where and is the estimated *F*_{ST} at site *s* from the case of known genotypes and sequencing data, respectively.

To evaluate the accuracy of the PCA method, we simulated 10k sites for each scenario with values of *F*_{ST} ranging from 0.02 to 0.4 and with *P*_{var} = 0.02, 0.1, or 1. We simulated three populations with 20 individuals each at 2×, 6×, and 20× sequencing coverage. We performed 10 distinct simulations for each experimental condition to assure robustness of our results. We assessed the accuracy of inferred PCA plots using Procrustes analysis (Wang *et al.* 2010). Briefly, we measured the deviation of PC1 and PC2 computed from the case of known genotypes and the case of unknown genotypes using sum-of-squares (SS), where SS values closer to 0 indicate better fits.

### Applications to real data

We analyzed a data set of wild and domesticated species of silkworm, *B. mori* (Xia *et al.* 2009). The data consisted of 40 samples representing 29 domesticated lineages and 11 wild lineages. Domesticated lineages are phenotypically and geographically separated into subgroups while all wild lineages are from China. Samples were sequenced at an approximate mean per-site coverage of 3×. We analyzed chromosome 2 using the original genotype likelihoods by removing sites where we had no information for at least one individual. Details on the calculation of genotype likelihoods can be found in the original article (Xia *et al.* 2009). Approximately, 200,000 sites were analyzed in total.

We computed posterior probabilities of sample allele frequencies and genotypes using ANGSD software (available at http://www.popgen.dk/angsd). We then performed PCA and estimated *F*_{ST} using the new proposed methods implemented in a set of C/C++ programs (available at https://github.com/mfumagalli/ngstools). All statistical analyses were performed in the R environment (http://www.r-project.org).

## Results

### Quantifying population genetic differentiation from sequencing data

We performed extensive simulations to evaluate the accuracy of estimating *F*_{ST} using different methods and under different conditions. We first evaluated the accuracy of method-of-moments estimates of per-site *F*_{ST} based on called genotypes. Specifically, we assign genotypes for each individual based on the the highest genotype posterior probability (see *Materials and Methods*). This approach is representative of strategies currently used for genotype calling, and it provides better genotype and SNP calling accuracies than other genotype calling strategies examined here (Table S1, Table S2, and Table S3).

We then obtain a method-of-moments estimator of *F*_{ST} from NGS data without calling genotypes by using posterior probabilities of sample allele frequencies, which allows us to compute expected genetic variance components between and within populations (see *Materials and Methods*). Here, we employ Equation 15 to estimate the 2D-SFS and use it as a prior as in Equation 16. We call this estimator .

Results show that this new method performs substantially better than the method based on genotype calling under the experimental conditions explored in this study, especially at low sequencing coverage (Figure 1). tends to underestimate the true value of *F*_{ST} at 2× coverage, but this bias is reduced at 6× coverage (Figure 1). We observe accuracy in our estimates that are comparable to that of methods based on genotype calling for high coverage sequencing data. We obtain similar results when using the true 2D-SFS as a prior (Figure S1). We also observe that at 2× coverage, is more accurate for estimating *F*_{ST} than estimators based on computing the expected allele frequency for each population (see File S1 and Table S4), which overestimates *F*_{ST} (Figure S2).

Next, we compared the accuracy of a ML estimator of *F*_{ST} from called genotypes under the Balding–Nichols model, *F*_{ST.ML.GC}, to the proposed estimator based on the full likelihood under the same model while taking genotype calling uncertainty into account, *F*_{ST.ML} (see *Materials and Methods*). The results show that *F*_{ST.ML} outperforms the method based on calling genotypes at 2× and 6× coverage (Figure 2). For higher sequencing coverage, both methods perform very similarly. We also observe that ML estimates of the ancestral population allele frequency are highly correlated with the true values (Figure S3).

We also test the accuracy of estimating multiple-sites *F*_{ST} on 10k sites from a larger set of 1M simulated positions where only 10% of the sites are variable in the population (see *Materials and Methods*). For this particular analysis we chose the method-of-moments estimator because of its natural extension to multiple-sites estimation (Equation 4). At 2× sequencing coverage we underestimate the true *F*_{ST} (Figure S4). This bias diminishes at 6× and disappears at 20×. When we use the true 2D-SFS as a prior at 2× sequencing coverage, we underestimate the true *F*_{ST} when this value is above the whole-region average (approximately equal to 0.25), while we overestimate the true *F*_{ST} when this value is below the whole-region average (Figure S4). This bias is derived from using the 2D-SFS estimated from the entire region as a prior. At 6× and 20× sequencing coverage we observed unbiased estimates using the true 2D-SFS as a prior (Figure S4).

### Principal components analysis

In traditional PCA, genotypes are called at each site for each individual. We explore an alternative approach based on the genotype posterior probabilities for each individual at each site (see *Materials and Methods*).

At low sequencing coverage, the new method, which does not rely on SNP or genotype calling, produces PCA plot results that are essentially identical to those that use known genotypes (Figure 3). By contrast, direct genotype calling at low sequencing coverage generally leads to a loss in the ability to cluster individuals according to populations, which is a problem that may persist even after removing outlier individuals (Figure 3).

We replicated these findings under many different experimental conditions and for multiple independent simulations and assessed the accuracy of PCA plots using SS values from PC1 and PC2 computed from known genotypes (see *Materials and Methods*). The new method provides better accuracy than the method based on genotype calling for all tested scenarios, even at medium sequencing coverage (Figure S5). Generally, we obtain lower SS values without normalization of the standardized allele frequencies (see Equation 18), and the new method still outperforms an approach based on called genotypes at low sequencing coverage (Figure S6). We next simulated only variable sites data at high sequencing coverage to produce an ideal scenario for genotype calling. As expected, procedures based on calling genotypes lead to accurate PCA results under these conditions (Figure S7).

Notably, weighting each site by its probability of being variable gives higher accuracy than simply weighing all sites equally, especially when there are only a few variable sites in the sample (Figure S8). This proposed method also performs better than an approach based on computing expected genotypes from genotype posterior probabilities (Skotte *et al.* 2012; Gompert *et al.* 2012) for low coverage data (Figure S9). We also simulated one population with no genetic structure but where half of the individuals were sequenced at low coverage (2×) while the rest were sequenced at high coverage (20×). We still observe an improvement in the accuracy of the inferred PCA plots (Figure S10).

### Analysis of real data

To illustrate the performance of the herein proposed methods, we applied them to a data set of low-coverage sequencing data. Specifically, we investigate the population structure of wild and domesticated silkworm samples (Xia *et al.* 2009). Despite using only a single chromosome of the entire silkworm data set, we were able to detect fine-scale population genetic structure. Indeed, the first component of the PCA plot generated using the new method, which takes statistical uncertainty in genotype calling into account, shows a clear separation between wild and domesticated lineages (Figure 4A). Moreover, the second component divides the different lineages of domesticated silkworms into their subgroups (Figure 4A). The first two principal components explain 6.8 and 5.2% of the total genetic variation, respectively. Of note is that we achieve a better separation among the subgroups than in the original study using whole-genome sequence data, where several subgroups appear to be intermixed (Xia *et al.* 2009).

We then applied naïve strategies of performing PCA based on called genotypes using the maximum genotype likelihood or genotype posterior probability at each site for each individual. Results show several outlier individuals, which may be the effect of systematically misassigned heterozygous sites (Figure S11). However, when including only sites with estimated allele frequency greater or equal to two, and using genotype calling based on genotype posterior probabilities, we see a more accurate representation of the genetic structure. (Figure S11). A similar result using this allele frequency-filtered data set is obtained using the new proposed method that does not rely on genotype calling (Figure S11). Nonetheless, the new method applied to all of the data provides larger fractions of explained variance than the method based on genotype calling (Figure S12).

Finally, we estimated *F*_{ST} between wild and domesticated samples for 20-kb nonoverlapping genomic windows. We used the folded 2D-SFS due to uncertainty in assigning the ancestral and derived state of alleles. The distribution of the estimated *F*_{ST} values in 20-kb windows has mode around 0.4 (Figure 4B), which is larger than what was found in the original study (Xia *et al.* 2009).

## Discussion

NGS technologies are now an essential tool for population genetic studies. However, genotyping uncertainty associated with low sequencing coverage and high sequencing error can drastically bias downstream analyses (Nielsen *et al.* 2011). A recent study assessed the power to detect selective events and infer demographic scenarios as a function of sequencing coverage and error (Crawford and Lazzaro 2012). The results of the study show that weak selective events are hardly detectable, and inferences of population size changes are systematically biased for low-coverage data (<10×) (Crawford and Lazzaro 2012). Interestingly, the authors determined that population genetic differentiation was underestimated, even at medium to high sequencing coverage, suggesting that multipopulation analyses are even more sensitive to inaccuracy of NGS data (Crawford and Lazzaro 2012).

In this study, we take full advantage of a recently proposed Bayesian approach to taking sequencing data uncertainty into account (Li 2011; Nielsen *et al.* 2012). This method involves computing posterior probabilities for each genotype and all possible sample allele frequencies from genotype likelihoods. Estimation of classic population genetic parameters within this new probabilistic framework has previously been suggested (Yi *et al.* 2010; Li 2011; Nielsen *et al.* 2012) and in some cases implemented (Yi *et al.* 2010; Gompert and Buerkle 2011; Kang and Marjoram 2011). For instance, Gompert and Buerkle (2011) proposed a hierarchical Bayes model for genomic population structure. Their method accounts for uncertainty in sampling sequencing reads and measured population differentiation in terms of haplotype distances. Also, Skotte *et al.* (2012) and Gompert *et al.* (2012) used genotype expectations rather than called genotypes for the analysis of population structure. Here, we developed new methods for quantifying population genetic differentiation in terms of *F*_{ST} without relying on SNP or genotype calling. We simulated NGS data to assess the accuracy of these new estimators under a wide range of experimental scenarios.

Herein proposed methods for computing method-of-moments *F*_{ST} estimators, based on computing the posterior expected genetic variance components (see *Materials and Methods*), offer a solution to the lack of accuracy for low coverage data and outperform other examined estimators under all tested conditions (Figure 1). While the improvement offered by the new method is greatest and most noticeable at low coverage, even at medium sequencing coverage, it results in less biased estimates of *F*_{ST} (Figure 1). Similarly, ML estimation of *F*_{ST} that accounts for uncertainty in genotype calls outperforms a method based on genotype calling at low and medium coverage (Figure 2). These findings suggest that the framework presented in this study can be easily extended to other *F*_{ST} estimators. Overall, these results highlight the importance of taking statistical uncertainty into account when computing population genetic differentiation from NGS data. The great improvement in accuracy for low coverage data can be explained by the fact that we do not call SNPs or genotypes. We can thus avoid introducing errors during these processes, which can be particularly problematic for downstream analyses.

Errors introduced by calling SNPs and genotypes for low coverage and quality data can be even more evident when investigating population structure with PCA. Simple genotype calling provides very little ability to accurately identify structure using PCA for low coverage data. However, the new method based on genotype posterior probabilities provides PCA plots that are almost identical to cases in which true genotypes are known (Figure 3). Accuracy in identifying population structure can be recovered when calling genotypes by removing outlier individuals, low-quality sites, and low-frequency variants, but at the price of losing potential important information. Skoglund and Jakobsson (2011) investigated population structure by randomly sampled one read from each individual at each position. In this way they could compare modern, high-quality data with the low-pass ancient data. A disadvantage of this method is the loss of information associated with using only a single read from each individual, especially in the presence of sequencing errors.

We applied methods proposed in this study to a data set comprising 40 silkworm samples sequenced at low coverage (Xia *et al.* 2009). We used only a single chromosome of the original data set and we did not apply any criteria for SNP calling. Despite this, we were able to obtain a fine-scale map of population genetic structure, clearly separating wild and domesticated lineages of silkworm samples (Figure 4A). The first principal component separates domesticated and wild varieties, while the second component accurately divides the domesticated lineage into subgroups. Genotype calling from genotype posterior probabilities can provide an overall similar representation of the genetic structure when using a conservative initial filtering of data.

Genotype calling using stringent data filtering and a conservative approach for SNP calling and rare variants removal may be sufficient to give an overall picture of the genetic population structure, for example, a reasonably representative PCA. Other analyses, such as estimation of *F*_{ST}, that rely on accurate estimates of allele frequencies may be more difficult to rescue by conservative filtering because a fixed cut-off for SNP calling cannot provide unbiased estimates of allele frequencies (*e.g.*, Johnson and Slatkin 2008). Furthermore, the accuracy of genotype calling can be improved for human data by using imputation or haplotype-based genotype-calling methods (*e.g.*, Zhi *et al.* 2012), although such approaches are not as easily applicable to most other species. The poor performance of PCA after calling genotypes may largely be a result of inaccuracies in SNP calling rather than a consequence of erroneous genotype calls at variable sites. However, when simulating sequences with a larger proportion of polymorphic sites the new method still outperforms traditional methods, even in the case of an uneven sequencing coverage among individuals (Figure S10). While more sophisticated approaches have been developed to perform accurate SNP calling (*e.g.*, Kim *et al.* 2011), calling polymorphic sites using all individuals may result in ascertainment biases, which can influence estimates of population structure and divergence (Albrechtsen *et al.* 2010). Additionally, a stringent SNP-calling strategy implies that a large amount of data is discarded from the analyses, potentially leading to loss of important features of the data. For example, low-frequency variants, which are more likely to be removed in a conservative SNP calling strategy, can effectively distinguish closely related populations. Moreover, highly differentiated SNPs among populations, which may be related to genetic adaptation, might be lost in some analyses.

Like any other method for SNP-calling and allele-frequency estimation, the approach herein discussed is sensitive to the underlying base-calling algorithm and to the accuracy of quality scores. By improving accuracy and quality scores, current and future base callers can both reduce sequencing costs and increase accuracy of all downstream analyses of genetic variation. Furthermore, data filtering is a complex procedure when sequencing quality is low (*e.g.*, Minoche *et al.* 2011). Many other protocols, other than the ones used in this article, can be adopted to minimize the genotypes assignment bias.

We implemented the new proposed methods for estimating *F*_{ST} and perform PCA from NGS data in a fast, portable, and memory-efficient set of C/C++ programs and distributed on a public repository for shared development. These programs are directly integrated with ANGSD (http://www.popgen.dk/angsd), a software for the analysis of NGS data and easily integrable with other common software such as SAMtools (Li *et al.* 2009) or GATK (Mckenna *et al.* 2010). The computational cost associated with the new methods is slightly higher than that of standard approaches (Table S5 and Table S6). However, the increased computational burden is mostly associated with the computation of sample allele frequency posterior probabilities, which can be used for additional analyses. Notably, the computational cost should not be prohibitive for any existing data sets.

As NGS technologies become more ubiquitous and affordable, the frequency of large-scale population genetic and quantitative studies will certainly increase. The methods presented in this article provide tools for investigating genetic variation for multiple populations at large scales directly from high-throughput sequencing data.

## Acknowledgments

We are grateful to Michael DeGiorgio and Gaston Sanchez at the University of California, Berkeley, CA, for helpful discussions. We thank Shiping Liu and Zengli Yan at Beijing Genomics Institute, Shenzhen, China, for testing previous versions of the programs, and three anonymous reviewers for insightful comments on the manuscript. M.F. is supported by EMBO Long-Term Post-doctoral Fellowship (ALTF 229-2011). T.L. is supported by National Institutes of Health (NIH) Genomics Training Grant (Grant T32HG000047-13). E.H.S. is supported by National Science Foundation grant DBI-0906065 and NIH grant 3R01HG03229-08S2. R.N. is supported by NIH grant 3R01HG03229-07.

## Appendix

Let *G _{i}* and

*X*be random variables representing the genotype and read data, respectively, from individual

_{i}*i*. Likewise, let

*g*be a realization of

_{i}*G*, and let

_{i}*x*be a realization of

_{i}*X*,

_{i}*i*= 1, 2. We then wish to prove that

where the sums, here and in the following, are over all supported values of the variable under the summation sign. We use a simplified notation so that expectation operators implicitly are taken with respect to the random variable(s) inside the argument of the expectation operator, except when otherwise indicated by the use of subscripts. Also, we use the short-hand notation *p*(*g _{i}*) for the probability of the random variable

*G*taken on the value

_{i}*g*.

_{i}First note that, for unrelated individuals, *G*_{1} and *G*_{2} are independent and that *X*_{1} and *X*_{2} are independent assuming a fixed known allele frequency and assuming random mating. Next also note that (A2)and (A3)Then

The interchange of summations in the first step is justified because all sums are finite. The second equality is true because of the independence assumption. The third equality is verified by substitution of the expressions in (A2) and (A3). The fourth equality follows from the independence assumption and the definition of expectation.

## Footnotes

*Communicating editor: M. A. Beaumont*

- Received June 25, 2013.
- Accepted August 18, 2013.

- Copyright © 2013 by the Genetics Society of America