Comparing allele frequencies among populations that differ in environment has long been a tool for detecting loci involved in local adaptation. However, such analyses are complicated by an imperfect knowledge of population allele frequencies and neutral correlations of allele frequencies among populations due to shared population history and gene flow. Here we develop a set of methods to robustly test for unusual allele frequency patterns and correlations between environmental variables and allele frequencies while accounting for these complications based on a Bayesian model previously implemented in the software Bayenv. Using this model, we calculate a set of “standardized allele frequencies” that allows investigators to apply tests of their choice to multiple populations while accounting for sampling and covariance due to population history. We illustrate this first by showing that these standardized frequencies can be used to detect nonparametric correlations with environmental variables; these correlations are also less prone to spurious results due to outlier populations. We then demonstrate how these standardized allele frequencies can be used to construct a test to detect SNPs that deviate strongly from neutral population structure. This test is conceptually related to FST and is shown to be more powerful, as we account for population history. We also extend the model to next-generation sequencing of population pools—a cost-efficient way to estimate population allele frequencies, but one that introduces an additional level of sampling noise. The utility of these methods is demonstrated in simulations and by reanalyzing human SNP data from the Human Genome Diversity Panel populations and pooled next-generation sequencing data from Atlantic herring. An implementation of our method is available from http://gcbias.org.
THE phenotypes of individuals within a species often vary clinally along environmental gradients (Huxley 1939). Such phenotypic clines have long been central to adaptive arguments in evolutionary biology (Mayr 1942), with diverse examples including latitudinal clines in skin pigmentation in humans (Jablonski 2004), body size and temperature tolerance in Drosophila (Hoffmann and Weeks 2007), and flowering time in plants (Stinchcombe et al. 2004). Unsurprisingly, comparisons of allele frequencies between populations that differ in environment were among the earliest population genetic tests for selection (Cavalli-Sforza 1966; Lewontin and Krakauer 1973; Endler 1986) and have continued to be central to population genetics to this day (e.g., Coop et al. 2009; Akey et al. 2010).
The falling cost of sequencing and genotyping means that such comparisons can now be made on a genome-wide scale, allowing us to start to understand the genetic basis of local adaptation across a broad range of organisms. However, such studies need to acknowledge the sampling issues inherent in population genetic studies of natural populations. In assessing correlations between allele frequencies and environmental variables, or in looking for loci with unusually high levels of differentiation, two broad technical issues need to be addressed. First, sample allele frequencies are noisy estimates of the population allele frequency, and this issue is exacerbated when sample sizes differ across populations. Second, population allele frequencies among multiple populations are not statistically independent, as populations vary in their relationship to one another due to varying amounts of shared genetic drift and migration over time. Failure to account for differences in sample size and the shared history of populations can lead to a high rate of false positives and negatives when searching for the signal of local adaptation due to the unaccounted sources of variance and nonindependence among populations (Robertson 1975; Excoffier et al. 2009; Bonhomme et al. 2010). Therefore, accounting for these potential biases should provide additional precision in the identification of loci responsible for adaptation.
To accommodate these sources of noise, the Bayesian method Bayenv was developed (Hancock et al. 2008; Coop et al. 2010); Bayenv attempts to account for these two factors while testing for a correlation between allele frequencies and an environmental variable. To control for a general relationship between populations, an arbitrary covariance matrix of allele frequencies is estimated from a set of control markers. This model of covariance is then used as a null model against an alternative model that allows for a linear relationship between the (transformed) allele frequencies at a particular locus and an environmental variable of interest. Inference under these models is performed using Markov chain Monte Carlo (MCMC) to integrate over the posterior of the parameters. Bayenv has been successfully applied to identify loci putatively involved in local adaptation to environmental variables across a range of different species (e.g., Hancock et al. 2008, 2010, 2011b,c; Eckert et al. 2010; Fumagalli et al. 2011; Jones et al. 2011; Cheng et al. 2012; Fang et al. 2012; Keller et al. 2012; Limborg et al. 2012; Pyhäjärvi et al. 2012).
A range of other model-based methods have been developed in parallel to detect environmental correlations. One of the earliest of these was the method of Joost et al. (2007), which accounted for differences in sample size but did not account for population structure. Building on this, Poncet et al. (2010) developed a fast method based on generalized estimation equations, to allow population structure for correlated sample frequencies within a set of population clusters (but no relatedness between these clusters). A recent simulation study (De Mita et al. 2013) has shown that environmental correlation methods that explicitly account for population structure (e.g., Coop et al. 2010; Poncet et al. 2010) outperform those that do not account for structure, with Bayenv having somewhat higher power and greater robustness to population structure, likely due to it allowing arbitrary relatedness between populations (at the expense of computational speed).
Recently, Guillot (2012) developed an inference method very similar to Bayenv, offering large gains in computational efficiency. These gains are at the expense of constraining the covariance matrix in an explicit isolation-by-distance parametric form. In addition, Frichot et al. (2013) presented a latent factor mixed model that estimates the effect of population history and environmental correlations simultaneously. The Frichot et al. (2013) method resulted in a slightly higher power than that of Bayenv to detect environmental correlations in simulations, perhaps in part as a result of the simultaneous inference of fixed and random effects reducing the effect of selected loci inflating the covariance matrix.
These methods (Coop et al. 2010; Poncet et al. 2010; Guillot 2012; Frichot et al. 2013) all seem to have similar performance in terms of power, suggesting that the detection of linear correlations in the presence of population structure has reached a reasonable level of development. However, further work is needed to extend the utility and robustness of these methods to enhance their application to population genomic data.
One concern about applications of these methods is that linear models are not robust to outliers, which can lead to spurious correlations. For example, if a single population has both an extreme allele frequency and an extreme environmental variable, while all other populations show no correlation, then the linear model may be misled (see Hancock et al. 2011b; Pyhäjärvi et al. 2012, for examples). This sensitivity can be overcome by using rank-based nonparametric statistics, such as Spearman’s ρ, which may also offer increased power to robustly detect linear and nonlinear relationships. The difficulty is that such tests do not acknowledge the differences in sample size or the covariance in allele frequencies across populations. To address this some authors (Fumagalli et al. 2011; Hancock et al. 2011a) have used a nonparametric partial Mantel test, which can partially account for this covariance and makes fewer model assumptions. However, the partial Mantel approach is known to perform very poorly when both genotypes and environmental variables are spatially autocorrelated (see Guillot and Rousset 2013, for discussion). As such, we currently lack a framework to perform robust, nonparametric inference of environmental correlations in population genomic data.
To overcome these difficulties we use the framework laid out in Bayenv to provide the user with a set of “standardized” allele frequencies at each SNP. In these standardized allele frequencies the effect of unequal sampling variance and covariance among populations has been approximately removed. This affords users a general framework to utilize statistics of their choosing to investigate environmental correlations or other sources of allele frequency variation. As an application of this we show how these standardized allele frequencies can be used to develop a powerful test that robustly infers environmental correlations in the face of outlier populations. As a further example of how these “standardized allele frequencies” can be used, we construct a global FST-like statistic, which accounts for shared population history and sampling noise, to identify loci with unusually high allele frequency variance among populations (a test that is closely related to that of Bonhomme et al. 2010). We demonstrate the utility of these approaches through simulation and reanalyzing SNP genotyping data from the Centre d'Etude du Polymorphisme Hmain (CEPH) Human Genome Diversity Panel (HGDP) (Conrad et al. 2006; Li et al. 2008).
We also extend Bayenv to deal with some of the statistical challenges posed by next-generation sequencing. Recently, pooled next-generation sequencing (NGS) of multiple individuals from a population has gained in popularity (e.g., Turner et al. 2010, 2011; He et al. 2011; Kolaczkowski et al. 2011; Boitard et al. 2012; Fabian et al. 2012; Kofler et al. 2012; Lamichhaney et al. 2012; Orozco-terWengel et al. 2012), as it offers a cost-efficient alternative to sequencing of single individuals. However, estimating allele frequencies from read counts sequenced from a pool implies a second level of sampling variance (Futschik and Schlötterer 2010; Zhu et al. 2012), which needs to be considered in population genetic analyses. We extend the model of Bayenv to account for the sampling of reads in pooled NGS experiments. We show that this improves the power to detect environmental correlations in pooled resequencing data. We also show the utility of our approach by the reanalysis of a pooled next-generation sequence data set from Atlantic herring (Clupea harengus) populations along a salinity gradient (Lamichhaney et al. 2012).
General model of Bayenv
First, we briefly explain the underlying model of Bayenv for the sake of completeness. Further details about the model and the inference method can be found in Coop et al. (2010). Consider a biallelic locus l with a population allele frequency pjl in a population j, where nj alleles have been sampled from this population in total. Assuming that each population is reasonably large and approximately at Hardy–Weinberg equilibrium, the observed count of allele 1, kjl, in this population is the result of binomial sampling from this population frequency:(1)We follow the model of Nicholson et al. (2002) by assuming that a simple transform of the population allele frequency pjl in subpopulation j at locus l represents a normally distributed deviate around an “ancestral” frequency ɛl. Specifically we assume that (2)i.e., that the masses <1 and >1 are placed as point masses at 0 and 1, representing the loss or fixation of the allele in population j, respectively. We then assume that that the marginal distribution of θjl is normally distributed, around an ancestral mean frequency εl with variance proportional to εl(1 − εl) (inspired by the model of Nicholson et al. 2002). We denote the vector of transformed population allele frequencies at a locus by θl, where θl = (θ1l, …, θJl) when J is the number of populations. As we do not expect that the populations are independent from each other, we assume that θl follows a multivariate normal distribution(3)where Ω is the variance–covariance matrix of allele frequencies among populations. We can write the joint probability of our counts at a locus and the θl as (4)We place priors on Ω (inverse Wishart) and the εl at each SNP (symmetric Beta). Assuming that our SNPs are independent, we write the joint probability of all of our loci and parameters as(5)Our posterior is this joint probability normalized by the integral over Ω and the εl and θl at all of the loci.
We then use MCMC to sample posterior draws of the covariance matrix (Ω) from a set of unlinked, putatively neutral control SNPs. Our observations showed that the MCMC converges quickly to a small set of covariance matrices for each data set given a sufficient number of independent SNPs (Coop et al. 2010). Given this tight distribution, we use a single draw of Ω, denoted by , after a sufficient burn-in. The entries of the matrix Ω are closely related to the matrix of pairwise FST (Weir and Hill 2002; Samanta et al. 2009), and so this model provides a flexible model of population history; for example, Pickrell and Pritchard (2012) use a similar model to infer a tree-like graph of population history and Guillot (2012) uses a related model as a model of isolation by distance.
Next, we formulate an alternative model where an environmental variable Y, standardized to have mean 0 and variance 1, has a linear effect β on the allele frequencies: (6)To express the support for the alternative model at a locus l, Coop et al. (2010) calculated a Bayes factor (BF) by taking the ratio of probability of the alternative and the null model given the data and , integrating out the uncertainty in θl, εl, and β (under a uniform prior on β between −0.2 and 0.2).
Tests based on standardized allele frequencies
The linear relationship between the transformed allele frequencies (Equation 6) may not be the best fit in all situations, as other monotonic relationships could be viewed as biologically realistic in some cases. Additionally, there may be situations in which a linear model is not robust to outliers and so will spuriously identify loci as strong correlations. Therefore, we provide a general framework to allow investigators to apply statistics of their choice, such as rank-based nonparametric statistics, to detect environmental correlations, while taking advantage of the Bayenv framework. These statistics could in theory be applied to the raw sample frequencies; in practice, however, that can lead to high false-positive and false-negative rates as sample allele frequencies are naturally noisy because of the process of sampling and nonindependent due to the covariance among populations.
The multivariate normal framework employed by Bayenv offers a natural way to attempt to standardize θl to be variates with mean zero, variance one, and no covariance. These allele frequencies allow standard statistics that rely on these assumptions to be applied more directly. We denote the Cholesky decomposition of the covariance matrix C (Ω = CCT, where C is an upper-triangular matrix), which can be thought of as being equivalent to the square root of the matrix and so analogous to the standard deviation of θl. To standardize the θl for effects of unequal sampling variance and covariance among populations we write(7)If θl ∼ MVN(εl, εl(1 − εl)Ω), then Xl ∼ MNV(0, ) where is the identity matrix (i.e., i,j = 1 if i = j and i,j = 0 otherwise). Note that this transform is not unique, as a range of other factorizations of the covariance matrix are possible.
If we wish to test the correlation of our transformed allele frequencies with an environmental variable, we also need to similarly transform our environmental variable, to ensure that our frequencies and environmental variable are in the same frame of reference. Specifically, if our environmental variable is Y (standardized to be mean 0, variance 1), then our transformed environmental variable is(8)The fact that we need to do this follows from the fact that we are interested in predicting θl by Y, and so what we are doing is conceptually equivalent to multiplying the left- and right-hand sides of the equation relating the two by C−1 to remove the effect of the covariance in allele frequencies. This rotation by C−1 means that X and Y′ are linear superpositions of θ and Y, respectively, and so the population labels no longer apply to particular entries in these vectors. The transform will tend to exaggerate the environmental variable differences between very closely related populations. Furthermore, if part of the variation in the environmental variable precisely matches the major axis of variation in the genetic data, then applying the transform may remove much of this variation. Both of these effects seem desirable properties, as we are interested in identifying correlations discordant with the patterns expected from drift. However, users should visually inspect Y and Y′ to understand how the transform has altered the environmental variable (see Supporting Information, Figure S1, Figure S2, Figure S3, and Figure S4 for examples).
We do not get to observe θl, so we obtain a representative sample of M draws from the posterior . Given these draws, there is an enormous variety of ways that we could choose to summarize the support for the correlation with our environmental variable Y′. Here we choose to write(9)i.e., ρl is the mean of the function ρ() over our posterior draws of Xl.
In this article, we calculated Pearson’s and Spearman’s correlation coefficients [as our ρ()] as alternative tests to the Bayes factors. To obtain an appropriate sample from the posterior in a computationally efficient manner, these statistics were calculated between Xl and Y′ every 500 MCMC generations and then averaged over the complete MCMC run. Our draws of Xl will therefore be weakly autocorrelated, but as ρl is a mean, this does not affect its expectation.
While this standardization, for a known , would work perfectly if our θl were really multivariate normal, in reality this is only an approximation, as even under the null model deviations due to drift are only approximately normal over short timescales. Thus, while we model drift at a locus as being multivariate normal (i.e., θl has a prior given by Equation 3), if the true model is more complex, the joint probability of this along with our count data (and our uncertainty in Ω) may force θl to not be MVN(0, ). While, under these circumstances, Xl will conform to those assumptions better than θl, we still choose to use the empirical distribution of ρl across SNPs rather than rely on asymptotic results.
A robust statistic to detect extreme population differentiation:
This transformation to provide a set of Xl can also be used to propose an FST-like statistic. Specifically we can write the variance of Xl at a locus as(10)Note that while the multiplication by C−1 is not a unique transform, would be the same no matter which matrix factorization was used. This statistic is closely related to FST, which can be expressed as(11)where Var(plj) is the variance of our allele frequency across populations (see Nicholson et al. 2002; Balding 2003, for discussion). Thus XTX is directly analogous to FST, but it accounts for the variance–covariance structure of the populations in question. If θl is truly multivariate normal, then is distributed . This suggests that is a natural test statistic to identify loci that deviate away strongly from the multivariate normal distribution, e.g., due to selection. Furthermore, this form naturally accounts for hierarchical population structure or other models of population structure that can confound FST-style outlier analyses (Excoffier et al. 2009). As we do not observe Xl and rather obtained a posterior distribution of Xl, we take the average across the sample of Xl from our MCMC (we term this XTX).
Our XTX statistic is very closely related to the work of Bonhomme et al. (2010), who independently developed a similar test statistic (extended Lewontin and Krakauer test, FLK) for the case of a known population tree (see also Nei and Maruyama 1975; Robertson 1975, for earlier discussion of the effect of a population tree on the Lewontin and Krakauer 1973 test). Specifically, Bonhomme et al. (2010) use an equation analogous to Equation 10, but where Ω is a covariance matrix specified by a neighbor-joining population tree estimated from the data [using Reynolds’ genetic distance between population pairs (Reynolds et al. 1983)]. Thus we expect similar performance of XTX to that of FLK in situations where the covariance matrix is well approximated by a tree and XTX to outperform FLK under models such as isolation by distance and more complex models where population history is poorly represented by a tree.
Sequencing of pooled samples
If genotyping is conducted as sequencing of population pools, an additional step of sampling is included. At a site l the total coverage in population j is mjl, and we observe rjl reads of allele 1. Assuming that each individual contributed the same number of chromosomes to the pool, the sequenced reads are the result of binomial sampling(12)where is the unknown sample allele frequency in the pooled sample. Summing over this unknown frequency (13)(again assuming Hardy–Weinberg equilibrium) gives us the probability of our sampled reads given the population frequency. This replaces the binomial probability (Equation 1) in the joint probability given by Equation 4. In Coop et al. (2010), the Bayes factors were approximated by an importance sampling technique while performing MCMC under the null model; i.e., β = 0. This allowed the rapid calculation of the Bayes factor for many environmental variables with little extra computational cost. However, Bayes factors calculated by this technique are noisy, and so here we also implement an MCMC to estimate the posterior on β. We place a uniform prior on β (between −0.1 and 0.1) and update β along with εl and θl. For our update on β we use a random-walk sampler with a small normal deviate (σ = 0.01) and accept this move with the ratio of the joint posterior of our proposed parameters to that of our current parameters. As a simple summary of the posterior support for β ≠ 0, we look at the skew of the posterior away from zero. Specifically we estimate the proportion (f) of the marginal posterior on β that is above 0 and then take Z = |0.5 − f| as a test statistic, with values of Z close to 0.5 showing strong support for β ≠ 0.
The extended model was implemented in Bayenv2.0. Simulations were conducted to evaluate the power of these extensions. To use both a realistic covariance among populations and realistic environmental values, we based these simulations on SNP data from the HGDP populations (Conrad et al. 2006; Li et al. 2008) and the environmental variables measured at these sampling locations (also used in Hancock et al. 2008; Coop et al. 2010). These simulations were implemented in R (R Development Core Team 2011). We employed a single Bayenv2.0 estimate of the covariance matrix from the original SNP data (sampled after 100,000 MCMC iterations) to simulate population allele frequencies. For each SNP, an ancestral frequency εl was drawn from a beta distribution (with parameters α = 0.5, β = 3) if not otherwise stated. Then population allele frequencies were drawn from the multivariate normal , with values >1 or <0 replaced with 1 and 0, respectively. Allele counts were then drawn binomially from these frequencies, with a sample size of 20 chromosomes per population (if not otherwise stated).
To construct a null distribution we calculated our test statistics for these simulated SNPs and an environmental variable Y during 100,000 MCMC iterations. For a second set of SNPs, an environmental effect was simulated by drawing their population allele frequencies from a multivariate normal , using a range of β’s. Again our test statistics were calculated over 100,000 MCMC iterations. Power estimates were based on the proportion of these SNPs that were detected at a certain significance level α (5% here), i.e., the fraction of our simulations (with a β) in the upper α tail of the null distribution. Note that by taking this empirical approach we are setting our false-positive rate to 5%, and as such we do not need to investigate our false-positive rate.
Testing differentiation statistics:
To compare the power of differentiation statistics, we simulated data sets based on the HGDP populations, using the minimum winter temperature and latitude as environmental variables (latitude is obviously not an environmental variable, but is taken as a proxy for many environmental variables that are correlated with latitude). To highlight the effects of different sample sizes among populations, we simulated populations using the original sample sizes of the HGDP data (Li et al. 2008). Sets of neutral and selected SNPs were simulated as described above, and the differentiation statistics FST [using the pegas R library (Paradis 2010)], XTX, and the related FLK [using an R script provided by the authors (Bonhomme et al. 2010)] were calculated. The ancestral frequency was used as an outgroup population for the calculation of FLK. The ancestral allele frequency ε was set to 0.5 for all SNPs simulated for these tests as we discovered noisy power estimates for FLK when the ancestral frequency was drawn from a beta distribution (not shown). All power estimates again used empirical cutoffs based on the neutral simulations.
Simulation of pooled samples:
For the simulation of pooled NGS data, we assume that the depth of coverage of a pool follows a negative binomial distribution, which allows for the overdispersion of read depths compared to the Poisson. Coverages for each population and SNP were independently drawn from a negative binomial distribution NB(r, p), for which we set r = 5 and set p to obtain the respective coverage mean [i.e., NB(r, p) has a mean of pr/(1 − p) and a variance of pr/(1 − p)2]. This represents a case where the variance–mean ratio increases for higher average coverages. This pattern is generally consistent with observations from pooled next-generation data in Arabidopsis thaliana (T. Günther, C. Lampei, I. Barilar, and K. J. Schmid, unpublished results). An environmental effect of |β| = 0.06 was simulated when all 52 HGDP populations were used and |β| = 0.15 was used for simulations of smaller population subsets. The sign of β was assigned to be positive or negative at random.
Bayenv2.0 was employed to reanalyze a genome-wide data set of 640,698 SNPs from 52 HGDP-CEPH populations (Li et al. 2008; Hancock et al. 2010), using both Bayes factors and our nonparametric test statistic (ρl). We restricted our analysis to minimum winter temperature, as most winter climate variables show outliers. All environmental variables were normalized to a mean of zero and a standard deviation of one. The covariance matrix for this reanalysis was estimated from a random subset of 5000 SNPs after 100,000 MCMC iterations. Bayes factors and correlation coefficients for each SNP were estimated using 100,000 MCMC iterations. In addition to these test statistics, we sampled Xl every 500 MCMC generations and averaged XTX over these values for each SNP. SNP positions and gene annotations were obtained from Ensembl (Flicek et al. 2012) and Entrez Gene (Maglott et al. 2011).
To explore the use of Bayenv on pooled data we reanalyzed a pooled sequence data set from eight Atlantic herring (C. harengus) populations that were sampled along a salinity gradient (Lamichhaney et al. 2012). Lamichhaney et al. (2012) sequenced eight population pools of 50 individuals and aligned their reads back to a de novo assembly of the C. harengus exome, with each pool being sequenced to on average 30-fold depth. They identified 440,817 SNPs in these data and concentrated their analyses on SNPs with at least 40 reads in each population to reduce sampling noise. We reanalyzed the data of Lamichhaney et al. (2012), using allele count data provided by the authors. We first estimated the covariance matrix, using a random subset of 1000 SNPs, chosen to have ≥40-fold coverage in all populations. To explore the performance of our pooled sequencing extension of Bayenv we calculated the statistic Z for the correlation of allele frequencies and salinity across the eight populations. The environmental correlation test was calculated for all SNPs with at least 40 reads in all populations (36,794 SNPs) and for 100,000 SNPs chosen at random irrespective of sequencing depth. All statistics were calculated during 100,000 MCMC iterations.
Using tests based on standardized allele frequencies
We explored the performance of tests based on our standardized transformed population frequencies (Xl). Before we calculated test statistics on our standardized allele frequencies, we examined whether the multivariate standardization (as in Equation 7) had removed the covariance among populations from our standardized Xl. We first calculated the sample covariance matrix, using the sample frequencies for all 2333 autosomal SNPs of Conrad et al. (2006) shown in Figure 2A. Specifically, denoting the vector of sample frequencies by kl/nl we calculated . As expected, there is substantial structure in this sample covariance matrix between regions, which corresponds to known population structure in humans (Coop et al. 2010). Then we calculated the sample covariance matrix of the Xl across these SNPs, using Bayenv2.0; specifically, we took a single draw of Xl (after a burn-in) for each of these 2333 SNPs and calculated . The resulting sample covariance matrix (shown in Figure 2B) is close to the identity matrix in form, demonstrating that the majority of the covariance between populations has been removed. This suggests that our Xl are appropriately standardized for the application of correlation tests, averaging across our uncertainty in Xl at each locus.
To further test the normality of Xl, we checked whether the XTX statistic follows a χ2-distribution with 52 d.f. (i.e., the number of populations). A QQ plot of the XTX and the expected -distribution shows that the mean of the distribution approximately matches that of the , whereas the variances do not (Figure S5). Therefore, while XTX provides a potentially suitable summary statistic for identifying empirical outliers, we cannot assume a distributional form to those outliers under a null neutral model.
Detecting extreme population differentiation:
To test the utility of XTX as a statistic to detect overly differentiated loci, we simulated adaptation of the HGDP populations to minimum winter temperature and latitude and explored different ways to detect such environmental effects with an empirical cutoff score determined by simulations of a neutral model. In addition to Bayenv’s Bayes factors, we calculated the differentiation statistics FST, XTX, and FLK (Bonhomme et al. 2010). The environmental correlation Bayes factors clearly outperformed the differentiation statistics for both environments (Figure 3), which is consistent with the intuition that knowing the responsible environmental factor should improve the detection power (De Mita et al. 2013). Our simulations show that FLK and XTX clearly outperform the traditional FST-based tests, suggesting that accounting for the relationship among populations increases the power of differentiation statistics. Notably, FLK and XTX have similar power with slightly higher estimates for XTX. The similar level of power of the two tests is perhaps not unsurprising, as the history of the worldwide HGDP populations can be well represented by a tree-like relationship [with a small number of migration events (Pickrell and Pritchard 2012)].
Correlations using standardized allele frequencies:
To explore the power of standard correlation tests applied to our standardized Xl, in comparison to the Bayes factors, we again conducted power simulations based on the HGDP data. We calculated both Spearman’s ρ and Pearson’s r between Y′ and our transformed allele frequencies averaged across the posterior on these transformed frequencies. Statistics based on our Bayesian model clearly outperform correlation tests calculated for point estimates from sample allele frequencies (Figure 4, A and B). This improvement in power is due to the fact that the methods based on the sample frequencies fail to incorporate the sampling noise and the relationship among populations. All three tests based on Bayenv performed effectively identically with marginal advantages of the Bayes factors for minimum winter temperature (Figure 4B) and a slightly lower power of Spearman’s ρ, which is not surprising, as all simulated effects are linear. We expect that the relative performance of the rank-based test, i.e., Spearman’s ρ, may be reduced as the number of populations is decreased. We also tested the power of the Xl tests incorrectly, using Y in place of Y′, which led to power curves intermediate between the two sets (data not shown). Overall these results show that correlation tests based on Xl perform well.
The alternative model of Bayenv (Equation 6) implies a linear relationship between the transformed allele frequencies and the environmental variable. However, the fitting and significance of this linear model may be misled by populations that are statistical outliers. For instance, linear models might mistakenly identify a SNP as a strong candidate, when a single outlier population features both an extreme environment and an extreme allele frequency. We note that the extreme allele frequency may be due to a component of drift not well modeled by our MVN framework or due to a selection pressure (or response) poorly correlated with our environmental variable of interest. While loci of the latter form are of interest as genomic outliers, we believe researchers interested in particular environmental variables would consider such loci spurious and would prefer a set of candidates where many populations support a consistent pattern of correlation with an environmental variable.
To test such a case, we used winter minimum temperature as a climate variable since one population, the Yakuts from northeast Russia, is characterized by a very low minimum temperature (Figure 4C). We simulated allele frequencies for the HGDP populations as described above but to create outliers, we set the allele frequencies of the Yakuts to 0. Bayes factors and Pearson’s correlation coefficient r, which both assume a linear relationship, showed an excess of false positives (Figure 4D), while a nonparametric Spearman’s rank correlation coefficient (ρ) was much less sensitive to these outliers, with a false-positive rate very close to the expected value of 5% (Figure 4D).
Robust candidates in the HGDP data
We next explored the use of our standardized Xl for identifying robust putative candidates for adaptive evolution in a genome-wide data set of 640,698 SNPs from a global sample of 52 human populations [HGDP-CEPH (Li et al. 2008; Hancock et al. 2010)].
Environmental correlation statistics in the HGDP data:
As described above, populations with outliers in terms of allele frequencies and/or environments can potentially lead to spurious correlations. For example, the use of minimum winter temperature as an environmental variable could generate false-positive correlations in analyses of the HGDP data because of the extremely low temperature for the Yakut population. To explore this, we used minimum winter temperature and reanalyzed all SNPs of the HGDP data, calculating both Bayes factors and ρl(Xl, Y′) using Spearman’s rank correlation coefficient. Our Bayes factors and |ρl(Xl, Y′)| are correlated across SNPs (Spearman’s ρ = 0.72) and show an overlap of 29 SNPs in their top 100 most extreme SNPs, 142 SNPs in the top 500, and 2.8% in the top 5% of signals. These overlaps are substantial but suggest that our two tests are detecting somewhat different signals, which likely reflects in part the influence of outlier populations.
The 100 strongest signals of the Bayes factor analysis and |ρl(Xl, Y′)| are shown in Table S1 and Table S2. Among our top hits for both statistics are multiple SNPs that fall in the gene MKL1 (megakaryoblastic leukemia 1), which is a myocardin-related transcription factor that is involved in smooth muscle cell differentiation, mammary gland function, and cytoskeletal signaling (Parmacek 2007; Maglott et al. 2011) and has been associated with various disease phenotypes (Ma et al. 2001; Hinohara et al. 2009; Scharenberg et al. 2010).
To exemplify the effect of an outlier in Figure S6, we compare two SNPs that fall in our top 20 Bayes factors, but that have very different Spearman’s ρ. While in general, as seen above, there is good agreement between the Bayes factors and |ρl(Xl, Y′)|, we suggest that the Bayes factors, or other linear model test statistics, should be used in conjunction with robust test statistics such as those described here to avoid spurious signals due to outliers. As these both can be calculated from the same MCMC run, this should be reasonably computationally efficient.
Population differentiation statistics in the HGDP data:
We also explored our test statistic XTX, which is designed to highlight loci that deviate strongly from the expected pattern of population structure, by calculating it for each of the 640,698 HGDP SNPs. These XTX statistics have been uploaded as a genome browser track to http://hgdp.uchicago.edu/. The empirical distribution is shown in Figure 5. The empirical distribution clearly differs from the expected -distribution, with a higher mean and a lower variance than expected. This again highlights that we do not have a good theoretical expectation for the distribution and so must use the empirical ranks to judge how interesting a signal is.
To examine the relationship between XTX and global FST we took per SNP values of global FST previously calculated among the population groups as colored in Figure 2 (values from Coop et al. 2009; Pickrell et al. 2009). The Spearman’s ρ between XTX and FST was 0.48. Looking at the extremes of both distributions, XTX and FST show an overlap of 6 SNPs in their top 100 most extreme SNPs, 37 SNPs in the top 500, and 1.4% in the top 5% of signals.
In Table S3 we present the top 100 XTX SNPs in the genome, along with their nearest genes and global FST values. To briefly explore where known signals fall in our empirical distribution in Figure 5, we also plot as arrows the maximum XTX for SNPs that fall within 50 kbp up- and downstream of 10 well-known pigmentation genes (list taken from Pickrell et al. 2009). As these arrows represent maximums across a number of SNPs around the gene, they will necessarily be more extreme than an average draw from this distribution. However, the extreme signals at a number of these genes demonstrate that the method is detecting loci with extreme allele frequency patterns. The SNP with the most extreme value of XTX in the genome falls close to SLC24A5 (Nakayama et al. 2002; Lamason et al. 2005), while a SNP close to SLC45A2 is the second-largest signal in the genome. More generally, 5 of these 10 pigmentation genes fall in the top 1% and 9 genes fall in the top 5% of the empirical distribution. A SNP close to the gene EDAR, one of the highest pairwise FST’s between East Asia and Western Eurasia HGDP populations, is also in the top 10 SNPs (Sabeti et al. 2007).
The weak overlap in the tails of the genome-wide XTX and FST means that they are finding different sets of candidate SNPs, presumably due to the reweighting of allele frequencies in XTX. For example, our 12th highest SNP for XTX falls close to MCHR1, with our 21st highest gene being a nonsynonymous variant (rs133072) in the same gene. MCHR1 (melanin-concentrating hormone receptor 1) is known to play a role in the intake of food, body weight, and energy balance in mice (Marsh et al. 2002), and the effect of the nonsynonymous variant on human obesity has been debated (Wermter et al. 2005; Rutanen et al. 2007; Kring et al. 2008; Speliotes et al. 2010). Both of these SNPs are nearly fixed differences between East Asia and the American HGDP populations (Figure S7 and Figure S8). This strong difference between regions that share a recent history and, thus, covariance among allele frequencies (Figure 2) makes these SNPs an interesting pattern for XTX. However, neither of these two SNPs has an extremely impressive global FST (falling in only the 5% tail), presumably because East Asia and the American HGDP populations are only two of seven groups in the global FST calculation and the other five groups do not show an interesting pattern.
Analysis of pooled data
Pooled sequencing of multiple individuals has increased in popularity, as it is considerably cheaper than barcoding all individuals and sequencing them separately (but see Cutler and Jensen 2010, for a discussion of drawbacks). The use of allele frequencies estimated from the resulting read counts seems to be a reasonable application of our method. However, it raises the question of how Bayenv behaves for different coverages as increasing sequencing coverage is not the same as increased numbers of sampled individuals.
We simulated data that resemble the HGDP populations and then pooled 10 diploid individuals (i.e., 20 chromosomes) from each population and used the populations’ respective latitudes as our environmental variable. First, we experimented with incorrectly using read counts in place of the chromosome counts (i.e., assuming rjl and mjl were kjl and njl, respectively) and found that this resulted in an excess of extreme Bayes factors for high coverages under the null (data not shown). We found this inflation to be most pronounced when read depths are greater than the actual sample size, and this is likely due to false certainty about the population frequencies by failing to account for the second level of sampling.
We then ran power simulations of Bayenv matched to the HGDP data, with Z as a test statistic, using the true sample frequencies (black squares in Figure 6) and the read counts (incorrectly) as the input data for the previous version of Bayenv (Bayenv1.0, black circles in Figure 6). Bayenv2.0, which accounts for both stages of binomial sampling in pooled data (as described above), was also applied to the same read counts (white circles in Figure 6). The true sample frequencies naturally resulted in the best power as there is no additional sampling noise (Figure 6A). For higher mean coverages the power of Bayenv1.0 using the read counts as sample allele frequencies was almost as good as the power using true sample allele frequencies (Figure 6A). As most applications may consist of a smaller number of populations, we additionally sampled two subsets (Figure 6, B and C). On all of these, as expected, Bayenv using the true sample frequencies outperformed Bayenv1.0 using the read counts.
In part, the poor power in pooled studies is unavoidable due to the additional sampling noise. However, the loss of power is likely boosted by failing to properly account for this second stage of sampling, which leads to poor performance due to variation in depth across populations and SNPs. Including the sampling of reads into the model clearly had a positive effect on power in our population subsets, while incorrectly using the read counts as input did not reach similar powers, even with high coverages (Figure 6, B and C). However, the power of Bayenv2.0 was still considerably low for mean coverages <20×, suggesting that such low read depths do not provide enough certainty for reliable frequency estimation. Somewhat surprisingly, we did not observe any advantages of the extended model in detection power if all 52 HGDP populations are simulated (Figure 6A), perhaps because differences in coverage over populations are averaged over so many populations.
We note that both our original implementation and Bayenv2.0, which acknowledges the pooled nature of the data, will outperform tests of association that use the sample frequency computed from the read count data. This follows from the fact that both of these tests more fully acknowledge sampling noise. This is an important issue for next-generation data, as most technologies currently lead to high variation in coverage along the genome, which will be a substantial source of additional noise. Failure to acknowledge this noise across loci will greatly reduce the power of tests of environmental correlations.
Analysis of empirical data from Atlantic herring:
Finally, we applied Bayenv2.0 to the data of Lamichhaney et al. (2012), which consist of pooled sequence for eight Atlantic herring populations sampled along a salinity gradient, and identified outliers for differentiation using FST and P-values of a χ2-test on the read counts. For their FST analysis, Lamichhaney et al. (2012) selected a subset of their data set consisting of the 36,794 SNPs (of 440,817 SNPs in total) with at least 40 reads in each population; this sampling cutoff was implemented to reduce the impact of sampling noise on FST. While this is a sensible approach, it does discard data. Approaches, like those in Bayenv2.0, that acknowledge sampling should make better use of these data, as they can identify candidate loci in regions with lower coverage while avoiding false positives due to sampling noise. We calculated Z for the salinity gradient on the 36,794 SNPs (with coverage >40×) and for a random subset of 100,000 SNPs chosen regardless of their coverage. In addition, we calculated FST and P-values of a χ2-test on the read counts for these data [to compare to the original study (Lamichhaney et al. 2012)]. To demonstrate how these three tests treat sites with different coverages, we show the mean coverage of sites in the upper 10–0.5% tail in Figure 7A. SNPs in the tail of FST show a strong bias toward loci with lower coverages, which in turn implies that many of these loci are likely false positives due to sampling error being mistaken for extreme population differentiation. On the other hand, SNPs in the tails of the χ2-test’s P-value are enriched for high-coverage positions, reflecting the fact that this significance of the χ2-test will be greater for sites with higher coverage, even if those sites do not have particularly high levels of differentiation. Using our Z statistic, which accounts for the sampling error while looking for differentiation along an environmental gradient, we find that our outliers seem much less biased toward either extremely high or extremely low coverage. This observation, along with our higher power to detect environmental correlations than differentiation-based approaches (Figure 3 and De Mita et al. 2013), suggests that our Z statistic is potentially detecting more true signals of loci that are strongly differentiated along the salinity gradient while the number of false positives is reduced.
Among the contigs in which the top 200 SNPs for our Z statistic were found, only 8 were not named as candidates by Lamichhaney et al. (2012), suggesting good agreement with their extreme outliers. More generally, we find >70% of our top 1% Z SNPs among the top 5% of FST and χ2 (Figure 7B). This enrichment increases with coverage, reassuringly suggesting that the approaches are in stronger agreement about outliers in regions with high coverage. The overlap is not perfect; the tails of the differentiation statistics should also include SNPs involved in adaptation to other environmental variables and outliers caused by selection in single populations. Due to the current poor annotation of herring data available in public databases, we refrain from further speculation about the biological relevance of our top hits.
In this article we have presented a method to more robustly identify loci where allele frequencies correlate with environmental variables. We have also described a method to detect loci that are outliers with respect to genome-wide population structure, while accounting for the differential relatedness across populations.
Many available tests for selection are designed to detect rapid complete sweeps from new mutations; however, such events are likely just a small percentage of adaptive genetic change (Coop et al. 2009; Pritchard et al. 2010; Cao et al. 2011; Hernandez et al. 2011). Analyzing allele frequencies across multiple populations offers the opportunity to detect selection acting on standing variation and polygenic phenotypes. The falling cost of genotyping means that typing individuals from many populations is now in reach, a development that will allow us to connect environmental variables to more subtle adaptive genetic variation. However, we stress that loci detected by the approaches discussed above are obviously at best just candidates for being involved in adaptation to a particular climate variable, or set of climate variables, and so additional evidence is needed to build the adaptive case at any locus.
Our use of the covariance matrix of population allele frequencies when looking for environmental correlations is conceptually similar to linear mixed-model (LMM) approaches that account for kinship structure in genome-wide association studies (GWAS) (e.g., Yu et al. 2006; Kang et al. 2008, 2010; Zhou and Stephens 2012, who use an observed relatedness matrix as the covariance matrix of the random effect). One important difference is that we seek to predict allele frequencies at a locus using the environmental variable, whereas these LMM methods are predicting a phenotype as a function of genotypes at a locus. In our approach, the equivalent of the random-effect matrix is a proxy for a neutral model of allele frequency variation, while in the application to GWAS the kinship matrix accounts for the confounding due to heritable variation in the phenotype elsewhere in the genome. Our model could be used to detect loci that strongly covaried with population mean phenotypes [e.g., phenotypes measured at the breed level in dogs (Boyko et al. 2010)]. However, the method used this way would have a high rate of false positives if there are large environmental effects on the phenotype that coincide with the principal axes of the covariance matrix. Similarly, LMM approaches could be used to identify loci that covaried with environmental gradients, but they may be underpowered, because their random-effects model does not attempt to reflect a model of genetic drift.
Standardized allele frequencies
We introduced a set of tests based on using our model of the covariance of allele frequencies to produce a set of standardized allele frequencies (Xl). The calculation of standardized allele frequencies allows us to calculate a variety of statistics while taking advantage of the other features of Bayenv2.0’s approach to account for covariance among populations and sampling noise. The removal of covariance is often a standard step in multivariate analysis; here we remove this covariance structure in a way that acknowledges the approximate form of genetic drift and the bounded nature of allele frequencies. By integrating our statistic across the posterior for Xl, we are averaging across our uncertainty in allele frequencies, which should further increase our power. Since the Xl are estimated using a single sample of Ω from the posterior, they are dependent on the quality of this . With a large number of SNPs used to estimate the covariance matrix, MCMC runs converge to a small set of Ω’s with little variance, so in practice this is not a major concern (for smaller numbers of loci, investigators could average the covariance matrix across the MCMC output).
As an example of the usefulness of the Xl, we explored their application in identifying robust correlations with environmental variables. While the use of Spearman’s ρ on these transformed allele frequencies results in a small loss of power, it is much less sensitive to outliers and able to detect any monotonic relationship. Therefore, a combined approach that takes a set of SNPs in the intersection of the tail of Bayes factors and in the tail of Spearman’s ρ on our transformed allele frequencies should provide best results.
Our transformed allele frequencies could also be used to detect and distinguish between the effects of multiple environmental variables shaping variation at a locus. This could be accomplished by including the multiple transformed environmental variables (Y′) into a linear model to predict the Xl at a locus or by applying appropriate transformed ecological niche models (ENM) to the Xl to understand the predictors of allele frequencies at a locus (see Fournier-Level et al. 2011; Banta et al. 2012 for applications of ENMs to allele frequencies). However, there is limited information about the effects of even a single environmental variable from contemporary allele frequencies if neutral allele frequencies are autocorrelated on the same scale as environmental variation (as is the case in humans). Therefore, we caution that in many situations there will be very limited power to learn about the effect of multiple environmental variables.
Using our Xl statistics, we also introduced a method to identify loci that are outliers from the general pattern of population structure (our XTX statistic). This statistic is closely related to FST, which can be expressed as Var(plj)/(εl(1 − εl)), where Var(plj) is the variance of our allele frequency across populations (see Nicholson et al. 2002; Balding 2003; Bonhomme et al. 2010, for discussion). Our statistic, which is the variance of Xl, can be written as Equation 10, and so XTX can be seen as closely related to FST calculated on the standardized allele frequencies. Importantly, by removing the covariance, we reweight populations so that a small change shared across many closely related populations is downweighted. This reweighting therefore strongly increases our power to detect unusual allele frequencies compared to that of global FST. The fact that we remove the covariance between closely related populations also means that, unlike in FST-based methods, we do not have to arbitrarily clump populations to identify globally differentiated SNPs. While in this article we use the 52 HGDP population labels, in principle Bayenv2.0 could be run, treating each individual as a population, allowing XTX to be calculated without regard to any population label. However, this would be computationally time-consuming with thousands of individuals. In such cases perhaps the sample frequencies and the Cholesky decomposition of the sample covariance matrix could instead be used to mitigate the computational burden. This could also be done for environmental correlations for large numbers of samples.
Ideally our XTX statistic would have a parametric distribution under a general null model in which only drift and migration shaped our frequencies. That might allow us to make statements about what fraction of allele frequency change was due to selection. Indeed, as noted above, if our population frequencies were truly multivariate normal, our XTX statistic would be χ2-distributed if our sample sizes were sufficiently large. This assumption would be approximately met if our levels of drift were sufficiently small, such that the transition density of allele frequencies was well approximated by a normal (see Price et al. 2009; Bhatia et al. 2011 for recent empirical applications along these lines). However, when levels of drift are higher, our normal approximation will break down, as demonstrated by the poor fit of the χ2 to the transformed HGDP frequencies. The distribution of our statistic could be obtained by simulation if the population history were known. In practice, we are skeptical that our knowledge of population genetic history will be sufficiently accurate to make this feasible, but simulations may be useful in guiding the setting of approximate significance levels.
Pooled next-generation sequencing
Recent empirical validations have shown that pooled resequencing of populations is a powerful and cost-efficient way to estimate allele frequencies (Zhu et al. 2012; but see Cutler and Jensen 2010). The downside of the saving of costs in library preparation and sequencing is the potential for increased sampling noise in the allele frequency estimates (Futschik and Schlötterer 2010; Zhu et al. 2012) and the loss of haplotype information [although some haplotypic information can be recovered (Long et al. 2011)]. We account for the sampling of sequencing reads as an additional level of binomial sampling in the model of Bayenv2.0. Our power simulations show that accommodating the extra level of sampling in pooled designs can help to improve the power. However, they also highlight the large, unavoidable loss in power due to increased sampling noise when the depth of coverage is low. The only way that this can be circumvented is through increasing sequencing coverage to provide sufficient certainty in the estimated allele frequencies and, thus, sufficient power to detect environmental correlations. Although low fold sequencing of many populations may help to increase power in some situations, it is likely that for some species (notably humans) sampling, and not sequencing, will be the limiting resource in the future.
Our model of pooled resequencing in Bayenv2.0 implies uniform sampling of reads from each individual. Therefore, we do not account for the possibility of an unequal number of chromosomes per individual due to measurement errors, different DNA content per individual, or differences caused during DNA extraction, all of which might cause additional noise in the allele frequency estimation (Cutler and Jensen 2010; Futschik and Schlötterer 2010). This additional noise, if it is constant across loci, should be absorbed into the covariance matrix in Bayenv2.0, which will result in a reduction in power. However, including a sufficient number of individuals in each pool should mitigate this effect (Zhu et al. 2012). Furthermore, our model assumes perfectly called bases, as we do not consider quality scores or sequencing errors. Researchers dealing with NGS data should exercise caution with these issues. However, examining multiple-population pools simultaneously provides some straightforward approaches to minimize error rates in SNP calling, such as calling only SNPs supported by a minimum number of reads in at least one population (Futschik and Schlötterer 2010). Such strategies are already good practice in studies of pooled samples and should be used in combination with the Bayenv model. For the application to individual-based NGS data, further possible extensions of our model include acknowledgment of sequencing errors and a probabilistic approach to genotype calling (see Nielsen et al. 2011, for a discussion on SNP calling from NGS data).
The population genomic comparison of closely related populations that differ strongly in environmental variables has already yielded many great candidate loci [see, for example, altitude adaptation in Tibetans (Beall et al. 2010; Simonson et al. 2010; Yi et al. 2010)]. The methods developed here and elsewhere are part of realizing the power of these population comparisons. Such empirical studies also highlight the current deficiencies of such methods, as some of the best signals in these studies are not shared across populations with broadly similar environments and instead indicate that adaptation has occurred through independent mutations in the same gene or pathway. For example, high-altitude adaptation seems to have a different genetic basis in highland Ethiopian and Andean populations (Bigham et al. 2010; Scheinfeldt et al. 2012). Methods based on environmental correlations will fail to detect such cases, unless the data are split into the appropriate geographic subsets (e.g., Hancock et al. 2011c) on an appropriate geographic scale (Ralph and Coop 2010). While shared standing variation will surely be part of the adaptive response across geographically separated instances of similar environments, ideally we would have methods that could cluster signals at the level of the gene or pathway to allow putative cases of parallel adaptation to be identified (Daub et al. 2013). The development of such techniques poses an important challenge for future method development.
We thank Jeremy Berg, Gideon Bradburd, Yaniv Brandvain, Fabian Freund, Chuck Langley, Joe Pickrell, Jonathan Pritchard, Peter Ralph, Jeffrey Ross-Ibarra, Karl Schmid, and Alisa Sedghifar for helpful discussions and comments on earlier versions of the manuscript. We acknowledge the various comments we received from readers of the preprint of the manuscript available on the arXiv. The XTX estimates for the HGDP data were kindly made available through the HGDP selection browser (hgdp.uchicago.edu/) by Joe Pickrell. We also thank Alvaro Martinez Barrio and Leif Andersson for sharing the herring data. T.G. was supported by the German Federal Ministry for Education and Research (Synbreed, 0315528D) and by a Volkswagen Foundation scholarship (I/84225), affording him a visit to the University of California, Davis. G.C. was supported by a Sloan Foundation fellowship.
Communicating editor: M. A. Beaumont
- Received April 23, 2013.
- Accepted June 17, 2013.
- Copyright © 2013 by the Genetics Society of America