Natural populations of forest trees exhibit striking phenotypic adaptations to diverse environmental gradients, thereby making them appealing subjects for the study of genes underlying ecologically relevant phenotypes. Here, we use a genome-wide data set of single nucleotide polymorphisms genotyped across 3059 functional genes to study patterns of population structure and identify loci associated with aridity across the natural range of loblolly pine (Pinus taeda L.). Overall patterns of population structure, as inferred using principal components and Bayesian cluster analyses, were consistent with three genetic clusters likely resulting from expansions out of Pleistocene refugia located in Mexico and Florida. A novel application of association analysis, which removes the confounding effects of shared ancestry on correlations between genetic and environmental variation, identified five loci correlated with aridity. These loci were primarily involved with abiotic stress response to temperature and drought. A unique set of 24 loci was identified as FST outliers on the basis of the genetic clusters identified previously and after accounting for expansions out of Pleistocene refugia. These loci were involved with a diversity of physiological processes. Identification of nonoverlapping sets of loci highlights the fundamental differences implicit in the use of either method and suggests a pluralistic, yet complementary, approach to the identification of genes underlying ecologically relevant phenotypes.
ENVIRONMENTAL heterogeneity at multiple spatial scales influences the distribution of genetic variation across plant populations. Correlations between genetic variation and environmental gradients have been identified in a variety of plant species (Antonovics and Bradshaw 1970; Antonovics 1971; Hamrick and Allard 1972; Westfall and Conkle 1992; Linhart and Grant 1996; Mitton 1997; Savolainen et al. 2007), and such associations are often interpreted as evidence of natural selection (Allard et al. 1972; Dudley 1996a,b; Gram and Sork 2001; Vasemägi and Primmer 2005; Parisod and Christin 2008). Renewed interest in identifying correlations between environmental and genetic variation has emerged as high-throughput sequencing and genotyping platforms are applied to functional genetic variation within natural populations of non-model species (cf. Joost et al. 2007; Namroud et al. 2008).
Forest trees illustrate clear phenotypic adaptations to environmental gradients at multiple spatial scales (Morgenstern 1996; Savolainen et al. 2007 and references therein). An extensive history of provenance, common garden, and genecological studies has established the highly polygenic basis of these adaptive traits (Langlet 1971; Namkoong 1979). One of the major abiotic stressors for conifers is water availability. Traits related to water-deficit stress (WDS) have a genetic basis for many conifers and variation at these traits is adaptive (Zhang and Marshall 1994; Aitken et al. 1995; Johnsen et al. 1999; Olivas-Garcia et al. 2000; Brendel et al. 2002; González-Martínez et al. 2007; Baltunis et al. 2008). Physiological responses to drought involve many different cellular and molecular pathways (Newton et al. 1991; Ingram and Bartels 1996), and functional and gene expression studies in Arabidopsis have implicated several gene networks in WDS responses, as well as interactions between drought and cold-hardiness traits (Shinozaki and Yamaguchi-Shinozaki 2000, 2007; Bray 2004). Population genetic studies, moreover, have identified functionally diverse genes with skewed site-frequency spectra (González-Martínez et al. 2006), extreme allele-frequency differences across populations (Eveno et al. 2008), altered gene expression (Dubos and Plomion 2003; Watkinson et al. 2003; Yang and Loopstra 2005), or significant associations with WDS (González-Martínez et al. 2008) for multiple pine species distributed across strong precipitation gradients.
This study investigates the potential association between genetic variation at individual loci and aridity gradients for loblolly pine (Pinus taeda L.). The predominant abiotic gradient across the range of loblolly pine is water availability. Water use efficiency is a WDS-related trait and in loblolly pine is heritable, but displays significant non-additive variance as well as strong environmental influences (Warren et al. 2001) and genotype-by-environment interactions (Baltunis et al. 2008). The canonical approach to searching for environmental associations would be to correlate measures of genetic diversity (e.g., allele frequencies, heterozygosity) to environmental variation related to drought stress (Linhart and Grant 1996; Mitton 1997; Mitton et al. 1998; Vasemägi and Primmer 2005). One limitation to this approach is that environment is likely to be confounded with geography and, by proxy, overall genetic structure. Correction for population structure is common in association studies that aim at identifying markers contributing to phenotypes (cf. Yu et al. 2006). An obvious solution to the confounding of genetic structure with environmental variation, therefore, is to apply existing association approaches to environmental data, treating the environment methodologically as a phenotype.
Here, we use two genome-wide marker data sets to address the following questions: (1) What are the patterns of population structure across the range of loblolly pine? (2) What is the degree of confounding between environmental variation and population structure due to geography? (3) Which loci are associated with water availability across the range of loblolly pine? (4) Are loci with strong genotypic correlations with aridity also those with extreme allele-frequency differences among populations? In answering these questions, we highlight the need for further integration of environmental and genetic data in genome scans for loci subject to natural selection, as well as the promise of combining complementary approaches for the identification of functionally important genetic variation within natural populations (cf. Vasemägi and Primmer 2005).
MATERIALS AND METHODS
Loblolly pine (Pinus taeda L.) is distributed throughout the southeastern United States, ranging from Texas to Delaware. Its 370,000-km2 range is divided primarily by the Mississippi River Valley, with 60% of the distribution range located east of the Mississippi River (Al-Rabab'ah and Williams 2002). Isozyme and nuclear simple sequence repeat (SSR) loci illustrate moderate genetic differentiation between populations located to the east and west of the Mississippi River Valley, as well putative population contraction in the westernmost populations (Wells et al. 1991; Schmidtling et al. 1999; Al-Rabab'ah and Williams 2002, 2004; Xu et al. 2008). A review of phylogeographical patterns in unglaciated eastern North America identified six major patterns, of which loblolly pine conforms to the Mississippi River discontinuity (Soltis et al. 2006). The structure of this discontinuity is consistent with a dual Pleistocene refugial model, which has also been used to explain differential growth abilities, disease resistance, and concentrations of secondary metabolites among families located across this discontinuity (Wells and Wakeley 1966; Squillace and Wells 1981; Schmidtling 2003).
Needle tissue was collected from 907 largely unrelated trees sampled across the natural range of loblolly pine (Figure 1). Seven hundred of these trees are first-generation selections (i.e., trees grown from wild-collected seed) with known source localities. These samples are geo-referenced by county (ncounties = 188). The average number of sampled trees per county was 4 ± 6 (range: 1–67). The remaining 207 are second-generation selections (i.e., trees resulting from crosses between the first-generation selections). These trees also comprise two experimental populations currently being used for association mapping: Weyerhaeuser (cf. González-Martínez et al. 2007) and North Carolina State University (cf. P. Cumbie, A. Eckert, J. Wegrzyn, R. Whetten, D. Neale and B. Goldfarb, unpublished results). Total genomic DNA was isolated from each sample at the U.S. Department of Agriculture National Forest Genetics Laboratory (Placerville, CA) using DNeasy plant kits (Qiagen, Valencia, CA) following the manufacturer's protocol.
Marker discovery and genotyping:
We utilized two sets of molecular markers. The first set comprises 23 unlinked nuclear SSR markers selected from the PtTX marker set for medium to high polymorphism rate and full coverage of the linkage map (Auckland et al. 2002; González-Martínez et al. 2006, 2007). The second set comprises ∼23,000 single nucleotide polymorphism (SNP) markers, of which we chose 7216 for genotyping, that were identified through the resequencing of 7535 uniquely expressed sequence tag (EST) contigs in 18 loblolly pine haploid megagametophytes. These SNPs cover the entire linkage map for loblolly pine (TG accession: TG091; http://dendrome.ucdavis.edu/cmap/), with 117 ± 18 SNPs mapped on average per linkage group for a total of 1635 mapped SNPs with an average distance of 1.2 ± 1.1 cM between SNPs. Selection of SNPs for genotyping was based largely on quality scores derived from the original sequence data and not on functional or site annotations. This ensured thorough coverage of the available sequence resource for loblolly pine (cf. http://dendrome.ucdavis.edu/adept2/). Genotyping of SNPs utilizing the Infinium platform (Illumina, San Diego) was carried out at the University of California Davis Genome Center. Arrays were imaged on a Bead Array reader (Illumina), and genotype calling was performed using BeadStudio v. 220.127.116.11 (Illumina). Information regarding the discovery and annotation, as well as PCR, genotyping, and DNA sequencing protocols for both marker types is available in the supporting information, File S1. The complete data are available in File S2 and File S3.
For each marker locus we calculated observed (HO) and expected (HE) heterozygosity as well as Wright's inbreeding coefficient (FIS = 1 − HO/HE). Loci with extreme values of FIS (|FIS| > 0.25) were removed prior to analysis. We used Fisher exact tests (Guo and Thompson 1992) with Bonferroni corrections to test for Hardy–Weinberg equilibrium (HWE) for each SNP and SSR marker, respectively. All analyses were preformed using the R environment (R Development Core Team 2007).
Climate data were gathered from the WORLDCLIM 2.5-min geographical information system (GIS) layer using Diva-GIS version 5.4 (Hijimans et al. 2005; available at http://www.diva-gis.org/). Monthly minimum and maximum temperatures, monthly precipitation, and 19 bioclimatic variables were obtained from this layer (Table S1). The temperature and precipitation data were used to estimate potential evapotranspiration (PET) with the method of Thornthwaite (1948). An aridity index (AI) was defined as the ratio of precipitation to PET (File S1), with this ratio being defined quarterly. Annual quarters were defined starting with January 1 through March 31 as quarter one and are labeled as AI1 through AI4. We focus on aridity because it encapsulates water availability as a function of temperature and precipitation. Thus, the remainder of the article focuses solely on aridity.
Patterns of population structure:
Population genetic structure was analyzed by means of principal component analysis (PCA) on genotypes from individual trees. Briefly, PCA was performed on the normalized n × m matrix, M, of genotypes, where n is the number of trees and m is the number of loci. Similar analyses were conducted for SSR markers using a method that accounts for the dependence among alleles at a locus (van Heerwaarden et al. 2010). The eigenvalues corresponding to the principal components (PCs) were inspected to determine the number of major independent axes of genetic differentiation in the data. Following outlier removal and reanalysis of M, the significance of PCs was determined by comparing the value of each standardized eigenvalue to a Tracy–Widom distribution (Patterson et al. 2006). Outliers were defined as trees with PC scores >6 SDs away from the mean for any of the first 10 PCs. Trees were assigned to discrete genetic clusters on the basis of K − 1 significant PCs, where K is the number of clusters being considered (Paschou et al. 2007; van Heerwaarden et al. 2010). Specifically, Ward's hierarchical clustering algorithm was applied to the matrix of Euclidean distances, calculated from the significant PCs, and the resulting dendrogram was used to assign individuals to each cluster using the CUTREE function in R.
For comparison, we also used the program STRUCTURE version 2.2 to infer the number of genetic clusters and membership coefficients within those clusters using the 23 nuclear SSR markers (Pritchard et al. 2000; Falush et al. 2003). Analysis using the SNP data was avoided because of the lack of convergence among runs likely related to insufficient run times for the Markov chain Monte Carlo (MCMC) sampler (data not shown). We varied the number of genetic clusters (K) from 1 to 12, and for each value ran 50 independent MCMC simulations. Each run was carried out for 1.2 × 107 steps, with the first 2.0 × 106 steps being discarded as burn-in. We assumed further that allele frequencies were correlated among populations and that our data contained admixed trees. The optimal value of K was determined using the ΔK method (Evanno et al. 2005) and by inspection of the relationship between the log probability of the data and K. Average admixture coefficients in all cases were estimated for each value of K using the LargeKGreedy algorithm with 1000 random input orders as implemented in the program CLUMPP version 1.1 (Jakobsson and Rosenberg 2007). These values were visualized using bar plots constructed with DISTRUCT version 1.1 (Rosenberg 2004).
Associations between population structure, geography, and environment were studied by fitting a general linear model to each measure of structure (assignment probabilities, PCs) with per-county aridity indices and latitude and longitude as explanatory variables. The relation between the different environmental variables and genetic assignment was visualized by a biplot of the PCA on the environmental variables with plotted points colored according to their corresponding genetic cluster.
Environmental associations and outlier analysis:
Loci associated with environment were identified using a standard association mapping approach, substituting aridity for phenotype. Analysis was done separately for each variable and each SNP. Each tree was assigned a “phenotype” consisting of its corresponding county-level aridity index. Following Price et al. (2006), we used PCA analysis to correct for spurious associations due to confounding of ancestry and aridity. Briefly, PCA is performed as described above, but excluding the target SNP. Two vectors of ancestry-corrected residuals are obtained by multiple linear regression on environmental and genotypic values, using the k significant genetic PCs as independent variables. Association between genotype and aridity is described by the squared correlation r2 between the two vectors. For each SNP, scored in N individuals, the test statistic is calculated as (N − k − 1)r2, which is approximately χ2 distributed with one degree of freedom (Price et al. 2006). SNP loci showing the strongest association with different aridity indices were identified by Q–Q analysis of P-values. The magnitude of environmental differences among SNPs was evaluated using a general linear model with environment as a dependent variable and corrected genotypic values as explanatory variables. Differences of environment among SNP genotypes were evaluated using a general linear model with environment as a dependent variable and corrected genotypic values as explanatory variables. Multiple testing was accounted for using the false discovery rate method of Storey and Tibshirani (2003) with a significance threshold of Q = 0.05, although this method, or any multiple test correction method assuming multiple independent tests of the same null hypothesis, is conservative due to the correlations among environmental variables.
For comparison, we searched for patterns of adaptive differentiation among populations using fdist2 (Beaumont and Nichols 1996; Beaumont and Balding 2004); the populations were those identified using PCA or STRUCTURE. Outlier analyses were conducted only for the SNP data, as opposed to the SSR data, because these represent functional variation likely to be targets of natural selection. We used results from clustering of SSRs and SNPs to define populations because the range of loblolly pine is continuous, the number of discrete populations is relatively unknown, and studies employing FST outlier analysis in forest trees justify a priori population definitions with some form of structure analysis (Eveno et al. 2008; Namroud et al. 2008). We simulated 1.0 × 106 loci under two null models using the ms software (Hudson 2002): an island model and a Pleistocene refugial model (File S1, Figure S1). The latter model was used to account for the effects of historical demography on the null expectation of FST across loci (cf. Excoffier et al. 2009). An iterative approach was used to adjust the median FST (Weir and Cockerham 1984) and HE until they were within 5% of observed values. A modified form of the msstats software developed with the libsequence C++ library (Thornton 2003; available from: http://www.rilab.org/code/random_code.html) was used to estimate FST and HE, while the cplot and pvj programs, which are distributed as part of fdist2, were used to estimate 99% quantiles of the null distribution and P-values, respectively. Extreme values of FST conditional on expected heterozygosity were defined as those that lay above the 99% quantile of the null distribution.
To our knowledge, this is the first successful application of the Illumina Infinium genotyping platform to a non-model plant species. Further details concerning conversion rates and quality scores are available in File S1. We selected one SNP that was typed successfully per EST locus for further analysis (n = 3082 SNPs). Forty-five of these SNPs deviated significantly from HWE proportions, with 23 also having FIS values >0.25. These 23 were removed prior to further analysis. An ascertainment bias was also apparent for the frequency distribution of the minor allele across all 3059 SNP loci (File S1, Figure S2). Call rates across the 23 SSR loci averaged 95%, with 18 of the 23 loci deviating significantly from HWE proportions. All 23 SSR loci had values of FIS <0.25 when only first-generation selections with known source localities were considered.
Integration of these data sets resulted in 622 trees sampled from 167 of the 188 county locations typed for 3059 SNPs and 23 SSRs. Summaries of these data are located in Table 1, with further descriptions of genotyping results presented in File S1 and Figure S3. These data were used to infer patterns of population structure and to search for multivariate and locus-specific environmental associations.
Patterns of environmental heterogeneity:
Strong correlations exist among environmental measures, as well as between those measures and geographical location (Figure S4). In general, sites located along the Gulf Coast Plain have the highest annual precipitation (1650 mm), whereas sites located in the northeast and southwest have the lowest (980 mm). The highest average annual temperatures are observed in the southeast (20.8°), while sites located in the northeast have the lowest (12.8°). These trends are apparent for measures of aridity both spatially and temporally (Figure 1). In general, all sites have water surpluses in the winter, with those located along the Gulf Coast Plain having the largest. Water deficits are apparent in the Northeast (spring) and Southwest (summer) as the year progresses. By fall, all sites have water surpluses, with those located in the extreme southeast being the driest.
Patterns of population structure:
Patterns of population structure for loblolly pine are accounted for primarily by the Mississippi River discontinuity. This is apparent for results from PCA and STRUCTURE using both SSRs and SNPs. PCA analysis on the SNP data revealed the presence of seven significant PCs defining eight genetic clusters. Visual inspection of the eigenvalues, however, shows the presence of two major PCs explaining 2.4% of the total variation (56% of the significant variation), which indicates the presence of three clearly differentiated clusters (Figures 2 and 3). The remaining five significant clusters lack a strong geographical basis. These three clusters are largely divided along the Mississippi River Valley, with a further division of the eastern cluster into Gulf and Atlantic Coast clusters. Similar results were observed for the SSR markers using PCA, with two of the three significant PCs clustered across the Mississippi River discontinuity (Figures 2 and 3).
The optimal value of K was 2 as determined by the ΔK statistic using STRUCTURE with the SSRs. Inspection of bar plots for the admixture coefficient when K = 2, averaged across the 50 replicated MCMC runs, indicated that each cluster is geographically based, with one cluster corresponding to trees that are located primarily west of the Mississippi River and the other to those located to the east (Figure S5). The use of ΔK to choose an optimal value of K = 2, however, is difficult, because ΔK in this case compares the lack of structure (K = 1) to some structure (K = 2 and 3). We chose to use values of 3 and 5 for further analysis, because K = 3 had the second largest value of ΔK, and K = 5 is where the median value of the log probability of the data leveled off (Figure 2). Geographical trends in cluster assignments for these values of K reflected the west–east division, as well as further divisions of the eastern cluster along a southwest to northeast axis (Figure 3).
The method of clustering did not dramatically affect the assignment of trees to clusters or the geographical basis of those clusters (Figure 3). The use of PCA and STRUCTURE on the SSR data yielded similar patterns for K = 3, with values of FST calculated for each assignment scheme being significantly correlated (r2 = 0.50, P < 2.2 × 10−16). Some discrepancy between the methods was apparent. For example, PCA placed the trees located in Livingston County, Louisiana, into the eastern population, while STRUCTURE associated them with the western population, which has been noted previously (Schmidtling 2003).
Multilocus population structure was strongly confounded with aridity. Geography was correlated with patterns of population structure and aridity for all clustering methods and choices of K (Table 2, Table S2). The first two PCs from PCA using the SNP data were correlated with geography and aridity. The four aridity indices, longitude, and latitude together explained 61% of the variance in PC1 and 39% of the variance in PC2, thus illustrating that the genetic clusters identified previously differ with respect to aridity (Figure S6). All aridity indices also showed significant correlations with either longitude or latitude (Figure S4). This was reflected by the fact that aridity and geography by themselves explained much of the variance in PC1 (r2 = 0.46 for all four aridity indices, P < 2.2 × 10−16; r2 = 0.51 for latitude and longitude, P < 2.2 × 10−16) and PC2 (r2 = 0.25, P < 2.2 × 10−16 for all four aridity indices; r2 = 0.16 for latitude and longitude, P < 2.2 × 10−16). Three of the four aridity indices (AI1, AI2, and AI4) were also significantly correlated (minimum r2 = 0.53, P < 2.2 × 10−16). A visual representation of the correlations between these variables and genetic structure is given in Figure 2. Similar yet weaker correlations were apparent for PCs and STRUCTURE assignment probabilities using the SSR data (Table 2).
Environmental associations and outlier analysis:
Association analysis resulted in the identification of five loci with significant correlations to aridity (Table 3, Figure S7). The strongest associations were between four loci and aridity during the second quarter (AI2), with subsets of these loci also associating with aridity during the first (AI1) and fourth quarters (AI4). Only a single locus was found to associate significantly with aridity during the third quarter (AI3). The four significant loci associated with aridity during quarters 1, 2, and 4 together explained 9.2% of the variance in AI2 (Figure 4), 5.6% of the variance in AI1, and 4.7% of the variance in AI4 (15.2%, 9.1%, and 8.1% of the ancestry-corrected environment). For comparison, the average SNP explained 0.1% ± 0.2% of the variance in AI2 (0.2% ± 0.3% for ancestry-corrected environment). The single locus associated with AI3 explained 2.6% of the variation (3.0% of variation in ancestry-corrected environment). All five significant SNPs were located in loci with high sequence similarity to coding sequences in Arabidopsis that primarily affect abiotic and pathogenic stress responses (see discussion; Table 3). Two of these five SNPs are mapped to linkage group 3. Three of the five SNPs are located in synonymous positions, while the remaining two are located in an intron and a 3′ UTR.
Clusters defined using the SSR markers resulted in few outlier SNPs, patterns indicative of residual within cluster substructure and multilocus values of FST <1% (Figure 5). Most of these outliers can be accounted for by including a dual Pleistocene refugia model as the null model. This model fit the frequency distribution of the minor allele as well as, if not better than, that of drift alone or an island model (File S1, Figure S3). Three loci had significantly elevated FST estimates, however, when K = 5.
A different pattern emerges when the cluster memberships are based on SNP loci. In this case, 24 and 15 loci are significant outliers after accounting for demography when K = 3 or K = 5, respectively. In both cases, the multilocus values of FST (0.022 for K = 3, 0.016 for K = 5) are more similar to those reported previously (Schmidtling 2003). Only 7 loci are shared between lists of outlier loci for each value of K; however, the outlier loci unique to a particular value of K were always located in the upper tail of the distribution for FST of the alternative value of K. We focus on the results when K = 3; results for K = 5 are shown in Table S3, Figure S8 and Figure S9. All outlier loci have values of FST that are ∼6- to 12-fold larger than the average FST across all SNPs (Table 4).
Nineteen of the 24 loci identified as outliers had significant tBLASTx hits to annotated loci in Arabidopsis (Table 4). The remaining 5 loci had little sequence similarity to known protein sequence in plants, although high nucleotide sequence similarity was detected in Sitka spruce [Picea sitchensis (Bong.) Carr.] EST libraries for 4 of those 5 using blastn. Putative functions of gene products in Arabidopsis range from growth regulators to pathogenic stress response. Approximately half of the SNPs located in outlier loci are nonsynonymous point mutations, with the remainder located largely in synonymous positions. Eleven of the 24 outliers are mapped and span five different linkage groups. Six of those 11 are located in close proximity (<2 cM) on linkage group 8.
There was no overlap between the loci associated with aridity gradients and those identified as outliers. Loci associated with aridity gradients had values of FST within the range of the mean value across loci (FST = 0–0.035), while FST outliers showed no meaningful correlation with aridity gradients (maximum r2 = 0.008). Correlations between FST and χ2 statistics for aridity during each quarter, moreover, were nonsignificant and the percentage variance explained was <0.5% in all cases. The genomic location of both sets of loci was different, with two of the five loci associated with aridity located ∼4 cM apart on linkage group 3 and the FST outliers spread across linkage groups 1, 3, 6, 7, and 8 (Figure 6).
Population structure in loblolly pine:
This study presents the first genome-wide analysis for population structure of functional genetic variation among natural populations of loblolly pine. Our results are broadly consistent with previous conclusions regarding patterns of population structure for loblolly pine and conifers in general (Ledig 1998; Scmidtling et al. 1999; Al-Rabab'ah and Williams 2002, 2004; Schmidtling 2003; González-Martínez et al. 2006, 2007; Xu et al. 2008): genetic structure is weak, primarily accounted for by the Mississippi River discontinuity, and is consistent with a dual Pleistocene refugial model. Our results also lay the foundation for performing population structure corrections during association mapping.
We detected further substructure across the range of loblolly pine that was apparent only in the SNP data set. While this result could possibly be a function of the ascertainment bias for the SNP data, such biases are not expected to affect the relative placement of trees in PC space (McVean 2009). However, the same three genetic clusters accounted for most of the geographical patterns observed for levels of population structure across values of K ranging from 3 to 8 (Figure S10), thus justifying our focus on K = 3. This is also reflected in the correlations of genetic structure with aridity (Table 2) because two to three clusters accounted for most of the significant correlations. Geographical trends in cluster assignments show clearly that trees in Florida, a putative refugium, are strongly clustered with those along the Atlantic rather than Gulf Coastal Plain and that trees located to the west of the Mississippi River are genetically distinct. This pattern supports expansion from dual refugia, one located in southern Florida and one in southern Texas or Mexico (Schmidtling 2003).
Novel to the analyses presented here is the placement of the Gulf Coast trees in a unique genetic cluster (Figure 3C). This result is consistent with a scenario of expansion with differentiation or admixture. These trees were intermediate to the eastern and western clusters along the first PC, which is consistent with admixture. Testing of these hypotheses, however, is complicated by introgression between loblolly pine and closely related sympatric pines (Chen et al. 2004), shared ancestral polymorphism (Syring et al. 2007), and sample size differences among the populations considered (McVean 2009).
Environmental associations and FST outliers:
Environmental association analysis identified five genes associated with aridity gradients. The primary functions of gene products encoded by these loci were abiotic and biotic stress responses. All five loci have putative orthologs in Arabidopsis that are responsive to abscisic (ABA) or jasmonic (JA) acid, two plant hormones with well-documented correlations to abiotic stress responses (Glazebrook et al. 2003; Wong et al. 2006; Wasternack 2007; Fabro et al. 2008; Mizuno and Yamashino 2008). In addition, gene expression for four of the five loci in Arabidopsis or close relatives was repressed at low temperatures (Kreps et al. 2002; Vergnolle et al. 2005; Wong et al. 2006), with one locus significantly upregulated in the presence of chitin (Libault et al. 2007) and another upregulated during infection by biotrophic fungi (Fabro et al. 2008). The protein products encoded by these genes also likely affect general osmotic stress responses because drought cold-tolerance responses are often correlated (Beck et al. 2007). These five genes did not have extreme values of FST for any of the six clustering assignments, with values similar to the mean across all loci in each case.
FST outliers encoded proteins involved in a wide range of plant processes including response to viral infection (0–13230; Nicaise et al. 2007), cuticular wax biosynthesis (2–1205; Costaglioli et al. 2005), floral and gametophytic development (2–3591; Tan and Irish 2006), and structural components of photosystem complexes (CL1799Contig1; Jansson 1994). Additional loci have been previously associated with growth (0–14415), leaf nitrogen content (2–1087), and pitch canker resistance (1–3324) for loblolly pine (P. Cumbie, A. Eckert, J. Wegrzyn, R. Whetten, D. Neale and B. Goldfarb, unpublished results; T. Quesada, V. Gopal, W. P. Cumbie, A. J. Eckert, J. L. Wegrzyn, D. Neale, B. Goldfarb, D. Huber, G. Casella and J. Davis, unpublished results). None of these 24 loci, nor the additional 8 FST outlier loci for K = 5, showed strong correlations with aridity.
Many of the outlier loci showed striking geographical trends along the major axis of differentiation (southwest to northeast), with some approaching fixation of different alleles at opposite ends of this axis (Figure S11). One of these was locus CL3949Contig1, which encodes a peptidyl-prolyl cis-trans isomerase affecting protein folding and signal transduction in Arabidopsis (Chou and Gasser 1997). A SNP in this locus was an FST outlier when K = 5 (Table S3) and was in the upper tail of the distribution of FST for K = 3. A putative ortholog of this locus in coastal Douglas fir [Pseudotsuga menziesii (Mirb.) Franco var. menziesii] was also identified as an FST outlier during a scan across candidate genes for associations with cold-hardiness phenotypes (Eckert et al. 2009). Trees in the northeast part of the range for loblolly pine were almost fixed for the A allele at this SNP, whereas the C allele was at high frequency in the southwest. Genome-wide data sets thus open the door to comparative association analysis across natural populations of conifers, where replication is across evolutionary lineages sampled across similar environmental gradients (cf. Turner et al. 2008).
Comparison between environmental association and FST outlier approaches:
A multitude of correlations between environmental variables and genetic variation has been noted across diverse taxa (Mitton 1997). To our knowledge, this is the first study to use the analytical machinery of association analysis to discover environmental associations between functional genetic markers and environmental gradients. This approach identified an independent set of genes relative to the FST outlier approach, which highlights the fundamental differences between these methods. The association approach used here assesses the effect of natural selection along specific environmental gradients, while FST outlier methods aim to identify loci influenced by natural selection driving allele-frequency differences among populations (i.e., ancestral groups) and are thus agnostic about the environmental gradients driving the extreme values of FST. We highlight the advantages and appropriate uses of each method by comparing and contrasting the interpretations attributed to significant results for these two approaches (see also Latta 1998; Barton 1999; Le Corre and Kremer 2003).
Environmental gradients are defined a priori in the association approach, and correlation with them is tested after corrections for population structure. Significant associations with an environmental gradient, therefore, likely represent those polymorphisms underlying functional responses to that gradient. In contrast, post hoc interpretations of environmental differences are attributed to the cause of FST outliers. Geography, however, can create genetic structure that is correlated to environment solely through neutral processes such as barriers to gene flow, distance effects, and historical population size and range changes (Manel et al. 2003; Storfer et al. 2007). As seen here, aridity does differ significantly among the identified genetic clusters, and a tempting conclusion to draw about FST outliers is one related to WDS. This implies that caution should be used when interpreting the biological significance of outlier loci, especially if overall correlations of structure to geography and environmental heterogeneity have not been assessed.
The association approach does not require identification of discrete genetic clusters. As illustrated here, the process of defining populations affects the identification of FST outliers. Although geographical patterns of cluster membership did not change dramatically between methods, they did become clearer in the full SNP data set. Differing definitions of populations may, therefore, lead to spurious results even if the island model is a good descriptor of the underlying genetic structure, unless the “islands” are known without error. A better approach in this case would be to define populations on the basis of environment, while accounting for within-population substructure, rather than neutral marker-based estimates. This is often done for species with subdivided ranges by organizing population samples along environmental or geographical gradients (cf. Eveno et al. 2008).
The identification of genetic clusters may be difficult or inappropriate for species such as loblolly pine that are distributed continuously across large geographical expanses and for which paleobotanical (cf. references in Schmidtling 2003) and population genetic evidence suggests historical fluctuations in population size (Waples and Gaggiotti 2006; Guillot 2009). Invocation of the island model in these cases to derive the null distribution of test statistics will result in undesirable statistical behaviors. Populations of most North American conifers are likely far from demographic equilibrium, and acknowledging this may lead to better inferences of selection (Westfall and Millar 2004). As illustrated here, incorporation of demographic models into FST approaches dramatically reduces the number of outliers detected (Figure 5). Even under equilibrium, however, misspecification of the hierarchical nature of population structure may increase the false-positive rate by nearly an order of magnitude (Excoffier et al. 2009). Results from analyses of FST outlier analyses that do not consider these scenarios (cf. Namroud et al. 2008) should be viewed with some skepticism.
In contrast with association methods, the FST outlier approach identifies loci independently of environment. Genome-wide scans for outliers thus enable identification of loci underlying phenotypic responses to diverse environmental gradients correlated to ancestry (but see McKay and Latta 2002; Le Corre and Kremer 2003). In contrast, the association approach has low power when there is complete confounding between environment gradients and axes of ancestry. Thus, lists of outlier loci are useful because they may allow a reverse engineering of the relevant phenotypes and possibly the environmental gradients on which the phenotypes are selected through comparative genomics, functional experimentation, and GIS analysis. This would in principle help clarify the relationship between genotype, adaptive phenotypes, and fitness (Luikart et al. 2003; Storz 2005; Ross-Ibarra et al. 2007).
We are not the first to note differences between results obtained using association vs. FST outlier approaches. In an analysis of SNPs associated with human diseases, there was no significant difference in FST across associated vs. randomly chosen genes (Lohmueller et al. 2009). This suggests that environment, as opposed to ancestry, plays a significant role in disease risk for many causative polymorphisms in humans. Analogously, natural selection may not be driving large-scale adaptive differences among lineages of loblolly pine, but may instead be selecting genotypes along environmental gradients regardless of ancestry. Selection may thus be operating across different spatial and evolutionary scales in forest trees, and our ability to detect it will depend upon the scale of sampling and the method employed.
Alternatively, quantitative genetic theory predicts that clinal variation for phenotypic traits is derived from small frequency differences at many loci (Barton 1999), with the among-population component of linkage disequilibrium across loci being largely responsible for changes in phenotypic means (Le Corre and Kremer 2003). In this case, loci underlying a quantitative trait are expected to illustrate only small allele-frequency differences among populations, but manifest large phenotypic effects (Latta 1998; McKay and Latta 2002; Le Corre and Kremer 2003). An observation consistent with this expectation was made previously in European aspen (Populus tremula L.) for two SNPs associated with bud phenology (Ingvarsson et al. 2008). This is consistent with the low FST for SNPs associated with aridity and suggests that scans for FST outliers may miss many loci underlying adaptive phenotypes.
While our approach to identifying ecologically relevant genetic variants has numerous advantages, it is not without its limitations. Within-county sampling may be more effective for association analyses than the county-level approach taken here because fine-scale environmental variation may be of considerable biological relevance (Mitton et al. 1998; Parisod and Christin 2008). Alternatively, allele-frequency approaches utilizing local populations have also been shown to be powerful in identifying loci affected by diversifying selection (Hancock et al. 2008). Our SNP data, moreover, represent only a fraction of the coding portion of the loblolly pine genome, and further SNP discovery and analysis would enable truly genomic approaches to ecologically relevant genetic variation. Our FST outlier approach is limited by the assumption that our demographic model reflects the true phylogeographical history of loblolly pine. While model misspecification can lead to erroneous results, we note that our model is likely more biologically plausible than the standard model used in FST outlier analyses and in fact explains the observed data as well as, if not better than, drift or the island model alone. Finally, future approaches using sets of targeted candidate genes, as opposed to genome scans, may be more fruitful in identifying loci correlated with specific environmental gradients.
We identified general patterns of population structure, environmental correlates to those patterns, and 29 SNP loci that were associated with aridity (n = 5) or had extreme values of FST (n = 24) for loblolly pine. The 5 SNP loci associated with aridity were located in genes encoding proteins primarily involved with biotic and abiotic stress responses, while the 24 outlier loci were located in genes encoding proteins involved with a diverse set of physiological functions. Increased awareness of the strengths, weaknesses, and complementarity of the methods employed during scans for ecologically important genetic variation is needed as truly population genomic data sets emerge for non-model species.
We thank Sedley Josserand, Dennis Deemer, Craig Echt, Valerie Hipkins, Vanessa K. Rashbrook, Charles M. Nicolet, John D. Liechty, Benjamin N. Figueroa, and Gabriel G. Rosa for laboratory and bioinformatics support. We also thank Andrew Bower for advice on GIS-derived climate data, W. Patrick Cumbie and Barry Goldfarb for providing geographic information for sampled trees, and Aslam Mohamed for creating the linkage map. The manuscript was also much improved by critical and insightful comments made by two anonymous reviewers. This work was supported by a National Science Foundation (grant no. DBI-0501763).
Supporting information is available at http://www.genetics.org/cgi/content/full/genetics.110.115543/DC1.
Communicating editor: J. Wakeley
- Received February 12, 2010.
- Accepted April 20, 2010.
- Copyright © 2010 by the Genetics Society of America