Skip to main content
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org
  • Log in
Genetics

Main menu

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • Centennial
    • Genetics of Immunity
    • Genetics of Sex
    • Genomic Selection
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

User menu

Search

  • Advanced search
Genetics

Advanced Search

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • Centennial
    • Genetics of Immunity
    • Genetics of Sex
    • Genomic Selection
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
Previous ArticleNext Article

Inferring Continuous and Discrete Population Genetic Structure Across Space

View ORCID ProfileGideon S. Bradburd, View ORCID ProfileGraham M. Coop and View ORCID ProfilePeter L. Ralph
Genetics September 1, 2018 vol. 210 no. 1 33-52; https://doi.org/10.1534/genetics.118.301333
Gideon S. Bradburd
Ecology, Evolutionary Biology, and Behavior Graduate Group, Department of Integrative Biology, Michigan State University, East Lansing, Michigan 48824
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gideon S. Bradburd
  • For correspondence: bradburd@msu.edu
Graham M. Coop
Center for Population Biology, Department of Evolution and Ecology, University of California, Davis, California 95616
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Graham M. Coop
Peter L. Ralph
Institute of Ecology and Evolution, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97403
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter L. Ralph
  • Article
  • Figures & Data
  • Supplemental
  • Info & Metrics
Loading

Abstract

A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure. The method estimates ancestry proportions for each sample from a set of two-dimensional population layers, and, within each layer, estimates a rate at which relatedness decays with distance. This thereby explicitly addresses the “clines versus clusters” problem in modeling population genetic variation, and remedies some of the overfitting to which nonspatial models are prone. The method produces useful descriptions of structure in genetic relatedness in situations where separated, geographically distributed populations interact, as after a range expansion or secondary contact. We demonstrate the utility of this approach using simulations and by applying it to empirical datasets of poplars and black bears in North America.

  • population genetics
  • isolation by distance
  • population structure
  • model-based clustering

A fundamental quandary in the description of biological diversity is the fact that diversity shows both discrete and continuous patterns. For example, reasonable people can disagree about whether two populations are separate species because the process of speciation is usually gradual, and so there is no set point in the continuous divergence of populations when they unambiguously become distinct species. The issue of identifying meaningful biological subunits extends below the species level, as patterns of phenotypic and genetic diversity within and among populations are shaped by continuous migration and drift, as well as by more discrete events, such as rapid expansions, bottlenecks, rare long-distance migration, and separation by geographic barriers. Both discrete and continuous components are required to accurately describe most species’ patterns of genetic relatedness.

From a practical standpoint, we often need to identify somewhat separable populations from which individuals are sampled (Wright 1949), even while acknowledging continuous processes. Delineating populations is useful for systematics and for informing conservation priorities (Moritz 1994; Waples 1998; Moritz et al. 2002). Furthermore, we often need to identify subsets of individuals resulting from reasonably coherent evolutionary histories for downstream analyses to learn about population history and adaptation. Conversely, the substantial information available from continuous, geographic differentiation (e.g., adaptation along a climatic gradient) can be confounded by discrete historical processes (e.g., admixture), requiring methods that can disentangle the two.

There have been many methods proposed to characterize population genetic structure, including generating population phylogenies (Cavalli-Sforza and Piazza 1975; Pickrell and Pritchard 2012), dimensionality-reduction approaches such as principal components analysis (Menozzi et al. 1978; Price et al. 2006; Novembre and Stephens 2008; Meirmans 2009), and model-based clustering approaches (e.g., Pritchard et al. 2000; Corander et al. 2003; Falush et al. 2003; Guillot et al. 2005; Huelsenbeck and Andolfatto 2007; Alexander et al. 2009; Hubisz et al. 2009; Lawson et al. 2012; Raj et al. 2014; Caye et al. 2018). Each of these methods performs best in particular situations, but many can give misleading results when applied to data that show a continuous pattern of differentiation, as that produced by geographic isolation by distance (Wright 1943; Novembre and Stephens 2008; Frantz et al. 2009). Here, we will focus on model-based clustering, the most widely used class of approaches for population delineation. (We note that the problem of identifying population clusters is distinct from, though of course related to, the problem of detecting barriers to gene flow between populations, (e.g., Barton 2008; Bradburd et al. 2013; Petkova et al. 2016; Ringbauer et al. 2018). Existing model-based clustering methods model each individual’s genotypes as random draws from a set of underlying, unobserved population clusters, each with a characteristic set of allele frequencies, which are estimated. These underlying frequencies are identical for all individuals assigned to a cluster, regardless of their spatial location. Spatial information has been incorporated into some of these methods, by, for example, placing spatial priors on cluster membership (Guillot et al. 2005; Caye et al. 2018), but this does not address the underlying issue that these methods assume that allele frequencies are constant in a cluster across the species’ range.

Isolation by distance refers to a pattern of increasing genetic differentiation with geographic separation, which occurs when geographically restricted dispersal allows genetic drift to build up differentiation between distant locations (Wright 1943). Theoretical work, mostly derived from “stepping-stone” models (Kimura and Weiss 1964; Sawyer 1976; Shiga 1988), gives us some analytical predictions for isolation by distance (Malécot 1969; Slatkin 1985; Epperson 2003), and some theory has been derived for continuous space (Nagylaki 1978; Nagylaki and Barcilon 1988; Barton et al. 2002), but substantial work remains to be done (Barton et al. 2013). Given the generality of the circumstances that generate a pattern of isolation by distance, it is unsurprising that isolation by distance is very widespread in nature (Meirmans 2012; Sexton et al. 2014).

The ubiquity of isolation by distance presents a challenge for models of discrete population structure, as it is frequently difficult to determine whether observed patterns of genetic variation are continuously distributed across a landscape, or instead are partitioned in discrete clusters. This problem can be compounded if sampling is done unevenly or discretely across a population or species’ range, and has given rise to a debate in the population genetic literature about how best to describe sets of individuals using continuous clines and discrete clusters (e.g., Serre and Pääbo 2004; Rosenberg et al. 2005).

Most existing model-based clustering methods are based on a discrete set of clusters, and so tend to partition continuous variation into spurious clusters with spatially autocorrelated cluster membership (Frantz et al. 2009; Meirmans 2012). In analyses of empirical datasets, which often show strong isolation by distance, model-based clustering approaches will therefore tend to overestimate the number of discrete clusters present.

To address this, we set out to develop a model-based clustering method that, when possible, uses isolation by distance to explain observed genetic variation. With an explicit spatial component, discrete population structure need only be invoked when genetic differentiation in the data deviates significantly from that expected given geographic separation. In this paper, we model genetic variation in genotyped individuals as partitioned within or admixed across a specified number of discrete layers, within each of which relatedness decays as a parametric function of the distance between samples. We also implement a cross-validation approach for comparing and selecting models across different numbers of layers, and we demonstrate the utility of our approach using both simulated and empirical data. The implementation of this method, conStruct (for “continuous structure”), is documented and available for general use as an R package at https://github.com/gbradburd/conStruct.

Materials and Methods

Data

The statistical framework of our approach is conceptually similar to that in Wasser et al. (2004), Bradburd et al. (2013), and Bradburd et al. (2016), although we use a somewhat different summary statistic than in this previous work. The model works with allele frequencies at L unlinked, biallelic single nucleotide polymorphisms (SNPs) genotyped across N samples. Each “sample” may be a single individual, a collection of individuals from a location, or frequencies estimated from pooled sequencing. From these we compute the allelic covariance between samples i and j, denoted Embedded Image as the expected covariance of distinct individual alleles chosen from each of the two samples at a random locus. More precisely, suppose that we pick a random biallelic locus uniformly from the genome, pick a random “reference” allelic state from the two alleles seen at that locus, and, in each sample, draw one random allele, recording Embedded Image if the allele drawn in sample i matches the random reference, and Embedded Image otherwise. Then,Embedded Image(1)Because we randomly choose the reference allele, each Embedded Image behaves marginally as a fair coin—in particular, Embedded Image so Embedded Image for every i—all information enters through correlations.

Although we describe this as a covariance between individually drawn alleles, Embedded Image is in fact also the covariance between the allele frequencies of a randomly chosen allele in samples i and j, as long as Embedded Image The choice of allele does not affect subsequent calculations, and so may be arbitrary, and Embedded Image can be calculated as (derived in Allelic covariance and inference):Embedded Image(2)Here Embedded Image is the allele frequency in the Embedded Image sample at locus Embedded Image This definition of covariance differs from the usual “genetic covariance” (McVean 2009) in that (a) we do not subtract locus means (to make the statistic insensitive to sample configuration), and (b) we randomly choose a reference allele at each locus (to retain insensitivity to choice of reference allele). As noted in Petkova et al. (2016), for Embedded Image this can also be calculated as Embedded Image where Embedded Image is the genetic distance calculated from those L sites, i.e., the proportion of sites at which random samples from i and j differ.

Continuous and discrete differentiation

Clustering approaches to describing genetic variation are useful because population history can often be meaningfully described on a coarse scale by interactions between discrete “populations” whose relationships are delimited by patterns of glaciation, large-scale migration, mountain ranges, and the like. Here we add a spatial component within each such discrete historical component, which we refer to as a set of “layers” that overlay the modern map. We imagine each layer as a geographically distributed population that extends over the entire sampled range of the populations. As depicted in Figure 1, each sample is composed of a mixture of contributions from each of these layers, with the relative contributions of each layer described by a set of “admixture proportions” (the Embedded Image). These layers thus take the place of “clusters” in clustering methods, but we do not adopt this term, as “spatial cluster” suggests a clustering in space, while our layers may contribute to genetic variation across the entire geographic range.

Figure 1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1

Schematic of our method, using Embedded Image as an example. Spatial autocorrelation of allele frequencies within each layer is depicted by color gradients, and Embedded Image denotes the covariance shared by samples with ancestry entirely in the kth layer. Sampled populations on the landscape are inferred to be admixed between these layers; the ith sample draws proportion Embedded Image of its ancestry from layer k. For convenience, each layer is depicted as a small square, but in fact, each layer exists everywhere in the sampled area, so the small dashed circles on each layer show where the location of the highlighted admixed sample intersects each layer.

Within each of these layers, allele frequencies have positive covariance at geographically close locations, but this covariance is allowed to decay as geographic distance increases. This pattern of spatial decay reflects how migration between nearby spatial regions homogenizes allele frequency changes that arise locally due to drift, but less effectively homogenizes geographically distant regions, resulting in a continuous pattern of isolation by distance within each layer. There is a fixed amount of covariance between layers, irrespective of spatial location. Within each layer, allele frequencies are expected to change gradually with distance, but observed frequencies can change abruptly at many loci if the proportions of ancestry individuals derive from each layer (the admixture proportions) do so as well.

To allow flexibility in the form of the decay of allelic covariance with geographic distance within each layer, we define the covariance within layer k between samples i and j to be:Embedded Image(3)where the superscript Embedded Image denotes parameters specific to the kth layer. The quantity Embedded Image is the observed geographic distance between samples i and j, and the Embedded Image parameters control the shape of the decay of covariance with distance in the layer. Our choice of a powered-exponential decay, as parameterized by the αs, is a flexible and standard choice in spatial statistics (Diggle et al. 1998), and is not chosen to match a particular population genetics model. The Embedded Image is a parameter that describes the background covariance within the layer. If two samples draw 100% of their ancestry from layer k, then their covariance under the model is Embedded Image if they are furthermore geographically very close (Embedded Image) they will have covariance Embedded Image If the geographic distance between them is very large, their covariance will be equal to the background level Embedded Image within the layer. The “shared drift” parameter Embedded Image is analogous to the branch length connecting the kth population to the population ancestral to all modeled layers (see, for example, Patterson et al. 2012; Peter 2016), although they cannot be directly compared because we are modeling the allelic, rather than genetic, covariance. In “Model rationale: drift, admixture, and space” we lay out a simple model of allele frequencies underlying this covariance model.

We then allow samples to draw their ancestry from more than one layer. The admixture proportion of the ith sample in the kth layer, denoted Embedded Image gives the genome-wide proportion of alleles from sample i that derive from layer k (and so Embedded Image). A visual representation of the method is shown in Figure 1.

We can then describe the covariance between samples i and j across all K layers, Embedded Image by summing their within-layer spatial covariances (Embedded Image in layer k), weighted by the relevant admixture proportions.Embedded Image(4)In this equation, Embedded Image is the proportion of alleles that both sample i and sample j have inherited from layer k.

In addition to the admixture-weighted sum of the within-layer spatial covariances, this function contains two terms, γ and Embedded Image The first, γ, describes the global allelic covariance between all samples, and arises because all samples share an ancestral mean allele frequency at each locus, which generates a base-line covariance. In the final term, Embedded Image is an indicator variable that takes a value of 1 when i equals j and 0 otherwise, and Embedded Image adds variance specific to sample i. This term on the diagonal of the parametric covariance matrix captures processes shaping variance within the sampled deme, such as inbreeding and the sampling process.

Likelihood and inference

If the allele frequency deviations at each locus were independent between loci and multivariate normally distributed across populations, their allelic covariance Embedded Image would be Wishart distributed with degrees of freedom equal to L, the number of loci genotyped. We use this as a convenient approximation to the true distribution described above, and so define the likelihood of the allelic covariance to beEmbedded Image(5)where W is the Wishart likelihood function. Statistical nonindependence between loci (linkage disequilibrium, LD) will decrease the effective number of degrees of freedom. One possible solution, which we have not yet found necessary to implement, would be to estimate an effective number of loci by introducing a parameter to modify the given degrees of freedom and thereby informally model linkage between loci (e.g., Petkova et al. 2016).

We estimate the values of the parameters of the model using a Bayesian approach. Acknowledging the dependence of the parametric covariance matrix Ω on its constituent parameters Embedded Image and on the (observed) geographic distances D with the notation Embedded Image we denote the posterior probability density of the parameters as:Embedded Image(6)where Embedded Image Embedded Image Embedded Image Embedded Image and Embedded Image are prior distributions. All parameters are given (half-)Gaussian priors except for Embedded Image which is uniform on Embedded Image and w, for which we use an independent Dirichlet of dimension K for each sample (see Table A1 for specifics). Parameters are independent between layers. We use Hamiltonian Monte Carlo as implemented in STAN (Hoffman and Gelman 2014; Carpenter 2015; Stan Development Team 2015, 2016) to estimate the posterior distribution on the parameters. Our R package, conStruct (for “continuous structure”), functions as a wrapper around this inference machinery.

Relationship of this model to nonspatial structure models

A nice feature of our approach is that the model described in Eq. 4 contains a nonspatial assignment model as a special case (see Models, Parameters, and Priors for a more in-depth discussion). By setting Embedded Image to zero for all k, we obtain a nonspatial model in which each cluster has its own allele frequency at each SNP, and individuals draw a proportion of their ancestry from each cluster. This model is very similar to that of STRUCTURE (Pritchard et al. 2000) and related models (e.g., Alexander et al. 2009); the main difference is that our likelihood assumes that allele frequencies are normally distributed around their expectations, while the standard assignment methods assume that the error is binomially distributed (Engelhardt and Stephens 2010). (We make this approximation for the substantial advantages in computational speed.) The second difference is that, in the original STRUCTURE model, allele frequencies at each locus are independently drawn for each cluster (Pritchard et al. 2000), while in conStruct’s nonspatial model, it is more natural to envision each cluster’s allele frequency as being drifted away from a single, global allele frequency. This makes our model more closely related to the “F-model” prior for allele frequencies of Falush et al. (2003). These differences in the underlying model could in principle result in different behavior, but below we show that the nonspatial model indeed produces similar results to ADMIXTURE, and use this fact to compare the fit of the different models—spatial vs. nonspatial, across different values of K—by comparing their performance in a common framework.

Choice of layer number and cross-validation

There are a number of reasons why there is no true (or right) number of layers for real datasets, discussed further in the Discussion. However, it is still important to assess whether additional layers (larger K) meaningfully model patterns in the data or merely explain spurious variation introduced by noise—in other words, whether additional model complexity provides significant explanatory power. Toward that end, we have implemented a method for statistically comparing conStruct results across different values of K and between the spatial and nonspatial models.

Several approaches have been used as model choice criteria for the number of discrete clusters in population genetic data, including: comparisons of the likelihood of the data across different values of K, with various criteria on how to choose a single value (e.g., Evanno et al. 2005), or with information theoretic penalizations such as Akaike information criterion (AIC) or Bayesian information criterion (BIC; e.g., Alexander et al. 2009); comparisons of the marginal likelihood, generated either via various approximations (e.g., Pritchard et al. 2000) or via thermodynamic integration (Verity and Nichols 2016); or inference using a Dirichlet process prior (Huelsenbeck and Andolfatto 2007). See Verity and Nichols (2016) for a discussion of these approaches and comparison between several methods.

We use cross-validation [similar in spirit to Alexander and Lange (2011)] to attack this problem. To do this, we use a “training” partition of the data (in practice, a random 90% subset of the loci) to estimate the posterior distribution of the parameters, and then calculate the log-likelihood of the remaining “testing” loci, averaged over the posterior. Prediction accuracy of a particular value of K is then measured using this log-likelihood, averaged over a number of independent data partitions. The best model is judged to be the simplest one with significantly better predictive accuracy than others (see Cross validation procedure for more on our cross-validation procedure). In general, larger values of K allow the model more flexibility, and thus increases the likelihood of the training partition, but this improvement in the likelihood will plateau (or even peak), as above a certain K the model only fits noise specific to the training data rather than generalizable patterns. At any value of K, support for the spatial model over the nonspatial model means that isolation by distance is likely a feature of the data.

Cross-validation provides a valuable summary of how much explanatory power is added by spatial structure within each layer, and each additional layer. However, we remind users that “statistical significance does not imply real-world significance,” and so small but statistically significant differences between models should not be relied on too strongly.

Another way to describe the practical significance of additional layers is to calculate each layer’s relative contribution to total covariance, and to choose a value of K where all layers have a contribution above some cutoff (e.g., 0.1%). The Dirichlet prior on admixture proportions is quite harsh against intermediate admixture values (see Table A1), encouraging the model to “not use” unnecessary layers if they are present in the model, so that they will have a low contribution to overall covariance.

To calculate layer contributions, we use the following alternative description of our covariance model: the genomes of any pair of individuals agree with some background probability at a locus, but this probability of agreement is increased on any segment of genome that both have inherited from the same layer (the amount it increases depends on how far apart they are geographically and on the decay of isolation by distance). We use this characterization to quantify the relative contributions of each layer by computing the average contribution to increased probability of agreement as described in Calculating layer contributions. This layer contribution is similar to the “ancestry contribution” proposed by Raj et al. (2014). However, each of our layers can induce a different amount of covariance between samples embedded in them, so we take that into account when calculating each layer’s contribution to the whole.

Data availability

The method conStruct is implemented as an R package, and is available for installation at https://github.com/gbradburd/conStruct. Scripts for generating and analyzing all simulated and empirical datasets, as well as the datasets themselves, are also available at the same site, and additionally have been archived at Data Dryad (doi: 10.5061/dryad.5qj7h09). Supplemental material available at Figshare: https://doi.org/10.25386/genetics.6840629.

Results

Simulations

To test the method, we first generated data using the coalescent simulator ms (Hudson 2002). In each simulation, we split a single ancestral population into K subpopulations Embedded Image units of coalescent time in the past, and at time Embedded Image in the past, each of these discrete populations instantaneously colonized a separate Embedded Image square lattice of demes. Migration on each lattice was to nearest neighbors (eight neighbors, including diagonals). Finally, at time Embedded Image in the past, we collapsed those K discrete layers into a single grid of demes, choosing various amounts of admixture from these different layers (see Figure A1), with randomly distributed but spatially autocorrelated admixture proportions. See Simulation details for more details, including parameter values used. We simulated datasets using Embedded Image 2, and 3 layers; in each simulation we sampled 10,000 unlinked loci from each of 20 haploid individuals from every deme. We then ran both spatial and nonspatial conStruct analyses on each simulated dataset with K between 1 and 7, and compared predictive performance of the models using cross-validation with 10 replicates. For comparison, we also analyzed each simulated dataset using ADMIXTURE (Alexander et al. 2009) with K between 1 and 7, and compared models using ADMIXTURE’s cross-validation procedure with 50 folds.

With these simulations, spatial conStruct does not create spurious discrete groupings when there are none: Figure 2, Supplemental Material, Figure S1, Figure S2, and Figure S3 show that subsequent layers beyond the number used for simulation are unused. When data simulated with Embedded Image are analyzed with Embedded Image the additional layers contribute very little to any population. Even when the spatial model is run with Embedded Image the inferred admixture proportions are nearly identical to those estimated under the true value of K for each simulation. Moreover, the method infers the true admixture proportions with high accuracy, tight precision, and good coverage (Figure S4 and Figure S5).

Figure 2
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2

Results for data simulated using Embedded Image showing maps of admixture proportions estimated using the nonspatial conStruct model for Embedded Image through 4 [(a)–(c); top row] and the spatial conStruct model for Embedded Image through 4 [(d)–(f); bottom row]. As there is only a single layer in the simulation, no populations should be admixed, which is accurately depicted by the spatial model (second row), while the nonspatial model creates spurious clusters (first row).

In contrast, the nonspatial model describes geographic variation using gradients of admixture between increasingly many discrete clusters to better approximate the continuous, spatial patterns of relatedness (Figure 2, Figure S6, Figure S7, and Figure S8). The ADMIXTURE results are qualitatively similar, as shown in Figure S9, Figure S10, and Figure S11. Each nonspatial cluster is genetically more similar within itself than it is to other clusters, but we know that these boundaries are arbitrary, because the data were simulated without them.

The spatial model’s better fit is reflected by increased predictive accuracy: as shown in Figure 3, across all models and choices of K, the spatial model is correctly preferred over the nonspatial model. As desired, predictive accuracy of the spatial model increases until the true value of K, and then plateaus or declines (Figure 3, Figure S12, Figure S13, and Figure S14). Predictive accuracy of the nonspatial model increases as subsequent clusters are added up to Embedded Image (the largest number tested), although gains are greatest as layers below the true number are added. The same holds true for the ADMIXTURE cross-validation results, in which models that have the largest number of clusters are preferred over all other models, as shown in Figure 3 (vermilion diamonds), and, in more detail, in Figure S15.

Figure 3
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3

Cross-validation results for data simulated under Embedded Image Embedded Image and Embedded Image comparing the spatial and nonspatial conStruct models (in blue and green, respectively) run with Embedded Image through 7, with 10 cross-validation replicates. The inset plots zoom in on cross-validation results outlined in the dotted boxes. The spatial model shows better model fit at every value of K. The vermillion diamond indicates the value of K selected on the basis of lowest cross-validation error among ADMIXTURE models. In all simulations, the preferred ADMIXTURE model was that with the largest number of clusters.

The unimportance of spurious layers can be seen in plots of layer contributions (Figure 4, Figure S16, and Figure S17). In the spatial analyses, once we pass the true K, subsequent layers add little in terms of (co)variance explained; in contrast, additional clusters in the nonspatial analyses continue to contribute substantially.

Figure 4
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4

Results for data simulated using Embedded Image showing layer/cluster contributions (i.e., how much each layer/cluster contributes to total covariance), from conStruct runs using Embedded Image through 7 for the spatial model (left), and the nonspatial model (right). In each run of the spatial model, a single layer explained nearly all the covariance (additional bars are present but not visible).

Empirical applications

To further demonstrate the utility of this method, we also applied conStruct to empirical population genomic data from two systems: a contact zone between two poplar species in northwestern North America, and a large North American sample of black bears.

Poplars

Study system and questions:

Trees in the genus Populus (poplars, aspens, and cottonwoods) are distributed throughout the Northern Hemisphere; species in the genus regularly co-occur and, where they do, they frequently hybridize (Eckenwalder 1984; Cronk 2005).

Populus trichocarpa, the black cottonwood, and Populus balsamifera, the balsam poplar, have a broad zone of overlap in the Pacific Northwest, where they are hypothesized to hybridize (Geraldes et al. 2014; Suarez-Gonzalez et al. 2016). Both species are sampled over a large geographic region, and show spatial patterns of genetic and phenotypic variation (Slavov et al. 2012; McKown et al. 2014), making the system well-suited for application of our method. We organize the results of our analyses around the following questions:

  1. To what degree has hybridization blurred the boundaries between trichocarpa and balsamifera? (As an extreme case, does genetic differentiation support these as separate species, as opposed to a single cline of ancestry?)

  2. Does the only significant boundary of population structure fall along the species boundary (if any), or is there substructuring within species?

  3. Does the strength of isolation by distance differ between inferred layers? This may indicate, e.g., different speeds of postglacial expansion or primary modes of dispersal.

Data and analyses:

We use data from Geraldes et al. (2014), consisting of 434 individuals sampled from 35 drainages genotyped at just over 33,000 loci (map of the sampling shown in Figure S18). The number of individuals per drainage ranged between 1 and 50, with most sampling concentrated on trichocarpa drainages. The data were generated using an Infinium 34K array designed for trichocarpa (Geraldes et al. 2013), and showed a strong pattern of bias in allelic dropout (the majority of missing data were from drainages with only Populus balsamifera individuals). To ameliorate some of the problems that arise when there is a strong bias in which data are missing, we dropped loci for which any data were missing, resulting in just over 20,200 loci retained for analysis. We then analyzed these data, grouped by drainage, using both the spatial and nonspatial conStruct models with Embedded Image through 7, and compared these models using cross-validation with 10 replicates. The results of all these analyses are shown in Figure 5 and Figure 6, as well as Figure S19, Figure S20, Figure S21, Figure S22, and Figure S23 in the “Supplemental Materials”. For comparison, we also ran ADMIXTURE (Alexander et al. 2009) with Embedded Image through 7, using 50-fold cross-validation to compare model performance (Figure S24 and Figure S25).

Figure 5
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5

Maps of admixture proportions estimated for the Populus dataset using the spatial conStruct model for Embedded Image through 4 (a–c), as well as the corresponding layer-specific covariance curves estimated under each model (d–f).

Figure 6
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6

Cross-validation results for Populus dataset comparing the spatial and nonspatial conStruct models run with Embedded Image through 7 with 10 cross-validation replicates. The first panel in each row shows all results; the second panel zooms in on the results from analyses run with Embedded Image through 7.

Results from construct:

All models with Embedded Image assigned the majority of each of the two species to distinct layers, with some populations drawing ancestry from multiple layers. Based on cross-validation results, we view the Embedded Image spatial model as a sufficient description of the data, with additional structure of uncertain significance. This provides strong support for discrete population structure between the two species, with some admixture, rather than a single, continuous cline of ancestry. At all values of Embedded Image discrete population structure was mostly partitioned along species lines; at values of K above 2, further discrete substructure was inferred within the P. trichocarpa samples, with no substructure within balsamifera. There was also strong support for isolation by distance in the dataset, but most of this signal seems to derive from the P. trichocarpa samples: as seen in Figure 5 d–f and Figure S21, there is almost no isolation by distance within the balsamifera layer (Embedded Image). Both points are in agreement with previous work (Keller et al. 2010), which found low diversity within the region’s balsamifera, probably as the result of a recent postglacial expansion.

A consistent split between layers within trichocarpa fell along the “no-cottonwood belt,” a region along the central coast of British Columbia in which black cottonwood is absent (the break between yellow and red, for Embedded Image). The no-cottonwood belt is hypothesized to divide the species’ distribution into northern and southern groups, which, in a provenance test, were experimentally shown to display differences in ecologically relevant phenotypes (e.g., pathogen resistance, Xie et al. 2009, 2012). At higher values of K, drainages at the southern tip of trichocarpa sampling begin to split out into their own layers, perhaps due to introgression from the southern neighbors Populus angustifolia or fremontii (Zhou and Holliday 2012; Geraldes et al. 2014).

Comparison to ADMIXTURE:

Both nonspatial conStruct and ADMIXTURE displayed the successive partitioning of space and the clines of admixture seen in the simulation results. The details of each were somewhat different (Figure S20 vs. Figure S24), and also differed across the replicate analyses. These differences between runs and methods may be due to noise in the different inference algorithms employed, multi-modality in the likelihood surfaces, or to model details (e.g., the priors used in nonspatial conStruct, or the fact that ADMIXTURE is modeling each allele’s frequency in each cluster, rather than the covariance across all alleles). However, overall, the behavior of both methods was quite similar: each recovered the trichocarpa/balsamifera split with the first two clusters modeled, then, with higher values of K, used subsequent clusters to subdivide the trichocarpa samples into geographically restricted foci of cluster membership. Both nonspatial conStruct and ADMIXTURE strongly favored the most cluster-rich model (Figure 6 and Figure S25). In contrast, the spatial conStruct model clearly did not favor the model with the highest value of K, and appears to describe patterns of isolation by distance across the trichocarpa range quite well.

Black bears

Study system and questions:

The American black bear, Ursus americanus, is endemic to North America and has a broad distribution across the continent. During the last glacial maximum, black bears were confined to isolated glacial refugia, from which they subsequently expanded to occupy their current range (Byun et al. 1997; Wooding and Ward 1997; Stone and Cook 2000; Puckett et al. 2015), likely leading to both continuous and discrete patterns of genetic structure. We organize our results around the following questions:

  1. How many distinct populations are reflected in modern patterns of genetic variation?

  2. How strong is isolation by distance within each inferred group?

Distinct populations likely represent different glacial refugia, and differing strengths of isolation by distance might indicate different levels of habitat connectivity, dispersal behavior, or different postglacial histories.

Data and analyses:

We use data from Puckett et al. (2015), consisting of 95 individuals sampled across the United States and on the West coast of Canada, genotyped at just under 22,000 biallelic loci. The distribution of missing data across these individuals was uneven, with a few individuals representing most of the missing data, so we removed individuals with >4% missing data, resulting in a final dataset of 78 individuals. We then analyzed these data, treating individuals as the unit of analysis, using both the spatial and nonspatial conStruct models with a K of between 1 and 7, and compared these models using cross-validation with 10 replicates. We also ran ADMIXTURE (Alexander et al. 2009) on the same dataset, using Embedded Image through 7, and comparing models using ADMIXTURE’s cross-validation procedure with 50 data fold subsets. The results of these analyses are shown in Figure 7, Figure 8, and Figure 9, as well as in Figure S26, Figure S27, Figure S28, Figure S29, Figure S30, and Figure S31 in the “Supplemental Materials”.

Figure 7
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 7

Maps of admixture proportions estimated for the black bear dataset using the spatial conStruct model (a), the nonspatial conStruct model (b), and ADMIXTURE (c) for Embedded Image Pies show mean admixture results across individuals within their diameter, and the admixture results for all individuals included within each group are shown in the plot above.

Figure 8
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 8

Cross-validation results for the black bear dataset, comparing spatial and nonspatial conStruct models, as well as ADMIXTURE, all run with Embedded Image through 7, with 10 cross-validation replicates for the conStruct analyses and 50 data-fold subsets for the ADMIXTURE analyses. The first panel in each row shows results from spatial and nonspatial conStruct models; the second panel zooms in on the results from the spatial analyses run with Embedded Image through 7, and the third panel shows the results for ADMIXTURE. Note that the admixture plot shows cross-validation error (rather than predictive accuracy), and that the y-axis has therefore been flipped for ease of comparison to the conStruct results.

Figure 9
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 9

Layer/cluster contributions (i.e., how much total covariance is contributed by each layer/cluster), for all layers estimated in runs using Embedded Image through 7 for the spatial model (left), and for all clusters using the nonspatial conStruct model (right). For each value of K along the x-axis, there are an equal number of contributions plotted. Colors are consistent with Figure 7.

Results from conStruct:

The results partition the sampled bears into two main groups (shown in Figure 7a): one (red) to the east of the Rocky Mountains, which also occurs in Alaska, the other primarily west of the Rockies (blue). The disjointed range of the red layer likely reflects the fact that Canada was not sampled, and so the red layer may extend through the intervening (unsampled) northern Great Plains and Canadian Shield, with the blue layer presumably then stretching up into British Columbia.

The spatial models have strong statistical support up until around Embedded Image or 6 (Figure 8), but additional spatial layers beyond Embedded Image contribute little to total covariance (Figure 9). The locations of admixed individuals are consistent with a scenario of postglacial expansion from two refugia, one in the American Southwest and one in the American Southeast, meeting near the Northwest coast of North America and the Cascade Range. However, lack of any samples from Canada and Mexico, and lack of denser sampling across northern North America, make more detailed interpretations untrustworthy. The spatial covariance functions estimated in layers beyond the first two take very large values over small spatial lags, but decay sharply after that. This feature, combined with the overall amounts and spatial patterns of ancestry in those layers, suggests that these layers are describing processes that shape genetic variation at local scales, such as inbreeding, which affects covariance between individuals within each location, but has limited impact on covariance between locations.

Comparison to ADMIXTURE:

Results from the nonspatial model and from the ADMIXTURE analyses clearly exhibit the tendency of nonspatial clustering algorithms to describe continuous spatial patterns of divergence using gradients of admixture between clusters. For example, in Figure 7b, the third cluster (in gold) exhibits a clear East-West gradient that overlays the discrete structure between the Southwest cluster and the Southeast. The results from ADMIXTURE are not identical to those obtained using the nonspatial conStruct model, but they do show the same tendency: e.g., at Embedded Image — the preferred model from the cross-validation analysis shown in Figure S31 — ADMIXTURE splits the westernmost Alaskan samples out of the cluster with the eastern samples, and at Embedded Image it subdivides the eastern cluster into two geographically partitioned groups (Figure S30). Interestingly, for the nonspatial model implemented in ADMIXTURE, the preferred model has a smaller K (Embedded Image) than that of the spatial models with best cross-validation performance in conStruct (Embedded Image or 6). This discrepancy likely stems from the different features introduced in layers beyond Embedded Image in the two models: conStruct uses small contributions of new layers to model very local drift, while ADMIXTURE moves to geographically finer subdivisions.

Even at Embedded Image ADMIXTURE invokes clusters to describe what seems to be a continuous spatial pattern of genetic variation, which conStruct describes using only two spatial layers. The third cluster in the ADMIXTURE analysis at Embedded Image (shown in gold in Figure S30b), shows strong spatial autocorrelation in admixture proportions, as would be expected if it is describing continuous spatial differentiation. The allelic covariances plotted against distance (see Figure S32) provide more information on ADMIXTURE’s lack of fit: covariance between Eastern bears falls off gradually rather than abruptly with distance, indicating a residual pattern best explained by isolation by distance within layers. In addition, the covariance between bears assigned to ADMIXTURE’s gold and red layers (the furthest Northwestern and Eastern bears, respectively) appears to be a natural extension of the decay of covariance with distance, falling to only slightly lower values than covariances between other widely separated pairs of Eastern sampling locations.

Across all values of K for which we ran conStruct, we see strong support for the spatial model over the nonspatial model (Figure 8). This pattern may resolve a discrepancy between our results and previous analyses that split Alaskan and British Columbian bears out into their own cluster with an inferred Beringian glacial refugium (Byun et al. 1997; Stone and Cook 2000; Puckett et al. 2015). Our model, which explicitly incorporates a spatial decay of relatedness, allows somewhat genetically differentiated individuals that are sampled far from one another to belong to the same layer, instead of splitting these individuals out into successive clusters (e.g., Figure S26d vs. Figure S27d).

Discussion

In this paper, we have presented a statistical framework, conStruct, for simultaneously modeling continuous and discrete patterns of population structure. By employing the sensible default assumption that relatedness ought to decay with geographic distance, even within a population, we avoid erroneously ascribing population differentiation to discrete population clusters. To aid comparison between models, we present a cross-validation approach as well as a way to describe the contribution of each spatial layer to the model (but caution against overly strict interpretation of either).

The method performs well on simulated data: we accurately infer the admixture proportions used to simulate the data and accurately pick the simulating model as the best model using our cross-validation procedure. Two empirical applications of conStruct to samples of North American poplars and black bears yield reasonable results, and demonstrate that, by acknowledging isolation by distance, real datasets can be better described using fewer layers.

The proposed method combines the utility of model-based clustering algorithms with a model of isolation by distance. We anticipate that conStruct will be useful for identifying populations and determining ancestral origins of sampled individuals, especially when the populations exhibit geographic patterns of relatedness.

Comparison to nonspatial model-based clustering

Above, we showed that (a) the nonspatial conStruct model recapitulates results of other, commonly used nonspatial clustering methods, and (b) conStruct can concisely capture spatial structure, which is common within populations. Given this, when should methods without spatial capability be used? One advantage these have over conStruct is speed when the number of samples is large. Although conStruct’s computation time is independent of the number of loci included in the dataset (after the initial calculation of the allelic covariance), it currently scales poorly with number of samples. The computationally limiting step is the inversion of the covariance matrix, which scales more than quadratically with the number of samples, whereas computation time for, e.g., STRUCTURE, scales linearly with number of samples.

For a relatively small number of samples, conStruct can be much faster than existing nonspatial Bayesian clustering methods. On a desktop machine, using a single 4.2 GHz Intel Core i7 processor, an analysis of the black bear dataset (78 samples, 21,000 loci) running conStruct’s spatial model with four layers for 5000 Markov chain Monte Carlo (MCMC) iterations (which was more than sufficient for convergence) took 2.8 hr. For almost any size dataset, the maximum likelihood algorithm implemented in ADMIXTURE is quite a bit faster than conStruct: running ADMIXTURE on the bear dataset over all values of K from 1 to 7, including 50-fold cross-validation for each value of K, took only 6.6 min. It should also be noted that there may be situations when the binomial-based model underlying ADMXITURE performs better than our Gaussian-based model, e.g., when clusters differ at only a few strongly differentiated loci, although we have not investigated this possibility.

Choosing the “best” number of layers

Although we recognize the utility of choosing a single, “best” value of K, and using only that analysis to communicate results, we emphasize that the choice of best K is always relative to the data in hand and the questions to be answered. From a statistical perspective, unless the data were generated under the model itself, the support for larger values of K is likely to increase with increasing amounts of data. In the limit of infinite data, the best value of K may be the number of samples included in the dataset (Patterson et al. 2006).

From a biological perspective, it is important to stress that patterns of relatedness between individuals and populations are shaped by complex spatial and hierarchical processes. All individuals within a species are related to one another in some way, and summarizing those patterns of relatedness with a single value of K may be reductive or misleading. We therefore encourage users to perform analyses across different values of K and observe which layers split out at what levels (this is conceptually similar to taking successively shallower cross-sections of the population phylogeny), and also to take the results of the proposed cross-validation procedure with a large grain of salt. Calculating layer contributions may also be a useful heuristic, as it can reveal layers with statistical support but small biological import.

Although we believe our model adds spatial realism to the groups used by clustering methods, it is important to note that the layers detected by our method do not necessarily correspond to distinct, ancestral populations; nor does a nonzero admixture proportion indicate that admixture (i.e., gene flow) must have occurred. Both groupings and admixture proportions should be viewed as hypotheses that should be subject to further testing (for an indepth discussion of these points, see Falush et al. 2016).

Implications for management and conservation

Because isolation by distance is common, a likely result of applying conStruct to existing data is that populations previously identified as distinct using nonspatial clustering methods may be grouped into the same layer. This “lumping” might better reflect the demographic history of these populations, but may not contradict the genetic distinctness implied by the nonspatial clustering. This genetic distinctness—rather than shared history—may be more relevant for management decisions and conservation policy, which are often predicated on the identification of discrete “management units” identified using genetic data (Moritz 1994; Waples 1998; Moritz et al. 2002).

It is therefore important to stress that individuals sampled from the same conStruct layer may be quite genetically diverged from one another, perhaps especially at loci underlying adaptive traits, and that a conStruct layer may still contain multiple distinct management units worthy of independent protections. For instance, although both the Alaskan and Eastern Black Bears draw most of their ancestry from the same conStruct layer, they are separated by a great distance, and may therefore differ substantially from each other (although less than from the Western bears, as measured by average covariance). Alternatively, the inclusion of multiple management units into a single conStruct layer may occur if these populations are currently (or were recently) exchanging migrants, and thus might emphasize the importance of maintaining habitat corridors, or of implementing an integrated conservation plan across a geographic region.

Allelic or genetic covariance?

The choice of allelic covariance, rather than genetic covariance, was motivated by the fact that it is less affected by sample configuration—the genetic covariance is calculated after subtracting the mean from the entire sample, which is more strongly affected by densely sampled locations. Genetic covariance is also often computed after first dividing each frequency by Embedded Image where p is the global allele frequency, with the aim of equalizing variances across loci. Our definition does not do this, and so is less affected by low-frequency alleles. Both of these changes led to better performance on test data. However, note that allelic covariance is more affected by singleton sites than the standard genetic covariance, so it may be advisable to filter these prior to analysis if they are likely to contain a large percentage of errors (Linck and Battey 2017).

Caveats and considerations

There are a few important caveats to consider in the interpretation of conStruct results. First, we have modeled allelic covariance within a layer as a spatial process. Although there is flexibility built into the model about the shape of that covariance, inference may be misleading if the sampling geography departs radically from the way the sampled organisms disperse (or have dispersed) on their landscape. For example, if we were to run a conStruct analysis using geographic distances between sampled individuals of greenish warblers (Irwin et al. 2001) or Ensatina salamanders (Wake and Schneider 1998)—two canonical examples of ring species—we might get misleading results. This is because distance between locations on either side of the species’ distributions (across the Tibetan plateau and the Central Valley, respectively) is not representative of the path traversed in the coalescence of a pair of alleles sampled at those locations.

A second caveat is that, in some instances, membership in the same layer may not mean that samples are particularly related. If covariance within a layer decays sharply with distance, and the layer-specific relatedness parameter Embedded Image is low, individuals separated by a large spatial distance may be in the same layer but have very low pairwise relatedness. It is possible that this is happening in Figure S19. At Embedded Image the southernmost populations of P. trichocarpa are in the gold layer, whose other neighbors are to the north, with an intervening group of populations in the red layer, and at Embedded Image those southernmost samples split out and become their own layer. Furthermore, note that in this case Embedded Image and Embedded Image are confounded, so differences in φ between layers should not be overinterpreted. Again, we encourage users to run analyses across multiple values of K and to examine the spatial covariance functions within layers when interpreting results.

Extensions and future directions

There are several ways in which the model described in this paper might be extended or improved. For example, we currently assume that all layers within a model are equally unrelated (a star population phylogeny, although the branches can have different lengths thanks to the Embedded Image parameter), similar to the F-model of Falush et al. (2003). However, we could extend the existing model by implementing a relatedness structure between the layers by, for example, estimating a population phylogeny between them (e.g., Pickrell and Pritchard 2012).

In addition, here we have assumed that samples have known geographic coordinates, and that they draw ancestry from layers only at those sampled locations. A natural extension would be to attempt to “geo-locate” the ancestry of samples without geographic coordinates (Wasser et al. 2004). We could also imagine letting samples draw ancestry from other geographic coordinates, as we have done in a previous approach (Bradburd et al. 2016) to model long distance dispersal. We could even allow entire layers to bud off of a particular location on another layer. This would enable more explicit modeling of range expansion or domestication, in which a set of individuals are thought to have ancestry that originated from a particular geographic location embedded in a larger pattern of isolation by distance.

A final direction would be to model relatedness within a layer as a spatiotemporal process, in which covariance decays both with distance in space and in time. As the number of genotyped historical or ancient samples increases, it is becoming possible to ask whether there is genetic continuity at a point in space across time, or whether populations are being replaced (Lazaridis et al. 2014; Haak et al. 2015; Slatkin and Racimo 2016; Nielsen et al. 2017; Schraiber 2017; Joseph and Pe’er 2018). However, we expect allele frequencies to change through time in a population, even without replacement, simply due to drift. Therefore, a natural way to test for population replacement is to estimate the rates at which relatedness within a layer decays with time in the same way we do in the current model with space, in which case a change in discrete population structure across space is comparable to population replacement across time.

Acknowledgments

We thank Marjorie Weber, Yaniv Brandvain, William Wetzel, Mariah Meek, Doc Edge, Evan McCartney-Melstad, Matthew Stephens, Nick Barton, and the anonymous reviewers for invaluable comments on the method and manuscript, as well as Quentin Cronk, who provided input on the Populus analyses, and Emily Puckett, who provided input on the black bear analyses. We also thank the attendees at the 2017 Society for the Study of Evolution (SSE) Meeting in Portland, OR, whose votes determined the name of the method. This work was supported in part by the National Science Foundation under award number NSF #1262645 Division of Biological Infrastructure (DBI) to P.L.R. and G.M.C., the National Institute of General Medical Sciences of the National Institutes of Health under award numbers NIH R01-GM108779 to G.M.C., and the National Science Foundation under award numbers NSF #1148897 and #1402725 to G.S.B.

Appendices

Model Rationale: Drift, Admixture, and Space

Here we sketch a simple model of allele frequencies and their covariances, to justify the form given in the main text.

Drift

We first provide a simple model of allele frequencies within a layer. Imagine a sample i that draws all of its ancestry from layer k. The allele frequency in sample i at locus Embedded Image denoted Embedded Image can be written as the sumEmbedded Image(7)The first term is the ancestral allele frequency Embedded Image shared by all samples; the second is the deviation from that ancestral frequency due to drift in the ancestral population of the kth layer, which is shared by all samples within the layer. The third term is the deviation of the ith sample away from the kth layer mean due to the spatial process of drift and migration within the layer. The final term is the deviation specific to the ith sample, which captures drift not shared by all samples at the population level (i.e., subpopulation-specific drift due to, e.g., inbreeding). We will assume that these four deviations are all uncorrelated with each other.

If we have two samples i and j drawn from layer k, their covariance across loci will beEmbedded Image(8)where the quantity Embedded Image is an indicator variable that equals 1 when i is equal to j and 0 otherwise, as in Eq. 4.

Admixture

The model above describes the simple case in which samples draw 100% of their ancestry from only a single layer each. To accommodate admixture between layers, we model sampled genomes as drawn from allele frequencies that are weighted averages of the local frequencies in each layer from which they draw ancestry. The weights, Embedded Image describe the “admixture proportion” of sample i in layer k. These can be interpreted as the proportion of the genome in the ith sample that came from the kth layer (or the probability that an allele at a locus is drawn from layer k), so that Embedded Image for each i. The allele frequency in the ith sample at the ℓth locus can therefore be written as:Embedded Image(9)and so the covariance between i and j across loci is

Embedded Image(10)

Space

Under our nonspatial model, we assume that Embedded Image so that the only additional covariance between i and j (above that induced by a shared ancestral frequency at each locus) is due to the drift in the ancestral population of their layer (the variance of which is Embedded Image). Under our spatial model we assume that some of the covariance in allele frequencies between i and j decays as a function of the geographic distance between the pair, Embedded Image so thatEmbedded Image(11)We note that this form is chosen for flexibility and convenience, and not because it matches any explicit population genetic model of isolation by distance.

Allelic Covariance and Inference

Here we go into further detail about both the allelic covariance we model and the modeling framework we use.

Allelic covariance

To see why Eq. 1 and Eq. 2 for the allelic covariance are equivalent, pick a random locus and let A and B be randomly drawn alleles at that locus from populations i and j respectively. Suppose these are each coded as “0” or “1” (where “0” denotes a reference allele), but we randomly “flip” this coding, so that we let Embedded Image and Embedded Image with probability 1/2, but otherwise we let Embedded Image and Embedded Image These are Embedded Image and Embedded Image in Eq. 1, so that Embedded Image The random allele flipping makes the value of Embedded Image independent of the choice of reference allele. By conditioning on the flip, and using the fact that Embedded Image Eq. 2 comes from the observation thatEmbedded Image(12)Thanks to averaging over choice of alleles, the within-population allelic variance in sample i, Embedded Image is the variance of a series of Bernoulli(Embedded Image) draws across loci, and therefore Embedded Image for every sample i. Averaging over choice of reference allele therefore removes some information about factors acting within populations that might otherwise leave signatures in the genetic covariance, such as population size, extent of inbreeding, and history of bottlenecks. However, as our model is focused on modeling covariances between samples as the outcome of some spatial process, we count this a minor loss.

Likelihood

If allele frequency deviations are well approximated by a Gaussian, their sample allelic covariance is a sufficient statistic, so that calculating the likelihood of their sample allelic covariance is the same as calculating the probability of the frequency data up to a constant. We can therefore model the covariance of the sample allele frequencies, Embedded Image as a draw from a Wishart distribution with degrees of freedom equal to the number of loci L across which the sample allelic covariance is calculated:Embedded Image(13)where W is the Wishart likelihood function.

A benefit of directly modeling the sample allelic covariance is that, after the initial calculation of the sample covariance matrix, the computation time of the likelihood is not a function of the number of loci, so inference can be done using whole genome data.

Models, Parameters, and Priors

Here we discuss the different models implemented in this paper and give the priors we place on model parameters.

Spatial vs. nonspatial

In this paper, we discuss two types of models, spatial and nonspatial, each of which can be implemented with different numbers of layers/clusters. The spatial model is parameterized as in Equation 10, and the nonspatial model is a special case of the spatial model with all α parameters set to 0. The nonspatial model therefore has Embedded Image fewer parameters than the spatial model, because there are three α parameters that describe the continuous differentiation effect of distance in each layer.

Single layer

Each of these models can be run with a single layer (Embedded Image), in which case the layer-specific covariance parameter Embedded Image and the global covariance parameter γ become redundant. The single-layer model is therefore a special case of the multi-layer model, in which we set φ to zero. For the spatial model, the single-layer parametric covariance is:Embedded Image(14)and for the nonspatial model, it is:

Embedded Image(15)

Priors

We use a Bayesian approach to parameter inference. A table of all parameters, their descriptions, and their priors is given in Table A1.

View this table:
  • View inline
  • View popup
Table A1 List of parameters used in the conStruct model, along with their descriptions and priors

Cross-Validation Procedure

We employ a Monte Carlo cross-validation approach for model comparison (Picard and Cook 1984). This procedure generates a mean predictive accuracy for each model and each value of K, as well as a confidence interval around that mean, which can then be used for model comparison or selection. Briefly, we follow the following procedure:

  1. For each of X replicates:

    1. partition the allele frequency data into a 90% “training” partition (Embedded Image) and a 10% “testing” partition (Embedded Image).

    2. run our inference procedure using the training partition to estimate model parameters Embedded Image for Embedded Image models:

      • i. m: the spatial and the nonspatial model.

      • ii. k: the number of layers/clusters 1 through K.

    3. calculate the mean log likelihood of the testing data partition over the posterior distribution of training-estimated parameters for each model (Embedded Image henceforth Embedded Image).

    4. generate standardized mean log likelihoods, Embedded Image across all models run on this data partition:

      • i. identify the highest mean log likelihood, Embedded Image across all Embedded Image models.

      • ii. subtract Embedded Image from Embedded Image for each model, such that the standardized log likelihood, Embedded Image of the best model is 0, and <0 for all inferior models.

  2. For each model (i.e., each combination of m and k) calculate the mean (Embedded Image) standardized log likelihood of the testing data partition across X replicates, as well as its SE (Embedded Image) and 95% confidence interval (Embedded Image).

In other words, the “predictive accuracy” shown as conStruct cross-validation results are in units of improvement in log-likelihood of that model relative to the best model for that partitioning of the data, averaged over data partitions. The standardization is necessary because different data partitions can be systematically more or less difficult to fit, resulting in greater differences in mean training data log likelihood between data partitions than between models fit to the same partition.

If the genomic coordinates of the loci are known, the training/testing partitioning should be designed to accommodate LD. Loci in strong LD are not inherited independently, so if loci from a single linkage block are included in both training and testing partitions, the independence of the test in the testing partition will be compromised because the parameters estimated from the training partition might be describing process heterogeneity or noise in a region of the genome that also has loci included in the testing partition. The best practice for cross-validation is to make sure that no loci in the testing dataset are in strong LD with, or near on the genome to, loci in the training dataset.

Calculating Layer Contributions

Let A and B be randomly chosen alleles from samples i and j respectively, at a randomly chosen locus. Then, if we let Embedded Image and Embedded Image since U and V take the values Embedded Image so as in Eq. 12,Embedded ImageTo translate, Embedded Image is the probability that the alleles from our two focal samples agree with each other, while Embedded Image is the probability that they disagree. This implies that Embedded Image where Embedded Image is the probability that two randomly chosen alleles differ, which is the genetic divergence.

Now, here is a generative model that gives us the form of the covariance we have postulated. To decide whether or not A and B will agree, first each sample randomly chooses a layer: call these layers I and J. The probability that A chooses layer k is Embedded Image the ith sample’s admixture proportion in the kth layer. The same holds true for B. If they do not choose the same layer, the probability that they agree is Embedded Image If they do choose the same layer, then they agree with a probability Embedded Image that depends on their distance apart. By the above, the probability of agreement is Embedded Image and so we can defineEmbedded ImageOne way to summarize the contribution of each layer is to partition the probability of agreement into contributions due to agreement “in” each layer. So, the contribution from layer k to agreement between i and j isEmbedded Imagewhich is the probability, given that they agree, that they agree thanks to layer k. Because our signal comes from variation in covariance, we omit the Embedded Image terms (i.e., we condition on agreement not due to “background” levels of agreement in the interpretations above). Stated in this way, this quantity is the relative contribution of the Embedded Image layer to the (model-based) kinship coefficient between i and j.

This suggests defining the overall contribution of layer k to agreement, Embedded Image, to be the average of that quantity over i and j:Embedded Image(16)which is that layer’s contribution to agreement between samples summed over the upper triangle (excluding the diagonal) of the covariance matrix. We define the contribution of the kth layer, Embedded Image as the relative contribution of the kth layer to total agreement:Embedded Image(17)This is the quantity that is plotted in Figure 4 and Figure 9.

Simulation Details

We wished to simulate data under a model that had some biological realism, but at the same time had unambiguous true admixture proportions (so as to test the behavior of the method). This second requirement precluded scenarios of, e.g., recent secondary contact between populations expanding out of different refugia, which would have more biological realism, but no unambiguous ancestry proportions for admixed populations. Here, we describe in more detail the procedure we use to simulate our test dataset, using a cartoon schematic with Embedded Image as an example (Figure A1).

Figure A1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure A1

Schematic of how we simulate datasets with continuous and discrete differentiation, using Embedded Image as an example. Going forward in time, the K populations split from a common ancestor at time Embedded Image then expand to each colonize a lattice of demes with nearest-neighbor symmetric migration at time Embedded Image then finally at time Embedded Image collapse into a single lattice consisting of demes with ancestry entirely in one or the other of the populations, or admixed between them.

Using the program ms (Hudson 2002), we generated discrete population structure by simulating K distinct populations, each of which split from a common ancestor Embedded Image units of coalescent time in the past, without subsequent migration between them. Then, to generate continuous differentiation within each population, at time Embedded Image in the past, each of these discrete populations instantaneously colonizes an independent lattice of demes, for which we use a stepping stone model with symmetric migration to nearest neighbors (eight neighbors, including diagonals).

Finally, at time Embedded Image in the past we generate a single dataset by collapsing those K discrete lattices into a single grid of demes that are admixed to various degrees from these different layers. We wish to simulate realistic patterns of admixture (and thereby set a more difficult test for the method), by generating spatially autocorrelated admixture proportions in each diverged population. To do so, we first place K equidistant points on the circle centered on our lattice. These points serve as “foci” of ancestry in each of the K layers. We then calculate the distance from each deme in the sampled lattice to each of these K foci, and draw admixture proportions for each deme from a Dirichlet distribution for which the concentration parameter for deme i in layer k is inversely proportional to the distance between deme i and focus k. This creates a pattern in which the admixture proportions in a given layer decreases with the distance from that layer’s focus, as might be expected if a spatial process were mediating admixture between diverged populations.

The parameters used to simulate the data were as follows: a diploid population size of 1000, a migration rate between neighboring demes of 0.4, a deep split time between layers of 500 (corresponding to Embedded Image in Figure A1), an expansion event across layers of 250 (corresponding to Embedded Image in Figure A1), and an admixture event between layers in the immediate past (Embedded Image). The times and rates reported above have already been scaled by Embedded Image (as per ms syntax), and therefore give the values fed directly to ms. We used the -s option to sample a single segregating site per coalescent history, and simulated Embedded Image independent histories — corresponding to the same number of independent loci — in each dataset, with 10 diploid genotypes generated per deme at each locus.

Footnotes

  • Supplemental material available at Figshare: https://doi.org/10.25386/genetics.6840629.

  • Communicating editor: N. Barton

  • Received April 16, 2018.
  • Accepted July 16, 2018.
  • Copyright © 2018 Bradburd et al.

Available freely online through the author-supported open access option.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Literature Cited

  1. ↵
    1. Alexander D. H.,
    2. Lange K.
    , 2011 Enhancements to the admixture algorithm for individual ancestry estimation. BMC Bioinformatics 12: 246. doi:10.1186/1471-2105-12-246
    OpenUrlCrossRefPubMedWeb of Science
  2. ↵
    1. Alexander D. H.,
    2. Novembre J.,
    3. Lange K.
    , 2009 Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19: 1655–1664. doi:10.1101/gr.094052.109
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Barton N. H.
    , 2008 The effect of a barrier to gene flow on patterns of geographic variation. Genet. Res. 90: 139–149. doi:10.1017/S0016672307009081
    OpenUrlCrossRefPubMed
  4. ↵
    1. Barton N. H.,
    2. Depaulis F.,
    3. Etheridge A. M.
    , 2002 Neutral evolution in spatially continuous populations. Theor. Popul. Biol. 61: 31–48. doi:10.1006/tpbi.2001.1557
    OpenUrlCrossRefPubMedWeb of Science
  5. ↵
    1. Barton N. H.,
    2. Etheridge A. M.,
    3. Véber A.
    , 2013 Modelling evolution in a spatial continuum. J. Stat. Mech. 2013: P01002. doi:10.1088/1742-5468/2013/01/P01002
    OpenUrlCrossRef
  6. ↵
    Bradburd, G. S., P. L. Ralph, and G. M. Coop, 2013 Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution 67: 3258–3273. doi:10.1111/evo.12193
    OpenUrlCrossRefPubMed
  7. ↵
    Bradburd, G. S., P. L. Ralph, and G. M. Coop, 2016 A spatial framework for understanding population structure and admixture. PLoS Genet. 12: e1005703. doi:10.1371/journal.pgen.1005703
    OpenUrlCrossRefPubMed
  8. ↵
    1. Byun S. A.,
    2. Koop B. F.,
    3. Reimchen T. E.
    , 1997 North American black bear mtDNA phylogeography: implications for morphology and the haida gwaii glacial refugium controversy. Evolution 51: 1647–1653.
    OpenUrlCrossRefWeb of Science
    1. Carpenter B.,
    2. Gelman A.,
    3. Hoffman M. D.,
    4. Lee D.,
    5. Goodrich B.,
    6. et al.
    2017 Stan: a probabilistic programming language. J. Stat. Softw. 76: doi:10.18637/jss.v076.i01
    OpenUrlCrossRef
  9. ↵
    1. Cavalli-Sforza L. L.,
    2. Piazza A.
    , 1975 Analysis of evolution: evolutionary rates, independence and treeness. Theor. Popul. Biol. 8: 127–165. doi:10.1016/0040-5809(75)90029-5
    OpenUrlCrossRefPubMedWeb of Science
  10. ↵
    1. Caye K.,
    2. Jay F.,
    3. Michel O.,
    4. François O.
    , 2018 Fast inference of individual admixture coefficients using geographic data. Ann. Appl. Stat. 12: 586–608. doi:10.1214/17-AOAS1106
    OpenUrlCrossRef
  11. ↵
    1. Corander J.,
    2. Waldmann P.,
    3. Sillanpää M. J.
    , 2003 Bayesian analysis of genetic differentiation between populations. Genetics 163: 367–374.
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Cronk Q. C. B.
    , 2005 Plant eco-devo: the potential of poplar as a model organism. New Phytol. 166: 39–48. doi:10.1111/j.1469-8137.2005.01369.x
    OpenUrlCrossRefPubMedWeb of Science
  13. ↵
    1. Diggle P. J.,
    2. Tawn J. A.,
    3. Moyeed R. A.
    , 1998 Model-based geostatistics. J. Roy. Stat. Soc. C. Appl. Stat. 47: 299–350.
    OpenUrl
  14. ↵
    1. Eckenwalder J. E.
    , 1984 Natural intersectional hybridization between North American species of Populus (salicaceae) in sections Aigeiros and Tacamahaca. ii. Taxonomy. Can. J. Bot. 62: 325–335. doi:10.1139/b84-051
    OpenUrlCrossRef
  15. ↵
    1. Engelhardt B. E.,
    2. Stephens M.
    , 2010 Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 6: e1001117. doi:10.1371/journal.pgen.1001117
    OpenUrlCrossRefPubMed
  16. ↵
    1. Epperson B.
    , 2003 Geographical Genetics. Monographs in Population Biology. Princeton University Press, Princeton, NJ.
  17. ↵
    1. Evanno G.,
    2. Regnaut S.,
    3. Goudet J.
    , 2005 Detecting the number of clusters of individuals using the software structure: a simulation study. Mol. Ecol. 14: 2611–2620. doi:10.1111/j.1365-294X.2005.02553.x
    OpenUrlCrossRefPubMedWeb of Science
  18. ↵
    1. Falush D.,
    2. Stephens M.,
    3. Pritchard J. K.
    , 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
    OpenUrlAbstract/FREE Full Text
  19. ↵
    1. Falush D.,
    2. van Dorp L.,
    3. Lawson D.
    , 2016 A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. bioRxiv 066431; doi:10.1101/066431.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Frantz A. C.,
    2. Cellina S.,
    3. Krier A.,
    4. Schley L.,
    5. Burke T.
    , 2009 Using spatial Bayesian methods to determine the genetic structure of a continuously distributed population: clusters or isolation by distance? J. Appl. Ecol. 46: 493–505. doi:10.1111/j.1365-2664.2008.01606.x
    OpenUrlCrossRef
  21. ↵
    1. Geraldes A.,
    2. DiFazio S. P.,
    3. Slavov G. T.,
    4. Ranjan P.,
    5. Muchero W.,
    6. et al.
    2013 A 34k SNP genotyping array for Populus trichocarpa: design, application to the study of natural populations and transferability to other populus species. Mol. Ecol. Resour. 13: 306–323. doi:10.1111/1755-0998.12056
    OpenUrlCrossRefPubMed
  22. ↵
    1. Geraldes A.,
    2. Farzaneh N.,
    3. Grassa C. J.,
    4. McKown A. D.,
    5. Guy R. D.,
    6. et al.
    2014 Landscape genomics of Populus trichocarpa: the role of hybridization, limited gene flow, and natural selection in shaping patterns of population structure. Evolution 68: 3260–3280. doi:10.1111/evo.12497
    OpenUrlCrossRef
  23. ↵
    1. Guillot G.,
    2. Mortier F.,
    3. Estoup A.
    , 2005 Geneland: a computer package for landscape genetics. Mol. Ecol. Notes 5: 712–715. doi:10.1111/j.1471-8286.2005.01031.x
    OpenUrlCrossRefWeb of Science
  24. ↵
    1. Haak W.,
    2. Lazaridis I.,
    3. Patterson N.,
    4. Rohland N.,
    5. Mallick S.,
    6. et al.
    2015 Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522: 207–211. doi:10.1038/nature14317
    OpenUrlCrossRefPubMed
  25. ↵
    1. Hoffman M. D.,
    2. Gelman A.
    , 2014 The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. arXiv:1111.4246v1
  26. ↵
    1. Hubisz M. J.,
    2. Falush D.,
    3. Stephens M.,
    4. Pritchard J. K.
    , 2009 Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9: 1322–1332. doi:10.1111/j.1755-0998.2009.02591.x
    OpenUrlCrossRefPubMed
  27. ↵
    1. Hudson R. R.
    , 2002 Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. doi:10.1093/bioinformatics/18.2.337
    OpenUrlCrossRefPubMedWeb of Science
  28. ↵
    1. Huelsenbeck J. P.,
    2. Andolfatto P.
    , 2007 Inference of population structure under a Dirichlet process model. Genetics 175: 1787–1802. doi:10.1534/genetics.106.061317
    OpenUrlAbstract/FREE Full Text
  29. ↵
    1. Irwin D. E.,
    2. Bensch S.,
    3. Price T. D.
    , 2001 Speciation in a ring. Nature 409: 333–337. doi:10.1038/35053059
    OpenUrlCrossRefPubMedWeb of Science
  30. ↵
    1. Joseph T. A.,
    2. Pe’er I.
    , 2018 Inference of population structure from ancient DNA. bioRxiv 261131; doi: 10.1101/261131.
    OpenUrlAbstract/FREE Full Text
  31. ↵
    1. Keller S. R.,
    2. Olson M. S.,
    3. Silim S.,
    4. Schroeder W.,
    5. Tiffin P.
    , 2010 Genomic diversity, population structure, and migration following rapid range expansion in the balsam poplar, Populus balsamifera. Mol. Ecol. 19: 1212–1226. doi:10.1111/j.1365-294X.2010.04546.x
    OpenUrlCrossRefPubMed
  32. ↵
    1. Kimura M.,
    2. Weiss G. H.
    , 1964 The stepping stone model of population structure and the decrease of genetic correlation with distance. Genetics 49: 561–576.
    OpenUrlFREE Full Text
  33. ↵
    1. Lawson D. J.,
    2. Hellenthal G.,
    3. Myers S.,
    4. Falush D.
    , 2012 Inference of population structure using dense haplotype data. PLoS Genet. 8: e1002453. doi:10.1371/journal.pgen.1002453
    OpenUrlCrossRefPubMed
  34. ↵
    1. Lazaridis I.,
    2. Patterson N.,
    3. Mittnik A.,
    4. Renaud G.,
    5. Mallick S.,
    6. et al.
    2014 Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature 513: 409–413. doi:10.1038/nature13673
    OpenUrlCrossRefPubMedWeb of Science
  35. ↵
    1. Linck E. B.,
    2. Battey C. J.
    , 2017 Minor allele frequency thresholds strongly affect population structure inference with genomic datasets. bioRxiv. 188623; doi:10.1101/188623
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Malécot G.
    , 1969 The Mathematics of Heredity. W. H. Freeman, San Francisco.
  37. ↵
    1. McKown A. D.,
    2. Guy R. D.,
    3. Klapste J.,
    4. Geraldes A.,
    5. Friedmann M.,
    6. et al.
    2014 Geographical and environmental gradients shape phenotypic trait variation and genetic structure in Populus trichocarpa. New Phytol. 201: 1263–1276. doi:10.1111/nph.12601
    OpenUrlCrossRefPubMedWeb of Science
  38. ↵
    1. McVean G.
    , 2009 A genealogical interpretation of principal components analysis. PLoS Genet. 5: e1000686. doi:10.1371/journal.pgen.1000686
    OpenUrlCrossRefPubMed
  39. ↵
    1. Meirmans P.
    , 2009 Genodive version 2.0 b14. Computer software distributed by the author. Accessed : May 12th, 2018. Available at: http://www. bentleydrummer.nl/software/software/GenoDive.html.
  40. ↵
    1. Meirmans P. G.
    , 2012 The trouble with isolation by distance. Mol. Ecol. 21: 2839–2846. doi:10.1111/j.1365-294X.2012.05578.x
    OpenUrlCrossRefPubMedWeb of Science
  41. ↵
    1. Menozzi P.,
    2. Piazza A.,
    3. Cavalli-Sforza L.
    , 1978 Synthetic maps of human gene frequencies in Europeans. Science 201: 786–792. doi:10.1126/science.356262
    OpenUrlAbstract/FREE Full Text
  42. ↵
    1. Moritz C.
    , 1994 Defining “evolutionarily significant units” for conservation. Trends Ecol. Evol. 9: 373–375. doi:10.1016/0169-5347(94)90057-4
    OpenUrlCrossRefPubMedWeb of Science
  43. ↵
    1. Moritz C.,
    2. Funk V.,
    3. Sakai A. K.
    , 2002 Strategies to protect biological diversity and the evolutionary processes that sustain it. Syst. Biol. 51: 238–254. doi:10.1080/10635150252899752
    OpenUrlCrossRefPubMedWeb of Science
  44. ↵
    1. Nagylaki T.
    , 1978 A diffusion model for geographically structured populations. J. Math. Biol. 6: 375–382. doi:10.1007/BF02463002
    OpenUrlCrossRef
  45. ↵
    1. Nagylaki T.,
    2. Barcilon V.
    , 1988 The influence of spatial inhomogeneities of neutral models of geographical variation. II. The semi-infinite linear habitat. Theor. Popul. Biol. 33: 311–343. doi:10.1016/0040-5809(88)90018-4
    OpenUrlCrossRef
  46. ↵
    1. Nielsen R.,
    2. Akey J. M.,
    3. Jakobsson M.,
    4. Pritchard J. K.,
    5. Tishkoff S.,
    6. et al.
    , 2017 Tracing the peopling of the world through genomics. Nature 541: 302–310. doi:10.1038/nature21347
    OpenUrlCrossRefPubMed
  47. ↵
    1. Novembre J.,
    2. Stephens M.
    , 2008 Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40: 646–649. doi:10.1038/ng.139
    OpenUrlCrossRefPubMedWeb of Science
  48. ↵
    1. Patterson N.,
    2. Price A. L.,
    3. Reich D.
    , 2006 Population structure and eigenanalysis. PLoS Genet. 2: e190. doi:10.1371/journal.pgen.0020190
    OpenUrlCrossRefPubMed
  49. ↵
    1. Patterson N.,
    2. Moorjani P.,
    3. Luo Y.,
    4. Mallick S.,
    5. Rohland N.,
    6. et al.
    , 2012 Ancient admixture in human history. Genetics 192: 1065–1093. doi:10.1534/genetics.112.145037
    OpenUrlAbstract/FREE Full Text
  50. ↵
    1. Peter B. M.
    , 2016 Admixture, population structure and f-statistics. Genetics 202: 1485–1501. doi:10.1534/genetics.115.183913
    OpenUrlAbstract/FREE Full Text
  51. ↵
    1. Petkova D.,
    2. Novembre J.,
    3. Stephens M.
    , 2016 Visualizing spatial population structure with estimated effective migration surfaces. Nat. Genet. 48: 94–100. doi:10.1038/ng.3464
    OpenUrlCrossRefPubMed
  52. ↵
    1. Picard R. R.,
    2. Cook R. D.
    , 1984 Cross-validation of regression models. J. Am. Stat. Assoc. 79: 575–583. doi:10.1080/01621459.1984.10478083
    OpenUrlCrossRefPubMedWeb of Science
  53. ↵
    1. Pickrell J. K.,
    2. Pritchard J. K.
    , 2012 Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8: e1002967. doi:10.1371/journal.pgen.1002967
    OpenUrlCrossRefPubMed
  54. ↵
    1. Price A. L.,
    2. Patterson N. J.,
    3. Plenge R. M.,
    4. Weinblatt M. E.,
    5. Shadick N. A.,
    6. et al.
    , 2006 Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38: 904–909. doi:10.1038/ng1847
    OpenUrlCrossRefPubMedWeb of Science
  55. ↵
    1. Pritchard J. K.,
    2. Stephens M.,
    3. Donnelly P.
    , 2000 Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
    OpenUrlAbstract/FREE Full Text
  56. ↵
    1. Puckett E. E.,
    2. Etter P. D.,
    3. Johnson E. A.,
    4. Eggert L. S.
    , 2015 Phylogeographic analyses of American black bears (Ursus americanus) suggest four glacial refugia and complex patterns of postglacial admixture. Mol. Biol. Evol. 32: 2338–2350. doi:10.1093/molbev/msv114
    OpenUrlCrossRefPubMed
  57. ↵
    1. Raj A.,
    2. Stephens M.,
    3. Pritchard J. K.
    , 2014 fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197: 573–589. doi:10.1534/genetics.114.164350
    OpenUrlAbstract/FREE Full Text
  58. ↵
    1. Ringbauer H.,
    2. Kolesnikov A.,
    3. Field D. L.,
    4. Barton N. H.
    , 2018 Estimating barriers to gene flow from distorted isolation-by-distance patterns. Genetics 208: 1231–1245. doi:10.1534/genetics.117.300638
    OpenUrlAbstract/FREE Full Text
  59. ↵
    1. Rosenberg N. A.,
    2. Mahajan S.,
    3. Ramachandran S.,
    4. Zhao C.,
    5. Pritchard J. K.,
    6. et al.
    , 2005 Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1: e70.
    OpenUrlCrossRefPubMed
  60. ↵
    1. Sawyer S.
    , 1976 Results for the stepping stone model for migration in population genetics. Ann. Probab. 4: 699–728. doi:10.1214/aop/1176995980
    OpenUrlCrossRef
  61. ↵
    1. Schraiber J.
    , 2017 Assessing the relationship of ancient and modern populations. bioRxiv. 113779; doi:10.1101/113779
    OpenUrlAbstract/FREE Full Text
  62. ↵
    1. Serre D.,
    2. Pääbo S.
    , 2004 Evidence for gradients of human genetic diversity within and among continents. Genome Res. 14: 1679–1685. doi:10.1101/gr.2529604
    OpenUrlAbstract/FREE Full Text
  63. ↵
    1. Sexton J. P.,
    2. Hangartner S. B.,
    3. Hoffmann A. A.
    , 2014 Genetic isolation by environment or distance: which pattern of gene flow is most common? Evolution 68: 1–15. doi:10.1111/evo.12258
    OpenUrlCrossRefPubMedWeb of Science
  64. ↵
    Shiga, T., 1988 Stepping stone models in population genetics and population dynamics, pp. 345–355 in Stochastic Processes in Physics and Engineering (Bielefeld, 1986), Vol. 42 of Math. Appl. Reidel, Dordrecht, The Netherlands. doi:10.1007/978-94-009-2893-0_18.
    OpenUrl
  65. ↵
    1. Slatkin M.
    , 1985 Gene flow in natural populations. Annu. Rev. Ecol. Syst. 16: 393–430. doi:10.1146/annurev.es.16.110185.002141
    OpenUrlCrossRefWeb of Science
  66. ↵
    1. Slatkin M.,
    2. Racimo F.
    , 2016 Ancient DNA and human history. Proc. Natl. Acad. Sci. USA 113: 6380–6387. doi:10.1073/pnas.1524306113
    OpenUrlAbstract/FREE Full Text
  67. ↵
    1. Slavov G. T.,
    2. DiFazio S. P.,
    3. Martin J.,
    4. Schackwitz W.,
    5. Muchero W.,
    6. et al.
    2012 Genome resequencing reveals multiscale geographic structure and extensive linkage disequilibrium in the forest tree Populus trichocarpa. New Phytol. 196: 713–725. doi:10.1111/j.1469-8137.2012.04258.x
    OpenUrlCrossRefPubMedWeb of Science
  68. ↵
    1. Stan Development Team
    , 2015 Stan: A C++ library for probability and sampling, version 2.10.0. http://mc-stan.org
  69. ↵
    1. Stan Development Team
    , 2016 RStan: the R interface to Stan, version 2.10.1. http://mc-stan.org
  70. ↵
    1. Stone K. D.,
    2. Cook J. A.
    , 2000 Phylogeography of black bears (Ursus americanus) of the pacific northwest. Can. J. Zool. 78: 1218–1223. doi:10.1139/z00-042
    OpenUrlCrossRefWeb of Science
  71. ↵
    1. Suarez-Gonzalez A.,
    2. Hefer C. A.,
    3. Christe C.,
    4. Corea O.,
    5. Lexer C.,
    6. et al.
    2016 Genomic and functional approaches reveal a case of adaptive introgression from Populus balsamifera (balsam poplar) in P. trichocarpa (black cottonwood). Mol. Ecol. 25: 2427–2442. doi:10.1111/mec.13539
    OpenUrlCrossRef
  72. ↵
    1. Verity R.,
    2. Nichols R. A.
    , 2016 Estimating the number of subpopulations (K) in structured populations. Genetics 203: 1827–1839. doi:10.1534/genetics.115.180992
    OpenUrlAbstract/FREE Full Text
  73. ↵
    1. Wake D. B.,
    2. Schneider C. J.
    , 1998 Taxonomy of the plethodontid salamander genus ensatina. Herpetologica 54: 279–298.
    OpenUrlWeb of Science
  74. ↵
    1. Waples R.
    , 1998 Separating the wheat from the chaff: patterns of genetic differentiation in high gene flow species. J. Hered. 89: 438–450. doi:10.1093/jhered/89.5.438
    OpenUrlCrossRefWeb of Science
  75. ↵
    1. Wasser S. K.,
    2. Shedlock A. M.,
    3. Comstock K.,
    4. Ostrander E.,
    5. Mutayoba B.,
    6. et al.
    2004 Assigning African elephant DNA to geographic region of origin: applications to the ivory trade. Proc. Natl. Acad. Sci. USA 101: 14847–14852. doi:10.1073/pnas.0403170101
    OpenUrlAbstract/FREE Full Text
  76. ↵
    1. Wooding S.,
    2. Ward R.
    , 1997 Phylogeography and pleistocene evolution in the North American black bear. Mol. Biol. Evol. 14: 1096–1105. doi:10.1093/oxfordjournals.molbev.a025719
    OpenUrlCrossRefPubMedWeb of Science
  77. ↵
    1. Wright S.
    , 1943 Isolation by distance. Genetics 28: 114–138.
    OpenUrlFREE Full Text
  78. ↵
    1. Wright S.
    , 1949 The genetical structure of populations. Ann. Eugen. 15: 323–354. doi:10.1111/j.1469-1809.1949.tb02451.x
    OpenUrlCrossRefPubMed
  79. ↵
    1. Xie C.-Y.,
    2. Ying C. C.,
    3. Yanchuk A. D.,
    4. Holowachuk D. L.
    , 2009 Ecotypic mode of regional differentiation caused by restricted gene migration: a case in black cottonwood (Populus trichocarpa) along the pacific northwest coast. Can. J. For. Res. 39: 519–525. doi:10.1139/X08-190
    OpenUrlCrossRef
  80. ↵
    1. Xie C. Y.,
    2. Carlson M. R.,
    3. Ying C. C.
    , 2012 Ecotypic mode of regional differentiation of black cottonwood (Populus trichocarpa) due to restricted gene migration: further evidence from a field test on the northern coast of British Columbia. Can. J. For. Res. 42: 400–405. doi:10.1139/x11-187
    OpenUrlCrossRef
  81. ↵
    1. Zhou L.,
    2. Holliday J. A.
    , 2012 Targeted enrichment of the black cottonwood (Populus trichocarpa) gene space using sequence capture. BMC Genomics 13: 703. doi:10.1186/1471-2164-13-703
    OpenUrlCrossRefPubMed
View Abstract
Previous ArticleNext Article
Back to top

PUBLICATION INFORMATION

Volume 210 Issue 1, September 2018

Genetics: 210 (1)

ARTICLE CLASSIFICATION

INVESTIGATIONS
Statistical Genetics and Genomics
View this article with LENS
Email

Thank you for sharing this Genetics article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Inferring Continuous and Discrete Population Genetic Structure Across Space
(Your Name) has forwarded a page to you from Genetics
(Your Name) thought you would be interested in this article in Genetics.
Print
Alerts
Enter your email below to set up alert notifications for new article, or to manage your existing alerts.
SIGN UP OR SIGN IN WITH YOUR EMAIL
View PDF
Share

Inferring Continuous and Discrete Population Genetic Structure Across Space

View ORCID ProfileGideon S. Bradburd, View ORCID ProfileGraham M. Coop and View ORCID ProfilePeter L. Ralph
Genetics September 1, 2018 vol. 210 no. 1 33-52; https://doi.org/10.1534/genetics.118.301333
Gideon S. Bradburd
Ecology, Evolutionary Biology, and Behavior Graduate Group, Department of Integrative Biology, Michigan State University, East Lansing, Michigan 48824
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gideon S. Bradburd
  • For correspondence: bradburd@msu.edu
Graham M. Coop
Center for Population Biology, Department of Evolution and Ecology, University of California, Davis, California 95616
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Graham M. Coop
Peter L. Ralph
Institute of Ecology and Evolution, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97403
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter L. Ralph
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation

Inferring Continuous and Discrete Population Genetic Structure Across Space

View ORCID ProfileGideon S. Bradburd, View ORCID ProfileGraham M. Coop and View ORCID ProfilePeter L. Ralph
Genetics September 1, 2018 vol. 210 no. 1 33-52; https://doi.org/10.1534/genetics.118.301333
Gideon S. Bradburd
Ecology, Evolutionary Biology, and Behavior Graduate Group, Department of Integrative Biology, Michigan State University, East Lansing, Michigan 48824
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gideon S. Bradburd
  • For correspondence: bradburd@msu.edu
Graham M. Coop
Center for Population Biology, Department of Evolution and Ecology, University of California, Davis, California 95616
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Graham M. Coop
Peter L. Ralph
Institute of Ecology and Evolution, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97403
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter L. Ralph

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero

Related Articles

Cited By

More in this TOC Section

Investigations

  • Blimp-1 Mediates Tracheal Lumen Maturation in Drosophila melanogaster
  • Spatiotemporal Gene Expression Analysis of the Caenorhabditis elegans Germline Uncovers a Syncytial Expression Switch
  • Detecting Rare Mutations with Heterogeneous Effects Using a Family-Based Genetic Random Field Method
Show more 3

Statistical Genetics and Genomics

  • Detecting Rare Mutations with Heterogeneous Effects Using a Family-Based Genetic Random Field Method
  • A Large Multiethnic Genome-Wide Association Study of Adult Body Mass Index Identifies Novel Loci
Show more 3
  • Top
  • Article
    • Abstract
    • Materials and Methods
    • Results
    • Discussion
    • Acknowledgments
    • Appendices
    • Footnotes
    • Literature Cited
  • Figures & Data
  • Supplemental
  • Info & Metrics

GSA

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.

Online ISSN: 1943-2631

  • For Authors
  • For Reviewers
  • For Subscribers
  • Submit a Manuscript
  • Editorial Board
  • Press Releases

SPPA Logo

GET CONNECTED

RSS  Subscribe with RSS.

email  Subscribe via email. Sign up to receive alert notifications of new articles.

  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus

Copyright © 2018 by the Genetics Society of America

  • About GENETICS
  • Terms of use
  • Advertising
  • Permissions
  • Contact us
  • International access