Abstract
Admixture between long-separated populations is a defining feature of the genomes of many species. The mosaic block structure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here, we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns. principal components analysis is first applied to genomic data to identify the primary population structure, followed by wavelet decomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, this method is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome-scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximate Bayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty. Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improved statistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions.
ADMIXTURE occurs when previously separated populations interact and merge. This process has been instrumental in human history, with most global groups showing at least some signals of population merger (Hellenthal et al. 2014). The admixture process produces “mosaic” genomes with alternating blocks of DNA from each ancestral population. Over time, recombination decreases the length of these ancestry blocks, and therefore the distribution of block sizes is informative about the time of admixture. However, the extent to which these patterns can provide additional information about historic admixture processes is still a young area of exploration.
A range of methods have been developed to partition the genome of an admixed individual into ancestry blocks based on raw genomic data (Falush et al. 2003; Price et al. 2009). Some methods assign ancestry directly. For instance, HAPMIX uses a hidden Markov model to estimate the break points of ancestry blocks, while other approaches define ancestry blocks using simple empirical criteria, such as strings of shared vs. nonshared polymorphisms (Pool and Nielsen 2009) or the differential presence of population-specific variants (Brown and Pasaniuc 2014). Another set of methods is more indirect. ROLLOFF (Moorjani et al. 2011), LAMP (Baran et al. 2012), and ALDER (Loh et al. 2013) all search for rapid changes in linkage disequilibrium to define the borders of ancestry blocks, while other approaches assign ancestry for predefined genomic windows using conditional random fields (Maples et al. 2013) or principal component analysis (PCA) (Gravel 2012).
These methods vary in their effectiveness. Simple empirical criteria perform surprisingly well for admixture between species (as for the mouse admixture zone studied by Pool and Nielsen 2009). Similarly, most of these methods tend to be highly accurate for recent admixture between well-separated human groups (such as African Americans or American Latinos). Indeed, in these settings, subtleties such as multiple waves of admixture have even be detected (Gravel 2012). However, reconstructing complex demographic features for much older admixture events (i.e., thousands rather than hundreds of years in the past) remains extremely challenging (Moorjani et al. 2011). While methods have in principle been proposed to detect multiple waves of ancient admixture, in many realistic settings they are still restricted to single admixture events (Loh et al. 2013), although some evidence for multiple ancient admixture events has been presented for several Indian populations (Moorjani et al. 2011).
Other indirect methods look increasingly promising in this “old admixture” space. Approaches based on principal components analysis and wavelets have been employed with some success. PCA is a nonparametric data-reduction technique, which has been used widely to identify patterns of population structure in genetic data (Patterson et al. 2006; Novembre and Stephens 2008; McVean 2009; Bryc et al. 2010; Ma and Amos 2012). Dispersion of admixed individuals along the first principal component connecting ancestral populations can be used as a diagnostic for two-way admixture (Patterson et al. 2006; Mcvean 2009). For instance, PCAdmix employs PCA to assign ancestry to localized windows along the genome for each individual (Brisbin et al. 2012). Pugach et al. (2011) also use PCA, but do not directly assign ancestry to genomic regions, instead applying a wavelet transform to obtain an indirect measure of the average admixture block length. While this approach has been shown to be powerful for dating old admixture events, there remains considerable scope for (i) the development of more sophisticated wavelet constructions, (ii) examining the resulting wavelet decompositions in greater detail (particularly to identify aspects of non-time-related information in the transformed data), and (iii) to provide a more user-friendly software solution for wavelet analysis.
Wavelet techniques themselves are an active and evolving area, with much potential for novel application in population genetics, as highlighted in the review article by Liò (2003). Wavelets can be thought of as localized, oscillatory functions and are particularly useful for representing data that has local features such as sharp changes and discontinuities. In the context of genome-wide single nucleotide polymorphism (SNP) data, wavelets can be used to represent the mosaic pattern of ancestry blocks. A wavelet decomposition of the data provides information on the size of the ancestry blocks and, importantly, how they are distributed along the chromosomes. Summary measures of the wavelet decomposition allow aspects of the admixture process to be reconstructed, such as the time of admixture and admixture proportions.
Here, we present a substantially revised wavelet-based approach to describe population admixture that builds on the work of Pugach et al. (2011). This new method has significantly fewer model assumptions and allows us to identify more complex demographic processes, such as multiple admixture events. As with previous methods, PCA is first employed to describe the population structure. The maximal overlap discrete wavelet transform (MODWT) is then applied directly to the SNP-level data, without the need to compute averages over localized genomic windows as implemented in related procedures (Pugach et al. 2011; Brisbin et al. 2012). Instead, windowing is performed naturally and objectively as part of the wavelet decomposition procedure. We show that this new method provides robust estimates of admixture time (including improved control of uncertainty estimates), as well as recognizing other aspects of admixture processes that previous wavelet-based methods have not been able to identify with any accuracy.
Methods
General framework
Initially, we consider a simple admixture scenario where two ancestral populations PA and PB merged T generations ago to form the admixed population PC. The ancestral populations contribute to the admixed population with probabilities p and . The sizes of the populations, the admixture time, and the admixture proportions are free to vary.
To quantify patterns of genomic block size variation, a three-step analysis procedure was used: (i) PCA was applied to the genomic data to describe population structure; (ii) the wavelet variance was computed to provide a scale-by-scale decomposition of the variance for each population; and (iii) the portion of this measure that is informative for admixture processes was extracted relative to background levels observed in the ancestral populations.
Data simulation
Genome-wide SNP data were simulated using the sequential coalescent simulator MaCS (Chen et al. 2009). Because our primary interest is in the admixture history of Island Southeast Asia (see Real genomic data section below), we chose parameter settings that produce genomic data that broadly fit observed patterns of genetic diversity in this study region (Cox et al. 2008). The demographic model, parameters, and information sources are described in more detail in the Supporting Information (Figure S1). We emphasize, however, that the method we describe is general and can be applied to most admixed genomic systems.
Data setup
Given an admixed population PC derived from two ancestral populations PA and PB, the number of individuals in the analysis (i.e., present day samples) is n = nA = nB + nC. For each individual i, we observe a collection of T SNPs along a chromosome. Thus the raw data matrix X is a matrix with T genotype counts in columns and n individuals in rows. The SNPs s are ordered by their physical positions along the chromosome, with the cells of the data matrix
taking the value 0 if heterozygous, and arbitrarily −1 or 1 for the alternative homozygous states. Prior to principal components analysis, the data matrix is centered such that the column mean with respect to the ancestral reference populations is zero, giving

Principal components analysis
PCA is performed using only individuals from the ancestral populations. Rather than performing PCA on all samples combined, this approach has the advantage that other features of the admixed sample (such as admixture from additional ancestral populations) will not influence the projection (McVean 2009). The first eigenvector reflects the primary population structure. Projection of individuals onto this axis of variation is given by
(1)The proportion of ancestry inherited from population PA can be estimated for each individual (or population) using the distance from the centroids of the ancestral populations; that is,
, where
and
are the centroids of the ancestral populations along the first principal axis (Bryc et al. 2010). Note that variation between individuals within a population is represented by the smaller eigenvalues and corresponding eigenvectors (Ma and Amos 2010).
This representation of admixed individuals in PCA space, as shown in Figure 1A, provides a genome-wide estimate of average ancestry, but does not indicate how admixture tracts are distributed along the chromosomes. To obtain localized estimates, the projection is performed at the SNP level rather than summing over the length of the genome as in Equation 1. The raw SNP-level admixture signals are given by (2)where
for
. The additional terms in Equation 2 ensure that the signals are normalized such that the mean of the ancestral populations are arbitrarily
and
. This normalization step makes the measure robust to uneven sample sizes, which can affect the structure of the PCA (Novembre and Stephens 2008; Mcvean 2009). Stability of the signals is maintained by specifying a tolerance ϵ for separating the ancestral populations at a given SNP. This ensures that SNPs with poor discrimination are treated as uninformative in the next step of the analysis.
Simulated example with 13,000 SNPs, 15 diploid individuals in ancestral populations (PA, PB), and 20 diploid individuals in the admixed population (PC). Populations are shown in green (PA), blue (PB), and red (PC). (A) PCA is used to describe the primary population structure; (B) raw wavelet variance for each population illustrates high frequency noise; (C) informative variation in the admixed population after standard correction for noise estimated from the ancestral populations. Note that this example uses the default threshold μ = 1.
Wavelet transform
The resulting SNP-level admixture signals indicate how ancestry varies along the genome, but they invariably exhibit a high noise-to-information ratio. To interpret the signal, its frequency content can be described using the wavelet variance (Percival 1995). The wavelet variance for scales
provides a scale-by-scale decomposition of the variance of the signal. The first scale (
) captures the highest frequency patterns, representing very local information. Increasing the scale index provides successively coarser, or lower frequency information, equivalent to “zooming out” on the signal until the level of the entire chromosome is reached. A plot of
vs. j indicates which scales are important contributors to the process variance and indirectly provides information about the distribution of admixture tracts. For example, recent admixture produces a peak in the wavelet variance at a large wavelet scale, reflecting long admixture tracts, while more ancient admixture events produce peaks at lower wavelet scales, reflecting shorter admixture tracts.
The wavelet variance for an individual i is given by (3)where
are the wavelet coefficients for the signal
constructed using the wavelet system
. To appreciate the methodology, it is sufficient to understand that the wavelet variance reflects the frequency content of the signal, but more detailed background material is provided in the Supporting Information (Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, and File S1). Our implementation employs Daubechies’ least asymmetric wavelet number 8 in the waveslim (Whitcher 2013) package of the statistical software R (R Development Core Team 2014). We emphasize, however, that the methods proposed here are robust to other choices of analyzing wavelet (see Supporting Information, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, and File S1).
Population averages are computed as (and similarly for populations PA and PB). An example of the average wavelet variance for each population is shown in Figure 1B. The wavelet variance is highest at fine scales, but as the ancestral populations also show this pattern, it should be considered background noise. It is intuitive that the very finest wavelet scales are uninformative because small numbers of SNPs should be insufficient to differentiate between populations. The raw wavelet variance is therefore considered as a combination of informative variation and background noise
(4)To extract the informative variance
, we subtract the proportion that can be attributed to noise. This is estimated from the variation observed in the ancestral populations;
, where μ is a multiplicative factor that allows the degree of thresholding to be controlled. Under almost all conditions, a default value of
may be assumed, and this threshold should be raised only if the admixture signals exhibit high levels of noise (see Supporting Information, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, and File S1 for details). Population characteristics that influence noise levels in the admixture signals are explored in the next section. The final measure of the informative variance is given by
(5)which describes the frequency content that is unique to the admixed population (in contrast to the ancestral populations).
Real genomic data
To illustrate that our method performs well in real-world situations, it was applied to a SNP genotyping chip data set of 394 individuals from 16 communities spread across the Indonesian archipelago (Table 1). Equivalent SNP data from Southern Han Chinese and Papua New Guinea Highlanders were used as proxies for the ancestral populations. Permission to conduct research in Indonesia was granted by the Indonesian Institute of Sciences. Blood samples or buccal swabs were collected from consenting, closely unrelated, and seemingly healthy individuals by J. Stephen Lansing (University of Arizona) and Herawati Sudoyo (Eijkman Institute for Molecular Biology, Indonesia), with the assistance of Indonesian Public Health clinic staff. All sample collection followed protocols for the protection of human subjects established by both the Eijkman Institute and the University of Arizona institutional review boards. Participant interviews confirmed local residence for at least two generations into the past. Samples were genotyped with the Affymetrix Axiom chip, yielding 548,994 SNPs across the autosomes. (Sex-linked markers were excluded from the analysis.) The SNP data were cleaned using standard protocols in PLINK v. 1.07 (Purcell et al. 2007; Purcell 2009) and the wavelet transform performed as described above.
The approximate Bayesian computation analysis employed 1000 data sets with sample sizes and SNP numbers set to those of the real data. These data sets were simulated by drawing from a uniform prior of admixture times between 10 and 300 generations. The admixture proportion for the Bena population (used as our primary test case) was set to 0.6, as estimated previously from the real data. The ABS metric was calculated for each simulation, and the multiple chromosome structure of the data were mimicked by sampling each individual repeatedly with different data densities.
Results
As proof of concept, we first applied our wavelet method to simulated data. A range of admixture scenarios was explored by varying parameters of the demographic model, particularly the time of admixture, admixture proportion, and single vs. multiple admixture events. Fifty simulations were performed for each scenario with modest (but therefore realistic) ancestral sample sizes of nA = nB = 15 and an admixed sample size nC = 20.
Admixture time
Because the ability of wavelet methods to calculate the time of admixture is well known from earlier work (Pugach et al. 2011), we explored this feature first. Simulations were performed for admixture times ranging from 10 to 320 generations (i.e., from the recent past to ∼10,000 years ago, using a generation interval of 30 years; Fenner 2005). Admixture at 10 generations shows the highest informative wavelet variance at scale 13, reflecting relatively few, long admixture blocks (Figure 2). As the time of admixture occurs further back in the past, the peak in wavelet variance shifts toward successively lower wavelet scales, reflecting ever-smaller admixture blocks driven by cumulative recombination along the chromosome. The average frequency content can be characterized by the average block size metric ABS, termed the “wavelet center” by Pugach et al. (2011), which as shown later, can be used to date the admixture event
Informative wavelet variance for each time of admixture (10–320 generations using default thresholding ). Shaded bars represent the average over 50 simulations at each admixture time; black bars represent the range across individual simulations. The average block size metric for each scenario is indicated by a dotted blue line.

Admixture proportion
Admixture proportions were varied between 0.5 (equal ancestry from PA and PB) and 0.025 (ancestry predominately from PA). For this analysis, the time of admixture was fixed at 160 generations. As the proportion of admixture decreases, the raw wavelet variance exhibits increasing levels of noise relative to informative variation. This is shown by the reduced magnitude of the informative wavelet variance (Figure 3) and emphasizes that, as expected, it is increasingly difficult to extract informative variation at low admixture proportions (small p) even where the signal is technically present. In this example, informative estimates were obtained for admixture proportions as low as 2.5%, although in general, the range of p for which this method is applicable will also depend on other characteristics of the data, such as the SNP density and sample size, as considered in the next section.
Relationship between proportion of admixture and informative wavelet variance. For this example only, a nondefault value for the threshold was used to account for increased noise in the admixture signals due to low proportions of admixture, as described in the text. The magnitude of the wavelet variance decreases with the admixture proportion, shown as colored bars from black (P = 0.50) to yellow (P = 0.025).
Sensitivity analyses
The sensitivity of the method to a wide range of data characteristics was considered by repeating the results of the admixture time example with a large number of simulated data sets. Results are summarized in Table 2 and Figure 4.
Sensitivity to a range of realistic data limitations. Comparison to reference data (condition 1) simulated with T = 13,000 SNPs, populations sizes ,
, and ancestral population divergence at
generations. The gray area shows the range of ABS metrics observed under the standard reference condition. (A) Potential sources of error (conditions 2–5); (B) varying SNP densities (conditions 6–8). Note that the decline in absolute values of the ABS metrics in B is expected; these are easily accounted for in an inference setting because the SNP density is always a known variable. Condition descriptions and numeric values are presented in Table 2.
Condition 1 shows the original results, exactly as described above. New simulations were then performed to mimic realistic linkage disequilibrium (LD) (condition 2). To do so as accurately as possible, we applied the real recombination rates observed along the first 100 Mb of chromosome 1, as recombination rates for chromosome 1 are near the average of all chromosome-level recombination rates (Figure S2). The effect of lower sample size (condition 3) was investigated by reducing the number of individuals sampled from each population by 5, thus yielding sample sizes that would be smaller than almost any published population genetics study (nA, nB = 10, nC = 15). The effect of more recent divergence between the ancestral populations (condition 4) was investigated by decreasing from 2000 to 1200 generations ago (50,000–30,000 years ago). The effect of using a misrepresentative modern population as a proxy for an ancestral population (condition 5) was investigated by studying ancestral populations with mixed (rather than “pure”) ancestry. Rather than using samples from the true ancestral population PA, an admixed ancestral population
was employed instead (
). A wide range of parameters was applied for sensitivity testing, but for clarity, only results for single parameter values are shown on Figure 4. These examples are representative of all the tests that were run.
Variation in summary measures between simulations was compared by computing the relative standard deviation (RSD) at each admixture time. For all of the error conditions above, the computed ABS metrics are consistent with the reference case (condition 1), but with slightly larger relative standard deviations. Only for one case (condition 4; reduced divergence between the ancestral populations and admixture at 320 generations) are the ABS metrics biased, with the mean falling outside the range of values observed for the reference example. We emphasize that this is expected: admixture should be more difficult to detect when it occurs between two ancestral populations that diverged only recently. Stability of the ABS metrics in this particular scenario could be improved by applying a higher level of thresholding. However, the default value of was retained here to provide consistency across scenarios, to demonstrate the deterioration in resolution, and to illustrate that the thresholding parameter can be ignored for all but the most extreme admixture cases.
In all of these examples, including the standard reference case, the localized admixture signals provide a noisy indication of how ancestry varies along the chromosome. Indeed, the inherent stochasticity of the block structure is the primary reason why other sources of variance, such as the cases discussed above, have relatively little additional effect on the overall results. This noise is addressed using wavelets to capture the distribution of block sizes, coupled with a correction based on the ancestral populations to distinguish informative signals from background variation. The cases considered above all slightly increase noise levels relative to informative variation, which, as demonstrated by the admixture proportion example in Figure 3, reduces the magnitude of the extracted informative wavelet variance. As noise increases, it naturally becomes more difficult to extract informative variation. However, this increase in noise levels is minimal for all but the most extreme confounds, thus allowing the technique to be applied robustly to a very wide range of scenarios.
The effect of SNP density (which is always a known variable) is demonstrated by down sampling the data (conditions 6–8). The original density of 4306 SNPs (condition 6) was chosen to correspond to the size of our real chromosome 22 data set. Reducing the SNP density of this data set means that the resulting wavelet decomposition is given over 11 wavelet scales rather than the earlier 13, and so as expected, the computed mean ABS metrics are correspondingly much smaller. However, this has no effect on the inference, as the data size is always known and simulations are simply run to match the size of the observed data. Further reductions in SNP density to 3250 SNPs (condition 7) and 1625 SNPs (condition 8) are also shown. Note that although the absolute values of the ABS metrics are shifted, the trend with admixture time remains consistent.
Method comparison
The original StepPCO method (Pugach et al. 2011) has already been tested extensively against other admixture detection methods, particularly HAPMIX (Price et al. 2009). We therefore focus here on comparing our improved wavelet method against the StepPCO procedure. Figure 5 shows that the summary measure (wavelet center) used in StepPCO is comparable to the adwave ABS metrics, as both exhibit a strong trend with time of admixture. However, the dispersion is consistently smaller for the adwave ABS metrics. For example, the wavelet centers (StepPCO) computed for and
show substantial overlap, while the ABS metrics (adwave) for the same populations show only minimal overlap. This illustrates that adwave offers increased power to differentiate between older admixture scenarios, with substantially reduced uncertainty in dating.
Comparing StepPCO and adwave showing the relationship between wavelet transform summaries and time of admixture. (A) Adwave using ; (B) StepPCO using
,
, threshold = 0.1, and maxlevel = 6. Numbers indicate the relative standard deviation (RSD, %) for each admixture time. Note the difference in discrimination power between the two methods for older admixture events (95% confidence intervals as dashed blue and green horizontal lines).
We also emphasize that adwave requires far fewer user specifications with regard to runtime options. The only variable for adwave is the thresholding parameter, and as shown above, the default value of should be used for almost all admixture scenarios. In contrast, the StepPCO results required a signal length parameter (
), a window size parameter (
), and two thresholding parameters (threshold = 0.1, maxlevel = 6) (all notations from Pugach et al. 2011). A detailed demonstration of this method comparison, with explanation of the settings chosen for StepPCO, is provided in Figure S3.
Admixture in Indonesian populations
Populations across Indonesia show genomic admixture between Asian and Melanesian ancestral sources (Cox et al. 2010), which has been dated using other methods to an admixture event ∼4000 years ago (∼130 generations) (Xu et al. 2012). We calculated wavelet summary measures for 16 communities across the Indonesian archipelago using 548,994 autosomal SNPs screened in 394 individuals (Table 1). Equivalent data from Southern Han Chinese and Papua New Guinea Highlanders was used as proxies for ancestral populations, as described in Cox et al. (2010).
The PCA for all individuals, where only the ancestral populations were used to define the axes, is shown in Figure 6. Admixed individuals dispersed along the first principal component illustrate the primary genomic signal, a strong gradient in Asian-Melanesian ancestry that has previously been observed across the region (Cox et al. 2010). The informative wavelet variance was computed separately for each chromosome and individual and subsequently combined to provide a single measure for each population (Figure S4). To combine information across chromosomes, which vary considerably in size, the raw admixture signals were windowed: all signals were reduced to the size of the smallest chromosome (importantly without discarding any data) by computing averages over a window of SNPs (details of the windowing procedure are provided in Supporting Information, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, and File S1). The SNP density and window size for each chromosome are shown in Table S1. This windowing procedure is used only to standardize chromosomes to the same length and utilizes very short windows of SNPs (unlike the approach of Pugach et al. 2011).
PCA of autosomal SNP data from Indonesian populations, with Southern Han Chinese (blue circles) and Papua New Guinea Highlanders (green circles) employed as proxy ancestral populations. Numbers give calculated admixture proportions.
The average block size metrics calculated for each population are shown in Table 1. The first six Indonesian populations (Nias, Mentawai, Java, Sumatra, Bali, and Sulawesi) exhibit predominantly Asian ancestry, with high-frequency noise in the signals causing some bias in the computed ABS metrics (Figure S4). The remaining Indonesian populations exhibit less extreme Asian ancestry proportions (42–67%), with the resulting ABS metrics appearing broadly similar between populations.
Under the assumption of a single admixture time (relaxed in later sections), the average block size metric can be used to date the time of admixture using approximate Bayesian computation (ABC). A general introduction to ABC can be found in Csilléry et al. (2010) and Sunnåker et al. (2013), while ABC in the context of parameter estimation for population admixture has been considered by Sousa et al. (2009) and Robinson et al. (2014).
The ABC inference procedure allows us to capture uncertainty in admixture time estimates more robustly than earlier wavelet dating approaches (Pugach et al. 2011; Xu et al. 2012). To illustrate this process, dating was performed on the Bena population of Flores in eastern Indonesia, resulting in an estimated median admixture time of 147 generations (95% credible region: 122–178 generations), or 4410 years before present (95% CR: 3660–5340 years BP). This almost exactly matches earlier point estimates of the admixture time (Xu et al. 2012) and is consistent with our current understanding of Island Southeast Asian prehistory (Bellwood 2007).
The relationship between time of admixture and the ABS metric across all simulations is illustrated in Figure 7A. ABC was implemented using the R package abc (Csilléry et al. 2012), and the posterior distribution of admixture time was computed using a local linear regression (Beaumont et al. 2002) with a tolerance rate of 0.2. Cross validation was used to evaluate the accuracy of this estimate: the prediction error was low (0.038) and insensitive to the exact tolerance value. For future research focusing on parameter inference, this procedure could be modified to use a larger number of simulated data sets and a lower tolerance rate. However, this simple example clearly illustrates that the adwave method has good statistical power to date admixture using a relatively small number of simulations.
Dating time of admixture for Bena (Flores, eastern Indonesia) using approximate Bayesian computation. (A) Relationship between admixture time and average block size metric for all simulations; (B) weighted posterior distribution of admixture time. Median estimated time of admixture, indicated by the blue line, is 147 generations (95% credible region: 122–178 generations).
Multiple admixture events
Another aim of this work is to show that our improved wavelet approach can be used to study other features of the admixture process beyond the well-explored question of admixture time. In the examples covered thus far, it has been assumed that admixture occurred as a single event. However, additional waves of admixture will result in the introduction of new ancestry tracts, replacing a proportion of older, shorter ancestry blocks with newer, longer ones. Pugach et al. (2011) briefly considered the effects of continuous admixture within a wavelet setting, showing that this leads to underestimated admixture times in their original methodological framework. In contrast, we instead consider scenarios with two distinct admixture events. We show that this process creates distinctive patterns in the observed informative variation, which can be used to reconstruct more complex demographic processes (as opposed to being treated solely as a potential source of bias).
In the following dual-admixture scenarios, the first admixture event always occurs at 160 generations. To investigate the effect of separation between admixture events, the second admixture event varies between 10 and 80 generations. In the extreme case of admixture at 160 and 10 generations ago, the localized admixture signals contain two dominant frequencies. Single admixture events at 160 and 10 generations lead to peaks in the informative wavelet variance at wavelet scales of 9 and 13, respectively. When two admixture events occur, the informative wavelet variance is instead spread between these scales (Figure 8A). As the admixture events occur closer together, this spread in the observed informative wavelet variance decreases (Figures 8, B–D).
Dual admixture events at 160 and 10–80 generations. Gray bars represent the average over 50 simulations for each scenario; black bars represent the range for individual simulations. Blue bars show the average informative wavelet variance for a single admixture event at 160 generations, providing a reference point for comparison.
For one admixture event, a single dominant peak is observed in the informative wavelet variance, and the ABS metric therefore provides a convenient summary measure. For multiple admixture events, the ABS metric describes the average admixture time, but provides no information about the duration over which admixture occurred. In contrast, the informative wavelet variance should provide additional information about the peak dispersion. To explore the potential for identifying more complex admixture scenarios, a simple classification rule was implemented. An admixed population PC is assigned to one of two groups , which are characterized by the summary measures
. This scheme is described with abstract choice of summary measure, but below, we consider how different summary measures (taking
to be either the ABS metric or wavelet informative variance) affect the success of classification.
The classification rule is implemented as follows:
The “true” summary measures
,
are computed for each group using values obtained from the first 25 simulations.
For each of the remaining 25 trial data sets (s = 1,…,25), estimated summary statistics
are calculated. The divergence measures are defined as
(7)
for
.
If
, classify to
; otherwise classify to
.
The classification rates are shown in Table 3 for scenario 1 (a single admixture event at 60 generations; mean ABS 10.47, range 10.30–10.64) and scenario 2 (two admixture events at 160 and 10 generations; mean ABS 10.57, range 10.35–10.90). With a sample size of just 10 individuals for the admixed population, perfect classification is achieved using the informative wavelet variance, while the ABS metric correctly classifies only 60% of cases. For real multiple admixture situations, this classification framework could be extended to a more complex inferential setting (such as ABC), but this simple example demonstrates the potential for reconstructing complex admixture scenarios from the full wavelet variance profile.
Discussion
Wavelet techniques provide information on the ancestry block structure of admixed genomes and hence can be used to reconstruct the processes involved in past admixture events. Ancestry blocks are strictly unobservable and can be inferred only from the data. Wavelets provide indirect information on the block structure, thus providing an alternative over methods that assign ancestry directly (Sankararaman et al. 2008; Price et al. 2009). A growing body of methods now assign ancestry indirectly using various unrelated approaches (Moorjani et al. 2011; Baran et al. 2012; Gravel 2012; Loh et al. 2013; Maples et al. 2013; Brown and Pasaniuc 2014), but here we extend the use of wavelet techniques as introduced by Pugach et al. (2011). Importantly, our implementation differs markedly from the original StepPCO program, with the main differences at each stage of the analysis highlighted below:
Localized admixture signal formation: StepPCO (Pugach et al. 2011) uses large windows of SNPs to produce an averaged admixture signal in localized windows along the genome. Our work demonstrates that wavelet methods are equally applicable to the raw unwindowed signals, with the windowing procedure performed intrinsically as part of the wavelet analysis, and therefore not requiring arbitrary a priori decisions on window size.
Wavelet analysis: The wavelet methods we describe are based on the MODWT, which offers more flexibility in its application since there is no restriction on the length of the signals. Conversely, StepPCO employs the discrete wavelet transform (DWT), which has the strict requirement that signals be of length 2n. Data must therefore be windowed, or discarded, to meet the restrictive length requirements of the DWT framework. Another advantage of the MODWT is that the resulting wavelet coefficients are translation equivariant, meaning that circularly shifting the data results in the same shifting of the coefficients. Said differently, changing the starting point—for instance, to avoid a poor quality SNP—does not affect the resulting wavelet coefficients, whereas this is not true under the DWT framework. This property is particularly important if the results are to be used for specific localized genomic regions (as discussed briefly below) and thus provides a solid statistical foundation for future work.
Extraction of relevant information: The portion of the resulting wavelet decomposition that is informative about the admixture process is extracted in a simple procedure with reference to the ancestral populations, offering greater simplicity and objectivity than the multistage thresholding procedure described by Pugach et al. (2011).
Software: The adwave software, which implements the method described in this article, is an official package in the R project (http://cran.r-project.org/web/packages/adwave/index.html). This allows extremely easy installation and use, as well as providing a series of simple worked examples as a learning exercise. The adwave package is also faster than the existing StepPCO code and offers more flexibility in the choice of analyzing wavelet (unlike StepPCO, which employs only the simplest “square-shaped” Haar wavelet).
The work presented here also makes several other advances. The average block size metric has previously been shown to capture the time of admixture. Here, we have implemented a more formal dating procedure using ABC under the assumption of a single admixture event. In reality, populations may have experienced multiple admixture events leading to complex patterns of genetic variation. We have shown that the wavelet variance contains additional information to identify these more complex admixture scenarios. This highlights the potential of wavelet-based techniques to be coupled with formal statistical inference procedures to robustly distinguish between the range of scenarios that could have resulted in any observed genetic pattern.
Method performance for the StepPCO procedure has already been tested against other admixture detection methods, most extensively with HAPMIX (Price et al. 2009), with favorable results. This is especially true for older admixture events (Pugach et al. 2011). While an in-depth comparison with other local ancestry detection methods would be of great interest (Moorjani et al. 2011; Baran et al. 2012; Gravel 2012; Loh et al. 2013; Maples et al. 2013; Brown and Pasaniuc 2014), such an analysis is beyond the scope of this manuscript. We have therefore focused instead on showing how adwave markedly improves on the original wavelet method implemented in StepPCO. As shown above, adwave offers improved statistical power to differentiate between admixture scenarios, offers much reduced uncertainty in model parameter estimates, and importantly, is far easier to use than StepPCO, especially by requiring far fewer user-specified runtime parameters.
In the future, considering the full wavelet periodogram, rather than the genome-wide summary measures used in both adwave and StepPCO, may yield promising results wherever the distribution of ancestry tracts along the genome is substantially nonstationary. Bryc et al. (2010) use their formulation of localized admixture signals to address whether regions of the genome show predominant ancestry from a given population. Wavelets are well suited to distinguishing local features in data and could be helpful in this regard, identifying features that may not be easily detected by considering the localized admixture signals in their raw form.
Other prospective areas for further work include the extension of these methods to the more general case of multipopulation admixture. Ma and Amos (2012) describe the use of PCA as a diagnostic in this setting, and PCA has been used to assign multipopulation ancestry in the software PCAdmix (Brisbin et al. 2012). The wavelet methods described here could be extended in a similar way by considering pairwise combinations of any number of ancestral populations.
In contrast, key restrictions that determine our ability to reconstruct admixture events include the degree of differentiation between the ancestral populations and the representativeness of samples used as surrogate ancestral groups. As the ancestral populations become more similar or the surrogate populations become more different from the true ancestral populations, the localized admixture signals become increasingly noisy. Although this ultimately leads to a loss of identifiability in extreme cases, the method is remarkably robust to moderate deviations from these assumptions. As shown above for low admixture proportions, through judicial choice of the thresholding parameter even extremely noisy data can still provide meaningful estimates (the only situation in which we encourage deviation from the default setting).
Sample size (both in terms of SNPs and individuals) is also important and affects the PCA step of the procedure. The purpose of the PCA step is to summarize the overall variability among individuals, which includes both between-population and within-population variability. In reconstructing population ancestry, we aim to describe between-population variation, while ignoring within-population variation. This is achieved by selecting the first principal component, as long as the sample sizes are sufficiently large. Within-population fluctuations of individual coordinates on the PCA scatterplot can be caused by subtle population substructure. Assuming that no such substructure is present, these fluctuations decrease as the total sample size increases, and an asymptotically stable pattern of the eigenvector plot results (Ma and Amos 2010). When the number of individuals is large, variation between individuals from the same population is small compared to that of the different populations, so that the first eigenvector describes the primary population structure of the data. However, as the sample size decreases, individual variation carries more weight, which may be addressed in more than the first principal component. Note that the use of methods other than PCA may be helpful in this regard. For example, Jombart et al. (2010) introduced discriminant analysis of principal components to achieve separation of individuals into predefined groups. In practice, as long as the sizes of the ancestral population samples is sufficiently large, discriminant analysis provides the same result as PCA (unpublished data). The two methods may, however, perform differently for small sample sizes.
How far back in time admixture processes can be reliably identified is strongly influenced by the number of genotyped SNPs. The relationship between the number of admixture blocks, time of admixture and wavelet scale is summarized in Table 4. The shaded column indicates the findings described in the Results, using simulated data sets of 13,000 SNPs (chosen for a region ∼100 Mb in length, comparable to the SNP content of our 100 Mb chromosome 15 data set). For admixture at 10 generations, the informative wavelet variance is highest at scale 13, reflecting a small number of large admixture blocks. As the time of admixture increases, the peak shifts toward lower scales, reflecting a larger number of smaller admixture blocks. This pattern is illustrated for admixture up to 320 generations (∼10,000 years), but importantly, it is possible to reconstruct even older admixture events. The highest frequency (relating to the smallest admixture blocks) that can be detected, as determined purely by the data density, is termed the Nyquist frequency (Chatfield 2003). However, resolution power is likely to deteriorate well before this point and will be strongly influenced by the degree of differentiation between the ancestral populations. The more closely related the ancestral populations, the less well they can be discriminated using only a small number of SNPs. Increasing the SNP density allows detection of higher frequency information, relating to shorter (more ancient) admixture tracts. To illustrate this, the mapping to wavelet scale is illustrated for a hypothetical twofold and fourfold increase in the number of genotyped SNPs (26,000 and 104,000 SNPs, respectively). As genetic data sets improve (particularly through whole-genome sequencing), wavelet methods will therefore gain substantial resolution. It seems entirely feasible that wavelet approaches will have sufficient statistical power to reconstruct admixture events far deeper in time than those currently studied. Advances in wavelet methods therefore offer exciting potential for future research, particularly for ancient and complex human admixture processes.
Software
Software for the analyses described here has been released in the form of an R package, adwave, which is freely available from the R project’s central package repository: http://cran.r-project.org/web/packages/adwave/index.html
Acknowledgments
We gratefully acknowledge assistance with sample collection by Agustini Leonita and Alida Harahap (Eijkman Institute for Molecular Biology, Jakarta, Indonesia) and J. Stephen Lansing (University of Arizona), as well as data processing by Olga Savina (University of Arizona). We also thank Matthew Nunes (University of Lancaster, United Kingdom) and Martin Hazelton (Massey University, New Zealand) for their valuable comments.This research was supported by the Royal Society of New Zealand through a Rutherford Fellowship (RDF-10-MAU-001) and Marsden Grant (11-MAU-007) to M.P.C., and by funding from the Allan Wilson Center for Molecular Biology and Evolution.
Footnotes
Communicating editor: K. M. Roeder
Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.176842/-/DC1.
- Received October 29, 2014.
- Accepted April 3, 2015.
- Copyright © 2015 by the Genetics Society of America
Available freely online through the author-supported open access option.