Abstract
During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s and
) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s
offers high power, but is outperformed by a novel statistic that we test, which we call
We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—
Kelly’s
and
—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While
replicates most candidates when recombination map data are not available, the
and
statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available.
GENOME-WIDE selection scans now form part of the standard repertoire of techniques through which to probe the evolutionary past of a population. These attempt to identify genomic regions showing evidence of nonneutral evolution by iteratively calculating a test statistic at different locations for a sample of genetic data. As the availability of genetic data and computational power has increased, so too have the number of scans performed and the range of statistics used. In the case of human genetics, thousands of putative selection signals have been suggested (Akey 2009). For the vast majority of these, it is unclear what phenotypic association may have allowed selection to operate, and a relatively small number of signals are replicated between studies.
Given the large number of hypotheses tested in genome scans for selection, many signals are likely to be false positives (Kelley et al. 2006; Akey 2009; Nei et al. 2010). This concern is compounded by systematic biases related to ascertainment (Thornton and Jensen 2007) or data quality (Mallick et al. 2009). Distinguishing true signatures of selection is challenging, requiring a multifaceted approach (Barrett and Hoekstra 2011), but improving statistics used to infer selection is an important step. This work is an attempt to test statistics calculated from patterns of linkage disequilibrium (LD) and enhance their ability to detect selective sweeps.
Distortions in the LD pattern surrounding a selected variant are one of several population genetic effects of natural selection related to genetic hitchhiking (Smith and Haigh 1974). In the case of positive selection, a hard selective sweep is expected to initially increase local LD. As the selected variant reaches high frequency, LD between SNPs located on opposite sides of the variant drops, as described by Kim and Nielsen (2004) and illustrated in Figure 1. Statistics designed to detect both the first [Kelly’s (Kelly 1997)] and second [ω (Kim and Nielsen 2004)] patterns have been suggested, and the theoretical dynamics of LD given positive selection have been explored (Stephan et al. 2006; McVean 2007; Pfaffelhuber et al. 2008). It has been argued that one of these statistics, ω, is relatively robust to nonequilibrium demographic history of the target population, which can vastly reduce the performance of other statistics (Jensen et al. 2007; Crisci et al. 2013).
The average pattern of pairwise linkage disequilibrium ( lower triangle of each matrix) and SNP diversity (represented by average number of LD measurements at a given pairwise distance, upper triangle) created by a selective sweep, based on 2000 simulations. The human demographic model of Gravel et al. (2011) was used in the simulations (see Appendix: Extended Methods for details), with 40 chromosomes sampled from the European population. When selection is simulated, the sweep begins (forward in time) at time
generations before the present, using an initial selected allele frequency of 0.0005 and an additive selection model with the homozygous state corresponding to
The frequency of the selected allele in the present is, on average, ∼0.7 and
when selection started 400 and 1600 generations ago, respectively. In the rightmost plot, the along (α) and over (β) regions used to calculate test statistics are indicated. As described by Kim and Nielsen (2004), this plot displays an increase in LD in the along region, but a decrease in the over region, compared to the central plot, which shows the increased LD in both regions associated with an intermediate-frequency partial sweep.
In humans (Daly et al. 2001) and other species (e.g., Petes 2001; Mezard 2006; Paigen and Petkov 2010) the local recombination rate is known to be highly variable. Genetic maps, which provide estimates of the recombination rate along a genome, are created to describe these fluctuations, either by using patterns of linkage disequilibrium in a population genetic sample (e.g., McVean et al. 2004) or by inferring recombination events in pedigrees (e.g., Kong et al. 2010). As an indication of the extent of recombination rate variation, human genetic data suggest that 60% of recombination events happen in 6% of the genome (Frazer et al. 2007). The portion of the genome with a high recombination rate manifests as highly local extreme peaks in the recombination rate, known as recombination hotspots. The implications of recombination rate variation on methods to detect selective sweeps based on pairwise LD have not yet been thoroughly explored.
In this article, we use simulations to assess the power of Kelly’s and ω to detect selective sweeps and compare them to a barrage of alternative LD-based selection statistics, some of which attempt to control for recombination rate variation. We focus on a hard sweep model of selection, but also consider selection acting on standing genetic variation, which can lead to soft sweeps (Hermisson and Pennings 2005). We then test several of the best-performing statistics on well-studied HapMap phase II data (Frazer et al. 2007), assessing their ability to replicate selection candidates identified using site frequency spectrum distortions and/or high population differentiation.
Selection Statistics Based on LD
Several methods that use pairwise LD patterns alone [as opposed to, for example, haplotype structure (Sabeti et al. 2002)] have been proposed—Kelly’s (Kelly 1997), Rozas’
and
(Rozas et al. 2001), and the ω statistic (Kim and Nielsen 2004). A separate approach [Ped/Pop (O’Reilly et al. 2008)] compares local recombination rate estimated using LD distortions in population genetic data (McVean et al. 2004) with estimates based on pedigree data to detect unexpected LD patterns and hence detect selection. We focus on two of these methods—Kelly’s
and ω—which detect very different qualities of the LD distortions expected given a selective sweep.
Kelly’s is simply the average pairwise LD between all SNPs over a fixed region of the genome (Kelly 1997),
(1)where S corresponds to a list of polymorphic sites in the genomic region numbered
i and j are indicators referring to loci in the list S, and
is a standardized measure of LD corresponding to the squared correlation of allelic identity between loci i and j (Hill and Robertson 1968). The normalization ensures that
when all loci in S are in maximal LD. Visually, an example calculation of this statistic would be the average
among all SNPs contained in the window x in Figure 2. Given the dynamics of LD driven by a hard selective sweep, this statistic is expected to be most effective when a selected variant has reached a moderate to high frequency, but is not nearing fixation. The approach also has relatively high power to detect soft sweeps in which a locus experiences recurrent beneficial mutations (Pennings and Hermisson 2006).
Schematic illustration of a test statistic calculation over a region of length X. The dark gray line indicates a section of the chromosome and the rectangles intermittently spaced along it are SNPs. To obtain the test statistic value, the selection statistic is calculated for each SNP in the orange region, and either the maximum or the minimum of these values is retrieved as specified. The selection statistic itself is calculated over a window of length x centered on the target lth SNP. For example, to calculate the selection statistic for the first SNP in the region (colored light blue), SNPs within distance are divided into sets L (red) and R (green). LD measurements contributing to this selection statistic value are indicated as circles above the chromosome and are colored dark blue if in the along set (used to calculate
) or pink if in the over set (used to calculate
).
The ω statistic tries to identify a characteristic LD pattern that emerges toward the end of a hard selective sweep, represented by an increase in LD between SNPs downstream or upstream of a selected variant (along the chromosome), but a reduction in LD between those on either side of it (over the selected locus). Following the notation and definition of Kim and Nielsen (2004), ω is calculated by taking the list of S polymorphic sites in a window and dividing it into two contiguous groups, L and R, which are located on either side of the locus. L and R contain l and
SNPs, respectively. Given these definitions, ω is defined as
(2)where the sum is taken over independent pairs
with
The position of l is iteratively moved along the chromosome to obtain
the maximum value of ω. Referring again to Figure 2, an example calculation of this statistic would be the average value of
measurements indicated by blue circles divided by the average value of
measurements indicated by pink circles. Large values of
may indicate nonneutral evolution. The pattern detected by ω is apparent in the rightmost plot of Figure 1.
The OmegaPlus program (Alachiotis et al. 2012b) implements a genome-scan variant of ω. Instead of locating the divisor on SNPs, a regularly sampled grid is defined over the region to be scanned. At each point on the grid, the value of is determined by varying the window size and hence the SNPs in sets L and R. In the original ω formula, the window size was limited by the length of sequence data available. As this would imply whole-chromosome windows in a genome-scan situation, the OmegaPlus implementation of ω differs in defining constraints on window size for each ω calculation, using the flags “-minwin” and “-maxwin” (supplemental material in Alachiotis et al. 2012b; first applied in Pavlidis et al. 2010). Normalization occurs as above and depends on the number of LD evaluations made in each window. The implementation of the statistic is highly optimized (Alachiotis et al. 2012a), such that large amounts of genetic data can be rapidly scanned, even on consumer-grade computers.
has shown promise in identifying selective sweeps in simulation studies (Jensen et al. 2007; Crisci et al. 2013).
In our work, we are interested in assessing and developing LD-based summary statistics for use in genome scans. We therefore use the OmegaPlus program to calculate values rather than using the standard ω-statistic construction. Henceforth we use the notation
to denote an individual run of OmegaPlus with window size flags -minwin and -maxwin set as the specified number of kilobases.
Challenges faced by Kelly’s
and the ω statistic
Both Kelly’s and ω have been used to infer selection in population genetic data (e.g., Catalán et al. 2012; Lee et al. 2013; Renzette et al. 2015). However, both the construction of statistics and properties of molecular evolution may complicate the interpretation of results. We discuss two potential factors affecting these LD-based statistics—variable recombination rate, which affects both Kelly’s
and ω, and the variable window size of ω.
Variable recombination rate
As previously noted, the recombination rate shows high levels of genome-wide variation for a number of species. This clearly has significant implications for statistics searching for unusual patterns of LD to infer nonneutral evolution (O’Reilly et al. 2008). Recombination hotspots cause local LD to plummet, which mimics the pattern of reduced LD expected at the end of the selective sweep and can lead to large values of the ω statistic. Conversely, certain regions of the genome have low recombination rates (Petes 2001; McVean et al. 2004), and these coldspots, with correspondingly high LD, will raise Kelly’s If the null distribution of ω or Kelly’s
is based on simulations that do not accurately represent recombination rate variation, recombination hot- and coldspots may lead to false positives. When the null distribution is based on empirical genome-wide data, the signal of variable recombination rate will be captured, but we nevertheless expect a reduction in statistical power compared to the constant-recombination rate case due to outliers associated with recombination hotspots and coldspots.
Variable statistic window size
Many selection scans calculate statistics based on a sliding window of fixed size. The OmegaPlus algorithm takes a different approach, calculating the statistic at static positions l on a regular grid while changing the size of the windows used to define R and L on each side of the target locus to find This method has several implications. First, we note that the spatial extent of LD distortions in the genome caused by a selective sweep will depend on recombination rate, selection strength, and the age of the sweep (Smith and Haigh 1974). Given this, it is possible that a single scan with OmegaPlus using variable window size can identify a broader range of selection scenarios than a fixed-window approach. Second, the expected value of ω under neutrality depends on the sizes of R and L, which makes interpreting the statistic less intuitive. A highly simplified illustrative example is provided in Supplemental Material, File S1. Third, the changing size of the window means that the number of SNPs used to obtain LD values when calculating ω may vary considerably (although the “-minsnps” flag allows the specification of a minimum number of SNPs in a window and hence some control of this). The variance in the ω statistic is likely to be larger under neutrality for windows containing fewer SNPs, with Equation 2 evaluating to very large values when the denominator is randomly small. A similar effect may also be important in determining the distribution of other test statistics, particularly given the impact of selection (Smith and Haigh 1974) and neutral processes such as variable mutation rate (Hodgkinson and Eyre-Walker 2011) on diversity.
The neutral distribution of statistics designed to capture local patterns of LD depends, then, on features such as the window size used and neutral variation in SNP diversity. Furthermore, variable recombination rate substantially affects observed LD values. The implication is that some control for these phenomena may improve the ability of these statistics to detect positive selection. In this work, we focus on developing and testing LD statistics calculated over a fixed-size genomic window that control for variable recombination rate.
Methods
Our approach to improving LD-based statistics is a pragmatic one. In essence, we use simulations to explore a large number of possible LD-based statistics and assess their potential to detect selective sweeps by comparing their power to Kelly’s and
The details of this process are described in Appendix: Extended Methods.
Designing selection statistics
When designing statistics, we begin by identifying a region of length X bp to be assessed for evidence of selection in a genetic sample. A selection statistic, such as is calculated centered on each polymorphic site in the region in turn; the test statistic for the region can be either the maximum or the minimum of these values.
The selection statistic is itself calculated over a window of fixed length x, centered on the target locus and containing polymorphic sites forming a list S. Unlike in the calculation of the target locus is fixed at the lth site in the list S, with the window stretching
bp upstream and downstream of it. We then follow the approach taken by the ω statistic in dividing SNPs within the statistic window into two sets—those that are to the left of the target locus, L, and those to the right, R; see Figure 2. The ω statistic, Equation 2, averages LD measurements between SNPs in the same set (the along region) and divides this by the average LD between SNPs in different sets (the over region). A measure of the average value of LD in the along region is
(3)and in the over region is
(4)see Figure 1. Assuming the number of SNPs in L and R are similar, Kelly’s
will be approximately the average of
and
while ω, calculated when the divisor lth locus is centrally located among the list of S polymorphic sites, is
The selection statistics we investigate take a similar approach in calculating a measure of average LD in the over and along regions, with the possibility of some simple operation (such as addition and division, as above) then performed on these. Unlike ω and Kelly’s we include the option of subtracting or otherwise controlling for the expected value of the selection statistic, based on expected LD between SNPs under neutrality. In total, we test 39 different statistics, 29 of which control for expected LD; see Appendix: Extended Methods.
Assessing the power of statistics through coalescent simulation
We assess the performance of different statistics using coalescent simulations generated with the programs MSMS (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. 2014). We first approximate the distribution of each statistic under neutrality by simulating 1000 samples for each demographic and recombination model. We consider the identification of a selection candidate within 100 kb of the true focus of selection to be practically useful and therefore calculate test statistics using a region size The 1000 values thus generated describe the null distributions for a test statistic. To approximate the distribution of the test statistics given a positive selective sweep, we simulate at least 300 samples with positive selection acting on a single SNP located in the middle of the chromosome, using various selection scenarios (e.g., different selection strengths and conditioning on different initial and final selected allele frequencies). The distribution of each test statistic under neutrality and given positive selection is then used to calculate power and receiver operating characteristic (ROC) curves (Metz 1978).
We use two demographic scenarios, one with constant population size of and the other following an out-of-Africa (OOA) human demographic model suggested by Gravel et al. (2011), with samples taken from the European population. Details can be found in Appendix: Extended Methods; also see Figure A1. The recombination rate is either constant (both models) or variable (constant population size model only). For the variable recombination rate model, the rate is sampled from the HapMap human chromosome 2 (b36) genetic map, estimated from HapMap phase II populations (Frazer et al. 2007), excluding regions close to the centromere.
To control for expected LD, we generate an “LD profile” describing the expected properties of LD at a given genetic map distance, such as the average or standard deviation of We approximate the LD profile by simulating
bp of neutral genetic data according to the appropriate demographic scenario and sample size and then binning LD measurements according to their genetic map distance. For all simulations involving a variable recombination rate, we consider three scenarios. We assume that no genetic map is available, in which case physical distance serves as a proxy for genetic map distance (PhysMap); or that the true genetic map is available (TrueMap); or that a lower-resolution genetic map is available (LowResMap).
The algorithm that MSMS uses to simulate selection involves conditioning a coalescent model on a stochastically generated allele frequency trajectory. The allele frequency trajectory is created using the selection coefficient s, the demographic model, and several of four possible parameters describing features of the selection scenario—the time at which the selection phase begins (pastward), generations; the time at which selection stops,
and the frequency of the selected variant at
and
and
(see Figure A1). We focus our attention on a hard sweep model and compare test statistic performance based on three selection scenario categories—low frequency, where the final frequency of the selected allele
high frequency, with
or
and selOOA, denoting the out-of-Africa demographic scenarios. We also briefly explore the performance of test statistics when selection acts as a variant at higher initial frequency,
Performance is assessed using the power of test statistics (e.g., Table 1, Table 2, and Figure 4) or a measurement based on the partial area under the ROC curve [pAUC (McClish (1989)] between a false positive rate (FPR) of 0 and 0.05 (Figure 3) that assesses the best possible test statistic performance suggested by our simulations. Using the pAUC gives greater emphasis on performance when the which is relevant for genome-wide selection scans. Again, further details on selection scenarios and the assessment of statistic performance are provided in Appendix: Extended Methods.








Performance of different categories of selection statistic tested according to our pAUC metric, under (A) a constant recombination rate, (B) a variable recombination rate and the true genetic map, and (C) a variable recombination rate using the physical map as an approximation of the genetic map. The statistic categories are indicated in the key, corresponding, in order, to those that control for expected LD, those that do not, Kelly’s statistic, the
statistic with various window sizes, and methods based on SNP diversity. The relationship between our pAUC metric and power is shown in D.
Replication of previously suggested selection candidates
We compare the performance of our selection statistics with Kelly’s and
Tests using several statistics, particularly the
statistic, Equation 3, and Kelly’s
when these control for expected LD, have high power given a variable recombination rate. To help further assess the utility of different statistics, we present selection scan results, using HapMap phase II [NCBI b36 (Frazer et al. 2007)] data for human chromosome 2 (populations of European ancestry, CEU, and East Asian ancestry, CHB+JPT) and chromosome 15 (CEU). Recombination rate is controlled for using either the HapMap genetic map [derived from LD patterns (Frazer et al. 2007)] or the deCODE genetic map [derived from inferred recombination events in a large Icelandic cohort (Kong et al. (2010)]. We assessed the ability of the various statistics to replicate selection signals previously identified based on the site frequency spectrum and/or population differentiation (Carlson et al. 2005; Nielsen et al. 2005; Oleksyk et al. 2008; Pickrell et al. 2009; Chen et al. 2010; Ronen et al. 2013; Colonna et al. 2014; Pagani et al. 2016), genetic features that should be relatively independent of local LD patterns and recombination rate variation under neutrality (although see, e.g., figure 3 of Ferrer-Admetlla et al. 2014 and Thornton 2005).
Data availability
The authors state that all information or data necessary for confirming the conclusions presented in the article are either publicly available (HapMap data) or represented fully within the article and Supplemental Material files.
Results
Our simulations identify certain test statistics (maximum Kelly’s or
over the
regions we test) as particularly powerful. They also suggest that controlling for expected LD increases the power of these statistics when recombination rate is variable, but marginally reduces power when recombination rate is constant. We also found that controlling for expected LD increased the number of previously suggested selection candidates that these statistics replicated in HapMap phase II SNP data. Although the interpretation of signal replication is not trivial (signals may be false or true positives), the overall impression is that controlling for expected LD increases the performance of certain LD-based selection statistics.
Controlling for expected LD increases simulated power of statistics when recombination rate is variable
In total, we tested 29 methods that incorporated some form of control for expected linkage and 10 methods that did not. Comparing these as groups of similar statistics—for example, all those that divide average LD in the along region by that in the over region (like ω) or all those that add or average LD in these two regions (like Kelly’s )—the average improvement, given a constant population size demographic model, variable recombination rate, and a hard sweep model, over methods that did not consider genetic map data was 79% (in absolute terms, 0.22) by our pAUC metric (see Figure 3). Focusing on the case of
window size
and
this reflects an average increase in power at 1% FPR of 120% (0.07) for the low-frequency scenarios and 90% (0.08) for the high-frequency scenarios, over all (both high and low performance) statistics.
When controlling for variable recombination rate, we found that both the full genetic map (TrueMap) and the lower-resolution genetic map (LowResMap) yielded similar results, with a performance reduction according to the pAUC metric of just 10% and 4% for the lower-resolution map given low-frequency and high-frequency scenarios, respectively. However, trying to control for expected LD based on physical distance yielded poor results, and those statistics that did not take variable recombination rate into account showed equal or better performance (see Figure 3C).
The power to detect selection at different stages of a hard selective sweep (or, in the case of the out-of-Africa demographic model, selection acting on an initially very rare variant, ) for some representative statistics is shown in Table 1 and Table 2. We note the generally high performance of
compared to Kelly’s
and especially
and the slower decay over time of reduced SNP diversity as a signal of a selective sweep compared to LD distortions (the
statistic, describing the number of LD measurements in the over region). To avoid high variance in LD-based statistics, we did not calculate test statistics for simulations with very few SNPs; our reported power values for diversity-based statistics are hence slight underestimates (usually by <0.01 and always <0.07 in Table 1, Table 2, and Table 3).
The statistic also demonstrated relatively high power to detect a soft sweep acting on standing genetic variation (see Table 3 and File S1, Tables S2 and S3).
Identifying the “best” statistic from our set was not simple; full results included in File S1 show that many approaches to controlling for expected LD had similar performance. Statistics based on the average LD in the along region, like tended to be particularly successful, and we therefore focus on these. In Table 1 and Table 2 we controlled for the expected statistic value given genetic map distance by dividing the observed
by that expected given the SNP distribution under neutrality,
While this is relatively intuitive, there are other possible approaches to controlling for expected LD, several of which are shown in Figure 4.
Power (FPR = 0.01, ) of the selection statistic
to detect a hard selective sweep compared with that of similar methods focusing on LD in the along region but controlling for expected LD; see Appendix: Extended Methods. A constant population size demographic model (
) was used with variable recombination rate and (A) weaker,
or (B) stronger,
selection. Unless otherwise indicated
Overall, the statistic is simple to calculate and generally performed as well as or better than other LD-based statistics explored given the variable recombination rate.
Controlling for expected LD increases selection candidate signal replication in HapMap data
To further assess and other selection statistics, we performed selection scans, using publicly available HapMap genotype data (Frazer et al. 2007). We chose to focus on chromosomes 2 (European descent and Asian populations, CEU and CHB+JPT) and 15 (CEU) as these chromosomes/populations include several well-characterized (e.g., Sabeti et al. 2007; Mathieson et al. 2015) selection signals—MCM6/LCT, SLC24A5, HERC2, and EDAR.
An example selection scan for chromosome 15 (CEU), using OmegaPlus (), Kelly’s
and
as well as LD-controlled variants of Kelly’s
and
is shown in Figure 5 [results for chromosome 2 (CEU and CHB+JPT) are included in File S1]. Selection candidate signals suggested by other studies that did not use statistics based on LD or haplotype patterns (see Appendix: Extended Methods) are indicated. Table 4 shows the average value of each test statistic in the 200-kb genomic region containing these selection signals, as well as the average rank of selection candidate regions among all 200-kb regions and an indication of how unusual the observed findings are. Results for scans using several other LD-based statistics that we investigated are included in File S1. Scans on the CEU population sample using the deCODE (Kong et al. 2010) rather than the HapMap genetic map to control for expected LD yielded similar results and are therefore also included in File S1. While the main purpose of these scans was to assess the various statistics, we did find several novel peaks. As we are not aware of selection statistics based on pairwise LD being applied to these data (although see O’Reilly et al. 2008), we tabulate the strongest signals in File S1.
Selection scans (standardized to aid comparison) on HapMap phase II data (chromosome 15, CEU population), using a range of LD-based statistics and a diversity measure. From top to bottom, the scans represent Kelly’s Kelly’s
controlled for expected LD,
controlled for expected LD, the OmegaPlus program calculating
and the diversity measure
Excluding OmegaPlus, all statistics were calculated with statistic window size
The HapMap combined genetic map was used to estimate expected LD. Thick dashed lines indicate the genomic targets of two relatively well-established signatures of selection, while the light gray lines indicate signals found in a range of studies based on population differentiation and the site frequency spectrum; see Appendix: Extended Methods and File S2.
Interpreting the selection scans shown in Figure 5 and in File S1, Figures S2 and S3, is not simple. Our metrics show no signal at many selection candidates, and in some cases controlling for expected LD actually leads to a reduction in signal strength. This in itself is not concerning, in that some suggested selection targets may be false positives. For example, of the 72 selection candidate regions we identified in various studies that searched for selection in European samples on chromosome 2, only 5 were replicated between studies. Nevertheless, it is clear that controlling for expected LD does not always improve the signal even for well-accepted candidates. While the signal at SLC24A5 (CEU) increases (see Figure 5), that of MCM6/LCT (CEU) appears to fall (see File S1, Figure S2), for both Kelly’s and
which, unsurprisingly, are highly correlated.
Interestingly, the statistic actually displayed an unusually low value for some selected regions (such as LCT in CEU) rather than the high value that is generally expected, using various window sizes (File S1 and File S2). This is presumably because the selective sweep is recent and incomplete (Itan et al. 2009; Mathieson et al. 2015), and the reduction in LD in the over region that drives high
values appears only in later stages of the sweep (Stephan et al. 2006). A similar pattern was sometimes observed in our simulations and was found to be strongly influenced by choice of window-size parameters.
Greater detail on the performance of the various statistics is given in Table 4. Over all selection candidate regions, Kelly’s replicated 9 of 96 signals as 5% outliers, the same as
but
(12) and
(15 of 101). The estimate for
is slightly pessimistic, as our pipeline removed regions with very few LD measurements, including two selection candidate regions [both in chromosome (Chr) 2, one Asian and the other European] that would otherwise have yielded positive results. Controlling for expected LD considerably improved the performance of both Kelly’s
and
with both replicating 18 of 96 signals. In general, all statistics replicated considerably more signals than expected by chance, as strongly suggested by the resampling results (Table 4).
The purpose of this study has been to assess certain LD-based selection statistics rather than to search for selection signatures as such. We therefore tabulate novel hits in File S1 only. Often, several outlier regions occurred in succession, such that it is difficult to identify specific genes, variants, or features that might be driving a signal. Different approaches using pairwise LD (Clemente et al. 2014), other population genetic patterns [e.g., DIND (Barreiro et al. 2009; Pagani et al. 2016) and DDAF (1000 Genomes Project Consortium et al. 2012; Colonna et al. 2014)], or especially biological information on the impact of variants can simplify this task. Certain novel 200-kb regions contained a single gene or a few genes, such as signals overlapping MKRN3 and ARHGAP11B (Europeans) and ABCA12 (Asians). ARHGAP11B is a human-lineage duplication that has recently been found to influence neocortex size (Florio et al. 2015), and its low variation supports a recent origin and possible selection. Mutations in MKRN3 can affect the onset of puberty (Abreu et al. 2013), and variants in this region have been associated with age at menarche (Perry et al. 2014). Finally, ABCA12 has been previously identified as a possible selection candidate based on population differentiation, possibly related to adaptation of the skin in response to ultraviolet light exposure (Colonna et al. 2014). We did not formally class this as a replication as our 200-kb region does not include the previously highlighted variant.
Dividing the 65 selection candidates identified in European chromosome 2 into 200-kb regions that did (39) and did not (26) contain protein-coding genes and calculating selection statistics did not support enrichment of signal replication in either set (File S1, Table S8). Although the naive expectation might suggest that relatively more true positive selection events, and hence signal replication, will have occurred in regions containing protein-coding genes, recent work has highlighted selection on regulation as especially important in recent human evolution (Enard et al. 2014).
Discussion
Our results suggest that incorporating information from the genetic map into calculations of selection statistics based on pairwise LD can substantially improve their performance, both in simulations and in replicating previously identified selection candidates. We have also suggested a modification of Kelly’s
that has higher power to detect the LD distortions caused by positive selection in simulations and that both these statistics outperform
in simulations.
We can begin to explain this by considering the distribution of different test statistics under neutrality and after a hard selective sweep (Figure 6). The maximum value of over the
region is often considerably larger after selection. The minimum value of
is only slightly reduced. These patterns—an increase in LD in the along region and a decrease in the over region—reflect prior simulations and theoretical expectations (Kim and Nielsen 2004; Stephan et al. 2006; McVean 2007; Pfaffelhuber et al. 2008) and are expected to be more pronounced toward the end of the sweep (leading to higher power at this stage, Table 1 and Table 2). Interestingly, the signal of reduced LD in the over region is not strong, and the maximum
test statistic demonstrates weaker power than
and a more peaked and skewed neutral distribution.
The distribution of test statistics in constant population size () neutral and hard selective sweep simulations (
) calculated using an
statistic window (unless otherwise indicated) and
region. Maximum values of
and
and the minimum value of
in the region were used as test statistics. To aid comparison, statistic distributions were standardized by subtracting the mean value under neutrality and dividing by the standard deviation under neutrality. Summary metrics of the neutral distribution, skewness and excess kurtosis, are indicated above the plots, as is the power to detect selection given the hard sweep model (FPR = 0.01).
Comparison between the distributions of maximum and
is especially enlightening. The moderately high power of
demonstrates that the essential signal that the ω statistic tries to detect is indeed a reasonable signature of late-stage hard selective sweeps. However, the skew and kurtosis of the neutral distribution of
are far greater. This appears to be the primary cause of the low power of the ω statistic in our simulations. We suggest that the adaptive window may be extremely successful at finding exceptionally high values of
under neutrality, leading to a very fat-tailed distribution and difficulty in distinguishing outlier values due to selection.
Although the statistic implemented in OmegaPlus performed relatively poorly in simulations, it was effective in replicating selection candidate signals. Indeed, of the LD-based statistics that we tested that did not incorporate information from the genetic map, this statistic reproduced signals at most previously suggested selection candidates at the 5% significance level. The apparent contradiction between power suggested by simulations and the ability to reproduce selection candidate signals in real data deserves consideration, as simulations are often used to justify the use of specific selection statistics. There are several possible confounding factors, and we begin by discussing our simulation modeling before turning to the implications of selection candidate replication.
The use of coalescent simulations to represent complex demographic scenarios (e.g., Schaffner et al. 2005) and selection (Wakeley 2010) is well established. We used two coalescent programs, MSMS (Ewing and Hermisson 2010), which has been widely used and cited, and Cosi2 (Shlyakhter et al. 2014), a development of the well-known Cosi program (Schaffner et al. 2005). In both cases, selection is incorporated by dividing the population according to allelic state at the selected site (Hudson and Kaplan 1988) and conditioning the coalescent process on an independently generated allele frequency trajectory describing the size of the two subpopulations over time. The bifurcating tree generated is expected to be least accurate when the sample size is a relatively large proportion of the population size and selection very strong (Fu 2006; Kim and Wiehe 2009). Importantly, we did not observe any anomalous power results associated with our largest sample size () and selection strength (
), and simulations were also found to yield qualitatively expected patterns of LD (Kim and Nielsen 2004; Stephan et al. 2006) and reduced diversity (see Figure 1). We therefore assume that genealogies under both neutrality and positive selection are approximated with reasonable accuracy by the coalescent simulation programs we employed.
The question to ask of our simulation results, then, is whether the correct selection, demographic, and recombination scenarios were modeled. In the case of recombination, the variable rate was sampled from a genetic map (Frazer et al. 2007) estimated using the HapMap data we investigate, which in turn appears to be broadly consistent with other information on recombination rate variation (e.g., Kong et al. 2010; Wang et al. 2012). The demographic model we apply was estimated based on the joint site frequency spectrum, using low-coverage data from the 1000 Genomes project (Gravel et al. 2011). This method involves fitting the site frequency spectrum calculated using a diffusion approximation of a Wright–Fisher population that incorporates drift, selection, and migration to observed data (Gutenkunst et al. 2009). Kingman’s coalescent can accurately approximate genealogies generated by a Wright–Fisher model (Kingman 1982; Fu 2006). As such, even though any demographic model is a vastly idealized version of a population’s history, the parameterization will capture qualities of human genetic data that in turn should be recapitulated in simulated genealogies. The model of human demography used is broadly consistent with understanding the out-of-Africa dispersals and is similar, in terms of divergence times and population size estimates, to a model estimated based on haplotype sharing (Harris and Nielsen 2013).
Our choice of selection scenarios is more constrained, with a focus on hard sweeps and soft sweeps from standing genetic variation. The evolution of real populations, and consequently the selection candidate list we use, will reflect a much wider range of selection phenomena. For example, soft sweeps, in which the ultimate fixation of an allele is caused by selection acting on more than one copy of that allele (Hermisson and Pennings 2005; also figure 1 of Messer and Petrov 2013), may also occur due to multiple advantageous mutations occurring at a single locus, a scenario we do not explore. Various factors contribute to the relative frequency of different types of selective sweep (Messer and Petrov 2013); both explicitly hard and soft sweeps have been inferred from genetic data (e.g., Peter et al. 2012; Garud et al. 2015; Schlamp et al. 2016).
More broadly, the frequency at which selective sweeps occurred in recent human evolution remains a subject of debate (Hernandez et al. 2011; Enard et al. 2014). Some signals may be driven by other selective processes that are known to have affected human genetic variation, in particular purifying (e.g., Bustamante et al. 2005) and balancing (e.g., Andrés et al. 2009) selection. The population genetic signatures that these create can be similar to those generated by selective sweeps—a simple example being reduction in diversity usually expected given both hard selective sweeps and purifying selection. The details of signal replication depend on whether these processes generate outlier values for multiple selection statistics. Regarding LD-based statistics, Kelly’s may be high in cases of balancing selection between linked loci (Kelly 1997) and soft sweeps involving recurrent mutation (Pennings and Hermisson 2006), while selection on loci with epistatic interactions can also affect LD patterns (Phillips 2008).
Inevitably, the patterns of LD generated by our hard and soft sweep simulations will only closely approximate patterns observed at a subset of candidate selection signals generated by selective processes. The possible range of demographic and selection scenarios that might be simulated is large such that comprehensive exploration is impractical. Meanwhile, the relative role of different forms of selection in different species and populations is not fully understood. Consequently, characterizing the performance of selection statistics is an iterative process. An indication of some conditions under which LD-based statistics appear effective, and of the utility of genetic maps in controlling for recombination rate variation, provides a useful baseline for further research.
In our work, we have assumed that our selection candidate set was enriched for positively selected regions and hence considered replication of signals by the LD-based statistics a useful indicator of their utility. Reproduction of a signal, however, has unclear implications. The critical property is the extent to which two statistics correlate positively under neutrality and under positive selection. To try to avoid strong correlations under neutrality, we did not use selection scans based on measures of allelic associations such as pairwise linkage disequilibrium or haplotype homozygosity statistics when generating our selection candidate signal list. We did include selection scans based on the site frequency spectrum; the pattern of high-frequency derived alleles following a selective sweep is expected to correlate with high LD due to common genealogical structure (Kim and Nielsen 2004). Summary statistics describing the site frequency spectrum can also be affected by recombination rate variation (Thornton 2005), such that the expected relationship between observed LD and such statistics appears complex.
Of broader relevance to the practice of detecting selection signals, only the reproduction of signals by statistics that show little correlation under neutrality can be considered independent evidence for selection at a locus; if this is not the case, replication can incorrectly give the impression of a robust selection candidate. Developing a detailed understanding of the correlation between different statistics under neutrality and selection, and the impact of complex demography, variable recombination, and variable mutation rate on this, is an important future direction in the interpretation of selection signatures.
A slightly different problem is caused by characteristics of molecular evolution that increase the variance of many different selection statistics, such as regionally low mutation or recombination rates. For example, infrequent mutation leads to low genetic diversity, sometimes used as an indicator of positive or purifying selection. However, it also reduces the number of available pairwise LD measurements in a region and leads to a coarser estimate of haplotype blocks and the site frequency spectrum. Increased variance in statistics at certain points in the genome will lead to more selection statistic outliers (both positive and negative) in these regions and hence a greater probability of overlapping outliers even when the neutral correlation between statistics is low.
These general points are highly relevant to the signal overlap between the LD-based statistics we tested, for example, which correlate strongly (Figure 5), but are not especially informative concerning the unexpectedly high number of selection candidates replicated by OmegaPlus. The three signals replicated only by the statistic were detected in three different articles using three different methods (Table 5), such that they cannot simply be attributed to neutral correlation between OmegaPlus and another selection statistic. While the
statistic has greatest performance at the end of a selective sweep (e.g., Table 1), coinciding with greatest population differentiation and larger distortions to the site frequency spectrum, its power at this stage was nevertheless lower than that of other LD-based statistics in simulations. Ultimately, further work will be needed to precisely clarify the relationship between signal replication and power assessed through simulations.

When controlling for expected LD given genetic map distance, we used two different genetic maps. We focus on results using the combined HapMap genetic map (Frazer et al. 2007) in the main text, which is based on LD patterns in Europeans, Asians, and Africans. As such, there is a danger of underestimating the recombination rate in regions with high LD due to natural selection and of incorrectly inferring hotspots when LD is low over a high-frequency selected site. Where recombination rate is underestimated, expected LD is correspondingly overestimated, and controlling for recombination rate is thus likely to degrade the signal of selection based on unusually high LD. Despite this, we found that results were not substantially different when using the deCODE genetic map (Kong et al. 2010) (see File S1 and File S2). This is interesting, in that differences between the genetic map estimated from LD patterns and that based on observed recombination events have been suggested as indicative of selection (O’Reilly et al. 2008). We speculate that the method of combining recombination rate estimates from multiple populations used in the HapMap genetic map substantially mitigates this effect. This is because LD-based methods tend to identify signatures of recent selection, which often postdate population divergence and will affect different genomic regions in the different populations.
We finally note that our method of controlling for expected LD is simple and that more complex alternatives may improve the power of these statistics further. For example, widely separated alleles that are always found on the same haplotype show an value of 1, but such distant cosegregating alleles are far less likely under neutrality if the alleles are both derived and at high frequency [the signal exploited by methods derived from EHH (Sabeti et al. 2002)]. Incorporating information about the derived allele frequency of each allele in a pairwise LD measurement in addition to genetic map distance may give a better indication of how unusual the observed LD pattern is and hence increase statistical power. The question, ultimately, would be whether this closer approximation of haplotype-based methods has advantages over the range of well-developed haplotype methods currently used.
Conclusion
Our work has confirmed that the power of selection statistics based on LD can often be improved by controlling for variable recombination rate. Doing so is likely to reduce the number of false positive selection candidates and give a clearer indication of the relative strength of selection signals. Of the methods we tested, the statistic and
showed highest power in simulations. In the absence of information on the genetic map, OmegaPlus was most successful at replicating selection candidates identified by other selection scan studies, despite often demonstrating low power in both soft and hard selective sweep simulations. Simulations are often used to test the performance of selection statistics, and this pattern creates an intriguing contradiction. Focusing on this problem, we conclude pragmatically, without a greater understanding of the correlation between signals, under neutrality especially, it is difficult to interpret precisely what signal replication implies. Thus, in our study the OmegaPlus statistic was effective at finding signals that have already been identified, but we are unable to suggest the precise evolutionary meaning of these signals or whether this reflects shared true or false positive results. Based on both simulation and replication, incorporating information on expected LD using a genetic map can substantially improve the performance of selection statistics.
Acknowledgments
We acknowledge useful discussions with Florian Clemente, Gereon Kaiping, and Mircea Iliescu at various stages of this work and helpful comments from Andy Collins. We also thank two anonymous reviewers for their comments. This work was submitted by an Engineering and Physical Science Research Council Doctoral Training Centre grant (EP/G03690X/1; G.S.J.) and a European Starting Investigator grant (FP7-26123) (to T.K.).
Appendix: Extended Methods
Designing Simple Selection Statistics
As indicated in the main text, we defined a series of possible selection statistics based on some average measurement of LD () in the along region and/or the over region. For example, two of the simplest statistics we explored were
(Equation 3) and
(Equation 4). However, it was often useful to use an alternative to the observed
value to control for expected LD. A simple example would be the statistic
corresponding to the average
in the along region minus the average expected
in the along region,
(A1a)where
is estimated based on the generated LD profile, described in detail below, and the genetic map distance between loci i and j. As a rule, we use superscripts to
and
to indicate cases where we are calculating a measure of LD in the along or the over region, respectively, in a manner analogous to
and
but based on a quantity other than the observed
between loci. We used four other approaches to controlling for expected LD within the along and over calculations,
(A1b)
(A1c)
(A1d)
(A1e)where, in Equation A1e,
denotes the value, at
of the cumulative distribution function of a Beta distribution with parameters a and b [
with
the incomplete beta function and
the complete beta function], fitted by maximum likelihood to
measurements in the appropriate genetic map distance bin, generated when creating the LD profile. This final approach is essentially attempting to estimate a P-value for each observed
measurement and averages these (which is far more conservative than using Fisher’s method to combine P-values and likely more representative given the strong correlation between LD at nearby pairs of loci).
Analogous quantities are defined when calculating variants of while Kelly’s
is the average expected
between all pairs of loci in a window. The selection statistics we assessed involved simple operations on estimates of LD in the along and over regions and consisted of the following:
LD and deviation from expected LD in the along region:
LD and deviation from expected LD in the over region:
Kelly’s
and deviation from expected Kelly’s
The
statistic and similar constant window-size alternatives that can control for their expected value:
Methods similar to Kelly’s
with more diverse approaches to controlling for expected LD:
Alternatives to
that instead use the difference between LD in the over and along regions:
The product of average LD in over and along:
The number of SNPs in the along and over region, where
indicates the cardinality of set R:
Many of these statistics rely on the creation of an LD profile, which we now describe.
The LD Profile
The LD profile consists of descriptive statistics of LD measurements between loci separated by a given genetic map distance. Creating the LD profile using simulated data involved generating 1000 samples of n chromosomes, of length 3 Mb, under neutrality and according to the relevant model of recombination and demography. As in our power simulations, loci with were removed, followed by the removal of random loci until the average spacing between polymorphic sites was 2500 bp.
values were calculated for all pairs of loci up to a distance of 2 cM, with the genetic map distance between loci based on the true genetic map, the low-resolution map, or physical distance. These LD measurements were assigned to 20,000 bins according to the genetic map distance between the two loci, such that each bin represents 0.00001 cM. Finally, the average LD,
for each bin and the standard deviation,
were calculated, and a maximum-likelihood fitting of the Beta distribution was performed using the Scipy module (Jones et al. 2001) (scipy.stats.beta.fit), giving values of a and b for each bin.
A different LD profile was generated for each combination of sample size, demographic model, recombination model, and assumed known genetic map. LD profiles were constructed for the HapMap data individually for each chromosome and population, using the two different genetic maps (Frazer et al. 2007; Kong et al. 2010).
Genetic Maps
In our simulations with variable recombination rate, we considered three possible scenarios concerning the genetic map. Two of these simply involved using the physical map as a proxy (PhysMap) or providing the real section of the HapMap genetic map according to which the data were simulated (TrueMap). The third one used a lower-resolution version of the true genetic map (LowResMap). This was generated by downsampling the HapMap map by a factor of 15, reducing the average distance between reported map positions from ∼817 bp to 12,260 bp. Note that the genetic map is still accurate in the sense that it was generated using all loci, but that hotspots will be considerably smoothed out.
Simulations
Simulations were performed using MSMS (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. 2014). Each simulation replicate involved simulating a sample size of chromosomes of length 1.5 Mb, which may or may not have been subject to selection at a site located in the middle of the chromosome. The mutation rate was
in the constant-size demographic model and followed Gravel et al. (2011),
in the out-of-Africa demographic model. When recombination rate was constant this was set to
the variable recombination rate was retrieved as described in Methods. The generation time was taken to be 25 years, as in Gravel et al. (2011). To approximate SNP panel data and avoid LD measurements involving singletons, loci with
in the sample were removed before randomly removing loci until the average spacing between SNPs was 2500 bp.
An region was then defined from positions 650 to 850 kb. We calculated the value of all selection statistics other than
using in-house scripts at each SNP, using three statistic window sizes (
). Selection statistics were not calculated if
or
were under 4 or if
was calculated on a grid using OmegaPlus with a resolution of 2500 bp, using window-size flags -minwin
-maxwin
and -minsnps
When reporting power, we focus on results for
in the main text, which tended to perform relatively well. Test statistics were retrieved as either the maximum or the minimum value of each selection statistic in the 200-kb region, unless SNP diversity was too low to obtain any selection statistic calculations, in which case the replicate was not used in power calculation. Note that the removal of very SNP-sparse replicates will make our power estimate for diversity-based statistics conservative—most windows removed would have been true positives. We estimate this distortion to generally be of the order of 0.01 in the tables presented in the main text (Table 1, Table 2, and Table 3), rising to a maximum of 0.07 for the OOA scenario with selection starting at
generations (Table 2).
Two demographic scenarios were used, one with constant population size () and one following an out-of-Africa model (Gravel et al. 2011) with samples taken from the European population. To obtain selective sweep trajectories under both demographic models in MSMS, we used two types of selection scenarios. In the first one, selection of strength s begins (pastward in time) at
generations with an allele frequency of
The time at which the selection phase of the model ends,
generations, corresponds to the time at which the de novo selected allele first appears and is determined stochastically when MSMS generates the selected allele frequency trajectory on which later coalescent simulations are conditioned. This approach is used when the population size is constant. The second method involves specifying s and times
and
as well as the frequency of the selected allele when the selection phase ends,
This allows for selection on standing variation,
This time,
is determined by the generated selected allele frequency trajectory. We use this approach when applying the OOA demographic model and when exploring selection on standing variation.
The selection scenarios investigated used an additive selection model and a selective advantage of for the homozygote. For the constant population size demographic model, we conditioned our hard sweep selection simulations on final allele frequency,
with
or
For the out-of-Africa model, we conditioned selection simulations on the starting allele frequency
and with
and
and further removed simulations in which the selected allele became extinct such that
The ROC curves from these scenarios were used to calculate the selOOA pAUC performance measure (Figure 3), with otherwise similar supplementary runs using
performed to assess the decay of the LD signal (Table 2).
In the OOA model, even the low initial frequency chosen, led to some soft sweeps (in which not all lineages with the selected site have coalesced at the start of selection), especially when n and s were large and
small. Using the “-oOC” flag in MSMS and further simulations, we estimated that ∼8% of selective sweeps used to calculate OOA pAUC metrics shown in File S1, Figure S1 and Figure 3A were soft sweeps; <1.5% of those used to obtain the power calculations shown in Table 3 were soft.
We also explored, in less detail, primarily soft sweeps acting on standing genetic variation. Here, sample size was window length
(in the case of the ω statistic, we calculated
), and selection strength
Initial allele frequency was
with the selection phase ending, pastward, at times
.
The program MSMS was used for all simulations in which the recombination rate did not vary, while Cosi2 was used for simulations with a variable recombination rate. The allele frequency trajectories used to simulate selection in Cosi2 were generated using MSMS. Example MSMS scripts [checked using PopPlanner (Ewing et al. 2015)] are shown in File S1, Table S4, while Figure A1 gives a schematic of the OOA demographic model, indicating the populations that may be under selection (depending on ).
ROC curves were calculated for each statistic by comparing 1000 neutral replicates with at least 300 replicates involving selection. As the statistics we used employed different signatures to detect selection, four ROC curves were calculated, based on top or bottom outliers of the maximum or minimum statistic value in the 200-kb region indicating selection. These were used to determine the pAUC between an FPR of 0 and 0.05. When assessing the performance of statistics, we did not want to make assumptions about the direction of deviation indicating selection or the statistic window size used in the selection scan. We therefore chose the maximum pAUC for each statistic at a sample size of (usually three window sizes and the 4 pAUC each, so the maximum of 12 pAUC values) as its measure of performance under a given selection, recombination rate, and demographic scenario. To summarize the performance of statistics under different selection models, we separated the selection scenarios into three groups, low frequency, high frequency, and selOOA as detailed in Methods. We averaged the maximum pAUC across scenarios in these groups to give an overall indication of average statistic performance. Note that we implicitly give equal weight to each selection coefficient; often power was extremely high when
and low when
In the case of selOOA, we did not include those scenarios for which the average frequency of the selected allele at sampling was low,
corresponding to
with
and
with
When calculating pAUC given selection on standing variation, we excluded only the
scenario.
Defining Selection Candidates and Assessing Signal Replication
To compile a list of previously suggested selection candidates, we searched for a range of studies performing selection scans based on the site frequency spectrum (SFS) or population differentiation. We identified eight appropriate selection scans, summarized in Table A1.
We converted the suggested signal locations into single 200-kb regions by identifying which 200-kb region overlapped the central point of the signal or the SNP reported. Very few regions were identified in multiple studies (the 72 signals identified for European Chr2 corresponded to 66 unique regions, with corresponding numbers 17/21 and 18/21 in European Chr 15 and Asian Chr 2, respectively), although there were obvious clusters of signals. A complete table of signals is included in File S1.
We classed 200-kb selection candidate regions that were in the top (bottom for the statistic) 5% of 200-kb regions for a given statistic to be successful in replications. To give an indication of how unexpected the statistic values of selection candidate regions were, we resampled the same number of randomly located 200-kb regions 10,000 times and compared the observed statistic value, rank of candidate windows, and number of outliers to this set. We noted that signals were often clustered and that peaks in several of the LD-based statistics often included several consecutive 200-kb regions. We therefore approximated this clustering in our resampling regime, copying the distribution of consecutive regions seen in the candidate signal data. For example, in the European chromosome 15 candidate data there were nine solitary regions, two runs of two consecutive regions, and a single run of four consecutive regions, and when resampling we followed this pattern.
When assessing the ability of different statistics to replicate the 200-kb selection candidate regions, selection statistics were calculated for each SNP in the region, using an statistic window for all selection scans apart from ω, for which
was used. These statistic window sizes were chosen for two reasons. First, large regions of the chromosomes in the HapMap data were SNP sparse and could not yield statistic scores for small windows. Second, a 400-kb window appeared to capture selection signals around LCT and SLC24A5 more effectively (results not shown). However, the
statistic window size differs from that used to calculate power results shown in Table 1, Table 2, and Figure 4. Simulations using an
window size were conducted and are incorporated into the pAUC calculations. The different window sizes for ω were not observed to strongly affect replication results; see Table S12 in File S1.
To assess whether 200-kb regions representing signals containing protein-coding genes were preferentially replicated, we focused on regions suggested based on European chromosome 2 data. These were split into two groups, those that contained protein-coding genes and those that did not, based on the hg18 RefSeq Genes track refGene table accessed through the UCSC table browser (genome.ucsc.edu/cgi-bin/hgTables), before being assessed for signal overlap with our LD-based methods as usual.
Schematic diagram of the out-of-Africa model of Gravel et al. (2011), with a simulated frequency trace of a positively selected allele shown at the top. The horizontal axis represents time, running pastward from right (the present) to left. For population size and migration parameters, see table 2 in Gravel et al. (2011). Three demographic time events are indicated: the time of an ancient African bottleneck, the time of the out-of-Africa bottleneck,
involving Eurasian population splitting from the African population, with modern Yoruba (YRI) descended from the latter; and the time of the split of the Eurasian population into Europeans (CEU) and East Asians (CHB+JPT),
Two selection time events are indicated: the time at which selection begins, pastward in time,
and the time at which it ends,
Depending on the values of
and
selection acts in different populations; the populations that would be subject to selection if
(148,000 years) are colored orange. Note that the exact implementation of the model when
(51,000 years) is modified slightly; see File S1, Table S4. In the frequency trace,
and
in this simulation, the frequency of the allele at
is
at
it is
and at present, after 300 generations of random drift,
Footnotes
Communicating editor: R. Nielsen
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.185900/-/DC1.
- Received December 10, 2015.
- Accepted April 5, 2016.
- Copyright © 2016 by the Genetics Society of America