Skip to main content
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org
  • Log in
Genetics

Main menu

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • Centennial
    • Genetics of Immunity
    • Genetics of Sex
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

User menu

Search

  • Advanced search
Genetics

Advanced Search

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • Centennial
    • Genetics of Immunity
    • Genetics of Sex
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
Previous ArticleNext Article

Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps

View ORCID ProfileGuy S. Jacobs, View ORCID ProfileTimothy J. Sluckin and View ORCID ProfileToomas Kivisild
Genetics August 1, 2016 vol. 203 no. 4 1807-1825; https://doi.org/10.1534/genetics.115.185900
Guy S. Jacobs
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United KingdomComplexity Institute, Nanyang Technological University, Singapore 637723
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guy S. Jacobs
  • For correspondence: guysjacobs@ntu.edu.sg
Timothy J. Sluckin
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Timothy J. Sluckin
Toomas Kivisild
Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Toomas Kivisild
  • Article
  • Figures & Data
  • Info & Metrics
Loading

Abstract

During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s Embedded Image and Embedded Image) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s Embedded Image offers high power, but is outperformed by a novel statistic that we test, which we call Embedded Image We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—Embedded Image Kelly’s Embedded Image and Embedded Image—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While Embedded Image replicates most candidates when recombination map data are not available, the Embedded Image and Embedded Image statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available.

  • linkage disequilibrium
  • positive selection
  • recombination rate
  • genetic map

GENOME-WIDE selection scans now form part of the standard repertoire of techniques through which to probe the evolutionary past of a population. These attempt to identify genomic regions showing evidence of nonneutral evolution by iteratively calculating a test statistic at different locations for a sample of genetic data. As the availability of genetic data and computational power has increased, so too have the number of scans performed and the range of statistics used. In the case of human genetics, thousands of putative selection signals have been suggested (Akey 2009). For the vast majority of these, it is unclear what phenotypic association may have allowed selection to operate, and a relatively small number of signals are replicated between studies.

Given the large number of hypotheses tested in genome scans for selection, many signals are likely to be false positives (Kelley et al. 2006; Akey 2009; Nei et al. 2010). This concern is compounded by systematic biases related to ascertainment (Thornton and Jensen 2007) or data quality (Mallick et al. 2009). Distinguishing true signatures of selection is challenging, requiring a multifaceted approach (Barrett and Hoekstra 2011), but improving statistics used to infer selection is an important step. This work is an attempt to test statistics calculated from patterns of linkage disequilibrium (LD) and enhance their ability to detect selective sweeps.

Distortions in the LD pattern surrounding a selected variant are one of several population genetic effects of natural selection related to genetic hitchhiking (Smith and Haigh 1974). In the case of positive selection, a hard selective sweep is expected to initially increase local LD. As the selected variant reaches high frequency, LD between SNPs located on opposite sides of the variant drops, as described by Kim and Nielsen (2004) and illustrated in Figure 1. Statistics designed to detect both the first [Kelly’s Embedded Image (Kelly 1997)] and second [ω (Kim and Nielsen 2004)] patterns have been suggested, and the theoretical dynamics of LD given positive selection have been explored (Stephan et al. 2006; McVean 2007; Pfaffelhuber et al. 2008). It has been argued that one of these statistics, ω, is relatively robust to nonequilibrium demographic history of the target population, which can vastly reduce the performance of other statistics (Jensen et al. 2007; Crisci et al. 2013).

Figure 1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1

The average pattern of pairwise linkage disequilibrium (Embedded Image lower triangle of each matrix) and SNP diversity (represented by average number of LD measurements at a given pairwise distance, upper triangle) created by a selective sweep, based on 2000 simulations. The human demographic model of Gravel et al. (2011) was used in the simulations (see Appendix: Extended Methods for details), with 40 chromosomes sampled from the European population. When selection is simulated, the sweep begins (forward in time) at time Embedded Image generations before the present, using an initial selected allele frequency of 0.0005 and an additive selection model with the homozygous state corresponding to Embedded Image The frequency of the selected allele in the present is, on average, ∼0.7 and Embedded Image when selection started 400 and 1600 generations ago, respectively. In the rightmost plot, the along (α) and over (β) regions used to calculate test statistics are indicated. As described by Kim and Nielsen (2004), this plot displays an increase in LD in the along region, but a decrease in the over region, compared to the central plot, which shows the increased LD in both regions associated with an intermediate-frequency partial sweep.

In humans (Daly et al. 2001) and other species (e.g., Petes 2001; Mezard 2006; Paigen and Petkov 2010) the local recombination rate is known to be highly variable. Genetic maps, which provide estimates of the recombination rate along a genome, are created to describe these fluctuations, either by using patterns of linkage disequilibrium in a population genetic sample (e.g., McVean et al. 2004) or by inferring recombination events in pedigrees (e.g., Kong et al. 2010). As an indication of the extent of recombination rate variation, human genetic data suggest that 60% of recombination events happen in 6% of the genome (Frazer et al. 2007). The portion of the genome with a high recombination rate manifests as highly local extreme peaks in the recombination rate, known as recombination hotspots. The implications of recombination rate variation on methods to detect selective sweeps based on pairwise LD have not yet been thoroughly explored.

In this article, we use simulations to assess the power of Kelly’s Embedded Image and ω to detect selective sweeps and compare them to a barrage of alternative LD-based selection statistics, some of which attempt to control for recombination rate variation. We focus on a hard sweep model of selection, but also consider selection acting on standing genetic variation, which can lead to soft sweeps (Hermisson and Pennings 2005). We then test several of the best-performing statistics on well-studied HapMap phase II data (Frazer et al. 2007), assessing their ability to replicate selection candidates identified using site frequency spectrum distortions and/or high population differentiation.

Selection Statistics Based on LD

Several methods that use pairwise LD patterns alone [as opposed to, for example, haplotype structure (Sabeti et al. 2002)] have been proposed—Kelly’s Embedded Image (Kelly 1997), Rozas’ Embedded Image and Embedded Image (Rozas et al. 2001), and the ω statistic (Kim and Nielsen 2004). A separate approach [Ped/Pop (O’Reilly et al. 2008)] compares local recombination rate estimated using LD distortions in population genetic data (McVean et al. 2004) with estimates based on pedigree data to detect unexpected LD patterns and hence detect selection. We focus on two of these methods—Kelly’s Embedded Image and ω—which detect very different qualities of the LD distortions expected given a selective sweep.

Kelly’s Embedded Image is simply the average pairwise LD between all SNPs over a fixed region of the genome (Kelly 1997),Embedded Image(1)where S corresponds to a list of polymorphic sites in the genomic region numbered Embedded Image i and j are indicators referring to loci in the list S, and Embedded Image is a standardized measure of LD corresponding to the squared correlation of allelic identity between loci i and j (Hill and Robertson 1968). The normalization ensures that Embedded Image when all loci in S are in maximal LD. Visually, an example calculation of this statistic would be the average Embedded Image among all SNPs contained in the window x in Figure 2. Given the dynamics of LD driven by a hard selective sweep, this statistic is expected to be most effective when a selected variant has reached a moderate to high frequency, but is not nearing fixation. The approach also has relatively high power to detect soft sweeps in which a locus experiences recurrent beneficial mutations (Pennings and Hermisson 2006).

Figure 2
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2

Schematic illustration of a test statistic calculation over a region of length X. The dark gray line indicates a section of the chromosome and the rectangles intermittently spaced along it are SNPs. To obtain the test statistic value, the selection statistic is calculated for each SNP in the orange region, and either the maximum or the minimum of these values is retrieved as specified. The selection statistic itself is calculated over a window of length x centered on the target lth SNP. For example, to calculate the selection statistic for the first SNP in the region (colored light blue), SNPs within distance Embedded Image are divided into sets L (red) and R (green). LD measurements contributing to this selection statistic value are indicated as circles above the chromosome and are colored dark blue if in the along set (used to calculate Embedded Image) or pink if in the over set (used to calculate Embedded Image).

The ω statistic tries to identify a characteristic LD pattern that emerges toward the end of a hard selective sweep, represented by an increase in LD between SNPs downstream or upstream of a selected variant (along the chromosome), but a reduction in LD between those on either side of it (over the selected locus). Following the notation and definition of Kim and Nielsen (2004), ω is calculated by taking the list of S polymorphic sites in a window and dividing it into two contiguous groups, L and R, which are located on either side of the Embedded Image locus. L and R contain l and Embedded Image SNPs, respectively. Given these definitions, ω is defined asEmbedded Image(2)where the sum is taken over independent pairs Embedded Image with Embedded Image The position of l is iteratively moved along the chromosome to obtain Embedded Image the maximum value of ω. Referring again to Figure 2, an example calculation of this statistic would be the average value of Embedded Image measurements indicated by blue circles divided by the average value of Embedded Image measurements indicated by pink circles. Large values of Embedded Image may indicate nonneutral evolution. The pattern detected by ω is apparent in the rightmost plot of Figure 1.

The OmegaPlus program (Alachiotis et al. 2012b) implements a genome-scan variant of ω. Instead of locating the divisor on SNPs, a regularly sampled grid is defined over the region to be scanned. At each point on the grid, the value of Embedded Image is determined by varying the window size and hence the SNPs in sets L and R. In the original ω formula, the window size was limited by the length of sequence data available. As this would imply whole-chromosome windows in a genome-scan situation, the OmegaPlus implementation of ω differs in defining constraints on window size for each ω calculation, using the flags “-minwin” and “-maxwin” (supplemental material in Alachiotis et al. 2012b; first applied in Pavlidis et al. 2010). Normalization occurs as above and depends on the number of LD evaluations made in each window. The implementation of the statistic is highly optimized (Alachiotis et al. 2012a), such that large amounts of genetic data can be rapidly scanned, even on consumer-grade computers. Embedded Image has shown promise in identifying selective sweeps in simulation studies (Jensen et al. 2007; Crisci et al. 2013).

In our work, we are interested in assessing and developing LD-based summary statistics for use in genome scans. We therefore use the OmegaPlus program to calculate Embedded Image values rather than using the standard ω-statistic construction. Henceforth we use the notation Embedded Image to denote an individual run of OmegaPlus with window size flags -minwin and -maxwin set as the specified number of kilobases.

Challenges faced by Kelly’s Embedded Image and the ω statistic

Both Kelly’s Embedded Image and ω have been used to infer selection in population genetic data (e.g., Catalán et al. 2012; Lee et al. 2013; Renzette et al. 2015). However, both the construction of statistics and properties of molecular evolution may complicate the interpretation of results. We discuss two potential factors affecting these LD-based statistics—variable recombination rate, which affects both Kelly’s Embedded Image and ω, and the variable window size of ω.

Variable recombination rate

As previously noted, the recombination rate shows high levels of genome-wide variation for a number of species. This clearly has significant implications for statistics searching for unusual patterns of LD to infer nonneutral evolution (O’Reilly et al. 2008). Recombination hotspots cause local LD to plummet, which mimics the pattern of reduced LD expected at the end of the selective sweep and can lead to large values of the ω statistic. Conversely, certain regions of the genome have low recombination rates (Petes 2001; McVean et al. 2004), and these coldspots, with correspondingly high LD, will raise Kelly’s Embedded Image If the null distribution of ω or Kelly’s Embedded Image is based on simulations that do not accurately represent recombination rate variation, recombination hot- and coldspots may lead to false positives. When the null distribution is based on empirical genome-wide data, the signal of variable recombination rate will be captured, but we nevertheless expect a reduction in statistical power compared to the constant-recombination rate case due to outliers associated with recombination hotspots and coldspots.

Variable statistic window size

Many selection scans calculate statistics based on a sliding window of fixed size. The OmegaPlus algorithm takes a different approach, calculating the statistic at static positions l on a regular grid while changing the size of the windows used to define R and L on each side of the target locus to find Embedded Image This method has several implications. First, we note that the spatial extent of LD distortions in the genome caused by a selective sweep will depend on recombination rate, selection strength, and the age of the sweep (Smith and Haigh 1974). Given this, it is possible that a single scan with OmegaPlus using variable window size can identify a broader range of selection scenarios than a fixed-window approach. Second, the expected value of ω under neutrality depends on the sizes of R and L, which makes interpreting the statistic less intuitive. A highly simplified illustrative example is provided in Supplemental Material, File S1. Third, the changing size of the window means that the number of SNPs used to obtain LD values when calculating ω may vary considerably (although the “-minsnps” flag allows the specification of a minimum number of SNPs in a window and hence some control of this). The variance in the ω statistic is likely to be larger under neutrality for windows containing fewer SNPs, with Equation 2 evaluating to very large values when the denominator is randomly small. A similar effect may also be important in determining the distribution of other test statistics, particularly given the impact of selection (Smith and Haigh 1974) and neutral processes such as variable mutation rate (Hodgkinson and Eyre-Walker 2011) on diversity.

The neutral distribution of statistics designed to capture local patterns of LD depends, then, on features such as the window size used and neutral variation in SNP diversity. Furthermore, variable recombination rate substantially affects observed LD values. The implication is that some control for these phenomena may improve the ability of these statistics to detect positive selection. In this work, we focus on developing and testing LD statistics calculated over a fixed-size genomic window that control for variable recombination rate.

Methods

Our approach to improving LD-based statistics is a pragmatic one. In essence, we use simulations to explore a large number of possible LD-based statistics and assess their potential to detect selective sweeps by comparing their power to Kelly’s Embedded Image and Embedded Image The details of this process are described in Appendix: Extended Methods.

Designing selection statistics

When designing statistics, we begin by identifying a region of length X bp to be assessed for evidence of selection in a genetic sample. A selection statistic, such as Embedded Image is calculated centered on each polymorphic site in the region in turn; the test statistic for the region can be either the maximum or the minimum of these values.

The selection statistic is itself calculated over a window of fixed length x, centered on the target locus and containing polymorphic sites forming a list S. Unlike in the calculation of Embedded Image the target locus is fixed at the lth site in the list S, with the window stretching Embedded Image bp upstream and downstream of it. We then follow the approach taken by the ω statistic in dividing SNPs within the statistic window into two sets—those that are to the left of the target locus, L, and those to the right, R; see Figure 2. The ω statistic, Equation 2, averages LD measurements between SNPs in the same set (the along region) and divides this by the average LD between SNPs in different sets (the over region). A measure of the average value of LD in the along region isEmbedded Image(3)and in the over region isEmbedded Image(4)see Figure 1. Assuming the number of SNPs in L and R are similar, Kelly’s Embedded Image will be approximately the average of Embedded Image and Embedded Image while ω, calculated when the divisor lth locus is centrally located among the list of S polymorphic sites, is Embedded Image

The selection statistics we investigate take a similar approach in calculating a measure of average LD in the over and along regions, with the possibility of some simple operation (such as addition and division, as above) then performed on these. Unlike ω and Kelly’s Embedded Image we include the option of subtracting or otherwise controlling for the expected value of the selection statistic, based on expected LD between SNPs under neutrality. In total, we test 39 different statistics, 29 of which control for expected LD; see Appendix: Extended Methods.

Assessing the power of statistics through coalescent simulation

We assess the performance of different statistics using coalescent simulations generated with the programs MSMS (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. 2014). We first approximate the distribution of each statistic under neutrality by simulating 1000 samples for each demographic and recombination model. We consider the identification of a selection candidate within 100 kb of the true focus of selection to be practically useful and therefore calculate test statistics using a region size Embedded Image The 1000 values thus generated describe the null distributions for a test statistic. To approximate the distribution of the test statistics given a positive selective sweep, we simulate at least 300 samples with positive selection acting on a single SNP located in the middle of the chromosome, using various selection scenarios (e.g., different selection strengths and conditioning on different initial and final selected allele frequencies). The distribution of each test statistic under neutrality and given positive selection is then used to calculate power and receiver operating characteristic (ROC) curves (Metz 1978).

We use two demographic scenarios, one with constant population size of Embedded Image and the other following an out-of-Africa (OOA) human demographic model suggested by Gravel et al. (2011), with samples taken from the European population. Details can be found in Appendix: Extended Methods; also see Figure A1. The recombination rate is either constant (both models) or variable (constant population size model only). For the variable recombination rate model, the rate is sampled from the HapMap human chromosome 2 (b36) genetic map, estimated from HapMap phase II populations (Frazer et al. 2007), excluding regions close to the centromere.

To control for expected LD, we generate an “LD profile” describing the expected properties of LD at a given genetic map distance, such as the average or standard deviation of Embedded Image We approximate the LD profile by simulating Embedded Image bp of neutral genetic data according to the appropriate demographic scenario and sample size and then binning LD measurements according to their genetic map distance. For all simulations involving a variable recombination rate, we consider three scenarios. We assume that no genetic map is available, in which case physical distance serves as a proxy for genetic map distance (PhysMap); or that the true genetic map is available (TrueMap); or that a lower-resolution genetic map is available (LowResMap).

The algorithm that MSMS uses to simulate selection involves conditioning a coalescent model on a stochastically generated allele frequency trajectory. The allele frequency trajectory is created using the selection coefficient s, the demographic model, and several of four possible parameters describing features of the selection scenario—the time at which the selection phase begins (pastward), Embedded Image generations; the time at which selection stops, Embedded Image and the frequency of the selected variant at Embedded Image and Embedded Image Embedded Image and Embedded Image (see Figure A1). We focus our attention on a hard sweep model and compare test statistic performance based on three selection scenario categories—low frequency, where the final frequency of the selected allele Embedded Image high frequency, with Embedded Image or Embedded Image and selOOA, denoting the out-of-Africa demographic scenarios. We also briefly explore the performance of test statistics when selection acts as a variant at higher initial frequency, Embedded Image

Performance is assessed using the power of test statistics (e.g., Table 1, Table 2, and Figure 4) or a measurement based on the partial area under the ROC curve [pAUC (McClish (1989)] between a false positive rate (FPR) of 0 and 0.05 (Figure 3) that assesses the best possible test statistic performance suggested by our simulations. Using the pAUC gives greater emphasis on performance when the Embedded Image which is relevant for genome-wide selection scans. Again, further details on selection scenarios and the assessment of statistic performance are provided in Appendix: Extended Methods.

View this table:
  • View inline
  • View popup
Table 1 The power (FPR = 0.05, with Embedded Image) of several test statistics given a constant population size (Embedded Image) demographic model and a hard sweep (selection Embedded Image) to various allele frequencies Embedded Image
View this table:
  • View inline
  • View popup
Table 2 The power (FPR = 0.05, with Embedded Image) of several test statistics (see Table 1 legend) given an out-of-Africa demographic model and selection Embedded Image beginning at the time indicated, Embedded Image in generations, with initial allele frequency Embedded Image
Figure 3
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3

Performance of different categories of selection statistic tested according to our pAUC metric, under (A) a constant recombination rate, (B) a variable recombination rate and the true genetic map, and (C) a variable recombination rate using the physical map as an approximation of the genetic map. The statistic categories are indicated in the key, corresponding, in order, to those that control for expected LD, those that do not, Kelly’s Embedded Image statistic, the Embedded Image statistic with various window sizes, and methods based on SNP diversity. The relationship between our pAUC metric and power is shown in D.

Replication of previously suggested selection candidates

We compare the performance of our selection statistics with Kelly’s Embedded Image and Embedded Image Tests using several statistics, particularly the Embedded Image statistic, Equation 3, and Kelly’s Embedded Image when these control for expected LD, have high power given a variable recombination rate. To help further assess the utility of different statistics, we present selection scan results, using HapMap phase II [NCBI b36 (Frazer et al. 2007)] data for human chromosome 2 (populations of European ancestry, CEU, and East Asian ancestry, CHB+JPT) and chromosome 15 (CEU). Recombination rate is controlled for using either the HapMap genetic map [derived from LD patterns (Frazer et al. 2007)] or the deCODE genetic map [derived from inferred recombination events in a large Icelandic cohort (Kong et al. (2010)]. We assessed the ability of the various statistics to replicate selection signals previously identified based on the site frequency spectrum and/or population differentiation (Carlson et al. 2005; Nielsen et al. 2005; Oleksyk et al. 2008; Pickrell et al. 2009; Chen et al. 2010; Ronen et al. 2013; Colonna et al. 2014; Pagani et al. 2016), genetic features that should be relatively independent of local LD patterns and recombination rate variation under neutrality (although see, e.g., figure 3 of Ferrer-Admetlla et al. 2014 and Thornton 2005).

Data availability

The authors state that all information or data necessary for confirming the conclusions presented in the article are either publicly available (HapMap data) or represented fully within the article and Supplemental Material files.

Results

Our simulations identify certain test statistics (maximum Kelly’s Embedded Image or Embedded Image over the Embedded Image regions we test) as particularly powerful. They also suggest that controlling for expected LD increases the power of these statistics when recombination rate is variable, but marginally reduces power when recombination rate is constant. We also found that controlling for expected LD increased the number of previously suggested selection candidates that these statistics replicated in HapMap phase II SNP data. Although the interpretation of signal replication is not trivial (signals may be false or true positives), the overall impression is that controlling for expected LD increases the performance of certain LD-based selection statistics.

Controlling for expected LD increases simulated power of statistics when recombination rate is variable

In total, we tested 29 methods that incorporated some form of control for expected linkage and 10 methods that did not. Comparing these as groups of similar statistics—for example, all those that divide average LD in the along region by that in the over region (like ω) or all those that add or average LD in these two regions (like Kelly’s Embedded Image)—the average improvement, given a constant population size demographic model, variable recombination rate, and a hard sweep model, over methods that did not consider genetic map data was 79% (in absolute terms, 0.22) by our pAUC metric (see Figure 3). Focusing on the case of Embedded Image window size Embedded Image and Embedded Image this reflects an average increase in power at 1% FPR of 120% (0.07) for the low-frequency scenarios and 90% (0.08) for the high-frequency scenarios, over all (both high and low performance) statistics.

When controlling for variable recombination rate, we found that both the full genetic map (TrueMap) and the lower-resolution genetic map (LowResMap) yielded similar results, with a performance reduction according to the pAUC metric of just 10% and 4% for the lower-resolution map given low-frequency and high-frequency scenarios, respectively. However, trying to control for expected LD based on physical distance yielded poor results, and those statistics that did not take variable recombination rate into account showed equal or better performance (see Figure 3C).

The power to detect selection at different stages of a hard selective sweep (or, in the case of the out-of-Africa demographic model, selection acting on an initially very rare variant, Embedded Image) for some representative statistics is shown in Table 1 and Table 2. We note the generally high performance of Embedded Image compared to Kelly’s Embedded Image and especially Embedded Image and the slower decay over time of reduced SNP diversity as a signal of a selective sweep compared to LD distortions (the Embedded Image statistic, describing the number of LD measurements in the over region). To avoid high variance in LD-based statistics, we did not calculate test statistics for simulations with very few SNPs; our reported power values for diversity-based statistics are hence slight underestimates (usually by <0.01 and always <0.07 in Table 1, Table 2, and Table 3).

The statistic Embedded Image also demonstrated relatively high power to detect a soft sweep acting on standing genetic variation (see Table 3 and File S1, Tables S2 and S3).

View this table:
  • View inline
  • View popup
Table 3 The power (FPR = 0.05, with Embedded Image) of several test statistics (see Table 1 legend) given a constant population size (Embedded Image) demographic model and selection (Embedded Image see File S1,Table S3 for Embedded Image) acting on standing genetic variation

Identifying the “best” statistic from our set was not simple; full results included in File S1 show that many approaches to controlling for expected LD had similar performance. Statistics based on the average LD in the along region, like Embedded Image tended to be particularly successful, and we therefore focus on these. In Table 1 and Table 2 we controlled for the expected statistic value given genetic map distance by dividing the observed Embedded Image by that expected given the SNP distribution under neutrality, Embedded Image While this is relatively intuitive, there are other possible approaches to controlling for expected LD, several of which are shown in Figure 4.

Figure 4
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4

Power (FPR = 0.01, Embedded Image) of the selection statistic Embedded Image to detect a hard selective sweep compared with that of similar methods focusing on LD in the along region but controlling for expected LD; see Appendix: Extended Methods. A constant population size demographic model (Embedded Image) was used with variable recombination rate and (A) weaker, Embedded Image or (B) stronger, Embedded Image selection. Unless otherwise indicated Embedded Image

Overall, the Embedded Image statistic is simple to calculate and generally performed as well as or better than other LD-based statistics explored given the variable recombination rate.

Controlling for expected LD increases selection candidate signal replication in HapMap data

To further assess Embedded Image and other selection statistics, we performed selection scans, using publicly available HapMap genotype data (Frazer et al. 2007). We chose to focus on chromosomes 2 (European descent and Asian populations, CEU and CHB+JPT) and 15 (CEU) as these chromosomes/populations include several well-characterized (e.g., Sabeti et al. 2007; Mathieson et al. 2015) selection signals—MCM6/LCT, SLC24A5, HERC2, and EDAR.

An example selection scan for chromosome 15 (CEU), using OmegaPlus (Embedded Image), Kelly’s Embedded Image Embedded Image and Embedded Image as well as LD-controlled variants of Kelly’s Embedded Image and Embedded Image is shown in Figure 5 [results for chromosome 2 (CEU and CHB+JPT) are included in File S1]. Selection candidate signals suggested by other studies that did not use statistics based on LD or haplotype patterns (see Appendix: Extended Methods) are indicated. Table 4 shows the average value of each test statistic in the 200-kb genomic region containing these selection signals, as well as the average rank of selection candidate regions among all 200-kb regions and an indication of how unusual the observed findings are. Results for scans using several other LD-based statistics that we investigated are included in File S1. Scans on the CEU population sample using the deCODE (Kong et al. 2010) rather than the HapMap genetic map to control for expected LD yielded similar results and are therefore also included in File S1. While the main purpose of these scans was to assess the various statistics, we did find several novel peaks. As we are not aware of selection statistics based on pairwise LD being applied to these data (although see O’Reilly et al. 2008), we tabulate the strongest signals in File S1.

Figure 5
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5

Selection scans (standardized to aid comparison) on HapMap phase II data (chromosome 15, CEU population), using a range of LD-based statistics and a diversity measure. From top to bottom, the scans represent Kelly’s Embedded Image Kelly’s Embedded Image controlled for expected LD, Embedded Image Embedded Image controlled for expected LD, the OmegaPlus program calculating Embedded Image and the diversity measure Embedded Image Excluding OmegaPlus, all statistics were calculated with statistic window size Embedded Image The HapMap combined genetic map was used to estimate expected LD. Thick dashed lines indicate the genomic targets of two relatively well-established signatures of selection, while the light gray lines indicate signals found in a range of studies based on population differentiation and the site frequency spectrum; see Appendix: Extended Methods and File S2.

View this table:
  • View inline
  • View popup
Table 4 The performance of test statistics in replicating a signal in candidate selected regions

Interpreting the selection scans shown in Figure 5 and in File S1, Figures S2 and S3, is not simple. Our metrics show no signal at many selection candidates, and in some cases controlling for expected LD actually leads to a reduction in signal strength. This in itself is not concerning, in that some suggested selection targets may be false positives. For example, of the 72 selection candidate regions we identified in various studies that searched for selection in European samples on chromosome 2, only 5 were replicated between studies. Nevertheless, it is clear that controlling for expected LD does not always improve the signal even for well-accepted candidates. While the signal at SLC24A5 (CEU) increases (see Figure 5), that of MCM6/LCT (CEU) appears to fall (see File S1, Figure S2), for both Kelly’s Embedded Image and Embedded Image which, unsurprisingly, are highly correlated.

Interestingly, the Embedded Image statistic actually displayed an unusually low value for some selected regions (such as LCT in CEU) rather than the high value that is generally expected, using various window sizes (File S1 and File S2). This is presumably because the selective sweep is recent and incomplete (Itan et al. 2009; Mathieson et al. 2015), and the reduction in LD in the over region that drives high Embedded Image values appears only in later stages of the sweep (Stephan et al. 2006). A similar pattern was sometimes observed in our simulations and was found to be strongly influenced by choice of window-size parameters.

Greater detail on the performance of the various statistics is given in Table 4. Over all selection candidate regions, Kelly’s Embedded Image replicated 9 of 96 signals as 5% outliers, the same as Embedded Image but Embedded Image (12) and Embedded Image (15 of 101). The estimate for Embedded Image is slightly pessimistic, as our pipeline removed regions with very few LD measurements, including two selection candidate regions [both in chromosome (Chr) 2, one Asian and the other European] that would otherwise have yielded positive results. Controlling for expected LD considerably improved the performance of both Kelly’s Embedded Image and Embedded Image with both replicating 18 of 96 signals. In general, all statistics replicated considerably more signals than expected by chance, as strongly suggested by the resampling results (Table 4).

The purpose of this study has been to assess certain LD-based selection statistics rather than to search for selection signatures as such. We therefore tabulate novel hits in File S1 only. Often, several outlier regions occurred in succession, such that it is difficult to identify specific genes, variants, or features that might be driving a signal. Different approaches using pairwise LD (Clemente et al. 2014), other population genetic patterns [e.g., DIND (Barreiro et al. 2009; Pagani et al. 2016) and DDAF (1000 Genomes Project Consortium et al. 2012; Colonna et al. 2014)], or especially biological information on the impact of variants can simplify this task. Certain novel 200-kb regions contained a single gene or a few genes, such as signals overlapping MKRN3 and ARHGAP11B (Europeans) and ABCA12 (Asians). ARHGAP11B is a human-lineage duplication that has recently been found to influence neocortex size (Florio et al. 2015), and its low variation supports a recent origin and possible selection. Mutations in MKRN3 can affect the onset of puberty (Abreu et al. 2013), and variants in this region have been associated with age at menarche (Perry et al. 2014). Finally, ABCA12 has been previously identified as a possible selection candidate based on population differentiation, possibly related to adaptation of the skin in response to ultraviolet light exposure (Colonna et al. 2014). We did not formally class this as a replication as our 200-kb region does not include the previously highlighted variant.

Dividing the 65 selection candidates identified in European chromosome 2 into 200-kb regions that did (39) and did not (26) contain protein-coding genes and calculating selection statistics did not support enrichment of signal replication in either set (File S1, Table S8). Although the naive expectation might suggest that relatively more true positive selection events, and hence signal replication, will have occurred in regions containing protein-coding genes, recent work has highlighted selection on regulation as especially important in recent human evolution (Enard et al. 2014).

Discussion

Our results suggest that incorporating information from the genetic map into calculations of selection statistics based on pairwise LD can substantially improve their performance, both in simulations and in replicating previously identified selection candidates. We have also suggested a modification of Kelly’s Embedded Image Embedded Image that has higher power to detect the LD distortions caused by positive selection in simulations and that both these statistics outperform Embedded Image in simulations.

We can begin to explain this by considering the distribution of different test statistics under neutrality and after a hard selective sweep (Figure 6). The maximum value of Embedded Image over the Embedded Image region is often considerably larger after selection. The minimum value of Embedded Image is only slightly reduced. These patterns—an increase in LD in the along region and a decrease in the over region—reflect prior simulations and theoretical expectations (Kim and Nielsen 2004; Stephan et al. 2006; McVean 2007; Pfaffelhuber et al. 2008) and are expected to be more pronounced toward the end of the sweep (leading to higher power at this stage, Table 1 and Table 2). Interestingly, the signal of reduced LD in the over region is not strong, and the maximum Embedded Image test statistic demonstrates weaker power than Embedded Image and a more peaked and skewed neutral distribution.

Figure 6
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6

The distribution of test statistics in constant population size (Embedded Image) neutral and hard selective sweep simulations (Embedded Image Embedded Image Embedded Image) calculated using an Embedded Image statistic window (unless otherwise indicated) and Embedded Image region. Maximum values of Embedded Image Embedded Image and Embedded Image and the minimum value of Embedded Image in the region were used as test statistics. To aid comparison, statistic distributions were standardized by subtracting the mean value under neutrality and dividing by the standard deviation under neutrality. Summary metrics of the neutral distribution, skewness and excess kurtosis, are indicated above the plots, as is the power to detect selection given the hard sweep model (FPR = 0.01).

Comparison between the distributions of maximum Embedded Image and Embedded Image is especially enlightening. The moderately high power of Embedded Image demonstrates that the essential signal that the ω statistic tries to detect is indeed a reasonable signature of late-stage hard selective sweeps. However, the skew and kurtosis of the neutral distribution of Embedded Image are far greater. This appears to be the primary cause of the low power of the ω statistic in our simulations. We suggest that the adaptive window may be extremely successful at finding exceptionally high values of Embedded Image under neutrality, leading to a very fat-tailed distribution and difficulty in distinguishing outlier values due to selection.

Although the Embedded Image statistic implemented in OmegaPlus performed relatively poorly in simulations, it was effective in replicating selection candidate signals. Indeed, of the LD-based statistics that we tested that did not incorporate information from the genetic map, this statistic reproduced signals at most previously suggested selection candidates at the 5% significance level. The apparent contradiction between power suggested by simulations and the ability to reproduce selection candidate signals in real data deserves consideration, as simulations are often used to justify the use of specific selection statistics. There are several possible confounding factors, and we begin by discussing our simulation modeling before turning to the implications of selection candidate replication.

The use of coalescent simulations to represent complex demographic scenarios (e.g., Schaffner et al. 2005) and selection (Wakeley 2010) is well established. We used two coalescent programs, MSMS (Ewing and Hermisson 2010), which has been widely used and cited, and Cosi2 (Shlyakhter et al. 2014), a development of the well-known Cosi program (Schaffner et al. 2005). In both cases, selection is incorporated by dividing the population according to allelic state at the selected site (Hudson and Kaplan 1988) and conditioning the coalescent process on an independently generated allele frequency trajectory describing the size of the two subpopulations over time. The bifurcating tree generated is expected to be least accurate when the sample size is a relatively large proportion of the population size and selection very strong (Fu 2006; Kim and Wiehe 2009). Importantly, we did not observe any anomalous power results associated with our largest sample size (Embedded Image) and selection strength (Embedded Image), and simulations were also found to yield qualitatively expected patterns of LD (Kim and Nielsen 2004; Stephan et al. 2006) and reduced diversity (see Figure 1). We therefore assume that genealogies under both neutrality and positive selection are approximated with reasonable accuracy by the coalescent simulation programs we employed.

The question to ask of our simulation results, then, is whether the correct selection, demographic, and recombination scenarios were modeled. In the case of recombination, the variable rate was sampled from a genetic map (Frazer et al. 2007) estimated using the HapMap data we investigate, which in turn appears to be broadly consistent with other information on recombination rate variation (e.g., Kong et al. 2010; Wang et al. 2012). The demographic model we apply was estimated based on the joint site frequency spectrum, using low-coverage data from the 1000 Genomes project (Gravel et al. 2011). This method involves fitting the site frequency spectrum calculated using a diffusion approximation of a Wright–Fisher population that incorporates drift, selection, and migration to observed data (Gutenkunst et al. 2009). Kingman’s coalescent can accurately approximate genealogies generated by a Wright–Fisher model (Kingman 1982; Fu 2006). As such, even though any demographic model is a vastly idealized version of a population’s history, the parameterization will capture qualities of human genetic data that in turn should be recapitulated in simulated genealogies. The model of human demography used is broadly consistent with understanding the out-of-Africa dispersals and is similar, in terms of divergence times and population size estimates, to a model estimated based on haplotype sharing (Harris and Nielsen 2013).

Our choice of selection scenarios is more constrained, with a focus on hard sweeps and soft sweeps from standing genetic variation. The evolution of real populations, and consequently the selection candidate list we use, will reflect a much wider range of selection phenomena. For example, soft sweeps, in which the ultimate fixation of an allele is caused by selection acting on more than one copy of that allele (Hermisson and Pennings 2005; also figure 1 of Messer and Petrov 2013), may also occur due to multiple advantageous mutations occurring at a single locus, a scenario we do not explore. Various factors contribute to the relative frequency of different types of selective sweep (Messer and Petrov 2013); both explicitly hard and soft sweeps have been inferred from genetic data (e.g., Peter et al. 2012; Garud et al. 2015; Schlamp et al. 2016).

More broadly, the frequency at which selective sweeps occurred in recent human evolution remains a subject of debate (Hernandez et al. 2011; Enard et al. 2014). Some signals may be driven by other selective processes that are known to have affected human genetic variation, in particular purifying (e.g., Bustamante et al. 2005) and balancing (e.g., Andrés et al. 2009) selection. The population genetic signatures that these create can be similar to those generated by selective sweeps—a simple example being reduction in diversity usually expected given both hard selective sweeps and purifying selection. The details of signal replication depend on whether these processes generate outlier values for multiple selection statistics. Regarding LD-based statistics, Kelly’s Embedded Image may be high in cases of balancing selection between linked loci (Kelly 1997) and soft sweeps involving recurrent mutation (Pennings and Hermisson 2006), while selection on loci with epistatic interactions can also affect LD patterns (Phillips 2008).

Inevitably, the patterns of LD generated by our hard and soft sweep simulations will only closely approximate patterns observed at a subset of candidate selection signals generated by selective processes. The possible range of demographic and selection scenarios that might be simulated is large such that comprehensive exploration is impractical. Meanwhile, the relative role of different forms of selection in different species and populations is not fully understood. Consequently, characterizing the performance of selection statistics is an iterative process. An indication of some conditions under which LD-based statistics appear effective, and of the utility of genetic maps in controlling for recombination rate variation, provides a useful baseline for further research.

In our work, we have assumed that our selection candidate set was enriched for positively selected regions and hence considered replication of signals by the LD-based statistics a useful indicator of their utility. Reproduction of a signal, however, has unclear implications. The critical property is the extent to which two statistics correlate positively under neutrality and under positive selection. To try to avoid strong correlations under neutrality, we did not use selection scans based on measures of allelic associations such as pairwise linkage disequilibrium or haplotype homozygosity statistics when generating our selection candidate signal list. We did include selection scans based on the site frequency spectrum; the pattern of high-frequency derived alleles following a selective sweep is expected to correlate with high LD due to common genealogical structure (Kim and Nielsen 2004). Summary statistics describing the site frequency spectrum can also be affected by recombination rate variation (Thornton 2005), such that the expected relationship between observed LD and such statistics appears complex.

Of broader relevance to the practice of detecting selection signals, only the reproduction of signals by statistics that show little correlation under neutrality can be considered independent evidence for selection at a locus; if this is not the case, replication can incorrectly give the impression of a robust selection candidate. Developing a detailed understanding of the correlation between different statistics under neutrality and selection, and the impact of complex demography, variable recombination, and variable mutation rate on this, is an important future direction in the interpretation of selection signatures.

A slightly different problem is caused by characteristics of molecular evolution that increase the variance of many different selection statistics, such as regionally low mutation or recombination rates. For example, infrequent mutation leads to low genetic diversity, sometimes used as an indicator of positive or purifying selection. However, it also reduces the number of available pairwise LD measurements in a region and leads to a coarser estimate of haplotype blocks and the site frequency spectrum. Increased variance in statistics at certain points in the genome will lead to more selection statistic outliers (both positive and negative) in these regions and hence a greater probability of overlapping outliers even when the neutral correlation between statistics is low.

These general points are highly relevant to the signal overlap between the LD-based statistics we tested, for example, which correlate strongly (Figure 5), but are not especially informative concerning the unexpectedly high number of selection candidates replicated by OmegaPlus. The three signals replicated only by the Embedded Image statistic were detected in three different articles using three different methods (Table 5), such that they cannot simply be attributed to neutral correlation between OmegaPlus and another selection statistic. While the Embedded Image statistic has greatest performance at the end of a selective sweep (e.g., Table 1), coinciding with greatest population differentiation and larger distortions to the site frequency spectrum, its power at this stage was nevertheless lower than that of other LD-based statistics in simulations. Ultimately, further work will be needed to precisely clarify the relationship between signal replication and power assessed through simulations.

View this table:
  • View inline
  • View popup
Table 5 Replication of selection candidate regions at the more stringent Embedded Image level

When controlling for expected LD given genetic map distance, we used two different genetic maps. We focus on results using the combined HapMap genetic map (Frazer et al. 2007) in the main text, which is based on LD patterns in Europeans, Asians, and Africans. As such, there is a danger of underestimating the recombination rate in regions with high LD due to natural selection and of incorrectly inferring hotspots when LD is low over a high-frequency selected site. Where recombination rate is underestimated, expected LD is correspondingly overestimated, and controlling for recombination rate is thus likely to degrade the signal of selection based on unusually high LD. Despite this, we found that results were not substantially different when using the deCODE genetic map (Kong et al. 2010) (see File S1 and File S2). This is interesting, in that differences between the genetic map estimated from LD patterns and that based on observed recombination events have been suggested as indicative of selection (O’Reilly et al. 2008). We speculate that the method of combining recombination rate estimates from multiple populations used in the HapMap genetic map substantially mitigates this effect. This is because LD-based methods tend to identify signatures of recent selection, which often postdate population divergence and will affect different genomic regions in the different populations.

We finally note that our method of controlling for expected LD is simple and that more complex alternatives may improve the power of these statistics further. For example, widely separated alleles that are always found on the same haplotype show an Embedded Image value of 1, but such distant cosegregating alleles are far less likely under neutrality if the alleles are both derived and at high frequency [the signal exploited by methods derived from EHH (Sabeti et al. 2002)]. Incorporating information about the derived allele frequency of each allele in a pairwise LD measurement in addition to genetic map distance may give a better indication of how unusual the observed LD pattern is and hence increase statistical power. The question, ultimately, would be whether this closer approximation of haplotype-based methods has advantages over the range of well-developed haplotype methods currently used.

Conclusion

Our work has confirmed that the power of selection statistics based on LD can often be improved by controlling for variable recombination rate. Doing so is likely to reduce the number of false positive selection candidates and give a clearer indication of the relative strength of selection signals. Of the methods we tested, the Embedded Image statistic and Embedded Image showed highest power in simulations. In the absence of information on the genetic map, OmegaPlus was most successful at replicating selection candidates identified by other selection scan studies, despite often demonstrating low power in both soft and hard selective sweep simulations. Simulations are often used to test the performance of selection statistics, and this pattern creates an intriguing contradiction. Focusing on this problem, we conclude pragmatically, without a greater understanding of the correlation between signals, under neutrality especially, it is difficult to interpret precisely what signal replication implies. Thus, in our study the OmegaPlus statistic was effective at finding signals that have already been identified, but we are unable to suggest the precise evolutionary meaning of these signals or whether this reflects shared true or false positive results. Based on both simulation and replication, incorporating information on expected LD using a genetic map can substantially improve the performance of selection statistics.

Acknowledgments

We acknowledge useful discussions with Florian Clemente, Gereon Kaiping, and Mircea Iliescu at various stages of this work and helpful comments from Andy Collins. We also thank two anonymous reviewers for their comments. This work was submitted by an Engineering and Physical Science Research Council Doctoral Training Centre grant (EP/G03690X/1; G.S.J.) and a European Starting Investigator grant (FP7-26123) (to T.K.).

Appendix: Extended Methods

Designing Simple Selection Statistics

As indicated in the main text, we defined a series of possible selection statistics based on some average measurement of LD (Embedded Image) in the along region and/or the over region. For example, two of the simplest statistics we explored were Embedded Image (Equation 3) and Embedded Image (Equation 4). However, it was often useful to use an alternative to the observed Embedded Image value to control for expected LD. A simple example would be the statistic Embedded Image corresponding to the average Embedded Image in the along region minus the average expected Embedded Image in the along region,Embedded Image(A1a)where Embedded Image is estimated based on the generated LD profile, described in detail below, and the genetic map distance between loci i and j. As a rule, we use superscripts to Embedded Image and Embedded Image to indicate cases where we are calculating a measure of LD in the along or the over region, respectively, in a manner analogous to Embedded Image and Embedded Image but based on a quantity other than the observed Embedded Image between loci. We used four other approaches to controlling for expected LD within the along and over calculations,Embedded Image(A1b)Embedded Image(A1c)Embedded Image(A1d)Embedded Image(A1e)where, in Equation A1e, Embedded Image denotes the value, at Embedded Image of the cumulative distribution function of a Beta distribution with parameters a and b [Embedded Image with Embedded Image the incomplete beta function and Embedded Image the complete beta function], fitted by maximum likelihood to Embedded Image measurements in the appropriate genetic map distance bin, generated when creating the LD profile. This final approach is essentially attempting to estimate a P-value for each observed Embedded Image measurement and averages these (which is far more conservative than using Fisher’s method to combine P-values and likely more representative given the strong correlation between LD at nearby pairs of loci).

Analogous quantities are defined when calculating variants of Embedded Image while Kelly’s Embedded Image is the average expected Embedded Image between all pairs of loci in a window. The selection statistics we assessed involved simple operations on estimates of LD in the along and over regions and consisted of the following:

  1. LD and deviation from expected LD in the along region:Embedded Image

  2. LD and deviation from expected LD in the over region:Embedded Image

  3. Kelly’s Embedded Image and deviation from expected Kelly’s Embedded ImageEmbedded Image

  4. The Embedded Image statistic and similar constant window-size alternatives that can control for their expected value:Embedded Image

  5. Methods similar to Kelly’s Embedded Image with more diverse approaches to controlling for expected LD:Embedded Image

  6. Alternatives to Embedded Image that instead use the difference between LD in the over and along regions:Embedded Image

  7. The product of average LD in over and along:Embedded Image

  8. The number of SNPs in the along and over region, where Embedded Image indicates the cardinality of set R:Embedded Image

Many of these statistics rely on the creation of an LD profile, which we now describe.

The LD Profile

The LD profile consists of descriptive statistics of LD measurements between loci separated by a given genetic map distance. Creating the LD profile using simulated data involved generating 1000 samples of n chromosomes, of length 3 Mb, under neutrality and according to the relevant model of recombination and demography. As in our power simulations, loci with Embedded Image were removed, followed by the removal of random loci until the average spacing between polymorphic sites was 2500 bp. Embedded Image values were calculated for all pairs of loci up to a distance of 2 cM, with the genetic map distance between loci based on the true genetic map, the low-resolution map, or physical distance. These LD measurements were assigned to 20,000 bins according to the genetic map distance between the two loci, such that each bin represents 0.00001 cM. Finally, the average LD, Embedded Image for each bin and the standard deviation, Embedded Image were calculated, and a maximum-likelihood fitting of the Beta distribution was performed using the Scipy module (Jones et al. 2001) (scipy.stats.beta.fit), giving values of a and b for each bin.

A different LD profile was generated for each combination of sample size, demographic model, recombination model, and assumed known genetic map. LD profiles were constructed for the HapMap data individually for each chromosome and population, using the two different genetic maps (Frazer et al. 2007; Kong et al. 2010).

Genetic Maps

In our simulations with variable recombination rate, we considered three possible scenarios concerning the genetic map. Two of these simply involved using the physical map as a proxy (PhysMap) or providing the real section of the HapMap genetic map according to which the data were simulated (TrueMap). The third one used a lower-resolution version of the true genetic map (LowResMap). This was generated by downsampling the HapMap map by a factor of 15, reducing the average distance between reported map positions from ∼817 bp to 12,260 bp. Note that the genetic map is still accurate in the sense that it was generated using all loci, but that hotspots will be considerably smoothed out.

Simulations

Simulations were performed using MSMS (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. 2014). Each simulation replicate involved simulating a sample size of Embedded Image chromosomes of length 1.5 Mb, which may or may not have been subject to selection at a site located in the middle of the chromosome. The mutation rate was Embedded Image in the constant-size demographic model and followed Gravel et al. (2011), Embedded Image in the out-of-Africa demographic model. When recombination rate was constant this was set to Embedded Image the variable recombination rate was retrieved as described in Methods. The generation time was taken to be 25 years, as in Gravel et al. (2011). To approximate SNP panel data and avoid LD measurements involving singletons, loci with Embedded Image in the sample were removed before randomly removing loci until the average spacing between SNPs was 2500 bp.

An Embedded Image region was then defined from positions 650 to 850 kb. We calculated the value of all selection statistics other than Embedded Image using in-house scripts at each SNP, using three statistic window sizes (Embedded Image). Selection statistics were not calculated if Embedded Image or Embedded Image were under 4 or if Embedded Image Embedded Image was calculated on a grid using OmegaPlus with a resolution of 2500 bp, using window-size flags -minwin Embedded Image -maxwin Embedded Image and -minsnps Embedded Image When reporting power, we focus on results for Embedded Image in the main text, which tended to perform relatively well. Test statistics were retrieved as either the maximum or the minimum value of each selection statistic in the 200-kb region, unless SNP diversity was too low to obtain any selection statistic calculations, in which case the replicate was not used in power calculation. Note that the removal of very SNP-sparse replicates will make our power estimate for diversity-based statistics conservative—most windows removed would have been true positives. We estimate this distortion to generally be of the order of 0.01 in the tables presented in the main text (Table 1, Table 2, and Table 3), rising to a maximum of 0.07 for the OOA scenario with selection starting at Embedded Image generations (Table 2).

Two demographic scenarios were used, one with constant population size (Embedded Image) and one following an out-of-Africa model (Gravel et al. 2011) with samples taken from the European population. To obtain selective sweep trajectories under both demographic models in MSMS, we used two types of selection scenarios. In the first one, selection of strength s begins (pastward in time) at Embedded Image generations with an allele frequency of Embedded Image The time at which the selection phase of the model ends, Embedded Image generations, corresponds to the time at which the de novo selected allele first appears and is determined stochastically when MSMS generates the selected allele frequency trajectory on which later coalescent simulations are conditioned. This approach is used when the population size is constant. The second method involves specifying s and times Embedded Image and Embedded Image as well as the frequency of the selected allele when the selection phase ends, Embedded Image This allows for selection on standing variation, Embedded Image This time, Embedded Image is determined by the generated selected allele frequency trajectory. We use this approach when applying the OOA demographic model and when exploring selection on standing variation.

The selection scenarios investigated used an additive selection model and a selective advantage of Embedded Image for the homozygote. For the constant population size demographic model, we conditioned our hard sweep selection simulations on final allele frequency, Embedded Image with Embedded Image or Embedded Image For the out-of-Africa model, we conditioned selection simulations on the starting allele frequency Embedded Image and with Embedded Image and Embedded Image and further removed simulations in which the selected allele became extinct such that Embedded Image The ROC curves from these scenarios were used to calculate the selOOA pAUC performance measure (Figure 3), with otherwise similar supplementary runs using Embedded Image performed to assess the decay of the LD signal (Table 2).

In the OOA model, even the low initial frequency chosen, Embedded Image led to some soft sweeps (in which not all lineages with the selected site have coalesced at the start of selection), especially when n and s were large and Embedded Image small. Using the “-oOC” flag in MSMS and further simulations, we estimated that ∼8% of selective sweeps used to calculate OOA pAUC metrics shown in File S1, Figure S1 and Figure 3A were soft sweeps; <1.5% of those used to obtain the power calculations shown in Table 3 were soft.

We also explored, in less detail, primarily soft sweeps acting on standing genetic variation. Here, sample size was Embedded Image window length Embedded Image (in the case of the ω statistic, we calculated Embedded Image), and selection strength Embedded Image Initial allele frequency was Embedded Image with the selection phase ending, pastward, at times Embedded Image.

The program MSMS was used for all simulations in which the recombination rate did not vary, while Cosi2 was used for simulations with a variable recombination rate. The allele frequency trajectories used to simulate selection in Cosi2 were generated using MSMS. Example MSMS scripts [checked using PopPlanner (Ewing et al. 2015)] are shown in File S1, Table S4, while Figure A1 gives a schematic of the OOA demographic model, indicating the populations that may be under selection (depending on Embedded Image).

ROC curves were calculated for each statistic by comparing 1000 neutral replicates with at least 300 replicates involving selection. As the statistics we used employed different signatures to detect selection, four ROC curves were calculated, based on top or bottom outliers of the maximum or minimum statistic value in the 200-kb region indicating selection. These were used to determine the pAUC between an FPR of 0 and 0.05. When assessing the performance of statistics, we did not want to make assumptions about the direction of deviation indicating selection or the statistic window size used in the selection scan. We therefore chose the maximum pAUC for each statistic at a sample size of Embedded Image (usually three window sizes and the 4 pAUC each, so the maximum of 12 pAUC values) as its measure of performance under a given selection, recombination rate, and demographic scenario. To summarize the performance of statistics under different selection models, we separated the selection scenarios into three groups, low frequency, high frequency, and selOOA as detailed in Methods. We averaged the maximum pAUC across scenarios in these groups to give an overall indication of average statistic performance. Note that we implicitly give equal weight to each selection coefficient; often power was extremely high when Embedded Image and low when Embedded Image In the case of selOOA, we did not include those scenarios for which the average frequency of the selected allele at sampling was low, Embedded Image corresponding to Embedded Image with Embedded Image and Embedded Image with Embedded Image When calculating pAUC given selection on standing variation, we excluded only the Embedded Image Embedded Image Embedded Image scenario.

Defining Selection Candidates and Assessing Signal Replication

To compile a list of previously suggested selection candidates, we searched for a range of studies performing selection scans based on the site frequency spectrum (SFS) or population differentiation. We identified eight appropriate selection scans, summarized in Table A1.

We converted the suggested signal locations into single 200-kb regions by identifying which 200-kb region overlapped the central point of the signal or the SNP reported. Very few regions were identified in multiple studies (the 72 signals identified for European Chr2 corresponded to 66 unique regions, with corresponding numbers 17/21 and 18/21 in European Chr 15 and Asian Chr 2, respectively), although there were obvious clusters of signals. A complete table of signals is included in File S1.

We classed 200-kb selection candidate regions that were in the top (bottom for the Embedded Image statistic) 5% of 200-kb regions for a given statistic to be successful in replications. To give an indication of how unexpected the statistic values of selection candidate regions were, we resampled the same number of randomly located 200-kb regions 10,000 times and compared the observed statistic value, rank of candidate windows, and number of outliers to this set. We noted that signals were often clustered and that peaks in several of the LD-based statistics often included several consecutive 200-kb regions. We therefore approximated this clustering in our resampling regime, copying the distribution of consecutive regions seen in the candidate signal data. For example, in the European chromosome 15 candidate data there were nine solitary regions, two runs of two consecutive regions, and a single run of four consecutive regions, and when resampling we followed this pattern.

When assessing the ability of different statistics to replicate the 200-kb selection candidate regions, selection statistics were calculated for each SNP in the region, using an Embedded Image statistic window for all selection scans apart from ω, for which Embedded Image was used. These statistic window sizes were chosen for two reasons. First, large regions of the chromosomes in the HapMap data were SNP sparse and could not yield statistic scores for small windows. Second, a 400-kb window appeared to capture selection signals around LCT and SLC24A5 more effectively (results not shown). However, the Embedded Image statistic window size differs from that used to calculate power results shown in Table 1, Table 2, and Figure 4. Simulations using an Embedded Image window size were conducted and are incorporated into the pAUC calculations. The different window sizes for ω were not observed to strongly affect replication results; see Table S12 in File S1.

To assess whether 200-kb regions representing signals containing protein-coding genes were preferentially replicated, we focused on regions suggested based on European chromosome 2 data. These were split into two groups, those that contained protein-coding genes and those that did not, based on the hg18 RefSeq Genes track refGene table accessed through the UCSC table browser (genome.ucsc.edu/cgi-bin/hgTables), before being assessed for signal overlap with our LD-based methods as usual.

Figure A1
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure A1

Schematic diagram of the out-of-Africa model of Gravel et al. (2011), with a simulated frequency trace of a positively selected allele shown at the top. The horizontal axis represents time, running pastward from right (the present) to left. For population size and migration parameters, see table 2 in Gravel et al. (2011). Three demographic time events are indicated: the time of an ancient African bottleneck, Embedded Image the time of the out-of-Africa bottleneck, Embedded Image involving Eurasian population splitting from the African population, with modern Yoruba (YRI) descended from the latter; and the time of the split of the Eurasian population into Europeans (CEU) and East Asians (CHB+JPT), Embedded Image Two selection time events are indicated: the time at which selection begins, pastward in time, Embedded Image and the time at which it ends, Embedded Image Depending on the values of Embedded Image and Embedded Image selection acts in different populations; the populations that would be subject to selection if Embedded Image (148,000 years) are colored orange. Note that the exact implementation of the model when Embedded Image (51,000 years) is modified slightly; see File S1, Table S4. In the frequency trace, Embedded Image Embedded Image and Embedded Image in this simulation, the frequency of the allele at Embedded Image is Embedded Image at Embedded Image it is Embedded Image and at present, after 300 generations of random drift, Embedded Image

View this table:
  • View inline
  • View popup
Table A1 Selection scan studies used to define the selection candidate signal set

Footnotes

  • Communicating editor: R. Nielsen

  • Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.185900/-/DC1.

  • Received December 10, 2015.
  • Accepted April 5, 2016.
  • Copyright © 2016 by the Genetics Society of America

Literature Cited

  1. ↵
    1. Abreu A. P.,
    2. Dauber A.,
    3. Macedo D. B.,
    4. Noel S. D.,
    5. Brito V. N.,
    6. et al.
    , 2013 Central precocious puberty caused by mutations in the imprinted gene MKRN3. N. Engl. J. Med. 368: 2467–2475.
    OpenUrlCrossRefPubMedWeb of Science
  2. ↵
    1. Akey J. M.
    , 2009 Constructing genomic maps of positive selection in humans: Where do we go from here? Genome Res. 19: 711–722.
    OpenUrlAbstract/FREE Full Text
  3. ↵
    Alachiotis, N., P. Pavlidis, and A. Stamatakis, 2012a Exploiting multi-grain parallelism for efficient selective sweep detection, pp. 56–68 in Algorithms and Architectures for Parallel Processing. Springer-Verlag, Berlin/Heidelberg, Germany/New York.
  4. ↵
    1. Alachiotis N.,
    2. Stamatakis A.,
    3. Pavlidis P.
    , 2012b OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 28: 2274–2275.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Andrés A. M.,
    2. Hubisz M. J.,
    3. Indap A.,
    4. Torgerson D. G.,
    5. Degenhardt J. D.,
    6. et al.
    , 2009 Targets of balancing selection in the human genome. Mol. Biol. Evol. 26: 2755–2764.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Barreiro L. B.,
    2. Ben-Ali M.,
    3. Quach H.,
    4. Laval G.,
    5. Patin E.,
    6. et al.
    , 2009 Evolutionary dynamics of human Toll-like receptors and their different contributions to host defense. PLoS Genet. 5: e1000562.
    OpenUrlCrossRefPubMed
  7. ↵
    1. Barrett R. D.,
    2. Hoekstra H. E.
    , 2011 Molecular spandrels: tests of adaptation at the genetic level. Nat. Rev. Genet. 12: 767–780.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Bustamante C. D.,
    2. Fledel-Alon A.,
    3. Williamson S.,
    4. Nielsen R.,
    5. Hubisz M. T.,
    6. et al.
    , 2005 Natural selection on protein-coding genes in the human genome. Nature 437: 1153–1157.
    OpenUrlCrossRefPubMedWeb of Science
  9. ↵
    1. Carlson C. S.,
    2. Thomas D. J.,
    3. Eberle M. A.,
    4. Swanson J. E.,
    5. Livingston R. J.,
    6. et al.
    , 2005 Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res. 15: 1553–1565.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    1. Catalán A.,
    2. Hutter S.,
    3. Parsch J.
    , 2012 Population and sex differences in Drosophila melanogaster brain gene expression. BMC Genomics 13: 654.
    OpenUrlCrossRefPubMed
  11. ↵
    1. Chen H.,
    2. Patterson N.,
    3. Reich D.
    , 2010 Population differentiation as a test for selective sweeps. Genome Res. 20: 393–402.
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Clemente F. J.,
    2. Cardona A.,
    3. Inchley C. E.,
    4. Peter B. M.,
    5. Jacobs G.,
    6. et al.
    , 2014 A selective sweep on a deleterious mutation in CPT1A in arctic populations. Am. J. Hum. Genet. 95: 584–589.
    OpenUrlCrossRefPubMed
  13. ↵
    1. Colonna V.,
    2. Ayub Q.,
    3. Chen Y.,
    4. Pagani L.,
    5. Luisi P.,
    6. et al.
    , 2014 Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol. 15: R88.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Crisci J. L.,
    2. Poh Y.-P.,
    3. Mahajan S.,
    4. Jensen J. D.
    , 2013 The impact of equilibrium assumptions on tests of selection. Front. Genet. 4: 235.
    OpenUrlPubMed
  15. ↵
    1. Daly M. J.,
    2. Rioux J. D.,
    3. Schaffner S. F.,
    4. Hudson T. J.,
    5. Lander E. S.
    , 2001 High-resolution haplotype structure in the human genome. Nat. Genet. 29: 229–232.
    OpenUrlCrossRefPubMedWeb of Science
  16. ↵
    1. Enard D.,
    2. Messer P. W.,
    3. Petrov D. A.
    , 2014 Genome-wide signals of positive selection in human evolution. Genome Res. 24: 885–895.
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Ewing G.,
    2. Hermisson J.
    , 2010 MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26: 2064–2065.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Ewing G. B.,
    2. Reiff P. A.,
    3. Jensen J. D.
    , 2015 PopPlanner: visually constructing demographic models for simulation. Front. Genet. 6: 150.
    OpenUrlPubMed
  19. ↵
    1. Ferrer-Admetlla A.,
    2. Liang M.,
    3. Korneliussen T.,
    4. Nielsen R.
    , 2014 On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol. Biol. Evol. 31: 1275–1291.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Florio M.,
    2. Albert M.,
    3. Taverna E.,
    4. Namba T.,
    5. Brandl H.,
    6. et al.
    , 2015 Human-specific gene ARHGAP11B promotes basal progenitor amplification and neocortex expansion. Science 347: 1465–1470.
    OpenUrlAbstract/FREE Full Text
  21. ↵
    1. Frazer K. A.,
    2. Ballinger D. G.,
    3. Cox D. R.,
    4. Hinds D. A.,
    5. Stuve L. L.,
    6. et al.
    , 2007 A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861.
    OpenUrlCrossRefPubMedWeb of Science
  22. ↵
    1. Fu Y.-X.
    , 2006 Exact coalescent for the Wright–Fisher model. Theor. Popul. Biol. 69: 385–394.
    OpenUrlCrossRefPubMedWeb of Science
  23. ↵
    1. Garud N. R.,
    2. Messer P. W.,
    3. Buzbas E. O.,
    4. Petrov D. A.
    , 2015 Recent selective sweeps in North American Drosophila melanogaster show signatures of soft sweeps. PLoS Genet. 11: e1005004.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Gravel S.,
    2. Henn B. M.,
    3. Gutenkunst R. N.,
    4. Indap A. R.,
    5. Marth G. T.,
    6. et al.
    , 2011 Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. USA 108: 11983–11988.
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. Gutenkunst R. N.,
    2. Hernandez R. D.,
    3. Williamson S. H.,
    4. Bustamante C. D.
    , 2009 Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5: e1000695.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Harris K.,
    2. Nielsen R.
    , 2013 Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 9: e1003521.
    OpenUrlCrossRefPubMed
  27. ↵
    1. Hermisson J.,
    2. Pennings P. S.
    , 2005 Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics 169: 2335–2352.
    OpenUrlAbstract/FREE Full Text
  28. ↵
    Hernandez, R. D., J. L. Kelley, E. Elyashiv, S. C. Melton, A. Auton et al., 2011 Classic selective sweeps were rare in recent human evolution. Science 331: 920–924.
  29. ↵
    1. Hill W.,
    2. Robertson A.
    , 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231.
    OpenUrlCrossRefPubMed
  30. ↵
    1. Hodgkinson A.,
    2. Eyre-Walker A.
    , 2011 Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12: 756–766.
    OpenUrlCrossRefPubMed
  31. ↵
    1. Hudson R. R.,
    2. Kaplan N. L.
    , 1988 The coalescent process in models with selection and recombination. Genetics 120: 831–840.
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Itan Y.,
    2. Powell A.,
    3. Beaumont M. A.,
    4. Burger J.,
    5. Thomas M. G.,
    6. et al.
    , 2009 The origins of lactase persistence in Europe. PLoS Comput. Biol. 5: e1000491.
    OpenUrlCrossRefPubMed
  33. ↵
    1. Jensen J. D.,
    2. Thornton K. R.,
    3. Bustamante C. D.,
    4. Aquadro C. F.
    , 2007 On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations. Genetics 176: 2371–2379.
    OpenUrlAbstract/FREE Full Text
  34. ↵
    Jones, E., T. Oliphant, and P. Peterson, 2001 SciPy: open source scientific tools for Python. Available at: http://www.scipy.org. Accessed August 19, 2015.
  35. ↵
    1. Kelley J. L.,
    2. Madeoy J.,
    3. Calhoun J. C.,
    4. Swanson W.,
    5. Akey J. M.
    , 2006 Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res. 16: 980–989.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Kelly J. K.
    , 1997 A test of neutrality based on interlocus associations. Genetics 146: 1197–1206.
    OpenUrlAbstract/FREE Full Text
  37. ↵
    1. Kim Y.,
    2. Nielsen R.
    , 2004 Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 1513–1524.
    OpenUrlAbstract/FREE Full Text
  38. ↵
    1. Kim Y.,
    2. Wiehe T.
    , 2009 Simulation of DNA sequence evolution under models of recent directional selection. Brief. Bioinform. 10: 84–96.
    OpenUrlAbstract/FREE Full Text
  39. ↵
    1. Kingman J. F.
    , 1982 On the genealogy of large populations. J. Appl. Probab. 19: 27–43.
    OpenUrlCrossRef
  40. ↵
    1. Kong A.,
    2. Thorleifsson G.,
    3. Gudbjartsson D. F.,
    4. Masson G.,
    5. Sigurdsson A.,
    6. et al.
    , 2010 Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467: 1099–1103.
    OpenUrlCrossRefPubMedWeb of Science
  41. ↵
    1. Lee T.,
    2. Cho S.,
    3. Seo K. S.,
    4. Chang J.,
    5. Kim H.,
    6. et al.
    , 2013 Genetic variants and signatures of selective sweep of Hanwoo population (Korean native cattle). BMB Rep. 46: 346–351.
    OpenUrlCrossRefPubMed
  42. ↵
    1. Mallick S.,
    2. Gnerre S.,
    3. Muller P.,
    4. Reich D.
    , 2009 The difficulty of avoiding false positives in genome scans for natural selection. Genome Res. 19: 922–933.
    OpenUrlAbstract/FREE Full Text
  43. ↵
    1. Mathieson I.,
    2. Lazaridis I.,
    3. Rohland N.,
    4. Mallick S.,
    5. Patterson N.,
    6. et al.
    , 2015 Genome-wide patterns of selection in 230 ancient Eurasians. Nature 528: 499–503.
    OpenUrlCrossRefPubMed
  44. ↵
    1. McClish D. K.
    , 1989 Analyzing a portion of the ROC curve. Med. Decis. Making 9: 190–195.
    OpenUrlAbstract/FREE Full Text
  45. ↵
    1. McVean G.
    , 2007 The structure of linkage disequilibrium around a selective sweep. Genetics 175: 1395–1406.
    OpenUrlAbstract/FREE Full Text
  46. ↵
    1. McVean G. A.,
    2. Myers S. R.,
    3. Hunt S.,
    4. Deloukas P.,
    5. Bentley D. R.,
    6. et al.
    , 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.
    OpenUrlAbstract/FREE Full Text
  47. ↵
    1. Messer P. W.,
    2. Petrov D. A.
    , 2013 Population genomics of rapid adaptation by soft selective sweeps. Trends Ecol. Evol. 28: 659–669.
    OpenUrlCrossRefPubMedWeb of Science
  48. ↵
    1. Metz C. E.
    , 1978 Basic principles of ROC analysis. Semin. Nucl. Med. 8: 283–298.
    OpenUrlCrossRefPubMedWeb of Science
  49. ↵
    1. Mezard C.
    , 2006 Meiotic recombination hotspots in plants. Biochem. Soc. Trans. 34: 531–534.
    OpenUrlAbstract/FREE Full Text
  50. ↵
    1. Nei M.,
    2. Suzuki Y.,
    3. Nozawa M.
    , 2010 The neutral theory of molecular evolution in the genomic era. Annu. Rev. Genomics Hum. Genet. 11: 265–289.
    OpenUrlCrossRefPubMedWeb of Science
  51. ↵
    1. Nielsen R.,
    2. Williamson S.,
    3. Kim Y.,
    4. Hubisz M. J.,
    5. Clark A. G.,
    6. et al.
    , 2005 Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575.
    OpenUrlAbstract/FREE Full Text
  52. ↵
    1. Oleksyk T. K.,
    2. Zhao K.,
    3. Francisco M.,
    4. Gilbert D. A.,
    5. O’Brien S. J.,
    6. et al.
    , 2008 Identifying selected regions from heterozygosity and divergence using a light-coverage genomic dataset from two human populations. PLoS One 3: e1712.
    OpenUrlCrossRefPubMed
  53. 1000 Genomes Project Consortium, G. R. Abecasis, A. Auton, L. D. Brooks, M. A. DePristo et al., 2012 An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65.
    OpenUrlCrossRefPubMedWeb of Science
  54. O’Reilly, P. F., E. Birney, and D. J. Balding, 2008 Confounding between recombination and selection, and the ped/pop method for detecting selection. Genome Res. 18: 1304–1313.
  55. ↵
    1. Pagani L.,
    2. Lawson D. J.,
    3. Jagoda E.,
    4. Mörseburg A.,
    5. Eriksson A.,
    6. et al.
    , 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature (in press).
  56. ↵
    1. Paigen K.,
    2. Petkov P.
    , 2010 Mammalian recombination hot spots: properties, control and evolution. Nat. Rev. Genet. 11: 221–233.
    OpenUrlCrossRefPubMedWeb of Science
  57. ↵
    1. Pavlidis P.,
    2. Jensen J. D.,
    3. Stephan W.
    , 2010 Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations. Genetics 185: 907–922.
    OpenUrlAbstract/FREE Full Text
  58. ↵
    1. Pennings P. S.,
    2. Hermisson J.
    , 2006 Soft sweeps III: the signature of positive selection from recurrent mutation. PLoS Genet. 2: e186.
    OpenUrlCrossRefPubMed
  59. ↵
    1. Perry J. R.,
    2. Day F.,
    3. Elks C. E.,
    4. Sulem P.,
    5. Thompson D. J.,
    6. et al.
    , 2014 Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature 514: 92–97.
    OpenUrlCrossRefPubMedWeb of Science
  60. ↵
    1. Peter B. M.,
    2. Huerta-Sanchez E.,
    3. Nielsen R.
    , 2012 Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet. 8: e1003011.
    OpenUrlCrossRefPubMed
  61. ↵
    1. Petes T. D.
    , 2001 Meiotic recombination hot spots and cold spots. Nat. Rev. Genet. 2: 360–369.
    OpenUrlCrossRefPubMedWeb of Science
  62. ↵
    1. Pfaffelhuber P.,
    2. Lehnert A.,
    3. Stephan W.
    , 2008 Linkage disequilibrium under genetic hitchhiking in finite populations. Genetics 179: 527–537.
    OpenUrlAbstract/FREE Full Text
  63. ↵
    1. Phillips P. C.
    , 2008 Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9: 855–867.
    OpenUrlCrossRefPubMedWeb of Science
  64. ↵
    1. Pickrell J. K.,
    2. Coop G.,
    3. Novembre J.,
    4. Kudaravalli S.,
    5. Li J. Z.,
    6. et al.
    , 2009 Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19: 826–837.
    OpenUrlAbstract/FREE Full Text
    1. Renzette N.,
    2. Kowalik T. F.,
    3. Jensen J. D.
    , 2016 On the relative roles of background selection and genetic hitchhiking in shaping human cytomegalovirus genetic diversity. Mol. Ecol. 25: 403–413.
  65. ↵
    1. Ronen R.,
    2. Udpa N.,
    3. Halperin E.,
    4. Bafna V.
    , 2013 Learning natural selection from the site frequency spectrum. Genetics 195: 181–193.
    OpenUrlAbstract/FREE Full Text
  66. ↵
    1. Rozas J.,
    2. Gullaud M.,
    3. Blandin G.,
    4. Aguadé M.
    , 2001 DNA variation at the rp49 gene region of Drosophila simulans: evolutionary inferences from an unusual haplotype structure. Genetics 158: 1147–1155.
    OpenUrlAbstract/FREE Full Text
  67. ↵
    1. Sabeti P. C.,
    2. Reich D. E.,
    3. Higgins J. M.,
    4. Levine H. Z.,
    5. Richter D. J.,
    6. et al.
    , 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.
    OpenUrlCrossRefPubMedWeb of Science
  68. ↵
    1. Sabeti P. C.,
    2. Varilly P.,
    3. Fry B.,
    4. Lohmueller J.,
    5. Hostetter E.,
    6. et al.
    , 2007 Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913–918.
    OpenUrlCrossRefPubMedWeb of Science
  69. ↵
    1. Schaffner S. F.,
    2. Foo C.,
    3. Gabriel S.,
    4. Reich D.,
    5. Daly M. J.,
    6. et al.
    , 2005 Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15: 1576–1583.
    OpenUrlAbstract/FREE Full Text
  70. ↵
    1. Schlamp F.,
    2. Made J.,
    3. Stambler R.,
    4. Chesebrough L.,
    5. Boyko A. R.,
    6. et al.
    , 2016 Evaluating the performance of selection scans to detect selective sweeps in domestic dogs. Mol. Ecol. 25: 342–356.
    OpenUrlCrossRef
  71. ↵
    1. Shlyakhter I.,
    2. Sabeti P. C.,
    3. Schaffner S. F.
    , 2014 Cosi2: an efficient simulator of exact and approximate coalescent with selection. Bioinformatics 30: 3427–3429.
    OpenUrlAbstract/FREE Full Text
  72. ↵
    1. Smith J. M.,
    2. Haigh J.
    , 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23–35.
    OpenUrlCrossRefPubMedWeb of Science
  73. ↵
    1. Stephan W.,
    2. Song Y. S.,
    3. Langley C. H.
    , 2006 The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172: 2647–2663.
    OpenUrlAbstract/FREE Full Text
  74. ↵
    1. Thornton K.
    , 2005 Recombination and the properties of Tajima’s D in the context of approximate-likelihood calculation. Genetics 171: 2143–2148.
    OpenUrlAbstract/FREE Full Text
  75. ↵
    1. Thornton K. R.,
    2. Jensen J. D.
    , 2007 Controlling the false-positive rate in multilocus genome scans for selection. Genetics 175: 737–750.
    OpenUrlAbstract/FREE Full Text
  76. ↵
    Wakeley, J., 2010 Natural selection and coalescent theory, pp. 119–149 in Evolution Since Darwin: The First 150 Years, edited by M. A. Bell, D. J. Futuyma, W. F. Eanes, and J. S. Levinton. Sinauer Associates, Sunderland, MA.
  77. ↵
    1. Wang J.,
    2. Fan H. C.,
    3. Behr B.,
    4. Quake S. R.
    , 2012 Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150: 402–412.
    OpenUrlCrossRefPubMedWeb of Science
View Abstract
Previous ArticleNext Article
Back to top

PUBLICATION INFORMATION

Volume 203 Issue 4, August 2016

Genetics: 203 (4)

ARTICLE CLASSIFICATION

INVESTIGATIONS
Population and evolutionary genetics
View this article with LENS
Email

Thank you for sharing this Genetics article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps
(Your Name) has forwarded a page to you from Genetics
(Your Name) thought you would be interested in this article in Genetics.
Print
Alerts
Enter your email below to set up alert notifications for new article, or to manage your existing alerts.
SIGN UP OR SIGN IN WITH YOUR EMAIL
View PDF
Share

Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps

View ORCID ProfileGuy S. Jacobs, View ORCID ProfileTimothy J. Sluckin and View ORCID ProfileToomas Kivisild
Genetics August 1, 2016 vol. 203 no. 4 1807-1825; https://doi.org/10.1534/genetics.115.185900
Guy S. Jacobs
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United KingdomComplexity Institute, Nanyang Technological University, Singapore 637723
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guy S. Jacobs
  • For correspondence: guysjacobs@ntu.edu.sg
Timothy J. Sluckin
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Timothy J. Sluckin
Toomas Kivisild
Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Toomas Kivisild
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation

Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps

View ORCID ProfileGuy S. Jacobs, View ORCID ProfileTimothy J. Sluckin and View ORCID ProfileToomas Kivisild
Genetics August 1, 2016 vol. 203 no. 4 1807-1825; https://doi.org/10.1534/genetics.115.185900
Guy S. Jacobs
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United KingdomComplexity Institute, Nanyang Technological University, Singapore 637723
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guy S. Jacobs
  • For correspondence: guysjacobs@ntu.edu.sg
Timothy J. Sluckin
Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Timothy J. Sluckin
Toomas Kivisild
Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Toomas Kivisild

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero

Related Articles

Cited By

More in this TOC Section

Investigations

  • Cell Specificity of Human Regulatory Annotations and Their Genetic Effects on Gene Expression
  • Dynein Light Chain DLC-1 Facilitates the Function of the Germline Cell Fate Regulator GLD-1 in Caenorhabditis elegans
  • The Caenorhabditis elegans SMOC-1 Protein Acts Cell Nonautonomously To Promote Bone Morphogenetic Protein Signaling
Show more Investigations

Population and Evolutionary Genetics

  • Cell Specificity of Human Regulatory Annotations and Their Genetic Effects on Gene Expression
  • Dynein Light Chain DLC-1 Facilitates the Function of the Germline Cell Fate Regulator GLD-1 in Caenorhabditis elegans
  • The Caenorhabditis elegans SMOC-1 Protein Acts Cell Nonautonomously To Promote Bone Morphogenetic Protein Signaling
Show more Population and Evolutionary Genetics
  • Top
  • Article
    • Abstract
    • Methods
    • Results
    • Discussion
    • Acknowledgments
    • Appendix: Extended Methods
    • Footnotes
    • Literature Cited
  • Figures & Data
  • Info & Metrics

GSA

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.

Online ISSN: 1943-2631

  • For Authors
  • For Reviewers
  • For Subscribers
  • Submit a Manuscript
  • Editorial Board
  • Press Releases

SPPA Logo

GET CONNECTED

RSS  Subscribe with RSS.

email  Subscribe via email. Sign up to receive alert notifications of new articles.

  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus

Copyright © 2019 by the Genetics Society of America

  • About GENETICS
  • Terms of use
  • Advertising
  • Permissions
  • Contact us
  • International access