We have evaluated the extent to which SNPs identified by genomewide surveys as showing unusually high levels of population differentiation in humans have experienced recent positive selection, starting from a set of 32 nonsynonymous SNPs in 27 genes highlighted by the HapMap1 project. These SNPs were genotyped again in the HapMap samples and in the Human Genome Diversity Project–Centre d'Etude du Polymorphisme Humain (HGDP–CEPH) panel of 52 populations representing worldwide diversity; extended haplotype homozygosity was investigated around all of them, and full resequence data were examined for 9 genes (5 from public sources and 4 from new data sets). For 7 of the genes, genotyping errors were responsible for an artifactual signal of high population differentiation and for 2, the population differentiation did not exceed our significance threshold. For the 18 genes with confirmed high population differentiation, 3 showed evidence of positive selection as measured by unusually extended haplotypes within a population, and 7 more did in between-population analyses. The 9 genes with resequence data included 7 with high population differentiation, and 5 showed evidence of positive selection on the haplotype carrying the nonsynonymous SNP from skewed allele frequency spectra; in addition, 2 showed evidence of positive selection on unrelated haplotypes. Thus, in humans, high population differentiation is (apart from technical artifacts) an effective way of enriching for recently selected genes, but is not an infallible pointer to recent positive selection supported by other lines of evidence.
IN the last 50,000–100,000 years (KY), humans have expanded from being a rare species confined to parts of Africa and the Levant to their current numbers of >6 billion with a worldwide distribution (Jobling et al. 2004). Paleontological and archaeological evidence suggests that key aspects of modern human behavior developed ∼100–50 KYA in Africa (Henshilwood et al. 2002) and behaviorally modern humans then expanded out of Africa ∼60–40 KYA (Mellars 2006). The physical and biological environments encountered outside Africa would have been very different from those inside and included climatic deterioration reaching a glacial maximum ∼20 KYA and subsequent amelioration that permitted the development of agricultural and pastoral lifestyles in multiple independent centers after ∼10 KYA. Neolithic lifestyles would have led to further changes including higher population densities, close contact with animals, and novel foods, in turn leading to new diseases (Jobling et al. 2004). It is likely that genetic adaptations accompanied many of these events.
Adaptation, or positive natural selection, leaves an imprint on the pattern of genetic variation found in a population near the site of selection. This pattern can be identified by comparing the DNA variants in multiple individuals from the same and different populations and searching for signals such as unusually extended haplotypes (extended haplotype homozygosity, EHH) (Voight et al. 2006; Sabeti et al. 2007; Tang et al. 2007), high levels of population differentiation (International Hapmap Consortium 2005; Barreiro et al. 2008; Myles et al. 2008), or skewed allele frequency spectra (Carlson et al. 2005). These signals become detectable at different times after the start of selection and are all transient, being gradually eroded by both molecular processes such as mutation, recombination, or further selection and population processes such as migration or demographic fluctuations, with the survival order extended haplotypes < population differentiation < allele frequency spectra (Sabeti et al. 2006). The absolute timescales of survival are not well understood, but extended haplotype tests typically detect selection within the last 10 KY (Sabeti et al. 2006) while unusual allele frequency spectra may detect much older selection. For example, it has been suggested that the signal associated with the FOXP2 gene (Enard et al. 2002) may predate the modern human–Neanderthal split ∼300–400 KYA (Krause et al. 2007), although such an interpretation has been questioned (Coop et al. 2008). However, despite significant uncertainties and limitations, population-genetic analyses are well placed to provide insights into many of the important events within the timescale of recent human evolution.
In principle, it should be possible to survey the genome for sites of selection and then interpret this catalog in the light of archaeological, climatic, and other records. Progress toward such a goal has, however, been limited: many factors can confound the detection of selection and only genotype data from previously ascertained SNPs, rather than full resequence data, have thus far been available throughout the whole genome. In practice, the strategy used has therefore been to search the genome for signals that can be detected in available genotype data, such as extended haplotypes or population differentiation, and evaluate the significance of the regions identified by comparing them with empirical distributions of the same statistic, models that incorporate information about the demography, or biological expectations (McVean and Spencer 2006). However, it remains unclear how effective this strategy is: What false positive and false negative rates are associated with its applications? Further evaluation is desirable.
The International HapMap Project has carried out the highest-resolution study so far of genetic variation in a set of human populations. In an article published in 2005, genotypes of >1 million SNPs were reported from 270 individuals with ancestry from Africa (Yoruba in Ibadan, Nigeria: YRI), Europe (Utah residents with ancestry from northern and western Europe: CEU), China (Han Chinese in Beijing, China: CHB), and Japan (Japanese in Tokyo, Japan: JPT) (International HapMap Consortium 2005). This article highlighted 32 SNPs from 27 genes that showed particular evolutionary interest because of a combination of two factors: they were nonsynonymous, that is, they changed an amino acid within a protein-coding gene and thus were likely to alter biological function, and they also exhibited a high level of population differentiation equal to or exceeding that of rs2814778, a SNP that is associated with strong biological evidence for population-specific selection. This SNP underlies the FY*0 (Duffy blood group negative) phenotype; FY*0 homozygotes do not express the Duffy blood group antigen on red blood cells and are consequently highly resistant to infection by the malarial parasite, Plasmodium vivax. The *0 allele is nearly fixed in Africa and rare outside, and it is widely believed that this is due to selection for resistance to vivax malaria.
However, a number of studies have emphasized that large differences in allele frequency between populations can arise without positive selection: for example, a highly differentiated SNP in the Neuregulin I gene was not accompanied by unusual patterns in adjacent SNPs (Gardner et al. 2007), and large frequency differences can be quite common in empirical data sets, particularly in comparisons between Africa or America and the rest of the world, where population bottlenecks and “allele surfing” may have occurred during the exit from and entrance to these continents, respectively (Hofer et al. 2009). We wished to measure the extent to which the high population differentiation observed at the 27 HapMap genes might have resulted from positive selection and the extent to which it reflected other origins such as demographic factors, chance, or errors. We therefore retyped the same SNPs in the HapMap samples and in a large additional set of human populations and applied alternative tests for selection, either based on long-range haplotypes or based on full resequence data. For the latter, sequence data for 5 of the genes were available from public sources, and four new data sets were generated for this project. We found that, while genotyping errors led to some artifactual high differentiation signals, population differentiation was a useful but by no means infallible guide to recent selection detected by other methods.
MATERIALS AND METHODS
DNA samples and genotyping:
HapMap samples (International HapMap Consortium 2005) and extended HapMap samples were purchased from the Coriell Institute for Medical Research (Camden, NJ), the Brahui (BRU, Pakistan) samples were from the collection of S. Q. Mehdi, the Human Genome Diversity Project–Centre d'Etude du Polymorphisme Humain (HGDP–CEPH) collection (Cann et al. 2002) was kindly provided by Howard Cann (CEPH, Paris, France), and one chimpanzee (Pan troglodytes) sample was purchased from the European Collection of Cell Cultures (Salisbury, Wiltshire, UK). The HGDP–CEPH samples were whole-genome amplified before use (GenomiPhi HY DNA amplification kit; Amersham Biosciences, Piscataway, NJ).
The 32 SNPs were assembled for SNaPshot genotyping into eight small multiplexes (multiplexes 1–8) containing 4 SNPs each. Primers to amplify a different-sized fragment for each SNP within a multiplex were designed using Primer3 (http://frodo.wi.mit.edu/) and extension primers again differing in length within a multiplex were picked from the sequence immediately up- or downstream of the SNP (supporting information, Table S1). Primer interactions within the multiplex were evaluated and minimized using the AutoDimer program (http://www.cstl.nist.gov/div831/strbase/AutoDimerHomepage/AutoDimerProgramHomepage.htm). PCRs contained 10–50 ng DNA, 1× Invitrogen (Carlsbad, CA) Platinum Taq buffer, 4 mm MgCl2, 200 μm of each dNTP, 0.04 μm of each primer, and 1 unit of Invitrogen Platinum Taq DNA polymerase in a 10-μl reaction volume. A touchdown PCR program was used: denaturation at 94° for 15 min and then 20 cycles of 94° for 30 sec, annealing at 70° for 30 sec, and extension at 72° for 45 sec, decreasing the annealing temperature by 1° per cycle. This was followed by 15 cycles of denaturation at 94° for 30 sec, annealing at 50° for 30 sec, and extension at 72° for 45 sec and a final extension at 72° for 7 min. The PCR products were purified by treatment with Exonuclease I (USB Corporation, Cleveland; 1.5 units) and Shrimp Alkaline Phosphatase (USB Corporation, 2.0 units) at 37° for 1 hr followed by 85° for 15 min. The extension reaction contained 1× ABI Prism SNaPshot Multiplex ready reaction mix (Applied Biosystems, Foster City, CA), 0.5 μm of each primer, and 1 μl of each PCR product and was carried out as recommended (Applied Biosystems). The extension PCR products were purified using 1 unit Shrimp Alkaline Phosphatase and then run on an ABI 3100 Genetic Analyzer. SNP calling was carried out using GeneMapper software v. 3.0 (Applied Biosystems). Since peak heights in heterozygotes are not always equal, we needed to choose a threshold to distinguish between heterozygotes and homozygotes. For this, we calculated (allele 1 peak height – allele 2 peak height)/(allele 1 peak height + allele 2 peak height) using an in-house Perl script to give a value between +1 and −1 for each sample. We then inspected the distribution of values from each SNP for all samples and set an appropriate threshold.
For quality control, we included 13 pairs of blind duplicates and observed seven discrepancies in 294 comparisons (1.2% error), mostly reflecting ambiguities in calling heterozygotes in samples that lay near the threshold (Table S2). Hardy–Weinberg equilibrium was also calculated, and no significant departures were observed within individual populations. Comparisons with HapMap data were not used for genotyping quality control (although they were used for sequencing quality control) and are discussed in the results section.
Two overlapping ∼6-kb fragments covering ∼12 kb spanning each of the high-differentiation SNPs in the F2, HERC1, ZNF646, and GNB1L genes (primers in Table S3) were amplified by long PCR with Platinum High Fidelity Taq polymerase (Invitrogen). The PCR contained 1× Platinum High Fidelity Taq polymerase reaction buffer, 2 mm MgSO4, 200 μm of each dNTP, 1 unit Taq polymerase, and 10 μm of each primer in a 25-μl volume. The reaction was carried out by touchdown PCR starting with denaturation at 94° for 15 min and then 94° for 30 sec, annealing at 71° for 30 sec, and extension at 68° 10 min for 20 cycles, with the annealing temperature decreasing by 0.5° per cycle. This was followed by 94° for 30 sec, 63° for 30 sec, and extension at 68° for 10 min for another 15 cycles and a final extension at 68° for 10 min.
The long PCR products were then used as templates for a second round of nested PCR to produce fragments for sequencing. Sequenced fragments were designed to be ∼500 bp ± 15% long and overlap by 240 bp ± 30%, so that >95% of the regions were covered ≥4× on both strands (Table S3). The nested PCR contained a 200-fold dilution of the long PCR products, 1× Invitrogen Platinum Taq PCR reaction buffer, 1.6 mm MgCl2, 200 μm of each dNTP, 0.5 unit Taq polymerase, and 10 μm of each primer in a 15-μl reaction volume. Reactions were set up by a Beckman Coulter Biomek FX robot. The PCRs were carried out for 30 cycles of 94° for 30 sec, 60° for 30 sec, and extension at 68° for 45 sec, followed by a final extension at 68° for 3 min. Standard capillary sequencing reactions were performed by the Sanger large-scale sequencing service.
Potential SNPs were flagged by Mutation Surveyor v. 2.0 software (SoftGenetics) and then all were checked manually. SNP calling quality was assessed by including four duplicate samples in the resequencing experiment and comparing the SNP calls with HapMap calls. No discrepancies were found between our duplicates; comparisons with HapMap data showed 0 of 816 (HERC1), 6 of 635 (ZNF646), 17 of 1492 (F2), and 34 of 1088 (GNB1L) discrepancies. Reexamination of the discrepant positions suggested that the ratios of our miscalls:HapMap miscalls were 2:4, 4:13, and 18:16. The higher error rate in our data for GNB1L reflected lower sequence quality from this gene. Significant proportions of the HapMap errors might be accounted for by two immediately adjacent SNPs that could affect genotyping but not resequencing (F2) and a misassigned individual contributing 11 of the 16 miscalls (GNB1L). Overall, after distinguishing between likely sequencing and HapMap errors, the mean reliability of our variant position calls was estimated at 99.4%. Data are reported in Table S4, Table S5, Table S6, and Table S7.
The frequency of each SNP allele in each sample was determined by counting. FST was used to measure population differentiation and was calculated using the Hierfstat R package (Goudet 2005). To evaluate the significance of the observed FST values, we compared them with the empirical distribution of genomewide SNPs in the HapMap and HGDP (J. Z. Li et al. 2008) panels. Genomewide SNPs were divided into 20 mean frequency classes (0–5%, 5–10%, etc.) and the 95th and 99th percentiles were calculated for each class. The mean frequency in the test SNP in HapMap samples was compared to the 95th and 99th percentiles of its frequency class. Linkage disequilibrium was visualized using Haploview (Barrett et al. 2005). Haplotypes were reconstructed by PHASE 2.1 (Stephens and Donnelly 2003) from both our own resequence data and data obtained from the literature (Hamblin et al. 2002) or downloaded from the SeattleSNPs (http://pga.mbt.washington.edu/) and National Institute of Environmental Health Sciences (NIEHS) SNPs (http://egp.gs.washington.edu/) websites. Median-joining networks of inferred haplotypes from regions of high linkage disequilibrium surrounding the SNP of interest (“Network region size” in Table 3) were constructed using Network 4.50 (Bandelt et al. 1999).
Extended haplotypes from the three populations YRI, CHB + JPT, and CEU were examined in three ways. EHH (Sabeti et al. 2002) was analyzed in HapMap2 200-kb phased haplotypes (http://www.hapmap.org/) surrounding each SNP, using the program Sweep (http://www.broad.mit.edu/mpg/sweep/) after transforming build 36 genomic coordinates to the build 35 coordinates required by Sweep. Control regions to evaluate the 95th and 99th percentiles were the ENCODE neutral regions (Birney et al. 2007). For integrated Haplotype Score (iHS) (Voight et al. 2006) and cross population (XP)-EHH (Sabeti et al. 2007), genotypes for all unrelated HapMap samples were downloaded from the HapMap project website; haplotypes were inferred using BEAGLE (Browning and Browning 2009); and ancestral states from the chimpanzee, orangutan, and macaque sequence at the corresponding position were provided by dbSNP. For cases where the ancestral state could not be inferred the major allele in YRI was assumed to be ancestral. iHS and XP-EHH were calculated using custom C scripts provided by J. Pickrell (Pickrell et al. 2009). For iHS, SNPs with a minor allele frequency <0.05 were discarded. Both scores were standardized as described by Sabeti et al. (2007) to give a mean score of 0 and standard deviation of 1. For iHS this standardization was performed separately for scores within 20 equally spaced bins of derived allele frequency before combining the scores together from all bins, following the protocol of Voight et al. (2006).
Nei's diversity (π) and the summary statistics Tajima's D (Tajima 1989), Fu and Li's D and F (Fu and Li 1993), Fu's Fs (Fu 1997), and Fay and Wu's H (Fay and Wu 2000) were calculated using an in-house program written in C. The significance of each diversity value was assessed using a maximum-likelihood HKA test, where each gene was compared with the random ENCODE region ENr321 using a model with free mutation and no selection vs. free mutation and selection (Wright and Charlesworth 2004). The significance of the other test results was evaluated by comparison with values from coalescent simulations (Hudson 2002) carried out in two ways. First, we controlled for demographic effects using the best-fit demographic model for each population, using the cosi-package (Schaffner et al. 2005). Second, we also controlled for the ascertainment of the SNPs via high population differentiation by retaining only the subset of simulations from the best-fit demographic model that each yielded at least one SNP with an FST value as high as those in the relevant HapMap SNPs. Our data consisted of regions with high LD and so we set recombination to zero in these simulations. For non-HapMap samples, we chose the demographic model from the closest HapMap population; for the two genes sequenced in African-Americans, this meant excluding admixture. Third, we calculated empirical P-values from the SeattleSNPs data for Tajima's D and Fu's Fs.
To determine how many of the 32 high-differentiation nonsynonymous SNPs from 27 genes were likely to have resulted from positive selection and how many could be accounted for in other ways, we applied a series of filters to the SNPs, regenotyping them in the HapMap samples and the HGDP–CEPH panel, examining their extended haplotype patterns, and finally testing allele frequency spectra from full resequence data for a subset of the genes. A summary of these stages and their outcomes is shown in Figure 1.
As a first step, we retyped the SNPs in the DNA samples used by the HapMap project. Three SNPs failed in our assay, but 2 of these were from ALMS1, for which 4 additional SNPs were successfully typed, so most subsequent conclusions are based on 29 SNPs from 26 genes. In contrast to the HapMap findings, 4 of the SNPs were monomorphic in our assay, and 3 showed polymorphic, but significantly different, patterns. Several independent lines of evidence support our genotyping results (Table 1) and our conclusions were passed on to the HapMap Consortium with the result that most of the data were corrected in the HapMap2 release. Thus, after genotyping in the HapMap samples, 7 SNPs from 7 genes were found to have experienced genotyping artifacts, in five cases sufficient to neutralize evidence for positive selection, and 22–24 SNPs from 19–21 genes remained candidates for selection.
We next examined all 29 SNPs yielding high-quality genotypes in the HGDP–CEPH panel. This panel includes populations with similar geographic origins to the HapMap samples: Yoruba, French, Han Chinese, and Japanese, respectively, as well as populations from additional geographic regions. For all of the SNPs, the allele frequencies were similar between the (corrected) HapMap samples and their geographic equivalents. r2 values were as follows: YRI–Yoruba, 0.71; CHB–Han Chinese, 0.98; JPT–Japanese, 0.97; and CEU–French, 0.98. Worldwide frequencies for 8 genes illustrating a range of patterns are shown in Figure 2 and those for the remainder in Figure S1, Figure S2, and Figure S3. Population differentiation was evaluated using the statistic FST, and results are presented for the HapMap populations alone and the HGDP–CEPH populations considered by country (32 populations as in Figure 2; HGDP-32) or grouped according to the five genetic clusters identified by genomewide analyses (Rosenberg et al. 2002) (HGDP-5) (Table 2). These values were highly correlated (r2 = 0.75 HapMap vs. HGDP-32; r2 = 0.74 HapMap vs. HGDP-5). The values for 6 of the 7 SNPs identified above as representing artifacts were not unusually high. The seventh of these SNPs (rs8110904, CEACAM1) still showed high FST in both HapMap and HGDP data. The values for the remaining 22 SNPs were all relatively high: 17 lay outside the 99th percentile in all analyses and another 2 in the HapMap but not in the HGDP panel, a difference that might sometimes be expected because the SNPs were ascertained in the HapMap samples. Of the remaining 3 SNPs, 1 (rs5896 in F2) showed significant differentiation in the HGDP alone, and 2 (rs1702003 in THEA and rs2073770 in GNB1L) in none of the analyses. Thus analysis of a larger and more diverse set of populations confirms that 21/29 SNPs and 18/26 genes show unusually high levels of population differentiation (Figure 1).
We then investigated the long-range haplotype structure surrounding the derived allele of each of the 29 SNPs. The EDA2R SNP rs1385699 could not be evaluated by our approach because it lay on the X chromosome, while for 2 of the other SNPs (rs7162473 in HERC1 and rs364637 in FUT6), map positions were missing from the 2009 HapMap data set. Of the remaining 26 SNPs, 2 (rs3827760 from EDAR and rs1229984 from ADH1B) showed a significant iHS signal in the CHB + JPT sample and 12 (including the same EDAR and ADH1B SNPs and SNPs from the FY, Q8NGY8_human, ALMS1, SLC30A9, SLC24A5, ZNF646, and ABCC11 genes) showed a significant XP-EHH signal at the 5% level in one or more populations or comparisons (Figure 1, Table S8). Overall, 5.9% of the SNPs tested for iHS showed a signal within the top 5%, not significantly different from the proportion expected by chance and probably reflecting the low power of iHS to detect selection around alleles at high or low frequency, combined with the ascertainment for SNPs with extreme frequencies in individual populations. In contrast, there was striking evidence for an overrepresentation of XP-EHH signals in these SNPs: 11.6% of the SNPs are in the 2% most extreme results (1% of each tail) and 30.4% are in the top 10% of results. The signals detected by simple population differentiation are often still reflected in the surrounding haplotype structures.
We finally examined full resequence data from a subset of the genes. Five genes in the set (four with high population differentiation) had already been resequenced by others: FY by Hamblin and colleagues (Hamblin et al. 2002); EDAR, ADH1B, and CEACAM1 by the SeattleSNPs project; and ERCC6 by the NIEHS SNPs project, and we resequenced an additional three high-differentiation genes, F2, HERC1, and ZNF646, and a fourth, GNB1L, because of the unusual geographic distribution of its derived allele. Nucleotide diversity and five statistics that summarize different aspects of the allele frequency spectrum were calculated; their significance was evaluated by comparison with (1) a multilocus HKA test to investigate whether the number of SNPs was lower than expected (Wright and Charlesworth 2004), (2) the best-fit demographic model for each population to test all statistics (Schaffner et al. 2005), (3) this model incorporating a modification for the ascertainment of the SNPs, and, for some, (4) the empirical data generated by the SeattleSNPs project (Table 3). The relationships between the inferred haplotypes were visualized using median-joining networks (Figure 3). Recent positive selection in a particular population is expected to lead to a number of characteristics in that population: low diversity, negative values of the summary statistics, and high-frequency haplotypes showing as large circles or clusters of circles in the median-joining networks.
Results for the nine genes were as follows:
(1) FY provides a paradigm of positive selection in humans with the FY*O allele as described earlier, but also carries a nonsynonymous SNP at high frequency in East Asia (Figure S1) defining the FY*A allele. Although an HKA test provided support for reduced diversity of the FY*O allele in Africa (Hamblin et al. 2002), summary statistics did not provide evidence for positive selection in the Chinese sample or indeed in the Hausan (African) sample where the *O allele predominates. In the combined worldwide sample, all statistics except Tajima's D showed evidence for a departure from neutral evolution (Table 3), and a complex pattern of haplotypes containing several local clusters was observed in the network (Figure S4). XP-EHH showed a strong signal indicative of positive selection in the YRI (Figure 1; Table S8) and so appears to be detecting selection on the *O allele rather than the *A allele.
(2) EDAR shows a very high frequency of the derived allele in East Asia and the Americas and a low frequency elsewhere (Figure 2A); the haplotype carrying this allele has correspondingly high frequency (45/46 chromosomes, 98%) and low diversity (44/45 C-allele chromosomes share a single haplotype and the 45th differs by a single SNP; Figure 3A). This pattern leads to a diversity value of 0.74 × 10−4 in the Asian-American sample, an order of magnitude lower than the average for chromosome 2 (Sachidanandam et al. 2001) and the lowest value in Table 3. The HKA test did not show significantly reduced diversity for this or any other gene (results included in Table 3), perhaps because its power to detect incomplete sweeps is low. Summary statistics are almost all significantly skewed, also showing many of the lowest P-values in Table 3. The same haplotype was present at moderate frequency in the Hispanic-American sample, where Fay and Wu's H, but not the other statistics, was significant. Somewhat surprisingly, the European-American sample showed significantly negative values for Tajima's D, Fu and Li's D, Fay and Wu's H, and Fu's Fs (Table 3), indicating possible positive selection. Such selection could not be acting on rs3827760 because the derived allele of this SNP is present only in 2/48 chromosomes. The network pattern (Figure 3A) suggests that it could be acting on the adjacent cluster of haplotypes where a central haplotype making up 28/48 chromosomes is surrounded by four one-step neighbors together contributing 15/48 chromosomes and two two-step neighbors consisting of 1 chromosome each and thus together forming 94% of the sample. This cluster lies in a region of the network that carries no other nonsynonymous SNPs, so there is no obvious second target of selection. An iHS signal is seen within the combined CHB + JPT sample where the frequency is 87%, and XP-EHH signals are seen in the comparisons involving this sample as previously (Sabeti et al. 2007), but also in the CEU–YRI comparison, supporting the hypothesis of additional independent positive selection in Europeans.
(3) The derived A allele of rs1229984 in ADH1B is present at relatively high frequency in Asia, particularly East Asia (Figure 2B), and, in the resequenced samples, was found only in the Asian-Americans where it lay in three neighboring haplotypes forming a small cluster in the network (Figure 3B). The frequency of derived alleles in this population was higher than expected, as reflected in a significantly negative Fay and Wu's H value (Table 3), but diversity was not unusually low, and several other statistics such as Tajima's D were actually positive, so evidence for selection was unconvincing, consistent with the conclusions of an independent study of this SNP (H. Li et al. 2008), which suggested that, if selection were acting, it was more likely to be on a nearby regulatory region SNP than on rs1229984. iHS did show a moderately significant value in the combined CHB + JPT sample, as did XP-EHH in the comparison of this sample with the CEU (Figure 1, Table S8), but both of these signals are also compatible with selection on a nearby SNP.
(4) The ERCC6 SNP rs4253047 initially identified by the HapMap study was represented by the ancestral allele in almost all of our samples (Figure 2C) and thus its inclusion was the result of a genotyping artifact; but since sequence data were available from the NIEHS SNPs study, we performed the same analyses as a negative control. Reassuringly, no test suggested evidence for positive selection.
(5) F2 showed the highest frequency of the derived allele in East Asia (Figure 2D), but this was associated with both higher diversity in the CHB than in the other populations examined and positive, albeit nonsignificant, values for most of the summary statistics (Table 3). The network showed a distinct branch marked by the nonsynonymous SNP and largely specific to the CHB and carrying 26/46 (57%) of CHB chromosomes (Figure 3D); the presence of two distant haplotype clusters in the CHB accounts for the positive summary statistics. Interestingly, Tajima's D and Fu and Li's D were significantly negative in the YRI (Table 3) and the YRI haplotypes were clustered in the network (Figure 3D), findings potentially pointing to an earlier episode of positive selection at the same locus that was unlinked to the nonsynonymous SNP that led to the ascertainment of the gene.
(6) HERC1 showed the highest frequency of the derived allele in East Asia, which reached >90% in several samples (Figure 2E). It therefore showed significantly negative values of Tajima's D and Fay and Wu's H in the CHB but not in other populations (Table 3), standard strong indicators of positive selection approaching fixation, and is illustrated by a large cluster in the network dominated by the CHB haplotypes (Figure 3E). Complications with the mapping of rs7162473 prevented the direct application of the extended haplotype tests (see materials and methods), but examination of all the HapMap2 SNPs in a 100-kb window surrounding this SNP revealed that 20/41 (49%) fell within the top 5% of iHS signals in the CHB + JPT sample, while XP-EHH showed an even stronger signal. For the CHB + JPT–YRI comparison, 70/75 (93%) fell within the top 1% of SNPs for selection in CHB + JPT and 100% of SNPs fell within the top 5%, whereas for the CEU–CHB + JPT comparison the values were 23/75 (31%) and 100%, respectively (this time in the negative tail, i.e., again pointing to selection in East Asia). These conclusions are consistent with those of Sabeti et al. (2007) and are indicated by the distinct symbols in Figure 1.
(7) The ZNF646 nonsynonymous SNP rs749670 derived allele is common in most populations outside sub-Saharan Africa, reaching ∼50% in Europe/West Asia and fixation in one East Asian sample (Figure 2F). Consequently, it shows low diversity and significantly negative values of Tajima's D and Fay and Wu's H in the CHB (Table 3), but higher diversity and positive values of several summary statistics in the CEU and BRU, reaching significance for Tajima's D in the BRU. The network (Figure 3F) shows a large cluster carrying most (45/46) of the CHB haplotypes but also 18/44 (41%) of the BRU haplotypes, explaining the negative and positive statistics, respectively. XP-EHH values are significant at the 5% level and indicate selection in the CHB + JPT sample (Figure 1, Table S8). Evidence for positive selection in East Asia is thus convincing.
(8) CEACAM1 rs8110904 showed a high frequency of the derived allele in all populations outside sub-Saharan Africa, reaching fixation in many of the samples (Figure 2G). Resequence data were available only for African-American and European-American samples, but showed low diversity and significantly negative Tajima's D and Fay and Wu's H in the European-Americans (Table 3). The corresponding network (Figure 3) shows a cluster specific for the derived allele carrying most of the European-American haplotypes, consistent with positive selection on the derived allele outside Africa.
(9) GNB1L stood out in this study as the only gene of the 27 that had the highest frequency of the derived allele in Africa, reaching fixation in one sample (Figure 2H), although its population differentiation was not unusually high (Table 2). It showed its lowest diversity and uniformly negative summary statistics in the two African samples, which are significant for Tajima's D, Fu and Li's D, and Fu's Fs in the Luhya in Webuye, Kenya (LWK) (Table 3). The network pattern is relatively complex, but there is one predominant derived haplotype, well represented along with its neighbors in the African samples (Figure 3H). Despite the unexceptional FST values, these unusual findings are consistent with positive selection in Africa.
Thus the resequencing studies provide powerful insights into the evolutionary history of this set of genes. The ERCC6 gene, with no significant population differentiation, showed no evidence for positive selection. Of the other eight genes, five showed evidence for positive selection acting on the derived allele from multiple significant summary statistic values. Thus using this stringent criterion, ∼60% of the nonsynonymous SNPs showing exceptionally high population differentiation are likely to result from positive selection. Two of the examples where clear evidence for positive selection on the derived SNP was not obtained, ADH1B and F2, both show recent expansions of haplotypes carrying the derived allele in network analyses, but these haplotypes are present at too low a frequency to lead to significant summary statistic values. The third example, FY, might have experienced complex rounds of selection that are not readily detected by the methods used (Hamblin et al. 2002). And two of the genes, EDAR and F2, show evidence of positive selection unrelated to the highly differentiated SNP, suggesting that they may have been the targets of multiple episodes of selection, although perhaps less complex than those of FY.
Two general conclusions emerge from this work. First, technical artifacts have contributed significantly to the apparent discoveries of highly differentiated SNPs. Second, if these artifacts are excluded, the extremely highly differentiated SNPs identified by the HapMap1 empirical survey do appear to have arisen predominantly as a result of population-specific positive selection, rather than genetic drift, a conclusion reached despite the limited power of all available methods to detect selection. We consider each of these conclusions in more detail and some of the biological implications of our findings.
The 32 SNPs that provided the starting point for this study resulted from a stringent ascertainment process by the HapMap1 project: more than 1 million SNPs were genotyped, and then the most highly differentiated nonsynonymous ones were picked out (International HapMap Consortium 2005). In retrospect, it is clear that this strategy also enriched for a rare “bookkeeping” error: allele switching. If a SNP that actually has little or no variation is labeled as having the wrong allele in one population, it will appear to show a very high level of population differentiation. This error accounts for most (five of seven) of the major genotyping discrepancies and associated low population differentiation in the HGDP–CEPH panel. Conversely all of the low-variability nonsynonymous SNPs in the HapMap set subject to this error would have been among the 32, suggesting that it affected only a very small proportion of HapMap genotypes, a low level of error that is probably inescapable in a project of this size, although one that would have been readily detected by simple replication of the small number of results highlighted in the final publication. We searched the 2008 HapMap data for remaining patterns that might indicate allele switching (one population sample fixed for the opposite allele to all the others, allowing one individual to depart from this pattern) and found 2 of 10,928 nonsynonymous SNPs compared with 6 of 832,520 noncoding SNPs that met our criteria. These numbers indicate a nominally significant enrichment of possible allele switching errors in nonsynonymous SNPs (P = 0.004 by Fisher's exact test), but this significance would be lost if one fewer nonsynonymous SNP with this pattern had been discovered, so we do not attach importance to the finding.
After excluding 8 genes because of low population differentiation and 1 because of our genotyping failure, we were left with a set of 18 in which to investigate whether or not high differentiation was due to selection. For this purpose, we had a set of somewhat independent indicators of selection: long-range haplotype structure within or between populations, summary statistics based on resequence data, and networks. The last did not provide a formal test for selection but nevertheless offered considerable insight into the history of the region. REHH and iHS analyses picked out only a few of the genes as showing evidence of unusually long haplotypes when analyzed within a population, a result that reflects the limited power of the method when applied to SNPs ascertained by high differentiation. Such SNPs are usually present at very high frequency in the population of most interest, so the absence of a signal is not evidence for a lack of selection. In contrast, XP-EHH analyses showed a strong enrichment of signals: 10/18 genes with confirmed high population differentiation show signals if the indirect HERC1 result is included, consistent with the higher power of this test relative to iHS for selected variants at higher allele frequencies (Sabeti et al. 2007).
In considering the summary statistic results, we need to take into account the fact that multiple (five) tests were used, but also the partial correlations between the tests, which are sensitive to different but related aspects of the data. We did this by simulating neutral loci in each population, with or without biased ascertainment as described in materials and methods. There was an ∼6% (range 5.8–6.6%) chance of obtaining two or more significant values (compared with <3% or obtaining three or more) and we therefore set a threshold of requiring a significant P-value in at least two tests. With this criterion, we obtained evidence for a departure from neutrality in the population with the high-frequency derived allele, in the direction expected from positive selection, in >60% of the genes examined. We take this as robust evidence for positive selection and also for the power and utility of the resequencing approach. However, some features of the tests applied deserve further discussion. They are best suited to detect positive selection when a single selected haplotype and its derivatives have risen to very high frequency, but not fixation. For example, the lowest and most significant values of Tajima's D were observed for EDAR and ZNF646 (Table 3), where the population samples of interest contained just one and two haplotypes lacking the selected SNP, respectively. These patterns are visualized in the networks as a large cluster containing the selected SNP with one or two small distant haplotypes (Figure 3). When a similar network pattern, in HERC1, contained four haplotypes in the distant cluster (<9%), the value of Tajima's D was not so low, although still significant. But when the number of haplotypes outside the largest cluster in the population of interest was as large as 23% or 54% (ADH1B, F2), nonsignificant statistics were observed. Two conclusions emerge from this discussion: with the current small sample sizes, excessive weight should not be placed on the exact summary statistic values and significance levels, and the networks can provide useful indications, although not firm evidence, that selection may be favoring haplotypes that do not show up in the tests used.
For some of the genes analyzed, there is evidence for biological consequences of the amino acid difference investigated here (Figure 1): EDAR rs3827760 influences NF-κB activity (Bryk et al. 2008) and hair thickness (Fujimoto et al. 2008a,b), including in a mouse model (Mou et al. 2008), and thus might have been sexually selected; the alcohol dehydrogenases metabolize alcohol and rs1229984 in ADH1B is associated with protection against alcoholism (H. Li et al. 2008); SLC24A5 is one of the major loci contributing to light skin color in Europeans (Lamason et al. 2005); ABCC11 rs17822931 determines wet/dry earwax type (Yoshiura et al. 2006); and rs1385699 in EDA2R has been associated with male-pattern baldness (Prodi et al. 2008). For other genes, more general information is available about possible biological functions of the gene, although not the consequences of the specific amino acid change studied: for example, ALMS1 is implicated in carbohydrate metabolism (Scheinfeldt et al. 2009), RNF135 in growth regulation (Douglas et al. 2007), and CEACAM1 in a wide range of functions including infection (Kuespert et al. 2006). These links, particularly with the specific SNP, make the case for selection more compelling. It is notable that among this small set of selection events are three that involve changes in visible appearance, suggesting that mate choice may have been a powerful selective force.
It was striking that two of the eight genes examined in detail showed evidence of positive selection on haplotypes different from those leading to the original ascertainment, and a third gene, FY, might also be placed in this category. Departures from neutral evolution are most readily detected by intraspecific tests when a single round of selection acts on a new mutation or rare SNP and have lower power to detect selection in more complex circumstances, emphasizing the remarkable nature of this observation. Positive selection is rare, and it appears that selection is focused preferentially on a small number of target genes, many of which may experience multiple independent selective events.
Finally, what conclusions can be drawn about the prospects for cataloging the sites of positive selection in our genome and linking them to a broader understanding of human evolution? The strategy of picking out highly differentiated functional SNPs yields (after excluding artifacts) a large enrichment for selected genes, with >60% of candidates being true positives. Thus follow-up of all hits identified using this approach by full resequencing and functional studies would be worthwhile. However, it is clear that this strategy identifies only a very small proportion of the regions that have experienced positive selection—its false negative rate is likely to be high. Furthermore, empirical approaches are inevitably biased toward detecting certain forms of selection (Teshima et al. 2006), although searching for outliers in population differentiation is one of the few effective ways of detecting selection on standing, rather than new, variants (Innan and Kim 2008). There is an urgent need for more effective ways of detecting selection, and resequence data from large numbers of individuals as planned by the 1000 Genomes Project (http://www.1000genomes.org/page.php) will provide an excellent data set to test these.
We thank the original donors of the HapMap, Brahui, and HGDP–CEPH samples; Howard Cann for supplying the HGDP–CEPH DNAs; Joe Pickrell for C scripts and helpful discussions on extended haplotype tests; and the Sanger Large-Scale Sequencing Team for generating the sequence data. This work was supported by The Wellcome Trust.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.107722/DC1.
↵1 These authors contributed equally to this work.
↵2 Present address: MRC Epidemiology Unit, Institute of Metabolic Science, Addenbrooke's Hospital, Cambridge CB2 0QQ, United Kingdom.
Communicating editor: A. Di Rienzo
- Received July 24, 2009.
- Accepted August 31, 2009.
- Copyright © 2009 by the Genetics Society of America