Scan of Human Genome Reveals No New Loci Under Ancient Balancing Selection
K. L. Bubb, D. Bovee, D. Buckley, E. Haugen, M. Kibukawa, M. Paddock, A. Palmieri, S. Subramanian, Y. Zhou, R. Kaul, P. Green, M. V. Olson


There has been much speculation as to what role balancing selection has played in evolution. In an attempt to identify regions, such as HLA, at which polymorphism has been maintained in the human population for millions of years, we scanned the human genome for regions of high SNP density. We found 16 regions that, outside of HLA and ABO, are the most highly polymorphic regions yet described; however, evidence for balancing selection at these sites is notably lacking—indeed, whole-genome simulations indicate that our findings are expected under neutrality. We propose that (i) because it is rarely stable, long-term balancing selection is an evolutionary oddity, and (ii) when a balanced polymorphism is ancient in origin, the requirements for detection by means of SNP data alone will rarely be met.

GENOMIC regions with high densities of single-nucleotide polymorphisms (SNPs) may arise for one of three reasons: (1) a high mutation rate, (2) diversifying selection, or (3) chance. The latter category simply reflects neutral variation across the genome in the time elapsed since the most recent common ancestor of a genomic segment (i.e., “deepest” coalescence time). The level of diversity at any locus is often described in terms of the average pairwise divergence, or the probability that two randomly chosen haplotypes differ at a particular base in that locus (θπ). Population-genetic theory predicts that, on average for neutrally evolving regions, the most dissimilar haplotypes, capturing the full depth of the evolutionary tree, will be about twice as divergent as a random pair of haplotypes (∼2θπ) (Chimpanzee Sequencing and Analysis Consortium 2005). The variance of each measurement (average pairwise divergence and maximal pairwise divergence) decreases as the size of the locus examined increases. Although there have been no systematic studies of regions of unusually high SNP density in the human genome, a few such regions have been discovered during the analysis of variation in important genes.

The most dramatic examples are in the HLA locus. In some regions of HLA, such as the DQB1–DQA1–DRB1 gene cluster, the sequence divergence between the most dissimilar haplotypes across tens of thousands of base pairs is nearly two orders of magnitude higher than the genome-average θπ (Stewart et al. 2004; Raymond et al. 2005a). The elevated nucleotide diversity at HLA has been attributed to diversifying selection acting on functionally critical sites, coupled to extensive “hitchhiking” of neutral SNPs (Hughes and Yeager 1998; Slatkin 2000). Similar explanations have been offered for other known regions with high SNP densities, although none is nearly as dramatic as HLA. For example, across the ∼25 kbp of ABO sequence, the most dissimilar haplotypes are roughly three times as divergent as expected under neutrality (6.25θπ compared to the neutral expectation of 2θπ) [SeattleSNPs, National Heart, Lung, and Blood Institute (NHLBI) Program for Genomic Applications, Seattle ( (October 2005)]. At an “ancient” 900-kbp inversion on chromosome 17, spanning many genes, the maximum pairwise difference is only ∼4.2θπ (Stefansson et al. 2005). This is an unusual case, because as recombination is absent between inverted haplotypes, the depth of the coalescent is the same across the entire region. Across the genome, the average pairwise diversity over tens of kilobase pairs rarely exceeds 4θπ [National Institute of Environmental Health Sciences (NIEHS) Environmental Genome Project, University of Washington, Seattle ( (October 2005)]. As a larger fraction of the genome is analyzed in multiple haplotypes, the number of regions that, by chance alone, have ancient coalescence times may overshadow the number of regions that are deep due to selection. This preponderance of false positives may be true for many tests of selection based on characteristics of the coalescent tree.

In this study, we undertook a systematic search for segments of the human genome that have high SNP densities. In contrast to previous studies, our search was unbiased by functional considerations. We simply mined the same public databases of shotgun-sequencing reads that led to discovery of most of the SNPs presently in the SNP database. By paying careful attention—through computational filters and additional experimental sampling—to the many sources of false positives, we identified 16 loci that, next to HLA and ABO, now represent some of the highest SNP densities yet described in the human genome across regions of many thousands of base pairs. These regions do not seem to reflect the action of ancient balancing selection, but rather illustrate the significant “noise” level of the variation in neutral coalescent depth. Our results suggest a paucity of regions outside of HLA that are under long-term balancing selection, as detectable by polymorphism level. This result is in agreement with theoretical predictions that single-site balanced polymorphisms would not be detectable due to recombinational decay of the linked neutral region. Our analysis also highlights the various challenges associated with genomewide scans for genomic segments under balancing selection.


Computational filtering:

The SNP Consortium (Sachidanandam et al. 2001) reads (release 10) were downloaded from and For each read, we required one fasta file containing the base calls and one quality file containing the base-by-base, log-transformed probability that each call is incorrect (Ewing and Green 1998). In some cases, paired fasta and quality files were downloaded, and in some cases, the chromatogram was downloaded and fasta and quality files were generated in-house using PHRED (Ewing and Green 1998; Ewing et al. 1998). The reads were then trimmed for quality such that each read consisted only of a region with >95% Q20 bp. Reads with <100 bp fitting this criterion were removed from the pipeline, leaving a total of 4,295,373 reads (see Figure 1 for filtering summary).

These trimmed reads were aligned to the human reference sequence (build30) using BLAST (Altschul et al. 1990). We retained those reads that matched the reference for at least 300 bp (3,334,762 reads) and, in addition, disagreed with the reference at 1% or more of the aligned base pairs (402,580 reads). Reads that contained any bases in repeat elements, as annotated on the University of California at Santa Cruz website at, were removed from further analysis, leaving 97,909 reads.

We then used cross_match (which is based on a word-nucleated banded Smith–Waterman algorithm available at to align each read to its local genomic region. In the ∼60,000 alignments that had a “minscore” of at least 100, we then looked for putative SNPs, defined as discrepancies between the read and the reference sequence in which both bases had a PHRED quality of ≥30 (corresponding to an error probability of <0.001). (Note that much of the reference sequence has the artificial quality 99, which means that a human “finisher” manually annotated it as being correct.) To discount the effects of CpG mutational hotspots (Bird 1980), discrepancies in which one sequence had a CpG and the other a TpG or CpA were counted as one-tenth of a putative SNP, reflecting the ∼10-fold higher mutation rate at CpGs (Hwang and Green 2004). We identified 6395 alignments that had ≥1% putative SNPs. Primer3 (Rozen and Skaletsky 2000) was used on those 6395 regions to pick primers for PCR amplification of the region; in 4255 of the cases, primers were successfully picked. To avoid sequencing the same region multiple times, in cases where multiple reads implicated the same region as highly polymorphic, we PCR amplified only one arbitrarily chosen amplicon/10 kbp, for a total of 991 regions (see Figure 2a; see supplemental Table 1 at for list of primer sequences and expected amplicon lengths).

Experimental filtering (PCR and fosmid sequencing):

PCR amplification was performed on a set of 10 DNA samples from self-identified African Americans, obtained from the Coriell repository (NA17031, NA17032, NA17033, NA17034, NA17035, NA17036, NA17037, NA17038, NA17039, NA17040). PCR products amplified from these samples at the 991 test loci were sequenced (see Figure 2b). Polyphred (Nickerson et al. 1997) identified 208 regions that had more than three SNPs of rank 3 or less, and the supporting trace data at these loci were manually examined. Loci in the HLA region, as defined by Stewart et al. (2004), were dropped from analysis at this point. We determined that 80 of the loci actually contained ≥1% SNPs with genotypes in approximate Hardy–Weinberg proportions (thereby indicating that they were not artifacts of co-amplified paralogous sequences), with a minimum of two occurrences of the minor allele for each SNP.

To test whether the high SNP density extended over several kilobase pairs, for each locus with ≥1% SNPs in this sample population (again counting possible CpG mutations as one-tenth SNP), we PCR amplified two flanking regions—one 3 kbp upstream and one 3 kbp downstream from the initial read—from four individuals from our diversity panel (NA03715, NA11373, NA11589, NA14660), three from the original African-American sample (NA17031, NA17038, NA17039) and one from a chimpanzee (NA03646), as shown in Figure 2c. Loci at which at least one of the flanking PCR sites had ≥0.7% SNPs, with a minimum of two occurrences of the minor allele per SNP, were considered further (80 loci). On the basis of visual inspection, we selected one SNP from each originally amplified locus as a “tag SNP” for purposes of defining a pair of diverged haplotypes; the tag SNP was required to be in apparent linkage disequilibrium with other SNPs in both the originally amplified locus and the flanking regions, and in general it was the SNP with the highest minor allele frequency (see asterisks in Figure 2). We sequenced additional individuals (NA01814, NA10471, NA10540, NA10543, NA10975, NA10976, NA11321, NA11521, NA13838, NA14663) to identify three samples from each of the two haplotypes defined as described above. For each candidate locus, we isolated the six haplotypes from fosmid libraries constructed from the identified individuals, using PCR assays targeted 10 kbp upstream and downstream from the site of the original read to identify fosmids sharing at least 20 kbp of overlapping sequence (Raymond et al. 2005b). Isolated fosmids were then sequenced by standard shotgun-sequencing and finishing methods to an estimated error rate of <10−5 (the sequences have been deposited in GenBank with accession nos. DQ384420DQ384510).

Allele-frequency information for each tag SNP by human subpopulations was obtained from the following samples from the Coriell depository: African Americans (NA17101–NA17132), Caucasians (NA17201–NA17232), and Chinese Americans (NA17733–NA17747, NA17749, NA17752–NA17759, NA17761–NA17762, NA17764–NA17769).

Calculating within-human divergence and human–chimpanzee divergence:

For each region, we plotted a profile of the percentage of nucleotide difference between each pair of haplotypes in 5-kbp windows with a 100-bp slide (e.g., Figure 3). Human–chimpanzee divergence levels were calculated by comparing the human fosmid sequences to the orthologous chimpanzee draft sequence (UCSC build 1 version 1). Because the chimpanzee sequence is not “finished,” we used a POLYBAYES-like algorithm to estimate the probability that each discrepancy between a pair of sequences was a true discrepancy rather than a sequencing error (Marth et al. 1999) and calculated the expected number of true discrepancies for each region.

Computer simulation and analysis:

We used two neutral evolution models to simulate the evolutionary relationships among a sample of chromosomes. The first was a constant-population-size, uniform-recombination-rate model implemented in the ms program (Hudson 2002), referred to below as the “simple model,” which we used to simulate 30 haplotypes, arbitrarily defining the first haplotype as the reference haplotype and the second as the read haplotype and taking the remaining 28 haplotypes as the experimental sample. We estimated the genome-average θ (= 4Nμ, where N is the effective population size and μ is the per-generation mutation rate) to use in these simulations from the number of segregating sites in all African-origin individuals in the 19 genes with the most sequence available (abcb1, app, blm, ctnna1, cyp19a1, ece1, gab1, map2k4, mapk9, mmp16, nos2a, pdlim1, poln, pten, rad18, rev1l, tjp1, tp53bp1, xrcc4) from the NIEHS SNPs project [NIEHS Environmental Genome Project, University of Washington, Seattle ( (October 2005)]. Our value (θS = 0.081%) is probably a slight underestimate, because the sequence includes exonic regions. Our value for the genome-average ρ (= 4Nr, where r is the recombination rate) of 5 × 10−4 is consistent with current genome-average estimates (Pritchard and Przeworski 2001).

The second model was a calibrated parameter-rich model that generated simulated data. We simulated 38 haplotypes (22 African-American, 6 European, 5 Asian, and 5 African haplotypes) using the cosi_package, maintaining all parameter values as described in the “best-fit” model (Schaffner et al. 2005). We defined the European haplotype 1 as the reference and chose the read haplotype at random from European haplotype 4-6, African haplotype 3-5, or Asian haplotype 3-5. The remaining 28 simulated haplotypes correspond to our experimental sample.

In analyzing the simulated data, we attempted to mimic our real-data analysis as closely as possible, described as method c in the results. Because some issues (particularly the presence of base-calling errors in the reads and the need to reduce computational burden) were relevant to the real-data analysis but not the simulations, some steps were done in a different order in the simulations. In particular, reads with at least 1% SNP differences relative to the reference could be identified in a single (late) step in the simulations, but required several steps in the real-data analysis, including an early step to identify reads with ≥1% discrepancies (including basecalling errors in addition to SNP differences) relative to the reference, as well as later steps involving experimental confirmation of the SNPs. As a result of this reordering of intermediate steps, some intermediate numbers from the simulations in Figure 1 are not directly comparable to the real-data results, but the bottom-line values (number of regions expected to be identified by our approach after applying all computational and experimental filters) should be.

In the simulations, the filtering process can be divided into two phases, consisting of the steps that do or do not enrich for highly polymorphic regions. First, we determined the number of reads that would have been expected to survive our data-quality filters. We estimated this from the real data results: 24.3% of the reads had no repetitive elements; 62% of the reads had cross_match scores ≥100; we were able to design useful primers for 66.5% of the putatively highly polymorphic regions; 80% of our PCR amplifications produced usable sequence data; 50% of products were ≥300 bp (we subsequently required three or more SNPs/300 bp). Multiplying all of these factors by the initial 3,334,762 trimmed reads with at least 300 bp aligned to the reference sequence, leaves 133,642 reads expected to pass the above filters (see Figure 1).

Second, we estimated how many highly polymorphic regions are expected among the surviving reads, which were regarded as unbiased with respect to polymorphism content, on the basis of our coalescent simulations (see Figure 2). Because of run-time constraints for the parameter-rich coalescence simulator, we divided the genome into 1-Mb “chromosomes” and randomly placed 1/3000 of the 133,642 reads on each chromosome. For each simulated read, we counted the number of nucleotide differences between the designated read and the reference haplotypes in a read-size window of variable length, drawn from our empirical distribution of trimmed read-lengths (as in Figure 2a). We allowed only one highly polymorphic read (arbitrarily chosen) per 10-kbp window, discarding all others in that window. For those simulated reads with ≥1% nucleotide differences, we counted the number of SNPs within 20 African-American haplotypes, in the case of the parameter-rich simulation, or an arbitrary subset of 20 different haplotypes in the simple simulation (as in Figure 2b). We required that the minor allele for each of these SNPs be represented at least twice among the 20 haplotypes. For those regions that still had ≥1% SNP density by these criteria, we examined the number of SNPs in windows the same size as that used at the site of the original read, 3 kbp upstream and downstream from that site in 6 of the 20 test haplotypes and in the 8 remaining unexamined haplotypes, as shown in Figure 2c (in the case of the parameter-rich simulation, these were European haplotypes 2-3, African-American haplotypes 27-28, Asian haplotypes 1-2, and African haplotypes 1-2). If either the upstream or the downstream window had ≥0.7% SNPs, again requiring that the minor alleles for each SNP appear at least twice, we scanned a 20-kbp window, centered on the original read position, for the 5-kbp window with the highest number of nucleotide differences between the most dissimilar haplotypes within this last set of haplotypes (“MAXDIV”). The numbers reported in Figure 1 are for the parameter-rich simulation, but the numbers for the simple simulation are similar (see Figure 7).


Our initial data set consisted of a pairwise comparison of ∼4 million whole-genome shotgun reads from the SNP Consortium (Sachidanandam et al. 2001) of the human reference sequence. The main difficulty when looking for highly divergent regions with data of this type is the high frequency of false positives—indeed, the vast majority of regions that initially appear to be highly divergent are false positives. Therefore, we developed an extensive filtering scheme (Figure 1) to increase the ratio of true to false positives. The types of false positives that we encountered can be broken down into two classes: (i) misalignment of paralogous sequences and (ii) miscalled bases with atypically high-quality values (i.e., sequencing errors that are difficult to recognize).

Figure 1.—

Analysis pipelines used to identify regions of the genome with high polymorphism in real and simulated data. Numbers shown for the simulated pipeline are those generated using a parameter-rich coalescent model (Schaffner et al. 2005). Black arrows indicate steps that enrich for highly polymorphic regions, indicated by black boxes following these steps. Gray arrows are accompanied by the percentage of reads that passes through that particular filter. That value is used in the simulated pipeline. As is described more fully in materials and methods, the order of the filters differed between the real-data and simulated pipelines.

We computationally filtered the reads in ways that eliminated most of these and then tested 991 of the most promising cases experimentally on a panel of 10 African Americans to validate the SNPs in this region (see materials and methods for details). Because the variance in SNP density due to the Poisson distribution alone is high within read-length-sized windows, which average 400 bp, we required that the highly variable segments extend over a minimum of 3 kbp in an attempt to avoid regions where a fortuitously high number of SNPs is not reflective of divergence time. We implemented this requirement by designing PCR assays 3 kbp upstream and downstream from the original read and resequencing these amplicons from a new panel of humans. This new panel comprised three of the original African Americans and four additional individuals from a diversity panel to bias our positives toward those with high-minor-allele frequencies in multiple populations. Figure 2 illustrates the three steps of the filtering strategy in which most of the enrichment for highly polymorphic regions occurred. Finally, we sequenced at least 20 kbp of the most dissimilar haplotypes at the 16 non-HLA loci that passed all our filters. Typical results are shown in Figure 3, and characteristics of the 16 loci are given in Table 1. Although maximum pairwise divergence is often well above the genome-average pairwise divergence (indicated by the lowest dotted line in Figure 3; 0.081% for humans of recent African ancestry) and in places approaches that of the human–chimpanzee divergence, corroborative signs of balancing selection were absent from these regions. In no case was a human haplotype more closely related to a sampled chimpanzee haplotype than another of the sampled human haplotypes (i.e., there was no indication of trans-species polymorphism). Moreover, the interspecies divergences do not indicate unusual levels of mutation at any of these loci. Although traditional tests for balancing selection such as Tajima's D (Tajima 1989) and HKA (Hudson et al. 1987) were often strongly positive in these regions, these statistics are highly correlated with polymorphism levels; hence, they are not independent tests of neutrality. Annotated genes were generally sparse, with most regions being tens or hundreds of kilobase pairs away from the nearest exon. For nine of the most divergent loci, we tested allele frequencies in three subpopulations and found that the FST's (Weir and Hill 2002) for these loci were not unusual as compared to the observed distribution for random sites (Akey et al. 2002). The number of putative highly conserved bases in these regions (Siepel et al. 2005) was also not significantly different from randomly chosen regions (Table 1).

Figure 2.—

Key filtering techniques used to find extended regions of high polymorphism: (a) original alignment of the SNP Consortium read (“TSC read”) to the human reference genome; (b) the SNP-confirmation step in which the region was amplified from the genomic DNA of 10 self-identified African Americans and resequenced (20 haplotypes); and (c) the SNP discovery step 3 kbp upstream and downstream from the original read, based on a panel including three of the previous African Americans (haplotypes indicated with shaded lines) and four additional individuals from a diversity panel (haplotypes indicated with dashed shaded lines). Note that SNPs were not typed for the four additional individuals at the site of the original read alignment. For each SNP, the major and minor alleles are indicated as solid and open circles, respectively. Asterisks indicate potential “tag” SNPs used in the subsequent fosmid isolation step.

Figure 3.—

Pairwise divergences in 5-kbp sliding windows (offset = 100 bp) over a 30-kbp genomic span for three loci. Blue lines indicate human–human comparisons; red lines indicate human–chimpanzee comparisons. At the top, middle, and bottom, dotted lines represent pairwise divergences of 1, 0.3, and 0.081%. The latter value is the genomewide average divergence between two randomly sampled sequences. Straight edges indicate interpolation of the human–chimpanzee comparisons across regions in which chimpanzee sequence is lacking.

View this table:

Characteristics of the 16 highly divergent loci identified in our screen with ABO and HLA for comparison

Given the lack of annotated gene features that might comprise targets for balancing selection, we sought to determine whether the observed levels of divergence were within the expectations of a neutral model. We chose as our key statistic the number of nucleotide differences in a 5-kbp window between the most dissimilar haplotypes (MAXDIV) to capture the amplitude of the polymorphism level over a significant genomic extent in the observed regions. Due to the complexity of the models that we wanted to consider, the analysis we that present is based on coalescent simulations rather than on explicit theory. We simulated the evolutionary relationship among a sample of haplotypes, using both a constant-population-size, uniform-recombination-rate model (referred to below as the “simple model”; Hudson 2002) and a calibrated parameter-rich model (Schaffner et al. 2005), and then looked for the frequency of occurrence of loci with characteristics similar to those that we observed (see materials and methods).

We analyzed 10 genome equivalents of simulated data using each of three methods, illustrated in Figure 4. In the first method (a), we made a histogram of MAXDIVs for each nonoverlapping 5-kbp window in the genome. This distribution has a mean of 2θπ × 5000, which, for our estimated θπ of 0.081%, is ∼8. In the second method (b), we scanned each nonoverlapping 20-kbp region for the 5-kbp window with the largest MAXDIV, as was done at our observed loci. Our last method (c) takes into account the acquisition bias in the choice of regions to examine (see materials and methods and Figure 4 legend for details). We obtained results using both the simple and the parameter-rich simulators. The expected MAXDIV distributions for each of these analysis methods are illustrated in Figure 5, with the MAXDIV values for our loci indicated with red asterisks.

Figure 4.—

Description of simulation and three alternate methods of analysis. (Top) The evolutionary relationship among 30 haplotypes of a population for a segment of genomic sequence. For those 30 haplotypes, there are two changes in their evolutionary relationship in this segment, due to ancestral recombination events. The sites of these ancestral recombinations are represented by edges between adjacent color blocks, which contain slightly differing phylogenies. (Bottom) Three alternate methods of analyzing the simulated genomes. (a) The number of nucleotide differences between the most dissimilar haplotypes (MAXDIV) within each nonoverlapping 5-kbp window is reported. (b) For each nonoverlapping 20-kbp window, the MAXDIV of the most divergent 5-kbp window is reported. (c) For each 20-kbp window that satisfied the computational filtering requirements summarized in Figure 1, the MAXDIV of the most divergent 5-kbp window is reported (see materials and methods for details). To simulate the filtering steps, three test regions (see small vertical boxes) were established in the center of each 20-kbp window, corresponding to the positions of the original read and sites 3 kbp upstream and downstream from this position.

Figure 5.—

Comparison of observed loci with simulated MAXDIV distributions. Curves labeled methods a–c were generated by analyzing data simulated under the simple coalescent model, using the analysis methods illustrated in Figure 4. The curve labeled “parameter-rich” was generated by analyzing data simulated under the parameter-rich coalescent model using method c. See materials and methods for details of both simulation models and analysis methods. For each analysis method, a histogram was produced and then normalized such that the bar areas sum to one. Red asterisks indicate the MAXDIV of the 16 loci for which we obtained extended sequence. The smoothness of the curve for method a reflects a higher number of windows analyzed in the 10 genomes with this method.

Overall, our data appear to be consistent with expectations based on the calibrated, parameter-rich coalescent model (see Figure 5). However, it is reassuring to note that the data also agree reasonably well with expectations based on a simple constant-population, uniform-recombination coalescent model; hence, the agreement between neutral theory and experiment is not highly sensitive to the details of the neutral model.

To examine the effect that varying recombination rates would have on the expectations under the simple coalescent model, we repeated the simple model simulations using recombination rates 1/2, 1/4, 1/8, 1/16, and 1/32 that are in the initial simulation and analyzed the simulated data according to method c. The two main effects of decreasing the recombination rate were (i) lengthening the upper tail of the MAXDIV distribution (Figure 6) and (ii) increasing the expected number of regions of high polymorphism (Figure 7). The fact that the shapes of the distributions converge by ∼ρ/8 suggests that the results of our simulation—the expected numbers of highly polymorphic 5-kbp windows in a genome—are robust even to major uncertainties in the recombination parameter. (Note also that in the parameter-rich simulation, 12% of recombinations happened outside of hotspots, corresponding to a background rate of ρ/8.)

Figure 6.—

The effect of varying recombination rates (rho) on simulated MAXDIV distributions. All distributions were generated using method c. The curves for different multiples of rho (where rho = 5 × 10−4) used data simulated under the simple model. The curve labeled “parameter-rich” was generated by analyzing data simulated under the parameter-rich coalescent model. Red asterisks indicate the MAXDIV of the 16 loci for which we obtained extended sequence.

Figure 7.—

Numbers of 20-kbp windows found per simulated genome for simple coalescent models with varying recombination rates and the parameter-rich coalescent model. The line of shaded asterisks indicates the number of 20-kbp windows identified in our real-data screen. The line of shaded circles indicates the number of 20-kbp windows found per simulated genome for the parameter-rich model. Analyses used method c, illustrated in Figure 4.

Two factors that we did not attempt to accommodate in our simulations, although there is ample evidence that both have had significant effects on the human-genome evolution, are varying mutation rates across the genome and the possibly larger ancestral population size than that estimated from the genomewide average θπ. Either of these factors would increase the variance of the distributions shown in Figure 5, thus making our findings even less unexpected under a null hypothesis of neutrality. For example, while the average divergence between the human and chimpanzee genome is 1.23%, the standard deviation for 100-kbp windows is in excess of 0.25% and rises rapidly with decreasing window size; this effect is thought to be due mainly to variation in mutation rates across the genome (Chimpanzee Sequencing and Analysis Consortium 2005). We attempted to control for mutation rate by examining local human–chimpanzee divergences, which did tend to be slightly higher than the genome average, although only one exceeded 2% (e.g., Table 1).

Ancestral population sizes on a multi-million-year time scale that exceed the usual estimates of ∼10,000 would also increase the variance of the distributions shown in Figure 5, primarily by extending the upper tail of the distribution. It has been inferred that the size of the population ancestral to both humans and chimps was perhaps five times greater than estimates for the recent effective population size for humans (Chen and Li 2001; Wall 2003). While it is not known when the effective population size of the human lineage shrank, it must have been long ago enough that it does not affect the coalescence time of the vast majority of sites in the genome (Marth et al. 2004; Schaffner et al. 2005; Voight et al. 2005). If a locus had remained polymorphic since the time of the larger ancestral population size, the total coalescence time for that locus would reflect that larger population size and might be reflected in the few very deep coalescences occurring fortuitously in the genome.

The HLA locus serves as an important positive control for our genomewide screen. Regions identified by the computational filter were enriched for the HLA region, as expected. Since the HLA locus, as commonly defined, comprises <0.1% of the genome and only a fraction of the locus is highly polymorphic, we expect <1 of the 991 computationally identified regions to land within this locus by chance. In contrast, we observed16. Without prior knowledge of HLA annotation, the 16 HLA “hits” implicated 11 genes (RFP, HLA-F, DRB3, DRB1, DQA1, DQB1, DQA2, DQB2, DOA, DPA1, and DPB1) as candidates for balancing selection due to elevated polymorphism (Figure 8). Most of these genes are in extremely polymorphic HLA subregions (Stewart et al. 2004).

Figure 8.—

Sites within the HLA locus that our computational filter indicated as putative highly polymorphic. The profile of the pairwise divergence for 10-kbp sliding windows with 5-kbp offsets of two HLA haplotypes, 6-COX and PGF (Stewart et al. 2004), is plotted, with select HLA genes and our hits indicated by black vertical lines above the scale bar.


Although the regions we identified are among the most polymorphic regions yet reported in the human genome outside of the HLA and ABO loci, we did not find any evidence that balancing selection had maintained this variation. Our results are consistent with theoretical predictions (Wiuf et al. 2004) and recent experimental results (Asthana et al. 2005), both of which suggest that only under very special circumstances can balancing selection persist over millions of years, while also leaving a signature of high local polymorphism on the human genome.

Our choice of test statistic (maximum divergence over a 5-kbp interval) limits our ability to detect certain types of balanced haplotypes, such as those that have been eroded by recombination to a few nucleotides. If we were looking instead for evidence of ancient admixture with Neanderthal—longer regions of moderately high polymorphism levels—our test statistic would need to reflect these expected properties. In all scans of this type, there is a trade-off between discarding true positives by filtering too aggressively and being overwhelmed with false positives when the filtering is not aggressive enough.

The larger question posed by our data is whether or not the use of DNA polymorphism data to detect targets of long-term balancing selection in the human genome is futile regardless of the details of the methodology employed. Despite the abundance of theoretical and experimental literature on balancing selection, there are only three well-documented instances of trans-species balancing selection among eukaryotes: (i) the MHC (Hughes and Yeager 1998), (ii) the gene responsible for middle-wavelength and long-wavelength color vision in New World monkeys (Surridge and Mundy 2002), and (iii) the self-incompatibility (SI) loci in plants, which have arisen independently multiple times (Castric and Vekemans 2004). Beyond these examples, one could argue that the sex chromosomes, which have arisen multiple times (Fraser and Heitman 2005), are the oldest and most prevalent example of balancing selection. However, the first three examples are more relevant for our study.

There is strong evidence that adaptive immunity, a process in which the MHC is definitively involved, evolved multiple times (Litman et al. 2005). Hence, as with self-incompatibility in plants, the MHC may be another case of convergent evolution of balancing selection. What do these genetic loci have in common? We argue that, for balancing selection to be evolutionarily stable, it must be frequency dependent (as opposed to heterotic), and, for it to be detectable, there must be at least two physically linked loci that are each under balancing selection, thereby enabling neutral mutations to accumulate over a substantial region due to selection against most recombination events in the interval (Slatkin 2000). These conditions seem to have been met rarely during human evolution—in our screen, the HLA region was unique in the density of highly polymorphic “hits.”

While any type of selection that favors maintenance of more than one allele is, by definition, balancing selection, there are multiple mechanisms through which a balance of alleles can be maintained. The most widely recognized mechanism is heterozygote advantage, as in the textbook example of sickle-cell anemia. Although the sickle-cell allele raises the overall fitness of the population, a significant fraction of individuals have decreased survival and reproductive rates as a consequence of this one allele—a phenomenon that has been described as genetic or segregational load. There are two indications that such systems may not be stable. First, a new allele under balancing selection may rise in frequency more quickly than a new allele under positive selection—even one which, in equilibrium state, confers a greater fitness benefit on the population. This is because when a new allele is at a low frequency, the fitness advantage of the heterozygote is most important, while the lower fitness of homozygotes is not yet very relevant. For example, despite the fact that multiple hemoglobinopathy-related alleles (including the one responsible for sickle-cell anemia) have arisen independently in response to selective pressure by malaria, an allele exists (HbC) that is protective against malaria in the homozygous state and more weakly in the heterozygous state as well. Neither state is associated with hemoglobinopathy. Given enough time under continued selective pressure, it is expected that this allele would sweep through the at-risk region and increase the total population fitness (Modiano et al. 2001). Second, in general, one can imagine some combination of gene duplication and regulatory modifications that would allow all individuals to have the benefits of both alleles of a gene under balancing selection (Spofford 1969), as is illustrated by the evolution of separate middle-wavelength and long-wavelength color-vision genes in Old World monkeys and Great Apes.

In contrast, frequency-dependent selection does not require a steady-state fitness differential and, therefore, confers less load on a population (Kojima 1971). Consequently, this type of balancing selection is probably more stable than instances that depend on heterozygote advantage.

If a balancing-selection system is sustainable, but depends on a single SNP, the genomic region in which two haplotypes are preserved will erode, due to recombination, ultimately making it impossible to recreate an accurate coalescent tree with SNP data (Wiuf et al. 2004). However, if there are two physically linked sites that, in certain combinations, produce balanced haplotypes, the neutral sites between them will reflect the divergence time of the balanced haplotypes. The sites must be nearby, however, to avoid recombinational erosion (Kelly and Wade 2000; Barton and Navarro 2002). Although there is evidence for some degree of genomic clustering of coexpressed genes in mammals at the megabase-pair scale, which may largely reflect local chromatin characteristics (Hurst et al. 2004), we suspect that sites at which recombination is strongly suppressed by selection against recombinant haplotypes are rare.

In each of our three best examples, there is epistasis between physically linked sites. In the form of SI in which compatibility is determined by parental genotype (“sporophytic SI”), the male and female determinants are tightly linked, both physically and phylogenetically (Sato et al. 2002; Hiscock and McInnis 2003). In the form of SI in which compatibility is determined by gametic genotype alone (“gametophytic SI”), only the female determinant has been identified. For that locus, there are two hypervariable regions within a single gene that are thought to interact functionally (Franklin-Tong and Franklin 2003). The color vision alleles in New World monkeys also appear to display balancing selection at multiple intragenic sites, as there are multiple mutational differences in different exons that distinguish functional alleles (Shyue et al. 1998). At the MHC locus, there are many associations between “ancestral haplotypes” covering multiple genes and disease susceptibilities (Price et al. 1999). It appears that, as in the evolution of the sex chromosomes (Fraser and Heitman 2005), the MHC locus has acquired functionally related sets of genes whose gene products interact (Kelley et al. 2005).

While there are frequent claims for balancing selection at other loci in the literature, the plausibility of most of these cases depends on scenarios for heterozygote advantage. Thus far, the best case for balancing selection in the human genome solely on the basis of greater-than-expected coalescence time is at the locus controlling ABO blood type, specifically between the A and B alleles. ABO is an interesting example because, although it has been known to be polymorphic for >100 years due to its relevance in blood transfusion, its primary evolutionary function remains elusive. The lack of a strongly deleterious genotype satisfies our first proposed criterion that there should be little genetic load. The initial suggestion of long-term balancing selection came from the fact that the AB antigen–antibody phenotype is present in many primates, including some New World monkeys (Blancher et al. 2000). Furthermore, it has been shown biochemically that only two nucleotides, separated by 6 bp, differentiate the A allele from the B allele (Yamamoto and Hakomori 1990) and that these two nucleotides demonstrate apparent trans-species polymorphism within humans, chimpanzees, and gorillas (Martinko et al. 1993). In contrast, the O allele appears to have arisen multiple times in humans but is rare in nonhuman primates. When intronic sequence of humans, gorillas, and chimpanzees is compared, there is no evidence for trans-species polymorphism of linked neutral sites, so it has been argued that the two functional polymorphisms reflect convergent evolution (O'HUigin et al. 1997). However, if the balanced haplotype is just 8 bp long, it would behave as a single site and have only modest effects on flanking polymorphism levels (Wiuf et al. 2004); the six exonic nucleotides between the functional polymorphisms certainly cannot hold enough neutral mutation to provide an accurate estimate of divergence time. Indeed, while polymorphism levels are high in the ABO region—with a MAXDIV of 49, which approaches human–chimpanzee divergence levels—there is no evidence for trans-species polymorphism outside the 8-bp haplotype [SeattleSNPs, NHLBI Program for Genomic Applications, SeattleSNPs, Seattle ( (October 2005)]. Thus, while we cannot conclude that ABO is another example of trans-species balancing selection, the possibility exists that it is an “invisible” example that cannot be detected by polymorphism studies.

We hypothesize that balancing selection most frequently arises in transient situations when the environment changes rapidly. Balancing-selection systems may largely be evolutionary “band-aids” that survive only until a more stable strategy arises, based on gene duplication and divergence, or until the rise of a more evolutionarily successful allele. This view is reminiscent of arguments supporting the less-is-more hypothesis (Olson 1999); indeed, many suspected examples of recent balancing selection involve maintenance of nonfunctional or subfunctional alleles in the population (e.g., Δccr5, ΔF508, HbS).

This brief analysis suggests that long-term balancing selection may simply be rare in humans and other organisms with similar biology and evolutionary histories. Certainly, this conclusion is compatible with the results of our search for targets of long-term balancing selection in the human genome. Nonetheless, the question still arises as to whether or not we failed to identify such targets simply because we had too little data to analyze. Would we have fared better, for example, if the entire genome were sequenced across 20 human haplotypes? While we cannot exclude that possibility, we suspect that identification of genes under long-term balancing selection will remain a gene-by-gene process, based largely on functional evidence, and not greatly accelerated by genomic analysis because (i) the phenomenon itself is rare and (ii) compatible balancing selection between physically linked loci—a requirement for generating a detectable genomic fingerprint—is also rare. Nonetheless, the fact that balancing selection systems have arisen independently multiple times and involve core functions of multicellular, sexually reproducing organisms (e.g., combating pathogens and avoiding selfing) suggests that, while rare, balancing selection has had major effects on the evolution of metazoan organisms.


We thank Joseph Felsenstein, Mary Kuhner, and Arnold Kas for many helpful conversations and suggestions. This work was supported by the Howard Hughes Medical Institute (P.G.) and the Center of Excellence in Genome Science grant P50 HG02351 from the National Institutes of Health National Human Genome Research Institute (M.V.O.).


  • Communicating editor: M. Nordborg

  • Received January 12, 2006.
  • Accepted May 31, 2006.


View Abstract