Glucose-6-phosphate dehydrogenase (G6PD) deficiency is the most common enzymopathy in humans. Deficiency alleles for this X-linked disorder are geographically correlated with historical patterns of malaria, and the most common deficiency allele in Africa (G6PD A-) has been shown to confer some resistance to malaria in both hemizygous males and heterozygous females. We studied DNA sequence variation in 5.1 kb of G6pd from 47 individuals representing a worldwide sample to examine the impact of selection on patterns of human nucleotide diversity and to infer the evolutionary history of the G6PD A-allele. We also sequenced 3.7 kb of a neighboring locus, L1cam, from the same set of individuals to study the effect of selection on patterns of linkage disequilibrium. Despite strong clinical evidence for malarial selection maintaining G6PD deficiency alleles in human populations, the overall level of nucleotide heterozygosity at G6pd is typical of other genes on the X chromosome. However, the signature of selection is evident in the absence of genetic variation among A-alleles from different parts of Africa and in the unusually high levels of linkage disequilibrium over a considerable distance of the X chromosome. In spite of a long-term association between Plasmodium falciparum and the ancestors of modern humans, patterns of nucleotide variability and linkage disequilibrium suggest that the A-allele arose in Africa only within the last 10,000 years and spread due to selection.
WITH the completion of the first drafts of the human genome (International Human Genome Sequencing Consortium 2001; Venteret al. 2001) considerable attention is now focused on understanding the levels and patterns of nucleotide variation among individuals. An accurate description of this variation is important for understanding processes of molecular evolution, for identifying disease genes, and for making inferences about the origin and history of Homo sapiens. A number of studies have described patterns of nucleotide variability in relatively large samples of individuals (Hardinget al. 1997; Clarket al. 1998; Deinard and Kidd 1998; Harris and Hey 1999, 2001; Jaruzelskaet al. 1999; Kaessmannet al. 1999; Ranaet al. 1999; Riederet al. 1999; Fullertonet al. 2000; Giladet al. 2000; Hamblin and Di Rienzo 2000; Nachman and Crowell 2000; Zhaoet al. 2000; Alonso and Armour 2001; Yuet al. 2001), and a large public effort recently identified and mapped >1 million single-nucleotide polymorphisms (SNPs; International SNP Map Working Group 2001). These studies have generally focused on regions of the genome in which positive natural selection is believed to be a negligible force, and as such, provide a baseline for average patterns of genomic variability. However, selection may have been an important force in shaping human genetic variation. Selection can have a powerful effect on patterns of linkage disequilibrium (LD), levels of heterozygosity, and frequencies of alleles segregating in a population, and these effects may extend to linked sites at considerable distances from the targets of selection (Hudson 1990, 1996). One way to study the impact of selection in shaping nucleotide variability is to look at regions of the genome in which the strength and form of selection are known and in which the connections from genotype to phenotype to environment are well understood.
The X-linked gene coding for glucose-6-phosphate dehydrogenase (G6PD) is subject to malarial selection in some human populations. The normal G6PD enzyme catalyzes a critical step in the pentose monophosphate shunt of glycolysis, and in cases of dysfunctional G6PD, an individual may suffer with clinical manifestations that include hemolytic anemia and neonatal jaundice (Beutler 1994). Some human populations exhibit G6PD deficiency alleles at frequencies that range up to 65% (Livingstone 1985; Oppenheimet al. 1993). In general, there is a geographic correlation between the frequency of G6PD deficiency alleles and the historical prevalence of malaria globally (Allison 1960; Motulsky 1961; Oppenheimet al. 1993). Moreover, in vitro studies (Rothet al. 1983; Roth and Schulman 1988) and epidemiological evidence (Ruwendeet al. 1995) indicate that G6PD deficiency confers some resistance to Plasmodium falciparum, the primary human malaria parasite.
The most common G6PD deficiency allele in sub-Saharan Africa is G6PD A-, and it typically reaches frequencies near 20% in populations living in malarial areas (Livingstone 1985). The A-allele differs from the normal allele (G6PD B) by nonsynonymous changes at coding nucleotide positions 202 and 376. A minor deficiency allele, G6PD A+, differs from the B allele only at site 376 (Figure 1). The enzymatic activities of the A+ and A-alleles are 85 and 12% of normal levels, respectively (Hirono and Beutler 1988; Beutleret al. 1989; Vulliamyet al. 1991). The mild deficiency phenotype characteristic of G6PD A+ does not cause significant clinical manifestations and does not appear to confer resistance to malaria (Ruwendeet al. 1995). However, the deficiency phenotype characteristic of G6PD A-confers an ∼50% reduction in risk of severe malaria in both heterozygote females and hemizygote males. Homozygous females probably have a similar level of protection from malaria, although this genotype is quite rare (Ruwendeet al. 1995). In the presence of falciparum malaria, the G6PD A-allele is therefore beneficial, while in the absence of malaria this allele is deleterious. Thus G6pd provides a rare example of a gene in humans where the selective agent and approximate form and strength of selection are known (Ruwendeet al. 1995; Tishkoffet al. 2001).
As part of an ongoing project to characterize patterns of nucleotide variability at multiple loci throughout the genome for a common worldwide sample of human DNAs and to investigate the impact of selection on G6pd, we sequenced 5.1 kb of G6pd in a sample of 47 humans (Table 1). We also sequenced 3.7 kb at L1cam in these same individuals. L1cam is situated 556 kb from G6pd; thus, polymorphisms at L1cam provide an opportunity to investigate the impact of selection on neighboring selection on G6pd are more subtle than those predicted under a model of long-term diversifying selection.
MATERIALS AND METHODS
Samples: DNA sequences were determined in a sample of 41 human males, including 10 from Africa, 10 from the Americas, 10 from Europe, and 11 from Asia and Melanesia (Table 1). This sample was chosen as part of a long-term project in our labs to survey nucleotide variability at a number of loci throughout the genome using a common set of individuals (e.g., Nachmanet al. 1998; Nachman and Crowell 2000; M. W. Nachman and M. F. Hammer, unpublished data). However, since G6PD A-alleles are primarily found in Africa and since the effects of selection at G6pd are likely to be found primarily in Africa, we augmented our worldwide sample with 4 additional African individuals that were known [by restriction fragment length polymorphism (RFLP) analysis] to carry G6PD A-alleles and 2 individuals known to carry G6PD A+ alleles. This allowed us to investigate patterns of variability within G6PD A-alleles and to study LD between G6PD A-alleles and other alleles. Homologous sequences from a male chimpanzee (Pan troglodytes) and a male orangutan (Pongo pygmaeus) were also determined for divergence estimates. By studying X-linked loci in males we were able to PCR amplify single alleles and to directly recover haplotypes over long genomic distances to study patterns of linkage disequilibrium.
PCR amplification and sequencing: Maps of the human X chromosome and the loci sampled in this study, G6pd and L1cam, are presented in Figure 1. L1cam was chosen because of its proximity to G6pd (556 kb); all polymorphisms detected at L1cam are silent or noncoding, and there is no a priori reason to assume that L1cam itself is a target of selection. Approximately 82 other genes are found within 1 Mb on either side of G6pd and none of these genes are known to be recent targets of positive selection. PCR fragments were amplified for G6pd (5.2 kb) and L1cam (4.2 kb) using a long-template PCR system (Roche Biochemicals). For G6pd, the primers Gf (5′ GTT TAT GTC TTC TGG GTC AGG GAT GG 3′) and Gr (5′ AGT GTT GCT GGA AGT CAT CTT GGG T 3′) are positioned with the 5′ end of the primer at sites 206322 and 201052, respectively, in GenBank accession no. L44140. For L1cam, the primers Lf (5′ TCC TCT CCA GAG TAG CCG ATA GTG ACC 3′) and Lr (5′ AAG TTT CTA CTG GCC TGA CCC TCT CG 3′) are positioned with the 5′ end of the primer at sites 19587 and 24251, respectively, in GenBank accession no. U52112 (Figure 1). Internal primers (available upon request) were used to generate overlapping sequence runs on an ABI 377 automated sequencer. A contiguous sequence that included coding and noncoding regions (5109 and 3691 bp for G6pd and L1cam, respectively) was assembled for each individual and aligned using the computer program Sequencher (Gene Codes, Ann Arbor, MI). Sequences have been submitted to GenBank under accession nos. AY158094-AY158142 and AY167680-AY167728 for G6pd and L1cam, respectively.
Data analysis: Nucleotide diversity, π (Nei and Li 1979), and the proportion of segregating sites, Θ (Watterson 1975), were calculated using the program PROSEQ (Filatovet al. 2000) for the worldwide sample, for African individuals, and for non-African individuals. Only the 41 individuals of the nonaugmented worldwide sample were included in analyses of nucleotide diversity, and insertion-deletion polymorphisms were excluded. Under neutral equilibrium conditions both π and Θ estimate the neutral parameter 3Neμ for X-linked loci, where Ne is the effective population size and μ is the neutral mutation rate. Tajima’s D (Tajima 1989), Fu and Li’s D (Fu and Li 1993), and Fay and Wu’s H (Fay and Wu 2000; http://crimp.lbl.gov/htest.html) were calculated to test for deviations from a neutral equilibrium frequency distribution for both loci. Ratios of polymorphism to divergence for G6pd and L1cam were compared with the expectations under a neutral model using the Hudson-Kreitman-Aguadé (HKA) test (Hudsonet al. 1987). Polymorphism data for these tests were derived from the 41 sequences determined in this study for G6pd and L1cam, as well as from Dmd (intron 44) from the same set of individuals (Nachman and Crowell 2000) and from the Pdha1 data of Harris and Hey (1999). Dmd and Pdha1 were chosen for comparison because they both reside in regions of the X chromosome with moderate to high rates of recombination and thus are expected to be relatively free of the effects of selection at linked sites. Divergence data were derived for each of these loci by comparing the homologous sequences from a chimpanzee to a single randomly chosen human allele. LD between pairs of polymorphic sites was measured using the statistics D′ (Lewontin 1964) and r2 (Hill and Robertson 1968). The age of the G6PD A-allele was estimated from the decay of linkage disequilibrium and from coalescent simulations using the computer program GENETREE (Hardinget al. 1997; Bahlo and Griffiths 2000). The SWST haplotype test of Andolfatto et al. (1999) was implemented using the data from G6pd and L1cam separately. This test compares the observed number of haplotypes with those expected under a neutral model with a specified rate of recombination.
Nucleotide diversity: Patterns of nucleotide variability at G6pd and L1cam are presented in Tables 1 and 2. In the worldwide sample of 41 chromosomes (nonaugmented sample) we observed 18 single-nucleotide polymorphisms and three insertion/deletion (indel) polymorphisms at G6pd. Fifteen of these polymorphisms were in introns; of the remaining 6 polymorphisms, 2 were nonsynonymous changes (coding sites 202 and 376) and 4 were synonymous changes. Levels of nucleotide variability were roughly four times higher in Africa than in non-African populations (Table 2), consistent with other studies that demonstrate higher diversity in Africa (e.g., Harris and Hey 1999, 2001; Nachman and Crowell 2000). Many of the polymorphisms found in Africa distinguish G6PD A-alleles from all other alleles in the sample. At L1cam we observed 7 polymorphisms in the nonaugmented sample. Levels of nucleotide variability were relatively low for L1cam overall; however, nucleotide variability was higher in Africa than in non-African populations.
In the worldwide sample of 41 chromosomes, two A-alleles were in the African subset (n = 10), consistent with previously documented frequencies of ∼20% for G6PD A- in sub-Saharan Africa. Overall, worldwide levels of nucleotide variability at G6pd and L1cam were close to or slightly below average values for other regions of the genome. For example, among primarily noncoding sites at 12 X-linked genes in humans, the average level of nucleotide diversity (π) is 0.06% and the average proportion of segregating sites (Watterson’s Θ) is 0.07% (Nachman 2001). For both G6pd and L1cam, nucleotide diversity at intron sites is slightly below average (G6pd π= 0.04%, L1cam π= 0.02%), while Watterson’s Θ is close to average (G6pd Θ= 0.08%, L1cam Θ= 0.07%). Since the A-allele represents only 5% of the worldwide sample, it is not expected to contribute substantially to levels of nucleotide variability. Within Africa, however, G6PD A- is present at high frequency (20%), yet overall levels of nucleotide variability (π= 0.08%, Table 2) are still average. For example, the average level of nucleotide variability for 8 X-linked genes in Africa is 0.084% (Payseur and Nachman 2002).
Tests of neutrality: Tajima’s D is the normalized difference between π and Θ and takes on positive values when there is an excess of intermediate-frequency polymorphisms and takes on negative values when there is an excess of low-frequency polymorphisms (Tajima 1989). Positive Tajima’s D values are generally consistent with long-term balancing selection or a population contraction, while negative values are expected following a selective sweep or a population expansion. For G6pd, Tajima’s D is negative (but not significant) in the worldwide sample and for all subsets of the data (Table 2). Similar results are obtained with Fu and Li’s D (Fu and Li 1993), which also measures the frequency distribution of polymorphisms and is sensitive to the number of singletons in the sample (Table 2). For L1cam both statistics are also negative and are significantly negative in the worldwide sample (Table 2). Fay and Wu’s H statistic (Fay and Wu 2000) utilizes the frequency distribution of polymorphisms to test for an excess of high-frequency-derived variants compared to equilibrium neutral expectations. For both G6pd and L1cam, Fay and Wu’s H test shows no significant deviation from the neutral expectation in the worldwide sample, the African sample, or the non-African sample.
We performed an HKA test (Hudsonet al. 1987) using pairwise comparisons of polymorphism and divergence for G6pd and L1cam and two other X-linked genes, Pdha1 and Dmd. In comparisons using worldwide samples or African samples alone, we failed to reject the null model (HKA χ2 < 3.0, P > 0.1 for all tests). Thus, neither the frequency spectrum nor the level of heterozygosity at G6pd fits the expected pattern of nucleotide variability under a simple model of long-standing diversifying or balancing selection.
To test whether the haplotype structure of the data deviates from neutral expectations we implemented the SWST program as described in Andolfatto et al. (1999), assuming recombination rates of 0, 1, and 2 cM/Mb. Tests were performed separately for G6pd and for L1cam. None of these tests showed significant deviations from neutral expectations using the worldwide sample or the African sample alone.
Linkage disequilibrium: To better examine patterns of linkage disequilibrium we augmented our random sample of 10 African X chromosomes with 4 chromosomes carrying A-alleles and 2 chromosomes carrying A+ alleles. Thus the augmented African sample in the study includes 6 chromosomes carrying G6pd A-alleles from South Africa, Central Africa, and West Africa (samples YCC 9, YCC 32, G11, M115, M241, and S823 in Table 1). Unusually high levels of linkage disequilibrium were observed within G6pd, within L1cam, and between G6pd and L1cam. D′ is a measure of linkage disequilibrium that is standardized to equal 0 when there is random association among polymorphisms (i.e., no disequilibrium) and to equal 1 when there is complete association among polymorphisms (i.e., complete disequilibrium). In all comparisons between A-alleles and other alleles, D′= 1 for all sites in Table 1. A single most parsimonious haplotype network was inferred for all sites at G6pd (Figure 2), indicating a lack of evidence for recombination in this sample despite the fact that Xq28 is a genomic region with moderate to high rates of recombination (Payseur and Nachman 2000). Surprisingly, the three polymorphic sites at L1cam at intermediate frequency (positions 776, 885, and 2115) can also be mapped on this same network with no homoplasy. The observed high level of linkage disequilibrium is primarily a consequence of mutations falling on the branch separating the A-deficiency allele from the normal B alleles (Figure 2). A Fisher’s exact test revealed significant LD (P = 0.0082) between site 202 of G6pd and three out of four informative polymorphisms at L1cam (alignment positions 776, 885, and 2115 at L1cam; Table 3).
Age of the G6PD A-allele: We estimated the age of the A-allele in two ways. First, we used a standard model for the decay of linkage disequilibrium as a function of time (t) and recombination (c), where linkage disequilibrium at time t (r2t) compared with time 0 (r20) is given by r2t/r20 = (1 - c)t (Hedrick 1998). For this calculation, we use r2 as a measure of linkage disequilibrium between L1cam and G6pd because, unlike D′, r2 is sensitive to allele frequencies when only three out of four gametic types are present in a sample (Hedrick 1998). Assuming that linkage disequilibrium between site 202 of G6pd A- and positions 776, 885, and 2115 of L1cam alleles was complete at the time of origin of the A-allele (i.e., r20 = 1), we can estimate the time in generations, t = ln(r2t)/ln(1 - c), for the age of the allele, given the observed recombinational distance between L1cam and G6pd and the observed linkage disequilibrium in the data (r2 = 0.52). Since it is possible that r20 was <1.0 between these sites when the G6PD A-allele arose, our estimates provide an upper bound for the age of the G6PD A-allele. This region of Xq28 is subject to moderate levels of recombination in general (1-3 cM/Mb; Payseur and Nachman 2000; Konget al. 2002), and recombination rates near G6pd and L1cam specifically have been estimated as low as 0.14 cM/Mb and as high as 2 cM/Mb (Smallet al. 1997). Using these recombination rates we estimated the maximum age of the A-allele to be between 58 and 840 generations (Figure 3). With a generation time of 25 years, this implies that the G6pd A-allele arose 1461-20,994 years ago.
A second estimate for the age of the A-allele was obtained from simulations using a coalescent model conditioned on the sample size and observed levels of nucleotide variability (GENETREE: Hardinget al. 1997; Bahlo and Griffiths 2000). This model assumes neutral equilibrium conditions and thus may provide an overestimate of the true age of the A-allele (since the present frequency of A- has probably been determined in large measure by selection). These simulations suggest that the A-allele arose 10,575 years ago (SD ± 8887 years).
Both of these estimates are in good agreement with an independent estimate for the age of the G6PD A-allele (3840-11,760 years) that was reported by Tishkoff et al. (2001) on the basis of intra-allelic levels of linked microsatellite variability.
Models of selection and nucleotide variability at G6pd: We investigated levels and patterns of nucleotide variability at G6pd, a locus known to be under malarial selection in some human populations, and found that nucleotide diversity was similar to average values for other X-linked genes. Moreover, several commonly employed statistical tests based on DNA sequence variation failed to reject a simple neutral model of molecular evolution. In several respects, however, the data from G6pd are quite striking: levels of linkage disequilibrium are high and extend over a long genomic distance, much of the nucleotide variation is partitioned between functionally distinct alleles, and no nucleotide variation is observed within deficiency alleles. Below we discuss general models of selection for G6pd and how our observations might fit these models.
Although four different species of Plasmodium typically infect humans, P. falciparum is the most virulent species and is responsible for most malaria-related deaths, especially in Africa (Schmidt and Roberts 1996). Malaria is endemic throughout most of sub-Saharan Africa where >1 million people die each year due to complications from infection (Trigg and Konrachine 1998). From a population genetics perspective, such a virulent parasite serves as a strong selective agent for genetic resistance. In fact, it has long been known that African populations exhibit genetic resistance factors to malaria at relatively high frequencies compared with non-African populations (e.g., Miller 1994). Moreover, many of the mutations that confer resistance are deleterious outside of the malaria environment. G6PD A-, for example, has an enzymatic activity that is only about one-tenth of normal and results in significant clinical manifestations such as hemolytic anemia and neonatal jaundice (Hirono and Beutler 1988; Beutleret al. 1989; Vulliamyet al. 1991). However, this allele also confers an ∼50% reduction in risk of severe malaria in both heterozygote females and hemizygote males (Ruwendeet al. 1995).
Although G6PD is often assumed to be subject to balancing selection (sensu heterosis; e.g., Tishkoffet al. 2001), the precise nature of selection on G6PD deficiency alleles is not fully understood. In the absence of malaria, deficiency alleles are at a selective disadvantage and are expected to be eliminated (Table 4, fitness array 1). In the presence of malaria, female heterozygotes and male deficiency hemizygotes appear to have a selective advantage over wild-type individuals, but the fitness of female deficiency homozygotes relative to other genotypes is not clear (Ruwendeet al. 1995). If female heterozygotes have a higher fitness than either homozygote, then selection may maintain both A- and wild-type alleles in populations under malarial selection (i.e., heterosis; Table 4, fitness array 2). However, the conditions for maintenance of a stable X-linked polymorphism are rather restrictive; either selection must be of similar magnitude but opposite in direction for the two sexes (which seems unlikely for G6PD deficiency) or there must be heterosis in females without a large fitness difference between the two male genotypes (Hedrick 1998). Alternatively, if female deficiency homozygotes have the same fitness as male hemizygotes and female heterozygotes, then selection should drive the eventual fixation of the G6PD A-allele in populations subject to continuous malarial selection (Table 4, fitness array 3). In such a situation, the A-allele is expected to rise to high frequencies and to reach fixation in a very short period of time (e.g., <10,000 years; Ruwendeet al. 1995). The exact time required for fixation depends on assumptions about population size, initial frequency of the A-allele, relative fitness of the different genotypes, and the average generation time. However, for a wide range of parameter values, allele frequencies are expected to quickly rise to very high levels. The observation that most African populations have A-allele frequencies <20% (Livingstone 1985) is inconsistent with a simple model of directional selection in which selection has been strong and long acting.
Thus the best explanation for current G6PD A-allele frequencies seems to be either heterosis (fitness array 2) or some form of spatially and/or temporally varying selection due to malaria, in which case allele frequencies may be determined primarily by changing selection pressures (i.e., a combination over time or space of fitness array 1 and fitness array 2 and/or 3 in Table 4). On a large geographic scale (e.g., among continents), spatially varying selection is clearly important in determining allele frequencies; the extent to which this applies to small geographic scales is less clear, although the frequency of the A-allele differs significantly among different populations in sub-Saharan Africa (Cavalli-Sforzaet al. 1996; Tishkoffet al. 2001). While we cannot distinguish between heterosis and spatially/temporally varying selection, our data do allow us to address the timescale over which selection has acted.
A simple model of long-term balancing selection or long-term spatially or temporally varying selection is expected to leave a distinct signature in patterns of DNA sequence variation (Figure 4). When a new advantageous mutation first appears (Figure 4a), it will rise in frequency, creating LD with other mutations on the haplotype on which it arose (Figure 4b). This transient phase will result in lowered levels of heterozygosity. Over time, linkage disequilibrium will decay through recombination around the target of selection, and heterozygosity will increase near the target of selection (Figure 4c). This simple model of long-term selection predicts elevated levels of nucleotide variability in a restricted window around the target of selection (Hudsonet al. 1987) and a skew in the frequency distribution of polymorphisms with an excess of intermediate-frequency variants within this restricted window (Tajima 1989). Both of these patterns are seen in several other well-studied systems. For example, at Mhc loci in a variety of organisms [the human leukocyte antigen (HLA) loci in humans], levels of heterozygosity are significantly higher than those in neighboring regions (Takahataet al. 1992). At Adh in Drosophila melanogaster, heterozygosity is elevated around the fast/slow allozyme polymorphism, resulting in a significant HKA test (Hudsonet al. 1987).
In contrast, patterns of nucleotide variability at G6pd do not support either of these predictions with respect to G6pd A-, and several observations suggest that patterns at G6pd fit the model expected in an early stage of selection (Figure 4b). First, overall levels of nucleotide diversity are close to average values for other X-linked loci. This is true for the worldwide sample and, more importantly for evaluating models of selection, it is also true for the African sample alone. An HKA test applied to our data fails to reject a neutral model. Second, there is no evidence for an excess of intermediate-frequency polymorphisms. In fact, both Tajima’s D and Fu and Li’s D are slightly (but not significantly) negative for the African sample (Table 2). Third, we find extensive linkage disequilibrium within and around G6pd, and this disequilibrium is due almost exclusively to nucleotide differences that distinguish the A-allele from other alleles. We observed no recombination events within G6pd. This stands in contrast to many other human nucleotide polymorphism data sets, including intron 44 of Dmd, surveyed in this same set of individuals (Nachman and Crowell 2000), in which numerous recombination events were observed over distances of several hundred bases. In addition to significant LD within G6pd, we found significant LD between G6pd and L1cam (Table 3; D′= 1 in all comparisons), loci that are separated by ∼550 kb. This amount of LD is much higher than typical values for the human genome. For example, Reich et al. (2001) recently studied the decay of D′ for 19 different genomic regions and found that in a European population the half-length of D′ (the distance at which the average D′ drops below 0.5) is typically 60 kb, while in an African population the half-length of D′ is <5 kb (Reichet al. 2001). Other studies have also revealed lower levels of linkage disequilibrium in African populations compared with non-African populations (Tishkoff et al. 1996, 2001). Interestingly, we observe much higher levels of LD than those previously reported for this region of Xq28 by Taillon-Miller et al. (2000) in populations of European descent. Finally, there is no intra-allelic variation within G6pd A-, consistent with the notion that G6PD A- is relatively young.
Taken together, these observations argue against a model of long-term selection on the G6pd A-allele, but do not allow us to distinguish between recent balancing selection (sensu heterosis), on the one hand, and recent diversity-enhancing (i.e., spatially and/or temporally varying) selection, on the other hand. Better fitness estimates of all genotypes (in particular, female deficiency homozygotes), as well as detailed sampling of G6PD A-frequencies across Africa, might help us to distinguish between these hypotheses.
Contrary to the intra-allelic patterns of nucleotide variability for G6pd A-, the minor deficiency allele G6pd A+ shows a high level of intra-allelic diversity and greater linkage equilibrium. Although our study includes only two A+ chromosomes that represent a single haplotype, at least two additional haplotypes have been identified on the basis of RFLP analyses (Figure 2; Vulliammy et al. 1991). Moreover, microsatellites located up to 19 kb away from G6pd exhibit greater linkage equilibrium and higher diversity on A+ alleles than on A-alleles (Tishkoffet al. 2001). These observations, taken together with a coalescent-based estimate for the age of the mutation at coding position 376 from our study (131,250-174,375 years on the basis of GENETREE analysis), suggest that G6PD A+ may be relatively old. G6PD A+ has an enzymatic activity that is 80% of normal and does not appear to cause a significant clinical condition (Takizawaet al. 1987). Furthermore, G6PD A+ does not seem to currently confer resistance to severe falciparum malaria, as does G6PD A-(Ruwendeet al. 1995). However, the age of G6PD A+ coupled with the reduced level of enzymatic activity raises the possibility that this allele has been under selection at some time in the past.
Is it possible that demographic processes are primarily responsible for the high levels of LD seen in Figure 2? Linguistic and archaeological evidence suggests that a Bantu expansion took place in Africa ∼4000 years ago (Excoffieret al. 1987). This range expansion occurred in sub-Saharan Africa primarily from west to east and southward, a distribution that is similar to the current distribution of African populations with elevated G6PD-allele frequencies. If admixture from this range expansion were responsible for generating the observed LD in our data, we would also expect to see G6PD B alleles with significant LD. This is not observed. Instead, most of the LD in our data is found between sites on functionally different alleles, arguing against any simple demographic explanation. Likewise, no LD is observed between Bantu and non-Bantu individuals from this set of 41 individuals sampled for other loci (e.g., Nachman and Crowell 2000).
One intriguing observation in our data set is the relatively high level of divergence found at L1cam between individuals bearing the G6PD A-allele and all other individuals. Four of the six (66.7%) G6PD A-alleles share a common motif of three polymorphisms in complete linkage disequilibrium (C, T, and T at positions 776, 885, and 2115, respectively; Table 1) while the rest of the segregating sites at L1cam include only four singletons and one doubleton. This pattern along with the significant LD between G6pd and L1cam (Table 3) suggests that the A-mutation arose on a relatively diverged haplotype, possibly as a consequence of population subdivision. Analysis of G6pd and L1cam as well as additional neighboring loci in a larger geographic sample from Africa may shed light on this unusual pattern.
In general, the observations reported here demonstrate that even when selection is relatively strong, its signature on patterns of DNA sequence variation may be subtle, especially if selection is recent. While several of the conventional statistical tests for selection fail to reject the null hypothesis, the footprint of selection is seen in the long-range patterns of LD and in the absence of variation among A-alleles from different parts of Africa. Similar patterns of nucleotide variability at G6pd have also recently been reported by Verrelli et al. (2002). The patterns of DNA sequence variation observed at G6pd are markedly different from those seen at another well-studied target of balancing selection, HLA, where ancient alleles result in substantially elevated levels of polymorphism (Grimsleyet al. 1998; Hortonet al. 1998; Gaudieriet al. 1999). The spatial and temporal scales over which selection pressures have shaped human genomic diversity are still largely unknown, but environmental changes associated with the transition from the Paleolithic to the Neolithic may have imposed substantial new selection pressures on humans, suggesting that patterns of nucleotide variability similar to those documented here for G6pd may be found at other loci.
Age of G6PD A- and the evolution of malarial resistance: These results have important implications for the evolution of resistance to malaria in humans. Several observations reported here, including average levels of nucleotide variability at G6pd, negative values of Tajima’s D, high levels of linkage disequilibrium between G6pd and L1cam, and complete absence of variation among G6pd A-alleles from different parts of Africa, suggest that the A-allele is young (Tables 1 and 2). A recent study based on microsatellite haplotype diversity (Tishkoffet al. 2001) also suggests that the G6pd A-allele arose recently, within the past 4000-12,000 years. A recent phylogeny of primate malaria parasites indicates that P. falciparum is closely related to P. reichenowi, a chimpanzee parasite. Moreover, cytochrome b sequence divergence between P. falciparum and P. reichenowi suggests a divergence time of 4-5 million years ago (Escalanteet al. 1998), in good agreement with the estimated time of the human-chimpanzee divergence. The discrepancy between this date and the recent origin of the G6pd A-allele raises the possibility that P. falciparum has been a human parasite for most of the evolutionary history of H. sapiens, but that the parasite’s current level of virulence has evolved only recently (Richet al. 1998). The estimated age of the A-allele agrees well with the spread of agriculture throughout sub-Saharan Africa (Waterset al. 1991; Cavalli-Sforzaet al. 1996) and suggests that changes in human lifeway may have contributed to increased transmission and/or increased virulence of P. falciparum, perhaps through an increase in the density and mobility of Anopheles mosquitoes that serve as vectors in transmission of malaria.
We thank J. D. Jensen and S. Peterson for technical assistance. Human DNA samples M115 and M241 were kindly donated by L. Luzzatto and K. Nafa. R. O. Ryder provided chimpanzee and orangutan samples. R. M. Harding, L. Luzzatto, E. Beutler, B. A. Payseur, E. T. Wood, C. C. Campbell, and A. J. Redd provided helpful discussion. We also thank R. G. Harrison and two anonymous reviewers who provided helpful comments about the manuscript. This work was supported by a National Science Foundation (NSF) grant to M.W.N. and M.F.H. and an NSF predoctoral fellowship to M.A.S.
Communicating editor: R. G. Harrison
- Received January 14, 2002.
- Accepted September 18, 2002.
- Copyright © 2002 by the Genetics Society of America