Abstract
The Duchenne muscular dystrophy (Dmd) locus lies in a region of the X chromosome that experiences a high rate of recombination and is thus expected to be relatively unaffected by the effects of selection on nearby genes. To provide a picture of nucleotide variability at a high-recombination locus in humans, we sequenced 5.4 kb from two introns of Dmd in a worldwide sample of 41 alleles from Africa, Asia, Europe, and the Americas. These same regions were also sequenced in one common chimpanzee and one orangutan. Dramatically different patterns of genetic variation were observed at these two introns, which are separated by >500 kb of DNA. Nucleotide diversity at intron 44 (π = 0.141%) was more than four times higher than nucleotide diversity at intron 7 (π = 0.034%) despite similar levels of divergence for these two regions. Intron 7 exhibited significant linkage disequilibrium extending over 10 kb and also showed a significant excess of rare polymorphisms. In contrast, intron 44 exhibited little linkage disequilibrium and no skew in the frequency distribution of segregating sites. Intron 7 was much more variable in Africa than in other continents, while intron 44 displayed similar levels of variability in different geographic regions. Comparison of intraspecific polymorphism to interspecific divergence using the HKA test revealed a significant reduction in variability at intron 7 relative to intron 44, and this effect was most pronounced in the non-African samples. These results are best explained by positive directional selection acting at or near intron 7 and demonstrate that even genes in regions of high recombination may be influenced by selection at linked sites.
IDENTIFYING the forces shaping genetic variation in natural populations remains a key problem in population genetics. Surprisingly, our understanding of the amount and structure of genetic variation at the nucleotide level in humans is still in its early stages. Mutation, migration, drift, recombination, selection at individual loci, the effects of selection at linked sites, and demographic history undoubtedly all play a role in shaping patterns of human genetic variation, although the relative importance of these different factors is not yet clear. Significant progress into this problem has been made with recent studies of nucleotide variation at β-globin (Hardinget al. 1997), dystrophin (Zietkiewicz et al. 1997, 1998), lipoprotein lipase (Clarket al. 1998), introns of seven X-linked genes (Nachmanet al. 1998), pyruvate dehydrogenase E1 α subunit (Harris and Hey 1999), angiotensin converting enzyme (Riederet al. 1999), a noncoding region at Xq13.3 (Kaessmannet al. 1999), the X-chromosome-specific zinc-finger protein (Jaruzelskaet al. 1999), and the melanocortin 1 receptor (Ranaet al. 1999; Hardinget al. 2000). Collectively, these studies have shown that the average level of nucleotide diversity in humans is quite low, largely confirming a result first obtained by Li and Sadler (1991). However, there is also substantial heterogeneity in levels and patterns of genetic variation among loci, and a central challenge now is to explain these differences.
Theoretical studies show that the interaction of selection and recombination can have a dramatic effect on levels of nucleotide variability, either through the fixation of advantageous mutations (i.e., genetic hitchhiking; Maynard Smith and Haigh 1974) or the removal of deleterious mutations (i.e., background selection; Charlesworthet al. 1993). Both processes are expected to reduce levels of neutral genetic variation in genomic regions with low rates of recombination. Estimates of recombination rate for different genomic regions can be obtained by comparing the genetic and physical locations of markers. In humans, there is evidence for both local and large-scale variation in the recombinational landscape. For example, several studies have revealed recombinational hotspots, suggesting that recombination rates may vary substantially over a scale of several kilobases (e.g., Oudetet al. 1992; Hardinget al. 1997). A common large-scale pattern is the suppression of recombination near centromeres of metacentric chromosomes (e.g., Nagarajaet al. 1997). Variation at both of these scales is likely to be important in determining the effects of selection at linked sites. There is good evidence for a positive correlation between regional recombination rate and levels of nucleotide heterozygosity in Drosophila melanogaster (Begun and Aquadro 1992; Aquadroet al. 1994; Moriyama and Powell 1996) and weaker evidence for a positive association between recombination rate and levels of nucleotide variability in a number of other organisms (Nachman 1997; Dvoráket al. 1998; Kraftet al. 1998; Stephan and Langley 1998), including humans (Nachmanet al. 1998; Przeworskiet al. 2000).
Motivated by theoretical expectations concerning the effects of selection on linked neutral variation and the empirical evidence suggesting that such effects may be common, we were interested in documenting patterns of nucleotide variability at a gene that experiences a very high rate of recombination in humans. In principle, high-recombination genes are least likely to be affected by selection at linked sites and are thus more likely to reflect neutral, equilibrium conditions.
Dystrophin is the protein product of the Duchenne muscular dystrophy (Dmd) locus. Duchenne muscular dystrophy is a common inherited disease with an incidence worldwide of 1 in 3500 births, many of which arise from new mutations. The Dmd locus is ~2.4 Mb long and consists of 79 exons that encode a 14-kb transcript. This mRNA codes for a 3685-amino-acid protein of 427 kD that shows similarity to several cytoskeletal proteins. Dmd is X-linked and lies in a genomic region experiencing high rates of recombination. Fine scale mapping of this region reveals overall recombination frequencies of 12 cM across 2 Mb of DNA (Abbset al. 1990; Oudetet al. 1992). This overall rate, 6 cM/Mb, is about six times the average value of ~1 cM/Mb across the human genome. Oudet et al. (1992) documented considerable heterogeneity in recombination frequencies in different regions of the Dmd gene and found that some regions experience recombination rates >10 cM/Mb (Figure 1). Previous studies have surveyed worldwide variation in intron 44 of Dmd using single-strand conformation polymorphisms (SSCPs) to detect mutations (Zietkiewicz et al. 1997, 1998) or through direct DNA sequencing (Nachmanet al. 1998). Zietkiewicz et al. (1997, 1998) screened 7622 bp in a worldwide sample of 250 chromosomes but may not have uncovered all of the underlying variation since their study was based on mutation detection using SSCP. Nachman et al. (1998) surveyed 1537 bp in 10 individuals by sequencing all sites.
Here, we further investigate patterns of genetic variation at two introns (7 and 44) of Dmd in a global sample of 41 alleles and find strikingly different patterns of genetic variation in each region. Both of these introns experience recombination rates well above the genomic average and are expected to be relatively free of the effects of selection at linked sites. Nonetheless, the contrasting patterns of variation at these two introns suggest that recent directional selection has acted at or near intron 7 of Dmd.
Map of Dmd. (A) Recombination rate estimates for different portions of the Dmd locus (data from Oudetet al. 1992). (B) Physical map of Dmd. (C) Position of exons. (D) Amplified regions of introns 7 and 44 in this study. Arrows denote PCR primer positions.
MATERIALS AND METHODS
Samples: Forty-one men were sampled, including 10 from Africa, 10 from Europe, 11 from Asia (including one from Melanesia), and 10 from the Americas. Human genomic DNAs were provided by Dr. M. F. Hammer from the Y chromosome consortium (YCC) DNA repository. A single male common chimpanzee (Pan troglodytes) and a single male orangutan (Pongo pygmaeus) were also surveyed from DNAs provided by Dr. O. A. Ryder. By sequencing X chromosomes in males, we were able to amplify by PCR and sequence a single allele per individual and thus avoid problems associated with sequencing and scoring heterozygous sites. We were also able to recover haplotypes directly and thereby look at patterns of linkage disequilibrium among all sites in the sample.
PCR amplification and sequencing of Dmd: A map of the Dmd locus is shown in Figure 1. Additional detailed information about the structure of this locus can be found at http://www.dmd.nl. Intron 7 and intron 44 are separated by >500 kb of DNA. Both introns lie in genomic regions experiencing high rates of recombination (>4 cM/Mb), although the intervening introns experience considerably lower rates of recombination (<1 cM/Mb). DNA was PCR amplified (Saikiet al. 1988) in 25-μl volumes with 40 cycles of 94° 1 min, 55° 1 min, and 72° 2 min. Amplification primers were designed from published sequence for intron 7 (GenBank accession no. U60822) and intron 44 (GenBank accession no. M86524) and are listed in Table 1. Products were cycle-sequenced and run on an ABI 377 automated sequencer or sequenced manually as in Nachman et al. (1998). A total of 2389 bp from intron 7 and 3000 bp from intron 44 were sequenced. The 3000 bp in intron 44 are contiguous but the 2389 bp in intron 7 consist of two fragments (1388 bp and 1001 bp) separated by 8820 bp of largely repetitive DNA, which we found difficult to sequence (Figure 1). The 3000-bp portion of intron 44 includes and extends the smaller region (1537 bp) sequenced in 10 individuals in Nachman et al. (1998). Sequences have been submitted to GenBank under accession nos. AF279921–AF280049.
Data analysis: Sequences were aligned by eye, and the numbers and frequencies of all polymorphisms were counted. Two measures of nucleotide variability, π (Nei and Li 1979) and θ (Waterson 1975), were calculated. Nucleotide diversity, π, is based on the average number of nucleotide differences between two sequences randomly drawn from a sample, and θ is based on the proportion of segregating sites in a sample. Under neutral, equilibrium conditions, both π and θ estimate the parameter 3Neμ for X-linked loci, where Ne is the effective population size and μ is the neutral mutation rate. Departures from a neutral equilibrium frequency distribution of polymorphisms were evaluated using two approaches (Tajima 1989; Fu and Li 1993). Linkage disequilibrium (D′) was calculated for a set of independent pairwise comparisons between nonunique polymorphic sites (Lewontin 1964, 1995), and the significance of D′ was assessed using Fisher's exact tests. Ratios of polymorphism within humans to divergence between human and chimpanzee or human and orangutan were compared with expectations under a neutral model using the Hudson, Kreitman and Aguadé (HKA) test (Hudsonet al. 1987). Polymorphism was based on variation segregating among the 41 human alleles and divergence was based on a single randomly chosen human allele and the single chimpanzee or orangutan allele.
RESULTS
Polymorphic sites for introns 7 and 44 are shown in Tables 2 and 3, respectively. Numbers of segregating sites, nucleotide diversity, measures of the frequency distribution, and levels of divergence are summarized in Table 4 for both introns. Nine segregating sites were observed in intron 7, and 19 segregating sites were observed in intron 44. Intron 7 had three insertion-deletion polymorphisms; two consisted of a single nucleotide and one consisted of 5 bp. Intron 44 contained a complicated compound microsatellite consisting of several different dinucleotide repeats (Table 3). Nucleotide diversity at intron 44 (π = 0.141%) was more than four times greater than nucleotide diversity at intron 7 (π = 0.034%). Waterson's θ, which is based on the number of segregating sites, was less than twice as large in intron 44 (θ = 0.148) as in intron 7 (θ = 0.088). The relative similarity in θ despite the difference in π between the two introns is due in large part to the difference in the number of singletons in each intron. Seven out of 9 (78%) polymorphic sites in intron 7 are singletons, while 6 out of 19 (32%) polymorphic sites in intron 44 are singletons. The frequency distribution of polymorphisms is consistent with neutral expectations for intron 44, but there is an excess of rare polymorphisms in intron 7, reflected in the significantly negative values of Tajima's D and Fu and Li's D statistics (Table 4).
Amplification primers used in this study
Individuals sampled and polymorphic sites at Dmd intron 7
Divergence was significantly higher at intron 7 than at intron 44 in comparisons between human and chimpanzee (t = 2.30, P < 0.05). In comparisons between human and orangutan, divergence was only slightly and not significantly higher at intron 7 than at intron 44 (Table 4).
Polymorphic sites at Dmd intron 44
We investigated patterns of linkage disequilibrium by comparing pairs of sites in order along the chromosome; this provides a set of independent comparisons for tests of significance (Lewontin 1995). Sites containing singletons were excluded from this analysis. In intron 7, comparisons were made among four sites (section one, 268, 551; section two, 481, 540). Significant linkage disequilibrium was observed in each of the three sequential comparisons involving these sites (Fisher's exact test, P < 0.001 for each, after Bonferroni correction for multiple tests). Sites 268 and 551 in section one of intron 7 are ~10 kb away from sites 481 and 540 in section two of intron 7. The high level of linkage disequilibrium in intron 7 results in two major haplotypes (represented by YCC individuals 32 and 8; Table 2). In intron 44, 13 sequential pairwise comparisons were made among 14 sites (1, 191, 681, 814, 834, 1160, 1224, 1353, 1830, 1858, 2374, 2423, 2532, and 2691). Significant linkage disequilibrium was observed in three of these comparisons (Fisher's exact test, P < 0.0001 for sites 834–1160 and 1160–1224; P < 0.05 for sites 2374–2423, after Bonferroni correction).
Nucleotide polymorphism and divergence at Dmd
None of the 3 comparisons between pairs of sites in intron 7 contained all four gametic types, while 6 of the 13 comparisons between pairs of sites in intron 44 contained all four gametic types. Thus, more recombination is observed among the sequences in intron 44 than in intron 7, consistent with the mapping data in Figure 1. We also calculated the neutral recombination parameter, γ, from the polymorphism data at intron 44 using the method of Hey and Wakeley (1997). This provided an estimate of the per-site population recombination rate, 2Ncf = 5.58 × 10−3, which corresponds to a rate of 27.9 cM/Mb (where cf is the female recombination rate assuming N = 104; e.g., Hammer 1995). This value is substantially larger than the estimate of 15 cM/Mb obtained from mapping data (Figure 1), although the variance associated with both of these estimates is large. If population size is closer to 20,000–30,000 (e.g., Nachmanet al. 1998; Harris and Hey 1999), then the inferred recombination rate from the sequence data (9.3–14.0 cM/Mb) is in better agreement with the estimate from mapping data. γ could not be calculated for either portion of intron 7 because there are no incongruent pairs of sites in these data; the maximum-likelihood estimate of γ in this case is zero (Hey and Wakeley 1997).
The geographic distribution of nucleotide variation at each intron is shown in Table 5. For intron 7, nucleotide diversity is substantially lower in the non-African samples (π ranges from 0 to 0.025%) than in the African sample (π = 0.08%). The two major haplotypes at intron 7 are both present in Africa, but only one is present out of Africa. For intron 44, nucleotide diversity in the non-African samples (π ranges from 0.111 to 0.144%) is more than half the value observed in the African sample (π = 0.173%). Surprisingly, for both introns, the Asian sample is the least variable and is even slightly, though not significantly, less variable than the sample from the Americas. Average FST calculated across all populations was six times higher for intron 7 (FST = 0.176) than for intron 44 (FST = 0.028). This overall difference in FST is attributable to the differences between the two introns in the partitioning of genetic variation between African and non-African populations, as can be seen from the distribution of variation in Tables 2 and 3. Average FST calculated across all non-African populations was zero for intron 7 and was very small for intron 44 (FST = 0.013).
HKA comparisons involving polymorphism and Homo-Pan divergence between intron 7 and intron 44 are shown in Table 6. When all the data are considered, there is only a marginally significant rejection of the null model (P = 0.08). However, when the non-African populations are considered collectively, there is a significant reduction in the ratio of polymorphism to divergence at intron 7 relative to intron 44 (P < 0.05). A significant reduction is also seen in Europe and in Asia, but not in Africa or the Americas. HKA tests involving Homo-Pongo comparisons yield similar results: a significant or marginally significant rejection of the null model is obtained in comparisons involving Asia (P < 0.05), Europe (P = 0.06), or all non-African populations (P = 0.08), but not in comparisons involving the total sample, Africa, or the Americas (P > 0.10). We also performed HKA tests comparing Dmd intron 7 and Dmd intron 44 to another X-linked gene, Pdha1 (Harris and Hey 1999). The Pdha1 data consist of 4200 bp surveyed in a worldwide sample of 35 chromosomes. In comparisons using the entire sample, the ratio of polymorphism to divergence is lower at Dmd intron 7 than at Pdha1 (HKA χ 2 = 3.41, P = 0.06), but is nearly identical at Dmd intron 44 and at Pdha1 (HKA χ2 = 0.01, P > 0.5).
Amount and distribution of polymorphisms at Dmd introns 7 and 44 by geographic region
HKA tests comparing Dmd intron 7 vs. intron 44, Homo vs. Pan
DISCUSSION
We investigated the amount and structure of DNA sequence variation at two introns of Dmd in a worldwide sample of 41 humans and found that these two introns have strikingly different patterns of genetic variation. In general, intron 44 had a high level of nucleotide diversity, little linkage disequilibrium, no skew in the frequency distribution of polymorphisms, and revealed similar patterns of variation in and out of Africa. Patterns of variation at intron 44 are entirely consistent with a neutral model of molecular evolution. In contrast, intron 7 had a low level of nucleotide diversity, displayed significant linkage disequilibrium extending over 10 kb, a significant excess of rare polymorphisms, and very different patterns of variation in and out of Africa. Jointly, the patterns of variation observed at these two introns are inconsistent with a standard, neutral equilibrium model. The statistical evidence against this model derives from the significantly negative values of Tajima's (1989) D and Fu and Li's (1993) D for intron 7 (Table 4) and from the significant HKA tests showing reduced variability at intron 7 in non-African populations (Table 6). These patterns are difficult to reconcile with non-equilibrium population-level effects, such as migration or changes in population size, since such effects are expected to affect all loci in a roughly proportional fashion. On the other hand, all of our observations are consistent with positive directional selection acting recently at or near intron 7. Positive directional selection can reduce levels of linked neutral variability, increase levels of linkage disequilibrium, and produce a skew in the frequency distribution toward an excess of rare sites (Maynard Smith and Haigh 1974; Kaplanet al. 1989; Tajima 1989; Bravermanet al. 1995). Moreover, if selection does not act equally in all geographic regions, it may also lead to increased levels of population differentiation (e.g., Stephan 1994; Stephanet al. 1998).
The exact nature of selection is difficult to determine from the observed distribution of variation. There are two major haplotypes at intron 7 (represented by YCC individuals 32 and 8) and these haplotypes are three (YCC 32) and one (YCC 8) mutational steps derived from the ancestral human haplotype, inferred from parsimony using the chimpanzee and orangutan sequences as outgroups. Both of the major haplotypes are present in Africa but only one is present out of Africa. All other haplotypes in our sample are one mutational or recombinational step derived from one of these two major haplotypes. One straightforward explanation for the differing patterns of variation at intron 44 and at intron 7 is a partial selective sweep of the more common haplotype (YCC 32) at intron 7, especially in non-African populations. The fact that variation is reduced primarily in non-African populations suggests that a selective sweep may have occurred concomitant with or following the movement of anatomically modern humans out of Africa. It should be noted that despite the presence of two major alleles, there is no evidence for an excess of variation or for polymorphisms at intermediate frequency as might be expected under prolonged balancing selection (e.g., Hudsonet al. 1987; Kreitman and Hudson 1991).
The likelihood that selection has acted at or near intron 7 raises the question of which site or sites are the direct targets of selection. The genomic distance in base pairs (d) over which selection is likely to exert a strong effect on levels of linked neutral variability is a function of the strength of selection, s, and the recombination rate per nucleotide, c, and is approximated by d = (0.01) s/c (Kaplanet al. 1989, p. 896). For example, Wang et al. (1999) observed a reduction in neutral variation over a region of only a few kilobases in the vicinity of the 5′ promoter region of the teosinte-branched 1 locus in maize, a gene that has been a target of strong artificial selection during the domestication of maize. The recombination rate over the entire Dmd locus is ~6 cM/Mb, but it may be closer to 4 cM/Mb in intron 7 (Oudetet al. 1992; Figure 1), corresponding to c = 4 × 10−8 per nucleotide. Mutations with selection coefficients <10−4 are unlikely to have been affected by deterministic processes in ancestral human populations, given most estimates of effective population size (e.g., Hammer 1995; Zietkiewiczet al. 1998). If we consider a range of selection coefficients, 0.001 < s < 0.1, then linked neutral variability is expected to be reduced over a genomic distance ranging from 500 bp to 50 kb. Nucleotide variability at intron 7 is low in both of the segments we sequenced, and these are separated by ~9 kb. This implies that selection coefficients may be >10−3, assuming a simple model of genetic hitchhiking (Kaplanet al. 1989). Nonetheless, the large size of the window of reduced variation points to the difficulty of identifying the specific site or sites that have been under selection. This stands in contrast to the narrow window of reduced variability in maize at the tb1 locus (Wanget al. 1999) or the narrow window of elevated variability at Adh in Drosophila (Kreitman and Hudson 1991). Because average recombination rates in humans (10−8 per site) are roughly threefold lower than in Drosophila (2–3 × 10−8 per site; Nachman and Churchill 1996), the size of windows affected by selection on linked sites in humans may, on average, be larger than in flies (assuming equivalent selection coefficients).
The effects of selection are expected to be easiest to detect in genomic regions experiencing low rates of recombination because these regions will contain more potential targets of selection for a given genetic distance. Indeed, our study was motivated by an interest in depicting patterns of variation at a high-recombination gene to capture the distribution of variation that may be closest to neutral, equilibrium values. However, the observed patterns of variation strongly suggest that selection has acted in this region, and these observations raise the possibility that the signature of selection at the molecular level may be common in the human genome. Moreover, the differences in patterns of variation seen at intron 7 and intron 44 highlight that a single functional gene may contain segments with dramatically different evolutionary histories.
Overall, the level of variation we observed in intron 44 is in general agreement with previous surveys of nucleotide variability in this intron (Nachmanet al. 1998; Zietkiewiczet al. 1998). In a sample of 10 alleles surveyed over 1537 bp, Nachman et al. (1998) reported nucleotide diversity of 0.187%, a value that is not significantly different from the value reported here. Zietkiewicz et al. (1997, 1998) surveyed 7622 bp in a worldwide sample of 250 alleles and reported nucleotide diversity of 0.101%. The slightly lower value obtained by Zietkiewicz et al. (1998) may reflect that their survey was based on polymorphism detection using SSCP.
The level of nucleotide variability observed at each intron can be used to estimate the effective population size under the neutral expectation for X-linked genes, π = 3Neμ, assuming a sex ratio of 1. Using the human-chimpanzee divergence values in Table 4, the estimated mutation rates are μ = 3.26 × 10−8 for intron 7 and μ = 1.8 × 10−8 for intron 44 assuming a divergence time of 5 mya and a generation time of 20 years. The estimated population sizes are ~Ne = 3500 for intron 7 and Ne = 26,000 for intron 44. The corresponding coalescence times are ~210,000 years for intron 7 and 1,560,000 years for intron 44. Despite the large variance associated with each of these estimates, these differences underscore the fact that different regions of the genome, and even of the same gene, may provide quite different estimates of parameters that are important for understanding human evolution. Genomic regions that have been influenced by selection at linked sites may provide substantial underestimates of the long-term effective population size for humans. The larger value of Ne obtained from intron 44 is likely to better reflect equilibrium conditions and suggests that a long-term effective population size for humans may be on the order of 30,000 rather than 10,000 (e.g., Hammer 1995).
The geographic patterns reported here are in general agreement with other studies of nucleotide variability in humans in revealing more variation in Africa than in other continental regions (e.g., Hardinget al. 1997; Zietkiewiczet al. 1998; Harris and Hey 1999; Kaessmannet al. 1999). This observation is often interpreted as evidence that modern humans throughout the world derived recently from an ancestral African population (e.g., Kaessmannet al. 1999), although it has also been pointed out that some of the African diversity may derive from human migration back to Africa (e.g., Hardinget al. 1997; Hammeret al. 1998). One surprising observation in our data is the greater variability in the Americas than in Asia. The Americas are typically thought to have been colonized from Asia, and samples from the Americas typically reveal lower levels of genetic variability than samples from Asia (e.g., Karafetet al. 1999). At both intron 7 and intron 44, we observe higher levels of variability in the Americas than in Asia, though in neither case is this difference significant. Surveys of nucleotide variability at additional unlinked loci from multiple populations will be essential for disentangling the effects of population-level processes from selection in shaping variation in different geographic regions.
Acknowledgments
We thank Mike Hammer for discussions, Isaac Jones for help with sequencing, and Wolfgang Stephan and two anonymous reviewers for helpful comments on the manuscript. This work was supported by the National Science Foundation.
Footnotes
-
Communicating editor: W. Stephan
- Received September 13, 1999.
- Accepted April 17, 2000.
- Copyright © 2000 by the Genetics Society of America