Studies of human DNA sequence polymorphism reveal a range of diversity patterns throughout the genome. This variation among loci may be due to natural selection, demographic influences, and/or different sampling strategies. Here we build on a continuing study of noncoding regions on the X chromosome in a panel of 41 globally sampled humans representing African and non-African populations by examining patterns of DNA sequence variation at four loci (APXL, AMELX, TNFSF5, and RRM2P4) and comparing these patterns with those previously reported at six loci in the same panel of 41 individuals. We also include comparisons with patterns of noncoding variation seen at five additional X-linked loci that were sequenced in similar global panels. We find that, while almost all loci show a reduction in non-African diversity, the magnitude of the reduction varies substantially across loci. The large observed variance in non-African levels of diversity results in the rejection of a neutral model of molecular evolution with a multi-locus HKA test under both a constant size and a bottleneck model. In non-Africans, some loci harbor an excess of rare mutations over neutral equilibrium predictions, while other loci show no such deviation in the distribution of mutation frequencies. We also observe a positive relationship between recombination rate and frequency spectra in our non-African, but not in our African, sample. These results indicate that a simple out-of-Africa bottleneck model is not sufficient to explain the observed patterns of sequence variation and that diversity-reducing selection acting at a subset of loci and/or a more complex neutral model must be invoked.
PATTERNS of variation at multiple loci can be used to infer the history of human migration patterns, subdivision, and changes in population size. These patterns can also shed light on the relative importance of different population genetic processes (e.g., mutation, genetic drift, selection, and recombination) and thus provide clues to the mechanisms of evolutionary change at the molecular level. A current challenge is to distinguish the signature of natural selection from those of neutral demographic processes associated with changes in population size, distribution, and structure. Often selective and demographic processes produce identical patterns of sequence variation at a given locus. For example, an excess of rare mutations over neutral, equilibrium expectations could be a signature of either recent directional selection at the locus under investigation or recent population growth. One approach to distinguishing between selective and neutral demographic effects on genome variability is to sample multiple independent loci: natural selection is expected to affect variation in small regions (i.e., at sites linked to those under selection), while demographic processes tend to affect all loci in the genome similarly.
Considerable work over the past decade has documented DNA sequence variation in humans. Early studies focused primarily on mitochondrial DNA (Vigilant et al. 1991) and the Y chromosome (Hammer 1995; Whitfield et al. 1995), while more recent single-locus studies have focused on the X chromosome (Nachman et al. 1998; Harris and Hey 1999; Kaessmann et al. 1999; Nachman and Crowell 2000; Gilad et al. 2002; Saunders et al. 2002; Yu et al. 2002b) and on the autosomes (reviewed in Przeworski et al. 2000; Excoffier 2002). Two major features to emerge from this body of work are (1) substantial heterogeneity among genes in overall patterns of variation, including differences in the level of nucleotide diversity, the amount of linkage disequilibrium, and the distribution of allele frequencies, and (2) clear differences in levels and patterns of variation among populations. For example, there is mounting evidence that African populations have more genetic variation (Vigilant et al. 1991; Tishkoff et al. 1996; Przeworski et al. 2000; Hammer et al. 2001), harbor more rare alleles (Wall and Przeworski 2000), and have lower levels of linkage disequilibrium (Reich et al. 2001) than non-African populations.
These genetic patterns have led to contrasting inferences of human demographic history. Results from loci with an excess of rare polymorphisms (i.e., with large negative Tajima's D values; Tajima 1989) have been used to support models in which humans expanded dramatically from small initial size (Harpending and Rogers 2000; Shen et al. 2000; Wooding and Rogers 2000; Alonso and Armour 2001; Rogers 2001). On the other hand, many nuclear loci have positive Tajima's D values so they do not provide evidence of population growth (Harding et al. 1997; Hey 1997; Zietkiewicz et al. 1997; Przeworski et al. 2000). Using data from the 12 nuclear loci then available, Wall and Przeworski (2000) tested several simple models of population growth and found that the different patterns among loci were not compatible with any of their models. This led to the suggestion that various forms of selection have influenced a subset of loci (Wall and Przeworski 2000; Excoffier 2002). It is also becoming clear that more complex models of human demography must be considered, such as those incorporating geographic structure and changes in population size (Pluzhnikov et al. 2002; Ptak and Przeworski 2002).
One of the major challenges for interpreting the contrasting patterns observed among human loci comes from the sampling strategies used by different investigators. Studies of nuclear sequence variation vary greatly with regard to the scheme for sampling populations (from population-based to global “grid” sampling), the type of genomic regions studied (from coding to noncoding), and the molecular methods of variation detection employed (Przeworski et al. 2000; Ptak and Przeworski 2002). This diverse array of strategies has made it difficult to compare results across studies. Recently, several surveys have sampled DNA sequences from multiple loci in a common set of individuals (Frisse et al. 2001; Harris and Hey 2001; Stephens et al. 2001; Yu et al. 2002a,b; Carlson et al. 2003). These studies have typically focused on multiple autosomal loci from either noncoding regions exclusively (Frisse et al. 2001) or regions encompassing exons (Stephens et al. 2001). Studies of the X chromosome have typically focused on coding regions and/or only a few loci (Harris and Hey 1999, 2001; Stephens et al. 2001; Kitano et al. 2003). These studies also vary in the way humans are sampled, ranging from panels of individuals from the United States (Stephens et al. 2001), to panels containing many globally dispersed samples (Harris and Hey 1999, 2001), to panels with multiple individuals from a limited number of human populations (Frisse et al. 2001).
Here we build on a continuing study of noncoding regions on the X chromosome (Figure 1) in a panel of 41 globally sampled humans representing African and non-African populations by examining patterns of variation at four loci (APXL, AMELX, TNFSF5, and RRM2P4) and comparing these patterns with those previously reported at DMD (i.e., introns 7 and 44; Nachman and Crowell 2000), G6PD and L1CAM (Saunders et al. 2002), and MSN and ALAS2 (Nachman et al. 2004). We also include comparisons with patterns of noncoding variation seen at five additional X-linked loci that were sequenced in similar global panels: PDHA1 (Harris and Hey 1999), Xq13.3 (Kaessmann et al. 1999), FIX (Harris and Hey 2001), MAO-A (Gilad et al. 2002), and Xq21.3 (Yu et al. 2002b). Because we studied X-linked loci only, we were able to avoid some of the complications that arise when comparing loci with different modes of inheritance and effective population sizes, such as those associated with the Y chromosome, autosomes, or the mitochondrial genome (Fay and Wu 1999; Przeworski et al. 2000; Hellmann et al. 2003). Our results indicate that, despite a common sampling strategy, there is still substantial heterogeneity in patterns of variation among loci on the human X chromosome. This degree of heterogeneity does not appear to be compatible with a simple demographic model and may reflect the effects of recent diversity-reducing selection acting on a subset of loci.
SUBJECTS AND METHODS
Human genomic DNAs were isolated from lymphoblastoid cell lines that were established by the Y Chromosome Consortium (2002) at the New York Blood Center from blood donated by volunteers who gave informed consent. All sampling protocols were according to procedures approved by the New York Blood Center and University of Arizona Human Subjects Committees. A total of 41 men were sampled, including 10 Africans (2 Tsumkwe San from Namibia, 1 West Bantu Herero, and 1 East Bantu Pedi, 1 East Bantu Sotho, 2 Biaka Pygmies from CAR, and 3 Mbuti Pygmies from Zaire), 11 Asians (3 Han Chinese, 2 Siberian Yakuts, 1 Cambodian, 3 Japanese, and 1 Pakistani, and 1 Nasioi from Melanesia), 10 Europeans/Middle Easterners (2 Ashkenazi Jews, 1 British, 1 Adygean from Krasnodar, 3 Germans, 2 Western Russians, and 1 Turk), and 10 Native Americans (1 Navajo, 1 Tohono O'Odham, 1 Poarch Creek, 2 Karitianans, and 2 Surui from Brazil, 1 Mayan, and 2 Amerindians of unknown tribal affiliation). This sample was chosen as part of a long-term project in our labs to survey nucleotide variability at a number of loci throughout the genome using a common set of individuals (Nachman et al. 1998, 2004; Nachman and Crowell 2000; Saunders et al. 2002). A single male common chimpanzee (Pan troglodytes) was surveyed from DNAs provided by O. Ryder. By sequencing X chromosomes in males we were able to avoid problems associated with sequencing and scoring heterozygous sites and we were also able to recover haplotypes directly among all sites in the sample.
Choice of loci:
We chose to sequence APXL (apical protein-like Xenopus laevis), AMELX (amelogenin, X-linked), RRM2P4 (ribonucleotide reductase M2 polypeptide pseudogene 4), and TNFSF5 (tumor necrosis factor ligand superfamily, member 5) because they map to telomeric regions with moderate to high rates of recombination. These loci complement our existing database of six other genes (sequenced in the same global panel) mapping to X chromosome regions with a range of recombination rates (Figure 1): DMD intron 7, DMD intron 44 (Nachman and Crowell 2000), G6PD and L1CAM (Saunders et al. 2002), and MSN and ALAS2 (Nachman et al. 2004). Approximately 5 kb of intron from each gene was sequenced. With the exception of G6PD in Africa (Saunders et al. 2002), none of these loci was a priori believed to be influenced by selective forces. In addition, five published X chromosome DNA sequence data sets were used for comparisons with the 10 loci examined in our global panel. These loci included PDHA1 (Harris and Hey 1999), FIX (Harris and Hey 2001), Xq13.3 (Kaessmann et al. 1999; referred to here as P2Y10), Xq21.1 (Yu et al. 2002b; referred to here as DACH2), and MAO-A (Gilad et al. 2002). We did not include ZFX (Jaruzelska et al. 1999) or DYS44 (Zietkiewicz et al. 1997) because the polymorphism data were ascertained mainly by single-strand conformation polymorphism rather than by DNA sequencing.
PCR amplification and sequencing:
DNA was PCR amplified in 25-μl volumes with 40 cycles. Conditions for each of the fragments described below varied slightly and are available from the authors upon request. Amplification primers were designed from published sequences for APXL (AC002365), AMELX (AY040206), RRM2P4 (NG_000871; HSJ169P22), and TNFSF5 (NT_011786) and are available upon request. Internal primers (also available upon request) were used to generate overlapping sequence runs on an ABI3730 automated sequencer. Contiguous sequence that included coding and noncoding regions (4885, 5331, 5240, and 2385 bp for APXL, AMELX, TNFSF5, and RRM2P4, respectively) was assembled for each individual and aligned using the computer program Sequencher (GeneCodes). Sequences have been submitted to GenBank under accession nos. AY694820, AY694987.
Nucleotide diversity, π (Nei and Li 1979), Watterson's (1975) estimator of θ, and FST (Hudson et al. 1992) were calculated using the program DNAsp 3.99 (Rozas and Rozas 1999), excluding insertion-deletion polymorphisms. Under neutral equilibrium conditions both π and θ estimate the neutral parameter 3Neμ for X-linked loci, where Ne is the effective population size and μ is the neutral mutation rate. To test for deviations from a neutral equilibrium frequency distribution, Tajima's D (Tajima 1989), Fu and Li's D with an outgroup (Fu and Li 1993), and Fay and Wu's H (Fay and Wu 2000) were also calculated using DNAsp 3.99 (P values were determined by 1000 replicates of Monte Carlo simulation of the coalescent process under a neutral panmictic model with no recombination). Ratios of polymorphism to divergence were compared with the expectations under a neutral model using a multilocus Hudson-Kreitman-Aguadé (HKA) test (Hudson et al. 1987) with the software “HKA” (J. Hey; http://lifesci.rutgers.edu/heylab/). This program does not take account of intragenic recombination and therefore the resulting P values are slightly inflated (Frisse et al. 2001). Divergence data were derived for each of these loci by estimating the net divergence (DA; Nei 1987) between homologous sequences from a chimpanzee and all 41 human sequences. The times to most recent common ancestors (TMRCAs) among sampled sequences were estimated by dividing Watterson's estimator of 3Neμ (Watterson 1975) by the locus-specific rates of neutral mutation, estimated from the interspecific divergence. We assumed a human-chimpanzee divergence of 6 million years and a 20-year human generation time. Female estimated recombination rates were taken from the University of California, Santa Cruz, web site (http://www.genome.ucsc.edu) using the July 2003 freeze of the Human Genome Project Working Draft. The recombination rates represent average rates for a window of 1 Mb around each locus estimated through a comparison of the sequence of the human X chromosome with the deCODE Genetics map (Kong et al. 2002), which is based on 5136 microsatellite markers in 146 families with a total of 1257 meioses.
Patterns of variation at four telomeric loci:
Polymorphic sites within the APXL, AMELX, and TNFSF5 introns, as well as in the RRM2P4 pseudogene, are shown in Table 1. Numbers of segregating sites, nucleotide diversity, measures of the frequency distribution, levels of divergence, TMRCA, and FST values are summarized in Table 2. The number of segregating sites ranges from 13 to 19 for the four loci. The average nucleotide diversity for the three gene regions (consisting mainly of introns) is slightly lower than that in the pseudogene (Table 2). The average level of Homo-Pan divergence is also lower at two of the three genes compared with the pseudogene (Table 2). A four-locus HKA test does not reject the null model (P = 0.64). Tajima's D (TD) and Fu and Li's D (FLD) values are negative for all three gene regions; however, none is statistically significant. RRM2P4 is the only locus with a positive FLD value (+0.55) indicating a low proportion of singletons. Interestingly, TNFSF5 has an excess of high-frequency-derived polymorphisms as reflected in the statistically significant negative H value (Table 2).
Global patterns of variation at 10 X-linked loci sampled in the same individuals:
Global nucleotide diversity levels exhibit a large range of values—from a high value of 0.00143 at DMD44 to a low value of 0.00035 at ALAS2 (Table 2). However, a single 10-locus HKA test does not reject the null model of equal rates of molecular evolution (P = 0.53; data not shown). Interestingly, all 10 loci surveyed in our global panel have negative TD values. The mean global TD and FLD values for these 10 loci are −0.759 and −1.135, respectively. The average negative value of TD and FLD does not reveal the great extent to which frequency spectra vary among loci. Two loci have TD and FLD values close to zero (DMD44 and RRM2P4), while three loci (ALAS2, DMD7, and L1CAM) have statistically significant negative TD values. There is an excess over neutral expectations of singleton polymorphisms at 2 of these 10 loci: DMD7 and L1CAM as indicated by the significantly negative FLD values (Table 2).
Comparisons with patterns of variation at other X-linked loci:
Summary statistics for the 10 loci sampled in the same panel of 41 individuals are compared with those from five additional X-linked loci surveyed in global samples in Table 2. Average global levels of variation at these five loci (average θ = 0.00076 and π = 0.00068) are very similar to the 10 others in Table 2 (Mann-Whitney test, P = 0.668 and 0.951, respectively), as are summaries of the frequency spectra (average TD = −0.552, FLD = −1.401; Mann-Whitney test, P = 0.582 and 0.951, respectively). Heterogeneity among loci is also apparent: levels of variation at FIX, MAO-A, and P2Y10 are low, while those at DACH2 and PDHA1 are average and high, respectively. P2Y10, FIX, and DACH2 show an excess of rare and/or singleton polymorphisms (Table 2). In contrast, PDHA1 and MAO-A have positive TD values. As expected given the variation in levels of polymorphism, estimates of the TMRCA also vary considerably among all 15 loci in Table 2. For example, the TMRCA for FIX and MAO-A is <500 KY, while 6 loci have TMRCAs >1 MY.
Kitano et al. (2003) recently surveyed sequence variation at 10 X-linked genes that contain mutations known to cause mental retardation. Global patterns of intron variability within these genes are similar to those reported in Table 2, although the average global level of diversity at their 10 loci (θ = 0.00051 and π = 0.00035) is ∼40% lower than that for the 15 loci in Table 2 (θ = 0.00081 and π = 0.00061). Their sequences also had an ∼30% reduction in human-chimpanzee divergence (0.699%) relative to the 15 X-linked loci in Table 2 (1.01%) and a lower mean TMRCA (474 vs. 1004 KY, respectively), possibly reflecting higher levels of selective constraint on loci involved in human cognitive function (Kitano et al. 2003).
Nucleotide diversity and recombination rate:
Global nucleotide diversity (θ) and recombination rate for the 10 loci sampled in the N = 41 panel are positively correlated (Pearson linear correlation, two-tailed t-test, R2 = 0.648, P = 0.005). When all 15 loci in Table 2 are considered, there is still a positive but weaker correlation (R2 = 0.315, P = 0.029; Figure 2A). This increased scatter in the larger sample may reflect variation in levels of diversity that is attributable to different sampling strategies. Similarly, in Drosophila melanogaster, the correlation between diversity and recombination rate observed in heterogeneous samples (Begun and Aquadro 1992) is stronger when studied from multiple loci in a single sample (Aquadro et al. 1994). When we consider the relationship of human-chimpanzee divergence levels to human recombination rates, there is no statistically significant relationship for either the set of 10 or the set of 15 loci in Table 2 (R2 = 0.073, P = 0.449; R2 = 0.082, P = 0.303, respectively), which may reflect the small number of loci investigated (Figure 2B).
African and non-African levels of diversity:
Table 3 summarizes the numbers of segregating sites, nucleotide diversity, measures of the frequency distribution, and TMRCAs within African and non-African samples. Consistent with many other studies (e.g., Yu et al. 2002a), most of the 15 X-linked loci exhibit a pattern of reduced nucleotide diversity in non-Africans relative to Africans. Two loci, RRM2P4 and MSN, show the opposite pattern (Table 3). Mean levels of nucleotide diversity are almost twice as high in Africans (θ = 0.00082 ± 0.00040, π = 0.00076 ± 0.00049) as in non-Africans (θ = 0.00045 ± 0.00036, π = 0.00040 ± 0.00042). These differences are statistically significant (Mann-Whitney test, P = 0.002 and 0.004, respectively).
To test for relationships between polymorphism and divergence, we performed a single 15-locus HKA test on the entire global sample, as well as on Africans and non-Africans separately (Table 4). We found that the null model was rejected in non-Africans only (P = 0.007). When we repeated the HKA tests using only the 10 loci sequenced in the panel of 41 individuals, similar results obtained (P = 0.030; data not shown). We also note that the correlation between recombination rate and nucleotide diversity was stronger in non-Africans (R2 = 0.521, P = 0.019) compared with Africans (R2 = 0.308, P = 0.096; data not shown).
African and non-African frequency spectra:
When we estimate TD and FLD values separately in African and non-African samples, both estimators are less negative than those in the global panel (see above): mean African TD and FLD = −0.401 and −0.552, respectively, and mean non-African TD and FLD = −0.512 and −0.846, respectively. TD and FLD values for all loci are consistent with neutral equilibrium expectations in Africans, while four loci showed an excess over neutral expectations of rare polymorphisms and/or singletons in non-Africans: ALAS2, MSN, DMD7, and L1CAM (Table 3). There is also an excess of high-frequency-derived polymorphisms at MSN and DMD7 in non-Africans as reflected in statistically significant negative Fay and Wu's H (FWH) values (Table 3). African TD values ranged from −1.72 to 0.80, while those of non-Africans ranged from −2.06 to 0.72. The mean TD for non-Africans (−0.512 ± 0.906) was slightly more negative than that of Africans (−0.401 ± 0.736).
Under a model of population growth TD and FD are negatively correlated with sample size (Ptak and Przeworski 2002; Hammer et al. 2003). To determine whether the more negative mean TD value in our non-African sample compared with our African sample was the result of a larger mean sample size (i.e., 32.3 vs. 12.3), we reanalyzed our non-African data by resampling each locus 100 times and making the sample size equal to the number of Africans sequenced at the locus. In other words, for the n = 41 data set, 10 non-Africans were resampled 100 times. Resampled data sets with no variation (i.e., S = 0) were thrown out and an additional resampling was performed. The FIX locus was not resampled because equal numbers of Africans and non-Africans were surveyed initially (Harris and Hey 2001). The mean TD value for the resampled non-African data set was still more negative (TD = −0.639) despite having an identical size as the African sample. This suggests that the more negative TD value in non-Africans compared with Africans is unrelated to sample size differences.
To explore the relationship between frequency spectra and recombination rate we plotted African or non-African FLD vs. recombination rate for each locus in Table 3. For Africans there is no relationship either for the set of 10 loci sampled in the same set of 41 individuals or for all 15 loci (R2 = 0.014, P = 0.747 and R2 = 0.014, P = 0.675, respectively; Figure 3A). In contrast, non-African FLD values exhibit a statistically significant positive correlation for both the 10 loci and 15 loci data sets (R2 = 0.520, P = 0.019 and R2 = 0.454, 0.006, respectively; Figure 3B). Similar trends were observed with TD; however, the non-African correlation was not statistically significant (P = 0.138). Interestingly, there is also a positive relationship between non-African FLD and nucleotide variability (θ) for the set of 10 loci sampled in the same individuals (R2 = 0.678, P = 0.003), while no such relationship was observed for the African FLD values (R2 = 0.001, P = 0.936; data not shown). This relationship is only marginally statistically significant for the full set of 15 loci in non-Africans (R2 = 0.235, P = 0.067).
We obtained DNA sequence data from four loci in regions of high recombination on the X chromosome and compared patterns of variation at these loci with those from six additional loci sequenced in the same panel of 41 global samples. Despite the fact that all loci were X-linked and sampled in the same set of individuals, we found substantial heterogeneity in levels and patterns of variation among these 10 loci. We also compared patterns at these 10 loci with those from five additional X-linked genes that were sequenced in similar global panels (Harris and Hey 1999, 2001; Kaessmann et al. 1999; Gilad et al. 2002; Yu et al. 2002b). Nucleotide diversity varies by more than an order of magnitude among loci (Table 2): ALAS2, L1CAM, and FIX exhibit some of the lowest levels of nucleotide diversity seen in the human genome, while PDHA1, DMD44, RRM2P4 are higher than the autosomal average (Li and Sadler 1991; Przeworski et al. 2000). Likewise, the distribution of mutation frequencies differs considerably among loci, with several harboring an excess (over neutral predictions) of low-frequency polymorphisms (e.g., ALAS2, DMD7, L1CAM, FIX), and others with an abundance of high-frequency (e.g., TNFSF5) or intermediate-frequency (e.g., PDHA1, DMD44, RRM2P4)-derived polymorphisms.
Patterns of heterogeneity seen among loci are different between African and non-African samples. There was a wide range of FST values, with two loci (PDHA1 and MSN) exhibiting some of the highest known levels of differentiation among populations and others with extremely low levels of differentiation (e.g., DACH2, FIX, and DMD44; Romualdi et al. 2002). As previously documented for autosomal, Y-linked, and mitochondrial loci (Vigilant et al. 1991; Przeworski et al. 2000; Shen et al. 2000; Hammer et al. 2003), X-linked loci are more variable in sub-Saharan African populations than in non-African populations. However, the extent of the reduction in non-African diversity at X-linked loci appears to be greater than that observed on the autosomes. For example, the average non-African reduction in θ for the 15 X chromosome loci in Table 3 is 45%, while the average non-African reduction in θ on the autosomes is 30% (Halushka et al. 1999; Frisse et al. 2001; Stephens et al. 2001). We also note that there is substantial variability among X-linked loci in the degree of reduction in non-African variation, with some loci having <10% of African diversity (e.g., PDHA1, ALAS2, and L1CAM). Similar disparities in African and non-African autosomal levels of diversity have not been reported (Przeworski et al. 2000; Alonso and Armour 2001; Frisse et al. 2001). Mounting evidence suggests that, for many loci, African populations contain more rare alleles than non-African populations (Wall and Przeworski 2000). We found that our African sample has TD values similar (i.e., slightly negative on average) to those reported in the literature for African populations (Przeworski et al. 2000). However, our non-African sample exhibits an unusual pattern, whereby the mean TD value is slightly more negative than the mean TD value in our African sample (−0.512 and −0.401, respectively). The more negative average TD value for non-Africans held even after subsampling 10 non-Africans to control for differences in sample sizes between Africans and non-Africans. This is driven, in part, by sharply negative TD values at MSN and DMD7 and fewer loci with positive TD outside Africa (Table 3).
In summary, there is substantial heterogeneity in patterns of variation among loci on the X chromosome, even when sampled in the same set of individuals. Previous observations of heterogeneity among loci have been interpreted as evidence for selection. For example, Wall and Przeworski (2000) tested whether patterns of variation observed at a number of nuclear loci (including some of those examined here) were compatible with a variety of demographic models. They found that the low TD values at some loci (including DMD7 and P2Y10) and the high TD values at other loci (including PDHA1 and DMD44) together were not consistent with a model of constant size or with a model of constant size followed by exponential growth. Even after incorporating more complex demographic components (such as a bottleneck or geographic structure), none of their models could account for the patterns of variation seen at all loci. To explain these contrasting patterns, they suggested that selection influenced variation at several of the loci studied.
Here we consider our observations in light of two alternative demographic models incorporating selection put forward by Wall and Przeworski (2000). The first is a model with long-term population growth (Harpending et al. 1998) that is expected to lead to an excess of rare variants (i.e., negative TD) at all loci that are not subject to selection. Under this model, the negative TD seen at some loci reflects population growth while the positive TD values, or those close to zero, observed at other loci reflect the action of diversity-enhancing selection (Harpending and Rogers 2000; Wall and Przeworski 2000; Rogers 2001; Excoffier 2002). The second model has constant population size (i.e., the onset of human population growth is too recent to leave a signature in the nuclear genome). Under this model, TD values near zero reflect constant population size while significantly negative TD values at other loci reflect the recent effects of directional selection.
Results from previous analyses of X-linked loci have been interpreted to support both models. Nachman and Crowell (2000) sampled variation at two DMD introns and showed that DMD7 has much lower levels of nucleotide diversity, many more rare polymorphisms, higher levels of linkage disequilibrium, and different African vs. non-African patterns, compared with DMD44. They suggested that patterns of variation at DMD44 are consistent with a neutral equilibrium model of molecular evolution and that those at DMD7 were shaped by recent directional selection (especially out of Africa). Harris and Hey (2001) compared patterns of variation at PDHA1 and FIX and posited that the much lower global nucleotide diversity and skew in the frequency distribution at FIX was the result of a history of positive directional selection, or background selection, acting at or near FIX. In an earlier report, Harris and Hey (1999) demonstrated that PDHA1 had an unusual pattern of sequence variation and suggested that this locus experienced some form of diversity-reducing selection outside of Africa. Similarly, Nachman et al. (2004) present evidence that MSN and ALAS2 have patterns of variation that reflect a history of diversity-reducing selection, with stronger effects outside of Africa.
In contrast, other authors reached very different conclusions on the basis of analyses of some of these very same loci, as well as others on the X chromosome (Harpending and Rogers 2000; Excoffier 2002). Rogers (2001) took the opposite view of Nachman and Crowell (2000) by suggesting that patterns of variation at DMD7 reflect demography (i.e., expansion of population size) while those at DMD44 reflect a long history of balancing selection. However, it is difficult to see how a long history of balancing selection could create the patterns of variation seen at DMD44 because there is little linkage disequilibrium among sites at DMD44. Wooding and Rogers (2000) argued that even though Tajima's D test did not reject the null hypothesis of constant population size at the P2Y10 locus (Kaessmann et al. 1999), the significantly negative Fu's Fs value at this locus does support a model of a Pleistocene population expansion. Yu et al. (2002b) interpreted a significant Fu and Li's D test to indicate a population expansion signature at DACH2, despite a failure of Tajima's D and Fu's Fs tests to reject neutrality. They suggested that ancient population subdivision must be taken into account to interpret these tests properly. We note that when the African and non-African P2Y10 and DACH2 data sets are considered separately, an excess of rare and/or singleton alleles is found only in the African samples (Table 3). Therefore, these loci do not support the simplest model of population expansion.
Non-African levels of polymorphism reject a neutral, constant-size model by the conservative HKA test (Table 4), indicating that the variance in polymorphism among the 15 X-linked loci is too large outside of Africa. Is this large variance in polymorphism across loci primarily the product of either natural selection acting upon a subset of loci or a history of nonequilibrium demography out of Africa (e.g., a population bottleneck)? Although Hudson et al. (1987) initially dismissed the possibility that a bottleneck systematically influences the HKA test, we have conducted coalescent simulations of intermediate strength bottlenecks that result in an increase in the variance of polymorphism across unlinked loci. These simulations were not intended to estimate bottleneck parameters from the data, but instead were used to examine the effects of a simple population bottleneck on the outcome of the HKA test. Similar to the bottleneck model of Fay and Wu (1999), the model we examined assumes constant population size (N = 104) until 3000 generations ago, when a 40-fold bottleneck is imposed upon the population for 500 generations, after which the population reverts to its original size of 104 individuals (i.e., the bottleneck reduces the effective population size to 250 for 10,000 years). These bottleneck parameters were chosen to (1) produce the observed reduction in non-African diversity and (2) maximize the variance in diversity among loci (data not shown). Maximizing variance among loci produces a simulated null distribution of the HKA statistic that is likely to be conservative when assessing the impact of a bottleneck on the test. We also incorporated estimates of the population recombination rate at each locus (data not shown) in the 15 locus simulations, which were replicated 1000 times. We find that our observed non-African HKA test statistic is still significantly too high (P = 0.038) when compared with the null distribution generated by the conservative bottleneck model (Figure 4). This result is compelling evidence that a simple population bottleneck out of Africa is insufficient to account for the increased variance in polymorphism across loci, although more complex demographic models might account for these observations.
We also observed a positive relationship between recombination rate and nucleotide diversity (Figure 2). This relationship may be caused by either positive or negative selection at linked sites (Maynard Smith and Haigh 1974; Charlesworth et al. 1993), by variation in underlying mutation rate, or by some combination of these factors. A simple test of the idea that variation in underlying mutation rate is responsible for the correlation between nucleotide diversity and recombination rate is to compare recombination rate with interspecific divergence. Several different studies have documented a significant positive correlation between recombination rate in humans and interspecific divergence (Lercher and Hurst 2002; Waterston et al. 2002; Hardison et al. 2003; Hellmann et al. 2003) and, thus, it seems likely that variation in mutation rate accounts for some of the variation in nucleotide diversity. In this study we observed a significant positive correlation between nucleotide diversity and recombination rate but not between interspecific divergence and recombination rate (Figure 2), although both showed positive trends. The stronger association between nucleotide diversity and recombination rate here, compared with other studies, is noteworthy in two respects. First, many studies sample single nucleotide polymorphisms (SNPs) in a heterogeneous pool of individuals with small sample sizes instead of a common sample for all loci. For example, the average sample size for most genomic regions in the SNP consortium data that are analyzed in Lercher and Hurst (2002) and Waterston et al. (2002) is two (Altshuler et al. 2000). Here the effects of sampling can also be seen: the correlation between nucleotide diversity and recombination rate is stronger among the 10 loci sampled in the same set of individuals than among all 15 loci (Figure 2). A similar effect of sampling has been observed in D. melanogaster (Aquadro et al. 1994). Second, we have studied X-linked loci, while all previous studies have focused mainly or exclusively on autosomal loci. One interesting possibility is that selection at linked sites may be more important on the X chromosome than on the autosomes. While the effects of background selection are expected to be weaker on the X chromosome (Charlesworth et al. 1993), hitchhiking effects are expected to be stronger (discussed in Begun and Whitley 2000) as a consequence of either higher fixation rates (Charlesworth et al. 1987) or shorter sojourn times (Avery 1984).
We also found a positive relationship between recombination rate and frequency spectra in our non-African sample, but not in our African sample (Figure 3). Such a relationship is not expected under a neutral equilibrium model (Przeworski et al. 2001). One explanation for this observation is that diversity-reducing selective forces (i.e., hitchhiking or background selection) have led to an excess over neutral expectations of singletons at loci in regions of lower recombination (Charlesworth et al. 1993; Braverman et al. 1995). However, Andolfatto and Przeworski (2001) demonstrated that a similar positive correlation between the summary of the frequency spectrum of polymorphic mutations (both TD and FLD) and the recombination rate in D. melanogaster, while expected under simple hitchhiking models, was unlikely under a model of background selection. Neither is there an expectation that diversity-enhancing selection would lead to a positive correlation between FLD and recombination rate. Moreover, the expected signature of long-term balancing selection—a peak of polymorphism surrounding a selected site—has not been observed at loci with high levels of variation and positive TD or FLD values (Wall and Przeworski 2000). As mentioned above, the correlation between recombination rate and nucleotide diversity was also stronger in non-Africans compared with Africans. The combined data suggest that positive directional selection (i.e., hitchhiking) may be a more important factor influencing X chromosome variation outside Africa.
Finally, our data set included nine intronic regions within functional genes and a pseudogene, which may be less perturbed by selection than introns of genes. We chose the RRM2P4 pseudogene in particular because it maps to a region of high recombination and low gene density on Xq27.3 and thus should provide good estimates of neutral parameters. We found that levels and patterns of variation at this pseudogene are similar to those at other loci exhibiting high levels of variation (e.g., DMD44, DACH2, AMELX, APXL). This region has the third highest level of diversity, exhibits no skew in the frequency spectrum, and harbors similar levels of variation in African and non-African samples. This supports the hypothesis that similar patterns of variation at other high variation loci reflect neutral demographic processes.
If we accept that these five X-linked regions, as well as P2Y10 and TNFSF5, are relatively free from the influences of natural selection, then what can we discern about human demography from patterns of variation at these loci? There is only a minor reduction in non-African diversity (i.e., ∼20%), a slightly negative TD in Africa, and a TD close to zero out of Africa. These data do not provide evidence for long-term population growth outside Africa. Rather, they are consistent with a larger effective population size for Africans and the possibility that non-Africans experienced a phase of population size reduction during which rare variants were lost more quickly than common variants (Zietkiewicz et al. 1997; Przeworski et al. 2000). While it is possible that both diversity-reducing selection and population expansion have left signatures on X-linked loci, it is difficult to explain the heterogeneous patterns observed here by a simple model of population expansion without selection. We note that after removing the five loci showing low variation from our analyses (ALAS2, MSN, DMD7, L1CAM, and FIX) there is still considerable heterogeneity among loci, especially in the non-African samples. More realistic models of human demography might include more complex patterns of subdivision and population size changes (Pluzhnikov et al. 2002), changing migration rates over time (Wakeley 1999) and/or low levels of admixture with archaic Homo. Finally, the unexpected finding of several X-linked loci with a putative signature of selection (Przeworski 2002) is consistent with the possibility that the colonization of novel environments by modern humans as they migrated out of Africa in the last ∼50,000 years may have coincided with a burst of adaptive evolution (Payseur et al. 2002; Kayser et al. 2003; Mishmar et al. 2003).
Publication of this article was made possible by grants BCS-9906362 (to M.W.N. and M.F.H.) from the National Science Foundation (NSF) and GM-53566 from the National Institute of General Medical Sciences (to M.F.H.). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NSF or the National Institutes of Health.
Communicating editor: L. Excoffier
- Received December 5, 2003.
- Accepted April 13, 2004.
- Genetics Society of America