The worldwide pattern of single nucleotide polymorphism (SNP) variation is of great interest to human geneticists, population geneticists, and evolutionists, but remains incompletely understood. We studied the pattern in noncoding regions, because they are less affected by natural selection than are coding regions. Thus, it can reflect better the history of human evolution and can serve as a baseline for understanding the maintenance of SNPs in human populations. We sequenced 50 noncoding DNA segments each ∼500 bp long in 10 Africans, 10 Europeans, and 10 Asians. An analysis of the data suggests that the sampling scheme is adequate for our purpose. The average nucleotide diversity (π) for the 50 segments is only 0.061% ± 0.010% among Asians and 0.064% ± 0.011% among Europeans but almost twice as high (0.115% ± 0.016%) among Africans. The African diversity estimate is even higher than that between Africans and Eurasians (0.096% ± 0.012%). From available data for noncoding autosomal regions (total length = 47,038 bp) and X-linked regions (47,421 bp), we estimated the π-values for autosomal regions to be 0.105, 0.070, 0.069, and 0.097% for Africans, Asians, Europeans, and between Africans and Eurasians, and the corresponding values for X-linked regions to be 0.088, 0.042, 0.053, and 0.082%. Thus, Africans differ from one another slightly more than from Eurasians, and the genetic diversity in Eurasians is largely a subset of that in Africans, supporting the out of Africa model of human evolution. Clearly, one must specify the geographic origins of the individuals sampled when studying π or SNP density.
THERE has been much interest in single nucleotide polymorphisms (SNPs) in human populations because such data are useful for studying human evolution and the mechanism of maintenance of genetic variability in human populations and for identifying genes associated with complex disease (Hardinget al. 1997; Nickersonet al. 1998; Zietkiewiczet al. 1998; Chargill et al. 1999; Halushkaet al. 1999; Harris and Hey 1999; Jaruzelskaet al. 1999; Kaessmannet al. 1999; Riederet al. 1999; Nachman and Crowell 2000). The much interest notwithstanding, the worldwide pattern of SNP variation remains incompletely understood. This is especially true for noncoding regions because such regions are medically less interesting. However, data from noncoding regions are less affected by natural selection than data from coding regions and so can reflect more accurately the history of human evolution. Moreover, the pattern of SNP variation in noncoding regions can serve as a baseline for understanding the maintenance of SNPs in human populations. For these reasons, we have obtained SNP data from 50 noncoding regions in Africans, Europeans, and Asians and have estimated the levels of nucleotide diversity within and between populations. To strengthen our conclusions, we have also used data from GenBank and the literature.
MATERIALS AND METHODS
DNA samples: The 10 Africans used were 1 Biaka Pygmy, 1 Mbuti Pygmy, 1 Ghanaian, 1 Kikuyu, 1!Kung, 1 Luo, 2 Nigerians (Yuroba and Rivers), 1 South African Bantu speaker, and 1 Zulu (also a South African Bantu speaker); the 10 Europeans were 1 Finnish, 1 French, 1 German, 1 Hungarian, 1 Italian, 1 Portuguese, 1 Russian, 1 Spanish, 1 Swedish, and 1 Ukranian; and the 10 Asians were 1 Cambodian, 2 Chinese (North and South), 1 Han Taiwanese, 2 Indians (Punjab and Bengal), 1 Japanese, 1 Mongolian, 1 Vietnamese, and 1 Yakut. As every segment studied is autosomal, the number of sequences studied for each segment is 60 (20 for each continent studied).
Selection of DNA segments: Fifty noncoding, nonrepetitive genomic segments (each ∼1 kb), which covered almost all autosomes, were selected randomly with reference to the Gesee Chen and Li (2001) for details. All of them were chosen to avoid coding or close linkage to any coding regions. In each segment and its nearby regions there was no registered gene in GenBank and no potential coding region was detected by either GenScan or GRAIL-EXP.
PCR amplification and DNA sequencing: Touchdown PCR (Donet al. 1991) was used and the reactions were carried out following the conditions described (Zhaoet al. 2000). The PCR products were purified by the Wizard PCR Preps DNA purification resin kit (Promega, Madison, WI). Sequencing reaction was performed according to the protocol of ABI Prism BigDye Terminator sequencing kits (Perkin-Elmer, Norwalk, CT) modified by quarter reaction. The extension products were purified by Sephadex G-50 (DNA grade; Pharmacia, Piscataway, NJ) and run on an ABI 377XL DNA sequencer using 4.25% gels (Sooner Scientific). About 500 bp of each segment was sequenced.
ABI DNA Sequence Analysis 3.0 was used for lane tracking and base calling. The data were then proofread manually and heterozygous sites were detected as double peaks. The forward and reverse sequences were assembled automatically in each individual using SeqMan in DNASTAR. The assembled files were carefully checked by eye. Fluorescent traces for each variant site were rechecked again in all individuals. All singletons, which were variants that appear only once in the total sample, were verified by PCR reamplification and resequencing the PCR products in both directions.
Data analysis: The sequences were aligned by SeqMan in the DNASTAR or the DAMBE package (Xia 2000). Nucleotide diversity values were calculated using DNASP (Rozas and Rozas 1999), DAMBE, and our own programs.
RESULTS AND DISCUSSION
Distribution of SNPs: A total of 146 SNPs were found in the total sample; 53 of them were observed only once (i.e., singletons) and 22 only twice (doubletons). The number of variant sites found in the African sample was 118, of which 68 (36 singletons, 15 doubletons, and 17 others) were not found in the Eurasian sequences (i.e., they were unique). In contrast, in the Eurasian sample only 78 variant sites were found and only 28 of them (17 singletons, 4 doubletons, and 7 others) were unique, though the combined sample size was twice the African sample size. Thus, beyond the 50 variants already observed in the African sample, the combined Eurasian sample contains in addition only 17 singletons and 11 nonsingleton variants. The high frequencies of singletons in the African and Eurasian samples are similar to those observed in other studies (Kaessmannet al. 1999; Zhaoet al. 2000; Yuet al. 2001). Note that in a neutral Wright-Fisher population with θ, the expected number of mutations of size i in a random sample of n sequences is θ/i (Fu 1995). So the number of singletons should be twice the number of doubletons and thrice the number of tripletons. In our total sample we found 53 singletons, 22 doubletons, and 7 tripletons. Therefore, there is an excess of singletons, which suggests a population expansion in the recent past.
Nucleotide diversity: Nucleotide diversity (π) is defined as the number of nucleotide differences between two randomly chosen sequences in a population. The π-value fluctuates greatly among the segments studied (Table 1). The range of π is from 0 (5 segments) to 0.27% in the total sample, from 0 (5 segments) to 0.58% in the African sample, from 0 (19 segments) to 0.27% in the Asian sample, and from 0 (18 segments) to 0.29% in the European sample. Such large fluctuations are not surprising because the nucleotide diversity in a short DNA region is subject to strong stochastic effects. In addition, variation in π may also arise from different mutation rates among different segments, although we found no correlation between π and the divergence betweenhuman and ape sequences. The average π-values are only 0.061% ± 0.010% among Asians and 0.064% ± 0.011% among Europeans, but almost twice as high among Africans (0.115% ± 0.016%); π is 0.088% ± 0.011% for the total sample. The average π-value within Africans is actually somewhat higher than that between Africans and Eurasians (0.096% ± 0.012%). In other words, Africans differ on average more among themselves than from Eurasians.
Adequacy of sampling scheme and sample size: We now consider the adequacy of our sampling schedule. The fact that the individuals used were chosen to cover various geographic areas and ethnic backgrounds in each of the three continents studied may tend to overestimate π, whereas the inclusion of the two sequences within each individual may tend to underestimate π. The two tendencies should be reflected in between-individual π-values (πb) and in within-individual π-values (πw), respectively, and the average πb and πw can be taken as an upper and a lower bound of the true π-value. To simplify the analysis, we concatenate the segments in an individual in a random manner into two continuous sequences. For the African sequences the distribution of πb-values, which ranges from 0.059 to 0.187%, is only somewhat wider than that of the 10 πw-values, which ranges from 0.059 to 0.152%. Therefore, the average πb (0.115%) is only slightly higher than the average πw (0.108%), implying that the geographic locations of the individuals sampled have little effect on the average π-value. For the Asian sample there are two very low πw-values (0.023 for the North Chinese and the Bengal) and the average πw-value (0.051%) is substantially lower than the average πb (0.063%). The average πw without the two outliers becomes 0.058%, which is similar to the between-individual average πb (0.063%). So, our sampling schedule in Asia may cause at most only a minor overestimate of the true average π-value. For the European sample, the average πw-value is only 0.049% and, after excluding the two lowest values (0.027% for the Ukranian and 0.031% for the Russian), it becomes 0.054%, which is still not close to the between-individual average of 0.066%. This comparison suggests that our sampling schedule may have inflated somewhat the average π-value for the Europeans. However, as the final estimate of 0.063% should have been compensated to some extent by the πw-values, which tend to be low, it should not differ much from the true value.
Next let us consider whether the sample size of 10 individuals from each continent studied is sufficiently large for our purpose. To answer this question, let us consider subsamples of the original samples; we consider concatenated sequences. Figure 1 shows the distribution of the average π-values when each subsample in a continent contains only 6 individuals from the original sample of 10; there are 210 such possible subsamples. It is seen that the three distributions are rather concentrated. For example, in each distribution none of the average π-values deviate significantly from the mean π-value of the total sample and the proportions of average π-values in subsamples that deviate more than 10% from the mean are 18.1, 7.6, and 20.4% for the African, Asian, and European samples, respectively. This analysis suggests that even a sample of 6 independent individuals would usually give an estimate of π in a continent reasonably close to the true value.
Worldwide pattern of SNP variation and implications for human evolution: A more general pattern of worldwide nucleotide diversity is shown in Table 2, which includes the present data and data from the literature and our unpublished studies. For the autosomal regions included (47,038 bp) the π-values are 0.105% for Africans, 0.070% for Asians, 0.069% for Europeans, and 0.097% for between Africans and Eurasians. For X-linked regions (47,421 bp) the corresponding values are 0.088, 0.042, 0.053, and 0.082%. As in many previous studies (Vigilantet al. 1991; Hammeret al. 1997; Nickersonet al. 1998; Halushkaet al. 1999; Kaessmannet al. 1999; Ingmanet al. 2000; Jordeet al. 2000; Underhillet al. 2000; Zhaoet al. 2000; Yuet al. 2001), the nucleotide diversity is considerably higher in Africans than in non-Africans, probably because of a larger effective population size or longer evolutionary history (Stonekinget al. 1997; Ingmanet al. 2000). Asians and Europeans have very similar π-values and are considerably closer to each other than to Africans.
The genome-wide nucleotide diversity has been estimated to be 0.075% for the National Institutes of Health (NIH) diversity panel (International SNP Working Group 2001; see also Venteret al. 2001). The NIH panel consists of 90 individuals from European-Americans, African-Americans, Hispanic Americans, Native Americans, and Asian Americans, while our estimate of 0.089% is from a worldwide sample (Table 2) and is slightly higher. The SNP density is 1 SNP/1331 bp for the former data and 1 SNP/1123 bp for our data. For X-linked sequences the SNP density is 1 SNP/1408 bp (Table 2). It is clear from Table 2 that the geographic origins of the individuals sampled should be specified when considering nucleotide diversity or SNP density. For example, the SNP density for autosomal regions is 1 SNP/952 bp for Africans but only 1 SNP/1430 bp for Asians and Europeans.
Interestingly, both autosomal and X-linked sequence data show higher DNA variation within Africans than between Africans and Eurasians (Table 2), contrary to the general observation of lower within-population than between-population differences in population genetics. This finding implies that Africans differ on average more among themselves than from Eurasians. Thus, with the exception of many minor unique variants, the nucleotide diversity in Eurasians is essentially a subset of that in Africans, as suggested by the observation that both Y-linked and autosomal haplotypes found outside of Africa were often a subset of the collection of haplotypes found in Africa (Armouret al. 1996; Tishkoff et al. 1996, 2000; Hammeret al. 1997; Underhillet al. 2000). Our finding is more in agreement with the out of Africa model of human evolution than with the multiregional model because it is consistent with the view that modern humans originated in Africa and that a smaller subset of this population later migrated to other parts of the world (see Stonekinget al. 1997 and references therein). During and after the migration some variants would have been lost and, as the separation time is still short, non-Africans have not yet acquired many high-frequency variants, though they might have derived some variants from indigenous archaic populations in Asia and Europe. For these reasons, the genetic differences between non-Africans and Africans are on average smaller than the genetic differences within Africans.
This work was supported by National Institutes of Health grants GM55759, HD38287, GM30998, and GM59290.
Communicating editor: Y.-X. Fu
- Received August 30, 2001.
- Accepted February 19, 2002.
- Copyright © 2002 by the Genetics Society of America