Genes responsible for human-specific phenotypes may have been under altered selective pressures in human evolution and thus exhibit changes in substitution rate and pattern at the protein sequence level. Using comparative analysis of human, chimpanzee, and mouse protein sequences, we identified two genes (PRM2 and FOXP2) with significantly enhanced evolutionary rates in the hominid lineage. PRM2 is a histone-like protein essential to spermatogenesis and was previously reported to be a likely target of sexual selection in humans and chimpanzees. FOXP2 is a transcription factor involved in speech and language development. Human FOXP2 experienced a >60-fold increase in substitution rate and incorporated two fixed amino acid changes in a broadly defined transcription suppression domain. A survey of a diverse group of placental mammals reveals the uniqueness of the human FOXP2 sequence and a population genetic analysis indicates possible adaptive selection behind the accelerated evolution. Taken together, our results suggest an important role that FOXP2 may have played in the origin of human speech and demonstrate a strategy for identifying candidate genes underlying the emergences of human-specific features.
IN spite of the relative young age of our species, we have many distinct morphological, physiological, and behavioral features that are not found in apes, most notably, bipedalism, a large brain, susceptibility to AIDS, speech, and higher-order cognitive function (Boyd and Silk 2000; McConkeyet al. 2000; Varki 2000; Gagneux and Varki 2001). Understanding how and why these and other features unique to humans evolved is a key to disclosing the mystery of human origins and is of substantial medical importance (Gibbons 1998; McConkeyet al. 2000; Varki 2000). Fortunately, most of the genetic bases of these features lie somewhere in the ∼3 billion nucleotides of our genome, a huge, albeit limited, pool in which to look for answers. Gagneux and Varki (2000) recently reviewed genetic differences between humans and great apes. Although many genetic changes that have occurred in the human lineage have been found, including chromosomal fusion, gene duplication, gene deletion/inactivation, nucleotide substitution, and change in gene expression, very few, if any, of these changes have been linked to specific phenotypes important to the origin and well being of our species (Gibbons 1998; Gagneux and Varki 2001). With the availability of the human draft genome sequence, accumulation of ape DNA sequences, and rapid advances in molecular technology, calls have been made for systematic searches for genes that make us human (Gibbons 1998; McConkeyet al. 2000).
We tackle this problem by comparing the rate of protein sequence evolution in the human lineage (since the human-chimpanzee split) with that in nonhuman mammals. This comparison is useful because phenotype-affecting genetic modifications can be subject to positive Darwinian selection, under which the rate of amino acid substitution can be greatly enhanced (Nei and Kumar 2000). A change in substitution rate may also result when the function of a protein shifts so that the selective pressure is either enhanced or relaxed (Nei and Kumar 2000). In the following, we report identification of two genes with significant rate enhancements in the hominid lineage and discuss their relevance to the origins of human-specific features.
MATERIALS AND METHODS
Database search: In our design of the rate comparison, orthologous protein sequences from humans (Homo sapiens), chimpanzees (Pan troglodytes), and, as an outgroup, mice (Mus musculus) are used (Figure 1A). Use of mice rather than primates for the outgroup makes the estimate of the substitution rate less subjective to sampling errors because a long-term average is obtained. Also many more genes have been sequenced and functionally characterized for the mouse than for any other nonhuman mammal. It has been suggested that the average amino acid substitution rate is higher in rodents than in primates (Gu and Li 1992; but see Eastealet al. 1995). This will likely make our detection of accelerated human protein evolution more conservative. Here we focus on orthologous genes because a change in substitution rate after gene duplication (Lynch and Conery 2000) would complicate our analysis. Ideally, no gene duplication should be allowed in any branches of the tree of human, chimpanzee, and mouse (Figure 1A). However, duplications occurring in branches 5 and 2 have virtually no effects on our results, as we are largely concerned with branches 1, 3, and 4. Duplications in branch 4, or the rodent-specific duplications, have only small effects because a basal substitution rate in mammals can still be estimated relatively accurately. All annotated gene sequences in GenBank were screened to find cases satisfying the above criteria. Specifically, all annotated chimpanzee gene sequences were retrieved from the GenBank. The translated protein sequences were BLASTed against the GenBank database to find the closest human and mouse sequences. Various sources of information and analyses, including previous evolutionary analyses of the genes (Chen and Li 2001), functional data, UniGene search, human/mouse homology maps, and phylogenetic analysis, were used to determine that the sequences are orthologous and that no gene duplications have occurred in branches 1 and 3 of the tree in Figure 1A. Nevertheless, it is possible that some cases may still have undetected duplications in branch 1 or 3 or may include paralogous genes, due to incomplete genome sequences of human and mouse and limited genetic information on the chimpanzee. This did not have serious effects on our results because we were interested mainly in the very few cases showing significant rate changes; additional experiments and analyses could be conducted after initial identification of candidate genes.
Obtaining new chimpanzee sequences: In addition to the sequences retrieved from GenBank, we sequenced the coding regions of five chimpanzee genes for which the orthologous human and mouse sequences were available in GenBank. The five genes are BRCA2, CATSPER, FOXP2, RNASE4, and RNH. PCR primers were designed following the known human sequences and the chimpanzee genes were amplified by PCR and sequenced in both directions using automated DNA sequencer.
Rate analysis: The obtained protein sequences were aligned using Clustal X (Thompsonet al. 1997) and gaps were removed before rate analysis. Aligned proteins with lengths (before removal of gaps) of <100 amino acids were discarded. For each protein, the numbers of amino acid substitutions in branches 1, 2, and 3 + 4 are denoted by h, c, and m, respectively (Figure 1A). These numbers were derived from branch length estimates of the tree of orthologous human, chimpanzee, and mouse proteins. The branch lengths were estimated using the neighbor-joining method (Saitou and Nei 1987). Several distance measures were used, including the protein p-distance, Poisson distance, and gamma distance with the shape parameter of 2.0 (equivalent to Dayhoff distance; Nei and Kumar 2000). The results were found to be similar and p-distance results are presented as this distance is associated with a relatively low variance. Primates and rodents diverged ∼90 million years ago (MYA; Kumar and Hedges 1998; Archibaldet al. 2001; Neiet al. 2001) and humans separated from chimpanzees ∼5.5 MYA (Chen and Li 2001; Staufferet al. 2002). An acceleration index for the human lineage (branch 1) in comparison to the mammalian lineage before the human-chimpanzee split (branch 3 + 4) is defined by λ= (h/5.5)/[m/(2 × 90 - 5.5)] = 31.7h/m. In other words, if a protein evolves with a constant rate (i.e., λ= 1), the number of amino acid substitutions in branch 3 + 4 (m) is expected to be 31.7 times greater than that in branch 1 (h). Given h and m, the tail probability in a binomial distribution of B(h + m, 0.03056) is computed for testing the statistical significance of rate enhancement in the human lineage. Here, 0.03056 is from 5.5/180, the time span for branch 1, relative to that for branches 1 + 3 + 4. Similarly, an acceleration index for the chimpanzee lineage is defined by κ= (c/5.5)/[m/(2 × 90 - 5.5)] = 31.7c/m.
Computer simulation: To determine the frequency of type-I error (false-positive results) in the binomial test described above, we conducted a computer simulation. In the simulation, a constant substitution rate is used for branches 1, 3, and 4. Let this rate be r substitutions per amino acid site per million years (MY). Substitution rate variation among sites does not affect the simulation result, as r can also be regarded as the average substitution rate over the entire sequence. Given the length of a protein (n amino acids), the number of substitutions in branch 1 is a Poisson random variable with mean = 5.5nr and that for the branches 3 + 4 is a Poisson variable with mean = 174.5nr. These two random numbers were generated by computer and the binomial test was performed to see if the null hypothesis of rate constancy could be rejected. Such simulations were repeated 5000 times for each given parameter of nr. Chen and Li (2001) estimated that the average substitution rate between humans and chimpanzees is r = 0.013/(11 MY) = 0.00118 substitutions per amino acid site per million years. The average length for the 120 proteins examined in this study is ∼n = 350 amino acids. Thus, average nr is ∼350 × 0.00118 = 0.413 substitutions per sequence per million years for orthologous proteins of humans and chimpanzees. In fact, the average nr for the 120 genes, which was 0.323, may also be computed from the appendix. Considering that nr varies from 0 to 1.41 for the 120 genes, our simulation was conducted under a wide range of nr, from 0.04 to 4.
FOXP2 DNA sequencing and analysis: All 17 exons of the FOXP2 gene from the chimpanzee, pygmy chimpanzee, gorilla, and orangutan were PCR amplified and sequenced in both directions. The orthologous human (accession no. AF33-7817) and mouse (accession nos. AY079003 and NT_023632) sequences were obtained from GenBank. The orthology of the FOXP2 sequences was confirmed by phylogenetic analysis and observation of expected levels of synonymous nucleotide distances. Parsimony (Fitch 1971) and distance-based Bayesian (Zhang and Nei 1997) methods were used to infer numbers of synonymous and nonsynonymous nucleotide substitutions (Nei and Kumar 2000) in the FOXP2 gene tree of the above six species.
To determine the variability of the amino acid positions in which humans experienced substitutions, part of exon 7 of FOXP2 was PCR amplified and sequenced in both directions from an additional 24 mammals and the chicken (see Figure 3). The same region was also sequenced in 32 human individuals to determine the polymorphism at the aforementioned amino acid positions.
For population genetic analysis, 8679 nucleotides in intron 6 and 1305 nucleotides in intron 7 of the FOXP2 gene were sequenced in both directions in 10 human individuals. All singletons were confirmed from a second PCR reaction and sequencing. Nucleotide diversity (π) and Watterson’s θ were computed as described in Tajima (1989). Tajima’s (1989) and Fu and Li’s (1993) tests were conducted using 50,000 coalescent simulations. To test the neutral evolution hypothesis for the polymorphic data of FOXP2, we compiled available data on worldwide polymorphisms in other noncoding regions of the human genome that are at least 3000 nucleotides long and are not known to be under selection. Six data sets were found and the Hudson-Kreitman-Aguadé (HKA) test (Hudsonet al. 1987) was used to compare FOXP2 with these neutral sequences. DnaSP (Rozas and Rozas 1999) was used for all population genetic analyses.
RESULTS AND DISCUSSION
Identification of proteins with accelerated evolution in the hominid lineage: Following the criteria set in the above section, we identified 115 genes from GenBank and obtained 5 additional genes from our laboratory that were suitable for the rate analysis. Figure 1B shows the distribution of the acceleration index λ for the 120 genes. Results from each of the 120 genes are given in the appendix. The mean λ is 1.13 ± 0.54 and the median is 0.39. The distribution is skewed because no amino acid substitutions are found in the human lineage in about one-third (39/120 = 0.325) of the genes examined. A majority of the genes have λ< 3.2. Only two genes have λ significantly >1 (P < 0.003 and P < 0.001, respectively; binomial test). Since 120 tests were conducted, it was necessary to evaluate whether there were false-positive cases. For this, we conducted a computer simulation. As described in the above section, our simulations were designed to examine the type-I error of the binomial test. The results suggest that the expected number of false-positive cases is ⪡1 for our sample of 120 genes (Table 1). Thus, our positive detection is unlikely due to statistical artifact.
The two positive cases, PRM2 and FOXP2, are listed in Table 2. PRM2 (protamine 2) is a DNA-binding protein that replaces histones in spermatogenesis. It has been shown to evolve rapidly in humans and chimpanzees and was suggested to be a likely target of sexual selection (Wyckoffet al. 2000). Thus, it is not unexpected that PRM2 is identified in our analysis. However, the fact that both human and chimpanzee lineages experienced accelerated evolution (λ and κ are both significantly >1) suggests that the type of selection on PRM2 is probably not unique to humans. In contrast, FOXP2 has the highest λ (63.4) of all genes examined, while κ is 0 (Table 2), suggesting hominid-specific acceleration. We thus focus our analysis on FOXP2 in the remainder of the article.
Enhanced substitution rate of human FOXP2: FOXP2 belongs to the winged helix/forkhead class of transcription factors (Laiet al. 2001; Shuet al. 2001). It is expressed in multiple fetal and adult tissues with a high expression in certain regions of the fetal brain (Laiet al. 2001; Shuet al. 2001). Mutations in the gene cause a severe speech and language disorder in affected individuals despite their adequate intelligence and opportunity for language acquisition, suggesting that FOXP2 is specifically involved in speech development (Laiet al. 2001). FOXP2 is a conserved protein, with only three amino acid differences (and a 1-amino-acid insertion/deletion) between human and mouse in its entire length of 715 amino acids (Figure 2). We sequenced the coding regions of the FOXP2 gene from the chimpanzee, pygmy chimpanzee, gorilla, and orangutan and determined that two of the three aforementioned substitutions occurred in the hominid lineage and no substitutions occurred in chimpanzees (Figure 2). As indicated in Table 2, the acceleration in the evolution of human FOXP2 is statistically significant. This significance is also obtained (P = 0.001-0.006) when we consider ranges of divergence times for the human-chimpanzee split at 4.0-7.0 MYA (Chen and Li 2001; Brunetet al. 2002; Staufferet al. 2002) and the primate-rodent split at 80-110 MYA (Kumar and Hedges 1998; Archibaldet al. 2001; Neiet al. 2001).
The two amino acid substitutions in the human lineage are a Thr-to-Asn change at position 303 and an Asn-to-Ser change at position 325, both in exon 7. These substitutions are located in a broadly defined transcription repression domain (Shuet al. 2001; Figure 2), so it is possible that they affect the binding of FOXP2 with regulatory sequences of its target genes. If these substitutions are important to speech development, they should be fixed in normal humans and not be found in nonhuman organisms. Indeed, these substitutions are shared by all 32 normal humans surveyed (9 African Americans, 10 Caucasians, 9 Asians, and 4 Amerindians), but by none of the 29 nonhuman species examined. These species include a bird and 28 placental mammals from 12 representative orders (Figure 3). Interestingly, the Asn-to-Ser substitution also occurred independently in carnivores, suggesting that this substitution alone is not sufficient for the origin of speech and language.
Driving forces behind the accelerated evolution of human FOXP2: It would be interesting to identify the driving force behind the two amino acid substitutions and the accelerated evolution of human FOXP2. There are three possibilities: enhanced mutation rate, relaxed purifying selection, and positive selection. Because synonymous nucleotide changes are usually immune to selection, the rate of synonymous substitutions can be used to measure the mutation rate (Nei and Kumar 2000). Using parsimony, we determined the number of synonymous substitutions in each branch of the FOXP2 gene tree of five hominoids and mouse (Figure 4). It can be seen that the number of synonymous substitutions in the human lineage (two) is smaller than that in the two chimpanzee lineages (three and four, respectively). The number of synonymous substitutions per MY is also smaller in the human lineage (2/5.5 MY = 0.36) than in the lineage before the human-chimpanzee separation ([2.5 + 4.5 + 127.5]/[90 MY × 2 - 5.5 MY] = 0.77 for the branches linking node A and mouse; see Figure 4). Thus, there is no indication of enhanced mutation rate at FOXP2 in the human lineage. This conclusion is strengthened as the true number of synonymous substitutions is likely to be higher than the parsimony estimate for the long branch leading to the mouse, but not for the short branches within hominoids. Use of Bayesian estimates of ancestral sequences confirmed this result. Furthermore, the ratio of nonsynonymous substitutions to synonymous substitutions in the human lineage (2/2 = 1; see Figure 4) is significantly greater than the ratio in the branches linking node A and mouse (1/[2.5 + 4.5 + 127.5] = 0.007; P < 0.002, Fisher’s exact test; Zhanget al. 1997), suggesting that the rate difference is due to a difference in selection. It is unlikely, however, that the functional constraint and purifying selection on FOXP2 has been relaxed in humans, as mutations show severe deleterious effects (Laiet al. 2001). Consistent with the existence of strong purifying selection, no amino acid polymorphisms in FOXP2 were found in a survey of 48 humans (Newburyet al. 2002). Thus, positive selection remains as the most likely cause of the accelerated evolution of human FOXP2.
We noted, however, that the rate ratio of nonsynonymous to synonymous substitutions per site is not >1 in the human FOXP2 lineage. This is likely due to the fact that FOXP2 is an overall conserved protein and many sites are under purifying selection. Under such circumstances, population genetic data may provide useful information on the evolutionary force. We therefore sequenced 9984 nucleotides in introns 6 and 7 of the FOXP2 gene from 10 humans (3 African-Americans, 3 Caucasians, 3 Asians, and 1 Amerindian) and one chimpanzee (Table 3). Introns 6 and 7 are adjacent to exon 7, where the two amino acid substitutions occurred in humans (Figure 2). By tight linkage to exon 7, these intron sequences may preserve information on the fixation process of the amino acid changes. For comparison, we also compiled available data on worldwide polymorphisms in other noncoding regions of the human genome that are at least 3000 nucleotides long and are not known to be under selection. We found that the level of polymorphism is lower in FOXP2 introns than in any other neutral noncoding regions examined (Table 4). An HKA neutrality test comparing the intra- and interspecific sequence variations between loci (Hudsonet al. 1987) yielded a very significant result when FOXP2 introns were compared with all other regions combined (P < 0.00001; Table 4). When these regions were compared individually with FOXP2, all indicated a lower-than-expected polymorphism in FOXP2 and four out of six cases showed statistical significance (Table 4). Mutation-rate variation among loci would not result in significant HKA test results (Hudsonet al. 1987). Population demographic changes cannot explain them either, because they would have affected all loci in a similar way (Hudsonet al. 1987). Rather, these comparisons suggest background selection and/or selective sweeps. Here background selection refers to purifying selection on deleterious mutations in tightly linked exons and selective sweep refers to quick fixation of advantageous mutations in these exons. These events, if recent enough, can lead to a reduced present-day polymorphism in introns 6 and 7 (Maynard Smith and Haigh 1974; Charlesworthet al. 1993). Consistent with the HKA test results, Tajima’s D (-1.36, P = 0.076) and Fu and Li’s F* (-1.81, P = 0.064) are both negative for the FOXP2 intron data, although they are only marginally significant. Note that these tests are conservative as a recombination rate of zero was assumed in the coalescent simulation.
If the nonneutral pattern of introns 6 and 7 is due to background selection, the selection intensity must be high, because weak background selection is known to be ineffective in reducing the polymorphic level. This suggests that the adjacent exons must be under strong functional constraints with no relaxed purifying selection, which would imply that positive selection is the only possible explanation for the accelerated protein evolution. If a relatively recent selective sweep caused the low polymorphism, at least one of the two amino acid changes in exon 7 must be advantageous because no other amino acid substitutions occurred in the evolution of human FOXP2 and no other functional genes are located within 100 kb of FOXP2 exon 7. Taken together, unless positive selection is invoked, one cannot explain the accelerated evolution of FOXP2 protein and low polymorphism of introns simultaneously. The finding that FOXP2 is critical to speech and language development (Laiet al. 2001) does not by itself demonstrate the role of this gene in the origin of human speech, because the function of FOXP2 could have remained unchanged during human evolution while other speech-related genes changed. However, the revelation of significant acceleration and positive selection in human FOXP2 suggests functional and fitness relevance of the two amino acid substitutions and provides support for the role of this gene in the evolution of speech and language. Interestingly, the notion of selection is consistent with the belief that the origin of language is an adaptation (Pinker and Bloom 1990; Boyd and Silk 2000). In the future, it would be interesting to examine the exact functional effects of the two amino acid substitutions of human FOXP2 by in vitro assays of protein function as well as characterization of human phenotypes of reverse mutations.
If the lower-than-expected nucleotide diversity in FOXP2 introns suggested by HKA tests and D and F* statistics is indeed a result of a relatively recent selective sweep, the sweep probably occurred no earlier than 0.5 N generations ago, because the signal of a sweep is unlikely to last longer than that (Simonsenet al. 1995). Here N is the effective population size of humans and is generally thought to have been ∼10,000 (Takahata 1993). Thus, the sweep would have occurred no earlier than 5000 generations, or ∼100,000 years, ago. This estimate is within the wide window of 40,000 years to 4 MYA during which human languages are believed to have emerged (Boyd and Silk 2000). A paleo-population genetic study (Lambertet al. 2002) may more accurately define the timing and process of the two amino acid substitutions in humans.
Perspective: In this study we focused on identification of proteins with accelerated evolution in the hominid lineage. Other strategies that may also be used in the search for genetic bases of uniquely human features include identifying human genes that are under positive selection, human-specific gene duplications, deletions or deactivations, and changes in gene expression (Gagneux and Varki 2001; Enardet al. 2002). Different from these methods, our approach is useful when the phenotype-affecting genetic changes are simple amino acid substitutions. Our computer simulation showed that unless the substitution rate per sequence (nr) is high, our rate-constancy test is quite conservative. While this property somewhat reduces the power of our approach, it also makes our claims more secure. In other words, the positively identified cases will have a high chance to be biologically meaningful. At present, only a small number of chimpanzee genes have been sequenced, and only 120 genes, or ∼0.35% of the genome, have been analyzed here. As the chimpanzee genome sequencing project (Fujiyamaet al. 2002) proceeds, many more genes affecting uniquely human features may be found by this and other methods.
We thank S. Hinshaw, A. Rooney, P. Tucker, S. Yokoyama, Y. Zhang, and the University of Michigan Museum of Zoology Mammal Division for providing DNA and animal tissue samples. We thank M. Nei, P. Tucker, and two anonymous reviewers for their comments on earlier versions of the manuscript. This work was supported by a startup fund of the University of Michigan to J.Z.
Communicating editor: S. Yokoyama
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AY143178-AY143181 and AF539547-AF539550.
- Received August 15, 2002.
- Accepted September 6, 2002.
- Copyright © 2002 by the Genetics Society of America