Few loci have been measured for DNA polymorphism and divergence in several species. Here we report such data from the protein-coding region of xanthine dehydrogenase (Xdh) in 22 species of Drosophila. Many of our samples were from closely related species, allowing us to confidently assign substitutions to individual lineages. Surprisingly, Xdh appears to be fixing more A/T mutations than G/C mutations in most lineages, leading to evolution of higher A/T content in the recent past. We found no compelling evidence for selection on protein variation, though some aspects of the data support the notion that a significant fraction of amino acid polymorphisms are slightly deleterious. Finally, we found no convincing evidence that levels of silent heterozygosity are associated with rates of protein evolution.
THE nucleotide substitution process may be affected by many biological and statistical phenomena. The complexity of the process is probably a major reason why our understanding of it is only rudimentary. Discussion of the possible role of effective population size (NE) on evolutionary rates has been ongoing for many decades (Wright 1931, 1932; Fisher 1958; Kimura 1983; Ohta 1992; Gillespie 1999, 2000, 2001), yet has resulted in little clarity. For example, the neutral model of molecular evolution posits that there is no effect of population size on protein evolution, whereas simple models of adaptive protein evolution predict a positive correlation between substitution rate and population size (e.g., Kimura 1983). Models invoking a significant contribution of slightly deleterious alleles to evolution may predict a negative correlation between population size and rates of protein evolution (e.g., Ohta 1992). Finally, recent theoretical analyses suggest that, contrary to previous results, substitution rates for sites under selection need not be strongly dependent on the population size (Cherry 1998; Gillespie 1999, 2000, 2001). In spite of this long history of theoretical analysis, there have been few attempts to empirically estimate effective sizes through studies of DNA polymorphism and then to investigate whether patterns of nucleotide substitution are related to these estimated population sizes.
Molecular population genetic data from Drosophila melanogaster and D. simulans have revealed several interesting differences between these lineages, many of which have been interpreted in terms of differences in NE (Choudhary and Singh 1987; Aquadroet al. 1988; reviewed in Moriyama and Powell 1996). For example, simulans is more polymorphic at silent sites than melanogaster (Aquadroet al. 1988; Moriyama and Powell 1996). Under the neutral model, this is consistent with the idea that NE is larger in simulans than in melanogaster. D. melanogaster harbors proportionally more amino acid polymorphism (relative to silent polymorphism) than does simulans and has fixed proportionally more amino acid mutants (relative to silent fixations) than has simulans (Akashi 1995, 1996; Moriyama and Powell 1996; Takano 1998; but see Begun 1996; Andolfatto 2001). The greater proportional protein polymorphism and divergence in melanogaster has been interpreted as resulting from reduced efficacy of mutation-selection balance against slightly deleterious amino acid variants in this species (Akashi 1995, 1996). Compared to simulans, melanogaster has also fixed a larger proportion of presumably slightly deleterious, unpreferred silent mutations (Akashi 1995, 1996). This difference, too, has been attributed to weaker selection against deleterious alleles in the melanogaster lineage (Akashi 1995, 1996). A smaller NE in melanogaster than in simulans would be a plausible explanation for weaker selection against deleterious mutations in the former.
However, the fact that differences in NE may be consistent with patterns of nucleotide variation in melanogaster and simulans provides only weak supporting evidence for the population-size hypothesis because the species may differ in many ways besides population size. This leaves room for an agnostic position on the cause of the genomic differences between these two species and motivates the study presented here—an analysis of silent and replacement variation at a single locus in population samples from several Drosophila species. We specifically selected groups of closely related species for most of our analyses. There were two primary motivations for this sampling strategy. First, low sequence divergence between species allows us to infer individual substitutions occurring on single evolutionary lineages. This reduces our dependence on uncertain models of the substitution process and allows us to test hypotheses by counting polymorphic and fixed mutations instead of estimating parameters of molecular evolution over longer time periods. Second, our sampling strategy provides an opportunity to investigate the connection between polymorphism and divergence for a homologous region of DNA in many species. We selected xanthine dehydrogenase (Xdh, corresponding to the ry locus of D. melanogaster) for this study. One factor motivating this choice was the allozyme (e.g., Figure 9.3 of Kimura 1983) and DNA sequence data (Rileyet al. 1992), suggesting that it is highly polymorphic at the protein level in several Drosophila species. Therefore, we expected to observe sufficient numbers of amino acid polymorphisms and fixations to provide powerful tests of null hypotheses. Furthermore, Xdh shows moderate levels of codon bias in some previously sequenced Drosophila species (Rodriguez-Trelleset al. 1999), suggesting that we could expect to observe large numbers of silent mutations as well.
MATERIALS AND METHODS
Samples: Flies/DNA used for isolating Xdh alleles, and the number of alleles isolated from each species, are given in Table 1.
PCR, cloning, and sequencing: Several PCR primers or degenerate PCR primers were designed from regions of the Xdh protein that were highly conserved among D. melanogaster, D. pseudoobscura, and Bombyx mori. This region spanned residues 142-150 or 206-215 of the D. melanogaster Xdh protein for forward primers or residues 737-749 of the melanogaster protein for reverse primers. Several combinations of primer pairs were used to amplify Xdh fragments from different species. For species in which PCR was only marginally effective, a single allele was isolated from a gel-purified PCR product, cloned, and sequenced. These data were subsequently used to design species-specific primers. DNA isolated from single wild-caught flies was used in PCR for species provided by W. Etges. For lines established by us or provided by the Species Center, DNA was isolated from single flies taken from isofemale lines. PCR was carried out on DNA from a multifly prep provided by J. Powell (Gleason and Powell 1997) for some of the willistoni group data. In all cases, PCR was carried out using a high-fidelity polymerase (Boehringer Mannheim, Indianapolis). The resulting product was cloned in the TOPO Zero Blunt vector (Invitrogen, San Diego). A single clone was sequenced from each PCR reaction. Sequencing was carried out by using primers that annealed to vector sequence followed by primer walking based on primers designed from sequence data from individual species or groups of closely related species. The total number of nucleotides sequenced per individual varied slightly as a consequence of technical difficulties sequencing some templates or from variation in locations of PCR primers. The vast majority of our alleles start at the residue corresponding to amino acid 216 of the melanogaster protein and end at the residue corresponding to 723 of the melanogaster protein. Thus, ∼1500 bases were sequenced for each allele. This encompasses ∼38% of the 1335-amino-acid-long melanogaster Xdh protein. The error rate of the polymerase is given as ∼3 × 10-7 (Boehringer-Mannheim product literature), so we expect very few errors in the ∼1.8 × 105 bp (120 alleles × ∼1500 bp/allele) of reported sequence. DNaSP (v. 3.51, Rozas and Rozas 1999) was used for most analysis of sequence polymorphism. Analyses of intraspecific variation were restricted to sites that were sequenced in all alleles. Sequences can be found under GenBank accession nos. AF543072-AF543189.
Polarizing mutations: In most cases, phylogenetic relationships based on previously published data (summarized in Powell 1997) were used to determine ancestral states for polymorphic and fixed mutations. The affinis/athabasca/azteca clade is unresolved in previous analyses. We found strong support for an affinis/athabasca sister relationship, which we assumed for our analyses. Lack of appropriate outgroup data precluded inferences of fixations on some lineages. Specific phylogenetic relationships used for our analyses are in the appendix.
Heterozygosity: Table 2 shows estimates of silent and replacement heterozygosity at the Xdh locus for each of 22 species of Drosophila. Divergence among closely related species for silent and replacement sites is shown in Table 3. Very few sites are polymorphic in each of two sister taxa (with the exception of pseudoobscura and persimilis), suggesting that polymorphism data from each species can be considered as independent. Silent θ varies from a high of 0.064 to a low of 0.008; replacement θ varies from a high of 0.0080 to a low of 0.0014. Average heterozygosity is not significantly heterogeneous across species groups for silent sites (Kruskal-Wallis test, P = 0.09, species for which n < 4 are omitted) or replacement sites (Kruskal-Wallis test, P = 0.60, species for which n < 4 are omitted). Replacement and silent heterozygosity across species should be positively correlated across species under a neutral model of evolution with a homogeneous mutation rate. Our estimates of silent and replacement θ for 17 species were positively correlated (Spearman’s ρ= 0.56) and significantly different from zero (P = 0.02; Figure 1). The ratio of replacement to silent θ was not significantly heterogeneous across the repleta, obscura, virilis, and willistoni species groups (Tables 2 and 4; Kruskal-Wallis test, P = 0.35).
Frequency spectrum: Table 5 shows summaries of the frequency spectrum of polymorphism as estimated by Tajima’s D (Tajima 1989) for silent and replacement sites for the 17 species in which at least four alleles were sampled. Negative and positive D values indicate skews toward rare or intermediate frequency alleles, respectively. Xdh data show a strong trend toward negative Tajima’s D for both silent and replacement polymorphisms, though none of the individual tests is significant after adjusting the critical value for multiple tests. Unfortunately, there are multiple explanations for this result. First, small sample sizes tend to produce negative D values even under the neutral, equilibrium model (Tajima 1989). Second, sites under purifying selection may have negative D values. Third, neutral sites in expanding populations may be skewed toward rare variants. Fourth, multiple forms of selection may produce skews toward rare alleles at linked neutral sites (Gillespie 2000). Finally, we cannot rule out the possibility that Taq polymerase errors during PCR contribute to the trend toward rare alleles, though Taq errors are expected to have only a small effect on the data set. Tajima’s D values for replacement polymorphisms are significantly more negative than the corresponding values for silent polymorphisms (sign test, P = 0.01).
Substitutions: Substitutions between pairs of closely related species can be assigned to individual lineages under parsimony given an outgroup and a single mutation in the history of the site under consideration. Inference of the ancestral state for a pair of species should be reliable at the low level of sequence divergence observed for many of our species pairs (Table 3). Numbers of silent and replacement substitutions occurring on individual lineages are given in Table 6. A test of the contingency table of 16 species and silent vs. replacement fixations is only marginally significantly heterogeneous (G-test, P = 0.043). The ratio of replacement to silent fixations is not significantly heterogeneous across the repleta, obscura, virilis, and willistoni species groups (G-test, P = 0.83). Similarly, the ratio of replacement to silent fixations is not significantly different from the ratio of replacement to silent polymorphisms for any species group.
Correlation between heterozygosity and protein divergence: If silent mutations are neutral, silent site heterozygosity can be used as an estimator of NE. Under this premise we could investigate whether effective population size is correlated with the proportion of replacement to silent fixations along a lineage. We proceed under the assumption that silent heterozygosity is highly positively correlated with NE, though we acknowledge that selection may result in a complex relationship between silent heterozygosity and NE (e.g., Akashi 1999; Gillespie 1999, 2000; McVean and Charlesworth 1999). Figure 2 shows a scatterplot of silent θ vs. the ratio of replacement to silent fixations; each point represents the heterozygosity and the ratio of silent to replacement fixations for a single species. We find no evidence for a correlation between these two variables (Spearman’s ρ= -0.12, P = 0.66). However, the fact that several lineages have fixed a small number of mutations (Table 6) would inflate the sampling variance of the ratio of replacement to silent fixations, thereby reducing our power to detect any underlying relationship between this ratio and other variables. We consider eight independent pairwise comparisons of silent and replacement divergence to reduce the severity of this problem (Table 3). For each species pair we compared average silent heterozygosity to the ratio of replacement to silent divergence. There is no significant correlation (Spearman’s ρ= 0.17, P = 0.66). However, data from the willistoni/equinoxialis pair appear to be quite different from the other groups in that the level of polymorphism is unusually high. Data from the seven remaining species pairs reveal a strong, nearly significant positive correlation (Spearman’s ρ= 0.75, P = 0.066) between average silent heterozygosity and protein divergence (Figure 3).
Evolution of codon bias and base composition: Preferred and unpreferred mutants are hypothesized to be slightly beneficial alleles and slightly deleterious alleles, respectively. Silent polymorphisms and fixations were categorized as preferred or unpreferred and polarized using parsimony (Akashi 1995; Table 7). This approach assumes that putative fitness categories derived from melanogaster data can be extended to distantly related species. This assumption seems reasonable because general patterns of codon usage are well conserved among melanogaster, virilis, and pseudoobscura (Akashi and Schaeffer 1997; McVean and Vieira 1999; Kreitman and Antezana 2000), three species that diverged early in the history of the genus. Nevertheless, we compared codon usage [effective number of codons (ENC); Wright 1990] in a high-bias (mercatorum, ENC = 34.71) vs. low-bias gene (willistoni, ENC = 54.52) to determine whether patterns of codon usage in our Xdh sequences are consistent with patterns observed more generally in Drosophila genes. Of the 21 codon families where preferred and unpreferred codons were identified in melanogaster (Akashi 1995), the preferred codon was used more often in the high-bias than in the low-bias Xdh sequence for 19 families. Furthermore, ENC is highly correlated with GC content at third positions of codons at Xdh (Figure 4). Both observations suggest that patterns of codon usage at Xdh across species are similar to genomic patterns of codon usage in melanogaster.
Under an equilibrium model of codon bias evolution, one expects equal numbers of unpreferred and preferred fixations. Our fixation data, pooled across lineages (Table 7), clearly deviate from this expectation (binomial probability, P < 0.001). This result is not attributable to a small number of lineages that deviate strongly from the equilibrium prediction, but rather results from a general trend toward unpreferred fixations in most lineages—14 of 18 lineages have fixed more unpreferred than preferred mutations.
Similarly, there is a strong trend in the direction of excess unpreferred polymorphisms. Seventeen of 18 species have greater numbers of unpreferred than preferred polymorphisms. Under the neutral equilibrium model, the proportion of unpreferred to preferred polymorphisms should be equal to the proportion of unpreferred to preferred fixations. Comparison of unpreferred and preferred polymorphisms and fixations in simulans and melanogaster showed that simulans has a significant excess of unpreferred polymorphisms, while the polymorphic and fixed mutants in melanogaster were compatible with the strictly neutral model (Akashi 1996, 1999). Though Xdh fixations show a highly significant deviation from the equilibrium expectation (Table 7), joint consideration of the polymorphic and fixed mutations does not reject the neutral model for silent sites at Xdh. This is similar to the pattern seen in data from several melanogaster genes (Akashi 1995).
A caveat regarding these conclusions comes from separate analysis of data from the repleta group, which are consistent with the aforementioned results primarily because of the very large number of unpreferred fixations in the eremophila lineage (Table 7). If these data are omitted, the remaining repleta group data appear to differ from data from other species groups. Polymorphic and fixed, preferred and unpreferred mutations are significantly heterogeneous (P < 0.001). Furthermore, there is no excess of unpreferred fixations (25 preferred vs. 37 unpreferred, binomial probability, P = 0.08). Of the 8 non-eremophila repleta lineages, only 4 (Table 6) have fixed more unpreferred than preferred mutations [i.e., all 4 lineages (of 18 total lineages) that have not fixed more unpreferred mutations are from the repleta group].
All preferred melanogaster codons end in G or C (Akashi 1995). Thus, our fixation data should reveal an accumulation of A/T. Mutations were categorized as G or C to A or T and A or T to G or C. Numbers of fixed mutations in the two categories (pooled across lineages) for silent and replacement sites are given in Table 8. At equilibrium for base composition, one expects equal numbers of fixations in the two categories. There is a significant excess of A/T fixations (binomial probability, P < 0.001) at silent sites. Replacement sites show the same pattern, though the replacement fixations show only a marginally significant deviation from the equilibrium expectation (perhaps because of the smaller number of observations compared to silent mutations).
A smaller NE in melanogaster than in simulans was proposed as a possible explanation for the much higher proportion of unpreferred fixations in the former (Akashi 1996). If this were a general explanation, we might expect to observe a negative correlation between ENC and silent heterozygosity; species with histories of larger NE should have higher levels of codon bias (and thus smaller ENC). Figure 5 shows a scatterplot of ENC (vs. silent heterozygosity for our Xdh samples; we see no evidence of a negative correlation between these two variables (Spearman’s ρ= 0.29, P = 0.25). In fact, the two willistoni group species have very high levels of silent heterozygosity, yet show low levels of codon bias at Xdh.
Silent substitutions at Xdh are biased toward unpreferred mutations. The two best-studied Drosophila species, melanogaster and simulans, have each fixed more unpreferred than preferred mutations (Akashi 1995; Takano 1998; Begun 2001; McVean and Vieira 2001). Although additional data are required to make strong inferences about global properties of silent substitutions in flies, the Xdh data presented here, data from several genes in melanogaster and simulans, and data from the saltans group of Drosophila (Rodriguez-Trelleset al. 1999) suggest that accumulation of unpreferred fixations may be common in Drosophila genes. Since unpreferred mutations end in A or T, the accumulation of unpreferred mutations is equivalent to a change in base composition toward A or T. Data suggesting that replacement (Rodriguez-Trelles et al. 1999, 2000) or noncoding (Takano-Shimizu 2001) fixations may also be skewed toward A or T suggest that evolution of base composition may not be restricted to silent sites.
We can entertain at least three kinds of explanations for evolution of A/T content. First, lineages may evolve in response to a change of A/T mutation bias in an ancestor. Second, lineages may evolve in response to selection favoring A/T (this requires independent selection favoring A/T in different lineages). Finally, A/T accumulation may reflect fixation of slightly deleterious (unpreferred) mutations by genetic drift. This final hypothesis may seem unlikely for our Xdh data because it invokes reduction of fitness in several Drosophila lineages that are biologically and historically distinct. Nevertheless, global factors (e.g., temperature) reducing Drosophila population sizes at some point during the last several million years could have promoted fixation of very slightly deleterious alleles, as suggested by Akashi (1996, 1999) for the melanogaster lineage. Data of the type described here from several genes and from noncoding DNA (Rodriguez-Trelles et al. 1999, 2000; Takano-Shimizu 2001) may help distinguish between some alternatives. For example, sequences from virilis (Bergman and Kreitman 2001) and the saltans group (Rodriguez-Trelles et al. 1999, 2000) suggest that these lineages may be evolving toward higher A/T content (relative to melanogaster), even in non-protein-coding regions. Such results weaken the hypothesis that accumulation of A/T at silent sites in exons can be explained simply by weaker selection against borderline deleterious mutations.
Although there are broad generalizations to be gleaned from the data, it also seems clear that there may be strong lineage effects on patterns of base composition evolution. For example, eremophila appears to have experienced an atypically large excess of unpreferred fixations relative to the other repleta lineages. This historical inference is consistent with the fact that eremophila Xdh is the least biased of the repleta group samples. Comparison of the repleta group to other species groups also supports the importance of lineage effects. Excluding eremophila, the repleta group shows no significant accumulation of unpreferred fixations. This is in contrast to pooled data from the other species groups, which show a highly significant excess of unpreferred fixations. Patterns of codon bias are consistent with this inference in that codon bias of repleta species (mean ENC = 39.8) is greater than that of other groups (mean obscura ENC = 45.3; mean virilis ENC = 44.2; mean willistoni ENC = 55.5). Major lineage effects can also be inferred simply by noting that A/T content at third positions among species for this region of Xdh ranges from 16.5 to 54.9%. Given that there was a single common ancestor with a particular A/T content, the wide range of current A/T contents is indicative of a heterogeneous substitution process (e.g., Rodriguez-Trelles et al. 1999, 2000; Takano-Shimizu 2001). It is difficult to speculate on the causes of these lineage effects given the limited data. For example, the small amount of data for most lineages means that we are still unable to assess the potential significance of lineage × locus interactions (Takano-Shimizu 2001). Data from other loci and other types of nucleotide sites will be important for clarifying Drosophila substitution patterns.
Previously collected data on polymorphism and divergence at silent sites suggested the presence of excess unpreferred (i.e., putatively borderline deleterious) polymorphisms in simulans, but not in melanogaster (Akashi 1995, 1996). Similar to the pattern previously observed for melanogaster (Akashi 1996), the contingency table of pooled, polymorphic, and fixed silent mutations at Xdh (Table 7) provides no support for heterogeneity of polymorphisms and fixations. However, consideration of the different species groups or overall levels of codon bias of Xdh genes does provide some evidence for nonneutral dynamics of silent mutations. Table 9 shows the number of polymorphic and fixed, unpreferred and preferred mutations in high-bias species (spenceri, arizonae, mulleri, mojavensis, aldrichi, leonis, hydei) vs. low-bias species (americana, azteca, athabasca, affinis, willistoni, equinoxialis, eremophila). Allele configurations for the high-bias species are significantly heterogeneous (G-test, P = 0.001) in the direction of excess unpreferred polymorphisms, as was previously observed in analyses of simulans sequences (Akashi 1995). Similarly, polymorphic and fixed mutants for the seven low-bias genes are significantly heterogeneous (P = 0.026). However, if data from eremophila are omitted, the polymorphic and fixed silent mutations in the remaining six low-bias species are no longer significantly heterogeneous. The numbers of fixed preferred vs. unpreferred mutations deviate from equilibrium expectation for low-bias species (binomial probability, P < 0.001; omitting eremophila, P = 0.005). However, fixations for the high-bias species do not deviate from the expectation (binomial probability, P = 0.26). Thus, although there does appear to be a general accumulation of unpreferred fixations, the effect seems to be greater in lower-bias lineages. Unfortunately, inferences about high-bias species are compromised by the fact that they all belong to the repleta group. Therefore, we cannot determine whether patterns associated with these genes result from lineage effects or effects related to overall levels of codon bias.
Under a neutral model with homogeneous mutation rates, the fraction of new silent and replacement mutants that are neutral is expected to be the same across populations of different effective sizes. As predicted under this model, we observed a positive correlation between silent and replacement heterozygosity. However, other aspects of the data suggest the possibility of different dynamics of protein vs. silent variants. First, compared to silent polymorphisms, replacement polymorphisms consistently show greater skew toward rare alleles. Second, there is a marginally significant negative correlation between silent heterozygosity and Tajima’s D for replacement polymorphisms (Figure 6, Spearman’s ρ=-0.51, P = 0.04). In other words, populations that harbor more silent variation tend to have amino polymorphisms that are more highly skewed toward rare alleles. This is the pattern one might expect if silent polymorphism were more highly correlated with NE than was replacement polymorphism and if selection against slightly deleterious amino acid mutations were more effective in larger populations (Akashi 1999). Nevertheless, given the weakness of the data and the possibility that demographic phenomena or several forms of selection (Gillespie 1999) may produce negative population Tajima’s D values, we should be reluctant to favor a particular explanation for the frequency spectrum data.
The analyses presented here are, to our knowledge, the first attempt to investigate whether nucleotide substitution rates differ along lineages leading to more polymorphic vs. less polymorphic species (but see Skibinski and Ward 1982 for a similar analysis of allozyme data). We found no convincing correlation between silent heterozygosity and rates of protein evolution, though a subset of the data revealed a nonsignificant trend for more polymorphic species pairs having higher rates of protein evolution (Figure 3). There are several possible interpretations for the absence of a strong relationship between heterozygosity and protein evolution. First, the neutral model predicts no association between population size and rate of evolution. Second, the correlation between NE in the recent vs. more distant past may be weak. For example, closely related species such as simulans and melanogaster (e.g., Moriyama and Powell 1996) or simulans and sechellia (Hey and Kliman 1993; Klimanet al. 2000) have different levels of nucleotide variation, though they recently evolved from common ancestral populations of a certain size. Third, as noted earlier, recent theoretical results suggest there may be no major effect of population size on substitution rates, even if substitutions are influenced by natural selection (Cherry 1998; Gillespie 1999, 2000, 2001). Fourth, the relatively small numbers of observed amino acid fixations observed in many lineages and/or the large variances on estimates of θ could limit our power to detect a correlation. Finally, Xdh amino acid variants could be under strong directional selection only in some lineages and/or at some times. It seems that the attempt to rule out certain models of molecular evolution using associations of heterozygosity and replacement substitution rates will be difficult.
We thank the individuals listed in Table 1 for providing flies or DNA. J. Gillespie, C. Langley, and two anonymous reviewers provided useful comments. This work was supported by the National Institutes of Health, the National Science Foundation, and a Sloan Young Investigator Award in Molecular Evolution.
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AF543072-AF543189.
Communicating editor: W. Stephan
- Received January 21, 2002.
- Accepted September 19, 2002.
- Copyright © 2002 by the Genetics Society of America