We present the results of a survey of DNA sequence variability at X-linked and autosomal loci in Drosophila americana and of patterns of DNA sequence evolution among D. americana and four other related species in the virilis group of Drosophila. D. americana shows a typical level of silent polymorphism for a Drosophila species, but has an unusually low ratio of nonsynonymous to silent variation. Both D. virilis and D. americana also show a low ratio of nonsynonymous to synonymous substitutions along their respective lineages since the split from their common ancestor. The proportion of amino acid substitutions between D. americana and its relatives that are caused by positive selection, as estimated by extensions of the McDonald–Kreitman test, appears to be unusually high. We cannot, however, exclude the possibility that this reflects a recent increase in the intensity of selection on nonsynonymous mutations in D. americana and D. virilis. We also find that base composition at neutral sites appears to be in overall equilibrium among these species, but there is evidence for departure from equilibrium for codon usage in some lineages.
DROSOPHILA americana is a member of the virilis group of Drosophila, native to the east-central zone of the United States and southern Canada. Members of the virilis group have been an object of study by evolutionary geneticists for >70 years (Patterson and Stone 1952; Throckmorton 1982). This work has shed light on questions such as the genetic basis of hybrid sterility and inviability (Orr and Coyne 1989), the genetics of species differences in morphological traits (Spicer 1991; Wittkopp et al. 2003) and behavior (Hoikkala et al. 2000), the extent of DNA sequence differences between closely related species (Hilton and Hey 1997), and rates of chromosomal evolution (Vieira et al. 1997).
Much recent work on D. americana has been motivated by the wish to advance understanding of the evolution of the centric fusion between Muller's element B (the homolog of the fourth chromosome of D. virilis, a close relative of D. americana), and the X chromosome. This creates a neo-X chromosome, and the X/4 fusion exhibits a north–south clinal pattern of variation, with southern populations essentially lacking the fusion and northern populations being fixed for it (Patterson and Stone 1952; Throckmorton 1982; Caletka and McAllister 2004). This provides an important model system for investigating the evolution of chromosomal rearrangements in general and neo-sex chromosomes in particular (McAllister and Charlesworth 1999; McAllister 2002, 2003; Vieira et al. 2006).
D. americana is also useful for other types of evolutionary genetic studies, since it is one of the few Drosophila species to have a well-defined ecology, breeding exclusively in wet riparian areas with willow species Salix interior and S. nigra in North America, not associated with humans (Throckmorton 1982; B. F. McAllister, personal communication). This means that it has not been affected as much by human disturbance of its habitat as many of the other species used in evolutionary genetic studies. It arrived in North America >3 MYA (Caletka and McAllister 2004) and appears to have had a relatively stable demographic history, so that investigators can avoid some of the problems of disentangling demography and selection that have been encountered in species such as D. melanogaster and D. simulans (Haddrill et al. 2005). In accordance with this expectation, our previous study of selection on synonymous and noncoding variants in this species suggested that base composition and codon usage are approximately in equilibrium (Maside et al. 2004).
In this article, we extend our analyses of DNA sequence variation in D. americana to nonsynonymous variants at a set of 18 genes. We also combine these with single sequences for each of these genes, and 4 additional genes, from four other species in the virilis group. This provides further information on the phylogeny of these species and enables us to conduct tests for selection on nonsynonymous variants using both codon-based models of sequence evolution (reviewed in Yang and Bielawski 2000) and the McDonald–Kreitman test and its recent extensions (Eyre-Walker 2006; Welch 2006). We find an unusually low ratio of nonsynonymous to synonymous variation within D. americana and evidence for an increased intensity of purifying selection against nonsynonymous mutations in both the D. virilis and D. americana lineages, following their split from a common ancestor. There is also evidence for a significantly higher ratio of nonsynonymous to synonymous mutations for interspecies comparisons than for variation within D. americana. This reflects either a high fraction of amino acid substitutions driven by selection or an increased intensity of selection on nonsynonymous but not synonymous variants within the D. americana lineage.
MATERIALS AND METHODS
We analyzed the following strains: D. virilis (A11), D. ezoana (15010-0971.2), D. littoralis (Kemi96), D. montana (Kemi96), kindly provided by Jorge Vieira (Instituto de Biologia Molecular e Celular, University of Porto, Portugal), and 53 D. americana isofemale lines from three different sample populations: G96, collected near Gary, Indiana (1996), OR01 (Toledo, OH, 2001), and FP99 (southeast Arkansas, 1999) from the collection of Bryant McAllister (Department of Biological Sciences, University of Iowa). The proportions of X/4 fusion chromosomes in these populations are 98, 86, and 14%, respectively (McAllister 2003). Flies were reared on banana medium at 21°.
Details of the loci analyzed and primers used are given in supplemental Table S1 at http://www.genetics.org/supplemental/. PCR amplification and DNA sequencing methods are described elsewhere (Maside et al. 2004).
Polymorphism data were obtained from 5–10 random isofemale lines from the G96 sample population. For two loci, elav and Pros28.1, larger samples were composed of a combination of isofemale lines from three natural populations: G96 (N = 11 and 13, for elav and Pros28.1, respectively), OR01 (N = 19 and 20), and FP99 (N = 20 and 19). G96.11 was randomly selected to be used as the source line for analyses that required the use of a single D. americana sequence.
Nucleotide sequences were edited with Bioedit (v. 5.0.9) and initially aligned with ClustalX, v. 1.81 (Thompson et al. 1997). Alignments of exon sequences were corrected by hand by combining nucleotide and predicted amino acid data, and intron sequences were aligned with McAlign2 (Keightley and Johnson 2004). Some exon sequences included short repetitive regions (usually <300 bp), consisting of tandem repeats of one or two-codon motifs. The historical reconstruction of the changes in the following repeats was ambiguous and they were eliminated from the alignments: dos, poly(QA) between alignment positions 736–813; elav, poly(Q), 72–338; su(Hw), poly(D), 79–123; su(s), poly(G), 364–408; and poly(S), 1042–1137. All sites with gaps in any of the sequences were also eliminated from the alignments before the analysis. The final contents and lengths of the alignments are shown in supplemental Table S1.
DNA sequences of anon66-Db, Cdc37, Cp36, csw, Ddx1, Dos, elav, Fused, kni, msl-3, pros28.1, rh4, sina, su(Hw), su(s), tll, and Yp1 from D. virilis, D. americana, and D. ezoana have previously been published and deposited in GenBank (Maside et al. 2004). The sequences newly obtained for this analysis (i.e., the sequences for these loci from D. littoralis and D. montana as well as those of lama, Gpi1, hb, and tube from all species) have been deposited in GenBank under accession nos. EF635062–EF635113.
The relative positions of su(s), per, elav, Yp1, Cp36, and Fused in the D. americana X chromosome are given by Vieira et al. (2006, and references therein). We inferred those of csw, Gpi1, and Pros28.1 by the methods of Vieira et al. (2006); i.e., we used the University of California Santa Cruz Genome Browser to identify the D. virilis genome scaffold (July 2004 assembly) that included the genes' sequences, found their position in the D. virilis polytene chromosome map, and determined their correspondence to the D. americana X. Our partial sequence of Pros28.1 matched D. virilis genome coordinates scaffold_48: 270292–271110. This means that it maps to 13C–D in the D. virilis polytene map (Gubenko and Evgen'ev 1984), very close to the distal In(X)a breakpoint (Figures 1 and 2 in Vieira et al. 2006). Without in situ data, we cannot reliably determine whether Pros28.1 remained opposite elav in conserved block 3 [this chromosome segment was involved in inversions In(X)b and In(X)c; see Figure 2 in Vieira et al. 2006] or whether In(X)a relocated it near the centromere along with Fused at the proximal end of block 4. We can, however, confirm that this locus has been repositioned in the D. americana lineage and was classified as potentially affected by the inversions, along with elav, Pros28.1, Yp1, Cp36, and Fused.
Phylogenetic relationships among the five species were analyzed using distance (neighbor-joining, minimum-evolution, and UPGMA) and parsimony methods (maximum parsimony) using Mega 3.1 (Kumar et al. 2004). The substitution model was selected by applying the second-order Akaike information criterion (AICc) (Posada and Buckley 2004), choosing among a set of alternative models: JC69, K80, F81, F84, HKY85, T92, TN93, REV, and UNREST as implemented in PAML (Yang 1997). These models were tested assuming a single substitution rate across sites and allowing for variation of the substitution rates across sites, following a discrete gamma distribution with eight categories. As a starting point, we used an independent phylogenetic tree obtained from mitochondrial 12S and 16S rRNA genes (Spicer and Bell 2002). Equilibrium nucleotide frequencies and transition/transversion rate ratios were assumed to differ among the three codon positions. Model selection was performed on a gene-by-gene basis and on the concatenated data set. In the first approach, the averages of AICc values under the different models for each gene were weighted by the length of their respective sequences, to obtain an approximation to the AICc value of each model for the data set. In both cases, the model with highest support from the AICc weights was TN93 (Tamura and Nei 1993) with variable substitution rates (TN93 + Γ). Other models with lower AICc support, as well as other methods that take into account variable substitution rates at synonymous and nonsynonymous sites, such as the Kumar method (Nei and Kumar 2000, Chap. 4), produced the same phylogenetic tree. Since intron data were obtained only for a subset of genes, and the sizes of intron sequences varied widely across genes (supplemental Table S1), model selection and tree search were performed using coding sequences alone.
Numbers of fixed and polymorphic variants in D. americana were corrected for polymorphisms misclassified as fixations using a method described elsewhere (Maside et al. 2004). Numbers of synonymous and nonsynonymous substitutions per site were estimated using a codon-based maximum-likelihood approach (Goldman–Yang, GY) (Goldman and Yang 1994), implemented in the CODEML program of the PAML package (Yang 1997). Two different codon substitution maximum-likelihood (ML) models were used: the simplest (null) model assumed that the dN/dS ratio is the same for all branches (one-ratio model), and the alternative model allowed dN/dS ratios to vary freely between branches (free-ratio model). Equilibrium codon frequencies in the models were estimated from the average nucleotide frequencies at the three codon positions (F3 × 4).
Tests of neutrality:
Fu and Li's D and Tajima's D tests were conducted by hand and by using the DnaSP program v4 (Rozas et al. 2003) and ProSeq v2.9 (Filatov 2002). Statistical significance of the tests on each locus and on the pooled data set was assessed by coalescent simulations using DnaSP and HKA (available in J. Hey's lab web page: http://lifesci.rutgers.edu/∼heylab/HeylabSoftware.htm#HKA). To give equal weight to all loci on the pooled estimates, only five sequences were used to calculate the D statistics at each locus. For loci with sample size N > 5, we drew 100 random samples of five sequences and used the average value of the statistics as an approximation to the expectation. This will make our tests slightly conservative, as these estimates have lower sampling variances than assumed in standard calculations. Estimates of the recombination rates at each locus were obtained using the -estimator (Wall 2000).
To determine if the free-ratio model fits the data significantly better than the one-ratio model, we used the log-likelihood-ratio test (LRT), assuming that the distribution of the test statistic, 2ΔL (twice the difference between the log-likelihoods of each model) can be approximated by the χ2-distribution with the number of degrees of freedom equal to the difference in the numbers of parameters of the nested models. As only one sequence of each locus in each species was needed, all D. americana sequences used were collected from a single strain selected at random (G96-11). For this analysis, we used one sequence from each locus in the five species, except for su(s) for which we were not able to obtain a sequence in D. montana. Estimates of the divergence values between any two species were calculated by summing the dN and dS values at all branches connecting them.
Since dS estimates obtained with the GY method are influenced by codon usage bias, they may not be appropriate for between-locus comparisons (Bierne and Eyre-Walker 2003). Thus, synonymous and nonsynonymous divergences for this purpose were also estimated using the method of Nei and Gojobori (1986, Equations 1–3), implemented in DnaSP, which uses a conservative criterion for counting synonymous and nonsynonymous sites (Rozas et al. 2003); when alternative evolutionary paths are possible, it chooses the paths that involve fewer steps and fewer nonsynonymous mutations (see DnaSP Help file for details). Tests for heterogeneous selection pressure along protein sequences were performed with CODEML, using models M0, M1a, M2a, M7, and M8 (Yang et al. 2000; Wong et al. 2004). CODEML also implements the empirical Bayes method for estimating the category to which each site is most likely to belong (Nielsen and Yang 1998).
The proportion of nonsynonymous nucleotide substitutions fixed by positive selection (α) was estimated using the methods of Fay et al. (2001) and Smith and Eyre-Walker (2002), which are an extension of the McDonald–Kreitman test (McDonald and Kreitman 1991). Bierne and Eyre-Walker (2004) have developed a maximum-likelihood estimator of α. In our use of this, α was assumed to be constant across loci. The results are consistent across methods, and only those using the former are shown in detail (for a comparison of the methods see Eyre-Walker 2006; Welch 2006).
Reconstruction of ancestral sequences:
Ancestral nucleotides at internal nodes were inferred using the maximum-likelihood approach developed by Yang et al. (1995), implemented in the BASEML program included in PAML. Given the sequence data and the phylogenetic tree, this method uses maximum-likelihood estimates for branch lengths of the tree and parameters from a previously selected substitution model to calculate the posterior probabilities of alternative nucleotide assignments to each interior node at a site (marginal reconstruction) and to select the one with the highest probability. The substitution model used was HKY85 + Γ. This is a modification of the HKY85 model (Hasegawa et al. 1985), which allows substitution rates to vary across sites following a discrete gamma distribution with eight categories.
Probabilities of ancestral codons were calculated as the product of the probabilities of ancestral nucleotides at each codon position (Akashi et al. 2006). When ancestral and derived codons differed by a single nucleotide, the probability of the ancestral codon was taken as the change's count. With differences at two positions, there are two alternative two-step paths between the ancestral and the derived codons, and estimating the numbers of each type of substitutions may not be straightforward (Nei and Gojobori 1986). We used a simple method by which we gave the same weight to both evolutionary paths and assumed equal probabilities of all types of changes. For six codons, alternative paths implied different numbers of synonymous and nonsynonymous changes. For a similar data set, Akashi et al. (2006) showed that, given sufficiently low numbers of substitutions per site, this method gives similar results to those obtained when a correction for synonymous/nonsynonymous ratios is used to weight alternative paths. Codons that had changes at all three positions were excluded from the data set. The lack of a su(s) sequence from D. montana would imply the use of a different phylogenetic tree by the model, and so this locus was excluded from this analysis.
Codon preferences for all species were assumed to follow the D. virilis preferences table (Maside et al. 2004). The probabilities of ancestral nucleotides in introns were estimated directly, using the same substitution model (HKY85 + Γ).
Phylogeny of the species:
Figure 1 shows a neighbor-joining tree, representing the inferred phylogenetic relations between the species. This topology was obtained from the concatenated data set. It is strongly supported by the substitution model tested (100% bootstrap support for all branches) and is consistent across tree-inference methods (see materials and methods). Similar phylogenetic reconstructions were obtained by Spicer and Bell (2002) from sequences of mitochondrial 12S and 16S rRNA genes and by Orsini et al. (2004) using microsatellite data.
Some genes produced alternative trees when analyzed separately, shifting the relations between D. ezoana, D. littoralis, and D. montana, although always with very weak bootstrap support. Only the data from su(Hw) suggested a clustering of D. montana with the ancestor of D. virilis and D. americana, with >85% bootstrap support from different tree inference methods and nucleotide substitution models. Figure 1 shows relatively short branch lengths for these species, with dS along the D. montana branch being at most three times the synonymous diversity within D. americana, i.e., ∼12Ne generations [using the equilibrium formula for expected neutral diversity, π = 4Neu, where Ne is the effective population size, u is the mutation rate, and π is the pairwise nucleotide site diversity (Nei 1987)]. With this level of divergence relative to diversity, ∼20% of the fixations on each branch are expected to be contributed by ancestral polymorphisms rather than fixations of new mutations (Charlesworth et al. 2005); because of linkage, these will tend to occur in clusters (Wiuf et al. 2004), so it is likely that this discrepancy for su(Hw) simply reflects the random fixation of neutral polymorphisms present in the common ancestor, which obscures the signal of the species tree.
Genetic diversity in D. americana:
Mean silent-site diversity as measured by π and θw (Watterson 1975) is 1.84% (Table 1), a fairly typical value for Drosophila species. The average pairwise nucleotide site diversity at nonsynonymous sites is ∼45 times lower than at silent sites (mean π = 0.04%, weighting individual values by their estimated net variances, as in Bartolomé et al. 2005). This ratio is similar to that previously reported from analyses of smaller sets of loci in this species (Vieira et al. 2001; McAllister 2003) and is substantially lower than that observed in other Drosophila species, such as the closely related D. virilis [mean π = 0.01 and 0.10, respectively (Vieira and Charlesworth 1999)], or the more distant D. miranda [0.08 and 0.42 (Bartolomé et al. 2005; Bartolomé and Charlesworth 2006)], D. pseudoobscura [0.2 and 2.3 (Loewe et al. 2006)], and African populations of D. melanogaster [0.18 and 2.87, respectively (Andolfatto 2005)].
The mean silent π is 1.20% for the X chromosome vs. 2.28% for the autosomes (P < 0.0001 on a Mann–Whitney U-test). A similarly low ratio of variance-weighted mean nonsynonymous to silent diversities is observed for both X-linked and autosomal loci when analyzed separately (0.033 ± 0.015 and 0.019 ± 0.0083, respectively; ratio plus or minus standard error of the ratio). Given that the mean silent diversity is typical for a Drosophila species (see previous references), the low nonsynonymous diversity in D. americana probably reflects strong selection against nucleotide substitutions that change amino acid sequences.
The observed higher silent variation at autosomal loci is mainly due to the significant differences observed at synonymous sites, for which the X/A ratio is 0.47 (variance-weighted mean π = 1.14% vs. 2.59% at X-linked and autosomal genes respectively, P < 0.003 on a Mann–Whitney U-test). This difference remains significant even after adjusting the X-linked values by dividing by 0.75 to account for the fact that the effective population size for X chromosomes is three-quarters that of autosomes in the absence of sexual selection (P < 0.01). In contrast, the X/A ratio for the intron diversity is 0.82 (variance-weighted mean π = 1.37 vs. 1.66, P <0.80), close to the 0.75 expected in the absence of sexual selection. These results are consistent with the action of selection at synonymous sites (Maside et al. 2004), which is expected to reduce X-linked relative to autosomal diversity, provided that the fitness effects of synonymous substitutions are weakly deleterious and either partially recessive or female specific (McVean and Charlesworth 1999), whereas intron variants are close to what is expected under neutrality.
Tests of neutrality:
To further investigate the signature of selection, we conducted Tajima's D (Tajima 1989) and Fu and Li's D (Fu and Li 1993) tests on the polymorphism data at each locus and on the pooled data set (see materials and methods). Synonymous, intron, and nonsynonymous variants were analyzed separately (supplemental Table S2 at http://www.genetics.org/supplemental/). At synonymous sites, Tajima's and Fu and Li's D tests are negative for 13 of the 18 loci. This is significant on a Wilcoxon signed-rank test (TS = 38; P < 0.05) (Sokal and Rohlf 1995, pp. 440–444). The significance of the observed values of the statistics was determined by means of coalescent simulations that allow for the observed levels of intragenic recombination in each locus (see materials and methods). According to Fu and Li's D test, the deviations from neutral equilibrium expectations are significant for Cdc37, elav, and su(s) and border on significance for anon-66Db (P = 0.05). For the pooled data, Fu and Li's D = −0.40 (P = 0.10, calculated by assuming that all loci are unlinked but with no recombination within them).
Skews at intron sites tend to go in the same direction but are generally less pronounced and are not significant for any of the loci (supplemental Table S2). When run on the pooled data, Tajima's D = 0.00 and Fu and Li's D = −0.08. At nonsynonymous sites, there is a slight overall surplus of young/low-frequency variants across loci, as expected under the action of purifying selection, but the low levels of polymorphism limit the power of the tests, which are nonsignificant.
Sequence divergence among species:
Maximum-likelihood estimates of genetic divergence at each locus were obtained for the seven branches of the phylogenetic tree (Figure 1). This allowed us to compare the patterns of evolution among genes and branches. To increase the sample, we included sequence data from four additional genes: hb, tube, lama, and Gpi1, for which we have one sequence per species (supplemental Table S1).
Maximum-likelihood estimates of synonymous and nonsynonymous divergence on all branches of the phylogenetic tree were obtained using two models (see materials and methods). The first assumed a constant dN/dS ratio for the entire tree (one-ratio model), and the second allowed dN/dS ratios to vary among branches (the free-ratio model, Yang 1998; Yang and Nielsen 1998). This was done for all genes separately and for a sequence composed of a concatenation of all coding sequences from each species. Log-likelihood values under the free-ratio model were always larger than those under the one-ratio model (supplemental Table S3 at http://www.genetics.org/supplemental/). The difference was found to be significant for 3 of 22 genes (Cdc37, kni, and per; not corrected for multiple tests), which indicates that the free-ratio model fits the data better. Consistent with this, the one-ratio model is rejected if the test is run on the concatenated data set (2ΔL = 25.1, P = 0.0003 on a ; supplemental Table S3).
dN/dS ratios (Figure 1) are largest in the lineages leading to the common ancestor of D. ezoana and D. littoralis (EzoLit) and of D. montana (0.086) and smallest for those of D. americana (0.049) and D. virilis (0.034). The large dN/dS for EzoLit can be attributed to a very small dS (0.015), which is less than one-third of the across-lineage average (0.053; supplemental Table S4 at http://www.genetics.org/supplemental/). However, caution is needed in interpreting this result, as this is a relatively short branch (Figure 1), and dN and dS values may be subject to substantial stochastic variation due to the low number of mutations in it. On the other hand, the D. virilis and D. americana branches are long, so the estimates are relatively reliable. Their smaller dN/dS ratio is associated with a smaller than average nonsynonymous divergence (0.22%), their dS being close to average.
The average dN and dS values are 0.003 and 0.053, respectively, and the average dN/dS is 0.079, so that these loci are under strong purifying selection (supplemental Table S4). elav and sina are the most conserved, with no observed amino acid substitutions. On the other hand, dN/dS values for tube and per are well above the locus average in most lineages. In the case of tube, this result is due to both a higher dN and a lower dS than the average. However, dN for per is 5.5 times higher than the mean (1.81 vs. 0.33%), which is not matched by a proportionally high rate of silent divergence, although this is 1.6 times greater than the mean (dS = 9.4%).
We also constructed various LRTs to check for the presence of significant differences in levels of selective constraint among sites (Yang et al. 2000). The null model (M0) assumes a unique dN/dS for all sites (calculated from the data); M1a assumes that a proportion of sites are highly conserved (0 < dN/dS < 1) and the rest are largely unconstrained, with dN/dS = 1; M2a includes an additional class of positively selected sites (dN/dS > 1). LRTs of the nested models using the concatenated data set show that M1a and M2a fit the data significantly better than M0 (2ΔL = 151.2, P <0.0001). However, M2a does not make any significant improvement over M1a (2ΔL = 0.23, P = 0.95), so there is no conclusive evidence for sites evolving under positive selection. Under M1a, 96% of the sites evolve under selective constraints (dN/dS = 0.03) and the other 4% evolve unrestrictedly (dN/dS = 1). Other tests with more elaborate models such as M7 and M8 (Yang et al. 2000) also failed to detect signatures of positive selection at individual sites (such that dN/dS > 1; data not shown). However, these negative results should be regarded with caution, as the power of statistical methods for detecting adapting evolution in coding sequences is greatly reduced with the number of taxa from which data are available and the fraction of positively selected sites (Wong et al. 2004), as is the case for our data set, with sequences from five species and 96% of codons under purifying selection.
To characterize the nature of selection further, we used the McDonald–Kreitman test (McDonald and Kreitman 1991), which compares the ratio of polymorphism to divergence (rpd) among synonymous and nonsynonymous sites interspersed along a sequence; under the neutral model, both ratios should be the same. To avoid the possible confounding effects of a substantial contribution of shared ancestral polymorphisms on the divergence between closely related species such as D. americana and D. virilis (Charlesworth et al. 2005), we used D. ezoana as the primary outgroup. As shown in Table 2, rpd is larger among synonymous than among nonsynonymous variants (0.33 vs. 0.12 for the pooled data). We evaluated this difference by means of the Mantel–Haenszel test (Sokal and Rohlf 1995, Chap. 17), which indicated that there is a 2.3 times larger proportion of polymorphic synonymous substitutions as compared with nonsynonymous changes (ωMH = 2.3; XMH = 7.8, P = 0.005; where ωMH is the Mantel–Haenszel (MH) estimator of the odds ratios and XMH is the MH statistic, which is distributed as χ2 with 1 d.f.). These differences in odds ratios are homogeneous across genes (weighted sum of squares, XMH = 6.6, P = 0.97). Similar results are obtained if D. littoralis or D. montana are used as outgroups for divergence estimates (Table 2).
The same trend as described above is observed if autosomal and X-linked genes are analyzed separately. The rpd values at synonymous sites are 2.5 and 2.2 times larger than those at nonsynonymous sites, respectively. The difference is, however, significant only among the former genes (XMH = 5.2, P = 0.02), due to lack of power in the reduced X-linked sample. These joint-odds ratios barely change when noncoding sites are used as a proxy for neutrality (2.4 and 1.9, for X-linked and autosomal loci respectively, although none are significant).
The rate of adaptive evolution:
If we assume that all mutations are either deleterious or neutral, then the ratio of nonsynonymous to synonymous divergence (dN/dS) would be expected to be equal to, or less than, the ratio of nonsynonymous to synonymous polymorphisms (PN/PS). However, if a fraction of nonsynonymous mutations were adaptive and thus driven to fixation by positive selection, then the ratio of the divergences should be greater, given that adaptive nonsynonymous mutations make little contribution to polymorphism. The proportion of nonsynonymous substitutions fixed by positive selection (α) can thus be estimated (Charlesworth 1994), and several methods for doing this have been developed (Eyre-Walker 2006; Welch 2006).
Using the method of Bierne and Eyre-Walker (2004), we obtained a maximum-likelihood estimate of α between D. americana and D. ezoana of 0.57 (95% C.I.: 0.25–0.77) (Table 3). Estimates of the same magnitude were obtained using D. littoralis and D. montana as outgroups (0.59 and 0.65, respectively). These results are almost 50% higher than those usually obtained for similar sets of data for other Drosophila species (Eyre-Walker 2006). In contrast, if D. virilis is used as the outgroup, or if fixed changes are counted in the D. americana lineage only, the resulting estimates of α are substantially lower (although the differences are not significant).
Changes in the patterns of codon usage across lineages:
A total of 6938 codons of 21 genes from five lineages were analyzed. Ancestral and derived codons were identified and located on the phylogenetic tree using a maximum-likelihood procedure (see materials and methods). This method allowed us to identify each mutational step between ancestral and derived codons and assign it a probability, which was taken as its count. The total number of mutational changes was 1725.9, of which 11.4 corresponded to codons that differed in all three positions and were removed from the analysis. Synonymous and nonsynonymous changes accounted for 81 and 19% of the remainder.
Table 4 summarizes the estimated numbers of synonymous and nonsynonymous changes in each branch of the phylogeny (Figure 1). Synonymous changes were further classified as from preferred to unpreferred (PU) and vice versa, according to codon preferences established in D. virilis (Maside et al. 2004). Assuming equilibrium for codon usage, an equal number of changes in either direction is expected. Departures from equilibrium were quantified as dUP−PU = (UP − PU)/(UP + PU), which has a range of −1 to 1 (Akashi et al. 2006). The significance of any departures was assessed by means of the Wilcoxon signed-rank test (WSR), which compares UP with PU mutation counts per locus on each lineage. dUP−PU = −0.23 on the pooled data, which suggests that there is a significant excess of PU substitutions (P < 0.005). However, this pattern was not common to all genes (supplemental Table S5 at http://www.genetics.org/supplemental/), as dUP−PU values vary significantly across genes [GH = 60.9, P = 5.1 × 10−6 in a G-test of heterogeneity; e.g. su(Hw) and tll produced the most highly negative dUP−PU values, as opposed to Pros28.1 and Gpi1]. Furthermore, dUP−PU and codon usage bias (measured with Fop, Ikemura 1985; supplemental Table S1) are negatively correlated (r2 = 0.38, P = 0.05 in a two-tailed Kendall's τ-test).
On average, apparent departures from equilibrium for autosomal loci were larger than on those on the X (dUP−PU = −0.27, P < 0.005 and dUP−PU = −0.15, not significant, respectively; supplemental Table S6 at http://www.genetics.org/supplemental/). The difference can be attributed to lineages Vir, Ame, and VirAme, for which significant deviations were detected among autosomal loci only (departures on the X were not significant for any lineage). Departures from equilibrium also varied across lineages (Table 4). Vir, Ame, and Mon accumulated a higher fraction of PU changes (the deviation was significant for the three), whereas a larger fraction of mutations on Lit, VirAme, and EzoLit were in the opposite direction. The Mantel–Haenszel test of homogeneity of odd ratios was used to assess the significance of the observed differences between UP and PU counts across genes between pairs of lineages (Table 5). Remarkably, the large accumulation of unpreferred codons in both Vir and Ame is in contrast to what was observed in their common ancestor (VirAme).
Variation in the patterns of base composition across lineages:
To investigate mutational patterns across lineages for putatively neutral sites, we compared the estimated numbers of A/T to C/T changes (AT–GC) with those in the opposite direction (GC–AT) from our intron sequence data. Given that most preferred codons end in G/C and unpreferred ones in A/T, and the evidence for selection on codon usage (Maside et al. 2004), data on synonymous sites are not suitable for this purpose, as it would be difficult to disentangle the effects of selection on codon usage from the underlying mutational processes.
Departures from base composition equilibrium were quantified by dWS−SW, where W and S denote weak (i.e., A and T) and strong (i.e., G and C) nucleotides, respectively, following the ambiguity code for nucleotides (Akashi et al. 2006). No significant departures from base composition equilibrium were detected in any lineage or for the pooled data set (Table 6). This result is in good agreement with previous evidence for equilibrium of base composition in D. americana (Maside et al. 2004). However, there is significant variation across loci (GH = 44.4, P = 2.1 × 10−5 in a G-test of heterogeneity; supplemental Table S7 at http://www.genetics.org/supplemental/). Tll, Ddx1, Cp36, and Yp1 show the greatest proportion of mutations to AT, and Cdc37 and dos are in the opposite direction. No significant differences were detected between autosomal and X-linked loci (dWS–SW = −0.02 and −0.06, respectively, using data pooled across lineages and loci). We did not detect a significant correlation between dWS–SW and dUP–PU (r2 = 0.22, P = 0.09) or between dWS–SW and Fop (r2 = 0.13, P = 0.13).
Genetic diversity and protein sequence evolution in D. americana:
Our data provide the first genomewide survey of variability in D. americana. In common with previous studies of a more limited number of genes (Hilton and Hey 1996; Vieira et al. 2001; McAllister 2003), we find a fairly typical level of silent- nucleotide-site diversity for a Drosophila species, with a mean of ∼2%. As reported in the results section, mean silent-site diversity is significantly lower for the X chromosome than for the autosomes, suggesting an absence of strong effects of sexual selection on genetic diversity in this species, in contrast to what has been described for African populations of D. melanogaster and D. simulans (Andolfatto 2001) and for D. miranda and D. pseudoobscura (Yi et al. 2003; Bartolomé et al. 2005).
At autosomal loci, the variance-weighted mean diversity for synonymous sites (in percent) is larger than that for introns (2.59 ± 0.251 vs. 1.66 ± 0.289; mean ± SE; Mann–Whitney U0.05[10,7] = 53; P = 0.05). This is, however, consistent with the theoretical expectation of an increase in synonymous diversity over neutral values at sites subject to weak selection on codon usage (Nes < 1), if there is mutational bias toward unpreferred codons (see Figure 2 of McVean and Charlesworth 1999). There is evidence that these conditions are met in the case of D. americana (Maside et al. 2004). Alternatively, there might be stronger selection on intron sites than synonymous sites, as reported for long introns in D. melanogaster (Haddrill et al. 2005). But similar levels of average sequence divergence (± SE) at synonymous and intron sites (14.3 ± 1.44% and 12.8 ± 1.66%, respectively, between D. americana and D. ezoana, using the method of estimation implemented in DnaSP, see materials and methods), together with the lack of evidence for selection at intron sites from the neutrality tests, suggest that this is not the case in our sample of introns, most of which can be considered as short (mean length = 83.6 bp; supplemental Table S1) by D. melanogaster standards.
In contrast with this high level of silent-site diversity, nonsynonymous diversity is very low, and the ratio of nonsynonymous to silent diversity is exceptionally low for a Drosophila species, with a value of 2–3% rather than the more usual value of ∼10% (see results section for details). In addition, the results on sequence evolution displayed in Figure 1 suggest that dN/dS is also unusually low along the branches of the phylogeny leading to D. virilis and D. americana, but not in their common ancestor; if anything, this effect is stronger for D. virilis. At first sight, this suggests that there has been an increase in Nes in both species, possibly because of an expansion in their population sizes. However, the limited available data on DNA sequence polymorphism in D. virilis provisionally suggest that silent-site variability is somewhat lower than in D. americana, and the ratio of nonsynonymous to silent diversity is close to the more typical value of 10% (see references in the results section).
This is difficult to reconcile with the hypothesis of a population expansion in both species as a cause of the low ratio of nonsynonymous to synonymous diversity. In addition, we have previously estimated Nes for mutations from preferred to unpreferred codons in D. americana; it is comparable to that in D. miranda (Bartolomé et al. 2005), a species with substantially lower silent diversity. This would imply that any increase in Nes has affected selection on nonsynonymous mutations, but not mutations affecting codon usage, which is very unlikely. An alternative hypothesis is that D. americana underwent a recent population bottleneck, causing it to lose low-frequency variants, especially nonsynonymous variants, and has recovered recently. This is, however, inconsistent with its high variability at silent sites and the absence of any signals of change in population size in patterns of noncoding variability (see results section).
It is, therefore, hard to be sure about what has caused the low nonsynonymous diversity and rate of substitution in D. americana. The possibility of a recent increase in Ne in this lineage raises concerns about the results of the McDonald–Kreitman tests and the estimates of α, the fraction of amino acid substitutions caused by positive selection (see Tables 3 and 4), which mostly used D. ezoana as an outgroup. These α-values are higher than those reported using similar methods for D. melanogaster and D. simulans (Bierne and Eyre-Walker 2004; Eyre-Walker 2006). As discussed by Eyre-Walker (2002), an artifactually high estimate of α can be caused by population expansion in the species from which polymorphism data are collected, if the nonsynonymous polymorphism/divergence ratio (rpd) is compared with that for a neutral standard. But if the comparison is with rpd for synonymous sites subject to selection on codon usage, the bias will generally be in the opposite direction (underestimation of α), provided that selection on nonsynonymous sites is substantially more intense than that on synonymous sites (Eyre-Walker 2002). The evidence for weak selection on codon usage provided by Maside et al. (2004) suggests that these conditions hold for D. americana.
These considerations imply that the estimate of a high rate of adaptive protein sequence evolution is probably not simply an artifact of population expansion. We cannot, however, unequivocally rule out the hypothesis of an increased intensity of selection against deleterious nonsynonymous, but not synonymous, variants in the D. virilis and D. americana lineages. Further polymorphism data on D. virilis are needed to see whether it, too, has reduced nonsynonymous variability compared with silent variability; the current data on this species (Vieira and Charlesworth 1999) are inadequate for a convincing test. If this is not the case, then the hypothesis of increased selection on nonsynonymous sites would be hard to sustain, given that the D. virilis lineage also shows reduced dN (Figure 1).
Possible effects of X chromosome rearrangements on diversity:
We need to consider the possibility that the relatively low silent diversity at D. americana X-linked loci may to some extent reflect the complex recent evolutionary history of its X chromosome since the split from the D. virilis lineage, which includes at least three chromosomal inversions (Figure 2 in Vieira et al. 2006), and a fusion with chromosome 4 (Patterson and Stone 1952; Throckmorton 1982; Caletka and McAllister 2004). The most recent inversion, In(X)c, is polymorphic and is strongly associated with the X/4 fusion chromosomes (94.6 and 7% of fusion and nonfusion chromosomes, respectively; Hsu 1952). The frequency of this arrangement displays a latitudinal cline, being more abundant in northern populations (Vieira et al. 2001; McAllister 2002).
Recent analyses have provided evidence for genetic differentiation between X/4 fusion and nonfusion chromosomes and indicate that X/4 fusion chromosomes have relatively reduced silent variability due to suppression of recombination, in the region between the centromere and the basal In(X)c inversion breakpoint (Vieira et al. 2001, 2003, 2006; McAllister 2002). The polymorphism data for seven of the nine X-linked genes were obtained from G96, a population sample nearly fixed for the X/4 fusion chromosomes, and the larger samples for the other two genes (elav and Pros28.1) included lines from three natural populations with different proportions of this arrangement (G96, OR01, and FP99, with 98, 86, and 14%, respectively). Thus, the relatively low silent diversity observed could be explained by the inclusion of most X-linked loci in this study in the low-variation region.
To check for this possibility, we identified the chromosomal locations of all X-linked genes for which we obtained polymorphism data (see materials and methods) and compared the level of silent variation at the loci putatively affected by the chromosome reorganizations in the D. americana lineage with that for loci in the presumably unaffected distal positions. The silent diversity of the potentially affected genes is not significantly different from that of the distal ones (1.6 ± 0.16 vs. 1.3 ± 0.50, respectively; mean ± SE), and the same applies to the ratios of silent-diversity values to the corresponding observed silent divergence between D. americana and D. virilis, to account for different intensities of purifying selection at silent sites (0.12 ± 0.016 vs. 0.15 ± 0.003, for distal and proximal loci, respectively). There is, thus, no evidence for a large-scale reduction of silent diversity on X/4 fusion chromosomes as a result of the recent chromosomal evolution (Figure 2).
We also checked for genetic differentiation between populations at elav and Pros28.1, the two loci with samples from different populations (see above). In agreement with a previous study (Vieira et al. 2003), we found no evidence for structure at elav for pairwise comparisons between populations with high (G96 and OR01) and low (FP99) prevalence of the X/4 fusion; KST = −0.003 and −0.006, for G69 vs. FP99 and OR01 vs. FP99 comparisons, respectively; none are significant in a permutation test with 1000 replicates (Hudson et al. 1992). The same result was obtained if G96 and OR01 are pooled and compared with FP99 (KST = −0.005, NS). On the other hand, at Pros28.1 there is significant differentiation between populations with high and low prevalence of the X/4 fusion (KST = 0.09 and 0.11, P < 0.01; for G69 vs. FP99 and OR01 vs. FP99 pairwise comparisons, respectively), but not between G96 and OR1 (KST = −0.027, NS). Given that genetic differentiation between fusion and nonfusion chromosomal forms seems to be restricted to the base of the X chromosome (Vieira et al. 2006), this result could be taken as indirect evidence that Pros28.1 is close to Fused at the basal end of block 4 (Figure 2).
Changes in codon usage and noncoding sequences:
The analysis of patterns of base substitutions at noncoding sites among the different species (Table 6) suggests that base composition at putatively neutral sites is roughly in equilibrium in this group, with AT to GC changes being almost the same in frequency as the reverse. However, the analysis of synonymous changes suggested a deficit of changes from unpreferred codons (U) to preferred codons (P) in the D. montana, D. virilis, and D. americana lineages, at least for autosomal loci (Tables 5 and 6). At statistical equilibrium, we would expect equal numbers of changes in each direction, even with selection operating (Akashi 1996).
In the latter two cases, this is probably an artifact of the inclusion of polymorphisms among the variants classed as fixed substitutions, since selection against U mutations causes a deficit of UP changes among polymorphisms but not fixations, if codon usage is at equilibrium (Akashi 1996, 1999). Maside et al. (2004) showed that, when a correction for polymorphisms misclassified as fixations was used, there were approximately equal numbers of PU and UP changes in the D. americana lineage, but the use of a single sequence for each gene led to a deficit of UP changes. The negative correlation between dUP−PU and codon usage bias (Fop) that we detected (see results) is consistent with this interpretation, as stronger selection on codon usage leading to higher Fop can produce a detectable effect on polymorphism statistics for synonymous sites (Cutter and Charlesworth 2006).
This explanation is unlikely to apply with such force to the relatively long D. montana branch (Figure 1), so that the large deficit of UP changes here may reflect a genuine decline in Nes for synonymous mutations that change codon usage in this lineage. The apparent excess of UP changes in the ancestor of D. virilis and D. americana is hard to interpret, but perhaps indicates an increase in Ne in this lineage that is too small to affect nonsynonymous substitutions. Otherwise, there is no strong evidence for departure from equilibrium with respect to codon usage.
Further insights into the evolution of codon usage are provided by examining the relation between codon usage bias in a gene and its rates of synonymous and nonsynonymous substitution. It has been suggested that codon usage differences may bias the estimates of dS produced by the GY method (Bierne and Eyre-Walker 2003). Therefore, for the purpose of among-locus comparisons, synonymous and nonsynonymous divergences (Ks and Ka, respectively) were estimated using the NG method implemented in DnaSP (see materials and methods). We detected a nonsignificant negative association (Figure 3A) between Ka and Fop. Differences between X-linked and autosomal loci are attributable to the presence of one outlier (per, which has an exceptionally high Ka value) and disappear if this locus is removed from the data. The overall negative correlation without per is then significant (r2 = 0.24, P = 0.025). On the other hand, Ks is not correlated with Fop (Figure 3B). Various evolutionary scenarios, such as interference between selection on nonsynonymous and synonymous sites, selection on codon usage for accuracy of translation, and translational constraints affecting nonsynoymous changes can cause a negative correlation between Ka and Fop (Akashi 1994; Betancourt and Presgraves 2002; Kim 2004; Marais et al. 2004).
All D. americana lines were kindly provided by B. F. McAllister. D. virilis, D. ezoana, D. littoralis, and D. montana stocks were provided by J. Vieira. We are grateful to J. Vieira for help in locating genes on D. virilis and D. americana chromosomes, to D. Filatov for kindly agreeing to writing new resampling subroutines for Proseq upon our request, to X. Bello for Python scripts for managing data files, and to F. Rodríguez-Trelles for help with the phylogenetic analysis. We also thank C. Bartolomé, B. F. McAllister, J. Vieira, and J. J. Welch for helpful discussions and critical reading that helped improve the manuscript and H. Cowan for media preparation. This work was financed by a grant from the Biotechnology and Biological Sciences Research Council, United Kingdom to B.C. B.C. is supported by The Royal Society and X.M. by a “Ramon y Cajal” contract from the Ministerio de Educación y Ciencia (Spain).
- Received January 23, 2007.
- Accepted May 11, 2007.
- Copyright © 2007 by the Genetics Society of America