| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
,1,2
* Center for Population Biology, University of California, Davis, California 95616 and
Committee on Evolutionary Biology, University of Chicago, Chicago, Illinois 60637
2 Corresponding author: Department of Ecology and Evolution, 1101 E. 57th St., Chicago, IL 60637. E-mail: arguello{at}uchicago.edu
| ABSTRACT |
|---|
|
|
|---|
130 chemoreceptor genes from five closely related species of Drosophila that share a common ancestor within the past 12 million years. We demonstrate that the overall evolution of the Or and Gr families is nonneutral. We also show that selection regimes differ both between the two families as wholes and within each family among groups of genes with varying functions, patterns of expression, and phylogenetic histories. Finally, we find that the independent evolution of host specialization in Drosophila sechellia and D. erecta is associated with a fivefold acceleration of gene loss and increased rates of amino acid evolution at receptors that remain intact. Gene loss appears to primarily affect Grs that respond to bitter compounds while elevated Ka/Ks is most pronounced in the subset of Ors that are expressed in larvae. Our results provide strong evidence that the observed phenomena result from the invasion of a novel ecological niche and present a unique synthesis of molecular evolutionary analyses with ecological data. DROSOPHILA has emerged as one of the most valuable models for understanding chemoreception. Its value stems from a relatively simple anatomical structure, the vast genetic tools available, and the recent identification of what are believed to be the complete olfactory receptor (Or) and gustatory receptor (Gr) repertoires (CLYNE et al. 1999, 2000; GAO and CHESS 1999; VOSSHALL et al. 1999; ROBERTSON et al. 2003; HALLEM et al. 2004). In Drosophila melanogaster, the Or and Gr families comprise 60 genes each, encoding 62 and 68 proteins, respectively (ROBERTSON et al. 2003). These genes are peripheral components of the chemosensory system. They are predicted to encode 7 transmembrane proteins that bind environmental chemicals and trigger nerve signals to higher processing centers in the brain. It has been demonstrated that in most cases, only one Or gene is expressed in any given olfactory receptor neuron and that this Or determines not only the odors to which the neuron is sensitive, but also the neuron's response dynamics (HALLEM et al. 2006). Less is known about Grs, but it is clear that multiple Gr genes are often expressed in a single gustatory receptor neuron (AMREIN and THORNE 2005). Although the two families are common to diverse insects, they share little sequence similarity with each other and do not appear to be homologous with functionally similar OR and GR genes found in vertebrates (HALLEM et al. 2006).
Comparisons of the Or and Gr families from distantly related insects (Drosophila, mosquito, and honeybee) have uncovered dramatic changes in gene family size and content (HILL et al. 2002; ROBERTSON et al. 2003; ROBERTSON and WANNER 2006) and fueled the widely held suspicion that these genes evolve rapidly. It is not clear, however, how chemoreceptors evolve over short time scales, as comparisons of closely related species have been limited by a lack of whole-genome sequences. This void was recently filled by the release of 11 new Drosophila genomes, 4 of which belong to species in the melanogaster subgroup that, together with D. melanogaster itself, share a common ancestor within the past 12 million years (Figure 1). Taking advantage of all 11 new genomes, two recent studies documented overall stasis in the size of the Or family across the genus as a whole (GUO and KIM 2007; NOZAWA and NEI 2007). But there has been no investigation of changes in the size of the Gr family and no comprehensive investigation of sequence evolution among closely related species for either family (but see TUNSTALL et al. 2007 for a study of 11 genes).
|
A second feature of the chemoreceptor superfamily that renders a study of its evolutionary behavior interesting is the fact that the functions and expression patterns of its constituent genes are rapidly being characterized. The past five years have witnessed the publication of nearly 100 articles on Drosophila Or and Gr genes. This information can guide biologically meaningful analyses of variation in evolutionary behavior within the superfamily. That variation, in turn, may provide insight into as yet undescribed functions, guiding further molecular genetic studies.
Last, because the sequenced Drosophila species within the melanogaster subgroup are ecologically diverse, an analysis of lineage-specific evolution may provide insight into the role of chemoreceptors in ecological adaptation. For example, while D. melanogaster, D. simulans, and D. yakuba are generalist flies exploiting a broad array of rotting fruit, D. sechellia and D. erecta have independently specialized on novel host fruit, Morinda citrifolia and Pandanus candelabrum, respectively (Figure 1; TSACAS and BACHLI 1981; RIO et al. 1983; LOUIS and DAVID 1986). The acquisition of these novel hosts was likely associated with dramatic changes in the microhabitats to which the two species are regularly exposed, including not only food plants, but also resting and mating sites, microclimates, and natural enemies. It therefore makes sense that Or and Gr genes, which represent the interface between the fly and its chemical environment, would experience novel evolutionary forces and display unusual patterns of evolution along these lineages. Indeed, a previous study demonstrated accelerated gene loss and amino acid substitution in D. sechellia Or and Gr genes compared to D. simulans (MCBRIDE 2007). With three additional species (one specialist and two generalists), forming a tree with six additional lineages (one specialist and five generalists; Figure 1), we can examine the generality of these phenomena and characterize them in greater detail.
Here we present a detailed molecular evolutionary analysis of the chemoreceptor superfamily among five closely related species of Drosophila from the melanogaster subgroup. We focus on (1) fundamental questions regarding the molecular evolution of the superfamily as a whole and contrasts between its two constituent Or and Gr families, (2) variation in the evolutionary behavior of discrete functional/expression/phylogenetic groups within each family, and (3) lineage-specific evolution associated with host specialization.
| MATERIALS AND METHODS |
|---|
|
|
|---|
A second set of D. simulans Or/Gr annotations was created using the six syntenic genome assemblies produced by the Drosophila Population Genomics Project (http://www.dpgp.org; BEGUN et al. 2007). Each of these assemblies constitutes low coverage shotgun sequence data from a single inbred strain assembled via alignment to the D. melanogaster genome. By extracting regions syntenic to D. melanogaster Or/Gr genes from each assembly, we were able to gather a sample of six D. simulans alleles for most loci (used for analyses of polymorphism within D. simulans). We also constructed a single representative "syntenic" allele for each D. simulans Or/Gr gene. This single allele was usually a full-length coding sequence chosen randomly from one of the six assemblies. When no single assembly contained a full-length coding sequence, however, we constructed the representative allele by piecing together segments from different syntenic assemblies. Since the syntenic assemblies seemed to contain fewer mistakes than the CAF1 D. simulans assembly (e.g., many putative nonsense mutations resulting from low-quality reads in the CAF1 assembly were masked by a stringent phred filter in the syntenic assemblies), these representative syntenic alleles were substituted for those derived from the CAF1 assembly for protein tree inference and analyses of divergence whenever possible. In particular, this substitution was made for almost all genes with orthologs in D. melanogaster, but not for genes absent in D. melanogaster (since the latter are not covered by the syntenic assemblies; see Final Allele column in supplemental Table 1B at http://www.genetics.org/supplemental/ for the source assembly of each simulans allele used in our analyses). We did not resequence putative nonsense mutations from the CAF1 D. simulans assembly unless they were supported by, or not covered by, the syntenic assemblies.
Protein tree inference:
We used Bayesian methods to infer the phylogenetic relationships among all annotated Or and Gr genes from all five D. melanogaster subgroup species plus D. ananassae. Pseudogenes were repaired (frame corrected) and included, as long as
20% of the full coding sequence was present. Treating Ors and Grs separately, we aligned translated coding sequences using ClustalW (THOMPSON et al. 1994) under two alignment parameter settings: a relaxed setting (–gapopen = 9 –gapext = 0.18) and the default setting. We then inferred a single protein tree for each family using the MrBayes MPI software package (HUELSENBECK and RONQUIST 2001; RONQUIST and HUELSENBECK 2003; ALTEKAR et al. 2004); mcmc nchains = 8 ngen = 1,000,000 samplefreq = 100 Temp = 0.03). Chain conversion was assessed as recommended in the user manual: (1) the potential scale reduction factors were all very close to 1, (2) the average standard deviation of split frequencies were all <0.05, (3) plots of the iteration vs. the log probability of the data (using the sump command) showed no trends, and (4) the likelihoods for separate runs on each data set were very close. Using TreeJuxtaposer (MUNZNER et al. 2003), the trees resulting from the two alternative alignment settings were compared and found to have only minor differences, excluding the following two cases. First, the relaxed Or tree has the Or67d and Or83c orthologs grouped with the Or56a, Or43a, Or49b, Or30a clade with a posterior of 0.9; the default tree has the Or67d and Or83c orthologs more closely related to the Or65c, Or65b, Or65a, Or47b clade. Second, the default Gr tree provides a posterior of 0.95 for a node that places the Gr22a orthologs as an outgroup to Gr22f, Gr22c, Gr22b, Gr22d, Gr22e, and thus changes some of the relationships within this clade; the relaxed tree provided a posterior <0.75. Trees based on the relaxed alignment settings are reported on here. In Figures 2 and 3, we collapsed all nodes with <75% posterior support and pruned all D. ananassae genes and all but one representative branch per D. melanogaster subgroup ortholog set. No nodes were collapsed in, nor genes/orthologs pruned from, the trees in supplemental Figures 1 and 2 at http://www.genetics.org/supplemental/. Tree manipulation was performed using TreeDyn (CHEVENET et al. 2006).
|
|
Divergence analyses 1—Ka/Ks:
We inferred the ratio of replacement to silent substitution (Ka/Ks) for each chemoreceptor gene present in the D. melanogaster subgroup by maximum likelihood as implemented in PAML (YANG 1997). Our inference for each gene was based on a manually curated ClustalW alignment of the D. melanogaster, D. simulans, D. sechellia, D. yakuba, and D. erecta orthologs (pseudogenes excluded) plus the nearest outgroup sequence (usually the D. ananassae ortholog, but sometimes a closely related paralog). Using a branch model, we assigned one Ka/Ks ratio to the outgroup branch, and a second independent ratio to all other branches (model = 2, NSsites = 0). The outgroup ratio was then discarded leaving a single Ka/Ks ratio characteristic of the divergence of the given gene within the D. melanogaster subgroup. For this analysis and for all analyses described below, we assumed the topology illustrated in Figure 1, placing D. yakuba and D. erecta as sister species. We compared the median and mean Ka/Ks ratios of interesting subsets of chemoreceptor genes using nonparametric two-sample Wilcoxon rank-sum tests or parametric t-tests/ANOVAs. Although log-transformed data were substituted for raw data for parametric analyses of variation within the Or family, the Gr data did not require such a transformation. Our analyses excluded genes that have been lost (or duplicated) along any lineage, except when explicitly comparing "lost" genes to those that remain intact in all species.
Divergence analyses 2—rate heterogeneity:
To investigate whether replacement and silent divergence within the Or and Gr families is clocklike, we followed a maximum-likelihood procedure provided by LANGLEY and FITCH (1974). The aim was to investigate whether the observed data deviate from what would be expected given a constant Poisson clock. Briefly (LANGLEY and FITCH 1974, p. 162), the likelihood of the observed number of substitutions in the mth protein along the ith branch (
m,i) is
![]() |
m is the proportionate substitution rate of the mth protein and tk and ti represent time points at the beginning and the end of a branch. Assuming independence across proteins and along branches, the likelihood of the entire data set is
![]() |
Our "observations" were inferred via maximum likelihood using PAML's codeml package (YANG 1997) by parsing rst files from the runs described under Divergence analyses 1—Ka/Ks and are provided in supplemental Table 2 at http://www.genetics.org/supplemental/. With one exception (Or19a), we considered genes that have only a single intact ortholog in all five subgroup species.
m was computed for a particular protein by taking the sum of all substitutions among its five D. melanogaster subgroup orthologs and dividing it by the sum of all substitutions in all proteins. Because the denominator in the likelihood cannot equal zero, ortholog sets in which one or more branches had zero observed substitutions were excluded.
To estimate the maximum likelihood, MCMC sampling with uniform proposal distributions was used to distribute mutations along branches. Three different chains were run, each with very different starting values. The computationally intensive portion of the routine was written in C and data were outputted to R, where statistical analyses and convergence diagnostics were carried out using the CODA package (PLUMMER et al. 2006). Convergence was considered successful if all three chains showed good mixing, converged to the same likelihood, and if the Geweke's convergence diagnostic (GEWEKE 1992), which compares the mean values within windows at the start and end of the chain following the burnin, supported stationarity. A likelihood-ratio test was used to test the constant-rate assumption.
To compare our results for Or/Gr genes to a random gene sample of similar size, we repeated the above procedure on a group of 50 protein-coding genes randomly chosen from the D. melanogaster group guide-tree alignments provided by the DROSOPHILA 12 GENOMES CONSORTIUM (2007).
Divergence analyses 3—index of dispersion:
To complement the rate-heterogeneity tests described above, we carried out a second test of the Poisson molecular clock at the same loci by estimating the index of dispersion [R(t), variance-to-mean ratio] for silent and replacement substitutions. We subsampled the orthologs for each gene in two different ways: (1) excluding D. sechellia [(Dere, Dyak), Dsim, Dmel] and (2) excluding D. simulans [(Dere, Dyak), Dsec, Dmel]. The rationale for this is that speciation between D. simulans and D. sechellia has occurred very recently (
2 MY; HEY and KLIMAN 1993; KLIMAN et al. 2000; S. KUMAR, unpublished data), and the stochasticity of coalescent events occurring in species trees with short branches can inflate estimates of R(t) (HUDSON 1983). Analyzing D. simulans and D. sechellia separately should avoid this bias. R(t) was calculated following the procedure of GILLESPIE (1994); lineage weights are provided in supplemental Table 3 at http://www.genetics.org/supplemental/.
Under strict neutrality, R(t) is expected to equal one (GILLESPIE 1989, 1994). To evaluate the significance of departures from 1, we generated simulated data sets using a procedure similar to those previously described (GILLESPIE 1989; ZENG et al. 1998). First, ancestral sequences for each ortholog group were inferred using maximum likelihood as implemented in PAML's codeml (rateancestral = 1). These ancestral sequences were "evolved" according to the four-species topologies, with silent and replacement mutations along the sequence being Poisson distributed with means equal to those estimated from the actual data, but after the lineage weights had been applied. The Ors and Grs had their own transition and transversion probabilities that were estimated from their respective full data sets. The procedure was repeated 5000 times for each ortholog group, and R(t) was estimated for each resulting data set as described above.
Polymorphism analyses in D.simulans:
To investigate patterns of polymorphism at Or/Gr loci, we took advantage of the six syntenic D. simulans genome assemblies. We extracted the coding sequence of each Or/Gr gene from all six assemblies (see Annotations) and aligned them to the inferred sequence of the most recent common ancestor (MRCA) of D. simulans and D. melanogaster (ancestral sequence for each gene inferred via maximum likelihood during the PAML runs described in Divergence analyses 1—Ka/Ks). These alignments included many gaps because the coverage from any given assembly was relatively low; the average number of alleles available per site was 3.5. Genes with putative nonsense mutations in any of the six assemblies were not considered. We then wrote automated procedures in Python (http://www.python.org) that used the alignments to infer the number of silent and replacement substitutions/polymorphisms that had occurred at each locus along the D. simulans lineage. Inferences were parsimony based and minimized first the number of total changes and second the number of replacement changes required to explain the variation observed at any given codon site. To reduce the likelihood of including ancestral polymorphisms in our analysis, we ignored all polymorphisms for which one allele was shared with D. melanogaster and the other allele was shared with D. yakuba. Using the resulting substitution and polymorphism counts, we tested for recent positive/purifying selection by (1) conducting a McDonald–Kreitman (MK) test (MCDONALD and KREITMAN 1991) on each individual gene and (2) examining the distribution of the neutrality index (NI, ratio of silent to replacement divergence divided by the ratio of silent to replacement polymorphism) (RAND and KANN 1996) for the Or and Gr families as wholes. These tests excluded genes with fewer than six polymorphisms, six fixations, six silent variants, or six replacement variants (i.e., with any row or column sum less than six). We also compared Ors to Grs by tallying the number of silent and replacement polymorphisms/substitutions across all genes within each family and then asking whether the resulting pooled MK tables were significantly different using a three-way contingency test.
To ask whether the pattern of polymorphism and divergence observed at Or and Gr loci was significantly different from that characterizing the rest of the genome, we repeated the above analyses on a set of 3222 genes scattered throughout the genome using alignments provided by BEGUN et al. (2007). Since the D. erecta, D. sechellia, and D. ananassae alleles were not included in these alignments, we inferred the sequence of the MRCA of D. melanogaster and D. simulans via parsimony using the D. simulans, D. melanogaster, and D. yakuba alleles only (rather than extracting these ancestral sequences from PAML runs on six species alignments). The Or/Gr data were reanalyzed in the same way for comparison.
Gene loss in specialists:
We examined potential variation in the rate of chemoreceptor gene loss among lineages using a maximum-likelihood framework implemented in a new extension of the program Brownie (O'MEARA et al. 2006; see APPENDIX). In this analysis, the status of each Or/Gr gene present in the MRCA of the D. melanogaster subgroup was traced across an ultrametric phylogeny including the five focal species (Figure 1). The likelihood of inferred gene loss events (considered irreversible) was then estimated under five alternative models. The simplest model assigned a single rate of loss to the entire subgroup. The remaining four models assigned an independent rate of loss to each of two groups of lineages defined a priori. We assessed the relative fit of alternative models using corrected Akaike information criteria (AICc) (HURVICH and TSAI 1989) and Akaike weights. Ors and Grs were examined separately.
We investigated the possibility that the Gr genes lost along specialist lineages were a nonrandom subsample of Grs in two different ways. First, we asked whether the lost genes were phylogenetically clustered. Using our unrooted Bayesian Gr tree (Figure 7), we computed the total length of the smallest subtree containing all lost genes; the smallest subtree was defined as that which included the fewest total genes. We then compared this length to a null distribution derived by repeating this procedure for 10,000 random samples of lost Gr genes (of the same size as the real set). The smallest subtree for each data set was identified by comparing a list of the lost genes to a bipartition table using automated Python scripts and the total lengths of these subtrees were computed in PAUP* (SWOFFORD 2002). Second, we asked whether the same genes were lost independently along multiple lineages more often than expected by chance. We simulated losses along the D. melanogaster subgroup topology shown in Figure 4 by assigning loss events to randomly selected genes while holding the number of losses and the branches on which these losses occur constant. We then compared the number of "overlaps" (genes lost independently along two or more lineages) from the real data set to the distribution derived from the simulations. We considered three types of overlaps—those among specialist lineages, those among generalist lineages, and those between specialist and generalist lineages.
|
|
To examine potential heterogeneity in the observed lineage effects across groups of genes with different functions or expression profiles, we conducted two nested ANOVAs on log transformed Ka/Ks data from the ecological model. The first ANOVA applied to Or genes only and tested for the main effects of host ecology (generalist lineages vs. specialist lineages), life stage (genes expressed in adults only vs. adults plus larvae vs. larvae only), the interaction between host ecology and life stage, and the nested effect of individual Or genes within life stages. The interaction effect was of primary interest—addressing the possibility that Ka/Ks along specialist lineages may be elevated for genes expressed during certain life stages but not for genes expressed during other life stages. The second model was similar to the first except that it applied to Gr genes only, and the effect of tuning modality (whether individual genes respond to bitter compounds, sweet compounds, or unknown compounds) was substituted for the effect of life stage. For visual comparison in Figure 9, the large set of Gr genes that respond to unknown compounds was split into "conserved" and "unconserved" subsets on the basis of inferred whole subgroup Ka/Ks ratios. The intact orthologs of genes that have been lost were included in the Gr model to maintain balance among categories; otherwise, there would be only three bitter receptors.
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
Using our annotated coding sequences, we calculated the effective number of codons (ENC) (WRIGHT 1990) used in each gene from each melanogaster subgroup species using the codonw server (http://bioweb.pasteur.fr/seqanal/interfaces/codonw.html). Overall, there is little evidence of codon bias, with both Ors and Grs having a mean ENC of
52 (Gr range = 36.38–61, Or range = 32.66–61; ENC for each gene is an average across all intact orthologs). There was also little variation in ENC among orthologs of the same gene from different species (supplemental Table 5 at http://www.genetics.org/supplemental/).
Bayesian trees provide confident inferences of branching order deep within both families:
Using Bayesian methods, we inferred unrooted protein trees for both the Or and Gr families. Figures 2 and 3 show abridged versions of these trees; supplemental Figures 1 and 2 at http://www.genetics.org/supplemental/ show full versions. Though computationally intensive, Bayesian methods are more powerful than the neighbor-joining method used by previous studies (ROBERTSON et al. 2003; GUO and KIM 2007; NOZAWA and NEI 2007) and provide more confident inferences of branching order deep within the family. Almost all nodes in both trees are strongly supported (Figures 2 and 3). Comparison of our Or tree to that of GUO and KIM (2007) did not reveal any major discrepancies.
Evolution of the Or and Gr families as wholes
To characterize the molecular evolution of the Or and Gr families within the melanogaster subgroup, we adopted three complementary approaches. First, we examined changes in copy number. Second, using comparative sequence data from all five species, we examined rates of orthologous sequence divergence at silent and replacement sites (Ka/Ks, overall rate heterogeneity, and the index of dispersion). Third, using population genetic data from just one species, we contrasted divergence to polymorphism at silent and replacement sites (neutrality index).
Gene gain and loss—overall contraction of gene family size:
The close relationship among the five melanogaster subgroup species made assignments of orthology unambiguous and allowed us to infer the timing of gene duplication and loss events confidently via parsimony (using the phylogeny shown in Figure 1 and assuming loss events to be irreversible). Losses are defined as orthologs that were deleted or exhibited a nonsense mutation in all alleles examined for a particular species. Although they do not include orthologs that we observed to be polymorphic for nonsense mutations (one in erecta, two in melanogaster, and possibly a few in simulans), it is possible that functional alleles are segregating at some lost loci in natural populations. Figures 2 and 3 show the identity of all duplicated/lost genes, and Figure 4 shows the distribution of gain/loss events across all eight lineages. Note that our inferences for the Or family differ substantially from those of GUO and KIM (2007) who report twice as many Or duplications and distribute Or losses differently across the melanogaster subgroup lineages (compare Figure 4 from this study to their Figure 1). These discrepancies are most severe in the subclade including the species with the lowest quality assemblies (simulans and sechellia) and likely result from the previously described differences in annotations (e.g., their acceptance of several spurious nonsense mutations and putative duplicates on identical, short, and/or unassembled contigs).
Starting from a basal set of 64 Or genes present in the MRCA of the subgroup, we observed a total of 4 Or duplication events and 12 Or loss events. Starting from a basal set of 74 Gr genes, we observed 0 Gr duplications and 35 Gr losses. Since two of the Or duplication events and several of both the Or and Gr loss events affected the same genes along different lineages, the numbers of genes that experienced at least 1 duplication or loss are slightly lower. The most striking trend arising from these inferences is the overall contraction of the two families. Using a maximum-likelihood framework implemented in the program Brownie (see APPENDIX), we estimated that the overall rates of loss for Ors and Grs are 0.46 and 1.02 losses per gene per 100 MY, respectively. While a contingency test comparing the proportion of genes lost along at least one lineage to the proportion duplicated showed that the loss rate was only marginally higher than the duplication rate for Ors (Fisher's exact P = 0.05), losses dramatically exceeded duplications for Grs (Fisher's exact P < 10–8). The overall rate of Gr loss, however, masks substantial lineage-specific variation, which we discuss in the Specialists are losing Gr genes approximately five times more rapidly than generalists section.
Divergence 1—Ka/Ks:
Interested in overall rates of divergence, we first examined the ratio of replacement to silent substitution (Ka/Ks) at each locus. We examined a single ratio per gene reflecting substitutions that have occurred across the entire tree in Figure 1. Every Or and Gr was characterized by Ka/Ks less than one (range = 0.01–0.57, median = 0.15; Figure 5; raw data in supplemental Table 6 at http://www.genetics.org/supplemental/), indicating that they are functional and experience purifying selection. Our inferences of Ka/Ks for Ors within the melanogaster subgroup are similar to, but significantly lower than, the values inferred across all 12 Drosophila genomes by GUO and KIM (2007) (paired Wilcoxon P = 0.001, median difference = 0.024). The higher values obtained across all 12 species may reflect saturation at silent sites along long branches.
|
Conservatively limiting our analysis to the 104 ortholog groups that have not experienced changes in copy number (with the exception of Or19a), we found that both the Or and Gr families are accumulating substitutions in a definitively nonclocklike fashion. Overall tests of heterogeneity in rates of silent and replacement substitution were significant for both families, as were the subtests of heterogeneity within and along branches (all P-values < 10–17; supplemental Table 7 at http://www.genetics.org/supplemental/). A similarly sized data set of randomly chosen protein-coding genes from the same species, however, also led to unequivocal rejection of a neutral Poisson model (supplemental Table 7). This suggests that rate heterogeneity is not unique to the chemoreceptor family and may result from lineage effects such as generation time or from pervasive positive selection. Alternatively, it is also possible that the substitution process in Drosophila is poorly modeled using a Poisson distribution (GILLESPIE 1994). If true, it would not be surprising that discrepancies summed over many genes would give significant results.
Divergence 3—index of dispersion:
As a complementary approach to the rate-heterogeneity tests, we calculated the index of dispersion [R(t)] for the same subset of proteins, following the procedure outlined by GILLESPIE (1989, 1994; see MATERIALS AND METHODS). Under a strict neutral model, substitutions are Poisson distributed and the variance to mean ratio, R(t), is expected to be 1. Estimates significantly greater than one (overdispersed) have been used to argue against neutral evolution (GILLESPIE 1989, 1994; CUTLER 2000; KERN et al. 2004). These measurements are informative because they pertain to particular proteins, whereas the above tests of rate heterogeneity are descriptive of the families as wholes. Moreover, they provide a test of the Poisson clock that takes uniformly acting lineage effects (e.g., different generation times or population sizes) into account.
Both Or and Gr genes tend to be overdispersed, but to a small degree (Figure 5; raw data in supplemental Table 6 and supplemental Figures 3–6 at http://www.genetics.org/supplemental/). The median R(t) taken over both topologies (see MATERIALS AND METHODS) is close to 1.5 for both Or and Gr replacement substitutions and 1.9 for Or and Gr silent substitutions: median Or
= 1.43, median Gr
= 1.55, median Or
= 1.82, median Gr
= 1.92. All four of these values are significantly greater than one (Wilcoxon P < 10–6 for each). By simulating empirical null distributions that were based on the scaled trees, we also identified individual genes that were significantly overdispersed. These included 17 Ors and 14 Grs overdispersed for replacement substitutions and 7 Ors and 11 Grs overdispersed for silent substitutions (supplemental Figure 7 at http://www.genetics.org/supplemental/). A contingency test indicated that the proportion of all genes overdispersed for Ka was significantly higher than the proportion overdispersed for Ks (
2 P = 0.03), particularly within the Or family. Although we did observe several genes that were significantly underdispersed (n = 13), the total number of these cases is close to what we would expect by chance given the number of tests (e.g.,
5%).
A considerable amount of work has focused on the interpretation of overdispersed proteins (GOLDMAN 1994; NIELSEN 1997; ZENG et al. 1998; CUTLER 2000; KERN et al. 2004; WILKE 2004). There are several interpretations that are unlikely to apply to our data set. As mentioned previously, factors that vary between lineages in a uniform way across loci are removed by the lineage weighting scheme used prior to R(t) calculation (GILLESPIE 1989) and should not contribute to the observed patterns. Also, while saturation at silent sites has been shown to bias estimates of
, the mean Ks between the most distant species that we consider is only
0.35 for both Ors and Grs. Finally, variation in the strength of selection for major codons (as invoked by ZENG et al. 1998) is unlikely to explain our results; we already reported that Ors and Grs have relatively little codon bias (ENC
52), with negligible variation in bias across species (supplemental Table 5 at http://www.genetics.org/supplemental/). In addition, proteins possessing significant
values span the range of ENC values rather than possessing the lowest scores, and there is no correlation between
and the absolute value of the deviation of a particular protein's ENC from the family average ENC (supplemental Figures 8 and 9 at http://www.genetics.org/supplemental/).
The most straightforward interpretation of overdispersion within the Or and Gr families posits that elevated R(t) results from episodic bursts of substitution associated with lineage-specific changes in the strength of positive and/or purifying selection (GILLESPIE 1989; KERN et al. 2004). This explanation would account for our observation of more frequent overdispersion of Ka than of Ks since selection is likely to be stronger at replacement sites than at silent sites. It would also be compatible with independent evidence of positive selection acting on chemoreceptors in general (see Polymorphism within simulans, below, and TUNSTALL et al. 2007) and of lineage-specific changes in selection regimes associated with host shifts (see Evolution of Ors and Grs along specific lineages—a role for the chemoreceptor superfamily in host specialization, below). Interestingly, it also provides the best explanation for overdispersion at Or19a. This gene has duplicated along the melanogaster lineage and was therefore excluded from the summary of R(t) results presented above. Nevertheless, the fact that a disproportionate number of replacement substitutions have occurred in this gene along the lineage in which it duplicated (supplemental Table 2 at http://www.genetics.org/supplemental/) and the fact that this burst of substitution appears to have driven the gene's
well above one, suggests that changes in the strength of selection have the potential to produce overdispersion on the scale observed in our data set. It also supports previous concerns regarding the use of duplicate genes in R(t) analyses (OHTA 1991).
Polymorphism within simulans:
Patterns of divergence between species can help reveal nonneutral evolution, but analyses of polymorphism within species typically provide more powerful tests of the specific forces underlying nonneutral evolution. We therefore used available genomic polymorphism data from D. simulans to further characterize the evolutionary forces acting on the Or and Gr families. Our analyses included 54 Or and 61 Gr loci that were covered by the simulans syntenic assemblies and did not show any evidence of polymorphic nonsense mutations. The average number of alleles covered per site at each locus ranged from 0.4 to 5.5 with a mean of 3.5. Although this sample size is too low for analysis of the frequency spectrum, it is sometimes adequate for MK tests of selection (MCDONALD and KREITMAN 1991). MK tests look for a significant difference between the ratio of replacement polymorphism to fixation and the ratio of silent polymorphism to fixation via a contingency test. If all sampled variants and observed fixations are neutral, the two ratios should be the same and the neutrality index—former ratio divided by latter ratio—should equal one (RAND and KANN 1996). Positive selection on replacement sites theoretically lowers this index below one (since it adds to the number of replacement fixations without affecting the number of replacement polymorphisms), while weak purifying selection on replacement sites raises it above one (since it prevents a subset of low-frequency replacement polymorphisms from contributing to the observed replacement fixations).
We inferred the number of silent and replacement polymorphisms and fixations arising along the simulans lineage since this species' MRCA with melanogaster and conducted a polarized MK contingency test at each Or/Gr locus that exhibited a minimum number of variants (see MATERIALS AND METHODS). These tests, summarized in Table 2, provided strong evidence for positive selection on both families. First, 6 of 27 Or and 11 of 36 Gr genes were individually significant with NI < 1 at the P < 0.05 level (a larger fraction than expected by chance). Second, the median NI of each family was significantly <1 (Or median NI = 0.61, Wilcoxon P = 0.04; Gr median NI = 0.42, Wilcoxon P < 0.0002). Third, pooled MK tests tallying the total number of silent and replacement polymorphisms and fixations across all loci within each family were significant, with NI < 1 (Or pooled NI = 0.54, Gr pooled NI = 0.38). Interestingly, however, there was no more evidence of positive selection on the Or and Gr families than on 3222 genes scattered across the simulans genome as a whole (Table 2); we found no difference in the proportion of significant tests (contingency P = 0.3), median NI (Wilcoxon P = 0.4), nor pooled NI (three-way contingency test comparing the two pooled MK tables, P = 0.8). Thus either a good portion of the entire simulans genome experiences positive selection (BEGUN et al. 2007) or some other process such as weak purifying selection on silent sites is driving the pattern. The latter alternative seems unlikely, at least at chemoreceptor loci, which do not show codon bias.
|
Grs also differed from Ors in their lower neutrality indices along the simulans lineage. The pooled NI of the Gr family (0.38) was significantly lower than the pooled NI of the Or family (0.54) by a three-way contingency analysis (P = 0.01). Summing across loci with varying levels of constraint can complicate the interpretation of pooled MK tests. The difference between the families proved consistent, however, even when broken down into five contrasts between smaller pools of Or and Gr genes with similar rates of substitution and levels of polymorphism (Gr NI was lower than the Or NI for each of the five smaller pools; supplemental Table 8 at http://www.genetics.org/supplemental/). This difference is also reflected in a trend for the Gr family to have a larger proportion of individually significant genes and a lower median NI than the Or family (Figure 5). Since low neutrality indices are associated with positive selection, the observed difference suggests that Grs experience stronger positive selection than Ors. The alternative explanation of weaker purifying selection, however, still cannot be completely ruled out. It is possible, for example, that Or genes contain a class of sites under weak purifying selection (contributing to observed replacement polymorphisms but not fixations) that are completely neutral in Gr genes (contributing to both polymorphisms and fixations). There was no difference in R(t) between the Or and Gr families at either silent or replacement sites (two-sample Kolmogorov–Smirnov tests P = 0.8 for Ka, P = 0.3 for Ks; Figure 5).
Why might Grs experience a different selection regime than Ors? Despite their similarity, the insect olfactory and gustatory codes may be quite different. The olfactory code is combinatorial with most odorants stimulating multiple olfactory receptor neurons (each expressing a single Or gene) and being represented by the unique assemblage of such neurons transmitting impulses to the antennal lobe. The gustatory system, on the other hand, appears to adhere to a labeled-line model wherein tastants stimulate one or more Gr genes within a single class of gustatory neurons (e.g., sweet, bitter, etc.) that are in turn hard wired to specific behaviors (e.g., attraction, repulsion; MARELLA et al. 2006). It is possible that this difference in organization makes the olfactory system less flexible than the gustatory system to evolutionary changes in single receptor proteins at the periphery. Moreover, although many coexpressed Grs surely serve independent and unique functions (e.g., Gr5a and Gr64a; CHYB et al. 2003; JIAO et al. 2007), some may be partially redundant, releasing each other from evolutionary constraint. Finally, it is possible that of all the compounds that a fly must be able to recognize, the soluble ones are more variable between environments/mates/hosts than the volatile ones, resulting in more frequent selection on Grs for novel binding affinities or sensitivities. Note that the only Gr known to be involved in mate recognition (Gr68a) has neither a particularly low NI (0.38) nor a particularly high Ka/Ks (0.15) (supplemental Table 6 at http://www.genetics.org/supplemental/; but see TUNSTALL et al. 2007 for evidence of positive selection on a particular codon site).
Evolution of specific groups of genes within the Or and Gr families
Functional geneticists are rapidly accumulating information on the binding specificities and patterns of expression of individual receptor genes. This information, combined with some of our own inferences, allowed us to assign genes to a priori categories on the basis of function, expression, subfamily, or propensity for loss, and subsequently look for meaningful variation in evolutionary behavior among these categories within the Or and Gr families. The following results describe variation we detected in Ka/Ks only. We found no significant variation in R(t), and the simulans polymorphism data are not sufficient for comparisons of NI among small subgroups of genes.
Or expression groups—genes expressed in large basiconic sensilla are highly conserved:
Or proteins and the dendrites of the olfactory receptor neurons (ORNs) in which they are embedded, are housed in porous sensory hairs found on the antennae and maxillary palpi of adult Drosophila. These sensory hairs, called sensilla, come in five distinct morphological classes: large basiconic, small basiconic, trichoid and ceoloconic (on the antenna only), and thin basiconic (on the antenna and maxillary palp; SHANBHAG et al. 2000). We used published data to categorize Or genes according to the class of sensillum in which they are expressed (COUTO et al. 2005) and found significant variation among these groups. In particular, Ors expressed in large basiconic sensilla on the adult antenna had significantly lower Ka/Ks than Ors expressed in most other classes (one-way ANOVA P = 0.03; Figure 6A, middle; supplemental Table 9 at http://www.genetics.org/supplemental/; coeloconic sensilla were excluded from the analysis since only one Or gene is known to be expressed therein). This particularly applies to the ab1 and ab2 sensilla, which house a total of six ORNs expressing four of the five most conserved adult Or genes (Or59b, Or85a, Or42b, and Or92a) and, incidentally, two of the most conserved Gr genes (Gr21a and Gr63a). Or59b and Or85a have been physiologically characterized but do not appear unusual in their binding affinities (HALLEM and CARLSON 2006). Interestingly, Or42b has been implicated in the ability of flies to detect an odor given off by stressed conspecifics (G. SUH, personal communication), and we have no information on Or92a. We found no variation in Ka/Ks among groups of Or genes that respond to aliphatic vs. aromatic compounds (ANOVA P = 0.40), that are expressed during different life stages (ANOVA P = 0.47), or that belong to different subfamilies within the Or tree (ANOVA P = 0.2, Figure 6A, right; supplemental Table 9 at http://www.genetics.org/supplemental/; subfamilies marked on the unrooted phylogram in supplemental Figure 10 at http://www.genetics.org/supplemental/).
|
We decided to further categorize Gr genes according to their phylogenetic history by dividing our unrooted Bayesian tree into eight clades/subfamilies of closely related paralogs (labeled A–H in Figure 7). These groupings revealed another level of significant variation in Ka/Ks (one-way ANOVA P < 0.0001, Figure 6B, right). In accordance with the results of the previous analysis, the subfamily comprising the sweet and CO2 receptor sister clades (A in Figure 7) had one of the lowest mean Ka/Ks ratios, and the subfamily including five of the six putative bitter receptors (E in Figure 7) had the highest mean Ka/Ks (Figure 6B). The phylogenetic groupings also revealed remarkable heterogeneity within the large group of Grs with unknown functions. Most strikingly, the alternative splice forms of Gr28b and their four closest relatives (subfamily H in Figure 7) are at least as conserved as the sweet/CO2 receptors (mean Ka/Ks = 0.06). Also, subfamily B, interesting by virtue of its position as sister to the sweet/CO2 responders, had a relatively low mean ratio (Ka/Ks of Gr10a = 0.05, mean of the rest = 0.21). The fact that the honeybee genome appears to contain genes in both of these subfamilies (B and H), as well as genes in the sweet/CO2 clade (A), suggests that they play fundamental roles common to many different types of insects (ROBERTSON and WANNER 2006).
Interestingly, the Gr subfamilies themselves can be grouped into pairs or trios with similar Ka/Ks. For example, as one moves counterclockwise around the tree in Figure 7 from subfamily A to H, the mean Ka/Ks of each consecutive clade grows larger, peaking with subfamily E, and then waning through F and G to the highly conserved subfamily H (resulting in an arc in the right section of Figure 6B). This nonrandom pattern boosts our confidence in the deep bifurcations within the Bayesian Gr family tree and supports the idea that subfamilies comprise functionally related genes.
Chemoreceptors that have been lost at least once have high Ka/Ks:
Our final analysis of heterogeneity within the Or and Gr families contrasts genes that have been lost in at least one species (and were therefore excluded from the previous analyses) with genes that remain intact in all five species. We found that the intact orthologs of the former have significantly higher median Ka/Ks than the latter within both families (Wilcoxon P = 0.04 and 0.0002 for Ors and Grs, respectively; Figure 6; supplemental Table 9 at http://www.genetics.org/supplemental/). Interestingly, however, this effect becomes weak when Gr genes are blocked by subfamily (two-way ANOVA including subfamily and loss as factors, P for effect of loss = 0.03), suggesting that high Ka/Ks and high rates of loss tend to characterize specific subfamilies rather than individual genes within diverse subfamilies. For example, subfamily E, which contains almost all of the putative bitter receptors, has the highest mean Ka/Ks (Figure 6B) and has sustained the highest number of losses (Figure 7); the lost genes within this group do not have higher Ka/Ks than their intact relatives (one-way ANOVA P = 0.7). The observed relationship between loss and Ka/Ks may thus reflect their shared association with particular functions (e.g., detection of bitter compounds; see Specialists are losing a nonrandom set of Grs, below).
Evolution of Ors and Grs along specific lineages—a role for the chemoreceptor superfamily in host specialization
Our investigation of patterns of divergence and polymorphism showed that the Or and Gr families are not evolving neutrally and suggested that they experience positive selection along at least the simulans lineage. These analyses, however, say nothing about the biological phenomena underlying these patterns. One possible cause of nonneutral evolution is ecological adaptation. The five sequenced species vary greatly in their biogeography and ecology, and their ancestors must have therefore undergone several evolutionary transitions from one ecological niche to another. MCBRIDE (2007) previously described a clear difference in the evolutionary behavior of Or and Gr genes along the lineages of D. sechellia and D. simulans and hypothesized that the evolutionary transition from a host generalist to a host specialist occurring along the sechellia lineage contributed to the difference. We now have data from three additional species and six additional lineages (Figure 1). Since a second, independent transition to host specialization occurred along one of these additional lineages (that leading to erecta), we can now ask whether the as