As classical phase II detoxification enzymes, glutathione S-transferases (GSTs) have been implicated in insecticide resistance and may have evolved in response to toxins in the niche-defining feeding substrates of Drosophila species. We have annotated the GST genes of the 12 Drosophila species with recently sequenced genomes and analyzed their molecular evolution. Gene copy number variation is attributable mainly to unequal crossing-over events in the large δ and ε clusters. Within these gene clusters there are also GST genes with slowly diverging orthologs. This implies that they have their own unique functions or have spatial/temporal expression patterns that impose significant selective constraints. Searches for positively selected sites within the GSTs identified G171K in GSTD1, a protein that has previously been shown to be capable of metabolizing the insecticide DDT. We find that the same radical substitution (G171K) in the substrate-binding domain has occurred at least three times in the Drosophila radiation. Homology-modeling places site 171 distant from the active site but adjacent to an alternative DDT-binding site. We propose that the parallel evolution observed at this site is an adaptive response to an environmental toxin and that sequencing of historical alleles suggests that this toxin was not a synthetic insecticide.

THE sequencing of the genomes of 12 Drosophila species provides a powerful new evolutionary context by which to understand insect biology. It bridges the gap between previous comparative studies of insect genomes that focus on gene composition and recent studies in Drosophila that address the extent to which adaptation has shaped molecular evolution of genes. The latter studies have used tests of neutrality that partition within- and between-species variation into categories such as synonymous and nonsynonymous sites and have been used to estimate that ∼50% of the amino acid substitutions in Drosophila melanogaster and its sibling species have been adaptive (Smith and Eyre-Walker 2002; Andolfatto 2005; Welch 2006). While these studies quantify the amount of adaptation, comparative analyses of the 12 Drosophila genomes have the power to identify specific loci that are the targets of positive selection by their patterns of sequence divergence (Yang and Nielsen 2002).

Here we examine the molecular evolution of glutathione S-transferase (GST) gene family, which fulfill a number of functional roles, including detoxification. In the context of adaptation in insects, detoxification is of interest for two reasons. First, the ecological niche of Drosophila species is largely defined by the larval feeding substrate, and therefore the genes enabling Drosophila to identify, detoxify, and utilize these substrates as nutritional resources are obvious candidates for genes that have been the targets of natural selection. Indeed, in at least 2 of the 12 species with sequenced genomes, detoxification is believed to be important in the adaptation to niche-defining larval substrates: D. sechellia to toxic levels of octanoic acid in Morinda fruit (Legal et al. 1994) and D. mojavensis to necrotic tissues of various cactus species (Ruiz and Heed 1988). Second, any adaptive evolution involving the detoxification of natural compounds over the past 50 million years of evolution may inform studies of insecticide resistances that have evolved over the past 60 years. The GST multigene family is one of several repeatedly implicated in insecticide resistance and we focus exclusively on it here to integrate molecular evolutionary analyses with the excellent structural models available for insect GSTs.

GST genes have been the subject of analysis with each new insect genome sequenced largely because they are candidates of insecticide resistance genes (Ranson et al. 2001; Tu and Akgul 2005; Claudianos et al. 2006). GSTs act by conjugating the thiol group from glutathione (GSH; γ-glutamyl-cysteinyl-glycine) to compounds that possess an electrophilic center. In doing this, they can eliminate toxins from a cell by rendering them more water soluble or by targeting them to specific GSH multidrug transporters. GSTs are reported to bio-transform organochlorine (Clark and Shamaan 1984; Tang and Tu 1994) and organophosphorous insecticides (Lewis and Sawicki 1971; Oppenoorth et al. 1979; Huang et al. 1998), and they confer resistance to pyrethroid insecticides by reducing the oxidative damage that the insecticides cause to lipids (Vontas et al. 2001). Resistant strains of various species have been shown to have either increased expression (Grant and Hammock 1992; Syvanen et al. 1994) or increased GST activity (Fournier et al. 1992). In Anopheles gambiae, an ϵ GST has colocalized with resistance on a genetic map (Ranson et al. 2000).

Comparative analysis of the D. melanogaster and An. gambiae genomes revealed 36 and 28 cytosolic GSTs, respectively, and these have been classified into six classes (δ, ϵ, σ, θ, ω, and ζ; Ranson et al. 2001). Of particular interest to researchers are the δ and ϵ classes because they are insect specific, exist in some of the largest gene clusters in insect genomes, and, to date, are the only GSTs classes that have been implicated in insecticide resistance (Tang and Tu 1994; Ranson et al. 2001; Ding et al. 2003). The molecular complexity of dipteran GSTs may extend beyond the size of the gene clusters, as there is a suggestion that diversity may be generated within individuals by alternative splicing and within populations by gene fusions (Zhou and Syvanen 1997) and possibly by gene amplifications (Vontas et al. 2002). A striking feature of the phylogenetic analyses (Ranson et al. 2002) is that, while there is a conservation of GST classes between D. melanogaster and An. gambiae, genes within a class are often more closely related to genes within the same species. Thus there are few cases where true orthologous relationships are observed. Instead, Ranson et al. (2002) describes the situation as “orthologous sets of paralogous genes.” Thus multiple gene duplication and/or intergene recombination events have happened since the divergence of the D. melanogaster and An. gambiae lineages.

Here we analyze the GST genes of the 12 Drosophila species with sequenced genomes to obtain a clear understanding of the molecular evolutionary events affecting these genes. We parse all the cytosolic GST genes present in the Drosophila genomes into those that show the hallmarks of adaptive evolution from those that show patterns of divergence more consistent with purifying selection. We reason that those that evolve adaptively are more likely to have roles in detoxifying environmental compounds and are more likely to be preadapted to insecticide detoxification. Finally, we test a hypothesis that a candidate adaptive substitution is associated with insecticide resistance by sequencing alleles that predate the use of insecticides and interpret the substitution in a structural context by docking a known substrate on a homology-based protein structure model.


D. melanogaster GSTs as reference set:

Glutathione S-transferases encoded in the D. melanogaster genome are divided into two nonhomologous families: the microsomal GSTs for which there are three genes (Toba and Aigaki 2000) and the canonical GSTs that are cytosolic (36 genes). We limit our study here to the cytosolic GSTs so that we could combine molecular evolutionary analyses with protein structure analyses using known structures of cytosolic GSTs. Ranson et al. (2001, 2002) previously identified 36 D. melanogaster cytosolic GSTs and classified them into six classes (δ, ϵ, σ, θ, ω, and ζ). Strictly speaking, only a subset of the 36 has been biochemically tested for GST activity (specifically, GSTD1, GSTD2, GSTD3, GSTD7, GSTD9, GSTD10, GSTE1, GSTS1, CG6673, CG6776, CG6662, and CG6781; Sawicki et al. 2003; Kim et al. 2006). However, homology with other GSTs suggests that the others may have this activity. As is common in the literature, we will refer to these as GSTs even though a more accurate description would be proteins homologous to known insect GSTs (Ranson et al. 2001). To ensure that we had a complete set of GST sequences from D. melanogaster, we performed a motif profile-based search. Members of known D. melanogaster GST δ, ϵ, σ, θ, ω, and ζ were used as training sequences for MEME (Bailey and Elkan 1994) to build a position-specific probability matrix. Not surprisingly, the highest scoring matrix is part of the SNAIL/TRAIL motif that is thought to function as the glutathione-binding motif (Koonin et al. 1994). This highest scoring matrix was then used as input for MAST (Bailey and Gribskov 1998) to search the FlyBase release 4.3 D. melanogaster translated sequences for matching sequences. An arbitrary e-value of 0.1 was assigned as the cutoff score and all sequences that passed this criterion were gathered. The training and searching processes were repeated using newly gathered sequences as inputs until no new sequences were found. A total of 38 cytosolic GST genes were found and we refer to these GSTs as our reference set of GSTs. This reference set differs from the set reported in Ranson et al. (2001, 2002) and in Claudianos et al. (2006) in the following ways:

  1. We included CG4623, whereas Ranson et al. (2001) excluded it, because it was too long and too diverged from other GSTs. The reason for its inclusion was because multiple alignment of this gene with other D. melanogaster GSTs showed that it has the conserved serine in the putative SNAIL/TRAIL motif, which was suggestive of GST catalytic activity. It is also classified as a member of the GST multigene family by the INTERPRO database.

  2. We kept the two δ cluster pseudogenes discarded by Ranson et al. (2001) so that we could identify potentially functional orthologs in other Drosophila species.

  3. Our gene list counts CG6673 only once, while acknowledging that two proteins appear to be produced from this locus by splicing of alternate exons.

Manual annotation:

Each of 38 D. melanogaster reference GSTs was used as input for tblastn searches against the other 11 Drosophila species assemblies (Drosophila 12 Genomes Consortium 2007). To ensure that we identified all GST genes, we used a low stringency e-value of 0.01 as a cutoff. This e-value was low enough to detect orthologs, paralogs within the same class, and some paralogs in different classes. However, it was not high enough for a δ GST to identify a σ GST and vice versa. The coordinates of each tblastn hit were parsed into the corresponding Drosophila species scaffolds using our customized perl scripts, which utilized some Bioperl modules (Stajich et al. 2002). The tblastn hits were merged if their coordinates overlapped and were used as inputs for blastx against translated sequences from D. melanogaster (FlyBase release 4.3). If the top blastx hit was not a GST of our reference set, then the sequence was discarded. The remaining sequences were manually annotated using Artemis v6 (Rutherford et al. 2000).

We identified several potential pseudogenes by their disablement features such as premature stop codons, frameshifts, and gene truncations (relative to the canonical GST gene). There were also sequences with low-quality sequence trace files and/or gaps in the genome sequence assembly and these we termed “partial.” A second round of annotation was performed after examining the information on gene orders, multiple alignments, and neighbor-joining distance trees. All annotated genes were checked for consistency of gene length, intron/exon structure, suspicious insertion/deletion, tree branch length, and, if necessary, the actual sequence trace files. A total of 418 full-length GST genes and 24 pseudogenes from the 12 Drosophila species were found (our annotations are available at∼charlesrobin/ROBI_MAN/ALL_GST).

Phylogenetic tree:

Deduced amino acids were aligned using ClustalW, SEQBOOT, PROTDIST, NEIGHBOR, and CONSENSE from the PHYLIP v3.66 package were used to compute a bootstrapped neighbor-joining tree containing 407 GST sequences (Felsenstein 2005). The set of 11 orthologous CG10065 proteins were not included in this tree because the GST domain is only part of the polypeptide encoded by this gene. An independent 100-replicate bootstrapped tree was done for this set. Nodes with <75% bootstrap support were collapsed and the remaining clades formed the basis of assignments of orthologous relationships and of the assignment to gene sets for PAML analysis and for the dating of gene duplication events relative to the species phylogeny.

PAML: models for heterogeneous selective constraint on codons:

Twenty-two sets of GST orthologs were analyzed with CODEML of PAML v3.14 (supplemental Table 1 at http:www.genetics.org/supplemental/) (Yang and Nielsen 2002). CODEML uses a maximum-likelihood method introduced by Goldman and Yang (1994), which accounts for multiple hits and differentially weights evolutionary changes between different codons. The amino acid sequences within the sets were aligned using ClustalW and then the nucleotide coordinates were mapped to the corresponding amino acid alignment using the program MRTRANS (Pearson 1990). Following this, all alignments were manually inspected prior to CODEML. The tree topology supplied for CODEML followed the species tree in Figure 5. For all CODEML analyses, we used the F3x4 codon model of Yang and Nielsen (2000), which calculates codon frequencies from the nucleotide frequencies at the three codon positions. The key parameter that CODEML estimates is the ratio of rates of nonsynonymous to synonymous substitution (ω) and these can be estimated for each codon in a multiple alignment. If ω > 1, then positive (or, to be specific, diversifying) selection has favored amino acid substitutions. If ω < 1, then negative (or purifying selection) has prevented amino acid substitution, and if ω = 1, then the sequence is evolving as if it is neutral. To detect site-specific positive selection, we ran model M0, M1a, M2a, M3 (K = 3), M7, and M8 for each of the 22 GST data sets. Model 1a assumes that codons fall into two types, one where 0 < ω0 < 1 and the other where ω1 = 1. This is consistent with Kimura's neutral theory of evolution where some sites can evolve under selective constraint and others are free to change as mutations arise in them. Model 1a is thus referred to as a neutral model. Model 2a is an extension of model 1a that allows a third proportion of sites with ω2 > 1 to be estimated. This is referred to as a positive selection model. Model 3 uses a general discrete distribution with three site classes having proportions p0, p1, and p2 with the corresponding ω0, ω1, and ω2, respectively. Model 7 assumes a β distribution for 10 site classes of ω between 0 and 1, whereas model 8 adds a class of sites that has ω >1 when compared to model 7. To avoid being trapped at local optima, three different initial ω-values of 0.5, 1.1, and 2.0 were used in the estimation of the log likelihood for model 7 and model 8. The highest value among the three runs was used in significance calculations. The significance calculations used the likelihood ratio test (LRT) of Yang and Nielsen (2002), in which the test statistic was two times the difference of the log likelihood of a positive selection model and the log likelihood of the neutral model. The specific comparisons were models M2a vs. M1a and M8 vs. M7. The range of the total tree length, S, of model M0 outputs for all GST sets studied should allow relatively high power and accuracy in detection of positive selection (1.8 < S < 8.1; Anisimova et al. 2002). We used a Bonferroni correction for multiple tests and an empirical Bayes calculation to assess confidence in the positively selected sites (posterior Bayesian probability >95%).

Polymorphism analysis:

Sequences of GstD1 from the y; cn bw sp strain of D. melanogaster and two D. simulans strains (New Caledonia, wcl29f11.g1; w501, uya58f03.b1) were obtained from FlyBase and the NCBI trace archives, respectively. We also sequenced one GstD1 allele from each of 36 D. melanogaster lines (see Drosophila strains below) using direct sequencing and/or sequencing from plasmids. The cloning strategy was used if any of the direct sequencing of PCR products displayed heterozygosity, in which case we randomly chose one plasmid to sequence. The cloning system used was pGEM T-Easy (Promega, Madison, WI). Sequencing of cloned PCR products of GstD1 from six Australian lines from D. simulans was also done. PCR reactions were performed in 50 μl containing 10 mm Tris–HCl (pH 8.3), 1.5 mm MgCl2, 50 mm KCl, 200 μm of each dNTP, 1 μm of each primer, and 5 units of Taq polymerase (Promega). The primers dmel5′GstD1f: 5′-AAGAAGTTGGAGATTTGTTCACTC-3′ and dmel3′GstD1r: 5′-CTTCTTGAACTCCAGGCATC-3′ were used to amplify D. melanogaster genomic DNA, and the primers dsGstD1f: 5′-CTCCTTCAGTCATATGGCTGACTTCTACTAC-3′ and dsGstD1r: 5′-AAAACGTGAATTCCAGGCTTATTC-3′ were used to amplify D. simulans genomic DNA. The PCR temperature cycling conditions were 1 cycle of 95° for 2 min, 35 cycles of 95° for 30 sec, 55° for 30 sec, and 72° for 30 sec. The program DNAsp, version 4, was used to calculate θ, Tajima's D, and the McDonald and Kreitman test (Rozas and Rozas 1999).

Drosophila strains:

GstD1 was sequenced from the following D. melanogaster lines: 11 Australian lines (5 were provided by A. Hoffmann), 6 were from Drosophila Genomic Resource Center (DGRC) (103414, 103415, 103416, 103417, 103418, 103407), 8 United States lines [7 inbred lines from 2003 in Raleigh, NC, were provided by C. H. Langley, 1 from Bloomington (stock 5)], 5 were African lines (Malawian lines from C. H. Langley), 2 Papua New Guinean lines from Ehime (E-10043, E10044), 4 Japanese lines [3 from Ehime (E-10005, E-10010, E-10012) and 1 from DGRC (103408)], 1 Kazakhstan line from DGRC (107659), 1 Swedish line from DGRC (107660), 1 Ukrainian line from Bloomington (4266), 1 Columbian line from Bloomington (3843), and 1 Spanish line from Bloomington (3844). The lines from Kazakhstan, Sweden, Ukraine, and 1 of the United States lines were collected before 1940 and hence represent pre-DDT lines. GstD1 was also sequenced from 6 D.simulans lines collected from the east coast of Australia and were provided by A. Hoffmann.

Homology modeling of GSTD1 and molecular docking:

Individual domains of D. melanogaster GSTD1 and GSTD2 were modeled on the crystal structure of a GST from Lucilia cuprina (Wilce et al. 1994) using COMPOSER as contained in Sybyl7.2 (Tripos). The amino acid sequence identity of D. melanogaster GSTD1 and GSTD2 when compared to the L. cuprina θ GST is 83 and 68%, respectively. The individual models of the N- and C-terminal domains were then assembled into a domain pair and the GST dimer was constructed in Swiss-PDB viewer followed by energy minimization. Verify3D was used to evaluate the quality of the models (Eisenberg et al. 1997).

AutoDock 3.0.5 was used to explore the binding of DDT to the GSTD1 model (Morris et al. 1996). The docking region was centered at the binding site of glutathione. The Lamarckian genetic algorithm was used to produce 100 conformations, which were clustered on a root-mean-squared deviation of 1.8 Å2 and the docked conformations were inspected with VMD1.8.2. The docking simulations were carried out in the presence of glutathione as well as without the glutathione. Another docking simulation was carried out centering at the positively selected site Lys171.


Patterns of GST gene gain and loss:

Our search of the genomic sequences of 12 Drosophila species identified 429 genes and 24 pseudogenes in the GST multigene family. Included in the “active” genes are 11 genes for which only partial sequence is available (because of sequence gaps), and in the “pseudogene” category there are 8 genes with only single inactivating mutations and these may be only null alleles. All six classes of GSTs (δ, ϵ, σ, θ, ω, and ζ) are represented throughout the radiation of the 12 Drosophila species (Figure 1). The total number of GSTs within a species range from 29 to 45 with most of the variation attributable to the two large δ and ϵ gene clusters (Figures 1 and 2). Apart from the variation in these large clusters, which are described in more detail below, there are three other copy number variations observed:

  • i.The Drosophila subgenus species (D. virilis, D. mojavensis, and D. grimshawi) have three rather than four ω GSTs,

  • ii.D.grimshawi has two co-orthologs of CG9362 (a ζ GST). The level of divergence between these genes (∼40% at silent sites) indicates that they have not arisen from an assembly artifact and, since they are adjacent, they appear to have arisen by unequal crossover.

  • iii.The copy number of genes with high sequence similarity to D. melanogaster CG30000 and CG30005 differed across the phylogeny as shall be detailed further below.

To determine orthologous/paralogous relationships, we constructed a neighbor-joining tree based on the amino acid sequences (supplemental Figure 1 at http://www.genetics.org/supplemental/). To be conservative with our assignments, we considered microsynteny conservation and relative position and orientation of the genes within clusters and considered only nodes with >75% bootstrap support. This left us with several types of phylogenetic relationships. There were clades that clearly identified sets of orthologs, and some of these contained a single gene from all 12 species (e.g., GstE10). The genes in these clades were named according to the D. melanogaster gene. Other “orthologous clades” did not have a representative from all species (e.g., GstD4) and, if these were without a D. melanogaster gene, they were given a new name as shown in Figure 2 (e.g., GstE11). There were 20 genes, including 7 in the D. willistoni δ GST cluster, that were not grouped into any clade (these have an “x” in their names, e.g., DwilDx1). There were also clades that contain paralogs and these genes were named to match the orthologous D. melanogaster gene and given an added letter if required (e.g., GstD12 and GSTD12a are paralogs).

Figure 1.—

The total number of GST genes and pseudogenes in the 12 Drosophila species. For each species, the number of GSTs genes (left) and pseudogenes (right) are shown with the different classes represented by different shading (see key). The tally of pseudogenes includes only those identified with our tblastn filtering criteria and does not account for gene losses inferred from the phylogenetic reconstruction in Figure 3.

Figure 2.—

The δ and ϵ GST clusters of the 12 Drosophila species are shown in A and B, respectively. Full-length GSTs are shown as arrowed boxes. The direction of the arrows represents the direction of transcription. The boxes are colored to indicate orthologs. Paralogous clades supported by bootstrap score ≥75 that lack orthologs are also colored (e.g., GstE15, GstE15a, GstE15b). The white boxes represent genes that do not occur in clades with a bootstrap score ≥75. The “ψ” and circled P signs below each box indicate GST pseudogenes and partial sequences, respectively. The numbers above the vertical bars represent the number of bases excluded to keep the rest of the figure to scale. The “| |” symbol represents sequence gaps in contigs.

Figure 3 illustrates a reconstruction of gene duplication and loss over the Drosophila radiation. This was constructed using a “tree reconciliation approach” in which the nodes with >75% on the gene tree were placed on the best-supported species tree (Pollard et al. 2006) following the principle of parsimony (Nam and Nei 2005). The analysis identifies 22 of the 38 GSTs in our original reference set as having orthologs in both the Drosophila and Sophophora subgenera, and these therefore existed in the ancestor of the 12 species studied. It also identifies 24 gene duplication events. Of those 24, 19 have occurred in either the δ or ϵ gene clusters, 18 (of the 24) fall on the terminal branches, and another 2 occur on the long branch shared by the sibling species D. persimilis and D. pseudoobscura. By considering only the nodes with >75% bootstrap support, many genes would not be accounted for and so Figure 3 also indicates the youngest point at which the duplication may have occurred. Many, if not most, of these genes may have occurred in the ancestral Drosophila; it is just that the phylogenetic signal is not strong enough to assert this, or there has been gene loss. A corollary is that some true orthologs may not be identified and may be given different names (e.g., GSTD2 and GSTD11 could be orthologs). In such cases, the lack of bootstrap support may be a result of homoplasies introduced by high levels of amino acid divergence or may reflect homoplasy caused by intergene recombination.

Figure 3.—

GST gene gain and loss is represented on the species tree. Gene duplications that have 75% or greater bootstrap support are shown in boldface type above the branches of the species tree. Inferred gene losses are denoted with “ψ” or, if in a terminal branch of the species tree, are listed next to the species name. To account for all genes, duplications not supported by >75% bootstrap support are indicated in italics at the youngest possible branch where those duplications could have occurred. The number of δ and ϵ genes that are not found in clades with >75% supports are listed under “<75% bootstrap.” “Dψ” refers to the genes that have duplicated and become pseudogenes within single lineage. The prefix “GST” has been removed from GST δ and ϵ genes for clarity.

The reconciliations of the gene tree with the species trees indicate that there have been at least four duplications of CG30000/CG30005 within the Drosophila lineages studied here. The most ancient of these led to CG30000 and CG30005 and this most likely happened after the Drosophila subgenus separated from the other lineages (although a loss in the Drosophila subgenus is possible). The gene trees suggest a much more recent duplication of CG30000/30005 in the D. willistoni lineage and a duplication resulting in a third gene in D. ananassae. We also detected two CG30000-like sequences in the obscura group species, which are separated by ∼7 kb and oriented in opposing directions. While repeats and gaps in the sequence coverage make the genome assemblies suspect, these sequences do not have long open reading frames, suggesting that they are pseudogenes. One of the fragments is missing an intron that is observed in functional paralogs, indicating that it may have arisen by retro-transposition.

Rates of GST ortholog evolution:

An inspection of amino acid alignments and phylogenetic trees reveals that GSTs evolved at different rates across the Drosophila radiation. This is unsurprising as GSTs from different classes have different functions, expression levels, and sites of expression and are therefore expected to exhibit different levels of selective constraint. To quantify the variation in selective constraint, maximum-likelihood estimates of ω, the dN/dS ratio, were used as implemented in PAML (Yang and Nielsen 2002). This standardizes the amount of amino acid changes observed relative to the synonymous site divergence and was performed on the 22 orthologous gene sets that contained sequences from both subgenera. LRTs indicated that a model allowing three different ω per orthologous gene set (model 3 with K = 3) is significantly better than model 0, which allowed only a single ω per orthologous gene set (P < 0.05). Figure 4 represents the three most likely ω-values estimated for each orthologous set, and the relative proportions of sites with those values. This enables the GSTs to be partitioned by evolution rates with some relatively uncharacterized genes such as CG1681, CG1702, and CG16936 among the slowest evolving, and others such as CG4623, CG4688, and CG11784 among the quickly evolving.

Figure 4.—

Outputs of model 3 of PAML are represented graphically, allowing comparison of selective constraint among GST genes. Model 3 uses a maximum-likelihood method to assign three ω-values to each gene set where ω0, ω1, and ω2 have the proportions p0, p1, and p2, respectively (we represent the proportions as percentages along the x-axis). While each gene set has only three ω-values, we have divided the ω-values observed across the whole 22 gene sets into four different ranges: highly constrained sites (0 ≤ ω ≤ 0.1), moderately constrained sites (0.1 ≤ ω ≤ 0.5), relaxed sites (0.5 ≤ ω ≤ 1.0), and positively selected sites (ω > 1). Genes with many sites evolving slowly have a high proportion of sites in the 0 ≤ ω ≤ 0.1 range and are mostly solid in the figure. These values were calculated for genes inferred to be present in the ancestral Drosophila (see Figure 3) and were calculated with all available orthologs. The CG10065 here refers only to the GST domain of this chimeric gene.

Evidence for adaptive evolution:

One of the hallmarks of adaptive evolution is a greater rate of substitution at nonsynonymous sites than at synonymous sites. With sets of closely related species it is possible to identify specific codons that exhibit an increased rate of nonsynonymous substitution. Figure 4 shows that GSTD1 and CG6776 have some sites with ω > 1; however, to formally test whether adaptive models fit the patterns of sequence divergence better than neutral models, we used two standard PAML comparisons (model M2a vs. M1a and M8 vs. M7; Yang and Nielsen 2002; Lynn et al. 2005; Balakirev et al. 2006). Of 22 sets of GSTs subjected to site-specific CODEML analyses, only GstD1 was found significant after Bonferroni correction for multiple tests (P-value = 0.0014, which is <0.05/22; Table 1). The amino acid state of the positively selected site in GSTD1 across the 12 species is shown in Figure 5. This shows that a glycine → lysine substitution at site 171 must have occurred at least three times in the Drosophila radiation. This requires at least two sites in the codon to change (GGN → AAR) and is a biochemically radical change.

Figure 5.—

The state of the amino acid residue 171 of GSTD1 in the 12 Drosophila species. This site was estimated to be positively selected using empirical Bayes with probability >95% under model M8. The species tree is adapted from FlyBase.

View this table:

Maximum-likelihood parameter estimates for GstD1 and their significance

Could DDT be the selective agent driving the G171K substitution in GSTD1?

Tang and Tu (1994) have previously shown that D. melanogaster GSTD1 possesses DDTase activity and so it has been implicated in DDT metabolism. This activity and the strong pattern of parallel substitution that we observe in the Drosophila radiation brings to mind multiple instances where the strong selective pressure imposed by insecticides has resulted in parallel molecular evolution. These include examples of parallel amino acid changes at target molecules of insecticides (e.g., Kdr and Rdl; Thompson et al. 1993; Miyazaki et al. 1996) and in detoxification proteins (e.g., α-esterase-7; Claudianos et al. 1999) as well as regulatory mutations (e.g., mosquito Gste2 and Drosophila Cyp6g1; Schlenke and Begun 2004; Lumjuan et al. 2005).

To test whether the G171K replacement is associated with recent adaptive evolution to DDT, we surveyed polymorphism in 36 D. melanogaster and 8 D. simulans alleles. From the sequence data all D. melanogaster have lysine at residue 171 of GSTD1 and all D. simulans lines have a glycine. This shows that K171 in D. melanogaster is likely to be fixed in worldwide populations. Among these alleles, 4 were obtained from lines collected before the use of DDT and so it is unlikely that the G171K change occurred in response to DDT selection. Furthermore, the four pre-DDT alleles have the same amino acid sequence as the GstD1 allele found to have DDTase activity, suggesting that the DDTase activity predates DDT.

In theory, the recent selective fixation of a beneficial mutation is expected to produce patterns of reduced nucleotide diversity (Smith and Haigh 1974) and an excess of rare alleles as subsequent mutations occur (Braverman et al. 1995). As more time elapses after a selective sweep, a surplus of high-frequency-derived alleles occurs at sites near the target of selection (Fay and Wu 2000). The nucleotide diversity (θ) for our global collection of 36 D. melanogaster alleles was 0.0035 whereas it was 0.0072 for the 8 D. simulans alleles (Table 2). This difference in θ (= 4Neμ) could, at least partially, be attributed to differences in effective population size of the two species (Kliman et al. 2000). Tajima's (1989) D test does not detect an enrichment of rare alleles among the D. melanogaster GstD1 alleles (P > 0.10). Furthermore, the Mcdonald and Kreitman (1991) G-test, which compares the ratio of replacement to silent-site polymorphisms within species with the ratio of replacement to silent-site divergence between species shows no significant deviation from neutral expectations (G-test value with Williams' correction = 2.34, P-value = 0.13). These analyses support the interpretation made from the four “historical” D. melanogaster lines—that positive selection for the G171K site has not happened over the past 60 years in the D. melanogaster lineage.

View this table:

Polymorphism statistics for GstD1 in D. melanogaster and D. simulans populations

Homology modeling and molecular docking of DDT to GSTD1:

To surmise the potential structural and biochemical role of the G171K replacement in GSTD1, homology models of GSTD1 and GSTD2 were constructed using the crystal structure of an orthologous protein from the sheep blowfly, L. cuprina (Figure 6A; Wilce et al. 1994). This indicates that the G171K replacement occurs at a site remote from the typical GST active site. In silico docking studies of a known GSTD1 substrate, the insecticide DDT, onto this homology model suggest that this typical active site is used in GSTD1. Visual inspection shows that DDT docks best in the presence of glutathione with the trichloro group of DDT orientated toward the sulfur atom on glutathione. This conformation agrees with the GSH-dependent dehydrochlorinase activity proposed by Matsumura (1985). DDT docks less well into the active site of GSTD2, which does not have DDTase activity in vitro (Tang and Tu 1994).

Figure 6.—

Molecular docking of DDT and glutathione to the homology modeled D. melanogaster GSTD1 monomer showing the best-docked conformations when site of docking centered on (A) the GSH-binding site and (B) the positively selected residue K171.

Despite the G171K site being distant from the active site, we observed a crevice in the homology model adjacent to site 171 in which we were able to dock DDT. This raises the possibility that the crevice may act as a secondary binding site that may have an allosteric effect on the active site. Interestingly, this crevice is absent in the L. cuprina structural model because of a phenylalanine rather than an isoleucine at position 174, suggesting that a substitution here, in an ancestor of Drosophila, may be a prerequisite for the G-K substitutions.


Gene duplication and loss:

Recent comparative analyses of insect genomes have revealed stark differences in the number of GST genes. There are 38 and 28 cytosolic GST genes identified in the D. melanogaster and An. gambiae genomes, respectively, while only 8 have been found in the Apis mellifera genome (Claudianos et al. 2006). Hahn et al. (2005) cautioned against making adaptive interpretations of such differences because variation may simply be attributed to stochastic changes in gene number over long periods of evolutionary time. On the other hand, genes that lack a function are rapidly lost from Drosophila genomes (Petrov et al. 1996) and so it may be presumed that genes that have been maintained over millions of years would have some function. Such functions may make a GST distinct from all others encoded in the genome or may contribute to a threshold of general GST activity. Duplicate GST genes may have arisen as adaptations to new environments or could have nonadaptive origins and have been maintained through, for example, complementary degenerative mutations accumulating in duplicate copies (Force et al. 1999).

The molecular evolution of GSTs over the 50 million years represented in the divergence of the 12 sequenced Drosophila genomes allows us to make some inferences about the degree of redundancy in GST genes by their flux in gene numbers over evolutionary time and by the relative levels of selective constraint exhibited among orthologous comparisons. The species with the most GSTs, D. ananassae and D. willistoni, also seem to have a greater number of genes from other multigene families (R. T. Good and C. Robin, unpublished results) and also have larger genomes (http://rana.lbl.gov/drosophila/wiki/index.php/Assembly_Summaries). So rather than a GST-specific phenomenon, the relative abundance of GSTs may indicate lineage effects such as greater population sizes allowing duplications with smaller selective coefficients to become fixed and then be maintained. The species with the fewest GSTs are the three species of the Drosophila subgenus and this correlates with the ϵ and δ clusters being dispersed over multiple contigs. Rather than being an artifact of missing data, the smaller number of genes in these species may reflect a reduced rate of gene duplication in smaller gene clusters.

The presence of pseudogenes that are orthologs to functional genes is interesting because they can be considered “natural knockouts” and suggests that the orthologous genes are not vital. Four of the 10 δ GST genes in the D. melanogaster genome have orthologs with premature stop codons or frameshifts or are truncated in one or more members of the D. melanogaster subgroup (Dmel_GstD3, Dere_GstD4, Dmel_GstD5, Dsec_GstD6). A fifth δ GST gene, GstD7 from the P2 strain of D. melanogaster, was also originally annotated as a pseudogene because of a two-nucleotide insertion disrupting the open reading frame (Toung et al. 1993; accession no. M97702); however, this indel is not present in the y; cn bw sp strain of the D. melanogaster genome sequence (release 5 FlyBase). Three other δ genes (GstD2, GstD10, and GstD9) are pseudogenes in more distant species (D. ananassae, D.virilis, and D. mojavensis, respectively; Figure 2). Similarly, two D. melanogaster ϵ GST genes (GstE1 and GstE6) have orthologs with inactivating mutations and D. melanogaster appears to have lost GstE11 and GstE3a. Although we refer to these as pseudogenes, many have only one detectable inactivating mutation and these may be polymorphic, so that they may be better described as null alleles. Petrov et al. (1996) estimated that it would take 11 million years to loose half a pseudogene through deletions. If we consider only the melanogaster group, which is thought to have diverged within the last 10 million years, there are 12 putative “pseudogenes,” but 7 of these have only one inactivating mutation and may actually be null alleles. This age spectrum of nonfunctional copies is consistent with the idea that GSTs may exhibit a high load of null alleles because their detoxification functions are required only sporadically.

Difference in selective constraint:

At the other extreme are the genes that have weathered the tests of evolutionary time. This includes genes within the large δ and ϵ gene clusters, which would have been represented by at least four δ and three ϵ GST genes in the ancestral Drosophila. The conserved genes in these large clusters occur at the edge of the gene clusters in extant species, complementing the observation that the pseudogenes tend to be located in the centers of the clusters. Four of the ancestral GSTs (GstD1, GstD9, GstE9, and GstE4) are among the slowest evolving GSTs, suggesting that these either have their own unique functions or have spatial/temporal expression patterns that impose significantly greater selective constraint on them than on the other genes in these clusters (Figure 4). Therefore, the patterns of evolution of these specific genes indicate that they are not redundant in natural ecological settings.

Among the GST genes exhibiting the greatest selective constraint is CG9363, which is one of two ζ GST genes in the Drosophila genome. The other ζ GST, CG9362, is evolving at a much faster rate. We therefore suggest that CG9363 is more likely to perform the maleylacetoacetate isomerase activity, highly conserved across eukaryotes, that functions in the catabolism of phenylalanine and tyrosine (Fernandez-Canon and Penalva 1998). CG9362 is likely to have a derived function, requiring less selective constraint. The duplication of these genes would have arisen before the divergence of Drosophila species but, on the basis of comparisons to the An. gambiae genome, probably after the split between higher and lower diptera.

Each of the ω GSTs is evolving relatively quickly perhaps because ω GSTs have several features that make them structurally distinct from other GSTs, including the fact that they bind glutathione through a cysteine residue at position 32 rather than through a tyrosine or a serine at about position nine (Board et al. 2000). The slowest evolving ω, CG6781, has recently been shown to be the structural gene of the sepia locus and so it has a role in the biosynthesis of red eye pigments (Kim et al. 2006). This adds 6-pyruvoyl tetrahydropterin to a diverse set of proposed substrates for ω GSTs, including proteins damaged by S-thiol adducts and dehydroascorbic acid (Ishikawa et al. 1998; Board et al. 2000). While the in vivo substrates for three of the four D. melanogaster ω GSTs are unknown, it is noteworthy that the A. mellifera and An. gambiae genomes have only a single ω GST, and phylogenetically this is most similar to CG6662, which has the lowest specific activities of all ω GSTs for the tested substrates (Kim et al. 2006).

Adaptive evolution:

One of the motivations of this research was to identify GSTs that are evolving adaptively as these may help us to identify the genes that are involved in detoxification. Our screen using site-specific models of PAML recovered only a single site in a single gene (GstD1). This test can give false positives (Suzuki and Nei 2004), particularly if the silent sites are saturated. This appears not to be the case in GstD1 because the Nei and Gojobori divergence at silent sites (Ds) is greatest in the comparisons spanning the longest evolutionary time (Sophophora vs. Drosophila subgenus) and the maximum Ds value is 0.8 (i.e., less than one substitution per site). Furthermore, the prior literature on GSTD1 makes it an excellent candidate for a gene undergoing positive selection. Three aspects of GSTD1 have caught our attention. First, GSTD1 has been shown to have DDTase activity (Tang and Tu 1994). In addition, the lab strain PSU-R, which has been selected from Oregon-R with increasing DDT concentrations over 2 years, showed approximately twofold higher expression of GSTD1 by Western blot analysis when compared to Oregon-R (Tang and Tu 1994). Second, GstD1 has been used as an example of a gene with high codon bias and our analysis of codon bias suggests that this is the most biased GST among all the Drosophila species (Bielawski and Yang 2005). And third a recent microarray study on D. mojavensis, comparing the transcriptional response to different cactus-derived food substrates, highlighted an ortholog of GstD1 as a noteworthy responder (Matzkin et al. 2006).

Here we found that the same radical amino acid substitution has occurred at the same site three times in the Drosophila radiation. GstD1 is one of the slowest evolving GST genes in the Drosophila genome and a lysine is not seen in the homologous site in any other Drosophila GST δ paralogs. All this suggests that this substitution is not neutral and has a functional effect. Homology modeling indicates that this site is on the surface, remote from the active site. However, the model also shows a crevice adjacent to site 171 that may be a secondary substrate-binding site and either may have an allosteric effect on active site catalysis or may act as a site of sequestration as is seen in other GSTs (Kostaropoulos et al. 2001). Another possibility, given that this site is positioned away from the dimerization site, is that it could affect protein/protein interactions. Yeast two-hybrid analyses suggest that GSTD1 interacts with the products of the Sala and CG10200 genes, neither of which are well characterized (Giot et al. 2003).

Parallel evolution is frequently seen in instances of insecticide resistance. In two genera of mosquitoes, Anopheles and Aedes, an ϵ GST is associated with DDT resistance (Ranson et al. 2004; Lumjuan et al. 2005). The enzymes have been shown to have DDTase activity in vitro, and the genes are constitutively upregulated in resistant strains. In An. gambiae, resistance maps to the ϵ GST gene cluster. Another example of parallel evolution is the detoxification enzyme, α-esterase 7, of flies where an amino acid substitution provides organophosphate resistance in several species, including L. cuprina and Musca domestica (Claudianos et al. 1999; Oakeshott et al. 2005). The substitution of an aspartic acid in place of a glycine in the catalytically important oxyanion hole converts the enzyme from an esterase to a phosphatase (Newcomb et al. 1997).

There are two levels of evidence to suggest that the G171K mutation in GSTD1 has not been a response to insecticide usage. First, the amount and distribution of nucleotide variation in current populations of D. melanogaster are inconsistent with a recent selective sweep at this locus. Second, stocks collected before the use of the first insecticide, DDT, have the derived lysine state at 171. We can be confident that these historical stocks are not contaminated because they contain rare susceptible alleles at the Cyp6g1 locus, a verified DDT resistance locus (Daborn et al. 2002). Nevertheless, the G171K substitution has arisen independently three times in GSTD1, an enzyme that is an excellent candidate for a detoxification gene. The four species with a lysine at position 171 are all found in the western United States and this raises the possibility that the parallel G171K substitutions may have been responses to a common selective agent.

One of the species with a lysine at position 171 of GSTD1 is D. mojavensis. Matzkin et al. (2006) have found that this gene has 1.4 times higher transcript abundance on flies reared on food containing organ pipe cactus relative to flies reared on food containing agria. These authors also cited an unpublished nucleotide sequence survey suggesting that GstD1 has been evolving adaptively in D. mojavensis. It will be interesting to know whether G171K exists as a polymorphism in D. mojavensis or as a site of divergence with its sibling species.

Although we can conclude from the above data that the G171K substitution has not evolved in response to DDT selection, the testable hypothesis that the G171K substitution may affect the kinetics of its DDTase activity (or other activities) still stands. In this context, we note that Ivarsson et al. (2003) used a PAML approach to identify a serine/threonine substitution that was potentially adaptive in mammalian μ class GSTs. Site-directed mutagenesis of the human ortholog at this site changed the specific activity of the enzyme toward a known substrate, trans-stilbene oxide, 1000-fold. The ease by which GSTs are expressed in heterologous systems, purified, and crystallized makes them excellent models for studying the evolutionary significance of amino acid substitutions to protein structure and function relationships and vice versa.


We thank Robert Trygve Good for helpful discussions and help in aspects of the bioinformatics and Lisa Bardsley for cross-checking our pseudogene analyses. We also thank the Drosophila community, in particular, the Assembly, Alignment, Annotation coordinating committee and the Washington University, Broad, Agencourt, and J. Craig Venter Sequencing Centres, for providing access to the genome sequences of the 12 species of Drosophila. This work is supported by the Australian Research Council through a Discovery Project (DP0557497) and the Centre for Environmental Stress and Adaptation Research. M.W.P. is an Association pour la Recherche Contre le Cancer Federation Fellow and National Health and Medical Research Council Honorary Fellow.


  • Communicating editor: D. Begun

  • Received May 10, 2007.
  • Accepted June 19, 2007.


View Abstract