Abstract
Approximately 4 kb of the Cecropin cluster region have been sequenced in nine lines of Drosophila melanogaster and one line of the sibling species D. simulans, D. mauritiana, and D. sechellia. This region includes three functional genes (CecA1, CecA2, and CecB), which are involved in the insect immune response, and two pseudogenes (CecΨ1 and CecΨ2). The level of silent polymorphism in the three Cec genes is rather high (0.028), and there is no excess of nonsynonymous polymorphism. There is no evidence of gene conversion in the history of these genes. The interspecific comparison has revealed that in the three species of the simulans cluster the CecA2 gene is partially deleted and has therefore lost its function and become a pseudogene; in each of the species, subsequent deletions have accumulated. Divergence estimates indicate that the CecΨ1 and CecΨ2 pseudogenes are highly diverged, both between themselves and relative to the other three Cec genes. However, both CecΨ1 and CecΨ2 have conserved transcriptional signals and splice sites, and they present an open reading frame; also, correctly spliced transcripts have been detected for both CecΨ1 and CecΨ2. The data support that these genes are either active genes with some null alleles or young pseudogenes.
MULTIGENE families are formed by genes originated by gene duplication that have retained a certain degree of similarity. The different members are often arranged in a compact cluster although they might be more or less dispersed in the genome, mostly due to chromosomal rearrangements subsequent to the gene duplications. Members of a family can be functional or nonfunctional (pseudogenes). Functional members can be very similar as the copies might have retained the same function and be redundant. However, one of the copies may have acquired a new function and suffered a certain degree of differentiation, which would be best explained by the action of Darwinian selection (Ohta 1994). On the other hand, pseudogenes can accumulate substitutions due to the lack of functional constraints. Concerted evolution of the different copies of a gene, which is facilitated by their compact clustering, can restrict the functional differentiation as well as the loss of function of the copies (Walsh 1987). Otherwise, members of a family where concerted evolution is weak or absent have a higher probability to become pseudogenes (Walsh 1995).
The Cecropin multigene family of Drosophila melanogaster is a family with both functional genes and pseudogenes. The functional genes of this family code for cecropins, which are antibacterial peptides involved in the insect humoral immune response (Kylstenet al. 1990; Tryseliuset al. 1992). In Drosophila this response is mediated by at least another eight different kinds of peptides: defensin, attacin, diptericin, drosocin, metnikowin, drosomycin, andropin, and lysozyme (Engström 1997; Hetruet al. 1997; Meisteret al. 1997). The humoral response constitutes together with the cellular response the immune system in insects (Hultmark 1993).
In D. melanogaster the Cecropin region was cloned and sequenced by Kylsten et al. (1990) and by Tryselius et al. (1992). In an ~7-kb region (Figure 1) these authors detected four functional Cecropin genes (CecA1, CecA2, CecB and CecC) and two pseudogenes (CecΨ1 and CecΨ2). All functional genes are expressed upon bacterial infection, mainly in the fat body, although at different times during development: CecA1 and CecA2 are essentially expressed in larvae and adults while CecB and CecC are mainly expressed during the pupal stage (Hultmark 1993).
The immune system of invertebrates is more general than that of vertebrates. However, the immunoresponsiveness of the genes involved is mediated by similar cisregulatory elements. In vertebrates the κB and GATA motifs are generally present (Bauerle and Henkel 1994; Simon 1995), while the GAAANN motif has only been found in some cases (Williams 1991). In invertebrates these motifs (κB-like, GAAANN and GATA) as well as the R1 motif are present in the proximal promoter region of different insect immune genes. However, they are present in different number, order and orientation (Engström 1997). In particular, the two CecA genes of Drosophila present κB-like, GATA and R1 motifs; the CecB gene presents both the κB-like and R1 motifs, and the CecC gene presents the κB-like, GATA, and GAAANN motifs.
The humoral immune response of insects is a rather general response; cecropins act both against Gram positive and Gram negative bacteria. The mature cecropin peptides aggregate on the infecting bacteria membranes once a threshold concentration has been reached; this aggregation causes the disruption of the membrane and the death of the bacteria (Shai 1997). This general immune response in insects stands in contrast to the highly specific response in vertebrates. Also, the survey in D. melanogaster of nucleotide variation in some immune system genes, including the Cec genes, has not revealed a high level of nonsynonymous variation as compared to synonymous variation (Clark and Wang 1997). This finding would seem at odds with the excess of nonsynonymous variation detected in some vertebrate immune system genes (Hughes and Nei 1988). However, vertebrates present both an innate and an adaptative immune system (Meisteret al. 1997), while invertebrates only present the innate immune system. The genes studied in Drosophila (Clark and Wang 1997) belong to this innate system, while those studied in vertebrates are part of their adaptative immune system, which could account for the observed discrepancy.
The presence of both functional genes and pseudogenes in the Cecropin multigene family offers the opportunity to contrast their evolution. We have analyzed the nucleotide sequence variation of an ~4-kb region that includes three functional Cec genes (CecA1, CecA2, and CecB) and two pseudogenes (CecΨ1 and CecΨ2) of the Cecropin cluster in nine lines of D. melanogaster and one line of each D. simulans, D. mauritiana, and D. sechellia.
MATERIALS AND METHODS
Fly stocks: Nine third chromosome isogenic lines of D. melanogaster originated from a sample collected in Montemayor (Córdoba, Spain) in 1990 were used in the present study. The procedure to obtain the isochromosomal lines was described in Cirera and Aguadé (1997). One line each of D. simulans (from Barcelona), D. mauritiana (from the Umeå Fly Stock Center), and D. sechellia were also included in this study.
DNA extraction, PCR amplification, and sequencing: Genomic DNA from the lines from Montemayor and from the D. sechellia line was CsCl purified (Binghamet al. 1981). DNA from D. simulans and D. mauritiana was extracted using protocol 48 in Ashburner (1989). On the basis of the sequence of Kylsten et al. (1990), three pairs of synthetic oligonucleotides were designed to PCR amplify three overlapping fragments covering the 4-kb region studied. Whenever necessary, new synthetic oligonucleotides were designed to amplify the homologous region in the sibling species.
Synthetic oligonucleotides separated on average 250 bp were used to sequence the whole region. The PCR-amplified DNA was either made single stranded by λ-exonuclease treatment (Higuchi and Ochman 1989) and manually sequenced, or cycle sequenced (Perkin Elmer cycle sequencing kit; Norwalk, CT) and separated on an ABI 377 automated DNA sequencer. The sequences reported in this article have been deposited in the EMBL sequence database library under accession numbers Y16852–Y16863.
Expression studies: Total RNA was extracted from 80 mg of adult flies according to the guanidine-CsCl isopycnic method (Farrell 1993). Total cDNA was obtained from total RNA using reverse transcriptase and random hexanucleotides or oligo(dT)s. Specific pairs of synthetic oligonucleotides were designed and subsequently used to PCR amplify the corresponding cDNAs. PCR products were reamplified using internal primers. These products were subsequently cycle sequenced according to manufacturer's instructions (Perkin Elmer cycle sequencing kit) and separated on an ABI 377 automated DNA sequencer.
DNA sequence analysis: Sequences were assembled using the ESEE (Cabot and Beckenbach 1989) program; the approximately 4-kb-long sequences were multiply aligned using the CLUSTAL W program (Thompsonet al. 1994). The alignments were slightly modified manually, and these optimized alignments were used for further analyses. The MacClade version 3.05 program (Maddison and Maddison 1992) was used for editing the sequences. The DOTPLOT program of the GCG package (Devereuxet al. 1984) was used to establish regions of similarity within the studied fragment. The BEST-FIT and PILEUP programs of this package were later used to align the regions of similarity.
Most intra- and interspecific analyses were performed with the DnaSP version 2.52 program (Rozas and Rozas 1997); this program was also used to estimate divergence between genes and to perform different tests of neutrality. In comparisons between species or between genes involving more than one line in at least one of the groups compared, nucleotide divergence (K) was estimated as the average of the estimates of divergence for all combinations of lines between groups. For coding regions nucleotide polymorphism and divergence were estimated separately for synonymous and nonsynonymous sites (Nei and Gojobori 1986). Phylogenetic analyses were performed using the neighbor-joining algorithm (Saitou and Nei 1987) implemented in the MEGA version 1.01 (Kumaret al. 1994) program.
RESULTS
Identification of the duplicated regions: When Kylsten et al. (1990) first cloned and sequenced the Cecropin cluster in D. melanogaster, they identified five regions of similarity in the 4-kb region included in the present work. Three of these regions corresponded to three functional genes (CecA1, CecA2 and CecB), and the other two were considered pseudogenes (see below). As these regions probably arose by duplication, we first tried to establish the limits of the different repeats. In addition to the dotplot analysis, nucleotides putatively involved in functional signals (κB-like, GAAANN, GATA and R1 motifs; TATA box, capping site, initiation codon, splice sites, stop codon and polyadenylation signal) were identified in all five repeats. The two most extreme signals in each repeat were used to establish further upstream and downstream similarity. In the case of the Cec genes, the region of similarity among these genes extended in both directions from the most extreme signals (R1 motif and polyadenylation signal); in the case of CecΨ1 and CecΨ2, the region of similarity between them only extended downstream from the most extreme 3″ putative signal (stop codon). The regions of similarity in the Cecropin cluster (including the CecC gene) are shown in Figure 1.
Schematic representation of the Cecropin gene cluster in D. melanogaster (A) and of a generalized Cec gene duplicated region (B). The region studied includes three genes and two pseudogenes. In A shaded boxes represent the different regions of similarity; in each gene exons are represented by connected black boxes, and arrows indicate the direction of the reading frame. In B the different transcriptional signals are indicated by hatched boxes, and exons are shaded.
CecA2 has lost its function in the sibling species D. simulans, D. mauritiana and D. sechellia: Sequence comparison of the 4-kb region studied allowed us to identify a number of deletions in the CecA2 region of the sibling species D. simulans, D. mauritiana and D. sechellia. Table 1 and Figure 2 give a summary of the length and nucleotide changes detected in these species relative to the corresponding sequence of D. melanogaster. In all three species more than one event could have caused the loss of function of the CecA2 gene. In fact, in D. simulans we detected a large deletion that spanned most of the gene and included the initiation codon, and two changes in the stop codon. In D. mauritiana we detected a large deletion that spanned both the TATA box and the capping site, a deletion that spanned the polyadenylation signal, and two changes in the stop codon. Finally, D. sechellia presented one deletion affecting the initiation codon and another much larger deletion that spanned most of the gene and included the stop codon. Also there was one substitution in each of the κB and GATA promoter elements that was common to all three species. Given that practically all the coding region was deleted in D. simulans and in D. sechellia, we concluded that the CecA2 gene had been lost in both species. On the other hand, despite the multiple inactivating changes in D. mauritiana, the four additional deletions detected in its coding region were multiples of three nucleotides; there was in fact a much shorter open reading frame (ORF) that extended three codons past the ancestral stop codon (Figure 2). Specific oligonucleotides for this ORF were designed and used in an RT-PCR experiment using D. mauritiana total RNA and CecA1 as positive control; as expected from the absence of TATA box in this species, no transcript could be detected.
None of the deletions detected in the CecA2 region had common breakpoints in all three species; this would point to an independent loss of function of that gene in the three lineages after their split. However, the loss of function could be due to the changes in the stop codon that render it inactive. These changes were common to both D. simulans and D. mauritiana, and could have also been common to D. sechellia prior to the occurrence of the large deletion spanning the stop codon. Alternatively, the loss of function could also be due to the changes in the promoter elements. In any of these two cases, the CecA2 gene could have lost its function after the split of the melanogaster and simulans lineages but prior to the split of this latter lineage. Whatever was the inactivation event and whenever it occurred, additional deletions have accumulated in the gene in each of the three lineages. Only in D. sechellia a large insertion (35 bp) was detected in exon 1; in D. simulans a 1-bp insertion was detected in a short run of thymines just upstream of the TATA box.
Changes in the CecA2 region of the three species included in the simulans cluster
Distribution of length differences in the CecA2 region between D. melanogaster and each D. simulans, D. mauritiana, and D. sechellia. For D. melanogaster the region is represented schematically as in Figure 1B. For the other species blanks indicate deletions relative to D. melanogaster, and triangles, insertions. The length (in base pairs) of insertions/deletions in the exons is indicated by the corresponding number.
CecΨ1 and CecΨ2 are transcribed: In the original characterization of the Cecropin region close to the Andropin (Anp) gene (Kylstenet al. 1990), sequence analysis revealed the presence of two pseudogenes (CecΨ1 and CecΨ2) in addition to the three cecropin coding genes (CecA1, CecA2, and CecB). The two regions that were considered pseudogenes showed some sequence similarity to the Cec functional genes; also, and despite the claimed absence of consensus splice sites, there was some reminiscence of the two-exon structure of the functional genes. The authors considered these regions to be pseudogenes due to the absence of consensus splice sites and the absence of cDNA clones (they detected nine clones for the CecA1, seven for the CecA2, and one for the CecB transcripts), and also because, according to them, the Canton S strain sequenced showed deletions and multiple stop codons in the putative coding regions. Both CecΨ1 and CecΨ2, however, presented conserved TATA boxes and capping sites, which the authors considered should deserve further attention. Our sequencing study of the nine lines of D. melanogaster, and of one allele of each D. simulans, D. mauritiana and D. sechellia revealed that in addition to the TATA box and capping site, both CecΨ1 and CecΨ2 presented the conserved splice sites and the GATA promoter element, as well as either the κB-like or GAAANN motifs and a partial R1 element (Figure 3). Both CecΨ1 and CecΨ2 presented an ORF, and in the proteins resulting from the conceptual translation of these genes, a signal peptide could be identified with the PLOT.A/SIG program (Luttke and Markiewicz 1990), which is based on the von Heijne method (von Heijne 1986). As previously indicated (Kylstenet al. 1990), the sequences of the conceptual proteins presented approximately 50% amino acid similarity to that of the functional Cec genes (Figure 4).
Given that CecΨ1 and CecΨ2 showed characteristics that pointed to their functionality, we tried to ascertain whether they were transcribed and in this case whether the conceptual splice sites were used in the processing of the primary transcript. Specific oligonucleotides were designed in exons 1 and 2 of each CecΨ1 and CecΨ2. Their specificity was tested in PCR reactions using genomic DNA as substrate, and subsequent sequencing of the products obtained. PCR reactions from total cDNA of D. melanogaster were performed using these specific primers for CecΨ1 and CecΨ2. For CecΨ1 a weak band could be seen on an agarose gel; internal primers, also located in exons one and two, were used for its reamplification and sequencing. For CecΨ2, two PCR rounds were necessary to visualize the products (more than one) on an agarose gel; the product of the expected size was reamplified and sequenced using internal primers also located in exons one and two. The result of the two sequencing reactions showed that genes CecΨ1 and CecΨ2 are transcribed in adult flies and, in both cases, the mature transcript results from the correct splicing of the predicted introns.
Our sequencing study also revealed, however, some characteristics pointing to these genes being pseudogenes. In D. melanogaster both CecΨ1 and CecΨ2 presented a loss-of-function change: in one line CecΨ1 presented a 5-bp insertion in exon 2, and in another two (the two identical) lines CecΨ2 presented a 1-bp deletion in exon 1; these length changes cause a change in the reading frame. Also our reanalysis of the Canton S sequence only revealed a deletion that spanned the initiation codon in the CecΨ1 gene, but no other loss-of-function changes (e.g., additional stop codons, lack of splice sites) were found in either CecΨ1 or CecΨ2. Moreover, based on sequence analysis both CecΨ1 and CecΨ2 would be nonfunctional in D. mauritiana: (i) for CecΨ1, the initiation codon had changed to ATT, and there was also a deletion spanning the stop codon; (ii) for CecΨ2 there was a 5-bp deletion at the beginning of exon 1, which generated a nearby stop codon, and an additional stop codon in exon 2.
Sequence comparison of the transcriptional signals in the Cec and Cec-related genes of D. melanogaster. Dots indicate the same nucleotide as the first sequence. For signals present in the opposite strand the sequence has been underlined. Sites polymorphic in our sample are indicated in lowercase and the most common nucleotide is shown.
Levels and pattern of polymorphism and divergence at the Cecropin gene cluster: Table 2 and Figure 5 give a summary of the distribution of polymorphism at the ~4-kb region studied. A total of 262 nucleotide and 39 length polymorphisms were detected in this 4031-bp region. Nucleotide variation was estimated both as the average number of pairwise differences per nucleotide or nucleotide diversity (π = 0.029), and as the Watterson estimate (Watterson 1975) or expected heterozygosity per nucleotide (θ = 0.026). Despite the high level of polymorphism, two lines (M55 and M66) were identical. As shown in Table 2, the level of variation in the complete region, which includes different coding regions, was lower than the estimated variation in intergenic regions. Length polymorphisms were essentially located in noncoding regions; insertions/deletions varied between 1 and 33 bp and were rather evenly distributed along the region, although slightly more frequent in the two extreme regions of the cluster.
A minimum of 36 recombination events in the history of the Spanish sample were detected by the four-gamete algorithm (Hudson and Kaplan 1985), which is based on the presence of all four gametic types in the sample for any pair of polymorphisms. These events were distributed rather evenly along the region studied. The recombination parameter C = 4Nc, where N is the effective number of individuals in a population and c is the recombination rate between adjacent nucleotides, was estimated by the method of Hudson (1987) that assumes mutation-drift equilibrium; the estimated 4Nc was 86.9 for the whole region, which resulted in a per nucleotide estimate of 0.022.
Tajima's test of neutrality (Tajima 1989) was applied to the whole region studied. This test contrasts whether the difference between the two estimates of nucleotide variation (π and θ) is equal to zero, as would be expected if the population were in mutation-drift equilibrium. The power of the test increases with higher numbers of polymorphisms and larger sample sizes; in the present case the number of polymorphisms was high but the sample size was not. The distribution of Tajima's D was obtained in samples generated by Monte Carlo simulation of the coalescent model using the Hudson code (Hudson 1983). When the no recombination model was considered, the two-tailed test did not reveal any deviation from neutrality (P = 0.42). As the variance of the number of pairwise differences is reduced with increasing recombination, the variance (and standard error) of Tajima's D is also lower. When the estimated recombination parameter, 86.9, or half its value was considered, the estimated D values were significant (P = 0.02) or marginally significant (P = 0.06), respectively. As the estimate of 4Nc has a high variance and also its estimation is based on the assumption of mutation-drift equilibrium, these results have to be taken with caution.
Comparison of the amino acid sequences of the Cec and Cec-related genes; the sequence of the more distantly related gene Andropin is also given. Dots indicate the same amino acid as the first sequence. A dash indicates a gap. For D. melanogaster and species of the simulans cluster the consensus sequence is given; only one sequence from Ceratitis capitata and one from Hyalophora cecropia are presented. Open and shadowed boxes indicate residues conserved in all known cecropins (Hetruet al. 1997), and in the products of the CecΨ1 and CecΨ2 genes, respectively. Arrows indicate the limits of the mature cecropins.
Polymorphism in the different regions of the Cecrepin cluster
We observed some clusters of significant linkage disequilibria (estimated as D; Lewontin and Kojima 1960) between tightly linked sites. For example, six polymorphisms in the small intron of CecA1 were in complete linkage disequilibrium and formed two haplotypes at intermediate frequencies (0.44/0.56); this same cluster had been detected by Clark and Wang (1997) in a sample of seven lines from Maryland, but not in their sample of five lines from Zimbabwe. The sign test on D (Lewontin 1995) was used to detect linkage disequilibrium in all the region. In our sample there were 257 biallelic sites (including singletons), which allowed 256 independent comparisons. The null hypothesis of no true disequilibrium could be rejected using the G test with William's correction (G = 58.1, 10 d.f., P < 0.001). However, no overall excess of either positive or negative associations between variants could be detected (G = 0.813, 1 d.f., P = 0.2).
Divergence was estimated for the different regions (results not shown) as well as for the complete fragment (Table 3). These figures can not be directly compared to previously published ones as most divergence estimates between these species are based only on coding regions. However, present estimates would be on the upper part of the range of silent divergence estimates between D. melanogaster and D. simulans (see, for example, Hudsonet al. 1994).
Polymorphism and divergence at the Cec genes: Table 4 shows the distribution of polymorphism at the different functional regions of the three Cec genes (CecA1, CecA2, and CecB). Estimates of synonymous variation in the coding region were among the highest ever reported in this species, especially for CecA2 and CecB; however, this assertion should be made with caution due to the rather small number of nucleotide sites sampled. The estimates of silent variation for the different duplicated regions were rather similar, and these values were lower than the estimated variation for the regions between or flanking duplications (Table 2). In each of the three genes, nonsynonymous polymorphisms were located in the region of exon 1 that codes for the signal peptide. Estimated nonsynonymous polymorphism was approximately one order of magnitude lower than estimated silent polymorphism, indicating the action of purifying selection against replacement changes.
Table 5 shows a summary of interspecific divergence for the CecA1 and CecB regions. As expected from the levels of polymorphism, silent divergence was higher than nonsynonymous divergence. Under mutation-drift equilibrium there should be a direct relationship between the levels of polymorphism and divergence. Two tests of neutrality based on this prediction of the neutral hypothesis were performed: the Hudson, Kreitman, and Aguadé, or HKA, test (Hudsonet al. 1987), and the McDonald and Kreitman, or MK, test (McDonald and Kreitman 1991). In the case of the HKA test, silent polymorphism and divergence in each of the CecA1 and CecB regions were compared to polymorphism and divergence in the region between the Anp gene and the CecA1 region. In the case of the MK test, for each gene, fixed nonsynonymous and synonymous changes were compared to polymorphic nonsynonymous and synonymous changes. None of the tests revealed a significant deviation from the neutral hypothesis.
Polymorphism at the Cec cluster region of D. melanogaster. The different functional regions are indicated schematically above the polymorphic sites; black and white boxes indicate exons and introns, respectively. Polymorphisms are numbered according to the reference sequence in GenBank (accession no. X16972); nucleotides inserted relative to that sequence are indicated with correlative letters after the nucleotide site preceding the insertion. Dots indicate the same nucleotide as the first sequence. A dash indicates a gap. The length of an insertion/deletion is indicated by its first and last nucleotides; nucleotide polymorphisms within a polymorphic length change are indicated in squared brackets. Nonsynonymous polymorphisms are indicated in boldface; n, nonsynonymous.
Nucleotide divergence in the different regions of the Cec cluster
Figure 6 shows the phylogenetic tree for the three duplicated regions reconstructed by the neighbor-joining method (Saitou and Nei 1987) from the nine sequences of D. melanogaster and the sequences of the sibling species (except for the CecA2 region). Each duplication formed a cluster separated by a deep branch from the rest, which is an indication of the independent evolution of each copy since the duplication occurred; gene conversion has not, therefore, played a major role at least in the recent history of these three genes. A similar tree was obtained when only silent changes were considered (not shown). However, in the tree obtained using nonsynonymous changes (not shown) the CecA1 and CecA2 genes grouped together, reflecting the complete absence of fixed nonsynonymous changes between these genes. The rates of synonymous and nonsynonymous substitutions could have varied since the CecA1-CecA2 duplication. To check this possibility, we applied a modification of the MK test (McDonald and Kreitman 1991) where we considered (i) nonsynonymous and synonymous changes fixed between the two genes (0 and 7, respectively); (ii) nonsynonymous and synonymous changes in these genes between and within species (5 and 10, respectively). No significant difference in the ratio of nonsynonymous and synonymous changes was detected (χ2 = 3.04, P > 0.05); the low power of this test should be mentioned, however, given the low number of nonsynonymous changes.
Polymorphism and divergence of Cec-related genes: Analysis of intra- and interspecific variation can also shed some light on the status of the Cec-related genes; for these analyses, neither the length changes in the D. melanogaster coding region that would cause changes in the reading frame nor the D. mauritiana sequence were considered. Table 6 gives a summary of nucleotide polymorphism in the CecΨ1 and CecΨ2 regions. For the coding region (exons 1 and 2) nonsynonymous polymorphism was lower than synonymous polymorphism; when silent polymorphism in each duplicated region was considered, again nonsynonymous variation was lower than silent variation, which would indicate some functional constrainton amino acid replacement changes. The level of total silent variation was either of the same order (0.025 for CecΨ2) or higher (0.043 for CecΨ1) than in the functional genes. Only silent polymorphism at CecΨ1 (0.043) was, however, higher than that detected in the intergenic regions (0.034).
Polymorphism at the different genes of the Cec cluster
Divergence at the Cec genes
Nucleotide divergence between species was estimated for the different functional regions of the CecΨ1 and CecΨ2 duplications (Table 7). No deviation from the neutral prediction of a direct proportionality between silent polymorphism and silent divergence was detected when the CecΨ1 (or the CecΨ2) gene region was compared by the HKA test (Hudsonet al. 1987) to the region between Anp and CecA1. Nor did application of the MK test (McDonald and Kreitman 1991) to each of the CecΨ1 and CecΨ2 regions detect any deviation from neutral expectations.
Relationship between Cec genes and Cec-related genes: The CecΨ1 and CecΨ2 genes presented a rather low sequence similarity with the Cec genes (slightly higher than 50%). Nevertheless, in addition to the already described TATA box and capping site these genes (like functional Cec genes) presented the GATA, κB-like or GAAANN motifs and a more or less partial R1 motif (Figure 3). These characteristics together with the structure of these genes—two exons separated by a small intron, and the presence of a signal peptide in the conceptually translated protein—and their location in the Cecropin cluster, point to their origin by duplication from an ancestral Cec gene.
Table 8 shows the divergence estimates between the different genes of the Cec cluster (CecA1, CecA2, CecB, CecΨ1, and CecΨ2) under this assumption. When all five genes were considered, only the coding regions could be reliably aligned, and the number of nucleotides compared was consequently rather low. In comparisons between CecA1, CecA2, and CecB genes and between CecΨ1 and CecΨ2 genes, silent divergence could be estimated for the complete duplication. For any particular pair of genes, divergence was estimated as the average of different average divergence estimates; for example, in the case of the CecA1 and CecB genes this would correspond to the following comparisons: between CecA1 D. melanogaster lines and CecB D. melanogaster lines, between CecA1 D. melanogaster lines and CecB simulans species cluster, between CecA1 simulans species cluster and CecB D. melanogaster lines, and between CecA1 simulans species cluster and CecB simulans species cluster. Divergence estimates for synonymous sites were highest for comparisons between Cec and Cec-related genes, indicating that they form two separate groups. However, for nonsynonymous sites divergence estimates between Cec and Cec-related genes were of the same order, although slightly lower than between CecΨ1 and CecΨ2. This just reflects the lack of amino acid sequence conservation between the cecropins and the putative proteins coded by these genes. In fact, in all the residues that are conserved in cecropin mature peptides (Hetruet al. 1997), these putative proteins present a different amino acid than cecropins (Figure 4). The putative CecΨ1 and CecΨ2 proteins presented four fixed amino acid differences (all in the mature peptide) relative to the cecropins. These substitutions most probably occurred after the original duplication from an ancestral Cec gene but prior to the separation of these genes by gene duplication.
Neighbor-joining tree (Saitou and Nei 1987) of the three Cec gene regions studied, using estimated total divergence per nucleotide (K) corrected by the method of Jukes and Cantor (Jukes and Cantor 1969), and based on comparison of 446 sites. Percentage bootstrap values (based on 500 replicates) for the main branching nodes are shown on the tree. The CecA1, CecA2, and CecB genes are indicated by A1, A2 and B, respectively. D. simulans, D. mauritiana, and D. sechellia are abbreviated as D. sim, D. mau, and D. sech, respectively. The D. melanogaster lines are indicated by the name of the line.
DISCUSSION
Pattern of variation at the Cec cluster: This cluster is located in a region with a rather high level of recombination despite its location on band 99E, only one section apart from the telomere of the right arm of the third chromosome. According to Kliman and Hey (1993a) the rate of recombination for this region (0.0036) would not be much lower than the highest value reported for the third chromosome (0.0045 for the Bj1 region on band 64B). Although the null hypothesis of no true disequilibrium was rejected by the sign test on D (Lewontin 1995), the significant associations between polymorphic sites were interspersed in the whole region. Also the minimum number of recombination events (on average nine per kilobase) and the recombination parameter (4Nc = 86.9) inferred from the history of the sample were rather high.
Polymorphism at the Cec-related genes
Divergence at the Cec-related genes
The distribution of linkage disequilibria in the Cec cluster stands in contrast to the detection in the same lines of two clusters of linkage disequilibria at the Acp70A gene region (Cirera and Aguadé 1997), where recombination is also high. If the observed associations were due only to a nonequilibrium situation of this European population (as proposed by Begun and Aquadro 1993, 1995), similar patterns would be expected in regions with similar rates of recombination; however, selection could be contributing to the observed discrepancy in the distribution of significant associations.
Divergence between genes in the Cec cluster
Variation at the Cec genes: In some genes of the major histocompatibility complex an excess of nonsynonymous variation had been detected, especially in the antigen-recognition parts of the molecules (Hughes and Nei 1988). Clark and Wang (1997) studied variation in six genes involved in the immune response of D. melanogaster: Andropin and Diptericin in addition to CecA1, CecA2, CecB, and CecC. In their samples of D. melanogaster from North America and from East Africa (Zimbabwe) they did not find any excess of nonsynonymous polymorphism. In this respect we found similar results in our sample from Spain. As mentioned in the Introduction, the genes studied in vertebrates are part of their adaptative immune system while those studied in Drosophila are part of the insect innate immune system. The observed differences in the level of nonsynonymous variation between both kinds of genes point to selection favoring only diversity in the genes mediating the highly specific response of vertebrates.
Both in our sample from Spain and in the sample from Maryland (Clark and Wang 1997), the few nonsynonymous polymorphisms detected were in exon 1 and caused amino acid replacement polymorphisms in the signal peptide. Only the sample from Zimbabwe (Clark and Wang 1997) showed some nonsynonymous polymorphisms that would affect the amino acid sequence of the mature antibacterial peptide. In contrast to the results previously reported (Clark and Wang 1997), our estimates of divergence for the CecB region were not especially low. In our study divergence was estimated between D. melanogaster and each of the three sibling species D. simulans, D. mauritiana, and D. sechellia, and estimates of divergence were rather similar in the three comparisons (Table 5).
Evolution of the Cec genes: These genes are rather old, as indicated by silent divergence estimates between duplications (Table 8), as compared to silent divergence estimates between D. melanogaster and species of the simulans cluster (KA1 = 0.12, KB = 0.09); divergence estimates were based on the alignment of the three genes. Considering the time since the split of the melanogaster and simulans lineages (2.5 mya, Powell and DeSalle 1995) and assuming a constant rate of silent substitution, the two duplication events (between CecA1 and CecA2, and between the CecB and CecA genes) could be roughly dated as having occurred around 9 and 19 mya, respectively. Given the time back to the CecA1-CecA2 duplication, the absence of fixed nonsynonymous differences between these genes would seem in contrast with the presence of some nonsynonymous changes in each of these genes (either segregating in D. melanogaster or fixed between species). One could think of a gene conversion event prior to the split of the species as the possible cause of this lack of fixed replacement substitutions. If gene conversion had caused the lack of nonsynonymous differentiation, one would also expect a decrease of synonymous variation in the exons. Silent divergence between the two duplicated regions was, therefore, estimated along the region and only the intron showed a lower than average silent divergence.
We have detected that the CecA2 gene has lost its function in the three sibling species D. simulans, D. mauritiana, and D. sechellia. We cannot ascertain which event caused the loss of function in each of these species, which precludes our establishing its time of occurrence. However, upper and lower limits for this inactivation can be established: (i) if the inactivating event was the loss of the stop codon or the changes in the promoter elements, it most probably occurred after the split of the melanogaster and simulans lineages (2.5 mya, Powell and DeSalle 1995) but prior to the split of the three sibling species lineages; (ii) if the inactivating event was a deletion, CecA2 would have most probably lost its function independently in these three lineages after their split, which has not been dated accurately but which would have occurred at most 1 mya (Hey and Kliman 1993; Kliman and Hey 1993b). Consequently, CecA2 has become a pseudogene rather recently in these species; some additional deletions have however accumulated in the three species, which have dramatically reduced the length of that region (Table 1, Figure 2). The two copies of the CecA gene would seem to have a redundant antibacterial function given the identity of the mature peptide. One could view the loss of the CecA2 gene in the species of the simulans cluster as a consequence of this redundancy and therefore of the lack of functional constraint after the inactivating event. This argument would, however, stand in contrast with the age and protein sequence conservation of the two copies.
It has been recently argued (Petrovet al. 1996) that the scarcity of pseudogenes in Drosophila might be a result of deletions being predominantly fixed in these regions, once they have become neutral because of inactivation. Pritchard and Schaeffer (1997) have, however, found that although LcpΨ had on average lost length in D. melanogaster and D. simulans, relative to its functional counterpart, the number of insertions was higher than that of deletions. In the CecA2 region deletions outnumbered insertions, but, unlike the case of LcpΨ, most length changes were longer than 2 bp (Figure 2). The CecA2 pseudogene in the simulans species cluster is a short-lived pseudogene that can, however, still be recognized. As pointed out by Pritchard and Schaeffer (1997) and as will be discussed later, this may not be the general situation in Drosophila.
Evolution of the Cec-related genes: Silent divergence estimates (based on the alignment between the CecΨ1 and CecΨ2 gene regions) between duplications (Table 8) and between species (KΨ1 = 0.15, KΨ2 = 0.14) indicate that CecΨ1 and CecΨ2 are also rather old. Under the assumption of a constant rate of evolution and considering that the D. melanogaster and D. simulans lineages diverged 2.5 mya, the duplication that originated them could be roughly dated around 11 mya.
We found correctly spliced transcripts for the two initially described pseudogenes of the Cecropin cluster (CecΨ1 and CecΨ2), but at low concentration. Although transcription does not necessarily mean function (the transcribed product might not be correctly spliced, translation might not be correctly initiated or terminated due to changes in the initiation or stop codons, translation might stop prematurely due to the presence of stop codons, etc.), in the present case this would not seem to be the situation except for D. mauritiana and one allele of each of the CecΨ1 and the CecΨ2 genes in D. melanogaster. However, if the CecΨ1 and CecΨ2 genes were functional, their function should not be essential, at least under certain environmental conditions, as the strains with the null allele of each gene were homozygous and perfectly viable in the laboratory conditions.
In Drosophila, and also in mammals, there are few reports of transcribed pseudogenes that have also been studied at the intra- and interspecific levels. In two such cases, the pattern of evolution of the formerly considered pseudogenes deviated from the expected pattern in regions with no functional constraint. In fact, the Adh-related processed pseudogene of D. yakuba and D. teissieri (Jeffs and Ashburner 1991), and the Adh-related pseudogene in species of the repleta group (as reviewed in Sullivanet al. 1994) turned out to be chimeric functional genes: jingwei in D. yakuba and D. teissieri (Long and Langley 1993), and finnegan in species of the repleta group (Begun 1997). Comparison of intra- and interspecific variation in these genes revealed that they had accumulated adaptive nonsynonymous substitutions in their Adh-related part.
Two possible scenarios for the possible evolution of the CecΨ1 and CecΨ2 genes. See text for description. Two connected boxes represent two exons of the same gene: black for pseudogenes, white with or without pattern for functional genes, and patterned black for either functional genes with null alleles or pseudogenes. (A) Scenario 1, (B) scenario 2.
Unlike in a functional gene, all length changes in the former coding region of a pseudogene would be neutral. Therefore, the presence of such loss-of-function alleles in a duplicated gene is generally considered an indication of that copy of the gene having lost its function and become a pseudogene (Balakirev and Ayala 1996); these loss-of-function alleles could be segregating for a long time as their time to fixation, which is dependent on the effective population size, can be rather high. In fact, the absence of such length variants in the Adh-related pseudogene in species of the repleta group was considered an indication that the duplicated gene was functional (Begun 1997). However, in the case of single copy genes (like some allozyme loci) loss-of-function alleles are considered null alleles that, despite their more or less deleterious character, are maintained in natural populations by mutation-selection balance.
Even if at present CecΨ1 and CecΨ2 were pseudogenes, two different scenarios could be viewed for their evolution (Figure 7). In the first scenario, the first copy of the ancestral Cec gene would have maintained its function and accumulated some (neutral or adaptive) nonsynonymous substitutions; after the duplication of this slightly differentiated copy, the two genes could have further differentiated both between themselves and relative to the Cec genes and acquired new functions. In this case, the loss-of-function changes would have occurred independently in each of these new genes and could have occurred rather recently. In the second scenario, the first copy of the ancestral Cec gene would have lost its function and become a pseudogene (more than 11 my old), which would have soon suffered a duplication. The two pseudogenes would be therefore old pseudogenes, and they would have accumulated further differences both at previously synonymous and nonsynonymous sites due to the complete loss of functional constraint.
If CecΨ1 and CecΨ2 were or had been functional rather recently, as depicted under the first scenario, the low amino acid similarity of the putative mature proteins both between themselves and relative to the cecropins would point to each of these genes having acquired a differentiated function; under this assumption these genes could be considered single copy genes. The loss-of-function alleles of both CecΨ1 and CecΨ2 could be considered in this sense null alleles, as each of the length changes causing the loss of the original reading frame would have occurred in an external branch of the D. melanogaster gene genealogy. However, the frequency of these alleles of CecΨ1 and CecΨ2 in the sampled population—out of nine lines, one and two (the two identical) lines presented those changes in CecΨ1 and CecΨ2, respectively—would seem rather high to be maintained by mutation-selection balance. In a previous survey of 20 autosomal enzyme loci (Voelkeret al. 1980), the weighted mean frequency of null alleles was 0.0025, although only 13 loci presented null alleles and the highest frequency was 0.012. In 12 of these loci, null homozygotes were viable and fertile. As discussed above, the CecΨ1 and CecΨ2 genes would not be essential in either of the species studied at least under laboratory conditions, although they could have been essential in the past or even in the present under certain environmental conditions. Only survey of a larger sample will allow a more reliable estimate of the frequency of null alleles and, therefore, the possibility of discussing its consistency or not with a simple mutation-selection balance.
Alternatively, these genes could be old pseudogenes (~11 my) and never have had a differentiated function. In this second scenario the higher rate of substitution in previously nonsynonymous sites would be due simply to the loss of functional constraint. There are, however, some observations that would seem at odds with this interpretation: (i) the maintenance in each case, despite the fixation of length changes, of an open reading frame that is transcribed and correctly spliced; (ii) the conservation in each case of the transcription and splicing signals, despite their short length; (iii) the rather high GC content of the putative exons of these genes, which is comparable to that of the Cec genes and stands in contrast to the low GC content of the putative introns and intergenic regions (Figure 8). Also if the CecΨ1 and CecΨ2 genes were old pseudogenes, their evolution would have been different than that of the CecA2 pseudogene in the species of the simulans cluster. In fact, this latter pseudogene would be a rather young pseudogene and would have accumulated deletions in species with rather different effective population sizes as D. simulans relative to D. mauritiana and D. sechellia (Kliman and Hey 1993b), while the CecΨ1 and CecΨ2 genes would be rather old and would have accumulated mainly nucleotide changes.
Distribution of the GC content along the Cec region studied. As shown schematically below the graph, the region was divided according to both function and similarity (see Figure 1A), and the graph shows the GC content of each subdivision. Squares and circles indicate coding and noncoding regions, respectively.
Acknowledgments
We thank A. Moragas for fly collection, J. Rozas for sharing isochromosomal lines and the unpublished version 2.52 of the DnaSP program, and E. Juan, A. Amador and J. M. Martín-Campos for advice and/or sharing materials in the RNA work. We also thank the Umeå Stock Center for the D. mauritiana line, Serveis Científico-Tècnics from Universitat de Barcelona for automated sequencing facilities, and J. Rozas and C. Segarra for critical comments and discussion. This work was supported by grants PB94-923 from Dirección General de Investigación Científica y Técnica, Ministerio de Educación y Ciencia, Spain, and 1995SGR-577 from Comissió Interdepartamental de Recerca i Tecnologia, Generalitat de Catalunya, to M.A.
Footnotes
-
Communicating editor: A. G. Clark
- Received March 2, 1998.
- Accepted May 14, 1998.
- Copyright © 1998 by the Genetics Society of America