Genetics, Vol. 159, 589-598, October 2001, Copyright © 2001

The Evolutionary Analysis of "Orphans" From the Drosophila Genome Identifies Rapidly Diverging and Incorrectly Annotated Genes

Karl J. Schmida and Charles F. Aquadroa
a Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853

Corresponding author: Karl J. Schmid, Department of Genetics and Evolution, Max Planck Institute for Chemical Ecology, Carl Zeiss Promenade 10, D-07745 Jena, Germany., schmid{at}ice.mpg.de (E-mail)

Communicating editor: M. AGUADÉ


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

In genome projects of eukaryotic model organisms, a large number of novel genes of unknown function and evolutionary history ("orphans") are being identified. Since many orphans have no known homologs in distant species, it is unclear whether they are restricted to certain taxa or evolve rapidly, either because of a lack of constraints or positive Darwinian selection. Here we use three criteria for the selection of putatively rapidly evolving genes from a single sequence of Drosophila melanogaster. Thirteen candidate genes were chosen from the Adh region on the second chromosome and 1 from the tip of the X chromosome. We succeeded in obtaining sequence from 6 of these in the closely related species D. simulans and D. yakuba. Only 1 of the 6 genes showed a large number of amino acid replacements and in-frame insertions/deletions. A population survey of this gene suggests that its rapid evolution is due to the fixation of many neutral or nearly neutral mutations. Two other genes showed "normal" levels of divergence between species. Four genes had insertions/deletions that destroy the putative reading frame within exons, suggesting that these exons have been incorrectly annotated. The evolutionary analysis of orphan genes in closely related species is useful for the identification of both rapidly evolving and incorrectly annotated genes.


GENOME projects aim at correctly identifying all genes encoded by a genome (e.g., BORK and KOONIN 1998 Down; BRENNER 1999 Down; ADAMS et al. 2000 Down) and understanding their genetic, biochemical, and cellular functions (HIETER and BOGUSKI 1997 Down; BORK et al. 1998 Down). Achieving these goals is a considerable challenge because all genomes studied so far harbor many proteins with no or little similarity to proteins of known function. A comparison of publications describing partial or complete genome sequences from eukaroytic model organisms over the past 5 years reveals that about one-third of all predicted protein-coding genes fall into this class, despite the exponential growth of sequence databases. Such genes have been called "orphans" and their function needs to be determined by genetic or biochemical approaches (OLIVER 1996 Down).

There are two major explanations for the large number of orphans. Both need to take into account that most model organisms whose genomes are currently being sequenced are separated by large evolutionary distances. First, many orphans might consist of genes whose phylogenetic distribution is restricted to certain evolutionary lineages; e.g., they are specific to plants or vertebrates. Second, orphan genes may diverge rapidly between closely related species because the proteins they encode are unconstrained in their sequence evolution or subjected to directed Darwinian selection, whereas their structure and function might be conserved even between distantly related organisms. Such a hypothesis is supported by estimates of only a few thousand naturally occurring protein superfamilies (CHOTHIA 1992 Down; BRENNER et al. 1997 Down). Orphans might therefore consist of highly divergent, rapidly evolving members of this limited set of superfamilies.

Evolutionary comparisons of closely related genomes will help to differentiate between the two hypotheses. For example, by a hybridization and sequencing approach it was estimated that about one-third of all expressed Drosophila genes diverge rapidly within the genus Drosophila (SCHMID and TAUTZ 1997 Down). These data support the rapid evolution hypothesis for a large number of orphan genes. Surveys of nucleotide polymorphism of some of these rapidly diverging orphan genes in populations of D. melanogaster and D. simulans revealed that a lack of constraints may be responsible for their evolution because the majority of the numerous amino acid substitutions are neutral or nearly neutral (SCHMID et al. 1999 Down).

There appears to be a relationship between the function and evolutionary conservation of genes. For example, the genetic and sequence analysis of 3 Mb of the Adh region of D. melanogaster revealed strong functional differences between conserved and nonconserved genes (ASHBURNER et al. 1999 Down). Sequence analysis predicted 220 protein-coding genes, of which only 79 had a detectable phenotype (lethality, sterility, or morphological deformations). The lack of sequence conservation was very different between the two classes of genes: About two-thirds of the 79 genes with a phenotype had a homolog in distantly related species (yeast, vertebrates, C. elegans, and prokaryotes) in contrast to only 14% of the genes without a phenotype. Clearly, the latter class is less conserved and both its function and evolution remain largely obscure. Additionally, genes with a mutant phenotype are more highly expressed as evaluated by comparisons to >80,000 expressed sequence tags (ESTs) from Drosophila (ASHBURNER et al. 1999 Down).

The goals of this study were to test whether rapidly evolving candidate genes can be reliably identified in single genomic sequences, to verify by comparative sequencing that candidate genes evolve rapidly, and to distinguish between low constraint and positive Darwinian selection as causes for the sequence divergence of the proteins encoded by such genes. By combining data on the long-term evolutionary conservation in distant species and matches to Drosophila ESTs with sequence features like codon usage, we found 13 rapidly evolving candidate genes from the Adh region on the second chromosome (ASHBURNER et al. 1999 Down) and the tip of the X chromosome (BENOS et al. 2000 Down). Homologs of 6 genes were sequenced from the closely related species D. simulans and D. yakuba. We discovered 1 very rapidly evolving and several incorrectly annotated genes.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Analysis of annotated Drosophila sequences:
Sequences from the European Drosophila Genome Project (EDGP) were downloaded from the EGDP FTP site (ftp.ebi.ac.uk) and coding sequences were extracted using the annotation in the GenBank format. The 3-Mb region of the Adh region and a gff-formatted file containing the annotation information were downloaded from the Berkeley Drosophila Genome Project (BDGP) website (http://www.fruitfly.org) and the coding sequences were extracted. The coding sequences were searched against the collection of 86,000 Drosophila ESTs and the nonredundant GenBank database at the National Center for Biotechnology Information using BLAST with standard settings (ALTSCHUL et al. 1997 Down). The effective number of codons (ENC) and the GC content at silent sites were calculated according to WRIGHT 1990 Down. Sequence extractions, database searches, and analyses were performed with Perl scripts written by K. J. Schmid.

PCR and sequencing:
Primers were designed with the Primer3 program (ROZEN and SKALETSKY 1998 Down). Polymerase chain reactions were carried out using standard conditions (e.g., SCHMID et al. 1999 Down). The primer sequences can be found in the supplementary information on our website (address below). PCR products were sequenced on an ABI377 automated sequencer using BIG-DYE Terminator chemistry and both the PCR and internal primers. Base-calling, sequence assembly, and sequence alignment were performed with Phred (EWING et al. 1998 Down), Phrap (P. GREEN, unpublished data), and Consed (GORDON et al. 1998 Down).

Lines:
The lines from D. melanogaster and D. simulans used for the population survey were collected in Harare, Zimbabwe and established as inbred isofemale lines. The DNA from these lines was prepared by standard protocols and purified with CsCl centrifugation. D. yakuba and D. erecta lines were obtained from the Drosophila Species Stock Center at Bowling Green. DNA was isolated from ~50 flies each using phenol/chloroform extraction and phenol precipitation.

Sequence analysis:
The values of dn and ds were estimated with the maximum-likelihood method of YANG and NIELSEN 1998 Down using the F3 x 4 model (YANG 1999 Down). Homologous sequences from D. melanogaster and D. simulans were downloaded from GenBank, and the coding sequences were extracted and semiautomatically aligned, using ClustalW (THOMPSON et al. 1994 Down) and Perl scripts. DnaSP3.0 (ROZAS and ROZAS 1999 Down) was used to calculate two estimates of nucleotide diversity, the average number of pairwise differences, {pi} (NEI 1987 Down), and an estimate of the mutation parameter 4Neµ, {theta} (WATTERSON 1975 Down), and to perform tests of neutrality. The following tests were used: Tajima's D (TAJIMA 1989 Down), Fu and Li's D with an outgroup (FU and LI 1993 Down), the HKA test (HUDSON et al. 1987 Down), and the MK test (MCDONALD and KREITMAN 1991 Down). Further details about the tests can be found in the references.

Supplementary information:
Sequences were submitted to GenBank under accession nos. AF264913, AF264914, AF264915, AF264916, AF264917, AF264918, AF264919, AF264920, AF264921, AF264922, AF264923, AF264924, AF264925, AF264926, AF264927, AF264928, AF264929, AF264930, AF264931, AF264932, AF264933, AF264934, AF264935, AF264936, AF264937, AF264938, AF264939, AF264940, AF264941, AF264942, AF264943, AF264944, AF264945, AF264946, AF264947. Further information is available from our website at http://www.mbg.cornell.edu/aquadro/sequences.html.


*  RESULTS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Identification of candidate genes:
We analyzed genes from the annotated genome sequences located on the tip of the X chromosome from the EDGP (BENOS et al. 2000 Down) and the annotated 3-Mb Adh region on the second chromosome (ASHBURNER et al. 1999 Down). The three criteria for the selection of putative rapidly evolving genes were (i) no or little (<25%) sequence identity to genes from distant organisms, (ii) no or few matches to ESTs, and (iii) a low codon bias. The use of codon bias as an indicator of rapid amino acid sequence evolution was based on the following rationale: An analysis of codon usage patterns in Drosophila genes revealed that amino acids encoded by unpreferred codons tend to be less conserved (AKASHI 1996 Down). This is probably due to a lack of selection of translational accuracy on functionally less important amino acid residues. Thus, proteins encoded by a large number of unpreferred codons should have many unconstrained amino acids and evolve rapidly. This hypothesis is confirmed by the codon usage patterns in rapidly evolving Drosophila genes (SCHMID et al. 1999 Down). Additionally, a comparison of proteins from several species known to evolve under strong Darwinian selection also revealed that many of them show little codon bias (K. J. SCHMID, unpublished observation). It is important to note that codon usage is influenced by several factors (e.g., expression level and length of coding sequence) and may not be strongly correlated with rapid evolution of the amino acid sequence. Finally, nucleotide composition and patterns of codon usage are important criteria for gene prediction algorithms like GENEFINDER and GENSCAN (GREEN 1995 Down; BURGE and KARLIN 1997 Down). Genes with unusual patterns of codon usage should have, on average, lower scores in the prediction and might be incorrectly annotated genes.

To identify genes with a low codon usage bias, the ENC (WRIGHT 1990 Down) was plotted against the GC content at synonymous codon positions (GC3) for the 220 genes of the Adh region and 236 genes from the tip of the X chromosome (Fig 1). We also compared the codon usage of predicted genes with their GENEFINDER and GENSCAN scores as obtained from ASHBURNER et al. 1999 Down, but they did not reveal a simple relationship (results not shown). Genes with very low and very high ENC values tend to have lower GENEFINDER or GENSCAN scores. Preferred codons in D. melanogaster end in C or G (AKASHI 1995 Down), and genes under selection for optimal codon usage should have a GC3 > 0.5. We selected genes with ENC > 55 and/or GC3 < 0.5 and no or weak similarity to genes from distant species (Table 1). Our sample also included several biased and conserved genes as controls. For example, to evaluate the relationship between codon usage and amino acid evolution, two members of a gene cluster encoding hypothetical metalloproteases were compared. BACR44L22.4 has an ENC value of 60.6 and is the most divergent member in pairwise comparisons of the six paralogs (ASHBURNER et al. 1999 Down). The codon usage of BACR44L22.3 is more biased (ENC = 51.3) and it is the most conserved paralog in the cluster. Two additional "controls" were the highly biased genes DS01068.5 (ENC = 35.6) and DS00810.3 (ENC = 27.0), which are not conserved outside insects.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 1. Relationship between GC content at synonymous sites and the effective number of codons (ENC; WRIGHT 1990 Down) for all genes of the Adh region (A) and the tip of the X chromosome of D. melanogaster (B). Solid circles represent candidates for rapidly evolving genes. The x symbol shows the three "control" genes (see RESULTS). The line gives the expected relationship of GC content at synonymous sites and ENC of random sequences (WRIGHT 1990 Down). ENC values of 61 indicate indiscriminate use of synonymous codons.


 
View this table:
In this window
In a new window

 
Table 1. List of candidate genes chosen for sequencing in D. simulans and D. yakuba

Sequence comparisons:
To obtain homologs from D. simulans and D. yakuba, primers were designed from the D. melanogaster sequence using GC-rich regions in or around exons. Among 16 primer pairs tested, 10 resulted in a PCR product in D. simulans and 6 in D. yakuba (Table 1). We expected that only a subset of primers would work in the other species, because the average divergence at silent sites is ~11% between D. melanogaster and D. simulans (BAUER and AQUADRO 1997 Down; POWELL and MORIYAMA 1997 Down) and 23% between D. melanogaster and D. yakuba (SCHMID and TAUTZ 1997 Down). Out of 10 PCR products obtained from D. simulans, 7 could be sequenced successfully, and 5 could be sequenced from D. yakuba. Only one of the three high codon bias control genes could be amplified and sequenced successfully (BACR44L22.3) in both D. simulans and D. yakuba.

An alignment of the sequences revealed many nucleotide substitutions and insertions/deletions (indels). Among the six genes with low codon bias, the putative coding region of four genes showed out-of-frame indels in a comparison between D. melanogaster and D. simulans, D. yakuba, or D. erecta. These genes include DS01514.3, DS03192.3, and exon 1 of DS07721.6, which are probably incorrectly annotated exons. In two genes, we observed several in-frame indels (DS06283.4 and exon 3 of DS07721.6). In the comparison between D. melanogaster and D. erecta homologs of EG0007.10, an out-of-frame indel in the 3' region of the coding sequence leads to a longer protein in D. erecta. It is unclear whether this gene encodes a functional protein. Estimates of dn and ds are given in Table 1. Only DS07721.6 can be considered to be a rapidly evolving protein (dn = 0.0494) between D. melanogaster and D. simulans, whereas all other genes are more conserved and exhibit dn values that are not significantly different from the control gene (BACR44L22.3).

DS07721.6 is predicted to encode a large protein of 1585 amino acids of unknown function (ASHBURNER et al. 1999 Down). Secondary structure analysis of the protein sequence suggests that it is a transmembrane protein (data not shown). Because of its length, we focused our sequencing on the extracellular domain (Fig 2). The nonsynonymous divergence of DS07721.6 is similar to anon1G5, the most rapidly evolving gene from an earlier screen for rapidly evolving genes (SCHMID and TAUTZ 1997 Down), which also exhibits in-frame indels in comparisons between D. melanogaster, D. simulans, and D. yakuba. Although the first exon of DS07721.6 contains two out-of-frame indels, we consider it to be a functional gene because the numerous indels in exon 3 are in frame and there is a weak but significant sequence similarity to REJ-domain-containing proteins from other animal phyla (data not shown). The rapid evolution of parts of DS07721.6 raises the question of whether this is due to a high rate of neutral evolution or positive Darwinian selection. A sliding window analysis of the dn and ds values along the coding sequence of exons 3–6 shows that the nonsynonymous sequence divergence is relatively constant whereas the synonymous sequence divergence is highly variable between different regions of the coding sequence (Fig 2B). Interestingly, in the fragment encoding the region C-terminal of the REJ module, dn drops to zero and ds increases up to 0.4, which is much higher than the expected neutral sequence divergence. This fragment may consist of a mutational hotspot combined with strong constraints on nonsynonymous substitutions.



View larger version (28K):
In this window
In a new window
Download PPT slide
 
Figure 2. (A) Schematic structure of gene DS07721.6. Shaded boxes designate exons, solid arrowheads show in-frame indels (multiples of three), and open arrowheads show out-of-frame indels (no multiples of three). Numbers above the arrowheads are the lengths of insertions in base pairs. Bars show the regions sequenced in D. simulans or D. yakuba. (B) Schematic structure of the predicted protein sequence of DS07721.6. The locations of the moderately conserved REJ module and the single transmembrane helix (TM) are shown as shaded boxes. The graph shows a sliding window analysis of the dn and ds values in the D. melanogaster-D. simulans comparison using a window size of 90 codons and a step size of one. The sliding window analysis of dn and ds was performed with the program wina (ENDO et al. 1996 Down). The bar shows the region surveyed in populations of D. melanogaster and D. simulans.

DNA polymorphism in DS07721.6:
Because of the high rate of amino acid evolution and silent divergence, we obtained sequences of exons 3 and 4 from 10 lines each of African populations of D. melanogaster and D. simulans. A comparison of intraspecific polymorphism and interspecific divergence can be used to discriminate between neutral evolution and positive Darwinian selection as the causes for the rapid evolution of these genes. We chose the African lines because they represent ancestral populations of both species and are probably close to a mutation-selection-drift equilibrium (BEGUN and AQUADRO 1993 Down).

Nucleotide diversity is low in the 858 bases in D. melanogaster ({pi} = 0.0018; Table 2). Only 6 polymorphisms were discovered; 3 of them are synonymous and 3 nonsynonymous. In D. simulans, 27 polymorphisms are segregating in the sample ({pi} = 0.0124), of which 10 are synonymous and 17 nonsynonymous. The level of DNA diversity ({pi}) is about seven times higher than in D. melanogaster, which is within the range observed for other genes that were surveyed in both species (MORIYAMA and POWELL 1996 Down). The large number of replacement polymorphisms is consistent with the rapid evolution of this region of DS07721.6. Most other surveyed genes have much smaller numbers of nonsynonymous polymorphisms (MORIYAMA and POWELL 1996 Down). Several tests of neutrality were applied to the data and none of them rejected the null hypothesis of neutral evolution (Table 2 and Table 3). The HKA test was not significant in comparisons of DS07721.6 to various neutrally evolving reference loci (anon1A3, anon1E9, and anon1G5; SCHMID et al. 1999 Down), although it was marginally significant in D. melanogaster when the Adh-5' region of KREITMAN and HUDSON 1991 Down was used for comparison ({chi}2 = 3.824, P = 0.0505).


 
View this table:
In this window
In a new window

 
Table 2. Nucleotide diversity in exon 4 of DS07721.6 in D. melanogaster and D. simulans


 
View this table:
In this window
In a new window

 
Table 3. McDonald-Kreitman test for DS07721.6

We also looked at lineage-specific substitutions to analyze the effect of different species-level effective population sizes on the evolution and polymorphism of this region (see SCHMID et al. 1999 Down for a more detailed discussion). Using D. yakuba as an outgroup, 37 out of 41 fixed differences could be assigned to either the D. melanogaster or D. simulans lineages. Thirteen nonsynonymous and 6 synonymous substitutions occurred in the D. melanogaster lineage and 14 nonsynonymous and 4 synonymous substitutions in the D. simulans lineage. The numbers for the two types of substitutions differ little between the two lineages. This suggests that the nonsynonymous substitutions are either completely neutral or have been fixed by relatively strong positive selection that occurred in both lineages.


*  DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Identifying rapidly evolving genes:
Our motivation for this study was to test whether orphans in the Drosophila genome are rapidly evolving genes and, if so, whether they evolve neutrally because of relaxed constraints or positive selection. Rapidly evolving genes have recently attracted considerable interest, because they might play a role in the adaptive evolution of phenotypic traits (e.g., MURPHY 1993 Down; SWANSON and VACQUIER 1995 Down; PAMILO and O'NEILL 1997 Down; CIVETTA and SINGH 1998 Down; MICHAELMORE and MEYERS 1998 Down; DUDA and PALUMBI 1999 Down; YOKOYAMA et al. 1999 Down; WYCKOFF et al. 2000 Down). An understanding of the evolution and function of such "adaptive trait loci" may be highly relevant to the study of species differences (TAUTZ and SCHMID 1998 Down). Thus, after rapidly evolving genes have been identified, it is of interest to test whether they diverge neutrally or are responding to positive Darwinian selection.

Since no extensive genomic sequence from a closely related species of D. melanogaster is available, we identified candidate genes by a synopsis of data on sequence features, function, and evolutionary conservation in distantly related organisms. The genes of the Adh region and the tip of the X chromosome region are good candidates for testing such an approach because they are among the best-characterized regions of the Drosophila genome and much information on sequence conservation, expression, and genetic function is available. Our criteria for selecting putative rapidly evolving genes were codon usage, a low level of expression, and no or weak similarity to distant organisms. Among four surveyed genes without codon bias that retained an intact reading frame in D. simulans or D. yakuba, only one (DS07721.6) was rapidly evolving at the amino acid level, suggesting that a lack of codon usage alone may not be a good indicator for the discovery of rapidly evolving genes. This notion is further supported by a comparison of ENC with dn for genes (including those of this study) where partial or complete coding sequences were available from D. melanogaster and D. simulans (n = 85). Although we find a highly significant negative correlation between codon usage bias and nonsynonymous divergence (Fig 3A), it is not very strong. This suggests that, although there is evidence for selection on translational accuracy, additional factors such as gene length (COMERON et al. 1999 Down), expression level (SHIELDS et al. 1988 Down; POWELL and MORIYAMA 1997 Down; DURET and MOUCHIROUD 1999 Down), mutation bias (KLIMAN and HEY 1994 Down), and local rates of recombination (KLIMAN and HEY 1993 Down) also influence codon usage patterns in the genome of Drosophila. These additional factors may blur the relationship between codon usage and nonsynonymous divergence. Furthermore, under selection for translational accuracy, a positive relationship between ENC and ds is also expected, as has been found in several studies (e.g., SHARP and LI 1989 Down). In a more recent study, however, such a relationship was not obtained, and simulations suggested that such a relationship represents different assumptions in the estimation of nucleotide divergence (DUNN et al. 2001 Down). Using the same maximum-likelihood method for estimating nucleotide divergence as DUNN et al. 2001 Down and with a larger number of genes, we also do not find a significant correlation between ENC and synonymous divergence in comparisons between D. melanogaster and D. simulans (Fig 3B). One explanation for the absence of such a correlation may be variable mutational pressures in different evolutionary lineages, which can lead to a negative correlation between ENC and ds (BIELAWSKI et al. 2000 Down). In addition, our data do not show a positive correlation between dn and ds (R2 = 0.02, P = 0.25), which is in contrast to earlier studies (AKASHI 1994 Down; COMERON and KREITMAN 1998 Down; DUNN et al. 2001 Down). However, since we are mainly interested in the relationship between ENC and dn, the lack of such a relationship has no consequences for our study. In conclusion, it can be stated that, although there is a positive correlation between dn and ENC, the use of codon bias alone is not sufficient for a reliable identification of rapidly evolving genes.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 3. Correlation between the effective number of codons, ENC (calculated from the D. melanogaster sequences), the number of nonsynonymous, dn (A), and synonymous, ds (B), substitutions calculated from alignments of homologous sequences from D. melanogaster and D. simulans (n = 85) that were retrieved from GenBank or sequenced in this study.

Therefore, additional information about gene function needs to be taken into account for generating better predictions of rapidly evolving genes from single genome sequences. Such information can be the type and strength of mutant phenotypes (ASHBURNER et al. 1999 Down), the tissue where genes are expressed (HURST and SMITH 1999 Down), or the type of protein that is encoded by a gene (e.g., subcellular location). For example, DS06238.4 is probably identical to the pupal gene, which has a lethal phenotype. Under the assumption that functionally important genes should be more conserved (WILSON et al. 1977 Down), rapid sequence divergence is not expected in this gene. In fact, its sequence is highly conserved in distant insects but not in other phyla, suggesting that its occurrence is restricted to insects, where it may have acquired an essential function. The lack of codon bias in this gene could be related to its repetitive amino acid sequence.

Causes of rapid evolution:
Only one of the candidate genes we examined was apparently both functional and rapidly evolving. Neither the sequence comparisons between Drosophila species nor the population variation analysis of the rapidly evolving gene DS07721.6 revealed evidence for positive selection being important for its evolution. The levels of nonsynonymous divergence and replacement polymorphisms are very similar to other rapidly evolving orphan genes (SCHMID et al. 1999 Down). These results together suggest that the primary sequence of numerous (correctly annotated) orphan genes may evolve relatively unconstrained at the amino acid level. Whereas the criteria we used are expected to be compatible with the identification of genes evolving under relaxed selective constraints, low levels of expression (indicated by the absence of EST matches) and low codon bias may not necessarily be a characteristic of genes evolving under positive Darwinian selection. However, one can expect that many genes evolving under positive selection have specialized functions with a restricted expression (TAUTZ and SCHMID 1998 Down) and therefore may not be represented in current EST collections. This notion is supported by a recent EST sequencing study of genes expressed in the testis of D. melanogaster, which found that about one-half of 1560 cDNA sets fail to align with existing Drosophila ESTs (ANDREWS et al. 2000 Down). This suggests that many tissue-specific genes have not yet been discovered, although they may be expressed at a high level within a tissue. As EST collections grow in size, information about the number of tissues in which genes are expressed can be used to identify rapidly evolving genes. There is little theoretical support for the notion that genes evolving under positive selection can be expected to have low codon bias, but one can assume that translational accuracy may not be very strong in such genes. This hypothesis is consistent with the observation of several genes encoding male accessory gland proteins that evolve under positive Darwinian selection and are characterized by low codon bias (BEGUN et al. 2000 Down).

It should be noted that most tests employed for detecting positive selection are not very powerful in detecting weak or episodic selection (for a more detailed discussion, see SCHMID et al. 1999 Down). More powerful tests need to be developed for detecting these types of adaptive molecular evolution. Generation of data for such genes from additional species may allow codon-specific models to be used such as those developed by Z. Yang and R. Nielsen (e.g., YANG et al. 2000 Down). Such a study of DS07721.6 may reveal that positive selection at a subset of the amino acids, coupled with selective constraint at others, may account for its rapid evolution shown here.

Improving the annotation:
A surprising result of our survey is the large proportion of incorrectly annotated genes. In four out of six candidate genes, the putative open reading frame contained out-of-frame indels in either D. simulans, D. yakuba, or D. erecta. Two of these sequences may not be protein-coding genes at all. Furthermore, there are at least two additional paralogs of gene DS07721.6 in the Drosophila genome that were not recognized and annotated by the gene prediction algorithms used for the annotation (data not shown). These observations confirm the conclusions of the Drosophila Genome Annotation Assessment Project (GASP; REESE et al. 2000 Down) that, even in the relatively compact Drosophila genome, purely computer-based gene annotations (ab initio predictions) both over- and underpredict genes. Many predictions contain errors (e.g., the incorrect identification of the 5' end of open reading frames), particularly for genes with a lack of sequence conservation or with unusual patterns of codon usage like the candidate genes of this study (GUIGO et al. 2000 Down). Gene predictions need additional experimental verification such as full-length cDNA sequencing, sequencing of ESTs from tissue-specific libraries (ANDREWS et al. 2000 Down), or, as described in this study, sequencing of homologous genes from closely related species. It should be noted that our small sample of genes does not allow an estimation of how many predicted genes contain annotation errors. However, we expect that a substantial proportion of nonconserved genes may be overpredicted and that many genes not recognized by prediction algorithms may consist of rapidly evolving genes.

Comparative sequencing of related species:
The fact that only one of six candidate genes evolves rapidly suggests that the identification of such genes in single genomic sequences is difficult, in particular because of the requirement of a correctly annotated sequence. In addition, the PCR approach used in this pilot study is not practical for analyzing a large number of candidate genes because about one-half of the primer pairs designed using the D. melanogaster sequence did not work in D. simulans or D. yakuba. However, because numerous rapidly evolving genes can be expected in the genome of D. melanogaster and other model organisms (SCHMID and TAUTZ 1997 Down), alternative approaches might be taken to identify such genes on a large scale. Possible approaches include the sequencing of the complete genome (at low coverage) or of ESTs from cDNA libraries of closely related "satellite" species. Suitable species for comparisons to D. melanogaster are D. simulans or D. yakuba. Values of ds range from 0.05 to 0.18 between D. melanogaster and D. simulans (BAUER and AQUADRO 1997 Down; POWELL and MORIYAMA 1997 Down) and from 0.11 to 0.35 between D. melanogaster and D. yakuba (SCHMID and TAUTZ 1997 Down). dn and ds values from such comparisons give good estimates of sequence divergence and facilitate the genome-wide identification of rapidly evolving genes like DS07721.6 that are candidates for positively selected genes. In addition, comparisons between D. melanogaster and D. simulans or D. yakuba are sufficiently divergent to detect incorrectly annotated exons because of the large number of point and indel mutations that are being fixed by chance in noncoding sequences. Such approaches would not only lead to the identification of rapidly evolving genes with potential roles in the phenotypic divergence of species and enhance our understanding of genome-wide patterns of protein evolution but also assist in the correct annotation of "difficult" genes for which currently available gene prediction methods are not reliable.


*  ACKNOWLEDGMENTS

We are grateful to B. Haubold, W. Swanson, T. Wiehe, M. Aguadé, and two anonymous reviewers for helpful comments on the manuscript. This work was funded by a postdoctoral fellowship of the German Academic Exchange Service (DAAD) to K.J.S. and a National Institutes of Health (NIH) grant to C.F.A.

Manuscript received November 14, 2000; Accepted for publication July 6, 2001.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

ADAMS, M., S. CELNIKER, R. HOLT, C. EVANS, and J. GOCAYNE et al., 2000  The genome sequence of Drosophila melanogaster. Science 287:2185-2195[Abstract/Free Full Text].

AKASHI, H., 1994  Synonymous codon usage in Drosophila melanogaster: Natural selection and translational accuracy. Genetics 136:927-935[Abstract].

AKASHI, H., 1995  Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076[Abstract].

AKASHI, H., 1996  Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino substitution, and larger proteins in D. melanogaster. Genetics 144:1297-1307[Abstract].

ALTSCHUL, S., T. MADDEN, A. SCHÄFFER, J. ZHANG, and Z. ZHANG et al., 1997  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402[Abstract/Free Full Text].

ANDREWS, J., G. BOUFFARD, C. CHEADLE, J. LÜ, and K. BECKER et al., 2000  Gene discovery using computational and microarray analysis of transcription in the Drosophila melanogaster testis. Genome Res. 10:2030-2043[Abstract/Free Full Text].

ASHBURNER, M., S. MISRA, J. ROOTE, S. LEWIS, and R. BLAZEJ et al., 1999  An exploration of the sequence of a 2.9-Mb region of the genome of Drosophila melanogaster—The Adh region. Genetics 153:179-219[Abstract/Free Full Text].

BAUER, V. and C. AQUADRO, 1997  Rates of DNA sequence evolution are not sex-biased in Drosophila melanogaster and Drosophila simulans. Mol. Biol. Evol. 14:1252-1257[Abstract].

BEGUN, D. and C. AQUADRO, 1993  African and North American populations of Drosophila melanogaster are very different at the DNA level. Nature 365:548-550[Medline].

BEGUN, D., P. WHITLEY, B. TODD, H. WALDRIP-DAIL, and A. CLARK, 2000  Molecular population genetics of male accessory gland proteins in Drosophila. Genetics 156:1879-1888[Abstract/Free Full Text].

BENOS, P., M. GATT, M. ASHBURNER, L. MURPHY, and D. HARRIS et al., 2000  From sequence to chromosome: the tip of the X chromosome of D. melanogaster. Science 287:2220-2222[Abstract/Free Full Text].

BIELAWSKI, J., K. DUNN, and Z. YANG, 2000  Rates of nucleotide substitution and mammalian nuclear gene evolution: approximate and maximum-likelihood methods lead to different conclusions. Genetics 156:1299-1308[Abstract/Free Full Text].

BORK, P. and E. KOONIN, 1998  Predicting functions from protein sequences: where are the bottlenecks? Nat. Genet. 18:313-318[Medline].

BORK, P., T. DANDEKAR, Y. DIAZ-LAZCOZ, F. EISENHABER, and M. HUYNEN et al., 1998  Predicting function: from genes to genomes and back. J. Mol. Biol. 283:707-725[Medline].

BRENNER, S., 1999  Errors in genome annotation. Trends Genet. 15:132-133[Medline].

BRENNER, S., C. CHOTHIA, and T. HUBBARD, 1997  Population statistics of protein structures: lessons from structural classifications. Curr. Opin. Struct. Biol. 7:369-376[Medline].

BURGE, C. and S. KARLIN, 1997  Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78-94[Medline].

CHOTHIA, C., 1992  One thousand families for the molecular biologist. Nature 357:543-544[Medline].

CIVETTA, A. and R. S. SINGH, 1998  Sex-related genes, directional sexual selection, and speciation. Mol. Biol. Evol. 15:901-909[Abstract].

COMERON, J. and M. KREITMAN, 1998  The correlation between synonymous and nonsynonymous substitution in Drosophila: mutation, selection, or relaxed constraints? Genetics 150:767-775[Abstract/Free Full Text].

COMERON, J., M. KREITMAN, and M. AGUADÉ, 1999  Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151:239-249[Abstract/Free Full Text].

DUDA, T. F. and S. R. PALUMBI, 1999  Molecular genetics of evolutionary diversification: duplication and rapid evolution of toxin genes of the venomous gastropod Conus. Proc. Natl. Acad. Sci. USA 96:6820-6823[Abstract/Free Full Text].

DUNN, K., J. BIELAWSKI, and Z. YANG, 2001  Substitution rates in Drosophila nuclear genes: implications for translational selection. Genetics 157:295-305[Abstract/Free Full Text].

DURET, L. and D. MOUCHIROUD, 1999  Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:4482-4487[Abstract/Free Full Text].

ENDO, T., K. IKEO, and T. GOJOBORI, 1996  Large-scale search for genes on which positive selection may operate. Mol. Biol. Evol. 13:685-690[Abstract].

EWING, B., L. HILLIER, M. WENDL, and P. GREEN, 1998  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185[Abstract/Free Full Text].

FU, Y.-X. and W.-H. LI, 1993  Statistical tests of neutrality of mutations. Genetics 133:693-709[Abstract].

GORDON, D., C. ABAJIAN, and P. GREEN, 1998  Consed: a graphical tool for sequence finishing. Genome Res. 8:195-202[Abstract/Free Full Text].

GREEN, P., 1995 GENEFINDER Documentation (http://genetics.mgh.harvard.edu/doc/genefinder.doc.html).

GUIGÓ, R., P. AGARWAL, J. ABRIL, M. BURSET, and J. FICKETT, 2000  An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 10:1631-1642[Abstract/Free Full Text].

HIETER, P. and M. BOGUSKI, 1997  Functional genomics: its all how you read it. Science 278:601-602[Abstract/Free Full Text].

HUDSON, R. R., M. KREITMAN, and M. AGUADÉ, 1987  A test of neutral molecular evolution based on nucleotide data. Genetics 116:153-159[Abstract/Free Full Text].

HURST, L. and N. SMITH, 1999  Do essential genes evolve slowly? Curr. Biol. 9:747-750[Medline].

KLIMAN, R. and J. HEY, 1993  Reduced natural selection associated with low recombination in Drosophila melanogaster.. Mol. Biol. Evol. 10:1239-1258[Abstract].

KLIMAN, R. and J. HEY, 1994  The effects of mutation and natural selection on codon bias in the genes of Drosophila. Genetics 137:1049-1056[Abstract].

KREITMAN, M. and R. HUDSON, 1991  Inferring the evolutionary histories of the Adh and Adh-dup loci in Drosophila melanogaster from patterns of polymorphism and divergence. Genetics 127:565-582[Abstract].

MCDONALD, J. and M. KREITMAN, 1991  Adaptive evolution at the Adh locus in Drosophila. Nature 351:652-654[Medline].

MICHAELMORE, R. W. and B. C. MEYERS, 1998  Clusters of resistance genes in plants evolve by divergent selection and a birth-and-death process. Genome Res. 8:1113-1130[Abstract/Free Full Text].

MORIYAMA, E. N. and J. R. POWELL, 1996  Intraspecific nuclear DNA variation in Drosophila. Mol. Biol. Evol. 13:261-277[Abstract].

MURPHY, P. M., 1993  Molecular mimicry and the generation of host defense protein diversity. Cell 42:823-826.

NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.

OLIVER, S., 1996  From DNA sequence to biological function. Nature 379:597-600[Medline].

PAMILO, P. and R. J. W. O'NEILL, 1997  Evolution of the Sry genes. Mol. Biol. Evol. 14:49-55[Abstract].

POWELL, J. and E. MORIYAMA, 1997  Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. USA 94:7784-7790[Abstract/Free Full Text].

REESE, M., G. HARTZELL, N. HARRIS, U. OHLER, and J. ABRIL et al., 2000  Genome annotation assessment in Drosophila melanogaster. Genome Res. 10:483-501[Abstract/Free Full Text].

ROZAS, J. and R. ROZAS, 1999  DnaSP version 3: an integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics 15:174-175[Abstract/Free Full Text].

ROZEN, S., and H. SKALETSKY, 1998 Primer3 (code available at http://www-genome.wi.mit.edu).

SCHMID, K. J. and D. TAUTZ, 1997  A screen for fast evolving genes from Drosophila. Proc. Natl. Acad. Sci. USA 94:9746-9750[Abstract/Free Full Text].

SCHMID, K. J., L. NIGRO, C. F. AQUADRO, and D. TAUTZ, 1999  Large number of replacement polymorphisms in rapidly evolving genes of Drosophila: implications for genome-wide surveys of DNA polymorphism. Genetics 153:1717-1729[Abstract/Free Full Text].

SHARP, P. and W.-H. LI, 1989  On the rate of DNA sequence evolution in Drosophila. J. Mol. Evol. 28:398-402[Medline].

SHIELDS, D., P. SHARP, D. HIGGINS, and F. WRIGHT, 1988  "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5:704-716[Abstract].

SWANSON, W. J. and V. D. VACQUIER, 1995  Extraordinary divergence and positive Darwinian selection in a fusagenic protein coating the acrosomal process of abalone spermatozoa. Proc. Natl. Acad. Sci. USA 92:4957-4961[Abstract/Free Full Text].

TAJIMA, F., 1989  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595[Abstract/Free Full Text].

TAUTZ, D. and K. J. SCHMID, 1998  From genes to individuals—developmental genes and the generation of the phenotype. Proc. R. Soc. London Ser. B 353:231-240.

THOMPSON, J. D., D. G. HIGGINS, and T. J. GIBSON, 1994  CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680[Abstract/Free Full Text].

WATTERSON, G. A., 1975  On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276[Medline].

WILSON, A., S. CARLSON, and T. WHITE, 1977  Biochemical evolution. Annu. Rev. Biochem. 46:573-639[Medline].

WRIGHT, F., 1990  The ‘effective number of codons’ used in a gene. Gene 87:23-29[Medline].

WYCKOFF, G., W. WANG, and C. WU, 2000  Rapid evolution of male reproductive genes in the descent of man. Nature 403:304-309[Medline].

YANG, Z., 1999 Phylogenetic Analysis Using Maximum Likelihood (PAML), Version 2. University College, London.

YANG, Z. and R. NIELSEN, 1998  Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409-418[Medline].

YANG, Z., R. NIELSEN, N. GOLDMAN, and A. PEDERSEN, 2000  Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449[Abstract/Free Full Text].

YOKOYAMA, S., H. ZHANG, F. B. RADLWIMMER, and N. S. BLOW, 1999  Adaptive evolution of color vision of the Comoron coelacanth (Latimeria chalumnae). Proc. Natl. Acad. Sci. USA 96:6279-6284[Abstract/Free Full Text].




This article has been cited by other articles:


Home page
Mol Biol EvolHome page
A. Aouacheria, C. Geourjon, N. Aghajari, V. Navratil, G. Deleage, C. Lethias, and J.-Y. Exposito
Insights into Early Extracellular Matrix Evolution: Spongin Short Chain Collagen-Related Proteins Are Homologous to Basement Membrane Type IV Collagens and Form a Novel Family Widely Distributed in Invertebrates
Mol. Biol. Evol., December 1, 2006; 23(12): 2288 - 2302.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
P. Siwach, S. D. Pophaly, and S. Ganesh
Genomic and Evolutionary Insights into Genes Encoding Proteins with Single Amino Acid Repeats
Mol. Biol. Evol., July 1, 2006; 23(7): 1357 - 1369.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
J. L. Mueller, K. R. Ram, L. A. McGraw, M. C. Bloch Qazi, E. D. Siggia, A. G. Clark, C. F. Aquadro, and M. F. Wolfner
Cross-Species Comparison of Drosophila Male Accessory Gland Protein Genes
Genetics, September 1, 2005; 171(1): 131 - 143.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. M. Alba and J. Castresana
Inverse Relationship Between Evolutionary Rate and Age of Mammalian Genes
Mol. Biol. Evol., March 1, 2005; 22(3): 598 - 606.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
T. Domazet-Loso and D. Tautz
An Evolutionary Analysis of Orphan Genes in Drosophila
Genome Res., October 1, 2003; 13(10): 2213 - 2219.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Barrier, C. D. Bustamante, J. Yu, and M. D. Purugganan
Selection on Rapidly Evolving Proteins in the Arabidopsis Genome
Genetics, February 1, 2003; 163(2): 723 - 733.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. J. Betancourt and D. C. Presgraves
Linkage limits the power of natural selection in Drosophila
PNAS, October 15, 2002; 99(21): 13616 - 13620.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
M. S. Halfon and A. M. Michelson
Exploring genetic regulatory networks in metazoan development: methods and models
Physiol Genomics, September 3, 2002; 10(3): 131 - 143.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. Mounsey, P. Bauer, and I. A. Hope
Evidence Suggesting That a Fifth of Annotated Caenorhabditis elegans Genes May Be Pseudogenes
Genome Res., May 1, 2002; 12(5): 770 - 775.
[Abstract] [Full Text] [PDF]