In many biological systems, especially bacteria and unicellular eukaryotes, rates of synonymous and nonsynonymous nucleotide divergence are negatively correlated with the level of gene expression, a phenomenon that has been attributed to natural selection. Surprisingly, this relationship has not been examined in many important groups, including the unicellular model organism Chlamydomonas reinhardtii. Prior to this study, comparative data on protein-coding sequences from C. reinhardtii and its close noninterfertile relative C. incerta were very limited. We compiled and analyzed protein-coding sequences for 67 nuclear genes from these taxa; the sequences were mostly obtained from the C. reinhardtii EST database and our C. incerta EST data. Compositional and synonymous codon usage biases varied among genes within each species but were highly correlated between the orthologous genes of the two species. Relative rates of synonymous and nonsynonymous substitution across genes varied widely and showed a strong negative correlation with the level of gene expression estimated by the codon adaptation index. Our comparative analysis of substitution rates in introns of lowly and highly expressed genes suggests that natural selection has a larger contribution than mutation to the observed correlation between evolutionary rates and gene expression level in Chlamydomonas.
IN diverse biological lineages the degree of nonrandom usage of synonymous codons and the rate of evolutionary change of a gene are often related to its level of expression. For example, in bacteria, yeast, Caenorhabditis elegans, Drosophila, and Arabidopsis, codon usage bias of genes appears to be positively correlated with the level of gene expression (Gouy and Gautier 1982; Sharp et al. 1986; Stenico et al. 1994; Chiapello et al. 1998; Duret and Mouchiroud 1999) and it is generally accepted that this is the result of elevated selection for a preferred set of codons that enhance translational efficiency in highly expressed genes (for reviews see Akashi and Eyre-Walker 1998; Akashi 2001). Moreover, in bacteria (Sharp and Li 1987a; Sharp 1991; Smith and Eyre-Walker 2001) and yeast (Pal et al. 2001; Hirsh et al. 2005), rates of synonymous substitutions are negatively correlated with the level of gene expression and it is generally thought that this is the result of purifying selection for efficient translation of highly expressed genes, although variation in mutation rate has been invoked to explain this correlation in some cases (Berg and Martelius 1995; Eyre-Walker and Bulmer 1995; but see Ochman 2003). In some multicellular lineages, however, including mammals and Arabidopsis, a correlation between synonymous rate and gene expression level could not be established (Duret and Mouchiroud 2000; Wright et al. 2002, 2004) and while there is evidence supporting such a correlation in Drosophila (e.g., Powell and Moriyama 1997), this correlation has been questioned (Dunn et al. 2001; but see Bierne and Eyre-Walker 2003 and Marais et al. 2004).
Rates of nonsynonymous substitution, which typically show greater variation among genes than synonymous rates, also appear correlated with level of gene expression in diverse lineages including various bacteria, yeast, land plants, Drosophila, and mammals (Duret and Mouchiroud 2000; Pal et al. 2001; Wright et al. 2002, 2004; Marais et al. 2004; Rispe et al. 2004; Rocha and Danchin 2004; Subramanian and Kumar 2004; Lemos et al. 2005). The basis for the connection between gene expression and protein evolution is the subject of ongoing debate. It has been suggested that nonsynonymous sites in highly expressed genes are under enhanced selective constraint to optimize the speed and accuracy of protein synthesis (Akashi 2001, 2003). Alternatively, genes expressed broadly across tissues, which are also highly expressed, may be under enhanced functional constraint because their products need to function in diverse biochemical/biophysical environments (Hastings 1996) or because nonsynononymous changes in these genes affect a greater number of tissues and therefore have a greater impact on fitness (Duret and Mouchiroud 2000).
The green algae represent a major biological group of eukaryotes for which there has been no large-scale examination of evolutionary rates in nuclear protein-coding genes. Chlamydomonas reinhardtii represents an obvious candidate for such a study, as this green algal species has an extensive and annotated expressed sequence tag (EST) database and a genome project near completion. The nuclear genome of C. reinhardtii has an overall GC content of nearly 65% (Grossman et al. 2003) and the nucleus-encoded genes of this taxon have a preponderance of codons that end in G or C, although the codon usage bias varies considerably among genes (LeDizet and Piperno 1995). Correspondence analysis of relative synonymous codon usage (RSCU) values has established that C. reinhardtii genes are positioned along the first axis according to the level of gene expression as approximated by EST abundance (Naya et al. 2001). The relationship between codon usage and level of gene expression in C. reinhardtii is consistent with the observation that highly expressed genes exhibit lower values of the effective number of codons (Nc). Furthermore, genes with the highest number of EST matches use a preferred set of codons, which are rich in C at fourfold degenerate sites relative to those with only one EST match (Naya et al. 2001).
C. incerta SAG 7.73 is the closest known noninterfertile relative of C. reinhardtii (Schlösser 1976; Coleman and Mai 1997; Ferris et al. 1997; Liss et al. 1997; Pröschold et al. 2001), with the possible exception of Chlamydomonas sp. (CCAP 11/132) (Pröschold et al. 2005). Studies on the level of sequence divergence between C. incerta and C. reinhardtii are limited to three spliceosomal introns (Liss et al. 1997), two intergenic spacers (Coleman and Mai 1997), and a few protein-coding genes (Ferris et al. 1997). Clearly, an expanded analysis of orthologous gene sequences from these taxa is needed to gain a better view of the extent and causes of the variation in evolutionary divergence among genes in Chlamydomonas.
In this study we investigate the relationship between the rates of nucleotide divergence and the level of gene expression in C. reinhardtii and C. incerta. We compiled a greatly expanded data set of 67 protein-coding sequences by utilizing our C. incerta EST data prepared for the Protist EST Program (PEP) (http://megasun.bch.umontreal.ca/pepdb/pep.html), the C. reinhardtii EST database (http://www.chlamy.org), and GenBank (http://www4.ncbi.nlm.nih.gov). The specific objectives were (i) to compare synonymous codon usage in C. incerta with that of C. reinhardtii; (ii) to examine the diversity of synonymous and nonsynonymous substitution rates among genes between C. reinhardtii and C. incerta; (iii) to access the relationship, if any, between the evolutionary rates, both synonymous and nonsynonymous, and the level of gene expression as estimated by the codon adaptation index (CAI), which is a measure of the degree to which the synonymous codon usage of a gene matches that of highly expressed genes (Sharp and Li 1987b); and (iv) to compare the rate of nucleotide substitution in introns of genes with a low CAI to those with a high CAI.
MATERIALS AND METHODS
C. incerta strain and growth conditions:
C. incerta was obtained from the Sammlung von Algenkulturen, Göttingen (SAG), Germany, where it is listed as SAG 7.73 and placed under the name C. reinhardtii on the basis of morphological criteria and susceptibility to autolysin from the C. reinhardtii group (SAG, personal communication). Strain CC-1870 previously maintained at the Chlamydomonas Genetic Center, Duke University, now replaced with strain CC-3871, and strain UTEX 2607 maintained at the University of Texas at Austin, were all derived from SAG 7.73 and should be equivalent (Ferris et al. 1997). Cells were grown under continuous “cool white” fluorescent lighting at 24° in (i) Tris–acetate–phosphate (TAP) medium (Harris 1989) with rotary shaking and 25 μmol photons·m−2·s−1 photosynthetically active radiation (PAR) and (ii) high-salt (HS) medium (Harris 1989) bubbled with 1% CO2 in air and 130 μmol photons·m−2·s−1 PAR. Cells were harvested by centrifugation at 2000 × g at 4° when the cultures reached the late exponential phase of growth (OD750 = 0.3 and 0.7 for the TAP- and HS-grown cultures, respectively). An equivalent mass of cells from each of the two culture conditions was lysed in TRIzol (Invitrogen, Carlsbad, CA) and the lysates were combined for RNA isolation. A pellet of HS-grown cells was resuspended in Tris–EDTA buffer (20 mm Tris, 100 mm EDTA, 100 mm NaCl, pH 8) for the isolation of DNA (Maniatis et al. 1982) used in PCR analysis.
Generation of the C. incerta (SAG 7.73) genomic sequences:
PCR amplification of several C. incerta genes was performed (i) to confirm that C. incerta strain SAG 7.73 is the same as the strain CC-1870 used by Ferris et al. (1997) and (ii) to obtain intron sequences to be used in substitution rate analyses. With regard to the first objective, the sequence of a 456-nucleotide segment of C. incerta (SAG 7.73) mid confirmed that this fragment is identical to the corresponding region of the C. incerta (CC-1870/CC-3871) mid deposited in GenBank (accession no. AF002710). The PCR reaction was performed using the following primers: 5′-TAG CCA GGT TCC GGT TCA A-3′ and 5′-CCA TCT GTC GAC GCC AAG T-3′. To study the intron substitution level, partial genomic sequences of cblP, GapA, and Tpx of SAG 7.73 were obtained using perfect match primer sequences based on the corresponding cDNA sequences obtained in this study. A 2124-nucleotide sequence of C. incerta sfa, not available as a cDNA, was amplified using a set of primer sequences based on the C. reinhardtii and C. eugametos orthologs: 5′-TGG AGC AGG AGA AGC AG-3′ and 5′-CGC CTT CGT GTA GTC GTT G-3′. Finally, the genomic sequence for pdk was obtained from GenBank for both C. reinhardtii and C. incerta (supplemental Table S1 at http://www.genetics.org/supplemental/).
C. incerta ESTs:
C. incerta (SAG 7.73) cDNA library construction and EST sequencing was part of the Protist EST program (http://megasun.bch.umontreal.ca/pepdb/pep.html). Non-normalized and normalized C. incerta cDNA libraries were constructed by DNA Technologies (Gaithersburg, MD). Inserts were unidirectionally cloned between the EcoRV and NotI sites of the pcDNA3.1 vector (Invitrogen). Sequencing was carried out at the National Research Council, Institute for Marine Biosciences, Halifax, Canada, and >95% of the ESTs were sequenced from the 5′-end; ∼85% of the sequenced ESTs were from the regular library. A total of 5124 quality- and vector-trimmed ESTs with an average length of 395 nucleotides were clustered into 1388 unique sequences (i.e., clusters and singletons) and 589 of these unique sequences gave BLASTX (Altschul et al. 1990) hits (expectation value of cutoff of 10−5) against the GenBank nonredundant database. The automatic annotation procedure using AutoFACT (Koski et al. 2005) was used to define the names of the C. incerta gene products.
Sequence retrieval and alignments:
The data set analyzed comprises 67 genes (Table 1 and supplemental Table S1 at http://www.genetics.org/supplemental/). Sixty-one C. incerta gene sequences were retrieved from the PEP database. The corresponding C. reinhardtii homologs were identified by running a BLASTN search (Altschul et al. 1990) of C. incerta unique sequences against the C. reinhardtii EST database from ChlamyDB (http://www.chlamy.org). In our C. incerta (SAG 7.73) cDNA library data, we found one EST sequence corresponding to yptC1. This sequence is different from the one reported by Ferris et al. (1997) as being the yptC1 of C. incerta (CC-1870/CC-3871). Our sequence, confirmed by sequencing the PCR product of the genomic yptC1, shows 22 synonymous substitution differences relative to the C. reinhardtii yptC1 sequence reported in GenBank (accession no. U13168), while the sequence reported by Ferris et al. (1997) differs from the C. reinhardtii GenBank entry at only one synonymous site. In all analyses we used the C. incerta yptC1 sequence obtained by us.
The selection of sequences for analysis was not done randomly. Sequence length and quality were considered, and as much as possible we choose genes encoding products that perform diverse cellular functions and are targeted to different cellular compartments. We designated genes as encoding mitochondrial- or plastid-targeted products on the basis of the best BLAST hits for proteins that are known to be targeted to the respective organelles and annotated as such in the GenBank, C. reinhardtii EST, and C. incerta PEP databases. The remaining seven gene sequences were obtained from GenBank or, for C. incerta, by sequencing PCR products. The mean length of the 67 homologous pairs of protein-coding sequences from C. reinhardtii and C. incerta, as aligned codon by codon using Clustal X (Thompson et al. 1997), was 220 codons (maximum, 1194; minimum, 63). The full coding sequence was analyzed in 41 (61%) of the homologous gene pairs examined. The C. reinhardtii and C. incerta intron sequences were aligned using Multalin (Corpet 1988) and then manually adjusted.
The number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) in the protein-coding regions were estimated using the maximum-likelihood (ML) method (Goldman and Yang 1994) implemented in the CODEML program of the version 3.14 PAML package (Yang 1997), assuming transition/transversion bias and codon usage bias (F3x4). The number of substitutions per site in the introns (dI) was calculated with the Hasegawa–Kishino–Yano (HKY85) model (Hasegawa et al. 1985) implemented in the BASEML program, also part of the PAML package, assuming transition/tranversion bias and nonuniform base composition.
Codon usage bias:
CODONS (Lloyd and Sharp 1992), MEGA 2.1 (Kumar et al. 2001), and DAMBE (Xia and Xie 2001) software packages were used to compute the Nc (Wright 1990), base composition, RSCU values, and the CAI (Sharp and Li 1987b). In calculating the CAI, we used the specific set of optimal codons previously defined by Naya et al. (2001), which was based on the codon usage frequencies of C. reinhardtii highly expressed genes.
The paired t-tests and the calculation of Pearson correlation coefficients were performed using MINITAB, release 14.12.0. The likelihood-ratio test was used as described in the PAML manual.
Nucleotide sequence accession numbers:
Sixty-one C. incerta unique sequences (i.e., clusters or singletons) have been deposited in GenBank under accession nos. DQ122864–DQ122923 and DQ222936. The full C. incerta EST data set (5124 entries) is available at http://amoebidia.bcm.umontreal.ca/public/pepdb/agrm.php/. Partial genomic sequences of C. incerta cblP, GapA, sfa, Tpx, and ytpC1 have been deposited in GenBank under accession nos. DQ122924–DQ122927 and DQ222937.
Comparison of codon usage in C. reinhardtii and C. incerta:
As shown in Table 1, measures of compositional (GC%, GC3%) and especially codon biases (Nc and CAI) vary considerably among genes in C. incerta, and in pairwise comparisons with C. reinhardtii orthologs, the values of these parameters correlate strongly (r > 0.9). Moreover, the averages of these parameters are almost identical between the two species although the small difference in the GC% averages is statistically significant and the same is true for the GC3% averages (Table 1). On the basis of these data, it is not surprising that RSCU values of concatenated C. reinhardtii and C. incerta gene sequences are highly correlated (r = 0.99) (supplemental Table S2 at http://www.genetics.org/supplemental/).
Nc in C. incerta correlates strongly with the level of gene expression as indicated by CAI (r = −0.89, P < 0.001) and with C3 composition (r = −0.76, P < 0.001), but does not correlate significantly with G3 composition (r = 0.07, P = 0.55) (supplemental Figure S1 at http://www.genetics.org/supplemental/). A significant correlation between Nc and the level of gene expression estimated by the number of EST matches of each sequence was previously reported in C. reinhardtii (Naya et al. 2001). We also report a strong positive correlation between CAI and the C content at fourfold degenerate sites in both C. reinhardtii (r = 0.82, P < 0.001) and C. incerta (r = 0.85, P < 0.001) (supplemental Figure S2 at http://www.genetics.org/supplemental/). This finding supports the view that the frequency of C3 increases with the level of gene expression in both taxa and is consistent with the observation of Naya et al. (2001) that highly expressed genes in C. reinhardtii are C3 rich.
Rate of nucleotide substitution among genes:
The divergence at synonymous sites, as estimated for individual gene pairs using the ML method (Table 1), varies >65-fold among genes; these dS values range from 0.025 ± 0.008 for cblP to 1.68 ± 0.47 for the pherophorin I gene. Nevertheless, 85% of the genes analyzed exhibit dS values between 0.1 and 0.75 (Figure 1A).
Estimates of the nonsynonymous substitution divergence between C. reinhardtii and C. incerta orthologs in our data set range from no divergence for five gene sequences (petF, eIF4A, fla14, Atp9, and yptC1) to values of 0.11 and 0.12 for mid and the gene for pherophorin I, respectively (Table 1). About 90% of the genes show dN values below 0.05 (Figure 1B); however, variation in dN exceeds that of dS by 50% (Table 1).
Finally, we determined dN/dS ratios for the individual genes in our data set (Table 1) and also tested for a correlation between the synonymous and nonsynonymous rates among genes. The dN/dS values vary widely across genes but are consistently <1; the average over all genes is 0.056. At 0.217, the mid gene has the highest dN/dS ratio. Estimates of synonymous and nonsynonymous rates are positively correlated (r = 0.62, P < 0.001).
Correlation between substitution divergence and codon adaptation among genes:
We wanted to determine if the among-gene variation in synonymous and nonsynonymous divergence between C. reinhardtii and C. incerta is correlated with gene expression as indicated by CAI. The analysis reveals a significant negative correlation between dS and CAI (r = −0.37, P < 0.01) (Figure 2A) and the same relationship is also observed between dN and CAI (r = −0.46, P < 0.001) (Figure 2B). mid has a particularly high dN value and pherophorin I is conspicuously high for both dS and dN. Removing these two outliers from the analyses increased the strength of the correlation between dS and CAI (r = −0.50, P < 0.001) and between dN and CAI (r = −0.57, P < 0.001). We found similar correlation coefficients (data not shown) between both dS and dN and the number of ESTs recovered from the C. incerta library (EST abundance data are given in supplemental Table S1 at http://www.genetics.org/supplemental). However, we chose to use CAI rather than EST abundance in our reported analyses because not all genes in our data set were represented in either the C. incerta or the C. reinhardtii library and normalization steps were employed in the preparation of these libraries. In other systems, moreover, CAI has been shown to be as good as mRNA abundance in predicting protein abundance (Jansen et al. 2003) and the best codon-bias-derived surrogate for gene expression level (Goetz and Fuglsang 2005).
Comparison of intron and exon substitution rate estimates in genes with low- and high-CAI values:
We analyzed concatenated intron and exon sequences from three genes (sfa, mid, and Pdk) with CAI values between 0.36 and 0.64 and three genes (cblP, GapA, and Tpx) with CAI values between 0.81 and 0.90 (the six genes are listed in Table 2; intron data are given in supplemental Table S3 at http://www.genetics.org/supplemental/). Likelihood-ratio tests indicated homogenous rates among sites within introns from the lowly expressed genes (low CAI) and also from the highly expressed genes (high CAI) (supplemental Table S4 at http://www.genetics.org/supplemental/). Next, we found that the difference in substitution rates between the two groups of introns is statistically significant, as the hypothesis of equal substitution rates between the two partitions is rejected by the likelihood-ratio test (2Δℓ = 20.32, d.f. = 1, P < 0.001). The substitution rate in introns of the low-CAI genes is ∼1.3 times higher than that in introns of the high-CAI genes (Table 2). In contrast, when the mean synonymous divergence of the corresponding exons is examined, we find that the synonymous substitution rate in the low-CAI genes is ∼5 times higher than the rate in high-CAI genes (Table 2).
Analysis of 67 pairs of orthologous genes revealed similar codon usage and base composition between C. reinhardtii and C. incerta, suggesting the absence of differences in selective or mutational forces acting on codon usage in the two taxa since their divergence.
There is considerable variation in the synonymous and nonsynonymous substitution rates among the C. reinhardtii/C. incerta genes examined even when the extreme values are not considered. The synonymous substitution divergence across the genes studied here is at least as large as the among-gene synonymous rate variation reported in other systems, e.g., bacteria (Sharp and Li 1987a), vertebrates (Bernardi et al. 1993; Wolfe and Sharp 1993; Moriyama and Powell 1997), and land plants (Alvarez-Valin et al. 1999; Kusumi et al. 2002; Tiffin and Hahn 2002; Senchina et al. 2003), although when comparing these values one must consider differences in methods of divergence estimation, sample size, and gene sets, all of which can affect the level of variation in the synonymous divergence among genes.
Evidence for a negative correlation between the estimated synonymous divergence among genes and gene expression level, as supported here for C. reinhardtii/C. incerta, has also been reported for bacteria and yeast (Sharp and Li 1987a; Sharp 1991; Pal et al. 2001; Smith and Eyre-Walker 2001; Rispe et al. 2004; Hirsh et al. 2005) but contrasts with reports on mammals and Arabidopsis where synonymous substitution rate and level of gene expression seem uncorrelated (Duret and Mouchiroud 2000; Wright et al. 2002, 2004). At least two causes for the negative correlation between synonymous substitution rate and expression level of the Chlamydomonas genes can be invoked: (i) synonymous substitution rate is reduced by translational selection, which increases with gene expression; and (ii) highly expressed genes acquire fewer mutations because of transcription-mediated repair processes (e.g., Berg and Martelius 1995; Eyre-Walker and Bulmer 1995; see also Sullivan 1995 for a review). One expects that transcription-coupled mutation-repair processes would affect introns and flanking exons equally. Therefore, if such processes represent the major cause of the dS depression in the exons of high-CAI genes (e.g., cblP, GapA, and Tpx) relative to the exons of low-CAI genes (e.g., sfa, mid, and Pdk), a nearly similar drop in divergence in introns of the former gene set compared to the latter gene set would be expected. Alternatively, if translational selection is the major evolutionary force responsible for the lower dS in the exons of high-CAI genes compared to the exons of low-CAI genes, one might expect equal levels of divergence between the introns from the two sets of genes. Our results fall between these two extremes. There is significantly lower divergence in the introns of the high-CAI genes relative to the low-CAI genes examined. This difference, however, is about four times less than the drop in the synonymous substitution divergence in the exons of the two gene categories. These results, therefore, support translational selection over mutation as the more important evolutionary force underlying the described negative correlation between dS and CAI in the Chlamydomonas taxa. The lower intron evolutionary rate in the high-CAI genes compared to the low-CAI ones could result from transcription-mediated repair processes, although these results are also consistent with enhanced selective constraints on the sequence of introns in highly expressed genes. Nevertheless, in spite of the fact that there are reports showing that spliceosomal introns are subject to constraints in sequence evolution, especially the first introns and the sites flanking intron/exon boundries (e.g., Chamary and Hurst 2004), there is no report indicating that these constraints differ among highly and lowly expressed genes.
The among-gene heterogeneity in the nonsynonymous substitution rates between C. reinhardtii and C. incerta is considerable and shows more variation than the synonymous substitution rates, which is in agreement with the expected variations in functional constraints on the amino acid sequence in different proteins and with reports on other biological groups (Li and Graur 1991; Bernardi et al. 1993; Wolfe and Sharp 1993; Tiffin and Hahn 2002). The negative correlation between the rate of nonsynonymous substitutions and the level of gene expression found in a number of systems (Duret and Mouchiroud 2000; Pal et al. 2001; Wright et al. 2002, 2004; Marais et al. 2004; Rispe et al. 2004; Rocha and Danchin 2004; Subramanian and Kumar 2004; Lemos et al. 2005) has also been observed here in C. reinhardtii/C. incerta. Different models have been proposed to explain this connection. In the translational selection model, highly expressed genes are proposed to be under purifying selection against nonsynonymous changes that may be neutral with respect to protein function but suboptimal with respect to translational efficiency (Akashi 2001, 2003). Other models propose that genes that are broadly expressed across tissues in multicellular eukaryotes, which are also the most highly expressed, are under enhanced functional constraints because their products must function in a greater number of cellular environments (Hastings 1996) or because nonsynonymous changes in these genes affect a greater number of tissues and therefore have a greater impact on fitness (Duret and Mouchiroud 2000). In Chlamydomonas, the relationship between the rate of nonsynonymous substitution and the level of gene expression might be better explained by the translational selection model because there is evidence for translational selection in C. reinhardtii (Naya et al. 2001) and Chlamydomonas taxa are unicellular organisms. Consistent with this idea we found (i) a strong negative correlation between Nc and the level of gene expression in C. incerta, (ii) a significant negative correlation between the rate of synonymous substitution in C. reinhardtii/C. incerta and the level of gene expression, and (iii) a significant correlation of both Nc (data not shown) and synonymous substitution rate with the nonsynonymous substitution rate in C. reinhardtii/C. incerta. Nevertheless, some effects on protein functional constraints related to breadth of gene expression might be expected in C. reinhardtii/C. incerta. These taxa undoubtedly exist under diverse environmental conditions and have both asexual and sexual life-cycle phases so that genes may vary in their breadth of expression under different physiological or developmental stages. If the breadth and level of gene expression are correlated in Chlamydomonas, it may prove difficult to separate their effects on protein evolution.
The highest nonsynonymous substitution estimates in our data set come from mid and the pherophorin I gene, which are approximately six times greater than the value averaged over the whole set of genes. mid encodes a minus-dominance protein important in gamete sex-determination in Chlamydomonas (Ferris and Goodenough 1997). Among both unicellular and multicellular eukaryotes there is evidence that sex-related genes evolve significantly more rapidly than genes not directly related to sex functions (e.g., Singh and Kulathinal 2000; Torgerson and Singh 2003; Zhang et al. 2004). In C. reinhardtii and C. incerta, mid was reported previously to evolve rapidly in terms of nonsynonymous substitutions. However, two regions of the predicted protein product were observed to be conspicuously more conserved in amino acid sequence between these species as compared to two other regions of the protein (Ferris et al. 1997). We searched the InterPro Scan database (http://www.ebi.ac.uk/InterProScan) and found that the C-terminal region previously described as conserved between C. reinhardtii and C. incerta indeed contains a domain conserved across the plant lineage (the RWP-RK domain; Pfam accession no. PF02042). According to our estimates, the average nonsynonymous substitution rate for sites in the nonconserved regions of this gene (dN = 0.28 ± 0.06) is ∼2.5 times higher than the rate averaged over all sites in the gene (dN = 0.113 ± 0.02) and ∼8 times higher than those averaged over the sites in the conserved regions of the gene (dN = 0.035 ± 0.013). The high evolutionary rates at the nonsynonymous sites in the nonconserved regions are consistent with a relaxation of functional constraint or, as proposed by Ferris et al. (1997), positive selection, while the conserved domains are probably under purifying selection. Although the different domains experience different rates at nonsynonymous sites, they have rather similar synonymous substitution rates, base composition, and codon bias.
The pherophorin I gene belongs to a multigene family described so far in Volvox carteri (Godl et al. 1995; Hallmann 2003) and C. reinhardtii (Nedelcu 2005). In V. carteri, which is a colonial close relative of C. reinhardtii/C. incerta, pherophorins are major constituents of the extracellular matrix and are structurally related to the sex-inducing pheromone (reviewed by Hallmann 2003). Divergence at both synonymous and nonsynonymous sites in the pherophorin I gene is the highest in our data set. The points representing the pherophorin gene in both the dS vs. CAI and dN vs. CAI plots lie conspicuously well above the regression lines, suggesting that the expression level had little cause for the high evolutionary rates of this gene. Relaxed purifying selection or positive selection on this gene cannot be invoked to explain the high dN values unless a substantial overestimation of dS is assumed, as the dN/dS ratio is not exceptionally high. Alternatively, these pherophorin I sequences may have existed as ancient alleles or paralogs prior to the C. incerta and C. reinhardtii speciation event and therefore had more time to diverge. In this connection, the pherophorin I genes used in this study are the most closely related pherophorin-like genes currently in the databases of the two taxa on the basis of both phylogenetic affiliation (data not shown) and sequence similarity (the BLAST E-value was at least 100 orders of magnitude smaller for this pair of genes than for other hits).
C. reinhardtii is unquestionably the premier unicellular model organism among green plants (Harris 2001; Gutman and Niyogi 2004). Yet the exciting potential of this system for molecular evolutionary analysis was hindered by the scarcity of appropriate comparative sequence data. The placement of our newly generated cDNA sequence data for C. incerta in a comparative framework with the existing gene sequences from C. reinhardtii has opened the opportunity to address fundamental questions about the relative roles of translational selection and transcription-coupled repair processes in the nuclear compartment of this model group of green algae. Although this study represents only the first step in addressing these and related questions, the results should prove valuable in guiding future comparative and experimental work. For example, it is important that we test our findings by employing other measures of gene expression such as identifying genes richest in codons having the most abundant cognate tRNAs and by measuring transcript abundance using microarray analysis. Finally, with a larger sample and improved annotation of orthologous genes from these taxa it should be possible to investigate attributes such as gene age, gene length, and codon usage bias along the translational gradient that have been shown in other systems to have fine-scale effects in shaping the rates and patterns of nucleotide sequence evolution.
We thank Aurora Nedelcu, University of New Brunswick at Fredericton, for helpful discussions on pherophorins and for providing alignments of C. reinhardtii pherophorin-like sequences. This work was supported by Natural Sciences and Engineering Research Council of Canada grants to R.W.L and J.P.B. and is part of the Protist EST Program (PEP) funded by Genome Canada and the Atlantic Canada Opportunities Agency (Atlantic Innovation Fund). C.E.P. received scholarships from Dalhousie University and the Patrick F. Lett Fund and T.B. was the recipient of a PEP postdoctoral fellowship.
Communicating editor: D. Charlesworth
- Received June 30, 2005.
- Accepted November 22, 2005.
- Copyright © 2006 by the Genetics Society of America