Many of Chlamydomonas reinhardtii expressed sequence tags (ESTs) in GenBank dbEST and community EST assemblies were either over- or undertrimmed in terms of their cDNA termini, which are defined as the diagnostic sequence elements that delineate 3′/5′ ends of mRNA transcripts. Overtrimming represents a loss of directional, positional, and structural information of transcript ends whereas undertrimming causes unclean spurious sequences retained in ESTs that exert deleterious impacts on downstream EST-based applications. We examined 309,278 raw EST sequencing trace files of C. reinhardtii and found that only 57% had cDNA termini that matched the expected structures specified in their cDNA library constructions while satisfying our minimum length requirement for their final clean sequences. Using GMAP, 156,963 individual ESTs were mapped to the genome successfully, with their in silico-verified cDNA termini anchored to the genome. Our data analysis suggested strong macro- and microheterogeneity of 3′/5′ end positions of individual transcripts derived from the same genes in C. reinhardtii. This work annotating differential ends of individual transcripts in the draft genome presents the research community with a new stream of data that will facilitate accurate determination of gene structures, genome annotation, and exploration of the transcriptome and mRNA metabolism in C. reinhardtii.
CHLAMYDOMONAS reinhardtii is a single-celled, eukaryotic organism that shares plant- (e.g., photosynthetic chloroplast) and animal-like (e.g., cilia or flagella) characteristics. Because of its unique evolutionary position, divergent from land plants over a billion years ago, the genome and gene catalogs of this model organism have received much attention since the publication of its draft genome (Merchant et al. 2007). Using ab initio and homology-based gene prediction, a gene catalog of 15,143 protein-coding genes was created by the Department of Energy Joint Genome Institute (JGI). Among them, only 56% of the gene models were supported with expressed sequence tag (EST) evidence (Merchant et al. 2007).
As “tags” to identify genes, ESTs are primarily single-pass complementary DNA (cDNA) sequences derived from transcribed mRNAs (transcripts). Prior to all downstream applications (e.g., GenBank submission, EST clustering, gene discovery, and genome annotation), raw EST sequence reads are typically trimmed of vector fragments, insert-flanking restriction endonuclease recognition sites (restriction enzyme sites), adapter (linker) sequences, and/or poly(A)/(T) tails in current cleaning steps (e.g., Liang et al. 2006; Nagaraj et al. 2006). Unfortunately, many of such trimmed sequences represent potentially informative content with respect to cDNA molecule structure and, therefore, biological processing and structure of the original mRNAs. As genomics studies deepen, loss of these trimmed sequences actually presents an obstacle for validating error-prone ESTs and mining ESTs for new knowledge. To address this issue, we recently introduced a new concept to EST data analyses: “EST terminus”, a set of diagnostic sequence elements or features (e.g., adapter and restriction enzyme sites) detected in raw EST trace or chromatogram files that delineate cDNA insert termini (ends) and therefore most likely mRNA ends (Liang et al. 2007a,b). In particular, we developed a bioinformatics tool: WebTraceMiner, a public web service that processes raw EST trace files, identifies in silico-authenticated cDNA termini, and determines final clean sequences on the basis of both identified terminal structures and base-calling quality values (Liang et al. 2007a). Using WebTraceMiner, we reprocessed 172,229 Pinus taeda EST trace files and created the ConiferEST database, the first public resource that presents both the complexity and the abnormality of cDNA terminal structures to the community (Liang et al. 2007b). Our work suggests that examination of cDNA termini in raw EST trace files could not only extract previously overlooked information (i.e., the directional, positional, and structural aspects of cDNA termini) embedded in the enormous existing EST data sets, but also help overcome difficulties in data quality control and validation of error-prone ESTs and benefit many downstream EST applications. Without inspecting cDNA terminal structures, current EST data-cleaning steps appear to be inadequate because they leave behind spurious sequence elements in many ESTs, and they fail to detect terminal structure abnormalities that arise as cloning artifacts or from other unknown sources (Liang et al. 2007b).
So far, the majority of public C. reinhardtii ESTs have been obtained by sequencing 3′ or 5′ ends of cDNA clones produced through reverse transcription of polyadenylated mRNAs [i.e., mRNAs with poly(A) tails]. Many of these ESTs have not been submitted into GenBank dbEST yet, but have been used to create community EST assemblies and to annotate the draft genome (Grossman et al. 2003; Jain et al. 2007; Merchant et al. 2007). We have collected a total of 309,278 raw EST trace files and some of their corresponding EST sequences available from NCBI dbEST, the Chlamy Center (http://www.chlamy.org), Kazusa DNA Research Institute, Japan (KDRI), and JGI. With the draft genome reference of C. reinhardtii, we are able to further our research on EST termini and explore the relationships between genomic DNA sequences and ESTs with in silico-authenticated cDNA termini. In this research, our primary goal is to examine all raw trace files for the previously overlooked cDNA termini and consolidate their detection using genomic sequences for confirmation. On the basis of identified cDNA terminal structures, we can then detect incorrectly trimmed (i.e., either under- or overtrimmed) EST counterparts in public domain resources (e.g., NCBI dbEST and community EST assemblies). More importantly, we aim to map individual ESTs and anchor their cDNA termini to the draft genome. Clearly, annotation of the draft genome with differential 3′/5′ ends of individual transcripts derived from the same genes will create a new data resource that facilitates accurate delineation of transcripts, determination of gene structures, and exploration of the transcriptome and mRNA metabolism in C. reinhardtii.
MATERIALS AND METHODS
Of the 309,278 raw trace files, 45,312 were created by JGI and downloaded from NCBI Trace Archive (http://www.ncbi.nlm.nih.gov/Traces/), 51,135 were provided by KDRI (Asamizu et al. 1999, 2000, 2004; http://est.kazusa.or.jp/en/plant/chlamy/EST/index.html), and 212,831 were from the Chlamy Center (Shrager et al. 2003; Jain et al. 2007). The cDNA library for JGI trace files was a normalized one using the highly polymorphic S1D2 strain (E. Lindquist, personal communication), whereas KDRI adopted the C9 (mt−) strain and the Chlamy Center used the strains 21gr, 137c, and S1D2 (Shrager et al. 2003; Jain et al. 2007). The genomic DNA for the draft genome was prepared from the strain CC-503 cw92 mt+, a mutant isolated from the strain 137c (Merchant et al. 2007). Clearly, the EST data we obtained represent high genotype heterogeneity.
All cDNA libraries adopted the same or a similar construction protocol using the same cloning sites EcoRI and XhoI. The major difference is that the Chlamy Center and KDRI adopted pBluescript II SK (−) as the vector whereas JGI used pBluescript SK (+). As shown in Figure 1, we improved our definitions of cDNA termini by adding outward extensible vector fragment parts adjacent to the insert-flanking restriction enzyme sites (i.e., EcoRI and XhoI). The 5′ terminus of the cDNA in the sense strand (5TSS), which delineates the 5′ terminus of the relevant mRNA, consists of a vector fragment with the minimum length of 9 nt (5′- … GGGCTGCAG-3′), an EcoRI site (5′-GAATTC-3′, 6 nt), and an adapter (5′-GGCACGAGG-3′, 9 nt). The maximum numbers of allowed errors (i.e., insertions, deletions, and/or mismatches) were 2 nt for the adapter, 1 nt for the enzyme site, and 4 nt for the combination of the minimum vector fragment, enzyme site, and adapter (24 nt in total). For every 5-nt vector extension, one additional error was allowed for the combined part. The 3′ terminus of the cDNA in the sense strand (3TSS), which denotes the 3′ terminus of the relevant mRNA, consists of a poly(A) tail, a XhoI site (5′-CTCGAG-3′, 6 nt), and a vector fragment with the minimum length of 9 nt (5′-GGGGGGCCC … -3′). Here, we kept the poly(A) tail as the integral part of our terminus definition, because it is a post-transcriptional (not genomically encoded) product. The maximum number of allowed errors for the poly(A) tail was 2 nt for a minimum 10 adenines, which means we could have a poly(A) tail that has 8 continuous adenines. One additional error was allowed for every 5-nt adenine extension. Moreover, a minimum of 80% identity was guaranteed for any subfragments of the poly(A) tail within the first 10 nucleotides adjacent to the cDNA insert end (i.e., 3′-UTR in a mRNA). Only 1 error was permitted for the XhoI site, and the maximum number of errors allowed for the combined XhoI site and the minimum vector part was 2 of 15 nt. We adopted the similar strategies in detecting the other two termini: the 5′ terminus of the cDNA in the nonsense strand (5TNS) and the 3′ terminus of the cDNA in the nonsense strand (3TNS), which delineate the 3′ and 5′ termini of a mRNA, respectively, and whose sequences are read in the 5′ → 3′ direction in the nonsense strand (see Figure 1). The individual components for each terminus were required to keep their sequential order and orientation constraints (e.g., an adapter 5′-CCTCGTGCC-3′ first, an EcoRI 5′-GAATTC-3′ site in the middle, and a vector fragment 5′-CTGCAGCCC … -3′ at the end in 3TNS; Figure 1) and formed a canonical or expected structure for a given terminus, known as the terminal structure. The aforementioned numbers of errors have been empirically derived after systematic comparison of cDNA terminus identification results using different parameter settings to minimize occurrences of false positives. Nevertheless, what is critical for the terminus identification is that we adopted a systematic approach that focuses on the relationship (i.e., orientation constraint, distance constraint, sequential order constraint, and so on) among individual terminal components (putative sequence features like adapters or enzyme recognition sites) to identify each cDNA terminus as a whole entity. In theory, a typical 5′-end EST sequence should contain a 5TSS and perhaps a 3TSS if the cDNA insert is short. Similarly, a typical 3′-end EST sequence should theoretically harbor a 5TNS and possibly a 3TNS. In some cases, the expected structures are not detectable in ESTs; for example, where sequencing primers are too close to ends of cDNA inserts, EST ends are in a low-quality region or ESTs possess abnormal or complex terminal structures. That is why we also examined complex, partial, and/or abnormal terminal structures by permutations of the expected terminus components in addition to the four conventional termini (see Figure 1).
The core component of WebTraceMiner (Liang et al. 2007a) was modified to incorporate previously mentioned definitions of cDNA termini. We adopted the new version of Phred (040406.c) (Ewing et al. 1998; B. Ewing, personal communication) as the base caller to process all trace files and obtain raw sequence reads as well as corresponding Phred quality values. A moving-window strategy (Liang et al. 2006) with the threshold Phred quality value of 10 (i.e., 90% of base-call accuracy, Ewing et al. 1998) was used to determine a high-quality region for each raw sequence read. After base calling, quality trimming, and vector screening, WebTraceMiner identified the cDNA-terminal structure for each sequence read and then determined the final clean EST sequence that matched the canonical cDNA terminus structure model. The final clean sequence was intended to represent the partial or complete cDNA insert within a high-quality region, with at least one terminus annotated, and without vectors, adapters, and poly(A)/(T) tails.
We downloaded all C. reinhardtii EST sequences from NCBI dbEST (167,641 sequence reads as in the dbEST release of 09-25-2007; ftp://ftp.ncbi.nih.gov/repository/dbEST) and named them dbEST sequences. After filtering out some dbEST sequences whose trace files were not available to us, we then formed our GenBank EST data set of 147,365 trace files. We obtained the final EST sequences from the Chlamy Center and JGI (http://genome.jgi-psf.org/Chlre3/Chlre3.download.ftp.html) and named them community sequences, because they had been used for creation of EST assemblies (Jain et al. 2007) and genome annotation (Merchant et al. 2007) for the C. reinhardtii research community. Filtering out some of the community sequences whose trace files were not available to us, we also formed our community EST data set, which contained 217,634 trace files. These two data sets overlapped for only 138,021 trace files, because many community sequences have not been submitted to GenBank yet. We adopted Blast2Seq (Tatusova and Madden 1999) to conduct sequence comparisons for two counterpart sequences between our raw sequences and community sequences and between our raw sequences and dbEST sequences.
There are several public EST assemblies existing for C. reinhardtii, including assembly of contiguous ESTs based on genome (ACEGs) (Jain et al. 2007), The Institute for Genomic Research (TIGR) (Dana Farber Cancer Institute, DFCI) gene indexes (http://compbio.dfci.harvard.edu/tgi/cgi-bin/tgi/gimain.pl?gudb=c_reinhardtii,Release5.0), KDRI EST indexes (http://est.kazusa.or.jp/en/plant/chlamy/EST/), and JGI EST clusters (http://genome.jgi-psf.org/Chlre3/Chlre3.download.ftp.html). To evaluate the potential impact of incorrectly trimmed ESTs on downstream applications, we scanned all EST contig (consensus) sequences for complete, partial, and/or abnormal termini.
For EST–genome mapping, we adopted GMAP (Wu and Watanabe 2005), a stand-alone program for aligning cDNA sequences to a genome and generating gene structures. For ESTs with a final sequence of >74 nt in length, we used their raw sequences to map to the draft genome (JGI Assembly v.3.1, unmasked assembly, http://genome.jgi-psf.org/Chlre3/Chlre3.download.ftp.html). Because our main objective in this research was to annotate the draft genome with 3′/5′ ends of individual mRNA transcripts, we sought high accuracy in EST–genome mapping. Considering the high genotype heterogeneity and possible sequencing errors in individual ESTs, the criteria for filtering a valid EST–genome mapping result were (1) the minimum mapped length of a raw EST sequence must be 70 nt, (2) the minimum matched identity of the raw EST sequence must be 80%, and (3) the minimum matched coverage of the final clean sequence of a raw EST must be 80%. We also mapped all previously mentioned EST contigs to the draft genome. The criteria for a valid mapping result were similar: the mapped length of a contig sequence must be at least 70 nt, with a minimum matched identity of 80%.
Among 309,278 sequences, 43% are designated as 3′ ESTs, with “.x” or “.g” in their sequence names, whereas 57% are 5′ ESTs named with “.y,” “.b,” or “_r” suffix extensions. We identified a total of 198,132 ESTs (64% of all ESTs) having in silico-verified termini: 187,404 (61% of all ESTs) matched the expected terminal structures listed in Figure 1 and 10,728 (3% of all ESTs) possessed complex and/or abnormal terminal structures, including “double-termini adapters” (315 ESTs) previously detected in pine ESTs (Liang et al. 2007b). On the basis of final clean sequence length (i.e., ≥75 nt or not), the type and number of the detected terminus, and whether or not the terminus is inside the high-quality region, we categorized the 187,404 ESTs into six major groups of sequence types listed in Table 1 and obtained 174,860 ESTs (57% of all ESTs) that had a final clean sequence at least 75 nt in length. It is clear that the 5TSS, which delineates the mRNA 5′ end, was the dominant terminus for 5′ ESTs whereas the 5TNS [i.e., the terminus with a poly(T) tail] that denotes the mRNA 3′ end was the major terminus for 3′ ESTs. Having 5TSS–3TSS and 3TNS–5TNS terminal pairs, respectively, 2% of 5′ ESTs and 4% of 3′ ESTs represented potential full-length cDNA sequences with a complete set of expected cloning features. As one form of reversal anomaly, 764 5′ ESTs displayed 3′-like terminal structures (3TNS and/or 5TNS) while 1061 3′ ESTs showed 5′-like terminal structures (5TSS and/or 3TSS) (see Table 1). For ESTs from the Chlamy Center, this reversal anomaly is due at least in part to errors in indexing the reads at the time of sequencing, whereby 3′ reads were mislabeled as .y and 5′ reads were mislabeled as .x (O. Vallon, personal communication).
The community EST data set contained 217,634 trace files while the GenBank EST data set included 147,365 trace files. After BLAST2Seq comparison, 73% (i.e., 158,128) of community EST data set and 86% (i.e., 126,743) of GenBank EST data set trace files had raw sequences that completely cover their community or dbEST counterpart sequences in their BLAST2Seq alignments. Using only those sequences that were completely covered, we could accurately determine whether or not a community or dbEST sequence is undertrimmed or overtrimmed with the reference of the in silico-verified terminus positions. In the community EST data set, 75% of the aforementioned 158,128 ESTs met our final clean sequence length constraint (i.e., ≥75 nt), and, among them, 72,221 ESTs contained a 5TSS, 42,404 a 5TNS, 6045 a 3TNS, and 2686 a 3TSS. Similarly in the GenBank EST data set, 70% of the above-mentioned 126,743 ESTs with complete coverage met the length requirement, and, among them, 54,972 ESTs contained a 5TSS, 31,271 a 5TNS, 4407 a 3TNS, and 2119 a 3TSS. Focusing on the two dominant termini (i.e., 5TSS and 5TNS), we determined the distances between the actually trimmed positions in either community or dbEST sequences and the theoretical terminal positions identified in corresponding raw sequences. It is clear that a large proportion of both community and dbEST sequences were incorrectly trimmed. As shown in Figure 2, only 22% of all the compared community sequences (72,221) with the 5TSS identified had been correctly trimmed in their 5TSSs, because there was no difference (0 distance) between the EST start positions delimited by identified cDNA termini and actual EST start positions present in the community sequences. Meanwhile, 8% were overtrimmed in their 5TSSs whereas 69% were undertrimmed. In particular, of all undertrimmed sequences, 66% had 8 of 9 nt of the adapter sequence untrimmed. In contrast, 75% of all the compared community sequences (42,404) with their 5TNS identified had been correctly trimmed in their 5TNSs [i.e., poly(T) tails have been completely trimmed of the final sequences] whereas 17% were overtrimmed and 8% undertrimmed. Our data suggest that more dbEST sequences appear to be incorrectly trimmed than community sequences. Also in Figure 2, only 9% of all the compared dbEST sequences (54,972) with their 5TSS identified had been correctly trimmed in their 5TSSs whereas 6% were overtrimmed and 84% were undertrimmed. Meanwhile, 5% of all compared dbEST sequences with their 5TNS identified had been correctly trimmed in their 5TNSs whereas 2% were overtrimmed and 93% were undertrimmed. It is worth mentioning that 92% (28,666 of 31,271) of the compared dbEST sequences with the 5TNS detected in their counterpart raw sequences had kept six thymines of the identified poly(T) tails in their 5TNSs. This was mainly because the Chlamydomonas Genome Project deliberately trimmed down poly(T) stretches to 7 nt before GenBank submission (O. Vallon, personal communication).
As shown in Table 2, TIGR (DFCI) C. reinhardtii gene index had 25% (i.e., 7971) of tentative consensuses (TCs) identified with cDNA termini and their remnants. Among them, 5907 were confirmed by their genome-mapping results, in which the unclean terminus parts could not be mapped to the genome. JGI released two EST cluster sets: one with 40,294 clusters and the other with 306 clusters. Our terminal detection revealed that 22% of JGI clusters had complete/partial termini detected, among which 73% had genome-mapping confirmation. KDRI provided 3585 EST clusters for their 51,135 ESTs, and we identified 220 (6% of 3585) contigs containing unclean terminus sequences, among which 77% were confirmed by their genome-mapping results. The main set of ACEGs proved to have the least amount of contigs that contained unclean termini or their remnants: 313 (i.e., 2% of 15,857), among which 48% had genome-mapping evidence. In the bonus set of ACEGs, however, we detected a great number of contigs (i.e., 3031) having their poly(A) tails retained. This was because the main set was postprocessed to remove as much of the poly(A) and termini sequences as possible, while the less reliable bonus set was provided as a raw assembly result (Jain et al. 2007).
Of 174,860 ESTs with a final clean sequence >74 nt, 160,812 had their raw sequences successfully mapped to the draft genome. After filtration, we obtained 156,963 ESTs that met the criteria for having valid EST–genome matches. Of all these ESTs with valid matches, 58% (90,913) contained a 5TSS, 2% (3233) a 3TSS, 40% (62,624) a 5TNS, and 5% (7158) a 3TNS. For every EST with valid genomic matches, we calculated the match distance between the actual matched start/end position of the raw EST sequence in the genome and the theoretical start/end position delimited by identified cDNA termini. A distance of zero indicated a perfect mapping of cDNA start/end positions to the genome, excluding post-transcriptional poly(A)/(T) tails. A negative distance meant that some extra bases from the identified termini themselves [i.e., nucleotides from the adapters or the poly(A)/(T) tail], together with their cDNA insert, had been mapped to the genome. A positive distance represented that some nucleotides of cDNA insert, immediately adjacent to the identified terminus [i.e., the adapter in 5TSS or poly(A) in 3TSS], had not been mapped to the genome. As shown in Figure 3, all four termini displayed a very similar pattern, in which EST sequence numbers dropped dramatically from 0 toward other positive distances; meanwhile, a large proportion of EST sequences possessed negative distances. The majority of ESTs (i.e., 94% of the 5TSS, 87% of the 3TSS, 94% of the 5TNS, and 87% of the 3TNS) had a match distance between −10 and 10 bases. In particular, many ESTs (i.e., 39% of the 5TSS, 22% of the 3TSS, 24% of the 5TNS, and 32% of the 3TNS) displayed a perfect mapping (0 distance). Also in Figure 3, many ESTs (i.e., 6% of the 5TSS, 11% of the 3TSS, 18% of the 5TNS, and 12% of the 3TNS) possessed positive distances and had 1–10 nt that were delimited by verified termini but not mapped to the genome. Further data analysis revealed that 96% of these ESTs had these nucleotides in high-base-call-quality regions. Therefore, they were not sequencing errors, but nontemplated (not genomically encoded) nucleotides.
Of 156,963 ESTs with valid genome-mapping results, 69% were mapped to the loci of only one gene of the JGI gene catalog (Chlre3.1), 3% to the loci of more than one gene, and 28% to loci without any annotated gene. Excluding ESTs mapped to more than one gene, 48% (i.e., 6954 of 14,548) of JGI gene models possessed ESTs with authenticated cDNA termini. In particular, 45% of the gene models had ESTs with 5TSS or 3TNS that delimits the transcript 5′ ends, whereas 26% contained ESTs with 3TSS or 5TNS that delineates the transcript 3′ ends. For each gene, we determined an EST sequence with a cDNA terminus that was in closest proximity to the 5′ or the 3′ end of the gene and used its mapped position in the genome as the reference point (0 nt). For the ESTs with terminus nucleotide(s) mapped to the genome, the reference point was adjusted to remove the terminus nucleotide(s). For the ESTs with nontemplated nucleotide(s), the reference point was actually delimited by the nontemplated nucleotide(s), rather than by the identified terminus. Using the reference point, we then determined intragenic variations in EST-annotated transcript end positions by calculating differences in mapped positions (i.e., relative transcript 3′/5′-end positions in nt) for all other ESTs mapped to this gene. To illustrate this, we used two gene examples in Figure 4, A and B. The gene estExt_fgenesh2_pg.C_120076 (transcript ID: 188837) had 642 ESTs with a verified 5TSS/3TNS that delimits the transcript 5′ end; while the gene estExt_gwp_1H.C_230040 (transcript ID: 130316) had 78 ESTs with a verified 3TSS/5TNS that delineates the transcript 3′ end. Obviously, many intragenic transcripts seemed to have a 3′/5′-end position difference of one to a few bases from each other, defined as microheterogeneity. On the other hand, different transcript end positions from the same genes appear to be clustered into obviously different groups, i.e., macroheterogeneity. For example, the gene estExt_gwp_1H.C_230040 also had two major position groups at 4 nt (39 ESTs) and 158 nt (18 ESTs). Moreover, the frequency distribution of EST counts in different relative 3′/5′-end positions for all gene transcripts indicated strong intragenic variations (see Figure 4, C and D). It is interesting to note that 95% of 77,972 ESTs had their relative transcript 5′-end positions no more than 3406 (nt) apart while 95% of 33,026 ESTs had their relative transcript 3′-end positions no more than 345 (nt) away. We have developed a web-based EST-to-genome browser that anchors individual ESTs to the genome, annotating differential transcript ends by emphasizing relevant cDNA termini. In Figure 5, individual C. reinhardtii ESTs with in silico-authenticated cDNA termini were mapped to the draft genome to generate a better gene structure model, in particular, uncovering substantial evidence of alternative polyadenylation in the transcripts. For instance, the maximum difference in poly(A) sites among individual transcript 3′ ends, indicated by differential 5TNSs, is >100 nucleotides. At the highest resolution, users can see the nucleotide alignment of individual ESTs with the genomic sequences, which provides visualization of insertion, deletion, and mismatches, along with cDNA termini emphasizing the transcript 3′/5′ ends.
In any EST project, the cleaning and trimming of spurious sequences from raw sequence reads is a critical prerequisite for accurate downstream results. There are many tools available for EST processing to remove or mask contaminants, repetitive elements, and low-complexity regions in sequence reads (see the review by Nagaraj et al. 2006). Unfortunately, these tools are limited in their ability to detect and cleanly trim spurious sequences with great confidence (Liang et al. 2006, 2007a). It is a common practice to conjoin adapter/linker sequences with vector sequences in a FASTA-format file and then trim or mask vector and adapter sequences using cross_match (http://www.phrap.org) or other programs (e.g., Adzhubei et al. 2006; Liang et al. 2006, 2007a). Such is the case with pregap4 in the popular, open-source Staden Package for sequence analysis (Staden 1996; staden.sourceforge.net). For the majority of ESTs, such a strategy appears to be effective and work properly. However, when the presented vector sequences are too short to be detected or are partially undetectable because of low quality, or when adapter sequences are in the middle of cDNA inserts and/or present as concatemers or other abnormal structures, these approaches typically miss the errors. Compared to other tools, MAGIC-SPP (Liang et al. 2006) did a better job at adapter cleaning through one dedicated Perl module that detected and trimmed adapter sequences at EST sequence ends. Unfortunately, it still had problems with cDNA inserts having abnormal, concatenated adapters (Liang et al. 2007a). Comparing three popular programs, SeqClean (compbio.dfci.harvard.edu/tgi/software/), cross_match (http://www.phrap.org), and LUCY (Chou and Holmes 2001) for EST vector trimming, Chen et al. (2007) recently reported that a significant number of errors remained after preliminary trimming using these programs, but that by relinearizing the cloning vector, i.e., rearrangement of cutting positions of a circular vector into a linear format in a FASTA file, the aforementioned errors were almost completely corrected. Using randomly sampled data from dbEST, they estimated that 1.63% (i.e., 575 of 35,363) of ESTs have vector contamination, a proportion that is higher than the previous estimation of 0.28–0.36% (Lamperti et al. 1992; Seluja et al. 1999). While these efforts are invaluable for cleaning up error-prone ESTs, they did not explore the essence of potential data abnormality in ESTs from a cDNA constructional perspective.
Adopting a different approach by examining cDNA terminal structures, we detected untrimmed remnants of cDNA termini (e.g., vector fragments and adapter sequences) and terminal structure abnormalities in numerous P. taeda ESTs deposited in GenBank dbEST (Liang et al. 2007b). Similarly in C. reinhardtii, we identified a large proportion of ESTs undertrimmed in their cDNA termini that delineate transcript 3′ or 5′ ends (from 8 to 93% dependent on different data sets and terminus types; Figure 2). Before GenBank dbEST submission, poly(A)/(T) tails are trimmed to retain a few adenines/thymines (usually 6–10) by some submitters for the purpose of directional indication of ESTs whereas they are completely trimmed off by others (Liang et al. 2006; Nagaraj et al. 2006). Therefore, it is understandable that we detected high proportions of untrimmed poly(T) tails within identified 5TNSs that denote transcript 3′ ends: dbEST sequences (93%) and community sequences (8%). In our opinion, however, what causes potential problems is the inconsistency in poly(A)/(T) trimming. For example, 75% of community sequences with a 5TNS identified show complete removal of poly(T) whereas 17% are overtrimmed and 8% undertrimmed. We believe such inconsistency will cause problems in downstream data integration where EST directional information is utilized. On the other hand, our data clearly revealed that current EST cleaning steps leave behind unclean remnants of cDNA termini that delimit transcript 5′ ends. Jain et al. (2007) adopted LUCY for contaminant trimming of raw C. reinhardtii ESTs before the creation of EST assemblies, ACEGs. Unfortunately, of their resultant community sequences, we still detected a very high percentage (i.e., 69%) of ESTs that were undertrimmed in their 5TSSs, although this percentage was much lower than what we found in our dbEST sequences (i.e., 84%). It is worth mentioning that in comparison with other public C. reinhardtii EST clusters in Table 2, ACEGs appeared to be much cleaner on the basis of the occurrence of 5TSS and its remnants confirmed by genome-mapping results. Without a doubt, their unique EST-clustering protocol, which took advantage of the draft genome sequences, had been of significant help in correcting EST errors and resulted in better EST assemblies. However, a key consideration here is the number of species, like P. taeda, that in the near future will not have draft/complete genome sequences available for EST validation and clean up and what alternatives exist for quality control for the numerous error-prone ESTs obtained from these species.
In our opinion, trimming EST sequences without keeping track of cDNA-terminal information not only diminishes our opportunities to uncover many signs having biological meaning, but also increases the difficulty for quality control and validation of error-prone ESTs that are widely used in many downstream EST applications. By examining cDNA terminal structures, we reported for the first time “double-termini adapters” in P. taeda (Liang et al. 2007b), which is defined as a palindrome linker, made from two mutually exclusive terminus components (e.g., the adapter 5′-CCTCGTGCC-3′ from 3TNS, the EcoRI site 5′-GAATTC-3′ from either 3TNS or 5TSS, and the adapter 5′-GGCACGAGG-3′ from 5TSS, where 5TSS and 3TNS should be, in theory, mutually exclusive as per Figure 1) that bring separate 3′ and 5′ directional sequence fragments together into a single, chimeric EST sequence. Interestingly, we detected a reduced occurrence of double-termini adapters in C. reinhardtii in comparison with that in P. taeda (315 vs. 6904). Like in P. taeda ESTs, we also detected many ESTs with complex and abnormal terminal structures (e.g., ESTs with a 5TSS–5TNS terminal pair) in addition to the double-termini adapters. Although further efforts are definitely required to characterize these abnormalities and explore their potential biological meanings, it is clear to us that double-termini adapters might be one of many underappreciated wet-lab artifacts embedded in the large EST data sets. Without inspecting cDNA terminal structures, current bioinformatics protocols and tools are unlikely to appreciate and remove these data abnormalities, which create significant problems for downstream processes like EST clustering (Liang et al. 2007b). Our terminal-scanning results in public C. reinhardtii EST clusters also proved this (see Table 2).
Due to the availability of the C. reinhardtii draft genome, we were able to use genomic sequences to confirm the in silico verification of identified cDNA termini. We modified the most accurate and sensitive algorithm for local sequence alignment, Smith–Waterman (Smith and Waterman 1981; Li et al. 2004), to scan the whole-genome sequence for all the hits that match our minimum terminus sequences (i.e., 24 nt for termini 5TSS and 3TNS and 25 nt for termini 3TSS and 5TNS, allowing for a maximum of four errors) and detected only two hits in the genome, both of which are not perfect matches, but with four base errors. Examining our EST–genome-mapping results, it is interesting to note that there are a large number of sequences (i.e., 6–18%, positive distances in Figure 3) in which the actually mapped start/end positions are several bases off of the detected cDNA insert start/end positions. Nontemplated (nongenomically encoded) nucleotide addition before 3′-polyadenlyation sites proved to be common in Arabidopsis mRNAs (Jin and Bian 2004). Recently, nontemplated poly(A) tails have been reported to be present in the 5′ end of mRNAs in maize plants (Gowda et al. 2006). Because the majority of our nontemplated nucleotides were within the high-quality regions, our data suggest that nontemplated nucleotide additions frequently exist at both transcript 3′/5′ ends in C. reinhardtii. Further work is definitely needed to explore this interesting finding. Meanwhile, a large number of ESTs had their genome-mapping hits starting inside the terminus region (i.e., negative distances in Figure 3). Although many different software packages are available for EST–genome mapping, GMAP provided higher-quality alignment and better gene structures more often than BLAT (Kent 2002) and other programs (i.e., MGAlign, SIM4, Spidey, and dds/gap2), even in the presence of substantial polymorphisms and sequencing errors (Wu and Watanabe 2005). However, we still noted that GMAP appears to provide more mismatches in the border regions in some ESTs when we presented their raw sequences for genome mapping. Obviously, this could be one of the potential reasons for the high proportion of sequences in which terminus nucleotides [e.g., from adapters or poly(A)/(T) tails, see Figure 3] have been mapped into the genomes. Moreover, the terminus nucleotide(s) from the adapters within 5TSS/3TNS that have been mapped to the genome appear to be due to random matches of the GC-rich adapter beyond the cDNA insert to the GC-rich genome (64%) (Merchant et al. 2007). Although random matching and software limitation could also explain why some terminus adenines/thymines have been mapped to the genome to some extent, the number of ESTs with the distance of −1 in Figure 3 [i.e., the genomic nucleotide that corresponds to the first adenine/thymine within an identified poly(A)/(T) tail in ESTs] was too high to justify this explanation. The single-nucleotide profile analysis revealed that this genomic nucleotide was predominately adenine (Shen et al. 2008), which was also evident in other species (Zhao et al. 1999; Graber et al. 1999; Loke et al. 2005).
EST terminus information has largely been ignored in public-domain databases because the majority of current sequence processing tools focus on cleaning/trimming spurious sequences in ESTs, rather than mining the trimmed part for potential useful information (Liang et al. 2007a). In theory, the directionally sequenced ESTs should contain information about cDNA termini and therefore most likely the 3′/5′ ends of mRNAs from which they were derived. While new technologies, such as LongSAGE (Wei et al. 2004) and 5′-RATE (454 Sequencing) (Gowda et al. 2006), are being developed for explicit exploration of transcript ends, in our opinion, it would be also advantageous to make use of existing large EST data sets to extract previously overlooked information, i.e., positional, directional, and structural aspects of cDNA insert termini. We are the first to map individual ESTs and anchor their cDNA termini to their genome for the exploration of differential 3′/5′ ends of transcripts derived from the same genes. Our preliminary data analysis clearly suggested strong macroheterogeneity and microheterogeneity of intragenic variations in the transcript 3′/5′-end positions, as delimited by cDNA termini (Figure 4). It is interesting to note that the majority of transcript 3′ ends differed by 0–345 nt; in contrast, the majority of transcript 5′ ends differ by 0–3406 nt. Although it is possible that mispriming of the oligo(dT) primer from a stretch of adenosyl residues internal to the mRNA happens during cDNA library construction, we believe that most of our transcript 3′ ends, delimited by the post-transcriptional poly(A) tails in 3TSS [or poly(T) tails in 5TNS] and confirmed by genome-mapping results, represent the real polyadenylation sites of mRNAs. On the other hand, our transcript 5′ ends are delineated by the artificial cloning adapter. It is likely that many of the cDNAs will not be full length due to 5′ truncations of mRNAs and/or premature termination of the reverse-transcriptase reaction. Therefore, we have reasons to believe that many of our transcript 5′ ends do not faithfully represent their real 5′ ends of mRNAs. Compared with the transcript 3′ ends, the greater intragenic heterogeneity in transcript 5′ ends was more likely to be biased by wet-lab artifacts. It is also worth mentioning that our data analysis on transcript end heterogeneity was based on the current JGI gene catalog, which might contain many incomplete and incorrect gene annotations. Clearly, more work is necessary to explore intragenic heterogeneity of transcript ends and their relationships with genotypes and environmental and developmental cues. Nevertheless, our unique approach in examining cDNA terminal structures and mapping individual ESTs to the genome presents the research community with a new stream of data that will facilitate our ongoing efforts in accurate delineation of transcription ends, determination of gene structures, and genome annotation. As shown in Figure 5, ESTs with in silico-verified termini clearly indicated inaccurate predictions and incomplete gene structures from the JGI gene catalog. By examining 5TNS (or 3TSS) position differences, we can explore possible 3′-end alternative polyadenylation, which has proved to be an important post-transcriptional gene expression and regulation procedure (Hall-Pogar et al. 2005). In addition, our EST terminus data provide a new perspective to study 5′ alternative splicing, alternative first exons, and accurate 3′/5′-UTR determinations among others.
The significance of our work lies in the creation of a novel bioinformatics protocol for the processing (or reprocessing) of the ever-growing collection of raw EST trace files to create cleaner EST sequences for downstream applications and to extract previously overlooked information that can further our understanding of many biological processes. Moreover, the new approach we adopted to map individual ESTs with authenticated cDNA termini to the genome should definitely improve genome annotation and gene structure prediction and provide significant opportunities to explore transcript-related biological phenomena. Our C. reinhartdii EST terminus data are publicly available through the website http://www.conifergdb.org/chlamyest. Efforts are underway to analyze complex and abnormal termini, as well as the ESTs without any cDNA terminus identified, and to tune the analytic parameters for more accurate detection of various termini and their components. We will continue our efforts to clean up error-prone ESTs in many species and to explore other applications that utilize the new stream of data.
We thank Jeff Shrager from the Carnegie Institute of Washington and Erika Asamizu from KDRI for providing us the raw EST trace files as well as cDNA library construction information. We thank Erika Lindquist from JGI who makes their final EST sequences available for public download. We greatly appreciate Olivier Vallon from Institut de Biologie Physico-Chimique, France for sending us their ACEG data and providing critical suggestions and comments to improve our system as well as the manuscript. Special thanks also go to John Karro, Hank Stevens, Linda Hartmann, and Ryan Cook from Miami University and Jeff Dean from the University of Georgia for helpful discussions, feedback about our system, and/or help with the manuscript. This work was supported in part by the new faculty start-up grant from Miami University (to C.L.), by grants from the Center for Academic Computational Research of Miami University (to C.L. and Q.Q.L.), and by a grant award from the Ohio Plant Biotechnology Consortium (no.401384 to Q.Q.L. and C.L.).
Communicating editor: O. Vallon
- Received December 10, 2007.
- Accepted March 6, 2008.
- Copyright © 2008 by the Genetics Society of America