The yeast genetics community has embraced genomic biology, and there is a general understanding that obtaining a full encyclopedia of functions of the ∼6000 genes is a worthwhile goal. The yeast literature comprises over 40,000 research papers, and the number of yeast researchers exceeds the number of genes. There are mutated and tagged alleles for virtually every gene, and hundreds of high-throughput data sets and computational analyses have been described. Why, then, are there >1000 genes still listed as uncharacterized on the Saccharomyces Genome Database, 10 years after sequencing the genome of this powerful model organism? Examination of the currently uncharacterized gene set suggests that while some are small or newly discovered, the vast majority were evident from the initial genome sequence. Most are present in multiple genomics data sets, which may provide clues to function. In addition, roughly half contain recognizable protein domains, and many of these suggest specific metabolic activities. Notably, the uncharacterized gene set is highly enriched for genes whose only homologs are in other fungi. Achieving a full catalog of yeast gene functions may require a greater focus on the life of yeast outside the laboratory.
SEVERAL years ago we projected that, judging from the steady and predictable increase in database entries, all yeast genes would have “known” functions by mid-2007 (Hughes et al. 2004). This was an optimistic estimate, since what was known at the time about many of the “known” genes was unsatisfying (or even conflicting) in terms of actual understanding of physiological purpose or molecular role. Nonetheless, it seemed reasonable, and still does, that yeast will be the first organism for which the functions of all the genes are characterized to a degree that would satisfy most molecular biologists.
Characterizing the functions of all of the yeast genes is not a senseless academic pursuit. Yeast is a part of everyday life, being critical to baking and brewing. Yeasts can also be pathogenic. Much of what is known about how all eukaryotic cells work has come from studying yeast. And the possibility that something would be known about virtually all yeast genes only a decade or so after the initial sequencing [at which time only ∼1000 genes appeared in the literature, and roughly half could be ascribed some function on the basis of sequence features (Goffeau et al. 1996)] could be taken as a harbinger of what might eventually be anticipated from similar efforts in more complicated organisms with larger genomes and more genes, such as humans. Examination of the path taken to such an achievement, including any hurdles or missteps, should provide a level of guidance. Although many organisms now have a genome sequence and a set of useful genetic tools, yeast has been the major proving ground for large-scale application of new technologies, and in many respects remains the advance guard of functional genomics, proteomics, and systems biology, as outlined in other recent reviews (Bader et al. 2003; Mustacchi et al. 2006; Suter et al. 2006). Knowing what all the parts do is important if you want to know how a machine works.
It is now mid-2007. Are the functions of all 6000 yeast genes “known”? Given a very liberal definition of known function, this goal has nearly been reached. The commercial database used in our initial time line (Hughes et al. 2004) is no longer freely available. We therefore examined information present in the Saccharomyces Genome Database (SGD) (Nash et al. 2007), the public database that curates the yeast genes and also compiles Gene Ontology (GO) annotations, publications, interactions, and a host of additional information. As of March 20, 2007, there are only 38 genes with no information available whatsoever, and only 566 lack any annotations in any of the three major branches of Gene Ontology (biological process, molecular function, or cellular component), which is a somewhat fluid categorical data structure that incorporates information with a variety of evidence levels (Ashburner et al. 2000). This is, to be sure, a phenomenal achievement, having occurred mostly in a single decade.
Unfortunately, much of this annotation is derived from large-scale studies, and depth is lacking for many genes. In the last 3 years, the overall proportion of genes that SGD classifies as “uncharacterized ORF” has not changed a great deal, decreasing by only 285 genes (Figure 1). To be certain, gene characterization is not a binary attribute, and not everyone will agree on the definition or the individual assignments. Nonetheless, the list of “uncharacterized ORF” classifications confirms that the vast majority of SGD decisions are in agreement with our own sense of gene characterization. Although expectation differs depending on the type of protein and its proposed role, in our view gene characterization would typically be expected to include multiple lines of specific and consistent evidence, often encompassing phenotype, cell biology, and biochemistry. By manual analysis, we found only 53 (<5%) genes classified as uncharacterized by SGD that we considered characterized by conventional standards. For many of the uncharacterized ORFs, there is little or no experimental evidence to support any specific physiological function or even functionality—although there are many instances such as that of YDR185C: “Mitochondrial protein of unknown function; has similarity to Ups1p, which is involved in regulation of alternative topogenesis of the dynamin-related GTPase Mgm1p,” or YKL098W, which encodes a 357 amino acid “Putative protein of unknown function,” and interacts genetically with both CDC8 and SKP1, which are involved in DNA synthesis and mitosis, respectively.
As of March 20, 2007, there are still 1253 uncharacterized yeast genes listed on SGD—21% of all genes. Why so many? What might their functions be? How will we go about characterizing them? Previous analyses have outlined the relative strengths of specific approaches in functional genomics and proteomics, usually using cross-validation (i.e., sensitivity and specificity at reproducing the genes whose functions are already known in blinded tests) (von Mering et al. 2002; Wong et al. 2005; Myers et al. 2006). All methods work in at least some instances and, in selected cases, have been extremely successful. RNA and ribosome biogenesis categories, for example, which include hundreds of genes, often dominate functional predictions, since these genes and their encoded proteins share regulatory patterns, exist in relatively abundant protein complexes, and often contain conserved sequence features. Consequently, investigators have capitalized on large-scale data sets to study new RNA and ribosome biogenesis genes and their products (e.g., Grandi et al. 2002; Peng et al. 2003).
It is less clear how useful predictions from large-scale data are in all classes of gene function, however. Our previous analysis (Hughes et al. 2004) included predicted GO annotations for 122 proteins that were uncharacterized at the time, but whose GO assignments were supported by three or more large-scale data sets. While it is encouraging that 82 of these have since been characterized, only 23 were assigned to one of the precise predicted annotation categories, and 12 of these were in RNA processing or ribosome biogenesis. This is certainly higher than random, but cautions that predictions might best be used as a rough guide to areas for further exploration.
Thus, rather than evaluating the relative merits of the methods, here we instead consider the uncharacterized genes themselves, with an eye to why they are not yet known, how their functions might be understood, and what lessons might hold for similar efforts in other genomes.
ARE ALL OF THE UNCHARACTERIZED GENES REAL?
A trivial explanation for difficulty in associating uncharacterized genes with specific functions would be that they are not real genes. The yeast gene catalog changes occasionally as gene boundaries are redefined and small expressed ORFs are discovered (Fisk et al. 2006). Expression or even localization (discussed below), which are often taken as evidence that a gene is real, is not on its own conclusive proof of function. Even mutant phenotypes are not foolproof, as they can be caused by disruption of regulatory sequence. Anecdotally, microarray analysis of yeast ORF deletion strains occasionally reveals misregulation of an adjacent gene on the chromosome (T. R. Hughes, unpublished observation), which could, in principle, have a phenotypic consequence that is unrelated to the function of the deleted gene. These caveats partially underlie the general notion that gene characterization should be based on more than one independent line of evidence.
Perhaps the best evidence that most of the uncharacterized yeast genes are bona fide genes comes from sequence attributes. The vast majority are conserved in syntenic positions in related yeasts, indicating selective pressure for retention (Cliften et al. 2003; Kellis et al. 2003) (Figure 2). For nearly half (538), other sequence features include a domain structure cataloged on Pfam (Finn et al. 2006) (Figure 2). In addition, although the length of uncharacterized genes is shorter than those of characterized genes, on the whole (median 281 aa vs. 379 aa for all yeast proteins), nearly two-thirds (822/1253) are >200 amino acids long (Figure 2). ORFs of such length would occur at most a few dozen times in a random-sequence version of the yeast genome. Therefore, the majority of uncharacterized yeast genes are probably real genes, albeit not all.
ARE THE UNCHARACTERIZED GENES TOO NEW TO HAVE BEEN STUDIED?
Another explanation for lack of functional knowledge on many yeast genes is that they have only recently been discovered. Systematic yeast gene names (e.g., YAL001W) added after the initial sequence assembly (Goffeau et al. 1996) typically carry a suffix (e.g., YAL001W-A) (Fisk et al. 2006). A total of 204 of the uncharacterized genes carry such a suffix, while only 285 genes overall carry such a suffix [not including “dubious ORFs,” which are no longer believed to be genes (Fisk et al. 2006)]. Not being in the database clearly puts a gene at a disadvantage for characterization: 188 of the 204 uncharacaterized genes carrying a suffix were not in the initial deletions consortium collection (Giaever et al. 2002), which has formed the basis of hundreds of genetic screens over the last 5 years; similarly, 156 are not in the TAP/GFP collections (Ghaemmaghami et al. 2003; Huh et al. 2003). However, on the whole, most of the uncharacterized genes have been in databases for over a decade. The majority (982) are present in the initial deletions consortium collection and many others have been added subsequently (Kastenmayer et al. 2006) (Figure 2). In any case, lack of ready genetic reagents does not prevent proteins from appearing in pulldowns or emerging from traditional genetic screens. Therefore, although some of the uncharacterized genes are newer than other yeast genes, they are a fairly small minority (Figure 2).
DO UNCHARACTERIZED GENES HAVE ANY DISTINGUISHING CHARACTERISTICS IN LARGE-SCALE ANALYSES?
Uncharacterized genes tend to be underrepresented in virtually all types of large-scale data (Hughes et al. 2004), and this may partially explain why they are not studied as frequently; however, we found no criteria that completely separate currently uncharacterized from characterized genes. One of the greatest differences in distribution of properties is requirement for viability: <4% of uncharacterized genes (40) are listed as essential, in comparison to ∼19% of genes on the whole (Fisk et al. 2006). In the GFP localization database (Huh et al. 2003), there are no striking differences in distribution among subcellular categories, and while 72% of “verified” ORFs are localized (3491/4505), in comparison to 57% of uncharacterized ORFs (629/1094), this is at least partially due to the use of localization information in gene characterization since 2003. The difference in the distribution of expression levels between characterized and uncharacterized is significant, although very far from a bimodal distribution (Figure 3). Uncharacterized genes are also represented in interaction databases, although again less frequently than characterized genes. In the current BioGrid catalog (Stark et al. 2006), 683 uncharacterized genes have interactions by two hybrid, 243 have interactions by biochemical activity (e.g., phosphorylation), 457 have protein–protein interactions (including pulldowns but not two hybrid), and 197 participate in genetic interactions. Although the meaning of all of these interactions remains to be seen, it is certainly possible to draw hypotheses. Altogether, the number of uncharacterized genes cannot be generally attributed to their absence from large-scale data sets.
ARE THE UNCHARACTERIZED GENES NEEDED ONLY UNDER CERTAIN CONDITIONS?
Accepting that the current set of uncharacterized genes are in large part real genes, that most of them are present in current reagent sets (including the deletions collection), and that many are expressed under standard laboratory conditions and participate somehow in the life of the cell, we are faced with the possibility that there is something about their biology that makes them difficult to characterize. That they tend to lack strong phenotypes indicates that many of them are not needed under standard laboratory conditions. Two general explanations are (1) that the uncharacterized genes are redundant with other genes or with each other or (2) that they are important only under conditions that are not usually assayed in the laboratory. These explanations are related, are not mutually exclusive, and one or both may apply to only a subset of genes.
Are these explanations supported by data? Redundancy carries several expectations. First, since one mechanism for achieving redundancy is gene duplication, we might expect the uncharacterized ORFs to contain many duplicated genes. In fact, 161 uncharacterized proteins have sequences at least 50% identical to the sequences of another uncharacterized protein (median blast E-value 10−51), and these 161 proteins cluster into 54 groups (Figure 4). These clusters all contain between two and five proteins, with the exception of 26 Y′ helicase-like elements and 15 seripauperin (PAU) family genes, which may be cell-wall mannoproteins (Abramova et al. 2001). Y′ and PAU genes are found near telomeres, and in fact most of the duplicated genes overlap or are adjacent to telomeres, tRNAs, or transposon-like sequences (Figure 4), loci that are prone to duplication, recombination, and rearrangement. It is conceivable that at least some of these ORFs are parasitic, and/or react to unusual circumstances that affect the loci they occupy (e.g., telomeres). Their role could benefit from a large number of family members, for example, to achieve high copy number, rapid diversification, and/or redundancy, making them complicated to study by genetics and harder to identify in genetic screens. The uncharacterized gene set also contains several groups of proteins with similar domain structures, but more distant primary sequences; these could also be genetically redundant.
Second, redundancy might result in a substantial number of genetic interactions with and among uncharacterized genes. Genetic redundancy with >100 genes has been measured systematically across all strains in the deletion collection (Tong et al. 2004). In the BioGrid database (Stark et al. 2006), only 197 (or 18%) of uncharacterized genes have at least one genetic interaction, in comparison to 3275 (or 72%) of verified ORFs. This is far from conclusive evidence, as only a small proportion of the full genetic interaction network has been mapped (even under a single growth condition) and relatively few uncharacterized genes have been crossed to all other deletions. In general, it is more difficult to understand the mechanistic basis of genetic interactions than it is for physical interactions, and there is still some debate regarding expectation and scoring of double-mutant crosses. Nonetheless, this result does raise the possibility that at least some of the uncharacterized genes exist solely as genetic buffers, but also suggests that high genetic redundancy does not distinguish them from characterized genes.
If the uncharacterized genes are important only under specific conditions—particular situations that are not normally tested in the laboratory—then we would expect mutation of these genes to have little or no phenotype in general, although they may be constitutively expressed. Results obtained to date do roughly match this expectation. Another expectation of genes required to survive specific situations in nature is that their sequence features may be indicative of activities related to interaction with the environment. Many of them do contain sequence features indicating that they are biosynthetic enzymes (147) or transporters/permeases (40), consistent with a role in dealing with biochemistry in the wild. Consistent with this notion, the uncharacterized gene set contains an extremely significant number of genes (177 of 405) that have a homolog in one or more other fungi, but not in any other organisms (Nishida 2006). The set of 228 characterized genes with exclusively fungal homologs is enriched for yeast-specific functions, including genes involved in cell-wall organization and biogenesis, mating, and cell–cell adhesion (Robinson et al. 2002).
HOW WILL WE COMPLETE THE ENCYCLOPEDIA OF THE CELL?
To summarize the evidence, a variety of factors are likely to contribute to the relatively large number of as-yet-uncharacterized yeast genes, including genetic redundancy, lack of strong growth phenotype, and the possibility that not all of them are real genes. In addition, many of the remaining genes may be involved in environmental and metabolic responses or growth modes not normally queried in the laboratory. Thus, despite the success of high-throughput methods in many instances, characterization of these remaining refractory genes may require a different tack.
Certainly there is no shortage of researchers and enthusiasm. On the basis of first-author names on articles cataloged at SGD, there were at least 9447 yeast researchers active between 2003 and 2007. Yeast molecular biologists typically jump at the chance to describe a new gene function, especially if it involves a topic of current interest; for example, RTT109, although previously known to influence Ty transposition (Scholes et al. 2001), was simultaneously described as encoding a histone H3-K56 acetyltransferase by at least five different groups in late 2006 and early 2007 (Schneider et al. 2006; Collins et al. 2007; Driscoll et al. 2007; Han et al. 2007; Tsubota et al. 2007).
Indeed, it does appear that present approaches to gene characterization result in a gradual increase in the proportion of genes with known functions (Figure 1). What is of concern is that if the trend in Figure 1 continues at its present course, characterization of all yeast genes (that would be likely to pass peer review) will be achieved in the neighborhood of the year 2020 (and likely later, assuming an asymptotic approach), which is much longer than many yeast enthusiasts would have hoped for. How might the conquering of uncharacterized yeast genes be accelerated? We can propose several concrete experimental possibilities, which might also be accompanied by some modification of how we think about the enterprise of gene characterization.
First, a great deal might be accomplished simply by having seasoned yeast researchers systematically peruse the list of uncharacterized genes and their attributes—including what is known or predicted about them from sequence features or large-scale surveys—and encouraging those engaged in large-scale efforts to focus on the uncharacterized genes. Figure 5 gives an overview of information regarding the function or functionality of the 1253 currently uncharacterized genes, and the supplemental data at http://hugheslab.med.utoronto.ca/geneticsReview07/ contains a spreadsheet summarizing the information in Figures 2 and 5. Engaging individual researchers is often a major hurdle in proceeding from categorical predictions (which are what typically emerge from large-scale and computational analyses) to biological insight. Admittedly, individual labs may have already attempted to characterize these genes without much success (or alternatively, the articles are all in preparation). Assuming not, many of these genes may be approachable by the conventional one-gene-at-a-time hypothesis-driven approach or by devising assays that probe a specific pathway or activity. For example, the uncharacterized genes include 10 protein kinases; both genetic and biochemical array-based methods have been developed for the identification of kinase targets (Ptacek et al. 2005), thus providing clues for physiological roles. The uncharacterized gene list also includes several potential RNA-binding proteins [YLR271W (G-patch domain), YPL184C/MRN1 (RRM domain), YJL010C/NOP9 (PUF domain), YGR250C (RRM domain), and YFR032C (RRM domain)], whose ligands might be explored using microarrays (Gerber et al. 2004).
Given the large number of apparent biosynthetic enzymes and mitochondrial proteins, a global metabolite profiling system would be extremely helpful. Such a system has not yet entered into the mainstream, although the technology exists (Griffin 2006). Nonetheless, biochemical genomics (Martzen et al. 1999), in which genes are characterized by the activities of their purified products, and the related protein chip approach that deposits the same proteins on a microscope slide for parallel analysis (Zhu et al. 2001) appear to be partly filling the same niche. A surprising number of tRNA-modifying activities have been identified using biochemical genomics, including that encoded by TRM13, which is still listed on SGD as uncharacterized but was published this month as a tRNA 2′-O-methyltransferase (Wilkinson et al. 2007). Structural genomics is a form of “reverse” biochemical genomics: SDO1, which was changed from “uncharacterized” to “verified” by SGD in March 2007, was found by structural genomics to have an OB fold (a common oligonucleotide-binding motif), suggesting interaction with RNA, consistent with its coregulation with other RNA biogenesis factors (Savchenko et al. 2005). Data published this month show that SDO1 specifically controls maturation and translational activation of ribosomes (Menne et al. 2007). Chemical genetic profiling (Giaever et al. 2002; Parsons et al. 2006), which simultaneously probes the sensitivity of all of the deletion strains to a small molecule—often a metabolic inhibitor or an environmental chemical stress—also qualifies as a form of biochemical genomics and has the advantage of inherently revealing phenotypes.
While these examples highlight fundamental aspects of cell biology, which are amenable to laboratory study, it appears as if more exploration of the life of yeast in the wild would also be productive, on the basis of the fact that half of the fungal-specific genes remain uncharacterized. Examination of the lifestyle and genetics of yeast in its ecological niche is still in its infancy. In nature, Saccharomyces cerevisiae appears to be one of many opportunistic microorganisms found in rotting fruit (Mortimer and Polsinelli 1999; Fleet 2003). The cells presumably are dormant much of the time. Fruit flies prey on yeast, but they and other insects are also critical to the yeast life cycle, as they are believed to be the primary vector for transport of yeast (Mortimer and Polsinelli 1999). Thus, ability to stick to insects through dormancy, without being eaten, might be a requisite for propagation of wild yeast. Chemical defenses and competitive nutrient utilization strategies are also undoubtedly essential for wild yeast; even human-made fermentations contain a wide variety of interacting microbes, including yeasts, other fungi, and bacteria (Fleet 2003). Indeed, in addition to the biosynthetic enzymes and transporters listed above, the uncharacterized genes include nine zinc-cluster transcription factors [most characterized members of this class of proteins regulate metabolic pathways or responses to environmental stress (MacPherson et al. 2006)] and a smattering of other proteins that appear likely to be involved in environmental adaptation. YNL234W, for example, is a direct target of Rgt1 (a glucose-responsive transcription factor) and encodes a functional globin (Sartori et al. 1999; Kaniak et al. 2004); however, its physiological purpose remains elusive. Other uncharacterized proteins may be related to the cell surface or cell shape; YOR111W, for instance, consists almost entirely of a Maf domain, named for a protein that influences septum formation in bacteria by an unknown mechanism (Butler et al. 1993). Some environmental behaviors, such as filamentous growth (Kron 1997), might be better modeled in strains in addition to S288C; for instance, the filamenting strain Sigma 1278B is amenable to genetic study in the laboratory (e.g., Lorenz and Heitman 1998).
Finally, it may be time to reconsider the argument that gene functions need not manifest as cut-and-dried laboratory single-mutant phenotypes to be selected over evolutionary time. Marginal but reproducible fitness contributions have previously been noted for many single mutants in nonessential yeast genes (Thatcher et al. 1998), and more recent efforts to compile synthetic genetic interactions have indicated that the full yeast genetic network may encompass as many as 200,000 pairwise interactions (Davierwala et al. 2005). The effects of loss of tRNA modifications appear to be synthetic in at least some cases (Alexandrov et al. 2006), and these modifying enzymes seem to be among the more refractory genes with respect to rapid characterization. Similar genetic behavior might be expected of other currently uncharacterized genes.
At the time of the original sequencing effort, only an estimated 1000 of 6000 apparent genes had been previously described in a research article (Goffeau et al. 1996); a decade later, the running total is 4687, counting only genes that appear in focused articles (those that mention 10 or fewer yeast genes) (Fisk et al. 2006). By any standard, this is a phenomenal accomplishment, giving us confidence that the full gene complement can be characterized and optimism that similar accomplishments will be made during our lifetimes in other organisms, perhaps even mammals. Current technology appears to be sufficient to characterize most gene products at a cellular level, provided they are active and nonredundant under the conditions examined. It is possible to generate hypotheses regarding functions and associate them with confidence levels, but someone still has to follow them up. There is clearly a need for human inference and domain knowledge in the creation of new approaches for specific problems, including the characterization of individual genes and their roles in nature.
We are grateful to Mike Cherry, Dianna Fisk, Owen Ryan, Charlie Boone, Brenda Andrews, Guri Giaever, Allan Spradling, Mark Johnston, and Linda Bisson for helpful discussions and feedback on the manuscript.
Communicating editor: A. Spradling
- Copyright © 2007 by the Genetics Society of America