Abstract
Thousands of genes are expressed at such very low levels (≤1 copy per cell) that global gene expression analysis of rarer transcripts remains problematic. Ambiguity in identification of rarer transcripts creates considerable uncertainty in fundamental questions such as the total number of genes expressed in an organism and the biological significance of rarer transcripts. Knowing the distribution of the true number of genes expressed at each level and the corresponding gene expression level probability function (GELPF) could help resolve these uncertainties. We found that all observed largescale gene expression data sets in yeast, mouse, and human cells follow a Paretolike distribution model skewed by many lowabundance transcripts. A novel stochastic model of the gene expression process predicts the universality of the GELPF both across different cell types within a multicellular organism and across different organisms. This model allows us to predict the frequency distribution of all gene expression levels within a single cell and to estimate the number of expressed genes in a single cell and in a population of cells. A random “basal” transcription mechanism for proteincoding genes in all or almost all eukaryotic cell types is predicted. This fundamental mechanism might enhance the expression of rarely expressed genes and, thus, provide a basic level of phenotypic diversity, adaptability, and random monoallelic expression in cell populations.
GENE expression within a cell is a complex process involving chromatin remodeling, transcription, and export of RNA from the nucleus to the cytoplasm where mRNA molecules are translated into proteins. The physiological activity and cell differentiation of a mammalian cell is controlled by 10,000 or more proteincoding genes associated with ~300,000–500,000 mRNA transcripts (Bishopet al. 1974). The complete gene expression profile for a given set of cells is the list of all expressed genes, together with each gene's expression level defined as the average number of cytoplasmic mRNA transcripts per cell. Currently, gene expression profiling methods (e.g., serial analysis of gene expression (SAGE; Velculescu et al. 1995, 1999), cDNA, or oligonucleotide microarrays (Holstegeet al. 1998; Jelinsky and Samson 1999) measure gene transcripts from large numbers of cells (i.e., not a single cell) and cannot reliably detect the thousands of genes that are expressed at very low copy numbers (less than two per cell). Many of these lowerlevel transcripts may be essential for determining normal and pathological cell phenotypes (Chenet al. 2000; Ohlssonet al. 2001). However, a rationale for an extreme number of rare transcripts has remained unresolved.
Determination of biologically significant expressed genes in eukaryotic cells is a challenging biological problem (Bishopet al. 1974; Velculescuet al. 1999). An important current issue for gene identification is determining the true statistical distributions of the number of genes expressed at all possible expression levels, both in a single cell and in a population of cells. Identification of such distributions can provide a theoretical basis for accurately counting the number of expressed genes and the total number of genes in a given cell type and for better understanding the mechanism(s) governing the expression of thousands of genes at very low levels. The similar problem of estimating the distribution of species in a population or different alleles in a population has been intensively discussed (see for references Huang and Weir 2001).
The statistics of expressed genes can be partially specified by the proportions of expressed genes that have one, two, etc. transcripts present in an associated mRNA sample (i.e., a normalized histogram of gene expression levels). Analysis of such empirical histograms using largescale gene expression databases leads to models of the underlying gene expression level probability functions (GELPF) in a cell and in a population of cells. Interestingly, similar gene expression “patterns” in different cells were observed >25 years ago by RNADNA hybridization (Bishopet al. 1974). These and more recent gene expression data sets have demonstrated very broad ranges of gene transcript levels (i.e., from 0.1 to 20,000 transcripts per human cell; Bishopet al. 1974; Velculescuet al. 1999). However, a suitable theoretical model of the distribution of gene expression levels in a cell has not been previously identified due to undersampling, unreliable detection of many low abundance transcripts, experimental errors, and ambiguities in the identification of many transcripts.
A large body of experimental and theoretical literature on molecular mechanisms of gene expression control makes it increasingly evident that stochastic processes in transcription and translation machinery (as well as within signaling pathways and cross talk between different pathways) need to be considered to fully understand basic processes of gene expression. In particular, several experimental systems indicate that initiation of gene transcription is a discrete process in which many individual proteincoding genes existing in an off state can be stochastically switched to an on state resulting in the production of mRNAs in sporadic pulses (Ko 1992; Rosset al. 1994; Newlandset al. 1998; McAdams and Arkin 1999; Hume 2000; Sutherlandet al. 2000; Ohlssonet al. 2001; Sanoet al. 2001).
In this study we present evidence that the functional form of the GELPF is invariant among eukaryotic cell types. Stochastic and probabilistic mechanisms of the initiation of the gene expression process can help explain the observed universality of the GELPF across different cell types in a multicellular organism and across different organisms. We describe a new distribution function and derive from it a probabilistic model of the growth of a population (e.g., the number of all transcripts) with many distinct classes (e.g., distinct expressed genes) in a complex system (e.g., observed SAGE transcriptome for a population of homogeneous cells) as sampling increases. This model allows us to predict the frequency distribution of gene expression levels for all genes and the total number of genes expressed in a representative cell averaged over time. The model exhibits predictive power even when the sequencing database is incomplete and contains ambiguity in sequence to gene assignments.
RESULTS
Distribution of the gene expression levels: We have analyzed diverse largescale gene expression databases for different human tissues and cell lines (http://www.ncbi.nlm.nih.gov/UniLib; http://www.ncbi.nlm.nih.gov/UNIGENE; http://www.ncbi.nlm.nih.gov/CGAP/ncicgap; http://www.ncbi.nlm.nih.gov/SAGE), mouse tissues (http://www.ncbi.nlm.nih.gov/UniLib), and yeast cells (ftp://genomeftp.stanford.edu.pub/yeast/tables/SAGE_Data; http://www.sagenet.org; http://www.hsph.harvard/geneexpression) to identify the GELPF for eukaryotic cells. These data sets have been created by three different technologies: sequencing of clones in complementary DNA (cDNA) and SAGE libraries and oligonucleotide microarray hybridization methods. These techniques involve making cDNA sequences of the less stable mRNA molecules and then using specific shortnucleotide sequence tags that match different mRNAs to quantify their relative abundance in the cell sample. For each data set, we define its library as the list of sequenced tags that match mRNAs associated with genes, together with the number of occurrences of each specific tag. Let M denote the size of the library, i.e., the total number of tags in it, and let n(m, M) denote the number of distinct tags that have the expression level m (occurring m times) in the given library of size M. The observed value ñ(m, M) only approximates the number of expressed genes with expression level m in the cell sample due to experimental errors, nonunique taggene matching, and incorrect annotation of genes (see below). Let J denote the observed expression level for the most abundant tag in the library; J increases with the library size M. Then
We found that the empirical GELPF histograms, constructed for analyzed yeast SAGE libraries, mouse and human SAGE or cDNA libraries (Velculescu et al. 1995, 1999; Lalet al. 1999; Strausberget al. 2000), as well as Affymetrix microarray samples for yeast cells (Jelinsky and Samson 1999; Jelinskyet al. 2000), exhibited similar monotonically skewed shapes with a greater abundance of rarer transcripts and more gaps among the higheroccurrence expression levels (Figures 1 and 2).
Several classes of skewed probability functions [Poisson, exponential, logarithmic series, simple power law, Paretolike, and mixture of logseries and exponential (Johnsonet al. 1993)] were fit (see methods) to empirical gene expression level histograms for >50 human, mouse, and yeast SAGE libraries; 30 human cDNA libraries in Cancer Genome Anatomy Project (CGAP) databases (http://www.ncbinlm.nih.gov/CGAP; http://www.ncbinlm.nih.gov/SAGE); and 30 microarrays of normal and treated yeast cells (http://www.hsph.harvard/geneexpression; Holstegeet al. 1998).
The best fit by our criteria was obtained using the discrete Paretolike probability function,
Figure 2 shows the frequency of the numbers of distinct open reading frames (ORFs)/genes vs. hybridization signal intensity values for Affymetrix microarray hybridization data obtained for normal yeast cell transcriptome (http://www.hsph.harvard/geneexpression; Jelinsky and Samson 1999). After subtraction of background noise, the total (digital) hybridization signal intensity, s, was normalized to the typical number of mRNA molecules per yeast cell (Jelinsky and Samson 1999). The window plot in Figure 2 is the empirical frequency distribution of the number of genes expressed at s copies per cell. It has a skewed form with a long rightside tail and with a leftside tail down to threshold levels of reliable detection of at least ~0.5 copies per cell. The cooccurrence of the lowexpressed genes/ORFs in three or more of the six analyzed microarrays (http://www.hsph.harvard/geneexpression) allowed us to suggest that at least 45% of the known 6200 genes/ORFs are present but at <1 copy per cell. Table 1 clearly shows that the skewed GELPFs from such filtered Affymetrix data for yeast cells at different phases of cell life are all very similar and are all fitted by the GDP model down to 0.5 transcripts per cell.
Regardless of either the method used to generate the gene expression profile (SAGE, cDNA library sequencing, or oligonucleotide microarray hybridization) or the species studied, we have observed that the GDP functional form fits the observed GELPFs. The parameters of the fitted GDP, however, show a significant dependence on sample size, specific eukaryote species, and methods used to generate the library (Table 1).
Effect of library size on gene expression level distribution: Similarly sized libraries made using the same method from many different human tissues and cell lines have similar numbers of distinct gene tags and are characterized by empirical GELPFs with nearly equivalent parameters in their bestfit GDP models (see Table 1). As the size of a library increases, the shape of the empirical GELPF changes systematically: (1) p_{1}, the fraction of distinct tags represented by only one copy, becomes smaller; (2) J increases in proportion to M; (3) the parameter b becomes larger; and (4) the parameter k increases and then slowly decreases (Figure 3; Table 1). Despite significant variation in human tissue types studied, the number of distinct tags, N, appears to be essentially invariant for the similarsize SAGE libraries (Figure 3b). Although the yeast genome is less complex, yeast SAGE libraries behave similarly (Table 1; Figure 4a). We also found that for yeast, mouse, and human SAGE and cDNA libraries, all values of the scaling parameter a (a = J/M, which represents the frequency of occurrence of the most common transcript within the library or cell population) fall within narrow ranges (Table 1). These observations suggest that all studied cell types have a common skewed underlying probability function form.
Importantly, in socalled “scalefree” (or selfsimilar, i.e., any part of the system is statistically similar to the whole) biological and physical systems, described by a simple power law (b = 0; j = ∞; k, z are the positive constants), the parameter b = 0 and the parameter k is assumed to be independent of the size of the system (Stanleyet al. 1999; Jeonget al. 2000; Gomezet al. 2001). We did not observe such properties in the GELPFs; they display a nonlinear, rather than linear trend in loglog coordinates and a samplesize dependence. For example, Table 1 shows that the parameter b in the GDP model is significantly different from 0 for most data sets, and b becomes larger as the library size increases (see, for example, libraries 2892.1 and 2892.2 or libraries 154 and 154.1 in Table 1 and Figure 3a). In the case of SAGE libraries for different human cell types, the Pearson correlation coefficient between the library sizes and the values of parameter b equals 0.9. We observed that the values of parameter b approach 0 as M increases at relatively small sample sizes (for example, ~10,000 SAGE tags, Table 1).
Although the Paretolike models appear to fit empirical GELPFs down to the least transcript abundance observed (~0.2–0.5 copies per cell in yeast microarray experiments, Figure 2), theoretically these models demonstrate an unlimited increase in the number of species (i.e., different expressed genes) as the sample size approaches infinity. This contradicts the fact that there is a finite number of different mRNAs (different expressed gene products). Thus these models must be considered at best empirical approximations of an underlying probability. We have developed a construction model (see methods and appendix) for the underlying probability distribution. When this distribution is finitely sampled, the results fit by Paretolike GELPFs. The model explicitly exhibits the observed sample size dependence but retains a finite limit to the number of different classes as the sample size increases. Importantly, this model assumes that each expressed gene has a positive probability of being observed in any given sample and also that the expression level for this gene is statistically independent of the expression levels for other genes. The expression of those small groups of genes that are regulated by common sets of transcription factors would be expected to show correlations within any given cell. However, the average correlation between expression events for a given gene and all other (thousands of) expressed genes would likely be statistically insignificant. Furthermore, since expression profiles are obtained from cells at one instant, specific transcription events driven by the same transcription factors would have weaker correlations due to temporal and spatial fluctuations (e.g., chromatin dynamics) within a given cell and certainly when averaged over a large population of cells. This assumption of essentially statistical independence among all transcription events in a population of cells is consistent with experimental observations (Chellyet al. 1989; Ko 1992; Rosset al. 1994; Newlandset al. 1998; Fieringet al. 2000; Sanoet al. 2001). Even for synchronized yeast cells arrested in G1, S, and G2/Mphases of cell life, the empirical GELPFs in microarray experiments were very similar to each other and to the GELPF observed for yeast cells in logphase of cell growth (see Table 1). These results suggest that the shape of the empirical GELPFs is relatively robust to different correlations between expressed genes at least in normal yeast cells at different phases of cell life. In addition, our analyses of the seven microarray (Jelinsky and Samson 1999; Jelinskyet al. 2000) data sets for normal yeast cell samples and pooled three SAGE libraries (Velculescuet al. 1997) of normal yeast cell samples show that at least 2000 yeast genes/ORFs are expressed, but at <0.5 copies per cell on average (see, for example, Figures 2 and 4b). At the level of the individual cell, these rare transcription events can be treated as stochastic events. A mathematical description of our model of the GELPF is presented in methods and the appendix.
Analysis of the empirical GELPF using SAGE databases: Using the LG model (see Equations 3 and 4 and methods for a definition of the LG model) to fit empirical population growth curves like those presented in Figure 3b, we can predict the frequencies of the gene expression levels p_{1}, p_{2}, … , for a given cDNA or SAGE library size (Figure 3c); p_{i} is the probability that a random gene has i transcripts. Application of the LG model to human SAGE databases results in extremely large estimates (138,000 distinct tags expressed in brain and 127,000 expressed in a “typical” human tissue) compared to the total number of genes in the genome (30,000–40,000 genes; International Human Genome Sequencing Consortium 2001; Venteret al. 2001; Kuznetsov 2002). This demonstrates the wellknown discrepancy between the numbers of different expressed sequences in SAGE or Unigene libraries and the number of human genes. This large discrepancy can be attributed to a variety of sources including sequencing errors, multiple restriction sites on the same transcripts leading to multiple tags per gene, and alternative splicing (Lalet al. 1999; Velculescuet al. 1999; Chenet al. 2000; Stollberget al. 2000; Caronet al. 2001).
Analysis of errors in SAGE cell libraries and prediction of the number of expressed genes in a population of cells: Without removing experimental errors in SAGE libraries one cannot obtain an accurate estimate of the number of expressed genes, N_{t}, and the GELPF. Using our probabilistic model (Equations 2, 3, 4 and 5, methods), we developed a computational methodology to estimate the true GELPF for a SAGE library, even when the SAGE library is incomplete (contains only a fraction of all expressed genes for the sample cell type). Our methodology is as follows.
We selected only tags whose location on the yeast chromosome map coincided with proteincoding gene or ORF regions (called here “true tags”).
We constructed population growth curves for the numbers of different genes/ORFs found in the tag location database.
We fitted the growth curve for the numbers of distinct genes/ORFs by Equations 3 and 4.
We calculated N_{t}, using Equation 5, and, finally, calculated the true underlining GELPF, using Equation 2.
To validate this approach, we analyzed 11,329 yeast cell distinct SAGE tags representing 59,494 SAGE tags of the three yeast cell SAGE libraries (Velculescuet al. 1997, Table 1). Since almost all yeast proteincoding genes/ORFs and their location on chromosomes are known, we can obtain the true distinct tags and their expression levels in a yeast SAGE library by eliminating erroneous tags that fail to match known 3′ NLaIII genes/ORFs regions and adjacent 3′ end regions presented in the chromosome tag location database (http:genomehttp://www.stanford.edu/Saccharomyces). This database was generated by Velculescu et al. (1997) and currently contains 8480 distinct tags, matching 4735 of ~6200 known yeast genes/ORFs. We found that 25% (2849 distinct tags) of the 11,329 analyzed distinct tags failed to match the yeast genome and these tags were associated with sequencing errors, 23.5% (2661 distinct tags) failed to match ORFs and adjacent 3′ end regions, and 51.4% (5819 distinct tags) were classified as the true distinct tags. Also, we found that 1689 distinct tag sequences of the 2661 distinct failed tags matched antisense sequences within 1504 genes/ORFs.
Figure 4a shows empirical GELPF for distinct SAGE tags in the G2/M phasearrested yeast cell library contains 19,527 tags with 5303 distinct tags. By filtering these tags by matching in the tag location database, we discovered 3239 erroneous tags (16.6% of the 19,527 tags in the library) corresponded to 2103 distinct tags. Most of the 2103 distinct erroneous sequences found in the set of 3239 erroneous tags occur only one or two times. The remaining 16,288 tags corresponded to 3200 distinct tags with matched 2936 genes/ORFs in the tag location database.
By sampling randomly from a pooled library containing all the observed true tags from the three SAGE libraries, we constructed population growth curves for both the number of distinct tags chosen and the corresponding number of different genes/ORFs found in the tag location database (Figure 4b). Sample sizedependent LG model (Equations 3 and 4) fits detected numbers of both distinct true tags and different genes/ORFs. In the case of distinct true tags (○, Figure 4b), our estimator (Equation 5) once again predicts a very large value of 25,103 ± 2000 distinct true tags compared to the total number of known yeast genes. For genes/ORFs (●, Figure 4b), a more reasonable estimate of the total number of expressed genes, N_{t} = 7025 ± 200, was obtained. After minor corrections (see Accuracy of an estimate of the number of expressed genes for yeast cells in methods), this estimate is consistent with Cantor and Smith (1999) and Johnson (2000) estimates. Thus, we can suggest that all or almost all yeast genes are expressed in a growing normal yeast cell population, i.e., N_{t} ≈ G, where G is the total number of genes in the entire yeast genome.
Using the estimated parameters c = 0.579 and d = 6580 in the LG function (Equations 3 and 4, methods) and an estimate M_{cell} = 15,000 of the number of mRNAs per yeast cell (Velculescuet al. 1997), Equation 5 (methods) predicts 3009 genes/ORFs/cell. This estimate is consistent with our estimate for a single yeast cell in the G2/M phasearrested state (2936 genes/ORFs by SAGE data) and with our estimates for yeast cells by Affymetrix microarray data (Table 1).
The GELPF in a single yeast cell: The GELPF for a single yeast cell was estimated for corrected data (Figure 4a), using both the BD and GDP models (see methods); the results are presented in Figure 4b. To validate our mathematical models used to analyze SAGE data, we also determined the GELPF on the basis of Affymetrix microarray data sets (Jelinsky and Samson 1999; Jelinskyet al. 2000). Figure 4c shows a histogram constructed for the microarray hybridization experiment (Jelinskyet al. 2000) for normal yeast cells. We converted the hybridization intensity signal values to gene expression values with 0.5 transcripts per cell chosen as a reasonable lowlimit cutoff point (see also Figure 2). In this case, 3000 more highly expressed genes/ORFs representing ~16,000 transcripts per cell were found. Figure 4c shows that the GELPF for the Affymetrix microarray data follows the GDP model (k = 0.86 ± 0.001, b = 0.37 ± 0.003) and is consistent with the GELPF for corrected SAGE data. Similar skewed frequency distributions were also observed (see examples in Table 1) in 30 other microarray experiments using normal and stressed yeast cells (Holstegeet al. 1998; Jelinsky and Samson 1999; Jelinskyet al. 2000).
METHODS
Binomial differential distribution and an estimator of the total number of expressed genes: Let M denote the total number of transcripts in a given “errorfree” library and let N denote the number of distinct gene tags (or tag/signals converting to genes) for that library. Let p_{m} denote the probability that a randomly chosen gene is represented by m associated transcripts in the library for m = 1, 2, … . On the basis of a multinomial distribution model for sampled transcripts, when M is large enough, we obtain the discrete probability function p_{m}, in terms of M and N, as
Estimation of the GELPF for a single cell: First, we use the BD model (Equation 3) with fitted parameters c and d in p_{1}(M) to compute the probability values p_{1}, p_{2}, … , p_{6} for 3009 yeast genes/ORFs corresponding to the library size 15,000 transcripts. Because the GDP model is a good approximation of BD distribution at fixed M (see Figure 3, Figure 4c, and methods), it is acceptable to use the GDP model to estimate p_{m} for larger m. (This use of the GDP model was necessary because there are no readily available numerical algorithms that do not accurately compute values of highorder derivatives.) We fit the GDP model (Equation 1) to the six points predicted by the BD probability distribution at constraints M ≈ 15,000, J ≈ 0.028*M and extrapolate the fitted GDP model to estimate values of p_{m} for m > 6 (solid step line in Figure 4c). To check the selfconsistency of our predictions, we estimated the total number of transcripts, M, from the fitted GDP model and noted that the result was 15,000.
Accuracy of the estimated number of expressed genes for yeast cells: Our estimate, N_{t} = 7024 genes/ORFs, is ~4–10% higher than current estimates of the total number of distinct ORFs in the yeast genome (6200–6760 genes/ORFs; Cantor and Smith 1999; Johnson 2000). This relatively small difference could be due to the existence of erroneous and redundant tags that nevertheless match genes/ORFs and their adjacent genomic regions. Our analysis does not take into account nonannotated ORFs and overlapping ORFs that match the same tag. Additionally, ~1–3% of transcripts would be expected to lack an NlaIII site and would therefore be missing in the database.
In the case of yeast, ~5% of the genes show alternative splicing. Furthermore, splice variants might have the same primary tag, alternative tags, or become SAGE silent, depending on the restriction sites remaining. We summed all SAGE tags that matched ORFs to obtain the GELPF, so the only effect would be to miss the small number of splice variants lacking a NLaIII restriction site. We do not know the frequency distribution of alternative splicing transcripts in yeast and human cells. With respect to our model of GELPF, we might assume that splice variants for yeast cells will have a skewed form of the probability distribution and not have a significant effect on our estimate of the true GELPF.
Simulation of theoretical histograms: Given the number of expressed genes, N, and the bestfit parameters k and b of the GDP distribution, we sample the values of m at random on the basis of the function f(m) (Equation 1) N times (once for each gene). Then we count the occurrence numbers of generated values m in the intervals (0–1], (1, 2], … and construct the simulated gene expression level frequency histogram for a given value N. Note the corresponding value M is randomly determined by our sampling (see Figure 4c).
Goodnessoffit analysis methods, numerical calculations, and software: Parameters in models were estimated MLAB mathematical modeling software (Civilized Software, Silver Spring, MD, www.civilized.com). For goodnessoffit analysis, we used the modified Akaike information criterion [or model selection criterion (MSC)],
We also used the cumulative fraction function R (see Figure 1), as well as several regular goodnessoffit criteria (sum of squares for deviations, the Wilcoxon twosample rankorder test).
By our goodnessoffit criteria, the GDP model is superior to simple power law, as well as many other skew probability functions (Poisson, logseries, and exponential) and mixed logarithmic series + exponential distribution. For example, for library sizes >40,000 SAGE tags, the values of the Ψ ranged between 3 and 6 (satisfactory or poor); however, Ψ values ranged in (11–7) for the GDP model. Similar superiority of the GDP model (measured by the R and Ψ criteria) was observed after goodnessoffit analysis of the distribution models to microarray data.
Symbolic differentiation and subsampling were performed using MLAB. Monte Carlo experiments were performed using MLAB and programs written in Fortran90. Datamining tools of the Cancer Research Anatomy Project including X profiling and SAGE/map (Lalet al. 1999; http://www.ncbi.nlm.nih.gov/SAGE) were also used.
DISCUSSION
Even with their large differences in genome organization yeast, mouse, and human cells all demonstrate similar skewed longtail Paretolike geneexpression level distributions. The observed distributions have the following characteristics in common: There are few redundant and many rare transcripts. The universality of the empirical GELPF form for different eukaryotic cells suggests a common underlying probabilistic mechanism associated with the gene expression process conserved in eukaryote evolution. Similar distributions have been observed for the connectivity numbers of metabolic networks (Jeonget al. 2000), for the rates of protein synthesis of prokaryotic organisms (Ramsden and Vohradsky 1998), in different DNArelated phenomena (see Li 1999; Stanleyet al. 1999; Gomezet al. 2001 for references), and in many models of the selforganized systems (http://linkage.rockfeller.edu/wli/zipf). All such systems exhibit a strong stochastic component.
Both oligonucleotide microarray hybridization and construction of SAGE libraries allow largescale characterization of gene expression profiles including lowabundance transcripts. However, in both technologies, the determination of expression levels at one or fewer transcripts per cell is limited by issues of limited sensitivity and erroneous measurements. These limitations become more severe with increasing size of the transcriptome. We developed a comprehensive statistical approach to analyzing the empirical distribution of expressed genes for large transcriptomes obtained by the SAGE method by first removing sequence errors using chromosome location maps for SAGE tags and then applying statistical modeling and the BD model (Equations 2, 3, 4 and 5) to filtered data (Figure 4). Our resulting SAGE data were similar to the GELPF data that we obtained for normalized oligonucletide microarray hybridization data sets for yeast cells. Our methodology of construction of the correct underlying probability distribution could be used to analyze large SAGE transcriptomes for different mammalian cells and cell types, including human transcriptomes, and for evaluation of the new modifications of the SAGE method using, for example, 21mer tags.
Our statistical modeling approach provides a justifiable way to compare the GELPFs using samples (cDNA or SAGE libraries) with different sizes. This approach also can be used to permit the use of exact statistical tests for different transcriptomes (i.e., obtained for normal and cancerous cell tissues). Our novel numerical estimator of the number of species (i.e., expressed genes; Equation 5) can be used to estimate the number of expressed genes in a single cell and in a population of the cells by SAGE or cDNA data sets, even if data are incomplete and exhibit severe experimental errors.
On the basis of our analysis of the GELPF in the normalized yeast microarray databases (Jelinsky and Samson 1999; Jelinskyet al. 2000) 1330 ± 45 genes/ORFs are represented on average by a single mRNA molecule per cell, at least 2000 genes/ORFs are expressed at 0.1–0.5 molecules per cell on average, and ~47% (2917) of all yeast genes are expressed at less than one transcript per cell. Our estimate for yeast cell SAGE data is consistent with these estimates. Approximately 1200 genes/ORFs are represented on average by a single mRNA molecule per yeast cell (Figure 4b). The population growth curve for genes/ORFs (Figure 4b) predicts that 3800 additional genes/ORFs (~55% of all yeast genes) are expressed at less than one transcript per cell. Even in proliferating mammalian cells, the majority of genes are thought to be transcribed over a short period of time (4–7 min per transcript) and infrequently, less than once per hour (Jacksonet al. 2000). On the basis of a similar methodological approach for estimating the GELPFs for a SAGE transcriptome, which we used in this article for yeast SAGE data sets, the analysis of a large (~600,000 SAGE tags) human transcriptome data set (Velculescuet al. 1999) indicated that ~70% of all proteincoding human genes are expressed with less than one transcript per cell on average (Kuznetsov 2002). Such low numbers of transcripts in a cell population may be due to the action of a random transcription process in individual cells (McAdams and Arkin 1999; Fieringet al. 2000; Hume 2000; Sutherlandet al. 2000).
Initiation of transcription has been observed to occur sporadically and randomly both in time and location on chromosomes in a variety of cell systems (Ko 1992; Rosset al. 1994; Newlandset al. 1998; Sanoet al. 2001). In this study, we present additional data and arguments supporting our hypothesis that at the level of the individual cell the transcription events for a given gene at an instant appear to be statistically independent of expression levels for thousands of other genes. The existence of such a random transcription process would imply that all or almost all proteincoding genes in a genome should have a small but positive probability to be transcribed in any given cell during any fixed time interval. This suggestion is consistent with the observation that small transcript copy numbers occur even for various tissuespecific genes in human cells of different type, such as fibroblasts, lymphocytes, etc. (Chellyet al. 1989). Although not all cells of a population would have a copy of a specific transcript at a given moment, we would expect to see all these genes expressed, at least at a low level, in a sufficiently large cell population at any point in time. That is, ergodicity holds. This point is supported by the yeast expression data in the microarray database (Jelinskyet al. 2000): We observed that only 250 ORFs (~150 of them are “questionable” or “hypothetical” ORFs) of ~6200 genes/ORFs were not expressed in any of six presented microarray samples from normal growing yeast cells.
In mammalian cells, a significant fraction of genes are silenced (transcripts are not observed); the silent state of a gene can be inherited, but later reactivated involving the stochastic, allornone mechanism at the level of a single cell (Sutherlandet al. 2000). Lowprobability transcription events for many genes in a cell could be regulated by its own specific transcripts. Sporadic initiation of the transcription process for rarely expressed genes could also be under dynamical control of some nonproteincoding genes associated with stressresponse control (Eddy 2001). Many RNAs transcribed from these genes represent antisense RNA transcripts that overlap proteincoding genes on the other genomic strand. We might suspect that in response to environmental changes, various stress conditions, and local fluctuation of molecular composition, the initiation of transcription events for many rarely expressed genes in each cell could be under dynamical control of these noncoding genes. Such autoregulation might tend to keep the lowexpressed gene “onehalf on,” thus sporadically providing the mechanism of low expression for many genes in a cell population.
Physically, random “basal” transcription of genes might reflect nonlinear responses of the independent “gene transcription complexes” to internal or external fluctuations including thermal molecular motion. Noise in nonlinear dynamical systems can play a constructive role: It can, for example, improve a system's sensitivity to weak signals (Wiesenfeld and Jaramillo 1998). If it is so, noise in the geneexpression machinery could enhance weak transcription signals. This amplification mechanism for gene expression in eukaryotic cell types might provide a basic level of phenotypic diversity within a cell population and thus could facilitate adaptation of a population of cells. The stochastic variability in the case of rarely transcribed genes could also lead to changes in the genotype in which lineage commitment results from a selective rather than an instructive mechanism in cells (Ko 1992; Hume 2000; Sutherlandet al. 2000; Ohlssonet al. 2001; Sanoet al. 2001). Random initiation of gene transcription in a given cell would allow the “essential” rarely transcribed gene to provide the random switch between active and inactive states during the formation of daughter cells and thus to provide genotype diversity in the cell population. Because the distribution of the lifetimes of “switchon” and “switchoff” states for genes in a single cell has a long right tail (McAdams and Arkin 1999), one of the two alleles of the same locus for a given lowexpressed gene might be present in the same state for a long period of time. In this case, a natural clonal selection process in a population of the cells could select the clone(s) with a monoallelic gene expression, i.e., a phenomenon in which only a single copy, or allele, of a given gene of diploid organisms appears (Ohlssonet al. 2001). Further statistical analysis and modeling of the largescale gene expression data could help us to understand how the stochastic variability of gene expression in a single cell might lead to changes of the genotype repertoire in developing tissues and in an entire organism.
In this study, we have analyzed only “snapshot” gene expression profiles, which are based on an “averaging” of the many thousands of cells in various states and which by themselves do not provide us direct information about regulatory relationships between expressed genes. Taking multiple times the samples of transcripts from homogenous cell populations or from single cells of this population, we could study the correlation between expression levels of genes. In this case, it would be interesting to construct a distribution model that takes into account the correlations between the levels of different transcripts. For example, the multivariate multinomial distribution model can be constructed (see Johnsonet al. 1997). Such a distribution model would be suitable, in particular, for analysis of generegulating network data representing gene expression profiles for individual cells sampled with replicates from an “isogenic” cell population at several periods of observation. Identification of such a multivariate multinomial distribution model may help to estimate the biologically significant correlations between genes and evaluate the stochastic component in dynamics of gene expression clusters at low expression levels.
Acknowledgments
The authors thank V. Velculescu for providing the supplementary information on the yeast cell SAGE database. We thank J. Berzofsky, I. Belyakov, K. Chumakov, R. Nossal, R. Strausberg, A. Strunnikov, T. Tatusov, and two reviewers for critical comments on the manuscript.
APPENDIX: A STOCHASTIC MODEL OF GENE EXPRESSION
Let us assume that N_{t} genes 1, 2, … , N_{t} are expressed independently with M_{t} associated transcripts in total in the cells of a large cell population with respective probabilities q_{1}, q_{2}, … q_{Nt} where Pr(a random transcript corresponds to gene i) = q_{i}. Let the random variable s_{i} denote the number of transcripts for gene i in a random library of size M. Note
Using the formulas for f_{i} (m, M), n(m, M), and N(M) we can derive the recursion formula (Kuznetsov 2001)
Footnotes

Communicating editor: G. A. Churchill
 Received January 5, 2002.
 Accepted March 31, 2002.
 Copyright © 2002 by the Genetics Society of America