| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Corresponding author: V. A. Kuznetsov, National Institute of Child Health and Human Development, NIH, Bldg. 13, Rm. 3W16, Bethesda, MD 20892-5772., vk28u{at}nih.gov (E-mail)
Communicating editor: G. A. CHURCHILL
| ABSTRACT |
|---|
Thousands of genes are expressed at such very low levels (
1 copy per cell) that global gene expression analysis of rarer transcripts remains problematic. Ambiguity in identification of rarer transcripts creates considerable uncertainty in fundamental questions such as the total number of genes expressed in an organism and the biological significance of rarer transcripts. Knowing the distribution of the true number of genes expressed at each level and the corresponding gene expression level probability function (GELPF) could help resolve these uncertainties. We found that all observed large-scale gene expression data sets in yeast, mouse, and human cells follow a Pareto-like distribution model skewed by many low-abundance transcripts. A novel stochastic model of the gene expression process predicts the universality of the GELPF both across different cell types within a multicellular organism and across different organisms. This model allows us to predict the frequency distribution of all gene expression levels within a single cell and to estimate the number of expressed genes in a single cell and in a population of cells. A random "basal" transcription mechanism for protein-coding genes in all or almost all eukaryotic cell types is predicted. This fundamental mechanism might enhance the expression of rarely expressed genes and, thus, provide a basic level of phenotypic diversity, adaptability, and random monoallelic expression in cell populations.
GENE expression within a cell is a complex process involving chromatin remodeling, transcription, and export of RNA from the nucleus to the cytoplasm where mRNA molecules are translated into proteins. The physiological activity and cell differentiation of a mammalian cell is controlled by 10,000 or more protein-coding genes associated with
300,000500,000 mRNA transcripts (![]()
![]()
![]()
![]()
![]()
![]()
![]()
Determination of biologically significant expressed genes in eukaryotic cells is a challenging biological problem (![]()
![]()
![]()
The statistics of expressed genes can be partially specified by the proportions of expressed genes that have one, two, etc. transcripts present in an associated mRNA sample (i.e., a normalized histogram of gene expression levels). Analysis of such empirical histograms using large-scale gene expression databases leads to models of the underlying gene expression level probability functions (GELPF) in a cell and in a population of cells. Interestingly, similar gene expression "patterns" in different cells were observed >25 years ago by RNA-DNA hybridization (![]()
![]()
![]()
A large body of experimental and theoretical literature on molecular mechanisms of gene expression control makes it increasingly evident that stochastic processes in transcription and translation machinery (as well as within signaling pathways and cross talk between different pathways) need to be considered to fully understand basic processes of gene expression. In particular, several experimental systems indicate that initiation of gene transcription is a discrete process in which many individual protein-coding genes existing in an off state can be stochastically switched to an on state resulting in the production of mRNAs in sporadic pulses (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this study we present evidence that the functional form of the GELPF is invariant among eukaryotic cell types. Stochastic and probabilistic mechanisms of the initiation of the gene expression process can help explain the observed universality of the GELPF across different cell types in a multicellular organism and across different organisms. We describe a new distribution function and derive from it a probabilistic model of the growth of a population (e.g., the number of all transcripts) with many distinct classes (e.g., distinct expressed genes) in a complex system (e.g., observed SAGE transcriptome for a population of homogeneous cells) as sampling increases. This model allows us to predict the frequency distribution of gene expression levels for all genes and the total number of genes expressed in a representative cell averaged over time. The model exhibits predictive power even when the sequencing database is incomplete and contains ambiguity in sequence to gene assignments.
| RESULTS |
|---|
Distribution of the gene expression levels:
We have analyzed diverse large-scale gene expression databases for different human tissues and cell lines (http://www.ncbi.nlm.nih.gov/UniLib; http://www.ncbi.nlm.nih.gov/UNIGENE; http://www.ncbi.nlm.nih.gov/CGAP/ncicgap; http://www.ncbi.nlm.nih.gov/SAGE), mouse tissues (http://www.ncbi.nlm.nih.gov/UniLib), and yeast cells (ftp://genome-ftp.stanford.edu.pub/yeast/tables/SAGE_Data; http://www.sagenet.org; http://www.hsph.harvard/geneexpression) to identify the GELPF for eukaryotic cells. These data sets have been created by three different technologies: sequencing of clones in complementary DNA (cDNA) and SAGE libraries and oligonucleotide microarray hybridization methods. These techniques involve making cDNA sequences of the less stable mRNA molecules and then using specific short-nucleotide sequence tags that match different mRNAs to quantify their relative abundance in the cell sample. For each data set, we define its library as the list of sequenced tags that match mRNAs associated with genes, together with the number of occurrences of each specific tag. Let M denote the size of the library, i.e., the total number of tags in it, and let n(m, M) denote the number of distinct tags that have the expression level m (occurring m times) in the given library of size M. The observed value ñ(m, M) only approximates the number of expressed genes with expression level m in the cell sample due to experimental errors, nonunique tag-gene matching, and incorrect annotation of genes (see below). Let J denote the observed expression level for the most abundant tag in the library; J increases with the library size M. Then
is the number of distinct tags in the library. The points (m, g(m)) for m = 1, ... J, where
, form the histogram corresponding to the empirical relative frequency distribution of expressed genes. This is a size-frequency form of the empirical GELPF and it represents an estimate of the GELPF in the cell sample.
We found that the empirical GELPF histograms, constructed for analyzed yeast SAGE libraries, mouse and human SAGE or cDNA libraries (![]()
![]()
![]()
![]()
![]()
![]()
|
|
Several classes of skewed probability functions [Poisson, exponential, logarithmic series, simple power law, Pareto-like, and mixture of log-series and exponential (![]()
![]()
The best fit by our criteria was obtained using the discrete Pareto-like probability function,
![]() |
(1) |
where the f(m) is the probability that a randomly chosen distinct tag (representing a gene) occurs m times in the library. The function f involves two unknown parameters, k and b, where k > 0 and b > -1; the normalization factor z is the generalized Riemann zeta-function value,
. We call Equation 1 the generalized discrete Pareto (GDP) model. Note that J, the maximum observed expression level, is a sample-size-dependent quantity J = J(M). The parameter k reflects the skewness of the probability function; the parameter b characterizes the deviation of the GDP distribution from a simple power law (with b = 0; see, for example, dotted line in Fig 1A). The inset plot in Fig 1A demonstrates that the fitted GDP model predicts the empirical cumulative fraction function Re(m) (for the definition of Re see the Fig 1 legend); this demonstrates that our model fits well over the entire range of experimental values.
Fig 2 shows the frequency of the numbers of distinct open reading frames (ORFs)/genes vs. hybridization signal intensity values for Affymetrix microarray hybridization data obtained for normal yeast cell transcriptome (http://www.hsph.harvard/geneexpression; ![]()
![]()
0.5 copies per cell. The cooccurrence of the low-expressed genes/ORFs in three or more of the six analyzed microarrays (http://www.hsph.harvard/geneexpression) allowed us to suggest that at least 45% of the known 6200 genes/ORFs are present but at <1 copy per cell. Table 1 clearly shows that the skewed GELPFs from such filtered Affymetrix data for yeast cells at different phases of cell life are all very similar and are all fitted by the GDP model down to 0.5 transcripts per cell.
|
Regardless of either the method used to generate the gene expression profile (SAGE, cDNA library sequencing, or oligonucleotide microarray hybridization) or the species studied, we have observed that the GDP functional form fits the observed GELPFs. The parameters of the fitted GDP, however, show a significant dependence on sample size, specific eukaryote species, and methods used to generate the library (Table 1).
Effect of library size on gene expression level distribution:
Similarly sized libraries made using the same method from many different human tissues and cell lines have similar numbers of distinct gene tags and are characterized by empirical GELPFs with nearly equivalent parameters in their best-fit GDP models (see Table 1). As the size of a library increases, the shape of the empirical GELPF changes systematically: (1) p1, the fraction of distinct tags represented by only one copy, becomes smaller; (2) J increases in proportion to M; (3) the parameter b becomes larger; and (4) the parameter k increases and then slowly decreases (Fig 3; Table 1). Despite significant variation in human tissue types studied, the number of distinct tags, N, appears to be essentially invariant for the similar-size SAGE libraries (Fig 3B). Although the yeast genome is less complex, yeast SAGE libraries behave similarly (Table 1; Fig 4A). We also found that for yeast, mouse, and human SAGE and cDNA libraries, all values of the scaling parameter a (a = J/M, which represents the frequency of occurrence of the most common transcript within the library or cell population) fall within narrow ranges (Table 1). These observations suggest that all studied cell types have a common skewed underlying probability function form.
|
|
Importantly, in so-called "scale-free" (or self-similar, i.e., any part of the system is statistically similar to the whole) biological and physical systems, described by a simple power law (b = 0; j =
; k, z are the positive constants), the parameter b = 0 and the parameter k is assumed to be independent of the size of the system (![]()
![]()
![]()
10,000 SAGE tags, Table 1).
Although the Pareto-like models appear to fit empirical GELPFs down to the least transcript abundance observed (
0.20.5 copies per cell in yeast microarray experiments, Fig 2), theoretically these models demonstrate an unlimited increase in the number of species (i.e., different expressed genes) as the sample size approaches infinity. This contradicts the fact that there is a finite number of different mRNAs (different expressed gene products). Thus these models must be considered at best empirical approximations of an underlying probability. We have developed a construction model (see METHODS and Appendix) for the underlying probability distribution. When this distribution is finitely sampled, the results fit by Pareto-like GELPFs. The model explicitly exhibits the observed sample size dependence but retains a finite limit to the number of different classes as the sample size increases. Importantly, this model assumes that each expressed gene has a positive probability of being observed in any given sample and also that the expression level for this gene is statistically independent of the expression levels for other genes. The expression of those small groups of genes that are regulated by common sets of transcription factors would be expected to show correlations within any given cell. However, the average correlation between expression events for a given gene and all other (thousands of) expressed genes would likely be statistically insignificant. Furthermore, since expression profiles are obtained from cells at one instant, specific transcription events driven by the same transcription factors would have weaker correlations due to temporal and spatial fluctuations (e.g., chromatin dynamics) within a given cell and certainly when averaged over a large population of cells. This assumption of essentially statistical independence among all transcription events in a population of cells is consistent with experimental observations (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Analysis of the empirical GELPF using SAGE databases:
Using the LG model (see Equation 3 and Equation 4 and METHODS for a definition of the LG model) to fit empirical population growth curves like those presented in Fig 3B, we can predict the frequencies of the gene expression levels p1, p2, ... , for a given cDNA or SAGE library size (Fig 3C); pi is the probability that a random gene has i transcripts. Application of the LG model to human SAGE databases results in extremely large estimates (138,000 distinct tags expressed in brain and 127,000 expressed in a "typical" human tissue) compared to the total number of genes in the genome (30,00040,000 genes; INTERNATIONAL HUMAN GENOME SEQUENCING CONSORTIUM 2001; ![]()
![]()
![]()
![]()
![]()
![]()
![]()
Analysis of errors in SAGE cell libraries and prediction of the number of expressed genes in a population of cells:
Without removing experimental errors in SAGE libraries one cannot obtain an accurate estimate of the number of expressed genes, Nt, and the GELPF. Using our probabilistic model (Equation 2Equation 3 HREF="#FD4">Equation 4Equation 5, METHODS), we developed a computational methodology to estimate the true GELPF for a SAGE library, even when the SAGE library is incomplete (contains only a fraction of all expressed genes for the sample cell type). Our methodology is as follows.
To validate this approach, we analyzed 11,329 yeast cell distinct SAGE tags representing 59,494 SAGE tags of the three yeast cell SAGE libraries (![]()
![]()
6200 known yeast genes/ORFs. We found that 25% (2849 distinct tags) of the 11,329 analyzed distinct tags failed to match the yeast genome and these tags were associated with sequencing errors, 23.5% (2661 distinct tags) failed to match ORFs and adjacent 3' end regions, and 51.4% (5819 distinct tags) were classified as the true distinct tags. Also, we found that 1689 distinct tag sequences of the 2661 distinct failed tags matched anti-sense sequences within 1504 genes/ORFs.
Fig 4A shows empirical GELPF for distinct SAGE tags in the G2/M phase-arrested yeast cell library contains 19,527 tags with 5303 distinct tags. By filtering these tags by matching in the tag location database, we discovered 3239 erroneous tags (16.6% of the 19,527 tags in the library) corresponded to 2103 distinct tags. Most of the 2103 distinct erroneous sequences found in the set of 3239 erroneous tags occur only one or two times. The remaining 16,288 tags corresponded to 3200 distinct tags with matched 2936 genes/ORFs in the tag location database.
By sampling randomly from a pooled library containing all the observed true tags from the three SAGE libraries, we constructed population growth curves for both the number of distinct tags chosen and the corresponding number of different genes/ORFs found in the tag location database (Fig 4B). Sample size-dependent LG model (Equation 3 and Equation 4) fits detected numbers of both distinct true tags and different genes/ORFs. In the case of distinct true tags (
, Fig 4B), our estimator (Equation 5) once again predicts a very large value of 25,103 ± 2000 distinct true tags compared to the total number of known yeast genes. For genes/ORFs (, Fig 4B), a more reasonable estimate of the total number of expressed genes, Nt = 7025 ± 200, was obtained. After minor corrections (see Accuracy of an estimate of the number of expressed genes for yeast cells in METHODS), this estimate is consistent with ![]()
![]()
G, where G is the total number of genes in the entire yeast genome.
Using the estimated parameters c = 0.579 and d = 6580 in the LG function (Equation 3 and Equation 4, METHODS) and an estimate Mcell = 15,000 of the number of mRNAs per yeast cell (![]()
The GELPF in a single yeast cell:
The GELPF for a single yeast cell was estimated for corrected data (Fig 4A), using both the BD and GDP models (see METHODS); the results are presented in Fig 4B. To validate our mathematical models used to analyze SAGE data, we also determined the GELPF on the basis of Affymetrix microarray data sets (![]()
![]()
![]()
16,000 transcripts per cell were found. Fig 4C shows that the GELPF for the Affymetrix microarray data follows the GDP model
and is consistent with the GELPF for corrected SAGE data. Similar skewed frequency distributions were also observed (see examples in Table 1) in 30 other microarray experiments using normal and stressed yeast cells (![]()
![]()
![]()
| METHODS |
|---|
Binomial differential distribution and an estimator of the total number of expressed genes:
Let M denote the total number of transcripts in a given "error-free" library and let N denote the number of distinct gene tags (or tag/signals converting to genes) for that library. Let pm denote the probability that a randomly chosen gene is represented by m associated transcripts in the library for m = 1, 2 , ... On the basis of a multinomial distribution model for sampled transcripts, when M is large enough, we obtain the discrete probability function pm, in terms of M and N, as
![]() |
(2) |
(see the Appendix), where m = 1, 2, ... Note that N is treated as a function of M, so pm is a function of M. We call this function the binomial differential (BD) function. Taking m = 1 in Equation 2, we obtain the differential equation
![]() |
(3) |
with N(1) = 1. Equation 3 defines the "logarithmic growth" (LG) function N(M). p1 is a decreasing function of M (see Fig 3C). We use the empirical approximation (![]()
![]() |
(4) |
where c and d are positive constants. Using an explicit specification of p1 allows us to fit the BD and LG models to empirical histograms. With p1 defined by Equation 4, Equation 3 has an exact solution for N(M) in the limit as M
:
![]() |
(5) |
Nt is an estimator of the number of expressed genes in a large population of cells. Using Equation 2 with fitted values of the parameters d and c provides a mean of computing p1, p2, ... at a given library size M.
Estimation of the GELPF for a single cell:
First, we use the BD model (Equation 3) with fitted parameters c and d in p1(M) to compute the probability values p1, p2, ... , p6 for 3009 yeast genes/ORFs corresponding to the library size 15,000 transcripts. Because the GDP model is a good approximation of BD distribution at fixed M (see Fig 3, Fig 4C, and METHODS), it is acceptable to use the GDP model to estimate pm for larger m. (This use of the GDP model was necessary because there are no readily available numerical algorithms that do not accurately compute values of high-order derivatives.) We fit the GDP model (Equation 1) to the six points predicted by the BD probability distribution at constraints M
15,000, J
0.028*M and extrapolate the fitted GDP model to estimate values of pm for m > 6 (solid step line in Fig 4C). To check the self-consistency of our predictions, we estimated the total number of transcripts, M, from the fitted GDP model and noted that the result was 15,000.
Accuracy of the estimated number of expressed genes for yeast cells:
Our estimate, Nt = 7024 genes/ORFs, is
410% higher than current estimates of the total number of distinct ORFs in the yeast genome (62006760 genes/ORFs; ![]()
![]()
13% of transcripts would be expected to lack an NlaIII site and would therefore be missing in the database.
In the case of yeast,
5% of the genes show alternative splicing. Furthermore, splice variants might have the same primary tag, alternative tags, or become SAGE silent, depending on the restriction sites remaining. We summed all SAGE tags that matched ORFs to obtain the GELPF, so the only effect would be to miss the small number of splice variants lacking a NLaIII restriction site. We do not know the frequency distribution of alternative splicing transcripts in yeast and human cells. With respect to our model of GELPF, we might assume that splice variants for yeast cells will have a skewed form of the probability distribution and not have a significant effect on our estimate of the true GELPF.
Simulation of theoretical histograms:
Given the number of expressed genes, N, and the best-fit parameters k and b of the GDP distribution, we sample the values of m at random on the basis of the function f(m) (Equation 1) N times (once for each gene). Then we count the occurrence numbers of generated values m in the intervals (01], (1, 2], ... and construct the simulated gene expression level frequency histogram for a given value N. Note the corresponding value M is randomly determined by our sampling (see Fig 4C).
Goodness-of-fit analysis methods, numerical calculations, and software:
Parameters in models were estimated MLAB mathematical modeling software (Civilized Software, Silver Spring, MD, www.civilized.com). For goodness-of-fit analysis, we used the modified Akaike information criterion [or model selection criterion (MSC)],

where m = 1, 2, ... , J. In our case m is the expression level value and J is the maximum observed gene expression level in the library; g is the empirical relative frequency distribution, f is the theoretical probability distribution function with v unknown parameters, and E(g) is the mean value of observed data. Note, the
is independent of the scaling of data points.
ranges between excellent (118), very good (86), satisfactory (64), and poor (41).
We also used the cumulative fraction function R (see Fig 1), as well as several regular goodness-of-fit criteria (sum of squares for deviations, the Wilcoxon two-sample rank-order test).
By our goodness-of-fit criteria, the GDP model is superior to simple power law, as well as many other skew probability functions (Poisson, log-series, and exponential) and mixed logarithmic series + exponential distribution. For example, for library sizes >40,000 SAGE tags, the values of the
ranged between 3 and 6 (satisfactory or poor); however,
values ranged in (117) for the GDP model. Similar superiority of the GDP model (measured by the R and
criteria) was observed after goodness-of-fit analysis of the distribution models to microarray data.
Symbolic differentiation and subsampling were performed using MLAB. Monte Carlo experiments were performed using MLAB and programs written in Fortran-90. Data-mining tools of the Cancer Research Anatomy Project including X profiling and SAGE/map (![]()
| DISCUSSION |
|---|
Even with their large differences in genome organization yeast, mouse, and human cells all demonstrate similar skewed long-tail Pareto-like gene-expression level distributions. The observed distributions have the following characteristics in common: There are few redundant and many rare transcripts. The universality of the empirical GELPF form for different eukaryotic cells suggests a common underlying probabilistic mechanism associated with the gene expression process conserved in eukaryote evolution. Similar distributions have been observed for the connectivity numbers of metabolic networks (![]()
![]()
![]()
![]()
![]()
Both oligonucleotide microarray hybridization and construction of SAGE libraries allow large-scale characterization of gene expression profiles including low-abundance transcripts. However, in both technologies, the determination of expression levels at one or fewer transcripts per cell is limited by issues of limited sensitivity and erroneous measurements. These limitations become more severe with increasing size of the transcriptome. We developed a comprehensive statistical approach to analyzing the empirical distribution of expressed genes for large transcriptomes obtained by the SAGE method by first removing sequence errors using chromosome location maps for SAGE tags and then applying statistical modeling and the BD model (Equation 2Equation 3Equation 4 HREF="#FD5">Equation 5) to filtered data (Fig 4). Our resulting SAGE data were similar to the GELPF data that we obtained for normalized oligonucletide microarray hybridization data sets for yeast cells. Our methodology of construction of the correct underlying probability distribution could be used to analyze large SAGE transcriptomes for different mammalian cells and cell types, including human transcriptomes, and for evaluation of the new modifications of the SAGE method using, for example, 21-mer tags.
Our statistical modeling approach provides a justifiable way to compare the GELPFs using samples (cDNA or SAGE libraries) with different sizes. This approach also can be used to permit the use of exact statistical tests for different transcriptomes (i.e., obtained for normal and cancerous cell tissues). Our novel numerical estimator of the number of species (i.e., expressed genes; Equation 5) can be used to estimate the number of expressed genes in a single cell and in a population of the cells by SAGE or cDNA data sets, even if data are incomplete and exhibit severe experimental errors.
On the basis of our analysis of the GELPF in the normalized yeast microarray databases (![]()
![]()
47% (2917) of all yeast genes are expressed at less than one transcript per cell. Our estimate for yeast cell SAGE data is consistent with these estimates. Approximately 1200 genes/ORFs are represented on average by a single mRNA molecule per yeast cell (Fig 4B). The population growth curve for genes/ORFs (Fig 4B) predicts that 3800 additional genes/ORFs (
55% of all yeast genes) are expressed at less than one transcript per cell. Even in proliferating mammalian cells, the majority of genes are thought to be transcribed over a short period of time (47 min per transcript) and infrequently, less than once per hour (![]()
600,000 SAGE tags) human transcriptome data set (![]()
70% of all protein-coding human genes are expressed with less than one transcript per cell on average (![]()
![]()
![]()
![]()
![]()
Initiation of transcription has been observed to occur sporadically and randomly both in time and location on chromosomes in a variety of cell systems (![]()
![]()
![]()
![]()
![]()
![]()
150 of them are "questionable" or "hypothetical" ORFs) of
6200 genes/ORFs were not expressed in any of six presented microarray samples from normal growing yeast cells.
In mammalian cells, a significant fraction of genes are silenced (transcripts are not observed); the silent state of a gene can be inherited, but later reactivated involving the stochastic, all-or-none mechanism at the level of a single cell (![]()
![]()
Physically, random "basal" transcription of genes might reflect nonlinear responses of the independent "gene transcription complexes" to internal or external fluctuations including thermal molecular motion. Noise in nonlinear dynamical systems can play a constructive role: It can, for example, improve a system's sensitivity to weak signals (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this study, we have analyzed only "snapshot" gene expression profiles, which are based on an "averaging" of the many thousands of cells in various states and which by themselves do not provide us direct information about regulatory relationships between expressed genes. Taking multiple times the samples of transcripts from homogenous cell populations or from single cells of this population, we could study the correlation between expression levels of genes. In this case, it would be interesting to construct a distribution model that takes into account the correlations between the levels of different transcripts. For example, the multivariate multinomial distribution model can be constructed (see ![]()
| ACKNOWLEDGMENTS |
|---|
The authors thank V. Velculescu for providing the supplementary information on the yeast cell SAGE database. We thank J. Berzofsky, I. Belyakov, K. Chumakov, R. Nossal, R. Strausberg, A. Strunnikov, T. Tatusov, and two reviewers for critical comments on the manuscript.
Manuscript received January 5, 2002; Accepted for publication March 31, 2002.
| APPENDIX |
|---|
A STOCHASTIC MODEL OF GENE EXPRESSION
Let us assume that Nt genes 1, 2, ... , Nt are expressed independently with Mt associated transcripts in total in the cells of a large cell population with respective probabilities q1, q2, ... qNt where Pr(a random transcript corresponds to gene i) = qi. Let the random variable si denote the number of transcripts for gene i in a random library of size M. Note
. When M << Mt, sampling with replacement is an acceptable model of library construction; this is described by a multinomial distribution. The joint probability of observing s1 = y1 mRNA transcripts of gene 1, ... ,
mRNA transcripts of gene Nt in a given library with size M is defined by the probability function,
, where
![]() |
(A1) |
The function f has the unknown parameters q1, q2, ... qNt and Nt, together with the constraints
and
. The marginal probability function fi (m, M) := Pr(si = m) is the probability that the unique tag for gene i occurs exactly m times in a library of size M:

We can estimate the expected number of distinct genes, n(m, M), which have m transcripts in our library of size M. Let
ij = 1 when i = j and 0 otherwise. Now,

Let the random variable
; GM is the number of distinct genes represented in the library of size M. We can estimate the expected number of genes N(M) in a given library as E[G] := N(M), where
![]() |
(A2) |
where n(0, M) denotes the expected number of distinct genes that escaped detection in the given library; n(0, M) is given as
.
Using the formulas for fi (m, M), n(m, M), and N(M) we can derive the recursion formula (![]()
![]() |
(A3) |
where m
{1, ... , M}. Also, n(m, M) = 0, if m > M.
is the backward difference operator
. These results allow us to compute n(m, M) for any given values of m and M. Recall we let
. Then, for large M we may approximate pm with its continuous analog and obtain the probability function called the binomial differential model (Equation 2, METHODS). This function has a skewed form and is approximated for large enough M and m by the power function pm
1/m2 (Lotka-Zipf law).
| LITERATURE CITED |
|---|
BISHOP, J. O., J. G. MORTON, M. ROSBASH, and M. RICHARDSON, 1974 Three classes in Hela cell messenger RNA. Nature 250:199-204.[Medline]
CANTOR, C. R., and C. L. SMITH, 1999 Genomics. John Wiley & Sons, New York.
CARON, H., B. VAN SCHAIK, M. VAN DER MEE, F. BAAS, and G. RIGGINS et al., 2001 The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science 291:1289-1292.
CHELLY, J., J.-P. CONCORDET, J.-C. KAPLAN, and A. KAHN, 1989 Illegitimate transcription: transcription of any gene in cell type. Proc. Natl. Acad. Sci. USA 86:2617-2621.
CHEN, J.-J., J. D. ROWLEY, and S. M. WANG, 2000 Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification. Proc. Natl. Acad. Sci. USA 97:349-353.
EDDY, S. R., 2001 Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2:919-928.[Medline]
FIERING, S., E. WHITELAW, and D. I. K. MARTIN, 2000 To be or not to be active: the stochastic nature of enhancer action. BioEssays 22:381-387.[Medline]
GOMEZ, S. M., S.-H. LO, and A. RZHETSKY, 2001 Probabilistic prediction of unknown metabolic and signal-transduction networks. Genetics 159:1291-1298.
HOLSTEGE, F. C. P., E. G. JENNINGS, J. J. WYRICK, T. I. LEE, and C. J. HENGARTNER et al., 1998 Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95:717-728.[Medline]
HUANG, S.-P. and B. S. WEIR, 2001 Estimating the total number of alleles using a sample coverage method. Genetics 159:1365-1373.
HUME, D. A., 2000 Probability in transcriptional regulation and implications for leukocyte differentiation and inducible gene expression. Blood 96:2323-2328.
Initial sequencing and analysis of the human genome. (2001) Nature 409:860-921.[Medline]
IYER, V. and K. STRUHL, 1996 Absolute mRNA levels and transcriptional initiation rates in Saccharomyces cerevisiae.. Proc. Natl. Acad. Sci. USA 93:5208-5212.
JACKSON, D. A., A. POMBO, and F. IBORRA, 2000 The balance sheet for transcription: an analysis of nuclear RNA metabolism in mammalian cells. FASEB J. 14:242-254.
JELINSKY, S. A. and L. D. SAMSON, 1999 Global response of Saccharomyces cerevisiae to alkylating agent. Proc. Natl. Acad. Sci. USA 96:1486-1491.
JELINSKY, S. A., P. ESTEP, G. M. CHURCH, and L. D. SAMSON, 2000 Regulatory networks revealed by transcriptional profiling of damaged Saccharomyces cerevisiae cells: Rpn4 links base excision repair with proteasomes. Mol. Cell. Biol. 20:8157-8167.
JEONG, H., B. TOMBOR, R. ALBERT, Z. N. OTTVAI, and A.-L. BARABASI, 2000 The large-scale organization of metabolic networks. Nature 407:651-654.[Medline]
JOHNSON, M., 2000 The yeast genome: on the road to the gold age. Curr. Opin. Genet. Dev. 10:617-623.[Medline]
JOHNSON, N. L., S. KOTZ and A. W. KEMP, 1993 Univariate Discrete Distributions. John Wiley & Sons, New York.
JOHNSON, N. L., S. KOTZ and N. BALAKRISHNAN, 1997 Discrete Multivariate Distributions. John Wiley & Sons, New York.
KO, M. S. H., 1992 Induction mechanism of a single gene molecule: stochastic or deterministic. BioEssays 14:341-346.[Medline]
KUZNETSOV, V. A., 2001 Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. EURASIP J. Appl. Signal Proc. 4:285-296.
KUZNETSOV, V. A., 2002 Statistics of the numbers of transcripts and protein sequences encoded in the genome, pp. 125171 in Computational and Statistical Approaches to Genomics, edited by W. ZHANG and I. SHMULEVICH. Kluwer, Boston.
LAL, A., A. E. LASH, S. F. ALTSCHUL, V. VELCULESCU, and L. ZHANG et al., 1999 A public database for gene expression in human cancers. Cancer Res. 59:5403-5407.
LI, W., 1999 Statistical properties of open reading frames in complete genome sequences. Comput. Chem. 23:283-301.[Medline]
MCADAMS, H. H. and A. ARKIN, 1999 It's a noisy business! Genetic regulation at the nanomolar scale. Trends Genet. 15:65-69.[Medline]
NEWLANDS, S., L. K. LEVITT, C. S. ROBINSON, A. B. KARPF, and V. R. HODGSON et al., 1998 Transcription occurs in pulses in muscle fibers. Genes Dev. 12:2748-2758.
OHLSSON, R., A. PALDI, and J. A. MARSHALL GRAVES, 2001 Did genomic imprinting and X chromosome inactivation arise from stochastic expression? Trends Genet. 17:136-141.[Medline]
RAMSDEN, J. J. and J. VOHRADSKY, 1998 Zipf-like behavior in prokaryotic protein expression. Phys. Rev. E 58:7777-7780.
ROSS, I. L., C. M. BROWNE, and D. A. HUME, 1994 Transcription of individual genes in eukaryotic cells occurs randomly and infrequently. Immunol. Cell. Biol. 72:177-185.[Medline]
SANO, Y., T. SHIMADA, H. NAKASHIMA, R. H. NICHOLSON, and J. F. ELIASON et al., 2001 Random monoallelic expression of three genes clustered within 60 kb of mouse t complex genomic DNA. Genome Res. 11:1833-1841.
STANLEY, H. E., S. V. BULDYREV, A. L. GOLDBERGER, S. HAVLIN, and C. K. PENG et al., 1999 Scaling features of noncoding DNA. Physica A 273:1-18.[Medline]
STOLLBERG, J., J. URSCHITZ, Z. URBAN, and C. D. BOYD, 2000 A quantitative evaluation of SAGE. Genome Res. 10:1241-1248.
STRAUSBERG, R., K. H. BUETOW, M. R. EMMERT-BUCK, and R. D. KLAUSNER, 2000 The Cancer Genome Anatomy Project: building an annotated gene index. Trends Genet. 16:103-106.[Medline]
SUTHERLAND, H. G., M. KEARNS, H. D. MORGAN, A. P. HEADLEY, and C. MORRIS et al., 2000 Reactivation of heritably silenced gene expression in mice. Mamm. Genome 11:347-355.[Medline]
VELCULESCU, V. E., L. ZHANG, B. VOGELSTEIN, and K. W. KINZLER, 1995 Serial analysis of gene expression. Science 270:484-487.
VELCULESCU, V. E., L. ZHANG, W. ZHOU, J. VOGELSTEIN, and M. A. BASRAI et al., 1997 Characterization of the yeast transcriptome. Cell 88:243-251.[Medline]
VELCULESCU, V. E., S. L. MADDEN, L. ZHANG, A. E. LASH, and J. YU et al., 1999 Analysis of human transcriptomes. Nat. Genet. 23:387-388.[Medline]
VENTER, J. C., M. D. ADAMS, E. W. MYERS, P. W. LI, and R. MURAL et al., 2001 The sequence of the human genome. Science 291:1304-1351.