Originally published as Genetics Published Articles Ahead of Print on December 30, 2005.

Genetics, Vol. 172, 1809-1820, March 2006, Copyright © 2006
doi:10.1534/genetics.105.044099

Conditional Coalescent Trees With Two Mutation Rates and Their Application to Genomic Instability

TIMC–TIMB Department, Faculty of Medicine, Institut de l'Ingénierie de l'Information de Santé, 38706 La Tronche, France

1 Corresponding author: TIMC–TIMB Department, Faculty of Medicine, Institut de l'Ingénierie de l'Information de Santé, 38706 La Tronche, France.
E-mail: olivier.francois{at}imag.fr

Manuscript received April 6, 2005. Accepted for publication December 13, 2005.

ABSTRACT

Humans have invested several genes in DNA repair and fidelity replication. To account for the disparity between the rarity of mutations in normal cells and the large number of mutations present in cancer, an hypothesis is that cancer cells must exhibit a mutator phenotype (genomic instability) during tumor progression, with the initiation of abnormal mutation rates caused by the loss of mismatch repair. In this study we introduce a stochastic model of mutation in tumor cells with the aim of estimating the amount of genomic instability due to the alteration of DNA repair genes. Our approach took into account the difficulties generated by sampling within tumoral clones and the fact that these clones must be difficult to isolate. We provide corrections to two classical statistics to obtain unbiased estimators of the raised mutation rate, and we show that large statistical errors may be associated with such estimators. The power of these new statistics to reject genomic instability is assessed and proved to increase with the intensity of mutation rates. In addition, we show that genomic instability cannot be detected unless the raised mutation rates exceed the normal rates by a factor of at least 1000.


DNA replication in normal human cells is an extremely accurate process. During the cell division cycle, copy errors occur with probabilities <10–9–10–10 per nucleotide. In contrast, the malignant cells that constitute cancer tissues are markedly heterogeneous and exhibit alterations in the nucleotide sequence of DNA (e.g., BIELAS and LOEB 2005). To account for the disparity between the rarity of mutations in normal cells and the large numbers of mutations present in cancer, LOEB et al. (1974) hypothesized that during tumor progression, cancer cells must exhibit a mutator phenotype (see the review by LOEB et al. 2003). It is still a matter of debate to know exactly which event initiates tumorigenesis. But one hypothesis for the initiation of abnormal mutation rates in tumors is the loss of mismatch repair (MMR).

For instance, this phenomenon may follow from the inactivation of the genes hMSH2 and hMLH1 involved in hereditary nonpolyposis colorectal cancers (HNPCC) (FISHEL et al. 1993; LEACH et al. 1993; LINDBLOM et al. 1993). In normal conditions, the MMR repair system involves a complex interaction among the protein products of hMSH2 and hMLH1 genes. The result is to eliminate ~99.9% of the errors in DNA replication, reducing errors to a rate of ~1/1012 bp in genes that regulate the apoptosis or the cell cycle duration. HNPCC is inherited in an autosomal dominant fashion. One copy of the mutant allele is defective and is inherited in the germline. The loss of MMR may start when the second mutation occurs somatically as a consequence of the two-hits theory (MOOLGAVKAR and KNUDSON 1981).

Widespread genomic instability seems associated with MMR-defective genes. For example, microsatellite instability is associated with HNPCC (IONOV et al. 1993; PELTOMAKI et al. 1993; THIBODEAU et al. 1993). Detection of DNA instability is therefore a crucial step in view of noninvasive diagnosis of such forms of cancer. Because numerous mutations are required for the full development of cancer, inactivation of caretaker genes can greatly accelerate its development (KINZLER and VOGELSTEIN 2002). For an account of the etiology and genetic epidemiology of cancer with a statistical perspective a major review is by THOMAS (2004).

This study introduces a two-rates model of DNA mutation based on the infinitely many sites model (WATTERSON 1975). We consider a sample of n sequences taken from a pretumoral tissue and assume that loss of DNA repair has occurred once (and only once) during the history of the n sequences tracking back to their most recent common ancestor. We denote the mutational event by the formal symbol {Delta}. The event {Delta} is assumed to occur at a very low rate {delta}.

The loss of MMR (occurrence of {Delta}) may lead to a 10- to 1000-fold increase in the normal mutation rate µ0 (BHATTACHARYYA et al. 1994; SHIBATA et al. 1994). However, only the sequences that descend from {Delta} are concerned with such an increase in the mutation rate. Because heterogeneity prevails in cancer tissues and sampling from the tumor is difficult, we assume that an unknown random number of sequences among the sample descend from the mutation {Delta}.

The goal of this study is to provide statistical estimators for the raised mutation rate µ1 under the assumption that the normal rate µ0 is known, but the number of descendants of {Delta} is unknown. Two classical statistics are studied (see HARTL and CLARK 1997 for a review in a population genetics context). The first one is the nucleotide polymorphism computed as the average number of segregating sites in the DNA. The second one is the nucleotide diversity computed as the number of pairwise nucleotide differences. Our main contribution is the calculation of corrections to the classical statistics that are needed because the increase in the mutation rate concerns only a random subgenealogy of the sample.

In our study, the clonal evolution of mitotic cell divisions is assumed to be selectively neutral. More precisely, the evolution of the tissue is approximated by a continuous branching process where one cell dies at random after each division (MORAN 1962). At least in the early stages of progression toward tumor cells, this model may be consistent with instability theory. Nevertheless the assumption of selective neutrality is still a source of controversy. Opponents of Loeb's theory support the hypothesis that the number of pretumoral cancer cells increases rapidly with time (see TOMLINSON et al. 1996, 2002; TOMLINSON and BODMER 1999; SIEBER et al. 2003). An alternative to the approach developed here would therefore include the rapid growth of tumor clones and the selective advantage of pretumoral cells. It is clear that such assumptions complicate the model and its analysis significantly. Although we believe that the recent contributions by KRONE and NEUHAUSER (1997), STEPHENS and DONNELLY (2003), and COOP and GRIFFITHS (2004) may allow some progresses in this respect, the selection perspective will not be presented here.

Under the neutral assumption, we model the genealogies of DNA sequences using conditional coalescent trees (WIUF and DONNELLY 1999; GRIFFITHS and TAVARÉ 2003; TAVARÉ 2004). This formalism has been developed for the primary purpose of estimating the age of an allele (GRIFFITHS and TAVARÉ 1998; STEPHENS 2000). So far, evolutionary models have been introduced for dating the loss of MMR (TSAO et al. 2000; CALABRESE et al. 2004). TSAO et al. (2000) observed microsatellite alleles in noncoding regions, assuming neutrality as well. However, the need for further mathematical studies has been emphasized in a recent review to better understand the influence of existing hypotheses in the evolution of cancer (MICHOR et al. 2004).

In the next section, we define our notation and give an account of the existing results in the theory of conditional trees. In addition, we extend many results of the theory to encompass other times or ages useful in the context of genomic instability and describe an efficient way for simulating conditional trees. In NUCLEOTIDE POLYMORPHISM and NUCLEOTIDE DIVERSITY, we introduce unbiased estimators of the raised mutation rate µ1 based on the number of segregating sites and the number of pairwise differences within the sample. The statistical errors and the power of tests based on these estimators are then compared using Monte Carlo methods.


CONDITIONAL COALESCENT TREES

Model and notations:

We consider a sample of n copies of a gene at a particular DNA locus taken from a pretumoral tissue and assume that the loss of MMR (event {Delta}) occurred once in the sample history. However, the date and place at which this event occurred in the sample genealogy are unknown. Mathematically, we consider taking the limit as the rate of occurrence {delta} tends to zero conditional on {Delta} having occurred. In further statements the symbol = therefore often replaces the limit symbol as {delta} goes to zero.

The sample is divided into two random complementary subsamples Formula and Formula. The cardinality of Formula is a random variable denoted by B. Given the number B = b of sequences in Formula, the number of sequences in Formula is then c = n b. As usual, in studies of conditional coalescent trees, the analysis requires two levels of conditioning. At the first level, the sample has the property that all sequences in Formula are descendants of the particular mutation {Delta} while none of those in Formula are. This property is called the topological event and is denoted by E. At the second level, we assume that the mutation {Delta} arose only once in the history of the sample. We denote this event by M. Conditioning on E impacts the random topology of the tree, while conditioning on M affects branch lengths. In the terminology of TAVARÉ (2004), conditioning on E {cap} M amounts to considering a unique event polymorphism in the tree. The probability distribution of B is called the frequency spectrum and it can be described as

Formula
where Formula denotes the (n – 1)th harmonic number (see GRIFFITHS and TAVARÉ 1998, 2003; STEPHENS 2000).

Under the neutral hypothesis, we assume that lineages coalesce at random, and time is rescaled so that the unit of time corresponds to N generations with N the total cell population size (KINGMAN 1982). In this setting, the normal mutation rate is usually rescaled so that {theta}0/2 = 2Nµ0 and the raised mutation rate is {theta}1/2 = 2Nµ1. Conditioning on B = b leads to a model of genealogies that we refer to as the conditional coalescent tree (WIUF and DONNELLY 1999; see Figure 1). All subsequent results are established conditional on the event E {cap} M, but with the exception of the APPENDIX we omit this condition to alleviate notations in long formulas.


Figure 1
View larger version (15K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Conditional coalescent tree with n = 8 leaves. The mutation {Delta} has B = 4 descendants.

 

Background results:

To state results about conditional coalescence times, some additional results are needed. As much as possible, we use notations similar to those of WIUF and DONNELLY (1999) and TAVARÉ (2004). For Formula, we define Jr to be the total number of ancestors at the time the subsample Formula first has r ancestors. This definition implies that Jr ranges between (r + 1) and (nb + r). In addition, we denote by J0 the number of ancestors in the sample at the time the Formula lineages first coalesce with the rest of the sample. This means that we have

Formula
Similarly, we consider Kr to be the total number of ancestors at the time the subsample Formula first has r ancestors. We have

Formula
where the subset Formula is replaced by Formula in the previous definition, and the Kr's are complementary to the Jr's in the set of labels [1, n]. Note that conditional on J0 = j, we have Kr = r for all r < j and j + 1 ≤ Kj. To finish, we denote by J{Delta} the total number of ancestors in the sample at the time the mutation {Delta} occurs. This implies that J{Delta} takes its values between 2 and nb + 1. A picture of a tree with a summary of notation is displayed in Figure 2.


Figure 2
View larger version (10K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Coalescence levels in Formula and Formula with their notations Jr and Kr. Here we have n = 8; B = 4; J3 = 7, J2 = 5, J1 = 4, J0 = 2; and K3 = 6, K2 = 3, K1 = 1.

 
The conditional joint distributions of the Jr's given the events E or E {cap} M are described in TAVARÉ (2004, Chap. 8, pp. 106–109), which we refer to when necessary. For example, we easily deduce that

Formula 1(1)
for all Formula 1 (see WIUF and DONNELLY 1999). This result is useful in the NUCLEOTIDE DIVERSITY section. Similar properties are stated without proofs when they are direct consequences of Tavaré's notations.

Another useful result concerns the number of ancestors in the sample at the time when the mutation {Delta} occurs. Recall that we have

Formula 2(2)
for all Formula 2.

The age of the mutation {Delta} has been studied by GRIFFITHS and TAVARÉ (1998), WIUF and DONNELLY (1999), and STEPHENS (2000). Conditional on B = b, the expected age is given by

Formula 3(3)
GRIFFITHS (2003) gave a nicer formula:

Formula 3

The distribution of intercoalescence times:

In the standard coalescent, the durations X{ell} that separate coalescence events backward in time are independent random variables and have exponential distribution of rate {lambda}{ell} = {ell}({ell} – 1)/2, where {ell} is the number of ancestors just before the event. Here we recall how the conditioning on B = b and the existence of a unique event polymorphism E {cap} M further modify the shape of the genealogy by lengthening the intercoalescence times.

The next result can be deduced from GRIFFITHS and TAVARÉ (1998) or STEPHENS (2000). Assume that the mutation {Delta} has B = b descendants. The joint probability distribution of Formula 3 conditional on the event E {cap} M has density

Formula 4(4)
where f{ell}(x{ell}) is the probability density function of the exponential distribution of rate {lambda}{ell}.

As a consequence of Equation 4 we have the following result:

COROLLARY 1. Assume that the mutation {Delta} has B = b descendants. Let Formula 4. Then we have

Formula 5(5)

As a consequence of Equation 4, note that conditional on the event E {cap} M the X{ell}'s are no longer independent random variables. However, Equation 4 has the nice interpretation that once we know that the number of ancestors is equal to k at the time {Delta} occurs, then Xk has gamma G(2, {lambda}k) distribution, the other X{ell} have exponential G(1, {lambda}{ell}) distribution, and the variables are mutually independent. This remark is useful for simulating conditional trees given that B = b. Our algorithm is as follows:

  1. Draw J{Delta} = k according to the distribution (pk{Delta}) for Formula 5.
  2. Draw J0 from the conditional distribution

    Formula 5

  3. Draw an ordered sequence Formula 5 uniformly from the set of ordered integral sequences Formula 5.
  4. Fill the holes left in [1, n] by the Jr's with the Kr's.
  5. Sample Xk from the gamma G(2, {lambda}k) distribution; otherwise, sample X{ell} from the exponential distribution G(1, {lambda}{ell}), for {ell} != k.

Testing for the absence of {Delta}:

Here we present a brief study of the power of a rather "abstract" test to reject the null hypothesis H0 of absence of the mutation {Delta} against the alternative hypothesis H1 of its existence. The test is abstract because it assumes the knowledge of the sample genealogy, and the data set consists of all the intercoalescence times (Xk). Under the null hypothesis we assume that the property E holds for a specific subsample of b sequences. In the alternative hypothesis we assume that the mutation {Delta} has B = b descendants as well. The test statistic consists of the ratio of likelihoods that is believed to behave optimally for reasonably large sample sizes. It can be described as

Formula 5
Under H0, we see that this ratio has the same distribution as a sum of independent exponential random variables of rates {nu}k = 1/pk{Delta},

Formula 6(6)
whereas under H1 it is distributed as Y plus a sum of independent exponential random variables of rates Formula 6,

Formula 7(7)
The criterion for rejection is r greater than the 0.95th percentile from neutral data sets (see Equation 6). The power of the test was studied numerically from 10,000 replicates of Y and Z. We found that the power did not exceed a value close to 0.2 for n = 10, 20, 50, 100, and b {approx} n. For smaller b's, the lack in power was even more striking. For example, the power dropped to {approx}0.1 for b/n {approx} 0.5.

Because we assume the ideal knowledge of tree topologies and branch lengths, the interest in these power calculations is more theoretical than directed toward applications. However, these results put some limitations on testing for the occurrence of the mutation {Delta}. They are evidence that the occurrence of {Delta} alone conveys too little information for being detected by any kind of statistical testing even if the full genealogy were observed. This could be explained that the shapes of such trees do not undergo significant changes under the occurrence of {Delta}.


NUCLEOTIDE POLYMORPHISM

Corrected estimator:

We now take account of the mutations that are superimposed on the conditional coalescent trees. Mutations on the tree branches are distributed according to Poisson processes of rates {theta}0/2 or {theta}1/2, depending on where {Delta} takes place. Assuming the infinitely many sites model of the DNA molecule, we introduce an unbiased estimator of {theta}1 based on the number of segregating sites S. This variable equals the number of mutations that occurred during the sample history back to the most recent common ancestor of the sample. In the classical coalescent, S has Poisson distribution of parameter Ln{theta}/2, where {theta} is the mutation rate, and Ln is the length of the genealogy. The nucleotide polymorphism or Watterson's estimator is defined as Formula 7 (WATTERSON 1975). It is an unbiased estimator of {theta} with the property that

Formula 7

In analogy with the classical approach, we denote by Formula 7 the length of the genealogy of the full sample and by Formula 7 the length of the subgenealogy of Formula 7. Borrowing the notation from WIUF and DONNELLY (1999), we also denote by {eta}n the time separating the root of the subgenealogy from the mutation {Delta}. In the two-rates model, the number of segregating sites can be split into two independent terms

Formula 7
where S1 has Poisson distribution of rate (Formula 7){theta}1/2 and S0 has Poisson distribution of rate (Formula 7){theta}0/2. Taking expectations, we obtain the expected number of segregating sites as

Formula 7
where

Formula 7
and

Formula 7
Accordingly, an unbiased estimator of {theta}1 can be defined as follows:

Formula 7
Table 1 provides numerical values for An and Bn with sample sizes in the range 5–50. Exact formulas are derived afterward. First, the expectation Formula 7 results from Corollary 1 as follows:

Formula 7
Given that the mutation {Delta} has b descendants (B = b), the conditional expectations involved in the computation of An and Bn can be obtained from the results of WIUF and DONNELLY (1999) and GRIFFITHS and TAVARÉ (2003). On the one hand, GRIFFITHS and TAVARÉ (2003) proved that

Formula 7
where

Formula 7
for Formula 7 and Formula 7. On the other hand, WIUF and DONNELLY (1999) showed that

Formula 7
The values of An and Bn can then be computed by summing over all b's.


View this table:
In this window
In a new window

 
TABLE 1

Correction coefficients for Formula 7

 

Statistical errors and power of tests:

In the first half of this section, we evaluate the standard deviation (SD) of the estimator Formula 7. The exact computation of Var[Formula 7] appears intricate enough so that we resort to Monte Carlo methods. In the second half, we evaluate the power of the statistic Formula 7 to reject the hypothesis that the mutation rate increases simultaneously with the occurrence of the mutation {Delta}. Simulations were performed using the R statistical programming language (R DEVELOPMENT CORE TEAM 2004).

Statistical errors:

For evaluating statistical errors, the following experimental design was used. The parameter {theta}0 was set equal to the value {theta}0 = 1. Roughly, this corresponded to a normal mutation rate per mitotic division of µ0 {approx} 10–10, while the total number of cells N in the tissue approximated 2.5 billion. We considered three different values for the raised mutation rate {theta}1 = 10, 102, 103, and the sample sizes were taken in the range n = 10–50. Simulations were performed using the method described in the previous section. Table 2 gives the bias and the standard deviation computed over 10,000 replicates. These results confirm that Formula 7 was indeed unbiased. Nevertheless, the standard deviations were rather high. This could be explained as the empirical distributions exhibited strong positive skew. In addition, most of the error was contributed by a term that seemed proportional to Formula 7. For n = 20, we adjusted a regression model of the form Formula 7 to the variance, and an almost perfect fit was obtained as Formula 7 (R2 = 0.999, P < 10–12). For n = 40, we obtained Formula 7 (R2 = 0.997, P < 10–12). Apparently, SDs did not exhibit a fast decrease as sample sizes increased. This might be due to a strong correlation of data within the subsample Formula 7 and to the fact that the most recent ancestor of this subsample is expected to be recent. Note that the shape of the correcting constant Bn suggested a logarithmic rate of decrease of errors toward zero.


View this table:
In this window
In a new window

 
TABLE 2

Statistical errors for Formula 7

 

Power:

A fundamental assumption through this work is that the mutation {Delta} has occurred once in the history of the sample. Assuming a normal mutation rate {theta}0, we report results regarding the power of the test based on Formula 7 to reject the null hypothesis of absence of {Delta} against the alternative of its existence together with an increase in mutation rate {theta}1 > {theta}0. Results for {theta}0 = 1 and {theta}1 = 10–103 are given in Table 3. Power values ranged from {approx}0.06 to {approx}0.90. Reasonable powers were obtained for {theta}1 > 103{theta}0. No significant improvements were observed when the sample sizes varied from n = 10 to n = 50.


View this table:
In this window
In a new window

 
TABLE 3

Powers for Formula 7

 
In a second step we reverted the role of the null and alternative hypotheses and used a test based on Formula 7. The results are reported in Table 4. In this table, powers range from {approx}0.43 to {approx}0.90. For {theta}1 < 10, the test exhibited performances similar to those presented in the previous section where the simultaneous rise in mutation rate was ignored. Significant gains in power were obtained for {theta}1 = 103{theta}0. Increasing the sample sizes did not provide additional benefit. Table 4 indicates that the event {Delta} was more easily detected when associated with large mutation rates and small sample sizes. However, the power to detect {Delta} remains small for {theta}1 < 1000{theta}0.


View this table:
In this window
In a new window

 
TABLE 4

Powers for Formula 7

 


NUCLEOTIDE DIVERSITY

Corrected estimator:

Here we introduce an unbiased estimator of {theta}1 based on the nucleotide diversity {Pi}. In the infinitely many sites model the nucleotide diversity is defined as the mean number of pairwise differences between nucleotides. Let {Pi}(i, j) be the number of sites at which the sequence i differs from the sequence j, for 1 ≤ i ≤ n and 1 ≤ j ≤ n. The nucleotide diversity is the average value of {Pi}(i, j). It can be computed as follows:

Formula 7
In the unconditional coalescent, we have E[{Pi}(1, 2)] = {theta}E[X2], and {Pi} is an unbiased estimator of {theta}. The variance of {Pi} is equal to Var[{Pi}] = (n + 1){theta}/3(n – 1) + 2(n2 + n + 3){theta}2/9n(n – 1) (TAJIMA 1983).

Now consider the occurrence of {Delta} and the two rates of mutation {theta}0 and {theta}1. Again, we assume that the mutation {Delta} has B = b descendants. Consider two arbitrary sequences labeled 1 and 2. In the classical coalescent, E[X2] is the expected coalescence time of sequences 1 and 2. In analogy with this, the computation of E[{Pi}(1, 2)] requires distinguishing three cases. In the first case, both sequences 1 and 2 belong to Formula 7, and we have

Formula 7
where Formula 7 is the expected coalescence time within Formula 7. This case occurs with probability (b/n)2. In the second case, one sequence is in Formula 7 while the other belongs to Formula 7. This event occurs with probability 2b(nb)/n2, and we have

Formula 7
where Formula 7 is the expected coalescence time of sequence 1 and sequence 2, and {tau}{Delta} is the age of {Delta} given in Equation 3. The third case occurs with probability (1 – b/n)2. It corresponds to the situation where both sequence 1 and sequence 2 are in Formula 7. Then we have

Formula 7
where Formula 7 is the corresponding expected coalescence time. Taking expectation with respect to B, we deduce that

Formula 7
where the constants Cn and Dn can be computed from the above defined coalescence times. Therefore, an unbiased estimator {Pi}1 of {theta}1 is of the form

Formula 7
Table 5 gives numerical values for Cn and Dn for n in the range 10–50. The next section explains the way the exact computations of all coalescence times can be achieved.


View this table:
In this window
In a new window

 
TABLE 5

Correction coefficients for {Pi}1

 

Coalescence times:

Here we provide explicit ways of computing the coalescence times Formula 7, and Formula 7. As a consequence, we are able to give formal expressions for the correcting constants Cn and Dn. Because the formal expressions are somewhat ugly, the following results should be considered more as recipes for computing expressions than as immediate closed mathematical formulas. The strategy for establishing these exact formulas is rather simple and replicable with slight variations in the three cases.

Case 1—coalescence within Formula 7:

Let Formula 7 denote the time at which the sample first has j ancestors. A basic argument shows that if a node has J ancestors, then its expected age is E[TJ+1]. Therefore, the coalescence time of two individuals in a subsample of size b for which the total number of ancestors at each node are Formula 7 is given by

Formula 7
which is made explicit in the APPENDIX.

Case 2—coalescence between Formula 7 and Formula 7:

The expression of Formula 7 has a simple interpretation in terms of the age of {Delta}. The expression given in the APPENDIX can be reduced, using a symbolic computing language such as Maple. Because the gamma distribution G(2, {lambda}k) is the sum of two independent exponentials, we find that the coalescence time Formula 7 (age of {Delta}) plus the coalescence time of two ancestors among the k present at the occurrence of {Delta}. According to Equation 4, the second coalescence time has exponential G(1, 1) distribution. Hence, we have

Formula 7

Case 3—coalescence within Formula 7:

The average coalescence time for two individuals within Formula 7 can be obtained from conditioning on J0 = j and from the observation that we have Kr = r for r < j given that J0 = j. This leads to a complicated formula that uses a series of probabilistic results stated in Lemmas 3 and 4 (see the APPENDIX).

Statistical errors and power of tests:

Here we report numerical estimates of the standard deviations of {Pi}1, and we study the power of this statistic to reject the hypothesis that the mutation rate increased simultaneously with the occurrence of the mutation {Delta}. The same experimental design was used as for the statistic Formula 7 defined in the previous section. The results are closely parallel to those obtained for Formula 7 (see Tables 6–8GoGo ). The estimator appears to be unbiased. The standard deviations are of the same order as those computed for Formula 7 although they seem slightly higher. Using {Pi}1 instead of Formula 7 to reject the existence of {Delta} leads to a 12 or 13% loss in power when {theta}1 = 100 or {theta}1 = 103. Reverting the two hypotheses and using {Pi} yield the same conclusions as for Formula 7.


View this table:
In this window
In a new window

 
TABLE 6

Statistical errors for {Pi}1

 

View this table:
In this window
In a new window

 
TABLE 7

Powers for {Pi}1

 

View this table:
In this window
In a new window

 
TABLE 8

Powers for {Pi}

 


DISCUSSION
Genetic information must be tightly regulated, and its faithful replication and repair is the greatest imperative. To this end humans have invested >130 genes in DNA repair, and this number is even greater if genes dedicated to fidelity of replication are included (ANDERSON 2001; WOOD 2001). In this article we introduced a stochastic model of mutation in tumor cells with the aim of estimating the amount of genomic instability in cancer tissues due to the alteration of DNA repair genes. Our approach took into account the difficulties generated by sampling within tumoral clones and the fact that these clones must be difficult to isolate (ANDERSON et al. 2001). We provided unbiased estimators of the normalized raised mutation rates. These quantities can be interpreted as the mean numbers of new mutations present in daughter cells after each mitotic generation (this corresponds to an evaluation of {theta}1/2 = 2µ1N). The power of these statistics to reject genomic instability was assessed and proved to increase with the intensity of mutation. However, we showed that large statistical errors may be associated with such estimates. Conditional on the presence of loss of MMR within a sample of cells, no significant benefit would be expected from large sample sizes. In addition, we proved that genomic instability can hardly be detected unless the raised mutation rates exceed the normal rates by a factor >103. These results suggest monitoring several loci to increase power and reliability of tests and give theoretical support to foundations of current clinical guidelines (BOLAND et al. 1998).

Computations were conducted under the assumptions of selective neutrality. Tumors of clonal origin have long life spans with evolutionary history that may last over 10 or 20 years and exhibits multistep progression. At least in the early stages of tumor progression selective neutrality is still compatible with Loeb's theory of carcinogenesis. Evidence is lacking that the initiating events are neither highly advantageous nor highly deleterious. A competing assumption explains that a cell must exhibit a selective advantage to be converted into a pretumoral cell. Then by a selective clonal expansion the cell becomes malignant (CAIRNS 1975; NOWELL 1976; TOMLINSON et al. 1996). The material presented in this article may serve as a basis for testing such kinds of assumption. A classical way of doing so is by computing Tajima's D-statistic (TAJIMA 1989). In our framework this statistic can be defined as the difference Formula 7. To apply the test, P-values can be obtained from Monte Carlo replicates, using the new simulation procedure described in CONDITIONAL COALESCENT TREES.

Genomic instability particularly affects DNA repeat sequences. It has even been calculated to affect hundred of thousands of such sequences in each tumor cell but very few of these events are within coding sequences (PERUCHO 1996). It is widely argued that stepwise mutation models might be more appropriate for such DNA sequences than the infinitely many sites model used in this work. However, genomic instability is not restricted to repeat sequences and even not limited to the nucleus. Mitochondrial DNA may also be mutated in a process that involves clonal expansion (POLYAK et al. 1998). Infinitely many sites models may thus be acceptable in several situations.

ANDERSON et al. (2001) reported several difficulties with measuring the amount of instability in cancer cell genomes. The ideal measurement would be how many genomic events occur per cell generation because this number would allow evaluation of the rate of tumor progression. Regardless of the fact that it is as yet difficult to approach in clinical application, a rigorous way of calculating unbiased estimates of the amount of genomic instability in pretumoral tissues would nevertheless require the correction coefficients described in this article.


APPENDIX
Proof of Corollary 1. Let n ≥ 2. Assuming that {Delta} has b descendants (1 ≤ b ≤ n – 1) and using Equation 4 we obtain the marginal distribution of each intercoalescence time. For Formula 7 we have

Formula 7
if {ell} ≤ nb + 1; otherwise, it is

Formula 7
where f{ell} is the density of the exponential G(1, {lambda}{ell}) distribution. Taking expectations it becomes

Formula 7

LEMMA 1. Let n ≥ 2. We have

Formula 7
Proof. Let Formula 7. From Corollary 1 we have

Formula 7
Then

Formula 7

LEMMA 2. Let n ≥ 2 and assume that {Delta} has b descendants. Let Formula 7 and k isin [2, nb + 1]. For j isin [k + r – 1, nb + r], we have

Formula 7
Proof. Let k isin [2, nb + 1] and r isin [1, b – 1]. For all j isin [k + r – 1, nb + r] it is known that for Formula 7 we have

Formula 7
(TAVARÉ 2004). Note that the above formula is independent of Formula 7. We have

Formula 7

LEMMA 3. Let n ≥ 2 and assume that {Delta} has b descendants. Let J0 be defined as in CONDITIONAL COALESCENT TREES. For j = 1, ... , nb, we have

Formula 7
Proof. Due to a straightforward combinatorial argument, for j = 1, ... , nb we have

Formula 7
Then intregrating over J{Delta}'s implies that

Formula 7

LEMMA 4. Let n ≥ 2, assume that {Delta} has b descendants, and denote c = nb. Let Formula 7 and Kr be defined as in CONDITIONAL COALESCENT TREES. For k isin [r + 1, r + b], we have

Formula 7
Proof. Note that the vector Formula 7 is obtained from a permutation of the labels Formula 7, where Jr's and Kr's are defined as in CONDITIONAL COALESCENT TREES. Conditional on J0 = j, the vector Formula 7 is also a permutation of the labels Formula 7. Then Equation 1 implies that for Formula 7, we have

Formula 7
Note that the above formula is independent of Formula 7. We have

Formula 7

An explicit formula for Formula 7 defined in NUCLEOTIDE DIVERSITY is given by

Formula 7
In this expression, we used Corollary 1,

Formula 7
and the result stated in Lemma 2 (APPENDIX).

The average coalescence time for two sequences, one within Formula 7 and one within Formula 7, is straightforward from the conditioning on J{Delta}. We obtain that

Formula 7
where

Formula 7
Because j ≤ k in the above summation, we obtain from Corollary 1 that

Formula 7

Regarding {tau}C, we have

Formula 7
Now we use the fact

Formula 7
For Formula 7 and r < j, we have

Formula 7
with

Formula 7
Otherwise, we have r ≥ j and

Formula 7
where

Formula 7
For all Formula 7, the conditional probabilities P(J{Delta} = {ell} | J0 = j) can be obtained from Bayes' formula.


ACKNOWLEDGEMENTS
The authors thank Robert C. Griffiths for useful discussions about the model and an anonymous referee for correcting some bibliographical mistakes. This work is partly supported by a grant from the Algorithmes et Populations Biologiques project, which is supported by the Institut d'Informatique et de Mathématiques Appliquées de Grenoble.


LITERATURE CITED

ANDERSON, G. R., 2001 Genomic instability in cancer. Curr. Sci. 81: 501–507.

ANDERSON, G. R., D. L. STOLER and B. M. BRENNER, 2001 Cancer: the evolved consequence of a destabilized genome. BioEssays 23: 1037–1046.[CrossRef][Medline]

BHATTACHARYYA, N. P., A. SKANDALIS, A. GANESH, J. GRODEN and M. MEUTH, 1994 Mutator phenotypes in human colorectal carcinoma cell lines. Proc. Natl. Acad. Sci. USA 91: 6319–6323.[Abstract/Free Full Text]

BIELAS, J. H., and L. A. LOEB, 2005 Mutator phenotype in cancer: timing and perspectives. Environ. Mol. Mutagen. 45: 143–149.[CrossRef][Medline]

BOLAND, C. R., S. N. THIBODEAU, S. R. HAMILTON, D. SIDRANSKY, J. R. ESHLEMAN et al., 1998 A National Cancer Institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res. 58: 5248–5257.[Abstract/Free Full Text]

CAIRNS, J., 1975 Mutation selection and the natural history of cancer. Nature 255: 197–200.[CrossRef][Medline]

CALABRESE, P., J. P. TSAO, Y. YATABE, R. SALOVAARA, J. P. MECKLIN et al., 2004 Colorectal pretumor progression before and after DNA mismatch repair. Am. J. Pathol. 164: 1447–1453.[Abstract/Free Full Text]

COOP, G., and R. C. GRIFFITHS, 2004 Ancestral inference on gene trees under selection. Theor. Popul. Biol. 66: 219–232.[CrossRef][Medline]

FISHEL, R., M. K. LESCOE, M. R. RAO, N. G. COPELAND, N. A. JENKINS et al., 1993 The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colorectal cancer. Cell 75: 1027–1038 (erratum: Cell 77: 167).

GRIFFITHS, R. C., 2003 The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor. Popul. Biol. 64: 241–251.[CrossRef][Medline]

GRIFFITHS, R. C., and S. TAVARÉ, 1998 The age of a mutation in a general coalescent tree. Stoch. Models 14: 273–295.

GRIFFITHS, R. C., and S. TAVARÉ, 2003 The genealogy of a neutral mutation, pp. 393–412 in Highly Structured Stochastic Systems, edited by P. GREEN, N. HJORT and S. RICHARDSON. Oxford University Press, Oxford.

HARTL, D., and A. CLARK, 1997 Principles of Population Genetics, Ed. 3. Sinauer Associates, Sunderland, MA.

IONOV, Y., M. A. PEINADO, S. MALKHOSYAN, D. SHIBATA and M. PERUCHO, 1993 Ubiquitous somatic mutations in simple repeated sequences reveal a new mechanism for colonic carcinogenesis. Nature 363: 558–561.[CrossRef][Medline]

KINGMAN, J. F. C., 1982 The coalescent. Stoch. Proc. Appl. 13: 235–248.[CrossRef]

KINZLER, K. W., and B. VOGELSTEIN, 2002 Familial cancer syndromes: the role of caretakers and gatekeepers, pp. 209–210 in The Genetic Basis of Human Cancer, Ed. 2, edited by B. VOGELSTEIN and K. W. KINZLER. McGraw-Hill, New York.

KRONE, S. M., and C. NEUHAUSER, 1997 Ancestral process with selection. Theor. Popul. Biol. 51: 210–237.[CrossRef][Medline]

LEACH, F. S., N. C. NICOLAIDES, N. PAPDOPULOS, B. LIU, J. JEN et al., 1993 Mutations of a mutS homolog in hereditary nonpolyposis colorectal cancer. Cell 75: 1215–1225.[CrossRef][Medline]

LINDBLOM, A., P. TANNERGARD, B. WERELIUS and M. NORDENSKJOLD, 1993 Genetic mapping of a second locus predisposing to hereditary non-polyposis colon cancer. Nat. Genet. 5: 279–282.[CrossRef][Medline]

LOEB, L. A., B. N. SPRINGGATE and N. BATTULA, 1974 Errors in DNA replication as a basis of malignant changes. Cancer Res. 34: 2311–2321.[Abstract/Free Full Text]

LOEB, L. A., K. R. LOEB and J. P. ANDERSON, 2003 Multiple mutations and cancer. Proc. Natl. Acad. Sci. USA 100: 776–781.[Abstract/Free Full Text]

MICHOR, F., Y. IWASA and M. A. NOWAK, 2004 Dynamics of cancer progression. Nat. Rev. Cancer 4: 197–205.[CrossRef][Medline]

MOOLGAVKAR, S. H., and A. G. KNUDSON, JR., 1981 Mutation and cancer: a model for human carcinogenesis. J. Natl. Cancer Inst. 66: 1037–1052.[Medline]

MORAN, P. A. P., 1962 The Statistical Process of Evolutionary Theory. Clarendon Press, Oxford.

NOWELL, P. C., 1976 The clonal evolution of tumor cell populations. Science 194: 23–28.[Abstract/Free Full Text]

PELTOMAKI, P., L. AALTONEN, P. SISTONEN, L. PYLKKANEN, J. P. MECKLIN et al., 1993 Genetic mapping of a locus predisposing to human colorectal cancer. Science 260: 751–752.[Free Full Text]

PERUCHO, M., 1996 Cancer of the microsatellite mutator phenotype. J. Biol. Chem. 377: 675–684.

POLYAK, K., Y. LI, H. ZHU, C. LENGAUER, J. K. WILLSON et al., 1998 Somatic mutations of the mitochondrial genome in human colorectal tumours. Nat. Genet. 20: 291–293.[CrossRef][Medline]

R DEVELOPMENT CORE TEAM, 2004 R: A Language and Development for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. (http://www.R-project.org).

SIEBER, O. M., K. HEINIMAN and I. P. M. TOMLINSON, 2003 Genomic instability—the engine of tumorigenesis? Nat. Rev. Cancer 3: 701–708.[CrossRef][Medline]

SHIBATA, D., M. A. PEINADO, Y. IONOV, S. MALKHOSYAN and M. PERUCHO, 1994 Genomic instability in repeated sequences is an early somatic event in colorectal tumorigenesis that persists after transformation. Nat. Genet. 6: 273–281.[CrossRef][Medline]

STEPHENS, M., 2000 Times on tree, and the age of an allele. Theor. Popul. Biol. 57: 109–119.[CrossRef][Medline]

STEPHENS, M., and P. DONNELLY, 2003 Ancestral inference in population genetics with selection. Aust. N. Z. J. Stat. 45: 395–430.[CrossRef]

TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460.[Abstract/Free Full Text]

TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595.[Abstract/Free Full Text]

TAVARÉ, S., 2004 Ancestral inference in population genetics, pp. 1–188 in Lectures on Probability Theory and Statistics. Ecole d'Eté de Probabilité de St Flour XXXI–2001, edited by J. PICARD. Springer Verlag, New York.

THIBODEAU, S. N., G. BREN and D. SCHAID, 1993 Microsatellite instability in cancer of the proximal colon. Science 260: 816–819.[Abstract/Free Full Text]

THOMAS, D. C., 2004 Statistical Methods in Genetic Epidemiology. Oxford University Press, New York.

TOMLINSON, I. P. M., and W. BODMER, 1999 Selection, the mutation rate and cancer: ensuring that the tail does not wag the dog. Nat. Med. 5: 11–12.[CrossRef][Medline]

TOMLINSON, I. P. M., M. NOVELLI and W. BODMER, 1996 The mutation rate and cancer. Proc. Natl. Acad. Sci. USA 93: 14800–14803.[Abstract/Free Full Text]

TOMLINSON, I. P. M., P. SASIENI and W. BODMER, 2002 How many mutations in a cancer? Am. J. Pathol. 160: 755–758.[Free Full Text]

TSAO, J. L., Y. YATABE, R. SALOVAARA, H. J. JÄRVINEN, J. P. MECKLIN et al., 2000 Genetic reconstruction of individual colorectal tumor histories. Proc. Natl. Acad. Sci. USA 97: 1236–1241.[Abstract/Free Full Text]

WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 2: 256–276.

WIUF, C., and P. DONNELLY, 1999 Conditional genealogies and the age of a neutral mutant. Theor. Popul. Biol. 56: 183–201.[CrossRef][Medline]

WOOD, A., 2001 Racial differences in the response to drugs—pointers to genetic differences. N. Engl. J. Med. 344: 1393–1395.[Free Full Text]

Communicating editor: M. NORDBORG