## Abstract

Humans have invested several genes in DNA repair and fidelity replication. To account for the disparity between the rarity of mutations in normal cells and the large number of mutations present in cancer, an hypothesis is that cancer cells must exhibit a mutator phenotype (genomic instability) during tumor progression, with the initiation of abnormal mutation rates caused by the loss of mismatch repair. In this study we introduce a stochastic model of mutation in tumor cells with the aim of estimating the amount of genomic instability due to the alteration of DNA repair genes. Our approach took into account the difficulties generated by sampling within tumoral clones and the fact that these clones must be difficult to isolate. We provide corrections to two classical statistics to obtain unbiased estimators of the raised mutation rate, and we show that large statistical errors may be associated with such estimators. The power of these new statistics to reject genomic instability is assessed and proved to increase with the intensity of mutation rates. In addition, we show that genomic instability cannot be detected unless the raised mutation rates exceed the normal rates by a factor of at least 1000.

DNA replication in normal human cells is an extremely accurate process. During the cell division cycle, copy errors occur with probabilities <10^{−9}–10^{−10} per nucleotide. In contrast, the malignant cells that constitute cancer tissues are markedly heterogeneous and exhibit alterations in the nucleotide sequence of DNA (*e.g.*, Bielas and Loeb 2005). To account for the disparity between the rarity of mutations in normal cells and the large numbers of mutations present in cancer, Loeb*et al.* (1974) hypothesized that during tumor progression, cancer cells must exhibit a *mutator phenotype* (see the review by Loeb*et al.* 2003). It is still a matter of debate to know exactly which event initiates tumorigenesis. But one hypothesis for the initiation of abnormal mutation rates in tumors is the loss of mismatch repair (MMR).

For instance, this phenomenon may follow from the inactivation of the genes hMSH2 and hMLH1 involved in hereditary nonpolyposis colorectal cancers (HNPCC) (Fishel*et al.* 1993; Leach*et al.* 1993; Lindblom*et al.* 1993). In normal conditions, the MMR repair system involves a complex interaction among the protein products of hMSH2 and hMLH1 genes. The result is to eliminate ∼99.9% of the errors in DNA replication, reducing errors to a rate of ∼1/10^{12} bp in genes that regulate the apoptosis or the cell cycle duration. HNPCC is inherited in an autosomal dominant fashion. One copy of the mutant allele is defective and is inherited in the germline. The loss of MMR may start when the second mutation occurs somatically as a consequence of the two-hits theory (Moolgavkar and Knudson 1981).

Widespread genomic instability seems associated with MMR-defective genes. For example, microsatellite instability is associated with HNPCC (Ionov*et al.* 1993; Peltomaki*et al.* 1993; Thibodeau*et al.* 1993). Detection of DNA instability is therefore a crucial step in view of noninvasive diagnosis of such forms of cancer. Because numerous mutations are required for the full development of cancer, inactivation of *caretaker* genes can greatly accelerate its development (Kinzler and Vogelstein 2002). For an account of the etiology and genetic epidemiology of cancer with a statistical perspective a major review is by Thomas (2004).

This study introduces a two-rates model of DNA mutation based on the infinitely many sites model (Watterson 1975). We consider a sample of *n* sequences taken from a pretumoral tissue and assume that loss of DNA repair has occurred once (and only once) during the history of the *n* sequences tracking back to their most recent common ancestor. We denote the mutational event by the formal symbol Δ. The event Δ is assumed to occur at a very low rate δ.

The loss of MMR (occurrence of Δ) may lead to a 10- to 1000-fold increase in the normal mutation rate μ_{0} (Bhattacharyya*et al.* 1994; Shibata*et al.* 1994). However, only the sequences that descend from Δ are concerned with such an increase in the mutation rate. Because heterogeneity prevails in cancer tissues and sampling from the tumor is difficult, we assume that an unknown random number of sequences among the sample descend from the mutation Δ.

The goal of this study is to provide statistical estimators for the raised mutation rate μ_{1} under the assumption that the normal rate μ_{0} is known, but the number of descendants of Δ is unknown. Two classical statistics are studied (see Hartl and Clark 1997 for a review in a population genetics context). The first one is the *nucleotide polymorphism* computed as the average number of segregating sites in the DNA. The second one is the *nucleotide diversity* computed as the number of pairwise nucleotide differences. Our main contribution is the calculation of corrections to the classical statistics that are needed because the increase in the mutation rate concerns only a random subgenealogy of the sample.

In our study, the clonal evolution of mitotic cell divisions is assumed to be selectively neutral. More precisely, the evolution of the tissue is approximated by a continuous branching process where one cell dies at random after each division (Moran 1962). At least in the early stages of progression toward tumor cells, this model may be consistent with instability theory. Nevertheless the assumption of selective neutrality is still a source of controversy. Opponents of Loeb's theory support the hypothesis that the number of pretumoral cancer cells increases rapidly with time (see Tomlinson*et al.* 1996, 2002; Tomlinson and Bodmer 1999; Sieber*et al.* 2003). An alternative to the approach developed here would therefore include the rapid growth of tumor clones and the selective advantage of pretumoral cells. It is clear that such assumptions complicate the model and its analysis significantly. Although we believe that the recent contributions by Krone and Neuhauser (1997), Stephens and Donnelly (2003), and Coop and Griffiths (2004) may allow some progresses in this respect, the selection perspective will not be presented here.

Under the neutral assumption, we model the genealogies of DNA sequences using conditional coalescent trees (Wiuf and Donnelly 1999; Griffiths and Tavaré 2003; Tavaré 2004). This formalism has been developed for the primary purpose of estimating the age of an allele (Griffiths and Tavaré 1998; Stephens 2000). So far, evolutionary models have been introduced for dating the loss of MMR (Tsao*et al.* 2000; Calabrese*et al.* 2004). Tsao*et al.* (2000) observed microsatellite alleles in noncoding regions, assuming neutrality as well. However, the need for further mathematical studies has been emphasized in a recent review to better understand the influence of existing hypotheses in the evolution of cancer (Michor*et al.* 2004).

In the next section, we define our notation and give an account of the existing results in the theory of conditional trees. In addition, we extend many results of the theory to encompass other times or ages useful in the context of genomic instability and describe an efficient way for simulating conditional trees. In nucleotide polymorphism and nucleotide diversity, we introduce unbiased estimators of the raised mutation rate μ_{1} based on the number of segregating sites and the number of pairwise differences within the sample. The statistical errors and the power of tests based on these estimators are then compared using Monte Carlo methods.

## CONDITIONAL COALESCENT TREES

#### Model and notations:

We consider a sample of *n* copies of a gene at a particular DNA locus taken from a pretumoral tissue and assume that the loss of MMR (event Δ) occurred once in the sample history. However, the date and place at which this event occurred in the sample genealogy are unknown. Mathematically, we consider taking the limit as the rate of occurrence δ tends to zero conditional on Δ having occurred. In further statements the symbol = therefore often replaces the limit symbol as δ goes to zero.

The sample is divided into two random complementary subsamples and . The cardinality of is a random variable denoted by *B*. Given the number *B* = *b* of sequences in , the number of sequences in is then *c* = *n* − *b*. As usual, in studies of conditional coalescent trees, the analysis requires two levels of conditioning. At the first level, the sample has the property that all sequences in are descendants of the particular mutation Δ while none of those in are. This property is called the *topological event* and is denoted by *E*. At the second level, we assume that the mutation Δ arose only once in the history of the sample. We denote this event by *M*. Conditioning on *E* impacts the random topology of the tree, while conditioning on *M* affects branch lengths. In the terminology of Tavaré (2004), conditioning on *E* ∩ *M* amounts to considering a unique event polymorphism in the tree. The probability distribution of *B* is called the *frequency spectrum* and it can be described aswhere denotes the (*n* − 1)th harmonic number (see Griffiths and Tavaré 1998, 2003; Stephens 2000).

Under the neutral hypothesis, we assume that lineages coalesce at random, and time is rescaled so that the unit of time corresponds to *N* generations with *N* the total cell population size (Kingman 1982). In this setting, the normal mutation rate is usually rescaled so that θ_{0}/2 = 2*N*μ_{0} and the raised mutation rate is θ_{1}/2 = 2*N*μ_{1}. Conditioning on *B* = *b* leads to a model of genealogies that we refer to as the *conditional coalescent tree* (Wiuf and Donnelly 1999; see Figure 1). All subsequent results are established conditional on the event *E* ∩ *M*, but with the exception of the appendix we omit this condition to alleviate notations in long formulas.

#### Background results:

To state results about conditional coalescence times, some additional results are needed. As much as possible, we use notations similar to those of Wiuf and Donnelly (1999) and Tavaré (2004). For , we define *J _{r}* to be the total number of ancestors at the time the subsample first has

*r*ancestors. This definition implies that

*J*ranges between (

_{r}*r*+ 1) and (

*n*−

*b*+

*r*). In addition, we denote by

*J*

_{0}the number of ancestors in the sample at the time the lineages first coalesce with the rest of the sample. This means that we haveSimilarly, we consider

*K*to be the total number of ancestors at the time the subsample first has

_{r}*r*ancestors. We havewhere the subset is replaced by in the previous definition, and the

*K*'s are complementary to the

_{r}*J*'s in the set of labels [1,

_{r}*n*]. Note that conditional on

*J*

_{0}=

*j*, we have

*K*=

_{r}*r*for all

*r*<

*j*and

*j*+ 1 ≤

*K*. To finish, we denote by

_{j}*J*

_{Δ}the total number of ancestors in the sample at the time the mutation Δ occurs. This implies that

*J*

_{Δ}takes its values between 2 and

*n*−

*b*+ 1. A picture of a tree with a summary of notation is displayed in Figure 2.

The conditional joint distributions of the *J _{r}*'s given the events

*E*or

*E*∩

*M*are described in Tavaré (2004, Chap. 8, pp. 106–109), which we refer to when necessary. For example, we easily deduce that(1)for all (see Wiuf and Donnelly 1999). This result is useful in the nucleotide diversity section. Similar properties are stated without proofs when they are direct consequences of Tavaré's notations.

Another useful result concerns the number of ancestors in the sample at the time when the mutation Δ occurs. Recall that we have(2)for all .

The age of the mutation Δ has been studied by Griffiths and Tavaré (1998), Wiuf and Donnelly (1999), and Stephens (2000). Conditional on *B* = *b*, the expected age is given by(3)Griffiths (2003) gave a nicer formula:

#### The distribution of intercoalescence times:

In the standard coalescent, the durations *X*_{ℓ} that separate coalescence events backward in time are independent random variables and have exponential distribution of rate λ_{ℓ} = ℓ(ℓ − 1)/2, where ℓ is the number of ancestors just before the event. Here we recall how the conditioning on *B* = *b* and the existence of a unique event polymorphism *E* ∩ *M* further modify the shape of the genealogy by lengthening the intercoalescence times.

The next result can be deduced from Griffiths and Tavaré (1998) or Stephens (2000). Assume that the mutation Δ has *B* = *b* descendants. The joint probability distribution of conditional on the event *E* ∩ *M* has density(4)where *f*_{ℓ}(*x*_{ℓ}) is the probability density function of the exponential distribution of rate λ_{ℓ}.

As a consequence of Equation 4 we have the following result:

Corollary 1. *Assume that the mutation* Δ *has B* = *b descendants. Let* . *Then we have*(5)

As a consequence of Equation 4, note that conditional on the event *E* ∩ *M* the *X*_{ℓ}'s are no longer independent random variables. However, Equation 4 has the nice interpretation that once we know that the number of ancestors is equal to *k* at the time Δ occurs, then *X _{k}* has gamma

*G*(2, λ

_{k}) distribution, the other

*X*

_{ℓ}have exponential

*G*(1, λ

_{ℓ}) distribution, and the variables are mutually independent. This remark is useful for simulating conditional trees given that

*B*=

*b*. Our algorithm is as follows:

Draw

*J*_{Δ}=*k*according to the distribution (*p*_{k}^{Δ}) for .Draw

*J*_{0}from the conditional distributionDraw an ordered sequence uniformly from the set of ordered integral sequences .

Fill the holes left in [1,

*n*] by the*J*'s with the_{r}*K*'s._{r}Sample

*X*from the gamma_{k}*G*(2, λ_{k}) distribution; otherwise, sample*X*_{ℓ}from the exponential distribution*G*(1, λ_{ℓ}), for ℓ ≠*k*.

#### Testing for the absence of Δ:

Here we present a brief study of the power of a rather “abstract” test to reject the null hypothesis H_{0} of absence of the mutation Δ against the alternative hypothesis H_{1} of its existence. The test is abstract because it assumes the knowledge of the sample genealogy, and the data set consists of all the intercoalescence times (*X _{k}*). Under the null hypothesis we assume that the property

*E*holds for a specific subsample of

*b*sequences. In the alternative hypothesis we assume that the mutation Δ has

*B*=

*b*descendants as well. The test statistic consists of the ratio of likelihoods that is believed to behave optimally for reasonably large sample sizes. It can be described asUnder H

_{0}, we see that this ratio has the same distribution as a sum of independent exponential random variables of rates ν

_{k}= 1/

*p*

_{k}^{Δ},(6)whereas under H

_{1}it is distributed as

*Y*plus a sum of independent exponential random variables of rates ,(7)The criterion for rejection is

*r*greater than the 0.95th percentile from neutral data sets (see Equation 6). The power of the test was studied numerically from 10,000 replicates of

*Y*and

*Z*. We found that the power did not exceed a value close to 0.2 for

*n*= 10, 20, 50, 100, and

*b*≈

*n*. For smaller

*b*'s, the lack in power was even more striking. For example, the power dropped to ≈0.1 for

*b*/

*n*≈ 0.5.

Because we assume the ideal knowledge of tree topologies and branch lengths, the interest in these power calculations is more theoretical than directed toward applications. However, these results put some limitations on testing for the occurrence of the mutation Δ. They are evidence that the occurrence of Δ alone conveys too little information for being detected by any kind of statistical testing even if the full genealogy were observed. This could be explained that the shapes of such trees do not undergo significant changes under the occurrence of Δ.

## NUCLEOTIDE POLYMORPHISM

#### Corrected estimator:

We now take account of the mutations that are superimposed on the conditional coalescent trees. Mutations on the tree branches are distributed according to Poisson processes of rates θ_{0}/2 or θ_{1}/2, depending on where Δ takes place. Assuming the infinitely many sites model of the DNA molecule, we introduce an unbiased estimator of θ_{1} based on the number of segregating sites *S*. This variable equals the number of mutations that occurred during the sample history back to the most recent common ancestor of the sample. In the classical coalescent, *S* has Poisson distribution of parameter *L _{n}*θ/2, where θ is the mutation rate, and

*L*is the length of the genealogy. The nucleotide polymorphism or Watterson's estimator is defined as (Watterson 1975). It is an unbiased estimator of θ with the property that

_{n}In analogy with the classical approach, we denote by the length of the genealogy of the full sample and by the length of the subgenealogy of . Borrowing the notation from Wiuf and Donnelly (1999), we also denote by η_{n} the time separating the root of the subgenealogy from the mutation Δ. In the two-rates model, the number of segregating sites can be split into two independent termswhere *S*^{1} has Poisson distribution of rate ()θ_{1}/2 and *S*^{0} has Poisson distribution of rate ()θ_{0}/2. Taking expectations, we obtain the expected number of segregating sites aswhereandAccordingly, an unbiased estimator of θ_{1} can be defined as follows:Table 1 provides numerical values for *A _{n}* and

*B*with sample sizes in the range 5–50. Exact formulas are derived afterward. First, the expectation results from Corollary 1 as follows:Given that the mutation Δ has

_{n}*b*descendants (

*B*=

*b*), the conditional expectations involved in the computation of

*A*and

_{n}*B*can be obtained from the results of Wiuf and Donnelly (1999) and Griffiths and Tavaré (2003). On the one hand, Griffiths and Tavaré (2003) proved thatwherefor and . On the other hand, Wiuf and Donnelly (1999) showed thatThe values of

_{n}*A*and

_{n}*B*can then be computed by summing over all

_{n}*b*'s.

#### Statistical errors and power of tests:

In the first half of this section, we evaluate the standard deviation (SD) of the estimator . The exact computation of Var[] appears intricate enough so that we resort to Monte Carlo methods. In the second half, we evaluate the power of the statistic to reject the hypothesis that the mutation rate increases simultaneously with the occurrence of the mutation Δ. Simulations were performed using the R statistical programming language (R Development Core Team 2004).

##### Statistical errors:

For evaluating statistical errors, the following experimental design was used. The parameter θ_{0} was set equal to the value θ_{0} = 1. Roughly, this corresponded to a normal mutation rate per mitotic division of μ_{0} ≈ 10^{−10}, while the total number of cells *N* in the tissue approximated 2.5 billion. We considered three different values for the raised mutation rate θ_{1} = 10, 10^{2}, 10^{3}, and the sample sizes were taken in the range *n* = 10–50. Simulations were performed using the method described in the previous section. Table 2 gives the bias and the standard deviation computed over 10,000 replicates. These results confirm that was indeed unbiased. Nevertheless, the standard deviations were rather high. This could be explained as the empirical distributions exhibited strong positive skew. In addition, most of the error was contributed by a term that seemed proportional to . For *n* = 20, we adjusted a regression model of the form to the variance, and an almost perfect fit was obtained as (*R*^{2} = 0.999, *P* < 10^{−12}). For *n* = 40, we obtained (*R*^{2} = 0.997, *P* < 10^{−12}). Apparently, SDs did not exhibit a fast decrease as sample sizes increased. This might be due to a strong correlation of data within the subsample and to the fact that the most recent ancestor of this subsample is expected to be recent. Note that the shape of the correcting constant *B _{n}* suggested a logarithmic rate of decrease of errors toward zero.

##### Power:

A fundamental assumption through this work is that the mutation Δ has occurred once in the history of the sample. Assuming a normal mutation rate θ_{0}, we report results regarding the power of the test based on to reject the null hypothesis of absence of Δ against the alternative of its existence together with an increase in mutation rate θ_{1} > θ_{0}. Results for θ_{0} = 1 and θ_{1} = 10–10^{3} are given in Table 3. Power values ranged from ≈0.06 to ≈0.90. Reasonable powers were obtained for θ_{1} > 10^{3}θ_{0}. No significant improvements were observed when the sample sizes varied from *n* = 10 to *n* = 50.

In a second step we reverted the role of the null and alternative hypotheses and used a test based on . The results are reported in Table 4. In this table, powers range from ≈0.43 to ≈0.90. For θ_{1} < 10, the test exhibited performances similar to those presented in the previous section where the simultaneous rise in mutation rate was ignored. Significant gains in power were obtained for θ_{1} = 10^{3}θ_{0}. Increasing the sample sizes did not provide additional benefit. Table 4 indicates that the event Δ was more easily detected when associated with large mutation rates and small sample sizes. However, the power to detect Δ remains small for θ_{1} < 1000θ_{0}.

## NUCLEOTIDE DIVERSITY

#### Corrected estimator:

Here we introduce an unbiased estimator of θ_{1} based on the nucleotide diversity Π. In the infinitely many sites model the nucleotide diversity is defined as the mean number of pairwise differences between nucleotides. Let Π(*i*, *j*) be the number of sites at which the sequence *i* differs from the sequence *j*, for 1 ≤ *i* ≤ *n* and 1 ≤ *j* ≤ *n*. The nucleotide diversity is the average value of Π(*i*, *j*). It can be computed as follows:In the unconditional coalescent, we have **E**[Π(1, 2)] = θ**E**[*X*_{2}], and Π is an unbiased estimator of θ. The variance of Π is equal to Var[Π] = (*n* + 1)θ/3(*n* − 1) + 2(*n*^{2} + *n* + 3)θ^{2}/9*n*(*n* − 1) (Tajima 1983).

Now consider the occurrence of Δ and the two rates of mutation θ_{0} and θ_{1}. Again, we assume that the mutation Δ has *B* = *b* descendants. Consider two arbitrary sequences labeled 1 and 2. In the classical coalescent, **E**[*X*_{2}] is the expected coalescence time of sequences 1 and 2. In analogy with this, the computation of **E**[Π(1, 2)] requires distinguishing three cases. In the first case, both sequences 1 and 2 belong to , and we havewhere is the expected coalescence time within . This case occurs with probability (*b*/*n*)^{2}. In the second case, one sequence is in while the other belongs to . This event occurs with probability 2*b*(*n* − *b*)/*n*^{2}, and we havewhere is the expected coalescence time of sequence 1 and sequence 2, and τ_{Δ} is the age of Δ given in Equation 3. The third case occurs with probability (1 − *b*/*n*)^{2}. It corresponds to the situation where both sequence 1 and sequence 2 are in . Then we havewhere is the corresponding expected coalescence time. Taking expectation with respect to *B*, we deduce thatwhere the constants *C _{n}* and

*D*can be computed from the above defined coalescence times. Therefore, an unbiased estimator Π

_{n}_{1}of θ

_{1}is of the formTable 5 gives numerical values for

*C*and

_{n}*D*for

_{n}*n*in the range 10–50. The next section explains the way the exact computations of all coalescence times can be achieved.

#### Coalescence times:

Here we provide explicit ways of computing the coalescence times , and . As a consequence, we are able to give formal expressions for the correcting constants *C _{n}* and

*D*. Because the formal expressions are somewhat ugly, the following results should be considered more as recipes for computing expressions than as immediate closed mathematical formulas. The strategy for establishing these exact formulas is rather simple and replicable with slight variations in the three cases.

_{n}##### Case 1—coalescence within ℬ:

Let denote the time at which the sample first has *j* ancestors. A basic argument shows that if a node has *J* ancestors, then its expected age is **E**[*T _{J}*

_{+1}]. Therefore, the coalescence time of two individuals in a subsample of size

*b*for which the total number of ancestors at each node are is given bywhich is made explicit in the appendix.

##### Case 2—coalescence between ℬ and 𝒞:

The expression of has a simple interpretation in terms of the age of Δ. The expression given in the appendix can be reduced, using a symbolic computing language such as Maple. Because the gamma distribution *G*(2, λ_{k}) is the sum of two independent exponentials, we find that the coalescence time (age of Δ) plus the coalescence time of two ancestors among the *k* present at the occurrence of Δ. According to Equation 4, the second coalescence time has exponential *G*(1, 1) distribution. Hence, we have

##### Case 3—coalescence within 𝒞:

The average coalescence time for two individuals within can be obtained from conditioning on *J*_{0} = *j* and from the observation that we have *K _{r}* =

*r*for

*r*<

*j*given that

*J*

_{0}=

*j*. This leads to a complicated formula that uses a series of probabilistic results stated in Lemmas 3 and 4 (see the appendix).

#### Statistical errors and power of tests:

Here we report numerical estimates of the standard deviations of Π_{1}, and we study the power of this statistic to reject the hypothesis that the mutation rate increased simultaneously with the occurrence of the mutation Δ. The same experimental design was used as for the statistic defined in the previous section. The results are closely parallel to those obtained for (see Tables 6–8⇓⇓ ). The estimator appears to be unbiased. The standard deviations are of the same order as those computed for although they seem slightly higher. Using Π_{1} instead of to reject the existence of Δ leads to a 12 or 13% loss in power when θ_{1} = 100 or θ_{1} = 10^{3}. Reverting the two hypotheses and using Π yield the same conclusions as for .

## DISCUSSION

Genetic information must be tightly regulated, and its faithful replication and repair is the greatest imperative. To this end humans have invested >130 genes in DNA repair, and this number is even greater if genes dedicated to fidelity of replication are included (Anderson 2001; Wood 2001). In this article we introduced a stochastic model of mutation in tumor cells with the aim of estimating the amount of genomic instability in cancer tissues due to the alteration of DNA repair genes. Our approach took into account the difficulties generated by sampling within tumoral clones and the fact that these clones must be difficult to isolate (Anderson*et al.* 2001). We provided unbiased estimators of the normalized raised mutation rates. These quantities can be interpreted as the mean numbers of new mutations present in daughter cells after each mitotic generation (this corresponds to an evaluation of θ_{1}/2 = 2μ_{1}*N*). The power of these statistics to reject genomic instability was assessed and proved to increase with the intensity of mutation. However, we showed that large statistical errors may be associated with such estimates. Conditional on the presence of loss of MMR within a sample of cells, no significant benefit would be expected from large sample sizes. In addition, we proved that genomic instability can hardly be detected unless the raised mutation rates exceed the normal rates by a factor >10^{3}. These results suggest monitoring several loci to increase power and reliability of tests and give theoretical support to foundations of current clinical guidelines (Boland*et al.* 1998).

Computations were conducted under the assumptions of selective neutrality. Tumors of clonal origin have long life spans with evolutionary history that may last over 10 or 20 years and exhibits multistep progression. At least in the early stages of tumor progression selective neutrality is still compatible with Loeb's theory of carcinogenesis. Evidence is lacking that the initiating events are neither highly advantageous nor highly deleterious. A competing assumption explains that a cell must exhibit a selective advantage to be converted into a pretumoral cell. Then by a selective clonal expansion the cell becomes malignant (Cairns 1975; Nowell 1976; Tomlinson*et al.* 1996). The material presented in this article may serve as a basis for testing such kinds of assumption. A classical way of doing so is by computing Tajima's *D*-statistic (Tajima 1989). In our framework this statistic can be defined as the difference . To apply the test, *P*-values can be obtained from Monte Carlo replicates, using the new simulation procedure described in conditional coalescent trees.

Genomic instability particularly affects DNA repeat sequences. It has even been calculated to affect hundred of thousands of such sequences in each tumor cell but very few of these events are within coding sequences (Perucho 1996). It is widely argued that stepwise mutation models might be more appropriate for such DNA sequences than the infinitely many sites model used in this work. However, genomic instability is not restricted to repeat sequences and even not limited to the nucleus. Mitochondrial DNA may also be mutated in a process that involves clonal expansion (Polyak*et al.* 1998). Infinitely many sites models may thus be acceptable in several situations.

Anderson*et al.* (2001) reported several difficulties with measuring the amount of instability in cancer cell genomes. The ideal measurement would be how many genomic events occur per cell generation because this number would allow evaluation of the rate of tumor progression. Regardless of the fact that it is as yet difficult to approach in clinical application, a rigorous way of calculating unbiased estimates of the amount of genomic instability in pretumoral tissues would nevertheless require the correction coefficients described in this article.

## APPENDIX

*Proof of Corollary* 1. Let *n* ≥ 2. Assuming that Δ has *b* descendants (1 ≤ *b* ≤ *n* − 1) and using Equation 4 we obtain the marginal distribution of each intercoalescence time. For we haveif ℓ ≤ *n* − *b* + 1; otherwise, it iswhere *f*_{ℓ} is the density of the exponential *G*(1, λ_{ℓ}) distribution. Taking expectations it becomes

Lemma 1. *Let n* ≥ 2. *We have**Proof.* Let . From Corollary 1 we haveThen

Lemma 2. *Let n* ≥ 2 *and assume that* Δ *has b descendants. Let* and k ∈ [2, *n* − *b* + 1]. *For j* ∈ [*k* + *r* − 1, *n* − *b* + *r*], *we have**Proof.* Let *k* ∈ [2, *n* − *b* + 1] and *r* ∈ [1, *b* − 1]. For all *j* ∈ [*k* + *r* − 1, *n* − *b* + *r*] it is known that for we have(Tavaré 2004). Note that the above formula is independent of . We have

Lemma 3. *Let n* ≥ 2 *and assume that* Δ *has b descendants. Let J*_{0} *be defined as in conditional coalescent trees. For j* = 1, … , *n* − *b*, *we have**Proof.* Due to a straightforward combinatorial argument, for *j* = 1, … , *n* − *b* we haveThen intregrating over *J*_{Δ}'s implies that

Lemma 4. *Let n* ≥ 2, *assume that* Δ *has b descendants*, *and denote c* = *n* − *b. Let* *and K _{r} be defined as in conditional coalescent trees. For k* ∈ [

*r*+ 1,

*r*+

*b*],

*we have*

*Proof.*Note that the vector is obtained from a permutation of the labels , where

*J*'s and

_{r}*K*'s are defined as in conditional coalescent trees. Conditional on

_{r}*J*

_{0}=

*j*, the vector is also a permutation of the labels . Then Equation 1 implies that for , we haveNote that the above formula is independent of . We have

An explicit formula for defined in nucleotide diversity is given byIn this expression, we used Corollary 1,and the result stated in Lemma 2 (appendix).

The average coalescence time for two sequences, one within and one within , is straightforward from the conditioning on *J*_{Δ}. We obtain thatwhereBecause *j* ≤ *k* in the above summation, we obtain from Corollary 1 that

Regarding τ_{C}, we haveNow we use the factFor and *r* < *j*, we havewithOtherwise, we have *r* ≥ *j* andwhereFor all , the conditional probabilities *P*(*J*_{Δ} = ℓ | *J*_{0} = *j*) can be obtained from Bayes' formula.

## Acknowledgments

The authors thank Robert C. Griffiths for useful discussions about the model and an anonymous referee for correcting some bibliographical mistakes. This work is partly supported by a grant from the Algorithmes et Populations Biologiques project, which is supported by the Institut d'Informatique et de Mathématiques Appliquées de Grenoble.

## Footnotes

Communicating editor: M. Nordborg

- Received April 6, 2005.
- Accepted December 13, 2005.

- Copyright © 2006 by the Genetics Society of America