Abstract
The probability that an advantageous mutant rises to fixation in a viral quasispecies is investigated in the framework of multitype branching processes. Whether fixation is possible depends on the overall growth rate of the quasispecies that will form if invasion is successful rather than on the individual fitness of the invading mutant. The exact fixation probability can be calculated only if the fitnesses of all potential members of the invading quasispecies are known. Quasispecies fixation has two important characteristics: First, a sequence with negative selection coefficient has a positive fixation probability as long as it has the potential to grow into a quasispecies with an overall growth rate that exceeds that of the established quasispecies. Second, the fixation probabilities of sequences with identical fitnesses can nevertheless vary over many orders of magnitudes. Two approximations for the probability of fixation are introduced. Both approximations require only partial knowledge about the potential members of the invading quasispecies. The performance of these two approximations is compared to the exact fixation probability on a network of RNA sequences with identical secondary structure.
ONE of the most remarkable aspects of the dynamics of RNA viruses is the high rate at which mutant variants are produced. At mutation rates close to one substitution per genome per generation (Drake 1993; Drake and Holland 1999), a virus population forms a highly diverse cloud of mutants (Domingo et al. 1976, 1978; Hollandet al. 1982; Steinhaueret al. 1989; Biebricher and Luce 1993; Burch and Chao 2000), a socalled quasispecies (Eigen and Schuster 1979; Nowak 1992; Domingo and Holland 1997; Domingoet al. 2001). At the same time, the sequence space is so large that even for population sizes up to 10^{12}, there is a constant stream of new mutants that have never existed before. Most of these mutants have impaired fitness, but occasionally, a new mutant will fare better than all currently existing virions, for example, by presenting an epitope that the immune system fails to recognize. With a certain probability, this mutant will rise to fixation, where fixation is understood in the sense that the mutant becomes the ancestor of a new quasispecies, which completely replaces the currently existing one.
The problem of the fixation of an advantageous mutant is an old one, with a long history of investigations in classical population genetics, reaching back to Haldane and Fisher (Fisher 1922, 1930; Haldane 1927; Kimura 1957, 1964, 1970; Ewens 1967; Kimura and King 1979; Barton 1995; Bürger and Ewens 1995; Otto and Barton 1997; Pollak 2000). However, these investigations differ from the quasispecies case in one important aspect: the mutation rates considered. In classical population genetics, the usual assumption is that mutations are rare events, such that an invading mutant will not mutate again while it is moving toward either fixation or extinction. In the quasispecies setting, on the other hand, most of the immediate offspring of a mutant will have further mutations, and their offspring will as well, and so on. As a consequence, the fitness of a prospective invading quasispecies is not given by the fitness of the initial mutant, but rather by the average fitness of the offspring mutant cloud that will form eventually. One of the more surprising results of these dynamics is that a mutant with the ability to replace the currently existing quasispecies may actually have a reduced replication rate, if at the same time its robustness against further mutations is increased (Schuster and Swetina 1988; Wilke 2001b; Wilkeet al. 2001; Krakauer and Plotkin 2002).
Quasispecies theory in its original formulation by Eigen and Schuster (1979) is based on deterministic differential equations and as such cannot deal with the fluctuations that are responsible for fixation or extinction of individual mutants. Within the more general mathematical framework of multitype branching processes, it is possible to describe both the deterministic aspects of large populations and the fluctuations inherent in the dynamics of small and very small populations (Demetriuset al. 1985; Hofbauer and Sigmund 1988; Hermissonet al. 2002). An expression for the probability of fixation follows naturally from branching process theory. We discuss how this expression relates to the predictions of the deterministic quasispecies equations, as well as to the results of classical population genetics. The remainder of the article is organized as follows.
First, we derive a general expression for the probability of fixation in an arbitrary fitness landscape. Then, we discuss the special case of fixation on a neutral network, that is, the case in which all sequences of the invading quasispecies have the same fitness, and derive two approximations for the fixation probability that can be evaluated without the knowledge of the fullfitness landscape. To give a concrete example, we apply both the exact expression and the approximations to a known network of >50,000 RNA sequences. For this neutral network, we also discuss how the fixation probability changes if multiple sequences invade at the same time.
THEORETICAL FRAMEWORK
For a population evolving under high mutational pressure, we have to understand fixation in the sense that a mutant is fixed once it has become a common ancestor of the whole population. The more traditional definition of fixation, which is to regard a mutation as fixed if all sequences in the population carry it, is not applicable: The mutational pressure constantly creates new deleterious mutants, which may not carry a particular mutation although their ancestors did so. If we understand fixation as the process by which a mutant becomes a common ancestor of the whole population, then the probability that a mutant is fixed is given by the probability that the cascade of further mutated offspring of the invading mutant does not come to a halt. We can calculate this probability from the theory of multitype branching processes.
The general setting to which our theory applies is as follows. Consider a viral quasispecies in mutationselection balance, with an average fitness 〈w〉. If generations are discrete and nonoverlapping, and the population size N is constant, then the probability that a virion i produces k offspring in one generation is given by WrightFisher sampling,
Assume that a rare mutation leads to the emergence of a virion with the potential to form a new quasispecies and to replace the already established one in the process. This new quasispecies (in the following also called the invading quasispecies) may consist of sequences of type 1, 2,..., n, with replication rates w_{i}. Let the probability that a sequence j produces an erroneous copy i be given by Q_{ij}. As long as the total abundance of the invading quasispecies is small compared to the established quasispecies, we can assume that 〈w〉 is not affected by the presence of the invading quasispecies. Then, the probability that a single sequence of type i generates (k_{1},..., k_{n}) offspring of types 1,..., n can be expressed as
Let x_{i} be the probability that the offspring cascade spawned by a sequence i goes extinct after a finite number of generations. From the theory of multitype branching processes (Harris 1963), we know that the vector of extinction probabilities x = (x_{1},..., x_{n}) satisfies x = f(x), where f(z) = (f_{1}(z),..., f_{n}(z)) is the probabilitygenerating function of the distribution of offspring probabilities P(k_{1},..., k_{n}i). The probabilitygenerating function is defined as
To compare Equation 6 to the result of Haldane (1927), we take the logarithm on both sides of Equation 6 and expand to second order:
For simplicity, we have considered only discrete, nonoverlapping generations. Generalization to continuous time is straightforward (see, e.g., Harris 1963; Hermissonet al. 2002). In the continuoustime case, the vector of fixation probabilities π is again determined by an equation of the form 1 – π = f(1 – π). However, the generating function f(z) is in general not given by Equation 5. Its functional form depends on the details of the continuoustime process that is being modeled. For example, if reproduction occurs through binary fission, f(z) will be quadratic in the variables z_{1},..., z_{n}.
FIXATION ON A NEUTRAL NETWORK
Exact expressions and estimates: So far, we have made no assumptions about the structure and fitness distribution of the invading quasispecies. This has led to a general equation for the vector of fixation probabilities π, but not much further analysis is possible without a concrete model for the fitness landscape of the invading quasispecies (we do not have to make any further assumptions about the established quasispecies, since it enters the equations only through its average fitness 〈w〉). The concrete fitness landscape we study is that of a neutral network (Huynenet al. 1996; BornbergBauer 1997) of related sequences with identical replication rate σ. All sequences that are not part of the neutral network are assumed to have a vanishing replication rate. Mutations occur as random substitutions of single bases, and we allow for at most one substitution per replication event, similar to the approach of van Nimwegen et al. (1999). The probability of a substitution is given by μ. The restriction to at most a single substitution is a technicality that simplifies the analysis. Generalization to more elaborate mutation schemes is possible along the lines of Wilke (2001a).
We denote the sequence length by L and the number of different bases by κ (κ= 4 for RNA/DNA). For the matrix M, we have to take into account only the sequences belonging to the neutral network. It is useful to introduce the connection graph G = (G_{ij}). The elements G_{ij} are 1 if and only if two sequences i and j are exactly one mutation apart. In all other cases, G_{ij} = 0. We can express M in terms of G as
The spectral radius of M is given in terms of the spectral radius of the connection graph ρ_{G} as
In an experimental setting, we cannot expect to have knowledge of the complete connection graph G. Therefore, it is important to have approximations for the fixation probability π_{i}. We consider two alternative methods. Both are based on replacing the matrix M in Equation 6 by a suitable diagonal matrix. This replacement leads to a decoupling of the equations for different π_{i}.
The quantity that is easiest to obtain experimentally is the growth rate of the invading quasispecies relative to the established quasispecies, when initially both are present in large and equal amounts. From the definition of M, we see that this relative growth rate corresponds to the spectral radius ρ_{M} of M. If we assume that every mutant present in the invading quasispecies has an expectation of ρ_{M} offspring per generation, then we can replace M in Equation 6 with a matrix that has entries ρ_{M} on the diagonal, while all offdiagonal elements are zero. Then, Equation 6 simplifies to 1 – π_{i} = e^{–ρ}_{M}^{π}_{i} for all i. Clearly, this approximation will overestimate the π_{i} for some mutants (mostly those that produce on average <ρ_{M} offspring) and underestimate it for others (mostly those that produce on average >ρ_{M} offspring). In the following, we refer to this estimate as the deterministic growth estimate, because it is based on the assumption that the invading quasispecies grows according to the deterministic equations from the outset.
The alternative method of estimating π_{i} is as follows. It is reasonable to assume that the first couple of replication cycles mostly determine fixation or extinction for an invading sequence. During these initial generations, the subpopulation descending from the invading sequence cannot explore the full neutral network if the network is large. Therefore, the major contribution to the fixation probability comes from the connection matrix of the local genetic neighborhood of the invading sequence, and sequences farther away on the neutral network are relatively unimportant. The idea behind the second approximation is therefore to calculate the fixation probability on the basis of a small area of genotype space surrounding the invading sequence. In the simplest case, we consider only the invading sequence and its immediate mutational neighbors. Assume sequence i has ν_{i} neutral neighbors, i.e., Σ_{j}G_{ij} =ν_{i}. Then the total expected number of offspring of sequence i is Σ_{j}M_{ij} = s + 1 +βν_{i}. Under the assumption that all offspring of i have the same expected number of further offspring, the probability of fixation satisfies the equation 1 – π = e^{–(s+1+βνi)πi}. We call the solution to this equation the neutrality estimate. As in the case of the deterministic growth estimate, it will overestimate the true fixation probability for some sequences and underestimate it for others.
Fixation on an RNA neutral network: We compared the two estimates to the exact fixation probabilities on a neutral network of RNA sequences. The network of 51,028 sequences of length L = 18 was found through exhaustive enumeration by van Nimwegen et al. (1999). The spectral radius of the network's connection graph is ρ_{G} = 15.7. To calculate fixation probabilities on this neutral network, we have to make an assumption about the average fitness 〈w〉 of the established quasispecies. We assume 〈w〉 = 1 –μ[1 –ρ_{G}/(3L)], in which case the relative growth rate of the invading quasispecies (at macroscopic concentration) with respect to the established quasispecies follows from Equation 10 as ρ_{M} =σ, independent of the mutation rate.
Figure 1 displays the exact fixation probabilities (obtained numerically from Equation 6) and the two estimates as functions of the mutation rate. We have shown the average fixation probability π= Σπ_{i}/n, the minimum probability π_{min} = min_{i}{π_{i}}, and the maximum probability π_{max} = max_{i}{π_{i}}. Since we chose 〈w〉 such that ρ_{M} is independent of μ, the deterministic growth estimate is independent of μ. We observe that the deterministic growth estimate lies consistently above the average π, but below the maximum π_{max}. The neutrality estimate underestimates the smallest fixation probabilities and overestimates the largest ones. Its average lies slightly below π for small mutation rates and above π for large mutation rates. A more detailed plot of the fixation probabilities at a fixed mutation rate of μ= 0.5 is given in Figure 2. There, we display the fixation probability vs. the neutrality (number of neutral neighbors) of the invading sequence. The spread in the fixation probabilities is remarkable. For sequences with a given neutrality, the fixation probabilities vary over up to seven orders of magnitude. This demonstrates the important influence of not only the nearest neighbors but also the wider genetic neighborhood on the fate of a single sequence in quasispecies evolution. The neutrality estimate substantially underestimates the fixation probabilities of those sequences that have only few immediate neutral neighbors, but are otherwise located in a region of the genotype space where the density of neutral sequences is high. In principle, we could improve the neutrality estimate by taking into account all neutral sequences up to some distance d, but in practice this method becomes quickly as unwieldy as calculating the exact fixation probabilities.
Multiple invading sequences: The above considerations address only the case of a single invading sequence. The generalization to more than one invading sequence is straightforward. Assume that a set S of N sequences, with S = {i_{1},..., i_{N}}, invades an established quasispecies. The probability that this invasion is successful is given by 1 – Π_{i}_{∈}_{S}(1 –π_{i}), where π_{i} are the fixation probabilities of the individual sequences. The probability of successful invasion of N sequences can be used as an indicator for the population size at which the deterministic quasispecies equations capture the relevant dynamics of a finite population. The fluctuations distinguishing the stochastic process of a finite population from the deterministic description can be neglected if the invasion probability is close to one. In Figure 3, the fixation probability on the same neutral network of RNA sequences that we have used before is displayed against the size of the invading population. The individual data points are averaged over 1000 independent trials, where for each trial the N starting sequences were chosen at random. As before, 〈w〉 is chosen such that σ is the average number of offspring of the invading quasispecies in the deterministic limit.
Figure 3 shows that the population need not cover the relevant sequence space to behave as predicted by the deterministic equations. On a neutral network of >50,000 sequences, a population of ∼1000 behaves deterministically at an advantage in growth rate of only 1%. It is important to note that this advantage has been calculated under the assumption of an infinite population and that sufficiently small populations will grow substantially slower (van Nimwegenet al. 1999). Apparently, here a population that covers only 2% of the neutral network is not sufficiently small to experience this reduction in growth rate.
DISCUSSION
The exact expression for the probability of fixation in the quasispecies context is easy to evaluate numerically if the fitnesses of all relevant sequences are known. However, these data are normally not available for experimental systems, and approximations must be used. What is most easily available experimentally is the relative rate of growth of the two quasispecies at macroscopic concentrations, which is the basis of the deterministic growth estimate. Since this estimate gives only a single number, independently of the sequence actually seeding the invading quasispecies, it does not reflect local variations in the density of viable sequences around the invading sequence. The neutrality estimate does not suffer from this shortcoming. However, it requires the knowledge of the fitnesses of the immediate neighbors of the invading sequence. Although experimentally tedious, these fitnesses can be measured in principle. For example, Elena and Lenski (1997) generated 225 mutant strains of the bacterium Escherichia coli (each mutant differed from the wild type by one, two, or three mutations) and measured the relative fitnesses of the mutant strains to the wild type. The mutant neighborhood of an RNA virus can conceivably be measured in a similar manner.
The predictive power of both the deterministic growth estimate and the neutrality estimate depends strongly on the distribution of neutral sequences in sequence space. For example, both estimates become exact for the case of a uniform neutral lattice, in which all sequences have exactly the same neutrality. Furthermore, we expect the neutrality estimate to perform particularly well in networks in which a sequence's neutrality is strongly correlated to the neutralities of its immediate and more distant neutral neighbors. The deterministic growth estimate, on the other hand, will yield the best results if the neutral network does not decompose into areas that are substantially more densely or less densely connected than other areas. However, to what extent these conditions are met in natural systems is questionable. As we have seen in this article, the connection graph of a comparatively simple neutral network—consisting of RNA sequences that are only 18 bp long—is already so heterogeneous that both estimates fail to give an accurate prediction of the fixation probability for a substantial fraction of sequences on that network. It is reasonable to assume that the distribution of highfitness sequences in sequence space for an RNA virus that consists of several thousand bases is at least as heterogeneous as the one in our toy RNA network, probably more so. In this work, we have considered only the fate of a single invading quasispecies. However, while an invading quasispecies is moving toward fixation or extinction, another mutant, one that belongs to a quasispecies of even higher mean fitness, may appear. The fixation probability of the first invader will then be modulated by the dynamics of the second one and vice versa, an effect commonly referred to as “clonal interference” (Gerrish and Lenski 1998). Clonal interference has been reported in experiments with vesicular stomatitis virus (Miralles et al. 1999, 2000) and with the bacterium E. coli (de Visseret al. 1999). Currently, an accurate mathematical description of clonal interference for the quasispecies case is not available.
The approach we have followed in this work cannot be directly generalized to include clonal interference, because the assumption of a constant background average fitness 〈w〉 is not justified in the context of two (or more) competing branching processes. A second problem that we have to solve in a theory of quasispecies clonal interference is the identification of advantageous mutants. Throughout this article, we have used the definition that an advantageous mutant is one that can grow into a quasispecies with higher average fitness than that of the currently established quasispecies. To use this definition in the context of clonal interference, we need to have a priori knowledge about how to best subdivide the sequence space into independent quasispecies. Only with this knowledge can we decide whether a particular new mutant is part of the parent quasispecies or rather the founding member of a new quasispecies. A possible way to study clonal interference in future work will be to consider a particular fitness landscape—for example, a set of intertwined neutral networks at different fitness levels—for which the a priori separation into distinct quasispecies is possible. For such a landscape, numerical studies of clonal interference will be straightforward, and an analytic description should be possible as well. For landscapes that are a priori unknown, even the numerical investigation of clonal interference will remain difficult until a workable method for the identification of advantageous mutants has been found.
Recently, Jenkins et al. (2001) and Holmes and Moya (2002) expressed doubts regarding the relevancy of the quasispecies model for virology (but see Domingo 2002). They argued that there is no unequivocal experimental evidence for the quasispecies nature of RNA viruses and that the deterministic quasispecies equations are potentially not applicable to viral evolution on theoretical grounds, due to the immense size of the sequence space. The results of this article show that the second concern is not entirely justified. A single sequence has a positive probability to rise to fixation if and only if the average fitness of the quasispecies that will form eventually exceeds the average fitness of the currently established quasispecies. The individual fitness of the invading sequence has some influence on the exact value of that probability, but does not affect whether fixation is possible at all. Moreover, when the population size reaches several hundred, with probability of almost one the population will, for reasonable choices of the parameters, behave as predicted by the deterministic equations. A similar result has been obtained by van Nimwegen et al. (1999) for flow reactor simulations, where, on the same neutral network of RNA sequences that we have studied here, quasispecies effects started to become important when the product of population size and mutation rate Nμ exceeded the value 10 (see Figure 3 of van Nimwegenet al. 1999).
Wilke (2001b) studied the probability of fixation for RNA sequences in a simulated flow reactor. The measured fixation probability was compared to an expression equivalent to the deterministic growth estimate of this work (since continuous time simulations were used to generate the data, the exact expressions differ from those given here). The analytic expression correctly predicted the parameter regions for which fixation was possible. In particular, the mutation rate at which a slower replicator with better mutational support could successfully invade a quasispecies consisting of sequences with higher individual fitnesses was determined accurately. However, the exact fixation probabilities seemed to be slightly overestimated. (Within the statistical accuracy of the data, a definite decision on this issue could not be made. While the data were in agreement with the model according to a χ^{2} test, they were not in agreement according to a nonparametric test based on how often the data points fell above or below the predicted value.)
The probability of fixation of advantageous mutants is obviously of tremendous importance for disease dynamics and vaccines. For example, live vaccinces of attenuated poliovirus can contain small amounts of virulent poliovirus variants (Chumakovet al. 1991), the reason being that attenuated and virulent virus variants are often separated by only one or a few mutations. In experiments, small amounts of highly virulent virus remain typically suppressed by the less virulent virus, but once a threshold concentration of the highly virulent virus variant is reached, infection occurs (de la Torre and Holland 1990; Chumakovet al. 1991; Tenget al. 1996). The apparent existence of such a threshold may well be a result of insufficient resolution of the experiments. Whether the highly virulent strain will grow is determined by stochastic fluctuations, and, as we have seen in Figure 3, the probability of fixation decays quickly with shrinking initial concentration of the virulent strain. If such a strain in a vaccine has a 1% chance to cause infection, then ≫100 replicates of the appropriate assay are necessary to observe at least one infection with certainty. Probabilities of this magnitude or lower can easily be missed at low numbers of replicates, so that the virulent strain appears to be safely suppressed.
Acknowledgments
I thank C. Adami for extensive discussions and encouragement and E. C. Holmes for commenting on an earlier version of this manuscript. Moreover, I thank M. Huynen for providing the neutral network data from van Nimwegen et al. (1999). Financial support by the National Science Foundation under contract DEB9981397 is gratefully acknowledged.
APPENDIX
We consider a model with discrete, nonoverlapping generations and a constant population size N. Under the assumption that the reproductive success of a sequence i is proportional to its fitness w_{i}, the probability that a randomly chosen sequence in the next generation is offspring of sequence i is given by ξ= w_{i}/(〈w〉N), where 〈w〉 is the average fitness in the population. Since there are N sequences in the population, the probability that k of them are offspring of sequence i is binomial,
We can extend the above argument to sequences of two types, r and s. The probability that sequence i leaves k_{r} offspring sequences of type r and k_{s} offspring sequences of type s is the probability that k_{r} offspring are of type r,
Footnotes

Communicating editor: M. W. Feldman

Note added in proof: After acceptance of this manuscript, I became aware of a recent paper by T. Johnson and N. H. Barton (2002, The effect of deleterious alleles on adaptation in asexual populations. Genetics 162: 395–411). Using the theory of multitype branching processes, Johnson and Barton studied in this article the probability of fixation of a mutant that suffers additional deleterious mutations while going to fixation.
 Received June 3, 2002.
 Accepted November 1, 2002.
 Copyright © 2003 by the Genetics Society of America