# Thermodynamics of Neutral Protein Evolution

^{*}Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125,^{†}Keck Graduate Institute of Applied Life Sciences and School of Mathematical Sciences, Claremont Graduate University, Claremont, California 91711 and^{‡}Section of Integrative Biology and Center for Computational Biology and Bioinformatics, University of Texas, Austin, Texas 78712

- 1
*Corresponding author:*Division of Chemistry and Chemical Engineering, California Institute of Technology, Mail Code 210-41, 1200 E. California Blvd., Pasadena, CA 91125. E-mail: jesse.bloom{at}gmail.com

## Abstract

Naturally evolving proteins gradually accumulate mutations while continuing to fold to stable structures. This process of neutral evolution is an important mode of genetic change and forms the basis for the molecular clock. We present a mathematical theory that predicts the number of accumulated mutations, the index of dispersion, and the distribution of stabilities in an evolving protein population from knowledge of the stability effects (ΔΔ*G* values) for single mutations. Our theory quantitatively describes how neutral evolution leads to marginally stable proteins and provides formulas for calculating how fluctuations in stability can overdisperse the molecular clock. It also shows that the structural influences on the rate of sequence evolution observed in earlier simulations can be calculated using just the single-mutation ΔΔ*G* values. We consider both the case when the product of the population size and mutation rate is small and the case when this product is large, and show that in the latter case the proteins evolve excess mutational robustness that is manifested by extra stability and an increase in the rate of sequence evolution. All our theoretical predictions are confirmed by simulations with lattice proteins. Our work provides a mathematical foundation for understanding how protein biophysics shapes the process of evolution.

PROTEINS evolve largely through the slow accumulation of amino acid substitutions. Over evolutionary time, this process of sequence divergence creates homologous proteins that differ at the majority of their residues, yet still fold to similar structures that often perform conserved biochemical functions (Lesk and Chothia 1980). The maintenance of structure and function during sequence divergence suggests that much of protein evolution is neutral in the sense that observed sequence changes frequently do not alter a protein's ability to fold and adequately perform the biochemical function necessary to enable its host organism to survive. This comparative evidence for neutrality in protein evolution has been corroborated by experimental studies showing that the mutations separating diverged sequences often have no effect other than modest and additive changes to stability (Serrano *et al.* 1993) and that a large fraction of random mutations do not detectably alter a protein's structure or function (Shortle and Lin 1985; Pakula *et al.* 1986; Loeb *et al.* 1989; Guo *et al.* 2004; Bloom *et al.* 2005, 2006a). In this respect, it seems that protein evolution should be well described by Kimura's neutral theory of evolution, which holds that most genetic change is due to the stochastic fixation of neutral mutations (Kimura 1983). One of the key predictions of the neutral theory is that assuming a constant mutation rate, the number of mutations separating two proteins should be proportional to the time since their divergence (Kimura 1983). Indeed, the observation by Zuckerkandl and Pauling (1965) that proteins are “molecular clocks” that accumulate mutations at a roughly constant rate has long been taken as one of the strongest pieces of evidence supporting the neutral theory (Ohta and Kimura 1971).

However, mutations that are neutral with respect to a protein's capacity to perform its biological function often affect protein thermodynamics. The biological functions of most proteins depend on their ability to fold to thermodynamically stable native structures (Anfinsen 1973). Yet natural proteins are typically only marginally stable, with free energies of folding (Δ*G*_{f}) between −5 and −15 kcal/mol (Fersht 1999). Most random mutations to proteins are destabilizing (Pakula *et al.* 1986; Matthews 1993; Godoy-Ruiz *et al.* 2004; Kumar *et al.* 2006), and their effects on stability (measured as ΔΔ*G*, the Δ*G*_{f} of the mutant protein minus the Δ*G*_{f} of the wild-type protein) are frequently of the same magnitude as a protein's net stability. The impact of a mutation on a protein's function can therefore depend on the protein's stability: a moderately destabilizing mutation that is easily tolerated by a stable parent protein may completely disrupt the folding of a less stable parent. This effect of protein stability on mutational tolerance has been verified by experiments demonstrating that more stable protein variants are markedly more robust to random mutations (Bloom *et al.* 2005, 2006a).

The fact that mutations that are neutral with respect to direct selection for protein function can affect a protein's tolerance to subsequent mutations is not consistent with the simplest formulation of the neutral theory of evolution, which tends to assume that the fraction of mutations that is neutral remains constant in time. Kimura (1987) himself recognized the possibility that the neutrality might change, and Takahata (1987) mathematically treated the consequences of a “fluctuating neutral space.” In particular, Takahata showed that fluctuating neutrality could explain the observed overdispersion in the molecular clock (Cutler 2000b) (the tendency for the variance in the number of fixed mutations to exceed the expectation for the Poisson process predicted by the neutral theory) long considered troublesome for the neutral theory. However, further progress on this topic was stymied by the lack of a specific model for how or why protein neutrality might fluctuate.

More recently, researchers have preferred to describe neutral evolution using the concept of “neutral networks,” which are networks in the space of possible protein sequences in which each functional protein is linked to all other functional proteins that differ by only a single mutation (Smith 1970; Huynen *et al.* 1996; Govindarajan and Goldstein 1997; Bornberg-Bauer and Chan 1999; van Nimwegen *et al.* 1999; Tiana *et al.* 2000; Bastolla *et al.* 2002). A neutrally evolving protein population is then envisioned as moving on the neutral network, and the neutrality of the population may fluctuate if the nodes on the network differ in their connectivities. A general theoretical treatment of evolution on neutral networks by van Nimwegen *et al.* (1999) has shown that if the product of the population size and mutation rate is small then members of the population are equally likely to occupy any node, while if this product is large then the population will preferentially occupy highly connected nodes (see also Bornberg-Bauer and Chan 1999; Taverna and Goldstein 2002b; Xia and Levitt 2004). Simulations with simplified lattice models of proteins have attempted to provide insight into the specific features of protein neutral networks. These simulations have shown that lattice protein neutral networks are centered around highly connected nodes occupied by stable proteins (Bornberg-Bauer and Chan 1999; Broglia *et al.* 1999; Xia and Levitt 2004; Wingreen *et al.* 2004), a finding consistent with the experimental observation (Bloom *et al.* 2005, 2006a) that stable proteins are more mutationally robust. Lattice protein studies also suggest that protein structures differ in their “designabilities” (defined as the total number of sequences that fold into a structure) and that sequences that fold into more designable structures will neutrally evolve at a faster rate due to the increased size and connectivity of their neutral networks (Li *et al.* 1996; Govindarajan and Goldstein 1997; Chan and Bornberg-Bauer 2002; England and Shakhnovich 2003; Wingreen *et al.* 2004). Finally, simulations have demonstrated that fluctuations in neutrality as a protein population moves along its neutral network can lead to an overdispersion of the molecular clock (Bastolla *et al.* 2002), as originally suggested by Takahata (1987). However, an extension of these lattice protein simulations of evolution on neutral networks into a quantitative theory has been difficult because protein neutral networks are far too large to be computed for all but the simplest lattice models.

Here we present a mathematical treatment of neutral protein evolution that describes the evolutionary dynamics in terms of the ΔΔ*G* values for single mutations, which are experimentally measurable. Our treatment is based on the experimentally verified (Bloom *et al.* 2005, 2006a) connection between protein stability and mutational robustness, as well as a few biophysically supported assumptions about ΔΔ*G* values for random mutations. By linking a protein's tolerance to mutations with stability, we are able to quantitatively describe neutral evolution without a full description of the neutral network. We can then compute the average number of accumulated mutations, the average fraction of neutral mutations, the index of dispersion, and the distribution of stabilities in a neutrally evolving population solely from knowledge of the ΔΔ*G* values for single mutations. In addition, we follow the formalism of van Nimwegen *et al.* (1999) to calculate all four of these properties in the limit when the product of the population size and mutation rate is much less than one and in the limit when this product is much greater than one. In demonstrating that these properties are different in these two limits, we show that the rate of fixation of neutral mutations can vary with population size in violation of one of the standard predictions of Kimura's neutral theory (Kimura 1987). Our work presents a unified view of neutral protein evolution that is grounded in measurable thermodynamic quantities.

## MATERIALS AND METHODS

#### Lattice protein simulations:

We performed simulations with lattice proteins of *L* = 20 monomers of 20 types corresponding to the natural amino acids. The proteins could occupy any of the 41,889,578 possible compact or noncompact conformations on a two-dimensional lattice. The energy of a conformation is the sum of the nonbonded nearest-neighbor interactions, , where is one if residues *i* and *j* are nearest neighbors in conformation and zero otherwise, and is the interaction energy between residue types and , given by Table 5 of Miyazawa and Jernigan (1985). We computed the stability of a conformation as , where is the partition sum, made tractable by noting that there are only 910,972 unique contact sets. All simulations were performed at a reduced temperature of *T* = 1.0.

We used adaptive walks to find sequences that folded into each of the three arbitrarily chosen conformations shown in Figure 2 with Δ*G*_{f} ≤ 0, and then neutrally evolved these sequences for 10^{4} generations with a population size of *N* = 100. Our evolutionary algorithm was as follows: at each generation we randomly chose a protein that folded to the parental structure with Δ*G*_{f} ≤ 0 from the population and mutated each residue to some other randomly chosen residue with probability 5 × 10^{−4} and continued doing this until we had filled the new population with proteins. At the end of this equilibration evolution, we chose the most abundant sequence in the population as the starting point for further analysis and for the computation of the distribution of ΔΔ*G* values for all 380 point mutations (sequences shown in Figure 2). In principle, computing the distribution of ΔΔ*G* values over all sequences in the population rather than just the most abundant one should give a more accurate representation of the true form of this distribution, and indeed we found that doing this slightly increased the accuracy of the predictions shown in Figure 2. However, the resulting improvement in accuracy was small, since the approximate constancy of the ΔΔ*G* distribution during neutral evolution (discussed below) means that the distribution computed over a single sequence is representative of that computed over all sequences in the population. Therefore, we chose to compute the ΔΔ*G* distribution over just the most abundant sequence since this choice more closely tracks what would be experimentally feasible with real proteins. (It is experimentally tractable to compute ΔΔ*G* values for a single protein, but would be unmanageable to do so for all proteins in a natural population.)

To collect data for the case when the product *N*μ of the population size *N* and the per protein per generation mutation rate μ is , we first equilibrated 1000 replicates by evolving each of them with a population size of *N* = 10 and for 5000 generations starting with a clonal population of the initial sequence described above. The remainder of the evolutionary algorithm was as described above: the mutation rate stayed at 5 × 10^{−4} per residue per generation (corresponding to a per protein per generation mutation rate of μ = 10^{−2}), and at each generation all proteins that folded to the target native structure with Δ*G*_{f} ≤ 0 reproduced with equal probability. We then evolved each of these equilibrated populations for a further 5000 generations to collect data. We combined the data for all the folded proteins in the final populations of all the replicates to calculate the average number of mutations after *T* generations, the corresponding index of dispersion *R _{T}*, and the distribution of stabilities shown in Figure 2. If we instead simply randomly chose a single folded protein from the final population of each replicate, we obtained results that were identical within the precision shown in Figure 2. We emphasize that and

*R*were computed by keeping track of the actual number of mutations that had occurred during the evolutionary history of each protein, not simply by counting the number of amino acid differences between the ancestral and the final sequences (the two quantities may differ if a single site undergoes multiple mutations, as discussed in more detail in later sections).

_{T}To generate the data for , we used the same procedure but with *N* = 10^{5} and performed only 10 replicates. We again computed the statistics shown in Figure 2 by combining the data for all of the folded proteins in the final populations of all 10 replicates. Similar results were obtained if we instead computed and *R _{T}* over all of the folded proteins in the final population of a single replicate (average values of were identical while the

*R*values of 1.03, 0.95, and 0.94 were extremely similar to those shown from top to bottom in Figure 2). This outcome is expected since the probability distributions for evolve deterministically.

_{T}#### Lattice protein predictions:

The numerical predictions for the lattice proteins given in Figure 2 were computed by constructing the matrix **W** described in the first section of results with a bin size of *b* = 0.005 and truncating the matrix by assuming that no proteins would have stabilities < −5.0. For the case when , was calculated using Equation 6 and *R _{T}* was calculated using Equation 11. For , was calculated using Equation 18 and

*R*was calculated using Equation 19.

_{T}## RESULTS

#### Assumptions and mathematical background:

Here we describe the physical view of protein evolution that motivates our work. We begin with the basic observations that evolution selects for protein function and that most proteins must stably fold to function (Anfinsen 1973), meaning that protein stability is under evolutionary pressure only insofar as it must be sufficient to allow a protein to fold and function. In taking this view, we ignore those proteins (estimated at 10% of prokaryotic and 30% of eukaryotic proteins) that are intrinsically disordered (Uversky *et al.* 2005), as well as those rare proteins that are only kinetically stable (Jaswal *et al.* 2002). Natural selection for function requires a protein to fold with some minimal stability , since proteins that lack this minimal stability will be unable to reliably adopt their native structure and perform their biochemical task. A protein's extra stability beyond this minimal threshold is quantified as , meaning that all functional proteins must have (more negative values of Δ*G*_{f} indicate increased stability). We further assume that as long as , natural selection for protein function is indifferent to the exact amount of extra stability a protein possesses. This assumption is at odds with the persistent speculation that high stability inherently impairs protein function and so is selected against by evolution (DePristo *et al.* 2005; Somero 1995). But the circular argument most commonly advanced to support this speculation—that the observed marginal stability of natural proteins indicates that higher stability is detrimental to protein function—has now been contradicted both by experiments that have dramatically increased protein stability without sacrificing function (Serrano *et al.* 1993; Giver *et al.* 1998; van den Burg *et al.* 1998; Zhao and Arnold 1999) and by demonstrations that marginal stability is a simple consequence of the fact that most mutations are destabilizing (Arnold *et al.* 2001; Taverna and Goldstein 2002a; this work). There is a possibility, however, that certain regulatory proteins must be marginally stable to facilitate rapid degradation (Huntzicker *et al.* 2006). To summarize, current biochemical evidence supports our assumption that (with certain well-defined exceptions) the only requirement imposed on protein stability by natural selection for protein function is that stability must meet or surpass some minimal threshold (a protein must have ).

A mutation to a protein changes its stability by an amount ΔΔ*G*, and experimental measurements of ΔΔ*G* values have shown that most mutations are destabilizing (have ΔΔ*G* > 0) (Pakula *et al.* 1986; Matthews 1993; Godoy-Ruiz *et al.* 2004; Kumar *et al.* 2006). A mutation is neutral with respect to selection for stability if since the mutant protein still satisfies the minimal stability threshold; otherwise the mutant does not stably fold and is culled by natural selection. Of course, mutations can also have specific effects on protein function (such as altering an enzyme's activity), but experiments have shown that such mutations are rare compared to the large number of mutations that affect stability (Shortle and Lin 1985; Pakula *et al.* 1986; Loeb *et al.* 1989; Bloom *et al.* 2006a). Mutations can also have effects unrelated to the functioning of the individual protein molecule: they can affect its propensity to aggregate (Chiti *et al.* 2000), alter its codon usage (Akashi 2003), change its mRNA stability (Chamary and Hurst 2005), affect the efficiency or accuracy of translation (Akashi 2003; Rocha and Danchin 2004), or change the fraction of mistranslated proteins that fold (Drummond *et al.* 2005). These higher-level effects are probably most apparent in the evolution of highly expressed proteins (Pal *et al.* 2001; Drummond *et al.* 2005). However, here we ignore such effects and assume that the evolutionary impact of a mutation is determined mostly by its effect on protein stability (an assumption in agreement with a recent bioinformatics analysis by Sanchez *et al.* 2006). The view we present therefore describes the impact of a mutation solely by its ΔΔ*G* value and the of the wild-type protein and is summarized graphically in Figure 1. We have previously used a similar view to successfully describe experimental protein mutagenesis results (Bloom *et al.* 2005, 2006a).

To use the view of Figure 1 to construct a useful description of neutral protein evolution, we make one major assumption: that the overall distribution of ΔΔ*G* values for random mutations stays roughly constant as the protein sequence evolves. Actually, this assumption is stronger than is strictly needed for the mathematical theory presented below—the theory can be developed simply by assuming that all proteins with the same Δ*G*_{f} have the same distribution of ΔΔ*G* values (in this case the matrix elements *W _{ij}* introduced below depend on

*j*in addition to the difference

*i*−

*j*). However, we make the stronger assumption that the ΔΔ

*G*distribution remains constant during sequence evolution, since we believe that this assumption is consistent with existing evidence. We emphasize that this assumption does not imply that we are arguing that the ΔΔ

*G*distribution is identical for every possible protein sequence. Clearly, any given structure has a most stable sequence (with all ΔΔ

*G*values positive), a least stable sequence (with all ΔΔ

*G*values negative), and a vast range of sequences in between. However, most of these sequences fall within a stability range that is never populated by evolution, since simulations (Taverna and Goldstein 2002a) and experiments (Davidson

*et al.*1995; Keefe and Szostak 2001) clearly show that the vast majority of protein sequences do not stably fold into any structure (meaning the least stable folded protein is still far more stable than the typical random sequence). Among the subset of sequences that do stably fold, the simple statistical reality that marginally stable sequences are far more abundant than highly stable sequences causes evolution to further confine itself mostly to sequences with stabilities far less than that of the most stable sequence (Arnold

*et al.*2001; Taverna and Goldstein 2002a; this work). This fact is amply demonstrated by engineering experiments that have greatly increased the stability of natural proteins without sacrificing any of their functional properties (Serrano

*et al.*1993; Giver

*et al.*1998; van den Burg

*et al.*1998; Zhao and Arnold 1999). Therefore, although the distribution of ΔΔ

*G*values certainly varies widely among all sequences, it is still reasonable to assume that it is relatively constant among those sequences visited by natural evolution. This assumption of a constant ΔΔ

*G*distribution among evolved sequences is explicitly supported by simulations (Broglia

*et al.*1999; Bloom

*et al.*2005, 2006a; Wilke

*et al.*2005; this work) and is consistent with the observation that the number of neighbors on a protein's neutral network is approximately determined by its stability (Bornberg-Bauer and Chan 1999; Xia and Levitt 2004). Furthermore, protein mutagenesis experiments indicate that the ΔΔ

*G*values for random mutations are usually additive (Wells 1990; Serrano

*et al.*1993), meaning that any given mutation to a protein of length

*L*will alter only ≈1/

*L*of the other ΔΔ

*G*values, leaving the ΔΔ

*G*distribution mostly unchanged. Finally, the assumption of a constant ΔΔ

*G*distribution has been shown to explain the experimentally observed exponential decline in the fraction of functional proteins with increasing numbers of mutations (Bloom

*et al.*2005). However, we acknowledge that at present the assumption of a roughly constant ΔΔ

*G*distribution among neutrally evolving proteins can be verified only for lattice proteins—for real proteins the most we can say is that it is consistent with existing experimental evidence.

We begin our mathematical treatment by conceptually dividing the continuous variable of protein stability into small discrete bins of width *b*. This discretization of stability allows us to treat mutations as moving a protein from one bin to another—the bins can be made arbitrarily small to eliminate any numerical effects of the binning. The stability of each folded protein in the evolving population (the folded proteins are all those with ) can be described by specifying its stability bin. Specifically, a protein is in bin *i* if it has between and −*ib*, where *i* = 1, 2, … . Let *W _{ij}* be the probability that a random mutation has a ΔΔ

*G*value such that it moves a protein's stability from bin

*j*to bin

*i*, where

*i*and

*j*both are in the range 1, 2, … . Then

*W*is easily computed as the fraction of ΔΔ

_{ij}*G*values between and . Since

*W*describes only transitions between folded proteins, and since we have assumed that a protein's mutational tolerance is determined by its stability, then the fraction of folded mutants (neutrality) of a protein in bin

_{ij}*j*is . Clearly, more stable proteins will have larger values of ν

_{j}.

In the following two sections, we use the matrix **W** with elements *W _{ij}* to calculate the distribution of stabilities in an evolving protein population of constant size

*N*, the mean number of mutations after

*T*generations, the corresponding index of dispersion , and the average fraction of mutations that do not destabilize the proteins past the minimal stability threshold. We assume that

**W**is computed from the distribution of ΔΔ

*G*values for all random single-amino-acid mutations, although in principle it could be for any type of mutation. We also assume that the per-protein-per-generation mutation rate μ is small, so that at each generation a protein undergoes at most one mutation. Our calculations at first follow and then extend the theoretical treatment by van Nimwegen

*et al.*(1999) of evolution on a neutral network. In particular, we follow their lead in separately treating the two limiting cases where the product

*N*μ of the population size and mutation rate is ≪1 and ≫1. We emphasize that all of the equations derived in the following two sections depend only on the mutation rate μ, the number of generations

*T*, and the matrix

**W**, which can be computed from the single-mutant ΔΔ

*G*values. The population size

*N*determines the applicable limiting case, but otherwise drops out of all final results.

#### Limit when *N*μ≪1:

When , the evolving population is usually clonal, since each mutation either is lost or goes to fixation before the next mutation occurs. If a mutation destabilizes a protein in the population beyond the stability cutoff, then it is immediately culled by natural selection. If a mutation does not destabilize a protein beyond the stability cutoff, it will be lost to genetic drift with probability and go to fixation with probability 1/*N* (Kimura 1983). Since mutations occur rarely (), the loss or fixation of the mutant will occur before the next mutant appears in the population. The entire population therefore moves as one entity along its neutral network. The population can thus be described by the column vector , with element giving the probability that the population is in stability bin *i* at time *t*.

If the population is initially in stability bin *j*, at each generation there is a probability *N*μ*W _{ij}* that a protein experiences a mutation that changes its stability to bin

*i*, and if such a mutation occurs, then there is a probability of 1/

*N*that it is eventually fixed in the population. Therefore, at each generation there is a probability μ

*W*that the population experiences a mutation that eventually causes it to move from stability bin

_{ij}*j*to bin

*i*. If we define the matrix

**V**so that the diagonal elements are given by

*V*= ν

_{ii}_{i}and all other elements are zero, then

**p**evolves according to(1)where

**I**is the identity matrix. Note that this equation treats lethal mutations (those that destabilize a protein beyond the cutoff) as immediately being lost to natural selection and so leaving the population in its original stability bin (hence the population accumulates a mutation with probability μ

**V**rather than probability μ). Equation 1 describes a Markov process with the nonnegative, irreducible, and acyclic transition matrix

**A**=

**I**− μ

**V**+ μ

**W**, and so

**p**approaches the unique stationary distribution

**p**, satisfying(2)

_{o}This equation gives the expected distribution of protein stabilities solely in terms of the single-mutant ΔΔ*G* values.

We now calculate the average number of mutations that accumulate in an equilibrated population after *T* generations and the corresponding index of dispersion *R _{T}*

_{,o}. We emphasize that represents the average number of accumulated mutations during the course of the evolutionary process. When the number of accumulated mutations

*m*is small compared to the length of the protein sequence

*L*(), then

*m*is just equal to the number of residues differing from those in the parent protein sequence (the Hamming distance). However, when

*m*becomes substantial relative to

*L*,

*m*becomes larger than the Hamming distance since some sites will undergo multiple mutations (Jukes and Cantor 1969). In this case it is necessary to use a substitution model to infer

*m*from the observed Hamming distance. In the treatment that follows, we calculate the expected value of

*m*; application of these formulas to actual protein sequences requires use of one of the well-established statistical techniques for inferring

*m*from the Hamming distance (Jukes and Cantor 1969; Goldman and Yang 1994). We begin the calculation of by defining to be the column vector with element

*i*giving the probability that at time

*t*the population has accumulated

*m*mutations and is in stability bin

*i*. The time evolution of is given by(3)

The *k*th moment of the number of mutations at time *t* is(4)where is the unit row vector. We can write a recursive equation for in the long-time limit (steady state) by multiplying both sides of Equation 3 by *m*, summing over *m*, and left multiplying by **e** to obtain(5)where we have used the property **eA** = **e**, noted that in the long-time limit and , and defined the average neutrality as = **eWp _{o}** =

**eVp**. Summing the recursion yields the steady-state value for the number of accumulated mutations,(6)

_{o}To calculate the index of dispersion , we need to find the second moment . In a fashion analogous to the construction of Equation 5, we can write a recursive expression for the long-time limit of as(7)where we have used the property (implicit in Equation 5) that in the long-time limit, . Summing the recursion yields the following value for the long-time limit,(8)where we have made the substitution **eWQWp _{o}** = and noted that since

**A**is an irreducible, aperiodic, stochastic matrix (Ewens and Grant 2005). This yields a value for the index of dispersion in the long-time limit of(9)

The above equation is consistent with the generic equation for the index of dispersion given by Cutler (2000a,b), where we now give concrete expressions for the variables ρ and in Cutler's formula in terms of measurable quantitites, namely ρ = μ and .

We can further simplify Equation 9 by performing spectral decompositions of **A** and **Q**. Let λ_{1}, … , λ_{K} be the eigenvalues of **V** − **W**, and let **r**_{1}, … , **r**_{K} and **l**_{1}, … , **l**_{K} be the corresponding right and left eigenvectors, normalized so that **l**_{i}**r**_{j} = 1 if *i* = *j* and 0 otherwise. These eigenvectors are also eigenvectors of the irreducible, acyclic, stochastic **A**, and the corresponding eigenvalues are 1 − μλ_{1}, … , 1 − μλ_{K}, with Perron–Frobenius theorems guaranteeing that one eigenvalue (chosen here to be 1 − μλ_{1}) = 1 and all other eigenvalues have absolute values <1. Then **r**_{1} and **l**_{1} are right and left eigenvectors of **Q** with eigenvalue 1 (*i.e*., **r**_{1} = **p _{o}** and

**l**

_{1}=

**e**), and all other eigenvalues of

**Q**are 0. The spectral decompositions are therefore

**Q**=

**r**

_{1}

**l**

_{1}and . Inserting these spectral decompositions into Equation 9, we find for the index of dispersion a value of(10)since (Ewens and Grant 2005). In the limit of large

*T*and small μ, the value of

*R*

_{T}_{,o}given by the above equation approaches the value(11)where the μ term drops out because μ is small and the term drops out because

*T*is large and . This equation shows that

*R*

_{T}_{,o}approaches a constant value independent of

*T*and μ. Although we could not prove that the value of

*R*

_{T}_{,o}given by Equation 11 is necessarily >1 (since some of the eigenvalues λ

_{i}could be complex), in all of our simulations we observed

*R*

_{T}_{,o}> 1, suggesting that when , fluctuations in protein stability tend to overdisperse the molecular clock.

#### Limit when *N*μ≫1:

When , the population is spread across many nodes of the neutral network rather than converged on a single sequence (van Nimwegen *et al.* 1999). In this limit, we treat the evolutionary dynamics of the population deterministically (*i.e.*, we assume an infinite population size) and describe the distribution of stabilities in the population by the column vector , with element giving the fraction of proteins in the population at time *t* that have stabilities in bin *i*. At generation *t*, the fraction of mutated proteins that continue to fold is . These folded proteins reproduce, and, to maintain a constant population size, this reproduction must balance the removal of proteins by death, meaning that each folded sequence must produce an average of offspring. The population therefore evolves according to(12)

After the population has evolved for a sufficient period of time, **x** approaches an equilibrium distribution of **x**_{∞}. The corresponding equilibrium neutrality is = **eWx**_{∞}, and the equilibrium reproduction rate is , so(13)

This equation can be rewritten to show that **x**_{∞} is the principal eigenvector of **W**,(14)

We note that approximates the asymptotic neutrality for the decline in the fraction of folded proteins upon random mutagenesis (Bloom *et al.* 2005; Wilke *et al.* 2005).

We now determine the average number of accumulated mutations and the corresponding index of dispersion *R _{T}*

_{,∞}by treating the forward evolutionary process. As described in the text immediately prior to Equation 3, our calculations describe the actual number of mutations accumulated during the evolutionary process, which may differ from the number of sequence differences relative to the ancestor if a single site undergoes multiple mutations. When , it is not

*a priori*obvious that the average number of mutations present in the population is equivalent to the number of fixed substitutions along the line of descent. Therefore, in the appendix, we show that identical results are obtained by tracing a randomly chosen protein backward in time along its ancestor distribution, proving that the treatment we give below is mathematically equivalent to treating the time-reversed process. We define as the column vector with element

*i*giving the fraction of the population at time

*t*that has accumulated

*m*mutations and is in stability bin

*i*. Once the population has reached the equilibrium distribution of stabilities, the time evolution of is(15)

The recursion can be solved to obtain(16)as can be verified by direct substitution. Since we are assuming that the population has equilibrated at time 0 and no mutations have accumulated at that time, is **x**_{∞} for *m* = 0 and 0 otherwise. Furthermore, **x**_{∞} satisfies Equation 14, so multiplying Equation 16 by **e** yields(17)where gives the fraction of the population that has accumulated *m* mutations after *t* generations. The average number of accumulated mutations after *T* generations is the mean of this binomial distribution,(18)

Using the well-known result for the variance of the binomial distribution, we find that the index of dispersion is(19)

It is important to reiterate that the above equation was derived under the assumption that there is at most one mutation per sequence per generation. For realistic distributions of mutations (*i.e.*, Poisson), this means that . In this regime, *R _{T}*

_{,∞}is close to 1.

#### Lattice protein simulations:

We tested our theory's predictions on the evolutionary dynamics of lattice proteins. Lattice proteins are simple protein models that are useful tools for studying protein folding and evolution (Chan and Bornberg-Bauer 2002). Our lattice proteins were chains of 20 amino acids that folded on a two-dimensional lattice. The energy of a lattice protein conformation was equal to the sum of the pairwise interactions between nonbonded amino acids (Miyazawa and Jernigan 1985). Each lattice protein has 41,889,578 possible conformations, and by summing over all of these conformations we could exactly determine the partition sum and calculate Δ*G*_{f}. We set a minimal stability threshold for the lattice proteins of , meaning that we considered all proteins that folded to the target structure with Δ*G*_{f} ≤ 0 to be folded and functional, while all proteins with Δ*G*_{f} > 0 were considered to be nonfunctional. We note that this stability threshold is equivalent to requiring a lattice protein to spend at least half of its time in the target native structure at equilibrium. We began by generating lattice proteins that stably folded to each of the three different structures shown in Figure 2. For each of these three proteins, we determined the distribution of ΔΔ*G* values for all 380 single mutations (these distributions are shown in Figure 2). These distributions were used to construct the matrix **W** and to predict the equilibrium distribution of stabilities, the average number of mutations, and the indexes of dispersion for both the and the cases, using the equations presented in the preceding sections.

To test the accuracy of these predictions, we then simulated evolving populations of the lattice proteins with a standard evolutionary algorithm using Wright–Fisher sampling. Briefly, the populations were held at a constant size of either *N* = 10 or *N* = 10^{5}. At each generation, a new population was created by choosing parents with equal probability from all folded proteins in the previous generation's population and copying these parents into the new population with a mutation rate of 5 × 10^{−4} mutations per residue per generation. Since the proteins have a length of 20 amino acids, this mutation rate corresponds to a per-protein-per-generation mutation rate of μ = 10^{−2}. Therefore, the product *N*μ is either 0.1 or 10^{3}, corresponding to or , respectively. We emphasize that the lattice protein evolutionary algorithm is the same for both population sizes. When *N* = 10 the population naturally follows dynamics approximating those presented for , while when *N* = 10^{5} it naturally follows dynamics approximating those presented for (as evidenced by the excellent agreement of the predictions with the simulations). For *N* = 10, we performed 1000 replicates for each different structure. For *N* = 10^{5}, computational constraints limited us to 10 replicates for each structure (however, the evolutionary dynamics are nearly deterministic in this case, so all replicates yielded similar results). We note that during the simulations we recorded the number of mutations that actually accumulated rather than simply computing the number of differences (Hamming distance) from the original sequence.

Figure 2 shows the theoretical predictions and simulation results for each of the three structures. The theoretical predictions are in good agreement with the simulation results. Figure 2 clearly shows that when , the proteins tend to be more stable than when . This extra stability is a biophysical manifestation of the neutrally evolved mutational robustness predicted by van Nimwegen *et al.* (1999). This increase in stability leads to a substantial increase in the number of accumulated mutations. In accordance with the theoretical predictions, when the index of dispersion is elevated above one by fluctuations in protein stability. Another clear result from the simulations is that proteins of different structure show markedly different distributions of stabilities and rates of sequence evolution due to the differences in their ΔΔ*G* distributions. Overall, the simulations offer strong support for the validity of the theoretical predictions in the preceding sections.

## DISCUSSION

We have presented a theory that offers quantitative predictions about the distribution of stabilities, the average number of fixed mutations, and the index of dispersion for an evolving protein population in terms of the ΔΔ*G* values for individual mutations. We have demonstrated that these predictions are accurate for simple lattice proteins and have used existing biophysical evidence to argue that the basic theoretical assumptions should also be accurate for real proteins. Here, we give qualitative interpretations of the mathematical results and discuss their implications for our understanding of protein evolution.

One major result is to show that the effects of protein structure on the rate of sequence evolution can be quantitatively cast in terms of the ΔΔ*G* values for single mutations. Numerous lattice protein simulations have shown that protein structure can dramatically affect the rate of sequence evolution, since structures that are more “designable” (encoded by more sequences) can evolve their sequences more rapidly (as can be seen in Figure 2) (Li *et al.* 1996; Govindarajan and Goldstein 1997; Bornberg-Bauer and Chan 1999; Tiana *et al.* 2000; Chan and Bornberg-Bauer 2002; Wingreen *et al.* 2004; Xia and Levitt 2004). Unfortunately, these simulations typically measure structural designability by enumerating a large number of lattice protein sequences, meaning that their findings cannot be extended to real proteins for which such extensive enumeration is impossible. However, recent theoretical work by England and Shakhnovich (2003) has made progress in connecting designability to observable structural properties, and a bioinformatics analysis based on this theoretical measure of designability indicates that structure indeed influences the evolutionary rate of real proteins (Bloom *et al.* 2006b). Our work provides a way to quantitatively relate the structural influences on protein evolution to experimentally measureable ΔΔ*G* values, opening the door to further connecting structural designability and sequence evolution to laboratory stability measurements. Although thousands of ΔΔ*G* values have been measured experimentally (Kumar *et al.* 2006), at present there are no large sets of measurements for truly random mutations to a single protein. When such sets of measurements become available, it should be possible to use them in conjunction with the theory that we have presented to predict the neutralities of real proteins with different structures.

A second important result is to show that protein evolutionary dynamics can depend on the product of population size and mutation rate, *N*μ. When , the evolving protein population is polymorphic in stability and subject to frequent mutations, so the more stable (and thus more mutationally tolerant) proteins produce more folded offspring. In contrast, when , the population is usually monomorphic in stability and so all members of the population are equally likely to produce folded offspring. The general tendency for populations to neutrally evolve mutational robustness when has previously been treated mathematically by van Nimwegen *et al.* (1999), and a variety of lattice protein simulations have noted the tendency of evolving protein populations to preferentially occupy highly connected neutral network nodes (Bornberg-Bauer and Chan 1999; Taverna and Goldstein 2002b; Xia and Levitt 2004). Our work shows that for proteins, in the limiting cases when or , this process can be rigorously described by considering only protein stability, rather than requiring a full analysis of the neutral network (provided, as we have argued is likely to be the case, that the assumption of a roughly constant ΔΔ*G* distribution holds for real proteins as well as it holds for our lattice proteins). In addition, we prove that the number of accumulated mutations depends on whether *N*μ ≪ 1 or ≫1. This finding is at odds with the standard prediction (Kimura 1987) of Kimura's neutral theory that the rate of evolution is independent of population size. The reason for this discrepancy is that the standard neutral theory fails to account for the possibility that increasing the population size so that can systematically increase the fraction of mutations that are neutral.

A third important contribution of our theory is to use the distribution of ΔΔ*G* values for single mutations to predict the distribution of protein stabilities in an evolving population. Several researchers have pointed out that evolved proteins will be marginally stable simply because most mutations are destabilizing (Arnold *et al.* 2001; Taverna and Goldstein 2002a); we have described this process quantitatively. In addition, we have shown how the neutral evolution of mutational robustness when will shift the proteins toward higher stabilities (as shown in Figure 2), although this increase in stability is limited by the counterbalancing pressure of predominantly destabilizing mutations. The formulas we provide can in principle be combined with experimentally measured ΔΔ*G* values to predict the expected range of stabilities for evolved proteins.

Our work also weds Takahata's concept that fluctuating neutral spaces might overdisperse the molecular clock (Takahata 1987; Cutler 2000b; Bastolla *et al.* 2002) to a concrete description of how protein neutrality fluctuates during evolution. When , fluctuations in protein stability can cause an overdispersion in the number of accumulated substitutions that can be calculated from the single-mutant ΔΔ*G* distribution. Furthermore, given our assumption of a roughly constant ΔΔ*G* distribution, we show that the index of dispersion will approach a constant value that is independent of time or mutation rate, but will depend on whether or . Previous simulations have indicated that overdispersion indeed depends on the population size (Bastolla *et al.* 2002; Wilke 2004)—we have explained this dependence by showing that stability-induced overdispersion does not occur when since the population's distribution of stabilities equilibrates as it spreads across many sequences. Mathematically, the difference in the cases and is that, assuming the ΔΔ*G* distribution remains relatively constant, when the population size is sufficiently large, the distribution of protein stabilities no longer fluctuates in a manner that influences the probability of a substitution (Equation 3 contains μ**V** in the first term on the right side, while Equation 15 does not).

In summary, we have presented a mathematical theory of how thermodynamics shape neutral protein evolution. A major strength of our theory is that it makes quantitative predictions using single-mutant ΔΔ*G* values, which can be experimentally measured. Our work also suggests how neutral and adaptive protein evolution may be coupled through protein thermodynamics. Protein stability represents an important hidden dimension in the evolution of new protein function, since extra stability that is itself neutral can allow a protein to tolerate mutations that confer new or improved functions (Bloom *et al.* 2006a). Our theory describes the dynamics of protein stability during neutral evolution—adaptive protein evolution is superimposed on these stability dynamics, with proteins most likely to acquire beneficial mutations when they are most stable.

## APPENDIX

Here we calculate the properties of the evolving population when by analyzing the time-reversed process to compute the mean and variation in the number of mutations in a single randomly chosen protein over time. We show that the results so obtained are identical to those found in the main text, where we analyzed the forward-time process to compute the mean and variation in the number of mutations across the population of evolving proteins.

When , the population is now never converged to a single sequence, so it is not *a priori* obvious that the average number of mutations present in the population is equivalent to the expected number of fixed substitutions along the line of descent. In fact, in the limit of very large population sizes there may not even be a common line of descent in relevant time frames, since many new mutations will occur before any given mutation goes to fixation. In the main text we calculated the average number of mutations , a sequence in the population that has accumulated over the last *T* generations by treating the forward evolution of the population. Here we trace a randomly chosen protein in the population back in time and show that the average number of substitutions that it has accumulated over the last *T* generations is equal to . We also show that indexes of dispersion of and have the same value of *R _{T}*

_{,∞}.

To calculate , we first define a vector **a** giving the ancestor distribution (Hermisson *et al.* 2002): element *i* of gives the probability that a randomly chosen sequence from the population at time *T* had a predecessor with stability in bin *i* at time *T* − *t*. The transition probabilities of when the population is in equilibrium are the discrete time analogue of those computed by Hermisson *et al.* (2002). From Equation 15, it follows that the fraction of sequences in bin *i* at time *t* + 1 that had as their ancestor in the previous generation a sequence in bin *j* is . To obtain the probability that a sequence in bin *i* at time *t* + 1 had an ancestor in bin *j*, we have to divide this fraction by the total number of sequences in bin *i* at time *t* + 1. When the population is at equilibrium, α_{t} = α and , where *x _{i}* is the element from

**x**

_{∞}. Hence, the probability that a sequence in bin

*i*had an ancestor in bin

*j*is , where we have defined(A1)

The time evolution of **a** is therefore(A2)where the matrix **H** is defined by Equation A1. Equation A2 can be solved to show that the equilibrium value of **a** is **a**_{∞}, satisfying(A3)

If we define as the vector with element *i* giving the probability that a randomly chosen sequence at time *T* had a predecessor at time *T* − *t* in stability bin *i* and with *s* substitutions relative to the sequence at time *T*, then the time evolution for an equilibrated population is(A4)

We can solve Equations A4 and A3 in a manner analogous to the forward process to obtain(A5)

Again defining as the probability of having accumulated *s* substitutions as one moves back *t* generations from time *T*, we obtain the binomial distribution(A6)

Comparison of Equation 17 and Equation A6 shows that they are identical. Therefore, all moments computed from the two distributions must be equal. In particular, this proves that = and that the corresponding indexes of dispersion have the same value of *R _{T}*

_{,∞}defined by Equation 19. This shows that when , we expect equivalent results regardless of whether we average over the number of mutations in all sequences present in the population or randomly choose a single sequence and trace back along its ancestor distribution.

## Acknowledgments

We thank Frances H. Arnold for helpful comments and discussion. J.D.B. is supported by a Howard Hughes Medical Institute predoctoral fellowship. C.O.W. is supported by the National Institutes of Health grant AI 065960. A.R. is supported by the National Science Foundation grants CCF 0523643 and FIBR 0527023.

## Footnotes

Communicating editor: N. S. Wingreen

- Received June 8, 2006.
- Accepted October 23, 2006.

- Copyright © 2007 by the Genetics Society of America