Gene duplication is arguably the most significant source of new functional genetic material. A better understanding of the processes that lead to the stable incorporation of gene duplications into the genome is important both because it relates to interspecific differences in genome composition and because it can shed light on why some classes of gene are more prone to duplication than others. Typically, models of gene duplication consider the periods before duplication, during the spread and fixation of a new duplicate, and following duplication as distinct phases without a common underlying selective environment. I consider a scenario where a gene that is initially expressed in multiple contexts can undergo mutations that alter its expression profile or its functional coding sequence. The selective regime that acts on the functional output of the allele copies carried by an individual is constant. If there is a potential selective benefit to having different coding sequences expressed in each context, then, regardless of the constraints on functional variation at the single-locus gene, the waiting time until a gene duplication is incorporated goes down as population size increases.
GENE duplication has long been viewed as a mechanism that promotes diversification of functional genes in the genome (Ohno 1970; Taylor and Raes 2004; Conant and Wolfe 2008). Simply stated, this view holds that once a gene locus has been duplicated, the pair of loci can go their own way without decreasing fitness. Whether the newly formed and subsequently diverged loci are then maintained over longer evolutionary periods obviously depends on the fitness costs of deleting one of the loci. While models of duplication universally agree on this point (Innan and Kondrashov 2010), they differ in their view of how the duplication itself originally spreads and how the loci then diverge. Here I develop models of the gene duplication process that consider the functional effects of mutations in coding or regulatory regions with consistent selection acting on variation before, during, and following duplication. By doing this I am able to compare the total waiting time for a gene to go from a single copy to a pair of stably maintained duplicates. I find that the waiting time until duplication depends on whether the net effect of selection on coding variants at the single-copy gene is stabilizing, the magnitude of the potential fitness gain from having diverged duplicate genes, and the population size. Regardless of the specific assumptions on the fitness effects of coding and regulatory mutations, I find that there typically are multiple routes to duplication that are driven by selection and therefore speed up as population size increases.
Previous models apply selection inconsistently
Three main modes for the adaptive maintenance of duplications have been proposed: neofunctionalization, subfunctionalization, and the divergence of multifunctional genes. The theoretical framework proposed for each of these modes suffers from a deep logical flaw: the effect of selection is applied when it is convenient to support the conclusion of the proposed mechanism.
Under the neofunctionalization model, the duplication becomes fixed by drift and one of the duplicate loci takes on a completely new function while the other maintains the old one (Force et al. 1999). This view assumes that somehow evolution is frozen before the random fixation of a duplication. A mutation that appeared in one of the alleles at the single-copy ancestor is just as likely to provide some incremental ability to perform a “new function” as a mutation in one of the alleles of a duplicate pair of loci. More generally, the allelic state (or states) of a single-copy gene is subject to the same selection pressures that operate at a pair of duplicate loci. If a change in the environmental circumstances of the species is posited, it would be very unlikely that this shift would happen at precisely the same time as the fixation of a duplication by drift. The set of mutationally accessible alleles determines the opportunity for neofunctionalization; it is not the fixation of a duplication that creates opportunities for beneficial mutations. Models of neofunctionalization artificially restrict the question of when mutation and selection are able to discover new functions to the postduplication phase (but see Walsh 2003).
Under the divergence of multifunctional genes model, a single locus is fixed for an allele that has multiple functions and duplication provides an opportunity for each locus to optimize a single function (Hughes 2005). The main proponents of this mode of duplication (Hughes 2005; Des Marais and Rausher 2008) have not relied on formal mathematical models (Innan and Kondrashov 2010), making it more difficult to draw conclusions about the generality and tempo of this process. This framework involves the assumption that the single-copy locus is fixed for an allele coding for a multifunctional protein that does not perform either of its functions optimally. If a duplication becomes fixed, then the two loci may diverge so that each locus is fixed for an allele that codes for a protein optimized on just one function. This framework again assumes that during the preduplication phase, evolution is frozen and no genetic variation is possible. In effect, it argues that duplications diverge because of multifunctional proteins but kicks back the question of how multifunctional genes arise. The problem with this view is that the evolution of multifunctionality and the divergence of multifunctional duplicate genes both depend on the relationship between the multiallelic genotype and fitness (e.g., Proulx and Phillips 2006). In particular, the fitness effects of allelic variation at a single-copy gene determine whether multifunctionality will evolve, and a subset of the conditions that promote multifunctionality also promotes the divergence of genes following duplication. The critical finding of Proulx and Phillips (2006) was that most parameter values that promote divergence of duplicate loci actually promote allelic divergence and the evolution of heterozygote advantage at the single-locus gene (i.e., before a duplication becomes fixed in the population).
Under the subfunctionalization model [and the duplication-degeneration-domplementation (DDC) model in particular], duplications are fixed by drift followed by the stochastic fixation of loss of subfunction mutations in each of the duplicate loci (Force et al. 1999, 2005; Lynch and Force 2000; Lynch et al. 2001; Walsh 2003). The starting point for these models is that existing loci have multiple functions, that duplication itself does not alter fitness, and that these functions can be partitioned into distinct subfunctions without any loss (or gain) of fitness. This assumption is often operationalized by considering genes with multiple cis-regulatory binding sites and assuming that the number of alleles expressed in a specific regulatory context has no effect on function (and therefore fitness). The particularity of this assumption is rarely discussed. Under these assumptions, alleles that have mutationally lost a subfunction can drift to fixation at either of the duplicate loci. The time for this to occur is quite sensitive to the assumptions, in particular that subfunctions are completely separable. If mutations that destroy one subfunction are even slightly disadvantageous as homozygotes, say because they also cause a change in the speed or variability of transcription initiation in the other context, then the waiting time until such mutations fix increases rapidly with population size. This mutation rate asymmetry can increase the waiting times well above those previously noted.
Further, the subfunctionalization framework assumes that multifunctionality arose in the distant past and that the forces selecting for multifunctionality have little to do with the postduplication process. Preduplication, however, selection must be acting on both the regulatory and coding regions of alleles. Whatever the physiological, developmental, or genetic factors are that determine preduplication evolution will determine how postduplication mutations in coding and regulatory alleles affect fitness. Even though the DDC model posits that duplications become stable because of a series of neutral substitutions, the postduplication evolution of the two loci will still be affected by both coding and further regulatory mutations. The subfunctionalization framework ignores the evolutionary process that makes subfunctionalization possible, but the preduplication process sets the conditions that allow (or do not allow) subfunctionalization to proceed. The subfunctionalization model can be taken seriously only if it is shown that the evolutionary processes that precede duplication tend to produce alleles whose mutational neighborhood fits the assumptions of the DDC model. The point is that evolution preceding duplication will determine whether mutations that delete transcription factor (TF) binding sites behave neutrally or do not.
Developing a consistent model for the entire duplication process
The overarching theme of this article is that evolution both before and after duplications arise is governed by the same biochemical, physiological, and developmental effects of changes in genotype. The same mechanisms that cause fitness to depend on the two alleles an individual carries at a single locus also act on individuals that carry four alleles. If changes in dosage affect fitness, then so too must changes in expression at a single locus. If a mutation altering one allele in a set of four affects fitness, so too will a mutation altering one allele in a set of two. Taking the simplifying view that the process of duplication can be broken up into independent phases is not just a benign approximation because it generates predictions that are qualitatively different from those made by a consistent model.
Our previous work considered changes in allele function related by a trade-off between performance in two distinct contexts. We showed that the conditions that favor the evolution of multifunctional genes, a necessary precursor to the multifunctional gene model, can lead to divergence of loci following duplication. Our results also showed that the same conditions that allow multifunctional genes to diverge postduplication are likely to cause divergence of alleles preduplication (termed allelic divergence) followed by the selectively driven spread of a duplication (Proulx and Phillips 2006). These results also demonstrate that allelic divergence and duplication can happen on much shorter timescales relative to the timing of subfunctionalization for all but the smallest population sizes. In this article, I extend this analysis to a scenario where both cis-regulatory and coding mutations occur.
Previous works by Force, Lynch, and co-workers (Force et al. 1999, 2005; Lynch and Force 2000; Lynch et al. 2001) have typically considered mutations that knock out regulatory regions and lead to loss of expression in one or more situations. These studies have followed the duplication and subfunctionalization process and allowed for future change in coding regions that could specialize the new duplications that have had their expression patterns subdivided. One argument for this rationale is that mutations removing binding sites for TFs are expected to be more common than mutations that cause conditionally advantageous change in coding sequence. However, subfunctionalization models work because duplications are able to fix, essentially by drift, in smaller populations. This process can require enormous amounts of time because it will be limited either by the waiting time to the appearance of a duplication that is destined to become fixed (1/duplication rate) or by the time that it takes for a duplication to spread when it is destined to fix (∼4Ne generations in diploids). If either of these waiting times is large, then the total time waiting for a duplication to fix via drift will be large.
What happens to populations during such long waiting times? Even if the rate at which adaptive mutations arise is relatively low, their spread through populations will be much quicker than fixation via drift. In this article I explore the waiting times for a set of alternative pathways that eventually lead to a population fixed for duplicate genes that have diverged both in regulatory and in coding sequence.
In this model there are two contexts and the focal gene can have promoter sites that induce expression in each context. The coding region of the gene may also experience mutation that can improve performance in one context while reducing performance in the other context. The structure of the fitness landscape determines whether mutations that alter the coding region can spread in the ancestral population, leading to the evolution of heterozygote advantage followed by the spread and fixation of gene duplicates. When the effect of altering the coding region alone is a net reduction in fitness, this direct pathway to duplication is selectively disfavored. However, an alternative pathway involving first the loss of expression in one context and then a mutation in the coding region is possible through a form of stochastic tunneling. Stochastic tunneling occurs when a segregating mutation gives rise to a beneficial secondary mutation that then fixes (Iwasa et al. 2004; Weissman et al. 2009; Proulx 2011). In addition, duplication events that result in alleles missing some fraction of the cis-regulatory region are circum-neutral and can therefore drift to high frequencies [genotypes are considered circum-neutral when all differences in their population genetic dynamics can be attributed to their genetic context rather than to direct effects on reproductive output (Proulx and Adler 2010)]. Coding mutations are then directly advantageous and rapidly spread in the population. I then compare the waiting times for the different possible pathways toward duplication to determine how the fitness landscape, mutational parameters, and population size determine the rate at which duplications are incorporated into genomes.
Many eukaryotic genes are regulated to be expressed in multiple contexts. In this article I consider two kinds of mutations, regulatory and coding. Mutations that alter the cis-regulatory sequence can cause the allele to be expressed only in one context, often called subfunctionalization (Force et al. 1999). I refer to these as subfunctionalizing mutations and use the subscripts si to denote an allele that is expressed only in context i.
Mutations in the protein-coding sequence that improve the function of the protein in context i are indicated by ci (note that I explain below why such mutations are expected to degrade the function of the protein in the other context). We can also refer to alleles with two mutations in a similar way, such as sicj. An allele that is expressed only in context i and has a coding mutation that improves function in context i is indicated with the shorthand sci (in other words, sc1 is the same as s1c1). I refer to such alleles as subspecialized. Haplotypes with a duplicate copy of the gene are indicated with a separator|. For example, a haplotype that has one copy of the ancestral allele and one copy of a subspecialized allele would be A|sc1. Alleles that are expressed only in the context in which they are deleterious are expected to be rapidly lost from the population and I do not track their frequencies in the analytical analysis (but they are included in the stochastic simulations). Figure 1 shows a schematic diagram of the mutational network, limited to alleles that are not unconditionally deleterious.
I consider scenarios where several coding mutations and regulatory mutations are accessible from the ancestral allele. Of course, the ancestral allele is just the most recently fixed allele in the population. Levins’ notion of a fitness set is particularly useful for describing the series of substitutions that can lead to our ancestral allele. On the basis of arguments developed in supporting information, File S1, I assume that mutations from the A allele exhibit antagonistic pleiotropy, because mutations that increase fitness in one context decrease fitness in the other context (see File S1, section 1).
The parameter space can be divided into two regions on the basis of the fitness effects of mutations that affect the coding sequence:
Allelic divergence: Specialized alleles can invade when rare and reach a deterministic equilibrium frequency. Even though there is still antagonistic pleiotropy, mutations near A are at a net advantage when heterozygous. Coding mutations near A are then maintained in the population and can create direct selection favoring duplications. This scenario additionally leads to selection to alter the regulatory region to create subspecialized alleles.
Net stabilizing coding selection: There is antagonistic pleiotropy that causes mutants that increase fitness in one context to be at a net disadvantage both as heterozygotes and as homozygotes. In this scenario, mutations that silence expression in one context act as recessive lethal (or recessive sick) mutations and can be stochastically maintained at appreciable frequencies. Secondary mutations produce alleles that are expressed only in the context to which their coding region is adapted. These alleles are actively maintained by selection and open the door to complementary mutations that specialize in the other contexts. Duplications are then directly advantageous and can spread due to selection.
In this section, I define a mechanistic model of fitness that allows dominance and epistasis to emerge without adding a large number of parameters. I follow the assumption that only the relative amount of expressed protein determines fitness (Proulx and Phillips 2006).
Context-specific fitness is assumed to be a function of the number and type of proteins expressed in each context. If only the ancestral protein is expressed, then context-specific fitness is assigned to be 1. Instead of assigning pairwise and three-way dominance, I assume that the ancestral protein provides an impulse to keep tissue-specific fitness at 1 that is scaled by a coefficient h (similar to dominance). Each specialized allele that is expressed in a given context provides either a positive or a negative impulse on fitness. This results in a model that describes interactions among nine allele states using only five parameters. The parameters describe the context-specific fitness of each protein-coding state (two coding states in two contexts giving four parameters) and the degree of dominance of the ancestral coding state. For simplicity I assume that fitness is 0 if no protein is expressed in either context and that there is no epistasis between contexts.
Using this framework, context-specific fitness is given by(1)where i = 0 represents the ancestral coding sequence, wi,κ represents the fitness component for protein i in context κ, Ei,κ represents the number of expressed alleles that code for protein i in context κ, and h relates to the dominance of the ancestral protein state. If h = 1, then the ancestral sequence is fully dominant, but if h = , then the ancestral coding sequence is codominant. This formulation is fairly flexible and can smoothly move between the conditions assumed in the standard DDC model to conditions where selection acts on coding changes. Because there is no epistasis, total fitness is simply Φ1Φ2. I write total fitness, W, as a function of the set of alleles that an individual carries. For example, W(A,c1) represents the fitness of an individual with one ancestral allele and one coding mutant allele.
Calculating approximate waiting times
I assume that ancestral populations are fixed for the A allele that is expressed in both contexts. The evolutionary process allows for mutations to both the coding and the regulatory regions, as well as knockout mutations that irrevocably silence the allele. For simplicity, I assume that each allele has the same knockout mutation rate.
Throughout this article I write the total number of haploid genomes as N and assume that Ne ≈ N. When Nμ ≪ 1 [the weak mutation assumption of Gillespie’s strong selection–weak mutation model (Gillespie 1991)], then the population is well described by the nonstochastic population genetic equilibria most of the time but occasionally transitions between states following the successful introduction of a new mutation. That is to say, without frequency-dependent selection we expect most populations to be monomorphic and with frequency dependence we expect the population to be near the frequency-dependent equilibrium. The population can change state if a mutation arises, is not lost when rare, and is deterministically maintained in the population. However, stochastic fluctuations in allele frequency are considered during the invasion of a new haplotype. This modeling framework is related to Gillespie’s strong selection–weak mutation formalism (Gillespie 1991) but makes allowances for situations with weak or frequency-dependent selection. Much inspiration was drawn from Hammerstein’s (1996) streetcar approach.
The steps that go into calculating the waiting time for each evolutionary transition are presented in more detail in File S1. Under the assumption that Nμ < the waiting time for a mutation that is favored when rare is simplywhere s is the difference between the relative fitness of the mutant and 1. I ignore the time required to approach population genetic equilibrium for alleles under selection because it is usually orders of magnitude smaller than the waiting time for the appearance of a successful mutation.
I simulated the full evolutionary process to observe evolutionary trajectories and to compare the waiting times until duplications become resident. The simulation was performed using Mathematica (code available, see File S2). I assumed constant population size where regulation occurred by exact culling of juveniles so that the number of adults is constant. The order of events was mating → selection → recombination → mutation → culling. The simulation was streamlined by tracking counts of haplotypes in the gamete stage and by calculating the total probability that each adult in the next generation would have a particular genotype. The distribution of haplotypes that contribute to the next generation is a composite of selection, mutation, and recombination and is expected to be multinomial distributed (Proulx 2000). By first calculating the multinomial coefficients the number of random variables drawn could be kept low so that simulations of large populations could still be performed in reasonable amounts of time.
Evolutionary Trajectories of Duplication
I analyze four different scenarios on the basis of fitness landscapes and the types of duplicating mutations considered. For each, I calculate the expected waiting time until a duplication is stably maintained and compare the results to stochastic simulations.
No coding selection
This scenario reflects the classic DDC assumption that there is no genetic variation for context-specific adaptation of the coding region. The double-recessive model commonly assumed in models of subfunctionalization is assumed.
By considering transitions between populations that are effectively monomorphic the waiting time for the DDC process to reach completion can be calculated (Lynch and Force 2000; Lynch et al. 2001; Walsh 2003; Force et al. 2005). To go from an ancestral state with a single locus expressed in two contexts to a population fixed for a pair of duplicate genes, each expressed in a single context, three population states must be visited. First a duplication must spread to fixation. Then, a mutation knocking out expression in one context must spread to fixation at one of the duplicate loci. If the duplication is lost before this second step, then the process must start over again. Once one gene copy has lost expression in one context, the locus that is expressed in both contexts can no longer be lost by drift. However, the gene copy that is expressed only in a single context may still be lost by drift, returning the population to be fixed for the A allele. Finally, a mutation knocking out expression in the alternative context must spread to fixation at the other gene copy. At this point the pair of gene copies is under strong selection to maintain function and the duplication is expected to be preserved. Because knockout mutations and drift can remove a duplicate gene just as easily as they can result in the fixation of a new duplication, most instances of this process will require many false starts to reach completion.
The transitions between the four possible states of the population can be described by a Markov transition matrix (see Force et al. 2005 for a similar approach). The population states are indexed on the basis of the haplotype fixed in the population: A (the ancestral allele present in a single copy), A|A (a haplotype carrying duplicate copies of the ancestral allele), s1|A (a haplotype with one copy of the ancestral allele and one copy of a subfunctionalized allele), and s1|s2 (a haplotype with complementary subfunctionalized alleles). For convenience I label the first subfunctional mutant to arise as s1, regardless of which context expression is lost in. Because each transition is a neutral substitution, the per generation probability that a new mutant destined for fixation arises is simply the rate of each type of mutation. The transition A|A → s1|A can happen by loss of one regulatory element at either locus (with probability 2μs), while the transition s1|A → s1|s2 requires the loss of a specific regulatory element at a specific gene copy (with probability μs/2):(2)The number of haplotypes in the population is defined as N and assumed for simplicity to be approximately equal to the effective number of haplotypes in the population. Using the fact that each neutral fixation takes an average of 2N generations, first-step analysis can be used to calculate the average waiting time until the DDC process is complete (Taylor and Karlin 1984) (see File S1, section 2.1 for the details of the calculation of waiting time). Assuming that μ = μd = μs and γ = μk/μ, then(3)where the 2N term represents the time spent during drift of mutations destined to fix and the second term represents the time waiting for mutations. For instance, if γ = 1, then the DDC process requires 15 neutral fixation events. The number of fixation events increases quadratically as γ increases. The waiting time under the pure DDC process is plotted for some sample mutation rates in Figure 2. Differences in the rate of silencing mutations can have just as large an effect on waiting time as differences in population size.
I simulated this process simply by setting the coding mutation rate to 0 in the full model (μc = 0). Figure 2 shows the predicted and observed mean waiting times until a stable duplication (i.e., s1|s2 or s2|s1) is maintained. The variance in waiting times is large, on the order of the square of the waiting time. When the mutation rates are low, the assumptions of the approximation are met and the fit is quite good.
However, the approximation breaks down as the mutation rate becomes large (Nμ ≫ 1) and overestimates the waiting times. To make these calculations, I have ignored the possibility that multiple mutations occur before the population becomes effectively fixed for a substitution involving only a single mutation. For instance, while the A|A haplotype is segregating one of the A copies could become subfunctionalized and then drift to fixation. This is a form of stochastic tunneling, but in this case it involves two mutations that are neutral. Weissman et al. (2009) developed techniques to determine when the stochastic tunneling regime can be applied and when deterministic models are better descriptors. In the DDC case each potential substitution is neutral, which can violate the assumptions of the tunneling models when Nμ ≫ 1. Unfortunately, neither the deterministic approximation nor the stochastic tunneling approximation applies in this regime and accurate estimates of the waiting times are not available. However, for biologically reasonable parameters the prediction of this model holds.
Proulx and Phillips (2006) showed that selection acting on function in two contexts can lead to the maintenance of alternate diverged alleles at a single-copy gene. This then creates selection for the spread of gene duplicates. While this process can be described by deterministic dynamics, there is still a stochastic component that will play a role in finite populations simply because of variance in the waiting times for mutations to appear and because adaptive mutations can be lost through drift when rare. Claessen et al. (2007) showed that evolutionary branching can have significant time lags before alternative genotypes are maintained. The total waiting time can be calculated as the average of the path-dependent waiting times weighted by the probability that each path is taken. However, the probability of taking a path is generally correlated with the waiting time, so that pathways involving shorter waiting times are much more likely to be taken. For each of the three main pathways for duplication under divergent coding selection, the waiting time decreases with increases in population size, mutation rate, and selection coefficient.
The waiting times for the pathways shown in Figure 3 are calculated in detail in File S1, section 2.2. For the three pathways they are(4)(5)(6)where px refers to the population genetic equilibrium frequency of haplotype x at the previous population state and sx refers to the selection coefficient for the rare mutant of type x. Note that the selection coefficients are also context dependent and may incorporate multiple genetic backgrounds that the focal haplotype may be found in.
The total waiting time until a stable duplication is maintained depends on how likely it is that each pathway will be taken. The difference between pathway P1 and P3 is the time at which the gene duplicates. If a duplication happens to occur before the subspecialized alleles arise, then we expect the process to proceed down P1 and otherwise move toward branch point B2. The route at branch point B2 depends on the fitness parameters. Pathway P2 is likely to occur only if the fitness of the heterozygote carrying alternate subspecialized alleles is high. Generally, P1 and P3 have similar waiting times because they depend on the same events but in different orders. Figure 3 shows the expected waiting time for a sample set of parameters. Because this process is largely driven by selection, the waiting time goes down as population size and the selection coefficients increase.
Simulations were used to check the accuracy of the waiting time calculations for small population sizes. Figure 3 shows the simulated waiting times when the value of μs was set to be much lower than μc. This decreases the likelihood that subfunctionalized mutants would appear first. Higher levels of μs result in waiting times that are shorter than predicted because they use paths that are considered in the next section. The calculations for pathways P1, P2, and P3 are upper bounds for the waiting time.
Net stabilizing coding selection
While the DDC process is characterized by neutral fixations and allelic divergence involves a series of events driven by selection, the process when there is net stabilizing coding selection combines elements of stochastic population genetics and selection-driven change. Starting from the ancestral population state where the A allele is fixed, mutations that alter either the coding region or the regulatory region are not favored when rare. Because coding changes are actively selected against (even when heterozygous), we expect them to remain at a low fluctuating frequency that depends on the mutation rate. Thus, eventual fixation of gene duplicates is unlikely to proceed through an intermediate stage of coding allele divergence.
Losses of context-specific expression, in contrast, behave as recessive lethal mutations. Such mutations are characterized by stochastic population genetic dynamics where their mean frequency increases with the square root of the mutation rate and with population size (Nei 1968; Crow and Kimura 1970; Robertson and Narain 1971). In effect, such mutations behave neutrally when rare but interfere with themselves when they become more common. This interference is stochastically exacerbated in small populations.
Because the square root of the mutation rate is much larger than the mutation rate itself, these recessive lethal mutants occur in large enough numbers to offer a significant opportunity for secondary mutations to arise and fix [i.e., through stochastic tunneling (Iwasa et al. 2004; Weissman et al. 2009; Proulx 2011)]. Here, this means that secondary mutations that alter the coding region arise from stochastically segregating loss of expression alleles and create subspecialized alleles. Subspecialized alleles are always beneficial when rare (i.e., as heterozygotes) but are assumed to be lethal as homozygotes (Figure 4).
Once subspecialized alleles arise, they are maintained at frequency-dependent equilibria (see Figure 5 for a sample simulation showing the sequence of substitutions). Once the subspecialized alleles are maintained, duplications of the subspecialized alleles are directly favored. They do not spread to fixation but reach a population genetic equilibrium. Recombination between subspecialized duplicate haplotypes and either the ancestral allele or the other subspecialized allele creates a haplotype that deterministically spreads to fixation. Each successive step in the sequence takes a smaller amount of time because the frequency of the haplotype that participates in the next step continues to increase, creating greater and greater opportunity for further adaptive mutations to arise and spread.
I consider three pathways to duplication under net stabilizing coding selection (Figure 4). In the first case, a subspecialized allele arises, duplicates, and recombines to create a stably maintained duplication. In the second and third cases, both subspecialized alleles become resident before either one is duplicated. The details of the calculation of the waiting times are presented in File S1, section 2.3. The total waiting time when all three pathways are considered is(7)The total waiting time goes down as both population size and the selection coefficients increase and agrees well with simulations (Figure 4).
Duplication with loss of regulatory regions
The molecular mechanisms responsible for gene duplication can result in a duplicate locus that does not include the full regulatory sequence and in some cases does not even include all exons. This process has been termed “partial duplication” and has been shown to be common in Caenorhabditis elegans (Katju and Lynch 2003). This means that a single mutational event sometimes creates a gene copy with altered expression. This is particularly interesting for the net stabilizing coding selection scenario because it opens up another pathway to the stable maintenance of a gene duplication.
The first step of this pathway involves the production of haplotypes carrying one ancestral allele and one subfunctionalized allele (i.e., A|s1, see Figure 6). This haplotype has the same direct fitness as the ancestral allele haplotype but does not behave neutrally because of its position in the mutational network [i.e., it is circum-neutral (Proulx and Adler 2010)]. Thus, a lineage founded by an A|s1 mutant can produce a significant probability of producing a secondary mutant before going extinct. This is known as stochastic tunneling, and the general expression for the probability of stochastic tunneling in a Wright–Fisher model was derived in Proulx (2011). The probability that a lineage of A|s1 mutants gives rise to an A|sc1 mutant that then is not lost iswhere μds is probability that allele A mutates into allele A|s1 or A|s2 and is the invasion selection coefficient for A|sc1 haplotypes in a population of all A alleles. Note that only half of the possible coding mutations result in subspecialized alleles. As a point of comparison, will be shorter than the waiting time for allele A|sc1 to drift to fixation (1/(μds)) so long as . This does not pose a particularly stringent condition, even though we already require Nμ < 1 for each type of mutation we consider.
Once a subspecialized duplication has been established in the population, mutations are favored that cause the ancestral allele to lose expression in the context that the subspecialized allele is expressed. Such mutations decrease the amount of interference that the subspecialized allele faces but do not reduce function in the other context. These can be followed by specialization of the coding sequence, giving the total waiting time of(8)Figure 6 shows how the expected waiting times decrease with increasing population size and selection coefficient and the agreement with simulations.
The goal of this article is to understand how alternative pathways toward gene duplication relate to each other and determine the total rate at which stable gene duplications are incorporated into the genome. My framework relies on a consistent view of how changes in gene expression and coding sequence determine the phenotypic output and organismal fitness. This is not to say that the fitness effects of mutational substitutions are expected to remain constant, only that such changes are not viewed as occurring only after a gene duplication has already become fixed.
The models considered here can be categorized on the basis of the type of selection that acts on changes in the coding sequence in the absence of related regulatory changes. I have shown that regardless of whether selection on the coding sequence is net stabilizing or leads to allelic divergence, increased population size and increased selection for context-specific alleles speed up the incorporation of duplications into the genome. This is because there are always routes toward gene duplication that are, at least in part, driven by selection. Many pathways can lead from an ancestral genotype to a maintained duplicate, and some of these pathways involve selection and therefore accelerate as both population size and the selection coefficients increase. Even if many possible pathways are unlikely to occur, because they either are selected against or require many fortuitous events, the presence of even a single adaptive pathway to duplication has a large impact on reducing the total waiting time.
Selection for multifunctional proteins can lead to allelic divergence followed by duplication, and most conditions that promote the origin of multifunctional proteins also create an adaptive pathway to gene duplication (Proulx and Phillips 2006). In the context of the current study, allelic divergence can lead to incorporation of gene duplications in relatively short periods of time (see Figure 3). Even for fairly weak selection, very low adaptive coding mutation rates, and moderately large population size (105) the adaptive duplication pathway is much faster than the DDC pathway. These pathways need not operate exclusively, however. If a duplication does drift to fixation or high frequency, subsequent coding mutations will be under positive selection and lead to the stable maintenance of the gene duplication.
The pattern is similar for net stabilizing coding selection but the selection coefficient and population size must be larger to achieve the same waiting time (see Figure 4). The main difference between the allelic divergence and net stabilizing coding selection regimes is that in the stabilizing regime the first adaptive step involves a form of evolutionary tunneling (Iwasa et al. 2004; Weissman et al. 2009; Proulx 2011). This step depends on a term involving the product of the mean frequency of subfunctionalized alleles under mutation–selection balance and the coding mutation rate. When population size is small, the DDC pathway is expected to dominate, but as population size increases the adaptive duplication pathways dominate. The overall pattern is expected to follow the minimum of these waiting times, so that the overall pattern is for waiting time to be flat for small populations but then drop off as population size becomes larger.
The rate of duplicate retention is dependent on the silencing rate for the DDC pathway, but not for adaptive pathways. If the silencing mutation rate (μk) is much larger than the other mutation rates, then the DDC waiting time can increase by orders of magnitude. This is a likely scenario when many possible coding and regulatory mutations knock out or completely disable gene function and this rate is expected to depend on both gene length and the structure of the gene in terms of intron number and UTR length (Lynch 2007). Under the DDC model, variation in gene structure can lead to substantial variation in the rate of duplicate retention that is equal to or larger than the variance in duplicate retention due to changes in population size alone. This effect can greatly increase the waiting time for stable maintenance of duplications compared with previous calculations of the waiting times for the DDC process.
Tandem gene duplications involve the replication of a chromosomal segment. A duplication that copies only part of the coding sequence is likely to produce a nonfunctional gene that will have a very low probability of ever mutating into a functional gene copy. On the other hand, a duplication or retrotransposition that copies only part of the regulatory region may create a functional gene that is expressed only in certain contexts. This can open up another adaptive path toward duplication where a coding mutation hits a segregating duplicate haplotype carrying A|s1. This occurs at a rate that involves the product of the duplication rate and the square root of the coding mutation rate. This tends to be faster than the pathway that first involves the acquisition of the subspecialized double mutant for two reasons. First, subfunctional alleles act as recessive lethals at single-copy genes. It has long been known that their frequency can be large compared with dominant deleterious alleles and that this effect depends on population size. In particular, in large populations their mean frequency approaches the square root of the subfunctionalization mutation rate. This is quite similar to the tunneling pathway involving duplicate haplotypes carrying a subfunctionalized allele, where the rate of tunneling is related to the square root of the coding mutation rate. However, in smaller populations the frequency of recessive lethals is significantly lower, so that even when coding and regulatory mutation rates are equal, the pathway starting with a duplication producing the A|s1 haplotype is faster.
Second, in both pathways a type of double mutant arises and the rate of the next step depends on the equilibrium frequency of the double mutant. This again falls out in favor of the pathway starting from A|s1 because the A|sc1 haplotype can spread to fixation, whereas the sc1 haplotype is under negative frequency-dependent selection and tends to maintain a low equilibrium frequency. Putting these together, pathways starting with the A|s1 haplotype can greatly accelerate the rate of duplicate incorporation even when the rate of duplications that also involve subfunctionalization is low (Figure 7).
Overall, the picture painted by this study is that adaptive processes are likely to be a component of most successful duplication events. When knockout mutations are included in models of the DDC process, I find that the waiting time until duplicate retention increases by orders of magnitude, calling into question the conclusion that typical multicellular eukaryote lineages experience population sizes amenable to the DDC process. Only in the exact scenario posited by the DDC model, where the potential for specialization of the gene toward specific tissues is absent or associated with very small selection coefficients, do we predict that adaptive routes toward duplication are unavailable. Because adaptive routes to duplication are present even under net stabilizing selection on coding regions, we expect duplication rates to increase with population size and selection strength. This creates an apparent paradox in that lineages with small effective population size have higher rates of gene duplication and lineages with enormous population size have lower rates of gene duplication.
This apparent paradox can be immediately resolved by noting that all known transitions to multicellularity produce a correlation between Ne, mating system, and internal tissue complexity. The dynamics of lethal alleles are critically related to the mating system. In species that have high rates of selfing, recessive lethal alleles are selected against even when rare because one-quarter of the offspring of individuals carrying a lethal allele will be homozygous for the lethal allele. In haploid asexual species, there is not even the possibility of the spread of lethal alleles, so subfunctional mutants (i.e., s1 and s2) are immediately selected against. Organisms that have multiple tissues that exhibit polyphenism or experience multiple distinct environments during a single life span will also have more opportunity for multifunctional proteins to evolve simply because there are more contexts that genes can become specialized for. This suggests that in addition to shifts in population size between basal eukaryotes and multicellular eukaryotes, changes in mating system and organismal complexity may have increased the rate of duplicate retention.
F. R. Adler is specially thanked for pointing out the ansatz for the Wright–Fisher tunneling problem and A. Yanchukov is specially thanked for a careful reading of a draft of this article. The comments of two anonymous reviewers contributed to both conceptual clarity and presentation of this work. This work was supported by National Science Foundation grant EF-0742582 (to S.R.P.).
Appendix: Probability of Loss of an In-Phase Duplication
When a duplicate haplotype first arises via a tandem duplication, it may consist of two copies of the same allele, termed an in-phase duplicate haplotype. The new duplicate haplotype can recombine with alleles at the original locus to create out-of-phase haplotypes. The relative fitness of an individual carrying the in-phase haplotype may be greater than or less than 1, while the relative fitness of an individual carrying the out-of-phase haplotype is >1 (fitness is relative to the mean fitness of the population in which the duplication arises). Define ω1 as the relative fitness of an individual carrying three copies of the same specialized allele [i.e., (c1, c1|c1)] and ω2 as the relative fitness of an individual carrying the out-of-phase haplotype [either (c1, c2|c1) or (c2, c2|c1), which are assumed by symmetry to be equal].
In many studies of the dynamics of gene duplicate evolution, the eigenvalue for the spread of the duplicate is derived and used as a measure of selection or fixation (Otto and Yong 2002; Proulx and Phillips 2006; Connallon and Clark 2011). Consider a population of haplotypes carrying the c1 and c2 alleles in which a duplication occurs creating a c1|c1 haplotype. The spread of the duplicate haplotype involves both c1|c1 (in-phase) haplotypes and c2|c1 (out-of-phase) haplotypes. This two-state transition matrix is
where p is the equilibrium frequency of the c1 allele in the absence of the duplicate haplotype. The dominant eigenvalue can be found by standard techniques and is(A2)The effective selection coefficient for the rare duplicate haplotype is . In the case where ω1 = ω2, this system reduces to a standard selection problem with the well-known result that the probability of nonloss of the invading haplotype is .
A contrasting approach is to directly calculate the probability of nonloss of the duplicate haplotype undergoing selection and recombination. This can be done using first-step analysis for the multitype branching process (Ross 1988). Assume that diploid adults produce a Poisson-distributed number of offspring and that the number that undergoes recombination is binomially distributed. Let D1 be the probability that a haplotype lineage starting with 1 copy of the in-phase duplication eventually goes extinct (that is, no haplotypes carrying either the in-phase or the out-of-phase duplication are left in the population). Likewise, let D2 be the probability that a haplotype lineage starting with 1 copy of the out-or-phase duplication eventually goes extinct. Because this is a branching process, the probability of eventual extinction of a set of duplicate haplotypes is simply the probability that the lineage produced by each individual goes extinct (see Proulx 2011 for a rigorous limit for Wright–Fisher populations). This givesThis can be simplified to give(A3)(A4)The classic result for fixation probability can be recovered if ω = ω1 = ω2 (which also implies that D = D1 = D2), giving an implicit formula for D asThis transcendental equation cannot be further simplified, but can be approximated when ω is slightly larger than 1 (Proulx 2011) to give D ≈ 2(ω − 1).
The joint solution to Equations A3 and A4 can be found numerically for specific values of ω1, ω2, r, and p. For the remainder of this discussion I assume that p = . Figure A1 shows the probabilities of nonloss of the two duplicate haplotypes and the probability of loss estimated from the eigenvalue.
In all cases, the probability of nonloss is <2(λ − 1), where λ is the eigenvalue. When r ≈ 0, the probability of loss is determined by the behavior of the in-phase duplicate. At r = the difference from the eigenvalue expectation is due to the probability that the initial in-phase duplication goes extinct immediately before any recombination is possible. Otherwise the eigenvector is immediately reached and the eigenvalue approximation for the probability of nonloss applies. Therefore, the probabilities of nonloss must interpolate between twice the mean fitness of rare in-phase haplotypes and 2(λ − 1).
The behavior of the system can be understood by considering three qualitatively different scenarios. In the first case, the mean fitness of the in-phase duplication is >1 [because (ω1 + ω2)/2 > 1]. Because we will be considering only cases where ω2 > 1 the eigenvalue is also always >1. In this case, when r = 0, then the probability of nonloss is simply 2((ω1 + ω2)/2 − 1). As r increases, the probability of nonloss monotonically increases toward 2(λ − 1) (Figure A1A). Interestingly, the eigenvalue approach is most deceiving when r is small. This case also applies when the in-phase duplication has mean relative fitness of 1.
In the second case the mean fitness of the in-phase duplication is <1 but the eigenvalue for r = is >1. Near r = 0 the probability of loss is close to 1, in contradiction to the eigenvalue result, which argues that the spread of the duplicate is fastest when r is small. However, the probability of nonloss rapidly increases in r and does reach a maximum value for intermediate r (Figure A1B). The probability of nonloss is always <2(λ − 1).
In the third case, both the mean fitness of the in-phase duplication is <1 and the eigenvalue at r = is <1. In this case, for large enough r, the probability that a rare duplicate haplotype is lost approaches 1. The probability of loss for a single copy of the out-of-phase duplicate and 2(λ − 1) are virtually identical. The probability of loss of the in-phase haplotype is 0 for small r, increases to a maximum value for intermediate r, and then decreases until it becomes 0 when the eigenvalue reaches 1.
Supporting information is available online at http://www.genetics.org/content/suppl/2011/12/05/genetics.111.135590.DC1.
Communicating editor: L. M. Wahl
- Received October 7, 2011.
- Accepted November 29, 2011.
- Copyright © 2012 by the Genetics Society of America