Abstract

Evolution of multigene families by gene duplication and subsequent diversification is analyzed assuming a haploid model without interchromosomal crossing over. Chromosomes with more different genes are assumed to have higher fitness. Advantageous and deleterious mutations and duplication/deletion also affect the evolution, as in previous studies. In addition, negative selection on the total number of genes (copy number selection) is incorporated in the model. First, a Markov chain approximation is used to obtain formulas for the average numbers of different alleles, genes without pseudogene mutations, and pseudogenes assuming that mutation rates and duplication/deletion rates are all very small. Computer simulation shows that the approximation works well if the products of population size with mutation and duplication/deletion rates are all small compared to 1. However, as they become large, the approximation underestimates gene numbers, especially the number of pseudogenes. Based on the approximation, the following was found: (1) Gene redundancy measured by the average number of redundant genes decreases as advantageous selection becomes stronger. (2) The number of different genes can be approximately described by a linear pure-birth process and thus has a coefficient of variation around 1. (3) The birth rate is an increasing function of population size without copy number selection, but not necessarily so otherwise. (4) Copy number selection drastically decreases the number of pseudogenes. Available data of mutation rates and duplication/deletion rates suggest much faster increases of gene numbers than those observed in the evolution of currently existing multigene families. Various explanations for this discrepancy are discussed based on our approximate analysis.

GENE duplication with subsequent diversification is considered to have played very important roles in the organismal evolution (Ohno  1970, 1988b, 1991). If a gene is duplicated, the selective constraint becomes less for the extra copy, and it can evolve to have a (slightly) different function, while the original function of the gene is kept in the other copy. Thus, gene duplication with subsequent diversification is one of the simplest ways to acquire a new function and is thought to have been employed many times during evolution. For example, Iwabe et al. (1996) suggested that gene duplications contributed to the compartmentalization of a cell and formation of multicellular organisms by creating genes expressed tissue specifically. Also, homeobox genes (Carroll 1995) and MCM1, AGAMOUS, DEFICIENS, and SRF (MADS)-box genes (Thiessen et al. 1996) were apparently created by ancient gene duplications. Gene duplications are not restricted to ancient times. Changes of gene number are observed in closely related species in many taxa, and even examples of polymorphisms with regard to copy number within species are known (e.g., Lyckegaard and Clark 1989; Takano et al. 1989; Lange et al. 1990; Neitz and Neitz 1995). Thus, gene duplications are still ongoing evolutionary processes. How such gene duplications and formation of multigene families occur at the population level is an interesting evolutionary problem.

Ohta (1987a,b, 1988a,b) theoretically studied evolution by gene duplication in haploid and diploid models using Monte Carlo simulation. She showed among others that positive selection is necessary to acquire genes with new functions and that the variance of the gene number is generally large. Also the ratio (R) of the number of genes with different functions to the number of pseudogenes was proposed to measure effects of positive selection, and an approximate formula to compute it was derived (Ohta 1987b),
R=u+v+v,
(1)
where v+ and v are rates of mutation to a different gene with a new function and of mutation to a pseudogene, respectively, and u+ is the fixation probability of an advantageous mutation introduced into a population. Later, Clark (1994) and Walsh (1995) studied different aspects of two gene systems assuming slightly different models. However, there are questions still to be answered. First, because Ohta's results (Ohta  1987a,b, 1988a,b) were mostly from simulation except for Equation 1 dependency of gene multiplication on parameters like population size, intensity of selection, mutation rate, and duplication/deletion rate was not so clear. As we do not have good estimates for most of these parameters, deriving analytical expressions describing gene multiplication processes is desirable to quantify speed of gene duplication and gene redundancy. Second, negative selection on the total number of genes per chromosome (copy number selection) was not considered in the previous works. A tendency for DNA loss is observed in Drosophila (Petrov et al. 1996), and gene silencing is observed when multiple copies are introduced into plants (Hollick et al. 1997). Thus, there is a possibility of the action of copy number selection. Third, the rate of gene duplication from a single gene and from multiple genes are considered to be different. However, this difference was ignored to simplify the model in Ohta's simulation.

In the present article, as an extension of Ohta's work, a haploid model of gene duplication is analyzed paying attention to the change of gene number. Copy number selection is explicitly incorporated into the model, and the duplication rate from a single gene was assumed to be 0. First, the model is described. Then, a Markov chain approximation of the model is derived, and its behavior is analyzed assuming low mutation and duplication/deletion rates. Results of extensive computer simulation to check the accuracy of the approximate analysis will be described. Approximate formulas for the rate of increase of the number of different genes and gene redundancy were obtained as a function of population size, mutation rates, duplication/deletion rates, and selection coefficients. Copy number selection was shown to be very effective in reducing the number of pseudogenes. Relationships of various haploid models also will be discussed.

MODEL

Definitions of symbols used are summarized in Table 1. A random-mating haploid population consisting of N chromosomes is assumed. Generations are discrete, and all chromosomes have two identical genes at generation zero. Each chromosome undergoes mutation, gene duplication/deletion, and random sampling with selection, in this order, in one generation.

We call a gene that never has had pseudogene mutation in its descent a live gene. Each live gene mutates to a new different allele or to a pseudogene with rates ua or up, respectively. Here, following the usage of Ohta (1988b), we use the word alleles to denote gene types, although alleles may not be at the same locus. Each new allele is assumed to be different from any other alleles preexisting in the population (Kimura and Crow 1964). Because fitness of a chromosome is a monotone increasing function of the number of different alleles, as described later, ua is the advantageous mutation rate. Once a gene becomes a pseudogene, it stays a pseudogene.

Next, if a chromosome has more than one gene, duplication

TABLE 1

Definitions of symbols frequently used in the text

SymbolMeaning
kNumber of different alleles in a chromosome
miNumber of copies with the ith allelic state
n1Number of live genes in a chromosome
npNumber of pseudogenes in a chromosome
ntNumber of total genes in a chromosome
NPopulation size
uaMutation rate to a different allele
upMutation rate to a pseudogene
γDuplication/deletion rate
saCoefficient for advantageous selection
sdSelection coefficient for copy number selection
f+Fixation probability of a chromosome with one more different allele
g+Fixation probability of a chromosome with one more gene
gFixation probability of a chromosome with one less gene
αRate of increase of the number of different alleles
p1(t)Probability of the population fixed by chromosomes with k = 1
p~1Probability of the population fixed by chromosomes with nt = 1
RRatio of the number of different alleles to the number of pseudogenes
SymbolMeaning
kNumber of different alleles in a chromosome
miNumber of copies with the ith allelic state
n1Number of live genes in a chromosome
npNumber of pseudogenes in a chromosome
ntNumber of total genes in a chromosome
NPopulation size
uaMutation rate to a different allele
upMutation rate to a pseudogene
γDuplication/deletion rate
saCoefficient for advantageous selection
sdSelection coefficient for copy number selection
f+Fixation probability of a chromosome with one more different allele
g+Fixation probability of a chromosome with one more gene
gFixation probability of a chromosome with one less gene
αRate of increase of the number of different alleles
p1(t)Probability of the population fixed by chromosomes with k = 1
p~1Probability of the population fixed by chromosomes with nt = 1
RRatio of the number of different alleles to the number of pseudogenes
TABLE 1

Definitions of symbols frequently used in the text

SymbolMeaning
kNumber of different alleles in a chromosome
miNumber of copies with the ith allelic state
n1Number of live genes in a chromosome
npNumber of pseudogenes in a chromosome
ntNumber of total genes in a chromosome
NPopulation size
uaMutation rate to a different allele
upMutation rate to a pseudogene
γDuplication/deletion rate
saCoefficient for advantageous selection
sdSelection coefficient for copy number selection
f+Fixation probability of a chromosome with one more different allele
g+Fixation probability of a chromosome with one more gene
gFixation probability of a chromosome with one less gene
αRate of increase of the number of different alleles
p1(t)Probability of the population fixed by chromosomes with k = 1
p~1Probability of the population fixed by chromosomes with nt = 1
RRatio of the number of different alleles to the number of pseudogenes
SymbolMeaning
kNumber of different alleles in a chromosome
miNumber of copies with the ith allelic state
n1Number of live genes in a chromosome
npNumber of pseudogenes in a chromosome
ntNumber of total genes in a chromosome
NPopulation size
uaMutation rate to a different allele
upMutation rate to a pseudogene
γDuplication/deletion rate
saCoefficient for advantageous selection
sdSelection coefficient for copy number selection
f+Fixation probability of a chromosome with one more different allele
g+Fixation probability of a chromosome with one more gene
gFixation probability of a chromosome with one less gene
αRate of increase of the number of different alleles
p1(t)Probability of the population fixed by chromosomes with k = 1
p~1Probability of the population fixed by chromosomes with nt = 1
RRatio of the number of different alleles to the number of pseudogenes

or deletion occurs with a rate γ/2 per gene copy, respectively. If a chromosome has just one gene, no change occurs. Here, we are considering sister-chromatid unequal crossing over as a genetic mechanism for copy number change. If the mechanism for copy number change is duplicative transposition, copy number change can occur even in chromosomes with a single copy.

Finally, N chromosomes of the next generation are sampled from the N chromosomes of the present generation with probabilities proportional to fitness of each chromosome. Let k and nt be the numbers of different alleles and all genes in a chromosome, respectively. The fitness of the chromosome is zero if k = 0. Otherwise, we consider the following three fitness functions, w(k,nt).

  1. Model I (Ohta 1987b),
    w(k,nt)={exp(sdnt)(kk¯)exp[sa(k¯k)sdnt](k<k¯)},
    (2)
    where k¯ is the average number of different alleles per chromosome in the population, and sa and sd(sa, sd > 0) are positive and negative selection coefficients, respectively.
  2. Model II (Walsh 1995),
    w(k,nt)=exp[saksdnt].
    (3)
  3. Model III,
    w(k,nt)={exp[sa(kk¯)sdnt](k>k¯)exp(sdnt)(kk¯)}.
    (4)
    Models I and II are modifications of those by Ohta (1987b) and Walsh (1995), respectively, with addition of deleterious effects determined by the total number of genes (copy number selection). As will be noted in the next section, models I, II, and III correspond to the recessive, semidominant, and dominant advantageous mutation models, respectively, in regard to the increase of the number of different alleles. We mainly show results for Model I. Differences among the models will be discussed where relevant.

Random sampling with selection completes one generation. This cycle is repeated many times. At generation zero, the population starts with all chromosomes having two live genes (k = 1, nl = 2, np = 0), and all genes are identical as mentioned before.

Homologous equal crossing over does not occur in the present haploid model. Thus, this is a single locus infinite-allele model. For large populations, the diffusion approximation can be applied, and, in this approximation, relevant parameters are Nua, Nup, Nγ, Nsa, and Nsd measuring time in units of N (Crow and Kimura 1970).

Because the evolution of the model seems very complex, we first analyze a Markov chain whose behavior approximates that of our haploid model under a certain condition. The validity of the approximation will be examined by computer simulation later.

MARKOV CHAIN APPROXIMATION

We assume that Nua, Nup, and Nγ are all very small compared to 1. Also, we assume that positive selection is fairly strong, so that the number of different alleles never decreases. Under these conditions, the population is expected to be mostly monomorphic and thus represented by one chromosome type. A mutant chromosome type that is to replace the currently dominant chromosome type in the population is expected to be fixed quickly. For characterization of a chromosome, let k be the number of different alleles in the chromosome as in the previous section. The chromosome is characterized by k numbers, m1,⋯,mk, each representing the number of live genes with the same allelic state and the number of pseudogenes, np. Thus, in this approximation, the population state is characterized by a vector (m1,⋯,mk,np). The number of live genes, nl, is expressed as n1=Σi=1kmi, and the total number, nt, is expressed as nt = nl + np. Occasionally, transitions among states occur. Because we assume Nua, Nup, Nγ ≪ 1, one of the following types of changes occurs in a chromosome, and the chromosome is fixed in the population with the probabilities specified below:

  1. A live gene in the ith class with mi > 1 has an advantageous mutation. The rate of this event is uamiNf+, where f+ is the fixation probability of a mutant chromosome with one more different allele than other chromosomes in the population. Depending on the models of selection, f+ has a different form and is approximately expressed as,
    f+={(4sa)(πN)(Model I, Ohta1987b)2sa(1exp(2Nsa))(Model II, Walsh1995)F(1N)F(1)(Model III)}
    (5)
    where F(p)=0pexp[Nsax(2x)]dx (see Crow and Kimura 1970). Note that the fixation probabilities in models I, II, and III correspond to those of recessive, semidominant, and dominant advantageous mutations, respectively. The derivation for Model III can be done in the same way as that in Model I by Ohta (1987b).
  2. A live gene in the ith class with mi > 1 becomes a pseudogene. Because such change is selectively neutral with mi > 1, the rate for the ith class genes to become a pseudogene is miup.

  3. The number of live genes of the ith class increases by gene duplication. The rate for this event is miγNg+/2, where g+ is the fixation probability of a deleterious mutant with a selection coefficient −sd and is expressed as (Kimura 1962),
    g+=exp(2sd)1exp(2Nsd)1.
    (6)
  4. The number of live genes in the ith allelic class decreases by deletion. The rate for this event in miγNg/2, where g is expressed as,
    g=1exp(2sd)1exp(2Nsd).
    (7)
  5. The number of pseudogenes increases by gene duplication. The rate of increase is npγNg+/2.

  6. The number of pseudogenes decreases by deletion. The rate of decrease is npγNg/2.

The above transition probabilities determine a Markov chain. Because we assumed strong selection (SS) and weak mutation, duplication/deletion (WMD), we call the result of this approximation the SSWMD limit. This type of approximation has been widely used to analyze complex genetic systems (Gillespie 1983; Walsh 1985; Li 1987). In this approximation, further reduction of the number of relevant parameters occurs. If we measure time in units of γ−1, ua/γ and up/γ, in addition to Nsa and Nsd, determine the behavior of the process. The process is still a multidimensional Markov chain and is difficult to analyze directly. So we take two approaches. One is computer simulation of the Markov chain and another is further approximations for studying some

Figure 1.

The probability, p1(t), that the number of different alleles is 1 at t. All combinations of Nsa = 5, 10, 40, and sd = 0, 0.01 sa are used for selection coefficients. Other parameters are ua = 0.1γ and up = γ.

aspects of the process. For each parameter set, 10,000 replications were made in the simulation.

Because k is an increasing function of time with strong positive selection, k is always > 1 once k becomes more than 1. We call the phases with k = 1 and k > 1 the first and second phases, respectively. In the first phase, the total number of genes, nt, can become 1, and then no change of the gene number occurs thereafter. For this reason, the number of pseudogenes, np, affects the evolution of k and nt. On the other hand, in the second phase, nl is always larger than 1, and change of the number of genes never stops. The evolution of k and nl does not depend on np. Therefore, these two phases are analyzed separately.

In the first phase, the population starts with k = 1, nt = 2, and we are interested in when the population enters the second phase (k > 1). Let p1(t) be the probability that k is 1 at time t. Because evolution in this phase is influenced by np, p1(t) is difficult to compute. We examined p1(t) by simulation of the Markov chain. Figure 1 shows the time-dependent behavior of p1(t). Solid and broken lines represent results for cases without (sd = 0) and with (sd = 0.01 sa) copy number selection, respectively, and time is measured in units of 1/γ generations. The mutation rates in this example are 10ua = up = γ. The probability p1(t) decreases quickly in the first few 1/γ generations, and then the change becomes small. In the later generations, nt is 1 in most populations with k = 1, and no changes occur in these populations. That is why p1(t) does not change in the later generations. Positive selection decreases the probability, but very weak copy number selection can increase the probability. The effect of copy number selection is stronger for the cases with stronger positive selection.

Ultimately, the population moves to either the state with nt = 1 or that with k > 1. The ultimate probability, p1(∞), was estimated by simulation. Because p1(t) is a monotone decreasing function of t and bounded from the lower side by the probability p~1(t) of nt being 1, simulations are continued until p1(t) − p~1(t) < 0.01 so that we can estimate p1(∞) with a maximum error of 0.01. The results are shown in Table 2. The ratio ua/γ and effects of selection are important in determining p1(∞). For small ua/γ, most populations become ultimately fixed by chromosomes with a single copy. The ratio up/ua has little effect probably because the duplication/deletion force is strong. As ua/γ becomes larger, p1(∞) decreases and the effects of pseudogene mutation become larger. The effect of copy number selection is always significant even though sd/sa = 0.01 in the examples.

In the second phase, we are interested in the evolution of k, nl, and np. We first investigate the ratio Q = nl/k, the average number of genes with the same allelic state and a measure of strict gene redundancy. Because the duplication/deletion process does not stop with nt > 1, the evolution of the number of genes with the ith allelic state, mi (we call this the ith allelic class) can be decoupled from np or other mj's in this approximation. Here, we first concentrate on the evolution of one allelic class and denote the number of live genes in the class by m.

Consider the following process: At the start, m is one. Transition from m = 1 to m = 2 occurs at the rate a = γNg+/2 by gene duplication. For m > 1, the process moves to m − 1, m + 1, or 1. The transition to 1 is added because we want to know the average Q over all allelic classes, and for this purpose, the process moves to a new allelic class with probability one-half if a different allele is created from this allelic class. The transition to m − 1 occurs by deletion, advantageous mutation, or pseudogene mutation, and its rate is denoted by mb. Because we stay in the same allelic class in one-half of the cases when an advantageous mutation occurs, b = up + N(uaf+/2 + γg/2). The transition to m + 1 occurs by duplication and its rate is ma. Finally, the transition rate to 1 is mc with c = Nuaf+/2. Let qm(t) be the probability of the state being m at time t. Then, qm(t)'s satisfy the following system of differential equations:
dq1dt=aq1+2bq2+cΣi=1iqi
(8)
 
dqmdt=m(a+b+c)qm+(m1)aqm1+(m+1)bqm+1(m2).
(9)
With the generating function h(z,t)=Σm=1qmzm, the system of differential equations is expressed by
h(z,t)t=b(1a+b+cbz+ba)hz+(bq1+cMq)zbq1,
(10)
where Mq=Σm=1mqm is the mean number. Although the time-dependent solution of this equation is difficult
TABLE 2

The ultimate probability, p1(∞), of the population being fixed by chromosomes with one copy

Nsa = 5Nsa = 10Nsa = 40
uaup/uasd = 0sd = sa/100sd = 0sd = sa/100sd = 0sd = sa/100
0.0250.730.770.690.760.590.79
100.730.780.700.760.600.80
0.150.540.580.490.550.380.54
100.580.620.530.600.420.61
0.550.400.440.360.400.250.37
100.490.530.430.500.330.50
Nsa = 5Nsa = 10Nsa = 40
uaup/uasd = 0sd = sa/100sd = 0sd = sa/100sd = 0sd = sa/100
0.0250.730.770.690.760.590.79
100.730.780.700.760.600.80
0.150.540.580.490.550.380.54
100.580.620.530.600.420.61
0.550.400.440.360.400.250.37
100.490.530.430.500.330.50
TABLE 2

The ultimate probability, p1(∞), of the population being fixed by chromosomes with one copy

Nsa = 5Nsa = 10Nsa = 40
uaup/uasd = 0sd = sa/100sd = 0sd = sa/100sd = 0sd = sa/100
0.0250.730.770.690.760.590.79
100.730.780.700.760.600.80
0.150.540.580.490.550.380.54
100.580.620.530.600.420.61
0.550.400.440.360.400.250.37
100.490.530.430.500.330.50
Nsa = 5Nsa = 10Nsa = 40
uaup/uasd = 0sd = sa/100sd = 0sd = sa/100sd = 0sd = sa/100
0.0250.730.770.690.760.590.79
100.730.780.700.760.600.80
0.150.540.580.490.550.380.54
100.580.620.530.600.420.61
0.550.400.440.360.400.250.37
100.490.530.430.500.330.50
to obtain because of q1 and Mq in the right-hand side, we can compute the equilibrium solution h(z) by noting ∞ and qm ≤ 1 for all n. The resulting solution is
h(z)=q1rln(1rz),
(11)
where
q1=rln(1r)
(12)
 
r=2aa+b+c+(a+b+c)24ab.
(13)
The equilibrium distribution of m is easily obtained by expanding the generating function. The mean number Mq is also computed from h(z) and represented as
Mq=bracln(1r).
(14)
It is intuitively clear that increasing b or c decreases and increasing a increases Mq, but quantitatively how it changes is fairly complex, as expressed by (14).

Because Mq is the mean number of live genes in one allelic class, its value might be close to the ratio Q = nl/k, although Mq computed above is the equilibrium value. We can check how well Mq can approximate Q by simulation. Because Q is 1 in populations with nt = 1, we computed Q conditioned on nt > 1 from the simulation at t = 10/γ (Table 3). As shown in Table 3, Q estimated from the simulation and Mq computed from (14) agree well. Thus, it seems that the distribution of m is close to equilibrium at t = 10/γ. As Nsa or ua becomes larger, the redundancy measured by Mq becomes less because both b and c increase. Increasing up increases b and thus decreases Mq. Introduction of the copy number selection also decreases Mq. Only when ua/γ is very small, is high genetic redundancy observed, but otherwise the genetic redundancy is very small.

Next, we consider the number (k) of different alleles. Because a new allele is created from a redundant copy, its production rate is proportional to the number of redundant genes and uaNf+. Thus, in the second phase, k(t) is expected to be approximated by a pure-birth process with a rate of (Mq − q1)uaNf+. Define k*(t) = E[k(t),k(∞) > 1] as the expectation of k over the event that k eventually enters the second phase. The relationship between k*(t) and E[k(t)] is represented by
E[k(t)]=p1()+k(t).
(15)
As noted above, because k(t) is a pure-birth process in the second phase, k*(t) takes the form
k(t)exp[αt],
(16)
where α = (Mq − q1)uaNf+. Thus, we can write the expectation of k as
E[k(t)]=p1()+C×(1p1())exp[αt].
(17)
To check the accuracy of this expression based on the pure-birth process approximation, the values computed from it and E[k(t)] obtained from the Markov chain simulation were compared (Figure 2). To obtain C, the simulation data were fitted by the curve represented by (17). The simulation data are represented by open symbols, and the fitted curve is shown by solid lines in Figure 2. The agreement seems very good except for the very intial phase. Thus, in the second phase, k(t) evolves approximately as a pure-birth process defined as above. Note that the fitted curves are >1 at t = 0. This is probably because we started with a redundant copy (m1 = 2). In this curve fit, we regard that a certain amount of new alleles are created at t = 0.
Finally, we consider the number of pseudogenes, np. The average rate of production of new pseudogenes per allelic class is again proportional to the number of redundant genes and represented by β1 = (Mq − q1)up. Once a new pseudogene is created, its number can change by duplication/deletion. The rate of decrease is β2 = (g − g+)Nγ/2. Thus, the expected number of pseudogenes may take the following form,
E[np(t)]~0tβ1k(s)exp[β2(ts)]ds.
(18)
However, recall that the estimated E[k(t)] by (17) is not 1 at t = 0. Because pseudogenes are produced from redundant genes as new alleles but with a different rate
TABLE 3

Observed and expected redundancy, Q = nl/k

Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0252.0381.9531.9121.8371.6391.616
101.7721.7131.6721.6501.5311.511
0.151.2821.2781.2451.2451.1791.176
101.1851.1821.1631.1671.1341.132
0.010.0251.8621.7921.6341.6031.2271.226
101.6261.6071.5001.4841.2061.200
0.151.2571.2541.2061.2051.0971.094
101.1721.1681.1451.1431.0751.074
Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0252.0381.9531.9121.8371.6391.616
101.7721.7131.6721.6501.5311.511
0.151.2821.2781.2451.2451.1791.176
101.1851.1821.1631.1671.1341.132
0.010.0251.8621.7921.6341.6031.2271.226
101.6261.6071.5001.4841.2061.200
0.151.2571.2541.2061.2051.0971.094
101.1721.1681.1451.1431.0751.074
a

conditioned on nt > 1 at t = 10/7 are computed from the Markov chain simulation.

b

Mq's computed from Equation 14 are shown.

TABLE 3

Observed and expected redundancy, Q = nl/k

Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0252.0381.9531.9121.8371.6391.616
101.7721.7131.6721.6501.5311.511
0.151.2821.2781.2451.2451.1791.176
101.1851.1821.1631.1671.1341.132
0.010.0251.8621.7921.6341.6031.2271.226
101.6261.6071.5001.4841.2061.200
0.151.2571.2541.2061.2051.0971.094
101.1721.1681.1451.1431.0751.074
Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0252.0381.9531.9121.8371.6391.616
101.7721.7131.6721.6501.5311.511
0.151.2821.2781.2451.2451.1791.176
101.1851.1821.1631.1671.1341.132
0.010.0251.8621.7921.6341.6031.2271.226
101.6261.6071.5001.4841.2061.200
0.151.2571.2541.2061.2051.0971.094
101.1721.1681.1451.1431.0751.074
a

conditioned on nt > 1 at t = 10/7 are computed from the Markov chain simulation.

b

Mq's computed from Equation 14 are shown.

β1 per allelic class, we need to add the following term due to pseudogenes created at t = 0:
β1α(k(0)(1p1()))exp[β2t].
(19)
The term in parentheses is the number of new alleles created at t = 0 in the curve fit mentioned above, and the ratio of the number of pseudogenes to that of new alleles when they are created is β1/α (Ohta 1987b). Combining (18) and (19), we obtain
E[np(t)]=β1[1α+β2+β2α(α+β2)exp[(α+β2)t]]k(t)β1(1p1())αexp[β2t].
(20)
Although the computation of this formula requires p1(∞), if we want an expectation conditioned on the
Figure 2.

Comparison of the number of different alleles obtained by Markov chain simulation and that expected under a linear pure-birth process. Simulation results are represented by ○, Nsa = 5; □, Nsa = 10; and ♦, Nsa = 40. The curve fits are obtained assuming Equation 17. See text for more detail. Other parameters are ua = 0.1γ and up = γ.

event of k > 1, it can be computed from the conditional expectation of k(t). Thus, this formula can be used to examine the relationship between conditional expectations of k and np. We again checked the accuracy of this equation with the values obtained by the Markov chain simulation, and the results are shown in Table 4. The agreement is generally very good. Without copy number selection (sd = 0), β2 = 0, and if we note k*(t) − (1 − p1(∞)) = E[k(t)] − 1 [see (15)], Equation 20 is simplified to
E[np(t)]=β1α(E[k(t)]1),
(21)
which is equivalent to (1) obtained by Ohta (1987b). With the copy number selection, the following expression is obtained for β2t ≫ 1,
E[np]β1α+β2k.
(22)
Thus, if γ that increases β2 linearly is large, E[np]/k* becomes small. This can be seen from Table 4, where even with weak copy number selection (sd = sa/100), the number of pseudogenes is drastically reduced.

SIMULATION

To examine the validity and limitation of the SSWMD limit obtained by the Markov chain approximation and also to investigate parameter ranges outside that assumed in the approximation, we carried out computer simulation closely following the model. We call this type of simulation the Wright-Fisher simulation. In the simulation, N chromosomes are prepared. In each chromosome, k, m1, …, mk, np and its fitness are stored. Numbers of mutations and duplication/deletion events are determined by generating Poisson random numbers with means computed by multiplying the number of genes and respective rates. Genes undergoing mutations or duplication/deletion are chosen randomly. After these,

TABLE 4

Observed and expected number of pseudogenes

Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0251.1191.1021.2371.2281.4691.490
101.6501.6671.8401.8222.3122.308
0.153.8663.8854.6654.6786.7756.779
104.0934.1274.6794.7346.7976.869
0.010.0250.7070.6680.4510.4490.0640.064
101.0110.9800.7470.7120.1010.104
0.152.6792.6362.1762.2020.5000.519
102.7862.7532.2282.2060.5830.578
Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0251.1191.1021.2371.2281.4691.490
101.6501.6671.8401.8222.3122.308
0.153.8663.8854.6654.6786.7756.779
104.0934.1274.6794.7346.7976.869
0.010.0250.7070.6680.4510.4490.0640.064
101.0110.9800.7470.7120.1010.104
0.152.6792.6362.1762.2020.5000.519
102.7862.7532.2282.2060.5830.578

Values at t = 10/7 are shown.

a

E[np] computed from the Markov chain simulation.

b

Values computed from Equation 20 are shown.

TABLE 4

Observed and expected number of pseudogenes

Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0251.1191.1021.2371.2281.4691.490
101.6501.6671.8401.8222.3122.308
0.153.8663.8854.6654.6786.7756.779
104.0934.1274.6794.7346.7976.869
0.010.0250.7070.6680.4510.4490.0640.064
101.0110.9800.7470.7120.1010.104
0.152.6792.6362.1762.2020.5000.519
102.7862.7532.2282.2060.5830.578
Nsa = 5Nsa = 10Nsa = 40
Sd/sauaup/uaObs.aExp.bObs.aExp.bObs.aExp.b
0.00.0251.1191.1021.2371.2281.4691.490
101.6501.6671.8401.8222.3122.308
0.153.8663.8854.6654.6786.7756.779
104.0934.1274.6794.7346.7976.869
0.010.0250.7070.6680.4510.4490.0640.064
101.0110.9800.7470.7120.1010.104
0.152.6792.6362.1762.2020.5000.519
102.7862.7532.2282.2060.5830.578

Values at t = 10/7 are shown.

a

E[np] computed from the Markov chain simulation.

b

Values computed from Equation 20 are shown.

the fitness of each chromosome is determined according to (2), (3), or (4) and is divided by the maximum fitness of the current population. The next generation is constructed in the following way: First, choose a chromosome randomly. Next, draw a uniform random number in [0,1) and, if the fitness of the chosen chromosome is less than the random number, add it to the population of the next generation. This is repeated until the chromosome number becomes N. For each parameter set, 1600 replications are made. We present the simulation results in comparison with the SSWMD limit.

First, we examine time courses of the average k, nl, np and the proportion p~1(t) of populations with nt = 1 when Nγ, Nup, and Nua are small and agreement between the Markov chain simulation and Wright-Fisher simulation is expected. Some examples are shown in Figure 3. The values computed by the Markov chain simulation are represented by solid lines and those by the Wright-Fisher simulation are represented by dotted lines. Parameters used are γ = up = 0.001, ua = 0.0001, sa = 0.1, sd = 0, and N = 100 (Figure 3A) and N = 400 (Figure 3B). With N = 100, the Markov chain approximation is very close to the Wright-Fisher simulation. For this parameter set, Nγ = Nup = 0.01 and Nua = 0.001. However, with N = 400, the Markov chain approximation underestimates expected values of k, np, nl by 10 to 20%, especially in the later generations, although they seem to increase exponenetially. The largest discrepancy is found in np. As the number of genes per chromosome increases, the population becomes polymorphic because product parameters such as nlNγ become large. This may be the reason for the discrepancies. Estimates of p~1(t) from the two simulations agree well.

To examine the numbers of genes in a wider parameter range, more simulations were carried out, and some of the results are shown in Table 5 (without copy number selection) and Table 6 (with copy number selection). To clarify the effect of increasing γ, ua, and up, cases with γ = 0.0001 and 0.001 are shown along with the SSWMD limit keeping the ua/γ and up/γ constant. Means and variances of gene numbers at t = 10/γ are shown. With ua/γ = 0.02 and N = 100, values for γ = 0.001 and 0.0001 are quite close to the SSWMD limit. As population size increases (N = 400) or ua/γ approaches 0.1, the values with a large γ (0.001) deviate from the SSWMD limit. Especially, the number of pseudogenes becomes large when γ is large, decreasing the ratio of the number of different alleles to those of pseudogenes. With larger N or ua, populations become polymorphic with regard to gene numbers, thus breaking the assumption of the Markov chain approximation. The increase of pseudogenes is probably due to competition among advantageous alleles and resulting reduction of the probability of fixation of new different alleles. The trend becomes more pronounced when both N and ua are larger (ua/γ = 0.1, N = 400). In this parameter set, the numbers of different alleles and live genes are also larger than the SSWMD limit. This is especially true when there is copy number selection. In the Markov chain approximation, fixations of new alleles are assumed to occur after fixations of redundant copies. However, in polymorphic populations, new, different alleles may appear before fixation of redundant copies. This may increase the fixation rate of new genes, especially when there is copy number selection and fixations of redundant copies are difficult to make occur.

The number of different alleles increases as population size increases without copy number selection. However, this is not necessarily true with copy number selection. With ua/γ = 0.02, the average numbers of different alleles are almost the same for N = 100 and 400. The variance of k is generally very large and in the order of the square of the mean when the mean is large. As ua increases, keeping ua/up constant, E[k] always increases.

Thus far, we investigated only cases with fairly strong selection (Nsa = 10, 40), where one of the assumptions of the SSWMD limit is satisfied. We also investigated when selection is weak, and the result is shown in Table 7.

Figure 3.

Comparison of the expected numbers of genes obtained by the Markov chain simulation and Wright-Fisher simulation. Two population sizes are used. A, N = 100; B, N = 400. The average numbers of different alleles (k), live genes (nl), pseudogenes (np), and the proportion of populations with nt = 1 (p1) are plotted against generation. Values obtained by the Markov chain simulation (M) are represented by closed symbols with solid lines and those by the Wright-Fisher simulation (W) are represented by open symbols with broken lines. Other parameters are γ = up = 0.0001, ua = 0.00001, sa = 0.1, sd = 0.

The result of reducing γ to 0.0001 was indistinguishable from that of γ = 0.001 and is not shown. The average number of different alleles is very close to 1, although pseudogenes are created. Thus, for smaller populations, new alleles are difficult to evolve. Copy number selection has very little effect on E[k] and slight effects on E[np]. One notable difference from the strong selection case is that E[k] decreases as ua increases, keeping ua/up constant. The effect of increasing up overrides that of increasing ua if Nsa is of the order of 1. Also note that the variance of np is very large. Some populations are considered to have a very large number of pseudogenes, whereas nt = 1 and np = 0 in other populations.

DISCUSSION

In the present article, evolution of multigene families by gene duplication was analyzed assuming a haploid population and no interchromosomal recombination. In such a population, redundant copies are created first, and then one of them evolves to become a new gene by adaptive mutation. This system differs from those in which selection directly operates on the number of gene copies (Ohta 1983; Stephan 1986) in that acquiring adaptive genes requires the presence of redundant genes (Ohta 1987a). To analyze this system, we introduced the Markov chain approximation for small mutation and duplication/deletion rates under which it is assumed that each population is fixed by one chromosome type and that evolution occurs by occasional jumps among chromosome types (the SSWMD limit). By this approximation, the extent of gene redundancy was calculated, and the evolution of the average number of different alleles was characterized. Effects of polymorphism expected under higher mutation and duplication/deletion rates were analyzed using computer simulation. The most significant departure from the SSWMD limit with the same ratios of the rates is the increase of the number of pseudogenes. Also, polymorphism elevates the number of different alleles compared to that computed by the SSWMD limit, especially when there is negative selection on the copy number. With these caveats in mind, we consider some of the questions concerning the evolution of multigene families based on the SSWMD limit.

First, note that in the SSWMD limit, if we measure time in units of γ−1, the behavior of the process is characterized by the scaled parameters ua=uaγ and up=upγ, in addition to Nsa and Nsd. Thus, if we keep these parameters constant, the birth rate α is proportional to γ. Let α* = α/γ and consider a simple case of no copy number selection (sd = 0). In this case, α* is expressed as
α=(1(2up+2d+1)r)log(1r),
(23)
where
d=Nf+ua,r=11+up+d+(d+up)2+d.
Here, d is proportional to both ua and Nf+. Two limiting cases are instructive. If d ≪ 1, Equation 23 reduces to
αd2up(1+2up)log(1+12up).
(24)
This shows that α* increases linearly as d increases and that the rate is a decreasing function of up. Thus, if up is small, α* quickly increases as d increases either by an increase of N or ua. However, the increase is saturated, and for d ≫ 1,
αd2(d+up).
(25)
TABLE 5

Numbers of different alleles (k), live genes (n1), and pseudogenes (np)

kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.6111.6992.3018.0341.79917.7370.6738
0.00011.6812.0572.3678.3831.82718.4590.6681
M1.6502.0012.3068.1551.84018.3120.6748
4000.0012.7808.0204.50932.2883.12134.5730.5825
0.00012.7369.1013.89626.0512.27420.1820.6025
M2.6478.6083.73523.9552.31221.7450.5960
0.11000.0012.4914.8332.9398.0825.75373.9990.4894
0.00012.7297.9323.09511.5014.90662.1770.5181
M2.6897.8013.04311.3394.67954.5120.5159
4000.0016.23624.8837.74141.11812.974195.6750.3294
0.00016.62445.7137.59562.8248.187105.1840.3894
M5.90243.5396.63958.0456.79786.0130.4155
kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.6111.6992.3018.0341.79917.7370.6738
0.00011.6812.0572.3678.3831.82718.4590.6681
M1.6502.0012.3068.1551.84018.3120.6748
4000.0012.7808.0204.50932.2883.12134.5730.5825
0.00012.7369.1013.89626.0512.27420.1820.6025
M2.6478.6083.73523.9552.31221.7450.5960
0.11000.0012.4914.8332.9398.0825.75373.9990.4894
0.00012.7297.9323.09511.5014.90662.1770.5181
M2.6897.8013.04311.3394.67954.5120.5159
4000.0016.23624.8837.74141.11812.974195.6750.3294
0.00016.62445.7137.59562.8248.187105.1840.3894
M5.90243.5396.63958.0456.79786.0130.4155

Values at t = 10/7 are shown. Other parameters are sa = 0.1, sd = 0, and up/ua = 10. Number of replications for each parameter set is 1600.

a

M denotes the values obtained by the Markov chain simulation.

TABLE 5

Numbers of different alleles (k), live genes (n1), and pseudogenes (np)

kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.6111.6992.3018.0341.79917.7370.6738
0.00011.6812.0572.3678.3831.82718.4590.6681
M1.6502.0012.3068.1551.84018.3120.6748
4000.0012.7808.0204.50932.2883.12134.5730.5825
0.00012.7369.1013.89626.0512.27420.1820.6025
M2.6478.6083.73523.9552.31221.7450.5960
0.11000.0012.4914.8332.9398.0825.75373.9990.4894
0.00012.7297.9323.09511.5014.90662.1770.5181
M2.6897.8013.04311.3394.67954.5120.5159
4000.0016.23624.8837.74141.11812.974195.6750.3294
0.00016.62445.7137.59562.8248.187105.1840.3894
M5.90243.5396.63958.0456.79786.0130.4155
kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.6111.6992.3018.0341.79917.7370.6738
0.00011.6812.0572.3678.3831.82718.4590.6681
M1.6502.0012.3068.1551.84018.3120.6748
4000.0012.7808.0204.50932.2883.12134.5730.5825
0.00012.7369.1013.89626.0512.27420.1820.6025
M2.6478.6083.73523.9552.31221.7450.5960
0.11000.0012.4914.8332.9398.0825.75373.9990.4894
0.00012.7297.9323.09511.5014.90662.1770.5181
M2.6897.8013.04311.3394.67954.5120.5159
4000.0016.23624.8837.74141.11812.974195.6750.3294
0.00016.62445.7137.59562.8248.187105.1840.3894
M5.90243.5396.63958.0456.79786.0130.4155

Values at t = 10/7 are shown. Other parameters are sa = 0.1, sd = 0, and up/ua = 10. Number of replications for each parameter set is 1600.

a

M denotes the values obtained by the Markov chain simulation.

The increase rate with d = ∞ is γ/2 per generation as expected, but the approach to it is affected by the pseudogene mutation rate. Between these two extreme cases, α* was numerically shown to be an increasing function of d in the range 0.01 ≤ up/γ ≤ 10 (data not shown). Thus, for example, the rate of increase of gene number is larger in larger populations with sd = 0. However, this is not always true when there is copy number selection (sd ≠ 0), as shown in Figure 4. In these cases, as N becomes larger, α* first increases and then decreases. This is because the effect of copy number selection also becomes larger in larger populations. However, as noted in the previous section, the SSWMD limit underestimates the number of different alleles with copy number selection as the change rates (ua, up, and uγ) become larger. Thus, the decrease of the rate as N increases is not as pronounced, as Figure 4 shows in more realistic situations. Indeed, when ua = 10−5, up = 10−4, γ = 10−4, sa = 0.1, and sd = 0.002, corresponding to set 1 in Figure 4, the average numbers of different alleles at the 10/γth generation were 2.048 ± 0.0103 for N = 200 and 1.896 ± 0.081 for N = 800. In summary, the number of different alleles increases as population size becomes large without copy number selection, but this is not necessarily so with copy number selection.

Next, we examine the speed of gene duplication. We measure it by the doubling time of E[k]. It is expressed by log2/[(Mq − q1)uaNf+] and depends on ua, up, γ, Nsa, and Nsd. We do not have reliable estimates for most of the parameters, but some rough calculation under

TABLE 6

Numbers of different alleles (k), live genes (nl), and pseudogenes (np) in copy number selection

kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.4191.1471.7924.1610.7744.6890.7625
0.00011.4070.9781.7403.3320.7344.6000.7612
M1.4521.1801.8053.8340.7474.5290.7464
4000.0011.6932.3062.0815.7060.2940.8270.7556
0.00011.4841.2971.6482.3880.1160.2330.7856
M1.4361.1651.5672.0550.1010.2320.8021
0.11000.0012.2784.3002.6317.0333.01125.7290.5694
0.00012.2834.8082.5416.7642.17216.8090.5894
M2.2665.1112.5087.2512.22816.9770.5914
4000.0013.98213.5214.67520.6952.71615.9390.4869
0.00013.11411.4803.37614.4380.8272.7290.5869
M2.6437.6182.7959.1150.5831.7650.6091
kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.4191.1471.7924.1610.7744.6890.7625
0.00011.4070.9781.7403.3320.7344.6000.7612
M1.4521.1801.8053.8340.7474.5290.7464
4000.0011.6932.3062.0815.7060.2940.8270.7556
0.00011.4841.2971.6482.3880.1160.2330.7856
M1.4361.1651.5672.0550.1010.2320.8021
0.11000.0012.2784.3002.6317.0333.01125.7290.5694
0.00012.2834.8082.5416.7642.17216.8090.5894
M2.2665.1112.5087.2512.22816.9770.5914
4000.0013.98213.5214.67520.6952.71615.9390.4869
0.00013.11411.4803.37614.4380.8272.7290.5869
M2.6437.6182.7959.1150.5831.7650.6091

Values at t = 10/γ are shown. Other parameters are sa = 0.1, sd = 0.01, and up/ua = 10. Number or replications for each parameter set is 1600.

a

M denotes the values obtained by the Markov chain simulation.

TABLE 6

Numbers of different alleles (k), live genes (nl), and pseudogenes (np) in copy number selection

kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.4191.1471.7924.1610.7744.6890.7625
0.00011.4070.9781.7403.3320.7344.6000.7612
M1.4521.1801.8053.8340.7474.5290.7464
4000.0011.6932.3062.0815.7060.2940.8270.7556
0.00011.4841.2971.6482.3880.1160.2330.7856
M1.4361.1651.5672.0550.1010.2320.8021
0.11000.0012.2784.3002.6317.0333.01125.7290.5694
0.00012.2834.8082.5416.7642.17216.8090.5894
M2.2665.1112.5087.2512.22816.9770.5914
4000.0013.98213.5214.67520.6952.71615.9390.4869
0.00013.11411.4803.37614.4380.8272.7290.5869
M2.6437.6182.7959.1150.5831.7650.6091
kn1np
uaNγaMeanVar.MeanVar.MeanVar.p~1
0.021000.0011.4191.1471.7924.1610.7744.6890.7625
0.00011.4070.9781.7403.3320.7344.6000.7612
M1.4521.1801.8053.8340.7474.5290.7464
4000.0011.6932.3062.0815.7060.2940.8270.7556
0.00011.4841.2971.6482.3880.1160.2330.7856
M1.4361.1651.5672.0550.1010.2320.8021
0.11000.0012.2784.3002.6317.0333.01125.7290.5694
0.00012.2834.8082.5416.7642.17216.8090.5894
M2.2665.1112.5087.2512.22816.9770.5914
4000.0013.98213.5214.67520.6952.71615.9390.4869
0.00013.11411.4803.37614.4380.8272.7290.5869
M2.6437.6182.7959.1150.5831.7650.6091

Values at t = 10/γ are shown. Other parameters are sa = 0.1, sd = 0.01, and up/ua = 10. Number or replications for each parameter set is 1600.

a

M denotes the values obtained by the Markov chain simulation.

TABLE 7

Numbers of different alleles (k), live genes (nl), and pseudogenes (np) with weak selection

knlnp
sd/sauaMeanVar.MeanVar.MeanVar.p~1
0.00.021.1300.2611.4772.5261.41113.7110.7719
0.11.0680.0871.1580.3132.96733.5610.6656
0.010.021.1450.3011.4392.1051.28914.1550.7863
0.11.0560.0761.1260.2442.24523.9010.7087
knlnp
sd/sauaMeanVar.MeanVar.MeanVar.p~1
0.00.021.1300.2611.4772.5261.41113.7110.7719
0.11.0680.0871.1580.3132.96733.5610.6656
0.010.021.1450.3011.4392.1051.28914.1550.7863
0.11.0560.0761.1260.2442.24523.9010.7087

Values at t = 10/γ are shown. Other parameters are N = 25, sa = 0.1 and up/ua = 10. Number of replications for each parameter set is 1600.

TABLE 7

Numbers of different alleles (k), live genes (nl), and pseudogenes (np) with weak selection

knlnp
sd/sauaMeanVar.MeanVar.MeanVar.p~1
0.00.021.1300.2611.4772.5261.41113.7110.7719
0.11.0680.0871.1580.3132.96733.5610.6656
0.010.021.1450.3011.4392.1051.28914.1550.7863
0.11.0560.0761.1260.2442.24523.9010.7087
knlnp
sd/sauaMeanVar.MeanVar.MeanVar.p~1
0.00.021.1300.2611.4772.5261.41113.7110.7719
0.11.0680.0871.1580.3132.96733.5610.6656
0.010.021.1450.3011.4392.1051.28914.1550.7863
0.11.0560.0761.1260.2442.24523.9010.7087

Values at t = 10/γ are shown. Other parameters are N = 25, sa = 0.1 and up/ua = 10. Number of replications for each parameter set is 1600.

this model is possible. For up, we may use the null mutation rate measured in Drosophila by Mukai and Cockerham (1977). Averaging over four loci encoding various enzymes, they obtained 1.03 × 10−5 for the null mutation rate per loci. Estimates of unequal crossing-over rates vary widely. For example, in Drosophila, the duplication rates from a single gene at the ma and ry loci are estimated to be 2.7 × 10−6 and 1.7 × 10−4, respectively (Shapira and Finnerty 1986). In maize, the rate per copy is estimated to be in the range 10−4 − 10−3 (Sudpak et al. 1993; Eggelston et al. 1995). Here, we tentatively assume γ = 10−4, taking an intermediate value. Currently, we do not have any estimates for the advantageous mutation rate ua or selection parameters sa, sd, but to get a rough idea, let us assume ua = 10−6, employing the bandmorph mutation rate 1.03 × 10−6, estimated by Mukai and Cockerham (1977) in Drosophila. With Nsa = 40 and no copy number selection, the doubling time is 8 × 104 generations. Because d* ≪ 1, the doubling time is shown to be roughly proportional to (uaNf+γ)−1 from (24), and reducing ua to one-tenth makes the time about 10 times greater. At any rate, gene

Figure 4.

Examples of the rate, α, of increase of the number of different alleles with copy number selection as a function of population size. α/γ is computed from the SSWMD limit as a function of population size. Two examples are shown. Set 1, ua = 0.1γ, up = γ; Set 2, ua = 0.02γ, up = 0.2γ. Other parameters are sa = 0.1, sd = 0.002.

duplications can occur very quickly under the present model from the view point of the evolutionary time scale. However, according to Fryxell (1996), the doubling time is of the order of 10 to 100 million years in a particular lineage of a gene family. Clegg et al. (1997) surveyed three multigene families in plants and estimated new gene recruitment rates to be 2.9 × 10−8 for Adh, 2.8 × 10−7 for Chs, and 2.7 × 10−7 for rbcS. Because explosive gene multiplications are not observed in the evolution of multigene families, Fryxell (1996) proposed that simultaneous coevolution of two or more gene families might be necessary for gene multiplication in respective families, which will diminish duplication rates significantly. However, there are other explanations. For example, Nsa might not be constant. If Nsa becomes close to 1 or less, the number of different alleles stays near 1 as shown by Ohta (1987a,b) and by the present study. Another explanation is copy number selection. If we introduce weak copy number selection (sd = 0.01sa), the doubling times in the above examples become 2 × 106 (ua = 10−7) and 2 × 105 (ua = 10−6) generations, making the times much longer. Thus, copy number selection might have contributed to retarding the duplication rates. Copy number selection might also explain the general observation of larger multiplicity of genes in plants and mammals compared to those in, for example, Drosophila and lower eukaryotes. The numbers of rDNA genes in plants are very large compared to those in Drosophila and seem to contain many pseudogenes (Buckler et al. 1997). The number of pseudogenes is expected to be very small under strong copy number selection (see Tables 4 and 6).

We usually observe large variation in gene numbers in one multigene family among related species. For example, in Drosophila amylase genes, the numbers of genes are two (Payant et al. 1988; Shibata and Yamazaki 1995), one to three (Brown et al. 1990), and at least four (Da  Lage et al. 1992) in D. melanogaster, D. pseudoobscura, and D. ananassae, respectively. Such variation can be explained by difference in parameters such as sa and N. However, even with the same parameter set, Clark (1994) showed that multiple equilibrium points concerning gene configuration are possible in two gene systems. Moreover, the variance of gene number is large, an emphasized by Ohta (1987b). As shown here, this is because the increase of gene number behaves like a pure linear-birth process. The coefficient of variation of such a process is asymptotically expressed as 1n0, where n0 is the initial number (see p. 159 of Cox and Miller 1965). In our case, a further increase of the variance occurs because the gene number remains 1 in a significant proportion of lineages. Thus, we should expect a very large variance of gene numbers in a multigene family.

In the present article, as a fitness function, Ohta's model (Model I; Ohta 1987b) was mostly used. In the Markov chain approximation, the difference in fitness function contributes only through difference in the fixation probability of a chromosome with one more different allele than other chromosomes in the population. As noted, models I, II, and III correspond to recessive, semidominant, and dominant advantageous mutation models in this regard. Thus, the increase in the number of different alleles is more difficult in this order because a mutant chromosome appears singly at first. However, if we consider a decrease of the number of different alleles by chance, models I, II, and III correspond to dominant, semidominant, and recessive deleterious mutant models. Therefore, although it is most difficult to increase the number of different alleles in Model I, it is also most difficult to decrease it. Dependency on parameters also differes among the models. The fixation rate d = uaNf+/γ of different alleles is proportional to Nsa, in Model I and Nsa in Model II (Walsh's model; Walsh 1995). Thus, with the same Nsa > 1, the number of different alleles becomes larger under Model II. For d ≪ 1, α* is proportional to d as shown in (24). Indeed, for Nua = 0.04, Nup = Nγ = 0.4, Nsa = 40, and sd = 0, the Wright-Fisher-type simulation shows that the average number of different alleles at t = 25N was 150 in Model II, whereas it was 6 in Model I. Thus, the selection in Model I is the least efficient in increasing gene number.

We assumed that no duplication/deletion occurs if the total number, nt, of genes becomes 1 because the duplication rate is much smaller if there is only a single copy. We divided the process into the first and second phases, depending on whether k is 1 or more than 1. In the second phase, the evolution is the same as that in Ohta's model (Ohta 1987b). The analysis of the first phase shows that even if Nsa is fairly large, nearly half or more of populations starting from chromosomes with two copies ends up in single gene chromosomes unless ua/γ is not small (Table 2). Thus, we expect to find lineages with only a single gene even if gene number is large in other lineages.

From our approximate analysis, the gene duplication processes in a haploid model without interchromosomal crossing over were characterized, and dependencies of the speed of gene duplication, genetic redundancy, and the number of pseudogenes on various parameters are clarified for strong selection cases. With the accumulation of data on gene families, various discussions on the mechanisms of the evolution of multigene families have been made (e.g., Fryxell 1996; Clegg et al. 1997). Our results will provide a quantitative framework for such discussions by incorporating gene duplication and subsequent divergence into the model. However, higher organisms are mostly diploid, and interchromosomal crossing over does occur. The diploidy and interchromosomal crossing over not included here are known to accelerate the increase of gene number (Ohta 1988a). Incorporating them into the model and analyzing the model in more detail would be necessary for a better understanding of the evolution of multigene families.

Acknowledgement

We thank T. Ohta and M. Iizuka, A. Clark, and an anonymous reviewer for their helpful comments on earlier drafts of the manuscript. This research was partially supported by grants-in-aid to H.T. from the Ministry of Education, Science and Culture of Japan.

Footnotes

Communicating editor: A. G. Clark

LITERATURE CITED

Brown
C J
,
Aquadro
C F
,
Anderson
W W
,
1990
 
DNA sequence evolution of the amylase multigene family in Drosophila pseudoobscura
.
Genetics
 
126
:
131
138
.

Buckler
E S
IV,
Ippolito
A
,
Holtsford
T P
,
1997
 
The evolution ribosomal DNA: divergent paralogues and phylogenetic implications
.
Genetics
 
145
:
821
832
.

Carroll
S B
,
1995
 
Homeotic genes and the evolution of arthropods and chordates
.
Nature
 
376
:
479
485
.

Clark
A G
,
1994
 
Invasion and maintenance of a gene duplication
.
Proc. Natl. Acad. Sci. USA
 
91
:
2950
02954
.

Clegg
M T
,
Cummings
M P
,
Durmbin
M L
,
1997
 
The evolution of plant nuclear genes
.
Proc. Natl. Acad. Sci. USA
 
94
:
7791
7798
.

Cox
D R
,
Miller
H D
,
1965
 
The Theory of Stochastic Processes
.
Chapman and Hall
,
London
.

Crow
J F
,
Kimura
M
,
1970
 
An Introduction to Population Genetic Theory
.
Harper & Row
,
New York
.

Da Lage
J L
,
Lemeunier
M L
,
Cariou
M L
,
David
J R
,
1992
 
Multiple amylase genes in Drosophila ananassae and related species
.
Genet. Res.
 
59
:
85
92
.

Eggleston
W B
,
Alleman
M
,
Kermicle
J L
,
1995
 
Molecular organization and germinal instability of R-stippled maize
.
Genetics
 
141
:
347
360
.

Fryxell
K J
,
1996
 
The coevolution of gene family trees
.
Trends Genet.
 
12
:
364
369
.

Gillespie
J H
,
1983
 
Some properties of finite populations experiencing strong selection and weak mutation
.
Am. Nat.
 
121
:
691
708
.

Hollick
J B
,
Dorweiler
J E
,
Chandler
V L
,
1997
 
Paramutation and related allelic interactions
.
Trends Genet.
 
13
:
302
308
.

Iwabe
N
,
Kuma
K
,
Miyata
T
,
1996
 
Evolution of gene families and relationship with organismal evolution: rapid divergence of tissue-specific genes in the early evolution of chordates
.
Mol. Biol. Evol.
 
13
:
483
493
.

Kimura
K
,
1962
 
On the probability of fixation of mutant genes in a population
.
Genetics
 
47
:
713
719
.

Kimura
M
,
Crow
J F
,
1964
 
The number of alleles that can be maintained in a finite population
.
Genetics
 
49
:
725
738
.

Lange
B W
,
Langley
C H
,
Stephan
W
,
1990
 
Molecular evolution of Drosophila metallothionein genes
.
Genetics
 
126
:
921
932
.

Li
W H
,
1987
 
Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons
.
J. Mol. Evol.
 
24
:
337
345
.

Lyckegaard
E M S
,
Clark
A G
,
1989
 
Ribosomal DNA and stellate gene copy number variation on the Y chromosome of Drosophila melanogaster
.
Proc. Natl. Acad. Sci. USA
 
86
:
1944
1984
.

Mukai
T
,
Cockerham
C C
,
1977
 
Spontaneous mutation rates at enzyme loci in Drosophila melanogaster
.
Proc. Natl. Acad. Sci. USA
 
74
:
2514
2517
.

Neitz
M
,
Neitz
J
,
1995
 
Numbers and ratios of visual pigment genes for normal red-green color vision
.
Science
 
267
:
1013
1016
.

Ohno
S
,
1970
 
Evolution by Gene Duplication
.
Springer-Verlag
,
New York
.

Ohta
T
,
1983
 
Theoretical study on the accumulation of selfish DNA
.
Genet. Res.
 
41
:
1
15
.

Ohta
T
,
1987a
 
A model of evolution for accumulating genetic information
.
J. Theor. Biol.
 
124
:
199
211
.

Ohta
T
,
1987b
 
Simulating evolution by gene duplication
.
Genetics
 
115
:
207
213
.

Ohta
T
,
1988a
 
Further simulation studies on evolution by gene duplication
.
Evolution
 
42
:
375
386
.

Ohta
T
,
1988b
 
Multigene and supergene families
.
Oxf. Surv. Evol. Biol.
 
5
:
41
65
.

Ohta
T
,
1991
 
Multigene families and the evolution of complexity
.
J. Mol. Evol.
 
33
:
34
41
.

Payant
V
,
Abukashawa
S
,
Sasseville
M
,
Benkel
B F
,
Hickey
D A
 et al. ,
1988
 
Evolutionary conservation of the chromosomal configuration and regulation of amylase genes among eight species of the Drosophila melanogaster species subgroup
.
Mol. Biol. Evol.
 
5
:
560
567
.

Petrov
D A
,
Lozovskaya
E R
,
Hartl
D L
,
1996
 
High intrinsic rate of DNA loss in Drosophila
.
Nature
 
384
:
346
349
.

Shapira
S K
,
Finnerty
V G
,
1986
 
The use of genetic complementation in the study of eukaryotic macromolecular evolution: rate of spontaneous gene duplication at two loci of Drosophila melanogaster
.
J. Mol. Evol.
 
23
:
159
167
.

Shibata
H
,
Yamazaki
T
,
1995
 
Molecular evolution of the duplicated Amy locus in Drosophila melanogaster species subgroup: concerted evolution only in coding region and excess of non-synonymous substitutions in speciation
.
Genetics
 
141
:
223
236
.

Stephan
W
,
1986
 
Recombination and the evolution of satellite DNA
.
Genet. Res.
 
47
:
167
174
.

Sudupak
M A
,
Bennetzen
J L
,
Hullbert
S H
,
1993
 
Unequal exchange and meiotic instability of disease-resistance genes in the Rp 1 region of maize
.
Genetics
 
133
:
119
125
.

Takano
T S
,
Kusakabe
S
,
Koga
A
,
Mukai
T
,
1989
 
Polymorphism for the number of tandemly multiplicated glycerol-3-phosphate dehydrogenase genes in Drosophila melanogaster
.
Proc. Natl. Acad. Sci. USA
 
86
:
5000
5004
.

Theissen
G
,
Kim
J T
,
Saedler
H
,
1996
 
Classification and phylogeny of the MADS-box multigene family suggest defined roles of MADS-box gene subfamilies in the morphological evolution of eukaryotes
.
J. Mol. Evol.
 
43
:
484
516
.

Walsh
J B
,
1985
 
Interaction of selection and biased gene conversion in a multigene family
.
Proc. Natl. Acad. Sci. USA
 
82
:
153
157
.

Walsh
J B
,
1995
 
How often do duplicated genes evolve new functions?
 
Genetics
 
139
:
421
428
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)