Abstract
Nonindependent evolution of duplicated genes is called concerted evolution. In this article, we study the evolutionary process of duplicated regions that involves concerted evolution. The model incorporates mutation and gene conversion: the former increases d, the divergence between two duplicated regions, while the latter decreases d. It is demonstrated that the process consists of three phases. Phase I is the time until d reaches its equilibrium value, d0. In phase II d fluctuates around d0, and d increases again in phase III. Our simulation results demonstrate that the length of concerted evolution (i.e., phase II) is highly variable, while the lengths of the other two phases are relatively constant. It is also demonstrated that the length of phase II approximately follows an exponential distribution with mean τ, which is a function of many parameters including gene conversion rate and the length of gene conversion tract. On the basis of these findings, we obtain the probability distribution of the level of divergence between a pair of duplicated regions as a function of time, mutation rate, and τ. Finally, we discuss potential problems in genomic data analysis of duplicated genes when it is based on the molecular clock but concerted evolution is common.
TO understand the evolutionary importance of gene duplication, it is critical to know when duplication events occurred. This is not very difficult as long as two duplicated genes have accumulated mutations independently, because we can estimate the time to the duplication event from the level of nucleotide divergence between two duplicates. The idea that nucleotide divergence has a linear correlation with time is known as the “molecular clock.” However, the molecular clock hypothesis does not always hold for duplicated genes because of the phenomenon called “concerted evolution” (reviewed in Ohta 1980, 1983; Arnheim 1983; Li 1997). Under concerted evolution, the level of divergence between two duplicated genes is maintained very low, so that the observed divergence is usually much lower than the expectation when the molecular clock is assumed.
Gene conversion has been considered as the most important mechanism for this homogenization in duplicated genes (i.e., a small multigene family with copy number of 2), although unequal crossing over could also be important for large- or middle-size multigene families. Clear evidence for gene conversion is seen when DNA polymorphism data are available for both of the duplicated genes, because gene conversion creates “shared polymorphic sites” (Innan 2003a), at which both of the two corresponding sites in the duplicated genes are polymorphic (Figure 1A). Although such data sets are few, examples include several pairs of duplicated genes in Drosophila (Inomataet al. 1995; King 1998; Lazzaro and Clark 2001; Bettencourt and Feder 2002), human (Innan 2003b), plants (Satoet al. 2002; Charlesworthet al. 2003), and Plasmodium falciparum (Nielsenet al. 2003).
The action of gene conversion can also be suggested from phylogenetic studies. For example, suppose that duplicated genes I and II exist in species A and B. This means the gene duplication event predates the speciation (Figure 1B). Without gene conversion, it is expected that we observe a tree that is consistent with the real tree. However, if these genes have undergone concerted evolution, the observed tree might be inconsistent with the real tree. That is, the two duplicated genes in each species are more closely related (Figure 1B, right tree). On the basis of this idea, recently, Rozen et al. (2003) reported that there are abundant gene conversions between several pairs of duplicated regions on the Y chromosome of human and chimpanzee.
Thus, there are many lines of evidence for gene conversion between duplicated genes. However, the effect of gene conversion on the divergence between duplicated genes has not been well understood theoretically. The purpose of this article is to investigate the behavior of the divergence between duplicated genes after gene duplication. We modify gene conversion models of Walsh (1987). In our model, gene conversion occurs such that a certain length of DNA fragment is transferred from one gene to the other, although gene conversion tracts cover the entire gene in Walsh's models. In our general model, we investigate the effect of the mutation and gene conversion rates and the length of gene conversion tract on the divergence between duplicated genes by simulations, and we attempt to formulate the probability distribution of the divergence as a function of time. Then, we discuss potential problems in genomic data analysis of duplicated genes on the basis of the molecular clock when concerted evolution is common.
Summary of parameters
—(A) Shared polymorphic sites created by gene conversion. (B) Real tree (history) of gene duplication and speciation (left) and observed tree under concerted evolution (right).
MODEL AND SIMULATION
The evolutionary process of a pair of duplicated regions is considered. We study the behavior of d, the number of nucleotide differences between duplicated regions after their birth (i.e., duplication). The process involves mutation and gene conversion; the former increases the level of divergence while the latter decreases it. Parameters used in this section are summarized in Table 1.
Suppose a duplication event creates two identical sequences in a genome at time T = 0. In this article, we consider the evolutionary process of a pair of subregions that are within the duplicated regions as illustrated in Figure 2. Each considered region is represented by a large box and assigned to an interval of (0, 1). After the duplication event, the regions start accumulating mutations and gene conversion works to homogenize variation. Mutation occurs independently in the two regions at rate μ per region per generation, so that the number of mutations in each region follows a Poisson distribution with mean μ. For each mutation, its position is determined as a random variable between 0 and 1 (i.e., infinite-site model).
Gene conversion transfers a DNA fragment from one to the other. Figure 2 shows two examples of gene conversion events. Gene conversion I transfers a DNA segment between positions 0.2 and 0.5 from the original region to the duplicated region, so that the corresponding part of the duplicated region is replaced by the original sequence. Note that shaded boxes represent the sequence of the original region before gene conversion and open boxes show the duplicated region. Gene conversion could involve regions outside of the interval (0, 1). Gene conversion II in Figure 2 consists of a region from position 0.85 to 1 and a fragment outside of position 1.
—Illustration of gene conversion events. The simulated regions are represented by large boxes: the shaded box represents the original region and the open box represents the duplicated region before gene conversion. Two gene conversion tracts are shown by small boxes.
Gene conversion is simulated assuming the length of the conversion tract follows a geometric distribution in a finite length of DNA region. This assumption is from Wiuf and Hein (2000), who studied homologous gene conversion (in this study, we consider interlocus gene conversion that occurs between two duplicated regions). Although the mechanism of interlocus gene conversion is not well understood, it might be reasonable to assume that it may be similar to that of homologous gene conversion. Following Wiuf and Hein (2000), simulating gene conversion events involves two parameters, g and q. g represents the rate that a gene conversion event initiates and q is the geometric distribution parameter to determine the length of the gene conversion tract. The geometric distribution for the length of the converted tract in base pairs, z, is given by q(1 – q)z–1. Suppose that each of the simulated regions is L bp long. If we assume L is very large and q is very small (i.e., infinite-sites model), the distribution of the length of a tract (see Figure 2) is given by an exponential distribution with parameter Q = qL, and the average length of a conversion tract is 1/Q (the unit of length is L bp). In this model, the rate that a particular site of one region is transferred to the other is given by
In this study, we consider another two probabilities. One is the probability that determines whether the two regions make pairing in meiosis, which is considered to be required for gene conversion to occur. This probability, S1(x), should be a function of x, the divergence between the two regions. For example, S1(x) is defined as
The other is the probability that a gene conversion event is successfully completed. Let S2(y) be this probability, which is given by a function of y, the divergence in a gene conversion tract. One example is that gene conversion occurs only when y is smaller than a certain threshold, s2. That is,
In addition to nucleotide mutations, Walsh (1987) also considered mutations that block gene conversion. Such mutations are called “terminator mutations.” Large indels and transposable elements might belong to terminator mutations. We assume such mutations occur at rate m per region per generation. The position of mutation is given as a random variable between (0, 1). Gene conversion is assumed to be completely suppressed if a conversion tract includes the position of the terminator mutation.
Note that the numbers of mutations and gene conversion events per generation follow Poisson distributions. When the expected numbers of these events are very small, we can obtain essentially the same simulation results by an approximate method, in which the Poisson processes are simulated every k generation with the parameters multiplied by k.
RESULTS
The behavior of the divergence between duplicated regions, d, is investigated by computer simulation. Throughout this study μ= 10–6/region is assumed: the simulated region corresponds to a 1-kb region if we assume that a standard mutation rate = 10–9/site/generation. Also, k = 105 is assumed. We assume 1/Q = 0.1, s1 = 100 in (2), and s2 = 100z′ in (3), where z′ is the gene conversion tract length in the interval (0, 1). The results are in Figure 3, which shows four independent realizations of the trace of d for c = {0, 1, 3, 5, 10} × 10–8 from top to bottom. Without gene conversion (top), d increases linearly with increasing time (i.e., molecular clock). When c = 10–8, a little delay is observed in the increasing function, and the delay gets bigger as c increases. When c = 5 × 10–8, d fluctuates around an equilibrium value for a quite long time, and then d starts increasing linearly. The bottom indicates that c = 10–7 is so high that the fluctuation time of d is extremely long (but the fluctuation does not continue forever).
—Traces of d over time under concerted evolution. We assume a gene conversion rate = {0, 1, 3, 5, 10} × 10–8 from top to bottom. Four independent traces are shown for each gene conversion rate.
Thus, the evolutionary process of duplicated regions involves a long fluctuation time of d unless the gene conversion rate is small. Figure 4 illustrates a typical behavior of d, which consists of three phases: phase I is the time until d reaches its equilibrium value, d0, which depends mainly on the mutation rate and gene conversion rate (Innan 2002, 2003a). In phase II, d keeps fluctuating around d0. Then, the process enters phase III, in which the two regions start accumulating mutations independently. Note that there are not clear borders between the three phases. The definition of the three phases is intuitively understood such that the divergence approximately follows a molecular clock if phase II is removed. The length of phase II approximately represents the length of concerted evolution.
To study the length of concerted evolution, we consider Td1, the waiting time for the first hit of d = d1, d1 ≫ d0. Td1 is divided into three parts, t1, t2, and t3, according to the phases (see Figure 4). We expect that Td1 is directly related to the length of concerted evolution, t2, because t2 ≈ Td1 – d1/(2μ). The effect of gene conversion rate on Td1 is investigated by simulation. We assume d1 = 200. Table 2 summarizes the results of simulations for c = {1, 2, 3,..., 10} × 10–8 when 1/Q = ∞, s1 = 100, and S2 is given by (3) with s2 = 100. It is demonstrated that as c increases, the average of Td1 increases exponentially. The variance of Td1 is huge, indicating that Td1 is highly variable. Similar results are obtained for the case of 1/Q = 0.1 (also see Figure 5A).
Averages and variances of Td1=200
—Illustration of the behavior of d.
We also investigate the relationship between c and Td1 under various gene conversion models (Figure 5). First we assume 1/Q = 0.1, s1 = 100 in (2) and S2 is given by (3) with s2 = 100z′. The result is presented by solid stars in Figure 5A. Td1 is larger than that for 1/Q = ∞ (solid squares), indicating that the gene conversion tract length has a significant effect on the length of concerted evolution (see below). Next we consider Td1 for 1/Q = ∞ and 1/Q = 0.1 in a model where s1 = 100 in (2) and S2 is given by (4) with s3 = 100z′. It is shown that the increase of Td1 against c is slower in comparison with the previous model. This is expected because the effective gene conversion rate given by (4) is smaller than that given by (3) if c is the same and s2 = s3.
In Figure 5B, Td1 in a model with terminator mutation is shown. We assume m = 1.25 × 10–11, s1 = 100 in (2) and S2 is given by (3) with s2 = 100z′. When 1/Q = ∞, as c increases, Td1 saturates around 1/(2m) = 4 × 1010 because phase II is terminated with probability 2m. The situation is complicated when 1/Q = 0.1. Td1 saturates somehow around 1/(2m) when c ≈ 8–9 × 10–8, but again starts increasing for c ≥ 10 × 10–8. This is because after a terminator mutation gene conversion is suppressed in a short region around the mutation when 1/Q is small. That is, as c increases, in most regions phase II continues for a quite long time even after a terminator mutation.
Figure 6A shows the effect of the length of gene conversion tract on Td1 when c = 3 × 10–8, m = 0, and s1 = 100. As 1/Q decreases, the average of Td1 increases dramatically. This observation could be understood as follows. As shown in Figure 6B, 1/Q has a positive correlation with the variance of d in phase II. The variance of d is a very important factor to determine the time of phase II because phase II is terminated when d happens to exceed a threshold value of d, dt. dt is defined as the minimum value of d that is too big for gene conversion to occur (see Figure 4). That is, once d hits dt, there is no chance that the system returns to phase II. It is obvious that the larger the variance of d, the more chance that d hits dt, creating the negative correlation between Td1 and 1/Q. Note that 1/Q has no effect on the expectation of d in phase II (Innan 2002, 2003a).
—The relationship between c and Td when 1/Q = ∞ (squares) and 0.1 (stars). (A) Solid squares and 1stars represent results of simulations in which S1 and S2 are given by (2) and (3) with s1 = s2 = 100. Shaded squares and stars represent the case where S1 and S2 are given by (2) and (4) with s1 = s3 = 100. (B) Results with terminator mutations are shown (shaded squares and stars). m = 1.25 × 10–11 is given. Solid squares and stars indicate the same as those in A.
Thus, the time of phase II, t2, might be considered as a waiting time for an event when d first hits dt. Let τ be the expectation of this waiting time. It is expected that t2 approximately follows an exponential distribution with mean τ when τ is large, and the variance of Td1 is approximately given by
Our simulation results have indicated that the relationship between d and time is complicated because many parameters (c, 1/Q, s1, s2, and m) affect t2. However, the probability distribution function (pdf) of d might involve only three parameters, μ, τ, and T, because τ summarizes all parameters that affect t2. Here, we attempt to obtain the pdf of d as a function of μ, τ, and T, assuming the pdf of t2 is given by
—(A) The effect of the length of the conversion tract, 1/Q, on Td1. (B) The effect of the length of the conversion tract on the variance of d in phase II.
We can also obtain the pdf of Td1. That is,
DISCUSSION
The evolutionary process of a pair of duplicated regions is studied by simulations. The model incorporates mutation and gene conversion: the former increases d, the divergence between two duplicated regions, while the latter decreases d. It is demonstrated that the process consists of three phases. Phase I is the time until d reaches its equilibrium value, d0. In phase II d fluctuates around d0, and d increases again in phase III. These three phases are defined such that d has a positive linear correlation with time if phase II is deleted. Phase II approximately corresponds to the time under concerted evolution. The lengths of the three phases, t1, t2, and t3, could be almost independent of each other. t1 and t3 are relatively constant, while t2 is highly variable. Our simulations demonstrated that t2 approximately follows an exponential distribution because t2 is a waiting time for a random event that initiates phase III. The rate that such events occur determines τ, the expectation of t2, which depends on the mutation rate (μ and m), gene conversion rate (c together with S1 and S2), and the average length of gene conversion tract (1/Q). It seems extremely difficult to obtain an equation for τ as a function of these parameters, but we were able to obtain the pdf of d given τ, μ, and T.
In this study, we considered mutations as only a mechanism to terminate phase II. However, strong selection could also be a factor to stop concerted evolution. Innan (2003b) demonstrated that the gene conversion rate is effectively reduced around the target site of selection, because selection works against gene conversion to keep the variation between two regions. Evidence for such selection is seen in exon 7 in the human RH genes (Innan 2003b).
Recent genomic sequence data provide great opportunities to study the evolution of gene duplication (e.g., Lynch and Conery 2000; Friedman and Hughes 2001; Baileyet al. 2002; Guet al. 2002; McLysaghtet al. 2002). Some of these studies estimate the date of gene duplications on the basis of the molecular clock, ignoring concerted evolution. This ignorance is partly because we do not know how common concerted evolution via gene conversion is at the genome level, even though there are many reports of gene conversion for specific genes or regions (see Introduction). Here, we consider potential problems in genomic data analysis of duplicated genes on the basis of the molecular clock, if concerted evolution is common. As an example, we use the data of Lynch and Conery (2000), who estimated the birth (duplication) and death (deletion or pseudogenization) rates on the basis of the molecular clock. Their estimation is based on the idea that the frequency distribution (spectrum) of d is approximately given by an exponential distribution when the birth and death rates (a and b, respectively) are constant over time. The solid bars in Figure 7B show this relationship when b = 10–8. It is obvious that the spectrum is expected to be flat when b = 0 (solid bars in Figure 7A). The number of observed duplicated genes with very low d in a genome reflects the duplication rate, a.
Figure 7A shows the expected frequency distribution of d when b = 0 and c = 3 × 10–8 (open bars), indicating that gene conversion creates ∼10 times more duplicated genes with low divergence (d < 50) than the case without gene conversion (solid bars) does because d spends a long time around d0. The two distributions are identical for d > 100 because gene conversion does not occur (i.e., dt = 100). It is indicated that gene conversion alone can also create an exponential-like distribution of d as well as the constant death process. The open bars in Figure 7B show that the frequency distribution under the joint effect of the death process and gene conversion is also similar to an exponential distribution. The peak of genes with low divergence is approximately three times higher than that of the case of the death process only, and the distribution decreases very quickly as d increases.
Thus, if concerted evolution of duplicated genes via gene conversion is common, the effect of gene conversion on the frequency distribution of d cannot be ignored. In such a case, it is indicated that the result of duplicated gene analysis based on the molecular clock should be biased. For example, suppose a and b are estimated by fitting an exponential distribution. An estimate of a depends on the number of duplicated genes with very low d: the more genes with low d, the higher the rate of gene duplication. If the number of duplicated genes of low divergence is increased by gene conversion, a should be overestimated. This excess of genes of low divergence might also contribute to an overestimation of b because an estimate of b depends on how quickly the distribution decreases as d increases.
Figure 7C shows the observed frequency distributions of the level of divergence measured by Ks, the expected number of synonymous substitutions per site in the Drosophila melanogaster genome. The data are from Lynch and Conery (2000). We chose this species as an example because DNA polymorphism data for several pairs of duplicated genes exhibit clear evidence for frequent gene conversion (Innan 2003a). Duplicated genes are classified into two groups, pairs on the same chromosome and those on different chromosomes, because gene conversion is expected to occur more frequently between pairs of genes on the same chromosome. We find a clear difference in the distributions: an excess of pairs of low divergence is observed only in the former class. If this difference is due to the difference in the gene conversion rate, it might be suggested that the high peak of low-divergence genes on the same chromosome might be a signature of frequent gene conversion rather than a rapid birth-and-death process. Lynch and Conery (2000) have also argued this possibility from the positive correlation between Ks and physical distance between genes.
Our results suggest that analyzing duplicated genes on the basis of the molecular clock might be misleading if concerted evolution is common. Therefore, before analyzing data, it is very important to test the molecular clock hypothesis, which might require genomic information from other related species. Unfortunately, such data are few at this moment, but will be available in time.
—Expected and observed frequency distributions of d. Expected distributions are obtained by simulations. 1/Q = ∞ and s1 = 100 are assumed and S2 is given by (3) with s2 = 100z′. a is assumed to be constant, but not specified. The distributions are shown in adjusted frequencies relative to the flat distribution of the case b = c = 0. (A) Solid bars represent the distribution of d for b = c = 0, and open bars represent that for b = 0 and c = 3 × 10–8. (B) Solid bars are for b = 10–8 and c = 0, and open bars for b = 10–8 and c = 3 × 10–8. (C) Observed distributions of Ks for the D. melanogaster genome. Data are from Lynch and Conery (2000).
Acknowledgments
J. S. Conery and M. Lynch kindly provided us information on their data in Lynch and Conery (2000). H.I. is supported by a grant from the University of Texas and K.M.T. is supported by a grant to R. Chakraborty.
Footnotes
-
Communicating editor: J. B. Walsh
- Received September 16, 2003.
- Accepted December 2, 2003.
- Copyright © 2004 by the Genetics Society of America