Abstract
A simple two-locus gene conversion model is considered to investigate the amounts of DNA variation and linkage disequilibrium in small multigene families. The exact solutions for the expectations and variances of the amounts of variation within and between two loci are obtained. It is shown that gene conversion increases the amount of variation within each locus and decreases the amount of variation between two loci. The expectation and variance of the amount of linkage disequilibrium are also obtained. Gene conversion generates positive linkage disequilibrium and the degree of linkage disequilibrium decreases as the recombination rate is increased. Using the theoretical results, a method for estimating the mutation, gene conversion, and recombination parameters is developed and applied to the data of the Amy multigene family in Drosophila melanogaster. The gene conversion rate is estimated to be ∼60–165 times higher than the mutation rate for synonymous sites.
AS mechanisms to homogenize DNA sequence variation in multigene families, gene conversion and unequal crossing over have been considered. By computer simulations, Smith (1974, 1976) showed that repeated unequal crossing over results in fixation of a single copy in the whole multigene family (see also Black and Gibson 1974; Ohta 1976, 1978). It was demonstrated that gene conversion is also important to reduce the amount of variation in multigene families (Edelman and Gally 1970; Birky and Skavaril 1976; Ohta 1977, 1982, 1984; Nagylaki and Petes 1982; Nagylaki 1984a,b). However, the rates of gene conversion and unequal crossing over in natural populations are not well understood.
Multigene families whose copy number is two are called small multigene families (Ohta 1981, 1983). The purpose of this article is to estimate the gene conversion rate in small multigene families from DNA polymorphism data. The other important mechanism, unequal crossing over, is ignored because gene conversion is more significant than unequal crossing over in small multigene families (Baltimore 1981; Dover and Coen 1981; Ohta 1983). A simple neutral model with mutation, random genetic drift, intrachromosomal gene conversion, and recombination was constructed, according to the following observed pattern of DNA variation in the Amy multigene family of Drosophila melanogaster subgroup. On the second chromosome of D. melanogaster, there are two reversely duplicated Amy genes, called the proximal and distal genes (Bahn 1967). The chromosome configuration of the Amy region is conserved among D. melanogaster subgroup (Payantet al. 1988; Shibata and Yamazaki 1995), suggesting that the two Amy genes have been maintained for a long time. Inomata et al. (1995) investigated DNA polymorphisms in the two Amy genes for nine strains of D. melanogaster. In the alignment of 18 sequences (9 from the proximal gene and 9 from the distal gene) there are 47 segregating sites, none of which has more than two segregating nucleotides. In the model, therefore, it is assumed that the copy number is constant at two, and only two allelic states are considered. The model is a special case of Ohta's (1982) model. Walsh (1988) proposed a similar model.
Under this simple model, the exact solutions for the equilibrium expectations and variances of the amounts of variation within and between two loci were obtained by a diffusion method. The amount of linkage disequilibrium was also investigated analytically. Using the theoretical results, a method for estimating the mutation, gene conversion, and recombination parameters was developed. The method was applied to estimate these three parameters in the Amy multigene family of D. melanogaster. It was shown that the gene conversion rate is ∼60–165 times higher than the mutation rate for synonymous sites.
THEORY
Consider two linked loci, I and II, in a random mating population with N diploids. We consider two neutral alleles, A and a, so that there are four haplotypes, A-A, A-a, a-A, and a-a (the first letter represents the allele at locus I and the second one represents the allele at locus II). It is assumed that the mutation rate between two alleles is μ per locus per generation. The recombination rate between two loci is assumed to be r per generation. Intrachromosomal gene conversion occurs at the rate c per locus per generation; e.g., A-a changes into A-A with probability c and into a-a with the same probability. Interchromosomal gene conversion is not considered. Let the frequencies of A-A, A-a, a-A, and a-a be x1, x2, x3, and x4 (x1 + x2 + x3 + x4 = 1), respectively. Given x1, x2, x3, and x4, their expectations in the next generation are given by
Under this model, we calculate the expectations of moments of allele frequencies using a diffusion method, which was introduced to population genetics by Kimura (1964). In equilibrium, it is known that a function, g(x1, x2, x3), satisfies the equation
First, letting g = p and q in (2) and (6), we have
Next, letting g = p2, q2, pq, and D, we obtain the following four equations:
Therefore, the expectations of the amounts of variation (heterozygosity) within loci I and II are given by
In a similar way, the variances of hwI, hwII, and hb are written as
Numerical examples for E(hw), E(hb), and E(D) are shown in Figure 1. Figure 1A shows the results for E(hw) given θ = 0.01. Gene conversion increases the amount of variation within a locus. Note that E(hw) = 0.0098 without gene conversion. When the gene conversion rate is relatively small (C = 0.1), E(hw) is ∼1.75-fold larger than that without gene conversion, while there is almost no effect of gene conversion on E(hw) when C = 100. Recombination also increases E(hw) but the effect is relatively small. Figure 1B shows the results for E(hb). Gene conversion decreases the amount of variation between two loci. The amount of variation between two loci is much bigger than that within each locus unless C is very large. When C = 100, E(hw) and E(hb) are almost the same. In Figure 1C, it is shown that gene conversion generates positive linkage disequilibrium. When there is no gene conversion, E(D) = 0. D is positively correlated with C, and D decreases as R increases. These results are consistent with other studies (e.g., Ohta 1982).
DATA ANALYSIS AND ESTIMATION OF PARAMETERS
Since the expectations and variances of the amounts of variation within and between two loci and linkage disequilibrium are given by functions of θ, C, and R, it may be possible to estimate these parameters from DNA polymorphism data. An estimation method is explained using the data of the Amy region in D. melanogaster as an example (see Figure 1 in Inomataet al. 1995). The data consist of the coding sequences of the proximal and distal Amy genes for nine strains. The length of coding sequence is 1482 bp. In the alignment of 18 sequences (9 sequences from the proximal gene and 9 from the distal gene), 47 sites are polymorphic, of which 37 are synonymous.
First, we estimate the amounts of variation and linkage disequilibrium for a particular site. Consider the 567th site of the Amy genes where two nucleotides, T and C, are segregating, so that there are four possible haplotypes, T-T, T-C, C-T, and C-C (the first letter represents the nucleotide in the proximal gene and the second one represents that in the distal gene). Denote the number of these haplotypes by n1, n2, n3, and n4. Estimates of heterozygosity within the proximal and distal genes are given by
From (23, 24, 25), we can calculate hw, hb, and δ for all sites of the genes and we have their averages. The averages of hwp and hwd correspond to πwp and πwd, the average numbers of pairwise differences within the proximal and distal genes per site. πb is the average number of pairwise differences between two genes, which is the average of hb. Let d be the average of δ. Only the data for the synonymous sites of Inomata et al. (1995) are used for the calculation because their sampling was not random. The sampling was based on the information of allozyme variation (see Inomataet al. 1995). From all synonymous sites, we have πwp = 0.0315 and πwd = 0.0302. Then, the average number of pairwise differences within a gene, πw, is 0.0309. In a similar way, we have the average number of pairwise differences between two genes, πb = 0.0452. The average of linkage disequilibrium between two genes, d, becomes 0.000452. If θ, C, and R are constant for all the sites, the expectations of πw, πb, and d are given by
—E(hw), E(hb), and E(D) given θ = 0.01. (A) Results for E(hw). (B) Results for E(hb). (C) Results for E(D).
Since E(πw), E(πb), and E(d) are given by functions of θ, C, and R, it may be possible to estimate these parameters from πw, πb, and d, although the equations for E(πw), E(πb), and E(d) are too complicated to solve for θ, C, and R. One way for the estimation is to find a set of θ, C, and R that minimizes x:
Estimates of θ, C, and R of the Amy multigene family in D. melanogaster
DISCUSSION
A simple two-locus gene conversion model was considered to investigate the amounts of DNA variation and linkage disequilibrium in small multigene families. The exact solutions for the expectations and variances of the amounts of variation within and between two loci were obtained. It was shown that gene conversion increases the amount of variation within each locus and that the degree of increase is large when the gene conversion rate is relatively small. On the other hand, gene conversion decreases the amount of variation between two loci and there is almost no difference between πw and πb when the gene conversion rate is very large. The effect of recombination on the amounts of variation within and between two loci is relatively small. The expectation and variance of the amount of linkage disequilibrium were also obtained. Gene conversion generates positive linkage disequilibrium and the degree of linkage disequilibrium decreases as the recombination rate increases.
The model considered here is a special case of Ohta's (1982) general model, and the theoretical results obtained in this article are consistent with her results. Ohta (1982) considered an intrachromosomal gene conversion model of multigene families with K alleles, where the number of loci is assumed to be constant (m). Her model with m = K = 2 corresponds to the model of this study. Ohta (1982) investigated the three identity coefficients, f, c1, and c2, at equilibrium. f is the probability that two alleles sampled from the same locus are identical, c1 is the probability that two alleles from different loci on the same chromosome are identical, and c2 is the probability that two alleles from different loci from different chromosomes are identical. These identity coefficients are written in terms of the amounts of variation considered here. That is, f = 1 – E(hw) and c1 ≈ c2 ≈ 1 – E(hb). Ohta (1982) obtained the approximate expectations of three allelic identity coefficients using transient equations with the assumption that mutation, gene conversion, and recombination rates are small. In this study, the exact solutions for m = K = 2 were obtained without this assumption by a diffusion method. This method is useful to obtain the variances of hw, hb, and D. The transient equations for the second orders of the identity coefficients are too complicated to solve (Ohta 1985; Basten and Weir 1990).
Using the theoretical results, a method for estimating the mutation, gene conversion, and recombination parameters was developed. The method was applied to the data of the Amy multigene family of D. melanogaster (Inomataet al. 1995). The estimate of θ for synonymous sites is 0.0172, which is close to the average of this species (0.0135; Moriyama and Powell 1996). The gene conversion rate is estimated to be ∼60-fold larger than the estimate of the mutation rate for synonymous sites. The amount of variation within a locus is much larger than θ because of a high rate of gene conversion.
Similar results are obtained from recent data of the same region (Table 1). Araki et al. (2001) reported sequence variations of the Amy region in random samples from Japan and Kenya. For the Kenyan sample, θ for synonymous sites and for the total coding region are estimated to be 0.0302 and 0.0089, respectively. C for the total coding region is estimated to be 3.88, which is similar to that for synonymous sites (4.08). The similarity of the two estimates of C is consistent with the mechanism of gene conversion, because a single conversion event usually involves a certain length of DNA fragment. For the Japanese sample, since negative d is observed, the estimation was conducted assuming free recombination (R = ∞). The results are very similar to those of the Kenyan sample. An estimate of θ for synonymous sites is about fourfold bigger than that for the total coding region, while two estimates of C are similar. Estimates of θ and C for the Kenyan sample are larger than those for the Japanese sample, probably because of the difference of population size.
—Linkage disequilibrium in the Amy genes in D. melanogaster.
To estimate θ, C, and R, these parameters are assumed to be constant across the region. The obtained estimates might be the averages for all the sites considered. Since the Amy genes of D. melanogaster are reversely duplicated, R could have a large heterogeneity across the region. Assuming the recombination rate per site is constant (ρ per kb), R for the first position is ∼4.5ρ and for the last position is ∼7.5ρ because the length of the region between the two Amy genes is ∼4.5 kb. The effect of heterogeneity in the recombination rate on δ was investigated (Figure 2) because the effect of R on δ is relatively large. Almost no correlation was detected, suggesting that the effect of the heterogeneity of R on the estimates may not be large.
The method considered here ignores the effect of selection, and estimates might be biased if selection is working. Purifying selection decreases the amounts of variation within and between two loci. The effect of selection is large and complicated when some kind of balancing selection acts to maintain two different alleles in a population. The amount of variation between two loci increases dramatically as selection intensity increases. The amount of variation within each locus is increased when selection is relatively weak, while almost no variation is observed when selection is very strong (H. Innan, unpublished results).
Acknowledgments
The author thanks M. Nordborg for comments. This study was supported in part by a fellowship from the Japan Society for the Promotion of Science.
APPENDIX
In equilibrium, letting g = p3, p2q, and pD in (2) and (6), we have three equations for E(p3), E(p2q), and E(pD),
Footnotes
-
Communicating editor: F. Tajima
- Received May 15, 2001.
- Accepted March 4, 2002.
- Copyright © 2002 by the Genetics Society of America