- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Innan, H.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Innan, H.
The Coalescent and Infinite-Site Model of a Small Multigene Family
Hideki Innanaa Department of Biological Science, University of Southern California, Los Angeles, California 90089-1340 and Human Genetics Center, School of Public Health, University of Texas Health Science Center, Houston, Texas 77030
Corresponding author: Hideki Innan, School of Public Health, University of Texas Health Science Center, 1200 Hermann Pressler, Houston, TX 77030., hinnan{at}sph.uth.tmc.edu (E-mail)
| ABSTRACT |
|---|
The infinite-site model of a small multigene family with two duplicated genes is studied. The expectations of the amounts of nucleotide variation within and between two genes and linkage disequilibrium are obtained, and a coalescent-based method for simulating patterns of polymorphism in a small multigene family is developed. The pattern of DNA variation is much more complicated than that in a single-copy gene, which can be simulated by the standard coalescent. Using the coalescent simulation of duplicated genes, the applicability of statistical tests of neutrality to multigene families is considered.
RECENT genomic data show that a substantial proportion of genes in the eukaryotic genome have been created by gene duplication, forming multigene families (![]()
![]()
![]()
The pattern of polymorphism in a multigene family is much more complicated than that in a single-copy gene, because duplicated genes do not likely evolve independently due to recurrent exchanges of genetic materials between genes (i.e., concerted evolution of multigene families, reviewed in ![]()
![]()
![]()
![]()
|
There are not many theories for analyzing this complicated pattern of DNA polymorphism in a multigene family. In the 1980s, ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| INFINITE-SITE MODEL FOR A SMALL MULTIGENE FAMILY |
|---|
In this section, my previous theoretical result based on a two-locus gene conversion model (![]()
Let the frequencies of A-A, A-a, a-A, and a-a be x1, x2, x3, and x4 (x1 + x2 + x3 + x4 = 1), respectively. The amount of variation within a locus, hw, is defined as heterozygosity within a particular locus [i.e., hw = 2(x1 + x2)(x3 + x4) at locus I, hw = 2(x1 + x3)(x2 + x4) at locus II]. The expectation of hw at equilibrium is given by a function of three parameters (
= 4Nµ, C = 4Nc, and R = 4Nr),
![]() |
(1) |
where

(![]()
0 and C
0. The amount of variation between two loci, hb, is defined as the probability that two independent alleles sampled from different loci are different [i.e., hb = (x1 + x2)(x2 + x4) + (x1 + x3)(x3 + x4)]. The expectation of hb is given by
![]() |
(2) |
The expectation of linkage disequilibrium between two loci (D = x1x4 - x2x3) is given by
![]() |
(3) |
Here, we consider hw, hb, and D in a small multigene family with two duplicated genes, I and II, each of which consists of L nucleotides. Assume that n chromosomes are randomly sampled from a population and both genes are sequenced for each chromosome. The amount of nucleotide variation within a gene is usually measured by the average number of pairwise differences,
w. Denote the numbers of nucleotide differences between the ith and jth chromosomes in the first and second genes by d11(i, j) and d22(i, j), respectively. Then,
w for genes I and II are given by
![]() |
(4) |
respectively.
w1 = 2.6 and
w1 = 2.4 in the example of Table 1. Let d12(i, j) be the number of nucleotide differences between gene I of the ith chromosome and gene II of the jth chromosome. The average of d12(i, j) represents the amount of variation between two genes. That is,
![]() |
(5) |
Note that
b is defined to correspond to hb derived by ![]()
b does not involve d12(i, j) when i = j (i.e.,
b does not consider the nucleotide differences between two genes on the same chromosome). This is because hb is defined as the probability that two independent alleles sampled from different loci are different. In the data of Table 1,
b = 3.4. Define Dsum as the sum of linkage disequilibria at all L sites. Let Dm be the linkage disequilibrium at the mth site, which is calculated as Dm = (nAAnaa - nAanaA)/[n(n - 1)], where nxy represents the number of chromosomes with nucleotides x and y at genes I and II, respectively. Then, Dsum is given by
![]() |
(6) |
In the data of Table 1, since D1 = 0.05, D2 = -0.05, and D4 = 0.1, the sum is Dsum = 0.1. Note that only shared polymorphic sites contribute linkage disequilibrium (D = 0 for the other types of polymorphic sites).
Equation 1Equation 2Equation 3 are applied to this two-gene model with L nucleotides. Since it is possible to consider that there are L two-locus models in the duplicated genes, the expectations of three amounts of variation are given by
![]() |
(7) |
When gene conversion occurs between a pair of DNA sequences, it should be considered that gene conversion involves a certain length of DNA tract, indicating that L nucleotide sites in the duplicated genes are not independent. However, these equations for the expectations hold without the assumption of independence among L sites. That is, the distribution of gene conversion tract does not affect the expectations if the gene conversion rate per site (C) is given. On the other hand, the variances of
w,
b, and Dsum are affected by the distribution of gene conversion tract.
Under the infinite-site model, the mutation rate is assumed to be so small that there are no multiple mutations at a single site (![]()
with L
=
. That is,
![]() |
(8) |
![]() |
(9) |
and
![]() |
(10) |
From (810),
, C, and R can be estimated by
w,
b, and Dsum:
![]() |
(11) |
![]() |
(12) |
and
![]() |
(13) |
With the example data of Table 1,
, C, and R are estimated to be 1.3, 1.1, and 22.2, given
w = 2.4,
b = 3.4, and Dsum = 0.1.
Equation 11Equation 12Equation 13 are also applied to data of three small multigene families in Drosophila melanogaster. As shown in Table 2, these equations work when
w <
b and Dsum > 0. Equation 12 does not work well when
w >
b because (8) and (9) indicate E(
w)
E(
b) [E(
w) = E(
b) when C =
]. Equation 13 also does not work well when Dsum < 0 because the theory predicts E(Dsum)
0. See ![]()
|
| COALESCENT SIMULATION OF A SMALL MULTIGENE FAMILY |
|---|
To simulate patterns of polymorphism in a small multigene family with two duplicated genes, a standard coalescent model with recombination (![]()
![]()
![]()
|
On the way to generate the ancestral recombination graph, gene conversions are placed randomly (Fig 1A). Gene conversion occurs with probability c per site per generation whether lineages are ancestral to the sampled chromosomes (Fig 1A, solid lines) or not (Fig 1A, dashed lines). For each gene conversion event, the position and direction are determined. For convenience, the gene is represented by an interval of (0, 1), so that the position of a gene conversion tract is given by an interval between 0 and 1. For example, the gene conversion between T0 and T1 in Fig 1A occurs between positions 0.08 and 0.27. Since the direction of this gene conversion is from II to I, the gene conversion changes allelic state {1, 0} to {0, 0} and {0, 1} to {1, 1} (see Fig 1, BD). Note that the allelic state for a pair of lineages is represented by two numbers in brackets. The presence and absence of mutation are represented by 1 and 0, respectively. The first number is for gene I and the second one is for gene II. A gene conversion of the other direction changes {1, 0} to {1, 1} and {0, 1} to {0, 0}. Gene conversions do not change the allelic states {0, 0} or {1, 1}. The length of gene conversion tract might follow a certain function. ![]()
This two-gene coalescent simulation should be continued until the MRCA of the two genes (i.e., the MRCA of all the 2n lineages) is reached. The MRCA of the two genes requires coalescence between the two genes, which occurs by gene conversion because gene conversion transfers the DNA segment from one gene to the other. Fig 1E shows the tree for the interval (0.020.08), which is used to explain the definition of the MRCA of the two genes. On the tree, a gene conversion event occurs between T3 and T4 and transfers the DNA segment between 0.02 and 0.08 of gene I to gene II. This event can be considered as a coalescent event between the two genes. That is, going backward in time, the right lineage merges into the left one. Treating gene conversion in this way, we can find the MRCA of the two genes when the 2n lineages coalesce into one lineage. On the tree in Fig 1E, it occurs with the gene conversion event between T6 and T7. The coalescent simulation can be stopped when all segments in the interval (0, 1) reach the MRCAs of the two genes.
Given an ancestral recombination graph with gene conversion, mutations are randomly distributed on lineages following the Poisson process (Fig 1A). Mutations occur at any position on the graph with equal probability density (µ per site per generation) whether lineages are ancestral to the sampled chromosomes or not. For each mutation, the position in the gene is also determined. The positions are random numbers between 0 and 1. In Fig 1A, there are four mutations: at position 0.12 of gene II, at position 0.37 of gene I, at position 0.74 of gene II, and at position 0.98 of gene I. The allelic state of the lineage on which mutation occurs is given by 1. For example, when the mutation at position 0.12 occurs in gene II between T3 and T4, the allelic state of the site for the two genes is given by {0, 1} (Fig 1B).
The histories of the mutations in Fig 1A are traced forward in time in Fig 1B&NDASH;D, where allelic states are shown along the ancestral recombination graph (Fig 1A). Let us follow the mutation at position 0.12. Since the mutation occurs in gene II on the right pair of lineages between T3 and T4, the allelic states of the two pairs of lineages are given by {{0, 0}, {0, 1}} (the order of allelic states follows Fig 1A). At T3 the right pair of lineages are duplicated (coalescent event), and the states for the three pairs of lineages are given by {{0, 0}, {0, 1}, {0, 1}}. Another duplication of the left pair of lineages at T2 results in {{0, 0}, {0, 0}, {0, 1}, {0, 1}}, and a recombination event with the two middle pairs of lineages at T1 makes {{0, 0}, {0, 1}, {0, 1}}. Between T0 and T1, a gene conversion event on the right pair of lineages results in {{0, 0}, {0, 1}, {1, 1}} at the bottom of the graph. Therefore, the mutation at 0.12 appears as a shared polymorphic site. In a similar way, the mutations at 0.37 and 0.74 are traced and appear as fixed and specific polymorphisms, respectively (Fig 1C and Fig D). Note the mutation at 0.98 is not observed because it is lost by the recombination event at T8.
Following this process, patterns of DNA polymorphism are simulated and frequency spectra of three types of polymorphisms are investigated. For each parameter set, the expected frequency spectrum is obtained from 10,000 replications. The length of gene conversion tract is assumed to be so small that any gene conversion segment does not include more than one mutation. This assumption does not affect the expected spectrum as long as the gene conversion rate per site is constant as mentioned in the previous section. It is demonstrated that the averages of
w,
b, and Dsum in the simulations are in excellent agreement with the theoretical expectations obtained by (810).
Fig 2A shows the spectra of derived alleles (nucleotides) for a low gene conversion rate (C = 0.2). It is shown that a large proportion of polymorphic sites are fixed sites. Specific polymorphic sites are more frequent than shared polymorphic sites, and the shapes of spectra of these two types of polymorphic sites are U shapes that are skewed toward the left (rare classes). The effect of recombination on the spectrum is relatively small. When C = 1 (Fig 2B), shared polymorphic sites are more frequent than specific ones, and fixed ones are very rare. The spectra of specific and shared sites are both L shapes, and the former is more skewed than the latter. When gene conversion rate is high (C = 5), almost no fixed polymorphic sites are observed, and most polymorphic sites are shared sites (Fig 2C). Fig 3A shows the observed spectra in the distal and proximal Amy genes of D. melanogaster. They are similar to the expected spectrum obtained from a simulation with 10,000 replications given the estimated values of
= 12.92, C = 4.55, and R = 22.99 (see Table 2).
|
|
| APPLICABILITY OF TESTS OF NEUTRALITY |
|---|
As demonstrated in this article, the pattern of polymorphism in a multigene family is much more complicated than that in a single-copy gene. Therefore, statistical tests of neutrality based on the standard coalescent theory for a single-copy gene may not be appropriate for genes in multigene families. TAJIMA's (1989) D and FU and LI's (1993) D* tests are among these. Consider the distal and proximal Amy genes in D. melanogaster as examples. If the two genes are treated as two independent single-copy genes, the test statistics can be calculated for each gene. Tajima's D and Fu and Li's D* are -0.13 and -0.38 in the distal gene and 0.10 and 0.09 in the proximal gene, respectively. However, the distributions of the test statistics for multigenes are different from those for single-copy genes. In Fig 3B, the distribution of Tajima's D in a single-copy gene is compared with that for a gene in a small multigene family with
= 12.92, C = 4.55, and R = 22.99. The variance of the latter is much smaller than that of the former, indicating it is very unlikely to observe significant Tajima's D in a small multigene family if the confidence interval is determined by the distribution in a single-copy gene. A similar result is obtained for Fu and Li's D* (Fig 3C). The results are consistent with the observed Tajima's D and Fu and Li's D* values, which are quite close to zero. Hudson, Kreitman, and Aguadé's test (![]()
On the other hand, there is no problem in applying model-independent tests of neutrality. ![]()
|
| DISCUSSION |
|---|
The pattern of nucleotide polymorphism in a multigene family is much more complicated than that in a single-copy gene because of exchanges of genetic materials between members of a family. In this article, the amounts and pattern of nucleotide polymorphism are studied under the infinite-site model. The expectations of three amounts of DNA variation (
w,
b, and Dsum) are obtained analytically, and a coalescent method for simulating patterns of nucleotide polymorphism is developed. From the simulation the frequency spectra of three types of polymorphic sites are investigated.
The simulations demonstrate that statistical tests that are based on the standard theory for a single-copy gene may not be appropriate to use for genes in multigene families (e.g., Tajima's D; Fu and Li's D*; and Hudson, Kreitman, and Aguadé's tests). New statistical tests should be developed for multigene families with the coalescent simulation described in this article. On the other hand, model-independent tests (e.g., McDonald and Kreitman's test) can be used without any problem (see Table 3).
The coalescent simulation developed in this article can be easily extended to a model of a multigene family with more than two genes as long as the number of genes is constant. Patterns of polymorphism in such multigene families could be more complicated because the gene conversion rates among members may vary. An example is seen in the hsp70 multigene family (![]()
Interchromosomal gene conversion, which is ignored for mathematical convenience, can be easily incorporated in the simulation, because an interchromosomal gene conversion event can be considered as intragenic gene conversion and recombination events that occur at the same time. That is, going backward in time, immediately after placing an intragenic gene conversion event, a new pair of lineages is introduced in the ancestral recombination graph. It is not clearly understood how often interchromosomal gene conversion occurs in comparison with intrachromosomal gene conversion.
| ACKNOWLEDGMENTS |
|---|
The author thanks H. Araki, J. Hey, M. Nordborg, and N. Rosenberg for comments and discussions, and the two anonymous reviewers for helpful suggestions. The C-program used in this study is available on request by the author.
Manuscript received August 11, 2002; Accepted for publication November 6, 2002.
| LITERATURE CITED |
|---|
ARAKI, H., N. INOMATA, and T. YAMAZAKI, 2001 Molecular evolution of duplicated amylase gene regions in Drosophila melanogaster: evidence of positive selection in the coding regions and selective constraints in the cis-regulatory regions. Genetics 157:667-677.
ARNHEIM, N., 1983 Concerted evolution of multigene families, pp. 3861 in Evolution of Genes and Proteins, edited by M. NEI and R. K. KOEHN. Sinauer, Sunderland, MA.
BAHLO, M., 1998 Segregating sites in a gene conversion model with mutation. Theor. Popul. Biol. 54:243-256.[Medline]
BAILEY, J. A., Z. GU, R. A. CLARK, K. REINERT, and R. V. SAMONTE et al., 2002 Recent segmental duplications in the human genome. Science 297:1003-1007.
BETTENCOURT, B. R. and M. E. FEDER, 2002 Rapid concerted evolution via gene conversion at the Drosophila hsp70 genes. J. Mol. Evol. 54:569-586.[Medline]
FU, Y.-X. and W.-H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133:693-709.[Abstract]
GRIFFITHS, R. C. and G. A. WATTERSON, 1990 The number of alleles in multigene families. Theor. Popul. Biol. 37:110-123.[Medline]
HEY, J., 1991 A multi-dimensional coalescent process applied to multi-allelic selection models and migration models. Theor. Popul. Biol. 39:30-48.[Medline]
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183-201.[Medline]
HUDSON, R. R., M. KREITMAN, and M. AGUADÉ, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116:153-159.
INNAN, H., 2002 A method for estimating the mutation, gene conversion and recombination parameters in small multigene families. Genetics 161:865-872.
INOMATA, N., H. SHIBATA, E. OKUYAMA, and T. YAMAZAKI, 1995 Evolutionary relationships and sequence variation of
-amylase variants encoded by duplicated genes in the Amy locus of Drosophila melanogaster.. Genetics 141:237-244.[Abstract]
KIMURA, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61:893-903.
KING, L. M., 1998 The role of gene conversion in determining sequence variation and divergence in the Est-5 gene family in Drosophila pseudoobscura.. Genetics 148:305-315.
LAZZARO, B. P. and A. G. CLARK, 2001 Evidence for recent paralogous gene conversion and exceptional allelic divergence in the Attacin genes of Drosophila melanogaster.. Genetics 159:659-671.
LYNCH, M. and J. S. CONERY, 2000 The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.
MCDONALD, J. H. and M. KREITMAN, 1991 Adaptive protein evolution at the Adh locus in Drosophila.. Nature 351:652-654.[Medline]
NAGYLAKI, T., 1984a Evolution of multigene families under interchromosomal gene conversion. Proc. Natl. Acad. Sci. USA 81:3796-3800.
NAGYLAKI, T., 1984b The evolution of multigene families under intrachromosomal gene conversion. Genetics 106:529-548.
NORDBORG, M., 2001 Coalescent theory, pp. 179212 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. J. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.
OHNO, S., 1970 Evolution by Gene Duplication. Springer-Verlag, New York.
OHTA, T., 1981 Genetic variation in small multigene families. Genet. Res. 37:133-149.[Medline]
OHTA, T., 1982 Allelic and nonallelic homology of a supergene family. Proc. Natl. Acad. Sci. USA 79:3251-3254.
OHTA, T., 1983 On the evolution of multigene families. Theor. Popul. Biol. 23:216-240.[Medline]
TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis. Genetics 123:585-595.
WIUF, C. and J. HEIN, 2000 The coalescent with gene conversion. Genetics 155:451-462.
This article has been cited by other articles:
![]() |
S. Mano and H. Innan The Evolutionary Rate of Duplicated Genes Under Concerted Evolution Genetics, September 1, 2008; 180(1): 493 - 505. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Takuno, T. Nishio, Y. Satta, and H. Innan Preservation of a Pseudogene by Gene Conversion and Diversifying Selection Genetics, September 1, 2008; 180(1): 517 - 531. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Beisswanger and W. Stephan Evidence that strong positive selection drives neofunctionalization in the tandemly duplicated polyhomeotic genes in Drosophila PNAS, April 8, 2008; 105(14): 5447 - 5452. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Teshima and H. Innan Neofunctionalization of Duplicated Genes Under the Pressure of Gene Conversion Genetics, March 1, 2008; 178(3): 1385 - 1398. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Zhang and N. A. Rosenberg On the Genealogy of a Duplicated Microsatellite Genetics, December 1, 2007; 177(4): 2109 - 2122. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. R. Thornton The Neutral Coalescent Process for Recent Gene Duplications and Copy-Number Variants Genetics, October 1, 2007; 177(2): 987 - 1000. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. F. Storz, M. Baze, J. L. Waite, F. G. Hoffmann, J. C. Opazo, and J. P. Hayes Complex Signatures of Selection and Gene Conversion in the Duplicated Globin Genes of House Mice Genetics, September 1, 2007; 177(1): 481 - 500. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Innan Modified Hudson-Kreitman-Aguade Test and Two-Dimensional Evaluation of Neutrality Tests Genetics, July 1, 2006; 173(3): 1725 - 1733. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. A. Chapman, J. E. Bowers, F. A. Feltus, and A. H. Paterson Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication PNAS, February 21, 2006; 103(8): 2730 - 2735. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Hallast, L. Nagirnaja, T. Margus, and M. Laan Segmental duplications and gene conversion: Human luteinizing hormone/chorionic gonadotropin {beta} gene cluster Genome Res., November 1, 2005; 15(11): 1535 - 1546. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. P. Sugino and H. Innan Estimating the Time to the Whole-Genome Duplication and the Duration of Concerted Evolution via Gene Conversion in Yeast Genetics, September 1, 2005; 171(1): 63 - 69. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Thornton and M. Long Excess of Amino Acid Substitutions Relative to Polymorphism Between X-Linked Duplications in Drosophila melanogaster Mol. Biol. Evol., February 1, 2005; 22(2): 273 - 284. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-z. Gao and H. Innan Very Low Gene Duplication Rate in the Yeast Genome Science, November 19, 2004; 306(5700): 1367 - 1370. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Teshima and H. Innan The Effect of Gene Conversion on the Divergence Between Duplicated Genes Genetics, March 1, 2004; 166(3): 1553 - 1560. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Innan A two-locus gene conversion model with selection and its application to the human RHCE and RHD genes PNAS, July 22, 2003; 100(15): 8793 - 8798. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Innan, H.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Innan, H.




















