| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 177, 987-1000, October 2007, Copyright © 2007
doi:10.1534/genetics.107.074948
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697
1 Address for correspondence: Department of Ecology and Evolutionary Biology, 321 Steinhaus Hall, University of California, Irvine, CA 92697.
E-mail: krthornt{at}uci.edu
| ABSTRACT |
|---|
|
|
|---|
In parallel with the analysis of genomewide data, the systematic identification of recent duplication events in Drosophila species has identified several cases of lineage-specific genes, in an effort to understand the importance of natural selection in the early stages of the evolution of "new" genes (e.g., LONG and LANGLEY 1993; WANG et al. 2000, 2002, 2004; BETRAN et al. 2002; BETRAN and LONG 2003; JONES et al. 2005; LOPPIN et al. 2005; ARGUELLO et al. 2006; LEVINE et al. 2006; FAN and LONG 2007). Examples of recent gene duplications have also been described in humans, mice, and plant species (reviewed in LONG et al. 2003). In general, these studies consist of three parts: first, the identification of the recent duplicate; second, an investigation of patterns of polymorphism and/or divergence; and third, some assay of function, often at the level of gene expression, is performed to show that the new gene is functional.
The examples cited above all describe new genes that are fixed in population samples (the recent duplicate is found in all individuals sampled). There is currently much interest in identifying polymorphic duplications (so-called "copy-number variants," or CNV), particularly in the human genome (BAILEY et al. 2002, 2004; CHEUNG et al. 2003; IAFRATE et al. 2004; LI et al. 2004; SEBAT et al. 2004; SHARP et al. 2005, 2006; CONRAD et al. 2006; LOCKE et al. 2006; PERRY et al. 2006; REDON et al. 2006; GRAUBERT et al. 2007), as it is believed that CNVs may be a significant contributor to the genetic basis of disease. While CNVs have been implicated in several diseases (SHARP et al. 2006; SEBAT et al. 2007; reviewed in KONDRASHOV and KONDRASHOV 2006), they are also of significant evolutionary interest, as they will likely provide valuable insight into the earliest stages of the evolution of new genes.
Little is currently available in terms of a framework for analyzing polymorphism data from recent duplicates and CNVs. With regard to the analysis of single-nucleotide polymorphism data, the coalescent process (HUDSON 1983; TAJIMA 1983) has been well studied for single-copy genes. For small gene families of size two, INNAN (2003a) has described the neutral coalescent process for the case where the duplication event is ancient (i.e., the duplication fixed
generations ago), allowing for gene conversion between duplicates, which is commonly observed in polymorphism data from gene duplicates (INNAN 2003b; THORNTON and LONG 2005; LINDSAY et al. 2006; RAEDT et al. 2006). In his model, the common ancestor of the two genes is reached via a gene conversion event. Here, I describe the coalescent process for the case of a recent duplication event, accounting for the fixation process of the duplication and tracing the history of both the ancestral gene, and of the recent duplicate, to the most recent common ancestor of both genes. I consider a neutral model where, at some point in the past, a randomly chosen allele of the ancestral locus was duplicated, and the duplication fixed in the population by genetic drift. Thus, the common ancestor of the gene family can be reached either via a gene conversion event or by proceeding back in time past the origination of the new gene, to the common ancestor of both genes.
When a duplicate gene has fixed recently in the population, diversity in the new gene is expected to be significantly reduced, and an excess of rare alleles is also expected. These expectations complicate the inference of positive selection on new genes using many standard population-genetic tests. Coalescent simulations are used to investigate both the effects of gene conversion between duplicates, which results in complex patterns of polymorphism, and the applicability of standard "tests of neutrality" when applied to young gene families. I find that commonly used tests of the site-frequency spectrum are not appropriate in this case, while the MCDONALD and KREITMAN (1991) test appears to be quite conservative when gene conversion is occurring between duplicates. The simulation is easily extended to the case of copy-number variants, and I describe patterns of polymorphism in neutral CNVs using simulations.
| THEORY |
|---|
|
|
|---|
, the average number of mutations between two chromosomes in a Moran model, for the case where a substitution occurs immediately before sampling. I extend Tajima's results to obtain the expectation of TAJIMA's (1989) D statistic, which is a summary of the site-frequency spectrum of mutations (a histogram of mutation frequencies). For a large, equilibrium population undergoing no selection, the expectation of D is 0. An excess of rare alleles results in D < 0, and D > 0 implies an excess of intermediate-frequency variants. TAJIMA (1990) considered the gene genealogy for a Moran population of 2N chromosomes in which a neutral substitution has recently occurred. The Moran model is a simple model of overlapping populations where drift occurs in discrete time steps (EWENS 2004, p. 104). At each step, one individual is chosen to reproduce, and another is chosen to die, and it is possible that the same individual is chosen both to reproduce and to die. At some point in the process, all 2N chromosomes may be the descendant of a single ancestor, who necessarily reproduced (Figure 1). At any time step, 2N – 1 of the descendants of this ancestor may share a most recent common ancestor with each other in the more recent past than they do with the 2Nth chromosome. If any of the 2N – 1 chromosomes are chosen to reproduce, and the 2Nth is chosen to die, then a substitution occurs in the next step of the process, and all chromosomes are the descendants of a single individual in the next time step (Figure 1).
|
instead of the standard
in units of 2N generations (TAJIMA 1990). Using these considerations, Tajima showed that, for a genealogy completely linked to a fixation at time
= 0 in the past (
is in units of 4N generations),
![]() | (1) |
![]() |
= 4Nµ.
|
= 0, is
![]() | (2) |
An example genealogy for this case is shown in Figure 2B. The ancestral process of the 2N chromosomes sampled at time t = 0 is described by the standard coalescent model, with coalescent events occurring at rate
until time
in the past. At time
, the expected number of lineages remaining in the sample is
![]() |
![]() |
The total time on the tree during the time period from 0 to tk is
and if tk <
, there are an additional k(
– tk) units of total time to account for during the time period from 0 to
(Figure 2B). Starting at time
in the past, the genealogy of the k remaining lineages is described by TAJIMA's (1990) process, as the substitution event occurred at
. Therefore, the rate of coalescence from k to 1 lineages is given by
and the expectation of total time during this phase is
Due to the Markov structure of the process, we can sum the expectations of the total times from 2N to k lineages, and from k to 1 lineage, which is the total time on the tree for fixations a time
![]() | (3) |
Under the infinitely many-sites mutation model, the expected number of mutations given a recent neutral substitution is
![]() |
We can use Equations 1 and 2 to calculate the expectation of TAJIMA'S (1989) D statistic, conditional on
= 0. First, the expectation of WATTERSON's (1975)
is
![]() | (4) |
And the expectation of D when
= 0 is
![]() | (5) |
The denominator of Equation 5 is an approximation of the variance of the numerator and is calculated using the standard equations from TAJIMA (1989). The expected sign of D is given by expectation of the numerator of Equation 5 and will be negative if
![]() |
![]() |
![]() |
The
term cancels, and the inequality is true for 2N > 15. We therefore expect D to be negative in large populations when a neutral substitution has recently occurred, and we therefore expect D to be negative for recent gene duplicates. For example, when 2N = 50, and
= 10, E[D |
= 0] = –0.538. Recently, MCVEAN and SPENCER (2006) used simulations to come to similar conclusions about FU and LI's (1993) D statistic.
It is important to note that TAJIMA (1990) obtained Equation 1 by considering the branching patterns of genealogies under a Moran model, for which the coalescent process is exact for the entire population. Further, he considered the rate of coalescence at time t in the past to be a function of only the number of distinct lineages at time t and did not account for the frequency trajectory of the substituting allele. The alternative approach is to account for the frequency trajectory of the substituting allele, in which case the rate of coalescence is given by
where x(t) is the frequency of the allele at time t in the past. The discrepancy between these two approaches will be largest for small sample sizes. For example, when n = 5 and
= 0 and following Tajima's arguments, the mean time to the first coalescence is
When accounting for the frequency of the allele, the expected time to the first coalescence is
[because x(0) = 1], resulting in a difference of
In the SIMULATION section below, I describe a simulation-based approach using the structured coalescent that accounts for the allele frequency trajectory. Using both coalescent and forward simulations, we see that the above formulas are good approximations for large sample sizes (say n
50).
| SIMULATION |
|---|
|
|
|---|
in the past, and the allele frequency trajectory during fixation is a random variable. Mutations occur according to the infinitely many-sites model. Figure 3 shows an example genealogy for a gene family of size two.
|
in the past, the duplicate locus fixed in the population. The duration of the fixation event is tf. Prior to
, events in the history of the sample include coalescent, crossing over between loci, and ectopic gene conversion between loci (ectopic gene conversion).
In units of 4N generations, the rate of coalescence is given by
![]() | (6) |
![]() | (7) |
= 4Nr is the scaled genetic distance between A and B. Ectopic gene conversion occurs at rate
![]() | (8) |
Structured coalescent:
At time
, the simulation enters a structured coalescent (e.g., HUDSON and KAPLAN 1988; KAPLAN et al. 1988; BRAVERMAN et al. 1995) phase to model the fixation of the new duplicate. At time t of the fixation process, the duplicate is at frequency x(t) in the population. Therefore, the fraction x(t) of the population bears the duplicate, and 1 – x(t) does not. During the structured phase, there are three distinct types of A chromosomes to keep track of (Figure 3). First, there are A chromosomes still linked to ancestors of B that have descendants in the sample. Second, there are A chromosomes not currently linked to ancestral B lineages, but whose ancestry is in the fraction x(t) of the population containing the duplicate (i.e., they are linked to B lineages nonancestral to the sample). Finally, there are A chromosomes whose ancestry at time t in the past is not linked to the duplicate locus. We label the first two types of A chromosomes as A+ and the third kind as A–. Examples of these types are shown in Figure 3.
I now list the rates at which events occur during the structured phase. In Equations 9–17, all rates are in units of 4N generations. Let the sample size of A– chromosomes be n1, and the rate of coalescence between A– chromosomes is
![]() | (9) |
![]() | (10) |
There are four types of crossover events to consider. First, there is crossover in an AB pair, and the ancestor of the A region has an A– label:
![]() | (11) |
![]() | (12) |
![]() | (13) |
![]() | (14) |
The rate of gene conversion from A to B is
![]() | (15) |
![]() | (16) |
The rate of gene conversion from B to an A+ chromosome is
![]() | (17) |
The simulation continues in the structured phase until x(t) first reaches a value
1/2N. At this point, all remaining chromosomes belong to the same deme, and the standard coalescent algorithm applies until the grand MRCA of the sample is reached (Figure 3). Once the structured phase is exited, one of the remaining chromosomes is the MRCA of the duplicate locus, and the origination of the duplicate is therefore a random sample of a single allele from the ancestral locus (Figure 3).
Copy number variants:
So far, we have considered only the simulation of genealogies for duplication events that are fixed in the population. The method is easily extended to duplications observed to be segregating (CNVs). To model polymorphic duplicates, one must account for the unknown population frequency of the duplication. There are two reasonable options for simulation. First, if the duplicate gene is observed in k of n chromosomes, k/n is the maximum-likelihood estimate of the population frequency of the duplicate. The second approach would be to place a prior distribution on the population frequency. A natural choice for the prior is a beta (a, b) distribution, giving the posterior distribution on the population frequency of the duplicate as beta (a + k, b + n – k) (GELMAN et al. 2003, p. 40). I use the latter approach in this article, generating a new allele frequency from the posterior distribution for each simulated replicate. The prior distribution is the uniform distribution (beta (1, 1)). For the CNV model, the simulation enters the structured phase at
= 0.
The frequency trajectory of a neutral mutation:
The fixation of the young duplication is modeled as a neutral process by simulating the trajectory of a neutral allele backward in time, from frequency x(
) to 0, conditional on absorption at 0 (e.g., GRIFFITHS 2003). For the case where a gene duplication is fixed, at time
when the simulation enters the structured phase, x(
) = 1. For a CNV, x(
) is beta-distributed as described above. These trajectories are generated by simulating a process of small jumps in allele frequency x per time interval
t (COOP and GRIFFITHS 2004; PRZEWORSKI et al. 2005; TESHIMA and PRZEWORSKI 2006; TESHIMA et al. 2006). Conditional on absorption at 0, jumps in x are given by
![]() |
![]() |
t
0. In this article,
t = 1/50N, where N = 104.
Model of ectopic gene conversion:
The model of conversion between duplicate loci is similar to WIUF and HEIN's (2000) model of conversion between alleles at a single-copy locus. The difference is that I assume that the entire duplicated region has been sampled and that the flanking regions are too divergent to be affected by gene conversion. Therefore, only events that both begin and end within the region are considered. For a fragment of L nucleotides, a conversion event begins at position i within the region and includes positions i through position i + l – 1 (i
1, i + l – 1
L).
The mean tract length is T, and tract lengths, l are sampled from the truncated geometric distribution P(l = k | k
L – i + 1) using the inverse c.d.f. method, where
p = 1/T, and U is a uniformly distributed deviate from the interval (0, 1].
This model of gene conversion differs from that of INNAN (2003a), who considered the case of intrachromosomal conversion (conversion between nonallelic positions on the same chromosome) affecting only one mutation per event. Here, I have relaxed that assumption, with events occurring between random chromosomes in the population and involving random amounts of DNA. Simulation results will, however, be qualitatively similar, in that increasing conversion rates will lead to fewer fixed differences, and more shared polymorphisms, between the two duplicates.
Implementation details:
Genealogies are generated using a modification of HUDSON's (2002) algorithm for bookkeeping of genealogies with recombination (both gene conversion and crossing over). The simulation is written in C++, using available libraries (THORNTON 2003). Source code for the coalescent simulation is available from the author's web site (http://www.molpopgen.org).
Forward simulations:
Forward simulations of a Wright–Fisher population were conducted using multinomial sampling to generate the gamete frequencies in the next generation. Mutations occur according to the infinitely many-sites model. A diploid population of 2N = 10,000 chromosomes,
= 10, and no recombination or selection was evolved for 10N generations to reach statistical equilibrium. After reaching equilibrium, the simulation continued until a single substitution occurred, at which point independent samples of sizes 5, 25, and 50 were taken from the population and recorded. The purpose of the forward simulation in this study is to check some of the results obtained from coalescent simulations with an independent method (forward in time, rather than backward).
| RESULTS |
|---|
|
|
|---|
= 0. The expectations of
, S, and D were estimated from 105 coalescent and forward simulations, and the two simulation methods are in excellent agreement (Table 1). Also shown in Table 1 are the expectations predicted by Equations 1, 2, and 5, respectively. For large sample sizes, the simulations and the formulas are in good agreement. For smaller sample sizes, the discrepancies are rather large, because the formulas do not account for the allele frequency trajectory of the substitution event during fixation. The simulation results show that the expectation of Tajima's D statistic is negative when a fixation has occurred recently and that the expected level of diversity in the samples is also reduced.
|
= 0. As the rate of ectopic gene conversion increases, fewer fixed differences are observed between genes, and more shared polymorphisms are found in the data. As the mean length of conversion events increases, this effect becomes more pronounced (Figure 5), although there does not appear to be much of a difference between a mean tract length of
the sampled region compared to
the region. There is also a slight effect of interlocus crossing over on the expected SFS, as crossover events cause the two loci to have different histories (Figure 3). The results in Figure 4 are qualitatively similar to those of INNAN (2003a).
|
|
, the mean number of pairwise differences in the sample, and D, a summary of the site-frequency spectrum. The two important qualitative results are that a reduction in diversity and a skew in the SFS of polymorphisms are expected in recent gene duplicates (Figure 6) across a range of parameters. Further, when there is neither crossing over nor conversion between loci, the ancestral gene will show the same pattern of polymorphism as the duplicate locus, since they both have the same genealogy (Figure 6A). As the fixation time of the duplicate gene becomes more ancient, the expectations of both
and D are more similar to what is expected under the standard neutral model, under which fixation events occur at random times.
|
becomes larger than the standard neutral expectation when the conversion rate is high (Figure 6D). The effect of conversion on
depends on the rate of crossing over—when the two loci are tightly linked, variation will be reduced on average when the fixation event is recent (Figure 6C), but when crossover rates are high, E[
] >
, even when
= 0 (Figure 6D). For ancient duplications (
), high rates of gene conversion result in E[
]
2
(INNAN 2003a, data not shown).
Patterns of polymorphism in copy number variants:
The observed number of occurrences of a copy-number variant affects whether or not gene conversion events are detectable as shared polymorphisms in the sample (Figure 7). When a polymorphic duplicate is rare in the sample, the duplicate allele is likely to be relatively young, and there will have been little time for gene conversion events to have occurred. When the conversion rate increases, such that 4Nc
, shared polymorphisms will tend to be observed only as singletons unless the sample frequency of the duplicate is relatively high (compare Figure 7A to 7B).
|
90%), the expectation of D will be negative, which is expected as the mutation is quite close to fixation in the population, and should thus show a pattern of polymorphism qualitatively similar to that of a fixed gene duplication (Figure 6).
|
and Tajima's D are summarized for the case of no crossing over and no gene conversion. When the duplicate gene is observed to be rare in the sample, D is expected to be slightly negative in both genes. When the duplicate is observed in 25 of the 50 chromosomes, D is expected to be positive in the ancestral gene and negative in the new gene. Finally, when the duplicate is at high frequency (45 of 50 in the sample), D is expected to be quite negative in both genes. The effect of sample size of the duplicate locus on D at the ancestral locus can be understood by considering that the observed sample count of the duplicate constrains the possible genealogies for the ancestral locus. For example, when there is no crossing over between loci, and the duplicate gene is present on 25 of 50 chromosomes, the 25 chromosomes bearing the new gene must reach their common ancestor before they are allowed to coalesce with the ancestors of chromosomes that do not carry the duplicate. Thus, the genealogy of the ancestral gene always contains a deep split, and a positive D is expected. Likewise, for a duplicate observed at high frequency in the sample, the genealogy of the ancestral gene will contain a deep split between relatively few lineages and many lineages, resulting in a negative D due to an excess of both rare and high-frequency derived alleles. Crossing over between loci eliminates these effects, because the genealogy of the ancestral locus can move between the duplicate-containing and duplicate-absent classes of chromosomes (compare Figures 8A and 8C). Figure 9 plots the mean of Fay and Wu's H as a function of the number of occurrences of the CNV in the sample. When there is no crossing over between loci, the expectation of H is negative in the ancestral gene when the frequency of the CNV in the sample is high, because the genealogy of the ancestral gene consists of a deep split of few lineages from the rest of the sample (see above). Thus, for evaluating hypotheses concerning the evolution of very young gene families, the standard coalescent is not an appropriate null model. It is important to consider the rate at which high-frequency derived CNVs will be observed in the genome, though. The results above consider the pattern of polymorphism given a CNV observed at a certain frequency. In a large equilibrium population with CNVs arising at rate
in the genome, the expected number of CNVs at a frequency 1
i < n is
/i, and therefore CNVs at frequencies such as 45 of 50 will be relatively rare.
|
To assess the applicability of standard tests of the SNM to recent gene duplicates, I simulated data over a range of parameters (103 samples were generated for all combinations of
= 10, n
{10, 50},
{0, 10, 100}, 4Nc
{0, 1, 10}, and
{0, 0.1, 0.2}). One-tailed P-values (lower tail) for Tajima's D and FAY and WU's (2000) H were obtained from 104 replicates simulated under the SNM with
= 10 and no recombination of any sort. The parameter combinations that resulted in rejection rates of at least 10% are shown in Table 2. The general pattern is that, in large sample sizes (n = 50), a significantly negative Tajima's D value will be inferred up to 15% of the time, and the effect of the fixation on patterns of polymorphism may persist at least as long as 0.8N generations. Rejection rates of
10% were seen only for Fay and Wu's H statistic when the gene conversion rate between duplicates was high (4Nc = 10). The effect is understandable by making an analogy to the selective sweep process—some lineages have ancestors more ancient than the origin of the gene duplication, due to the effect of gene conversion. When n = 10, rejection rates for all parameter combinations for both statistics were <10%. Thus, although an excess of rare alleles is expected for small sample sizes when a neutral substitution has occurred (Table 1), the effect will be difficult to detect in small sample sizes.
|
for a single locus was 10, split such that
= 8 and 2 at replacement and silent sites, respectively. For each replicate, the P-value of the MK tests was obtained using Fisher's exact test. For all cases, the rejection rate for the test was <0.05, implying that the MK test is conservative when applied to data from recent duplicates (data not shown). For the highest conversion rate studied (4Nc = 10), the rejection rate was observed to be as low as 0.001. The reason for this effect is that high rates of gene conversion result in few fixed differences (Figure 4), which tends to result in high P-values for the MK test. | DISCUSSION |
|---|
|
|
|---|
The results described here show that young duplicate genes are expected to show a reduction in diversity and an excess of rare alleles. This is an important point with respect to inferring if positive selection has acted on recent duplications, which is a critical issue in the debate over the relative roles of subfunctionalization (FORCE et al. 1999) vs. neofunctionalization in the preservation of duplicate genes (reviewed in LONG et al. 2003). For example, a recent study of three recent duplicates in Arabidopsis thaliana observed reduced variability in two of the three genes, as well as in some of the ancestral genes (MOORE and PURUGGANAN 2003). While Moore and Purugganan interpreted this observation as evidence for recent selective sweeps, implying positive selection on new functions, the reduction in diversity in the recent duplicates may simply be a consequence of the genes having fixed recently. Similarly, ignoring concerns about the appropriate demographic model for the species, reduced diversity in the ancestral genes may be a consequence of linkage between duplicates, given that the effective rate of crossing over and gene conversion in A. thaliana is expected to be quite low due to selfing (NORDBORG 2000).
THORNTON and LONG (2005) sequenced 12 X-linked duplicates with low divergence between duplicates at synonymous sites, and high nonsynonymous to synonymous ratios (dN/dS > 1) between duplicates, in a population sample of Drosophila melanogaster from Zimbabwe, Africa. The mean Tajima's D at third positions of codons in their data is –0.662, compared to an average of –0.186 observed in the predominantly single-copy, coding genes described in ANDOLFATTO (2005), also sampled from Zimbabwe. It is possible that at least part of this difference in average D is due to some of the duplicates having fixed recently. Further, overall diversity is low at many of the loci, compared to the average for the species, which is also expected if the genes are young. However, the distribution of P-values for the MK test between genes shows an excess of low values (THORNTON and LONG 2005), and the neutrality index (RAND and KANN 1996) is <1 for most comparisons, suggesting positive selection on amino acid fixations. Given that summaries of the data such as D and levels of diversity are confounded not only by the age of the duplication and the rate of gene conversion, but also by demographic history and the possibility that levels of selective constraint differ between single-copy and duplicate loci, it is possible that approaches based on the MK test will be the most fruitful in studying the role of selection in young genes.
In Drosophila species, several copy-number variants have been described in natural populations (TAKANO et al. 1989; LANGE et al. 1990; LOOTENS et al. 1993), although levels of variability at the nucleotide level remain unstudied. In humans, the emphasis so far has been on the description of genomewide patterns of CNVs (see Introduction), although SNP data from copy-number variants will likely be available soon. Although the major motivation to study CNVs in humans has been the potential that they are involved in the genetic basis of diseases, there is also the potential to learn about the evolutionary forces shaping young genes that are still segregating in natural populations. The simulations performed in this study suggest that rare CNV mutants will be low in diversity, which may make it difficult to infer the role of selection on such polymorphisms in the genome. However, such data will be very informative about the number of polymorphic pseudogenes and functional genes in the human and other genomes. Further, studying the genomewide site-frequency spectrum of polymorphic pseudogenes and functional genes will be informative about the role of selection on duplicates during processes of fixation in, or loss from, the genome.
The coalescent model presented here is highly simplified. Some of these simplications, such as no intragenic crossing over or gene conversion, are easily incorporated. Others, such as more complex models of gene conversion, are more difficult and may be better studied by forward simulation. For example, TESHIMA and INNAN (2004) considered a model where gene conversion events are allowed to occur until divergence between duplicates reached some threshold value. Such models violate the assumption of the coalescent process that the genealogy can be studied independently of the mutation process, and hence Teshima and Innan used a forward simulation approach. An additional biological complication arises from the observation that large duplications suppress local rates of crossing over when heterozygous (ROBERTS and BRODERICK 1982), suggesting that CNVs may contribute to heterogeneity in local recombination rates and variation in the decay of linkage disequilibrium across regions of the genome.
In this study, I assumed that the fixation of the gene duplicate occurred by drift. It is straightforward to incorporate simple models of directional selection into the simulation, by replacing the neutral frequency trajectory with one for a positively selected mutation (COOP and GRIFFITHS 2004). The most obvious effect of a fixation by positive selection is a more pronounced skew in the site-frequency spectrum when selection is very strong. A second effect is fewer shared polymorphisms between gene duplicates, as the rate of coalescence during the sweep becomes much faster than the rate of conversion.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
ANDOLFATTO, P., 2005 Adaptive evolution of non-coding DNA in Drosophila. Nature 437: 1149–1152.[CrossRef][Medline]
ARGUELLO, J. R., Y. CHEN, S. YANG, W. WANG and M. LONG, 2006 Origination of an X-linked testes chimeric gene by illegitimate recombination in Drosophila. PLoS Genet. 2: e77.[CrossRef][Medline]
BAILEY, J. A., Z. GU, R. A. CLARK, K. REINERT, R. V. SAMONTE et al., 2002 Recent segmental duplications in the human genome. Science 297: 1003–1007.
BAILEY, J. A., D. M. CHURCH, M. VENTURA, M. ROCCHI and E. E. EICHLER, 2004 Analysis of segmental duplications and genome assembly in the mouse. Genome Res. 14: 789–801.
BETRAN, E., and M. LONG, 2003 Dntf-2r, a young Drosophila retroposed gene with specific male expression under positive Darwinian selection. Genetics 164: 977–988.
BETRAN, E., K. THORNTON and M. LONG, 2002 Retroposed new genes out of the X in Drosophila. Genome Res. 12: 1854–1859.
BRAVERMAN, J. M., R. R. HUDSON, N. L. KAPLAN, C. H. LANGLEY and W. STEPHAN, 1995 The hitchhiking effect on the site frequency-spectrum of DNA polymorphisms. Genetics 140: 783–796.[Abstract]
CHEUNG, J., X. ESTIVILL, R. KHAJA, J. R. MACDONALD, K. LAU et al., 2003 Genome-wide detection of segmental duplications and potential m assembly errors in the human genome sequence. Genome Biol. 4: R25.[CrossRef][Medline]
CONRAD, D. F., T. D. ANDREWS, N. P. CARTER, M. E. HURLES and J. K. PRITCHARD, 2006 A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38: 75–81.[CrossRef][Medline]
COOP, G., and R. C. GRIFFITHS, 2004 Ancestral inference on gene trees under selection. Theor. Popul. Biol. 66: 219–232.[CrossRef][Medline]
EWENS, W., 2004 Mathematical Population Genetics I. Theoretical Introduction, Ed. 2. Springer-Verlag, Berlin/Heidelberg, Germany/New York.
FAN, C., and M. LONG, 2007 A new retroposed gene in Drosophila heterochromatin detected by microarray-based genomic hybridization. J. Mol. Evol. 64: 272–283.[CrossRef][Medline]
FAY, J., and C.-I. WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413.
FORCE, A., M. LYNCH, F. B. PICKETT, A. AMORES, Y. L. YAN et al., 1999 Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–1545.
FU, Y. X., and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693–709.[Abstract]
GAO, L. Z., and H. INNAN, 2004 Very low gene duplication rate in the yeast genome. Science 306: 1367–1370.
GELMAN, A., J. B. CARLIN, H. S. STERN and D. B. RUBIN, 2003 Bayesian Data Analysis, Ed. 2. Chapman & Hall/CRC, London/New York.
GRAUBERT, T. A., P. CEHAN, D. EDWIN, R. R. SELZER, T. A. RICHMOND et al., 2007 A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet. 3: e3.[CrossRef][Medline]
GRIFFITHS, R. C., 2003 The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor. Popul. Biol. 64: 241–251.[CrossRef][Medline]
GU, Z., D. NICOLAE, H. LU and W. LI, 2002a Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet. 18: 609–613.[CrossRef][Medline]
GU, Z. L., A. CAVALCANTI, F. C. CHEN, P. BOUMAN and W. H. LI, 2002b Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol. Biol. Evol. 19: 256–262.
GU, Z. L., L. M. STEINMETZ, X. GU, C. SCHARFE, R. W. DAVIS et al., 2003 Role of duplicate genes in genetic robustness against null mutations. Nature 421: 63–66.[CrossRef][Medline]
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183–201.[CrossRef][Medline]
HUDSON, R. R., 1990 Gene genealogies and the coalescent process, pp. 1–42 in Oxford Surveys in Evolutionary Biology, Vol. 7, edited by D. FUTUYAMA and J. ANTONOVICS. Oxford University Press, Oxford.
HUDSON, R. R., 2002 Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337–338.
HUDSON, R. R., and N. L. KAPLAN, 1988 The coalescent process in models with selection and recombination. Genetics 120: 831–840.
HUDSON, R. R., M. KREITMAN and M. AGUADE, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153–159.
IAFRATE, A. J., L. FEUK, M. N. RIVERA, M. L. LISTEWNIK, P. K. DONAHOW et al., 2004 Detection of large-scale variation in the human genome. Nat. Genet. 36: 949–951.[CrossRef][Medline]
INNAN, H., 2003a The coalescent and infinite-site model of a small multigene family. Genetics 163: 803–810.
INNAN, H., 2003b A two-locus gene conversion model with selection and its application to the human RHCE and RHD genes. Proc. Natl. Acad. Sci. USA 100: 8793–8798.
JONES, C. D., A. W. CUSTER and D. J. BEGUN, 2005 Origin and evolution of a chimeric fusion gene in Drosophila subobscura, D. madeirensis and D. guanche. Genetics 170: 207–219.
KAPLAN, N. L., T. DARDEN and R. R. HUDSON, 1988 The coalescent process in models with selection. Genetics 120: 819–829.
KONDRASHOV, F., I. ROGOZON, Y. WOLF and E. KOONIN, 2002 Selection in the evolution of gene duplications. Genome Biol. 3: 0008.1–0008.9.
KONDRASHOV, F. A., and A. S. KONDRASHOV, 2006 Role of selection in fixation of gene duplications. J. Theor. Biol. 239: 141–151.[CrossRef][Medline]
LANGE, B. W., C. H. LANGLEY and W. STEPHAN, 1990 Molecular evolution of Drosophila metallothionein genes. Genetics 126: 921–932.[Abstract]
LEVINE, M., C. D. JONES, A. D. KERN, H. A. LINDFORS and D. J. BEGUN, 2006 Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc. Natl. Acad. Sci. USA 103: 9935–9939.
LI, J., T. JIANG, J.-H. MAO, A. BALMAIN, L. PETERSON et al., 2004 Genomic segmental polymorphisms in inbred mouse strains. Nat. Genet. 36: 952–954.[CrossRef][Medline]
LINDSAY, S. J., M. KHAJAVI, J. R. LUPSKI and M. E. HURLES, 2006 A chromosomal rearrangement hotspot