- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Data Supplement
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Padhukasahasram, B.
- Articles by Nordborg, M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Padhukasahasram, B.
- Articles by Nordborg, M.
Genetics, Vol. 178, 2417-2427, April 2008, Copyright © 2008
doi:10.1534/genetics.107.085332
Exploring Population Genetic Models With Recombination Using Efficient Forward-Time Simulations
Badri Padhukasahasram*,1,
Paul Marjoram
,
Jeffrey D. Wall
,
Carlos D. Bustamante* and
Magnus Nordborg
* Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14850,
Biostatistics Division, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California 90089,
Institute of Human Genetics, University of California, San Francisco, California 94143 and
Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
1 Corresponding author: Biological Statistics and Computational Biology, Room 169, Biotechnology Building, Cornell University, Ithaca, NY 14850.
E-mail: bp85{at}cornell.edu
>ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
We present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population under the Wright–Fisher model. We used simulations based on this algorithm to verify the accuracy of the ancestral recombination graph approximation by comparing it to the exact Wright–Fisher scenario. We find that the recombination graph is generally a very good approximation for models with complete outcrossing, whereas, for models with self-fertilization, the approximation becomes slightly inexact for some combinations of selfing and recombination parameters.
COALESCENT theory provides a continuous-time approximation for the history of small samples in large populations and coalescent simulation is a widely used tool in population genetics. Under this framework, the genealogy of a sample of DNA sequences is modeled backward in time and neutral mutations are superposed on this genealogy to generate sequence polymorphism data (KINGMAN 1982; HUDSON 1983; ROSENBERG and NORDBORG 2002). Forward simulations, in contrast, model the evolution of all the sequences in a population exactly, forward in time and generation by generation. Because coalescent simulations consider only those chromosomes that carry material ancestral to the sample, and, by making a continuous-time approximation skip uninteresting generations whose events do not affect the sample, they are computationally much more efficient than forward simulation programs. However, despite their inefficiency, forward simulations are necessary if we wish to simulate data sets under complex and realistic biological scenarios (e.g., natural selection at multiple linked loci) that are difficult to model accurately using the coalescent. Given the dramatic growth in the power of computing, forward-time simulations are currently feasible for large genomic regions (e.g., megabase scale) and many simulation packages have been developed recently (e.g., BALLOUX 2001; HEY 2004; HOGGART et al. 2005; PENG and KIMMEL 2005; DUDEK et al. 2006; GUILLAUME and ROUGEMONT 2006; SANFORD et al. 2007) and have also found important applications (e.g., BALLOUX and GOUDET 2002; PINEDA-KRCH and REDFIELD 2005; PENG and KIMMEL 2007). Here, we present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population undergoing mutations, recombination, and natural selection at multiple linked loci. In contrast to existing forward-time simulators that consider the population genealogy generation by generation, our forward algorithm uses the genealogical information for multiple generations at a time, and on the basis of this information, simulates only those chromosomes in the next generation that can potentially contribute to the future population. We show that such a forward–backward scheme combined with other optimizations can lead to substantial improvements in run-time efficiency. We use our simulation program to evaluate coalescent models with recombination by comparing them to the exact Wright–Fisher model.
ABSTRACT
>SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
- Let gen(0) represent the current generation, gen(1) represent the generation being simulated, and gen(2), gen(3), gen(4), ... , etc., represent subsequent generations. Before creating the individuals of gen(1), we generate the future genealogical information of the population for k generations [i.e., information required for creating gen(2) to gen(k + 1)]. This involves simulating the ancestry of the chromosomes in the next k generations and determining whether or not they will undergo recombination in any particular generation. Using this information, we can see that two key events are possible:
- All the descendants of a chromosome belonging to gen(1) may be lost by gen(k + 1) without any of their homologs having undergone recombination.
- A chromosome or its homolog may recombine in gen(2), but both of them can lose all their descendants by gen(k + 1) without any of their homologs having undergone any subsequent recombination [i.e., from gen(3) to gen(k + 1)]. Chromosomes that belong to categories a or b, cannot potentially leave any trace in the future population that exists at gen(k + 1). Therefore, it is not necessary to explicitly simulate such chromosomes in gen(1).
- All the descendants of a chromosome belonging to gen(1) may be lost by gen(k + 1) without any of their homologs having undergone recombination.
- Chromosomes of gen(1) are created by randomly sampling chromosomes from gen(0) and determining whether or not they undergo recombination. When a chromosome of gen(0) gets chosen the first time and does not recombine, we simply exchange the pointers to the arrays between gen(1) and gen(0) to create a new chromosome of gen(1). If it gets picked again or if it undergoes recombination, we create a new chromosome by copying parts of the relevant arrays into the arrays of gen(1) using the memcpy() function. The only exception to this occurs when a recombination is the last event involving a particular individual. In this case, we first exchange the pointers to the arrays for one of the homologs (provided it has not been picked already) and explicitly copy only part of the other array using memcpy(). After creating all the chromosomes of gen(1), we generate new mutation locations for each chromosome and insert them into the sorted arrays using a binary search and the memove() function.
- Using the future genealogical information, create only those chromosomes in gen(1) that can potentially contribute to the population that exists at gen(k + 1). Assume that the other arrays are empty. [Note that the main idea in step 2, when creating a new generation, was to reuse the arrays from the previous generation as much as possible and avoid copying. If we eliminate some of the nonancestral chromosomes from any generation (see APPENDIX A), the fraction of chromosomes for which explicit array copying is necessary decreases substantially.]
- Update the future genealogy by one more generation. Repeat step 3. Remove nonpolymorphic locations from the population at regular intervals.
- Simulate the whole population during the last k + 2 generations of the simulation (i.e., if the simulation is run for l generations, we explicitly simulate all the chromosomes from generations l – k – 1 to l) as well as during the last k + 2 generations up to the generation during which nonpolymorphic locations get removed from the population (i.e., if fixed mutations get removed every n generations, then we explicitly simulate all the chromosomes from generations n – k – 1 to n, 2n – k – 1 to 2n, 3n – k – 1 to 3n, ... , etc.). [Note that this last step is essential because at the end of the simulation as well as during the generation at which nonpolymorphic locations get removed, we require all the chromosomes present in the population (for example, to determine which chromosomal locations are nonpolymorphic) and not just the ones that can contribute to the future population.]
The parameter k has to be chosen optimally for this algorithm to work most effectively. There is a trade-off between the computational effort spent to look forward for k generations and the effort saved by the elimination of nonancestral chromosomes from the next generation. In terms of run-time complexity, creating a new generation mainly involves array copy operations that take linear [i.e., O(N)] time in terms of the number of mutations [binary search takes O(ln(N)) time while exchanging the pointers takes constant time] accumulated in the array. In contrast, looking forward for a few generations takes only constant time and is independent of the number of mutations carried.
The expected proportion of individuals in categories a or b first increases as the depth of the look-ahead (i.e., k) increases but eventually becomes nearly constant. So, increasing k beyond a certain range will not be desirable. The expected proportion decreases as the per-generation per-sequence recombination rate increases, and therefore this strategy becomes less effective for high values of r. Table 1 shows the expectation of the fraction of chromosomes in category a as a function of k and r for a population with 500 diploid individuals and evolving under the standard neutral model (also see APPENDIX A). In all the simulations presented here, we use a fixed value of k = 8 and remove nonpolymorphic locations after every N generations, where N denotes the size of the population (APPENDIX B shows some run-time comparisons for different values of the look-ahead parameter).
|
For models with natural selection, we simulate the evolution of selected and neutral sites separately. We first generate the future genealogical information by simulating the ancestry of all the chromosomes in the population considering only the selected sites. If the number of sites under selection remains small, this information can be generated quickly. Then, using this information, we simulate the evolution of the remaining (neutral) sites according to the algorithm described earlier. Note that as the proportion of sites under selection increases, the look-ahead strategy becomes relatively less effective.
Random number generation:
Random numbers are generated using the Mersenne Twister algorithm (MATSUMOTO and NISHIMURA 1998). The external files mtrand.cpp and mtrand.h are used along with our program to enable random number generation. mtrand.cpp is a fast and high-quality random number generator whose period length is a large prime number that is one less than a power of 2.
Comparison with other forward simulation programs:
We first compare the approximate running time of our simulation program (FORWSIM) with two other currently available forward-time simulation programs for the standard neutral model. These comparisons demonstrate that our look-ahead strategy combined with other standard optimizations can result in large gains in run-time efficiency (Table 2a, APPENDIX C shows some run-time comparisons for models with natural selection at multiple sites). The comparisons also confirmed that, for all the programs tested, the means and distributions of some simple summary statistics are in agreement with coalescent simulations (Table 2b, Figure 1).
|
|
ABSTRACT
SIMULATION ALGORITHM
>IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
- It assumes that the lineages we follow backward in time recombine only with nonancestral lineages. This follows because we are tracing the ancestry of small samples in large populations and therefore the number of lineages ancestral to the sample remains small compared to the total population size.
- It also assumes that in a large population all the recombination events are independent of one another. We use forward simulations of the exact Wright–Fisher model with and without self-fertilization and compare the expected decay of pairwise linkage disequilibrium (LD) to values generated with equivalent coalescent simulations and verify the accuracy of these approximations.
When the recombination rate is high, the number of ancestral lineages in the recombination graph can become very large and so it is not obvious whether the first approximation will be accurate in finite populations. When there is partial selfing, going backward in time, a pair of lineages resulting from a single recombination event can spend a significant amount of time together within the same ancestors before they find different parents or coalesce. Thus, it can be shown that there is a significant probability that such lineages may recombine again (i.e., overlapping recombinations) before they find different ancestors (see APPENDIX D, Figure 2a). This clearly violates the assumption that all recombination events happen independently of one another. For models without selfing, the probability of such overlapping recombination events is expected to be much smaller as long as the population size is reasonably large (see APPENDIX D, Figure 2b).
|
Figure 3 shows the expected decay of the absolute value of pairwise D' (the normalized measure of LD that takes values between –1 and 1) for forward-time simulations with selfing, forward-time simulations without selfing, and coalescent simulations with equivalent parameters. When simulating using the coalescent, we assume that recombination happens only with nonancestral chromosomes, which ignores the chance of recombination events between ancestral lineages. When r is high, the expected value of D' is slightly higher in forward simulations without selfing than in comparable coalescent simulations presumably because the number of ancestral lineages is large and recombinations with ancestral lineages are not rare. Because recombination events with ancestral lineages will be associated with some coalescence, we reach the most recent common ancestor slightly sooner in the exact case than in the ARG. Nevertheless, for models with only outcrossing, we see from results in Figures 1, 3, and from Table 2, that the expectations and distributions under the coalescent with recombination are close to the expectations under the exact scenario even for higher values of r. We also compared the frequencies of some triplet based LD patterns (PADHUKASAHASRAM et al. 2004, 2006) at different distances and reached similar conclusions (results not shown here). For models with selfing, the ARG remains a close approximation to reality as long as either r or s remains small (Figures 3 and 4, Table 3). When r and s are both very high, the ARG approximation breaks down due to overlapping recombination events and expected value of D' is significantly higher in forward simulations compared to the equivalent coalescent model.
|
|
|
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
>SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
We checked the accuracy of the ancestral recombination graph approximation by comparing the expected decay of pairwise linkage disequilibrium in forward and coalescent simulations. Our results indicate that the standard coalescent with recombination will be a close approximation to the exact scenario for completely outcrossing populations with 2N = 1000 chromosomes or more, even for higher values of r. The ARG is also a good approximation for models with selfing as long as either the selfing rate (s) or recombination rate (r) remains small. When s and r are both very high, the scaled ARG for partial self-fertilization becomes slightly inexact due to substantial probability of overlapping recombination events. Therefore, for such parameter ranges, it is best to simulate data sets using exact Wright–Fisher simulations (or alternately modify existing coalescent simulation programs to allow for overlapping recombination events).
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
>APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
Assuming n copies of a chromosome in the current generation, the probability that exactly m copies get chosen in the next generation is s(n, m) = 2NCm(n/2N)m(1 – n/2N)2N–m.
The chance that a chromosome does not recombine in any given generation is approximately e–r.
Assuming n copies of a chromosome in the current generation, the chance that none of the copies of the chromosome that get picked in the next generation, recombine, can be approximated as
![]() | (A1) |
For k = 1 and r > 0, a chromosome can be lost if it does not get picked in gen(2). Therefore, the chance that a chromosome is lost without its homolog having undergone recombination is
![]() |
![]() | (A2) |
In general, when N is large, a is small and r > 0, the chance P(a) that all the descendants of a chromosome from gen(1) are lost at gen(a + 2) but not before that, without any of their homologs having undergone recombination is nearly
, where m1, m2, ... , ma can all vary from 1 to 2N – 1. Therefore, for k > 0, the total probability T(k) that all the copies of a chromosome are lost by gen(k + 1), without any of their homologs having undergone any recombination is nearly
![]() | (A3) |
|
|
For r = 0, T(k) is exactly equal to
![]() | (A4) |
, where m1, m2, ..., ma all vary from 1 to 2N – 1.
|
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
>APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
|
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
>APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
|
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
>APPENDIX D
ACKNOWLEDGEMENTS
LITERATURE CITED
The probability that it finds different ancestors in the first generation is (1 – s)(1 – 1/N) (i.e., not created by selfing and choose different ancestors).
The probability that it finds different ancestors in the second generation is (s/2 + (1 – s)/(2N))(1 – s)(1 – 1/N) [i.e., probability of selfing but choosing different chromosomes in the same ancestor in the first generation (i.e., s/2) or not created by selfing but picking different chromosomes within the same parent in the first generation (i.e., (1 – s)/2N) and then not created by selfing and choosing different ancestors in the second generation].
The probability that it finds different ancestors in the second generation without further recombination is (s/2 + (1 – s)/(2N))(1 – s)(1 – 1/N)e–2r.
In general, the probability that a pair finds different ancestors in the nth generation but not before that is: (s/2 + (1 – s)/(2N))n–1(1 – s)(1 – 1/N) and the probability that this happens without subsequent recombinations is (s/2 + (1 – s)/(2N))n–1(1 – s)(1 – 1/N)e–(2n–2)r.
Thus, the total probability that a recombination event will eventually split lineages into two separate ancestors is (summing up to n = infinity) 2(1 – s)(1 – 1/N)/(2 – (s + (1 – s)/(N))). Now, the total probability that this happens without any subsequent recombination events is (summing appropriate terms up to n = infinity) 2(1 – s)(1 – 1/N)/(2 – (s + (1 – s)/(N))e–2r). Therefore, the probability of overlapping recombination events, given that after a recombination event the pair of lineages will find different ancestors is (s + (1 – s)/(N))(1 – e–2r)/(2 – (s + (1 – s)/(N))e–2r). Figure 2a shows this probability as a function of s and r.
Similarly, for models without self-fertilization, going backward in time, the probability that a pair of lineages finds different ancestors in the first generation is (1 – 1/N). The probability of finding different ancestors in the second generation is (1 – 1/N)/2N (i.e., pick different chromosomes in the same parent in the first generation (i.e., 1/2N) and pick different parents in the second generation, etc.). The probability of finding different ancestors in the second generation without subsequent recombination is (1 – 1/N)e–2r/2N.
In general, the probability that a pair of lineages finds different ancestors in the nth generation is (1 – 1/N)/(2N)n–1and the probability that this happens without subsequent recombinations is
![]() |
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
SIMULATION ALGORITHM
IS THE ANCESTRAL RECOMBINATION...
SUMMARY
APPENDIX A: CALCULATIONS
APPENDIX B
APPENDIX C
APPENDIX D
ACKNOWLEDGEMENTS
>LITERATURE CITED
BALLOUX, F., 2001 EASYPOP (Version 1.7): a computer program for population genetics simulation. J. Hered. 92: 301–302.
BALLOUX, F., and J. GOUDET, 2002 Statistical properties of population differentiation estimators under stepwise mutation in a finite island model. Mol. Ecol. 11: 771–783.[CrossRef][Medline]
DUDEK, S. M., A. A. MOTSINGER, D. R. VELEZ, S. M. WILLIAMS and M. D. RITCHIE, 2006 Data simulation software for whole-genome association and other studies in human genetics. Pac. Sym. Biocomput. 11: 499–510
GRIFFITHS, R. C., and P. MARJORAM, 1996 Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3: 479–502.[Medline]
GUILLAUME, F., and J. ROUGEMONT, 2006 Nemo: an evolutionary and population genetics programming framework. Bioinformatics 22: 2556–2557.
HEY, J., 2004 FPG: A computer program for forward population genetic simulation. http://lifesci.rutgers.edu/
heylab/HeylabSoftware.htm#FPG.
HOGGART, C., T. G. CLARK, R. LAMPARIELLO, M. DE IORIO, J. WHITTAKER et al., 2005 FREGENE: software for simulating large genomic regions. Technical Report. Department of Epidemiology and Public Health, Imperial College, London.
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183–201.[CrossRef][Medline]
HUDSON, R. R., 2002 Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338.
KINGMAN, J. F. C., 1982 The coalescent. Stochast. Proc. Appl. 13: 235–248.[CrossRef]
MATSUMOTO, M., and T. NISHIMURA, 1998 Mersenne Twister: a 623 dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8: 3–30.[CrossRef]
NORDBORG, M., 2000 Linkage disequilibrium, gene trees, and selfing: an ancestral recombination graph with partial self-fertilization. Genetics 154: 923–929.
PADHUKASAHASRAM, B., P. MARJORAM and M. NORDBORG, 2004 Estimating the rate of gene-conversion on human chromosome 21. Am. J. Hum. Genet. 75: 386–397.[CrossRef][Medline]
PADHUKASAHASRAM, B., J. D. WALL, P. MARJORAM and M. NORDBORG, 2006 Estimating recombination rates from single-nucleotide polymorphisms using summary statistics. Genetics 174: 1517–1528.
PENG, B., and M. KIMMEL, 2005 simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21: 3686–3687.
PENG, B., and M. KIMMEL, 2007 Simulations provide support for the common disease–common variant hypothesis. Genetics 175: 763–776.
PINEDA-KRCH, M., and R. J. REDFIELD, 2005 Persistence and loss of meiotic recombination hotspots. Genetics 169: 2319–2333.
ROSENBERG, N. A., and M. NORDBORG, 2002 Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3: 380–390.[CrossRef][Medline]
SANFORD, J., J. BAUMGARDNER, W. BREWER, P. GIBSON and W. REMINE, 2007 Mendel's accountant: a biologically realistic forward-time population genetics program. Scalable Computing: Practice and Experience. 8: 147–165
Communicating editor: M. K. UYENOYAMA
This article has been cited by other articles:
![]() |
Y. Kim and T. Wiehe Simulation of DNA sequence evolution under models of recent directional selection Brief Bioinform, January 1, 2009; 10(1): 84 - 96. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. K. Chen, P. Marjoram, and J. D. Wall Fast and flexible simulation of DNA sequence data Genome Res., January 1, 2009; 19(1): 136 - 142. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Hernandez A flexible forward simulator for populations subject to selection and demography Bioinformatics, December 1, 2008; 24(23): 2786 - 2787. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Data Supplement
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Padhukasahasram, B.
- Articles by Nordborg, M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Padhukasahasram, B.
- Articles by Nordborg, M.


= 200.0 and
= 200.0 and forward simulations with r = 0.100, u = 0.100, and 2N = 1000, using FORWSIM, FREGENE, and FPG. (b) The distribution of the number of distinct haplotypes H for coalescent simulations with 
















