# Exploring Population Genetic Models With Recombination Using Efficient Forward-Time Simulations

- Badri Padhukasahasram
^{*},^{1}, - Paul Marjoram
^{†}, - Jeffrey D. Wall
^{‡}, - Carlos D. Bustamante
^{*}and - Magnus Nordborg
^{§}

^{*}Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14850,^{†}Biostatistics Division, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California 90089,^{‡}Institute of Human Genetics, University of California, San Francisco, California 94143 and^{§}Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089

- 1
*Corresponding author:*Biological Statistics and Computational Biology, Room 169, Biotechnology Building, Cornell University, Ithaca, NY 14850. E-mail: bp85{at}cornell.edu

## Abstract

We present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population under the Wright–Fisher model. We used simulations based on this algorithm to verify the accuracy of the ancestral recombination graph approximation by comparing it to the exact Wright–Fisher scenario. We find that the recombination graph is generally a very good approximation for models with complete outcrossing, whereas, for models with self-fertilization, the approximation becomes slightly inexact for some combinations of selfing and recombination parameters.

COALESCENT theory provides a continuous-time approximation for the history of small samples in large populations and coalescent simulation is a widely used tool in population genetics. Under this framework, the genealogy of a sample of DNA sequences is modeled backward in time and neutral mutations are superposed on this genealogy to generate sequence polymorphism data (Kingman 1982; Hudson 1983; Rosenberg and Nordborg 2002). Forward simulations, in contrast, model the evolution of all the sequences in a population exactly, forward in time and generation by generation. Because coalescent simulations consider only those chromosomes that carry material ancestral to the sample, and, by making a continuous-time approximation skip uninteresting generations whose events do not affect the sample, they are computationally much more efficient than forward simulation programs. However, despite their inefficiency, forward simulations are necessary if we wish to simulate data sets under complex and realistic biological scenarios (*e.g.*, natural selection at multiple linked loci) that are difficult to model accurately using the coalescent. Given the dramatic growth in the power of computing, forward-time simulations are currently feasible for large genomic regions (*e.g.*, megabase scale) and many simulation packages have been developed recently (*e.g.*, Balloux 2001; Hey 2004; Hoggart *et al.* 2005; Peng and Kimmel 2005; Dudek *et al.* 2006; Guillaume and Rougemont 2006; Sanford *et al.* 2007) and have also found important applications (*e.g.*, Balloux and Goudet 2002; Pineda-Krch and Redfield 2005; Peng and Kimmel 2007). Here, we present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population undergoing mutations, recombination, and natural selection at multiple linked loci. In contrast to existing forward-time simulators that consider the population genealogy generation by generation, our forward algorithm uses the genealogical information for multiple generations at a time, and on the basis of this information, simulates only those chromosomes in the next generation that can potentially contribute to the future population. We show that such a forward–backward scheme combined with other optimizations can lead to substantial improvements in run-time efficiency. We use our simulation program to evaluate coalescent models with recombination by comparing them to the exact Wright–Fisher model.

## SIMULATION ALGORITHM

Our algorithm is implemented in the C++ programming language and we simulate data sets under the Wright–Fisher model assumptions. Individuals in a population are assumed to be diploid, the population size is assumed constant (this assumption can readily be relaxed), and generations are always nonoverlapping. Chromosomes within the population are represented by sorted arrays of integers that correspond to the locations of their mutations in base pairs. In this representation, a location is considered polymorphic if it occurs in some but not all of the chromosomes. Over time, the chromosome arrays undergo changes due to recombination (*i.e.*, are partially replaced by parts of other arrays) and mutation (*i.e.*, new integer locations get inserted). They also increase or decrease in the number of copies due to genetic drift. At any given time, we keep track of chromosomes belonging only to the current and previous generations and keep reusing these arrays. We use a pseudo infinite-sites model for mutations (*i.e.*, where the number of sites is finite but new mutations can appear only at nonpolymorphic locations) and a finite-sites model for recombination and remove locations that are no longer polymorphic, at regular intervals. The total number of new mutations added to a chromosome in any particular generation is modeled as a Poisson random variable with mean equal to the per-generation per-sequence mutation rate *u*. Meiotic recombination is modeled as a single crossing-over event and the probability that a recombination event occurs in any particular generation is equal to 1 *− e ^{−r}*, where

*r*denotes the per-generation per-sequence rate of recombination.Our forward algorithm proceeds by simulating the population in the next generation as a function of (i) the population in the previous generation and (ii) the simulated genealogy for the next

*k*generations. We use the simulated genealogy to tell us which of the chromosomes in the next generation can contribute material to the future population that exists after

*k*generations from now. Only those chromosomes are explicitly simulated in the next generation since all the other chromosomes are destined to disappear. We outline all the steps in our simulation program below:

Let gen(0) represent the current generation, gen(1) represent the generation being simulated, and gen(2), gen(3), gen(4), … , etc., represent subsequent generations. Before creating the individuals of gen(1), we generate the future genealogical information of the population for

*k*generations [*i.e.*, information required for creating gen(2) to gen(*k +*1)]. This involves simulating the ancestry of the chromosomes in the next*k*generations and determining whether or not they will undergo recombination in any particular generation. Using this information, we can see that two key events are possible:All the descendants of a chromosome belonging to gen(1) may be lost by gen(

*k +*1) without any of their homologs having undergone recombination.A chromosome or its homolog may recombine in gen(2), but both of them can lose all their descendants by gen(

*k +*1) without any of their homologs having undergone any subsequent recombination [*i.e.*, from gen(3) to gen(*k +*1)]. Chromosomes that belong to categories*a*or*b*, cannot potentially leave any trace in the future population that exists at gen(*k +*1). Therefore, it is not necessary to explicitly simulate such chromosomes in gen(1).

Chromosomes of gen(1) are created by randomly sampling chromosomes from gen(0) and determining whether or not they undergo recombination. When a chromosome of gen(0) gets chosen the first time and does not recombine, we simply exchange the pointers to the arrays between gen(1) and gen(0) to create a new chromosome of gen(1). If it gets picked again or if it undergoes recombination, we create a new chromosome by copying parts of the relevant arrays into the arrays of gen(1) using the memcpy() function. The only exception to this occurs when a recombination is the last event involving a particular individual. In this case, we first exchange the pointers to the arrays for one of the homologs (provided it has not been picked already) and explicitly copy only part of the other array using memcpy(). After creating all the chromosomes of gen(1), we generate new mutation locations for each chromosome and insert them into the sorted arrays using a binary search and the memove() function.

Using the future genealogical information, create only those chromosomes in gen(1) that can potentially contribute to the population that exists at gen(

*k +*1). Assume that the other arrays are empty. [Note that the main idea in step 2, when creating a new generation, was to reuse the arrays from the previous generation as much as possible and avoid copying. If we eliminate some of the nonancestral chromosomes from any generation (see appendix a), the fraction of chromosomes for which explicit array copying is necessary decreases substantially.]Update the future genealogy by one more generation. Repeat step 3. Remove nonpolymorphic locations from the population at regular intervals.

Simulate the whole population during the last

*k*+ 2 generations of the simulation (*i.e*., if the simulation is run for*l*generations, we explicitly simulate all the chromosomes from generations*l − k −*1 to*l*) as well as during the last*k*+ 2 generations up to the generation during which nonpolymorphic locations get removed from the population (*i.e.*, if fixed mutations get removed every*n*generations, then we explicitly simulate all the chromosomes from generations*n − k −*1 to*n*, 2*n − k −*1 to 2*n*, 3*n − k −*1 to 3*n*,*…*, etc.). [Note that this last step is essential because at the end of the simulation as well as during the generation at which nonpolymorphic locations get removed, we require all the chromosomes present in the population (for example, to determine which chromosomal locations are nonpolymorphic) and not just the ones that can contribute to the future population.]

The parameter *k* has to be chosen optimally for this algorithm to work most effectively. There is a trade-off between the computational effort spent to look forward for *k* generations and the effort saved by the elimination of nonancestral chromosomes from the next generation. In terms of run-time complexity, creating a new generation mainly involves array copy operations that take linear [*i.e.*, *O*(*N*)] time in terms of the number of mutations [binary search takes *O*(*ln*(*N*)) time while exchanging the pointers takes constant time] accumulated in the array. In contrast, looking forward for a few generations takes only constant time and is independent of the number of mutations carried.

The expected proportion of individuals in categories *a* or *b* first increases as the depth of the look-ahead (*i.e.*, *k*) increases but eventually becomes nearly constant. So, increasing *k* beyond a certain range will not be desirable. The expected proportion decreases as the per-generation per-sequence recombination rate increases, and therefore this strategy becomes less effective for high values of *r*. Table 1 shows the expectation of the fraction of chromosomes in category *a* as a function of *k* and *r* for a population with 500 diploid individuals and evolving under the standard neutral model (also see appendix a). In all the simulations presented here, we use a fixed value of *k* = 8 and remove nonpolymorphic locations after every *N* generations, where *N* denotes the size of the population (appendix b shows some run-time comparisons for different values of the look-ahead parameter).

For models with natural selection, we simulate the evolution of selected and neutral sites separately. We first generate the future genealogical information by simulating the ancestry of all the chromosomes in the population considering only the selected sites. If the number of sites under selection remains small, this information can be generated quickly. Then, using this information, we simulate the evolution of the remaining (neutral) sites according to the algorithm described earlier. Note that as the proportion of sites under selection increases, the look-ahead strategy becomes relatively less effective.

#### Random number generation:

Random numbers are generated using the Mersenne Twister algorithm (Matsumoto and Nishimura 1998). The external files mtrand.cpp and mtrand.h are used along with our program to enable random number generation. mtrand.cpp is a fast and high-quality random number generator whose period length is a large prime number that is one less than a power of 2.

#### Comparison with other forward simulation programs:

We first compare the approximate running time of our simulation program (FORWSIM) with two other currently available forward-time simulation programs for the standard neutral model. These comparisons demonstrate that our look-ahead strategy combined with other standard optimizations can result in large gains in run-time efficiency (Table 2a, appendix c shows some run-time comparisons for models with natural selection at multiple sites). The comparisons also confirmed that, for all the programs tested, the means and distributions of some simple summary statistics are in agreement with coalescent simulations (Table 2b, Figure 1).

## IS THE ANCESTRAL RECOMBINATION GRAPH A GOOD APPROXIMATION TO THE EXACT SCENARIO?

Under the coalescent framework, the genealogy of a sample of sequences with recombination can be approximated by a graph called the ancestral recombination graph (ARG) (*e.g.*, see Hudson 1983; Griffiths and Marjoram 1996). If *s* denotes the probability of self-fertilization and *F = s/*(2 *− s*), the genealogy of a sample for partial selfing (*i.e.*, 0 < *s* < 1) can be approximated by an outcrossing version of the ARG with a rate of coalescence that is 1 *+ F* times faster and a rate of recombination that is 1 *− s* times slower (see Nordborg 2000). The recombination graph makes two main assumptions:

It assumes that the lineages we follow backward in time recombine only with nonancestral lineages. This follows because we are tracing the ancestry of small samples in large populations and therefore the number of lineages ancestral to the sample remains small compared to the total population size.

It also assumes that in a large population all the recombination events are independent of one another. We use forward simulations of the exact Wright–Fisher model with and without self-fertilization and compare the expected decay of pairwise linkage disequilibrium (LD) to values generated with equivalent coalescent simulations and verify the accuracy of these approximations.

When the recombination rate is high, the number of ancestral lineages in the recombination graph can become very large and so it is not obvious whether the first approximation will be accurate in finite populations. When there is partial selfing, going backward in time, a pair of lineages resulting from a single recombination event can spend a significant amount of time together within the same ancestors before they find different parents or coalesce. Thus, it can be shown that there is a significant probability that such lineages may recombine again (*i.e*., overlapping recombinations) before they find different ancestors (see appendix d, Figure 2a). This clearly violates the assumption that all recombination events happen independently of one another. For models without selfing, the probability of such overlapping recombination events is expected to be much smaller as long as the population size is reasonably large (see appendix d, Figure 2b).

Figure 3 shows the expected decay of the absolute value of pairwise *D′* (the normalized measure of LD that takes values between −1 and 1) for forward-time simulations with selfing, forward-time simulations without selfing, and coalescent simulations with equivalent parameters. When simulating using the coalescent, we assume that recombination happens only with nonancestral chromosomes, which ignores the chance of recombination events between ancestral lineages. When *r* is high, the expected value of *D′* is slightly higher in forward simulations without selfing than in comparable coalescent simulations presumably because the number of ancestral lineages is large and recombinations with ancestral lineages are not rare. Because recombination events with ancestral lineages will be associated with some coalescence, we reach the most recent common ancestor slightly sooner in the exact case than in the ARG. Nevertheless, for models with only outcrossing, we see from results in Figures 1, 3, and from Table 2, that the expectations and distributions under the coalescent with recombination are close to the expectations under the exact scenario even for higher values of *r*. We also compared the frequencies of some triplet based LD patterns (Padhukasahasram *et al.* 2004, 2006) at different distances and reached similar conclusions (results not shown here). For models with selfing, the ARG remains a close approximation to reality as long as either *r* or *s* remains small (Figures 3 and 4, Table 3). When *r* and *s* are both very high, the ARG approximation breaks down due to overlapping recombination events and expected value of *D′* is significantly higher in forward simulations compared to the equivalent coalescent model.

## SUMMARY

We have presented an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population under the Wright–Fisher model of evolution. Comparisons with other currently available forward-in-time simulators show that our C++ program is able to simulate data sets quickly and all the tested programs appear to function correctly. Further refinements to our algorithm are possible to improve its efficiency. For example, instead of using a constant depth of look-ahead, we may change the depth during the run. Note that toward the later stages of a simulation, when the amount of polymorphism in the population becomes high, a deeper look-ahead might prove to be more advantageous. Also, it may be possible to determine other categories of chromosomes (apart from those in classes *a* or *b*) that cannot potentially leave any trace in the future population that exists after *k* generations. Alternately, instead of using the look-ahead strategy described before, we may explicitly construct chromosomes for a small number of generations in terms of the chromosomes of gen(1) by generating the recombination breakpoints of the future (this may be useful when *r* is very high). Doing this will allow us to eliminate all the chromosomes that are nonancestral to the population that exists at gen(*k +* 1) but will require greater computational effort than the former look-ahead strategy. Finally, we anticipate that a parallel implementation of this algorithm that can simultaneously utilize a large number of computer processors (which can all access the same memory), can make forward-time simulations practical for very large populations.

We checked the accuracy of the ancestral recombination graph approximation by comparing the expected decay of pairwise linkage disequilibrium in forward and coalescent simulations. Our results indicate that the standard coalescent with recombination will be a close approximation to the exact scenario for completely outcrossing populations with 2*N* = 1000 chromosomes or more, even for higher values of *r*. The ARG is also a good approximation for models with selfing as long as either the selfing rate (*s*) or recombination rate (*r*) remains small. When *s* and *r* are both very high, the scaled ARG for partial self-fertilization becomes slightly inexact due to substantial probability of overlapping recombination events. Therefore, for such parameter ranges, it is best to simulate data sets using exact Wright–Fisher simulations (or alternately modify existing coalescent simulation programs to allow for overlapping recombination events).

## APPENDIX A: CALCULATIONS

Let gen(0) represent the current generation, gen(1) represent the generation being simulated and gen(2), gen(3), gen(4), …, etc., represent subsequent generations. Let 2*N* denote the total number of chromosomes in the population and *k* denote the number of generations of look-ahead. Assuming random mating as follows: *v*(*m*), the probability that *m* chromosomes do not get chosen for the next generation is (1 − *m*/2*N*)^{2N}: *q*(*m*), the probability that a chromosome is chosen exactly *m* times is ^{2N}*C _{m}*(1/2

*N*)

*(1 − 1/2*

^{m}*N*)

^{2N−m}.

Assuming *n* copies of a chromosome in the current generation, the probability that exactly *m* copies get chosen in the next generation is *s*(*n*, *m*) = ^{2N}*C _{m}*(

*n*/2

*N*)

*(1*

^{m}*− n*/2

*N*)

^{2N−m}.

The chance that a chromosome does not recombine in any given generation is approximately *e ^{−r}*.

Assuming *n* copies of a chromosome in the current generation, the chance that none of the copies of the chromosome that get picked in the next generation, recombine, can be approximated as(A1)where *l* denotes the maximum number of copies that can be picked in the next generation.

For *k* = 1 and *r* > 0, a chromosome can be lost if it does not get picked in gen(2). Therefore, the chance that a chromosome is lost without its homolog having undergone recombination isFor *k* = 2 and *r* > 0, a chromosome can be lost if it does not get picked in gen(2) or gets picked 1 to 2*N −* 1 times in gen(2) but none of those copies get picked in gen(3). Therefore, the probability that all the copies of a chromosome are lost without any of their homologs having undergone any recombination is nearly(A2)Note that if a chromosome gets picked *m* times in the next generation, then its homolog can get picked at most 2*N − m* times and therefore we have to choose *l* appropriately in the terms in Equation A2. We assume that if there are *x* copies of a chromosome in any given generation, then there are also *x* homologs, when calculating the probability that none of those homologs will recombine. There is a small chance that some of the copies of a chromosome will be homologs of one another in the next generation. Therefore, the probability given by Equation A2 is not exact.

In general, when *N* is large, *a* is small and *r* > 0, the chance *P*(*a*) that all the descendants of a chromosome from gen(1) are lost at gen(*a +* 2) but not before that, without any of their homologs having undergone recombination is nearly , where *m*_{1}, *m*_{2}, *…* , *m _{a}* can all vary from 1 to 2

*N −*1. Therefore, for

*k*> 0, the total probability

*T*(

*k*) that all the copies of a chromosome are lost by gen(

*k +*1), without any of their homologs having undergone any recombination is nearly(A3)where

*a*varies from 0 to

*k −*1. Table A1 shows the approximate probability given by (A3) as a function of

*k*and

*r*. Figures A1 and A2 show different views of this likelihood surface.

For *r* = 0, *T*(*k*) is exactly equal to(A4)where *a* varies from 0 to *k−* 1 and *U*(*a*) = , where *m*_{1}, *m*_{2}, *…*, *m _{a}* all vary from 1 to 2

*N −*1.

## APPENDIX D

Let *N* be the total number of individuals in the population and *s* be the selfing probability and *r* be the per-generation per-sequence recombination rate. For a given chromosome, the chance of no recombination events in any given generation can be approximated by *e ^{−r}*. We first calculate the probability that, going backwards in time, a pair of lineages resulting from a recombination event will eventually find two different ancestors.

The probability that it finds different ancestors in the first generation is (1 *− s*)(1 − 1*/N*) (*i.e*., not created by selfing and choose different ancestors).

The probability that it finds different ancestors in the second generation is (*s/*2 + (1 *− s*)/(2*N*))(1 *− s*)(1 − 1*/N*) [*i.e*., probability of selfing but choosing different chromosomes in the same ancestor in the first generation (*i.e*., *s/*2) or not created by selfing but picking different chromosomes within the same parent in the first generation (*i.e*., (1 *− s*)/2*N*) and then not created by selfing and choosing different ancestors in the second generation].

The probability that it finds different ancestors in the second generation without further recombination is (*s/*2 + (1 *− s*)/(2*N*))(1 *− s*)(1 − 1/*N*)*e ^{−}*

^{2r}.

In general, the probability that a pair finds different ancestors in the nth generation but not before that is: (*s/*2 + (1 *− s*)/(2*N*))^{n−}^{1}(1 *− s*)(1 − 1*/N*) and the probability that this happens without subsequent recombinations is (*s/*2 + (1 *− s*)/(2*N*))^{n−}^{1}(1 *− s*)(1 − 1*/N*)*e ^{−}*

^{(2n−2)r}.

Thus, the total probability that a recombination event will eventually split lineages into two separate ancestors is (summing up to *n* = infinity) 2(1 *− s*)(1 − 1*/N*)/(2 − (*s +* (1 *− s*)*/*(*N*))). Now, the total probability that this happens without any subsequent recombination events is (summing appropriate terms up to *n* = infinity) 2(1 *− s*)(1 − 1*/N*)/(2 − (*s +* (1 *− s*)*/*(*N*))*e ^{−}*

^{2r}). Therefore, the probability of overlapping recombination events, given that after a recombination event the pair of lineages will find different ancestors is (

*s +*(1

*− s*)

*/*(

*N*))(1

*− e*

^{−}^{2r})/(2 − (

*s +*(1

*− s*)

*/*(

*N*))

*e*

^{−}^{2r}). Figure 2a shows this probability as a function of

*s*and

*r*.

Similarly, for models without self-fertilization, going backward in time, the probability that a pair of lineages finds different ancestors in the first generation is (1 − 1*/N*). The probability of finding different ancestors in the second generation is (1 − 1*/N*)/2*N* (*i.e*., pick different chromosomes in the same parent in the first generation (*i.e*., 1*/*2*N*) and pick different parents in the second generation, etc.). The probability of finding different ancestors in the second generation without subsequent recombination is (1 − 1*/N*)*e ^{−}*

^{2r}

*/*2

*N*.

In general, the probability that a pair of lineages finds different ancestors in the nth generation is (1 − 1*/N*)*/*(2*N*)^{n−}^{1}and the probability that this happens without subsequent recombinations isThe total probability that after a recombination event, a pair of lineages will eventually find two separate ancestors is (summing appropriate terms up to infinity) 2*N −* 2*/*(2*N −* 1) and the total probability that this happens without any subsequent recombination is 2*N −* 2*/*(2*N − e ^{−}*

^{2r}). Therefore, the probability of overlapping recombination events, given that a pair of lineages will eventually find two different individuals is: (1

*− e*

^{−}^{2r})/(2

*N − e*

^{−}^{2r}). Figure 2b shows this probability as a function of

*N*and

*r*.

## Acknowledgments

We thank Andrew G. Clark and members of the Bustamante lab for providing comments on this project. This work was supported by National Science Foundation grant DBI-0606461 to Susan McCouch and Carlos D. Bustamante as well as by National Institutes of Health (NIH), Center for Excellence in Genomic Sciences grants HG-002790 and GM-069890 to Paul Marjoram and Magnus Nordborg. This work was also supported in part by NIH grant R01-HG004049-02 to Jeffrey D. Wall, Paul Marjoram, and Magnus Nordborg.

## Footnotes

Communicating editor: M. K. Uyenoyama

- Received December 2, 2007.
- Accepted January 26, 2008.

- Copyright © 2008 by the Genetics Society of America