# The Spatial Mixing of Genomes in Secondary Contact Zones

^{*}Department of Evolution and Ecology and Center for Population Biology, University of California, Davis, California 95616^{†}Department of Plant Biology, University of Minnesota, St. Paul, Minnesota 55108^{‡}Department of Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089

- 1Corresponding author: Department of Evolution and Ecology, One Shields Ave., Davis, CA 95616. E-mail: sedghifar{at}ucdavis.edu

## Abstract

Recent genomic studies have highlighted the important role of admixture in shaping genome-wide patterns of diversity. Past admixture leaves a population genomic signature of linkage disequilibrium (LD), reflecting the mixing of parental chromosomes by segregation and recombination. These patterns of LD can be used to infer the timing of admixture, but the results of inference can depend strongly on the assumed demographic model. Here, we introduce a theoretical framework for modeling patterns of LD in a geographic contact zone where two differentiated populations have come into contact and are mixing by diffusive local migration. Assuming that this secondary contact is recent enough that genetic drift can be ignored, we derive expressions for the expected LD and admixture tract lengths across geographic space as a function of the age of the contact zone and the dispersal distance of individuals. We develop an approach to infer age of contact zones, using population genomic data from multiple spatially sampled populations by fitting our model to the decay of LD with recombination distance. To demonstrate an application of our model, we use our approach to explore the fit of a geographic contact zone model to three human genomic data sets from populations in Indonesia, Central Asia, and India and compare our results to inference under different demographic models. We obtain substantially different results from those of the commonly used model of panmictic admixture, highlighting the sensitivity of admixture timing results to the choice of demographic model.

- contact zones
- admixture
- linkage disequilibrium

POPULATIONS frequently undergo periods of relative isolation that are followed by secondary contact. During isolation, the evolutionary processes of genetic drift, mutation, and selection act to differentiate populations at many markers throughout the genome. When these populations come back into contact, the restoration of gene flow generates admixed populations, which start as an assemblage of differentiated parental genomes that are broken up every generation by segregation and recombination between chromosomes.

Under this process, linked alleles of the same ancestry will tend to be co-inherited until separated by recombination. Because the parental populations are differentiated with respect to each other, this co-inheritance leads to a nonrandom association of alleles, referred to as linkage disequilibrium (LD). This admixture-induced LD (or admixture LD) is the resulting covariance between loci and initially extends over a much larger genomic scale than LD does in either parental population and is a signature of relatively recent admixture (Cavalli-Sforza and Bodmer 1971; Chakraborty and Weiss 1988). One can also think of this signature as the persistence of parental haplotypes in admixed populations that, rather than being measured directly, is measured as the extent of co-occurrence along a chromosome of alleles that are diagnostic of parental origin. Recombination acts every generation to gradually break apart long tracts of ancestry into smaller tracts, and so the association between nearby alleles lasts many generations. The physical scale over which admixture LD breaks down is determined by the timescale over which parental populations have been interbreeding; the conservation of many ancestral haplotypes over large physical distances would imply very recent admixture, whereas a longer history of admixture produces many smaller parental tracts. We assume that population differentiation within the parental populations is weak relative to that between them and so consider only admixture LD (and not mixture LD that is covariance between loci induced by substructure) in the focal populations.

Data from many (potentially weakly) differentiated markers allow for the identification and quantification of admixture in individuals (*e.g.*, Pritchard *et al.* 2000) and the inference of the ancestral origin of a given chromosomal region (*e.g.*, Falush *et al.* 2003; Price *et al.* 2009; Hellenthal *et al.* 2014). The continued mixing of differentiated genotypes, as described above, produces predictable population genomic patterns that change through time, and these signals can be used to not only detect past admixture in an extant population, but also learn about the timing and history of these admixture events (*e.g.*, Harris and Nielsen 2013; Loh *et al.* 2013; Hellenthal *et al.* 2014). Such inferences have been used to reconstruct historical population movements, highlighting the importance of admixture in shaping patterns of diversity in human populations (Reich *et al.* 2009; Patterson *et al.* 2012; Loh *et al.* 2013; Moorjani *et al.* 2013; Hellenthal *et al.* 2014). These studies have utilized powerful methods that first identify stretches of chromosome inherited from a particular parental population [admixture tracts (Gravel 2012; Hellenthal *et al.* 2014)] or measure the covariance, over spatial scales, of variants that are diagnostic of parental populations [admixture LD (Patterson *et al.* 2012; Loh *et al.* 2013)] and then infer the genetic scale over which this measured coancestry decays. Commonly this is done by assuming a model of admixture in which one isolated population is formed by a single admixture event in time, with subsequent random mating. Under this simple model, the distribution of admixture tract lengths and the decay of admixture LD with respect to genetic distance are approximately exponential, with the rate parameter corresponding to the time in generations since admixture. However, violations of the assumptions of the single-pulse model can result in substantial departure between expected and observed rates of decay of coancestry with respect to time.

Models incorporating multiple admixture times, or sustained migration (Pool and Nielsen 2009; Gravel 2012; Hellenthal *et al.* 2014; Liang and Nielsen 2014b), have been built to address more complex admixture scenarios in single populations. However, these do not incorporate the fact that admixture often occurs in a geographic context—beginning at a given point in time, then spreading across space. Most current models treat each admixed population as an independent event, not accounting for this spatial context, even when admixture in spatially distributed populations is potentially attributable to a single historical event.

In this article we build an alternative model of diffusion of ancestry across geography in time. Specifically, we consider a scenario in which two populations spread back into contact, generating a gradient of admixture across space with the greatest variance in ancestry at the point of initial contact. We refer to this mixture across space, where migration is sustained through both time and space, as a contact zone. This geographic mixing leads to departures from a simple model of exponential decay of admixture LD as there is exchange of migrants between neighboring populations with different admixture proportions. We describe the expected covariance in ancestry (ancestry LD) in contact zones, accounting for migration in continuous space. By assuming a large population that is not affected by genetic drift, and therefore ignoring coalescence, we are able to derive an analytic expression for LD in contact zones. This model provides a framework to simultaneously examine admixture patterns over a set of geographically distributed populations and a potential geographic null model for studying historical movements of populations. Inference under this model provides a means to estimate both the time at which populations spread back into contact and some measures of dispersal. We analyze several potential human contact zones under our model and show that simpler “point” models of admixture can infer unreasonably recent admixture dates. In addition to human admixture, LD has also been used to characterize hybrid zones (*e.g.*, Szymura and Barton 1986; Mallet *et al.* 1990; Wang *et al.* 2011), and so the model presented here also has applications in the study of such secondary contact between incipient species. This could potentially complement earlier investigations of coancestry along hybrid zones in the presence of selection (Barton 1979, 1983; Barton and Bengtsson 1986).

## Materials and Methods

### Outline of neutral model

We consider two differentiated populations along a transect in space, formerly separated by a barrier that completely prevented migration (at position ) that was removed *τ* generations ago (Figure 1). We imagine the barrier as a physical obstruction to migration; however, in practice the two previously isolated populations could come into contact through a variety of means. We use a continuous-space limit of randomly mating (Wright–Fisher) populations on a line, made formal in, *e.g.*, Nagylaki (1975) and Shiga (1980), which can be described informally as follows.

Since time *τ*, individuals have moved without restriction, in such a way that the distribution of displacements between an ancestor and a descendant separated by *t* generations is Gaussian with mean zero and variance The Gaussian assumption is appropriate since, over many generations, the sum of many steps under a finite-variance dispersal kernel will converge to a Gaussian distribution. This forms a gradient of admixed populations across space, whose degree of admixture depends on the time that has passed and the distance to the point of initial contact. Over time, genotypes of different ancestries diffuse across the entire range, and recombination breaks down tracts of continuous ancestry. We aim to describe this diffusion of ancestry throughout time and space.

To determine the typical degree of admixture at a location, we follow the lineage of a sampled individual back through time, tracing the spatial location of the ancestor of today’s sample back to the initiation of secondary contact. The ancestral type of today’s sample is determined by the geographic position of its ancestor *τ* generations ago: we say that a sampled individual whose lineage falls to the left of the barrier (*i.e.*, whose ancestor *τ* generations ago lived at a location ) is of ancestry *A* and is otherwise of ancestry *B*. This represents the alleles belonging to ancestral population *A* or *B* before the initiation of secondary contact. We treat time and space as continuous variables, and the time-reversible properties of Brownian motion allow us to model the movement of lineages as a continuous Brownian process. The framework we present here is explicitly defined in a one-dimensional geographic transect, but also applies, unchanged, in two dimensions due to the absence of drift.

### Behavior of a single locus

We start by describing the marginal profile of admixture proportions. Suppose that we sample a randomly chosen individual today from position ℓ relative to the center of the contact zone and define to be that individual’s ancestry at a randomly chosen locus (*i.e.*, is equal to *B* if their ancestor at the time of secondary contact *τ* generations ago lived to the right of the barrier). Since we assume that the displacement between parents and offspring is Gaussian with variance we can describe the movement of the lineage as a Brownian motion, and so the probability that is of ancestry *B* is equal to the probability that a Brownian motion that begins at ℓ is to the right of zero after *τ* generations; *i.e.*, (1)Here is the indicator function:In other words, the probability that an individual sampled at geographic position ℓ inherits at a given locus from ancestral population *B* is the probability that where is Gaussian with mean ℓ and variance This is also the expected frequency of ancestry *B* at position ℓ, *τ* generations after contact, and therefore provides an expectation of the cline in ancestry proportion (Figure 1). Although it is convenient to imagine the motion of a lineage as a Brownian motion in continuous time, this expression also holds for discrete generations since the distribution of parent–offspring dispersal is Gaussian with variance and then the total displacement across *τ* generations is also Gaussian, with variance .

Under this model, we expect the zone of significant admixture, where admixture LD is observable, to extend over distance in either direction from the center of the zone. Therefore, to fit our model using the inference framework we describe below, we will need samples on this spatial scale. We note that while this may be a good model for small *τ*, the prediction under this model that admixture proportions homogenize as *τ* becomes very large can be unrealistic as barriers to intermixing may persist over time and therefore may not be an appropriate model for very old contact zones.

### Ancestry LD between linked loci

In our model, all chromosomes begin as unbroken tracts of ancestry prior to initial contact. As time progresses, recombination between haplotypes of different ancestry breaks down these associations. To model this effect, we consider two linked loci separated by a recombination fraction *r*, on a single chromosome sampled at geographic position ℓ, *τ* generations after secondary contact (see Figure 1 and legend). The ancestries of a sampled individual at these two loci are denoted and respectively. If there is no recombination between these two loci, then both lineages trace the same path back in time, and The recombination fraction between the loci is the per generation probability of observing a recombinant haplotype as the product of meiosis. For close pairs of markers it may suffice to use the genetic distance *d* in morgans that separates markers, but for more distant markers we can use the probability of an *observed* recombination event, which is the probability of an odd number of recombination events between focal loci, accounting for interference when possible. Assuming no interference (*i.e.*, a Poisson model), the relationship between *d* and *r* is given by Haldane’s mapping function, (Haldane 1919).

We measure ancestry LD as the covariance in ancestry between the alleles at the two loci, (2)Since and have the same distribution, the second term is simply which by Equation 1 is

The first term of Equation 2 is the probability that both and are of ancestry *B*, which we can compute by considering the joint distribution of the movement of the two lineages over the last *τ* generations. At the time of sampling, and until the first recombination event between the two loci, the two lineages follow an identical path back through time. We assume that after the first recombination event the two lineages never coalesce back onto the same chromosome and therefore pursue independent Brownian paths for the remainder of the *τ* generations since secondary contact (Figure 1). This assumption ignores drift since secondary contact and therefore does not account for the possibility of the two lineages coming back onto the same background to once again assume identical paths. This is a reasonable assumption for large populations and recent contact zones where the probability of coalescing back onto the same background is small, but neglects some additional covariance in smaller populations.

This assumption of no drift will be good if is much smaller than the number of individuals falling in a circle (or interval) of radius *σ*, proportional to Wright’s neighborhood size (Wright 1946). This is because in one dimension, assuming Gaussian dispersal, the number of generations that two randomly moving lineages that start in the same place spend within distance *σ* of each other across *τ* generations is of order the chance that they coalesce each time they are nearby is proportional to and so the chance of coalescence is negligible if Since coalescence is less likely in two dimensions than in one, this gives a bound in the two-dimensional case as well; for more discussion, see Nagylaki (1978) and Barton *et al.* (2002).

To find an expression for covariance in ancestry, observe that the random number of generations *T* since the most recent recombination event between the two loci is geometrically distributed; continuing with the continuous time model, we can take *T* to be exponentially distributed with rate parameter *r*. Given that the most recent recombination along this lineage occurred *T* generations ago, with the spatial displacements of the two lineages from which and derive at *τ* generations in the past are distributed as a bivariate Gaussian with covariance and variance the probability density of which we denote

The probability that both lineages are to the right of zero *τ* generations ago, and hence are both of ancestry *B*, is therefore given by (3)The first term of Equation 3 corresponds to the probability that there has been no recombination for the last *τ* generations, multiplied by the probability that the path of our single ancestral lineage is on the right side of the barrier when the barrier was removed. The second term integrates the probability that two lineages that recombined *t* generations ago are both to the right of the barrier at time *τ*, *i.e.*, the bivariate Gaussian density integrated over the quadrant and over all possible times of first recombination. Rescaling *t* so that Equations 2 and 3 come together to give

To obtain this expression, we integrate by parts, make use of the identity in Equation A3, and rescale (0, *τ*) onto (0, 1) (see *Appendix A* for more detail). We denote this covariance as a function which expresses the expected covariance in ancestries of two loci in a randomly sampled individual from a given geographic location (ℓ) as a function of recombination fraction (*r*) between the loci, time since admixture (*τ*), and rate of dispersal (*σ*). In *Appendix B* we also develop analogous results for arbitrary migration schemes in discretized space, for both continuous and discrete time.

These functional forms give us a way to relate observed patterns of LD in admixed populations to the parameters of the demographic model generating admixture. We later use this to develop an inference method to estimate these parameters in a contact zone. Before doing this, we explore strategies to obtain the full distribution of admixture block lengths in a contact zone model.

### Admixture block lengths

An extension to the above approach for describing admixture LD between two loci is to consider how ancestry along the chromosome is partitioned into unbroken genomic tracts of ancestry drawn from one parental population. This is a natural way to think about coancestry in admixed populations (Fisher 1954; Barton 1983; Ungerer *et al.* 1998; Gravel 2012), and the genome-wide distribution of ancestry tract length will contain information about admixture and is a richer source of information than pairwise LD alone.

We again examine a chromosome drawn at random at geographic position this time considering the probability that between physical positions *P* and *Q*, separated by genetic distance *d*, the chromosome inherits entirely from ancestry *B*. As above, we assume that once linkage is broken by recombination, the lineages from which the products of recombination are descended move independently of each other. This again assumes that *τ* is small relative to the timescale of coalescence (genetic drift). Further, it ignores the correlation structure imposed by the pedigree (Wakeley *et al.* 2012; Liang and Nielsen 2014a), the impact of which we return to in the *Discussion*.

We note that our measure of recombination rate *d* will differ from the earlier definition of recombination fraction (*r*) as we will be tracking all recombination events between *P* and *Q*. We now assume that recombination events occur as a Poisson process with rate *d*, which reflects genetic distance on the genetic map between our two endpoint loci, and assume no interference.

If there have been *K* recombination events that occurred along the tract of the chromosome over the last *τ* generations, then this region has genetic ancestors from *τ* generations ago. Denote the spatial locations at time *τ* of these ancestors As our assumption of infinite population size neglects coalescence, these ancestors are assumed to be distinct. The segment contains only ancestry from population *B* if all (*i.e.*, all ancestors lived at locations to the right of 0 at time *τ*; see Figure 2 for an example of *K* = 3). We denote the probability of this segment containing only ancestry from population *B* as (5)where, as before, is *B* if This expected value averages over both the number and timing of recombination events and the locations of the ancestral lineages at time *τ* ago. We now outline one approach to obtain an expression for by conditioning on the number of recombination events, and give a complementary approach in *Appendix D*.

Since we assume no coalescence, the branching order of the ancestral lineages via recombination specifies a labeled tree structure with tips, given *K* recombinations, meaning that a modern individual at location ℓ has distinct ancestors from *τ* generations ago. Since, looking backward in time, each lineage moves as an independent Brownian motion once it has split from the others, **X** has a -dimensional multivariate Gaussian distribution with mean and variance–covariance matrix **Σ**. The entries are determined by the amount of time that the lineages leading to tips *i* and *j* spend in linkage. If we let be the time, in generations, from the present to the recombination that separates tip *i* from tip *j*, then and the diagonal entries are

Conditioning on recombinations and **Σ**, the probability that all tips are of ancestry *B* is given by the integral of the -dimensional Gaussian density over the space for which all (6)The above integrand is the density for the multivariate Gaussian whose covariance matrix is determined by the timing and ordering along the chromosome of recombination events. As an example, Figure 2 presents the two different unlabeled topologies that can be obtained for The topology of Figure 2A would produce a multivariate Gaussian with covariance matrixand the topology of Figure 2B would be represented by the covariance matrix

Obtaining the unconditioned value of from the conditioned version in Equation 6 requires averaging over possible trees; to do this we sum over possible tree topologies and for each tree topology integrate over possible split times (*i.e.*, and in the case of the three-recombination trees shown in Figure 2).

For a given unlabeled tree topology we therefore need to integrate Equation 6 over the possible split times of the tree. First note that we can reduce to the case of trees with height 1 by scaling time. We define so that by Gaussian scaling,

We then obtain the following term:

(7)Here, is the probability of the branching times given the topology. *V* accounts for the fact that the **Σ** we integrate over represents a topology and so the possible entries of **Σ** are constrained by Thus, the set of possible times, over which we integrate depends on the tree topology, and correspondingly, each topology has a probability given *k* recombinations. [See *Appendix C* for a further description of and .]

Finally, we sum across *k* and, for each *k*, all -tipped unlabeled topologies. Recalling that the probability of the number of recombination events is Poisson distributed, (8)where is the probability of the unlabeled topology given that there are tips [we describe the calculation of in *Appendix C*]. We note that Equation 8 is a Wild sum expansion for (Etheridge 2000). We outline an approach using differential equations to obtain an equivalent expression in *Appendix* *D*.

In practice, we can approximate this sum by conditioning on or fewer recombination events in *τ* generations:

In the *Results* section below, we briefly explore the convergence of this sum to distributions obtained by simulation. Summing over the large number of topologies for large is computationally expensive, but terms in the sum can be reused over some parameter values. Given reliable measurements of block-length distributions from genomic data, the above estimate of *U* provides a means by which timing and migration in contact zones can be inferred (see Gravel 2012 for a recent application of such an approach). However, we do not implement this strategy here, concentrating instead on fitting models to admixture LD.

### Simulations

We developed two classes of simulations to (1) evaluate the accuracy of our analytic results and (2) explore the consequences of realistic violations of our model that likely occur under the specified biological process. Specifically, we are concerned with the assumption that movement of alleles is independent following recombination that follows from the assumption of infinite population size as well as the assumption of continuous time, rather than discrete generations.

For the first class of simulations, *simulations under the model*, we consider chromosomes moving in continuous space and time, with recombination modeled as a Poisson process through continuous time and independent movement of all products of recombination. Chromosomes were simulated by generating a vector of recombination times under a Poisson process and then uniformly placing the recombination events on a chromosome. Geographic positions through time were assigned to chromosomal segments at each recombination event such that adjacent segments followed the same path until the time at which a recombination event occurred between them, after which they followed independent paths. The positions were assigned by drawing the geographic displacement that occurred between recombination events from a Gaussian distribution with mean zero and variance equal to multiplied by the amount of elapsed time. This is an explicit simulation of the model described above. We simulated 10,000 chromosomes under the model.

The second class of simulations, *simulations under the process*, are forward-time, discrete-generation simulations of a grid of discrete populations in continuous time. In these simulations we record the complete recombination history of each chromosome. As such simulations allow genetic drift, enforce a pedigree structure onto local ancestry, and occur in discrete time and space, these simulations under the process present a biologically realistic challenge to many of our major modeling assumptions. We consider 200,000 diploids (400,000 chromosomes) evenly spread across 20 demes. Demes are connected through nearest-neighbor migration with a per-generation, per-individual probability *m* of migration (this migration rate is reduced to on demes at the edges of one-dimensional space). We sample the number of recombination events each generation from a Poisson distribution with mean of one, corresponding to a 1 Morgan chromosome, and recombination events are uniformly placed along a chromosome (*i.e.*, no recombinational interference). Every generation, migration, random mating, and recombination take place, and we record for each piece of chromosome the population from which it was inherited (*i.e.*, its ancestry). After *τ* generations we sample chromosomes and assign ancestry along each individual’s chromosome based on whether ancestors originated in populations 1–10 (ancestry *B*) or in populations 11–20 (ancestry *A*).

### Inference of parameters in human admixture data

We now use our theory to infer parameters in a demographic model, using real data. To do this, we can use either ancestry LD (Equation 4) or ancestral block length distributions (derivable from Equation 5). While the distribution of continuous-ancestry tracts necessarily contains more information than LD alone, there are limits to the precision of the measurement of tract length over short recombination distances (which would reflect old events). This, combined with the relative ease of obtaining LD measurements from genomic data, motivates the use of LD in our analysis of human admixture contact zones. A variety of methods, including ALDER (Loh *et al.* 2013) and Globetrotter (Hellenthal *et al.* 2014), estimate some measure of admixture LD that is an estimate of ancestry LD. We use the weighted LD curves generated by ALDER, which computes the statistic (10)where the sum is over a set of pairs of autosomal loci, each of which is *r* apart (in practice, this method uses *d* in place of *r*, and for analysis using ALDER output, we do the same). After Loh *et al.* (2013), is a pair of loci, and are sample allele frequencies in the parental populations *A* and *B*, and is the sample covariance between alleles at the two loci within the target population. If *r* is large enough that background LD in the ancestral populations can be ignored, and the allele frequencies in the parental populations are known, then where is the expected covariance in ancestry between pairs of loci a recombination fraction *r* apart, *α* is the ancestry proportion of population *A* in the admixed population, and the constant measures differentiation of allele frequencies between the two parental populations. Often, the true parental populations no longer exist, or are not sampled, and instead proxy parental populations are designated. In these cases, is a measure of the shared differentiation between the true parental populations and the proxy populations (Loh *et al.* 2013).

#### Admixture at a single time point:

Under a basic model of admixture, decay in ancestry LD can be described by the parameters *F*, *t*, and *G* in the exponential model (11)corresponding to a single pulse of admixture *t* generations ago. This reflects the exponential decay of admixture LD with time due to recombination between two loci separated by recombination fraction *r* and is the model used by ALDER (Loh *et al.* 2013) and Globetrotter (Hellenthal *et al.* 2014) to estimate admixture timing in a single population.

The term *G* represents LD between unlinked markers due to substructure in the sampled individuals with respect to their ancestry proportions. Under the model of Loh *et al.* (2013), the value corresponds to where *α* is the admixture proportion and therefore is a compound parameter reflecting both admixture proportion and differentiation between parental populations (Loh *et al.* 2013). This is the expected variance in allele frequency at a single locus, which is a function of the differences in allele frequency between the parental populations, the proportion of ancestry from each parental population, and the covariance that arises from nonrandom mating with respect to ancestry in the admixed population.

#### Fitting to a geographic contact zone:

Under our model, we take a set of admixed samples drawn from *n* populations, who fall at positions along a linear geographic transect. The geographic location of the center of the zone along this transect is *C*, such that sample 1 is a distance from the zone. We specify a pair of proxy parental populations *A* and *B* to represent the end points of the contact zone. Using ALDER, we generate the statistic for the *j*th population sample for each genetic distance bin (*i*), giving us a set, **a**, of weighted-LD decay curves (as defined in Equation 10). We use the minimum inter-SNP distance determined by ALDER based on LD in the parental populations.

To assess the uncertainty in **a**, we estimate the variance in ALDER’s statistics, using the jackknife (which is an output of ALDER). For each of the iterations, one chromosome is removed before recalculating **a** for the remaining 21 chromosomes. We use this to calculate the variance We then conduct a least-squares fit of the ALDER output to our prediction given by Equation 4 for values of *τ*, *σ*, ℱ [which accounts for differentiation between the parental populations in the same way that does], and *C*. We fit all *n* populations simultaneously, minimizing

Our choice of would be the negative log-likelihood of our parameters if our were Gaussian [a reasonable approximation given the large number of pairs of markers contributing to each value of ] and independent. We refer to as the log-likelihood, and because we are mainly interested in *τ* and *σ*, we generate profile surfaces of ℒ across combinations of *τ* and *σ*. Specifically, we set a value for *C* based on a fit of Equation 1 to ancestry proportion and generate a likelihood surface over a grid of and for each combination of *τ* and *σ* we define the profile log-likelihood as the maximum log-likelihood across all of our corresponding ℱ grid points. The grid of ℱ values that we fit over is informed by the strength of differentiation between the parental populations.

We note that, although Equation 11 includes an affine term to account for LD that could be generated by an unspecified model of population substructure, our model does not. This is because a source of long-range LD is incorporated into our model via gene flow from neighboring populations with different admixture proportions.

### Data availability

Using Equation 12, we fit our model to genomic data from populations that potentially represent admixture contact zones. These data were obtained from previously published studies, as detailed in Table S1, and are available publicly or upon request from their respective authors.

Custom scripts for simulations and fitting of data are available as Supporting Information, File S2 and File S3, and at https://github.com/asedghifar/NeutralZones.

## Results

### Simulation results and comparison to exponential model

Figure 3 shows the decay in LD at various points in time and space and shows the exact correspondence between the analytic expression of Equation 4 and the output of simulations under the model. As expected, the rate of decay increases with age, and LD is greatest at the center of the zone. While LD decays from a higher point in populations closer to the center of the zone, the rate of decay is greater in populations farther from the center of the zone. Dispersal is measured in the same units as distance from the zone center, and so the impact of dispersal on curves can be measured simply by rescaling the distance parameter.

To evaluate the consequences of fitting a single-pulse model to data generated by our spatial model of continuous admixture, we fitted the exponential decay of Equation 11 to a set of simulated populations from a 50-generation-old contact zone under the model. The comparison, shown in Supporting Information, Figure S1, of best-fit parameters indicates that the simple exponential tends to underestimate the age of the admixed population, presumably because of the continuous introduction of migrants bearing long ancestral haplotypes. In other words, the poor fit of the single-pulse model to these LD decay curves, especially close to the center of the contact zone, is due to the heterogeneous mixture of recombination times. Consistent with this interpretation, the effect diminishes in populations far from the center of the zone, as the difference in ancestry composition between neighboring populations decreases as the distance to the center increases.

To demonstrate our inference method as described above, we fitted our model (Equation 4) to the curves generated under the process. Because we simulated single chromosomes, we could not use the jackknife estimator of variance and therefore modified Equation 12 by removing the denominator. We removed populations with no detectable admixture from the fit, limiting our analysis to populations close to the center of the contact zone. The profile likelihoods of these surfaces are shown in Figure 4. The maximum-likelihood estimates of *τ* and *σ* are and for zones simulated under and respectively, all with We further explored the performance of our inference framework under different values of *σ* and *τ* (Figure S2). The method generally performs well. The accuracy of inference decreases with smaller values of *σ* and *τ*, presumably due to the increased noise from the small number of migrants and recombination events and perhaps also due to increased discrepancies between the discrete time and space simulations compared to the continuous time model.

Compared to the true values we use to simulate under the process our inference method tends to slightly underestimate the age of the contact zone. We expect that this is in part due to the discrete nature of the simulation. These estimates are closer to the true simulated ages than those obtained using the same method to fit an exponential (Equation 11), which estimated for estimated for and estimated for (compare to the results of our method on the same data above).

Finally, we compared our estimates of continuous length distribution to simulated tract lengths under the model. Figure 5 shows the distribution of tract lengths in simulated populations along contact zones of two different ages as well as the convergence of the sum in Equation 9 as (the maximum number of recombinations) is increased. The approximation using works best for young contact zones and for the distribution for short tract lengths. For older zones and longer tracts, summing over is computationally intensive, and the numerical approximations using the differential equation approach (*Appendix D*) may be more tractable.

### Application to human data sets

We applied our model to three independent sets of populations that potentially represent admixture in a spatial context: populations along the Indonesian archipelago and New Guinea, populations in Central Asia, and populations in India (Table S1). Genetic distances between SNPs were inferred using sex-averaged recombination rates from deCODE (Kong *et al.* 2010). Each of these data sets has been previously analyzed for admixture times, using the single-pulse admixture model (Xu *et al.* 2012; Moorjani *et al.* 2013; Hellenthal *et al.* 2014), and our aim was to compare results and goodness-of-fit of our geographic admixture model to the single-pulse (“exponential”) model.

#### Indonesian archipelago:

Populations along the Indonesian archipelago and New Guinea show a longitudinal cline of admixture between East Asian and Papuan autosomal ancestry (HUGO Pan-Asian SNP Consortium 2009; Xu *et al.* 2012; Lipson *et al.* 2014). The decrease in proportion of Asian ancestry with longitude has been interpreted as evidence of the Austronesian expansion from the west through Indonesia. Xu *et al.* (2012) fitted simple admixture models independently to each of the populations to infer admixture times of 120–200 generations, with populations with higher levels of Papuan ancestry having more recent admixture times. A more recent analysis using ALDER estimated single admixture dates for populations in the region in the range of 30–60 generations, suggesting that this in part is the result of subsequent waves of gene flow from populations with varying levels of Asian ancestry (Lipson *et al.* 2014).

We obtained the genotypes for seven population samples in Indonesia (shown in Figure 6 and Table S1) from the HUGO Pan-Asian SNP Consortium (2009) and a Papuan population from the Human Genome Diversity Project (HGDP) (Li *et al.* 2008). The combined data set yielded 17,057 shared SNPs. We first ran STRUCTURE (Pritchard *et al.* 2000) with on these nine samples. The admixture proportions obtained from STRUCTURE confirm the east-to-west cline (shown in Figure 6). We then used least squares to fit Equation 1 to these admixture proportions, which estimated the cline center at and Based on ancestry proportions, we chose the Mentawai population and the Papua New Guinean population as proxy source populations to generate ALDER curves. Simultaneously fitting our model to the six admixed populations, we generated the profile-log-likelihood surface shown in Figure 6. The parameters that minimized Equation 12 were an approximate contact time of ∼200 generations ago [or 5800 years, given a generation time of 29 years (Fenner 2005)], per generation ( per generation), and Our estimate of *σ* seems reasonable; using differences in estimates of admixture dates in each of these populations, Xu *et al.* (2012) propose a mean dispersal of 22.5 km per generation in these populations. The fit to LD decay curves under these estimates is shown in Figure 6 and Figure S3.

We also explored the fit to LD decay curves of the single-pulse model, fitting Equation 11 by least squares (weighted by jackknife variance as in Equation 12). Unsurprisingly, the fit of our model is not as good as that of a model in which all admixed populations are considered as having a single admixture time but allowed different values of *F* ( compared to ) since independently fitting the *y* intercept to each population allows for many more parameters while these intercepts in our model are constrained by geographic distances between the populations. The fits to each population are presented in Table S2 and are in good accordance with those found by Lipson *et al.* (2014), using similar methods. With this approach, the mean timing among the admixed populations is 60.8 generations (we ignore the Javanese population that has little admixture and an estimated admixture time of 665 generations as this is far older than all the other populations).

Additionally, we considered fitting all populations simultaneously for a single time under the exponential model (Equation 11), allowing each population to choose its own *F* parameter to account for differences in admixture proportions. Under this model we obtain an estimated age of generations (). Given that the truth is likely more complex than both the exponential and contact zone models, this better fit is not surprising given that we are allowing each population to fit its own intercept.

Linguistic evidence suggests that the Austronesian expansion through Indonesia dates to years ago (Gray *et al.* 2009). As noted by Lipson *et al.* (2014), these single-pulse dates (Table S2) are too recent to reflect this, consistent with our earlier observation that admixture times may be underestimated by a simple exponential model if admixture has been ongoing. Our estimate of timing based on fitting a geographic contact zone (5800 years ago) is much older than dates estimated by single-pulse models, but is also considerably older than the Austronesian expansion. Considering that it is constrained by having to fit all populations simultaneously, our model provides a good fit to these genomic data. One possible explanation for our overestimate of admixture time is the assumption of a continuous rate of diffusion after initial contact. Despite this, our model may be a more realistic depiction of ongoing gene flow than a single-pulse model.

#### India:

Population structure in India is complex and multilayered. While the precise history of human movement in this region is unclear, work by Moorjani *et al.* (2013) and Reich *et al.* (2009) suggests that many modern Indian populations are descendants of an admixture event between differentiated ancestral North Indian (ANI) and ancestral South Indian (ASI) populations, with a cline in the extent of ANI ancestry from north to south across the subcontinent, shown in Figure 7. While it is difficult to identify modern proxies of the parental populations, the ANI population appears to be most closely related to Western Eurasian populations (such as Georgia) and the Onge population of the Andaman Islands seems to draw much of its ancestry from the ASI population. Moorjani *et al.* (2013) broadly grouped their samples into Indo-European or Dravidian samples and, under this classification, found that the decay in ancestry LD in their samples was consistent with two historical admixture events, the first ∼108 generations ago, giving rise to the Dravidian populations, and a second wave of admixture from the north taking place 36 generations later that contributed to the ancestry of Indo-European populations.

We obtained the genomic data used in Moorjani *et al.* (2013) (from Li *et al.* 2008, Reich *et al.* 2009, Metspalu *et al.* 2011, and Moorjani *et al.* 2013), yielding shared SNPs, and used only the populations represented in table 1 of Moorjani *et al.* (2013) (see Table S1). We then fit our model to LD curves generated by ALDER, as described above. Fitting a model in which all Indo-European and Dravidian populations are the outcome of a single admixture contact zone yielded an age of ∼220 generations since contact (Figure 7, Figure S4). (Additional details of the fitting procedure for the Indian populations can be found in File S1.)

Several aspects of the data indicate potential misestimation of dates. Some populations, presumably the oldest, have very little admixture LD, which may prevent an accurate fit to the decay. Second, it is possible that the sampling of populations in space does not span a broad enough distance to obtain an accurate fit. Substructure within populations, due to practices such as endogamy, may also influence ancestry LD within a population and cause a deviation from expectations under a null model of locally random mating. We take these challenges, and the uncertainty in our results, as an indication that the complicated demographic histories of these populations are poorly described by a simplistic model of the sort we consider here. These challenges also likely apply to other analyses of these data, and caution is warranted in judging the age of this zone.

#### Central Asia:

Populations in central Eurasia show varying levels of East Asian ancestry. In a global analysis, Hellenthal *et al.* (2014) identified a signal of admixture, using Mongolians and Iranians as proxy source samples, in Turkish, Uzbek, Hazara, and Uygur samples. The proportion of Mongolian ancestry decreases with longitudinal distance from Mongolia, with the Turkish populations harboring the lowest proportion of Mongolian ancestry. The estimated admixture dates for these populations of 20–30 generations in the past found by Hellenthal *et al.* (2014) are consistent with the timing of the westward military movement of Mongolians during the 13th century.

We took the genomic data for the four admixed populations and the two proxy source populations from the data set of Hellenthal *et al.* (2014) (∼500,000 SNPs). A STRUCTURE analysis of these populations, with is consistent with a gradient in Mongolian ancestry across Central Asia (Figure 8). We used ALDER to generate weighted covariance curves, using the Mongolian and Iranian samples as the two proxy source populations. For the four admixed populations, the best fit under our simple contact zone model is ∼49 generations or 1421 years ago (29 years per generation), with per generation (see Figure 8 for the profile-likelihood surface, computed over 20 values of ℱ between 0.001 and 0.01). This admixture date predates the Mongolian invasion of Central Asia that took place ∼800 years ago. However, it is known that human movement in Central Asia was complex and preceded the Mongolian invasions by centuries, and it is possible that our estimated date is capturing a signal of these earlier migrations. This is supported by recent analyses of Central Asian populations by Yunusbayev *et al.* (2015). Our estimated parameters under the exponential model can be found in Table S3.

ALDER identified extensive long-range LD in the Hazara population, possibly due to population substructure within this sample with respect to Mongolian ancestry. Because this could potentially influence our inference, we refitted the LD curves to the set of admixed populations, excluding the Hazara. This produced an estimate of 37 generations (see Figure S6 and Figure S7 for LD curves in Central Asian populations).

One consideration in our applications is our assumption that the populations spread back into contact and then simply passively diffused into each other. This is obviously likely a poor description of the movement of Mongolian genotypes across Asia during the 13th century invasions, which could result in a discrepancy between expected and predicted decay in ancestry LD. We therefore proposed an alternate model that allows for an initial fast pulse of Mongolian migration into central Asia, followed by diffusion through local geographic dispersal (*i.e.*, our Brownian motion). Explicitly, we construct a model that defines two new parameters: a point in space to the east of which some proportion, *ψ*, of the population is replaced by Mongolian genotypes *τ* generations ago (see *Appendix E* for mathematical details). In specifying this model, we are trying to capture a scenario in which, at least initially, unadmixed Mongolian genotypes rapidly spread westward. However, we acknowledge that this is at best a very crude approximation to the true history. Note that while is analogous to the contact zone center estimated in the original model, which was estimated using admixture proportion, when fitting the weighted-LD curves to the modified model, we are fitting two additional parameters, and *ψ*.

While this alternate model provides a better fit to admixture proportions (Figure 8 shows the fit with and ), given the few populations, this good fit may reflect overparameterization of the model. Furthermore, a search for the best fit to the LD decay curves returned parameters that were effectively identical to the initial basic model proposed ( cline center ), indicating that this is not a likely alternative model (profile-likelihood curves for each fitted parameter are shown in Figure S8). Given the early estimated admixture date, it is possible that admixture across Central Asia is not a product of a single event as our models, and those of others (Hellenthal *et al.* 2014), assume, but rather a result of complex human migrations throughout time. Despite the limitations imposed on inference of parameters by the small number of populations, broad patterns of ancestry LD across space are nevertheless somewhat consistent with our proposed model of ancestry LD decay across space along an admixture gradient.

## Discussion

The generation and subsequent decay of admixture LD as an outcome of interbreeding between differentiated populations leaves a population genetic signature that is a valuable tool for understanding the nature and timing of admixture. Existing methods for modeling decay in admixture LD consider the expected rate of decay in one population at a time and often assume a simple one-time “pulse” of admixture without subsequent gene flow from neighboring admixed populations. Here, we have described a neutral model under which individuals diffuse across space after secondary contact. Based on this model, we derive an analytic expression for the expected decay in ancestry LD as a function of time since contact and a population’s position in space. We consider this an alternate model to one in which admixed populations are independently formed by a single-pulse event with potential subsequent gene flow from parental populations. This model is suitable for recent secondary contact, when the genomic signatures of admixture are detectable and the timescale of admixture is smaller than that of drift.

In contrast to previous analyses of spatial admixture that treated populations as independent admixture events (*e.g.*, Xu *et al.* 2012), we consider data from all sampled populations simultaneously to build a model that incorporates all available information and accounts for the movement of individuals between populations. Compared to the expression for ancestry LD derived here, a simple exponential model tends to underestimate the time since admixture, as it does not account for the introduction of long ancestral haplotypes from neighboring populations.

### Additional sources of covariance

In developing tractable approximations to spatial admixture contact zones we have ignored genetic drift and the genealogical structure imposed by the pedigree.

Genetic drift is not problematic if population densities, and dispersal rates, are high enough that coalescence between geographically close lineages is unlikely over the time since coalescence (as is likely the case in our human applications). Otherwise, a theoretical approach incorporating coalescence will be needed (see Barton *et al.* 2013 for recent progress). However, in that case, background LD and admixture LD will be on comparable genomic scales, making the job of separating the two much more challenging.

The other form of correlation structure that we have ignored is that imposed by the genealogy (Wakeley *et al.* 2012; Liang and Nielsen 2014a). When multiple crossovers during meiosis segment the stretch of chromosome we are considering, odd-numbered recombinant segments come from one parent, and even number segments come from the other parent; the result is that nonadjacent segments are found in the same parent and are hence nonindependent. This additional covariance from the pedigree structure does not affect our pairwise model of ancestry LD if *r* is strictly defined as a recombination fraction, as an odd number of recombinations between our pair of loci means that the two alleles are present in different parents in the preceding generation and thereafter follow independent trajectories back in time. Our block length calculations ignore this form of covariance, as we assume that fragments follow independent spatial paths backward in time after recombination events. This assumption will be problematic only for long regions (where more than one recombination can happen per generation) and for short time intervals (*i.e.*, small *τ*). However, in such cases, ignoring genetic interference may present a greater source of error than ignoring this additional source of covariance.

### Application of the model to human admixture data

We used our model to estimate ages of admixture events and dispersal distances, using genomic data from admixed human populations. To do this we fitted our expressions for ancestry LD to the output of weighted LD from ALDER, but similar information about ancestry LD can be obtained from other methods such as Chromopainter (Lawson *et al.* 2012).

Our spatial model provided a good fit to admixed populations along the Indonesian archipelago, consistent with a relatively straightforward history of admixture across space. Our estimated time of initial contact is somewhat consistent with the work of Xu *et al.* (2012) and is older than that reported by Lipson *et al.* (2014). Our deeper admixture time estimate likely reflects the fact that inference under single-population admixture models will produce estimates of timing of initial admixture that is more recent than estimates under our contact zone model. Our estimate of years ago is older than estimates obtained from linguistic analysis (Gray *et al.* 2009). This could be in part due to the simplifying assumptions of our model, which requires dispersal to be constant in time and space. One could imagine, for example, that if there were pulses of human movement followed by a slowing down of dispersal, this would affect our estimate. Finally, we note that the analysis on the Indonesian population was carried out on a relatively small number of SNPs (∼17,000). While increasing the number of SNPs would likely improve the analysis, we believe that this demonstrates the utility of this approach even on smaller data sets. It should be noted that the density of SNPs required will depend on the scale of LD present in the populations.

Our spatial model provided a poor fit to the Indian and Central Asian populations. This is likely due, in part, to deviations from a simple model of instantaneous removal of a barrier to contact and continuous diffusion thereafter. A better fit to the data is possible using separate “single-pulse” models for each population; this is unsurprising, given the number of additional parameters such a model uses.

In India, a complex population structure, a caste system, and potentially two waves of contact may have all contributed to difficulties in finding parameters that fit under our model. In particular, the need to separately estimate the intercept meant that there was relatively little information in the decay curves about the timing and mode of admixture. This is especially problematic for older admixture (particularly in the Dravidians), as there is relatively little admixture LD over larger scales and consequently much of our information relies on LD over short genetic distances ( cM). Given this paucity of information, it is likely that many, and quite different, admixture models would fit these data nearly equally well. As such, our fit and estimate of timing, and indeed the estimates under alternate models, should be interpreted with caution.

The limited number of populations in Central Asia with signals of Iranian and Mongolian admixture places a limit on the confidence for the fit to the data under any dispersal model. Furthermore, it is known that human movement in the region spans many centuries and is unlikely to be simple. While earlier attempts to date admixture in these populations estimate admixture times of generations, corresponding to the Mongolian invasions (Hellenthal *et al.* 2014), our estimated time is much older, at generations. It is unlikely that our demographic model is a good approximation to historical human movement in the area, and this is likely to have affected our inference. However, it is possible that our estimate of earlier admixture is in part reflecting older human movements in the region, and this is in part supported by the findings of Yunusbayev *et al.* (2015).

While our model may not provide a great fit to these human contact zones, our results highlight the sensitivity of the age of admixture estimates to the model used. The continuous and potentially complicated mixing of individuals in a contact zone, especially close to the center of the zone, means that the decay of admixture curves can be much slower than the age of the zone would suggest. Therefore, in many cases it may be difficult to know exactly when admixture began.

### Extensions of the simple neutral model and other applications

Our relatively simple expressions describing ancestry LD depend on assuming Brownian movement and on ignoring genetic drift and pedigree structure. The examples of human admixture zones provided above indicate, however, that alternative models may be needed to describe patterns of LD, given different demographic scenarios. Because of the simplicity of our model, modifications can be made with relative ease to describe different geographic scenarios. For example, we were able to apply a model in which the movement of Mongolian genotypes began as a pulse of migrants, followed by diffusion. In a similar vein, one could modify movement to contain a Brownian drift parameter to account for directional migration, although this would require some consideration of how the dispersal kernel of an admixed individual is determined. Discrete deme models could also be used (as we develop in *Appendix B*) to model complex histories of populations in geographic and temporal heterogeneity. However, in practice there is not enough information in admixture LD decay curves to infer detailed population histories with many parameters.

We have demonstrated that inference of admixture parameters can be greatly influenced by the choice of demographic model. This highlights the need for more admixture models to be developed to test with population genomic data and for careful consideration of which model is appropriate for a given biological scenario. The model presented here makes some progress toward addressing the movement of admixed individuals and presents a potential framework for future development of dispersal models. As a final point, we note that all (to our knowledge) admixture models to date, including ours, assume that populations undergo differentiation in relative isolation prior to secondary contact. Under this assumption, there is a strong appeal to fit pulse models (such as a wave of secondary contact) to human admixture data, with the goal of estimating the timing of a pulse and relating it to particular historical events. It seems that perhaps a more appropriate null model in these scenarios would be one in which gene flow has been ongoing between populations, but at a rate slow enough to allow some differentiation to occur. Testing for patterns of LD under this isolation-by-distance (or isolation–migration) model would be a first step toward understanding the demographic history of spatially distributed populations, and the development of such a null model seems an important step in creating future tools for population genomic inference.

As mentioned above, LD has been used to characterize hybrid zones (*e.g.*, Szymura and Barton 1986; Mallet *et al.* 1990; Wang *et al.* 2011), and we see our framework as a potential null model for spatial models of secondary contact, whereby incipient species come back into contact. Although tension zones can maintain distinct species, reproductive isolation is often weak enough to allow diverged populations to exchange alleles. In such scenarios, patterns of diversity that depart from expected ancestry LD could be used to detect potential targets of selection relevant to speciation or local adaptation. It should be noted, however, that good estimates of decay in ancestry LD require reliable genetic maps, as overestimates of genetic distance may give the appearance of a slower rate of decay by inflating LD; this may be a limiting factor in many systems.

The LD induced by the admixing of two differentiated populations provides a powerful population genetic signal that has been measured with genome-wide data to inform the timing of historical admixture events. Building on models that use the patterns of LD to infer admixture dates under scenarios with discretized migration events, we have developed a novel framework that accounts for continuous movements of haplotypes through time and space. We believe that this can serve as a reasonable basic model for understanding patterns of diversity in contact zones. Furthermore, we see potential for this model to be further developed and tailored to fit a range of demographic scenarios, including those that incorporate selection.

## Acknowledgments

We thank P. Moorjani, G. Hellenthal, and M. Metspalu for access to data and Simon Aeschbacher, Alison Etheridge, Jeremy Berg, Gideon Bradburd, and Ivan Juric for helpful conversations and comments. We thank Nick Barton, Nick Patterson, and an anonymous reviewer for comments on an earlier draft of this article. This work was supported by the National Science Foundation (NSF) Graduate Research Fellowship under Grant No. (1148897), by the NSF under grant 1262645 to P. Ralph and G. Coop, and by the National Institute of General Medical Sciences of the National Institutes of Health (NIH) under award nos. NIH RO1GM83098 and RO1GM107374 awarded to G. Coop.

## Appendix A: Covariance in Ancestry

By integration by parts, Equation 3 becomes

where is the bivariate Gaussian density for jointly distributed with correlation *t*. The second term of (A1) is (A2)For the third term of (A1), we can utilize the useful identity that for a bivariate Gaussian with variances 1 and correlation *t* (Pearson 1901), (A3)and so the last term of (A1) becomes (A4)Combining Equation 2 and (A1), (A2), and (A4) therefore leaves us with

## Appendix B: Island Model

In a discretized time and space model, with *n* islands of equal, constant size and per-generation migration rates defined by the matrix *M*, the motion of a single lineage is described by a discrete-time Markov chain, and so the expected frequency of loci inherited from ancestry *B* in population *X* is

where *τ* is the number of generations since admixture began, *S* is the set of demes that are defined as being ancestry *B* at the time of contact, and is the element of the matrix power of *M*. The covariance is derived by summing over possible recombination times and the location of the allele at the time of recombination (*N* is the set of all locations):

Note that *r* is the probability of any odd number of recombinations occurring, *i.e.*, the probability that a Poisson random variable with mean *d* is odd.

## Appendix C: Unlabeled Rooted Topologies and Their Probabilities

Given that we condition on the number of recombination events within a time interval, the events are uniformly distributed both through time and along a chromosome of unit length. The most recent recombination back in time is therefore the minimum time for *k* uniformly distributed events. This first recombination event (in time) splits the chromosome into two products, which become two subtrees, with *j* and recombinations, respectively. The two subtrees are independent and have similar properties to those of the whole chromosome, so the process of bifurcation can be iterated. The first recombination splits the remaining ones uniformly, such that the probability of generating one subtree of size *j* is where when and otherwise.

Thus, for the set of unlabeled topologies with leaves, a topology is generated by joining with at the root (*m* and *n* are arbitrary). Because each subtree is independent, the probability, of topology conditioning on *k* recombination events, can be obtained by the product of the probabilities of each subtree and the probability of generating two subtrees of sizes *j* and

where is the topology made by joining topologies and at the root. Here *s* indicates the symmetry of such that if the tree is symmetric (*i.e.*, the two subtrees and are the same) and otherwise, so that gives the probability that the first recombination event in produces two subtrees of the required sizes.

When integrating over the set of all trees conditioning on a topology, we need to multiply by the probability density of the recombination times given the topology. We start by considering the probability of the first recombination event at the root of a -tipped tree, which is the first order statistic of *k* independent and uniformly distributed events over the interval The probability density of this first recombination given *k* total recombinations is therefore

Similarly, the timing of the *j*th node is the first-order statistic within the subset of recombination events that generate the subtree of which it is the root. The *j*th node, from which nodes are descended, which itself is directly descended from the parent node thus has the probability density

The independence of each subbranch allows us to compute the probability density of all recombination events as

Note that Equation C2 is the Beta distribution scaled to lie on the interval

## Appendix D: Obtaining Block Length Distributions by a Branching Brownian Motion

Here we describe an alternative approach to finding an expression for the probability that an entire region of length *d* is of ancestry *B*, of Equation 5, without conditioning on the number of recombination events. The process of recombination and dispersal described above is analogous to a branching Brownian motion (BBM), where recombination is represented by a splitting event. In standard BBM, each lineage has the same rate of splitting, but here the total length of the chromosome is constant, and so we have conservation of the total rate of splitting *d*. The rate of splitting on a lineage decreases with each recombination event, as both products of recombination are shorter (and therefore have a smaller probability of recombination).

Below, we derive an integro-differential equation satisified by *U*, similar to the classic analysis of branching Brownian motion by McKean (1975). Starting in the present, we follow a single lineage backward in continuous time. The movement of this lineage is Brownian with variance We model recombination events between the two loci as a Poisson process with rate *d*. At the first recombination event, we generate a uniform random variable, to represent the genomic position of the recombination event. We then split the sequence into left and right fragments— and respectively. Following this, the two linages move independently backward in time with respective recombination (splitting) rates of and This process is iterated over the time period *τ*.

We consider moving back a very short time interval from the present and take the expectation over the random events that could have occurred in that time interval. (In other words, we are writing down the infinitesimal generator of this Markov process.)

With probability there is no recombination during the interval and conditioning on this, we have only to take the expectation over the small random change in spatial location during this time,

where is the expectation over all changes in position *X*.

A recombination event occurs in the interval with probability Conditioning on recombination occurring at time at position producing two recombinants of length and (D2)where is the probability that all subsequent recombinants along the chromosomal fragment of length are of ancestry type *B*.

Combining (D1) and (D2) and taking obtains (D3)with boundary conditions for and for This differential equation is solved by defined in Equation 5, and is the probability that at time *τ* in the past, the leftmost branch of this branching process initiated at position is at a position This differential equation is related to that presented by Baird *et al.* (2003) to describe the survival of genomic blocks within a panmictic population (but the latter does not have a spatial diffusion term). The equation is similar to the Fisher-KPP equation, with differences arising from the nonconstant splitting rate. The first term of Equation D3 reflects the spatial diffusion of lineages, and the second term reflects the splitting of blocks of length *d* by recombination into two shorter blocks (of size and ) that each must be of type *B*.

## Appendix E: Invasion Pulse

Because it is unlikely that Mongolian movement during the century was Brownian, we construct an alternative model for Mongolian movement, in which individuals of Mongolian ancestry (ancestry *B*) initially invade and displace some proportion *ψ* of the resident population over some geographic space We assume that most of the distance between and the source Mongolian population has been invaded in this way. This invasion occurs instantaneously at time *τ*, such that the frequency of ancestry *B* at time *τ* is

Following the model presented in *Materials and* *Methods*, we assume that a lineage that is found at position at time *τ* has probability *ψ* of having ancestry *B*. The probability that a single lineage sampled at location *ℓ* at is given by

For a lineage that has recombined, for both lineages to have ancestry *B*, both lineages have to be found in at time *τ*, at which point there is a chance of both having ancestry *B*. Using the approach taken in *Appendix A*, we can obtain the following expression for ancestry LD at position *ℓ*:

## Footnotes

*Communicating editor: N. H. Barton*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179838/-/DC1.

- Received June 25, 2015.
- Accepted July 18, 2015.

- Copyright © 2015 by the Genetics Society of America