## Abstract

We study how a block of genome with a large number of weakly selected loci introgresses under directional selection into a genetically homogeneous population. We derive exact expressions for the expected rate of growth of any fragment of the introduced block during the initial phase of introgression, and show that the growth rate of a single-locus variant is largely insensitive to its own additive effect, but depends instead on the combined effect of all loci within a characteristic linkage scale. The expected growth rate of a fragment is highly correlated with its long-term introgression probability in populations of moderate size, and can hence identify variants that are likely to introgress across replicate populations. We clarify how the introgression probability of an individual variant is determined by the interplay between hitchhiking with relatively large fragments during the early phase of introgression and selection on fine-scale variation within these, which at longer times results in differential introgression probabilities for beneficial and deleterious loci within successful fragments. By simulating individuals, we also investigate how introgression probabilities at individual loci depend on the variance of fitness effects, the net fitness of the introduced block, and the size of the recipient population, and how this shapes the net advance under selection. Our work suggests that even highly replicable substitutions may be associated with a range of selective effects, which makes it challenging to fine map the causal loci that underlie polygenic adaptation.

THE extent to which phenotypic and genetic changes are replicated during adaptation across closely related populations has generated much interest (Conte *et al.* 2012; Storz 2016), and is part of the broader question of the predictability of evolutionary change (Lässig *et al.* 2017). Parallel evolution can be investigated at multiple scales, and may refer to the involvement of the same nucleotides, the same genes, or even the same pathways during adaptive responses in different populations (Manceau *et al.* 2010). An important challenge is to interpret highly replicable genetic loci; do such loci necessarily make large contributions to the selected trait? Conversely, for different architectures of the selected trait, how often are the same genetic loci implicated across replicate populations?

Several factors can influence the replicability of allele frequency changes or adaptive substitutions at individual loci (Stern 2013). More shared variation between replicate populations makes it more likely that the same variants respond to selection across different replicates, but the response is then associated with a weaker reduction in neutral diversity, [*i.e.*, “soft sweeps,” see Hermisson and Pennings (2005)]. By contrast, adaptive substitutions that arise *de novo* are rarely shared across replicates, but can be more easily identified as being under selection because of the associated hard sweep patterns.

Factors influencing genetic parallelism at the level of individual variants remain poorly understood for selection on highly polygenic traits, despite several studies with natural and laboratory replicates (Burke *et al.* 2010; Chan *et al.* 2012; Yeaman *et al.* 2016). Polygenic traits are often characterized by high genotypic redundancy, with multiple genotypes corresponding to approximately the same phenotype. For traits under stabilizing selection, a shift in the selection optimum can thus result in a highly heterogeneous response at the genotypic level, with selection amplifying initial random fluctuations in allele frequencies, to produce quite different outcomes across replicates, even if these initially share the same variation.

In general, adaptation in complex traits involves a large number of partially linked and weakly selected variants, and is thus characterized by pervasive hitchhiking, where effectively neutral or deleterious variants are swept to high frequencies along with clusters of positively selected variants (Barton 1995). While the effects of hitchhiking on neutral diversity have been studied extensively in the theoretical literature, most of these studies assume that selected variants are either all deleterious or all beneficial in a given population (*e.g.*, Barton and Bengtsson 1986). Moreover, most empirical studies also separately consider the effect of positive *or* background selection on patterns of diversity, divergence, or introgression [though see Elyashiv *et al.* (2016) for an investigation of the joint effects of the two]. However, a typical genome is likely to harbor many linked variants with a range of (positive and negative) selective effects. The efficacy of selection in discriminating between multiple, tightly linked beneficial and deleterious loci is an essential determinant of the total phenotypic response to selection in a finite population (Robertson 1970, 1977). It is also key to assessing whether an allele frequency change (even one that occurs across replicates) is itself adaptive and not a consequence of hitchhiking.

The extent of hitchhiking (and the associated patterns of neutral diversity) are strongly influenced by the density of selected polymorphisms on the genome. For example, when the rate of deleterious mutations is higher than the typical selective effect per mutation, polymorphic variants are sufficiently common that neutral diversity no longer depends on the selective effects of individual variants (Good *et al.* 2014). This results in a qualitatively different shape of the neutral site frequency spectrum, which cannot be predicted by standard models of background selection (*e.g.*, Nordborg *et al.* 1996), but is better described by an *infinitesimal* framework, which is parametrized by the variance in population fitness (or the variance per unit map length), rather than the variance of fitness effects at individual loci (Neher *et al.* 2013).

In this paper, we focus on a similar scenario of a complex trait determined by a large number of weakly selected variants uniformly spread across the genome [see also Sachdeva and Barton (2018)]. The main goal is to explore how the interplay between linkage, polygenic selection, and genetic drift shapes the long-term introgression probability of different fragments of such a genome when it is introduced into a genetically homogeneous recipient population. For simplicity, we consider the introgression of a medium-sized block of map length rather than the whole genome. This allows us to ignore certain complications, such as multiple crossovers within the selected region, in analytical calculations. It is also representative of a scenario where a genome that is repeatedly back-crossed into a recipient population breaks up into several medium-sized segments that evolve more or less independently, while they are rare in the recipient population. Most importantly, as shown below, the introgression probability of any variant depends primarily on the effect of variants within a characteristic linkage scale, and not on loosely linked regions, as long as the introduced genome has *no net* selective effect in the recipient population. This again suggests that studying the introgression of medium-sized blocks can provide useful insight into the more general case.

The introduced block is assumed to carry a large number *L* of loci with a range of selective effects, uniformly spaced over map length These loci contribute to an additive trait that is under directional selection in the recipient population. The trait value associated with any portion of the block is then just the sum of effects of all the loci it contains. The effect sizes of loci on the introduced block are drawn from a distribution with mean μ and variance the effect sizes of variants fixed in the recipient population can be set to zero without loss of generality. For simplicity, we assume that μ and do not vary across the introduced block. However, most of our analytical results hold even when this condition is not met, for instance, if there is a statistically significant clustering of large-effect variants in the donor genome.

If the donor population is in linkage equilibrium, such that allelic states of different loci along the introduced block are statistically uncorrelated, then different segments of this block (each with loci) have random (and typically unequal) contributions, with mean and variance Note that the variance of these contributions, or more generally the variance per unit map length, given by is the same irrespective of whether the block contains a few, large-effect loci (small *L* and large ) or many, small-effect loci (large *L* and small ). Thus, varying *L* while keeping and constant tunes the extent to which individual loci interfere with each other within fragments with a given map length and total selective effect. If the number of loci is very large and the distribution of allelic effects is correspondingly narrow—*i.e.*, in the limit and with and held fixed—this model approaches the *infinitesimal model with linkage* [see Robertson (1977)]. This model is similar to the well-known infinitesimal model (Bulmer 1980; Barton *et al.* 2017), but also accounts for linkage by considering loci on a linear map, rather than unlinked loci.

In this study, we explore the effects that shape the introgression dynamics of different segments of a *particular* introduced block, in contrast to our previous work [see Sachdeva and Barton (2018)], which analyzed statistical averages associated with an *ensemble* of such blocks (all characterized by the same net trait value and the same genic variance per unit map length ). We focus here on the following two questions. First, what is the expected *initial* growth rate of a fragment or locus embedded within the introduced block, under directional selection? Is the growth rate of an individual variant influenced more by its own effect or the effects of linked loci within a characteristic map distance? Second, to what extent do expected initial growth rates predict the ultimate fixation probability of different fragments in a finite population?

The distinction between initial and long-term introgression is related to the qualitatively different role played by recombination during these two phases. Recombination primarily breaks the introduced genome into smaller fragments during the initial phase, but as introgressing fragments become more abundant in the population, recombination can also bring together various successful fragments, countering Hill–Robertson interference. Thus, over longer time scales, recombination can uncover fine-scale variation within successful fragments (provided these have not fixed), allowing selection to discriminate between tightly linked variants. In a very large population, this would allow selection to fix all positively selected variants and eliminate all deleterious variants within introgressing fragments that have survived the initial phase. In a smaller population, only part of this fine-scale discrimination can be achieved, which constrains the net response to selection. An important focus of our work is to explore how the ultimate fate of variants in a finite population is determined by the interplay between the early dynamics characterized by selection on relatively large genomic fragments and long-term dynamics governed by selection on fine-scale variation within these. This clarifies the conditions under which individual variants establish with high probability (and hence are likely to be replicated across populations that receive the same genome). Identifying the distribution of effects associated with highly replicable loci, even within this simple model, can shed light on the limitations of fine mapping the causal loci that underlie polygenic adaptation.

The expected growth rates of genomic segments during the initial phase of introgression are relatively easy to calculate, since introgressed fragments are rare, and hence unlikely to encounter each other during mating. Thus, any genome carries at most one fragment of the introduced block, which leads to explicit analytical expressions for the initial growth rate of fragments and for the probability of survival of at least some part of the introduced block. We then study long-term introgression by simulating individuals, and illustrate the correlation between initial growth rates and long-term fixation probability by means of a representative example. We then describe how introgression probabilities at individual loci depend on the variance of fitness effects, the total trait value associated with the introduced block, the size of the recipient population, and the extent of initial stochasticity (which is tuned in this model by introducing single *vs.* multiple identical copies of the block). The extent to which selection and recombination can pick out and amplify individual favorable variants is a basic determinant of the net response to selection, when selected variants are tightly linked. We address the question of selection limits, and their dependence on population size and linkage, at the end of the paper.

## Methods

Consider a situation where *identical* copies of a block of genome are introduced into a very large population of diploids at Any diploid individual is assumed to carry at most one copy of the introduced block. The block has *L* loci, which are assumed to be evenly spaced on the genetic map, with recombination rate *c* between adjacent loci. The extension to unequal rates of recombination between loci is straightforward. The additive effect of the locus is denoted by The fitness of the block is multiplicative across loci and is given by

### Initial introgression into a large population

To derive analytical results for the initial dynamics of block fragments, we assume that the introduced block, with map length is short enough that multiple crossovers can be neglected. When the introduced block and its descendant fragments are sufficiently rare in the population (as expected during the initial phase), then the probability of recombination between fragments is negligible and any individual inherits at most one introgressed fragment. Therefore, we need only consider single fragments spanning loci with selective effect which break up by recombination with rate For weak selection and recombination the expected numbers of different fragments change approximately continuously through time, according to a set of linear equations:(1)The first term is the rate at which a fragment spreads intact (without being split), and is positive when the fragment is amplified by selection (at a rate proportional to its fitness ) faster than it is split by recombination (rate proportional to map length ). The second and third terms are due to generation of the fragment by recombination from larger blocks (in which it is embedded).

Equation 1 can be solved explicitly by taking the Laplace Transform, solving for the expected number of single recombinants and and then solving for expected numbers of smaller blocks. The solution is a sum of exponentially decaying (or increasing) components:(2)where is obtained by replacing *c* by in *A* and by replacing by in *B*. Also, , , and are all equal to 1 for m>k.

The sum is over all blocks (with ) that contain the fragment including itself. For large *t*, the exponential term with the largest growth rate will dominate the dynamics of the block. Thus, at long times, any block grows at the same rate as the fastest growing parent block from which it can be generated. Note that if the intrinsic growth rate of the block is larger than the growth rate of any parent block containing it, then the long-term growth rate of the block is just An important consequence of Equation 2 is that deleterious fragments may also spread through the population, if generated at a constant rate by a beneficial, exponentially growing parent block. In fact in this case, the deleterious subblock will grow at the same rate as the beneficial parent block, as is evident from Equation 2. Thus, each such exponentially increasing block generates various descendant subblocks of varying fitness, giving rise to a family of blocks, akin to a quasi-species (Eigen *et al.* 1988).

Figure 1A depicts the long-term growth rates of all possible 55 fragments of a block with 10 loci, while Figure 1B shows the expected dynamics of these fragments. Note the spread of different families of subblocks at different rates, where each such family (represented by lines of one color in Figure 1B) consists of subblocks that are all contained within a particular, fast-growing parent block. After an initial transient phase, all subblocks within a family increase at the rate of growth of this parent block. In the following, we approximate the growth rate of any fragment by the rate associated with the fastest-growing term in Equation 2, *i.e.*, by the intrinsic growth rate of the fastest-growing parent block that contains this fragment. We refer to this as the expected growth rate of the fragment, since it describes the dynamics of the expected number of copies of the fragment in the population, averaging over all possible stochastic histories, including those in which the fragment is lost.

The key assumption underlying Equation 1 is that descendants of the introduced fragment form a negligible fraction of the population, and thus never encounter each other during mating. Then, the spread of introgressed genetic material through the population can be formulated as a *branching process* (BP) that treats the (forward in time) lineages of different descendants as being independent of each other, but having a common dependence on (Sachdeva and Barton 2018). The BP framework is very powerful and can, in principle, yield the full distribution of introgressing fragments as a function of time. Here, we write down the equation for the probability that at least *some* part of a block spanning loci survives at long times, given that it is initially introduced as a single copy. To work with dimensionless quantities, we scale both selection and survival probability relative to the map length: and The long-term scaled survival probability of a block satisfies:(3)This equation is of the form where the first term *f* is due to the survival of smaller subblocks and the coefficient *a* of the second term is the intrinsic growth rate of the full block. This equation is analogous to equation 3 in Barton (1995), but has an additional driving term *f*. Solving the quadratic, we have:(4)where This is easily evaluated, because for larger blocks depends only on that for smaller blocks. Note that blocks with a negative net rate of increase can survive (in part), if they contain smaller blocks with positive growth rates, such that If the block is introduced in copies, then the survival probability of each such block is roughly independent of other blocks, as long as is much smaller than the size of the recipient population. Then, the overall survival probability is just

### Long-term introgression

Equation 1 is a valid description of the dynamics of genomic fragments only while any individual carries at most one such fragment. As shown in Sachdeva and Barton (2018), this is true over a time scale that scales weakly with the size *N* of the recipient population. Beyond this initial time scale, mating between individuals carrying introgressed material becomes more frequent and genomes carrying multiple fragments of the introduced block emerge. This phase is thus characterized by competition between genomes bearing different mosaics of the fragments that have established and proliferated in the initial phase, and is no longer described by Equation 1.

To study long-term introgression, we simulate populations with *N* diploid individuals. The simulations are initialized by assuming that haploid genomes in the population carry identical introduced blocks, while all other genomes carry identical “native” blocks. For simplicity, we assume that each of the introduced blocks is present in a different individual. Each block has *L* loci, with rate of recombination *c* between adjacent loci. The allelic effect of each native locus is equal to zero, while effects of the introduced loci are drawn from a distribution with variance and mean μ. The trait value *S* associated with any individual is then just the sum of effects of all introduced variants in its diploid genome; its fitness is

In each generation, parents are drawn by randomly sampling individuals in proportion to their fitness. Each parent produces a gamete with recombination between parental haplotypes: the number of crossover points is drawn from a Poisson distribution with mean the locations of the crossover points are chosen by uniformly sampling one of the junctions between loci without replacement. The gametes are then paired to form *N* individuals of the next generation.

We compute introgression probabilities by performing 200–400 replicate simulations, each initialized by introducing identical copies of the same block at Thus, replicate populations differ only in the stochastic history of reproduction and recombination. The introgression probability of a particular fragment at time *t* is calculated as the fraction of replicates in which the fragment is present *either by itself or as part of a larger block*. Since any fragment must either fix or be completely eliminated at long times, the introgression probability of a fragment in the large *t* limit, is just its fixation probability. A fragment with fixation probability *P* would be found in two replicate populations with probability ; thus, any fragment with high introgression probability is also a replicable fragment. Note that the introgression probability is not the same as the survival probability which was calculated in Equation 4. The latter refers to the probability that the block introduced as a single copy at is not fully lost from a very large population, but has at least one surviving descendant subblock.

### Data availability

FORTRAN 95 codes used to generate the simulated data can be found at: https://git.ist.ac.at/himani.sachdeva/source_codes_replicability_introgression_patterns/snippets.

## Results

To gain insight into how the ultimate fixation probability of different parts of the introduced genome is shaped by the early dynamics (as encapsulated by the expected initial growth rates) *vs.* long-term dynamics (characterized by selection on multiple, linked successful fragments), we first analyze one representative example in detail. This illustrates key features of the introgression process, which hold more generally.

### Introgression probabilities of different fragments of the genome: an example

Consider an introduced block with loci, uniformly spaced over map length The contributions of different loci are drawn from a normal distribution with mean 0 and variance using an iterative scheme that ensures that the contributions sum to [see also Sachdeva and Barton (2018)]. Figure 1C depicts all fragments of this block that have a positive expected growth rate. Note that these fragments are quite small; longer fragments of a nearly neutral introduced block are split by recombination faster than they are amplified by selection, and hence have negative expected growth rates.

In this example, we simulate a diploid population of size with initial frequency of the introduced haplotype, such that there are introduced blocks in the population at and, hence, 40 approximately independent realizations of the initial introgression process within any one population. The dynamics of early introgression in such a population would thus be close to the prediction of Equation 2. By contrast, with a single introduced block all or some of the introduced genome, including beneficial fragments, may be lost from the population while present in low numbers, which causes the dynamics of individual populations to differ markedly from the expected dynamics. The parameter thus governs the extent to which the initial dynamics of any individual population is deterministic.

Figure 2, A and B show the introgression probability of different fragments of the introduced genome *vs.* their expected growth rates (as calculated above) for two different time instants: (Figure 2A) and (Figure 2B). Each point represents a particular fragment; the color of the point encodes the length of the fragment or, alternatively, the number of equally spaced selected loci that it contains (see accompanying color scale). The introgression probability of any fragment is the fraction of replicates in which the fragment is present, considering only those replicates in which at least some part of the introduced block survives at long times. We only show introgressing fragments with positive expected growth rates, as fragments with negative growth rate fail to introgress in the long run within a large population.

The introgression probabilities of different fragments at are highly correlated with their expected growth rates (correlation coefficient ), while the correlation is somewhat weaker but still nonzero at The weaker correlation at longer times is due to substantial differences in the long-term introgression probabilities of different fragments that have the *same* expected growth rate (note the much higher scatter along the *y*-axis in Figure 2B relative to Figure 2A). Note that we use the correlation coefficient purely as a descriptive statistic: thus, an *r* value of 0.8 implies that of the total variance of introgression probabilities can be explained by a linear regression with the expected growth rates. However, this does not provide a complete description, since the introgression probabilities are not normally distributed about the linear prediction.

Figure 2B suggests that the introgression of shorter fragments is more replicable than that of longer fragments within any family of fragments with the same expected growth rate; note the higher introgression probabilities associated with black as compared to blue points within each vertical column of points. To investigate this in more detail, we zoom into a small window of the genome that contains a large number of replicable fragments. Figure 2, C and D show snapshots of this genomic window for 50 replicate populations at and , respectively. Each horizontal band corresponds to one replicate population and the colors along each such band encode the frequencies (in that population) of introgressed variants at different map positions. The red triangles depict selective effects of individual variants.

At long times (Figure 2D corresponding to ), multiple, disjointed fragments of the introduced genome are fixed in any population, even within this limited genomic region. We focus on the genomic window between the two vertical lines: this segment has the second highest expected growth rate among all the segments of the introduced genome. Note that many of the replicate populations lack a single, deleterious introgressed variant in the middle of this segment. These populations most probably witnessed one (or a few) recombination events that brought together two fragments of this segment onto a single genome *without the deleterious variant* in the middle. This new combination of fragments would then have out-competed the original segment, as it is associated with a larger selective advantage than the original segment, while having the same (low) probability of being split by recombination. In a subsequent section, we analyze this process in more detail for a toy example with three loci.

To avoid comparing fragments of different sizes, many of which are overlapping or even fully contained within other fragments, we compare introgression probabilities of different *single-locus variants* embedded within the introduced genome. Figure 3, A and B show introgression probabilities at single loci (which constitute a subset of the fragments shown in Figure 2, A and B) *vs.* their expected growth rates, for and As before, the expected growth rate of a locus is the growth rate of the fastest-growing subblock that contains it. The introgression probability at individual loci shows high correlation with the expected growth rate, at both (with correlation coefficient ) and at (with ). However, at longer times, there is a substantial scatter among loci with the same expected growth rate, with deleterious loci (purple dots) having lower introgression probabilities than beneficial loci (orange dots) within any particular cluster of such loci.

To explore how long-term introgression probabilities at individual loci depend on their own selective effects, we consider three separate windows of the introduced genome (shown in green, red, and blue in the inset of Figure 3C). The main plot in Figure 3C shows the long-term introgression probabilities of single-locus variants contained within each of these three segments *vs.* their selective effects, with green, red, and blue points depicting loci contained in the corresponding segments shown in the inset. Figure 3C suggests that within loci with the same expected growth rate, long-term introgression probabilities are sensitive to the selective effect of the locus. In particular, the introgression probabilities of deleterious loci decline with their selective disadvantage (note especially the red and blue points).

In the early phases of introgression, tightly linked loci contained within the same exponentially increasing subblock have the same fate (as evident in the limited scatter of introgression probabilities among loci with the same expected growth rate in Figure 3A). At early times, selection does not “see” individual loci, since recombination has not had sufficient time to separate tightly linked variants. However, at longer times, recombination generates smaller and smaller fragments of successful subblocks, and also reconstructs various combinations of these fragments, some of which would lack deleterious variants present in the original parent block. Such combinations would supplant the original successful block, thus also eliminating some of the deleterious loci contained within this block. Note that this kind of fine-grained separation of tightly linked beneficial and deleterious variants within a block over long time scales can only occur if the block has not already fixed in the population. For instance, the subblock corresponding to the green segment in Figure 3C fixes very rapidly, which does not allow enough time for embedded deleterious variants to recombine away and get eliminated.

Does the substantial decline in introgression probability with the selective effect of individual loci within limited genomic windows lead to a correspondingly strong dependence on individual effects across larger map distances? Figure 3D shows introgression probabilities (at ) of single-locus variants across the entire block *vs.* their selective effects. Even in the long run, single-locus introgression probabilities are only weakly correlated with selective effect across the whole block (correlation coefficient in Figure 3D), in contrast to the much stronger correlation with expected growth rates ( in Figure 3B). This is consistent with our explanation that the ultimate fixation probability is shaped by linked selection (or selection on clusters of loci) in the early phases of introgression, which then constrains the extent to which selection can distinguish between individual variants within these clusters in the later phases of introgression. Note further that the most replicable variants are associated with a range of selective effects (see also Figure 4) and can even be deleterious, which makes it misleading, at least within this model, to infer that high replicability implies adaptive significance.

### Dependence of introgression probabilities on the distribution of fitness effects

This raises the question: are long-term introgression probabilities of single-locus variants more strongly correlated with their own selective effects when these effects are larger? In other words, as the distribution of fitness effects (DFE) becomes wider (larger ), is there an approach to an effective single-locus regime where selection on the focal locus, rather than linked selection, determines introgression outcomes? In examining this question, we must distinguish between the variance of fitness effects and the genic variance per unit map length which depends on both and the density of selected loci on the genome. The genic variance influences the extent of linked selection; a fragment of length *y* (emerging from an introduced genome with ) has a typical (positive or negative) contribution that scales as Thus, successful fragments have higher expected growth rates (on average) and correspondingly shorter fixation times when is larger. On the other hand, the per locus variance determines the extent to which selection can discriminate between loci of different effects at fine recombination scales at long times. Thus, large and large are expected to influence the correlation between introgression probabilities and selective effect differently. While larger implies higher efficacy of selection in separating tightly linked beneficial and deleterious loci, higher results in higher growth rates of successful blocks, which causes tightly linked loci to fix together before they can be taken apart by recombination and seen individually by selection.

To clarify the roles of and in determining long-term introgression probabilities, we compare two blocks with the same density of selected variants (*i.e.*, with the same number *L* of selected loci and the same recombination rate *c* per locus), but with the variance of fitness effects of individual loci on one block being four times the corresponding variance for the other block. It then follows that the genic variance per unit map length for the first block is also four times the corresponding variance for the second block. We consider all variants with long-term introgression probabilities > 0.5 within each block and plot the distribution of their selective effects (Figure 4). Strikingly, the distribution has a very similar shape in the two cases (black *vs.* blue curves), but is shifted toward slightly *weaker* selective effects (relative to ) when the block has a wider DFE and a correspondingly high genic variance (blue), than when the block has a more narrow DFE and a lower value of (black). The mean selective effect of such strongly replicable loci is in the first case (blue curve) and in the second case (black). Moreover, the fraction of strongly replicable loci that are *deleterious* is slightly higher (0.156 *vs.* 0.117) for the introduced block with larger and . This suggests that for a given (high) density of selected variants on the genome, a wider DFE results in more extensive hitchhiking of deleterious loci during introgression and, thus, reduced power to pinpoint adaptive loci from patterns of replicability.

We also compare two blocks characterized by the same but with different values of and *L*. Since governs initial introgression, both blocks are expected to show very similar short-term patterns of introgression. However, a population that receives the block with wider DFE (and lower density of selected variants) should undergo more efficient elimination of deleterious loci in the long run. This is indeed observed in simulations; the distribution of selective effects of strongly replicable variants is shifted toward higher effects (mean selective effect ) for the block with wider DFE (red curve), as compared to the block that has a narrower DFE but the same (black curve). Further, the fraction of strongly replicable loci that are deleterious is negligible for the red curve, as compared to 0.117 for the black curve. This is consistent with the general expectation that direct selection on loci should affect their introgression outcomes more strongly than linked selection when the contribution of a genomic region is determined by fewer, larger effect loci.

Note that, in the present example, the recipient population is sufficiently large or recombination between neighboring selected variants sufficiently frequent that individual variants can be isolated. More generally, selection and recombination are expected to weed out regions of some typical length from within successful blocks, where would depend on population size *N* and the growth rate of the successful block. The selective effects associated with replicable loci would then depend on the DFE of segments of length , rather than the DFE of individual variants. Understanding how depends on population size *N* and is nontrivial, and we do not attempt to address this question here.

### Dependence of introgression probabilities on population size

The size of the recipient population is important in determining the probability of stochastic loss and hence the extent to which the spread of an advantageous allele is predictable. The fixation probability of an unlinked allele with selective effect *s*, introduced into a population of size *N* at a frequency , was derived by Kimura (1957). This fixation probability approaches a limit that is independent of *N* for when a single copy of the allele is introduced the asymptotic fixation probability is for Thus, most beneficial alleles are lost in a single population. On the other hand, when a beneficial allele is introduced at a fixed frequency its asymptotic fixation probability is 1 for large *N*.

To what extent are these single-locus predictions relevant to a scenario where the introduced genome consists of multiple, linked, beneficial and deleterious loci? We investigate the dependence of long-run introgression probabilities on *N* by comparing introgression of the same genome into populations of different sizes for two scenarios: one where a single copy of the block is introduced into the population ( see Figure 5A) and the second where multiple copies of the same block are introduced at a fixed initial frequency ( independent of *N*, Figure 5B). Figure 5, A and B show the introgression probabilities (at ) of single-locus variants *vs.* their expected growth rates in the two cases, for various population sizes. Solid lines show the predictions of a modified version of Kimura’s formula, where the introgression probability of the introduced variant at locus *i* is assumed to be:(5)where is the expected growth rate at locus *i*, rather than the selective effect of the locus.

Interestingly, Equation 5 can explain 40–70% of the variance of introgression probabilities (see also caption of Figure 5), suggesting that clusters of loci identified as having the same expected initial growth rate are, in some sense, natural linkage units into which the genome can be decomposed. To a first approximation, each such unit can be considered as unlinked to others. However, note that there is considerable scatter of the introgression probabilities (about the prediction of Equation 5), especially in large populations. As we argue above, the differential introgression of tightly linked loci within a genomic fragment depends on its time to fixation relative to the time scale at which fine-scale recombination within the fragment becomes effective, which in turn depends on population size *N*.

We now examine this dependence on *N* in detail for an introduced block with loci. Assume that the selective effects at the three loci satisfy and such that the expected growth rate of the full block is higher than the growth rates of any of its single- or double-locus fragments. However, if recombination were to bring together introgressing variants at the first and third locus without the deleterious () variant in the middle, then this double recombinant would be fitter than the introduced parent block and can supplant the latter.

An approximate expression for the fixation probability of this “101” recombinant (carrying introgressed variants at the first and third loci) can be obtained using the semideterministic approach of Hartfield and Otto (2011) (details in Appendix). The 101 recombinant is mostly generated by recombination between the full parent block, and a descendant block carrying an introgressed variant at just the first or the third locus. The frequency of such recombination events (that lead to the emergence of the 101 recombinant in generation *t*) can be computed by assuming deterministic dynamics for genotypic frequencies. We then use a BP framework to calculate the probability that a 101 recombinant, emerging at time *t*, fixes in the population in the long run. Note that the time at which the recombinant emerges determines its fitness relative to the recipient population, since the mean fitness of the recipient population itself changes as the introgressed genome spreads through it.

The probability that none of the recombinants (generated in generation *t*) fix is then: which is approximately . Further, if double recombinants emerge and establish rarely, then the net probability that no double recombinant fixes in a population of size *N* is just This implies that the double recombinant fixes with probability , which increases with the size of the recipient population and approaches 1 for large *N*, for a fixed The fixation probability of the 101 recombinant in individual-based simulations matches the semideterministic prediction quite well (see Appendix).

Thus, large populations typically have a larger number of recombinants that combine various fit fragments of a successful subblock (before it fixes). This makes it increasingly probable that at least one of these recombinants establishes and supplants the parent subblock in the population, thus eliminating deleterious variants embedded in the parent subblock.

### Role of net selective effect of introduced block

So far, we have considered the case where the introduced block is neutral with respect to the recipient population However, when the introduced block is deleterious as a whole any of its constituent fragments (including those with a high, positive expected growth rate in the long run) experience an initial selective disadvantage. Many of the copies of the introduced block are lost from the population before the beneficial fragment is separated from the deleterious background and amplified by selection, which suggests that the actual introgression probability in this case would be lower than that predicted by Equation 5.

Figure 6 compares introgression probabilities obtained from simulations with the predictions of Equation 5 for a neutral (Figure 6A), beneficial (Figure 6B), and deleterious (Figure 6C) block. As expected, Equation 5 broadly captures the variation of single-locus introgression probabilities along the neutral block, but systematically overestimates (or underestimates) these when the block is deleterious (or beneficial). Note that introgression probabilities of single-locus variants are highly correlated with the expected growth rate, even for blocks with (where the correlation coefficient is for the beneficial block shown in Figure 6B and somewhat lower at for the deleterious block in Figure 6C). However, Equation 5 no longer accurately describes the relationship between the two quantities.

More generally, since the rate of increase of a subblock itself changes over time, before approaching a constant (asymptotic) value (see Figure 1B), a more accurate predictive approach needs to explicitly consider the establishment of any fragment under a time-varying effective selection coefficient (*e.g.*, see Uecker and Hermisson 2011) instead of a constant (as used in Equation 5). An interesting question is whether the effect of total fitness of the introduced block can be partially captured by a constant factor (corresponding to a genome-wide barrier to gene flow) that attenuates (or inflates) the introgression probabilities of all fragments of the block by the same amount. Explicit expressions for barrier strength have been derived by Bengtsson (1985) and Barton and Bengtsson (1986) for the simpler case of a neutral locus embedded in a genome with all deleterious loci.

### Dependence of long-term selection response on population size and linkage

Directional selection on standing genetic variation results in a net phenotypic advance, which depends on population size *N*. When selection acts on a very large number of polymorphic *unlinked* loci (as in the standard infinitesimal model), additive genetic variance declines at a rate per generation due to drift. This constrains the net advance, which scales *linearly* with *N* (Robertson 1960). The situation is more complex for selection on linked loci (Robertson 1970, 1977). Tightly linked variants would fix together rapidly in a small population, but be broken apart and seen by selection in a larger population. Thus, in this case, increasing population size tunes drift as well as the extent of hitchhiking.

We investigate the scaling behavior of net advance for different degrees of linkage by simulating a particular set of loci, with net trait value spread across blocks of different map lengths Multiple (*i.e.*, ) copies of the block are introduced into a population of size *N* at We then measure the net advance under selection (defined as the average trait value in the limit ) in a population and average over replicate populations. To assess the effect of linkage on net advance, we do simulations where the same loci are uniformly spaced across blocks with different the allelic effects of the loci are scaled with such that the genic variance per unit map length and the trait value relative to the total genic variance is the same for blocks of different map lengths. Figure 7 shows the average net advance *vs.* population size *N* for blocks of different lengths. The net advance is normalized by the probability of survival of at least some part of the introduced block.

Figure 7 shows that the net advance scales sublinearly with *N* for the shortest blocks (tightly linked loci), but approaches a linear dependence with increasing block length. Based on the above arguments, efficient fine-scale separation of beneficial and deleterious loci should also occur within short blocks in a sufficiently large population, resulting in asymptotically linear scaling of net advance with *N*. However, this is difficult to confirm in simulations. More importantly, this limit might not be realized in typical scenarios involving populations of a few thousand (as in Figure 7), when introduced variants are tightly linked. Thus, sublinear scaling of net advance with the size of the recipient population may be commonly observed.

## Discussion

### Selection on linkage blocks

The nature of genetic change (substitutions *vs.* minor shifts in allele frequency) at individual loci during polygenic adaptation and the resultant genomic signatures of such adaptation have attracted much interest recently (Pritchard *et al.* 2010). A key challenge is to disentangle the effect of selection on the allele frequency dynamics of the selected locus from the effect of linked selection due to nearby loci. Our analysis shows that these two kinds of selection act over different time scales and together shape the fixation probability of different variants on a single genome introduced into a genetically homogeneous population. We identify sets of loci that share the same short-term fate, by decomposing the genome into contiguous blocks that grow faster (under directional selection) than any parent block containing them. Each such block acts as a linkage unit: the initial introgression of loci within a block is governed by the rate of spread of the block (rather than the selective effects of individual loci) and is largely independent of other such blocks.

While the expected growth rates of linkage blocks provide a useful first estimate for the variation of introgression probability along the genome, a more refined estimate needs to account for several other factors, such as the fine-scale variation within successful fragments, the effects of linkage between multiple successful fragments, and the effect of the background genome, which could attenuate or inflate the effective number of blocks introduced into the population.

The idea that a recombining genome under selection can be viewed as a collection of unlinked, effectively asexual segments or linkage blocks is quite powerful, and allows for the calculation of neutral diversity using theoretical results for asexual populations (Neher *et al.* 2013; Good *et al.* 2014; Weissman and Hallatschek 2014). In our formulation, the intrinsic growth rate of a fragment of map length *y* and selective effect is proportional to which is of the order of if the full introduced block is neutral This is the maximum for a characteristic map length or, alternatively, for a characteristic number of loci given by where is the variance of selective effects of individual variants and *c* the rate of recombination between adjacent selected variants. Thus, on average, blocks of map length should grow faster than any parent block containing them. This heuristic derivation for the length of linkage blocks is consistent with the expectations of Neher *et al.* (2013) (see Equation 5 in that study), modulo a logarithmic correction arising from the dependence of fitness variance on population size. However, unlike earlier work, our analysis does not just yield an *average* linkage scale, but explicitly decomposes a specific genome into linkage blocks of different sizes by accounting for local variations of selective effect along the genome.

### Selection on individual loci and net advance under selection

While the expected growth rate of linkage blocks shows a high correlation with the long-term fixation probability of individual variants within these (Figure 3B) and can capture the coarse-grained variation of introgression probabilities along the introduced block reasonably well (see Figure 6A), there can be considerable fine-scale variation of introgression probabilities of individual variants within each such linkage unit. In fact, this fine-scale variation often reflects differences in selective effect of variants *within* a successful block (Figure 3C). As we illustrate by means of a toy example, the efficacy of selection at fine scales depends on the rate at which a fit block fixes in the population, relative to the rate at which recombination can generate and bring together fragments of the block into even fitter combinations that lack some of the deleterious variants present in the original block. In larger populations, the latter rate is larger, resulting in finer-scale sifting of deleterious variants with increasing population size at long time scales.

The extent to which selection can eliminate individual deleterious variants embedded within a favorable background determines the net response under selection. Robertson (1970) derived approximate expressions for the net advance due to selection acting on standing variation contributed by linked loci of equal effects (spread over map length *l*) and found that the net advance in a population of size *N* is significantly reduced relative to the advance under free recombination, while *l* is less than the selection intensity. Importantly, for tight linkage, the net advance scales sublinearly with *N* for small *N*, but approaches a linear dependence for larger *N*. Our results are broadly consistent; the net advance increases with increasing map length. Further, it scales approximately linearly with *N* for the longest block (weakest linkage), but shows a much weaker (sublinear) dependence on *N* for tighter linkage between loci (Figure 7).

### Selective effects associated with highly replicable loci

Our analysis shows that highly replicable substitutions may be associated with a wide range of selective effects, and can even be deleterious if the density of selected variants on the genome is high (see Figure 3D or Figure 4). The fraction of deleterious substitutions during adaptation is lower when the trait value is determined by fewer, widely-spaced loci of stronger effect. In this case, a typical linkage block contains fewer (∝ ) loci; thus, the selective effect of a locus makes a larger proportional contribution to, and hence shows a stronger correlation with, its expected growth rate. Then, deleterious loci are less likely to be present in an initially successful fragment and to hitchhike during the initial phase of introgression. Moreover, even those deleterious variants that attain high frequencies have a greater chance of being eliminated in the long run (once fine-scale recombination starts isolating individual loci), due to stronger selection against individual variants. However, note that a wider DFE actually results in a higher fraction of deleterious substitutions (due to stronger hitchhiking), unless the density of selected variants is correspondingly low (Figure 4).

Interestingly, differences in replicability among *tightly linked* variants are informative about the effect sizes of these variants; missing or polymorphic variants (that decline in frequency) within genomic regions with high overall introgression probability are typically associated with deleterious effects. By contrast, differences in introgression probabilities of loosely linked loci (that belong to different linkage blocks) reflect how the strength of linked selection varies along the genome and cannot be explained by differences in selective effects of individual loci (contrast Figure 3, C and D). In general, our study underlines the need for a cautious approach toward interpreting patterns of introgression or adaptation in replicates, and suggests that it is more difficult to ascribe adaptive significance to individual substitutions than to clusters of such substitutions. While genomic regions with multiple replicated substitutions (which correspond to successful blocks in our model) are typically important for adaptation, individual substitutions within these may be nearly neutral (even if they occur across replicates, see Figure 3D).

### Interpreting genomic islands of divergence

Genomic regions where the native haplotype persists (*i.e.*, for which the introduced haplotypes fail to introgress) would appear as islands of divergence between the recipient and source population (Strasburg *et al.* 2012). In our model, such regions are associated with introduced fragments that have a *negative* expected growth rate in the recipient population. Note that such fragments need not have deleterious effects; negative growth rates can also be associated with moderately beneficial fragments that are nevertheless broken up faster by recombination than they are amplified by selection. Such fragments may be completely lost from the population, especially when introduced at a low frequency into the recipient population. Thus, in this model at least, the hybridization history (steady gene flow *vs.* sporadic migration events) can be an important factor determining whether extended regions of differentiation are necessarily correlated with barriers to gene flow.

### Replicability during adaptation to a phenotypic optimum

Our model considers an additive trait under directional selection, and thus ignores epistasis and genetic redundancy, which can strongly influence replicability (Yeaman 2015). A common scenario is one where the additive trait is under stabilizing selection toward a different optimum in the recipient population. If introduced haplotypes are far from the fitness optimum, then they effectively experience directional selection, causing adaptive fragments within these to be amplified, much as in the present model. In the long run, closer to the optimum, the redundancy of different genomic fragments that make similar contributions to the phenotype may play an important role. Understanding how the initial spread of adaptive fragments constrains long-term introgression patterns along the genome in this case is an interesting direction for future work.

### Genetically heterogeneous donor and recipient populations

Our model assumes that the recipient population is genetically homogeneous and that the initial hybridization event involves introduction of multiple copies of the *same* genomic block into the recipient population. Further, all replicate populations receive *identical* genomes from the donor population. Thus, these assumptions effectively ignore variation within the source and recipient population, and just focus on fixed differences between the two. Most theoretical work on introgression and barriers to gene flow makes similar assumptions (Barton and Bengtsson 1986; Uecker *et al.* 2015). However, genetic heterogeneity within the donor and recipient populations is likely to qualitatively impact patterns of introgression and the extent to which these are convergent across replicate populations, and is thus a natural extension to consider.

The present analysis provides some intuition in a scenario where a few nonidentical haplotypes are introduced into a genetically homogeneous recipient population: genomic fragments that have high expected growth rates and are present in multiple copies among the introduced haplotypes are expected to introgress with high probability. However, genetic variation within the donor and recipient populations can have other important consequences. Small recipient populations may exhibit significant inbreeding depression due to segregation of deleterious recessive alleles. Alleles from a diverged population then have a heterotic advantage, even if they are deleterious within the recipient population. In fact, heterosis has been suggested as a possible reason for the persistence of Neanderthal-derived variants in modern human populations (Harris and Nielsen 2016).

### Efficacy of selection

We have considered the contribution that a single introduced genome makes to a homogeneous population. A natural extension is to consider the efficacy of selection within a heterogeneous population. Initially, there is a set of genomes, each carrying alleles at very many loci that have a distribution of effects, both positive and negative. We can focus on one of these genomes and ask what contribution it makes to the final selection response; the expected response of the whole population is the sum of the expected contributions of each initial genome. The key differences from the analysis presented here are that a genome finds itself competing against other genomes that are themselves improving under selection, and that it finds itself associated with other genomes that have random effects on the trait. A first step might be to simply treat the rest of the population as having a mean and variance determined by classical quantitative genetics, and ask whether an analysis focused on a single genome in this variable background can give a good approximation.

We have contrasted two regimes: one where the fate of an allele depends on its own selective effect and another where selection acts on fragments of genome whose effects on fitness depend on very many loci. This contrast is closely related to the more practical question of when we can detect individual causal alleles, by genetic mapping or an association study, or instead can only attribute genetic variance to regions of the genome. An important question for the future is to find ways to determine when it is appropriate to deal with discrete loci rather than take a statistical approach; or better, how to combine these two representations of genetic variation to understand real genetic architectures.

## Acknowledgments

We thank Christelle Fraisse for useful comments on the manuscript.

## Appendix: Introgression of a Block with Three Loci: Semideterministic Approximations

Consider a genotype “111” introduced at frequency in a population of *N* individuals, all having genotype “000.” For simplicity, we consider haploid individuals, but the analysis can be generalized to the diploid case. In the following, “1” will always denote the introduced allele and “0” the native allele. The native alleles at each of the three loci are associated with zero selective effect. The introduced alleles at the three loci have selective effects , , and ; the net selective effect of the introduced genotype is The introduced genotype is of the type *i.e.*, with , and We now derive an approximate expression for the fixation probability of the double recombinant 101 (*i.e.*, ), which is the fittest possible recombinant, using the semideterministic approach employed by Hartfield and Otto (2011).

We consider frequencies of genotypes generated by a single recombination event. These single recombinants are: 100, 110, 001, and 011. The genotypes 110 and 011 will typically be quite rare, as they have lower fitness than the genotypes 001 and 100, as well as the introduced genotype 111. As a first approximation, we assume that their frequency is zero. Then, we need only track three genotypes: 001, 100, and 111 (along with the native genotype 000), before the genotype 101 emerges. Under deterministic dynamics and assuming , such that all second-order terms in *s* and *c* can be neglected, the frequencies of these genotypes evolve as:(6)These can be solved numerically to obtain various genotypic frequencies as a function of time.

The fraction of mating events that generate the recombinant 101 in generation *t* is . Since the frequencies and are typically quite small, we can approximate the probability of emergence of the recombinant 101 by Note that is the probability that a mating event in generation *t* results in a 101 recombinant; it is not the frequency of 101 genotypes in generation *t*.

We can now calculate the fixation probability of a 101 recombinant generated in generation *t* by approximating its spread as a time-inhomogeneous BP, where the relative selective advantage of this recombinant changes over time due to the changing mean population fitness (Hartfield and Otto 2011). Then, follows [see Barton (1995)]:(7)Equation 7 can be solved numerically to obtain Then, the probability that none of the recombinants (that emerge in generation *t*) fix is just: _{,} which can be approximated by ∼ . Further, the net probability that no double recombinant fixes in a population of size *N* is ∼ . The key assumption is that instances of double recombinants emerging and establishing are rare, so that all such instances can be treated as independent.

Predictions of the semideterministic analysis are in very good agreement with results of individual-based simulations of populations of different sizes (Figure A1). Note that the analysis above assumes deterministic dynamics of the introduced block and other genotypes produced by single crossover events. A more careful analysis can be done, accounting for initial stochasticity, when the introduced block and its descendants are rare and may be lost by chance.

## Footnotes

*Communicating editor: R. Nielsen*

- Received July 28, 2018.
- Accepted September 28, 2018.

- Copyright © 2018 by the Genetics Society of America