Abstract
A stochastic model for the genealogy of a sample of recombining sequences containing one or more sites subject to selection in a subdivided population is described. Selection is incorporated by dividing the population into allelic classes and then conditioning on the past sizes of these classes. The past allele frequencies at the selected sites are thus treated as parameters rather than as random variables. The purpose of the model is not to investigate the dynamics of selection, but to investigate effects of linkage to the selected sites on the genealogy of the surrounding chromosomal region. This approach is useful for modeling strong selection, when it is natural to parameterize the past allele frequencies at the selected sites. Several models of strong balancing selection are used as examples, and the effects on the pattern of neutral polymorphism in the chromosomal region are discussed. We focus in particular on the statistical power to detect balancing selection when it is present.
COALESCENT theory is based on the realization that selective neutrality allows the separation of descent from state. This makes it possible to model samples (or populations) as random genealogies with superimposed neutral mutations (see Nordborg 2001). Neutrality is thus fundamental to this approach. Nonetheless, selection has been successfully incorporated in two very different ways.
One way is to construct a genealogical process that “leaves room” for selection by creating genealogies that contain “virtual” branches representing possible lines of descent. After mutations have been superimposed and the state transmitted through each branch is known, the genealogy is “pruned” by preferentially removing selectively inferior branches so that only “actual” lines of descent remain. The process known as “ancestral selection graph” (Krone and Neuhauser 1997; Neuhauser and Krone 1997) accomplishes this (see also Donnelly and Kurtz 1999; Neuhauser 1999; Slade 2000a,b, 2001; Fearnhead 2001). An important feature of this approach, which can be seen as a natural extension of the standard neutral coalescent, is that all selection coefficients must be scaled using the standard coalescent/diffusion scaling. This means that it is not possible to model arbitrarily strong selection this way: From a mathematical point of view, infinitely strong selection would require infinitely many virtual branches; from a practical point of view, simulation of strong selection becomes extremely inefficient because of the large number of virtual branches.
The second way, first described in the context of the coalescent by Kaplan et al. (1988), utilizes the fact that a polymorphic population can be thought of as subdivided into allelic classes, within which no selection occurs. Genealogies can then be modeled using existing models of geographic subdivision, with mutation (and, in a sense, recombination) taking the place of migration. It is necessary to know the past (relative) sizes of the “subpopulations,” i.e., of the allelic classes. Thus the approach may be seen as modeling genealogies conditional on the past frequencies of the selectively different alleles (Nordborg 1999, 2001). However, since it is in general not known how to obtain the unconditional process when the past allele frequencies are random variables, the approach has been used only for strong selection, when it may be reasonable to model the selective dynamics deterministically. Examples include balancing selection (see Hudson and Kaplan 1988; Takahata 1990; Hey 1991; Kaplanet al. 1991; Nordborg 1997, 1999; Takahata and Satta 1998; Kelly and Wade 2000; Schierupet al. 2001; Barton and Navarro 2002; Navarro and Barton 2002), positive selection (or “selective sweeps,” see Kaplanet al. 1989; Hudsonet al. 1994; Bravermanet al. 1995; Simonsenet al. 1995; Kim and Stephan 2002; Przeworski 2002), and purifying selection (or “background selection,” see Hudson and Kaplan 1995; Nordborg 1997; Campbell 1999).
The two approaches are in a sense complementary: Whereas the ancestral selection graph works only for weak selection, the conditional approach works only for strong selection. How to connect the two is not clear.
In this article we use the second, conditional approach, to show how the original formulations of Kaplan et al. (1988) and Hudson and Kaplan (1988) may be extended to model selection at multiple sites, with the possibility of different selection coefficients in different subpopulations (local adaptation). This has been done before (Takahata 1990; Kaplanet al. 1991; Nordborg 1997, 2001; Kelly and Wade 2000; Schierupet al. 2000; Barton and Navarro 2002); however, most treatments have considered the genealogy of a nonrecombining site linked to the selected loci. In contrast, we consider the genealogy of the entire region. This makes it possible to ask statistical questions about the pattern of polymorphism (see also Innan and Tajima 1999; Schierupet al. 2001; Kim and Stephan 2002; Przeworski 2002). In particular we are interested in the conditions under which we would expect to be able to detect balancing selection.
A BASIC MODEL
Consider a diploid, hermaphroditic population consisting of P patches, each of which harbors a constant, large number of adult individuals, Nk, k = 1, 2,..., P. Let N = ΣNk be the total population size, and define ck = Nk/N. The population reproduces in discrete, non-overlapping generations according to a generalized Wright-Fisher model in the following manner. Each individual produces an (effectively) infinite number of gametes. Male gametes (e.g., pollen) flows between the patches; let fkl be the probability that a male gamete produced in patch k ends up in patch l. After migration, gametes unite randomly to form zygotes. The number of immature individuals in each patch is thus still effectively infinite, but only a finite number (Nk in the kth patch) reach adulthood. The probability that a given individual survives is determined by its genotype and the genotypic frequencies in the patch. Generalizations of this model are discussed below.
Forward dynamics at the selected loci: Let H be the number of different haplotypes with respect to the selected locus or loci (which are assumed to be linked). Label the haplotypes 1, 2,..., H, and the G = H(H + 1)/2 different genotypes by pairs of indices {i, j}, i ≤ j = 1, 2,..., H, according to their haplotypic composition. Let Nij,k(t) be the number of adults with genotype {i, j} in patch k in generation t. Note that whereas Nij,k(t) is a random variable, Nk is not. The frequency of genotype {i, j} in patch k is
Genealogy of the surrounding segment: Consider a chromosomal segment that contains the selected locus or loci. Take a particular copy of this segment, sampled from the adult individuals in generation t + 1. With respect to geographic location, it belongs to one of the P patches, and with respect to the selected locus or loci, it belongs to one of the H haplotypes.
Trace the genealogy of this segment one generation back in time. Each nucleotide in the segment in the current generation is a copy of the homologous nucleotide in some parental segment in the previous generation. In the absence of recombination, all nucleotides must have the same parental segment; otherwise, they may have different ones. However, rather than modeling nucleotides, it is convenient to think of the segment abstractly as a continuous unit interval where each point may have a different genealogy. This makes mathematical and computational sense and does not entail any loss of biological generality (see, e.g., Nordborg 2001).
—Two examples of the effects of recombination on the single-generation genealogy of a segment. The colors denote ancestry (only). On the left, a segment with a single-locus A1 haplotype was produced through recombination in an A1/A2 heterozygous individual. As a result, going back in time, one piece of the segment takes on the A2 haplotype, whereas the other (containing the locus defining the haplotype) remains A1. On the right, a segment with a two-locus A1B2 haplotype was produced through recombination in an A1B1/A2B2 doubly heterozygous individual, causing both pieces to change haplotype (going back in time).
When tracing the genealogy of the segment back through the life cycle, the first thing to note is that selection cannot affect its state with respect to either patch or haplotype. The chosen segment must have been present in one of the gametes of the same type in the same patch before zygotes were formed and selection took place. However, migration can change the state of the segment with respect to patch, because the gamete may have been an immigrant, if it was male. The probability that a randomly chosen gamete of type i in patch k was male is
Having traced the segment through migration to the premigration gamete pool, the next step is to trace it through gamete production to the previous adult generation. Gamete production, i.e., meiosis, can change the state of the segment with respect to haplotype both through mutation at one of the selected loci and through recombination inside the segment. If the gamete was a mutant, then the segment changes state to take on the haplotype of the parental segment. If the gamete was a recombinant, things are more complicated, because the segment then has two parental segments. At each breakpoint, the parentage of the offspring segment switches from one parental segment to the other. Going backward in time, the segment splits into two pieces (or sets of pieces if there was more than one breakpoint), one of which takes on the haplotype of the first parental segment while the other takes on the haplotype of the second parental segment (see Figure 1). Both parental segments consequently have to be followed if the genealogy of the original segment is to be traced farther back in time. As is demonstrated later, the backward transition probabilities through meiosis can readily be derived, but the expressions depend on the details of the model (e.g., on the number of loci). In general, the probability that a gamete of type i in patch k resulted from a particular type of meiotic event in a particular genotype is a function of the length-G vector [xij,k(t)]i≤j and the recombination and mutation parameters.
Once the genotype and patch of the parental individual has been determined, all eligible individuals are equally likely to have been the parent. Segments can thus be seen as “picking” their parent randomly.
To trace the single-generation genealogy of n copies of the segment, note that, since infinitely many gametes are produced, and gametes unite randomly within patches, the fate of each segment is independent of the fates of all the other segments. In other words, each segment “picks” its parental segment (or segments, in case of recombination) independently of the other segments. Whenever two or more segments pick parental segments belonging to the same patch, haplotype, and genotype, two or more of them may pick the same one. Segments that pick the same parental segment are said to coalesce; if their genealogy is to be traced farther back in time, only the single segment needs to be followed.
Segments can of course pick only the same parental segment if they first pick the same parental individual. Since all individuals of the same genotype are equally likely to be picked, the probability that n segments all pick different parents, given that they all pick parents with genotype {i, j} in patch k is
COALESCENT APPROXIMATION
The purpose of the preceding section was to demonstrate that, conditional on the P length-G vectors [Nij,k (t)]i≤j, k = 1, 2,..., P, for t = 0, -1,..., i.e., conditional on the genotype frequencies in all past generations, it is possible to model the genealogy of n segments sampled in generation t = 0 as a discrete-time Markov process running backward in time. However, the interesting process is the unconditional one, in which the genotype frequencies are governed by another discrete-time Markov process, running forward in time (as described above). Typically, we would like to know how the genealogy is affected by the parameters of that forward process (e.g., the selection coefficients and population sizes), not how it is affected by a particular realization of it.
Clearly, the unconditional process could be studied through discrete-time simulation: One would simply simulate the genotype frequencies forward in time and then simulate the genealogy backward in time, conditional on those frequencies.
An alternative approach, which is taken here, is to use a coalescent/diffusion approximation and assume that the genotype frequencies can be treated as having evolved deterministically on the continuous timescale (Kaplanet al. 1988). It follows from the standard diffusion arguments of population genetics (see, e.g., Neuhauser 2001) that this may be justified if selection is sufficiently strong relative to the inverse of the population size (or, in the present case, patch sizes). It should be stated clearly that this approach is not mathematically rigorous, but it is likely that it can be made rigorous for some scenarios and that it will be a reasonable approximation for many others.
We focus on strong balancing selection in the following because the approach is easiest to explain and justify for such models (with balancing selection, the model is very close to that of Barton and Navarro 2002). Balancing selection is here simply meant as any form of selection that tends to maintain all genotypes at constant, nonzero frequencies, which we denote xˆij,k. In a finite population, the actual frequencies in any given generation will of course differ from these values. Similarly, in the diffusion approximation, the actual frequencies at a given point in time will differ from their expectations. The differences will tend to be smaller the stronger the selection. Note that nothing in the basic model described above limits how strong selection may be. Indeed, it is possible to assume infinitely strong selection (i.e., the selection coefficients need not be scaled in the diffusion approximation), in which case
Scaling: Under the assumption that the genotype frequencies can be treated deterministically, it is possible to use standard arguments to find a continuous-time coalescent approximation for the discrete-time genealogical process described above. Note that the probability (2) can be rewritten
The per-generation probabilities of migration, mutation, and recombination are also assumed to be O(1/N), and the corresponding scaled parameters are introduced. Thus, it is assumed that the migration probability fkl can be written
To proceed farther we must consider specific genetic models.
Single-locus example: Consider a segment that contains a locus (or site) that is maintained polymorphic for two alleles by strong balancing selection in a subdivided environment. There are H = 2 “haplotypes,” A1 and A2, and three genotypes, A1A1, A1A2, and A2A2. Let aij be the probability that allele Ai mutates to Aj during meiosis, and let r be the probability of a recombination event. It is assumed that these can be written
Consider the backward transitions for a single segment with haplotype i sampled from an individual in patch k, as before. Because the probabilities of migration, mutation, and recombination are all assumed to be small, it is clear that the segment is most likely to be a copy of a segment with the same haplotype in the same patch in the previous generation. Indeed, it is easy to show that the probability is 1 - O(1/N). Furthermore, given that its state with respect to haplotype and patch did not change, the probability that it was produced by an {i, j} individual is ŷj,k + O(1/N). It can also be shown that the segment is an immigrant of the same type from patch l with probability
As discussed above, it is simple to extend to a sample of n segments, because the single-generation transitions are mutually independent. Only the coalescence probabilities need to be determined. From Equation 4 a single coalescence event has probability O(1/N) or less, which means that the probability of more than two segments coalescing in a single generation has probability O(1/N 2) or less. It also means that the probability that a segment involved in a coalescence is a migrant, mutant, or recombinant is O(1/N2). The only coalescence event that has probability O(1/N) is thus between two segments that do not change state with respect to haplotype or patch. For simplicity, consider the probability that n = two segments with haplotype i in patch k coalesce. To do so, they must have been produced by individuals of the same genotype, by the same individual of that genotype, and by the same segment within that individual. The probability of this is
Given these transition probabilities, the limiting coalescent process can be derived using standard arguments. Segments belong to states with respect to haplotype and patch as before. Measure time in units of 2N generations, and let N go to infinity. Then, independently of all other segments, each segment with haplotype i in patch k migrates to patch l at rate
Two-locus example: Consider a model with two loci each with two alleles, A1/A2 and B1/B2. There are H = 4 haplotypes: A1B1, A1B2, A2B1, and A2B2, which are numbered 1,..., 4 in the order listed, and 10 genotypes. Define the mutation probability at the B-locus, bij, analogously to aij. The recombination probability, r, is also defined as before, but let dr be the length of the part that lies between the two loci (which are loci in the strict sense of the word—i.e., there is no recombination within them; they can be thought of as single-nucleotide polymorphisms, for example).
The single-generation backward transition probabilities can be found in the same manner as in the single-locus model, but are in some cases more complicated. The probability that a segment of type i in patch k is an immigrant of the same type from patch l is
The extension to n segments and the conversion to continuous time can be done precisely as for the single-locus model. The complete transition rates are given in Table 1.
DETECTING SELECTION
In this section we use simulations to investigate the distribution of the pattern of polymorphism in regions that contain sites subject to balancing polymorphism. We are in particular interested in the power to detect selection under various models and assumptions about parameter values. The simulation software, which is written in C++ with a Mathematica front end, is available on request.
Symmetric single-locus model: Consider first a “classical” balancing selection model, in which selection maintains two different alleles at high frequencies in a random-mating population. The precise values of the equilibrium frequencies do not matter greatly (we assume that selection is completely symmetric so that the sizes of the two allelic classes, A1 and A2, are even), but the assumption that there is no subdivision does, as we see later. We also assume mutation between the two allelic classes is rare and symmetric (i.e., A1 mutates to A2 at the same rate as A2 mutates to A1). This assumption is discussed further below as well.
Figure 2 shows a typical realization of this model. The time to the most recent common ancestor (MRCA) of the sample is extremely high at the selected site and decreases with distance. The reason for this is clear: The selected site itself cannot coalesce unless a mutation from one allelic class to the other occurs. Since mutations are rare, this means that, almost always, all members of a particular class will coalesce to the MRCA of that class long before a mutation occurs. Sites that are more distantly linked to the selected site can move between allelic classes by recombination and coalesce much faster.
The genealogical pattern shown in Figure 2 may result in a characteristic pattern of polymorphism, which may be detected in the data. Figure 3 shows the distribution of the amount of variation within and between allelic classes and of Tajima’s D statistic along the chromosome in six realizations. It is evident that the presence of balancing selection usually causes a “peak” due to divergence between the two allelic classes. This well-known phenomenon is also reflected in Tajima’s D, which is expected to be greater than zero in the presence of balancing selection or (many forms of) population subdivision (Tajima 1989). However, note the considerable randomness: The peaks are not always centered on the selected site; in Figure 3b, variation is inflated within one of the allelic classes as well as between them; and in Figure 3e, there is no peak at all. The pattern of polymorphism depends heavily on the history of mutations between the allelic classes, as well as on the history of recombination in the region.
We considered the power to detect selection using Tajima’s D statistic. Table 2 gives the probability of observing a significantly positive value of this statistic under various assumptions. Several conclusions are clear. First, the power depends sensitively on the width of the window used to calculate D. A very small window will not have enough segregating sites to achieve statistical significance, whereas a large window will “drown” the peak in neutral noise. The optimal window width will depend on the ratio between θ and ρ, i.e., the number of neutral mutations per recombination. If θ/ρ> 1, balancing selection should usually be detected, but if θ/ρ⊡ 1, it usually will not be.
To make the numbers in Table 2 applicable to sequence data, it is necessary to make assumptions about θ and ρ per base. Assume that θ= 0.01 per site, as is reasonable for Drosophila melanogaster (Przeworskiet al. 2001). Then the regions simulated in Figure 3 and Table 2 correspond to 1 kb, and the optimal window size is usually 100 bp. Balancing selection affects very small regions. This realization calls into question the infinite-sites assumption for mutation, because, given the very long coalescence times expected under balancing selection, it is no longer reasonable to assume that each site is hit only once. Depending on the level of selective constraint, at most 1000 selectively neutral sites are in the region, and more likely 1000/3. When finite sites are taken into account, balancing selection becomes harder to detect, because some of the “excess” variability simply results in repeat mutations (Table 2).
Backward transition rates in the two-locus model
—An example of the genealogical pattern around a selected site (positioned in the center of the region, at 0.5). The sample size is n = 8, with 1-4 belonging to the A1 allelic class and 5-8 belonging to the A2 allelic class. The recombination rate, ρ, for the whole region is 2. The mutation rate at the selected site, α, is 0.01. Note that the trees for different regions are drawn on very different scales.
—Sliding-window analysis of the distribution of the average number of pairwise differences within and between the two different allelic classes (πw and πb, respectively) and of Tajima’s D in six realizations of the symmetric balancing selection model described in the text. In the top, diamonds and stars connected by broken lines show the distributions of πw for the two allelic classes, and squares connected by solid lines show the distribution of πb. The bottom shows the distribution of Tajima’s D. Simulations were carried out with n = 24 (12 in A1 and 12 in A2) and the infinite-sites recombination/mutation model with θ=ρ= 10. A sliding-window of size 0.1 was moved with increments of 0.025.
Loss-of-function mutations: Many cases of balancing selection, especially those that are due to a trade-off between resistance and cost of resistance to some parasite or pathogen, are likely to involve loss-of-function mutations (Olson 1999; Stahlet al. 1999; Johansonet al. 2000; Tianet al. 2002). This type of scenario fits well into the modeling framework presented here, but leads to very different predictions. Specifically, since mutation between the two allelic classes is essentially unidirectional (for example, if mutation at any of 100 different sites can lead to loss of function, then the total rate of loss-of-function mutations would be 1, assuming that the mutation rate per base pair is 0.01, while the rate of back mutation would still be 0.01), there is usually no ancient polymorphism, and no peak of polymorphism will develop. As is illustrated in Table 2, balancing selection is not likely to be detected in these cases. We return to this topic in the discussion.
The probability (%) of detecting balancing selection
Two loci: Figure 4 shows a typical realization of a classical symmetric two-locus model with all four haplotypes present at equal frequencies. The two selected sites are positioned at 0.4 and 0.6, respectively. Note that the topologies of the trees behave in the intuitively obvious way as we walk along the chromosome: Close to each selected site, the sample must coalesce in a manner determined by the allelic classes at that site. In a sample that contains the “complementary” haplotypes (A1B1 and A2B2 or A1B2 and A1B2), sites located between the selected sites can coalesce only if there are at least two recombination events.
—An example of the genealogical pattern around two selected sites (located at positions 0.4 and 0.6). The sample size is n = 8, with 1-2 belonging to the A1B1 haplotypic class, 3-4 belonging to the A1B2 haplotypic class, 5-6 belonging to the A2B1 haplotypic class, and 7-8 belonging to the A2B2 haplotypic class. As in Figure 2, we have ρ= 2 and α= 0.01 (for each selected site).
Figure 5 illustrates the effect of the distance between the selected sites on the distribution of πw, πb, and Tajima’s D along the chromosome. If the distance between the sites is sufficiently great, there may be two distinct peaks (Figure 5, a and b); otherwise, only a single peak may be visible (Figure 5c). Power studies analogous to those in Table 2 indicate that the probability of rejecting neutrality using Tajima’s D depends sensitively on the positioning, numbers, and sizes of the windows used (not shown). Obviously, the best strategy is to use a window size that captures each peak, but since the number of selected sites is not known a priori, this may be difficult to implement in practice. However, regions containing multiple sites subject to balancing selection are considerably less likely to be missed (see also Navarro and Barton 2002).
—Sliding-window analysis of the distribution of πw, πb, and of Tajima’s D in three different realizations of the symmetric two-locus model. In the top, diamonds and stars connected by broken lines represent the distributions of πw for the A-locus, and squares with solid lines represent the distributions of πb. The middle shows the same for the B-locus. The bottom shows the distribution of Tajima’s D. Simulations were carried out with n = 24 (6 A1B1, 6 A1B2, 6 A2B1, and 6 A2B2) and θ=ρ= 10. The selected sites were located at (a) 0.2 and 0.8, (b) 0.4 and 0.6, and (c) 0.45 and 0.55. Sliding-window parameters were the same as in Figure 3.
Subdivision: The behavior of balancing selection models with population subdivision is very complicated, but the example shown in Figure 6 illustrates the main points. In addition to the structure imposed by the allelic and haplotypic classes, there is now population structure as well. Which structure turns out to be predominant will depend on the relative magnitudes of the parameters. In the example shown in Figure 6, the pattern of coalescence differs between the two selected sites. At the B-locus (located at 0.6), the first stage of coalescence is within allelic class within patches, followed by coalescence within allelic class between the two patches. Finally coalescence occurs between two allelic classes. On the other hand, at the A-locus (located at 0.4), there is a cluster of the four sequences in patch 2, and the longest branch is between sequence 1 (A1 in patch 1) and the others. The genealogical pattern is highly variable between realizations.
—An example of the genealogical pattern around two selected sites in a subdivided population. The parameters are as in Figure 4, except that the population is divided into two patches of equal size, connected with symmetric migration at rate ϕ = 0.1. The structure of the sample is shown.
Local adaptation: An important motivation for the model described above is to consider local adaptation. A balance between migration and selection seems much more likely to maintain polymorphism than does “pure” balancing selection. Because local adaptation leads to a deficit of heterozygotes at the selected locus or loci, the effect on linked neutral variation may be much greater. Local adaptation should thus be much easier to detect using polymorphism data.
Figure 7 shows an example of a single-locus model of local adaptation. The frequency of A1 is assumed to be 0.9 in patch 1 and 0.1 in patch 2 (and conversely for A2). The effects of local adaptation are even more dramatic when multiple sites are involved. Figure 8 shows an example of a two-locus model where A1B1 is favored in patch 1 and A2B2 is favored in patch 2. Note that the effects of selection extend throughout the simulated region, completely obscuring the peak around the B-locus. Multiple linked sites involved in local adaptation can, in principle, “lock up” entire chromosomal regions in complementary haplotypes.
DISCUSSION
We have shown how the “structured coalescent” described by Nordborg (1997) may be combined with the “ancestral recombination graph” of Griffiths and Marjoram (1997) to yield a genealogical model for sequences that contain multiple sites subject to strong selection in a subdivided population. We refer to the combined model as the “structured ancestral recombination graph” (SARG). Albeit complex, the SARG is highly suitable for simulation, just like the standard coalescent.
Robustness of the SARG: We described the SARG as a limiting approximation to a specific model of pollen flow in an outcrossing hermaphroditic plant species, but the SARG is much more general. As is the case for the standard coalescent, phenomena such as selfing, separate sexes, or sex linkage, etc., can readily be incorporated (although formally proving convergence is likely to be both difficult and tedious; see Nordborg and Krone 2002).
The hardest problem from a mathematical point of view is also the most interesting from a biological point of view; namely, when is it reasonable to treat selection as population structure? For strong balancing selection, it is plausible to argue that the approximation is a good one (Kaplanet al. 1988). In particular, when local adaptation is involved, it is easy to imagine very strong selection. However, to treat allele frequencies as constant in a model of local adaptation, it is also necessary to assume that migration is strong so that a deterministic migration-selection balance results. In such a model, there would be no signs of subdivision when looking at neutral markers unless these were sufficiently closely linked to the adaptively important sites. This may be the case in strong clines and even in some hybrid zones (although the approximation will certainly break down if the number of selected loci becomes too large; see Barton and Navarro 2002).
—An example of the genealogical pattern around a site involved in local adaptation (located at position 0.5). The selected site is located at position 0.5. The two equal-sized patches are connected with symmetric migration at rate ϕ = 0.1; the rest of the parameters are as in Figure 2. The sample size is n = 8, with 1-3 belonging to A1 and 4 belonging to A2 in patch 1, and 5 belonging to A1 and 6-8 belonging to A2 in patch 2.
—An example of the genealogical pattern around a pair of sites involved in local adaptation (located at 0.4 and 0.6). The parameters are as in previous figures, except that the frequencies of A1B1, A1B2, A2B1, and A2B2 are 0.7, 0.1, 0.1, and 0.1 in patch 1 and 0.1, 0.1, 0.1, and 0.7 in patch 2, respectively. The sample size is n = 8, with one sequence from each haplotype in each patch.
Detecting balancing selection: Our motivation for deriving the SARG was that we wanted to simulate sequence data from regions containing sites subject to balancing selection. It has long been known that balancing selection may create a peak of increased polymorphism centered around the selected site (Hudson and Kaplan 1988). Numerous articles have been published about the expected levels of polymorphism surrounding such a site (e.g., Nordborget al. 1996; Kelly and Wade 2000; Schierupet al. 2000; Barton and Navarro 2002). These kinds of results are of limited value for data analysis, because the pattern of polymorphism surrounding any particular balanced polymorphism will reflect the random history of this region and will usually be very far from expectations.
Using simulations, we have focused on the question of whether we should expect balancing selection to be detectable or not. Our original motivation for asking this question is that many years of population genetics research in Drosophila have failed to uncover strong evidence for balancing selection. Is this because balancing selection is indeed rare, perhaps limited to highly unusual cases, like the MHC and plant self-incompatibility loci (Kreitman and Akashi 1995; Hudson 1996), or is it simply very difficult to detect because recombination and gene conversion effectively destroys the evidence (Andolfatto and Nordborg 1998)? The latter view is supported by several recent observations of balancing selection in Arabidopsis thaliana (Stahlet al. 1999; Tianet al. 2002), where recombination is expected to be less effective because of selfing (Nordborg 1997).
Our simulations also tend to support the view that balancing selection might be difficult to detect in out-crossing organisms. For θ/ρ= 1, as may be typical for Drosophila, taking finite sites into account, power to detect balancing selection seems to be ∼50% (Table 2). Note that this is power in the evolutionary sense, not in the usual “sampling” sense: It is not the case that a different sample would detect balancing selection; rather it is the case that a large fraction of existing balancing polymorphisms are expected to be undetectable (because of the particular evolutionary history of the polymorphism). Power will of course also depend on sampling and on the statistical method used, but perhaps less than we would think. It should be noted that our simulations do not take gene conversion into account: This could decrease power further (Andolfatto and Nordborg 1998).
Finally, it seems clear that we should not normally expect to be able to see a typical signal of balancing selection when selection maintains a polymorphism between loss-of-function and functional alleles. The reason is simply that loss-of-function alleles are created far too rapidly to be ancient. When loss-of-function alleles nonetheless appear to be ancient, as is the case for some disease-resistance loci in A. thaliana (Stahlet al. 1999; Tianet al. 2002), this may be an indication that not all loss-of-function alleles are created equal: It is notable that these cases involve complete deletions, and it may be that such deletions are preferable to point mutations that may lead to incorrectly folded proteins, for example. This could reduce the rate of loss-of-function mutations sufficiently for ancient polymorphism to be maintained.
Acknowledgments
We thank A. Navarro and an anonymous reviewer for comments on the manuscript. M.N. thanks Peter Donnelly, Bob Griffiths, Dick Hudson, Steve Krone, Tom Kurtz, Paul Marjoram, Claudia Neuhauser, Gesine Reinert, Simon Tavaré, and Carsten Wiuf for many, many conversations about selection.
Footnotes
-
Communicating editor: W. Stephan
- Received August 7, 2002.
- Accepted December 17, 2002.
- Copyright © 2003 by the Genetics Society of America