Constrained Disequilibrium Values and Hitchhiking in a Three-Locus System
- * Section of Evolution and Ecology, University of California, Davis, California 95616
- † School of Public Health, University of California, Berkeley, California 94720
- ‡ Department of Integrative Biology, University of California, Berkeley, California 94720
- Corresponding author: Mark N. Grote, Section of Evolution and Ecology, University of California, Davis, CA 95616. E-mail: mngrote{at}ucdavis.edu
Abstract
Positive selection on a new mutant allele can increase the frequencies of closely linked alleles (through hitchhiking), as well as create linkage disequilibrium between them. Because this disequilibrium is induced by the selected allele, one may be able to identify loci under selection by measuring the influence of a candidate locus on pairwise disequilibrium values at nearby loci. The constrained disequilibrium values (CDV) method approaches this problem by examining differences in pairwise disequilibrium values, which have been normalized for two- and three-locus systems, respectively. We have investigated in detail the reliability of inferences based on CDV, using simulation and analytical methods. Our main results are (i) in some circumstances, CDV may not distinguish well between a selected locus and a neighboring neutral locus, but (ii) CDV seldom indicates “selection” in neutral haplotypes with moderate to large 4Nc. We conclude that, although the CDV method does not appear to precisely locate selected alleles, it can be used to screen for regions in which hitchhiking is a plausible hypothesis. We present a microsatellite data set from human chromosome 6, in which constrained disequilibrium values suggest the action of selection in a region containing the human leukocyte antigen (HLA)-A and myelin oligodendrocyte glycoprotein (MOG) loci. The connection between hitchhiking and disequilibrium has received relatively little attention, so our investigation presents opportunities to address more general issues.
IN the genetic hitchhiking model, positive selection on a new mutant allele increases the frequencies of other alleles physically linked to the mutant, skewing the frequency distributions at the linked loci. Theoretical and empirical studies of hitchhiking generally focus on the reduction in variation at linked neutral loci that can result if the recombination rate is low and the selected mutant is quickly fixed in the population (Maynard-Smith and Haigh 1974; Ohta and Kimura 1975; Aguadé et al. 1989; Kaplanet al. 1989; Stephan and Langley 1989; and many more recent studies). Relatively fewer studies have focused on linkage disequilibrium (gametic phase disequilibrium) in haplotypes subject to hitchhiking (Thomson 1977; Robinsonet al. 1991a; Begun and Aquadro 1994, 1995). Our concerns in the present study are the nature of linkage disequilibrium created by hitchhiking, and the extent to which certain patterns of disequilibrium can be used to make inferences about hitchhiking. In a wider sense, our aim is to investigate a particular method that uses linkage disequilibrium to physically locate genes of interest.
Relatively insignificant linkage disequilibrium is always created by the appearance of a new mutant, because initially the mutant is found only on an “ancestral” haplotype of closely linked alleles. Thomson (1977) showed that hitchhiking can noticeably increase this existing disequilibrium, if selection in favor of the new mutant is strong enough relative to the recombination rate with linked loci. Hitchhiking can also create significant disequilibrium between nonselected alleles if they are closely linked to the selected allele (Thomson 1977); here, disequilibrium between neutral alleles is induced by mutual association with the selected allele. These associations are expected to decline in strength as recombination breaks up haplotypes bearing the selected mutant.
Robinson et al. (1991a,b) introduced the constrained disequilibrium values (CDV) method as a means of identifying loci that may have been subject to recent hitchhiking. Inference with the CDV method depends on comparisons between pairwise disequilibrium measures, which have been normalized according to different constraints imposed by two- and three-locus systems. A familiar two-locus linkage disequilibrium measure is
Our purpose is to present some recent results that bear upon the use and interpretation of CDV. First, we summarize further simulations of the deterministic model, describing some circumstances under which CDV does, or does not, lead to reliable inferences about the position of the selected locus. In connection with this, we analyze the normalized disequilibrium measures under a selection model with simplifying assumptions and show that inferences with CDV are especially sensitive to allele frequencies at neutral loci closely linked to the selected locus. Second, we apply the CDV method to data sets generated under a stochastic model of neutral haplotypes, using a simulation program of Hudson (1983, 1985), to assess the performance of CDV under a finite-population “null” model. In our discussion, we reexamine the types of inferences that can be made with CDV and address some conceptual and practical issues. Finally, we apply the CDV method to marker haplotypes from human chromosome 6, to illustrate one use of CDV.
METHODS
Measures of disequilibrium and hitchhiking: Our attention centers on two normalized measures of pairwise linkage disequilibrium, D′ and D″, and in particular on the difference in their magnitudes,
Robinson et al. (1991b) derived a new normalized pairwise measure, D″ab, which incorporates futher constraints imposed on Dab by a third locus. For a three-locus diallelic haplotype, where the C locus plays the role of the “constraining” locus,
Like
Assuming Dab > 0 for the moment, ∂ = |D′ - D2 is greater than zero when the pairwise measure Dab is|more extreme relative to its two-locus maximum, than to its positive range in the three-locus system; in this case the pairwise association between a and b appears to be relatively weaker when all of the pairwise associations of the three-locus system are taken into account. Loosely speaking, when ∂ > 0, the association between a and b is said to be partly accounted for by their mutual association with c. Assuming further that c is a selected mutant, this property of ∂ is the primary reason for treating ∂ > 0 as the “footprint” of a hitchhiking event, in which the neutral a and b alleles have hitchhiked with c (Robinsonet al. 1991a).
Although the normalized measure D′ may change during a hitchhiking event (Thomson 1977), D′ alone does not distinguish between loci that may be under positive selection and linked neutral loci that are only hitchhiking with the selected locus. Similarly, there is a single measure of third-order disequilibrium, Dabc, which can also be normalized appropriately (Geiringer 1944; Thomson and Baur 1984), but Dabc also makes no distinction between selected and hitchhiking loci. The main claim of Robinson et al. (1991a) is that ∂ values, when interpreted appropriately, can make this distinction.
For a given three-locus haplotype, each locus may play the role of the constraining locus, and there are three ∂ values: ∂ab(c), ∂a(b)c, and ∂(a)bc. Using deterministic simulations, Robinson et al. (1991a) found that ∂ was often large and positive when the “constraining” allele was increasing in frequency due to positive selection, but the linked alleles were selectively neutral. When a nonselected allele played the “constraining” role, Robinson et al. (1991a) found that ∂ tended to be zero or negative. Based on their observations, Robinson et al. (1991a) proposed the following criteria for inferring selection based on ∂ values:
If one of the three ∂ values is positive and the remaining two are zero or negative, the constraining allele that gives the positive ∂ is the one that may have experienced recent selection.
If more than one of the ∂ values is positive, but one is much larger than the rest (for this study, more than double the next largest), the constraining allele that gives the large ∂ is the one that may have experienced recent selection.
If all three ∂ values are ≤0, or two are positive but close in value, no conclusion about selection can be drawn.
Robinson et al. (1991a) paid considerable attention to the magnitudes of ∂ values under various scenarios, but we first focus simply on which loci the CDV method identifies as candidates for selection, in a large series of deterministic simulations.
A deterministic hitchhiking model: The deterministic simulations are based on a three-locus, diallelic model that evolves via a standard system of algebraic recursions (Feldmanet al. 1974; Thomson 1977; Hartl and Clark 1989). For purposes of the CDV method, we are interested in a single new mutant allele and the closely linked alleles of the ancestral haplotype on which the mutant first appeared. The alleles of interest, in their order on the chromosome, are a, b, and c, one of which will be the new mutant and the others linked alleles. By convention, A, B, and C may be taken to represent all other alleles at their respective loci.
The recursion equations describing changes in the haplotype frequencies can be specified by selection and mutation parameters described immediately below, the recombination rates r1 and r2 between the A and B loci and the B and C loci, respectively (where r1 + r2 - 2r1r2 gives the recombination rate between A and C for the “no-interference” model), and a set of initial haplotype frequencies. The latter are determined by specifying initial allele frequencies pa(0), pb(0), pc(0), and a single initial disequilibrium value [e.g., D′ab(0) when c is the new mutant]. In addition, we assume that the haplotype bearing the new mutant has not experienced mutation or recombination before the simulation begins at generation zero [for example, if c is the new mutant, this implies fabc(0) = pc(0)]. The frequency dynamics of a strongly selected allele, once it has left the zero-frequency boundary, are commonly modeled as a deterministic process (e.g., Kaplanet al. 1989). For convenience, we have assumed that the time spent close to the boundary is small relative to recombination and mutation rates near the selected locus.
Fitnesses at the selected locus (using genotypes at the C locus for illustration) are given by wcc = 1 - sc, wCc = 1, and wCC = 1 - sC. We have adopted a general frame-work for hitchhiking studies, as our selection model encompasses both directional selection leading to fixation of the new mutant (e.g., sc ≤ 0 and 0 < sC ≤ 1) and balancing selection (0 < {sC,sc} < 1). Mutation is unidirectional at rates μa = μb = μc = 10-5 per generation from alleles a, b, and c to A, B, and C, respectively, so that the alleles of interest are transient. We use terms like “equilibrium frequency” loosely, referring to the relatively fast adjustment of allele frequencies that results from the appearance of a new selected mutant. For completeness, we have included the recursion equations in the appendix.
Scope of the deterministic simulations: The parameter space for the deterministic model is large and multi-dimensional, so we limit our investigation to a relatively narrow subset of parameter values under which measurable linkage disequilibrium is likely to be present. Using simple frequency arguments, one can conclude that most new mutants arise on relatively common haplotypes; but more unusual events, in which mutants appear on rare haplotypes, are actually of greater interest in hitchhiking studies. Thomson (1977) showed that hitchhiking will only noticeably perturb allele frequencies and disequilibria when at least one of the neutral alleles initially linked to the mutant is rare. The pairwise disequilibrium value Dab is only large when alleles a and b are at intermediate frequencies and strongly associated in the “coupling” (ab) phase. An initially rare ab haplotype, on which a strongly selected mutant happens to occur, is in a primary position to pass through this range of intermediate frequencies in strong coupling.
In the following simulations, we have (somewhat arbitrarily) set the initial frequency of at least one of the neutral alleles at p(0) = 0.05, to ensure that the ancestral haplotype is sufficiently rare. Table 1 shows parameter values that are typical of the simulations. Here, c is the selected mutant and pa(0) and pb(0) are treated in a symmetric fashion, each assuming the value p(0) = 0.05 while the other takes values between 0.05 and 0.9 in successive runs. Some values of the initial pairwise disequilibrium D′ab(0) rule out certain combinations of pa(0), pb(0), and pc(0) in Table 1, but the same treatments are always applied to the a and b alleles.
Values of the remaining parameters were guided by a few basic rules. Hitchhiking is thought to be a weak force unless selective values are roughly an order of magnitude greater than recombination rates (Maynard-Smith and Haigh 1974; Thomson 1977; Kaplanet al. 1989), so we have chosen selection and recombination parameters accordingly. Generally, for each setting of pa(0), pb(0), pc(0), D′ab(0), r1, and r2 that was investigated, we examined a basic series of runs formed by 7 × 10 = 70 pairs of selection coefficients (for example, sC and sc as shown in Table 1). Combinations of parameter values that involved interactions beyond those of primary interest [for example, D′ab(0) = -0.25 and r1 ≠ r2 in Table 1] were left unexamined to keep the number of runs reasonable. More detailed tables are in Grote (1996) and are available upon request.
Within these guidelines, our first objectives are to significantly enlarge upon the number of deterministic cases examined in Robinson et al. (1991a), and to investigate some cases where the relationship between the ∂ values is inconsistent with correct inference of the selected locus.
CDV in a stochastic neutral model: Our second aim is to study the performance of CDV in a neutral, finite-population model, to determine whether or not genetic drift and sampling effects can produce patterns of linkage disequilibrium conforming to criteria 1 or 2 above. Robinson et al. (1991a) used an ad hoc method to study ∂ values under genetic drift.
We have modified a computer program of Hudson (1983, 1985) to study ∂ values in the neutral model. The program simulates random samples of three-locus haplotypes, generated under the neutral “infinite alleles” model with recombination at equilibrium. The program requires the following input parameters: n, the number of haplotypes per sample; 4Nc, the scaled recombination rate between the A and C loci (the B locus is assumed to be halfway between A and C); θa, θb, and θc (with, e.g., θa = 4N μa, where N is the effective population size and μa is the mutation rate to new A-locus alleles). We used the value θ = 0.2 at each locus, corresponding to the approximate numerical solution of
—Deterministic simulation for selection at c, with Dab′(0) = 0.0, r1 = r2 = 0.001, pa(0) = 0.05, pb(0) = 0.1, pc(0) = 0.001, sC = 0.025, sc = 0.075.
RESULTS
Deterministic simulations: Figures 1, 2, 3 show sample runs of a deterministic model in which c is the selected mutant and the A and B loci are neutral. In Figures 1, 2, 3, recombination rates, initial allele frequencies at the A and C loci, and mutation and selection parameters are the same; only the initial frequency of the b allele varies between the figures.
In the allele frequency plots, pc(t) approaches the equilibrium value sC/(sc + sC) = 0.25, then slowly declines due to mutation (not evident in these plots). Frequencies of the a and b alleles both increase due to hitchhiking with the selected mutant c. The frequencies ultimately attained by a and b depend on their initial frequencies, Dab(0), the strength of selection on c, and the recombination rates between these loci (Maynard-Smith and Haigh 1974; Thomson 1977). The initial value of Dab is zero in each of the runs, and initial values of Dac and Dbc are small positive numbers reflecting the associations of a and b with the new mutant c. All three values of D increase with the hitchhiking effect, then slowly decrease. Because there is only a single selected locus in these runs, the equilibrium value of all disequilibria D and each ∂ is zero. In the deterministic model without selection on c, Dab would remain at zero, whereas Dac and Dbc would decline from their initial values to zero without a transient increase. In these runs, it is only after the disequilibrium measures D have attained relatively large values that deviations from ∂ = 0 are observed.
—Deterministic simulation for selection at c, the same as Figure 1 except pb(0) = 0.5.
In Figure 1, ∂ values between roughly generations 100 and 300 satisfy criteria 1 or 2 to correctly indicate selection at the C locus. Later in the run ∂ values conform to criterion 3, where no conclusions about selection would be made. Figure 2 conforms entirely to criterion 3, having no signal for selection during the run. In Figure 3, both the b and c alleles meet criteria for selection at different times in the run, although only c is under selection. In particular, applying the CDV criteria at any time between generations 320 and 550, we could conclude that the neutral b allele is in fact under positive selection (Figure 3 is similar to pattern II′ in Figure 3 of Robinsonet al. 1991a). Inferences about the selected allele based on disequilibrium values at a single time-point could indeed be misleading in Figure 3, where knowledge of the whole history of the selective event might seem necessary for correct inference.
—Deterministic simulation for selection at c, the same as Figure 1 except pb(0) = 0.05.
The performance of the CDV criteria in a large series of deterministic runs is summarized in Tables 1 and 2. Following the discussion above, we have classified each run by determining which alleles, if any, the CDV criteria would indicate as “selected.” The run of Figure 1 shows a correct signal at the C locus for 100 ≤ t ≤ 300 but gives no signal for selection otherwise, and is counted under the column “signal at c alone” in Table 1. Sampling such a run at an arbitrary time, we might draw no conclusions, but would not incorrectly identify a neutral allele as selected. The run of Figure 2 gives no signal for selection at all and is counted under “no signal” in Table 1. The run of Figure 3 gives, for 320 ≤ t ≤ 550, a misleading signal for selection at the neutral B locus and is counted under “signal at b” in Table 1 (there is a similar column for “signal at a”). Because there are no runs in this series with signals at both neutral loci, each run falls into only one of these categories. We have chosen a conservative classification that emphasizes times during which CDV leads to incorrect inferences. In the text below, we describe broad trends and give some breakdowns of the runs that would not be evident by examining Tables 1 and 2 alone. We use percentages in the tables and text as convenient summaries, but do not view these as probabilities.
c is the new mutant under selection
b is the new mutant under selection: pc(0) = 0.05, r2 = 0.001
In 26.1% (676/2590) of the runs in Table 1, the only signal identifying an allele under selection correctly points to c as the selected mutant. The CDV criteria identify the selected locus most reliably when the b allele is initially of moderate frequency: 50.5% (283/560) of the runs in Table 1 with pb(0) = 0.3, 0.4, or 0.5 and pa(0) = 0.05 resulted in the c allele being correctly identified, 35.5% (199/560) led to a possible misidentification of the b allele, and the remaining 14.0% (78/560) gave no signal for selection. CDV also performs well when the initial disequilibrium between the neutral alleles is negative: 46.7% (294/630) of the runs with
In these simulations, when sc ≤ 0.0 and sC > 0.0, the new mutant c will be transiently fixed in the population (often called a “selective sweep”). In the selective-sweep runs, 24.8% (257/1036) gave a correct signal from the c allele, 50.5% (523/1036) led to a possible misidentification of the b allele, and 4.2% (43/1036) to a possible misidentification of the a allele. In the next section, we will examine why CDV may not perform especially well in a selective sweep.
As one might imagine, there is a trend in the reliability of inferences associated with the ratio sc/sC: for fixed values of the remaining parameters, with both sc, sC > 0, runs with larger values of sc/sC tend to have no signal, those with smaller values of sc/sC tend to have incorrect signals from the b allele, and those with intermediate values of sc/sC allow the CDV method to perform best. The critical values of the ratio sc/sC depend in a complex way on the remaining parameters and appear to be different in each series of runs.
Table 2 is similar in structure to Table 1, except now b is the new selected mutant and the A and C loci are neutral. Here, by symmetry, there is no need to switch the roles of the neutral loci, and we use pc(0) = 0.05, r2 = 0.001 throughout. When b is the new mutant, a relatively small number of runs have a potentially misleading signal at a neutral locus, and nearly all are cases in which the neutral allele frequencies pa(0) and pc(0) differ widely [i.e., pa(0) = 0.75 or 0.9 and pc(0) = 0.05].
The role of allele frequencies at a closely linked neutral locus: The most problematic observation in the simulations above was a strong tendency for the CDV method to indicate selection at the B locus when c was the new selected mutant. Using some mathematics and general aspects of the hitchhiking model, it is possible to show that a rare neutral allele on the ancestral haplotype can easily be mistaken for the selected allele, when using the CDV method. The analysis requires some simplifying assumptions, but gives some generality to the results of the deterministic simulations, showing that our observations do not depend strongly on particular choices of parameter values.
An overdominance model: We examine the behavior of ∂a(b)c, the ∂ value that indicates selection at the B locus, during the rapid increase of a new, strongly overdominant c allele. To avoid dealing with the time component explicitly, we focus on ∂a(b)c at t = 0, t “small” (a few generations) and t “moderate” (on the order of 100 to a few hundred generations). We assume that r1 and r2 are small enough so that recombination in the ancestral haplotype abc can be practically ignored when t is near zero, and further assume that pb(0) is small enough so that b and c are in strong coupling for small-to-moderate t. Low recombination and strong coupling of b and c imply that fab(t), fac(t), and fbc(t) are all approximately equal to pc(t) for small-to-moderate t. We finally assume Dab = 0, but due to hitchhiking, all of Dab, Dac, and Dbc are positive after a few generations of selection.
To characterize ∂a(b)c, we must study the relationship between
At t = 0, fac = pc, and because the new mutant c is found only with a, Dac = max Dac. Further, the inequality
For small t, with the disequilibria of m2 still small relative to the third-order products, the reasoning is very similar. Because the new mutant c is still found almost exclusively on the ancestral haplotype, Dac ≈ max Dac to a good approximation, so it also must be true that max*Dac ≈ max Dac. We then have ∂a(b)c ≈ 0 for small t.
The situation changes when the disequilibria of m2 are large relative to the third-order products, so that m2 < 0 and min*Dac > 0; here, we must use the second case in the definition of
Using very similar arguments, it is possible to show that ∂ab(c) ≥ 0 during the same time interval, so that the same general mechanisms give the “correct” signal at the C locus. The constrasting result ∂(a)bc ≈ 0 can be obtained using the same detailed arguments, or more easily can be obtained by noting that Dbc remains very close to max Dbc during the time interval of interest. Taken together, these arguments suggest that under our assumptions the CDV criteria could indicate selection at either the B or C loci, but not at the A locus.
A selective sweep model: A second basic model may be handled without doing any further analysis. For the selective sweep case, we assume sc ≤ 0 and sC > 0, so that the selected mutant c will be fixed, but the remaining assumptions are the same. The transient dynamics of allele and haplotype frequencies are the same as in the overdominance model, with perhaps minor differences in time scale; the main difference is in the endpoint of the selection process. Maynard-Smith and Haigh (1974) showed that an allele at a polymorphic locus closely linked to a new favored mutant may readily fix with the selected mutant. In our model, if the c allele fixes in a small number of generations, the time during which Dac and Dbc are positive will be very short, since these equal zero once the C locus has become monomorphic. If there has been very little recombination with A- or B-bearing haplotypes by the time c fixes, Dab will also depart only transiently from zero, because a and b will then nearly fix with c. There appears to be only a small time frame in which we could observe any disequilibrium, hence any change in ∂ values, in the sweep model. The basic reasoning of the previous section again suggests that during this time, a signal for apparent selection is possible from either the selected locus or a nearby neutral locus carrying a rare allele.
Stochastic simulations: We have calculated ∂ values in simulated random samples from a stochastic, neutral diallelic model, to informally investigate the “type I” error in the CDV method. The three-locus neutral model is perhaps the simplest null-model that would be considered for data of the type used for CDV. Hudson (1985) investigated the sampling distribution of the pairwise disequilibrium D using a similar approach. Pairwise D have been treated analytically under the neutral model by Hill (1975), Golding (1984), and Hill and Weir (1988).
Distributions of the three ∂ values in samples of size n = 100 are shown in Figure 4 for 4Nc = 10, 25, and 100. The histograms of Figure 4 show only the univariate (marginal) distributions of ∂ values and contain no information about the associations within samples of the three ∂ values. At each value of 4Nc, the ∂ = 0 class is by far the most common for each of ∂(a)bc, ∂a(b)c, and ∂ab(c), with ∂ ≥ 0 relatively uncommon. Negative values of ∂ are more common than positive values when ∂ departs from zero. Relative to ∂(a)bc and ∂ab(c), ∂a(b)c is more often different from zero.
—Distributions of ∂ values under a stochastic neutral model for three values of 4Nc (where c is the recombination rate per generation between the A and C loci and N is the effective population size). Each individual data set consists of n = 100 three-locus haplotypes, having exactly two alleles segregating at each locus, with per-locus heterozygosities of at least 0.095. The distributions are based on 1000 independent data sets, generated using a modified program of Hudson (1983, 1985). Bars above ∂ = 0 are not drawn to scale with the remaining bars; instead the frequencies at ∂ = 0 are indicated in the figure.
The frequencies of apparent “hitchhiking” events, obtained by applying the CDV criteria to the samples in Figure 4, are shown in Table 3. For 4Nc = 10, each locus satisfies the criteria for selection in a small percentage of cases: here one can expect to find a signal for selection at some locus perhaps 8 to 9% of the time, using the CDV criteria in a neutral sample. With 4Nc ≥ 25, however, any apparent signal for “selection” based on the CDV criteria would be unusual. In concordance with the deterministic simulations, although here there is no selection, we obtain false signals for selection at the B locus more often than at A or C (as expected, the A and C loci give similar results). We take this as further evidence of a “position” effect that favors the middle locus.
Frequency of signal for apparent selection: neutral samples (n = 100)
Marker haplotypes from human chromosome 6: To illustrate one use of the CDV method, we have calculated ∂ values in a series of three-locus microsatellite haplotypes in the 6p21.3-22.1 region of human chromosome 6 (see Figure 5). We do not presume any of these marker loci are selected, but suppose instead that perhaps one or more markers could be closely linked to a selected gene.
We used a “sliding window” approach, examining in turn each of the five groups of three adjacent markers among the seven markers shown in Figure 5. Human leukocyte antigen (HLA)-F3′ and myelin oligodendrocyte glycoprotein (MOG)c are dinucleotide repeats closely linked to the HLA F locus and the MOG locus, respectively. HLA-A, a major histocompatibility complex class I locus, is located between the D6S265 and HLA-F3′ markers shown in Figure 5 (Laueret al. 1997; Mosseret al. 1997). This region contains other loci of biological and evolutionary interest and has been the focus of recent intensive efforts to map the hereditary hemochromatosis locus, now known to be ∼2.2 Mb telomeric to D6S464 (Federet al. 1996; Laueret al. 1997; Mosseret al. 1997). The data we used are from a sample of 70 randomly ascertained ethnic Germans and were generously provided by L. Calandro and G. F. Sensabaugh (see Sensabaughet al. 1996). We used an expectation-maximization (EM) algorithm to estimate haplotype frequencies from multilocus genotypes (Baur and Danilovs 1980), working separately with each group of three adjacent markers. The EM algorithm provides haplotype frequency estimates for all possible combinations of alleles at the three loci, some of which have very low estimates and are unlikely to actually be in the sample. For further calculations, we retained only those three-locus haplotypes in which the constituent two-locus estimates were at least 0.05. Seventeen three-locus haplotypes in all met this minimum frequency threshold: 3 haplotypes of the D6S265/HLA-F3′/MOGc loci, 5 haplotypes of the HLA-F3′/MOGc/D6S258 loci, 3 haplotypes of the MOGc/D6S258/D6S306 loci, 4 haplotypes of the D6S258/D6S306/D6S105 loci, and 2 haplotypes of the D6S306/D6S105/D6S464 loci. All are combinations of the few most common alleles at each locus. We calculated ∂ values in each of these 17 haplotypes, converting to dialleles by combining the alleles not under consideration into a single class. All 3 haplotypes of the D6S265/HLA-F3′/MOGc loci had disequilibrium patterns conforming to criteria 1 or 2 (Table 4), but none of the remaining 14 haplotypes met these criteria.
DISCUSSION
Inferences with CDV: In the deterministic runs, where the new selected mutant appeared at a terminal locus (the C locus), the CDV method did not distinguish well between the selected locus and a neutral neighbor, especially when a relatively rare allele of the neutral locus was initially linked with the selected mutant. In this case a signal for apparent selection could either be detected from the selected locus or the neutral locus. This situation is unfortunate and somewhat paradoxical, because we have argued that selected mutants that form on rare haplotypes create the most significant linkage disequilibrium in a hitchhiking scenario. To some extent, the CDV method is sensitive to each of the parameters of the model, but we discovered in particular a sensitivity to allele frequencies at the middle locus (the B locus). We showed, using an analytical approach under the assumption of strong selection and tight linkage, that a rare neutral allele at the B locus may easily be mistaken by the CDV criteria for the selected mutant c.
Lewontin (1988) showed that the normalized pairwise measure D′ab and other related measures are not in any general sense independent of the underlying allele frequencies pa and pb, although they are routinely treated as such. The CDV method uses both D′ and D2 at each pair of the three-locus system, where D″ incorporates further one- and two-locus frequency constraints. In light of Lewontin’s (1988) results, it is not unexpected that CDV shows a sensitive dependence on allele and haplotype frequencies, as well as on other parameters of the model.
In our deterministic simulations, when the middle locus (the B locus) had the new selected mutant, the CDV method gave correct inferences in a large majority of runs. It is difficult to put this attractive result into practice in the inference setting, because a signal for apparent selection at the B locus could indeed reflect selection at the locus or could be a false signal of the type that was commonly observed when c was the selected allele. One remedy might be to confine inferences to terminal loci, perhaps obtaining additional markers that could place any locus of interest at the “A” or “C” positions of our model. This assumes we could be virtually certain about inferences at terminal loci, an assertion that is contradicted by the fraction of deterministic runs of Tables 1 and 2 in which a signal appears at an unselected terminal locus. It further seems possible that a generalization of our analytical approach, which relaxes assumptions about position, could show that inferences about terminal loci may not be reliable in the presence of rare neutral alleles. We think at present that the CDV method may not allow for high-precision inferences about the location of selected mutants; on this point we depart from Robinson et al. (1991a).
—Physical map of seven dinucleotide repeat markers in the 6p21.3-22.1 region of human chromosome 6. Approximate intermarker distances are based on the YAC contig and STS maps of Mosser et al. (1997). Allele frequency distributions are for the sample of 70 ethnic Germans provided by L. Calandro and G. F. Sensabaugh (Sensabaughet al. 1996). The x and y axes of the histograms are labeled according to repeat number and frequency in the sample, respectively.
Ethnic German sample haplotypes with signal for selection
The stochastic simulations showed that patterns of linkage disequilibrium conforming to criteria 1 or 2 are uncommon for 4Nc ≥ 10 and highly unusual for 4Nc as large as 100. Here, we think there is potential inference value in the CDV method, because a simple neutral model can apparently be ruled out if either criteria 1 or 2 is met in a moderate-sized sample, with 4Nc on the order of 100. At this point, other nonselective alternative hypotheses (such as the neutral model with population structure or migration) cannot immediately be ruled out; this requires work beyond our current scope.
Although we do not think that CDV can very accurately distinguish the particular locus that has the selected allele, we do think that CDV can be used to screen for fairly localized regions that may have a recent history of hitchhiking (in general agreement with Robinsonet al. 1991a). The basic requirements appear to be that the terminal loci span at least a distance of 4Nc = 10 (with the third locus roughly intermediate), that there is a standard minimum level of heterozygosity H ≥ 0.095 at each locus, and that there is moderately strong, but not complete, linkage disequilibrium in the region.
Selected mutations and linked markers at equilibrium: We now describe a simple model of recurrent selected mutations and address some implications for CDV and similar methods. The simplest model assumes that selected alleles arise at random points in the genome. If such events are rare, the influence of new selected alleles on linked loci is transient: eventually the new mutant reaches equilibrium, and recombination, mutation, and genetic drift again dominate the dynamics of linked loci. Under this simple model, neutral alleles linked to a new overdominant mutant will increase in frequency and may reach high levels of disequilibrium, but do not generally fix (because the overdominance mode tends to preserve extant variation). Two such loci will return to neutral frequency and phase equilibria, respectively, at rates 1 - 1/2N, the rate of loss of heterozygosity at either locus (with N the effective population size; see, e.g., Crow and Kimura 1970), and 1 - c - 1/2N, the rate of decay of linkage disequilibrium between the loci (for the random union of gametes model, where c is the recombination rate; Hill 1974). In the selective sweep case, if a neutral allele fixes with the new mutant, the time until polymorphism could be reestablished at the neutral locus is on the order of 1/μ, where μ is the neutral mutation rate (Crow and Kimura 1970). If either overdominant or favored mutants reoccur in a particular region over relatively short time scales, and the recovery of linkage equilibrium or polymorphism is inadequate, reperturbation by successive hitchhiking events may not be detectable. Even if selected mutants appear only rarely, the availability of adequate polymorphism at closely linked sites, on which linkage disequilibrium could be recorded, may be in question; for if θ = 4Nμ is small, a majority of linked neutral sites will be monomorphic. We have further claimed that disequilibrium created by hitchhiking is primarily connected to rare events in which selected mutants appear on low-frequency haplotypes. In particular, these impediments suggest that in chromosomal regions thought to be subject to recurrent selective sweeps (Aguadé et al. 1989; Begun and Aquadro 1994, 1995), the linkage disequilibrium that is indeed observed is primarily the result of mutational or other events that occurred since the most recent sweep.
For tightly linked loci, patterns of linkage disequilibrium conforming to criteria 1 or 2 persist approximately as long as the time required for the new mutant to reach equilibrium (Thomson 1977, and our deterministic simulations). If we assume strong selection and large N, and confine our attention only to mutants that invade the population, the expected time until the new mutant fixes in the selective sweep model is approximately
Human chromosome 6 haplotypes: In Table 4, we showed three haplotypes of the D6S265/HLA-F3′/MOGc loci that met the CDV criteria for hitchhiking. HLA-F3′ and MOGc are physically close, so we must make a rough assessment of 4Nc between these loci if we wish to compare the data with the neutral simulations of Figure 4 and Table 3. Although there is apparently no family data that give precise estimates of the recombination fraction between HLA-F3′ and MOGc, the physical distance between these loci is known to be approximately 100-150 kb, based on YAC contig and STS maps (Mosseret al. 1997; Human Genome Data Base 1997). For estimation purposes, we will assume the distance is 100 kb and use N = 2000, perhaps a conservatively low value for a modern European population. If we use the crude conversion 1 Mb ≈ 1.16 cM [obtained by observing that the genome size is equivalently 3200 Mb or 3702 cM in human females (The Human Transcript Map 1996)], we conclude that 4Nc between HLA-F3′ and MOGc is ∼9. Thus, the D6S265/HLA-F3′/MOGc haplotype appears to span a distance over which criteria 1 or 2 are not commonly met in the simple neutral model. The setting here is not directly analogous to the null-model calculations of Figure 4 and Table 3 for two main reasons: (i) different three-locus marker haplotypes may share an allele at one or more loci, introducing dependencies not present in the simulated neutral haplotypes; (ii) it is well known that the “infinite alleles” mutation model used for the neutral simulations does not apply to microsatellite loci (see, e.g., Valdeset al. 1993). However, the role of the mutation model, especially given the time scale for mutational events relative to the duration of hitchhiking events, should be minor. Finally, if the scaling of 4Nc is approximately correct, the chances under the neutral model that even one of the D6S265/HLA-F3′/MOGc haplotypes would meet the CDV criteria appear to be small.
We conclude that hitchhiking with one or more selected alleles, closely linked to the D6S265/HLA-F3′/MOGc loci, is a plausible explanation for the patterns of linkage disequilibrium observed in these haplotypes. Three apparently distinct haplotypes meet criteria 1 or 2, suggesting that hitchhiking with overdominant alleles is the more likely scenario: the data would seem to require otherwise that several favored alleles in the region are simultaneously being selected for, or that an ancestral haplotype bearing a favored allele has experienced several mutation events. We have also argued that the loss of variation under the selective sweep model poses a serious problem for observing disequilibrium, making it unlikely that disequilibrium created specifically by selectively favored alleles would ever be observed. While we have scaled back previous efforts to infer the precise location at which selection has acted, our results are consistent with other work on selection in this region of the human genome (Klitz and Thomson 1987; Sattaet al. 1994; Parham and Ohta 1996). Our main intention in this example is to demonstrate that evidence for historical selection processes may indeed be found in the patterns of linkage disequilibrium we have focused on, in our investigation of the CDV method.
APPENDIX: THREE-LOCUS DETERMINISTIC RECURSIONS
In the three-locus diallelic model, the eight haplotypes (gametes) are ABC, ABc, AbC, Abc, aBC, aBc, abC, abc, and their respective frequencies in a given generation are x,... x8,
Acknowledgments
We thank C. H. Langley, who read an earlier draft of the manuscript and made suggestions that led to substantial revisions. The human chromosome 6 haplotypes were collected by L. Calandro and G. F. Sensabaugh, who generously allowed us to use them here. We thank D. Cutler and A. D. Long for discussion and suggestions. An anonymous reviewer made suggestions that improved the presentation. This work was supported by National Institutes of Health grants HD-12731, GM-56688, and 5 T32 GM-07127.
Footnotes
-
Communicating editor: G. B. Golding
- Received February 6, 1998.
- Accepted August 7, 1998.
- Copyright © 1998 by the Genetics Society of America