Abstract
Methods of estimating twolocus sample probabilities under a neutral model are extended in several ways. Estimation of sample probabilities is described when the ancestral or derived status of each allele is specified. In addition, probabilities for twolocus diploid samples are provided. A method for using these twolocus probabilities to test whether an observed level of linkage disequilibrium is unusually large or small is described. In addition, properties of a maximumlikelihood estimator of the recombination parameter based on independent linked pairs of sites are obtained. A compositelikelihood estimator, for more than two linked sites, is also examined and found to work as well, or better, than other available ad hoc estimators. Linkage disequilibrium in the Xq28 and Xq25 region of humans is analyzed in a sample of Europeans (CEPH). The estimated recombination parameter is about five times smaller than one would expect under an equilibrium neutral model.
LINKAGE disequilibrium is widely recognized as an important aspect of variation in natural populations (Lewontin 1964, 1974; Langleyet al. 1974; Langley 1977). Despite this recognition, there appears to be no consensus about how to analyze linkage disequilibrium or even to summarize the levels of observed linkage disequilibrium when two or more polymorphic sites are observed in a sample of chromosomes. One approach has been to calculate D^{2} or r^{2} for all pairs of polymorphic sites and plot these values as a function of the distance between each pair of sites. (For examples, see Langley 1977; Chakravartiet al. 1984; Langleyet al. 2000; TaillonMilleret al. 2000.) Since the moments of these summary statistics are known at least approximately under standard neutral models (Ohta and Kimura 1969, 1971; Kimura and Ohta 1971; Hill 1975), this has been useful. However, much information is lost in these summary statistics. An alternative analysis consists of reporting the P value of an exact test of independence for all pairs of sites (e.g., Macphersonet al. 1990; Langleyet al. 2000; Vieira and Charlesworth 2000; however, see Lewontin 1995 for an alternative to this approach). Unfortunately, this approach gives little sense of whether observed levels of linkage disequilibrium are higher or lower than expected for pairs of tightly linked sites.
Recently, methods for estimating likelihoods under simple neutral models have been introduced (Griffiths and Marjoram 1996; Kuhneret al. 2000; Nielsen 2000). In principle these methods should allow the most powerful analyses to be carried out on samples with multiple linked polymorphic sites. However, at present, these Monte Carlo methods are extremely computationally intensive, and it has been difficult to assess when a valid estimate of the likelihood is obtained and even more difficult to assess the properties of any statistical inference based on these methods.
In summary, quantifying and interpreting observed patterns of linkage disequilibrium remain a challenge. To address this challenge, we propose in this article that one consider polymorphic sites in pairs and utilize likelihood methods appropriate for analyzing a pair of polymorphic sites. That is, we suggest that it may be of use to interpret observed twosite sample configurations in light of the twosite sampling distribution under a simple neutral model, without summarizing the data in a statistic such as D^{2} or r^{2}. When more than two linked polymorphisms appear in a data set, this approach will entail some loss of information, but a great deal is gained in tractability relative to the full multisitelikelihood approach. In this article, we describe some methods of calculating (or estimating) twosite sampling distributions and some applications of these distributions for the analysis of samples from natural populations.
Although methods of calculating or estimating twolocus sample probabilities have been previously described (Golding 1984; Hudson 1985; Ethier and Griffiths 1990), very little use has been made of these distributions, in part because of the computational effort that has been necessary to obtain these probabilities. However, even inexpensive desktop computers are now sufficiently fast to calculate these probabilities, at least for small sample sizes. Furthermore, the required sampling distributions can be made available over the Internet.
In addition to describing existing methodology, we extend the methods in several ways. These include considering samples in which the ancestral/derived states of alleles are taken into account. Also, twolocus diploid sample probabilities are calculated. In addition, we describe how these distributions can be used to assess observed levels of linkage disequilibrium between sites and how they may be used to estimate the recombination rate parameter of a neutral model.
THE MODEL AND NOTATION
We consider a selectively neutral twolocus random union of gametes model with discrete generations and WrightFisher sampling to produce succeeding generations (Karlin and McGregor 1968; Ewens 1979; Griffiths 1981). The population size, assumed constant, is denoted N. We assume that each locus has the same neutral mutation rate, although relaxing this assumption is trivial. An infiniteallele model of mutation is assumed, although we focus primarily on the case in which mutation rates are small, in which case the model becomes essentially the same as the infinitesites mutation model. The neutral mutation rate at each locus is denoted u, and the recombination rate between the two loci is denoted r. For large populations the sampling properties are functions of the composite parameters, 4Nu (≢ θ) and 4Nr (≢ ρ) (Ohta and Kimura 1969, 1971; Hill 1975).
We focus our attention on samples with exactly two alleles at each locus. The two alleles at the first locus are designated A_{0} and A_{1}; the two alleles at the other locus are designated B_{0} and B_{1}. (At this point the labeling is arbitrary, although later, when ancestral and mutant alleles are specified, the labeling will be meaningful.) A sample of n gametes is randomly drawn from a population at stationarity under the neutral model. The unordered sample configuration is denoted by n = (n_{00}, n_{01}, n_{10}, n_{11}), where n_{ij} is the number of sampled gametes that carry allele A_{i} at the A locus and allele B_{j} at the B locus. Hence, n_{00} + n_{01} + n_{10} + n_{11} = n. The frequencies of the A_{1} allele and the B_{1} allele in the sample are p_{1} = (n_{10} + n_{11})/n and q_{1} = (n_{01} + n_{11})/n, respectively, and the frequency in the sample of the A_{1}B_{1} gamete is p_{11} = n_{11}/n. In this notation, D = p_{11} − p_{1}q_{1} and r^{2} = D ^{2}/(p_{1}(1 − p_{1})q_{1}(1 − q_{1})) are two commonly employed sample measures of linkage disequilibrium.
The probability of a particular sample configuration, n = (i, j, k, l), is denoted q_{u}((i, j, k, l); θ, ρ) or when no ambiguity results as q_{u}(n; θ, ρ). This sample probability corresponds to the probability given by Ethier and Griffiths (1990, Equation 2.14) and the quantity ΦM of Golding (1984). We note that q_{u}((i, j, k, l); θ, ρ) = q_{u}((i, k, j, l); θ, ρ), since we assume that the mutation rate is the same at the two loci. It is q_{u}(n; θ, ρ) and closely related probabilities that are the main foci of this article. Because we are interested in polymorphism at single nucleotide sites, the case of very small θ is of primary interest, and most results will be for the limiting case as θ → 0.
OBTAINING SAMPLE PROBABILITIES
Recursion equations method: Numerical values of q_{u}(n; θ, ρ) can be obtained for small samples by solving a recursion, originally due to Golding (1984) and further analyzed by Ethier and Griffiths (1990). The reader should consult these articles for details. For sample sizes >40, the linear systems of equations that need to be solved become quite large. For example, with a sample of size 40, the last set of equations that must be solved has >20,000 equations. However, the system is sparse, having only nine or fewer nonzero coefficients in each equation. A program to numerically solve Golding's recursion was written by the author and is available at http://home.uchicago.edu/~rhudson1. The program utilizes a conjugate gradient method and indexed storages of the sparse matrices as described by Press et al. (1992) to solve the linear systems.
Randomgenealogies Monte Carlo method: An alternative to solving Golding's recursion is to estimate the twolocus sample probabilities by the method of Hudson (1985). This method is practical for samples of sizes up to 100 and perhaps somewhat larger. Briefly, the estimate is obtained by generating a large number of independent twolocus genealogies (under the neutral model with the appropriate value of ρ) using standard coalescent machinery (Hudson 1983). For each genealogy, one calculates the probability of the sample configuration of interest. The average of these probabilities is an estimate of q_{u}(n; θ, ρ). Because it is of use later, we describe the method in more detail. Before proceeding with this description we note that Monte Carlo Markov chain methods, such as that of Nielsen (2000), are likely to be very much faster than the method described below. Nielsen's method estimates essentially the same probability that we consider here but can be used on the much more difficult problem of more than two linked sites. However, for estimating the probabilities of all possible configurations for a pair of sites and a given sample size, the method of Hudson (1985) may be competitive with the Monte Carlo Markov chain methods. (For small sample sizes, the point may be moot, since the sample probabilities have already been calculated and tabulated, as described in results and applications. For larger sample sizes, the issue remains important.) We now describe the method of Hudson (1985).
A twolocus genealogy is produced by generating a random sequence of events, proceeding backward in time from the present, as described by Hudson (1983). The events are coalescent events, in which two lineages merge into a single common ancestor, and recombination events, in which a single ancestral chromosome splits into two parental chromosomes. We denote the ith event by E_{i}. The complete ordered sequence of events is denoted ϵ and is referred to as the Esequence. Associated with each event is a specification of which lineage or lineages are involved. The Esequence completely determines the topology of the A locus and B locus gene trees. A complete specification of the twolocus genealogy requires that one also specify the time intervals between the events. We note, however, that under the constant population size model, the Esequence can be generated without regard to the time intervals between events. The time interval preceding E_{i} is denoted T_{i}. The ordered sequence of these time intervals is referred to as the Tsequence. Under the constant population size model and conditional on the Esequence, the time intervals are independent exponentially distributed random variables. The mean of T_{i} depends on the configuration of the ancestral lineages during the interval, which, in turn, depends on the Esequence. The calculation of the mean of T_{i} conditional on the Esequence is also described in Hudson (1983).
The twolocus genealogy can be summarized as two tiplabeled gene trees, one for the A locus and one for the B locus. We arbitrarily number the branches of the A locus tree from 1 to 2n − 2 and designate the length of the ith branch by a_{i}, measuring time in units of 4N generations. Note that for any particular branch, say the ith, a_{i} is the sum of one or more consecutive elements of the Tsequence. Similarly, the branches of the B locus tree are numbered, and their lengths are denoted by b_{j}. As with the A locus tree, the lengths of the B locus tree are sums of one or more consecutive elements of the Tsequence. It is assumed that the number of mutations on branch i of the A locus tree, conditional on its length, is Poisson distributed with mean (θ/2)a_{i}. The sum of the a_{i} is denoted τ_{A} and the sum of the lengths of the B locus branches by τ_{B}. For a given twolocus genealogy, it is a simple matter to check for each pair of branches, one from the A locus gene tree and one from the B locus gene tree, whether a mutation on the A locus branch and a mutation on the B locus branch would lead to the specified sample configuration, n. This property of the twolocus genealogy depends only on ϵ and not on the Tsequence. Let I(ϵ, n, j, k) denote an indicator variable that is one if branch j on the A locus tree and branch k on the B locus tree are such a pair of branches and zero otherwise. If I(ϵ, n, j, k) equals one, then the sample configuration, n, would arise if one or more mutations occurred on branch j of the A locus tree, and one or more mutations occurred on branch k of the B locus tree, and no mutations occurred elsewhere on the tree. Thus given ϵ and the Tsequence, the probability of the configuration n being produced by mutations on branch j of the A locus tree and branch k of the B locus tree is
In the case of the constant population size model, the method of Hudson (1985) can be made more efficient by replacing the randomly generated values of a_{j}b_{k} by their expectation conditional on the Esequence. That is, we estimate h_{u}(n, ρ) by
Ancestral and derived alleles: In the previous paragraphs, we did not specify which alleles were ancestral and which were the mutant (or derived). It is now common to obtain sequence from a closely related species and infer which alleles are ancestral. The probabilities of sample configurations with specified ancestral/derived states are no more difficult to calculate than the unspecified configurations. A sample in which the ancestral/derived status of each allele is specified is referred to as an “adspecified” sample, and otherwise the sample is “ad unspecified.” The algorithm we just described can be modified to estimate the probabilities of adspecified samples by simply changing the indicator function used. Golding's recursion can also be modified to calculate adspecified sample probabilities. We use the convention for adspecified samples that A_{0} and B_{0} denote the ancestral alleles and A_{1} and B_{1} denote the mutant alleles. For adspecified samples, the quantities corresponding to q_{u}(n; θ, ρ) and h_{u}(n; ρ) are denoted q(n; θ, ρ) and h(n; ρ).
The adunspecified probabilities can be obtained from the adspecified probabilities by summing four (or fewer) distinct adspecified sample probabilities, which result in the same unspecified sample configuration. That is, we can obtain the adunspecified probabilities with
The method of Hudson (1985) can be extended to any neutral model in which the twolocus genealogy can be efficiently generated. In particular, simple island models of geographic structure and models with changing population size can be easily accommodated. A program to estimate h(n; ρ) using (4) under these models is available at http://home.uchicago.edu/~rhudson1.
Sequenced samples with two polymorphic sites: In the previous sections, the samples considered are assayed only at two sites. The intervening and flanking nucleotide sites may or may not be polymorphic. This is exactly the situation considered by Nielsen (2000). In contrast, we consider now the case where a set of contiguous sites are sequenced in each individual and all sites are therefore examined, and all polymorphisms in the sequenced segment are detected. Thus, full haplotype information is obtained for all sites in the segment. This is the situation considered by Griffiths and Marjoram (1996) and Kuhner et al. (2000). Nielsen (2000), Griffiths and Marjoram (1996), and Kuhner et al. (2000) all analyze the very difficult problem of estimating the probability of samples with arbitrary numbers of linked sites. In contrast, we now limit ourselves to the special case where only two sites are found to be polymorphic in the sample (and the rest are monomorphic), because, in this case, the randomgenealogies method of Hudson (1985) can be easily extended to calculate these sample probabilities for a sequenced segment. This can be done as follows.
If the segment sequenced is L nucleotides long, we use an Llocus model, where each locus corresponds to a nucleotide site. We number the nucleotide positions from 1 (the leftmost site) to L(rightmost site.) Instead of a twolocus gene genealogy we must consider an Llocus gene genealogy. Each site has associated with it a gene genealogy or gene tree. The sum of the lengths of the branches of the gene tree of the ith site is denoted τ_{i}. We denote
In results and applications we compare h(n; ρ) and h_{seq}(n, x, y; ρ, θ_{t}) to see how much the knowledge that there are no other polymorphisms between or near the focal pair of sites affects an inference about ρ. The estimates of h_{seq}(n, x, y; ρ, θ_{t}) obtained as described here may be useful for checking other algorithms for estimating sequenced sample configuration probabilities.
Diploid samples: Up to this point, we have considered samples consisting of haplotypes. It would be useful to have sample probabilities analogous to q(n; θ, ρ) for the case of diploid samples. We show here how the probabilities of diploid samples can be expressed in terms of the haploid sample probabilities.
For diploids, with two alleles segregating at each locus, there are 10 distinct diploid genotypes. Often the phase of double heterozygotes is not determined directly, in which case there are only 9 distinguishable diploid genotypes. However, we begin by considering the case where the haplotypes constituting double heterozygotes are determined experimentally. In this case, the data can be represented by a 10vector, n_{d} = (n_{0}, n_{1}, … , n_{9}). We use n_{0} to denote the number of couplingphase double heterozygotes (A_{0}B_{0}/A_{1}B_{1}) and n_{1} to denote the number of repulsionphase double heterozygotes (A_{0}B_{1}/A_{1}B_{0}). The numbers of each of the other diploid genotypes are designated by n_{i}, i = 2, … , 9. From the vector, n_{d}, we can count the number of each of the four possible chromosomes. That is, the vector n_{d} maps unambiguously to the underlying haploid data configuration, which we denote by n(n_{d}). Under random mating, the probability of n_{d} is q(n(n_{d}); θ, ρ) times the probability that 2n haploids of configuration n(n_{d}) when randomly paired produce n_{d}. In symbols,
Now consider the case where the phase of double heterozygotes is not determined by the experimenter. In this case, we cannot observe n_{0} or n_{1}, but we do observe their sum. We denote the sum by n_{dh}. We denote the diploid data set in this case by n_{d}_{−}_{9} = (n_{dh}, n_{2}, … , n_{9}). Given n_{dh}, the actual number of coupling phase double heterozygotes, n_{0}, could be any value from zero to n_{dh}. Each of these possible values corresponds to a different n_{d} configuration. We denote these possible n_{d} configurations by n_{d}(i, n_{d}_{−}_{9}), where i = 0, … , n_{dh}. That is, if n_{d}_{−}_{9} = (n_{dh}, n_{2}, … , n_{9}), then n_{d}(i, n_{d}_{−}_{9}) = (i, n_{dh} − i, n_{2}, … , n_{9}). Then the probability of a diploid configuration n_{d}_{−}_{9} is obtained by summing up the probability of each of these mutually exclusive possible n_{d} configurations, as:
Conditional probabilities: Most applications of these twolocus sampling distributions will focus only on pairs of sites in which both sites are polymorphic in the sample. This means that rather than q(n; θ, ρ), it will be useful to consider the probability of specific sample configurations conditional on two alleles segregating in the sample at each locus. That is, it is useful to consider the conditional probability
It may also be of interest to condition on other events. For example, one may wish to limit consideration to polymorphisms in which the rarer allele has frequency >0.05, or one may wish to condition on precisely the marginal allele frequencies observed. These are easily calculated by changing the summation in the denominator of the righthand side of (14). Various conditional probabilities are utilized in the following sections.
RESULTS AND APPLICATIONS
Example sampling distributions: I have used the Hudson (1985) Monte Carlo algorithm (and Equation 4 or 5) to estimate h(n; ρ) for all possible twolocus sample configurations (with exactly two alleles at each locus) for samples sizes of 20, 30, 40, 50, and 100 and a range of ρ values between 0 and 100. These are available at http://home.uchicago.edu/~rhudson1. The program used to estimate these quantities is also available at this site. The program actually generates multilocus gene trees and simultaneously estimates the sample probabilities for a range of recombination rates and all possible sample configurations simultaneously. The results for sample size 40 have been compared for several configurations to the results of solving the recursions of Golding (1984). No significant discrepancies were found. Thus two very different approaches using entirely independent computer code produced essentially the same values for the probabilities of a large number of sample configurations over a range of ρ values. In addition, the results for large recombination rates converge on the free recombination configuration probabilities that are easy to calculate with Ewens sampling distribution (Ewens 1972) and the assumption of independence of the two sites. These two results give considerable reassurance that the Monte Carlo program functions correctly. The program can also be used to estimate twolocus sample probabilities under the island model of spatial structure and under a model with recent exponential growth in population size.
Figure 1 shows some conditional sampling distributions for a sample of size 90. This figure shows the asymmetric Ushaped distribution of linkage disequilibrium, which is typical for low recombination rates, the broad distribution of linkage disequilibrium for values of ρ in the range of 5–20, and the unimodal and nearly normal distribution of linkage disequilibrium for large ρ. An application of these distributions is described in the next section.
The conditional probabilities q_{c}(n; ρ) when plotted as functions of ρ are conditionallikelihood curves. Estimates of three such curves are shown in Figure 2. Application of these likelihood curves for estimating ρ are described in a subsequent section, Estimating ρ.
Before proceeding to some applications of these sample probabilities we consider briefly the effects of knowing which alleles are ancestral and the effects of having full sequence data vs. assessing the variation at only two specified sites. In Figure 3, we show plots of h(n; ρ) for a set of four adspecified samples and h_{u}(n; ρ) for the corresponding adunspecified sample. In the plot, we see that different adspecified samples, each corresponding to the same adunspecified sample, can have very different likelihood curves. In this case, two of the configurations have monotonically decreasing likelihood curves, and the other two configurations have a maximum for intermediate values of ρ. However, two of the configurations, which have similar curves, have much higher probabilities than the other two configurations. Thus, although knowledge of which alleles are ancestral may in specific instances have an important impact on inferences about ρ, on average, knowledge of which alleles are ancestral may not provide very much more information. This suggestion is supported by the results concerning asymptotic variances of estimates of ρ using many pairs of sites, which is described later.
The effects of having full sequence data vs. assessing the variation at only two polymorphic sites are illustrated in Figure 4, in which we show plots of h(n; ρ) and h_{seq}(n, x, y; ρ, θ_{t}) as functions of ρ for n = (15, 15, 9, 1) and θ_{t} = 0.6. In this case, the curves are very similar in shape. One should be cautious about generalizing from this particular result, but it appears that, for the case where only two sites are polymorphic, likelihoods based on full sequence data may be quite similar to the likelihood based on assays of only the pair of sites that are polymorphic. For other values of θ_{t} this may not hold.
Assessing observed levels of linkage disequilibrium: With the conditional probabilities described above, it is possible to assess whether or not the level of linkage disequilibrium observed between a particular pair of sites is compatible with our neutral model and a specified value of ρ. For example, suppose that we have a sample of 90 gametes in which two sites, 1000 bp apart, are assayed and found to be polymorphic, with sample configuration, n = (53, 7, 17, 13). (This is the sample configuration corresponding to n_{11} = 13 in Figure 1.) Note that the marginal allele frequencies of the derived alleles are 30(= 17 + 13) at one locus and 20(= 7 + 13) at the other locus. We ask the question of whether this observed configuration is compatible with the hypothesis that 4Nr_{bp} (= ρ_{bp}) equals, say, 0.001, where r_{bp} is the recombination rate per base pair per generation. With these assumptions, the relevant recombination parameter for our pair of sites is ρ = ρ_{bp} 1000 = 1.0. To assess whether the linkage disequilibrium observed is unusually high or low, we examine the distribution of n conditional on the observed marginal allele frequencies with ρ = 1.0. Conditional on the marginal allele frequencies, there are only 21 possible sample configurations, which can be specified by the value of n_{11}. The conditional probabilities of these configurations for ρ = 1.0 are shown in Figure 1 and were obtained using Equation 14, with the summation in the denominator of the righthand side being over the 21 possible sample configurations.
We define a statistical test by summing the conditional probabilities of all configurations with probabilities less than or equal to the probability of the observed configuration. If this sum is <0.05 we reject our hypothesis. In our example, the configurations with n_{11} = 8, … , 13 have probabilities less than or equal to the observed configuration. The sum of the probabilities of these configurations is approximately 0.028, so our hypothesis is rejected. That is, we conclude that this sample configuration is quite unusual for ρ = 1.0 and note that it would be much more likely if ρ = 10. We refer to this test as the “exact test conditional on the marginals” (ETCM) and emphasize that it requires that one know or specify ρ.
Suppose now that our pair of polymorphic sites, with n = (53, 7, 17, 13), had been 100,000 bp apart, in which case, ρ = 100. If we examine the conditional probabilities for this value of ρ (shown on the right in Figure 1), we find that the sum of the probabilities of configurations with less or equal probabilities is ~0.02, and so the hypothesis would again be rejected. (In this case the configurations with lower probabilities are n_{11} = 0, and n_{11} = 14, … , 20.) There is too much linkage disequilibrium in this case. This illustrates that the ETCM can reject the null hypothesis due to either too much or too little linkage disequilibrium.
This test could be carried out for all possible pairs of polymorphic sites in a contiguous region to explore the possibility that some sites exhibit unusually high or low levels of linkage disequilibrium. Such sites may be indicative of hotspots of recombination or mutation or epistatic selection. This is a complementary approach to the usual analysis of linkage disequilibrium in which Fisher's exact test of independence is applied to all pairs. It should be noted that, when more than two linked sites are considered, there is a statistical dependence between the ETCMs on each pair, and any interpretation of the results should bear this in mind.
Estimating ρ: Using independent linked pairs: The conditional probabilities q_{c}(n; ρ) when plotted as functions of ρ are likelihood curves. Estimates of three such curves are shown in Figure 2. (The estimates are obtained using (14) and (4) but for adspecified samples.) Most sample configurations lead to monotonically increasing or decreasing likelihood curves, but samples with high but not complete linkage disequilibrium lead to likelihood functions with a maximum at a finite positive value of ρ. Thus the maximumlikelihood estimate of ρ for a single pair of sites is often zero or infinity. When the estimate is finite and greater than zero, the confidence interval is clearly large (as indicated by the broad likelihood function). This was noted before by Hudson (1985) and by Hill and Weir (1994). However, if one had k pairs of sites, where each pair is independent of the other pairs and where each pair of sites has the same ρ, then one might be able to obtain a very accurate estimate of ρ. In this case the overall likelihood, for small θ, is approximately
To characterize the statistical properties of the maximumlikelihood estimate
In Figure 5, we show, in addition to our estimate of E_{ρ0}(log(q_{c}(n; z))) for ρ_{0} = 5.0, a quadratic function obtained by a leastsquares fit to several points near z = 5.0. Clearly the expected likelihood function is very close to quadratic for a substantial range of z in the neighborhood of 5.0. This suggests that the asymptotic properties may apply for moderate values of k. By estimating the second derivative of the expected likelihood functions, we have estimated asymptotic variances (times k) for a set of values of ρ_{0} and plotted the results as a function of ρ_{0} in Figure 6. The plot in Figure 6 shows that pairs separated by ρ in the range of 2–15 are best for estimating ρ. As ρ decreases below 2.0, the asymptotic variance grows rapidly. For larger values of ρ the asymptotic variance grows more slowly.
To give some feeling for the number of pairs needed to get a reasonable estimate of ρ we consider a numerical example. Suppose that we have data for k = 20 independent pairs of polymorphic sites, where the rare allele has in every case allele frequency of at least 0.1 and where the recombination rate, ρ_{0}, between the sites of each pair is the same and is in the range from 2 to 10. In this case, we see from Figure 6 that k
In practice, different pairs of polymorphic sites will be different distances apart and will have different recombination rates. In the case where the physical distance between each pair of sites was known and the recombination rate per base pair was the same for each pair of polymorphic sites, then the likelihood for k polymorphic pairs is approximately
Using the above method, we can estimate the asymptotic variance of ρ in adunspecified samples and in diploid samples in which the phase of double heterozygotes is not determined. For the case of adunspecified samples of 50 gametes in which ρ = 5.0, the asymptotic variance is estimated to be 2.2/k, which is essentially the same as what we found for adspecified samples. Thus, there appears to be little if any gain in knowing which alleles are ancestral when the data consist of independent linked pairs of sites. A similar asymptotic analysis could be used to determine the optimum sample size when one can trade off sample size for number of pairs. We do not pursue that analysis here.
Finally we note that, when investigating pairs of polymorphic sites, the incorporation of gene conversion is straightforward. It is necessary only to establish an effective recombination rate as a function of distance that incorporates gene conversion, such as Andolfatto and Nordborg (1998) or Langley et al. (2000). The scaledlikelihood functions do not need to be recalculated. Frisse et al. (2001) recently estimated gene conversion rates in humans using this method.
Using many linked sites: In the previous sections, we considered one or more linked pairs of sites, where each pair is independent of the others. We now consider the case where more than two linked polymorphic sites are assayed in a single sample. In this situation, full likelihood considering all polymorphisms simultaneously is the proper approach. Griffiths and Marjoram (1996), Kuhner et al. (2000), and Nielsen (2000) have provided Monte Carlo methods for estimating these likelihoods. However, at present the methods of Griffiths and Marjoram (1996) and Kuhner et al. (2000) are difficult to employ due to the very large computational requirements and the difficulty in determining when adequate convergence has been obtained. The approach of Nielsen (2000) is less computationally demanding and goes some way toward solving this problem. However, the properties of fulllikelihood estimators have not been explored, and the computation requirements for this exploration are daunting. As an alternative we consider a composite (or pseudo) likelihood obtained by using (19), where the summation on the righthand side is over all pairs of sites. This is an approach suggested by Hudson (1993). Because of the statistical dependence between different pairs of sites, this expression is not the true likelihood. We can nevertheless maximize this function to obtain an estimate of ρ_{bp}, which is denoted
To assess the quality of this compositelikelihood estimator,
The figures show that the estimator, C_{wak}, performs poorly compared to the other estimators shown. The estimator C_{wak} is an improved version of the estimator of Hudson (1987), which would perform slightly more poorly than C_{wak}.
The estimator γ has a considerably lower 90th percentile than the other estimators. This is desirable as long as the 90th percentile is larger than the true value. Unfortunately, for large ρ and θ = ρ/4, the 90th percentile of γ falls below the true value. In addition, it has a median considerably below the true value and a 10th percentile well below the 10th percentile of
The medians of the estimators
Finally, since
An application to human polymorphism data: We close by estimating ρ from a survey of human variation on the X chromosome (TaillonMilleret al. 2000). In this study, 39 SNPs were surveyed in three population samples. In the following, only the sample of 92 CEPH males is considered. The parameter ρ_{bp} was estimated by maximizing the composite likelihood for (i) all 39 >SNPs, (ii) the 14 SNPs in Xq25, and (iii) the 10 SNPs in or near Xq28. For loci on the X chromosome, using the h(n; ρ) functions described for autosomal loci will result in an estimate of 2Nr, where r is the pergeneration recombination rate in females and N is the total effective population size. In the following, the estimates returned by the computer programs were multiplied by 2 so that the values reported here for ρ are, in fact, estimates of 4Nr, where r is the female recombination rate. The estimates were 9 × 10^{−5}, 8.8 × 10^{−5}, and 9 × 10^{−5} for all 39 sites, for the 14 sites of Xq25, and the Xq28 sites, respectively. These results suggest that there is no overall difference in recombination rates in the Xq25 region compared to the Xq28 region. The effective population size of humans has been estimated from levels of DNA polymorphism to be ~10^{4} and the recombination rate per base pair, though quite variable, is for this region on average ~10^{−8}. Thus we might expect that ρ_{bp} ≈ 4 × 10^{−4} or about five times larger than we estimate from the X chromosome data.
The linkage disequilibrium at each of the 741 (= 39 × 38/2) possible pairs of sites was evaluated by the ETCM, described in Assessing observed levels of linkage disequilibrium, assuming ρ_{bp} = 9 × 10^{−5}. A total of only 7 pairs, or ~0.9% of the pairs, showed unusual twolocus configurations (with P < 0.025) using the ETCM. This is somewhat fewer than one would expect by chance when carrying out this many tests. Thus, overall, our analysis does not support the presence of a low recombination rate region in Xq25, as suggested by TaillonMiller et al. However, it should be noted that 6 of the 7 significant pairs involve sites from the Xq25 region, and the seventh is immediately adjacent to the Xq25 region. Of the 6 significant pairs in the Xq25 region, 5 show unusually large linkage disequilibrium, but the 6th shows unusually low linkage disequilibrium. The latter pair of sites is separated by 30 kb and D′ for the pair is 0.22. The five pairs showing significantly large linkage disequilibrium from the Xq25 region were among the pairs identified by TaillonMiller et al. as showing significant linkage disequilibrium by a Fisher's exact test. In Figure 9, the values of r^{2} are plotted for all pairs of sites within the Xq25 region (14 sites and hence 91 pairs) and for all pairs within the Xq28 region (10 sites or 45 pairs). TaillonMiller et al. also displayed this plot. In the figure those sites that showed significantly large or small linkage disequilibrium given ρ_{bp} = 9 × 10^{−5}, using the ETCM test, are shown with x's. Also shown in Figure 9 is the expected value of r^{2} conditional on the frequencies of the minor alleles being >0.32. These were calculated with the tabulated h(n; ρ) values for a sample of size 92. The fit to the data appears to be fairly good, but there does appear to be some tendency for Xq28 pairs to fall below the line for larger distances and to be above the line for shorter distances. In contrast, the Xq25 sites appear to scatter on both sides of the curve.
CONCLUSIONS
Twolocus sample probabilities offer a useful tool for analyzing linkage disequilibrium levels from population surveys. Carrying out tests of significance for pairs of sites and estimating ρ from many sites is computationally quick, once the twolocus sample probabilities are in hand. A compositelikelihood approach for estimating ρ with more than two linked sites appears to work as well as the method of Wall (2000) and better than other ad hoc methods. Of course, none of these ad hoc methods should be used when fulllikelihood methods are computationally feasible.
Acknowledgments
The author is grateful to Jeff Wall and Tony Long for useful discussion. Also, I am indebted to Jeff Wall for providing most of the numbers going into Figures 7 and 8 and to P. TaillonMiller for providing the X chromosome data set. This research was supported in part by National Institutes of Health grants HG02098, HG01847, and HG02107.
Footnotes

Communicating editor: Y.X. Fu
 Received March 1, 2001.
 Accepted August 31, 2001.
 Copyright © 2001 by the Genetics Society of America