- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Hudson, R. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Hudson, R. R.
Two-Locus Sampling Distributions and Their Application
Richard R. Hudsonaa Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637
Corresponding author: Richard R. Hudson, 1101 E. 57th St., University of Chicago, Chicago, IL 60637., rr-hudson{at}uchicago.edu (E-mail)
Communicating editor: Y.-X. FU
| ABSTRACT |
|---|
Methods of estimating two-locus sample probabilities under a neutral model are extended in several ways. Estimation of sample probabilities is described when the ancestral or derived status of each allele is specified. In addition, probabilities for two-locus diploid samples are provided. A method for using these two-locus probabilities to test whether an observed level of linkage disequilibrium is unusually large or small is described. In addition, properties of a maximum-likelihood estimator of the recombination parameter based on independent linked pairs of sites are obtained. A composite-likelihood estimator, for more than two linked sites, is also examined and found to work as well, or better, than other available ad hoc estimators. Linkage disequilibrium in the Xq28 and Xq25 region of humans is analyzed in a sample of Europeans (CEPH). The estimated recombination parameter is about five times smaller than one would expect under an equilibrium neutral model.
LINKAGE disequilibrium is widely recognized as an important aspect of variation in natural populations (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Recently, methods for estimating likelihoods under simple neutral models have been introduced (![]()
![]()
![]()
In summary, quantifying and interpreting observed patterns of linkage disequilibrium remain a challenge. To address this challenge, we propose in this article that one consider polymorphic sites in pairs and utilize likelihood methods appropriate for analyzing a pair of polymorphic sites. That is, we suggest that it may be of use to interpret observed two-site sample configurations in light of the two-site sampling distribution under a simple neutral model, without summarizing the data in a statistic such as D2 or r2. When more than two linked polymorphisms appear in a data set, this approach will entail some loss of information, but a great deal is gained in tractability relative to the full multisite-likelihood approach. In this article, we describe some methods of calculating (or estimating) two-site sampling distributions and some applications of these distributions for the analysis of samples from natural populations.
Although methods of calculating or estimating two-locus sample probabilities have been previously described (![]()
![]()
![]()
In addition to describing existing methodology, we extend the methods in several ways. These include considering samples in which the ancestral/derived states of alleles are taken into account. Also, two-locus diploid sample probabilities are calculated. In addition, we describe how these distributions can be used to assess observed levels of linkage disequilibrium between sites and how they may be used to estimate the recombination rate parameter of a neutral model.
| THE MODEL AND NOTATION |
|---|
We consider a selectively neutral two-locus random union of gametes model with discrete generations and Wright-Fisher sampling to produce succeeding generations (![]()
![]()
![]()
) and 4Nr (
) (![]()
![]()
![]()
We focus our attention on samples with exactly two alleles at each locus. The two alleles at the first locus are designated A0 and A1; the two alleles at the other locus are designated B0 and B1. (At this point the labeling is arbitrary, although later, when ancestral and mutant alleles are specified, the labeling will be meaningful.) A sample of n gametes is randomly drawn from a population at stationarity under the neutral model. The unordered sample configuration is denoted by n = (n00, n01, n10, n11), where nij is the number of sampled gametes that carry allele Ai at the A locus and allele Bj at the B locus. Hence, n00 + n01 + n10 + n11 = n. The frequencies of the A1 allele and the B1 allele in the sample are p1 =
and q1 =
, respectively, and the frequency in the sample of the A1B1 gamete is p11 =
. In this notation, D = p11 - p1q1 and r2 =
are two commonly employed sample measures of linkage disequilibrium.
The probability of a particular sample configuration, n = (i, j, k, l), is denoted qu((i, j, k, l);
,
) or when no ambiguity results as qu(n;
,
). This sample probability corresponds to the probability given by ![]()
M of ![]()
,
) = qu((i, k, j, l);
,
), since we assume that the mutation rate is the same at the two loci. It is qu(n;
,
) and closely related probabilities that are the main foci of this article. Because we are interested in polymorphism at single nucleotide sites, the case of very small
is of primary interest, and most results will be for the limiting case as
0.
| OBTAINING SAMPLE PROBABILITIES |
|---|
Recursion equations method:
Numerical values of qu(n;
,
) can be obtained for small samples by solving a recursion, originally due to ![]()
![]()
![]()
Random-genealogies Monte Carlo method:
An alternative to solving Golding's recursion is to estimate the two-locus sample probabilities by the method of ![]()
) using standard coalescent machinery (![]()
,
). Because it is of use later, we describe the method in more detail. Before proceeding with this description we note that Monte Carlo Markov chain methods, such as that of ![]()
![]()
![]()
A two-locus genealogy is produced by generating a random sequence of events, proceeding backward in time from the present, as described by ![]()
and is referred to as the E-sequence. Associated with each event is a specification of which lineage or lineages are involved. The E-sequence completely determines the topology of the A locus and B locus gene trees. A complete specification of the two-locus genealogy requires that one also specify the time intervals between the events. We note, however, that under the constant population size model, the E-sequence can be generated without regard to the time intervals between events. The time interval preceding Ei is denoted Ti. The ordered sequence of these time intervals is referred to as the T-sequence. Under the constant population size model and conditional on the E-sequence, the time intervals are independent exponentially distributed random variables. The mean of Ti depends on the configuration of the ancestral lineages during the interval, which, in turn, depends on the E-sequence. The calculation of the mean of Ti conditional on the E-sequence is also described in ![]()
The two-locus genealogy can be summarized as two tip-labeled gene trees, one for the A locus and one for the B locus. We arbitrarily number the branches of the A locus tree from 1 to 2n - 2 and designate the length of the ith branch by ai, measuring time in units of 4N generations. Note that for any particular branch, say the ith, ai is the sum of one or more consecutive elements of the T-sequence. Similarly, the branches of the B locus tree are numbered, and their lengths are denoted by bj. As with the A locus tree, the lengths of the B locus tree are sums of one or more consecutive elements of the T-sequence. It is assumed that the number of mutations on branch i of the A locus tree, conditional on its length, is Poisson distributed with mean (
/2)ai. The sum of the ai is denoted
A and the sum of the lengths of the B locus branches by
B. For a given two-locus genealogy, it is a simple matter to check for each pair of branches, one from the A locus gene tree and one from the B locus gene tree, whether a mutation on the A locus branch and a mutation on the B locus branch would lead to the specified sample configuration, n. This property of the two-locus genealogy depends only on
and not on the T-sequence. Let I(
, n, j, k) denote an indicator variable that is one if branch j on the A locus tree and branch k on the B locus tree are such a pair of branches and zero otherwise. If I(
, n, j, k) equals one, then the sample configuration, n, would arise if one or more mutations occurred on branch j of the A locus tree, and one or more mutations occurred on branch k of the B locus tree, and no mutations occurred elsewhere on the tree. Thus given
and the T-sequence, the probability of the configuration n being produced by mutations on branch j of the A locus tree and branch k of the B locus tree is
![]() |
(1) |
Thus to obtain the sample probability, qu(n;
,
), we sum over all branches j and k and take the expectation over the joint distribution of
and the T-sequence,
![]() |
(2) |
where E() designates expectation over random genealogies, j indexes the branches of the A locus tree, and k indexes the branches of the B locus tree. The approximation is for small
and is obtained by Taylor expanding the exponentials and dropping higher-order terms in
. Since we are interested in small
, we consider the following function:
![]() |
(3) |
This function is perhaps best described as the "scaled, small-
, likelihood function" and is referred to as the "scaled likelihood." The value of hu(n,
) at a particular value of
can be estimated by generating a large number, m, of two locus genealogies using the specified value of
and calculating the sum
![]() |
(4) |
where
i is the E-sequence of the ith randomly generated two-locus genealogy, and aj(i) and bk(i) are the branch lengths on the same two-locus genealogy. In effect, the method simply estimates the expected product of the lengths of pairs of branches that, if mutations occurred on them, would produce the specified sample configuration. To obtain an estimate of qu(n;
,
) we use 
(n,
). This is the method of ![]()
In the case of the constant population size model, the method of ![]()
) by
|
(5) |
This is feasible because aj and bk are sums of one or more consecutive elements of the T-sequence, which are exponentially distributed. If aj and bk share no elements of the T-sequence in common, then the expectation of the product is the product of the expectations. If they have elements in common, the expectation of the product is the product of the expectations plus the sum of the expectations of the elements that are common to both. For example, if aj is equal to the sum of T2 + T3 and bk is T3 + T4, then under the constant population size model, the expectation of ajbk is

where
i = E(Ti|
). This follows from properties of the exponential distribution. Thus if hu(n;
) is estimated with (5) rather than (4), the T-sequence does not need to be generated and a lower variance estimate of hu(n;
) is obtained.
Ancestral and derived alleles:
In the previous paragraphs, we did not specify which alleles were ancestral and which were the mutant (or derived). It is now common to obtain sequence from a closely related species and infer which alleles are ancestral. The probabilities of sample configurations with specified ancestral/derived states are no more difficult to calculate than the unspecified configurations. A sample in which the ancestral/derived status of each allele is specified is referred to as an "a-d-specified" sample, and otherwise the sample is "a-d unspecified." The algorithm we just described can be modified to estimate the probabilities of a-d-specified samples by simply changing the indicator function used. Golding's recursion can also be modified to calculate a-d-specified sample probabilities. We use the convention for a-d-specified samples that A0 and B0 denote the ancestral alleles and A1 and B1 denote the mutant alleles. For a-d-specified samples, the quantities corresponding to qu(n;
,
) and hu(n;
) are denoted q(n;
,
) and h(n;
).
The a-d-unspecified probabilities can be obtained from the a-d-specified probabilities by summing four (or fewer) distinct a-d-specified sample probabilities, which result in the same unspecified sample configuration. That is, we can obtain the a-d-unspecified probabilities with
![]() |
(6) |
where

In RESULTS AND APPLICATIONS, we compare scaled-likelihood curves for one a-d-unspecified sample and the corresponding a-d-specified configurations. In that section we also address the issue of whether knowledge of which alleles are ancestral can improve estimates of
.
The method of ![]()
) using (4) under these models is available at http://home.uchicago.edu/~rhudson1.
Sequenced samples with two polymorphic sites:
In the previous sections, the samples considered are assayed only at two sites. The intervening and flanking nucleotide sites may or may not be polymorphic. This is exactly the situation considered by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
If the segment sequenced is L nucleotides long, we use an L-locus model, where each locus corresponds to a nucleotide site. We number the nucleotide positions from 1 (the leftmost site) to L(rightmost site.) Instead of a two-locus gene genealogy we must consider an L-locus gene genealogy. Each site has associated with it a gene genealogy or gene tree. The sum of the lengths of the branches of the gene tree of the ith site is denoted
i. We denote
Li=1
by
seq. We designate the positions of the two polymorphic sites by x and y. The branches of the tree of site x and of the tree site y are numbered arbitrarily from 1 to 2n - 2, and the length of branch j of the tree of site x is labeled
x,j, and similarly
y,k denotes the length of the kth branch of the tree of site y. The two-locus sample configuration at sites x and y is denoted by n, as before. ![]()
t. In this notation, if
t/L is small, the expected number of polymorphic sites is 
tE(
seq), and the probability of no polymorphic sites in the sample is
E(e-
t
seq). As before, we define
= 4Nr, but in this case r is the recombination rate per generation between the leftmost and rightmost sites of the sequenced segment. The probability of the fully sequenced a-d-specified sample with two polymorphic sites is denoted qseq(n, x, y;
t,
) and is given by
![]() |
(7) |
where the approximation is obtained by expanding selected exponential terms and dropping terms of order (
t/L)3 and higher and where Ix,y is an indicator function, as before, but in this case it depends on the gene genealogies of site x and of site y. Ix,y is one if the jth branch of the tree of site x and the kth branch of the tree of site y are such that mutations on them lead to the given a-d-specified sample configuration n. This expression for the probability of a sequenced sample is essentially the same as the expression for the two-locus configuration (2), except for the last term, e-
t
seq. The analogue of h(n;
) for sequence data, which we denote hseq(n, x, y;
,
t), is
![]() |
(8) |
where
t is assumed constant (and does not depend on L). This can be estimated by
![]() |
(9) |
where
i is the E-sequence of the ith randomly generated L locus genealogy, and xj(i) and yk(i) are the branch lengths on the trees of site x and site y, respectively. And
seq(i) is
seq for this same L locus genealogy.
In RESULTS AND APPLICATIONS we compare h(n;
) and hseq(n, x, y;
,
t) to see how much the knowledge that there are no other polymorphisms between or near the focal pair of sites affects an inference about
. The estimates of hseq(n, x, y;
,
t) obtained as described here may be useful for checking other algorithms for estimating sequenced sample configuration probabilities.
Diploid samples:
Up to this point, we have considered samples consisting of haplotypes. It would be useful to have sample probabilities analogous to q(n;
,
) for the case of diploid samples. We show here how the probabilities of diploid samples can be expressed in terms of the haploid sample probabilities.
For diploids, with two alleles segregating at each locus, there are 10 distinct diploid genotypes. Often the phase of double heterozygotes is not determined directly, in which case there are only 9 distinguishable diploid genotypes. However, we begin by considering the case where the haplotypes constituting double heterozygotes are determined experimentally. In this case, the data can be represented by a 10-vector, nd = (n0, n1, ... , n9). We use n0 to denote the number of coupling-phase double heterozygotes (A0B0/A1B1) and n1 to denote the number of repulsion-phase double heterozygotes (A0B1/A1B0). The numbers of each of the other diploid genotypes are designated by ni, i = 2, ... , 9. From the vector, nd, we can count the number of each of the four possible chromosomes. That is, the vector nd maps unambiguously to the underlying haploid data configuration, which we denote by n(nd). Under random mating, the probability of nd is q(n(nd);
,
) times the probability that 2n haploids of configuration n(nd) when randomly paired produce nd. In symbols,
![]() |
(10) |
where b(nd, n(nd)) is the probability under random pairing of getting nd from a haploid configuration n(nd). By counting up the possible pairings, one finds that
![]() |
(11) |
where nhet is the number of diploid individuals that are heterozygous at one or two loci and where ni is the ith element of nd and nij, i, j = 0, 1 are the elements of n.
Now consider the case where the phase of double heterozygotes is not determined by the experimenter. In this case, we cannot observe n0 or n1, but we do observe their sum. We denote the sum by ndh. We denote the diploid data set in this case by nd-9 = (ndh, n2, ... , n9). Given ndh, the actual number of coupling phase double heterozygotes, n0, could be any value from zero to ndh. Each of these possible values corresponds to a different nd configuration. We denote these possible nd configurations by nd(i, nd-9), where i = 0, ... , ndh. That is, if nd-9 = (ndh, n2, ... , n9), then nd(i, nd-9) = (i, ndh - i, n2, ... , n9). Then the probability of a diploid configuration nd-9 is obtained by summing up the probability of each of these mutually exclusive possible nd configurations, as:
![]() |
(12) |
Thus, with the haploid sample probabilities in hand, it is a simple matter to calculate diploid sample probabilities, using (10), (11), and (12).
Conditional probabilities:
Most applications of these two-locus sampling distributions will focus only on pairs of sites in which both sites are polymorphic in the sample. This means that rather than q(n;
,
), it will be useful to consider the probability of specific sample configurations conditional on two alleles segregating in the sample at each locus. That is, it is useful to consider the conditional probability
![]() |
(13) |
where the summation is over all configurations, m, with two alleles at each locus. In the limit as
tends to zero this becomes
![]() |
(14) |
which we can estimate without specifying
. This conditional probability for small
is denoted qc(n;
).
It may also be of interest to condition on other events. For example, one may wish to limit consideration to polymorphisms in which the rarer allele has frequency >0.05, or one may wish to condition on precisely the marginal allele frequencies observed. These are easily calculated by changing the summation in the denominator of the right-hand side of (14). Various conditional probabilities are utilized in the following sections.
| RESULTS AND APPLICATIONS |
|---|
Example sampling distributions:
I have used the ![]()
) for all possible two-locus sample configurations (with exactly two alleles at each locus) for samples sizes of 20, 30, 40, 50, and 100 and a range of
values between 0 and 100. These are available at http://home.uchicago.edu/~rhudson1. The program used to estimate these quantities is also available at this site. The program actually generates multilocus gene trees and simultaneously estimates the sample probabilities for a range of recombination rates and all possible sample configurations simultaneously. The results for sample size 40 have been compared for several configurations to the results of solving the recursions of ![]()
values. In addition, the results for large recombination rates converge on the free recombination configuration probabilities that are easy to calculate with Ewens sampling distribution (![]()
Fig 1 shows some conditional sampling distributions for a sample of size 90. This figure shows the asymmetric U-shaped distribution of linkage disequilibrium, which is typical for low recombination rates, the broad distribution of linkage disequilibrium for values of
in the range of 520, and the unimodal and nearly normal distribution of linkage disequilibrium for large
. An application of these distributions is described in the next section.
|
The conditional probabilities qc(n;
) when plotted as functions of
are conditional-likelihood curves. Estimates of three such curves are shown in Fig 2. Application of these likelihood curves for estimating
are described in a subsequent section, Estimating
.
|
Before proceeding to some applications of these sample probabilities we consider briefly the effects of knowing which alleles are ancestral and the effects of having full sequence data vs. assessing the variation at only two specified sites. In Fig 3, we show plots of h(n;
) for a set of four a-d-specified samples and hu(n;
) for the corresponding a-d-unspecified sample. In the plot, we see that different a-d-specified samples, each corresponding to the same a-d-unspecified sample, can have very different likelihood curves. In this case, two of the configurations have monotonically decreasing likelihood curves, and the other two configurations have a maximum for intermediate values of
. However, two of the configurations, which have similar curves, have much higher probabilities than the other two configurations. Thus, although knowledge of which alleles are ancestral may in specific instances have an important impact on inferences about
, on average, knowledge of which alleles are ancestral may not provide very much more information. This suggestion is supported by the results concerning asymptotic variances of estimates of
using many pairs of sites, which is described later.
|
The effects of having full sequence data vs. assessing the variation at only two polymorphic sites are illustrated in Fig 4, in which we show plots of h(n;
) and hseq(n, x, y;
,
t) as functions of
for n = (15, 15, 9, 1) and
t = 0.6. In this case, the curves are very similar in shape. One should be cautious about generalizing from this particular result, but it appears that, for the case where only two sites are polymorphic, likelihoods based on full sequence data may be quite similar to the likelihood based on assays of only the pair of sites that are polymorphic. For other values of
t this may not hold.
|
Assessing observed levels of linkage disequilibrium:
With the conditional probabilities described above, it is possible to assess whether or not the level of linkage disequilibrium observed between a particular pair of sites is compatible with our neutral model and a specified value of
. For example, suppose that we have a sample of 90 gametes in which two sites, 1000 bp apart, are assayed and found to be polymorphic, with sample configuration, n = (53, 7, 17, 13). (This is the sample configuration corresponding to n11 = 13 in Fig 1.) Note that the marginal allele frequencies of the derived alleles are 30(= 17 + 13) at one locus and 20(= 7 + 13) at the other locus. We ask the question of whether this observed configuration is compatible with the hypothesis that 4Nrbp (=
bp) equals, say, 0.001, where rbp is the recombination rate per base pair per generation. With these assumptions, the relevant recombination parameter for our pair of sites is
=
bp 1000 = 1.0. To assess whether the linkage disequilibrium observed is unusually high or low, we examine the distribution of n conditional on the observed marginal allele frequencies with
= 1.0. Conditional on the marginal allele frequencies, there are only 21 possible sample configurations, which can be specified by the value of n11. The conditional probabilities of these configurations for
= 1.0 are shown in Fig 1 and were obtained using Equation 14, with the summation in the denominator of the right-hand side being over the 21 possible sample configurations.
We define a statistical test by summing the conditional probabilities of all configurations with probabilities less than or equal to the probability of the observed configuration. If this sum is <0.05 we reject our hypothesis. In our example, the configurations with n11 = 8, ... , 13 have probabilities less than or equal to the observed configuration. The sum of the probabilities of these configurations is approximately 0.028, so our hypothesis is rejected. That is, we conclude that this sample configuration is quite unusual for
= 1.0 and note that it would be much more likely if
= 10. We refer to this test as the "exact test conditional on the marginals" (ETCM) and emphasize that it requires that one know or specify
.
Suppose now that our pair of polymorphic sites, with n = (53, 7, 17, 13), had been 100,000 bp apart, in which case,
= 100. If we examine the conditional probabilities for this value of
(shown on the right in Fig 1), we find that the sum of the probabilities of configurations with less or equal probabilities is
0.02, and so the hypothesis would again be rejected. (In this case the configurations with lower probabilities are n11 = 0, and n11 = 14, ... , 20.) There is too much linkage disequilibrium in this case. This illustrates that the ETCM can reject the null hypothesis due to either too much or too little linkage disequilibrium.
This test could be carried out for all possible pairs of polymorphic sites in a contiguous region to explore the possibility that some sites exhibit unusually high or low levels of linkage disequilibrium. Such sites may be indicative of hotspots of recombination or mutation or epistatic selection. This is a complementary approach to the usual analysis of linkage disequilibrium in which Fisher's exact test of independence is applied to all pairs. It should be noted that, when more than two linked sites are considered, there is a statistical dependence between the ETCMs on each pair, and any interpretation of the results should bear this in mind.
Estimating
:
Using independent linked pairs:
The conditional probabilities qc(n;
) when plotted as functions of
are likelihood curves. Estimates of three such curves are shown in Fig 2. (The estimates are obtained using (14) and (4) but for a-d-specified samples.) Most sample configurations lead to monotonically increasing or decreasing likelihood curves, but samples with high but not complete linkage disequilibrium lead to likelihood functions with a maximum at a finite positive value of
. Thus the maximum-likelihood estimate of
for a single pair of sites is often zero or infinity. When the estimate is finite and greater than zero, the confidence interval is clearly large (as indicated by the broad likelihood function). This was noted before by ![]()
![]()
, then one might be able to obtain a very accurate estimate of
. In this case the overall likelihood, for small
, is approximately
![]() |
(15) |
in which ni is the two-locus configuration for the ith pair of sites. The maximum-likelihood estimate of
obtained with (15) is denoted
.
To characterize the statistical properties of the maximum-likelihood estimate
, it is useful to consider the expectation of log(qc(n; z)), over the distribution of n conditional on polymorphism at both sites. That is, we consider
![]() |
(16) |
where E
0 indicates expectation given that the true value of
is
0. This function can be viewed as a function of both z and
0 and can be estimated from our tabulated values of h(n;
). An estimate of this function is plotted as a function of log z for
0 = 5.0 and a sample of size 50 in Fig 5. (The conditioning for Fig 5 is that both sites are polymorphic with the rarer allele having frequency
0.1.) The second derivative of this function with respect to z, evaluated at the
0, is inversely proportional to the asymptotic variance of the maximum-likelihood estimate of
. More precisely, if k pairs of sites are utilized, we expect the variance of the maximum-likelihood estimator to be approximately
![]() |
(17) |
for k sufficiently large. Here, Var
0,k denotes the variance of the estimator based on k pairs and with
=
0. With the tabulated estimates of h(n;
), one can estimate the second derivative in (17) and hence the asymptotic variance of
. However, it may be of most interest to investigate the coefficient of variation of the estimate of
, so we consider instead
|
In Fig 5, we show, in addition to our estimate of E
0(log(qc(n; z))) for
0 = 5.0, a quadratic function obtained by a least-squares fit to several points near z = 5.0. Clearly the expected likelihood function is very close to quadratic for a substantial range of z in the neighborhood of 5.0. This suggests that the asymptotic properties may apply for moderate values of k. By estimating the second derivative of the expected likelihood functions, we have estimated asymptotic variances (times k) for a set of values of
0 and plotted the results as a function of
0 in Fig 6. The plot in Fig 6 shows that pairs separated by
in the range of 215 are best for estimating
. As
decreases below 2.0, the asymptotic variance grows rapidly. For larger values of
the asymptotic variance grows more slowly.
|
To give some feeling for the number of pairs needed to get a reasonable estimate of
we consider a numerical example. Suppose that we have data for k = 20 independent pairs of polymorphic sites, where the rare allele has in every case allele frequency of at least 0.1 and where the recombination rate,
0, between the sites of each pair is the same and is in the range from 2 to 10. In this case, we see from Fig 6 that k Var
0,k(
) is
2.5, and hence the asymptotic standard deviation of log(
), when estimated from 20 independent pairs, is
0.35 (=
). If log(
) is approximately normally distributed, then with probability
0.95, log(
) will lie in the interval log(
0) ± 2(0.35), and hence
will be within a factor of 2 of the true value. To check this result, I generated 32,000 two-locus samples on the computer, using the conditional sampling probabilities qc(n;
), conditioning on the appropriate marginal allele frequencies. These random two-locus samples were formed into 1600 groups of 20, and
was calculated for each group. From these outcomes, the estimated variance of log(
) was 0.124, which is very close to the prediction from the asymptotic analysis (2.5/20 = 0.125). The probability of being within a factor of 2 is estimated to be 0.96, also in very good agreement with the asymptotic analysis prediction of 0.95.
In practice, different pairs of polymorphic sites will be different distances apart and will have different recombination rates. In the case where the physical distance between each pair of sites was known and the recombination rate per base pair was the same for each pair of polymorphic sites, then the likelihood for k polymorphic pairs is approximately
![]() |
(19) |
where
bp is 4Nrbp, rbp is the recombination rate per base pair, and di is the distance in base pairs between the ith pair of sites. This likelihood could be used to estimate
bp. The results in the previous paragraph suggest that the lowest variance estimator of
bp will be obtained if sites are separated by a distance such that
bp times the distance is
5. If rbp varied from one pair of sites to the next, but the value of rbp were known for each pair of sites, say from comparisons of physical and genetic maps, a similar likelihood could be used to estimate N, the effective population size.
Using the above method, we can estimate the asymptotic variance of
in a-d-unspecified samples and in diploid samples in which the phase of double heterozygotes is not determined. For the case of a-d-unspecified samples of 50 gametes in which
= 5.0, the asymptotic variance is estimated to be 2.2/k, which is essentially the same as what we found for a-d-specified samples. Thus, there appears to be little if any gain in knowing which alleles are ancestral when the data consist of independent linked pairs of sites. A similar asymptotic analysis could be used to determine the optimum sample size when one can trade off sample size for number of pairs. We do not pursue that analysis here.
Finally we note that, when investigating pairs of polymorphic sites, the incorporation of gene conversion is straightforward. It is necessary only to establish an effective recombination rate as a function of distance that incorporates gene conversion, such as ![]()
![]()
![]()
Using many linked sites:
In the previous sections, we considered one or more linked pairs of sites, where each pair is independent of the others. We now consider the case where more than two linked polymorphic sites are assayed in a single sample. In this situation, full likelihood considering all polymorphisms simultaneously is the proper approach. ![]()
![]()
![]()
![]()
![]()
![]()
![]()
bp, which is denoted
CL, where the subscript "CL" indicates an estimate based on composite likelihood. Once the two-locus sampling-scaled likelihoods (h(n;
)) are tabulated, calculating these composite likelihoods is very fast, and hence the statistical properties of this estimator can be explored.
To assess the quality of this composite-likelihood estimator,
CL was calculated for a large number of samples generated by coalescent methods. The samples were generated by the method of ![]()
corresponding to the recombination rate between the ends of the segment observed and
being the mutation parameter associated with the entire segment. Fig 7 and Fig 8 show estimates of the medians and the 10th and 90th percentiles of the distribution of
CL for a range of
values. The same quantities were estimated for three other estimators that have been described in the literature. These estimators are
(![]()
![]()
![]()
(![]()
![]()
|
|
The figures show that the estimator, Cwak, performs poorly compared to the other estimators shown. The estimator Cwak is an improved version of the estimator of ![]()
The estimator
has a considerably lower 90th percentile than the other estimators. This is desirable as long as the 90th percentile is larger than the true value. Unfortunately, for large
and
=
/4, the 90th percentile of
falls below the true value. In addition, it has a median considerably below the true value and a 10th percentile well below the 10th percentile of
CL and CHRM. Thus, for these parameter values
has a strong tendency to underestimate
. For other parameter values ![]()
has little bias or a bias in the opposite direction.
The medians of the estimators
CL and CHRM are close to the true
, for
>
4.0 for the case of
=
and for
> 10 when
=
/4. The 10th percentiles of
CL and CHRM are considerably closer to the true value than the 10th percentiles of the other estimators. Their 90th percentiles are much closer to the true value than the 90th percentile of Cwak but as mentioned above are not as small as that of
. In terms of percentiles, the estimators
CL and CHRM appear to be quite similar. Overall, it appears that the estimators
p and CHRM are substantially superior to the other two estimators (at least for the parameter values investigated). The estimator
CL has considerable flexibility that may make it of broader use. Since it does not rely on surveying or sequencing of all sites, it can be applied to data collected on previously identified single nucleotide polymorphisms (SNPs) or in surveys of regions with intervening gaps. Also, as indicated earlier, incorporating gene conversion is straightforward and does not require reestimating any of the h(n;
) values.
Finally, since
CL is so quick to calculate (once the scaled likelihoods are in hand), one can afford to carry out simulations to characterize the properties of an estimate. For example, if the sampling procedure employed to collect the data is well specified and simple, then computer-generated samples can be used to study the distribution of the logarithm of the ratio of the composite likelihood of the data at
CL to the likelihood at the true value of
for a range of
values and in this way obtain confidence intervals.
An application to human polymorphism data:
We close by estimating
from a survey of human variation on the X chromosome (![]()
bp was estimated by maximizing the composite likelihood for (i) all 39 SNPs, (ii) the 14 SNPs in Xq25, and (iii) the 10 SNPs in or near Xq28. For loci on the X chromosome, using the h(n;
) functions described for autosomal loci will result in an estimate of 2Nr, where r is the per-generation recombination rate in females and N is the total effective population size. In the following, the estimates returned by the computer programs were multiplied by 2 so that the values reported here for
are, in fact, estimates of 4Nr, where r is the female recombination rate. The estimates were 9 x 10-5, 8.8 x 10-5, and 9 x 10-5 for all 39 sites, for the 14 sites of Xq25, and the Xq28 sites, respectively. These results suggest that there is no overall difference in recombination rates in the Xq25 region compared to the Xq28 region. The effective population size of humans has been estimated from levels of DNA polymorphism to be
104 and the recombination rate per base pair, though quite variable, is for this region on average
10-8. Thus we might expect that
bp
4 x 10-4 or about five times larger than we estimate from the X chromosome data.
The linkage disequilibrium at each of the 741 (= 39 x 38/2) possible pairs of sites was evaluated by the ETCM, described in Assessing observed levels of linkage disequilibrium, assuming
bp = 9 x 10-5. A total of only 7 pairs, or
0.9% of the pairs, showed unusual two-locus configurations (with P < 0.025) using the ETCM. This is somewhat fewer than one would expect by chance when carrying out this many tests. Thus, overall, our analysis does not support the presence of a low recombination rate region in Xq25, as suggested by Taillon-Miller et al. However, it should be noted that 6 of the 7 significant pairs involve sites from the Xq25 region, and the seventh is immediately adjacent to the Xq25 region. Of the 6 significant pairs in the Xq25 region, 5 show unusually large linkage disequilibrium, but the 6th shows unusually low linkage disequilibrium. The latter pair of sites is separated by 30 kb and D' for the pair is 0.22. The five pairs showing significantly large linkage disequilibrium from the Xq25 region were among the pairs identified by Taillon-Miller et al. as showing significant linkage disequilibrium by a Fisher's exact test. In Fig 9, the values of r2 are plotted for all pairs of sites within the Xq25 region (14 sites and hence 91 pairs) and for all pairs within the Xq28 region (10 sites or 45 pairs). Taillon-Miller et al. also displayed this plot. In the figure those sites that showed significantly large or small linkage disequilibrium given
bp = 9 x 10-5, using the ETCM test, are shown with x's. Also shown in Fig 9 is the expected value of r2 conditional on the frequencies of the minor alleles being >0.32. These were calculated with the tabulated h(n;
) values for a sample of size 92. The fit to the data appears to be fairly good, but there does appear to be some tendency for Xq28 pairs to fall below the line for larger distances and to be above the line for shorter distances. In contrast, the Xq25 sites appear to scatter on both sides of the curve.
|
| CONCLUSIONS |
|---|
Two-locus sample probabilities offer a useful tool for analyzing linkage disequilibrium levels from population surveys. Carrying out tests of significance for pairs of sites and estimating
from many sites is computationally quick, once the two-locus sample probabilities are in hand. A composite-likelihood approach for estimating
with more than two linked
















) n11 = 5. (
) n11 = 1. (
) n11 = 0. The conditioning in this case is that there are two alleles at each locus in the sample. The three sample configurations are n = (20, 10, 20, 0), n = (21, 9, 19, 1), and n = (25, 5, 15, 5), which are labeled n11 = 0, n11 = 1, and n11 = 5, respectively. The marginal allele frequencies are the same for each configuration. The values of qc(n; 
) h ((16, 20, 14, 0);
) h((6, 30, 14, 0);
) h ((0, 30, 14, 6);
) h((0, 20, 14, 16); 








