Genetics, Vol. 167, 2003-2013, August 2004, Copyright © 2004
doi:10.1534/genetics.103.023044

Simultaneous Detection of Linkage Disequilibrium and Genetic Differentiation of Subdivided Populations

* Faculty of Marine Science, Tokyo University of Marine Science and Technology, Minato, Tokyo 108-8477, Japan
{dagger} Graduate School of Agricultural and Life Sciences, University of Tokyo, Bunkyo, Tokyo 113-8657, Japan

1 Corresponding author: Tokyo University of Marine Science and Technology, 4-5-7 Konan, Minato, Tokyo 108-8477, Japan.
E-mail: kitada{at}s.kaiyodai.ac.jp

Manuscript received March 20, 2003. Accepted for publication December 26, 2003.

ABSTRACT

We propose a new method for simultaneously detecting linkage disequilibrium and genetic structure in subdivided populations. Taking subpopulation structure into account with a hierarchical model, we estimate the magnitude of genetic differentiation and linkage disequilibrium in a metapopulation on the basis of geographical samples, rather than decompose a population into a finite number of random-mating subpopulations. We assume that Hardy-Weinberg equilibrium is satisfied in each locality, but do not assume independence between marker loci. Linkage states remain unknown. Genetic differentiation and linkage disequilibrium are expressed as hyperparameters describing the prior distribution of genotypes or haplotypes. We estimate related parameters by maximizing marginal-likelihood functions and detect linkage equilibrium or disequilibrium by the Akaike information criterion. Our empirical Bayesian model analyzes genotype and haplotype frequencies regardless of haploid or diploid data, so it can be applied to most commonly used genetic markers. The performance of our procedure is examined via numerical simulations in comparison with classical procedures. Finally, we analyze isozyme data of ayu, a severely exploited fish species, and single-nucleotide polymorphisms in human ALDH2.


WITH the many discoveries of fine-scale markers that represent highly polymorphic loci, linkage disequilibrium between these markers and disease genes or other trait genes has regained importance (JORDE 1995). Along with the rapid progress of genetic techniques, assessing linkage disequilibrium has become a current concern. The assessment involves two aspects: detecting the presence of disequilibrium and estimating its magnitude once disequilibrium has been confirmed (WEIR 1979). Continuous efforts have been made on such assessments of linkage disequilibrium between alleles at two or more loci (HILL 1974a,b; BROWN 1975; WEIR and COCKERHAM 1978; SLATKIN and EXCOFFIER 1996; LUO 1998; LUO and SUHAI 1999; LUO et al. 2000; AYRES and BALDING 2001; LUO and WU 2001). Methods for linkage disequilibrium-based mapping of target genes have also been developed (HILL and WEIR 1994; KAPLAN et al. 1995; XIONG and GUO 1997; MEUWISSEN and GODDARD 2000; WU and ZENG 2001). These methods assume homogeneous natural populations. However, if a population is structured, this leads to biased results (known as spurious association) that can reject a null association between a phenotype and molecular markers (e.g., LANDER and SCHORK 1994).

To overcome this problem, SPIELMAN et al. (1993) developed a method called the transmission/disequilibrium test (TDT), which uses nuclear family data to detect real associations in structured populations. To use this type of reliable information, similar family-based methods for testing linkage disequilibrium have been developed (e.g., EXCOFFIER and SLATKIN 1998; LAZZERONI and LANGE 1998; SPIELMAN and EWENS 1998). However, family-based methods cannot be used for association studies of undomesticated species, such as wildlife and forest trees, for which no nuclear family records are available (WU and ZENG 2001). Hence, population-based methods that detect linkage disequilibrium even in the presence of population structure are required. PRITCHARD et al. (2000a)(b) proposed a population-based method that can detect associations between marker alleles and phenotypes in structured populations. The essential idea of the method is to decompose a sample drawn from a mixed population into several unstructured subpopulations and test the association in the homogeneous subpopulations. The methods have been applied to association analyses in humans (PARRA et al. 1998; ROSENBERG et al. 2002) and crop plants, with modified test statistics being used to deal with quantitative traits (THORNSBERRY et al. 2001).

To study the population structure, a sample is often divided by geographical regions. However, in many cases, there are no obvious regional units by which subpopulations can be defined. Rather, natural populations have a complex hierarchical structure, forming a metapopulation (PANNELL and CHARLESWORTH 2000). In models, a metapopulation has a continuous structure and consists of an infinite number of subpopulations, or demes. Such populations are well described by hierarchical models that specify the distribution of genetic structure among demes. Population subdivision (NEI and LI 1973; PETERSON et al. 1999) and other evolutionary forces such as genetic drift (HILL and ROBERTSON 1968), natural selection (LEWONTIN 1964), and mutation (OHTA 1982a,b) affect linkage disequilibrium. Therefore, when assessing linkage disequilibrium in natural populations, it is very important to also estimate the magnitude of genetic differentiation.

In this article, we propose a new strategy to detect linkage disequilibrium between marker loci and simultaneously estimate genetic differentiation in metapopulations, extending the empirical Bayesian method of KITADA et al. (2000). Instead of decomposing a population into a finite number of subpopulations, we describe a distribution for allele frequencies within subpopulations and test linkage disequilibrium using an information criterion based on hierarchical models that detect either linkage equilibrium or disequilibrium. Using samples from randomly sampled localities from a metapopulation, we estimate hyperparameters associated with linkage disequilibrium and genetic differentiation, on the basis of marginal-likelihood functions derived from prior distributions of allele or haplotype frequencies and likelihood functions. We assume that individuals mate randomly and that Hardy-Weinberg (H-W) equilibrium holds in each locality or deme. The method can be applied to frequency data of common genetic markers, including isozymes, mtDNA, microsatellites, and single-nucleotide polymorphisms (SNPs). We do not assume independence between marker loci, and linkage states remain unknown. We examine the performance of our procedure via numerical simulations in comparison with classical procedures. Finally, we analyze isozyme data of ayu, a severely exploited fish species, and single-nucleotide polymorphisms in human ALDH2.


MODELS AND METHODS

Distribution of allele frequencies and genetic differentiation:

We consider a metapopulation that consists of localities or demes. Hardy-Weinberg equilibrium holds in each deme. The distribution of the allele frequencies, p = (p1, ... , pI)', at each deme is described as a Dirichlet distribution (JOHNSON and KOTZ 1969),

(1)
where I is the number of alleles and and {alpha} = ({alpha}1, ... , {alpha}I)' are regarded as hyperparameters for respective alleles specifying prior distribution. We use this distribution as a prior for allele frequencies of a subdivided population. According to WEIR (1996), for populations that have reached equilibrium under the joint effects of drift and mutation or migration, WRIGHT (1945)(1951) found that allele frequencies for loci with two alleles have a beta distribution; for multiallele loci, the distribution is Dirichlet. We assumed that hyperparameters are common over subpopulations and the sum of the hyperparameters is also common for all loci. So, if random sampling of demes is performed from a metapopulation, estimated hyperparameters describe the magnitude of genetic differentiation among subpopulations that have reached equilibrium. The mean linkage disequilibrium coefficient over subpopulations is also written as a function of the hyperparameters, as shown later. Therefore, our assumption on hyperparameters has genetic meaning.

Now, K demes are randomly sampled from the metapopulation, and nk individuals are randomly sampled from each deme (k = 1, ... , K). Given the allele frequencies at each deme, the sample counts of alleles, nk = (nk1, ... , nkI)', follow a multinomial distribution. Taking account of uncertainty of the allele frequencies, the likelihood of the sample counts is expressed as a marginal-likelihood function, which is a Dirichlet-multinomial distribution (e.g., LANGE 1995; RANNALA and HARTIGAN 1996; WEIR 1996; KITADA et al. 2000; BALDING 2003):

(2)

The total likelihood is the product of the likelihoods of K demes. Here, .

With a sample size of n, the variance-covariance matrix of a Dirichlet-multinomial distribution is (n + {theta})/(1 + {theta}) times larger than that of the multinomial distribution (JOHNSON and KOTZ 1969). This phenomenon, in which the variance exceeds the nominal variance, is called overdispersion. In our hierarchical models, overdispersion corresponds to the variation of allele frequencies over subpopulations. With samples of size n1, . . , nk, our overdispersion, {sigma}2, becomes

(3)

Here, is the mean sample count over K samples given by . For a panmictic population, {sigma}2 is 1; it takes values >1 according to the magnitude of genetic differentiation among subpopulations. On the other hand, {theta} converges to infinity for a panmictic population.

From Weir's equation (WEIR 1996, p. 48, Equation 2.15), the variance of the allele frequency between subpopulations is expressed as

The term FST(2n – 1) + 1 corresponds to the dispersion parameter {sigma}2. From this, KITADA et al. (2000) obtained the relation between {sigma}2 and FST (WRIGHT 1951):

(4)

If the organism is haploid, 2 should be . By substituting Equation 3 into Equation 4 and assuming , we obtain the relation

(5)
which is also given in BALDING (2003). From Equation 5, we have

(6)

This coincides with Equation 3 of RANNALA and HARTIGAN (1996), which was proposed by WRIGHT (1969) and gives the rate of gene flow. Thus, the sum of hyperparameters {theta} is consistent with the rate of gene flow and has a relation with FST.

Linkage equilibrium with population structure:

When genetic data are available from multiple loci, the likelihood is obtained as a product of the likelihoods of these loci when they are under linkage equilibrium. We consider diploid organisms. Let the frequency allele i in a subpopulation at locus j be pi (j = 1, ... , J; i = 1, ... , Ij), where J is the number of loci. The haplotype frequency of the subpopulation under linkage equilibrium can be written by the product of allele frequencies over the loci as

where the combination of J alleles i(1) ... i(J) should be ij ... ij, but we use ellipses for simplicity. Let the sample count for haplotypes be ni...i and (individuals). The likelihood for sample haplotypes under H-W equilibrium is a multinomial distribution,

where 2ni is the number of genes of allele i at locus j. The constant term is the combination of the observed haplotypes. However, we do not observe haplotypes but genotypes. When composite genotypes Gi...i are observed, the likelihood is written as

(7)
where and nhetero is the sum of the heterozygous loci over n individuals. Under H-W equilibrium, probabilities for heterozygous genotypes are obtained by duplicating the product of allele frequencies. The term 2nhetero refers to such duplication.

We assume a priori that allele frequencies of subpopulations have a Dirichlet (for cases Ij > 2) or beta (for cases Ij = 2) distribution. The marginal-likelihood function for a subpopulation is obtained from Equations 1, 2, and 7 as

(8)
where 2ni, the number of genes for allele i at locus j, is calculated from composite genotypes. The total likelihood is the product of K likelihood functions for respecsamples. This equation can also be used for haploid organisms by using ni and C instead of 2ni and C'2nhetero.

Linkage disequilibrium with population structure:

The likelihood for sample haplotypes of a subpopulation is also a multinomial distribution:

(9)

If the haplotypes are observed, the marginal likelihood is given by

(10)
where I is the number of haplotypes and, for simplicity, we use a suffix i for haplotypes instead of i(1) ... i(J). However, when analyzing diploid data, we do not observe diplotypes (and then no haplotypes), but only composite genotypes.

When investigating linkage disequilibrium, the linkage phase between two alleles may be of most interest. Hence, here we focus on models for two loci with two alleles and derive the general marginal-likelihood function in the APPENDIX.

Let allele frequencies for two loci be pA0, pA1 and pB0, pB1, and let the haplotype frequencies be h00, h01, h10, h11 in a subpopulation. In this case, nine composite genotypes can be observed with specific diplotypes. Let the number of individuals for the genotypes be n1, ... , n9 (Table 1). For example, the diplotype for the composite genotype A0A0B1B1 can be specified by A0B1/A0B1. Under the H-W equilibrium assumption, the probability of having the genotype is h201 and that for A0B0/A0B1 is 2h00h01. Thus, the probabilities of having genotypes can be written by diplotype probabilities, which are combinations of haplotype probabilities. However, for double heterozygotes, two diplotypes are possible for the genotypes. In this case, A0A1B0B1 is the double heterozygote with observed number n5. For this genotype, the possible diplotypes are A0B0/A1B1 and A0B1/A1B0. The probability for the genotype is then given as 2h00h11 + 2h01h10 (Table 1).


View this table:
In this window
In a new window

 
TABLE 1

Notation for number of observed composite genotypes

 
The likelihood of composite genotypes for the case of two loci with two alleles is given in HUDSON (2001)(p. 316, Equation 11.6). Expanding the term for the double heterozygote, we obtain the likelihood function as

(11)
where is a constant term for the combination of the multinomial likelihood. The marginal-likelihood function for a subpopulation is then obtained as

(12)
where {alpha}11 = {theta}{alpha}00{alpha}01 {alpha}10.

Parameter estimation and model selection:

We estimate parameters by maximizing the negative log marginal-likelihood functions for linkage equilibrium and disequilibrium. The constant terms C and C' can be excluded from the estimation procedure. We reparameterize hyperparameters with FST by using the relation given by Equation 6 and estimate FST and hyperparameters numerically as free parameters by using simplex minimization. The rate of gene flow {theta} and the dispersion parameter {sigma}2 are then estimated by Equations 4 and 6. Our empirical Bayesian procedure thus offers maximum-likelihood estimators of genetic differentiation. We estimate the 95% confidence interval for FST from the log-likelihood profile, and then estimate 95% confidence intervals for {theta} and {sigma}2 by substituting the lower and upper confidence limits of FST into Equations 4 and 6, respectively.

We also estimate the linkage correlation coefficient (HILL and ROBERTSON 1968) as

(13)
where E and E are mean allele frequencies over subpopulations, which are estimated from sample allele frequencies. Here, DTOTAL is the linkage disequilibrium coefficient over subpopulations, EEE, and is estimated as

(14)

However, this estimator is biased. Let E[D] be the mean of D at each deme, which can be written as

from which we have

(15)

As and , the covariance is expressed as

The variance and covariance of a Dirichlet distribution are Var[pi] = {alpha}i({theta}i{alpha}i)/({theta}2({theta} + 1)) and Cov[pi, pj] = {alpha}i{alpha}j/({theta}2({theta} + 1)) (JOHNSON and KOTZ 1969). Hence, we have the covariance as a function of hyperparameters:

(16)

From Equations 14, 15, and 16, we have the unbiased estimator of the mean linkage disequilibrium coefficient correcting spurious association as

(17)
while correction of r is not trivial because of the denominator of Equation 13.

We use the Akaike Information Criterion (AIC; AKAIKE 1973) as a criterion for model selections,

(18)
where is the maximum marginal log-likelihood of the model and u is the number of free parameters estimated. We can compare various models by this criterion; the model with the lowest AIC value is selected as the most parsimonious model. Equation 18 indicates that, when there are several models with similar values of the maximum likelihood, we should select the model with the smallest number of parameters, again following the principle of parsimony. By using AIC, we can detect linkage equilibrium or disequilibrium.


TESTING PERFORMANCE
To evaluate the performance of our method, we conducted two sets of simulations. The first simulation examined if our procedure estimates the linkage disequilibrium and genetic differentiation reliably. The second simulation compared our estimate of linkage disequilibrium with two other commonly used estimates: an estimate from a pooled sample and an average of the estimated linkage disequilibrium over subsamples.

Simulated data 1:

We assumed that the mean population allele frequencies of the two loci over subpopulations were and hence haplotype frequencies were E[h00] = E[h11] = 0.25(1 + r) and E[h01] = E[h10] = 0.25(1 – r). We set the sample size to 50 individuals for each sampling point and generated haplotype frequencies for 20 geographical samples under various FST (0, 0.05, 0.10, 0.15, 0.20) and r (0, 0.2, 0.4, 0.6, 0.8). We then calculated sample composite genotype counts on the basis of haplotype frequencies assuming H-W equilibrium.

For the state of linkage equilibrium (r = 0), given the FST values, sample allele frequencies of two loci were generated independently from the beta distribution ß({theta}/2, {theta}/2), where {theta} = 1/FST – 1. Haplotype frequencies were then calculated as EE for E[h00]. Other haplotypes were calculated in a similar way. For the case of FST = 0, sample allele frequencies of the two loci were generated independently from the binomial distribution Bi(50, 0.5). For linkage disequilibrium, given the FST values, haplotype frequencies were generated from the Dirichlet distribution D({alpha}00, {alpha}01, {alpha}10, {alpha}11), where the hyperparameters were given as {alpha}00 = {alpha}11 = 0.25{theta}(1 + r) and {alpha}01 = {alpha}10 = 0.25{theta}(1 – r). For cases of FST = 0, sample haplotype frequencies were generated from the multinomial distribution Mul(50, E[h00], E[h01], E[h10], E[h11]).

We estimated FST, r, and hyperparameters on the basis of Equations 8, 12, and 13 for 1000 replicates. For all cases with r > 0, the model of linkage disequilibrium had smaller values of AIC and was selected. On the other hand, for all cases of r = 0, the model of linkage equilibrium was selected. Estimates obtained from the best-fit model agreed well with the real values of FST and r, showing that our method works appropriately over a wide range of genetic differentiation and linkage disequilibrium (Table 2).


View this table:
In this window
In a new window

 
TABLE 2

Mean linkage correlation coefficients r and FST of the best-fit models estimated from 1000 simulations under various levels of linkage disequilibrium and population structure

 

Simulated data 2:

In an ordinary way, researchers would use the sample mean of the linkage disequilibrium coefficient in each subpopulation (say, DMEAN) as an estimator of the mean linkage disequilibrium coefficient. Although a larger number of sampling points should improve precision of estimates of genetic differentiation, limited sample size in each locality may result in a poor estimate of DMEAN. Another procedure is to estimate the mean linkage disequilibrium coefficient from pooled genotype data over subpopulations (say, DPOOL). This procedure neglects the population structure.

The mean population allele frequencies of the two loci over subpopulations were assumed as and hence haplotype frequencies were E[h00] = 0.64 + 0.16r, E[h01] = E[h10] = 0.16(1 – r), and E[h11] = 0.04 + 0.16r. Given FST values, haplotype frequencies were generated from the Dirichlet distribution, where the hyperparameters were given as {alpha}00 = {theta}(0.64 + 0.16r), {alpha}01 = {alpha}10 = {theta}0.16(1 – r), and {alpha}11 = {theta}(0.04 + 0.16r). We set total sample size to 1000 individuals and generated haplotype frequencies for different numbers of localities (K = 10, 20, 40, 100) under various r (0, 0.2, 0.4, 0.6, 0.8) with fixed FST = 0.2. We then calculated sample composite genotype counts on the basis of haplotype frequencies assuming H-W equilibrium.

We estimated FST, E[D], and hyperparameters on the basis of Equations 8, 12, and 17 for 1000 replicates. DMEAN and DPOOL were estimated by the method of HILL (1974b) on the basis of sample composite genotype counts at each locality and the pooled counts. A larger number of sampling points improved the precision of estimates of FST (Table 3). Our hierarchical model estimated the real values of E[D] correctly over the whole range of linkage disequilibrium and numbers of sampling points. Precision of the estimates of E[D] was also improved for a larger number of sampling points (Table 4). Estimates of DMEAN were biased for weaker linkage disequilibrium. The biases were decreased for larger r; however, the biases were larger than those from the hierarchical model. Estimates of DPOOL were largely biased with low precision over the whole range of linkage disequilibrium and numbers of sampling points. The results showed that our method works more efficiently than classical procedures.


View this table:
In this window
In a new window

 
TABLE 3

Mean FST estimated from 1000 simulations under various levels of linkage disequilibrium for different numbers of sampling points

 

View this table:
In this window
In a new window

 
TABLE 4

Performance of the hierarchical model in comparison with classical procedures

 


APPLICATION TO REAL DATA

Genotype data of the ayu:

We analyzed isozyme genotype data for seven samples of the ayu (Plecoglossus altivelis) from Japan (K. YOSHIZAWA, unpublished data). Two samples of wild stocks were taken from Biwako Lake (land-locked type) and Nagara River (amphidoromous type), and five samples were taken from captive brood stocks in hatcheries. The lifespan of ayu is 1 year. These brood stocks have been bred in captivity for between 7 and 26 generations in each hatchery to produce juveniles for release to enhance severely exploited stocks. The numbers of generations of captive breeding are shown in parentheses with the names of the sampling locations in the Table 5 legend. Two loci were analyzed. Four alleles were found at the Gpi locus and three alleles at the Mpi locus. However, two alleles of Gpi and one of Mpi were very minor, and so we grouped them with major ones to obtain nine composite genotypes (Table 5).


View this table:
In this window
In a new window

 
TABLE 5

Allozyme composite genotype frequencies for seven samples of the ayu from Japan

 
We estimated parameters on the basis of composite genotype data by using Equations 8, 12, and 13 (Table 6). A smaller AIC value was obtained for the model of linkage equilibrium; thus, linkage equilibrium between Gpi and Mpi loci was detected. Estimates of FST and {theta} showed very large genetic differentiation and small gene flow. Generally, genetic differentiation of fish species is small because of large gene flows caused by migration or lack of barriers (e.g., MCPHERSON et al. 2001; GOLD and TURNER 2002). However, the result obtained here is natural because the sample comprised three different groups: land-locked, amphidoromous, and captive brood stocks in successive breeding for many generations.


View this table:
In this window
In a new window

 
TABLE 6

Estimated linkage correlation coefficient between Gpi and Mpi loci and genetic differentiation among seven samples of ayu

 

Human SNPs:

We analyzed SNPs reported by PETERSON et al. (1999) in the human aldehyde dehydrogenase 2 gene ALDH2.

We focused on sites 1 and 2 from the six sites reported by Peterson et al. The haplotype frequencies were estimated from the haplotype configurations given in their Table 2 as h11 = H1 + H4 + H6 + H8 + H9, h12 = H3, h21 = H2, and h22 = 0. We estimated parameters using Equation 8 from the allele frequencies at the two sites within ALDH2 genotyped in 756 people from 17 populations across five continents (Table 7). We also estimated parameters for estimated haplotypes on the basis of Equation 10 (Tables 7 and 8). The AIC value for the model of linkage disequilibrium was smaller than that for linkage equilibrium. Our estimate of r was –0.60, whereas the original authors' estimates varied from –0.96 to 0.01 for 17 subpopulations, with an average of –0.45 (PETERSON et al. 1999, Table 4). Strong linkage disequilibrium was found, which is consistent with the analyses by Peterson et al. Estimates of FST and {theta} also showed that genetic differentiation of human ALDH2 was large with a small gene flow.


View this table:
In this window
In a new window

 
TABLE 7

Haplotype frequencies at two sites (sites 1 and 2) in the human ALDH2 gene among 17 human populations, calculated from Table 2 of PETERSON et al. (1999)

 

View this table:
In this window
In a new window

 
TABLE 8

Estimated worldwide linkage correlation coefficient between sites 1 and 2 in the human ALDH2 gene and genetic differentiation among 17 human populations

 


DISCUSSION
We have proposed a new method, based on genotype frequencies of geographical samples, to detect linkage disequilibrium between markers and to simultaneously estimate genetic differentiation by taking population subdivision into account. The method detected linkage disequilibrium and estimated linkage disequilibrium coefficient and FST correctly.

With a hierarchical model, we estimate the genetic structure in a metapopulation, rather than decompose a population into a finite number of randomly mating subpopulations. The sum of hyperparameters {theta} coincided with the rate of gene flow, and FST was written by {theta}. We also showed by Equation 17 that the magnitude of linkage disequilibrium depends on that of genetic differentiation. Our model assumes metapopulations, which includes Wright's island model as a special case (WRIGHT 1940); hence the relationship between {theta} and FST can be applied for these population models. The metapopulation concept fits with ecology of wildlife that have limited movement ability and may mate randomly in their territory. Even for marine fish, which can disperse widely because of the lack of barriers in oceans, the stock concept is popular in fishery resource management (WAPLES 1998). In breeding seasons, fish species generally gather to spawning grounds where they mate randomly. As a result of such breeding patterns, the assumption of Hardy-Weinberg equilibrium may be valid in each locality. Furthermore, in humans, whose population structures may be more complex than those of wildlife, it has been reported that geographic clusters often correspond closely to predefined regional or population groups or collections of geographically similar populations (ROSENBERG et al. 2002). In actual populations, various levels of subdivision may exist (EXCOFFIER 2001), and the number of demes should be very large with continuous subdivision. In ecological studies, sampling points may increase year by year, which increases the accuracy of our method.

Estimates of hyperparameters were based only on frequencies counted in the samples, and alleles that did not appear were assigned frequency 0. Here, we consider if this treatment was appropriate. Consider a simple case of three haplotypes with sample counts n1, n2, and n3, where n3 = 0. The marginal-likelihood function for a subpopulation is a Dirichlet-multinomial distribution:

(19)

Using the relation of {Gamma}(n + 1) = n{Gamma}(n), the first term of this equation can be written as

(20)

Hence, to maximize Equation 19, the term in the parentheses in Equation 20 should be minimized. As {alpha}3 ≥ 0, the maximum-likelihood estimate of {alpha}3 is then 0. This can be generalized for all cases.

For diploid data, we have two procedures for inferring linkage disequilibrium. The first procedure (method 1) estimates hyperparameters using Equation 12 or the last equation in the APPENDIX, on the basis of observed composite genotypes. Another way (method 2) is to use Equation 10, on the basis of estimated haplotypes. Several methods for estimating haplotype frequencies have been developed (EXCOFFIER and SLATKIN 1995; LONG et al. 1995; SLATKIN and EXCOFFIER 1996). In the analysis of human SNPs, we used estimated haplotypes as if they were observed data. However, the effect of using estimated haplotypes on parameter estimates is unknown. The accuracy of the inference might depend on the precision of haplotype estimates. Method 1 should be better for accurate inference, but deriving explicit marginal-likelihood functions becomes much harder for larger numbers of loci, although the general form of the marginal-likelihood function is derived in the APPENDIX. Method 2 is much easier and more practical. Precise estimation of haplotype frequencies is very important for studies of linkage disequilibrium.


APPENDIX:

GENERAL LINKAGE DISEQUILIBRIUM MARGINAL-LIKELIHOOD FUNCTION FOR DIPLOID ORGANISMS

Let observed composite genotypes be Gi(i = 1, ... , m) and the numbers of individuals for the genotypes be nGi. The likelihood function for a sample is written as

where PGi is the probability for the genotype Gi, which can be written by the probability of diplotypes. For the case of two loci with two alleles, the diplotype probabilities are given in Table 1. Even for multilocus data, a diplotype can be specified for a set of completely homozygous genotypes, but the larger number of heterozygous genotypes results in a larger number of diplotype combinations. Consequently, the number of summations of diplotype probabilities increases for such genotypes. Thus, the probability of a genotype is expressed by the sum of diplotype probabilities.

Let a set of diplotypes and haplotypes consistent with a genotype G be

where lG is the number of diplotypes consistent with the genotype G. The likelihood function for composite genotypes is then written by using diplotype probabilities as

By multinomial expansion, we have another form of the likelihood as

where ij is the number of individuals having diplotype j, which constitutes genotype i, and .

Under H-W equilibrium, the probability of the jth diplotype is expressed as a product of the probabilities of haplotypes,

where {delta} is a delta function that is 1 for homozygous and 0 for heterozygous genotypes. Therefore, the likelihood of a sample from a subpopulation under H-W equilibrium is

In the above equation some haplotypes will be common, so we rename these haplotypes as

where {xi} is the number of common haplotypes. The likelihood function is then written as

where js (s = 1, ... , {xi}) is the number of individuals having haplotype s and ,

where {delta}h'sh11 is again a delta function that is 1 if h's coincides with h11 and 0 otherwise. Thus, the likelihood for observed composite genotypes can be written as the sum of the multinomial distribution for haplotype frequencies. From above L1, the marginal-likelihood function is obtained as

The total marginal-likelihood function is obtained multiplying K likelihood functions.


ACKNOWLEDGEMENTS
We thank Kazutomo Yoshizawa for allowing us to use his unpublished data and Laurent Excoffier and an anonymous reviewer for their constructive comments on earlier versions of the manuscript. This work was supported by the Japan Society for the Promotion of Science and Japan Science and Technology Agency.


LITERATURE CITED

AKAIKE, H., 1973 Information theory and an extension of the maximum likelihood principle, pp. 267–281 in The Second International Symposium on Information Theory, edited by B. N. PETROV and F. CSáKI. Akadémiai Kiadó, Budapest.

AYRES, K. L., and D. J. BALDING, 2001 Measuring gametic disequilibrium from multilocus data. Genetics 157: 413–423.[Abstract/Free Full Text]

BALDING, D. J., 2003 Likelihood-based inference for genetic correlation coefficient. Theor. Popul. Biol. 63: 221–230.[CrossRef][Medline]

BROWN, A. D. H., 1975 Sample sizes required to detect linkage disequilibrium between two or three loci. Theor. Popul. Biol. 8: 184–201.[CrossRef][Medline]

EXCOFFIER, L., 2001 Analysis of population subdivision, pp. 271–307 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. Wiley, Chichester, UK.

EXCOFFIER, L., and M. SLATKIN, 1995 Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921–927.[Abstract]

EXCOFFIER, L., and M. SLATKIN, 1998 Incorporating genotypes of relatives into a test of linkage disequilibrium. Am. J. Hum. Genet. 62: 171–180.[CrossRef][Medline]

GOLD, J. R., and T. F. TURNER, 2002 Population structure of red drum (Sciaenops ocellatus) in the northern Gulf of Mexico, as inferred from variation in nuclear-encoded microsatellites. Marine Biol. 140: 249–265.[CrossRef]

HILL, W. G., 1974a Tests for association of gene frequencies at several loci in random diploid populations. Biometrics 31: 881–888.

HILL, W. G., 1974b Estimation of linkage disequilibrium in randomly mating populations. Heredity 33: 229–239.[Medline]

HILL, W. G., and A. ROBERTSON, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231.[CrossRef]

HILL, W. G., and B. S. WEIR, 1994 Maximum likelihood estimation of gene location by linkage disequilibrium. Am. J. Hum. Genet. 54: 705–714.[Medline]

HUDSON, R. R., 2001 Linkage disequilibrium and recombination, pp. 309–324 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. Wiley, Chichester, UK.

JOHNSON, N. L., and S. KOTZ, 1969 Discrete Distributions. Wiley, New York.

JORDE, L. B., 1995 Linkage disequilibrium as a gene mapping tool. Am. J. Hum. Genet. 56: 11–14.[Medline]

KAPLAN, N., W. G. HILL and B. S. WEIR, 1995 Likelihood methods for locating disease genes in nonequilibrium populations. Am. J. Hum. Genet. 56: 18–32.[Medline]

KITADA, S., T. HAYASHI and H. KISHINO, 2000 Empirical Bayes procedure for estimating genetic distance between populations and effective population size. Genetics 156: 2063–2079.[Abstract/Free Full Text]

LANDER, E. S., and N. J. SCHORK, 1994 Genetic dissection of complex traits. Science 265: 2037–2048.[Abstract/Free Full Text]

LANGE, K., 1995 Application of the Dirichlet distribution to forensic match probabilities. Genetica 96: 107–117.[CrossRef][Medline]

LAZZERONI, L. C., and K. LANGE, 1998 A conditional inference framework for extending the transmission/disequilibrium test. Hum. Hered. 48: 67–81.[CrossRef][Medline]

LEWONTIN, R. C., 1964 The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 4–67.

LONG, J. C., R. C. WILLIAMS and M. URBANEK, 1995 An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56: 799–810.[Medline]

LUO, Z. W., 1998 Detecting linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations. Heredity 80: 198–208.

LUO, Z. W., and S. SUHAI, 1999 Estimating linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations. Genetics 151: 359–371.[Abstract/Free Full Text]

LUO, Z. W., and C.-I WU, 2001 Modeling linkage disequilibrium between a polymorphic marker locus and a locus affecting complex dichotomous traits in natural populations. Genetics 158: 1785–1800.[Abstract/Free Full Text]

LUO, Z. W., S. H. TAO and Z-B. ZENG, 2000 Inferring linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations. Genetics 156: 457–467.[Abstract/Free Full Text]

MCPHERSON, A. A., R. L. STEPHENSON, P. T. O'REILLY, M. W. JONES and C. T. TAGGART, 2001 Genetic diversity of coastal Northwest Atlantic herring population implications for management. J. Fish Biol. 59: 356–370.[CrossRef]

MEUWISSEN, T. H. E., and M. E. GODDARD, 2000 Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics 155: 421–430.[Abstract/Free Full Text]

NEI, M., and W.-H. Li, 1973 Linkage disequilibrium in subdivided populations. Genetics 75: 213–219.[Abstract/Free Full Text]

OHTA, T., 1982a Linkage disequilibrium due to random genetic drift in finite subdivided populations. Proc. Natl. Acad. Sci. USA 79: 1940–1944.[Abstract/Free Full Text]

OHTA, T., 1982b Linkage disequilibrium with the island model. Genetics 75: 139–155.

PANNELL, J. R., and B. CHARLESWORTH, 2000 Effects of metapopulation processes on measures of genetic diversity. Philos. Trans. R. Soc. Lond. B 355: 1851–1864.[CrossRef][Medline]

PARRA, E. J., A. MARCINI, J. AKEY, J. MAARTINSON, M. A. BATZER et al., 1998 Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63: 1839–1851.[CrossRef][Medline]

PETERSON, R. J., D. GOLDMAN and J. C. LONG, 1999 Effects of worldwide population subdivision on ALDH2 linkage disequilibrium. Genet. Res. 9: 844–852.

PRITCHARD, J. K., M. STEPHENS, N. A. ROSENBERG and P. DONNELLY, 2000a Association mapping in structured populations. Am. J. Hum. Genet. 67: 170–181.[CrossRef][Medline]

PRITCHARD, J. K., M. STEPHENS and P. DONNELLY, 2000b Inference of population structure using multilocus genotype data. Genetics 156: 945–959.

RANNALA, B., and J. A. HARTIGAN, 1996 Estimating gene flow in island populations. Genet. Res. 67: 147–158.[Medline]

ROSENBERG, N. A., J. K. PRITCHARD, J. L. WEBER, H. M. CANN, K. K. KIDD et al., 2002 Genetic structure of human populations. Science 298: 2381–2384.[Abstract/Free Full Text]

SLATKIN, M., and L. EXCOFFIER, 1996 Testing for linkage disequilibrium in genotype data using the EM algorithm. Heredity 76: 377–383.

SPIELMAN, R. S., and W. J. EWENS, 1998 A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am. J. Hum. Genet. 62: 450–458.[CrossRef][Medline]

SPIELMAN, R. S., R E. MCGINNIS and W. J. EWENS, 1993 Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52: 506–513.[Medline]

THORNSBERRY, J. M., M. M. GOLDMAN, J. DOEBLEY, S. KRESOVICH, D. NIELSEN et al., 2001 Dwarf8 polymorphisms associate with variation in flowering time. Nat. Genet. 28: 286–289.[CrossRef][Medline]

WAPLES, R. S., 1998 Separating the wheat from the chaff. Pattern of genetic differentiation in high gene flow species. J. Hered. 89: 438–450.[Abstract/Free Full Text]

WEIR, B. S., 1979 Inferences about linkage disequilibrium. Biometrics 35: 235–254.[CrossRef][Medline]

WEIR, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.

WEIR, B. S., and C. C. COCKERHAM, 1978 Testing hypotheses about linkage disequilibrium with multiple alleles. Genetics 88: 633–642.[Abstract/Free Full Text]

WRIGHT, S., 1940 Breeding structure of populations in relation to speciation. Am. Nat. 74: 232–248.[CrossRef]

WRIGHT, S., 1945 The differential equation of the distribution of gene frequencies. Proc. Natl. Acad. Sci. USA 31: 383–389.

WRIGHT, S., 1951 The general structure of populations. Ann. Eugen. 15: 323–354.

WRIGHT, S., 1969 Evolution and Genetics of Populations: The Theory of Gene Frequencies, Vol. 2. University of Chicago Press, Chicago.

WU, R., and Z-B. ZENG, 2001 Joint linkage and linkage disequilibrium mapping in natural populations. Genetics 157: 899–909.[Abstract/Free Full Text]

XIONG, M. M., and S. W. GUO, 1997 Fine-linkage genetic mapping based on linkage disequilibrium theory and applications. Am. J. Hum. Genet. 60: 1513–1531.[Medline]




This article has been cited by other articles:


Home page
GeneticsHome page
S. Kitada, T. Kitakado, and H. Kishino
Empirical Bayes Inference of Pairwise FST and Its Distribution in the Genome
Genetics, October 1, 2007; 177(2): 861 - 873.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
T. Kitakado, S. Kitada, Y. Obata, and H. Kishino
Simultaneous Estimation of Mixing Rates and Genetic Drift Under Successive Sampling of Genetic Markers With Application to the Mud Crab (Scylla paramamosain) in Japan.
Genetics, August 1, 2006; 173(4): 2063 - 2072.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
T. Kitakado, S. Kitada, H. Kishino, and H. J. Skaug
An Integrated-Likelihood Method for Estimating Genetic Differentiation Between Populations
Genetics, August 1, 2006; 173(4): 2073 - 2082.
[Abstract] [Full Text] [PDF]