## Abstract

We propose a new method for simultaneously detecting linkage disequilibrium and genetic structure in subdivided populations. Taking subpopulation structure into account with a hierarchical model, we estimate the magnitude of genetic differentiation and linkage disequilibrium in a metapopulation on the basis of geographical samples, rather than decompose a population into a finite number of random-mating subpopulations. We assume that Hardy-Weinberg equilibrium is satisfied in each locality, but do not assume independence between marker loci. Linkage states remain unknown. Genetic differentiation and linkage disequilibrium are expressed as hyperparameters describing the prior distribution of genotypes or haplotypes. We estimate related parameters by maximizing marginal-likelihood functions and detect linkage equilibrium or disequilibrium by the Akaike information criterion. Our empirical Bayesian model analyzes genotype and haplotype frequencies regardless of haploid or diploid data, so it can be applied to most commonly used genetic markers. The performance of our procedure is examined via numerical simulations in comparison with classical procedures. Finally, we analyze isozyme data of ayu, a severely exploited fish species, and single-nucleotide polymorphisms in human *ALDH2*.

WITH the many discoveries of fine-scale markers that represent highly polymorphic loci, linkage disequilibrium between these markers and disease genes or other trait genes has regained importance (Jorde 1995). Along with the rapid progress of genetic techniques, assessing linkage disequilibrium has become a current concern. The assessment involves two aspects: detecting the presence of disequilibrium and estimating its magnitude once disequilibrium has been confirmed (Weir 1979). Continuous efforts have been made on such assessments of linkage disequilibrium between alleles at two or more loci (Hill 1974a,b; Brown 1975; Weir and Cockerham 1978; Slatkin and Excoffier 1996; Luo 1998; Luo and Suhai 1999; Luo* et al.* 2000; Ayres and Balding 2001; Luo and Wu 2001). Methods for linkage disequilibrium-based mapping of target genes have also been developed (Hill and Weir 1994; Kaplan* et al.* 1995; Xiong and Guo 1997; Meuwissen and Goddard 2000; Wu and Zeng 2001). These methods assume homogeneous natural populations. However, if a population is structured, this leads to biased results (known as spurious association) that can reject a null association between a phenotype and molecular markers (*e.g.*, Lander and Schork 1994).

To overcome this problem, Spielman* et al.* (1993) developed a method called the transmission/disequilibrium test (TDT), which uses nuclear family data to detect real associations in structured populations. To use this type of reliable information, similar family-based methods for testing linkage disequilibrium have been developed (*e.g.*, Excoffier and Slatkin 1998; Lazzeroni and Lange 1998; Spielman and Ewens 1998). However, family-based methods cannot be used for association studies of undomesticated species, such as wildlife and forest trees, for which no nuclear family records are available (Wu and Zeng 2001). Hence, population-based methods that detect linkage disequilibrium even in the presence of population structure are required. Pritchard* et al.* (2000a)(b) proposed a population-based method that can detect associations between marker alleles and phenotypes in structured populations. The essential idea of the method is to decompose a sample drawn from a mixed population into several unstructured subpopulations and test the association in the homogeneous subpopulations. The methods have been applied to association analyses in humans (Parra* et al.* 1998; Rosenberg* et al.* 2002) and crop plants, with modified test statistics being used to deal with quantitative traits (Thornsberry* et al.* 2001).

To study the population structure, a sample is often divided by geographical regions. However, in many cases, there are no obvious regional units by which subpopulations can be defined. Rather, natural populations have a complex hierarchical structure, forming a metapopulation (Pannell and Charlesworth 2000). In models, a metapopulation has a continuous structure and consists of an infinite number of subpopulations, or demes. Such populations are well described by hierarchical models that specify the distribution of genetic structure among demes. Population subdivision (Nei and Li 1973; Peterson* et al.* 1999) and other evolutionary forces such as genetic drift (Hill and Robertson 1968), natural selection (Lewontin 1964), and mutation (Ohta 1982a,b) affect linkage disequilibrium. Therefore, when assessing linkage disequilibrium in natural populations, it is very important to also estimate the magnitude of genetic differentiation.

In this article, we propose a new strategy to detect linkage disequilibrium between marker loci and simultaneously estimate genetic differentiation in metapopulations, extending the empirical Bayesian method of Kitada* et al.* (2000). Instead of decomposing a population into a finite number of subpopulations, we describe a distribution for allele frequencies within subpopulations and test linkage disequilibrium using an information criterion based on hierarchical models that detect either linkage equilibrium or disequilibrium. Using samples from randomly sampled localities from a metapopulation, we estimate hyperparameters associated with linkage disequilibrium and genetic differentiation, on the basis of marginal-likelihood functions derived from prior distributions of allele or haplotype frequencies and likelihood functions. We assume that individuals mate randomly and that Hardy-Weinberg (H-W) equilibrium holds in each locality or deme. The method can be applied to frequency data of common genetic markers, including isozymes, mtDNA, microsatellites, and single-nucleotide polymorphisms (SNPs). We do not assume independence between marker loci, and linkage states remain unknown. We examine the performance of our procedure via numerical simulations in comparison with classical procedures. Finally, we analyze isozyme data of ayu, a severely exploited fish species, and single-nucleotide polymorphisms in human *ALDH2*.

## MODELS AND METHODS

### Distribution of allele frequencies and genetic differentiation:

We consider a metapopulation that consists of localities or demes. Hardy-Weinberg equilibrium holds in each deme. The distribution of the allele frequencies, * p* = (

*p*

_{1}, … ,

*p*)′, at each deme is described as a Dirichlet distribution (Johnson and Kotz 1969), 1where

_{I}*I*is the number of alleles and and

**α**= (α

_{1}, … , α

*)′ are regarded as hyperparameters for respective alleles specifying prior distribution. We use this distribution as a prior for allele frequencies of a subdivided population. According to Weir (1996), for populations that have reached equilibrium under the joint effects of drift and mutation or migration, Wright (1945)(1951) found that allele frequencies for loci with two alleles have a beta distribution; for multiallele loci, the distribution is Dirichlet. We assumed that hyperparameters are common over subpopulations and the sum of the hyperparameters is also common for all loci. So, if random sampling of demes is performed from a metapopulation, estimated hyperparameters describe the magnitude of genetic differentiation among subpopulations that have reached equilibrium. The mean linkage disequilibrium coefficient over subpopulations is also written as a function of the hyperparameters, as shown later. Therefore, our assumption on hyperparameters has genetic meaning.*

_{I}Now, *K* demes are randomly sampled from the metapopulation, and *n _{k}* individuals are randomly sampled from each deme (

*k*= 1, … ,

*K*). Given the allele frequencies at each deme, the sample counts of alleles,

*= (*

**n**_{k}*n*

_{k}_{1}, … ,

*n*)′, follow a multinomial distribution. Taking account of uncertainty of the allele frequencies, the likelihood of the sample counts is expressed as a marginal-likelihood function, which is a Dirichlet-multinomial distribution (

_{kI}*e.g.*, Lange 1995; Rannala and Hartigan 1996; Weir 1996; Kitada

*et al.*2000; Balding 2003): 2

The total likelihood is the product of the likelihoods of *K* demes. Here, .

With a sample size of *n*, the variance-covariance matrix of a Dirichlet-multinomial distribution is (*n* + θ)/(1 + θ) times larger than that of the multinomial distribution (Johnson and Kotz 1969). This phenomenon, in which the variance exceeds the nominal variance, is called overdispersion. In our hierarchical models, overdispersion corresponds to the variation of allele frequencies over subpopulations. With samples of size *n*_{1}, . . , *n _{k}*, our overdispersion, σ

^{2}, becomes 3

Here, *n̅* is the mean sample count over *K* samples given by . For a panmictic population, σ^{2} is 1; it takes values >1 according to the magnitude of genetic differentiation among subpopulations. On the other hand, θ converges to infinity for a panmictic population.

From Weir's equation (Weir 1996, p. 48, Equation 2.15), the variance of the allele frequency between subpopulations is expressed as

The term *F*_{ST}(2*n* − 1) + 1 corresponds to the dispersion parameter σ^{2}. From this, Kitada* et al.* (2000) obtained the relation between σ^{2} and *F*_{ST} (Wright 1951): 4

If the organism is haploid, 2*n̅* should be *n̅*. By substituting Equation 3 into Equation 4 and assuming , we obtain the relation 5which is also given in Balding (2003). From Equation 5, we have 6

This coincides with Equation 3 of Rannala and Hartigan (1996), which was proposed by Wright (1969) and gives the rate of gene flow. Thus, the sum of hyperparameters θ is consistent with the rate of gene flow and has a relation with *F*_{ST}.

### Linkage equilibrium with population structure:

When genetic data are available from multiple loci, the likelihood is obtained as a product of the likelihoods of these loci when they are under linkage equilibrium. We consider diploid organisms. Let the frequency allele *i* in a subpopulation at locus *j* be *p*^{}_{i} (*j* = 1, … , *J*; *i* = 1, … , *I _{j}*), where

*J*is the number of loci. The haplotype frequency of the subpopulation under linkage equilibrium can be written by the product of allele frequencies over the loci as where the combination of

*J*alleles

*i*

^{(1)}…

*i*

^{(}

^{J}^{)}should be

*i*

^{}

_{j}…

*i*

^{}

_{j}, but we use ellipses for simplicity. Let the sample count for haplotypes be

*n*

_{i}

^{}

_{…i}

^{}and (individuals). The likelihood for sample haplotypes under H-W equilibrium is a multinomial distribution, where 2

*n*

^{}

_{i}is the number of genes of allele

*i*at locus

*j*. The constant term is the combination of the observed haplotypes. However, we do not observe haplotypes but genotypes. When composite genotypes

*G*

_{i…i}are observed, the likelihood is written as 7where and

*n*

_{hetero}is the sum of the heterozygous loci over

*n*individuals. Under H-W equilibrium, probabilities for heterozygous genotypes are obtained by duplicating the product of allele frequencies. The term 2

^{nhetero}refers to such duplication.

We assume *a priori* that allele frequencies of subpopulations have a Dirichlet (for cases *I _{j}* > 2) or beta (for cases

*I*= 2) distribution. The marginal-likelihood function for a subpopulation is obtained from Equations 1, 2, and 7 as 8where 2

_{j}*n*

^{}

_{i}, the number of genes for allele

*i*at locus

*j*, is calculated from composite genotypes. The total likelihood is the product of

*K*likelihood functions for respecsamples. This equation can also be used for haploid organisms by using

*n*and

_{i}*C*instead of 2

*n*and

_{i}*C*′2

^{nhetero}.

### Linkage disequilibrium with population structure:

The likelihood for sample haplotypes of a subpopulation is also a multinomial distribution: 9

If the haplotypes are observed, the marginal likelihood is given by 10where *I* is the number of haplotypes and, for simplicity, we use a suffix *i* for haplotypes instead of *i*^{(1)} … *i*^{(}^{J}^{)}. However, when analyzing diploid data, we do not observe diplotypes (and then no haplotypes), but only composite genotypes.

When investigating linkage disequilibrium, the linkage phase between two alleles may be of most interest. Hence, here we focus on models for two loci with two alleles and derive the general marginal-likelihood function in the appendix.

Let allele frequencies for two loci be *p*_{A0}, *p*_{A1} and *p*_{B0}, *p*_{B1}, and let the haplotype frequencies be *h*_{00}, *h*_{01}, *h*_{10}, *h*_{11} in a subpopulation. In this case, nine composite genotypes can be observed with specific diplotypes. Let the number of individuals for the genotypes be *n*_{1}, … , *n*_{9} (Table 1). For example, the diplotype for the composite genotype *A*_{0}*A*_{0}*B*_{1}*B*_{1} can be specified by *A*_{0}*B*_{1}/*A*_{0}*B*_{1}. Under the H-W equilibrium assumption, the probability of having the genotype is *h*^{2}_{01} and that for *A*_{0}*B*_{0}/*A*_{0}*B*_{1} is 2*h*_{00}*h*_{01}. Thus, the probabilities of having genotypes can be written by diplotype probabilities, which are combinations of haplotype probabilities. However, for double heterozygotes, two diplotypes are possible for the genotypes. In this case, *A*_{0}*A*_{1}*B*_{0}*B*_{1} is the double heterozygote with observed number *n*_{5}. For this genotype, the possible diplotypes are *A*_{0}*B*_{0}/*A*_{1}*B*_{1} and *A*_{0}*B*_{1}/*A*_{1}*B*_{0}. The probability for the genotype is then given as 2*h*_{00}*h*_{11} + 2*h*_{01}*h*_{10} (Table 1).

The likelihood of composite genotypes for the case of two loci with two alleles is given in Hudson (2001)(p. 316, Equation 11.6). Expanding the term for the double heterozygote, we obtain the likelihood function as 11where is a constant term for the combination of the multinomial likelihood. The marginal-likelihood function for a subpopulation is then obtained as 12where α_{11} = θ − α_{00} − α_{01} − α_{10}.

### Parameter estimation and model selection:

We estimate parameters by maximizing the negative log marginal-likelihood functions for linkage equilibrium and disequilibrium. The constant terms *C* and *C*′ can be excluded from the estimation procedure. We reparameterize hyperparameters with *F*_{ST} by using the relation given by Equation 6 and estimate *F*_{ST} and hyperparameters numerically as free parameters by using simplex minimization. The rate of gene flow θ and the dispersion parameter σ^{2} are then estimated by Equations 4 and 6. Our empirical Bayesian procedure thus offers maximum-likelihood estimators of genetic differentiation. We estimate the 95% confidence interval for *F*_{ST} from the log-likelihood profile, and then estimate 95% confidence intervals for θ and σ^{2} by substituting the lower and upper confidence limits of *F*_{ST} into Equations 4 and 6, respectively.

We also estimate the linkage correlation coefficient (Hill and Robertson 1968) as 13where *E* and *E* are mean allele frequencies over subpopulations, which are estimated from sample allele frequencies. Here, *D*_{TOTAL} is the linkage disequilibrium coefficient over subpopulations, *E* − *E**E*, and is estimated as 14

However, this estimator is biased. Let *E*[*D*] be the mean of *D* at each deme, which can be written as from which we have 15

As and , the covariance is expressed as

The variance and covariance of a Dirichlet distribution are Var[*p _{i}*] = α

*(θ*

_{i}*− α*

_{i}*)/(θ*

_{i}^{2}(θ + 1)) and Cov[

*p*,

_{i}*p*] = α

_{j}*α*

_{i}*/(θ*

_{j}^{2}(θ + 1)) (Johnson and Kotz 1969). Hence, we have the covariance as a function of hyperparameters: 16

From Equations 14, 15, and 16, we have the unbiased estimator of the mean linkage disequilibrium coefficient correcting spurious association as 17while correction of *r* is not trivial because of the denominator of Equation 13.

We use the Akaike Information Criterion (AIC; Akaike 1973) as a criterion for model selections, 18where is the maximum marginal log-likelihood of the model and *u* is the number of free parameters estimated. We can compare various models by this criterion; the model with the lowest AIC value is selected as the most parsimonious model. Equation 18 indicates that, when there are several models with similar values of the maximum likelihood, we should select the model with the smallest number of parameters, again following the principle of parsimony. By using AIC, we can detect linkage equilibrium or disequilibrium.

## TESTING PERFORMANCE

To evaluate the performance of our method, we conducted two sets of simulations. The first simulation examined if our procedure estimates the linkage disequilibrium and genetic differentiation reliably. The second simulation compared our estimate of linkage disequilibrium with two other commonly used estimates: an estimate from a pooled sample and an average of the estimated linkage disequilibrium over subsamples.

### Simulated data 1:

We assumed that the mean population allele frequencies of the two loci over subpopulations were and hence haplotype frequencies were *E*[*h*_{00}] = *E*[*h*_{11}] = 0.25(1 + *r*) and *E*[*h*_{01}] = *E*[*h*_{10}] = 0.25(1 − *r*). We set the sample size to 50 individuals for each sampling point and generated haplotype frequencies for 20 geographical samples under various *F*_{ST} (0, 0.05, 0.10, 0.15, 0.20) and *r* (0, 0.2, 0.4, 0.6, 0.8). We then calculated sample composite genotype counts on the basis of haplotype frequencies assuming H-W equilibrium.

For the state of linkage equilibrium (*r* = 0), given the *F*_{ST} values, sample allele frequencies of two loci were generated independently from the beta distribution β(θ/2, θ/2), where θ = 1/*F*_{ST} − 1. Haplotype frequencies were then calculated as *E**E* for *E*[*h*_{00}]. Other haplotypes were calculated in a similar way. For the case of *F*_{ST} = 0, sample allele frequencies of the two loci were generated independently from the binomial distribution Bi(50, 0.5). For linkage disequilibrium, given the *F*_{ST} values, haplotype frequencies were generated from the Dirichlet distribution *D*(α_{00}, α_{01}, α_{10}, α_{11}), where the hyperparameters were given as α_{00} = α_{11} = 0.25θ(1 + *r*) and α_{01} = α_{10} = 0.25θ(1 − *r*). For cases of *F*_{ST} = 0, sample haplotype frequencies were generated from the multinomial distribution Mul(50, *E*[*h*_{00}], *E*[*h*_{01}], *E*[*h*_{10}], *E*[*h*_{11}]).

We estimated *F*_{ST}, *r*, and hyperparameters on the basis of Equations 8, 12, and 13 for 1000 replicates. For all cases with *r* > 0, the model of linkage disequilibrium had smaller values of AIC and was selected. On the other hand, for all cases of *r* = 0, the model of linkage equilibrium was selected. Estimates obtained from the best-fit model agreed well with the real values of *F*_{ST} and *r*, showing that our method works appropriately over a wide range of genetic differentiation and linkage disequilibrium (Table 2).

### Simulated data 2:

In an ordinary way, researchers would use the sample mean of the linkage disequilibrium coefficient in each subpopulation (say, *D*_{MEAN}) as an estimator of the mean linkage disequilibrium coefficient. Although a larger number of sampling points should improve precision of estimates of genetic differentiation, limited sample size in each locality may result in a poor estimate of *D*_{MEAN}. Another procedure is to estimate the mean linkage disequilibrium coefficient from pooled genotype data over subpopulations (say, *D*_{POOL}). This procedure neglects the population structure.

The mean population allele frequencies of the two loci over subpopulations were assumed as and hence haplotype frequencies were *E*[*h*_{00}] = 0.64 + 0.16*r*, *E*[*h*_{01}] = *E*[*h*_{10}] = 0.16(1 − *r*), and *E*[*h*_{11}] = 0.04 + 0.16*r*. Given *F*_{ST} values, haplotype frequencies were generated from the Dirichlet distribution, where the hyperparameters were given as α_{00} = θ(0.64 + 0.16*r*), α_{01} = α_{10} = θ0.16(1 − *r*), and α_{11} = θ(0.04 + 0.16*r*). We set total sample size to 1000 individuals and generated haplotype frequencies for different numbers of localities (*K* = 10, 20, 40, 100) under various *r* (0, 0.2, 0.4, 0.6, 0.8) with fixed *F*_{ST} = 0.2. We then calculated sample composite genotype counts on the basis of haplotype frequencies assuming H-W equilibrium.

We estimated *F*_{ST}, *E*[*D*], and hyperparameters on the basis of Equations 8, 12, and 17 for 1000 replicates. *D*_{MEAN} and *D*_{POOL} were estimated by the method of Hill (1974b) on the basis of sample composite genotype counts at each locality and the pooled counts. A larger number of sampling points improved the precision of estimates of *F*_{ST} (Table 3). Our hierarchical model estimated the real values of *E*[*D*] correctly over the whole range of linkage disequilibrium and numbers of sampling points. Precision of the estimates of *E*[*D*] was also improved for a larger number of sampling points (Table 4). Estimates of *D*_{MEAN} were biased for weaker linkage disequilibrium. The biases were decreased for larger *r*; however, the biases were larger than those from the hierarchical model. Estimates of *D*_{POOL} were largely biased with low precision over the whole range of linkage disequilibrium and numbers of sampling points. The results showed that our method works more efficiently than classical procedures.

## APPLICATION TO REAL DATA

### Genotype data of the ayu:

We analyzed isozyme genotype data for seven samples of the ayu (*Plecoglossus altivelis*) from Japan (K. Yoshizawa, unpublished data). Two samples of wild stocks were taken from Biwako Lake (land-locked type) and Nagara River (amphidoromous type), and five samples were taken from captive brood stocks in hatcheries. The lifespan of ayu is 1 year. These brood stocks have been bred in captivity for between 7 and 26 generations in each hatchery to produce juveniles for release to enhance severely exploited stocks. The numbers of generations of captive breeding are shown in parentheses with the names of the sampling locations in the Table 5 legend. Two loci were analyzed. Four alleles were found at the *Gpi* locus and three alleles at the *Mpi* locus. However, two alleles of *Gpi* and one of *Mpi* were very minor, and so we grouped them with major ones to obtain nine composite genotypes (Table 5).

We estimated parameters on the basis of composite genotype data by using Equations 8, 12, and 13 (Table 6). A smaller AIC value was obtained for the model of linkage equilibrium; thus, linkage equilibrium between *Gpi* and *Mpi* loci was detected. Estimates of *F*_{ST} and θ showed very large genetic differentiation and small gene flow. Generally, genetic differentiation of fish species is small because of large gene flows caused by migration or lack of barriers (*e.g.*, McPherson *et al.* 2001; Gold and Turner 2002). However, the result obtained here is natural because the sample comprised three different groups: land-locked, amphidoromous, and captive brood stocks in successive breeding for many generations.

### Human SNPs:

We analyzed SNPs reported by Peterson* et al.* (1999) in the human aldehyde dehydrogenase 2 gene *ALDH2*.

We focused on sites 1 and 2 from the six sites reported by Peterson *et al.* The haplotype frequencies were estimated from the haplotype configurations given in their Table 2 as *h*_{11} = *H*_{1} + *H*_{4} + *H*_{6} + *H*_{8} + *H*_{9}, *h*_{12} = *H*_{3}, *h*_{21} = *H*_{2}, and *h*_{22} = 0. We estimated parameters using Equation 8 from the allele frequencies at the two sites within *ALDH2* genotyped in 756 people from 17 populations across five continents (Table 7). We also estimated parameters for estimated haplotypes on the basis of Equation 10 (Tables 7 and 8). The AIC value for the model of linkage disequilibrium was smaller than that for linkage equilibrium. Our estimate of *r* was −0.60, whereas the original authors' estimates varied from −0.96 to 0.01 for 17 subpopulations, with an average of −0.45 (Peterson* et al.* 1999, Table 4). Strong linkage disequilibrium was found, which is consistent with the analyses by Peterson *et al*. Estimates of *F*_{ST} and θ also showed that genetic differentiation of human *ALDH2* was large with a small gene flow.

## DISCUSSION

We have proposed a new method, based on genotype frequencies of geographical samples, to detect linkage disequilibrium between markers and to simultaneously estimate genetic differentiation by taking population subdivision into account. The method detected linkage disequilibrium and estimated linkage disequilibrium coefficient and *F*_{ST} correctly.

With a hierarchical model, we estimate the genetic structure in a metapopulation, rather than decompose a population into a finite number of randomly mating subpopulations. The sum of hyperparameters θ coincided with the rate of gene flow, and *F*_{ST} was written by θ. We also showed by Equation 17 that the magnitude of linkage disequilibrium depends on that of genetic differentiation. Our model assumes metapopulations, which includes Wright's island model as a special case (Wright 1940); hence the relationship between θ and *F*_{ST} can be applied for these population models. The metapopulation concept fits with ecology of wildlife that have limited movement ability and may mate randomly in their territory. Even for marine fish, which can disperse widely because of the lack of barriers in oceans, the stock concept is popular in fishery resource management (Waples 1998). In breeding seasons, fish species generally gather to spawning grounds where they mate randomly. As a result of such breeding patterns, the assumption of Hardy-Weinberg equilibrium may be valid in each locality. Furthermore, in humans, whose population structures may be more complex than those of wildlife, it has been reported that geographic clusters often correspond closely to predefined regional or population groups or collections of geographically similar populations (Rosenberg* et al.* 2002). In actual populations, various levels of subdivision may exist (Excoffier 2001), and the number of demes should be very large with continuous subdivision. In ecological studies, sampling points may increase year by year, which increases the accuracy of our method.

Estimates of hyperparameters were based only on frequencies counted in the samples, and alleles that did not appear were assigned frequency 0. Here, we consider if this treatment was appropriate. Consider a simple case of three haplotypes with sample counts *n*_{1}, *n*_{2}, and *n*_{3}, where *n*_{3} = 0. The marginal-likelihood function for a subpopulation is a Dirichlet-multinomial distribution: 19

Using the relation of Γ(*n* + 1) = *n*Γ(*n*), the first term of this equation can be written as 20

Hence, to maximize Equation 19, the term in the parentheses in Equation 20 should be minimized. As α_{3} ≥ 0, the maximum-likelihood estimate of α_{3} is then 0. This can be generalized for all cases.

For diploid data, we have two procedures for inferring linkage disequilibrium. The first procedure (method 1) estimates hyperparameters using Equation 12 or the last equation in the appendix, on the basis of observed composite genotypes. Another way (method 2) is to use Equation 10, on the basis of estimated haplotypes. Several methods for estimating haplotype frequencies have been developed (Excoffier and Slatkin 1995; Long* et al.* 1995; Slatkin and Excoffier 1996). In the analysis of human SNPs, we used estimated haplotypes as if they were observed data. However, the effect of using estimated haplotypes on parameter estimates is unknown. The accuracy of the inference might depend on the precision of haplotype estimates. Method 1 should be better for accurate inference, but deriving explicit marginal-likelihood functions becomes much harder for larger numbers of loci, although the general form of the marginal-likelihood function is derived in the appendix. Method 2 is much easier and more practical. Precise estimation of haplotype frequencies is very important for studies of linkage disequilibrium.

## APPENDIX:

### GENERAL LINKAGE DISEQUILIBRIUM MARGINAL-LIKELIHOOD FUNCTION FOR DIPLOID ORGANISMS

Let observed composite genotypes be *G _{i}*(

*i*= 1, … ,

*m*) and the numbers of individuals for the genotypes be

*n*

_{Gi}. The likelihood function for a sample is written as where

*P*

_{Gi}is the probability for the genotype

*G*, which can be written by the probability of diplotypes. For the case of two loci with two alleles, the diplotype probabilities are given in Table 1. Even for multilocus data, a diplotype can be specified for a set of completely homozygous genotypes, but the larger number of heterozygous genotypes results in a larger number of diplotype combinations. Consequently, the number of summations of diplotype probabilities increases for such genotypes. Thus, the probability of a genotype is expressed by the sum of diplotype probabilities.

_{i}Let a set of diplotypes and haplotypes consistent with a genotype *G* be where *l _{G}* is the number of diplotypes consistent with the genotype

*G*. The likelihood function for composite genotypes is then written by using diplotype probabilities as

By multinomial expansion, we have another form of the likelihood as where *i*^{}_{j} is the number of individuals having diplotype *j*, which constitutes genotype *i*, and .

Under H-W equilibrium, the probability of the *j*th diplotype is expressed as a product of the probabilities of haplotypes, where δ is a delta function that is 1 for homozygous and 0 for heterozygous genotypes. Therefore, the likelihood of a sample from a subpopulation under H-W equilibrium is

In the above equation some haplotypes will be common, so we rename these haplotypes as where ξ is the number of common haplotypes. The likelihood function is then written as where *j _{s}* (

*s*= 1, … , ξ) is the number of individuals having haplotype

*s*and , where δ

_{h′sh11}is again a delta function that is 1 if

*h*

^{′}

_{s}coincides with

*h*

^{}

_{11}and 0 otherwise. Thus, the likelihood for observed composite genotypes can be written as the sum of the multinomial distribution for haplotype frequencies. From above

*L*

_{1}, the marginal-likelihood function is obtained as

The total marginal-likelihood function is obtained multiplying *K* likelihood functions.

## Acknowledgments

We thank Kazutomo Yoshizawa for allowing us to use his unpublished data and Laurent Excoffier and an anonymous reviewer for their constructive comments on earlier versions of the manuscript. This work was supported by the Japan Society for the Promotion of Science and Japan Science and Technology Agency.

## Footnotes

Communicating editor: L. Excoffier

- Received March 20, 2003.
- Accepted December 26, 2003.

- Genetics Society of America