Abstract
Linkage disequilibrium is an important topic in evolutionary and population genetics. An issue yet to be settled is the theory required to extend the linkage disequilibrium analysis to complex traits. In this study, we present theoretical analysis and methods for detecting or estimating linkage disequilibrium (LD) between a polymorphic marker locus and any one of the loci affecting a complex dichotomous trait on the basis of samples randomly or selectively collected from natural populations. Statistical properties of these methods were investigated and their powers were compared analytically or by use of Monte Carlo simulations. The results show that the disequilibrium may be detected with a power of 80% by using phenotypic records and marker genotype when both the trait and marker variants are common (30%) and the LD is relatively high (40–100% of the theoretical maximum). The maximumlikelihood approach provides accurate estimates of the model parameters as well as detection of linkage disequilibrium. The likelihood method is preferred for its higher power and reliability in parameter estimation. The approaches developed in this article are also compared to those for analyzing a continuously distributed quantitative trait. It is shown that a larger sample size is required for the dichotomous trait model to obtain the same level of power in detecting linkage disequilibrium as the continuous trait analysis. Potential use of these estimates in mapping the trait locus is also discussed.
LINKAGE disequilibrium (LD) between genes at different loci is a topic of historical importance in evolutionary and population genetics theory. Analysis of LD was shown to be a very powerful approach in distinguishing between alternative evolutionary models (Lewontin 1974). There was a recent resurgence in interest in linkagedisequilibrium analysis because of the abundance of genetic polymorphisms at the DNA level. These data made it possible to use the disequilibrium measure to map disease genes (Hastbackaet al. 1992; Kruglyak 1999) or to optimize breeding schemes for markerassisted selection (Lande and Thompson 1990; Luoet al. 1997). From the statistical point of view, inferences about linkage disequilibrium involve two aspects: detecting its presence and estimating its magnitude once the disequilibrium has been confirmed (Weir 1979). Many studies focused on inferring linkage disequilibrium between alleles at two or more loci under the circumstance where gametic or genotypic data are available at the involved loci (Hill 1974, 1975; Brown 1975; Weir and Cockerham 1978; Spielmanet al. 1993; Kaplanet al. 1995).
Many characters of economic or evolutionary importance are complex traits for which a onetoone relationship between genotype and phenotype does not exist (Darvasi 1998). In many instances, a complex phenotype can be assessed continuously or discontinuously. The key difficulty encountered in modeling linkage disequilibria involved with complex traits is mainly caused by the incomplete information on the genotype of these traits. A theoretical analysis was carried out in Luo (1998) to explore the statistical power for detecting the disequilibrium between a polymorphic marker locus and any one of the loci contributing to quantitative genetic variation in natural populations. Applying the model to association studies between a marker and a trait, Nielsen and Weir (1999) showed that there is a simple relationship between the marker being examined and the trait loci. More recently, Luo and Suhai (1999) developed a likelihood approach to estimate the level of linkage disequilibrium between a polymorphic marker locus and any one of the loci affecting a continuous quantitative trait in natural populations. These methods are appropriate for analyzing the marker genotype and trait phenotype data that are directly observable from experiments. Luo et al. (2000) discussed how these analyses are relevant to mapping the major genetic effect of continuous quantitative variation.
Most intensively studied characters in evolutionary biology are complex traits that show discontinuous phenotypic variation. These include, for example, male fertility (usually scored as fertile or sterile) in interspecific hybrids (Wu and Palopoli 1994), pheromonal types between different species of Drosophila (Coyneet al. 1994), and the presence or absence of wing spots in butterflies (Brakefield 1996). To extend our previous study to analyze complex traits of this sort, this article develops the theory and method to detect and estimate linkage disequilibrium between a polymorphic marker locus and any one of the loci affecting a complex binary trait in natural populations. Analyses with dichotomous traits may be theoretically more challenging than with continuous traits because the former requires modeling the link between the observable phenotype and the corresponding latent variable. We study various statistical properties of the theoretical model and compare it with the analysis involved with continuous traits for their statistical powers. We also discuss the implications of the methods in locating genes underlying complex binary traits by use of samples from natural populations.
MODEL AND THEORETICAL ANALYSES
Model and notations: We consider cosegregation of genes at two autosomal loci: One affects a dichotomous trait whereas the other is a codominant marker locus that is devoid of effect on the trait. The two alleles are denoted by M and m at the marker locus and by A and a at the trait locus. The association between genes at the two loci is quantified by D, the coefficient of the disequilibrium defined as D = f_{MA} − pq, where f_{MA} is the frequency of the MA haplotype and p and q denote frequencies of alleles M and A accordingly. With the assumption of random mating, the joint distribution of genotypes at the marker locus and QTL can be expressed as a function D, p, and q and is summarized in Table 1. It should be noted that the joint distribution implies random mating of the population. The model has been discussed elsewhere (Luoet al. 2000).
The phenotype of the trait is assumed to be distributed as a binary variable. If a random sample is collected from the population as described above, individuals sampled may be grouped according to their marker genotypes. The phenotype of the jth individual within the ith marker group is modeled by
The conditional probability of y_{ij} = 1 given the individual's genotype at the trait locus, say G_{ij} = k, is obtained by
Statistical analyses: In this section, three approaches are presented for testing the presence of linkage disequilibrium using the random sample described above.
Independence test: The data can be sorted into a 2 × 3 contingency table as illustrated in Table 2. In Table 2, π_{ij} (or n_{ij}) is the frequency (or number of counts) of the individuals with the ith phenotype and jth marker genotype. The marginal frequencies are denoted by π_{i+} = Σjπ_{ij} and π_{+j} = Σiπ_{ij}. A statistical test for the null hypothesis H_{0}, π_{ij} = π_{i+} π_{+j} with i = 1, 2 and j = 1, 2, 3, is performed by using Pearson's chisquare test statistic
Statistical power of the independence test can thus be formulated as
Regression analysis: If the trait phenotype and the marker genotype data are used to fit a simple regression model,
The test statistic, i.e., t = β/σ_{β}, is expected to follow a central tdistribution under the null hypothesis D = 0. Power of the test at a significant level of α can thus be calculated as the probability
Maximumlikelihood analysis: The loglikelihood of the observed phenotype of the trait and the marker genotype data given the model parameters can be written as
Note that h_{ik} is the joint frequency of the ith marker genotype and the kth genotype at the trait locus, and
The MLEs of the unknown parameters in Equation 12 can be calculated by use of the expectationmaximization (EM) algorithm (Dempsteret al. 1977). Implementation of the EM algorithm in the present context involves iteration of the following two steps.
Estep: Calculate the posterior probability w_{ijk} using the parameter estimates Ω^{(t)} at the tth iteration for both the penetrance and liability models:
The other parameters are calculated in different ways depending on which model is considered: For the penetrance model, the updated estimates of the three penetrance parameters can be directly obtained as
Let
The MLEs of μ, a, and d can be obtained from solving the following equations:
The algorithm is iterated until the sequence of the estimate of Ω converges, and the converged values are the maximumlikelihood estimates of the parameters. Use of the MLEs allows calculation of the likelihood
SIMULATION STUDY AND NUMERICAL ANALYSES
Simulation model: The strategy is described elsewhere (Luo 1998) for simulating the linkage disequilibrium between a polymorphic marker locus and a trait locus in random mating natural populations. For any given set of parameters, n, p, q, D, μ, a, and d, random samples of the individual genotypes at the marker and trait loci were generated. The liability of an individual was determined by its genotype value at the trait locus plus a normally distributed random variate, and the trait phenotype for the individual depended if its liability value was greater than the simulated threshold, θ. In the simulation study, we considered 15 sets of parameters and these were listed in Table 3 in which genetic effects of the disease gene were represented in term of h^{2}, heritability of the liability of the disease trait, and dr, dominance ratio at the disease locus. Each of the simulated parameters was repeated 100 or 200 times depending on the purposes of the analysis.
Behavior of the test statistics under the null hypothesis: To apply the methods developed above in practice, a question remains to determine appropriate critical threshold values for significance of the test statistics. These require knowledge about the distributions of the test statistics under the null hypothesis. Under the null
hypothesis, the test statistic of the contingency table analysis is expected to be a
Table 4 illustrates the expected and simulated values of the means, variances, and 95th percentiles of the four test statistics that were calculated from 200 repeated simulation trials under the condition D = 0. The population numbers that specify the simulation parameter sets except for the value of D are in accordance with those given in Table 3. Table 4 shows that the observed values of the distribution parameters for the contingency and regression test statistics were in good agreement with their corresponding theoretical expectations. The observed frequencies of these tests being significant are in the range of the given significant level. These suggest validity in approximating the distributions of the two test
statistics with the expected ones. However, the simulated means, variances, and 95th percentiles of the likelihoodratio test statistics under the penetrance model and the liability model may differ markedly from those expected for the χ^{2} distribution with 1 d.f. Moreover, the frequencies of the tests detected as significant are substantially lower than the given significance level. These indicate that approximating distributions of the likelihoodratio test statistics given in Equation 23 with χ^{2} is inappropriate. The deviation of distributions of the likelihoodratio test statistics from
Contingency and regression analyses: The expected value of the χ^{2}test statistic in the contingency analysis under each of the 15 sets of the simulated parameters was theoretically predicted using the formula d.f. + λ and calculated from the simulation study. In the regression analysis, the expectation of the regression coefficient was calculated according to Equation 8 and the corresponding simulated observation was obtained as the mean of the regression coefficients over 100 simulations. The expected standard deviation of the regression coefficient was evaluated using Equations 10 (
Table 5 summarizes results of the analyses. The table shows a good agreement between the expected values of the statistics investigated here and their corresponding simulated values under both the contingency and regression analyses. The standard deviations of the regression coefficients predicted using Equation 10 are almost identical to that calculated using Equation 11 for all but population 7 considered here, suggesting that the normality assumption of the data is not important in calculating the standard deviations. When the marker gene and allele at the trait locus were in linkage equilibrium (population 7), the covariance between the marker genotype and the phenotype of the trait was expected to be zero, and Equation 11 then failed to provide prediction of the standard deviation. As long as the power is concerned, it can be seen from Table 4 that the level of linkage disequilibrium plays a major role in both the analyses. In addition, the power is expected to be higher when allelic frequencies at the marker and trait loci are low (population 11) or high (population 12) than that when the frequencies take intermediate values (population 1). The power may be reduced slightly with increase in the threshold. The powers of both the tests decrease as the dominance ratio at the trait locus increases (populations 1, 8, and 9). Comparison of the power between the two tests suggests that the regression model is generally a more efficient method to detect the linkage disequilibrium than the contingency analysis.
Maximumlikelihood analyses: The likelihood equation (13) was used to calculate the maximumlikelihood estimates of the parameters defining the penetrance model and the liability model. Table 6 illustrates the means and the standard errors of the MLEs of the parameters of the penetrance model. It shows that the parameters are well estimated by their corresponding maximumlikelihood estimates. The standard errors of the estimates of the marker allele frequencies were much smaller than those of the estimates of the gene frequencies and the coefficients of the linkage disequilibrium, revealing the fact that the marker genotype data provide full information for estimating the marker gene frequencies, while the estimates of the trait gene frequencies and the disequilibrium coefficients were based on incomplete information from the data of the marker genotype and the trait phenotype. The penetrances of the three genotypes at the trait locus were adequately predicted by the maximumlikelihood approach. The penetrances of the heterozygote genotype were estimated with smaller standard errors than those of the other two homozygote genotypes. Tabulated in Table 6 were also the empirical powers, on the basis of 200 simulations, for detecting the linkage disequilibrium model. The critical value used to determine significance of the likelihoodratio test was based on 200 permutations at the null hypothesis and may change for each simulation. Table 6 shows that among all parameters under question, the level of linkage disequilibrium between the marker and trait loci is the most important factor that determines the efficiency of the likelihoodratio test. There is a trend toward decrease in the power as the dominance effect at the trait locus increases (comparisons among populations 1, 8, and 9 and between populations 12 and 13 as well as between populations 14 and 15). Comparison among populations 1, 10, and 11 shows that the test tends to be more efficient when both the marker and trait genes are at lower or higher frequencies given the other parameters. This may reflect that the contrast of difference in value of the penetrance between the three genotypes is enhanced when the genes are at low or high frequencies.
Tabulated in Table 7 are the means and their standard errors of the MLEs of the parameters used in the liability model as well as the empirical powers of the likelihoodratio test under the model. It can be seen from the table that the parameters are well predicted by their corresponding maximumlikelihood estimates. The standard errors of the parameter estimates under this model are comparable with those under the penetrance model, suggesting consistency of the two models in the parameter estimates. The powers of the linkage disequilibrium test whose significance threshold was also based on 200 permutations at the null hypothesis are approximately the same as those observed in the penetrance model analysis. Effects of the allele frequencies at the marker and trait loci and the dominance level of the gene at the trait locus on the power show the same trend as that discussed in the above penetrance model. These reveal that both the models are almost equally efficient in the parameter estimation and the disequilibrium test.
Comparison of the powers observed in the likelihood analyses to those illustrated in Table 4 shows that the likelihoodratio test is probably more powerful in detecting the disequilibrium than the test based on the contingency or regression analyses.
Comparison of the power between this model and the normal data model: We previously investigated the statistical test for the linkage disequilibrium between a polymorphic marker locus and a locus underlying a quantitative trait whose phenotype is normally distributed (Luoet al. 2000). It was shown that the unbalanced nested ANOVA and the regression analyses provided an efficient test for the disequilibrium using samples from a natural population. To compare these approaches analyzing the continuous quantitative traits to those developed in the present study, the sample sizes required for detecting the linkage disequilibrium with power of 80% at the significance level of 0.001 were evaluated for the methods and were summarized in Table 8. It shows that the methods modeling the continuous trait phenotype are usually more powerful than the methods analyzing the binary phenotype, with the exception that the regression analysis under the reliability model is only slightly better off when the marker allele is at the intermediate frequency but the frequency of the trait gene is low and the gene shows complete dominance (population 12). The table reveals that the regression models require smaller samples for the given power than the alternative methods regardless of whether the trait phenotype is distributed continuously or dichotomously.
DISCUSSION
Wholegenome association studies were recently proposed as a powerful approach for investigating many fundamental questions in evolutionary biology (Clarket al. 1998; Paabo 1999) and for detecting the many subtle genetic effects that underlie susceptibility to common diseases (Lander and Schork 1994; Lander 1996; Risch and Merikangas 1996).
Appropriate modeling of linkage disequilibria between polymorphic markers and genes affecting complex genetic variation in natural populations is an essential step and has proved to be an effective approach in improving resolution of gene mapping. Linkage disequilibrium analysis of gene mapping may provide a mapping resolution <1 cM so that molecular screening at the DNA sequence level for the candidate gene can be performed. This is an improvement over the traditional pedigreebased or crossing populationbased linkage analysis where the candidate gene can hardly be narrowed down to such a resolution (de la Chapelle and Wright 1998; Kearsey and Farquhar 1998; Guo and Lange 2000). The theoretical analysis of this article offers insight into the degree of association between the marker and trait and serves as an important first step in the direction of analyzing complex dichotomous traits using populationlevel LD at linked markers.
Efficiency and statistical properties of three methods proposed here for detecting and estimating linkage disequilibrium are investigated analytically or by simulation. It is shown that linkage disequilibrium under a defined spectrum of inheritance models may be detected with 80% power using a sample size of a few hundred to 2000. The contingency analysis and the regression analysis developed in this study provide the disequilibrium test, and the maximumlikelihood approach presented here can be used to estimate the model parameters, i.e., gene frequencies at both the marker and trait loci, the coefficient of linkage disequilibrium between the two loci, and the genetic effects of the genes. These estimates may be useful in interpreting the demographic history of the natural populations under question (Thompson and Neel 1996; Fay and Wu 1999) and, in turn, to extend the principle of linkage disequilibriumbased mapping to complex traits.
The major difficulty in linkage disequilibriumbased mapping is to quantify the relationship between recombination fraction and linkage disequilibrium measure. Since recombinant events are not observed, the recombination fraction between the marker and trait locus must be estimated on the basis of a population genetics model. Several methods were suggested to address this problem. One of these attempted to search for the reparameterization by which the disequilibrium measure can be directly related to the recombination fraction. For instance, Devlin and Risch (1995) found that, in the present notations, the measure of the disequilibrium
Development of a dense map of biallelic singlenucleotide polymorphisms (SNPs) was suggested to be an effective strategy for genomewide search for linkage disequilibrium between the polymorphisms and candidate genes. Under that setting, a proportion of SNP markers are considered within the candidate genes; thus the recombination fraction between the marker and trait loci may be assumed to be zero (Martinet al. 2000). However, efficiency of the genome scan scheme with SNPs is still controversial (Kruglyak 1999; Ott 2000).
In this study, we considered inference of linkage disequilibrium using a random sample from natural populations. In practice, selected samples may be favored in relevant genetics studies. In principle, this study can be extended to analyze the selected samples. We note that the joint genotypic distribution at the marker and trait loci within the selected sample (say, for example, ξ_{ij}) is related to that in the natural population (h_{ij} given in Table 1) from which the sample is collected, as
The methods presented here differ from others in several aspects. Allison (1997) extended the transmission disequilibrium test (TDT) theory to detect association between the marker locus and a locus contributing to quantitative genetic variation using family data. As a member of the TDT family, Allison's method can detect linkage between the marker locus and the trait locus only if linkage disequilibrium is present. An important assumption made in the TDT model is that the marker locus is the trait locus itself and is not just in linkage disequilibrium with the trait locus, indicating that the genotype information at the trait locus is assumed. This does not apply to the methods proposed in this study where information on the trait locus is missing. Rabinowitz (1997) generalized the TDT test for analyzing quantitative traits, but no effort was made to estimate genetic parameters of the traits. However, it must be noted that a distinct feature of the TDT methods is their robustness to population stratification. The question remains how this analysis may be affected by stratification. This may be tackled readily if the dynamics and properties of linkage disequilibrium in admixed populations are taken into account as we recently showed in Tao et al. (2000).
In a recent study, Nielsen and Weir (1999) proposed the model for association studies between a marker and a trait and showed that there is a simple relationship between the marker being examined and the trait loci. Their model is a very useful framework for specifying the structure of linkage disequilibrium between the loci and for investigating the genetic context of many markerbased statistic tests. Our main purpose is, however, to directly estimate the linkage disequilibrium and the genetic parameters of the dichotomous trait locus. Sharing the same theoretical model of gene segregation at both marker and trait loci, our previous studies (Luo 1998; Luoet al. 2000) focused on inferring linkage disequilibrium between a polymorphic marker locus and a locus affecting continuous complex genetic traits. In this study, however, the trait locus was assumed to affect a complex dichotomous phenotype. We found that the disequilibrium test with the dichotomous traits is generally less efficient than the test where the traits display a continuous phenotypic distribution. Furthermore, a statistical method was suggested in Xu and Atchley (1996) for mapping quantitative trait loci underlying complex binary diseases using planned crossing experiments. The genetic structure of the experiment population under the quantitative trait loci (QTL) mapping consideration allows more information about the QTL genotype to be extracted from the flanking markers because the parental linkage phases of genes at the marker and QTL loci are assumed known, and thus it provides a direct prediction of cosegregation of marker genes and genes at the QTL. With natural populations, none of the information is available.
Our model considers a biallelic model at both marker and trait loci. For two reasons, multiple alleles at the marker loci may need to be taken into account. First, multiple alleles may be common in natural populations for molecular DNA markers like microsatellites. Second, analysis of haplotypes over several marker loci may be effectively reduced to a multiallelic model as demonstrated in Service et al. (1999) and Clayton and Jones (1999). In principle, our model may be integrated into the multiallelic analysis by taking into account the probability that each of the marker alleles is in linkage disequilibrium with the trait allele A. If, for example, the ith marker allele is in linkage disequilibrium with the allele A, the rest of marker alleles are sorted into one class. With this strategy, a likelihoodbased modeling of the multiallelic marker such as that proposed in Terwilliger (1995) may thus be tractable. It is much more challenging to consider multiple alleles at the trait locus because the individual allelic effect cannot be identified with certainty from the trait phenotype. The problem was usually addressed by the way that individuals with a specific phenotype were recognized as carrying one class of alleles and the rest without the phenotype were assigned as carrying other class of alleles (Lazzeron 1998; Serviceet al. 1999). In this study, the putative allele A may be considered as a set of alleles that increase the presence of the trait phenotype (y = 1) while the allele a is a set of alleles that decrease the presence of the phenotype.
Moreover, we recognize there are several multiloci linkage disequilibrium mapping methods. Some of them are based on comparison of pairwise linkage disequilibria between the single trait locus and a set of marker polymorphisms and use the peak value of disequilibrium measure over several marker loci as evidence for location of the hypothesized QTL (e.g., Terwilliger 1995; Rannala and Slatkin 1998; Slatkin 1999). In addition, some theoretical effort has been made to combine the pairwise disequilibrium between the putative trait locus and each of a set of marker loci by use of the composite likelihood principle (Devlinet al. 1996; Collins and Morton 1998). Our study provides statistical inference of linkage disequilibrium between a single marker locus and any one of the loci underlying complex genetic variation; hence the multipleloci analysis could be built upon the twolocus model. However, more work will be needed to extend the method to appropriately taking multiple marker information into account and analyzing multiple trait loci. Such an extension would certainly need to take into account the complex structure of the disequilibria among multiple loci (Weir 1979) or at least a major part of it. We hope this study and analysis will provide a basis for a multilocus analysis and serve as an important building block for our understanding of the genetic architecture underlying complex genetic variation.
Acknowledgments
We are indebted to Mary Sara McPeek for her comments on the manuscript. We thank Richard Hudson for the discussion on the topic of linkage disequilibrium mapping. The comments and criticisms made by an anonymous reviewer and Dr. Marjorie A. Asmussen were very helpful in improving presentation and clarifying several ambiguities in an earlier version of this article. Z.W.L. is supported by China's Basic Research Program “973,” the National Science Foundation of China, the QiuShi Foundation, and the Changjiang Scholarship. C.I.W. is supported by U.S. National Institutes of Health and National Science Foundation grants.
APPENDIX A: CALCULATION OF THE NONCENTRALITY PARAMETER OF THE χ^{2}TEST STATISTIC
In Table 2, let T = 1, 2, 3 correspond to the three marker genotypes MM, Mm, and mm, respectively, and Y = 1, 2 accord to the affected or normal phenotype, and let G be an indicator of genotype at the trait locus. The probability
APPENDIX B: CALCULATION OF THE REGRESSION PARAMETERS AND THE SAMPLE VARIANCE OF THE REGRESSION COEFFICIENT
Let Y and T be trait phenotype and number of marker allele M carried by sampled individual. Y takes a value of 0 or 1, and T could be 0, 1, or 2. Under the assumption of a random mating population, the basic statistics of T can be readily derived as
APPENDIX C: CALCULATION OF MLE OF THE THRESHOLD θ
Let ξ_{0} be the proportion of unaffected individuals sampled. It can be used to approximate its population probability (Thompson 1972), and, in the present context,
APPENDIX D: SOLVING FOR THE MLES OF THE MODEL PARAMETERS
Let ξ be either μ, a, or d, then the first derivative of the expected complete data loglikelihood, which is defined by Equation 14, with respect to ξ can be
The second derivatives are more messy but a general form was found as
Footnotes

Communicating editor: M. A. Asmussen
 Received August 8, 2000.
 Accepted April 30, 2001.
 Copyright © 2001 by the Genetics Society of America