Originally published as Genetics Published Articles Ahead of Print on September 19, 2005.
Genetics, Vol. 172, 647-661, January 2006, Copyright © 2006
doi:10.1534/genetics.105.045781
Improvement of Mapping Accuracy by Unifying Linkage and Association Analysis
Xiang-Yang Lou*,
Jennie Z. Ma
,
Mark C. K. Yang
,
Jun Zhu
,
Peng-Yuan Liu**,
Hong-Wen Deng**,
Robert C. Elston
and
Ming D. Li*,1
* Department of Psychiatric Medicine, University of Virginia, Charlottesville, Virginia 22911,
Department of Psychiatry, University of Texas Health Science Center, San Antonio, Texas 78229,
Department of Statistics, University of Florida, Gainesville, Florida 32611,
Department of Agronomy, Zhejiang University, Hangzhou, Zhejiang 310029, People's Republic of China, ** Osteoporosis Research Center, Creighton University Medical Center, Omaha, Nebraska 68131 and 
Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109
1 Corresponding author: 1670 Discovery Dr., Ste. 110, Charlottesville, VA 22911.
E-mail: ml2km{at}virginia.edu
Manuscript received May 17, 2005.
Accepted for publication September 14, 2005.
 |
ABSTRACT
|
|---|
It is well known that pedigree/family data record information on the coexistence in founder haplotypes of alleles at nearby loci and the cotransmission from parent to offspring that reveal different, but complementary, profiles of the genetic architecture. Either conventional linkage analysis that assumes linkage equilibrium or family-based association tests (FBATs) capture only partial information, leading to inefficiency. For example, FBATs will fail to detect even very tight linkage in the case where no allelic association exists, while a violation of the assumption of linkage equilibrium will result in biased estimation and reduced efficiency in linkage mapping. In this article, by using a data augmentation technique and the EM algorithm, we propose a likelihood-based approach that embeds both linkage and association analyses into a unified framework for general pedigree data. Relative to either linkage or association analysis, the proposed approach is expected to have greater estimation accuracy and power. Monte Carlo simulations support our theoretical expectations and demonstrate that our new methodology: (1) is more powerful than either FBATs or classic linkage analysis; (2) can unbiasedly estimate genetic parameters regardless of whether association exists, thus remedying the bias and less precision of traditional linkage analysis in the presence of association; and (3) is capable of identifying tight linkage alone. The new approach also holds the theoretical advantage that it can extract statistical information to the maximum extent and thereby improve mapping accuracy and power because it integrates multilocus population-based association study and pedigree-based linkage analysis into a coherent framework. Furthermore, our method is numerically stable and computationally efficient, as compared to existing parametric methods that use the simplex algorithm or Newton-type methods to maximize high-order multidimensional likelihood functions, and also offers the computation of Fisher's information matrix. Finally, we apply our methodology to a genetic study on bone mineral density (BMD) for the vitamin D receptor (VDR) gene and find that VDR is significantly linked to BMD at the one-third region of the wrist.
TWO approaches are commonly used in pedigree- or family-based gene mapping, i.e., linkage analysis (e.g., ELSTON and STEWART 1971; HASEMAN and ELSTON 1972; OTT 1974; LANDER and GREEN 1987; RISCH 1990; WARD 1993; AMOS 1994; KRUGLYAK and LANDER 1995; O'CONNELL and WEEKS 1995; KRUGLYAK et al. 1996; GUDBJARTSSON et al. 2000; ABECASIS et al. 2002) and family-based association tests (FBATs) (e.g., FALK and RUBINSTEIN 1987; SPIELMAN et al. 1993; LAZZERONI and LANGE 1998; LAIRD et al. 2000; RABINOWITZ and LAIRD 2000). Linkage analysis focuses on gene cosegregation that can be characterized by inheritance vectors or gene concordance between related individuals (identical-by-descent, IBD, or identical-in-state, IIS) at each locus, while association tests (which, when due to linkage, are tests of gametic association, also called linkage disequilibrium, LD) directly utilize allele status and linkage phase that record historic events. Pedigree data contain both these components of information that give rise to complementary profiles of the genetic architecture. Either linkage or association analysis alone, however, can capitalize only on the genetic information from one of these components and fails to grasp the whole picture, thereby leading to a loss in mapping accuracy and statistical power.
To illustrate the limitations of applying either a linkage or association approach alone, let us consider the affected sib pair design used in RISCH (1990) and RISCH and MERIKANGAS (1996). First, traditional linkage analysis will give a biased result in the presence of population association. To simplify our exposition, assume there are a diallelic disease locus
with alleles Q and q and a codominant marker locus
with alleles A and a. Alleles Q and A have the same frequency and are in perfect association, and let pQ = pA = pAQ = p. Table 1 lists the assumed probabilities (under no association), the true probabilities, and RISCH's (1990) LOD scores of all six possible sib configurations in the case where marker
is unlinked to a recessively inherited disease gene
. Using RISCH's (1990) EM iterative Equation 4, we can obtain the maximum-likelihood estimates (MLEs) of the posterior probabilities that the affected sib pairs share i marker alleles IBD (i = 0, 1, 2). To illustrate the result, we take p to be a specific value, say p = 0.5, then we have
,
, and
, respectively, and the expected LOD (ELOD) = 0.384. These values deviate substantially from the true IBD sharing scores of 0.25, 0.5, and 0.25, respectively, and exhibit a spuriously excessive allele sharing. This suggests that a false-positive result can occur in allele-sharing analysis. We further demonstrate that, generally, the assumed likelihood is a monotonically decreasing function of the recombination fraction
for
[0, 0.5] (see the APPENDIX). This means that, if the true recombination fraction
0
0, we may still obtain an estimate of zero.
Second, neglecting to take account of information on association may cause loss of statistical power. As pointed out by RISCH and MERIKANGAS (1996), the allele-sharing method is much less powerful than the transmission/disequilibrium test (TDT) method in the cases they considered, i.e., when there is no recombination and the alleles at the two loci are perfectly associated. This arises because the linkage statistic, the mean allele sharing, fails to consider the allele-specific IBD sharing. Actually, allele A (increasing disease risk) contributes more allele sharing to the statistic, whereas allele a contributes less, so that the overall mean allele sharing is diluted. Our simulations of model-based linkage-only analysis support this theoretical argument, i.e., the plausible bias and the reduced power (see SIMULATION STUDIES).
Because they fail to incorporate information on linkage, FBATs are inherently conservative, and so they cannot detect linkage even when two or more siblings are available, unless there is also population association. The conclusion by RISCH and MERIKANGAS (1996) was drawn from the ideal circumstance where the marker is the disease gene itself. In such a situation, FBATs reach their maximum potential power. In practice, however, it may not be true that a marker happens to have the same variant frequencies as, and be perfectly associated with, the disease gene of interest, even for fine mapping, as there are always many polymorphic SNPs within a gene whereas only a few may be responsible for the change of its function. Both theoretical and empirical studies (e.g., KRUGLYAK 1999; HINDS et al. 2005) have shown that the founder LD within a small region has usually been largely disrupted by various population forces, such as recombination, gene conversion, and/or mutation accumulated over time, so that high-LD regions with little genetic shuffling, termed haplotype blocks, span only a very short distance, implying that strong LD is not inevitable with tightly linked loci. HapMap studies also indicate that the frequencies of variants change from one SNP to another largely within a block (INTERNATIONAL HAPMAP CONSORTIUM 2003). In practical application, FBATs can therefore lose their theoretical power even with closely linked loci, owing to the violation of such an ideal assumption. Furthermore, association may extend over a great distance, even to nonsyntenic loci because of factors other than linkage, such as population subdivision and admixture, population bottlenecks, mutation, gene conversion, meiotic drive, sampling or ascertainment bias, nonrandom mating, and coancestry. Caution is also required in that a positive result from an FBAT does not necessarily imply the presence of tight linkage; i.e., an FBAT alone cannot distinguish strong association and loose linkage from weak association and tight linkage (ELSTON 1998; WHITTAKER et al. 2000).
Therefore, it is of great interest to remedy the above limitations. A judicious way is to take both these pieces of information into consideration in gene mapping. Such an idea was conceived in earlier literature (e.g., MACLEAN et al. 1984) and adopted in some computer software such as LINKAGE (LATHROP and LALOUEL 1984; LATHROP et al. 1985). Unfortunately, the bonus from joint mapping was not recognized, so this remarkable idea has been buried for several years (XIONG and JIN 2000). Recently, ZHAO et al. (1998) proposed a semiparametric method for a combined linkage and linkage disequilibrium analysis. XIONG and JIN (2000) advocated a likelihood-based parametric method for joint analysis with nuclear family data. CANTOR et al. (2005) further extended XIONG and JIN's (2000) method for general pedigrees. LI et al. (2005) suggested an approach that identifies associated and potentially causal SNPs through joint modeling of linkage and association. Parallel to parametric ones, variance components (e.g., ALLISON et al. 1999; FULKER et al. 1999; ABECASIS et al. 2000) and nonparametric (HUANG and JIANG 1999; WICKS 2000; WICKS and WILSON 2000; LAZZERONI 2002) methods have also been developed. However, those methods work mostly for specific data structures and types such as affected sib pairs, nuclear families, and categorical traits and/or can provide a solution only for specific problems such as single-point analysis. The bonus of combined mapping has also not been thoroughly explored. By invoking a data augmentation technique and the EM algorithm, we have evolved a general likelihood-based statistical framework for integrating linkage and association analyses (LOU et al. 2005). In the present article, we further extend this model-based approach for general pedigrees. This approach allows us to simultaneously perform segregation, linkage, and association analyses, i.e., to estimate penetrance functions, genetic distances, and association parameters, as well as to carry out the corresponding hypothesis tests within a unified framework. More appealingly, it adds several unique strengths to existing parametric methods (e.g., XIONG and JIN 2000; CANTOR et al. 2005; LI et al. 2005). First, this framework is conceptually straightforward, flexible, easy to generalize, and also comprehensive, so that it covers a wide range of cases with multiple loci and/or multiple alleles. Multilocus mapping and epistatic QTL mapping can be implemented as well under the same concept. Second, our new approach is computationally efficient and powerful. We formulated the closed-form solutions for MLEs implemented with EM iteration and thus avoid the computational difficulty of high-order multidimensional searches, leading to less computational time per iteration and quick convergence. Third, due to the advantage of the EM algorithm over the simplex algorithm and Newton-type methods in the context of a mapping study, as pointed out by some authors (e.g., LANDER and GREEN 1987), our new approach is numerically stable, as compared with existing methods. In our experience, a wide range of initial values appears to give good convergence. Finally, we offer the computation of Fisher's information matrix and hence can provide the estimation precision of MLEs. Although this article emphasizes a demonstration of the improvement in mapping accuracy using a two-locus model, i.e., one marker and one trait gene, we use an interval mapping model to describe our new approach in the MODEL AND METHOD section for readers to have a clearer picture about it. After presenting the theory, we use simulation studies to compare the power of an FBAT, of the pure linkage method, and of our new approach and the estimation precision of the latter two. An application to the genetic study of bone mineral density (BMD) is used to demonstrate this new methodology. Finally, we discuss some relevant issues to provide further insights into this approach.
 |
MODEL AND METHOD
|
|---|
Here we use a three-diallelic-locus model to illustrate the approach. Suppose there are three loci, one trait gene or QTL,
, bracketed by a pair of flanking markers,
and
, respectively. Let A, a, Q, q, B, and b be the alleles at the three loci, respectively. All the alleles together form eight haplotypes, AQB, AQb, AqB, Aqb, aQB, aQb, aqB, and aqb. These haplotypes unite to generate a total of 36 diplotypes, AQB/AQB, AQB/AQb, ... , and aqb/aqb, where the "/" denotes the separation of the maternally and paternally derived gametes. The 36 diplotypes are collapsed into 27 zygote genotypes, each with an identical allelic combination at all the loci, and further, into 9 marker genotypes and 3 QTL genotypes. Owing to the fact that genotypes are conflated data that ignore the linkage phases of diplotypes, some of the genotypes consist of >1 diplotype. For example, all 4 diplotypes AQB/aqb, AQb/aqB, AqB/aQb, and Aqb/aQB exhibit the same genotype, AaQqBb. To express the relationship between diplotypes and genotypes, we denote by
,
, and
the manyone mapping operators taking the genotypes at all loci, the marker loci and the QTL, of a diplotype in parentheses, respectively. Thus,
, and
.
We use pAQB, pAQb, ... , paqb and PAQB/AQB, PAQB/AQb, ... , Paqb/aqb to denote the frequencies of the haplotypes AQB, AQb, ... , aqb and diplotypes AQB/AQB, AQB/AQb, ... , aqb/aqb, respectively, in the population studied. If the population is at HardyWeinberg equilibrium, we have
The frequencies of the haplotypes can be decomposed into different components determined by the allele frequencies at each locus and LD coefficients of different orders; e.g.,
where pA, pB, and pQ are the frequencies of alleles A, B, and Q, respectively and DAQ, DBQ, DAB, and DAQB are the LD coefficients, respectively. Reversely, the frequencies of alleles and LD coefficients can also be represented by the frequencies of haplotypes; e.g., pA = pAQB + pAQb + pAqB + pAqb, ... , DAB = pAB pApB, ... , and DAQB = pAQB pADBQ pBDAQ pQDAB pApBpQ. For a more general expression with an arbitrary number of alleles and/or loci, see one of our recent communications (LOU et al. 2003) for details.
Crossing over between a pair of contiguous loci may take place during meiosis. Either recombination (R) or nonrecombination (N) between each of the pairs of adjacent loci (i.e.,
and
,
and
) will give rise to four recombination configurations described by NN, NR, RN, or RR. The frequency of a new haplotype is a function of the recombination fraction(s) associated with its recombination configuration(s). For simplicity, we here ignore crossover interference during gametogenesis. Let
AQ and
BQ be the recombination fractions between loci
and
and between
and
, respectively. The frequencies of these four configurations can be expressed in terms of
AQ and
BQ, i.e., (1
AQ)(1
BQ), (1
AQ)
BQ,
AQ(1
BQ), or
AQ
BQ corresponding to NN, NR, RN, or RR, respectively. Furthermore, the conditional probability of a zygote randomly formed by the haplotypes generated from a pair of parents is a product of the frequencies of paternally and maternally original haplotypes.
For any complex trait, either continuous or discrete, there is no oneone correspondence between genotype and phenotype. The conditional probability of observing a phenotype given a specified genotype, termed the penetrance function, is thus used to characterize the relationship between genotype and phenotype. Because the phenotype is genetically determined by the genotypes at locus
, the penetrance function, given diplotype
, can be expressed as
for a continuous phenotype in which it is typically assumed that the distribution within each subpopulation defined by genotype is normal, where
is the genotypic mean of QTL genotype
and
2 is the residual variance. For a categorical trait the penetrance
is defined as the probability that individuals with genotype
manifest phenotype y. We may specify different penetrance functions to mothers, fathers, and children on the basis of the inheritance pattern of the trait under investigation. To make this presentation terser, here we assume the same penetrance for the parental and offspring generations. However, it is not difficult to recast the methodology to be applicable to the case with different penetrance functions. Mendelian trait(s) and marker(s) can be viewed as specific examples with full penetrance. Then the methodology developed hereinafter is also applicable to their analysis.
In a gene-mapping study aimed at estimating parameters of penetrance, association, and position (usually measured by the recombination fractions), a major challenge is that latent data exist, also referred to as missing data, that cannot be directly observed, such as disease genotype, diplotype, and recombination configuration. We hypothesize the observed data, i.e., marker genotypes and phenotypes, together with the latent data, i.e., diplotypes and recombination configurations, as complete data, also termed augmented data. Correspondingly, the observed data alone are called incomplete data. The observed data can be viewed as mixtures of complete data and then we can use a mixture model to tackle the issue of parameter estimation.
The complete data likelihood:
Denote marker, diplotype/haplotype, recombination configuration, and phenotype data by M,
/
,
, and y, respectively. Observed marker and phenotypic data are in boldface type while the missing data for parent and child diplotypes and child recombination configurations are in script type. We first use nuclear family data, in which there is no phenotypic covariance between parents and children, to demonstrate parameter estimation within a unified framework of interval mapping and LD mapping, and then extend the method to general pedigree data.
With N unrelated nuclear families randomly drawn from a general population, the overall likelihood is the product of individual family likelihoods, denoted L1, L2, ... , LN. Let us present an example to demonstrate how to build the likelihood function. In the example, family i consists of a mother with diplotype AQB/AQB (
) and phenotype yim, a father with AQB/Aqb (
) and yif, and two children with diplotypes and recombination configurations AQB/AQB and NN/RN (
) and AQB/AQb and NN/NR (
), respectively, and phenotypes
and
, respectively. The likelihood can be expressed by a three-level hierarchical model,
where
is the vector of unknown parameters containing three subsets of population genetic parameters (haplotype frequencies,
P), penetrance parameters (e.g., genotypic values and the residual variance,
Q), and position parameters (recombination fractions,
R), related to the parental diplotype distribution, the phenotype density functions, and
, respectively.
represents the conditional probability of child j of family i having diplotype
and recombination configuration
given parental diplotypes
and
. The overall likelihood can be represented as
 | (1) |
where the y's are the phenotypic vectors;
's are the diplotype vectors;
is the recombination configuration for the children; Ni is the number of children within family i; nAQB, nAQb, ... , naqb are the numbers of haplotypes AQB, AQb, ... , aqb appearing in parental diplotypes, respectively;
and
are the numbers of recombinants and nonrecombinants between loci
and
existing in the recombination configurations, respectively; and
and
are those between
and
, respectively.
In many cases, information is partial because of experimental errors, financial limitations, or other practical constraints, as often occurs in studies of late-onset diseases such as Alzheimer's disease where parents are unavailable. Since missing phenotypic observations can be treated by simply setting the corresponding f(y|D)'s equal to 1 wherever they occur in the above likelihood, Equation 1 automatically covers the likelihoods of family data with missing phenotypes like TDT-type data. For data with missing diplotypes such as sibship data, instead of Equation 1 we can use a form of mixture model summing over all plausible diplotypes and/or recombination configurations compatible with the available data to represent such likelihoods and so address the statistical analysis within the EM framework described in The incomplete data likelihood section.
Equation 1 can be generalized to the case of N pedigrees,
 | (1') |
where the likelihood of pedigree i,
assuming that the rightmost is ordered as ELSTON and STEWART's (1971) recursive form in which
represents the probability of either child j given the parental diplotypes or founder j within pedigree i;
and
are the parental diplotypes of nonfounder j within pedigree i, respectively; yF and yN are the founder and nonfounder phenotypic vectors;
and
are the founder and nonfounder diplotype vectors;
is the recombination configuration for nonfounders, respectively; NiF and NiN are the numbers of founder(s) and nonfounder(s) within pedigree i; nAQB, nAQb, ... , naqb are the numbers of haplotypes AQB, AQb, ... , aqb appearing in founder diplotypes, respectively; and
,
, and
are the numbers of recombinants and nonrecombinants between loci
and
and between
and
across all N pedigrees, respectively.
The maximum-likelihood estimator can be derived through differentiating the log-likelihood with respect to
and then setting each derivative equal to 0 and solving the set of simultaneous equations. Define the identity indicators
The MLEs for the likelihood (1) are
 | (2) |
for quantitative traits, and
for categorical traits, where G
{QQ, Qq, qq}; and indicators I(y = yim), I(y = yif), and I(y = yijo) are 1 when y = yim, y = yif, and y = yijo, respectively, and 0 otherwise. The MLEs of the recombination parameters for likelihood (1') are the same as those for likelihood (1), and the other MLEs have similar forms,
 | (2') |
for quantitative traits, and
for category traits.
Unlike the traditional approach, for flexibility we make here no assumption such as that the recombination fraction between the two markers can be known a priori. If the recombination fraction between the two markers (
AB) is available, however, the corresponding terms with respect to one of the recombination fractions,
AQ and
BQ, will disappear from the above estimation procedure since any one of the two is a function of the other one and of
AB. A grid search procedure can also be used for estimating QTL position on the basis of the preceding methodology.
The incomplete data likelihood:
In practice, only marker genotype and phenotype data are observed, whereas the data on diplotypes, recombination events, and QTL genotypes are hidden. The observed data are mixtures of component complete data, and the statistical analysis becomes a typical mixture issue. Let us go back to the above example again and assume that only marker genotypes AABB (Mim), AABb (Mif), AABB (
), and AABb (
) and phenotypes
, and
are available for the mother, father, and two children of family i, respectively. Now
is a mixture of diplotypes AQB/AQB, AQB/AqB, and AqB/AqB; and Mif is composed of diplotypes AQB/AQb, AQB/Aqb, AQb/AqB, and AqB/Aqb; and both
and
also consist of unidentified diplotype(s) together with recombination configuration(s) nested within the paired parental diplotypes. The likelihood can be formulated as
where
denotes summation over all recombination configuration(s) by taking "*" as either recombination or nonrecombination that is compatible with parent and child diplotypes; Li(AQB/AQB, AQB/AQb), Li(AQB/AQB, AQB/Aqb), ... , are probabilities of the mother and father of family i with diplotypes AQB/AQB and AQB/AQb, AQB/AQB and AQB/Aqb, ... , respectively; and
denotes summation over all pairs of
compatible with the observed marker phenotypes in family i. The partial derivative of the log-likelihood of family i is
where
and
are the posterior probabilities that the mother and father of family i have diplotypes
and
and that child j from family i has diplotype
and reduced recombination
produced by the mother and father diplotypes
and
, respectively; e.g.,
and
The grand likelihood of the incomplete data, including the phenotype (y) and marker information (M), can be represented as
 | (3) |
where Mm, Mf, and Mo are the marker genotypes of the mothers, fathers, and children, respectively.
Differentiating the log-likelihood of Equation 3 leads to
 | (4) |
where nAQB*, nAQb*, ... , naqb* are the expected numbers of haplotypes AQB, AQb, ... , and aqb, respectively;
, and
are the expected numbers of recombinants and nonrecombinants between
and
and between
and
, respectively; and sums are taken over all diplotypes and recombination configurations consistent with the marker genotypes.
Similarly, the pedigree-based likelihood is
 | (3') |
where MF and MN are the marker genotypes of founder(s) and nonfounder(s), respectively, the last line is placed in a recursive order, and
denotes summation over all diplotype(s) and/or recombination configuration(s) compatible with the observed data. The partial derivative is
 | (4') |
We can adopt the peeling algorithm (ELSTON and STEWART 1971) to calculate the likelihood and the posterior probabilities. Under the assumption of linkage equilibrium, our approach reduces to an EM version of ELSTON and STEWART's (1971) algorithm. For pedigree(s) with loop(s), we can use LANGE and ELSTON's (1975) method to break the loop(s).
We implement the EM algorithm (DEMPSTER et al. 1977) to estimate the parameters of the likelihood function, i.e., haplotype frequencies
P, QTL genotypic effects and residual variance or penetrances
Q, and recombination fractions
R. In the E-step, we update the posterior probabilities and expected numbers conditional on the initial values or the estimates of the current iteration. In the M-step, substituting expected numbers nAQB*, nAQb*, ... , naqb*,
, and
and posterior probabilities
, and
for
,
, and
in Equations 2 for likelihood (3), respectively, G
{QQ, Qq, qq}, we compute the next cycle of MLEs of the unknown parameters. Likewise, we perform a similar M-step in (2') for the pedigree-based likelihood (3'). These two steps are repeated until convergence is attained. Allele frequencies and linkage disequilibria, QTL additive and dominance effects, and relative locations on the chromosome can be calculated from the haplotype frequencies, QTL genotypic effects, and recombination fractions, respectively.
The asymptotic variancecovariance matrix of the MLEs:
LOUIS' (1982) procedure or the supplemented EM (SEM) (MENG and RUBIN 1991) that embeds the computation of the observed information within the EM iteration can be adopted to obtain the asymptotic variancecovariance matrix for MLEs of haplotype frequencies, genotypic effects, residual variance, and recombination fraction(s). In our computer program we use the improved equations of LOUIS' (1982) method, LOU et al.'s (2005) (C3) and (C5), to compute the observed information matrix of the parameters (i.e., the haplotype frequencies, penetrance or genotypic effects plus residual variance, and recombination fractions). The information on other parameters can be calculated with that for these basic parameters. The variancecovariance matrix for genetic effects and allele frequencies can be calculated easily since they are linear functions of the haplotype frequencies or genotypic effects. The approximate variances of the linkage disequilibria can be found by the delta method, based on their Taylor series expansions. If the parameter vector
is a function of the basic parameter
, i.e.,
= f(
), then the approximate variancecovariance of
is given by
 | (5) |
For example,
where
Hypothesis testing:
The following hypotheses are tested sequentially: (1) the existence of a trait gene and (2) various submodel hypotheses. The existence of a trait gene with significant effects can be tested by calculating a log-likelihood ratio (LR) test statistic under the null (H0: there is no trait-causing gene) and alternative hypotheses (H1: there is a trait-causing gene) as
for quantitative traits and
for categorical traits. The LR under the null hypothesis is asymptotically
2-distributed with corresponding degrees of freedom for a fixed set of frequencies and relative position of the putative gene. However, because these are nuisance parameters under H0, the regularity conditions required for the
2-distribution of the LR statistic are violated. Parametric or nonparametric bootstrap (e.g., the permutation procedure proposed by CHURCHILL and DOERGE, 1994) can be adopted to determine a critical threshold for declaring the presence of a gene at a given significance level.
After rejecting the hypothesis of no gene, the tests for particular subsets of hypotheses regarding gene action mode, gene position, and/or LD coefficient(s) can be conducted in tandem with the corresponding LR statistics that are approximately distributed as
2-statistics with degrees of freedom equal to the relevant numbers of parameters being tested.
Benefiting from making full use of both complementary components of information on correlated transmission within pedigrees and correlated occurrence at the population level, the proposed approach is expected to have greater analytical accuracy and testing power. To validate our theoretical expectation, we conducted a series of simulations under a variety of disease models and degrees of LD to compare the performance of three methods: an FBAT, pure linkage (PL) analysis, and the combined linkage and association analysis (LLD).
 |
SIMULATION STUDIES
|
|---|
A model with two diallelic loci, one marker and one disease gene each with a minor allele frequency of 0.4, was considered in our simulation studies. LLD was run by a computer program written in the C++ language, while PL analysis was performed by the EM version of ELSTON and STEWART's (1971) algorithm. Average MLE, mean square error (MSE), and the power of both LLD and PL were computed on the basis of 200 simulations for each case. Power calculation of the FBAT was implemented with the PBAT software package on the basis of simulation using the default choice (LANGE and LAIRD 2002; LANGE et al. 2002). Unless otherwise stated, all powers were evaluated at the 0.05 significance level for a null hypothesis of no linkage. The complete details of the scenarios used in the simulations are given in the relevant text and tables of this section.
To confirm that PL analysis may result in a biased estimation in the presence of association while our new approach can remedy this limitation, we first conducted a set of simulations for a comparison between LLD and PL. Although such a case may represent an extreme one, for full exposition we concentrate here on a completely penetrant codominant disease model and, theoretically, the general conclusions from this will also be valid for complex models. TDT-type (including parent and child marker genotypes and child phenotypes) and sib-type (including sibling marker genotypes and phenotypes) data were simulated on a sample consisting of 300 nuclear families with two children and 200 families with three children at two LD levels,
= 0.1 (normalized LD,
' = 0.417) and
= 0.2 (
' = 0.833), and two linkage levels,
= 0.05 and
= 0.2, respectively. Only the results on the MLE and MSE of the recombination fraction are shown in Table 2, since the MLEs of the other parameters, such as allele frequencies and LD coefficient (for LLD), have an excellent accuracy and the statistical power is very high. Table 2 shows that PL yields a large bias (
) in both TDT-type and sib-type designs. For example, the bias and the root MSE of the estimated recombination fraction are 0.065 and 0.069 for TDT-type design and 0.149 and 0.150 for sib-type design, respectively, when true parameters are
= 0.2 and
= 0.2. This implies that the result from linkage-only analysis is less reliable when association is present. As expected, however, LLD has highly precise estimation. All the absolute values of the bias from LLD are <5% of the parameter values, a conventional criterion for unbiased estimation, and all the MSEs are much less than their counterparts from PL. The bias and the root MSE are 0.001 and 0.017 for the TDT-type design and 0.001 and 0.023 for the sib-type design, respectively, when
= 0.2 and
= 0.2.
View this table:
[in this window]
[in a new window]
|
TABLE 2 Average MLEs (and root MSEs) of the recombination fraction ( ) from pure linkage analysis (PL) and combined linkage and association analysis (LLD) at two levels of LD for TDT-type and sib-type designs, respectively
|
|
To demonstrate that LLD can give an unbiased estimate of the recombination fraction and further test an arbitrary null hypothesis, say H0:
= 0.1, in such a way that it has an advantage over FBATs in being capable of identifying tight linkage, we carried out simulations on the basis of a classic TDT-type design consisting of 500 nuclear families with a single child per family under a fully penetrant codominant model. As before, we considered two LD levels,
= 0.1 (
' = 0.417) and
= 0.2 (
' = 0.833), and two tight linkage levels,
= 0 and
= 0.05, respectively. Powers were calculated for the hypotheses H0:
= 0.5 and H0:
0.1, respectively, in LLD analysis. The MLE and MSE of the recombination fraction and the corresponding powers are presented in Table 3. As shown in Table 3, LLD gives an accurate estimate and high power for both null hypotheses at
' = 0.833; e.g., the bias and the root MSE are 0.007 and 0.014, the powers for both H0:
= 0.5 and H0:
0.1 are 1.0, in the case of
= 0, and the bias and the root MSE are 0.004 and 0.024, and the powers are 0.995 and 0.645, in the case of
= 0.05, respectively. LLD has reasonable estimation accuracy and test power at
' = 0.417. These results suggest that LLD can offer the possibility of distinguishing strong association and loose linkage from weak association and tight linkage, even in the case of only one child per family.
Next we consider a more common case where the disease gene affects a quantitative phenotype. TDT-type data were generated on a sample that consists of 300 families with two children each and 200 families with three children each under an additive model (no dominance effect, i.e., µQQ = µ + a, µQq = µ, and µqq = µ a, where µ and a are the mean and additive effect, respectively). We assumed that a marker locus is completely linked to the disease susceptibility locus but with varying degrees of LD (from 0 to 0.1) and heritability (from 0.1 to 0.4). The results of the power comparison of the three methods are summarized in Figures 1 and 2, while only the estimated parameters from LLD and PL are shown in Table 4, because the nonparametric FBAT approach cannot perform parameter estimation. Figure 1 shows power plotted against LD, where a, b, c, and d are for heritabilities 0.1, 0.2, 0.3 and 0.4, respectively. Figure 2 shows power plotted against heritability, where a, b, c, d, and e are for no LD (
= 0),
= 0.025 (
' = 0.104),
= 0.05 (
' = 0.208),
= 0.075 (
' = 0.313), and
= 0.1 (
' = 0.418), respectively.
View this table:
[in this window]
[in a new window]
|
TABLE 4 Average MLEs (and root MSEs) from the approach of either pure linkage (PL) or combined linkage and association analysis (LLD)
|
|
Clearly, the power profiles shown in Figures 1 and 2 support our expectation. As the degree of LD increases, so does the power of the FBAT and LLD, whereas that of PL is almost unchanged or increases little (Figure 1, ad). The power also increases with heritability for most cases, but when there is no LD, the FBAT has no power regardless of the value of the heritability (Figure 2a). Generally speaking, it appears that LLD is the most powerful, followed by PL and then the FBAT when LD is absent or weak (
' <
0.2; Figure 2, a and b) or by the FBAT and then PL when LD is strong (Figure 2, ce). Other than the cases of no LD, where PL has power close to that of LLD, LLD is much more powerful than PL. Also, LLD always performs better than the FBAT, even under situations with strong LD (
'
0.313), where the power of the FBAT approaches that of LLD. This is not surprising, because more than one sibling is available in our simulations and hence, in theory, information on allele sharing between siblings should contribute to detecting linkage except for the case of no linkage, where there is no practical importance as there is no interest in testing for linkage with a type I error. The power comparison indicates that the union of two complementary components of information allows LLD to be more powerful.
Unlike FBATs, our new approach can also achieve parameter estimation for gene effects, allele frequencies, LD coefficient, and recombination fractions, so that it can provide more knowledge regarding disease etiology. Table 4 lists some typical results on the comparison between LLD and PL, but the results are not shown when LD is absent or weak, as LLD has an estimated result similar to that of PL. In the latter situation, both LLD and PL gave unbiased estimates, although LLD appeared to have slightly larger MSE, but the difference was very small. Such results are highly consistent with our expectation because the assumption of linkage equilibrium is indeed satisfied for linkage-only analysis while one needs to estimate one more unknown parameter for LLD. But for cases with slightly stronger LDs, such as
'
0.208, LLD gained much improvement in estimation accuracy, which is reflected by bias and MSE, over linkage-only analysis (see Table 4). The bias and MSE of the estimated parameters from LLD are almost uniformly less than their counterparts from PL. This is in good agreement with theoretical expectations. In general, the estimation accuracy increases with LD, and LLD improves more than PL. The magnitude of improvement differs with the various parameter values. The estimates of recombination fraction and genetic effects are greatly affected by LD level, while those of population mean and variance are less affected. In some cases, ignoring LD may result in a large bias and MSE in PL. The comparison of parameter estimation strongly indicates that it is necessary to capitalize on the information from population association to get a better and more reliable estimation.
 |
APPLICATION
|
|---|
To demonstrate its use, we applied our new algorithm to a BMD genetic study conducted at Creighton University. A total of 1873 subjects from 405 Caucasian pedigrees containing 740 parents/grandparents and 434 sibships were included in the study. The pedigrees varied in size from 3 to 12 and the mean size was 4.86 while the sibships ranged from 1 to 10 and averaged 2.61. Three SNPs within the vitamin D (1,25-dihydroxyvitamin D3) receptor (VDR) gene, ss12568610, ss12568583, and ss12568608, were chosen to test association with BMD. A detailed description of the clinical subjects and SNP-related information, such as primers/probes and genotyping conditions, has been reported in a separate study (LIU et al. 2005). Several BMD-related traits were measured in the study (LIU et al. 2005) and we used the BMD at the one-third region of the wrist as an example here. The phenotypic values of the BMD range from 0.349 to 0.997. Our segregation analysis suggested that there is a major gene underlying this trait (data not shown). The coefficients of skewness and kurtosis of the residual effects are 0.123 and 3.175, respectively, which can be regarded as having an approximately normal distribution.
The results analyzed by FBAT and our LLD approach are presented in Tables 5 and 6 for P-values, MLEs, and standard errors (SEs), respectively. The three P-values in Table 5 are for null hypotheses
= 0.5,
0.2, and
0.1, respectively, in LLD analysis, while the P-values are for
= 0.5 in FBAT. After correction for multiple testing, the LR statistic still remains highly significant (minimum P = 0.001 for H0:
= 0.5), whereas the FBAT statistic shows only marginal significance (P = 0.040) for ss12568583. Furthermore, the results of parameter estimation show that all the three SNPs are very tightly linked to the putative disease gene, i.e., have near zero estimated recombination fractions and small SEs, but different frequencies from those of this gene (see Table 6). All estimates from the three SNPs are very consistent, which indicates that, very likely, a gene responsible for BMD is located within or near the VDR gene but the genotyped SNPs do not seem to be the causal variant. The MLEs of
' are 0.045, 0.177, and 0.021 for SNPs ss12568610, ss12568583, and ss12568608, respectively, suggesting that the associations between the gene and the SNPs are weak. This may be the reason why this gene can elude most FBAT gene-hunting strategies such as QTDT and FBAT. Our approach also gave estimates of the penetrance parameters. As shown in Table 6, the gene has a large genetic effect and displays an incompletely dominant mode of inheritance. In summary, our results indicate that the VDR gene is significantly linked to that for BMD, especially for SNP ss12568583.
View this table:
[in this window]
[in a new window]
|
TABLE 6 MLEs (and standard errors) for allele frequency, LD, recombination fraction, additive and dominance effects, mean, and variance on using different VDR SNPs
|
|
 |
DISCUSSION
|
|---|
This study was motivated by the fact that traditional mapping methods, e.g., FBATs and the allele-sharing method, utilize only one component of genetic information, either on linkage or on association, often leading to inefficiency and inaccuracy, although they have desirable properties in some specific cases, e.g., if the assumption of no LD is approximately satisfied in linkage analysis or if the marker tested is exactly the trait gene itself in FBATs. Owing to its ignoring association, the weakness of linkage analysis motivated RISCH and MERIKANGAS (1996) to conclude that the allele-sharing method may be hardly up to the task of identifying genes underlying complex traits. On the other hand, however, the TDT may be not as ideal as RISCH and MERIKANGAS (1996) claimed because such a perfect case (i.e., perfectly associated, with no recombination and the same allele frequencies) rarely occurs in real data sets, even in a fine-mapping context. Mostly, there are less extreme cases with diverse degrees of LD between both tightly linked loci and loosely linked loci, arising from mutation, recombination erosion, or population admixture. Therefore, it is necessary to develop new approaches that can improve the FBAT's power for successful gene hunting. Heuristically, exploiting information on allele sharing contained in each sibship can achieve this aim and also circumvent the weakness that association is required for linkage to be detected. Such attempts have been pursued by a number of researchers (e.g., ZHAO et al. 1998; XIONG and JIN 2000; CANTOR et al. 2005; LI et al. 2005). In our previous study (LOU et al. 2005), we reported the development of a statistical framework with two hierarchies. In this article, we address a further issue, i.e., developing mapping models with an arbitrary number of hierarchies to handle complex pedigrees. Thus, the proposed approach not only is capable of accommodating multiple loci and/or multiple alleles so that it is easy to tackle interval mapping, multiple interval mapping, and epistasis models, but also allows for diverse types of traits and pedigree structures. Haplotypes in founders contribute information for an association study while informative and partially informative meioses do so for linkage analysis. We unify segregation, linkage, and association analyses into a comprehensive mapping strategy and thus can capture the two complementary aspects of the genetic architecture. The proposed approach has the properties of both linkage analysis and association analysis. From the viewpoint of linkage, it is a LOD score method that is adaptive to the amount of LD. It can make use of LD, if it is indeed present, while it reduces to the standard LOD method when LD is weak or absent. From another viewpoint, it is an association study that incorporates haplotyping analysis in pedigrees and genotyping by a progeny test at a disease locus. For singleton data, it reduces to a parametric association study (e.g., LOU et al. 2003; SHIBATA et al. 2004). Although the EM algorithm rather than the quasi-Newton method is used to maximize the likelihood function, our approach is a direct generalization of that of CANTOR et al. (2005) to multiple loci. The model of LI et al. (2005) is also a specific application of the new method, which assumes that the candidate SNP is completely linked to the disease locus and that flanking markers are in linkage equilibrium with one another, the SNP, and the disease locus. Both the corresponding LR statistics, testing for linkage equilibrium and complete LD, can be constructed by using the new approach.
Another contribution of this article is that it shows, through systematic simulation studies and an application, the important conclusion that a mapping bonus can be obtained by combined linkage and association analysis without any increase in experimental expense. This is highly consistent with theoretical expectation. First, the improvement in mapping resolution arises from the marriage of linkage and association. In a gene-hunting context, latent data exist such as disease genotype, inheritance vector, and linkage phase. Parameter estimation and statistical inference rely on accurate genetic reconstruction of such ambiguous data, i.e., statistical imputation. Violation of the assumption of linkage equilibrium leads to inaccurate imputation in pure linkage analysis so that it may give a biased result, as demonstrated in this article. On the other hand, the assumption of linkage equilibrium also affects imputation precision, owing to its resulting in a likelihood that retains maximum uncertainty about which component of the mixture distribution generates the data, and hence is least informative for the recombination parameter
. Theoretically, integrating both complementary components increases imputation accuracy, leading to improvement in mapping accuracy, precision, and power over traditional linkage analysis. Intuitively, linkage induces more gene concordance between related individuals having similar phenotypes, while the opposite holds true for those with disparate phenotypes. Incorporating this information, which FBATs fail to do, gives our LLD approach a higher power than that of FBATs. This type of phenomenon has also been widely observed with comparisons of the TDT, the conventional affected sib pairs, and combined methods (HUANG and JIANG 1999; WICKS and WILSON 2000; LAZZERONI 2002). Both our simulation and real data studies support our theoretical expectation.
Second, the improvement may also come from two other potential sources, although they are not explored in this article. The LLD approach integrates population-based association analysis and pedigree-based linkage analysis into a coherent framework so that it can handle diverse types of data, including full sibs, half sibs, cousins, nuclear families, extended nuclear families, complex pedigrees, and singletons, as well as their mixtures. Unlike those of HUANG and JIANG (1999), WICKS and WILSON (2000), and LAZZERONI (2002), which require only affected pairs with at least one heterozygous parent, our approach allows for analyzing any type of data structure, including singletons and pedigrees without any informative meioses, which do not contribute information to linkage parameter(s) but do inform association parameter(s). Without dropping any type of mapping data, we make use of data to the maximum extent, leading to the possibility of improving mapping performance. Furthermore, our flexible framework is easily applied to a multipoint analysis. It has been well documented that multipoint analyses can extract more statistical information than pairwise ones and thus may substantially increase the power and reduce spurious results (LATHROP et al. 1984, 1985). Conceivably, unifying multilocus linkage and association mapping will further improve mapping resolution.
Our current version of the program is capable of handling five to six loci on a PC computer if only limited amounts of data are missing. Although it allows for more loci and alleles on a workstation or a PC cluster with more memory and storage, computation can be very time-consuming for a large number of loci and alleles because the required memory and time exponentially increase with the number of loci. To avoid a formidable computational burden, the simulation-based versions of the EM algorithm such as stochastic EM and Markov chain Monte Carlo (MCMC) EM (THOMPSON 1994; CELEUX et al. 1996), based on estimating conditional posterior probabilities in the E-step rather than computing them exactly, can be used. The approximate methods based a composite likelihood (e.g., RANNALA and SLATKIN 2000) also seem to be the feasible ways to tackle this problem. The relevant study is under way.
Moreover, unlike the model-free approaches such as allele sharing and FBATs, which can tell us only whether linkage or association exists but fail to provide any estimates of what values they have, the proposed LLD approach can simultaneously provide parameter estimation of genetic distance, allelic association, and genotypephenotype relationship and also perform various types of hypothesis testing. For example, we can perform a comparison of the analyses, including or not the LD information to assess the validity of the LD model assumptions. Thus, this approach exposes more genetic mechanisms than FBATs to genetic etiology and hence increases the predictability of gene mapping.
Finally, as pointed out by PÉREZ-ENCISO (2003), given the diversity of genetic architectures and population histories, it is unlikely that a single statistical approach will be valid for all cases. The approach described here is subject to the same limitations faced by all model-based methods, i.e., the requirement of a correct, or close to correct, model for the trait under study. If the model for predicting disease status from phenotypes is not sufficiently well known, this approach cannot perform well. Therefore, this model-based LLD approach should serve as a supplement to model-free methods in tracking the gene(s) underlying complex diseases, once model-free methods have suggested how many loci are involved and their approximate locations in the genome (ELSTON 1998).
 |
APPENDIX
|
|---|
For affected sib pair data with a recessively inherited disease as described in the text (p = 0.5), under the assumption of linkage equilibrium we have the log-likelihood
 | (A1) |
where nAA,AA, nAA,Aa, nAA,aa, nAa,Aa, nAa,aa, and naa,aa are the numbers of affected sib pairs with marker configurations {AA, AA}, {AA, Aa}, {AA, aa}, {Aa, Aa}, {Aa, aa}, and {aa, aa}, with true probabilities
, and
, respectively; and
and
0 (a specific value) are the recombination fractions between the marker and the disease gene. The partial derivative with respect to
, the score, is

| (A2) |
Because
whatever value
takes, the assumed likelihood is a monotonically decreasing function of
in the interval [0, 0.5], and hence
is the MLE, even if there is no linkage; i.e.,
0 = 0.5.
 |
ACKNOWLEDGEMENTS
|
|---|
This work is funded in part by a grant from the National Institute on Drug Abuse to M. D. Li (DA-12844), by grants from the National Center for Research Resources (RR03655) and the National Institute of General Medical Sciences (GM28356) to R. C. Elston, and by a grant from the National Science Foundation of China (30000097) to X.-Y. Lou.
 |
LITERATURE CITED
|
|---|
ABECASIS, G. R., L. R. CARDON and W. O. COOKSON, 2000 A general test of associa