## Abstract

There are generally three steps to isolate a disease linkage-susceptibility gene: genome-wide scan, fine mapping, and, last, positional cloning. The last step is time consuming and involves intensive laboratory work. In some cases, fine mapping cannot proceed further on a set of markers because they are tightly linked. For years, genetic statisticians have been trying different ways to narrow the fine-mapping results to provide some guidance for the next step of laboratory work. Although these methods are practical and efficient, most of them are based on IBD data, which usually can be inferred only from the genotype data with some uncertainty. The corresponding methods thus have no greater power than one using genotype data directly. Also, IBD-based methods apply only to relative pair data. Here, using genotype data, we have developed a statistical hypothesis-testing method to pinpoint a SNP, or SNPs, suspected of responsibility for a disease trait linkage among a set of SNPs tightly linked in a region. Our method uses genotype data of affected individuals or case-control studies, which are widely available in the laboratory. The testing statistic can be constructed using any genotype-based disease-marker disequilibrium measure and is asymptotically distributed as a chi-square mixture. This method can be used for singleton data, relative pair data, or general pedigree data. We have applied the method to simulated data as well as a real data set; it gives satisfactory results.

RECENTLY, genome-wide scans have been widely used in the study of complex genetic diseases such as cardiovascular diseases, obesity, diabetes, schizophrenia, etc., due to the advance in biological science that hundreds of markers could be genotyped quickly with reduced cost. Subsequent fine-mapping studies have been also frequently reported, which narrow the linkage region to a disease trait to about one or a few centimorgans. However, very few of the studies reach the final step of positional cloning to isolate the gene responsible for the linkage to a complex disease. Part of the reason is that the process involves genomic DNA spanning millions of base pairs at the linkage region, sequencing large amounts of the overlapped genomic DNA fragments, and genotyping tens or hundreds of markers in the region, which take intensive work in an ordinary laboratory. In some cases, fine mapping cannot proceed farther on a set of markers because they are tightly linked. For years, genetic statisticians have been trying to develop parametric and/or nonparametric methods to pinpoint the linkage to one or very few markers suspected to be truly responsible for the linkage of a disease trait and to exclude those only in linkage disequilibrium (LD) to the susceptibility markers.

Difficulty in the identification of specific disease-predisposing alleles may result due to multiple genetic factors (Tait and Harrison 1991; Thomson 1991). Greenberg (1993) and Hodge (1993) considered the analysis of “necessary” *vs.* “susceptibility” loci in which the associated marker allele itself increases disease susceptibility but is neither necessary nor sufficient for disease expression. The conditioning method is one of the typical statistical tools for studying such problems. Fulker* et al.* (1999) developed a conditioning method using the variance component model. This method tests both linkage and association at the same time, so that it provides the result whether a locus is the candidate locus to the trait or is just in LD with the candidate locus. This idea was further expanded by Cardon and Abecasis (2000), in which a combined linkage and association method using the variance components model is proposed. Valdes and Thomson (1997) and Siegmund* et al.* (2001) used the conditioning method to narrow down the association region. Lazzeroni and Lange (1998) proposed such a framework in the transmission/disequilibrium test. Furthermore, Soria* et al.* (2000) considered a conditioning argument to pinpoint the linkage of the G20210A mutation in the prothrombin gene to the disease gene. On the other hand, Blanger* et al.* (2000) studied a Bayesian variance components method, and Horikawa* et al.* (2000) used a modified association study method, which identified a single-nucleotide polymorphism (SNP), SNP43, that showed significant association with the evidence for linkage with type 2 diabetes.

Recently, Sun* et al.* (2002) proposed a statistical method for this problem. They used a conditioning hypothesis-testing procedure to pinpoint, among a set of tightly trait-linked genes, a single or a few susceptible markers, using identity-by-descent (IBD) data from affected sibships. This method is based on the genome-wide scan result, which identified a region showing strong linkage with a putative trait. Often markers in such a region are tightly linked among themselves. The goal of the method is to identify which of those markers are truly responsible for the linkage and which are merely tightly linked to such putative markers. This method is practical in application and yielded good results in their simulation studies.

However, most of the existing methods for this problem use IBD data on paired family members. Usually IBD data are not fully available in practice and can be inferred only from genotype data with uncertainty and often inconsistently from different methods used. Inference based on them has no greater power than that based on genotype data, unless the IBD data are a sufficient statistic for the parameters underlying the model. Also, IBD-based methods apply only to relative pair data.

Here we present a method for this problem by formulating a set of conditional hypothesis testing, in this respect similar to that in Sun* et al.* (2002), but we use genotype data instead and the testing statistic is different in nature from theirs. Using any genotype-based trait-marker disequilibrium measure, the testing statistics are constructed by successively conditioning on each of the tightly linked SNP sites. Our method is nonparametric: it does not require model specification or phase information in the data. It applies to family data of arbitrary structure, including singleton data, in which each individual comes from a different and independent family. Under the null hypothesis of being the sole susceptible site, each of these statistics follows asymptotically a chi-square mixture distribution. The corresponding *P* values are easily obtained via simulation.

## THE METHOD

### The data:

Let *A* be the unknown disease allele, for which we want to infer its position in the human genome. Assume that there are *J* identified SNP markers, *M _{j}* (

*j*= 1, … ,

*J*), with alleles

*M*(

_{jk}*k*= 1, 2), which are brought to our attention because of their tight linkage to the disease allele. A natural question is whether all of them are susceptible genes for the linkage or some of them show disease linkage just because of their strong linkage with the true susceptible gene(s). Our goal here is to identify the true susceptible SNP(s), if any, among them.

For ease of explanation we first describe our method for singleton data and then extend it to general pedigree data in a later section. We now describe a general procedure for the conditional inference of this problem; the construction of the specific testing statistic is detailed later. Let *G* = (*G*_{1}, … , *G _{J}*) be a general notation for the composite SNP genotype at all the SNP loci, where

*G*= (

_{j}*g*

_{j}_{1},

*g*

_{j}_{2}) is its allelic notation;

*G*= (

_{nj}*g*

_{nj}_{1},

*g*

_{nj}_{2}) be the observed genotype of the

*n*th individual at the

*j*th SNP locus (

*n*= 1, … ,

*N*;

*j*= 1, … ,

*J*); and

*G*= (

_{n}*G*

_{n}_{1}, … ,

*G*) be the vector notation of the observed composite genotype of individual

_{nJ}*n*. The data to be used are

*G*

_{1}, … ,

*G*, the observed composite genotypes of

_{N}*N*individuals at

*J*SNP loci each.

Here we assumed the common practice that at each SNP locus there are two different alleles in the population; we code them as 1 and 2, although the same value from alleles at different loci may have different allelic meaning. At each locus, we code the genotype as *G _{nj}* = 0 when

*g*

_{nj}_{1}≠

*g*

_{nj}_{2},

*G*=

_{nj}*I*when (

*g*

_{nj}_{1},

*g*

_{nj}_{2}) = (1, 1), and

*G*=

_{nj}*II*when (

*g*

_{nj}_{1},

*g*

_{nj}_{2}) = (2, 2). Note that we have two representations of a SNP genotype, one allelic and one numerical. Which one(s) will be used, even in the same expression, depends on convenience.

### The disequilibrium measure and the conditioning principle:

The proposed conditional testing procedure uses testing statistics, which are constructed via the conditional version of any trait-marker LD measure using genotype data. We first state the conditional testing principle and then give the specific forms of the testing statistic for some particular data designs.

Now we describe the trait-marker LD measure. Let *p _{A}* be the population frequency of the disease allele

*A*,

*q*be those of allele

_{jk}*k*of marker

*j*, and

*P*

_{A}_{,}

*be those of the haplotype (*

_{jk}*A*,

*M*). Let

_{jk}*D*

_{A}_{,}

*=*

_{jk}*P*

_{A}_{,}

*−*

_{jk}*p*be the LD measure between the disease allele

_{A}q_{jk}*A*and allele

*k*of marker

*j*. Since the position of

*A*is unknown,

*p*,

_{A}*P*

_{A}_{,}

*, and thus*

_{jk}*D*

_{A}_{,}

*cannot be directly estimated from the observed data; instead various quantities are constructed to infer it.*

_{jk}When *D _{A}*

_{,}

*is positive, the marker allele*

_{jk}*M*is more likely to be associated with the disease-susceptible allele

_{jk}*A*than would be expected by chance. The disequilibrium measures

*D*

_{A}_{,}

*are among the main tools for finding the association between a marker locus (loci) and the disease locus. There are numerous ways to construct inference statistic from the*

_{jk}*D*

_{A}_{,}

*'s, some using relative pair IBD data at markers, and some using marker genotype data (Bengtsson and Thomson 1981; Lehesjoki*

_{jk}*et al*. 1993; Feder

*et al*. 1996; Nielsen

*et al*. 1998). Here we develop the conditional version of the genotype-based method.

Let *q _{jk}*

_{|}

*be the population frequency of allele*

_{i}*k*of the

*j*th SNP conditional on the

*i*th SNP genotype (

*k*= 1, 2). Let

*P*

_{A}_{,}

_{jk}_{|}

*be the population frequency of the disease-SNP haplotype (*

_{i}*A*,

*M*) at the

_{jk}*j*th marker locus conditional on the

*i*th SNP genotype and

*P*

_{jr}_{|}

*be that of the homozygote SNP genotype*

_{i}*r*at the

*j*th SNP locus conditional on the

*i*th SNP genotype (

*r*=

*I*,

*II*). We choose the conditional LD measure at the

*j*th locus, given the

*i*th SNP genotype, as 1Note that

*P*

_{Aj}_{2|}

*−*

_{i}*p*

_{A}q_{j}_{2|}

*= −(*

_{i}*P*

_{Aj}_{1|}

*−*

_{i}*p*

_{A}q_{j}_{1|}

*), so only one of the marker alleles is needed to define this disequilibrium. Our motivation to use the conditional LD measure is that if marker*

_{i}*i*is the sole susceptible site of linkage to the disease allele, then the genotype data from this site constitute a sufficient statistic for this measure, or, in other words, it will explain all the disequilibria in the region. Thus, conditioning on the site of interest, the disequilibria parameters

*D*

_{j}_{|}

*vanish from the conditional distribution of the data, for all*

_{i}*j*≠

*i*.

In the following we explain what the conditioning actually means in practice. Suppose we have genotype data for 502 individuals on two SNP loci, each locus has two alleles, and each allele takes one of the two forms that we coded as 1 and 2. The genotype at each locus is thus represented as (1, 1), (1, 2) = (2, 1) and (2, 2). The supposed observed genotype frequencies for the two loci are given in Table 1.

By conditioning on the genotype at the first locus being (1, 1), we mean the subgroup of 156 individuals whose genotype at the first locus is (1, 1). Within this subgroup, the genotype at the second locus is denoted as locus 2/(1, 1) and similarly for conditioning on the first locus genotype being (1, 2) and (2, 2). Thus conditioning on the first locus genotype (1, 1), (1, 2), and (2, 2) separately, the data are divided into three nonoverlap subdata sets, and we obtain the genotype frequency of the second locus as shown in Table 2. Likewise, conditioning on the second locus genotypes separately, we get the genotype frequency of the first locus as shown in Table 3.

In conditional testing, test statistics are constructed with the data in Tables 2 and 3. For example, to test the hypothesis that locus 1 is the only susceptible site, then conditioning on it we obtain three subtables. If the hypothesis is true, the LD vanish on each of the subtables, and the test statistics constructed from them should manifest nonsignificance.

### The hypotheses and testing statistics:

We are interested in testing the null hypothesis *H _{i}*: among the set of markers, SNP marker

*i*is the sole cause of the linkage to the disease locus. Here we assume background effects on the linkage are negligible; see the discussion for more details on this. Under

*H*,

_{i}*D*

_{j}_{|}

*= 0 for all*

_{i}*j*≠

*i*. For each fixed

*i*, testing statistics

*S*

_{j}_{|}

*are constructed, usually a function of the empirical version*

_{i}*D̂*

_{j|i}(

*j*≠

*i*), such that they tend to be small under

*H*and large otherwise.

_{i}*H*can be decomposed as

_{i}*H*=

_{i}*H*

_{i}_{1}⊕

*H*

_{i}_{2}, where

*H*is the hypothesis: genotype

_{ik}*k*at site

*i*is the sole susceptibility SNP reasonable for the trait LD in the region. Note that when

*H*is rejected, we can conclude only that the SNP does not contribute to the LD in the region, that the single or multiple causal polymorphism may not be among those that are typed, or that there is more than one source of such contribution. By testing the sequence of {

_{ik}*H*}, we can find a confidence set, which may consist of a single SNP or several SNPs, or it may be empty. This set may be more accurately inferred by testing the hypothesis of multiple SNPs as in a later subsection. Our method can be used to detect a more detailed local relationship by testing the more detailed hypothesis

_{ik}*H*

_{j}_{|}

*, LD at site*

_{i}*j*is completely caused by site

*i*, or even the finer hypothesis

*H*

_{j}_{|}

*, LD at site*

_{ik}*j*is completely caused by genotype

*k*of site

*i*.

These last hypotheses are inferred using the statistics *S _{j}*

_{|}

*, which are the corresponding versions of the*

_{ik}*S*

_{j}_{|}

*'s for the*

_{i}*H*

_{j}_{|}

*'s. For recessive disease, the conditional statistic notation*

_{ik}*S*

_{j}_{|}

*means . The*

_{ik}*S*

_{j}_{|}

*'s are constructed of the form , and the random column vector is jointly asymptotic normal under*

_{ik}*H*. Let ∑

_{ik}*be the asymptotic variance matrix of*

_{ik}*X*and λ = (λ

_{ik}_{1}, … , λ

_{J}_{−1}) be its eigenvalues. Usually ∑

*and thus λ can be estimated by their empirical version. The particular forms of the*

_{ik}*S*

_{j}_{|}

*'s are given later for different data designs.*

_{ik}### Asymptotic distribution of the testing statistic:

Let us consider *H _{ik}*. Its testing statistic is given by To get the asymptotic distribution of

*S*

_{+|}

*(λ) or*

_{ik}*S*

_{+|}

*under*

_{ik}*H*, we first give a general result for the distribution of quadratic form of normal random variables. The proof is given in the appendix.

_{ik}#### Proposition:

Let *X* = (*X*_{1}, … , *X _{d}*)′ be a nondegenerate normal random vector:

*X*∼

*N*(

**0**, ∑) (

*i.e.*, |∑| ≠ 0), with eigenvalues λ = (λ

_{1}, … , λ

*);*

_{d}*A*is a

*d*-dimensional positive definite symmetric matrix with eigenvalues γ = (γ

_{1}, … , γ

*); the λ*

_{d}*'s and the γ*

_{i}*'s keep the same order in the diagonalization. We have*

_{i}where the

*Y*^{2}_{j}'s are independent and identically distributed (IID) χ^{2}_{1}random variables.Let Γ = diag(γ

_{1}, … , γ) and Λ = (λ_{d}_{1}, … , λ); then Especially, when_{d}*A*=*I*, we have_{d}

#### Remark:

The case ∑ or

*A*being degenerate is not of much interest and can be avoided easily in the construction of the testing statistic.It requires that γ and λ be of the same order; this can be done using the same orthogonal matrix (matrices) in the diagonalization of ∑ and

*A*. More conveniently, since it actually used only the γλ_{j}'s, they are just the eigenvalues of_{j}*A*∑ (or ∑*A*).Using i or ii is a matter of choice. i is simpler in forming the χ

^{2}statistic but not in computing the quantiles or*P*values, while the order of the γλ_{j}'s does not matter. ii involves computing_{j}*A*^{1/2}in forming the χ^{2}statistic, and the order of γλ_{j}and that of_{j}*X*must match. In practice, this is not trivial; however, it is simpler in computing the quantiles or_{j}*P*values using the existing χ^{2}tables.Given γ and λ, the density of χ

^{2}_{d}can be derived by the multiple convolution formula, and thus its αth quantile and/or the*P*value of the observed statistic can be obtained. But, more conveniently, for a given level α, the αth quantile and/or the*P*value of the observed statistic can be consistently estimated by their empirical versions.

To sample from χ^{2}_{d}, we sample *Y*^{2}_{1}, … , *Y*^{2}_{J−1} from χ^{2}_{1} independently; then γ_{1}λ_{1}*Y*^{2}_{1} + … + γ_{d}λ_{d}*Y*^{2}_{d} is a sample from χ^{2}_{d}.

The χ^{2} linear combination is the general form of the quadratic form of normals. When the *X _{j}*'s are independent, λ

*= Var(*

_{j}*X*); when the

_{j}*X*'s are IID and . There are some other similar results about the quadratic form of normals (Graybill and Marsaglia 1957; Good 1969; Khatri 1980, 1982; Anderson and Styan 1982). Our result is independent and not of the same formulations and conditions as the others.

_{j}Let the eigenvalues (in their original order) of ∑* _{ik}* be λ = (λ

_{1}, … , λ

_{J}_{−1}); by ii of the

*Proposition*, we have (see appendix):

#### Corollary:

Under *H _{ik}*, asymptotically and where the

*Y*

^{2}

_{j}'s are IID χ

^{2}

_{1}random variables.

Thus for given 0 < α < 1, the asymptotic level α test of *H _{ik}* is given by the rejection rule: the

*P*value of the observed

*S*

_{+|}

*is smaller than α, or*

_{ik}*S*

_{+|}

*>*

_{ik}*Q*

_{J}_{−1}(λ, α), the αth quantile of the χ

^{2}

_{J−1}distribution.

Note that our method requires only the genotype information and allele counts at each locus. It does not require phase information in diploids; thus it is practical in applications.

In the following we give the specific forms of the *S*_{+|}* _{ik}*'s [

*S*

_{+|}

*(λ)'s] under some commonly used settings; those of the*

_{ik}*S*

_{+|}

*(λ)'s are the same and are omitted.*

_{ik}### Multiple susceptible loci:

Our method can be extended to the case of multiple susceptible loci without conceptual difficulty, but with more involved computations. Consider the hypothesis *H*_{i1k1,…,irkr} (1 ≤ *r* < *J*) that the composite genotypes (*k*_{1}, … , *k _{r}*) at loci (

*i*

_{1}, … ,

*i*) are the true susceptible ones. The corresponding testing statistics

_{r}*S*

_{j|i1k1,…,irkr}are constructed similarly as before. The only difference is now the inference set, the conditional data set, consisting of those individuals whose alleles at loci (

*i*

_{1}, … ,

*i*) are (

_{r}*k*

_{1}, … ,

*k*), and which is asymptotically χ

_{r}^{2}

_{J−r}, and λ = (λ

_{1}, … , λ

_{J}_{−}

*) is the eigenvalue of the asymptotic variance matrix ∑*

_{r}_{i1k1,…,irkr}, which is estimated the same way as the single susceptible locus case, but uses the current inference data set.

For fixed *r*, there are *J*!(*J* − *r*)!/*r*! of such tests across different choices of loci combinations, and 2* ^{r}* of such tests for each choice of loci combination. So the total number of tests will be 2

*!(*

^{r}J*J*−

*r*)!/

*r*!.

Note that the above construction of the testing statistic is general; its inference behavior depends on the particular statistic used. The general form of the testing statistic is asymptotically a chi-square mixture, which is centralized under *H _{ik}* and noncentralized otherwise. The functional form of the parameters of interest entering the noncentrality parameter in the chi-square mixture will explain the behavior of the test in terms of asymptotic power. We give more detail on this for specific tests used in the following sections.

## AFFECTED INDIVIDUAL DATA

Now we explain how to construct the *S*_{+|}* _{ik}*'s in this type of data. In the case

*J*= 1, assume the two SNP alleles are

*M*and

*M̅*

*,*and let

*A*be the disease allele. Let

*p*,

_{A}*q*, and

_{M}*P*be the population frequency of the alleles

_{AM}*A*and

*M*and haplotype

*AM*, respectively, and let

*D*=

_{AM}*P*−

_{AM}*p*be the LD. For clarity we first assume the disease is

_{A}q_{M}*recessive*and

*P*(Affected|

*AA*) = 1. Under these assumptions, Feder

*et al*. (1996) and more specifically Nielsen

*et al*. (1998) discovered the relationship where ψ is the probability that an individual will exhibit the disease due to causes other than this locus, and φ is the prevalence of the disease in the population. This equality enables us to detect the marker-disease association by testing Hardy-Weinberg disequilibrium at the marker locus without using IBD information. In fact the connection between the marker allele frequencies and the marker-disease LD is kept if we use only the numerator in the above equality, and this will simplify the computation. That is, 2We derive a conditional version of (2) to serve our purpose.

Since all individuals are affected in this study, we drop off the index “Affected” to simplify the notations. We want to test the hypothesis *H _{ik}*: SNP type

*k*at locus

*i*is the sole cause of the LD in the region. Let

*P*

_{jr}_{|}

*be the population frequency of genotype*

_{ik}*r*(

*r*=

*I*,

*II*) of locus

*j*given one's genotype being

*k*at locus

*i*,

*q*

_{jr}_{|}

*be that of allele*

_{ik}*r*(

*r*= 1, 2) at locus

*j*given one's genotype being

*k*at locus

*i*, ψ

*be the probability that an individual will exhibit the disease due to causes other than locus*

_{j}*j*, and

*D*

_{j}_{|}

*be the disequilibrium corresponding to the conditional LD measure. Now the same derivation of (2) leads to 3Under*

_{ik}*H*, all association of SNP

_{ik}*j*is completely explained by genotype

*k*of locus

*i*; thus

*D*

_{j}_{|}

*= 0 and hence*

_{ik}*T*

_{j}_{|}

*= 0 (*

_{ik}*j*≠

*i*).

We comment that our method works for a general disease model; in this case *T _{j}*

_{|}

*is still a function of*

_{ik}*D*

_{j}_{|}

*but the expression is more involved (see Nielsen*

_{ik}*et al*. 1998, pp. 1533–1534), and under

*H*we still have

_{ik}*T*

_{j}_{|}

*= 0 (*

_{ik}*j*≠

*i*); hence the test is still valid. In this case, the power and error rate computation will be more involved. The same comment applies to the case-control section also.

Now we construct testing statistics for *H _{ik}* (

*i*= 1, … ,

*J*). The consistent estimates

*P̂*

_{jr}_{|}

*of*

_{ik}*P*

_{jr}_{|}

*and*

_{ik}*q̂*

_{jr}_{|}

*of*

_{ik}*q*

_{jr}_{|}

*are given by where is the total number of individuals with the*

_{ik}*i*th SNP genotype being

*k*, and we rearrange them as the first, second, … , and the

*N*th individual.

_{ik}*I*

_{n}_{,}

_{jr}_{|}

*,(= 0, 1) is the indicator that the*

_{ik}*n*th individual among this set has genotype type

*r*on the

*j*th locus given he (she) has genotype

*k*at locus

*i*, and where

*J*

_{n,j}_{1|}

*(= 0, 1, 2) is, for the*

_{ik}*n*th individual, the number of times allele 1 occurs at locus

*j*, given one's genotype being

*k*at locus

*i*. The estimate of

*T*

_{j}_{|}

*is*

_{ik}Let *T̂ _{ik}* = (

*T̂*

_{j}_{|}

*:*

_{ik}*j*≠

*i*) be the (

*J*− 1) dimensional column vector. Under

*H*,

_{ik}*T̂*

_{ik}is asymptotically

*N*(

**0**, ∑

*) for some matrix ∑*

_{ik}*to be identified later. Let λ = (λ*

_{ik}_{1}, … , λ

_{J}_{−1}) be all the eigenvalues of ∑

*, and . By the*

_{ik}*Corollary*, under

*H*asymptotically and ∑

_{ik}*is estimated by 4(appendix), where and*

_{ik}Here ⊕ means matrix direct summation, which results in a (*J* − 1) × 2(*J* − 1) dimensional matrix. From ∑̂* _{ik}*, we obtain the estimated eigenvalues λ̂.

Similarly, for *H _{i}*, let

*N*=

_{i}*N*

_{i}_{1}+

*N*

_{i}_{2}, α

*=*

_{r}*N*/

_{ir}*N*, and

_{i}*T̂*= (

_{i}*T̂*

_{j}_{|}

*:*

_{i}*j*≠

*i*). Let ∑

*be the asymptotic matrix of*

_{i}*T̂*and λ = (λ

_{i}_{1}, … , λ

_{2(}

_{J}_{−1)}) be all the eigenvalues of ∑

*. Note that*

_{i}*T̂*=

_{i}*T̂*

_{i}_{1}+

*T̂*

_{i}_{2}, and

*T̂*

_{i}_{1}and

*T̂*

_{i}_{2}are independent, so under

*H*,

_{i}*T̂*

_{i}is asymptotically

*N*(

**0**, ∑

*), with . Its estimate is obtained as , and ∑̂*

_{i}*is constructed as before.*

_{ir}Let Under *H _{i}*,

*S*

_{+|i}∼ χ

^{2}

_{J−1}.

We remark that in the above the asymptotic variance matrices ∑* _{jk}* are estimated the same way as for the IID data. In general the familial data are not IID, and the above variance matrices are dealt with differently. Usually, in the positive dependent case, the asymptotic variance matrix will be larger, in the sense of generalized variance—the determinant of the variance matrix—and consequently will tend to have larger eigenvalues than the IID case, such as the singleton data case. In the case of homogeneous familial structure, more accurate estimates can be obtained. We study the above methods for general pedigree data in the extension section later.

In some of the existing methods for this problem, *e.g.*, Sun* et al.* (2002), the conditional IBD sharing statistics are computed at each site given the genotype at that site. In this way the statistic can test whether each of the sites is the sole susceptible site, but will not be able to find the more detailed relationship between sites when the null hypothesis of only one susceptible site is rejected, while our test statistic can be used to reveal more detailed relationship. If *H _{ik}* is accepted, it is reasonable to say that the connection between site

*j*and the disease locus is due to genotype

*k*of site

*i*.

By the asymptotic normality of *T̂*_{ik} and (3), when *H _{ik}* is false,

*S*

_{+|}

*will be asymptotically a noncentral χ*

_{ik}^{2}

_{J−1}(λ, μ), with noncentrality parameter It is clear that

*H*is true if and only if

_{ik}*D*

_{j}_{|}

*= 0 (*

_{ik}*j*≠

*i*). In terms of μ, the null hypothesis is rephrased as

*H*: μ = 0. For a given level α (=

_{ik}*P*(reject

*H*|

_{ik}*H*is true)), the parameters λ, ψ

_{ik}*, φ, and the*

_{j}s*D*

_{j}_{|}

*'s, the asymptotic power of the test is Here*

_{ik}*Q*

_{J}_{−1}(λ, α) is the αth quantile of the noncentral χ

^{2}

_{J−1}distribution, which can be simulated by the sampling method after the

*Remark*of the

*Proposition*, but with

*Y*

_{1}, … ,

*Y*

_{J}_{−1}independent, and

*Y*from

_{j}*N*(μ

*, 1) with .*

_{j}For this particular test statistic, since the power is an increasing function of μ, *H _{ik}* will be more accurately rejected when the ψ

*(1 − ψ)*

_{j}*'s and the conditional*

_{j}*D*

_{j}_{|}

*'s are large, and φ*

_{ik}*is small or the disease is relatively rare. Likewise,*

_{j}*H*will be more correctly accepted when the ψ

_{ik}*(1 − ψ*

_{j}*)'s and the conditional*

_{j}*D*

_{j}_{|}

*'s are small (*

_{ik}*i.e*., mainly explained by allele

*k*of locus

*i*), and the disease is relatively common.

Likewise, the error rate, the probability of false acceptance, is

## CASE-CONTROL DATA

Let *q _{M}*

_{|A}and

*q*

_{M}_{|U}denote marker

*M*population frequencies for the affected (case) and unaffected (control) individuals. Bengtsson and Thomson (1981) and Lehesjoki

*et al*. (1993) gave the following LD measure: The conditional version of the above is Let

*N*

_{A}and

*N*

_{U}be the number of affected and unaffected individuals, and where is the total number of “affected” individuals with the

*i*th SNP genotype being

*k*, where

*J*

^{A}

_{n,jr|ik}and

*J*

^{U}

_{n,jr|ik}are the same as the

*J*

_{n}_{,}

_{jr}_{|}

*before, but here for affected and unaffected individuals. is the total number of unaffected individuals whose*

_{ik}*i*th SNP genotype is

*k*. Let

*N*=

_{ik}*N*

_{A,}

*+*

_{ik}*N*

_{U,}

*. Assume*

_{ik}*N*

_{A}_{,}

*/*

_{ik}*N*→ α

_{ik}_{A,}

*and*

_{ik}*N*

_{U,}

*/*

_{ik}*N*→ α

_{ik}_{U,}

*= 1 − α*

_{ik}_{A,}

*. To test*

_{ik}*H*, let Under

_{ik}*H*, asymptotically where λ is the vector of eigenvalues of the matrix ∑

_{ik}*. Let Then ∑*

_{ik}*is estimated by 5(appendix), where, for singleton data, the affected and the unaffected are independent, so , and*

_{ik}Similarly, to test *H _{i}*, let Then under

*H*, asymptotically

_{i}*S*

_{+|i}∼ χ

^{2}

_{J−1}, and λ is the vector of eigenvalues of ∑

*, which is estimated by 6(appendix), where, for singleton data, , where ,*

_{i}*N*=

_{i}*N*

_{i}_{1}+

*N*

_{i}_{2},

*N*

_{A,}

*=*

_{i}*N*

_{A,}

_{i}_{1}+

*N*

_{A,}

_{i}_{2},

*N*

_{U,}

*=*

_{i}*N*

_{U,}

_{i}_{1}+

*N*

_{U,}

_{i}_{2}, α

_{A,}

*=*

_{i}*N*

_{A,}

*/*

_{i}*N*, and α

_{i}_{U,}

*=*

_{i}*N*

_{U,}

*/*

_{i}*N*= 1 − α

_{i}_{A,}

*. Similarly,*

_{i}Other LD measures can also be used, for example, the trend test statistic (Armitage 1955; Devlin and Roeder 1999).

As in the affected individual case, when *H _{ik}* is not true,

*S*

_{+|}

*is asymptotically noncentral χ*

_{ik}^{2}

_{J−1}, where Given α, λ, ψ

*, φ,*

_{j}*p*, the

_{A}*q*

_{j}_{2}'s, and the

*D*

_{j}_{1|}

*'s, the power and error rate can be computed by simulation as before, but with*

_{ik}*Y*

_{1}, … ,

*Y*

_{J}_{−1}independent, with

*Y*from

_{j}*N*(μ

*, 1), where*

_{j}Here, the power and probability of correct acceptance of *H _{ik}* depend on ψ, φ,

*p*, the

_{A}*q*

_{j}_{2}, and the

*D*

_{j}_{1|}

*'s. The power is maximum when the conditional*

_{ik}*D*

_{j}_{1|}

*'s are maximum, and the test is more likely to accept*

_{ik}*H*when the

_{ik}*D*

_{j}_{1|}

*'s are small. Their relationships with the other parameters can be analyzed similarly.*

_{ik}## EXTENSION TO GENERAL PEDIGREE DATA

As mentioned earlier, the only difference in our methods between general pedigree data and the singleton data is the estimations of the corresponding asymptotic variance matrices. A simple method for this purpose can be found in the work of G. E. Bonney, V. Apprey and A. Yuan (unpublished data), without any assumption on the data and no extra parameters introduced for the dependence. We illustrate this with the affected familial data, which for the case-control family data is similar. For such data, the estimations for the genotype/allele frequencies in the previous sections are not IID averages; we rewrite them as IID versions, so that their asymptotic variance matrices can be computed easily. First we assume the data have the same familial structure. Suppose there are *M* families with *S* individuals each (*N* = *MS*). We redefine *P̂ _{jr}*

_{|}

*as where is the total number of families in which at least one individual with SNP type*

_{ik}*k*at locus

*i*,

*I*(

_{ik}*s*,

*m*), is the indicator that in the

*m*th family, there are

*s*individuals with SNP type

*k*at locus

*i*.

*I*

_{jr}_{|}

*(*

_{ik}*s*,

*m*) is the indicator that there are

*s*individuals in family

*m*with SNP type

*r*on the

*j*th locus, given the family is in the group with SNP type

*k*on the

*i*th locus. Let . Then for fixed (

*jr*,

*ik*), {

*I*

_{jr}_{|}

*(*

_{ik}*m*) :

*m*= 1, … ,

*M*} is an IID sequence, and for different (

*jr*,

*ik*) and (

*j*′

*r*′,

*i*′

*k*′), {

*I*

_{jr}_{|}

*(*

_{ik}*m*) :

*m*= 1, … ,

*M*} and {

*I*

_{j}_{′}

_{r}_{′|}

_{i}_{′}

_{k}_{′}(

*m*) :

*m*= 1, … ,

*M*} are independent. Similarly,

*q̂*

_{jr}_{|}

*is redefined as where*

_{ik}*J*

_{jr}_{|}

*(*

_{ik}*s*,

*m*) is the count that there are

*s*SNP allele

*r*in family

*m*on the

*j*th locus, and their SNP type is

*k*on the

*i*th locus. Let . Then for fixed (

*jr*,

*ik*), {

*J*

_{jr}_{|}

*(*

_{ik}*m*) :

*m*= 1, … ,

*M*} is an IID sequence, and for different (

*jr*,

*ik*) and (

*j*′

*r*′,

*i*′

*k*′), {

*J*

_{jr}_{|}

*(*

_{ik}*m*) :

*m*= 1, … ,

*M*} and {

*J*

_{j}_{′}

_{r}_{′|}

_{i}_{′}

_{k}_{′}(

*m*) :

*m*= 1, … ,

*M*} are independent.

Let *T̂ _{j}*

_{|}

*and*

_{ik}*T̂*be as before but with

_{ik}*P̂*

_{jI}_{|}

*,*

_{ik}*P̂*

_{jII}_{|}

*, and*

_{ik}*q̂*

_{j}_{1|}

*replaced by the above versions. Let . Now it is clear that the Ω̂ in (4) can be replaced by the consistent estimator for this case as where and ∑̂*

_{ik}*=*

_{ik}*D̂*Ω̂

*D̂*′, and

*D̂*is the same as in (4).

More generally, suppose that there are *L* different familial structures in the data set, with size *M _{l}* each, and the

*l*th structure has

*S*individuals per family (

_{l}*l*= 1, … ,

*L*). Let where is the total number of families with the structure

*l*in which at least one individual with SNP type

*k*at locus

*i*,

*I*

^{}

_{ik}and

*I*

^{}

_{jr|ik}, is the counterpart of

*I*(

_{ik}*s*,

*m*) and

*jr*|

*ik*(

*s*,

*m*), respectively, for familial structure

*l*. Let . Then for fixed (

*l*,

*jr*,

*ik*), is an IID sequence, and for different (

*l*,

*jr*,

*ik*) and (

*l*′,

*j*′

*r*′,

*i*′

*k*′), and are independent. Let define the estimate of

*P*

_{jr}_{|}

*as*

_{ik}Similarly, let where *J*^{}_{jr|ik} is the counterpart of *J _{jr}*

_{|}

*(*

_{ik}*s*,

*m*) for familial structure

*l*. Let , and define

Now for this general pedigree data, let *T̂ _{j}*

_{|}

*and*

_{ik}*T̂*be as before but with

_{ik}*P̂*

_{j}_{1|}

*,*

_{ik}*P̂*

_{j}_{2|}

*, and*

_{ik}*q̂*

_{j}_{1|}

*replaced by the above versions. Let , and we assume , then a consistent estimate of ∑*

_{ik}*is given by 7(appendix), where and For the test of*

_{ik}*H*, or the case of case-control data, testing statistics and the corresponding asymptotic variance matrices can be obtained in a similar way; we omit the details here.

_{i}## SIMULATION STUDY

Here we use simulated data to illustrate our method. To exhibit the applicability of our method, we use singleton data, which is out of the scope of the IBD-based methods. We simulate the data *G*_{1}, … , *G _{N}*, where

*G*= (

_{n}*G*

_{n}_{1}, … ,

*G*) (

_{nJ}*n*= 1, … ,

*N*) and

*G*= (

_{nj}*g*

_{nj}_{1},

*g*

_{nj}_{2}), the two alleles at SNP site

*j*for the

*n*th individual. The

*g*'s are coded as 1, 2 for its possible two alleles. We assume phase is known to simplify the simulation process, so that for each

_{njk}*n*, the two haplotypes (

*g*

_{n}_{11}, … ,

*g*

_{nJ}_{1}) and (

*g*

_{n}_{12}, … ,

*g*

_{nJ}_{2}) are independent. In this example, we take

*J*= 6, so all the vectors

*G*= (

_{n}*G*

_{n}_{1}, … ,

*G*

_{n}_{6}) are random samples from the population genotype

*S*= (

*S*

_{1}, … ,

*S*

_{6}), and

*S*= (

_{j}*s*

_{j}_{1},

*s*

_{j}_{2}) is the genotype at the

*j*th site. We assume genotype (1, 1) at the third SNP site is responsible for all the LD with the disease allele

*A*; the other first alleles,

*s*

_{j}_{1}(

*j*≠ 3), in this region are tightly linked to

*s*

_{31}.

Now the haplotypes *S*^{(1)} = (*s*_{11}, … , *s*_{61}) and *S*^{(2)} = (*s*_{12}, … , *s*_{62}) are independent and the *s _{j}*

_{2}'s are independent within themselves. Denote and as the two haplotypes of the

*n*th individual. To sample such data, for each

*n*we need only to sample

*G*

^{}

_{n}from

*S*

^{(1)}and

*G*

^{}

_{n}from

*S*

^{(2)}independently. Let

*q*= 0.8 be the frequency of the disease allele allele

_{A}*A*= 1 among the affected individuals,

*q*

^{(1)}= (

*q*

_{11}, … ,

*q*

_{61}) be the frequencies of

*S*

^{(1)}= (1, … , 1), and

*q*

^{(2)}= (

*q*

_{12}, … ,

*q*

_{62}) be that of

*S*

^{(2)}= (1, … , 1). To sample from

*S*

^{(2)}is trivial;

*i.e.*, just sample

*g*

_{nj}_{2}independently from

*B*(

*q*

_{j}_{2}), the Bernoulli distribution with probability

*q*

_{j}_{2}of getting 1 and probability 1 −

*q*

_{j}_{2}of getting 0. To sample

*G*

^{}

_{n}, we need to sample from a joint Bernoulli distribution with probability

*q*

^{(1)}. Such a joint distribution can be specified in the form (Cox 1972; Fitzmaurice and Laird 1993), where Ψ and Ω are parameters and exp {−

*A*(Ψ, Ω)} is the normalizing constant and

*W*is all the cross-product terms of

*S*

^{(1)}, including all the second- and higher-order terms. This distribution can be sampled using the Gibbs sampler (Geman and Geman 1984). But the specification of the joint Bernoulli distribution has some subjectivity and the sampling scheme is not simple. Instead, we use a normal discretization method to sample it. We use high correlation for linkage. Let ∑ be the corresponding correlation matrix of the

*J*+ 1-dimensional normal distribution for (

*A*,

*S*

^{(1)}), Note that this matrix corresponds to a strong connection between

*A*and

*s*

_{31}, but not between

*A*and (

*s*

_{11},

*s*

_{21},

*s*

_{31},

*s*

_{41},

*s*

_{51},

*s*

_{61}); it also corresponds to a strong connection between

*s*

_{31}and (

*s*

_{11},

*s*

_{21},

*s*

_{31},

*s*

_{41},

*s*

_{51},

*s*

_{61}). Thus all the loci have apparent linkage with the disease allele

*A*.

To sample the composite genotypes from the above distribution, let *X* = (*x _{A}*,

*x*

_{1}, … ,

*x*

_{6}) be a sample from the normal distribution

*N*(

**0**, ∑); if

*x*< Φ

_{j}^{−1}(

*q*

_{j}_{1}), we assign

*g*

_{nj}_{1}= 1; otherwise

*g*

_{nj}_{1}= 0, (

*j*= 1, … , 6), where Φ

^{−1}(

*q*) is the

*q*th quantile of the standard normal distribution. Since

*q*

_{31}is the proportion of allele 1, at locus 3, which is linked to the disease allele, in the affected population, the two alleles at locus 3 are in Hardy-Weinberg disequilibrium. The disease is recessive. We make the corresponding conditional probability

*P*(

*s*

_{32}= 1|

*s*

_{31}= 1) high, say 0.8, among the affected individuals. In the simulation, we used a high frequency of

*q*

_{j}_{1}=

*q*= 0.1, 0.2, … , 0.9, (

*j*≠ 3) for allele 1 at each locus, to see how this affects the results.

By the same way we simulated control data, in which the two haplotyes are sampled the same way as *G*^{}_{n} above. Together with the previous affected data we have case-control data, and the analysis is displayed in Table 6.

Specifically, the sampling scheme has the following three steps:

For each *n* = 1, … , *N*, (*N* = 1000):

Draw a sample

*X*= (*x*,_{A}*x*_{1}, … ,*x*) from the normal distribution_{J}*N*(**0**, ∑); if*x*< Φ_{j}^{−1}(*q*_{j}_{1}), we assign*g*_{nj}_{1}= 1; otherwise*g*_{nj}_{1}= 0_{3}(*j*= 1, … , 6). Then we get the sample*G*^{(1)}= (*g*_{n}_{11}, … ,*g*_{nJ}_{1}).If

*g*_{n}_{31}= 1, set*q*_{32}=*P*(*s*_{32}= 1|*s*_{31}= 1) = 0.8, else*q*_{32}= 0.1. For each*j*= 1, … ,*J*, draw*X*from*U*(0, 1), the uniform distribution on [0, 1]; if*X*<*q*_{j}_{2}assign*g*_{nj}_{2}= 1; otherwise assign*g*_{nj}_{2}= 0. Then we get a sample*G*^{(2)}= (*g*_{n}_{12}, … ,*g*_{nJ}_{2}).*G*= (_{n}*G*^{(1)},*G*^{(2)}) is a sample from*S*.

When the two alleles at each locus are in Hardy-Weinberg disequilibrium, we use a two-dimensional normal with mean (0, 0) and variance matrix Ω = (1, *r*; *r*, 1) with *r* = 0.2 to model their dependence. For each *n*, we first get the sample *G*^{(1)} = (*g _{n}*

_{11}, … ,

*g*

_{nJ}_{1}) from (

*x*

_{1}, … ,

*x*) as before, then for each

_{J}*j*= 1, … ,

*J*separately, sample

*y*from the conditional distribution

_{j}*N*(

*rx*, 1 −

_{j}*r*

^{2}). If

*y*< Φ

_{j}^{−1}(

*q*

_{j}_{2}), assign

*g*

_{nj}_{2}= 1, otherwise 0.

To simulate the case-control data, we choose *q* = 0.6 for the case and *q* = 0.25 for the control.

## RESULTS

### Simulated data:

We constructed the test statistics *S*_{+|}* _{ik}*, (

*i*= 1, … ,

*J*;

*k*= 1, 2) and computed the corresponding eigenvalues λ = (λ

_{1}, … , λ

_{J}_{−1}), using the method described in the

*Remark*after the

*Proposition*to compute the χ

^{2}

*P*value under the null hypotheses. Since in the simulation the sole linkage with the disease allele comes from

*s*

_{31}, we expect

*H*

_{31}will be accepted, and the other

*H*'s will be rejected. Table 4 is a summary of the observed values of the

_{jk}*S*

_{+|}

_{j}_{1}'s for the

*H*'s, for different choices of

_{jk}*q*, with corresponding

*P*values in parentheses. We simulated and computed data for

*q*= 0.1, 0.2, … , 0.9; we display only part of them to save space.

For each testing statistic *S*_{+|}* _{jk}*, there is a set of nonnegative eigenvalues λ = (λ

_{1}, … , λ

_{J}_{−1}). Their magnitude plays an important role in determining the asymptotic

*P*value of the observed

*S*

_{+|}

*. For a given observed value of*

_{jk}*S*

_{+|}

*and fixed number of loci*

_{jk}*J*, a roughly larger eigenvalue total |λ| (defined as λ

_{1}+ … + λ

_{J}_{−1}) results in a larger

*P*value, and vice versa. Although for two sets of eigenvalues λ

_{1}= (λ

_{11}, … , λ

_{1,}

_{J}_{−1}) and λ

_{2}= (λ

_{21}, … , λ

_{2,}

_{J}_{−1}), even if |λ

_{1}| = |λ

_{2}|, the corresponding distributions χ

^{2}(λ

_{1}) and χ

^{2}(λ

_{2}) may not be equal, and they are equal if and only if λ

^{(1)}= λ

^{(2)}, where λ

^{(}

^{k}^{)}= (λ

_{k}_{,(1)}, … , λ

_{k}_{,(}

_{J}_{−1)}) is the ordered version of λ

*(*

_{k}*k*= 1, 2).

We display in the following the eigenvalues λ* _{j}* = (λ

_{j}_{1}, … , λ

_{j}_{5}) for the

*S*

_{+|}

_{j}_{1}'s, for the case

*q*= 0.7.

and

We find that in most cases the *P* values of *S*_{+|31} suggest acceptance of *H*_{31} with high confidence, and those for *S*_{+|}_{j}_{1} (*j* ≠ 3), suggest rejection of *H _{j}*

_{1}, except for the case

*q*= 0.9, in which the

*P*values of

*S*

_{+|51}and

*S*

_{+|61}are also significant, along with that of

*S*

_{+|31}. We regard this last case as exceptional, in which the over-high proportion of allele 1 at each locus blurred the identifiability of the problem (think of the extreme case of

*q*≅ 1; the corresponding locus contributes nearly no information for the problem). Thus, in all these cases, the true hypothesis

*H*

_{31}is accepted with high confidence, and the other false ones,

*H*

_{j}_{1}, are rejected;

*i.e.*, the true disease-linkage-related allele 1 at locus 3 is correctly identified among all six loci that are all in LD with the disease locus.

To investigate the influence of the deviation from Hardy-Weinberg on our method, we simulated the data for this case, in which we use the allelic correlation *r* ≠ 0 at each locus for the deviation from Hardy-Weinberg equilibrium (HWE). The disease allele population frequency is fixed at *q* = 0.7 and the results are displayed in Table 5.

In the non-HWE case, it seems that the true picture becomes more difficult to recover as the deviation from HWE increases. In general, significant departures from HWE are not expected, but if observed, caution should be taken in applying this method (if genotyping error is present, for example). In particular, in situations in which nonrandom mating is a known confounder because of inbreeding or population structure, care should be exercised.

For the case-control data, we used *q* = 0.6 for the case and *q* = 0.25 for the control; HWE is assumed, and again locus 3 is the only connection to the disease allele. The results are shown in Table 6. It is seen that again, for the case-control data SNP locus 3 is correctly identified, and all the other loci are rejected as sources of cause for LD in the region.

The following is a tabulation of power of the test for the above simulated data, using the above λ and some combinations of α, ψ* _{j}* = ψ(

*j*≠

*i*), φ, and

*D*

_{j}_{|31}=

*d*(

*j*≠ 3). To get a sense of the power behavior of our methods, we choose

*J*= 6, λ = λ

_{1}as shown before. The noncentrality parameter μ involves 2

*J*− 1 parameters in it. It is impractical to investigate and tabulate the influence of each of the 2

*J*− 1 parameters to the power. Instead, we investigate the influence of μ to the power, with the given genetic model. Each given value of μ, corresponding to a 2(

*J*− 1)-dimensional parameter subspace, is given by the formula for μ. Table 7 shows the display of power for both the affected individual data and the case-control data, for some choices of the level α and the parameter μ. We comment that for the above specification of the parameter μ, the power of the tests for both the affected individual data and that for the case-control data are the same.

Since the μ in the power of the test for affected individual data and that for the case-control data have different expressions, more detailed power computation can be obtained by the specification in terms of all the parameters involved.

### Application to real data:

#### Non-insulin-dependent diabetes mellitus-1 data:

We first apply our method to the non-insulin-dependent diabetes mellitus-1 (NIDDM1) data used in Sun* et al.* (2002) and list our results along with theirs in Table 8. We see that, for these data, the two methods yield quite different, although not contradictory, results. With the method of Sun *et al.*, loci 2 and 12 are most likely responsible for the LD, while by our method, loci 2, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, and 22 all likely contribute to the LD in the region. One possibility for the difference of the two methods might be that the calpain-10 region has some patterns of LD that are not understood—violating one of the assumptions of the methods. Since the truth in the data is unknown, we do not comment on the performances of the two methods on these data. It is not uncommon in the hypothesis test context, even for methods based on the same type of data, that different methods may have different results, even contradictory ones. In principle, methods using genotype data have no less power in inference than those using IBD data. Here it is too early to comment on the pros and cons for the two types of methods. A formal assessment may involve long-term and large-scale studies. At least our method provides the user more options and a flexible tool for this problem. Also, more methods will give us more strength in the inference. If the methods give consistent results, this will strengthen our confidence in decision; if they do not or are contradictory, the problem may need further investigation. We may perform the hypothesis tests on the current confidence set and continue this way to get a final confidence set of SNPs, in which all of them are accepted as possible sources of LD in the region. We do not pursue this in detail here because of space limitation.

#### Diabetes data:

Next, in a diabetes study, 280 individuals with type 2 diabetes were genotyped at a large number of SNP sites. First we find those SNPs with strong linkage to the trait and then use our method to identify the susceptible one(s). We use the measure of Nielsen* et al.* (1998) to detect the marker-disease association, which is given by where *P̂ _{ij}*

_{|Affected}and

*q̂*

_{i}_{|Affected}are the estimated frequencies of marker genotype

*A*and allele

_{i}A_{j}*A*from the observed affected individuals and

_{i}*m*is the total number of alleles. They showed that this marker Hardy-Weinberg disequilibrium measure is proportional to the square of the disease-marker LD measure. Under the null hypothesis that there is no disease-marker LD, χ

^{2}

_{HW}is approximately distributed as a χ

^{2}variable with degrees of freedom

*m*(

*m*− 1)/2.

After computing the value of χ^{2}_{HW} at each marker and their corresponding *P* values, we found that 13 of the markers significantly indicate strong evidence of disease-marker disequilibrium. To apply our method, we choose a set of six SNPs, and we code them as sites 1–6 for simplicity. The χ^{2}_{HW} values are displayed in Table 9, along with their *P* values in parentheses.

We see from this table that all six loci are very tightly linked to the trait. Now we use our method to identify which one of the six SNPs is the sole true cause of linkage, if any. The computed values of the conditional testing statistics and their *P* values are in Table 10.

From this table we see that all the *P* values, except that of *S*_{+|31}, are significant at the 1% level. This shows that site 3, or SNP 4249771, is most likely to be the sole cause of disease linkage for all six SNP sites.

## DISCUSSION

We developed a method using the conditional LD approach to identify the true linkage-susceptible SNP in a region tightly linked to a qualitative trait, if any, using genotyping data. Simulation studies show that this method can accurately identify the true susceptibility site among a region of tightly linked loci. Application to the real data also leads to the finding of one locus, among a set of tightly linked loci, being the leading cause of linkage to the trait, while the rest of the loci are merely in tight linkage to the susceptibility locus. We illustrated the method using singleton data. This method can be applied to general pedigree data sets, in which the pedigrees are required to have homogeneous familial structure.

Our method requires only the genotype information and allele counts at each locus. It does not require phase information in diploids, which is a difficult task in contemporary sequencing and genotyping methods (Lin* et al*. 2002). Thus this method is practical to use in applications.

By forming a hypothesis that one of these sites is the sole cause and the others subordinate, we constructed testing statistics by conditioning successively on each of the sites. They can be constructed using any marker-disease LD measure based on genotype data. For illustration, our testing statistic is based on a conditional version of part of the quantity in Feder* et al*. (1996) and Nielsen* et al*. (1998), in which the relationship between marker genotype and the marker-disease LD is established. Under the true hypothesis, the testing statistic follows a mixture χ^{2} distribution, with which the *P* values of these statistics can be obtained easily via simulation.

It is likely that the exact relevant variation goes untyped in practice; there are two possibilities for the set of SNPs under study. Some of them in the set are the susceptibility SNPs to the disease linkage, although they may not be directly disease related. Our method is designed to identify SNPs that are in tight linkage with the relevant untyped variation. When more than one SNP is identified (selected), they are not necessarily in high LD with each other, since different sources may contribute to their linkage. The other possibility is that, although showing strong disease linkage, none of them are the cause for it, or all of them are carry-ins by some untyped SNP(s) or background factors. In this case our method is expected to reject all the SNPs in the set, and a more refined scan around the region spanned by this set is suggested.

Our method is based on a set of well-chosen markers. They are chosen as a result of optimization of the corresponding model. So it is reasonable to assume the background LD to be random and negligible, and asymptotic approximation is relatively robust for such a level of noise as long as the sample size is fairly large. When some patterned background is nonnegligible, one should build this effect into the model to improve the accuracy. We do not pursue this line here.

Simulation indicates our method is relatively sensitive to large deviation of HWE. In general, significant departures from HWE are not expected in practice, but if they are observed caution should be taken in applying this method. In particular, in situations in which nonrandom mating is a known confounder because of inbreeding or population structure, care should be exercised. How to modify our method to be robust against deviation from HWE will be a topic of our future research.

## APPENDIX

**Proof of the proposition:**

i. LetSince both ∑ and *A* are positive definite, there is an orthogonal matrix *P* (*PP*′ = *P*′*P* = *I _{d}*) such that Let

*Y*= Λ

^{−1/2}

*PX*(or

*X*=

*P*′Λ

^{1/2}

*Y*); then

*Y*is a normal random vector with

*E*(

*Y*) =

**0**and

*i.e.*,

*Y*∼

*N*(

**0**,

*I*), or its squared components

_{d}*Y*

^{2}

_{1}, … ,

*Y*

^{2}

_{d}are IID χ

^{2}

_{1}random variables. Now ii. Keep notations in i, then

*A*

^{1/2}= Γ

^{1/2}

*P*. Let

*Y*= Λ

^{−1/2}

*PX*(or

*X*=

*P*′Λ

^{1/2}

*Y*). Then

*i.e*., the

*Y*'s are independent standard normal random variables. Now

_{j}### Derivation of (2):

Since *X _{ik}* = (

*X*

_{j}_{|}

*:*

_{ik}*j*≠

*i*) ∼

*N*(

**0**, ∑

*) asymptotically, in the limit where the*

_{ik}*X*'s are standard normal random variables, with Cov(

_{j}*X*) = ∑

*. Since for fixed*

_{ik}*i*, the

*X*

_{j}_{|}

*'s are not a function of each other, neither does their distribution limit the*

_{ik}*X*'s;

_{j}*i.e.*,

*X*is a nondegenerate normal vector, and the conclusion comes from ii of the

*Proposition*with

*A*=

*I*

_{J}_{−1}.

### Derivation of (4):

To get the asymptotic variance matrix ∑* _{ik}*, and hence λ, first consider the asymptotic distribution of ; then that of and that of

*T̂*

_{ik}and thus that of

*S*

_{+|}

*are obtained. Note that (*

_{ik}*P̂*

_{jI}_{|}

*+*

_{ik}*P̂*

_{jII}_{|}

*,*

_{ik}*q̂*

_{j}_{1|}

*) can be written as an average of*

_{ik}*N*IID random variables, so its asymptotic normality is asserted by the central limit theorem. Let then under

_{ik}*H*,

_{ik}*g*(

*P*

_{jI}_{|}

*+*

_{ik}*P*

_{jII}_{|}

*,*

_{ik}*q*

_{j}_{1|}

*) = 0, and Now using the delta method (Serfling 1980), under*

_{ik}*H*, is asymptotically

_{ik}*N*, where ∑

_{j}_{|}

*= Cov(*

_{ik}*I*

_{n}_{,}

_{jI}_{|}

*+*

_{ik}*I*

_{n}_{,}

_{jII}_{|}

*,*

_{ik}*J*

_{n}_{,}

_{j}_{1|}

*).*

_{ik}Similarly, under *H _{ik}*,

*T*

_{ik}is asymptotically

*N*(0,

*D*∑

*′), and ∑*

_{ik}D*is given by where Ω = Cov(*

_{ik}*I*

_{n}_{|}

*), and*

_{ik}*I*

_{n}_{|}

*is the 2(*

_{ik}*J*− 1)-dimensional column vector and where ⊕ means matrix direct summation, which results in a (

*J*− 1) × 2(

*J*− 1)-dimensional matrix, and

*D*is estimated by its empirical version

*D̂*in which

*q*

_{j}_{1|}

*is replaced by*

_{ik}*q̂*

_{j}_{1|}

*. And Ω is estimated by and*

_{ik}### Derivation of (5):

Let ∑_{A,}* _{ik}* and ∑

_{U,}

*be the asymptotic variance matrices of and under their corresponding null hypothesis. Assume*

_{ik}*N*

_{A,}

*/*

_{ik}*N*→ α

_{ik}_{A,}

*and*

_{ik}*N*

_{U,}

*/*

_{ik}*N*→ α

_{ik}_{U,}

*= 1 − α*

_{ik}_{A,}

*. Since*

_{ik}*q̂*

_{jr}_{|A,}

*and*

_{ik}*q̂*

_{jr}_{|U,}

*are independent, we have asymptotically, under their corresponding null hypothesis, where Let*

_{ik}*g*(

*x*,

*y*) = (

*x*−

*y*)/(1 −

*y*); then Δ

*g*(

*x*,

*y*) := (∂

*g*/∂

*x*, ∂

*g*/∂

*y*) = (1/(1 −

*y*), (

*x*− 1)/(1 −

*y*)

^{2}). Under

*H*,

_{ik}*R*(

*jr*|

*ik*) = 0, thus by the delta method, where Similarly, for some ∑

*, where Let Let Ω*

_{ik}_{A}, Ω

_{U}, and Ω be the asymptotic variance matrices for

*J*

^{A}

_{n|ik},

*J*

^{U}

_{n|ik}, and

*J*

_{n|ik}. Let The same way as before,

### Derivation of (6):

Let *R̂ _{i}* = (

*R̂*

_{i}_{1},

*R̂*

_{i}_{2}). Under

*H*, asymptotically

_{i}*R̂*

_{i}∼

*N*for some matrix ∑

*. Let Let*

_{i}*N*=

_{i}*N*

_{i}_{1}+

*N*

_{i}_{2},

*N*

_{A,}

*=*

_{i}*N*

_{A,}

_{i}_{1}+

*N*

_{A,}

_{i}_{2},

*N*

_{U,}

*=*

_{i}*N*

_{U,}

_{i}_{1}+

*N*

_{U,}

_{i}_{2}, α

_{A,}

*=*

_{i}*N*

_{A,}

*/*

_{i}*N*, α

_{i}_{U,}

*=*

_{i}*N*

_{U,}

*/*

_{i}*N*= 1 − α

_{i}_{A,}

*. Let Ω*

_{i}_{A}, Ω

_{U}, and Ω be the asymptotic variance matrices for

*J*

^{A}

_{n|i},

*J*

^{U}

_{n|i}, and

*J*

_{n|i}. Let Then similarly as the derivation of (4) we have

### Derivation of (7):

We need only to derive, under *H _{ik}*, the asymptotic distribution of

*T̂*

_{ik}. We first derive that of for each

*j*. Again, we first get the asymptotic distribution of A1

The summands above are independent of each other, and recall . Since is asymptotically *N*, with by Slutsky's theorem, (A1) is asymptotically *N*(**0**, Ω* _{j}*) with

Let *g*(*x*, *y*) be the same as in the derivation of (4), and Under *H _{ik}*, , and So

*T̂*

_{j|ik}is asymptotically normal with zero mean vector and variance matrix Now the final conclusion follows the same way as in the derivation of (4).

## Acknowledgments

We thank the two anonymous reviewers, whose comments and suggestions improved the quality of the article, and Nancy Cox for providing us the NIDDM1 data. This work was supported by U.S. Public Service grant AG 16996 from the National Institutes of Health. The software used in this article is written in SAS and can be provided upon request to A.Y. or G.C. at gchen{at}genomecenter.howard.edu.

## Footnotes

Communicating editor: M. W. Feldman

- Received August 25, 2003.
- Accepted March 22, 2004.

- Genetics Society of America