Genetics, Vol. 167, 1445-1459, July 2004, Copyright © 2004
doi:10.1534/genetics.103.021600

Identifying the Susceptibility Gene(s) in a Set of Trait-Linked Genes Using Genotype Data

* Statistical Genetics and Bioinformatics Unit, National Human Genome Center, Howard University, Washington, DC 20059
{dagger} Genetic Epidemiology Unit, National Human Genome Center, Howard University, Washington, DC 20059

1 Corresponding author: Statistical Genetics and Bioinformatics Unit, National Human Genome Center, 2216 Sixth St., NW, Suite 205, Howard University, Washington, DC 20059.
E-mail: ayuan{at}howard.edu

Manuscript received August 25, 2003. Accepted for publication March 22, 2004.

ABSTRACT

There are generally three steps to isolate a disease linkage-susceptibility gene: genome-wide scan, fine mapping, and, last, positional cloning. The last step is time consuming and involves intensive laboratory work. In some cases, fine mapping cannot proceed further on a set of markers because they are tightly linked. For years, genetic statisticians have been trying different ways to narrow the fine-mapping results to provide some guidance for the next step of laboratory work. Although these methods are practical and efficient, most of them are based on IBD data, which usually can be inferred only from the genotype data with some uncertainty. The corresponding methods thus have no greater power than one using genotype data directly. Also, IBD-based methods apply only to relative pair data. Here, using genotype data, we have developed a statistical hypothesis-testing method to pinpoint a SNP, or SNPs, suspected of responsibility for a disease trait linkage among a set of SNPs tightly linked in a region. Our method uses genotype data of affected individuals or case-control studies, which are widely available in the laboratory. The testing statistic can be constructed using any genotype-based disease-marker disequilibrium measure and is asymptotically distributed as a chi-square mixture. This method can be used for singleton data, relative pair data, or general pedigree data. We have applied the method to simulated data as well as a real data set; it gives satisfactory results.


RECENTLY, genome-wide scans have been widely used in the study of complex genetic diseases such as cardiovascular diseases, obesity, diabetes, schizophrenia, etc., due to the advance in biological science that hundreds of markers could be genotyped quickly with reduced cost. Subsequent fine-mapping studies have been also frequently reported, which narrow the linkage region to a disease trait to about one or a few centimorgans. However, very few of the studies reach the final step of positional cloning to isolate the gene responsible for the linkage to a complex disease. Part of the reason is that the process involves genomic DNA spanning millions of base pairs at the linkage region, sequencing large amounts of the overlapped genomic DNA fragments, and genotyping tens or hundreds of markers in the region, which take intensive work in an ordinary laboratory. In some cases, fine mapping cannot proceed farther on a set of markers because they are tightly linked. For years, genetic statisticians have been trying to develop parametric and/or nonparametric methods to pinpoint the linkage to one or very few markers suspected to be truly responsible for the linkage of a disease trait and to exclude those only in linkage disequilibrium (LD) to the susceptibility markers.

Difficulty in the identification of specific disease-predisposing alleles may result due to multiple genetic factors (TAIT and HARRISON 1991; THOMSON 1991). GREENBERG (1993) and HODGE (1993) considered the analysis of "necessary" vs. "susceptibility" loci in which the associated marker allele itself increases disease susceptibility but is neither necessary nor sufficient for disease expression. The conditioning method is one of the typical statistical tools for studying such problems. FULKER et al. (1999) developed a conditioning method using the variance component model. This method tests both linkage and association at the same time, so that it provides the result whether a locus is the candidate locus to the trait or is just in LD with the candidate locus. This idea was further expanded by CARDON and ABECASIS (2000), in which a combined linkage and association method using the variance components model is proposed. VALDES and THOMSON (1997) and SIEGMUND et al. (2001) used the conditioning method to narrow down the association region. LAZZERONI and LANGE (1998) proposed such a framework in the transmission/disequilibrium test. Furthermore, SORIA et al. (2000) considered a conditioning argument to pinpoint the linkage of the G20210A mutation in the prothrombin gene to the disease gene. On the other hand, BLANGER et al. (2000) studied a Bayesian variance components method, and HORIKAWA et al. (2000) used a modified association study method, which identified a single-nucleotide polymorphism (SNP), SNP43, that showed significant association with the evidence for linkage with type 2 diabetes.

Recently, SUN et al. (2002) proposed a statistical method for this problem. They used a conditioning hypothesis-testing procedure to pinpoint, among a set of tightly trait-linked genes, a single or a few susceptible markers, using identity-by-descent (IBD) data from affected sibships. This method is based on the genome-wide scan result, which identified a region showing strong linkage with a putative trait. Often markers in such a region are tightly linked among themselves. The goal of the method is to identify which of those markers are truly responsible for the linkage and which are merely tightly linked to such putative markers. This method is practical in application and yielded good results in their simulation studies.

However, most of the existing methods for this problem use IBD data on paired family members. Usually IBD data are not fully available in practice and can be inferred only from genotype data with uncertainty and often inconsistently from different methods used. Inference based on them has no greater power than that based on genotype data, unless the IBD data are a sufficient statistic for the parameters underlying the model. Also, IBD-based methods apply only to relative pair data.

Here we present a method for this problem by formulating a set of conditional hypothesis testing, in this respect similar to that in SUN et al. (2002), but we use genotype data instead and the testing statistic is different in nature from theirs. Using any genotype-based trait-marker disequilibrium measure, the testing statistics are constructed by successively conditioning on each of the tightly linked SNP sites. Our method is nonparametric: it does not require model specification or phase information in the data. It applies to family data of arbitrary structure, including singleton data, in which each individual comes from a different and independent family. Under the null hypothesis of being the sole susceptible site, each of these statistics follows asymptotically a chi-square mixture distribution. The corresponding P values are easily obtained via simulation.


THE METHOD

The data:

Let A be the unknown disease allele, for which we want to infer its position in the human genome. Assume that there are J identified SNP markers, Mj (j = 1, ... , J), with alleles Mjk (k = 1, 2), which are brought to our attention because of their tight linkage to the disease allele. A natural question is whether all of them are susceptible genes for the linkage or some of them show disease linkage just because of their strong linkage with the true susceptible gene(s). Our goal here is to identify the true susceptible SNP(s), if any, among them.

For ease of explanation we first describe our method for singleton data and then extend it to general pedigree data in a later section. We now describe a general procedure for the conditional inference of this problem; the construction of the specific testing statistic is detailed later. Let G = (G1, ... , GJ) be a general notation for the composite SNP genotype at all the SNP loci, where Gj = (gj1, gj2) is its allelic notation; Gnj = (gnj1, gnj2) be the observed genotype of the nth individual at the jth SNP locus (n = 1, ... , N; j = 1, ... , J); and Gn = (Gn1, ... , GnJ) be the vector notation of the observed composite genotype of individual n. The data to be used are G1, ... , GN, the observed composite genotypes of N individuals at J SNP loci each.

Here we assumed the common practice that at each SNP locus there are two different alleles in the population; we code them as 1 and 2, although the same value from alleles at different loci may have different allelic meaning. At each locus, we code the genotype as Gnj = 0 when gnj1 != gnj2, Gnj = I when (gnj1, gnj2) = (1, 1), and Gnj = II when (gnj1, gnj2) = (2, 2). Note that we have two representations of a SNP genotype, one allelic and one numerical. Which one(s) will be used, even in the same expression, depends on convenience.

The disequilibrium measure and the conditioning principle:

The proposed conditional testing procedure uses testing statistics, which are constructed via the conditional version of any trait-marker LD measure using genotype data. We first state the conditional testing principle and then give the specific forms of the testing statistic for some particular data designs.

Now we describe the trait-marker LD measure. Let pA be the population frequency of the disease allele A, qjk be those of allele k of marker j, and PA,jk be those of the haplotype (A, Mjk). Let DA,jk = PA,jkpAqjk be the LD measure between the disease allele A and allele k of marker j. Since the position of A is unknown, pA, PA,jk, and thus DA,jk cannot be directly estimated from the observed data; instead various quantities are constructed to infer it.

When DA,jk is positive, the marker allele Mjk is more likely to be associated with the disease-susceptible allele A than would be expected by chance. The disequilibrium measures DA,jk are among the main tools for finding the association between a marker locus (loci) and the disease locus. There are numerous ways to construct inference statistic from the DA,jk's, some using relative pair IBD data at markers, and some using marker genotype data (BENGTSSON and THOMSON 1981; LEHESJOKI et al. 1993; FEDER et al. 1996; NIELSEN et al. 1998). Here we develop the conditional version of the genotype-based method.

Let qjk|i be the population frequency of allele k of the jth SNP conditional on the ith SNP genotype (k = 1, 2). Let PA,jk|i be the population frequency of the disease-SNP haplotype (A, Mjk) at the jth marker locus conditional on the ith SNP genotype and Pjr|i be that of the homozygote SNP genotype r at the jth SNP locus conditional on the ith SNP genotype (r = I, II). We choose the conditional LD measure at the jth locus, given the ith SNP genotype, as

(1)
Note that PAj2|i pAqj2|i = –(PAj1|ipAqj1|i), so only one of the marker alleles is needed to define this disequilibrium. Our motivation to use the conditional LD measure is that if marker i is the sole susceptible site of linkage to the disease allele, then the genotype data from this site constitute a sufficient statistic for this measure, or, in other words, it will explain all the disequilibria in the region. Thus, conditioning on the site of interest, the disequilibria parameters Dj|i vanish from the conditional distribution of the data, for all j != i.

In the following we explain what the conditioning actually means in practice. Suppose we have genotype data for 502 individuals on two SNP loci, each locus has two alleles, and each allele takes one of the two forms that we coded as 1 and 2. The genotype at each locus is thus represented as (1, 1), (1, 2) = (2, 1) and (2, 2). The supposed observed genotype frequencies for the two loci are given in Table 1.


View this table:
In this window
In a new window

 
TABLE 1

Genotype frequencies at two loci

 
By conditioning on the genotype at the first locus being (1, 1), we mean the subgroup of 156 individuals whose genotype at the first locus is (1, 1). Within this subgroup, the genotype at the second locus is denoted as locus 2/(1, 1) and similarly for conditioning on the first locus genotype being (1, 2) and (2, 2). Thus conditioning on the first locus genotype (1, 1), (1, 2), and (2, 2) separately, the data are divided into three nonoverlap subdata sets, and we obtain the genotype frequency of the second locus as shown in Table 2. Likewise, conditioning on the second locus genotypes separately, we get the genotype frequency of the first locus as shown in Table 3.


View this table:
In this window
In a new window

 
TABLE 2

Genotype frequencies at locus 2 conditioning on locus 1

 

View this table:
In this window
In a new window

 
TABLE 3

Genotype frequencies at locus 1 conditioning on locus 2

 
In conditional testing, test statistics are constructed with the data in Tables 2 and 3. For example, to test the hypothesis that locus 1 is the only susceptible site, then conditioning on it we obtain three subtables. If the hypothesis is true, the LD vanish on each of the subtables, and the test statistics constructed from them should manifest nonsignificance.

The hypotheses and testing statistics:

We are interested in testing the null hypothesis Hi: among the set of markers, SNP marker i is the sole cause of the linkage to the disease locus. Here we assume background effects on the linkage are negligible; see the DISCUSSION for more details on this. Under Hi, Dj|i = 0 for all j != i. For each fixed i, testing statistics Sj|i are constructed, usually a function of the empirical version j|i (j != i), such that they tend to be small under Hi and large otherwise. Hi can be decomposed as Hi = Hi1 {oplus} Hi2, where Hik is the hypothesis: genotype k at site i is the sole susceptibility SNP reasonable for the trait LD in the region. Note that when Hik is rejected, we can conclude only that the SNP does not contribute to the LD in the region, that the single or multiple causal polymorphism may not be among those that are typed, or that there is more than one source of such contribution. By testing the sequence of {Hik}, we can find a confidence set, which may consist of a single SNP or several SNPs, or it may be empty. This set may be more accurately inferred by testing the hypothesis of multiple SNPs as in a later subsection. Our method can be used to detect a more detailed local relationship by testing the more detailed hypothesis Hj|i, LD at site j is completely caused by site i, or even the finer hypothesis Hj|ik, LD at site j is completely caused by genotype k of site i.

These last hypotheses are inferred using the statistics Sj|ik, which are the corresponding versions of the Sj|i's for the Hj|ik's. For recessive disease, the conditional statistic notation Sj|ik means . The Sj|ik's are constructed of the form , and the random column vector is jointly asymptotic normal under Hik. Let {sum}ik be the asymptotic variance matrix of Xik and {lambda} = ({lambda}1, ... , {lambda}J–1) be its eigenvalues. Usually {sum}ik and thus {lambda} can be estimated by their empirical version. The particular forms of the Sj|ik's are given later for different data designs.

Asymptotic distribution of the testing statistic:

Let us consider Hik. Its testing statistic is given by

To get the asymptotic distribution of S+|ik({lambda}) or S+|ik under Hik, we first give a general result for the distribution of quadratic form of normal random variables. The proof is given in the APPENDIX.

Proposition:

Let X = (X1, ... , Xd)' be a nondegenerate normal random vector: X ~ N(0, {sum}) (i.e., |{sum}| != 0), with eigenvalues {lambda} = ({lambda}1, ... , {lambda}d); A is a d-dimensional positive definite symmetric matrix with eigenvalues {gamma} = ({gamma}1, ... , {gamma}d); the {lambda}i's and the {gamma}i's keep the same order in the diagonalization. We have

  1. where the Y2j's are independent and identically distributed (IID) {chi}21 random variables.

  2. Let {Gamma} = diag({gamma}1, ... , {gamma}d) and {Lambda} = ({lambda}1, ... , {lambda}d); then

    Especially, when A = Id, we have


Remark:

  1. The case {sum} or A being degenerate is not of much interest and can be avoided easily in the construction of the testing statistic.
  2. It requires that {gamma} and {lambda} be of the same order; this can be done using the same orthogonal matrix (matrices) in the diagonalization of {sum} and A. More conveniently, since it actually used only the {gamma}j{lambda}j's, they are just the eigenvalues of A{sum} (or {sum}A).
  3. Using i or ii is a matter of choice. i is simpler in forming the {chi}2 statistic but not in computing the quantiles or P values, while the order of the {gamma}j{lambda}j's does not matter. ii involves computing A1/2 in forming the {chi}2 statistic, and the order of {gamma}j{lambda}j and that of Xj must match. In practice, this is not trivial; however, it is simpler in computing the quantiles or P values using the existing {chi}2 tables.
  4. Given {gamma} and {lambda}, the density of {chi}2d can be derived by the multiple convolution formula, and thus its {alpha}th quantile and/or the P value of the observed statistic can be obtained. But, more conveniently, for a given level {alpha}, the {alpha}th quantile and/or the P value of the observed statistic can be consistently estimated by their empirical versions.

To sample from {chi}2d, we sample Y21, ... , Y2J–1 from {chi}21 independently; then {gamma}1{lambda}1Y21 + ... + {gamma}d{lambda}dY2d is a sample from {chi}2d.

The {chi}2 linear combination is the general form of the quadratic form of normals. When the Xj's are independent, {lambda}j = Var(Xj); when the Xj's are IID and . There are some other similar results about the quadratic form of normals (GRAYBILL and MARSAGLIA 1957; GOOD 1969; KHATRI 1980, 1982; ANDERSON and STYAN 1982). Our result is independent and not of the same formulations and conditions as the others.

Let the eigenvalues (in their original order) of {sum}ik be {lambda} = ({lambda}1, ... , {lambda}J–1); by ii of the Proposition, we have (see APPENDIX):

Corollary:

Under Hik, asymptotically

and

where the Y2j's are IID {chi}21 random variables.

Thus for given 0 < {alpha} < 1, the asymptotic level {alpha} test of Hik is given by the rejection rule: the P value of the observed S+|ik is smaller than {alpha}, or S+|ik > QJ–1({lambda}, {alpha}), the {alpha}th quantile of the {chi}2J–1 distribution.

Note that our method requires only the genotype information and allele counts at each locus. It does not require phase information in diploids; thus it is practical in applications.

In the following we give the specific forms of the S+|ik's [S+|ik({lambda})'s] under some commonly used settings; those of the S+|ik({lambda})'s are the same and are omitted.

Multiple susceptible loci:

Our method can be extended to the case of multiple susceptible loci without conceptual difficulty, but with more involved computations. Consider the hypothesis Hi1k1,...,irkr (1 ≤ r < J) that the composite genotypes (k1, ... , kr) at loci (i1, ... , ir) are the true susceptible ones. The corresponding testing statistics Sj|i1k1,...,irkr are constructed similarly as before. The only difference is now the inference set, the conditional data set, consisting of those individuals whose alleles at loci (i1, ... , ir) are (k1, ... , kr), and

which is asymptotically {chi}2Jr, and {lambda} = ({lambda}1, ... , {lambda}Jr) is the eigenvalue of the asymptotic variance matrix {sum}i1k1,...,irkr, which is estimated the same way as the single susceptible locus case, but uses the current inference data set.

For fixed r, there are J!(Jr)!/r! of such tests across different choices of loci combinations, and 2r of such tests for each choice of loci combination. So the total number of tests will be 2rJ!(Jr)!/r!.

Note that the above construction of the testing statistic is general; its inference behavior depends on the particular statistic used. The general form of the testing statistic is asymptotically a chi-square mixture, which is centralized under Hik and noncentralized otherwise. The functional form of the parameters of interest entering the noncentrality parameter in the chi-square mixture will explain the behavior of the test in terms of asymptotic power. We give more detail on this for specific tests used in the following sections.


AFFECTED INDIVIDUAL DATA
Now we explain how to construct the S+|ik's in this type of data. In the case J = 1, assume the two SNP alleles are M and , and let A be the disease allele. Let pA, qM, and PAM be the population frequency of the alleles A and M and haplotype AM, respectively, and let DAM = PAM pAqM be the LD. For clarity we first assume the disease is recessive and P(Affected|AA) = 1. Under these assumptions, FEDER et al. (1996) and more specifically NIELSEN et al. (1998) discovered the relationship

where {psi} is the probability that an individual will exhibit the disease due to causes other than this locus, and {phi} is the prevalence of the disease in the population. This equality enables us to detect the marker-disease association by testing Hardy-Weinberg disequilibrium at the marker locus without using IBD information. In fact the connection between the marker allele frequencies and the marker-disease LD is kept if we use only the numerator in the above equality, and this will simplify the computation. That is,

(2)
We derive a conditional version of (2) to serve our purpose.

Since all individuals are affected in this study, we drop off the index "Affected" to simplify the notations. We want to test the hypothesis Hik: SNP type k at locus i is the sole cause of the LD in the region. Let Pjr|ik be the population frequency of genotype r (r = I, II) of locus j given one's genotype being k at locus i, qjr|ik be that of allele r (r = 1, 2) at locus j given one's genotype being k at locus i, {psi}j be the probability that an individual will exhibit the disease due to causes other than locus j, and Dj|ik be the disequilibrium corresponding to the conditional LD measure. Now the same derivation of (2) leads to

(3)
Under Hik, all association of SNP j is completely explained by genotype k of locus i; thus Dj|ik = 0 and hence Tj|ik = 0 (j != i).

We comment that our method works for a general disease model; in this case Tj|ik is still a function of Dj|ik but the expression is more involved (see NIELSEN et al. 1998, pp. 1533–1534), and under Hik we still have Tj|ik = 0 (j != i); hence the test is still valid. In this case, the power and error rate computation will be more involved. The same comment applies to the case-control section also.

Now we construct testing statistics for Hik (i = 1, ... , J). The consistent estimates jr|ik of Pjr|ik and jr|ik of qjr|ik are given by

where is the total number of individuals with the ith SNP genotype being k, and we rearrange them as the first, second, ... , and the Nikth individual. In,jr|ik,(= 0, 1) is the indicator that the nth individual among this set has genotype type r on the jth locus given he (she) has genotype k at locus i, and

where Jn,j1|ik(= 0, 1, 2) is, for the nth individual, the number of times allele 1 occurs at locus j, given one's genotype being k at locus i. The estimate of Tj|ik is

Let ik = (j|ik : j != i) be the (J – 1) dimensional column vector. Under Hik, ik is asymptotically N(0, {sum}ik) for some matrix {sum}ik to be identified later. Let {lambda} = ({lambda}1, ... , {lambda}J–1) be all the eigenvalues of {sum}ik, and . By the Corollary, under Hik asymptotically

and {sum}ik is estimated by

(4)
(APPENDIX), where

and

Here {oplus} means matrix direct summation, which results in a (J 1) x 2(J – 1) dimensional matrix. From ik, we obtain the estimated eigenvalues .

Similarly, for Hi, let Ni = Ni1 + Ni2, {alpha}r = Nir/Ni,

and i = (j|i : j != i). Let {sum}i be the asymptotic matrix of i and {lambda} = ({lambda}1, ... , {lambda}2(J–1)) be all the eigenvalues of {sum}i. Note that i = i1 + i2, and i1 and i2 are independent, so under Hi, i is asymptotically N(0, {sum}i), with . Its estimate is obtained as , and ir is constructed as before.

Let

Under Hi, S+|i ~ {chi}2J–1.

We remark that in the above the asymptotic variance matrices {sum}jk are estimated the same way as for the IID data. In general the familial data are not IID, and the above variance matrices are dealt with differently. Usually, in the positive dependent case, the asymptotic variance matrix will be larger, in the sense of generalized variance—the determinant of the variance matrix—and consequently will tend to have larger eigenvalues than the IID case, such as the singleton data case. In the case of homogeneous familial structure, more accurate estimates can be obtained. We study the above methods for general pedigree data in the extension section later.

In some of the existing methods for this problem, e.g., SUN et al. (2002), the conditional IBD sharing statistics are computed at each site given the genotype at that site. In this way the statistic can test whether each of the sites is the sole susceptible site, but will not be able to find the more detailed relationship between sites when the null hypothesis of only one susceptible site is rejected, while our test statistic can be used to reveal more detailed relationship. If Hik is accepted, it is reasonable to say that the connection between site j and the disease locus is due to genotype k of site i.

By the asymptotic normality of ik and (3), when Hik is false, S+|ik will be asymptotically a noncentral {chi}2J–1 ({lambda}, µ), with noncentrality parameter

It is clear that Hik is true if and only if Dj|ik = 0 (j != i). In terms of µ, the null hypothesis is rephrased as Hik: µ = 0. For a given level {alpha} (= P(reject Hik|Hik is true)), the parameters {lambda}, {psi}js, {phi}, and the Dj|ik's, the asymptotic power of the test is

Here QJ–1({lambda}, {alpha}) is the {alpha}th quantile of the noncentral {chi}2J–1 distribution, which can be simulated by the sampling method after the Remark of the Proposition, but with Y1, ... , YJ–1 independent, and Yj from Nj, 1) with .

For this particular test statistic, since the power is an increasing function of µ, Hik will be more accurately rejected when the {psi}j(1 – {psi})j's and the conditional Dj|ik's are large, and {phi}j is small or the disease is relatively rare. Likewise, Hik will be more correctly accepted when the {psi}j(1 – {psi}j)'s and the conditional Dj|ik's are small (i.e., mainly explained by allele k of locus i), and the disease is relatively common.

Likewise, the error rate, the probability of false acceptance, is


CASE-CONTROL DATA
Let qM|A and qM|U denote marker M population frequencies for the affected (case) and unaffected (control) individuals. BENGTSSON and THOMSON (1981) and LEHESJOKI et al. (1993) gave the following LD measure:

The conditional version of the above is

Let NA and NU be the number of affected and unaffected individuals, and

where

is the total number of "affected" individuals with the ith SNP genotype being k,

where JAn,jr|ik and JUn,jr|ik are the same as the Jn,jr|ik before, but here for affected and unaffected individuals. is the total number of unaffected individuals whose ith SNP genotype is k. Let Nik = NA,ik + NU,ik. Assume NA,ik/Nik -> {alpha}A,ik and NU,ik/Nik -> {alpha}U,ik = 1 – {alpha}A,ik. To test Hik, let

Under Hik, asymptotically

where {lambda} is the vector of eigenvalues of the matrix {sum}ik. Let

Then {sum}ik is estimated by

(5)
(APPENDIX), where, for singleton data, the affected and the unaffected are independent, so ,

and

Similarly, to test Hi, let

Then under Hi, asymptotically S+|i ~ {chi}2J–1, and {lambda} is the vector of eigenvalues of {sum}i, which is estimated by

(6)
(APPENDIX), where, for singleton data, ,

where , Ni = Ni1 + Ni2, NA,i = NA,i1 + NA,i2, NU,i = NU,i1 + NU,i2, {alpha}A,i = NA,i/Ni, and {alpha}U,i = NU,i/Ni = 1 – {alpha}A,i. Similarly,

Other LD measures can also be used, for example, the trend test statistic (ARMITAGE 1955; DEVLIN and ROEDER 1999).

As in the affected individual case, when Hik is not true, S+|ik is asymptotically noncentral {chi}2J–1, where

Given {alpha}, {lambda}, {psi}j, {phi}, pA, the qj2's, and the Dj1|ik's, the power and error rate can be computed by simulation as before, but with Y1, ... , YJ–1 independent, with Yj from Nj, 1), where

Here, the power and probability of correct acceptance of Hik depend on {psi}, {phi}, pA, the qj2, and the Dj1|ik's. The power is maximum when the conditional Dj1|ik's are maximum, and the test is more likely to accept Hik when the Dj1|ik's are small. Their relationships with the other parameters can be analyzed similarly.


EXTENSION TO GENERAL PEDIGREE DATA
As mentioned earlier, the only difference in our methods between general pedigree data and the singleton data is the estimations of the corresponding asymptotic variance matrices. A simple method for this purpose can be found in the work of G. E. BONNEY, V. APPREY and A. YUAN (unpublished data), without any assumption on the data and no extra parameters introduced for the dependence. We illustrate this with the affected familial data, which for the case-control family data is similar. For such data, the estimations for the genotype/allele frequencies in the previous sections are not IID averages; we rewrite them as IID versions, so that their asymptotic variance matrices can be computed easily. First we assume the data have the same familial structure. Suppose there are M families with S individuals each (N = MS). We redefine jr|ik as

where is the total number of families in which at least one individual with SNP type k at locus i, Iik(s, m), is the indicator that in the mth family, there are s individuals with SNP type k at locus i. Ijr|ik (s, m) is the indicator that there are s individuals in family m with SNP type r on the jth locus, given the family is in the group with SNP type k on the ith locus. Let . Then for fixed (jr, ik), {Ijr|ik(m) : m = 1, ... , M} is an IID sequence, and for different (jr, ik) and (j'r', i'k'), {Ijr|ik(m) : m = 1, ... , M} and {Ij'r'|i'k'(m) : m = 1, ... , M} are independent. Similarly, jr|ik is redefined as

where Jjr|ik(s, m) is the count that there are s SNP allele r in family m on the jth locus, and their SNP type is k on the ith locus. Let . Then for fixed (jr, ik), {Jjr|ik(m) : m = 1, ... , M} is an IID sequence, and for different (jr, ik) and (j'r', i'k'), {Jjr|ik(m) : m = 1, ... , M} and {Jj'r'|i'k'(m) : m = 1, ... , M} are independent.

Let j|ik and ik be as before but with jI|ik, jII|ik, and j1|ik replaced by the above versions. Let . Now it is clear that the in (4) can be replaced by the consistent estimator for this case as

where

and ik = ', and is the same as in (4).

More generally, suppose that there are L different familial structures in the data set, with size Ml each, and the lth structure has Sl individuals per family (l = 1, ... , L). Let

where is the total number of families with the structure l in which at least one individual with SNP type k at locus i, Iik and Ijr|ik, is the counterpart of Iik(s, m) and jr|ik(s, m), respectively, for familial structure l. Let . Then for fixed (l, jr, ik), is an IID sequence, and for different (l, jr, ik) and (l', j'r', i'k'), and are independent. Let define the estimate of Pjr|ik as

Similarly, let

where Jjr|ik is the counterpart of Jjr|ik(s, m) for familial structure l. Let , and define

Now for this general pedigree data, let j|ik and ik be as before but with j1|ik, j2|ik, and j1|ik replaced by the above versions. Let , and we assume , then a consistent estimate of {sum}ik is given by

(7)
(APPENDIX), where

and

For the test of Hi, or the case of case-control data, testing statistics and the corresponding asymptotic variance matrices can be obtained in a similar way; we omit the details here.


SIMULATION STUDY
Here we use simulated data to illustrate our method. To exhibit the applicability of our method, we use singleton data, which is out of the scope of the IBD-based methods. We simulate the data G1, ... , GN, where Gn = (Gn1, ... , GnJ) (n = 1, ... , N) and Gnj = (gnj1, gnj2), the two alleles at SNP site j for the nth individual. The gnjk's are coded as 1, 2 for its possible two alleles. We assume phase is known to simplify the simulation process, so that for each n, the two haplotypes (gn11, ... , gnJ1) and (gn12, ... , gnJ2) are independent. In this example, we take J = 6, so all the vectors Gn = (Gn1, ... , Gn6) are random samples from the population genotype S = (S1, ... , S6), and Sj = (sj1, sj2) is the genotype at the jth site. We assume genotype (1, 1) at the third SNP site is responsible for all the LD with the disease allele A; the other first alleles, sj1(j != 3), in this region are tightly linked to s31.

Now the haplotypes S(1) = (s11, ... , s61) and S(2) = (s12, ... , s62) are independent and the sj2's are independent within themselves. Denote and as the two haplotypes of the nth individual. To sample such data, for each n we need only to sample Gn from S(1) and Gn from S(2) independently. Let qA = 0.8 be the frequency of the disease allele allele A = 1 among the affected individuals, q(1) = (q11, ... , q61) be the frequencies of S(1) = (1, ... , 1), and q(2) = (q12, ... , q62) be that of S(2) = (1, ... , 1). To sample from S(2) is trivial; i.e., just sample gnj2 independently from B(qj2), the Bernoulli distribution with probability qj2 of getting 1 and probability 1 – qj2 of getting 0. To sample Gn, we need to sample from a joint Bernoulli distribution with probability q(1). Such a joint distribution can be specified in the form

(COX 1972; FITZMAURICE and LAIRD 1993), where {Psi} and {Omega} are parameters and exp {–A({Psi}, {Omega})} is the normalizing constant and W is all the cross-product terms of S(1), including all the second- and higher-order terms. This distribution can be sampled using the Gibbs sampler (GEMAN and GEMAN 1984). But the specification of the joint Bernoulli distribution has some subjectivity and the sampling scheme is not simple. Instead, we use a normal discretization method to sample it. We use high correlation for linkage. Let {sum} be the corresponding correlation matrix of the J + 1-dimensional normal distribution for (A, S(1)),

Note that this matrix corresponds to a strong connection between A and s31, but not between A and (s11, s21, s31, s41, s51, s61); it also corresponds to a strong connection between s31 and (s11, s21, s31, s41, s51, s61). Thus all the loci have apparent linkage with the disease allele A.

To sample the composite genotypes from the above distribution, let X = (xA, x1, ... , x6) be a sample from the normal distribution N(0, {sum}); if xj < {Phi}–1(qj1), we assign gnj1 = 1; otherwise gnj1 = 0, (j = 1, ... , 6), where {Phi}–1(q) is the qth quantile of the standard normal distribution. Since q31 is the proportion of allele 1, at locus 3, which is linked to the disease allele, in the affected population, the two alleles at locus 3 are in Hardy-Weinberg disequilibrium. The disease is recessive. We make the corresponding conditional probability P(s32 = 1|s31 = 1) high, say 0.8, among the affected individuals. In the simulation, we used a high frequency of qj1 = q = 0.1, 0.2, ... , 0.9, (j != 3) for allele 1 at each locus, to see how this affects the results.

By the same way we simulated control data, in which the two haplotyes are sampled the same way as Gn above. Together with the previous affected data we have case-control data, and the analysis is displayed in Table 6.


View this table:
In this window
In a new window

 
TABLE 6

Case-control data: values of observed S+|j1 (P value)

 
Specifically, the sampling scheme has the following three steps:

For each n = 1, ... , N, (N = 1000):

  1. Draw a sample X = (xA, x1, ... , xJ) from the normal distribution N(0, {sum}); if xj < {Phi}–1(qj1), we assign gnj1 = 1; otherwise gnj1 = 03 (j = 1, ... , 6). Then we get the sample G(1) = (gn11, ... , gnJ1).
  2. If gn31 = 1, set q32 = P(s32 = 1|s31 = 1) = 0.8, else q32 = 0.1. For each j = 1, ... , J, draw X from U(0, 1), the uniform distribution on [0, 1]; if X < qj2 assign gnj2 = 1; otherwise assign gnj2 = 0. Then we get a sample G(2) = (gn12, ... , gnJ2).
  3. Gn = (G(1), G(2)) is a sample from S.

When the two alleles at each locus are in Hardy-Weinberg disequilibrium, we use a two-dimensional normal with mean (0, 0) and variance matrix {Omega} = (1, r; r, 1) with r = 0.2 to model their dependence. For each n, we first get the sample G(1) = (gn11, ... , gnJ1) from (x1, ... , xJ) as before, then for each j = 1, ... , J separately, sample yj from the conditional distribution N(rxj, 1 – r2). If yj < {Phi}–1(qj2), assign gnj2 = 1, otherwise 0.

To simulate the case-control data, we choose q = 0.6 for the case and q = 0.25 for the control.


RESULTS

Simulated data:

We constructed the test statistics S+|ik, (i = 1, ... , J; k = 1, 2) and computed the corresponding eigenvalues {lambda} = ({lambda}1, ... , {lambda}J–1), using the method described in the Remark after the Proposition to compute the {chi}2 P value under the null hypotheses. Since in the simulation the sole linkage with the disease allele comes from s31, we expect H31 will be accepted, and the other Hjk's will be rejected. Table 4 is a summary of the observed values of the S+|j1's for the Hjk's, for different choices of q, with corresponding P values in parentheses. We simulated and computed data for q = 0.1, 0.2, ... , 0.9; we display only part of them to save space.


View this table:
In this window
In a new window

 
TABLE 4

Affected individual data: values of observed S+|j1 (P value) for different q

 
For each testing statistic S+|jk, there is a set of nonnegative eigenvalues {lambda} = ({lambda}1, ... , {lambda}J–1). Their magnitude plays an important role in determining the asymptotic P value of the observed S+|jk. For a given observed value of S+|jk and fixed number of loci J, a roughly larger eigenvalue total |{lambda}| (defined as {lambda}1 + ... + {lambda}J–1) results in a larger P value, and vice versa. Although for two sets of eigenvalues {lambda}1 = ({lambda}11, ... , {lambda}1,J–1) and {lambda}2 = ({lambda}21, ... , {lambda}2,J–1), even if |{lambda}1| = |{lambda}2|, the corresponding distributions {chi}2({lambda}1) and {chi}2({lambda}2) may not be equal, and they are equal if and only if {lambda}(1) = {lambda}(2), where {lambda}(k) = ({lambda}k,(1), ... , {lambda}k,(J–1)) is the ordered version of {lambda}k (k = 1, 2).

We display in the following the eigenvalues {lambda}j = ({lambda}j1, ... , {lambda}j5) for the S+|j1's, for the case q = 0.7.


and

We find that in most cases the P values of S+|31 suggest acceptance of H31 with high confidence, and those for S+|j1 (j != 3), suggest rejection of Hj1, except for the case q = 0.9, in which the P values of S+|51 and S+|61 are also significant, along with that of S+|31. We regard this last case as exceptional, in which the over-high proportion of allele 1 at each locus blurred the identifiability of the problem (think of the extreme case of q {cong} 1; the corresponding locus contributes nearly no information for the problem). Thus, in all these cases, the true hypothesis H31 is accepted with high confidence, and the other false ones, Hj1, are rejected; i.e., the true disease-linkage-related allele 1 at locus 3 is correctly identified among all six loci that are all in LD with the disease locus.

To investigate the influence of the deviation from Hardy-Weinberg on our method, we simulated the data for this case, in which we use the allelic correlation r != 0 at each locus for the deviation from Hardy-Weinberg equilibrium (HWE). The disease allele population frequency is fixed at q = 0.7 and the results are displayed in Table 5.


View this table:
In this window
In a new window

 
TABLE 5

Affected individual data: values of observed S+|j1 (P value) for different r

 
In the non-HWE case, it seems that the true picture becomes more difficult to recover as the deviation from HWE increases. In general, significant departures from HWE are not expected, but if observed, caution should be taken in applying this method (if genotyping error is present, for example). In particular, in situations in which nonrandom mating is a known confounder because of inbreeding or population structure, care should be exercised.

For the case-control data, we used q = 0.6 for the case and q = 0.25 for the control; HWE is assumed, and again locus 3 is the only connection to the disease allele. The results are shown in Table 6. It is seen that again, for the case-control data SNP locus 3 is correctly identified, and all the other loci are rejected as sources of cause for LD in the region.

The following is a tabulation of power of the test for the above simulated data, using the above {lambda} and some combinations of {alpha}, {psi}j = {psi}(j != i), {phi}, and Dj|31 = d(j != 3). To get a sense of the power behavior of our methods, we choose J = 6, {lambda} = {lambda}1 as shown before. The noncentrality parameter µ involves 2J – 1 parameters in it. It is impractical to investigate and tabulate the influence of each of the 2J – 1 parameters to the power. Instead, we investigate the influence of µ to the power, with the given genetic model. Each given value of µ, corresponding to a 2(J – 1)-dimensional parameter subspace, is given by the formula for µ. Table 7 shows the display of power for both the affected individual data and the case-control data, for some choices of the level {alpha} and the parameter µ. We comment that for the above specification of the parameter µ, the power of the tests for both the affected individual data and that for the case-control data are the same.


View this table:
In this window
In a new window

 
TABLE 7

Power for some given parameter values

 
Since the µ in the power of the test for affected individual data and that for the case-control data have different expressions, more detailed power computation can be obtained by the specification in terms of all the parameters involved.

Application to real data:

Non-insulin-dependent diabetes mellitus-1 data:

We first apply our method to the non-insulin-dependent diabetes mellitus-1 (NIDDM1) data used in SUN et al. (2002) and list our results along with theirs in Table 8. We see that, for these data, the two methods yield quite different, although not contradictory, results. With the method of Sun et al., loci 2 and 12 are most likely responsible for the LD, while by our method, loci 2, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, and 22 all likely contribute to the LD in the region. One possibility for the difference of the two methods might be that the calpain-10 region has some patterns of LD that are not understood—violating one of the assumptions of the methods. Since the truth in the data is unknown, we do not comment on the performances of the two methods on these data. It is not uncommon in the hypothesis test context, even for methods based on the same type of data, that different methods may have different results, even contradictory ones. In principle, methods using genotype data have no less power in inference than those using IBD data. Here it is too early to comment on the pros and cons for the two types of methods. A formal assessment may involve long-term and large-scale studies. At least our method provides the user more options and a flexible tool for this problem. Also, more methods will give us more strength in the inference. If the methods give consistent results, this will strengthen our confidence in decision; if they do not or are contradictory, the problem may need further investigation. We may perform the hypothesis tests on the current confidence set and continue this way to get a final confidence set of SNPs, in which all of them are accepted as possible sources of LD in the region. We do not pursue this in detail here because of space limitation.


View this table:
In this window
In a new window

 
TABLE 8

Results from the NIDDM1 data

 

Diabetes data:

Next, in a diabetes study, 280 individuals with type 2 diabetes were genotyped at a large number of SNP sites. First we find those SNPs with strong linkage to the trait and then use our method to identify the susceptible one(s). We use the measure of NIELSEN et al. (1998) to detect the marker-disease association, which is given by

where ij|Affected and i|Affected are the estimated frequencies of marker genotype AiAj and allele Ai from the observed affected individuals and m is the total number of alleles. They showed that this marker Hardy-Weinberg disequilibrium measure is proportional to the square of the disease-marker LD measure. Under the null hypothesis that there is no disease-marker LD, {chi}2HW is approximately distributed as a {chi}2 variable with degrees of freedom m(m 1)/2.

After computing the value of {chi}2HW at each marker and their corresponding P values, we found that 13 of the markers significantly indicate strong evidence of disease-marker disequilibrium. To apply our method, we choose a set of six SNPs, and we code them as sites 1–6 for simplicity. The {chi}2HW values are displayed in Table 9, along with their P values in parentheses.


View this table:
In this window
In a new window

 
TABLE 9

Values of observed {chi}2HW

 
We see from this table that all six loci are very tightly linked to the trait. Now we use our method to identify which one of the six SNPs is the sole true cause of linkage, if any. The computed values of the conditional testing statistics and their P values are in