- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Yuan, A.
- Articles by Bonney, G. E.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Yuan, A.
- Articles by Bonney, G. E.
Genetics, Vol. 167, 1445-1459, July 2004, Copyright © 2004
doi:10.1534/genetics.103.021600
Identifying the Susceptibility Gene(s) in a Set of Trait-Linked Genes Using Genotype Data
Ao Yuan*,1,
Guanjie Chen
,
Yuanxiu Chen
,
Charles Rotimi
and
George E. Bonney*
* Statistical Genetics and Bioinformatics Unit, National Human Genome Center, Howard University, Washington, DC 20059
Genetic Epidemiology Unit, National Human Genome Center, Howard University, Washington, DC 20059
1 Corresponding author: Statistical Genetics and Bioinformatics Unit, National Human Genome Center, 2216 Sixth St., NW, Suite 205, Howard University, Washington, DC 20059.
E-mail: ayuan{at}howard.edu
>ABSTRACT
THE METHOD
AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
There are generally three steps to isolate a disease linkage-susceptibility gene: genome-wide scan, fine mapping, and, last, positional cloning. The last step is time consuming and involves intensive laboratory work. In some cases, fine mapping cannot proceed further on a set of markers because they are tightly linked. For years, genetic statisticians have been trying different ways to narrow the fine-mapping results to provide some guidance for the next step of laboratory work. Although these methods are practical and efficient, most of them are based on IBD data, which usually can be inferred only from the genotype data with some uncertainty. The corresponding methods thus have no greater power than one using genotype data directly. Also, IBD-based methods apply only to relative pair data. Here, using genotype data, we have developed a statistical hypothesis-testing method to pinpoint a SNP, or SNPs, suspected of responsibility for a disease trait linkage among a set of SNPs tightly linked in a region. Our method uses genotype data of affected individuals or case-control studies, which are widely available in the laboratory. The testing statistic can be constructed using any genotype-based disease-marker disequilibrium measure and is asymptotically distributed as a chi-square mixture. This method can be used for singleton data, relative pair data, or general pedigree data. We have applied the method to simulated data as well as a real data set; it gives satisfactory results.
RECENTLY, genome-wide scans have been widely used in the study of complex genetic diseases such as cardiovascular diseases, obesity, diabetes, schizophrenia, etc., due to the advance in biological science that hundreds of markers could be genotyped quickly with reduced cost. Subsequent fine-mapping studies have been also frequently reported, which narrow the linkage region to a disease trait to about one or a few centimorgans. However, very few of the studies reach the final step of positional cloning to isolate the gene responsible for the linkage to a complex disease. Part of the reason is that the process involves genomic DNA spanning millions of base pairs at the linkage region, sequencing large amounts of the overlapped genomic DNA fragments, and genotyping tens or hundreds of markers in the region, which take intensive work in an ordinary laboratory. In some cases, fine mapping cannot proceed farther on a set of markers because they are tightly linked. For years, genetic statisticians have been trying to develop parametric and/or nonparametric methods to pinpoint the linkage to one or very few markers suspected to be truly responsible for the linkage of a disease trait and to exclude those only in linkage disequilibrium (LD) to the susceptibility markers.
Difficulty in the identification of specific disease-predisposing alleles may result due to multiple genetic factors (TAIT and HARRISON 1991; THOMSON 1991). GREENBERG (1993) and HODGE (1993) considered the analysis of "necessary" vs. "susceptibility" loci in which the associated marker allele itself increases disease susceptibility but is neither necessary nor sufficient for disease expression. The conditioning method is one of the typical statistical tools for studying such problems. FULKER et al. (1999) developed a conditioning method using the variance component model. This method tests both linkage and association at the same time, so that it provides the result whether a locus is the candidate locus to the trait or is just in LD with the candidate locus. This idea was further expanded by CARDON and ABECASIS (2000), in which a combined linkage and association method using the variance components model is proposed. VALDES and THOMSON (1997) and SIEGMUND et al. (2001) used the conditioning method to narrow down the association region. LAZZERONI and LANGE (1998) proposed such a framework in the transmission/disequilibrium test. Furthermore, SORIA et al. (2000) considered a conditioning argument to pinpoint the linkage of the G20210A mutation in the prothrombin gene to the disease gene. On the other hand, BLANGER et al. (2000) studied a Bayesian variance components method, and HORIKAWA et al. (2000) used a modified association study method, which identified a single-nucleotide polymorphism (SNP), SNP43, that showed significant association with the evidence for linkage with type 2 diabetes.
Recently, SUN et al. (2002) proposed a statistical method for this problem. They used a conditioning hypothesis-testing procedure to pinpoint, among a set of tightly trait-linked genes, a single or a few susceptible markers, using identity-by-descent (IBD) data from affected sibships. This method is based on the genome-wide scan result, which identified a region showing strong linkage with a putative trait. Often markers in such a region are tightly linked among themselves. The goal of the method is to identify which of those markers are truly responsible for the linkage and which are merely tightly linked to such putative markers. This method is practical in application and yielded good results in their simulation studies.
However, most of the existing methods for this problem use IBD data on paired family members. Usually IBD data are not fully available in practice and can be inferred only from genotype data with uncertainty and often inconsistently from different methods used. Inference based on them has no greater power than that based on genotype data, unless the IBD data are a sufficient statistic for the parameters underlying the model. Also, IBD-based methods apply only to relative pair data.
Here we present a method for this problem by formulating a set of conditional hypothesis testing, in this respect similar to that in SUN et al. (2002), but we use genotype data instead and the testing statistic is different in nature from theirs. Using any genotype-based trait-marker disequilibrium measure, the testing statistics are constructed by successively conditioning on each of the tightly linked SNP sites. Our method is nonparametric: it does not require model specification or phase information in the data. It applies to family data of arbitrary structure, including singleton data, in which each individual comes from a different and independent family. Under the null hypothesis of being the sole susceptible site, each of these statistics follows asymptotically a chi-square mixture distribution. The corresponding P values are easily obtained via simulation.
ABSTRACT
>THE METHOD
AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
The data:
Let A be the unknown disease allele, for which we want to infer its position in the human genome. Assume that there are J identified SNP markers, Mj (j = 1, ... , J), with alleles Mjk (k = 1, 2), which are brought to our attention because of their tight linkage to the disease allele. A natural question is whether all of them are susceptible genes for the linkage or some of them show disease linkage just because of their strong linkage with the true susceptible gene(s). Our goal here is to identify the true susceptible SNP(s), if any, among them.For ease of explanation we first describe our method for singleton data and then extend it to general pedigree data in a later section. We now describe a general procedure for the conditional inference of this problem; the construction of the specific testing statistic is detailed later. Let G = (G1, ... , GJ) be a general notation for the composite SNP genotype at all the SNP loci, where Gj = (gj1, gj2) is its allelic notation; Gnj = (gnj1, gnj2) be the observed genotype of the nth individual at the jth SNP locus (n = 1, ... , N; j = 1, ... , J); and Gn = (Gn1, ... , GnJ) be the vector notation of the observed composite genotype of individual n. The data to be used are G1, ... , GN, the observed composite genotypes of N individuals at J SNP loci each.
Here we assumed the common practice that at each SNP locus there are two different alleles in the population; we code them as 1 and 2, although the same value from alleles at different loci may have different allelic meaning. At each locus, we code the genotype as Gnj = 0 when gnj1
gnj2, Gnj = I when (gnj1, gnj2) = (1, 1), and Gnj = II when (gnj1, gnj2) = (2, 2). Note that we have two representations of a SNP genotype, one allelic and one numerical. Which one(s) will be used, even in the same expression, depends on convenience.
The disequilibrium measure and the conditioning principle:
The proposed conditional testing procedure uses testing statistics, which are constructed via the conditional version of any trait-marker LD measure using genotype data. We first state the conditional testing principle and then give the specific forms of the testing statistic for some particular data designs.Now we describe the trait-marker LD measure. Let pA be the population frequency of the disease allele A, qjk be those of allele k of marker j, and PA,jk be those of the haplotype (A, Mjk). Let DA,jk = PA,jk pAqjk be the LD measure between the disease allele A and allele k of marker j. Since the position of A is unknown, pA, PA,jk, and thus DA,jk cannot be directly estimated from the observed data; instead various quantities are constructed to infer it.
When DA,jk is positive, the marker allele Mjk is more likely to be associated with the disease-susceptible allele A than would be expected by chance. The disequilibrium measures DA,jk are among the main tools for finding the association between a marker locus (loci) and the disease locus. There are numerous ways to construct inference statistic from the DA,jk's, some using relative pair IBD data at markers, and some using marker genotype data (BENGTSSON and THOMSON 1981; LEHESJOKI et al. 1993; FEDER et al. 1996; NIELSEN et al. 1998). Here we develop the conditional version of the genotype-based method.
Let qjk|i be the population frequency of allele k of the jth SNP conditional on the ith SNP genotype (k = 1, 2). Let PA,jk|i be the population frequency of the disease-SNP haplotype (A, Mjk) at the jth marker locus conditional on the ith SNP genotype and Pjr|i be that of the homozygote SNP genotype r at the jth SNP locus conditional on the ith SNP genotype (r = I, II). We choose the conditional LD measure at the jth locus, given the ith SNP genotype, as
![]() | (1) |
i. In the following we explain what the conditioning actually means in practice. Suppose we have genotype data for 502 individuals on two SNP loci, each locus has two alleles, and each allele takes one of the two forms that we coded as 1 and 2. The genotype at each locus is thus represented as (1, 1), (1, 2) = (2, 1) and (2, 2). The supposed observed genotype frequencies for the two loci are given in Table 1.
|
By conditioning on the genotype at the first locus being (1, 1), we mean the subgroup of 156 individuals whose genotype at the first locus is (1, 1). Within this subgroup, the genotype at the second locus is denoted as locus 2/(1, 1) and similarly for conditioning on the first locus genotype being (1, 2) and (2, 2). Thus conditioning on the first locus genotype (1, 1), (1, 2), and (2, 2) separately, the data are divided into three nonoverlap subdata sets, and we obtain the genotype frequency of the second locus as shown in Table 2. Likewise, conditioning on the second locus genotypes separately, we get the genotype frequency of the first locus as shown in Table 3.
|
|
In conditional testing, test statistics are constructed with the data in Tables 2 and 3. For example, to test the hypothesis that locus 1 is the only susceptible site, then conditioning on it we obtain three subtables. If the hypothesis is true, the LD vanish on each of the subtables, and the test statistics constructed from them should manifest nonsignificance.
The hypotheses and testing statistics:
We are interested in testing the null hypothesis Hi: among the set of markers, SNP marker i is the sole cause of the linkage to the disease locus. Here we assume background effects on the linkage are negligible; see the DISCUSSION for more details on this. Under Hi, Dj|i = 0 for all j
i. For each fixed i, testing statistics Sj|i are constructed, usually a function of the empirical version
j|i (j
i), such that they tend to be small under Hi and large otherwise. Hi can be decomposed as Hi = Hi1
Hi2, where Hik is the hypothesis: genotype k at site i is the sole susceptibility SNP reasonable for the trait LD in the region. Note that when Hik is rejected, we can conclude only that the SNP does not contribute to the LD in the region, that the single or multiple causal polymorphism may not be among those that are typed, or that there is more than one source of such contribution. By testing the sequence of {Hik}, we can find a confidence set, which may consist of a single SNP or several SNPs, or it may be empty. This set may be more accurately inferred by testing the hypothesis of multiple SNPs as in a later subsection. Our method can be used to detect a more detailed local relationship by testing the more detailed hypothesis Hj|i, LD at site j is completely caused by site i, or even the finer hypothesis Hj|ik, LD at site j is completely caused by genotype k of site i.
These last hypotheses are inferred using the statistics Sj|ik, which are the corresponding versions of the Sj|i's for the Hj|ik's. For recessive disease, the conditional statistic notation Sj|ik means
. The Sj|ik's are constructed of the form
, and the random column vector
is jointly asymptotic normal under Hik. Let
ik be the asymptotic variance matrix of Xik and
= (
1, ... ,
J1) be its eigenvalues. Usually
ik and thus
can be estimated by their empirical version. The particular forms of the Sj|ik's are given later for different data designs.
Asymptotic distribution of the testing statistic:
Let us consider Hik. Its testing statistic is given by
![]() |
) or S+|ik under Hik, we first give a general result for the distribution of quadratic form of normal random variables. The proof is given in the APPENDIX.
Proposition:
Let X = (X1, ... , Xd)' be a nondegenerate normal random vector: X
N(0,
) (i.e., |
|
0), with eigenvalues
= (
1, ... ,
d); A is a d-dimensional positive definite symmetric matrix with eigenvalues
= (
1, ... ,
d); the
i's and the
i's keep the same order in the diagonalization. We have-
where the Y2j's are independent and identically distributed (IID)
21 random variables.
- Let
= diag(
1, ... ,
d) and
= (
1, ... ,
d); then
Especially, when A = Id, we have

Remark:
- The case
or A being degenerate is not of much interest and can be avoided easily in the construction of the testing statistic.
- It requires that
and
be of the same order; this can be done using the same orthogonal matrix (matrices) in the diagonalization of
and A. More conveniently, since it actually used only the
j
j's, they are just the eigenvalues of A
(or
A).
- Using i or ii is a matter of choice. i is simpler in forming the
2 statistic but not in computing the quantiles or P values, while the order of the
j
j's does not matter. ii involves computing A1/2 in forming the
2 statistic, and the order of
j
j and that of Xj must match. In practice, this is not trivial; however, it is simpler in computing the quantiles or P values using the existing
2 tables.
- Given
and
, the density of
2d
can be derived by the multiple convolution formula, and thus its
th quantile and/or the P value of the observed statistic can be obtained. But, more conveniently, for a given level
, the
th quantile and/or the P value of the observed statistic can be consistently estimated by their empirical versions.
To sample from
2d
, we sample Y21, ... , Y2J1 from
21 independently; then
1
1Y21 + ... +
d
dY2d is a sample from
2d
.
The
2 linear combination is the general form of the quadratic form of normals. When the Xj's are independent,
j = Var(Xj); when the Xj's are IID and
. There are some other similar results about the quadratic form of normals (GRAYBILL and MARSAGLIA 1957; GOOD 1969; KHATRI 1980, 1982; ANDERSON and STYAN 1982). Our result is independent and not of the same formulations and conditions as the others.
Let the eigenvalues (in their original order) of
ik be
= (
1, ... ,
J1); by ii of the Proposition, we have (see APPENDIX):
Corollary:
Under Hik, asymptotically
![]() |
![]() |
21 random variables.
Thus for given 0 <
< 1, the asymptotic level
test of Hik is given by the rejection rule: the P value of the observed S+|ik is smaller than
, or S+|ik > QJ1(
,
), the
th quantile of the
2J1
distribution.
Note that our method requires only the genotype information and allele counts at each locus. It does not require phase information in diploids; thus it is practical in applications.
In the following we give the specific forms of the S+|ik's [S+|ik(
)'s] under some commonly used settings; those of the S+|ik(
)'s are the same and are omitted.
Multiple susceptible loci:
Our method can be extended to the case of multiple susceptible loci without conceptual difficulty, but with more involved computations. Consider the hypothesis Hi1k1,...,irkr (1
r < J) that the composite genotypes (k1, ... , kr) at loci (i1, ... , ir) are the true susceptible ones. The corresponding testing statistics Sj|i1k1,...,irkr are constructed similarly as before. The only difference is now the inference set, the conditional data set, consisting of those individuals whose alleles at loci (i1, ... , ir) are (k1, ... , kr), and
![]() |
2Jr
, and
= (
1, ... ,
Jr) is the eigenvalue of the asymptotic variance matrix
i1k1,...,irkr, which is estimated the same way as the single susceptible locus case, but uses the current inference data set. For fixed r, there are J!(J r)!/r! of such tests across different choices of loci combinations, and 2r of such tests for each choice of loci combination. So the total number of tests will be 2rJ!(J r)!/r!.
Note that the above construction of the testing statistic is general; its inference behavior depends on the particular statistic used. The general form of the testing statistic is asymptotically a chi-square mixture, which is centralized under Hik and noncentralized otherwise. The functional form of the parameters of interest entering the noncentrality parameter in the chi-square mixture will explain the behavior of the test in terms of asymptotic power. We give more detail on this for specific tests used in the following sections.
ABSTRACT
THE METHOD
>AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
, and let A be the disease allele. Let pA, qM, and PAM be the population frequency of the alleles A and M and haplotype AM, respectively, and let DAM = PAM pAqM be the LD. For clarity we first assume the disease is recessive and P(Affected|AA) = 1. Under these assumptions, FEDER et al. (1996) and more specifically NIELSEN et al. (1998) discovered the relationship
![]() |
is the probability that an individual will exhibit the disease due to causes other than this locus, and
is the prevalence of the disease in the population. This equality enables us to detect the marker-disease association by testing Hardy-Weinberg disequilibrium at the marker locus without using IBD information. In fact the connection between the marker allele frequencies and the marker-disease LD is kept if we use only the numerator in the above equality, and this will simplify the computation. That is,
![]() | (2) |
Since all individuals are affected in this study, we drop off the index "Affected" to simplify the notations. We want to test the hypothesis Hik: SNP type k at locus i is the sole cause of the LD in the region. Let Pjr|ik be the population frequency of genotype r (r = I, II) of locus j given one's genotype being k at locus i, qjr|ik be that of allele r (r = 1, 2) at locus j given one's genotype being k at locus i,
j be the probability that an individual will exhibit the disease due to causes other than locus j, and Dj|ik be the disequilibrium corresponding to the conditional LD measure. Now the same derivation of (2) leads to
![]() | (3) |
i).
We comment that our method works for a general disease model; in this case Tj|ik is still a function of Dj|ik but the expression is more involved (see NIELSEN et al. 1998, pp. 15331534), and under Hik we still have Tj|ik = 0 (j
i); hence the test is still valid. In this case, the power and error rate computation will be more involved. The same comment applies to the case-control section also.
Now we construct testing statistics for Hik (i = 1, ... , J). The consistent estimates
jr|ik of Pjr|ik and
jr|ik of qjr|ik are given by
![]() |
is the total number of individuals with the ith SNP genotype being k, and we rearrange them as the first, second, ... , and the Nikth individual. In,jr|ik,(= 0, 1) is the indicator that the nth individual among this set has genotype type r on the jth locus given he (she) has genotype k at locus i, and
![]() |
![]() |
Let
ik = (
j|ik : j
i) be the (J 1) dimensional column vector. Under Hik, 
ik is asymptotically N(0,
ik) for some matrix
ik to be identified later. Let
= (
1, ... ,
J1) be all the eigenvalues of
ik, and
. By the Corollary, under Hik asymptotically
![]() |
ik is estimated by
![]() | (4) |
![]() |
![]() |
Here
means matrix direct summation, which results in a (J 1) x 2(J 1) dimensional matrix. From
ik, we obtain the estimated eigenvalues
.
Similarly, for Hi, let Ni = Ni1 + Ni2,
r = Nir/Ni,
![]() |
i = (
j|i : j
i). Let
i be the asymptotic matrix of
i and
= (
1, ... ,
2(J1)) be all the eigenvalues of
i. Note that
i =
i1 +
i2, and
i1 and
i2 are independent, so under Hi, 
i is asymptotically N(0,
i), with
. Its estimate is obtained as
, and
ir is constructed as before.
Let
![]() |
2J1
.
We remark that in the above the asymptotic variance matrices
jk are estimated the same way as for the IID data. In general the familial data are not IID, and the above variance matrices are dealt with differently. Usually, in the positive dependent case, the asymptotic variance matrix will be larger, in the sense of generalized variancethe determinant of the variance matrixand consequently will tend to have larger eigenvalues than the IID case, such as the singleton data case. In the case of homogeneous familial structure, more accurate estimates can be obtained. We study the above methods for general pedigree data in the extension section later.
In some of the existing methods for this problem, e.g., SUN et al. (2002), the conditional IBD sharing statistics are computed at each site given the genotype at that site. In this way the statistic can test whether each of the sites is the sole susceptible site, but will not be able to find the more detailed relationship between sites when the null hypothesis of only one susceptible site is rejected, while our test statistic can be used to reveal more detailed relationship. If Hik is accepted, it is reasonable to say that the connection between site j and the disease locus is due to genotype k of site i.
By the asymptotic normality of 
ik and (3), when Hik is false, S+|ik will be asymptotically a noncentral
2J1 (
, µ), with noncentrality parameter
![]() |
i). In terms of µ, the null hypothesis is rephrased as Hik: µ = 0. For a given level
(= P(reject Hik|Hik is true)), the parameters
,
js,
, and the Dj|ik's, the asymptotic power of the test is
![]() |
,
) is the
th quantile of the noncentral
2J1
distribution, which can be simulated by the sampling method after the Remark of the Proposition, but with Y1, ... , YJ1 independent, and Yj from N(µj, 1) with
.
For this particular test statistic, since the power is an increasing function of µ, Hik will be more accurately rejected when the
j(1
)j's and the conditional Dj|ik's are large, and
j is small or the disease is relatively rare. Likewise, Hik will be more correctly accepted when the
j(1
j)'s and the conditional Dj|ik's are small (i.e., mainly explained by allele k of locus i), and the disease is relatively common.
Likewise, the error rate, the probability of false acceptance, is
![]() |
ABSTRACT
THE METHOD
AFFECTED INDIVIDUAL DATA
>CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
![]() |
![]() |
![]() |
![]() |
is the total number of "affected" individuals with the ith SNP genotype being k,
![]() |
is the total number of unaffected individuals whose ith SNP genotype is k. Let Nik = NA,ik + NU,ik. Assume NA,ik/Nik
A,ik and NU,ik/Nik
U,ik = 1
A,ik. To test Hik, let
![]() |
![]() |
is the vector of eigenvalues of the matrix
ik. Let
![]() |
ik is estimated by
![]() | (5) |
,
![]() |
![]() |
Similarly, to test Hi, let
![]() |
2J1
, and
is the vector of eigenvalues of
i, which is estimated by
![]() | (6) |
,
![]() |
, Ni = Ni1 + Ni2, NA,i = NA,i1 + NA,i2, NU,i = NU,i1 + NU,i2,
A,i = NA,i/Ni, and
U,i = NU,i/Ni = 1
A,i. Similarly,
![]() |
Other LD measures can also be used, for example, the trend test statistic (ARMITAGE 1955; DEVLIN and ROEDER 1999).
As in the affected individual case, when Hik is not true, S+|ik is asymptotically noncentral
2J1
, where
![]() |
,
,
j,
, pA, the qj2's, and the Dj1|ik's, the power and error rate can be computed by simulation as before, but with Y1, ... , YJ1 independent, with Yj from N(µj, 1), where
![]() |
Here, the power and probability of correct acceptance of Hik depend on
,
, pA, the qj2, and the Dj1|ik's. The power is maximum when the conditional Dj1|ik's are maximum, and the test is more likely to accept Hik when the Dj1|ik's are small. Their relationships with the other parameters can be analyzed similarly.
ABSTRACT
THE METHOD
AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
>EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
jr|ik as
![]() |
is the total number of families in which at least one individual with SNP type k at locus i, Iik(s, m), is the indicator that in the mth family, there are s individuals with SNP type k at locus i. Ijr|ik (s, m) is the indicator that there are s individuals in family m with SNP type r on the jth locus, given the family is in the group with SNP type k on the ith locus. Let
. Then for fixed (jr, ik), {Ijr|ik(m) : m = 1, ... , M} is an IID sequence, and for different (jr, ik) and (j'r', i'k'), {Ijr|ik(m) : m = 1, ... , M} and {Ij'r'|i'k'(m) : m = 1, ... , M} are independent. Similarly,
jr|ik is redefined as
![]() |
. Then for fixed (jr, ik), {Jjr|ik(m) : m = 1, ... , M} is an IID sequence, and for different (jr, ik) and (j'r', i'k'), {Jjr|ik(m) : m = 1, ... , M} and {Jj'r'|i'k'(m) : m = 1, ... , M} are independent.
Let
j|ik and
ik be as before but with
jI|ik,
jII|ik, and
j1|ik replaced by the above versions. Let
. Now it is clear that the
in (4) can be replaced by the consistent estimator for this case as
![]() |
![]() |
ik = 

', and
is the same as in (4).
More generally, suppose that there are L different familial structures in the data set, with size Ml each, and the lth structure has Sl individuals per family (l = 1, ... , L). Let
![]() |
is the total number of families with the structure l in which at least one individual with SNP type k at locus i, I
ik
and I
jr|ik
, is the counterpart of Iik(s, m) and jr|ik(s, m), respectively, for familial structure l. Let
. Then for fixed (l, jr, ik),
is an IID sequence, and for different (l, jr, ik) and (l', j'r', i'k'),
and
are independent. Let
define the estimate of Pjr|ik as
![]() |
Similarly, let
![]() |
jr|ik
is the counterpart of Jjr|ik(s, m) for familial structure l. Let
, and define
![]() |
Now for this general pedigree data, let
j|ik and
ik be as before but with
j1|ik,
j2|ik, and
j1|ik replaced by the above versions. Let
, and we assume
, then a consistent estimate of
ik is given by
![]() | (7) |
![]() |
![]() |
ABSTRACT
THE METHOD
AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
>SIMULATION STUDY
RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
3), in this region are tightly linked to s31.
Now the haplotypes S(1) = (s11, ... , s61) and S(2) = (s12, ... , s62) are independent and the sj2's are independent within themselves. Denote
and
as the two haplotypes of the nth individual. To sample such data, for each n we need only to sample G
n from S(1) and G
n from S(2) independently. Let qA = 0.8 be the frequency of the disease allele allele A = 1 among the affected individuals, q(1) = (q11, ... , q61) be the frequencies of S(1) = (1, ... , 1), and q(2) = (q12, ... , q62) be that of S(2) = (1, ... , 1). To sample from S(2) is trivial; i.e., just sample gnj2 independently from B(qj2), the Bernoulli distribution with probability qj2 of getting 1 and probability 1 qj2 of getting 0. To sample G
n, we need to sample from a joint Bernoulli distribution with probability q(1). Such a joint distribution can be specified in the form
![]() |
and
are parameters and exp {A(
,
)} is the normalizing constant and W is all the cross-product terms of S(1), including all the second- and higher-order terms. This distribution can be sampled using the Gibbs sampler (GEMAN and GEMAN 1984). But the specification of the joint Bernoulli distribution has some subjectivity and the sampling scheme is not simple. Instead, we use a normal discretization method to sample it. We use high correlation for linkage. Let
be the corresponding correlation matrix of the J + 1-dimensional normal distribution for (A, S(1)),
![]() |
To sample the composite genotypes from the above distribution, let X = (xA, x1, ... , x6) be a sample from the normal distribution N(0,
); if xj <
1(qj1), we assign gnj1 = 1; otherwise gnj1 = 0, (j = 1, ... , 6), where
1(q) is the qth quantile of the standard normal distribution. Since q31 is the proportion of allele 1, at locus 3, which is linked to the disease allele, in the affected population, the two alleles at locus 3 are in Hardy-Weinberg disequilibrium. The disease is recessive. We make the corresponding conditional probability P(s32 = 1|s31 = 1) high, say 0.8, among the affected individuals. In the simulation, we used a high frequency of qj1 = q = 0.1, 0.2, ... , 0.9, (j
3) for allele 1 at each locus, to see how this affects the results.
By the same way we simulated control data, in which the two haplotyes are sampled the same way as G
n above. Together with the previous affected data we have case-control data, and the analysis is displayed in Table 6.
|
Specifically, the sampling scheme has the following three steps:
For each n = 1, ... , N, (N = 1000):
- Draw a sample X = (xA, x1, ... , xJ) from the normal distribution N(0,
); if xj <
1(qj1), we assign gnj1 = 1; otherwise gnj1 = 03 (j = 1, ... , 6). Then we get the sample G(1) = (gn11, ... , gnJ1).
- If gn31 = 1, set q32 = P(s32 = 1|s31 = 1) = 0.8, else q32 = 0.1. For each j = 1, ... , J, draw X from U(0, 1), the uniform distribution on [0, 1]; if X < qj2 assign gnj2 = 1; otherwise assign gnj2 = 0. Then we get a sample G(2) = (gn12, ... , gnJ2).
- Gn = (G(1), G(2)) is a sample from S.
When the two alleles at each locus are in Hardy-Weinberg disequilibrium, we use a two-dimensional normal with mean (0, 0) and variance matrix
= (1, r; r, 1) with r = 0.2 to model their dependence. For each n, we first get the sample G(1) = (gn11, ... , gnJ1) from (x1, ... , xJ) as before, then for each j = 1, ... , J separately, sample yj from the conditional distribution N(rxj, 1 r2). If yj <
1(qj2), assign gnj2 = 1, otherwise 0.
To simulate the case-control data, we choose q = 0.6 for the case and q = 0.25 for the control.
ABSTRACT
THE METHOD
AFFECTED INDIVIDUAL DATA
CASE-CONTROL DATA
EXTENSION TO GENERAL PEDIGREE...
SIMULATION STUDY
>RESULTS
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
Simulated data:
We constructed the test statistics S+|ik, (i = 1, ... , J; k = 1, 2) and computed the corresponding eigenvalues
= (
1, ... ,
J1), using the method described in the Remark after the Proposition to compute the
2 P value under the null hypotheses. Since in the simulation the sole linkage with the disease allele comes from s31, we expect H31 will be accepted, and the other Hjk's will be rejected. Table 4 is a summary of the observed values of the S+|j1's for the Hjk's, for different choices of q, with corresponding P values in parentheses. We simulated and computed data for q = 0.1, 0.2, ... , 0.9; we display only part of them to save space.
|
For each testing statistic S+|jk, there is a set of nonnegative eigenvalues
= (
1, ... ,
J1). Their magnitude plays an important role in determining the asymptotic P value of the observed S+|jk. For a given observed value of S+|jk and fixed number of loci J, a roughly larger eigenvalue total |
| (defined as
1 + ... +
J1) results in a larger P value, and vice versa. Although for two sets of eigenvalues
1 = (
11, ... ,
1,J1) and
2 = (
21, ... ,
2,J1), even if |
1| = |
2|, the corresponding distributions
2(
1) and
2(
2) may not be equal, and they are equal if and only if
(1) =
(2), where
(k) = (
k,(1), ... ,
k,(J1)) is the ordered version of
k (k = 1, 2).
We display in the following the eigenvalues
j = (
j1, ... ,
j5) for the S+|j1's, for the case q = 0.7.
![]() |
![]() |
We find that in most cases the P values of S+|31 suggest acceptance of H31 with high confidence, and those for S+|j1 (j
3), suggest rejection of Hj1, except for the case q = 0.9, in which the P values of S+|51 and S+|61 are also significant, along with that of S+|31. We regard this last case as exceptional, in which the over-high proportion of allele 1 at each locus blurred the identifiability of the problem (think of the extreme case of q
1; the corresponding locus contributes nearly no information for the problem). Thus, in all these cases, the true hypothesis H31 is accepted with high confidence, and the other false ones, Hj1, are rejected; i.e., the true disease-linkage-related allele 1 at locus 3 is correctly identified among all six loci that are all in LD with the disease locus.
To investigate the influence of the deviation from Hardy-Weinberg on our method, we simulated the data for this case, in which we use the allelic correlation r
0 at each locus for the deviation from Hardy-Weinberg equilibrium (HWE). The disease allele population frequency is fixed at q = 0.7 and the results are displayed in Table 5.
|
In the non-HWE case, it seems that the true picture becomes more difficult to recover as the deviation from HWE increases. In general, significant departures from HWE are not expected, but if observed, caution should be taken in applying this method (if genotyping error is present, for example). In particular, in situations in which nonrandom mating is a known confounder because of inbreeding or population structure, care should be exercised.
For the case-control data, we used q = 0.6 for the case and q = 0.25 for the control; HWE is assumed, and again locus 3 is the only connection to the disease allele. The results are shown in Table 6. It is seen that again, for the case-control data SNP locus 3 is correctly identified, and all the other loci are rejected as sources of cause for LD in the region.
The following is a tabulation of power of the test for the above simulated data, using the above
and some combinations of
,
j =
(j
i),
, and Dj|31 = d(j
3). To get a sense of the power behavior of our methods, we choose J = 6,
=
1 as shown before. The noncentrality parameter µ involves 2J 1 parameters in it. It is impractical to investigate and tabulate the influence of each of the 2J 1 parameters to the power. Instead, we investigate the influence of µ to the power, with the given genetic model. Each given value of µ, corresponding to a 2(J 1)-dimensional parameter subspace, is given by the formula for µ. Table 7 shows the display of power for both the affected individual data and the case-control data, for some choices of the level
and the parameter µ. We comment that for the above specification of the parameter µ, the power of the tests for both the affected individual data and that for the case-control data are the same.
|
Since the µ in the power of the test for affected individual data and that for the case-control data have different expressions, more detailed power computation can be obtained by the specification in terms of all the parameters involved.
Application to real data:
Non-insulin-dependent diabetes mellitus-1 data:
We first apply our method to the non-insulin-dependent diabetes mellitus-1 (NIDDM1) data used in SUN et al. (2002) and list our results along with theirs in Table 8. We see that, for these data, the two methods yield quite different, although not contradictory, results. With the method of Sun et al., loci 2 and 12 are most likely responsible for the LD, while by our method, loci 2, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19, 20, and 22 all likely contribute to the LD in the region. One possibility for the difference of the two methods might be that the calpain-10 region has some patterns of LD that are not understoodviolating one of the assumptions of the methods. Since the truth in the data is unknown, we do not comment on the performances of the two methods on these data. It is not uncommon in the hypothesis test context, even for methods based on the same type of data, that different methods may have different results, even contradictory ones. In principle, methods using genotype data have no less power in inference than those using IBD data. Here it is too early to comment on the pros and cons for the two types of methods. A formal assessment may involve long-term and large-scale studies. At least our method provides the user more options and a flexible tool for this problem. Also, more methods will give us more strength in the inference. If the methods give consistent results, this will strengthen our confidence in decision; if they do not or are contradictory, the problem may need further investigation. We may perform the hypothesis tests on the current confidence set and continue this way to get a final confidence set of SNPs, in which all of them are accepted as possible sources of LD in the region. We do not pursue this in detail here because of space limitation.
|
Diabetes data:
Next, in a diabetes study, 280 individuals with type 2 diabetes were genotyped at a large number of SNP sites. First we find those SNPs with strong linkage to the trait and then use our method to identify the susceptible one(s). We use the measure of NIELSEN et al. (1998) to detect the marker-disease association, which is given by
![]() |
ij|Affected and
i|Affected are the estimated frequencies of marker genotype AiAj and allele Ai from the observed affected individuals and m is the total number of alleles. They showed that this marker Hardy-Weinberg disequilibrium measure is proportional to the square of the disease-marker LD measure. Under the null hypothesis that there is no disease-marker LD,
2HW is approximately distributed as a
2 variable with degrees of freedom m(m 1)/2.
After computing the value of
2HW at each marker and their corresponding P values, we found that 13 of the markers significantly indicate strong evidence of disease-marker disequilibrium. To apply our method, we choose a set of six SNPs, and we code them as sites 16 for simplicity. The
2HW values are displayed in Table 9, along with their P values in parentheses.
|
We see from this table that all six loci are very tightly linked to the trait. Now we use our method to identify which one of the six SNPs is the sole true cause of linkage, if any. The computed values of the conditional testing statistics and their P values are in



















































