## Abstract

In genetic association studies a conventional test statistic is proportional to the correlation coefficient between the trait and the variant, with the result that it lacks power to detect association for low-frequency variants. Considering the link between the conventional association test statistics and the linkage disequilibrium measure *r*^{2}, we propose a test statistic analogous to the standardized linkage disequilibrium *D′* to increase the power of detecting association for low-frequency variants. By both simulation and real data analysis we show that the proposed *D′* test is more powerful than the conventional methods for detecting association for low-frequency variants in a genome-wide setting. The optimal coding strategy for the *D′* test and its asymptotic properties are also investigated. In summary, we advocate using the *D′* test in a dominant model as a complementary approach to enhancing the power of detecting association for low-frequency variants with moderate to large effect sizes in case-control genome-wide association studies.

IN testing for association between a trait and a genetic variant, conventional methods fit to data generalized linear models, *e.g.*, logistic regression models for a binary trait and linear regression models for a continuous trait (Balding 2006). Under the assumption of an additive genetic model, the power of the test is proportional to the locus-specific narrow sense heritability—the proportion of the trait variance explained by the additive genetic effect of the variant (Chapman *et al.* 2003)—which is essentially the square of the linear correlation coefficient *r*^{2} between the trait and the variant. In this article we call tests based on the correlation coefficient *r*^{2}-type tests. The power of an *r*^{2}-type test can be shown to be proportional to , where *p* denotes the allele frequency (Chapman *et al.* 2003; Spencer *et al.* 2009). The power of an *r*^{2}-type test decreases as the minor allele frequency (MAF) of the variant decreases, which is one of the reasons genome-wide association studies (GWAS) mostly identify associations with common variants.

Association of alleles at different loci on the same chromosome is referred to as gametic phase disequilibrium, or linkage disequilibrium (LD). Various measures of LD are proposed in the literature (Hedrick 1987; Devlin and Risch 1995; Mueller 2004), among which two are commonly used, the correlation coefficient *r*^{2} (Hill and Robertson 1968) and the standardized gametic disequilibrium coefficient *D′* (Lewontin 1964). Numerous studies compare the two measures from the perspectives of inferring population evolutionary history and mapping genes (for a review, see *e.g.*, Slatkin 2008). A consensus concerning the two measures is that *D′* is more robust to allele frequency. Although neither measure is independent of the allele frequency (Edwards 1963; Lewontin 1988), the range of *D′* is always between and , regardless of allele frequency, whereas the limits of depend on the allele frequencies at both loci. *D′* more accurately reflects the recombination patterns (Hedrick 1988), in particular in the case of low allele frequencies, which can facilitate pinpointing disease-causing mutations (Devlin and Risch 1995; Guo 1997). Note that in the case of small sample sizes and low allele frequencies, the absolute magnitude of *D′* tends to be overestimated (Zapata *et al.* 2001; Teare *et al.* 2002; Pe’er *et al.* 2006). Inspired by these characteristics of *D′*, we propose an alternative approach to gene mapping using statistics based on *D′* instead of the conventional *r*^{2}-type tests. We hypothesized that a *D′*- type test would be more powerful than *r*^{2}-type tests in detecting association for low-frequency variants.

## Method

Consider two diallelic loci A and B, with alleles *A _{1}* and

*A*at locus A and alleles

_{2}*B*and

_{1}*B*at locus B,

_{2}*i.e.*, the factors A and B in Table 1 stand for loci A and B, respectively. Denote by and , where , allele (relative) frequencies, and by , where , haplotype (relative) frequencies. The gametic disequilibrium coefficient is defined as (1)(Lewontin and Kojima 1960). The measure (2)is the square of Pearson’s correlation coefficient that measures the linear dependence between the two loci (Hill and Robertson 1968). The measure, (3)where when and when is a standardized

*D*obtained on dividing it by its theoretical maximum given the allele frequencies (Lewontin 1964).

Let factor B in Table 1 stand for the affection status of a disease, with *B*_{1} denoting cases and *B*_{2} denoting controls, and consider association between the disease and the alleles, assumed to be independent, of a diallelic locus A. The widely used Pearson’s test statistic is defined as , where and . It can be shown that , where is the estimate of the squared Pearson correlation coefficient as defined in . Therefore, using this method to test association between the disease and a marker is analogous to measuring LD between two loci by *r*^{2}. The range of *r*^{2} for a table depends on the frequencies of both factors and it reaches 1 only when the two factors are perfectly correlated. We speculated that this is one reason that conventional *r*^{2}-type methods have low power to detect association for low-frequency variants, in particular in a case–control study where cases are oversampled. In contrast, has the same range for all frequencies and its estimate tends to be inflated in the case of low allele frequencies. We hypothesized that a test based on will be more powerful than the *r*^{2}-type methods for detecting association with low-frequency variants in a case–control GWAS.

A Wald-type test statistic is defined as

which asymptotically follows a distribution with 1 degree of freedom. For simplicity we refer to this test as the test. can be estimated as by plugging estimates into formula . It is a ratio of two random variables and both its mean and variance can be obtained by the delta method using the first-order Taylor’s series expansions

and

(Casella and Berger 2001). Under the null hypothesis of no association between the disease and marker, ; therefore, . In particular, in a balanced case–control study, is a constant regardless of the sign of , and thus . has been obtained (Zapata *et al.* 1997; Zapata *et al.* 2001) as

where

(Hill 1974); and when ; and when ; and *γ* = , , , and when = , , , and , respectively. Note that at some points the parameters lie on the boundary of the parameter space (Mangin *et al.* 2008). When the asymptotic variance of cannot be obtained by the delta method because is not continuous at this point; however, from a practical point of view has a point mass with zero probability. For the sake of completeness, we define at with a probability of zero. When one cell equals 0, or when both diagonal/antidiagonal cells = 0, and , and the test statistic is undefined. One way to circumvent this is to add 0.5 to each cell. Considering the known upward bias of in the case of small sample sizes and low allele frequencies, we take a more conservative approach by adding 0.5 to the cell in the table with the smallest number while keeping the marginal totals fixed; when there are two such cells belonging to the same row or column, no penalty will be taken.

First we provide a numerical example illustrating differences between the D′-type measures and the *r*^{2}-type measures for association of low-frequency exposures in a case–control study under both null and alternative hypotheses. Consider a balanced design with 200 cases and 200 controls and a risk factor with an exposure rate of 0.05, *i.e.*, and in Table 1. We calculated and , and their corresponding test statistics and , with fixed and decreasing from 10 to 0 (Table 2).

Since both the validity and power of a test depend on the coding schemes it employs in genetic studies (Sasieni 1997; Freidlin *et al.* 2002), we investigated what coding scheme to use when applying the *D′* test in the case of low-frequency variants. Consider a balanced case–control study with a total of *N* subjects and a diallelic locus with minor allele *A* and major allele *a*. Denote by and , where indicates the copy number of allele *A*, the numbers of individuals with genotypes *aa*, *Aa*, and *AA* in cases and controls, respectively (Table 3). For simplicity, assume , *i.e.*, no individual with the *AA* genotype is observed, since the *D′* test is specifically designed for low-frequency variants. As a result, we need only compare two coding schemes—one takes each allele as a unit, which we call the “allelic model”; the other takes each individual as a unit, which we call the “dominant model.”

We compared the power of the *D′* test with that of Pearson’s test and the likelihood-ratio test (LRT) to detect association of variants with various MAFs in a GWAS setting. To make a fair comparison, we simulated variants of different MAFs with equal chance of being detected by a standard method at a nominal level and compared the empirical power of the different methods in a genome-wide screen context. Specifically, for a given MAF in controls, the MAF in cases was computed, using the arcsine transformation, such that the power to detect the MAF difference between the cases and controls was 0.8 at a nominal 0.05 significance level (Cohen 1988). Three MAF categories—defined by the intervals , , and —were considered. In each category 30 single-nucleotide polymorphisms (SNPs) associated with the disease were simulated by randomly choosing values of . To mimic a GWAS according to the MAF spectrum of the Affymetrix Genome-wide Human SNP Array 6.0 in Caucasians, we simulated 1 million null SNPs, among which 15, 20 and 65% were with MAFs randomly chosen from , , and , respectively. Again we considered a balanced design with an equal number (200, 500, 1000, or 2000) of cases and controls. In each setting, 100 replicates were generated. For each replicate, we recorded the statistics and performed the power analysis in two ways. One was to count the number of true positive SNPs among the top SNPs. The other way, adjusting for multiple testing, was to fit to the statistics an empirical null distribution, estimate the local false discovery rate (FDR) (Efron 2004; Efron 2007a,b), and then count the number of true positive SNPs meeting a certain FDR criterion.

As a proof-of-principle example, we applied the *D′* test and the LRT in a GWAS of plasma levels of triglycerides with nonsynonymous SNPs in the African Americans (AAs) of the Dallas Heart Study (DHS) (Victor *et al.* 2004). There were 1722 individuals with complete phenotypes—triglycerides, body mass index (BMI), age, sex—after deleting those taking cholesterol-lowering medicine. We first fitted to data the linear regression modelwhere the ancestry of each individual was inferred using ancestry-informative markers (Xing *et al.* 2011), to obtain the residuals ε. We then sampled the lowermost and uppermost quintiles of the residuals (*n* = 345) as cases and controls, respectively. There was a total of 8968 nonsynonymous SNPs assayed across the autosomes. We first filtered out singletons and those out of Hardy–Weinberg equilibrium (HWE; *P*-value < ) in the whole AA population, and then filtered out monomorphic ones in the case-control sample, resulting in 8268 SNPs for further analysis. We performed genome-wide scans using both the *D′* test and the LRT. Fisher’s exact test and Pearson’s test were omitted because the LRT is more powerful in detecting association for low-frequency variants (Xing *et al.* 2012, 2013), which was confirmed in the simulation study of this article.

Finally, we explored by simulation the asymptotic properties of the *D′* test, in particular whether the type I error rate is well calibrated for the *D′* test under the null hypothesis at different sample sizes. We considered a balanced design with equal numbers (200, 500, 1000, or 2000) of cases and controls. Variants with low frequencies were of particular interest and we simulated SNPs with MAFs equal to 0.005, 0.01, 0.05, and 0.1. In each setting 100,000 replicates, in each of which all the SNPs were polymorphic, were generated. Each data set could be summarized into a table and we tested the allele frequency difference between cases and controls. The test statistic was recorded to estimate its empirical distribution. We also performed Fisher’s exact test, Pearson’s test, and the LRT, for comparison. Except in the case of Fisher’s exact test, we calculated *P*-values assuming an asymptotic distribution of . The empirical type I error rates at a nominal α significance level were calculated as the proportion of the 100,000 replicates for which the *P*-value was ≤ α.

## Results

Compared with the *r*^{2}-type test statistic, the *D′*-type test statistic under the alternative hypothesis is more distinct from that under the null hypothesis (Table 2). Under the null hypothesis of no association—*n*_{11} = *n*_{12} = 10—both and equal zero. As *n*_{12} decreases, approaches one whereas remains minimal, which is expected because the range of is constrained by frequencies of both factors based on their definitions. In contrast, by definition and should have the same range regardless of frequencies. Under the null hypothesis, both equal zero; however, as *n*_{12} decreases to 1 and 0, which correspond to situations of strong association, increases more rapidly than . This suggested the potential advantage of using the *D′*-type tests to detect association for low-frequency exposures, as further illustrated in a GWAS below.

Denote by and the sample estimates of *D′*, and by and the *D′*-type test statistics under the allelic and dominant models, respectively. It can be shown that and in the case of a balanced case–control study for a low-frequency variant (see *Appendix* for details). Thus analyzing data under a dominant model leads to greater power than analyzing data under an allelic model for a low-frequency variant using the *D′*-type test, and therefore we employ the dominant coding scheme in what follows. It is of interest to note that by using the dominant model we also circumvent the potential issue of deviation from HWE of variants, which is a known issue impairing the validity of tests based on the allelic model coding scheme (Sasieni 1997).

In a genome-wide screen the *D′* test is superior to the LRT and Pearson’s test in detecting low-frequency variants, in particular when the sample size is small (Figure 1 and Figure 2); its gain in power diminishes as the sample size increases. Given the sample size, the *D′* test is most powerful in detecting association for low-frequency variants; its power decreases as the MAF increases. In contrast, the power of the other two tests increases as the MAF increases. The power of the three tests converges with increasing MAF and sample size. It is interesting to note that for common variants, when using ranking for the cutoff, the *D′* test selects fewer true positive variants than the other two methods (Figure 1); however, when using the FDR as the criterion to correct for multiple testing, the *D′* test selects more true positive variants, especially when the sample size is small (Figure 2). The underlying reason is that the null distribution and the alternative distribution of the *D′* test statistics are more distinguishable than those of the LRT statistics and Pearson’s test statistics (Table 4).

In Table 5 we list the union of the top 10 SNPs in the genome-wide scan of plasma triglyceride levels in the case–control study of DHS AAs using the *D′* test and the LRT. There was a total of 13 SNPs, among which 7 SNPs were ranked in the top 10 by both methods. All the top 10 SNPs by the LRT were ranked in the top 15 by the *D′* test; however, common variants tended to be ranked higher by the LRT than by the *D′* test. In contrast, there were two low-frequency variants (MAFs < 0.005) ranked in the top 10 by the *D′* test but ranked much lower by the LRT. There was one SNP with local FDR less than 0.05 using the *D′* test, but none using the LRT.

The *D′* test statistic converges to a distribution of as both the MAF and the sample size increase; however, when the MAF is low and the sample size is small, the distribution has a heavy tail (supporting information, Figure S1). Its type I error rate is well preserved at the nominal 0.05 significance level but inflated at more stringent test levels (Table 6). We previously compared Fisher’s exact test, Pearson’s test, and the LRT, concluding that Fisher’s exact test is the most conservative, whereas the LRT is the most liberal (Xing *et al.* 2012). Compared to the LRT, the *D′* test is more conservative at a loose criterion (*α* = 0.05), but more liberal at a stringent criterion (*α* < 0.001), especially in the case of testing for the association of low-frequency variants. This property suggests proper usage of the *D′* test: on one hand, the *D′* test is not a valid test at low test levels; on the other hand, the *D′* test statistic has the advantage of detecting association signals of low-frequency variants over that of common variants in a genome-wide screen setting, as we have demonstrated by both simulated and real data of genome-wide screens.

## Discussion

Conventional wisdom regarding the LD measures *D′* and *r*^{2} is that *D′* is more useful in population genetics, such as for inferring recombination events, whereas *r*^{2} is more useful in association mapping (Mueller 2004), because the power of a test between the causal variant and the proxy variant being examined is proportional to *r*^{2} (Pritchard and Przeworski 2001; Chapman *et al.* 2003; Spencer *et al.* 2009; Lin *et al.* 2012). However, the *r*^{2}-type association test lacks power for low-frequency variants. Instead, in this article we propose a *D′*-type test. Based on *D′*’s biological property of measuring recombination events, and by making use of its low sensitivity to allele frequencies and its upward bias estimate in the case of small sample sizes and low allele frequencies, we aimed to enhance the power of detecting association for low-frequency variants in genome-wide screens. In the *D′*-type test, the inflation of under the null is largely controlled by the large accompanying variance (Zapata *et al.* 1997; Teare *et al.* 2002) and by adding 0.5 to the minimum cell while keeping the marginal totals fixed. The *D′* test is conservative at a loose criterion (α = 0.05), but liberal at a stringent criterion (*α* < 0.001), in particular when testing for association with low-frequency variants, which favors low-frequency variants with modest to large effect sizes in a genome-wide screen. As the sample size and the MAF increase, the test statistic converges to a distribution under the null hypothesis. After standardization, and are comparable in terms of precision, accuracy, and efficiency, but the former has an advantage in the fact that its range of variation is always the same regardless of MAFs (Zapata 2011), which, we speculate, explains its superior power for low-frequency variants in genome-wide screens.

In the literature, a metric closely related to *D′* is the so-called “optimal measure of allelic association” ρ, which equals . It was proposed in a genetic mapping method based on the Malécot model for isolation by distance (Collins and Morton 1998; Morton *et al.* 2001; Morton *et al.* 2007). The Malécot model uses a composite likelihood to combine association information from multiple variants; however, at each site the association information is still summarized by Pearson’s statistic, which is fundamentally different from the *D′*-type test we propose. The metric ρ originated from population genetic theories with an intuitive appeal—it is an estimate of the probability that a random haplotype has descended without recombination from a founder population (Morton *et al.* 2001; Shete 2003). This feature sheds light on the *D′*-type test in detecting association for low-frequency variants; a highly penetrant, low-frequency, causal variant is expected to be in complete LD/association, *i.e.*, *D′* = 1, with disease affection status. Another metric closely related to *D′* is the robust attributable risk δ (Bengtsson and Thomson 1981; Devlin and Risch 1995). In the case of rare diseases and random sampling, such that the disease frequency is smaller than the MAF of a variant, these two metrics are approximately equal (Devlin and Risch 1995). In a case–control study design, the two metrics have the same numerator, which is *D*_{,} but different denominators— and for δ and *D′*, respectively. In a case–control design, , but for a low-frequency variant so that *D′ > δ*. Therefore, *D′* should be a more sensitive metric for detecting association of low-frequency variants. Both ρ and δ were applied in a composite likelihood approach to locating a disease causal gene/variant assuming an evolutionary model (Terwilliger 1995; Devlin *et al.* 1996; Collins and Morton 1998; Morton *et al.* 2001, 2007). On one hand, we suspect that imposing a population genetic model may harm its power in gene mapping for complex diseases with locus and allelic heterogeneity; on the other hand, we speculate that using the information directly from ρ/*D′*, as the *D′*-type test, instead of from Pearson’s statistic at each site to form the composite likelihood, will increase its power, especially in isolated populations with relatively homogenous disease etiology.

The *D′* test is highly sensitive to association signals of low-frequency variants and so it can capture association missed by the *r*^{2}-type test; however, to avoid spurious results, it is important to replicate findings in independent samples; to estimate significance accurately, a permutation procedure is needed. In the case of low-frequency variants, we proved that the *D′* test is more powerful in a dominant coding scheme than in an allelic coding scheme; however, this does not hold over the full allele frequency spectrum. Thus we advocate using the *D′* test in a dominant model as a complementary approach to the *r*^{2}-type test instead of a replacement for it in GWAS. By doing so we also circumvent the potential issues brought up when a variant is out of HWE. Since the advantage of the *D′* test lies in low-frequency variants, it is not a primary interest of this article to construct a *D′*-type genotypic test for a table, although this is under investigation. In this article we present a single-variant *D′* test to enhance the power of detecting association for low-frequency variants with moderate to large effect sizes. Previously, we compared the power of the single-variant LRT and two multivariant analysis approaches—the burden test and variance component method—for analyzing sequence data and showed that as long as there is a variant with a moderate to large effect size and stable expressivity, the single-variant LRT has the potential to identify it (Xing *et al.* 2012). Given that the *D′* test is more powerful than the LRT for low-frequency variants, this conclusion will also hold for the *D′* test. A further potential research topic would be to extend the *D′* test to be multivariant, which, we expect, will enhance the power of detecting loci of multiple low-frequency variants with small effect sizes.

Variants attaining genome-wide significance explain only a small fraction of the genetic variance of complex traits (Manolio *et al.* 2009), and it was shown that a large proportion of the heritability is not so much missing as hidden because the individual effects are too small to pass stringent significance tests (Yang *et al.* 2010); this is relevant to not only GWAS, but also sequencing studies. Apart from increasing the sample size, more efficient study designs and more powerful methods are needed to extract from random noise true associations, in particular, low-frequency variants with moderate to large effect sizes. For example, employing a weighted FDR control procedure we previously identified, with a sample size of ∼7400, low-frequency variants at *FOXA2* associated with plasma glucose levels (Xing *et al.* 2010), which was missed by the regular method with sample sizes of ∼6000 (Banasik *et al.* 2012) and ∼46,000 (Dupuis *et al.* 2010), but was confirmed later with an even larger sample size of >100,000 (Scott *et al.* 2012) and by accounting for interaction with BMI in a sample of >50,000 (Manning *et al.* 2012). In this study we proposed a test based on the standardized LD measure *D′* that is powerful in detecting association for low-frequency variants in genome-wide screens. We recommend using the test as a complementary approach to the conventional *r*^{2}-type test to detect association for variants over the full allele frequency spectrum.

## Acknowledgments

We thank the editor and reviewers for constructive comments that improved the manuscript, as well as Dr. Jonathan Cohen for critically reading the manuscript. We also thank Dr. Helen Hobbs for granting us permission to use the DHS data. This study was supported by the American Heart Association Scientist Development Grant (No. 10SDG4220051) to C.X., by the National Research Foundation of Korea Grant (No. NRF-2011-220-C00004) to R.C.E., and by the National Science Council of Taiwan Grant (No. NSC-102-2118-M-005-002) to C.-Y.L.

## Footnotes

*Communicating editor: N. Yi*

- Received December 17, 2013.
- Accepted January 26, 2014.

- Copyright © 2014 by the Genetics Society of America