## Abstract

It is well established that test statistics and *P*-values derived from discrete data, such as genetic markers, are also discrete. In most genetic applications, the null distribution for a discrete test statistic is approximated with a continuous distribution, but this approximation may not be reasonable. In some cases using the continuous approximation for the expected null distribution may cause truly null test statistics to appear nonnull. We explore the implications of using continuous distributions to approximate the discrete distributions of Hardy–Weinberg equilibrium test statistics and *P*-values. We derive exact *P*-value distributions under the null and alternative hypotheses, enabling a more accurate analysis than is possible with continuous approximations. We apply these methods to biological data and find that using continuous distribution theory with exact tests may underestimate the extent of Hardy–Weinberg disequilibrium in a sample. The implications may be most important for the widespread use of whole-genome case–control association studies and Hardy–Weinberg equilibrium (HWE) testing for data quality control.

MOST analyses of genetic data rely on discrete genetic markers such as single-nucleotide polymorphisms (SNPs), copy number variants (CNVs), or microsatellites, yet most analyses use statistical theory based on continuous distributions such as the normal or chi square. In some cases, use of these theories is satisfactory but in most contemporary genetic analyses there is a need for care, especially with the reported *P*-values for hypothesis tests. The need may be most urgent for the widespread use of whole-genome case–control association studies and Hardy–Weinberg equilibrium (HWE) testing for data quality control.

The issue of approximating discrete distributions with continuous functions has been discussed in the literature. Yates (1934) applied a simple “continuity correction” for goodness-of-fit tests and 50 years later stressed that this made the chi-square test statistic a better approximation to the exact test for 2 × 2 contingency tables (Yates 1984). However, the correction does not alter the fact that test statistics for discrete data have discrete distributions. One prominent issue raised by discrete test statistics is discrete type I error rates; such a type I error rate has a discrete number of possible values. Tocher (1950) described a stochastic hypothesis rejection method that allows any chosen type I error rate to be achieved when working with discrete test statistics. Innan *et al.* (2005) proposed a similar random procedure for the specific case of the haplotype configuration test. These methods effectively correct the rejection region of discrete statistics; however, the methods are seldom applied, likely because their stochastic nature is unappealing to scientists.

Some other discussions of genetic test statistics have been cognizant of the discrete nature of test statistics and corresponding *P*-values (Slatkin 1994; Raymond and Rousset 1995; Rousset and Raymond 1995; Innan *et al.* 2005). However, the implications of using continuous distribution theory with discrete *P*-values have not been sufficiently discussed. Today testing is done by computer and computational issues are of less importance than in the past, making it possible to evaluate the actual discrete *P*-value distributions and use those for inference. We are particularly interested in how data discreteness affects the null distribution of *P*-values, making them nonuniform. Wigginton *et al.* (2005) looked at HWE testing for various sample sizes, significance thresholds, and minor allele frequencies (MAFs). They found that, even with a sample size of 1000, the actual type I error rates for both goodness-of-fit tests and exact tests may be much different from the nominal values. We build on their work by exploring properties of complete discrete *P*-value distributions under both the null and the alternative hypotheses, using simulated and biological data. We focus particularly on discreteness, power, and MAF affects.

In this article, on the 100th anniversary of the original HWE papers (Hardy 1908; Weinberg 1908), we examine the implications of discrete *P*-values in HWE testing. Evidence for departure from HWE has been used in many applications such as inferring the existence of natural selection (Wallace 1958; Lewontin and Cockerham 1959), challenging the statistical analysis of forensic DNA profiles (Cohen *et al.* 1991; Weir 1992), and detecting genotyping errors (Gomes *et al.* 1999; Zou and Donner 2006). We derive the actual null distribution of both the chi-square goodness-of-fit test statistic (Weir 1996) and exact test *P*-values (Weir 1996) by completely enumerating all sets of genotype counts conditional on observed allele counts. These distributions are then used to explore type I error rate, power, an acceptable MAF range, and agreement with real data. Because of the current importance of diallelic SNPs in human genetics, we confine our attention to the two-allele case. We stress that the test statistics have very coarse distributions when the number of copies of the minor allele is small and this calls for caution in applying asymptotic assumptions and in determining significance thresholds in multiple-testing situations such as those in whole-genome scans. For all uses of test statistics and *P*-values, rigorous calculation and accuracy are required when determining the expected distributions to which observed values can be compared.

## METHODS

*P*-value distributions under the null hypothesis:

For genotypes *AA*, *Aa*, *aa*, the sample counts are *n _{AA}*,

*n*,

_{Aa}*n*, summing to

_{aa}*n*. The usual chi-square test statistic for HWE is constructed by comparing these counts to the values expected under HWE: where and . A convenient form for the test statistic is(1)Under HWE, this test statistic is approximately chi-square distributed with 1 d.f., making the

*P*-value for a given data set the area under the -curve to the right of the calculated test statistic

*X*

^{2}.

The test statistic can also be written as , where is the (discrete) maximum-likelihood estimate of the (continuous) within-population inbreeding coefficient *f*. The estimate is just the term in parentheses in Equation 1, and the parameter allows population genotype frequencies to be written in terms of allelic frequencies as . When HWE is true, *f* = 0.

The goodness-of-fit test is equivalent to the 2 × 2 contingency table test of diploid genotype counts, as in Table 1. The test is conditional on the row and column totals, *n _{A}*,

*n*. For any set of genotype counts there are only as many possible test statistic values as there are tables with the same row and column totals. Without loss of generality, we assume that

_{a}*A*is the minor allele, meaning that

*n*=

_{A}*n*+ 2

_{Aa}*n*≤

_{AA}*n*=

_{a}*n*+ 2

_{Aa}*n*. With fixed row and column totals (

_{aa}*n*and

_{A}*n*),

_{a}*n*ranges over 0, 2, 4, … ,

_{Aa}*n*if

_{A}*n*is even and over 1, 3, 5, … ,

_{A}*n*if

_{A}*n*is odd. The number of possible

_{A}*n*values, and therefore the number of test statistic values, is ⌊

_{Aa}*n*/2⌋ + 1, where ⌊

_{A}*x*⌋ indicates the largest integer less than or equal to

*x*.

An exact test does not rest on continuous approximations of discrete distributions and is not thought to be problematic with small numbers, as is a chi-square test. Rather, an exact test is based directly on the discrete sampling distribution of the data under the null hypothesis. With random sampling the multinomial distribution is applicable. Under the HWE null hypothesis the probability of the genotype counts *n _{AA}*,

*n*,

_{Aa}*n*is(2)(Weir 1996). Note that the homozygote counts can be parameterized in terms of the heterozygote and allele counts as

_{aa}*n*= (

_{AA}*n*−

_{A}*n*)/2 and

_{Aa}*n*= (

_{aa}*n*−

_{a}*n*)/2. We use

_{Aa}*n*,

_{AA}*n*for notational simplicity. The

_{aa}*P*-value for this test is calculated as the sum of this probability for an observed data set and the probabilities of all other data sets that have the same or smaller probabilities when HWE is true. The total number of data sets, ⌊

*n*/2⌋ + 1, is usually small enough to allow for calculations based on a complete enumeration of these values. The more complex methods of Guo and Thompson (1992) are needed only for loci with multiple alleles.

_{A}As an illustration, Table 2 shows the eight possible data sets for a sample of 100 individuals with 14 copies of the *A* allele, along with the exact probabilities and *P*-values assuming that there is HWE. The conventional chi-square goodness-of-fit test statistics are also displayed along with the *P*-values from the -distribution. This chi-square test would reject the HWE hypothesis at the 0.05 significance level if there were ≤10 heterozygotes whereas the exact test would change the rejection region to ≤8 heterozygotes. The “spurious significant” result from the chi-square test is removed if the test statistic is corrected for continuity by replacing with for a set of observed (*o*) and expected (*e*) counts (Yates 1934).

The key feature of Table 2 is that the null distributions of the *P*-values are far from uniform for each test. There are only eight distinct *P*-values and, for example, the probabilities that the *P*-values are ≥0.5 are 0.61, 0.00, and 0.93 for the exact, chi-square, and corrected chi-square tests, respectively, instead of the 0.5 value expected for a uniform distribution.

The example in Table 2 is specific to one observed allele frequency (). In Figure 1 we show the exact *P*-value distributions for , and 0.5 with sample size *n* = 1000. If there were only these five equally likely MAFs, the total *P*-value distribution would be the average of the five marginal distributions. This total *P*-value distribution (Figure 1, bottom) is clearly nonuniform. Note that, for any value of , the most probable set of genotype counts has an exact *P*-value of 1.0. The total distribution of *P*-values over many 's will have a spike at 1.0, which may be obscured if *P*-values are binned before plotting. For any given sample size, the minor allele count is constrained to 1, 2, … , *n*, giving *n* possible values for . The distribution over all *n* allele frequencies for the exact test on 100 individuals is shown in Figure 2. Use of a simple average here assumes that all possible sample allele counts are equally likely.

The number of possible *P*-values is equal to the number of possible *n _{Aa}* values, which increases linearly with minor allele count (

*n*), which in turn must be less than or equal to the sample size (

_{A}*n*). Even as

*n*becomes very large, there are a finite number of possible

*P*-values so the

*P*-value distribution remains discrete, although when binned, it increasingly resembles a uniform distribution.

The chi-square test *P*-values have a similarly coarse distribution. However, for the chi-square test, the *P*-value can be 1.0 only if the sample counts are in perfect HWE and this is not usually possible. The null distribution of chi-square test *P*-values has no spike at 1.0 but it is not uniformly distributed. For the remainder of this article we focus on the exact test.

*P*-value distributions under the alternative hypothesis:

A full description of the behavior of hypothesis tests requires consideration of power: the probability of rejecting the null hypothesis when the alternative hypothesis is true. For the chi-square test, the test statistic still has the value but under the alternative hypothesis it has a noncentral chi-square distribution with 1 d.f. and noncentrality parameter λ = *nf*^{2}. The power of the test can be calculated as the probability to the right of *X*^{2} under the noncentral chi-square distribution. This continuous-distribution theory is convenient for simple sample-size calculations: in this 1-d.f. case, the noncentrality value follows from percentiles of the standard normal distribution. If *z _{x}* is the

*x*th percentile of the standard normal, than for significance level α and power 1 − β,

*nf*

^{2}= λ = (

*z*

_{α/2}+

*z*

_{β})

^{2}. For 90% power and 5% significance level, λ = 10.5 and

*n*≥ 10.5/

*f*

^{2}.

The probability of a given set of genotype counts when HWE is not assumed can be written as(Wellek 2004) to show how, for given values of the allele counts, the probability depends only on the heterozygote count (recall that *n _{Aa}*,

*n*,

_{A}*n*determine

_{a}*n*,

_{AA}*n*) and a single parameter θ = . The normalizing constant is found by summing over all possible values of

_{aa}*n*:The exact power of the test for some θ is found by summing values of this probability over all genotype counts for which the test rejects the null hypothesis. The rejection region is calculated under the null hypothesis but the power is calculated under the alternative hypothesis. Although the θ-parameterization avoids relying on allele frequencies and is convenient for performing power calculations, it is possible to retain the inbreeding coefficient approach by recognizing that θ = 4

_{Aa}*p*(1 −

_{A}p_{a}*f*)

^{2}/[(

*p*+

_{A}*fp*)(

_{a}*p*+

_{a}*fp*)].

_{A}Power calculations are illustrated in Table 3 for the case of *n* = 100, *n _{A}* = 14 where the rejection region is chosen for the conventional 0.05 significance level (

*i.e*., the exact test rejects for ≤8 heterozygotes and the chi-square test rejects for ≤10 heterozygotes). A range of θ-values are considered. Note that when θ = 4, the null hypothesis of HWE is true and the power is equivalent to the type I error rate. When θ < 4 and θ > 4, there is an excess of homozygotes and heterozygotes, respectively. In the example shown in Table 3, the most likely genotype configuration under the null hypothesis is 14 heterozygotes and no minor allele homozygotes, as shown in Table 2. So in the case of excess homozygotes (θ < 4), the tests gain power. However, it is not possible to observe an excess of heterozygotes, so this sort of departure is not detectable by either test, causing them to lose power when θ > 4. Figure 3 shows how type I error rate and power vary across

*n*,

_{A}*n*, and θ. Type I error and power increase and decrease simultaneously so that if power is high for a particular MAF, type I error is also high. Because the chi-square rejection region is not smaller than the exact test region, the chi-square test is always at least as powerful as the exact test, even as the power of each varies with MAF.

## AN APPLICATION

In the exact *P*-value distribution over all for *n* = 100 shown in Figure 2, we assumed that MAFs are uniformly distributed, but this may not be true in some SNP panels. When MAFs in a data set are not uniformly distributed, the empirical MAF distribution can be used to weight the null distributions of *P*-values for each MAF. If the sample size is *n* for every SNP, then there are only *n* possible MAF values and marginal null distributions. The number of times each occurs in the data can be used as a weight in constructing the total null distribution. We performed exact HWE tests on real data to demonstrate this point and to explore the consequences of approximating the expected *P*-value distribution as uniform.

The Wellcome Trust Case–Control Consortium (WTCCC) kindly provided us with genomewide SNP data in 1504 individuals from the 1958 birth cohort (Wellcome Trust Case–Control Consortium 2007). This sample, consisting of all individuals born in England, Scotland, and Wales in 1 week in 1958, was previously found to have negligible population structure (Wellcome Trust Case–Control Consortium 2007).

To avoid complications of varying sample size due to missing data, we excluded SNPs with any missing genotypes, for a total of 34,625 polymorphic SNPs. Using unfiltered complete SNPs makes our final SNP panel susceptible to genotyping errors and severe ascertainment bias. This SNP panel is not appropriate for biological analysis or as a surrogate for the 1958 birth cohort data set; however, it is a mix of actual SNPs in HWE and Hardy–Weinberg disequilibrium (HWD) and thus is useful for comparing observed *P*-values with different expected *P*-value distributions. This SNP set may contain a higher proportion of non-HWE SNPs than is typical in quality-controlled data, but that is not relevant when comparing analyses across expected distributions. The nonuniformity of MAFs in this SNP set (Figure 4) justify weighting the expected marginal *P*-value distributions by the observed MAF distribution.

When comparing observed and expected values, such as the HWE exact *P*-values, a *Q*-*Q* plot proves a useful diagnostic by plotting the observed ranked *P*-values against the ranked *P*-values sampled independently from the theoretical null distribution of *P*-values. If the observed and expected values are similarly distributed, all the points lie near the diagonal. For a more detailed examination of the very small *P*-values, −log(*p*) can be used in a *Q*-*Q* plot. Figure 5 shows *Q*-*Q* plots comparing *P*-values resulting from tests of the observed SNPs to expected *P*-values drawn from a uniform distribution and from our calculated null distribution with *n* = 1504 and the observed MAF distribution. Many of the observed SNPs are in strong Hardy–Weinberg disequilibrium. The farthest outliers are likely to be caused by genotyping errors and many other outliers may be due to population structure or selection. Clearly, expected *P*-values drawn from the calculated null distribution match the bulk of the observed *P*-values much closer than *P*-values drawn from a uniform distribution. Many of the observed *P*-values are 1.0, resulting in a horizontal bar in the *Q*-*Q* plots against uniform *P*-values. Using a uniform distribution as the expected distribution in *Q*-*Q* plots leads to inaccurate assessment of expected and observed data agreement.

HWE testing to control SNP quality has often used a somewhat arbitrary significance level, such as 0.001, and excluded markers with *P*-values below that threshold from further analysis. Using this 0.001 significance threshold, our analysis showed an observed rejection rate of 0.002484, which is twice the type I error rate expected under the continuous assumption (0.001) but four times the exact expected rate of 0.0005827. A simple 1-d.f. chi-square test shows that the observed positive rate is significantly higher than expected with asymptotic assumptions (*X*^{2} = 76.23), more so if the calculated expected value is used (*X*^{2} = 214.81). There may exist a case where assuming uniform *P*-values leads to the conclusion that there are not significantly more positives than expected under the null hypothesis when, in fact, there are.

As mentioned previously, the type I error rate can be set exactly using randomized tests that reject the null hypothesis with a specific probability for *P*-values closest to the significance threshold (Tocher 1950; Innan *et al.* 2005; Gibbons 2006). However, randomization can cause the same analysis to call a marker significant on one run and not significant on the next. This inconsistency is unsatisfying and can be avoided by deriving the true *P*-value distribution.

In the 1958 birth cohort analysis outlined above, some SNPs are in HWE and some are in HWD but the true proportions of HWE and HWD SNPs are unknown. To show the relevance of expected distribution choice when all SNPs are in HWE, we performed a simulation study of 100,000 SNPs in 100 individuals with uniformly distributed MAFs under HWE. Figure 6 shows *Q*-*Q* plots comparing exact HWE *P*-values calculated on the basis of the simulated data with the calculated null *P*-values and uniform *P*-values. Similar to the real data analysis, the calculated null *P*-values fit the observed *P*-values much better than do the uniform *P*-values. When comparing observed *P*-values to the uniform, an investigator may be surprised to see a deficit of low observed *P*-values, indicating that many SNPs have a closer fit to HWE than expected under HWE. When comparing the observed *P*-values to the calculated null *P*-values, the distributions align well, indicating that the SNPs are in HWE.

## DISCUSSION

The small number of possible sets of genotype counts for a given set of allele counts can lead to very coarse discrete distributions of *P*-values that may not be adequately approximated by a continuous uniform distribution. Analyses that assume *P*-value uniformity may be reconsidered using the true, discrete, coarse *P*-value distribution. These analyses include type I error rate, power, false discovery rate (FDR) estimation, and distribution agreement assessment.

In most studies across many diallelic markers, a lower bound is set on acceptable MAF. Figure 7 illustrates the effects of MAF lower bounds on power, type I error rate, and FDR. FDR is the proportion of significant tests falsely deemed significant and is often used in multiple-testing situations such as whole-genome scans. Applying a lower bound on MAF actually increases type I error rate and power. Depending on the specific parameters (*n*, *n _{A}*, θ, proportion of truly null and alternative markers), the net effect may be calculated as change in FDR for a specific lower bound. Often MAF lower bounds are used to avoid genotyping errors that are more pronounced in low MAF markers. Even in these cases, we advise informing the decision with an exact analysis of genotyping error rate, type I error rate, power, and FDR depending on MAF lower bound.

The results shown in this article are specific to HWE tests, but the same concept applies to any statistical test based on discrete data. In the field of statistical genetics, this includes most tests using genetic marker data such as simple case–control association (Neuhäuser 2002), linkage disequilibrium (LD) (Zapata and Alvarez 1997), and general allelic association (Zaykin *et al.* 1995) testing. Data for these problems may be framed in a contingency table and the probability of a particular table given the marginal counts can be computed under the null and alternative hypotheses. For example, Zapata and Alvarez (1997) show the contingency table for LD testing and the procedure for the exact LD test. They show the probability of a particular table under the null hypothesis of no LD with *n* haplotypes,(3)where *C* and *D* are loci with alleles *C*/*c* and *D*/*d*. Note that if *n _{C}*,

*n*,

_{c}*n*,

_{D}*n*are held constant, the contingency table consisting of all haplotype counts can be completely specified with

_{d}*n*, the number of

_{CD}*CD*haplotypes present. A certain degree of linkage disequilibrium can be parameterized as κ = (

*p*)/(

_{CD}p_{cd}*p*), where

_{Cd}p_{cD}*p*is the probability of observing the genotype

_{CD}*CD*and so on. The probability of a particular two-locus haplotype configuration for some value of κ can be calculated aswhereUsing these probabilities and the enumeration of all contingency tables conditional on the marginals, a test statistic or

*P*-value distribution can be derived for all possible marginals. This provides the precise test statistic or

*P*-value distribution under a null or an alternative hypothesis.

## Acknowledgments

We thank T. Lumley and W. G. Hill for their thoughtful discussion on this topic. This study makes use of data generated by the Wellcome Trust Case–Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk. This work was supported in part by National Institutes of Health grants R01 GM 75091 and T32 GM07735. Funding for the project was provided by the Wellcome Trust under award 076113.

## Footnotes

Communicating editor: M. W. Feldman

- Received February 11, 2008.
- Accepted September 10, 2008.

- Copyright © 2008 by the Genetics Society of America