Correlation-Based Inference for Linkage Disequilibrium With Multiple Alleles
Dmitri V. Zaykin, Alexander Pudovkin, Bruce S. Weir

Abstract

The correlation between alleles at a pair of genetic loci is a measure of linkage disequilibrium. The square of the sample correlation multiplied by sample size provides the usual test statistic for the hypothesis of no disequilibrium for loci with two alleles and this relation has proved useful for study design and marker selection. Nevertheless, this relation holds only in a diallelic case, and an extension to multiple alleles has not been made. Here we introduce a similar statistic, R2, which leads to a correlation-based test for loci with multiple alleles: for a pair of loci with k and m alleles, and a sample of n individuals, the approximate distribution of n(k – 1)(m – 1)/(km)R2 under independence between loci is Math. One advantage of this statistic is that it can be interpreted as the total correlation between a pair of loci. When the phase of two-locus genotypes is known, the approach is equivalent to a test for the overall correlation between rows and columns in a contingency table. In the phase-known case, R2 is the sum of the squared sample correlations for all km 2 × 2 subtables formed by collapsing to one allele vs. the rest at each locus. We examine the approximate distribution under the null of independence for R2 and report its close agreement with the exact distribution obtained by permutation. The test for independence using R2 is a strong competitor to approaches such as Pearson's chi square, Fisher's exact test, and a test based on Cressie and Read's power divergence statistic. We combine this approach with our previous composite-disequilibrium measures to address the case when the genotypic phase is unknown. Calculation of the new multiallele test statistic and its P-value is very simple and utilizes the approximate distribution of R2. We provide a computer program that evaluates approximate as well as “exact” permutational P-values.

THE phenomenon of nonrandom co-occurrence of alleles at two loci on the same haplotype is known as linkage disequilibrium (LD). It is an important population genetic concept with wide applications including theoretical studies of evolutionary dynamics (Lewontin 1974), forensic science (Evett and Weir 1998), conservation genetics and studies of effective population size (Waples 2006), evolutionary history, and human origins (Tishkoff et al. 1996). The extent of LD in populations has been of great interest since the development of molecular techniques allowing genotypes to be obtained at multiple loci throughout the genome. Characterization of LD in human populations has been instrumental in fine mapping of complex genetic traits in both candidate gene and whole-genome association designs. Although diallelic loci (SNPs) are utilized in most association studies, multiallelic markers (microsatellites or SNP haplotypes) will continue to be useful in genetic research, most prominently in forensic applications and studies of population size and history. Multiallelic loci provide greater precision and may yield higher power to detect and characterize LD. A simulation study by Slatkin (1994) reported an increase in power with the number of alleles to detect LD by Fisher's exact test under a finite-allele mutation model with drift and recombination. More generally, power is not a simple function of the number of alleles, as it depends on the actual disequilibria and allelic frequencies (Weir and Cockerham 1978). Formally, the LD coefficient for alleles A and B at loci A and B refers to the deviation of the joint frequency, gametic or haplotypic, from the product of allele frequencies DAB = pABpApB. The correlation between alleles is defined asMathStrictly speaking, the correlation is for the indicator variables xA and yB that equal 1 when the alleles are A and B and zero otherwise. This correlation coefficient has drawn much attention during recent years because the quantity Math, where rAB is the value of ρAB in a sample of n gametes, is asymptotically distributed as Math under the hypothesis that ρAB = 0. This relation has obvious implications for issues of power of association studies and strategies for selecting subsets of genetic markers representative of common haplotypes for genomewide analysis (Pritchard and Przeworski 2001; International Hapmap Consortium 2003; Terwilliger and Hiekkalinna 2006). However, no similar relation has been proposed for markers with more than two alleles at each locus. There is a statistical difficulty in that, beyond the two-allele case, the total squared correlation R2 does not have a limiting chi-square distribution. Briefly, a sum of squared normal variables, Math, has a χ2-distribution only when the variance–covariance matrix of the Zi's is a projection matrix. A more general result is usually stated in the matrix notation, regarding the distribution of a quadratic form, ZCZ (Searle 1971, Chap. 2, Theorem 2). In our case, C is an identity matrix. Pearson's X2-statistic is an example of such a sum, while the sum of squared LD correlations is not. Thus, despite the vast theory on contingency tables, the distribution of R2 has not been adopted for testing interactions. Nevertheless, different approximations by a scaled chi-square distribution are possible for a sum of dependent chi-squares (e.g., Box 1954). Here we report a very simply computed chi-square approximation that appears to have good properties. This result is further applied to testing LD at a pair of multiallelic loci when only single-locus genotypes are scored unambiguously. Earlier work on characterization and testing of LD at a pair of multiallelic loci includes accounts by Hill (1975); Yamazaki (1977); Weir and Cockerham (1978); Weir (1979); Karlin and Piazza (1981); Hedrick (1987); Zaykin et al. (1995); Kalinowski and Hedrick (2000); Zapata (2000); Schaid (2004); and Zhao et al. (2005, 2007). Similar to the methods of Weir (1979) and Schaid (2004), our correlation LD approach is based on the composite disequilibrium definition. The composite disequilibrium approach has certain desirable properties. It is robust with respect to single-locus deviations from Hardy–Weinberg equilibrium (HWE). The composite disequilibrium coefficient is estimated directly from genotypic counts, and thus it is readily computed from data with the unknown gametic phase. Earlier work (Weir 1979; Schaid 2004; Zaykin 2004) demonstrated good statistical properties associated with this approach.

The correlation LD test is recommended for usage and can be readily applied for screening large numbers of pairs of multiallelic loci. It is also applicable for conducting correlation-based tests for interaction in contingency tables. Our program provides exact (permutational) P-values for tests based on R2.

METHODS

Known gametic phase:

When the gametic phase is unambiguous, the two-locus haplotype observations can be arranged into a k × m contingency table with the sample size N being equal to twice the number of individuals n, N = 2n. The cell counts in the table represent N haplotype observations: the (i, j)th cell has the number nij of haplotypes carrying allele i at the first locus and allele j at the second. We assume multinomial sampling of haplotypes. The observed haplotype frequencies are Math. Row and column frequencies for the table of haplotype frequencies correspond to the vectors of allele frequencies at the two loci: {p1, … , pk} and {q1, … , qm}. The observed correlation for the cell (i, j) isMath(1)We propose the following two correlation-based statistics, both having an approximate chi-square distribution (as shown in appendix a). The eigenvalue-based statistic isMath(2)whereMathMathThe statistic T2 is much simpler, as it does not involve a computation of eigenvalues:Math(3)

Unknown haplotype phase:

Scoring genotypes one locus at a time creates ambiguity in determining pairs of haplotypes in individuals that are heterozygous at both loci. A maximum-likelihood solution for obtaining sample haplotype frequencies was suggested by Hill (1974, 1975) and elaborated on by Weir and Cockerham (1979). This approach was extended to multiple loci (Excoffier and Slatkin 1995) with the use of the EM algorithm incorporating the likelihood under the assumption of HWE. Weir (1979) sought to avoid making the HWE assumption and suggested estimating the composite disequilibrium defined as ΔAB = pAB + pA/B − 2pApB, where pA/B is the joint frequency of alleles A and B at two different gametes within individuals. The corresponding composite LD correlation isMath(4)where DA, DB are the the Hardy–Weinberg disequilibrium coefficients at the two loci. Strictly speaking, this is the correlation of the number of A and B alleles carried by an individual (Weir 1979; Zaykin 2004). The composite coefficient is directly estimated from two-locus counts by simple counting (Weir 1979). Under HWE, the intergametic disequilibrium term DA/B = pA/BpApB = 0, and the population value of ΔAB = DAB.

The composite correlations for a pair of alleles in a multiple-allele system areMathWeir and Cockerham (1989) gave a decomposition of the two-locus genotype frequency Math as a sum of functions of allele frequencies and two-locus disequilibria. Writing out the two-locus analog of the Hardy–Weinberg disequilibrium (HWD), Math, in these terms shows that under the two-locus HWE, only the DAB and thus ΔAB disequilibria are nonzero. Therefore, assuming two-locus HWE, a chi-square statistic for testing LD can be written asMath(5)as was suggested by Weir (1979). Under HWE, the composite coefficient estimates the usual LD. On the basis of Fisher's formula for approximate variances, Schaid (2004) derived the covariance matrix of the sample LD coefficients (W). He proposed a chi-square test based on a quadratic form. The test statistic definition involves a generalized inverse, W. This test is analogous to (19). For the vector containing all sample composite LD coefficients Math, Schaid's test statistic, Math, has an asymptotic chi-square distribution with the degrees of freedom equal to the rank of W. Schaid's test explicitly incorporates deviations from HWE.

We base the unknown-phase extension of the correlation LD approach on the approximate sampling distribution of the total composite LD correlation,Math(6)where Math denotes sample values of Math. Comparing this statistic to (5) shows that now the deviations from HWE at both loci are explicitly incorporated into the test.

Schaid's test statistic as well as (Rc)2 assumes that trigenic and quadrigenic two-locus disequilibria can be ignored. These disequilibria compare joint frequencies of three and four alleles at two loci with the products of allele frequencies, after removing any lower-order disequilibria (Weir 1996). To obtain the Box-type approximation (for the statistic T1), the elements of the matrix W are scaled as Math. This gives the correlation matrix WR. As before, the scale parameter is Math, and the degrees of freedom are Math. Then the two statistics with their approximate distributions areMath(7)Math(8)where Math is the average composite correlation.

Type-I error rates, goodness of fit to the null distribution, and power:

A common way to evaluate a test performance under the null hypothesis is to report the type-I error, or the proportion of P-values that fall below a rejection threshold, such as α = 0.05. An empirical estimate of the type-I error is that proportion in a large number of simulations conducted under the null hypothesis. We denote the number of simulations by B. For a more complete evaluation of the P-value distribution produced by a test, we propose to compute a statistic SB that adds up the squares of deviations of ordered P-values from the respective theoretical values expected under the null distribution. A visual method of plotting ordered P-values against the corresponding expected values of order statistics is known as a “rankit plot” (Ipsen and Jerne 1944). Such a plot very closely corresponds to the common “Q-Q” plot (where values are plotted against quantiles instead), unless the value of B is small. The deviation from the null by visual inspection is judged by the deviation of actual P-values from the expected straight line. The essence of the statistic SB is to capture the extent of this deviation. Since the usual type-I error reports the proportion of P-values below a single fixed cutoff point (a nominal level), commonly chosen to be 5%, it is possible that there would be a different degree of closeness to the nominal value at a different cutoff point. In contrast, the statistic SB has an advantage in that it gives a summary of the correspondence of P-values with the null distribution for the entire (0, 1) interval.

We denote the ordered set of P-values obtained from B simulations as Math. The random variable that corresponds to the observed p(i) is denoted by P(i). The summary statistic measuring the lack of fit to the null distribution isMath(9)Under the null hypothesis, the distribution of the order statistics P(i) would be Beta(i, Bi + 1) if the distribution of the test statistic was continuous and exact, rather than approximate. The computational formula for SB isMath(10)Larger values of SB indicate larger deviations from the null distribution. When P-values indeed come from the null (uniform) distribution, we find the expected value of this statistic to beMath(11)Thus, for any test statistic, the fit of P-values to the uniform distribution can be simply evaluated by computing the proposed statistic, SB. We report and compare the values of SB for competing methods, in addition to the usual empirical type-I error rates.

Performance of the tests under the alternative hypothesis (HA) was characterized by statistical power. Power was estimated as the proportion of P-values that fall below the 5% rejection threshold, using data sets generated under HA.

RESULTS

Known haplotype phase:

The goal of this section is to compare performance of the proposed correlation-based tests. The performance was evaluated in terms of the classical type-I error and power. Additionally, the fit to the null distribution was evaluated with the usage of the coefficient SB, as described above. The following tests were used in this study:

  1. Correlation-based statistic T1 defined by (2).

  2. Correlation-based statistic T2 defined by (3).

  3. Cressie–Read's power divergence statistic,Math(12)with Math (Cressie and Read 1984), where eij = ni·n·j/n are the expected counts. Cλ has an asymptotic chi-square distribution with (k – 1)(m – 1) d.f.

  4. Likelihood-ratio (LR) statistic,Math(13)

  5. Pearson's chi-square statistic,Math(14)

  6. Permutation-based tests using statistics as defined above, which we denote as Tp, Math, and X Math. The statistics T1 and T2 correspond to the same permutational test, denoted by Tp.

  7. Fisher's exact test Fp, with the P-value approximated by a permutation test using the statistic Math.

The P-value for a permutation test is defined as the proportion of times the test statistic computed from randomly sampled tables was found to be as extreme or more extreme than the statistic value for the original data. These random tables are generated with marginal counts constrained to be the same as that for the observed data set. We used K = 19,999 permutations to compute each P-value, and the number of simulations in all type-I error evaluation experiments was B = 100,000. Oden (1991) showed that the value of K in simulation experiments can be very much smaller than B. Boos and Zhang (2000) suggested that K can be as small as Math, and if the significance level is α, the value should preferably be such that (K + 1)α is an integer. The number of simulations to evaluate power was 10,000.

Tables 1–3 present results for the type-I error rates at the nominal 5% level and the closeness of fit to the null distribution as measured by the SB statistic. The tables of haplotype counts in this set of simulations have fixed margins and the cell counts are generated at random to satisfy the marginal conditions. A similar approach was used in the evaluation of small sample properties of some common tests, such as Pearson's chi square (e.g., Larntz 1978; Fienberg 1979). For example, the marginal frequencies in Table 2 are taken to be proportional to (2:3:5) for the rows and (2:3:4:5:6) for the columns. This matches the first setting of Table 6 in Larntz (1978). Our values for Pearson's X2 and the LR statistic G2 replicate the type-I error results of Larntz, who used sample sizes of 20–100. Across all simulations, our results confirm the previous observations (Larntz 1978) that the LR test (G2) has an inflated type-I error when sample sizes are small to moderate.

View this table:
TABLE 1

Type-I error rates and values of the statistic measuring lack of fit to the null distribution, 1000 × SB for 3 × 5 tables: row margins, 5:3:2; column margins, 2:3:4:5:6

View this table:
TABLE 2

Type-I error rates and values of the statistic measuring lack of fit to the null distribution, 1000 × SB for 3 × 5 tables: row margins, 2:3:5; column margins, 2:3:4:5:6

View this table:
TABLE 3

Type-I error rates and values of the statistic measuring lack of fit to the null distribution, 1000 × SB for 5 × 7 tables: row margins, 2:3:4:5:6; column margins, 1:2:3:4:5:6:7

Both of the proposed statistics, T1 and T2 show a correct type-I error for the corresponding test. Moreover, examination of SB values indicates that small-to-moderate sample size behavior of these statistics is such that they provide the best fit to the null distribution among the asymptotic/approximate tests studied here. The simpler approximation, T2, shows the best fit.

Tables 4–7 present both power and the behavior under H0, given in terms of the type-I error and the SB values. The null distribution data sets corresponding to the power results were generated by randomly shuffling the data generated under the association model, to produce new counts under the hypothesis of no association. Sample sizes for different simulations are chosen depending on the strength of the population association, to provide intermediate to high power, and highlight the difference between the tests.

View this table:
TABLE 4

Power and the corresponding H0 behavior for 4 × 3 tables at abs(D′) ± 0.5 (population parameters are defined in Table B1 of appendix b)

View this table:
TABLE 5

Power and the corresponding H0 behavior for 4 × 3 tables at abs(D′) ± 0.5 (population parameters are defined in Table B2 of appendix b)

View this table:
TABLE 6

Power and the corresponding H0 behavior for 5 × 5 tables at abs(D′) ∈ (0.19–0.21) (population parameters are defined in Table B3 of appendix b)

View this table:
TABLE 7

Power and the corresponding H0 behavior for the “sample heterogeneity” model (5 × 5 tables)

The population association values for Tables 4–6 were generated as follows. The association value for the cell (i, j) can be measured in terms of LD, Dij = pijpiqj. The maximum absolute value of Dij is constrained by the marginal frequencies pi, qj and the association values for all cells were set as proportions of the maximum attainable value, Math (Lewontin 1964). The population frequencies, pij, and the values of Math are given in appendix b, Tables B1–B3. Samples for each simulation experiment were obtained by multinomial sampling from these population frequencies. Both of the proposed tests (T1, T2) show type-I error rates close to the nominal 5% level. The simple approximation T2 shows the best fit to the null distribution among the asymptotic/approximate tests. Moreover, the power corresponding to T2 or its permutational equivalent Tp is somewhat higher than that for the rest of the tests. These differences in power are highly significant statistically, due to the paired nature of the data (P-values) and the large number of simulations.

As mentioned previously, in the known-phase case the test for LD is equivalent to a test for interaction in a contingency table. In principle, the tests based on the total correlation can be used in a classical setting of testing heterogeneity between several multinomial samples. Although a detailed examination of the proposed tests regarding this problem is beyond the scope of this article, a simulation study (Table 7) confirms that the proposed approach provides a competitive test. In the 5 × 5 tables used here, rows represent independent samples taken from five populations; and columns represent five categories (such as sample-specific allele frequencies). Population frequencies for each of the simulations were generated from the Dirichlet distribution with the common parameter, 20. A property of this sampling is such that the 1 and 99% population quantiles for the frequency of any of the five column categories are 0.1 and 0.3, with mean frequency 0.2. This range gives a measure of the between-population variability for each of the categories. Samples for each of the five populations were generated by multinomial sampling for each of the simulation runs. As before, data for the hypothesis of homogeneity (H0) were obtained by taking the sample generated as just described and reshuffling the counts under the constraints that the marginal frequencies of a particular sample are preserved. Table 7 shows good properties of the proposed tests under the hypothesis of no association. The power values are found to be identical to those provided by Fisher's exact test. The asymptotic version of the LR test (G2) shows a higher power; however, this value might be unreliable, because the type-I error of this test was found consistently inflated in all simulations.

Unknown haplotype phase:

This section gives results of the comparison between the two “LD correlation” statistics (T1, T2) and a chi-square test recently described by Schaid (2004), which has similarity in that it also utilizes the composite LD definition. Schaid's test (S2) corresponds to Pearson's chi-square in the “known-phase” case; however, there is no simple explicit expression for the test statistic in the ambiguous haplotype phase case. The calculation of S2 involves a generalized inverse of the covariance matrix of the sample composite LD. We assume a common scenario when single-locus genotypes are scored at each locus, without the knowledge about arrangement of the alleles on haplotypes across the loci.

The first set of simulations was designed for a two-locus linkage equilibrium system with five and seven alleles correspondingly. Both loci have high population levels of HWD. The amount of HWD and allele frequencies for various simulation settings are given in the legend to Table 8. The homozygote HWD values for the two loci (Math) are given as proportions of the maximum possible value. The heterozygote HW disequilibria are related to these as Math. The simulation results confirm that both the correlation-based tests and Schaid's test are robust in the presence of high levels of population HWD. Similar to the known-phase results, the simple T2 approximation shows the best fit to the null distribution (under the hypothesis of linkage equilibrium).

View this table:
TABLE 8

Type-I error and values of the statistic measuring lack of fit to the null distribution, 1000 × SB, for the composite LD tests: locus A, five alleles; locus B, seven alleles; n = 100

The second set of simulations was designed to evaluate power utilizing the population LD derived from an actual set of human short tandem repeat (STR) polymorphisms, described in Rosenberg et al. (2002). We took 30 STR loci from chromosome 1, using a combined sample of 217 Middle-East and European individuals, and identified seven pairs of loci in LD by an exact test (Zaykin et al. 1995). The resulting set of loci used for these simulations had 4–6 alleles after rare alleles were grouped together. Two-locus counts of these data were further used to set the population frequencies. These fixed population frequencies were used to obtain multinomial samples of individuals for each of the simulations. Results of these simulations are shown in Table 9. The permutational (“exact”) version of the correlation-based tests, Tp was included as well. The fit to the null (linkage equilibrium) distribution follows the same pattern found in the previous simulations—the simple approximation T2 shows a better fit than other nonexact tests. The power values are found to be similar in all cases.

View this table:
TABLE 9

Human diversity panel results

Correspondence between approximations and the exact test for the total correlation:

Overall, we found an excellent agreement between P-values provided by either of the approximations (T1, T2) and the exact P-value given by the test Tp.

Figure 1, a and b, shows a very close P-value correspondence between T2 and the its exact version, Tp. Figure 1a plots the T2 P-values against the Tp P-values using the subset of simulations used to produce Table 1 (N = 100). Figure 1b is a similar plot for the unknown haplotype phase data (locus pairs 11 and 23 from Table 9). For comparison, Figure 1c plots T2 P-values against those obtained by Pearson's chi-square test (N = 100, data from Table 1 simulations). There is no similar correspondence, which indicates that the two statistics are capturing somewhat different aspects of sample associations.

Figure 1.—

(a) Plots of T2 P-values against the Tp P-values for the known haplotype phase simulations. (b) Plots of T2 P-values against the Tp P-values for the unknown haplotype phase simulations. (c) Plots of T2 P-values against Pearson's χ2 P-values for the known haplotype phase simulations.

Figure 2 shows the correspondence between the two correlation-based test approximations, T1 and T2. Figure 2a illustrates the correspondence for the known haplotype phase case (N = 100; data for Table 1). Figure 2b illustrates a similar correspondence between the P-values for the unknown haplotype phase case (data from simulations to produce “setting I” in Table 8).

Figure 2.—

(a) Plots of T2 P-values against T1 P-values for the known haplotype phase simulations. (b) Plots of T2 P-values against T1 P-values for the unknown haplotype phase simulations.

Due to closeness of P-values resulting from the T1 and the T2 tests, and much greater simplicity of the T2-statistic computation, we recommend its usage over the test based on T1.

DISCUSSION

We introduce correlation-based testing for linkage disequilibrium with multiple alleles. Following earlier work by Weir (1979) and Schaid (2004) we adopt the usage of the composite LD that provides robust inference even under conditions of high deviations from HWE. Simulations confirm that the test maintains the proper error rate even when the HWD reaches its maximum value for some of the genotypes. Our approach provides several advantages. The behavior of the proposed method under the hypothesis of no association is found to be consistently closer to the expected than that of other “nonexact” tests included in this study. Values of the statistic SB that we introduced for evaluation of the null distribution of the studied test statistics show that in 35 of 38 experiments, the approximation T2 was closer to its null expected value than the chi-square statistic (Tables 1–9). Power evaluations suggest that the correlation-based tests provide higher power than other tests under the alternatives where associations are present among multiple pairs of alleles (Tables 4–6). The novelty and advantages of our approach also include tractability of the corresponding test statistic, simplicity, and high speed of computations. The relation of the sum of squared LD correlations to chi-square extends the well-known relation for the two-allele case and thus may have implications for the design of genetic association studies. Good power properties of the test based on a simple statistic Math give justification for usage of the average correlation to characterize and compare multiallelic LD in various settings, including estimation of the effective population size (Waples 2006) and fine mapping of genetic traits, where LD coefficients could be compared between samples with and without a specific disease (Nielsen et al. 2004; Zaykin et al. 2006). Further work may include investigation of confidence intervals for R2, on the basis of the proposed chi-square approximation.

Although the method is motivated by testing the LD, the test provides high power when used to detect heterogeneity among samples in contingency tables. For example, the correlation-based test can be used to compare allele or genotype frequencies (columns) between samples from several populations, represented by rows in a contingency table. In this setting, the power is very similar to the power of common tests such as Pearson's chi-square and Fisher's exact test. Further study may be required to fully investigate properties of this test as a general purpose test for detecting interactions and heterogeneity in contingency tables.

A computer program implementing the methods described here is available at (http://www.niehs.nih.gov/research/atniehs/labs/bb/staff/zaykin/rxc.cfm) or by a request to D.V.Z. The provided implementation computes average correlations with the corresponding P-values on the basis of the T2 statistic, using multilocus genotype data. For those P-values that fall below a user-specified threshold, a Monte Carlo P-value is reported as well. This approach allows rapid computations for large collections of loci. Correlation-based tests for contingency tables are implemented as well.

APPENDIX A

We denote the km × 1 vectors of population and sample frequencies by P and Math; the elements of Math are the observed haplotype frequencies, Math. Under the null hypothesis, H0: P = P0, we have that Math converges in distribution to a multivariate normal. Row and column frequencies for the table of haplotype frequencies correspond to the vectors of allele frequencies at the two loci: p, {p1, … , pk}T; q, {q1, … , qm}T. For complete absence of linkage disequilibrium, the vector of frequencies is a (km × 1) Kronecker product,Mathand the vector of expected (equilibrium) sample frequencies is based on sample valuesMathUnder H0, the covariance matrix of Math isMathand the variance of Math isMath(Holt et al. 1980). The contingency table Pearson's chi-square statistic isMath(A1)DenoteMathThe notation {·} above denotes vectors; e.g., Math. The elements of the vectors Z and R areMath(A2)Math(A3)The elements of R are sample correlations for each pair of alleles, and Math is the sum of squared correlations for the entire table.

Pearson's X2 in (A1) can be expressed differently, using either the vector of the chi-square contributions, Z, or the vector of correlations, R,Math(A4)Math(A5)Math(A6)where V denotes a generalized inverse of V. The reduction to a simple sum was given by Pearson (1922).

Our primary interest is in the approximate distribution of RTR. For any multivariate normal vector Z* with covariance matrix V*, the distribution of (Z*)TZ* is that of Math, where λi denote the nonzero eigenvalues of V* and Math are the 1-d.f. chi-square variables (Box 1954). The asymptotic covariance matrix of Math is idempotent with all (k–1)(m–1) nonzero eigenvalues being equal to 1. Hence, the sum of Math, that is, Math, has an asymptotic χ2-distribution. In contrast to Math, the matrix Math with (k–1)(m–1) positive eigenvalues is not idempotent. Therefore, NRTR does not have an asymptotic χ2-distribution. Box (1954) suggested that the distribution of weighted chi-square variables, Math, where each chi square is with the degrees of freedom vi, can be approximated by a scaled chi-square distribution, Math, whereMath(A7)Math(A8)The degrees of freedom d need not be integral. In the case of a sum of correlations, all vi = 1, and the weights are computed from the eigenvalues ofMath(A9)Since only the sums of eigenvalues or of their squares are needed, and not the eigenvalues themselves, the computations simplify substantially:Math(A10)Math(A11)This makes use of the fact that eigenvalues of a squared matrix are given by squared eigenvalues of that matrix and that the trace of a symmetric matrix is given by the sum of its eigenvalues. Therefore, for our first scaled chi-square approximation we haveMath(A12)where Math stands for “approximately distributed” andMath(A13)Math(A14)In the second and much simpler approximation, we set the degrees of freedom equal to (k – 1)(m – 1), which is the number of nonredundant disequilibrium coefficients, Dij = pijpiqj, and note the expected valuesMathMathMath(A15)By matching moments, the scale parameter is found to be Math. Thus, we obtain our second approximate distribution asMath(A16)Note that RTR/(km) is just the average squared correlation. Waples (2006) noted that approximately, the distribution of such a coefficient might be a chi square and that with k alleles per locus, the number of independent comparisons (and thus the degrees of freedom) for a comparison of two loci should be (k – 1)2. Nevertheless, he did not provide a distribution explicitly.

APPENDIX B

Tables of population joint frequencies (pij) to provide a specified amount of association (measured by D′) for the power study.

View this table:
TABLE B1

The 4 × 3 table of joint frequencies with the corresponding level of association

View this table:
TABLE B2

The 4 × 3 table of joint frequencies with the corresponding level of association

View this table:
TABLE B3

The 5 × 5 table of joint frequencies with the corresponding level of association, pij\D

Acknowledgments

Shyamal Peddada and David Umbach provided useful discussion. Noah Rosenberg provided STR genotypes for deriving data sets used in the simulation study. Daniel Schaid and Jason Sinnwell provided a program implementing Schaid's S2 test. This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Institute of Environmental Health Sciences, and by NIH GM 07591.

Footnotes

  • Communicating editor: A. D. Long

  • Received March 18, 2008.
  • Accepted July 17, 2008.

References

View Abstract