Abstract
The usual approach to characterizing and estimating multilocus associations in a diploid population assumes that the population is in Hardy-Weinberg equilibrium. The purpose of this study is to develop a set of summary statistics that can be used to characterize and estimate the multilocus associations in a nonequilibrium population. The concept of “zygotic associations” is first expanded to facilitate the development. The summary statistics are calculated using the distribution of a random variable, the number of heterozygous loci (K) found in diploid individuals in the population. In particular, the variance of K consists of single-locus and multilocus components with the latter being the sum of zygotic associations between pairs of loci. Simulation results show that the multilocus associations in the variance of K are detectable in a sample of moderate size (≥30) when the sum of all pairwise zygotic associations is greater than zero and when gene frequency is intermediate. The method presented here is a generalization of the well-known development for the Hardy-Weinberg equilibrium population and thus may be of more general use in elucidating the multilocus organizations in nonequilibrium and equilibrium populations.
THE extent and patterns of nonrandom associations between linked as well as independent loci provide important information about the history of a population, the evolutionary forces governing these loci, and the location of the loci on the chromosomes. Such multilocus associations may arise from many demographic and evolutionary events including epistatic selection, random drift due to population growth and decline, mixing of two or more distinct gene pools, nonrandom mating, and mutation, regardless of whether or not the loci are physically linked (e.g., Hedricket al. 1978; Brown 1979; Barton and Clark 1990).
A number of statistical measures have been proposed to characterize the multilocus associations, but the literature has focused on characterizing gametic disequilibria, i.e., nonrandom associations of alleles at two loci ordered within gametes (e.g., Hedrick 1987). While these measures are useful for analyzing haploid data or diploid data from a Hardy-Weinberg equilibrium population, they may not be appropriate for a nonequilibrium diploid population in which a complete characterization of two-locus associations also requires other types of disequilibria (Cockerham and Weir 1973; Weir 1979). For example, in a hybrid population arising from mixing of genes from two or more populations or species, alleles derived from the same populations or species tend to cluster together in the same individuals, either because of Wahlund's (1928) effect or because of strong selection against hybrids or both. The resulting Hardy-Weinberg disequilibria at individual loci and multilocus associations across loci may be persistent and may be detectable for a number of generations after an initial mixing of gene pools. Thus, the multilocus associations in the hybrid population need to be characterized at the zygote level.
A related issue about characterizing and testing the multilocus associations is that most of the proposed measures are defined for a pair of loci only. When there are a large number of loci, each having many alleles, pairwise measures may be too many to be readily manageable and interpretable. For example, for 20 loci, each with four alleles in a nonequilibrium population, there are 6 independent Hardy-Weinberg disequilibria for each of the 20 loci, 9 gametic disequilibria, 9 nongametic disequilibria, 54 trigenic disequilibria, and 45 quadrigenic disequilibria for each of 190 locus pairs. Furthermore, unless a stringent significance level is imposed, the large number of required pairwise tests under commonly used significance levels of 5 and 1% may produce spurious association realizations (Karlin and Piazza 1981; Weir 1996, pp. 133–135). Therefore, it is desirable to have a set of summary statistics that adequately describe the extent and patterns of multilocus structure in a nonequilibrium population.
The objective of this study is to develop such a set of summary statistics. The concept of “zygotic associations” (Haldane 1949; Bennett and Binet 1956; Allardet al. 1968) is first expanded to facilitate the development. The summary statistics are calculated using the distribution of a random variable, the number of heterozygous loci (K) found in diploid individuals in the population. A similar method by Brown et al. (1980) has been used to analyze multilocus data collected from haploid, inbred, or random mating populations (e.g., Brownet al. 1980; Whittamet al. 1983; Nevo and Beiles 1989; Maynard Smithet al. 1993; Yehet al. 1994; Hauboldet al. 1998), but it considers only gametic disequilibrium. Numerical analyses are also carried out to depict the dependence of the zygotic associations on gene frequencies and various disequilibria and to examine the sensitivity of our method for detecting the multilocus zygotic associations.
ZYGOTIC ASSOCIATIONS
Let us consider a diploid population in which individual genotypes are known at each of m loci. Two of these m loci are indexed by j and l with alleles j_{u}, u = 1, 2, … , r and l_{y}, y = 1, 2, … , s, respectively. Frequencies of genotypes at loci j and l from the union of gametes j_{u}l_{y} and j_{v}l are written as jlPvzuy=jlPuyvz . Weir (1979) described various marginal totals that are sums of genotypic frequencies indicated by dots for the indices summed. For example, one-locus genotypic frequencies for j_{u}j_{v} and l_{y}l_{z} are denoted by
jPv.u.=Σy=1sΣz=1sjlPvzuyandlP.z.y=Σu=1rΣv=1rjPvzuy,
and frequencies of alleles j_{u} and l_{y} are given by
jpu=jP..u.=Σv=1rΣy=1sΣz=1sjlPvzuyandlpy=lP...y=Σu=1rΣv=1rΣz=1sjPvzuy.
Following Bennett and Binet (1956) and Allard et al. (1968), we now define a zygotic association between loci j and l as a deviation of joint frequencies of double heterozygotes from products of frequencies of heterozygotes at the two loci:
jlωvzuy=jlPvzuy−jPv.u.lP.z.y.
(1)
The other three zygotic associations, jlωuzuy , jlωvyuy , and jlωuyuy , can be similarly defined by substituting appropriate allele indexes in (1). It is easy to find the ranges of these zygotic associations. For example, the range of jlωvzuy is
−jPv.u.lP.z.y≤jlωvzuy≤min[jPv.u.(1−lP.z.y),(1−jPv.u.)lP.z.y].
(2a)
This dependence of the zygotic association on the marginal frequencies at single loci suggests a need to normalize the zygotic association jlωvzuy ,
jlωvzuy′={jlωvzuyjPv.u.lP.z.y,jlωvzuy<0jlωvzuymin[jPv.u.(1−lP.z.y),(1−jPv.u.)lP.z.y],jlωvzuy>0,}
(2b)
which is analogous to Lewontin's (1964) normalized gametic disequilibrium.
When summing over all alleles at loci j and l, we obtain an overall measure of zygotic associations (ω_{jl}) and the following relations:
ωjl=Σu=1rΣy=1sjlωuyuy=ΣΣu≠vΣΣy≠zjlωvzuy=−Σu=1rΣΣy≠zjlωuzuy=−ΣΣu≠vΣy=1sjlωvyuy.
(3)
Thus, the sum ∑u=1r∑v=1r∑y=1s∑z=1sjlPvzuy=1 can be expanded into four classes of genotypic frequencies: (i) frequency of being homozygous at both loci; (ii) homozygous at locus j and heterozygous at locus l; (iii) heterozygous at locus j and homozygous at locus l; and (iv) heterozygous at both loci,
Σu=1rΣy=1sjlPuyuy=(1−Hj)(1−Hl)+ωjlΣu=1rΣΣy≠zjlPuzuy=(1−Hj)Hl−ωjlΣΣu≠vΣy=1sjlPvyuy=Hj(1−Hl)−ωjlΣΣu≠vΣΣy≠zjlPvzuy=HjHl+ωjl,
(4)
where H_{j} and H_{l} are the population heterozygosities at loci j and l,
Hj=ΣΣu≠vjPv.u.=1−Σu=1rjPu.u.=hj−Σu=1rjDu.u.Hl=ΣΣy≠zlP.z.y=1−Σy=1slP.y.y=hl−Σy=1slD.y.y,
(5)
with hj(=1−∑u=1rjpu2) and jDu.u.(=−∑v≠ujDv.u.) , for example, being the gene diversity (or expected heterozygosity under Hardy-Weinberg equilibrium) and Hardy-Weinberg disequilibrium for allele u at locus j, respectively.
MULTILOCUS HETEROZYGOSITY
Number of heterozygous loci (K): When a diploid individual is randomly taken from the population (defined above), it can be either homozygote or heterozygote at a given locus. If all m loci are evaluated, then the random variable K is simply the number of heterozygous loci found in the randomly chosen diploid individual from the population. Thus, K is the sum of m indicator variables, K=∑j=1mXj , where X_{j} takes either 1 or 0, depending on whether the jth locus is heterozygous or homozygous. The probability that this locus is heterozygous is H_{j}, the population heterozygosity at the jth locus, and the probability that it is homozygous is 1 − H_{j}. K can take any integer value from 0 to m. If K = 0, then all m loci are homozygous; if, on the other hand, K = m, then all m loci are heterozygous.
Moments of K: The expected value of K is
E(K)=E(Σj=1mXj)=Σj=1mE(Xj)=Σj=1mHj,
(6)
and the second to fourth central moments are given by, letting x_{j} = X_{j} − E(X_{j}),
σK2=E[K−E(K)]2=Σj=1mE(xj2)+2ΣΣj<lE(xjxl),
(7a)
E[K−E(K)]3=Σj=1mE(xj3)+3Σj=1mΣl≠jE(xj2xl)+6ΣΣΣj<l<oE(xjxlxo),
(7b)
and
E[K−E(K)]4=Σj=1mE(xj4)+4Σj=1mΣl≠jE(xj3xl)+6ΣΣj<lE(xj2xl2)+12Σj=1mΣΣl,o≠jl<oE(xj2xlxo)+24ΣΣΣΣj<l<o<qE(xjxlxoxq),
(7c)
where, for example, E(xj2xl) is the {21}th central mixed moment of variables X_{j} and X_{l} for loci j and l (Elandt-Johnson 1971, pp. 106–107). It is evident from (7a)–(7c) that evaluating the ith central moment of K requires a specification of joint genotypic frequencies for i loci, which include various associations for genes at up to i loci. For example, the variance (second central moment) of K is a function of single-locus heterozygosities and two-locus associations only and is independent of higher-order associations involving three or more loci. Similar arguments can be carried out for the third or higher central moments of K. If there is complete interlocus independence, (7a)–(7c) reduce to (3)–(5) of Brown et al. (1980) but we use heterozygosity {H_{j}} instead of gene diversity {h_{j}} to measure genetic variation at individual loci. When the population is in Hardy-Weinberg equilibrium, the heterozygosity equals to the gene diversity (cf. Equation 5).
Variance of K: The variance of K as given in (7a) has two components, one being the sum of variances at individual loci and the other being the sum of covariances between pairs of loci,
σK2=E[K−E(K)]2=Σj=1mVar(Xj)+2ΣΣj<lCov(Xj,Xl),
(8)
where Var(Xj)=Hj−Hj2 and Cov(X_{j}, X_{l}) = ω_{jl} as computed using the joint probability distribution between loci j and l (Table 1). Thus,
σK2=Σj=1mHj−Σj=1mHj2+2ΣΣj<lωjl.
(9a)
It is evident from (1) and (3) that ωjl=∑u=1r∑y=1s[jlPuyuy−jPu.u.lP.y.y] , for example. Following Cockerham and Weir (1973) and Weir (1979), the two-locus frequencies {jlPuyuy} are expressed in terms of gene frequencies and various genic disequilibria. Given these results and those in (5) for {H_{j}}, σK2 in (9a) can be rewritten as
σK2=Σj=1m[1−Σu=1r(jpu2+jDu.u.)]−Σj=1m[1−Σu=1r(jpu2+jDu.u.)]2+2ΣΣj<lΣu=1rΣy=1s[2jpujlD.yuy+2lpyjlDu.uy+jlDuyuy+2jpulpyjlD..uy+2jpulpyjlDu..y+(jlD..uy)2+(jlD.yu.)2],
(9b)
where each genic disequilibrium (D) is the deviation of a frequency from that based on random association of genes and accounting for any lower-order disequilibria. Definitions and properties of these disequilibria are detailed in many places (e.g., Weir 1979). Here it suffices to recognize that there are five types of disequilibria: (i) single-locus digenic disequilibria (i.e., Hardy-Weinberg disequilibria, jDu.u. and lD.y.y ); (ii) two-locus digenic disequilibria for gametic genes (i.e., gametic disequilibria, jlD..uy ); (iii) two-locus digenic disequilibria for nongametic genes (i.e., nongametic disequilibria, jlD.yu. ); (iv) trigenic disequilibria (jlDu.uy and jlD.yuy ); and (v) quadrigenic disequilibria (jlDuyuy ).
Table 2 lists six special cases of σK2 as given in (9a) or (9b). The first two cases assume that there are no zygotic associations between pairs of loci for all m loci (Σ _{j<l} ω _{jl} = 0), but case 1 further assumes Hardy-Weinberg equilibrium in the population. When genotypes (zygotes) result from random union of gametes, all nongametic disequilibria including Hardy-Weinberg disequilibria at all loci disappear (e.g., jDu.u.=jlD.yu.=jlD.yuy=jlDuyuy=0 ). This leads to σK2(3) as given in case 3. σK2(3) was previously derived (cf. Equation 15 of Brownet al. 1980). Case 4 states a well-established fact that nonzero quadrigenic disequilibria occur under Hardy-Weinberg disequilibrium, even in a population that is in gametic equilibrium (e.g., Haldane 1949; Bennett and Binet 1956; Weir and Cockerham 1973).
The last two cases in Table 2 are not directly obtainable from (9a) or (9b), but rather serve to illustrate the difficulty of finding the maximum value of σK2 because the upper bound for jlωvzuy in (2a) is not unique. Case 5 portrays a scenario where all m loci are absolutely associated (Clegget al. 1976). The final case constructs a population of hypothetical multilocus zygotes with maximum variance of heterozygosity by ranking the {H_{j}} such that H_{1} > H_{2} > H_{3} > … > H_{m}. Similar expressions for these two cases were given by Brown et al. (1980) and Brown and Burdon (1983) for haploid and random mating populations.
NUMERICAL ANALYSIS
Relationships between zygotic associations and genic disequilibria: It is evident from (9a) and (9b) that the
overall measure of zygotic associations between a pair of loci is a complex function of gametic, nongametic, trigenic, and quadrigenic disequilibria weighted appropriately by gene frequencies. The range of values for each of these disequilibria is defined by gene frequencies and disequilibria of lower orders. To further explore such intricate interrelationships among zygotic associations, gene frequencies, and various genic disequilibria, numerical calculations are carried out. For simplicity, let us assume that there are two alleles (1 and 2) at each of the two loci. Frequencies of the ten possible genotypes are denoted as P1111 , P1211 , P1212 , P2111 , P2211 , P2112 , P2212 , P2121 , P2221 and P2222 , dropping the identifiers for the two loci. These genotypic frequencies are grouped into four classes (f_{00}, f_{01}, f_{10}, and f_{11}) based on whether genotypes at individual loci are homozygous or heterozygous (Table 3). The marginal totals for the individual loci are, respectively, f_{0.} = f_{00} + f_{01}, f_{1.} = f_{10} + f_{11}, f_{.0} = f_{00} + f_{10}, and f_{.1} = f_{01} + f_{11}. Thus, the overall measure of zygotic associations (ω) can be calculated using the relations given in Table 1. To gauge the relationships between zygotic associations, gene frequencies, and various disequilibria, the two-locus genotypic frequencies are expressed in terms of disequilibrium functions (cf. Table 6.1 of Weir and Cockerham 1989). All types of disequilibria except for Hardy-Weinberg disequilibria affect the zygotic associations because they are genic disequilibria between the two loci.
TABLE 1
Joint frequency distribution of indicator variables X_{j} and X_{l} in terms of heterozygosities (H_{j} and H_{l}) and zygotic associations (ω_{jl}) at loci j and l
We examine the effects of three genic disequilibria (gametic, trigenic, and quadrigenic disequilibria) on the distribution of zygotic associations. Since we assume equal gene frequencies (p) at both loci, the nongametic disequilibrium and gametic disequilibrium are equal, and so are the two trigenic disequilibria. To illustrate the three-way relationship, the effect of gene frequencies and gametic disequilibria on zygotic associations is depicted in Figure 1. In this case, the zygotic association is ω = 2(1 − 2p)^{2}D + 4D^{2}, where (D(=D11=−D..12=−D..21=D..22) )
is the gametic disequilibrium. The maximum zygotic association (ω = 0.25) is obtained at p = 0.5 and D = ±0.25, but while ω always increases with D > 0, it can be negative with D < 0 for some gene frequencies as shown in Figure 1. The zygotic association is affected little by trigenic disequilibria, but increases with positive and decreases with negative quandrigenic disequilibria, respectively (the 3D plots for trigenic and quadrigenic disequilibria are not presented).
TABLE 2
Single-locus and multilocus components of variance of K, , under six special cases
Figure 1.
Dependence of zygotic associations on gene frequency and gametic disequilibrium.
Estimating zygotic associations from variance of multilocus heterozygosity: The variance of K in (9a) suggests that the average zygotic associations (ω¯ ) may be obtained by
ω¯=ΣΣj<lωjl=12[σK2−σK2(2)],
(10)
where σK2(2)=∑j=1m(Hj−Hj2) is for case 2 of Table 2. To estimate ω¯ from a sample of n diploid individuals with m polymorphic loci, one needs to estimate ω¯ and single-locus heterozygosities, {H_{j}}. There are several discussions of procedure for estimating these parameters from a sample taken from a random mating population or haploid population (e.g., Brownet al. 1980; Brown and Burdon 1983; Chakraborty 1984). Essentially the same estimation procedure is used in the following simulation study.
The nonequilibrium population for two loci each with two alleles is constructed using the fact that each two-locus genotypic frequency can be written as a sum of the product of single-locus frequencies and its zygotic association (Table 4). For a given gene frequency (p) at a locus, Hardy-Weinberg disequilibrium (D=D1.1.=−D2.1.=−D1.2.=D2.2. ) is bounded by
max[−p2,−(1−p)2]≤D≤p(1−p)
(11)
so that the frequencies of the three genotypes at this locus are completely described by p and D: P1.1.=p2+D , P2.1.=2p(1−p)−2D , and P2.2.=(1−p)2+D . We simulate three D values: zero and half the maximum and minimum possible values as given in (11). While bounds of nine individual zygotic associations can be computed from the single-locus genotypic frequencies using (2a), we choose to compute only the four associations (ω1111 , ω2121 , ω1212 and ω2222 ) since the remaining five (ω1211 , ω2111 , ω2211 , ω2212 and ω2221 ) are simply the functions of those four associations as explained in Table 4. For simplicity, a further assumption in our simulation is that only one zygotic association is present in the population
and the other three are zero. Under this assumption, the bounds of these four zygotic associations are
−min(P1.1.P.1.1,P2.1.P.2.1)≤ω1111≤min(P1.1.P.2.1,P2.1.P.1.1)−min(P1.1.P.2.2,P2.1.P.2.1)≤ω1212≤min(P1.1.P.2.1,P2.1.P.2.2)−min(P2.2.P.1.1,P2.1.P.2.1)≤ω2121≤min(P2.2.P.2.1,P2.1.P.1.1)−min(P2.2.P.2.2,P2.1.P.2.1)≤ω2222≤min(P2.2.P.2.1,P2.1.P.2.2).
(12)
We simulate three values of zygotic association: zero and half the maximum and minimum possible values as given in (12).
TABLE 3
Frequencies of homozygotes and heterozygotes in terms of frequencies of 10 possible genotypes at two loci (j and l)
From each of 27 constructed populations [3 gene frequencies (p = 0.1, 0.3, and 0.5) × 3 values of Hardy-Weinberg disequilibrium × 3 values of zygotic association], 10,000 replicate samples of size n = 30 or n = 100 are drawn. For a sample of n diploid individuals, let X∼tj be 1 or 0 according to whether the tth individual in the sample is heterozygous or homozygous at the jth locus. Then the number of heterozygous loci for this individual is K∼t=∑j=1mX∼tj . We compute the sample mean as K∼=∑t=1nK∼t∕n and the sample variance as
sK2=1nΣt=1n(K∼t−K∼)2.
(13a)
Using various expectations of indicators defined for the sample (Weiret al. 1990; Weir 1996, pp. 142–144), it is easily seen that while the sample mean is an unbiased estimator of K, [E(K∼)=K ], the sample variance (13a) is not an unbiased estimator of σK2 , i.e., E(sK2)=[(n−1)∕n]σK2 , because we have divided by n rather than the customary (n − 1) in computing (13a). Clearly, the bias should be negligible unless the sample size is very small.
Under the null hypothesis of no zygotic association (H_{0}), we estimate σK2 by computing the sample variance, sK2 , as the sum of sample variances for m loci {sj2 },
sK2(2)=Σj=1msj2=Σj=1m[1nΣt=1n(X∼tj−X∼j)2],
(13b)
where X∼j=∑t=1nX∼tj∕n . While the estimator sK2(2) in (13b) is slightly biased for the same reason as in computing sK2 , its expectation and sampling variance can be readily calculated by inserting the appropriate results in (7) under interlocus independence (see also Equations 3, 4 and 5 of Brownet al. 1980) into the well-known formulas of Kendall and Stuart (1977, Equations 10.8 and 10.9),
E(sK2∣H0)=ΣjHj−ΣjHj2
(14a)
and
Var(sK2∣H0)=1n{ΣjHj−7ΣjHj2+12ΣjHj3−6ΣjHj4}{+2[ΣjHj−ΣjHj]2}.
(14b)
Two one-tailed tests are used to determine if the sample variance sK2 is significantly greater than its expectation under zero zygotic association σK2 . In the first test, assuming that the distribution of K under H_{0} approximates a normal distribution, the statistic
XsK22=nsK2∕σK2(2)
(15)
has a χ^{2} distribution with n d.f., where n is the number of diploid individuals in the sample and σK2(2) is estimated using (13b) [The chi-square test (15) would have d.f. = (n − 1) if the customary (n − 1) is used to compute sK2 ]. The null hypothesis (H_{0}) is rejected if XsK22 exceeds 43.77 or 124.34, the upper-tailed 5% critical value of χ^{2} distribution with d.f. = 30 or d.f. = 100, respectively. Manly (1985, p. 331) defined a similar statistic for haploid data, but because sK2 was computed from a sample of n^{2} “dependent” gamete pairs (comparisons) for n haplotypes (Brownet al. 1980), the appropriate degrees of freedom for the chi-square test are yet
to be determined. Furthermore, Haubold et al. (1998) recently provided a more appropriate formula to estimate σK2(2) for haploid data with an account of the interdependence between the gamete pairs. In the second test, assuming that the sampling distribution of sK2 approximates normality, Brown et al. (1980) suggested a test criterion of rejecting H_{0} if sK2>L , the upper-tailed 5% critical value for sK2 . In our simulation, L is estimated by
L≅sK2(2)+1.645Var(sK2∣H0).
(16)
Statistical properties of sample zygotic association and sK2 are examined for the simulated samples of sizes n = 30 and n = 100. Despite the slight downward bias in the mean values of sK2(2) by a factor of (n − 1)/n, its observed standard deviations are very close to their expected values even for n = 30 (Table 5), suggesting that (14b) is an adequate approximation to the sampling variance of sK2(2) . Table 5 also shows that Hardy-Weinberg disequilibrium (D) affects σK2(2) in an interesting way. Avoidance of mating between relatives (D < 0) increases heterozygosity whereas inbreeding (D > 0) decreases it. Thus, σK2(2) is expected to be greater for D < 0 or smaller for D > 0 than that for the equilibrium population (D = 0). However, this is not true when the gene frequency approaches p = 0.5. At p = 0.5, the maximum σK2(2) is obtained only when the population is in the Hardy-Weinberg equilibrium (D = 0) and any change in heterozygosity either due to avoidance of mating between relatives or to inbreeding would result in a smaller σK2(2) . Negligible skewness and kurtosis suggest that the normality of the sampling distribution of sK2(2) required for the test criterion (16) is probably adequate even though our simulation results are limited to the two loci only. As expected, the estimates of zygotic association in all simulated populations are zero or very close to zero. The increase of sample size from n = 30 to n = 100 (not presented) has improved the results only slightly.
TABLE 4
Joint frequencies of nine genotypes at loci j and l in terms of their single-locus genotypic frequencies and zygotic associations
TABLE 5
Mean, standard deviation, skewness, and kurtosis of under zero zygotic association for two gene frequencies (p) and three Hardy-Weinberg disequilibria (D)
The means of ω∼1111 are close to their respective theoretical values and the sampling variances of ω∼1111 increase with increasing gene frequencies at n = 30 (Table 6). The increase of sample sizes from 30 to 100 reduces the sampling variances and downward bias of estimated σK2 (results not presented for n = 100). The XsK22 test statistics are close to their expected values of 30.0 for n = 30 and 100.0 for n = 100 when zygotic association is small at low gene frequencies, but fluctuate with large positive or negative zygotic associations at more intermediate gene frequencies. The standard deviations of the chi-square statistics are also close to their expectations of 7.75 for n = 30 and 14.14 for n = 100 in most cases, but sizable discrepancies occur in the cases of large positive or negative zygotic associations. Similar patterns of sampling behaviors and properties are revealed for ω1212=ω2121 and ω∼2222 .
Judging from the estimated powers of the two test statistics, the zygotic associations are detectable only when they are positive and when the gene frequencies are close to 0.5 (Table 6). Figure 2 further shows that the powers increase with the large, positive zygotic associations and that zero powers are obtained for the large, negative zygotic associations when p = 0.5 and D = 0.125. Similar patterns are observed for other values of p and D. It is of interest to note that, unlike the nonlinear relationship in Figure 2, a linear relationship of zygotic associations with the variances of K or with chi-square values is observed (results not shown). The power should be 0.05 for the cases of no zygotic associations as a 5% significance level is used to reject these null hypotheses. According to this criterion, both tests perform reasonably well. While test (16) is slightly more powerful than test (15) in most cases, the two tests essentially provide the same amount of power across the range of zygotic associations. The increase of sample size from 30 to 100 results in an increase in the power of detecting the zygotic associations. Hardy-Weinberg disequilibrium (D) has little effect on the detection. For
example, with p = 0.5, ω1111=0.0938 for both D = −0.125 and D = 0.125. The power estimates with n =30 are 0.810 for D = −0.125 and 0.816 for D = 0.125, according to the chi-square test criteria (15).
DISCUSSION
A wide range of molecular data, from isozymes to newly developed microsatellite markers, is now available
for population genetic analysis. The average heterozygosity across all the loci scored has been routinely used to summarize the molecular data at hand. In the presence of nonrandom associations within and among loci, there is a need to characterize various genic disequilibria (e.g., Cockerham and Weir 1973; Weir 1979; Weir and Cockerham 1989), but the number of disequilibria for multiple alleles and many loci quickly increases beyond comprehension. This article has expanded the earlier concept of zygotic associations to effectively summarize those disequilibria within and between pairs of loci [cf. (9a) and (9b)]. The measure of zygotic associations shares most of the properties by gametic disequilibrium, but at the zygote level (Table 4). Further, we have developed a method to compute a set of summary statistics that are used to characterize and estimate the multilocus associations in the nonequilibrium population. This development substantiates and complements the earlier development of Brown et al. (1980) for a Hardy-Weinberg equilibrium population in which the multilocus associations are the function of only one type of two-locus disequilibria, gametic disequilibria. For the equilibrium population, our method reduces to that of Brown et al. (1980) because, in this case, the gametic frequencies can be inferred from the zygotic frequencies at individual loci. However, our method should be of more general use in elucidating the multilocus organizations in nonequilibrium and equilibrium diploid populations. For haploid data such as those from genetic assessment of bacterial or inbred plant populations, the procedures of Brown et al. (1980) and Haubold et al. (1998) should be used to construct the distribution of K through comparing all possible pairs of gametes in a population and to estimate different moments of K with an account of the interdependence between the gamete pairs for detecting multilocus associations.
Figure 2.
The relationships between zygotic associations and the estimated powers of two tests as given in Equation 15 (dashed lines) and Equation 16 (solid lines). Each point represents the power estimated from 10,000 simulated samples of sizes n = 30 (●) and n = 100 (▴).
Our method may be particularly useful for characterizing and estimating the multilocus associations in hybrid populations. Because these populations arise from the mixing of two or more distinct gene pools, strong Wahlund effect and selection against heterozygotes may frequently occur, thereby maintaining Hardy-Weinberg disequilibrium and zygotic associations for a long time. Given that alleles derived from the same parental populations or species tend to cluster together in the same individuals, the majority of pairwise zygotic associations should be positive, leading to an easier detection of the multilocus associations from our summary statistics. Barton and Gale (1993) have recently proposed a somewhat different method of estimating the multilocus associations from the variance of hybrid index for a hybrid population arising from the mixing of two parental gene pools. While their method is based on essentially the same strategy to summarize the multilocus data, it is of limited value in (i) detecting multilocus associations for hybrid populations arising from the mixing of more than two parental gene pools; (ii) using unfixed but informative markers for the multilocus analysis; and (iii) analyzing hybrid populations that are not in Hardy-Weinberg equilibrium.
The zygotic associations and multilocus statistics presented here are for a single nonequilibrium population. When many nonequilibrium populations are studied, the total variance of K may be partitioned into components due to the single-locus and multilocus effects of population subdivision. The method of partitioning the total variance of K among several haploid populations by Brown and Feldman (1981) can be conceivably extended to account for population subdivision related to Hardy-Weinberg disequilibrium and zygotic associations in nonequilibrium populations. However, the analysis of variance (ANOVA) of the K values may be considered for the multilocus diploid data with a complex hierarchical population structure. In this case, the ANOVA procedure of estimating hierarchical F-statistics by Yang (1998) may be extended to assess the effects of multilocus population subdivision at different levels of hierarchy.
The two sample sizes (n = 30 and n = 100) in our simulation probably represent the two ends of what may be used in most experimental population genetic studies for measuring multilocus heterozygosity. The sample of n = 30 appears to provide sufficient power to detect zygotic associations, agreeing with Brown et al.'s (1980) assertion that the multilocus statistics can be used with relatively small samples (in the order of 30). The increase in the sample size to n = 100 results only in slight to moderate reduction in the sampling variance of sK2 and an increase in the powers of detecting the multilocus associations (Figure 2). We have not simulated samples of very small sizes that may occur in practice. With small sample sizes, the validity of the assumed distributions for the sample variance of K as required by tests (15) and (16) may not be warranted. In this case, the recently developed permutation test (Guo and Thompson 1992) may be a preferred alternative to detect zygotic associations because it requires no assumptions about the distributions of multilocus statistics. In the permutation test, the null distribution [i.e., the distribution of sK2(2) ] is generated by randomly shuffling the single-locus zygotes among individuals in the sample. This is very similar to the randomization scheme described by Haubold et al. (1998) for haploid data, but it retains Hardy-Weinberg disequilibrium in the zygotes. However, the permutation test can be computationally intensive, particularly when the sample size is large. Thus, tests (15) and (16) should be useful for analyzing samples of moderate to large sizes.
Acknowledgments
I thank three reviewers for comments and constructive criticisms on earlier versions of the manuscript. This research has been supported in part by the Natural Sciences and Engineering Research Council of Canada grant OGP0183983.
- Received July 14, 1999.
- Accepted April 3, 2000.
- Copyright © 2000 by the Genetics Society of America