- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Law, B.
- Articles by Weir, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Law, B.
- Articles by Weir, B. S.
Effects of Population Structure and Admixture on Exact Tests for Association Between Loci
B. Lawa, J. S. Buckletonb, C. M. Triggsa, and B. S. Weirca Department of Statistics, University of Auckland, Auckland 1020, New Zealand,
b Institute of Environmental and Scientific Research, Auckland 1004, New Zealand
c Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-7566
Corresponding author: B. S. Weir, North Carolina State University, Raleigh, NC 27695-7566., weir{at}stat.ncsu.edu (E-mail)
Communicating editor: A. H. D. BROWN
| ABSTRACT |
|---|
The probability of multilocus genotype counts conditional on allelic counts and on allelic independence provides a test statistic for independence within and between loci. As the number of loci increases and each sampled genotype becomes unique, the conditional probability becomes a function of total heterozygosity. In that case, it does not address between-locus dependence directly but only indirectly through detection of the Wahlund effect. Moreover, the test will reject the hypothesis of allelic independence only for small values of heterozygosity. Low heterozygosity is expected for population subdivision but not for population admixture. The test may therefore be inappropriate for admixed populations. If individuals with parents in two different populations are always considered to belong to one of the populations, then heterozygosity is increased in that population and the exact test should not be used for sparse data sets from that population. If such a case is suspected, then alternative testing strategies are suggested.
IN forensic science multilocus genotype frequencies are often estimated as products of allele frequencies. Although this is expected to be appropriate for large random-mating populations, especially for unlinked loci, it is customary to check for evidence of allelic dependencies before invoking the product rule. Exact tests have been shown to have satisfactory power, at least in comparison to alternative testing strategies (![]()
![]()
![]()
| STATISTICAL TESTING PROCEDURES |
|---|
Exact test:
Suppose that the lth of L loci has alleles Ali, where the range of i is left arbitrary but is understood to depend on the locus index, l. Then the product rule expresses the frequency of the L-locus genotype, A1i A1j A2i A2j ... ALi ALj, as the product of the frequencies of all 2L constituent alleles, along with a factor of 2 for each locus that is heterozygous. If this genotype is regarded as being the gth of all possible L-locus genotypes, the product-rule frequency is
![]() |
(1) |
where the heterozygosity hgl at the lth locus is equal to one if Ali, Alj are different alleles in the gth genotype and is zero if they are the same. Summing over loci gives hg =
lhgl as the number of loci heterozygous in the gth genotype. The population frequency of allele Ali is pli. Note that the same indices i, j are used for different loci to simplify notation, but they are not implied to be equal over loci. There is an implicit assumption that i
j at each locus to prevent heterozygotes being counted twice.
With random sampling, the probability of a sample of size n having ng copies of the gth genotype (n =
g ng) is

Under the hypothesis of complete allelic independence, as expressed by the product rule in Equation 1, the alleles sampled at each locus have independent multinomial distributions. If the sample contains nli copies of allele Ali,

(2n =
inli). We assume that every locus is scored in every individual.
The conditional probability Pc of the genotype counts given the allelic counts, if Equation 1 holds, is therefore

The quantity h =
gnghg is the total number of heterozygous loci in the sample and lies between 0 and nL. The unknown allele probabilities have canceled out and, if the hypothesis of independence is false, small values of Pc will be observed. To carry out an exact test all possible arrays of genotype counts with the same allelic counts as the observed data are examined. The significance level for the test is the sum over all arrays of the values of the conditional probability that are as small or smaller than the value of Pc for the data (![]()
![]()
As the number of alleles at a locus and the number of loci increase, it becomes more and more likely that each multilocus genotype in a sample will occur once only. The product over g of the factorials ng! therefore tends to one, and this is unlikely to be changed by permuting alleles. Permutation leaves the allele counts nli unchanged but can change the number of heterozygotes in the sample at each locus. In other words, the probability Pc for an array of genotype counts becomes proportional to 2h. Evidently the exact test tends toward a test for heterozygosity, but in the sense that only arrays with small values of h can lead to rejection of the hypothesis implied by Equation 1. The test is now a one-sided test for total heterozygosity. The number of heterozygotes, h, has additive contributions from each locus and so has no between-locus component, although its distribution and hence its variance are affected by between-locus dependencies. It retains an indirect ability to detect some between-locus dependencies by its ability to detect the Wahlund effect as shown below.
Goodness-of-fit tests:
The original aim of testing for independence over all alleles with the exact test is lost when sampled genotypes are unique as the conditional probability does not involve any direct information on the relationships between the loci observed in the data. Would a goodness-of-fit test for independence over genotypes do any better? When every genotype in a sample is unique, ng = 1, such a chi-square goodness-of-fit statistic becomes

where nge is the expected count for genotype g, under the hypothesis of allelic independence, if it is present in the sample. This statistic does not depend on the sample heterozygosity and is expected to increase with the number of loci if only because the nge values will decrease as the number of loci increases. Of course there is the usual problem of spurious significance values with chi-square tests as the expected counts become small.
Heterozygosity test:
We have shown that for sparse data sets, with each sampled genotype being unique, allelic independence between loci is not addressed by the exact test. That test can be regarded instead as supporting the hypothesis that the total heterozygosity h has the value expected under independence rather than the alternative that h is less than expected. We write the test as Th. Alternatively, a one-sided test can be made against the alternative that h is greater than expected. When this distinction is needed the tests will be denoted by T-h and T+h, and they can be performed by noting whether the observed h value lies in the lower or upper tail of the distribution of values found by permuting alleles at each locus.
At each individual locus, the hypothesis that heterozygosity has the value expected under allelic independence (Hardy-Weinberg equilibrium) can be also addressed by a comparison of observed and expected heterozygosities, resulting in a chi-square statistic with 1 d.f. This test is two-sided as both large and small values of heterozygosity can lead to rejection. Under the hypothesis of independent loci the single-locus statistics can be added together. Alternatively, allelic permutation can be carried out separately for each locus and empirical significance levels generated for tests based on h. In this case the empirical significance levels of each of the tests would have to be adjusted because of the multiple comparisons.
![]()
Variance-of-heterozygosity test:
Sums of single-locus heterozygosities do not contain information about between-locus dependencies, but there is information in the variance of the single-locus heterozygosities. ![]()
![]()
![]()
For a sample of n genotypes the variance-of-heterozygosity test statistic, V, is given by

The statistic V is affected by both within-locus and all pairwise between-locus associations. ![]()
![]()
| GENETIC MODELS |
|---|
Structured populations:
An idealized model of population structure either has all current populations descending from a reference population (![]()
k of individuals belong to the kth subpopulation. The frequency of allele Ali is plik in the kth subpopulation and is
in the whole population.
Even if there is allelic independence within subpopulations, the Wahlund effect causes a dependence to exist at the population level whenever the allele frequencies vary among subpopulations. One way to quantify this effect is by the difference between actual and expected heterozygosities in the whole population,
![]() |
(2) |
when there is allelic independence within each population. Note that H is the proportion of heterozygous genotypes in the population, whereas previously h has denoted a count. The difference H - He is negative and increases in absolute value as the variance in allele frequencies increases. In the idealized population, this increases over time due to drift. For the null hypothesis that the heterozygosity is equal to the value expected under allelic independence, power will therefore increase both with time and with the number of loci, and this was the situation investigated by ![]()
The Wahlund effect also produces between-locus dependencies. Linkage disequilibrium in the whole population, or the difference between the joint frequency of pairs of alleles at different loci and the product of their separate frequencies, is given by

when there is linkage equilibrium within the subpopulations (![]()
If the whole population now mates at random, allele frequencies remain at pli. Single-locus genotype frequencies become products of these allele frequencies. The population heterozygosity equals the value expected under within-locus allelic independence, so goodness-of-fit tests for heterozygosity, or exact tests for allelic association in sparse data sets, are not expected to give significant results. Linkage disequilibria will decay at a rate depending on the recombination fractions between loci and can be detected by tests at each pair of loci or by the test of ![]()
Admixed populations:
A model for human populations that may be more appropriate for recent history has previously distinct populations admixing. Such admixture also creates dependencies among allele frequencies, but in a way different from that of the Wahlund effect. The population structure model assumed that subpopulations remained distinct and provides relationships between genotype and allele frequencies in the whole population. The admixture model assumes the modification of some subpopulations by the influx of alleles from other subpopulations.
A simple example supposes the parental generation to be composed of two random-mating populations, indexed by k = 1, 2, in which the frequencies of alleles Ali at locus l are plik. In the next generation, a proportion mkk of the individuals in the admixed population have both parents in population k, and a proportion 2mkk' have one parent in each of populations k and k'. The offspring genotype proportions in the admixed population are, therefore,

and the allele frequencies are

It is convenient to introduce the quantities
k, k = 1, 2 as the probabilities of a random allele in the admixed population having come from parental population k, so
1 = m11 + m12 and
2 = m12 + m22. We can define an assortative-mating parameter M, which measures the tendency for within-population mating,

so that min(-
1/
2, -
2/
1)
M
1. The expected frequency of Ali Alj heterozygotes at locus l in the admixed population is given by

Thus the difference between actual and expected heterozygosities at locus l is
![]() |
(3) |
Equation 3 shows that a preference for within-population matings, M > 0, will result in fewer heterozygotes than expected. The exact test for allelic independence, acting as a one-sided test for heterozygosity, should therefore detect such assortative mating. However, if M < 0 the exact test will not perform well.
Population dominance:
A quite different situation arises when there is some "dominance" in population assignment. If individuals with either one or two parents in population k are assigned to that population, then there will be an excess of heterozygotes in that population. For the admixed population considered in the last section, suppose that individuals with both parents from population 1 are considered to belong to population 1 but individuals with at least one parent in population 2 are considered to belong to population 2. This may be the situation in New Zealand where population 1 represents Caucasians and population 2 represents Maoris.
Among the population 2 members of the admixed population, homozygote and allele frequencies for Ai at locus l are

This leads to the following expression for the difference between actual and expected heterozygosities within the population 2 component of the admixed population:
![]() |
(4) |
This expression is always positive, so there are more heterozygotes than expected and the exact test will not be appropriate for multilocus data although it will still be satisfactory for each locus separately. The population 1 component of the admixed population is wholly descended from population 1 and has no departures from allelic independence if none were within that population.
| NUMERICAL RESULTS |
|---|
Structured populations:
Simulations of the drift process were performed for 10 populations of size N = 1000 and for L = 1, 2, 3, 4, and 10 loci each with 5 or 10 equiprobable alleles per locus. The power of the exact test was found for samples of n = 200 individuals from the whole population after t = 0, 20, 60, 103, and 213 generationscorresponding to population structure parameter,
= 1 - (1 - 1/2N)t, with values of 0.00, 0.01, 0.03, 0.05, and 0.10. The power was also found for test T-h on the basis of values of the total heterozygosity h. For each set of simulated data, the exact and heterozygosity tests led to rejection if the Pc or h values were among the smallest 5% of the values found from 2500 sets of data formed by permuting alleles separately at each locus. Powers were calculated as the proportions of rejections from 500 simulated data sets. The standard errors for the estimated powers can be calculated assuming sampling from the binomial distribution. Since 500 replications were used, the standard errors for the estimated powers are <0.0224. The results are shown in Table 1.
|
The power for both tests increases with
, with the number of loci, and with the number of alleles per locus, as found previously by ![]()
The relationship between the two tests is shown graphically in Fig 1, as plots of ln(Pc) against h. For sparse data, the relationship between these two statistics becomes linear, reflecting the dependence of Pc only on h among permuted data sets. Even for three loci and samples of size 200, the data are sufficiently sparse that the exact test does not detect between-locus dependencies.
|
Admixed populations:
Two of the simulated populations described in the previous section were allowed to contribute equally to an admixed population, m11 = m12 = m22 = 0.25. One generation of random mating was simulated and the exact test, the total heterozygosity tests, and the variance of heterozygosity test were performed using a sample of n = 200 genotypes. The powers were calculated as in the previous section and results are shown in Table 2. The simulations were repeated using samples of size 500 and the results for
= 0.10 are shown in Table 3.
|
|
As expected, the heterozygosity test has power equal to the significance level since all single-locus heterozygosities are equal to their expected values. There is linkage disequilibrium, however, so the variance of heterozygosity has power that increases with
. The power does increase with
for the exact test similar to the variance of heterozygosity test until the number of loci becomes so large (greater than two) that data sparseness reduces the test to one of heterozygosity.
Admixed populations with population dominance were also simulated by setting m11 = 0 and m12 = m22 = 0.33. The exact test, the total heterozygosity tests, and the variance of heterozygosity test were performed using samples of size 200. The powers calculated from 500 replications are shown in Table 3.
There is a discontinuity in the results of the exact test for admixture under random mating between one and two loci shown in Table 3. For a randomly mating admixed population there will be little or no within-locus association, but substantial between-locus association. Thus the exact test for a single locus has power 0.05. However, the exact test can detect between-locus association for two or more loci if the sample size is sufficiently large. A sample size of n = 500 will detect the between-locus association for two loci with a power of 0.35. As the number of loci increases for a fixed sample size the genotype array becomes increasingly sparse and the power is seen to drop back to 0.05.
The powers of the heterozygosity test T-h are less than the significance level while the powers of the T+h tests increase with
, since the heterozygosity tests are one-sided. The exact test is similar to the heterozygosity test T-h except for one-locus tests. This is because when only one locus was used in the test, the genotype arrays were not sparse and the exact test is in effect a two-sided test.
The variance-of-heterozygosity test statistic is affected by both within- and between-locus associations. In the case of population dominance, within-locus and between-locus associations have opposite effects on V. When the number of loci used is small, V is affected mostly by the within-locus association. As the number of loci increases, the number of pairwise between-locus associations increases and these balance out the effects of within-locus associations. As a result, the empirical power first increases and then decreases as the number of loci increases.
| DISCUSSION |
|---|
Care is needed in applying tests for allelic independence to check on the validity of the product rule in Equation 1. For small numbers of loci, when multilocus genotypes can occur several times in a sample, the exact test is good for associations both within and between loci.
As the number of loci increases, however, the exact test becomes a test of total heterozygosity, but it offers no information on between-locus associations. The numerical work of ![]()
For populations undergoing random mating following amalgamation of a set of divergent subpopulations, i.e., admixture, there is little point in performing a test for overall heterozygosity or in performing the exact test for sparse data sets. There are no within-locus associations, and the between-locus associations will not contribute to these test statistics. This random-mating situation is the one envisaged by ![]()
| ACKNOWLEDGMENTS |
|---|
Very helpful comments were made by the reviewers. This work was supported in part by a New Zealand Institute of Environmental and Scientific Research grant to B. Law and U.S. National Institutes of Health grant GM 45344 to North Carolina State University.
Manuscript received November 14, 2002; Accepted for publication January 14, 2003.
| LITERATURE CITED |
|---|
BROWN, A. H. D., M. W. FELDMAN, and E. NEVO, 1980 Multilocus structure of natural populations of Hordeum spontaneum.. Genetics 96:523-536.
CHAKRABORTY, R., 1984 Detection of nonrandom association of alleles from the distribution of the number of heterozygous loci in a sample. Genetics 108:719-731.
EVETT, I. W., and B. S. WEIR, 1998 Interpreting DNA Evidence. Sinauer, Sunderland, MA.
FALCONER, D. S., 1960 Introduction to Quantitative Genetics. Ronald Press, New York.
GUO, S. W. and E. A. THOMPSON, 1992 Performing the exact test of Hardy-Weinberg proportions for multiple alleles. Biometrics 48:361-372.[Medline]
MAISTE, P. J. and B. S. WEIR, 1995 A comparison of tests for independence in the FBI RFLP databases. Genetica 96:125-138.[Medline]
PROUT, T., 1973 Population genetics of marine pelecypods. III. Epistasis between functionally related isoenzymes of Mytilus edulis. Genetics 73:487-496.
WALSH, K. A. J. and J. S. BUCKLETON, 1988 A discussion of the law of mutual independence and its application to blood group frequency. J. Forensic Sci. 28:95-98.
WEIR, B. S., 1996 Genetic Data Analysis II. Sinauer, Sunderland, MA.
YANG, R.-C., 2000 Zygotic associations and multilocus statistics in a nonequilibrium diploid population. Genetics 155:1449-1458.
ZAYKIN, D., L. ZHIVOTOVSKY, and B. S. WEIR, 1995 Exact tests for association between alleles at arbitrary numbers of loci. Genetica 96:169-178.[Medline]
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Law, B.
- Articles by Weir, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Law, B.
- Articles by Weir, B. S.




