Analysis of population structure and organization with DNA-based markers can provide important information regarding the history and evolution of a species. Linkage disequilibrium (LD) analysis based on allelic associations between different loci is emerging as a viable tool to unravel the genetic basis of population differentiation. In this article, we derive the EM algorithm to obtain the maximum-likelihood estimates of the linkage disequilibria between dominant markers, to study the patterns of genetic diversity for a diploid species. The algorithm was expanded to estimate and test linkage disequilibria of different orders among three dominant markers and can be technically extended to manipulate an arbitrary number of dominant markers. The feasibility of the proposed algorithm is validated by an example of population genetic studies of hickory trees, native to southeastern China, using dominant random amplified polymorphic DNA markers. Extensive simulation studies were performed to investigate the statistical properties of this algorithm. The precision of the estimates of linkage disequilibrium between dominant markers was compared with that between codominant markers. Results from simulation studies suggest that three-locus LD analysis displays increased power of LD detection relative to two-locus LD analysis. This algorithm is useful for studying the pattern and amount of genetic variation within and among populations.
THE pattern and extent of nonrandom associations among polymorphic markers distributed over the genome are related to the evolutionary rate of population structure for a species (Tishkoff et al. 1996, 2001; Stephens et al. 2001; Ardlie et al. 2002; Weiss and Clark 2002). One measure of such nonrandom associations, linkage disequilibrium (LD), is often affected by various evolutionary forces, such as selection, genetic drift, mutation, admixture, population structure, etc., which have operated in the population. For this reason, by estimating and testing for the extent and distribution of LD throughout the genome, the evolution of population structure can be inferred (Tishkoff and Williams 2002). There is a wealth of literature on the application of the LD analysis to understand the population evolution of humans (Reich et al. 2001; Ardlie et al. 2002; Dawson et al. 2002; Gabriel et al. 2002) as well as a variety of plants and animals (Remington et al. 2001; Hansson et al. 2004; Liu et al. 2006).
Unlike humans and several model systems, such as mouse and Arabidopsis, in which high-resolution LD maps have been constructed with codominant markers, such as single-nucleotide polymorphisms (SNPs) and microsatellites, many underrepresented species, like forest trees, still heavily rely upon simple and cheap dominant marker techniques. These markers including random amplified polymorphic DNA (RAPD) (Williams et al. 1990) and amplified fragment length polymorphism (AFLP) (Vos et al. 1995) can be genotyped arbitrarily from the genome with no need of prior knowledge about the structure and sequence of the genome. There have been extensive publications on the use of dominant markers to explore the amount, structure, and distribution of genetic variation in a population (Yan et al. 1999; Zhivotovsky 1999; Holsinger et al. 2002; Miller and Schaal 2006) and manage biological resources and diversity in agriculture and forestry (Kuang et al. 1998; Silbiger et al. 1998; Kremer et al. 2005). In the postgenomic era, proteomic techniques have become increasingly available to produce dominant markers such as presence/absence sport (PAS) and protein quantitative locus (PQL) (Thiellement et al. 1999; Zivy and de Vienne 2000; Consoli et al. 2002). Proteomic markers are often related to biological functions and, therefore, will play an important role in the genetic study of variation in a natural or an experimental population.
For a dominant marker, the plus allele, shown as the presence of a band on the gel, dominates over the null allele, shown as the absence of a band (unamplifiable by PCR). Therefore, heterozygous and homozygous genotypes both with the band cannot be directly distinguished from each other. Statistical methods for estimating linkage disequilibria with codominant markers, for which two homozygotes and the heterozygote are distinguishable, have been well developed (Pritchard and Przeworski 2001; Wang and Wu 2004). However, the estimates of LD among dominant markers have been poorly explored, although they have attracted the interest of geneticists for a long time because of their potential importance in the understanding of population structure (Hill 1974). As compared to codominant markers, dominant markers are less informative and, therefore, there may be some loss of information for their practical use. In this article, we derive a statistical method for estimating and testing the LD between dominant markers within the maximum-likelihood context. This method is implemented with the EM algorithm, allowing for a number of hypotheses regarding the structure and organization of a population. The statistical properties of the method for analyzing dominant markers are investigated and compared with those for codominant markers through simulation studies.
TWO-LOCUS LINKAGE DISEQUILIBRIUM ANALYSIS
The description of our LD analysis model will start with more informative codominant markers. The principle for analyzing codominant markers is then expanded to derive the model for LD analyses of less informative dominant markers.
The codominant model:
Suppose n diploid individuals are randomly drawn from a large population at Hardy–Weinberg equilibrium. Let us consider two diallelic markers A (with alleles A and a) and B (with two alleles B and b). These two markers are genetically associated with the linkage disequilibrium D in a Hardy–Weinberg equilibrium (HWE) population. Allele frequencies are defined as pA and pa (pA + pa = 1) for A and pB and pb (pB + pb = 1) for B, respectively. The frequencies of four haplotypes, AB, Ab, aB, and ab, formed by the two markers, are denoted, respectively, as(1)summing to one. Each of the two markers has three genotypes, that is, AA, Aa, and aa for marker A and BB, Bb, and bb for marker B. Let be the observed number of observations of a two-marker genotype lAlB (lA = AA, Aa, aa; lB = BB, Bb, bb). For a practical data set, the observations of these genotypes and their expected frequencies can be tabulated in a format like Table 1. Under the HWE, the diplotype frequencies are the products of the corresponding haplotype frequencies. The resulting genotype frequencies for each cell are then expressed in terms of haplotype frequencies.
The likelihood of unknown haplotype frequencies, Ω = (pAB, pAb, paB, pab), given observed genotypes (M) can be written as a multinomial form; i.e.,(2)
The maximum-likelihood estimates (MLEs) of haplotype frequencies are then derived as(3)where(4)
Equations 3 and 4 compose a loop for the EM algorithm. Initial values for the haplotype frequencies are provided to calculate the proportion ϕ in the E step (Equation 4). The calculated ϕ-value is then used to estimate haplotype frequencies in the M step (Equation 3). Both the E and the M steps are repeated until the estimates of haplotype frequencies converge. The MLEs of allele frequencies at the two markers and their linkage disequilibrium can be obtained by solving a group of equations (Equation 1); i.e.,(5)The sampling variances for the MLEs of haplotype frequencies (and therefore allele frequencies and linkage disequilibrium) can be calculated from the Fisher information matrix.
The degree of linkage disequilibrium between two markers can be tested by formulating two hypotheses expressed as(6)under which the likelihoods, and , are calculated, respectively, where the tilde corresponds to the MLEs for the null hypothesis and the circumflex corresponds to the MLEs for the alternative hypothesis. Under the null hypothesis above, the allele frequencies of codominant markers can be directly estimated from the marker data without the use of the EM algorithm. The log-likelihood-ratio test statistic is calculated by(7)which is asymptotically χ2-distributed with 1 d.f.
The dominant model:
For a dominant marker (say A), the homozygote (AA) for the dominant allele cannot be distinguished from the heterozygote (Ao). Thus, these two genotypes are observed as a single “phenotype” (A_). Because of this, some cells for the observations and expected genotype frequencies in Table 1 are collapsed in a way as shown in Table 2. We use to denote the observed number of observations of a two-dominant-marker genotype kA/kB . Using the same idea as that conceived for the codominant markers, we formulate the EM algorithm to estimate haplotype frequencies(8)where(9)The E (Equation 9) and M steps (Equation 8) are iterated until convergence. The MLEs of the allele frequencies and linkage disequilibrium are obtained by Equation 5. The linkage disequilibrium is similarly tested with hypothesis (6). A similar iterative algorithm for estimating the LD between dominant markers was derived by Hill (1974).
It should be pointed out that, unlike the codominance case, the estimation of allele frequencies of two dominant markers under the null hypothesis of no linkage disequilibrium should be based on the EM algorithm. The EM algorithm is described as follows:
In the E step, calculateand then in the M step, estimate the allele frequencies of markers A and B by using
THREE-LOCUS LINKAGE DISEQUILIBRIUM ANALYSIS
The codominant model:
Consider three segregating markers, A with alleles A and a, B with alleles B and b, and C with alleles C and c. The notation for marker alleles and their frequencies is given similarly for two markers in the preceding section. Linkage disequilibrium analysis among three markers is characterized by allele frequencies at each marker and linkage disequilibria between any pair of markers and among all the three markers are denoted as DAB, DAC, DBC, and DABC, respectively (Bennett 1954; Nielsen et al. 2004). Three markers generate eight different haplotypes, expressed as ABC, ABc, AbC, Abc, aBC, aBc, abC, and abc, whose frequencies are expressed, respectively, as(10)
These haplotype frequencies are used to describe the diplotype (and genotype) frequencies at the three markers given in Table 3. Let be the observed number of observations of a three-marker genotype lAlBlC (lA = AA, Aa, aa; lB = BB, Bb, bb; lC = CC, Cc, cc). On the basis of the expected frequencies, generally expressed as , as given in Table 3, we construct the multinomial log-likelihood by(11)from which the EM algorithm is derived to estimate haplotype frequencies; i.e.,(12)with , where(13)
Equations 13 and 12 construct a loop for iterations in the EM algorithm. Started with initial values, iterations are undertaken until converged estimates are obtained. The allele frequencies and linkage disequilibrium of different orders can be estimated from the MLEs of haplotype frequencies by(14)
The sampling variances for the MLEs of haplotype frequencies (and therefore allele frequencies and linkage disequilibria of different orders) in a three-locus analysis can be calculated from the Fisher information matrix.
The linkage disequilibria of different orders can be tested in general or individually. The existence of linkage disequilibria is tested using the following hypotheses:(15)The LR test statistic for the significance of LD is calculated by comparing the likelihood values under H1 (full model) and H0 (reduced model) using an equation similar to (7). The LR is considered to asymptotically follow a χ2-distribution with 4 d.f. The MLEs of allelic frequencies under H0 can be estimated using the EM algorithm described above, but with the constraints(16)
The dominant model:
Like two-locus linkage disequilibrium analysis, we derive the EM algorithm to estimate the haplotype frequencies for three-locus linkage disequilibrium analysis of dominant markers. When all three markers are dominant, expected genotype frequencies for codominant markers listed in Table 3 are collapsed to form the simpler Table 4. We use to denote the observed number of observations of a three-dominant-marker genotype kA/kB/kC . In this case, haplotype frequencies are estimated by(17)where(18)Equations 18 and 17 form a loop of the EM algorithm that can provide the MLEs of haplotype frequencies and, therefore, allele frequencies and linkage disequilibria of different orders for three dominant markers. The hypothesis tests for the degree of linkage disequilibria can be made in a way similar to the codominant model shown by Equations 15 and 16.
MONTE CARLO SIMULATION
We perform extensive simulation studies to investigate the statistical properties of the model for estimating linkage disequilibria between different dominant markers. We consider two cases, in which markers display high and low heterozygosity, respectively. Markers of high heterozygosity are simulated by similar frequencies for the alternative alleles, say 0.5 vs. 0.5, whereas those of low heterozygosity are simulated by contrast frequencies for the alternative alleles, say 0.9 vs. 0.1 (Nei 1987). In both cases, the markers are assumed to be associated with a certain LD. These parameters are used to simulate observations for marker genotypes for different sample sizes n = 30, 100, 200, and 400.
Two-locus LD analysis:
For cases of both high and low heterozygosity, LD is assumed to be 0.015. This assumed LD value corresponds to a larger normalized value (Lewontin 1964) for the two markers of lower heterozygosity than of higher heterozygosity. Table 5 gives the MLEs of the allele frequencies and linkage disequilibrium and the square roots of their mean square errors for two dominant markers. The model generally provides reasonable estimates of allele frequencies even when the sample size used is as low as 30. The accuracy and precision of the parameter estimation increase dramatically when the sample size increases to 100 for the markers of high heterozygosity, but to 200 for the markers of low heterozygosity. Regardless of the degree of heterozygosity, a small sample size cannot provide an unbiased estimate of LD. For the markers of high heterozygosity, a reasonable estimate of LD requires at least 200 individuals, whereas for the markers of low heterozygosity a doubled sample size is still not sufficient (Table 5). The simulation studies allow for the estimation of empirical power for the detection of significant LD. In general, the power is quite low for the assumed LD value, which, as expected, increases with the increase of sample size.
The same set of parameters was used to simulate codominant markers with the estimation results also tabulated in Table 5. The estimation accuracy and precision of all the parameters, particularly LD, can be better estimated for codominant than for dominant markers. A small sample size of 30 would be adequate for the reasonable estimate of LD for codominant markers regardless of their heterozygosity. Compared to dominant markers, codominant markers have higher power for the detection of LD, especially when the sample size is large. Different from dominant markers, allele frequencies and LD for codominant markers can be better estimated when the markers have lower heterozygosity (Table 5).
Three-locus LD analysis:
Similar simulation schemes are used for three-locus linkage disequilibrium analysis. More heterozygous markers are simulated by assuming allele frequencies pA = 0.5 and pB = pC = 0.6, whereas less heterozygous markers are simulated by assuming allele frequencies pA = 0.9 and pB = pC = 0.8. In both cases, linkage disequilibria are assumed to be DAB = DAC = DBC = 0.015 and DABC = 0.010. The results from the three-locus analysis are basically consistent with those from the two-locus analysis: For dominant markers, higher heterozygosity favors the estimation of the parameters in terms of estimation precision and power, whereas for codominant markers, lower heterozygosity favors the estimation (Table 6). Beyond two-locus analysis, three-locus analysis provides the estimation of more parameters related to linkage disequilibria and, thus, can be more informative for the understanding of population structure and organization.
Results from Tables 5 and 6 allow for the comparisons between two- and three-locus linkage disequilibrium analysis. Under the same condition, the precision of parameter estimation is similar between these two analysis approaches. But three-locus analysis displays much greater power for LD detection than two-locus analysis for codominant markers with a sample size of ≥100. For dominant markers, the power for LD detection can be increased only for the markers of high heterozygosity. Three-locus analysis does not improve the power for the dominant markers of low heterozygosity.
An additional simulation study was conducted to estimate type I error rates of linkage disequilibrium analyses. Marker data were repeatedly (1000 times) simulated according to the scenarios as designed above for different sample sizes, marker heterozygosities, and marker types by assuming that there is no linkage disequilibrium between the markers. It appears that sample sizes do not largely affect type I error rates for linkage disequilibrium detection. Two- and three-locus analyses may incorrectly find significant linkage disequilibria at a similar rate (Table 7). There is more chance to obtain false-positive results for markers with a high heterozygosity (7–16%) than for those with a low heterozygosity (1–8%). Type I errors are slightly different between dominant and codominant markers, although the pattern of differences depends on the level of marker heterozygosity.
We used the algorithm developed to estimate and test linkage disequilibria between dominant markers in hickory trees. Hickory is an oil woody species naturally distributed in southeastern China. DNA samples were collected from 90 trees randomly drawn from three different stands in Anhui Provinces. A total of 238 RAPD markers were genotyped, to study the structure and pattern of genetic variation in the population of hickory. As a demonstration, we randomly picked up a subset of markers to test and validate our algorithm proposed. Significant linkage disequilibria were detected for many marker combinations in the hickory population (Table 8). The estimates of linkage disequilibria are broadly consistent between two- and three-locus analyses, but the latter has more chance to detect the significance of linkage disequilibria for the same marker pair than the former. This result can be validated by simulation studies in the preceding section. For example, markers 19 and 144 have no significant linkage disequilibrium according to two-locus analysis, but they were detected to be significantly associated by three-locus analysis. In some cases, the significance level of the detection of linkage disequilibria can be increased when three-locus analysis is used, compared with that from two-locus analysis. For example, the estimate of the linkage disequilibrium between markers 46 and 53 is significant at P = 5% for two-locus analysis, but it is significant at P < 0.1% for three-locus analysis. Compared with two-locus analysis, three-locus analysis has an additional advantage in the estimation of high-order linkage disequilibria (e.g., significant DABC's were detected among markers 102, 31, and 227; markers 46, 53, and 196; and markers 167, 8, and 31; Table 8), which may have played an important role in shaping population structure (Nielsen et al. 2004).
The use of individual markers to test for the deviation of a population from Hardy–Weinberg equilibrium has become a routine approach for the inference of the structure and evolution of the population. LD analysis based on multiple markers can provide additional information about population structure by estimating the extent and distribution of nonrandom associations throughout the genome (Stephens et al. 2001; Ardlie et al. 2002; Dawson et al. 2002). For a random-mating population, the LD between two markers decays with generation in a proportion depending on the recombination fraction between the markers (Lynch and Walsh 1998). Thus, by comparing the rate of LD decay over genetic distances, the evolutionary history of a population can be inferred (Dawson et al. 2002; Gabriel et al. 2002; Tishkoff and Williams 2002). Also, the rate of the LD decay as a function of generation has established a fundamental principle for the high-resolution mapping of complex traits in a population (Rafalski and Morgante 2004).
While high-throughput molecular marker techniques, such as SNPs, have been widely used to study the population structure and evolution of humans (Stephens et al. 2001; Dawson et al. 2002), cheap dominant markers still serve as an important tool for genetic research in underrepresented species (Kuang et al. 1998). More importantly, the role of dominant markers in the characterization of biochemical functions has become more pronounced in the postgenomic era with the advent of proteomic techniques (Thiellement et al. 1999; Zivy and de Vienne 2000; Consoli et al. 2002). Statistical approaches have been derived to study the structure and organization of a population with individual dominant markers (Lynch and Milligan 1994; Zhivotosky 1999; Holsinger et al. 2002). However, the estimation of LD with multilocus dominant markers has not been well explored, although the importance of multilocus analysis in population genetic studies has been increasingly recognized (Kremer et al. 2005).
In this article, we present a statistical method for estimating and testing the LD between dominant markers by implementing the EM algorithm. With this algorithm, the frequencies of haplotypes constructed by a series of nonalleles at different markers can be precisely estimated and the estimated haplotype frequencies are further used to estimate allele frequencies and linkage disequilibria. Our presentation of the method was first based on informative codominant markers, for which the estimation of haplotype frequencies can be readily formulated, and then expanded to take into account dominant markers. Although our method was presented for two- and three-locus LD analysis, it is common to estimate linkage disequilibria among four or more markers. Perhaps, a general model should be derived that can estimate and test the LDs for an arbitrary number of codominant and dominant markers. Such a multilocus LD analysis by exploiting more information at one time has advantages for increasing the precision of parameter estimation and the power to detect significant LD. Also, more information, e.g., trigenic or higher-order LD, can be detected.
Extensive simulation studies have been performed to examine the robustness and properties of LD analysis with dominant markers under different degrees of heterozygosity and different sample sizes. The method proposed in this study has proved its power to estimate and test for the LD between dominant markers, but because of their inherent adequacy of information, the use of dominant markers to study the LD is limited under some circumstances, such as a small sample size and low heterozygosities of markers. According to the results from simulation studies, the following guidelines are recommended for dominant markers to be more effectively used in practice:
For dominant markers of high heterozygosity, the sample size used for a two-locus LD analysis should be ≥200, whereas for dominant markers of low heterozygosity, the sample size should be ≥400.
Because multilocus analysis can improve the test power, it is always recommended to perform the LD analysis with three or more dominant markers. Furthermore, multilocus analysis provide additional information about a web of nonrandom associations among multiple markers.
Codominant markers are more powerful for the estimation and test of LD than dominant markers. It is recommended to generate a mix of dominant and codominant markers for a better characterization of the genetic structure of a population (e.g., see Yan et al. 1999).
Zhivotovsky (1999) and Holsinger et al. (2002) presented an approach for estimating population structure in diploids with individual dominant DNA markers. Our method has extended the use of dominant markers to characterize linkage disequilibrium among multiple loci. Although the focus in this study is on dominant markers, the methodological principle of the model proposed can be extended to consider a complex web of polygenic LD of different orders among multiallelic markers. For outcrossing populations, multiallelic markers, such as microsatellites, are crucial for the precise characterization of population structure and diversity. Our approach for LD analysis was derived within the maximum-likelihood context. Bayesian approaches that have several advantages over maximum likelihood (Holsinger et al. 2002; Fu et al. 2005) may offer greater power to understand population structure, especially when a number of markers with multiple alleles are analyzed simultaneously. The computer code to perform linkage disequilibrium analyses can be requested from the corresponding author.
We thank the two anonymous referees for their constructive comments on this manuscript. The preparation of this manuscript was partially supported by National Science Foundation/National Institutes of Health Mathematical Biology grant no. 0540745.
Communicating editor: R. Nielsen
- Received November 26, 2006.
- Accepted May 13, 2007.
- Copyright © 2007 by the Genetics Society of America