Abstract
Linkage analysis and allelic association (also referred to as linkage disequilibrium) studies are two major approaches for mapping genes that control simple or complex traits in plants, animals, and humans. But these two approaches have limited utility when used alone, because they use only part of the information that is available for a mapping population. More recently, a new mapping strategy has been designed to integrate the advantages of linkage analysis and linkage disequilibrium analysis for genome mapping in outcrossing populations. The new strategy makes use of a random sample from a panmictic population and the open-pollinated progeny of the sample. In this article, we extend the new strategy to map quantitative trait loci (QTL), using molecular markers within the EM-implemented maximum-likelihood framework. The most significant advantage of this extension is that both linkage and linkage disequilibrium between a marker and QTL can be estimated simultaneously, thus increasing the efficiency and effectiveness of genome mapping for recalcitrant outcrossing species. Simulation studies are performed to test the statistical properties of the MLEs of genetic and genomic parameters including QTL allele frequency, QTL effects, QTL position, and the linkage disequilibrium of the QTL and a marker. The potential utility of our mapping strategy is discussed.
GENETIC mapping of quantitative trait loci (QTL) has become a routine tool for the genetic study of plants, animals, and humans. With such a tool, many fundamental genetic questions including the inheritance mode of a quantitative trait, genotype × environment interaction, and the genetic basis of heterosis can be addressed (reviewed by Tanksley 1993; Templeton 1999; Wuet al. 2000; Mackay 2001). Genetic mapping also has potential to reshape our understanding of complex biological phenomena, such as human diseases and adaptive plasticity (the capacity of a given individual to change its phenotype across different environments). Most of these phenomena are now viewed as having some genetic components and, therefore, can be modified or changed genetically for a feature beneficial to humans. It can be anticipated that genetic mapping will play an increasingly important role in unraveling the genetic basis of quantitative variation in the next decade with the advent of novel DNA-based molecular marker technologies, such as single-nucleotide polymorphisms (SNPs; Wanget al. 1998).
Because of differences in biological properties of study materials, considerable effort is being made to develop statistical genetic mapping methods for specific species or populations. In terms of the genetic principles behind mapping, the methodology of genetic mapping includes two main areas: linkage analysis and association studies (reviewed by Olsonet al. 1999). Linkage analysis is based on the recombination of nonalleles at a marker and QTL during meiosis and, thus, can directly estimate the map distance (measured by recombination fraction) between the two syntenic loci. However, it is difficult to detect recombination events between closely spaced (<1cM) loci when there are a limited number of meioses occurring in a mapping population (e.g., Hästbacka et al. 1992, 1994; Darvasiet al. 1993; Longet al. 1995). Association studies, on the other hand, use all recombinations generated since nonrandom association of nonalleles at a marker and QTL (commonly referred to as linkage disequilibrium) was introduced into a population, thus increasing the precision of the estimate of the QTL location (Risch and Merikangas 1996; Rabinowitz 1997; Xiong and Guo 1997). Yet, the localization of QTL using linkage disequilibrium mapping is ineffective when the significant linkage disequilibrium detected between a marker and QTL results from the recent occurrence of disequilibrium rather than from a tight linkage between the loci. Such a spurious association detected even when the marker is not physically linked to any causative loci may be due to population subdivision and admixture. Current population-(Gordonet al. 2000; Luoet al. 2000) or family-based analyses [e.g., the transmission/disequilibrium test (TDT); Spielman and Ewens 1996; Allison 1997] of linkage disequilibrium cannot distinguish strong disequilibrium and loose linkage from weak disequilibrium and tight linkage (Whittakeret al. 2000).
The limits of linkage analysis and linkage disequilibrium mapping when they are used alone can be overcome by a new strategy for taking advantage of each approach in genetic mapping. Such a joint linkage and linkage disequilibrium mapping strategy has been recently devised by Wu and Zeng (2001) in that a random sample from a natural population and the open-pollinated progeny of the sample were analyzed jointly. This strategy was established on the principle of gene transmission from the parental to progeny generation during which the linkage between a marker and QTL is broken down due to meiotic recombination. It can therefore divide the composite measure of linkage disequilibrium from traditional population- or family-based association tests relying on recombinations in a single generation into two components: the linkage between the marker and QTL and their linkage disequilibrium created at a historic time. With the measures of these two components, one can clearly determine the mechanistic basis of a significant disequilibrium detected between a marker and QTL, which increases the feasibility for fine mapping QTL affecting a quantitative trait.
In this article, we extend the joint linkage and linkage disequilibrium mapping strategy to map QTL segregating in a natural population. The extension allows for simultaneous estimates for a number of genetic and genomic parameters including the allele frequency of QTL, its effects, its location, and its population association with a known marker locus. Our analysis is performed within the maximum-likelihood framework, implemented with the expectation-maximization (EM) algorithm. The statistical properties of the estimates for different genetic parameters are studied through extensive simulations. A comparison of the power for detecting linkage disequilibrium is made on the basis of traditional disequilibrium analyses and the joint linkage and linkage disequilibrium analysis proposed here.
STATISTICAL METHOD
Population structure theory: Outcrossing species likely have heterogeneous genomes, on which both dominant and codominant loci are distributed. For codominant loci, there are often a high but variable number of alleles from locus to locus (Weber and Wong 1993; Pfeifferet al. 1997). To simplify the descriptions of our mapping model, we consider only biallelic codominant loci in this article. Although straightforward in principle, extensions to other marker types, such as dominant or missing markers and multiallelic markers, require particular mathematical manipulations.
Consider one marker (M) and one QTL (Q), both segregating in a random mating population at Hardy-Weinburg equilibrium. The two alleles are denoted by M1 and M2 at the marker locus and by Q1 and Q2 at the QTL. The frequencies of alleles Mr (r = 1, 2) and Qs (s = 1, 2) in the population are denoted by pr and qs, with
If the marker and QTL are located on the same region of a chromosome, they are likely linked with recombination fraction θ. On the basis of population genetic theory (Nagylaki 1992), it is easy to derive the population frequencies of four two-locus gametes (haplotypes) MrQs (r, s = 1, 2), which are randomly combined to form the current generation t, as
Sampling theory: Assume that we randomly select M unrelated individuals from the population and collect the seeds from the selected individuals. The seeds are germinated and grown into seedlings for a progeny test, which is a regular procedure for traditional quantitative genetic experiments (McKeand and Bridgwater 1998). Both the selected individuals and their progeny are genotyped for molecular markers. Assuming the species
studied is dioecious, the genotypes of the seeds from a selected individual are derived from its own (maternal) gametes each with a frequency given in Table 1 and the paternal gametes from the pollen pool each with a frequency described by Equation 2. Thus, every selected individual represents an open-pollinated (i.e., half-sib) family with the common mother and different (unknown) fathers. On the basis of the sampling theory, the M selected individuals include three different marker genotypes, with the number denoted by Mr1r2 for genotype Mr1Mr2, and the genotypic frequencies in the sampled population are
Probabilities of the gamete genotypes produced by mother plants with different genotypes for the marker and QTL
Estimation theory: Suppose there is a segregating QTL responsible for a quantitative trait in the half-sib families. The phenotypic value of offspring j in an open-pollinated progeny test at the putative QTL is described by a simple statistical model
Conditional probabilities of the QTL genotypes upon the marker genotypes of a progeny produced by a mother plant with different marker genotypes (two-level marker genotypes)
Calculating the MLEs of Ω is equivalent to differentiating the log-likelihood of Equation 5 with respect to each of the unknown genetic parameters, setting the derivatives to equal zero, and solving the log-likelihood equations. On the basis of these procedures, we can obtain the explicit ML estimator of marker allele frequency p1:
For the other parameters
The existence of the QTL under consideration can be tested by formulating the two hypotheses
A log-likelihood ratio (LR) test statistic for the test of these two hypotheses is calculated using
SIMULATION
The statistical properties of the mapping method proposed in this article are examined by using simulated examples. Suppose the mother plants from which seeds are collected and grown into seedlings are randomly sampled from a panmictic population. A biallelic marker locus and a biallelic QTL, each of which is segregating in the population, are genetically associated. A number of factors may affect the precision and power of the method to detect the putative QTL, which include sampling schemes, the degree of marker and QTL segregation, the degree of linkage and linkage disequilibrium, and the mode of QTL gene action.
The effects of sampling schemes and population heterozygosity: How the size of samples and their allocation between and within open-pollinated families affect the behavior of a statistical method in a mapping experiment is an important issue for a practitioner to examine. In this simulation, we investigate the effects of three different sampling schemes on parameter estimation. The three schemes include (1) few large families (10 × 100), (2) moderately sized families of a moderate number (32 × 32), and (3) many small families (100 × 10). Also, the effects of sampling schemes are affected by other factors, such as gene segregation, the degree of nonrandom association between the marker and QTL, and the QTL effect. The effect due to the interaction between sampling schemes and gene segregation is examined. Gene segregation for a gene in a population is described by the difference in the frequencies of alternative alleles at the gene. A larger difference (say 0.10 vs. 0.90) implies that a population is closer to fixture and has a smaller degree of segregation. Otherwise, a population of a smaller difference in allele frequency (say 0.50 vs. 0.50) has a larger degree of segregation. Table 3 gives the parameter values used to simulate the effects of sampling schemes and gene segregation. Assuming each of the M selected open-pollinated families has an equal size, the phenotype and marker data are generated using the following steps:
Step 1. Randomly assign three marker genotypes to the M hypothesized mother plants according to probabilities
Step 2. Randomly assign three marker genotypes to the progeny within a mother plant of a particular marker genotype according to probabilities of the marker genotypes of the progeny (Table 2).
Step 3. Randomly sample joint genotypes at both the marker and QTL for an offspring derived from each mother plant from a multinomial distribution with the probabilities calculated from Table 2.
Step 4. Determine the phenotypic value for an individual with a given marker-QTL joint genotype by its genotypic value of the QTL plus a random number sampled from a normal distribution of mean zero and variance σ2 = 1.
The mean and standard error of the MLE for each of the unknowns over 100 replicates of simulation are given in Table 1. The MLE of the marker allele frequency is estimated directly, using Equation 6. The estimation for the other parameters is viewed as a missing data problem. In general, the EM algorithm derived in this article can provide the unknown parameters with consistent MLEs compared to their actual values. Yet, the precisions of parameter estimations in terms of the standard errors estimated from multiple simulation runs are greater when using a sampling scheme of few large families (10 × 100) than of many small families (100 × 10). Such precision improvement due to the use of a better sampling scheme is much more remarkable when the population sampled is closer to fixture. For example, when the difference in allele frequency for both marker and QTL is 0.80, the standard error for the allele frequency of the QTL is 0.0151 for many small families and 0.0087 for few large families. But the corresponding values are 0.0105 and 0.0081 for a population having an equal frequency for the alternative alleles at the same locus.
The power of detecting a significant linkage disequilibrium using our method is also investigated. For a less segregating population, the power is strongly dependent on the sampling scheme used, with 0.79 for many small families and 0.95 for few large families (Table 1).
The effects of linkage and linkage disequilibrium: Because missing information about the QTL is inferred from the marker genotype, the relationship between the marker and QTL affects the estimates for genetic parameters. Here, four different relationship patterns are compared on the basis of a sampling scheme 32 × 32: (1) tight linkage and weak disequilibrium, (2) tight linkage and strong disequilibrium, (3) loose linkage and weak disequilibrium, and (4) loose linkage and strong disequilibrium (Table 4). In these four patterns, all parameters except recombination fraction and linkage disequilibrium are set equal. As expected, the marker allele frequency can be very well estimated. Given the same linkage between the marker and QTL, a more associated marker tends to provide more precise estimates for both the population genetic (allele frequency) and quantitative genetic parameters of the QTL (the overall mean, additive and dominant effect, and residual variance) than a less associated marker. Also, as shown in our simulation example, there is significantly greater power to detect a QTL using a more associated marker [
Means and standard errors (in parentheses) of the MLEs of the genetic parameters for different sampling schemes and different heterogeneity in allelic frequency from 100 simulation replicates
Means and standard errors (in parentheses) of the MLEs of genetic parameters
The effects of linkage and linkage disequilibrium on parameter estimation vary among different parameters. Generally, these effects are larger on the estimates of the dominant effect of the QTL and residual variance than the additive effect and overall mean (Table 4).
The effects of QTL gene action: It has been well demonstrated that the magnitude of QTL effect affects parameter estimation, with a QTL of large effect being estimated more precisely than a QTL of small effect. Similar results have also been observed in the linkage disequilibrium-based mapping of QTL (Luo and Suhai 1999; Luoet al. 2000). However, it is unclear how different modes of gene action affect the precision and power of parameter estimation in linkage disequilibrium mapping. A simulation here is designed to investigate the effect of gene action of the estimates of QTL parameters.
Our simulation on gene action includes four different patterns: (1) purely additive (δ = 0), (2) partially dominant (0 < δ/α < 1), (3) dominant (δ/α = 1), and (4) overdominant (δ/α > 1). Except for the marker allele frequency, all other parameters have a consistent trend in the precision and power of parameter estimation over gene action (Table 5). As shown by the estimates of standard error, a dominant QTL can be estimated more precisely than an additive QTL. Also, an overdominant QTL can be estimated more precisely than a dominant or partially dominant QTL. However, the power to detect a significant linkage disequilibrium between the marker and QTL is greater for an additive QTL than for a dominant QTL as well as for a partially dominant than for an overdominant QTL (Table 5).
Comparison between traditional disequilibrium mapping an our joint mapping: We conduct an additional simulation study to compare the power for detecting linkage disequilibrium on the basis of the traditional disequilibrium mapping approach (Allison 1997; Luoet al. 2000) and our joint linkage and linkage disequilibrium mapping approach. For comparison, the same sets of genetic parameters are hypothesized between the two approaches, each allowing for different combinations between linkage and disequilibrium (Table 6). For both approaches a sample size of 1000 is assumed. For the pure disequilibrium mapping approach, this sample is randomly derived from a natural population, representing the same generation. But for our joint linkage and linkage disequilibrium mapping approach, this sample is allocated between the parental generation and the open-pollinated progeny generation. Here, the sampling scheme of 32 × 32 is simulated.
Table 6 shows the observed power for detecting linkage disequilibrium using the two mapping approaches. Generally, greater power is observed for the joint linkage and linkage disequilibrium analysis than for the pure disequilibrium analysis. However, the increase of the power by using the joint analysis depends on the degrees of linkage and linkage disequilibrium between a marker and QTL. In the situations where the linkage is loose or the disequilibrium is weak, the joint mapping approach has significantly increased power compared to the traditional disequilibrium mapping approach.
Means and standard errors (in parentheses) of the MLEs of the genetic parameters for different effects of the QTL from 100 simulation replicates
Difference in power to detect the linkage disequilibrium between a marker and QTL using the existing (top) and our (bottom) approaches
DISCUSSION
We have provided a unifying framework for the fine-scale mapping of QTL affecting a quantitative trait in a natural population on the basis of a joint linkage and linkage disequilibrium mapping strategy proposed by Wu and Zeng (2001). We model marker-QTL association on the basis of a random sample (mothers) drawn from a natural population and marker-QTL linkage on the basis of the open-pollinated progeny of the sample in which recombination events happen between different loci. Such a two-stage (mother and progeny) hierarchical modeling provides a simultaneous estimate of the linkage and linkage disequilibrium between the marker and QTL, which is thus beyond many existing composite linkage disequilibrium mapping methods that cannot distinguish strong association and loose linkage from weak association and tight linkage (Terwilliger 1995; Xiong and Guo 1997; Collins and Morton 1998; Terwilliger and Weiss 1998; Luoet al. 2000). Moreover, by unifying the information about linkage and linkage disequilibrium, the joint mapping method displays increased power to detect linkage disequilibrium, compared to the traditional linkage disequilibrium analyses.
As an example, we used a simpler one-biallelic codominant marker/one-biallelic QTL model to demonstrate the statistical properties of the joint linkage and linkage disequilibrium analysis in the precise mapping of individual QTL for complex trait. Linkage analysis requiring informative meioses in a pedigree can rarely detect a target gene that is within 1 cM of markers, but it should be useful for a genome-wide scan for QTL because a high-density map covering the entire genome can be constructed in a single pedigree. Thus, through a genome-wide scan for QTL using linkage analysis, genomic regions containing QTL can first be identified. These identified regions are then saturated by more markers and are further narrowed around QTL, using the joint linkage and linkage disequilibrium mapping strategy. We employ the maximum-likelihood method implemented with the EM algorithm to obtain the MLEs for model parameters including the allele frequency of QTL, its effects, its location, and its linkage disequilibrium with a marker. Extensive simulation studies show that the method can provide reasonable estimates for these genetic and genomic parameters for a wide range of parameter values.
In the current modeling, we have not considered the phenotypes of the genotyped mothers sampled from a natural population and used to supply the next progeny (contained in seeds). Yet, this would not affect the efficiency and utility of the model because we have integrated mothers' marker genotypes and progeny's marker genotypes into a two-level marker genotype framework. Thus, the phenotypes of the progeny population can be directly associated with the two-level marker genotypes. The strategy with no need of mothers' phenotypes is practically advantageous in at least two aspects. First, for species like forest trees, sample mothers from a natural population are easily genotyped, but their phenotypes are difficult to measure. Second, the mothers sampled cannot be compared to their progeny in phenotypes because of different ages and growth environments. However, for some species that can be vegetatively propagated, a field trial can be established with clonal replicates of both mothers and their progeny. In this case, mothers and their progeny with the same age can be simultaneously measured and compared. A further simulation study is needed to examine the advantage of the model implemented with mothers' phenotypes.
Although the codominant marker assumption used can be valid by genotyping markers like SNPs, there are many dominant markers derived from rapid amplified polymorphic DNAs and amplified fragment length polymorphisms in real data analyses for natural outcrossing populations. Also, with no doubt, our one marker-one QTL model is too simplistic for a quantitative trait that may be controlled by multiple genes each with a different effect. For these two practical reasons, the joint linkage and linkage disequilibrium mapping approach needs extension to allow for multiple markers including dominant and multiallelic markers. Linkage analysis in a pedigree using dominant markers is often biased and has low precision especially when a sample size is small (Maliepaardet al. 1997). But these problems can be overcome if they are combined with codominant markers through a Markov chain (Jiang and Zeng 1997). For the linkage disequilibrium analysis of dominant markers, a similar improvement in the precision of parameter estimate can also be expected from their combined use with codominant markers. For multiple alleles and/or loci, basic extension of the single-marker disequilibrium measures presented above has been developed in the current literature. Like linkage analysis, multipoint disequilibrium can be more efficient than single-marker analysis. For example, Hill and Weir (1994) showed that the variance of the linkage disequilibrium between a closely linked marker and a QTL is large, such that the disequilibrium cannot be used for the precise mapping of the QTL. When the disequilibria between all markers and the QTL are analyzed simultaneously, the problem of a high variability of a single linkage disequilibrium is avoided (Meuwissen and Goddard 2000). A likelihood-based multipoint approach to linkage disequilibrium mapping loci can be found in Terwilliger (1995), McPeek and Strahs (1999), Meuwissen and Goddard (2000), and Morris et al. (2000). When a narrow region is being considered for linkage disequilibrium fine-scale mapping, conditioning on the distances between markers allows the use of a composite likelihood to extract information from multiple markers. Xiong and Guo (1997) give a general likelihood framework for linkage disequilibrium mapping that incorporates multiallelic markers, multiple loci, and mutational processes at the disease and marker alleles.
For a multi-QTL model, a number of genetic parameters are treated as unknown. These include the number of QTL, the additive and dominant effect of each QTL, different kinds of epistatic effect between each pair of QTL, the chromosomal location of each QTL (determined by the recombination fraction between each QTL and its flanking markers), the linkage disequilibrium between each pair of QTL, and the linkage disequilibrium between each QTL and each marker. The maximum-likelihood method that works in a one-marker/one-QTL case may be insufficient for handling such a high dimension of unknowns. Markov chain Monte Carlo (MCMC) methods within a Bayesian framework may be a better solution for our multi-QTL linkage and linkage disequilibrium mapping. Unlike the traditional maximum-likelihood method, MCMC methods provide estimates for unobservables by analyzing their posterior distributions (Robert and Casella 1999). In the MCMC paradigm, we are able to incorporate prior information for model parameters including the number of QTL, where appropriate, which is thus advantageous over the maximum-likelihood method. Given the impressive applications of the Bayesian approach in QTL linkage mapping (see Satagopanet al. 1996; Sillanpaa and Arjas 1996, 1999; Uimariet al. 1996; Heath 1997; Stephens and Fisch 1998), we are confident of developing a powerful Bayesian approach for joint linkage and linkage disequilibrium mapping of multiple QTL through the entire genome.
Acknowledgments
We are grateful to Dr. Bruce Weir, Dr. Shizhong Xu, Dr. Nengjun Yi, and Dr. Zhao-Bang Zeng for stimulating discussions about this study, and to the associate editor and two anonymous referees for their helpful comments on this manuscript. This manuscript was approved for publication as journal series no. R-07961 by the Florida Agricultural Experiment Station.
APPENDIX A
We describe a procedure for deriving the conditional probabilities of QTL genotypes given two-level (mother and progeny) marker genotypes. Let us first consider the mothers with marker genotype M1M1. This mother marker genotype is composed of three joint marker-QTL genotypes M1M1Q1Q1, M1M1Q1Q2, and M1M1Q2Q2, with respective population frequencies in the mother generation (t) as
When the marker genotype of a sampled mother is M1M2, three possible joint marker-QTL genotypes are M1M2Q1Q1, M1M2Q1Q2, and M1M2Q2 Q2, with population frequencies as
A similar procedure can be described to derive the conditional probabilities of different QTL genotypes when the mother's marker genotype is M2M2 (Table 2).
APPENDIX B
The MLEs of the unknown parameters
Equations B6–B8.
The estimates obtained from Equations B2, B3, B4, B5, B6, B7 and B8 in Scheme 1 are then used to update H (the E step). In the E step, the posterior probability of progeny j to have QTL genotype κ is calculated using Equation B1. The iteration between the E and M steps is repeated until convergence. The values at convergence are the MLEs of the parameters.
Footnotes
-
Communicating editor: Y.-X. Fu
- Received June 15, 2001.
- Accepted November 20, 2001.
- Copyright © 2002 by the Genetics Society of America