Mapping Quantitative Trait Loci Using Multiple Families of Line Crosses
- Shizhong Xu⇓
- Author e-mail: xu{at}genetics.ucr.edu
Abstract
To avoid a loss in statistical power as a result of homozygous individuals being selected as parents of a mapping population, one can use multiple families of line crosses for quantitative trait genetic linkage analysis. Two strategies of combining data are investigated: the fixed-model and the random-model strategies. The fixed-model approach estimates and tests the average effect of gene substitution for each parent, while the random-model approach treats each effect of gene substitution as a random variable and directly estimates and tests the variance of gene substitution. Extensive Monte Carlo simulations verify that the two strategies perform equally well, although the random model is preferable in combining data from a large number of families. Simulations also show that there may be an optimal sampling strategy (number of families vs. number of individuals per family) in which QTL mapping reaches its maximum power and minimum estimation error. Deviation from the optimal strategy reduces the efficiency of the method.
LINE crossing is a common experimental design for mapping quantitative trait loci (QTLs) in plants and laboratory animals. Statistical methods are well developed for QTL mapping using line-crossing data (Lander and Botstein 1989; Haley and Knott 1992; Martínez and Curnow 1992; Jansen 1993, 1994; Zeng 1994). Methods developed by these authors are mainly designed to handle a single cross, e.g., a single F2 family. Under these methods, the effects of gene substitution (the first moments) are tested and estimated. Because of this, the methods are classified by Xu and Atchley (1995) as the fixed-model approach. The sampling strategy (using a single family) and the statistical methodology (the fixed model) consequently restrain the inference space of the parameter estimation to the particular cross. This is undesirable if the two lines initiating the cross are not segregating at a QTL, for then no matter how many offspring are sampled in the F2 or backcross population, the QTL cannot be detected. If a QTL is present, but is not detected because of fixation to the same allele in both lines, then a type of type II error has occurred. This type II error, referred to as genetic drift error by Xu (1996a), has largely been ignored in the QTL mapping literature.
A type II error of this kind can be reduced or even prevented by using multiple families of line crosses. At the low end, Muranty (1996) claims that QTL detection in a population derived from two parents is often less powerful than one derived from more parents. He then demonstrates that if QTL heterozygote frequency in the base population is high enough, a mating design with six parents should give a good sample of variance and allow the detection of QTL with reasonable power. Muranty (1996) introduced the idea of multiple-family QTL mapping by using an ideal situation in which the genotype of the QTL is known without error. In actuality, the QTL genotype cannot be observed, so the statistical method demonstrated by Muranty should be modified before it is applied to genome scanning using real data.
In this paper, I propose two strategies for combining data from multiple families of line crosses: the fixed-model and the random-model approaches. I then conduct Monte Carlo simulations to show that both the fixed- and the random-model approaches work as expected.
METHODOLOGY
Linear model: Consider t independent F2 families each derived from cross of two inbred lines (a total of 2t independent inbred lines are involved). The phenotypic value of a quantitative character can be described by the following linear model:
Let E(zij|IM) and E(wij|IM) be the conditional expectations of zij and wij given marker information (IM). The linear model can be approximated by substituting zij and wij by their conditional expectations (Haley and Knott 1992; Martínez and Curnow 1992),
Fixed model strategy: The first strategy of combining several different line crosses is to estimate and test {αi δi} for i = 1, …, t. The null hypothesis is H0: αi = δi = 0 ∀i. This approach is analogous to the nested design for multiple-family analysis (Welleret al. 1990; Knottet al. 1996); i.e. it treats QTL effects as nested within families. Because the first moments are estimated, the method is called the fixed-model strategy. Let ni be the number of individuals in the i-th family and
Under the fixed model, parameters are estimated via an iteratively reweighted least-squares algorithm described below. Given an initial guess of λαi, λδi and λαiδi, matrix R is considered as known. Under the pretense of known R, the solutions of θ = {β, α, δ} and σ 2ε are obtained via the following equations:
Note that these solutions are maximum likelihood estimations (MLEs) under Model 6. If N in the denominator of Equation 9 had been replaced by N − 3t, as done in regression analysis, the solutions would no longer be MLEs. Whether N or N − 3t is used to estimate
The least-squares method of Knott et al. (1996) simply ignores the correction for the residual variance, i.e., assuming R = I. When densed markers are used or the QTL effect is small or both, this assumption will have little effect on the results (Xu 1995).
Compared with the EM algorithm under a mixture model, this algorithm is extremely fast—only two to three cycles of iteration are required. However, three additional parameters are added to the model for each additional family, so the number of parameters to estimate grows quickly as the number of families increases.
Under the null hypothesis, H0: α = δ = 0, the maximum likelihood is
In QTL mapping with multiple families, effort is directed from a single family into multiple families. As a consequence, one is no longer interested in the α and δ of any particular family, but rather in α and δ from all families. However,
To estimate
An asymptotically unbiased estimate of
Two special situations may be noted. When only a single family is analyzed (the traditional QTL mapping strategy), A is only a scalar of value 1, which leads to
Random-model strategy: The second strategy of QTL mapping is to directly test and estimate the variances of the QTL effects, and because of this it is called the random model approach. Consider each αi and δi as randomly sampled from a large hypothetical population with a means of zero and variances of
Derivation of (15) is based on the assumption that α and δ are uncorrelated. If the indicator variables, Z and W, were observed, then the variance of y would be
Note that the definitions of λα and λδ are different from those of the fixed model. It should be noticed that the family-specific effects, βs, have been treated as fixed effects, although they can be considered as random effects with a mean of zero and a common variance
Given the expectation and the variance of the model and under the pretense of a normal distribution of y, we have the following likelihood function:
The random-model strategy involves inverting and determinating V, an N × N block diagonal matrix, which can be time consuming for large blocks (each block is of n × n dimension). A simple algorithm developed in random mating designs (S. Xu, unpublished data) can be adopted here. The algorithm provides the following matrix equivalencies:
NUMERICAL COMPARISON
Design of simulations: In this section, the two statistical methods are verified and compared numerically via Monte Carlo simulations. The criteria of verification are standard errors of the parameter estimation and the statistical powers. Factors considered include (1) marker heterozygosity; (2) relative position of QTL; (3) mode of QTL inheritance; (4) QTL variances; (5) distribution of the QTL allelic effect and (6) sampling strategy (family number vs. family size). Only a single chromosome segment of length 100 cM covered by 11 evenly spaced codominant markers is simulated. The total number of individuals [N = family number (t) × family size (n)] is set at ≈500 in all simulations. Under each condition, the simulation is repeated for 100 times. The standard deviation of an estimated parameter among the 100 replicates provides a measure of the standard error of parameter estimation. The statistical power is determined by counting the number of runs (over the 100 replicates) that have test statistics greater than an empirical threshold. The empirical threshold value under each condition is obtained by choosing the 95th percentile of the highest test statistic over 1000 additional runs under the null model (no QTL is segregating).
Marker heterozygosity in the population in which the inbred lines are sampled is simulated at three levels: (1) two alleles, (2) four alleles and (3) eight alleles. All alleles are equally frequent so that the marker heterozygosities represented by the three situations are one half, three quarters and seven eighths, respectively.
A single QTL is located at one of the three possible positions (measured from the left end of the chromosome): 0 cM (overlapping with the first marker), 25 cM (between markers 3 and 4) and 50 cM (in the middle of the chromosome). The estimated QTL location takes the point of the chromosome segment that has the highest test statistic value.
The mode of QTL inheritance is determined by the ratio of
The variance explained by the QTL is
Three distributions of the allelic effect of the QTL are considered. The first is uniform distribution with 10 equally frequent alleles. Each allele is assigned a value between 0 and 9. The F1 hybrid of each family is generated by randomly sampling two from the 10 alleles with replacement. F2 individuals are then generated by selfing the F1 hybrid. The additive value of an F2 individual is the sum of effects of the two alleles. The dominance effect takes the product of the two parental alleles. These genetic values (additive and dominance) are finally rescaled so that they have a mean of zero and the assigned variances. The second is normal distribution with infinite number of alleles. An F1 hybrid is made of two random alleles, each being assigned a value sampled from N(0,1) distribution. The dominance effect between any two sampled alleles takes the product of the two allelic effects. When F2 individuals are generated, their genetic values at the QTL are rescaled so that they have the appropriate assigned variances. The third distribution is 10 alleles, each having a value between 0 and 9. The frequency of an allele, however, scales exponentially with its assigned effect. Let pj be the frequency of the j-th allele for j = 0, …, 9, then
The last but most important factor considered in the simulations is the sampling strategy: family number vs. family size (N = t × n = 500). Eight levels are considered: (1) t × n = 1 × 500; (2) t × n = 3 × 167; (3) t × n = 6 × 83; (4) t × n = 10 × 50; (5) t × n = 15 × 33; (6) t × n = 20 × 25; (7) t × n = 50 × 10; and (8) t × n = 100 × 5.
Instead of performing simulations under all possible cases, I simulated a situation in which the central level is chosen for each factor considered. This particular situation is then referred to as the “standard,” which is described as follows: (1) four equally frequent alleles for each marker locus; (2) the QTL located at 25 cM; (3) mixed mode of QTL inheritance, i.e.,
Empirical threshold values for significance test at α = 0.05, where α is the type I error rate
Estimates of QTL parameters and empirical powers (α = 0.05) under different levels of marker polymorphism
Results of simulations: The empirical threshold values at a type I error rate of 0.05 are given in Table 1. The number of alleles per marker locus does not seem to have an influence on the threshold values. As the number of families increases, the threshold value increases under the fixed-model strategy considerably more than it does under the random-model strategy. This is expected because increasing the number of families increases the number of parameters tested under the fixed model while the number of parameters tested does not change under the random-model strategy.
When each marker has two equally frequent alleles in the population in which the parental lines are sampled, the two models have similar estimation errors and statistical powers (Table 2). The estimation of the QTL position, however, is biased and with large error in both methods. The statistical powers are also low, with two marker alleles relative to more marker alleles. The fixed-model strategy generally provides a biased estimate for the residual variance, as shown in this and subsequent tables.
The proportion of the phenotypic variance explained by the QTL
Table 4 shows that when the QTL is located at one end of the chromosome segment, estimation of the QTL position is biased toward the center and also with large error in both methods. There is little change in the power to detect a QTL as the true QTL position varies.
Mixed mode of QTL inheritance (additive and dominance) seems to have a higher statistical power than either of the additive or the dominance mode of inheritance. The estimation of the QTL position is biased and with large error under the dominance mode of inheritance. Again, the two methods do not show any major difference (see Table 5).
Distribution of the QTL allelic effect does not affect the comparison of the two methods (Table 6). It does, however, have an effect on the statistical power and the estimation errors of QTL parameters. The uniform distribution produces results similar to (in fact, slightly better than) the normal distribution. The exponential distribution decreases the statistical power and increases errors of parameter estimation.
Finally, the sampling strategy has a major impact on the performance (Table 7). First, there seems to be an optimal sampling strategy (10 × 50) that leads to the highest statistical power and smallest estimation errors of QTL parameters. Second, the sampling strategy of a single family causes a severe loss in power and huge biases and errors of QTL parameter estimation. Third, the residual variance is underestimated as the number of families increases. This is especially so for the fixed model. Overall, the two strategies of QTL mapping perform equally well, except that the fixed-model approach is difficult to implement for large number of families.
Estimates of QTL parameters and empirical powers (α = 0.05) under different levels of heritability of the QTL
Estimates of QTL parameters and empirical powers (α = 0.05) under three different locations of the QTL
DISCUSSION
Unless it is known that the parents are heterozygous at most QTLs for a trait of interest, it is generally recommended to use at least a few independent families for QTL analysis. Using more than a single family for QTL mapping may reduce a type II error caused by homogeneous parents being sampled. In traditional QTL mapping using a single-line cross, little attention has been paid to the type II error of this kind. This is because the two parental lines involved are not randomly selected from a pool of available strains; instead, they are selected to be at the opposite extremes for the trait of interest. As a consequence, it is almost guaranteed that most QTLs are heterozygous in the F1 parents, and thus a type II error of this kind is likely avoided. A nonrandom selection of parental lines can increase the statistical power for detecting QTLs responsible for the trait used as the selection criterion, but it may not be helpful in detecting QTLs responsible for other traits. In addition, one must be careful about the statistical inference space of the parameter estimation: because of the nonrandom selection, estimation of the QTL effect is biased and can only be inferred upon the two parental lines, not the pool of available strains where the two lines were selected.
Although the two strategies of consensus QTL mapping appear to perform equally well, the fixed-model approach is generally less preferable for the following reasons. With multiple-family QTL mapping, one is no longer interested in the effect of gene substitution in any particular family, but rather is interested in the variance of the substitution effect among different families. In other words, the average effect of gene substitution is considered to be a random variable with variance
Estimates of QTL parameters and empirical powers (α = 0.05) under different modes of inheritance of the QTL
Estimates of QTL parameters and empirical powers (α = 0.05) under different allelic distributions of the QTL
The random-model approach to QTL mapping was originally developed in human genetic linkage analysis in which a large number of small families are often involved (Haseman and Elston 1972; Goldgar 1990; Schork 1993; Olson and Wijsman 1993; Fulker and Cardon 1994; Kruglyak and Lander 1995; Xu and Atchley 1995). Because linkage phases of markers in the parents are generally not known in small pedigrees, the random-model approach is often implemented through an identical-by-descent (IBD) based variance component analysis. The IBD-based method does not depend on information about linkage phases of the parents; rather, it utilizes information on the number of alleles IBD shared by two siblings. The random-model approach proposed in this paper is closely related to the IBD-based method. Recall that the variance–covariance matrix of the data is
Estimates of QTL parameters and empirical powers (α = 0.05) under different sampling strategies (number of families × number of individuals per family)
In the random-model strategy, the family-specific effects, β, have been treated as fixed effects. When the number of families is large, however, it is desirable to treat β as random effects. By doing so, one only estimates a single parameter,
This paper demonstrates the algorithm of QTL mapping combining multiple F2 families as an example. With the random-model approach, it is easy to extend the algorithm to combine all types of line cross data, e.g., backcrosses, double haploids, open pollinated progenies. It is also not difficult to combine data from multiple full-sib and half-sib families. The method provides a general tool for data updating; i.e., QTL linkage analysis can be constantly updated as new data become available.
Acknowledgments
I thank Damian Gessler for helpful comments on the manuscript. This research was supported by the National Institutes of Health grant GM55321-01 and the National Research Initiative Competitive grants program/USDA 95-37205-2313.
Footnotes
-
Communicating editor: C.-I Wu
- Received July 6, 1997.
- Accepted September 29, 1997.
- Copyright © 1998 by the Genetics Society of America