## Abstract

The selective genotyping approach, where only individuals from the high and low extremes of the trait distribution are selected for genotyping and the remaining individuals are not genotyped, has been known as a cost-saving strategy to reduce genotyping work and can still maintain nearly equivalent efficiency to complete genotyping in QTL mapping. We propose a novel and simple statistical method based on the normal mixture model for selective genotyping when both genotyped and ungenotyped individuals are fitted in the model for QTL analysis. Compared to the existing methods, the main feature of our model is that we first provide a simple way for obtaining the distribution of QTL genotypes for the ungenotyped individuals and then use it, rather than the population distribution of QTL genotypes as in the existing methods, to fit the ungenotyped individuals in model construction. Another feature is that the proposed method is developed on the basis of a multiple-QTL model and has a simple estimation procedure similar to that for complete genotyping. As a result, the proposed method has the ability to provide better QTL resolution, analyze QTL epistasis, and tackle multiple QTL problem under selective genotyping. In addition, a truncated normal mixture model based on a multiple-QTL model is developed when only the genotyped individuals are considered in the analysis, so that the two different types of models can be compared and investigated in selective genotyping. The issue in determining threshold values for selective genotyping in QTL mapping is also discussed. Simulation studies are performed to evaluate the proposed methods, compare the different models, and study the QTL mapping properties in selective genotyping. The results show that the proposed method can provide greater QTL detection power and facilitate QTL mapping for selective genotyping. Also, selective genotyping using larger genotyping proportions may provide roughly equivalent power to complete genotyping and that using smaller genotyping proportions has difficulties doing so. The R code of our proposed method is available on http://www.stat.sinica.edu.tw/chkao/.

THE data in the QTL mapping study are usually composed of two parts, phenotypic trait values and marker genotypes, in the individuals, and the cost of producing data includes both phenotyping and genotyping costs. The cost ratio of the phenotyping to genotyping may vary significantly depending on the traits and species in studies. For a fixed budget and time frame in the study, both costs must be considered and properly allocated to make the study optimally cost effective. If the total cost is not of primary concern in QTL experiments, all individuals in the entire sample will be genotyped and phenotyped for QTL analysis. However, QTL experiments are usually conducted under a limited budget, and researchers may not be allowed to genotype and phenotype a large amount of individuals for QTL analysis. It is hence necessary to make a reasonable and effective allocation for the genotyping and phenotyping costs in the experiment. The selective genotyping approach has been known as a cost-effective strategy for reducing genotyping work and still has the ability to maintain efficiency in QTL detection (Lebowitz *et al.* 1987; Lander and Botstein 1989). This approach is intended to select individuals with extreme (high and low) phenotypic values for genotyping and keep the remaining individuals ungenotyped in the entire sample. Later, several statistical methods have been proposed to study QTL mapping under selective genotyping (Darvasi and Soller 1992; Muranty and Goffinet 1997; Henshall and Goddard 1999; Xu and Vogl 2000; Manichaikul *et al.* 2007). They again confirmed that selective genotyping can achieve reasonable power and precision in detecting QTL as compared to complete genotyping but at the expense of moderate increase in the number of phenotyped individuals. As a larger number of individuals must be phenotyped first, before the required number of extreme individuals can be collected for genotyping, selective genotyping will be most suitable for the cases in which the phenotyping cost is relatively inexpensive compared with the genotyping cost. Many economically and biologically important traits, such as flowering time, fruit size and shape, crop (meat) yield or quality, growth in height and weight, plant stress and disease resistance, survival time, blood pressure, and body mass index in human, can be obtained at relatively low cost, and QTL mapping of these traits by using selective genotyping can be managed to be more cost effective than that by using complete genotyping. Consequently, although genotyping cost has been dropping recently, selective genotyping has been still widely employed in QTL mapping for improving these traits and understanding their genetic basis in several plant and animal species (Abdel-Haleem *et al.* 2011; Lu *et al.* 2013; Miller *et al.* 2013; Fontanesi *et al.* 2014;). The keyword search using Google Scholar also reveals that selective genotyping remains popular and is more frequently used in genetics studies in recent years. The reasons may be that as marker genotyping becomes cheaper, more researchers are attracted to the QTL mapping analysis, but still face the situation of insufficient budgets to fully cover the expense of complete genotyping, and selective genotyping can become an alternative, cost-effective choice that allows maintenance of similar efficiency to complete genotyping in QTL mapping.

In general, the statistical methods of QTL mapping for selective genotyping can be grouped into two major types according to whether their models take the ungenotyped individuals (ungenotyping data) into account in the analysis. Methods of the first type use only the genotyped individuals (genotyping data) and exclude the ungenotyping data to develop statistical models for QTL analysis (Darvasi and Soller 1992; Henshall and Goddard 1999; Xu and Vogl 2000). Darvasi and Soller (1992) calculated the trait means of the QTL genotypes from the selected tails and constructed a *t*-test for their difference to determine linkage between a marker and a QTL in the backcross population. Henshall and Goddard (1999) used a logistic regression approach to the analysis of selective genotyping data by treating phenotypes as independent variables and genotypes as dependent variables for QTL mapping. Xu and Vogl (2000) developed a truncated normal mixture model for the interval mapping procedure under the framework of selective genotyping in QTL detection. Methods of the second type consider both genotyped and ungenotyped individuals (full data) from selective genotyping for QTL analysis (Muranty and Goffinet 1997; Ronin *et al.* 1998; Xu and Vogl 2000). By analyzing full data, Muranty and Goffinet (1997) adopted a normal mixture model to detect QTL under selective genotyping in the backcross population. Ronin *et al.* (1998) proposed a mixture model for the study of interval mapping of two QTL under selective genotyping. Xu and Vogl (2000) also used a normal mixture model to tackle the issue of QTL mapping under selective genotyping in the F_{2} population. In their normal mixture models, the mixing proportions for different QTL genotypes of the genotyped individuals are determined by the flanking markers as usual, and those of the ungenotyped individuals are assigned by the population frequencies (*e.g.*, 1/2 and 1/2 in the backcross, and 1/4, 1/2, and 1/4 in the F_{2} population for a single-QTL model). When comparing the two types of methods in selective genotyping, Xu and Vogl (2000) suggested that whenever possible, the analyses on full data are preferred over those on genotyping data, since the formers tend to provide improved estimations and greater test statistics for QTL parameters.

In this article, we develop a novel and simple statistical method on the basis of a normal mixture model to analyze full data from selective genotyping for QTL detection. Compared to the existing methods (Muranty and Goffinet 1997; Ronin *et al.* 1998; Xu and Vogl 2000), the main novelties of our proposed method are twofold. The first novelty is that we obtain the proportions of QTL genotypes in the ungenotyped individuals by deducting the expected QTL frequencies in the genotyped individuals from their population frequencies. Then these proportions instead of the population QTL frequencies, as in the existing methods, are used to model the ungenotyped individuals in model construction, so that the proposed model can fit better to the data and yield better performance in selective genotyping. The second novelty is that our proposed model is developed on the basis of a multiple-QTL model, as has been done in QTL mapping for complete genotyping (Kao *et al.* 1999). The multiple-QTL model approach can take multiple QTL into account to control more genetic variation for improving QTL detection. As a result, with the novelties, our proposed method can provide better resolution, analyze epistasis, and tackle multiple QTL problems in QTL mapping under selective genotyping. In addition, a truncated normal mixture model based on the multiple-QTL model is developed when only genotyping data are used in the analysis. We then compare the differences in QTL mapping between the two types of selective genotyping methods and investigate their properties under the multiple-QTL framework, as their notable differences blurred in the single-QTL framework may appear in the multiple-QTL framework. The threshold values of selective genotyping for different models are also investigated. Simulation studies are conducted to evaluate the proposed method, compare different models, and study their properties in QTL mapping under selective genotyping.

## The Statistical Methods for Selective Genotyping

In selective genotyping, the selected individuals for genotyping are those with high or low trait values in the two extremes, and the remaining unselected individuals are those with average trait values in the middle. Therefore, the full data generated from selective genotyping can be split into two parts: genotyping data containing phenotypically extreme individuals with marker genotypes and ungenotyping data containing intermediate individuals without marker genotypes. As described in the Introduction, statistical QTL mapping methods for analyzing selective genotyping data can either take full data or take just genotyping data into account in their models for QTL detection. When considering full data in the analysis, it is important to know that the genotypic frequencies of QTL in the ungenotyping (or genotyping) data are no longer the same as those in the full data, and they will depend on the underlying QTL parameters and population structures. With such facts, we obtain the frequencies of QTL genotypes in the ungenotyped individuals and use them, rather than the population frequencies of QTL genotypes as in the existing methods, to model the ungenotyped individuals and then to propose a new normal mixture model for QTL mapping when full data are used in the analysis. As opposed to our proposed method, the existing methods using normal mixture model will hereinafter be called population frequency-based (PFB) methods. Furthermore, we also develop a truncated normal mixture model for QTL mapping when only genotyping data are used in the analysis. Both types of models are developed on the basis of a multiple-QTL approach, so that they can be compared under multiple-QTL analyses and be used to deal with the multiple-QTL problem. In the following, we first investigate the genotypic distributions in the genotyped and ungenotyped individuals, then present the truncated model, and finally outline our proposed method for selective genotyping.

### Genotypic distributions in the genotyped and ungenotyped individuals

In the F_{2} population, a QTL, *Q*, under consideration has three possible genotypes, *QQ*, *Qq*, and *qq* with expected frequencies 1/4, 1/2, and 1/4, respectively. Their genotypic values, *G*_{2}, *G*_{1}, and *G*_{0}, can be related to genotypic mean (*μ*), additive effect (*a*), and dominance effect (*d*) by (1)where **D** is known as the genetic design matrix for characterizing the genetic effects of QTL in the vector **E**. Under such settings, the trait value of an individual, *y _{i}*, affected by

*Q*may have three possible distributions,

*i.e*.,

*y*|

_{i}*N*(

*μ*

_{1},

*σ*

^{2}),

*y*|

_{i}*N*(

*μ*

_{2},

*σ*

^{2}), or

*y*|

_{i}*N*(

*μ*

_{3},

*σ*

^{2}), where

*μ*

_{1},

*μ*

_{2}, and

*μ*

_{3}are the corresponding genotypic values. If a selective genotyping approach with genotyping proportion

*θ*(

*θ*/2 from each tail) is conducted on a sample with size

*N*,

*N*×

*θ*individuals from the two tails will have both trait and marker information, and

*N*× (1 −

*θ*) individuals from the middle will have only trait information. Let

*T*

_{R}and

*T*

_{L}be the right and left truncated points so that

*P*(

*y*<

_{i}*T*

_{L}) =

*P*(

*y*>

_{i}*T*

_{R}) =

*θ*/2. For

*N*large enough, the expected genotypic frequencies in the ungenotyped individuals are

*w*× [Φ((

_{j}*T*

_{R}−

*μ*)/

_{j}*σ*) − Φ((

*T*

_{L}−

*μ*)/

_{j}*σ*)],

*j*= 1, 2, 3, where Φ(⋅) denotes a standard normal cumulative distribution function, and

*w*

_{1}= 1/4,

*w*

_{2}= 1/2, and

*w*

_{3}= 1/4 are the expected genotypic frequencies in the whole population. Then, the proportions of the three QTL genotypes in the ungenoytyped individuals can be straightforwardly obtained by (2)where and . As

*κ*’s are functions of truncated points (

_{j}*T*

_{L}and

*T*

_{R}), genetic parameters (

*μ*,

*a*,

*d*, and

*σ*), and population frequencies (

*w*’s), the values of

_{j}*κ*’s depend on the factors such as genotyping proportions, heritability, sizes and modes of QTL actions, and population structure. For example, if

_{j}*a*=

*d*= 1,

*h*

^{2}= 0.5, and

*θ*= 0.5 (

*P*(

*y*<

_{i}*T*

_{L}) =

*P*(

*y*>

_{i}*T*

_{R}) = 0.25), the two truncated points are

*T*

_{L}≅ −0.785 and

*T*

_{R}≅ 0.873. The genotypic frequencies of

*κ*

_{1}= 0.299 (≅0.150/0.5),

*κ*

_{2}= 0.599 (≅0.299/0.5), and

*κ*

_{3}= 0.101 (≅0.051/0.5), respectively, by Equation 2. Similarly, these genotypic proportions are 0.069 (0.332), 0.137 (0.665) , and 0.794 (0.003) for

*y*<

_{i}*T*

_{L}(

*y*>

_{i}*T*

_{R}), respectively. Essentially, it is worth noting that the expected proportions of the three QTL genotypes in the genotyped and ungenotyped samples are no longer

*w*’s like those in the whole sample.

_{j}Equation 2 can be easily extended to the multiple-QTL case for obtaining the genotypic proportions in the ungenotyped population. For example, in the case of two QTL, *Q*_{1} and *Q*_{2}, there are nine possible QTL genotypes, *Q*_{1}*Q*_{1}*Q*_{2}*Q*_{2}, *Q _{1}Q*

_{1}

*Q*

_{2}

*q*

_{2},

*Q*

_{1}

*Q*

_{1}

*q*

_{2}

*q*

_{2},

*Q*

_{1}

*q*

_{1}

*Q*

_{2}

*Q*

_{2},

*Q*

_{1}

*q*

_{1}

*Q*

_{2}

*q*

_{2},

*Q*

_{1}

*q*

_{1}

*q*

_{2}

*q*

_{2},

*q*

_{1}

*q*

_{1}

*Q*

_{2}

*Q*

_{2},

*q*

_{1}

*q*

_{1}

*Q*

_{2}

*q*

_{2},

*q*

_{1}

*q*

_{1}

*q*

_{2}

*q*

_{2}, respectively. Their expected population frequencies are

*w*

_{1}= (1 −

*r*)

^{2}/4,

*w*

_{2}=

*r*(1 −

*r*)/2,

*w*

_{3}=

*r*

^{2}/4,

*w*

_{4}=

*r*(1 −

*r*)/2,

*w*

_{5}= (1 −

*r*)

^{2}/2 +

*r*

^{2}/2,

*w*

_{6}=

*r*(1 −

*r*)/2,

*w*

_{7}=

*r*

^{2}/4,

*w*

_{8}=

*r*(1 −

*r*)/2,

*w*

_{9}= (1 −

*r*)

^{2}/4, respectively, where

*r*is the recombination fraction between the two QTL. For

*r*= 0.3, the values of

*w*’s are ∼0.123, 0.105, 0.023, 0.105, 0.290, 0.105, 0.023, 0.105, and 0.123, respectively. If the two QTL have effects (

_{j}*a*

_{1}= 3,

*d*

_{1}= 1,

*a*

_{2}= 1,

*i*

_{aa}= 2) and

*h*

^{2}= 0.5, the values of

*κ*’s among the ungenotyped individuals are ∼0.040, 0.109, 0.028, 0.129, 0.377, 0.130, 0.029, 0.059 and 0.100, respectively, for

_{j}*θ*= 0.5. If the two QTL are unlinked (

*r*= 0.5), the values of

*w*’s are ∼0.063, 0.125, 0.063, 0.125, 0.250, 0.125, 0.063, 0.125, and 0.063, respectively, and the values of

_{j}*κ*’s are 0.043, 0.124, 0.067, 0.131, 0.270, 0.133, 0.067, 0.106, and 0.059, respectively. The differences between

_{j}*κ*’s and

_{j}*w*’s may be minor, but can be significant for some genotypes. In general, greater genotyping proportions, higher heritability, and tighter linkage will cause larger differences between the values of

_{i}*w*’s and

_{j}*κ*’s. Our proposed QTL mapping method for selective genotyping intends to use

_{j}*κ*’s rather than

_{j}*w*’s to model the relationship between the trait values and unobservable genotypes in the ungenotyped individuals.

_{j}The genetic model for multiple QTL, say *m* QTL, can be easily obtained from Equation 1 by augmenting the dimensions of the genetic design matrix according to their effects under consideration. Then, on the basis of the genetic model, the statistical model for fitting these *m* QTL, *Q*_{1}, *Q*_{2}, …, *Q _{m}*, without epistasis at given positions within the

*m*separate marker intervals, (

*M*

_{1},

*N*

_{1}), (

*M*

_{2},

*N*

_{2}), …, (

*M*,

_{m}*N*), can be written as (3)where ’s and ’s are coded variables for the additive and dominance effects,

_{m}*a*’s and

_{k}*d*’s, for Q

_{k}*’s,*

_{k}*k*= 1, 2, …,

*m*,

*y*is the quantitative trait value of the

_{i}*i*th individual, and

*ε*is a random error and assumed to follow

_{i}*N*(0,

*σ*

^{2}). Note that and associated with

*a*and

_{k}*d*are coded as (1, −1/2), (0, 1/2), and (−1, −1/2) for genotypes

_{k}*Q*,

_{k}Q_{k}*Q*, and

_{k}q_{k}*q*, respectively, and the above model can be easily extended to the model with epistasis by introducing the product terms as the terms for epistasis. Under complete genotyping, the likelihood function of the statistical model for the parameters Θ is a mixture of 3

_{k}q_{k}*normals as (4)where*

^{m}*f*(

*y*|

_{i}*μ*,

_{j}*σ*

^{2}) is a normal p.d.f. with mean

*μ*and variance

_{j}*σ*

^{2},

*μ*’s correspond to the genotypic values of the 3

_{j}*QTL genotypes, and*

^{m}*p*’s are the mixing proportions inferred from their flanking marker genotypes. If the statistical model is applied to analyze the data from selective genotyping, the likelihood will be different and depend on whether all individuals or just genotyping individuals are considered in the model as described below.

_{ij}### Model to analyze only genotyping data

Under selective genotyping, suppose that, among the *n* individuals, *n*_{s} individuals with extreme trait values (*n*_{s}/2 each from the upper and lower extremes) are selected for marker genotyping, and the remaining *n*_{u} (*n*_{u} = *n* − *n*_{s}) individuals are not genotyped. If only the genotyped individuals from the two extremes are utilized in the analysis, data of this sort are called centrally truncated data and the methods of analyzing truncated data can be applied to the analyses (Cohen 1991). Xu and Vogl (2000) incorporated the truncated model into the mixture structure of interval mapping framework to propose a truncated normal mixture model for QTL analysis. They pointed out that the maximization of the truncated normal mixture likelihood is a challenging task, and they used an EM algorithm to obtain the maximum likelihood estimates (MLE) of the QTL parameters for the model. An investigation of detection of a single QTL under selective genotyping was performed in their analysis. Here, we provide an alternative version of the EM algorithm for obtaining the MLE of the truncated normal mixture model and use it to address more complicated issues involving mapping multiple QTL and analyzing their epistasis. For *n*_{s} genotyped individuals, the likelihood function for Θ is (5)where *p _{ij}*’s are the conditional probabilities of QTL genotypes given marker genotypes,

*f*(

*y*|

_{i}*μ*,

_{j}*σ*

^{2}) is the normal density with mean

*μ*and variance

_{j}*σ*

^{2}, and is the cumulative density with genotypic values greater than

*T*

_{R}and lower than

*T*

_{L}. Statistically, the normal density

*f*(

*y*|

_{i}*μ*,

_{j}*σ*

^{2}) is standardized by

*U*to become a truncated normal density

_{j}*f*(

*y*|

_{i}*μ*,

_{j}*σ*

^{2})/

*U*. The details of the EM algorithm for obtaining the MLE of the parameters in the truncated normal mixture likelihood are described in

_{j}*Appendix*. In summary, the (

*t*+ 1)th iteration of the EM step is given below.

E-step: Update the posterior probabilities of the 3

QTL genotypes,for^{m}*j*= 1, 2, …, 3,^{m}*i*= 1, 2, …,*n*._{s}M-step: Find the estimates to maximize the conditional log-likelihood (see

*Appendix*). Equivalently, we can obtain the estimates from the following equations. For*μ*, the QTL effects, and*σ*^{2}, the equations are (6) (7) (8)where contains the posterior probabilities of QTL genotypes for the*n*_{s}genotyped individuals,**E**is a*k*× 1 column vector whose elements denote the QTL effects (*e.g.*,**E**_{2×1}= [*a d*]′ for a single-QTL model considering the additive and dominance effects),and . In the above expressions,**R**,_{μ}**R**, and are the correction terms to mainly account for the truncated normal distributions,_{E}*D*is a_{i}*k*× 1 column vector in the genetic model that associate with the corresponding QTL effect in the*i*th element of**E**, and*δ*is an indicator variable. Also,where*φ*(⋅) denotes the normal density function from the derivatives of log*U*(see_{j}*Appendix*) and are key components in**R**,_{μ}**R**, and . The E and M steps are iterated until convergence. The converged values of_{E}*μ*,**E**, and*σ*^{2}are the MLEs. We intended to express the solutions of the parameters (Equations 6–8) in the general formulas format designed for complete genotyping (Kao and Zeng 1997), so that the two sets of equations can be compared and investigated under complete and selective genotyping. When all individuals are genotyped for markers and included in the analysis, the correction terms in Equations 6–8 vanish, and the equations reduce to the same equations for complete genotyping.

### Model to analyze full data

If all the *n* individuals, including the *n*_{s} genotyping individuals and *n*_{u} ungenotyping individuals, are fitted into the statistical model (Equation 3) for QTL analysis, the model likelihood can be written as (9)where the first and second terms on the right-hand side are the likelihoods for the *n*_{s} genotyped and for the *n*_{u} ungenotyped individuals, respectively. Both likelihoods are normal mixture densities as the QTL genotypes are not observed. In the likelihood for the *n*_{s} genotyped individuals, the mixing proportions, *p _{ij}*’s, of a genotyped individual

*i*are obtained from the conditional probabilities of the QTL genotypes given its flanking marker genotype. In the likelihood for the

*n*

_{u}ungenotyped individuals, each different individual mixture density will be given to the same mixing proportions

*q*’s. Since there is no marker available to infer

_{j}*q*’s, ideally, we would use

_{j}*κ*’s (Equation 2),

_{j}*i.e.*, the proportions of QTL genotypes in the ungenotyped individuals (see

*Genotypic distributions in the genotyped and ungenotyped individuals*), to serve as the role of

*q*’s in the likelihood from the ungenotyped individuals. However, the values of

_{j}*κ*’s depend on the unknown QTL parameters, which will complicate the maximum-likelihood estimation if used directly. To avoid the complication, the PFB methods use

_{j}*w*’s as

_{j}*q*’s in their models,

_{j}*e.g.*,

*q*

_{1}=

*w*

_{1}= 1/4,

*q*

_{2}=

*w*

_{2}= 1/2, and

*q*

_{3}=

*w*

_{3}= 1/4 for

*m*= 1 in the F

_{2}model. Here, we propose the quantities (10)for the approximation of

*κ*’s and use

_{j}*α*’s as

_{j}*q*’s in our proposed method. In

_{j}*α*,

_{j}*β*is given aswhich sums up the conditional probabilities of a QTL genotype (indexed by

_{j}*j*) over the

*n*

_{s}genotyped individuals (indexed by

*i*) and then divides the sum by

*n*. Therefore,

*β*is the expected frequency of a QTL genotype among the genotyped individuals in the whole sample. By subtracting

_{j}*β*from its corresponding population frequency

_{j}*w*,

_{j}*i.e.*, (

*w*−

_{j}*β*), we can obtain the expected frequency of a QTL genotype among the ungenotyped individuals in the whole sample. The proposed quantities

_{j}*α*’s in Equation 10 reweigh these subtracted quantities, so that they are summed up to one and can serve as the proportions of QTL genotypes in the ungenotyped individuals for the use of

_{j}*q*’s in Equation 9. Equation 10 can be better understood by the following example: In the F

_{j}_{2}population (

*w*

_{1}= 0.25,

*w*

_{2}= 0.5, and

*w*

_{3}= 0.25), if only one QTL coincident with a marker is considered, the expected genotypic frequencies are equivalent to the observed frequencies in the genotyped individuals. Under

*θ*= 0.5, assume that the observed genotypic frequencies are 0.1, 0.2, and 0.2, respectively, in the genotyped individuals, then

*β*

_{1}= 0.1,

*β*

_{2}= 0.2, and

*β*

_{3}= 0.2. Consequently, we can have

*α*

_{1}= 0.3,

*α*

_{2}= 0.6, and

*α*

_{3}= 0.1 by Equation 10. In practice, QTL are usually not coincident with markers, and the expected genotypic frequencies

*p*’s will be used in obtaining

_{ij}*β*’s and then

_{j}*α*’s. On very rare occasions, a negative value may occur in

_{j}*w*−

_{j}*β*, and a zero value is suggested as a replacement. Equivalently, we propose (11)for practical use of the mixing proportions

_{j}*q*’s in Equation 9. Our proposed model in Equation 9 has two features. First, it is as simple as the PFB methods in that the mixing proportions are fixed and need not be estimated, so that the estimation procedures are similar to those of the QTL mapping model under complete genotyping. In the parameter estimation, the EM algorithm for complete genotyping (Kao and Zeng 1997) can be directly applied to obtain the MLE for our proposed model. In E step, we update the posterior probabilities of QTL genotypes for the

_{j}*n*

_{s}genotyped individuals and

*n*

_{u}ungenotyped individuals,respectively, at the current estimates of the parameters. In M step, the solutions of the parameters,

*μ*,

*σ*

^{2}, and QTL effects have the same formulations as Equations 6–8 except that the correction terms,

**R**

*,*

_{μ}**R**

_{E}, and , vanish. Certainly, the posterior probability matrix must be adjusted to according to the numbers of genotyped and ungenotyped individuals. The E and M steps are iterated until convergence. The converged values of estimates are the MLEs. Second, because

*α*’s are estimates of the proportions of QTL genotypes in the ungenotyped individuals, our proposed model using

_{j}*α*’s as mixing proportions will fit better to the ungenotyped individuals when compared to the PFB method using

_{j}*w*’s. As a result, the proposed method can be more powerful in QTL detection under selective genotyping as is validated in the simulation study.

_{j}## Simulation Result

Simulations were performed to evaluate the performance of our proposed method and to compare it with the currently used methods in QTL mapping under selective genotyping. Assume that a quantitative trait of interest is controlled by two unlinked epistatic QTL, Q_{A} and Q_{B}, in the F_{2} population. The two QTL are placed at 52 and 93 cM of two 150-cM chromosomes. Assume Q_{A} has additive (*a*_{1}) and dominance effects (*d*_{1}), and Q_{B} has only additive effect (*a*_{2}). Epistasis between QTL is assumed to be present only for the additive by additive effect (*i*_{aa}). Further, assume four scenarios, (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), (*a*_{1} = 2, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), (*a*_{1} = 1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), and (*a*_{1} = −1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), for the four present effects, which can reflect the relative sizes of the two QTL and epistasis. For (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), Q_{A} and Q_{B} contribute 76 and 8% to the total genetic variance ( and ), and epistasis contributes 16% to the total genetic variance (*V*_{I}/*V*_{G} = 16%). Similarly, (, , ) for (*a*_{1} = 2, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), (, , ) for both (*a*_{1} = 1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) and (*a*_{1} = −1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2). For each scenario of the effect setting, two kinds of marker maps, 5 and 15 cM, two heritabilities of the quantitative trait, 0.1 and 0.2, and two levels of selective genotyping proportions, 50 and 20%, are considered. The sample size is 200 for selective proportion 50% (the 100/200 design), and it is 500 for selective proportion 20% (the 100/500 design). Three models, the PFB model, the proposed model, and truncated model (Xu and Vogl 2000), are used for selective genotyping analysis. Also, the results of complete genotyping in the 100/100, 200/200, and 500/500 designs are presented for comparison. The number of simulated replicate is 1000. A stepwise selection procedure (Kao *et al.* 1999) was adopted to detect QTL and analyze epistasis. Threshold value of QTL mapping for selective genotyping has been found to be similar to that for complete genotyping (Muranty and Goffinet 1997; Manichaikul *et al.* 2007). We have further confirmed that the threshold values for selective genotyping are similar among the three different methods and among the two different designs based on 10,000 simulation replicates (results not shown). Here, the approximate threshold values for complete genotyping obtained by Gaussian stochastic process (Kao and Ho 2012) are used as those for selective genotyping. The obtained values at 5% level are 9.18 (9.80) and 12.34 (13.35) for one and two degrees of freedom in the 15-cM (5-cM) marker map, respectively.

To shorten the article, the results for the scenario of (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) are reported in detail regarding the power and estimation (Table 1, Table 2, Table 3, and Table 4), and the results of the other scenarios are reported only for power (Table 5). For the (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) scenario, the analyses using the one-QTL model of the three selective genotyping methods are first applied to QTL detection. Under the one-QTL model, the three methods have similar performance in detection power and parameter estimation of QTL effects and positions (results not shown). The powers of the three methods to detect the larger QTL, Q_{A}, are all close to 100%. And their powers to detect the smaller QTL, Q_{B}, in the 15- and 5-cM marker maps are ∼16% and ∼18% under the 100/200 design and are ∼24% and 27% under the 100/500 design. Further analyses using the multiple-QTL model are then followed for all the complete and selective genotyping methods. Table 1 and Table 2 show the results of QTL mapping under complete genotyping (with the 100/100 and 200/200 designs) and selective genotyping (with the 100/200 design) when epistasis is ignored and considered in the 5- and 15-cM marker maps. In general, for all methods and designs, greater power in detection and better quality in estimation can be achieved when the marker map is denser and epistasis is taken into account. The results for considering epistasis in the analyses are described here. In the complete genotyping 100/100 design, the powers of detecting Q_{A} and Q_{B} are 93.4% (90.3%) and 23.0% (19.6%), respectively, in the 5-cM (15-cM) marker map. In the complete genotyping 200/200 design, they are 100% (99.9%) and 56.1% (50.7%), respectively, in the 5-cM (15-cM) marker map. In the selective genotyping 100/200 design, the powers of detecting Q_{A} and Q_{B} by the PFB method are 99.8% (99.4%) and 43.6% (38.5%), the powers by the truncated model are 100% (99.5%) and 43.6% (38.1%), and the powers by the proposed method are 99.7% (99.5%) and 56.0% (49.8%) in the 5-cM (15-cM) marker map, respectively. It shows that the proposed method is more powerful in QTL detection than the PFB method and truncated model and that the proposed method in the 100/200 design has the ability to provide similar power to complete genotyping under the two marker maps. All methods for selective genotyping methods provide similar precision and accuracy in the estimation of QTL positions. For example, in the 5-cM marker map, the means of the estimated Q_{A} and Q_{B} positions by the PFB method, truncated model, and proposed method are at 52.2 cM (SD 7.0 cM) and 89.9 cM (SD 26.7 cM), 52.3 cM (SD 7.2 cM) and 89.8 cM (SD 27.4 cM), and 52.2 cM (SD 7.0 cM) and 89.0 cM (SD 28.5 cM), respectively. In the estimation of the QTL effects, the three methods generally perform well, as their means of the estimated effects are all very close to the true value. For example, in the 5-cM marker map, the means of the two estimated additive effects are 3.02 (SD 0.57) and 1.04 (SD 0.75) by the PFB method. The means are 3.21 (SD 0.71) and 1.09 (SD 0.82) by the truncated model, and the means are 3.20 (SD 0.64) and 1.00 (SD 0.81) by the proposed method. The epistatic effect between QTL can be also estimated well in selective genotyping. The means of estimated epistatic effects are 2.01 (SD 1.15), 2.14 (SD 1.35), and 2.24 (SD 1.36), respectively. In complete genotyping, their means are 1.89 (SD 1.68) and 2.06 (SD 0.98) for the 100/100 and 200/200 designs, respectively.

Table 3 and Table 4 show the results of QTL mapping under complete genotyping (with the 100/100 and 500/500 designs) and selective genotyping (with the 100/500 design) when epistasis is ignored and considered in the 5- and 15-cM marker maps. Again, for all methods and designs, greater power in detection and better quality in estimation can be achieved in the dense marker maps and after taking epistasis into account. Their results for considering epistasis in the analyses are described here. Except for the cases in the 100/100 design, the powers to detect the larger Q_{A} are all 100%. The powers to detect the small Q_{B} vary in value. In complete genotyping 500/500 design, the powers to detect Q_{B} are 96.5 and 93.1% in 5- and 15-cM marker maps. In selective genotyping 100/500 design, the powers of detecting Q_{B} are 69.9 and 62.7% by the PFB method, and they are 59.3 and 55.0% by the truncated model in 5- and 15-cM marker maps. By the proposed method, the powers of detecting Q_{B} are 80.9 and 75.0%, respectively. The proposed method provides greater powers to detect Q_{B} as compared to the PFB method and the truncated model. Also, the powers in 100/500 design provided by the three methods are all significantly lower than that in the 500/500 design. It may tell us that selective genotyping in the 100/500 design has difficulties maintaining equivalent power to complete genotyping (see *Conclusion and Discussion* for the reason). In parameter estimation, the three methods for selective genotyping methods all perform well and provide similar precision and accuracy for the estimates (see Table 3 and Table 4).

Table 5 presents the detection powers obtained by complete genotyping and different selective genotyping methods in the four settings under the cases of different designs, heritabilities, and marker maps. In general, the proposed method is more powerful than the PFB method and the truncated model, especially, for detecting Q_{B} in the (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) setting. For example, for *h*^{2} = 0.2 in the 100/200 (100/500) design, the powers of detecting Q_{B} by our proposed method are of 0.560 (0.809) and 0.498 (0.750), under 5- and 15-cM marker maps. The powers of detecting Q_{B} by the PFB method are 0.436 (0.699) and 0.385 (0.627), respectively, and the powers of detecting Q_{B} by the truncated model are 0.436 (0.593) and 0.381 (0.550), respectively. Besides, in most cases, the (proposed) model fitting full data is more powerful than the truncated model fitting only the genotyped data, which is more evident in the 100/500 design. For example, in the (*a*_{1} = −1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) setting of the 100/500 design, the powers to detect Q_{A} and Q_{B} by the proposed (PFB) method are 0.948 (0.933) and 0.865 (0.833), and the powers by the truncated model are 0.896 and 0.750 under *h*^{2} = 0.1 and the 5-cM marker map. Also, the detecting powers obtained by the proposed method in the 100/200 design are roughly close to those obtained by complete genotyping in the 200/200 design. For example, in the case of *h*^{2} = 0.1 and the 5-cM marker map in the 100/200 design (200/200 design), the powers to detect Q_{A} and Q_{B} are 0.765 and 0.123 (0.817 and 0.103), 0.722 and 0.280 (0.775 and 0.249), 0.659 and 0.510 (0.699 and 0.536), and 0.667 and 0.506 (0.696 and 0.528), respectively, in the (*a*_{1} = 3, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), (*a*_{1} = 2, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), (*a*_{1} = 1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2), and (*a*_{1} = −1, *d*_{1} = 1, *a*_{2} = 1, *i*_{aa} = 2) settings. Similar trends can be observed in the other cases in the 100/200 design as compared to the results from their 200/200 design (the slightly higher power with selective genotyping may be due to simulation error). It shows that our proposed method (in 100/200 design) has a better ability to maintain the similar power to complete genotyping (in 200/200 design) than the other two models (in the 100/200 design). Nevertheless, depending on the (relative) sizes of the QTL, the detecting powers obtained by the selective genotyping methods in the 100/500 design are obviously lower than those obtained by complete genotyping in the 500/500 design (see Table 5).

## Conclusion and Discussion

The approach of selective genotyping has been widely used for the detection and validation of QTL in genetic studies (Lander and Botstein 1989; Manichaikul *et al.* 2007; Vikram *et al.* 2012; Lu *et al.* 2013; Fontanesi *et al.* 2014). Statistical QTL mapping methods developed for selective genotyping can consider either full data or only genotyping data in the QTL analysis (Xu and Vogl 2000). When only the genotyping data are used in the analysis, the truncated mixture model can be applied to the QTL analysis. If the full data are considered in the analysis, the mixture model can be readily implemented to the QTL detection. In this article, we proposed a novel mixture model on the basis of the multiple-QTL model for selective genotyping when full data are included in the analysis. Our proposed model attempts to use the proportions of QTL genotypes in the ungenotyped individuals, rather than the population proportions of QTL genotypes as in the PFB model, to fit the ungenotyped individuals in model construction. Consequently, as shown in this article, the proposed method has the ability to provide better resolution for QTL detection, to analyze epistasis, and to deal with multiple QTL problems in selective genotyping. In addition, we provide a version of the multiple-QTL approach for the truncated mixture model, so that QTL mapping under selective genotyping can be compared and investigated among the different models in multiple QTL conditions. Simulation results show that our proposed method is more powerful than the PFB and truncated methods. Notably, under the 100/200 design, our selective genotyping method can produce roughly equivalent power to complete genotyping (the 200/200 design), and the PFB methods may have difficulty doing so. Under the 100/500 design, selective genotyping has difficulty matching complete genotyping (the 500/500 design) in QTL detection power, because the linkage information about QTL from the genotyped individuals is inadequate relative to that from the entire sample. Also, the analysis using full data, such as by using our proposed method, performs better than that using only genotyping data, because additional information from the ungenotyped individuals is incorporated into the analysis (Xu and Vogl 2000).

In the selective genotyping experiment, the individuals with intermediate trait values (in the central part of the trait distribution) contain less information about QTL than the extreme individuals (Lander and Botstein 1989) and are not genotyped for markers. When fitting them into the model, there is no marker to infer the genotypic distributions of the underlying QTL. As a result, the PFB methods use (fixed) genotypic distribution of QTL in the population as their genotypic distributions to model the ungenotyped individuals (Muranty and Goffinet 1997; Xu and Vogl 2000). However, the genotypic distribution of the QTL in these ungenotyped individuals will differ from that in the population and depends on several known and unknown factors. The known factors include the population structure and selection intensity. The unknown factors contain the modes (additive, dominance, epistasis) of QTL action, size of QTL, linkage strength between QTL. That is to say, the distribution of QTL genotype in the ungenotyped individuals is informative about these factors. On the contrary, the genotypic distribution of QTL in the population is not so informative about the factors in modeling the ungenotyped individuals, as it is equivalent to that in the ungenotyped individuals only under the null that there is no QTL. Therefore, it is possible to gain advantage and fit better to the data from selective genotyping if a method can use the distribution of QTL genotypes in the ungenotyped individuals to model the trait values of the ungenotyped individuals in model construction. Apparently, such a consideration is ignored by the PFB methods, but is taken into account in our proposed model (Equation 9). Hence, our proposed method can further improve QTL mapping under selective genotyping. The empirical threshold values for selective genotyping and complete genotyping have been found to be similar (Muranty and Goffinet 1997; Manichaikul *et al.* 2007). We have further identified that the proposed, PFB, and truncated models have similar empirical threshold values, which may be because they are all constructed by assuming the same population genomic structure.

It is challenging to analytically explore how the difference between QTL effects is affecting the efficiency of selective genotyping (relative to complete genotyping). Ronin *et al.* (1998) showed that the power of separating two linked QTL with different direction of effects is higher than that with the same direction under selective genotyping, which is also validated by Kao and Zeng (2010) in complete genotyping. Sen *et al.* (2009) found that when one or more QTL have large effects, the effectiveness can be unpredictable in their information analysis. This implies that selective genotyping may show better efficiency in QTL mapping under some combinations of QTL parameters than under others, which is validated in our simulation study. Selective genotyping might be most convenient and suitable for the cases in which only one trait is targeted for selection and being analyzed in QTL mapping (Darvasi 1997). The multiple-trait analysis by including the targeted trait and its correlated traits in the model at a time can also improve the power of QTL detection due to taking into account the correlated structure of traits (Jiang and Zeng 1995; Muranty *et al.* 1997; Ronin *et al.* 1998). When multiple traits are targeted for selective genotyping experiments, extreme individuals in one trait may not be the extreme individuals in the others, and the issues of which traits should be selected and of defining the criteria of selection have been important and discussed by several researchers (Lin and Ritland 1997; Muranty and Goffinet 1997; Muranty *et al.* 1997; Xu and Vogl 2000). Muranty *et al.* (1997) suggested using a selection index combining all trait values to select the extreme individuals and then treated their index values as new single trait values in the analysis. Our method can be readily implemented to analyze the index values for multiple-trait analysis. For directly taking multiple traits into account in the analyses, the multivariate versions of the PFB model fitting one and two QTL have been developed by Muranty *et al.* (1997) and Ronin *et al.* (1998), respectively, and the multivariate version for the truncated normal mixture model for fitting one QTL has been proposed by Xu and Vogl (2000). The multivariate version of our proposed model for directly analyzing multiple traits has not yet been considered here and can be pursued on the foundation of multivariate normal mixture model laid by Jiang and Zeng (1995), Muranty *et al.* (1997), Ronin *et al.* (1998), and Xu and Vogl (2000) in future studies.

## Acknowledgments

The authors are grateful to Mr. Po-Ying Chu for the discussions. We are also grateful to the associate editor and the referees for their valuable comments and suggestions. This study was supported partly by grants NSC102-2313-B-001-019 from the National Science Council, Taiwan, Republic of China.

## Appendix

When applying the model in Equation 3 to analyze genotyping data from selective genotyping, the likelihood function is a mixture of truncated normals with different means and mixing proportions. By treating the truncated normal mixture model as an incomplete-data problem (Little and Rubin 1987), we can apply the EM algorithm to obtain the MLEs of QTL parameters. Let and be the left and right truncation points after standardizing by their means and standard deviations. Then, we havewhere *φ*(⋅) and Φ(⋅) denote standard normal p.d.f. and c.d.f., respectively. For illustration, we take an additive model for *m* QTL in the model as an example. The EM algorithm for obtaining the MLE of the parameters in the likelihood (Equation 5) is described as below. In E step, the conditional expected complete-data log-likelihoodis computed, where *π _{ij}*’s are the posterior probabilities of QTL genotype. The last term in the right-hand side is a correction term for truncation.

The M step is to find Θ^{(}^{t}^{+1)} that maximize the conditional expected log-likelihood *Q*(Θ|Θ^{(}^{t}^{)}). We first obtain the derivatives of log *U _{j}* with respect to each parameter as follows. For the effect, say

*a*, associated with the

_{i}*i*th column of the genetic design matrix

**D**(

*D*), the derivative iswhere

_{i}*D*is the

_{ij}*ij*-entry of design matrix

**D**. For

*μ*and

*σ*

^{2}, the derivatives areandrespectively. For simplicity in the following derivations, we defineThe partial derivatives of

*Q*(Θ|Θ

^{(}

^{t}^{)}) with respect of each parameters are described below, For one of the QTL effect, say

*a*

_{1}, the derivative iswhere # denotes the element-by-element product of the two same-order matrices, , . The derivatives for other effects have the similar expressions. For

*μ*and

*σ*

^{2}, the derivatives arewhere andThe solutions of the above partial derivatives can be obtained in close forms. For the effect

*a*

_{1},The solutions for the other effects have similar expressions and can be easily formulated accordingly by operating their corresponding column vectors in

**D**. These solutions for the effects can be arranged together and expressed in matrix form as Equation 7. The solutions for

*μ*and

*σ*

^{2}areIt is worth pointing out that if other possible effects are considered in the model, the solutions can be still formulated as Equation 7.

## Footnotes

*Communicating editor: S. Sen*

- Received July 15, 2014.
- Accepted September 16, 2014.

- Copyright © 2014 by the Genetics Society of America