## Abstract

We present a novel semiparametric method for quantitative trait loci (QTL) mapping in experimental crosses. Conventional genetic mapping methods typically assume parametric models with Gaussian errors and obtain parameter estimates through maximum-likelihood estimation. In contrast with univariate regression and interval-mapping methods, our model requires fewer assumptions and also accommodates various machine-learning algorithms. Estimation is performed with targeted maximum-likelihood learning methods. We demonstrate our semiparametric targeted learning approach in a simulation study and a well-studied barley data set.

METHODOLOGY for quantitative trait loci (QTL) mapping has been an active area of research in recent decades. QTL mapping aims to identify the genes that underlie an observed trait, using genetic markers along the genome. A substantial amount of literature has been devoted to accurately identifying QTL, with a range of procedures (Sax 1923; Thoday 1960; Lander and Botstein 1989; Haley and Knott 1992; Jansen 1993; Zeng 1994; Satagopan *et al.* 1996; Heath 1997; Sillanpaa and Arjas 1998; Kao *et al.* 1999; Lee *et al.* 2008). Analysis of variance for single markers was proposed in early work, while interval mapping (IM), composite-interval mapping (CIM), and multiple-interval mapping (MIM) have emerged as popular approaches in contemporary research. Bayesian models and machine-learning algorithms have also been studied to map QTL.

IM involves setting a multinomial distribution for the genotype and a Gaussian mixture model for the trait value, and then a likelihood-ratio test is used to determine the significance of the QTL effect. Positions are tested separately at small increments across the genome, and a finely scaled whole-genome test statistic profile is constructed (Lander and Botstein 1989). A regression-based method, dubbed Haley–Knott regression, approximates IM. Haley–Knott regression involves imputing the unobserved genotype of a putative QTL, replacing it with its expected value (Haley and Knott 1992). These IM methods require the unrealistic assumption that only one QTL across the genome is responsible for the observed trait. When individual QTL are considered, possibly confounding QTL are ignored. If this conflicts with the true underlying data distribution, the effects of these other QTL are incorporated into the residual variance.

Both CIM and MIM were developed to handle multiple QTL. CIM adds background markers in an IM model to increase the accuracy of QTL effect estimates; the effects are now adjusted for possibly confounding QTL (Jansen 1993; Zeng 1994). MIM also estimates effects and positions of multiple QTL at the same time, but while it has greater power compared to CIM, it is computationally burdensome (Kao *et al.* 1999). There is also a significant problem in deciding which QTL to include. Zeng (1994) provides additional background on traditional regression and IM methods for QTL mapping.

Analysis of variance methods for identifying QTL (Sax 1923; Thoday 1960) do not allow one to examine QTL between the markers. However, as finely scaled single-nucleotide polymorphism (SNP) markers have replaced the traditionally widely spaced microsatellite markers, identifying QTL between markers has become less of an issue. Since SNP data are high dimensional, univariate marker–trait regressions are often favored due to their ease of implementation and computational feasibility despite noisy results.

Machine-learning algorithms, such as random forest (Breiman 2001), have been applied to dense SNP data. They are often adept at identifying interactions between genes and also predicting the conditional expectation of the outcome given the other genetic markers (Lee *et al.* 2008). Unfortunately, their effect measures do not provide *P*-values and are otherwise not targeted toward the effects of interest.

Most of these methods are fully parametric, requiring the specification of a parametric regression model and often assuming a Gaussian distribution for the phenotypic trait (machine-learning methods being the exception). Maximum-likelihood estimators are most commonly used for estimation in these parametric models; these estimators have been rigorously studied and are easily implemented in various software platforms. Unfortunately, in QTL mapping applications, parametric models oversimplify the underlying genetic mechanism, and thus estimates of the effects of interest will be biased. If the parametric model is selected after running several parametric regression models on the full data, the standard errors will not be interpretable. This is due to the fact that the standard errors are based on a single *a priori*-specified parametric model and do not account for the multiple runs of the regression models.

Our approach specifies a less restrictive semiparametric regression model and estimates the parameters of interest within this model with the double-robust and efficient two-step *targeted* maximum-likelihood estimator (TMLE) (van der Laan and Rubin 2006; van der Laan and Rose 2011). The model makes the assumption that the phenotypic trait changes linearly with the QTL while allowing one to explore a larger model space under fewer restrictions. The TMLE framework targets the effects of interest (*i.e.*, the QTL) rather than the entire distribution, providing more accurate estimates of effect sizes. TMLE can also incorporate the prediction power of machine-learning algorithms while maintaining computational feasibility and leading to improved QTL effect estimates and rankings.

## Methods

For the ease of presentation, we use a backcross design to demonstrate our methods. The derivations can be readily extended to other types of experimental crosses, such as intercross (F_{2}) or double haploid (DH). The backcross is produced by backcrossing the first generation (F_{1}) to one of its parental strains, and there are two possible genotypes *Aa* and *aa* at a locus.

### Semiparametric model

Our semiparametric model assumes that the phenotypic trait changes linearly with the QTL, but it does not specify a restrictive parametric model. Thus, we can explore a larger model space. The observed data are i.i.d. realizations of *O _{i}* = (

*Y*,

_{i}*M*) ∼

_{i}*P*

_{0},

*i*= 1, …,

*n*.

*Y*is the phenotypic trait value and the vector

*M*contains the marker genotype. The subscript “0” indicates that

*P*

_{0}is the true underlying distribution of the data, and the subscript “

*i*” indexes the

*i*th subject for

*O*.

_{i}We also define *A* as the genotype of the QTL under consideration. *A* may be observed or unobserved; when *A* lies on a marker, *A* is observed, but when *A* lies between markers, it is unobserved. When *A* is unobserved, we impute *A* as is done in Haley–Knott regression (Haley and Knott 1992). The value of *A* becomes its expected value, obtained from a multinomial distribution calculated with the genotypes and the relative locations of its flanking markers. For unobserved *A*’s, it is important to note that we therefore are estimating the effect of an imputed *A*. Throughout the text, uppercase *A* denotes the random variable, and its lowercase counterpart *a* denotes the realized value of *A*.

The semiparametric regression model used in this article for the effect of *A* at a value *A* = *a* relative to *A* = 0, adjusted for a set of other markers *M*^{−} is (1)The target parameter is the average marginal effect and is given by *β*_{0}. We differentiate a marginal effect, as defined above, from a standard conditional effect found in parametric regression analysis with covariate adjustment. In a backcross population, when the homozygote *aa* is coded 0 and the heterozygote *Aa* is coded 1, our target parameter *β*_{0} measures the effect of the *Aa* genotype on *Y* relative to *aa*. In an F_{2} population, with the coding (*AA*, *Aa*, *aa*) = (1, 0, −1), *β*_{0} can be interpreted as the difference in *Y* when *A* changes from heterozygote to homozygote.

The linearity assumption we make about the QTL effect can be easily seen in Equation 1 (*i.e.*, *β*_{0}*A*). We do not impose any distributional assumption on the data or any functional form on all functions *f*(*M*^{−}) of *M*^{−}. For *β*_{0} to be estimable and well defined, we also need the assumption that *A* is not a perfect surrogate of *M*^{−}. In other words, if we choose to estimate *E*_{0}(*A*|*M*^{−}), the *R*^{2} (coefficient of determination) from the estimator has to be <1.

### The TMLE

The TMLE framework is a new paradigm for efficient double-robust loss-based substitution estimation (van der Laan and Rubin 2006; van der Laan and Rose 2011). The two-stage procedure builds on the foundation of maximum-likelihood estimation. In the first step, one obtains an estimator of the data-generating distribution (possibly using maximum-likelihood estimation or machine learning). The second stage fluctuates this initial estimator in a step focused on making the optimal bias–variance trade-off for the target parameter. This second step is a bias reduction step.

The procedure can also be understood intuitively. The overall conditional expectation for the phenotypic trait value *Y* given the vector *M* in stage one is not targeted toward the parameter of interest; its bias–variance trade-off is for the overall density. The second stage brings in additional information (the conditional expectation for a particular genotype *A* of the QTL under consideration) to reduce the bias of the initial estimate for the conditional expectation of *Y*. Estimator comparisons involving TMLEs have been presented in the literature (*e.g.*, Rose and van der Laan 2008; Gruber and van der Laan 2010a,b; Stitelman and van der Laan 2010; van der Laan and Rose 2011; Wang *et al.* 2011). The statistical properties and flexibility of the TMLE make it ideal for application to QTL mapping.

#### Implementation summary:

The TMLE of *β*_{0}, defined in Equation 1, involves an initial fit of *E*_{0}(*Y*|*M*). With this initial fit, we can obtain a fit of *E*_{0}(*Y*|*A* = 0,*M*^{−}) and map it into a first-stage estimator of *β*_{0} [and thereby of *E*_{0}(*Y*|*A*,*M*^{−})] in our semiparametric model. With an initial estimate of *E*_{0}(*Y*|*A*,*M*^{−}), we now perform the second-stage updating step. This single update is completed using an estimate of the so-called “clever covariate” *A* − *E*_{0}(*A*|*M*^{−}) and fitting a coefficient *ε* in front of this clever covariate with univariate regression, using the initial estimator of *E*_{0}(*Y*|*A*,*M*^{−}) as an offset in the regression. We can now write that the TMLE of *β*_{0} is The TMLE of *β*_{0} is defined in detail below.

#### TMLE algorithm:

Obtain an initial estimator for

*E*_{0}(*Y*|*A*,*M*^{−}). This initial estimator must respect the semiparametric model in Equation 1 and takes the formObtain a reasonable estimate

*g*(_{n}*W*) of the expectation*E*_{0}(*A*|*W*). We typically need to focus on only a subset*W*of*M*^{−}that is viewed as potential confounders of the effect of*A*on*Y*. Hence, we replaced*M*^{−}with*W*, and we wish to name the prediction function*g*(_{n}*W*) as a “marker confounding mechanism.” In our applications,*W*is the set of markers on the same chromosome as*A*.Compute

*r*(*A*,*W*) =*A*−*g*(_{n}*W*). The*r*(*A*,*W*) is the residual of*g*(_{n}*W*), also referred to as the clever covariate. It plays the key role of correcting the bias in the initial estimator.Fit the “

*ε*regression.” This regression is given by*Y*′ ∼*εr*(*A*,*W*), where and the regression coefficient estimate is denoted*ε*._{n}Update. The initial estimate of is updated with and the initial fitted value with

Compute the variance estimate for Using influence curve-based methods, we calculate the variance estimate (van der Laan and Rose 2011):

The FDR-adjusted

*P*-values are then calculated from the variance estimate.

The double robustness of the TMLE for *β*_{0} can be understood as follows: the TMLE will be consistent if either or *g _{n}*(

*W*) is consistent and will be efficient when both are consistent. This means that when is correctly specified, the TMLE of

*β*stays essentially unchanged with only a minor adjustment from the second step. When is misspecified, a correct specification of

*g*(

_{n}*W*) will achieve the full bias reduction for and

*β*

^{0}.

There are some considerations when generating a marker-confounding mechanism *g _{n}*(

*W*) for

*E*

_{0}(

*A*|

*M*

^{−}). If two flanking markers of

*A*are used as a proxy for

*M*

^{−}, this simplifies the estimation problem and also possibly captures a large portion of confounding from the complete set

*M*

^{−}. However, choosing the distance between flanking markers is a nontrivial problem. Previous simulations (Tuglus and van der Laan 2011) indicate that TMLE does not deteriorate for correlations smaller than

*δ*= 0.7 between the marker of interest and the confounders, and this value could be used to describe the window width of the flanking markers. Another alternative is to define a set of correlation values

*δ*and implement

*δ*-specific TMLEs for each value of

*δ*. Subject matter knowledge can also be used to set flanking distance.

## Simulation Study

We present a simulation study to demonstrate how the proposed TMLE procedure corrects for bias. A single chromosome of 100 markers was simulated on 600 backcross subjects. Markers were evenly spaced at 2 cM. Four main QTL effects and three epistatic effects were generated according towhere *Y* denotes the phenotypical value, *M*_{[⋅]} the marker genotype, and *U* an error term drawn from an exponential distribution scaled to have variance 10. The number in the square brackets of *M* indicates the marker’s location in centimorgans. The four main effects respectively explain 3.14%, 4.91%, 1.40%, and 1.40% of the total phenotypic variance, and the three epistatic effects explain 0.41%, 0.80%, and 0.59% of the total variance (epistatic percentages were calculated assuming independence between interacting markers). The total proportion of explained variance by genetic markers is 12.65%. Three hundred replicates were simulated and results take the average.

### Main analyses

In this simulation, the marker density is high, the phenotypic outcome follows a nonnormal distribution, and there are strong counteracting main and epistatic effects in closely linked markers. We focused on markers and analyzed these data sets with univariate regression (UR), CIM, and TMLE. CIM was carried out using the software QTL Cartographer (Basten *et al.* 2001). To make predictions based on CIM for use in TMLE, we used markers detected by MIM with CIM as an initial model in QTL Cartographer. We chose CIM as a primary benchmark because it offers genome-wide profiles of both effect estimates and asymptotic *P*-values comparable with TMLE. Meanwhile, the performance of CIM is reasonably ranked among various mapping methods studied in this article. Univariate regression completely failed to identify the correct QTL, due to serious model misspecification. Its estimates of main effects are biased and its likelihood profile is dominated by a single peak spanning from 80 cM to 160 cM (Supporting Information, Figure S1).

In comparison, CIM detected some signal for QTL 1 and QTL 2, the two strongly linked QTL with counteracting effects. However, CIM was unable to separate QTL 3 and QTL 4 and combined them into a single signal with a striking magnitude. Using the CIM predictions as the initial estimator TMLE was able to correct the biased estimates of CIM in the right direction and effectively separated all four main QTL. The degree of correction and improvement TMLE has over CIM depends on how strongly we adjust for marker-confounding mechanism *g _{n}*(

*W*). Thus, two versions of TMLE were presented. One adjusts for flanking markers 10 cM away in

*g*(

_{n}*W*) and is denoted by TMLE(10), while the other adjusts for markers 20 cM away and is denoted TMLE(20). TMLE(10) improves on the CIM estimate more than TMLE(20), especially for QTL 1 and QTL 2. This was not unexpected, as TMLE(10) uses a linear regression estimate

*g*(

_{n}*W*) that is closer to the truth than that used by TMLE(20). The variance estimates of the effect estimates from TMLE(10) are larger than those from TMLE(20), due to a more aggressive choice for

*g*(

_{n}*W*) in TMLE(10). This implies that the bias correction comes at the price of variance. More details are provided in the text of Simulation Study section in File S1.

In Figure 1, we present the profiles of effect estimates, *P*-values, and rankings at each marker for CIM, TMLE(10), and TMLE(20). The true and the estimated main QTL effects are reported in Table 1 along with their *P*-values, top 10 proportions, and relative orderings. The *P*-value profiles for CIM, TMLE(10), and TMLE(20) are largely aligned with their effect estimate counterparts, although there were slight differences in QTL position estimates, and are reported in Table 2. TMLE(10) has the best performance with respect to separating linked effects. In addition, TMLE(10) has produced the most accurate rankings of the QTL based on QTL *P*-values. Ranking accuracy was evaluated for a specific marker by the proportion of simulations where that marker’s *P*-value ranking was ≤10 among all the markers in a simulation. For TMLE(10), this quantity has mirrored the true effect sizes of the simulated QTL and resulted in a correct relative ordering among the four main QTL effects. CIM missed QTL 1 and merged QTL 3 and QTL 4. The performance of TMLE(20) lies in between that of TMLE(10) and CIM. A ranking cutoff of 5 produced similar results.

### Additional analyses

We also analyzed the simulated data sets, using a variety of shrinkage methods with main-terms linear models, including the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996; Friedman *et al.* 2010), Bayesian shrinkage estimation (Wang *et al.* 2005), iterative adaptive LASSO (Sun *et al.* 2010), and a penalized-likelihood approach (pLOD) presented in Manichaikul *et al.* (2009) (Table 1, Figure 2, Table S1, and Figure S4). Analyses focused on effect estimates as these methods do not always provide *P*-value profiles. LASSO detected four markers carrying main effects and no markers carrying epistatic effects. The effect estimates from LASSO are biased downward. This is expected as the LASSO estimator is a shrinkage estimator. We reestimated the effects of each marker detected by LASSO in a multivariate linear regression. The estimates are moderately improved, on average, and other shrinkage methods produced results similar to those of LASSO (Figure S2). In addition to the above methods, we used pLOD, allowing pairwise interactions, as pLOD is designed to detect epistatic effects. This led to improved performance compared to that of a linear model with main effects only, even though the pLOD effect estimates did not outperform TMLE (Figure S3).

Additionally, we examined how the effect estimates of markers associated with epistatsis were affected when ignoring interaction terms in the model by evaluating the marginal effect of the four epistatic markers *M*_{[44]}, *M*_{[74]}, *M*_{[86]}, and *M*_{[146]} (Table S2). Estimates from all methods are quite biased, yet TMLE(10) produced overall results closest to the true effects. The performance of TMLE(20) is similar to that of CIM. We also investigated how initial estimators (UR, CIM, and LASSO) affect the performance of TMLE. Results were similar for TMLE(10), but differed for TMLE(20), where CIM provided the best performance (Figure S5 and Table S3). In summary, TMLE produced the most accurate estimates among all investigated methods. A more aggressive marker-confounding regression will result in improved bias reduction for a misspecified initial estimator, but also larger variance estimates.

## Barley Data Analysis

We analyzed a barley data set presented in Hayes *et al.* (1993), available from http://www.genenetwork.org/genotypes/SXM.geno and http://wheat.pw.usda.gov/ggpages/SxM/phenotypes.html. This data set contains 150 doubled-haploid lines derived from the F_{1} cross of two barley varieties: “Steptoe” and “Morex.” Phenotypes consist of eight agronomic traits measured across multiple environments. Genotypes include 495 markers distributed along the barley genome. The purpose of the study is to identify QTL linked to measured agronomic traits. For the ease of demonstration, we present only the analysis of a malting quality trait referred to as “lodging.” This trait was measured in six environments and the average was taken as the phenotype in our analysis.

We removed three samples and 54 markers that were of low quality (call rate threshold 90%). The genotypes of each marker were coded 1 for Steptoe allele and −1 for Morex allele. Both CIM and TMLE were run to test 1070 positions along the barley genome at an incremental step of 1 cM. CIM was carried out using QTL Cartographer. In TMLE, the initial was fitted to the entire data set, using an elastic net, with 50% mixtures of *L*_{1} and *L*_{2} penalties. The marker-confounding mechanism *g _{n}*(

*W*) was fitted with a linear regression on flanking QTL that are 20 cM away from the tested position. In Figure 3,

*P*-value profiles of the analysis are presented (for the profile of effect estimates, please see Figure S6), and in Table 3, we report the estimates of all QTL with a

*P*-value <0.0034 (suggested linkage threshold in Lander and Kruglyak 1995).

Two major QTL on chromosomes 2 (QTL 1) and 3 (QTL 4) are identified by both CIM and TMLE. CIM identified several other QTL as highly significant, while these QTL are borderline significant in TMLE. Typical cases are QTL 2, QTL 4, and QTL 5. This difference is likely due to documented epistatic effects at these sites (Hayes *et al.* 1993). CIM is a parametric method and its result is sensitive to model misspecification. Our analysis assumed a main-effect model that ignores interaction effects and may lead to biased effect estimates in CIM. In contrast, TMLE is more robust to model misspecification and hence produces less significant *P*-values at sites with epistatic effects. TMLE also has a better resolution than CIM, particularly evidenced by the peak on chromosome 5. Our findings are also largely consistent with what was reported previously (Hayes *et al.* 1993; Zhao and Xu 2012).

We produced an additional simulation study based on this barely data set to study the classic QTL mapping setting where markers are widely spaced and QTL lie in between markers with unobserved genotypes (File S1 Supplemental Simulation, Table S4, Figure S7 and Figure S8). This simulation informed several of the analytic choices in the barely data analysis above. This included the finding that due to high and variable correlations among markers, adjusting for markers 10 cM away led to overfitting in *g _{n}*(

*W*) and thus was not pursued. We also found two advantages associated with TMLE compared to CIM in this simulation: (1) the resolution of identified QTL from TMLE was better and (2) TMLE, on average, preserved the correct rankings of simulated QTL while CIM did not (Figure S7, B and C).

## Discussion

Targeted maximum-likelihood learning is a novel flexible semiparametric methodology with broad applications in QTL mapping. Analysis of variance, univariate regression, various forms of interval mapping, and machine-learning techniques have been proposed and implemented for QTL mapping. However, current practice relies heavily on parametric models. While parametric methods offer inference via *P*-values, bias is a substantial issue. Machine-learning algorithms offer flexibility, yet lack inference and are designed for prediction over effect estimation. TMLE allows both effective inference and bias reduction, as well as the incorporation of machine learning without substantial computational burden.

In this article, we explored TMLE in QTL mapping with simulations and data analysis. We compared TMLE with popular approaches, such as UR, CIM, and shrinkage methods. In our simulations, these methods were substantially more biased than TMLEs, and TMLEs improved on their estimates and had the most accurate rankings of QTL. TMLE and CIM were also compared in a barley data set, with TMLE displaying less noise. In the analysis of the barley data set, we initialized TMLE with elastic nets, demonstrating its ability to incorporate machine-learning algorithms.

In summary, a targeted learning approach for QTL mapping has multiple advantages. First, semiparametric models place fewer unrealistic restrictions on the functional form of the data. Second, precision of the QTL mapping is improved in the bias-reduction step. Third, machine-learning algorithms, such as random forests or more aggressive ensembling techniques (van der Laan *et al.* 2007; van der Laan and Rose 2011), can be incorporated into the prediction steps to allow for more flexible estimation. Finally, *P*-values are available to generate ranked lists of QTL.

## Acknowledgments

We thank Shizhong Xu for his generosity in providing us a clean and compiled version of the barley data set.

## Footnotes

Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.168955/-/DC1.

*Communicating editor: S. Sen*

- Received July 27, 2014.
- Accepted September 13, 2014.

- Copyright © 2014 by the Genetics Society of America