## Abstract

The existing statistical methods for mapping quantitative trait loci (QTL) assume that the phenotype follows a normal distribution and is fully observed. These assumptions may not be satisfied when the phenotype pertains to the survival time or failure time, which has a skewed distribution and is usually subject to censoring due to random loss of follow-up or limited duration of the experiment. In this article, we propose an interval-mapping approach for censored failure time phenotypes. We formulate the effects of QTL on the failure time through parametric proportional hazards models and develop efficient likelihood-based inference procedures. In addition, we show how to assess genome-wide statistical significance. The performance of the proposed methods is evaluated through extensive simulation studies. An application to a mouse cross is provided.

QUANTITATIVE trait analysis plays an important role in the understanding of genetic variations in plants and animals. Mapping quantitative trait loci (QTL) can lead to improvements in economic traits, such as yield and quality in crop plants and milk production in cows. QTL mapping in animals can also provide valuable insights into the genetic etiologies of complex human diseases (Hilbert *et al.* 1991; Jacob *et al.* 1991; Shepel *et al.* 1998).

Much of the modern statistical methodology for QTL mapping in experimental crosses originates from the seminal work of Lander and Botstein (1989). The Lander-Botstein interval-mapping method postulates that QTL occur at a series of positions within a set of adjacent marker intervals and that the trait value depends on the QTL genotype through a linear regression model. The distance between each pair of genetic markers is assumed known. The method steps through the genome in specified increments, say every 1 or 2 cM, and calculates the likelihood-ratio statistic for testing no QTL present at each position. The position with the largest value of the likelihood-ratio statistic is declared to be the candidate QTL location provided that the value exceeds a certain threshold level. It is widely recognized that the interval-mapping method has higher power and requires fewer progenies than the single-marker analysis (Lander and Botstein 1989; Haley and Knott 1992; Zeng 1994). Doerge *et al.* (1997) described in greater detail this method and various extensions.

Most of the existing QTL-mapping methods require that the phenotype be normally distributed and fully observed. These assumptions are likely to be false when the phenotype pertains to the survival time or failure time. The Weibull distribution and other skewed distributions with long right tails are more appropriate than the normal distribution. Furthermore, the failure time is often subject to censoring so that the trait value is known only to be beyond the censoring time. An example of failure time in plant experiments is the flowering time, which may be censored due to limited duration of the experiment; see Ferreira *et al.* (1995). In animal studies, the failure times of interest include time to tumor and time to death (*i.e.*, survival time), which may be subject to censoring because of limited study duration or death due to unrelated causes. One particular example is a mice cross presented by Broman (2003), in which the trait of interest is time to death after a bacterial infection and in which 30% of the mice are still alive at the end of the study period. Symons *et al.* (2002) presented another interesting study, in which the phenotype is the time until terminal illness due to tumor for Eμ-v-abl transgenic mice.

The incompleteness of the trait values presents major challenges in the application of the interval-mapping approach. Broman (2003) considered a cure model in which the mice that are alive at the end of the study are regarded as cured and in which the survival times among the deaths follow a log-normal distribution. This is a specialized model, which can deal only with the situations in which the potential censoring times are equal among all study subjects. Symons *et al.* (2002) utilized a variant of the expectation-maximization (EM) algorithm (Lipsitz and Ibrahim 1998) to map QTL with censored observations. This method is computationally intensive and its properties have not been investigated.

In this article, we provide a simple and rigorous extension of the interval-mapping approach of Lander and Botstein (1989) to censored quantitative traits. Specifically, we formulate the effects of QTL on the failure time through proportional hazards models (Kalbfleish and Prentice 2002, Sect. 2.3). We then develop efficient likelihood-based methods for locating QTL and estimating the effects of QTL. In addition, we show how to assess genome-wide statistical significance by extending the analytical results of Lander and Botstein (1989) and Dupuis and Siegmund (1999) and by developing an accurate and efficient Monte Carlo procedure. We conduct extensive simulation studies to evaluate the performance of the proposed methods. Finally, we provide an application to the mice data of Broman (2003).

## METHODS

### Interval mapping:

In this section, we develop an interval-mapping method for potentially censored failure-time traits in an F_{2} intercross population by modeling a single QTL. Expanding the results to other crosses is straightforward. Consider *n* progenies from an intercross between two inbred strains. Let *T _{i}* denote the quantitative trait for the

*i*th subject, which pertains to a failure time that can potentially be censored and thus incompletely observed. Let

*C*be the censoring time for the

_{i}*i*th subject. The observation on the trait value of the

*i*th subject consists of two components:

*Y*= min(

_{i}*T*,

_{i}*C*) and Δ

_{i}*=*

_{i}*I*(

*T*≤

_{i}*C*), where

_{i}*I*(𝒜) is the indicator function for event 𝒜. The failure time

*T*is fully observed only when it is uncensored,

_{i}*i.e*., Δ

*= 1.*

_{i}Suppose that we have data on a set of genetic markers with a known genetic map. Let **M*** _{i}* denote the multipoint marker genotype data for the

*i*th subject. We consider a putative QTL locus

*d*in the genome with two possible alleles

*q*and

*Q*from the two inbred parents and define

*G*= −1, 0, or 1 according to whether the

_{i}*i*th subject has genotype

*G*, the hazard function of

_{i}*T*takes the form 1where β

_{i}_{1}and β

_{2}pertain to the additive and dominant effects of the QTL, and λ

_{0}(·) is an unknown baseline hazard function (Kalbfleish and Prentice 2002, Sect. 2.3). In this article, we assume a parametric model for λ

_{0}. In particular, we consider a Weibull hazard function , γ

_{1}> 0, γ

_{2}> 0 (Kalbfleish and Prentice 2002, p. 33).

Write **θ** = (**β**, **γ**), where **β** = (β_{1}, β_{2}) and **γ** = (γ_{1}, γ_{2}). At each locus, we may calculate π_{i}_{,}* _{g}* = Pr(

*G*=

_{i}*g*|

**M**

*) (*

_{i}*g*= −1, 0, 1;

*i*= 1, … ,

*n*), which are the conditional probabilities of the QTL genotypes given the observed marker data. Under the assumptions of no crossover interference and no genotyping errors, these probabilities are determined by the genotypes of the two flanking markers and the location of the QTL; see Equation 15.2 of Lynch and Walsh (1998).

We assume that censoring is noninformative (Kalbfleish and Prentice 2002, p. 195) and that the censoring time is independent of the failure time and QTL genotype. The likelihood for the vector of parameters **θ** based on the complete data (*Y _{i}*, Δ

*,*

_{i}*G*,

_{i}**M**

*) (*

_{i}*i*= 1, … ,

*n*) is proportional to 2while the likelihood based on the observed data (

*Y*, Δ

_{i}*,*

_{i}**M**

*)(*

_{i}*i*= 1, … ,

*n*) is proportional to 3

To obtain the maximum-likelihood estimator (MLE) of **θ**, we may maximize the observed-data likelihood (3) directly. An alternative approach is to apply the EM algorithm (Dempster *et al.* 1977) to (2). The expected value of the complete-data log-likelihood given the observed data can be shown to be, up to a constant, 4where and

In the E-step, we evaluate *p _{i}*

_{,}

*(*

_{g}**θ**) at the current estimate of

**θ**. The M-step can proceed in a similar manner to the case of complete data since expression (4), with

**θ**in

*p*

_{i}_{,}

*(*

_{g}**θ**) fixed, takes the same form as the complete-data log-likelihood. We begin the EM algorithm by assigning an initial value to

**θ**and iterate until convergence. The initial value for

**β**is set to

**0**and that of

**γ**to some value in the parameter space of

**γ**. The resulting MLE is denoted by

**θ̂**. See appendix A for further detail.

We test the null hypothesis of no QTL effects, *i.e.*, H_{0}: **β** = **0**, by the likelihood-ratio statistic where *L*(·) is the observed-data likelihood, and **θ̃** = (**0**, **γ̃**) with **γ̃** being the restricted MLE of **γ** under H_{0}. The LOD score is LR/(2 ln 10). Under H_{0}, LR is asymptotically χ^{2}-distributed with 2 d.f. (appendix B). Note that *p _{i}*

_{,}

*(*

_{g}**θ**),

**θ̂**,

*L*(

**θ̂**), LR, and LOD all depend on the locus

*d*through the dependence of π

_{i}_{,}

*on*

_{g}*d*. In the sequel, we include

*d*in the expressions to emphasize their dependence on

*d*if ambiguity arises. Note also that

**θ̃**and

*L*(

**θ̃**) do not depend on

*d*. Thus, as in the case of standard interval mapping, the likelihood under H

_{0}is calculated once while the likelihood under the alternative is evaluated at each location in the genome to produce a LOD curve for each chromosome. The position with the largest value of the LOD score is declared to be the QTL location provided that the value exceeds a certain threshold level. We show how to determine the threshold level in the following section.

### Thresholds:

When searching the entire chromosome or whole genome for QTL, one should select a threshold level for the LOD score such that the probability (under the null hypothesis) that LOD or some other test statistic exceeds this level anywhere in the genome equals the desired false-positive rate. The pointwise significance level based on the χ^{2}-approximation is inadequate because of the multiple tests while the Bonferroni correction is too conservative because of the dependence of the test statistics at different points in the genome. In appendix C, we show that the likelihood-ratio statistic LR(*d*) can be partitioned into the sum of the squares of two asymptotically independent Ornstein-Uhlenbeck processes. This result is analogous to those of Lander and Botstein (1989) and Dupuis and Siegmund (1999) and implies that the analytical approximations of thresholds for normal traits can be applied to the case of censored failure time observations. These analytical results assume that the markers are dense or equally spaced with no missing data and thus may not work well in practice. Using results of Davies (1977)(1987), Rebai *et al.* (1994) provided approximate thresholds in backcross (BC) and F_{2}, which are applicable in the intermediate map density case. The calculations are formidable, even for F_{2}, and do not accommodate missing marker data.

To overcome the limitations of the analytical approximations, we propose a novel resampling approach to determining the thresholds for genome-wide statistical significance. This approach allows arbitrary distributions of the markers as well as arbitrary test positions. It also accommodates missing marker data and dominant markers.

For evaluating the correlations of the test statistics among different locations, it is more convenient to work with the score statistic than the likelihood-ratio statistic. Let **U**_{β}(**θ̃**; *d*) be the score function (based on the observed-data likelihood) for **β** at location *d*, which can be approximated by the sum of *n* independent zero-mean random vectors , where **θ**_{0} = (**0**, **γ**_{0}) is the true parameter value under H_{0}; see appendix B. Thus, *n*^{−1/2}**U**_{β} is asymptotically zero-mean normal with covariance matrix **V**(*d*) that is the limit of . We replace the unknown quantities in **Ũ*** _{i}*(

**θ**

_{0};

*d*) by their sample estimators to yield

**Û**

*(*

_{i}*d*) shown in appendix B. The score statistic for testing H

_{0}:

**β**=

**0**against H

_{1}:

**β**≠

**0**takes the form where is a consistent estimator of the limiting covariance matrix

**V**(

*d*). It can be shown that

*W*(

*d*) is asymptotically equivalent to LR(

*d*) (Cox and Hinkley 1974, Sect. 9.3).

In general, the limiting distribution of sup* _{d}W*(

*d*) is not analytically tractable. We use a resampling approach similar to that of Lin

*et al*. (1993) to approximate the null distributions of

*n*

^{−1/2}

**U**

_{β}(

**θ̃**;

*d*) and sup

*(*

_{d}W*d*). From appendix B, it is easy to see that

*n*

^{−1/2}

**U**

_{β}(

**θ̃**;

*d*) converges to a zero-mean Gaussian process with covariance function

**ψ**(

*d*

_{1},

*d*

_{2}) that is the limit of at (

*d*

_{1},

*d*

_{2}) and that

**ψ**(

*d*

_{1},

*d*

_{2}) can be consistently estimated by . Define , where

*Z*(

_{i}*i*= 1, … ,

*n*) are independent standard normal random variables that are independent of the observed data. Conditional on the observed data,

*n*

^{−1/2}

**Û**(

*d*) is a Gaussian process with mean

**0**and covariance function at (

*d*

_{1},

*d*

_{2}), which converges to

**ψ**(

*d*

_{1},

*d*

_{2}). Thus, the conditional distribution of

*n*

^{−1/2}

**Û**(

*d*) given the observed data converges to the limiting distribution of

*n*

^{−1/2}

**U**

_{β}(

**θ̃**;

*d*). As a result, the distribution of

*W*(

*d*) can be approximated by that of

To approximate the distribution of sup* _{d}W*(

*d*), we generate the normal random sample (

*Z*

_{1}, … ,

*Z*) a large number of times while holding the observed data fixed; for each sample, we calculate

_{n}*Ŵ*(

*d*) and sup

*(*

_{d}Ŵ*d*). The 100(1 − α)th percentile of the simulated sup

*(*

_{d}Ŵ*d*) is the threshold value for the genome-wide significance level of α. This resampling approach is computationally much more efficient than the use of permutation (Churchill and Doerge 1994) and other simulation methods because it involves only simulation of normal random variables and does not entail repeated analysis of simulated data sets.

## SIMULATIONS

To investigate the operating characteristics of the proposed methods in practical situations, we performed extensive simulation studies. We generated the failure times from the Weibull distributions with baseline hazard function , where γ_{1} = 0.01 and γ_{2} = 2. We reparameterized γ_{1} and γ_{2} according to so as to ensure that the estimates of γ_{1} and γ_{2} are positive. The censoring times were generated from the uniform (0, τ) distribution, where τ was chosen to yield ∼30% censored observations. Assuming no crossover interference, we generated the marker data from the Markov chain. The interval-mapping step size was set at 1 cM.

We considered a chromosome with a total length of 100 cM. Genetic maps with different numbers of equally spaced markers were simulated. Under H_{1}, one QTL located at 35 cM was simulated with β_{1} = 0.35 and β_{2} = 0.30. We generated 10,000 replicates of 300 observations from an F_{2} population. We evaluated the finite-sample properties of the MLEs of the QTL effects at the true QTL location. The results for the estimator of the additive QTL effect are summarized in Table 1. The proposed estimator appears to be virtually unbiased. The standard error estimator reflects accurately the true variation. The confidence intervals have proper coverage probabilities. We obtained similar results for the estimator of the dominant QTL effect (data not shown). We also examined the performance of the proposed interval-mapping methods for locating the QTL and estimating the QTL effects at the location with the maximum LOD. The results are summarized in Table 2. There is little bias for the estimator of the QTL location or the estimators of the QTL effects. The mapping is more precise for denser marker maps.

We conducted additional simulation studies to evaluate the performance of the analytical and resampling methods for determining genome-wide statistical significance. We generated both equally and unequally spaced markers. We also simulated data with missing marker genotypes and dominant markers, which are more comparable with real data. We considered one chromosome with a total length of 100 cM. For the cases of unevenly spaced markers, we placed *m* markers at the following locations, where LOC* _{j}* is the

*j*th marker location and [

*m*/2] is the largest integer that is less than or equal to

*m*/2. In these settings, the first half of the markers is denser than the second half of the markers. We generated 10,000 replicates of 300 observations from an F

_{2}population. The dense-map and sparse-map approximations were obtained from Equation C1 in appendix C . The thresholds for the resampling method were based on 10,000 normal samples. The results are summarized in Tables 3 and 4.

The thresholds based on the resampling method are close to the empirical values, whether the data are generated under H_{0} or H_{1}; consequently, the LR tests based on these thresholds have proper type I error and power. This is true of any genetic map, with or without missing marker genotypes and dominant markers. The dense-map approximations are too conservative and thus result in power loss, while the sparse-map approximations tend to be too liberal. We also assessed the approximations by Rebai *et al.* (1994), which turn out to be conservative when the genetic map is dense. For example, in the case of 51 markers with α = 0.05, the sizes are 2.66 and 2.26% for marker patterns 1 and 2, respectively.

## APPLICATION

To illustrate our methods, we consider the mice data previously analyzed by Broman (2003). A total of 116 female mice from an intercross between the BALB/cByJ and C57BL/6ByJ strains were genotyped at 133 markers, including 2 on the X chromosome. The phenotype of interest is the time to death following infection with *Listeria monocytogenes*. Approximately 30% of the survival times are censored.

Broman (2003) proposed a nonparametric (NP) approach and a two-part model. The NP approach is an extension of the Kruskal-Wallis statistic (Lehmann 1975, Sect. 5.2) by assigning a prior weight (π_{i}_{,}* _{g}*) to the rank of the

*i*th observation for each QTL genotype group

*g*. In this approach, the censored observations are treated as the true failure times and an average rank is assigned to those observations. In the two-part approach, Broman considered a cure model in which the mice that are alive at the end of the study are regarded as cured while the survival times among the deaths follow a log-normal distribution.

We applied the proposed methods to these data, assuming a Weibull baseline hazard, and the results are shown in Table 5 and Figures 1 and 2. The threshold for the LOD score at the 5% genome-wide significance level based on the resampling approach is 3.43, which is close to 3.27, the threshold obtained by permutation for the NP approach of Broman (2003). Our results are fairly consistent with those of Broman (2003). We detect almost the same QTL on chromosomes 5, 13, and 15 except that we detect an additional QTL on chromosome 6 rather than on chromosome 1. The QTL on chromosome 5 appears to have a strong additive effect and the hazard ratio of the survival time with genotype *QQ vs. qq* is ∼9.95. Genotypes *qq* and *Qq* at the QTL on chromosome 6 seem to have similar effects. The QTL on chromosome 13 appears to have both additive and dominant effects. The QTL on chromosome 15 appears to have a strong dominant effect. At most detected QTL locations, our LOD scores are larger than those of Broman's NP approach. This suggests that our approach may be more efficient in detecting QTL.

Figure 1 shows the LOD curves from the three methods: proposed method, nonparametric method, and two-part model. The LOD scores from the two-part model are larger than those of the other two methods at some QTL locations because there are two more free parameters in the two-part model than in the other two methods. This will decrease the power to detect QTL since a larger threshold (*i.e.*, 4.91) is required. To evaluate different methods on a common scale, we converted the LOD curves to the estimated pointwise *P*-values. Figure 2 displays the values of −log_{10}*P* for chromosomes 1, 5, 6, 13, and 15. Comparisons with the nonparametric method and two-part model reveal that the proposed method yields more significant results on the above chromosomes except chromosome 1. Incidentally, the resampling method is ∼100 times faster than the permutation method in this application.

To get some ideas about the adequacy of the Weibull distribution in describing the Listeria data, we fitted both the semiparametric model and the Weibull model at marker D5M357, which is close to the peak of the LOD score on chromosome 5. The estimated QTL effects from the two models are very similar. It would be worthwhile to develop formal goodness-of-fit methods for assessing the adequacy of the parametric survival model at the true QTL location.

## EXTENSIONS

In this section, we extend the single-QTL model to multiple QTL. The approach of interval mapping (IM) considers one putative QTL at a time. The QTL located elsewhere on the genome can have interfering effects, so that the estimators for the locations and effects of QTL may be biased and the power of detecting QTL may be compromised (Lander and Botstein 1989; Haley and Knott 1992; Zeng 1994). Boer *et al.* (2002) showed that the IM method fails to detect three interacting QTL with no main effects through simulation studies. A variety of approaches have been proposed for mapping multiple QTL. These methods can increase the power to detect QTL and reduce biases in the estimators of the QTL effects and locations. In this section, we consider mainly composite-interval mapping (CIM; Jansen 1993; Zeng 1993, 1994) and multiple-interval mapping (MIM; Kao *et al.* 1999) for censored traits.

### Composite-interval mapping:

The idea of CIM is to combine IM with multiple regression analysis in mapping QTL by conditioning on markers outside a region of interest to account for the effects of other QTL. To extend the original CIM model to censored traits, we consider the following proportional hazards model, 5where *j* and *j* + 1 conform to two flanking markers bordering the putative QTL, and *M _{ik}* is the indicator variable for the marker genotype, which takes values −1, 0, and 1 for genotypes

*aa*,

*Aa*, and

*AA*, respectively. We may further enhance the model by considering the interaction effects between the putative QTL and controlling markers. Replacing λ(

*t*|

*G*) in (2) and (3) with (5), we obtain the complete-data and observed-data likelihood functions, respectively. As in the case of standard interval mapping, we can maximize the observed-data likelihood directly or apply the EM algorithm to obtain the MLEs. We can test H

_{i}_{0}:

**β**=

**0**at any position in the genome.

The CIM approach requires that the sample size be large relative to the number of markers included in the model. In practice, the sample size is generally not very large. Thus, Zeng (1994) suggested including in the model only those markers that are more or less evenly spaced in the genome or those preidentified markers that explain most of the genetic variation in the genome. This suggestion also applies to our setting.

### Multiple-interval mapping:

The MIM approach proposed by Kao *et al.* (1999) uses multiple marker intervals simultaneously to fit multiple putative QTL directly in the model. Consider *K* QTL, *Q*_{1}, … , *Q _{K}*, located at

*d*

_{1}, … ,

*d*in the genome. There are 3

_{K}*possible QTL genotypes. Some of the*

^{K}*K*QTL may exhibit epistasis. We formulate the effects of the

*K*QTL on the failure time through a proportional hazards model, such that, conditional on the joint genotype

**G**

*= (*

_{i}*G*

_{1}

*, … ,*

_{i}*G*), the hazard function of

_{Ki}*T*takes the form 6where

_{i}**x**

*= (*

_{ij}*G*, 1 − |

_{ij}*G*|)

_{ij}*, δ*

^{T}*is an indicator variable for epistasis between*

_{jk}*Q*and

_{j}*Q*, and

_{k}**β**

*and*

_{j}**B**

*pertain to the main effects and epistatic effects, respectively. The variable δ*

_{jk}*indicates, by the values 1*

_{jk}*vs.*0, whether or not

*Q*and

_{j}*Q*interact. Given the marker data

_{k}**M**

*for the*

_{i}*i*th subject and assuming no crossover interference, we may calculate π

_{i}_{,}

*(*

_{g}*g*= 1, … , 3

*), the conditional probabilities of the 3*

^{K}*possible genotypes of the*

^{K}*K*QTL. The complete-data likelihood takes the same form as Equation 2 except that the summation of

*g*is now over (1, … , 3

*). To obtain the MLEs and LOD scores, we can again apply the EM algorithm.*

^{K}Since the true number and locations of the QTL are unknown, model selection is a critical issue in the MIM approach. Kao *et al.* (1999) suggested stepwise and chunkwise selection with the likelihood-ratio test statistic as a selection criterion to identify QTL, to separate linked QTL, and to analyze epistasis between QTL. Broman and Speed (2002) developed a modified Bayesian information criterion (Schwarz 1978) for model selection. When many QTL are included in the MIM model, the computation can be very intensive. Sen and Churchill (2001) reduced the computation burden of MIM by replacing the EM algorithm with a Monte Carlo algorithm. All these strategies can be applied to our setting.

## DISCUSSION

We have described our methods in the context of an F_{2} population. All the proposed methods can be easily generalized to other crosses such as BC, F_{3}, or even more complicated crosses such as combined crosses (Zou *et al.* 2001). We may also extend our methods to accommodate covariates, such as block factors in an agriculture field trial or cage number in a mouse cross.

Symons *et al.* (2002) formulated the effects of the QTL on the failure time through the semiparametric proportional hazards model. They utilized a variant of the EM algorithm developed by Lipsitz and Ibrahim (1998), in which Monte Carlo simulation is used to approximate the conditional expectation of the complete-data partial-likelihood score function given the observed data. The solution to this conditional expectation is not a maximum partial-likelihood estimator, so that the method is not statistically efficient. The Monte Carlo simulation is time-consuming. There is no formal justification for the method of Lipsitz and Ibrahim (1998) or that of Symons *et al.* (2002). Our method is computationally much simpler than Monte Carlo simulation. It is based on the maximum-likelihood estimator and is thus statistically efficient. Furthermore, we have established the theoretical properties of our method and assessed its empirical performance through simulation studies. The method of Symons *et al.* (2002) has the advantage that the baseline hazard function is unspecified.

The proposed resampling approach to determining genome-wide significance is applicable to arbitrary genetic maps and accommodates missing marker data and dominant markers. For the CIM and MIM models, no analytical thresholds are available and the permutation method is extremely time-consuming. The proposed resampling approach can be applied to the CIM and MIM models since all the relevant formulas have been presented for an arbitrary likelihood. For the MIM method, the resampling approach produces appropriate thresholds for testing each putative QTL given the others, with or without adjustment for the fact that multiple QTL are tested simultaneously.

We have written a computer program in C to implement the proposed method. This program is available from the authors upon request.

## APPENDIX A:

### COMPUTATIONS OF MLE AND COVARIANCE MATRIX

In this section, we present the formulas for the M-step of the EM algorithm under the Weibull model. In addition, we provide the observed information matrix as well as a consistent estimator of the covariance matrix of MLE **θ̂**. In the M-step of the (*k* + 1)th iteration, the first and second derivatives of the expected value of the log-likelihood given the observed data and current estimate **θ̂**^{(}^{k}^{)} are given by A1A2where

**g** = (*g*, 1 − |*g*|)* ^{T}*, and , which is the cumulative hazard function conditional on the QTL genotype

*g*. Then, we can apply the Newton-Raphson algorithm to update the current estimate with the new maximizer

**θ̂**

^{(}

^{k}^{+1)}.

The observed information matrix is given by where for a column vector **a**, **a**^{⊗2} denotes the matrix **aa*** ^{T}*. Thus, a consistent estimator of the covariance matrix of

**θ̂**is given by the inverse of the observed information matrix evaluated at

**θ̂**,

*i.e.*, .

## APPENDIX B:

### ASYMPTOTIC PROPERTIES OF SCORE AND LIKELIHOOD-RATIO STATISTICS

In this section, we show that LR(*d*) is asymptotically χ^{2}_{2}-distributed under H_{0} and provide the necessary ingredients for deriving the thresholds. Let **U**(**θ**; *d*) and **I**(**θ**; *d*) be the observed-data score function and information matrix at location *d* with the following partitions to conform with the partition (**β**, **γ**) of **θ**, and

Under H_{0}, *n*^{−1}**I**(**θ̃**) converges to **∑**(*d*). Denote

It can be shown that asymptotically (Cox and Hinkley 1974, Sect. 9.3). Through Taylor series expansions, asymptotically, where and **θ**_{0} = (**0**, **γ**_{0}). The replacements of the unknown quantities in **Ũ*** _{i}*(

**θ**

_{0};

*d*) with their sample estimators yield .

Let . Then **z**(*d*) converges to a normal distribution with mean **0** and an identity 2 × 2 covariance matrix. Thus, LR(*d*) is asymptotically distributed as χ^{2}_{2} under H_{0}.

## APPENDIX C:

### ANALYTICAL APPROXIMATIONS OF THRESHOLDS

We show in this appendix that, for infinitely dense markers, the null distribution of LR(*d*) can be approximated by an Ornstein-Uhlenbeck process. Under H_{0}, **∑**(*d*) does not depend on *d*. Let *d*_{1} and *d*_{2} denote two points on the chromosome, and *r* be the recombination fraction corresponding to the genetic distance |*d*_{1} − *d*_{2}|. Under the assumption of no crossover interference, it is easy to show that the correlation between **g**(*d*_{1}) and **g**(*d*_{2}) is given by provided that *r* is small. Since **U**_{β,}* _{i}*(

**θ**;

*d*) = (Δ

*− Λ*

_{i}_{0}(

*Y*))

_{i}**g**

*(*

_{i}*d*) under H

_{0}, we have

The above result implies that *z*_{1}(*d*) and *z*_{2}(*d*), the first and second components of **z**(*d*), are approximately independent Ornstein-Uhlenbeck processes with means zero and variances 1 − 2*r* and 1 − 4*r*, respectively. By the arguments of Dupuis and Siegmund (1999), the tail distribution of sup* _{d}*LR(

*d*) under H

_{0}satisfies C1where Δ is the average marker distance (in morgans),

*C*is the number of chromosomes,

*L*is the total length of the genome (in morgans), and

*v*=

*v*(

*a*(6Δ)

^{1/2}), the definition of which can be found in Siegmund (1985). When Δ = 0, the above formula reduces to that of Lander and Botstein (1989). These results imply that all the analytical thresholds for the normal trait can be applied to our case.

## Footnotes

Communicating editor: G. Churchill

- Received October 31, 2003.
- Accepted June 2, 2004.

- Genetics Society of America