## Abstract

The mapping of quantitative trait loci (QTL) is an important research question in animal and human studies. Missing data are common in such study settings, and ignoring such missing data may result in biased estimates of the genotypic effect and thus may eventually lead to errant results and incorrect inferences. In this article, we developed an expectation-maximization (EM)–likelihood-ratio test (LRT) in QTL mapping. Simulation studies based on two different types of phylogenetic models revealed that the EM-LRT, a statistical technique that uses EM-based parameter estimates in the presence of missing data, offers a greater statistical power compared with the ordinary analysis-of-variance (ANOVA)-based test, which discards incomplete data. We applied both the EM-LRT and the ANOVA-based test in a real data set collected from F_{2} intercross studies of inbred mouse strains. It was found that the EM-LRT makes an optimal use of the observed data and its advantages over the ANOVA *F*-test are more pronounced when more missing data are present. The EM-LRT method may have important implications in QTL mapping in experimental crosses.

ANIMAL models and their corresponding genomes are highly useful for mapping traits that may apply to human diseases (Knoblauch and Lindpaintner 1999). Since genes are conserved throughout evolution, the identification of “evolutionary homologs” in animals is well appreciated in helping to find their counterparts in humans.

There are two primary methods for quantitative trait locus (QTL) mapping: (a) the single-marker method and (b) the interval-mapping method. The single-marker method is a traditional method for detecting the association between individual genetic markers and the quantitative trait of interest (Luo* et al.* 2000). The analysis-of-variance (ANOVA) represents the typical method applied in this kind of analysis. The interval-mapping method uses information provided by multiple linked markers to probabilistically assess potential QTL at chromosomal locations between such markers. In the interval-mapping approach developed by Lander and Botstein (1989), evidence for a putative QTL is summarized by a LOD (log of odds) score that exceeds a predefined threshold at a given chromosomal position.

The presence of missing data in studies usually lowers both the power of QTL mapping and the precision of parameter estimation, because the sample size for the incomplete data is less than it would be if the data were complete. In previous literature, the treatment of such a missing data problem is not adequate. Two simple methods have been most widely applied. One is simply to use the incomplete data by deleting all data records with any values missing, and it is called “listwise deletion.” A second approach is called “pairwise deletion,” which deletes those data records if either the phenotypic data or the genotypic data at the marker of interest are missing. In this article, we propose an expectation-maximization (EM)–likelihood-ratio test (LRT) to incorporate the flanking markers' information in the presence of missing marker data in the single-marker analysis. The LRT is derived from the maximum likelihood calculated using the EM algorithm based on all the observed data.

In the following section, we first introduce the mathematical model and notations, and then we derive the EM algorithm for maximum-likelihood estimation. Afterward, we describe the EM-LRT (or the EM-based Student's *t*-test) and the standard ANOVA-based tests (*F*-test and pairwise *t*-test). Then, we assess the validity of the EM-LRT at various sample sizes and various proportions of missing data, compare the performances of the proposed EM-based tests over the ANOVA-based tests through simulations, and evaluate whether or not it represents a more effective test for real data sets. Finally, we provide a summarization and some further discussions.

We have implemented the algorithm described in this article in the freely available statistical software R (Ihaka and Gentleman 1996). The code is available from the authors upon request.

## MATERIALS AND METHODS

### Model settings and notations:

Let us denote the genotypes at the trait marker locus A (the hypothesized QTL for the trait) as *AA*, *Aa*, and *aa*, the genotypes at its left-side flanking marker locus B as *BB*, *Bb*, *bb*, and the genotypes at its right-side flanking marker locus C as *CC*, *Cc*, and *cc* (note that we consider here only the biallelic markers, such as the simple sequence length polymorphisms). Let *Y* denote the phenotype value; let *X*_{1}, *X*_{2}, and *X*_{3} denote the respective genotype values at the loci A, B, and C, where *X*_{1} = 1, 2, and 3 denotes the three respective genotypes, *AA*, *Aa*, and *aa*, *X*_{2} = 1, 2, and 3 denotes the three respective genotypes, *BB*, *Bb*, and *bb*, and *X*_{3} = 1, 2, and 3 denotes the three respective genotypes, *CC*, *Cc*, and *cc*. Let μ* _{i}* denote

*E*(

*Y*|

*X*

_{1}=

*i*), where

*i*= 1, 2, and 3. Then, what we test here is H

_{0}: μ

_{1}= μ

_{2}= μ

_{3}(locus A is

*not*a QTL for

*Y*),

*vs.*H

_{a}: μ

_{1}, μ

_{2}, and μ

_{3}are not all equal (locus A is a QTL for

*Y*).

This hypothesis test includes the test for both dominant and additive effects of the hypothesized QTL—locus A.

In practice, the genotype measure *X*_{1} at locus A may be missing for some animals. The usual approaches for missing data such as listwise deletion and pairwise deletion would simply exclude such animals from the ANOVA-based tests, resulting in a lower power to detect the QTL. Here, we propose an EM-based approach utilizing information of incomplete data, rather than discarding it. When there are missing data at locus A, the approach makes use of genotype data not only at locus A, but also at its two most closely linked markers, loci B and C. For the three linked markers, A, B, and C, there are a total of 27 possible genotype combinations {*X*_{1} = *j*, *X*_{2} = *k*, *X*_{3} = *l*}, where *j, k, l* = 1, 2, or 3. We denote the probabilities for the occurrence of each combination as *p _{j,k,l}* = Pr(

*X*

_{1}=

*j*,

*X*

_{2}=

*k*,

*X*

_{3}=

*l*).

By assuming a standard ANOVA model relating the phenotype *Y* to the genotype *X*_{1}, we have 1where ε ∼ *N*(0, σ^{2}) and *X*_{1} can take one of the three possible genotype values of 1, 2, or 3 defined above. The complete data set in this case is {(*Y _{i}*,

*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*),*

_{i}*i*= 1, … ,

*n*} for a sample size of

*n*.

The log-likelihood of the complete data is , where and θ = (μ_{1}, μ_{2}, μ_{3}, σ^{2}, *p*_{1,1,1}, *p*_{1,1,2}, *p*_{1,1,3}, *p*_{1,2,1}, *p*_{1,2,2}, *p*_{1,2,3}, *p*_{1,3,1}, *p*_{1,3,2}, *p*_{1,3,3}, *p*_{2,1,1}, *p*_{2,1,2}, *p*_{2,1,3}, *p*_{2,2,1}, *p*_{2,2,2}, *p*_{2,2,3}, *p*_{2,3,1}, *p*_{2,3,2}, *p*_{2,3,3}, *p*_{3,1,1}, *p*_{3,1,2}, *p*_{3,1,3}, *p*_{3,2,1}, *p*_{3,2,2}, *p*_{3,2,3}, *p*_{3,3,1}, *p*_{3,3,2}, *p*_{3,3,3}). When there are missing data, 2where *l*_{i} is defined as follows. First, if the phenotype *Y _{i}* and the three genetic markers

*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*are all observed for the*

_{i}*i*th animal, obviously, 3second, if the phenotype

*Y*is observed but some genetic markers are missing for the

_{i}*i*th animal, then 4and third, if the phenotype

*Y*is missing for the

_{i}*i*th animal, 5

Here and in the following, the notation of summation ∑_{j∈X1,i} denotes the summation over all possible values of *X*_{1,}* _{i}*. For example, if

*X*

_{1,}

*is observed to be 2, then the summation contains only one case (*

_{i}*i.e.*,

*j*= 2); on the other hand, if

*X*

_{1,}

*is missing, then the summation is taken over all three possible values*

_{i}*j*= 1, 2, and 3.

We propose estimating the parameters by maximizing the log-likelihood *L* as defined in Equations 2–5 above and using the corresponding LRT in hypothesis tests.

Direct maximization of *L* is difficult, as we can see in the complicated equations [(2)–(5)] shown above. The EM algorithm (Dempster* et al.* 1977; Little and Rubin 1987) is an appropriate method for computing the maximum-likelihood estimator θ̂ when missing data are present. In the following, we first derive formulas for the EM algorithm to maximize the log-likelihood *L*. Then, we deduce the LRT using the EM estimations and compare its performance with the ordinary ANOVA-based tests.

### EM algorithm:

We now derive the formulas of the EM algorithm for this problem following standard notations (McLachlan and Krishnan 1997).

We start with an initial estimate θ^{} (which can be either the ANOVA estimate or any other reasonable estimate). At the (*m* + 1)th iteration, we update the current estimate θ^{} by completing the E-step and the M-step as follows.

#### E-step:

Compute . The computation is simplified to 6where denotes the Pr(*X*_{1,}* _{i}* =

*j*,

*X*

_{2,}

*=*

_{i}*k*,

*X*

_{3,}

*=*

_{i}*l*|observed data and θ

^{}). It can be computed according to the following formula: If

*Y*is observed, 7if

_{i}*Y*is missing, 8

_{i}Here φ{*j* ∈ *X*_{1,}* _{i}*,

*k*∈

*X*

_{2,}

*,*

_{i}*l*∈

*X*

_{3,}

*} is the indicator function whether (*

_{i}*j*,

*k*,

*l*) is a possible value for (

*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*).*

_{i}#### M-step:

Update the parameter estimate to the value that maximizes *Q*. The maximization over θ becomes rather simple if we further write out the expression

Here, obs(*Y*) denotes the set of *i*'s where *Y _{i}* is observed, and

*n*

_{obs(}

_{Y}_{)}= |obs(

*Y*)|.

The maximization of the above expression is very similar to a linear model and we find explicitly the following updating formula: 91011

The E-step and M-step are then iterated until the estimate θ^{} converges to an estimated value, θ̂.

### Hypothesis testing:

To check whether locus A is a QTL for the trait of interest, *Y*, statistically we test the hypothesis H_{0}: μ_{1} = μ_{2} = μ_{3} (locus A is *not* a QTL for *Y*) *vs.* H_{a}: μ_{1}, μ_{2}, and μ_{3} are not all equal (locus A is a QTL for *Y*). Here we first describe the ordinary ANOVA for single-marker analysis, which is the standard approach in the present literature (Rubattu* et al.* 1996; Vallejo* et al.* 1998; Poyan Mehr* et al.* 2003; Zhao and Meng 2003). When missing data are present, the ordinary ANOVA excludes all the data records with missing information on *X*_{1} or *Y*, and a subset of observations is left {(*Y _{i}*,

*X*

_{1,}

*),*

_{i}*i*= 1, … ,

*n**}, (

*n** ≤

*n*). The ordinary ANOVA then estimates the mean phenotype given the genotype data, where φ is an indicator function. The variance is estimated by

Then, an *F*-test is constructed by comparing σ̂^{2} with the between-group variance, σ̂^{2}_{b}, where . Therefore, the *F*-test statistic is constructed as 12

The *F*-test would reject H_{0} if *F* > *F*_{α;2,}_{n}_{*−3}, where *F*_{α;2,}_{n}_{*−3} is the (1 − α)100th percentile of an *F*-distribution with d.f. = 2 and (*n** − 3).

The ANOVA can also use the pairwise *t*-tests to examine the phenotypic difference between two particular genotypes. This pairwise *t*-test is used to evaluate H_{0}: μ* _{j}* = μ

_{m}vs. H_{a}: μ

*≠ μ*

_{j}*for pairs of genotypes*

_{m}*j*and

*m*(

*e.g.*,

*j*= 1 and

*m*= 2, or

*j*= 1 and

*m*= 3, or

*j*= 2 and

*m*= 3). The

*T*-statistic is calculated as 13

The *t*-test would reject H_{0} (therefore declare a phenotypic difference between genotypes *j* and *m*) when *T* > *t*_{α/2;}_{n}_{*−3}, where *t*_{α/2;}_{n}_{*−3} is the (1 − α/2)100th percentile of a *t*-distribution with d.f. = (*n** − 3).

As pointed out above, the power of the ordinary ANOVA is not optimal because it does not use information for those data records with either phenotype or genotype marker data missing. In the previous section, we proposed using the EM algorithm to incorporate information from the flanking loci (*i.e.*, B and C) in the parameter estimation. Here we describe how to use these EM-based parameter estimates to develop a statistical test that replaces the corresponding *F*-test (or the pairwise *t*-test when applicable) in the ordinary ANOVA.

Basically, the *F*-test in the ordinary ANOVA is replaced by the LRT in the EM approach as follows: (a) use the EM algorithm of (6)–(11) to find the parameter estimate θ̂, and then compute the log-likelihood *L* according to (1); (b) fit the parameters again under H_{0} (by the EM algorithm with formulas described in the next paragraph) to yield an estimate θ̂_{0}, and compute the log-likelihood *L*; and (c) compute the likelihood-ratio statistic (LRS), 14

The LRT will reject H_{0} if LRS > χ^{2}_{α}, where χ^{2}_{α} is the (1 − α)100th percentile of the χ^{2}-distribution with d.f. = 1.

The calculation of the LRS according to Equation 14 requires the calculations of both the maximum log-likelihood *L* under H_{a} and the maximum log-likelihood *L* under H_{0}. We have provided in the previous section EM formulas for fitting θ̂ in Equations 7–11. Here we describe EM formulas for fitting the parameters θ̂_{0} under H_{0}. The EM algorithm under H_{0} is simpler because μ_{1} = μ_{2} = μ_{3} = μ. Therefore, we would estimate μ by the overall sample mean under H_{0}. Correspondingly, the variance is estimated by the sample variance. That is, we can get the estimates without going through any iterations: 10′11′

Thus, for estimating *p _{j,k,l}*'s, we need to iterate only between the E-step, 8′and the M-step, 9′

The estimate θ̂_{0} consists of μ̂* _{j}* in (10′), σ̂ in (11′), and

*p̂*'s that are the values of (9′) at convergence. Then θ̂

_{j,k,l}_{0}is plugged into Equation 1 to calculate

*L*, which is then used to compute the LRS in (14).

The pairwise *t*-test in the ordinary ANOVA is replaced by a corresponding adjusted *t*-test in the EM approach. Since μ̂* _{j}* − μ̂

*= ∑*

_{m}

_{i}_{∈obs(}

_{Y}_{)}

*Y*∑

_{i}*(δ*

_{k,l}*/∑*

_{i,jkl}

_{i}_{∈obs(}

_{Y}_{)}∑

*δ*

_{k,l}*− δ*

_{i,jkl}*/∑*

_{i,mkl}

_{i}_{∈obs(}

_{Y}_{)}∑

*δ*

_{k,l}*), the variance of μ̂*

_{i,mkl}*− μ̂*

_{j}*is approximately ∑*

_{m}

_{i}_{∈obs(}

_{Y}_{)}[∑

*(δ*

_{k,l}*/∑*

_{i,jkl}

_{i}_{∈obs(}

_{Y}_{)}∑

*δ*

_{k,l}*− δ*

_{i,jkl}*/∑*

_{i,mkl}

_{i}_{∈obs(}

_{Y}_{)}∑

*δ*

_{k,l}*)]*

_{i,mkl}^{2}σ̂

^{2}. The adjusted

*t*-test statistic,

*T*, for testing the pair of genotypes

*j*and

*m*is 15where μ̂

*, μ̂*

_{j}*, and σ̂ are from the EM estimate, θ̂. The*

_{m}*t*-test would reject H

_{0}when

*T*>

*t*

_{α/2;}

_{n}_{−30}, where

*t*

_{α/2;}

_{n}_{−30}is the (1 − α/2)100th percentile of a

*t*-distribution with d.f. = (

*n*− 30).

As the proportion of missing data increases, but is kept below the upper limit such that the type I error is not inflated, we would expect the EM-LRT to perform better than the ANOVA-based test in the single-marker analysis.

### Comparison with the interval-mapping method:

The proposed EM-LRT above uses the genotype information at flanking marker loci to allow more efficient QTL detection at the trait locus when there are missing genotype or phenotype data. The idea of using genotype information at flanking marker loci for capturing information of incomplete data is similar to the idea adopted by the interval-mapping method (Lander and Botstein 1989). The interval-mapping method also uses the EM algorithm to incorporate flanking markers' genotype information for inferring the association (expressed as a LOD score) of the phenotypic trait with genetic variation at any given point between the two flanking markers, but there is a significant difference between our method and the interval-mapping method. First, the main strategy is different. Our method is exactly a single-marker test when no data are missing, and it uses information of the flanking markers only when data are missing at the marker of interest; in contrast, the interval-mapping method intends to “screen” any given point, locus *X*, in the interval bracketed by two linked markers, assuming (a) genotypic variation at such theoretical point exists and (b) its recombination rates from the two flanking markers are correctly specified. Therefore, the trait locus *X* is a putative locus and is totally unobserved, and the interval-mapping method uses recombination rates, *r*_{B} and *r*_{C}, to compute the conditional probabilities , thus reducing the number of parameters to 2. However, such reduction of the number of parameters is valid only if the underlying assumptions regarding the recombination rates (*i.e.*, *r*_{B} and *r*_{C} in Figure 1) hold. Our proposed EM-LRT, on the other hand, makes no assumptions on the recombination rates (*i.e.*, *r*_{B} and *r*_{C}), but instead it computes *p*^{j}_{k,l} through *p _{j,k,l}* = Pr(

*X*

_{1}=

*j*,

*X*

_{2}=

*k*,

*X*

_{3}=

*l*), only if there are some incomplete phenotype data or genotype data at locus A (Figure 1). For convenience of mathematical derivation, we have written our formula in terms of

*p*. Hence our EM-LRT involves 27

_{j,k,l}*p*'s and we did not reduce them to two parameters,

_{j,k,l}*r*

_{B}and

*r*

_{C}, which are used in interval-mapping methods. However, the trade-off is that our EM-LRT is more generic with no model assumptions on the specification of recombination rates: for example, for very tightly linked markers, it has been shown that the rate of recombination is no longer a monotone function of the physical distance (Thompson

*et al.*1988), and the assumption of the interval-mapping method would appear to be overly strong. Under such circumstances, when there are missing data, our EM-LRT is still valid. We therefore consider our EM-LRT as a

*complimentary*method for the interval-mapping method, particularly when markers are very densely spaced (<1 cM).

## RESULTS

### Assessment of the validity of EM-LRT in finite samples:

EM-LRT is a valid test asymptotically; however, its validity for finite sample sizes needs to be carefully checked. We used extensive simulations to assess the validity of the EM-LRT for various sample sizes under various proportions of missing data.

We simulated a data set of *n* animals with the phenotype measurement (*Y _{i}*) and three genetic markers (

*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*) for each animal*

_{i}*i*: {(

*Y*,

_{i}*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*)}, where*

_{i}*i*= 1, … ,

*n*. The phenotype

*Y*for each animal was generated according to the linear model: Equation 1, with parameters μ

_{1}= μ

_{2}= μ

_{3}= 100 and σ = 10. We assigned

*p*to be proportional to (4 −

_{j,k,l}*j*) + (4 −

*k*) + (4 −

*l*). [We initially intended to simulate

*p*proportional to

_{j,k,l}*j*+

*k*+

*l*. However, as

*j*= 1 denotes the homozygous wild-type genotype, it should have higher probability than

*j*= 3. Hence, we used the transformations (4 −

*j*) to flip the probabilities.] We then randomly dropped phenotype observations at the trait marker locus A according to a missing probability.

For each data set, we first fitted the EM estimates θ̂ through iterations of Equations 6–11. The iteration started with the initial estimates:

The iteration would stop when a convergence criterion of 10^{−4} relative change was met. Next, we fitted the EM estimates θ̂_{0} again under H_{0} through Equations 8′–11′. Then θ̂ and θ̂_{0} were used in computing the LRS in (14).

We repeatedly ran the simulation 1000 times. For each simulated data set, we computed the EM-LRT (14) and recorded their values. The empirical type I error of EM-LRT was calculated as the proportions of the 1000 data sets where H_{0} was rejected at the significance level α = 0.05.

We simulated for *n* = 50, 100, 200, 500, and 1000, respectively. For each sample size of *n*, we increased the missing probability from 10% upward, until the type I error exceeds the nominal significance level α = 0.05 significantly (that is, it exceeds by two standard deviations, ). Table 1 shows the type I error for EM-LRT for various sample sizes.

As shown in Table 1, for a small sample size (*n* = 50), the EM-LRT is valid for up to 10% missing observations. When *n* = 100, the EM-LRT is valid when as much as 20% data were missing. When *n* = 200, the EM-LRT can tolerate up to 50% missing data. These simulations showed that we have to be careful in applying the EM-LRT. For a small sample (*e.g.*, *n* = 40), which is often encountered in real-world experiments, the type I error rates were 0.060 and 0.077 for 10 and 20% missing, respectively. Thus, for *n* = 40 (see the real example in III shown below), we can still use EM-LRT if 10% or fewer observations are missing. When there are ≥200 animals, we can use the tests with up to half of all observations missing.

To evaluate the accuracy of parameter estimates, we calculated the coefficient of variability (CV) for each model parameter estimate. CV is conventionally defined as /θ, where MSE(θ̂) denotes the mean squared error of the estimate for parameter θ over 1000 simulation runs. Table 2 shows the average CV for estimates of *p _{j,k,l}*'s, μ

*'s, and σ*

_{j}^{2}. (It turned out the CVs for estimates of

*p*'s were rather similar and thus we presented only their average values.)

_{j,k,l}It can be seen that the ancillary parameters *p _{j,k,l}* were estimated less accurately compared to the estimates of the main parameters μ

*and σ*

_{j}^{2}across the board. However, because

*p*'s are parameters that are used only in the adjustment of the impacts of the missing data on the main parameters, the main parameters of interest (

_{j,k,l}*i.e.*, μ

*'s and σ*

_{j}^{2}) were not much affected by the accuracies of the estimates of

*p*'s. All parameters were estimated more accurately when the sample size

_{j,k,l}*n*became larger. As a result, the EM-LRT is a valid test for increasingly greater missing proportions as

*n*becomes larger.

### Power comparison of EM-LRT with ANOVA-based tests:

To compare the power of EM-LRT with that of the ANOVA-based test, we conducted simulation studies using two types of phylogenetic models.

#### Simulation models:

In the simulations performed, genetic markers were generated according to two phylogenetic models (Figure 2). Let *A*, *B*, and *C* denote the wild-type alleles and *a*, *b*, *c* their corresponding mutant alleles for the three loci, *A*, *B*, and *C*, respectively. We assume that the *A* → *a* event has arisen before either *B* → *b* or *C* → *c* occurred, and *B* → *b* or *C* → *c* events occurred only on the *aBC* haplotype. In model I, the *B* → *b* took place first on the ancestral haplotype *aBC*, followed by the mutation of locus *C* on the haplotype *abC*, resulting in four distinctive haplotypes: *ABC*, *aBC*, *abC*, and *abc*. In model II, the mutation at locus *B* took place first on the ancestral haplotype *aBC*, followed by the mutation of locus *C* on the haplotypes bearing either the wild-type allele (*i.e.*, *aBC*) or the mutant allele (*i.e.*, *abC*) at locus *B*, resulting in five distinctive haplotypes: *ABC*, *aBC*, *abC*, *aBc*, and *abc*.

In model I, we assume that *C* → *c* occurred *only* on the *abC* haplotype, as shown in Figure 2. Let *p _{a}* denote the proportion of the “

*a*” allele in the population,

*p*denote the probability of the

_{b}*B*→

*b*event conditional on the

*A*→

*a*event, and

*p*denote the probability of the

_{c}*C*→

*c*event conditional on the

*B*→

*b*event. Two variants of model I were considered:

model IA: the genotype measures

*X*_{1},*X*_{2}, and*X*_{3}refer to loci*A*,*B*, and*C*, respectively (*e.g.*, genotype “*aaBbCC*” corresponds to*X*_{1}= 3,*X*_{2}= 2,*X*_{3}= 1);model IB: the genotype measures

*X*_{1},*X*_{2}, and*X*_{3}refer to loci*B*,*A*, and*C*, respectively (*e.g.*, genotype*aaBbCC*now corresponds to*X*_{1}= 2,*X*_{2}= 3,*X*_{3}= 1).

In model II, we considered the case where *B* → *b* and *C* → *c* events were independent (see Figure 2); without loss of generality, we assume that *B* → *b* occurred before *C* → *c*. Under this model, *p _{a}* and

*p*were defined similarly as we defined in model I, but

_{b}*p*is defined as the probability of the

_{c}*C*→

*c*event conditional on the

*A*→

*a*event.

In our simulations, we considered the following parameter settings for *p _{a}*,

*p*, and

_{b}*p*:

_{c}*p*=

_{b}*p*= 0.8, and

_{c}*p*of values 0.1, 0.2, and 0.4. For example,

_{a}*p*= 0.2 would mean that the

_{a}*a*allele is present in 20% of the population, and hence ∼32% of the animals have the genotype

*Aa*and 4% have the genotype

*aa*.

#### Simulation and fitting procedures:

For these models, we simulated for *n* = 200: {(*Y _{i}*,

*X*

_{1,}

*,*

_{i}*X*

_{2,}

*,*

_{i}*X*

_{3,}

*)}, where*

_{i}*i*= 1, … ,

*n*. The phenotype

*Y*for each animal is again generated according to the linear model—Equation 1, with parameters μ

_{1}= 100 − Δ, μ

_{2}= 100, μ

_{3}= 100 + Δ, and σ = 10. Here we randomly dropped values from each variable with a probability,

*p*. We conducted simulations under two scenarios: (a)

_{m}*p*= 10% and (b)

_{m}*p*= 20%. Note that in our simulations used for assessing the validity of EM-LRT in finite samples, the missing proportion refers to the missing probability of

_{m}*X*

_{1}. Here,

*p*refers to the missing probability of all variables,

_{m}*Y*,

*X*

_{1},

*X*

_{2}, and

*X*

_{3}. The validity of the EM-LRT for the simulation used here was verified by checking the values of the empirical type I error rates (

*i.e.*, when Δ = 0).

For each model setting, we repeatedly ran the simulation 1000 times. For each simulated data set, we computed the EM-LRT (14) and ANOVA *F*-test (12) statistics and recorded their values. The empirical powers of the EM-LRT and *F*-test were calculated as the proportions of data sets where H_{0} was rejected at the significance level α = 0.05. Figures 3 and 4 display the empirical powers from the 1000 simulation runs for *p _{a}* = 0.2 and 0.4, respectively. The statistical powers were calculated and compared for all three models (IA, IB, and II), for various values of Δ and for different missing probabilities (10 and 20% on the left-hand and right-hand sides, respectively for Figures 3–5). The simulated Δ values were defined as

*K*σ/, where

*K*= 0, 1, 2, … For the simulation runs with

*p*= 0.1 (Figure 5), we replaced the comparison between EM-LRT (14) and ANOVA

_{a}*F*-test (12) with the comparison between the EM-adjusted

*t*-test (15) and the ANOVA

*t*-test (13) for the following reason: when the minor allele (

*a*) frequency is low (

*p*= 0.1), it would be expected that only ∼1% of animals would carry the

_{a}*aa*genotype. Since a total of 200 animals were in each simulation, there were on average <2 animals with the

*aa*genotype in most simulated data sets. In many simulation runs, there was not a single observation in the

*aa*genotype group. Therefore, in this case, the phenotypic comparison is needed only between the pair of genotypes

*AA*and

*Aa*, with respective mean values denoted as μ

_{1}and μ

_{2}. It was thus more appropriate to compare the power of the EM-adjusted

*t*-test (15) with that of the ordinary ANOVA

*t*-test (13).

#### Power comparisons:

For a hypothesis test, a type I error occurs if H_{0} is rejected when it is true. If H_{0} holds, a correct test should have a type I error rate ≤ α. The H_{0} in this case was represented by Δ = 0 (or equivalently, *K* = 0) or the left-most case in Figures 3–5. It can be seen that in those cases the empirical type I error rates for both the EM-LRT and the ANOVA *F*-test were close to α = 0.05, confirming that they were both valid tests.

The power of a test is defined as one minus the type II error. Among valid tests with correct type I error rates, it is clear that a test with a higher power is preferred. It can be seen from Figures 3 and 4 that the empirical powers of EM-LRT were higher than the empirical powers of the *F*-test. Due to simulation variations, however, a higher empirical power does not necessarily mean the real power is higher. To see whether the difference in power is statistically significant, we conducted a pairwise nonparametric test (the Wilcoxon rank-sum test) on the 1000 pairs of *P*-values for EM-LRTs and *F*-tests. The cases where the powers of EM-LRTs are statistically significantly higher are indicated by asterisks in the figures.

As illustrated in Figures 3 and 4, when *p _{m}* = 10%, the power of the EM-LRT was significantly higher than that of

*F*-test when

*K*> 3 for models IA, IB, and II. And when

*p*= 20%, the EM-LRT started to outperform the

_{m}*F*-test when

*K*= 2. Not surprisingly, the power improvement of EM-LRT over the

*F*-tests became more significant when more data were missing.

The comparison results shown in Figure 5 were similar to those of Figures 3 and 4: when 10% of data were missing, the EM-adjusted *t*-test started to significantly outperform the ordinary ANOVA *t*-test for *K* = 3 or 4; when 20% of data were missing, the better performance started when *K* = 2.

### Application to a real data set in experimental crosses:

As an illustration, we applied the proposed method to a real data set based on an F_{2} intercross study. This data set, based on a previously published report (Rosen and Williams 2001), consisted of a total of 36 mice from an F_{2} intercross between a strain with low brain weight (A/J) and a strain with high brain weight (BXD5). Brain volume, striatal volume, striatal neuron number, striatal neuron number residual, striatal volume residual, and brain weight were measured using standard procedures. We studied a total of 13 microsatellite markers—9 markers on chromosome 10 (*D10Mit106*, *D10Mit3*, *D10Mit194*, *D10Mit61*, *D10Mit186*, *D10Mit266*, *D10Mit233*, *D10Mit179*, and *D10Mit180*), and 4 markers on chromosome 18 (*D18Mit20*, *D18Mit120*, *D18Mit122*, and *D18Mit184*). The map locations of the loci studied were obtained from Ensembl (http://www.ensembl.org/Mus_musculus/).

The *P*-values of both the ANOVA *F*-test and EM-LRT are displayed in Tables 2 and 3.

Since few missing observations were present in the data, the differences in *P*-values were very small between the ANOVA *F*-tests (Table 3) and the EM-LRT (Table 4). Both methods showed that *D10Mit186* affects most phenotypes in the study. Also, two markers on chromosome 18, *D18Mit20* and *D18Mit120*, significantly affect brain weight.

To illustrate the effects of missing genotype observations, we randomly dropped 10% of the genotype observations at the interested locus and recalculated the *P*-values of the ANOVA and EM-LRT. Table 5 presents the *P*-values of the ANOVA *F*-test and EM-LRT for all phenotypes of interest with and without the dropped *D10Mit186* genotype data. Similarly, Table 6 presents the *P*-values of the ANOVA *F*-test and EM-LRT for brain weight with and without the dropped *D18Mit20* and *D18Mit120* genotype data.

As we can see from these tables, *P*-values for the ANOVA *F*-tests were more sensitive to the dropped phenotype data than were those for the EM-LRT. For example, in Table 6, the ANOVA tests are no longer able to detect the association at the α = 0.01 level with brain weight when 10% of genotype observations at the interested locus were dropped while the EM-LRT can still detect the association under the same condition. On the other hand, as shown in Table 5, the effect of dropping 10% *D10Mit186* genotype data is less pronounced. The results produced by ANOVA tests led to the same conclusions on the associations of the *D10Mit186* genotype with all the phenotypes except the striatal neuron number residual. The ANOVA test was not able to detect the association between the *D10Mit186* genotype and the striatal neuron number residual when 10% of data were missing while the EM-LRT could still detect the association. By and large, we see that the EM-LRT improves the statistical power over the case when all missing data were excluded.

## DISCUSSION

In this article, we presented an EM-LRT using flanking markers information in single-marker analysis to utilize information contained in incomplete data. By using both simulated and real data sets, we demonstrated that EM-LRT utilizing incomplete data is a valid test for finite samples with moderate proportions of missing values and is a more powerful test compared to ordinary ANOVA-based tests that discarded all missing data from the analysis.

Missing information on either genotype or phenotype can obscure the true genetic effect (Sen and Churchill 2001). To reduce the proportion of missing data, the best solution is to repeat the experiment, but it can be costly and time-consuming. The EM algorithm is a standard maximum-likelihood estimation method for handling missing data (Dempster* et al.* 1977). In the present context, the method fractionally assigns (E-step) the incomplete data to their theoretically possible values on the basis of the current estimates of the parameters and then revises the parameter estimates to maximize (the M-step) the likelihood on the basis of the pseudo-complete data. This two-step, alternating iteration procedure is repeated until convergence can be reached. Statistical theory guarantees that the observed data likelihood increases to a maximum via the algorithm, and thus the EM-LRT can be performed validly (Dempster* et al.* 1977). Likelihood methods with the EM algorithm allow the recovery of much of the lost information and make statistically efficient use of the data. In the simulated data sets, the EM-LRT outperforms the ANOVA-based tests at various marker allele frequencies, and the differences in statistical power became increasingly more pronounced with an increasing portion of missing data or an increasing value of Δ (Figures 3–5). In the real data set example on inbred mouse strains, we found that with 10% missing data the significant associations of *D18Mit20* and *D18Mit120* with brain weight could still be detected by EM-LRT, but not by ANOVA-based tests. Taken together, we argue that the EM-LRT is an attractive statistical method that can utilize information from incomplete data.

The EM-LRT is a valid test asymptotically (*i.e.*, a large *n*). For finite samples, our simulations indicated that, for *n* = 100, the method can tolerate up to 20% missing genotype data; for *n* = 200, the method can tolerate up to 50% missing genotype data. Thus, there is another potential application of the proposed EM-LRT for a combined analysis of different studies. For example, suppose in study I (with a sample size of *n*_{1}) that we already collected phenotype data and genotype data on *D10Mit61* and *D10Mit266*, and later we decide to study other nearby genetic markers, say *D10Mit186* as well as *D10Mit61* and *D10Mit266* in a new, independent study, study II (with a sample size of *n*_{2}). We might combine study I with study II by treating the *D10Mit186* genotype data as missing in study I, and then the EM-LRT can be used to detect the association between the phenotype of interest and *D10Mit186* by merging studies I and II together (with a sample size of *n*_{1} + *n*_{2}). When we use this approach to combine different studies, we have to pay particular attention to the assumption of “missing at random.” That is, the genotype missing probability is not related to the phenotype value. This can be ensured by checking that the animals in different studies come from exactly the same genetic backgrounds (*e.g.*, common F_{0} parents) under the same experimental and breeding conditions. The tests developed in this article can be applied to the combined (studies I and II altogether) data provided that each of the new markers selected is independent from study I. In other words, the new genetic marker is not selected because the flanking markers already showed associations with the phenotype in study I. If the new genetic marker is selected because of an association observed in regard to the flanking markers in study I, then a sequential design is needed. How to adjust our tests for the sequential design is an interesting research topic that deserves further investigation.

## Acknowledgments

We are grateful to the two anonymous reviewers for their comments and suggestions. We thank Glenn D. Rosen at the Beth Israel Deaconess Medical Center, Harvard Medical School for providing the mouse inbred strain data.

## Footnotes

Communicating editor: Y.-X. Fu

- Received July 7, 2003.
- Accepted September 2, 2004.

- Genetics Society of America