## Abstract

A new effective Bayesian quantitative trait locus (QTL) mapping approach for the analysis of single-tail selected samples of the phenotype distribution is presented. The approach extends the affected-only tests to single-tail sampling with quantitative traits such as the log-normal survival time or censored/selected traits. A great benefit of the approach is that it enables the utilization of multiple-QTL models, is easy to incorporate into different data designs (experimental and outbred populations), and can potentially be extended to epistatic models. In inbred lines, the method exploits the fact that the parental mating type and the linkage phases (haplotypes) are known by definition. In outbred populations, two-generation data are needed, for example, selected offspring and one of the parents (the sires) in breeding material. The idea is to statistically (computationally) generate a fully complementary, maximally dissimilar, observation for each offspring in the sample. Bayesian data augmentation is then used to sample the space of possible trait values for the pseudoobservations. The benefits of the approach are illustrated using simulated data sets and a real data set on the survival of F_{2} mice following infection with *Listeria monocytogenes*.

QUANTITATIVE trait locus (QTL) mapping methods often assume that the trait, conditionally on the effects of the QTL, follows a normal distribution. However, nonrandom missing data patterns resulting from single-tail sampling may violate this assumption. The target in single-tail sampling is to increase the expected genotype–phenotype correlation of a sample with respect to the original population parameters. By sampling (ascertaining) individuals from the right tail of the phenotype distribution, the genotype frequencies for QTL with positive phenotype effects are potentially enriched. Similarly, sampling individuals from the left tail of the phenotype distribution can increase our chances to find QTL with negative effects. Single-tail sampling may also arise from censoring or if a quantitative trait exhibits measurable values only for a portion of the individuals, *i.e.*, there is a spike in the phenotype distribution (Broman 2003). However, due to single-tail sampling, the phenotypic variation of a sample may become too small for standard QTL mapping methods to work properly, *i.e.*, the signal is totally masked by the error. Therefore current approaches to QTL mapping of data resulting from single-tail sampling of the phenotype distribution consider the deviation of the allele- (or genotype-) frequency distribution at the marker loci from their Mendelian expectation, use logistic regression-based analysis strategies, or combine both of these approaches (Henshall and Goddard 1999; Beasley *et al.* 2004; Tenesa *et al.* 2005). Alternatively one can apply nonparametric/semiparametric methods, rank-based statistical procedures, or a robust mixture model to analyze such data (Kruglyak and Lander 1995; Zou *et al.* 2002, 2003; Broman 2003; Feenstra and Skovgaard 2004). A disadvantage of these approaches is that a single-QTL model is implicitly assumed, since only a single chromosomal position is tested at a time.

As stated in Luo *et al.* (2005), the viability (survival) of an individual can be simply defined as a binary phenotype indicating whether an individual has survived (*y* = 1) or not (*y* = 0). For continuous survival (or failure) time data, such as time to tumor or time to death (measured in logarithmic scale), the single-tail sampling approach can be considered (Broman 2003). Alternatively, methods exist for survival phenotypes (Diao *et al.* 2004; Moreno *et al.* 2005). In controlled crosses, several methods have been designed specially to map viability loci, the gene positions that have an influence on the fitness or the survival of an individual (*e.g.*, Vogl and Xu 2000; Luo and Xu 2003; Luo *et al.* 2005; Nixon 2006). In outbred populations, similar/related methods are adopted to locate the signatures of selection—the genomic regions having been under selective pressure (subject to natural or artificial selection). It is well known that (1) the variability (diversity) is reduced, (2) the linkage disequilibrium is enriched, and (3) the segregation ratios depart from their Mendelian expectations in the genomic regions at the immediate surroundings of the gene positions that influence survival. The size of the effect (selection intensity) the position has on the survival can be indirectly monitored via the extent of the above influences and their decay as a function of the genetic distance. Thus, the general rationale behind the mapping methods of such loci is in testing distorted segregation, testing linkage disequilibrium patterns, or comparing levels of genetic variability between the particular genomic position and other parts of the genome or between species. Again, as a drawback, a single-QTL model is usually implicitly assumed in these methods. Moreover, a common difficulty in applying these methods in outbred populations is that the demographic history (population growth or recent expansion) leaves kinds of local signs in the genome similar to those of selection (*e.g.*, Schlötterer 2003).

For case–control and association studies of binary traits in human genetics, it is common that one samples affected individuals only (see Greenland 1999). There is the affecteds-only test for trios, where genotyped or haplotyped parents and their affected offspring are both collected (*e.g.*, Falk and Rubinstein 1987; Terwilliger and Ott 1992; Lander and Schork 1994; Gauderman *et al.* 1999). Such a test is constructed between the cases and the controls where the individuals of the control sample, so-called pseudocontrols, “artificial controls,” or “antisibs,” are created from the genetic material that was not transmitted from the parents to the cases (their chromosomes are mirror images of the case chromosomes). The information needed to generate the chromosomes for the antisibs is obtained from the parental haplotypes by taking the complement of the genetic material of the parents that was transmitted to the affected offspring (see Figure 1). The ability to derive such a complement on the basis of the parental genotype data depends only on the parental mating type for the marker (marker informativeness); *e.g.*, it is easy to derive complemental observations for mating type *AB* × *CD*. Some single-locus tests also utilize genotypic pseudoobservations in 1:3 proportions. By adopting the pseudocontrol approach, one can obtain well-matched controls and avoid spurious associations due to ethnic confounding, *i.e*., closer kinship (higher degree of background linkage disequilibrium) in the affected sample (Terwilliger and Weiss 1998).

Here we bring this antisib idea to mapping QTL in experimental crosses (*e.g.*, backcross and F_{2}) of inbred lines as well as provide a theoretical basis for applying this method to outbred populations using a multiple-QTL model. To illustrate the methodology, we extend the affected-only tests to single-tail sampling with quantitative traits. In the method, continuous-trait values are generated for all of the antisibs on the basis of Bayesian hierarchical modeling and data augmentation. For data augmentation, see Albert and Chib (1993), Rubin (1996), and Van Dyk and Meng (2001). Unlike many others who consider mapping QTL (for quantitative, viability, or survival data) from single-tail samples, we use a multiple-QTL model.

To map signatures of selection, viability, and other binary traits from case-only data (where only selected/survived individuals are in our sample and have phenotypic value one), it is straightforward to genotype individuals from the single-phenotypic group only (*i.e.*, survived individuals). In the continuous-trait case, one can selectively sample backcross or F_{2} progenies so that the (case) individuals from only one tail of the phenotype distribution are genotyped. Pseudocontrols, corresponding to observations from the other tail of the phenotype distribution, can then be created as mirror images for each case individual. The binary phenotypic value of the pseudocontrol individuals is zero and their continuous-scale observations (so-called liability values) can be predicted using data augmentation. (Note that the liability values of the case individuals are already observed.) The genotype data for these artificial observations can be created on the basis of the genotypes and linkage phases of the parents, which in inbred line-cross designs are known by definition. For example, in backcross, one can follow the principle that the mirror image of genotype *AA* is *AB* and vice versa. In F_{2}, the mirror images of the three genotypes *AA*, *AB*, *BB* are *BB*, *AB*, *AA*, respectively. The same principle applies for outbred crosses and populations where the parental genotypes and haplotypes are known or estimated.

## MODEL

#### Notation:

Let us consider an inbred line-cross experiment (*e.g.*, backcross, double haploids, or F_{2}) with *N*_{gen} possible genotypes. We assume that measurements of a quantitative trait and marker genotypes at *N* loci have been obtained from *N*_{ind} individuals sampled from a single tail of the phenotype distribution. To consider a pseudocontrol idea (Figure 1) for quantitative traits, it is easy to adopt a liability and threshold model framework (*e.g.*, Albert and Chib 1993). Further, we assume that each sample has a mirror image (hidden observation) in the unsampled part of the phenotype distribution. Using case–control and threshold model terminology this assumption means that for each case individual (whose liability value is measured) we need one control individual (whose liability value is systematically lower/greater than that in case individuals).

Denote the phenotype and marker data of the observed offspring as (*y*^{o}, *M*^{o}) and the hidden phenotype and marker measurements of the offspring data as (*y*^{h}, *M*^{h}). From here on we refer to individual *i* and its unobserved counterpart as pair *i*. (Note that survival data, where genotypes *M*^{h} are observed for censored individuals, are a special case of this setting.) The observed and hidden phenotype vectors, and include the observed and hidden phenotypes of pair *i*, and respectively. Similarly, and are the observed and hidden marker matrices where the elements, and , are the coded genotypes from the set [1, … , *N*_{gen}] for pair *i* on marker *j*. To allow that some marker genotypes may be missing among the observed half of the individuals, the incomplete form of *M*^{o} is denoted by Note that if there are no missing marker genotypes, then *M*^{o} = *M**.

To exploit the “mirroring idea” by applying data augmentation on the hidden observations (*y*^{h}, *M*^{h}), it is helpful to consider that the observed phenotypes (*y*^{o}) give rise to the discrete auxiliary variables (discrete phenotypes). Let denote a discrete phenotype vector where for individual *i*, Here *T* is a (known) discretization threshold and the binary phenotype obtains value 1 if the underlying (observed) continuous phenotype is higher than the threshold *T* and = 0 otherwise. Similarly, for the hidden observations *y*^{h}, we have a discrete vector where To conclude, in single-tail sampling, (1) the threshold *T* uniquely determines the proportion of individuals selected from the phenotype distribution, (2) all elements in vector *z*^{o} are either 0 or 1, and (3) all elements in vector *z*^{h} are either 0 or 1 and opposite to *z*^{o}. In the case that *T* is unknown, depending on which tail has been sampled we can define *T* as the smallest or the highest phenotype value.

#### Phenotype model:

We adopt the additive multiple-QTL model considered earlier by Xu (2003). This model is closely related to the model of Meuwissen *et al.* (2001) and can be viewed as a submodel of Hoti and Sillanpää (2006). Although it is straightforward to include also pairwise epistatic interaction terms in the design matrix of this model (see Zhang and Xu 2005; Xu 2007), we omit such extensions here. For considering other models and designs, see the discussion. In the model, let us assume that the putative QTL can be placed only at marker points. However, this is not a very restrictive assumption because in experimental designs, arbitrary map positions (putative QTL) can be included into the analysis as pseudomarkers (Sen and Churchill 2001). Given the overall mean α and the effect-specific coefficients β = (β_{j,k}) of marker *j*, the phenotypes *s* = *o*, *h* of the pair *i* can be expressed as(1)where the residuals (the phenotypes after correcting for QTL effects) are assumed to be normally distributed, with unknown variance Note that this same model is assumed for both the observed and the hidden phenotypes. The indicator variable if the marker observation equals genotype code *k* and is 0 otherwise. For each marker *j* we introduce the constraint β_{j,1} = 0. Thus for a backcross or double haploids, where *N*_{gen} = 2, only a single coefficient β_{j,2} at each marker is needed to capture the contrast between the two genotypes. Similar treatment for F_{2}, where *N*_{gen} = 3, leads to two coefficients, β_{j,2} and β_{j,3}, that can be estimated for each marker. Here we use a random-variance model, where the genetic coefficients β_{j,k}, for *k* > 1, are assumed to be normally distributed with unknown variances In the following, we denote all unknown QTL parameters together as where σ^{2} = is a vector of the effect-specific variances.

#### Key assumptions (which should be considered jointly because assumption 1 is a necessary condition for assumption 2):

Given haplotypes for parents, genotype data of the hidden observations is (generated to be) maximally dissimilar to the observed data (

*cf.*Terwilliger and Ott 1992): When we create a mirror image (see Figure 1), we actually produce a maximally dissimilar pseudoindividual for each observation with respect to the marker data. This means that by doing so we maximize the information content of the sample (see O'Brien and Funk 2003). Note the related approaches that try to maximize the information in the sample by selecting individuals or pairs of individuals to be phenotyped on the basis of their genetic dissimilarity (Jin*et al.*2004; Jannink 2005; Xu*et al.*2005; Fu and Jansen 2006).The phenotypes of the hidden observations are (generated using data augmentation) either on the left or the right side of the truncation point and the observed phenotypes: We briefly consider what is assumed at the genetic level, in the presence of the additive multiple-QTL model (1), when “mirroring” of the genotypes is performed. Now, we assume ordering of the phenotypes with respect to the threshold

*T*:(2)

To see what this means at the genetic level, we substitute model (1) into both sides of Equation 2 so that α cancels out, and we obtain(3)where residuals and are both independently normally distributed with mean zero. Let us consider the following cases:

When the residuals are ordered in the same way as phenotypes of Equation 2: For such pairs

*i*, Equation 3 imposes an ordering constraint for the sum of the QTL effects that the mirrored genotypes at the QTL loci need to fulfill. However, this ordering constraint is relaxed by the nonnegative factor This means that the model can cope with some proportion of phenocopies (*i.e.*, such data individuals whose phenotype is not in agreement with the QTL model).When the residuals are ordered in the opposite way as the phenotypes of Equation 2: For such pairs

*i*, the ordering constraint is adjusted by the negative factor This means that for some of the phenotypes it is required that the ordering constraint is fulfilled in a tighter form. We now represent Equation 3 as(4)where is a relaxation/adjustment factor of the ordering constraint, whose sign and size depend on the rank and the difference of the two residuals, respectively. Further understanding of this question requires simulation studies that are not in the scope of this article.

Assumptions 1 and 2 together imply that each sample has a mirror image (hidden observation) in the unsampled part of the phenotype distribution.

#### Hierarchical model:

In Bayesian analysis, the aim is to obtain an estimate for the posterior distribution of the model parameters given the data, *p*(θ, *y*^{h}, *M*^{h}, *M*^{o}, *z*^{o}, *z*^{h} | *y*^{o}, *M**). This can be achieved by using Markov chain Monte Carlo (MCMC) methods, exploiting the fact that the posterior is proportional to the joint distribution of the parameters and the data, *p*(θ, *y*^{h}, *M*^{h}, *M*^{o}, *z*^{o}, *z*^{h}, *y*^{o}, *M**). By adopting suitable conditional independence assumptions (leading to the graphical model of Figure 2), the joint distribution of the parameters and the data can be presented as
where the likelihood can be factorized asThe functional forms of *p*( | θ, ) and *p*() are normal densities of the residuals and of model (1) with mean zero and variance (see, *e.g.*, Sillanpää and Arjas 1998).

#### Constraining priors:

This model includes an exceptionally high number of constraining priors, which introduce restrictions into the (MCMC) sampling scheme and take care of the consistency between variables. Their specific forms are given below. The prior for the discretized hidden observations is where *p*(). Here, and Moreover, the prior for the discretized “nonhidden” observations is where and The markers of the pseudoobservations are created according to the mirroring prior where and Also the indicator function prior is used to ensure that the complete marker observations are compatible with the observed data.

#### Other priors:

In inbred line-cross data, the prior used to handle missing observations can be presented as a Markov chain For the actual forms of the transition probabilities ) in various designs, see Jiang and Zeng (1997) and Sillanpää and Arjas (1998). The prior for the QTL parameters can be factorized as *p*(θ) = *p*(α)*p*(β | σ^{2})*p*(σ^{2})*p*(), where *p*(α) ∝ 1, , and Further, *p*(β_{j,k} | ) is the density function of a normal distribution with mean zero and variance and is the Jeffreys scale invariant prior having most of the support (mass) in values near zero. Also, for the residual variance, we choose The use of effect-specific variance components together with Jeffreys' prior is well justified because the prior adaptively shrinks QTL variances at unlinked positions to zero—which then leads to the positioning of QTL with nonnegligible effects (see Xu 2003; Hoti and Sillanpää 2006). Note that Meuwissen *et al.* (2001) fitted a common variance for all coefficients at single locus, which, however, does not lead to an equally sparse solution. It is good to know that even if Jeffreys' prior in this context seems to work extremely nicely, theoretically the posterior is improper because Jeffreys' prior has an infinite amount of mass near zero (*e.g.*, Hopert and Casella 1996; ter Braak *et al.* 2005). One way to avoid the theoretical problem is to specify a small positive number as a lower bound for the parameter in the prior, which we, however, did not apply here.

#### Parameter estimation:

We use a MCMC algorithm (*e.g.*, Casella and George 1992; Chib and Greenberg 1995) to estimate the posterior distribution of the unknown model parameters. Here we assume that the truncation point *T* of the original population is known, equals the smallest (or the highest) phenotypic value in the data, or has been successfully estimated before the analysis. Also if the phenotypic mean of the original population is available, we can utilize it as a starting value for α; otherwise we initialize α to zero (*i.e.*, we set ). We use nonzero starting values for the variances so that nonzero values are proposed for all effects—all positions initially explain the phenotype. In the following, we outline the MCMC sampling scheme used for continuous traits. For survival data, if genotypes *M*^{h} are observed/available for censored individuals, we can omit the generation of mirror images in steps 1 and 3. For binary traits, step 4 below is replaced by the step “the updating liabilities for a binary trait” found in earlier articles (Kilpikari and Sillanpää 2003; Hoti and Sillanpää 2006):

Specify initial values , (β

_{j,k}= 0, = 0.5,*j*= 1, … ,*N*,*k*= 2, … ,*N*_{gen}), = 0.5; initialize the missing genotypes in*M*^{o}from their prior distribution; and generate mirror images*M*^{h}conditionally on*M*^{o}.Update the QTL parameters (θ) needed in the phenotype model according to the Gibbs sampling scheme outlined elsewhere (Xu 2003; Hoti and Sillanpää 2006).

Update missing values in

*M*^{o}and the corresponding mirror images in*M*^{h}using a separate Metropolis–Hastings step for each individual and each marker. Propose the genotypes from their prior distribution*p*(*M*^{o}). The acceptance ratio contains only the likelihood*p*(*y*^{o},*y*^{h}| θ,*M*^{o},*M*^{h}). Note, however, that each change in*M*^{o}also changes the value of*p*(*y*^{h}| θ,*M*^{h}) because*M*^{h}contains the mirror image of the proposed value.Update

*y*^{h}using Gibbs sampling. A new is sampled (for each individual separately) from*p*( | θ,*M*_{i}^{h}, >*T*) if the observations*y*^{o}have been collected from the right tail of the phenotype distribution and is sampled from*p*(*y*_{i}^{h}| θ,*M*_{i}^{h},*y*_{i}^{o}≤*T*) if*y*^{o}are from the left tail. The fully conditional posterior distributions are and , where is the predictive mean and ϕ(·) is the cumulative distribution function of the standard normal distribution. Similarly, as in data augmentation algorithms for binary traits (Albert and Chib 1993; Hoti and Sillanpää 2006), this Gibbs sampling step requires sampling from a truncated normal distribution (for algorithms, see Devroye 1986). Note that the condition uniquely determines the values for and which again imply the constraint for the possible values of .Repeat steps 2–4 until a prespecified number of rounds have been reached.

## DATA ANALYSIS

In the following we present example analyses and comparisons of our method under different sampling schemes (random, single-tail, and two-tail sampling), using simulated backcross data in cases of unlinked and linked QTL. We consider both the average performance (assessed by analyzing 50 or 100 data replicates) and performance under a single realization of a data set (assessed by analyzing several single data sets with small heritability in each). For reasons why correction methods based on truncated normal distribution as “incomplete-data” likelihood are not used here, see the discussion. We used unrealistically large (QTL) heritabilities and small sample sizes in our example data sets to reduce computation time when analyzing data replicates. Albeit this treatment may appear to be unrealistic, the analyses presented here arguably correspond to the analyses with smaller heritabilities and larger samples. One can use existing power tables to find rough correspondence between the two cases (*e.g.*, Van Ooijen 1992; Carbonell *et al.* 1993; Beavis 1998). For example, using a traditional approach and backcross data, the probability of success to find a QTL with heritability 0.05 in a sample of 400 is roughly comparable to that to find a QTL with heritability 0.16 in a sample of 100 (Lander and Botstein 1989). Additionally, we illustrate the performance of our method with survival data and censored observations, using previously analyzed real F_{2} mice data that have some degree of randomly missing genotypes (Broman 2003).

## SIMULATION ANALYSIS OF UNLINKED QTL

#### Simulated data:

The performance of the new approach was tested using simulated data, which were generated in two phases. First, linked marker data for a population of 250 backcross individuals were generated using the QTL Cartographer software (Basten *et al.* 1996). The produced offspring data consisted of 33 markers that span the area on three 100-cM-long chromosomes. Each chromosome had 11 equidistant markers, one every 10 cM. To generate phenotypes, we selected 3 markers (nos. 3, 17, and 30) of 33 as QTL with additive genetic effects as β_{3} = 3, β_{17} = −2, β_{30} = 1, respectively. For each individual *i*, a quantitative phenotype *y*(*i*) was generated using an additive genetic model(5)where the indicator functions and take value 1 if individual *i* has genotype *AB* at positions 3, 17, and 30, respectively. The additive error *e*(*i*) was generated from the normal distribution with mean zero and variance 4. This resulted in a heritability ∼0.5. Note that there were no missing values in the marker data.

#### Sampled subdata:

In the following, to distinguish between the original simulated data set (250 individuals) and a smaller sample (∼40 individuals), which is obtained by sampling according to the sampling scheme, we call these two alternatives data and subdata, respectively.

#### Analyses:

To demonstrate the efficiency of the proposed pseudocontrol approach and to address the sampling variation around the estimates, six different sampling schemes were compared by analyzing simulated subdata replicates. Sampling of the individuals and analysis of the resulting subdata were repeated 100 times under each sampling scheme, using 100 different simulated data sets. All simulated data sets included the same genotype data of 250 offspring (see above) but a different set of phenotypes was generated each time by using the same generating model (Equation 5). (The method is expected to be more sensitive to sampling variation due to phenotypes than due to genotypes.) In each repetition, subdata were sampled from the same phenotype distribution of the 250 individuals according to different sampling schemes. Analyses using six different schemes of sampling from the phenotype distribution were considered: (A) an analysis using a random subdata sample, (B) an analysis using a sample from both the left and the right tails, (C) an analysis using the right tail sample without doing any correction with respect to truncation, (D) the pseudoobservation analysis using the right tail sample, (E) an analysis using the left tail sample only without doing any correction with respect to truncation, and (F) the pseudoobservation analysis using the left tail sample. Analyses D and F were carried out for all the subsamples, using the approach presented in the model section. Analyses A–C, and E were carried out using our implementation of the approach of Xu (2003); see details from Hoti and Sillanpää (2006). Analyses C, D and E, F correspond to the same analysis with and without generating pseudoobservations, respectively. The truncation threshold *T*, which determines the sampled individuals in the subdata, was defined as for the right-tail sampling analyses (C and D) and as for the left-tail sampling analyses (E and F). In the two-tail sampling analysis (B), both thresholds *T*_{l} and *T*_{r} were used. Due to the resampling of the phenotype, the sample sizes used in analyses C–F varied in each repetition. Therefore, to maintain comparability between the schemes (in each repetition), the size of the random sample (A) was chosen to equal the mean of the sample sizes of the left and right tail samples (rounded upward). In the two-tail sampling analysis (B), we randomly sampled half of the individuals (rounded upward) from both tails. The sample sizes varied in the range [35, 51] with median 41, which closely coincides with the theoretical expectation (16% of the samples in a normal distribution should be beyond one standard deviation, here corresponding to 0.16 × 250 ≈ 40 individuals).

#### Results:

We implemented the methods using Matlab software on a personal computer. The posterior estimation (of the effects) for each of the 100 repetitions was based on 10,000 Markov chain Monte Carlo cycles. In each MCMC run the first 1000 initial cycles were discarded from the chain as “burn-in” rounds and thinning of 10 was applied (by saving the values at every 10th cycle) to reduce autocorrelation between the samples. Due to the rather simple data generation model, the MCMC sampler converged rapidly in all 100 cases.

Instead of using the estimated effect size directly to summarize the results, we use a standardized form of the effect size, because then the selected QTL threshold 0.1 (giving definition for QTL as in Hoti and Sillanpää 2006) is directly comparable/applicable to other traits (sampling schemes) and marker data. For a backcross, at marker *j*, the standardized effect is where is the empirical standard deviation of the genotypes at marker *j* and is the empirical standard deviation of the phenotype (calculated from augmented data in D and F). Following Hoti and Sillanpää (2006), we define an indicator variable for the event, that the absolute value of the standardized effect size is larger than the given QTL threshold 0.1. This enables us to estimate the QTL occupancy probability (as a function of the posterior distribution of the QTL effects) in models such as those of Xu (2003), which do not originally include model selection indicators. Thus, we present the results in the form of the posterior probability of the QTL occupancy *P*(|θ_{j}| > 0.1 | data), which we calculate as the proportion of MCMC rounds where |θ_{j}| > 0.1. To summarize the posterior QTL occupancy over the 100 repetitions of data for each of the six sampling schemes, we calculated the mean value of the estimated posterior QTL occupancy probability at each locus *j* as by taking the average over the 100 subdata analyses in the different sampling schemes (Figure 3). In Figure 3, the corresponding means of the standardized effects (calculated over the MCMC samples in each subdata analysis where |θ_{j}| > 0.1) are shown using the curve. At each repetition, the calculation of the QTL occupancy and the posterior mean of the estimated (standardized) effect size was based on 900 effective MCMC samples. [The reason for not taking the average over repetitions with respect to the correctly identified QTL was that we wanted especially to monitor the magnitude of the signals (*cf.* Broman and Speed 2002).]

In Figure 3, A–F, the QTL occupancy probability (bars) and the mean standardized QTL effect (curve) summarize the QTL evidence at each locus over the 100 repetitions. Note that the same scales of the *y*-axis are used throughout. As expected, the analyses without doing any correction for truncated data performed very badly, showing practically no signals (Figure 3, C and E). It becomes evident from the graphs that the single-tail sampling analyses with pseudoobservations (Figure 3, D and F) have clearly more power than the analyses based on random subdata samples (Figure 3A) or the analyses that do not utilize the pseudoobservations (Figure 3, C and E). This is in the sense that analyses (Figure 3, D and F) on average show higher or elevated signals (QTL occupancy probabilities) around the true positions (3, 17, and 30). Surprisingly, in the case of the unlinked QTL, one can even conclude that the single-tail sampling analyses with pseudoobservations (Figure 3, D and F) showed power comparable to the analysis with two-tail sampling (Figure 3B). On the other hand, the mean of the estimated standardized effect size is practically at the same level in most of the analyses, which indicates that standardized effects seem to be (on average) comparable across the different analyses. In Figure 3, D and F, note the negligible bias of the QTL position (around locus 17) in the opposite direction from the direction of sampling. The simulated effect size at position 30 was apparently very small because the position (on average) stayed undetected in most of the cases.

#### Heritability estimation:

In schemes A–F in Table 1, the posterior mean heritability was estimated using the formula where is the empirical phenotypic variance (from augmented data in D and F), is the residual variance at round *t*, and *r* is the total number of MCMC rounds (after burn-in). The mean and the standard deviation were used to summarize the distribution of the estimated heritability over the 100 repetitions for each of the six sampling schemes. The random-sampling analysis (scheme A) seemed to give (on average) slightly small heritability estimates and the standard deviation (sampling variance) was very large. The heritability estimates were negligible (or negative) in almost all analyses based on the single-tail sampling without doing any correction with respect to truncation (schemes C and E). (The negative values arise because the residual variance was not restricted in the prior to be smaller than the phenotypic variance.) On the other hand, the heritability was overestimated in most of the analyses and the sampling variance was relatively small for analyses using single- or two-tail sampling (schemes B, D, and F) and pseudoobservations.

To enable comparison to more traditional methods (based on hypothesis testing) and to obtain an understanding of their potential performance on these data, Figure 4 shows the deviation of the marker genotype frequencies from their expected values in a typical realization of the data after right-tail sampling. One can clearly see the dependence of the frequencies over linked loci and the potential difficulty to control false positives by choosing the significance threshold. By looking at these data, it is easy to understand also the value of the multiple-QTL model.

## SIMULATION ANALYSIS OF TWO LINKED QTL

#### Simulated data:

A base population of 2500 backcross individuals was generated using the QTL Cartographer software (Basten *et al.* 1996). Each individual had a single chromosome with 33 linked loci every 1 cM. Two closely linked QTL (in coupling and 14 cM apart from each other) were placed at 12 and 26 cM with additive genetic effects β_{12} = 1 and β_{26} = 1, respectively. Only 11 markers (at 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, and 31 cM) of the original 33 were included in the offspring data used in the analysis step.

We sampled 50 data replicates of size 1000 from the base population. For each replicate, a quantitative phenotype *y*(*i*) was generated using the additive genetic model(6)where the indicator functions and take value 1 only if individual *i* has genotype *AB* at positions 12 and 26, respectively. The additive error *e*(*i*) was generated from the standard normal distribution. This resulted in a heritability value ∼0.47. Also, here no missing values were introduced into the marker data.

#### Analyses:

For each of the 50 data replicates, subdata were sampled from the phenotype distribution of the 1000 individuals according to the different sampling schemes. Again, the six different analyses were considered: (A) the analysis using a random subdata sample, (B) the analysis using a sample from both left and right tails, (C) the analysis using the right-tail sample without doing any correction with respect to truncation, (D) the analysis using the right-tail sample with pseudoobservations, (E) the analysis using the left-tail sample only without doing any correction with respect to truncation, and (F) the analysis using the left-tail sample with pseudoobservations. The sample sizes used in analyses C–F varied in each repetition, with mean 170 in C and D and mean 172 in E and F. Therefore, as in the first simulation study, the size of the random sample (A) was chosen to equal the mean of the sample sizes of the left- and right-tail samples (rounded upward), which resulted in a sample size of ∼171. In the two-tail sampling analysis (B), we randomly sampled half of the individuals from both tails (rounded upward), which also resulted in a sample size of ∼171. The use of a larger sample size in the simulation analysis with linked QTL was partly motivated by not including the QTL (loci 12 and 26) into the marker set used in the analyses.

#### Results:

For each data replicate, a Matlab implementation of the method was run for 20,000 MCMC cycles from which 2000 burn-in rounds were discarded and only every 10th sample was stored (thinning), resulting in 1800 effective MCMC samples. In Figure 5, A–F, the QTL occupancy probability (bars) and the mean standardized effect (curve) over the 50 repetitions are shown at each locus for each of the sampling schemes. As expected, the QTL localization was clearly more difficult for two linked QTL (Figure 5) than it was for the unlinked QTL (Figure 3). No QTL were found in analyses C and E in Figure 5. In analyses A, B, D, and F in Figure 5, all markers in the region 10–31 cM showed elevated QTL occupancy. Arguably, in all cases (Figure 5, A, B, D, and F), the average QTL occupancy increased clearly at the flanking markers, at 10 and 13 cM and at 25 and 28 cM and showed the highest value at the marker closest to the QTL. One can conclude that in the case of two linked QTL, the single-tail sampling analyses with pseudoobservations (Figure 5, D and F) showed power roughly comparable to the analysis with two-tail sampling (Figure 5B). Further, the power for the analyses using pseudoobservations (Figure 5, D and F) was clearly better than that for the analyses that did not utilize pseudoobservations (Figure 5, C and E), but it was only slightly better than that for the analyses based on random subdata samples (Figure 5A).

#### Heritability estimation:

The mean value and the standard deviation of the posterior mean heritability calculated over the 50 repetitions of the linked QTL data are shown for all six sampling schemes (A–F) in Table 1. As earlier, the random-sampling analysis (A) resulted in (on average) small heritability estimates and in large standard deviation (sampling variance). Also, negligible (or negative) heritability estimates were obtained in the analyses that used single-tail sampling without doing any correction with respect to truncation (C and E). Again, the analyses with single- or two-tail sampling and data augmentation (B, D, and F) produced overestimated heritabilities with small standard deviations. The overestimation was highest for the two-tail sampling analysis (B). The most probable reason for the smaller standard deviations here (for linked QTL) compared to the case of unlinked QTL is that we used a larger sample size and a smaller number of repetitions.

#### Two linked QTL in repulsion (*h*^{2} ≈ 0.1):

For comparison, we also simulated five replicates of two linked QTL (in repulsion and placed on positions at 12 and 26 cM) with additive genetic effects β_{12} = −1 and β_{26} = 1, respectively. The heritability was ∼0.10. The same six schemes of sampling from the phenotype distribution were again considered for each data replicate. Also, here each analysis was run for 20,000 MCMC rounds, which resulted in 1800 effective MCMC samples (after burn-in and thinning). Any notable difference were not found in the results when compared to the case of QTL in coupling. As earlier, the flanking markers showed elevated QTL occupancy in analyses B, D, and F. The QTL evidence was notably smaller in analysis A and no QTL were found in analyses C and E. In Figure 6 (left), the QTL occupancy probabilities (bars) and the unstandardized effect estimates (curve) are shown for one of the data replicates (heritability 0.10); the scale of the *y*-axis differs from the others in Figure 6B. In Table 2, for the same data replicate, the estimated posterior means and 90% credible intervals are shown for the heritabilities and the unstandardized and standardized QTL effects (for analyses B, D, and F) at the loci with highest QTL occupancy. Note that the effect estimates of analyses D and F are surprisingly close to their true simulated values while in B they are clearly overestimated. However, the QTL were not exactly at the markers, which may partly downweight the estimates. The posterior mean heritabilities are highly overestimated in analyses B, D, and F but the true heritability value falls inside the 90% credible interval in all cases. In Table 2, the fact that the credible interval of the QTL effect includes zero is an indication of a somewhat lower QTL occupancy at the locus and therefore downward weighting of the estimate. In general, these kinds of model-averaged estimates are shown to be robust to upward bias of small-effect QTL (see Ball 2001).

#### Two linked QTL in coupling (two realistic scenarios):

To closely monitor performance of our method under a realistic single realization of a data set with large sample size and small heritability, we simulated two additional data sets, with the two linked QTL (in coupling and again placed on positions 12 and 26 cM) in each. The first data set has additive genetic effects β_{12} = 0.5 and β_{26} = 0.2, and the second set has β_{12} = 0.6 and β_{26} = 0.3, respectively. The heritabilities for the two sets were ∼0.11 and 0.15. The same six schemes of sampling from the phenotype distribution were considered for both data sets. All analyses were based on 20,000 MCMC rounds and 1800 effective MCMC samples (after burn-in and thinning). We first sampled 1000 and 1500 individuals from 2500, which was the size of the base population. The sample size in the first data set (subdata) was 162 for the left-tail sample and 161 for the right-tail sample, and in the second data set it was 232 for both the left- and the right-tail samples. Comparable sample sizes were used in other schemes. The QTL occupancy probabilities (bars) and the unstandardized effect estimates (curve) are shown in Figure 6 (center column) for the first data set (heritability 0.11) and in Figure 6 (right column) for the second data set (heritability 0.15); the scale of the *y*-axis differs from the others in Figure 6B on the right. See Table 3 for the estimated posterior means and 90% credible intervals for the heritabilities.

In Figure 6 (center column), the flanking markers showed the highest QTL occupancy probabilities for one of the simulated QTL in analyses A and B, but the other QTL was undetected. In analyses C and E, all loci produced zero QTL occupancy probabilities. In analyses D and F, the loci around simulated QTL gained elevated QTL occupancy probabilities but the highest QTL occupancy probability was not necessarily obtained for the flanking markers. In Figure 6 (right column), one of the flanking markers showed the elevated QTL occupancy probability around one of the simulated QTL in A and around both QTL in B. In contrast to E, the analysis C also showed some elevated QTL occupancy probabilities. Again in analyses D and F, the loci around simulated QTL gained elevated QTL occupancy probabilities but the highest peaks did not necessarily occur at the flanking markers. To conclude the results of mirroring analyses D and F (Figure 6, center and right columns), it seems that the position estimates can be somewhat biased in the presence of two closely linked QTL and small heritability.

The heritabilities were badly overestimated with these data in analyses B, D, and F and the true heritability falls outside the 90% credible interval in all cases (Table 3).

## SURVIVAL DATA ANALYSIS

#### Real mice data:

We selected survival data of Boyartchuk *et al.* (2001) that were previously analyzed by Broman (2003) and Diao *et al.* (2004) using several different methods. Additionally, Jin *et al.* (2007) analyzed chromosomes 1, 5, and 13 using the same data. A quantitative trait, the log time to death (in log hours) following *Listeria monocytogenes* infection, was measured from 116 female F_{2} mice where for ∼30% of the mice the phenotype was censored (they survived to the end of the experiment, 264 hr). The F_{2} mice were obtained from an intercross between BALB/cByJ and C57BL/6ByJ strains. The marker map is known and the genotypes, with some random missing values/partial information, are available at 133 markers covering chromosomes 1–19 and X. We omitted data (2 markers) on the X chomosome because it needs to be treated differently from autosomes in the QTL mapping (Broman *et al.* 2006). Also, the last marker at chromosome 19 was omitted from the mapping panel because it contained a large proportion of missing values. For simplicity we treated partial genotype information as completely missing here.

#### Analyses:

We analyzed the mice data using three different methods: (I) the Bayesian multiple-QTL analysis with only those mice that died without generating pseudoobservations; (II) the Bayesian multiple-QTL analysis with only those mice that died (with generating pseudoobservations and using the highest noncensored phenotype as *T*); and (III) the Bayesian multiple-QTL analysis with all 116 mice, using data augmentation to impute phenotypes (liabilities) for the censored mice with *T* = log(264). Analysis I was carried out using our implementation of Xu (2003); see details from Hoti and Sillanpää (2006). Analyses II and III were carried out using the approach presented in the model section. Note that approach III is in spirit similar to the one proposed for Gaussian mixed-effects models by Sorensen *et al.* (1998). In all analyses, we adopted the F_{2} transition probabilities *p*() as presented in Sillanpää and Arjas (1998).

#### Results:

The Matlab implementation of the method was run for 30,000 MCMC cycles. The first 15,000 rounds were discarded as burn-in and of the remaining samples only every 10th sample was used in the estimation. For the F_{2} design, at marker *j*, β_{j,1} = 0 for heterozygotes and the standardized effect for *k* = {2, 3} is obtained as where is the empirical standard deviation of the indicator and is the empirical standard deviation of the phenotype (calculated from augmented data in II and III). The posterior QTL occupancy probabilities *P*(|θ_{j,k}| > 0.1 | data) are separately calculated for *k* = {2, 3} over chromosomes. In Figure 7, only the maximum of the two, max[*P*(|θ_{j,2}| > 0.1 | data), *P*(|θ_{j,3}| > 0.1 | data)], is shown at each position *j* for the three different methods. Our results closely agree with the results by Broman (2003) who found QTL in chromosomes 1, 5, 13, and 15 using a single-QTL model. As in Broman (2003), our analyses support the conclusion that the QTL on chromosome 1 has an effect on the time to death only among the nonsurvivors (a peak is present only in analysis I). Similarly, QTL on chromosome 5 appear to have an effect only on the change of survival (a peak is present more or less only in analyses II and III). Again consistently with Broman (2003), QTL on chromosomes 13 and 15 have an effect on both (on the time to death among the nonsurvivors and on the change of survival; a peak is present more or less in all the analyses), where the latter chromosome actually has a weak QTL. In contrast to Diao *et al.* (2004), we did not find any QTL evidence on chromosome 6. However, a little support for QTL (with an effect only on the change of survival) was found on chromosomes 12 and 18 in analyses II and III, respectively. This may indicate higher efficiency in detecting QTL by our multiple-QTL analysis.

#### Heritability estimates:

Bayesian point estimates (posterior means) of the heritability for analyses I, II, and III were 0.32, 0.49, and 0.41, respectively. The corresponding 90% credible intervals were [0.09, 0.50], [0.34, 0.61], and [0.25, 0.57].

## DISCUSSION

In the consideration of the single-tail problem it is important to make a clear distinction between the phenotype distribution and the conditional phenotype distribution (*i.e.*, phenotype after correcting for QTL). In principle it is possible to use a truncated normal distribution function as an incomplete-data likelihood for a single-tail sample in a way similar to that of Carriquiry *et al.* (1987); see also Schmee and Hahn (1979). However, these expressions are difficult to deal with analytically, which means that the fully conditional posterior distributions are not available (Sorensen *et al.* 1998). Also such a model is very sensitive to small values of the trait (Cox and Oakes 1984). In contrast, the MCMC sampling distributions concerning QTL model parameters are unaffected by correction in correction methods based on a nontruncated normal distribution as full-data likelihood. Following the latter, we have presented a new method that can be applied together with a multiple-QTL model to improve the efficiency of QTL mapping using single-tail samples. The method effectively utilizes additional information available from the parents in the data augmentation scheme. This is done by generating artificial sample points with the genotypes obtained deductively from the parental mating type and the phenotypes via data augmentation. Generally, the use of data augmentation and missing-data imputation is very common in Bayesian analyses (*e.g.*, Albert and Chib 1993; Sillanpää and Arjas 1998; Sorensen *et al.* 1998; Baker *et al.* 2005). Note that, unlike methods that rely on the Mendelian inheritance assumption, this method can be applied to fitness traits, to loci that are associated with the survival to birth, and to loci that suffer from segregation distortion. When mapping viability (selection) in F_{2}/outbred populations, both the selection intensity and the dominance can be estimated.

#### Different QTL models and designs:

The performance of the method was demonstrated using a multiple-QTL model where only markers (or pseudomarkers) were considered as putative QTL positions (*e.g.*, Xu 2003). However, the presented data augmentation for hidden observations can in principle be applied together with the majority of the existing Bayesian QTL mapping methods for inbred line-cross data, including interval mapping (*e.g.*, Sillanpää and Arjas 1998) and epistatic models (*e.g.*, Yi and Xu 2002; Yi *et al.* 2003; Zhang and Xu 2005). It can also be added to Bayesian models that consider outbred line-cross (*e.g.*, Sillanpää and Arjas 1999) or outbred family/trio data (*e.g.*, Lee and Thomas 2000), given the known or estimated haplotypes in parents. Recall the generality of Figure 1 in generating mirror images of the genotype data. The computational feasibility of each extension depends on the available computer capacity and on the complexity of the model considered. In binary traits, the pseudocontrol sample (based on parental data) can be created directly and analyzed with standard QTL/association mapping methods and software packages designed for binary trait locus/association mapping (*e.g.*, Xu and Atchley 1996; Visscher *et al.* 1996; Yi and Xu 2000; Kilpikari and Sillanpää 2003; Sillanpää and Bhattacharjee 2005), by assuming that there are no missing marker data. In a frequentist setting, one may try to pursue the implementation of the presented method for quantitative traits by using the EM algorithm (Dempster *et al.* 1977; Pettitt 1986; Smith and Helms 1995). The underlying key assumptions made in the mirroring approach for quantitative traits are (1) given haplotypes for parents, the genotype data of the hidden observations are made to be maximally dissimilar to the observed data and (2) conditionally on genotypes, all updated (predicted) phenotypes for the hidden observations are either on the left or on the right side of the truncation point and the observed phenotypes. Note that these assumptions correspond to assuming that observations can be (in the light of the current genetic model) divided into two ordered parts: All hidden phenotypes are smaller (or greater) than the observed phenotypes. However, this assumption (see the model section) does not directly rule out certain genetic models; *e.g.*, for an additive multiple-QTL model, it imposes a constraint for the sum of the QTL effects rather than restricting each QTL individually. Similar kinds of assumption (for liabilities) are also needed for discrete/binary phenotypes in the case of a multiple-QTL model. Such assumptions do not exclude us from using, for example, epistatic QTL models in mapping but it would be valuable to assess the limitations and possible negative influences of these assumptions for several model types in the future.

#### Mapping selection in breeding populations by using single-parent data:

Gomez-Raya *et al.* (2002) proposed the method to find the loci responsible for artificial or natural selection on the basis of testing distorted segregation among selected and nonselected offspring or gametes (obtained by single-sperm typing) of widely used bulls in cattle. As stated in Goddard (2003), this method shows significant departure from the expected 1:1 ratio for two sire alleles, when measured from the selected offspring or gametes, at a locus linked to the selection. Utilization of the pseudoobservation and data augmentation scheme (for quantitative traits) together with this kind of design is also possible. As an advantage, a multiple-QTL model can be applied. Let us assume that our data consist of a small number of sires and their offspring groups, where each sire has a large number of offspring. Let us then assume that only selected offspring (with phenotype value *y* = 1 or a tail-sampled quantitative phenotype) from each sire group are genotyped. Again to form the complete mapping population, ungenotyped individuals can now be generated by applying the pseudoobservation idea. For a heterozygote sire *AB*, the pseudoobservation corresponding to the offspring genotype *A** is *B** and vice versa. (Here * indicates the other/maternal allele.)

#### Heritability and QTL-effect estimates:

As illustrated in the simulated examples, our method tends to overestimate the heritabilities; also, even if not detected here, there may still be some upward bias present in the QTL-effect estimates. In any case, the upward bias does not necessarily imply an increased number of false positives, because the calculation of the QTL occupancy is based on standardized effects. To correct for the potential bias in the heritability/effect estimates, one could estimate the heritability (or the QTL effects) separately from the selected samples using the approach of Henshall and Goddard (1999) or from the full data set where also the phenotypes of the ungenotyped individuals, if available, are included (*cf.* Lander and Botstein 1989; Xu and Vogl 2000). The latter approach should be based on a model without pseudoobservations and where updating of the genotypes of the ungenotyped individuals is done via data augmentation (without mirroring). Another equally useful approach is to estimate the QTL effects (and corresponding heritability) afterward using logistic regression, complete marker data (for the tail sample and the pseudoobservations), and the phenotype data in a discretized form. This is because the logit-link function is robust for ascertainment bias (Kagan 2001; Neuhaus 2002; Grünewald 2004). Note that different types of pseudoobservation approaches have also been applied to estimate the sampling probabilities and to correct for ascertainment bias in association studies (Clayton 2003; Grünewald 2004).

#### QTL position estimates:

QTL mapping accuracy is not reduced if 20–25% of the individuals are sampled from the high and low extremes (Darvasi 1997). However, it is known that the positions of the two linked QTL are biased in the case of selective genotyping (Lin and Ritland 1996). The same can be expected to be true (and was actually found in our examples) in the single-tail sampling using the pseudocontrol approach, because an augmented single-tail sample intuitively corresponds to data sampled from two tails. Again, to correct the bias in the position estimates of two linked QTL, one could estimate the locations from the full data set where also the phenotypes of the ungenotyped individuals, if available, are included (Ronin *et al.* 1998).

#### Efficiency of the sampling:

Selective genotyping, two-tail sampling from the phenotype distribution, has been suggested as a “state-of-the-art” sampling scheme to improve the power of the analysis (Lander and Botstein 1989; Darvasi and Soller 1992; Sen *et al.* 2005). In our simulation analyses, we were able to demonstrate comparable power using data resulting from single-tail sampling. (Improvement of the power by generating pseudoobservations for two-tail sampled data is an open question that needs to be studied in the future.) However, it is well known that the power of any selection scheme depends on the underlying genetic architecture of the trait. Due to this, the two-tail sampling approach may lead to an unexpected drop of power, for example, in the presence of epistatic interactions (see Allison *et al.* 1998). Also, if the QTL effects are small, selective genotyping does not adversely affect the detection of epistasis (Sen *et al.* 2005). The same is evidently true for the single-tail sampling approach. On the other hand, in a number of situations it is possible to obtain data from a single-tail sample only, for example, if the crossing experiments suffer from the lethal effects of inbreeding depression, in which case our method may prove to be an irreplaceable tool.

#### Survival data:

Finally, we briefly comment on analyzing survival data with our method. Diao *et al.* (2004) analyzed censored observations by using parametric proportional hazard models and Broman (2003) used a mixture model for the same purpose. Both methods assumed that genotype data have been observed from all the individuals. Additionally, Broman (2003) assumed that the censoring time is equal among all the study subjects. We made the same assumption in our analysis above. In our method, it is possible to relax both of these assumptions. To account for individual-specific censoring times *T _{i}*, Equation 2 becomes “ for all pairs ,” and the hidden observations are updated by the Gibbs sampling distribution where each individual has its own truncation point (step 4 in parameter estimation). A Matlab implementation of the method is available from the authors upon request.

## Acknowledgments

We are grateful to Karl Broman for his comments on the *Listeria monocytogenes* data and to two anonymous referees for their constructive comments on the manuscript. This work was supported by a research grant (no. 202324) from the Academy of Finland.

## Footnotes

Communicating editor: A. D. Long

- Received August 29, 2007.
- Accepted October 5, 2007.

- Copyright © 2007 by the Genetics Society of America