## Abstract

Haloperidol is an efficacious antipsychotic drug that has serious, unpredictable motor side effects that limit its utility and cause noncompliance in many patients. Using a drug–placebo diallel of the eight founder strains of the Collaborative Cross and their F_{1} hybrids, we characterized aggregate effects of genetics, sex, parent of origin, and their combinations on haloperidol response. Treating matched pairs of both sexes with drug or placebo, we measured changes in the following: open field activity, inclined screen rigidity, orofacial movements, prepulse inhibition of the acoustic startle response, plasma and brain drug level measurements, and body weight. To understand the genetic architecture of haloperidol response we introduce new statistical methodology linking heritable variation with causal effect of drug treatment. Our new estimators, “difference of models” and “multiple-impute matched pairs”, are motivated by the Neyman–Rubin potential outcomes framework and extend our existing Bayesian hierarchical model for the diallel (Lenarcic *et al.* 2012). Drug-induced rigidity after chronic treatment was affected by mainly additive genetics and parent-of-origin effects (accounting for 28% and 14.8% of the variance), with NZO/HILtJ and 129S1/SvlmJ contributions tending to increase this side effect. Locomotor activity after acute treatment, by contrast, was more affected by strain-specific inbreeding (12.8%). In addition to drug response phenotypes, we examined diallel effects on behavior before treatment and found not only effects of additive genetics (10.2–53.2%) but also strong effects of epistasis (10.64–25.2%). In particular: prepulse inhibition showed additivity and epistasis in about equal proportions (26.1% and 23.7%); there was evidence of nonreciprocal epistasis in pretreatment activity and rigidity; and we estimated a range of effects on body weight that replicate those found in our previous work. Our results provide the first quantitative description of the genetic architecture of haloperidol response in mice and indicate that additive, dominance-like inbreeding and parent-of-origin effects contribute strongly to treatment effect heterogeneity for this drug.

- iallel
- MCMC
- Collaborative Cross
- inbred strains
- haloperidol
- causal modeling
- treatment effect heterogeneity
- pharmacogenetics

DRUG treatment at its best can provide a simple solution to a complex problem. It is well recognized that diseases to which drug treatment is most commonly applied are affected by a complex interaction between multiple heritable and environmental factors; yet despite such interindividual differences in etiology, disease extent, and intrinsic susceptibility, it is similarly well recognized that a single drug or drug combination can lead to practical alleviation of symptoms if not disease reversal. Nonetheless, interindividual differences in drug response exist and can present a vexing challenge to any medical intervention. Unwanted heterogeneity in treatment effect can range from mild variation in efficacy to scenarios in which some individuals experience severe adverse drug reactions, and the lack of a means to predict or explain individual responsiveness to treatment or vulnerability to side effects frustrates medical decision making.

There is no question that interindividual differences in drug response have a heritable component (Garrod 1923; Fox 1932; Snyder 1932; Knox 1958; Alexanderson *et al.* 1969; Bertilsson *et al.* 1993; Johansson *et al.* 1993; Meyer 2004) or that quantifying those heritable effects would be useful (*e.g.*, Gamazon and Perera 2012). Yet beyond studies that focus on candidate genes (*e.g.*, Thelma *et al.* 2008; Müller *et al.* 2013), the heritable architectures of responsiveness or vulnerability are largely uncharacterized, resemble to an unknown extent the architecture of the presenting complex disease, and are thus unavailable to inform clinical decisions, pharmaceutical prioritization, or prerequisite basic research. This lack of information inevitably leads to suboptimal policy that, ineluctably, must prioritize average treatment effects across the human population over potentially more beneficial effects limited to targeted subgroups (*e.g.*, Kravitz *et al.* 2004; Wilke and Dolan 2011).

A major obstacle to characterizing heritable architecture of drug response is technical. Twin studies, a standard way of estimating heritability that can inform subsequent genomic and clinical research (Martin *et al.* 1997), offer great potential value but face challenges in this setting because they ideally require twins to have identical diseases and treatments (Vesell 1989; Ozdemir *et al.* 2005; Rahmioglu and Ahmadi 2010). Genome-wide association studies (GWAS) provide valuable information about polygenic inheritance of disease traits (*e.g.*, Purcell *et al.* 2009). However, applied to drug responsiveness, GWAS face similar technical difficulties inherent in matching (*post hoc* or otherwise) treatment regimes with disease extent, making them inefficient for estimating such pharmacogenetic quantities. Moreover, any designed study in humans estimating vulnerability to side effects faces additional ethical constraints.

A powerful but underexploited model system for characterizing the heritable architecture of drug response and vulnerability to side effects is the mouse (Cotsapas 2008). Using mouse models, many of the factors that complicate and confound human studies, including the difficulties of matching dosing regimes, environmental and social contexts, and genetic background, can be eliminated or varied in a controlled manner through appropriate experimental design. The potential translational value of mouse experiments for understanding drug-response genetics was recently demonstrated by Harrill *et al.* (2009). Investigating vulnerability to liver damage induced by acetoaminophen (paracetamol) across 36 inbred strains from the Mouse Phenome Project (MPP) (Maddatu *et al.* 2012), they found that the efficiency of drug metabolism and the shape of the dose–toxicity curve are both subject to strain-specific variation. A subsequent association study for toxicity in these strains helped identify a candidate gene involved in the immune system (not commonly assumed of direct relevance to toxicity), and the human ortholog of this gene showed a replicating association in two independent human cohorts. More recently, using 27 strains from the MPP, our group estimated additive heritability of responsiveness to drug and vulnerability to adverse drug reactions (ADRs) following treatment with haloperidol (Crowley *et al.* 2012a).

Haloperidol is a highly effective antipsychotic. Like every antipsychotic currently in use, however, haloperidol has at least one major side effect and no clinically useful predictors of vulnerability (Lieberman *et al.* 2005; Zhang and Malhotra 2011). In a subset of patients it causes disfiguring motor side effects that are collectively termed extrapyramidal symptoms (EPS) (Hsin-Tung and Simpson 2000; Dayalu and Chou 2008; Thelma *et al.* 2008). With acute treatment, ∼40% of patients experience restlessness, involuntary spasms, and/or muscular rigidity (Simpson 1970). With long-term treatment (>3 months), ∼35% of patients develop the EPS syndrome tardive dyskinesia (TD) (Hsin-Tung and Simpson 2000; Dayalu and Chou 2008), characterized by repetitive, involuntary, and purposeless movements primarily of the orofacial region, such as chewing movements and tongue protrusion (Crane 1968). In ∼50% of affected patients, TD is irreversible (Soares-Weiser and Fernandez 2007), and there is currently no validated and widely accepted treatment (Kaiser *et al.* 2002).

These extrapyramidal symptoms, the most problematic side effects of haloperidol, show evidence of familial clustering (Yassa and Ananth 1981; O’Callaghan *et al.* 1990; Müller *et al.* 2001), but their heritability in humans remains unestimated. Candidate gene studies (Kaiser *et al.* 2002; Herken *et al.* 2003; Matsumoto *et al.* 2004; Lai *et al.* 2005; Lerer *et al.* 2005; Patsopoulos *et al.* 2005; Reynolds *et al.* 2005; Bakker *et al.* 2006; Lee *et al.* 2007) and GWAS (Thelma *et al.* 2008 and references therein; Åberg *et al.* 2010; Müller *et al.* 2013 and references therein) have produced results that are largely inconsistent or that fail to reach high statistical significance, mostly owing to low power resulting from small sample size. Armed with so little information about intrinsic vulnerability, the physician not only is unable to predict whether a patient will develop TD but, in the absence of efficacious treatments, also can expect that many of those for whom haloperidol is prescribed will be left with a permanently disfiguring condition.

Herein we use an all-by-all cross of eight genetically diverse mouse strains to investigate how response to haloperidol treatment and vulnerability to side effects are affected by multiple facets of genetic architecture. Our diallel design enables us to assess not only the additive genetic components of heritability, as might be estimable from strain survey data, but also the effects of dominance, epistasis, parent of origin, and all sex-specific versions thereof. The structure of our experiment, which explicitly matches drug- and placebo-treated individuals on genetic background, sex, and parent of origin, motivates a potential outcomes model of causal treatment effects (Rubin 2005), which in turn allows us to connect our work formally with other literature on heterogeneous treatment effects in clinical trials and population-based studies. Building on our previous work investigating ADRs in mice (Crowley *et al.* 2012a,b) and developing statistical models of diallels (Lenarcic *et al.* 2012), we propose new statistical methodology for estimating the effect of genetics on the causal effect of drug treatment.

Our results represent the first attempted comprehensive estimation of gross genetic architecture of haloperidol response in mice and demonstrate, among other things, the following: additive effects on ADR susceptibility by NZO/HILtJ and 129S1/SvlmJ, with evidence of effects being magnified when inherited through the mother, and contrasting additive effects of NZO/HILtJ and C57BL/6J on plasma drug levels, with evidence of epistasis. In seeking statistical descriptions of genetic architecture relevant to future experiments, we introduce new summary measures describing genetic and epigenetic effects on treatment response. These measures are applicable to studies of pharmacoheritability (intrinsic responsiveness) and toxicoheritability (intrinsic vulnerability) more generally. Because the eight mouse lines we use are the founders of the Collaborative Cross (CC) (Collaborative Cross Consortium 2012) and Diversity Outbred (DO) (Svenson *et al.* 2012) genetic resource populations, the results of our experiment are directly relevant to the design of follow-up experiments in the CC, the DO, and their derivatives.

## Experimental Materials and Methods

### Animals

The mice used in this study consisted of inbred and reciprocal F_{1} hybrids of the eight founder strains of the Collaborative Cross (Collaborative Cross Consortium 2012). This includes five classical strains (short names in parentheses), 129S1/SvlmJ (129S1), A/J (AJ), C57BL/6J (B6), NOD/ShiLtJ (NOD), and NZO/HILtJ (NZO), and three wild-derived strains, CAST/EiJ (CAST), PWK/PhJ (PWK), and WSB/EiJ (WSB). As shown in Figure 1, we tested all eight inbreds and 54 of 56 possible reciprocal F_{1} hybrids; we excluded F_{1} hybrids NZO × CAST and NZO × PWK (female × male) because these crosses are unproductive (Chesler *et al.* 2008). A total of 270 mice were tested, including 137 females (68 drug, 69 placebo) and 133 males (66 drug, 67 placebo). All animals were bred at the University of North Carolina from parents that were fewer than six generations removed from founders acquired from the Jackson Laboratory. Pups were weaned at ≈3 weeks of age and housed two animals per cage, with one randomly assigned to receive haloperidol and the other placebo. Animals were maintained on a 14-hr light, 10-hr dark schedule with lights on at 6:00 am in a room maintained at 20°–24° with 40–50% relative humidity. Mice were housed in standard 20 × 30-cm ventilated polysulfone cages with laboratory grade Bed-O-Cob bedding. Water and Purina Prolab RMH3000 were available *ad libitum*. A small section of PVC pipe was present in each cage for enrichment. All testing procedures were conducted in strict compliance with the *Guide for the Care and Use of Laboratory Animals* (Institute of Laboratory Animal Resources, National Research Council 1996) and approved by the Institutional Animal Care and Use Committee of the University of North Carolina.

### Haloperidol treatment

The goal was to achieve a human-like steady-state concentration of haloperidol (10–50 nM or 3.75–19 ng/ml) (Hsin-Tung and Simpson 2000) for 30 days. Previous work (Crowley *et al.* 2012a,b) showed that implantable pellets from Innovative Research of America (Sarasota, FL) yielded considerably lower coefficients of variation in steady-state haloperidol concentrations than did injections, implantable minipumps, or haloperidol in drinking water. Dose-ranging studies in CC founder strains showed that delivery of 6.7 mg⋅kg^{−1}⋅day^{−1} yielded steady-state plasma haloperidol levels in the 10- to 50-nM range over a 31-day time course (using pellets from Innovative Research of America designed to release drug at a steady rate for at least 21 days). Haloperidol pellets were implanted subcutaneously with a trocar under 2 min of isoflurane anesthesia to minimize handling stress and pain (Crowley *et al.* 2012a). Two pellets of incremental dosages were implanted 2 days apart to compensate for varying body weights and to minimize acute sedation (dosing regimen as in Crowley *et al.* 2012a). Placebo-treated animals were implanted with pellets containing the same matrix material but no drug.

### Measuring drug level after treatment (plasma haloperidol and brain haloperidol)

Following 31 days of exposure to haloperidol (HAL), blood was collected into EDTA-treated tubes via tail nick and centrifuged to isolate plasma. The following day, mice were sacrificed and whole brains collected. The right hemispheric portion of the cerebellum was used for brain level measures. Haloperidol assays were performed using mass spectrometry by the Analytical Psychopharmacology Laboratory at the Nathan Kline Institute for Psychiatric Research (Orangeburg, NY).

### Extrapyramidal symptoms (EPS)

The inclined screen test (Barnes *et al.* 1990) was used as an index of Parkinsonian rigidity and sedation. Mice were placed on a wire mesh screen inclined at 45° and the latency to move all four paws was recorded (to a maximum of 300 sec). Pilot work indicated that haloperidol-induced EPS were greatest after acute, rather than chronic, drug treatment. Therefore, EPS was measured at baseline (day −5) and 48 hr after implantation of the first drug pellet (day 0).

### Open field activity (OFA)

Open field activity was measured on days −7 and +28 relative to the start of drug treatment (day 0). Spontaneous locomotor activity in the open field Crawley (1985) was measured for 30 min, using a photocell-equipped automated open field apparatus (Superflex system; Accuscan Instruments, Columbus, OH; 40-cm wide × 40-cm long × 30-cm high arena). Total distance traveled in 30 min was used as input for diallel analysis.

### Prepulse inhibition (PPI)

The acoustic startle measure was based on the reflexive whole-body flinch following exposure to a sudden noise (Dulawa and Geyer 1996). Animals were tested with a San Diego Instruments SR-Lab system (San Diego), using the procedure described by Paylor and Crawley (1997). Briefly, mice were placed in a small Plexiglas cylinder within a larger, sound-attenuating chamber. The cylinder was seated upon a piezoelectric transducer, which allowed vibrations to be quantified and displayed on a computer. Each test session consisted of 42 trials, presented following a 5-min habituation period. There were 7 different types of trials: the no-stimulus trials, trials with the acoustic startle stimulus (40 msec; 120 dB) alone, and trials in which a prepulse stimulus (20 msec; 74, 78, 82, 86, or 90 dB) had onset 100 msec before the onset of the startle stimulus. The different trial types were presented in blocks of 7, in randomized order within each block, with an average intertrial interval of 15 sec. Measures were taken of the startle amplitude for each trial, defined as the peak response during a 65-msec sampling window that began with the onset of the startle stimulus. Levels of prepulse inhibition at each prepulse sound level were calculated as 100 − [(response amplitude for prepulse stimulus and startle stimulus together/response amplitude for startle stimulus alone) ×100]. Pilot work indicated that haloperidol-induced increases in PPI were greatest after acute, rather than chronic, drug treatment. Therefore, PPI was measured at baseline (day −6) and 24 hr after implantation of the first drug pellet (day 0). Since PPIs across different prepulses were highly correlated, they were reduced to the first principal component for diallel analysis [principal component (PC)1 explained 77% of the variation among the PPI change scores].

### Vacuous chewing movements (VCM)

Orofacial observations were made on days −5 and +30 relative to the start of drug treatment (day 0). High-resolution digital videotapes of orofacial behavior were made by modifying the method of Tomiyama *et al.* (2001). These methods are described in detail in Crowley *et al.* (2012a).

## Statistical Models and Methods

We introduce statistical methodology to measure genetic effects on drug response in a diallel. Starting with the decomposition of diallel effects and the Bayesian regression algorithm developed previously by our group in Lenarcic *et al.* (2012), we use insights from the Neyman–Rubin potential outcomes framework (Rubin 1974, 2005; Holland 1986) to introduce the concept of an implied “genetic × treatment vector”. To estimate this vector, we provide two methods: a “difference of models” estimator and a “multiple-impute matched pairs” estimator. We then advise on data transformation, on the use of “gain” (or “post- minus pre-”) scores as target phenotypes, on specification of prior distributions, and on model selection. We conclude by defining the concept of a diallel treatment-response variance projection, a heritability-like measure that relates the genetic, parent-of-origin, sex, and sex-specific effects varied in the diallel to the total variance in treatment effect of the drug.

### A Bayesian model of genetic effects in the diallel

We begin by reviewing the Bayesian linear mixed model for analyzing inbred diallels proposed in Lenarcic *et al.* (2012), which is implemented as a Gibbs sampler in the R package BayesDiallel. For a single quantitative phenotype *y _{i}*, measured for individuals

*i*∈ {1, … ,

*n*}, using sex and parental strain information, effects are decomposed into

*a*additive,

*b*inbreeding,

*m*maternal parent-of-origin,

*v*symmetric cross-specific, and

*w*asymmetric cross-specific effects and sex-specific versions thereof denoted

*ϕ*,

^{a}*ϕ*,

^{m}*ϕ*,

^{b}*ϕ*, and

^{v}*ϕ*. For individual

^{w}*i*with mother

*j*and father

*k*, we model(1)where

*μ*serves as intercept and

**x**

*is an optional vector of pretreatment experimental covariates, such as information on length, weight, diet, or sleep regimen of an individual. In the above equation,*

_{i}*S*

_{sex}is a sign with for female and for males, such that, for example,

*ϕ*describes the marginal effect of being female. Cross-specific terms represent epistasis that is symmetric with respect to the parents (

*jk*and

*kj*have the same effect) or asymmetric (

*jk*and

*kj*have different effects) and are defined such that

*θ*=

_{jk}*θ*for

_{kj}*θ*∈ {

*v*,

*w*,

*ϕ*,

^{v}*ϕ*} with asymmetry induced by operator

^{w}*S*

_{j}_{<}

*, which is a sign defined as for*

_{k}*j*<

*k*and for

*j*>

*k*. For

*J*= 8 parental strains, there are 8 additive, inbreeding, and maternal coefficients, one dose effect for each strain, and 8 × 7/2 = 28 symmetric and asymmetric cross-specific coefficients. Thus, a full model that includes all possible effects has 160 random-effects coefficients, even though the diallel itself contains only 2 × 8 × 8 = 128 combinations of conditions (or 128 diallel categories). Residual, or individual-specific, noise is normally distributed,

*ε*∼

_{i}*N*(0,

*σ*

^{2}), although BayesDiallel can also model

*t*-distributed noise.

For many of the remaining methods, it suffices to describe Equation 1 more compactly as(2)where the vector **d*** _{i}* encodes all diallel parental-strain and sex information, and the vector

**β**is a vector of length 160 with coefficients for all

*a*,

*b*,

*m*,

*v*, … ,

*ϕ*effects (collectively, “diallel effects”).

_{w}### Modeling causal effects using potential outcomes

We approach estimating causal effects of drug treatment though the framework of potential outcomes (see Rubin 2005 and references therein). As postulated in Rubin (2005), let *y _{i}*{1} be the phenotype of an individual

*i*if it had been treated with the drug and let

*y*{0} be the phenotype of individual

_{i}*i*if it had instead received placebo. Define the treatment effect of the drug on individual

*i*to be the linear difference: (3)If we could observe both

*y*{1} and

_{i}*y*{0}, then measuring this treatment effect would be straightforward. But in practice, only one of

_{i}*y*{1} or

_{i}*y*{0} can ever be observed for a given individual. As a result, the treatment effect can never be observed directly—a limitation known as the “Fundamental Problem of Causal Inference” (Holland 1986, p. 947). The two quantities

_{i}*y*{1} and

_{i}*y*{0} are described as “potential outcomes”: If individual

_{i}*i*was actually assigned drug treatment, then

*y*{1} is its observed potential outcome, while

_{i}*y*{0} is its unobserved potential outcome (or “counterfactual”); if some other individual

_{i}*i*′ received placebo, then its potential outcome

*y*

_{i}_{′}{1} is unobserved.

To estimate for every individual, a potential outcomes analysis requires assumptions and a method to impute the unobserved values, *y _{i}*{1} and

*y*{0}. That is, for individual

_{i}*i*whose

*y*{1} value was observed, a model-driven estimate must be imputed. This is typically achieved either parametrically or through, for example, matching of comparable pairs of individuals (

_{i}*e.g.*, Rubin 2006). Care must be taken to consider the treatment-assignment mechanism; placebo should not, for example, be given only to individuals that inherently score higher

*y*{1} and

_{i}*y*{0}. In our study, treatment was assigned to individuals at random among genetically identical cage mates.

_{i}### Extending potential outcomes to define a genetic **×** treatment vector

The BayesDiallel model in Equation 1 was designed to model the effect of genetics on a single-outcome measure, but not to model how genetics modulate the response of individuals to a drug or, alternatively, how a drug modulates the impact of genetics. If our sole interest was to measure an “average treatment effect” (ATE), then we could easily introduce a treatment indicator with corresponding coefficient in **α** of Equation 2 or alternatively use a standard potential outcomes estimator. It may be, however, that the effect of the drug differs among well-defined subgroups of the population—specifically, differing between those with certain combinations of genetics, parent of origin, and sex. In which case, we seek not to estimate an average but rather to characterize a “heterogenous treatment effect” (HTE) (*e.g.*, Byar 1985; Longford 1999) by constructing a measure that is relatable to genetic structure of the diallel. We here modify the framework of potential outcomes to define a vector that decomposes treatment effects into its many genetic (additive, epistatic), epigenetic (parent-of-origin), and sex-specific (sex, sex-by-genetic, and sex-by-parent-of-origin) targets.

Consider two diallel models governing the phenotype under assignment of drug or placebo for individual *i*, (4)where and In the above equations, {*μ*^{drug}, *μ*^{placebo}}, {α^{drug}, α^{placebo}}, and {β^{drug}, β^{placebo}} all provide potentially different contributions in the drug-treated and placebo state, and and may also differ considerably. Taking the difference of the above equations, we observe that we can model the treatment effect as (5) (6)This allows us to identify a genetic × treatment vector (7)This vector encodes the causal effect of drug varied among all inheritance combinations and is the key quantity of interest in the analysis of drug–placebo-treated diallels. Coefficients in **β**^{treat} represent drug interaction with every diallel category: sex, genetics, parent of origin, and their combinations. For instance, is the additive effect of the B6 genome on the effect of treatment using drug rather than placebo.

Three additional treatment interaction effects are thus defined,where *μ*^{treat} represents the overall shift in intercepts of the two models, and **α**^{treat} estimates the extent that pretreatment covariates (*e.g.*, pretreatment body weight) provide additional resistance or susceptibility to drug.

Regarding the covariance of the residuals for drug and placebo potential outcomes, we make the assumption that correlation *ρ* ≥ 0. With this assumption, Equation 6 suggests that noise is greatest when *ρ* = 0, or with *ρ* = 1, then *ε*^{treat} ∼ *N*(0, (*σ*_{drug} − *σ*_{placebo})^{2}). Since setting *ρ* = 0 assumes the largest effect of noise, we rate the performance of estimators for **β**^{treat} (described next) on this worst-case scenario.

### Estimating causal effects: Difference of models approach

A straightforward approach is to fit BayesDiallel separately for drug and placebo mice to yield separate estimates of and . Because their priors are independent, posteriors for **β**^{drug} and **β**^{placebo} are also independent and separable. This means that Gibbs samples and sampled at time step *t*, can be subtracted to generate a new Gibbs sampler chain: (8)We call this a “difference of models” (DoM) approach. DoM is equivalent to fitting all observed data under the assumption that bivariate noise covariance has *ρ* = 0 (*i.e.*, potential outcomes residuals for an individual are uncorrelated). If *ρ* > 0, then DoM will overrepresent statistical uncertainty about parameters.

### Estimating causal effects: Multiple-impute matched pairs approach

In our drug–placebo diallel, every drug-treated mouse *i* with mother *j* and father *k* has at least one matching mouse *i*′ of the same sex who received placebo. This enables a “matched pairs” (MP) approach, where we estimate mouse *i*’s drug response as (9)In a completely balanced experiment with *n* mice, this produces *n*/2 drug response estimates based on *n*/2 matched pairs. Setting as the outcome in the BayesDiallel regression yields an estimator of β^{treat}.

In our experiment multiple mice were tested under each experimental condition, and the unpredictably of mouse production and maintenance led to sporadic imbalance of drug *vs.* placebo mice per diallel category. In many cases, therefore, a given mouse *i* would have multiple eligible matches. For example, with two drug-treated mice {(*A*, *B*)} and three placebo-treated mice {(*C*, *D*, *E*)}, all in the same diallel category (*e.g*., all B6 × WSB males), six matchings are possible: {(*A*, *C*), (*B*, *D*)}, {(*A*, *E*), (*B*, *D*)}, {(*A*, *D*), (*B*, *C*)}, {(*A*, *E*), (*B*, *C*)}, {(*A*, *C*), (*B*, *E*)}, and {(*A*, *D*), (*B*, *E*)}. Each of these discards one placebo-treated mouse. To ensure our matching does not induce bias, we therefore perform a multiple imputation of matchings: of the combinatoric set of all possible matchings that use as many mice as possible, we sample 10 eligible matchings, perform Gibbs sampling of the BayesDiallel model on each, and then pool the Gibbs samples, noting that all matchings should receive equal weight *a priori*. This produces our final “multiple-impute matched pairs” (MIMP) estimate of β^{treat}. Note that in the case of a single, perfectly matched sample, MIMP reduces to MP; for simulations and theory we will therefore consider MP only (and not multiple imputation).

Because in our experiment mice in the same diallel category were typically housed in the same cage, the MP implicitly controls for environmental effects of housing and, by proxy, time. Even if a large cage effect *E _{i}* perturbed a drug-treated mouse’s phenotype, such that (10)then for matching mouse

*i*′, (11)Subtracting Equation 10 from Equation 11 removes

*E*from the model. Thus, optimally designed matched-pair experiments naturally remove blocking effects from a noise model, provided that matched pairs are contained within a block.

_{i}### Transformations, epistasis, and interpreting effects

For simplicity we have thus far assumed that *y _{i}*{1},

*y*{0}, and are subject to residual noise that is normally distributed. Measures such as OFA, which counts the total number of beam breaks ∈ [0, 50000), or blood haloperidol concentration may on their original scale break this assumption. As is standard in regression, to raw phenotype values we applied suitable normalizing transformations using common monotonic functions (log,

_{i}*x*

^{1/2}, …) that improved Gaussianity of the noise.

In choosing a transformation *f*, we balance interpretability of parameter estimates with a concerted attempt to satisfy the model’s assumptions of residual normality; when the two are in strong conflict, we favor satisfying residual normality for the following reason. The BayesDiallel model includes a comprehensive set of marginal effects that try to explain phenotypic variation through the linear combination of distinct strain-specific effects; but it also provides extensive scope for identifying statistical interactions between strains in the form of cross-specific effects (*i.e.*, epistasis that may or may not depend on parent of origin and/or sex). A highly skewed phenotype when analyzed without transformation will often induce strong statistical interactions that disappear when it is reanalyzed under a normalizing transformation. These interactions are therefore “removable” (Berrington De González and Cox 2007) and potentially say more about the inadequacy of our Gaussian noise assumption than they do about genetic architecture. To ensure parsimonious inference we thus apply normalizing transformations where possible (listed in Table 1), with the implication that estimated effects for, *e.g.*, body weight combine additively on the log scale but multiplicatively on the original scale.

### Pretreatment, post-treatment, and gain scores

Phenotype measurements were available before and after treatment for some phenotypes [EPS, OFA, PPI, and vacuous chewing movements (VCM)], but not others (plasma HAL, brain HAL). The causal effect modeling described above does not require pretreatment measurements to be valid, since randomized assignment means that mice in the same diallel category can be assumed to start at the same baseline for each behavioral phenotype— at least *in expectation*. Nonetheless, including pretreatment measures in the analysis avoids reliance on this expectation, accounts better for individual noise, and therefore can be used to improve estimates of **β**^{treat}.

Consider an individual *i*, a measurable phenotype *y _{i}*, and an experiment in which the individual receives either a drug treatment or a placebo treatment. Suppose individual

*i*is assigned to receive the drug. Let be the phenotype of the individual before treatment and let be the phenotype following treatment with the drug. The “gain score” for drug-treated individual

*i*is defined as (12)This is not the same as the drug response because it incorporates effects of multiple confounding factors unrelated to the drug itself, including: the passage of time; physical aspects of the implantation procedure; learning effects on the tests, and so on. The gain score of a placebo-treated individual is defined similarly as (13)Because individuals were assigned randomly to drug or placebo, we can use the same in either case. For our causal effect modeling, we specifically model potential outcomes

*y*{1} and

_{i}*y*{0} in Equation 6, using the gain scores and

_{i}One might argue that because is not observed in the model, specifically that since (14)it is unnecessary to introduce it in the potential outcomes framework. On the contrary, the role plays in improving estimates is subtle but valuable. In the DoM approach, has lower noise variance than and therefore allows us to estimate β^{drug} and β^{placebo} more precisely. Note that when is used as an input, it must be understood that β^{drug} is a vector that models the change in performance, and not In the MP estimator, and are measured on different individuals *i* and *i*′. Thus we implicitly assume the observed gain score for individual *i*′, is close to which is the unobserved gain score for individual *i*.

In our drug–placebo diallel, the pretreatment, post-treatment, and gain score phenotypes are all potentially influenced by covariates and diallel category and so can be modeled as univariate phenotypes (as in Equation 2; see example in Table 2).

### Bayesian hierarchical priors

The BayesDiallel model includes a few coefficients that we consider “fixed”—that is, not appropriate for grouped modeling. These include *μ*, *β*_{inbred}, *ϕ*, and *ϕ*_{inbred}, which are overall effects averaged over strains. To these, we give independent vague priors of the form *N*(0, 10)^{3}. Pretreatment covariates in **α** are also typically ungrouped in this study and so are similarly modeled as independent fixed effects.

Other effects, such as the strain-specific additive effects {*a*_{1}, … , *a*_{8}} = {*a*_{AJ}, *a*_{B6}, … , *a*_{WSB}}, we consider to be associated together in a group. These are modeled as if drawn from a constrained normal distribution whose variance is itself estimated. For instance, the group prior on the *J* = 8 additive coefficients *a*_{1}, … , *a*_{8} takes the form

In the present study, we use a more efficient version of the group prior of Lenarcic *et al.* (2012); this advance is described in *Appendix A*.

Covariates **x*** _{i}* or response variables with especially large or small ranges (

*e.g.*, >10

^{5}± 10

^{3}are scaled to a more stable range [

*e.g.*, 0 ± 100 order to ensure both numerical stability and that priors on effects are only weakly informative (as in,

*e.g.*, Gelman and Hill 2007)].

### Randomization assumption

We make a last, relevant assumption to causal research, which can be interpreted as a diallel equivalent to a “stable unit treatment value assumption” (SUTVA) (Rubin 2005). SUTVA commonly assumes that treatment assignment to one individual *i* negligibly affects phenotypes of other individuals *i*′. Because placebo and drug-treated individuals in the same diallel category are caged together, we do not consider SUTVA to hold. Only individuals in different diallel categories are independent. Treatment effect is thus defined herein as being *under conditions of paired containment with an alternately treated twin*. Because the treatment assignment mechanism is fully randomized and not based upon pretreatment phenotypes, we consider the DoM and MP estimators to be valid for these conditions.

### Model selection

Estimation of diallel effects described above proceeds under the assumption that all of the 13 inheritance classes are active; that is, effects in each class make a contribution to the phenotype value that could be small but only with infinitesimal probability that is exactly zero. This assumption is justified by the experimental design, in which a relatively small number of interventions are knowingly applied (genetics, sex, parent of origin, treatment) and the goal is to estimate their effects. An alternative perspective, leading to a more parsimonious account of genetic architecture, is provided by Bayesian model selection, where each class is assigned a substantial prior probability (0.5) of making zero contribution to the phenotype (equivalent to this class being excluded from the model). We use the exclusionary Gibbs group sampler of Lenarcic *et al.* (2012) to then evaluate how much the observed data update this probability in the posterior, examining each class’s posterior model inclusion probability (MIP). MIPs near 1 provide evidence for retaining an inheritance class, MIPs near 0 provide evidence for its exclusion, and MIPs near 0.5 indicate that the observed data insufficiently support an informed decision. For reporting purposes, MIPs within the ranges (0.05, 0.25] or [0.75, 0.95) represent positive evidence, (0.01, 0.05] or [0.95, 0.99) represent strong evidence, and [0, 0.01] or [0.99, 1] represent very strong to decisive evidence, approximately following the corresponding Bayes factor interpretations in Kass and Raftery (1995).

Using the MP estimator, MIPs identify significant treatment effects; *i.e.*, When using the DoM estimator, MIP is calculated twice: once for drug-treated individuals and once for placebo-treated mice. These DoM MIPs represent the importance of a group of effects separately to a drug or a placebo model. We argue that the maximum of these two MIPs is the most practical single score for model inclusion to be used with the DoM approach, since the first goal of MIP is to find potentially activated groups of interest.

### Diallel variance projection: A heritability-like summary

In addition to reporting estimates of 160 parameters from the BayesDiallel model, it is convenient to summarize an overall relative contribution from each of the 13 inheritance classes. In explaining the phenotype, we provide a measure that aggregates the contribution of linear effects together (assessing joint contribution of all *a*_{1}, *a*_{2}, … , *a*_{8}) and also describe the contributions of multiple classes (*a*, *b*, *m*, … , *ϕ _{v}*). Superficially, this is similar to a partitioning of “heritability” (

*e.g.*, Mather and Jinks 1982; Lynch and Walsh 1998) but includes other effects, such as sex and parent of origin, that are arguably nongenetic. Our decomposition is explicitly prospective, with the practical goal of forecasting the variance contributed by each class in an idealized, future diallel of the same founders.

For a multivariate, multistrain decomposition involving design matrix **D** = (**d**_{1}, … , **d*** _{n}*)

^{T}, we apply a property of the hat matrix,

**P**=

**D**[

**D**

^{T}

**D**]

^{−1}

**D**

^{T}, which is also known as a “projection” matrix, to decompose the sum-of-squared prediction error into modeled components and noise (Neter

*et al.*1996). If

**β**

*is the subset of coefficients of*

_{a}**β**corresponding to the additive effects

*a*, and

**d**

_{i}_{,}

*are the design matrix values for individual*

_{a}*i*for additive effects, define sum-of-squares (16)and noise sum-of-squares (17)where added to SS

_{noise}equals the total sum of squares of

**y**. Since we do not know

**β**

*, but have statistical draws from a posterior for Gibbs samples*

_{a}*t*, … ,

*T*, we use those Gibbs samples to estimate and report confidence measures for the estimated sum-of-squares contributions.

Although we may not have analyzed a diallel experiment that was complete and balanced, we still choose to express predicted variance *in terms of a balanced*, *complete diallel*. We thus *project* our particular diallel experiment onto future experiments based on a standard-sized diallel population. Instead of using observed *y _{i}*, we use the Gibbs sampler to simulate that is, posterior predictive mean draws for future mice. If for

*i*∈ {1, … , 2

*J*

^{2}} is the design set of all

*i*mice in a future, full, complete balanced diallel, then our Gibbs sampler estimates for SS

*are (18)Assuming that the future experiment is designed to ensure that covariates are either absent or held at a stable value or otherwise controlled, we estimate the diallel population mean, at Gibbs iteration*

_{a}*t*, to be

For the additive inheritance class, the posterior mean proportion of variance is then (19)When reported for all inheritances classes, we call this partitioning the diallel “variance projection” (VarP), since it is both a linear algebra projection of the components of variance and a prediction for the variation of future diallels.

Note that can conceivably be negative and that credibility intervals for can include zero. This is a consequence of having multiple groups of effects (*a*, *b*, … , *ϕ _{w}*) competing against each other to predict

*Y*. In extreme cases, a parameter group, such as

_{i}*w*, will have of opposite sign from the first-order correlation This serves our purposes for a consistent statistical estimator: having credibility intervals that include zero, for instance, a [−0.01, 0.02] interval, means that we do not automatically assume that every component

*a*,

*b*, … ,

*ϕ*contributes positive, nonzero information to the prediction of

_{w}*Y*. When credibility intervals exclude zero, such as when we find statistically significant epistasis in our experimental results, we are expressing strong evidence in the data that this group of effects is important to the model. Simulations in

_{i}*Appendix B*show that this is a reliable method to test the presence of contribution to phenotypic variance.

### Genetic treatment response variability: Summary of gene **×** treatment effects

The variance projection framework extends immediately to measuring “heritability of drug response” or, more accurately, the relative contribution of diallel effects to HTE. We call this form of VarP the diallel treatment-response variance projection (TReVarP). Whereas in VarP, genetic contributions are expressed as a proportion of the total phenotypic variance Var(*Y _{i}*), in TReVarP, we target a subcomponent of this variance. For a future individual

*i*, the variance in treatment effect defined by potential outcomes is (20)where is the average performance of drug-treated individuals in the diallel and is the average of placebo-treated individuals; this is used as the denominator for TReVarP. The difference is the average treatment effect; at Gibbs sample draw

*t*we define this as (21)where the expectation is taken over noise, and is the design matrix of a complete diallel. The Monte Carlo average is then the posterior mean for the average treatment effect.

The numerator sums of squares for TReVarP are defined using Gibbs samples as, for example, (22)From Gibbs samples, we calculate the posterior mean TReVarP as (23)which can be measured with credibility intervals similarly to VarP.

The MP method can estimate the same TReVarP as the DoM by substitution of for and for Thus, our extension of VarP into TReVarP is compatible with the concept of a denominator representing only treatment-response variance as in Equation 20.

## Simulation Results

Theoretical properties of DoM and MP (i.e., MIMP with a single, perfectly matched sample), described in *Appendix C*, suggest that under maximum-likelihood estimation the two approaches give point estimates that are identical but with standard errors that are different. Under Bayesian shrinkage, and when treatment effects are smaller than genetic effects, MP estimators are seen to have less variability whereas DoM estimators have less bias. From a pure estimation standpoint, the two approaches thus have different trade-offs. Other trade-offs relate to experimental design: when placebo *vs.* control groups are difficult to match, the DoM estimator more easily accommodates extra covariates; when matched pairs are housed together, the MP estimator can cancel out environmental noise. We assess relative performance of these methods under different assumptions through simulation.

We simulated matched pairs of placebo and control animals (*n* = 115), sampling with replacement from the 128 = 2 × 8 × 8 possible diallel categories and simulating treatment effects from only three inheritance classes (Table 3).

Individuals were given i.i.d. noise of magnitude and , and every matched pair was additionally perturbed with a single *E _{i}* ∼

*N*(0, 1) environmental effect. This environmental effect is naturally canceled out by MP but cannot be disentangled using DoM. We therefore compared estimates from the two procedures, both with and without environmental noise. Treatment effects were then estimated using the “full” BayesDiallel model, which includes all parameters in Equation 1, with and without the application of model selection. We also considered the performance of an “Oracle match” model, which is a matched pairs estimate made artificially prescient that the true model contains only additive, sex × additive, and maternal effects. Average performance in 400 simulations is reported in Table 4.

All of the methods are robust in that they return ≥95% coverage of effect values. But the estimators using the full BayesDiallel model significantly overcover, with 95% credibility intervals that cover the truth 99.9% of the time. The full BayesDiallel model is inherently overspecified: even in a fully sampled, replicated diallel cross specific effects *v _{jk}* and additive effects

*a*confound with each other, in that a model composed completely of cross-specific effects could completely reproduce an additive model. In the analysis of real data sets, we are unaware of and cannot expect there to be no cross-specific terms

_{j}*v*. But, when we fit these simulated data, for which only

_{jk}*a*,

_{j}*m*, and are nonzero, the estimators using model selection do have narrower credibility intervals that also more appropriately cover the truth (96.2%, 98%, and 96.1%, respectively).

_{j}For our purposes we must accept overconservative intervals as necessary for testing a complex model that includes epistasis and realize that some true (*i.e.*, truly nonzero) effects will go undetected. We see that the percentage of true effects identified, *i.e.*, the percentage of times the 95% credibility interval for a true effect excludes zero, hovers near 90%.

Integrated *L*_{2} error was measured using the Markov chain Monte Carlo (MCMC) chains as and error of posterior mean point estimates was measured as where is the posterior mean. The number of true treatment effects identified was recorded as the number of times credibility intervals for those effects excluded zero. The full model is shown to distinguish true effects from noise >85% of the time. MIPs given for the nonzero effects average >75%, whereas MIPs for the zero effects are notably lower at 18%.

Here in its ideal setting, the MP model has superior power—even more so when given oracle information. Even though in this setting the DoM is at a disadvantage, we see that DoM is nonetheless statistically sound and does not posit epistatic effects from environmental noise. What is surprising is that DoM’s resistance to bias makes its posterior mean seem, on average, superior to the MP estimator—even seeming to benefit from environmental noise. This is a consequence of *Appendix C*'s Equation C1, which shows that the bias of DoM is less when noise is balanced, as in our environmental noise model. Note that, despite a smaller bias for its posterior mean, the DoM estimator does not have more statistical power to confirm significant effects.

## Experimental Results

A diallel cross of inbred mouse strains was generated to investigate the effect of genetics, sex, and parent of origin on response to chronic haloperidol treatment. The parents used were the eight founder strains of the Collaborative Cross. From these eight founders was generated an almost complete diallel (Figure 1), including replicates of all eight inbreds, and 54 of 56 possible reciprocal F_{1} hybrids (the 2 hybrids NZO × CAST and NZO × PWK are unproductive). For each of the 62 genetic combinations available, cage mates of each sex were randomly split into two treatment groups: drug (66 males, 68 females) and placebo (67 males, 69 females). The resulting 270 mice entered a 6-week phenotyping protocol (Figure 2, Table 1). Prior to receiving drug or placebo, mice were weighed and phenotyped for the following behavioral measures: EPS, OFA, PPI, and VCM. These measures were chosen to help quantify the severity of potential adverse reactions to subsequent haloperidol treatment. Within 48 hr of receiving drug or placebo, mice were phenotyped a second time for EPS and PPI—these phenotypes being most responsive to acute haloperidol treatment. Following ∼1 month of chronic treatment, mice were phenotyped a second time for OFA and VCM—these phenotypes being more responsive to chronic haloperidol. At this time, drug-treated mice were also assayed for haloperidol levels in brain and plasma (brain HAL and plasma HAL; see Table 1). Phenotypes collected before treatment were termed “baseline” or “pretreatment” phenotypes and those collected afterward were termed “post-treatment”. The change in value from pre- to post-treatment for a given treatment class (drug or placebo) was termed the gain score. A statistical estimate of the change in gain score moving from placebo to drug treatment was termed the “drug response” or “treatment effect”. This terminology is summarized in Table 2.

Our diallel design allowed us to contrast phenotypes of animals measured on different genetic backgrounds, alternating both parent of origin and sex. Each of the 8 × 8 × 2 = 128 combinations of these was termed a “diallel category”; for example, Figure 1 shows that mice were available for all but 5 diallel categories. Phenotypes collected on mice in each diallel category (File S2) were used to inform the BayesDiallel statistical model (see *Statistical Models and Methods*), which estimates 160 effects parameters (“diallel effects”) that together describe how much a phenotype is affected by differences in sex, parental strain combination, and their interaction. Diallel effects are grouped into 13 distinct inheritance classes, which may be further categorized into overall effects (sex, S; inbreeding, B; and sex-specific inbreeding, B_{S}), related groups of strain- or cross-specific effects (additive, a; inbreeding, b; maternal, m; symmetric epistatic, v; and asymmetric epistatic, w), and sex-specific versions of those grouped effects (a_{S}, b_{S}, m_{S}, v_{S}, and w_{S}; see, *e.g.*, first column of Table 5). Throughout, diallel effects and predictions based on the BayesDiallel model were estimated using MCMC, with estimates for each analyzed phenotype based on 2500 posterior samples collected on five independent MCMC chains.

It was expected that the large-scale intrinsic differences that accompany changes in diallel category (*e.g.*, changing parental strains, changing parent of origin between the same parental strains, changing sex, or multiples of these at once) would affect all aspects of behavior and drug response *to some degree*. The goal of this experiment was to estimate the magnitude, direction, and relative contributions of these different types of effects. From this we hoped to develop a clearer picture of how genetics, parent of origin, sex, and their interactions modulate response to haloperidol—in particular, vulnerability to its side effects.

### Overall effects of haloperidol treatment

Effects of haloperidol averaged across genotype and sex are reported in Figure 3, which for each phenotype compares gain scores of drug- and placebo-treated mice. Consistent with our previous work (Crowley *et al.* 2012a), haloperidol, on average, tended to decrease activity (OFA measures) and increase the following: extrapyramidal side effects (EPS, VCM, and stereotypy), PPI, Parkinsonism, and TD-like and antipsychotic effects. As expected, some of these outcomes were correlated (Figure 4): for example, brain and plasma drug levels (*r* = 0.46), plasma drug levels with EPS severity (*r* = 0.29), OFA measures with each other, and all five stimulus variants of PPI with each other (not shown). For subsequent analysis, PPI was represented by its first principal component, and OFA was represented only by OFA distance (Table 1; details in *Experimental Materials and Methods*).

#### Diallel effects on baseline phenotypes:

Diallel effects were estimated for each of the baseline (pretreatment) phenotypes listed in Table 1. For each phenotype, Table 5 reports the percentage of variance explained by each of the 13 diallel effect classes, along with 95% highest posterior density (HPD) intervals (akin to traditional confidence intervals). Although resembling a breakdown of the broad-sense heritability, we describe each sequence of percentages formally as a diallel VarP because it uses out data to predict variance contributions in a perfectly balanced, complete, future diallel. In our VarP results, HPD intervals for individual classes (*a*, *b*, *m*, … , *ϕ _{w}*) can include zero or stretch to negative percentages; the latter indicates with some posterior probability a class has a negligible contribution (possibly due to correlation in the design matrix with other classes). HPD intervals that do not include zero we consider to be statistically significant at 95% credibility. HPD intervals for all 160 diallel parameters, including those for each strain or strain pair, are reported in Supporting Information, File S1.

An alternative view of the results is provided by model selection. Our model selection analysis judges the evidence for and against inclusion of each class of effects under the starting condition that there is a 0.5 probability *a priori* of that class making no contribution to the phenotype whatsoever (Figure 6). The resulting posterior MIPs are interpreted as follows: high (close to 1) for classes of effect deemed essential to the model fit, low (near 0) for those with apparently negligible effects, and in the middle (∼0.5) for those whose importance, given the data collected so far, remains uncertain (more guidelines for interpretation in *Statistical Models and Methods*).

Body weight was the only nonbehavioral baseline phenotype measured and so is a useful reference point owing to its ubiquity in genetic studies. It is shown to be highly heritable (Figure 5), with noise contributing only 6–9.4% to the total variance, strong effects of additive genetics (VarP[a] ≃ 46–76%, Table 5; MIP[a] ≃ 1, Figure 6) and sex (VarP[S] ≃ 10–17%, MIP[S] ≃ 1), and a small effect of parent of origin (VarP[m] = 0.14–9%; MIP[m] = 0.85; Figure 6). The HPD intervals for the diallel effects (Figure 7) showed a pattern strikingly similar to those we reported previously on an independent diallel from the same parental strains (Lenarcic *et al.* 2012). The congruence of our results on these two populations, despite no attempt to match laboratory conditions or timing, strongly supports the robustness of our general approach.

Although pretreatment behavioral phenotypes were less heritable than body weight, diallel effects nonetheless explained between 41% and 74%, leaving 25–59% attributed to noise (Table 5).

OFA was the best-explained behavioral phenotype (posterior mean of VarP[total] ≃ 74%). It showed strong additive effects for all strains (VarP[a] = 53.2%, MIP[a] ≃ 1), with clear separation between activity-reducing effects of AJ, 129S1, and NZO and activity-increasing effects of B6, NOD, and CAST (File S1). The inbred state induced an additional activity-reducing effect for 129S1 and an activity-increasing effect for NOD and WSB (VarP[b] = 8.9%, MIP[b] ≃ 0.8). Model selection suggested decisive evidence for both symmetric and asymmetric epistasis (MIP[v] = 1, MIP[w] ≃ 1); but the percentage of variance contributed may be relatively small (VarP[v] = 3.27%, VarP[w] = 3.86%). HPD intervals for these types of epistasis included strong positive (activity-increasing) departures from zero induced by the (nonreciprocal) pairings 129S1 × AJ and NZO × B6. A small but well-supported overall effect of OFA was contributed by sex (VarP[S] = 1.7%, MIP[S] ≃ 1), but sex-specific effects of genetics or parent of origin were negligible.

EPS and PPI also showed decisive evidence of symmetric epistasis (*i.e.*, strain-pair effects that are reflected in the diagonal; MIP[v] = 1). In EPS this was accompanied by decisive evidence of asymmetric epistasis (MIP[w] ≃ 1) but, despite a strong overall effect of sex, little evidence of sex-specific strain differences. In PPI, symmetric epistasis (VarP[v] = 23.7%) explained almost as much as additive genetics (VarP[a] = 26.1%). PPI seemed to be driven by effects of NZO (+0.5 additive effect to model) and NOD (−0.5 additive), with the most significant epistatic effects coming from NOD crossed with B6 and 129S1 (File S1). A similar effect was seen in EPS, where NZO contributes a +1.5 effect, with NOD being significantly different from zero but only at a −0.5 level. For both EPS and PPI, the NZO × AJ cross appeared to contribute most significantly to symmetric and asymmetric epistasis. We investigated to what extent this could be explained through an effect of body weight, which itself is affected by NZO. In the case of PPI, when body weight is added as a covariate to the BayesDiallel model, the contributions of NZO disappear (although the strong NOD contribution remains). In the case of EPS, however, adding weight to the model does little to diminish the NZO effect (File S1). Body weight was thus seen to be a strong predictor in the PPI phenotype, albeit confounded with other genetic contributions, but not of EPS. Body weight was also not a strong predictor of baseline OFA. See File S1 for weight-adjusted results on all baseline phenotypes.

VCM was explained the least well by the diallel effects, with variance due to noise estimated at ∼60% (Table 5). Nonetheless, we conclude with confidence that VCM is at least 30% heritable, with nonzero contributions to the variance due in largest part to aggregated effects of parent of origin and symmetric epistasis.

#### Diallel effects on post-treatment drug levels:

Levels of haloperidol in plasma and brain were moderately correlated overall (*r* = 0.5, *P* < 10^{−6}; transformations applied) but showed evidence of distinct genetic architectures (Figure 5 and Figure 8). Diallel effects predicted ∼78% of the variance for plasma HAL (VarP[total] = 68–87% in Table 6) but a much smaller and less certain 24% of brain HAL (VarP[total] = 8.6–42.3%).

In plasma HAL, a powerful additive genetic effect (VarP[a] ≃ 20%, MIP[a] = 0.97) was largely driven by NZO. NZO appeared to increase the plasma drug levels of any strain it was crossed with, particularly when inherited through the maternal line (VarP[m] ≃ 9%, MIP[m] = 0.043), as indicated by the strong banding pattern in Figure 9, A and B, and the HPD intervals in Figure 9C. An additive decreasing effect on plasma HAL was exerted by B6 (Figure 9C). The additive effects of NZO and B6 are consistent with our previous report (Crowley *et al.* 2012a) in which, following chronic haloperidol treatment, drug levels were much lower in B6 (12 ± 2 nM) than in NZO (62 ± 8 nM; *P* < 0.001). Our estimates of symmetric epistasis (VarP[v] = 14.6%, MIP[v] = 0.22; File S1) indicate that regardless of parent-of-origin descent, plasma HAL tends to be lower for hybrids of PWK and NOD and high for hybrids of WSB and PWK. The sex-specific effects show that, overall or in each strain, there was little difference between males and females.

Brain haloperidol concentrations were reduced by additive effects of B6 and NOD, increased by additive effects of CAST and PWK, and generally increased in inbreds (see File S1). Other genetic effects failed to show strong deviations from zero, and model selection provided little evidence for or against most types of effects, suggesting a low signal-to-noise ratio in this phenotype.

#### Diallel effects on drug response:

For each behavioral measure, drug response was estimated as the *increase* in the pre- to post-treatment phenotype (*i.e.*, gain score) associated with drug treatment *less* that seen for placebo treatment (see Table 2 and *Statistical Models and Methods*). To these estimates of drug response we applied the BayesDiallel model, generating posterior intervals for effects of genetics, parent of origin, and sex (as in Figure 11). Estimation of diallel effects on drug response was performed using two different approaches: DoM and MIMP (see *Statistical Models and Methods* for more details). In analyses for which both were applicable, we obtained almost indistinguishable results. For this reason and reasons described below, we predominantly report results from the MIMP estimator and use DoM estimates only for special cases, such as analyses adjusting for body weight. All variations of the analyses are available, however, in File S1.

The MIMP and DoM estimators rely on different assumptions and make different trade-offs: the MIMP estimator controls for cage effects but uses only a subset of data at any one time; the DoM does not control for cage, but uses all of the data at once. The fact that they give nearly identical results reveals something about the effects present in these particular data. In our experiment, mouse numbers for different conditions were nearly balanced, and our design ensured that mice in the same diallel category were housed in the same cage for both drug and placebo treatments. The similarity between the credibility intervals of MIMP and DoM suggests that variance due to cage effects does not inflate uncertainty in the DoM estimator and is consistent with such cage effects being negligible—or at least that the DoM estimator’s reduced bias balances the MIMP estimator’s reduced variability.

The contributions of each inheritance class to drug response are summarized in Table 6. We term this decomposition the diallel TReVarP because it predicts based on the observed data how much each class of effect would influence drug response in a perfectly balanced, complete, future diallel. These contributions are further summarized as aggregated classes in Figure 5B. The results of model selection, applied to drug response via the MIMP estimator, are summarized in Figure 10 (last four columns).

Haloperidol caused high levels of rigidity (EPS; reduced latency to move on inclined screen) in mice inheriting genomic material from NZO and 129S1, with some evidence for this effect being more potent when inheritance is transmitted through the mother (horizontal banding in Figure 12A; additive and maternal effects in Figure 8 and Figure 11). The wild-derived strains CAST, PWK, and to a lesser extent WSB, by contrast, contributed additively to a resistance phenotype (Figure 11). In all, diallel effects on EPS haloperidol response predicted ∼68% of the variance in treatment effect (Figure 5), with this split largely among three classes: additive genetics (posterior mean TReVarP[a] = 28% in Table 6; MIP[a] = 1 in Figure 10), parent of origin (TReVarP[m] = 14.8%; MIP[m] ≃ 0.7), and asymmetric epistasis (a type of parent-of-origin effect; TReVarP[w] = 9.7%, MIP[w] = 0.65). Inbreeding and sex contributed negligibly (Figure 10 and Figure 11), with strong evidence against their overall effects (MIP[S] = 0.02 and MIP[B] = 0.02) and positive evidence against their interaction (MIP[B_{S}] = 0.05).

It is interesting to note that NZO and 129S1, the two strains shown to have positive additive effects on haloperidol levels, experienced the greatest increase in EPS following chronic treatment. This observation is supported by the significant overall positive correlation between plasma HAL and EPS (*r* = 0.21, *P* = 0.017 for predrug EPS; *r* = 0.275, *P* = 0.013 for postdrug EPS) and is consistent with human studies that show dose to be a major predictor of EPS liability (Hsin-Tung and Simpson 2000; Dayalu and Chou 2008). As described above, NZO strongly affects pretreatment weight, and weight significantly predicts drug levels after chronic treatment. The relationship between body weight genetics and EPS haloperidol response is therefore likely to be complex, and our design does not provide a basis for the explicit weight matching that would clarify this matter. Nonetheless, in File S1, we report a DoM analysis that includes body weight as a covariate: this produces wider intervals, reduces the effect of NZO, but mostly retains additive and maternal effects of 129S1. This weight-adjusted analysis requires careful interpretation (see *Body weight as a covariate* below), but its result is consistent with the effects of 129S1 and weight on drug response being relatively independent.

Diallel effects on locomotor activity (OFA) in response to haloperidol contrasted primarily among different strains of inbreds (posterior mean TReVarP[b] = 12.8% in Table 6; MIP[b] = 0.56 in Figure 10). This contrast was driven mostly by the 129S1 inbreds being most sensitive to drug-induced reduction in OFA and the WSB inbreds being most resistant (Figure 8). Specifically, in response to drug, the 129S1 inbreds reduced OFA more than would be expected based on additive effects (Figure 12 and Figure 13), whereas the WSB inbred was marginally more resistant than expected (Figure 12 and Figure 13). Across all inbreds, activity in response to drug reduced more in males than in females (female.inbreed estimate in Figure 13; bottom two categories in OFA part of Figure 8; MIP[B_{S}] = 0.65). We found little evidence for or against diallel effects unrelated to inbreeding (*e.g.*, Figure 5, where inbreeding is categorized within epistasis).

Undersampling of PPI and VCM measurements (*i.e.*, too few mice measured post-treatment) makes estimation of diallel effects on drug response for these phenotypes relatively imprecise. For PPI, diallel effects predicted 40–83% of the variance in drug response (posterior mean TReVarP[total] = 61.8%; Table 6); for VCM, they predicted 35–75% (TReVarP[total] = 54.5%). For PPI drug response, we could not rule out either symmetric epistasis (TReVarP[v] = 0.44–27.57%; MIP[v] = 0.45) or asymmetric epistasis (TReVarP[w] = 0.33–25%; MIP[w] = 0.51), but found strong evidence against overall effects of both sex (MIP[S] = 0.012) and inbreeding (MIP[B] = 0.024). Some combinations of effects were well informed: compared with other diallel categories, PPI was strongly reduced in inbred B6 females (Figure 8). HPDs for diallel effects from the MIMP estimator (File S1) mostly settle around zero, but there is a weak pattern of B6 epistasis with other strains (including PWK, NZO, and WSB). MIMP analysis of drug effect on VCM showed strong evidence against an overall effect of sex (MIP[S] = 0.013) but otherwise little evidence for or against other diallel effect groups.

### Body weight as a covariate

The eight CC parental strains demonstrate about a fourfold range in body weight, with the three wild-derived strains being the lightest and NZO by far the heaviest. Since NZO had such high additive dosage effects on plasma HAL, we decided to take a closer look at the relationship between (pretreatment) body weight and all other phenotypes. As shown in Figure 14, body weight correlates positively with plasma HAL (*r* = 0.17, *P* < 0.046). For OFA, the picture is more complex: weight is negatively correlated with pretreatment OFA (*r* = −0.37, *P* < 1 × 10^{−6}), negatively correlated with gain postdrug (*r* = −0.38, *P* = 0.00011), but uncorrelated with gain post-placebo (*r* = −0.01, *P* = 0.89). Similarly, weight is positively correlated with EPS pretreatment (*r* = 0.42, *P* < 10^{−6}) and gain postdrug (*r* = 0.46, *P* = 3.6 × 10^{−6}), but uncorrelated with gain post-placebo (*r* = −0.03, *P* = 0.76). A correlation between body weight and drug response does not indicate causation because genetic effects on the two cannot be disentangled. Nonetheless, we may still use weight as a covariate in diallel analysis to support hypothesis testing. As shown in Figure 9D, adding weight as a covariate for plasma HAL diminishes the additive and maternal effects of the heaviest strain, NZO, but does not diminish the concentration-lowering effect of B6. In File S1, we present HPD plots with and without body weight as a covariate for all analyses (except MIMP, for which weight adjustment outside of the matching procedure itself would be inappropriate). DoM analyses of drug response showed that in EPS weight was confounded primarily with effects attributed to NZO and that weight adjustment in OFA did not alter the clear (non-NZO driven) pattern of strain-specific inbreeding effects described above.

## Discussion

We have used a quantitative genetics approach, applying causal reasoning to an existing Bayesian hierarchical model, to estimate genetic, parent-of-origin and sex-specific effects on haloperidol-induced adverse drug reactions in the founder mouse strains of the Collaborative Cross and their F_{1} hybrids. Through a large diallel, we generated offspring whose response to haloperidol treatment showed substantial heterogeneity. By examining behavioral phenotypes before and after treatment in both drug and placebo groups, we were able to separate diallel effects on behavior from diallel effects on behavioral response to drug. We could thus estimate drug response (or treatment effect) as the response specifically induced by haloperidol and not by other factors concomitant with treatment. In doing so we found evidence that baseline and drug response phenotypes have distinct genetic architectures.

Our most informed drug response phenotypes, EPS and OFA, showed contrasting patterns of diallel effects. For EPS drug response, the ∼70% of explained variance could be mostly attributed to additive genetics and parent of origin. Severity of EPS following acute treatment was strongly increased by additive genomic contributions of NZO and 129S1, with evidence of 129S1’s effect being stronger still when inherited through the maternal line; it was decreased by additive contributions from the wild-derived strains CAST, PWK, and WSB. This separation of effects among the strains could mean a relatively small number of variants explaining a relatively large amount of variance and motivates future mapping studies.

For OFA drug response, by contrast, we found evidence against substantive effects of additive genetics. We instead found evidence for an effect of inbreeding, especially in males. Indeed, the 129S1 inbred was particularly susceptible to the activity-lowering effects of haloperidol, an observation consistent with recessivity or canalizing epistasis, whereby potentially multiple haplotypes in the 129S1 genome that would otherwise induce susceptibility are neutralized when combined in an F_{1} hybrid with other CC strains. This too suggests potential value in follow-up mapping studies, *e.g.*, backcrossing 129S1 with other strains.

For baseline behavioral phenotypes, our results show strong to decisive evidence of additive genetics and epistasis, with nonreciprocal epistatic interactions making an especially large relative contribution to prepulse inhibition (23.7% of the predicted variance). To our knowledge, this is the first time EPS, PPI, or VCM has been studied in a diallel. Our results for baseline OFA provide an overdue update to older work on activity in inbred diallels (*e.g.*, Henderson 1967; Halcomb *et al.* 1975; Crusio *et al.* 1984), and our finding that baseline and drug-response OFAs have distinct genetic architectures is in line with much earlier diallel studies on OFA response to amphetamine (Anisman 1976; Kitahama and Valatx 1979) and nicotine (Marks *et al.* 1986). Our finding of pervasive epistasis, at least among behavioral phenotypes, is in line with that recently described for other model organisms (*e.g.*, Zwarts *et al.* 2011; Huang *et al.* 2012).

In general, behavior and behavioral responses to haloperidol were explained only in part by additive or dominance genetics and usually had a substantial contribution of more complex effects and unexplained variance. This is in line with the most recent data for human complex traits (Purcell *et al.* 2009; Lander 2011; Visscher *et al.* 2012), where many genes contribute individually small, but collectively large, effects to the phenotype; it also helps explain the difficulty thus far in identifying genes with major effects on haloperidol-induced ADR in relatively small studies on humans (Åberg *et al.* 2010) and mice (Crowley *et al.* 2012b). As with many complex diseases, identifying susceptibility genes for haloperidol-induced ADR in humans will require sample sizes that are very large (>5000). Nonetheless, we show that the heritabilities of these ADRs are similar to those of many other clinical phenotypes that have been highly successful in GWAS (*i.e.*, diabetes, obesity, Crohn’s disease, and schizophrenia). As with these other phenotypes, it will be critical to design a human study that is not only large but also mindful of the trait’s genetic architecture.

Among the most pronounced strain effects in this study was the observation that, compared with the other mouse strains or F_{1} hybrids, NZO and NZO-descended mice have higher steady-state levels of plasma haloperidol. The results of our covariate analysis suggest that these effects are likely confounded by body weight and adiposity. As part of our experimental design haloperidol dosage was calibrated to body weight; but it is possible that more careful calibration is needed for mice with considerable NZO descent. On the other hand, brain levels of haloperidol in NZO mice are not that different from those in other strains, suggesting that the dosage may be adequate for psychological studies. At the opposite end of the spectrum, unusually low levels of haloperidol in plasma were seen in B6 mice, yet these mice still showed a significant drop in activity. Haloperidol seems, therefore, to trigger a more adverse reaction in mice with B6 descent, possibly because the drug is processed more quickly than in other individuals. Further experiments with B6 should note that this strain’s reaction to haloperidol corresponds to a significantly more negative additive effect in OFA than for other CC parental strains.

Our diallel study suggests three additional design factors, beyond pure sample size, that will be important to consider for haloperidol pharmacogenomics studies. First, steady-state drug levels must be collected. In mice, we have maximal control over dosing and are assured of treatment compliance; in humans, neither of these is true. To exploit this, drug levels should be measured in all subjects and considered a primary covariate. Second, drug-response phenotypes must be rigorously defined. Haloperidol-induced ADRs come in a variety of forms, from Parkinsonian tremor to uncontrolled jaw movements. In this study, we collected only a sample of possible ADR-related motor phenotypes but found that the patterns of genetic effects among them differed. Therefore, meaningful genetic results will require accurate diagnosis and quantification of haloperidol ADRs with high interrater reliability. Third, body weight is a meaningful covariate that should be collected. Regardless of how it exerts its effect, be it through pharmacokinetics, pharmacodynamics, or otherwise, body weight is too easy to measure to be ignored.

Our experiment examines treatment effect heterogeneity by estimating the effects of causes that were knowingly applied. Investigating HTE in humans, by contrast, typically starts with a search for potential causes of observed effects. Dissection of HTE in clinical trials has been characterized as “an experiment and a survey rolled into one” (Kravitz *et al.* 2004, p. 667, paraphrasing Longford and Nelder 1999): randomization may be present at the level of treatment assignment, but genetics, the open-textured effects of environment, and interactions between the two represent unavoidable confounds. Statistical methods to analyze HTE in humans, in particular “subset” or “subgroup” analysis, must therefore navigate treacherous waters, taking pains to avoid, for example, data dredging through selective analyses that fail to account for multiplicity (Byar 1985; Kravitz *et al.* 2004; Rothwell 2005; Willke *et al.* 2012). Bayesian hierarchical models that induce shrinkage have found application in this context (Dixon and Simon 1991; Bayman *et al.* 2010). We too benefit from Bayesian shrinkage (Lenarcic *et al.* 2012), but also draw considerably greater strength from experimental design. By using a matched drug–placebo design and limiting our attention to a closed set of knowingly varied primary factors (genetics, parent of origin, and sex), we can work on the (in our view) realistic premise that the effects of these factors (*e.g.*, the effect of switching genetic descent between parents) may be small but are unlikely to be exactly zero. This allows us to focus exclusively on robust joint estimation. To provide a complementary perspective, in our model selection analysis we entertain inference under the premise that such zero effects do have substantial prior probability.

Previously we developed a detailed decomposition of diallel inheritance and a reliable Bayesian hierarchical model for analyzing univariate quantitative phenotypes (Lenarcic *et al.* 2012). But a new methodology was required to measure the interaction between diallel effects and the effect of drug exposure. Motivated by the Neyman–Rubin potential outcomes framework (Rubin 2005), we have proposed to measure the many components of interaction between drug and inheritance as the difference between drug-treated and placebo-treated model coefficients. We have demonstrated two methods for estimating this genetic × treatment vector: difference of models and multiple-impute matched pairs. Both estimation techniques can potentially be extended to other experimental designs that vary genetics for both placebo and drug. Our variance projection statistic, computing the difference in variation structure between experimental samples in future populations, is a flexible breakdown that can serve as an alternative to broad-sense heritability; in TreVarP we extend the concept naturally to summarizing how heritable factors affect drug response.

Heritable effects inferred from a diallel of the Collaborative Cross founders can be more or less presumed to exist in the CC and Diversity Outbred populations. They can also help guide design of follow-up experiments in those resources: for example, strong parent-of-origin epistasis suggests an advantage of reciprocally intercrossing CC lines (*i.e.*, to form a recombinant inbred intercross, or CC-RIX), whereas strong additivity and epistasis but no parent-of-origin effects suggests more immediate value in designs using the CC and DO directly.

We demonstrate here that a randomized-treatment study on the experimental diallel not only confirms a treatment effect but also estimates higher-order genetic-by-treatment interactions. Those interactions would be impossible to discover in an observational study where genetics might have influenced who got treatment and/or where the genetic variation in the population may be poorly aligned for discovering interactions. Our approach to defining and inferring genetic–drug interplay provides a starting point—not only for more sophisticated modeling that explicitly targets causality parameters, but also for prospective studies in mouse and humans that dissect the genetic architecture of adverse reactions of drugs in general and haloperidol in particular.

## Acknowledgments

The authors report no biomedical financial interests or potential conflicts of interest. Major funding was provided by the National Institute of Mental Health NIMH/National Human Genome Research Institute Center of Excellence for Genome Sciences grants (P50-MH090338 and P50-HG006582 to Fernando Pardo-Manuel de Villena). This work was also supported by NIMH grant K01-MH094406 (to James Crowley), National Institute of Diabetes and Kidney Diseases grant P30-DK056350 [University of North Carolina (UNC), Chapel Hill, Nutrition Obesity Research Center], the UNC Lineberger Comprehensive Cancer Center, and National Institute of General Medical Sciences grant R01-GM104125 (to William Valdar).

## Appendix A

### Constrained Priors for Diallel Effects

Although the BayesDiallel model is inherently overspecified, some identifiability can be recovered through the use of constrained priors, as in Equation 15. Fixing the sum of all additive effects to zero ensures that marginal HPD intervals on single coefficients can be used to identify significant differences of interest, for example, whether *a*_{B6} − *a*_{CAST} > 0. We achieve this constraint using a *J* × (*J* − 1) contrast matrix (A1)where This allows transformations of the form where **a** is the original coefficient set for additive effects, and is the postulated set of i.i.d. centered coefficients. Letting **X*** _{a}* be the

*n*×

*J*design matrix corresponding to the original additive

*a*

_{1},

*a*

_{2}, … ,

*a*coefficients, then applying the transformation yields the

_{J}*n*× (

*J*− 1) matrix In this case all

*J*− 1 coefficients defined on the transformed space have an i.i.d. marginal prior distribution All

*J*coefficients in the original space also have a marginal distribution, but they are not independent; this is because the sum of all rows and the sum of all columns

## Appendix B

### Simulations for Treatment Response Variability

Reflecting the simulations in Table 4, in Table B1 we investigated coverage and width for credibility intervals estimating contributions to variance in treatment response. Simulated variance contributions were *a* = 42.6, *ϕ ^{a}* = 7.7,

*m*= 42.6, and

*σ*

^{2}= 7.1. At the 95% level, with sample size 115, for the true effects we see that HPD intervals have approximately a width of ≤15. On average, frequency coverage of all effects was high and greater than the purported HPD coverage. However, there appeared to be a systematic undercoverage of the noise contribution, due to fitting a large complete model with symmetric and asymmetric effects, even though the only nonadditive contribution in this simulation is noise. We reason that this approach has a potential to overestimate whole-model contribution when the whole model is too large, but that individual effects are nonetheless robustly covered.

## Appendix C

### Theoretical Comparison of Difference of Models and Matched Pairs Estimators

DoM and MP procedures are compared in terms of their bias and variance. This comparison is made in regard to estimation of a single set of grouped effects β inferred under Bayesian (or ridge) priors that, in aiming to reduce overall *L*_{2} error, shrink those effects toward zero. For the MP estimator, parameter vector β^{MP} is shrunk to zero by a single dispersion parameter for the DoM estimator, both β^{drug} and β^{placebo} are shrunk to zero separately by dispersions and

Consider fixed values for dispersion parameters and —these represent prior distributions for and respectively—and also fixed noise levels and Following Theobald (1974), and assuming a single dispersion parameter per model, the DoM estimator has bias (C1)and variance (C2)whereas the MP estimator has bias (C3)and variance (C4)Let and note that for any invertible matrix **A** and perturbation matrix **P**, (**A** + **P**)^{−1} = **A**^{−1}(**I** + **A**^{−1}**P**)^{−1}. As a first-order approximation, the difference in variance between DoM and MP estimators is (C5)where and Thus the MP estimator has less variability, as long as parameters are large and the match dispersion is smaller. The MP estimator will have a larger bias, however, if we approximate *b*_{DoM} as (C6)In this case ‖*b*_{MP}‖^{2} − ‖*b*_{DoM}‖^{2} is approximately (C7)and the reduced squared bias of the DoM estimator is *o*(**A**^{−2}), such that DoM is potentially advantageous when matching cannot control for environmental noise. In problems where and are all large, the two estimators are equivalent.

In Figure C1, we keep fixed at 1 and vary from −1 to 1. Fixing and using and we consider a sample size The MP outperforms DoM when differences are large or is small. When is small, DoM is preferable, but if is also small, then it is actually preferable to use the MP. Thus we conclude that both DoM and MP priors and estimators have data settings where one outperforms the other in terms of MSE. The DoM estimator allows for more straightforward modeling of additional unmatchable covariates and is easier to apply when matched pairs are difficult to select or impute.

## Footnotes

*Communicating editor: S. F. Chenoweth*

- Received August 27, 2013.
- Accepted October 14, 2013.

- Copyright © 2014 by the Genetics Society of America

Available freely online through the author-supported open access option.