# Adjusting for Principal Components of Molecular Phenotypes Induces Replicating False Positives

- Andy Dahl*,
^{1}, - Vincent Guillemot
^{†}, - Joel Mefford*,
- Hugues Aschard
^{†},^{‡}and - Noah Zaitlen*,
^{1}

^{*}Department of Medicine, University of California San Francisco, 94158 California^{†}Centre de Bioinformatique, Biostatistique et Biologie Intégrative, Institut Pasteur, Paris, 75015 France^{‡}Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, 02115 Massachusetts

- 1Corresponding authors: Department of Medicine, University of California San Francisco, 1550 4th St., Bldg. 19B, San Francisco, CA 94158. E-mail: andywdahl{at}gmail.com; and noah.zaitlen{at}ucsf.edu

## Abstract

High-throughput measurements of molecular phenotypes provide an unprecedented opportunity to model cellular processes and their impact on disease. These highly structured datasets are usually strongly confounded, creating false positives and reducing power. This has motivated many approaches based on principal components analysis (PCA) to estimate and correct for confounders, which have become indispensable elements of association tests between molecular phenotypes and both genetic and nongenetic factors. Here, we show that these correction approaches induce a bias, and that it persists for large sample sizes and replicates out-of-sample. We prove this theoretically for PCA by deriving an analytic, deterministic, and intuitive bias approximation. We assess other methods with realistic simulations, which show that perturbing any of several basic parameters can cause false positive rate (FPR) inflation. Our experiments show the bias depends on covariate and confounder sparsity, effect sizes, and their correlation. Surprisingly, when the covariate and confounder have , standard two-step methods all have -fold FPR inflation. Our analysis informs best practices for confounder correction in genomic studies, and suggests many false discoveries have been made and replicated in some differential expression analyses.

- confounder
- molecular trait
- quantitative trait loci
- eigenvector perturbation

ASSOCIATION studies of molecular phenotypes have helped characterize basic biological processes, including transcription, methylation, chromatin accessibility, translation, ribosomal occupancy, and expression response to stimuli. These tests can be performed on *cis* and *trans* genetic variants to search for functional quantitative trait loci [*QTL: eQTL (Montgomery *et al.* 2010; Pickrell *et al.* 2010), mQTL (Rakyan *et al.* 2011), caQTL (Degner *et al.* 2012), pQTL (Albert *et al.* 2014), rQTL (Battle *et al.* 2015), sQTL (Rivas *et al.* 2015; Li *et al.* 2016), reQTL (Fairfax *et al.* 2014; Lee *et al.* 2014), and iQTL (Barry *et al.* 2017)]. Functional measurements can also be tested against nongenetic covariates with broad genomic effects, including cell type composition (Houseman *et al.* 2012; Jaffe and Irizarry 2014; Rahmani *et al.* 2017; Yao *et al.* 2017), disease status [*e.g.*, cancer (van’t Veer *et al.* 2002), autism (Parikshak *et al.* 2016), and obesity (Horvath *et al.* 2014)], fetal developmental stage (Colantuoni *et al.* 2011), and ancestry (Galanter *et al.* 2017). We mostly refer to gene expression for simplicity, but our arguments apply to any highly structured, high-dimensional measurements.

Unfortunately, unmeasured and unknown factors are common and often have large effects in functional genomic data, reducing power and skewing null transcriptome-wide *P*-values (Leek and Storey 2007; Gibson 2008). Conditioning on known confounders—like technical batch—is invaluable but incomplete. Because genetic effects are typically small, even modest confounders can induce spurious genetic associations that dwarf real signal (Leek and Storey 2007; Kang *et al.* 2008).

Fortunately, strong confounders induce large, low-dimensional structure in the transcriptome, which is exactly what principal components (PCs) aim to capture (as do their variants, which we collectively call CCs, for confounding components). This blessing of dimensionality motivates a two-step approach where CCs are first estimated and then conditioned on downstream as surrogates for the confounders (Alter *et al.* 2000; Leek and Storey 2008). Domain-specific CC methods, like surrogate variable analysis (SVA) (Leek and Storey 2007) and PEER (Stegle *et al.* 2010), make different assumptions about the structure of the confounders, and often outperform PCA. Two-step CC correction is an essential element of thousands of functional genomics analysis pipelines (Leek *et al.* 2010; Rakyan *et al.* 2011; Stegle *et al.* 2012, 2015; Albert and Kruglyak 2015).

Acknowledging the substantial benefits of CCs in many settings, in this work, we explore their adverse impact on the false positive rate (FPR). Theoretically, we derive an unappreciated source of bias created by conditioning on two types of PCs. Our results suggest the two step approach is biased whenever CCs imperfectly partition the phenotype-covariate correlation. We formalize this in a unifying, unidentified likelihood (1) that the CC methods each resolve with distinct assumptions.

We also study the bias with a range of CCs and simulations using real expression data from the GEUVADIS consortium (Lappalainen *et al.* 2013). First, we find no nontrivial method that avoids bias even in the simple scenario where the covariate has a small effect on a single gene and is added to white noise. In more complex data, this bias can be negligible compared to confounder-induced miscalibration. Next, we perform a series of simulations varying the number and strength of covariate effects and see substantial inflation in all CC methods when their assumptions fail. Finally, we allow confounders to be correlated with the covariate—the ordinary meaning of a confounder (Leek *et al.* 2010)—and find that all CCs can be severely miscalibrated; further, we show these false positives replicate out-of-sample.

## Confounder Estimation and Correction

We write *P* molecular phenotypes measured on *N* samples as and let be the *p*-th phenotype. The primary covariate of interest is . We assume *x* and the are standardized to mean 0 and variance 1.

We stylize the standard two-step confounder correction as:

Estimate a rank-

*K*confounder*U*by solving(1)

where is the Frobenius norm and P is some penalty representing (potentially implicit) priors on the causal and confounding patterns. Here, α and *V* are dummy variables.

2. Estimate the by regressing each gene on

*x*given :(2)

where OLS indicates the regression coefficient on *x* from ordinary linear regression of on *x*, , and an intercept.

We call approaches “unsupervised” when P constrains and “supervised” otherwise.

Solving (1) with amounts to maximum likelihood (ML) estimation under an i.i.d. Gaussian noise model for errors in *Y* (Leek and Storey 2007, 2008; Stegle *et al.* 2010). Standard ML inference for would then (pseudo)invert the information matrix for asymptotic standard errors that account for uncertainty in *U* and *V*. This exposes one problem with two-step confounder correction: step two conditions on a fixed as if it were known without error. Theoretically, this difficulty can be resolved by appealing to assumptions that ensure perfectly estimates *U* so that there is no uncertainty to propagate (Leek and Storey 2008; Wang *et al.* 2017).

Another difficulty for ML inference in (1) is that its solution is not unique (even when requiring, *e.g.*, ), asBecause satisfies the rank-*K* constraint, can obtain the same likelihood as α: adding any vector in span(*V*) to any α admits an equivalent solution. [We ignore the nonidentifiability of from the product because only span is used in (2).]

This nonidentifiability means all (well-defined) CCs must use nontrivial penalty functions P; below, we describe the choices of P roughly made by several popular CC methods. Moreover, this nonidentification means CCs depend heavily on their chosen P capturing the true, unknown parameter structure. In particular, we can easily design simulations where any particular CC behaves badly. This means that choice of CC method is important in practice and should be dataset-specific.

We note that we do not claim to take significant steps toward solving this identification problem, which has been analyzed from various theoretical perspectives elsewhere, *e.g. *West 2003; Leek and Storey 2008; Gagnon-Bartsch and Speed 2012; Sun *et al.* 2012; Gerard and Stephens 2017; Wang *et al.* 2017.

## Studied Confounder Estimation Methods

Unsupervised PCA takes as the top eigenvectors of or, equivalently, the top left singular vectors of *Y*. By definition, PCA solves (1) if P solely constrains . The key problem is that the effect of *x* on *Y* leads PCs to partially capture *x*, analogous to unshielded colliders in a directed graphical model (Figure 1). That is, conditioning on genomic PCs can cause, rather than remove, bias. Related concerns arise when conditioning on heritable covariates in other contexts (Aschard *et al.* 2015, 2017; Day *et al.* 2016). This bias creates test misspecification even marginally, for each gene; in contrast, previous theory assumed marginal tests were valid and focused on correlations between tests of different genes (Leek and Storey 2008).

We approximate the PC conditioning bias for small causal effects with textbook eigenvector perturbation theory. Conditioning on one phenotypic PC, our bias approximation for gene *p* coincides with the bias that naively derived from Figure 1where *a* is the causal effect, *q* is the causally affected gene, and *V* are the right singular vectors of *Y*.

A similar result can be derived for conditioning on genotypic PCs in genome-wide association studies. In Figure 1, this stylistically corresponds to replacing the phenotypes with SNPs, the covariate (*x*) with the tested phenotype, and reversing the arrow from *x* to . However, genotype matrices typically contain three orders of magnitude more variables than expression matrices, concomitantly reducing the bias [entries of *V* are ]. Intuitively, causal SNPs have much lower leverage on genetic PCs than causal genes have on expression PCs.

We also test an approach we call supervised PCA, which aims to protect from *x* by first residualizing *Y* on *x*. This solves (1) when P constrains α to its unconditional estimate, . We show this method simply amplifies biases in the unconditional estimate: unsupervised PCs are too correlated with *x*, but supervised PCs are too uncorrelated.

We study several other approaches through simulation. SVA penalizes (1) by assuming α is sparse, which is often plausible. We used the “two-step” and “irw” algorithms to implement unsupervised and supervised SVA, respectively (“irw” requires supervision, but “two-step” is infrequently used). The “irw” version learns which genes are determined by the signal α *vs.* the confounder *V* by testing with *q*-values. These association strengths are used in turn to weight the relative importance of each gene inside a singular value decomposition, effectively weighting PCA toward more-confounded genes to better estimate confounders. We also tested a recent reimplementation, SmartSVA (Chen *et al.* 2017); we found that it performs very similar to SVA with iterations, but much faster.

PEER explicitly penalizes *U* and α through priors on their respective sizes. PEER uses automatic relevance determining priors for the factors *U* and *V* and fits parameters with variational Bayes. In simulations, its default hyperparameters perform well when *x* explains of transcriptome-wide variation but less well for larger α. PANAMA is a closely related approach that greedily adds the most relevant SNPs while learning latent factors. While PANAMA can improve performance over PEER for SNPs with large effects, it is more computationally expensive and is rarely used in practice for human datasets.

RUV and related methods estimate confounders by using only a submatrix of *Y* that is known, *a priori*, to be unaffected by the primary signal (Lucas *et al.* 2006; Gagnon-Bartsch and Speed 2012; Gerard and Stephens 2017; Wang *et al.* 2017). This prior information breaks the identifiability problem by constraining for some subset of genes *S* (corresponding to a barrier penalty function for P). Latent factors can be safely identified by restricting to the negative control data, and then their effects on other genes can be extrapolated. We assume control genes or samples are unavailable, which is common, and do not study these approaches further.

We also assess LEAPP (Sun *et al.* 2012), a recent method that uses sparsity assumptions to disentangle α and *V*. Unlike other methods we study, LEAPP provably obtains oracle performance, asymptotically and assuming that confounders are strong, signals are sparse, and noise is independent across genes modulo confounders (Wang *et al.* 2017).

Finally, we also test the linear mixed model-based method ICE, which uses a random effect with covariance kernel to capture confounding, and tests the fixed effects of *x* against each gene individually (Kang *et al.* 2008). Conceptually, ICE seeks a few genes that are highly correlated with *x* compared to typical genes. Like in RUV methods, control genes can be used to improve power, which we did not study (Joo *et al.* 2014).

### The bias for one unsupervised PC

In this section we take *Y* as a *QTL plus some deterministic :(3)We assume . We allow to be fully general to capture all noise and confounding. This is closely related to the spiked covariance model (Johnstone 2001), though we use a general in place of i.i.d. Gaussian noise. We assume and for , and we call *x* a local covariate for gene *q*.

Ideally, the OLS step (2) would condition exactly on the true, unknown confounders. We evaluate conditioning, instead, on the top PCs of *Y *, which we call *U * and define as the top left singular vectors of *Y *. We compare conditioning on *U* to rather than the true confounders for two reasons. First, is assumed independent of *x*, hence conditioning on does not cause bias. Second, this allows us not to assume any particular form for confounding, or even its existence.

We aim to quantify the error at gene from conditioning on the top feasible PC, , instead of the top oracle PC, :where solves (2) given . Since for , any error can only be caused by the effect of *x* on *U*.

The bias is the expected error over *x*, the only randomness:We study rather than to focus away from the ordinary regression error due to noise and the onto the error caused by *x*’s perturbation of *U*. This enables stronger results—particularly, that deterministically. Because of this, we often refer to this perturbed PC conditioning error as the bias, *i.e.*, the randomness in is negligible.

We assume α is small so that we can use a standard approximation to the perturbed eigenvector [*e.g.*, (Allez and Bouchaud 2012), Sec. II]: for any small *E*, the first eigenvector of is approximately(4)where is the *j*-th eigenvalue of .

Under our assumption on α, , giving the perturbation approximation(5)This uses simplifying definitions based on rotating with *U*:Note that is still a spherical Gaussian random variable.

We show in Supplemental Material, Section S1.1 that this approximation can be combined with the standard two-step least squares expression for to give(6) is a condition number for and are random weights:The partition the perturbation among PCs and are proportional to the (random) squared correlations between *x* and the PCs (*i.e.*, ). The are nonnegative and sum to one in expectation. is deterministic—depending only on the spectrum of —and quantifies the susceptibility of the first PC to perturbation.

These properties of mean the error in (6) is a (randomly) weighted correlation between the projections of genes *p* and *q*—the tested and the causal genes—onto the eigen-axes, *i.e.*(7)where is the correlation weighted by some π. In particular, if is a vector of 1s, is the ordinary correlation between the two genes. In contrast, randomly weights the eigen-axes, but with far greatest weight on axis 1 and successively less expected weight on subsequent axes.

is the only remaining randomness, so the error depends on *x* only through this (random) notion of correlation. And even this randomness is often negligible. First, for (the former is the sum over the latter), and this gap grows for increasingly confounded data (Figure S1). Second, should be very well approximated by its expectation because it is an average over variables.

Together, this suggests the approximations and for , giving a deterministic approximation to the random error. Because the error is approximately deterministic, it can immediately be recognized a bias approximation, as well:(8)This uses the approximation (Equation S3). We find (8) is accurate in a realistic simulation (Figure S2).

While is biased conditional on *q* and *p*, this conditional bias itself has mean zero on average over *q* or *p* ( is mean zero). Nonetheless, our conditional definition of bias conveys the fact that *p* and *q* are biologically meaningful and replicable indices. Moreover, even random biases with mean zero introduce overdispersion that still causes false positive inflation.

We have not generalized these calculation to PCs, though we suspect an analogous result will hold after appropriately modifying *w*. If correct, will move toward as *K* grows, suggesting the correlation between causal and tested traits is a good intuitive proxy for the bias.

### The bias for supervised PCs

An apparent solution is to project out *x* before computing PCs, which we call supervised PCA. We show this is deeply flawed, even when *x* has no causal effect, supporting existing simulation results (Leek and Storey 2007).

First, after residualizing *x* from *Y*, *x* is the bottom supervised PC, hence orthogonal to the others. Thus the unconditional OLS estimates for are unchanged by conditioning on supervised PCs. Similarly, the ratio of the conditional and unconditional SEs is just the ratio of the overall regression error estimates, *i.e.*, . Together, the ratio of *t*-statistics testing is(9)By definition, *U* explains large amounts of variance in *Y*, making smaller than . Formally, the ratio (9) is inflated on average (over genes and *x*) and, in practice, is usually inflated (Figure S3 and Section S1.4).

### Local covariate simulations with white noise

We now demonstrate the CC conditioning bias in a simplistic simulation using (3): the background expression is drawn i.i.d. standard normal; the are (independently) i.i.d. standard normal; and , so that *x* is a local *QTL for gene *q*; finally, *q* is drawn, independently of *x* and *Y*, uniformly from , and its effect *a* is varied over . We chose to match the GEUVADIS data (see below for details).

After simulating *Y* and *x*, we test for using either PCA, SVA, PEER, or their supervised versions to estimate . We also test the mixed model implemented in ICE (which failed to converge in a few simulated datasets).

For 1000 independently simulated datasets, we perform one-sided Kolmogorov-Smirnov (KS) tests for deflation in the regression *P*-values at noncausally affected genes. A two-sided test should be used in the first step when testing for general miscalibration (Leek and Storey 2007, 2008); we, however, are testing for estimator bias, which decreases *P*-values, and discriminates inflation from deflation.

Figure 2 presents the QQ plots for the resulting KS *P*-values. Unsurprisingly, excluding all CCs (None) delivers well-calibrated *P*-values because we did not simulate confounders. Confirming our theory, unsupervised PCs cause noticeable bias for large *a*; however, no bias is detected for small *a*, emphasizing that the bias can be negligible for local covariates. Unsupervised PEER results are similar for small *a*, but for large *a* PEER becomes conservative (such observations require one-sided KS tests for the regression *P*-values). ICE is mostly conservative, especially when using REML and . Unsupervised SVA correctly declares [with the permutation test from (Buja and Eyuboglu 1992)] and is thus equivalent to “None”; although this is ideal behavior, any CC method could use this (or other) tests to choose *K*, and analysts in practice often turn to a different method in this situation (*e.g.*, Pierce *et al.* 2014).

The three supervised methods qualitatively share a different type of bias, growing with *K* and depending little on *a*. We theoretically characterized this for supervised PCA, but PEER and, especially, SVA seem less biased.

Overall, Figure 2 shows that all tested (nontrivial) CC methods create *P*-value miscalibration even in the complete absence of confounding, though the problem is small for small *a*.

### Data availability

We used a high-quality RNA-sequencing dataset from the GEUVADIS consortium (Lappalainen *et al.* 2013) as a realistic simulation baseline. We aligned the raw transcript reads from the European individuals to the reference hg19 transcriptome using RSEM (Li and Dewey 2011). We removed perfectly correlated genes and quantile-normalized the rest to standard normal. The final matrix has samples (rows) and genes (columns) and column-means and -variances equal to 0 and 1. Its spectrum is shown in Figure S10. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7040186.

## Global Covariate Simulations with Real Traits

We now simulate a global effect, meaning α is much denser and larger. We set 90% of its entries to 0 and the others to i.i.d. Gaussian with mean zero and variance such that *x* explains 1% of transcriptome-wide variation. Nongenetic *x* can easily have these sorts of effects, *e.g.*, even early studies with small sample sizes found broad expression profiles that still inform breast cancer treatment (Sparano *et al.* 2015; Cardoso *et al.* 2016). If *x* were genetic, it would be an extremely strong *trans*-*QTL.

We now use the GEUVADIS expression for the noise to make the simulation more realistic, and let Binomial(2,20%). Finally, as we do not aim to match the perturbation theory here, we adopt standard practice and normalize *x* and columns of *Y* to mean 0, variance 1.

We assess empirical FPR and true positive rate (TPR) at the nominal level and average over 250 independently simulated datasets (averaging before log-transforming, Figure 3, a and d). All unsupervised methods and supervised PCA are badly miscalibrated for , while other supervised CC methods, ICE, and LEAPP were calibrated. These calibrated methods had power similar to Oracle, which uses PCs of the pure noise term , except ICE.

We then decrease the sparsity of α from 90% zeros to 5% zeros, violating the sparsity assumptions of supervised (Smart)SVA and LEAPP. This leads (Smart)SVA to roughly 10-fold FPR inflation at (Figure 3b) and LEAPP to lose essentially all power (Figure 3e). Supervised PEER, however, retains near-oracle power and calibration.

Next, we return to 90% sparsity in α but increase its variance explained from 1 to 25%. This apparently violates PEER’s assumptions as it is ≈five-fold inflated at (Figure 3c).

These conclusions qualitatively remain when using a threshold (Storey 2003) or (Figures S4 and S5).

## Confounders Correlated with a Global Covariate

The previous simulations only added a causal *x* effect to some independent . We now add confounders correlated with *x*, which we feel is common in practice for nongenetic *x* (genetic *x* can only be confounded by population structure). For example, even within-tissue, PEER factors had ranging from 10% to 60% with known covariates (GTEx Consortium *et al.* 2015). Further, experimental procedures are often correlated with biological factors (Gilad and Mizrahi-Man 2015). And correlation between technical confounders and primary biological signal can be pernicious: tissue dissociation in quiescent muscle stem cells can resemble cellular activation from muscle injury (van den Brink *et al.* 2017).

We now add a confounder, *u*, that has correlation *ρ* with *x*:We draw each pair i.i.d. from a bivariate Gaussian with mean zero, variances equal to 1 and correlation *ρ*. We repeated our above pipeline, simulating 250 independent datasets and plotting the FPR in Figure 4. Here, we always take CCs, because was used originally for (Lappalainen *et al.* 2013) and we have added two rank-one effects.

We draw α as in Figure 3a: 10% of its entries are nonzero, drawn i.i.d.Gaussian with mean zero and variance such that is the transcriptome-wide fraction of variance explained. Because *u* represents a confounder, we draw *β* i.i.d. Gaussian with variance such that . This is roughly in line with GTEx (Aguet *et al.* 2017), where the top 15–35 PEER factors collectively explained 59−78% of transcriptome-wide variation. We vary from 0 to 1 (*ρ* and are equivalent).

The results in Figure 4 show that supervised PCA badly inflates FPR, even at . The other supervised methods are nearly as inflated for , except PEER for larger and modest *ρ*. Unsupervised CCs also inflate FPR for , though this diminishes as grows and *u* becomes near-perfectly captured. In particular, the apparently naive approach of simply ignoring confounding (None) can be less inflated than all CC methods, particularly for small or . This shows that, even when confounding exists, CC adjustment can cause more harm than good. Qualitatively similar patterns hold using or thresholds (Figures S6 and S7).

We found that ICE was always calibrated and LEAPP was always close to calibrated, with a maximum of roughly threefold inflation. The LEAPP inflation occurred for smaller and larger , scenarios excluded by assumptions in Wang *et al.* (2017). LEAPP was often more powerful than ICE, but these were typically settings where CC methods also performed well. For example, LEAPP was similar to supervised PEER/(Smart)SVA when , and LEAPP was similar to unsupervised CCs and supervised PEER when (though better calibrated). In the intermediate range (*e.g.*, and ), however, LEAPP outperformed all competitors. LEAPP was far slower than the other methods, taking ≈2 hr on average over simulated datasets in Figure 4, compared to ≈1–5 min for two-step methods and ≈10 min for ICE (Figure S8) nonetheless, recent re-implementations of LEAPP are faster (Wang *et al.* 2017).

Similar simulations can be found in Figure S2 of Leek and Storey (2008), but they reached the opposite conclusion, *i.e.* that SVA is calibrated even when . To test if this discrepancy is due to their smaller tested data dimensions, , we repeated our simulations in Figure 4 after downsampling to 20 samples and 1000 genes (uniformly without replacement, and independently downsampling for each simulation). This reduced FPR for SVA, though SVA can still be inflated (*e.g.*, ≈10-fold for and , Figure S9). This remaining discrepancy is likely partially because there is, in fact, inflation in Figure S2 of Leek and Storey (2008): *e.g.*, row 2, column 4 visually seems miscalibrated.

### False positive replication

Assuming *a* and *q* are biologically determined and universal, the PCA bias we derived depends only on the top PCs of *Y*. As *N* grows large, the bias remains, precision grows, and the null is rejected for every gene if *x* affects even one.

Analogously, biases will be similar between datasets with similar top PCs, which occurs in the presence of strong and similar confounders, *e.g.* batch effects (Leek *et al.* 2010), population structure (Listgarten *et al.* 2010), cell type composition (Jaffe and Irizarry 2014), or the signal *x* itself if it is strong.

We empirically assess replication rates by simulating as in Figure 4 with , , and 10% of *α*’s entries drawn i.i.d. Gaussian. We then split the data (*x* and *Y*) into halves, test each separately, compute the replication rate as the fraction of false positive discoveries from the first half that are deemed positive in the second half, and repeat after transposing the splits. We use a significance level of in each split. We independently repeat the process 250 times, ignoring initial splits without false discoveries. Splitting an existing dataset simulates worst-case confounder sharing between discovery and replication cohorts.

The left panel of Figure 5 shows the average (positive) false replication rate. Unsupervised methods perform reasonably well and supervised SmartSVA and PEER are calibrated. Either performing no confounder correction or using supervised SVA or PCA creates severe spurious replication, with nearly all false positives replicating when is large (*e.g.*, ). In this section, we do not assess ICE as it has low total positive rates.

We then increased the signal to . Unsupervised methods performed worse, as they more readily capture the larger *x* effect, and supervised PEER became miscalibrated, especially for larger . Next, we reduced the confounder correlation to 0: unsupervised methods performed slightly better, while supervised methods (except PCA) became roughly calibrated. Finally, we increased the density of *α* to 90%, causing supervised SmartSVA to suffer similarly to PEER.

## Discussion

We have evaluated unappreciated sources of bias induced by conditioning on estimated confounders in functional genomic association tests. We used a combination of theory and simulation to cover different cases of interest. Overall, no two-step CC method we evaluated generally had calibrated FPR; moreover, most studies use PCA, one of the worst-performing methods in our simulations. Although all methods behave well when their assumptions hold, and these assumptions are often reasonable in practice, confounders even modestly correlated with the primary signal can cause substantial bias. We also showed these false positives can replicate at a high rate.

*QTL studies are often performed only within local genomic windows, which are called *cis*-*QTL studies. In this context, genetic effects are small and restricted to nearby genes, and our results suggest that the bias induced from CC conditioning is minimal. However, unlike in our Figure 2 simulations, *cis*-windows may contain many highly correlated genes, and *cis*-*QTL often causally affect genes other than the nearest one (Zhu *et al.* 2016), both of which serve to inflate FPR.

*QTL studies can also be performed genome-wide, which are called *trans*-*QTL studies. While such *QTL are biologically central, they are difficult to reliably uncover because the signals tend to be dispersed across the transcriptome and genome. Unlike *cis*-*QTL, *trans*-*QTL can have much larger effects on CCs, which has led modern studies to diametrically opposed methodology for testing *trans*-*QTL. For example, Brynedal *et al.* (2017) do not use CCs, despite the fact that “all [significant] gene sets were significantly correlated to ... the top 20 PEER factors.” On the other hand, Yao *et al.* (2017) adjust for 20 PEER factors and cell type composition, and many of the resulting “hotspot” signals are in fact loci known to affect cell type composition^{1}. Similar to (Yao *et al.* 2017), (Aguet *et al.* 2017) adjust for PEER factors computed per tissue. We have proposed GBAT to address these and other limitations by testing gene-level *trans* associations (Liu *et al.* 2018). GBAT uses supervised SmartSVA, which performs best in our simulations without correlated confounders. This is feasible because we test only thousands of genes rather than millions of SNPs, though SNPs could be prescreened with unsupervised CCs. More generally, we recommend ICE or LEAPP when feasible: ICE was more reliably calibrated but had low power than LEAPP, while two-step CCs could perform very poorly when their assumptions are violated.

Our global covariate simulations are relevant for differential expression studies performed with linear regression and/or CC correction. Such tests have been broadly applied, including to differences between tissues, sexes, or ages. These factors can easily correlate with latent confounders when experiments are not randomized and can substantially affect the expression of many genes. This is the setting in our simulations underlying Figure 4, where no approach (except ICE) generally gives calibrated *P*-values and CC correction can be worse than a completely uncorrected analysis.

A key limitation of two-step approaches is that step 1 uncertainty is not propagated to the test in step 2. To address this, a multiple imputation-style approach can be used, performing step 2 on several draws from the step 1 CC posterior. We have not evaluated this concept as it is currently developed only within the RUV framework (Gerard and Stephens 2017). Related, the two steps can be integrated, which analytically conveys first-step uncertainty, though this has much greater computational cost when run genome-wide (Stegle *et al.* 2010; Fusi *et al.* 2012; Sun *et al.* 2012; Wang *et al.* 2017).

In the future, it may be useful to pursue other assumptions on in the bias calculation we derived for unsupervised PCA. For example, we could use a spiked covariance model for , using results from Nadler (2008) to approximate the perturbed eigenvector (*e.g.*, replacing our Equation 5 with their Equation 2.15). An advantage of our approach, however, is that we allow general correlations between traits.

In more complex scenarios, the appropriate covariate and confounding model can be unclear. For example, coexpression studies learn complex and subtle graphical models from (partial) covariance (Horvath 2011; Shin *et al.* 2014). But uncorrected confounders (or biased corrections) will yield statistically significant, biologically meaningless networks. Latent variable graphical models may suit this problem (Chandrasekaran *et al.* 2012), and a related two-step approximation was recently proposed for genomics (Parsana *et al.* 2017). Finally, mixed models that learn specifically-genetic graphical models may be adaptable to adjust for low-dimensional confounding (Dahl *et al.* 2013).

## Acknowledgments

We are grateful to Brunilda Balliu and Antonio Berlanga-Taylor for discussions about CC methods. This work was partially supported by National Institutes of Health grants 1U01HG009080-01, 5K25HL121295-03, and 1R03DE025665-01A1.

## Footnotes

Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7040186.

*Communicating editor: C. Sabatti*1 Thanks to Alexander Gusev for this observation.

- Received November 4, 2018.
- Accepted January 23, 2019.

- Copyright © 2019 by the Genetics Society of America