Abstract
We develop a flexible and computationally efficient approach for analyzing high-throughput chemical genetic screens. In such screens, a library of genetic mutants is phenotyped in a large number of stresses. Typically, interactions between genes and stresses are detected by grouping the mutants and stresses into categories, and performing modified t-tests for each combination. This approach does not have a natural extension if mutants or stresses have quantitative or nonoverlapping annotations (e.g., if conditions have doses or a mutant falls into more than one category simultaneously). We develop a matrix linear model (MLM) framework that allows us to model relationships between mutants and conditions in a simple, yet flexible, multivariate framework. It encodes both categorical and continuous relationships to enhance detection of associations. We develop a fast estimation algorithm that takes advantage of the structure of MLMs. We evaluate our method’s performance in simulations and in an Escherichia coli chemical genetic screen, comparing it with an existing univariate approach based on modified t-tests. We show that MLMs perform slightly better than the univariate approach when mutants and conditions are classified in nonoverlapping categories, and substantially better when conditions can be ordered in dosage categories. Therefore, it is an attractive alternative to current methods, and provides a computationally scalable framework for larger and complex chemical genetic screens. A Julia language implementation of MLMs and the code used for this paper are available at https://github.com/janewliang/GeneticScreen.jl and https://bitbucket.org/jwliang/mlm_gs_supplement, respectively.
HIGH-THROUGHPUT genetic screens have revolutionized biology by their ability to answer complex, large-scale scientific questions. This is made possible by advances in automation and multiplexing, the availability of large and comprehensive collections (such as mutant libraries and sequenced genomes), and advances in computational and statistical methodology. Here, we consider a high-throughput genetic screen to observe the fitness of a library of mutants in a variety of growth conditions. Potential goals of such a screen would be to analyze gene × condition interactions or to predict the effect of a new, but related, antibiotic. Matching genes with phenotypes is a particularly valuable application of high-throughput experiments in the age of rapid sequencing technology (van Opijnen and Camilli 2012). These techniques can shed light on the physiological roles of partially redundant gene functions (Typas et al. 2011) and the physiological pathways that are involved with responses to different environmental factors (Ivask et al. 2013). They can reveal relationships between unknown or seemingly unrelated genes (Oh et al. 2011), and provide insights into genes involved in multiple antibiotic resistance (Nichols et al. 2011).
It is now possible to run high-throughput genetic screens cost-effectively in bulk, but the unprecedented scale of these types of studies necessitates the development of generalizable and efficient methods for analyzing their results. Such studies are essentially multivariate problems, but most traditional methods turn them into several univariate problems for computational feasibility. Doing so fails to take advantage of known groupings and correlations in the observations. In the case of the genetic screening example, one can group mutants by gene or gene family, group growth conditions by antibiotic class or temperature, and consider spatial correlation on plates.
We present matrix linear models (MLMs), which provide a formal statistical framework for encoding such known, but perhaps nonexplicit, underlying relationships to enhance detection of associations that might otherwise be masked. This straightforward, multivariate approach can take into account any number of continuous or categorical covariates. Existing methods can encode for different mutant strains and condition types, but are univariate; these approaches are akin to ANOVA or Student’s t-tests. In addition to categorical groupings, MLMs have the distinct advantage of being able to explicitly model more complex and potentially noncategorical information, such as dosage-response levels, null/heterozygous/homozygous genotypes, and spatial correlation on the plates (colonies located on the edge of the plate are expected to exhibit greater growth, since they have fewer neighbors with whom to compete for resources). MLMs offer flexibility over existing methods analogous to what linear regression offers over Student’s t-tests.
Estimation of MLMs is fast even in moderately large dimensions. Using simulations and data from an Escherichia coli genetic screen (Nichols et al. 2011), we show that our method produces results comparable to those of the univariate S score approach (Collins et al. 2006), but with considerably more efficient computation time. We also analyze the data while encoding for dosage response of growth conditions, to demonstrate the method’s ability to incorporate information from continuous covariates and assess relationships more generally.
This paper is organized as follows. The Materials and Methods introduces a motivating E. coli chemical genetic screen data set, and describes the statistical model and estimation (Statistical analysis). The Results section evaluates our method using simulated and real data. We conclude with a Discussion.
Materials and Methods
Colony opacity was recorded for mutant strains grown at high density on agar plates with a range of conditions (Nichols et al. 2011; Shiver et al. 2016). Six “plate arrangements” of mutants were used, with 1536 colonies grown per plate. In this context, a plate arrangement refers to the choice of mutants and exposures, as well as their positioning in the 1536 wells. The 3983 mutant strains were taken from the Keio single-gene deletion library (Baba et al. 2006), essential gene hypomorphs [C-terminally tandem-affinity tagged (Butland et al. 2008) or specific alleles], and a small RNA/small protein knockout library (Hobbs et al. 2010). The colonies were grown in 307 conditions representing different E. coli stresses. More than half were antibiotic/antimicrobial treatments, but other types of conditions, such as temperature and pH, were included. Among the six plate arrangements, a total of 982,902 condition × gene interactions need to be estimated, along with main effects across interactions for the growth conditions and mutant strains. The study aimed to examine the interaction effects between the E. coli genes and the growth conditions. This information can be used to study potential drugs with unknown targets and the mechanism behind drug interactions, as well as identify the genes necessary to support growth in different conditions (Nichols et al. 2011).
Statistical analysis
Model:
Let Y be an matrix of quantitative colony growth from a high-throughput genetic screen, similar to the one described in the previous section. The growth conditions are annotated by
along the n rows and the different mutant strains are annotated by
along the m columns (Figure 1). MLMs are thus given by
(1)with the main and interaction effects contained in
and errors in
. The statistical form of the model is similar to that used by Xiong et al. (2011) for genetic analysis of function-valued phenotypes.
A visualization of the response, covariate, and coefficient matrices for a matrix linear model. The dimensions shown are for illustration only and are not necessarily to scale.
Let ⊗ denote the Kronecker product and vec be the vectorization operator that stacks columns of a matrix into a single-column vector. The vectorized equivalent of the problem is(2)This resembles the familiar linear regression model of
, and so MLMs can theoretically be analyzed using existing software for least squares linear regression methods. Unfortunately, doing so is frequently computationally inefficient or even infeasible when the Kronecker product is too large for memory. We leverage the fact that all of the information in the design matrix is contained in two smaller matrices. By doing so, we avoid the cumbersome Kronecker product and solve two smaller least squares problems. From a philosophical standpoint, vectorizing the data destroys its natural structure, reducing the interpretability of the results.
If the n plates, each exhibiting a certain growth condition, are independent with a common covariance matrix, then . The residual covariance matrix Σ is generally unknown and must be estimated using the data.
Estimation:
To obtain least squares estimates, we choose B to minimize the residual sum of squares,(3)where Y is the observed response matrix of colony growth and
is the linear model function, taking X and Z to be fixed. That is, we want to find B such that f is the best linear combination of X and Z for approximating Y.
This approach may be viewed as a generalized estimation equations approach (Xiong et al. 2011). The solution has a closed form,(4)which can be computed directly without the use of iterative solvers or specifying decision variables. The solution can be viewed as a combination of a least squares on the mutant strains (the Z part) and another on the conditions (the X part). The resulting estimate is asymptotically unbiased (Liang and Zeger 1986).
The variance–covariance matrix of the estimated coefficient is(5)If a consistent estimate of the residual covariance matrix Σ can be obtained, so can a consistent estimate of the variance of the coefficient estimates. Note that the choice of an estimator of Σ (made separately from the least squares minimization problem) will affect the formulation of τ, but not of B (see Appendix for the derivation of the least squares estimate and variance).
The fitted values are easily computed as

Testing:
Similar to the t-test statistic used to assess the coefficients from the univariate linear regression model, we can define a test statistic for our method as(7)where
is the vector of the diagonal entries of
, the estimator of the variance–covariance matrix of the estimated coefficient τ as defined in the previous subsection. Obtaining
requires specifying an estimate of the residual covariance matrix, which we took to be the consistent
. Both the division and square root functions in the test statistic formula are element-wise operators.
Based on simulations and real data, we empirically observed that the test statistics approximate a skewed t distribution. Rather than attempting to specify appropriate parameters for a skewed t distribution, we used permutation tests to obtain P-values for our analysis. At each permutation, we randomly permuted the rows of the Y matrix, i.e., the independent units were taken to be the plates. A null distribution was constructed from the test statistics estimated using the permuted data. P-values for each coefficient were computed by comparing the observed test statistics to this null distribution.
MLMs handle only complete data, so all missing values should be appropriately dropped, smoothed, or estimated beforehand. No such considerations were needed for the E. coli data set, which had no missing values. Weighted least squares can be used for heteroskedastic data by weighing plates and/or colonies (see Appendix for derivation).
Data availability
We implemented our method using the Julia programming language (Bezanson et al. 2017); it is available as a package at https://github.com/janewliang/GeneticScreen.jl. Data preprocessing and visualization was done in R (R Core Team 2018), due to its robust ecosystem of packages for statistics and data analysis (Wickham 2007; Dowle and Srinivasan 2018). Code for data analysis and generating all figures in this paper is available at https://bitbucket.org/jwliang/mlm_gs_supplement. Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8316881.
Results
We applied our method to simulated data and E. coli genetic screening data (Nichols et al. 2011). We compared the results and computation times for MLMs, and the S score, a popular existing method for analyzing high-throughput genetic screening data (Collins et al. 2006). An S score is essentially a Student’s t-test statistic comparing the observations for a given mutant and condition with the observations for a given mutant over all conditions. It is given as(8)where
(9)
,
, and
are the mean, variance, and number of measurements for normalized colony growth for a mutant and a condition of interest.
,
, and
are the median, variance, and median number of measurements for normalized colony growth for a mutant of interest over all conditions. Both
and
are subject to a minimum bound on the SD given by the expected SD in normalized colony growth for a mutant with similar growth phenotypes.
Unlike MLMs, which only assume that the rows of Y are independent, S scores assume that both the rows (plates) and columns (colonies) are independent. So the method is expected to make improvements over S scores, especially when this assumption is violated, such as when the columns of Y (the colonies) are spatially correlated. MLMs go beyond the S score ANOVA-like approach and allow for encoding more complex categorical or continuous relationships, similar to linear regression.
Simulation studies
Using the framework of the X and Z matrices from the E. coli data’s six plate arrangements, we simulated data with 1/2 nonzero main effects and 1/4 nonzero interactions drawn from a Normal(0, 4) distribution. The errors were independent and identically distributed from the standard normal distribution. We then applied both our multivariate method and the S score’s univariate approach to estimate ∼180,000 interactions. Using permutation tests, we obtained P-values corresponding to each interaction for each approach. We used the adaptive Benjamini–Hochberg adjustment (Benjamini and Hochberg 2000) (Team et al. 2017) to account for multiple testing. In the adaptive Benjamini–Hochberg step-up procedure, one controls for the false discovery rate by first estimating the number of “true” null hypotheses (Hochberg and Benjamini 1990) and then applying the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995). For the likely scenario that some of the tested hypotheses are not true, the usual Benjamini–Hochberg adjustment may be underpowered. The initial step helps keep the results from being too conservative.
To compare the results for each plate arrangement, we plotted the receiver operating characteristic (ROC) curve generated by obtaining true-positive rates (TPRs) and false-positive rates (FPRs) at varying P-value cutoffs for both methods (Figure 2). The cutoffs were used to determine which adaptive Benjamini–Hochberg-adjusted P-values corresponded to significant (nonzero) interactions, and these results were then compared to the simulated interactions to obtain the varying TPRs and FPRs. Figure 2 is the ROC plot for the first plate arrangement. The gray reference line that cuts diagonally from the lower left to the upper right is what we would expect the curve to look like for a method that just produces random noise. Its area under the curve (AUC) is 0.5. A method that performs well will have a curve that closely aligns with the upper left corner and an AUC approaching 1 (its TPR will be high even if its FPR is low for a given cutoff). See Figure S1 for ROC plots for the remaining five plates.
ROC curves comparing MLM to S scores applied to data simulated using a framework of first plate arrangement. Interactions corresponding to adaptive Benjamini–Hochberg adjusted P-values (Benjamini and Hochberg 2000; MuToss Coding Team et al. 2017) below a given cutoff are considered significant; these results are compared to the simulated interactions. Curves were generated by plotting the true-positive rates and false-positive rates at varying P-value cutoffs. The two methods perform very similarly, with MLM [AUC of 0.845 (Ekstrøm 2018)] performing slightly better than the S scores (AUC of 0.833). MLM, matrix linear model; ROC, receiver operating characteristic.
The AUCs for our method and Collins’s method were 0.845 and 0.833, respectively (Ekstrøm 2018). Based on both the visual and quantitative summaries, we can observe that MLMs perform as least as well as the S scores. However, this slight positive difference is consistent across all six plate arrangements (Table 1).
Type I error:
We simulated 100 data sets based on the six plate arrangements with 1/2 nonzero main effects and 1/4 nonzero interactions, as described in the section above. To approximate a null distribution, we permuted the rows of each simulated Y, ran MLMs, and obtained permutation P-values for the interactions. For each data set, we computed the FPR as the proportion of P-values below cutoffs of 0.01, 0.05, and 0.1. Table 2 displays the mean proportions across all 100 simulations with 95% normal approximation C.I.s calculated based on the SD across all 100 simulations. The mean proportions are consistently at or slightly below their corresponding thresholds, suggesting that our method preserves type I error.
Dosage-response simulation:
A more interesting case that illustrates the flexibility and benefits of using MLMs is to consider a genetic screen whose plate conditions have multiple dosage levels, and for whom a monotonic dose response is expected. Suppose a given condition has three dosage levels. The S score approach will analyze these condition × gene interactions separately for each of the dosage levels, i.e., ConditionLevel1 × gene, ConditionLevel2 × gene, and ConditionLevel3 × gene. One can analyze the data analogously using MLMs by encoding each of the condition–dosage combinations as separate dummy variables.
However, it is also possible for our method to encode this information as dosage-response levels for a given condition. Instead of treating this hypothetical condition as essentially three unrelated conditions, we can encode the three dosage-response levels together as a single variable corresponding to the condition. The simplest way to do this is to assign a different “slope” to each level in a condition. If the magnitude of the effect of the hypothetical condition is expected to increase (either in a beneficial or harmful manner) with each dosage level, one could create a single-vector variable where ConditionLevel1 is encoded as 1, ConditionlLevel2 is encoded as 2, and ConditionLevel3 is encoded as 3.
To examine this scenario more closely, we used the Z matrix frameworks for mutant strains from each of the six plate arrangements of the E. coli data. For each plate, we simulated effects for an experiment with 10 different conditions, each with three dosage levels and three replicates. First, 1/4 nonzero interactions were drawn from a Normal(0, 1/4) distribution. For these 1/4 nonzero interactions, we further simulated monotonic dosage effects. We did this by first randomly selecting a direction for the condition’s effect (positive or negative with equal probability). Then, for the first dosage level, we simulated an effect from an Exponential(0.5) distribution. For the second dosage level, we took the effect from the first dosage level and added it to a random effect drawn from an Exponential(0.5 ) distribution. For the third dosage level, we summed the second dosage level’s effect with a random effect drawn from Exponential(0.5
). The dosage effects for a given condition were then assigned the appropriate direction that was randomly selected in the first step. This simulation can be extended for any number of conditions with any number of monotonic dosage levels. Simulating monotonic effects is reasonable based on the study design of the experimental data. The chemicals and their concentrations were chosen to be in a range where monotonic concentration-dependent effects are observed.
The ROC curves in Figure 3 illustrate the performance of the dosage-response-encoded approach compared with Collins’s S scores, which can only encode categorical information, for the first plate arrangement. For each method, the ROC curve was generated by obtaining adaptive Benjamini–Hochberg-adjusted P-values (Benjamini and Hochberg 2000) (Team et al. 2017) from permutations and varying the cutoff for determining significant interactions. These were then compared to the true simulated interaction effects to calculate the TPR and FPR.
ROC curves for plate 1 simulations comparing dosage-response-encoded MLMs to categorically encoded S scores. Dosage-response-encoded MLMs outperform all other methods shown. These include S scores for data encoded with categorical condition–dosage combinations, for any of the three representations (Cond.-Conc, 1/3 Hits, and 2/3 Hits), as well as S scores for data encoded with just the conditions. Cond.-Conc.; condition concentration; Dos. Resp., dosage response; MLM, matrix linear model; ROC, receiver operating characteristic.
The red solid line plots the results for MLMs with dosage-response encoding of the conditions.
The blue dashed line plots the results for Collins’s S scores with categorical encoding of condition–dosage combinations, which is the conventional approach.
The green dot-dashed line plots the results for Collins’s S scores with categorical encoding of only conditions, i.e., all the dosage levels for a given condition are encoded as one categorical condition.
The black and gray dotted lines offer an alternate visualization of the Collins’s S scores results, encoded for categorical condition–dosage combinations. For a given condition × gene interaction, there are three corresponding S scores/adjusted P-values (one for each condition–dosage level). When plotting the black dotted “1/3 Hits” line, a true positive is counted when at least one out of the three adjusted P-values for a significant simulated condition × gene interaction is below the cutoff. A false positive is counted when at least one out of the three adjusted P-values for a nonsignificant interaction is below the cutoff. The gray dotted “2/3 Hits” line similarly requires at least two out of the three adjusted P-values to be below the cutoff.
The corresponding AUCs (Ekstrøm 2018) for the plate 1 simulation are shown in Table 3.
Our proposed dosage-response MLMs approach (red solid line) outperforms Collins’s S scores encoded with categorical condition–dosage combinations, regardless of representation (blue dashed, black dotted, and gray dotted lines). It also outperforms Collins’s S scores when they only encode for the conditions without regard for dosage level (green dot-dashed line). These trends are consistent across simulations based on the other plates’ arrangements (Figure S2 and Table S1). From an interpretation standpoint, this encoding can be useful if the investigator is interested in the overall effect of a plate condition as opposed to separately considering the different dosage levels. The more general approach to encoding covariates in MLMs leads to superior analysis in this situation.
Data analysis
We then applied the our method to each of the six plate arrangements in the E. coli genetic screen. The colony opacities were standardized by subtracting the median colony opacity of each plate (which has multiple mutant strains growing under a given condition) and dividing by the interquartile range.
Computational considerations:
When we ran our MLM and S score implementations, encoded with condition–dosage combinations, on the entire data set of six plates, the latter required significantly more computation time (Table 4 and Table 5). A computer with 128 GB memory and a 3.00 GHz dual-core processor was used to obtain the times as averages of 10 runs. MLMs only take ∼3.5 sec to estimate the roughly 1 million interactions, plus main effects (Table 4). In comparison, Collins’s S scores require ∼2.5 min. However, much greater computation time is needed to obtain permutation P-values. If the P-values are calculated based on 1000 permutations (parallelized over five cores), MLMs take just over 18 min (Table 5). S scores require over 8 hr to complete the same procedure (again, parallelized over five cores). This dramatic difference in computation time can have a considerable impact on the scope and feasibility of analyzing such data sets.
Auxotroph analysis:
To assess whether our method identifies significant interactions in the expected manner, we analyzed auxotrophs. Auxotrophs are mutant strains that have lost the ability to synthesize a particular nutrient required for growth. These might include knockout strains for a certain amino acid. Since they should experience little to no colony growth under specific conditions where the required nutrient is not present, we expect negative interactions between auxotrophic mutants and minimal media growth conditions. Auxotrophs are useful as controls, since the phenotype under particular conditions for a mutant strain is typically not known.
In the original univariate analysis of the colony size data, Nichols et al. empirically identified 102 auxotrophs (Nichols et al. 2011). Likewise, a previous study of the Keio Collection auxotrophs, based on colony size, found 238 auxotrophs, 110 of which were mutants included in this data set (Joyce et al. 2006). Nichols et al. and Joyce et al. overlapped by 70%, despite significant experimental differences (e.g., growth in liquid vs. solid media).
In a similar fashion, we empirically identified auxotrophs based on the MLM estimates. We did this by obtaining the quantiles of the MLM interaction scores for each mutant strain under minimal media conditions. Mutants whose 95% quantile for interaction scores with minimal media conditions fell below zero were classified as auxotrophs. Our auxotrophs had an 83% overlap with the Nichols et al. auxotrophs and a 72% overlap with those of Joyce et al. The slightly larger intersection of auxotrophs when comparing with Nichols et al. vs. comparing with Joyce et al. is to be expected, since we are analyzing the same data set. While not all of the MLM interactions between the Nichols et al. and Joyce et al. auxotrophs and minimal media conditions were negative, the vast majority of them were. Figure 4 and Figure S3 are visualizations of the distributions of each auxotroph’s interaction scores across minimal media conditions. The interaction scores are plotted as points, and the median for each auxotroph is plotted as a horizontal bar; most fall below zero.
Distributions of MLM interaction estimates for auxotrophs identified by Nichols et al. (2011) over minimal media conditions. The Nichols et al. auxotrophs are plotted along the x-axis. The MLM interactions between the auxotrophs and minimal media conditions are plotted along the y-axis, with the horizontal bars indicating the median value. Most interactions fall below zero, indicating little growth. MLM, matrix linear model.
Some of the discrepancy may be due to differences between analyzing colony opacity, as we did, and analyzing colony size, as Nichols et al. and Joyce et al. did. Kritikos et al. ran a study of the Keio collection using colony opacity (Kritikos et al. 2017). We used the publicly available Kritikos et al. S scores to replicate the auxotroph-identifying process for MLM t-test statistics. Of the 19 Nichols et al. auxotrophs that MLM was not able to identify, the Kritikos et al. S scores were unable to detect 12. This result suggests that about two-thirds of the Nichols et al. auxotrophs that MLM was unable to detect can be accounted for by differences in the types of measurements used to quantify growth.
Among the remaining 5 out of 19 auxotrophs, fepC is a mutant that results in the loss of ferric enterobactin uptake. There are minimal media conditions for both “high iron” and “low iron.” The MLM t-test statistics are generally positive for interactions between fepC and both iron growth conditions; the Kritikos et al. S scores are positive only for high iron. Thus, fepC might be considered a borderline case in auxotroph determination depending on whether or not it exhibits growth under the low-iron condition. Additionally, there were two auxotrophic mutants that were found through analysis of the MLM t-test statistics, but not through analysis of the Kritikos et al. S scores.
Out of the 31 Joyce et al. auxotrophs that the MLM t-test statistics were unable to find, 22 were also undetectable to the Kritikos et al. S scores. Once again, this suggests that about two-thirds of the discrepant auxotrophs can likely be explained by differences between analyzing colony size and colony opacity (Kritikos et al. 2017).
ROC plots (Figure 5 and Figure S4) can assess the ability of MLMs to correctly identify auxotrophs found by Nichols et al. and Joyce et al. To get the TPRs and FPRs for Figure 5, we took the auxotrophs identified by Nichols et al. to be the true auxotrophs. We then obtained TPRs and FPRs by varying cutoffs for the median minimal media interaction score for the auxotrophs that we identified. Figure S4 was obtained analogously for the Joyce et al. auxotrophs. The AUCs were 0.884 and 0.824, respectively (Ekstrøm 2018); there is high concordance between the three sets of auxotrophs.
ROC curve for the auxotrophs we empirically identified, compared against those identified by Nichols et al. (2011) as the reference. True-positive rates and false-positive rates were calculated based on the median minimal media interaction score for the each of the auxotrophs we identified, at varying cutoffs. The AUC was 0.884 (Ekstrøm 2018). ROC, receiver operating characteristic.
Dosage-response analysis
We also analyzed the data set by running MLMs with dosage-response levels encoded in the X matrix growth conditions. It should be noted that many of the conditions in the Nichols et al. data set had only one dosage level, which makes it somewhat less-than-ideal for illustrating this encoding approach. We then compared the dosage-response results with the results from applying MLMs and Collins’s S scores on data encoded for condition–dosage combinations.
Figure 6 examines the performance of these three approaches for analyzing the first plate arrangement. The other five plate arrangements, shown in Figure S5, produced similar results. For each of the three methods, we used permutation tests to obtain adaptive Benjamini–Hochberg-adjusted P-values (Benjamini and Hochberg 2000) (Team et al. 2017). We then plotted the proportion of adjusted P-values below varying thresholds to generate the three curves. The dosage-response-encoded MLMs were able to detect more significant interactions (adjusted P-values below a given cutoff) than Collins’ method at nearly every threshold. For lower cutoffs, this continuous encoding strategy also detects more significant interactions than MLMs with categorical encoding.
Comparing the proportion of significant interactions detected when encoding for dosage response and when encoding dosage–condition combinations categorically, for the first plate arrangement. The former is only possible in MLMs. The latter is shown for both MLMs and S scores. The curves were generated by obtaining the adaptive Benjamini–Hochberg-adjusted P-values from permutation tests for each method and identifying the proportion of adjusted P-values below varying cutoffs. Dos. Resp., dosage response; MLM, matrix linear model.
This simple example illustrates the potential gains in performance, and detection of significant interactions, when taking advantage of MLMs’ regression-like ability to encode for complex and continuous covariates. Encoding the conditions in this manner also provides interpretable results about the effects of dosage in the interactions as well as the effects of the conditions themselves.
Discussion
We have presented MLMs, a simple framework for encoding relationships and groupings in high-throughput genetic screens. This approach is computationally efficient, running faster than the univariate S score method (Collins et al. 2006). The speed advantage is especially noticeable when for large data sets and when doing permutation tests. We gain this speed by harnessing both the matrix structure of high-throughput data and the strengths of least squares estimation.
MLMs can also improve detection of interaction effects that might otherwise be masked. Comparing our method to the S score approach in simulations and an E. coli genetic screen (Nichols et al. 2011), we show that we achieve comparable results at much less computational expense. Furthermore, unlike the S score, MLMs are not limited to encoding categorical groupings. In this way, the relationship between S scores and MLMs is analogous to that between ANOVA/Student’s t-tests and linear regression. Analysis of simulated and E. coli data demonstrates that MLMs provide a more flexible and powerful approach for scenarios with noncategorical covariates, such as multidose conditions.
In this paper, we utilized a generalized estimating equation approach using least squares as the computational engine. Several extensions are possible. We may want a robust estimation procedure that downweights extreme observations (least squares is well known to be sensitive to outliers). This can be achieved by modifying the loss function from a sum of squares to a robustified version, such as Huber’s loss function (Hastie et al. 2009). The resulting optimization problem is expected to be more complex. Another extension might be to fit penalized MLMs with an L1 penalty. This optimization problem is also challenging, and we expect to report progress in future work. Finally, although we were motivated by high-throughput chemical genetic screens, many other high-throughput data, such as metabolomic data or cancer cell line drug screens, have a similar structure and might benefit from a similar approach.
Acknowledgments
We thank Carol Gross of the University of California, San Francisco (UCSF) for permission to use and share the chemical genetic screen data, and Anthony Shiver of Stanford University for help with accessing, processing, and interpreting the data. This work started when J.W.L. was a summer intern at UCSF, and continued when she was a scientific programmer at the University of Tennessee Health Science Center (UTHSC). We thank both UCSF and UTHSC for funding, and a supportive environment. S.S. was partly supported by National Institutes of Health grants GM-070683, GM-078338, GM-123489, ES-022841, and DA-044223.
Appendix
In this appendix, we provide details on the derivation of the estimation formulas introduced in the Statistical analysis section. We will use some well-known results from linear regression and matrix algebra, also stated below, to help the reader follow the main argument.
Recap: Kronecker Products
For matrices A, B, C, D, and X, the following identities hold.



Recap: Linear Regression
Consider the linear modelwhere
. The least squares estimate of β minimizes
and is
where
is the projection matrix corresponding to X.
The variance of this estimate isIf we substitute Λ in the above formula with a consistent estimator, then we get the so-called “sandwich estimator” of the variance.
Recap: Weighted Linear Regression
If we minimize a weighted least squares criterion, given a known (positive definite) weight matrix W, we minimizeand the solution is
where
is the weighted projection matrix of X with weights W.
reduces to
when the weight matrix is the identity matrix. The formula for the variance estimate is identical to the variance estimate for the least squares estimate, replacing the projection matrix
with the weighted version
.
These results can be used to derive least squares and weighted least squares formulas for the MLM coefficient estimates, and their variances.
Vectorized Version of the Model
Applying the vectorization and transpose operators on Equation 1,which gives us Equation 2. Thus, we can see that this specification is in the form of the familiar linear model
with





Least Squares Estimate
With the above equivalence, we can prove the least squares estimate formula using substitution. The projection matrix for the design matrix corresponding to the vectorized model iswhere we use Kronecker product identities repeatedly to simplify the expression. Therefore,
Transposing, unvectorizing, and substituting the projection matrices, we obtain
which establishes Equation 4.
Variance of Estimate
Again by substitution, we can see thatSubstituting the projection matrices and simplifying, we get Equation 5.
Weighted Estimation
The most common form of weighted estimation occurs when we standardize all the rows or columns of the data matrix. This corresponds to using weighted regression with a weight matrix that is the Kronecker product of two diagonal weight matrices for the vectorized version of the problem. More generally, if we consider weight matrices of the form , where
and
are positive definite, then it is easily seen that the formulas for the weighted version of
and its variance are the same as those for the unweighted version, replacing the projection matrices (
and
) with their weighted counterparts (
and
).
Footnotes
Supplemental material available at FigShare: https://doi.org/10.25386/genetics.8316881.
Communicating editor: M. Beaumont
- Received May 7, 2019.
- Accepted June 6, 2019.
- Copyright © 2019 by the Genetics Society of America