Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 7.
Published in final edited form as: Ann Appl Stat. 2022 Mar 28;16(1):169–192. doi: 10.1214/21-AOAS1444

SPARSE MATRIX LINEAR MODELS FOR STRUCTURED HIGH-THROUGHPUT DATA

Jane W Liang , Śaunak Sen ‡,§
PMCID: PMC12499375  NIHMSID: NIHMS2111537  PMID: 41059435

Abstract

Recent technological advancements have led to the rapid generation of high-throughput biological data, which can be used to address novel scientific questions in broad areas of research. These data can be thought of as a large matrix with covariates annotating both rows and columns of this matrix. Matrix linear models provide a convenient way for modeling such data. In many situations, sparse estimation of these models is desired. We present fast, general methods for fitting sparse matrix linear models to structured high-throughput data. We induce model sparsity using an L1 penalty and consider the case when the response matrix and the covariate matrices are large. Due to data size, standard methods for estimation of these penalized regression models fail if the problem is converted to the corresponding univariate regression scenario. By leveraging matrix properties in the structure of our model, we develop several fast estimation algorithms (coordinate descent, FISTA, and ADMM) and discuss their trade-offs. We evaluate our method’s performance on simulated data, E. coli chemical genetic screening data, and two Arabidopsis genetic datasets with multivariate responses. Our algorithms have been implemented in the Julia programming language and are available at https://github.com/senresearch/MatrixLMnet.jl.

Keywords and phrases: FISTA, ADMM, proximal gradient algorithms, gradient descent, Julia, LASSO

MSC2020 subject classifications: Primary 65C60, 62P10, secondary 92D10

1. Introduction.

The rise of high-throughput technology has been a major boon for answering complex biological questions. Advances in automation, multiplexing and miniaturization now enable us to perform biological assays in bulk at vastly lower cost compared to a couple of decades ago. Examples of such technologies include cDNA microarrays, next-generation sequencing technologies, and mass spectrometry. The rise of these technologies have influenced statistical methods by posing new questions. They have also spawned the need for faster computation, since the size of the data forces the analyst to make trade-offs between statistical efficiency (or perfection) and computational feasibility. A well-known example is the wave of statistical innovation on multiple comparisons that followed the adoption of microarrays.

In this note, we consider the problem of modeling structured high-throughput data as the response variable. This is the goal of a wide variety of studies, such as chemical genetic screens using mutant libraries; eQTL experiments (measuring genome-wide gene expression and genotype in a segregating population); and metabolomics studies (measuring a large number of metabolites or chemicals using mass spectrometry). The data from these studies can be presented as a large matrix, with annotations characterizing each row and each column of this matrix. For example, in a chemical genetic screen where a large number of mutant strains are phenotyped in a large number of conditions, the data can be arrayed with each row representing an experimental run, and each column representing a mutant; we have information regarding the environment of each run (row annotations) and the gene mutated (column annotations). The row/column annotations define a priori known structure in the data. The goal is to identify gene-environment interactions in the screen (connections between row and column annotations), with the underlying idea that such interactions are rare (row-column connections are sparse). We propose to accomplish this using sparse matrix linear models that provide considerable flexibility in modeling the data. We achieve sparsity by using an L1 penalty on the model parameters and can handle situations where the covariate matrices are large. This model has broad applicability to a wide range of high-throughput data, and it has attractive computational properties.

Our approach is to provide a unified sparse linear model framework for analyzing matrix-valued data where we have row and column covariates. This approach generalizes the current approach to such data, where a two-step procedure is usually followed. In a microarray study with two conditions (treatment vs control), the data is in a matrix, with each row being a sample (row covariates indicate the sample condition) and each column being a gene. Genes may be grouped into pathways (column covariates). The standard approach is to detect differential expression contrasting two conditions using t-tests for each individual gene expression measurement (Dudoit et al., 2002). These methods have been extended to situations where each sample may have covariates; instead of performing a t-test for each gene expression measurement, a linear model is fit with the covariates as predictors (Ritchie et al., 2015). To understand patterns across genes or gene groups, a second analysis across genes is performed; for example gene set enrichment analysis might be performed (Subramanian et al., 2005). By unifying the two steps into a single linear model, the analyst gains flexibility in modeling (especially in the second step where the analysis can have non-categorical or non-overlapping covariates) and computational speed, as well as power to detect associations (Liang, Nichols and Sen, 2019). In this note, we consider estimation of these models with a sparsity constraint.

In the Section 2, we outline the statistical framework underlying our model. We follow by describing some example datasets to which our methods can be applied and which motivated this work (Section 3). The computational strategy is detailed in Section 4, followed by a section on simulation studies and analysis of example datasets (Section 5). We close by summarizing our conclusions and outlining implications for future work in Section 6.

2. Statistical framework.

Suppose that Yn×m is a response matrix, with rows annotated by covariate matrix Xn×p and columns annotated by covariate matrix Zm×q. Consider the linear model

Y=XBZ+E, (1)

which is equivalent to

yij=k=1pl=1qxikzjlbkl+eij, (2)

where the entries of a matrix A are denoted by aij. The matrix Bp×q contains the coefficients that need to be estimated, and En×m are the errors (Figure 1). For example, in a high-throughput chemical screen of a library of mutants, the response matrix Y would consist of the colony sizes from growing the library of mutants in a variety of chemical conditions. Each row would be a separate run of the experiment; each column would represent a specific genetic mutant strain. The matrix X would consist of information on the nature and doses of the chemical media in which the mutants were grown, and the matrix Z would have information on which gene was mutated. The linear model allows us to model the effect of both the genes and the chemicals on colony size. The model is an example of a bilinear model. If Z is an identity matrix, the model reduces to fitting a linear model with the same row covariates (X) for all columns of Y. Likewise, if X is an identity matrix, the model reduces to fitting a linear model to each row of Y using the column covariates (Z). Models of this form are also found in functional regression if the response function is observed on a regular grid (Ramsay and Silverman, 2005).

Fig 1.

Fig 1.

A visualization of the response (Y:n×m), row covariate (X:n×p), column covariate (Z:m×q) and coefficient (B:p×q) matrices for a matrix linear model. The dimensions shown are for illustration only and not necessarily to scale. X, Y, and Z are taken from the data, and the goal is to estimate B.

Note that our model may be expressed in its vectorized form as follows. If vec is the vectorization operator that stacks columns of a matrix into a single column vector, we can write equation (1) as

vec(Y)=(ZX)vec(B)+vec(E). (3)

This is in the form of the familiar linear regression model y=Xβ+ϵ, where y=vec(Y), X=ZX, β=vec(B), and ϵ=vec(E). While the two representations are mathematically equivalent, representing the data as a matrix linear model using Equation (1) is computationally much more efficient. Even if the covariate matrices X and Z are moderately large, their Kronecker product can be prohibitively large. We elaborate more on this point later in this section.

Each row of the design matrix X corresponds to a row in the response matrix Y, analogous to how each row of the design matrix X corresponds to an observation in the response vector y for a univariate problem. The matrix X can contain an intercept (column of 1s), as well as any number of continuous or categorical covariates, such as contrasts encoding for different conditions. Similarly, the rows of the design matrix Z correspond to the columns of Y, and Z can be specified with an intercept as well as other covariates with associated contrasts. It is useful to consider the consequences of the model specification via the summation representation of the model expressed in Equation (2). The estimated coefficient corresponding to the row and column intercept is the overall intercept in the vectorized model. The coefficients corresponding to the intercept of Z and (non-intercept) columns of X are row main effects, or the regression coefficients if we were performing univariate regression of each column of Y on the variables in X. Similarly, coefficients corresponding to the intercept of X and (non-intercept) columns of Z are column main effects, or the regression coefficients if we were performing univariate regression of each row of Y on the variables in Z. The coefficients corresponding to non-intercept columns of X and Z can be interpreted as interaction terms between row and columns covariates — this is clearer when looking at Equation (2).

We consider the scenario when the entries in E are independently distributed with mean zero and the same variance. The estimation process reduces to finding the least squares estimates, which have a closed-form solution that can be computed quickly even with a high-dimensional X or Z matrix (Liang, Nichols and Sen, 2019; Xiong et al., 2011). If the rows are independent and identically distributed, but the columns are correlated, then we can transform the data so that the entries are uncorrelated. In this scenario, the covariance structure is of the form

VvecE=ΣI,

where Σ is the covariance matrix of the columns. If Σ is known, we can multiply the response by the inverse of the square root of this matrix, to reduce it to the uncorrelated form assumed by our model. If the covariance matrix is unknown, we can replace Σ by an estimate (e.g. from the residuals of a least squares fit) to perform the decorrelation.

In many problems, B is expected to be sparse, or we may want to use a sparse B for prediction and interpretation. In such settings, sparsity can be induced by adding the convex LASSO or L1 penalty λB1 to the least squares criterion (Tibshirani, 1996):

12Y-XBZ22+λB1, (4)

for which no closed-form solution exists and standard unconstrained optimization methods cannot be applied (Schmidt, Fung and Rosales, 2009). Several approaches for solving the univariate problem are well-established. When the covariate matrix X is high-dimensional (as in genome-wide association studies), Wu and Lange (2008) and Friedman, Hastie and Tibshirani (2010) proposed cyclic coordinate descent.

Proximal algorithms are another general approach for convex optimization in non-smooth, constrained, large-scale, or distributed problems. They use the proximal operator of a function to solve convex optimization sub-problems in closed form or with standard methods. The proximal operator of a closed proper convex function f scaled by ρ, sometimes called the proximal operator of f with parameter ρ, is defined as

proxρf(u)=argminxf(x)+12ρx-u2. (5)

ρ controls the speed at which proxρf(u) moves points toward the minimum of f relative to staying close to u. It is advantageous to use a proximal algorithm when the proximal operator is simple but the original function is complex.

An example of such proximal algorithms is fast iterative shrinkage and thresholding algorithms (FISTAs) (Beck and Teboulle, 2009), which come from computer science literature and have been applied to univariate L1-penalized regression. More generally, FISTA is a proximal gradient method proposed for the nonsmooth, convex optimization problem for a parameter vector θ given by a split objective function

minθ{h(θ)=f(θ)+g(θ)}, (6)

where the loss function f is a smooth convex function that is continuously differentiable with a Lipschitz continuous gradient, and the penalty term g is a continuous convex function that may be nonsmooth and has a simple proximal operator. The alternating direction method of multipliers (ADMM), also known as Douglas-Rachford splitting, is another type of proximal algorithm that utilizes the same objective function split. ADMM is efficient when both of the separate proximal operators for f and g are easy to evaluate (Parikh and Boyd, 2014). We have implemented extensions of these three approaches for our multivariate setup, in which f(B)=12Y-XBZ22 and g(B)=λB1.

Alternatively, one can approximate the non-smooth objective function with a twice-differentiable surrogate or recast the problem with constraints. In the former case, unconstrained optimization approaches like Newton’s method can be used directly to minimize the suitably chosen approximation. Two general approaches for choosing the surrogate are replacing the non-differentiable penalty g with a fixed smooth approximation or iteratively bounding g from above with a convex function (Schmidt, Fung and Rosales, 2009). ADMM is an example of the latter approach that utilizes a dual decomposition and augmented Lagrangians to perform constrained optimization (Boyd et al., 2011). It has also been used in statistical applications; see for example Tan et al. (2014).

The reader might be wondering why any new algorithms are needed for this problem. The reason is that, while mathematically equivalent to Equation (1), the vectorized form in Equation (3) is computationally cumbersome. For example, the R package glmnet (Friedman, Hastie and Tibshirani, 2010) fails, even for moderate dimensions of X and Z, because their Kronecker product has large memory requirements. Unlike general solvers, we utilize the fact that the design matrix is a Kronecker product, where all of the information is contained in two smaller matrices, and are able to obtain a computationally efficient solution. From a philosophical standpoint, vectorizing the data destroys its natural structure, making the interpretations harder (even though the numbers are same). Finally, our approach points to how one might fit penalized multivariate regression models for multi-dimensional (tensor-valued) responses, i.e. sparse tensor linear models.

3. Data.

In this section, we give some example high-throughput datasets that contributed to our methodological work. We remark on the sizes of the datasets, the nature of the biological questions, and how they relate to our model.

3.1. E. coli chemical genetic screen.

Nichols et al. (2011) performed a high-throughput genetic screen using 3,983 strains of E. coli that carried a mutation in a non-essential gene. They were grown in 307 conditions representing 114 unique stresses; more than half of them were antibiotic/antimicrobial treatments, but other conditions such as temperature and pH were included. Each experimental run had at least two replicates of the same strain. The goal was to identify condition-gene interactions, the idea being that such interactions would illuminate the functional role of the mutated gene. In a matrix linear model, the row covariates are the growth conditions, the column covariates are the mutants, and the coefficients are the gene-condition interactions.

3.2. Arabidopsis G×E experiment.

A population of 404 Arabidopsis thaliana recombinant lines derived from a cross between an ecotype (strain) originating in Sweden and an ecotype originating in Italy were grown in three consecutive years (2009, 2010, and 2011) in both Italy and Sweden. The main phenotype of interest was fitness, measured by the average number of seeds produced per plant. The lines were genotyped at 348 markers, and the goal was to identify genetic loci (quantitative trait loci, QTL) contributing to fitness across sites and years (main effect QTL) and exhibiting gene-environment interactions (Ågren et al., 2017, 2016). In a matrix linear model, the row covariates are the markers, the column covariates are the environments, and the coefficients are the QTL (main effects and QTL-environment interactions).

3.3. eQTL experiment in two environments.

This study involved a population of 104 recombinant inbred lines derived from the Tsu-1 (Tsushima, Japan) and Kas-1 (Kashmir, India) ecotypes of Arabidopsis thaliana (Lowry et al., 2013; Lovell et al., 2015). Gene expression traits were collected for 25,662 genes, and 450 markers were genotyped. In order to identify main effect (G) and interaction (G×E) expression quantitative trait loci (eQTLs) with drought stress, the experiment was run on wet and dry soil treatments with two replicates. The data structure for this experiment is similar to the previous one, but with many more traits. In a matrix linear model, the row covariates are the markers, the column covariates are the gene identity and treatment information, and the coefficients are the eQTL (both main effects and interactions).

3.4. Environmental screening.

Woodruff, Zota and Schwartz (2011) analyzed biomonitoring data from the National Health and Nutritional Examination Survey (NHANES) to characterize both individual and multiple chemical exposures in U.S. pregnant women. They analyzed data for 163 chemical analytes in 12 chemical classes for subsamples of 268 pregnant women from NHANES 2003–2004. Most of the chemicals were measured using mass spectrometry. In a matrix linear model, the row covariates are the demographics of the subjects, the column covariates are the chemical classes of the chemicals, and the coefficients are the associations between chemical classes and demographic variables.

4. Computational methods.

We outline three algorithms we used for fitting the L1-penalized model, beginning with the most stable algorithm: coordinate descent. Next, we describe two variants of FISTA, a considerably faster, but less stable, approach. A discussion on ADMM, which is known for its fast convergence to an approximate solution but slow convergence to high accuracy, follows. We conclude the section with computational and implementation considerations.

Throughout, as in the univariate case, the intercept is omitted from the penalty term g and is thus not regularized, unless otherwise stated. Standardizing X and Z by subtracting the row means and dividing by the row standard deviations is also recommended.

4.1. Coordinate descent.

Cyclic coordinate descent searches for the minimum of a multivariable function by minimizing it along one coordinate direction at a time and cyclically iterating through each direction until convergence. When using the least squares loss function, it is sometimes known as the shooting algorithm (Fu, 1998). We first calculate the directional derivatives along the forward and backward directions for the coordinate direction uij at each coefficient Bij:

duijh(B)=limτ0hB+τuij-h(B)τ=duijf(B)+λ,Bij0-λ,Bij<0d-uijh(B)=limτ0hB-τuij-h(B)τ=duijf(B)+-λ,Bij>0λ,Bij0 (7)

This is possible because the nondifferentiable penalty g has directional derivatives along uij and -uij. Furthermore, the loss function f is differentiable, so its forward and backward directional derivatives are simply the positive and negative ordinary partial derivatives stored in the gradient f.

duijf(B)=Bijf(B)=f(B^ij)=-X:iRZ:jd-uijf(B)=-Bijf(B)=-f(B^ij)=X:iRZ:j (8)

Above, X:i and Z:j denote the i-th and j-th columns of X and Z respectively; R=Y-XB^Z is the matrix of residuals. Note that calculating the gradient f(B^ij) involves low-dimensional matrix multiplication. Like Wu and Lange (2008), our implementation organizes cyclic updates around the residuals, which makes calculating f(B^ij) fast. For each coefficient, we compute f and then update the corresponding coefficient and residual as follows.

4.

Sλ is the soft-thresholding operator given as

Sρ(u)=u-ρ,u>ρ0,|u|ρu+ρ,u<-ρ, (9)

when ρ=λ. Our implementation uses “warm starts” by initializing the coefficients at zero and computing solutions for a decreasing sequence of λ values. The coefficients for each subsequent λ value are then initialized to the previous converged solutions. This strategy is faster and leads to a more stable algorithm (Friedman, Hastie and Tibshirani, 2010). We also take advantage of sparsity by organizing iterations over the active set of coefficients: after performing a full cycle through all of the coefficients, we cyclically update only the active (nonzero) coefficients until convergence. Another full cycle is run, and the process is repeated until the estimates stop changing. Iterating through coefficients randomly instead of cyclically can in practice result in faster convergence as well, so we provide this as an option.

Note that the usage of the term “active set” here is related to but not the same as active set methods, which represent a class of algorithms that iterate between updating and simultaneously optimizing a set of non-zero variables (Schmidt, Fung and Rosales, 2009). This concept is also connected to the least angle regression (LARS) algorithm (Efron et al., 2004), which updates the predictor most correlated with the response by taking the largest possible step in the direction of the correlation. The process repeats until a second predictor is at least as correlated with the current residuals, and so on. LARS implementations to obtain univariate lasso coordinates exist, but have not enjoyed the same level of popularity as coordinate descent.

4.2. FISTA.

Coordinate descent is a very stable approach with excellent performance for univariate L1-penalized regression (Wu and Lange, 2008). However, it is too slow for matrix linear models of moderately large dimensions, especially if cross-validation is used to tune the λ parameter. Consider instead an iterative shrinkage-thresholding algorithm (ISTA) that calculates the gradient at the previous coefficient estimates and updates all of the coefficients simultaneously at each iteration (Beck and Teboulle, 2009) as

Bk+1:=prox(stepλ)gBk-stepfBk. (10)

Note that the proximal operator of g is simply the soft-thresholding operator proxρg(u)=Sρ(u) given by Equation (9). The updates are also multiplied by a small, fixed step size (step) that is less than 1. While ISTA may take more iterations than coordinate descent to converge, each iteration is faster because the gradient can be calculated efficiently as a matrix product.

Choosing the step size requires some care, as an overly small step size can result in slow convergence and an overly large one can lead to divergence. A suggested approach for choosing the step size is to use the reciprocal of the (smallest) Lipschitz constant of f, given by 2×{maximumeigenvalueof(ZX)(ZX). The maximum eigenvalue of (ZX)(ZX) is equal to the product of the maximum eigenvalues of ZZ and XX, which allows us to bypass computing the Kronecker product.

Fast iterative shrinkage-thresholding algorithms (FISTAs) are an extension of ISTA (Beck and Teboulle, 2009; Nesterov, 1983) that calculate the gradient f based on extrapolated coefficients comprised of a linear combination of the coefficients at the previous two iterations. If B^ is the matrix of coefficient estimates from the most recent iteration and B^prev is that from the second-to-last iteration, then calculate f using A=B^+k-1k+2(B^-B^prev). This approach takes into account the change between the coefficients in previous iterations, leading to a “damped oscillation” convergence that reduces overshooting when the local gradient is changing quickly (Su, Boyd and Candès, 2016).

Even faster convergence can be achieved by implementing a backtracking line search to find the maximum step size at each iteration, instead of initializing a fixed step size. The idea is that the step size should be small enough that the decrease in the objective function corresponds to the decrease expected by the gradient. First, pick an initial step size and choose a multiplying factor 0<γ<1 with which to iteratively shrink the step size. In practice, we find that an initial step size of 0.01 often works well. At each update step, iteratively shrink the step size by multiplying it with γ until it satisfies the property in equation (11). Then update the coefficients.

12Y-XBZ2212Y-XAZ22+B-A,f(A)+12stepB-A22 (11)

4.

4.

Like coordinate descent, we implement FISTA using a path of “warm starts”. We note that various further refinements and extensions have been made for FISTA and FISTA-like algorithms in recent years (Florea and Vorobyov, 2017; Kim and Fessler, 2018; Liang and Schönlieb, 2018; Ochs and Pock, 2017). Coordinate descent, ISTA, and FISTA with fixed step size or backtracking each trade off between speed and stability.

4.3. ADMM.

Utilizing the same split of the objective function as ISTA and FISTA, the alternating direction method of multipliers (ADMM) uses the proximal operators of both f and g. To minimize the objective function, one iterates between three updates:

B0k+1:=proxρfB1k-B2k (12)
B1k+1prox(λ/ρ)gB0k+1+B2k (13)
B2k+1B2k+B0k+1-B1k+1 (14)

B0 and B1 converge to each other and to the optimal coefficient estimates B^, but have slightly different properties.

When working with the vectorized/univariate model given by Equation (3), the proximal operators of f(β)=12y-Xβ22=12yy-2yXβ+βXXβ and g(β)=λβ1 are known to be

proxρf(u)=ρI+XX-1ρu+Xy (15)

and

proxρg(u)=Sλ/ρ(u). (16)

The soft-thresholding operator in Equation (16) can conveniently be applied element-wise. However, a potential bottleneck in this scheme is the inversion of ρI+XX in Equation (15), so consider re-formulating f(β) in terms of the spectral decomposition XX=QΛQ:

f(β)=12yy-2yXβ+βXXβ=12yy-2yXQQβ+βQΛQβ=12yy-2yX*β*+(β*)Λβ*=f(β*),

where β*=Qβ, β=Qβ*, and X*=XQ.

By applying the property that proxρf(u)=QproxρfQu when Q is an orthogonal matrix, an equivalent update can be derived that involves element-wise division instead of matrix inversion.

proxρf(u)=Qproxρfu*=Q(ρI+Λ)-1ρu*+(X*)y=Q(ρI+Λ)-1ρQu+(X*)y=QρQu+(X*)y/[ρ+diag(Λ)] (17)

In the above expression, ./ denotes element-wise division and diag(Λ) extracts the diagonal elements of Λ, namely the eigenvalues of XX.

To obtain the analogous proximal operators for matrix linear model updates, we return to the vectorized formulation in Equation (3) and recognize that

XX=(ZX)(ZX)=ZZXX=QZΛZQZQXΛXQX=QZQXΛZΛXQZQX,

where the third equality follows from the spectral decompositions ZZ=QZΛZQZ and XX=QXΛXQX. Then Q=QZQX, Λ=ΛZΛX, and X*=XQ=(ZX)QZQX. Also recall that y=vec(Y) and β=vec(B), and apply Kronecker product properties to Equation (17) to get the final devectorized expression in Equation (18):

proxρf(vec(U))=QZQXρQZQXvec(U)+(ZX)QZQXvec(Y)./ρ+diagΛZΛX=QZQXvecρQXUQZ+QXXYZQZ./ρ+diagΛZΛX=QZQXvecρQXUQZ+Y*/(ρ+L)=vecQXρQXUQZ+Y*/(ρ+L)QZproxρf(U)=QXρQXUQZ+Y*/(ρ+L)QZ (18)

The only necessary Kronecker product is therefore that between diagonal matrices ΛZ and ΛX, a cheap calculation compared to a Kronecker product of dense matrices. One can also pre-compute Y*=QXXYZQZ and L=vecn,m-1diagΛZΛX.vec-1 denotes the inverse of the vectorization operator, such that vecn,m-1[vec(A)]=A for all ARn×m and vecvecn,m-1(a)=a for all aRnm.

When a rough solution is sufficient, ADMM can be a good approach because it is often easy to implement and converges to approximate estimates quickly. However, ADMM has been observed to be slow when a high degree of accuracy is desired. Like the choice of step size in FISTA, the choice of ρ>0 to tune ADMM has consequences for the speed of convergence. To set the initial value of ρ, we follow the suggestion laid out by Ghadimi et al. (2012) for the L1-regularized ADMM algorithm. When λ<mindiagΛZΛX — that is, when the penalty parameter λ is less than the minimum eigenvalue of (ZX)(ZX)— we set ρ=mindiagΛZΛX. When λ>maxdiagΛZΛX, we set ρ=λ; otherwise, we set ρ=maxdiagΛZΛX.

Furthermore, Boyd et al. (2011) describe a simple approach for varying the ADMM tuning parameter such that the rate of convergence is less dependent on the initial choice of ρ. Define the primal residuals as r=B1-B0 and the dual residuals s as the difference between the values of B1 at the previous and current iterations. At the (k+1)th iteration, update ρ as

ρk+1=τincrρkifrk2>μsk2ρk/τdecrifsk2>μrk2ρkotherwise,

for some choice of parameters μ>1, τincr>1, and τdecr. We use the typical values, as indicated in the paper, of μ=10 and τincr=τdecr=2. If ρ changes between iterations, B2 must be rescaled accordingly.

4.4. Computational Considerations.

We emphasize again that while many solvers are available for the vectorized matrix linear model given by Equation (3), this formulation is impractical or impossible even for moderately sized X and Z because their Kronecker product is too large. Many of the operations in our algorithms, such as matrix multiplication and element-wise operations, are parallelizable. With suitable hardware and software, significant speedups are possible. Our implementation did not use any parallelization.

4.4.1. Shrinkage Parameter Tuning.

To determine the optimal shrinkage/regularization parameter λ, k-fold cross-validation is recommended; a parallel implementation is straightforward. Various criteria can be used to identify optimal performance averaged across the k folds, including mean squared error (MSE), test error, AIC, and BIC. We used MSE for the analyses presented in the Results (Section 5). It is also possible to choose a λ based on the proportion of significant (nonzero) interactions desired.

4.4.2. Software Implementation.

We implemented these algorithms using the high-level programming language Julia (Bezanson et al., 2017). Julia is a relatively young language with an active community that combines ease of prototyping with computational speed. It features a just-in-time compiler and strong data typing, which enable fast computation. It is an attractive candidate for numerical computing problems such as ours, since one does not need to switch between multiple programming languages for implementation, analysis, and visualization. Julia has built-in support for parallelization which is helpful for large-scale analysis.

4.4.

Our package’s primary function, mlmnet, allows users to specify the data, penalty values, and estimation algorithm. Users can also choose which rows and columns of B should be regularized, and if one or both of the X and Z intercepts should be included and/or regularized. The package implements parallelized cross-validation for tuning λ and includes several functions for summarizing results.

5. Results.

5.1. Simulated data with varying dimensions.

To illustrate the speed of FISTA with backtracking and ADMM, we ran the algorithms on simulated data while fixing the dimensions of the multivariate response matrix and varying the dimensions of the interaction matrix (Table 1), or vice versa (Table 2). The setup represents a two-way layout where the row covariates correspond to a factor with p levels and the column covariates correspond to a factor with q levels. The data was simulated with 1/2 nonzero row and column main effects and 1/8 nonzero interactions drawn from Normal(0, 2) distributions. That is, to get the row main effects, we simulated a vector of length p with a random half of the entries set to zero and the other half of the entries drawn from Normal(0, 2). The column main effects (a vector of length q) and interactions (a p×q matrix) were obtained similarly. The X matrix was generated by repeatedly stacking p×p identity matrices until n rows were reached, and analogously for Z. We also included an intercept in both X and Z by concatenating a column of 1s, to encode for the main effects. Errors were drawn from Normal(0, 3). Times are presented as averages of 100 replicates, each run over 20 λ values. We used a dual CPU Xeon E5–2623 v3 @ 3.00GHz processor with 125 G RAM.

Table 1. Ratios of computation times for running FISTA with backtracking and ADMM on simulated data while varying p and q (the dimensions of the interaction matrix).

Times (in minutes) were obtained as averages of 100 replicates, each run over 20 λ values and holding n = m = 1200. The raw runtimes for FISTA and ADMM are reported to the left and right of the forward slash, respectively (FISTA/ADMM). The cell colors indicate the magnitude of the discrepancy between the two methods, based on ratios of the runtimes.

FISTA/ADMM q = 200 q = 400 q = 600 q = 800 q = 1000

p = 200 1.20/0.80 2.05/1.23 2.69/1.71 3.35/2.27 3.91/2.89
p = 400 1.29/0.85 1.85/1.44 2.51/2.10 3.18/2.83 3.77/3.60
p = 600 1.38/1.00 2.05/1.80 2.72/2.73 3.47/3.57 4.09/4.87
p = 800 1.50/1.28 2.16/2.25 2.94/3.34 3.54/4.47 4.14/6.19
p = 1000 1.58/1.59 2.23/2.70 3.09/3.98 3.90/5.89 4.65/7.90

Table 2. Ratios of computation times for running FISTA with backtracking and ADMM on simulated data while varying n and m (the dimensions of the multivariate response matrix).

Times (in minutes) were obtained as averages of 100 replicates, each run over 20 λ values and holding p = q = 400. The raw runtimes for FISTA and ADMM are reported to the left and right of the forward slash, respectively (FISTA/ADMM). The cell colors indicate the magnitude of the discrepancy between the two methods, based on ratios of the runtimes.

FISTA/ADMM m = 400 m = 800 m = 1200 m = 1600 m = 2000

n = 400 0.53/1.35 0.51/0.69 0.64/0.73 0.83/0.81 1.00/0.94
n = 800 0.70/0.76 0.99/0.91 1.28/1.11 1.55/1.26 1.85/1.52
n = 1200 0.99/0.91 1.48/1.13 1.83/1.46 2.36/1.77 3.07/2.07
n = 1600 1.27/1.06 1.86/1.44 2.59/1.86 3.24/2.28 3.92/2.70
n = 2000 1.54/1.23 2.33/1.71 3.13/2.19 4.10/2.75 4.94/3.32

Both algorithms remain fast even when scaling to greater dimensions. Interestingly, ADMM is much faster than FISTA in cases where n and m are large relative to p and q. Its runtimes also scale better when increasing n and m. However, when p and q approach n and m (i.e. when X and/or Z are close to being square matrices), the computational performance of ADMM suffers greatly. QX is p×p and QZ is q×q, so the matrix multiplication used to transform and back-transform B0 in the ADMM updates relies heavily on the size of p and q rather than n and m.

The rate at which the runtimes increase is also not entirely symmetrical for the two methods, both individually and relative to each other. For example, it appears to be more computationally expensive to increase the number of columns q in Z than it is to increase the number of columns p in X, for either method. However, the runtimes also increase more quickly for ADMM than for FISTA when scaling up q compared to scaling up p.

5.2. Environmental screening simulations.

We simulated data modeled after an environmental screening study (Woodruff, Zota and Schwartz, 2011) using mass spectrometry. The study measured environmental chemical concentrations in pregnant women across various demographics in several tissues. We simulated data from 100 chemicals, each measured in 10 tissues for 108 women. The tissues, chemicals, and each unique combination of tissues and chemicals were encoded in the Z matrix. That is, the Z matrix is comprised of an intercept, 100 dummy variables for the chemicals, 10 dummy variables for the tissues, and 100×10 = 1000 dummy variables for the unique chemical-tissue combinations:

Z10000×1111=11100chemicals10tissues100chemicals×10tissues. (19)

We then simulated an X matrix with 19 continuous demographic covariates drawn from the standard normal distribution, and included an intercept. For each tissue, 1/4 of the chemicals, 1/2 of the demographic covariates, and 1/8 of the interactions between chemicals/tissues and demographics, we simulated effects drawn from a Normal(0, 2) distribution. This was done similarly to how we generated the main effects and interactions in Section 5.1. In this case, the demographic effects correspond to the row main effects, and the element-wise sum of the tissue and chemical effects correspond to the column effects. Errors were drawn from Normal(0, 3). We used FISTA with backtracking to estimate the main and interaction effects. The receiver operating characteristic (ROC) curves in Figure 2 compare the performance of L1-penalized matrix linear models (MLM) to the conventional approach of running a univariate linear model for each chemical and tissue combination.

  • The solid black line plots the results for the L1-penalized MLM. We obtained true positive rates (TPR) and false positive rates (FPR) by varying λ and comparing the nonzero and zero interaction estimates to the true interactions.

  • The dotted red line is from running the 1000 univariate linear regression models for each combination of the 100 simulated chemicals and 10 simulated tissues. We obtained the adaptive Benjamini-Hochberg adjusted p-values (Benjamini and Hochberg, 2000; Team et al., 2017) for each model’s coefficient estimates and varied the cutoff for determining significant interactions. These were compared to the true interaction effects to calculate the TPR and FPR.

  • The blue lines of various line types offer an alternate interpretation of the univariate linear models. For each chemical, there were 10 chemical × demographic interactions, one for each of the 10 tissues. We flagged an interaction if least 1/5, 2/5, 3/5, or 4/5 out of the 10 different p-values was below the cutoff. A plot with curves for the 10 tissues, each of which corresponds to p-values from 100 different linear models, yields similar results.

Fig 2. ROC curves for simulations comparing L1-penalized matrix linear models to univariate linear regression for identifying chemical interactions in environmental screening data (Woodruff, Zota and Schwartz, 2011).

Fig 2.

The AUC (Ekstrøm, 2018) for each method is given in parentheses in the legend. L1-penalized matrix linear models outperforms the univariate approach.

Our method consistently outperforms variations of the conventional univariate approach. The L1 penalized MLM resulted in an area under the curve (AUC) (Ekstrøm, 2018) of 0.884; the AUC for the univariate linear regression interpretations was at most 0.686, which is when only one out of five significant univariate p-values (“hits”) is needed to detect an significant interaction.

5.3. E. coli chemical genetic screen.

A study by Nichols et al. (2011) aimed to examine the interaction effects between 3,983 E. coli mutant strains and 307 growth conditions. The mutant strains were taken from the Keio single-gene deletion library (Baba et al., 2006); essential gene hypomorphs (C-terminally tandem-affinity tagged (Butland et al., 2008) or specific alleles); and a small RNA/small protein knockout library (Hobbs, Astarita and Storz, 2010). Colony opacity was recorded for mutant strains grown in high density on agar plates. Six plate arrangements of mutants were used, with 1536 colonies grown per plate. In this context, a “plate arrangement” refers to the choice of mutants and exposures as well as their positioning in the 1536 wells. More than half of the growth conditions were antibiotic/antimicrobial treatments, but other types of conditions, such as temperature and pH, were included. For this data application, we specified a separate model for each of the six plate arrangements. In each case, X was a design matrix encoding the growth conditions factor in the plate arrangement and Z was a design matrix encoding the mutants as a factor. We included an intercept for the main effects in both the X and Z matrices.

Auxotrophs are mutant strains that have lost the ability to synthesize a particular nutrient required for growth. Since they should experience little to no colony growth under specific conditions where the required nutrient is not present, we anticipate negative interactions between auxotrophic mutants and minimal media growth conditions. While using sparse estimates for this analysis may not be a good modeling choice because we expect many of the interactions to be negative rather than zero, examining auxotrophs as controls is nevertheless useful, since the phenotype under particular conditions for a mutant strain is typically not known.

In their original analysis of the colony size data, Nichols et al. (2011) empirically identified 102 auxotrophs. Similar to what we did for the least-squares t-statistics (obtained by dividing the least squares coefficient estimates by their standard errors) in Liang, Nichols and Sen (2019), we empirically identified auxotrophs based on the sparse estimates. To do this, we obtained the quantiles of the interaction estimates for a given λ penalty for each mutant strain under minimal media conditions. Mutants whose 95% quantile for interactions with minimal media conditions fell below zero were classified as auxotrophs. The lambda that minimizes the mean-squared error within one cross-validation standard error was λ=0.46 for three of the plates and λ=0.35 for the other three plates. When setting λ=0.46 across all six plates, our auxotrophs had an 85% overlap with the Nichols et al. (2011) auxotrophs. This is consistent with the 83% overlap found in our earlier work on least-squares t-statistics (Liang, Nichols and Sen, 2019). Some of the discrepancy may be due to differences between analyzing colony opacity, as we did, and analyzing colony size, as Nichols et al. (2011) did. Figure 3 visualizes the distributions of each auxotroph’s sparse interactions (λ=0.46) across minimal media conditions. The interaction estimates are plotted as points, and the median for each auxotroph is plotted as a horizontal bar; most fall below zero.

Fig 3. Distributions of sparse matrix linear model interaction estimates (λ=0.46) for auxotrophs identified by Nichols et al. (2011) over minimal media conditions in the E. coli chemical genetic screening data.

Fig 3.

The Nichols et al. (2011) auxotrophs are plotted along the horizontal axis. The L1 penalized MLM interactions between the auxotrophs and minimal media conditions are plotted along the vertical axis, with the horizontal bars indicating the median value. Most interactions fall below zero, indicating little growth.

To compute the AUC, we took the auxotrophs identified by Nichols et al. (2011) to be the “true” auxotrophs. We then obtained TPRs and FPRs by varying cutoffs for the median minimal media interactions for the auxotrophs that we identified using L1 penalized estimates and least squares t-statistics. The AUC was 0.892 for the L1 penalized estimates and 0.884 for the least squares t-statistics (Ekstrøm, 2018). Supplemental Figure S1 (Liang and Sen, 2021a) plots the ROC curves for the two methods, which appear very similar. So, there is high concordance between our empirically identified auxotrophs and those identified by Nichols et al. (2011), as well as between our two approaches.

The auxotrophs are not expected to be sparse, but we can also consider all of the condition × mutant interactions by using simulated data. As we did in Liang, Nichols and Sen (2019), we used the framework of the X and Z matrices from the E. coli data’s six plate arrangements to simulate data with 1/2 nonzero main effects and 1/4 nonzero interactions drawn from a Normal(0, 4) distribution. The errors were independent and identically distributed from the standard normal distribution. We then obtained the adaptive Benjamini-Hochberg adjusted (Benjamini and Hochberg, 2000; Team et al., 2017) permutation p-values from the least squares estimates and the L1-penalized estimates for 50 λ values. To compare the results for each plate arrangement, we considered the AUCs and plotted the ROC curves. For the least squares approach, we obtained TPRs and FPRs by varying p-value cutoffs to determine which adjusted p-values correspond to significant (nonzero) interactions. For the L1-penalized solutions, we obtained TPRs and FPRs by varying λ and comparing the nonzero and zero interaction estimates to the true interactions.

The AUCs were very similar between the two methods for all plates (Table 3), but were consistently higher for the least squares approach. Figure 4 plots the ROC curves for the first plate arrangement, in which it is apparent that the two curves are nearly identical in trajectory. The ROC curves for the remaining five plates, which look quite similar to those in Figure 4, are shown in Supplemental Figure S2 (Liang and Sen, 2021a). In this situation, it appears that it is both simpler and more effective to use the closed-form least squares estimates, rather than attempting regularization.

Table 3. Area under the curve (Ekstrøm, 2018) for simulations based on each of the six plate arrangements in the E. coli chemical genetic screening data (Nichols et al., 2011).

The least squares matrix linear models are similar to, but consistently outperform, the L1-penalized solutions in each of the six cases.

Plate 1 2 3 4 5 6

L1-Penalized 0.835 0.830 0.778 0.843 0.838 0.843
Least Squares 0.845 0.843 0.853 0.852 0.847 0.854

Fig 4. ROC curves comparing least squares to L1-penalized estimates applied to data simulated using framework of first plate arrangement in the E. coli chemical genetic screening data (Nichols et al., 2011).

Fig 4.

The two methods perform very similarly, with the least squares t-statistics (AUC (Ekstrøm, 2018) of 0.845) performing slightly better than the L1-penalized solutions (AUC of 0.835).

5.4. Arabidopsis G×E experiment.

Ågren et al. (2013) studied 404 Arabidopsis thaliana recombinant inbred lines derived by crossing two ecotypes originating in Italy and Sweden. The plants were grown in six environments: in two sites (Italy and Sweden) measured over three years (2009–2011). The investigators genotyped 348 markers with the goal of mapping quantitative trait loci (QTL) to explain genetic mechanisms of fitness adaptation to local environments (Ågren et al., 2017, 2016). We ran L1-penalized MLMs to determine significant interactions between the markers and each of the two sites and six environments. The 348 markers were encoded as dummy variables in the X matrix, which also contained an intercept. The Z matrix was comprised of a contrast between the two sites (Italy and Sweden), representing the QTL-site interactions, and a regularized intercept, representing main effect QTL:

Z6×2=1111111-11-11-1. (20)

We used fruit production per seedling as the response data, and only considered the 390 lines with complete response data for all 6 environments. Data pre-processing was performed in R (R Core Team, 2018) with the help of the R/qtl package (Broman et al., 2003).

We performed 10-fold cross-validation with MSE as the criterion to determine an optimal λ penalty size of 6.2. Figure 5 plots the absolute main QTL effects (above the x-axis) and the absolute QTL-site (Italy vs. Sweden) interactions (below the x-axis) against marker position on the five chromosomes. Dotted vertical reference lines separate the chromosomes, and the peaks correspond to loci with significant, nonzero interactions. Several peaks are apparent on chromosomes 1, 2, 4, and 5. These results are largely aligned with the significant QTL found by Ågren et al. (2013), although it is notable that they identified significant QTL on chromosome 3, and we did not find any. Our approach has the advantage of being able to quickly analyze all six environments simultaneously by encoding the Z matrix with the site information.

Fig 5. Absolute QTL effects plotted against marker position for the Ågren et al. (2013) Arabidopsis data.

Fig 5.

The absolute main QTL effects are plotted above the x-axis and the absolute QTL-site (Italy vs. Sweden) interactions are plotted below the x-axis. Dotted vertical reference lines separate the five chromosomes. We see main effect QTLs on chromosomes 1, 4, and 5; and interaction effects on chromosomes 1 2, 4, and 5.

Table 4 compares the times for running our implementations of the different algorithms for obtaining L1-penalized estimates on this dataset. We averaged 100 replicates obtained from a dual CPU Xeon E5–2623 v3 @ 3.00GHz processor with 125 G RAM. For this moderately-sized data and 50 λ values, the L1-penalized coefficients can be computed within a few minutes using FISTA and ADMM. In this case, ADMM is the fastest, and FISTA with backtracking is slightly slower than FISTA with a fixed step size. Both coordinate descent algorithms take several times longer, with average runtimes of over 10 minutes. Here, cyclic coordinate descent is faster than random coordinate descent, but iterating over random directions may be faster in other scenarios. ISTA is notable for taking more than an hour to complete, which may be in part due to this particular dataset and in part due to the choice of fixed step size. Based on our experience, while it is unusual to observe such an egregiously slow performance from ISTA, it is consistently slower than FISTA.

Table 4. Computation time (minutes) to obtain interactions for the Ågren et al. (2013) Arabidopsis data for cyclic coordinate descent, random coordinate descent, ISTA with fixed step size, FISTA with fixed step size, FISTA with backtracking, and ADMM.

Times were obtained as averages of 100 replicates, each run over 50 λ values.

Algorithm Time (min)

Coordinate descent (cyclic) 10.56
Coordinate descent (random) 10.97
ISTA (fixed step size) 67.01
FISTA (fixed step size) 1.39
FISTA (backtracking) 1.50
ADMM 1.02

5.5. eQTL experiment in two environments.

Lowry et al. (2013) examined the regulation and evolution of gene expression by considering drought stress. This expression quantitative trait locus (eQTL) mapping experiment studied 104 individuals from the Tsu-1 (Tsushima, Japan) × Kas-1 (Kashmir, India) recombinant inbred line population of Arabidopsis thaliana. It was conducted across wet and dry soil treatments with two replicates. Gene expression phenotypes were collected for 25,662 genes, and 450 markers were genotyped (Lowry et al., 2013; Lovell et al., 2015). The goal was to identify main effect (G) and interaction (G × E) eQTLs for the environmental conditions. Here, the X matrix encoded dummy variables for the 450 markers, an additional treatment contrast encoding cytoplasm, and the intercept. The Z matrix, which encoded for main effects and drought treatment interactions for the 25,662 expression phenotypes, can be expressed as

Z51324×51324=I25662111-1. (21)

Data pre-processing was performed in R (R Core Team, 2018) with the help of the R/data.table (Dowle and Srinivasan, 2018) and R/qtl packages (Broman et al., 2003).

It took 6.96 hours to run the FISTA algorithm for a path of 16 λ penalties and 13.06 hours for ADMM (ADMM is likely slow because Z is a huge square matrix). We used a dual CPU Xeon E5–2623 v3 @ 3.00GHz processor with 125 G RAM. However, applying the R/qtl package’s stepwiseqtl function (Broman et al., 2003) one by one for each phenotype is estimated to take many times as long, at 80.00 hours. This estimate was obtained by running stepwiseqtl on 100 random phenotypes and extrapolating the resulting time, averaged over 10 runs, to the full set of 51,324 phenotypes. The large Z matrix also showcases another advantage of using FISTA with backtracking. Performing the spectral decomposition needed to compute the fixed step size for FISTA or to update B0 in ADMM easily exceeds memory limits for a typical computer. We were only able to run ADMM in this case because the Z matrix has a special structure such that the eigenvectors form an identity matrix. Using a backtracking line search sidesteps these dilemmas altogether.

Figure 6, which reproduces Figure 2 in Lowry et al. (2013) using our results, visually summarizes the main effect and interaction eQTLs identified by our method when λ=1.73. Our method was able to detect many of the same main effects and G × E effects.

Fig 6. Distribution of eQTL across genome in the Lowry et al. (2013) dataset.

Fig 6.

Main effects are shown in open blue circles and G×E interactions in solid red squares.

While speed was perhaps not a huge practical concern for the Ågren et al. (2013) data because all of the algorithms finished within minutes, the large number of phenotypes in this dataset makes performing L1-penalized MLMs a much more computationally-intensive endeavor. We compared the runtimes for the different algorithms applied to subsets of the 25,662 genes used to record gene expression phenotypes. To do this, we took random subsets of 25, 50, 100, 200, 400, 800, and 1600 genes. Note that because there were both wet and dry soil environments, the number of genes is equal to half the number of columns of Y and the number of rows and columns of Z, i.e. if we use a subset of 20 genes, m=q=2×20=40. We used 16 λ values for each run and averaged 100 replicates obtained from a dual CPU Xeon E5–2623 v3 @ 3.00GHz processor with 125 G RAM. The results are plotted in Figure 7 and reported in Table 5.

Fig 7. Computation time (seconds) to estimate interactions for subsets of the Lowry et al. (2013) data, plotted against the number of genes taken in the subset.

Fig 7.

Random subsets were taken of 25, 50, 100, 200, 400, 800, and 1600 genes. Runtimes are shown for cyclic coordinate descent, random coordinate descent, ISTA with fixed step size, FISTA with fixed step size, FISTA with backtracking, and ADMM. Times were obtained as averages of 100 replicates, each run over 16 λ values. Both axes are shown on a log scale with base 2.

Table 5. Computation time (minutes) to obtain interactions for subsets of the Lowry et al. (2013) data for cyclic coordinate descent, random coordinate descent, ISTA with fixed step size, FISTA with fixed step size, FISTA with backtracking, and ADMM.

Random subsets were taken of 25, 50, 100, 200, 400, 800, and 1600 genes. Times were obtained as averages of 100 replicates, each run over 16 λ values.

25 50 100 200 400 800 1600
Coord. desc. (cyclic) 0.06 0.35 0.64 3.67 16.49 54.20 236.59
Coord. desc. (random) 0.09 0.43 0.99 5.29 22.65 77.02 380.74
ISTA (fixed step size) 0.21 0.75 0.96 2.62 7.50 18.34 70.93
FISTA (fixed step size) 0.04 0.10 0.15 0.40 1.07 3.13 11.07
FISTA (backtracking) 0.06 0.16 0.23 0.54 1.46 3.19 12.66
ADMM 0.09 0.21 0.33 0.82 2.54 6.94 24.43

When only a subset of 25 genes are included in the dataset, estimating the interactions can be done in a matter of seconds for all six algorithms. In such cases, there may not be much of a practical advantage in using an algorithm that requires specifying a good step size or other tuning parameters, compared to the more stable coordinate descent. However, as the number of genes included increases, the gap between the amount of time needed by coordinate descent vs. the amount of time needed by FISTA and ADMM grows. (As the number of genes and the size of q grows, it also becomes apparent that ADMM is slower compared to FISTA in this case.) When 1600 genes are randomly subsetted, FISTA and ADMM still finish within 10 to 25 minutes, compared to the several hours required by coordinate descent. Extrapolating from this, it is not unreasonable to conclude that it would take on the order of days or even weeks to run coordinate descent on the full 25,662 genes in the dataset. This is not very viable, especially when faster alternatives exist.

Cyclic coordinate descent is consistently faster than random coordinate descent for a given number of subsetted genes, which was also observed for the Ågren et al. (2013) data. There is some additional overhead associated with randomly selecting and accessing the coefficients. Moreover, because the QTLs per gene are more-or-less independent, iterating through the coefficients randomly may not be advantageous compared to doing it cyclically.

If the number of subsetted genes is small, ISTA is the slowest method. However, ISTA quickly overtakes the coordinate descent algorithms as the number of genes increases. The primary difference between ISTA and coordinate descent is updating the gradient simultaneously for all of the coefficients using matrix multiplication vs. updating the coefficients one at a time. It appears that ISTA does not enjoy efficiencies from matrix multiplication when the dimensions of the coefficient matrix B are relatively small, like they are when only a few genes are subsetted or in the Ågren et al. (2013) data. Likewise, FISTA with fixed step size is also faster than FISTA with backtracking, but the discrepancy shrinks as the number of subsetted genes increases. If the dimensions are small, the additional overhead of performing a backtracking line search costs more than the speed gained by having a dynamically updated step size.

6. Discussion.

We have developed a fast fitting procedure for sparse, L1-penalized matrix linear models and demonstrated their use for several high-throughput data problems. This approach opens up analytic options for many studies using high-throughput data. Our method takes advantage of the structure of matrix linear models to speed up existing computational algorithms that cannot be feasibly run in the vectorized, univariate setting. Analysis of simulated and several previously-analyzed datasets illustrate our method’s applicability. We note that as in the case of univariate linear regression models, whether or not to use L1 penalized matrix linear models is a decision that should be dictated by the scientific goals of the study.

The choice between coordinate descent, the various flavors of (F)ISTA, and ADMM is largely a trade-off between speed and stability. Coordinate descent is a reasonably fast approach for computing L1-penalized estimates for univariate linear models, but is too slow for our multivariate scenario. Instead, we turned to the latter two options, combined with the exploitation of the matrix properties and sparsity of our model. The relative speed of ADMM compared to FISTA may depend on the relative dimension sizes of the data. When the number of interactions (implied by the sizes of p and q) is low relative to the dimensions of the response data (implied by the sizes of n and m), ADMM is likely to be the fastest option. However, we note that the from-scratch implementation of ADMM is quite straightforward, compared to FISTA when incorporating a backtracking line search.

Our work demonstrates the feasibility of fitting matrix linear models with moderately large dimensions. It can be extended in several promising directions that will further broaden the applicability of this class of models. First, we can extend fitting models with different loss (f) and penalty (g) functions. For example, we can fit the elastic-net method (Zou and Hastie, 2005) by changing the penalty function, adding an L2 penalty to the L1 penalty. We can also make the solution less sensitive to outliers by using a robust loss function, such as Huber’s loss function, instead of the squared error loss. Our current implementation allows users to specify which row and column effects (including intercepts) to regularize. If we incorporate the elastic-net method, a natural extension would be to also allow users to separately decide which penalty terms (if any) should be applied to each of the coefficients

Another direction would be the development of confidence intervals and hypothesis tests in this setting to complement our estimation algorithms. There has been some recent promising work in this direction (Javanmard and Montanari, 2014; Reid, Tibshirani and Friedman, 2016) for L1 penalized univariate regression models. Since the matrix linear model can be vectorized to a univariate linear model, these results may be expected to apply to the matrix case.

A third direction would be to extend the models to multi-dimensional (tensor-valued) responses. In Equation (3), the design matrix is a Kronecker product of two matrices. Through iterative vectorization, this model can be extended to more than two matrices to handle a multi-dimensional 𝒴 response tensor. Consider a 3-way tensor 𝒴n×m×l with rows, columns, and pages (horizontal, lateral, and frontal slices) annotated by Xn×p, Zm×q, and Wl×r, respectively. The goal is to estimate a 3-way tensor of coefficients, p×q×r. 𝒴 can be matricized into an (nm)×l matrix Y*. To do this, each n×m frontal slice is vectorized and the resulting column vectors are laid out into l columns. One can similarly define B(pq)×r*, the matricized version of p×q×r, and E(nm)×l*, the matricized version of the errors n×m×l. Let X(nm)×(pq)*=Zm×qXn×p. Then the tensor linear model can be written in the form of Equation (1)

Y*=X*B*W+E* (22)

and further reduced to the form of Equation (3)

vec(Y*)=(WX*)vec(B*)+vec(E*) (23)
vec(𝒴)=(WZX)vec()+vec(). (24)

In summation notation, the model can be expressed as

yijk=s=1pt=1qu=1rxiszjtwkubstu+eijk. (25)

The extensions for models involving higher-dimensional tensors follow analogously with additional iterative vectorization. These models might be attractive for handling, for example, time series high-throughput data or 3-D imaging data. Further work is needed to explore the performance, scalability, and stability of the fitting algorithms for tensor linear models.

A fourth direction would be to consider faster implementations, especially those with multi-threaded, distributed, or GPU computing options, which have had recent success in machine learning. While we have used some of those ideas in our implementation, there is room for considerable improvement. Reducing the computational burden would be important for tensor linear models.

Our algorithms have been implemented in the Julia (Bezanson et al., 2017) programming language and are available at https://github.com/senresearch/MatrixLMnet.jl (Liang and Sen, 2021b). The Julia and R (R Core Team, 2018) code for reproducing the analysis and generating the figures in this paper can be found at https://github.com/senresearch/mlm_l1_supplement (Liang and Sen, 2021c).

Supplementary Material

Supplement A: Supplemental figures (DOI: 10.1214/21-AOAS1444SUPPA; .pdf). Additional figures from analyzing the E. coli chemical genetic screening data.
Supplement B: Julia implementation for L1-penalized matrix linear models (DOI: 10.1214/21-AOAS1444SUPPB; .zip). MatrixLMnet Julia package for estimating L1 - penalized matrix linear models; most up-to-date version available at https://github.com/senresearch/MatrixLMnet.jl.
Supplement C: Code to reproduce paper analysis (DOI: 10.1214/21-AOAS1444 SUPPC; .zip). Repository with code to perform analysis and generate figures in paper, also available at https://github.com/senresearch/mlm_l1_supplement

Supplement A: Supplemental figures

(doi: 10.1214/00-AOASXXXXSUPP; .pdf). Additional figures from analyzing the E. coli chemical genetic screening data.

Supplement B: Julia implementation for L1-penalized matrix linear models

(doi: 10.1214/00-AOASXXXXSUPP; .zip). MatrixLMnet Julia package for estimating L1-penalized matrix linear models; most up-to-date version available at https://github.com/senresearch/MatrixLMnet.jl

Supplement C: Code to reproduce paper analysis

(doi: 10.1214/00-AOASXXXXSUPP; .zip). Repository with code to perform analysis and generate figures in paper, also available at https://github.com/senresearch/mlm_l1_supplement.

Acknowledgments

This work was started when JWL was a summer intern at UCSF, and continued when she was a scientific programmer at UTHSC. We thank both UCSF and UTHSC for funding, and supportive environments for this work. We thank Jon Ågren, Thomas E. Juenger, and Tracey J. Woodruff for granting permission to use their data for analysis. SS was partly supported by NIH grants GM123489, GM070683, DA044223, AI121144, and ES022841.

REFERENCES

  1. Ågren J, Oakley CG, McKay JK, Lovell JT and Schemske DW (2013). Genetic mapping of adaptation reveals fitness tradeoffs in Arabidopsis thaliana. Proceedings of the National Academy of Sciences 110 21077–21082. [Google Scholar]
  2. Ågren J, Oakley CG, Lundemo S and Schemske DW (2016). Adaptive divergence in flowering time among natural populations of Arabidopsis thaliana: estimates of selection and QTL mapping. Data from: Dryad Digital Repository. 10.5061/dryad.77971. [DOI] [Google Scholar]
  3. Ågren J, Oakley CG, Lundemo S and Schemske DW (2017). Adaptive divergence in flowering time among natural populations of Arabidopsis thaliana: Estimates of selection and QTL mapping. Evolution 71 550–564. [DOI] [PubMed] [Google Scholar]
  4. Baba T, Ara T, Hasegawa M, Takai Y, Okumura Y, Baba M, Datsenko KA, Tomita M, Wanner BL and Mori H (2006). Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Molecular systems biology 2. [Google Scholar]
  5. Beck A and Teboulle M (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2 183–202. [Google Scholar]
  6. Benjamini Y and Hochberg Y (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of educational and Behavioral Statistics 25 60–83. [Google Scholar]
  7. Bezanson J, Edelman A, Karpinski S and Shah VB (2017). Julia: A fresh approach to numerical computing. SIAM review 59 65–98. [Google Scholar]
  8. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 1–122. [Google Scholar]
  9. Broman KW, Wu H, Sen Ś and Churchill GA (2003). R/qtl: QTL mapping in experimental crosses. Bioinformatics 19 889–890. [DOI] [PubMed] [Google Scholar]
  10. Butland G, Babu M, Díaz-Mejía JJ, Bohdana F, Phanse S, Gold B, Yang W, Li J, Gagarinova AG, Pogoutse O et al. (2008). eSGA: E. coli synthetic genetic array analysis. Nature methods 5 789–795. [DOI] [PubMed] [Google Scholar]
  11. Dowle M and Srinivasan A (2018). data.table: Extension of ‘data.frame’ R package version 1.11.8.
  12. Dudoit S, Yang Y, Callow M and Speed T (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. STATISTICA SINICA 12 111–139. [Google Scholar]
  13. Efron B, Hastie T, Johnstone I, Tibshirani R et al. (2004). Least angle regression. The Annals of statistics 32 407–499. [Google Scholar]
  14. Ekstrøm CT (2018). MESS: Miscellaneous Esoteric Statistical Scripts R package version 0.5.2.
  15. Florea MI and Vorobyov SA (2017). A robust FISTA-like algorithm. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4521–4525. IEEE. [Google Scholar]
  16. Friedman J, Hastie T and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33 1. [PMC free article] [PubMed] [Google Scholar]
  17. Fu WJ (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics 7 397–416. [Google Scholar]
  18. Ghadimi E, Teixeira A, Shames I and Johansson M (2012). On the optimal step-size selection for the alternating direction method of multipliers. IFAC Proceedings Volumes 45 139–144. [Google Scholar]
  19. Hobbs EC, Astarita JL and Storz G (2010). Small RNAs and small proteins involved in resistance to cell envelope stress and acid shock in Escherichia coli: analysis of a bar-coded mutant collection. Journal of bacteriology 192 59–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Javanmard A and Montanari A (2014). Confidence Intervals and Hypothesis Testing for High-dimensional Regression. J. Mach. Learn. Res. 15 2869–2909. [Google Scholar]
  21. Kim D and Fessler JA (2018). Another look at the fast iterative shrinkage/thresholding algorithm (FISTA). SIAM Journal on Optimization 28 223–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Liang JW, Nichols RJ and Sen S (2019). Matrix linear models for high-throughput chemical genetić screens. Genetics 212 1063–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Liang J and Schönlieb C-B (2018). Improving FISTA: Faster, Smarter and Greedier. arXiv preprint arXiv:1811.01430. [Google Scholar]
  24. Liang JW and Sen Ś (2021a). Supplemental figures for “Sparse matrix linear models for structured high-throughput data.”.
  25. Liang JW and Sen Ś (2021b). matrixLMnet.jl Julia package for “Sparse matrix linear models for structured high-throughput data.”.
  26. Liang JW and Sen Ś (2021c). Code to reproduce analysis for “Sparse matrix linear models for structured high-throughput data.”.
  27. Lovell JT, Mullen JL, Lowry DB, Awole K, Richards JH, Sen S, Verslues PE, Juenger TE and McKay JK (2015). Exploiting differential gene expression and epistasis to discover candidate genes for drought-associated QTLs in Arabidopsis thaliana. The Plant Cell 27 969–983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lowry DB, Logan TL, Santuari L, Hardtke CS, Richards JH, Derose-Wilson LJ, McKay JK, Sen S and Juenger TE (2013). Expression quantitative trait locus mapping across water availability environments reveals contrasting associations with genomic features in Arabidopsis. The Plant Cell 25 3266–3279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Nesterov Y (1983). A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady 27 372–376. [Google Scholar]
  30. Nichols RJ, Sen S, Choo YJ, Beltrao P, Zietek M, Chaba R, Lee S, Kazmierczak KM, Lee KJ, Wong A et al. (2011). Phenotypic landscape of a bacterial cell. Cell 144 143–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ochs P and Pock T (2017). Adaptive FISTA for Non-convex Optimization. arXiv preprint arXiv:1711.04343. [Google Scholar]
  32. Parikh N and Boyd S (2014). Proximal algorithms. Foundations and Trends in Optimization 1 123–231. [Google Scholar]
  33. Ramsay JO and Silverman BW (2005). Functional data analysis, 2nd ed ed. Springer series in statistics. Springer, New York. [Google Scholar]
  34. Reid S, Tibshirani R and Friedman J (2016). A study of error variance estimation in Lasso regression. Statistica Sinica. [Google Scholar]
  35. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W and Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43 e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Schmidt M, Fung G and Rosales R (2009). Optimization methods for l1-regularization. University of British Columbia, Technical Report TR-2009 19. [Google Scholar]
  37. Su W, Boyd S and Candès EJ (2016). A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. Journal of Machine Learning Research 17 1–43. [Google Scholar]
  38. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES and Mesirov JP (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Tan KM, London P, Mohan K, Lee S-I, Fazel M and Witten D (2014). Learning Graphical Models With Hubs. Journal of machine learning research: JMLR 15 3297–3331. [PMC free article] [PubMed] [Google Scholar]
  40. R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  41. Team MC, Blanchard G, Dickhaus T, Hack N, Konietschke F, Rohmeyer K, Rosenblatt J, Scheer M and Werft W (2017). mutoss: Unified Multiple Testing Procedures R package version 0.1–12.
  42. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 267–288. [Google Scholar]
  43. Woodruff TJ, Zota AR and Schwartz JM (2011). Environmental chemicals in pregnant women in the United States: NHANES 2003–2004. Environmental health perspectives 119 878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Wu TT and Lange K (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 224–244. [Google Scholar]
  45. Xiong H, Goulding EH, Carlson EJ, Tecott LH, McCulloch CE and Sen Ś (2011). A flexible estimating equations approach for mapping function-valued traits. Genetics 189 305–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement A: Supplemental figures (DOI: 10.1214/21-AOAS1444SUPPA; .pdf). Additional figures from analyzing the E. coli chemical genetic screening data.
Supplement B: Julia implementation for L1-penalized matrix linear models (DOI: 10.1214/21-AOAS1444SUPPB; .zip). MatrixLMnet Julia package for estimating L1 - penalized matrix linear models; most up-to-date version available at https://github.com/senresearch/MatrixLMnet.jl.
Supplement C: Code to reproduce paper analysis (DOI: 10.1214/21-AOAS1444 SUPPC; .zip). Repository with code to perform analysis and generate figures in paper, also available at https://github.com/senresearch/mlm_l1_supplement

RESOURCES