Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 23.
Published in final edited form as: Stat Sin. 2024 Apr;34(2):527–546. doi: 10.5705/ss.202021.0170

Multi-response Regression for Block-missing Multi-modal Data without Imputation

Haodong Wang 1, Quefeng Li 2, Yufeng Liu 3
PMCID: PMC11035992  NIHMSID: NIHMS1892525  PMID: 38655129

Abstract

Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative.

Keywords: Inverse covariance matrix estimation, LASSO, Missing data, Moment estimation

1. Introduction

With the prevalence of large-scale multi-modal data in various scientific fields, multi-response linear regression is attracting increasing attention in the statistics and machine learning communities (Rothman et al., 2010; Lee and Liu, 2012; Loh and Zheng, 2013). Although linear regressions with a scalar response are well studied, many applications may have a vector as the response, for example, in biological problems (Kim and Xing, 2012). For example, for multi-tissue joint expression quantitative trait loci (eQTL) mapping (Molstad et al., 2020), researchers predict gene expression values in multiple tissues simultaneously by using a weighted sum of eQTL genotypes. A separate prediction for each tissue is inefficient if the same genes in different tissues are correlated because of shared genetic variants or other unmeasured common regulators. In order to use data from all tissues simultaneously, Molstad et al. (2020) propose a joint eQTL model that considers cross-tissue expression dependence.

To apply variable selection methods to multi-response problems, one option is to separately fit each response using a single-response model. For example, the lasso is a well-studied variable selection method for single-response linear regression models (Tibshirani, 1996). However, although this is a straightforward method, it neglects the dependency structure between responses. Incorporating the dependency structure of the response vector enables us to obtain a more efficient multi-response linear regression approach in terms of estimation and prediction.

For multi-response regression problems, Breiman and Friedman (1997) proposed the curds and whey method to improve the prediction performance by using the dependencies between responses. Specifically, they first fit a single-response regression model for each response, and then modify the predicted values from these regressions by shrinking them using the canonical correlations between the response variables and the predictors. Another popular approach is to use dimension reduction. In particular, the reduced-rank regression (Izenman, 1975) minimizes the least squares criterion, subject to a constraint on the rank of the regression parameter matrix. Yuan et al. (2007) extended this method to include the high-dimensional settings, reducing the dimension by encouraging sparsity among the singular values of the parameter matrix. Nevertheless, although these methods achieve better prediction performance than when using a separate univariate regression, they do not address the problem of variable selection.

In order to handle correlated responses together with variable selection, we can estimate the precision matrix of the response vector, given the predictors, and the regression parameter matrix either separately or simultaneously (Lee and Liu, 2012). For a separate estimation, Cai et al. (2013) use a constrained 1 minimization that can be treated as a multivariate extension of the Dantzig selector to estimate the regression parameter matrix. After removing the regression effect using the estimated regression parameter matrix, the precision matrix of the error terms can be estimated accordingly. A potential drawback of this indirect method is that it ignores the relationships between the responses, given the predictors, when estimating the regression parameter matrix. Thus, in order to use all information more efficiently, it may be better to estimate the precision matrix and regression parameter matrix simultaneously. Existing joint estimation techniques include those of Rothman et al. (2010), Yin and Li (2011), and Lee and Liu (2012) who formulate the multi-response regression problem in a penalized log-likelihood framework to estimate the parameter and precision matrices simultaneously. Using a similar idea, Chen et al. (2018) propose an estimation procedure that estimates the parameter and precision matrices simultaneously based on the generalized Dantzig selector.

However, most existing multi-response linear regression methods deal only with complete data without missing entries, even though multi-modal data are often incomplete in practice. For instance, studies on Alzheimer’s disease (AD) use data from different sources, including magnetic resonance imaging (MRI) of the brain, positron emission tomography (PET), and cerebrospinal fluid (CSF). In practice, observations of a certain modality can be missing completely, because patients drop out or other practical issues arise, leading to a block-wise missing data structure. Thus, it is important to integrate data from all modalities to improve model prediction and variable selection.

One way of handling incomplete multi-modal data is to simply remove observations with missing entries. However, this procedure may greatly reduce the number of observations and lead to loss of information. Another approach is to perform data imputation. However, existing imputation methods, such as matrix completion (Johnson, 1990) algorithms, may be unstable when the missing values occur in blocks. For such cases, Yu et al. (2020) proposed a direct sparse regression procedure using the covariance from the block-missing multi-modal data (DISCOM). They first use all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable, and then use these estimates and an extended Lasso-type estimator to estimate the coefficients. However, the DISCOM method considers only single-response regressions. Recently, Xue and Qu (2021) proposed the multiple block-wise imputation (MBI) method for a single-response regression when the data are block-wise missing. They developed an estimating equation approach to accommodate block-wise missing patterns in multi-modal data. The method is shown to have high selection accuracy and a low estimation error for a single-response regression with block-wise missing data. However, because their imputation method requires analyzing all combinations of blocks, it can be computationally expensive when the number of modalities is large.

Here, we consider a multi-response regression model for block-wise missing data. The main contribution of our method is to allow missing values in both the responses and the predictors, as well as correlations between responses. In contrast to most traditional methods, the proposed method can also be applied when no subject has complete observations. Our method includes two steps. The first step estimates each element of the covariance and cross-covariance matrices using all available observations without imputation. The second step uses a penalized approach to simultaneously estimate the sparse regression coefficient matrix and the precision matrix of the error terms. We show that this method exhibits estimation and model selection consistency in a high-dimensional setting. The results of our numerical studies and an analysis of Alzheimer’s Disease Neuroimaging Initiative (ADNI) data confirm that the proposed method performs competitively for block-wise missing data.

The remainder of the paper is organized as follows. In Section 2, we introduce the problem background and our model. In Section 3, we establish some theoretical properties of our proposed method, and in Sections 4 and 5, we present our simulation studies and a multi-modal ADNI data example, respectively.

2. Methodology

2.1. Problem setup and notation

Consider the following multi-response linear regression model:

Y=XB*+E, (2.1)

where B*=(bjk)Rp×q is an unknown p×q parameter matrix, Y=(y1,,yn) is the n×q response matrix, X=(x1,,xn) is the n×p design matrix, and E=(ϵ1,,ϵn) is an n×q error matrix. We assume that {xi}i=1n are independent and identically distributed (i.i.d.) realizations of a random vector (X1,,Xp) with zero mean and covariance matrix ΣXX=(σijXX)Rp×p. We use ΣXY=(σijXY)Rp×q to denote the cross-covariance matrix between xi and yi. We assume that the predictors come from multiple modalities, and there are pk predictors in the kth modality. In addition, X has block-missing values. That is, for one sample, its measurements in one modality can be entirely missing. We assume the elements of Y can also be missing. The errors ϵi=(ϵi1,,ϵiq), for i=1,,n, are i.i.d. realizations from a random vector ϵ with zero mean and covariance matrix Σϵ=(σijEE)Rq×q. We let C*=Σϵ-1. Moreover, we assume xi and ϵi are uncorrelated. Denote the support of B* and C* as SB={j:vec(B*)j0} and SC={j:vec(C*)j0}, repectively, where “vec” denotes vectorization by a column operator. For a set S, we denote |S| as its cardinality. Denote sB=|SB|, sC=|SC|, and s=max(sB,sC).

We employ the following notation throughout. The symbol S+d×d denotes sets of d×d symmetric positive-definite matrices. For a square matrix C=(cii')Rp×p, we denote its trace as tr(C)=icii and its diagonal matrix as diag(C). For a matrix A=(aij)Rp×q, we define its entrywise 1-norm as A1=i,j|aij|, and its entrywise -norm as A=maxi,j|aij|. In addition, we define its matrix 1-norm as AL1=maxji|aij|, the matrix -norm as AL=maxij|aij|, the spectral norm as A2=maxx2=1Ax2, the Frobenius norm as AF=i,jaij2, and the number of nonzero elements as A0=i,jI(aij0). Denote the largest and smallest eigenvalues of A by λmax(A) and λmin(A), respectively. Denote the sub-matrix of A with row and column indices in I1 and I2 as AI1I2. For a vector vRp, denote vI1 as the sub-vector of v with indices in I1, v1=i|vi|, v=maxi|vi|, vmin=mini|vi|, and v2=ivi2. For a function h(X), we use Xh to denote a gradient or subgradient of h with respect to X, if it exists. Finally, we write anbn if ancbn for some constant c, and write anbn if anbn and bnan.

2.2. Proposed multi-DISCOM method

If we separately apply a least squares estimation with the 1-norm penalty to each response, the multi-response linear regression model (2.1) essentially solves

argminBE[Y-XBF2]+λB1=argminBtr(12BΣXXB-ΣXYB)+λB1, (2.2)

where λ is a tuning parameter. We refer to this method as the separate lasso, with the solution denoted as BˆLASSO. However, the approach fails to consider correlations between the responses, and may lead to poor predictive performance (see, e.g., Breiman and Friedman (1997)). To produce a better estimator, we propose incorporating Σϵ into the estimation of B* and solving the following problem:

Bˆ0=argminBtr[C*ΣˆYY+C*BΣˆXXB-2C*BΣˆXY]+λB1, (2.3)

where λ is a tuning parameter, and ΣˆYY, ΣˆXX, and ΣˆXY are estimators of ΣYY, ΣXX, and ΣXY, respectively.

In practice, C* is usually unknown. In case, we first estimate C* using Cˆ, and then plug this into (2.3) and solve the following problem:

Bˆ0=argminBtr[CˆΣˆYY+CˆBΣˆXXB-2CˆBΣˆXY]+λB1. (2.4)

We refer to this method as the two-step weighted lasso. As shown in But as shown by the toy example in Section 2.2.1, the separate lasso may outperform this method in some problems.

We propose estimating B* and C* simultaneously by solving the following optimization problem:

(B^,C^)=argminCS+q×q,Btr[CΣ^YY+CBΣ^XXB2CBΣ^XY]+λBB1+λCC1logdetC, (2.5)

where λB and λC are tuning parameters. When λC is sufficiently large, Theorem 4 of Banerjee et al. (2008) implies that all off-diagonal entries in Cˆ become zero. Then, our proposed method (2.5) reduces to the separate lasso (2.2). For a univariate response regression problem, our proposed method (2.5) reduces to the DISCOM algorithm (Yu et al., 2020). When there are no missing entries, (2.5) reduces to the sparse conditional Gaussian graphical model of Yin and Li (2011).

In the toy example in Section 2.2.1, our joint estimation model (2.5) outperforms the two-step weighted lasso and the separate lasso.

2.2.1. Toy example

For illustration, we consider a toy example similar to that in Lee and Liu (2012). Assume p=q=2, XX=I, and Σϵ=(1ρρ1), where ρ is an unknown constant. We perform simulation studies for this example with 200 training samples, 300 tuning samples, and 1000 testing samples. Set B*=(0023.5) in Case 1, and (00−23.5) in Case 2. Figure 1 shows the estimation error for the separate lasso, two-step weighted lasso, and joint estimation model (2.5). In Case 1, the two-step weighted lasso has a smaller estimation error than that of the separate lasso when ρ is positive. The reverse is true when ρ is negative. In Case 2, the separate lasso has a smaller estimation error than that of the two-step weighted lasso when ρ is positive. The joint estimation model performs best in all cases.

Figure 1:

Figure 1:

Plots of the estimation errors for the separated lasso, two-step weighted lasso and joint estimation when Σϵ=(1ρρ1). The left panel is for B*=(0023.5), and the right panel is for B*=(00-23.5).

The simulation results can be explained by the following calculations. With the penalty parameter λ, the solution of the separate LASSO is given by BˆijLASSO=sign(BˆijS)[BˆijS-λ/2]+, where [u]+=u if u0, [u]+=0 if u<0, and BˆS=(XX)-1XY.

We can show that the two-step weighted lasso (2.4) is equivalent to

B^2step=argminB[(vec(B)vec(BS))(I2C^)(vec(B)vec(BS))+vec(B)1]. (2.6)

When the estimate Cˆ is accurate, Bˆ2step should be very close to the solution of (2.3), where we use Σϵ-1 as the weight. After we plug Cˆ=Σϵ-1 into (2.6), the solution is given by Bˆij2step=sign(BˆijS)[|BˆijS|-λ(1+ρ)/2]+ when sign(Bˆi1SBˆi2S)=1, and Bˆij2step=sign(BˆijS)[|BˆijS)-λ(1-ρ)/2]+ when sign(Bˆi1SBˆi2S)=-1. Compared with BˆijLASSO=sign(BˆijS)[BˆijS-λ/2]+, Bˆij2step differs only in the shrinkage amount for each entry. The shrinkage amounts for all entries of the separate lasso are the same, and depend only on the tuning parameter λ. The shrinkage amounts for all entries of the two-step weighted lasso depend on ρ, λ, and the sign of BˆS. Each entry of the two-step weighted lasso may have different shrinkage amounts.

We consider two cases of ρ in Case 1, where B*=(0023.5). Because B21* and B22* are far from zero, for simplicity, we assume that sign(Bˆ21S)=sign(Bˆ22S)=1.

  1. Consider ρ=-0.4. When sign(Bˆ11SBˆ12S)=-1, the shrinkage amounts for Bˆ212step and Bˆ222step are 0.7λ, and those for Bˆ112step and Bˆ122step are 0.3λ. Thus, the shrinkage amounts for Bˆ212step and Bˆ222step are smaller than those for Bˆ112step and Bˆ122step. Therefore, with the tuning parameter λ that shrinks Bˆ112step and Bˆ122step to zero, the shrinkage amounts for Bˆ212step and Bˆ222step are also smaller than those for Bˆ21LASSO and Bˆ22LASSO. Thus, the two-step weighted lasso has a smaller estimation error than that of the separate lasso in this scenario. When sign(Bˆ11SBˆ12S)=1, the shrinkage amounts for all entries in Bˆ2step are equal.

  2. Consider ρ=0.4. When sign(Bˆ11SBˆ12S)=-1, the shrinkage amounts for Bˆ212step and Bˆ222step are 0.3λ, and those for Bˆ112step and Bˆ122step are 0.7λ. Therefore, with the tuning parameter λ that shrinks Bˆ112step and Bˆ122step to zero, the shrinkage amounts for Bˆ212step and Bˆ222step are larger than those for Bˆ21LASSO and Bˆ22LASSO. Thus, the separate lasso is preferred to the two-step weighted lasso in this scenario. When sign(Bˆ11SBˆ12S)=1, all entries in Bˆ2step have the same shrinkage amount.

In Case 2, where B*=(00-23.5), the two-step weighted lasso is preferred to the separate lasso only when ρ is negative. In conclusion, the performance of the two-step weighted lasso compared with that of the separate lasso depends on the sign of B* and the covariance matrix Σϵ. In contrast, the joint estimation model (2.5) is more flexible. When Σϵ and B* favor the separate lasso, the joint estimation model (2.5) performs better by choosing a large λC. Otherwise, it can perform better by choosing a relatively small λC, and thus performs competitively in all cases.

2.2.2. Covariance estimation

Now, we show how to obtain ΣˆXX, ΣˆXY, and ΣˆYY when the data exhibit block-missing values. The following notation is used throughout. For the j th predictor, define SjX={i:xijisnotmissing}}. For the j th response, define SjY={i:yijisnotmissing}. Define SjkXX={i:xijandxikarenotmissing}, SjkXY={i:xijandyikarenotmissing}, SjklXX/Y={i:xij,xikarenotmissing,butyilismissing}, SjklXY/X={i:xij,yikarenotmissingbutxilismissing}, and SjkYY={i:yijandyikarenotmissing}}. Denote the cardinality of SjX, SjY, SjkXX, SjkXY, SjklXX/Y, SjklXY/X, and SjkYY as njX, njYnjkXX, njkXY, njklXX/Y, njklXY/X, and njkYY, respectively. Denote nX=minj|SjX|, nXX=minj,k|SjkXX|, nXY=minj,k|SjkXY|, nYY=minj,k|SjkYY|, nXX/Y=maxj,k,l|SjklXX/Y| and nXY/X=maxj,k,l|SjklXY/X|.

We propose using the initial estimators of ΣXX, ΣXY, and ΣYY as the sample covariance matrices from all available data, that is, Σ˜XX=(σ˜jtXX), Σ˜XY=(σ˜jtXY), ΣˆYY=(σˆjtYY), where σ˜jtXX=iSjtXXxijxit/njtXX, σ˜jtXY=iSjtXYxijyit/njtXY, and

σ^jtYY=1njtYYiSjtYYyijyit. (2.7)

Note that our method requires that Σ˜XX, Σ˜XY, and ΣˆYY be unbiased estimators of their counterparts. When the missingness in X and Y is completely at random, the unbiasedness assumption is satisfied. However, this assumption may also hold under other missing mechanisms. For our theory, we do not specify any particular missing mechanism, and the unbiasedness assumption suffices.

For block-missing data X, the estimate Σ˜XX can be ill-conditioned and have negative eigenvalues. Therefore, it may not be a good estimate of ΣXX, and cannot be used in (2.5) directly. Next, we introduce an estimator that is both well conditioned and more accurate than the initial estimate Σ˜XX. According to the partition of the predictors into K modalities, Σ˜XX can be partitioned into K2 blocks, denoted by Σ˜k1k2, for 1k1, k2K, where Σ˜k1k2 is a pk1×pk2 matrix. We denote

Σ˜I=(Σ˜11Σ˜22Σ˜KK)andΣ˜C=(0Σ˜12Σ˜1KΣ˜210Σ˜2KΣ˜K1Σ˜K20),

where Σ˜I is called the intra-modality sample covariance matrix and is a p×p block-diagonal matrix containing K diagonal blocks of Σ˜XX, and Σ˜C=Σ˜-Σ˜I is called the cross-modality sample covariance matrix containing all off-diagonal blocks of Σ˜XX. Let ΣI and ΣC be the true intra-modality and cross-modality covariance matrices, respectively. For block-missing multi-modal data, the imbalanced sample sizes mean that the estimate Σ˜I can be relatively accurate, while the estimate Σ˜C can be inaccurate. In that case, we estimate ΣXX using a linear combination of Σ˜I and Σ˜C with different weights. In addition, to ensure the positive definiteness of our estimation, we adopt the idea of a shrinkage estimation of the covariance matrix (Fisher and Sun, 2011) and add the diagonal matrix diag(Σ˜I) to our estimator,

ΣˆXX=α1Σ˜I+(1-α1)diag(Σ˜I)+α2Σ˜C, (2.8)

where α1,α2[0,1] are two shrinkage weights. We add the diagonal matrix diag(Σ˜I) to ensure the diagonal entries of our estimator are not shrunk.

By Weyl’s theorem, the eigenvalues of our estimator are greater than or equal to α1λmin(Σ˜I)+(1-α1)λmin(diag(Σ˜I))+α2λmin(Σ˜C). Because diag(Σ˜I) is a positive-definite matrix, we can guarantee that the eigenvalues of our estimator are positive by carefully selecting the tuning parameters α1 and α2.

As dicussed previously, our estimator ΣˆXX is a shrinkage estimator. Using a similar idea, we use a shrinkage estimator to estimate ΣXY. That is, we propose estimating ΣXY by

ΣˆXY=α3Σ˜XY, (2.9)

where α3[0,1] is the shrinkage weight. Here, we want to find the optimal linear combination ΣˆXY*=α3*Σ˜XY that minimizes the expected quadratic loss EΣˆXY*-ΣXYF.

Here, we consider only a relative low dimension of Y, with not too many incomplete observations, so we use ΣˆYY defined in (2.7) directly. However, when the dimension of Y is very high or there are many incomplete observations of Y, a shrinkage estimator of ΣYY is recommended instead.

Denote γ*=(γ1*,,γK*)=(tr(Σ11)/p1,,tr(ΣKK)/pK), δI=EΣ˜I-ΣIF2, δC=EΣ˜C-ΣCF2, δXY=EΣ˜XY-ΣXYF2 and θ=diag(Σ˜I)-ΣIF. The optimal choice for the weights of α1, α2, and α3 is stated in Proposition 2.1.

Proposition 2.1.

The solutions to the two optimization problems

(α1*,α2*)=argminα1,α2EΣˆXX-ΣXXF2 (2.10)
α3*=argminα3EΣˆXY-ΣXYF2 (2.11)

are

α1*=θ2θ2+δI2,α2*=ΣCF2ΣCF2+δC2,andα3*=ΣXYF2ΣXYF2+δXY2.

In addition, for ΣˆXX*=α1*Σ˜I+(1-α1*)diag(Σ˜I)+α2*Σ˜C and ΣˆXY*=α3*Σ˜XY, we have

EΣˆXX*ΣXXF2=δI2θ2δI2+θ2+δC2ΣCF2δC2+ΣCF2δI2+δC2=EΣ˜XXΣ˜XXF2,EΣˆXY*ΣXYF2=δXY2ΣXYF2δXY2+ΣXYF2δXY2=EΣ˜XYΣ˜XYF2.

Define the 2-error of the estimators ΣˆXX and ΣˆXY as EΣˆXX-ΣXXF2 and EΣˆXY-ΣXYF2, respectively. Proposition 2.1 shows that our estimator is more accurate than the sample covariance matrix.

Proposition 2.1 is closely related to Proposition 1 of Yu et al. (2020). They calculated the optimal weight and estimation error for their proposed estimator ΣˆXX,DISCOM* of ΣXX, where the estimation error is

EΣˆXX,DISCOM-ΣXXF2=δI2θ˜2δI2+θ˜2+δC2ΣCF2δC2+ΣCF2,

and θ˜2=tr(Σ)Ip/p-ΣIF2. Here, our estimator ΣˆXX has a smaller 2-error than that of their estimator, and our weighted estimator ΣˆXY is more accurate than the sample covariance matrix.

2.3. Computational algorithm

In this section, we describe the computational algorithm used to solve the optimization problem (2.5). Because (2.5) is a bi-convex problem, the standard approach to solving it is to use the alternating minimization method. In particular, starting with some given initial point (Bˆ0,Cˆ0) at the tth iteration, we solve solving the following problems:

B^t=argminBtr[C^t1Σ^YY+C^t1BΣ^XXB2C^t1BΣ^XY]+λBB1, (2.12)
C^t=argminCS+q×qtr[CΣ^YY+CB^t1Σ^XXB^t12CB^t1Σ^XY]+λCC1logdetC. (2.13)

In each iteration of our algorithm, given Cˆt-1, we first update the estimator Bˆt by solving (2.12). Because (2.12) is quadratic in B, we use the coordinate descent algorithm to solve it. Then, we adopt the graphical lasso method of Friedman et al. (2008) to solve (2.13). We summarize the above procedures in Algorithm 1.

2.

3. Theoretical study

We establish the following theoretical results. First, we prove in Theorem 3.1 that the proposed estimators ΣˆXX, ΣˆXY and ΣˆYY are consistent with high probability. We then show the convergence rate of our proposed estimators Bˆ and Cˆ in Theorem 3.4. Finally, the selection consistency of our proposed method is shown in Theorem 3.5. The technical assumptions (A1) to (A5), and all proofs are provided in the Supplementary Material. In the following analysis, we allow p and q to diverge as nXX, nXY and nYY increase.

In Theorem 3.1, we prove the large deviation bounds for our proposed estimators ΣˆXX, ΣˆXY and ΣˆYY.

Theorem 3.1.

Suppose 1-α1=O(logp/nX), 1-α2=O(logp/nXX), and 1-α3=O(logpq/nXY). If Conditions (A1) and (A2) hold, there exists positive constants v1, v2, and v3 such that

P(Σ^XXΣXXv1logpnXX)4p, (3.1)
P(Σ^XYΣXYv2log(pq)nXY)4pq, (3.2)
P(Σ^YYΣYYv3logqnYY)4q. (3.3)

If we only use samples with complete observations, sample covariance estimators Σ˜XX,complete, Σ˜XX,complete and Σ˜XX,complete have the following convergence rates

Σ˜XX,complete-ΣXX=Op((logp)/ncomplete),
Σ˜XY,complete-ΣXY=Op((log(pq))/ncomplete),
Σ˜YY,complete-ΣYY=Op((logq)/ncomplete),

where ncomplete is the number of samples with complete observations; see Yu et al. (2020). For block-missing data, ncomplete can be much smaller than nXX, nXY and nYY.

Next, we give the properties of initial estimators Bˆ0 and Cˆ0. The following lemma describes estimation consistency of the initial estimator Bˆ0.

Lemma 3.2.

Suppose Conditions (A1)-(A4) hold, 1-α1=O(logp/nX),1-α2=O(logp/nXX), and 1-α3=O(logpq/nXY). If we choose λB0=C(log(pq)/min(nXY,nXX))12B*L1 for some large enough constant C, then with probability at least 1-4/p-4/(pq), the initial estimator Bˆ0=argminBtr[ΣˆYY+BΣˆXXB-2BΣˆXY]+λBB1 satisfies

B^0B*FqsBΣ^XYΣ^XXB*B*L1qsBlog(pq)min(nXX,nXY).

Cai et al. (2013) showed that when there is no missing data and the true coefficient B* is exactly sparse, their estimator BˆCai has the convergence rate of BˆCai-B*F=Op(NpqsBlog(pq)/n), where n is the sample size of the data and Np is the upper bound of ΣXX-1L. When there is no missing data, our initial estimator Bˆ0 has the convergence rate of Bˆ0-B*F=Op(B*L1qsBlog(pq)/n). If we assume B*L1ΣXX-1L, the convergence rate of Bˆ0 is the same as that of BˆCai. When the data are block-wise missing, and we only use complete samples to estimate B*, we will have Bˆ0-B*F=Op(B*L1qsBlog(pq)/ncomplete), which can be much slower than the rate in Lemma 3.2 as ncomplete is typically much smaller than nXX and nXY for block-wise missing data.

For the single-response regression with block-wise missing data, the result in Lemma 3.2 is the same as Theorem 2 in Yu et al. (2020) and the estimator Bˆ0 performs well when the dimension of Y is small. But when the dimension of Y becomes large, the estimator Bˆ0 may perform poorly.

The following lemma describes consistency of our initial estimator Cˆ0.

Lemma 3.3.

Suppose Conditions (A1)-(A4) hold, 1-α1=O(logp/nX),1-α2=O(logp/nXX), 1-α3=O(logpq/nXY). If we choose λC0=CC*22B*L1(B*L1+sBq)(log(pq)/min(nXX,nXY))1/2 for a large enough C, it holds with probability at least 1-4/p-4/(pq)-4/q that

Cˆ0-C*FsCC*22Σϵ-Cˆ0-1C*22B*L1(B*L1+sBq)sClog(pq)min(nXX,nXY).

There are two terms in the estimation error bound of Cˆ0. The first term C*22B*L12sClog(pq)min(nXX,nXY) comes from the error induced by using incomplete observations to estimate ΣXX and ΣXY. The second term C*22B*L1sBsCqlog(pq)min(nXX,nXY) comes from the estimation error of Bˆ0.

We next derive the convergence rates of Bˆ and Cˆ. The convergence rates are related to nXX/Y and nXY/X, which are fractions of nXX and nXY respectively. Hence, we let nXX/YnXXT1 and nXY/XnXYT2 with τ1,τ2{-}[0,1]. When the responses are complete while the covariates have missing entries, nXX/Y=0 and τ1=-, nXY/X>0 and τ2[0,1]. When the covariates are complete while the responses have missing entries, nXY/X=0 and τ2=-, nXX/Y>0 and τ1[0,1]. When both the responses and covaraites are complete, nXX/Y=nXY/X=0 and τ1=τ2=-. Theorem 3.4 below establishes the consistency of proposed estimators Bˆ and Cˆ in (2.5).

Theorem 3.4.

Suppose Conditions (A1)(A4) hold, 1-α1=O(logp/nX),1-α2=O(logp/nXX), 1-α3=O(log(pq)/nXY). If we choose λB and λC satisfying λB=C((logp)1/2/min(nXX1-τ1/2,nXY1-τ2/2)B*C*L1+{(log(pq)/nXY}1/2) and λC=CC*22[B*L12+sBB*C*L1/min(nXX1/2-τ1/2,nXY1/2-τ2/2)](log(pq)/min(nXX,nXY))1/2 for a large enough C, then it holds with probability at least 1-4/p-4/(pq)-4/q that

B^B*FsB(B*C*L1(log(pq))1/2min(nXX1τ1/2,nXY1τ2/2)+{log(pq)nXY}1/2),
C^C*FsCC*22(sBB*C*L1(log(pq))1/2min(nXX1τ1/2,nXY1τ2/2)+B*L12(log(pq))1/2min(nXX1/2,nXY1/2))
B^B*1sB(B*C*L1(log(pq))1/2min(nXX1τ1/2,nXY1τ2/2)+{log(pq)nXY}1/2),
C^C*1sCC*22(sBB*C*L1(log(pq))1/2min(nXX1τ1/2,nXY1τ2/2)+B*L12(log(pq))1/2min(nXX1/2,nXY1/2)).

Next, we discuss some direct implications of Theorem 3.4. First, we show that our estimators are at least as good as the initial estimators under some conditions. Since τ1,τ21 as njklXX/YnjkXX and njklXY/XnjkXY, the convergence rate of Bˆ-B*F is no slower than Op(max(B*C*L1,1))sBlog(pq)/min(nXX,nXY)). Similarly, the convergence rate of Cˆ-C*F is no slower than Op(sCC*22(B*L12+sBB*C*L1)log(pq)min(nXX,nXY)). Here the two slowest convergence rates are achieved when τ1=τ2=1. If we assume B*C*L1=O(B*L1q), the upper bounds of Bˆ-B*F and Cˆ-C*F are at least as tight as Bˆ0-B*F and Cˆ0-C*F.

On the other hand, if B*C*L1=o(B*L1q) or max(τ1,τ2)<1 and B*C*L12=o(min(nXX1/2-τ1/2,nXY1/2-τ2/2)), the upper bounds of Bˆ-B*F and Cˆ-C*F are strictly tighter than that of Bˆ0-B*F and Cˆ0-C*F. One example is when var(ϵj)>1q for all jq and cov(ϵj,ϵk)=0 for jk. Another example is when nXX/Y=o(nXX),nXY/X=o(nXY), and B*C*L12=o(min(nXX1/2-τ1/2,nXY1/2-τ2/2)).

When Y is complete while X has missing entries, τ1=- and τ2[0,1]. Then convergence rate of Bˆ in Theorem 3.4 becomes

Bˆ-B*FsB(B*C*L1(log(pq))1/2nXY1-τ2/2+{log(pq)nXY}1/2).

When X are complete while Y have missing entries, τ2=- and τ1[0,1]. In this case, we can set α1=α2=1 and have

Bˆ-B*FsB(B*C*L1(log(pq))1/2nXX1-τ1/2+{log(pq)nXY}1/2).

When both X and Y are complete, τ1=τ2=-. In this case, we can set α1=α2=α3=1 and have

Bˆ-B*FsBlog(pq)/n, (3.4)

where n is the sample size. The error bound in (3.4) is the minimax rate of the 1-penalized estimator as shown in Raskutti et al. (2011).

In Theorem 3.5 below, we show that our proposed method is model selection consistent.

Theorem 3.5.

Assume that Conditions (A1)-(A5) hold. Suppose 1-α1=O(logp/nX), 1-α2=O(logp/nXX), 1-α3=O(log(pq)/nXY). If (log(pq)/nXY)12-γ2/λB=o(1), λB((CΣXX)SBSB)1L/minjSB|βj|=o(1), sB((CΣXX)SBSB)1L(logp/nXX)12γ2=o(1), and sB(logp/nXX)12-γ1-γ2/λB=o(1), then with probability at least 1-4/p-4/(pq)-4/q, there exists a solution Bˆ to (2.5) such that sign(Bˆ)=sign(B*).

4. Numerical study

In this section, we examine the performance of our proposed method (Multi-DISCOM) in terms of Σϵ, the signal-to-noise ratio, and the distribution of the error ϵ using numerical studies. We compare the efficiency of our proposed method with that of the following methods: (1) the complete lasso, which separately applies the lasso to each response using only samples with complete observations (both X and Y have no missing values); (2) the imputed lasso, which separately applies the lasso to each response using all samples, where missing data are imputed using the soft-thresholded SVD method; (3) the MBI, which separately applies the MBI (Xue and Qu, 2021) to each response using all samples, and the missing data are imputed using multiple block-wise imputation; (4) DISCOM, which separately applies the DISCOM method (Yu et al., 2020) to each response; and (5) the imputed-MRCE, which runs the MRCE (Rothman et al., 2010) using all samples, with missing data imputed using the soft-thresholded SVD method.

In all examples, we set q=4 and xi=(xi1,,xip)~N(0,Σ), with σjt=0.6|j-t|. The ith row of the coefficient matrix B* is (1,1.5, 1, 1.5), for i=1,p1+1,p1+p2+1, and zero otherwise. The response Y has entries missing completely at random, with the missing proportion 0.01.

For each example, the data are generated from three modalities, with dimensions p1, p2, and p3, respectively. The training data set contains n1 samples with complete observations, n2 samples from the third modality, n3 samples from the first and third modalities, and n4 samples from the first modality. The tuning data set contains 75 samples with complete observations, and the testing data set includes 300 samples with complete observations. For each method, we train our model with different tuning parameters on the training data set. Then we choose the optimal tuning parameter by minimizing the mean squared error (MSE) on the tuning data set.

For each example, we repeat the simulation 50 times. To evaluate the selection performance of the algorithm, we use the false-positive rate (FPR) and false-negative rate (FNR) as criteria: FPR=FP/(FP+TN) and FNR=FN/(FN+TP), where FN represents the number of coefficients wrongly detected as zero, TN is the number of coefficients correctly detected as zero, TP are is the number of coefficients correctly detected as nonzero and FP is the number of coefficients wrongly detected as nonzero. Furthermore, to evaluate the accuracy of our estimators, we use the MSE on the testing data set and the 2-distance Bˆ-B*F as criteria.

In Example 1, we examine our method related to Σϵ. Let n1=n2=n3=n4=30 and p1=p2=p3=30. We set the error ϵi=(ϵi1,,ϵiq)~N(0,Σϵ), with Σϵ=3I2(1ρρ1). We choose ρ between −0.4 and 0.4.

In Example 2, we examine the performance of our method related to the signal-to-noise ratio. Let n1=n2=n3=n4=30 and p1=p2=p3=30. We set the error ϵi=(ϵi1,,ϵiq)~N(0,Σϵ), with Σϵ=αI2(1-0.4-0.41), and choose α between one and five.

In Example 3, we examine the robustness of our method when the error follows a heavy-tailed distribution. Let n1=n2=n3=n4=30 and p1=p2=p3=30. We set the error ϵi=(ϵi1,,ϵiq)~t10(0,Σϵ), where Σϵ=3I2(1-0.4-0.41), and tν(0,Σϵ) refers to Student’s t distribution with location vector 0 and scale matrix Σϵ.

To demonstrate the results, we focus on Example 1. We report the results of the other examples in the Supplementary Material.

The results in Table 1 indicate that the multi-DISCOM method delivers the best performance in all settings. Specifically, the multi-DISCOM method produces a smaller MSE and estimation errors than those of the other methods in all settings, especially when there are large correlations between the responses. In addition, the lasso method with imputed data may deliver worse selection performance, possibly because of the randomness in the imputation of the block-missing data. The results in Table 4 in the Supplementary Material indicate that the multi-DISCOM method has a greater advantage when the signal-to-noise ratio is small. When the ratio is smaller, the noise has a stronger effect on Y, and hence considering the precision matrix is more more helpful for our estimation.

Table 1:

Performance comparison for the methods in Example 1 with different ρ. The values in parentheses are the standard errors of the measures.

Method B^B*F MSE FPR FNR

ρ = −0.4 Lasso 1.51(0.06) 3.70(0.06) 0.09(0.02) 0.00(0.00)
Imputed-Lasso 1.73(0.06) 3.57(0.06) 0.11(0.01) 0.00(0.00)
MBI 2.10(0.08) 4.26(0.09) 0.12(0.02) 0.11(0.03)
DISCOM 1.44(0.04) 3.56(0.06) 0.05(0.00) 0.05(0.01)
Imputed-MRCE 1.53(0.05) 3.72(0.08) 0.17(0.03) 0.08(0.02)
Multi-DISCOM 1.40(0.04) 3.39(0.08) 0.02(0.01) 0.09(0.02)

ρ = 0.4 Lasso 1.55(0.06) 3.77(0.06) 0.11(0.02) 0.00(0.00)
Imputed-Lasso 1.75(0.06) 3.61(0.06) 0.13(0.01) 0.00(0.00)
MBI 2.14(0.08) 4.30(0.09) 0.13(0.02) 0.11(0.03)
DISCOM 1.46(0.04) 3.59(0.06) 0.06(0.00) 0.05(0.01)
Imputed-MRCE 1.54(0.05) 3.73(0.08) 0.19(0.03) 0.09(0.02)
Multi-DISCOM 1.43(0.04) 3.44(0.08) 0.04(0.01) 0.07(0.02)

5. Application to the ADNI study

We apply the multi-DISCOM method to data from the ADNI study (Mueller et al., 2005), and compare it with several existing approaches. A primary goal of this analysis is to identify biological markers and neuropsychological assessments to measure the progression of mild cognitive impairment (MCI) and early AD. We are interested in predicting the mini mental-state examination (MMSE), ADAS1, and ADAS2, which are common diagnotic scores for AD. The data processing steps are summarized in the Supplementary Material.

After data processing, we have 93 features from MRI, 93 features from PET, and five features from CSF. There are 805 subjects in total, including 199 subjects with complete MRI, PET, and CSF features, 197 subjects with MRI and PET features only, 201 subjects with MRI and CSF features only, and 208 subjects with MRI features only.

In our analysis, we divide the data into training, tuning, and testing sets. The training set consists of all subjects with incomplete observations and 40 randomly selected subjects with complete features. The tuning set consists of another 40 randomly selected subjects with complete observations. The testing set contains the remaining 119 subjects with complete observations. We train our model using different tuning parameters on the training set, choosing the tuning parameter that minimizes the MSE on the tuning set. The testing set is used to evaluate the methods. We use all methods shown in the simulation study to predict the MMSE score. For each method, the analysis is repeated 30 times using different partitions of the data. In addition to the sum of the MSE of all three responses, we compare the MSEs for each response (MSEMMSE,MSEADAS1, and MSEADAS2) as criteria. We also compare the number of features selected by each method.

As shown in Table 2, the multi-DISCOM method outperforms all other methods. The DISCOM method has a similar overall MSE to that of the multi-DISCOM method, but worse MSEADAS1 and MSEADAS2. One possible reason for this is that ADAS1 and ADAS2 are highly correlated, which means considering the precision matrix can help. Because there are 208 subjects with MRI features only, the MBI method may not impute those 208 subjects accurately. As a result, the MBI method may not perform well in this case.

Table 2:

Performance comparison for the ADNI data.

Method Overall MSE MSEMMSE MSEADAS1 MSEADAS2 # of Selected Features

Lasso 93.37(3.82) 5.31(0.19) 29.84(1.35) 58.23(2.40) 54.20
Imputed-Lasso 80.40(1.62) 4.54(0.12) 25.80(0.51) 50.07(1.15) 165.00
MBI 91.84(3.02) 5.13(0.14) 28.43(1.17) 58.29(2.16) 59.87
DISCOM 67.47(1.33) 4.26(0.11) 21.76(0.51) 41.45(0.86) 72.87
Imputed-MRCE 67.41(2.02) 4.29(0.10) 21.61(0.65) 41.50(1.33) 218.50
Multi-DISCOM 65.82(1.21) 4.22(0.12) 21.18(0.46) 40.41(0.80) 89.67

With regard to model selection, both the DISCOM method and the multi-DISCOM method deliver relatively simple models. Figure 2 shows the selection frequency of the 191 features when predicting ADAS1. The selection frequency of each feature is defined as the number of times of it is selected in the 30 replications. As shown in Figure 2, for our method, some features are often selected, and many other features are rarely selected. Thus our method delivers robust model selection. However, the features selected by the imputed lasso method vary across replications. One possible reason for the unstable performance in terms of model selection is the randomness in the imputation of the block-missing data. Hippocampus formation left (69th region) and amygdale right (83th feature) are frequently selected by our method, and have been shown to be highly correlated with AD and MCI (Jack et al., 1999; Misra et al., 2009; Zhang and Shen, 2012); however, the DISCOM method rarely selects these features.

Figure 2:

Figure 2:

Selection frequency of 191 features for prediction of ADAS1 score.

6. Conclusion

We have proposed a joint estimation method in a penalized framework with an entry-wise 1-regularization using block-missing multi-modal predictors. We first estimate the covariance matrix of the predictors using a linear combination of the estimates of the variance of each predictor, the estimates of the intra-modality covariance matrix, and the cross-modality covariance matrix. The proposed estimator of the covariance matrix can be positive semidefinite and more accurate than the sample covariance matrix. In the second step, we use the estimated covariance matrix and a penalized estimator to deliver a sparse estimate of the coefficients in the optimal linear prediction. We also establish the theory for the estimation and feature selection consistency. Extensive simulation studies indicate that our method exhibits promising performance in terms of estimation, prediction, and model selection for block-missing multi-modal data. Finally, we apply the multi-DISCOM method to the ADNI data set, showing that our model has good prediction power and meaningful interpretation.

Supplementary Material

Supplement

Acknowledgments

The authors thank the editor, associate editor, and reviewers for their helpful comments and suggestions. This research was supported in part by NSF grant DMS-2100729 and NIH grants R01GM126550 and R01AG073259. Haodong Wang is gratefully acknowledges the partial support from the National Science Foundation, award NSF-DMS-1929298 to the Statistical and Applied Mathematical Sciences Institute.

Footnotes

Supplementary Material

Supplementary Material includes additional results of our numerical studies, technical conditions and proofs.

Contributor Information

Haodong Wang, Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill.

Quefeng Li, Department of Biostatistics, The University of North Carolina at Chapel Hill.

Yufeng Liu, Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill.

References

  1. Banerjee O, El Ghaoui L, and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516. [Google Scholar]
  2. Breiman L and Friedman JH (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1), 3–54. [Google Scholar]
  3. Cai TT, Li H, Liu W, and Xie J (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1), 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen J, Xu P, Wang L, Ma J, and Gu Q (2018). Covariate adjusted precision matrix estimation via nonconvex optimization. In International Conference on Machine Learning, pp. 922–931. [Google Scholar]
  5. Fisher TJ and Sun X (2011). Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Computational Statistics and Data Analysis 55(5), 1909–1918. [Google Scholar]
  6. Friedman J, Hastie T, and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Izenman AJ (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5(2), 248–264. [Google Scholar]
  8. Jack CR, Petersen RC, Xu YC, O’Brien PC, Smith GE, Ivnik RJ, Boeve BF, Waring SC, Tangalos EG, and Kokmen E (1999). Prediction of ad with mri-based hippocampal volume in mild cognitive impairment. Neurology 52(7), 1397–1397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Johnson CR (1990). Matrix completion problems: a survey. In Matrix Theory and Applications, Volume 40, pp. 171–198. Amer. Math. Soc. [Google Scholar]
  10. Kim S and Xing EP (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. The Annals of Applied Statistics 6(3), 1095–1117. [Google Scholar]
  11. Lee W and Liu Y (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis 111, 241–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Loh P-L and Wainwright MJ (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. The Journal of Machine Learning Research 16(1), 559–616. [Google Scholar]
  13. Loh W-Y and Zheng W (2013). Regression trees for longitudinal and multiresponse data. The Annals of Applied Statistics, 495–522. [Google Scholar]
  14. Misra C, Fan Y, and Davatzikos C (2009). Baseline and longitudinal patterns of brain atrophy in mci patients, and their use in prediction of short-term conversion to ad. Neuroimage 44(4), 1415–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Molstad AJ, Sun W, and Hsu L (2020). A covariance-enhanced approach to multi-tissue joint eqtl mapping with application to transcriptome-wide association studies. arXiv preprint arXiv:2001.08363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, Trojanowski JQ, Toga AW, and Beckett L (2005). The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics 15(4), 869–877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Raskutti G, Wainwright MJ, and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over lq-balls. IEEE Transactions on Information Theory 57(10), 6976–6994. [Google Scholar]
  18. Rothman AJ, Levina E, and Zhu J (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19(4), 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. [Google Scholar]
  20. Xue F and Qu A (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association 116(536), 1914–1927. [Google Scholar]
  21. Yin J and Li H (2011). A sparse conditional gaussian graphical model for analysis of genetical genomics data. The Annals of Applied Statistics 5(4), 2630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Yu G, Li Q, Shen D, and Liu Y (2020). Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association 115(531), 1406–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Yuan M, Ekici A, Lu Z, and Monteiro R (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346. [Google Scholar]
  24. Zhang D and Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. NeuroImage 59(2), 895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES