Multi-response Regression for Block-missing Multi-modal Data without Imputation

Haodong Wang; Quefeng Li; Yufeng Liu

doi:10.5705/ss.202021.0170

. Author manuscript; available in PMC: 2024 Apr 23.

Published in final edited form as: Stat Sin. 2024 Apr;34(2):527–546. doi: 10.5705/ss.202021.0170

Multi-response Regression for Block-missing Multi-modal Data without Imputation

Haodong Wang ¹, Quefeng Li ², Yufeng Liu ³

PMCID: PMC11035992 NIHMSID: NIHMS1892525 PMID: 38655129

Abstract

Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative.

Keywords: Inverse covariance matrix estimation, LASSO, Missing data, Moment estimation

1. Introduction

With the prevalence of large-scale multi-modal data in various scientific fields, multi-response linear regression is attracting increasing attention in the statistics and machine learning communities (Rothman et al., 2010; Lee and Liu, 2012; Loh and Zheng, 2013). Although linear regressions with a scalar response are well studied, many applications may have a vector as the response, for example, in biological problems (Kim and Xing, 2012). For example, for multi-tissue joint expression quantitative trait loci (eQTL) mapping (Molstad et al., 2020), researchers predict gene expression values in multiple tissues simultaneously by using a weighted sum of eQTL genotypes. A separate prediction for each tissue is inefficient if the same genes in different tissues are correlated because of shared genetic variants or other unmeasured common regulators. In order to use data from all tissues simultaneously, Molstad et al. (2020) propose a joint eQTL model that considers cross-tissue expression dependence.

To apply variable selection methods to multi-response problems, one option is to separately fit each response using a single-response model. For example, the lasso is a well-studied variable selection method for single-response linear regression models (Tibshirani, 1996). However, although this is a straightforward method, it neglects the dependency structure between responses. Incorporating the dependency structure of the response vector enables us to obtain a more efficient multi-response linear regression approach in terms of estimation and prediction.

For multi-response regression problems, Breiman and Friedman (1997) proposed the curds and whey method to improve the prediction performance by using the dependencies between responses. Specifically, they first fit a single-response regression model for each response, and then modify the predicted values from these regressions by shrinking them using the canonical correlations between the response variables and the predictors. Another popular approach is to use dimension reduction. In particular, the reduced-rank regression (Izenman, 1975) minimizes the least squares criterion, subject to a constraint on the rank of the regression parameter matrix. Yuan et al. (2007) extended this method to include the high-dimensional settings, reducing the dimension by encouraging sparsity among the singular values of the parameter matrix. Nevertheless, although these methods achieve better prediction performance than when using a separate univariate regression, they do not address the problem of variable selection.

In order to handle correlated responses together with variable selection, we can estimate the precision matrix of the response vector, given the predictors, and the regression parameter matrix either separately or simultaneously (Lee and Liu, 2012). For a separate estimation, Cai et al. (2013) use a constrained $ℓ_{1}$ minimization that can be treated as a multivariate extension of the Dantzig selector to estimate the regression parameter matrix. After removing the regression effect using the estimated regression parameter matrix, the precision matrix of the error terms can be estimated accordingly. A potential drawback of this indirect method is that it ignores the relationships between the responses, given the predictors, when estimating the regression parameter matrix. Thus, in order to use all information more efficiently, it may be better to estimate the precision matrix and regression parameter matrix simultaneously. Existing joint estimation techniques include those of Rothman et al. (2010), Yin and Li (2011), and Lee and Liu (2012) who formulate the multi-response regression problem in a penalized log-likelihood framework to estimate the parameter and precision matrices simultaneously. Using a similar idea, Chen et al. (2018) propose an estimation procedure that estimates the parameter and precision matrices simultaneously based on the generalized Dantzig selector.

However, most existing multi-response linear regression methods deal only with complete data without missing entries, even though multi-modal data are often incomplete in practice. For instance, studies on Alzheimer’s disease (AD) use data from different sources, including magnetic resonance imaging (MRI) of the brain, positron emission tomography (PET), and cerebrospinal fluid (CSF). In practice, observations of a certain modality can be missing completely, because patients drop out or other practical issues arise, leading to a block-wise missing data structure. Thus, it is important to integrate data from all modalities to improve model prediction and variable selection.

One way of handling incomplete multi-modal data is to simply remove observations with missing entries. However, this procedure may greatly reduce the number of observations and lead to loss of information. Another approach is to perform data imputation. However, existing imputation methods, such as matrix completion (Johnson, 1990) algorithms, may be unstable when the missing values occur in blocks. For such cases, Yu et al. (2020) proposed a direct sparse regression procedure using the covariance from the block-missing multi-modal data (DISCOM). They first use all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable, and then use these estimates and an extended Lasso-type estimator to estimate the coefficients. However, the DISCOM method considers only single-response regressions. Recently, Xue and Qu (2021) proposed the multiple block-wise imputation (MBI) method for a single-response regression when the data are block-wise missing. They developed an estimating equation approach to accommodate block-wise missing patterns in multi-modal data. The method is shown to have high selection accuracy and a low estimation error for a single-response regression with block-wise missing data. However, because their imputation method requires analyzing all combinations of blocks, it can be computationally expensive when the number of modalities is large.

Here, we consider a multi-response regression model for block-wise missing data. The main contribution of our method is to allow missing values in both the responses and the predictors, as well as correlations between responses. In contrast to most traditional methods, the proposed method can also be applied when no subject has complete observations. Our method includes two steps. The first step estimates each element of the covariance and cross-covariance matrices using all available observations without imputation. The second step uses a penalized approach to simultaneously estimate the sparse regression coefficient matrix and the precision matrix of the error terms. We show that this method exhibits estimation and model selection consistency in a high-dimensional setting. The results of our numerical studies and an analysis of Alzheimer’s Disease Neuroimaging Initiative (ADNI) data confirm that the proposed method performs competitively for block-wise missing data.

The remainder of the paper is organized as follows. In Section 2, we introduce the problem background and our model. In Section 3, we establish some theoretical properties of our proposed method, and in Sections 4 and 5, we present our simulation studies and a multi-modal ADNI data example, respectively.

2. Methodology

2.1. Problem setup and notation

Consider the following multi-response linear regression model:

Y = {X B}^{*} + E,

(2.1)

where $B^{*} = (b_{j k}) \in R^{p \times q}$ is an unknown $p \times q$ parameter matrix, $Y = {(y_{1}, \dots, y_{n})}^{⊤}$ is the $n \times q$ response matrix, $X = {(x_{1}, \dots, x_{n})}^{⊤}$ is the $n \times p$ design matrix, and $E = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}$ is an $n \times q$ error matrix. We assume that ${x_{i}}_{i = 1}^{n}$ are independent and identically distributed (i.i.d.) realizations of a random vector ${(X_{1}, \dots, X_{p})}^{⊤}$ with zero mean and covariance matrix $Σ_{X X} = (σ_{i j}^{X X}) \in R^{p \times p}$ . We use $Σ_{X Y} = (σ_{i j}^{X Y}) \in R^{p \times q}$ to denote the cross-covariance matrix between $x_{i}$ and $y_{i}$ . We assume that the predictors come from multiple modalities, and there are $p_{k}$ predictors in the kth modality. In addition, $X$ has block-missing values. That is, for one sample, its measurements in one modality can be entirely missing. We assume the elements of $Y$ can also be missing. The errors $ϵ_{i} = {(ϵ_{i 1}, \dots, ϵ_{i q})}^{⊤}$ , for $i = 1, \dots, n$ , are i.i.d. realizations from a random vector $ϵ$ with zero mean and covariance matrix $Σ_{ϵ} = (σ_{i j}^{E E}) \in R^{q \times q}$ . We let $C^{*} = Σ_{ϵ}^{- 1}$ . Moreover, we assume $x_{i}$ and $ϵ_{i}$ are uncorrelated. Denote the support of $B^{*}$ and $C^{*}$ as $S_{B} = {j : v e c {(B^{*})}_{j} \neq 0}$ and $S_{C} = {j : v e c {(C^{*})}_{j} \neq 0}$ , repectively, where “vec” denotes vectorization by a column operator. For a set $S$ , we denote $| S |$ as its cardinality. Denote $s_{B} = | S_{B} |$ , $s_{C} = | S_{C} |$ , and $s = m a x (s_{B}, s_{C})$ .

We employ the following notation throughout. The symbol $S_{+}^{d \times d}$ denotes sets of $d \times d$ symmetric positive-definite matrices. For a square matrix $C = (c_{i i^{'}}) \in R^{p \times p}$ , we denote its trace as $t r (C) = \sum_{i} c_{i i}$ and its diagonal matrix as $d i a g (C)$ . For a matrix $A = (a_{i j}) \in R^{p \times q}$ , we define its entrywise $ℓ_{1}$ -norm as $∥ A ∥_{1} = \sum_{i, j} | a_{i j} |$ , and its entrywise $ℓ_{\infty}$ -norm as $∥ A ∥_{\infty} = {m a x}_{i, j} | a_{i j} |$ . In addition, we define its matrix $ℓ_{1}$ -norm as $∥ A ∥_{L_{1}} = {m a x}_{j} \sum_{i} | a_{i j} |$ , the matrix $ℓ_{\infty}$ -norm as $∥ A ∥_{L_{\infty}} = {m a x}_{i} \sum_{j} | a_{i j} |$ , the spectral norm as $∥ A ∥_{2} = {m a x}_{∥ x ∥_{2} = 1} ∥ A x ∥_{2}$ , the Frobenius norm as $∥ A ∥_{F} = \sqrt{\sum_{i, j} a_{i j}^{2}}$ , and the number of nonzero elements as $∥ A ∥_{0} = \sum_{i, j} I (a_{i j} \neq 0)$ . Denote the largest and smallest eigenvalues of $A$ by $λ_{m a x} (A)$ and $λ_{m i n} (A)$ , respectively. Denote the sub-matrix of $A$ with row and column indices in $I_{1}$ and $I_{2}$ as $A_{I_{1} I_{2}}$ . For a vector $v \in R^{p}$ , denote $v_{I_{1}}$ as the sub-vector of $v$ with indices in $I_{1}$ , $∥ v ∥_{1} = \sum_{i} | v_{i} |$ , $∥ v ∥_{\infty} = {m a x}_{i} | v_{i} |$ , $∥ v ∥_{m i n} = {m i n}_{i} | v_{i} |$ , and $∥ v ∥_{2} = \sqrt{\sum_{i} v_{i}^{2}}$ . For a function $h (X)$ , we use $\nabla_{X} h$ to denote a gradient or subgradient of $h$ with respect to $X$ , if it exists. Finally, we write $a_{n} ≲ b_{n}$ if $a_{n} \leq c b_{n}$ for some constant $c$ , and write $a_{n} ≍ b_{n}$ if $a_{n} ≲ b_{n}$ and $b_{n} ≲ a_{n}$ .

2.2. Proposed multi-DISCOM method

If we separately apply a least squares estimation with the $ℓ_{1}$ -norm penalty to each response, the multi-response linear regression model (2.1) essentially solves

a r g {m i n}_{B} E [∥ Y - X B ∥_{F}^{2}] + λ ∥ B ∥_{1} = a r g {m i n}_{B} t r (\frac{1}{2} B^{⊤} Σ_{X X} B - Σ_{X Y}^{⊤} B) + λ ∥ B ∥_{1,}

(2.2)

where $λ$ is a tuning parameter. We refer to this method as the separate lasso, with the solution denoted as ${\hat{B}}^{L A S S O}$ . However, the approach fails to consider correlations between the responses, and may lead to poor predictive performance (see, e.g., Breiman and Friedman (1997)). To produce a better estimator, we propose incorporating $Σ_{ϵ}$ into the estimation of $B^{*}$ and solving the following problem:

{\hat{B}}^{0} = \arg {m i n}_{B} t r [C^{*} {\hat{Σ}}_{Y Y} + C^{*} B^{⊤} {\hat{Σ}}_{X X} B - 2 C^{*} B^{⊤} {\hat{Σ}}_{X Y}] + λ ∥ B ∥_{1},

(2.3)

where $λ$ is a tuning parameter, and ${\hat{Σ}}_{Y Y}$ , ${\hat{Σ}}_{X X}$ , and ${\hat{Σ}}_{X Y}$ are estimators of $Σ_{Y Y}$ , $Σ_{X X}$ , and $Σ_{X Y}$ , respectively.

In practice, $C^{*}$ is usually unknown. In case, we first estimate $C^{*}$ using $\hat{C}$ , and then plug this into (2.3) and solve the following problem:

{\hat{B}}^{0} = a r g \underset{B}{m i n} t r [\hat{C} {\hat{Σ}}_{Y Y} + {\hat{C} B}^{⊤} {\hat{Σ}}_{X X} B - 2 \hat{C} B^{⊤} {\hat{Σ}}_{X Y}] + λ ∥ B ∥_{1} .

(2.4)

We refer to this method as the two-step weighted lasso. As shown in But as shown by the toy example in Section 2.2.1, the separate lasso may outperform this method in some problems.

We propose estimating $B^{*}$ and $C^{*}$ simultaneously by solving the following optimization problem:

\begin{array}{l} (\hat{B}, \hat{C}) = arg min_{C \in S_{+}^{q \times q}, B} tr [C {\hat{Σ}}_{YY} + C B^{⊤} {\hat{Σ}}_{XX} B - 2 C B^{⊤} {\hat{Σ}}_{XY}] \\ + λ_{B} ‖ B ‖_{1} + λ_{C} ‖ C ‖_{1} - \log \det C, \end{array}

(2.5)

where $λ_{B}$ and $λ_{C}$ are tuning parameters. When $λ_{C}$ is sufficiently large, Theorem 4 of Banerjee et al. (2008) implies that all off-diagonal entries in $\hat{C}$ become zero. Then, our proposed method (2.5) reduces to the separate lasso (2.2). For a univariate response regression problem, our proposed method (2.5) reduces to the DISCOM algorithm (Yu et al., 2020). When there are no missing entries, (2.5) reduces to the sparse conditional Gaussian graphical model of Yin and Li (2011).

In the toy example in Section 2.2.1, our joint estimation model (2.5) outperforms the two-step weighted lasso and the separate lasso.

2.2.1. Toy example

For illustration, we consider a toy example similar to that in Lee and Liu (2012). Assume $p = q = 2$ , $X^{⊤} X = I$ , and $Σ_{ϵ} = (\begin{array}{l} 1 & ρ \\ ρ & 1 \end{array})$ , where $ρ$ is an unknown constant. We perform simulation studies for this example with 200 training samples, 300 tuning samples, and 1000 testing samples. Set $B^{*} = (\begin{array}{l} 0 & 0 \\ 2 & 3.5 \end{array})$ in Case 1, and $(\begin{array}{l} 0 & 0 \\ −2 & 3.5 \end{array})$ in Case 2. Figure 1 shows the estimation error for the separate lasso, two-step weighted lasso, and joint estimation model (2.5). In Case 1, the two-step weighted lasso has a smaller estimation error than that of the separate lasso when $ρ$ is positive. The reverse is true when $ρ$ is negative. In Case 2, the separate lasso has a smaller estimation error than that of the two-step weighted lasso when $ρ$ is positive. The joint estimation model performs best in all cases.

Figure 1: — Plots of the estimation errors for the separated lasso, two-step weighted lasso and joint estimation when $Σ_{ϵ} = (\begin{array}{l} 1 & ρ \\ ρ & 1 \end{array})$ . The left panel is for $B^{*} = (\begin{matrix} 0 & 0 \\ 2 & 3.5 \end{matrix})$ , and the right panel is for $B^{*} = (\begin{matrix} 0 & 0 \\ - 2 & 3.5 \end{matrix})$ .

The simulation results can be explained by the following calculations. With the penalty parameter $λ$ , the solution of the separate LASSO is given by ${\hat{B}}_{i j}^{LASSO} = s i g n ({\hat{B}}_{i j}^{S}) {[{\hat{B}}_{i j}^{S} - λ / 2]}_{+}$ , where $[u]_{+} = u$ if $u \geq 0$ , $[u]_{+} = 0$ if $u < 0$ , and ${\hat{B}}^{S} = {(X^{⊤} X)}^{- 1} X^{⊤} Y$ .

We can show that the two-step weighted lasso (2.4) is equivalent to

{\hat{B}}^{2 s t e p} = \arg \min_{B} [{(vec (B) - vec (B^{S}))}^{⊤} (I_{2} \otimes \hat{C}) (vec (B) - vec (B^{S})) + ‖ vec (B) ‖_{1}] .

(2.6)

When the estimate $\hat{C}$ is accurate, ${\hat{B}}^{2 s t e p}$ should be very close to the solution of (2.3), where we use $Σ_{ϵ}^{- 1}$ as the weight. After we plug $\hat{C} = Σ_{ϵ}^{- 1}$ into (2.6), the solution is given by ${\hat{B}}_{i j}^{2 s t e p} = s i g n ({\hat{B}}_{i j}^{S}) {[| {\hat{B}}_{i j}^{S} | - λ (1 + ρ) / 2]}_{+}$ when $s i g n ({\hat{B}}_{i 1}^{S} {\hat{B}}_{i 2}^{S}) = 1$ , and ${\hat{B}}_{i j}^{2 s t e p} = s i g n ({\hat{B}}_{i j}^{S}) {[| {\hat{B}}_{i j}^{S}) - λ (1 - ρ) / 2]}_{+}$ when $s i g n ({\hat{B}}_{i 1}^{S} {\hat{B}}_{i 2}^{S}) = - 1$ . Compared with ${\hat{B}}_{i j}^{L A S S O} = s i g n ({\hat{B}}_{i j}^{S}) {[{\hat{B}}_{i j}^{S} - λ / 2]}_{+}$ , ${\hat{B}}_{i j}^{2 step}$ differs only in the shrinkage amount for each entry. The shrinkage amounts for all entries of the separate lasso are the same, and depend only on the tuning parameter $λ$ . The shrinkage amounts for all entries of the two-step weighted lasso depend on $ρ$ , $λ$ , and the sign of ${\hat{B}}^{S}$ . Each entry of the two-step weighted lasso may have different shrinkage amounts.

We consider two cases of $ρ$ in Case 1, where $B^{*} = (\begin{matrix} 0 & 0 \\ 2 & 3.5 \end{matrix})$ . Because $B_{21}^{*}$ and $B_{22}^{*}$ are far from zero, for simplicity, we assume that $s i g n ({\hat{B}}_{21}^{S}) = s i g n ({\hat{B}}_{22}^{S}) = 1$ .

Consider $ρ = - 0.4$ . When $s i g n ({\hat{B}}_{11}^{S} {\hat{B}}_{12}^{S}) = - 1$ , the shrinkage amounts for ${\hat{B}}_{21}^{2 s t e p}$ and ${\hat{B}}_{22}^{2 s t e p}$ are $0.7 λ$ , and those for ${\hat{B}}_{11}^{2 s t e p}$ and ${\hat{B}}_{12}^{2 s t e p}$ are $0.3 λ$ . Thus, the shrinkage amounts for ${\hat{B}}_{21}^{2 s t e p}$ and ${\hat{B}}_{22}^{2 step}$ are smaller than those for ${\hat{B}}_{11}^{2 s t e p}$ and ${\hat{B}}_{12}^{2 s t e p}$ . Therefore, with the tuning parameter $λ$ that shrinks ${\hat{B}}_{11}^{2 step}$ and ${\hat{B}}_{12}^{2 s t e p}$ to zero, the shrinkage amounts for ${\hat{B}}_{21}^{2 s t e p}$ and ${\hat{B}}_{22}^{2 s t e p}$ are also smaller than those for ${\hat{B}}_{21}^{L A S S O}$ and ${\hat{B}}_{22}^{L A S S O}$ . Thus, the two-step weighted lasso has a smaller estimation error than that of the separate lasso in this scenario. When $s i g n ({\hat{B}}_{11}^{S} {\hat{B}}_{12}^{S}) = 1$ , the shrinkage amounts for all entries in ${\hat{B}}^{2 s t e p}$ are equal.
Consider $ρ = 0.4$ . When $s i g n ({\hat{B}}_{11}^{S} {\hat{B}}_{12}^{S}) = - 1$ , the shrinkage amounts for ${\hat{B}}_{21}^{2 s t e p}$ and ${\hat{B}}_{22}^{2 s t e p}$ are $0.3 λ$ , and those for ${\hat{B}}_{11}^{2 s t e p}$ and ${\hat{B}}_{12}^{2 s t e p}$ are $0.7 λ$ . Therefore, with the tuning parameter $λ$ that shrinks ${\hat{B}}_{11}^{2 s t e p}$ and ${\hat{B}}_{12}^{2 s t e p}$ to zero, the shrinkage amounts for ${\hat{B}}_{21}^{2 s t e p}$ and ${\hat{B}}_{22}^{2 s t e p}$ are larger than those for ${\hat{B}}_{21}^{L A S S O}$ and ${\hat{B}}_{22}^{L A S S O}$ . Thus, the separate lasso is preferred to the two-step weighted lasso in this scenario. When $s i g n ({\hat{B}}_{11}^{S} {\hat{B}}_{12}^{S}) = 1$ , all entries in ${\hat{B}}^{2 s t e p}$ have the same shrinkage amount.

In Case 2, where $B^{*} = (\begin{matrix} 0 & 0 \\ - 2 & 3.5 \end{matrix})$ , the two-step weighted lasso is preferred to the separate lasso only when $ρ$ is negative. In conclusion, the performance of the two-step weighted lasso compared with that of the separate lasso depends on the sign of $B^{*}$ and the covariance matrix $Σ_{ϵ}$ . In contrast, the joint estimation model (2.5) is more flexible. When $Σ_{ϵ}$ and $B^{*}$ favor the separate lasso, the joint estimation model (2.5) performs better by choosing a large $λ_{C}$ . Otherwise, it can perform better by choosing a relatively small $λ_{C}$ , and thus performs competitively in all cases.

2.2.2. Covariance estimation

Now, we show how to obtain ${\hat{Σ}}_{X X}$ , ${\hat{Σ}}_{X Y}$ , and ${\hat{Σ}}_{Y Y}$ when the data exhibit block-missing values. The following notation is used throughout. For the $j$ th predictor, define $S_{j}^{X} = {i : x_{i j} i s n o t m i s s i n g}}$ . For the $j$ th response, define $S_{j}^{Y} = {i : y_{i j} i s n o t m i s s i n g}$ . Define $S_{j k}^{X X} = {i : x_{i j} a n d x_{i k} a r e n o t m i s s i n g}$ , $S_{j k}^{X Y} = {i : x_{i j} a n d y_{i k} a r e n o t m i s s i n g}$ , $S_{j k l}^{X X / Y} = {i : x_{i j}, x_{i k} a r e n o t m i s s i n g, b u t y_{i l} i s m i s s i n g}$ , $S_{j k l}^{X Y / X} = {i : x_{i j}, y_{i k} a r e n o t m i s s i n g b u t x_{i l} i s m i s s i n g}$ , and $S_{j k}^{Y Y} = {i : y_{i j} a n d y_{i k} a r e n o t m i s s i n g}}$ . Denote the cardinality of $S_{j}^{X}$ , $S_{j}^{Y}$ , $S_{j k}^{X X}$ , $S_{j k}^{X Y}$ , $S_{j k l}^{X X / Y}$ , $S_{j k l}^{X Y / X}$ , and $S_{j k}^{Y Y}$ as $n_{j}^{X}$ , $n_{j}^{Y} n_{j k}^{X X}$ , $n_{j k}^{X Y}$ , $n_{j k l}^{X X / Y}$ , $n_{j k l}^{X Y / X}$ , and $n_{j k}^{Y Y}$ , respectively. Denote $n_{X} = {m i n}_{j} | S_{j}^{X} |$ , $n_{X X} = \min_{j, k} | S_{j k}^{X X} |$ , $n_{X Y} = \min_{j, k} | S_{j k}^{X Y} |$ , $n_{Y Y} = \min_{j, k} | S_{j k}^{Y Y} |$ , $n_{X X / Y} = {m a x}_{j, k, l} | S_{j k l}^{X X / Y} |$ and $n_{X Y / X} = {m a x}_{j, k, l} | S_{j k l}^{X Y / X} |$ .

We propose using the initial estimators of $Σ_{X X}$ , $Σ_{X Y}$ , and $Σ_{Y Y}$ as the sample covariance matrices from all available data, that is, ${\tilde{Σ}}_{X X} = ({\tilde{σ}}_{j t}^{X X})$ , ${\tilde{Σ}}_{X Y} = ({\tilde{σ}}_{j t}^{X Y})$ , ${\hat{Σ}}_{Y Y} = ({\hat{σ}}_{j t}^{Y Y})$ , where ${\tilde{σ}}_{j t}^{X X} = \sum_{i \in S_{j t}^{X X}} x_{i j} x_{i t} / n_{j t}^{X X}$ , ${\tilde{σ}}_{j t}^{X Y} = \sum_{i \in S_{j t}^{X Y}} x_{i j} y_{i t} / n_{j t}^{X Y}$ , and

{\hat{σ}}_{j t}^{Y Y} = \frac{1}{n_{j t}^{Y Y}} \sum_{i \in S_{j t}^{Y Y}} y_{i j} y_{i t} .

(2.7)

Note that our method requires that ${\tilde{Σ}}_{X X}$ , ${\tilde{Σ}}_{X Y}$ , and ${\hat{Σ}}_{Y Y}$ be unbiased estimators of their counterparts. When the missingness in $X$ and $Y$ is completely at random, the unbiasedness assumption is satisfied. However, this assumption may also hold under other missing mechanisms. For our theory, we do not specify any particular missing mechanism, and the unbiasedness assumption suffices.

For block-missing data $X$ , the estimate ${\tilde{Σ}}_{X X}$ can be ill-conditioned and have negative eigenvalues. Therefore, it may not be a good estimate of $Σ_{X X}$ , and cannot be used in (2.5) directly. Next, we introduce an estimator that is both well conditioned and more accurate than the initial estimate ${\tilde{Σ}}_{X X}$ . According to the partition of the predictors into $K$ modalities, ${\tilde{Σ}}_{X X}$ can be partitioned into $K^{2}$ blocks, denoted by ${\tilde{Σ}}^{k_{1} k_{2}}$ , for $1 \leq k_{1}$ , $k_{2} \leq K$ , where ${\tilde{Σ}}^{k_{1} k_{2}}$ is a $p_{k_{1}} \times p_{k_{2}}$ matrix. We denote

{\tilde{Σ}}_{I} = (\begin{matrix} {\tilde{Σ}}^{11} \\ {\tilde{Σ}}^{22} \\ ⋱ \\ {\tilde{Σ}}^{K K} \end{matrix}) and {\tilde{Σ}}_{C} = (\begin{matrix} 0 & {\tilde{Σ}}^{12} & \dots & {\tilde{Σ}}^{1 K} \\ {\tilde{Σ}}^{21} & 0 & \dots & {\tilde{Σ}}^{2 K} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {\tilde{Σ}}^{K 1} & {\tilde{Σ}}^{K 2} & \dots & 0 \end{matrix}),

where ${\tilde{Σ}}_{I}$ is called the intra-modality sample covariance matrix and is a $p \times p$ block-diagonal matrix containing $K$ diagonal blocks of ${\tilde{Σ}}_{X X}$ , and ${\tilde{Σ}}_{C} = \tilde{Σ} - {\tilde{Σ}}_{I}$ is called the cross-modality sample covariance matrix containing all off-diagonal blocks of ${\tilde{Σ}}_{X X}$ . Let $Σ_{I}$ and $Σ_{C}$ be the true intra-modality and cross-modality covariance matrices, respectively. For block-missing multi-modal data, the imbalanced sample sizes mean that the estimate ${\tilde{Σ}}_{I}$ can be relatively accurate, while the estimate ${\tilde{Σ}}_{C}$ can be inaccurate. In that case, we estimate $Σ_{X X}$ using a linear combination of ${\tilde{Σ}}_{I}$ and ${\tilde{Σ}}_{C}$ with different weights. In addition, to ensure the positive definiteness of our estimation, we adopt the idea of a shrinkage estimation of the covariance matrix (Fisher and Sun, 2011) and add the diagonal matrix $d i a g ({\tilde{Σ}}_{I})$ to our estimator,

{\hat{Σ}}_{X X} = α_{1} {\tilde{Σ}}_{I} + (1 - α_{1}) d i a g ({\tilde{Σ}}_{I}) + α_{2} {\tilde{Σ}}_{C},

(2.8)

where $α_{1}, α_{2} \in [0,1]$ are two shrinkage weights. We add the diagonal matrix $d i a g ({\tilde{Σ}}_{I})$ to ensure the diagonal entries of our estimator are not shrunk.

By Weyl’s theorem, the eigenvalues of our estimator are greater than or equal to $α_{1} λ_{m i n} ({\tilde{Σ}}_{I}) + (1 - α_{1}) λ_{m i n} (d i a g ({\tilde{Σ}}_{I})) + α_{2} λ_{m i n} ({\tilde{Σ}}_{C})$ . Because $d i a g ({\tilde{Σ}}_{I})$ is a positive-definite matrix, we can guarantee that the eigenvalues of our estimator are positive by carefully selecting the tuning parameters $α_{1}$ and $α_{2}$ .

As dicussed previously, our estimator ${\hat{Σ}}_{X X}$ is a shrinkage estimator. Using a similar idea, we use a shrinkage estimator to estimate $Σ_{X Y}$ . That is, we propose estimating $Σ_{X Y}$ by

{\hat{Σ}}_{X Y} = α_{3} {\tilde{Σ}}_{X Y},

(2.9)

where $α_{3} \in [0,1]$ is the shrinkage weight. Here, we want to find the optimal linear combination ${\hat{Σ}}_{X Y}^{*} = α_{3}^{*} {\tilde{Σ}}_{X Y}$ that minimizes the expected quadratic loss $E {∥ {\hat{Σ}}_{X Y}^{*} - Σ_{X Y} ∥}_{F}$ .

Here, we consider only a relative low dimension of $Y$ , with not too many incomplete observations, so we use ${\hat{Σ}}_{Y Y}$ defined in (2.7) directly. However, when the dimension of $Y$ is very high or there are many incomplete observations of $Y$ , a shrinkage estimator of $Σ_{Y Y}$ is recommended instead.

Denote $γ^{*} = {(γ_{1}^{*}, \dots, γ_{K}^{*})}^{⊤} = {(t r (Σ^{11}) / p_{1}, \dots, t r (Σ^{K K}) / p_{K})}^{⊤}$ , $δ_{I} = \sqrt{E {∥ {\tilde{Σ}}_{I} - Σ_{I} ∥}_{F}^{2}}$ , $δ_{C} = \sqrt{E {∥ {\tilde{Σ}}_{C} - Σ_{C} ∥}_{F}^{2}}$ , $δ_{X Y} = \sqrt{E {∥ {\tilde{Σ}}_{X Y} - Σ_{X Y} ∥}_{F}^{2}}$ and $θ = {∥ d i a g ({\tilde{Σ}}_{I}) - Σ_{I} ∥}_{F}$ . The optimal choice for the weights of $α_{1}$ , $α_{2}$ , and $α_{3}$ is stated in Proposition 2.1.

Proposition 2.1.

The solutions to the two optimization problems

(α_{1}^{*}, α_{2}^{*}) = a r g \underset{α_{1}, α_{2}}{m i n} E {∥ {\hat{Σ}}_{X X} - Σ_{X X} ∥}_{F}^{2}

(2.10)

α_{3}^{*} = a r g \underset{α_{3}}{m i n} E {∥ {\hat{Σ}}_{X Y} - Σ_{X Y} ∥}_{F}^{2}

(2.11)

are

α_{1}^{*} = \frac{θ^{2}}{θ^{2} + δ_{I}^{2}}, α_{2}^{*} = \frac{{‖ Σ_{C} ‖}_{F}^{2}}{{‖ Σ_{C} ‖}_{F}^{2} + {δ_{C}}^{2}}, and α_{3}^{*} = \frac{{‖ Σ_{X Y} ‖}_{F}^{2}}{{‖ Σ_{X Y} ‖}_{F}^{2} + {δ_{X Y}}^{2}} .

In addition, for ${\hat{Σ}}_{X X}^{*} = α_{1}^{*} {\tilde{Σ}}_{I} + (1 - α_{1}^{*}) d i a g ({\tilde{Σ}}_{I}) + α_{2}^{*} {\tilde{Σ}}_{C}$ and ${\hat{Σ}}_{X Y}^{*} = α_{3}^{*} {\tilde{Σ}}_{X Y}$ , we have

\begin{array}{c} E {‖ {\hat{Σ}}_{X X}^{*} - Σ_{X X} ‖}_{F}^{2} = \frac{δ_{I}^{2} θ^{2}}{δ_{I}^{2} + θ^{2}} + \frac{δ_{C}^{2} {‖ Σ_{C} ‖}_{F}^{2}}{δ_{C}^{2} + {‖ Σ_{C} ‖}_{F}^{2}} \leq δ_{I}^{2} + δ_{C}^{2} = E {‖ {\tilde{Σ}}_{X X} - {\tilde{Σ}}_{X X} ‖}_{F}^{2}, \\ E {‖ {\hat{Σ}}_{X Y}^{*} - Σ_{X Y} ‖}_{F}^{2} = \frac{δ_{X Y}^{2} {‖ Σ_{X Y} ‖}_{F}^{2}}{δ_{X Y}^{2} + {‖ Σ_{X Y} ‖}_{F}^{2}} \leq δ_{X Y}^{2} = E {‖ {\tilde{Σ}}_{X Y} - {\tilde{Σ}}_{X Y} ‖}_{F}^{2} . \end{array}

Define the $ℓ_{2}$ -error of the estimators ${\hat{Σ}}_{X X}$ and ${\hat{Σ}}_{X Y}$ as $E {∥ {\hat{Σ}}_{X X} - Σ_{X X} ∥}_{F}^{2}$ and $E {∥ {\hat{Σ}}_{X Y} - Σ_{X Y} ∥}_{F}^{2}$ , respectively. Proposition 2.1 shows that our estimator is more accurate than the sample covariance matrix.

Proposition 2.1 is closely related to Proposition 1 of Yu et al. (2020). They calculated the optimal weight and estimation error for their proposed estimator ${\hat{Σ}}_{X X, D I S C O M}^{*}$ of $Σ_{X X}$ , where the estimation error is

E {∥ {\hat{Σ}}_{X X, D I S C O M} - Σ_{X X} ∥}_{F}^{2} = \frac{δ_{I}^{2} {\tilde{θ}}^{2}}{δ_{I}^{2} + {\tilde{θ}}^{2}} + \frac{{δ_{C}}^{2} {∥ Σ_{C} ∥}_{F}^{2}}{{δ_{C}}^{2} + {∥ Σ_{C} ∥}_{F}^{2}},

and ${\tilde{θ}}^{2} = {∥ t r (Σ) I_{p} / p - Σ_{I} ∥}_{F}^{2}$ . Here, our estimator ${\hat{Σ}}_{X X}$ has a smaller $ℓ_{2}$ -error than that of their estimator, and our weighted estimator ${\hat{Σ}}_{X Y}$ is more accurate than the sample covariance matrix.

2.3. Computational algorithm

In this section, we describe the computational algorithm used to solve the optimization problem (2.5). Because (2.5) is a bi-convex problem, the standard approach to solving it is to use the alternating minimization method. In particular, starting with some given initial point $({\hat{B}}_{0}, {\hat{C}}_{0})$ at the tth iteration, we solve solving the following problems:

{\hat{B}}_{t} = \arg \min_{B} tr [{\hat{C}}_{t - 1} {\hat{Σ}}_{Y Y} + {\hat{C}}_{t - 1} B^{⊤} {\hat{Σ}}_{X X} B - 2 {\hat{C}}_{t - 1} B^{⊤} {\hat{Σ}}_{X Y}] + λ_{B} ∥ B ∥_{1},

(2.12)

{\hat{C}}_{t} = \arg \min_{C \in S_{+}^{q \times q}} tr [C {\hat{Σ}}_{Y Y} + C {\hat{B}}_{t - 1}^{⊤} {\hat{Σ}}_{X X} {\hat{B}}_{t - 1} - 2 C {\hat{B}}_{t - 1}^{⊤} {\hat{Σ}}_{X Y}] + λ_{C} ∥ C ∥_{1} - \log \det C .

(2.13)

In each iteration of our algorithm, given ${\hat{C}}_{t - 1}$ , we first update the estimator ${\hat{B}}_{t}$ by solving (2.12). Because (2.12) is quadratic in $B$ , we use the coordinate descent algorithm to solve it. Then, we adopt the graphical lasso method of Friedman et al. (2008) to solve (2.13). We summarize the above procedures in Algorithm 1.

3. Theoretical study

We establish the following theoretical results. First, we prove in Theorem 3.1 that the proposed estimators ${\hat{Σ}}_{X X}$ , ${\hat{Σ}}_{X Y}$ and ${\hat{Σ}}_{Y Y}$ are consistent with high probability. We then show the convergence rate of our proposed estimators $\hat{B}$ and $\hat{C}$ in Theorem 3.4. Finally, the selection consistency of our proposed method is shown in Theorem 3.5. The technical assumptions (A1) to (A5), and all proofs are provided in the Supplementary Material. In the following analysis, we allow $p$ and $q$ to diverge as $n_{X X}$ , $n_{X Y}$ and $n_{Y Y}$ increase.

In Theorem 3.1, we prove the large deviation bounds for our proposed estimators ${\hat{Σ}}_{X X}$ , ${\hat{Σ}}_{X Y}$ and ${\hat{Σ}}_{Y Y}$ .

Theorem 3.1.

Suppose $1 - α_{1} = O (\sqrt{l o g p / n_{X}})$ , $1 - α_{2} = O (\sqrt{l o g p / n_{X X}})$ , and $1 - α_{3} = O (\sqrt{l o g p q / n_{X Y}})$ . If Conditions (A1) and (A2) hold, there exists positive constants $v_{1}^{'}$ , $v_{2}^{'}$ , and $v_{3}^{'}$ such that

P ({∥ {\hat{Σ}}_{X X} - Σ_{X X} ∥}_{\infty} \geq v_{1}^{'} \sqrt{\frac{\log p}{n_{X X}}}) \leq \frac{4}{p},

(3.1)

P ({∥ {\hat{Σ}}_{X Y} - Σ_{X Y} ∥}_{\infty} \geq v_{2}^{'} \sqrt{\frac{\log (p q)}{n_{X Y}}}) \leq \frac{4}{p q},

(3.2)

P ({∥ {\hat{Σ}}_{Y Y} - Σ_{Y Y} ∥}_{\infty} \geq v_{3}^{'} \sqrt{\frac{\log q}{n_{Y Y}}}) \leq \frac{4}{q} .

(3.3)

If we only use samples with complete observations, sample covariance estimators ${\tilde{Σ}}_{X X, complete}$ , ${\tilde{Σ}}_{X X,complete}$ and ${\tilde{Σ}}_{X X,complete}$ have the following convergence rates

{∥ {\tilde{Σ}}_{X X, complete} - Σ_{X X} ∥}_{\infty} = O_{p} (\sqrt{(l o g p) / n_{complete}}),

{∥ {\tilde{Σ}}_{X Y, complete} - Σ_{X Y} ∥}_{\infty} = O_{p} (\sqrt{(l o g (p q)) / n_{complete}}),

{∥ {\tilde{Σ}}_{Y Y, complete} - Σ_{Y Y} ∥}_{\infty} = O_{p} (\sqrt{(l o g q) / n_{complete}}),

where $n_{complete}$ is the number of samples with complete observations; see Yu et al. (2020). For block-missing data, $n_{complete}$ can be much smaller than $n_{X X}$ , $n_{X Y}$ and $n_{Y Y}$ .

Next, we give the properties of initial estimators ${\hat{B}}_{0}$ and ${\hat{C}}_{0}$ . The following lemma describes estimation consistency of the initial estimator ${\hat{B}}_{0}$ .

Lemma 3.2.

Suppose Conditions (A1)-(A4) hold, $1 - α_{1} = O (\sqrt{l o g p / n_{X}}), 1 - α_{2} = O (\sqrt{l o g p / n_{X X}})$ , and $1 - α_{3} = O (\sqrt{l o g p q / n_{X Y}})$ . If we choose $λ_{B_{0}} = C {(l o g (p q) / m i n (n_{X Y}, n_{X X}))}^{\frac{1}{2}} {∥ B^{*} ∥}_{L_{1}}$ for some large enough constant $C$ , then with probability at least $1 - 4 / p - 4 / (p q)$ , the initial estimator ${\hat{B}}_{0} = a r g {m i n}_{B} t r [{\hat{Σ}}_{Y Y} + B^{⊤} {\hat{Σ}}_{X X} B - 2 B^{⊤} {\hat{Σ}}_{X Y}] + λ_{B} ∥ B ∥_{1}$ satisfies

\begin{array}{l} {∥ {\hat{B}}_{0} - B^{*} ∥}_{F} ≲ \sqrt{q s_{B}} {∥ {\hat{Σ}}_{X Y} - {\hat{Σ}}_{X X} B^{*} ∥}_{\infty} \\ ≲ {∥ B^{*} ∥}_{L_{1}} \sqrt{\frac{q s_{B} \log (p q)}{\min (n_{X X}, n_{X Y})}} . \end{array}

Cai et al. (2013) showed that when there is no missing data and the true coefficient $B^{*}$ is exactly sparse, their estimator ${\hat{B}}_{C a i}$ has the convergence rate of ${∥ {\hat{B}}_{C a i} - B^{*} ∥}_{F} = O_{p} (N_{p} \sqrt{q s_{B} l o g (p q) / n})$ , where $n$ is the sample size of the data and $N_{p}$ is the upper bound of ${∥ Σ_{X X}^{- 1} ∥}_{L_{\infty}}$ . When there is no missing data, our initial estimator ${\hat{B}}_{0}$ has the convergence rate of $∥ {\hat{B}}_{0} - B^{*} ∥_{F} = O_{p} ({∥ B^{*} ∥}_{L_{1}} \sqrt{q s_{B} l o g (p q) / n})$ . If we assume ${∥ B^{*} ∥}_{L_{1}} ≍ {∥ Σ_{X X}^{- 1} ∥}_{L_{\infty}}$ , the convergence rate of ${\hat{B}}_{0}$ is the same as that of ${\hat{B}}_{C a i}$ . When the data are block-wise missing, and we only use complete samples to estimate $B^{*}$ , we will have ${∥ {\hat{B}}_{0} - B^{*} ∥}_{F} = O_{p} ({∥ B^{*} ∥}_{L_{1}} \sqrt{q s_{B} l o g (p q) / n_{complete}})$ , which can be much slower than the rate in Lemma 3.2 as $n_{complete}$ is typically much smaller than $n_{X X}$ and $n_{X Y}$ for block-wise missing data.

For the single-response regression with block-wise missing data, the result in Lemma 3.2 is the same as Theorem 2 in Yu et al. (2020) and the estimator ${\hat{B}}_{0}$ performs well when the dimension of $Y$ is small. But when the dimension of $Y$ becomes large, the estimator ${\hat{B}}_{0}$ may perform poorly.

The following lemma describes consistency of our initial estimator ${\hat{C}}_{0}$ .

Lemma 3.3.

Suppose Conditions (A1)-(A4) hold, $1 - α_{1} = O (\sqrt{l o g p / n_{X}}), 1 - α_{2} = O (\sqrt{l o g p / n_{X X}})$ , $1 - α_{3} = O (\sqrt{l o g p q / n_{X Y}})$ . If we choose $λ_{C_{0}} = C {∥ C^{*} ∥}_{2}^{2} {∥ B^{*} ∥}_{L_{1}} ({∥ B^{*} ∥}_{L_{1}} + s_{B} \sqrt{q}) {(l o g (p q) / m i n (n_{X X}, n_{X Y}))}^{1 / 2}$ for a large enough $C$ , it holds with probability at least $1 - 4 / p - 4 / (p q) - 4 / q$ that

{∥ {\hat{C}}_{0} - C^{*} ∥}_{F} ≲ \sqrt{s_{C}} {∥ C^{*} ∥}_{2}^{2} {∥ Σ_{ϵ} - {\hat{C}}_{0}^{- 1} ∥}_{\infty} ≲ {∥ C^{*} ∥}_{2}^{2} {∥ B^{*} ∥}_{L_{1}} ({∥ B^{*} ∥}_{L_{1}} + s_{B} \sqrt{q}) \sqrt{\frac{s_{C} \log (p q)}{m i n (n_{X X}, n_{X Y})}} .

There are two terms in the estimation error bound of ${\hat{C}}_{0}$ . The first term ${∥ C^{*} ∥}_{2}^{2} {∥ B^{*} ∥}_{L_{1}}^{2} \sqrt{\frac{s_{C} l o g (p q)}{m i n (n_{X X}, n_{X Y})}}$ comes from the error induced by using incomplete observations to estimate $Σ_{X X}$ and $Σ_{X Y}$ . The second term ${∥ C^{*} ∥}_{2}^{2} {∥ B^{*} ∥}_{L_{1}} s_{B} \sqrt{\frac{s_{C} q l o g (p q)}{m i n (n_{X X}, n_{X Y})}}$ comes from the estimation error of ${\hat{B}}_{0}$ .

We next derive the convergence rates of $\hat{B}$ and $\hat{C}$ . The convergence rates are related to $n_{X X / Y}$ and $n_{X Y / X}$ , which are fractions of $n_{X X}$ and $n_{X Y}$ respectively. Hence, we let $n_{X X / Y} ≍ n_{X X}^{T_{1}}$ and $n_{X Y / X} ≍ n_{X Y}^{T_{2}}$ with $τ_{1}, τ_{2} \in {- \infty} \cup [0,1]$ . When the responses are complete while the covariates have missing entries, $n_{X X / Y} = 0$ and $τ_{1} = - \infty$ , $n_{X Y / X} > 0$ and $τ_{2} \in [0,1]$ . When the covariates are complete while the responses have missing entries, $n_{X Y / X} = 0$ and $τ_{2} = - \infty$ , $n_{X X / Y} > 0$ and $τ_{1} \in [0,1]$ . When both the responses and covaraites are complete, $n_{X X / Y} = n_{X Y / X} = 0$ and $τ_{1} = τ_{2} = - \infty$ . Theorem 3.4 below establishes the consistency of proposed estimators $\hat{B}$ and $\hat{C}$ in (2.5).

Theorem 3.4.

Suppose Conditions (A1)–(A4) hold, $1 - α_{1} = O (\sqrt{l o g p / n_{X}}), 1 - α_{2} = O (\sqrt{l o g p / n_{X X}})$ , $1 - α_{3} = O (\sqrt{l o g (p q) / n_{X Y}})$ . If we choose $λ_{B}$ and $λ_{C}$ satisfying $λ_{B} = C ((l o g p)^{1 / 2} / m i n (n_{X X}^{1 - τ_{1} / 2}, n_{X Y}^{1 - τ_{2} / 2}) {∥ B^{*} C^{*} ∥}_{L_{1}} + {{(l o g (p q) / n_{X Y}}}^{1 / 2})$ and $λ_{C} = C {∥ C^{*} ∥}_{2}^{2} [{∥ B^{*} ∥}_{L_{1}}^{2} + s_{B} {∥ B^{*} C^{*} ∥}_{L_{1}} / m i n (n_{X X}^{1 / 2 - τ_{1} / 2}, n_{X Y}^{1 / 2 - τ_{2} / 2})] {(l o g (p q) / m i n (n_{X X}, n_{X Y}))}^{1 / 2}$ for a large enough $C$ , then it holds with probability at least $1 - 4 / p - 4 / (p q) - 4 / q$ that

{∥ \hat{B} - B^{*} ∥}_{F} ≲ \sqrt{s_{B}} (\frac{{∥ B^{*} C^{*} ∥}_{L_{1}} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 - τ_{1} / 2}, n_{X Y}^{1 - τ_{2} / 2})} + {\frac{\log (p q)}{n_{X Y}}}^{1 / 2}),

{∥ \hat{C} - C^{*} ∥}_{F} ≲ \sqrt{s_{C}} {∥ C^{*} ∥}_{2}^{2} (\frac{s_{B} {∥ B^{*} C^{*} ∥}_{L_{1}} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 - τ_{1} / 2}, n_{X Y}^{1 - τ_{2} / 2})} + \frac{{∥ B^{*} ∥}_{L_{1}}^{2} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 / 2}, n_{X Y}^{1 / 2})})

{∥ \hat{B} - B^{*} ∥}_{1} ≲ s_{B} (\frac{{∥ B^{*} C^{*} ∥}_{L_{1}} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 - τ_{1} / 2}, n_{X Y}^{1 - τ_{2} / 2})} + {\frac{\log (p q)}{n_{X Y}}}^{1 / 2}),

{∥ \hat{C} - C^{*} ∥}_{1} ≲ s_{C} {∥ C^{*} ∥}_{2}^{2} (\frac{s_{B} {∥ B^{*} C^{*} ∥}_{L_{1}} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 - τ_{1} / 2}, n_{X Y}^{1 - τ_{2} / 2})} + \frac{{∥ B^{*} ∥}_{L_{1}}^{2} {(\log (p q))}^{1 / 2}}{\min (n_{X X}^{1 / 2}, n_{X Y}^{1 / 2})}) .

Next, we discuss some direct implications of Theorem 3.4. First, we show that our estimators are at least as good as the initial estimators under some conditions. Since $τ_{1}, τ_{2} \leq 1$ as $n_{j k l}^{X X / Y} \leq n_{j k}^{X X}$ and $n_{j k l}^{X Y / X} \leq n_{j k}^{X Y}$ , the convergence rate of ${∥ \hat{B} - B^{*} ∥}_{F}$ is no slower than $O_{p} (m a x ({∥ B^{*} C^{*} ∥}_{L_{1}}, 1)) \sqrt{s_{B} l o g (p q) / m i n (n_{X X}, n_{X Y})})$ . Similarly, the convergence rate of ${∥ \hat{C} - C^{*} ∥}_{F}$ is no slower than $O_{p} (\sqrt{s_{C}} {∥ C^{*} ∥}_{2}^{2} ({∥ B^{*} ∥}_{L_{1}}^{2} + s_{B} {∥ B^{*} C^{*} ∥}_{L_{1}}) \sqrt{\frac{l o g (p q)}{m i n (n_{X X}, n_{X Y})}})$ . Here the two slowest convergence rates are achieved when $τ_{1} = τ_{2} = 1$ . If we assume ${∥ B^{*} C^{*} ∥}_{L_{1}} = O ({∥ B^{*} ∥}_{L_{1}} \sqrt{q})$ , the upper bounds of ${∥ \hat{B} - B^{*} ∥}_{F}$ and ${∥ \hat{C} - C^{*} ∥}_{F}$ are at least as tight as ${∥ {\hat{B}}_{0} - B^{*} ∥}_{F}$ and ${∥ {\hat{C}}_{0} - C^{*} ∥}_{F}$ .

On the other hand, if ${∥ B^{*} C^{*} ∥}_{L_{1}} = o ({∥ B^{*} ∥}_{L_{1}} \sqrt{q})$ or $m a x (τ_{1}, τ_{2}) < 1$ and ${∥ B^{*} C^{*} ∥}_{L_{1}}^{2} = o (m i n (n_{X X}^{1 / 2 - τ_{1} / 2}, n_{X Y}^{1 / 2 - τ_{2} / 2}))$ , the upper bounds of ${∥ \hat{B} - B^{*} ∥}_{F}$ and ${∥ \hat{C} - C^{*} ∥}_{F}$ are strictly tighter than that of ${∥ {\hat{B}}_{0} - B^{*} ∥}_{F}$ and ${∥ {\hat{C}}_{0} - C^{*} ∥}_{F}$ . One example is when $v a r (ϵ_{j}) > \frac{1}{\sqrt{q}}$ for all $j \leq q$ and $c o v (ϵ_{j}, ϵ_{k}) = 0$ for $j \neq k$ . Another example is when $n_{X X / Y} = o (n_{X X}), n_{X Y / X} = o (n_{X Y})$ , and ${∥ B^{*} C^{*} ∥}_{L_{1}}^{2} = o (m i n (n_{X X}^{1 / 2 - τ_{1} / 2}, n_{X Y}^{1 / 2 - τ_{2} / 2}))$ .

When $Y$ is complete while $X$ has missing entries, $τ_{1} = - \infty$ and $τ_{2} \in [0,1]$ . Then convergence rate of $\hat{B}$ in Theorem 3.4 becomes

{∥ \hat{B} - B^{*} ∥}_{F} ≲ \sqrt{s_{B}} (\frac{{∥ B^{*} C^{*} ∥}_{L_{1}} (l o g (p q))^{1 / 2}}{n_{X Y}^{1 - τ_{2} / 2}} + {\frac{l o g (p q)}{n_{X Y}}}^{1 / 2}) .

When $X$ are complete while $Y$ have missing entries, $τ_{2} = - \infty$ and $τ_{1} \in [0,1]$ . In this case, we can set $α_{1} = α_{2} = 1$ and have

{∥ \hat{B} - B^{*} ∥}_{F} ≲ \sqrt{s_{B}} (\frac{{∥ B^{*} C^{*} ∥}_{L_{1}} (l o g (p q))^{1 / 2}}{n_{X X}^{1 - τ_{1} / 2}} + {\frac{l o g (p q)}{n_{X Y}}}^{1 / 2}) .

When both $X$ and $Y$ are complete, $τ_{1} = τ_{2} = - \infty$ . In this case, we can set $α_{1} = α_{2} = α_{3} = 1$ and have

{∥ \hat{B} - B^{*} ∥}_{F} ≲ \sqrt{s_{B} l o g (p q) / n},

(3.4)

where $n$ is the sample size. The error bound in (3.4) is the minimax rate of the $ℓ_{1}$ -penalized estimator as shown in Raskutti et al. (2011).

In Theorem 3.5 below, we show that our proposed method is model selection consistent.

Theorem 3.5.

Assume that Conditions (A1)-(A5) hold. Suppose $1 - α_{1} = O (\sqrt{l o g p / n_{X}})$ , $1 - α_{2} = O (\sqrt{l o g p / n_{X X}})$ , $1 - α_{3} = O (\sqrt{l o g (p q) / n_{X Y}})$ . If ${(l o g (p q) / n_{X Y})}^{\frac{1}{2} - γ_{2}} / λ_{B} = o (1)$ , $λ_{B} {∥ {({(C^{*} \otimes Σ_{X X})}_{S_{B} S_{B}})}^{- 1} ∥}_{L_{\infty}} / \min_{j \in S_{B}} | β_{j}^{*} | = o (1)$ , $s_{B} {∥ {({(C^{*} \otimes Σ_{X X})}_{S_{B} S_{B}})}^{- 1} ∥}_{L_{\infty}} {(\log p / n_{X X})}^{\frac{1}{2} - γ_{2}} = o (1)$ , and ${s_{B} (l o g p / n_{X X})}^{\frac{1}{2} - γ_{1} - γ_{2}} / λ_{B} = o (1)$ , then with probability at least $1 - 4 / p - 4 / (p q) - 4 / q$ , there exists a solution $\hat{B}$ to (2.5) such that $s i g n (\hat{B}) = s i g n (B^{*})$ .

4. Numerical study

In this section, we examine the performance of our proposed method (Multi-DISCOM) in terms of $Σ_{ϵ}$ , the signal-to-noise ratio, and the distribution of the error $ϵ$ using numerical studies. We compare the efficiency of our proposed method with that of the following methods: (1) the complete lasso, which separately applies the lasso to each response using only samples with complete observations (both $X$ and $Y$ have no missing values); (2) the imputed lasso, which separately applies the lasso to each response using all samples, where missing data are imputed using the soft-thresholded SVD method; (3) the MBI, which separately applies the MBI (Xue and Qu, 2021) to each response using all samples, and the missing data are imputed using multiple block-wise imputation; (4) DISCOM, which separately applies the DISCOM method (Yu et al., 2020) to each response; and (5) the imputed-MRCE, which runs the MRCE (Rothman et al., 2010) using all samples, with missing data imputed using the soft-thresholded SVD method.

In all examples, we set $q = 4$ and $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{⊤} ~ N (0, Σ)$ , with $σ_{j t} = {0.6}^{| j - t |}$ . The ith row of the coefficient matrix $B^{*}$ is (1,1.5, 1, 1.5), for $i = 1, p_{1} + 1, p_{1} + p_{2} + 1$ , and zero otherwise. The response $Y$ has entries missing completely at random, with the missing proportion 0.01.

For each example, the data are generated from three modalities, with dimensions $p_{1}$ , $p_{2}$ , and $p_{3}$ , respectively. The training data set contains $n_{1}$ samples with complete observations, $n_{2}$ samples from the third modality, $n_{3}$ samples from the first and third modalities, and $n_{4}$ samples from the first modality. The tuning data set contains 75 samples with complete observations, and the testing data set includes 300 samples with complete observations. For each method, we train our model with different tuning parameters on the training data set. Then we choose the optimal tuning parameter by minimizing the mean squared error (MSE) on the tuning data set.

For each example, we repeat the simulation 50 times. To evaluate the selection performance of the algorithm, we use the false-positive rate (FPR) and false-negative rate (FNR) as criteria: $F P R = F P / (F P + T N)$ and $F N R = F N / (F N + T P)$ , where FN represents the number of coefficients wrongly detected as zero, TN is the number of coefficients correctly detected as zero, TP are is the number of coefficients correctly detected as nonzero and FP is the number of coefficients wrongly detected as nonzero. Furthermore, to evaluate the accuracy of our estimators, we use the MSE on the testing data set and the $ℓ_{2}$ -distance ${∥ \hat{B} - B^{*} ∥}_{F}$ as criteria.

In Example 1, we examine our method related to $Σ_{ϵ}$ . Let $n_{1} = n_{2} = n_{3} = n_{4} = 30$ and $p_{1} = p_{2} = p_{3} = 30$ . We set the error $ϵ_{i} = (ϵ_{i 1}, \dots, ϵ_{i q}) ~ N (0, Σ_{ϵ})$ , with $Σ_{ϵ} = 3 I_{2} \otimes (\begin{array}{l} 1 & ρ \\ ρ & 1 \end{array})$ . We choose $ρ$ between −0.4 and 0.4.

In Example 2, we examine the performance of our method related to the signal-to-noise ratio. Let $n_{1} = n_{2} = n_{3} = n_{4} = 30$ and $p_{1} = p_{2} = p_{3} = 30$ . We set the error $ϵ_{i} = (ϵ_{i 1}, \dots, ϵ_{i q}) ~ N (0, Σ_{ϵ})$ , with $Σ_{ϵ} = α I_{2} \otimes (\begin{matrix} 1 & - 0.4 \\ - 0.4 & 1 \end{matrix})$ , and choose $α$ between one and five.

In Example 3, we examine the robustness of our method when the error follows a heavy-tailed distribution. Let $n_{1} = n_{2} = n_{3} = n_{4} = 30$ and $p_{1} = p_{2} = p_{3} = 30$ . We set the error $ϵ_{i} = (ϵ_{i 1}, \dots, ϵ_{i q}) ~ t_{10} (0, Σ_{ϵ})$ , where $Σ_{ϵ} = 3 I_{2} \otimes (\begin{matrix} 1 & - 0.4 \\ - 0.4 & 1 \end{matrix})$ , and $t_{ν} (0, Σ_{ϵ})$ refers to Student’s $t$ distribution with location vector 0 and scale matrix $Σ_{ϵ}$ .

To demonstrate the results, we focus on Example 1. We report the results of the other examples in the Supplementary Material.

The results in Table 1 indicate that the multi-DISCOM method delivers the best performance in all settings. Specifically, the multi-DISCOM method produces a smaller MSE and estimation errors than those of the other methods in all settings, especially when there are large correlations between the responses. In addition, the lasso method with imputed data may deliver worse selection performance, possibly because of the randomness in the imputation of the block-missing data. The results in Table 4 in the Supplementary Material indicate that the multi-DISCOM method has a greater advantage when the signal-to-noise ratio is small. When the ratio is smaller, the noise has a stronger effect on $Y$ , and hence considering the precision matrix is more more helpful for our estimation.

Table 1:

Performance comparison for the methods in Example 1 with different ρ. The values in parentheses are the standard errors of the measures.

	Method	${∥ \hat{B} - B^{*} ∥}_{F}$	MSE	FPR	FNR

ρ = −0.4	Lasso	1.51(0.06)	3.70(0.06)	0.09(0.02)	0.00(0.00)
	Imputed-Lasso	1.73(0.06)	3.57(0.06)	0.11(0.01)	0.00(0.00)
	MBI	2.10(0.08)	4.26(0.09)	0.12(0.02)	0.11(0.03)
	DISCOM	1.44(0.04)	3.56(0.06)	0.05(0.00)	0.05(0.01)
	Imputed-MRCE	1.53(0.05)	3.72(0.08)	0.17(0.03)	0.08(0.02)
	Multi-DISCOM	1.40(0.04)	3.39(0.08)	0.02(0.01)	0.09(0.02)

ρ = 0.4	Lasso	1.55(0.06)	3.77(0.06)	0.11(0.02)	0.00(0.00)
	Imputed-Lasso	1.75(0.06)	3.61(0.06)	0.13(0.01)	0.00(0.00)
	MBI	2.14(0.08)	4.30(0.09)	0.13(0.02)	0.11(0.03)
	DISCOM	1.46(0.04)	3.59(0.06)	0.06(0.00)	0.05(0.01)
	Imputed-MRCE	1.54(0.05)	3.73(0.08)	0.19(0.03)	0.09(0.02)
	Multi-DISCOM	1.43(0.04)	3.44(0.08)	0.04(0.01)	0.07(0.02)

Open in a new tab

5. Application to the ADNI study

We apply the multi-DISCOM method to data from the ADNI study (Mueller et al., 2005), and compare it with several existing approaches. A primary goal of this analysis is to identify biological markers and neuropsychological assessments to measure the progression of mild cognitive impairment (MCI) and early AD. We are interested in predicting the mini mental-state examination (MMSE), ADAS1, and ADAS2, which are common diagnotic scores for AD. The data processing steps are summarized in the Supplementary Material.

After data processing, we have 93 features from MRI, 93 features from PET, and five features from CSF. There are 805 subjects in total, including 199 subjects with complete MRI, PET, and CSF features, 197 subjects with MRI and PET features only, 201 subjects with MRI and CSF features only, and 208 subjects with MRI features only.

In our analysis, we divide the data into training, tuning, and testing sets. The training set consists of all subjects with incomplete observations and 40 randomly selected subjects with complete features. The tuning set consists of another 40 randomly selected subjects with complete observations. The testing set contains the remaining 119 subjects with complete observations. We train our model using different tuning parameters on the training set, choosing the tuning parameter that minimizes the MSE on the tuning set. The testing set is used to evaluate the methods. We use all methods shown in the simulation study to predict the MMSE score. For each method, the analysis is repeated 30 times using different partitions of the data. In addition to the sum of the MSE of all three responses, we compare the MSEs for each response ( ${M S E}_{M M S E}, {M S E}_{A D A S 1}$ , and ${M S E}_{A D A S 2}$ ) as criteria. We also compare the number of features selected by each method.

As shown in Table 2, the multi-DISCOM method outperforms all other methods. The DISCOM method has a similar overall MSE to that of the multi-DISCOM method, but worse ${M S E}_{A D A S 1}$ and ${M S E}_{A D A S 2}$ . One possible reason for this is that ADAS1 and ADAS2 are highly correlated, which means considering the precision matrix can help. Because there are 208 subjects with MRI features only, the MBI method may not impute those 208 subjects accurately. As a result, the MBI method may not perform well in this case.

Table 2:

Performance comparison for the ADNI data.

Method	Overall MSE	MSE_MMSE	MSE_ADAS1	MSE_ADAS2	# of Selected Features

Lasso	93.37(3.82)	5.31(0.19)	29.84(1.35)	58.23(2.40)	54.20
Imputed-Lasso	80.40(1.62)	4.54(0.12)	25.80(0.51)	50.07(1.15)	165.00
MBI	91.84(3.02)	5.13(0.14)	28.43(1.17)	58.29(2.16)	59.87
DISCOM	67.47(1.33)	4.26(0.11)	21.76(0.51)	41.45(0.86)	72.87
Imputed-MRCE	67.41(2.02)	4.29(0.10)	21.61(0.65)	41.50(1.33)	218.50
Multi-DISCOM	65.82(1.21)	4.22(0.12)	21.18(0.46)	40.41(0.80)	89.67

Open in a new tab

With regard to model selection, both the DISCOM method and the multi-DISCOM method deliver relatively simple models. Figure 2 shows the selection frequency of the 191 features when predicting ADAS1. The selection frequency of each feature is defined as the number of times of it is selected in the 30 replications. As shown in Figure 2, for our method, some features are often selected, and many other features are rarely selected. Thus our method delivers robust model selection. However, the features selected by the imputed lasso method vary across replications. One possible reason for the unstable performance in terms of model selection is the randomness in the imputation of the block-missing data. Hippocampus formation left (69th region) and amygdale right (83th feature) are frequently selected by our method, and have been shown to be highly correlated with AD and MCI (Jack et al., 1999; Misra et al., 2009; Zhang and Shen, 2012); however, the DISCOM method rarely selects these features.

6. Conclusion

We have proposed a joint estimation method in a penalized framework with an entry-wise ℓ₁-regularization using block-missing multi-modal predictors. We first estimate the covariance matrix of the predictors using a linear combination of the estimates of the variance of each predictor, the estimates of the intra-modality covariance matrix, and the cross-modality covariance matrix. The proposed estimator of the covariance matrix can be positive semidefinite and more accurate than the sample covariance matrix. In the second step, we use the estimated covariance matrix and a penalized estimator to deliver a sparse estimate of the coefficients in the optimal linear prediction. We also establish the theory for the estimation and feature selection consistency. Extensive simulation studies indicate that our method exhibits promising performance in terms of estimation, prediction, and model selection for block-missing multi-modal data. Finally, we apply the multi-DISCOM method to the ADNI data set, showing that our model has good prediction power and meaningful interpretation.

Supplementary Material

Supplement

NIHMS1892525-supplement-Supplement.pdf^{(583.4KB, pdf)}

Acknowledgments

The authors thank the editor, associate editor, and reviewers for their helpful comments and suggestions. This research was supported in part by NSF grant DMS-2100729 and NIH grants R01GM126550 and R01AG073259. Haodong Wang is gratefully acknowledges the partial support from the National Science Foundation, award NSF-DMS-1929298 to the Statistical and Applied Mathematical Sciences Institute.

Footnotes

Supplementary Material

Supplementary Material includes additional results of our numerical studies, technical conditions and proofs.

Contributor Information

Haodong Wang, Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill.

Quefeng Li, Department of Biostatistics, The University of North Carolina at Chapel Hill.

Yufeng Liu, Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill.

References

Banerjee O, El Ghaoui L, and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516. [Google Scholar]
Breiman L and Friedman JH (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1), 3–54. [Google Scholar]
Cai TT, Li H, Liu W, and Xie J (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1), 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, Xu P, Wang L, Ma J, and Gu Q (2018). Covariate adjusted precision matrix estimation via nonconvex optimization. In International Conference on Machine Learning, pp. 922–931. [Google Scholar]
Fisher TJ and Sun X (2011). Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Computational Statistics and Data Analysis 55(5), 1909–1918. [Google Scholar]
Friedman J, Hastie T, and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Izenman AJ (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5(2), 248–264. [Google Scholar]
Jack CR, Petersen RC, Xu YC, O’Brien PC, Smith GE, Ivnik RJ, Boeve BF, Waring SC, Tangalos EG, and Kokmen E (1999). Prediction of ad with mri-based hippocampal volume in mild cognitive impairment. Neurology 52(7), 1397–1397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson CR (1990). Matrix completion problems: a survey. In Matrix Theory and Applications, Volume 40, pp. 171–198. Amer. Math. Soc. [Google Scholar]
Kim S and Xing EP (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. The Annals of Applied Statistics 6(3), 1095–1117. [Google Scholar]
Lee W and Liu Y (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis 111, 241–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loh P-L and Wainwright MJ (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. The Journal of Machine Learning Research 16(1), 559–616. [Google Scholar]
Loh W-Y and Zheng W (2013). Regression trees for longitudinal and multiresponse data. The Annals of Applied Statistics, 495–522. [Google Scholar]
Misra C, Fan Y, and Davatzikos C (2009). Baseline and longitudinal patterns of brain atrophy in mci patients, and their use in prediction of short-term conversion to ad. Neuroimage 44(4), 1415–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molstad AJ, Sun W, and Hsu L (2020). A covariance-enhanced approach to multi-tissue joint eqtl mapping with application to transcriptome-wide association studies. arXiv preprint arXiv:2001.08363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, Trojanowski JQ, Toga AW, and Beckett L (2005). The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics 15(4), 869–877. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raskutti G, Wainwright MJ, and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over lq-balls. IEEE Transactions on Information Theory 57(10), 6976–6994. [Google Scholar]
Rothman AJ, Levina E, and Zhu J (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19(4), 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. [Google Scholar]
Xue F and Qu A (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association 116(536), 1914–1927. [Google Scholar]
Yin J and Li H (2011). A sparse conditional gaussian graphical model for analysis of genetical genomics data. The Annals of Applied Statistics 5(4), 2630. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu G, Li Q, Shen D, and Liu Y (2020). Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association 115(531), 1406–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M, Ekici A, Lu Z, and Monteiro R (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346. [Google Scholar]
Zhang D and Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. NeuroImage 59(2), 895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1892525-supplement-Supplement.pdf^{(583.4KB, pdf)}

[R1] Banerjee O, El Ghaoui L, and d’Aspremont A (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research 9, 485–516. [Google Scholar]

[R2] Breiman L and Friedman JH (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1), 3–54. [Google Scholar]

[R3] Cai TT, Li H, Liu W, and Xie J (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100(1), 139–156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen J, Xu P, Wang L, Ma J, and Gu Q (2018). Covariate adjusted precision matrix estimation via nonconvex optimization. In International Conference on Machine Learning, pp. 922–931. [Google Scholar]

[R5] Fisher TJ and Sun X (2011). Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Computational Statistics and Data Analysis 55(5), 1909–1918. [Google Scholar]

[R6] Friedman J, Hastie T, and Tibshirani R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Izenman AJ (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5(2), 248–264. [Google Scholar]

[R8] Jack CR, Petersen RC, Xu YC, O’Brien PC, Smith GE, Ivnik RJ, Boeve BF, Waring SC, Tangalos EG, and Kokmen E (1999). Prediction of ad with mri-based hippocampal volume in mild cognitive impairment. Neurology 52(7), 1397–1397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Johnson CR (1990). Matrix completion problems: a survey. In Matrix Theory and Applications, Volume 40, pp. 171–198. Amer. Math. Soc. [Google Scholar]

[R10] Kim S and Xing EP (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eqtl mapping. The Annals of Applied Statistics 6(3), 1095–1117. [Google Scholar]

[R11] Lee W and Liu Y (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis 111, 241–255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Loh P-L and Wainwright MJ (2015). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. The Journal of Machine Learning Research 16(1), 559–616. [Google Scholar]

[R13] Loh W-Y and Zheng W (2013). Regression trees for longitudinal and multiresponse data. The Annals of Applied Statistics, 495–522. [Google Scholar]

[R14] Misra C, Fan Y, and Davatzikos C (2009). Baseline and longitudinal patterns of brain atrophy in mci patients, and their use in prediction of short-term conversion to ad. Neuroimage 44(4), 1415–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Molstad AJ, Sun W, and Hsu L (2020). A covariance-enhanced approach to multi-tissue joint eqtl mapping with application to transcriptome-wide association studies. arXiv preprint arXiv:2001.08363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, Trojanowski JQ, Toga AW, and Beckett L (2005). The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics 15(4), 869–877. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Raskutti G, Wainwright MJ, and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over lq-balls. IEEE Transactions on Information Theory 57(10), 6976–6994. [Google Scholar]

[R18] Rothman AJ, Levina E, and Zhu J (2010). Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19(4), 947–962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. [Google Scholar]

[R20] Xue F and Qu A (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association 116(536), 1914–1927. [Google Scholar]

[R21] Yin J and Li H (2011). A sparse conditional gaussian graphical model for analysis of genetical genomics data. The Annals of Applied Statistics 5(4), 2630. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Yu G, Li Q, Shen D, and Liu Y (2020). Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association 115(531), 1406–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Yuan M, Ekici A, Lu Z, and Monteiro R (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346. [Google Scholar]

[R24] Zhang D and Shen D (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease. NeuroImage 59(2), 895–907. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multi-response Regression for Block-missing Multi-modal Data without Imputation

Haodong Wang

Quefeng Li

Yufeng Liu

Abstract

1. Introduction