Multivariate Regression with Calibration

Han Liu; Lie Wang; Tuo Zhao

. Author manuscript; available in PMC: 2015 Jan 22.

Published in final edited form as: Adv Neural Inf Process Syst. 2014 Dec;27:5630.

Multivariate Regression with Calibration^*

Han Liu ¹, Lie Wang ², Tuo Zhao ^3,^†

PMCID: PMC4303187 NIHMSID: NIHMS650162 PMID: 25620861

Abstract

We propose a new method named calibrated multivariate regression (CMR) for fitting high dimensional multivariate regression models. Compared to existing methods, CMR calibrates the regularization for each regression task with respect to its noise level so that it is simultaneously tuning insensitive and achieves an improved finite-sample performance. Computationally, we develop an efficient smoothed proximal gradient algorithm which has a worst-case iteration complexity O(1/ε), where ε is a pre-specified numerical accuracy. Theoretically, we prove that CMR achieves the optimal rate of convergence in parameter estimation. We illustrate the usefulness of CMR by thorough numerical simulations and show that CMR consistently outperforms other high dimensional multivariate regression methods. We also apply CMR on a brain activity prediction problem and find that CMR is as competitive as the handcrafted model created by human experts.

1 Introduction

Given a design matrix X ∈ ℝ^n×d and a response matrix Y ∈ ℝ^n×m, we consider a multivariate linear model Y = XB⁰ + Z, where B⁰ ∈ ℝ^d×m is an unknown regression coefficient matrix and Z ∈ ℝ^n×m is a noise matrix [1]. For a matrix A = [A_jk] ∈ ℝ^d×m, we denote A_j_* = (A_j₁, …, A_jm) ∈ ℝ^m and A_*_k = (A₁_k, …, A_dk)^T ∈ ℝ^d to be its j^th row and k^th column respectively. We assume that all Z_i_*’s are independently sampled from an m-dimensional Gaussian distribution with mean 0 and covariance matrix Σ ∈ ℝ^m×m.

We can represent the multivariate linear model as an ensemble of univariate linear regression models: $Y_{* k} = {XB}_{* k}^{0} + Z_{* k}$ , k = 1, …, m. Then we get a multi-task learning problem [3, 2, 26]. Multi-task learning exploits shared common structure across tasks to obtain improved estimation performance. In the past decade, significant progress has been made towards designing a variety of modeling assumptions for multivariate regression.

A popular assumption is that all the regression tasks share a common sparsity pattern, i.e., many $B_{j *}^{0}$ ’s are zero vectors. Such a joint sparsity assumption is a natural extension of that for univariate linear regressions. Similar to the L₁-regularization used in Lasso [23], we can adopt group regularization to obtain a good estimator of B⁰ [25, 24, 19, 13]. Besides the aforementioned approaches, there are other methods that aim to exploit the covariance structure of the noise matrix Z [7, 22]. For instance, [22] assume that all Z_i_*’s follow a multivariate Gaussian distribution with a sparse inverse covariance matrix Ω = Σ⁻¹. They propose an iterative algorithm to estimate sparse B⁰ and Ω by maximizing the penalized Gaussian log-likelihood. Such an iterative procedure is effective in many applications, but the theoretical analysis is difficult due to its nonconvex formulation.

In this paper, we assume an uncorrelated structure for the noise matrix Z, i.e., $\sum = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{m - 1}^{2}, σ_{m}^{2})$ . Under this setting, we can efficiently solve the resulting estimation problem with a convex program as follows

\hat{B} = \underset{B}{argmin} \frac{1}{\sqrt{n}} {‖ Y - XB ‖}_{F}^{2} + λ {‖ B ‖}_{1, p},

(1.1)

where λ > 0 is a tuning parameter, and ${‖ A ‖}_{F} = \sqrt{\sum_{j, k} A_{j k}^{2}}$ is the Frobenius norm of a matrix A. Popular choices of p include p = 2 and $p = \infty : {‖ B ‖}_{1, 2} = \sum_{j = 1}^{d} \sqrt{\sum_{k = 1}^{m} B_{j k}^{2}}$ and ${‖ B ‖}_{1, \infty} = \sum_{j = 1}^{d} {max}_{1 \leq k \leq m} ∣ B_{j k} ∣$ . Computationally, the optimization problem in (1.1) can be efficiently solved by some first order algorithms [11, 12, 4].

The problem with the uncorrelated noise structure is amenable to statistical analysis. Under suitable conditions on the noise and design matrices, let σ_max = max_k σ_k, if we choose $λ = 2 c \cdot σ_{max} (\sqrt{log d} + m^{1 - 1 / p})$ , for some c > 1, then the estimator B̂ in (1.1) achieves the optimal rates of convergence¹ [13], i.e., there exists some universal constant C such that with high probability, we have

\frac{1}{\sqrt{m}} {‖ \hat{B} - B^{0} ‖}_{F} \leq C \cdot σ_{max} (\sqrt{\frac{s log d}{n m}} + \sqrt{\frac{{s m}^{1 - 2 / p}}{n}}),

where s is the number of rows with non-zero entries in B⁰. However, the estimator in (1.1) has two drawbacks: (1) All the tasks are regularized by the same tuning parameter λ, even though different tasks may have different σ_k’s. Thus more estimation bias is introduced to the tasks with smaller σ_k’s to compensate the tasks with larger σ_k’s. In another word, these tasks are not calibrated. (2) The tuning parameter selection involves the unknown parameter σ_max. This requires tuning the regularization parameter over a wide range of potential values to get a good finite-sample performance.

To overcome the above two drawbacks, we formulate a new convex program named calibrated multivariate regression (CMR). The CMR estimator is defined to be the solution of the following convex program:

\hat{B} = \underset{B}{argmin} {‖ Y - XB ‖}_{2, 1} + λ {‖ B ‖}_{1, p},

(1.2)

where ${‖ A ‖}_{2, 1} = \sum_{k} \sqrt{\sum_{j} A_{j k}^{2}}$ is the nonsmooth L₂_,₁ norm of a matrix A = [A_jk] ∈ ℝ^d×m. This is a multivariate extension of the square-root Lasso [5]. Similar to the square-root Lasso, the tuning parameter selection of CMR does not involve σ_max. Moreover, the L₂_,₁ loss function can be viewed as a special example of the weighted least square loss, which calibrates each regression task (See more details in §2). Thus CMR adapts to different σ_k’s and achieves better finite-sample performance than the ordinary multivariate regression estimator (OMR) defined in (1.1).

Since both the loss and penalty functions in (1.2) are nonsmooth, CMR is computationally more challenging than OMR. To efficiently solve CMR, we propose a smoothed proximal gradient (SPG) algorithm with an iteration complexity O(1/ε), where ε is the pre-specified accuracy of the objective value [18, 4]. Theoretically, we provide sufficient conditions under which CMR achieves the optimal rates of convergence in parameter estimation. Numerical experiments on both synthetic and real data show that CMR universally outperforms existing multivariate regression methods. For a brain activity prediction task, prediction based on the features selected by CMR significantly outperforms that based on the features selected by OMR, and is even competitive with that based on the handcrafted features selected by human experts.

Notations

Given a vector v = (v₁, …, v_d)^T ∈ ℝ^d, for 1 ≤ p ≤ ∞, we define the L_p-vector norm of v as ${‖ v ‖}_{p} = {(\sum_{j = 1}^{d} {∣ v_{j} ∣}^{p})}^{1 / p}$ if 1 ≤ p < ∞ and ||v||_p = max_1≤_j_≤_d |v_j| if p = ∞. Given two matrices A = [A_jk] and C = [C_jk] ∈ ℝ^d×m, we define the inner product of A and C as $〈 A, C 〉 = \sum_{j = 1}^{d} \sum_{k = 1}^{m} A_{j k} C_{j k} = tr (A^{T} C)$ , where tr(A) is the trace of a matrix A. We use A_*_k = (A₁_k, …, A_dk)^T and A_j_* = (A_j₁, …, A_jm) to denote the k^th column and j^th row of A. Let Inline graphic be some subspace of ℝ^d×m, we use to denote the projection of A onto : $A_{S} = {argmin}_{C \in S} {‖ C - A ‖}_{F}^{2}$ . Moreover, we define the Frobenius and spectral norms of A as ${‖ A ‖}_{F} = \sqrt{〈 A, A 〉}$ and ||A||₂ = ψ₁(A), ψ₁(A) is the largest singular value of A. In addition, we define the matrix block norms as ${‖ A ‖}_{2, 1} = \sum_{k = 1}^{m} {‖ A_{* k} ‖}_{2}$ , ||A||_2,∞ = max_1≤_k_≤_m ||A_*_k||₂, ${‖ A ‖}_{1, p} = \sum_{j = 1}^{d} {‖ A_{j *} ‖}_{p}$ , and ||A||_∞_,q = max_1≤_j_≤_d ||A_j_*||_q, where 1 ≤ p ≤ ∞ and 1 ≤ q ≤ ∞. It is easy to verify that ||A||₂_,₁ is the dual norm of ||A||_2,∞. Let 1/∞ = 0, then if 1/p + 1/q = 1, ||A||_∞_,q and ||A||₁_,p are also dual norms of each other.

2 Method

We solve the multivariate regression problem by the following convex program,

\hat{B} = \underset{B}{argmin} {‖ Y - XB ‖}_{2, 1} + λ {‖ B ‖}_{1, p} .

(2.1)

The only difference between (2.1) and (1.1) is that we replace the L₂-loss function by the nonsmooth L₂_,₁-loss function. The L₂_,₁-loss function can be viewed as a special example of the weighted square loss function. More specifically, we consider the following optimization problem,

\hat{B} = \underset{B}{argmin} \sum_{k = 1}^{m} \frac{1}{σ_{k} \sqrt{n}} {‖ Y_{* k} - {XB}_{* k} ‖}_{2}^{2} + λ {‖ B ‖}_{1, p},

(2.2)

where $\frac{1}{σ_{k} \sqrt{n}}$ is a weight assigned to calibrate the k^th regression task. Without prior knowledge on σ_k’s, we use the following replacement of σ_k’s,

{\tilde{σ}}_{k} = \frac{1}{\sqrt{n}} {‖ Y_{* k} - {XB}_{* k} ‖}_{2}, k = 1, \dots, m .

(2.3)

By plugging (2.3) into the objective function in (2.2), we get (2.1). In another word, CMR calibrates different tasks by solving a penalized weighted least square program with weights defined in (2.3).

The optimization problem in (2.1) can be solved by the alternating direction method of multipliers (ADMM) with a global convergence guarantee [20]. However, ADMM does not take full advantage of the problem structure in (2.1). For example, even though the L₂_,₁ norm is nonsmooth, it is nondifferentiable only when a task achieves exact zero residual, which is unlikely in applications. In this paper, we apply the dual smoothing technique proposed by [18] to obtain a smooth surrogate function so that we can avoid directly evaluating the subgradient of the L₂_,₁ loss function. Thus we gain computational efficiency like other smooth loss functions.

We consider the Fenchel’s dual representation of the L₂_,₁ loss:

{‖ Y - XB ‖}_{2, 1} = max_{{‖ U ‖}_{2, \infty} \leq 1} 〈 U, Y - XB 〉 .

(2.4)

Let μ > 0 be a smoothing parameter. The smooth approximation of the L₂_,₁ loss can be obtained by solving the following optimization problem

{‖ Y - XB ‖}_{μ} = max_{{‖ U ‖}_{2, \infty} \leq 1} 〈 U, Y - XB 〉 - \frac{μ}{2} {‖ U ‖}_{F}^{2},

(2.5)

where ${‖ U ‖}_{F}^{2}$ is the proximity function. Due to the fact that ${‖ U ‖}_{F}^{2} \leq m {‖ U ‖}_{2, \infty}^{2}$ , we obtain the following uniform bound by combing (2.4) and (2.5),

{‖ Y - XB ‖}_{2, 1} - \frac{m μ}{2} \leq {‖ Y - XB ‖}_{μ} \leq {‖ Y - XB ‖}_{2, 1} .

(2.6)

From (2.6), we see that the approximation error introduced by the smoothing procedure can be controlled by a suitable μ. Figure 2.1 shows several two-dimensional examples of the L₂ norm smoothed by different μ’s. The optimization problem in (2.5) has a closed form solution Û^B with ${\hat{U}}_{* k}^{B} = (Y_{* k} - {XB}_{* k}) / max {{‖ Y_{* k} - {XB}_{* k} ‖}_{2}, μ}$ .

The next lemma shows that ||Y − XB||_μ is smooth in B with a simple form of gradient.

Lemma 2.1

For any μ > 0, ||Y − XB||_μ is a convex and continuously differentiable function in B. In addition, G^μ(B)—the gradient of ||Y − XB||_μ w.r.t. B—has the form

G^{μ} (B) = \frac{\partial (〈 {\hat{U}}^{B}, Y - XB 〉 + μ {‖ {\hat{U}}^{B} ‖}_{F}^{2} / 2)}{\partial B} = - X^{T} {\hat{U}}^{B} .

(2.7)

Moreover, let $γ = {‖ X ‖}_{2}^{2}$ , then we have that G^μ(B) is Lipschitz continuous in B with the Lipschitz constant γ/μ, i.e., for any B′, B″ ∈ ℝ^d×m,

{‖ G^{μ} (B^{'}) - G^{μ} (B^{″}) ‖}_{F} = {‖ 〈 X, {\hat{U}}^{B^{'}} - {\hat{U}}^{B^{″}} 〉 ‖}_{F} \leq \frac{1}{μ} {‖ X^{T} X (B^{'} - B^{″}) ‖}_{F} \leq \frac{γ}{μ} {‖ B^{'} - B^{″} ‖}_{F} .

Lemma 2.1 is a direct result of Theorem 1 in [18] and implies that ||Y − XB||_μ has good computational structure. Therefore we apply the smooth proximal gradient algorithm to solve the smoothed version of the optimization problem as follows,

\tilde{B} = \underset{B}{argmin} {‖ Y - XB ‖}_{μ} + λ {‖ B ‖}_{1, p} .

(2.8)

We then adopt the fast proximal gradient algorithm to solve (2.8) [4]. To derive the algorithm, we first define three sequences of auxiliary variables {A⁽^t⁾}, {V⁽^t⁾}, and {H⁽^t⁾} with A⁽⁰⁾ = H⁽⁰⁾ = V⁽⁰⁾ = B⁽⁰⁾, a sequence of weights {θ_t = 2/(t + 1)}, and a nonincreasing sequence of step-sizes {η_t > 0}. For simplicity, we can set η_t = μ/γ. In practice, we use the backtracking line search to dynamically adjust η_t to boost the performance. At the t^th iteration, we first take V⁽^t⁾ = (1 − θ_t)B⁽^t−¹⁾ + θ_tA⁽^t−¹⁾. We then consider a quadratic approximation of ||Y − XH||_μ as

Q (H, V^{(t)}, η_{t}) = {‖ Y - X V^{(t)} ‖}_{μ} + 〈 G^{μ} (V^{(t)}), H - V^{(t)} 〉 + \frac{1}{2 η_{t}} {‖ H - V^{(t)} ‖}_{F}^{2} .

Consequently, let H̃⁽^t⁾ = V⁽^t⁾ − η_tG^μ(V⁽^t⁾), we take

H^{(t)} = \underset{H}{argmin} Q (H, V^{(t)}, η_{t}) + λ {‖ H ‖}_{1, p} = \underset{H}{argmin} \frac{1}{2 η_{t}} {‖ H - {\tilde{H}}^{(t)} ‖}_{F}^{2} + λ {‖ H ‖}_{1, p} .

(2.9)

When p = 2, (2.9) has a closed form solution $H_{j *}^{(t)} = {\tilde{H}}_{j *} \cdot max {1 - η_{t} λ / {‖ {\tilde{H}}_{j *} ‖}_{2}, 0}$ . More details about other choices of p in the L₁_,p norm can be found in [11] and [12]. To ensure that the objective value is nonincreasing, we choose

B^{(t)} = \underset{B \in {H^{(t)}, B^{(t - 1)}}}{argmin} {‖ Y - XB ‖}_{μ} + λ {‖ B ‖}_{1, p} .

(2.10)

At last, we take $A^{(t)} = B^{(t - 1)} + \frac{1}{θ_{t}} (H^{(t)} - B^{(t - 1)})$ . The algorithm stops when ||H⁽^t⁾−V⁽^t⁾||_F ≤ ε, where ε is the stopping precision.

The numerical rate of convergence of the proposed algorithm with respect to the original optimization problem (2.1) is presented in the following theorem.

Theorem 2.2

Given a pre-specified accuracy ε and let μ = ε/m, after $t = 2 \sqrt{m γ} {‖ B^{(0)} - \hat{B} ‖}_{F} / ε - 1 = O (1 / ε)$ iterations, we have ||Y − XB⁽^t⁾||_2,1 + λ||B⁽^t⁾||_1,_p ≤ ||Y − XB̂||_2,1 + λ||B̂||_1,_p + ε.

The proof of Theorem 2.2 is provided in Appendix A.1. This result achieves the minimax optimal rate of convergence over all first order algorithms [18].

3 Statistical Properties

For notational simplicity, we define a re-scaled noise matrix W = [W_ik] ∈ ℝ^n×m with W_ik = Z_ik/σ_k, where $E Z_{i k}^{2} = σ_{k}^{2}$ . Thus W is a random matrix with all entries having mean 0 and variance 1. We define G⁰ to be the gradient of ||Y − XB||_2,1 at B = B⁰. It is easy to see that

G_{* k}^{0} = \frac{X^{T} Z_{* k}}{{‖ Z_{* k} ‖}_{2}} = \frac{X^{T} W_{* k} σ_{k}}{{‖ W_{* k} σ_{k} ‖}_{2}} = \frac{X^{T} W_{* k}}{{‖ W_{* k} ‖}_{2}}

does not depend on the unknown quantities σ_k for all k = 1, …, m. $G_{* k}^{0}$ works as an important pivotal in our analysis. Moreover, our analysis exploits the decomposability of the L_1,_p norm [17]. More specifically, we assume that B⁰ has s rows with all zero entries and define

S = {C \in ℝ^{d \times m} ∣ C_{j *} = 0 for all j such that B_{j *}^{0} = 0},

(3.1)

N = {C \in ℝ^{d \times m} ∣ C_{j *} = 0 for all j such that B_{j *}^{0} \neq 0} .

(3.2)

Note that we have B⁰ ∈ Inline graphic and the L_1,_p norm is decomposable with respect to the pair ( , ), i.e.,

{‖ A ‖}_{1, p} = {‖ A_{S} ‖}_{1, p} + {‖ A_{N} ‖}_{1, p} .

The next lemma shows that when λ is suitably chosen, the solution to the optimization problem in (2.1) lies in a restricted set.

Lemma 3.1

Let B⁰ ∈ Inline graphic and B̂ be the optimum to (2.1), and 1/p + 1/q = 1. We denote the estimation error as Δ̂ = B̂ − B⁰. If λ ≥ c||G⁰||_∞_,q for some c > 1, we have

\hat{Δ} \in M_{c} : = {Δ \in ℝ^{d \times m} ∣ {‖ Δ_{N} ‖}_{1, p} \leq \frac{c + 1}{c - 1} {‖ Δ_{S} ‖}_{1, p}} .

(3.3)

The proof of Lemma 3.1 is provided in Appendix B.1. To prove the main result, we also need to assume that the design matrix X satisfies the following condition.

Assumption 3.1

Let B⁰ ∈ Inline graphic , then there exist positive constants κ and c > 1 such that

κ \leq min_{Δ \in M_{c} \ {0}} \frac{{‖ X Δ ‖}_{F}}{\sqrt{n} {‖ Δ ‖}_{F}} .

Assumption 3.1 is the generalization of the restricted eigenvalue conditions for analyzing univariate sparse linear models [17, 15, 6], Many common examples of random design satisfy this assumption [13, 21].

Note that Lemma 3.1 is a deterministic result of the CMR estimator for a fixed λ. Since G is essentially a random matrix, we need to show that λ ≥ cR^*(G⁰) holds with high probability to deliver a concrete rate of convergence for the CMR estimator in the next theorem.

Theorem 3.2

We assume that each column of X is normalized as $m^{1 / 2 - 1 / p} {‖ X_{* j} ‖}_{2} = \sqrt{n}$ for all j = 1, …, d. Then for some universal constant c₀ and large enough n, taking

λ = \frac{2 c (m^{1 - 1 / p} + \sqrt{log d})}{\sqrt{1 - c_{0}}},

(3.4)

with probability at least $1 - 2 exp (- 2 log d) - 2 exp (- {n c}_{0}^{2} / 8 + log m)$ , we have

\frac{1}{\sqrt{m}} {‖ \hat{B} - B^{0} ‖}_{F} \leq \frac{16 c σ_{max}}{κ^{2} (c - 1)} \sqrt{\frac{1 + c_{0}}{1 - c_{0}}} (\sqrt{\frac{{s m}^{1 - 2 / p}}{n}} + \sqrt{\frac{s log d}{n m}}) .

The proof of Theorem 3.2 is provided in Appendix B.2. Note that when we choose p = 2, the column normalization condition is reduced to ${‖ X_{* j} ‖}_{2} = \sqrt{n}$ . Meanwhile, the corresponding error bound is further reduced to

\frac{1}{\sqrt{m}} {‖ \hat{B} - B^{0} ‖}_{F} = O_{P} (\sqrt{\frac{s}{n}} + \sqrt{\frac{s log d}{n m}}),

which achieves the minimax optimal rate of convergence presented in [13]. See Theorem 6.1 in [13] for more technical details. From Theorem 3.2, we see that CMR achieves the same rates of convergence as the noncalibrated counterpart, but the tuning parameter λ in (3.4) does not involve σ_k’s. Therefore CMR not only calibrates all the regression tasks, but also makes the tuning parameter selection insensitive to σ_max.

4 Numerical Simulations

To compare the finite-sample performance between the calibrated multivariate regression (CMR) and ordinary multivariate regression (OMR), we generate a training dataset of 200 samples. More specifically, we use the following data generation scheme: (1) Generate each row of the design matrix X_i*, i = 1, …, 200, independently from a 800-dimensional normal distribution N(0, Σ) where Σ_jj = 1 and Σ_j_ℓ = 0.5 for all ℓ ≠ j.(2) Let k = 1, …, 13, set the regression coefficient matrix B⁰ ∈ ℝ^800×13 as $B_{1 k}^{0} = 3, B_{2 k}^{0} = 2, B_{4 k}^{0} = 1.5$ , and $B_{j k}^{0} = 0$ for all j ≠ 1, 2, 4. (3) Generate the random noise matrix Z = WD, where W ∈ ℝ^200×13 with all entries of W are independently generated from N(0, 1), and D is either of the following matrices

\begin{array}{r} D_{I} = σ_{max} \cdot diag (2^{0 / 4}, 2^{- 1 / 4}, \dots, 2^{- 11 / 4}, 2^{- 12 / 4}) \in ℝ^{13 \times 13} \\ D_{H} = σ_{max} \cdot I \in ℝ^{13 \times 13} . \end{array}

We generate a validation set of 200 samples for the regularization parameter selection and a testing set of 10,000 samples to evaluate the prediction accuracy.

In numerical experiments, we set σ_max = 1, 2, and 4 to illustrate the tuning insensitivity of CMR. The regularization parameter λ of both CMR and OMR is chosen over a grid Λ = {2^40/4 λ₀, 2^39/4 λ₀, ···, 2^−17/4 λ₀, 2^−18/4 λ₀}, where $λ_{0} = \sqrt{log d} + \sqrt{m}$ . The optimal regularization parameter λ̂ is determined by the prediction error as $\hat{λ} = {argmin}_{λ \in Λ} {‖ \tilde{Y} - \tilde{X} {\hat{B}}^{λ} ‖}_{F}^{2}$ , where B̂^λ denotes the obtained estimate using the regularization parameter λ, and X̃ and Ỹ denote the design and response matrices of the validation set.

Since the noise level σ_k’s are different in regression tasks, we adopt the following three criteria to evaluate the empirical performance: $Pre . Err . = \frac{1}{10000} {‖ \bar{Y} - \bar{X} \hat{B} ‖}_{F}, Adj . Pre . Err . = \frac{1}{10000 m} {‖ (\bar{Y} - \bar{X} \hat{B}) D^{- 1} ‖}_{F}^{2}$ , and $Est . Err . = \frac{1}{m} {‖ \hat{B} - B^{0} ‖}_{F}^{2}$ , where X̄ and Ȳ denotes the design and response matrices of the testing set.

All simulations are implemented by MATLAB using a PC with Intel Core i5 3.3GHz CPU and 16GB memory. CMR is solved by the proposed smoothing proximal gradient algorithm, where we set the stopping precision ε = 10⁻⁴, the smoothing parameter μ = 10⁻⁴. OMR is solved by the monotone fast proximal gradient algorithm, where we set the stopping precision ε = 10⁻⁴. We set p = 2, but the extension to arbitrary p > 2 is straightforward.

We first compare the smoothed proximal gradient (SPG) algorithm with the ADMM algorithm (the detailed derivation of ADMM can be found in Appendix A.2). We adopt the backtracking line search to accelerate both algorithms with a shrinkage parameter α = 0.8. We set σ_max = 2 for the adopted multivariate linear models. We conduct 200 simulations. The results are presented in Table 4.1. The SPG and ADMM algorithms attain similar objective values, but SPG is up to 4 times faster than ADMM. Both algorithms also achieve similar estimation errors.

Table 4.1.

Quantitive comparison of the computational performance between SPG and ADMM with the noise matrices generated using D_I. The results are averaged over 200 replicates with standard errors in parentheses. SPG and ADMM attain similar objective values, but SPG is up to about 4 times faster than ADMM.

λ	Algorithm	Timing (second)	Obj. Val.	Num. Ite.	Est. Err.
2λ₀	SPG	2.8789(0.3141)	508.21(3.8498)	493.26(52.268)	0.1213(0.0286)
2λ₀	ADMM	8.4731(0.8387)	508.22(3.7059)	437.7(37.4532)	0.1215(0.0291)

λ₀	SPG	3.2633(0.3200)	370.53(3.6144)	565.80(54.919)	0.0819(0.0205)
λ₀	ADMM	11.976(1.460)	370.53(3.4231)	600.94(74.629)	0.0822(0.0233)

0.5λ₀	SPG	3.7868(0.4551)	297.24(3.6125)	652.53(78.140)	0.1399(0.0284)
0.5λ₀	ADMM	18.360(1.9678)	297.25(3.3863)	1134.0(136.08)	0.1409(0.0317)

Open in a new tab

We then compare the statistical performance between CMR and OMR. Tables 4.2 and 4.3 summarize the results averaged over 200 replicates. In addition, we also present the results of the oracle estimator, which is obtained by solving (2.2), since we know the true values of σ_k’s. Note that the oracle estimator is only for comparison purpose, and it is not a practical estimator. Since CMR calibrates the regularization for each task with respect to σ_k, CMR universally outperforms OMR, and achieves almost the same performance as the oracle estimator when we adopt the scale matrix D_I to generate the random noise. Meanwhile, when we adopt the scale matrix D_H, where all σ_k’s are the same, CMR and OMR achieve similar performance. This further implies that CMR can be a safe replacement of OMR for multivariate regressions.

Table 4.2.

Quantitive comparison of the statistical performance between CMR and OMR with the noise matrices generated using D_I. The results are averaged over 200 simulations with the standard errors in parentheses. CMR universally outperforms OMR, and achieves almost the same performance as the oracle estimator.

σ_max	Method	Pre. Err.	Adj. Pre.Err	Est. Err.
1	Oracle	5.8759(0.0834)	1.0454(0.0149)	0.0245(0.0086)
	CMR	5.8761(0.0673)	1.0459(0.0123)	0.0249(0.0071)
	OMR	5.9012(0.0701)	1.0581(0.0162)	0.0290(0.0091)

2	Oracle	23.464(0.3237)	1.0441(0.0148)	0.0926(0.0342)
	CMR	23.465(0.2598)	1.0446(0.0121)	0.0928(0.0279)
	OMR	23.580(0.2832)	1.0573(0.0170)	0.1115(0.0365)

4	Oracle	93.532(0.8843)	1.0418(0.0962)	0.3342(0.1255)
	CMR	93.542(0.9794)	1.0421(0.0118)	0.3346(0.1063)
	OMR	94.094(1.0978)	1.0550(0.0166)	0.4125(0.1417)

Open in a new tab

Table 4.3.

Quantitive comparison of the statistical performance between CMR and OMR with the noise matrices generated using D_H. The results are averaged over 200 simulations with the standard errors in parentheses. CMR and OMR achieve similar performance.

σ_max	Method	Pre. Err.	Adj. Pre.Err	Est. Err.
1	CMR	13.565(0.1408)	1.0435(0.0108)	0.0599(0.0164)
1	OMR	13.697(0.1554)	1.0486(0.0142)	0.0607(0.0128)

2	CMR	54.171(0.5771)	1.0418(0.0110)	0.2252(0.0649)
2	OMR	54.221(0.6173)	1.0427(0.0118)	0.2359(0.0821)

4	CMR	215.98(2.104)	1.0384(0.0101)	0.80821(0.25078)
4	OMR	216.19(2.391)	1.0394(0.0114)	0.81957(0.31806)

Open in a new tab

In addition, we also examine the optimal regularization parameters for CMR and OMR over all replicates. We visualize the distribution of all 200 selected λ̂’s using the kernel density estimator. In particular, we adopt the Gaussian kernel, and the kernel bandwidth is selected based on the 10-fold cross validation. Figure 4.1 illustrates the estimated density functions. The horizontal axis corresponds to the rescaled regularization parameter as $log (\frac{\hat{λ}}{\sqrt{log d} + \sqrt{m}})$ . We see that the optimal regularization parameters of OMR significantly vary with different σ_max. In contrast, the optimal regularization parameters of CMR are more concentrated. This is inconsistent with our claimed tuning insensitivity.

Figure 4.1 — The distributions of the selected regularization parameters using the kernel density estimator. The numbers in the parentheses are σ_max’s. The optimal regularization parameters of OMR are spreader with different σ_max than those of CMR and the oracle estimator.

5 Real Data Experiment

We apply CMR on a brain activity prediction problem which aims to build a parsimonious model to predict a person’s neural activity when seeing a stimulus word. As is illustrated in Figure 5.1, for a given stimulus word, we first encode it into an intermediate semantic feature vector using some corpus statistics. We then model the brain’s neural activity pattern using CMR. Creating such a predictive model not only enables us to explore new analytical tools for the fMRI data, but also helps us to gain deeper understanding on how human brain represents knowledge [16].

Figure 5.1 — An illustration of the fMRI brain activity prediction problem [16]. (a) To collect the data, a human participant sees a sequence of English words and their images. The corresponding fMRI images are recorded to represent the brain activity patterns; (b) To build a predictive model, each stimulus word is encoded into intermediate semantic features (e.g. the co-occurrence statistics of this stimulus word in a large text corpus). These intermediate features can then be used to predict the brain activity pattern.

Our experiments involves 9 participants, and Table 5.1 summarizes the prediction performance of different methods on these participants. We see that the prediction based on the features selected by CMR significantly outperforms that based on the features selected by OMR, and is as competitive as that based on the handcrafted features selected by human experts. But due to the space limit, we present the details of the real data experiment in the technical report version.

Table 5.1.

Prediction accuracies of different methods (higher is better). CMR outperforms OMR for 8 out of 9 participants, and outperforms the handcrafted basis words for 6 out of 9 participants

Method	P.1	P.2	P.3	P.4	P.5	P.6	P.7	P.8	P.9
CMR	0.840	0.794	0.861	0.651	0.823	0.722	0.738	0.720	0.780
OMR	0.803	0.789	0.801	0.602	0.766	0.623	0.726	0.749	0.765
Handcraft	0.822	0.776	0.773	0.727	0.782	0.865	0.734	0.685	0.819

Open in a new tab

6 Discussions

A related method is the square-root sparse multivariate regression [8]. They solve the convex program with the Frobenius loss function and L_1,_p regularization function

\hat{B} = \underset{B}{argmin} {‖ Y - XB ‖}_{F} + λ {‖ B ‖}_{1, p} .

(6.1)

The Frobenius loss function in (6.1) makes the regularization parameter selection independent of σ_max, but it does not calibrate different regression tasks. Note that we can rewrite (6.1) as

(\hat{B}, \hat{σ}) = \underset{B, σ}{argmin} \frac{1}{\sqrt{n m} σ} {‖ Y - XB ‖}_{F}^{2} + λ {‖ B ‖}_{1, p} s . t . σ = \frac{1}{\sqrt{n m}} {‖ Y - XB ‖}_{F} .

(6.2)

Since σ in (6.2) is not specific to any individual task, it cannot calibrate the regularization. Thus it is fundamentally different from CMR.

Footnotes

The authors are listed in alphabetical order. This work is partially supported by the grants NSF IIS1408910, NSF IIS1332109, NSF Grant DMS-1005539, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.

The rate of convergence is optimal when p = 2, i.e., the regularization function is ||B||₁_,p

Contributor Information

Han Liu, Department of Operations Research and Financial Engineering, Princeton University.

Lie Wang, Department of Mathematics, Massachusetts Institute of Technology.

Tuo Zhao, Department of Computer Science, Johns Hopkins University.

References

1.Anderson TW. An introduction to multivariate statistical analysis. Wiley; New York: 1958. [Google Scholar]
2.Ando Rie Kubota, Zhang Tong. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research. 2005;6(11):1817–1853. [Google Scholar]
3.Baxter J. A model of inductive bias learning. Journal of Artificial Intelligence Research. 2000;12:149–198. [Google Scholar]
4.Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing. 2009;18(11):2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
5.Belloni A, Chernozhukov V, Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika. 2011;98(4):791–806. [Google Scholar]
6.Bickel Peter J, Ritov Yaacov, Tsybakov Alexandre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]
7.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B. 2002;59(1):3–54. [Google Scholar]
8.Bunea Florentina, Lederer Johannes, She Yiyuan. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory. 2013;60:1313–1325. [Google Scholar]
9.Johnstone Iain M. Chi-square oracle inequalities. Lecture Notes-Monograph Series. 2001:399–418. [Google Scholar]
10.Ledoux Michel, Talagrand Michel. Probability in Banach Spaces: isoperimetry and processes. Springer; 2011. [Google Scholar]
11.Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. Proceedings of the 26th Annual International Conference on Machine Learning; ACM; 2009. pp. 649–656. [Google Scholar]
12.Liu J, Ye J. Technical report. Arizona State University; 2010. Efficient ℓ1/ℓq norm regularization. [Google Scholar]
13.Lounici K, Pontil M, Van De Geer S, Tsybakov AB. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39(4):2164–2204. [Google Scholar]
14.Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72(4):417–473. [Google Scholar]
15.Meinshausen Nicolai, Yu Bin. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37(1):246–270. [Google Scholar]
16.Mitchell TM, Shinkareva SV, Carlson A, Chang KM, Malave VL, Mason RA, Just MA. Predicting human brain activity associated with the meanings of nouns. Science. 2008;320(5880):1191–1195. doi: 10.1126/science.1152876. [DOI] [PubMed] [Google Scholar]
17.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27(4):538–557. [Google Scholar]
18.Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103(1):127–152. [Google Scholar]
19.Obozinski G, Wainwright MJ, Jordan MI. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics. 2011;39(1):1–47. [Google Scholar]
20.Ouyang Hua, He Niao, Tran Long, Gray Alexander. Stochastic alternating direction method of multipliers. Proceedings of the 30th International Conference on Machine Learning; 2013. pp. 80–88. [Google Scholar]
21.Raskutti Garvesh, Wainwright Martin J, Yu Bin. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research. 2010;11(8):2241–2259. [Google Scholar]
22.Rothman AJ, Levina E, Zhu J. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19(4):947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58(1):267–288. [Google Scholar]
24.Turlach BA, Venables WN, Wright SJ. Simultaneous variable selection. Technometrics. 2005;47(3):349–363. [Google Scholar]
25.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68(1):49–67. [Google Scholar]
26.Zhang Jian. PhD thesis. Carnegie Mellon University, Language Technologies Institute, School of Computer Science; 2006. A probabilistic framework for multi-task learning. [Google Scholar]

[R1] 1.Anderson TW. An introduction to multivariate statistical analysis. Wiley; New York: 1958. [Google Scholar]

[R2] 2.Ando Rie Kubota, Zhang Tong. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research. 2005;6(11):1817–1853. [Google Scholar]

[R3] 3.Baxter J. A model of inductive bias learning. Journal of Artificial Intelligence Research. 2000;12:149–198. [Google Scholar]

[R4] 4.Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Transactions on Image Processing. 2009;18(11):2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]

[R5] 5.Belloni A, Chernozhukov V, Wang L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika. 2011;98(4):791–806. [Google Scholar]

[R6] 6.Bickel Peter J, Ritov Yaacov, Tsybakov Alexandre B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics. 2009;37(4):1705–1732. [Google Scholar]

[R7] 7.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B. 2002;59(1):3–54. [Google Scholar]

[R8] 8.Bunea Florentina, Lederer Johannes, She Yiyuan. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory. 2013;60:1313–1325. [Google Scholar]

[R9] 9.Johnstone Iain M. Chi-square oracle inequalities. Lecture Notes-Monograph Series. 2001:399–418. [Google Scholar]

[R10] 10.Ledoux Michel, Talagrand Michel. Probability in Banach Spaces: isoperimetry and processes. Springer; 2011. [Google Scholar]

[R11] 11.Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. Proceedings of the 26th Annual International Conference on Machine Learning; ACM; 2009. pp. 649–656. [Google Scholar]

[R12] 12.Liu J, Ye J. Technical report. Arizona State University; 2010. Efficient ℓ1/ℓq norm regularization. [Google Scholar]

[R13] 13.Lounici K, Pontil M, Van De Geer S, Tsybakov AB. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics. 2011;39(4):2164–2204. [Google Scholar]

[R14] 14.Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72(4):417–473. [Google Scholar]

[R15] 15.Meinshausen Nicolai, Yu Bin. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37(1):246–270. [Google Scholar]

[R16] 16.Mitchell TM, Shinkareva SV, Carlson A, Chang KM, Malave VL, Mason RA, Just MA. Predicting human brain activity associated with the meanings of nouns. Science. 2008;320(5880):1191–1195. doi: 10.1126/science.1152876. [DOI] [PubMed] [Google Scholar]

[R17] 17.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27(4):538–557. [Google Scholar]

[R18] 18.Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103(1):127–152. [Google Scholar]

[R19] 19.Obozinski G, Wainwright MJ, Jordan MI. Support union recovery in high-dimensional multivariate regression. The Annals of Statistics. 2011;39(1):1–47. [Google Scholar]

[R20] 20.Ouyang Hua, He Niao, Tran Long, Gray Alexander. Stochastic alternating direction method of multipliers. Proceedings of the 30th International Conference on Machine Learning; 2013. pp. 80–88. [Google Scholar]

[R21] 21.Raskutti Garvesh, Wainwright Martin J, Yu Bin. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research. 2010;11(8):2241–2259. [Google Scholar]

[R22] 22.Rothman AJ, Levina E, Zhu J. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19(4):947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58(1):267–288. [Google Scholar]

[R24] 24.Turlach BA, Venables WN, Wright SJ. Simultaneous variable selection. Technometrics. 2005;47(3):349–363. [Google Scholar]

[R25] 25.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68(1):49–67. [Google Scholar]

[R26] 26.Zhang Jian. PhD thesis. Carnegie Mellon University, Language Technologies Institute, School of Computer Science; 2006. A probabilistic framework for multi-task learning. [Google Scholar]

PERMALINK

Multivariate Regression with Calibration^*

Han Liu

Lie Wang

Tuo Zhao

Abstract