Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality

Sai Li; T Tony Cai; Hongzhe Li

doi:10.1111/rssb.12479

. Author manuscript; available in PMC: 2023 Feb 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2021 Nov 16;84(1):149–173. doi: 10.1111/rssb.12479

Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality

Sai Li ¹, T Tony Cai ², Hongzhe Li ³

PMCID: PMC8863181 NIHMSID: NIHMS1755759 PMID: 35210933

Abstract

This paper considers estimation and prediction of a high-dimensional linear regression in the setting of transfer learning where, in addition to observations from the target model, auxiliary samples from different but possibly related regression models are available. When the set of informative auxiliary studies is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. When the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and show its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating data from multiple different tissues as auxiliary samples.

1. Introduction

Modern scientific research is characterized by massive and diverse data sets. It is of significant interest to integrate different data sets to make a more accurate prediction and statistical inference. Given a target problem to solve, transfer learning (Torrey and Shavlik, 2010) aims at transferring the knowledge from different but related samples to improve the learning performance of the target problem. A typical example of transfer learning is that one can improve the accuracy of recognizing cars by using not only the labeled data for cars but some labeled data for trucks (Weiss et al., 2016). Besides classification, another important transfer learning problem is linear regressions with auxiliary samples. In biomedical studies, some clinical or biological outcomes are hard to obtain due to ethical or cost issues, in which case transfer learning can be leveraged to boost the prediction and estimation performance by effectively utilizing information from related studies.

Transfer learning has been applied to problems in medical and biological studies, including predictions of protein localization (Mei et al., 2011), biological imaging diagnosis (Shin et al., 2016), drug sensitivity prediction (Turki et al., 2017), and integrative analysis of “multi-omics” data, see, for instance, Sun and Hu (2016), Hu et al. (2019), and Wang et al. (2019). It has also been applied to natural language processing (Daumé III, 2007) and recommendation systems (Pan and Yang, 2013) in machine learning. The application that motivated the present paper is the integration of the gene expression measurements in different issues for understanding the gene regulations using the Genotype-Tissue Expression (GTEx) data (https://gtexportal.org/). These datasets are always high-dimensional with relatively small sample sizes. When studying the gene regulation relationships of a specific tissue or cell-type, it is possible to incorporate information from other tissues to enhance the learning accuracy. This motivates us to consider transfer learning in high-dimensional linear regression.

1.1. Transfer Learning in High-dimensional Linear Regression

Regression analysis is one of the most widely used statistical methods to understand the association of an outcome with a set of covariates. In many modern applications, the dimension of the covariates is usually very high as compared to the sample size. Typical examples include genome-wide association and gene expression studies. In this paper, we consider transfer learning in high-dimensional linear models. Formally, the target model can be written as

y_{i}^{(0)} = {(x_{i}^{(0)})}^{⊤} β + ϵ_{i}^{(0)}, i = 1, \dots, n_{0},

(1)

where $({(x_{i}^{(0)})}^{⊤}, y_{i}^{(0)})$ , i = 1, . . . , n₀, are independent samples, $β \in ℝ^{p}$ is the coefficient vector of interest, and $ϵ_{i}^{(0)}$ , i = 1, . . . , n₀ are independently distributed random noises with $E [ϵ_{i}^{(0)} ∣ x_{i}^{(0)}] = 0$ . In the high-dimensional regime, where p can be larger and much larger than n₀, β is often assumed to be sparse such that the number of nonzero elements of β, denoted by s, is much smaller than p.

In the context of transfer learning, we observe additional samples from K auxiliary studies, That is, we observe $({(x_{i}^{(k)})}^{⊤}, y_{i}^{(k)})$ generated from the auxiliary model

y_{i}^{(k)} = {(x_{i}^{(k)})}^{⊤} w^{(k)} + ϵ_{i}^{(k)}, i = 1, \dots, n_{k}, k = 1, \dots, K,

(2)

where $w^{(k)} \in ℝ^{p}$ is the regression vector for the k-th study, and $ϵ_{i}^{(k)}$ is the random noise such that $E [ϵ_{i}^{(k)} ∣ x_{i}^{(k)}] = 0$ . The regression coefficients w^(k) are unknown and different from our target β in general. The number of auxiliary studies, K, is allowed to grow but practically K may not be too large. We will study the estimation and prediction of target model (1) utilizing the primary data $({(x_{i}^{(0)})}^{⊤}, y_{i}^{(0)})$ , i = 1, . . . , n₀, as well as the data from K auxiliary studies $({(x_{i}^{(k)})}^{⊤}, y_{i}^{(k)})$ , i = 1, . . . , n_k, k = 1, . . . ,K.

If an auxiliary model is “similar” to the target model, we say that this auxiliary sample/study is informative. In this work, we characterize the informative level of the k-th auxiliary study using the sparsity of the difference between w^(k) and β. Let δ^(k) = β − w^(k) denote the contrast between w^(k) and β. The set of informative auxiliary samples are those whose contrasts are sufficiently sparse:

A_{q} = {1 \leq k \leq K : {‖ δ^{(k)} ‖}_{q} \leq h},

(3)

for some q ∈ [0, 1]. The set $A_{q}$ contains the auxiliary studies whose contrast vectors have ℓ_q-sparsity at most h and is called the informative set. It will be seen later that as long as h is relatively small compared to the sparsity of β, the studies in $A_{q}$ can be useful in improving the prediction and estimation of β. In the case of q = 0, the set $A_{q}$ corresponds to the auxiliary samples whose contrast vectors have at most h nonzero elements. We also consider approximate sparsity constraints (q ∈ (0, 1]), which allows all of the coefficients to be nonzero but their magnitude decays at a relatively rapid rate. For any q ∈ [0, 1], smaller h implies that the auxiliary samples in $A_{q}$ are more informative; larger cardinality of $A_{q} (| A_{q} |)$ implies that a larger number of informative auxiliary samples. Therefore, smaller h and larger $| A_{q} |$ should be favorable. We allow $A_{q}$ to be empty in which case none of the auxiliary samples is informative. For the auxiliary samples outside of $A_{q}$ , we do not assume sparse δ^(k) and hence w^(k) can be very different from β for $k \notin A_{q}$ .

In polygenic risk score (PRS) prediction and gene-expression partial-correlation analysis, this similarity characterization of two different high dimensional regression models is motivated by commonly adopted assumptions. In PRS prediction, for example, high-dimensional sparse regression models are commonly assumed (Mak et al., 2017). In addition, it has been observed that many complex traits have a shared genetic etiology, including various autoimmune diseases (Li et al., 2015; Zhernakova et al., 2009) and psychiatric disorders (Lee et al., 2013; Cross-Disorder Group of the Psychiatric Genomics Consortium, 2019). The similarity characterization we proposed captures the sparse nature of genome-wide association data and shared genetic etiology of multiple genetically related traits. In the gene expression data analysis, one is interested in understanding how a set of genes regulate another gene based on data measured in different tissues. Such an analysis provides useful insights into gene regulatory networks, which are often sparse. In addition, many tissues have shared regulatory relationships among the genes (Pierson et al., 2015; Fagny et al., 2017). In such applications, we also expect sparse and similar regression coefficients for the models assumed for different tissues.

There is a paucity of methods and fundamental theoretical results for high-dimensional linear regression in the transfer learning setting. In the case where the set of informative auxiliary samples $A_{q}$ is known, there is a lack of rate optimal estimation and prediction methods. A closely related topic is multi-task learning (Ando and Zhang, 2005; Lounici et al., 2009; Agarwal et al., 2012), where the goal is to estimate multiple models simultaneously. The multi-task learning considered in Lounici et al. (2009) estimates multiple high-dimensional sparse linear models under the assumption that the supports of all the regression coefficients are the same. In multi-task learning, different regularization formats have been considered to model the similarity among different studies (Chen et al., 2010; Danaher et al., 2014; Dondelinger et al., 2020).

The goal of transfer learning is however different, as one is only interested in estimating the target model and this remains to be a largely unsolved problem. Cai and Wei (2021) studied the minimax and adaptive methods for nonparametric classification in the transfer learning setting under the assumption that all the auxiliary samples are similar to the target distribution (Cai and Wei, 2021, Definition 5). In a more challenging setting where the set $A_{q}$ is unknown as is typical in real applications, it is unclear how to avoid the effects of adversarial auxiliary samples. Bastani (2020) studied estimation and prediction in high-dimensional linear models with one informative auxiliary study and q = 1, where the sample size of the auxiliary study is larger than the number of covariates. The current work considers more general scenarios under weaker assumptions. Specifically, the sample size of auxiliary samples can be smaller than the number of covariates and some auxiliary studies can be non-informative, which is more practical in applications. Additional challenges include the heterogeneity among the design matrices, which does not arise in the conventional high-dimensional regression problems and hence requires novel proposals.

The problem we study here is certainly related to the high-dimensional prediction and estimation in the conventional settings where only samples from the target model are available. Several penalized or constrained minimization methods have been proposed for prediction and estimation for high-dimensional linear regression; see, for example, Tibshirani (1996); Fan and Li (2001); Zou (2006); Candes and Tao (2007); Zhang (2010). The minimax optimal rates for estimation and prediction are studied in Raskutti et al. (2011) and Verzelen (2012).

1.2. Our Contributions

In the setting where the informative set $A_{q}$ is known, we propose a transfer learning algorithm, called Oracle Trans-Lasso, for estimation of the target regression vector and prediction and prove its minimax optimality under mild conditions. The results demonstrate a faster rate of convergence when $A_{q}$ is non-empty and h is sufficiently smaller than s, in which case the knowledge from the informative auxiliary samples can be optimally transferred to substantially improve estimation and prediction of the regression problem under the target model.

In the more challenging setting where $A_{q}$ is unknown a priori, we introduce a data-driven algorithm, called Trans-Lasso, to adapt to the unknown $A_{q}$ . The adaption is achieved by aggregating a number of candidate estimators. The desirable properties of the aggregation methods guarantee that the Trans-Lasso does not perform much worse than the best one among the candidate estimators. We construct the candidate estimators and demonstrate the robustness and the efficiency of Trans-Lasso under mild conditions. In terms of robustness, the Trans-Lasso is guaranteed to be not much worse than the Lasso estimator using only the primary samples no matter how adversarial the auxiliary samples are. In terms of efficiency, the knowledge from a subset of the informative auxiliary samples can be transferred to the target problem under proper conditions. Furthermore, If the contrast vectors in the informative samples are sufficiently sparse, the Trans-Lasso estimator performs as if the informative set $A_{q}$ is known.

When the distributions of the design matrices are distinct in different samples, the effect of heterogeneous designs in transfer learning is studied. The performance of the proposed algorithm is investigated theoretically and numerically in various settings.

1.3. Organization and Notation

The rest of this paper is organized as follows. Section 2 focuses on the setting where the informative set $A_{q}$ is known and with the sparsity in (3) measured in ℓ₁-norm. A transfer learning algorithm is proposed for estimation and prediction of the target parameter and its minimax optimality is established. In Section 3, we study the estimation and prediction of the target model when $A_{q}$ is unknown for q = 1. In Section 4, we justify the theoretical performance of our proposals under heterogeneous designs. In Section 5, the numerical performance of the proposed methods is studied in various settings. In Section 6, the proposed algorithms are applied to the GTEx data to investigate the association of one gene with other genes in a target tissue by leveraging data measured on other related tissues or cell types. The proofs and results for ℓ_q-sparse contrasts with q ∈ [0, 1) are provided in the supplementary materials (Li et al., 2020).

We finish this section with notation. Let $X^{(0)} \in ℝ^{n_{0} \times p}$ and $y^{(0)} \in ℝ^{n_{0}}$ denote the design matrix and the response vector for the primary data, respectively. Let $X^{(k)} \in ℝ^{n_{k} \times p}$ and $y^{(k)} \in ℝ^{n_{k}}$ denote the design matrix and the response vector for the k-th auxiliary data, respectively. For a class of matrices $R_{l} \in ℝ^{n_{l} \times p_{0}}$ , $l \in L$ , we use ${R_{l}}_{l \in L}$ to denote R_l, $l \in L$ . Let $n_{A_{q}} = \sum_{k \in A_{q}} n_{k}$ . For a generic semi-positive definite matrix $Σ \in ℝ^{m \times m}$ , let Λ_max(Σ) and Λ_min(Σ) denote the largest and smallest eigenvalues of Σ, respectively. Let Tr(Σ) denote the trace of Σ. Let e_j be a vector such that its j-th element is 1 and all other elements are zero. Let a∨b denote max{a, b} and a∧b denote min{a, b}. We use c, c₀, c₁, . . . to denote generic constants which can be different in different statements. Let a_n = O(b_n) and a_n ≲. b_n denote |a_n/b_n| ≤ c for some constant c when n is large enough. Let a_n ≍ b_n denote |a_n/b_n| → c for some constant c as n → ∞. Let a_n = O_P (b_n) and $a_{n} ≲_{ℙ} b_{n}$ denote $ℙ (| a_{n} / b_{n} | \leq c) \to 1$ for some constant c < ∞. Let a_n = o_P (b_n) denote $ℙ (| a_{n} / b_{n} | > c) \to 0$ for any constant c > 0.

2. Estimation with Known Informative Auxiliary Samples

We consider in this section transfer learning for high-dimensional linear regression when the informative set $A_{q}$ is known. The focus is on the ℓ₁-sparse characterization of the contrast vectors. The notation $A_{1}$ will be abbreviated as $A$ in the sequel without special emphasis. Section C in the supplementary materials generalizes the sparse contrasts from ℓ₁-constraint to ℓ_q-constraint for q ∈ [0, 1) and presents a rate-optimal estimator in this setting.

2.1. Oracle Trans-Lasso Algorithm

We propose a transfer learning algorithm, called Oracle Trans-Lasso, for estimation and prediction when $A$ is known. As an overview, we first compute an initial estimator using all the informative auxiliary samples. However, its probabilistic limit is biased from β as w^(k) ≠ β in general. We then correct its bias using the primary data in the second step. Algorithm 1 formally presents our proposed Oracle Trans-Lasso algorithm.

Algorithm 1:

Oracle Trans-Lasso algorithm

Input : Primary data (X⁽⁰⁾, y⁽⁰⁾) and informative auxiliary samples

{X^{(k)}, y^{(k)}}_{k \in A}

Output:

\hat{β}

Step 1. Compute

{\hat{w}}^{A} = \underset{w \in ℝ^{p}}{\arg \min} {\frac{1}{2 n_{A}} \sum_{k \in A} {‖ y^{(k)} - X^{(k)} w ‖}_{2}^{2} + λ_{w} ‖ w ‖_{1}}

(4)

for

λ_{w} = c_{1} \sqrt{\log p / n_{A}}

with some constant c₁.

Step 2. Let

\hat{β} = {\hat{w}}^{A} + {\hat{δ}}^{A},

(5)

where

{\hat{δ}}^{A} = \underset{δ \in ℝ^{p}}{\arg \min} {\frac{1}{2 n_{0}} {‖ y^{(0)} - X^{(0)} ({\hat{w}}^{A} + δ) ‖}_{2}^{2} + λ_{δ} ‖ δ ‖_{1}}

(6)

for

λ_{δ} = c_{2} \sqrt{\log p / n_{0}}

with some constant c₂.

Open in a new tab

In Step 1, ${\hat{w}}^{A}$ is realized based on the Lasso (Tibshirani, 1996) using all the informative auxiliary samples. Its probabilistic limit is $w^{A}$ , which can be defined via the following moment condition

E [\sum_{k \in A} {(X^{(k)})}^{⊤} (y^{(k)} - X^{(k)} w^{A})] = 0.

Denoting $E [x_{i}^{(k)} {(x_{i}^{(k)})}^{⊤}] = Σ^{(k)}$ , $w^{A}$ has the following explicit form:

w^{A} = β + δ^{A}

(7)

for $δ^{A} = \sum_{k \in A} α_{k} δ^{(k)}$ and $α_{k} = n_{k} / n_{A}$ given that Σ^(k) = Σ⁽⁰⁾ for all $k \in A$ . That is, the probabilistic limit of ${\hat{w}}^{A}$ , $w^{A}$ , has bias $δ^{A}$ , which is a weighted average of δ^(k). Step 1 is related to the approach for high-dimensional misspecified models (Bühlmann and van de Geer, 2015) and moment estimators. The estimator ${\hat{w}}^{A}$ converges relatively fast as the sample size used in Step 1 is relatively large. Step 2 corrects the bias, $δ^{A}$ , using the primary samples. In fact, $δ^{A}$ is a sparse high-dimensional vector whose ℓ₁-norm is no larger than h. Hence, the error of step 2 is under control for a relatively small h. The choice of the tuning parameters λ_w and λ_δ will be further specified in Theorem 1.

We compare the proposed Oracle Trans-Lasso method to the multi-task regression methods, say Section 3.4.3 of Agarwal et al. (2012) and Danaher et al. (2014). The Oracle Trans-Lasso does not penalize the differences among the regression coefficients in the auxiliary studies. This is again because the focus of transfer learning is only the target study. Theoretically, extra penalization terms and the joint analysis of multiple estimators may not help improve the estimation accuracy of the parameter of interest.

2.2. Theoretical Properties of Oracle Trans-Lasso

Formally, the parameter space we consider can be written as

Θ_{q} (s, h) = {B = (β, δ^{(1)}, \dots, δ^{(K)}) : ‖ β ‖_{0} \leq s, \max_{k \in A_{q}} {‖ δ^{(k)} ‖}_{q} \leq h}

(8)

for $A_{q} \subseteq {1, \dots, K}$ and q ∈ [0, 1]. We study the rate of convergence for the Oracle Trans-Lasso algorithm under the following two conditions.

Condition 1.

For each $k \in A \cup {0}$ , each row of X^(k) is i.i.d. Gaussian distributed with mean zero and covariance matrix Σ. The smallest and largest eigenvalues of Σ are bounded away from zero and infinity, respectively.

Condition 2.

For each $k \in A \cup {0}$ , $E [{(y_{i}^{(k)})}^{2}]$ is finite and the random noises $ϵ_{i}^{(k)}$ are i.i.d. sub-Gaussian with mean zero and variance $σ_{k}^{2}$ . For some constant C₀, it holds that $\max_{k \in A \cup {0}} E [\exp {t ϵ_{i}^{(k)}}] \leq \exp {t^{2} C_{0}}$ for all $t \in ℝ$ .

Condition 1 assumes Gaussian designs, which provides convenience for bounding the restricted eigenvalues of sample covariance matrices. Moreover, the designs are identically distributed for $k \in A \cup {0}$ . This assumption simplifies some technical conditions and will be relaxed in Section 4. We mention that the conditions on the eigenvalues of Σ can be replaced with some eigenvalue conditions restricted to a convex cone. Condition 2 assumes sub-Gaussian random noises for primary and informative auxiliary samples and the second moment of the response vector is finite. Conditions 1 and 2 make no assumptions on the non-informative auxiliary samples as they are not used in the Oracle Trans-Lasso algorithm. In the next theorem, we prove the convergence rate of the Oracle Trans-Lasso. Let $η_{h} = h \sqrt{\log p / n_{0}} \land h^{2}$ .

Theorem 1 (Convergence Rate of Oracle Trans-Lasso).

Assume that Condition 1 and Condition 2 hold true. Suppose that $A$ is known with $h ≲ s \sqrt{\log p / n_{0}}$ and $n_{0} ≲ n_{A}$ . We take $λ_{w} = \max_{k \in A} c_{1} \sqrt{E [{(y_{i}^{(k)})}^{2}] \log p / n_{A}}$ and $λ_{δ} = c_{2} \sqrt{\log p / n_{0}}$ for some sufficiently large constants c₁ and c₂. If $s \log p / n_{A} + h {(\log p / n_{0})}^{1 / 2} = o (1)$ , then there exists some positive constant c₁ such that

\inf_{B \in Θ_{1} (s, h)} ℙ (\frac{1}{n_{0}} {‖ X^{(0)} (\hat{β} - β) ‖}_{2}^{2} \lor ‖ \hat{β} - β ‖_{2}^{2} ≲ \frac{s \log p}{n_{A} + n_{0}} + \frac{s \log p}{n_{0}} \land η_{h}) \geq 1 - \exp (- c_{1} \log p) .

(9)

where B = {β,w⁽¹⁾, . . . ,w^(k)} denotes all the unknown parameters. Theorem 1 provides the convergence rate of $\hat{β}$ for any true parameters in Θ₁(s, h) when an informative set $A$ is known. We illustrate Theorem 1 by contrasting to the estimation results of the Lasso. First, the results of Theorem 1 hold under a weaker condition on s, i.e., $s \log p = O (n_{A})$ when $n_{A} ≳ n_{0}$ , while s log p = o(n₀) is always assumed in the single-task regression. Hence, the Oracle Trans-Lasso can deal with more challenging scenarios with less sparse target parameter. Second, the right-hand side of (9) is sharper than the convergence rate of Lasso, s log p/n₀, if $h ≪ s \sqrt{\log p / n_{0}}$ and $n_{A} ≫ n_{0}$ . That is, if the informative auxiliary samples have contrast vectors sufficiently sparser than β and the total sample size is significantly larger than the primary sample size, then the knowledge from the auxiliary samples can significantly improve the learning performance of the target model. The condition for improvement, $h ≪ s \sqrt{\log p / n_{0}}$ , allows a wide range of h. For example, the typical regime for single-task regression is s log p/n₀ = O(1) and it implies that $s \sqrt{\log p / n_{0}}$ can be as large as $\sqrt{n_{0} / \log p}$ . Hence, the condition for improvement of Theorem 1 allows h to be as large as $\sqrt{n_{0} / \log p}$ . Larger the s, weaker the condition for improvement.

The sample size requirement in Theorem 1 guarantees the lower restricted eigenvalues of the sample covariance matrices in use are bounded away from zero with high probability. The proof of Theorem 1 involves an error analysis of ${\hat{w}}^{A}$ and that of ${\hat{δ}}^{A}$ . While $w^{A}$ may be neither ℓ₀-nor ℓ₁-sparse, it can be decomposed into an ℓ₀-sparse component plus an ℓ₁-sparse component as illustrated in (7). Exploiting this sparse structure is a key step in proving Theorem 1. Regarding the choice of tuning parameters, λ_w depends on the second moment of $y_{i}^{(k)}$ , which can be consistently estimated by ${‖ y^{(k)} ‖}_{2}^{2} / n_{k}$ . The other tuning parameter λ_δ depends on the noise levels, which can be estimated by the scaled Lasso (Sun and Zhang, 2012). In practice, cross validation can be performed for selecting tuning parameters.

We now establish the minimax lower bound for estimating β in the transfer learning setup, which shows the minimax optimality of the Oracle Trans-Lasso algorithm in Θ₁(s, h).

Theorem 2 (Minimax lower bound for q = 1).

Assume Condition 1 and Condition 2. If $\max {s \log p / (n_{A} + n_{0}), h {(\log p / n_{0})}^{1 / 2}} = o (1)$ , then

\inf_{\hat{β}} \sup_{B \in Θ_{1} (s, h)} ℙ (‖ \hat{β} - β ‖_{2}^{2} \geq c_{1} \frac{s \log p}{n_{A} + n_{0}} + c_{2} \frac{s \log p}{n_{0}} \land η_{h}) \geq \frac{1}{2}

for some positive constants c₁ and c₂.

Theorem 2 implies that $\hat{β}$ obtained by the Oracle Trans-Lasso algorithm is minimax rate optimal in Θ₁(s, h) under the conditions of Theorem 1. To understand the lower bound, the term $s \log p / (n_{A} + n_{0})$ is the optimal convergence rate when w^(k) = β for all $k \in A$ . This is an extremely ideal case where we have $n_{A} + n_{0}$ i.i.d. samples from the target model. The second term in the lower bound is the optimal convergence rate when w^(k) = 0 for all $k \in A$ , i.e., the auxiliary samples are not helpful at all. Let $B_{q} (r) = {u \in ℝ^{p} : ‖ u ‖_{q} \leq r}$ denote the ℓ_q-ball with radius r centered at zero. In this case, the definition of Θ₁(s, h) implies that $β \in B_{0} (s) \cap B_{1} (h)$ and the second term in the lower bound is indeed the minimax optimal rate for estimation when $β \in B_{0} (s) \cap B_{1} (h)$ with n₀ i.i.d. samples (Tsybakov, 2014).

3. Unknown Set of Informative Auxiliary Samples

The Oracle Trans-Lasso algorithm is based on the knowledge of the informative set $A$ . In some applications, the informative set $A$ is not given, which makes the transfer learning problem more challenging. In this section, we propose a data-driven method for estimation and prediction when $A$ is unknown. The proposed algorithm is described in detail in Sections 3.1 and 3.2. Its theoretical properties are studied in Section 3.3.

3.1. The Trans-Lasso Algorithm

Our proposed algorithm, called Trans-Lasso, consists of two main steps. First, we construct a collection of candidate estimators, each of which is based on an estimate of $A$ . Second, we perform an aggregation step (Rigollet and Tsybakov, 2011; Dai et al., 2012, 2018) on these candidate estimators. Under proper conditions, the aggregated estimator is guaranteed to be not much worse than the best candidate estimator under consideration in terms of prediction. For technical reasons, we need the candidate estimators and the samples for aggregation to be independent. Hence, we start with sample splitting. We need some more notation. For a generic estimate of β, b, denote its sum of squared prediction error as

\hat{Q} (I, b) = \sum_{i \in I} ‖ y_{i}^{(0)} - {(x_{i 2}^{(0)})}^{⊤} b ‖_{2}^{2},

where $I$ is a subset of {1, . . . , n₀}. Let $Λ^{L + 1} = {ν \in ℝ^{L + 1} : ν_{l} \geq 0, \sum_{l = 0}^{L} ν_{l} = 1}$ denote an L-dimensional simplex. The Trans-Lasso algorithm is presented in Algorithm 2.

As an illustration, steps 2 and 3 of the Trans-Lasso algorithm construct some initial estimates of β, $\hat{β} ({\hat{G}}_{l})$ . They are computed using the Oracle Trans-Lasso algorithm by treating each ${\hat{G}}_{l}$ as the set of informative auxiliary samples. We construct ${\hat{G}}_{l}$ to be some estimates of $A$ using the procedure provided in Section 3.2. Step 4 is based on the Q-aggregation proposed in Dai et al. (2012) with a uniform prior, a Kullback–Leibler penalty, and a simplified tuning parameter. The Q-aggregation can be viewed as a weighted version of least square aggregation and exponential aggregation (Rigollet and Tsybakov, 2011) and it has been shown to be rate optimal both in expectation and with high probability for model selection aggregation problems.

Algorithm 2:

Trans-Lasso Algorithm

Input : Primary data (X⁽⁰⁾, y⁽⁰⁾) and samples from K auxiliary studies

{X^{(k)}, y^{(k)}}_{k = 1}^{K}

Output:

{\hat{β}}^{\hat{θ}}

Step 1. Let

I

be a random subset of {1, . . ., n₀} such that

| I | \approx c_{0} n_{0}

with some constant 0 < c₀ < 1. Let

I^{c} = {1, \dots, n_{0}} \ I

Step 2. Construct L + 1 candidate sets of f

A

{{\hat{G}}_{0}, {\hat{G}}_{1}, \dots, {\hat{G}}_{L}}

such that

{\hat{G}}_{0} = \emptyset

and

{\hat{G}}_{1}, \dots, {\hat{G}}_{L}

are based on (14) using

(X_{I, .}^{(0)}, y_{I}^{(0)})

and

{X^{(k)}, y^{(k)}}_{k = 1}^{K}

Step 3. For each 0 ≤ l ≤ L, run the Oracle Trans-Lasso algorithm with primary sample

(X_{I, .}^{(0)}, y_{I}^{(0)})

and auxiliary samples

(X_{I, .}^{(0)}, y_{I}^{(0)})

and auxiliary samples

{X^{(k)}, y^{(k)}}_{k \in {\hat{G}}_{l}}

. Denote the output as

\hat{β} ({\hat{G}}_{l})

for 0 ≤ l ≤ L.

Step 4. Compute

\hat{θ} = \underset{θ \in Λ^{L + 1}}{\arg \min} {\hat{Q} (I^{c}, \sum_{l = 0}^{L} \hat{β} ({\hat{G}}_{l}) θ_{l}) + \sum_{l = 0}^{L} θ_{l} \hat{Q} (I^{c}, \hat{β} ({\hat{G}}_{l})) + \frac{2 λ_{θ}}{n_{0}} \sum_{l = 0}^{L} θ_{l} \log (θ_{l})}

(10)

for some λ_θ > 0. Output

{\hat{β}}^{\hat{θ}} = \sum_{l = 0}^{L} {\hat{θ}}_{l} \hat{β} ({\hat{G}}_{l}) .

(11)

Open in a new tab

Model selection aggregation is an effective method for the transfer learning task under consideration. On one hand, it guarantees the robustness of Trans-Lasso in the following sense. Notice that $\hat{β} ({\hat{G}}_{0})$ corresponds to the single-task Lasso estimator and it is always included in our dictionary. The purpose is that, invoking the property of model selection aggregation, the performance of ${\hat{β}}^{\hat{θ}}$ is guaranteed to be not much worse than the performance of the original Lasso estimator under mild conditions. This shows that the performance of Trans-Lasso will not be ruined by adversarial auxiliary samples. Formal statements are provided in Section 3.3. On the other hand, the gain of Trans-Lasso relates to the qualities of ${\hat{G}}_{1}, \dots, {\hat{G}}_{L}$ . If

ℙ ({\hat{G}}_{l} \subseteq A, for some 1 \leq l \leq L) \to 1,

(12)

i.e., ${\hat{G}}_{l}$ is a non-empty subset of the informative set $A$ , then the model selection aggregation property implies that the performance of ${\hat{β}}^{\hat{θ}}$ is not much worse than the performance of the Oracle Trans-Lasso with $\sum_{k \in {\hat{G}}_{l}} n_{k}$ informative auxiliary samples. Ideally, one would like to achieve ${\hat{G}}_{l} = A$ for some 1 ≤ l ≤ L with high probability. However, it can rely on strong assumptions that may not be guaranteed in practical situations.

To motivate our constructions of ${\hat{G}}_{l}$ , let us first point out a naive construction of candidate sets, which consists of 2^K candidates. These candidates are all different combinations of {1, . . . ,K}, denoted by ${\hat{G}}_{1}, \dots, {\hat{G}}_{2^{K}}$ . It is obvious that $A$ is an element of these candidate sets. However, the number of candidates is too large and it can be computationally burdensome. Furthermore, the cost of aggregation can be significantly high, which is of order K/n₀ as will be seen in Lemma 1. In contrast, we would like to pursue a much smaller number of candidate sets such that the cost of aggregation is almost negligible and (12) can be achieved under mild conditions. We introduce our proposed construction of candidate sets in the next subsection.

3.2. Constructing the Candidate Sets for Aggregation

As illustrated in Section 3.1, the goal of Step 2 is to have a class of candidate sets, ${{\hat{G}}_{0}, \dots, {\hat{G}}_{L}}$ , that satisfy (12) under certain conditions. Our idea is to exploit the sparsity patterns of the contrast vectors. Recall that the definition of $A$ implies that ${δ^{(k)}}_{k \in A}$ are sparser than ${δ^{(k)}}_{k \in A^{c}}$ , where $A^{c} = {1, \dots, K} \ A$ . This property motivates us to find a sparsity index R^(k) and its estimator ${\hat{R}}^{(k)}$ for each 1 ≤ k ≤ K such that

\max_{k \in A^{o}} R^{(k)} < \min_{k \in A^{c}} R^{(k)} and ℙ (\max_{k \in A^{o}} {\hat{R}}^{(k)} < \min_{k \in A^{c}} {\hat{R}}^{(k)}) \to 1,

(13)

where $A^{o}$ is some subset of $A$ . In words, the sparsity indices in $A^{o}$ are no larger than the sparsity indices in $A^{c}$ and so are their estimators with high probability. To utilize (13), we can define the candidate sets as

{\hat{G}}_{l} = {1 \leq k \leq K : {\hat{R}}^{(k)} is among the first l smallest of all}

(14)

for 1 ≤ l ≤ K. That is, ${\hat{G}}_{l}$ is the set of auxiliary samples whose estimated sparsity indices are among the first l smallest. A direct consequence of (13) and (14) is that $ℙ ({\hat{G}}_{| A^{o} |} = A^{o}) \to 1$ and hence the desirable property (12) is satisfied. To achieve the largest gain with transfer learning, we would like to find proper sparsity indices such that (13) holds for $\sum_{k \in A_{1}^{°}} n_{k}$ as large as possible. Notice that ${\hat{G}}_{K + 1} = {1, \dots, K}$ is always included as candidates according to (14). Hence, in the special cases where all the auxiliary samples are informative or none of the auxiliary samples are informative, it holds that ${\hat{G}}_{| A |} = A$ and the Trans-Lasso is not much worse than the Oracle Trans-Lasso. The more challenging cases are $0 < | A | < K$ .

As ${δ^{(k)}}_{k \in A^{c}}$ are not necessarily sparse, the estimation of δ^(k) or functions of δ^(k), 1 ≤ k ≤ K, is not trivial. As an example, an intuitive sparsity index can be ∥δ^(k)∥₁ and its estimate is ${‖ \hat{β} ({\hat{G}}_{0}) - {\hat{w}}^{(k)} ‖}_{1}$ , where ${\hat{w}}^{(k)}$ is the Lasso estimate of w^(k) based on the k-th study. However, such a Lasso-based estimate is not guaranteed to converge to the oracle ∥δ^(k)∥₁ when δ^(k) is non-sparse. Therefore, we consider using $R^{(k)} = {‖ \sum δ^{(k)} ‖}_{2}^{2}$ , which is a function of the population-level marginal statistics, as the oracle sparsity index for k-th auxiliary sample. The advantage of R^(k) is that it has a natural unbiased estimate even when δ^(k) is non-sparse. Let us relate R^(k) to the sparsity of δ^(k) using a Bayesian characterization of sparse vectors assuming Σ^(k) = Σ for all 0 ≤ k ≤ K. If $δ_{j}^{(k)}$ are i.i.d. Laplacian distributed with mean zero and variance $ν_{k}^{2}$ for each k, then it follows from the properties of Laplacian distribution (Liu and Kozubowski, 2015) that $E [{‖ δ^{(k)} ‖}_{1}] ≍ E^{1 / 2} [{‖ \sum δ^{(k)} ‖}_{2}^{2}]$ . Hence, the rank of $E [{‖ Σ δ^{(k)} ‖}_{2}^{2}]$ is the same as the rank of $E [{‖ δ^{(k)} ‖}_{1}]$ . As $\max_{k \in A} {‖ δ^{(k)} ‖}_{1} < \min_{k \in A^{c}} {‖ δ^{(k)} ‖}_{1}$ , it is reasonable to expect $\max_{k \in A} {‖ \sum δ^{(k)} ‖}_{2}^{2} < \min_{k \in A^{c}} {‖ \sum δ^{(k)} ‖}_{2}^{2}$ . The above derivation holds for many other zero mean prior distributions besides Laplacian. This illustrates our motivation for considering R^(k) as the oracle sparsity index.

We next introduce the estimated version, ${\hat{R}}^{(k)}$ , based on the primary data ${{(x_{i}^{(0)})}^{⊤}, y_{i}^{(0)}}_{i \in I}$ (after sample splitting) and auxiliary samples ${X^{(k)}, y^{(k)}}_{k = 1}^{K}$ . We first perform a SURE screening (Fan and Lv, 2008) on the marginal statistics to reduce the effects of random noises. We summarize our proposal for Step 2 of the Trans-Lasso as follows (Algorithm 3). Let $n_{*} = \min_{0 \leq k \leq K} n_{k}$ .

Algorithm 3:

Step 2 of the Trans-Lasso Algorithm

Step 2.1. For 1 ≤ k ≤ K, compute the marginal statistics

{\hat{Δ}}^{(k)} = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} x_{i}^{(k)} y_{i}^{(k)} - \frac{1}{| I |} \sum_{i \in I} x_{i}^{(0)} y_{i}^{(0)} .

(15)

For each k ∈ {1, . . ., K}, let

{\hat{T}}_{k}

be obtained by SURE screening such that

{\hat{T}}_{k} = {1 \leq j \leq p : | {\hat{Δ}}_{j}^{(k)} | is among the first t_{*} largest of all}

for a fixed

t_{*} = n_{*}^{α}

, 0 ≤ α < 1.

Step 2.2. Define the estimated sparse index for the k-th auxiliary sample as

{\hat{R}}^{(k)} = {‖ {\hat{Δ}}_{{\hat{T}}_{k}}^{(k)} ‖}_{2}^{2} .

(16)

Step 2.3. Compute

{\hat{G}}_{l}

as in (14) for l = 1, . . ., L.

Open in a new tab

One can see that ${\hat{Δ}}^{(k)}$ are empirical marginal statistics such that $E [{\hat{Δ}}^{(k)}] = Σ δ^{(k)}$ for $k \in A$ . The set ${\hat{T}}_{k}$ is the set of first t_* largest marginal statistics for the k-th sample. The purpose of screening the marginal statistics is to reduce the magnitude of noise. Notice that the un-screened version ${‖ {\hat{Δ}}^{(k)} ‖}_{2}^{2}$ is a sum of p random variables and it contains noise of order p/(n_k ∧ n₀), which diverges fast as p is much larger than the sample sizes. By screening with t_* of order $n_{*}^{α}$ , α < 1, the errors induced by the random noises is under control. In practice, the auxiliary samples with very small sample sizes can be removed from the analysis as their contributions to the target problem is mild. Desirable choices of ${\hat{T}}_{k}$ should keep the variation of Σδ^(k) as much as possible. Under proper conditions, SURE screening can consistently select a set of strong marginal statistics and hence is appropriate for the current purpose. In Step 2.2, we compute ${\hat{R}}^{(k)}$ based on the marginal statistics which are selected by SURE screening. In practice, different choices of t_* may lead to different realizations of ${\hat{G}}_{l}$ . One can compute multiple sets of ${{\hat{R}}^{(k)}}_{k = 1}^{K}$ with different t_* which give multiple sets of ${{\hat{G}}_{l}}_{l = 1}^{K}$ . It will be seen from Lemma 1 that a finite number of choices on t_* does not affect the rate of convergence.

3.3. Theoretical Properties of Trans-Lasso

In this subsection, we derive the theoretical guarantees for the Trans-Lasso algorithm. We first establish the model selection aggregation type of results for the Trans-Lasso estimator ${\hat{β}}^{\hat{θ}}$ .

Lemma 1 (Q-Aggregation for Trans-Lasso).

Assume that Condition 1 and Condition 2 hold true. Let $\hat{θ}$ be computed via (10) with $λ_{θ} \geq 4 σ_{0}^{2}$ . With probability at least 1 − t, it holds that

\frac{1}{| I^{c} |} {‖ X_{I^{c}, .}^{(0)}, {\hat{β}}^{\hat{θ}} - β ‖}_{2}^{2} \leq \min_{0 \leq l \leq L} \frac{1}{| I^{c} |} {‖ X_{I^{c}, .}^{(0)}, (\hat{β} ({\hat{G}}_{l}) - β) ‖}_{2}^{2} + \frac{λ_{θ} \log (L / t)}{n_{0}} .

(17)

If L ≤ c₁n₀ for some small enough constant c₁, then

‖ {\hat{β}}^{\hat{θ}} - β ‖_{2}^{2} ≲_{ℙ} \min_{0 \leq l \leq L} ‖ \hat{β} ({\hat{G}}_{l}) - β ‖_{2}^{2} + \frac{\log L}{n_{0}} .

(18)

Lemma 1 implies that the performance of ${\hat{β}}^{\hat{θ}}$ only depends on the best candidate regardless of the performance of other candidates under mild conditions. As commented before, this result guarantees the robustness and efficiency of Trans-Lasso, which can be formally stated as follows. As the original Lasso is always in our dictionary, (17) and (18) imply that ${\hat{β}}^{\hat{θ}}$ is not much worse than the Lasso in prediction and estimation. Formally, “not much worse” refers to the last term in (17), which can be viewed as the cost of “searching” for the best candidate model within the dictionary which is of order log L/n₀. This term is almost negligible, say, when L = O(K), which corresponds to our constructed candidate estimators. This demonstrates the robustness of ${\hat{β}}^{\hat{θ}}$ to adversarial auxiliary samples. Furthermore, if (12) holds, then the prediction and estimation errors of Trans-Lasso are comparable to the Oracle Trans-Lasso using the auxiliary samples in $A^{o}$ .

The prediction error bound in (17) follows from Corollary 3.1 in Dai et al. (2012). However, the aggregation methods do not have theoretical guarantees in estimation errors in general. Indeed, an estimator with ℓ₂-error guarantee is crucial for more challenging tasks, such as out-of-sample prediction and inference. For our transfer learning task, we show in (18) that the estimation error is of the same order if the cardinality of the dictionary is L ≤ cn₀ for some small enough c. For our constructed dictionary, it suffices to require K ≤ cn₀. In many practical applications, K is relatively small compared to the sample sizes and hence this assumption is not very restrictive.

In the following, we provide sufficient conditions such that the desirable property (13) holds with ${\hat{R}}^{(k)}$ defined in (16) and hence (12) is satisfied. For each $k \in A^{c}$ , define a set

H_{k} = {1 \leq j \leq p : | Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β | > n_{*}^{- κ}, κ < α / 2} .

(19)

Recall that α < 1 is defined such that t_* = n^α. In fact, H_k is the set of “strong” marginal statistics that can be consistently selected into ${\hat{T}}_{k}$ for each $k \in A^{c}$ . We see that $Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β = Σ_{j, .} δ^{(k)}$ if Σ^(k) = Σ⁽⁰⁾ for $k \in A^{c}$ . The definition of H_k in (19) allows for heterogeneous designs among non-informative auxiliary samples.

Condition 3.

For each $k \in A^{c}$ , each row of X^(k) is i.i.d. Gaussian with mean zero and covariance matrix Σ^(k) and $\max_{k \in A^{c}} Λ_{\max} (\sum^{(k)})$ is finite. For each $k \in A^{c}$ , the random noises $ϵ_{i}^{(k)}$ are i.i.d. Gaussian with mean zero and variance $σ_{k}^{2}$ and $E [{(y_{i}^{(k)})}^{2}]$ is finite.
It holds that $\log p \lor \log K \leq c_{1} \sqrt{n_{*}}$ for a small enough constant c₁. Moreover,
$\min_{k \in A^{c}} \sum_{j \in H_{k}} {| Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β |}^{2} \geq \frac{c_{2} \log p}{n_{*}^{1 - α}}$ (20)
for some constant c₂ > 0.

The Gaussian assumptions in Condition 3(a) guarantee the desirable properties of SURE screening for the non-informative auxiliary studies. In fact, the largest eigenvalue of Σ^(k), $k \in A^{c}$ can grow as $O (n_{*}^{τ})$ for some τ ≥ 0 and τ + α < 1 following the proof in Fan and Lv (2008). The Gaussian assumption can be relaxed to be sub-Gaussian random variables according to some recent studies (Ahmed and Bajwa, 2019). For the conciseness of the proof, we consider Gaussian distributed random variables with bounded eigenvalues. Condition 3(b) puts a constraint on the relative dimensions. It is trivial in the regime that $p \lor K \leq n_{*}^{ξ}$ for any finite ξ > 0. The expression (20) requires that for each $A^{c}$ , there exists a subset of strong marginal statistics with not-so-small cardinality. This condition is mild by choosing α such that $\log p ≪ n_{*}^{1 - α}$ and α = 1/2 is an obvious choice revoking the first part of Condition 3(b). For instance, if $\min_{k \in A^{c}} {‖ E [{\hat{Δ}}^{(k)}] ‖}_{\infty} \geq c_{0} > 0$ , then (20) holds with any α ≤ 1/2. In words, a sufficient condition for (20) is that at least one marginal statistic in the k-th study is of constant order for $k \in A^{c}$ . We see that larger n_* makes Condition 3 weaker. As mentioned before, it is helpful to remove the auxiliary samples with very small sample sizes from the analysis.

In the next theorem, we demonstrate the theoretical properties of ${\hat{R}}^{(k)}$ and provide a complete analysis of the Trans-Lasso algorithm. Let $A^{o}$ be a subset of $A$ such that

A^{o} = {k \in A : {‖ Σ^{(0)} δ^{(k)} ‖}_{2}^{2} \leq c_{1} \min_{k \in A^{c}} \sum_{j \in H_{k}} {| Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β |}^{2}}

for some small constant c₁ < 1 and H_k defined in (19). In general, one can see that the informative auxiliary samples with sparser δ^(k) are more likely to be included into $A^{o}$ . Specially, the fact that $\max_{k \in A} {‖ \sum^{(0)} δ^{(k)} ‖}_{2}^{2} \leq {‖ Σ^{(0)} ‖}_{2}^{2} h^{2}$ implies $A^{o} = A$ when h is sufficiently small. We will show (13) for such $A^{o}$ with ${\hat{R}}^{(k)}$ defined in (16). Let $n_{A^{o}} = \sum_{k \in A^{°}} n_{k}$ .

Theorem 3 (Convergence Rate of the Trans-Lasso).

Assume Conditions 1, 2, and 3. Then

ℙ (\max_{k \in A^{°}} {\hat{R}}^{(k)} < \min_{k \in A^{c}} {\hat{R}}^{(k)}) \to 1.

(21)

Let ${\hat{β}}^{\hat{θ}}$ be computed using the Trans-Lasso algorithm with $λ_{θ} \geq 4 σ_{0}^{2}$ . If $s \log p / (n_{A^{o}} + n_{0}) + {h {(\log p / n_{0})}^{1 / 2}} \land (s \log p / n_{0}) = o (1)$ and K ≤ cn₀ for a sufficiently small constant c > 0, then

\inf_{B \in Θ_{1} (s, h)} ℙ (\frac{1}{| I^{c} |} {‖ X_{I^{c}, .}^{(0)} ({\hat{β}}^{\hat{θ}} - β) ‖}_{2}^{2} \lor {‖ {\hat{β}}^{\hat{θ}} - β ‖}_{2}^{2} ≲ \frac{s \log p}{n_{A^{o}} + n_{0}} + \frac{s \log p}{n_{0}} \land η_{h} + \frac{\log K}{n_{0}}) \to 1

(22)

as $(n_{0}, n_{A^{o}}, p) \to \infty$ .

Remark 1.

Under the conditions of Theorem 3, if

‖ Σ^{(0)} ‖_{2}^{2} h^{2} \leq c \min_{k \in A^{c}} \sum_{j \in H_{k}} | Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β |^{2} for some c < 1,

then $ℙ (\max_{k \in A} {\hat{R}}^{(k)} < \min_{k \in A^{c}} {\hat{R}}^{(k)}) \to 1$ and as $(n_{0}, n_{A}, p) \to \infty,$

\inf_{B \in Θ_{1} (s, h)} ℙ (\frac{1}{| I^{c} |} {‖ X_{I^{c}, .}^{(0)} ({\hat{β}}^{\hat{θ}} - β) ‖}_{2}^{2} \lor {‖ {\hat{β}}^{\hat{θ}} - β ‖}_{2}^{2} ≲ \frac{s \log p}{n_{A} + n_{0}} + \frac{s \log p}{n_{0}} \land η_{h} + \frac{\log K}{n_{0}}) \to 1.

Theorem 3 establishes the convergence rate of the Trans-Lasso when $A$ is unknown. The result in (21) implies the estimated sparse indices in $A^{o}$ and in $A^{c}$ are separated with high probability. As illustrated before, a consequence of (21) is (12) for the candidate sets ${\hat{G}}_{l}$ defined in (14). Together with Theorem 1 and Lemma 1, we arrive at (22).

It is worth mentioning that Condition 3 is only employed to show the gain of Trans-Lasso. The robustness property of Trans-Lasso holds without any conditions on the non-informative samples (Lemma 1). In practice, missing a few informative auxiliary samples may not be a grave concern. One can see that when $n_{A^{o}}$ is large enough such that the first term on the right-hand side of (22) no longer dominates, increasing the number of auxiliary samples will not improve the convergence rate. In contrast, it is more important to guarantee that the estimator is not affected by the adversarial auxiliary samples. The empirical performance of Trans-Lasso is carefully studied in Section 5.

4. Extensions to Heterogeneous Designs

In this section, we extend the algorithms and theoretical results developed in Sections 2 and 3 to the case where the covariates have different covariance structures in different studies.

The Oracle Trans-Lasso algorithm proposed in Section 2 can be directly applied to the setting where the design matrices are moderately heterogeneous. Formally, we first introduce a relaxed version of Condition 1 as follows. Define

C_{Σ} = 1 + \max_{j \leq p} \max_{k \in A} {‖ e_{j}^{⊤} (Σ^{(k)} - Σ^{(0)}) {(\sum_{k \in A} α_{k} Σ^{(k)})}^{- 1} ‖}_{1},

which characterizes the differences between Σ^(k) and Σ⁽⁰⁾ for $k \in A$ . Notice that C_Σ is a constant if $\max_{1 \leq j \leq p} {‖ e_{j}^{⊤} (Σ^{(k)} - Σ^{(0)}) ‖}_{0} \leq C < \infty$ for all $k \in A$ , where examples include block diagonal Σ^(k) with constant block sizes or banded Σ^(k) with constant bandwidths for $k \in A$ .

Condition 4.

For each $k \in A \cup {0}$ , each row of X^(k) is i.i.d. Gaussian with mean zero and covariance matrix Σ^(k). The smallest eigenvalue of Σ^(k) are bounded away from zero for all $k \in A \cup {0}$ . The largest eigenvalue of Σ⁽⁰⁾ is bounded away from infinity.

The following theorem characterizes the rate of convergence of the Oracle Trans-Lasso estimator in terms of C_Σ. Let $η_{h, Σ} = (C_{Σ} h \sqrt{\log p / n_{0}}) \land (C_{Σ}^{2} h^{2})$ .

Theorem 4 (Oracle Trans-Lasso with heterogeneous designs).

Assume that Condition 2 and Condition 4 hold true. Suppose $A$ is known with $C_{Σ} h ≲ s \sqrt{\log p / n_{0}}$ and $n_{0} ≲ n_{A}$ . We take λ_w and λ_δ as in Theorem 1. If $s \log p / n_{A} + C_{Σ} h {(\log p / n_{0})}^{1 / 2} = o (1)$ , then

\inf_{B \in Θ_{1} (s, h)} ℙ (\frac{1}{n_{0}} {‖ X^{(0)} (\hat{β} - β) ‖}_{2}^{2} \lor ‖ \hat{β} - β ‖_{2}^{2} ≲ \frac{s \log p}{n_{A} + n_{0}} + \frac{s \log p}{n_{0}} \land η_{h, Σ}) \geq 1 - \exp (- c_{1} \log p) .

(23)

The right-hand side of (9) is sharper than s log p/n₀ if $n_{A} ≫ n_{0}$ and $C_{Σ} h \sqrt{\log p / n_{0}} ≪ s$ . We see that small C_Σ is favorable. This implies that the Oracle Trans-Lasso is guaranteed to perform well with sparse contrasts and similar covariance matrices to the primary one.

We now provide theoretical guarantees for the Trans-Lasso with heterogeneous designs when $A$ is unknown. In this case, the sparsity index R^(k) takes the format ${‖ Σ^{(k)} w^{(k)} - Σ^{(0)} β ‖}_{2}^{2}$ . It measures the sparsity of δ^(k) but also the covariance heterogeneity. We consider ${\tilde{A}}^{o}$ , a subset of $A$ such that

{\tilde{A}}^{o} = {k \in A : ‖ Σ^{(k)} w^{(k)} - Σ^{(0)} β ‖_{2}^{2} < c_{1} \min_{k \in A^{c}} \sum_{j \in H_{k}} {| Σ_{j, .}^{(k)} w^{(k)} - Σ_{j, .}^{(0)} β |}^{2}}

for some c₁ < 1 and H_k defined in (19). This is a generalization of $A^{°}$ to the case of heterogeneous designs.

Corollary 1 (Trans-Lasso with heterogeneous designs).

Assume Conditions 2, 3, and 4. Let ${\hat{β}}^{\hat{θ}}$ be computed via the Trans-Lasso algorithm with $λ_{θ} \geq 4 σ_{0}^{2}$ . If $s \log p / (n_{{\tilde{A}}^{o}} + n_{0}) + {C_{Σ} h {(\log p / n_{0})}^{1 / 2}} \land (s \log p / n_{0}) = o (1)$ and K ≤ cn₀ for a small enough constant c, then

\inf_{B \in Θ_{1} (s, h)} ℙ (\frac{1}{| I^{c} |} {‖ X_{I c, .,}^{(0)} ({\hat{β}}^{\hat{θ}} - β) ‖}_{2}^{2} \lor {‖ {\hat{β}}^{\hat{θ}} - β ‖}_{2}^{2} ≲ \frac{s \log p}{n_{{\tilde{A}}^{o}} + n_{0}} + \frac{s \log p}{n_{0}} \land η_{h, Σ} + \frac{\log K}{n_{0}}) \to 1

as $(n_{0}, n_{{\tilde{A}}^{o}}, p) \to \infty$ .

Corollary 1 provides an upper bound for the Trans-Lasso with heterogeneous designs. The numerical experiments for this setting are studied in Section 5.

5. Simulation Studies

In this section, we evaluate the empirical performance of the proposed methods and some other comparable methods in various numerical experiments. Specifically, we evaluate the performance of five methods, including Lasso, Oracle Trans-Lasso proposed in Section 2.1, Trans-Lasso proposed in Section 3.1, and two other ad hoc transfer learning methods related to ours. The first one implements Trans-Lasso except that the bias-correction step (Step 2) of the Oracle Trans-Lasso is omitted. We call this method the “aggregated Lasso” (Agg-Lasso), as it implements our proposed adaptive aggregation step and applies Lasso to each candidate set. The purpose is to understand the necessity of the bias-correction step in Oracle Trans-Lasso. The second one follows the steps of Trans-Lasso but uses a different aggregation step. Specifically, we consider ${\hat{R}}^{(k)} = {‖ {\hat{β}}^{L} - {\hat{w}}^{(k)} ‖}_{1}$ , k = 1, . . . ,K, where ${\hat{β}}^{L}$ and ${\hat{w}}^{(k)}$ are the Lasso estimators based on each of the corresponding studies. Moreover, the Q-aggregation step is replaced with the cross-validation, where we select the set ${\hat{G}}_{l}$ that minimizes the out-of-sample prediction errors. We call this algorithm “Ad hoc ℓ₁-transfer”. The purpose of including this method is to understand the performance of our proposed ${\hat{R}}^{(k)}$ based on SURE screening and Q-aggregation. In the Supplementary Materials, we report the performance of the estimated sparse indices ${\hat{R}}^{(k)}$ based on Trans-Lasso and Ad hoc ℓ₁-transfer. The R code for all the methods are available at https://github.com/saili0103/TransLasso.

5.1. Identity Covariance Matrix for the Designs

We consider p = 500, n₀ = 150, and n₁, . . . , n_K = 100 for K = 20. The covariates $x_{i}^{(k)}$ are i.i.d. Gaussian with mean zero and identity covariance matrix for all 0 ≤ k ≤ K and $ϵ_{i}^{(k)}$ are i.i.d. Gaussian with mean zero and variance one for all 0 ≤ k ≤ K. For the target parameter β, we set s = 16, β_j = 0.3 for j ∈ {1, . . . , s}, and β_j = 0 otherwise. For the regression coefficients in auxiliary samples, we consider two configurations.

For a given $A$ , if $k \in A$ , let
$w_{j}^{(k)} = β_{j} - 0.3 1 (j \in H_{k}),$
where H_k is a random subset of [p] with |H_k| = h ∈ {2, 6, 12}. If $k \notin A$ , we set H_k to be a random subset of [p] with |H_k| = 2s and $w_{j}^{(k)} = β_{j} - 0.5 1 (j \in H_{k})$ . We set $w_{1}^{(k)} = - 0.3$ for k = 1, . . . ,K.
For a given $A$ , if $k \in A$ , let H_k = {1, . . . , 100} and
$w_{j}^{(k)} = β_{j} + ξ_{j} 1 (k \in H_{k}), where ξ_{j} \sim_{i . i . d .} N (0, h / 100),$
where h ∈ {2, 6, 12} and N(a, b) is the normal with mean a and standard deviation b. If $k \notin A$ , we set H_k = {1, . . . , 100} and
$w_{j}^{(k)} = β_{j} + ξ_{j} 1 (j \in H_{k}), where ξ_{j} \sim_{i . i . d .} N (0, 2 s / 100) .$
We set $w_{1}^{(k)} = - 0.3$ for k = 1, . . . ,K. The setting (i) can be treated as either ℓ₀- or ℓ₁-sparse contrasts. In practice, the true parameters are unknown and we use $A$ to denote the set of auxiliary samples without distinguishing ℓ₀- or ℓ₁-sparsity. We consider $| A | \in {0, 4, 8, \dots, 20}$ .

In Figure 1, we report sum of squared estimation errors (SSE) for each estimator $| A |$ . Each point is summarized from 200 independent simulations. As expected, the performance of the Lasso does not change as $| A |$ increases. On the other hand, all four other transfer learning-based algorithms have estimation errors decreasing as $A$ increases. As h increases, the problem gets harder and the estimation errors of all four methods increase. In settings (i) and (ii), the Oracle Trans-Lasso has the smallest estimation errors in most settings. The proposed Trans-Lasso, which is agnostic to $A$ , is always the second-best. The gap between the Oracle Trans-Lasso and Trans-Lasso is a result of the uncertainty of aggregation and sample splitting for constructing the initial estimators. We also observe that when $A = \emptyset$ , the Trans-Lasso can have smaller errors than the oracle Trans-Lasso where the latter one does not use auxiliary information. This implies that some auxiliary information can still be borrowed. Due to the randomness of the parameter generation, our definition of $A$ may not always be the best subset of auxiliary samples that give the smallest estimation errors.

Among the two variants, Ad hoc ℓ₁-transfer is also adaptive but has slightly larger estimation errors than Trans-Lasso when h is large. This demonstrates the advantage of Q-aggregation with our proposed sparsity index over the cross-validation type of aggregation with ℓ₁-distance based sparsity index. The Agg-Lasso method has larger estimation errors than Trans-Lasso and Ad hoc ℓ₁-transfer, even when h is small. This demonstrates the necessity of the bias-correction step in the Oracle Trans-Lasso.

5.2. Homogeneous Designs among $A \cup {0}$

We now consider $x_{i}^{(k)}$ as i.i.d. Gaussian with mean zero and a equi-correlated covariance matrix, where Σ_j,j = 1 and Σ_j,k = 0.8 if j ≠ k for $k \in A \cup {0}$ . For $k \notin A \cup {0}$ , $x_{i}^{(k)}$ are i.i.d. Gaussian with mean zero and a Toeplitz covariance matrix whose first row is

Σ_{1, .}^{(k)} = (1, \underset{2 k - 1}{\underset{︸}{1 / (k + 1), \dots, 1 / (k + 1)}}, 0_{p - 2 k}) .

(24)

Other true parameters and the dimensions of the samples are set to be the same as in Section 5.1. From the results presented in Figure 2, we see that the Trans-Lasso and Oracle Trans-Lasso have reliable performance in the current setting. The average estimation errors are larger in Figure 2 than those in Section 5.1 as the covariates are highly correlated in the current setting. When h is relatively large, we see that Agg-Lasso and Ad hoc ℓ₁-transfer have significantly larger estimation errors than Trans-Lasso. This again demonstrates the advantage of Trans-Lasso over some ad hoc methods.

5.3. Heterogeneous Designs

We next consider a setting where Σ^(k) are distinct for k = 0, . . . ,K. Specifically, for k = 1, . . . ,K, let $x_{i}^{(k)}$ as i.i.d. Gaussian with mean zero and a Toeplitz covariance matrix whose first row is (24). Moreover, $Σ^{(0)} = I_{p}$ . Other parameters and the dimensions of the samples are set to be the same as in Section 5.1. Figure 3 shows that the general patterns observed under homogeneous designs still hold. Trans-Lasso still gives the best estimation performance under the heterogeneous designs as compared with alternative methods.

Figure 3. — Estimation errors of the Ad hoc ℓ₁-transfer, Agg-Lasso, Lasso, Oracle Trans-Lasso, and Trans-Lasso with heterogeneous covariance matrices. The two rows correspond to configurations (i) and (ii), respectively. The y-axis corresponds to $‖ b - β ‖_{2}^{2}$ for some estimator b.

6. Application to Genotype-Tissue Expression Data

In this section, we demonstrate the performance of our proposed transfer learning algorithm in analyzing the Genotype-Tissue Expression (GTEx) data (https://gtexportal.org/). Overall, the data sets measure gene expression levels from 49 tissues of 838 human donors, in total comprising 1,207,976 observations of 38,187 genes. In our analysis, we focus on genes that are related to the central nervous system (CNS), which were assembled as MODULE 137 (https://www.gsea-msigdb.org/gsea/msigdb/cards/MODULE_137.html). This module includes a total of 545 genes and additional 1,632 genes that are significantly enriched in the same experiments as the genes of the module. A complete list of genes can be found at http://robotics.stanford.edu/~erans/cancer/modules/module_137.

6.1. Data Analysis Method

It is of biological interest to understand the CNS gene regulations in different tissues/cell types. Statistically, we consider predicting the expression levels of a target gene using other CNS genes in multiple tissues. Such an analysis provides insights on how other genes regulate the expression of a target gene. To demonstrate the replicability of our proposal, we consider multiple target genes and multiple target tissues and estimate their corresponding models one by one.

For an illustration of the computation process, we consider gene JAM2 (Junctional adhesion molecule B), as the response variable. JAM2 is a protein coding gene on chromosome 21 interacting with a variety of immune cell types and may play a role in lymphocyte homing to secondary lymphoid organs (Johnson-Léger et al., 2002). Mutations in JAM2 has been found to cause primary familial brain calcification (Cen et al., 2020; Schottlaender et al., 2020). We consider the association between JAM2 and other CNS genes in a brain tissue as the target models and the association between JAM2 and other CNS genes in other tissues as the auxiliary models. As there are multiple brain tissues in the dataset, we treat each of them as the target at each time. The list of target tissues can be found in Figure 4. The min, average, and max of primary sample sizes in these target tissues are 126, 177, and 237, respectively. More information on the target tissues is given in the Supplementary Materials. JAM2 expresses in 49 tissues in our dataset and we use 47 tissues with more than 120 measurements on JAM2. The average number of auxiliary samples for each target model is 14,837 over all the non-target tissues. The covariates in use are the genes that are in the enriched MODULE_137 and do not have missing values in all of the 47 tissues. The final covariates include a total of 1,079 genes. The data is standardized before analysis.

Figure 4. — Prediction errors of Agg-Lasso, Naive Trans-Lasso, Trans-Lasso, and Ad hoc ℓ₁-transfer relative to the Lasso evaluated via 5-fold cross validation for gene JAM2 in multiple tissues.

We compare the prediction performance of Trans-Lasso with Lasso, Agg-Lasso, Ad hoc ℓ₁-transfer, and Naive Trans-Lasso. Implementation of the first four methods is the same as in Section 5. The Naive Trans-Lasso implements the Oracle Trans-Lasso algorithm assuming all the auxiliary studies are informative. Evaluating this method can help us understand the overall informative level of the auxiliary samples. We split the target sample into five folds and use four folds to train the algorithms and use the remaining fold to test their prediction performance. We repeat this process five times each with a different fold of test samples. We mention that one individual can provide expression measurements on multiple tissues and these measurements are hard to be independent. While the dependence of the samples can reduce the efficiency of the estimation algorithms, using auxiliary samples may still be beneficial. However, one need to choose proper tuning parameters. The tuning parameter for the Lasso and λ_w are chosen by 8-fold cross-validation. The tuning parameter λ_δ is set to be $λ_{w} \sqrt{\sum_{k \in A} n_{k} / n_{0}}$ . Other tuning parameters and configurations are the same as for the simulations.

6.2. Prediction Performance of the Trans-Lasso for JAM2 Expression

Figure 4 demonstrates the prediction errors of different methods for predicting gene expression JAM2 using other genes. We see that all the transfer learning methods in consideration make improvements over the Lasso in most experiments. The performance of Naive Trans-Lasso implies that there is heterogeneity among tissues and some auxiliary studies can be non-informative. Hence, adaptation to unknown $A$ is important. Among the adaptive transfer learning methods, Trans-Lasso achieves the smallest prediction errors in almost all the experiments. Its average gain is 22% comparing to the Lasso. This shows that our characterization of the similarity between a target model and a given auxiliary model is suitable for the current problem. Agg-Lasso gives similar prediction errors as Trans-Lasso in most of the tissues but has significantly worse performance for Cortex, Hippocampus, and Pituitary tissues. The average proportion of explained variance given by the Lasso and that given by the Trans-Lasso are 0.75 and 0.80, respectively, indicating improved fit from transfer learning.

6.3. Prediction Performance of Other 25 Genes on Chromosome 21

To demonstrate the replicability of our proposal, we also consider other genes on chromosome 21 which are in Module_137 as our target genes. We report the overall prediction performance of these 25 genes in Figure 5. A complete list of these genes and some summary information can be found in the Supplementary Materials. Generally speaking, we see that the Trans-Lasso has the best overall performance among all the target tissues when compared to the other two related methods, Agg-Lasso and Ad hoc ℓ₁-transfer. The deteriorating performance of the naive Trans-Lasso implies that adaptation to the unknown informative set is crucial for successful knowledge transfer.

Figure 5. — Prediction errors of Ad hoc ℓ₁-transfer, Agg-Lasso, Naive Trans-Lasso*, and Trans-Lasso relative to the Lasso for the 25 genes on chromosome 21 and in Module 137, in multiple target tissues. The Naive Trans-Lasso has two outliers for the tissue Cerebellum not showing in the figure with values 1.61 and 1.95.

7. Discussion

This paper studies high-dimensional linear regression in the presence of auxiliary samples. The similarity of the target model and a given auxiliary model is characterized by the sparsity of their contrast vectors. Transfer learning algorithms for estimation and prediction are developed that are adaptive to the unknown informative set. Numerical experiments and GTEx data analysis support the theoretical findings and demonstrate its effectiveness in applications.

In the machine learning literature, transfer learning methods have been proposed for different purposes, but few have statistical guarantees. There are several interesting problems related to the present paper for further research. First, transfer learning in nonlinear models can be studied. Using our similarity characterization of the auxiliary studies, transfer learning in high-dimensional generalized linear models (GLMs) can be formulated. GLMs include logistic and Poisson models that are widely used for classification. The main challenge is that the moment equation above (7) is nonlinear and the resulting $δ^{A}$ is not necessarily sparse. Hence, transfer learning beyond linear models remain open problems and can be studied under different characterizations for the similarity structure. Second, it is interesting to study statistical inference, such as constructing confidence intervals and hypothesis testing with auxiliary samples. Given the results derived in this paper, one may expect weaker sample size conditions in the transfer learning setting than those in the single-task setting. It is interesting to provide a precise characterization and to develop a minimax optimal confidence interval in the transfer learning setting.

Supplementary Material

supinfo

NIHMS1755759-supplement-supinfo.pdf^{(394.7KB, pdf)}

Acknowledgments

This research was supported by NIH grants GM129781 and GM123056 and NSF Grant DMS-1712735.

Contributor Information

Sai Li, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennvania, Philadelphia, PA 19104.

T. Tony Cai, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.

Hongzhe Li, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104.

References

Agarwal A, Negahban S, and Wainwright MJ (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics 40(2), 1171–1197. [Google Scholar]
Ahmed T and Bajwa WU (2019). Exsis: Extended sure independence screening for ultrahigh-dimensional linear models. Signal Processing 159, 33–48. [Google Scholar]
Ando RK and Zhang T (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853. [Google Scholar]
Bastani H (2020). Predicting with proxies: Transfer learning in high dimension. Management Science 67(5), 2657–3320. [Google Scholar]
Bühlmann P and van de Geer S (2015). High-dimensional inference in misspecified linear models. Electronic Journal of Statistics 9(1), 1449–1473. [Google Scholar]
Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics 49(1), 100–128. [Google Scholar]
Candes E and Tao T (2007). The dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics 35(6), 2313–2351. [Google Scholar]
Cen Z, Chen Y, Chen S, et al. (2020). Biallelic loss-of-function mutations in jam2 cause primary familial brain calcification. Brain 143(2), 491–502. [DOI] [PubMed] [Google Scholar]
Chen X, Kim S, Lin Q, Carbonell JG, and Xing EP (2010). Graph-structured multi-task regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:1005.3579. [Google Scholar]
Cross-Disorder Group of the Psychiatric Genomics Consortium (2019). Genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders. Cell 179(7), 1469–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dai D, Han L, Yang T, and Zhang T (2018). Bayesian Model Averaging with Exponentiated Least Squares Loss. IEEE Transactions on Information Theory 64(5), 3331–3345. [Google Scholar]
Dai D, Rigollet P, and Zhang T (2012). Deviation optimal learning using greedy q-aggregation. The Annals of Statistics 40(3), 1878–1905. [Google Scholar]
Danaher P, Wang P, and Witten DM (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society. Series B (Statistical methodology) 76(2), 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daumé III H (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263. [Google Scholar]
Dondelinger F, Mukherjee S, and Initiative ADN (2020). The joint lasso: high-dimensional regression for group structured data. Biostatistics 21(2), 219–235. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fagny M, Paulson JN, Kuijjer ML, et al. (2017). Exploring regulation in tissues with eqtl networks. Proceedings of the National Academy of Sciences 114(37), E7841–E7850. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360. [Google Scholar]
Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y, Li M, Lu Q, et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature genetics 51(3), 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson-Léger CA, Aurrand-Lions M, Beltraminelli N, et al. (2002). Junctional adhesion molecule-2 (jam-2) promotes lymphocyte transendothelial migration. Blood, The Journal of the American Society of Hematology 100(7), 2479–2486. [DOI] [PubMed] [Google Scholar]
Lee SH, Ripke S, Neale BM, et al. (2013). Genetic relationship between five psychiatric disorders estimated from genome-wide snps. Nature genetics 45, 984–994. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li S, Cai TT, and Li H (2020). Supplements to “Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality”. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li YR, Li J, Zhao SD, and et al. (2015). Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature Medicine 21, 1018–1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu Y and Kozubowski TJ (2015). A folded laplace distribution. Journal of Statistical Distributions and Applications 2(1), 1–17. [Google Scholar]
Lounici K, Pontil M, and Tsybakov AB (2009). Taking advantage of sparsity in multi-task learning. arXiv:0903.1468. [Google Scholar]
Mak TSH, Porsch RM, Choi SW, Zhou X, and Sham PC (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology 41(6), 469–480. [DOI] [PubMed] [Google Scholar]
Mei S, Fei W, and Zhou S (2011). Gene ontology based transfer learning for protein subcellular localization. BMC bioinformatics 12, 44. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan W and Yang Q (2013). Transfer learning in heterogeneous collaborative filtering domains. Artificial intelligence 197, 39–55. [Google Scholar]
Pierson E, Koller D, Battle A, et al. (2015). Sharing and specificity of co-expression networks across 35 human tissues. PLoS Comput Biol 11(5), e1004220. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raskutti G, Wainwright MJ, and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over ℓ_q-balls. IEEE transactions on information theory 57(10), 6976–6994. [Google Scholar]
Rigollet P and Tsybakov A (2011). Exponential screening and optimal rates of sparse estimation. The Annals of Statistics 39(2), 731–771. [Google Scholar]
Schottlaender LV, Abeti R, Jaunmuktane Z, et al. (2020). Bi-allelic jam2 variants lead to early-onset recessive primary familial brain calcification. The American Journal of Human Genetics 106(3), 412–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shin H-C, Roth HR, Gao M, et al. (2016). Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35(5), 1285–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun T and Zhang C-H (2012). Scaled sparse linear regression. Biometrika 99(4), 879–898. [Google Scholar]
Sun YV and Hu Y-J (2016). Integrative analysis of multi-omics data for discovery and functional studies of complex human diseases. In Advances in genetics, Volume 93, pp. 147–190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. [Google Scholar]
Torrey L and Shavlik J (2010). Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI Global. [Google Scholar]
Tsybakov AB (2014). Aggregation and minimax optimality in high-dimensional estimation. In Proceedings of the International Congress of Mathematicians, Volume 3, pp. 225–246. [Google Scholar]
Turki T, Wei Z, and Wang JT (2017). Transfer learning approaches to improve drug sensitivity prediction in multiple myeloma patients. IEEE Access 5, 7381–7393. [Google Scholar]
Verzelen N (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics 6, 38–90. [Google Scholar]
Wang S, Shi X, Wu M, and Ma S (2019). Horizontal and vertical integrative analysis methods for mental disorders omics data. Scientific Reports, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weiss K, Khoshgoftaar TM, and Wang D (2016). A survey of transfer learning. Journal of Big Data 3, 9. [Google Scholar]
Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics 38(2), 894–942. [Google Scholar]
Zhernakova A, Van Diemen CC, and Wijmenga C (2009). Detecting shared pathogenesis from the shared genetics of immune-related diseases. Nature Reviews Genetics 10(1), 43–55. [DOI] [PubMed] [Google Scholar]
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

NIHMS1755759-supplement-supinfo.pdf^{(394.7KB, pdf)}

[R1] Agarwal A, Negahban S, and Wainwright MJ (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics 40(2), 1171–1197. [Google Scholar]

[R2] Ahmed T and Bajwa WU (2019). Exsis: Extended sure independence screening for ultrahigh-dimensional linear models. Signal Processing 159, 33–48. [Google Scholar]

[R3] Ando RK and Zhang T (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, 1817–1853. [Google Scholar]

[R4] Bastani H (2020). Predicting with proxies: Transfer learning in high dimension. Management Science 67(5), 2657–3320. [Google Scholar]

[R5] Bühlmann P and van de Geer S (2015). High-dimensional inference in misspecified linear models. Electronic Journal of Statistics 9(1), 1449–1473. [Google Scholar]

[R6] Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics 49(1), 100–128. [Google Scholar]

[R7] Candes E and Tao T (2007). The dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics 35(6), 2313–2351. [Google Scholar]

[R8] Cen Z, Chen Y, Chen S, et al. (2020). Biallelic loss-of-function mutations in jam2 cause primary familial brain calcification. Brain 143(2), 491–502. [DOI] [PubMed] [Google Scholar]

[R9] Chen X, Kim S, Lin Q, Carbonell JG, and Xing EP (2010). Graph-structured multi-task regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:1005.3579. [Google Scholar]

[R10] Cross-Disorder Group of the Psychiatric Genomics Consortium (2019). Genomic relationships, novel loci, and pleiotropic mechanisms across eight psychiatric disorders. Cell 179(7), 1469–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Dai D, Han L, Yang T, and Zhang T (2018). Bayesian Model Averaging with Exponentiated Least Squares Loss. IEEE Transactions on Information Theory 64(5), 3331–3345. [Google Scholar]

[R12] Dai D, Rigollet P, and Zhang T (2012). Deviation optimal learning using greedy q-aggregation. The Annals of Statistics 40(3), 1878–1905. [Google Scholar]

[R13] Danaher P, Wang P, and Witten DM (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society. Series B (Statistical methodology) 76(2), 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Daumé III H (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263. [Google Scholar]

[R15] Dondelinger F, Mukherjee S, and Initiative ADN (2020). The joint lasso: high-dimensional regression for group structured data. Biostatistics 21(2), 219–235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Fagny M, Paulson JN, Kuijjer ML, et al. (2017). Exploring regulation in tissues with eqtl networks. Proceedings of the National Academy of Sciences 114(37), E7841–E7850. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360. [Google Scholar]

[R18] Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Hu Y, Li M, Lu Q, et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature genetics 51(3), 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Johnson-Léger CA, Aurrand-Lions M, Beltraminelli N, et al. (2002). Junctional adhesion molecule-2 (jam-2) promotes lymphocyte transendothelial migration. Blood, The Journal of the American Society of Hematology 100(7), 2479–2486. [DOI] [PubMed] [Google Scholar]

[R21] Lee SH, Ripke S, Neale BM, et al. (2013). Genetic relationship between five psychiatric disorders estimated from genome-wide snps. Nature genetics 45, 984–994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Li S, Cai TT, and Li H (2020). Supplements to “Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality”. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Li YR, Li J, Zhao SD, and et al. (2015). Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature Medicine 21, 1018–1027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Liu Y and Kozubowski TJ (2015). A folded laplace distribution. Journal of Statistical Distributions and Applications 2(1), 1–17. [Google Scholar]

[R25] Lounici K, Pontil M, and Tsybakov AB (2009). Taking advantage of sparsity in multi-task learning. arXiv:0903.1468. [Google Scholar]

[R26] Mak TSH, Porsch RM, Choi SW, Zhou X, and Sham PC (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology 41(6), 469–480. [DOI] [PubMed] [Google Scholar]

[R27] Mei S, Fei W, and Zhou S (2011). Gene ontology based transfer learning for protein subcellular localization. BMC bioinformatics 12, 44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Pan W and Yang Q (2013). Transfer learning in heterogeneous collaborative filtering domains. Artificial intelligence 197, 39–55. [Google Scholar]

[R29] Pierson E, Koller D, Battle A, et al. (2015). Sharing and specificity of co-expression networks across 35 human tissues. PLoS Comput Biol 11(5), e1004220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Raskutti G, Wainwright MJ, and Yu B (2011). Minimax rates of estimation for high-dimensional linear regression over ℓ_q-balls. IEEE transactions on information theory 57(10), 6976–6994. [Google Scholar]

[R31] Rigollet P and Tsybakov A (2011). Exponential screening and optimal rates of sparse estimation. The Annals of Statistics 39(2), 731–771. [Google Scholar]

[R32] Schottlaender LV, Abeti R, Jaunmuktane Z, et al. (2020). Bi-allelic jam2 variants lead to early-onset recessive primary familial brain calcification. The American Journal of Human Genetics 106(3), 412–421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Shin H-C, Roth HR, Gao M, et al. (2016). Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35(5), 1285–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Sun T and Zhang C-H (2012). Scaled sparse linear regression. Biometrika 99(4), 879–898. [Google Scholar]

[R35] Sun YV and Hu Y-J (2016). Integrative analysis of multi-omics data for discovery and functional studies of complex human diseases. In Advances in genetics, Volume 93, pp. 147–190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288. [Google Scholar]

[R37] Torrey L and Shavlik J (2010). Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI Global. [Google Scholar]

[R38] Tsybakov AB (2014). Aggregation and minimax optimality in high-dimensional estimation. In Proceedings of the International Congress of Mathematicians, Volume 3, pp. 225–246. [Google Scholar]

[R39] Turki T, Wei Z, and Wang JT (2017). Transfer learning approaches to improve drug sensitivity prediction in multiple myeloma patients. IEEE Access 5, 7381–7393. [Google Scholar]

[R40] Verzelen N (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics 6, 38–90. [Google Scholar]

[R41] Wang S, Shi X, Wu M, and Ma S (2019). Horizontal and vertical integrative analysis methods for mental disorders omics data. Scientific Reports, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Weiss K, Khoshgoftaar TM, and Wang D (2016). A survey of transfer learning. Journal of Big Data 3, 9. [Google Scholar]

[R43] Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics 38(2), 894–942. [Google Scholar]

[R44] Zhernakova A, Van Diemen CC, and Wijmenga C (2009). Detecting shared pathogenesis from the shared genetics of immune-related diseases. Nature Reviews Genetics 10(1), 43–55. [DOI] [PubMed] [Google Scholar]

[R45] Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429. [Google Scholar]

PERMALINK

Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality

Sai Li

T Tony Cai

Hongzhe Li

Abstract

1. Introduction

1.1. Transfer Learning in High-dimensional Linear Regression

1.2. Our Contributions

1.3. Organization and Notation

2. Estimation with Known Informative Auxiliary Samples

2.1. Oracle Trans-Lasso Algorithm

Algorithm 1:

2.2. Theoretical Properties of Oracle Trans-Lasso

Condition 1.

Condition 2.

Theorem 1 (Convergence Rate of Oracle Trans-Lasso).

Theorem 2 (Minimax lower bound for q = 1).

3. Unknown Set of Informative Auxiliary Samples

3.1. The Trans-Lasso Algorithm

Algorithm 2:

3.2. Constructing the Candidate Sets for Aggregation

Algorithm 3:

3.3. Theoretical Properties of Trans-Lasso

Lemma 1 (Q-Aggregation for Trans-Lasso).

Condition 3.

Theorem 3 (Convergence Rate of the Trans-Lasso).

Remark 1.

4. Extensions to Heterogeneous Designs

Condition 4.

Theorem 4 (Oracle Trans-Lasso with heterogeneous designs).

Corollary 1 (Trans-Lasso with heterogeneous designs).

5. Simulation Studies

5.1. Identity Covariance Matrix for the Designs

Figure 1.

5.2. Homogeneous Designs among A∪{0}

Figure 2.

5.3. Heterogeneous Designs

Figure 3.

6. Application to Genotype-Tissue Expression Data

6.1. Data Analysis Method

Figure 4.

6.2. Prediction Performance of the Trans-Lasso for JAM2 Expression

6.3. Prediction Performance of Other 25 Genes on Chromosome 21

Figure 5.

7. Discussion

Supplementary Material

Acknowledgments

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

5.2. Homogeneous Designs among $A \cup {0}$