CHEER: Rich Model Helps Poor Model via Knowledge Infusion

Cao Xiao; Trong Nghia Hoang; Shenda Hong; Tengfei Ma; Jimeng Sun

doi:10.1109/tkde.2020.2989405

. Author manuscript; available in PMC: 2023 Feb 1.

Published in final edited form as: IEEE Trans Knowl Data Eng. 2020 Apr 22;34(2):531–543. doi: 10.1109/tkde.2020.2989405

CHEER: Rich Model Helps Poor Model via Knowledge Infusion

Cao Xiao ^1,^#, Trong Nghia Hoang ^2,^#, Shenda Hong ^3,^#, Tengfei Ma ⁴, Jimeng Sun ⁵

PMCID: PMC9879310 NIHMSID: NIHMS1770886 PMID: 36712193

Abstract

There is a growing interest in applying deep learning (DL) to healthcare, driven by the availability of data with multiple feature channels in rich-data environments (e.g., intensive care units). However, in many other practical situations, we can only access data with much fewer feature channels in a poor-data environments (e.g., at home), which often results in predictive models with poor performance. How can we boost the performance of models learned from such poor-data environment by leveraging knowledge extracted from existing models trained using rich data in a related environment? To address this question, we develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations, which can be incorporated into the poor model to improve its performance. The infused model is analyzed theoretically and evaluated empirically on several datasets. Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.

Keywords: Health Analytics, Representation Learning, Embedding

1. Introduction

In rich-data environments with strong observation capabilities, data often come with rich representations that encompass multiple channels of features. For example, multiple leads of Electrocardiogram (ECG) signals in hospital used for diagnosing heart diseases are measured in intensive care units (ICU), of which each lead is considered a feature channel. The availability of such rich-data environment has thus sparked strong interest in applying deep learning (DL) for predictive health analytics as DL models built on data with multi-channel features have demonstrated promising results in healthcare (1). However, in many practical scenarios, such rich data are often private and not accessible due to privacy concern. Thus, we often have to develop DL models on lower quality data comprising fewer feature channels, which were collected from poor-data environments with limited observation capabilities (e.g., home monitoring devices which provide only a single channel of feature). Inevitably, the performance of state-of-the-art DL models, which are fueled by the abundance and richness of data, becomes much less impressive in such poor-data environments (2).

To alleviate this issue, we hypothesize that learning patterns consolidated by DL models trained in one environment often encode information that can be transferred to related environments. For example, a heart disease detection model trained on rich-data from 12 ECG channels in a hospital will likely carry pertinent information that can help improve a similar model trained on poor-data from a single ECG channel collected by a wearable device due to the correlation between their data. Motivated by this intuition, we further postulate that given access to a prior model trained on rich-data, the performance of a DL model built on a related poor-data can be improved if we can extract transferable information from the rich model and infuse them into the poor model. This is related to deep transfer learning and knowledge distillation but with a new setup that has not been addressed before, as elaborated in Section 2 below.

In this work, we propose a knowledge infusion framework, named CHEER, to address the aforementioned challenges. In particular, CHEER aims to effectively transfer domain-invariant knowledge consolidated from a rich model with high-quality data demand to a poor model with low data demand and model complexity, which is more suitable for deployment in poor-data settings. We also demonstrate empirically that CHEER helps bridge the performance gap between DL models applied in rich- and poor-data settings. Specifically, we have made the following key contributions:

We develop a transferable representation that summarizes the rich model and then infuses the summarized knowledge effectively into the poor model (Section 3.2). The representation can be applied to a wide range of existing DL models.
We perform theoretical analysis to demonstrate the efficiency of knowledge infusion mechanism of CHEER. Our theoretical results show that under practical learning configurations and mild assumptions, the poor model’s prediction will agree with that of the rich model with high probability (Section 4).
Finally, we also conduct extensive empirical studies to demonstrate the efficiency of CHEER on several healthcare datasets. Our results show that CHEER outperformed the second best approach (knowledge distillation) and the baseline without knowledge infusion by 5.60% and 46.80%, respectively, in terms of macro-F1 score and demonstrated more robust performance (Section 5).

2. Related Works

Deep Transfer Learning:

Most existing deep transfer learning methods transfer knowledge across domains while assuming the target and source models have equivalent modeling and/or data representation capacities. For example, deep domain adaptation have focused mainly on learning domain-invariant representations between very specific domains (e.g., image data) on (3; 4; 5; 6; 7; 8; 9; 10; 11). Furthermore, this can only be achieved by training both models jointly on source and target domain data.

More recently, another type of deep transfer learning (12) has been developed to transfer only the attention mechanism (13) from complex to shallow neural network to boost its performance. Both source and target models, however, need to be jointly trained on the same dataset. In our setting, since source and target datasets are not available at the same time and that the target model often has to adopt representations with significantly less modeling capacity to be compatible with the poor-data domain with weak observation capabilities.

Knowledge Distillation:

Knowledge distillation (14) or mimic learning (15) aim to transfer the predictive power from a high-capacity but expensive DL model to a simpler model such as shallow neural networks for ease of deployment (16; 17; 18; 19). This can usually be achieved via training simple models on soft labels learned from high-capacity models, which, however, assume that both models operate on the same domain and have access to the same data or at least datasets with similar qualities. In our setting, we only have access to low-quality data with poor feature representation, and an additional set of limited paired data that include both rich and poor representations (e.g., high-quality ICU data and lower-quality health-monitoring information from personal devices) of the same object.

Domain Adaptation:

There also exists another body of non-deep learning transfer paradigms that were often referred to as domain adaption. This however often include methods that not only assume access domain-specific (20; 21; 22; 23; 24) and/or model-specific knowledge of the domains being adapted (25; 26; 27; 28; 29; 30; 31; 32), but are also not applicable to deep learning models (33; 34) with arbitrary architecture as addressed in our work.

In particular, our method does not impose any specific assumption on the data domain and the deep learning model of interest. We recognize that our method is only demonstrated on deep model (with arbitrary architecture) in this research but our formulation can be straightforwardly extended to non-deep model as well. We omit such detail in the current manuscript to keep the focus on deep models which are of greater interest in healthcare context due to their expressive representation in modeling multi-channel data.

3. The cheer Method

3.1. Data and Problem Definition

Rich and Poor Datasets.

Let $𝓗_{r} ≜ {(x_{i}^{r}, y_{i}^{r})}_{i = 1}^{n}$ and $𝓗_{p} ≜ {(x_{i}^{p}, y_{i}^{p})}_{i = 1}^{m}$ denote the rich and poor datasets, respectively. The subscript i indexes the i-th data point (e.g., the i-th patient in healthcare applications), which contains input feature vector $x_{i}^{r}$ or $x_{i}^{p}$ and output target $y_{i}^{r}$ or $y_{i}^{p}$ of the rich or poor datasets. The rich and poor input features $x_{i}^{r} \in ℝ^{r}$ and $x_{i}^{p} \in ℝ^{p}$ are r- and p-dimensional vectors with $p ≪ r$ , respectively. The output targets, $y_{i}^{r}$ and $y_{i}^{p} \in {1 \dots c}$ , are categorical variables. The input features of these datasets (i.e., $x_{i}^{r}$ and $x_{i}^{p}$ ) are non-overlapping as they are assumed to be collected from different channels of data (i.e., different data modalities). In the remaining of this paper, we will use data channel and data modality interchangeably.

For example, the rich data can be the physiological data from ICU (e.g., vital signs, continuous blood pressure and electrocardiography) or temporal event sequences such as electronic health records with discrete medical codes, while the poor data are collected from personal wearable devices. The target can be the mortality status of those patients, onset of heart diseases and etc. Note that these raw data are not necessarily plain feature vectors. They can be arbitrary rich features such as time series, images and text data. We will present one detailed implementation using time series data in Section 3.3.

Input Features.

We (implicitly) assume that the raw data of interest comprises (says, p or r) multiple sensory channels, each of which can be represented by or embedded¹ into a particular feature signal (i.e., one feature per channel). This results in an embedded feature vector of size p or r (per data point), respectively. In a different practice, a single channel may be encoded by multiple latent features and our method will still be applicable. In this paper, however, we will assume one embedded feature per channel to remain close to the standard setting of our healthcare scenario, which is detailed below.

Paired Dataset.

To leverage both rich and poor datasets, we need a small amount of paired data to learn the relationships between them, which is denoted as $𝓗_{o} ≜ {(x_{i}^{r}, x_{i}^{p}, y_{i})}_{i = 1}^{k}$ . Note that the paired dataset contains both rich and poor input features, i.e. $x_{i}^{r}$ and $x_{i}^{p}$ , of the same subjects (hence, sharing the same target y_i).

Concretely, this means a concatenated input $x_{i}^{o} = [x_{i}^{r}, x_{i}^{p}]$ of the paired dataset has $o = p + r$ features where the first r features are collected from r rich channels (with highly accurate observation capability) while the remaining p features are collected from p poor channels (with significantly more noisy observations). We note that our method and analysis also apply to settings where $x_{i}^{p} \subseteq x_{i}^{r}$ . In such cases, $x_{i}^{o} = x_{i}^{r}$ and o = r (though the number of data point i for which $x_{i}^{r}$ is accessible as paired data is much less than the number of those with accessible $x_{i}^{p}$ ). To avoid confusion, however, we will proceed with the implicit assumption that there is no feature overlapping between poor and rich datasets in the remaining of this paper.

For example, the paired dataset may comprise of rich data from ICU ( $x_{i}^{p}$ ) and poor data from wearable sensors ( $x_{i}^{r}$ ), which are extracted from the same patient i. The paired dataset often contains much fewer data points (i.e., patients) than the rich and poor datasets themselves, and cannot be used alone to train a prediction model with high quality.

Problem Definition.

Given (1) a poor dataset $𝓗_{p}$ collected from a particular patient cohort of interest, (2) a paired dataset $𝓗_{o}$ collected from a limited sample of patients, and (3) a rich model $𝓣 (y ∣ x^{r})$ which were pre-trained on private (rich) data of the same patient cohort, we are interested in learning a model $𝓢 (y ∣ x^{p})$ using both $𝓗_{p}$ , $𝓗_{o}$ and $𝓣 (y ∣ x^{r})$ , which can perform better than a vanilla model $𝓓 (y ∣ x^{p})$ generated using only $𝓗_{p}$ or $𝓗_{o}$ .

Challenges.

This requires the ability to transfer the learned knowledge from $𝓣 (y ∣ x^{r})$ to improve the prediction quality of $𝓢 (y ∣ x^{p})$ . This is however a highly non-trivial task because (a) $𝓣 (y ∣ x^{r})$ only generates meaningful prediction if we can provide input from rich data channels, (b) its training data is private and cannot be accessed to enable knowledge distillation and/or domain adaptation, and (c) the paired data is limited and cannot be used alone to build an accurate prediction model.

Solution Sketch.

Combining these sources of information coherently to generate a useful prediction model on the patient cohort of interest is therefore a challenging task which has not been investigated before. To address this challenge, the idea is to align both rich and poor models using a transferable representation described in Section 3.2. This representation in turn helps infuse knowledge from the rich model into the poor model, thus improving its performance. The overall structure of CHEER is shown in Figure 1. The notations are summarized in Table 1.

Fig. 1: — CHEER: (a) a *rich* model was first built using rich multimodal or multi-channel data; (b) the behaviors of *rich* model are then infused into the poor model using paired data (i.e., behavior infusion); and (c) the *poor* model is trained to fit both the rich model’s predictions on paired data and its *poor* dataset (i.e., target infusion).

TABLE 1:

Notations used in CHEER.

Notation	Definition

$𝓗_{r} ≜ {(x_{i}^{r}, y_{i}^{r})}_{i = 1}^{n}$ ; $𝓣 (y ∣ x^{r})$	rich data; rich model
$𝓗_{p} ≜ {(x_{i}^{p}, y_{i}^{p})}_{i = 1}^{m}$ ; $𝓢 (y ∣ x^{p})$	poor data; poor model
$𝓗_{o} ≜ {(x_{i}^{r}, x_{i}^{p}, y_{i})}_{i = 1}^{k}$	paired data
$Q_{r} ≜ [q_{r}^{1} \dots q_{r}^{l_{r}}] \in ℝ^{d \times l_{r}}$	domain-specific features
$A_{r} (x^{r}) ≜ [a_{r}^{(1)} (x^{r}) \dots a_{r}^{(d)} (x^{r})]$	feature scoring functions
$O_{r} ≜ O_{r} (Q_{r}^{⊤} A_{r} (x^{r}))$	feature aggregation component

Open in a new tab

3.2. Learning Transferable Rich Model

In our knowledge infusion task, the rich model is assumed to be trained in advance using the rich dataset $𝓗_{r} ≜ {(x_{i}^{r}, y_{i}^{r})}_{i = 1}^{n}$ . The rich dataset is, however, not accessible and we only have access to the rich model. The knowledge infusion task aims to consolidate the knowledge acquired by the rich model and infuse it with a simpler model (i.e., the poor model).

Transferable Representation.

To characterize a DL model, we first describe the building blocks and then discuss how they would interact to generate the final prediction scores. In particular, let $Q_{r} (x^{r})$ , $A_{r} (x^{r})$ and $O_{r}$ denote the building blocks, namely Feature Extraction, Feature Scoring and Feature Aggregation, respectively. Intuitively, the Feature Extraction first transforms raw input feature x^r into a vector of high-level features $Q_{r} (x^{r})$ , whose importance are then scored by the Feature Scoring function $A_{r} (x^{r})$ . The high-level features $Q_{r} (x^{r})$ are combined first via a linear transformation $Q_{r} {(x^{r})}^{⊤} A_{r} (x^{r})$ that focuses the model’s attention on important features. The results are translated into a vector of final predictive probabilities via the Feature Aggregation function $O_{r} (Q_{r} {(x^{r})}^{⊤} A_{r} (x^{r}))$ , which implements a non-linear transformation. Mathematically, the above workflow can be succinctly characterized using the following conditional probability distributions:

𝓣 (y ∣ x^{r}) ≜ \Pr (y ∣ Q_{r}^{⊤} (x^{r}) A_{r} (x^{r}); O_{r}) .

(1)

We will describe these building blocks in more details next.

Feature Extraction.

Dealing with complex input data such as time series, images and text, it is common to derive more effective features instead of directly using the raw input x^r. The extracted features are denoted as $Q_{r} (x_{r}) ≜ [q_{r}^{1} (x^{r}) \dots q_{r}^{l_{r}} (x^{r})] \in ℝ^{d \times l_{r}}$ where $q_{r}^{i} (x^{r}) \in ℝ^{d}$ is a d-dimensional feature vector extracted by the i-th feature extractor from the raw input x^r. Each feature extractor is applied to a separate segment of the time series input (defined later in Section 3.3). To avoid cluttering the notations, we shorten $Q_{r} (x^{r})$ as $Q_{r}$ .

Feature Scoring.

Since the extracted features are of various importance to each subject, they are combined via weights specific to each subject. More formally, the extracted features $Q_{r} (x^{r})$ of the rich model are combined via $Q_{r}^{⊤} (x^{r}) A_{r} (x^{r})$ using subject-specific weight vector

A_{r} (x^{r}) ≜ [a_{r}^{(1)} (x^{r}) \dots a_{r}^{(d)} (x^{r})] \in ℝ^{d} .

Essentially, each weight component $a_{r}^{(i)} (x^{r})$ maps from the raw input feature x^r to the important score of its i^th extracted feature. For each dimension i, $a_{r}^{(i)} (x^{r}) ≜ a_{r}^{(i)} (x^{r}; ω_{r}^{(i)})$ parameterized by a set of parameters $ω_{r}^{(i)}$ , which are learned using the rich dataset.

Feature Aggregation.

The feature aggregation implements a nonlinear transformation $O_{r}$ (e.g., a feed-forward layer) that maps the combined features into final predictive scores. The input to this component is the linearly combined feature $Q_{r} {(x^{r})}^{⊤} A_{r} (x^{r})$ and the output is a vector of logistic inputs,

r (x^{r}) ≜ [r_{1} \dots r_{c}] = O_{r} (Q_{r} {(x^{r})}^{⊤} A_{r} (x^{r})),

(2)

which is subsequently passed through the softmax function to compute the predictive probability for each candidate label,

𝓣 (y = j ∣ x_{i}^{r}) ≜ \exp (r_{j}) / (\sum_{κ = 1}^{c} \exp (r_{κ})) .

(3)

3.3. A DNN Implementation of Rich Model

This section describes an instantiation of the aforementioned abstract building blocks using a popular DNN architecture with self-attention mechanism (35) for modeling multivariate time series (36):

Raw Features.

Raw data from rich data environment often consist of multivariate time series such as physiological signals collected from hospital or temporal event sequences such as electronic health records (EHR) with discrete medical codes. In the following, we consider the raw feature input $x_{i}^{r}$ as continuous monitoring data (e.g., blood pressure measures) for illustration purpose.

Feature Extraction.

To handle such continuous time series, we extract a set of domain-specific features using CNN and RNN models. More specifically, we splits the raw time series $x_{i}^{r}$ into $l_{r}$ non-overlapping segments of equal length.

That is, $x_{i}^{r} ≜ (s_{i, m}^{r})$ where $m = 1 \dots l_{r}$ and $s_{i, m}^{r} \in ℝ^{D_{r}}$ such that $D_{r} \times l_{r} = r$ with r denotes the number of features of the rich data. Then, we apply stacked 1-D convolutional neural networks ( $ℂ ℕ ℕ_{r}$ ) with mean pooling ( $ℙ_{r}$ ) on each segment, i.e.

h_{i, m}^{r} ≜ ℙ_{r} (ℂ ℕ ℕ_{r} (s_{i, m}^{r}))

(4)

where $h_{i, m}^{r} \in ℝ^{k_{r}}$ , and k_r denotes the number of filters of the CNN components of the rich model. After that, we place a recurrent neural network ( $ℝ ℕ ℕ_{r}$ ) across the output segments of the previous CNN and Pooling layers:

q_{i, m}^{r} ≜ ℝ ℕ ℕ_{r} (q_{i, m - 1}^{r}, h_{i, m}^{r}) \in ℝ^{d},

(5)

The output segments of the RNN layer are then concatenated to generate the feature matrix,

Q_{r}^{(i)} ≜ [q_{i, 1}^{r} \dots q_{i, l_{r}}^{r}] \in ℝ^{d \times l_{r}},

(6)

which correspond to our domain-specific feature extractors $Q_{r} (x_{i}^{r}) ≜ [q_{r}^{1} (x_{i}^{r}) \dots q_{r}^{l r} (x_{i}^{r})]$ where $q_{r}^{t} (x_{i}^{r}) = q_{i, t}^{r} \in ℝ^{d}$ , as defined previously in our transferable representation (Section 3.2).

Feature Scoring.

The concatenated features $Q_{r}^{(i)}$ is then fed to the self-attention component $A T T_{r}$ to generate a vector of importance scores for the output components, i.e. $a_{r}^{(i)} ≜ A T T_{r} (Q_{r}^{(i)}) \in ℝ^{d}$ . For more details on how to construct this component, see (37; 38; 39) and (35). The result corresponds to the feature scoring functions² $A_{r} (x_{i}^{r}) ≜ [a_{r}^{(1)} (x_{i}^{r}) \dots a_{r}^{(d)} (x_{i}^{r})]$ where $a_{r}^{(t)} (x_{i}^{r}) = {[a_{r}^{(i)}]}_{t} \in ℝ$ .

Feature Aggregation.

The extracted features $Q_{r}^{(i)}$ are combined using the above feature scoring functions, which yields $Q_{r}^{{(i)}^{⊤}} a_{r}^{(i)}$ . The combined features are subsequently passed through a linear layer with densely connected hidden units ( $D E ℕ S E_{r}$ ),

g_{r}^{(i)} ≜ D E ℕ S E_{r} (Q_{r}^{{(i)}^{⊤}} a_{r}^{(i)}; w_{r}),

(7)

where $g_{r}^{(i)} \in ℝ^{c}$ with c denotes the number of class labels and w_r denotes the parametric weights of the dense layers. Then, the output of the dense layer is transformed into a probability distribution over class labels via the following softmax activation functions parameterized with softmax temperatures τ_r:

𝓣 (y = j ∣ x_{i}^{r}) ≜ \exp ({[g_{r}^{(i)}]}_{j} / τ_{r}) / \sum_{κ = 1}^{c} \exp ({[g_{r}^{(i)}]}_{κ} / τ_{r})

The entire process corresponds to the feature aggregation function $O_{r} (Q_{r} {(x^{r})}^{⊤} A_{r} (x^{r}))$ parameterized by {w_r, τ_r}.

3.4. Knowledge Infusion for Poor Model

To infuse the above knowledge extracted from the rich model to the poor model, we adopt the same transferable representation for the poor model as follows:

𝓢 (y ∣ x^{p}) ≜ \Pr (y ∣ Q_{p}^{⊤} (x^{p}) A_{p} (x^{p}); O_{p})

where $Q_{p}, A_{p} (x^{p}) ≜ [a_{p}^{(1)} (x^{p}; ω_{p}^{(1)}) \dots a_{p}^{(d)} (x^{p}; ω_{p}^{(d)})] \in ℝ^{d}$ and O_p are the poor model’s domain-specific feature extractors, feature scoring functions and feature aggregation functions, which are similar in format to those of the rich model. Infusing knowledge from the rich model to the poor model can then be boiled down to matching these components between them. This process can be decomposed into two steps:

Behavior Infusion.

As mentioned above, each scoring function $a_{p}^{(i)} (x^{p}; ω_{p}^{(i)})$ is defined by a weight vector $ω_{p}^{(i)}$ . The collection of these weight vectors thus defines the poor model’s learning behaviors (i.e., its feature scoring mechanism).

Given the input components ${(x_{t}^{p}, x_{t}^{r})}_{t = 1}^{k}$ of the subjects included in the paired dataset $𝓗_{o}$ and the rich model’s scoring outputs ${a_{r}^{(i)} (x_{t}^{r})}_{t = 1}^{k}$ at those subjects, we can construct an auxiliary dataset $𝓑_{i} ≜ {(x_{t}^{p}, a_{r}^{(i)} (x_{t}^{r}))}_{t = 1}^{k}$ to learn the corresponding behavior $ω_{p}^{(i)}$ of the poor model so that its scoring mechanism is similar to that of the rich model. That is, we want to learn a mapping from a poor data point x^p to the important score assigned to its i^th latent feature by the rich model. Formally, this can be cast as the optimization task given by Eq. 8:

\begin{matrix} \underset{ω_{p}^{(i)}}{minimize} 𝓛_{i} (ω_{p}^{(i)}) ≜ \frac{1}{2} \sum_{t = 1}^{k} {(a_{p}^{(i)} (x_{t}^{p}; ω_{p}^{(i)}) - a_{r}^{(i)} (x_{t}^{r}))}^{2} \\ + λ {‖ ω_{p}^{(i)} ‖}_{2}^{2} . \end{matrix}

(8)

For example, if we parameterize $a_{p}^{(i)} (x_{t}^{p}; ω_{p}^{(i)}) = ω_{p}^{{(i)}^{⊤}} x_{t}^{p}$ and choose λ = 0, then Eq. 8 reduces to a linear regression task, which can be solved analytically. Alternatively, by choosing λ = 1, Eq. 8 reduces to a maximum a posterior (MAP) inference task with normal prior imposed on $ω_{p}^{(i)}$ , which is also analytically solvable.

Incorporating more sophisticated, non-linear parameterization for $a_{p}^{(i)} (x_{t}^{p}; ω_{p}^{(i)})$ (e.g., deep neural network with varying structures) is also possible but Eq. 8 can only be optimized approximately via numerical methods (see Section 3.2). Eq. (8) can be solved via standard gradient descent. The complexity of deriving the solution thus depends on the number of iteration τ and the cost of computing the gradient of $ω_{p}^{(i)}$ which depends on the parameterization of $a_{p}^{(i)}$ but is usually $O (w)$ where $w = \max_{i} | ω_{p}^{(i)} |$ . As such, the cost of computing the gradient of the objective function with respect to a particular i is O(kw). As there are τ iterations, the cost of solving for the optimal $ω_{p}^{(i)}$ is $O (τ k w)$ . Lastly, since we are doing this for d values of i, the total complexity would be $O (τ k w d)$ .

Target Infusion.

Given the poor model’s learned behaviors ${ω_{p}^{(i)}}_{i = 1}^{d}$ (which were fitted to those of the rich model via solving Eq. 8), we now want to optimize the poor model’s feature aggregation O_p and feature extraction Q_p components so that its predictions will (a) fit those of the rich model on paired data $𝓗_{o}$ ; and also (b) fit the ground truth ${y_{t}^{p}}_{t = 1}^{m}$ provided by the poor data $𝓗_{p}$ . Formally, this can be achieved by solving the following optimization task:

\min_{O_{p}, Q_{p}} 𝓛_{p} ≜ \frac{1}{k} \sum_{t = 1}^{k} \sum_{y = 1}^{c} {(𝓣 (y_{t} ∣ x_{t}^{r}; O_{r}, Q_{r}) - 𝓢 (y ∣ x_{t}^{p}; O_{p}, Q_{p}))}^{2} + \frac{1}{m} \sum_{t = 1}^{m} {(1 - 𝓢 (y_{t}^{p} ∣ x_{t}^{p}; O_{p}, Q_{p}))}^{2}

(9)

To understand the above, note that the first term tries to fit poor model $𝓢$ to rich model $𝓣$ in the context of the paired dataset $𝓗_{o} ≜ {(x_{t}^{p}, x_{t}^{r}, y_{t})}_{t = 1}^{k}$ while the second term tries to adjust the poor model’s fitted behavior and target in a local context of its poor data $𝓗_{p} ≜ {(x_{t}^{p}, y_{t}^{p})}_{t = 1}^{m}$ . This allows the second term to act as a filter that downplays distilled patterns which are irrelevant in the poor data context. Again, Eq. 9 can be solved depending on how we parameterize the aforementioned components $(O_{p}, Q_{p})$ .

For example, $O_{p}$ can be set as a linear feed-forward layer with densely connected hidden units, which are activated by a softmax function. Again, Eq. (9) could be solved via standard gradient descent. The cost of computing the gradient would depend linearly on the total no. n_p of neurons in the parameterization of $O_{p}$ and $Q_{p}$ for the poor model. In particular, the gradient computation complexity for one iteration is $O (n_{p} (m p c + k c^{2}))$ . For τ iteration, the total cost would be $O (τ n_{p} (m p c + k c^{2}))$ .

Both steps of behavior infusion and target infusion are succinctly summarized in Algorithm 1 below.

4. Theoretical Analysis

In this section, we provide theoretical analysis for CHEER. Our goal is to show that under certain practical assumptions and with respect to a random instance $x = (x^{p}, x^{r}) \sim 𝓟 (x)$ drawn from an arbitrary data distribution $𝓟$ , the prediction $y^{p} ≜ argmax 𝓢 (y ∣ x^{p})$ of the resulting poor model will agree with that of the rich model, $y^{r} ≜ argmax 𝓣 (y ∣ x^{r})$ , with high probability, thus demonstrating the accuracy of our knowledge infusion algorithm in Section 3.4.

Algorithm 1.

CHEER $(𝓗_{p}, 𝓣 (y ∣ x^{r}), 𝓗_{o})$

1:	Input: rich model $𝓣 (y ∣ x^{r})$ , poor data $𝓗_{p}$ and paired data $𝓗_{o}$
2:	Infuse rich model’s behavior via $𝓗_{o}$
3:	i ← 1
4:	while i ≤ d do
5:	$ω_{p}^{(i)} \leftarrow argmin 𝓛_{i} (ω_{p}^{(i)})$ via (8);
6:	$a_{p}^{(i)} (x^{p}) \leftarrow a_{p}^{(i)} (x^{p}; ω_{p}^{(i)})$
7:	$i \leftarrow i + 1$
8:	end while
9:	$A_{p} \leftarrow [a_{p}^{(1)} (x^{p}; ω_{p}^{(1)}) \dots a_{p}^{(d)} (x^{p}; ω_{p}^{(d)})]$
10:	Infuse rich model’s target via $(𝓗_{o}, 𝓗_{p})$ and A_p
11:	$(Q_{p}, O_{p}) \leftarrow argmin 𝓛_{p}$ via (9)
12:	Output: poor model $𝓢 (y ∣ x^{p}) \leftarrow (A_{p}, Q_{p}, O_{p})$

Open in a new tab

High-Level Ideas.

To achieve this, our strategy is to first bound the expected target fitting loss (see Definition 2) on a random instance $x ≜ (x^{p}, x^{r}) \sim 𝓟 (x)$ of the poor model with respect to its optimized scoring function A_p, feature extraction Q_p and feature aggregation O_p components via solving Eq. 8 and Eq. 9 in Section 3.4 (see Lemma 1).

We can then characterize the sufficient condition on the target fitting loss (see Definition 1) with respect to a particular instance $x ≜ (x^{p}, x^{r})$ for the poor model to agree with the rich model on their predictions of x^p and x^r, respectively (see Lemma 2). The probability that this sufficient condition happens can then be bounded in terms of the bound on the expected target fitting loss in Lemma 1 (see Theorem 1), which in turn characterizes how likely the poor model will agree with the rich model on the prediction of a random data instance. To proceed, we put forward the following assumptions and definitions:

Definition 1.

Let $θ_{p} ≜ {O_{p}, Q_{p}, A_{p}}$ denote an arbitrary parameterization of the poor model. The particular target fitting loss of the poor model with respect to a data instance $x ≜ (x^{p}, x^{r})$ is

\begin{array}{l} {\hat{𝓛}}_{X} (θ_{p}) ≜ \sum_{y = 1}^{c} {(𝓣 (y ∣ x^{r}) - 𝓢 (y ∣ x^{p}))}^{2} \\ + \frac{1}{m} \sum_{t = 1}^{m} {(1 - 𝓢 (y_{t}^{p} ∣ x_{t}^{p}))}^{2}, \end{array}

(10)

where c denotes the number of classes, $𝓣 (y ∣ x^{r})$ and $𝓢 (y ∣ x^{p})$ denotes the probability scores assigned to candidate class y by the rich and poor models, respectively.

Definition 2.

Let θ_p be defined as in Definition 1. The expected target fitting loss of the poor model with respect to the parameterization θ_p is defined below,

𝓛 (θ_{p}) ≜ E_{x \sim 𝓟 (x)} [{\hat{𝓛}}_{x} (θ_{p})],

(11)

where the expectation is over the unknown data distribution $𝓟 (x)$ .

Definition 3.

Let $x = (x^{p}, x^{r})$ and $y (x) ≜ {argmax}_{y = 1}^{c} 𝓣 (y ∣ x^{r})$ . The robustness constant of the rich model is defined below,

ϕ ≜ \frac{1}{2} \min_{(x^{p}, x^{r})} (𝓣 (y (x) ∣ x^{r}) - \max_{y \neq y (x)} 𝓣 (y ∣ x^{r})),

(12)

That is, if the probability scores of the model are being perturbed additively within $ϕ$ , its prediction outcome will not change.

Assumption 1.

The paired data points $x_{i} = (x_{i}^{p}, x_{i}^{r})$ of $𝓗_{o}$ are assumed to be distributed independently and identically from $𝓟 (x)$ .

Assumption 2.

The hard-label predictions $y^{p} ≜ argmax 𝓢 (y ∣ x^{p})$ and $y^{r} ≜ argmax 𝓣 (y ∣ x^{r})$ of the poor and rich models are unique.

Given the above, we are now ready to state our first result:

Lemma 1.

Let $θ_{p}^{*}$ and ${\hat{θ}}_{p}$ denote the optimal parameterization of the poor model that yields the minimum expected target fitting loss (see Definition 2) and the optimal solution found by minimizing the objective functions in Eq. 8 and Eq. 9, respectively. Let $α ≜ 𝓛 (θ_{p}^{*})$ , $δ \in (0, 1)$ and c denote the number of classes in our predictive task. If $k ≜ | 𝓗_{o} | \geq ({(c + 1)}^{2} / (2 ϵ^{2})) \log (2 / δ)$ then,

\Pr (𝓛 ({\hat{θ}}_{p}) \leq α + 2 ϵ) \geq 1 - δ .

(13)

Proof.

We first note that by definition in Eq. (10), for all x, ${\hat{𝓛}}_{x} (θ) \leq c + 1$ . Then, let us define the empirical target fitting loss as

\hat{𝓛} (θ) ≜ \frac{1}{k} \sum_{i = 1}^{k} {\hat{𝓛}}_{x^{(i)}} (θ),

(14)

where ${{\hat{𝓛}}_{x^{(i)}} (θ)}_{i = 1}^{k}$ can be treated as identically and independently distributed random variables in $(0, c + 1)$ . Then, by Definition 2, it also follows that $𝓛 (θ) = E [\hat{𝓛} (θ)]$ . Thus, by Hoeffding inequality:

\Pr (| 𝓛 (θ) - \hat{𝓛} (θ) | \leq ϵ) \geq 1 - 2 \exp (- \frac{2 k ϵ^{2}}{{(c + 1)}^{2}}) .

(15)

Then, for an arbitrary $δ \in (0, 1)$ , setting $δ \leq \exp (- 2 k ϵ^{2} / (c + 1)^{2})$ and solving for k yields $k \geq ({(c + 1)}^{2} / (2 ϵ^{2})) \log (2 / δ)$ . Thus, for $k \geq ({(c + 1)}^{2} / (2 ϵ^{2})) \log (2 / δ)$ , with probability at least $1 - δ, | 𝓛 (θ) - \hat{𝓛} (θ) | \leq ϵ$ holds simultaneously for all θ. When that happens with probability at least 1 − δ, we have:

\begin{array}{l} 𝓛 ({\hat{θ}}_{p}) \leq \hat{𝓛} ({\hat{θ}}_{p}) + ϵ \\ \leq \hat{𝓛} (θ_{p}^{*}) + ϵ \leq 𝓛 (θ_{p}^{*}) + 2 ϵ = α + 2 ϵ . \end{array}

(16)

That is, $\Pr (\hat{𝓛} ({\hat{θ}}_{p}) \leq α + 2 ϵ) \geq 1 - δ$ , which completes our proof for Lemma 1. Note that the above 2nd inequality follows from the definition of ${\hat{θ}}_{p} ≜ {argmin}_{θ} \hat{𝓛} (θ_{p})$ , which implies $\hat{𝓛} ({\hat{θ}}_{p}) \leq \hat{𝓛} (θ_{p}^{*})$ .

This result implies the expected target fitting loss $𝓛 ({\hat{θ}}_{p})$ incurred by our knowledge infusion algorithm in Section 3.4 can be made arbitrarily close (with high confidence) to the optimal expected target fitting loss $α ≜ 𝓛 (θ_{p}^{*})$ with a sufficiently large paired dataset $𝓗_{o}$ .

Lemma 2.

Let $x = (x^{p}, x^{r})$ and ${\hat{θ}}_{p}$ as defined in Lemma 1. If the corresponding particular target fitting loss (see Definition 1) ${\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}$ , then both poor and rich models agree on their predictions for x^p and x^r, respectively. That is, $y^{p} ≜ \max_{y} 𝓢 (y ∣ x^{p})$ and $y^{r} ≜ \max_{y} 𝓣 (y ∣ x^{r})$ are the same.

Proof.

Let y^p and y^r be defined as in the statement of Lemma 2. We have:

\begin{array}{l} 𝓢 (y^{p} ∣ x^{p}) \geq 𝓢 (y^{r} ∣ x^{p}) \geq 𝓣 (y^{r} ∣ x^{r}) - ϕ \\ \geq 𝓣 (y^{p} ∣ x^{r}) + 2 ϕ - ϕ \\ \geq 𝓢 (y^{p} ∣ x^{p}) + 2 ϕ - 2 ϕ = 𝓢 (y^{p} ∣ x^{p}) . \end{array}

(17)

To understand Eq. (17), note that the first inequality follows from the definition of y^p. The second inequality follows from the fact that ${\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}$ , which implies $\forall y {(𝓢 (y ∣ x^{p}) - 𝓣 (y ∣ x^{r}))}^{2} \leq ϕ^{2}$ and hence, $| 𝓢 (y^{r} ∣ x^{p}) - 𝓣 (y^{r} ∣ x^{r}) | \leq ϕ$ or $𝓢 (y^{r} ∣ x^{p}) \geq 𝓣 (y^{r} ∣ x^{r}) - ϕ$ . The third inequality follows from the definitions of $ϕ$ (see Definition 3) and y^r. Finally, the last inequality follows from the definition of y^p and that ${\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}$ , which also implies $𝓣 (y^{p} ∣ x^{r}) \geq 𝓢 (y^{p} ∣ x^{p}) - ϕ$ .

Eq. (17) thus implies $𝓢 (y^{p} ∣ x^{p}) \geq 𝓢 (y^{r} ∣ x^{p}) \geq 𝓢 (y^{p} ∣ x^{p})$ and hence, $𝓢 (y^{p} ∣ x^{p}) = 𝓢 (y^{r} ∣ y^{p})$ . Since the hard-label prediction is unique (see Assumption 3), this means $y^{r} = y^{p}$ and hence, by definitions of y^r and y^p, the poor and rich models yield the same prediction. This completes our proof for Lemma 2. Intuitively, Lemma 2 specifies the sufficient condition under which the poor model will yield the same hard-label prediction on a particular data instance x as the rich model. Thus, if we know how likely this sufficient condition will happen, we will also know how likely the poor model will imitate the rich model successfully on a random data instance. This intuition is the key result of our theoretical analysis and is formalized below:

Theorem 1.

Let $δ \in (0, 1)$ and $x = (x^{p}, x^{r})$ denote a random instance drawn from $𝓟 (x)$ . Let $k ≜ | 𝓗_{o} |$ denote the size of the paired dataset $𝓗_{o}$ , which were used to fit the learning behaviors of the poor model to that of the rich model, and $𝓔$ denotes the event that both models agree on their predictions of x. If $k \geq ({(c + 1)}^{2} / (2 ϵ^{2})) \log (2 / δ)$ , then with probability at least 1 − δ,

\Pr (𝓔) \geq 1 - \frac{1}{ϕ^{2}} (α + 2 ϵ) .

(18)

Proof.

Since ${\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}$ implies $𝓔$ , it follows that

\Pr (𝓔) \geq \Pr ({\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}) .

(19)

Then, by Markov inequality, we have

\Pr ({\hat{𝓛}}_{x} ({\hat{θ}}_{p}) > ϕ^{2}) \leq ϕ^{- 2} E [{\hat{𝓛}}_{x} ({\hat{θ}}_{p})] = ϕ^{- 2} 𝓛 ({\hat{θ}}_{p}) .

(20)

Subtracting both sides of Eq. (20) from a unit probability yields

\Pr ({\hat{𝓛}}_{x} ({\hat{θ}}_{p}) \leq ϕ^{2}) \geq 1 - ϕ^{- 2} 𝓛 ({\hat{θ}}_{p}),

(21)

where the last equality follows because $E [{\hat{𝓛}}_{x} ({\hat{θ}}_{p})] = 𝓛 ({\hat{θ}}_{p})$ , which follows immediately from Definitions 1-2 and Assumption 1. Thus, plugging Eq. (21) into Eq. (19) yields

\Pr (𝓔) \geq 1 - ϕ^{- 2} 𝓛 ({\hat{θ}}_{p}) .

(22)

Applying Lemma 2, we know that with probability 1 − δ, $𝓛 ({\hat{θ}}_{p}) \leq α + 2 ϵ$ . Thus, plugging this into Eq. (22) yields

\Pr (𝓔) \geq 1 - ϕ^{- 2} (α + 2 ϵ) .

(23)

That is, by union bound, with probability at least $1 - δ - ϕ^{- 2} (α + 2 ϵ)$ , the poor model yields the same prediction as that of the rich model. This completes our proof for Theorem 1. This immediately implies $𝓔$ will happen with probability at least $1 - δ - (1 / ϕ^{2}) (α + 2 ϵ)$ . The chance for the poor model to yield the same prediction as the rich model on an arbitrary instance (i.e., knowledge infusion succeeds) is therefore at least $1 - δ - (1 / ϕ^{2}) (α + 2 ϵ)$ .

5. Experiments

5.1. Experimental Settings

Datasets.

We use the following datasets in our evaluation.

MIMIC-III Critical Care Database (MIMIC-III)³ is collected from more than 58,000 ICU patients at the Beth Israel Deaconess Medical Center (BIDMC) from June 2001 to October 2012 (40). We collect a subset of 9,488 patients who has one of the following (most frequent) diseases in their main diagnosis: (1) acute myocardial infarction, (2) chronic ischemic heart disease, (3) heart failure, (4) intracerebral hemorrhage, (5) specified procedures complications, (6) lung diseases, (7) endocardium diseases, and (8) septicaemia. The task is disease diagnosis classification (i.e., predicting which of 8 diseases the patient has) based on features collected from 6 data channels: vital sign time series including Heart Rate (HR), Respiratory Rate (RR), Blood Pressure mean (BPm), Blood Pressure systolic (BPs), Blood Pressure diastolic (BPd) and Blood Oxygen Saturation (SpO2). We randomly divided the data into training (80%), validation (10%) and testing (10%) sets.
PTB Diagnostic ECG Database (PTBDB) ⁴ is a 15-channel 1000 Hz ECG time series including 12 conventional leads and 3 Frank leads (41; 42) collected from both healthy controls and cases of heart diseases, which amounts to a total number of 549 records. The given task is to classify ECG to one of the following categories: (1) myocardial infarction, (2) healthy control, (3) heart failure, (4) bundle branch block, (5) dysrhythmia, and (6) hypertrophy. We down-sampled the data to 200 Hz and pre-processed it following the “frame-by-frame” method (43) with sliding windows of 10-second duration and 5-second stepping between adjacent windows.
NEDC TUH EEG Artifact Corpus (EEG) ⁵ is a 22-channel 500 Hz sensor time series collected from over 30,000 EEGs spanning the years from 2002 to present (44). The task is to classify 5 types of EEG events including (1) eye movements (EYEM), (2) chewing (CHEW), (3) shivering (SHIV), (4) electrode pop, electrode static, and lead artifacts (ELPP), and (5) muscle artifacts (MUSC). We randomly divided the data into training (80%), validation (10%) and testing (10%) sets by records.

The statistics of the above datasets, as well as the architectures of the rich and poor models on each dataset are summarized in the tables below.

Baselines.

We compare CHEER against the following baselines:

Direct:

In all experiments, we train a neural network model parameterized with CHEER directly on the poor dataset without knowledge infusion from the rich model. The resulting model can be used to produce a lower bound of predictive performance on each dataset.

Knowledge Distilling (KD) (14):

KD transfers predictive power from teacher to student models via soft labels produced by the teacher model. In our experiments, all KD models have similar complexity as the infused model generated by CHEER. The degree of label softness (i.e., the temperature parameter of soft-max activation function) in KD is set to 5.

Attention Transfer (AT) (12):

AT enhances shallow neural networks by leveraging attention mechanism (13) to learn a similar attention behavior of a full-fledged deep neural network (DNN). In our experiments, we first train a DNN with attention component, which can be parameterized by CHEER. The trained attention component of DNN is then transferred to that of a shallow neural networks in poor-data environment via activation-based attention transfer with L₂-normalization.

Heterogeneous Domain Adaptation (HDA) (27):

Maximize Mean Discrepancy (MMD) loss (45) has been successfully used in domain adaptation such as (8). However, one drawback is that these works only consider homogeneous settings where the source and target domains have the same feature space, or use the same architecture of neural network. To mitigate this limitation, HDA (27) proposed modification of soft MMD loss to handle with heterogeneity between source domain and target domain.

Performance Metrics.

The tested methods’ prediction performance was compared based on their corresponding areas under the Precision-Recall (PR-AUC) and Receiver Operating Characteristic curves (ROC-AUC) as well as the accuracy and F1 score, which are often used in multi-class classification to evaluate the tested method’s prediction quality. In particular, accuracy is measured by the ratio between the number of correctly classified instances and the total number of test instances. F1 score is the harmonic average of precision (the proportion of true positive cases among the predicted positive cases) and recall (the proportion of positive cases whose are correctly identified), with threshold 0.5 to determine whether a predictive probability for being positive is large enough (larger than threshold) to actually assign a positive label to the case being considered or not.

Then, we use the average of F1 scores evaluated for each label (i.e., macro-F1 score) to summarize the averaged predictive performance of all tested methods across all classes. The ROC-AUC and PR-AUC scores are computed based on predicted probabilities and ground-truths directly. For ROC-AUC, it is the area under the curve produced by points of true positive rate (TPR) and false positive rate (FPR) at various threshold settings. Likewise, the PR-AUC score is the area under the curve produced by points of (precision, recall) at various threshold settings. In our experiments, we report the average PR-AUC and ROC-AUC since all three tasks are multi-class classification.

Training Details

For each method, the reported results (mean performance and its empirical standard deviation) are averaged over 20 independent runs. For each run, we randomly split the entire dataset into training (80%), validation (10%) and test sets (10%). All models are built using the training and validation sets and then, evaluated using test set. We use Adam optimizer (46) to train each model, with the default learning rate set to 0.001. The number of training epoches for each model is set as 200 and an early stopping criterion is invoked if the performance does not improve in 20 epoches. All models are implemented in Keras with Tensorflow backend and tested on a system equipped with 64GB RAM, 12 Intel Core i7–6850K 3.60GHz CPUs and Nvidia GeForce GTX 1080. For fair comparison, we use the same model architecture and hyper-parameter setting for Direct, KD, AT, HDA and CHEER. For rich dataset, we use the entire amount of dataset with the entire set of data features. For poor dataset, we vary the size of paired dataset and the number of features to analyze the effect of knowledge infusion in different data settings as shown in Section 4.3. The default maximum amount of paired data is set to 50% of entire dataset, and the default number of data features used in the poor dataset is set to be half of the entire set of data features. In Section 4.2, to compare the tested methods’ knowledge infusion performance under different data settings, we use the default settings for all models (including CHEER and other baselines).

5.2. Performance Comparison

Results on MIMIC-III, PTBDB and EEG datasets are reported in Table 9, Table 10 and Table 11, respectively. In this experiment, we set the size of paired dataset to 50% of the size of the rich data, and set the number of features used in poor-data environment to 3, 7, 11 for MIMIC-III, PTBDB and EEG, respectively. In all datasets, it can be observed that the infused model generated by CHEER consistently achieves the best predictive performance among those of the tested methods, which demonstrates the advantage of our knowledge infusion framework over existing transfer methods such as KD and AT.

TABLE 9:

Performance comparison on MIMIC-III dataset.

	ROC-AUC	PR-AUC	Accuracy	Macro-F1
Direct	0.622±0.062	0.208±0.044	0.821±0.012	0.141±0.057
KD	0.686±0.043	0.257±0.029	0.833±0.012	0.196±0.049
AT	0.645±0.064	0.225±0.044	0.826±0.013	0.167±0.057
HDA	0.655±0.034	0.225±0.029	0.824±0.011	0.157±0.038
CHEER	0.697±0.024	0.266±0.023	0.835±0.010	0.207±0.030

Rich Model	0.759±0.014	0.341±0.024	0.852±0.007	0.295±0.027

Open in a new tab

TABLE 10:

Performance comparison on PTBDB dataset.

	ROC-AUC	PR-AUC	Accuracy	Macro-F1
Direct	0.686±0.114	0.404±0.088	0.920±0.015	0.275±0.057
KD	0.714±0.096	0.439±0.093	0.925±0.016	0.295±0.043
AT	0.703±0.117	0.402±0.078	0.921±0.016	0.283±0.056
HDA	0.685±0.113	0.430±0.080	0.924±0.011	0.299±0.051
CHEER	0.724±0.103	0.441±0.080	0.927±0.017	0.299±0.052

Rich Model	0.732±0.110	0.483±0.101	0.930±0.017	0.366±0.071

Open in a new tab

TABLE 11:

Performance comparison on EEG dataset.

	ROC-AUC	PR-AUC	Accuracy	Macro-F1
Direct	0.797±0.064	0.506±0.083	0.888±0.015	0.425±0.078
KD	0.772±0.083	0.512±0.082	0.888±0.021	0.445±0.097
AT	0.793±0.071	0.502±0.082	0.884±0.012	0.417±0.062
HDA	0.805±0.050	0.523±0.073	0.884±0.019	0.455±0.073
CHEER	0.808±0.066	0.535±0.061	0.895±0.016	0.460±0.076

Rich Model	0.854±0.069	0.657±0.077	0.922±0.014	0.595±0.070

Open in a new tab

Notably, in terms of the macro-F1 scores, CHEER improves over KD, AT, HDA and Direct by 5.60%, 23.95%, 31.84% and 46.80%, respectively, on MIMIC-III dataset. The infused model generated by CHEER also achieves 81.69% performance of the rich model on PTBDB in terms of the macro-F1 score (i.e., 0.299/0.366, see Table 10) while adopting an architecture that is 15.14 times smaller than the rich model’s (see Tables 5 and 6). We have also performed a significance test to validate the significance of our reported improvement of CHEER over the baselines in Table 12.

TABLE 5:

The architecture of rich model in PTBDB, which includes a total of 688.8k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=10
2	Convolution1D	n_filter=128, kernel_size=16, stride=2	ReLU
3	Convolution1D	n_filter=128, kernel_size=16, stride=2	ReLU
4	Convolution1D	n_filter=128, kernel_size=16, stride=2	ReLU
5	AveragePooling1D
6	LSTM	hidden_units=128	ReLU
7	PositionAttention
8	Dense	hidden_units=n_classes	Linear
9	Softmax

Open in a new tab

TABLE 6:

The architecture of the infused, poor model used by CHEER, Direct, KD and AT for knowledge infusion in PTBDB, which includes a total 45.0k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=10
2	Convolution1D	n_filter=32, kernel_size=16, stride=2	ReLU
3	Convolution1D	n_filter=32, kernel_size=16, stride=2	ReLU
4	Convolution1D	n_filter=32, kernel_size=16, stride=2	ReLU
5	AveragePooling1D
6	LSTM	hidden_units=32	ReLU
7	PositionAttention
8	Dense	hidden_units=n_classes	Linear
9	Softmax

Open in a new tab

TABLE 12:

The p-values of corresponding t-tests (on one-tail) for every two samples of ROC-AUC scores of CHEER and a tested benchmark (i.e., Direct, KD, AT and HDA) on MIMIC-III, PTBDB and EEG datasets, respectively. The corresponding significance percentage (s) is provided in the parentheses next to each reported p-value.

	MIMIC-III	PTBDB	EEG
with Direct	0.0000 (s = 01%)	0.0154 (s = 05%)	0.1720 (s = 20%)
with KD	0.0450 (s = 05%)	0.1874 (s = 20%)	0.0124 (s = 05%)
with AT	0.0007 (s = 01%)	0.0421 (s = 05%)	0.1821 (s = 20%)
with HDA	0.0000 (s = 01%)	0.0741 (s = 10%)	0.4823 (s = 20%)

Open in a new tab

Furthermore, it can also be observed that the performance variance of the infused model generated by CHEER (as reflected in the reported standard deviation) is the lowest among all tested methods’, which suggests that CHEER’s knowledge infusion is more robust. Our investigation in Section 5.3 further shows that this is the result of CHEER being able to perform both target and behavior infusion. This helps the infused model generated by CHEER achieved better and more stable performance than those of KD, HDA and AT, which either match the prediction target or reasoning behavior of the rich and poor models (but not both). This consequently leads to their less robust performance with wide fluctuation in different data settings, as demonstrated next in Section 5.3.

5.3. Analyzing Knowledge Infusion Effect in Different Data Settings

To further analyze the advantages of CHEER’s knowledge infusion over those of the existing works (e.g., KD and AT), we perform additional experiments to examine how the variations in (1) sizes of the paired dataset and (2) the number of features of the poor dataset will affect the infused model’s performance. The results are shown in Fig. 3 and Fig. 4, respectively. In particular, Fig. 3 shows how the ROC-AUC of the infused model generated by each tested method varies when we increase the ratio between the size of the paired dataset and that of the rich data. Fig. 4, on the other hand, shows how the infused model’s ROC-AUC varies when we increase the number of features of the poor dataset. In both settings, the reported performance of all methods is averaged over 10 independent runs.

Fig. 3: — Graphs of achieved ROC-AUC scores on (a) MIMIC-III, (b) PTBDB and (c) EEG of the infused models generated by Direct, KD, AT, HDA and CHEER with different sizes of the paired datasets. The X-axis shows the ratio between the size of the paired dataset and that of the *rich* dataset.

Fig. 4: — Graphs of achieved ROC-AUC scores on (a) MIMIC-III, (b) PTBDB and (c) EEG of the infused models generated by Direct, KD, AT, HDA and CHEER with different number of data channels (i.e., features) included in the poor dataset. Notice that all method use the same set of selected features for each run.

Varying Paired Data.

Fig. 3 shows that (a) CHEER outperforms all baselines with varying sizes of the paired data and (b) direct learning on poor data yields significantly worse performance across all settings. Both of which are consistent with our observations earlier on the superior knowledge infusion performance of CHEER. The infused models generated by KD, HDA and AT both perform consistently worse than that of CHEER by a substantial margin across all datasets. Their performance also fluctuates over a much wider range (especially on EEG data) than that of CHEER when we vary the size of the paired datasets. This shows that CHEER’s knowledge infusion is more data efficient and robust under different data settings.

On another note, we also notice that when the amount of paired data increases from 20% to 30% of the rich data, there is a performance drop that happens to all tested methods with attention transfer (i.e., CHEER and AT) on MIMIC-III but not on PTBDB and EEG. This is, however, not surprising since unlike PTBDB and EEG, MIMIC-III comprises of more heterogeneous types of signals and its data distribution is also more unbalanced, which affects the attention learning, and causes similar performance drop patterns between methods with attention transfer such as CHEER and AT.

Varying The Number of Features.

Fig. 4 shows how the prediction performance of the infused models generated by tested methods changes as we vary the number of features in poor data. In particular, it can be observed that the performance of CHEER’s infused model on all datasets increases steadily as we increase the number of input features observed by the poor model, which is expected.

On the other hand, it is perhaps surprising that as the number of features increases, the performance of KD, HDA, AT and Direct fluctuates more widely on PTBDB and EEG datasets, which is in contrast to our observation of CHEER. This is, however, not unexpected since the informativeness of different features are different and hence, to utilize and combine them effectively, we need an accurate feature weighting/scoring mechanism. This is not possible in the cases of Direct, KD, HDA and AT because (a) Direct completely lacks knowledge infusion from the rich model, (b) KD and HDA only performs target transfer from the rich to poor model, and ignores the weighting/scoring mechanism, and (c) AT only transfers the scoring mechanism to the poor model (i.e., attention transfer) but not the feature aggregation mechanism, which is also necessary to combine the weighted features correctly. In contrast, CHEER transfers both the weighting/scoring (via behavior infusion) and feature aggregation (via target infusion) mechanisms, thus performs more robustly and is able to produce steady gain (without radical fluctuations) in term of performance when the number of features increases. This supports our observations earlier regarding the lowest performance variance achieved by the infused model of CHEER, which also suggests that CHEER’s knowledge infusion scheme is more robust than those of KD, HDA and AT.

Finally, to demonstrate how the performance of CHEER varies with different choices of feature sets for poor data, we computed the mutual information between each feature and the class label, and then ranked them in decreasing order. The performance of CHEER on all datasets is then reported in two cases, which include (a) K features with highest mutual information, and (b) K features with lowest mutual information. In particular, the reported results (see Table 13) show that a feature set with low mutual information to the class label will induce worse transfer performance and conversely, a feature set (with the same number of features) with high mutual information will likely improve the transfer performance.

TABLE 13:

CHEER’s performance on MIMIC-III, PTBDB and EEG with (left-column) K features with highest mutual information (MI) to the class label as features of the poor dataset; and (right-column) K features with lowest mutual information to the class label as features of the poor dataset. K is set to 2 for MIMIC-III, 5 for PTBDB and 7 for EEG.

Dataset	Highest MI Features	Lowest MI Features
MIMIC-III	0.672 ± 0.044	0.657 ± 0.012
PTBDB	0.646 ± 0.133	0.639 ± 0.115
EEG	0.815 ± 0.064	0.807 ± 0.042

Open in a new tab

To further inspect the effects of used modalities in CHEER, we also computed the averaged entropy of each modality across all classes, and ranked them in decreasing order for each dataset. Then, we selected a small number of top-ranked, middle-ranked and bottom-ranked features from the entire set of modalities. These are marked as Top, Middle and Bottom respectively in Table 14.

TABLE 14:

CHEER’s performance using different sets of data features with different information quality (as measured by their entropy scores).

	MIMIC-III	PTBDB	EEG
Top	0.688 ± 0.010	0.710 ± 0.131	0.839 ± 0.044
Middle	0.676 ± 0.014	0.682 ± 0.132	0.788 ± 0.065
Bottom	0.664 ± 0.012	0.633 ± 0.130	0.758 ± 0.066

Rich	0.759 ± 0.014	0.732 ± 0.110	0.854 ± 0.069

Open in a new tab

The number of selected features for each rank is 2, 4 and 5 for MIMIC-III, PTBDB and EEG, respectively. Finally, we report the ROC-AUC scores achieved by the corresponding infused models generated by CHEER for each of those feature settings in Table 14. It can be observed from this table that the ROC-AUC of the infused model degrades consistently across all datasets when we change the features of poor data from those in Top to Middle and then to Bottom. This verifies our statement earlier that the informativeness of different data features are different.

6. Conclusion

This paper develops a knowledge infusion framework (named CHEER) that helps infuse knowledge acquired by a rich model trained on feature-rich data with a poor model which only has access to feature-poor data. The developed framework leverages a new model representation to reparameterize the rich model and consequently, consolidate its learning behaviors into succinct summaries that can be infused efficiently with the poor model to improve its performance. To demonstrate the efficiency of CHEER, we evaluated CHEER on multiple real-world datasets, which show very promising results. We also develop a formal theoretical analysis to guarantee the performance of CHEER under practical assumptions. Future extensions of CHEER includes the following potential settings: incorporating meta/contextual information as part of the features and/or learning from data with missing labels.

Fig. 2: — The DNN Implementation of CHEER.

TABLE 2:

Data statistics.

	MIMIC-III	PTBDB	EEG
# subjects	9,488	549	213
# classes	8	6	5
# features	6	15	22
Average length	48	108,596	13,007
Sample frequency	1 per hour	1,000 Hz	500 Hz

Open in a new tab

TABLE 3:

The architecture of rich model in MIMIC-III, which includes a total of 51.6k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=6
2	Convolution1D	n_filter=64, kernel_size=4, stride=1	ReLU
3	Convolution1D	n_filter=64, kernel_size=4, stride=1	ReLU
4	AveragePooling1D
5	LSTM	hidden_units=64	ReLU
6	PositionAttention
7	Dense	hidden_units=n_classes	Linear
8	Softmax

Open in a new tab

TABLE 4:

The architecture of the infused, poor model used by CHEER, Direct, KD and AT for knowledge infusion in MIMIC-III, which includes a total of 3.5k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=6
2	Convolution1D	n_filter=16, kernel_size=4, stride=1	ReLU
3	Convolution1D	n_filter=16, kernel_size=4, stride=1	ReLU
4	AveragePooling1D
5	LSTM	hidden_units=16	ReLU
6	PositionAttention
7	Dense	hidden_units=n_classes	Linear
8	Softmax

Open in a new tab

TABLE 7:

The architecture of rich model in EEG, which includes a total of 417.4k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=5
2	Convolution1D	n_filter=128, kernel_size=8, stride=2	ReLU
3	Convolution1D	n_filter=128, kernel_size=8, stride=2	ReLU
4	Convolution1D	n_filter=128, kernel_size=8, stride=2	ReLU
5	AveragePooling1D
6	LSTM	hidden_units=128	ReLU
7	PositionAttention
8	Dense	hidden_units=n_classes	Linear
9	Softmax

Open in a new tab

TABLE 8:

The architecture of the infused, poor model used by CHEER, Direct, KD and AT for knowledge infusion in EEG, which includes a total of 51.6k parameters.

Layer	Type	Hyper-parameters	Activation
1	Split	n_seg=5
2	Convolution1D	n_filter=32, kernel_size=8, stride=2	ReLU
3	Convolution1D	n_filter=32, kernel_size=8, stride=2	ReLU
4	Convolution1D	n_filter=32, kernel_size=8, stride=2	ReLU
5	AveragePooling1D
6	LSTM	hidden_units=32	ReLU
7	PositionAttention
8	Dense	hidden_units=n_classes	Linear
9	Softmax

Open in a new tab

Biographies

graphic file with name nihms-1770886-b0005.gif

Cao Xiao is the Director of Machine Learning at Analytics Center of Excellence of IQVIA. She is leading IQVIA’s North America machine learning teams to drive next generation healthcare AI. Her research focuses on developing machine learning and deep learning models to solve diverse real world healthcare challenges. Particularly, she is interested in deep phenotyping on electronic health records, graph neural networks for insilico drug modeling, patient segmentation for neuro-degenerative diseases. The results of her research have been published in leading AI conferences including NIPS, ICLR, KDD, AAAI, IJCAI, SDM, ICDM, WWW and top health informatics journals such as Nature Scientific Reports and JAMIA. Prior to IQVIA, she acquired her Ph.D. degree from University of Washington, Seattle in 2016 and was a research staff member in the AI for Healthcare team at IBM Research from 2017 to 2019 and served as member of the IBM Global Technology Outlook Committee from 2018 to 2019.

graphic file with name nihms-1770886-b0006.gif

Trong Nghia Hoang is a Research Staff Member of MIT-IBM Watson AI Lab, IBM Research. His research interests span the areas of stochastic planning, active learning and Bayesian nonparametric methods for distributed/federated learning. The results of his research have been published in premier AI/ML conferences such as AAAI, IJCAI, AAMAS and ICML. Prior to joining IBM, he obtained his Ph.D. in Computer Science from National University of Singapore (NUS) in 2015. From 2015 to 2017, he was a Research Fellow at Sensor-enhanced Social Media (SeSaMe) Centre, Interactive and Digital Media Institute (IDMI), NUS. He then worked as a Postdoctoral Research Associate at Laboratory of Information and Decision Systems (LIDS), MIT during 2017–2018.

graphic file with name nihms-1770886-b0007.gif

Shenda Hong is a fifth-year Ph.D. student in School of Electronics Engineering and Computer Sciences at Peking University. He received B.S. from Beijing University of Posts and Telecommunications in 2014. His research interests are machine learning and data mining for healthcare, especially deep learning methods on Electronic Health Records and biomedical signals.

graphic file with name nihms-1770886-b0008.gif

Tengfei Ma Tengfei Ma is a research staff member of IBM T.J. Watson Research Center. Prior to that, he obtained his PhD from The University of Tokyo in 2015; and he was a researcher in IBM Research-Tokyo from 2015 to 2016. His research interests have spanned a range of topics in machine learning, natural language processing and healthcare. Particularly his recent research is focused on graph neural networks and deep learning based healthcare analytics. The results of his research have been published in premier AI conferences such as NeurIPS, ICLR, IJCAI, AAAI, EMNLP.

graphic file with name nihms-1770886-b0009.gif

Jimeng Sun is an Associate Professor of College of Computing at Georgia Tech. Prior to Georgia Tech, he was a researcher at IBM TJ Watson Research Center. His research focuses on data mining and health analytics, especially in tensor factorizations, deep learning, and large-scale predictive modeling systems. Dr. Sun has been collaborating with many healthcare organizations. He published over 120 papers and filed over 20 patents (5 granted). He has received SDM/IBM early career research award 2017, ICDM best research paper award in 2008, SDM best research paper award in 2007, and KDD Dissertation runner-up award in 2008. Dr. Sun received B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, PhD in Computer Science from Carnegie Mellon University in 2007 advised by Christos Faloutsos.

Footnotes

^1.

We embed these channel jointly rather than separately to capture their latent correlation.

^2.

We use the notation [a]_t to denote the t-th component of vector a.

^3.

https://mimic.physionet.org/

^4.

https://physionet.org/physiobank/database/ptbdb/

^5.

https://www.isip.piconepress.com/projects/tuh_eeg/html/overview.shtml

Contributor Information

Cao Xiao, Analytics Center of Excellence, IQVIA, Cambridge, MA, 02139.

Trong Nghia Hoang, MIT-IBM Watson AI Lab, Cambridge, MA, 02142.

Shenda Hong, Department of Computer Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332.

Tengfei Ma, IBM Research, Yorktown Heights, NY, 10598.

Jimeng Sun, Department of Computer Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332.

References

[1].Xiao C, Choi E, and Sun J, “Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review,” Journal of the American Medical Informatics Association, 2018. [DOI] [PMC free article] [PubMed]
[2].Salehinejad H, Barfett J, Valaee S, and Dowdell T, “Training neural networks with very little data - A draft,” 2018.
[3].Glorot X, Bordes A, and Bengio Y, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, ser. ICML’11. USA: Omnipress, 2011, pp. 513–520. [Online]. Available: http://dl.acm.org/citation.cfm?id=3104482.3104547 [Google Scholar]
[4].Chen M, Xu Z, Weinberger KQ, and Sha F, “Marginalized denoising autoencoders for domain adaptation,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. USA: Omnipress, 2012, pp. 1627–1634. [Online]. Available: http://dl.acm.org/citation.cfm?id=3042573.3042781 [Google Scholar]
[5].Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, and Lempitsky V, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res, vol. 17, no. 1, pp. 2096–2030, Jan. 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=2946645.2946704 [Google Scholar]
[6].Zhou G, Xie Z, Huang X, and He T, “Bi-transferring deep neural networks for domain adaptation,” in ACL, 2016.
[7].Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, and Erhan D, “Domain separation networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. USA: Curran Associates Inc., 2016, pp. 343–351. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157096.3157135 [Google Scholar]
[8].Long M, Cao Y, Wang J, and Jordan MI, “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, 2015, pp. 97–105. [Online]. Available: http://proceedings.mlr.press/v37/long15.html [Google Scholar]
[9].Huang S, Zhao J, and Liu Z, “Cost-effective training of deep cnns with active model adaptation,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19–23, 2018, 2018, pp. 1580–1588. [Online]. Available: 10.1145/3219819.3220026 [DOI] [Google Scholar]
[10].Rozantsev A, Salzmann M, and Fua P, “Beyond sharing weights for deep domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 41, no. 4, pp. 801–814, 2019. [Online]. Available: 10.1109/TPAMI.2018.2814042 [DOI] [PubMed] [Google Scholar]
[11].Xu Z, Huang S, Zhang Y, and Tao D, “Webly-supervised fine-grained visual categorization via deep domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 5, pp. 1100–1113, 2018. [Online]. Available: 10.1109/TPAMI.2016.2637331 [DOI] [PubMed] [Google Scholar]
[12].Zagoruyko S. and Komodakis N, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017. [Online]. Available: https://arxiv.org/abs/1612.03928
[13].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” ICLR, 2015.
[14].Hinton G, Vinyals O, and Dean J, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[15].Ba J. and Caruana R, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
[16].Sau BB and Balasubramanian VN, “Deep model compression: Distilling knowledge from noisy teachers,” arXiv preprint arXiv:1610.09650, 2016.
[17].Radosavovic I, Dollár P, Girshick R, Gkioxari G, and He K, “Data distillation: Towards omni-supervised learning,” arXiv preprint arXiv:1712.04440, 2017.
[18].Yim J, Joo D, Bae J, and Kim J, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2017. [Google Scholar]
[19].Lopez-Paz D, Bottou L, Schölkopf B, and Vapnik V, “Unifying distillation and privileged information,” ICLR, 2015.
[20].Wei P, Ke Y, and Goh CK, “A general domain specific feature transfer framework for hybrid domain adaptation,” IEEE Trans. Knowl. Data Eng, vol. 31, no. 8, pp. 1440–1451, 2019. [Online]. Available: 10.1109/TKDE.2018.2864732 [DOI] [Google Scholar]
[21].Wang B, Qiu M, Wang X, Li Y, Gong Y, Zeng X, Huang J, Zheng B, Cai D, and Zhou J, “A minimax game for instance based selective transfer learning,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4–8, 2019., 2019, pp. 34–43. [Online]. Available: 10.1145/3292500.3330841 [DOI] [Google Scholar]
[22].Luo C, Chen Z, Tang LA, Shrivastava A, Li Z, Chen H, and Ye J, “TINET: learning invariant networks via knowledge transfer,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19–23, 2018, 2018, pp. 1890–1899. [Online]. Available: 10.1145/3219819.3220003 [DOI] [Google Scholar]
[23].Peng P, Tian Y, Xiang T, Wang Y, Pontil M, and Huang T, “Joint semantic and latent attribute modelling for cross-class transfer learning,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 7, pp. 1625–1638, 2018. [Online]. Available: 10.1109/TPAMI.2017.2723882 [DOI] [PubMed] [Google Scholar]
[24].Tang Y, Wang J, Wang X, Gao B, Dellandréa E, Gaizauskas RJ, and Chen L, “Visual and semantic knowledge transfer for large scale semi-supervised object detection,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 12, pp. 3045–3058, 2018. [Online]. Available: 10.1109/TPAMI.2017.2771779 [DOI] [PubMed] [Google Scholar]
[25].Pan SJ, Tsang IW, Kwok JT, and Yang Q, “Domain adaptation via transfer component analysis,” in IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11–17, 2009, 2009, pp. 1187–1192. [Online]. Available: http://ijcai.org/Proceedings/09/Papers/200.pdf [Google Scholar]
[26].Pan SJ, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. [Online]. Available: 10.1109/TNN.2010.2091281 [DOI] [PubMed] [Google Scholar]
[27].Yao Y, Zhang Y, Li X, and Ye Y, “Heterogeneous domain adaptation via soft transfer network,” in ACM Multimedia, 2019.
[28].Jiang W, Gao H, Lu W, Liu W, Chung F, and Huang H, “Stacked robust adaptively regularized auto-regressions for domain adaptation,” IEEE Trans. Knowl. Data Eng, vol. 31, no. 3, pp. 561–574, 2019. [Online]. Available: 10.1109/TKDE.2018.2837085 [DOI] [Google Scholar]
[29].Li W, Xu Z, Xu D, Dai D, and Gool LV, “Domain generalization and adaptation using low rank exemplar svms,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 5, pp. 1114–1127, 2018. [Online]. Available: 10.1109/TPAMI.2017.2704624 [DOI] [PubMed] [Google Scholar]
[30].Luo Y, Wen Y, Liu T, and Tao D, “Transferring knowledge fragments for learning distance metric from a heterogeneous domain,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 41, no. 4, pp. 1013–1026, 2019. [Online]. Available: 10.1109/TPAMI.2018.2824309 [DOI] [PubMed] [Google Scholar]
[31].Segev N, Harel M, Mannor S, Crammer K, and El-Yaniv R, “Learn on source, refine on target: A model transfer learning framework with random forests,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 9, pp. 1811–1824, 2017. [Online]. Available: 10.1109/TPAMI.2016.2618118 [DOI] [PubMed] [Google Scholar]
[32].Xu Y, Pan SJ, Xiong H, Wu Q, Luo R, Min H, and Song H, “A unified framework for metric transfer learning,” IEEE Trans. Knowl. Data Eng, vol. 29, no. 6, pp. 1158–1171, 2017. [Online]. Available: 10.1109/TKDE.2017.2669193 [DOI] [Google Scholar]
[33].Ghifary M, Balduzzi D, Kleijn WB, and Zhang M, “Scatter component analysis: A unified framework for domain adaptation and domain generalization,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 7, pp. 1414–1430, 2017. [Online]. Available: 10.1109/TPAMI.2016.2599532 [DOI] [PubMed] [Google Scholar]
[34].Wu Q, Wu H, Zhou X, Tan M, Xu Y, Yan Y, and Hao T, “Online transfer learning with multiple homogeneous or heterogeneous sources,” IEEE Trans. Knowl. Data Eng, vol. 29, no. 7, pp. 1494–1507, 2017. [Online]. Available: 10.1109/TKDE.2017.2685597 [DOI] [Google Scholar]
[35].Lin Z, Feng M, Santos C. N. d., Yu M, Xiang B, Zhou B, and Bengio Y, “A structured self-attentive sentence embedding,” ICLR, 2017.
[36].Choi K, Fazekas G, Sandler M, and Cho K, “Convolutional recurrent neural networks for music classification,” arXiv preprint arXiv:1609.04243, 2016.
[37].Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
[38].Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, and Blunsom P, “Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems, 2015, pp. 1693–1701.
[39].Ba J, Mnih V, and Kavukcuoglu K, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
[40].Johnson AE, Pollard TJ, Shen L, Lehman L.-w. H., Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C-K, and Stanley HE, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000. [DOI] [PubMed] [Google Scholar]
[42].Bousseljot R, Kreiseler D, and Schnabel A, “Nutzung der ekg-signaldatenbank cardiodat der ptb über das internet,” Biomedizinische Technik/Biomedical Engineering, vol. 40, no. s1, pp. 317–318, 1995. [Google Scholar]
[43].Reiss A. and Stricker D, “Creating and benchmarking a new dataset for physical activity monitoring,” in Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 2012, p. 40. [Google Scholar]
[44].Obeid I. and Picone J, “The temple university hospital eeg data corpus,” Frontiers in neuroscience, vol. 10, p. 196, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola AJ, “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4–7, 2006, 2006, pp. 513–520. [Online]. Available: http://papers.nips.cc/paper/3110-a-kernel-method-for-the-two-sample-problem [Google Scholar]
[46].Kingma D. and Ba J, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]

[R1] [1].Xiao C, Choi E, and Sun J, “Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review,” Journal of the American Medical Informatics Association, 2018. [DOI] [PMC free article] [PubMed]

[R2] [2].Salehinejad H, Barfett J, Valaee S, and Dowdell T, “Training neural networks with very little data - A draft,” 2018.

[R3] [3].Glorot X, Bordes A, and Bengio Y, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, ser. ICML’11. USA: Omnipress, 2011, pp. 513–520. [Online]. Available: http://dl.acm.org/citation.cfm?id=3104482.3104547 [Google Scholar]

[R4] [4].Chen M, Xu Z, Weinberger KQ, and Sha F, “Marginalized denoising autoencoders for domain adaptation,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, ser. ICML’12. USA: Omnipress, 2012, pp. 1627–1634. [Online]. Available: http://dl.acm.org/citation.cfm?id=3042573.3042781 [Google Scholar]

[R5] [5].Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, and Lempitsky V, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res, vol. 17, no. 1, pp. 2096–2030, Jan. 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=2946645.2946704 [Google Scholar]

[R6] [6].Zhou G, Xie Z, Huang X, and He T, “Bi-transferring deep neural networks for domain adaptation,” in ACL, 2016.

[R7] [7].Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, and Erhan D, “Domain separation networks,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. USA: Curran Associates Inc., 2016, pp. 343–351. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157096.3157135 [Google Scholar]

[R8] [8].Long M, Cao Y, Wang J, and Jordan MI, “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, 2015, pp. 97–105. [Online]. Available: http://proceedings.mlr.press/v37/long15.html [Google Scholar]

[R9] [9].Huang S, Zhao J, and Liu Z, “Cost-effective training of deep cnns with active model adaptation,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19–23, 2018, 2018, pp. 1580–1588. [Online]. Available: 10.1145/3219819.3220026 [DOI] [Google Scholar]

[R10] [10].Rozantsev A, Salzmann M, and Fua P, “Beyond sharing weights for deep domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 41, no. 4, pp. 801–814, 2019. [Online]. Available: 10.1109/TPAMI.2018.2814042 [DOI] [PubMed] [Google Scholar]

[R11] [11].Xu Z, Huang S, Zhang Y, and Tao D, “Webly-supervised fine-grained visual categorization via deep domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 5, pp. 1100–1113, 2018. [Online]. Available: 10.1109/TPAMI.2016.2637331 [DOI] [PubMed] [Google Scholar]

[R12] [12].Zagoruyko S. and Komodakis N, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017. [Online]. Available: https://arxiv.org/abs/1612.03928

[R13] [13].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” ICLR, 2015.

[R14] [14].Hinton G, Vinyals O, and Dean J, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[R15] [15].Ba J. and Caruana R, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.

[R16] [16].Sau BB and Balasubramanian VN, “Deep model compression: Distilling knowledge from noisy teachers,” arXiv preprint arXiv:1610.09650, 2016.

[R17] [17].Radosavovic I, Dollár P, Girshick R, Gkioxari G, and He K, “Data distillation: Towards omni-supervised learning,” arXiv preprint arXiv:1712.04440, 2017.

[R18] [18].Yim J, Joo D, Bae J, and Kim J, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2017. [Google Scholar]

[R19] [19].Lopez-Paz D, Bottou L, Schölkopf B, and Vapnik V, “Unifying distillation and privileged information,” ICLR, 2015.

[R20] [20].Wei P, Ke Y, and Goh CK, “A general domain specific feature transfer framework for hybrid domain adaptation,” IEEE Trans. Knowl. Data Eng, vol. 31, no. 8, pp. 1440–1451, 2019. [Online]. Available: 10.1109/TKDE.2018.2864732 [DOI] [Google Scholar]

[R21] [21].Wang B, Qiu M, Wang X, Li Y, Gong Y, Zeng X, Huang J, Zheng B, Cai D, and Zhou J, “A minimax game for instance based selective transfer learning,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4–8, 2019., 2019, pp. 34–43. [Online]. Available: 10.1145/3292500.3330841 [DOI] [Google Scholar]

[R22] [22].Luo C, Chen Z, Tang LA, Shrivastava A, Li Z, Chen H, and Ye J, “TINET: learning invariant networks via knowledge transfer,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19–23, 2018, 2018, pp. 1890–1899. [Online]. Available: 10.1145/3219819.3220003 [DOI] [Google Scholar]

[R23] [23].Peng P, Tian Y, Xiang T, Wang Y, Pontil M, and Huang T, “Joint semantic and latent attribute modelling for cross-class transfer learning,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 7, pp. 1625–1638, 2018. [Online]. Available: 10.1109/TPAMI.2017.2723882 [DOI] [PubMed] [Google Scholar]

[R24] [24].Tang Y, Wang J, Wang X, Gao B, Dellandréa E, Gaizauskas RJ, and Chen L, “Visual and semantic knowledge transfer for large scale semi-supervised object detection,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 12, pp. 3045–3058, 2018. [Online]. Available: 10.1109/TPAMI.2017.2771779 [DOI] [PubMed] [Google Scholar]

[R25] [25].Pan SJ, Tsang IW, Kwok JT, and Yang Q, “Domain adaptation via transfer component analysis,” in IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11–17, 2009, 2009, pp. 1187–1192. [Online]. Available: http://ijcai.org/Proceedings/09/Papers/200.pdf [Google Scholar]

[R26] [26].Pan SJ, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Networks, vol. 22, no. 2, pp. 199–210, 2011. [Online]. Available: 10.1109/TNN.2010.2091281 [DOI] [PubMed] [Google Scholar]

[R27] [27].Yao Y, Zhang Y, Li X, and Ye Y, “Heterogeneous domain adaptation via soft transfer network,” in ACM Multimedia, 2019.

[R28] [28].Jiang W, Gao H, Lu W, Liu W, Chung F, and Huang H, “Stacked robust adaptively regularized auto-regressions for domain adaptation,” IEEE Trans. Knowl. Data Eng, vol. 31, no. 3, pp. 561–574, 2019. [Online]. Available: 10.1109/TKDE.2018.2837085 [DOI] [Google Scholar]

[R29] [29].Li W, Xu Z, Xu D, Dai D, and Gool LV, “Domain generalization and adaptation using low rank exemplar svms,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 5, pp. 1114–1127, 2018. [Online]. Available: 10.1109/TPAMI.2017.2704624 [DOI] [PubMed] [Google Scholar]

[R30] [30].Luo Y, Wen Y, Liu T, and Tao D, “Transferring knowledge fragments for learning distance metric from a heterogeneous domain,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 41, no. 4, pp. 1013–1026, 2019. [Online]. Available: 10.1109/TPAMI.2018.2824309 [DOI] [PubMed] [Google Scholar]

[R31] [31].Segev N, Harel M, Mannor S, Crammer K, and El-Yaniv R, “Learn on source, refine on target: A model transfer learning framework with random forests,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 9, pp. 1811–1824, 2017. [Online]. Available: 10.1109/TPAMI.2016.2618118 [DOI] [PubMed] [Google Scholar]

[R32] [32].Xu Y, Pan SJ, Xiong H, Wu Q, Luo R, Min H, and Song H, “A unified framework for metric transfer learning,” IEEE Trans. Knowl. Data Eng, vol. 29, no. 6, pp. 1158–1171, 2017. [Online]. Available: 10.1109/TKDE.2017.2669193 [DOI] [Google Scholar]

[R33] [33].Ghifary M, Balduzzi D, Kleijn WB, and Zhang M, “Scatter component analysis: A unified framework for domain adaptation and domain generalization,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 7, pp. 1414–1430, 2017. [Online]. Available: 10.1109/TPAMI.2016.2599532 [DOI] [PubMed] [Google Scholar]

[R34] [34].Wu Q, Wu H, Zhou X, Tan M, Xu Y, Yan Y, and Hao T, “Online transfer learning with multiple homogeneous or heterogeneous sources,” IEEE Trans. Knowl. Data Eng, vol. 29, no. 7, pp. 1494–1507, 2017. [Online]. Available: 10.1109/TKDE.2017.2685597 [DOI] [Google Scholar]

[R35] [35].Lin Z, Feng M, Santos C. N. d., Yu M, Xiang B, Zhou B, and Bengio Y, “A structured self-attentive sentence embedding,” ICLR, 2017.

[R36] [36].Choi K, Fazekas G, Sandler M, and Cho K, “Convolutional recurrent neural networks for music classification,” arXiv preprint arXiv:1609.04243, 2016.

[R37] [37].Chorowski JK, Bahdanau D, Serdyuk D, Cho K, and Bengio Y, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.

[R38] [38].Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, and Blunsom P, “Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems, 2015, pp. 1693–1701.

[R39] [39].Ba J, Mnih V, and Kavukcuoglu K, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.

[R40] [40].Johnson AE, Pollard TJ, Shen L, Lehman L.-w. H., Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C-K, and Stanley HE, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000. [DOI] [PubMed] [Google Scholar]

[R42] [42].Bousseljot R, Kreiseler D, and Schnabel A, “Nutzung der ekg-signaldatenbank cardiodat der ptb über das internet,” Biomedizinische Technik/Biomedical Engineering, vol. 40, no. s1, pp. 317–318, 1995. [Google Scholar]

[R43] [43].Reiss A. and Stricker D, “Creating and benchmarking a new dataset for physical activity monitoring,” in Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 2012, p. 40. [Google Scholar]

[R44] [44].Obeid I. and Picone J, “The temple university hospital eeg data corpus,” Frontiers in neuroscience, vol. 10, p. 196, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola AJ, “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4–7, 2006, 2006, pp. 513–520. [Online]. Available: http://papers.nips.cc/paper/3110-a-kernel-method-for-the-two-sample-problem [Google Scholar]

[R46] [46].Kingma D. and Ba J, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]

PERMALINK

CHEER: Rich Model Helps Poor Model via Knowledge Infusion

Cao Xiao

Trong Nghia Hoang

Shenda Hong

Tengfei Ma

Jimeng Sun

Abstract

1. Introduction

2. Related Works

Deep Transfer Learning:

Knowledge Distillation:

Domain Adaptation:

3. The cheer Method

3.1. Data and Problem Definition

Rich and Poor Datasets.

Input Features.

Paired Dataset.

Problem Definition.

Challenges.

Solution Sketch.

Fig. 1:

TABLE 1:

3.2. Learning Transferable Rich Model

Transferable Representation.

Feature Extraction.

Feature Scoring.

Feature Aggregation.

3.3. A DNN Implementation of Rich Model

Raw Features.

Feature Extraction.

Feature Scoring.

Feature Aggregation.

3.4. Knowledge Infusion for Poor Model

Behavior Infusion.

Target Infusion.

4. Theoretical Analysis

Algorithm 1.

High-Level Ideas.

Definition 1.

Definition 2.

Definition 3.

Assumption 1.

Assumption 2.

Lemma 1.

Proof.

Lemma 2.

Proof.

Theorem 1.

Proof.

5. Experiments

5.1. Experimental Settings

Datasets.

Baselines.

Direct:

Knowledge Distilling (KD) (14):

Attention Transfer (AT) (12):

Heterogeneous Domain Adaptation (HDA) (27):

Performance Metrics.

Training Details

5.2. Performance Comparison

TABLE 9:

TABLE 10:

TABLE 11:

TABLE 5:

TABLE 6:

TABLE 12:

5.3. Analyzing Knowledge Infusion Effect in Different Data Settings

Fig. 3:

Fig. 4:

Varying Paired Data.

Varying The Number of Features.

TABLE 13:

TABLE 14:

6. Conclusion

Fig. 2:

TABLE 2:

TABLE 3:

TABLE 4:

TABLE 7: