Multi-task Learning via Adaptation to Similar Tasks for Mortality Prediction of Diverse Rare Diseases

Luchen Liu; Zequn Liu; Haoxian Wu; Zichang Wang; Jianhao Shen; Yipiing Song; Ming Zhang

. 2021 Jan 25;2020:763–772.

Multi-task Learning via Adaptation to Similar Tasks for Mortality Prediction of Diverse Rare Diseases

Luchen Liu ¹, Zequn Liu ¹, Haoxian Wu ¹, Zichang Wang ¹, Jianhao Shen ¹, Yipiing Song ¹, Ming Zhang ¹

PMCID: PMC8075548 PMID: 33936451

Abstract

The mortality prediction of diverse rare diseases using electronic health record (EHR) data is a crucial task for intelligent healthcare. However, data insufficiency and the clinical diversity of rare diseases make it hard for deep learning models to be trained. Mortality prediction for these patients with different diseases can be viewed as a multi-task learning problem with insufficient data but a large number of tasks. On the other hand, insufficient training data makes it difficult to train task-specific modules in multi-task learning models. To address the challenges of data insufficiency and task diversity, we propose an initialization-sharing multi-task learning method (Ada-SiT). Ada-Sit can learn the parameter initialization and dynamically measure the tasks’ similarities, used for fast adaptation. We use Ada-SiT to train long short-term memory networks (LSTM) based prediction models on longitudinal EHR data. The experimental results demonstrate that the proposed model is effective for mortality prediction of diverse rare diseases.

1. Introduction

Mortality prediction [1] of diseases plays a crucial role in clinical work and helps doctors to take early interventions based on timely alert of patients’ adverse health status. With the immense accumulation of Electronic Health Records (EHR) available [2, 3], deep learning models [4], typically requiring a large volume of data, have been developed for mortality prediction of common diseases, demonstrating state-of-the-art performance. However, mortality prediction of rare diseases is relatively unexplored in the domain of intelligent healthcare and personalized medicine.

Predicting mortality of rare diseases suffers from the problem of data insufficiency. A rare disease is the disease that affects a small percentage of the population and has a different disease mechanism from common ones. However, there are more than 300 million people worldwide living with one of the approximately 7,000 rare diseases (US organization Global Genes^‡). Therefore, there is always not enough data for a specific rare disease. In some real-world data sources [3], only tens of data samples on average could be collected for each rare disease.

Besides data insufficiency, the clinical behavior diversity of these diseases is another challenge for mortality prediction of rare diseases. Behaviors of different diseases vary a lot and make potential conflicts to the training of global deep learning models [5, 6, 7]. For example, the high heart rate raises the mortality risk of people with heart disease, but for patients having a cold, it is a common symptom that does not indicate danger. So the insufficiency of diverse rare diseases data cannot be simply resolved by training a global model using samples from all common diseases.

Multi-task learning models [8, 9] can be used to settle the problem of disease behavior diversity. Mortality prediction for each kind of rare disease is viewed as a task, and multi-task learning is supposed to capture the task-specific characteristics as well as the shared information of all tasks. However, data insufficiency of the rare disease tasks makes it hard to train the corresponding modules and further utilize the shared characteristics of similar tasks in a multi-task learning framework.

Meta-learning methods can learn meta-knowledge of training models, which makes it possible to learn fast with few samples, such as few-shot learning [10]. To build a better multi-task model suitable for tasks with little training data, we bring in the idea of fast adaptation in meta learning, which learns a shared initialization as meta-knowledge for adaptation to new tasks. However, since the fast adaptation method adapts shared initialization to each task independently, it cannot directly take into account the relationship of similar tasks, which is important and can provide useful information to enhance multi-task learning [11, 12].

To deal with the above-mentioned challenges (summarised in Table 1), we propose an initialization-shared multitask learning method, named as Ada-SiT (Adaptation to Similar Task), in which the task similarity is dynamically measured in the process of meta-training according to our new definition. Therefore, the task similarity, as a part of learned meta-knowledge, can enhance the fast adaptation procedure of genetic meta-learning models. Moreover, Ada-SiT is model-agnostic and can employ all existing deep learning based approaches as the basic predictive model. Experimental results on real medical datasets demonstrate that the proposed model is able to make similar tasks cooperate in initialization-shared multi-task learning, and it outperforms state-of-the-art global models as well as multi-task methods for mortality prediction of diverse rare diseases.

Table 1:

Comparing the three kinds of methods in terms of their ability to handle three main challenges in mortality prediction for rare diseases

Method	Examples	Challenges
*Data Insufficiency*	*Task Similarity*	*Task Diversity*
Global Models	LSTM [5], TCN [6]	✓	✓	×
Multi-task Models	Multi-SAnD [8], Multi-Dense [9]	×	✓
Meta-learning Models	MAML [10]	✓	×
Our Model	Ada-SiT	✓	✓	✓

Open in a new tab

It is worthwhile to highlight the contributions of the proposed model as follow:

To the best of our knowledge, this is the first attempt to simultaneously tackle the challenge of disease diversity and data insufficiency in mortality prediction of rare diseases.
We propose a novel initialization-shared multi-task learning method Ada-SiT, which can utilize information of task similarity for adaptation to each small sample-size task.

2. Related Works

2.1. Deep Learning for Healthcare

The accumulation of Electronic Health Records (EHR) has enabled research on deep learning methods for healthcare [13, 14, 15, 16, 17]. Multi-layer Perceptron (MLP) [18], Convolutional Neural Network (CNN) [19] and Recurrent Neural Network (RNN) [4, 20, 21] have been used in healthcare domain. Among these methods, there are many works on mortality prediction. The good performance of these models depends on a large volume of EHR data, which cannot be satisfied in our scenario of mortality prediction for diverse rare diseases. As a result, these models cannot make precise mortality predictions for patients with different rare diseases. Our work is suitable for these settings because it simultaneously tackles the challenges of disease diversity and data insufficiency. Furthermore, our method is a general framework and can be applied to train deep learning models to improve their performance.

2.2. Multi-task Learning

Multi-task learning is an efficient method to improve the performance by jointly learning multiple related tasks. In deep multi-task learning models, the information sharing mechanism is based on specific network structures, including shared layers [22], shared functions [23] and additional constraints [24]. However, task similarity cannot be directly interpreted in these models . We propose a model-agnostic multi-task learning method which can share the parameter initialization for fast adaptation to each task and task similarity can be dynamically measured in the training prossess.

In the healthcare domain, multi-task learning is used for prediction of various clinical events [25], mortality prediction of multiple patient cohorts [9] and patient-specific diagnosis [26], in which the ”tasks” have different definitions. Similar to the paper [9], our work also treats the mortality prediction of a certain patient cohort as a task. However, the method proposed by Suresh [9] is suitable for a small number of tasks with a large volume of data, meanwhile, our method on mortality prediction for rare diseases is designed to deal with hundreds of tasks with insufficient data.

2.3. Optimization-based Meta Learning

To solve the problem of data insufficiency and task diversity, our method borrows the idea behind optimization-based meta-learning [10], which can adapt to new environments with a few training samples by modifying the parameter optimization process. MAML (Model Agnostic Meta Learning) [10] uses fast adaptation to find a good initialization for the parameters of deep neural networks. And the idea of meta-learning is also applied in personalized dialog system for improving the diversity of text generation [27]. Our work is similar to MAML for using fast adaptation to get parameters of each task, but different from it in two ways: First, the objective of MAML is to learn a good parameter initialization for fast adapting to new tasks but our work is to find good model parameters for each given task. Second, our work measures task similarity in model space dynamically and uses samples in similar tasks to assist the adaptation to each task while MAML does fast adaptations to each task independently.

In clinical scenario, MetaPred [28] use MAML for clinical risk prediction with limited patient electronic health records. Its task is similar to ours but its method is different from ours in two ways: First, it trains a parameter initialization on source domain and simulated target domain via fast adaptation, but our method can learn from multiple small target domains without the source domain knowledge. Second, like MAML, MetaPred also doesn’t consider task similarity.

3. Data and Task Descriptions

We give the notations and data descriptions of the predictive tasks in the following.

3.1. Heterogeneous Temporal Events in EHR data

The input X of each mortality prediction task is a given episode of patient EHR, which could be represented as T heterogeneous temporal events [4]: $X = {e_{t}}_{1 \leq t \leq T}$ . $e_{t}$ in this sequence is a tuple with four element: e = (type, value_c, value_n, time) , where type is the clinical event type, value_c and value_n are the categorical and numerical attributes of e_t, and time is the record time of e_t.

3.2. Multi-Task Mortality Prediction

The mortality prediction of each rare disease is defined as a task. Specifically, assuming that there are M diseases, we refer task_i as the corpus of the i-th mortality prediction task for patients of rare disease d_i with N_i samples:

t a s k_{i} = {(X_{k}^{(i)}, y_{k}^{(\hat{i})})}_{k = 1}^{N_{i}}

(1)

where $X_{k}^{(i)}$ and $y_{k}^{(\hat{i})}$ denote the k-th sample and its label respectively in the i-th task.

Specifically, X is a given episode of a patient’s EHR data, represented as heterogeneous temporal events, and $\hat{y}$ is the binary label indicating whether a patient will die in 24 hours.

3.3. Patient Cohort Setup

3.3.1. Heterogeneous Temporal Event Datasets

We set up two heterogeneous temporal event datasets based on MIMIC-III [3] database and eICU [2] database. MIMIC-III is a large, freely-available database comprising health data of patients in critical care units from Beth Israel Deaconess Medical Center between 2001 and 2012, and eICU is populated with data from a combination of many critical care units throughout the continental United States between 2014 and 2015.

The two datasets have the same data preprocessing framework: For each patient, we select all the events with their features from the original database and arrange the events in the temporal order. The descriptions of the selected events in eICU are listed in Table 2. The details of the events in MIMIC-III can be found in [13]. Then we annotate the mortality label for each patient event sequence.

Table 2:

The tables in elCU to construct heterogeneous temporal events

Table Name in eICU

Description

lab
intakeoutput
medication
infusiondrug
careplan
admissiondrug
nursecharting
physicalexam
diagnosis
respiratorycare
allergy

Laboratory tests mapped to a standard set of measurements. E.g. labTypeID, labResult Intake and output recorded for patients. E.g. intakeTotal, outputTotal, dialysisTotal Active medication orders for patients. E.g. drugHiclSeqno, dosage Details of drug infusions. E.g. drugRate, infusionRate, drugAmount Documentation relating to care planning. E.g. cplGeneralID, cplItemValue Details of medications that a patient was taking prior to admission to the ICU. E.g. drugUnit Information entered in a semi-structured form by the nurse. E.g. nursingchartcelltypecat Patients’ results of the physical exam. E.g. physicalExamText, physicalExamValue Diagnosis information of Patients. E.g. ICD9Code, diagnosisPriority Information related to respiratory care for patient. E.g. airwayType, airwaySize, cuffPressure Details of patient allergies. E.g. allergyType

Open in a new tab

3.3.2. Rare Disease Selection

We reorganize the heterogeneous temporal event datasets to get two rare disease datasets, MiniMIMIC and MiniEICU.

For each ICD (International Classification of Diagnose ^§) code in MIMIC-III and eICU, we calculate its sample size (i.e. the number of patients with this code). We select 858 ICD codes with less than 40 samples in MIMIC-III and 70 ICD codes with less than 100 samples in eICU as rare diseases. The heterogeneous temporal event sequences of patients with selected rare disease d_i form task_i. So MiniMIMIC is the task list {task₁, ..., task₈₅₈} and MiniEICU is the task list {task₁, ..., task₇₀}.

The statistics of MiniMIMIC and MiniEICU are summarized in Table 3.

Table 3:

Characteristics of Datasets

Name	MiniMIMIC	MiniEICU
# of tasks	858	70
# of samples	16610	7000
positive sample rate (mortality rate)	7%	13%
max # of samples per task	40	100
min # of samples per task	10	100
mean # of samples per task	19.36	100

Open in a new tab

Each task_i is split into 3 parts with fixed proportions, namely Train_i(70%), Valid_i(10%) and Test_i(20%). The validation set is used for conducting “early stop” and selecting hyper-parameters. The results of the evaluation metrics on the test set and their stander variations are used to compare different models.

4. Methodology

In this section, we begin with some basic notations for the adaptation to a single task and the backbone deep network for mortality prediction based on EHR data. Then we introduce the framework of our method Ada-Sit (Adaptation to similar tasks), which could be applied to train multiple mortality predictive models for different rare disease.

4.1. Adaptation to a Single Task

For a task task_i of a given disease and given initial parameters θ (either random or learned), we formulate the learning process Learn(task_i; θ) of model parameters θ_i as minimizing the loss function of model parameters on the data of the given task from the initialization θ.

θ_{i} = Learn (t a s k_{i}; θ) = \arg \min_{θ} L_{t a s k_{i}} (θ)

(2)

We assume p(y|X, θ) is the mortality rate predicted by model with parameters θ. The loss function $L_{t a s k_{i}} (θ)$ of model parameters θ on task corpus task_i is defined using the cross entropy CE(·) between the model output p(y|X, θ) and the true label $\hat{y}$ of the outcome:

L_{t a s k_{i}} (θ) = \sum_{(X, \hat{y}) \in t a s k_{i}} CE (p (y | X, θ), \hat{y})

(3)

In this work, the mortality prediction model p(y|X, θ) based on heterogeneous temporal events in EHR data is mainly composed of attributed event embedding [4] and long short-term memory (LSTM). Firstly, clincial events with attributes are embedded into vectors via information fusion of their type, categorical attributes and numerical attributes. After the attributed event embedding module, the temporal dependencies encoded in the sequence of embedded vectors are then captured by long short-term memory (LSTM) [5], which outputs the prediction results for mortality with a sigmoid layer at the last LSTM cell. The mortality prediction model is illustrated in the middle of Figure 2.

Figure 2: — Training Mortality Prediction Model with Ada-Sit

4.2. Adaptation to Similar Tasks

The architecture of our proposed multi-task learning method Ada-SiT is represented in Figure 1, where Ada-Sit is compared with generic multi-task methods and generic meta-learning based multi-task methods. In the following, we first introduce the architecture of Ada-SiT, and then the task similarity measurement module, which is a key component of Ada-SiT. The overall framework of how to train mortality prediction model with Ada-Sit is illustrated in Figure 2.

4.2.1. Architecture of Ada-SiT

Fast adaptation, one of the meta-learning methods, is applied into Ada-SiT for the multi-task learning scenario. Ada-SiT differs from the original fast adaptation [10] which separately adapts to each task. Ada-SiT can dynamically measure task similarity and learn to adapt to multiple similar tasks.

The idea of fast adaptation is to learn a good parameter initialization for fast adaptation to new tasks. Specifically, in our scenario of multi-task learning, it means using patient data from all the disease tasks to find a good initial parameter initialization θ and mortality prediction model parameters θ_i of each disease which is adapted from the found initial parameters θ.

The initialization θ is learned by repeatedly simulating scenarios of mortality prediction of each disease with its similar diseases. We achieve this goal by defining the meta-objective function as:

\min_{θ} \sum_{t a s k_{i}} L_{N} (t a s k_{i}) (Learn (N (t a s k_{i}); θ))

(4)

where N(task_i) is the extended sample set composed of tasks similar with task_i. Details of N(task_i) will be described in the next subsection.

We maximize the meta-objective function using stochastic approximation with gradient descent. For each epoch, we find similar tasks N(task_i) for each task, and then independently sample two samples subsets (D_tr and D_val) from the training set of the task N(task_i). The former D_tr is used to simulate the learning process of mortality prediction models, and the latter D_val is used to evaluate the precision of the learned models for the updating of the shared initialization. Here, a single-step gradient descend is applied in the training simulation:

{θ^{'}}_{i} = Learn (N (t a s k_{i}); θ) = θ - α \nabla_{θ} L_{N (t a s k_{i})}^{t r} (θ)

(5)

where α is the learning rate and $L_{N (t a s k_{i})}^{t r} (\cdot)$ represents the loss function calculated from sample set D_tr from N(task_i).

Next, we evaluate the updated task parameters ${θ^{'}}_{i}$ on D_val. The gradient computed from the evaluation, referred to as meta-gradient, is used to update the initial parameters θ. Gradients from a batch of tasks are aggregated to updating θ as follow:

θ \leftarrow θ - β \nabla_{θ} \sum_{t a s k_{i}} L_{N (t a s k_{i})}^{v a l} ({θ^{'}}_{i})

(6)

where β is the meta learning rate, and $L_{N (t a s k_{i})}^{v a l} (\cdot)$ represents the loss function calculated from sample set D_val from N(task_i).

In the process of the calculation and approximation of the meta-gradient by the chain rule, the second-order derivative term can be ignored without much accuracy loss [10]. And the meta-gradient can be approximated as the following simplified gradient:

\nabla_{θ} L_{N (t a s k_{i})}^{v a l} (θ^{'}) = \nabla_{θ^{'}} L_{N (t a s k_{i})}^{v a l} (θ^{'}) (1 - α H_{θ} (L_{N (t a s k_{i})}^{t r} (θ))) \approx \nabla_{θ^{'}} L_{N (t a s k_{i})}^{v a l} (θ^{'})

(7)

where the term including $H_{θ} (L_{N (t a s k_{i})}^{t r} (θ))$ , the Hessian matrix, a square matrix of second-order partial derivatives of the loss function $L_{N (t a s k_{i})}^{t r} (\cdot)$ at θ, is ignored.

As our goal is multi-task learning via adaptation to similar tasks, model parameters θ_i of each task are adapted from the newly updated initialization parameters θ at the end of each iteration. These parameters θ_i of tasks are comparable in the model space, and will be used for calculating the task similarity in model space in the next iteration.

4.2.2. Task Similarity Measurement in Model Space

To measure the similarity of all tasks in terms of clinical behavior, we define the similarity of tasks in the model space, where each predictive model for its corresponding task is represented as a vector composed of all the parameters.

Formally, the similar tasks N(task_i) of task_i is defined as N_cos(·):

N_{c o s} (t a s k_{i}) = {s | s \in t a s k_{j} & \cos (θ_{i} - θ, θ_{j} - θ) > η}

(8)

where θ_i and θ_j are the parameters of corresponding model of task_i and task_j, θ is the initial parameters. cos(θ_i − θ, θ_j − θ) reflects cos function of the angle between gradient directions of task_i and task_j, and η is threshold of the cos function.

The models of similar diseases also have similar gradient directions when they are adapted from the initialization. And the big cos (close to 1) value of included angle between two gradient directions indicates the tasks are similar.

Notice that a natural alternative way to get similar tasks is selecting k nearest neighbors in the model space. However, the absolute distance of models is more meaningful than relative distance, because the distance of models is generated by the gradient descent of the adaptation process from a common initialization. So selecting models in the neighborhood of a certain model as its similar models are more suitable in our Ada-SiT method, which is demonstrated by the experimental results in Section 5.4.

5. Results and Discussions

5.1. Comparing Methods and Experimental Settings

We compare Ada-SiT to both global single-task learning methods and multi-task learning methods for mortality prediction. The data size of each task is too small to train separate single-task models, so these baselines have not been included.

The global single-task learningmodels are trained by all the patients in the training set.

LSTM LSTM [5] is used to learn the representation of the heterogeneous event sequence for each patient. Binary predictions for mortality based on the learned representations can be generated with a logistic regression layer.
TCN The architect for prediction is the same as LSTM, except that the Temporal Convolutional Network (TCN) [6] is used to learn patient representation vectors from the heterogeneous event sequences instead of LSTM.

The multi-task learningmethods use Train_i to train the model for each task_i and get prediction results such as predicted label and probability on Test_i.

Multi-SAnD Multi-SAnD [8] is a Transformer-based multi-task learning method which uses the weighted sum of loss functions on all tasks for the loss function.
MMoE MMoE [29] is a Multi-gate Mixture-of-Experts model which shares the expert submodel across all tasks and has a gating network trained to optimize each task.
Multi-Dense Multi-Dense is proposed by Suresh [9]. It has a shared LSTM layer for representation learning, followed by task-specific dense layers and logistic regression output layers.

The models in this section are implemented with Tensorflow [30] and trained with Adam. In Ada-SiT, we set N(task_i) = N_cos(task_i), where the cos function threshold η is 0.7, for the task similarity measurement. α and β are 0.0005 and 0.001 respectively.

5.2. Evaluation Metrics

The data for target prediction tasks are imbalanced-labeled. So metrics for binary labels such as accuracy are not suitable for measuring the performance. Similar to the work [20], we adopt AUC (the area under ROC curves (Receiver Operating Characteristic curves)) and AP (the area under PRC (Precision-Recall curves)) for evaluation. They both reflect the overall quality of predicted scores at each decision time.

5.3. Quantitative Results

Table 4 shows the AUC and AP of Ada-SiT, global single-task models, and multi-task models on MiniMIMIC and MiniEICU. From the results in Table 4, we draw the following conclusions.

Table 4:

performance of different models on MiniMIMIC and MiniEICU

Model Class	Model	MiniMIMIC		MiniEICU
AUC	AP	AUC	AP
Global Single-task	LSTM [5] TCN [6]	0.8162 (0.0026) 0.8008 (0.0024)	0.3830 (0.0055) 0.4120 (0.0011)	0.6642 (0.0227) 0.6107 (0.0055)	0.2692 (0.0193) 0.1945 (0.0052)
Multi-task	Multi-SAnD [8] MMoE [29] Multi-Dense [9]	0.8036 (0.0161) 0.7181 (0.0117) 0.8325 (0.0036)	0.2754 (0.0063) 0.2195 (0.0097) 0.3997 (0.0096)	0.6215 (0.0075) 0.6300 (0.0023) 0.6730 (0.0071)	0.1592 (0.0016) 0.1364 (0.0030) 0.1147 (0.0039)
Ours	Ada-SiT	0.8729 (0.0112)	0.4543 (0.0241)	0.6746 (0.0090)	0.2961 (0.0103)

Open in a new tab

First, Ada-SiT can significantly improve the performance of global single-task learning methods. On both datasets, Ada-SiT performs better than LSTM and TCN. For example, on MiniMIMIC, Ada-SiT improves AUC and AP by around 6.9% and 19.1% respectively compared to LSTM. We can conclude that Ada-SiT can capture specific characteristics of diverse tasks without being interfered by data conflicts.

Second, Ada-SiT outperforms the compared multi-task learning methods. For example, on MiniMIMIC, Ada-SiT improves AUC and AP of Multi-SAnD by 8.6% and 65.0% respectively. On MiniEICU, it improves the AUC of Multi-SAnD and MMoE by 8.5% and 7.1% respectively. It should be noted that most of the multi-task baselines do not perform better than the global single-task baselines. The possible cause is that task-specific parameters of multitask models cannot be trained well because of data insufficiency of each task. We can conclude that Ada-SiT has a more robust information-sharing mechanism among tasks on small-size dataset compared to the traditional multi-task learning baselines.

5.4. Ablation Experiments of Task Similarity Measurement

To evaluate the effect of task similarity measuring in Ada-SiT, we vary this module while remaining other parts of the model identified in this section. We implement the following variants of Ada-SiT:

Ada-SiT− Ada-SiT− is Ada-SiT without similar task measurement (i.e. N(task_i) = task_i), nearly the same as MAML [10]
Ada-SiT (Static) According to the work [31], many static features can be used to measure task similarity. In Ada-SiT (Static), we choose the mortality rate as the static feature in the clinical scenarios and use it to measure task similarity instead of the proposed similarity measurement.
Ada-SiT (KNN) Ada-SiT (KNN) selects k nearest neighbors instead of neighbors within a certain distance for N_cos(task_i) while finding similar tasks.

Table ?? shows the results of the ablation experiments of task similarity measurement. We can draw the following conclusions. First, the information in similar tasks can improve the performance of fast adaptation. Ada-SiT (KNN) and Ada-SiT improve the AP of Ada-SiT−. Ada-SiT also improves the AUC of Ada-SiT−. Second, the task similarity measurement in model space outperforms the compared static measurement. Ada-SiT improves the AUC and AP of Ada-SiT (Static) by 3.0% and 9.7% respectively. It is because traditional task similarity measurements via static features only leverage the metadata of tasks, but our measurement in model space can find potential information from the mapping of samples and labels. Third, finding neighbors within a certain distance as N_cos(task_i) is the most suitable way to get similar tasks. For example, Ada-SiT improves the AUC of Ada-SiT (KNN) by around 5.6%. It is noteworthy that the AUC of Ada-SiT (KNN) is even lower than Ada-SiT−. It is possibly because some tasks in the k nearest neighbors of task_i may be far from task_i in model space and they interfere with the fast adaptation process of task_i.

5.5. Relationship between Task Similarity and Mortality Rate

There is strong correlation between task similarity and mortality rate. By treating each model’s parameter vector θ_i as a point in the model space, we use t-SNE [32] to visualize similar task clusters in Figure 3. In Figure 3, the two task clusters represent two types of rare diseases. The average mortality rate of diseases in the blue cluster is 0.6% and that in the yellow cluster is 32.1%. Meanwhile, the total average mortality rate of MiniMIMIC is 7%. We can see that the mortality rate is the main factor to determine task similarity. It suggests that our task similarity measurement module is reasonable and consistent with the clinical knowledge because diseases with a lower mortality rate and life-threatening diseases with a high mortality rate have different clinical behavior.

Figure 3: — Visualization of Tasks in the Model Space

6. Conclusion

In this paper, we propose a novel method Ada-SiT for learning predictive models for diverse tasks with insufficient data. Ada-SiT has a new task similarity measurement method and a new knowledge-sharing schema, where the shared initialization is learned to fast adapt to similar tasks. Experiment results show that our method is suitable for mortality prediction of diverse rare diseases, and can improve the performance compared to global single-task models and genetic multi-task models.

Acknowledge

This paper is supported by National Key Research and Development Program of China with Grant No. 2018AAA0101900 / 2018AAA0101902 as well as the National Natural Science Foundation of China (NSFC Grant No. 61772039 and No. 91646202).

Footnotes

^∗

Equal contribution.

^†

Corresponding Author.

^‡

https://globalgenes.org/rare-list/

^§

http://www.icd-code.org

Figures & Table

Table 5:

Ablation study of task similarity measurement

Methods	AUC	AP
Ada-SiT− (MAML)	0.8577 (0.0015)	0.3936 (0.0025)
Ada-SiT (Static)	0.8474 (0.0123)	0.4143 (0.0144)
Ada-SiT (KNN)	0.8264 (0.0110)	0.4059 (0.0112)
Ada-SiT	0.8729 (0.0112)	0.4543 (0.0241)

Open in a new tab

References

[1].Alok Sharma, Anupam Shukla, Ritu Tiwari, Apoorva Mishra. “Mortality Prediction of ICU patients using Machine Leaning: A survey”. ICCDA. 2017.
[2].Tom Pollard, Alistair Johnson, Jesse Raffa, Leo Celi, Roger Mark, Omar Badawi. “The eICU Collaborative Research Database, a freely available multi-center database for critical care research”. Scientific Data. 5(Sept. 2018):180178. doi: 10.1038/sdata.2018.178. DOI: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Johnson Alistair EW, Pollard Tom J, Lu Shen, et al. “MIMIC-III, a freely accessible critical care database”. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Luchen Liu, Jianhao Shen, Ming Zhang, Zichang Wang, Jian Tang. “Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction”. AAAI. 2018.
[5].Sepp Hochreiter and Ju¨rgen Schmidhuber “Long short-term memory”. Neural computation. 1997;9.8:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
[6].Shaojie Bai, J Zico Kolter, Vladlen Koltun. “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling”. arXiv preprint arXiv:1803.01271. 2018.
[7].Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention is all you need”. Advances in neural infor- mation processing systems. 2017:5998–6008. [Google Scholar]
[8].Huan Song, Deepta Rajan, Thiagarajan Jayaraman J, Andreas Spanias. “Attend and diagnose: Clinical time series analysis using attention models”. AAAI. 2018.
[9].Harini Suresh, Gong Jen J., John Guttag. “Learning Tasks for Multitask Learning: Heterogenous Patient Populations in the ICU”. 2018.
[10].Chelsea Finn, Pieter Abbeel, Sergey Levine. ICML. JMLR. org. 2017. “Model-agnostic meta-learning for fast adaptation of deep networks”; pp. 1126–1135. [Google Scholar]
[11].Theodoros Evgeniou, Micchelli Charles A, Massimiliano Pontil. “Learning multiple tasks with kernel meth- ods”. Journal of Machine Learning Research. 6 Apr 2005. pp. 615–637.
[12].Laurent Jacob, Jean-philippe Vert, Francis R Bach. “Clustered multi-task learning: A convex formulation”. NeurIPS. 2009. pp. 745–752.
[13].Luchen Liu, Haoran Li, Zhiting Hu, et al. “Learning hierarchical representations of electronic health records for clinical outcome prediction”. AMIA. 2019. [PMC free article] [PubMed]
[14].Zichang Wang, Haoran Li, Luchen Liu, Haoxian Wu, Ming Zhang. “Predictive Multi-level Patient Repre- sentations from Electronic Health Records”. BIBM. 2019.
[15].Luchen Liu, Haoxian Wu, Zichang Wang, Zequn Liu, Ming Zhang. “Early Prediction of Sepsis From Clinical Datavia Heterogeneous Event Aggregation”. CINC. 2019.
[16].Liantao Ma, Junyi Gao, Yasha Wang, et al. “AdaCare: Explainable Clinical Health Status Representation Learn- ing via Scale-Adaptive Feature Extraction and Recalibration”.. Thirty-Fourth AAAI Conference on Artificial Intelligence.2020. [Google Scholar]
[17].Liantao Ma, Chaohe Zhang, Yasha Wang, et al. “ConCare: Personalized Clinical Feature Embedding via Cap- turing the Healthcare Context”.. Thirty-Fourth AAAI Conference on Artificial Intelligence.2020. [Google Scholar]
[18].Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, Yan Liu. “Deep Computational Phe- notyping”.. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD; 2015. pp. 507–516. [Google Scholar]
[19].Qiuling Suo, Fenglong Ma, Ye Yuan, et al. “Personalized disease prediction using a cnn-based similarity learn- ing method”. BIBM. 2017:811–816. [Google Scholar]
[20].Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, Walter Stewart. “Re- tain: An interpretable predictive model for healthcare using reverse time attention mechanism”. NeurIPS. 2016:3504–3512. [Google Scholar]
[21].Qiuling Suo, Fenglong Ma, Giovanni Canino, et al. “A multi-task framework for monitoring health conditions via attention-based recurrent neural networks”. AMIA. 2017:1665. [PMC free article] [PubMed] [Google Scholar]
[22].Mingsheng Long, Jianmin Wang. “Learning multiple tasks with deep relationship networks”. arXiv preprint arXiv:1506.02117. 2015;2 [Google Scholar]
[23].Junkun Chen, Xipeng Qiu, Pengfei Liu, Xuanjing Huang. “Meta multi-task learning for sequence modeling”. AAAI. 2018.
[24].Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, Martial Hebert. “Cross-stitch networks for multi-task learning”. CVPR. 2016:3994–4003. [Google Scholar]
[25].Hrayr Harutyunyan, Hrant Khachatrian, Kale David C, Aram Galstyan. “Multitask learning and bench- marking with clinical time series data”. arXiv preprint arXiv:1703.07771. 2017.
[26].Nozomi Nori, Hisashi Kashima, Kazuto Yamashita, Susumu Kunisawa, Yuichi Imanaka. “Learning implicit tasks for patient-specific risk modeling in ICU”. AAAI. 2017.
[27].Yiping Song, Zequn Liu, Wei Bi, Rui Yan, Ming Zhang. “Learning to Customize Model Structures for Few-shot Dialogue Generation Tasks”.. Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics; 2020. pp. 5832–5841. [Google Scholar]
[28].Xi Sheryl Zhang, Fengyi Tang, Hiroko Dodge, Jiayu Zhou, Fei Wang. “MetaPred: Meta-Learning for Clinical Risk Prediction with Limited Patient Electronic Health Records”. 2019. [DOI] [PMC free article] [PubMed]
[29].Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, Ed H Chi. “Modeling task relationships in multi- task learning with multi-gate mixture-of-experts”.. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; ACM; 2018. pp. 1930–1939. [Google Scholar]
[30].Mart´ın Abadi, Paul Barham, Jianmin Chen, et al. “Tensorflow: a system for large-scale machine learning.”. OSDI. 2016;16:265–283. [Google Scholar]
[31].Sebastian Ruder, Barbara Plank. “Learning to select data for transfer learning with Bayesian Optimization”. arXiv preprint arXiv:1707.05246. 2017.
[32].Laurens van der Maaten, Geoffrey Hinton. “Visualizing data using t-SNE”. Journal of machine learning research. 9 Nov 2008. pp. 2579–2605.

[r1-111_3409698] [1].Alok Sharma, Anupam Shukla, Ritu Tiwari, Apoorva Mishra. “Mortality Prediction of ICU patients using Machine Leaning: A survey”. ICCDA. 2017.

[r2-111_3409698] [2].Tom Pollard, Alistair Johnson, Jesse Raffa, Leo Celi, Roger Mark, Omar Badawi. “The eICU Collaborative Research Database, a freely available multi-center database for critical care research”. Scientific Data. 5(Sept. 2018):180178. doi: 10.1038/sdata.2018.178. DOI: 10.1038/sdata.2018.178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-111_3409698] [3].Johnson Alistair EW, Pollard Tom J, Lu Shen, et al. “MIMIC-III, a freely accessible critical care database”. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-111_3409698] [4].Luchen Liu, Jianhao Shen, Ming Zhang, Zichang Wang, Jian Tang. “Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction”. AAAI. 2018.

[r5-111_3409698] [5].Sepp Hochreiter and Ju¨rgen Schmidhuber “Long short-term memory”. Neural computation. 1997;9.8:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[r6-111_3409698] [6].Shaojie Bai, J Zico Kolter, Vladlen Koltun. “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling”. arXiv preprint arXiv:1803.01271. 2018.

[r7-111_3409698] [7].Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention is all you need”. Advances in neural infor- mation processing systems. 2017:5998–6008. [Google Scholar]

[r8-111_3409698] [8].Huan Song, Deepta Rajan, Thiagarajan Jayaraman J, Andreas Spanias. “Attend and diagnose: Clinical time series analysis using attention models”. AAAI. 2018.

[r9-111_3409698] [9].Harini Suresh, Gong Jen J., John Guttag. “Learning Tasks for Multitask Learning: Heterogenous Patient Populations in the ICU”. 2018.

[r10-111_3409698] [10].Chelsea Finn, Pieter Abbeel, Sergey Levine. ICML. JMLR. org. 2017. “Model-agnostic meta-learning for fast adaptation of deep networks”; pp. 1126–1135. [Google Scholar]

[r11-111_3409698] [11].Theodoros Evgeniou, Micchelli Charles A, Massimiliano Pontil. “Learning multiple tasks with kernel meth- ods”. Journal of Machine Learning Research. 6 Apr 2005. pp. 615–637.

[r12-111_3409698] [12].Laurent Jacob, Jean-philippe Vert, Francis R Bach. “Clustered multi-task learning: A convex formulation”. NeurIPS. 2009. pp. 745–752.

[r13-111_3409698] [13].Luchen Liu, Haoran Li, Zhiting Hu, et al. “Learning hierarchical representations of electronic health records for clinical outcome prediction”. AMIA. 2019. [PMC free article] [PubMed]

[r14-111_3409698] [14].Zichang Wang, Haoran Li, Luchen Liu, Haoxian Wu, Ming Zhang. “Predictive Multi-level Patient Repre- sentations from Electronic Health Records”. BIBM. 2019.

[r15-111_3409698] [15].Luchen Liu, Haoxian Wu, Zichang Wang, Zequn Liu, Ming Zhang. “Early Prediction of Sepsis From Clinical Datavia Heterogeneous Event Aggregation”. CINC. 2019.

[r16-111_3409698] [16].Liantao Ma, Junyi Gao, Yasha Wang, et al. “AdaCare: Explainable Clinical Health Status Representation Learn- ing via Scale-Adaptive Feature Extraction and Recalibration”.. Thirty-Fourth AAAI Conference on Artificial Intelligence.2020. [Google Scholar]

[r17-111_3409698] [17].Liantao Ma, Chaohe Zhang, Yasha Wang, et al. “ConCare: Personalized Clinical Feature Embedding via Cap- turing the Healthcare Context”.. Thirty-Fourth AAAI Conference on Artificial Intelligence.2020. [Google Scholar]

[r18-111_3409698] [18].Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, Yan Liu. “Deep Computational Phe- notyping”.. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD; 2015. pp. 507–516. [Google Scholar]

[r19-111_3409698] [19].Qiuling Suo, Fenglong Ma, Ye Yuan, et al. “Personalized disease prediction using a cnn-based similarity learn- ing method”. BIBM. 2017:811–816. [Google Scholar]

[r20-111_3409698] [20].Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, Walter Stewart. “Re- tain: An interpretable predictive model for healthcare using reverse time attention mechanism”. NeurIPS. 2016:3504–3512. [Google Scholar]

[r21-111_3409698] [21].Qiuling Suo, Fenglong Ma, Giovanni Canino, et al. “A multi-task framework for monitoring health conditions via attention-based recurrent neural networks”. AMIA. 2017:1665. [PMC free article] [PubMed] [Google Scholar]

[r22-111_3409698] [22].Mingsheng Long, Jianmin Wang. “Learning multiple tasks with deep relationship networks”. arXiv preprint arXiv:1506.02117. 2015;2 [Google Scholar]

[r23-111_3409698] [23].Junkun Chen, Xipeng Qiu, Pengfei Liu, Xuanjing Huang. “Meta multi-task learning for sequence modeling”. AAAI. 2018.

[r24-111_3409698] [24].Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, Martial Hebert. “Cross-stitch networks for multi-task learning”. CVPR. 2016:3994–4003. [Google Scholar]

[r25-111_3409698] [25].Hrayr Harutyunyan, Hrant Khachatrian, Kale David C, Aram Galstyan. “Multitask learning and bench- marking with clinical time series data”. arXiv preprint arXiv:1703.07771. 2017.

[r26-111_3409698] [26].Nozomi Nori, Hisashi Kashima, Kazuto Yamashita, Susumu Kunisawa, Yuichi Imanaka. “Learning implicit tasks for patient-specific risk modeling in ICU”. AAAI. 2017.

[r27-111_3409698] [27].Yiping Song, Zequn Liu, Wei Bi, Rui Yan, Ming Zhang. “Learning to Customize Model Structures for Few-shot Dialogue Generation Tasks”.. Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics; 2020. pp. 5832–5841. [Google Scholar]

[r28-111_3409698] [28].Xi Sheryl Zhang, Fengyi Tang, Hiroko Dodge, Jiayu Zhou, Fei Wang. “MetaPred: Meta-Learning for Clinical Risk Prediction with Limited Patient Electronic Health Records”. 2019. [DOI] [PMC free article] [PubMed]

[r29-111_3409698] [29].Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, Ed H Chi. “Modeling task relationships in multi- task learning with multi-gate mixture-of-experts”.. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; ACM; 2018. pp. 1930–1939. [Google Scholar]

[r30-111_3409698] [30].Mart´ın Abadi, Paul Barham, Jianmin Chen, et al. “Tensorflow: a system for large-scale machine learning.”. OSDI. 2016;16:265–283. [Google Scholar]

[r31-111_3409698] [31].Sebastian Ruder, Barbara Plank. “Learning to select data for transfer learning with Bayesian Optimization”. arXiv preprint arXiv:1707.05246. 2017.

[r32-111_3409698] [32].Laurens van der Maaten, Geoffrey Hinton. “Visualizing data using t-SNE”. Journal of machine learning research. 9 Nov 2008. pp. 2579–2605.

PERMALINK

Multi-task Learning via Adaptation to Similar Tasks for Mortality Prediction of Diverse Rare Diseases

Luchen Liu

Zequn Liu

Haoxian Wu

Zichang Wang

Jianhao Shen

Yipiing Song, Ph.D.

Ming Zhang, Ph.D.

Abstract

1. Introduction

Table 1:

2. Related Works

2.1. Deep Learning for Healthcare

2.2. Multi-task Learning

2.3. Optimization-based Meta Learning

3. Data and Task Descriptions

3.1. Heterogeneous Temporal Events in EHR data

3.2. Multi-Task Mortality Prediction

3.3. Patient Cohort Setup

3.3.1. Heterogeneous Temporal Event Datasets

Table 2:

3.3.2. Rare Disease Selection

Table 3:

4. Methodology

4.1. Adaptation to a Single Task

Figure 2:

4.2. Adaptation to Similar Tasks

Figure 1:

4.2.1. Architecture of Ada-SiT

4.2.2. Task Similarity Measurement in Model Space

5. Results and Discussions

5.1. Comparing Methods and Experimental Settings

5.2. Evaluation Metrics

5.3. Quantitative Results

Table 4:

5.4. Ablation Experiments of Task Similarity Measurement

5.5. Relationship between Task Similarity and Mortality Rate

Figure 3:

6. Conclusion

Acknowledge

Footnotes

Figures & Table

Table 5:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases