DDL: Deep Dictionary Learning for Predictive Phenotyping

Tianfan Fu; Trong Nghia Hoang; Cao Xiao; Jimeng Sun

doi:10.24963/ijcai.2019/812

. Author manuscript; available in PMC: 2021 Mar 24.

Published in final edited form as: IJCAI (U S). 2019 Aug;2019:5857–5863. doi: 10.24963/ijcai.2019/812

DDL: Deep Dictionary Learning for Predictive Phenotyping

Tianfan Fu ^1,^*, Trong Nghia Hoang ^2,^*, Cao Xiao ³, Jimeng Sun ¹

PMCID: PMC7990269 NIHMSID: NIHMS1675238 PMID: 33767572

Abstract

Predictive phenotyping is about accurately predicting what phenotypes will occur in the next clinical visit based on longitudinal Electronic Health Record (EHR) data. While deep learning (DL) models have recently demonstrated strong performance in predictive phenotyping, they require access to a large amount of labeled data, which are expensive to acquire. To address this label-insufficient challenge, we propose a deep dictionary learning framework (DDL) for phenotyping, which utilizes unlabeled data as a complementary source of information to generate a better, more succinct data representation. Our empirical evaluations on multiple EHR datasets demonstrated that DDL outperforms the existing predictive phenotyping methods on a wide variety of clinical tasks that require patient phenotyping. The results also show that unlabeled data can be used to generate better data representation that helps improve DDL’s phenotyping performance over existing methods that only uses labeled data.

1. Introduction

The recent rise in popularity of deep learning (DL) and the widespread use of electronic health record (EHR) data in clinical research have sparked strong interest in using DL for electronic phenotyping, which is of paramount importance in various health analytics tasks such as chronic disease diagnosis [Ma et al., 2017], patient sub-typing [Baytas et al., 2017] and disease prediction [Esteban et al., 2016]. The key advantage of DL models are their abilities to construct deep features that capture complex and long-range data dependencies efficiently, both of which are particular traits of biomedical data. This helps DL models achieve better performance and require less effort in feature engineering than classical machine learning (ML) models.

Despite these successes, the training of DL-based phenotyping models often relies on direct supervision via labeled data, which in turn requires access to a large volume of EHR data annotated with relevant medical labels. This highlights a key weakness of existing DL-based phenotyping models in terms of practical efficiency and data utilization. First, acquiring labeled data is often prohibitively expensive due to the incurred labor-intensive data annotation, which makes the training practically inefficient. Second, most DL models are trained using supervised learning methods, which do not make use of unlabeled EHR data [Choi et al., 2016b; Ma et al., 2017; Ma et al., 2018]. Ignoring unlabeled data, however, leads to a severe under-utilization of available information since such data can still be exploited to learn a better representation of labeled data (hence, better prediction performance) even though they do not contain predictive information. To mitigate the above weaknesses, this paper aims to develop a deep dictionary learning (DDL) framework that utilizes and combines both labeled and unlabeled data to learn a better data representation for predictive phenotyping.

To elaborate, DDL implements a hybrid learning architecture that combines ideas from both dictionary learning and recurrent neural network (RNN) for temporal prediction. This is achieved by first embedding raw, unlabeled EHR data into a latent feature space using DDL’s RNN component. Then, dictionary learning is applied on the resulting set of embedded representations to search for a smaller set of patterns that occur frequently across different patients’ representations. This in turn allows DDL to characterize each patient’s embedded representation as a linear combination of those frequent patterns, which constitutes their deep dictionary representations (Section 2.2).

Furthermore, to optimize the above deep dictionary representation, DDL develops a deep dictionary decoder that reconstructs a patient’s unlabeled EHR data from its deep dictionary representation. This is achieved by minimizing the decoder’s reconstruction loss, which helps establish a parsimonious set of hidden-layer patient prototypes (Section 2.3). Different patients can then be viewed as different combinations of these prototypes where the combination coefficients represent its high-level feature description. These coefficients are then coupled with the patients’ labeled data to learn a mapping from each patient’s high-level description to his/her target outcomes (Section 2.4). This is also demonstrated in our experiments (Section 3) that such representation indeed induces better performance than those of the existing baselines for predictive phenotyping.

Last but not least, DDL also develops an end-to-end training mechanism that optimizes the above data embedding (Section 2.2), reconstruction (Section 2.3) and prediction (Section 2.4) components simultaneously. This intuitively allows DDL’s representation components (built with unlabeled data) to coordinate with its prediction component (built with labeled data) to avoid generating representations which are biased towards artifacts in unlabeled data. As a result, DDL generates a less biased representation that is beneficial for both reconstruction and prediction loss minimization, thus leading to better predictive phenotyping performance. This is in contrast to existing semi-supervised ML methods [Mairal et al., 2009; Tariyal et al., 2016; Dligach et al., 2015], which perform training with separate supervised and unsupervised steps, and are therefore vulnerable to biased representations caused by artifacts in unlabeled data.

To demonstrate the aforementioned advantages of DDL, we evaluated its performance and compared it against those of a set of selected state-of-the-art predictive phenotyping baselines on an extensive benchmark (Section 3) comprising multiple real-world EHR datasets and a wide variety of predictive phenotyping tasks (e.g., heart failure classification, mortality and sequential prediction). The reported results demonstrate improved prediction performance of DDL consistently over all baselines, which provides strong empirical evidence to support our key contribution statement that unlabeled data can be coupled with labeled data via semi-supervised learning to boost the performance of predictive phenotyping.

2. Method

This section presents the technical details of our developed DDL framework. In particular, our notations are first introduced in Section 2.1. Section 2.2 presents our developed deep dictionary encoder that generates deep dictionary representations for EHR data. Section 2.3 then presents a deep dictionary decoder that maps the generated deep dictionary representations back to EHR data, and optimizes it via minimizing the reconstruction loss. Section 2.4 develops a deep dictionary predictor that maps each patient’s deep dictionary representation to relevant target outcomes. Finally, Section 2.5 presents a collaborative training architecture (see Fig. 1) that jointly trains all the aforementioned components, thus allowing them to converge on the best deep dictionary representation that minimizes both reconstruction and prediction losses.

Figure 1: — The architecture of our DDL framework includes 3 interconnected modules: (1) an encoder that generates a deep dictionary representation for a patient’s EHR data, (2) a decoder and (3) a predictor which are both connected to the encoder, which allows them to collaboratively decide on the best representation that minimizes both reconstruction and minimization losses.

2.1. Notations and Definitions

Our EHR data comprises medical records of N different patients. The medical record of each patient n = 1 … N is represented as: the input $X_{n} ≜ [x_{n}^{(1)}; x_{n}^{(2)}; \dots; x_{n}^{(k_{n})}]$ denotes a collection of patient’s records for his/her past visits to the clinic, and k_n is the number of the n-th patient’s visits. Each visit record $x_{n}^{(i)} \in {0, 1}^{p}$ can be represented as a multi-hot vector that indicates whether the patient was associated with a particular medical code during his/her i-th visit to the clinic. There are p unique medical codes. The corresponding output/label y_n ∈ {0, 1}^c of a patient’s EHR data X_n is also a multi-hot vector that indicates whether the patient was diagnosed with a certain target disease or condition. There are c unique target diseases or conditions. In our setting, the label y_n is only available for a small subset of M < N patients with n = 1, …, M, which is designated as our labeled training dataset $D ≜ {(X_{n}, y_{n})}_{n = 1}^{M}$ . Given the labeled and unlabeled EHR datasets of previous patients, i.e. $D ≜ {(X_{n}, y_{n})}_{n = 1}^{M}$ and $U ≜ {(X_{n},)}_{n = 1}^{N}$ , respectively, the aim of the predictive phenotyping task is to learn a latent function that maps from a patient’s EHR data X to an output vector y ∈ {0, 1}^c that characterizes his/her phenotype accurately.

2.2. Deep Dictionary Encoder

This section develops a deep dictionary encoder that can succinctly captures embedded representations from the patient’s unlabeled EHR data $U = {(X_{n})}_{n = 1}^{N}$ coherently and in accordance with each other. This will help establish a parsimonious set of hidden-layer patient prototypes, which constitutes our deep dictionary representation. This can be achieved by combining ideas from both deep and dictionary learning, which is an emerging paradigm of predictive deep models in computer vision.

Intuitively, the key idea is to first exploit the capability of DL model to embed complex data (e.g., EHR data with varying length due to different numbers of clinical visits among different patients) into fixed-size, low-level embeddings. To incorporate high-level representations for better predictive performance, we use dictionary learning to further decompose these embeddings into fundamental patterns that succinctly characterize a space of high-level features. The patient embeddings can then be projected onto this space and the projection coefficients can be leveraged as high-level features to improve the cognitive capability of the predictive model.

To represent each patient, we employ an RNN with Long Short Term Memory (LSTM) architecture [Hochreiter and Schmidhuber, 1997], which is well-known for its ability to capture long-range dependencies within longitudinal data such as patient’s EHR. In particular, the LSTM module generates a latent embedding g_n from the multi-hot encoded patient record X_n,

g_{n} = LSTM (X_{n}) = LSTM (x_{n}^{(1)}, x_{n}^{(2)}, \dots, x_{n}^{(k_{n})}),

(1)

where $g_{n} \in ℝ^{d}$ is the output of the LSTM, which is used as the embedded, fixed-length patient representation. In the above equation, we concatenate the multi-hot vectors (corresponding to the patient’s different clinical visits) into an extended input vector for the LSTM. Based on the above embedded representation, we further adopt dictionary learning to extract both a dictionary of hidden-layer prototypes (which is patient-independent) and a collection of weight vectors (one per patient), which specifies how these prototypes can be combined to characterize a particular patient accurately. This is achieved via minimizing the following regularized projection loss,

L_{d} (D, {r_{n}}_{n = 1}^{N}) ≜ \sum_{n = 1}^{N} (\frac{1}{2} {‖ g_{n} - D r_{n} ‖}_{2}^{2} + λ_{1} {‖ r_{n} ‖}_{1}) + λ_{2} ‖ D ‖_{F}^{2},

(2)

with respect to ${r_{n}}_{n = 1}^{N} (r_{n} \in ℝ^{k})$ and $D \in ℝ^{d \times k}$ , which denote the weight vectors (or sparse codes) that characterize each patient and the dictionary of patient prototype represented in the embedded space, respectively. To avoid generating trivial solutions, the above loss is also regularized by penalizing the complexities λ₁∥r∥₁ and $λ_{2} ‖ D ‖_{F}^{2}$ of the sparse codes and dictionary with λ₁ and λ₂ being the regularization parameters. The embedded patient prototype can then be mapped back to the patient space via a deep dictionary decoder (see Section 2.3). Note that the above loss in Eq. 2 is convex in D given ${r_{n}}_{n = 1}^{N}$ and vice versa. Thus, Eq. 2 can be optimized (with local convergence guarantee) via alternating minimization [Chatterji and Bartlett, 2017].

2.3. Deep Dictionary Decoder

To map the patient deep dictionary representation (D, r_n) back to the original patient space, we develop a deep dictionary decoder module which specifically maps (D, r_n) to a vector $q_{n} ≜ [q_{n}^{(1)} \dots q_{n}^{(p)}] \in {[0, 1]}^{p}$ of probability scores such that $q_{n}^{(i)} \in [0, 1]$ denotes the probability that medical code i contributes actively to the patient’s predicted outcome y_n. We parameterize the decoder $D (D, r_{n})$ as a neural network with a fully-connected layer followed by a sigmoid activation,

q_{n} ≜ D (D, r_{n}; W) ≜ σ (W D r_{n}),

(3)

where the sigmoid operator σ(x) ≜ 1/(1 + exp(−x)) is applied point-wise to each element of the logit vector WDr_n. The above neural network is parameterized by the weight W of the dense, fully-connected layer. To optimize W, we generate the augmented dataset ${(D, r_{n}), {\bar{X}}_{n}}_{n = 1}^{N}$ , which can be generated via the deep dictionary embedding technique presented in Section 2.2. The augmented dataset can then be used to train the above neural network via back-propagation with the following cross-entropy reconstruction loss,

L_{r} (W, D, {r_{n}}_{n = 1}^{N}) = \sum_{n = 1}^{N} \sum_{i = 1}^{p} {\bar{X}}_{n}^{(i)} log q_{n}^{(i)} + (1 - {\bar{X}}_{n}^{(i)}) log (1 - q_{n}^{(i)}),

(4)

where ${\bar{X}}_{n}^{(i)} \in [0, 1]$ is the average occurrence of medical code i over all clinical visits of patient n. Optimizing Eq. 4 thus yields the deep dictionary decoder. To use this deep dictionary decoder on an unseen patient, we project the patient embedded representation g on D via minimizing $‖ g - D r ‖_{2}^{2}$ with respect to r, which can be solved analytically.

2.4. Deep Dictionary Predictor for Labeled Data

Given the deep dictionary representation (D, r_n) of labeled training data $D = {(X_{n}, y_{n})}_{n = 1}^{M}$ , we can further build a deep dictionary predictor that maps from a patient’s sparse code r_n to a vector o_n of predicted probabilities that the patient will be associated with each target outcomes (e.g., disease or mortality). This is achieved by parameterizing the predictor with a fully-connected layer followed by a soft-max activation,

o_{n} = softmax (V r_{n} + b) \in ℝ^{c},

(5)

where $V \in ℝ^{c \times k}$ and $b \in ℝ^{c}$ are the weight matrix and bias vector, respectively. These parameters can then be learned via minimizing the cross-entropy between the predicted probabilities o_n and the ground-truth label y_n,

L_{c} (V, b, {r_{n}}_{n = 1}^{M}) ≜ - \sum_{n = 1}^{M} \sum_{i = 1}^{c} y_{n}^{(i)} log o_{n}^{(i)},

(6)

with respect to V and b. c is number of prediction targets, and n = 1 … M is the patient index in the labeled training data $D$ with y_n ∈ {0, 1}^c denote the corresponding patient label characterizing his/her phenotype. $y_{n}^{(i)}$ and $o_{n}^{(i)}$ are the i-th entries of y_n and o_n, respectively.

2.5. Collaborative Prediction and Reconstruction

This section develops an collaborative architecture that connects the previously developed deep dictionary decoder and deep dictionary predictor using a common layer of deep dictionary encoder (see Fig. 1).

This allows us to optimize the above components simultaneously so that the deep dictionary encoder could interact with both the decoder and predictor modules to figure out a viable communication medium (i.e., the dictionary) that allows them to reach a consensus. Intuitively, the encoder plays the role of a mediator that suggests communication options for the decoder and predictor, and depending on the resulting quality of communication between them (i.e., the incurred prediction and reconstruction losses), the encoder will revise the communication medium until it facilitates an acceptable consensus between the decoder and predictor. This is in contrast to a naive solution that trains these components separately, which does not allow communication/coordination between the encoder, decoder and predictor; and therefore, cannot guarantee that the learned results would be optimized for both reconstruction and prediction.

To facilitate the aforementioned coordination, we aggregate the loss functions of these components (i.e., encoder, decoder and predictor) to generate a combined performance feedback, which can be exploited to update them jointly via stochastic gradient back-propagation,

L_{l} = η_{d} L_{d} + η_{r} L_{r} + η_{c} L_{c},

(7)

where $L_{l}$ is parameterized by W, D, V, b and ${r_{n}}_{n = 1}^{N}$ for which the projection loss $L_{d}$ only depends on $(D, {r_{n}}_{n = 1}^{N})$ , the reconstruction loss $L_{c}$ depends on $(W, D, {r_{n}}_{n = 1}^{N})$ , and the prediction loss $L_{c}$ depends on $(V, b, {r_{n}}_{n = 1}^{M})$ . The extra hyper-parameters η_d, η_r and η_c are manually tuned to trade-off between individual losses.

In addition, note that the projection and reconstruction losses do not depend on the training output and can therefore be pre-trained in advance using only unlabeled data $U$ . In this case, the loss function reduces to

L_{u} = η_{d} L_{d} + η_{r} L_{r},

(8)

which is first minimized (using unlabeled data) with respect to D, W and ${r_{n}}_{n = 1}^{N}$ to generate a good starting point for these parameters before further optimizing them via Eq. 7, and in accordance with the predictor’s parameters (V, b) using labeled data, thus still allowing end-to-end training that simultaneously optimizes both the supervised and unsupervised components of DDL (albeit with a starting point generated from pre-training its unsupervised component). The above process is, however, not computationally efficient due to the large number of optimizing parameters, which consequently results in a very slow convergence rate if we train all these parameters from scratch. To address this issue, we instead adopt a different approach which first computes a warm-start for the sparse codes ${r_{n}}_{n = 1}^{N}$ (as detailed below) as a good starting point to initiate gradient back-propagation on the entire network. The main algorithm is shown in Algorithm 1.

To initialize the sparse code r_n for each patient, we fix the dictionary D and the data embedding layer of the network (hence, the patient’s embedding representation g_n). Thus, given (D, g_n), the objective in Eq. 7 is convex with respect to r_n and reduces the following form,

r_{n} = \underset{r \in ℝ^{k}}{arg min} (\frac{1}{2} {‖ g_{n} - D r ‖}_{2}^{2} + λ_{1} ‖ r ‖_{1}),

(9)

which can be solved efficiently using proximal gradient method [Parikh and Boyd, 2014], as detailed in Algorithm 2 below. The use of proximal gradient descent is a technical necessity in this context since (9) is not differentiable everywhere due to the L₁ regularization λ₁∥r∥₁. Proximal gradient descent sidesteps this issue by decomposing (9) into two parts: (a)differentiable $f (r) ≜ 1 / 2 ‖ g_{n} - D {r ‖}_{2}^{2}$ , and (b)non-differentiable h(r) ≜ λ₁∥r∥₁. Thus, starting from a random initialization of r, we can use gradient descent on the first part to update r with respect to the differentiable part f(r) and then update it with respect to the non-differentiable part λ₁∥r∥₁ via the following proximal operator,

r \leftarrow p r o x_{λ_{1} h} (r) ≜ \underset{r^{'}}{arg min} (f (r^{'}) + \frac{1}{2} ‖ r^{'} - {r ‖}_{2}^{2}),

(10)

which can be solved analytically. Interested readers are referred to [Parikh and Boyd, 2014] for further details.

3. Experiment

In this section, we empirically evaluate the performance of DDL against several state-of-the-art baseline methods on 3 healthcare datasets, Heart Failure (HF) [Ma et al., 2018], MIMIC-III and a subset of Truven MarketScan Data¹, which contain 16794, 58000 and 72179 EHR samples, respectively. The numbers of clinical variables (p) in HF, MIMIC-III and TRUVEN are 1865, 283 and 283, respectively. In HF and MIMIC-III, the task is to predict the HF and mortality binary outcomes. For TRUVEN data, the prediction task is to predict which clinical variables are positive in the patient’s latest visit given data of his/her past visits.

3.1. Experiment Settings

For each experiment, we randomly generate 5 different partitions of the entire dataset into training, validation and testing sets with a 7 : 1 : 2 ratio. The reported result of each tested method is its averaged result over 5 independent runs corresponding to the 5 different data partitions. Our method is implemented by Tensorflow 1.9.0 and Python 3.5²; and tested on an Intel Xeon E5–2690 machine with 256G RAM and 8 NVIDIA Pascal Titan X GPUs. We evaluate each method using its best fine-tuned hyper-parameter configurations. The best configurations of DDL are described below.

For HF dataset, the no. of hidden units of DDL’s RNN component is set to 100. Its dictionary size is set to 10. Its learning rate for gradient back-propagation on the aggregate loss (Eq. 7) is set to be 1e − 2. To trade-off between projection, reconstruction and prediction losses, we set η_d = 1e − 1, η_c = 1 and η_r = 1e − 3 in Eq. 7. For the projection loss $L_{d}$ in Eq. 2, the regularization hyper-parameters are set as λ₁ = 5e − 2 and λ₂ = 1e − 3. For MIMIC-III dataset, we use the same configuration but with the following minor changes on learning rate (5e − 2) and trade-off coefficients (η_d = η_c = 1 and η_r = 2e − 3) between individual losses of DDL. For TRUVEN dataset, we also use the similar configuration but with the dictionary and RNN component’s hidden sizes set to be 15 and 200, respectively. In addition, the trade-off coefficients in Eq. 7 are also adjusted to η_d = 1e − 1 and η_c = η_r = 1. The batch sizes of DDL’s stochastic gradient descent on HF and MIMIC-III are both set to be 32, while on TRUVEN, it is set to be 64 (since TRUVEN dataset is larger than the others).

3.2. Baseline Methods

To demonstrate the advantage of DDL, we evaluate and compare its performance with those of the baseline methods, which includes (a) state-of-the-art deep phenotyping methods such as Doctor-AI (RNN) [Choi et al., 2016a], Reverse Time At-tention RNN (RETAIN) [Choi et al., 2016b], Denoising Autoencoders for Phenotype Stratification (DAPS) [Beaulieu-Jones and Greene, 2016], and (b) two traditional ML methods featuring Logistic Regression (LR) with L₂ regularizer and Dictionary Learning (DictLearn) [Mairal et al., 2009]. The LR, DictLearn and DAPS baselines were trained using the aggregated feature vector ${\tilde{X}}_{n} \in {0, 1}^{p}$ (instead of using the original input X_n), which was derived from the original EHR data X_n such that ${\tilde{X}}_{n}^{(i)} = 1$ if the i-th clinical variable occurred at least once in X_n’s multiple clinical visit. Otherwise, ${\tilde{X}}_{n}^{(i)} = 1$ . Furthermore, we also include a simplified version of DDL that excludes the decoder module.

3.3. Comparison in Supervised Setting

This section evaluates and reports the performance of DDL and baseline methods in supervised setting. That is, the training only uses EHR data with label. For experiments on HF and MIMIC-III datasets, the performance of each method is measured by the standard ROC-AUC (Area Under the Receiver Operating Characteristic Curve) score for binary classification where higher score implies better performance. For TRUVEN datasets, the final prediction is generated by combining the results of individual binary classification tasks (one for each variable in the patient’s last clinical visit). In particular, the top k variables with largest predicted probabilities to be positive are treated as positive labels while the others are associated with negative labels in our multi-label prediction. Thus, the performance of each method on TRUVEN dataset is measured using the following top-k recall metric:

top - k recall ≜ \frac{# of true positives in top k prediction}{# of true positives}

(11)

The averaged performance (with standard deviation) over 5 different runs of all tested methods are reported in Table 1 for HF and MIMIC-III datasets, and in Table 3 for TRUVEN dataset. The incurred training time for all methods are reported in Table 2. The reported results shown that: (1) DDL consistently outperforms all baseline methods in terms of prediction accuracy (ROC-AUC) on all datasets, which demonstrates the advantage of using a hybrid deep dictionary learning in predictive phenotyping; (2) among the baselines, DL-based methods perform significantly better than traditional ML baselines such as LR and DictLearn, which justifies the choice of using DL as the base model (in our framework) to be combined with dictionary learning; and (3) DDL with decoder achieved better results than DDL without decoder, thus supporting our statement earlier that jointly training both supervised and unsupervised components (instead of training them separately) allow them to coordinate on a better data representation.

Table 1:

Averaged prediction accuracy (ROC-AUC) of DDL and baseline methods (with standard deviation) evaluated on HF and MIMIC-III datasets. Higher ROC-AUC implies better performance.

Model	HF	MIMIC-III
LR	0.636 ± 0.003	0.763 ± 0.004
DictLearn	0.634 ± 0.004	0.771 ± 0.006
DAPS	0.651 ± 0.002	0.794 ± 0.004
RNN	0.668 ± 0.004	0.807 ± 0.004
RETAIN	0.675 ± 0.003	0.817 ± 0.004
DDL	0.682 ± 0.004	0.819 ± 0.004
DDL w.o. decoder	0.669 ± 0.006	0.813 ± 0.004

Open in a new tab

Table 3:

Averaged prediction accuracy (top-k recall) and incurred time (sec) of DDL and baseline methods (with standard deviation) on TRUVEN dataset. Higher top-k recall implies better performance.

Model	Top-50 recall	Top-30 recall	Time
LR	0.753 ± 0.002	0.605 ± 0.002	43.0 ± 1.2
DictLearn	0.752 ± 0.002	0.610 ± 0.002	84.2 ± 2.0
DAPS	0.773 ± 0.003	0.621 ± 0.002	232 ± 34
RNN	0.783 ± 0.003	0.630 ± 0.002	113.0 ± 8.2
RETAIN	0.786 ± 0.007	0.632 ± 0.005	134.4 ± 12.0
DDL	0.791 ± 0.004	0.634 ± 0.003	624 ± 34
DDL w.o. decoder	0.780 ± 0.05	0.627 ± 0.005	427 ± 31

Open in a new tab

Table 2:

Averaged training time (sec) incurred by DDL and baseline methods (with standard deviation) on HF and MIMIC-III datasets.

Model	HF	MIMIC-III
LR	34.3 ± 2.1	21.2 ± 1.9
DictLearn	49.3 ± 2.9	31.2 ± 3.0
DAPS	125.0 ± 5.6	86.4 ± 2.3
RNN	64.6 ± 8.4	25.6 ± 3.2
RETAIN	100.7 ± 10.1	79.4 ± 14.3
DDL	234.4 ± 21.3	178.3 ± 11.0
DDL w.o. decoder	183 ± 14.1	128.3 ± 6.0

Open in a new tab

3.4. Comparison in Semi-Supervised Setting

To showcase the advantage of leveraging unlabeled data, we design two scenarios to demonstrate it empirically. In the first scenario, we fix the number of labeled data samples and observe the changes in DDL’s and baseline methods’ performance when the number of unlabeled data samples varies. The results are plotted in Figure 2, which show the tested methods’ average prediction accuracies of 3 independent runs and their confidence intervals. It can be observed that across all datasets, the prediction performance increases as more unlabeled data is used for training, thus supporting our claim earlier that unlabeled data can be exploited as an extra source of information to improve performance of predictive phenotyping models.

Figure 2: — Graphs of average accuracy with 95% confidence interval of DDL and DAPS on (a) HF, (b) MIMIC-III, and (c) TRUVEN datasets. The reported performance is generated using 150 labeled data samples and varying number of unlabeled data samples.

In the second scenario, we fix the total number of training data samples while varying the percentage of labeled samples within the training set, and observe how the tested methods’ performance changes accordingly. From the results in Figure 3 above, it shows that DDL outperforms other methods most significantly when the amount of labeled data is limited: when this happens, the performance of RNN and RETAIN degrade the most since they cannot make use of unlabeled data. DAPS on the other hand can leverage unlabeled data and performs better than RNN and RETAIN but still worse than DDL since it does not train both supervised and unsupervised components, thus being vulnerable to artifacts in unlabeled data.

Figure 3: — Graphs of average accuracy (with 95% confidence interval) of DDL and baseline methods on (a) HF, (b) MIMIC-III, and (c) TRUVEN datasets. In all experiments, the size of training dataset is fixed while the fraction of labeled data is being varied.

4. Related Works

In this section, we briefly discuss the success use of deep learning in predictive phenotyping. [Choi et al., 2016a] used Recurrent Neural Network (RNN) to modelling temporal data in Electronic Health Record (EHR). [Choi et al., 2016b; Ma et al., 2017; Choi et al., 2017] uses attention mechanisms to detect important visits and clinical variables. [Choi et al., 2018] utilized multilevel structures to modelling EHR data in a finer-grained manner. [Ma et al., 2018] extracts comprehensive patient pattern via using time-aware neural architectures.

5. Conclusion

This paper addressed the label-insufficiency issue in predictive healthcare where unlabeled data are abundant but labeled data is limited. To combine both labeled and unlabeled data for better predictive performance, we propose a deep dictionary learning (DDL) framework that utilizes unlabeled data to learn more generalizable representation of data for the predictive model. For example DDL was evaluated on real-world healthcare datasets (MIMIC-III, Heart Failure and Truven) for disease prediction. The results consistently show that the representations learned from unlabeled data generalize better and improve the predictive accuracy.

Acknowledgements

This work was supported by the National Science Foundation award IIS-1418511, CCF-1533768 and IIS-1838042, the National Institute of Health award 1R01MD011682-01 and R56HL138415.

Footnotes

https://marketscan.truvenhealth.com/

Code is available at https://github.com/futianfan/dictionary.

References

Baytas Inci, Xiao Cao, Zhang Xi, Wang Fei, Jain Anil, and Zhou Jiayu. Patient Subtyping via Time-Aware LSTM Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 65–74, Halifax, Canada, September 2017. ACM. [Google Scholar]
Beaulieu-Jones Brett and Greene Casey. Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification. Journal of Biomedical Informatics, 64(2):168–178, October 2016. [DOI] [PubMed] [Google Scholar]
Chatterji Niladri and Bartlett Peter. Alternating Minimization for Dictionary Learning with Random Initialization. In Advances in Neural Information Processing Systems, pages 1997–2006, Long Beach, USA, December 2017. [Google Scholar]
Choi Edward, Bahadori Mohammad Taha, Schuetz Andy, Stewart Walter F., and Sun Jimeng. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. In Machine Learning for Healthcare Conference, pages 301–318, Stanford, USA, May 2016a. JMLR.org. [PMC free article] [PubMed] [Google Scholar]
Choi Edward, Bahadori Mohammad Taha, Sun Jimeng, Kulas Joshua, Schuetz Andy, and Stewart Walter F.. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems, pages 1493–1501, Barcelona, Spain, December 2016b. [Google Scholar]
Choi Edward, Bahadori Mohammad Taha, Song Le, Stewart Walter F., and Sun Jimeng. GRAM: Graph-based Attention Model for Healthcare Representation Learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795, Halifax, Canada, September 2017. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi Edward, Xiao Cao, Stewart Walter F., and Sun Jimeng. MIME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. In Advances in Neural Information Processing Systems, pages 4547–4557, Montreal, Canada, December 2018. [Google Scholar]
Dligach Dmitriy, Miller Timothy, and Savova Guergana K.. Semi-Supervised Learning for Phenotyping Tasks. In AMIA Annual Symposium Proceedings, pages 502–511, San Francisco, USA, April 2015. American Medical Informatics Association, AMIA. [PMC free article] [PubMed] [Google Scholar]
Esteban Cristóbal, Staeck Oliver, Baier Stephan, Yang Yinchong, and Tresp Volker. Predicting Clinical Events by Combining Static and Dynamic Information using Recurrent Neural Networks. In IEEE International Conference on Healthcare Informatics (ICHI), pages 93—−101, Xi-an, China, October 2016. IEEE Computer Society. [Google Scholar]
Hochreiter Sepp and Schmidhuber Jürgen. Long Short-Term Memory. Neural computation, 9(8):1735–1780, Apr-Jun 1997. [DOI] [PubMed] [Google Scholar]
Ma Fenglong, Chitta Radha, Zhou Jing, You Quanzeng, Sun Tong, and Gao Jing. DIPOLE: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1903–1911, Halifax, Canada, September 2017. ACM. [Google Scholar]
Ma Tengfei, Xiao Cao, and Wang Fei. Health-ATM: A Deep Architecture for Multifaceted Patient Health Record Representation and Risk Prediction. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 261–269, San Diego, USA, May 2018. SIAM. [Google Scholar]
Mairal Julien, Bach Francis, Ponce Jean, and Sapiro Guillermo. Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689–696, Montreal, Canada, June 2009. ACM. [Google Scholar]
Parikh Neal and Boyd Stephen. Proximal Algorithms. Foundations and Trends^® in Optimization, 1(3):127–239, January 2014. [Google Scholar]
Tariyal Snigdha, Majumdar Angshul, Singh Richa, and Vatsa Mayank. Deep Dictionary Learning. IEEE Access, 4(1):10096–10109, Apr-Jun 2016. [Google Scholar]

[R1] Baytas Inci, Xiao Cao, Zhang Xi, Wang Fei, Jain Anil, and Zhou Jiayu. Patient Subtyping via Time-Aware LSTM Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 65–74, Halifax, Canada, September 2017. ACM. [Google Scholar]

[R2] Beaulieu-Jones Brett and Greene Casey. Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification. Journal of Biomedical Informatics, 64(2):168–178, October 2016. [DOI] [PubMed] [Google Scholar]

[R3] Chatterji Niladri and Bartlett Peter. Alternating Minimization for Dictionary Learning with Random Initialization. In Advances in Neural Information Processing Systems, pages 1997–2006, Long Beach, USA, December 2017. [Google Scholar]

[R4] Choi Edward, Bahadori Mohammad Taha, Schuetz Andy, Stewart Walter F., and Sun Jimeng. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. In Machine Learning for Healthcare Conference, pages 301–318, Stanford, USA, May 2016a. JMLR.org. [PMC free article] [PubMed] [Google Scholar]

[R5] Choi Edward, Bahadori Mohammad Taha, Sun Jimeng, Kulas Joshua, Schuetz Andy, and Stewart Walter F.. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. In Advances in Neural Information Processing Systems, pages 1493–1501, Barcelona, Spain, December 2016b. [Google Scholar]

[R6] Choi Edward, Bahadori Mohammad Taha, Song Le, Stewart Walter F., and Sun Jimeng. GRAM: Graph-based Attention Model for Healthcare Representation Learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795, Halifax, Canada, September 2017. ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Choi Edward, Xiao Cao, Stewart Walter F., and Sun Jimeng. MIME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. In Advances in Neural Information Processing Systems, pages 4547–4557, Montreal, Canada, December 2018. [Google Scholar]

[R8] Dligach Dmitriy, Miller Timothy, and Savova Guergana K.. Semi-Supervised Learning for Phenotyping Tasks. In AMIA Annual Symposium Proceedings, pages 502–511, San Francisco, USA, April 2015. American Medical Informatics Association, AMIA. [PMC free article] [PubMed] [Google Scholar]

[R9] Esteban Cristóbal, Staeck Oliver, Baier Stephan, Yang Yinchong, and Tresp Volker. Predicting Clinical Events by Combining Static and Dynamic Information using Recurrent Neural Networks. In IEEE International Conference on Healthcare Informatics (ICHI), pages 93—−101, Xi-an, China, October 2016. IEEE Computer Society. [Google Scholar]

[R10] Hochreiter Sepp and Schmidhuber Jürgen. Long Short-Term Memory. Neural computation, 9(8):1735–1780, Apr-Jun 1997. [DOI] [PubMed] [Google Scholar]

[R11] Ma Fenglong, Chitta Radha, Zhou Jing, You Quanzeng, Sun Tong, and Gao Jing. DIPOLE: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1903–1911, Halifax, Canada, September 2017. ACM. [Google Scholar]

[R12] Ma Tengfei, Xiao Cao, and Wang Fei. Health-ATM: A Deep Architecture for Multifaceted Patient Health Record Representation and Risk Prediction. In Proceedings of the 2018 SIAM International Conference on Data Mining, pages 261–269, San Diego, USA, May 2018. SIAM. [Google Scholar]

[R13] Mairal Julien, Bach Francis, Ponce Jean, and Sapiro Guillermo. Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689–696, Montreal, Canada, June 2009. ACM. [Google Scholar]

[R14] Parikh Neal and Boyd Stephen. Proximal Algorithms. Foundations and Trends^® in Optimization, 1(3):127–239, January 2014. [Google Scholar]

[R15] Tariyal Snigdha, Majumdar Angshul, Singh Richa, and Vatsa Mayank. Deep Dictionary Learning. IEEE Access, 4(1):10096–10109, Apr-Jun 2016. [Google Scholar]

PERMALINK

DDL: Deep Dictionary Learning for Predictive Phenotyping

Tianfan Fu

Trong Nghia Hoang

Cao Xiao

Jimeng Sun

Abstract

1. Introduction

2. Method