Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations

Hossein Estiri; Zachary H Strasser; Jeffery G Klann; Thomas H McCoy, Jr; Kavishwar B Wagholikar; Sebastien Vasey; Victor M Castro; MaryKate E Murphy; Shawn N Murphy

doi:10.1016/j.patter.2020.100051

. 2020 Jun 18;1(4):100051. doi: 10.1016/j.patter.2020.100051

Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations

Hossein Estiri ^1,^2,^3,^8,^∗, Zachary H Strasser ^1,^2,^3,⁴, Jeffery G Klann ^1,^2,³, Thomas H McCoy Jr ^3,⁶, Kavishwar B Wagholikar ^1,^2,³, Sebastien Vasey ⁵, Victor M Castro ², MaryKate E Murphy ², Shawn N Murphy ^1,^2,^3,^4,⁷

PMCID: PMC7301790 NIHMSID: NIHMS1611213 PMID: 32835307

Summary

Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype “differentiators” and predictors than the “atemporal” EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history.

Keywords: electronic health records, temporal representations, sequencing, data representation, phenotyping, diagnosis prediction, machine learning, dimensionality reduction, disease trajectories

Highlights

•
A transitive sequential pattern mining (TSPM) approach is presented for clinical data
•
TSPM mines temporal sequences from discrete electronic health records (EHR) data
•
TSPM representations from EHRs are superior disease “differentiators” and predictors
•
An ML pipeline is introduced to apply TSPM in phenotype prediction and classification

The Bigger Picture

Over the past decade, billions of dollars have been spent to institute meaningful use of electronic health record (EHR) systems. For a multitude of reasons, however, EHR data are still complex and have ample quality issues, which make it difficult to leverage these data to address pressing health issues, especially during pandemics such as COVID-19, when rapid responses are needed. In this paper, we propose a transitive sequential pattern mining algorithm for exploiting the temporal information in the EHRs that are distorted by layers of administrative and healthcare system processes. Perhaps more importantly, we propose a machine learning (ML) pipeline that is capable of engineering predictive features without the need for expert involvement to model diseases and health outcomes. Together, the temporal sequences and the ML pipeline can be rapidly deployed to develop computational models for identifying and validating novel disease markers and advancing medical knowledge discovery.

Electronic medical records (EHRs) contain valuable temporal information about progression of diseases and treatment trajectories. However, time is not precisely captured in clinical data. In a study of congestive heart failure, Estiri and colleagues propose an approach for extracting temporal sequential patterns from EHR data that improve phenotype prediction and classification and are also interpretable.

Introduction

The widespread adoption of electronic health records (EHRs) has accumulated an unprecedented amount of patient health information. EHRs contain important temporal information that provide an opportunity for discovering new insights about disease progression and treatment trajectories. However, EHR observations reflect a complex set of processes that thwart their seamless translation into actionable knowledge. Namely, the raw EHR records may not be direct indicators of patients' “true” health states at different time points, but rather reflect the clinical processes (e.g., policies and workflows of the provider and payor), the patients' interactions with the system, and the recording processes.1, 2, 3

Biomedical researchers increasingly apply conventional association studies to EHR data, yet the temporality of these data have not been fully exploited by current methods.¹^,⁴ The temporal properties of EHR data signify the complexities of the healthcare processes. This makes incorporating temporal information into common temporal analysis techniques challenging.⁵ The challenge is twofold. First, EHR observations are recorded asynchronously across time (i.e., measured at different times and irregularly), which provide foundational challenges for directly applying common temporal analysis methods.6, 7, 8, 9 Second, translating the temporal nature of discrete EHR data into useful data representations (or features) for standard machine learning (ML) algorithms is challenging.¹⁰^,¹¹ This is a critical gap. This study aims to address this gap by developing and testing a temporal representation mining algorithm for asynchronously recorded discrete EHR data.

We propose a transitive sequencing algorithm for constructing temporal representations from medications and diagnoses data from EHR. We conduct this research with the application in classifying and predicting congestive heart failure (CHF) in patients. Our results demonstrate that temporal sequences of electronic medical records improve computational classification of patient cohorts, and phenotype prediction, when no record of the phenotype exists in the medical records. The proposed transitive sequential representations are interpretable and also more predictive features for standard ML algorithms than “atemporal” representations of discrete EHR data. We found that the sequential representations improve CHF classification by over 4% and prediction by over 13%. We also demonstrate that the proposed transitive sequential representations are more suited than the sequences mined through traditional sequential pattern mining (SPM) algorithms for ML using clinical data that inherit various biases due to the recording process.

Background

An extended body of work exists on extracting interval-based symbolic representations from clinical measurement data (e.g., laboratory test results),¹²^,¹³ often known as the temporal abstraction (TA) approach.¹⁴^,¹⁵ Although forms of such representations have been utilized as features in classification/prediction tasks,16, 17, 18, 19 application in ML is not the focus in the TA agenda. Furthermore, the development of TA methods has been largely confined to continuous clinical measurements data.¹²^,²⁰^,²¹ In addition to continuous data, however, EHRs contain discrete data, such as records of diagnoses, medications, and procedures.

For discrete clinical data, SPM²² approaches, such as the frequent sequential pattern (FSP),²² are promising. The goal in SPM is to discover relevant subsequences from a large set of sequences (of events or items) with time constraints. Several SPM algorithms have been developed to improve mining efficiency and address specific domain needs (for recent surveys of SPM algorithms, refer to the studies by Fournier-Viger et al.²³^,²⁴ and Truong-Chi and Fournier-Viger²³^,²⁴). As a result, SPM algorithms are fairly mature in computational and data management schemes. However, as the SPM approaches were primarily developed for transaction data, the importance of a sequential pattern for use in downstream ML algorithms is often determined by an occurrence frequency threshold, known as the minimum support.²⁵^,²⁶ The goal is to cut the number of data representations by finding the most frequent temporal patterns among all patterns. This strategy has been used by the majority of the literature using SPM approaches for clinical data mining.⁴^,⁷^,¹⁰^,26, 27, 28, 29 However, because the temporal patterns are mined based on frequency, some may not make clinical sense or do not apply to clinical data that inherit various biases through the recording process.

A priori-based SPM methods are popular in the healthcare domain. The a priori property is that if a sequence cannot pass the minimum support test (i.e., is not assumed frequent), all of its subsequences will be ignored. Examples of a priori-based algorithms with use cases in clinical data are SPM with regular expression constraints (SPIRIT),³⁰ sequential pattern discovery using equivalence classes (SPADE),³¹ and SPM (SPAM).³²

In particular, a couple of studies have applied adjustments to the SPM's frequency-based criterion for feature selection using clinical data. Liu et al.³³ proposed a temporal graph approach to predict the onset of CHF by summarizing multiple sequences of events recorded for a patient.³³ Due to added complexities in the network of events encapsulated in the temporal graphs, the authors had to work through a specialized generalization method. Guo et al.³⁴ showed the efficiency of sequential patterns in predicting the risk of acute ischemic stroke over Framingham and CHA₂DS₂-VASc models.³⁴ They applied feature selection procedures to reduce the dimensionality of the temporal patterns mined by the popular SPM algorithm (SPAM³²).

Considering EHRs as “indirect” reflections of a patient's true health state, we propose an algorithm for mining transitive sequential patterns from discrete EHR data, apply dimensionality reduction, and implement the top features in phenotype classification and prediction. Our approach provides a specification for SPM and a formal dimensionality procedure—minimize sparsity and maximize relevance (MSMR)—that integrates feature selection into the classification task.

Results

Study design is illustrated in Figure 1. A summary of the patient characteristics is provided in Table 1. Changes in the number of unique medication and diagnosis records through feature engineering and initial dimensionality reduction processes are presented in Table 2. For classification, we began with more than 45,000 unique medication/diagnosis records in the aggregated vector representation (AVR) approach, from which we mined over 137 million unique transitive sequence representations. These numbers were over 25,000 records and 30 million sequenced representations in the prediction task. Sparsity screening at the threshold of 1% (68 patients for classification and 58 patients for prediction) resulted in 6,469 AVR representations (i.e., unique medication/diagnosis records) and over 1,300,000 sequenced representations in the classification task. For prediction, the sparsity screening resulted in a reduction in dimensionality to over 2,500 AVR and 100,000 sequenced representations.

The Study Design Encompasses Representation Mining, Dimensionality Reduction, and ML Experiments

Table 1.

Summary of Patient Characteristics in the Test and Training Sets

	Test Set (N = 56)	Training Set (N = 6,851)
Gender	female: 52%	female: 42%
Gender	male: 48%	male: 58%
Race	white: 86%	white: 86%
	black: 9%	black: 7%
	Asian: ∼0%	Asian: 1%
	other/unknown: 5%	other/unknown: 6%
Ethnicity	Hispanic: ∼0%	Hispanic: 4%
Observation range	mean: 17 years (SD: 6.5)	mean: 16 years (SD: 7.8)
Age	mean: 72 years (SD: 14.7)	mean: 68 years (SD: 13.8)

Open in a new tab

Table 2.

Number of Unique Representation Records through Representation Mining and Dimensionality Reduction Steps

	Task	Start	Sparsity Screen	Mutual Information	Variable Importance (MDG)
AVR	classification	45,767	6,469	3,000	200
AVR	prediction	25,478	2,552	2,552	200
BOS	classification	137,735,403	1,349,704	3,000	200
BOS	prediction	30,962,075	107,760	3,000	200
BOS-AVR				6,000	200

Open in a new tab

At the conclusion of the MSMR algorithm, for each approach the top 200 representations were selected based on their mean decrease in Gini (MDG) (in tied situations, mutual information and prevalence were used). In the next section we compare the classification results obtained from the four different classifiers using the top 200 features.

In general, we found that the bag-of-sequences (BOS) approach outperforms the AVR approach for both classification and prediction tasks. We review the results by hypothesis. Testing median and best area under the receiver-operating characteristic (ROC) curve (AUC-ROC) values from each approach, classifier, and feature set are provided in Table 3, and performance metrics from cross-validation are reported in Table S1. Figure 2 presents the classification (top plot) and prediction (bottom plot) AUC values as well as the distribution of those values across all three approaches (AVR, BOS, and BOS-AVR), and by classifiers.

Table 3.

Test Set Median and Best Classification and Prediction Performances across Approaches and Classifiers

Classifier	AUC-ROC	Classification					Prediction
Classifier	AUC-ROC	AVR	BOS	Δ (%)	BOS-AVR	Δ (%)	AVR	BOS	Δ (%)	BOS-AVR	Δ (%)
Bayesian GLM^a	median	0.83	0.87	4.4%	0.88	4.4%	0.73	0.79	8.4%	0.67	−8.4%
Bayesian GLM^a	best	0.84	0.88	4.3%	0.89	4.9%	0.73	0.79	8.4%	0.67	−8.4%
L1 Logistic Reg.^b	median	0.83	0.85	3%	0.88	5.89%	0.67	0.79	16.7%	0.66	−2.4%
L1 Logistic Reg.^b	best	0.86	0.88	1.8%	0.89	3.3%	0.72	0.79	9.5%	0.67	−7%
MA Neural Net.^c	median	0.80	0.85	6.9%	0.85	6.9%	0.67	0.81	20.4%	0.75	12.4%
MA Neural Net.^c	best	0.87	0.88	0.6%	0.92	5.4%	0.77	0.85	10.8%	0.79	2.4%
SVM CW^d	median	0.79	0.80	2.3%	0.83	5.6%	0.79	0.84	6.7%	0.78	−1.4%
SVM CW^d	best	0.83	0.83	0%	0.85	2.5%	0.82	0.87	6.2%	0.83	1.3%
Average	median	0.81	0.84	4.1%	0.86	6.2%	0.72	0.81	13%	0.72	0.1%
Average	best	0.85	0.86	1.5%	0.89	4%	0.76	0.83	8.7%	0.74	−2.9%

Open in a new tab

Median and best AUC-ROC values are obtained from ten classification iterations with bootstrap cross-validation.

Best AUC-ROC performances are highlighted by bold font.

Bayesian generalized linear model.

Regularized logistic regression (L1).

Model averaged neural network.

Support vector machines with class weights.

Classification

We found that, across the four classifiers, the BOS representations provided an improvement in classification performance over the AVR representations by an average of 4.1% AUC-ROC (Table 3). The average median classification AUC-ROC values were 0.81 (AVR), 0.84 (BOS), and 0.85 (BOS-AVR). The best overall classification performances were from model averaged neural networks (MA Neural Net.) that resulted in AUC-ROC values of 0.87, 0.88, and 0.92 using the AVR, BOS, and the BOS-AVR data representations, respectively. For the purpose of classification, combining data representations from the AVR and BOS approaches resulted in the best performance (AUC-ROC = 0.92).

Prediction

For predicting heart failure, the performance improvements provided by the BOS representations were even more substantial when using medication and diagnosis records from prior to the first diagnosis of the phenotype in the medical records. On average, the BOS representations improved median prediction performance from AVR representations by 13% in AUC-ROC (Table 3). The average median prediction AUC-ROC values were 0.72 (AVR), 0.81 (BOS), and 0.72 (BOS-AVR). For the purpose of prediction, the sequenced data representations (BOS) provided the best performances (AUC-ROC = 0.87). When the AVR and BOS representations were combined, the prediction performance was inferior to BOS-only representations. Using the data from before the first record of the CHF, the best overall prediction performances were from support vector machines with class weights (SVM CW) with AUC-ROC values of 0.82, 0.87, and 0.83, respectively from the AVR, BOS, and the BOS-AVR data representations.

Sequential Pattern Mining

As described in Experimental Procedures, we also mined the traditional sequential patterns and selected the most frequent to represent the SPM for comparison. We only compared the results of the SPM against BOS with regularized logistic regressions as this was not the focus of this study. As illustrated in Figure 2, in the classification task, the SPM approach's performance was only slightly inferior to that of BOS, but statistically inferior to the BOS-AVR approach. Similar to the BOS and the BOS-AVR approaches, the SPM resulted in better classification performance than the AVR. However, the transitive sequencing algorithm demonstrated an unparalleled improvement over the SPM in prediction: median AUC-ROC 0.634 (SPM) versus 0.79 (BOS). Given these results, one can conclude that the transitive sequencing algorithm is more suited for modeling clinical data in comparison with the conventional SPM.

Clinical Interpretations

We used the visual dashboard to further explore the top 200 sequences for classification and prediction. A snapshot of the visualizations and functionalities of the dashboard is provided in Figure 3. The landing page in the dashboard provides an interactive flow diagram (Sankey plots) constructed from the classification/prediction sequences identified by the MSMR algorithm. The user has the ability to zoom into specific sequences by either selecting the sequence or specifying the rank threshold. In addition, the dashboard provides queryable tabular pages that present metrics including the prevalence, mutual information, and MDGs for selected sequences. Additionally, it provides donut chart visualizations of the likelihood of heart failure given a specific sequence versus the CHF likelihood for the individual elements of the sequence (for examples see Figure 4).

A Snapshot of the Developed Graphical Dashboard for Further Exploration of the Sequences

Phenotype Probability by Data Representation

The donut charts illustrate the probability of phenotype for patients who have at least one record of the feature. AVR data representations are common medication or diagnosis codes in EHRs. The BOS data representations are sequenced features mined in this research.

The transitive sequences are, in general, better “differentiators” for identifying heart failure than the “atemporal” EHRs. The signal obtained from individual features as to whether a patient truly has (or does not have) heart failure often strengthens when the features are in sequences. In Figure 4, the diagnoses codes for “heart failure,” “chronic obstructive pulmonary disease,” and “benzodiazepines” give probabilities of CHF at 45%, 47%, and 63%, respectively. However, when these features are sequenced with one another, the probability of heart failure increases. For example, the sequence “heart failure → benzodiazepine” has a likelihood of 64% for heart failure and “heart failure → other chronic obstructive pulmonary disease” has a likelihood of 78%. The temporal sequences confer a greater signal that a patient truly has heart failure compared with the raw elements (i.e., AVR features).

From among the top sequences, clinical experts identified those sequences that match a common clinical narrative among patients with heart failure versus those who lack an obvious clinical explanation. Table 4 has specific examples of the two groups of transitive sequences. For example, the sequence “abnormalities in breathing → cardiomyopathy” matches a common clinical scenario in heart failure patients. One would expect a heart failure patient to have the symptom of difficulty breathing, and it is likely for the patient to then be given the diagnosis of cardiomyopathy based on subsequent imaging studies. Another sequence that is easy to interpret is “heart failure → metoprolol.” Metoprolol is a common medication for heart failure and is frequently started after a diagnosis of heart failure. A less obvious sequence is “topical anti-infectives → unspecified kidney failure.” Neither of these two components are clearly related to heart failure in the same way that difficult breathing, cardiomyopathy, and metoprolol are related to heart failure.

Table 4.

Illustrative Examples of Expected and Obscure Transitive Sequences

Easy to Explain	Difficult to Explain
Classification

• abnormalities of breathing → cardiomyopathy • abnormalities of breathing → captopril • magnesium gluconate → cardiomyopathy • heart failure → pleural effusion • heart failure → metoprolol	• topical anti-infectives → unspecified kidney failure • dorsalgia → cardiomyopathy • GERD → pacemaker • metoprolol → levofloxacin • nail disorders → furosemide

Prediction

• essential hypertension → lisinopril • chronic ischemic heart disease → angina pectoris • essential hypertension → type 2 diabetes mellitus • cardiomyopathy → platelet aggregation inhibitors • cardiac murmur → complications of heart disease	• docusate → SSRI anti-depressants • docusate → propofol • ondansetron → glucose • fever of unknown origins → vancomycin • vancomycin → fentanyl

Open in a new tab

GERD, gastroesophageal reflux disease; SSRI, selective serotonin reuptake inhibitor.

Discussion

Using transitive sequences of EHR observations, we constructed data representations that are both predictive and interpretable. In the context of phenotyping CHF patients, our results demonstrate that harnessing the knowledge of disease progression through temporal sequencing (the BOS approach) improves classification and prediction over the conventional approach (AVR). The classification and prediction performances obtained in this study are comparable with the state-of-the-art classification/prediction models for heart failure. For example, the highest median AUC-ROC in Wu et al.³⁵ was 0.77. Similarly, Liu et al.³³ observed an AUC-ROC of 0.72, and Miotto et al.³⁶ observed 0.87. Our best overall classification AUC-ROC values from the BOS and BOS-AVR models were 0.88 and 0.92, respectively.

Shah et al.³⁷ found that AUC-ROC values ranged from 0.70 to 0.76 for predicting a combined outcome of death and cardiovascular hospitalization. Our best overall prediction AUC-ROC values were 0.82, 0.87, and 0.83, respectively from the AVR, BOS, and the BOS-AVR data representations.

The temporal relationships encoded in the BOS approach capture some of the complexities of the clinical process that are lost in the conventional approach. Certain sequences in the BOS model undoubtedly correspond to the clinical narrative that is common for CHF patients. For example, the sequence “abnormalities in breathing → captopril” in the BOS model is a better indicator than either feature on its own in the AVR model (65% versus 49% and 57%). This improved accuracy could be attributable to the sequence's ability to capture the relative timing of the two events. Patients who have abnormalities in breathing may have their symptoms attributable to not just heart failure but also pneumonia, chronic obstructive pulmonary disease, or other pulmonary diseases. Similarly, many patients taking captopril take it for hypertension, chronic kidney disease, or after a myocardial infarction. However, if a patient has abnormalities of breathing and then goes on to take captopril, there is a greater chance that the underlying cause of his or her symptoms and the need for this particular medication is due to CHF. The disease offers a unified explanation for both the symptoms (abnormalities in breathing) and the treatment (captopril). The transitive sequence is better able to represent the patient's clinical experience than either of the individual components on their own.

The sequences that initially seem less obvious could give insight into a disease. For example, as mentioned above, the sequence “topical anti-infectives → unspecified kidney failure” was labeled as difficult to explain. However, an argument could be made that this sequence is still clinically interpretable for heart failure patients. For example, it is likely that patients with heart failure are chronically ill and therefore more susceptible to dermatological infections. Their heart failure could lead to and exacerbate kidney disease. While few physicians would cite this sequence as obvious among this population, it could still be a less recognized but common relationship among heart failure patients. Also, in certain cases such sequences could potentially generate hypotheses for novel clinical relationships not previously appreciated.

Many of the prediction sequences are risk factors for CHF, the corresponding symptoms and medications for those risk factors, or the symptoms of the disease itself. For example, the sequence “essential hypertension → type 2 diabetes mellitus” was identified as a significant sequence. Both features are common and specific risk factors for developing heart failure. It makes sense that such risk factors would be important clinical sequences because they have a known pathological process that leads to the development of heart failure. Moreover, like the classifying sequences, there are also prediction sequences that are less obvious. For example, “gout → encounter for immunization” was identified as an important sequence. At first glance, neither component of this feature seems related to heart failure. However, encounters for immunization are often performed in patients with poorly controlled diabetes mellitus, alcohol use disorder, or cardiovascular disease; all risk factors for heart failure, while risk factors for gout include chronic kidney disease, diabetes mellitus, and cardiovascular disease, all of which are also known risk factors for heart failure. Despite neither feature itself having an obvious direct relation to heart failure, on further analysis both seem likely to have a positive correlation with the disease. The BOS method has the potential to identify predictive sequences that the physician may not otherwise appreciate.

Potential Clinical Utilities of Transitive Sequences

One can envision several potential clinical uses for the transitive sequencing approach. The BOS prediction model can be applied to compute real-time CHF probabilities for all patients who do not have a diagnosis of heart failure. A healthcare provider who is caring for these patients could be given a probability of the patient having the disease based on the available sequences in the chart. This could be of particular value for practicing medicine in under-resourced settings where patients may see healthcare providers less frequently. This tool could help identify patients in a community at risk of developing a particular disease and recommend their evaluation by a healthcare provider. Such predictive algorithms could also assist the provider to consider other diagnoses. There is the potential to use this tool for a variety of different diseases.

Although this is a generic potential use case for any ML algorithms at the point of care, we showed that classification and prediction using transitive sequences have higher accuracy in computing CHF probabilities than what can be computed from raw features.

Sequences from the BOS model can be used as a medication recommender system for patients who have a history of heart failure. The model can provide real-time probabilities for different sequences of diagnoses/medications based on trajectories learned from a large cohort of heart failure patients. For example, our model identified “heart failure → metoprolol” as an important sequence marker for patients with heart failure. Beta-blockers, such as metoprolol, are a standard treatment and have been proved to reduce mortality in CHF patients with a reduced ejection fraction. The BOS model could use these particular sequences as a clinical decision support tool to suggest to providers that they should consider prescribing the medication to their heart failure patients.

Another example is the sequence “atrial fibrillation and flutter → coumarins and indandiones.” If a patient with atrial fibrillation has an elevated CHA₂DS₂-VASc score (a standardized score for assessing stroke risk) above a certain threshold, he or she should take an anti-coagulation agent. The score depends on various pre-existing conditions, one of which is CHF. The sequencing approach would recommend anti-coagulation after atrial fibrillation if the patient has a CHF record in his or her medical history. Such a clinical decision support tool could be especially useful for generating recommendations for patients with complex histories, multiple providers, and health records that span many years.

Finally, the use of classification sequences in cohort identification can have utility in large patient cohorts. They can more accurately identify appropriate patients for clinical trials, quality assessment, and biomedical research. For example, if a researcher wanted to rapidly select all patients with heart failure in a given population, this could be done with greater accuracy using transitive sequences and the BOS model.

Relationship to Recurrent Neural Networks

To some extent, this study can be a simplified reverse engineering of recurrent neural networks (RNNs).³⁸ Instead of searching for n-deep sequences, we use the shortest sequences and perform dimensionality reduction. RNNs and RNN-based models such as long short-term memory (LSTM)³⁹ and gated recurrent unit (GRU)⁴⁰ have been recently applied to clinical questions to account for time.41, 42, 43, 44 However, the challenge of applying RNN-based algorithms to EHR data is twofold. On the one hand, there is a tradeoff between accuracy and interpretability that needs to be carefully considered. More complex algorithms often result in highly accurate models but are difficult to understand.⁴⁵ Although ways for making sense of RNN-based models are being explored, real application in clinical care, whereby both accuracy and interpretability are critical, is still a long way away. On the other hand, discrete healthcare records in EHRs often do not precisely reflect the true health status of a patient. Expecting a set of recurrent layers to somehow make sense of the data points that may or may not be reliable is naive.

Limitations and Future Work

A limitation of this work is that we did not filter disease observations by their phenotypic expression patterns. Patients' health states evolve over diverse time scales. Acute diseases such as pneumonia are more isolated spontaneous occurrences, while chronic conditions such as diabetes develop and progress over years. Acute conditions tend to have lower entropies, indicating an inherent link between the predictability of disease and their phenotypic expression pattern.⁴⁶ Scattered in EHRs are records of acute conditions, which often do not exhibit long-range patterns. Therefore, filtering acute conditions out may improve the temporally correlated predictive power.

Also, visual dashboard development was not a primary objective of this study. While our preliminary findings suggest that a visual dashboard can be useful in explaining the complex relationships to clinicians, a formal user study and further evaluations are needed to develop an interactive visual interface.

Finally, our modeling primarily focused on CHF. Further research is needed to evaluate the performance of transitive sequences for classifying/predicting other diseases and to further explore the possibilities of incorporating the sequencing approach into decision support tools at the point of care.

Conclusion

Innovative methods that enable us to properly incorporate time and understand the complexities involved in the healthcare process can yield interpretable findings from large-scale clinical databases. We found that data representations mined from sequences of EHR events are better phenotype “differentiators” and predictors than the “atemporal” EHRs that are widely used as the primary data representations in ML.

Given the rapidly increasing prevalence of EHR systems in today's practice, exploiting the temporal information in EHRs can advance medical knowledge discovery and meaningfully change clinical care by identifying and validating novel disease markers. The transitive sequencing approach presented here allows for limited expert involvement in feature engineering. However, the graphical experiment was helpful in that it resulted in refining the important sequences. Much like the genomics community, the identified sequences of medical records can be cataloged and shared on an accessible platform that would allow for the collaborative clinical use of the sequences as risk factors for diseases in many domains.

Experimental Procedures

Resource Availability

Lead Contact

Hossein Estiri, PhD. hestiri@mgh.harvard.edu.

Materials Availability

This study did not generate any new unique reagents or materials.

Data and Code Availability

Protected Health Information restrictions apply to the availability of the clinical data here, which were used under Institutional Review Board approval for use only in the current study. As a result, these datasets are not publicly available. All code used for modeling and dimensionality reduction in this study uses open-source R packages. The transitive sequencing code is available on https://github.com/hestiri/TSPM under Mozilla Public License 2.0.

Study Design

Modeling the temporal information in clinical data can uncover other dimensions of healthcare delivery that generate signals for disease classification or prediction.³ Thus, we hypothesize that temporal sequences of electronic medical records will improve (1) computational classification of patient cohorts and (2) phenotype prediction. To test these hypotheses, we construct a set of baseline models applying the conventional approach of aggregating the frequency of medical events as features for downstream ML algorithms. We also mine a set of representations by transitive sequencing of the medication and diagnosis events in electronic medical records. We call this proposed approach the BOS approach. The study design can be characterized under representation mining, dimensionality reduction, and ML experiments, and is illustrated in Figure 1.

Data

To test the study hypotheses, we used EHR data from the Mass General Brigham Biobank. Data from all Biobank patients with at least an International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code for CHF (428.0) were included in this study. This cohort consisted of 6,857 patients consented into the Partners Biobank. To create a gold-standard dataset, a board-certified nurse reviewed the clinical notes of a random sample of these patients. The review resulted in 56 patients with gold-standard CHF labels. Specifically, for each of these patients in the gold-standard dataset we had curated labels that determined whether the patient truly had (or did not have) heart failure. We used data from the 56 patients for testing. For training, we used the data from the remaining cohort of 6,851 Biobank patients. Table 1 provides summary information about the testing and training data. The training data included approximate training labels (i.e., silver-standard labels) curated by validated algorithms,⁴⁷^,⁴⁸ which use the distributional properties of a small number of representative features in the gold-standard population to estimate disease probabilities for all patients. The silver-standard labels were not verified by human experts but are crucial for scaling up ML training on large-scale clinical data. The use of data for this study was approved by the Mass General Brigham Institutional Review Board (2017P000282).

We only used the medications and diagnoses records data. For the diagnosis records, we used the ICD-9/10-CM. For medications, we used RxNorm codes.

For classification, we utilized the entire data available for patients in the cohort; therefore, data from potentially before and after CHF diagnoses codes are used. The prediction data is a subset of the classification data, in which we only extracted medication and diagnosis records from before the first time a record of heart failure was observed in a patient's medical records. For a number of patients, the medical records began with a CHF diagnosis. Hence, the cohort size in the prediction dataset is slightly smaller. As a result of the way we defined the datasets, none of the patients remaining in the data for training prediction classifiers had a record of CHF elsewhere in their medical records. There is no specific time frame for prediction. In the prediction dataset, the next encounter recorded in the electronic medical records for a patient will include a CHF diagnosis.

Representation Mining

To test the study hypotheses that transitive sequences of electronic medical records can improve computational classification of patient cohorts and phenotype prediction, we mined two vectors of data representations. The study design is illustrated in Figure 1. We first constructed a baseline method that applies the conventional approach for using EHR observations as features for phenotype classification and prediction. We henceforth call this the AVR approach. The frequency of all medical events is counted for each patient in the EHR data, and the patient is represented by a vector of the length equal to the number of unique events in her or his medical records.

Given a list $O_{1}, O_{2}, \dots, O_{n}$ of diagnosis or medication observations, for each patient $p$ , we have recorded the times $t_{i 1}^{p} \leq t_{i 2}^{p} \leq \dots \leq t_{i k_{i}^{p}}^{p}$ at which the observation $O_{i}$ was recorded (we allow $k_{i}^{p} = 0$ , in which case observation $O_{i}$ was yet to be recorded for patient $p$ ).

AVR Representations

In the AVR approach, our features are the possible observations, and for each patient, say patient p, we record only the numbers $k_{1}^{p}, k_{2}^{p}, \dots, k_{n}^{p}$ of records of each observation. For each $i$ , we think of the $k_{1}^{p}$ ’s as samples of a random variable $X_{i}$ . Our goal is then to predict the class label $Y$ , given $X_{1}, X_{2}, \dots, X_{n}$ .

We also mined a set of representations by transitive sequencing of the medication and diagnosis events in electronic medical records. In this approach, the patient is represented by a vector of the length equal to the number of sequences in her/his medical records. We call this proposed approach the BOS approach.

BOS Representations

In the BOS approach, the features are all possible pairs of distinct observations $(O_{i}, O_{j}), i \neq j$ . For a fixed patient $p$ , and $i \neq j \leq n$ , we set $r_{i j p}$ to be 1 if $k_{i}^{p} \geq 1, k_{j}^{p} \geq 1, and t_{i 1}^{p} \leq t_{j 1}^{p}$ , and 0 otherwise. In words, $r_{i j p}$ is 1 if and only if both $O_{i}$ and $O_{j}$ were recorded for the patient, and the first record of observation $i$ was before, or at the same time as, the first record of observation $j$ . Again, for each fixed $i \neq j$ , we think of the $r_{i j p}$ s as samples of a random variable $X_{i j}$ . Our goal is then to predict the class label $Y$ given. ${(X_{i j})}_{i \neq j}$ .

The use of first record (rather than all records) is a major difference in the way we mined sequences when compared with SPM routines in the general data mining community. We felt this choice better reflects the actual patient state to handle the repeated problem list entries for two reasons. First, diagnosis records are generally carried forward in patients' medical records (often known as problem list entries) through all succeeding encounters, making the first occurrence the true start of a sequential pattern. Second, relying on the high-resolution timing data of repeated diagnosis records is an implementation detail of the clinical system rather than clinically meaningful evidence of the patient's medical history.

It is important to emphasize that we call the sequential pairs in the BOS approach transitive sequences, as they embody distinctive modifications to the conventional SPM. Imagine a sequential pattern where observation $A$ happened directly before $B$ , and $B$ happened directly before $C$ ( $A$ → $B$ → $C$ ). SPM mines subsequences $A$ → $B$ and $B$ → $C$ . To account for the potential biases in EHRs, the transitive sequencing algorithm mines subsequences $A^{'}$ → $B^{'}$ , $B^{'}$ → $C^{'}$ , but also $A^{'}$ → $C^{'}$ from the sequence $A^{'}$ → $B^{'}$ → $C^{'}$ , where $A^{'}$ , $B^{'}$ , and $C^{'}$ are the first records of $A$ , $B$ , $C$ in the medical records. To evaluate the performance difference between the BOS approach and the SPM, we also mined the SPM sequences and used the most frequent sequences for classification.

Dimensionality Reduction

We apply a form of entropy-based temporal representation mining of discrete events from clinical data, which deviates from the traditional SPM and TA approaches that use frequency-based criteria for selecting subsequences. If all pairs of sequences in the BOS approach exist, there will be exactly $\frac{n (n - 1)}{2}$ pairs $(i, j)$ with $i \neq j and i, j \leq n$ . Thus, the number of sequential features is roughly quadratic in the number of observations. As demonstrated in Results, the sequence mining resulted in the explosion of sequences and therefore left us with a highly dimensional vector of representations. To both the BOS and AVR representations, we applied the MSMR formal dimensionality reduction procedure.

To minimize sparsity, we removed any feature that has prevalence smaller than 1%. On the remaining features, we compute the empirical mutual information using an estimation of the entropy of the empirical probability distribution.⁴⁹^,⁵⁰ Mutual information provides a measurement of the mutual dependence between two random variables, which unlike most correlation measures can capture non-linear relationships.⁵⁰^,⁵¹ We ranked the data representations based on their mutual information with the labeled outcome (in ties, we used prevalence to determine the ranking) and conventionally selected the top 3,000 representations from the AVR and BOS approaches.

We further scrutinized the relevance through random forests (RF)⁵² using the MDG—also known as Gini importance—for variable importance. The Gini importance measures the node purity gain by splitting a variable.⁵³ A variable's MDG is a forest-wide weighted average of the decrease in the Gini Impurity metric resulting from splitting on the variable across all of the individual trees that make up the forest.⁵⁴ A higher MDG indicates higher variable importance. At the end of this step, using the median MDGs we ranked features and conveniently curated feature sets for each approach containing the top 200 features. We also combined the two feature vectors prior to the MDG computation step and computed MDGs for the combined data representations as a hybrid approach (AVR-BOS). At the conclusion of the MSMR procedure, we had curated three feature sets through the MSMR procedure containing the top 200 AVR, BOS, and BOS-AVR representations. We also added the top-200 frequent sequences to represent the conventional frequent SPM approach.

Training Classification and Prediction Classifiers

We trained four different classifier algorithms on each vector of data representations using bootstrap cross-validation: (1) logistic regression with L1 regularization; (2) Bayesian generalized linear model; (3) model averaged neural network; and (4) support vector machines with class weights.⁵⁵ For SPM we only trained the regularized logistic regression (L1) classifier. All variables were scaled and centered. All classifiers were trained and tested on the 6,851-patient data. All performance metrics were computed using the 56-patient test set. Furthermore, we iterated the training process ten times with bootstrap sampling and used the median performance metrics for comparing the approaches. Overall, for each of the approaches, we trained 40 classifiers (4 algorithms × 10 bootstrap sampling iterations).

Evaluation

To evaluate our two hypotheses, we compared classifier performances using the AUC-ROC curve. We applied non-parametric and post hoc tests for comparing and ranking the classification performances across the 120 classifiers (40 classifiers × 3 approaches). The goal was to evaluate whether the AUC-ROC values would provide enough statistical evidence that the classifiers have different performances.

Finally, we developed an interactive graphical dashboard to evaluate the important sequential patterns discovered through the dimensionality reduction process. The dashboard provided different visualization and table views of the top 200 transitive sequences using the RStudio Shiny platform. The dashboard design involved an iterative process with a small number of physicians in our group to adjust the visualizations/tables and add new one for facilitating their interpretation of the sequences. Using the graphical dashboard, the top 200 BOS transitive sequences were evaluated by the physicians for their clinical significance. Specific sequences that were identified as having clinical meaning were evaluated further based on their accuracy for identifying patients with heart failure.

Acknowledgments

This work was funded through the U.S. National Human Genome Research Institute grant R01-HG009174.

Author Contributions

Conceptualization, H.E., T.H.M., and S.N.M.; Methodology, H.E.; Formal Analysis, H.E.; Investigation, H.E., S.N.M., and Z.H.S.; Writing – Original Draft, H.E. and Z.H.S.; Writing – Review & Editing, H.E., Z.H.S., J.G.K., K.B.W., and S.V.; Visualization, H.E.; Data Curation, V.M.C. and M.E.M.; Funding Acquisition, S.N.M.

Declaration of Interests

The authors declare no competing interests.

Published: June 18, 2020

Footnotes

Supplemental Information can be found online at https://doi.org/10.1016/j.patter.2020.100051.

Supplemental Information

Table S1. Cross-Validation Classification and Prediction Performances across Approaches and Classifiers

mmc1.pdf^{(89.6KB, pdf)}

Document S2. Article plus Supplemental Information

mmc2.pdf^{(2.1MB, pdf)}

References

1.Hripcsak G., Albers D.J., Perotte A. Exploiting time in electronic health record correlations. J. Am. Med. Inform. Assoc. 2011;18(Suppl 1):i109–i115. doi: 10.1136/amiajnl-2011-000463. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hripcsak G., Albers D.J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 2013;20:117–121. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Agniel D., Kohane I.S., Weber G.M. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361 doi: 10.1136/bmj.k1479. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Moskovitch R., Choi H., Hripcsak G., Tatonetti N. Prognosis of clinical outcomes with temporal patterns and experiences with one class feature selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017;14:555–563. doi: 10.1109/TCBB.2016.2591539. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dahlem D., Maniloff D., Ratti C. Predictability bounds of electronic health records. Sci. Rep. 2015;5:11865. doi: 10.1038/srep11865. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Madkour M., Benhaddou D., Tao C. Temporal data representation, normalization, extraction, and reasoning: a review from clinical domain. Comput. Methods Programs Biomed. 2016;128:52–68. doi: 10.1016/j.cmpb.2016.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Batal I., Valizadegan H., Cooper G.F., Hauskrecht M. A temporal pattern mining approach for classifying electronic health record data. ACM Trans. Intell. Syst. Technol. 2013;4 doi: 10.1145/2508037.2508044. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Liu, Z., Wu, L., and Hauskrecht, M. (2013). Modeling clinical time series using Gaussian process sequences. In Proceedings of the 2013 SIAM International Conference on Data Mining.
9.Albers D.J., Hripcsak G. Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series. Chaos Solitons Fractals. 2012;45:853–860. doi: 10.1016/j.chaos.2012.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Zhao J., Papapetrou P., Asker L., Boström H. Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inform. 2017;65:105–119. doi: 10.1016/j.jbi.2016.11.006. [DOI] [PubMed] [Google Scholar]
11.Pivovarov R., Elhadad N. Automated methods for the summarization of electronic health records. J. Am. Med. Inform. Assoc. 2015;22:938–947. doi: 10.1093/jamia/ocv032. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Stacey M., McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey. Artif. Intell. Med. 2007;39:1–24. doi: 10.1016/j.artmed.2006.08.002. [DOI] [PubMed] [Google Scholar]
13.Orphanou K., Stassopoulou A., Keravnou E. Temporal abstraction and temporal Bayesian networks in clinical domains: a survey. Artif. Intell. Med. 2014;60:133–149. doi: 10.1016/j.artmed.2013.12.007. [DOI] [PubMed] [Google Scholar]
14.Shahar Y., Musen M.A. RÉSUMÉ: a temporal-abstraction system for patient monitoring. Comput. Biomed. Res. 1993;26:255–273. doi: 10.1006/cbmr.1993.1018. [DOI] [PubMed] [Google Scholar]
15.Shahar Y., Musen M.A. Knowledge-based temporal abstraction in clinical domains. Artif. Intell. Med. 1996;8:267–298. doi: 10.1016/0933-3657(95)00036-4. [DOI] [PubMed] [Google Scholar]
16.Moskovitch R., Shahar Y. Classification-driven temporal discretization of multivariate time series. Data Min. Knowl. Discov. 2015;29:871–913. [Google Scholar]
17.Moskovitch R., Walsh C., Hripsack G., Tatonetti N. Prediction of biomedical events via time intervals mining. J. Biomed. Inform. 2014;75:70–82. doi: 10.1016/j.jbi.2017.07.018. [DOI] [PubMed] [Google Scholar]
18.Batal I., Cooper G., Fradkin D., Harrison J., Moerchen F., Hauskrecht M. An efficient pattern mining approach for event detection in multivariate temporal data. Knowl. Inf. Syst. 2016;46:115–150. doi: 10.1007/s10115-015-0819-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Shknevsky A., Shahar Y., Moskovitch R. Consistent discovery of frequent interval-based temporal patterns in chronic patients’ data. J. Biomed. Inform. 2017;75:83–95. doi: 10.1016/j.jbi.2017.10.002. [DOI] [PubMed] [Google Scholar]
20.Bellazzi R., Larizza C., Riva A. Temporal abstractions for interpreting diabetic patients monitoring data. Intell. Data Anal. 1998;2:97–122. [Google Scholar]
21.Bellazzi R., Larizza C., Magni P., Montani S., Stefanelli M. Intelligent analysis of clinical time series: an application in the diabetes mellitus domain. Artif. Intell. Med. 2000;20:37–57. doi: 10.1016/s0933-3657(00)00052-x. [DOI] [PubMed] [Google Scholar]
22.Agrawal R., Srikant R., Others Mining sequential patterns. icde. 1995;95:3–14. [Google Scholar]
23.Fournier-Viger P., Lin J.C.-W., Kiran R.U., Koh Y.S., Thomas R. A survey of sequential pattern mining. Data Sci. Pattern Recognit. 2017;1:54–77. [Google Scholar]
24.Truong-Chi T., Fournier-Viger P. A survey of high utility sequential pattern mining. In: Fournier-Viger P., Lin J.C.-W., Nkambou R., Vo B., Tseng V., editors. High-Utility Pattern Mining. 2019. pp. 97–129. [Google Scholar]
25.Mabroukeh N.R., Ezeife C.I. A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 2010;43 doi: 10.1145/1824795.1824798. [DOI] [Google Scholar]
26.Berlingerio, M., Bonchi, F., Giannotti, F., and Turini, F. (2007). Mining clinical data with a temporal dimension: a case study 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007), 429–436.
27.Moskovitch R., Shahar Y. Classification of multivariate time series via temporal abstraction and time intervals mining. Knowl. Inf. Syst. 2015;45:35–74. [Google Scholar]
28.Batal I., Sacchi L., Bellazzi R., Hauskrecht M. A temporal abstraction framework for classifying clinical temporal data. AMIA Annu. Symp. Proc. 2009;2009:29–33. [PMC free article] [PubMed] [Google Scholar]
29.Moskovitch R., Polubriaginof F., Weiss A., Ryan P., Tatonetti N. Procedure prediction from symbolic Electronic Health Records via time intervals analytics. J. Biomed. Inform. 2017;75:70–82. doi: 10.1016/j.jbi.2017.07.018. [DOI] [PubMed] [Google Scholar]
30.Garofalakis M.N., Rastogi R., Shim K. SPIRIT: sequential pattern mining with regular expression constraints. VLDB. 1999;99:7–10. [Google Scholar]
31.Zaki M.J. Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 2001;61:401–426. [Google Scholar]
32.Ayres, J., Flannick, J., Gehrke, J., and Yiu, T. (2002). Sequential PAttern mining using a bitmap representation. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 429–435.
33.Liu, C., Wang, F., Hu, J., and Xiong, H. (2015). Temporal phenotyping from longitudinal electronic health records. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). 10.1145/2783258.2783352.
34.Guo S., Li X., Liu H., Zhang P., Du X., Xie G., Wang F. Integrating temporal pattern mining in ischemic stroke prediction and treatment pathway discovery for atrial fibrillation. AMIA Jt. Summits Transl. Sci. Proc. 2017;2017:122–130. [PMC free article] [PubMed] [Google Scholar]
35.Wu J., Roy J., Stewart W.F. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med. Care. 2010;48:S106–S113. doi: 10.1097/MLR.0b013e3181de9e17. [DOI] [PubMed] [Google Scholar]
36.Miotto R., Li L., Kidd B.A., Dudley J.T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 2016;6 doi: 10.1038/srep26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Shah S.J., Katz D.H., Selvaraj S., Burke M.A., Yancy C.W., Gheorghiade M., Bonow R.O., Huang C.-C., Deo R.C. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation. 2015;131:269–279. doi: 10.1161/CIRCULATIONAHA.114.010637. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Rumelhart D.E., Hinton G.E., Williams R.J. Learning representations by back-propagating errors. Nature. 1986;323:533–536. [Google Scholar]
39.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
40.Cho K., van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv. 2014 1406.1078. [Google Scholar]
41.Wu S., Liu S., Sohn S., Moon S., Wi C.-I., Juhn Y., Liu H. Modeling asynchronous event sequences with RNNs. J. Biomed. Inform. 2018;83:167–177. doi: 10.1016/j.jbi.2018.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Lipton Z.C., Kale D.C., Elkan C., Wetzel R.C. Learning to diagnose with LSTM recurrent neural networks. arXiv. 2015 1511.03677. [Google Scholar]
43.Pham T., Tran T., Phung D., Venkatesh S. Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 2017;69:218–229. doi: 10.1016/j.jbi.2017.04.001. [DOI] [PubMed] [Google Scholar]
44.Choi, E., Bahadori, M.T., Song, L., Stewart, W.F., and Sun, J. (2017). GRAM: graph-based attention model for healthcare representation learning. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 787–795. [DOI] [PMC free article] [PubMed]
45.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Lee D.D., Sugiyama M., V Luxburg U., Guyon I., Garnett R., editors. Vol. 29. Curran Associates, Inc.; 2016. pp. 3504–3512. (Advances in Neural Information Processing Systems). [Google Scholar]
46.Perotte A., Hripcsak G. Temporal properties of diagnosis code time series in aggregate. IEEE J. Biomed. Heal. Inform. 2013;17:477–483. doi: 10.1109/JBHI.2013.2244610. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Yu S., Ma Y., Gronsbell J., Cai T., Ananthakrishnan A.N., Gainer V.S., Churchill S.E., Szolovits P., Murphy S.N., Kohane I.S. Enabling phenotypic big data with PheNorm. J. Am. Med. Inform. Assoc. 2018;25:54–60. doi: 10.1093/jamia/ocx111. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Wagholikar K.B., Estiri H., Murphy M., Murphy S.N. Polar labeling: silver standard algorithm for training disease classifiers. Bioinformatics. 2020;36:3200–3206. doi: 10.1093/bioinformatics/btaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Meyer P.E. Univ. Libr. Bruxelles; 2008. Information-Theoretic Variable Selection and Network Inference from Microarray Data. Ph. D. Thesis. [Google Scholar]
50.Cover T.M., Thomas J.A. John Wiley & Sons; 2012. Elements of Information Theory. [Google Scholar]
51.Paninski L. Estimation of entropy and mutual information. Neural Comput. 2003;15:1191–1253. [Google Scholar]
52.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
53.Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable importances in forests of randomized trees. 26th International Conference on Advances in Neural Information Processing Systems (NIPS 2013) 431–439
54.Menze B.H., Kelm B.M., Masuch R., Himmelreich U., Bachert P., Petrich W., Hamprecht F.A. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 2009;10:213. doi: 10.1186/1471-2105-10-213. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Batuwita R., Palade V. Class imbalance learning methods for support vector machines. In: He H., Ma Y., editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE/Wiley; 2013. pp. 83–99. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1. Cross-Validation Classification and Prediction Performances across Approaches and Classifiers

mmc1.pdf^{(89.6KB, pdf)}

Document S2. Article plus Supplemental Information

mmc2.pdf^{(2.1MB, pdf)}

Data Availability Statement

[bib1] 1.Hripcsak G., Albers D.J., Perotte A. Exploiting time in electronic health record correlations. J. Am. Med. Inform. Assoc. 2011;18(Suppl 1):i109–i115. doi: 10.1136/amiajnl-2011-000463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Hripcsak G., Albers D.J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 2013;20:117–121. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Agniel D., Kohane I.S., Weber G.M. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361 doi: 10.1136/bmj.k1479. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Moskovitch R., Choi H., Hripcsak G., Tatonetti N. Prognosis of clinical outcomes with temporal patterns and experiences with one class feature selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017;14:555–563. doi: 10.1109/TCBB.2016.2591539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Dahlem D., Maniloff D., Ratti C. Predictability bounds of electronic health records. Sci. Rep. 2015;5:11865. doi: 10.1038/srep11865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Madkour M., Benhaddou D., Tao C. Temporal data representation, normalization, extraction, and reasoning: a review from clinical domain. Comput. Methods Programs Biomed. 2016;128:52–68. doi: 10.1016/j.cmpb.2016.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Batal I., Valizadegan H., Cooper G.F., Hauskrecht M. A temporal pattern mining approach for classifying electronic health record data. ACM Trans. Intell. Syst. Technol. 2013;4 doi: 10.1145/2508037.2508044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Liu, Z., Wu, L., and Hauskrecht, M. (2013). Modeling clinical time series using Gaussian process sequences. In Proceedings of the 2013 SIAM International Conference on Data Mining.

[bib9] 9.Albers D.J., Hripcsak G. Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series. Chaos Solitons Fractals. 2012;45:853–860. doi: 10.1016/j.chaos.2012.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Zhao J., Papapetrou P., Asker L., Boström H. Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inform. 2017;65:105–119. doi: 10.1016/j.jbi.2016.11.006. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Pivovarov R., Elhadad N. Automated methods for the summarization of electronic health records. J. Am. Med. Inform. Assoc. 2015;22:938–947. doi: 10.1093/jamia/ocv032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Stacey M., McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey. Artif. Intell. Med. 2007;39:1–24. doi: 10.1016/j.artmed.2006.08.002. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Orphanou K., Stassopoulou A., Keravnou E. Temporal abstraction and temporal Bayesian networks in clinical domains: a survey. Artif. Intell. Med. 2014;60:133–149. doi: 10.1016/j.artmed.2013.12.007. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Shahar Y., Musen M.A. RÉSUMÉ: a temporal-abstraction system for patient monitoring. Comput. Biomed. Res. 1993;26:255–273. doi: 10.1006/cbmr.1993.1018. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Shahar Y., Musen M.A. Knowledge-based temporal abstraction in clinical domains. Artif. Intell. Med. 1996;8:267–298. doi: 10.1016/0933-3657(95)00036-4. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Moskovitch R., Shahar Y. Classification-driven temporal discretization of multivariate time series. Data Min. Knowl. Discov. 2015;29:871–913. [Google Scholar]

[bib17] 17.Moskovitch R., Walsh C., Hripsack G., Tatonetti N. Prediction of biomedical events via time intervals mining. J. Biomed. Inform. 2014;75:70–82. doi: 10.1016/j.jbi.2017.07.018. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Batal I., Cooper G., Fradkin D., Harrison J., Moerchen F., Hauskrecht M. An efficient pattern mining approach for event detection in multivariate temporal data. Knowl. Inf. Syst. 2016;46:115–150. doi: 10.1007/s10115-015-0819-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Shknevsky A., Shahar Y., Moskovitch R. Consistent discovery of frequent interval-based temporal patterns in chronic patients’ data. J. Biomed. Inform. 2017;75:83–95. doi: 10.1016/j.jbi.2017.10.002. [DOI] [PubMed] [Google Scholar]

[bib20] 20.Bellazzi R., Larizza C., Riva A. Temporal abstractions for interpreting diabetic patients monitoring data. Intell. Data Anal. 1998;2:97–122. [Google Scholar]

[bib21] 21.Bellazzi R., Larizza C., Magni P., Montani S., Stefanelli M. Intelligent analysis of clinical time series: an application in the diabetes mellitus domain. Artif. Intell. Med. 2000;20:37–57. doi: 10.1016/s0933-3657(00)00052-x. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Agrawal R., Srikant R., Others Mining sequential patterns. icde. 1995;95:3–14. [Google Scholar]

[bib23] 23.Fournier-Viger P., Lin J.C.-W., Kiran R.U., Koh Y.S., Thomas R. A survey of sequential pattern mining. Data Sci. Pattern Recognit. 2017;1:54–77. [Google Scholar]

[bib24] 24.Truong-Chi T., Fournier-Viger P. A survey of high utility sequential pattern mining. In: Fournier-Viger P., Lin J.C.-W., Nkambou R., Vo B., Tseng V., editors. High-Utility Pattern Mining. 2019. pp. 97–129. [Google Scholar]

[bib25] 25.Mabroukeh N.R., Ezeife C.I. A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv. 2010;43 doi: 10.1145/1824795.1824798. [DOI] [Google Scholar]

[bib26] 26.Berlingerio, M., Bonchi, F., Giannotti, F., and Turini, F. (2007). Mining clinical data with a temporal dimension: a case study 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007), 429–436.

[bib27] 27.Moskovitch R., Shahar Y. Classification of multivariate time series via temporal abstraction and time intervals mining. Knowl. Inf. Syst. 2015;45:35–74. [Google Scholar]

[bib28] 28.Batal I., Sacchi L., Bellazzi R., Hauskrecht M. A temporal abstraction framework for classifying clinical temporal data. AMIA Annu. Symp. Proc. 2009;2009:29–33. [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Moskovitch R., Polubriaginof F., Weiss A., Ryan P., Tatonetti N. Procedure prediction from symbolic Electronic Health Records via time intervals analytics. J. Biomed. Inform. 2017;75:70–82. doi: 10.1016/j.jbi.2017.07.018. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Garofalakis M.N., Rastogi R., Shim K. SPIRIT: sequential pattern mining with regular expression constraints. VLDB. 1999;99:7–10. [Google Scholar]

[bib31] 31.Zaki M.J. Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 2001;61:401–426. [Google Scholar]

[bib32] 32.Ayres, J., Flannick, J., Gehrke, J., and Yiu, T. (2002). Sequential PAttern mining using a bitmap representation. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 429–435.

[bib33] 33.Liu, C., Wang, F., Hu, J., and Xiong, H. (2015). Temporal phenotyping from longitudinal electronic health records. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). 10.1145/2783258.2783352.

[bib34] 34.Guo S., Li X., Liu H., Zhang P., Du X., Xie G., Wang F. Integrating temporal pattern mining in ischemic stroke prediction and treatment pathway discovery for atrial fibrillation. AMIA Jt. Summits Transl. Sci. Proc. 2017;2017:122–130. [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Wu J., Roy J., Stewart W.F. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med. Care. 2010;48:S106–S113. doi: 10.1097/MLR.0b013e3181de9e17. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Miotto R., Li L., Kidd B.A., Dudley J.T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 2016;6 doi: 10.1038/srep26094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Shah S.J., Katz D.H., Selvaraj S., Burke M.A., Yancy C.W., Gheorghiade M., Bonow R.O., Huang C.-C., Deo R.C. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation. 2015;131:269–279. doi: 10.1161/CIRCULATIONAHA.114.010637. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Rumelhart D.E., Hinton G.E., Williams R.J. Learning representations by back-propagating errors. Nature. 1986;323:533–536. [Google Scholar]

[bib39] 39.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[bib40] 40.Cho K., van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv. 2014 1406.1078. [Google Scholar]

[bib41] 41.Wu S., Liu S., Sohn S., Moon S., Wi C.-I., Juhn Y., Liu H. Modeling asynchronous event sequences with RNNs. J. Biomed. Inform. 2018;83:167–177. doi: 10.1016/j.jbi.2018.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Lipton Z.C., Kale D.C., Elkan C., Wetzel R.C. Learning to diagnose with LSTM recurrent neural networks. arXiv. 2015 1511.03677. [Google Scholar]

[bib43] 43.Pham T., Tran T., Phung D., Venkatesh S. Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 2017;69:218–229. doi: 10.1016/j.jbi.2017.04.001. [DOI] [PubMed] [Google Scholar]

[bib44] 44.Choi, E., Bahadori, M.T., Song, L., Stewart, W.F., and Sun, J. (2017). GRAM: graph-based attention model for healthcare representation learning. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 787–795. [DOI] [PMC free article] [PubMed]

[bib45] 45.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Lee D.D., Sugiyama M., V Luxburg U., Guyon I., Garnett R., editors. Vol. 29. Curran Associates, Inc.; 2016. pp. 3504–3512. (Advances in Neural Information Processing Systems). [Google Scholar]

[bib46] 46.Perotte A., Hripcsak G. Temporal properties of diagnosis code time series in aggregate. IEEE J. Biomed. Heal. Inform. 2013;17:477–483. doi: 10.1109/JBHI.2013.2244610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Yu S., Ma Y., Gronsbell J., Cai T., Ananthakrishnan A.N., Gainer V.S., Churchill S.E., Szolovits P., Murphy S.N., Kohane I.S. Enabling phenotypic big data with PheNorm. J. Am. Med. Inform. Assoc. 2018;25:54–60. doi: 10.1093/jamia/ocx111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Wagholikar K.B., Estiri H., Murphy M., Murphy S.N. Polar labeling: silver standard algorithm for training disease classifiers. Bioinformatics. 2020;36:3200–3206. doi: 10.1093/bioinformatics/btaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] 49.Meyer P.E. Univ. Libr. Bruxelles; 2008. Information-Theoretic Variable Selection and Network Inference from Microarray Data. Ph. D. Thesis. [Google Scholar]

[bib50] 50.Cover T.M., Thomas J.A. John Wiley & Sons; 2012. Elements of Information Theory. [Google Scholar]

[bib51] 51.Paninski L. Estimation of entropy and mutual information. Neural Comput. 2003;15:1191–1253. [Google Scholar]

[bib52] 52.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]

[bib53] 53.Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable importances in forests of randomized trees. 26th International Conference on Advances in Neural Information Processing Systems (NIPS 2013) 431–439

[bib54] 54.Menze B.H., Kelm B.M., Masuch R., Himmelreich U., Bachert P., Petrich W., Hamprecht F.A. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 2009;10:213. doi: 10.1186/1471-2105-10-213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] 55.Batuwita R., Palade V. Class imbalance learning methods for support vector machines. In: He H., Ma Y., editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE/Wiley; 2013. pp. 83–99. [Google Scholar]

PERMALINK

Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations

Hossein Estiri

Zachary H Strasser

Jeffery G Klann

Thomas H McCoy Jr

Kavishwar B Wagholikar

Sebastien Vasey

Victor M Castro

MaryKate E Murphy

Shawn N Murphy

Summary

Highlights

The Bigger Picture

Introduction

Background

Results

Figure 1.

Table 1.

Table 2.

Table 3.

Figure 2.

Classification

Prediction

Sequential Pattern Mining

Clinical Interpretations

Figure 3.

Figure 4.

Table 4.

Discussion

Potential Clinical Utilities of Transitive Sequences

Relationship to Recurrent Neural Networks

Limitations and Future Work

Conclusion

Experimental Procedures

Resource Availability

Lead Contact

Materials Availability

Data and Code Availability

Study Design

Data

Representation Mining

AVR Representations

BOS Representations

Dimensionality Reduction

Training Classification and Prediction Classifiers

Evaluation

Acknowledgments

Author Contributions

Declaration of Interests

Footnotes

Supplemental Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases