Using Multi-Step, Multivariate Long Short-Term Memory Neural Network to Detect Aberrant Signals in Health Data for Quality Assurance

Seyed M Miran; Stuart J Nelson; Doug Redd; Qing Zeng-Treitler

doi:10.1016/j.ijmedinf.2020.104368

. Author manuscript; available in PMC: 2022 Sep 28.

Published in final edited form as: Int J Med Inform. 2020 Dec 16;147:104368. doi: 10.1016/j.ijmedinf.2020.104368

Using Multi-Step, Multivariate Long Short-Term Memory Neural Network to Detect Aberrant Signals in Health Data for Quality Assurance

Seyed M Miran ¹, Stuart J Nelson ¹, Doug Redd ¹, Qing Zeng-Treitler ¹

PMCID: PMC9518650 NIHMSID: NIHMS1657574 PMID: 33401168

Abstract

Background:

The data quality of electronic health records (EHR) has been a topic of increasing interest to clinical and health services researchers. One indicator of possible errors in data is a large change in the frequency of observations in chronic illnesses. In this study, we built and demonstrated the utility of a stacked multivariate LSTM model to predict an acceptable range for the frequency of observations.

Methods:

We applied the LSTM approach to a large EHR dataset with over 400 million total encounters. We computed sensitivity and specificity for predicting if the frequency of an observation in a given week is an aberrant signal.

Results:

Compared with the simple frequency monitoring approach, our proposed multivariate LSTM approach increased the sensitivity of finding aberrant signals in 6 randomly selected diagnostic codes from 75 to 88% and the specificity from 68 to 92%. We also experimented with two different LSTM algorithms, namely, direct multi-step and recursive multi-step. Both models were able to detect the aberrant signals while the recursive multi-step algorithm performed better.

Conclusions:

Simply monitoring the frequency trend, as is the common practice in systems that do monitor the data quality, would not be able to distinguish between the fluctuations caused by seasonal disease changes, seasonal patient visit, or a change in data sources. Our study demonstrated the ability of stacked multivariate LSTM models to recognize true data quality issues rather than fluctuations that are caused by different reasons, including seasonal changes and outbreaks.

Keywords: Electronic Health Records, Health Data Quality, LSTM models

1. Introduction

Information from electronic health records (EHR) has become a crucial part of clinical research and healthcare quality improvement [1, 2]. EHR data is widely used by health professionals and researchers to discover knowledge, develop hypotheses, predict outcomes and assess risks [3, 4].

Ideally, the content of a database should be complete and accurate, including all health-related observations about each individual. In reality, however, healthcare databases contain inaccurate and missing values, stemming from different systematic and nonsystematic errors [5]. Such problems are compounded in large, heterogenous databases because of the changes over time and across different source sites in coding standards (e.g., the ICD-9CM to ICD-10CM transition, annual updates to terminologies, and the semantic drift of meanings of terms). EHR systems (e.g., Veteran Affairs and Department of Defense are transitioning to the Cerner EHR), health care organizations (e.g., hospital mergers and acquisitions) and practices (e.g., increase and decrease in opioid prescription and opioid use disorder diagnosis) are also in flux. Implementing an automated system that can identify data quality problems in the data can assist researchers and health providers take better advantage of any EHR database and improve research and clinical practice.

Several research studies have examined the quality of EHR databases. Weiskopf and Weng’s 2013 systemic review paper, for instance, identified five dimensions of data quality (completeness, correctness, concordance, plausibility, and currency) [6]. They also identified data quality assessment methods including comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Some tools have been developed to perform periodic checking or continuous monitoring of data quality [7–10]. They are however not widely used. For example, even the simple frequency-based monitoring is not ubiquitously implemented in EHR repositories and when implemented, not usually for every individual variable. In a research paper, DeShazo and Hoffman [5] compared demographic variables, diagnosis groups, and procedures in a large EHR database (Cerner Health Facts®) and in the HCUP Nationwide Inpatient Sample. They found that both databases had a statistically similar pattern for different data items. The researchers, however, did not suggest a method to automatically scrutinize the differences. Koszalinski et al.[11] investigated the problem of missing values in HealthFacts and commented that the number of missing values fluctuated in different years, but they did not suggest any idea how to differentiate between missing values that occur due to lack of information, such as not reporting items that are difficult to obtain (e.g., self-reports of race), and systematic errors that could be recognized and prevented. A recent paper by Wang et al described a rule-based data quality assessment system for EHR.[12] However, there is lacks standard approaches or novel methods in data quality mentoring in the management of clinical or clinical research data.[13–15]

1.1. The study objective

At least two approaches can be considered for investigation to ascertain if the number of observations/findings at a population level arise from a systematic error. The first approach, which is commonly used, is to define a normal range for the data frequency based on the historical frequency and identify large deviations. Simple monitoring of data frequency, however, is not ideal for the detection of quality issues in large and heterogenous databases. Other factors including seasonal changes, outbreaks, and changes in data sources often make the frequency of specific observations fluctuate.

A second approach is to define a normal range for data frequency based on a large number of co-occurring clinical observations. Such multivariate analysis will avoid mistaking seasonal changes and outbreaks as data quality issues. For instance, if we observed a sudden drop in the diagnosis of diabetes while the number of HbA1c test and insulin prescriptions remained unchanged, it would be considered a potential data quality issue.

Our goal in this study was to learn if it were possible to detect data quality issues through modeling the temporal pattern of frequencies of individual clinical observations in the context of other clinical observations. To achieve this goal, we developed prediction models as well as Prediction Intervals (PI). The aberrant signals are defined as observed values outside of the PI.

2. Methods

In this study, we used the Cerner HealthFacts® database to test the two approaches. HealthFacts is a relational database made up of EHR data (from Cerner Corp.), the version of July 2016 contains de-identified data on 414,435,400 patient encounters. In Health Facts, the average number of encounters per patient identifiers is 2.6. Because each patient in this database may have more than one identifier due to its de-identification process, we did not report the number of patients. Data in HealthFacts is extracted directly from the EMRs of hospitals in which Cerner has a data use agreement. Encounters include pharmacy data, clinical and microbiology laboratory data, admission records, and billing information from affiliated patient care locations. All admissions, medication orders, dispensing, laboratory orders, and specimen data are time-stamped, providing a temporal relationship between treatment patterns and clinical information. The Cerner Corporation has established Health Insurance Portability and Accountability Act compliance operating policies to establish de-identification for HealthFacts.

From a list of 50 most used diagnosis ICD-9 codes [12], we selected six diagnosis codes at random and calculated the proportion of encounters with each of these ICD codes in different weeks of three consecutive years to all encounters in the same week. The total number encounters with these diagnostic codes exceeded 13 million over a 3-year period. In order to evaluate the performance of the two approaches, a physician expert on the team (SJN) created a list of lab tests, symptoms and medications commonly associated with the diagnoses, used as a correlating variable for this study. For routine exams we simply used the seasonal pattern as the correlating variable. A comparison of the temporal pattern of those variables along with the temporal patterns of the diagnoses (Table 1) could then be performed.

Table 1:

Descriptions of the six use cases: ICD codes, average number of encounters per week, time span, and key variable associated with diagnosis

Diagnosis	ICD 9	Avg. encounters in different weeks	Years for calculating the frequency	Weeks for prediction and evaluation	Key variable associated with diagnosis

Diabetes	250.x	42,949	2013–2015	The last week of Oct. 2015 and the first four weeks of Nov. 2015	HbA1c test
Routine medical exam	V70.0	2,871	2009–2011	The last five weeks of 2011	Seasonal pattern
Hyperlipidemia	272.4	23,612	2013–2015	The last three weeks of Dec. 2013 and the first two weeks of Jan. 2014	Statins
Hypercholesterolemia	272.0	5,948	2013–2015	The last three weeks of Dec. 2013 and the first two weeks of Jan. 2014	Statins
Anxiety disorder	300.02	998	2013–2015	The last three weeks of Dec. 2013 and the first two weeks of Jan. 2014	Lorazepam
Urinary tract infection	599.0	7,133	2013–2015	The last three weeks of Dec. 2013 and the first two weeks of Jan. 2014	Dysuria

Open in a new tab

2.1. Frequency monitoring

For implementing the frequency monitoring approach, we created a normal range for the frequency of a diagnosis in a week based on the frequency of the same diagnosis in the past 47 weeks as follows:

N I_{i} = [(\frac{1}{47} \sum_{ω = - 47}^{- 1} f_{ω}) - 2 \times σ, (\frac{1}{47} \sum_{ω = - 47}^{- 1} f_{ω}) + 2 \times σ]

Where f(.) is the value of the frequency of the diagnosis in week w and σ is the standard deviation of the values in the 47 weeks. We selected 47 weeks in order to use the 47 weeks of data to predict the next 5 weeks, roughly corresponding to a 90% and 10% split of the data. Such a split is not uncommon in machine learning.

If the actual frequency in a week was out of the normal range, the frequency for that week was considered abnormal. For each of the six diagnoses, we repeated the same procedure five times to investigate the normality of frequencies in five consecutive weeks.

2.2. LSTM modeling

In order to implement the second approach, different models, including Simple Exponential Smoothing (SES) and Autoregressive Integrated Moving Average (ARIMA), could be used [16–18]. Recent advances in computing power have enabled researchers to use more sophisticated machine learning algorithms for time series analysis. Most notably, Long Short-Term Memory (LSTM) has a special structure, including layers and gates, that enables learning using a non-linear function and long-term dependencies that prevents a vanishing gradient [19]. It has been shown that LSTM performance in predicting time-series data is superior to that of Support Vector Machine (SVM), ARIMA, and Random Forest [20, 21]. We chose to implement a multivariate LSTM neural network-based approach to predict observation frequencies and normal intervals.

To enable the model to understand the clinical context in different weeks and assess the normality of the frequency of diagnosis in a week, we took into account the 250 most prevalent procedures, the 250 most prevalent medications (by generic drug name), and the 250 most prevalent lab procedures (grouped by analyte), calculating the frequency of their occurrences in different weeks. The selection of top 250 procedures, medications, and procedures allowed us to account for over 95% if the procedures, medications, and procedures. For each diagnosis, using a sliding window, we created 105 samples of data in which 47 weeks’ data were used for training a stacked LSTM model to predict a normal range for the frequency of a diagnosis in the following 5 weeks. The aim was to create a PI (predicted interval) rather than a point estimation. Therefore, a multi-step multivariate time series model was used.

In designing a five step multi-step prediction, there are two options. The first is that of a recursive algorithm, in which the multiple values are predicted one-by-one, and the other is a direct algorithm, in which the multiple values are predicted all together. The following section explains each of these two approaches in detail [22].

2.2.1. Recursive multi-step forecasting

This method uses the prediction of X_t as the input for prediction of X_t+1. All of the 751 features, including the feature of interest (i.e. proportion of encounters with one of the six diagnoses) as well as the 750 other clinical features, are predicted in each step. The complete algorithm is as follows:

Algorithm R

\begin{array}{l} S t e p 1 : X_{t} = L S T M_{i} (X_{t - 1}, X_{t - 2}, X_{t - 3}, \dots, X_{t - 47}) where in our case X_{t - i, 47 \leq i \leq 1}, is a matrix of \\ S a m p l e (t r a i n i n g) \times 751. \\ S t e p 2 : X_{t + 1} = L S T M_{i} (X_{t}, X_{t - 1}, X_{t - 2}, \dots, X_{t - 46}) \\ S t e p 3 : X_{t + 2} = L S T M_{i} (X_{t + 1}, X_{t}, X_{t - 1}, \dots, X_{t - 45}) \\ S t e p 4 : X_{t + 3} = L S T M_{i} (X_{t + 2}, X_{t + 1}, X_{t}, \dots, X_{t - 44}) \\ S t e p 5 : X_{t + 4} = L S T M_{i} (X_{t + 3}, X_{t + 2}, X_{t + 1}, \dots, X_{t - 43}) \end{array}

A concern with this algorithm is that prediction errors can be accumulated. In order to overcome this issue, a new version of the recursive algorithm has been proposed: one that fits a new LSTM model in each step to reduce the accumulated errors [23]. Although the adjusted recursive approach can lower the error rate somewhat, it requires a new model to be fitted in each step. Such an approach is computationally extensive and was not used in our experiments.

2.2.2. Direct multi-step forecasting

In this algorithm, all steps of the feature of interest are predicted at once through a vector of size n, where n is the step size. In this approach, there is no need to predict the other 750 clinical features. The algorithm can be represented as below:

Algorithm D

[x_{t}, x_{t + 1}, x_{t + 2}, x_{t + 3}, x_{t + 4}] = L S T M_{i} (X_{t - 1}, X_{t - 2}, X_{t - 3}, \dots, X_{t - 47})

2.2.3. Execution of the approach

In order to compare the performances of direct and recursive algorithms and find the better approach for a multi-step prediction and tune the hyperparameters, 80% of the sample was selected for training and 10% for the validation. Each algorithm was run with different numbers of LSTM layers, varying from 1 to 7 layers, to elucidate how performance of the two algorithms changed with the number of LSTM layers and find the best configuration for the ultimate analysis. A dropout method [24], omitting 30% of the nodes, was used to prevent overfitting the models.

We then repeated that procedure three times to calculate the median absolute error (MSE) of each model. The average of the three MSEs of each model trained using the two different algorithms was used to tune the hyperparameter.

In the next step, for each of the six cases, the best stacked LSTM model with validation data, i.e. the model with the lowest MSE, was used to generate a prediction interval with the remaining 10% test data for the proportions of encounters with a diagnosis of interest to all encounters in the following five weeks.

In order to create PIs, the test dataset was bootstrapped with replacement ten times (a parameter chosen empirically), taking the mean of the predicted values for each of five steps μ_{ij, 1≤i≤5,1≤j≤10}, calculating the standard deviation for each step σ_i, and then generating the prediction interval for each step (week) as follows:

P I_{i} = [\frac{1}{10} \sum_{j = 1}^{10} μ_{i j} - 2 \times σ_{i}, \frac{1}{10} \sum_{j = 1}^{10} μ_{i j} + 2 \times σ_{i}]

If the actual value of the frequency of that diagnosis fell out of the interval, it was considered as an aberrant signal. Using six diagnosis codes and investigating the normality of frequency of each diagnosis in five weeks resulted in a total of 30 observations to scrutinize. Sometimes researchers use a “three-sigma rule”, which identifies the data lying outside of the three standard deviations of the mean as outliers. In our case, this would increase the sensitivity at the cost of decreasing the specificity. Our goal however was to keep both metrics at the highest level possible.

Python 3, including the Keras and Scikit-Learn packages, was used to implement the LSTM. We used “MSE” as the loss function and “Adam” as the optimizer. The computations were conducted using MacBook (Intel Core i7, a 2.8-GHz CPU, a 16 G RAM, 2G Intel HD Graphic).

2.2.4. More explanation on the approach through two use cases

Among the six diagnoses, two use cases, diabetes and routine exam encounters, were chosen to provide further clarification on our proposed approach. For the first use case, aiming to predict the acceptable range for the frequency of encounters with diabetes as the diagnosis, all encounters with a diabetes diagnosis entered were identified, based on ICD-9 codes from January 2013 to December 2015, with an addition 750 features.

For the second use case, predicting the acceptable range for the frequency of encounters with routine exam as the diagnosis, encounters with routine exam were calculated in proportion to the as a diagnosis in all encounters, aggregated by week, from January 2011 to December 2013. The frequency of those 750 features mentioned earlier in the section 2.2 were also calculated in each of 156 weeks, for a total of 105 samples. As previously mentioned, 47 weeks of data were used for training to predict the frequency of encounters with routine exams in the following 5 weeks. The temporal patterns of diabetes and HbA1C test were compared, and the seasonal pattern of routine exam encounters was examined.

3. Results and Discussion

3.1. Sensitivity and Specificity of the frequency and LSTM approaches

For each of the six diagnosis codes, the best stacked LSTM model was used to generate a prediction interval and identify aberrant signals. Noting that the aberrant signals were labeled on a weekly basis. Totally, out of 30 weeks, there were 22 normal weeks and 8 abnormal weeks.

The performance of the proposed LSTM approach is shown in Table 2 along with the frequency approach as a control. Table 3 shows the sensitivity and specificity of each approach.

Table 2:

The contingency tables for both approaches

	True Normal	True Abnormal
Predicted Normal	15 (frequency) 20 (LSTM)	2 (frequency) 1 (LSTM)
Predicted Abnormal	7 (frequency) 2 (LSTM)	6 (frequency) 7 (LSTM)

Open in a new tab

Table 3:

The sensitivity and the specificity of the two approaches

Approach	Sensitivity	Specificity
The frequency monitoring (Approach 1)	75%	68.18%
The LSTM method (Approach 2)	87.50%	90.91%

Open in a new tab

Compared to the frequency-based approach, the LSTM approach increased the sensitivity from 75.00% to 87.50%, and the specificity from 68.18% to 90.91%.

3.2. Results of the two use cases

Among the six diagnoses, two cases, diabetes and routine exam encounters, were chosen for in-depth explanations. The performance of the two LSTM algorithms was compared for the data on diabetes diagnoses and for the data on routine medical exam diagnoses. Figure 1 shows the median absolute error for each of two algorithms D and R in LSTM models with different number of layers. The results are shown for diabetes diagnosis in Figure 1.a and the frequency of encounters with routine exam as diagnosis in Figure 1.b.

As it can be seen in Figure 1, the algorithm R (recursive multi-step prediction) outperforms algorithm D (direct multi-step prediction) in both cases, as the median absolute error in algorithm R is lower than that of algorithm D. It appears that the effect of the number of LSTM layers in the stacked LSTM model varies with the particular diagnosis under consideration. In case 1 (diabetes, shown in Figure 1.a) as the number of LSTM layers changes, the median absolute error with the algorithm R has a small variation (49.77%). In case 2 (routine exam, shown in Figure 1.b), the same measurement varies by 600% as the number of LSTM layers changes. The median absolute error of algorithm D also varied based on the number of layers.

As described in the Methods section, the best stacked LSTM model, with lowest MSE, was used to bootstrap and build a prediction interval with test data for the proportions of encounters with diabetes diagnoses to all encounters in five weeks starting from the last week of October 2015 to last week of November 2015. The actual proportion of encounters with diabetes diagnoses fell within the prediction interval for the last week of October 2015 (week 143), showing that the actual value was not considered as a signal of a systematic error in that week. The actual values for the second to fifth predicted weeks, weeks 144–147 (the four weeks of November 2015), fell out of the prediction interval and are considered as a signal of error. We delved into the issue to see if it was a systematic error in reporting all relevant medical records in those weeks or if it was a systematic error in reporting only diabetes diagnoses. If it was the former, the frequency of HbA1c tests should have dropped. Our findings, however, did not find any significant change in the frequencies of those medications and tests in the weeks of November 2015. On reflection it was noted that the second week from our prediction coincided with the ICD-10 implementation date [25]. When the model was run to investigate if the frequency of diabetes were normal in three periods of weeks 14–18, weeks 50–54, and weeks 101–105, all of observations fell inside the normal range predicted by our model.

Figure 2 depicts the temporal view of both frequency of diabetes diagnoses and frequency of HbA1c tests. Considering only frequency of diabetes in each week rather than using a multivariate model in which the other clinical observations in the week are also taken into account, would suggest the drop of the frequency of diabetes diagnoses in weeks 18, 54, 105 as aberrant signals (see Figure 2). In those three weeks, however, the frequency of HbA1c tests, which should have a strong correlation with the frequency of diabetes diagnosis, drops as well. The proposed model did not consider the drops in these three weeks as aberrant signals. It suggested that the drop of the frequency of diabetes diagnoses was not a systematic error in reporting the diabetes diagnosis. In weeks 144 through 147, however, the frequency of diabetes diagnoses dropped but the frequency of HbA1c tests experienced only a modest variation. It suggests that there was a systematic error with regard to the frequency of diabetes diagnoses in weeks 144 through 147.

In the case of the routine exam diagnoses, the stacked LSTM model was run three times, each time creating the PI for 5 weeks. The actual observed frequencies were all within the PI, and thus the model did not recognize the actual frequency of routine exams as aberrant signals.

Figure 3 depicts the frequency of encounters with routine exam as the diagnosis in 156 weeks, starting from January 2009. The frequency in all fifteen weeks, from week 142 through week 156, for which a PI was created, are in the green region, indicating that they are considered normal. Although it seems that the frequency had a sharp drop in week 156, the model did not consider it as an aberrant signal related to data quality issues. We observed that the frequency of encounters with a routine exam diagnosis was at a local minimum in week 1, week 104, and week 156. Given the fact that week 1 corresponds to the month of January of the first year, week 104 (52+52) to the month of December of the second year, and week 156 (52+52+52) to December of the third year, this suggests that the frequency of encounters with routine exam as the diagnosis decreases around the New Year. The model was able to learn this seasonal pattern and did not consider the sharp drop as an aberrant signal. This finding bolsters the notion that a sophisticated model is needed to distinguish the fluctuation of a clinical observation due to a quality issue from the fluctuation of a clinical observation due to seasonal changes, outbreaks, and changes in data sources. The multivariate LSTM enabled investigation of the fluctuation of clinical observations in the context of a larger number of other clinical observations including medications, comorbid conditions, and lab tests.

4. Discussion

Aberrant temporal signals are common in EHR databases for a variety reasons including data quality issues, seasonal changes, outbreaks, and changes in clinical practice. Implementing a method that can automatically recognize aberrant signals caused by data quality problems can assist researchers and health providers utilizing EHR data. In this paper, a method to analyze time series datasets and predict a normal interval for observation frequencies for the next n weeks is presented. If the actual values fall within the normal interval, the data can be considered outside of the interval, it is considered aberrant. As such, this study is very different from the prior and ongoing research efforts to identify and impute missing values.

Our results demonstrated the ability of stacked multivariate LSTM models ability to recognize true data quality issues rather than fluctuations caused by other factors. As the data showed, there is a fair amount of fluctuation in the Health Facts data. Some of the fluctuations may be caused by seasonal disease changes (e.g. flu season), seasonal patient visit patterns (e.g. less routine physical scheduled between Christmas and New Year), alterations in practice patterns (e.g. opioid use for pain management), or a change in data sources (e.g. hospital switching EHR systems). Simply monitoring the frequency trend, as is the common practice in systems that do monitor the data quality (some databases do monitor the trend), would not be able to distinguish these causes from aberrant data associated with data quality issues.

The method described here can be applied to a wide range of healthcare data. It can help meet the need for accurate data in observational research. For example, an ongoing study of the role of magnesium in diabetes prevention requires the use of ICD codes along with HbA1c tests to identify diabetic patients. As such, the quality of the ICD coding needs review. A large discordance between the frequencies of ICD-coded diabetes and HbA1c tests appears to be a sign of a data quality problem. The ability to automatically discover such problems will help avoid using the erroneous data in subsequent analyses.

Stacked LSTM models were used to create a weekly prediction interval for each of six instances for the following five weeks. Two LSTM algorithms, direct multi-step forecasting and recursive multi-step forecasting were evaluated. Each of these two algorithms has a potential disadvantage. Since the direct multi-step forecasting algorithm predicts all steps simultaneously, it is more complex, resulting in a greater chance of overfitting [19]. The recursive multi-step prediction algorithm uses the prediction of an observation as an input for the prediction of the next-step observation. It can accumulate errors and would be expected to have a lower prediction accuracy over a longer time horizon. This study found that regardless of the number of LSTM layers, the recursive algorithm outperformed the direct algorithm. The optimal number of LSTM layers in the stacked LSTM model varied with the diagnosis under study.

In this study, “hybrid” recursive multi-step forecasting approach, in which a model is built for prediction of each step and suppresses the errors, was not attempted. This approach seems likely to outperform an ordinary recursive model as it suppresses errors at each step and prevents error accumulation. Although the hybrid algorithms are expected to outperform the ordinary algorithms, they are time-consuming to train. In a future study, performance of the hybrid algorithms should be compared in a similar setting.

4.1. Limitations and the future studies

There are several limitations with this research work that can be considered in a future study. The first one is that only EHR data from HealthFacts database was used. Future studies should include other EHR databases, e.g., the VA Central Data Warehouse, to test the functionality of the LSTM methods. A second limitation is that the examination of aberrant signals was limited in this study to diagnoses. This method could be applied to create prediction intervals for different medications, lab tests, and medical procedures as well. The third limitation is that we only used one multivariate time series model, LSTM, to validate our proposed approach. In a future study, performance of different multivariate time series, including Transformer, should be compared.

Finally, the specific causes of aberrant signals in this study even when there is confidence that there are data quality issues could not be identified. Working with a fully de-identified database as end users limits the ability to trace the issues back to teir original data sources. Nevertheless, it is important for clinical researchers to be aware of the quality issues so that they can design their studies accordingly.

Summary table

What was already know on the topic	What this study added to our knowledge
• Although Electronic Health Records (EHR) are a promising source of clinical information, they contain inaccurate and missing values, stemming from different systematic and nonsystematic errors.	• Stacked multivariate LSTM models are able to distinguish between true data quality issues and fluctuations caused by other factors, such as seasonal changes, outbreaks, and changes in clinical practice.
• Fluctuations in frequency of a clinical incident in an EHR-based database can cause by seasonal disease changes, seasonal patient visit patterns, or a quality issue.	• Our method analyzes time series datasets and predict a normal interval for observation frequencies for the next n weeks to recognize aberrant signals.

Open in a new tab

Highlights.

What was already know on the topic

Although Electronic Health Records (EHR) are a promising source of clinical information, they contain inaccurate and missing values, stemming from different systematic and nonsystematic errors.
Fluctuations in frequency of a clinical incident in an EHR-based database can cause by seasonal disease changes, seasonal patient visit patterns, or a quality issue.

What this study added to our knowledge

Stacked multivariate LSTM models are able to distinguish between true data quality issues and fluctuations caused by other factors, such as seasonal changes, outbreaks, and changes in clinical practice.
Our method analyzes time series datasets and predict a normal interval for observation frequencies for the next n weeks to recognize aberrant signals.

Acknowledgements

The authors are partly funded by the following grants: NIH 1UL1TR001876; VA 1I01 HX002422; VA 1I01HX002679; NIH R21LM012929.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Steinbrook R, Health care and the American recovery and reinvestment act. New England Journal of Medicine, 2009. 360(11): p. 1057–1060. [DOI] [PubMed] [Google Scholar]
[2].Hillestad R, et al. , Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. Health affairs, 2005. 24(5): p. 1103–1117. [DOI] [PubMed] [Google Scholar]
[3].Kohane IS, Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 2011. 12(6): p. 417. [DOI] [PubMed] [Google Scholar]
[4].Klompas M, et al. , Automated surveillance and public health reporting for gestational diabetes incidence and care using electronic health record data. Using administrative databases to identify cases of chronic kidney disease: a systematic review, 2011: p. 23.
[5].DeShazo JP and Hoffman MA, A comparison of a multistate inpatient EHR database to the HCUP Nationwide Inpatient Sample. BMC health services research, 2015. 15(1): p. 384. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Weiskopf NG and Weng C, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc, 2013. 20(1): p. 144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Kahn MG, et al. , A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC), 2016. 4(1): p. 1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Simon KC, et al. , Building of EMR Tools to Support Quality and Research in a Memory Disorders Clinic. Front Neurol, 2019. 10: p. 161. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Khare R, et al. , Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network. EGEMS (Wash DC), 2019. 7(1): p. 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Observational Health Data Sciences and Informatics Software. [cited 2020 January 27]; Available from: https://www.ohdsi.org/analytic-tools/.
[11].Koszalinski R, et al. , Missing Data, Data Cleansing, and Treatment From a Primary Study: Implications for Predictive Models. CIN: Computers, Informatics, Nursing, 2018. 36(8): p. 367–371. [DOI] [PubMed] [Google Scholar]
[12].Wang Z, et al. , A Rule-Based Data Quality Assessment System for Electronic Health Record Data. Appl Clin Inform, 2020. 11(4): p. 622–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Dixon BE, et al. , A vision for the systematic monitoring and improvement of the quality of electronic health data. Stud Health Technol Inform, 2013. 192: p. 884–8. [PubMed] [Google Scholar]
[14].Chen H, et al. , A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health, 2014. 11(5): p. 5170–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Brown PJ, Harwood J, and Brantigan P, Data quality probes--a synergistic method for quality monitoring of electronic medical record data accuracy and healthcare provision. Stud Health Technol Inform, 2001. 84(Pt 2): p. 1116–9. [PubMed] [Google Scholar]
[16].McElroy T and Wildi M, Multi-step-ahead estimation of time series models. International Journal of Forecasting, 2013. 29(3): p. 378–394. [Google Scholar]
[17].Earnest A, et al. , Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore. BMC Health Services Research, 2005. 5(1): p. 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Mohamed B and Mohamad M, USING AUTOREGRESSIVE INTEGRATED MOVING AVERAGE (ARIMA) MODELS TO PREDICT THE NUMBER OF PATIENT ADMISSION IN PAEDIATRIC CLINIC AT A PRIVATE HOSPITAL IN KUANTAN. Journal of Sciences and Management Research. 2600: p. 738X. [Google Scholar]
[19].Hochreiter S and Schmidhuber J, Long short-term memory. Neural computation, 1997. 9(8): p. 1735–1780. [DOI] [PubMed] [Google Scholar]
[20].Siami-Namini S, Tavakoli N, and Namin AS A comparison of ARIMA and LSTM in forecasting time series. in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018. IEEE.
[21].Zhang J and Nawata K, A comparative study on predicting influenza outbreaks. Bioscience trends, 2017. [DOI] [PubMed]
[22].Brownlee J, Long Short-term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning. 2017: Machine Learning Mastery. [Google Scholar]
[23].Akhlaghi S, Zhou N, and Huang Z, A multi-step adaptive interpolation approach to mitigating the impact of nonlinearity on dynamic state estimation. IEEE Transactions on Smart Grid, 2016. 9(4): p. 3102–3111. [Google Scholar]
[24].Srivastava N, et al. , Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014. 15(1): p. 1929–1958. [Google Scholar]
[25].Centers for Medicare and Medicaid Services. 2020. [cited 2019 Dec 4]; Available from: https://www.cms.gov/ICD10/.

[R1] [1].Steinbrook R, Health care and the American recovery and reinvestment act. New England Journal of Medicine, 2009. 360(11): p. 1057–1060. [DOI] [PubMed] [Google Scholar]

[R2] [2].Hillestad R, et al. , Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. Health affairs, 2005. 24(5): p. 1103–1117. [DOI] [PubMed] [Google Scholar]

[R3] [3].Kohane IS, Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 2011. 12(6): p. 417. [DOI] [PubMed] [Google Scholar]

[R4] [4].Klompas M, et al. , Automated surveillance and public health reporting for gestational diabetes incidence and care using electronic health record data. Using administrative databases to identify cases of chronic kidney disease: a systematic review, 2011: p. 23.

[R5] [5].DeShazo JP and Hoffman MA, A comparison of a multistate inpatient EHR database to the HCUP Nationwide Inpatient Sample. BMC health services research, 2015. 15(1): p. 384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Weiskopf NG and Weng C, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc, 2013. 20(1): p. 144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Kahn MG, et al. , A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC), 2016. 4(1): p. 1244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Simon KC, et al. , Building of EMR Tools to Support Quality and Research in a Memory Disorders Clinic. Front Neurol, 2019. 10: p. 161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Khare R, et al. , Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network. EGEMS (Wash DC), 2019. 7(1): p. 36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Observational Health Data Sciences and Informatics Software. [cited 2020 January 27]; Available from: https://www.ohdsi.org/analytic-tools/.

[R11] [11].Koszalinski R, et al. , Missing Data, Data Cleansing, and Treatment From a Primary Study: Implications for Predictive Models. CIN: Computers, Informatics, Nursing, 2018. 36(8): p. 367–371. [DOI] [PubMed] [Google Scholar]

[R12] [12].Wang Z, et al. , A Rule-Based Data Quality Assessment System for Electronic Health Record Data. Appl Clin Inform, 2020. 11(4): p. 622–634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Dixon BE, et al. , A vision for the systematic monitoring and improvement of the quality of electronic health data. Stud Health Technol Inform, 2013. 192: p. 884–8. [PubMed] [Google Scholar]

[R14] [14].Chen H, et al. , A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health, 2014. 11(5): p. 5170–207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Brown PJ, Harwood J, and Brantigan P, Data quality probes--a synergistic method for quality monitoring of electronic medical record data accuracy and healthcare provision. Stud Health Technol Inform, 2001. 84(Pt 2): p. 1116–9. [PubMed] [Google Scholar]

[R16] [16].McElroy T and Wildi M, Multi-step-ahead estimation of time series models. International Journal of Forecasting, 2013. 29(3): p. 378–394. [Google Scholar]

[R17] [17].Earnest A, et al. , Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore. BMC Health Services Research, 2005. 5(1): p. 36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Mohamed B and Mohamad M, USING AUTOREGRESSIVE INTEGRATED MOVING AVERAGE (ARIMA) MODELS TO PREDICT THE NUMBER OF PATIENT ADMISSION IN PAEDIATRIC CLINIC AT A PRIVATE HOSPITAL IN KUANTAN. Journal of Sciences and Management Research. 2600: p. 738X. [Google Scholar]

[R19] [19].Hochreiter S and Schmidhuber J, Long short-term memory. Neural computation, 1997. 9(8): p. 1735–1780. [DOI] [PubMed] [Google Scholar]

[R20] [20].Siami-Namini S, Tavakoli N, and Namin AS A comparison of ARIMA and LSTM in forecasting time series. in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018. IEEE.

[R21] [21].Zhang J and Nawata K, A comparative study on predicting influenza outbreaks. Bioscience trends, 2017. [DOI] [PubMed]

[R22] [22].Brownlee J, Long Short-term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning. 2017: Machine Learning Mastery. [Google Scholar]

[R23] [23].Akhlaghi S, Zhou N, and Huang Z, A multi-step adaptive interpolation approach to mitigating the impact of nonlinearity on dynamic state estimation. IEEE Transactions on Smart Grid, 2016. 9(4): p. 3102–3111. [Google Scholar]

[R24] [24].Srivastava N, et al. , Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 2014. 15(1): p. 1929–1958. [Google Scholar]

[R25] [25].Centers for Medicare and Medicaid Services. 2020. [cited 2019 Dec 4]; Available from: https://www.cms.gov/ICD10/.

PERMALINK

Using Multi-Step, Multivariate Long Short-Term Memory Neural Network to Detect Aberrant Signals in Health Data for Quality Assurance

Seyed M Miran, PhD

Stuart J Nelson, MD, FACP, FACMI

Doug Redd, PhD

Qing Zeng-Treitler, PhD

Abstract

Background:

Methods:

Results:

Conclusions:

1. Introduction

1.1. The study objective

2. Methods

Table 1:

2.1. Frequency monitoring

2.2. LSTM modeling

2.2.1. Recursive multi-step forecasting

2.2.2. Direct multi-step forecasting

2.2.3. Execution of the approach

2.2.4. More explanation on the approach through two use cases

3. Results and Discussion

3.1. Sensitivity and Specificity of the frequency and LSTM approaches

Table 2:

Table 3:

3.2. Results of the two use cases

Figure 1.

Figure 2.

Figure 3.

4. Discussion

4.1. Limitations and the future studies

Highlights.

What was already know on the topic

What this study added to our knowledge

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases