Abstract
Inaccurate body weight measures can cause critical safety events in clinical settings as well as hindering utilization of clinical data for retrospective research. This study focused on developing a machine learning-based automated weight abnormality detector (AWAD) to analyze growth dynamics in pediatric weight charts and detect abnormal weight values. In two reference-standard based evaluation of real-world clinical data, the machine learning models showed good capacity for detecting weight abnormalities and they significantly outperformed the methods proposed in literature (p-value<0.05). A deep learning model with bi-directional long short-term memory networks achieved the best predictive performance, with AUCs ≥0.989 across the two datasets. The positive predictive value and sensitivity achieved by the system suggested more than 98% screening effort reduction potential in weight abnormality detection. Consequently, we hypothesize that the AWAD, when fully deployed, holds great potential to facilitate clinical research and healthcare delivery that rely on accurate and reliable weight measures.
Introduction
Body weight, as an important parameter in pediatrics, is widely utilized to track growth and development of children and to calculate drug doses in clinical settings1, 2. It is also an important variable for secondary data analysis in clinical research such as computerized phenotyping3. In current practice, body weight is usually measured and entered into the electronic health records (EHRs) by clinical staff manually, and can be recorded incorrectly in various ways, including improper operation of weight scale, typing errors, conversion errors (e.g., weight scale in pounds, and documentation in kilograms), and weight estimations instead of actual measurements4. Inaccurate body weight data have a high risk of causing patient harm; literature studies have reported that 18-22% of medication errors are resulted from “improper dose/quantity” in pediatrics, which is significantly higher than that in adult settings2, 5. The erroneous data also hinder the full utilization of EHRs for research purpose due to error propagation in downstream analysis.
Manual chart review is currently standard practice in correcting weight abnormalities, which is labor-intensive and impractical to perform in busy clinical settings. In particular, capturing weight abnormalities with high accuracy can be difficult, especially for neonates, and growing children who have acute or chronic medical conditions. There is a critical need to develop an accurate and cost-effective approach for detecting abnormalities from pediatric weight charts to prevent weight-based dosing/medication errors. The approach can also be applied to cleaning abnormal weight values from EHRs to provide high-quality data for subsequent research.
By using recent informatics technologies, several approaches have been proposed to identify abnormal weights from patient charts6. One established method is developed by the Centers for Disease Control and Prevention (CDC), which standardizes weight points with z-score normalization and identifies outliers based on their standard deviations from the mean7. Another computerized approach is proposed by Children’s Hospital of Philadelphia, which compares standard deviation of a weight point against a weighted moving average in the chart to identify abnormal values8. In our earlier study, we also develop a regression approach to model weight trend in a chart and determine if a weight point is an outlier9. However, all the methods are rule-based, where empirical thresholds are developed based on subjects’ age and gender to identify abnormal values. Even though they are easy to implement, the approaches have low detection capacity, particularly for complex weight charts.
Machine learning is a field of artificial intelligence that utilizes computerized algorithms to learn the relation between, and make prediction on, sets of data. The technologies have been widely used on a variety of clinical decision support tasks, including patient clinical status detection, workflow optimization, and computerized phenotype discovery10-12. More recently deep learning, a branch of machine learning focusing on developing neural network-based algorithms, has gained increasing popularity in clinical informatics and has been applied to analyze sequence data such as clinical narratives13-15. Nevertheless, no studies have used machine learning for analyzing patient weight charts.
Our research is specifically directed at developing an accurate and scalable informatics-based solution, an automated weight abnormality detector (AWAD), to identify errors in pediatric weight charts. In our earlier studies, we developed a visual annotation tool to enable large-scale annotation of weight abnormalities6, 16. To take the next step, this study focused on developing an automated approach to analyze patient charts and identify abnormal weight values. We hypothesized that with using state-of-the-art machine learning and deep learning technologies, the AWAD could detect weight abnormalities for individual charts with high sensitivity and specificity. The study is the first, known to us, to investigate detection of weight abnormalities in a large-scale via machine learning technologies.
Materials and methods
Figure 1 diagrams the overall processes of the study. We first collected and selected weight charts from the institutional EHRs (process 1 in Figure 1). Annotation was then performed to identify abnormal weight values for individual charts (process 2). Features were extracted from each weight chart to capture weight characteristics and growth dynamics (process 3), which were then fed into machine learningand deep learning-based algorithms to detect abnormal weight values (process 4). Finally, a separate annotated weight set was applied to assess the generalizability of the developed algorithms (process 5).
Figure 1.
The overall processes of the study.
Data collection and weight chart selection
A total of 4.3 million weight points was collected from the EHRs for all 347,056 patients visiting Cincinnati Children’s Hospital Medical Center (CCHMC) between 2010 and 2018. Use of the de-identified dataset was approved by the University of Cincinnati Intuitional Review Board (study ID: 2017-2075). Following our earlier study6, we excluded three types of weight points from analysis: 1) weight points documented in the first 24 months (1,137,410 points from 31,876 patients) because newborns could have more complex weight changing patterns, 2) weight points recorded after 240 months (17,211 points from 7,932 patients), and 3) all weight points from patients with less than four weight measures (213,388 points from 87,013 patients) to ensure sufficient information in each weight chart. After data exclusion, a secondary polynomial regression model was built to identify charts with potentially abnormal weight values based on five parameters including 1) maximum absolute residuals, 2) median absolute residuals, 3) root mean square errors, 4) maximum ratio of absolute residuals to fitted values, and 5) mean ratio of absolute residuals to fitted values. The regression model identified 107,336 candidate charts with at least one weight point outside its 99% confidence intervals. A set of 15,000 charts (denoted by MAIN dataset) were randomly sampled from the 107,336 dataset for weight abnormality annotation.
Weight abnormality annotation
We recruited domain experts with appropriate medical training and knowledge to review the selected weight charts and identify abnormal weight values. The annotation guideline suggested 1) weight points should be evaluated retrospectively based on all available points in a chart, and 2) weight points should be labeled as abnormal based on their clinical importance and risk for leading to patient harm. By using the guideline and a visual annotation tool developed in-house6, 16, the 15,000 weight charts were annotated by 18 domain experts to create a reference-standard set of weight abnormalities. During annotation, each expert was assigned to a random sample of 2,500 patient charts such that each chart was reviewed by three annotators. An example chart with annotation is presented in Figure 2. All weight points were considered normal and defaulted to blue. Abnormal points with high clinical importance were labeled as red, while potentially abnormal values requiring second opinions were labeled as orange.
Figure 2.
An example chart with weight abnormality annotation. The weight characteristics presented in the box were available to annotators during annotation. Color codes: 1) blue represents normal weight entries, 2) red represents abnormal weight points with high clinical importance, and 3) orange represents potentially abnormal points requiring second opinions.
A scoring approach was applied to summarize the annotations, with red scoring two, orange one and blue zero. A weight point was considered abnormal if sum of scores from the three annotators were greater than two (i.e., at least one red and one orange decisions or three orange decisions). Weight points scoring two were further reconciled by our clinical champion (Dr. Spooner) to determine their abnormality. Adjacent data points with identical age and weight values were reviewed manually to avoid diluting the effects of predictors by contradictory outcome labels. The interrater agreement (IRA) was calculated using Fleiss’ Kappa17.
Weight feature extraction
For each weight point, nine variables were generated to capture weight characteristics and growth dynamics: 1) subject weight in kilograms, 2) subject age in years, 3) subject sex, 4) the Box-Cox transformation, median and generalized coefficient of variation(LMS-) based z-score according to subject sex and age18, 5) modified LMS-based z-score using the weight-for-age data from a reference population provided by CDC to identify extreme weight values7, 6) percentage of the population that was below a weight value (denoted by percentiles), 7) absolute age difference from the immediate previous weight point, 8) absolute weight difference from the immediate previous weight point, and 9) absolute z-score difference from the immediate previous weight point. The numerical variables were standardized with z-score normalization19. The categorical variable (i.e., sex) was binarized with a dummy variable to avoid linear dependencies. After feature extraction each weight point was represented by a nine-dimensional numerical vector.
Traditional machine learning classifiers
We model detection of weight abnormalities as a binary-class classification and implemented five machine learning classifiers including LR with L1 and L2 normalization20, support vector machines with linear (SVM-L), polynomial (SVM-P) and radial basis function (SVM-R) kernels21, 22, decision trees (DTs)23, random forests (RFs)24, and onelayer artificial neural networks (aNNs)25.
Deep learning models
Differed from traditional classifiers that make predictions on individual examples, the long short-term memory (LSTM) network is capable of propagating information in a data sequence to improve prediction capacity26. Literature studies have shown the effectiveness of LSTM-based models in processing sequence data such as free-text narratives and time-series signals13, 27. To achieve the best detection capacity, we also developed a LSTM-based model to capture growth trends and identify weight abnormalities. Weight feature sequence of each chart was used as model input. Zero padding was implemented to ensure that all charts had the same length of weight records. The padding values would be identified by the masking technique and ignored during model training and evaluation. A bi-directional LSTM (Bi-LSTM) was developed to aggregated information from a weight chart forwards and backwards to detect weight abnormalities28. To simulate a prospective setting where future weights are not available, we also developed a one direction LSTM model in the study. Appropriate variants such as LSTM with conditional random field models were explored but were excluded from further analysis due to lack of performance improvement.
Baseline approaches
The machine learning and deep learning models were compared with three baselines proposed in the literature7-9. The first method was developed by CDC (denoted by CDC), which computed the modified weight-for-age z-scores based on a weight chart and considered weight points outside the range of [-5,8] abnormal. The second method was a computerized approach developed by Children’s Hospital of Philadelphia (denoted by CHOP). The method compared standard deviations of weight points against a weighted moving average in the chart to identify abnormal points that had significant deviation between recorded and expected values. We re-coded the algorithm based on the supplement code from the publication8. Two error types identified by the CHOP method including ‘duplicate’ measurements on the same day and ‘carried forward’ weights within 90 days of prior measurement, were allowed and considered ‘normal’ in our study. The third method was a regression approach developed in our earlier study (denoted by REG)9, which modeled weight trend based on age, sex, previous weight values, and time from previous weight points to determine if a weight point was an outlier. It is worth noting that the CDC method identified weight abnormality based on the current weight point and its immediate previous one, while the CHOP and REG methods used all weight points in a chart.
Experimental setup
A stratified random sampling was performed based on individual subjects to split the MAIN dataset into two parts, 70% for training and 30% for evaluation. Ten-fold cross-validation was utilized to train the machine learning and deep learning models, where grid search parameterization was applied to optimize hyper-parameters including 1) cost parameters for L1- and L2-normalized LR20, SVM-L, SVM-P and SVM-R22 (screened from 10-6 to 106); 2) minimum number of observations in a node (3, 5, 10, 15 and 20) and complexity parameters (screened from 10-6 to 10-1, 0.3, 0.5 and 0.8) for DT23; 3) number of trees (screened from 26 to 211) and minimum number of observations in a node (3, 5, 10, 15, 25, 40 and 50) for RFs24; 4) optimal degree for SVM-P (2 and 3); 5) parameter γ for SVM-R (screened from 2-8 to 29); 6) number of neurons (10, 15 and 20) and activation functions (rectified linear units [ReLU] and hyperbolic tangent [tanh]) for aNNs25; 7) number of neurons (screened at 50 increments from 50 to 200) for LSTM26 and Bi-LSTM28, and 8) learning rates for aNN, LSTM, and Bi-LSTM (screened from 10-4 to 10-2, 2e-4 and 5e-4). Stratified down-sampling was integrated into cross-validation to improve class imbalance29. The machine learning and deep learning classifiers were implemented with R or Python30, 31.
Generalizability validation
To evaluate system generalizability, a separate set of 223,725 weight points were extracted for 7,190 patients in the CCHMC Discover Together Biobank. The data were collected as part of a larger study of the Electronic Medical Records and Genomics network32. The dataset was processed using the same procedure as described above. Twenty percent of the charts were randomly selected and annotated by two study team members (Lei Liu and Dr. Ni) specialized in clinical informatics (denoted by the eMERGE dataset). The best-performing model as identified in our earlier experiments was trained on the full MAIN dataset and validated on the annotated eMERGE data. The results were compared with those generated from the baseline methods for model comparison.
Evaluation measures
We adopted the area under the ROC curve (AUC) as the primary measure and reported positive predictive value (PPV), sensitivity (SEN), negative predictive value (NPV), specificity (SPEC) when SEN reached 90% (a level required for production)33-35. To identify its limitations, we also performed an error analysis for the best-performing model on the MAIN test set, where patient charts with false positive or false negative predictions were visualized and inspected manually to identify potential causes of system errors.
Results
Descriptive statistics of the datasets
The summary of the annotation outcomes on the MAIN dataset (15,000 patient charts with 260,912 weight points) is presented in Table 1. We received a total of 775,670 annotation records. One hundred and forty-six records from 142 patient charts (0.95%; categories 3-4 in Table 1) contained display error and incomplete annotation and were excluded from analysis. The IRA between experts and the scoring-based outcome was substantial36 (Fleiss’ Kappa = 0.644, p-value <0.05). During data quality inspection we further excluded 14 charts (0.09%) due to clear abnormal weight values (i.e., LMS-based z-score ≥100), resulting in a set of 257,989 weight points from 14,844 patient charts. The abnormality rate of the dataset was 0.60% (1,549 abnormal weight values in 1,412 patient charts). After stratified sampling the training set contained 181,008 weight points from 10,391 charts, and the test set had 76,981 weight points from 4,453 charts. No missing values were presented in the training or test set. In the eMERGE dataset 215,736 weight points from 6,049 patients met inclusion criteria. We randomly selected 20% of the charts for annotation, resulting in a validation set of 44,747 weight points (1,209 charts) with an abnormality rate of 0.61%.
Table 1.
Summary of the annotation outcomes on the MAIN dataset.
| ID | Description | Number of patient charts | Number of weight points |
|---|---|---|---|
| 1 | Abnormal weight points | 1,412 | 1,549 |
| 2 | Normal weight entries | 14,844 | 256,440 |
| 3 | Charts containing display error in weight points- | 3 | 40 |
| 4 | Charts containing incomplete annotations- | 139 | 2,364 |
| 5 | Charts containing clear abnormal weight values (i.e., LMS-based z-score ≥100)- | 14 | 519 |
- Charts were excluded from analysis.
Predictive performance on the MAIN dataset
Figure 3 summarizes the performance of different machine learning models and baselines, where the detailed results are presented in Table 2. All machine learning and deep learning models achieved significantly better AUCs over the baseline approaches (p-value <0.05). Among the machine learning models, Bi-LSTM achieved the best performance, with an AUC of 0.996/0.989 on the training/test data. It also achieved the best PPV and SPEC when SEN was adjusted to ≥90%. The improvements of Bi-LSTM over the other models were statistically significant at 0.05 level. RF achieved the second-best performance, followed by aNN, SVM-R, and LSTM.
Figure 3.
Model performance for detection abnormal weight values. AUC: Area under the ROC Curve.
Table 2.
Model performance for predicting abnormal weight values when SEN reached 90%.
| Ten-fold cross validation performance | Test set performance | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Classifier | PPV | SEN | NPV | SPEC | AUC | PPV | SEN | NPV | SPEC | AUC |
| LR | 1.82% | 90.32% | 99.87% | 58.87% | 0.885 | 1.90% | 90.11% | 99.91% | 71.15% | 0.909 |
| SVM-L | 1.48% | 90.50% | 99.89% | 55.15% | 0.904 | 2.23% | 90.11% | 99.92% | 75.51% | 0.924 |
| SVM-P | 2.36% | 90.50% | 99.92% | 73.56% | 0.922 | 2.35% | 90.11% | 99.92% | 76.71% | 0.918 |
| SVM-R | 2.76% | 90.50% | 99.93% | 79.38% | 0.944 | 3.14% | 90.11% | 99.93% | 82.75% | 0.942 |
| DT | 1.86% | 95.29% | 91.49% | 48.09% | 0.907 | 0.61% | 99.16% | 60.00% | 0.01% | 0.896 |
| RF | 4.46% | 90.32% | 99.93% | 86.73% | 0.961 | 4.66% | 90.11% | 99.93% | 88.56% | 0.961 |
| aNN | 4.54% | 90.50% | 99.93% | 87.21% | 0.960 | 3.95% | 90.11% | 99.93% | 86.40% | 0.957 |
| LSTM | 3.07% | 90.06% | 99.93% | 82.87% | 0.941 | 3.47% | 90.11% | 99.93% | 84.41% | 0.938 |
| Bi-LSTM | 41.82% | 90.05% | 99.94% | 99.24% | 0.996 | 35.91% | 90.11% | 99.94% | 99.00% | 0.989 |
| Baseline | PPV | SEN | NPV | SPEC | AUC | PPV | SEN | NPV | SPEC | AUC |
| CDC | 0.65% | 90.48% | 99.43% | 11.48% | 0.553 | 0.52% | 90.08% | 99.54% | 11.20% | 0.546 |
| REG | 0.66% | 90.48% | 99.46% | 12.22% | 0.580 | 0.52% | 90.08% | 99.57% | 12.07% | 0.578 |
| CHOP | 0.62% | 90.32% | 99.55% | 13.81% | 0.618 | 0.64% | 90.11% | 99.53% | 13.30% | 0.620 |
Predictive performance on the eMERGE dataset
To assess its generalizability, the Bi-LSTM model optimized on the full MAIN dataset was compared with the baseline approaches on the eMERGE dataset (Table 3). We presented results with three probability thresholds: 1) the natural threshold (0.5), 2) the threshold when Bi-LSTM reached 90% SEN on the MAIN dataset (0.026), and 3) the threshold when Bi-LSTM reached 90% SEN on the eMERGE dataset (0.0014). Again, Bi-LSTM achieved significantly better performance than the baselines (p-value <0.05). Its AUC (0.989) was close to that achieved on the MAIN dataset (0.996/0.989 on training/testing). When using the probability threshold estimated by the MAIN dataset, the model had a SEN of 63.60% and a PPV of 68.65%. To yield a SEN of 90% on the eMERGE dataset, the system required a substantially lower threshold.
Table 3.
Performance of the optimized Bi-LSTM model and three baselines on the eMERGE dataset.
| Classifier | PPV | SEN | NPV | SPEC | AUC |
|---|---|---|---|---|---|
| Bi-LSTM (0.50) | 92.22% | 30.51% | 99.58% | 99.98% | 0.989 |
| Bi-LSTM (0.026*) | 68.65% | 63.60% | 99.78% | 99.82% | 0.989 |
| Bi-LSTM (0.0014**) | 16.97% | 90.07% | 99.94% | 97.30% | 0.989 |
| Baseline | PPV | SEN | NPV | SPEC | AUC |
| CDC | 0.61% | 90.07% | 99.40% | 10.44% | 0.510 |
| REG | 0.62% | 90.07% | 99.49% | 12.33% | 0.586 |
| CHOP | 0.63% | 90.07% | 99.52% | 13.12% | 0.605 |
*The probability threshold when Bi-LSTM reached 90% SEN on the MAIN dataset. **The probability threshold when Bi-LSTM reached 90% SEN on the eMEREG dataset.
Error analysis
By adjusting SEN to 90.11%, the optimized Bi-LSTM model achieved a PPV of 35.91% on the MAIN test set, resulting in 764 false positive (in 617 charts) and 47 false negative predictions (in 44 charts). We performed an error analysis on these patient charts to identify potential causes of error. Table 4 summarizes the error categories and the example charts are visualized in Figure 4.
Table 4.
Categorization and distribution of false positives (a) and false negatives (b) made by Bi-LSTM on the MAIN test set.
| ID | Category description | False positives | Percentage |
|---|---|---|---|
| 1 | Weight fluctuation between multiple measurements within 48 hours. | 196 | 25.65% |
| 2 | A patient had a large weight increase/decrease because of 1) fast weight loss or weight gain, 2) frequent weight fluctuation, 3) weight gain above 97th percentile or weight loss below 3rd percentile and 4) that the previous or the next weight point was labeled as ‘abnormal’ by annotators. | 183 | 23.95% |
| 3 | The current point was the first/last point in a chart and lack past or future information. | 176 | 23.04% |
| 4 | Experts and the model had different tolerance to weight changes at data points with low weight values. | 167 | 21.86% |
| 5 | Errors with unidentified reasons | 42 | 5.50% |
| (a) | |||
| ID | Category description | False negatives | Percentage |
|---|---|---|---|
| 1 | There was a large weight change between two adjacent measurements with a long time interval. | 23 | 48.94% |
| 2 | The current point was the first/last point in a chart and lack past or future information. | 9 | 19.15% |
| 3 | Experts and the model had different tolerance to weight changes at data points with low weight values. | 7 | 14.89% |
| 4 | Weight fluctuation between multiple measurements within 48 hours. | 4 | 8.51% |
| 5 | Two adjacent data points had similar age and weight values and were both annotated as ‘abnormal’ by annotators, while the classifier only predicted one as ‘abnormal’. | 4 | 8.51% |
| (b) | |||
Figure 4.
Example errors made by the optimized Bi-LSTM model on the MAIN test set. Arrows indicate the false positive/false negative cases in the error analysis.
Discussion
In this study we developed an AWAD to analyze patient weight charts and detect abnormal values from a pediatric population and setting. Compared with the error detection methods proposed in literature, the machine learning-based approaches demonstrated significantly better detection capacity (Table 2). The finding illustrates the advantage of machine learning technologies over knowledge-driven rules by learning latent patterns from the data. In particular, the deep learning-based Bi-LSTM achieved excellent performance and its improvements over the other classifiers were statistically significant (p-value <0.05). The PPV (35.91%) and SEN (90.11%) achieved by Bi-LSTM on the test set suggested that with using AWAD, a researcher could capture over 90% of abnormal weight values by reviewing only 1.56% of the data. Nevertheless, the one direction LSTM did not show improved performance over traditional classifiers such as RF, aNN and SVM-R, suggesting that the advantage gained by bi-LSTM was the aggregated information from future weight points. The similar performance of Bi-LSTM on the eMERGE dataset confirmed its improvements over the literature methods and validated its generalizability (Table 3). The results also suggested that different probability thresholds would be needed to balance sensitivity and specificity on different datasets.
The developed algorithms and findings could have potential for a significant impact on both research and clinical care. For instance, the Bi-LSTM approach can be used to facilitate data cleaning and improve data quality for secondary analysis (e.g., EHR-based phenotyping as conducted by the eMERGE network3). Its PPV and SEN (35.91%/90.11%) suggest potential for more than 98% screening effort reduction in weight abnormality detection. In practice, the abnormality predictions could be enumerated with an empirical probability threshold to balance sensitivity and specificity. On the other hand, the developed system can be implemented in prospective settings to improve healthcare delivery. For example, weight-based medication orders are created based on the most recent weight point in EHRs. By predicting and visualizing abnormality predictions on a weight chart, additional review and timely correction could be made to mitigate effects of entry errors on patient safety. Even though Bi-LSTM is not directly applicable due to unavailability of future weight information, the high performance achieved by the second-best RF (0.961/0.961 AUCs on the MAIN training/test sets) still assures the effectiveness of our application.
Error analysis, limitations and future work
The error analysis on the Bi-LSTM uncovered several areas of improvement. Over 25% of false positives (category 1 in Table 4a) were due to weight fluctuation in measurements within a very short time period (48 hours). Similarly, a large portion of errors (category 4) were caused by different tolerance to change between the annotations and model predictions when weight values were small, particularly at younger ages. We hypothesized that the annotators could have different tolerance of weight variation in a short time period and on low-weight points such that the model captured a “comprised” tolerance that optimized the predictive performance. This issue could potentially be addressed when we perform the study in prospective settings, where the weight abnormalities collected are true weight errors. In addition, the model often labeled large weight changes as abnormal (category 2), which however, could be weight gain or weight loss caused by factors such as health conditions, medication effects, and changes in living environment. Enriching the feature set with more comprehensive patient information (e.g., medical history, medication use) may contribute to better performance, and should be explored in future studies. Finally, 23% of errors (category 3) were due to lack of information because the weight point was the first or last one in a chart. The finding revealed a major limitation of data-driven technologies and incorporating knowledge-based rules might help mitigate this issue. For false negatives, approximately 50% of errors (category 1 in Table 4b) were caused by a pattern when there was a large weight change between two adjacent measures with a long time interval. The errors might be caused by insufficient documentation between ages and will be further investigated in the future. The other causes (categories 2-4) were similar to those from false positives. Few errors were caused by successive abnormal weight points (category 5), which could potentially be addressed by data post-processing (e.g., if an abnormal point is detected, any adjacent points with close weight and age are classified as abnormal).
While this study makes an important contribution to advancing methods for weight abnormality detection, there are limitations to be considered. First, the analysis excluded weight measures before age two that could have more complex weight changing patterns. Additional development and evaluation are therefore required for this age group. Likewise, the system performance is based on data from a single pediatric institution. Built upon the work, a web-servicebased AWAD has been developed to analyze weight charts, identify abnormal weight values, and visualize weight trends and errors. Once the web-service-based AWAD has been deployed, we anticipate assessing its generalizability with more diverse patient populations and institutions. It is worth noting that machine learning techniques support retraining the model when new data become available. As such, if generalizability is not satisfactory, appropriate active learning approaches could be implemented to re-tune the system automatically as new data become available37. Finally, this study was restricted to retrospective data. Once reliability and generalizability are established, the AWAD can be transferred to a production environment to adequately assess its usability and utility with prospective data in future studies.
Conclusions
By utilizing machine learning technologies, we developed an automated approach, AWAD, to analyze pediatric weight charts and identify abnormal weight values. In two reference-standard based evaluation of real-world clinical data, the machine learning models showed good capacity for detecting weight abnormalities and they significantly outperformed the methods proposed in literature. The best-performing model (Bi-LSTM) achieved AUCs ≥0.989 across the two datasets, with good PPVs when sensitivities were adjusted to ≥90%. Given its high performance in this stage of development, we hypothesize that the AWAD, when fully deployed, holds great potential to facilitate clinical research and healthcare delivery that rely on accurate and reliable weight measures.
Acknowledgements
We thank Mr. Milan Parikh at University of Cincinnati for his effort on reviewing and documenting the scripts of the baseline approaches. We also thank the eMERGE network for providing the weight dataset. Particular thanks go to PJ Van Camp who developed the visual annotation tool and managed the annotation process.
Figures & Table
References
- 1.Centers for Disease Control and Prevention . CDC Growth Charts. United States; 2009. [cited Feburary 25, 2021]; Available from: https://www.cdc.gov/growthcharts/background.htm . [Google Scholar]
- 2.Rinke ML, Shore AD, Morlock L, Hicks RW, Miller MR. Characteristics of pediatric chemotherapy medication errors in a national error reporting database. Cancer. 2007;110(1):186–95. doi: 10.1002/cncr.22742. [DOI] [PubMed] [Google Scholar]
- 3.Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association. 2013;20(e1):e147–e154. doi: 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hagedorn PA, Kirkendall ES, Kouril M, et al. Assessing Frequency and Risk of Weight Entry Errors in Pediatrics. JAMA Pediatrics. 2017;171(4):392–393. doi: 10.1001/jamapediatrics.2016.3865. [DOI] [PubMed] [Google Scholar]
- 5.Pham JC, Story JL, Hicks RW, et al. National study on the frequency, types, causes, and consequences of voluntarily reported emergency department medication errors. J Emerg Med. 2011;40(5):485–92. doi: 10.1016/j.jemermed.2008.02.059. [DOI] [PubMed] [Google Scholar]
- 6.Wu DT, Meganathan K, Newcomb M, et al. A comparison of existing methods to detect weight data errors in a pediatric academic medical center. American Medical Informatics Association; 2018. p. 1103. [PMC free article] [PubMed] [Google Scholar]
- 7.Centers for Disease Control and Prevention A SAS Program for the 2000 CDC Growth Charts (ages 0 to < 20 years) 2019. [cited Feburary 25, 2021]; Available from: https://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm .
- 8.Daymont C, Ross ME, Russell Localio A, Fiks AG, Wasserman RC, Grundmeier RW. Automated identification of implausible values in growth data from pediatric electronic health records. Journal of the American Medical Informatics Association. 2017;24(6):1080–1087. doi: 10.1093/jamia/ocx037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Spooner SA, Shields S, Dexheimer JW, Mahdi CM, Hagedorn P, Minich T. Weight entry error detection: a web service for real-time statistical analysis. Am Acad Pediatrics. 2018.
- 10.Zhai H, Brady P, Li Q, et al. Developing and evaluating a machine learning based algorithm to predict the need of pediatric intensive care unit transfer for newly hospitalized children. Resuscitation. 2014;85(8):1065–1071. doi: 10.1016/j.resuscitation.2014.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liu L, Ni Y, Zhang N. “Nick” Pratap J. Mining patient-specific and contextual data with machine learning technologies to predict cancellation of children’s surgery. International Journal of Medical Informatics. 2019;129:234–241. doi: 10.1016/j.ijmedinf.2019.06.007. [DOI] [PubMed] [Google Scholar]
- 12.Ni Y, Alwell K, Moomaw CJ, et al. Towards phenotyping stroke: Leveraging data from a large-scale epidemiological study to detect stroke diagnosis. PLoS One. 2018;13(2):e0192586. doi: 10.1371/journal.pone.0192586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75s:S34–s42. doi: 10.1016/j.jbi.2017.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liu Z, Yang M, Wang X, et al. Entity recognition from clinical texts via recurrent neural network. BMC Med Inform Decis Mak. 2017;17(Suppl 2):67. doi: 10.1186/s12911-017-0468-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Luo Y. Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform. 2017;72:85–95. doi: 10.1016/j.jbi.2017.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Van Camp PJ, Mahdi CM, Liu L, Ni Y, Spooner SA, Wu DTY. Development and preliminary evaluation of a visual annotation tool to rapidly collect expert-annotated weight errors in pediatric growth charts. Stud Health Technol Inform. 2019;264:853–857. doi: 10.3233/SHTI190344. [DOI] [PubMed] [Google Scholar]
- 17.Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in Social and Administrative Pharmacy. 2013;9(3):330–338. doi: 10.1016/j.sapharm.2012.04.004. [DOI] [PubMed] [Google Scholar]
- 18.Flegal KM, Cole TJ. Construction of LMS parameters for the Centers for Disease Control and Prevention 2000 growth charts. Natl Health Stat Report. 2013;(63):1–3. [PubMed] [Google Scholar]
- 19.Jain A, Nandakumar K, Ross A. Score normalization in multimodal biometric systems. Pattern Recognition. 2005;38(12):2270–2285. [Google Scholar]
- 20.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1. [PMC free article] [PubMed] [Google Scholar]
- 21.Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297. [Google Scholar]
- 22.Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis. Cambridge university press; 2004. [Google Scholar]
- 23.Quinlan JR. Induction of decision trees. Machine learning. 1986;1(1):81–106. [Google Scholar]
- 24.Breiman L. Random forests. Machine learning. 2001;45(1):5–32. [Google Scholar]
- 25.Mitchell TM. Artificial neural networks. Machine learning. 1997;45:81–127. [Google Scholar]
- 26.Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 27.Siami-Namini S, Tavakoli N, Namin AS. A comparison of ARIMA and LSTM in forecasting time series. 2018. pp. 1394–1401.
- 28.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing. 1997;45(11):2673–2681. [Google Scholar]
- 29.Kubat M, Matwin S. Addressing the curse of imbalanced data sets: One sided sampling. 1997. pp. 179–186.
- 30.R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2019. https://www.R-project.org/ [Google Scholar]
- 31.Rossum Gv Drake FL. The Python language reference manual. Network Theory Ltd; 2011. [Google Scholar]
- 32.McCarty CA, Chisholm RL, Chute CG, et al. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics. 2011;4(1):13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Altman DG, Bland JM. Diagnostic tests. 1: Sensitivity and specificity. BMJ. 1994;308(6943):1552. doi: 10.1136/bmj.308.6943.1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Altman DG, Bland JM. Statistics Notes: Diagnostic tests 2: predictive values. BMJ. 1994;309(6947):102. doi: 10.1136/bmj.309.6947.102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition. 1997;30(7):1145–1159. [Google Scholar]
- 36.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159. [PubMed] [Google Scholar]
- 37.Tong S, Koller D. Support vector machine active learning with applications to text classification. J Mach Learn Res. 2002;2:45–66. [Google Scholar]




