Abstract
With the pervasiveness of Electronic Health Records in many hospital systems, the application of machine learning techniques to the field of health informatics has become much more feasible as large amounts of data become more accessible. In our experiment, we evaluated several different convolutional neural network architectures that are typically used in text classification tasks. We then tested those models based on 1,113 histories of present illness. (HPI) notes. This data was run over both sequential and multi-channel architectures, as well as a structure that implemented attention methods meant to focus the model on learning the influential data points within the text. We found that the multi-channel model performed the best with an accuracy of 92%, while the attention and sequential models performed worse with an accuracy of 90% and 89% respectively.
Keywords: Machine Learning, Deep Learning, Bioinformatics, Natural Language Processing, Medical Informatics Computing
I. Introduction
The utilization of health records (EHR) has been a growing standard throughout the health informatics research community as the availability and size of datasets has increased over the past decade [1]. This increase gives machine learning models the data it needs to train and make predictions. EHR’s contain a variety of patient health information that often go either misused or completely omitted in machine learning models. While rich with contextual data about the patient, these records can often get overlooked for data that can be integrated into more traditional algorithms to predict patient outcomes in a meaningful way. We aim to determine how best to utilize this data to classify patients using unique convolutional neural network architectures.
Similar data has been used to make predictions utilizing convolutional neural networks (CNNs) in many other contexts such as named entity recognition or sentiment analysis [2]. Text classification, often built on the back of a supervised word embeddings and CNNs, can predict and classify a text based on their construction and have been shown to outperform many more traditional forms of machine learning [3]. This principle has been applied to medical tasks in specific cases, each using a similar dataset, but for different purposes [4].
A similar study also applied attention based CNNs to radiology reports in comparison with traditional bag-of-words and support vector models [6]. From this study, it was shown that CNNs consistently outperformed older models and demonstrated that using an attention method is a valuable exercise as both performed better than the highest performing baseline model, the support vector machine. We investigated whether classification tasks are significantly improved between attention models and their counterparts.
Although the overarching paradigm of CNNs remain constant, the way in which they can be built vary widely and have differing effects on the accuracy of the models. In this paper, we discuss the effects of using a multi-channel CNN architecture, a sequential architecture, and an attention seeking model on the reliability of our models to predict accurately. This experiment was developed for the automated detection of altered mental status (AMS) in emergency department physician notes to improve decision support.
II. Objectives
The objective of this experiment is to compare several convolutional neural network model architectures and determine which is best suited for patient classification tasks based on unstructured clinical notes. By comparing these three popular CNN model architectures, we hope to be able to provide guidance useful when developing models involving classification based on clinical text notes..
III. Methods
We extracted Emergency Department (ED) Physicians Notes from the Blinded Research Data Warehouse that holds data extracted from the EHR system. The notes were taken over a period of 6 years, all from the same EHR system. The data was enhanced with records from adult patients that had visits tagged with International Classification of Disease (ICD)-10 codes, which indicated that the patient had AMS (e.g. codes under the R41 ICD-10 code hierarchy, which includes symptoms and signs involving cognitive functions and awareness) and an equal number of records from patients without AMS ICD codes as controls or negative cases. The notes were parsed into the different components of a clinical record including the history of present illness (HPI). The parsed notes were imported into REDCap [7] and made available to clinical experts on our research team (which included ED physicians) for review and labeling as either AMS or not AMS, in preparation for the supervised learning experiments. The clinical members of the team labeled 1,113 HPI notes. Of those notes, 487 (43%) notes were labeled as AMS, while 626 (57%) were labeled as being non-AMS.
A. Text Processing:
We used Python Version 3.6 [8], NumPy [9], Pandas [10] for the pre-processing for the machine learning pipeline, and Keras/TensorFlow [11] for building and training and testing the three deep learning models used in this experiment. We ran the text through three CNN models. Text processing for these models included lower casing, sentence splitting, punctuation removal, word tokenization, and sequence padding.
B. Tuning;
For each of the models we built, we did extensive hyperparameter tuning to determine the best settings that for overall performance of each model. For the sequential model and the attention model, we were also able to perform Bayesian optimization on this model using the Hyperas python package [12]. From this, parameters such as filter sizes, drop rate, and learning rate were adjusted between their minimums and maximums over 100 runs. The models were then adjusted with the parameters that were found to be the most effective
C. Word Embeddings:
Each of the models utilizes Googles Word2Vec skip-gram algorithm [13]. Because of our small data size, we did not have enough tokens to train our own Word2Vec model without a significant impact on the reliability of the results and massive overfitting. We decided to use a pretrained Word2Vec model to help improve each of our models overall performance in classifying patients. The Word2Vec model was obtained from NLP Labs that was trained using texts from PubMed and PMC [14]. From this pretrained Word2Vec model, we weighed and mapped our entire set of 1,130 words. This introduced our models to the weights of 5.5 billion tokens taken from the PubMed and PMC databases. This was a 200-dimension Word2Vec model.
D. Reproduceability:
We ran our tests using a NVIDIA RTX 2080 Ti on a Windows 10 server hosted and operated by the Medical University of South Carolina. Results cannot be guaranteed to be completely reproducible if other computer architectures are used, however these architectures were tested on other operating systems using CPU only training which resulted in similar results.
E. Convolutional Neural Network Structures:
A Convolutional Neural Network is a deep learning model developed specifically for image processing by replicating the biological functions of vision [15], [16]. The convolutional layer feeds its output forward to be processed [17] by subsequent layers such as pooling and dropout. We used this as the basis for each of the models outlined in the subsections below. For each of the models, we use a pretrained word embedding layer from the Natural Language Processing Lab (NLPLab) combining medical related texts from PubMed with 200 dimensions per word (D200). This gave our models pretrained weights for the words in our dataset as mentioned above.
1). Multi-Channel CNN Model:
We based the architecture of this model on the Jakekiks architecture [18]. A multi-channel model, at its best will select for different feature sets by setting out the filter sizes into three separate pipelines and concatenating the resulting finds of each pipeline. Adding additional pipelines seemed to detract from the performance of the model greatly as it was slower to train and less accurate. Other models using similar architecture found that there are benefits to using a multi-channel architecture. Our architecture differs slightly from the one proposed in Opalkas study [19]. We have a significantly reduced channel size of just 3 separate channels. Our model also does not separate the data into segments but runs them in through separate channels and selects the best features in the concatenation step. We started with an Input tensor of size 1x630 that is representative of the unique texts pulled out of the word embedding. We then layer that with the embedding layer as mentioned above which applies the layer of word weights across the entire input vector. This is meant to give the model some form of context to allow it to pick up on important trends earlier. Because the current tensor is only formatted into one vector, we must then reshape the vector into a 2D applicable shape that is more easily processed by the algorithms. We do this by applying a reshape tensor to give it shape 630x200x1. The next steps include the convolution layers. We generate three convolution layers, each with three separate filter sizes that take in the output from the reshape layer. Note that changing of the filter sizes from (1,3,5) all the way up to (5,7,9) was attempted. There were no improvements to the model by increasing it or decreasing it past a filter size of (3,4,5) as many of the features were probably well within that range of filter. This selects for specific features that each of the filters catches and deems important to the particular data it is analyzing. In an image processing application, this might differentiate between black and white pixel, but in our case, it allows us to find the most important word/words in a text and reduce down the variables giving the text a particular meaning. These values are then output and fed into max-pooling layers that take the multi-dimensional output of the convolutional layers and transform them into a 1x1 value. This is meant to amplify the features selected within the convolutional steps to help filter out noise of other competing, but less effective features. Those values are then concatenated together to become a 3x1 tensor that, flattened, reshaped. We finally apply dropout rate of 0.25 to decrease overfitting [20], [21]. Originally a dropout of rate 0.5 was used here, but through hyperparameter tuning, we found that a 0.25 dropout yielded more accurate results. Finally, we add a single node with a softmax output to give us our result.
2). Sequential CNN Model:
A sequential model is built through the concatenation of layers one after the other. This is the basis of many early models including Kims early text classification models talked about in Section 1. All data is shepherded into one pipeline where a certain number of convolutional layers are added to perform feature selection and pare down the data as deemed necessary [22]. Comparatively, this is the simplest model that was constructed throughout this experiment. Similarly, to the multi-channel model, we start with the embedding layer mentioned above. A 1D convolution layer was added after the data was reshaped into the proper format of 200x1. This model has a much higher filter size, in comparison with the multi-channel model, with a filter size of 200. We decided upon one convolutional layer to contrast heavily between both of the multi-channel models that utilized the concatenation of multiple layers across three distinct pipelines. We continued to add a single max pooling layer that outputs to a single 1x1 matrix of features that decide if the text does or does not indicate if a patient’s history is indicative of AMS. Then we add a dropout layer with a dropout rate of .2 to reduce over-fitting the attention model and flatten before we output to a softmax function.
3). Attention Seeking CNN Model:
Attention is the ability for a model to find one influential section of the data set. To do this, we used kMaxPooling and Folding methods as described by Hughes [23]. This is intended to replicate the pooling process a total of k times to amplify the best feature available to use from the set given from the convolutional layers. We used a sequential model here because of its ability for quick prototyping. We also had to use 1-dimensional convolution layers instead of 2-dimension because of the restraints of the kMaxPooling and Folding layers. While this may influence the output of the model, it would not play a significant role in adjusting the metrics of the specific model outlined here. We start with the base embedding layer and add padding. Padding the embedding layer helps to improve the performance of the model as it no longer needs to guess the length of each of the input. Without this, we would not be able to perform deep enough convolutions for the kMaxPooling to work and take effect properly as the volume of data would be too sparse [24]. We then layer a 1-dimensional convolution layer, with a kMaxPooling layer of 5. This means that it will take the best max pool out of five and pass that on. We then add another convolution layer with half the window size of the first, add a folding layer, and then do another kMaxPool. We finally flatten, and output using a softmax function.
F. Training and Evaluation:
Each of the models described above was trained using a 5-fold train/test cycle. The data was divided into a 80%-20% split between training sets and testing sets. 80% of the data was used for training, 10% was used for validation, and 10% was set aside for evaluation testing. The data was split into stratified k-folds, giving us even splits of both positive and negatively labeled data. The splitting of the data and the k-fold evaluations were done using random seeds that ensured a reduction in bias towards more optimal selections of our dataset. The average accuracy, precision, recall, F1-Score, and loss were calculated from each of the 5-fold tests were averaged. The area under the operating characteristic curve (AUC) was calculated based on all predictions taken from the testing data in each of the five folds. Each model was trained over 100 epochs using a batch size of 1000. We used an early stopping technique to help prevent over-fitting of the model by stopping if the loss was to increase. Our model had a patience of 10, meaning that if the loss increased, there would be a period of 10 epochs for the model to correct itself before we prematurely ended our experiment. This allowed us to test differing learning rates without the need to adjust the callback parameters each time. If the model was performing well, it would continue to train without much issue, otherwise it would terminate, and we could continue testing different hyperparameters.
The above methods were run 10 times per experiment and the experiment was run 5 times for each model to ensure validity in analysis of the results. The results of the five overall experiments were averaged and presented as the final results of this paper. We also used the combined results of the experiments to calculate the confidence interval of each model’s AUC.
G. Ethical Considerations:
This study was approved by the Institutional Review Board at the Blinded headed by Dr. Blinded as IRB Chair under protocol No. Pro00080055. This experiment was approved as a chart study and therefore no human subjects were used.
IV. Results
The multi-channel model outperformed both the sequential and attention model in the patient classification task on AUC (97.4%), accuracy (92.3%), and loss (0.294) Table 1. There was overlap in their confidence intervals showing that in some cases, the sequential model outperformed the multi-channel model Table 2. The attention model and sequential model both scored similarly on AUC, but the sequential model outperformed the attention model with an accuracy of 90.8% versus the attention model’s 89.3%.
TABLE I.
| Accuracy Comparisons of CNN Architectures Part 1 | ||||
|---|---|---|---|---|
| Model | Accuracy | Area Under the Curve |
Loss | Confidence Interval |
| Multi-Channel | 92.3% | 97.4% | 0.294 | 96.1-98.7% |
| Sequential | 90.7% | 96.4% | 0.34 | 95.9-96.8% |
| Attention | 89.4% | 95.6% | 0.356 | 95.4-95.8% |
Table II.
| Accuracy Comparisons of CNN Architectures Part 2 | |||
|---|---|---|---|
| Model | Recall | Precision | F1-Score |
| Multi-Channel | 92.2% | 92.2% | 92.6% |
| Sequential | 91.3% | 91.3% | 92% |
| Attention | 90.0% | 91% | 90.0% |
V. Discussion
Between each of the models, it is clear from the results that the multi-channel model performed the best for patient classification task. The model achieved a 97.4% Area Under the Curve, and it also outperformed the accuracy of both the sequential and attention models by 2-4% over our testing . This aligned with our prediction as the multi-channel model is the most computationally expensive of the three models and should be able to make better predictions with sparse data similar to that in our own dataset. The sequential model and attention models performed well, and may have their place in systems with certain hardware limitations or energy restrictions such as embedded devices or mobile predictors, but it is clear through testing that these models do not complete our current task as they were outperformed on every metric by the multi-channel model.
We had theorized that adding an attention component to a model would improve on the ability for the model to accurately assess patient histories to classify our patients with AMS; however, this was not found in the results of our testing. While adding attention to the model in theory would allow the model to discover smaller details in the data, our particular set had shorter amounts of text.
Because of the success of the multi-channel CNN models, we can say that text classification may have a useful place in patient classification tasks, as well as other doctor assisted decision-making tools. To continue, we see these models being tested over larger sets of data to better determine their ability to generalize to other forms of illness. This sort of classification allows for much greater and more accurate prediction models, as once utilized textual data will now be able to be used in a more meaningful way. We envision models that incorporate text classification as one of a larger system of predictors that can involve text and numerical data analysis.
A. Limitations:
Our data used for this experiment only represents one health system, and one EHR system. This means it is difficult to generalize the performance of these models to other environments. Our experiment was also specifically designed as a text classifier to identify AMS. We only used the HPI type of EHR clinical text. We also had a small dataset of only 1,113 HPI notes. Our results might have differed if there was more data available for our study. More experiments will need to be conducted at other institutions and examining different types of notes of a much larger dataset size for us to be able to generalize.
VI. Conclusion
Although our experiment had limitations, our tests demonstrated that the multi-channel convolutional neural network is the best suited for patient classification on EHR. Even with a difference between models, performance was acceptable on the multi-channel model which demonstrates the power that neural networks present when confronted with medical text classification tasks. Our results have shown that a multi-channel model is the most robust of the three tested, however, sequential models could be viable depending on the medical text classification task.
Acknowledgment
This project was supported in part by the South Carolina SmartState Program, the South Carolina Clinical & Translational Research Institute, with an academic home at the Medical University of South Carolina, through the National Institutes of Health - National Center for Advancing Translational Sciences grant number UL1-TR001450, and the Delaware-CTR Accel program through the National Institute of General Medical Sciences grant number U54-GM104941. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Contributor Information
Kevin Gagnon, Computer Science, University of South Carolina, Columbia SC.
Tami L Crawford, Biomedical Informatics, Medical University of South Carolina, Charleston, SC.
Jihad Obeid, Biomedical Informatics, Medical University of South Carolina, Charleston, SC.
References
- [1].Henry J, Pylypchuk Y, Searcy T, and Patel V, “Adoption of electronic health record systems among US non-federal acute care hospitals: 2008–2015,” ONC Data Brief vol. 35, pp. 1–9, 2016. [Google Scholar]
- [2].Siencnik SK, “Adapting word2vec to named entity recognition,” in Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), 2015, pp. 239–243. [Google Scholar]
- [3].Kim Y, “Convolutional neural networks for sentence classification,” ArXiv Prepr. ArXiv14085882, 2014. [Google Scholar]
- [4].Mujtaba G et al. , “Clinical text classification research trends: Systematic literature review and open issues,” Expert Syst. Appl, vol. 116, pp. 494–520, 2019. [Google Scholar]
- [5].Rajkomar A et al. , “Scalable and accurate deep learning with electronic health records,”NPJ Digit. Med, vol. 1, no. 1, p. 18, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Shin B, Chokshi FH, Lee T, and Choi JD, “Classification of radiology reports using neural attention models,” in 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, May 2017, pp. 4363–4370, doi: 10.1109/IJCNN.2017.7966408. [DOI] [Google Scholar]
- [7].Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, and Conde JG, “Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support,” J. Biomed. Inform, vol. 42, no. 2, pp. 377–381, Apr. 2009, doi: 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].VanRossum G and Drake FL, The python language reference. Python Software Foundation Amsterdam, Netherlands, 2010. [Google Scholar]
- [9].Oliphant TE, A guide to NumPy, vol. 1. Trelgol Publishing USA, 2006. [Google Scholar]
- [10].McKinney W, “Data Structures for Statistical Computing in Python,” Austin, Texas, 2010, pp. 56–61, doi: 10.25080/Majora-92bf1922-00a. [DOI] [Google Scholar]
- [11].Chollet F, “Keras: The python deep learning library,” ascl, p. ascl: 1806.022, 2018. [Google Scholar]
- [12].Wistuba M, Schilling N, and Schmidt-Thieme L, “Hyperparameter Search Space Pruning – A New Component for Sequential Model-Based Hyperparameter Optimization,” in Machine Learning and Knowledge Discovery in Databases, vol. 9285, Appice A, Rodrigues PP, Santos Costa V, Gama J, Jorge A, and Soares C, Eds. Cham: Springer International Publishing, 2015, pp. 104–119. [Google Scholar]
- [13].Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J, “Distributed Representations of Words and Phrases and their Compositionality,” ArXiv13104546 Cs Stat, Oct. 2013, Accessed: Aug. 31, 2020. [Online], Available: http://arxiv.org/abs/1310.4546. [Google Scholar]
- [14].Moen S and Ananiadou TSS, “Distributional semantics resources for biomedical text processing,” Proc. LBM, pp. 39–44, 2013. [Google Scholar]
- [15].Hubel DH and Wiesel TN, Brain and visual perception: the story of a 25-year collaboration. New York, N.Y: Oxford University Press, 2005. [Google Scholar]
- [16].Fukushima K, “Neocognitron,” Scholarpedia, vol. 2, no. 1, p. 1717, 2007, doi: 10.4249/scholarpedia.1717. [DOI] [Google Scholar]
- [17].Svozil D, Kvasnicka V, and Pospichal J, “Introduction to multi-layer feed-forward neural networks,” Chemom. Intell. Lab. Syst, vol. 39, no. 1, pp. 43–62, Nov. 1997, doi: 10.1016/S0169-7439(97)00061-0. [DOI] [Google Scholar]
- [18].Kekic M, “CNN in keras with pretrained word2vec weights,” Oct. 27, 2017. https://www.kaggle.com/marijakekic/cnn-in-keras-with-pretrained-word2vec-weights. [Google Scholar]
- [19].Opalka S, Stasiak B, Szajerman D, and Wojciechowski A, “Multi-Channel Convolutional Neural Networks Architecture Feeding for Effective EEG Mental Tasks Classification,” Sensors, vol. 18, no. 10, p. 3451, Oct. 2018, doi: 10.3390/s18103451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Poemomo A and Kang D-K, “Biased Dropout and Crossmap Dropout: Learning towards effective Dropout regularization in convolutional neural network,” Neural Netw., vol. 104, pp. 60–67, Aug. 2018, doi: 10.1016/j.neunet.2018.03.016. [DOI] [PubMed] [Google Scholar]
- [21].Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res, vol. 15, no. 1, pp. 1929–1958, 2014. [Google Scholar]
- [22].Jaderberg M, Vedaldi A, and Zisserman A, “Speeding up Convolutional Neural Networks with Low Rank Expansions,” in Proceedings of the British Machine Vision Conference 2014, Nottingham, 2014, p. 88.1–88.13, doi: 10.5244/C.28.88. [DOI] [Google Scholar]
- [23].Hughes M, Li I, Kotoulas S, and Suzumura T, “Medical text classification using convolutional neural networks,” Stud Health Technol Inf, vol. 235, pp. 246–50, 2017. [PubMed] [Google Scholar]
- [24].Li F-F, Karpathy A, and Johnson J, Convolutional neural networks for visual recognition. 2015. [Google Scholar]
