Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2019 Aug 28;26(12):1600–1608. doi: 10.1093/jamia/ocz146

Using convolutional neural networks to identify patient safety incident reports by type and severity

Ying Wang 1,, Enrico Coiera 1, Farah Magrabi 1
PMCID: PMC7647259  PMID: 31730700

Abstract

Objective

To evaluate the feasibility of a convolutional neural network (CNN) with word embedding to identify the type and severity of patient safety incident reports.

Materials and Methods

A CNN with word embedding was applied to identify 10 incident types and 4 severity levels. Model training and validation used data sets (n_type = 2860, n_severity = 1160) collected from a statewide incident reporting system. Generalizability was evaluated using an independent hospital-level reporting system. CNN architectures were examined by varying layer size and hyperparameters. Performance was evaluated by F score, precision, recall, and compared to binary support vector machine (SVM) ensembles on 3 testing data sets (type/severity: n_benchmark = 286/116, n_original = 444/4837, n_independent = 6000/5950).

Results

A CNN with 6 layers was the most effective architecture, outperforming SVMs with better generalizability to identify incidents by type and severity. The CNN achieved high F scores (> 85%) across all test data sets when identifying common incident types including falls, medications, pressure injury, and aggression. When identifying common severity levels (medium/low), CNN outperformed SVMs, improving F scores by 11.9%–45.1% across all 3 test data sets.

Discussion

Automated identification of incident reports using machine learning is challenging because of a lack of large labelled training data sets and the unbalanced distribution of incident classes. The standard classification strategy is to build multiple binary classifiers and pool their predictions. CNNs can extract hierarchical features and assist in addressing class imbalance, which may explain their success in identifying incident report types.

Conclusion

A CNN with word embedding was effective in identifying incidents by type and severity, providing better generalizability than SVMs.

Keywords: neural networks, word embedding, multiple classification, text classification, clinical incident reports, patient safety

INTRODUCTION

Approximately 10% of admissions to acute-care hospitals are associated with adverse events.1,2 Events that could have resulted or did result in unnecessary harm are called patient safety incidents. Reports about patient safety incidents provide important information to understand how and why incidents occur and are useful for designing strategies to prevent similar events from occurring.3 Learning from incidents needs to occur at multiple levels including hospital, health system, and national levels.4 At a hospital level, clinicians or managers learn from experience and take their own actions to follow up problems. Beyond the local level, quick identification of a cluster of incidents reported from different hospitals but with similar contributing factors, can help health departments learn from local experiences and disseminate guidelines or change practices at a health system or national level.

Improving the efficiency of incident analysis at a health system and national level is an urgent problem. Current analysis of incidents does not support learning at a population level because it relies on retrospective review of reports by humans which is time-consuming and labor intensive.5 With wide implementation of incident monitoring across health systems, the growing volume of reports poses a challenge for timely response to incidents and active learning.6,7

Categorizing incidents by incident type and severity level can help to prioritize incidents with significant consequences for immediate follow-up.3,4 However, the lack of uniform classification prevents identification of incident clusters. While this can be addressed by asking reporters to categorize incidents in a standardized manner using structures like the Agency for Health care Research and Quality Common Formats,8 the ratings provided by reporters are often inaccurate.9,10 That is because the reporters are typically health care workers from a range of professional groups including clinicians and hospital administrators who may not be expert in categorizing incidents.10 There is high discordance among labelers because of the lack of uniform training.10,11 Moreover, labeling may be delayed, incomplete, or absent, thereby reducing the ability to respond in near real time.9

BACKGROUND AND SIGNIFICANCE

Aiming to reduce the human effort and improve classification efficiency, machine learning methods have been applied to identify incident reports.12–18 First, binary classifiers based on text classification were applied to distinguish certain specific types of incident reports (eg, handover, patient identification,15 extreme-risk events,16 or health information technology18,19). Gradually more and more studies aimed to identify multiple types of incidents encountered in health care. To address this multiclass classification problem, some studies have sought to apply unsupervised methods such as topic modelling to identify patient safety incidents.17,20,21 However, the mapping between topics and incident types is not straightforward, as the same topic may include multiple types of incidents. Other studies using supervised classifiers have shown the feasibility of categorizing reports into predefined incident types.12,13,22

In previous work, we decomposed this multiclass classification problem into a group of binary classification problems, achieving high F scores in identifying the common classes of incidents, such as patient falls and medication errors.12,13 However, these classifiers were not effective in identifying rare classes of incidents—for example, deteriorating patients.12,13 Moreover, the classifiers did not generalize well to identify incident reports from independent reporting systems.12

The decomposition strategy adopted in these studies was to divide an m-class problem into an ensemble of multiple paired binary problems.12,13,23 However, this approach imposes a heavy computational cost when testing a large number of binary classifiers.12,22 Promisingly, convolutional neural networks (CNN) have achieved great success as a single model in multiclass classification of images and text.24,25 CNNs have been applied to medical images to classify brain lesions26,27 or predict the stages of Alzheimer’s disease28 with an accuracy of 98.8%. They have also been adopted to analyze electronic health record (EHR) data (eg, predicting patient age29 or health trajectory).30

Recent studies have shown that the deep convolutional layers are more suitable for image classification, while shallow convolutional layers are more suited to text mining.31 However, the success of CNN depends on finding an optimal architecture to fit a given problem.32 We therefore designed and examined the architectures of a CNN for categorizing incident reports.

Another challenge for identifying incident reports is determining how to effectively represent complex patient safety concepts in reports. Feature engineering is required for training machine learning models especially when training data is limited.33 Algorithms based on the bag-of-words model have been widely used to extract word existence or frequency in documents.34 However, they ignore word order and lose the semantic context of words. Word embedding has shown to be efficient in text classification as it can capture useful semantic properties and linguistic relationships between words.35 Also a vector representation of words might be a better way to represent incident reports.

We therefore aim to develop a single classifier by combining word embedding with a CNN and to evaluate its feasibility to identify multiple types of incident reports and severity levels. Furthermore, we compare performance of this approach with support vector machine (SVM) ensembles from our previous work.12

MATERIALS AND METHODS

An overview of our approach comprising 4 main steps is shown in Figure 1.

Figure 1.

Figure 1.

An overview of the multiclass classification of incident reports by type and severity (benchmark data set: balanced testing sets from AIMS; original data set: imbalanced testing sets from AIMS; independent data set: stratified testing sets from Riskman).

  1. preprocessing data: incident reports were collected from 2 separate reporting systems, the Advanced Incident Management System (AIMS) and Riskman.3,4,12 The reports were then labeled by patient safety experts to provide a gold standard label for experiments.

  2. feature extraction: the narratives of reports from the training data set were processed to generate a word-embedding space.

  3. model training and validation: the various CNN architectures were trained and validated under cross-validation process. An optimal architecture of the CNN was selected based on its performance.

  4. model performance testing: undertaken on 3 testing data sets from AIMS and Riskman systems. This process was applied to identify incident reports by type and severity, separately. Model performance was evaluated and compared with SVM ensembles.12

Step 1: Data preparation

Incident reporting systems

AIMS has been used since 1998 in Australia across the public hospital system in 4 of the 8 states and territories: New South Wales, Western Australia, South Australia, and the Northern Territory.3,4 There were 137 522 incidents were reported to AIMS in an Australian state from January to December 2011 and were categorized into 20 types by reporters. For each incident type, 300 reports were randomly selected based on reporters’ labels. In total, we collected 6000 reports.

The Riskman system is an independent tool used across the state of Victoria and a number of private hospitals across the country. Using a random sampling approach, 6000 incident reports were selected from a hospital-level Riskman system between January 2005 and July 2012.

Upon collection, all reports were read and any identifiable or potentially patient-identifying information (eg, name, date of birth) was removed in accordance with jurisdictional privacy requirements. Ethical approval was obtained from university committees as well as a committee governing the hospital and state data sets.

Data labeling

Given the inconsistency of reporter labels,10 3 experts in the classification of incidents reviewed and validated labels based on the international classification for patient safety.4 The labels provided by the experts were used as a “gold standard” for training and testing the performance of classifiers. These reports were classified into 20 incident types and this study focused on 10 of these types which have been recognized as priority areas for safety and quality improvement4,12,36 (Table 1). Inter-rater reliability for determining incident types was Cohen’s kappa = 0.93 (P < .001 95% CI 0.9301–0.9319). To cover the whole data set, an “Others” set was created using a random sampling approach to ensure representativeness of 10 other unrelated incident types (see Supplementary Material. Appendix B12)

Table 1.

The composition of balanced and stratified data sets used for CNN training and testing. The same data was used to train SVM ensembles12

Balanced AIMS benchmark
Stratified AIMS original
Stratified Riskman independent
n n % n %
Incident type
 Falls 260 90 20 872 15
 Medication 260 68 15 1053 18
 Pressure Injury 260 37 8 190 3
 Aggression 260 49 11 487 8
 Documentation 260 26 6 252 4
 Blood Product 260 5 1 59 1
 Patient Identification 260 7 2 86 1
 Infection 260 6 1 22 <1
 Clinical Handover 260 7 2 87 1
 Deteriorating Patient 260 1 <1 14 <1
 Others 260 148 33 2878 48
Total 444 6000
Severity level
 SAC1 290 25 < 1 23 < 1
 SAC2 290 95 2 105 2
 SAC3 290 2198 45 2609 44
 SAC4 290 2519 52 3213 54
Total 4837 5950

Abbreviations: AIMS, Advanced Incident Management System; SAC, Severity Assessment Codes; SVM, support vector machine.

The seriousness of an incident was rated by an internationally accepted rating system called the Severity Assessment Codes (SAC) developed by the US Veterans Administration.2 Given the severity of an incident and the likelihood of recurrence, 4 risk ratings (I, extreme; ii, high; iii, medium; iv, low) were used.37 The gold standard was based on assignment of SAC ratings by local patient safety managers who had received training in assessing incident seriousness and were familiar with the nature of incidents and their consequences.

Training and testing data sets

For classifier training, we used 260 reports from each incident type and 290 reports for each SAC level to create a balanced data set (Table 1). The sample sizes were based on previous studies.12 The balanced data set was further divided into training (80%), validation (10%) and testing (10%) subsets under 10-fold sub-sampling cross validation process. The training and validation subsets were used to examine CNN architectures and hyperparameters, and the testing subsets (benchmark) were applied to generate benchmark results.

To evaluate the CNN’s applicability in real-world conditions, it was further tested on imbalanced, or “stratified,” data sets from AIMS (original). The stratified data sets were randomly selected from the remaining AIMS reports based on the real-world ratio incidents by type and severity (Table 1). To examine generalizability to an independent incident reporting system, the CNN was tested on a stratified Riskman data set (independent).

Step 2: Feature extraction

Incident reports consist of a number of structured and free text fields used to describe the safety event and its consequences (Box 1). For our experiments, only the descriptive narratives in reports were used including incident description, patient outcome, actions taken, prevention steps, investigation findings, and results. All codes, punctuation, and nonalphanumerical characters were removed, and the text was converted to lower case.

Box 1.

Incident report format and element examples

Report format Basic element of report
structured incident ID
date
time
incident type(s)
severity assessment code (SAC)
free text incident description
patient outcome
actions taken
prevention steps
investigation findings and results

Training word-embedding space

Word embedding is an unsupervised method to capture the relationships between words. It creates a lower-dimensional vector space where each discrete word is mapped to a vector, and words with similar semantic context are close to each other in this space.35 Word2Vec is one of the most popular embedding algorithms38 and outperforms other embedding methods when the semantic space is small.39 Given the common use of short sentences in incident reports, we chose Word2Vec. It first constructs a vocabulary from the training documents and then generates a vector representation of words based on how often a word appears close to other words in the source text. To generate the vectors, we used the skip-gram model which has been shown to be efficient with infrequent words given that incident reports typically deal with a variety of different topics (incident types).39 The resulting word vectors were used as input features to the CNN.

The embedding dimensionality (size of vector) is known as a hyperparameter of word-embedding space. It is usually set empirically. To search the optimal embedding dimensionality with incident reports, we examined its range from 50 to 200 through CNN training and validation process.

Extracting features

With a trained word-embedding space, a report was transformed into a feature matrix where each word was represented as a vector. However, incident reports varied by report length and the size of feature matrices varied accordingly. To fit into a CNN input layer, it was necessary to adjust the reports into sequences of word vectors of the same length. To achieve that, we first set a target length. Reports that were shorter than the target length were left-padded with zeros and reports exceeded the target length were truncated. Note that the descriptive narratives in reports first included incident description and then other incident management information (Box 1). Truncations were thus made from the bottom of reports to keep the maximum description of incident. The target length was set by examining the report length as another hyperparameter (ranging from 20 to 100) when training CNNs.

Step 3: Model training and validation

CNN architecture

The general architecture of neural network is composed of an input layer, an output layer, and multiple hidden layers in between. The CNN architectures follow the general design principles of applying multiple convolutional layers to the input layer. Given the design motivation towards lightweight structure and better accuracy, we started with a typical CNN architecture in a classification problem with 6 layers: input layer, convolution layer, rectifier layer, pooling layer, fully connected layer, and classification (output) layer31,32 (Figure 2).

Figure 2.

Figure 2.

The architecture of a CNN examined to identify incident reports, including the layers from input layer to convolution, rectifier, pooling, fully connected, and classification layers. Convolution and rectifier layers can be repeated.

The convolution layer uses a set of filters to generate feature maps by sliding them through the input space. There are 2 hyperparameters in the convolution layer: the number of filters and the region size, which is an area in the input space that would be filtered. The rectifier layer usually follows after convolution layer, performing as an activation function to maintain positive values by mapping negative values to zero. A rectified linear unit (ReLU) is a widely used activation function in neural networks.32,40 Pooling layer basically reduces the spatial size of the network and combats overfitting. The fully connected layer serves as a high-level reasoning layer where the neurons have connections to all activations in the previous layer. Classification layer specifies how training penalizes the deviation between the predicted and true labels and then generate classification probabilities for assigning labels on a single class out of multiple classes. The operations of convolution and pooling can be repeated.

In practice, smaller networks with 1 or 2 convolutional layers have been shown to be effective to learn categories when training with small data sets.41 We explored the design space of CNN architectures by examining multiple layers to operate convolution and pooling. We also examined the number of convolution filters and the size of a region for a filter to scan through the word-embedding input space. Apart from these hyperparameters in CNN, we examined the CNN performance by varying the dimensionality of word-embedding space and the length of report. With each architecture (Table 2), the CNN was trained to identify incidents by type and severity, separately, and performance was evaluated using pooled F scores on validation data sets. The optimal architecture was selected based on a trade-off between overfitting due to a large number of hyperparameters and discrimination power.

Table 2.

The hyperparameters examined in CNN models and the optimal model architectures to identify incident type and severity level

Layers
CNN architecture
Hyperparameters
Input layer (word-embedding space)
Convolution + Rectifier layer 1
Additional layers
Values embedding dimensionality document length filter count region size Pooling Convolution + Rectifier 2
50–200 20–100 2–40 (2, 2), (3, 2) (2, 3), (3, 3) Yes / No Yes / No
Examined CNN models with varying hyperparameters
Model 1 50 20 2 (2, 2) No No
Model 2 55 20 2 (2, 2) No No
Model 3 60 20 3 (2, 2) No No
Model i 75 50 15 (2, 3) No Yes
Model N 200 100 40 (3, 3) Yes Yes
Optimal model architectures
Incident type 100 75 20 (2, 2) No No
Severity level 120 60 10 (3, 3) No No

Step 4: Performance evaluation and comparison

Individual classification performance

Our aim was to identify individual incident type or severity level using F score, precision, and recall measures.12 A confusion matrix was also used to visualize the performance of classifiers.

Overall classification performance

Overall classification performance was examined using 2 widely accepted average measures in multiclass classification studies: microaveraging and macroaveraging. The microaveraged measure is based on the cumulative number of true positives, true negatives, false positives, and false negatives per type while macroaveraged measure of precision, recall, and F score is the average over all classes with equal weight to each incident type.12 With the balanced validation data sets, we compared the performance of different architectures of CNN. Using the most effective CNN architecture, we evaluated its performance across the benchmark, original, and independent data sets. Then we compared performance to previous work using the one-versus-one ensemble of SVMs with a binary count representation of bag-of-words features.12

RESULTS

CNN architectures

CNN architectures were examined by varying layers, the sizes for convolution filters, word-embedding space dimensionality, and report length. The most effective architecture in identifying incident types included the input, convolution and rectifier, fully connected, and classification layers (Table 2). A single convolution layer was used and the pooling layer was excluded. With the input layer, the optimal embedding dimensionality was 100 and report length was 75. For the convolution layer, 20 filters were applied to scan through a region of (2, 2) in the word-embedding space. The convolution operation using these 2 hyperparameters generated the most effective feature maps to identify incident types.

For severity level, the best-performing CNN consisted of the same layers as those in identifying incident type. 120-dimesion embedding space along with 60-word long document was the most effective setting of word embedding space in input layer. The convolution operation was performed using 10 filters to scan over a region of (3, 3) of input layer.

CNN performance

Overall performance

The best performing CNN achieved high and similar F scores on the benchmark (81.1%), original (87.6%), and independent (85.1%) data sets to identify incident types (Table 3). When identifying severity, the CNN performed quite well on the benchmark (84.5%) and original (81.8%) data sets but relatively worse on independent (66.1%) data set. The CNN outperformed SVM ensembles across all the 3 testing data sets in identifying both incident type and severity levels (Table 3, micro-averaged F scores). Precision and recall measures are reported in Supplementary Material Table A.

Table 3.

Classification performance of CNN (F score, %) in identifying incidents by type and severity, compared with SVM ensembles12

Benchmark
Original
Independent
Incident type CNN SVM ensembles CNN SVM ensembles CNN SVM ensembles
Micro-averaged F score 81.1 78.3 87.6 73.9 85.1 68.5
Macro-averaged F score 81.7 77.5 78.2 62.7 71.5 51.4
Falls 92.0 89.3 96.0 96.1 96.2 88.8
Medication 84.6 76.9 89.6 85.9 90.5 79.8
Pressure Injury 96.0 93.9 94.4 88.0 92.6 85.2
Aggression 94.3 90.6 94.9 79.2 92.0 70.6
Documentation 78.4 53.3 72.7 37.5 67.6 24.0
Blood Product 92.0 87.5 76.9 76.9 85.9 56.6
Patient Identification 67.8 71.0 30.8 37.0 34.0 30.5
Infection 76.2 90.6 50.0 52.6 52.4 20.0
Clinical Handover 81.5 72.4 66.7 29.4 44.7 20.8
Deteriorating Patient 75.6 88.9 100.0 40.0 44.0 19.4
Others 60.6 38.1 88.2 66.7 86.9 69.0
SAC level
Micro-averaged F score 84.5 62.9 81.8 50.1 66.1 52.7
Macro-averaged F score 84.7 62.4 56.6 34.1 41.1 33.2
SAC 1 87.1 87.3 25.0 19.8 9.5 12.5
SAC 2 90.9 49.0 28.7 12.3 11.1 12.0
SAC 3 84.2 49.1 86.9 42.6 66.4 48.3
SAC 4 75.9 64.0 85.8 61.8 77.5 60.0

Abbreviations: CNN, convoluted neural network; SAC, Severity Assessment Codes ; SVM, support vector machine.

The CNN performed well in identifying the common classes of reports: falls, medication, pressure injury, and aggression, as well as SAC3 and SAC4 which predominate the reported incidents. More importantly, for those common classes, CNN performed similarly from the benchmark and original (AIMS) to independent data sets (Riskman), indicating it has potential to be applied to external reporting systems.

Identifying incident types

The CNN performed slightly better when compared to SVM ensembles in identifying 4 common incident types (falls, medications, pressure injury, and aggression), achieving high F scores above 85% on the benchmark and original data sets (Table 3). Notably, the performance was improved by 7.4% to 21.4% on the independent data set compared to SVMs.

Compared to SVM ensembles, the CNN was more effective in identifying documentation, with F scores increasing dramatically from 53.3% to 78.4% on the benchmark data set, 37.5% to 72.7% on the original, and 24% to 67.6% on the independent data sets. Similarly, for the miscellaneous incident type of Others, the CNN outperformed SVM ensembles achieving consistent F scores on 2 stratified data sets: 88.2% (original) and 86.9% (independent). Although the poorer performance of CNN was observed on the benchmark data set with an F score of 60.6%, it was still much better than SVM ensembles achieving 38.1%. Testing on the 2 stratified data sets, Others was more likely to be misidentified as documentation, patient identification, and clinical handover (Supplementary Material Figure A1).

When identifying rarer types (eg, blood products, patient identification, infection, and clinical handover), the CNN trained on a balanced data set performed relatively weaker in real-world setting. A similar pattern was observed with SVM ensembles. The difference was that CNNs generalize well from the original to independent data set whereas SVM ensembles achieved worse performance on the independent data set. In addition, better performance was observed with the CNN in identifying infection and clinical handover on original and independent data sets. With patient identification, performance of the CNN and SVM ensembles were comparable across 3 testing data sets.

Identifying severity levels

Compared to SVM ensembles, the CNN achieved better performance in identifying SAC2, SAC3, and SAC4 and comparable performance on SAC1. Specifically, for SAC3 and SAC4, the CNN was consistently more effective than SVM ensembles. F scores with SAC3 were increased dramatically from 49.1% to 84% in the benchmark, 42.6% to 86.9% in the original, and 48.3% to 66.4% in the independent data set. For SAC4, performance was improved by more than 11.9% across 3 testing data sets.

When identifying rare classes of SAC1 and SAC2, although the CNN achieved higher F scores than SVM ensembles in the benchmark data set (SAC1: 87.1%, SAC2: 90.9%), it showed a similar pattern as SVM ensembles, performing poorly on the original and independent data sets (Supplementary Material Figure A2). For example, F scores of the CNN for SAC1 decreased from 87.1% to 25.0% and 9.5% in the original and independent data sets, respectively.

DISUSSION

Main findings and implications

We evaluated the feasibility of a CNN with word embedding to classify incident reports from 2 independent reporting systems. A CNN trained on balanced data sets can effectively identify incident types including falls, medications, pressure injury, aggression, documentation, and Others on stratified data sets. These types made up 96% of all reports (Table 1). Similarly, for severity level, the CNN performed well on the 3 testing data sets in identifying medium risk SAC3 and low risk SAC4 events which composed 98% of reports. These results showed that a simple CNN can identify incident reports by type and severity and is generalizable to unseen data sets.12 All these indicated that this CNN has potential to be applied in a real-world setting as a first step to group incidents when human resources are lacking.

Word embedding

Compared to the bag-of-words model, word embedding provides a more expressive and efficient representation by maintaining the contextual similarity of words while building low dimensional vectors. To generate word-embedding space, we need to consider the training data resource and the approach to discover word relationships. In terms of data resource, there are 2 common resources utilized to train word embedding: domain-specific corpora (eg, AIMS) and general data resources (eg, Wikipedia).35,39,42 Exploiting general data resources could capture the semantics of general English words and provide semantics relations between frequent and generic terms from a large corpus. However, it has been shown that embeddings differ from one domain to another due to lexical and semantic variation.43 Domain-specific terms are challenging for general domain embeddings since there are few statistical clues in the underlying corpora for these items. This problem presents in the patient safety domain where text in incident reports may use language unique to the domain.42 Some studies suggest that word embedding trained on biomedical data sets can capture the semantics of medical terms better and find semantically relevant medical terms closer to human experts’ judgement than those general embeddings pretrained on GloVe and Google News.44 To the best of our knowledge, there has been little work done to investigate word embeddings trained from the patient safety perspective. To capture domain-specific semantic relationships, we used the specific resource of AIMS incident data for training embeddings.

A 75-word sequence was the optimal report length for identifying incident type while 60 words was optimal for severity level. We found the optimal length was very close to the mean word length for the free text in training reports: 78.5 in AIMS (range: 5–308, SD: 35.5) and 63.4 in Riskman (range: 5–404, SD: 31.6).

CNN architectures

There is no standard for choosing architecture and hyperparameters of CNNs. The network architecture relies on data alone. Our reports were mostly made of short sentences. Therefore, bigger structure was not necessarily better with shorter sentences.31,41 Furthermore, smaller CNNs require less hardware memory which makes it more feasible to deploy in practice. We have done comprehensive experiments to examine the small architectures by including additional convolution and pooling layers along with varying filter sizes and number of filters in the convolution layer.

The convolutional layer is the core building block of a CNN, as it learns the features localized by the subregions of input space. This layer’s parameters comprise a set of independent filters which can scan through the word-embedding vectors from input layer. Each filter is convolved by dot product across the width and height of the input space known as filter size, then produces an activation map.32 If different filters are applied, a set of activation maps will be created as the results of this convolution layer. The optimal filter size was empirically found by conducting experimental validation. Incident reports often were made of short sentences; we thus constrained the filters to small sizes which allow for effective classification performance while avoiding exhaustive search. In addition, the convolutional layer is repeatable. However, additional layers will increase the complexity of CNN architecture and the risk of overfitting. Our results demonstrated that the additional layers did not help in identifying incidents, indicating that a single layer is sufficient to capture the key features to discriminate incidents by type and severity. This might be caused by the limited size of training samples for multiple classes.

We found a pooling layer was not necessary for identifying incident reports when training a small incident data set. Pooling is used to progressively reduce the number of features and, thus, to prevent the CNN from overfitting. Although a pooling layer reduced the computational complexity of the CNN, it may have resulted in poor model sensitivity, such as loss of some useful feature information.

Comparison between CNN and SVM ensembles

In our previous study, the multiclass classification problem was transformed into a series of binary classification problems.12 For the sake of equitable comparison, we trained a single CNN using the same training sets and evaluated it with the same testing data sets as in.12

Identification of incident types

Unlike SVM ensembles achieving relatively poorer performance on the independent data set, the CNN showed consistent performance across the 3 testing data sets in identifying all common types and 2 “difficult” types more likely misidentified by SVM ensembles (documentation and Others). The Others type was most likely misclassified as documentation by both CNN and SVM ensembles. However, the number of reports misclassified by CNN decreased dramatically to 63 from 458 misidentified by SVM ensembles (Supplementary Material Figure A1).

For identifying rare types of incidents, although performance of the CNN on the stratified data sets were poorer than the balanced data set, it was consistent between the original and independent data sets on most rare types except Deteriorating Patient. These results indicated that the CNN had better generalizability than SVM ensembles. The inconsistent performance with Deteriorating Patient might be caused by small data size. Eleven of 14 incidents involving deteriorating patients had been identified correctly, however 21 Others reports were misidentified as Deteriorating Patient, leading to lower precision and F score.

Identification of severity levels

The CNN was robust in identifying the 2 most common levels: SAC3 and SAC4 across the 3 testing data sets. With SVM ensembles, SAC3 events were more likely misidentified as SAC4. Interestingly, we found that SAC3 was misidentified as SAC4 and SAC2 equally by the CNN. The possible reason is that the boundary between medium and low risk levels is not always clear which results in inconsistent rating for these events by patient safety professionals. A study to evaluate the reliability of the severity rating scale for medication errors in UK showed that there were marked differences in the severity ratings between different health professional groups, within groups, and for individuals at different time points,11 making severity rating highly subjective.45 Another reason is that the CNN was a single classifier trained by considering all 4 levels of risk events together. Each SAC level included multiple incident types, making it harder to obtain distinct vocabularies between severity levels.

Overall, the CNN outperformed SVM ensembles in terms of classification performance and generalizability. Word embedding captured the semantics and similarity between words. This might be a reason for good generalizability. CNNs are also more practical requiring a single model for testing compared to assessing each classifier in an SVM ensemble. However, the classification of rare classes on independent data sets needs further investigation.

Limitations and future work

There are several limitations. First, the CNN generalized well on the 2 independent reporting systems in Australia. We did not examine the generalizability across nations. Second, the training data sets were balanced. Given the class imbalance problem with incident types in real-world settings, a stratified training data set may be more desirable when large training data sets are available. Third, the word embedding was generated using a domain specific data set. The use of general embeddings pretrained on larger data sets requires further investigation.

CONCLUSION

Our experiments demonstrated that a small single CNN outperformed an ensemble of SVMs to identify common incident types and severity levels. For rare classes, both the CNN and SVM ensembles trained on balanced data sets tend to be weaker on stratified data sets but the CNN generalized well from original to independent data sets. More training data may improve classification performance given the deep learning nature of CNNs.

FUNDING

This research is supported in part by grants from the Australian National Health and Medical Research Council (NHMRC), project grant APP1022964; and the Centre for Research Excellence in Digital Health, grant 1134919. The funding source did not play any role in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.

AUTHOR CONTRIBUTIONS

FM, YW, and EC conceptualized the study. YW designed and implemented the training and evaluation of the classification model. She is responsible for the integrity of the work. YW and FM drafted the paper. All authors participated in writing and revising the paper. All aspects of the study (design; collection, analysis, and interpretation of data; writing of the report; and decision to publish) were led by the authors. All authors read and approved the final manuscript.

ETHICS APPROVAL

Ethical approvals were obtained from committees of Macquarie University and the University of New South Wales, as well as the committees governing the hospital and the state data sets.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

Supplementary Material

ocz146_Supplementary_Data

ACKNOWLEDGMENTS

We thank Bronwyn Shumack, Katrina Pappas, and Diana Arachi for assisting with the extraction of the incident reports. We also thank Anita Deakin, Alison Agers, and Sara Suffolk for their assistance with labeling reports.

Conflict of interest statement

None declared.

AVAILABILITY OF DATA AND MATERIALS

Individual patient safety report data cannot be shared.

REFERENCES

  • 1. Rafter N, Hickey A, Condell S, et al. Adverse events in healthcare: learning from mistakes. QJM 2015; 1084: 273–7. [DOI] [PubMed] [Google Scholar]
  • 2. Runciman B, Walton M.. Safety and Ethics in Health care: A Guide to Getting It Right. Farnham: Ashgate Publishing; 2007. [Google Scholar]
  • 3. Clinical Excellence Commission NSW. Clinical Incident Management in the NSW Public Health System; 2016. http://www.cec.health.nsw.gov.au/clinical-incident-management. Accessed August 5, 2019.
  • 4. Runciman WB, Williamson JA, Deakin A, et al. An integrated framework for safety, quality and risk management: an information and incident management system based on a universal patient safety classification. Qual Saf Health Care 2006; 15 Suppl 1: i82–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Pronovost PJ, Morlock LL, Sexton JB, et al. Improving the value of patient safety reporting systems In: Henriksen K, Battles JB, Keyes MA, et al. , eds. Advances in Patient Safety: New Directions and Alternative Approaches (Vol. 1: Assessment). Rockville (MD): Agency for Healthcare Research and Quality; 2008;1:1–9. [PubMed] [Google Scholar]
  • 6. Mitchell I, Schuster A, Smith K, et al. Patient safety incident reporting: a qualitative study of thoughts and perceptions of experts 15 years after ‘To Err is Human’. BMJ Qual Saf 2016; 252: 92–9. [DOI] [PubMed] [Google Scholar]
  • 7. Mahajan RP. Critical incident reporting and learning. Br J Anaesth 2010; 1051: 69–75. [DOI] [PubMed] [Google Scholar]
  • 8. Agency for Healthcare Research and Quality. Patient safety organization common formats. https://www.pso.ahrq.gov/common/. Accessed August 5, 2019.
  • 9. Gong Y. Data consistency in a voluntary medical incident reporting system. J Med Syst 2011; 354: 609–15. [DOI] [PubMed] [Google Scholar]
  • 10. Haines TP, Massey B, Varghese P, et al. Inconsistency in classification and reporting of in-hospital falls. J Am Geriatr Soc 2009; 573: 517–23. [DOI] [PubMed] [Google Scholar]
  • 11. Williams SD, Ashcroft DM.. Medication errors: how reliable are the severity ratings reported to the national reporting and learning system? Int J Qual Health C 2009; 215: 316–20. [DOI] [PubMed] [Google Scholar]
  • 12. Wang Y, Coiera E, Runciman W, et al. Using multiclass classification to automate the identification of patient safety incident reports by type and severity. BMC Med Inform Decis Mak 2017; 171: 84.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wang Y, Coiera E, Runciman W, et al. Automating the identification of patient safety incident reports using multi-label classification. Stud Health Technol Inform 2017; 245: 609–13. [PubMed] [Google Scholar]
  • 14. Marella WM, Sparnon E, Finley E.. Screening electronic health record-related patient safety reports using machine learning. J Patient Saf 2017; 131: 31–6. [DOI] [PubMed] [Google Scholar]
  • 15. Ong MS, Magrabi F, Coiera E.. Automated categorisation of clinical incident reports using statistical text classification. Qual Saf Health Care 2010; 196: e55.. [DOI] [PubMed] [Google Scholar]
  • 16. Ong MS, Magrabi F, Coiera E.. Automated identification of extreme-risk events in clinical incident reports. J Am Med Inform Assoc 2012; 19 (e1): e110–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Fong A, Ratwani R.. An evaluation of patient safety event report categories using unsupervised topic modeling. Methods Inf Med 2015; 544: 338–45. [DOI] [PubMed] [Google Scholar]
  • 18. Chai K, Anthony S, Coiera E, et al. Using statistical text classification to identify health information technology incidents. J Am Med Inform Assoc 2013; 205: 980–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kang H, Yu Z, Gong Y.. Initializing and growing a database of health information technology (HIT) events by using TF-IDF and biterm topic modeling. AMIA Annu SympProc 2017: 1024–33. [PMC free article] [PubMed] [Google Scholar]
  • 20. Fong A, Hettinger AZ, Ratwani RM.. Exploring methods for identifying related patient safety events using structured and unstructured data. J Biomed Inform 2015; 58: 89–95. [DOI] [PubMed] [Google Scholar]
  • 21. Ratwani RM, Fong A.. Connecting the dots': leveraging visual analytics to make sense of patient safety event reports. J Am Med Inform Assoc 2015; 222: 312–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Liang C, Gong Y.. Automated classification of multi-labeled patient safety reports: a shift from quantity to quality measure. Stud Health Technol Inform 2017; 245: 1070–4. [PubMed] [Google Scholar]
  • 23. Sun MH. A multiclass support vector machine: theory and model. Int J Inf Tech Dec Mak 2013; 1206: 1175–99. [Google Scholar]
  • 24. Kim Y. Convolutional neural networks for sentence classification. arXiv preprint: 1408.5882; 2014.
  • 25. Xiao C, Choi E, Sun JM.. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc 2018; 2510: 1419–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lopez-Zorrilla A, de Velasco-Vazquez M, Serradilla-Casado O, et al. Brain white matter lesion segmentation with 2D/3D CNN. Nat Artif Comput Biomed Neurosci 2017; 10337: 394–403. [Google Scholar]
  • 27. Kamnitsas K, Ledig C, Newcombe VFJ, et al. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal 2017; 36: 61–78. [DOI] [PubMed] [Google Scholar]
  • 28. Farooq A, Anwar SM, Awais M, et al. A deep CNN based multiclass classification of Alzheimer's disease using MRI. In: IEEE Conference on Imaging Systems; 2017: 111–6. [Google Scholar]
  • 29. Wang Z, Li L, Glicksberg BS, et al. Predicting age by mining electronic medical records with deep learning characterizes differences between chronological and physiological age. J Biomed Inform 2017; 76: 59–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Pham T, Tran T, Phung D, et al. Predicting health care trajectories from medical records: A deep learning approach. J Biomed Inform 2017; 69: 218–29. [DOI] [PubMed] [Google Scholar]
  • 31. Le HT, Cerisara C, Denis A. Do convolutional networks need to be deep for text classification? arXiv preprint 1707.04108; 2017.
  • 32. Albelwi S, Mahmood A.. A framework for designing the architectures of deep convolutional neural networks. Entropy-Switz 2017; 196: 242. [Google Scholar]
  • 33. Gibaja E, Ventura S.. A tutorial on multilabel learning. ACM Comput Surv 2015; 473: 1. [Google Scholar]
  • 34. Joachims T. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Norwell: Kluwer Academic; 2002. [Google Scholar]
  • 35. Ge LH, Moh TS.. Improving Text Classification with Word Embedding. In: 2017 IEEE International Conference on Big Data; 2017: 1796–805. [Google Scholar]
  • 36. Runciman W, Hibbert P, Thomson R, et al. Towards an international classification for patient safety: key concepts and terms. Int J Qual Health Care 2009; 211: 18–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Bagian JP, Lee C, Gosbee J, et al. Developing and deploying a patient safety program in a large health care delivery system: you can't fix what you don't know about. Jt Comm J Qual Improv 2001; 2710: 522–32. [DOI] [PubMed] [Google Scholar]
  • 38. Goldberg Y, Levy O. Word2vec explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint 1408.5882; 2014.
  • 39. Naili M, Chaibi AH, Ben Ghezala HH.. Comparative study of word embedding methods in topic segmentation. Procedia Comput Sci 2017; 112: 340–9. [Google Scholar]
  • 40. Xu HT, Dong M, Zhu DX, et al. Text classification with topic-based word embedding and convolutional neural networks. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, October 02–05, 2016; Seattle, WA, USA.
  • 41. Landola F, Keutzer K. Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures. arXiv preprint 1710.02759; 2017.
  • 42. Wang P, Xu B, Xu JM, et al. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 2016; 174: 806–14. [Google Scholar]
  • 43. Danushka B, Takanori M, Ken-Ichi K. Embedding semantic relations into word representations. arXiv preprint 1505.00161; 2015.
  • 44. Wang YS, Liu SJ, Afzal N, et al. A comparison of word embeddings for the biomedical natural language processing. J Biomed Inform 2018; 87: 12–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Centre for Clinical Governance Research in Health. Evaluation of the Safety Improvement Program in New South Wales: Study No 6 Report on Program Outcomes. Sydney: University of New South Wales, Centre for Clinical Governance Research in Health; 2005. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocz146_Supplementary_Data

Data Availability Statement

Individual patient safety report data cannot be shared.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES