Abstract
Background
In medical research, the effectiveness of machine learning algorithms depends heavily on the accuracy of labeled data. This study aimed to assess inter-rater reliability (IRR) in a retrospective electronic medical chart review to create high quality labeled data on comorbidities and adverse events (AEs).
Methods
Six registered nurses with diverse clinical backgrounds reviewed patient charts, extracted data on 20 predefined comorbidities and 18 AEs. All reviewers underwent four iterative rounds of training aimed to enhance accuracy and foster consensus. Periodic monitoring was conducted at the beginning, middle, and end of the testing phase to ensure data quality. Weighted Kappa coefficients were calculated with their associated 95% confidence intervals (CIs).
Results
Seventy patient charts were reviewed. The overall agreement, measured by Conger's Kappa, was 0.80 (95% CI: 0.78-0.82). IRR scores remained consistently high (ranging from 0.70 to 0.87) throughout each phase.
Conclusion
Our study suggests the detailed manual for chart review and structured training regimen resulted in a consistently high level of agreement among our reviewers during the chart review process. This establishes a robust foundation for generating high-quality labeled data, thereby enhancing the potential for developing accurate machine learning algorithms.
Keywords: Adverse events, epidemiology and detection; Patient safety; Chart review methodologies
Introduction
Machine learning rapidly transforms medical and health services research by extracting valuable insights from extensive datasets through computational algorithms. The accuracy and effectiveness of supervised algorithms heavily depend on the quality and representativeness of labelled data.1 A widely acknowledged approach for generating labelled data is through chart review. Inter-rater reliability (IRR), which measures agreement among reviewers, becomes a key factor in ensuring the quality of data extraction.2
This study aimed to examine IRR during a large retrospective chart review designed to create labelled data on comorbidities and adverse events (AEs). This high-quality labelled data will be used to improve the performance of AE case identification and classification by developing machine learning algorithms.
Methods
Six registered nurses with a minimum of three years of clinical practice, specialising in surgery, general internal medicine, intensive care, oncology or cardiology, were recruited and underwent rigorous training to independently review inpatient electronic hospital charts, including cover pages, discharge summaries, trauma and resuscitation records, admission, consultation, diagnostic, surgery, pathology and anaesthesia reports, and multidisciplinary daily progress notes. They extracted data related to 20 predefined comorbidities and 18 AEs.3 The definitions of these comorbidities and AEs were explicitly outlined in the Manual for Chart Review (online supplemental appendix), guiding the review process. The chart review data collection was performed using REDCap, a secure web-based software platform.4
bmjoq-2023-002722supp001.pdf (22MB, pdf)
All reviewers underwent comprehensive training aimed to enhance accuracy and foster consensus. This training method has been employed and validated in a previous study.5 The foundation phase involved an initial review of the same 16 charts, focusing on encouraging discussion and resolving discrepancies through consensus. Results, including data extraction accuracy and agreement levels, were discussed during an in-person meeting after this review period. Subsequently, the training phase encompassed four iterative rounds, each involving the review of 10 charts, aimed at enhancing consistency and refining reviewer skills in data extraction. On achieving an overall excellent agreement level (k>0.8), the testing phase (official data collection, n=11 000) commenced. Periodic monitoring followed with IRR involving 30 more charts at the review’s beginning (n=3k), middle (n=5k) and end (n=9k) of the primary study to ensure consistent data extraction quality. The weighted kappa coefficient (Conger, Brennan and Prediger, Gwet and Fleiss) and its associated 95% CI were calculated for comorbidities and AEs using STATA V.16.2 6
Results
A total of 70 patient charts were selected and reviewed by each reviewer independently over the training and testing phases. The overall agreement measured by Conger’s kappa was 0.80 (95% CI 0.78 to 0.82). Other agreement metrics also demonstrated consistently high levels of agreement, including per cent agreement at 0.94 (95% CI 0.94 to 0.95), Brennan and Prediger’s kappa at 0.88 (95% CI 0.87 to 0.90), Gwet’s kappa at 0.92 (95% CI 0.91 to 0.93) and Fleiss’ kappa at 0.80 (95% CI 0.78 to 0.82).
IRR scores were consistently high (ranging from 0.70 to 0.87) at each phase (figure 1). Rater agreement (table 1) on comorbidities was robust (k=0.84, 95% CI 0.81 to 0.86), with higher agreement levels observed for diabetes (k=0.94, 95% CI 0.89 to 0.99) and psychosis at (k=0.92, 95% CI 0.74 to 1.0). IRR was slightly lower for peptic ulcer disease (k=0.65, 95% CI 0.51 to 0.80) and renal disease (k=0.63, 95% CI 0.53 to 0.74). Rater agreement on AEs was 0.77 (95% CI 0.74 to 0.80). Notably, agreement on pressure injury (k=0.90, 95% CI 0.81 to 0.98), fall (k=0.87, 95% CI 0.79 to 0.96) and thromboembolic event (k=0.86, 95% CI 0.77 to 0.94) was robust. Reviewers encountered challenges achieving consensus on adverse drug events (k=0.44, 95% CI 0.33 to 0.55) and retained surgical item (0.40, 95% CI 0.39 to 0.40).
Figure 1.
Inter-rater reliability changes across training and testing phases. Inter-rater reliability was measured by Conger’s kappa and its 95% CI.
Table 1.
Overall inter-rater reliability among six reviewers for comorbidities and adverse events
Comorbidities | Per cent agreement (95% CI) | Conger’s kappa (95% CI) |
Cerebrovascular disease | 0.95 (0.91 to 0.98) | 0.73 (0.57 to 0.89) |
Chronic pulmonary disease | 0.96 (0.93 to 0.99) | 0.88 (0.78 to 0.98) |
Congestive heart failure | 0.94 (0.90 to 0.97) | 0.75 (0.63 to 0.87) |
Dementia | 0.98 (0.96 to 1.00) | 0.81 (0.60 to 1.00) |
Depression | 0.95 (0.92 to 0.98) | 0.85 (0.75 to 0.94) |
Diabetes | 0.97 (0.95 to 1.00) | 0.94 (0.89 to 0.99) |
HIV disease (HIV/AIDS) | NS | NS |
Hypertension | 0.93 (0.88 to 0.97) | 0.83 (0.73 to 0.93) |
Leukaemia | 1 | 1 |
Liver disease | 0.96 (0.92 to 0.99) | 0.78 (0.63 to 0.94) |
Lymphoma | 0.99 (0.98 to 1.00) | 0.90 (0.71 to 1.00) |
Metastatic cancer | 0.98 (0.96 to 1.00) | 0.79 (0.54 to 1.00) |
Myocardial infarct | 0.95 (0.91 to 0.99) | 0.76 (0.60 to 0.93) |
Paralysis | 0.99 (0.98 to 1.00) | 0.80 (0.37 to 1.00) |
Peptic ulcer disease | 0.89 (0.84 to 0.95) | 0.65 (0.51 to 0.80) |
Peripheral vascular disease | 0.98 (0.95 to 1.00) | 0.77 (0.56 to 0.98) |
Psychosis | 0.99 (0.98 to 1.00) | 0.92 (0.74 to 1.00) |
Renal disease | 0.82 (0.76 to 0.87) | 0.63 (0.53 to 0.74) |
Rheumatological disease | 0.94 (0.90 to 0.98) | 0.72 (0.56 to 0.88) |
Solid tumour without metastasis | 0.95 (0.92 to 0.99) | 0.84 (0.75 to 0.93) |
AEs | ||
Accidental puncture or laceration | 0.95 (0.92 to 0.99) | 0.83 (0.71 to 0.94) |
Acute kidney injury | 0.88 (0.83 to 0.92) | 0.75 (0.66 to 0.84) |
Adverse drug event | 0.73 (0.67 to 0.79) | 0.44 (0.33 to 0.55) |
Central line or intravenous infection | 0.91 (0.86 to 0.95) | 0.74 (0.61 to 0.87) |
Fall | 0.95 (0.92 to 0.99) | 0.87 (0.79 to 0.96) |
Iatrogenic pneumothorax | 0.97 (0.94 to 0.99) | 0.83 (0.69 to 0.96) |
Postoperative respiratory failure | 0.88 (0.83 to 0.93) | 0.66 (0.52 to 0.79) |
Postoperative wound dehiscence | 0.96 (0.93 to 0.99) | 0.82 (0.71 to 0.93) |
Pressure injury | 0.96 (0.92 to 0.99) | 0.90 (0.81 to 0.98) |
Retained surgical item | 0.99 (0.97 to 1.00) | 0.40 (0.39 to 0.40) |
Sepsis | 0.89 (0.85 to 0.94) | 0.70 (0.57 to 0.82) |
Surgical site infection | 0.86 (0.81 to 0.91) | 0.60 (0.48 to 0.72) |
Thromboembolic event | 0.93 (0.89 to 0.97) | 0.86 (0.77 to 0.94) |
Transfusion reaction | 0.98 (0.96 to 1.00) | 0.83 (0.65 to 1.00) |
Urinary catheter infection | 0.90 (0.86 to 0.95) | 0.73 (0.61 to 0.85) |
Ventilator-associated pneumonia | 0.93 (0.90 to 0.97) | 0.83 (0.73 to 0.93) |
Ventilator-related injury | 0.99 (0.97 to 1.00) | 0.66 (0.14 to 1.00) |
Wrong site surgery | NS | NS |
NS, no sufficient data.
Discussion
In our study, the detailed Manual for Chart Review and structured training regimen resulted in a consistently high level of agreement among our reviewers during the chart review process. The increase in kappa agreement during the testing phase may be attributed to reviewers’ increased exposure to a larger number of charts, leading to greater familiarity with key definitions and fostering a common understanding of content knowledge. While excellent agreement on most items implies meticulous and well-documented patient charts, instances of disagreement among reviewers indicate potential incompleteness in chart information, highlighting inherent imperfections in chart documentation. Additionally, subjective judgements from reviewers during the review process may lead to variations in assessments and contribute to disparities. However, given the iterative training and detailed guidance from the Manual for Chart Review, the impact should be minimal. Overall, this high level of agreement establishes a robust foundation for generating high-quality labelled data, enhancing the potential for developing accurate machine learning algorithms.
Acknowledgments
We express our gratitude to Hannah Qi, Noopur Swadas, Chris King, Olga Grosu and Jennifer Crotts for their dedicated efforts in conducting the chart review.
Footnotes
@GuosongWu
Contributors: YX, HQ, CE, DAS and GW conceived and designed the study. NS led the chart review team, and GW conducted the linkage and data analysis. GW created the tables and figures and drafted the manuscript. All authors contributed to data interpretation and manuscript revision.
Funding: Canadian Institutes of Health Research (CIHR, funding number: DC0190GP, funder website: https://cihr-irsc.gc.ca/e/193.html ). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Competing interests: GW is supported by the Canadian Institutes of Health Research (CIHR) Fellowship, the O’Brien Institute for Public Health Postdoctoral Scholarship and Cumming School of Medicine Postdoctoral Scholarship at the University of Calgary. CE and YX are supported by the CIHR grant (grant number: DC0190GP). The other authors have no conflicts of interest to report.
Provenance and peer review: Not commissioned; externally peer reviewed.
Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
Ethics statements
Patient consent for publication
Not applicable.
Ethics approval
Research ethics was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB21-0416).
References
- 1. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med 2019;380:1347–58. 10.1056/NEJMra1814259 [DOI] [PubMed] [Google Scholar]
- 2. Hallgren KA. Computing inter-Rater reliability for observational data: an overview and Tutorial. Tutor Quant Methods Psychol 2012;8:23–34. 10.20982/tqmp.08.1.p023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Wu G, Eastwood C, Zeng Y, et al. Developing EMR-based Algorithms to identify hospital adverse events for health system performance evaluation and improvement: study protocol. PLoS One 2022;17:e0275250. 10.1371/journal.pone.0275250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Harris PA, Taylor R, Minor BL, et al. The Redcap consortium: building an international community of software platform partners. J Biomed Inform 2019;95. 10.1016/j.jbi.2019.103208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Eastwood CA, Southern DA, Doktorchik C, et al. Training and experience of coding with the world health organization’s International classification of diseases. Health Inf Manag 2023;52:92–100. 10.1177/18333583211038633 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. McHugh ML. Interrater reliability: the Kappa Statistic. Biochem Med (Zagreb) 2012;22:276–82. [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
bmjoq-2023-002722supp001.pdf (22MB, pdf)