Exploring the reliability of inpatient EMR algorithms for diabetes identification

Seungwon Lee; Elliot A Martin; Jie Pan; Cathy A Eastwood; Danielle A Southern; David J T Campbell; Abdel Aziz Shaheen; Hude Quan; Sonia Butalia

doi:10.1136/bmjhci-2023-100894

. 2023 Dec 20;30(1):e100894. doi: 10.1136/bmjhci-2023-100894

Exploring the reliability of inpatient EMR algorithms for diabetes identification

Seungwon Lee ^1,^2,^✉, Elliot A Martin ^1,², Jie Pan ^1,³, Cathy A Eastwood ^1,³, Danielle A Southern ³, David J T Campbell ^1,⁴, Abdel Aziz Shaheen ^1,⁴, Hude Quan ^1,³, Sonia Butalia ^1,⁴

PMCID: PMC10749029 PMID: 38123357

Abstract

Introduction

Accurate identification of medical conditions within a real-time inpatient setting is crucial for health systems. Current inpatient comorbidity algorithms rely on integrating various sources of administrative data, but at times, there is a considerable lag in obtaining and linking these data. Our study objective was to develop electronic medical records (EMR) data-based inpatient diabetes phenotyping algorithms.

Materials and methods

A chart review on 3040 individuals was completed, and 583 had diabetes. We linked EMR data on these individuals to the International Classification of Disease (ICD) administrative databases. The following EMR-data-based diabetes algorithms were developed: (1) laboratory data, (2) medication data, (3) laboratory and medications data, (4) diabetes concept keywords and (5) diabetes free-text algorithm. Combined algorithms used or statements between the above algorithms. Algorithm performances were measured using chart review as a gold standard. We determined the best-performing algorithm as the one that showed the high performance of sensitivity (SN), and positive predictive value (PPV).

Results

The algorithms tested generally performed well: ICD-coded data, SN 0.84, specificity (SP) 0.98, PPV 0.93 and negative predictive value (NPV) 0.96; medication and laboratory algorithm, SN 0.90, SP 0.95, PPV 0.80 and NPV 0.97; all document types algorithm, SN 0.95, SP 0.98, PPV 0.94 and NPV 0.99.

Discussion

Free-text data-based diabetes algorithm can yield comparable or superior performance to a commonly used ICD-coded algorithm and could supplement existing methods. These types of inpatient EMR-based algorithms for case identification may become a key method for timely resource planning and care delivery.

Keywords: health services research, electronic health records, medical informatics, medical record linkage

WHAT IS ALREADY KNOWN ON THIS TOPIC

Identifying people with diabetes in databases is typically carried out by using International Classification of Disease codes, laboratory results and/or medications.

WHAT THIS STUDY ADDS

The diabetes identification algorithm based on free-text electronic medical records (EMR) notes shows excellent performance. This study further supports the idea that EMRs contain a wealth of details that can be leveraged to complement existing methods to identify people with diabetes within databases.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

This study provides evidence that free-text EMR data could enhance the flow of diabetes information in clinical care and improve associated downstream processes in case identification, surveillance and clinical outcome research.

Introduction

Accurate identification of chronic conditions, such as diabetes, within acute care facilities or hospitals is imperative for delivering optimal care.¹ Information regarding comorbidity status is typically gathered by healthcare professionals during their care encounters and stored within electronic medical records (EMRs). The collected information is subsequently conveyed to other care providers based on individual needs. Comorbidity information is not only useful in point-of-care clinical encounters but is also useful for research, quality improvement and resource planning. In the Canadian context, a key inpatient administrative health database, the discharge abstract database (DAD),² is used by>90% of hospitals, is coded by trained coding specialists reviewing physician documentations from the EMRs who assign International Classification of Diseases 10th Revision Canadian modification (ICD-10-CA) codes. This database is populated by these coding specialists who review discharge summaries from the EMRs and assign International Classification of Diseases (ICD) codes to each encounter.³ The DAD serves various purposes such as care service planning activities, fiscal planning, operational planning, population surveillance and epidemiology. However, there exists a considerable delay in obtaining this data.

The increased adoption of EMRs within acute care facilities,⁴ coupled with the integration of artificial intelligence techniques in healthcare,⁵ has created the potential to extract chronic conditions and comorbidities directly from EMRs. This has the benefit of enhancing operational data practices in health systems and ensuring timelier information. For example, diabetes definitions have typically used ICD codes and laboratory and medication data to define diabetes in clinical datasets.⁶ However, most detailed contextual information in healthcare is stored within free-text notes in paper charts or EMRs. The advancement of natural language processing (NLP) techniques now enables using EMR free-text data to refine condition definitions and facilitate the identification process, thereby enhancing various healthcare processes, including real-time point of care, research and care planning processes.

Our hypothesis is that a diabetes algorithm, using clinical free-text notes, can perform similarly or better than existing standard methods. We were also interested in assessing whether different components of free-text notes could contribute to phenotyping diabetes. The purpose of this study was to develop diabetes algorithms based on different EMR data modalities and compare their performances.

Materials and methods

Study population and design

This study is a retrospective cohort study covering the period of 1 January 2015 to 30 June 2015, from Alberta, Canada. This cohort was assembled from data sources listed below.

Data sources and linkage

The EMR and administrative data records were linked using Personal Health Number (PHN), and generated Patient Identification and Encounter Management details (eg, encounter number, health record number) sourced from the Clinibase system. These represent a unique set of identifiers for patient encounters⁷ that are loaded into the EMR system. The combination ensured the linkage was pinpointed to the correct admission period contained within EMR data. We developed this linkage mechanism in a previous study⁸ and subsequently created multiple EMR databases linking administrative health databases. The PHN and other personal identifiers were anonymised after the linkage was completed. The following sources of information were used: chart review database, Allscripts Sunrise Clinical Manager EMR and the DAD.

Chart review database

A previously conducted project assembled a chart review cohort of randomly selected patients in acute-care facilities in Calgary, Alberta.⁹ The chart review data recorded patients’ chronic disease status (binary) which included diabetes status, admission date and other system variables for linking it to the DAD and other data sources. The chart review included a total of 51 medical conditions and 3 healthcare-related adverse events. The chart review team consisted of six nurses who received training and followed a consistent protocol to review the charts. These reviewers were blinded to the ICD coding status.

Allscripts Sunrise Clinical Manager EMR

Sunrise Clinical Manager (SCM) has been used as the inpatient EMR for several acute-care sites operated by Alberta Health Services (AHS), the single health authority in the province of Alberta, since 2009. This EMR contains (but is not limited to) patient demographic information, laboratory information, medications, free-text history and physical notes, interdisciplinary progress notes, and discharge summaries for inpatient encounters. Detailed description of this EMR system is available in our previous work.¹

Discharge abstract database

DAD is a national Canadian administrative health database which includes all inpatient separations (by discharge or death) through a collaborative system set up between provincial, and territorial governments, and the Canadian Institute for Health Information (CIHI). CIHI sets national training requirements for those responsible for coding the data. The utilisation of administrative health data, such as DAD, is widely acknowledged as the reference standard in Canada for both research activities¹⁰ and public health initiatives,¹¹ from using ICD codes.

Data extraction

Once the coded patient records were deterministically linked to the EMR using PHN and Clinibase variables, linkage to subtables within EMR of interests was conducted through system variables (eg, table record identifier, health record number). We extracted and cleaned these EMR subtables that contained the following information: (1) inpatient laboratory subtable (contains all conducted laboratory tests within a patient encounter period), (2) inpatient medication subtable (contains all medications prescribed and fulfilled to the patient within a patient encounter period), and (3) subtable containing all clinical notes (free-text notes documented throughout the patient encounter) period. ICD codes were obtained from the linked DAD data. These EMR subtables were used to develop varying diabetes algorithms listed in the next section.

Diabetes algorithm development

Chart review labels served as the gold standard labels for algorithm development.

Operational standards—validated administrative data-based ICD codes algorithm

Current operational algorithm standards for surveillance and research are based on ICD-coded data. The National Diabetes Surveillance System (NDSS)¹² employs ICD-based code algorithm developed by Quan et al¹³ and is inclusive of ICD-10-CA codes E10–E14 during hospitalisation. We assessed the performance of the algorithm by Quan et al against the chart review labels.

EMR data-based algorithms

Various approaches were implemented for developing algorithms accounting for different data modalities. All algorithms were compared against the chart review labels for performance measurements.

Laboratory data-based clinical diagnosis algorithm

To identify diabetes, we used haemoglobin A1C (HbA1c) tests, oral glucose tolerance tests, random plasma glucose tests, or fasting plasma glucose tests, adhering to the thresholds outlined in Diabetes Canada’s national guidelines for diagnosis. The criteria and thresholds for these tests have been published.¹⁴ While Diabetes Canada requires at least two separate test types for a diabetes diagnosis, the varied prevalence of recommended tests for each patient led us to implement a single test meeting the diagnostic criteria¹⁵ for performance reporting in this study.

Medication data-based clinical diagnosis algorithm

The medication clinical algorithm included any use of a single (or multiple) agent(s) that are commonly used to treat diabetes. The list of diabetes medications was derived from Diabetes Canada’s national guidelines, reviewed by clinicians (endocrinologists), and validated on the Canada’s Drug Product database¹⁴ (online supplemental appendix table 1).

Supplementary data

bmjhci-2023-100894supp001.pdf^{(63.9KB, pdf)}

Inpatient laboratory and medication data-based clinical diagnosis algorithm

This clinical diagnosis algorithm included both laboratory and medications data. Specifically, the absence of diabetes was defined as the highest HbA1c laboratory result below 6.5%^{16 17} with no evidence of prescribed or fulfilled medications. Pre-diabetes was defined by the highest HbA1c falling within the range of 6.0%–6.4% or through an oral glucose tolerance test, random plasma glucose test, or fasting plasma glucose test adhering to the thresholds listed in the Diabetes Canada guidelines, and no prescribed antidiabetic medications. Diabetes status was categorised as follows: as (1) HbA1c≥6.5%, if no evidence of medication, (2) meeting glycaemic targets: HbA1c values<7.0%, supported by evidence of both prescribed and dispensed medications, and (3) not meeting glycaemic targets: indicated by the highest HbA1c laboratory result closest to discharge>7.0 %.¹⁸ Another subgroup of individuals with diabetes was identified as those with appropriately intensified therapy with agents known to confer cardiorenal benefit such as (1) GLP1RA if obese or with a history of cardiovascular disease or stroke, and (2) SGLT2 if chronic kidney disease (low GFR or albuminuria) or cardiovascular disease. These data were analysed using a time-series context, and all laboratory and medication records were used.

NLP clinical notes-based machine learning (ML) algorithm

Free-text notes were cleaned and decoded into American Standard Code for Information Interchange (ASCII) to ensure extracted free-text notes were converted to an analyzable format. Then all free-text notes were stratified by document types. The default clinical pipeline of clinical Text Analysis and Knowledge Extraction Systems (cTAKES)¹⁹ was used to process the raw text documents into unified medical language system’s (UMLS) concept unique identifiers (CUIs) for each patient.²⁰ Two algorithms were developed: the first one was a CUI search of the diabetes concept which encompasses its synonyms (eg, diabetes, diabetes mellitus, hyperglycaemia), and the second algorithm was based on a data-driven model of all CUIs extracted from all document types. These CUIs covered anatomical sites, signs/symptoms, procedures, diseases/disorders and medications.

A data-driven supervised ML model on all document types and CUIs was developed (figure 1) and closely follows our previous work.²¹ Boruta²² feature selection algorithm was applied to reduce the dimension of CUIs. An XGBoost²³ algorithm was trained against the chart review cohort. The dataset was divided into 80:20 training ratio stratified by the diabetes outcome to ensure a similar ratio between the labels was maintained. Fivefold cross-validation was employed, and a grid search of hyperparameters was conducted. Feature importance assessed for the top predictive CUI document name pair (ie, a specific CUI in a specific document type) associated with diabetes. Top 20 document type—concept predictive features were identified after fitting the XGBoost algorithm.

Clinical Text Analysis and Knowledge Extraction Systems (cTAKES)and XGBoost free-text algorithm. After free-text notes were extracted from the Sunrise Clinical Manager (SCM) electronic medical record (EMR), these notes were processed by document type using cTAKES. Boruta feature selection was employed and XGBoost classification model was fit. This diagram was adapted and modified from our previous work on hypertension.

Combined algorithms used or statements between the above algorithms.

Evaluation metrics and validation

Several evaluation metrics were calculated to assess the model performance. These metrics included sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV). Statistical tests such as t-test, χ² and Kruskal-Wallis one-way analysis of variance test were applied for continuous, categorical and ordinal variables, respectively.

Figure 2 schematically presents the process flow from data linkage to algorithm development. Figure 1 depicts the detailed algorithm development process of applying cTAKES on the free-text data. We determined the best performing algorithm as the one that showed the high performance of SN and PPV.

Flow process of algorithm development. Chart Review data were deterministically linked to discharge abstract database (DAD) and inpatient data. The International Classification of Diseases (ICD) algorithm was developed by Quan *et al*. Laboratory and medication algorithms used Diabetes Canada³’s established definitions. Medications were ascertained on Canada’s drug product database. Free-text algorithm employed clinical Text Analysis and Knowledge Extraction Systems (cTAKES) for extracting concept unique identifiers (CUIs) from clinical notes and XGBoost was applied.

Results

Cohort overview

We analysed the charts of 3040 individuals, and their demographic details are summarised in table 1. The median age was 62.5 years and there was an equal distribution between males and females. The median body mass index of the cohort was 23.8 kg/m², and approximately 1617 individuals (53.2%) had no Charlson comorbidities. Among these 3040 individuals, 583 individuals (19.2%) had diabetes based on the chart review ‘gold standard’. The cohort with diabetes was, on average, 10 years older than the overall chart review cohort (p<0.01). Within the diabetes cohort, there was a higher proportion of males than females (p<0.01). Additionally, the comorbidity profiles differed between the two groups, with the diabetes subcohort exhibiting a higher prevalence of comorbidities compared with the overall cohort (p<0.01).

Table 1.

Demographics of people with diabetes from the chart review cohort

	Chart review cohort (n=3040)	Diabetes cohort (n=583)	No diabetes cohort (n=2457)	P value
Demographics
Age in years, median (IQR)	62.5 (28.0)	69.0 (19.0)	59.4 (19.6)	<0.01
Sex (F), proportion	1530 (50.3)	235 (40.3)	1161 (47.3)	<0.01
Body mass index, median (n, IQR)	23.8 (29.4)	24.7 (31.5)	23.8 (28.8)	0.56
Charlson comorbidities				<0.01
0	1617	57 (9.8)	1559 (63.5)
1	896	240 (41.2)	654 (26.6)
2	419	203 (34.8)	214 (8.7)
3+	111	83 (14.2)	30 (1.2)

Open in a new tab

Feature selection on all document type ML model

The cTAKES system successfully processed a total of 59 document types and processed 692 918 free-text records within this cohort. The system also extracted negation status and experiencer details, distinguishing between patients and family members. We retained only CUIs that were not negated, and had the patient as the experiencer, resulting in a total of 83 107 CUIs. Using the Boruta method, it recommended the inclusion of 42 ranked features, with an additional three features identified as tentative. Therefore, we considered the top 45 ranked features, which constituted the training dataset for the XGBoost model.

Algorithm performance

Table 2 presents the performance of the diabetes Clinical and ML algorithms on the testing dataset. The administrative database ICD-based algorithm yielded SN of 0.84, SP of 0.98, PPV of 0.93 and NPV of 0.96; medication data-based clinical algorithm, SN of 0.89, SP of 0.98, PPV of 0.91 and NPV of 0.98; selected keyword concepts from free-text notes, SN of 0.73, SP of 0.93, PPV of 0.70 and NPV of 0.93; ML algorithm based on free-text notes, SN of 0.95, SP of 0.98, PPV of 0.94 and NPV of 0.99. Various performance of the combined clinical and ML algorithms is also shown in table 2.

Table 2.

Performance of clinical and ML algorithms on the testing dataset (n=609)

Algorithm type	Sensitivity	Specificity	PPV	NPV	F1
ICD (Quan et al)¹³	0.84	0.98	0.93	0.96	0.88
SCM EMR data
Laboratory	0.37	0.96	0.69	0.86	0.48
Medications	0.89	0.98	0.91	0.98	0.90
Free-text (CUI: keywords search)	0.73	0.93	0.70	0.93	0.71
Free-text (all documents; CUI and XGBoost)	0.95	0.98	0.94	0.99	0.95
Combinations
Labs+Meds	0.90	0.95	0.80	0.97	0.85
Labs+Meds + Free text XGBoost	0.97	0.95	0.81	0.99	0.88
Labs+Meds + Free text XGBoost+ICD	0.97	0.94	0.79	0.99	0.87
Medications+free text XGBoost (SCM EMR)	0.97	0.98	0.90	0.99	0.94
Medications+free text XGBoost+ICD algorithm	0.97	0.96	0.87	0.99	0.92

Open in a new tab

CUIs, concept unique identifiers; EMR, electronic medical record; ICD, International Classification of Diseases; SCM, Sunrise Clinical Manager.

Discussion

This study explored various EMR data-based case definitions for diabetes, uncovering algorithms with excellent performance. We used chart review labels as our gold standard. While the validated administrative data-based ICD-code algorithm demonstrated strong performance, the findings support our hypothesis that harnessing free-text notes can yield comparable or superior results to existing standard methods. The ML algorithm that included all document types of free-text notes was the top performer in this study cohort, with 0.95 SN and 0.94 PPV. Meanwhile, the combination of free-text algorithm, medication, and ICD codes improved the SN to 0.97 but experienced a decline in PPV to 0.87.

The current operational standards for defining diabetes for surveillance (ie, NDSS)¹² and research purposes in Canada were shaped by the administrative data-based ICD code algorithm.¹³ These methodologies rely on the utilisation of ICD-code databases, and rely on readily available standardised ICD-code databases, like the DAD, established at both national and international settings. In the Canadian context, these DAD records are reliant on the quality of ICD codes produced by the trained coders who review the charts. Diabetes is a chronic condition which is heavily emphasised for ICD coding in Alberta, and yet the algorithms that solely use these codes resulted in a lower SN compared with the free-text algorithm. This discrepancy stems from the fact that ICD coders primarily review physician documentations from free-text documents within the EMR system for ICD coding in Canada, as dictated by the system design. Challenges and limitations encountered in ICD coding have been described in previous studies²⁴ indicating the information overload experienced by the healthcare system and workers in various areas when dealing with EMR data.

A recent scoping review highlighted that diabetes definitions typically incorporate laboratory and medications data, along with ICD codes.⁸ Laboratory data typically employ values surpassing specific clinical thresholds to determine disease status. When a patient is being treated with antihyperglycaemic medication, these clinical values are presumed not reach that threshold due to the medication’s effect. In our study, the combined clinical diagnosis algorithm of laboratory and medication had a 0.90 SN and 0.80 PPV, which is comparable to algorithms described in the above-mentioned review. In a systematic review²⁵ on the applications of NLP in diabetes care showed that out of 38 studies, 17 aimed to define diabetes, but most of these studies relied on single concept words or keyword-based definitions (ie, diabetes). In our cohort, the keyword algorithm had an 0.73 SN and 0.70 PPV, potentially reflecting the quality of documentation or the practice of data being entered into the EMR from the front end. Figure 3 showed that several consistent diabetes related medication terminologies (eg, metformin and insulin) were captured across multiple EMR document types. The ML-based algorithm which included all types of free-text documents performed the best in this study cohort, achieving a SN of 0.95 and PPV of 0.94 PPV, raising several important considerations. The ICD code algorithm had an 0.84 SN and 0.93 PPV. Combined algorithms often increased SN but reduced PPV, which was expected.

EMR systems, such as SCM¹ and Connect Care (Alberta’s newly implemented province-wide clinical information system),²⁶ based on Epic software (Madison, WI), typically have a front-end graphical user interface for delivering clinical care. It is important to note that not all healthcare workers or providers have access to complete patient charts, and access is typically determined based on assigned roles in the system. Information overload from EMR data can occur if too much information is given,²⁷ and communication oversight could arise if insufficient information is provided.²⁸ Additionally, the quality of clinical notes documentation can be heavily influenced by interactions between the care providers and patients or their family members, potentially triggering varying sets of orders and interventions documented in the EMR system. This project extracted all free-text notes from the back end of the EMR system and processed these documents using a standardised medical terminology dictionary (ie, UMLS). Our findings demonstrated that various types of healthcare workers and providers are documenting similar medical concepts across multiple EMR document types for diabetes. Therefore, analysing the commonality in documentation across roles to consolidate and centralise information for shared awareness would enhance information flow in clinical care settings and improve downstream processes, such as improving the quality of the administrative health databases.

Current diabetes definitions based on ICD-code databases are not integrated into clinical practices within the Canadian context, as DAD coding systems and EMR systems operate separately from each other. Alberta’s Connect Care clinical information system which includes EPIC-based EMR infrastructure, now in operations throughout AHS operated acute care and ambulatory facilities, has the capacity of integrating ML models,²⁹ with potential outputs incorporated into dashboards. The integration of inpatient data-specific case definitions could facilitate easier identification of comorbidities, designing automated risk prediction algorithms within EMR which could be implemented into point of care as needed. As EMR adoption in Canada continues to rise,⁴ the implementation of EMR data-based diabetes case definitions from both inpatient and outpatient care³⁰ has the potential to enhance the quality of DAD data for diabetes. This, from a research operations standpoint, could assist with cohort selection for epidemiological and clinical studies. The subsequent improvement in DAD will, in turn, enhance the surveillance capabilities of the NDSS for Alberta in the long run.

This study is not without limitations. First, as we used a single geographic setting, external validation from a different geographical setting is needed. Second, our algorithms do not differentiate between type 1 and type 2 diabetes, the two most common forms of diabetes. With the prevalence of both types increasing, as well as differences in management and care, differentiating between these types is important, this will be an area of future work. Also, we appreciate the immaturity of the proposed application in real-life practice but importantly this study is foundational work for ML in healthcare systems. We appreciate the limited interpretability by the prediction model. Importantly, in our study, we demonstrated the explainbility by showing that top features (figure 3) are coinciding with what is documented within clinical practices. This strengthens the application of our model in real-world practice. We also appreciate the lack of system infrastructure to implement models with existing EMRs not having the capacity to implement designed ML models. AHS has recently implemented EPIC-based clinical information system, which has the capacity to integrate ML models into EMR systems, in AHS-operated and partner acute and subacute care sites, ambulatory care locations, clinical lab services and diagnostic imaging areas. That being said, our study includes many strengths. Strengths include taking a multimodal EMR data approach to develop a case definition for diabetes and comparing to existing standards, integrating ML and NLP onto EMR data, and using the randomly selected chart review data as the gold standard.

Our future studies will expand to include Connect Care data and eventually validate this work in other jurisdictions. Furthermore, we will evaluate the implementation of our ML models into existing clinical information systems. Recent advancements in large language models have shifted the interest in developing such models for eventual deployment in healthcare systems from the NLP field perspective. While we acknowledge that we have not considered these deep learning NLP models for this study, a future study is in the planning stages, aiming to explore large language model methods on a study cohort with a much larger disease prevalence.¹⁰

Conclusion

As NLP techniques are advancing, there is the potential to leverage them in healthcare, particularly for using free text data within EMRs. As such, we assessed several algorithms and found the free-text algorithm performed the best in this cohort. Determining the ideal algorithm or combinations for implementation would be dependent on the needs, the clinical practice culture and data availability. These types of inpatient EMR-based algorithms for case identification are ideal for timely care delivery and resource planning.

Footnotes

Contributors: SL, HQ and SB conceptualised this study. CAE and DAS provided the chart review data. EAM and SL conducted data extraction and linkage. SL conducted analysis. EAM and JP assisted with analysis. DJTC and AAS refined the laboratory and medications algorithm and reviewed the medications list. SL and SB drafted the manuscript. All authors reviewed the contents of the manuscript. SL, HQ, and SB are the guarnators of this study.

Funding: This work was supported by Canadian Institutes of Health Research, Foundation Grant FDN-167272, awarded to HQ.

Competing interests: None declared.

Provenance and peer review: Not commissioned; externally peer reviewed.

Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Data availability statement

Data may be obtained from a third party and are not publicly available. Restrictions apply to the availability of these data. Data were obtained from Alberta Health Services and are available with the permission of Alberta Health Services.

Ethics statements

Patient consent for publication

Not applicable.

Ethics approval

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Conjoint Health Research and Ethics Board of University of Calgary (REB15-0790, and approved 7 May 2023).

References

1.Lee S, Xu Y, D Apos Souza AG, et al. Unlocking the potential of electronic health records for health research. Int J Popul Data Sci 2020;5:1123. 10.23889/ijpds.v5i1.1123 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.CIHI . Discharge abstract database metadata (DAD). 2023. Available: https://www.cihi.ca/en/discharge-abstract-database-metadata-dad
3.O’Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICD code accuracy. Health Serv Res 2005;40:1620–39. 10.1111/j.1475-6773.2005.00444.x [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chang F, Gupta N. Progress in electronic medical record adoption in Canada. Can Fam Physician 2015;61:1076–84. [PMC free article] [PubMed] [Google Scholar]
5.Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J 2019;6:94–8. 10.7861/futurehosp.6-2-94 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lee S, Doktorchik C, Martin EA, et al. Electronic medical record–based case Phenotyping for the Charlson conditions: scoping review. JMIR Med Inform 2021;9:e23934. 10.2196/23934 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Services AH . Patient care information system 2023. 2023. Available: https://albertahealthservices.ca/webapps/elearning/CB/Inpatient/a001_topic_1_introduction_to_clinibase_patient_care_information_system.html
8.Lee S, Li B, Martin EA, et al. CREATE: a new data resource to support cardiac precision health. CJC Open 2021;3:639–45. 10.1016/j.cjco.2020.12.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Eastwood CA, Southern DA, Khair S, et al. Field testing a new ICD coding system: methods and early experiences with ICD-11 beta version 2018. BMC Res Notes 2022;15:343. 10.1186/s13104-022-06238-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.De Coster C, Quan H, Finlayson A, et al. Identifying priorities in methodological research using ICD-9-CM and ICD-10 administrative data: report from an international consortium. BMC Health Serv Res 2006;6:77. 10.1186/1472-6963-6-77 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Escorpizo R, Kostanjsek N, Kennedy C, et al. Harmonizing WHO’s International Classification of Diseases (ICD) and International Classification of Functioning, Disability and Health (ICF): importance and methods to link disease and functioning. BMC Public Health 2013;13:742. 10.1186/1471-2458-13-742 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.LeBlanc AG, Jun Gao Y, McRae L, et al. At-a-glance - twenty years of diabetes surveillance using the Canadian chronic disease surveillance system. Health Promot Chronic Dis Prev Can 2019;39:306–9. 10.24095/hpcdp.39.11.03 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Quan H, Li B, Couris CM, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol 2011;173:676–82. 10.1093/aje/kwq433 [DOI] [PubMed] [Google Scholar]
14.Canada Go . Drug product database. 2023. Available: https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/drug-product-database.html
15.Punthakee Z, Goldenberg R, et al. , Diabetes Canada Clinical Practice Guidelines Expert Committee . Classification and diagnosis of diabetes, prediabetes and metabolic syndrome. Can J Diabetes 2018;42 Suppl 1:S10–5. 10.1016/j.jcjd.2017.10.003 [DOI] [PubMed] [Google Scholar]
16.Morris DH, Khunti K, Achana F, et al. Progression rates from Hba1C 6.0-6.4% and other prediabetes definitions to type 2 diabetes: a meta-analysis. Diabetologia 2013;56:1489–93. 10.1007/s00125-013-2902-4 [DOI] [PubMed] [Google Scholar]
17.Sherwani SI, Khan HA, Ekhzaimy A, et al. Significance of Hba1C test in diagnosis and prognosis of diabetic patients. Biomark Insights 2016;11:95–104. 10.4137/BMI.S38440 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Nunes JPL, DeMarco JP. A 7.0-7.7% value for Glycated Haemoglobin is better than a &Amp;Amp;Amp;Amp;Amp;Lt;7% value as an appropriate target for patient-centered drug treatment of type 2 diabetes mellitus. Ann Transl Med 2019;7:S122. 10.21037/atm.2019.05.43 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17:507–13. 10.1136/jamia.2009.001560 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.McInnes BT, Pedersen T, Carlis J, eds. Using UMLS concept unique Identifiers (Cuis) for word sense Disambiguation in the BIOMEDICAL domain. AMIA annual symposium proceedings; American Medical Informatics Association, 2007 [PMC free article] [PubMed] [Google Scholar]
21.Martin EA, D’Souza AG, Lee S, et al. Hypertension identification using inpatient clinical notes from electronic medical records: an explainable, data-driven algorithm study. CMAJ Open 2023;11:E131–9. 10.9778/cmajo.20210170 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw 2010;36:1–13. [Google Scholar]
23.Chen T, He T, Benesty M, et al. Xgboost: extreme gradient boosting [R package version 04-2]. 2015;1:1–4. [Google Scholar]
24.Tang KL, Lucyk K, Quan H. Coder perspectives on physician-related barriers to producing high-quality administrative data: a qualitative study. CMAJ Open 2017;5:E617–22. 10.9778/cmajo.20170036 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Turchin A, Florez Builes LF. Using natural language processing to measure and improve quality of diabetes care: a systematic review. J Diabetes Sci Technol 2021;15:553–60. 10.1177/19322968211000831 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Services AH . Connect care. 2023. Available: https://www.albertahealthservices.ca/cis/cis.aspx
27.Nijor S, Rallis G, Lad N, et al. Patient safety issues from information overload in electronic medical records. J Patient Saf 2022;18:e999–1003. 10.1097/PTS.0000000000001002 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Tiwary A, Rimal A, Paudyal B, et al. Poor communication by health care professionals may lead to life-threatening complications: examples from two case reports. Wellcome Open Res 2019;4:7. 10.12688/wellcomeopenres.15042.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Sendak M, Gao M, Nichols M, et al. Machine learning in health care: a critical appraisal of challenges and opportunities. EGEMS (Wash DC) 2019;7:1. 10.5334/egems.287 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Williamson T, Green ME, Birtwhistle R, et al. Validating the 8 CPCSSN case definitions for chronic disease surveillance in a primary care database of electronic health records. Ann Fam Med 2014;12:367–72. 10.1370/afm.1644 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary data

bmjhci-2023-100894supp001.pdf^{(63.9KB, pdf)}

Supplementary data

bmjhci-2023-100894supp002.pdf^{(101.3KB, pdf)}

Supplementary data

bmjhci-2023-100894supp003.pdf^{(30.6KB, pdf)}

Supplementary data

bmjhci-2023-100894supp004.pdf^{(29.5KB, pdf)}

Data Availability Statement

[R1] 1.Lee S, Xu Y, D Apos Souza AG, et al. Unlocking the potential of electronic health records for health research. Int J Popul Data Sci 2020;5:1123. 10.23889/ijpds.v5i1.1123 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.CIHI . Discharge abstract database metadata (DAD). 2023. Available: https://www.cihi.ca/en/discharge-abstract-database-metadata-dad

[R3] 3.O’Malley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICD code accuracy. Health Serv Res 2005;40:1620–39. 10.1111/j.1475-6773.2005.00444.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Chang F, Gupta N. Progress in electronic medical record adoption in Canada. Can Fam Physician 2015;61:1076–84. [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J 2019;6:94–8. 10.7861/futurehosp.6-2-94 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Lee S, Doktorchik C, Martin EA, et al. Electronic medical record–based case Phenotyping for the Charlson conditions: scoping review. JMIR Med Inform 2021;9:e23934. 10.2196/23934 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Services AH . Patient care information system 2023. 2023. Available: https://albertahealthservices.ca/webapps/elearning/CB/Inpatient/a001_topic_1_introduction_to_clinibase_patient_care_information_system.html

[R8] 8.Lee S, Li B, Martin EA, et al. CREATE: a new data resource to support cardiac precision health. CJC Open 2021;3:639–45. 10.1016/j.cjco.2020.12.019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Eastwood CA, Southern DA, Khair S, et al. Field testing a new ICD coding system: methods and early experiences with ICD-11 beta version 2018. BMC Res Notes 2022;15:343. 10.1186/s13104-022-06238-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.De Coster C, Quan H, Finlayson A, et al. Identifying priorities in methodological research using ICD-9-CM and ICD-10 administrative data: report from an international consortium. BMC Health Serv Res 2006;6:77. 10.1186/1472-6963-6-77 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Escorpizo R, Kostanjsek N, Kennedy C, et al. Harmonizing WHO’s International Classification of Diseases (ICD) and International Classification of Functioning, Disability and Health (ICF): importance and methods to link disease and functioning. BMC Public Health 2013;13:742. 10.1186/1471-2458-13-742 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.LeBlanc AG, Jun Gao Y, McRae L, et al. At-a-glance - twenty years of diabetes surveillance using the Canadian chronic disease surveillance system. Health Promot Chronic Dis Prev Can 2019;39:306–9. 10.24095/hpcdp.39.11.03 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Quan H, Li B, Couris CM, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol 2011;173:676–82. 10.1093/aje/kwq433 [DOI] [PubMed] [Google Scholar]

[R14] 14.Canada Go . Drug product database. 2023. Available: https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/drug-product-database.html

[R15] 15.Punthakee Z, Goldenberg R, et al. , Diabetes Canada Clinical Practice Guidelines Expert Committee . Classification and diagnosis of diabetes, prediabetes and metabolic syndrome. Can J Diabetes 2018;42 Suppl 1:S10–5. 10.1016/j.jcjd.2017.10.003 [DOI] [PubMed] [Google Scholar]

[R16] 16.Morris DH, Khunti K, Achana F, et al. Progression rates from Hba1C 6.0-6.4% and other prediabetes definitions to type 2 diabetes: a meta-analysis. Diabetologia 2013;56:1489–93. 10.1007/s00125-013-2902-4 [DOI] [PubMed] [Google Scholar]

[R17] 17.Sherwani SI, Khan HA, Ekhzaimy A, et al. Significance of Hba1C test in diagnosis and prognosis of diabetic patients. Biomark Insights 2016;11:95–104. 10.4137/BMI.S38440 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Nunes JPL, DeMarco JP. A 7.0-7.7% value for Glycated Haemoglobin is better than a &Amp;Amp;Amp;Amp;Amp;Lt;7% value as an appropriate target for patient-centered drug treatment of type 2 diabetes mellitus. Ann Transl Med 2019;7:S122. 10.21037/atm.2019.05.43 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17:507–13. 10.1136/jamia.2009.001560 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.McInnes BT, Pedersen T, Carlis J, eds. Using UMLS concept unique Identifiers (Cuis) for word sense Disambiguation in the BIOMEDICAL domain. AMIA annual symposium proceedings; American Medical Informatics Association, 2007 [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Martin EA, D’Souza AG, Lee S, et al. Hypertension identification using inpatient clinical notes from electronic medical records: an explainable, data-driven algorithm study. CMAJ Open 2023;11:E131–9. 10.9778/cmajo.20210170 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw 2010;36:1–13. [Google Scholar]

[R23] 23.Chen T, He T, Benesty M, et al. Xgboost: extreme gradient boosting [R package version 04-2]. 2015;1:1–4. [Google Scholar]

[R24] 24.Tang KL, Lucyk K, Quan H. Coder perspectives on physician-related barriers to producing high-quality administrative data: a qualitative study. CMAJ Open 2017;5:E617–22. 10.9778/cmajo.20170036 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Turchin A, Florez Builes LF. Using natural language processing to measure and improve quality of diabetes care: a systematic review. J Diabetes Sci Technol 2021;15:553–60. 10.1177/19322968211000831 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Services AH . Connect care. 2023. Available: https://www.albertahealthservices.ca/cis/cis.aspx

[R27] 27.Nijor S, Rallis G, Lad N, et al. Patient safety issues from information overload in electronic medical records. J Patient Saf 2022;18:e999–1003. 10.1097/PTS.0000000000001002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Tiwary A, Rimal A, Paudyal B, et al. Poor communication by health care professionals may lead to life-threatening complications: examples from two case reports. Wellcome Open Res 2019;4:7. 10.12688/wellcomeopenres.15042.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Sendak M, Gao M, Nichols M, et al. Machine learning in health care: a critical appraisal of challenges and opportunities. EGEMS (Wash DC) 2019;7:1. 10.5334/egems.287 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Williamson T, Green ME, Birtwhistle R, et al. Validating the 8 CPCSSN case definitions for chronic disease surveillance in a primary care database of electronic health records. Ann Fam Med 2014;12:367–72. 10.1370/afm.1644 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Exploring the reliability of inpatient EMR algorithms for diabetes identification

Seungwon Lee

Elliot A Martin

Jie Pan

Cathy A Eastwood

Danielle A Southern

David J T Campbell

Abdel Aziz Shaheen

Hude Quan

Sonia Butalia

Abstract

Introduction

Materials and methods

Results

Discussion

WHAT IS ALREADY KNOWN ON THIS TOPIC

WHAT THIS STUDY ADDS

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

Introduction

Materials and methods

Study population and design

Data sources and linkage

Chart review database

Allscripts Sunrise Clinical Manager EMR

Discharge abstract database

Data extraction

Diabetes algorithm development

Operational standards—validated administrative data-based ICD codes algorithm

EMR data-based algorithms

Laboratory data-based clinical diagnosis algorithm

Medication data-based clinical diagnosis algorithm

Inpatient laboratory and medication data-based clinical diagnosis algorithm

NLP clinical notes-based machine learning (ML) algorithm

Figure 1.

Evaluation metrics and validation

Figure 2.

Results

Cohort overview

Table 1.

Feature selection on all document type ML model

Algorithm performance

Table 2.

Top features from all document type ML model

Figure 3.

Discussion

Conclusion

Footnotes

Data availability statement

Ethics statements

Patient consent for publication

Ethics approval

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases