Abstract
Primary care electronic medical records (EMRs) contain rich data that can support proactive identification of chronic health conditions. However, leveraging unstructured EMR data requires the use of novel computational methods. We applied natural language processing and machine learning (ML) techniques to structured and unstructured EMR data to detect arthritis, chronic kidney disease, diabetes, hypertension, and respiratory diseases. Using data from 449 community-dwelling older adults in one Canadian primary care clinic, we developed an analytical pipeline that included preprocessing of unstructured data, Latent Dirichlet Allocation topic modelling, and supervised ML models (regularized logistic regression [RLR], support vector machine [SVM], artificial neural networks [ANNs]) with class-weighted learning and Synthetic Minority Oversampling Technique techniques to address class imbalance. Integrating unstructured clinical notes improved model performance, particularly for conditions often under-coded in structured data. For example, the area under the receiver operating characteristic curve increased from 0.724 to 0.841 for SVM classifiers in arthritis detection and from 0.733 to 0.890 for ANNs in respiratory disease detection. Less pronounced improvements were observed for diabetes, hypertension, and CKD. These findings highlight that while performance gains from unstructured data vary by condition, leveraging these data can improve disease detection in primary care EMR data.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-026-38594-5.
Keywords: Primary care, Electronic medical records, Natural language processing, Machine learning, Text mining, Chronic conditions
Subject terms: Health care, Computer science
Introduction
Effective management of chronic diseases in primary care relies on the accurate identification of patients at risk of these diseases to support early intervention. However, this goal is often undermined by the fragmented nature of data captured in electronic medical records (EMRs). Although Canadian EMR systems capture a broad spectrum of patient information1, essential diagnostic details (i.e., symptom descriptions, family history, lifestyle details) that are required for proactive identification and management of chronic conditions are frequently available only in unstructured (or free-text) clinical notes2–4. These clinical notes cannot be easily processed by conventional rule-based machine learning (ML) algorithms, which rely on structured data inputs, such as laboratory results and International Classification of Diseases (ICD) codes5–10. Compounding the issue is the fact that ICD codes are used inconsistently and mostly for billing purposes in Canadian primary care11,12. Recent cohort studies from Ontario underscore the limitations of relying exclusively on structured EMR data for chronic disease surveillance13,14.
Advances in computational methods, particularly natural language processing (NLP) and ML, offer new opportunities to leverage the rich data found in unstructured clinical notes. While using both structured and unstructured data has been advocated for disease detection15–18, existing studies tend to limit disease identification efforts to leveraging structured data only19, largely due to its ease of extraction and analysis. Nevertheless, several studies have demonstrated the utility of traditional supervised ML models—logistic regression (LR), support vector machines (SVM), and artificial neural networks (ANNs)—in detecting complex chronic conditions, such as type 2 diabetes mellitus (T2DM)20, conditions associated with frailty (i.e., mortality, disability, urgent hospitalization, fracture, preventable hospitalization, and accessing the emergency department)21, and cognitive decline22.
In this study, we address this gap by applying a combination of NLP and ML techniques to structured and unstructured data extracted from Canadian primary care EMRs to identify five common chronic conditions: hypertension, diabetes, arthritis, chronic kidney disease (CKD), and respiratory diseases. To evaluate the potential performance gains from integrating unstructured clinical notes, we compared estimates from three ML models, regularized logistic regression (RLR), support vector machines (SVM), and artificial neural networks (ANNs), trained on structured data alone versus models trained on a combination of structured and unstructured data.
Method
We analyzed structured and unstructured EMR data from 449 community-dwelling adults aged 60 years or older who received care in one primary care clinic in Alberta, Canada. This clinic serves as a training site for medical residents and students and is a sentinel site within the Northern Alberta Primary Care Research Network (NAPCReN), a regional arm of the Canadian Primary Care Sentinel Surveillance Network (CPCSSN)23,24. CPCSSN extracts and de-identifies structured and unstructured EMR data from primary care clinics across Canada using established cleaning, coding, and transformation protocols to ensure high-quality data for research and surveillance. Structured data comprised ICD-9 billing codes, medication lists, laboratory results, and encounter diagnoses, and unstructured data included timestamped clinical notes (e.g., patient complaints, progress notes, diagnostic assessments, reasons for visits). The data comprised 1,872,562 records documented in the clinic’s EMR between January 1, 2013, and December 31, 2021. On average, each patient contributed over 4,100 records to the final dataset. The study was approved by the University of Alberta Research Ethics Board (ID: Pro00088808), with informed consent obtained from the data custodians (i.e., physicians participating in the study).
An overview of the analytical workflow is presented in Fig. 1, with a detailed data visualization analysis available in Figure A2-A4 in the Supplementary File. The analytical plan included: (1) pre-processing of unstructured data using the tidy text mining and NLP techniques; (2) topic modelling using Latent Dirichlet Allocation (LDA) for all five chronic conditions; (3) training and validation of ML models for five chronic conditions.
Fig. 1.
A flowchart of the NLP- and ML-based chronic disease detection system.
Pre-processing of unstructured EMR data
To prepare clinical notes for text mining and condition detection using NLP techniques, we removed code artifacts such as “xxxx”; converted all text to lowercase to prevent case sensitivity issues; applied stemming and lemmatization to reduce words to their base or root forms (e.g., “testing” to “test”) and thus reduce vocabulary size and improve ML model performance; broke down text into individual words or tokens; expanded contractions to their full forms to standardize the text; removed punctuation marks; used the standard tm R package25,26 to filter out stop words (e.g., articles, prepositions, pronouns, common adverbs)27; removed non-alphanumeric words and characters; and replaced medical abbreviations with their expanded forms to enhance consistency and interpretability.
Once data cleaning was done, we developed an ad-hoc rule-based method that leverages regular expressions and linguistic cues to detect negation words (e.g., “not,” “no,” “never,” “without,” “lack,” “failed,” “recover”). This rule-based approach does not require large amounts of labelled training data27,28 and was shown to outperform more sophisticated, ML-based techniques29–31. To create our own rule-based algorithm tailored to EMR notes used in this project, we manually curated a list of regular expressions representing negation triggers (e.g., “failed,” “recover,” “no evidence for,” “was ruled out”). Medical terms that appeared immediately before or after negation triggers were considered negated. In a random sample of 100, 150, and 200 sentences extracted from EMR noted, our rule-based algorithm successfully identified 95% of all negations flagged by NZ.
Condition-specific terms identification
To inform the NLP algorithms, the authors with clinical expertise (MA, SK) compiled a comprehensive list of condition-specific terms, including relevant keywords, synonyms, diagnostic codes, and abbreviations. To identify these condition-specific terms in EMR data, we employed a stepped approach that identified and matched words in each paragraph with our list of descriptions: (1) Short terms (i.e., those with fewer than two characters) were omitted, and stemming was applied to the rest of the terms (i.e., only stems of these terms remained in the unstructured data). (2) Compound terms in the descriptions were prioritized for matching, as they often carry more specific meanings. (3) When compound terms had the same number of words, the more commonly used terms were given higher matching priority.
Tidy text mining
To quantify the importance of words associated with specific chronic conditions, a document-term matrix (DTM) was constructed. In this matrix, each row corresponded to an observation (i.e., a patient’s record), each unique term in the corpus was treated as a distinct input variable, and the numerical values represented the frequency of each term within the unstructured data. The sparsity of the DTM (i.e., many zero values) resulted from the fact that not all terms appeared in every observation. Terms that were absent from more than 99.5% of EMR notes were removed under the assumption that such rare terms were unlikely to be relevant or informative, as per existing recommendations32.
After removing sparse terms, the remaining features were weighted using Term Frequency–Inverse Document Frequency (TF-IDF), a statistic that reflects how important a word is in relation to a specific condition within the collection of EMR notes. TF-IDF was selected for its transparency and interpretability in highlighting condition-specific terms, both of which are critical for clinical validation. More complex embedding approaches (e.g., Word2Vec, BERT) generally require larger datasets to achieve robust performance, which was beyond the scope of this study. As shown in Fig. 2, TF-IDF helped identify key descriptive words associated with each chronic condition. For instance, datasets labelled with “Arthritis” commonly included terms such as “motion,” “bilaterally,” “respiratory,” “osteoarthritis,” and “disorders,” while datasets labelled with “Diabetes” contained keywords “metformin,” “glargine,” and “mellitus.”
Fig. 2.
Words ranked by TF-IDF scores in clinical notes associated with each chronic condition.
To visualize the relationships among words and conditions simultaneously, rather than focusing on only the top few words for a single chronic condition at a time, we constructed a correlation network to reveal structural patterns within EMR notes (Figure A1 in Supplementary file). For example, the five chronic conditions studied, CKD, Diabetes, Respiratory, Arthritis, and Hypertension, form distinct central nodes, each surrounded by related descriptive terms. Notably, the words “respiratory,” “disorders,” and “hypertension” are highly frequent and appear across several chronic conditions: i.e., Respiratory diseases, Arthritis, and Hypertension. Moreover, words clustered around each chronic condition node align with the high TF-IDF terms identified in Fig. 2, thus reinforcing their relevance to each chronic condition.
Although Fig. 2 highlights the importance of individual words (unigrams) associated with each chronic condition, some high TF-IDF scoring words carry limited meaning. To generate more accurate and contextually rich descriptions for each condition, we applied bigram text mining, tokenizing EMR notes into pairs of consecutive words (bigrams) instead of individual tokens. As shown in Fig. 3, bigrams with TF-IDF resulted in more meaningful descriptions. For example, EMR notes labelled with “Arthritis” include bigrams “allied disorders,” “upper respiratory,” and “respiratory infections,” while those labelled with “Diabetes” include “insulin glargine,” “diabetes mellitus,” and “insulin dependent.”
Fig. 3.
Bigrams with the highest TF-IDF scores in clinical notes associated with each chronic condition.
Application of ML techniques
Unsupervised topic (chronic condition) modelling
Topic modelling is an unsupervised method that groups text data based on similar words they contain. Latent Dirichlet Allocation (LDA) is a widely used approach for performing topic modelling. It treats each document as a mixture of topics, and each topic, as a mixture of words, thus allowing documents to share content across multiple topics, which mirrors the way language is typically used35. We conducted LDA–based chronic condition modelling using DTM. The dominant word distributions for each topic are shown in Fig. 4. For example, the most frequent words in Topic 1, i.e. “respiratory,” “acute,” “infections,” and “peripheral”, suggest Arthritis. Topic 3, characterized by words such as “diabetes,” “mellitus,” and “metformin,” clearly corresponds to Diabetes. Topic 5, characterized by “hypertension,” “ramipril,” and “amlodipine” represents Hypertension.
Fig. 5.
Gamma probabilities for each clinical note across the five LDA topics (x-axis in each subplot: topic index 1–5; y-axis in each subplot: topic probability).
It is important to note that some terms, such as “respiratory” and “hypertension”, appear across multiple chronic conditions. This reflects a key advantage of LDA-based chronic condition modelling over hard clustering methods, as it acknowledges that the same terminology can be used across different clinical conditions36. Another advantage of the LDA-based modelling is its ability to highlight meaningful differences between topics. For example, words related to ‘respiratory’ and ‘acute’ appear predominantly in Topic 1, while ‘hypertension,’ ‘ramipril,’ and ‘amlodipine’ are more prominent in Topic 5.
To validate our unsupervised LDA-based chronic condition model, we compared the LDA-generated topics (Fig. 4) with a comprehensive list of condition-specific terms identified by clinicians on the team. Unlike the TF-IDF weighting with bigram tokenization used for keyword visualization (Figs. 3 and 4), LDA topic modeling was performed on unigrams to optimize topic coherence and interpretability while avoiding sparsity issues. To evaluate both topic coherence and interpretability, we computed topic-level coherence scores using the top 10 terms per topic. The LDA model yielded coherence scores of -0.84, -0.72, -1.09, -0.46, and − 1.07 for Topics 1 to 5, respectively. These negative values are expected and fall within acceptable interpretability ranges for clinical notes. To assess the effectiveness of unsupervised learning in identifying the five chronic conditions, we examine the per-document-per-topic probabilities, denoted as gamma (Fig. 5). The x-axis of Fig. 5 represents the LDA topic index (Topics 1–5), while the y-axis indicates the gamma probability (topic contribution) for each clinical note. Ideally, the words associated with each chronic condition should be primarily (or exclusively) generated from the corresponding condition. Figure 5 suggests that some words from the “Respiratory” and “Arthritis” chronic conditions are also associated with other topics, which aligns with the interpretation of Fig. 4, where some overlap between chronic conditions was observed.
Fig. 4.
LDA-generated topics for each chronic condition.
Although LDA-generated topics were not included as features in the supervised models, topic modeling was used as an interpretive tool to validate the clinical relevance of extracted keywords and highlight the thematic structure of EMR notes. This step ensured that the NLP-derived features used for model training were consistent with clinical expertise and meaningful for chronic disease identification.
Remark: Figure 4is Dominant word distributions for the five latent topics identified via Latent Dirichlet Allocation (LDA) applied to clinical notes. Each panel corresponds to one latent topic (Topics 1–5), as automatically generated by the LDA model. The top-ranked words reveal the underlying chronic condition themes: Topic 1 is characterized by terms such as “respiratory,” “acute,” and “infections,” suggesting a Respiratory/Arthritis-related condition; Topic 3, dominated by “diabetes,” “mellitus,” and “metformin,” corresponds to Diabetes; Topic 5, with “hypertension,” “ramipril,” and “amlodipine,” represents Hypertension. Topics 2 and 4 are associated with other clinically relevant conditions (e.g., COPD and Osteoarthritis). The numerical labels (1–5) are standard LDA topic indices and are retained to emphasize the unsupervised nature of the model rather than manual condition labeling.
Supervised ML models for classification of chronic conditions
A trained research assistant (RAD) reviewed and labelled patient records with five chronic conditions. When uncertain, RAD consulted with SK, who conducted a secondary review of the records and made the final decision about the presence or absence of a specific condition. Patients without and with a condition were assigned a ‘0’ and ‘1’, respectively. Among 449 patients, 121 were identified as having diabetes, 225 – hypertension, 189 – arthritis, 82 – CKD, and 102 – respiratory diseases. The proportion of patients with diabetes, CKD, and respiratory diseases indicates class imbalance37,38, as the number of patients with these conditions is much smaller than the number of patients without them. To analyze class imbalanced datasets, class-based weighting of misclassification errors (i.e., class-weighted learning [CWL]) or data resampling techniques such as Synthetic Minority Oversampling Technique [SMOTE]) has been recommended33. We used both CWL and SMOTE techniques and compared the results with those obtained in the original imbalanced datasets.
Following pre-processing of unstructured EMR notes and extraction of condition-specific keywords, we combined these NLP-derived features with structured EMR data (ICD-9 billing codes, medication lists, and laboratory values) to form a single document-term matrix. This integrated dataset, which captures both structured clinical variables and text-derived features, served as the input for supervised machine learning models (RLR, SVM, and ANN) (Fig. 1). To train and validate supervised ML models, the dataset was split into the training and test datasets, containing integrated dataset of 359 (80%) and 90 (20%) randomly selected patients, respectively. We trained the following supervised ML classifiers in CWL-balanced, SMOTE-balanced, and original (imbalanced) datasets: regularized logistic regression (RLR), support vector machine (SVM) classifier, and artificial neural networks (ANNs). All NLP features, ML classifiers, and imbalanced data learning strategies used in this study are listed in Supplementary file Table A1.
Unlike traditional logistic regression, RLR incorporates a penalty term for each regression coefficient to minimize the impact of individual features on the overall model’s performance and prevent overfitting (detailed formulas for RLR are presented in Supplementary file), particularly in datasets with many features34. SVM models a linear decision boundary (i.e., a hyperplane) that separates different classes in a high-dimensional feature space39. The parameters of SVM models determine how the data are transformed into this high-dimensional space and how the decision boundary is determined. ANNs consist of interconnected artificial neurons arranged in layers, where each neuron receives inputs, computes a weighted sum, applies an activation function to transform the result, and passes it on to the next layer. The weights of neurons and their arrangement within the network affect the model’s performance. Hyperparameter tuning for SVM and ANN models was conducted via grid search with 10-fold cross-validation on the training dataset, using parameter ranges and step sizes specified in Table A2.
Evaluation metrics
Although accuracy (i.e., the number of correctly classified patients divided by the total number of patients) is one of the most commonly used evaluation metrics, it is not recommended to use when measuring performance on imbalanced classes40, as is the case in our study. Therefore, our interpretation focuses on the following metrics: sensitivity, positive predictive value (PPV), area under the receiver operating characteristic curve (AUROC), and F1 score. Sensitivity is the proportion of actual positive cases that are correctly classified by the model. PPV (or precision) is the proportion of true positives among all cases predicted as positive. AUROC values summarize the model’s ability to discriminate between positive and negative cases across all possible classification thresholds. F1 score is the harmonic mean of sensitivity and PPV (see Eq. (1)) and is a recommended evaluation metric to report in the context of imbalanced data41.
![]() |
1 |
Cross-validation
The 10-fold cross-validation technique was employed to evaluate model performance and detect overfitting. Specifically, the dataset was randomly divided into 10 equal-sized folds: nine folds were used for model training and one for validation. To optimize the model hyperparameters and maximize the classification accuracy of the models, iterative adjustment and testing were performed in 200 bootstrap samples of the training dataset. The final models, with the selected parameters (Supplementary Table A2), were then evaluated in the test dataset.
Results
We evaluated model performance using CWL and SMOTE class imbalance strategies and compared them to model performance in the original (imbalanced) data. This section focuses on the results from the SMOTE-balanced models only (for results in the original and CWL-balanced datasets, see Table A3 and Table A4 in Supplementary File 1). The ML models (RLR, SVM, ANNs) were trained on (1) structured EMR data alone (Table 1) and (2) both structured and unstructured EMR data (Table 2). Incorporating unstructured EMR data resulted in substantial improvements in model performance (particularly AUROC and F1 scores) across all five chronic conditions. For example, AUROC values increased from 0.733 to 0.890 in ANNs models for respiratory diseases detection, and from 0.724 to 0.841 in SVM models for arthritis detection. Less pronounced but notable improvements were observed for CKD, hypertension, and diabetes.
Table 1.
Performance of supervised ML models (i.e., regularized logistic regression, support vector machine, and artificial neural networks) for the identification of five chronic conditions in SMOTE-balanced structured EMR data.
| Model | AUROC [95% CI] | Sensitivity [95% CI] | PPV [95% CI] | F1 Score [95% CI] |
|---|---|---|---|---|
| Arthritis | ||||
| RLR | 0.720 [0.501, 0.871] | 0.799 [0.557, 0.895] | 0.833 [0.822, 0.980] | 0.550 [0.466, 0.692] |
| SVM | 0.724 [0.515, 0.892] | 0.810 [0.647, 0.912] | 0.827 [0.806, 0.900] | 0.532 [0.510, 0.664] |
| ANNs | 0.731 [0.517, 0.891] | 0.796 [0.560, 0.902] | 0.824 [0.812, 0.903] | 0.595 [0.433, 0.620] |
| CKD | ||||
| RLR | 0.667 [0.472, 0.831] | 0.446 [0.000, 0.851] | 0.775 [0.756, 0.956] | 0.600 [0.566, 0.690] |
| SVM | 0.656 [0.542, 0.873] | 0.417 [0.000, 0.832] | 0.762 [0.739, 0.907] | 0.578 [0.516, 0.597] |
| ANNs | 0.687 [0.511, 0.895] | 0.490 [0.000, 0.857] | 0.812 [0.619, 0.920] | 0.582 [0.521, 0.620] |
| Diabetes | ||||
| RLR | 0.949 [0.915, 1.000] | 0.917 [0.714 1.000] | 0.917 [0.907, 1.000] | 0.635 [0.603, 0.677] |
| SVM | 0.916 [0.796, 0.977] | 0.872 [0.791, 0.942] | 0.905 [0.851, 1.000] | 0.612 [0.590, 0.667] |
| ANNs | 0.907 [0.799, 0.953] | 0.871 [0.713, 0.922] | 0.916 [0.879, 1.000] | 0.610 [0.521, 0.652] |
| Hypertension | ||||
| RLR | 0.901 [0.737, 0.968] | 0.922 [0.816, 0.975] | 0.833 [0.679, 0.865] | 0.531 [0.417, 0.679] |
| SVM | 0.897 [0.798, 0.931] | 0.932 [0.862, 0.961] | 0.831 [0.678, 0.846] | 0.655 [0.510, 0.697] |
| ANNs | 0.881 [0.787, 0.955] | 0.919 [0.793, 0.961] | 0.828 [0.601, 0.874] | 0.609 [0.501, 0.632] |
| Respiratory diseases | ||||
| RLR | 0.703 [0.571, 0.839] | 0.449 [0.000, 0.736] | 0.770 [0.714, 0.893] | 0.573 [0.355, 0.779] |
| SVM | 0.709 [0.579, 0.846] | 0.509 [0.495, 0.738] | 0.795 [0.742, 0.851] | 0.596 [0.469, 0.792] |
| ANNs | 0.733 [0.628, 0.871] | 0.664 [0.475, 0.818] | 0.831 [0.770, 0.901] | 0.582 [0.511, 0.827] |
Abbreviations: ANNs: Artificial neural networks; AUROC: Area under the receiver operating characteristic curve; CI: Confidence interval; RLR: Regularized logistic regression; PPV: Positive predictive value; SVM: Support vector machine.
Table 2.
Performance of supervised ML models (i.e., regularized logistic regression, support vector machine, and artificial neural network) for the identification of five health chronic conditions using a combination of SMOTE-balanced structured and unstructured EMR data.
| Model | AUROC [95% CI] | Sensitivity [95% CI] | PPV [95% CI] | F1 Score [95% CI] |
|---|---|---|---|---|
| Arthritis | ||||
| RLR | 0.784 [0.653, 0.898] | 0.788 [0.633, 0.917] | 0.878 [0.691, 0.882] | 0.679 [0.501, 0.780] |
| SVM | 0.841 [0.727, 0.939] | 0.848 [0.720, 0.964] | 0.873[0.829, 0.980] | 0.698 [0.567, 0.721] |
| ANNs | 0.833 [0.717, 0.901] | 0.784 [0.645, 0.899] | 0.842 [0.611, 0.857] | 0.695 [0.651, 0.750] |
| CKD | ||||
| RLR | 0.684 [0.477, 0.886] | 0.600 [0.500, 0.778] | 0.797 [0.732, 0.899] | 0.698 [0.591, 0.701] |
| SVM | 0.681 [0.468, 0.895] | 0.429 [0.000, 0.833] | 0.791 [0.721, 0.818] | 0.774 [0.572, 0.816] |
| ANNs | 0.707 [0.558, 0.920] | 0.566 [0.000, 0.876] | 0.873 [0.795, 0.910] | 0.582 [0.468, 0.732] |
| Diabetes | ||||
| RLR | 0.927 [0.775, 1.000] | 0.923 [0.750, 1.000] | 0.926 [0.923, 1.000] | 0.622 [0.504, 0.681] |
| SVM | 0.979 [0.944, 1.000] | 0.923 [0.750, 1.000] | 0.914 [0.806, 1.000] | 0.746 [0.599, 0.790] |
| ANNs | 0.919 [0.904, 0.998] | 0.944 [0.811, 1.000] | 0.921 [0.914, 1.000] | 0.537 [0.475, 0.782] |
| Hypertension | ||||
| RLR | 0.932 [0.865, 0.983] | 0.957 [0.864, 1.000] | 0.868 [0.567, 0.900] | 0.640 [0.522, 0.703] |
| SVM | 0.867 [0.757, 0.964] | 0.957 [0.864, 1.000] | 0.844 [0.732, 0.952] | 0.775 [0.591, 0.798] |
| ANNs | 0.917 [0.781, 0.943] | 0.954 [0.873, 1.000] | 0.824 [0.732, 0.952] | 0.681 [0.616, 0.697] |
| Respiratory diseases | ||||
| RLR | 0.853 [0.727, 0.967] | 0.733 [0.471, 0.944] | 0.846 [0.838, 0.971] | 0.578 [0.442, 0.602] |
| SVM | 0.852 [0.726, 0.966] | 0.733 [0.278, 0.778] | 0.871 [0.850, 1.000] | 0.696 [0.570, 0.736] |
| ANNs | 0.890 [0.701, 0.897] | 0.761 [0.447, 0.800] | 0.918 [0.872, 1.000] | 0.606 [0.494, 0.729] |
Abbreviations: ANNs: Artificial neural networks; AUROC: Area under the receiver operating characteristic curve; CI: Confidence interval; RLR: Regularized logistic regression; PPV: Positive predictive value; SVM: Support vector machi.
Similarly, we observed improvements in F1 scores: for arthritis, F1 score increased from 0.550 to 0.679 in RLR models, and from 0.532 to 0.698 in SVM models; for CKD, F1 scores improved from 0.578 to 0.774 in SVM models. Sensitivity increased for conditions that are not systematically captured in structured EMR fields, such as arthritis and respiratory diseases. For instance, sensitivity for arthritis detection increased from 0.810 to 0.848 in SVM models, and for respiratory diseases, from 0.664 to 0.761 in ANNs models. See Fig. 6 for a visual representation of improvements in model performance. We used paired t-tests to examine changes in AUROC and F1 scores in models based on structured data alone (Table 1) and those supplemented by unstructured data (Table 2). The improvements in model performance were statistically significant: AUROC increased from 0.785 to 0.844 (p = 0.002), and F1 score, from 0.589 to 0.667 (p = 0.001).
Fig. 6.
Enhancing Classification AUROC and F1 Score with Unstructured EMR Data.
We also observed that RLR and SVM models achieved comparable, and in some cases superior, performance to ANN models. For example, the SVM model achieved an F1 score of 0.698 for arthritis detection (vs. 0.695 in ANNs) and 0.774 for CKD detection (vs. 0.582 in ANNs) (Table 2).
Discussion
This study aimed to evaluate the impact of mining both structured (e.g., ICD codes, diagnostic results) and unstructured (e.g., clinical notes) EMR data to identify five chronic conditions, i.e., hypertension, diabetes, arthritis, CKD, and respiratory diseases, in Canadian primary care EMR data. We found that models trained on both structured and unstructured EMR data to identify these conditions outperformed those trained on structured data alone. These performance gains were especially pronounced for arthritis and respiratory diseases, which are often inconsistently coded in structured EMR fields.
Our findings are consistent with prior research demonstrating the added value of incorporating unstructured data in clinical phenotyping. A systematic review by Ford et al. (2016) reported that NLP-enhanced models improved sensitivity in detecting under-coded conditions15,42: supplementing structured EMR data with unstructured data resulted in significant improvements in ML algorithms’ sensitivity (78% vs. 62%) and AUROC (95% vs. 88%). Since the publication of this systematic review, several studies specific to the Canadian context have emerged and deserve attention. For example, a study by Martin et al.43 describes the use of clinical notes in the Canadian acute care setting to identify hypertension. The authors reported significantly better performance of ML algorithms that used EMR notes compared to those based solely on ICD codes, with sensitivity reaching more than 90%, compared to 47% for ICD-based algorithms. Another study by Lee et al.44 used EMR-based algorithms to identify diabetes in Canadian inpatient EMR records. They compared the performance of algorithms based on laboratory data only, medication data only, both laboratory and medication data, diabetes concept keywords, diabetes free-text notes, and a combination of all of these inputs. The authors found that the latter algorithm significantly outperformed all other algorithms: for example, sensitivity increased from 37% with laboratory data alone and 89% with medication data alone to 95% when these data were supplemented with concept keywords and free text.
In our study, we report similarly high sensitivity for the identification of hypertension and diabetes, but our findings also suggest that the performance gains are particularly pronounced for those conditions that are often under-coded in structured EMR fields, namely, arthritis and respiratory diseases. For example, the sensitivity for identification of hypertension and diabetes remained > 90%, regardless of whether only structured or both structured and unstructured EMR data were used. Yet, sensitivity for respiratory diseases identification improved by close to 30% in SVM (from 0.664 to 0.761) and RLR models (from 0.449 to 0.733), and AUROC improved by over 20% (from 0.733 to 0.890) in ANNs models. Similarly, arthritis detection also benefited from the inclusion of unstructured data, with AUROC increasing from 0.724 to 0.841 and F1 score from 0.532 to 0.698 in SVM models. This finding suggests that the decision to whether structured EMR notes are sufficient alone, or whether they should be supplemented by unstructured EMR data, should consider how well each specific condition is documented in EMR. Additionally, while we focused only on five chronic conditions, it is possible that this finding applies to other conditions that are not consistently documented in structured EMR fields (e.g., dementia, depression, parkinsonism45–47).
Moreover, our study highlights the value of engaging clinicians in the development of algorithms to detect chronic conditions in EMRs and underscores the importance of fostering collaboration between clinicians and computer scientists. For example, a recent study in Alberta, Canada48, reported substantially better performance from large language models that were informed by clinical expertise as opposed to those models trained solely on data. In fact, the sensitivity of clinician-informed models for detecting acute myocardial infarction, diabetes, and hypertension, was higher than that of data-informed models: 94.9% vs. 84.0%, 93.6% vs. 91.2%, 96.2% vs. 80.0%, respectively. Recognizing the critical role of clinical input in developing phenotyping models, a new field of research, prompt engineering, has emerged in recent years and is gaining traction49.
A key strength of this study is the development of a unique analytic pipeline optimized for the Canadian EMRs. Our approach integrated rule-based negation detection, topic modelling using LDA, and feature harmonization across structured and unstructured data. This approach enabled effective model training despite a relatively small sample size and ensured clinical interpretability and relevance. We argue that it can pave the way for the identification of other health conditions that are not routinely captured in the structured data. Importantly, we leveraged TF-IDF models in this project. A systematic review of NLP methods for processing unstructured data for chronic disease detection3 showed that traditional TF-IDF models achieve AUROC values between 0.75 and 0.88. Our results align with these values, with an AUROC of 0.841 for SVM in arthritis detection and 0.890 for ANNs in respiratory disease detection. Deep learning models50 often achieve AUROC values exceeding 0.90; however, these models require substantial computational resources and datasets much larger than the one used in our analyses. Notably, Goh et al.51 developed an artificial intelligence algorithm for early sepsis detection that surpassed an AUROC of 0.90, yet our non-deep learning models achieved comparably high AUROC values (e.g., AUROC of 0.890 for ANNs in respiratory disease detection), demonstrating strong performance and suitability for EMR-based patient phenotyping in real-world primary care settings. Moreover, while less computationally complex than ANNs models, their performance was comparable to that of ANNs models in our sample size. This suggests that RLR and SVM models may be well-suited for similar data-constrained contexts, which aligns with the prior research22 reporting that LR and SVM outperformed XGBoost (XGB) on the audio-recorded patient-nurse verbal communication, EHR data and clinical notes, and integration of all datasets. In contrast, in another study20, with a much larger sample size with a total of 36,652 eligible participants from the Henan Rural Cohort Study, which found that ANNs model (AUC = 0.872) performed better than LR (AUC = 0.841), and SVM (AUC = 0.835). Finally, it is worth noting that the clinic serving as our data source is a training site, meaning EMR notes were written by multiple physicians and residents, each with their unique style of writing and levels of detail, a factor that likely contributed to the narrative diversity of our dataset.
Despite its strengths, this study has several limitations that warrant consideration. First, our sample of patients was relatively small (n = 449) and drawn from a single clinic. While future research should test our approach in larger and more diverse datasets gathered in multiple clinics, it is important to note that the analytic sample included over 1.8 million clinical notes gathered over the course of many years. As this study uses data from a single primary care clinic, the evaluation relied on internal validation (10-fold cross-validation and bootstrap). Future research is needed to conduct external validation on independent datasets from other clinics or regions. Furthermore, as noted above, many physicians and residents were involved in these patients’ care, therefore our analytic sample has high narrative variability that likely contributed to the richness of training datasets and improved models’ performance. Second, class imbalance may have affected model sensitivity and AUROC values, especially for less prevalent conditions. Although we applied the SMOTE technique to mitigate this issue, future work could explore complementary strategies, such as cost-sensitive learning52 or ensemble methods53. Additionally, future research should explore the application of other advanced techniques, such as contextual embeddings or concept normalization54,55, and consider modelling temporal patterns to account for the changes in documentation practices over time.
Although SMOTE-based oversampling outperformed undersampling in our dataset, future work with larger EMR datasets could explore hybrid strategies (e.g., combining oversampling and undersampling) to reduce sampling bias and improve robustness. While our study focused on RLR, SVM, and ANNs as representative models of linear, kernel-based, and neural network paradigms, ensemble tree-based approaches (e.g., Random Forest [RF]) and gradient boosting methods (e.g., XGB) have also shown strong performance in medical classification tasks. However, they require larger datasets and given our sample size, we chose models with lower risk of overfitting. Future research with larger, multi-site EMR datasets should explore RF and XGB methods to complement the models presented in this work and explore the use of advanced embedding models (e.g., BERT or Word2Vec) to assess whether they further improve predictive performance and interpretability. While our primary objective was to demonstrate predictive gains from including unstructured EMR notes, future studies could also apply model-specific interpretability techniques (e.g., regression coefficient analysis, variable importance) to quantify the contributions of structured and unstructured data to model performance.
Conclusion
This study demonstrates that integrating structured and unstructured data through NLP and ML techniques significantly improves chronic disease identification in primary care EMRs within the Canadian context. This finding highlights the value of clinical notes for disease phenotyping, and the approach described in this paper offers a practical framework for applying ML models to real-world EMR data. Future research should validate this approach in larger, multi-site datasets and explore its application to the identification of other conditions that are primarily documented in free text.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
All authors contributed to the interpretation of findings, writing of the manuscript, and have approved the final version. LK secured the funding. MA and SK conceptualized the idea, methodological approach and statistical plan. RA manually labelled the data. NZ and MB conducted analyses. LK supervised the statistical analysis. NZ, MA, and SK drafted the first version of the manuscript. The present study was supported by the Mitacs Accelerate Grant (grant IT30841) received by University of Alberta, Edmonton, Canada. CARM&A Health provided matching funding based on Mitacs requirements.
Abbreviations
- ANNs
Artificial neural networks
- AUROC
Area under the receiver operating characteristic curve
- CI
Confidence interval
- CKD
Chronic kidney disease
- CPCSSN
Canadian primary care sentinel surveillance network
- CWL
class-weighted learning
- DTM
Document term matrix
- eFI
Electronic frailty index
- EMR
Electronic medical record
- FI
Frailty index
- ICD
International classification of diseases
- ML
Machine learning
- NAPCRen
Northern Alberta primary care research network
- NLP
Natural language processing
- PPV
Positive predictive value
- RLR
Regularized logistic regression
- SVM
Support vector machine
- SMOTE
Synthetic minority oversampling technique
- UK
United Kingdom
Author contributions
All authors contributed to the interpretation of findings, writing of the manuscript, and have approved the final version. LK secured the funding. MA and SK conceptualized the idea, methodological approach and statistical plan. RA manually labelled the data. NZ and MB conducted analyses. LK supervised the statistical analysis. NZ, MA, and SK drafted the first version of the manuscript.
Data availability
The datasets analysed during the current study were extracted from Canadian primary care clinic’s EMRs and hence are not publicly available. Access to these datasets is a subject of obtaining approvals from data custodians and NAPCReN. Requests to access the dataset can be directed to the corresponding author via marjan.abbasi@albertahealthservices.ca.
Code availability
A comprehensive R code implementing the full analysis pipeline (NLP preprocessing, bigram and LDA, TF-IDF feature engineering, and machine learning models) is provided in Supplementary R Code section in Supplementary file to ensure the reproducibility of our results for researchers with appropriate data access.
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval and consent to participate
This study used retrospective, de-identified electronic medical record (EMR) data extracted from a sentinel primary care clinic affiliated with the Northern Alberta Primary Care Research Network (NAPCReN). The data custodians were two attending family physicians at the clinic, who provided consent for the use of the data. All methods were carried out in accordance with relevant guidelines and regulations. Individual informed consent was not required, as the data were de-identified prior to access and used in compliance with applicable privacy legislation and institutional policies. The study was approved by the University of Alberta Research Ethics Board (ID: Pro00088808), with informed consent obtained from the data custodians (i.e., physicians participating in the study).
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.AHS Annual Report. Alberta Health Services; (2022).
- 2.Canadian Primary Care Sentinel Surveillance Network. Northern Alberta. In (2023). Available from: https://cpcssn.ca/regional-networks-2/alberta/northern-alberta/
- 3.Sheikhalishahi, S. et al. Natural Language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inf.7 (2), e12239 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen, W. et al. Development and validation of algorithms to identify patients with chronic kidney disease and related chronic diseases across the Northern Territory, Australia. BMC Nephrol.23 (1), 1–12 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Richter, A. N. & Khoshgoftaar, T. M. A review of statistical and machine learning methods for modeling cancer risk using structured clinical data. Artif. Intell. Med.90, 1–4 (2018). [DOI] [PubMed] [Google Scholar]
- 6.Wang, W. et al. A systematic review of machine learning models for predicting outcomes of stroke with structured data. PloS One. 12 (6), e0234722 (2020 June). [DOI] [PMC free article] [PubMed]
- 7.Makowski, D. et al. Automated results reporting as a practical tool to improve reproducibility and methodological best practices adoption [Internet]. (2023). Available from: https://cran.r-project.org/web/packages/report/citation.html
- 8.Javaid, M., Haleem, A., Singh, R. P., Suman, R. & Rab, S. Significance of machine learning in healthcare: Features, pillars and applications. Int. J. Intell. Netw.3, 58–73 (2022). [Google Scholar]
- 9.Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng.Dec;6 (12), 1330–1345 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kennedy, J. et al. Predicting a diagnosis of ankylosing spondylitis using primary care health records–a machine learning approach. PLoS One. 31 (3), e0279076 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lix, L. M., Walker, R., Quan, H. & Nesdole, R. Features of physician services databases in Canada. In: Health Promotion and Chronic Disease Prevention in Canada. (2012). [PubMed]
- 12.Tu, K. et al. Are family physicians comprehensively using electronic medical records such that the data can be used for secondary purposes? A Canadian perspective. BMC Med. Inf. Decis. Mak.15, 1–2 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Raji, S. Regional Integration: Physician Perceptions on Electronic Medical Record Use and Impact in South West Ontario. Electron Thesis Diss Repos [Internet]. ; (2020). Available from: https://ir.lib.uwo.ca/etd/7980
- 14.Savage, D. W. et al. Characterizing the services provided by family physicians in Ontario, canada: A retrospective study using administrative billing data. PloS One. 8 (1), e0316554 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Teixeira, P. L. et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J. Am. Med. Inf. Assoc.24 (1), 162–171 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res.62 (8), 1120–1127 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zheng, L. et al. Web-based Real-Time case finding for the population health management of patients with diabetes mellitus: A prospective validation of the natural Language Processing–Based algorithm with statewide electronic medical records. JMIR Med. Inf.4 (4), e6328 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Casey J. A., Schwartz B. S., Stewart W. F. & Adler N. E. Using electronic health records for population health research: A review of methods and applications. Annu. Rev. Public. Health. 37 (37, 2016), 61–81 (2016). [DOI] [PMC free article] [PubMed]
- 19.Extracting information from the text. of electronic medical records to improve case detection: a systematic review [Internet]. [cited 2025 May 20]. Available from: https://academic.oup.com/jamia/article/23/5/1007/2379833?login=true [DOI] [PMC free article] [PubMed]
- 20.Zhang, L., Wang, Y., Niu, M., Wang, C. & Wang, Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: the Henan rural cohort study. Sci. Rep.10 (1), 4406 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tarekegn, A., Ricceri, F., Costa, G., Ferracin, E. & Giacobini, M. Predictive modeling for frailty conditions in elderly people: machine learning approaches. JMIR Med. Inf.4 (6), e16678 (2020 June). [DOI] [PMC free article] [PubMed]
- 22.Zolnoori, M. et al. Beyond electronic health record data: leveraging natural Language processing and machine learning to uncover cognitive insights from patient-nurse verbal communications. J. Am. Med. Inf. Assoc.32 (2), 328–340 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Garies, S., Birtwhistle, R., Drummond, N., Queenan, J. & Williamson, T. Data resource profile: National electronic medical record data from the Canadian primary care Sentinel surveillance network (CPCSSN. Int. J. Epidemiol.46 (4), 1091–1092 (2017). [DOI] [PubMed] [Google Scholar]
- 24.Kotecha, J. A. et al. Ethics and privacy issues of a practice-based surveillance system. Can. Fam Physician. 57 (10), 1165–1173 (2011). [PMC free article] [PubMed] [Google Scholar]
- 25.Feinerer, I. Introduction to the tm Package Text Mining in R.
- 26.Ghassemi, M. et al. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits Transl Sci Proc. ;2020:191–200. (2020). [PMC free article] [PubMed]
- 27.Mykowiecka, A., Marciniak, M. & Kupść, A. Rule-based information extraction from patients’ clinical data. J. Biomed. Inf.42 (5), 923–936 (2009). [DOI] [PubMed] [Google Scholar]
- 28.Tanushi, H. et al. Negation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNeg. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013 [Internet]. Oslo, Norway: Linköping University Electronic Press; pp. 387–97. Available from: https://aclanthology.org/W13-5635 (2013).
- 29.Goryachev, S., Sordo, M., Zeng, Q. & Ngo, L. H. Implementation and evaluation of four different methods of negation detection [Internet]. (2007). Available from: https://www.semanticscholar.org/paper/Implementation-and-Evaluation-of-Four-Different-of-Goryachev-Sordo/49517539055234e73bfa6140a7a84b74cfc12685
- 30.Peng, Y. et al. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits Transl Sci Proc. ;2018:188–96. (2018). [PMC free article] [PubMed]
- 31.Ayre, K. et al. Developing a natural Language processing tool to identify perinatal self-harm in electronic healthcare records. PLOS ONE. 16 (8), e0253809 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Harrison, C. J. & Sidey-Gibbons, C. J. Machine learning in medicine: a practical introduction to natural Language processing. BMC Med. Res. Methodol. ;21(1). (2021). [DOI] [PMC free article] [PubMed]
- 33.Wongvorachan, T., He, S. & Bulut, O. A comparison of Undersampling, Oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information14 (1), 54 (2023). [Google Scholar]
- 34.Algamal, Z. Y. & Lee, M. H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif.13 (3), 753–771 (2019). [Google Scholar]
- 35.Vayansky I, Kumar SA. A review of topic modeling methods. Information Systems94, 101582 (2020).
- 36.Churchill R, Singh L. The evolution of topic modeling. ACM Computing Surveys54(10s):1–35 (2022).
- 37.Wang, L., Han, M., Li, X., Zhang, N. & Cheng, H. Review of classification methods on unbalanced data sets. IEEE Access.9, 64606–64628 (2021). [Google Scholar]
- 38.Lu, H., Ehwerhemuepha, L. & Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol.22 (1), 1–12 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ghaddar, B. & Naoum-Sawaya, J. High dimensional data classification and feature selection using support vector machines. Eur. J. Oper. Res.265 (3), 993–1004 (2018). [Google Scholar]
- 40.Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep.12, 5979 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Diallo, R., Edalo, C. & Awe, O. O. Machine learning evaluation of imbalanced health data: A comparative analysis of balanced Accuracy, MCC, and F1 score. In: Practical Statistical Learning and Data Science Methods [Internet]. Springer, Cham; (2025). [cited 2025 Apr 22]. 283–312.
- 42.Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc.; 23(5):1007–15 (2016). [DOI] [PMC free article] [PubMed]
- 43.Martin, E. A. et al. Hypertension identification using inpatient clinical notes from electronic medical records: an explainable, data-driven algorithm study. Can. Med. Assoc. Open. Access. J.11 (1), E131–E139 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Lee, S. et al. Exploring the reliability of inpatient EMR algorithms for diabetes identification. BMJ Health Care Inf.30 (1), e100894 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Newby, D., Taylor, N., Joyce, D. W. & Winchester, L. M. Optimising the use of electronic medical records for large scale research in psychiatry. Transl Psychiatry. 1 (1), 1–10 (2024 June). [DOI] [PMC free article] [PubMed]
- 46.Shankar, R., Bundele, A. & Mukhopadhyay, A. Natural Language processing of electronic health records for early detection of cognitive decline: a systematic review. Npj Digit. Med.8 (1), 1–10 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hill, E. J. et al. Parkinson’s disease diagnosis codes are insufficiently accurate for electronic health record research and differ by race. Parkinsonism Relat Disord. 2023 Sept 1;114:105764. [DOI] [PubMed]
- 48.Pan, J. et al. Integrating large Language models with human expertise for disease detection in electronic health records. Comput. Biol. Med.1, 191:110161 (2025 June). [DOI] [PubMed]
- 49.Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res.25 (1), e50638 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wu, S. et al. Deep learning in clinical natural Language processing: a methodical review. J. Am. Med. Inf. Assoc.27 (3), 457–470 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Goh, K. H. et al. Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare. Nat. Commun.12 (1), 711 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Araf, I., Idri, A. & Chairi, I. Cost-sensitive learning for imbalanced medical data: a review. Artif. Intell. Rev.57 (4), 1–72 (2024). [Google Scholar]
- 53.Khan, A. A., Chaudhari, O. & Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024 June15;244:122778 .
- 54.Si, Y., Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inf. Assoc.26 (11), 1297–1304 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zalte, J. & Shah, H. Contextual classification of clinical records with bidirectional long short-term memory (Bi-LSTM) and bidirectional encoder representations from Transformers (BERT) model. Comput. Intell.40 (4), e12692 (2024). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets analysed during the current study were extracted from Canadian primary care clinic’s EMRs and hence are not publicly available. Access to these datasets is a subject of obtaining approvals from data custodians and NAPCReN. Requests to access the dataset can be directed to the corresponding author via marjan.abbasi@albertahealthservices.ca.
A comprehensive R code implementing the full analysis pipeline (NLP preprocessing, bigram and LDA, TF-IDF feature engineering, and machine learning models) is provided in Supplementary R Code section in Supplementary file to ensure the reproducibility of our results for researchers with appropriate data access.







