Abstract
The integration of electronic medical records (EMRs) with artificial intelligence (AI) is enhancing medical research, particularly in real-world evidence (RWE) studies. Extracting insights from coded medical data, such as ICD-10 codes, is essential for patient characterization. Traditional techniques, such as one-hot encoding (OHE), face limitations, particularly in managing high-dimensional data. In this study, a Bidirectional Encoder Representations from Transformers (BERT) approach is introduced to encode ICD-10 diagnostic codes, significantly improving model performance and reducing dimensionality. Data from 495,269 patients who visited the Cardiology Department at Asan Medical Center between 2000 and 2020 were used. The performance of models trained with OHE and ClinicalBERT embeddings was compared. For predicting major adverse cardiovascular events within one year following percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), the ClinicalBERT (code-embedded) model outperformed OHE. It achieved an AUC of 0.746 compared to 0.719, while also significantly reducing the dimensionality from 2,492 to 128. This method, which integrates diagnostic and medication data, provides valuable insights into patient care, enhancing the precision of predictions and supporting healthcare professionals in making more informed decisions.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-025-03145-x.
Keywords: Electronic medical records, ICD-10, Diagnosis, Medication, MACE, Artificial intelligence, ClinicalBERT, Embeddings, One-hot encoding, W2V, Representation learning
Clinical trial number
Not applicable.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-025-03145-x.
Introduction
Recent advances in healthcare technology have significantly increased the volume of EMR data due to the growing number of patients [1]. Electronic medical records (EMRs), containing vast repositories of patient medical information within healthcare systems, are essential for clinical and medical artificial intelligence (AI) research [2, 3]. Consequently, increasing attention is being paid to leveraging EMR data to enhance the performance of medical AI models [4]. Moreover, variations in coding practices and administrative errors often lead to missing, inconsistent, or redundant ICD codes in EMRs, creating additional challenges for data analysis [5]. EMRs provide comprehensive information about patient demographics and medical history, highlighting the importance of extracting meaningful insights from medical text.
EMRs typically store diagnostic and medication information in structured code formats, such as ICD-10 codes, which accurately reflect a patient’s clinical history and enable precise cohort identification in research [6]. By utilizing diagnostic codes, researchers can standardize patient characteristics across large datasets, supporting the development of reliable predictive models. In real-world evidence (RWE) studies, ICD-10 codes are often used to define target cohorts based on research objectives. However, large-scale EMR data presents challenges due to its unstructured, text-based nature, making direct analysis difficult. In addition, in large-scale EMR analyses, the high cardinality and significant redundancy of diagnostic codes pose critical challenges, potentially leading to issues such as data sparsity and overfitting [7]. To address this, natural language processing (NLP) techniques are required to extract relationships between diagnoses, treatments, and outcomes [8].
Predicting cardiovascular events is critical in clinical settings as it enables early intervention and helps reduce mortality and morbidity. Traditionally, EMR data, including ICD-10 codes, medications, and treatments, have been transformed into vector representations using One-Hot Encoding (OHE), which are then fed into deep learning (DL) models [9, 10]. While OHE is a standard technique for encoding categorical data, it presents several challenges. OHE struggles to capture complex structures and semantic nuances in EMR text, and when applied to large-scale datasets, it often leads to dimensionality issues and data redundancy, which hinder model performance.
Recent advancements in NLP have introduced more sophisticated methods for embedding medical text, going beyond traditional techniques. Models such as BERT (Bidirectional Encoder Representations from Transformers) [11], T5 (Text-to-Text Transfer Transformer) [12], GPT (Generative Pre-trained Transformer) [13], and domain-specific models such as BioBERT [14] and ClinicalBERT [15] have demonstrated superior capabilities in capturing the complexities of medical text data. Several recent studies have reported that embedding-based approaches can significantly enhance the performance of clinical prediction tasks by effectively representing subtle semantic relationships within EMR data [16]. These models address the limitations of OHE, managing dimensionality issues and enhancing the interpretability of diagnostic patterns within ICD-10 and medication codes.
Developing methods to efficiently encode medical information from these coded data sources can thus significantly improve the performance of cardiovascular event prediction models. In our study, we employed sophisticated preprocessing strategies that leveraged both structured and unstructured data to resolve discrepancies in mapping internal hospital codes to ICD-10, thereby ensuring that the final diagnostic profiles accurately reflected patients’ clinical histories.
This study presents an NLP-based framework that utilizes ClinicalBERT embeddings to represent ICD-10 diagnostic codes, effectively overcoming the limitations of traditional encoding methods and improving the predictive accuracy of medical AI models in cardiovascular event prediction. By comparing ClinicalBERT embeddings directly with OHE and W2V, we provide a fair evaluation of embedding methods tailored to real-world clinical data. Our contributions are three-fold:
We propose a novel NLP framework that leverages BERT-based models, specifically ClinicalBERT, to embed ICD-10 diagnostic codes, effectively managing high-dimensional EMR data and improving model interpretability.
We conducted intensive experiments on a real-world EMR dataset, utilizing over 495,269 patients’ records from Asan Medical Center to validate the proposed framework. Our ClinicalBERT-based embedding method consistently outperforms strong baselines, including OHE and Word2Vec (W2V).
Additionally, we demonstrate that the integration of diagnostic and medication data enhances the model’s predictive accuracy in identifying major adverse cardiovascular events (MACE), offering valuable insights into patient care and supporting healthcare professionals in making informed decisions.
Related work
To overcome the limitations of OHE, word-embedding methods such as the Skip-gram algorithm from Word2Vec (W2V) [17] have garnered significant attention. W2V represents words as continuous vectors, enabling the calculation of their similarity. This algorithm has been employed to organize ICD codes, test results, and medication codes within EMRs into a unified space, thereby facilitating the generation of low-dimensional representations of medical concepts [18, 19]. The sliding window feature of Skip-gram enables the clear visualization of code sequences, which can be utilized as a similarity matching method for patients within a cohort. Furthermore, word-embedding approaches have been used to link diagnostic codes with medications, assisting in the identification of drugs associated with specific diseases for treatment or prevention.
Although W2V can handle text data in ways that OHE cannot, it still fails to fully capture the context and unique characteristics of EMRs. Specifically, W2V is unable to differentiate between multiple meanings of the same term depending on context, a crucial limitation in medical texts where commonly used terms can carry different meanings, and a single term may have several interpretations. Recent studies have addressed these issues by applying NLP techniques that incorporate contextual information [20, 21]. These methods consider surrounding words to embed the same term differently based on its context.
For instance, the term “wave” in the sentences “The ocean’s wave crashed against the shore” and “She gave a simple hand wave” has different meanings. Contextual embeddings thus allow for better interpretation of meaning based on context.
Research has shown that context-aware embeddings, such as those produced by BERT, significantly improve the accuracy of medical predictions by capturing subtle language patterns within EMR data. Additionally, fine-tuning pre-trained language models (PLMs) like BERT, T5, and GPT on domain-specific data further enhances their performance [22, 23]. However, BERT models are often pre-trained on external datasets such as MIMIC (medical information mart for intensive care), and studies using MIMIC data tend to focus on common codes, limiting the models’ ability to predict rare diseases. Models trained solely on MIMIC data may not fully capture the entire spectrum of patient diagnoses, highlighting the need for models trained on more diverse clinical datasets that reflect a broader range of diseases [24].
Recently, EMR analysis has evolved beyond simple text embedding, with embedding-based approaches leveraging both structured and unstructured data to improve clinical prediction performance [25, 26]. Furthermore, various methodologies such as deep learning-based sequence models, graph-based models, and hybrid models have been proposed for EMR analysis, which effectively capture the complex interactions and temporal patterns within medical data [27].
Enhancing the precision and effectiveness of NLP-based medical AI models requires the integration of real-world diagnostic patterns. Such an approach enables AI models to account for the complexity of diseases and key medical variables, leading to more accurate diagnoses and better support for comprehensive patient management and treatment decisions. For instance, Med-BERT [28], pre-trained on medical records from approximately 20 million patients, demonstrates how larger cohorts and longer visit sequences allow models to capture a more comprehensive clinical context. However, incorporating additional data into PLMs is resource- and time-intensive, which can limit its practicality in certain research settings. Despite these challenges, PLMs remain valuable for their ability to learn from large datasets and capture complex patterns.
Recent studies have shown that domain-specific models like BioBERT and ClinicalBERT [15] perform exceptionally well in various medical NLP tasks. Building on these findings, we employ clinical domain-specific PLMs trained on large-scale medical datasets and scientific literature to capture the complex and nuanced language used in healthcare.
Our approach utilizes ClinicalBERT embeddings to more effectively capture and compress medical information into lower-dimensional representations, providing improved performance compared to traditional methods such as OHE and W2V. Real-world EMR data from Asan Medical Center (AMC) were leveraged to design a model that accurately reflects the diagnostic patterns observed in clinical practice, enhancing predictive accuracy while capturing the nuanced medical details essential for decision-making. In particular, our study is the first to directly compare ClinicalBERT embeddings with traditional encoding methods (OHE and W2V) in the context of cardiovascular event prediction, thereby demonstrating the advantages of domain-specific embeddings in enhancing the accuracy and utility of clinical predictive models.
Methods
Data acquisition
In this study, EMRs were obtained from patients who visited Asan Medical Center in Seoul between January 1, 2000, and December 31, 2019. The EMR records from January 2000 to December 2016 were sourced from the CardioNet DB [29], which contains data from 572,811 individuals, specifically focusing on visits to the Departments of Cardiology and Thoracic Surgery at AMC. From January 1, 2017, to December 31, 2019, EMRs additionally included personal information on 189,413 individuals who visited AMC. The comprehensive EMR database incorporated records from all other departments accessed by patients visiting the Cardiology Department during this period. The extracted EMR data from 762,224 individuals included various types of medical information. Structured data encompassed demographic details, physical measurements, visit histories, surgical procedures, medication records, test results, diagnostic codes (ICD-10), prescription codes, and detailed medication ingredient names. Additionally, records of Percutaneous Coronary Intervention (PCI) and Coronary Artery Bypass Grafting (CABG) procedures were included. Unstructured data consisted of narrative diagnostic details, free-text discharge notes, and other non-coded information.
Ethical approval
The Institutional Review Board (IRB) of Asan Medical Center (AMC) approved the study protocols (No. 2021 − 0303) in accordance with the Declaration of Helsinki (2008). This study utilized data from the Asan Biomedical Research Environment (ABLE), a de-identified electronic medical record (EMR) database maintained by AMC, a major tertiary hospital in South Korea [30]. The ABLE database is comprised of anonymized information, and as a result, the study was exempt from the requirement for informed consent by the IRB. All experiments were conducted in compliance with pertinent guidelines and regulations. Patient data, including diagnoses, laboratory test results, and reports, were extracted for patients admitted to AMC from January 1, 2000, to December 30, 2020.
Data preprocessing
The raw data were extracted from the ABLE research database at AMC, comprising both structured and unstructured formats, including patient demographics, clinical notes, diagnostic codes, medication records, and test results. Several preprocessing steps were implemented to ensure data quality and consistency for model training.
To ensure accuracy, structured data, such as patient ages, vital signs, and other numerical information, were cleaned by removing outliers and correcting inaccuracies based on clinical standards. In instances where a patient’s age was recorded as less than 0 or greater than 120, the value was corrected by referring to previous clinical records or the record was excluded from the dataset if deemed unreliable. Likewise, systolic blood pressure (SBP) readings recorded as less than 50 mmHg or greater than 300 mmHg were considered erroneous and were corrected using the median value or verified with other clinical records.
Diagnostic codes were standardized by mapping AMC’s internal codes to ICD-10 codes, ensuring consistency across the dataset. AMC’s internal codes, which are used to define specific medical procedures within the hospital, occasionally lacked a one-to-one correspondence with ICD-10 codes. To resolve these discrepancies, clinical notes and prescription records were cross-checked to assign the ICD-10 code that most accurately reflected the actual diagnosis.
In particular:
The official AMC code conversion table was used to directly map AMC internal codes to ICD-10 codes.
In cases where an internal code lacked a detailed diagnostic name or did not exactly match an ICD-10 code, the patient’s diagnostic description, prescription history, and clinical notes were reviewed to assign the most semantically appropriate ICD-10 code.
As an illustration, if AMC internal code ‘A12345’ could be mapped to either ‘I21’ (acute myocardial infarction) or ‘I25.1’ (atherosclerotic heart disease), the patient’s cardiac test records and prescribed medications were consulted to determine the most appropriate code.
To capture a broad range of symptom patterns, all ICD-10 codes were retained, including fewer common ones. Since patient records at AMC could span up to 20 years, including all diagnostic records in a single sequence would have resulted in excessively long code sequences (Fig. 1). To address this, the distribution of sequence lengths was analyzed, revealing sequences ranging from 3 to 395 codes, with an average length of 12.3 codes per patient. A frequency sorting approach was applied to improve the contextual understanding of the BERT model and optimize the sequences for model training. Each diagnostic code was treated as a single word, and sequences were constructed with lengths between 2 and 50 codes to align with the model’s natural language processing requirements. Padding and truncation techniques were employed to ensure that the model could effectively process sequences of varying lengths.
Fig. 1.
Overview of preprocessing methods for ICD-10 codes based on patient visits. This figure illustrates the design of preprocessing methods aimed at creating recipient activities for encoding the entire set of ICD-10 codes, taking into account the visit units in each patient’s diagnostic code
To reduce dimensionality while preserving clinical relevance, ICD-10 codes from each patient’s records were truncated to their first three characters. This approach captures the broader diagnostic categories (e.g., E11.1 was reduced to E11), ensuring that key information is retained without overcomplicating the model with finer diagnostic subcategories that may add redundancy or noise to the data.
To construct meaningful diagnostic sequences, we employed a frequency-based ordering method. Rather than enforcing strict chronological ordering, which may introduce excessive variability, we structured diagnostic sequences by ensuring that a patient’s prior medical history is retained while capturing deviations from expected patterns.
For instance, if a patient had multiple visits within a one-year period and was diagnosed with both “I10 Hypertension” (a highly common condition) and “G20 Parkinson’s disease” (a less frequently occurring diagnosis), the model first learns the baseline representation of I10 as part of the patient’s medical history. Since I10 is frequently observed, the model may anticipate its recurrence in future diagnoses. However, when an unexpected diagnosis such as G20 appears, this deviation from the expected pattern can serve as a more informative signal, helping the model distinguish between routine and clinically significant changes in the patient’s condition.
This approach ensures that common diagnoses provide a stable context, while less frequent diagnoses contribute additional predictive value by indicating possible changes in a patient’s disease trajectory. Rather than simply prioritizing high-frequency diagnoses, the model learns the underlying distribution of disease occurrences and detects deviations that may be clinically relevant.
Additionally, to effectively incorporate the temporal aspect of diagnostic sequences, we segmented each patient’s record based on their first visit rather than using a fixed calendar year. Instead of defining sequences from January 1st to December 31st, we used a rolling one-year window specific to each patient. For example, if Patient A’s first recorded visit was on November 30, 2021, their EMR data was grouped from November 30, 2021, to November 29, 2022. This method ensures that each patient’s full one-year medical history remains intact, rather than being artificially split by the calendar year.
By implementing frequency-based ordering within each time window and dynamically adjusting the sequence period based on each patient’s first visit, our transformer-based model effectively captures diagnostic importance and temporal trends while minimizing fragmentation of medical histories. This enables the model to recognize time-dependent patterns in disease progression and patient management with greater accuracy.
As a tertiary hospital, many patients visit solely for prescription refills, where diagnostic codes may be automatically assigned without reflecting an updated clinical assessment. In such cases, identical ICD-10 codes may appear multiple times within a short period due to administrative factors rather than disease progression. To address this, only the first occurrence of each ICD-10 code within a year was retained, unless a significant change in the patient’s condition was documented. This preserved the temporal structure of disease trajectories while filtering out irrelevant duplications.
Prescription data, including medication codes, active ingredients, and classification codes, were processed similarly. AMC’s internal medication codes are alphanumeric (e.g., ABC123 or ABCDE). Since both the medication code and drug name were extracted from the EMR, a preprocessing approach analogous to that used for ICD-10 and diagnostic name mapping was applied. Rather than grouping medications into therapeutic classes, actual prescribed drug names were retained, leveraging the reduced dimensionality from duplicate removal. For instance, if a patient switched from atorvastatin to rosuvastatin, both were explicitly recorded rather than collapsed under “statin therapy.” This ensured a precise representation of pharmacological history, enhancing the model’s ability to capture drug-disease interactions.
By structuring diagnostic and medication sequences in this manner, the embedding method’s capacity to predict clinical outcomes was effectively optimized.
Data construction
Our methodology involved fine-tuning a pre-trained BERT model and optimizing a disease prediction model to achieve the study’s objectives. Two distinct cohorts were created: one for predicting heart disease and the other for forecasting major adverse cardiovascular events (MACE).
Initially, 762,224 EMR records were extracted from the AMC database (Fig. 2). Since this dataset included patients with only a single visit, we retained only those with multiple visits, reducing the cohort to 563,970. Further exclusions were made for cases with missing or erroneous visit dates and extreme diagnostic code counts, resulting in the final Heart Disease Cohort of 495,269 individuals. This dataset includes preprocessed diagnostic codes, medication codes, and ingredient names, supporting heart disease prediction and facilitating a comparison between the OHE and code-embedded models.
Fig. 2.
Formation of patient cohorts. A detailed process illustrating the selection and formation of the heart disease cohort and MACE prediction cohort from the AMC database. Patients were included based on criteria such as diagnostic record availability, accurate date formatting, and relevant clinical characteristics
Patients with excessively high diagnostic code counts were primarily those with frequent prescription refill visits or long-term hospitalizations, where daily treatment adjustments generated redundant diagnostic entries. Such cases were excluded to maintain data quality and ensure that clinically significant diagnoses were not overshadowed by administrative artifacts. By filtering these cases, we ensured the dataset captured meaningful diagnostic patterns relevant to disease progression and prediction.
In addition to the Heart Disease Cohort, the MACE Prediction Cohort was constructed to evaluate our model’s performance in predicting MACE in real-world clinical settings. This cohort was derived by selectively sampling 1,578 patients from the initial pool of Heart Disease Cohort. Inclusion criteria focused on individuals who underwent PCI or CABG surgeries within approximately 7 days of admission, based on the admission date. The MACE Prediction Cohort incorporates both unstructured EMR data related to diagnoses and prescribed medications, as well as structured data, including essential demographic information such as age, gender, BMI, and standardized measurements like vital signs, LDL-C, and triglycerides. Diagnostic criteria for PCI or CABG and MACE were determined using unstructured free-text surgical records in the EMR, with final patient selection guided by clinical consultation.
Comorbidities were defined using ICD-10 codes and diagnostic and prescription history in EMRs:
Chronic Kidney Disease (CKD) was defined as having an eGFR ≤ 90 before the index date (ICD-10 code N18).
Hypertension included individuals with ICD-10 codes I10-I13, I15, a history of prescriptions for beta-blockers, RAAS inhibitors, and the use of one or more calcium channel blockers.
Diabetes Mellitus (DM) was defined as having an HbA1c level of ≥ 6.5% (ICD-10 code E11).
Through the establishment of these two cohorts, the efficacy of our models across different clinical scenarios and disease prediction tasks was evaluated.
BERT (ClinicalBERT)
The BERT model was selected for this study due to its ability to process long diagnostic code sequences typically found in medical records. Traditional models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) often face challenges in handling these sequences efficiently [31, 32]. BERT’s transformer architecture, which utilizes multilayer bidirectional transformers, offers a more nuanced understanding of context, making it particularly well-suited for predicting diseases based on complex diagnostic data. In contrast, GPT-based models use a unidirectional approach that relies primarily on preceding tokens for prediction, which may limit the model’s ability to capture the complete context [33]. Moreover, while GPT-based models generate text using only a decoder, BERT uses an encoder to perform bidirectional attention across the entire input sequence, thereby incorporating context from both preceding and following tokens. This bidirectional mechanism can offer advantages in capturing subtle dependencies and relationships among different diagnostic and prescription codes within long sequences [34–36].
Recent generative LLMs such as GPT-4o, Claude, DeepSeek, and LLaMA-3 have demonstrated impressive performance in a variety of clinical tasks. However, despite their excellent performance across diverse applications, these generative models are primarily offered as closed-source, API-based systems, which do not allow users to modify the model weights directly and may raise privacy concerns when actual hospital data are uploaded. Consequently, for tasks involving clinical data, open-source models such as ClinicalBERT, BioBERT, or specialized clinical language models like Meditron-70B and Meerkat-7B are preferred [37, 38]. However, Meditron-70B and Meerkat-7B are primarily designed for text generation and equipped with multi-step reasoning capabilities for solving complex medical problems [39]. This structure may make them less suitable for our purpose, which involves converting EMR data, such as diagnostic and prescription codes, into embeddings for clinical prediction tasks. In particular, ClinicalBERT and BioBERT, which are pre-trained on large-scale medical corpora, further enhance the understanding of clinical language and diagnostic patterns, thereby improving the accuracy of clinical disease prediction.
BERT employs two primary training objectives: MLM, where specific tokens in the input sequence are masked, and NSP, which evaluates sentence continuity. For instance, in the sequence ‘pain in the throat and chest, hypertension, [MASK] infarction,’ BERT may predict ‘myocardial’ or ‘cerebral,’ helping the model capture complex patterns in diagnostic codes (Fig. 3). To address potential concerns regarding masking individual tokens, we also reviewed the structure of ICD-10 codes in our dataset. Many codes are relatively short descriptions or single words, which limits the need for whole-description masking.
Fig. 3.
Visualization of BERT’s Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) training objectives. The figure illustrates how MLM involves predicting masked tokens within sequences, while NSP assesses the continuity between sentence pairs. These objectives enable BERT to capture complex patterns and dependencies in diagnostic codes, enhancing its ability to predict future diseases based on historical patient data
This bidirectionality allows BERT to understand the context of a word based on both its preceding and following tokens, making it particularly effective in healthcare applications where the precise interpretation of medical terminology depends heavily on context [40, 41]. Furthermore, BERT’s training objectives, including Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), are well-suited for capturing complex relationships in diagnostic code sequences and clinical narratives, which is crucial for accurate disease prediction in EMRs [34]. These training objectives allow BERT to learn relationships within long diagnostic code sequences, which is critical for accurate disease prediction [42, 43].
In this study, we focus on effectively converting EMR data into embeddings by leveraging BERT’s bidirectional architecture and ClinicalBERT, which is specialized for the clinical domain. Moreover, we fine-tuned ClinicalBERT using real EMR patient data from AMC, including diagnostic and prescription codes. This fine-tuning enables ClinicalBERT to process medical terminology and clinical context more effectively, thereby maximizing performance in tasks such as heart disease code prediction.
Heart disease code-embedded model
Two code-embedded models were developed based on ClinicalBERT to evaluate the performance of the embedding methods.
The first model, the Heart Disease Code-Embedded Model, was designed to predict heart-related diseases using unstructured diagnostic text data from the Heart Disease Cohort (Fig. 4). These models utilize the ClinicalBERT model, which encodes both ICD-10 codes and diagnostic names, generating dense embedding vectors through max pooling. The embeddings from ClinicalBERT were then used as inputs to an XGBoost model [44], which predicted the occurrence of 10 major diagnostic codes associated with heart disease. These codes were selected in consultation with cardiology specialists to ensure the robustness of the experiment. The heart disease-related ICD-10 codes used in the model are as follows:
Fig. 4.
Overview of the code-embedded model architecture. This figure provides an overview of the two models utilizing ClinicalBERT. Both models use embeddings generated from the fine-tuned ClinicalBERT model on EMR data as inputs to the XGBoost model for binary classification tasks. Hyperparameter tuning was performed using a random search method, with 30% of the training dataset reserved for validation. Cross-validation techniques were applied to prevent overfitting and ensure unbiased results
C16: Malignant neoplasm of the stomach
C34: Malignant neoplasm of bronchus and lung
C50: Malignant neoplasm of the breast
G20: Parkinson’s disease
I21: Acute myocardial infarction
I48: Atrial fibrillation and flutter
I50: Heart failure
I63: Cerebral infarction
I67: Other cerebrovascular diseases
J18: Pneumonia, unspecified organism
MACE prediction code-embedded model
The second model, the MACE Prediction Code-Embedded Model, aims to predict MACE within one year following PCI or CABG [45]. MACE encompasses conditions such as myocardial infarction (MI), unstable angina, all-cause mortality, and ischemic stroke (IS). The specific definitions of MACE are as follows:
Myocardial Infarction (MI) (ICD-10 codes, I21-I23) or Unstable Angina (ICD-10 code, I20) includes patients admitted to hospitals, including those admitted through the emergency room.
Ischemic Stroke (ICD-10 codes, I63, G45) comprises patients who had undergone at least one brain CT or MR imaging test within 30 days from the date of diagnostic code registration.
Patients were initially sorted based on diagnostic codes, and their discharge notes were reviewed by clinicians to confirm the final MACE diagnosis status. Unlike the Heart Disease Code-Embedded Model, the MACE Prediction Code-Embedded Model utilized data from the MACE Prediction Cohort, incorporating not only diagnostic data but also baseline clinical variables, comorbidities, and prescribed medications. Additional details regarding the tasks are provided in the Methods section.
The MACE prediction model also utilized a pre-trained BERT model with an added dimensionality reduction layer to transform the unstructured code sequences of each patient’s diagnostic and medication codes into different-dimensional continuous numeric vector representations(128d, 256d, 512d, 768d). These embedding vectors were then used as input data for the XGBoost model to predict MACE outcomes.
Experimental setup
The models were developed using Python 3.8 and the TensorFlow deep learning framework (version 2.6.0), based on a BERT-base architecture. The learning rate for the fine-tuning model was set at 10− 3, and a batch size of 32 was chosen based on the training data size and the GPU memory capacity of the GeForce RTX 3090. The maximum sequence length was capped at 256 tokens, and the vocabulary size was restricted to 7,000, including both diagnostic codes and medication names. Fine-tuning was conducted using the Adam optimizer, and the optimal number of epochs was determined based on accuracy and loss metrics.
Evaluation
To assess the performance of the two code-embedding methods, we conducted three experiments designed to measure their efficacy in different predictive tasks.
Experiment 1: Comparing the effects of code subsequence alignment methods
The first experiment aimed to evaluate the impact of various diagnostic code subsequence alignment methods on the performance of the BERT model. Prior to this, a preliminary investigation was conducted to determine the optimal temporal increment for splitting code sequences. This investigation involved clinician review of patient records segmented into intervals of 1, 2, 3, 5, or 7 years. Clinicians evaluated whether the diagnostic code lengths and redundancies were appropriate and confirmed that an annual split effectively captured sufficient patient history while avoiding excessive sequence length or redundancy. Based on these findings, the alignment of code subsequences was standardized to one year per patient.
Subsequently, three models were compared using different diagnostic code sorting techniques. Throughout training, loss and accuracy metrics were closely monitored until validation loss plateaued. This real-time evaluation allowed for precise identification of the point at which each model reached optimal performance.
Experiment 2: Performance comparison between OHE and code-embedded methods for heart-related disease prediction
The second experiment sought to compare the performance of the OHE method with the ‘Heart Disease Code-Embedded Model’ in predicting subsequent heart disease diagnoses. To achieve this, XGBoost was employed for both models, using the diagnostic dataset from the Heart Disease Cohort. Two distinct models were developed: the OHE Prediction Model and the Heart Disease Code-Embedded Model.
The performance of the two methods was evaluated using the AUROC (Area Under the Receiver Operating Characteristic curve) metric [46], a widely accepted measure for assessing classification model performance. In addition to accuracy, the semantic relationships among diseases were examined. To visualize these relationships, the t-SNE [47] algorithm was applied, enabling the comparison of code sequences in vector space and allowing for the measurement of disease similarity across models.
Experiment 3: Predicting major adverse cardiovascular events (MACE) with the MACE prediction code-embedded model
The third experiment aimed to evaluate the predictive performance of the MACE Prediction Code-Embedded Model in forecasting MACE within one year for patients who underwent PCI or CABG surgery. This experiment followed a similar methodology to Experiment 2 but utilized the MACE Prediction Cohort dataset.
The performance of the code-embedded approach was evaluated using the AUROC metric. Additionally, the predictive ability of the model was compared to other encoding methods, including OHE and W2V. This comparison enabled a comprehensive evaluation of each model’s capability to predict actual clinical outcomes in a real-world setting.
Results
Data characteristics
We have compiled a dataset comprising 495,269 patient records, encompassing diagnostic codes and related diagnostic information, admission and discharge details, as well as baseline patient data. The patient cohort consisted of individuals admitted to the Department of Cardiology or Thoracic Surgery at AMC between January 1, 2000, and December 31, 2019.
The mean ages of patients in the overall cohort (Heart Disease Cohort) were 58.99 years (SD 13.21). The dataset was composed of 46.15% women (228,587 out of 495,269) and 53.85% men (266,680 out of 495,269). The average length of stay (LOS) per visit was 1.94 days (SD 11.26). Primary diagnoses encompassed a range of heart-related diseases, including I10 (hypertension), E11 (diabetes mellitus), E78 (lipid metabolism disorders), R07 (chest pain), I63 (stroke), and others. For a detailed distribution of the diagnostic codes among the target patients.
In patients who underwent PCI or CABG, the mean ages of patients for the prediction of clinical diseases were 64.9 years (SD 9.69), and 73.3% (1,157 out of 1,578) were males. The mean BMI was 24.88 (SD 2.90), and the mean LDL-C level was 94.23 (SD 35.28). Comorbidities included chronic kidney disease (34.54%), diabetes (38.34%), and hypertension (97.6%). The distribution for the evaluation cohort, ‘MACE Prediction Cohort,’ is detailed in Table 1.
Table 1.
‘MACE prediction cohort’ demographic distribution table for prediction of clinical disease in patients who underwent PCI or CABG
| Study population (n = 1,578) N (%) |
|
|---|---|
| Mean (SD) Age (years) | 64.92 (9.69) |
| Male | 1,157 (73.32) |
| Mean (SD) BMI (kg/m2) | 24.88 (2.90) |
| Mean (SD) SBP (mmHg) | 126.81 (18.39) |
| Mean (SD) DBP (mmHg) | 72.89 (11.39) |
| Mean (SD) HDL-C (mg/dL) | 44.38 (11.66) |
| Mean (SD) Triglycerides (mg/dL) | 145.36 (83.19) |
| Mean (SD) Total cholesterol (mg/dL) | 177.59 (42.39) |
| Mean (SD) LDL-C (mg/dL) | 94.23 (35.28) |
| Chronic Kidney Disease | 545 (34.54) |
| Diabetes mellitus | 605 (38.34) |
| Hypertension | 1,541 (97.66) |
| Metabolic syndrome** | 1,538 (97.47) |
** Defined as having at least three conditions among (i) elevated waist circumference (90 cm for males and 85 cm for females), (ii) elevated TG levels (150 mg/dl or use of a relevant drug), (iii) reduced HDL-C levels (< 40 mg/dl for males and < 50 mg/dl for females or use of a relevant drug), (iv) elevated blood pressure (systolic 130 mmHg and/or diastolic 80 mmHg or use of an antihypertensive medication) and elevated fasting plasma glucose levels (100 mg/dl or use of an antidiabetic drug)
Performance of BERT with partial sequence code sorting
An experiment was conducted to evaluate the impact of various subcode sequence alignment techniques on BERT’s performance to identify the most effective model. This experiment compared the loss and accuracy metrics of three distinct models during the fine-tuning process.
The Frequency Model, which sorted diagnostic codes based on their frequency of occurrence, achieved the best performance with an accuracy of 0.977. This was followed by the Time Model and the Alphabetical Model, which had accuracies of 0.965 and 0.952, respectively. Based on these results, the Frequency Model was selected as the optimal approach due to its superior accuracy (0.977), as shown in Table 2.
Table 2.
Performance of various BERT models by code sequence alignment method
| Model | Examples of code sequences | Loss | Accuracy |
|---|---|---|---|
| Time model | Z72 Other problems related to lifestyle; I10 Other and unspecified primary hypertension; J98 Other disorders of lung; I20 Angina pectoris unspecified. | 0.225 | 0.965 |
| Alphabetical model | I10 Other and unspecified primary hypertension; I20 Angina pectoris unspecified; J98 Other disorders of lung; Z72 Other problems related to lifestyle. | 0.274 | 0.952 |
| Frequency model | I10 Other and unspecified primary hypertension; I20 Angina pectoris unspecified; Z72 Other problems related to lifestyle; J98 Other disorders of lung. | 0.124 | 0.977 |
The first model, referred to as the Time Model, used a code sequence sorted according to the day-by-day diagnostic order for each patient. The second model, the Alphabetical Model, arranged diagnostic codes alphabetically within each sequence. Finally, the third model, called the Frequency Model, arranged diagnostic codes based on their frequency of occurrence, following our proposed method
Our proposed frequency-based alignment method demonstrated a significant reduction in loss, by approximately 0.1 or more, compared to both the time-based and alphabetical ordering methods (Fig. 5). Additionally, it achieved an accuracy of over 0.97 at an early stage of training. These findings suggest that the sequence in which diagnostic codes are sorted has a significant impact on the performance of the BERT model. The results confirm that models prioritizing specific data sorting techniques, such as frequency-based alignment, can effectively minimize loss and enhance accuracy during training.
Fig. 5.
Comparison of loss and accuracy across models trained with different sequence-ordering methods using a dataset of diagnostic code sequences
Performance of the XGBoost model for predicting heart-related diseases using OHE and the code-embedded method
This section compares the performance of medical AI models for heart disease prediction using OHE and code-embedded methods. To ensure a comprehensive evaluation of the models’ classification effectiveness, we focused on both their discriminatory power and predictive accuracy. The hyperparameters of the XGBoost models were kept identical for both approaches, and predictions were made for 10 ICD codes (see Methods). The diagnostic code prediction task employed ClinicalBERT, BioBERT, and BERT-uncased for the code-embedded method, while OHE utilized all available features.
The evaluation metrics used in this study include:
AUC (Area Under the Receiver Operating Characteristic Curve): a measure of the model’s ability to distinguish between classes across varying thresholds, providing a robust assessment of classification performance.
ACC (Accuracy): representing the proportion of correctly predicted instances, serving as an indicator of overall model performance.
These metrics provide a comprehensive evaluation of the model’s predictive power, focusing on both its ability to differentiate between outcomes and its overall accuracy. In addition, we experimented with various classifiers, including XGBoost, Random Forest (RF) [48], Logistic Regression(LR) [49] and Support Vector Machine (SVM) [50], while also conducting hyperparameter experiments. Ultimately, XGBoost (0.797 ± 0.094) demonstrated the best performance and was selected as the final comparison model(RF: 0.768 ± 0.071, LR: 0.688 ± 0.120, SVM: 0.609 ± 0.100). Details on the experimental parameters can be found in Table S1, and additional performance comparisons are presented in Supplementary Figure S1.
As shown in Table 3, while OHE achieved the highest accuracy (0.978 ± 0.010), its AUC (0.795 ± 0.089) was significantly lower than the embedding-based models, suggesting that OHE’s performance is primarily driven by its accuracy in high-dimensional diagnostic features rather than its ability to effectively distinguish between classes. In contrast, ClinicalBERT (code-embedded) achieved the highest AUC (0.864 ± 0.028), demonstrating superior performance in terms of its ability to discriminate between positive and negative cases, while maintaining a strong accuracy (0.976 ± 0.011). Similarly, BioBERT (code-embedded) also performed well, with an AUC of 0.855 ± 0.034 and an ACC of 0.976 ± 0.011. The BERT-uncased model, on the other hand, lagged behind with an AUC of 0.766 ± 0.038, though its ACC (0.975 ± 0.011) remained comparable to the other models. These results highlight the strengths of ClinicalBERT and BioBERT in capturing complex patterns in clinical data, leading to superior AUC scores compared to OHE, which, despite its high accuracy, may struggle to fully capture the underlying relationships between diagnostic codes. Furthermore, the lower performance of the BERT-uncased model suggests that its general language representations may be insufficient for clinical diagnostic tasks, thereby reinforcing the suitability of domain-specific models like ClinicalBERT and BioBERT for capturing clinical nuances.
Table 3.
Comparison of OHE, clinicalbert, biobert, and BERT-uncased models for heart disease code prediction
| Model | AUC (mean std) |
ACC (mean std) |
|---|---|---|
| OHE | 0.795 ± 0.089 | 0.978 ± 0.010 |
| ClinicalBERT(code-embedded) | 0.864 ± 0.028 | 0.976 ± 0.011 |
| BioBERT(code-embedded) | 0.855 ± 0.034 | 0.976 ± 0.011 |
| BERT-uncased(code-embedded) | 0.766 ± 0.038 | 0.975 ± 0.011 |
This table presents the performance comparison of OHE, clinicalbert, biobert, and BERT-uncased models for predicting heart disease codes using diagnostic information. The mean AUC and ACC values represent the average prediction performance across 10 diagnostic codes. The OHE model transformed all diagnostic codes into 2,492 distinct features, while the code-embedded models used pre-trained BERT-based embeddings with 768 dimensions. All models were trained on diagnostic datasets from the heart disease cohort, including ICD-10 codes related to heart disease (details in Methods)
The evaluation of predictive models for heart-related diseases, employing both the OHE model and our code-embedded ClinicalBERT embeddings, demonstrated a clear distinction in performance across various ICD codes. For instance, in the prediction of diseases coded as I63, ClinicalBERT achieved an AUC of 0.837, significantly higher than the 0.725 observed with OHE. Similar trends were noted with other ICD codes such as J18, I48, and I67, where ClinicalBERT consistently outperformed OHE, underscoring its robustness in handling complex diagnostic data. These results highlight the model’s enhanced sensitivity and specificity, suggesting that the code-embedded model (ClinicalBERT) can more effectively differentiate between positive and negative cases in a clinical setting. The improved accuracy in disease prediction by ClinicalBERT is attributed to its advanced embedding techniques, which capture deeper semantic relationships within the diagnostic data compared to traditional encoding methods like OHE (Fig. 6).
Fig. 6.
ROC curve comparing the OHE and code-embedded models for predicting heart-related diseases. This figure presents the Receiver Operating Characteristic (ROC) curves comparing the performance of OHE and code-embedded models (ClinicalBERT) in predicting heart-related diseases. The curves are plotted for 4 representative ICD codes to illustrate differences in model performance
In contrast, although the OHE model achieves commendable accuracy (Table 3), its performance on the ROC curve reveals challenges in attaining high AUC values. This outcome suggests that, despite utilizing a large number of features, OHE may exhibit limitations in its generalization across various decision thresholds. OHE tends to perform adequately in settings where features are distinctly separated; however, in scenarios requiring the identification of subtler patterns, such as in heart-related disease prediction, the embedding-based approach (ClinicalBERT) offers clear advantages. These findings highlight the significant contribution of ClinicalBERT-based embeddings in capturing complex relationships within diagnostic data, thereby enhancing predictive accuracy and supporting more nuanced decision-making in clinical settings.
This visualization further reinforces the conclusion that domain-specific embeddings, such as ClinicalBERT, outperform traditional feature extraction methods like OHE, particularly in tasks involving complex, real-world clinical data.
The goal of this analysis was to assess how effectively each representation technique used in heart-related disease prediction models captured the essential relationships among diseases. To evaluate this, the t-SNE algorithm was applied to compare visual representations of diagnostic code relationships between OHE models and code-embedded models. These visual patterns were validated in consultation with clinicians to confirm the appropriateness of the observed relationships among diagnostic codes within disease groups.
Figure 7 illustrates a two-dimensional t-SNE visualization of diagnostic code vectors, highlighting whether diseases from related clinical groups cluster together based on the classification results. For example, codes such as I10 (hypertension) and I20 (angina), which are commonly associated with the development of heart-related conditions, are closely clustered in the code-embedded model. Furthermore, diagnostic code pairs like I10 I67 (hypertension, other cerebrovascular diseases) and I10 I63 (hypertension, cerebral infarction) show strong correlations, reflecting their frequent diagnostic co-occurrence.
Fig. 7.
Visualization of ICD-10 code relationships using t-SNE to reduce diagnostic code vectors to two dimensions
In contrast, the OHE models display these codes as more dispersed in vector space, indicating less precise grouping of related conditions. The code-embedded model, which leverages BERT-based contextual embeddings, more accurately captures significant relationships between diagnostic codes, offering a more nuanced and precise representation of patient diagnostic patterns.
Predicting MACE within one year using medication and diagnostic code embeddings
To assess the effectiveness and generalizability of code-embedded methodologies in clinical settings, experiments were conducted to predict the occurrence of MACE within one year for patients who underwent PCI or CABG. The MACE Prediction Code-Embedded Model incorporated additional clinical variables, including patient demographic information, prescription codes, medication names, and medication classification codes (detailed cohort definitions are provided in the Methods section).
The additional evaluation metrics included:
PRC (Precision-Recall Curve AUC): summarizing the trade-off between precision (positive predictive value) and recall (sensitivity), which is particularly useful when dealing with imbalanced datasets.
These metrics offer a comprehensive evaluation of the model’s predictive power, focusing on both its ability to differentiate between outcomes and its overall accuracy, while accounting for class imbalance.
These metrics offer a comprehensive evaluation of the model’s predictive power, focusing on both its ability to differentiate between outcomes and its overall accuracy while accounting for class imbalance. Since MACE is not a frequently occurring condition, label imbalance may be present. To ensure a robust and consistent evaluation, we employed a bootstrap resampling approach. The model’s performance was assessed across multiple resampled datasets, and for each evaluation, the average and standard deviation were computed to quantify variability and reliability.
Table 4 presents the comparative performance of various embedding techniques, including ClinicalBERT at different dimensions (128d, 256d, 512d, 768d), Word2Vec (W2V), and one-hot encoding (OHE), in predicting MACE. Although the default encoder size of BERT is 768, we conducted experiments with various dimensions to evaluate how reducing dimensionality affects performance, particularly in mitigating model complexity and overfitting.
Table 4.
Comparison of clinicalbert, OHE, and W2V models for predicting one-year MACE in patients undergoing PCI or CABG using diagnostic and medication information
| Model | #Dim | AUC (mean std) |
ACC (mean std) |
PRC (mean std) |
|---|---|---|---|---|
| ClinicalBERT(co-medi)-128d | 151 | 0.746 ± 0.016 | 0.711 ± 0.016 | 0.648 ± 0.023 |
| ClinicalBERT(co-medi)-256d | 279 | 0.737 ± 0.015 | 0.701 ± 0.014 | 0.633 ± 0.022 |
| ClinicalBERT(co-medi)-512d | 535 | 0.743 ± 0.015 | 0.705 ± 0.015 | 0.639 ± 0.022 |
| ClinicalBERT(co-medi)-768d | 791 | 0.740 ± 0.015 | 0.705 ± 0.015 | 0.641 ± 0.022 |
| W2V (vector size = 200) | 223 | 0.736 ± 0.017 | 0.701 ± 0.015 | 0.646 ± 0.022 |
| OHE (All features) | 2,492 | 0.719 ± 0.018 | 0.731 ± 0.014 | 0.601 ± 0.021 |
| OHE (Selected features = 256) | 279 | 0.633 ± 0.021 | 0.675 ± 0.015 | 0.452 ± 0.026 |
| Only clinical values | 23 | 0.610 ± 0.023 | 0.635 ± 0.018 | 0.516 ± 0.031 |
Among these, ClinicalBERT with 128-dimensional embeddings demonstrated the best performance, achieving the highest AUC (0.746 ± 0.016), ACC (0.711 ± 0.016), and PRC (0.648 ± 0.023). This highlights its effectiveness in accurately predicting MACE when both diagnostic and medication data are incorporated.
The W2V model also performed well, with an AUC of 0.736 ± 0.017, ACC of 0.701 ± 0.015, and PRC of 0.601 ± 0.021, though slightly lower than ClinicalBERT. In contrast, the OHE model, which utilized all features, showed weaker performance (AUC: 0.719 ± 0.018, ACC: 0.731 ± 0.014, PRC: 0.601 ± 0.021), underscoring its limitations in capturing complex relationships within the data.
Notably, the model trained using only clinical variables without embedding techniques yielded the lowest scores across all metrics (AUC: 0.610 ± 0.023, ACC: 0.635 ± 0.018, PRC: 0.516 ± 0.031), further demonstrating the importance of incorporating both diagnostic and medication data for accurate prediction.
These results confirm that the ClinicalBERT-based MACE Prediction Code-Embedded Model (128d) not only reduces dimensionality compared to OHE but also significantly improves predictive performance. The integration of prescription medication-related codes alongside diagnostic codes enhances the model’s ability to predict real-world clinical outcomes, suggesting its potential for broader applications in medical AI and clinical disease prediction.
Discussion
This study demonstrates the potential of ClinicalBERT in enhancing medical AI models using large-scale EMRs. Compared to OHE, ClinicalBERT reduced the dimensionality from 2,492 to 128 (approximately 95% reduction) while improving the AUC for MACE prediction from 0.719 to 0.746, a relative improvement of approximately 3.8%. This dimensionality reduction underscores the importance of precise model scaling, with the 128-dimensional representation striking an optimal balance between performance and computational efficiency. While larger models may generally offer better performance, our findings indicate that this reduced representation effectively captures complex clinical relationships while maintaining computational feasibility.
For simpler tasks such as predicting heart-related diseases based solely on ICD codes, OHE outperformed ClinicalBERT in terms of accuracy (0.978 vs. 0.976), but ClinicalBERT demonstrated a higher AUC (0.864 vs. 0.795). This suggests that while OHE may still be a valid option for straightforward, code-based classification tasks, ClinicalBERT is more effective for complex, multifactorial prediction tasks such as predicting MACE within one year after procedures like PCI or CABG, as it captures deeper semantic relationships among diagnoses, medications, and clinical baseline data.
ICD-10 remains widely used in clinical settings, although ICD-11 offers a more refined structure that could potentially improve model performance [51]. Due to the unique characteristics of individual hospital systems, full ICD-11 implementation remains challenging, which is why our experiments continued to utilize ICD-10. During the mapping process, we encountered difficulties in manually aligning AMC’s internal codes with ICD-10 due to discrepancies in terminology and granularity. Given the enhancements offered by ICD-11, future integration of mapping between ICD-10 and ICD-11 could yield performance gains, warranting further exploration of these challenges.
We chose to retain all ICD-10 codes, including the less common ones, to provide a comprehensive representation of diagnoses. However, this approach can introduce sparsity into the model. While studies using OHE often mitigate this issue by selecting only the most frequent codes [52, 53], our embedding-based approach effectively accommodates rare codes. Moreover, integrating SNOMED CT, which has a richer hierarchical structure and detailed clinical terminology, may further enhance code mapping precision by clarifying relationships between diagnoses, treatments, and symptoms [54]. This integration has the potential to improve predictive accuracy and capture more nuanced clinical patterns when combined with ClinicalBERT, a strategy that merits further investigation.
Although this study focused on a single institution’s EMR dataset, the use of standardized coding systems suggests that our findings could be generalized to other clinical settings. Expanding the model’s application to various hospitals that use similar coding systems and incorporating broader datasets from different specialties would provide a more robust assessment of the effectiveness of embedding-based representations in diverse clinical scenarios.
In conclusion, while OHE may suffice for simpler and more independent classification tasks, embedding-based models like ClinicalBERT clearly outperform traditional methods in complex and multifactorial prediction tasks. Integrating a wide range of clinical data such as diagnoses, medications and even environmental or genetic factors will be essential for advancing medical AI. Future research should focus on incorporating broader datasets, exploring alternative coding systems, and including additional patient data types to develop more comprehensive and accurate prediction models [55].
Conclusion
This study introduces a BERT-based model, specifically a code-embedded version of ClinicalBERT, to address the challenges posed by the unstructured text in EMRs. By embedding EMR data and reducing the dimensionality of code features, the model significantly improves clinical prediction performance. The code-embedded ClinicalBERT outperformed traditional OHE, particularly in predicting complex clinical outcomes, such as MACE, by capturing the nuanced interactions between diagnoses, medications, and prescriptions.
While OHE may be effective for simpler classification tasks, the embedding approach demonstrated clear advantages in real-world clinical scenarios. The integration of diagnostic and medication data enhances prediction accuracy, providing healthcare professionals with valuable insights for data-driven decision-making. Future research should focus on validating these findings across different medical specialties and incorporating additional factors, such as genetic and lifestyle information, to further strengthen the model’s applicability in clinical settings.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This research was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea. The funding was provided equally by two grants (50% each): HR20C0026 and HI23C0896.
Abbreviations
- ABLE
Asan Biomedical Research Environment database
- AMC
Asan Medical Center
- AI
Artificial intelligence
- AUROC
Area under the receiver operating characteristic curve
- BERT
Bidirectional encoder representation from transformers
- CABG
Coronary artery bypass grafting
- CKD
Chronic kidney disease
- CNN
Convolutional neural network
- CVD
Cardiovascular disease
- DB
Database
- DL
Deep learning
- DM
Diabetes mellitus
- EMR
Electronic medical record
- RF
Random forest
- GPT
Generative pre-trained transformer
- HF
Heart failure
- ICD
International Classification of Diseases of the World Health Organization
- IS
Ischemic stroke
- LOS
Length of stay
- LR
Logistic regression
- MACE
Major adverse cardiovascular events
- MI
Myocardial infarction
- MIMIC
Medical information mart for intensive care
- ML
Machine learning
- MLM
Masked language model
- NLP
Natural language processing
- OHE
One-hot encoding
- PAID
Patient identification
- PCI
Percutaneous coronary intervention
- PLMs
Pre-training language model
- RNN
Recurrent neural network
- RWE
Real-world evidence
- SVM
Support vector machine
- TG
Triglycerides
- T5
Text-to-Text Transfer Transformer
- t-SNE
t-distributed stochastic neighbor embedding
- W2V
Word2vec
- XGBoost
Extreme gradient boosting
Author contributions
Minkyoung Kim: Conceptualization, Methodology, Writing - Original Draft Yunha Kim: Investigation, Hee Jun Kang: Resource, Hyeram Seo: Investigation, Heejung Choi: Data curation, JiYe Han: Writing - Review & Editing, Gaeun Kee: Data curation, Soyoung Ko: Resource, HyoJe Jung: Resource, Byeolhee Kim: Resource, Boeun Choi: Resource, Tae Joon Jun: Supervision, Young-Hak Kim: Supervision, Project administration.
Funding
This research was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea. The funding was provided equally by two grants (50% each): HR20C0026 and HI23C0896.
Data availability
The de-identified electronic medical record data that support the findings of this study are available from the Asan Biomedical Research Environment (ABLE) database at Asan Medical Center, Seoul, Republic of Korea. Due to institutional and privacy restrictions, these data are not publicly accessible via a repository. Researchers may request access by contacting the corresponding author and obtaining approval from the Asan Medical Center Institutional Review Board (approval No. 2021-0303).
Declarations
Ethics approval and consent to participate
This study was approved by the Institutional Review Board of Asan Medical Center (AMC) (approval no. 2021 − 0303) in accordance with the Declaration of Helsinki (2008). The requirement for informed consent was waived by the IRB because all data were fully anonymized within the Asan Biomedical Research Environment (ABLE) database. All study procedures were conducted in compliance with relevant guidelines and regulations.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Tae Joon Jun and Young-Hak Kim contributed equally to this work.
References
- 1.Kim HS, Lee S, Kim JH. Real-world evidence versus randomized controlled trial: clinical research based on electronic medical records. J Korean Med Sci. 2018;33(34). [DOI] [PMC free article] [PubMed]
- 2.Hannan TJ. Electronic medical records. Health informatics: an overview. 1996. p. 133.
- 3.Zhao C, Jiang J, Xu Z, Guan Y. A study of EMR-based medical knowledge network and its applications. Comput Methods Programs Biomed. 2017;143:13–23. [DOI] [PubMed] [Google Scholar]
- 4.Moradi M, Dorffner G, Samwald M. Deep contextualized embeddings for quantifying the informative content in biomedical text summarization. Comput Methods Programs Biomed. 2020;184:105117. [DOI] [PubMed] [Google Scholar]
- 5.Yuan Z, Tan C, Huang S. Code synonyms do matter: multiple synonyms matching network for automatic ICD coding. arXiv preprint arXiv:2203.01515. 2022.
- 6.Yan C, Fu X, Liu X, Zhang Y, Gao Y, Wu J, Li Q. A survey of automated international classification of diseases coding: development, challenges, and applications. Intell Med. 2022;2(03):161–73. [Google Scholar]
- 7.Polnaszek B, Gilmore-Bykovskyi A, Hovanes M, Roiland R, Ferguson P, Brown R, Kind AJ. Overcoming the challenges of unstructured data in multisite, electronic medical record-based abstraction. Med Care. 2016;54(10):e65–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sun W, Cai Z, Li Y, Liu F, Fang S, Wang G. Data processing and text mining technologies on electronic medical records: a review. J Healthc Eng. 2018;2018. [DOI] [PMC free article] [PubMed]
- 9.Si Y, Du J, Li Z, Jiang X, Miller T, Wang F, et al. Deep representation learning of patient data from electronic health records (EHR): a systematic review. J Biomed Inform. 2021;115:103671. [DOI] [PMC free article] [PubMed]
- 10.Xiang X, Duan S, Pan H, Han P, Cao J, Liu C. From one-hot encoding to privacy-preserving synthetic electronic health records embedding. In: Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies; 2020 Dec. p. 407–413.
- 11.Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
- 12.Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1–67.
- 13.Brown TB. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.
- 14.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical Language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019.
- 16.Peissig PL, Costa VS, Caldwell MD, Rottscheit C, Berg RL, Mendonca EA, Page D. Relational machine learning for electronic health record-driven phenotyping. J Biomed Inform. 2014;52:260–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mikolov T. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.
- 18.Choi Y, Chiu CYI, Sontag D. Learning low-dimensional representations of medical concepts. AMIA Summits Transl Sci Proc. 2016;2016:41. [PMC free article] [PubMed]
- 19.Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE. Patient2vec: a personalized interpretable deep representation of the longitudinal electronic health record. IEEE Access. 2018;6:65333–46. [Google Scholar]
- 20.Sajjad H, Alam F, Dalvi F, Durrani N. Effect of post-processing on contextualized word representations. arXiv preprint arXiv:2104.07456. 2021.
- 21.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;26.
- 22.Huang K, Singh A, Chen S, Moseley ET, Deng CY, George N, Lindvall C. Clinical XLNet: Modeling sequential clinical notes and predicting prolonged mechanical ventilation. arXiv preprint arXiv:1912.11975. 2019.
- 23.Li F, Jin Y, Liu W, Rawat BPS, Cai P, Yu H. Fine-tuning bidirectional encoder representations from transformers (BERT)–based models on large-scale electronic health record notes: an empirical study. JMIR Med Inf. 2019;7(3):e14830. [DOI] [PMC free article] [PubMed]
- 24.Gao S, Alawad M, Young MT, Gounley J, Schaefferkoetter N, Yoon HJ, et al. Limitations of transformers on clinical text classification. IEEE J Biomed Health Inform. 2021;25(9):3596–3607. [DOI] [PMC free article] [PubMed]
- 25.Xu J, Xi X, Chen J, Sheng VS, Ma J, Cui Z. A survey of deep learning for electronic health records. Appl Sci. 2022;12(22):11709. [Google Scholar]
- 26.Song H, Rajan D, Thiagarajan J, Spanias A. Attend and diagnose: Clinical time series analysis using attention models. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2018 Apr;32(1).
- 27.Choi E, Xu Z, Li Y, Dusenberry M, Flores G, Xue E, Dai A. Learning the graphical structure of electronic health records with graph convolutional transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2020 Apr;34(01):606–613.
- 28.Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ahn I, Na W, Kwon O, Yang DH, Park GM, Gwon H, et al. CardioNet: a manually curated database for artificial intelligence-based research on cardiovascular diseases. BMC Med Inform Decis Mak. 2021;21:1–15. [DOI] [PMC free article] [PubMed]
- 30.Shin SY, Lyu Y, Shin Y, Choi HJ, Park J, Kim WS, Lee JH. Lessons learned from development of de-identification system for biomedical research in a Korean tertiary hospital. Healthc Inf Res. 2013;19(2):102–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):18. [DOI] [PMC free article] [PubMed]
- 32.Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: Predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference; 2016 Dec. p. 301–318. PMLR. [PMC free article] [PubMed]
- 33.Pawar CS, Makwana A. Comparison of bert-base and GPT-3 for marathi text classification. In: Futuristic Trends in Networks and Computing Technologies: Select Proceedings of Fourth International Conference on FTNCT 2021. Singapore: Springer Nature Singapore; 2022 Nov. p. 563–574.
- 34.Ji S, Hölttä M, Marttinen P. Does the magic of BERT apply to medical code assignment? A quantitative study. Comput Biol Med. 2021;139:104998. [DOI] [PubMed] [Google Scholar]
- 35.Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019.
- 36.Ethayarajh K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512. 2019.
- 37.Llaquet SI, Sallinen A, Boye G, Zhang M, Dupont-Roc M, Bernath B, et al. Llama-Tree-Meditron [70B] [Master’s thesis]. Universitat Politècnica de Catalunya; 2024.
- 38.Kim H, Hwang H, Lee J, Park S, Kim D, Lee T, et al. Small language models learn enhanced reasoning skills from medical textbooks. arXiv preprint arXiv:2404.00376. 2024.
- 39.Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, et al. Meditron-70B: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079. 2023.
- 40.Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Improving text embeddings with large language models. arXiv Preprint arXiv:240100368. 2023.
- 41.Peng Y, Yan S, Lu Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474. 2019.
- 42.Turchin A, Masharsky S, Zitnik M. Comparison of BERT implementations for natural Language processing of narrative medical documents. Inf Med Unlocked. 2023;36:101139. [Google Scholar]
- 43.Beaney T, Jha S, Alaa A, Smith A, Clarke J, Woodcock T, et al. Comparing natural language processing representations of coded disease sequences for prediction in electronic health records. J Am Med Inform Assoc. 2024;31(7):1451–1462. [DOI] [PMC free article] [PubMed]
- 44.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug. p. 785–794.
- 45.Kwon O, Na W, Kang H, Jun TJ, Kweon J, Park GM, et al. Electronic medical record–based machine learning approach to predict the risk of 30-day adverse cardiac events after invasive coronary treatment: Machine learning model development and validation. JMIR Med Inform. 2022;10(5):e26801. [DOI] [PMC free article] [PubMed]
- 46.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29–36. [DOI] [PubMed] [Google Scholar]
- 47.Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11).
- 48.Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31–9. [DOI] [PubMed] [Google Scholar]
- 49.Suthaharan S, Suthaharan S. Support vector machine. In: Machine learning models and algorithms for big data classification: Thinking with examples for effective learning. 2016. p. 207–235.
- 50.LaValley MP. Logistic regression. Circulation. 2008;117(18):2395–9. [DOI] [PubMed] [Google Scholar]
- 51.Chute CG, Çelik C. Overview of ICD-11 architecture and structure. BMC Med Inf Decis Mak. 2021;21(Suppl 6):378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Choi EJ, Jun TJ, Park HS, Lee JH, Lee KH, Kim YH, et al. Predicting long-term survival after allogeneic hematopoietic cell transplantation in patients with hematologic malignancies: Machine learning–based model development and validation. JMIR Med Inform. 2022;10(3):e32313. [DOI] [PMC free article] [PubMed]
- 53.Han J, Kim Y, Kang HJ, Seo J, Choi H, Kim M, et al. Predicting low density lipoprotein cholesterol target attainment using machine learning in patients with coronary artery disease receiving moderate-dose statin therapy. Sci Rep. 2025;15(1):5346. [DOI] [PMC free article] [PubMed]
- 54.Lee D, de Keizer N, Lau F, Cornet R. Literature review of SNOMED CT use. J Am Med Inform Assoc. 2014;21(e1):e11–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inf Decis Mak. 2020;20:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The de-identified electronic medical record data that support the findings of this study are available from the Asan Biomedical Research Environment (ABLE) database at Asan Medical Center, Seoul, Republic of Korea. Due to institutional and privacy restrictions, these data are not publicly accessible via a repository. Researchers may request access by contacting the corresponding author and obtaining approval from the Asan Medical Center Institutional Review Board (approval No. 2021-0303).












