Large language models improve transferability of electronic health record-based predictions across countries and coding systems

Matthias Kirchler; Matteo Ferro; Veronica Lorenzini; Robin P van de Water; FinnGen; Christoph Lippert; Andrea Ganna

doi:10.1038/s41746-026-02363-5

. 2026 Jan 22;9:177. doi: 10.1038/s41746-026-02363-5

Large language models improve transferability of electronic health record-based predictions across countries and coding systems

Matthias Kirchler ^1,^2,^3,^#, Matteo Ferro ^3,^#, Veronica Lorenzini ⁴, Robin P van de Water ^1,²; FinnGen, Christoph Lippert ^1,², Andrea Ganna ^3,^4,^✉

PMCID: PMC12916935 PMID: 41571946

Abstract

Variation in medical practices and reporting standards across healthcare systems limits the transferability of prediction models based on structured electronic health record data. Prior studies have demonstrated that embedding medical codes into a shared semantic space can help address these discrepancies, but real-world applications remain limited. Here, we show that leveraging embeddings from a large language model alongside a transformer-based prediction model provides an effective and scalable solution to enhance generalizability. We call this approach GRASP and apply it to predict the onset of 21 diseases and all-cause mortality in over one million individuals. Trained on the UK Biobank (UK) and evaluated in FinnGen (Finland) and Mount Sinai (USA), GRASP achieved an average ΔC-index that was 88% and 47% higher than language-unaware models, respectively. GRASP also showed significantly higher correlations with polygenic risk scores for 62% of diseases, and maintained robust performance even when datasets were not harmonized to the same data model.

Subject terms: Medical research, Risk factors

Introduction

Accurate disease risk estimation is important for guiding screening, preventative interventions, and early-stage treatments. Most disease prediction models used in clinical practice are based on a limited set of risk factors measured in prospective cohort studies^1–3. However, the increasing availability of electronic health record (EHR) data, combined with advances in machine learning, has opened avenues for automated risk prediction directly from EHRs. For example, promising results have been shown in predicting diseases such as pancreatic cancer⁴ and cardiovascular conditions^5–7.

Despite these opportunities, EHR data is inherently complex and lacks standardization across healthcare systems. Large providers with robust infrastructure can develop and deploy in-house models, but smaller healthcare often lacks the necessary resources. Consequently, the ability to transfer predictive models across healthcare settings is crucial to democratizing access to EHR-based predictions.

A common approach to harmonizing health records involves mapping EHR data to a common data model (CDM), such as OMOP-CDM⁸, which uses standardized vocabularies like SNOMED or RxNorm. For example, harmonization efforts across EHR systems are envisioned by the European Health Data Space Act of the European Union. However, these efforts are resource-intensive, and even after standardization, EHR data remains heterogeneous due to factors such as local clinical practices, regulatory environments, and coding differences. This heterogeneity undermines the performance and generalizability of prediction models, limiting their applicability across diverse healthcare systems.

Previous work has shown that, despite the heterogeneous sources of medical coding vocabularies, it is possible to capture similarities between underlying medical concepts. This can be achieved by learning their latent representations using embedding-based approaches^9–13. These approaches often rely on Natural Language Processing models trained on medical terms, of which many are variations of Bidirectional Encoder Representations from Transformers (BERT) models^14–17. Such embeddings, often derived from co-occurrence patterns, show promise but face generalization challenges when deployed externally¹⁸. Rare medical concepts might occur insufficiently often in the training data to learn high-quality embeddings but may be highly relevant for accurate disease prediction. Other approaches derive fixed embeddings from external sources, such as clinical ontologies¹⁹. However, integrating explicit knowledge graphs often demands substantial effort to align disparate sources of medical codes²⁰. A critical limitation of both ontology-based and co-occurrence-based embeddings is their dependency on the initial training vocabulary; this is often caused by training on a limited amount of data from a small amount of centers. Concepts absent during the embedding creation process cannot be incorporated during inference, restricting model adaptability in real-world deployment. This limitation becomes particularly pronounced as medical vocabularies evolve—for example, many embeddings were trained exclusively on ICD-9 codes, rendering them obsolete for applications requiring ICD-10 or newer coding systems¹¹. Moreover, treatment practices might differ significantly between health care systems which is insufficiently reflected in the training dataset(s)²¹. Some prior studies have explored direct text interpretation of medical concepts using a large language model (LLM)^18,22–25, but they lack concrete applications for disease prediction across diverse healthcare systems.

In this study, we aim to empirically evaluate whether using pre-trained semantic embeddings of medical history can improve the transferability of EHR-based disease prediction models across large-scale EHR datasets from multiple countries. We specifically focus on LLM-derived embeddings but also consider embeddings trained directly on medical concepts.

We implement these approaches in a deep learning framework called Generalizable Risk Assessment with Semantic Projection (GRASP). Rather than relying solely on a CDM, GRASP maps medical codes into a unified semantic space using an LLM. A downstream transformer network then processes patient medical histories to predict disease risk. By capturing semantic similarities between medical codes, GRASP helps bridge differences in coding systems across EHR datasets. Notably, GRASP can successfully predict disease risk even for medical codes that were absent from the training data.

GRASP is resource-efficient and can be deployed even in environments with low computational resources. The LLM is used exclusively to generate a lookup table of embeddings, eliminating the need to expose patient data directly to the model. This allows GRASP to operate in secure processing environments without internet connectivity. We applied GRASP to predict 21 diseases and all-cause mortality using EHR data from over one million individuals across three countries (the United Kingdom, Finland, and the United States of America). Our results demonstrate that GRASP significantly improves model transferability across healthcare systems compared to state-of-the-art models like XGBoost²⁶, even when data is harmonized to OMOP-CDM.

Notably, GRASP facilitates generalization across healthcare systems without the need for harmonization of medical codes, leveraging semantic similarities to align otherwise distinct medical codes. The inductive bias of the LLM allows GRASP to achieve the same predictive performance as conventional models with far less training data, enhancing data efficiency. Furthermore, GRASP identifies individuals at higher genetic risk for diseases more accurately than language-unaware models, underscoring the value of integrating semantic embeddings in disease risk prediction.

Results

GRASP architecture

By replacing conventional medical code embeddings with language-aware representations generated by a LLM from corresponding clinical descriptions, we aim to improve the transferability of EHR-based disease predictions while ensuring high predictive performance and training efficiency.

We begin by mapping all OMOP vocabulary concepts to semantic embeddings using an LLM (OpenAI – text-embedding-3-large). This step does not require individual-level data. The LLM processes the natural-language name or description of each concept (e.g., “Acute upper respiratory infection” instead of OMOP code 54398005) and generates a high-dimensional embedding. These embeddings form a lookup table that links every concept to its vector representation. A patient’s medical history is subsequently encoded by querying this lookup table, avoiding the need for repeated LLM evaluations during model inference.

Next, we introduce a multi-layer transformer neural network (Methods) that uses the encoded medical history and predicts the risk of developing 22 different health outcomes (21 diseases and all-cause mortality) (Fig. 1; Supplementary Tables 1, 2). Continuous variables (e.g., age) can be encoded with positional embeddings added onto the concepts. The use of pre-trained embeddings and a decoder-only transformer makes the architecture lightweight, efficient to train and deploy. The network is trained jointly across all 22 endpoints using a Cox proportional hazards loss function.

Fig. 1 — A The model to jointly predict the first occurrence of 21 diseases and all-cause mortality is developed in UK Biobank and validated in FinnGen and Mount Sinai. B We also evaluate the benefit of fine-tuning the model in Mount Sinai. C Two patients have the same diagnosis but different OMOP codes, for example, because of different coding practices. LLM embeddings can be used to match the two codes based on the corresponding natural-language name. D GRASP maps the entire medical history of a patient using the text-embedding-3-large model from OpenAI and trains a transformer to predict the 22 outcomes. This allows the generation of a score representing the disease risk for each outcome.

The core advantage of GRASP stems from the LLM’s language understanding. For example, consider a scenario where the concept “High glucose level in blood” is prevalent in the source dataset, while “Hyperglycemia” is rare or absent. In the target dataset, the reverse holds true—only “Hyperglycemia” appears. The LLM embeds both concepts closely due to their semantic similarity. This allows the downstream transformer to generalize and recognize “High glucose level in blood” as nearly synonymous with “Hyperglycemia,” even if the latter was never encountered during training.

Crucially, this semantic alignment does not solely depend on surface-level similarities (e.g., spelling) but rather on the underlying meaning. By positioning semantically similar concepts closer and unrelated concepts farther apart, GRASP achieves greater data efficiency, enabling robust predictions even from small training sets. Additionally, GRASP facilitates zero-shot generalization to previously unseen concepts, enhancing model adaptability across datasets.

Cohort characteristics and study design

We utilized three distinct datasets, comprising two large biobank-based studies—UK Biobank²⁷ (UKB) from the United Kingdom (N = 391,921) and FinnGen²⁸ from Finland (N = 253,991)—along with an extensive EHR dataset from the Mount Sinai Health System in the United States (N = 386,755). These datasets represent diverse populations and health systems, allowing robust assessment of model transferability.

Part of UK Biobank was used to train the models, which were evaluated both internally within the UK Biobank and externally using FinnGen and Mount Sinai (Methods). The observation period for each individual, as defined in the main experimental setup (Methods and Supplementary Fig. 2), extended on average up to 12 years in UK Biobank, 26 years in FinnGen, and 6 years in Mount Sinai. Predictors included age, sex, and observed OMOP-mapped concepts reflecting prior disease diagnoses, procedures, and drug prescriptions. On average, 29 unique concepts were observed per individual in UK Biobank, 39 in FinnGen, and 19 in Mount Sinai. To mitigate the risk of closely related conditions to the disease we aim to predict being used as predictors, we implemented a two-year washout period following the baseline date.

Model predictions were assessed over the follow-up period, which spanned an average of 11 years in UK Biobank, 10 years in FinnGen, and 6 years in Mount Sinai. During this period, we predicted the time-to-first event by fitting a Cox model adjusting for age, sex and the risk score from GRASP. We considered all-cause mortality and 21 common diseases of significant public health relevance, as detailed in Supplementary Table 1. In UK Biobank, knee osteoarthritis exhibited the highest incidence rate (4.3%), whereas inflammatory bowel disease had the lowest (0.41%). Cohort characteristics and descriptive statistics are provided in Supplementary Table 2.

GRASP improves transferability across OMOP-mapped datasets

To assess the capabilities of GRASP when applied to OMOP-mapped datasets we evaluated: (1) the model performance within the held-out UK dataset, (2) the cross-country transferability from UK to Finland and USA (without any additional fine-tuning), and (3) the cross- country transferability from UK to USA with additional fine-tuning. To perform this assessments we compared GRASP performance to other models: (1) an age-and-sex-only model implementation (Methods), to quantify the baseline predictability of outcomes using only basic demographic information; (2) a transformer model with randomly initialized embeddings, to isolate the contribution of semantic knowledge; and (3) a tabular model (XGBoost), to compare GRASP against a widely used standard alternative for EHR prediction tasks^29,30. Within the UK Biobank cross-validation test set, both GRASP, XGBoost and the model using random embeddings consistently outperformed the age- and sex-based baseline (average ΔC-index of 0.069, 0.078 and 0.081; for random embedding, XGBoost and GRASP model, respectively; Supplementary Fig. 1; Supplementary Table 3). However, GRASP did not perform significantly better than both models for any specific outcome.

When transferred to external OMOP-mapped datasets, GRASP demonstrated comparable and in some cases improved performance. Relative to random embeddings, GRASP achieved on average higher C-index values in FinnGen (average C-index of 0.712 vs. 0.676; Fig. 2, panel A; Supplementary Table 4) and Mount Sinai (average C-index of 0.698 vs. 0.677; Fig. 2, panel B; Supplementary Table 5). It also improved upon XGBoost in both external datasets (average C-index of 0.721 vs 0.689 in FinnGen, and Average C-index of 0.698 vs 0.688 in Mount Sinai, respectively; Fig. 2, Panels A and B; Supplementary Tables 4 and 5). GRASP achieved statistically significant improvements (p < 0.05) for 12 of the 22 outcomes in Finland and 5 of the 22 outcomes in the USA, while showing similar performance to other models for the remaining outcomes. Consistent improvements were observed for asthma, chronic kidney disease (CKD), and heart failure prediction across both datasets (Fig. 2, Panels A and B).

Fig. 2 — Performances (C-index) of a model with demographic (age and sex, blue) information only, random embeddings (light blue), XGBoost (red), and OpenAI embeddings (GRASP, orange) to predict the first occurrence of 22 outcomes. Models are trained in UK Biobank and tested in external datasets from: A Finland (FinnGen), and B USA (Mount Sinai). Horizontal lines represent 95% confidence intervals obtained via bootstrapping. An asterisk symbol is added to those outcomes for which GRASP performance is significantly better (p < 0.05) than all other models tested.

Finally, to further assess whether adapting GRASP to an external dataset could enhance its transferability, we fine-tuned the weights initially trained on UK Biobank data using a small, independent training set from Mount Sinai, separate from the test set. This approach allows GRASP to leverage the larger UK biobank dataset while adapting to the specific characteristics of the target data. For comparison, we also evaluated GRASP trained solely on Mount Sinai data (without UK biobank pre-training) and XGBoost trained exclusively on Mount Sinai. Fine-tuned GRASP outperformed both XGBoost and the GRASP model trained exclusively on Mount Sinai data (average C-index of 0.721 vs average C-index of 0.703 and 0.713, respectively; Supplementary Table 6).

GRASP transfers well across datasets mapped to different data models

So far, we have explored the transferability of GRASP on datasets mapped to a common data model (OMOP-CDM). Next, we investigated whether GRASP’s language-based embeddings could facilitate translation between different data models without requiring re-training. To assess this, we evaluated GRASP’s performance on Mount Sinai data under two conditions: (1) using only OMOP-mapped disease concepts, where the same data model was applied in both the training and testing datasets; and (2) using a different data model in the test set, where disease conditions were coded in the ICD-10-CM format. Notably, no explicit ontology mapping between OMOP and ICD-10-CM was applied.

As expected, the highest performance was observed when both the training and test datasets were mapped to OMOP, resulting in an average ΔC-index of 0.056 compared to the age and sex-only baseline (Fig. 3 and Supplementary Table 7). Even when evaluated using solely ICD-10-CM codes in Mount Sinai, GRASP demonstrated notable improvements, achieving an average ΔC-index of 0.036 over the baseline, with only 9 of the 22 outcomes showing significantly lower performance using ICD-10-CM (Fig. 3). This result is particularly striking given that no direct mapping between OMOP and ICD-10-CM was provided to GRASP. The model successfully inferred the relationships between the two coding systems by leveraging the semantic similarity of disease names, an ability that enables cross-data-model transfer, a task that conventional models such as XGBoost cannot perform.

Fig. 3 — Models are trained in UK Biobank to jointly predict the first occurrence of 22 outcomes. The figure reports the performances in the Mount Sinai dataset of a model that just uses age and sex (blue), GRASP model applied to a different data model (ICD-10-CM, light blue), or the same data model (OMOP-CDM, orange). Horizontal lines represent 95% confidence intervals obtained via bootstrapping. An asterisk symbol is added to those outcomes for which the performance of the OMOP-based models is significantly better (p < 0.05) than the ICD-10-CM-based models.

GRASP improves training-efficiency with small sample sizes

We reasoned that GRASP introduces significant inductive bias by positioning similar concepts nearby and unrelated concepts far apart, which can result in more efficient data utilization with fewer individuals. To test this hypothesis, we re-trained GRASP on subsets of 10,000 to 200,000 individuals in the UK Biobank and evaluated performances within UK Biobank via cross-validation and externally in FinnGen and Mount Sinai (Methods).

We find that across the 22 health outcomes GRASP performances were significantly higher than for the same model with random embeddings and XGBoost. This effect was especially pronounced at very small sample sizes (Fig. 4 and Supplementary Table 8). For example, the average ΔC-index improvement against XGBoost was 0.1 at N = 10,000 vs 0.01 at 200,000 in the cross-validation in UK Biobank. Unlike GRASP, XGBoost transferred poorly to Mount Sinai and FinnGen when trained on smaller sample sizes and significantly improved its performance only with larger sample sizes.

Fig. 4 — Average prediction performances across 22 health outcomes as a function of training sample size in UK Biobank. Comparison between GRASP (orange) and XGBoost (red). All models are trained in UK Biobank and evaluated in: A the cross-validation test-sets of UK Biobank, B FinnGen, and C Mount Sinai. The shaded area represents 95% confidence intervals obtained via bootstrapping.

Impact of concept-specific text on GRASP performance and comparison with other embedding methods

We hypothesized that incorporating additional information about each concept/medical code could enhance model performance by providing a more comprehensive representation of its semantic and ontological context. To evaluate this hypothesis, we enriched the embedded text for each concept with supplementary data extracted directly from the OMOP ontology. This included hierarchical relationships such as concept ancestors and descendants, and attributes like associated morphologies and anatomical finding sites (Methods). We found no substantial difference between using these enriched concept texts compared to the simpler embeddings using only the concept names (Supplementary Table 9), suggesting that just the name of the concept was sufficient to determine its similarity with other medical concepts.

We compared GRASP, which uses embeddings from OpenAI generalist LLM, with other embedding methods trained using biomedical data: GatorTron³¹, an open source clinical LLM, and SapBERT³², a biomedical BERT embedding method. All three approaches resulted in similar prediction improvements over a model using random embeddings, with no single method consistently outperforming the others across all diseases (Supplementary Fig. 3 and Supplementary Tables 3,4,5)

Models’ calibration

GRASP using OpenAI, GatorTron, or SapBERT achieved similarly good calibration in FinnGen, with average integrated calibration index (ICI) values of 0.00138, 0.00132, and 0.00133, respectively (Supplementary Fig. 4 and Supplementary Table 10). Similar findings were observed in the Mount Sinai dataset (Supplementary Table 11). Calibration from GRASP was not better than using random embedding.

These results were obtained based on our main approach, which fitted a Cox model in the test set including age, sex and GRASP-predicted disease risk. This procedure can be considered a form of re-calibration. Therefore, we also evaluated calibration when the GRASP-derived risk scores were directly used to predict disease incidence or all-cause mortality in the test set, without additional adjustment. In this scenario, we observed overall poorer calibration, with average ICI values of 0.01266, 0.00957, and 0.01164 for the three embedding methods when tested in FinnGen (Supplementary Table 10). This decrease in calibration performance is expected, given the differences in disease incidence rates between UK Biobank, FinnGen, and Mount Sinai (Supplementary Table 1). As an example, Supplementary Fig. 5 shows calibration plots for osteoporosis, comparing performance when the model is re-calibrated using age and sex versus when no recalibration is performed.

Understanding how GRASP generalize medical concepts

We wanted to better understand how semantic similarities between different OMOP codes can improve models’ performances and transferability. We use UMAP to provide a two-dimensional representation of the embedding representations of OMOP codes in UK Biobank, FinnGen, and Mount Sinai focusing on depression, as an example (Fig. 5A).

We first show that similar concepts cluster together, even when they are present in only one of the three datasets. For example, we identified a cluster of concepts related to substance abuse. It is well-established that depression and substance abuse are closely linked³³. Within this cluster, the concept of ‘opioid abuse,’ observed only in the UK Biobank, was closely positioned to ‘opioid dependence,’ which was also observed in FinnGen (Fig. 5B). Similarly, ‘cocaine dependence,’ present only in the UK Biobank, clustered near other drug-related concepts. However, we also observed some clear misclassifications; for example, ‘adult victim of abuse’ clustered near drug abuse concepts, likely due to the shared term ‘abuse’ in their descriptions.

Most concepts in the substance abuse cluster have similar importance in predicting depression (Fig. 5C) despite large differences in the frequency of concepts across biobanks (Fig. 5D). These results highlight how the semantic embedding in GRASP overall helps overcome differences in use and frequency of OMOP concepts across the three datasets.

GRASP semantic embeddings result in a stronger association with polygenic scores

We aimed to validate GRASP’s enhanced predictive capabilities through an orthogonal approach. Specifically, we hypothesized that GRASP’s risk estimates would more accurately identify individuals at higher genetic risk for diseases compared to models that do not incorporate language-based embeddings. Polygenic scores, which aggregate the effects of thousands of genetic variants, capture genetic risk and serve as an independent method to identify individuals at elevated disease risk. In FinnGen, we computed polygenic scores for 16 diseases following the approach described in Mars et al.³⁴ and assessed their correlation with predictions from GRASP and a comparable model using random embeddings.

GRASP demonstrated significantly stronger correlations (p < 0.05) with polygenic scores for 10 out of 16 diseases (Fig. 6), indicating its superior ability to identify individuals with high genetic risk. These findings suggest that GRASP’s language-informed embeddings improve the model’s capacity to capture underlying disease susceptibility beyond what is achievable with language-unaware models.

Discussion

Harmonization of EHR data across healthcare systems through CDMs is a valuable yet resource-intensive process. Initiatives like EHDEN have successfully promoted the adoption of the OMOP data model across European countries. However, the majority of EHR systems remain unmapped to OMOP, limiting interoperability. Even when datasets are fully harmonized to the same CDM, discrepancies persist due to variations in how medical codes are applied across healthcare systems.

Kather and colleagues recently argued that the reliance on standardized medical codes may be outdated in the era of LLMs, proposing that natural language should become the universal interface for healthcare³⁵. While this vision holds promise, medical codes continue to serve critical roles beyond clinical practice, including healthcare management and billing. As such, it is unlikely that medical coding systems will be entirely replaced by natural language in the foreseeable future.

In this study, we provide empirical evidence, based on three widely-used real datasets and across three different countries, that embedding medical concepts into a unified semantic space, combined with a transformer-based architecture, can enhance model transferability across different EHR datasets and coding systems without the need for explicit manual mappings.

Our results indicate that GRASP performs comparably to baseline models (e.g., XGBoost) within the training cohort (UK Biobank) and shows evidence of improved generalization when transferred to external datasets. Performance gains were more pronounced in FinnGen (Finland) and more modest in Mount Sinai (USA). GRASP achieved better performance than competing models for a subset of outcomes, including asthma, CKD, and heart failure, where consistent improvements were observed across both external datasets. Notably, GRASP maintains strong performance even when the external dataset is not mapped to the same CDM, a setting where many existing models fail to transfer at all. By leveraging semantic language similarities rather than relying solely on direct mappings, GRASP bridges gaps in data interoperability, allowing for cross-system model deployment.

Johnson and colleagues recently introduced unified clinical vocabulary embeddings derived from a clinical knowledge graph, applying them to disease risk prediction in an Israeli healthcare system¹⁹. While this method offers an intriguing alternative, their embeddings couldn’t be mapped to all OMOP codes used in this study and lacked compatibility with ICD-10 codes. These limitations underscore the advantages of using embeddings derived from general-purpose LLMs, which provide broader coverage and are more easily integrated across diverse coding systems.

Using embeddings from OpenAI’s LLM is just one possible approach, but it critically relies on closed-source technology. Open-source embeddings represent valid alternatives, and we tested two popular models specifically trained on biomedical data: GatorTron³¹ and SapBERT³². Their performance was overall similar to that of OpenAI, indicating that training on biomedical data does not, in itself, offer a clear advantage in improving model transferability. This is consistent with recent experiments showing that LLM fine-tuned on biomedical data do not outperform generalist LLMs³⁶.

GRASP offers several strengths. It is computationally efficient, fast to train, and can be deployed in secure, resource-limited environments without exposing patient-level data to external systems. Moreover, GRASP effectively utilizes small training sets by exploiting the inductive biases inherent in language models. Its ability to achieve zero-shot transferability across datasets with varying vocabularies or differing coding schemes highlights its potential to enhance data interoperability and improve predictive modeling across healthcare systems, and shows advantages compared to tabular data-based methods such as XGBoost.

Nevertheless, several limitations remain. First, the current implementation of GRASP does not model the longitudinal sequence of medical codes, instead assuming that all codes are observed at a single point prior to baseline; this may explain the similar performance compared to XGBoost. Incorporating sequential architectures that capture the temporal progression of medical events—similar to approaches used in generative models like Delphi-2M³⁷—could enhance predictive performance. Incorporating positional embeddings could allow GRASP to model richer longitudinal data–such as lab results, vital signs, and free-text clinical notes, extending beyond diagnoses, procedures, and medication information commonly found in EHR data. Moreover, different tokenization strategies could prove to be superior³⁸. Additionally, future evaluations could include comparisons with other baselines, such as a multilayer perceptron (MLP) classifier or sequence-based models like GRU and LSTM, to better contextualize GRASP’s performance. Second, GRASP’s performance has been evaluated using data from three high-income countries with advanced healthcare infrastructure. Its generalizability to underrepresented settings, including low- and middle-income countries, remains untested. Furthermore, as with all models built on LLMs, GRASP may inherit biases from the language model’s training data, potentially reinforcing systemic inequalities or misrepresenting minority populations^39,40. Addressing these biases is critical to ensure equitable and safe deployment in real-world healthcare settings. Third, while GRASP improves prediction accuracy, it does not improve model’s calibration compared to a language-unaware approach. This highlights the importance of re-calibrating any prediction model in the clinically relevant population. Fourth, GRASP performance was only evaluated against the same model with other semantic embeddings -random and open-sourced ones- and an XGBoost model; further comparisons against conventional risk scores and more advanced prediction models are needed to fully understand the prediction capabilities of the suggested framework.

In conclusion, GRASP offers a potential solution to improve the transferability of EHR-based disease predictions across diverse healthcare systems.

Methods

This study followed the TRIPOD AI reporting guideline to ensure transparent and comprehensive reporting of the development, validation, and performance of the artificial intelligence predictive model⁴¹.

Datasets

The UK Biobank²⁷ is a large-scale biomedical database containing diverse health information from 500,000 middle-aged individuals recruited between 2006 and 2010 from across the UK. It includes extensive phenotypic information, genetic data, imaging, and EHR. We used EHR data mapped onto OMOP CDM by Regeneron and Odysseus Data Services and provided by the UK Biobank (field 20142). Data was originally sourced from hospital inpatient data, assessment center data, and partially from primary care data; however, primary care data is only available for ~45% of the cohort and is not necessarily complete. To reduce the likelihood of including individuals with incomplete data or those who may have transitioned between EHR systems, we restricted the cohort to participants with at least one recorded condition before and after the baseline date.

FinnGen (https://www.finngen.fi/en) launched in 2017, is a public-private research project, combining genome and digital healthcare data on about 500,000 Finns²⁸. The nation-wide research project aims to provide novel medically and therapeutically relevant insight into human diseases. FinnGen is a pre-competitive partnership of Finnish biobanks and their background organizations (universities and university hospitals) and international pharmaceutical industry partners and Finnish biobank cooperative (FINBB). All FinnGen partners are listed here: https://www.finngen.fi/en/partners. A list of FinnGen authors is provided in Supplementary Table 12.

The Mount Sinai Health System is a large network of hospitals and health-care providers in New York City. Longitudinal clinical data are recorded and mapped to the OMOP CDM in the Mount Sinai Data Warehouse (MSDW), and data are made available to researchers via the AI-ready Mount Sinai (AIR·MS) platform. The MSDW contains records for more than 11 million patients across more than 87 million patient encounters and is updated regularly. It provides longitudinal data on diagnoses, lab results, prescriptions, hospitalizations, and procedures. Due to the urban location of the hospital system, this dataset consists of a highly diverse population, representing multiple ethnicities and socioeconomic groups. To reduce the likelihood of including individuals with incomplete data or those who may have transitioned between EHR systems, we restricted the cohort to participants with at least one recorded condition before and after the baseline date.

Endpoint definitions

Even though all three datasets are mapped to the OMOP CDM, they used partially different mappings. This makes creating homogeneously defined, clinically meaningful endpoints challenging. We decided to utilize pre-defined, harmonized endpoints from the FinnGen project. These were originally defined using ICD revisions 8, 9, and 10, as well as additional information from ICD-O-3, procedure codes, drug reimbursement codes, and ATC codes. For FinnGen individuals, all endpoints were already precomputed. For Mount Sinai data, we used the original ICD-10-CM codes and defined endpoints using the ICD-10 FinnGen definitions.

In UK Biobank we had less detailed access to ICD-mapped codes with significant discrepancy between OMOP-mappings and ICD mappings which required a more complex workaround. In particular, we used ICD-10 codes from FinnGen as a starting point and mapped those to SNOMED conditions using the OMOP ontology (“Non-standard to Standard map”/“maps to” relationship, using both ICD-10 and ICD-10-CM). Unfortunately, this mapping was error prone and we manually filtered incorrectly mapped concepts (e.g., OMOP concept “Periodontal disease” (OMOP ID 134398) mapped to ICD-10-CM “Type 2 diabetes mellitus with periodontal disease” (ICD-10-CM code E11.630), which is a subcategory of “Type 2 diabetes mellitus” (E11)). To additionally reduce the amount of endpoint leakage, for each endpoint we manually inspected the concepts with the highest occurrence discrepancy between cases and controls and added these concepts if needed. The final endpoint was then defined from this list of OMOP concepts. Despite these quality controls, a small amount of endpoint leakage from OMOP concepts is to be expected across all three datasets, slightly inflating model performance for all evaluated methods. Only for the endpoint “Type 2 diabetes” we additionally created a list of concepts to exclude individuals with such an occurrence; we excluded individuals if T2D could not be distinguished from other types of diabetes (e.g., the concept “Autonomic neuropathy due to diabetes mellitus”), or if the individual shows another type of diabetes (e.g., “Type 1 diabetes mellitus uncontrolled”).

Main experiments setup

In all cohorts the same study design has been implemented, a schematic of it can be found in the Supplementary Fig. 2.

First, EHR information of the patient was extracted during the observation period, which goes from birth to two years before the baseline date. In UK Biobank, the baseline date was set to the first assessment center visit, while for FinnGen and Mount Sinai, a fixed date (01-01-2009 for Finngen and 01-01-2018 for Mount Sinai) was uniformly applied across all individuals. We implemented the two-year washout period to mitigate the risk of closely related conditions to the disease we aim to predict being used as predictors. The final list of predictors used in the analysis included: age at baseline, sex, and the observed OMOP-mapped concepts reflecting disease diagnoses, procedures, and drug prescriptions.

Following the observation period, outcomes were assessed during the follow-up period, which extended from the baseline date until the occurrence of the first event of interest, death, or the end of follow-up. We predicted the time-to-first event for all-cause mortality and 21 common diseases of significant public health relevance, as detailed in Supplementary Table 1.

We train all model setups (Random embedding, Gradient boosting and GRASP) on UK Biobank data using 4-fold cross-validation, using a similar evaluation strategy to Steinfeldt et al.⁴². We created four train/test splits with disjoint test sets where approximately 75% of the data are training and 25% of the data are test sets. Within each training set we set apart 10% (relative) for validation and hyperparameter selection. All model setups use identical data splits.

Reported evaluations on UK biobank data denote the overall c-index when aggregating all test-split predictions (more information on the c-index evaluation in this section: Grasp architecture and design, Step 3: Model application).

In the external datasets, we have four independent models (one for each train/test split in UK Biobank) and use those to generate four risk scores for each individual and endpoint. We average all four scores to get an overall risk score and feed this risk score together with age and sex into a simple Cox-PH model, using 20% of the target data.

Grasp architecture and design

The GRASP architecture consists of three phases (Fig. 1, Panel D). First, we map the full EHR vocabulary onto semantic embeddings using a LLM. Second, we use these embeddings as input to a multi-layer transformer neural network to predict risk scores for all target endpoints jointly. Finally, for each endpoint, we tune a linear Cox Proportional Hazards model on only age, sex, and the EHR-based risk score.

Step 1: LLM embedding setup. Implementations of OMOP all depend on a common set of standard vocabularies, such as SNOMED for conditions and procedures or RxNorm for drug prescriptions. While datasets can use additional non-standard vocabularies or custom concepts, the majority are either already available in the standard vocabularies or can be mapped using non-standard to standard mappings provided by OMOP (e.g., from ICD-10-CM to SNOMED). In the first step of the GRASP architecture, we embed all open OMOP standard vocabularies into a semantic embedding space using OpenAI’s state-of-the-art embedding LLM (“text-embedding-3-large”). For this, we only take the concept name, e.g., “Hyperglycemia” for the concept with OMOP Concept ID 4214376 and ignore other metadata (see Supplementary Table 8 for experiments with additional contextual information). Concept names can have varying length, with conditions and procedures often relatively short (e.g., “Essential hypertension”, “Cough”, or “Radiologic examination of knee”) and concept names for drug exposures sometimes including a list of active components and brand names (e.g., “floxacillin 250 MG Oral Capsule” or “60 ACTUAT fluticasone propionate 0.25 MG/ACTUAT/salmeterol 0.05 MG/ACTUAT Dry Powder Inhaler [Wixela]”).

We created a large lookup table from all available concepts once. Notably, this did not require the use of any dataset-specific data at all: we mapped standard and valid conditions (n = 173,526), procedures (n = 254,010), and drugs (n = 2,007,406) from the freely available standard vocabularies as provided by OHDSI. This excluded, however, the non-free procedure vocabulary CPT-4 commonly used in the US, including the Mount Sinai dataset, which we mapped back to SNOMED procedures using the OMOP “CPT-4 to SNOMED equivalent” relationship. After creation of the lookup table our approach did not require any more access to a LLM and did not expose any patient data directly or indirectly to third parties. All possible OMOP concepts encountered at inference are covered by the lookup table, even if a given dataset may only contain a subset of these concepts. New concepts would only arise if the OHDSI OMOP-CDM reference (https://athena.ohdsi.org/search-terms/start) is updated, in which case the lookup table would need to be regenerated.

In addition to the conditions, procedures, and drug exposures we also created embeddings for sex (with text “Biological sex is female” or “Biological sex is male”) and age (with text “Age at baseline”, see next section for details on age encoding).

The resulting embedding has a dimensionality of 3072 and is normalized to unit Euclidean norm. Semantically similar concepts are positioned close together even if they don’t coincide on a character level⁴³.

Step 2: Model architecture and training. In traditional machine learning settings, an individual’s medical history would be required to be encoded as a single vector of fixed length, with each dimension likely corresponding to a single concept. By contrast, in our setup, an individual medical history $h_{i}$ is represented as the variable-size collection of encountered OMOP concepts, mapped to their respective

d - dimensional embeddings h_{i} = [e_{1}, \dots, e_{n_{i}}] \in R^{d \times n_{i}}

To encode an individual’s age, we incorporate a sinusoidal positional embedding⁴⁴ into the “age at baseline” embedding. For other concepts we omit associated quantitative information, such as time of occurrence or sequence order, though these could also be included using positional embeddings. We retain all occurrences of each concept rather than keeping only a single occurrence of each unique concept.

Our primary model is based on a standard transformer encoder architecture⁴⁴ without causal masking. For a given individual i, the transformer linearly maps all input embeddings $[e_{1}, \dots, e_{n_{i}}]$ to 256-dimensional tokens and adds a learnable class token, creating the input to the first transformer layer, $h_{i}^{0} = [e_{1}^{0}, \dots, e_{n_{i + 1}}^{0}]$ .

Each transformer layer transforms input $h_{i}^{l - 1}$ to output $h_{i}^{0} = [e_{1}^{0}, \dots, e_{n_{i + 1}}^{0}]$ with identical dimensions.

Each transformer layer applies two main components: (1) a multi-head attention (MHA) mechanism with scaled dot-product attention, and (2) a fully-connected MLP. Both components are integrated with layer-normalization⁴⁵ and a residual connection. The multi-head attention layer combines several smaller attention blocks (“heads”) for computational efficiency. Intuitively, these attention blocks dynamically relate each input token $e_{i}^{l - 1}$ to all other input tokens in the same layer $[e_{1}^{l - 1}, \dots, e_{n_{i + 1}}^{l - 1}]$ , to determine which tokens (i.e., concepts) to focus on. The MLP block then non-linearly transforms the output of the MHA to enable the model to learn more complex features. In our case, the MLP consists of two linear layers with intermediate dimension 1024 and GELU nonlinearity. After processing through all L layers, we discard the token-specific embedding outputs $[e_{1}^{l}, \dots, e_{n_{i}}^{l}]$ and use only the class token embedding for the final prediction. A linear layer on top of the class token generates risk scores for each endpoint.

We used L = 4 layers and multi-head attention with 8 attention heads.

In contrast to more traditional machine learning models, a transformer network can process inputs with varying length n_i. During training, however, we randomly sample n = 64 concepts per individual to enable faster training with homogeneously sized batches. If less than 64 concepts are available for an individual, we pad the available concepts with 0-valued empty concepts. During inference, we always include all concepts for an individual without padding.

We use mini-batch stochastic gradient descent with the AdamW optimizer⁴⁶ to train each model for all 22 endpoints jointly with a batch-wise Cox log-partial likelihood from the Cox Proportional Hazards model^47,48. We average the loss over all endpoints. If the loss for a single endpoint is not well-defined, we drop that endpoint’s loss for that batch, implicitly oversampling uncensored individuals. After hyperparameter selection in the UK Biobank cross-validation setup, we train all models with a batch-size of 512 for 8 epochs using a base learning rate of 0.001 with linear learning rate warmup for 2 epochs and cosine decay afterwards.

Step 3: Model application. The main model is trained on all endpoints and all individuals jointly on UK Biobank data. For each endpoint, we fit an additional (linear) Cox-PH model using the lifelines Python library v0.27.8. This model takes as input age, sex, and the EHR-based risk score and performs time-to-event prediction with (1) proper exclusion of events before the baseline (+2 year washout period) of the same endpoint. (2) right censoring based on death or end of follow up dates of each individual. Based on time-to-event predictions, the C-index metric is subsequently computed for each model setup.

Baseline methods

We compare GRASP against three baselines: (1) we compare against a model with identical architectural setup but restricted to the age- and sex-embedding vectors only, effectively yielding a baseline model based solely on demographic information; (2) we compare against a model with identical architectural setup but with (fixed) randomized embeddings for the observed diagnosis, procedures, and medication history, to disentangle how much of GRASP’s performance is due to the use of embeddings versus architectural design; (3) we compare against gradient boosting trees (XGBoost) to compare against a powerful state-of-the-art prediction model.

Random Embeddings. We use identical setup and architecture in this setting. The only difference is that embeddings are drawn at random from a Gaussian distribution with Euclidean normalization. While GRASP can adapt to unseen concepts in the target dataset with zero-shot adaptation, this is not possible for random embeddings, and we only use concepts that have been seen during training. We performed identical hyperparameter search as for GRASP but found that both models are stable with respect to architectural details such as number of layers, embedding dimensions, and optimization parameters, and used identical settings.

Gradient Boosting (XGBoost). As a more powerful baseline model, we used gradient boosting algorithm²⁶. We encode an individual’s medical history as a 0–1-coded (or count-coded) p-dimensional vector, where each dimension denotes the occurrence or absence of the associated concept. Binary encoding was determined by the presence of a concept during the observational window. If a concept occurred at least once, the corresponding feature was assigned a value of 1; otherwise, it was set to 0. In contrast to GRASP, this model can only handle fixed-size inputs and can only utilize concepts seen during training. To enable better generalizability of the model, we used OMOP’s ontology to also activate ancestor features of each concept. XGBoost implements survival prediction with an accelerated failure time model instead of the Cox proportional hazards model. We performed hyperparameter search over the minimum number of occurrences required to include a concept (set to include all), whether to include concept occurrence counts or binarized indicators (set to binarized), learning rate (set to 0.1), max depth per tree (set to 2), number of boosting rounds (set to 500), scale of the AFT loss distribution (set to 3.0), and the number of ancestors to turn on in the ontology (set to 5) in the UK Biobank cross-validation.

Since we trained boosting models already per endpoint instead of jointly for all endpoints, in UK Biobank evaluations we used the predictions directly without the three-variable Cox-PH model on top. For adjusting predictions in the external datasets, we used the same setup as for GRASP, namely, first applying the boosting model to get a risk score that we then used together with age and sex as input to a Cox-PH model trained on 20% of the full dataset and evaluate on the remaining data.

OMOP-to-ICD experiment

To assess the models’ transferability across coding schemes, we used the same model trained on UK Biobank data from all used OMOP tables (diagnosis, procedures, drugs, and age and sex), but evaluated it in two settings: (i) using only the diagnosis table (along with age and sex) in the OMOP-formatted Mount Sinai dataset, and (ii) using diagnosis data coded in the original ICD-10-CM vocabulary (also with age and sex) from Mount Sinai.

Instead of running the full ensemble of all four models, we only use a single model trained on UK Biobank data. For the OMOP condition evaluation, we predict EHR-based risk scores identically to the previous setting but discarding procedure and drug exposure data for better comparability.

We created embeddings for ICD-10-CM data analogously to OMOP-coded data, by mapping each possible concept name directly with the same LLM to a semantic embedding. As data on diagnosis in the Mount Sinai data warehouse were originally recorded in ICD-10-CM format, we could use the original diagnostic codes instead of mapping OMOP concepts to ICD codes via the ontology.

The remaining setup was identical to the main experiments.

Small-n experiment

To simulate smaller sample sizes, we subset the UK Biobank training data to 10,000, 25,000, 50,0000, 100,000, and 200,000 individuals, respectively. We use the same train-test cross-validation splits from the main experimental setup. However, we keep the validation sets identical across all settings (10% of the original training split) and only subset the remaining training data to the respective dataset size to ensure comparability across splits. The test splits remain the same across all settings.

Like in the main experiments we train individual models per cross-validation split and apply all four models in the external datasets for an averaged risk score per individual.

Ontology-enriched embeddings experiment

We designed more detailed texts for each possible concept to investigate if additional context or information beyond the concept name would improve embedding performance and therefore model performance. We used the OMOP ontology to create this enriched concept text with synonyms and related concepts. For a given concept, we collected all available synonyms and all pairwise relationships with the concept as the first of the two. We created a new string from this information in the form “{concept_name}. Synonym: {synonym1}. Synonym: {synonym2}. […] {relationship_name}: {other_concept} […]”. For drugs we limited the maximum number of synonyms to 10. To keep string lengths manageable, we also excluded approximately 100 relationship types, mostly consisting of mappings between different drug coding systems. The resulting string, for example for concept 43530605, is “Pulmonary embolism with pulmonary infarction. Synonym: Pulmonary embolism with pulmonary infarction (disorder). Synonym: Pulmonary embolism with infarction. Is a: Pulmonary infarction. Is a: Injury of artery. Has finding site: Structure of artery of pulmonary circulation. Is a: Pulmonary embolism. Has associated morphology: Embolus. Has finding site: Lung structure. Has associated morphology: Infarct”.

We replaced the original embeddings with these enriched embeddings but kept all other settings identical to the main setup.

Other embeddings experiment

We used an open source clinical LLM (GatorTron) and a biomedicine BERT model (SapBERT) to generate alternative medical code embedding look-up tables that could be switched in our GRASP architecture.

Fine-tuning experiment

For the fine-tuning experiment on Mount Sinai data, we used the same train and test splits as for the main evaluation where we only tuned the Cox-PH model on top of the frozen model trained on UK Biobank.

Instead of evaluating all four ensemble models as in the main setting, we only trained and evaluated a single model per setting. For hyperparameter selection we performed a 75%/25% train-validation split on the 20% of the data separated for training. We trained three separate models: (1) re-trainining GRASP from scratch; (2) fine-tuning GRASP with weights initialized from training in UK Biobank; and (3) training an XGBoost model from scratch.

For both the fine-tuning and the retraining from scratch we used identical optimization settings, training for 8 epochs with a batch size of 512, and randomly sampling 64 tokens per individual. For fine-tuning we did not freeze any layers as preliminary results did not indicate any improved performance.

Due to the different set of available concepts between UK Biobank and Mount Sinai, there is no straightforward way to fine-tune boosting models between datasets. Instead, we train the XGBoost model only from scratch. We used the same train-validation splits as for the GRASP models and performed a hyperparameter selection over the loss distribution scale, while keeping other training parameters identical to the training in UK Biobank.

We evaluated all models on the same large test set of 80% of the Mount Sinai data as the main experiments were performed on.

Model calibration results and experiments

We evaluated the calibration of GRASP model using OpenAI, GatorTron or SapBERT for all of the 22 endpoints available. Calibration was evaluated in models trained in UK biobank and tested in FinnGen and Mount Sinai. To calculate calibration, we use the ICI, which is summary metric that quantifies how closely a model’s predicted probabilities align with observed outcomes, by averaging the absolute differences across the full range of predicted risks. We also plot calibrations curves for all endpoints, but here report only one example for osteoporosis.

We compared calibration derived from our main approach, where a Cox model is fit in the test set adjusting for age, sex and GRASP-predicted disease risk with a “raw/not re-calibrated” approach where GRASP directly output the disease risk.

Explainability

We require feature importances for each individual concept instead of per input dimension as is common in traditional machine learning settings or simpler fully connected neural networks. We derived concept-level feature importances for a single individual and endpoint by evaluating the model’s predicted risk score after occluding (i.e., setting to 0) the full embedding for each available input concept for that individual and subtracting it from the originally predicted risk score. We only used unique occurrences of concepts per individual. If an individual has k unique concepts, computation of this score would require k + 1 model evaluations. If a concept increases the individual’s risk for the endpoint, the feature importance score will be positive, otherwise negative or close to 0. To get overall feature importance scores for a dataset, we created these occlusion scores for all individuals in the dataset and averaged them over all individuals with this concept. Hence, these feature importance scores are dependent on the endpoint and on the dataset they were evaluated on; if a concept is not available in a dataset there is no associated feature importance score. Due to computational constraints, we only evaluated scores for a single model (instead of all four ensemble models). To better contrast the movement of concepts, in UK Biobank we evaluated the scores on the training set, in the external datasets on all individuals.

For the plot in Fig. 5, we filtered all available condition concepts for minimum count of 5 occurrences in at least one of the three datasets and retained only the top 700 concepts according to their feature importance (using maximum over all three datasets) and applied a UMAP dimensionality reduction.

Polygenic scores

Polygenic scores used were previously calculated by Mars et al.³⁴ in FinnGen and are available at PGS Catalog (https://www.pgscatalog.org/, publication ID PGP000364).

Supplementary information

Supplementary Supplementary Figure^{(874.3KB, pdf)}

Acknowledgements

We want to acknowledge the participants and investigators of the FinnGen study. See Supplementary Table 12 for a list of all consortium members and their affiliations. The FinnGen project is funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and the following industry partners: AbbVie Inc., AstraZeneca UK Ltd, Biogen MA Inc., Bristol Myers Squibb Inc. (and Celgene Corporation & Celgene International II Sàrl), Genentech Inc., Merck Sharp & Dohme LCC, Pfizer Inc., GlaxoSmithKline Intellectual Property Development Ltd., Sanofi US Services Inc., Maze Therapeutics Inc., Johnson&Johnson Innovative Medicine Inc., Novartis AG, Boehringer Ingelheim International GmbH and Bayer AG. Following biobanks are acknowledged for delivering biobank samples to FinnGen: Auria Biobank (www.auria.fi/biopankki), THL Biobank (www.thl.fi/biobank), Helsinki Biobank (www.helsinginbiopankki.fi), Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx), Finnish Clinical Biobank Tampere (www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere), Biobank of Eastern Finland (www.ita-suomenbiopankki.fi/en), Central Finland Biobank (www.ksshp.fi/fi-FI/Potilaalle/Biopankki), Finnish Red Cross Blood Service Biobank (www.veripalvelu.fi/verenluovutus/biopankkitoiminta), Terveystalo Biobank (www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/) and Arctic Biobank (https://www.oulu.fi/en/university/faculties-and-units/faculty-medicine/northern-finland-birth-cohorts-and-arctic-biobank). All Finnish Biobanks are members of BBMRI.fi infrastructure (https://www.bbmri-eric.eu/national-nodes/finland/). Finnish Biobank Cooperative -FINBB (https://finbb.fi/) is the coordinator of BBMRI-ERIC operations in Finland. The Finnish biobank data can be accessed through the Fingenious® services (https://site.fingenious.fi/en/) managed by FINBB. This research was conducted using the UK Biobank Resource under Application Number 77717. It was supported by the Office of Research Infrastructure of the National Institutes of Health (award number S10OD026880), the AI-Ready Mount Sinai (AIR·MS) research platform (developed through collaboration between the Hasso Plattner Institute for Digital Health at Mount Sinai and Data4Life), and the computational and data resources provided by the Icahn School of Medicine at Mount Sinai, supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Additionally, the research was funded by the European Commission in the Horizon 2020 project INTERVENE (Grant agreement ID: 101016775). A.G. received funding from the European Research Council under the Horizon 2020 research and innovation programme (grant number 945733) and from the Academy of Finland fellowship grant no. 323116.

Author contributions

M.K. and M.F. wrote the manuscript with input and comments from A.G., C.L., and R.P.W. M.K. trained and tested all models used in this study, in all cohorts. M.F. generated all results figures and supplementary information, with help from V.L. for Fig. 1. A.G. and C.L. supervised the study.

Data availability

The code for the project is available at https://github.com/mkirchler/grasp. The individual-level data in these studies is protected for data privacy, access is regulated through the biobanks. The Finnish biobank data can be accessed through the Fingenious® services (https://site.fingenious.fi/en/) managed by FINBB. UK Biobank data are available through a procedure described at http://www.ukbiobank.ac.uk. Mount Sinai EHR data can be accessed via use agreement with researchers at Mount Sinai.

Competing interests

A.G. is the founder of Real World Genetics Oy. The other authors do not have a competing interest.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A list of authors and their affiliations appears at the end of the paper.

A full list of members and their affiliations appears in the Supplementary Information.

These authors contributed equally: Matthias Kirchler, Matteo Ferro.

Contributor Information

Andrea Ganna, Email: andrea.ganna@helsinki.fi.

FinnGen:

Andrea Ganna

Supplementary information

The online version contains supplementary material available at 10.1038/s41746-026-02363-5.

References

1.Wishart, G. C. et al. PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer. Breast Cancer Res.12, R1 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur. Heart J. 42, 2439–2454 (2021). [DOI] [PMC free article] [PubMed]
3.Lee, A. et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet. Med.21, 1708–1718 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med.29, 1113–1122 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Forrest, I. S. et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet Lond. Engl.401, 215–225 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Petrazzini, B. O. et al. Coronary risk estimation based on clinical data in electronic health records. J. Am. Coll. Cardiol.79, 1155–1166 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhao, J. et al. Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction. Sci. Rep.9, 717 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. JAMIA22, 553–564 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Choi, Y., Chiu, C. Y.-I. & Sontag, D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl. Sci. Proc. 2016, 41–50 (2016). [PMC free article] [PubMed]
10.Nelson, C. A., Butte, A. J. & Baranzini, S. E. Integrating biomedical research and electronic health records to create knowledge-based biologically meaningful machine-readable embeddings. Nat. Commun.10, 3045 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Finch, A. et al. Exploiting hierarchy in medical concept embedding. JAMIA Open4, ooab022 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med.3, 96 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Choi, E., Xiao, C., Stewart, W. F. & Sun, J. MiME: Multilevel medical embedding of electronic health records for predictive healthcare. Adv. Neur. Inform. Process. Syst.31 (2018).
14.Vithanage, D., Yu, P., Wang, L. & Deng, C. Contextual word embedding for biomedical knowledge extraction: a rapid review and case study. J. Healthc. Inform. Res.8, 158–179 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. Npj Digit. Med.4, 1–13 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep.10, 7155 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Proc. Machine Learning for Health158, 239–260 (2021).
18.Hegselmann, S. et al. Large Language Models are Powerful EHR Encoders. Preprint at 10.48550/arXiv.2502.17403 (2025).
19.Johnson, R. et al. Unified clinical vocabulary embeddings for advancing precision. Preprint at 10.1101/2024.12.03.24318322 (2024).
20.Abu-Salih, B. et al. Healthcare knowledge graph construction: a systematic review of the state-of-the-art, open issues, and opportunities. J. Big Data10, 81 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Beaulieu-Jones, B. K. et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? Npj Digit. Med.4, 1–6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Hur, K. et al. Unifying heterogeneous electronic health records systems via text-based code embedding. ACM Conference on Health, Inference, and Learning (2021).
23.Hur, K. et al. GenHPF: general healthcare predictive framework for multi-task multi-source learning. In IEEE J. Biomed. Health Inform (IEEE, 2023). [DOI] [PubMed]
24.Lee, S. A. et al. Clinical decision support using pseudo-notes from multiple streams of EHR data. npj Digit. Med.8, 394 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing, 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
26.Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
27.Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature613, 508–518 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion81, 84–90 (2022). [Google Scholar]
30.Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deeplearning on typical tabular data? In Proc. 36th International Conference on Neural Information Processing Systems (NIPS ’22) 37, 507–520 (Curran Associates Inc, Red Hook, NY, USA, 2022).
31.Yang, X. et al. A large language model for electronic health records. npj Digit. Med.5,194 (2022). [DOI] [PMC free article] [PubMed]
32.Liu, F. et al. Self-Alignment Pretraining for Biomedical Entity Representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4228–4238, Online (Association for Computational Linguistics, 2021).
33.Hunt, G. E., Malhi, G. S., Lai, H. M. X. & Cleary, M. Prevalence of comorbid substance use in major depressive disorder in community and clinical settings, 1990-2019: Systematic review and meta-analysis. J. Affect. Disord.266, 288–304 (2020). [DOI] [PubMed] [Google Scholar]
34.Mars, N. et al. Systematic comparison of family history and polygenic risk across 24 common diseases. Am. J. Hum. Genet.109, 2152–2162 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Kather, J. N., Ferber, D., Wiest, I. C., Gilbert, S. & Truhn, D. Large language models could make natural language again the universal interface of healthcare. Nat. Med.30, 2708–2710 (2024). [DOI] [PubMed] [Google Scholar]
36.Dorfner, F. J. et al. Biomedical large languages models seem not to be superior to generalist models on unseen medical data. Preprint at 10.48550/arXiv.2408.13833 (2024).
37.Shmatko, A. et al. Learning the natural history of human disease with generative transformers. Nature647, 248–256 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Oufattole, N. et al. MEDS-torch: an ML pipeline for inductive experiments for EHR medical foundation models. In NeurIPS Workshop on Time Series in the Age of Large Models (2024).
39.Ranjan, R., Gupta, S. & Singh, S. N. A comprehensive survey of bias in LLMs: current landscape and future directions. Preprint at 10.48550/arXiv.2409.16430 (2024).
40.Taubenfeld, A., Dover, Y., Reichart, R. & Goldstein, A. Systematic biases in LLM simulations of debates. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 251–267. 10.18653/v1/2024.emnlp-main.16 (Association for Computational Linguistics, 2024).
41.Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ385, e078378 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Steinfeldt, J. et al. Medical history predicts phenome-wide disease onset and enables the rapid response to emerging health threats. Nat. Commun.16, 585 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Neelakantan, A. et al. Text and code embeddings by contrastive pre-training. Preprint at 10.48550/arXiv.2201.10005 (2022).
44.Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS’17). 6000–6010 (Curran Associates Inc, Red Hook, NY, USA, 2023).
45.Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at 10.48550/arXiv.1607.06450 (2016).
46.Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2017).
47.Cox, D. R. Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol.34, 187–220 (1972). [Google Scholar]
48.Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol.18, 24 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Supplementary Figure^{(874.3KB, pdf)}

Data Availability Statement

[CR1] 1.Wishart, G. C. et al. PREDICT: a new UK prognostic model that predicts survival following surgery for invasive breast cancer. Breast Cancer Res.12, R1 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. Eur. Heart J. 42, 2439–2454 (2021). [DOI] [PMC free article] [PubMed]

[CR3] 3.Lee, A. et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet. Med.21, 1708–1718 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med.29, 1113–1122 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Forrest, I. S. et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet Lond. Engl.401, 215–225 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Petrazzini, B. O. et al. Coronary risk estimation based on clinical data in electronic health records. J. Am. Coll. Cardiol.79, 1155–1166 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Zhao, J. et al. Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction. Sci. Rep.9, 717 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. JAMIA22, 553–564 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Choi, Y., Chiu, C. Y.-I. & Sontag, D. Learning low-dimensional representations of medical concepts. AMIA Jt Summits Transl. Sci. Proc. 2016, 41–50 (2016). [PMC free article] [PubMed]

[CR10] 10.Nelson, C. A., Butte, A. J. & Baranzini, S. E. Integrating biomedical research and electronic health records to create knowledge-based biologically meaningful machine-readable embeddings. Nat. Commun.10, 3045 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Finch, A. et al. Exploiting hierarchy in medical concept embedding. JAMIA Open4, ooab022 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med.3, 96 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Choi, E., Xiao, C., Stewart, W. F. & Sun, J. MiME: Multilevel medical embedding of electronic health records for predictive healthcare. Adv. Neur. Inform. Process. Syst.31 (2018).

[CR14] 14.Vithanage, D., Yu, P., Wang, L. & Deng, C. Contextual word embedding for biomedical knowledge extraction: a rapid review and case study. J. Healthc. Inform. Res.8, 158–179 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. Npj Digit. Med.4, 1–13 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep.10, 7155 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Proc. Machine Learning for Health158, 239–260 (2021).

[CR18] 18.Hegselmann, S. et al. Large Language Models are Powerful EHR Encoders. Preprint at 10.48550/arXiv.2502.17403 (2025).

[CR19] 19.Johnson, R. et al. Unified clinical vocabulary embeddings for advancing precision. Preprint at 10.1101/2024.12.03.24318322 (2024).

[CR20] 20.Abu-Salih, B. et al. Healthcare knowledge graph construction: a systematic review of the state-of-the-art, open issues, and opportunities. J. Big Data10, 81 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Beaulieu-Jones, B. K. et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? Npj Digit. Med.4, 1–6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Hur, K. et al. Unifying heterogeneous electronic health records systems via text-based code embedding. ACM Conference on Health, Inference, and Learning (2021).

[CR23] 23.Hur, K. et al. GenHPF: general healthcare predictive framework for multi-task multi-source learning. In IEEE J. Biomed. Health Inform (IEEE, 2023). [DOI] [PubMed]

[CR24] 24.Lee, S. A. et al. Clinical decision support using pseudo-notes from multiple streams of EHR data. npj Digit. Med.8, 394 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing, 1998–2022 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).

[CR26] 26.Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

[CR27] 27.Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature613, 508–518 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion81, 84–90 (2022). [Google Scholar]

[CR30] 30.Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deeplearning on typical tabular data? In Proc. 36th International Conference on Neural Information Processing Systems (NIPS ’22) 37, 507–520 (Curran Associates Inc, Red Hook, NY, USA, 2022).

[CR31] 31.Yang, X. et al. A large language model for electronic health records. npj Digit. Med.5,194 (2022). [DOI] [PMC free article] [PubMed]

[CR32] 32.Liu, F. et al. Self-Alignment Pretraining for Biomedical Entity Representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4228–4238, Online (Association for Computational Linguistics, 2021).

[CR33] 33.Hunt, G. E., Malhi, G. S., Lai, H. M. X. & Cleary, M. Prevalence of comorbid substance use in major depressive disorder in community and clinical settings, 1990-2019: Systematic review and meta-analysis. J. Affect. Disord.266, 288–304 (2020). [DOI] [PubMed] [Google Scholar]

[CR34] 34.Mars, N. et al. Systematic comparison of family history and polygenic risk across 24 common diseases. Am. J. Hum. Genet.109, 2152–2162 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Kather, J. N., Ferber, D., Wiest, I. C., Gilbert, S. & Truhn, D. Large language models could make natural language again the universal interface of healthcare. Nat. Med.30, 2708–2710 (2024). [DOI] [PubMed] [Google Scholar]

[CR36] 36.Dorfner, F. J. et al. Biomedical large languages models seem not to be superior to generalist models on unseen medical data. Preprint at 10.48550/arXiv.2408.13833 (2024).

[CR37] 37.Shmatko, A. et al. Learning the natural history of human disease with generative transformers. Nature647, 248–256 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Oufattole, N. et al. MEDS-torch: an ML pipeline for inductive experiments for EHR medical foundation models. In NeurIPS Workshop on Time Series in the Age of Large Models (2024).

[CR39] 39.Ranjan, R., Gupta, S. & Singh, S. N. A comprehensive survey of bias in LLMs: current landscape and future directions. Preprint at 10.48550/arXiv.2409.16430 (2024).

[CR40] 40.Taubenfeld, A., Dover, Y., Reichart, R. & Goldstein, A. Systematic biases in LLM simulations of debates. In Proc. 2024 Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 251–267. 10.18653/v1/2024.emnlp-main.16 (Association for Computational Linguistics, 2024).

[CR41] 41.Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ385, e078378 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Steinfeldt, J. et al. Medical history predicts phenome-wide disease onset and enables the rapid response to emerging health threats. Nat. Commun.16, 585 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Neelakantan, A. et al. Text and code embeddings by contrastive pre-training. Preprint at 10.48550/arXiv.2201.10005 (2022).

[CR44] 44.Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS’17). 6000–6010 (Curran Associates Inc, Red Hook, NY, USA, 2023).

[CR45] 45.Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at 10.48550/arXiv.1607.06450 (2016).

[CR46] 46.Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations (2017).

[CR47] 47.Cox, D. R. Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol.34, 187–220 (1972). [Google Scholar]

[CR48] 48.Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol.18, 24 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Large language models improve transferability of electronic health record-based predictions across countries and coding systems

Matthias Kirchler

Matteo Ferro

Veronica Lorenzini

Robin P van de Water

Christoph Lippert

Andrea Ganna

Abstract

Introduction

Results

GRASP architecture

Fig. 1. Overview of the study.

Cohort characteristics and study design

GRASP improves transferability across OMOP-mapped datasets

Fig. 2. GRASP model evaluation.

GRASP transfers well across datasets mapped to different data models

Fig. 3. Transferability across datasets mapped to different data models.

GRASP improves training-efficiency with small sample sizes

Fig. 4. GRASP average prediction performances for different training sample sizes.

Impact of concept-specific text on GRASP performance and comparison with other embedding methods

Models’ calibration

Understanding how GRASP generalize medical concepts

Fig. 5. Representation of semantic embedding and feature importance across the three datasets for prediction of depression.

GRASP semantic embeddings result in a stronger association with polygenic scores

Fig. 6. Correlation with polygenic score.

Discussion

Methods

Datasets

Endpoint definitions

Main experiments setup

Grasp architecture and design

Baseline methods

OMOP-to-ICD experiment

Small-n experiment

Ontology-enriched embeddings experiment

Other embeddings experiment

Fine-tuning experiment

Model calibration results and experiments

Explainability

Polygenic scores

Supplementary information

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases