Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 May 11:2025.05.09.25327341. [Version 1] doi: 10.1101/2025.05.09.25327341

Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs

Thomas Z Li 1,2, John M Still 3, Lianrui Zuo 4, Yihao Liu 4, Aravind R Krishnan 4, Kim L Sandler 5, Fabien Maldonado 6, Thomas A Lasko 3, Bennett A Landman 1,3,4,5
PMCID: PMC12083608  PMID: 40385386

Abstract

Electronic health records (EHRs) are a rich source of clinical data, yet exploiting longitudinal signals for pulmonary nodule diagnosis remains challenging due to the administrative noise and high level of clinical abstraction present in these records. Because of this complexity, classification models are prone to overfitting when labeled data is scarce. This study explores masked representation learning (MRL) as a strategy to improve pulmonary nodule diagnosis by modeling longitudinal EHRs across multiple modalities: clinical conditions, procedures, and medications. We leverage a web-scale text embedding model to encode EHR event streams into semantically embedded sequences. We then pretrain a bidirectional transformer using MRL conditioned on time encodings on a large cohort of general pulmonary conditions from our home institution. Evaluation on a cohort of diagnosed pulmonary nodules demonstrates significant improvement in diagnosis accuracy with a model finetuned from MRL (0.781 AUC, 95% CI: [0.780, 0.782]) compared to a supervised model with the same architecture (0.768 AUC, 95% CI: [0.766, 0.770]) when integrating all three modalities. These findings suggest that language-embedded MRL can facilitate downstream clinical classification, offering potential advancements in the comprehensive analysis of longitudinal EHR modalities.

1. Introduction

Electronic health records (EHRs) capture most of the important touch points patients have with the health care system, which makes them the most comprehensive source of clinical data available. Modeling longitudinal signals from EHRs for lung cancer diagnosis is largely unexplored. Extracting useful signals from EHRs is challenging however many are obscured by the multitude of administrative tasks that healthcare practitioners perform daily. For example, a single lab test can trigger a cascade of additional tests, seemingly unrelated diagnostic codes from differential diagnoses, and medication prescriptions from empirical treatments that may not target the true disease process. Common modeling approaches, such as counting events over a period or analyzing variables at a single time point, can destroy the longitudinal structure essential for modeling complex relationships.

Masked representation learning (MRL) has been empirically shown to be a leading method for learning longitudinal features. It involves masking a portion of features and training a model to recover this portion. For data with strong temporal relationships (i.e. time series, sequential data), masking in the time domain naturally emerges to have models predict the present from the past. BEHRT [1] and Med-BERT [2] applied masked token prediction to structured EHR sequences, improving disease prediction. In unstructured data, ClinicalBERT [3] and BioBERT [4] utilized MRL to learn representations from clinical text.

Despite these advances, longitudinal EHR modeling for pulmonary nodule diagnosis remains underexplored. Existing frameworks primarily focus on general-purpose pretraining with a single modality (i.e. ICD codes) within a short time frame. Our work addresses this gap by conducting masked representation learning over multi-year time scales across three modalities: conditions, procedures, and medications. We leverage an open-source language embedding to jointly represent these modalities and train a model to predict randomly masked events conditioned on their time encodings. For pretraining, we selected a large cohort encompassing patients with a broad set of pulmonary conditions. Evaluating on a separate cohort of patients with indeterminate pulmonary nodules, we find that MRL improves pulmonary nodule diagnosis compared to a supervised counterpart with the same model architecture. These results preliminarily demonstrate the potential for language embedded MRL to enable lifetime longitudinal analysis of EHRs towards a clinically meaningful diagnosis.

2. Methods

2.1. Data

All data was collected from the Research Derivative [5] of Vanderbilt University Medical Center under IRB #140274. We pulled variables from three modalities, diagnostic codes, procedural codes, and medication codes, as event streams from 2000 to 2024. Here we denote an event as recorded variable of any modality with a corresponding timestamp in the EHR. International Classification of Diseases, 9th revision (ICD-9) and ICD-10 diagnostic codes were mapped to the Systematized Nomenclature of Medicine Clinical Terms (SNOMED) [6] vocabulary, representing diagnoses and conditions. Meanwhile, all Current Procedural Terminology (CPT) codes were retreived as procedural codes and medications were pulled from the database as RxNorm vocabulary [7]. A unique identifier was assigned to each unique code such that two events with the same code would share this identifier. In addition, each unique code had a concept name following their respective vocabulary, typically resembling a taxonomy standardized descriptive phrase. As all the vocabularies employed in this study have a hierarchical taxonomy, the concept names varied in specificity. For example, “Systolic heart failure” came from a lower level SNOMED code that fell under the parent code “Heart failure”. Lastly, we pulled the birthdate, birth sex, and race of each patient. Following the IRB protocol, dates were shifted and patient identifiers were hashed for anonymization purposes.

We built a discovery set of 109,031 patients selected for having a broad range pulmonary diseases according to the SNOMED vocabulary. This was used for pretraining the masked representation model in a self-supervised manner. To evaluate the usefulness of the learned representations in predicting lung cancer, we identified an SPN set of 13069 patients with a solitary pulmonary nodule (SPN) code and no prior history of any cancer. Lung cancer cases were identified as patients with a billing code for any lung malignancy occurring 4–1095 days post SPN date and controls were those with no such code during the same period [8]. The SPN set was split into training (n=10530) and testing (n=2639) sets (Table 1). There was incomplete overlap between the discovery set and SPN training set, but the discovery set and SPN test set were made to be disjoint.

Table 1.

Characteristics of Discovery and Solitary Pulmonary Nodule (SPN) Datasets.

Dataset Discovery Set SPN Training Set SPN Test Set
Patients 109,031 10,530 2639
Events per patient
 Conditions 160±197 127±215 122±192
 Procedures 102±128 54.9±103 53±91
 Medications 217±238 573±1323 560±1313
Lung Cancer (%) N/A 614 (0.06%) 153 (0.06%)

2.2. Natural Language Preprocessing

We removed patients who had either negligible or excess engagement with the health care system, defined as having less than 10 events or being in the 95th percentile of record length respectively. We did not remove patients with excess engagement in the SPN sets because there was the possibility that lung nodule management may present as a dense record. We designed a natural language template for each variable combining its modality (i.e. condition, procedure, or medication) with its concept ID and concept name in the following form.

[Modality] #[Concept ID]: [Concept name]

Although it has no semantic value, the concept ID was included to help distinguish between semantically similar but unique concepts. Treating the natural language form of each variable as a sentence, we computed a vector embedding for each variable from the state-of-the-art text embedding model (TEM) dunzhang/stella_en_400M_v5 (https://huggingface.co/dunzhang/stella_en_400M_v5) [9,10]. This specific TEM was chosen for its small memory footprint and high ranking performance on the Massive Text Embedding Benchmark [11]. First, a SentencePiece tokenizer [12] transformed the sentence into a sequence of sub-words across the entire dataset. Then, the TEM encodes the sequence and computes the average to generate a single text embedding for the variable. In this manner we precomputed embeddings for all unique variables, enabling event embeddings to be retrieved efficiently via a cached mapping back to their corresponding variable embeddings.

2.3. Masked Representation Learning

Our self-supervised model first projected the preprocessed event stream (natural language transformed and tokenized) of each patient into the TEM’s pretrained embedding space in R1024. The input event stream was collated such that the output of the TEM was another sequence of tokens such that each was an embedded event. Token from here on refers to a vector embedding of an event as opposed to the sub-word encodings from the SentencePiece tokenizer. This allowed us to add a time encoding [13] to each token corresponding with the event’s timestamp. This sequence formed the input to a bidirectional transformer that was optimized from random weights using BERT-style MRL (Figure 1) [14].

Figure 1.

Figure 1.

Streams of events (defined as an episodic variable and its timestamp) were pulled from the VUMC EHRs and preprocessed into natural language templates. We leveraged a leading text embedding model to embed each event as feature vectors x0,,xl. Corresponding time encodings t0,,tl. were added to each token corresponding to the event’s timestamp. From this sequence, we pretrained a transformer masked representation learning and finetuned on lung cancer labels.

During MRL, 15% of tokens are randomly and non-contiguously selected. Since the masking procedure is not done at the label-prediction stage, we do not mask all the selected tokens to prevent overfitting on the masked prediction task. Instead, only 80% of the initially selected tokens are replaced with a mask token, which is a vector embedding that is randomly initialized at the start of training and fixed during training. 10% of these signatures remain unchanged, and the remaining 10% are replaced with an “incorrect” variable selected at random from the corpus of possible variables. Importantly, time encodings were added to the masked tokens as well to provide the model with temporal context. Given the logits from the masked tokens, x1,,xN, and a set of classes defined as the corpus of m unique variables c1,,cm, the loss for a sequence (i.e. a single patient) is computed with cross-entropy as follows:

L=1Nk=1Nxk,cklogj=1mexk,j

Where xk,ck is the logit for the correct class of the kth token and xk,j is the logit for class j in the kth token.

3. Experiments

We pretrained the masked representation model on the discovery set with the TEM module frozen. We evaluated the usefulness of this model (Table 1: MRL-finetune) through fine-tuned prediction of pulmonary nodule diagnosis within 3 years using the SPN training and test sets. During both pretrain and fine-tune stages we employed Adam optimization [15] and set the learning rate scheduler to cosine annealing with warm restart cycles [16]. We initialized finetuning with the pretrained MRL weights and added a “CLS” token [17], which fed into a multi-layer perceptron for classification. As a comparative baseline, we reused the model architecture in the fine-tune stage but trained from random weights (Table 1: Supervised). We repeated these experiments with a single modality in diagnostic codes and all three modalities.

Results are reported as mean area under the receiver operating characteristic curve (AUC) and 95% confidence intervals from 1000 bootstrap samples of the test set predictions, sampling with replacement. We used the two-sided Wilcoxon signed-rank test to determine if performance of each approach was significantly different than its supervised counterpart for the same set of modalities at p<0.05.

4. Results

When only given diagnostic codes, we did not observe a difference in pulmonary nodule diagnosis between MRL-finetune and the supervised models (0.773 [0.772, 0.774] vs, 0.772 [0.769, 0.775], resp.). In the tri-modality context, however, MRL-finetune (0.781 [0.780, 0.782]) significantly outperformed the supervised approach (0.768 [0.766, 0.770]) (Table 2).

Table 2.

Mean AUC [95% CI] in single modality and tri-modality learning

Model Diagnoses Diagnoses, procedures, medications
Supervised 0.772 [0.769, 0.775] 0.768 [0.766, 0.770]
MRL-finetune 0.773 [0.772, 0.774] 0.781 [0.780, 0.782] *
*

p<0.05 compared to values in the same column

5. Discussion

This study is a preliminary investigation of enhancing three-year pulmonary nodule diagnosis using MRL on longitudinal EHRs spanning multiple decades. Our contributions are (1) the integration of time encodings with language embeddings developed from web-scale efforts as a basis for MRL and (2) leveraging this approach to improve the diagnosis of incidentally-detected pulmonary nodules from a purely supervised approach.

The nuances of health care administration specific to each modality, might explain why MRL only lead to performance improvements when all three modalities were jointly learned. The billing of diagnostic codes is known to be heavily abstracted from medical reality for many reasons including the fact that many initial diagnosis are incorrect but still recorded, lack of specificity from assigning parent-level codes when more specific codes would significantly alter the clinical context, and the time of billing being chronologically separated from the patient’s management [1822]. We suspect that procedural codes suffer from a much smaller degree of abstraction as there is intuitively little ambiguity in the treatments that patients receive. However, the benefit of procedural codes is somewhat contradicted in the decreased performance of the supervised tri-modal model. We suspect that this model overfit on the expanded feature space introduced by the additional modalities [23,24]. As intended, our unsupervised MRL approach may have mitigated this overfitting, resulting in improved performance in the multi-modal experiment.

We also consider medications to be considerably abstracted in the EHR because elements such as dosage, units of measurement, route of administration, frequency, duration, and titration schedule are not seamlessly recorded. In this study, these elements were either included in the concept name (i.e. “acetaminophen 500 MG Oral Tablet”) or exceedingly difficult to capture without parsing clinical notes. In simplifying medication events to their concept names, we relied on masked representation learning to capture their temporal dynamics and semantic relationship with other modalities. In examining a t-SNE map of event streams embedded with the masked representation model from two patients, we observed clusters of multiple modalities suggesting overarching clinical concepts (Figure 2). Whether MRL effectively learns concepts of medication prescription is area of future investigation.

Figure 2.

Figure 2.

A. & B. Embeddings of event streams combining diagnoses, procedures and drugs together of two patients were visualized with t-SNE. Events from different modalities cluster together forming overarching clinical concepts. C. The embeddings of 30 events preceding a pulmonary nodule code were concatenated for 2000 patients and visualized with t-SNE.

6. Acknowledgments

This research was funded by the NIH through F30CA275020, 2U01CA152662, and R01CA253923-02, as well as NSF CAREER 1452485 and NSF 2040462. This study was also funded by the Vanderbilt Institute for Surgery and Engineering through T32EB021937-07, the Vanderbilt Institute for Clinical and Translational Research through UL1TR002243-06, and the Pierre Massion Directorship in Pulmonary Medicine. We utilize generative AI to generate code segments based on task descriptions, as well as to assist with debugging, editing, and autocompleting code. Additionally, generative AI has been employed to refine sentence structure and ensure grammatical accuracy. However, all conceptualization, ideation, and prompts provided to the AI stem entirely from the authors’ creative and intellectual efforts. We take full responsibility for reviewing and verifying all AI-generated content in this work.

7. References

  • [1].Li Y., Rao S., Solares J.R.A., Hassaine A., Ramakrishnan R., Canoy D., Zhu Y., Rahimi K., Salimi-Khorshidi G., BEHRT: Transformer for Electronic Health Records, Scientific Reports 2020 10:1 10 (2020) 1–12. 10.1038/s41598-020-62922-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Rasmy L., Xiang Y., Xie Z., Tao C., Zhi D., Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, Npj Digital Medicine 2021 4:1 4 (2021) 1–13. 10.1038/s41746-021-00455-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Huang K., Altosaar J., Ranganath R., ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, (2019). https://arxiv.org/abs/1904.05342v3 (accessed March 4, 2025).
  • [4].Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J., BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–1240. 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Danciu I., Cowan J.D., Basford M., Wang X., Saip A., Osgood S., Shirey-Rice J., Kirby J., Harris P.A., Secondary Use of Clinical Data: the Vanderbilt Approach, J Biomed Inform 52 (2014) 28. 10.1016/J.JBI.2014.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].National Institutes of Health, SNOMED CT to ICD-10-CM map, U.S. National Library of Medicine. (n.d.). https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html. [Google Scholar]
  • [7].RxNorm, (n.d.). https://www.nlm.nih.gov/research/umls/rxnorm/index.html (accessed March 5, 2025).
  • [8].Li T.Z., Xu K., Chada N.C., Chen H., Knight M., Antic S., Sandler K.L., Maldonado F., Landman B.A., Lasko T.A., Curating Retrospective Multimodal and Longitudinal Data for Community Cohorts at Risk for Lung Cancer, MedRxiv (2023) 2023.11.03.23298020. 10.1101/2023.11.03.23298020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Zhang X., Zhang Y., Long D., Xie W., Dai Z., Tang J., Lin H., Yang B., Xie P., Huang F., Zhang M., Li W., Zhang M., mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval, (2024). https://arxiv.org/abs/2407.19669v2 (accessed March 5, 2025).
  • [10].Lee C., Roy R., Xu M., Raiman J., Shoeybi M., Catanzaro B., Ping W., NV-EMBED: IMPROVED TECHNIQUES FOR TRAINING LLMS AS GENERALIST EMBEDDING MODELS, (n.d.). https://huggingface.co/nvidia/NV-Embed-v2. (accessed May 8, 2025).
  • [11].Muennighoff N., Tazi N., Magne L., Reimers N., MTEB: Massive Text Embedding Benchmark, EACL 2023 – 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (2022) 2006–2029. 10.18653/v1/2023.eacl-main.148. [DOI] [Google Scholar]
  • [12].Kudo T., Richardson J., SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, EMNLP 2018 - Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Proceedings (2018) 66–71. 10.18653/v1/d18-2012. [DOI] [Google Scholar]
  • [13].Li T.Z., Xu K., Gao R., Tang Y., Lasko T.A., Maldonado F., Sandler K.L., Landman B.A., Time-distance vision transformers in lung cancer diagnosis from longitudinal computed tomography, 10.1117/12.2653911 12464 (2023) 229–238. https://doi.org/10.1117/12.2653911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Devlin J., Chang M.W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL HLT 2019 – 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2018) 4171–4186. https://arxiv.org/abs/1810.04805v2 (accessed November 20, 2023). [Google Scholar]
  • [15].Kingma D.P., Welling M., Auto-Encoding Variational Bayes, 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (2013). https://arxiv.org/abs/1312.6114v10 (accessed November 30, 2021). [Google Scholar]
  • [16].Loshchilov I., Hutter F., SGDR: Stochastic Gradient Descent with Warm Restarts, 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings (2016). 10.48550/arxiv.1608.03983. [DOI] [Google Scholar]
  • [17].Vaswani A., Brain G., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention Is All You Need, (n.d.). [Google Scholar]
  • [18].Cowie M.R., Blomster J.I., Curtis L.H., Duclaux S., Ford I., Fritz F., Goldman S., Janmohamed S., Kreuzer J., Leenay M., Michel A., Ong S., Pell J.P., Southworth M.R., Stough W.G., Thoenes M., Zannad F., Zalewski A., Electronic health records to facilitate clinical research, Clin Res Cardiol 106 (2017). 10.1007/S00392-016-1025-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Rajkomar A., Oren E., Chen K., Dai A.M., Hajaj N., Hardt M., Liu P.J., Liu X., Marcus J., Sun M., Sundberg P., Yee H., Zhang K., Zhang Y., Flores G., Duggan G.E., Irvine J., Le Q., Litsch K., Mossin A., Tansuwan J., Wang D., Wexler J., Wilson J., Ludwig D., Volchenboum S.L., Chou K., Pearson M., Madabushi S., Shah N.H., Butte A.J., Howell M.D., Cui C., Corrado G.S., Dean J., Scalable and accurate deep learning with electronic health records, NPJ Digit Med 1 (2018) 1–10. 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Reisman M., EHRs: The Challenge of Making Electronic Data Usable and Interoperable, Pharmacy and Therapeutics 42 (2017) 572. /pmc/articles/PMC5565131/ (accessed April 3, 2023). [PMC free article] [PubMed] [Google Scholar]
  • [21].Wornow M., Xu Y., Thapa R., Patel B., Steinberg E., Fleming S., Pfeffer M.A., Fries J., Shah N.H., The shaky foundations of large language models and foundation models for electronic health records, Npj Digital Medicine 2023 6:1 6 (2023) 1–10. 10.1038/s41746-023-00879-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Rudrapatna V.A., Glicksberg B.S., Avila P., Harding-Theobald E., Wang C., Butte A.J., Accuracy of medical billing data against the electronic health record in the measurement of colorectal cancer screening rates, BMJ Open Qual 9 (2020) e000856. 10.1136/BMJOQ-2019-000856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Wang W., Tran D., Feiszli M., What Makes Training Multi-Modal Classification Networks Hard?, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2020) 12692–12702. 10.1109/CVPR42600.2020.01271. [DOI] [Google Scholar]
  • [24].Li T.Z., Xu K., Krishnan A., Gao R., Kammer M.N., Antic S., Xiao D., Knight M., Martinez Y., Paez R., Lentz R.J., Deppen S., Grogan E.L., Lasko T.A., Sandler K.L., Maldonado F., Landman B.A., Performance of Lung Cancer Prediction Models for Screening-detected, Incidental, and Biopsied Pulmonary Nodules, Radiol Artif Intell (2025). 10.1148/RYAI.230506. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES