Skip to main content
. 2025 Jun 6;8:340. doi: 10.1038/s41746-025-01703-1

Fig. 3. Text processing pipeline for generating document embeddings.

Fig. 3

Flowchart illustrating the five-step pipeline for constructing document embeddings. The process begins with selecting clinical notes based on a predefined character cutoff threshold, followed by text preprocessing. Next, term frequency-inverse document frequency (TF-IDF) weighting is applied to each participant’s text corpus (“patient-level corpora”). Preprocessed text is then mapped to word embeddings using BioWord2Vec, and a final document-level embedding is obtained by matrix-multiplying the TF-IDF matrix with the word embedding matrix.