Abstract
Objective:
Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.
Methods:
Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated -values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer’s disease patients.
Results:
ARCH produces high-quality clinical embeddings and KG for over 60, 000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API *. ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms’ performance. Notably, it successfully categorized Alzheimer’s patients into two subgroups with varying mortality rates.
Conclusion:
The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.
Keywords: Electronic health records, natural language processing, representation learning, knowledge graph
Graphical Abstract

1. Introduction
The increasing adoption of electronic health record (EHR) systems has provided opportunities for clinical studies and biomedical research [1, 2, 3]. EHR data often cover hundreds of thousands of unique clinical features from both codified data and unstructured clinical narrative notes. To analyze these two types of data simultaneously, the main challenges lie in combining the codified and unstructured data efficiently, representing their covered clinical features meaningfully, and quantifying statistically the presence-absence as well as the strength of relationships between different features.
The integration of codified and unstructured data is driven by their complementary nature, collectively providing a more comprehensive view of a patient’s medical history. While codified data are readily usable due to standardized formats [4], extracting valuable insights from unstructured clinical notes requires natural language processing (NLP) techniques. These extracted NLP concepts, also referred to as clinical concepts of unique identifiers (CUIs) in the Unified Medical Language System (UMLS) [5], provide complementary information to the codified data.
Many studies have shown that incorporating this textual information into analyses can enhance model performance by significant margins [6, 7]. For example, NLP concepts are particularly valuable for capturing drug side effects, as a significant proportion of these effects, such as symptoms, cannot be adequately represented by diagnostic codes [8]. Furthermore, combining codified and unstructured data yields benefits for disease phenotyping. In the United States, a diagnosis code is required by the healthcare provider during the evaluation of a condition. Even if the patient is ultimately diagnosed with a different condition, the initial diagnosis code will remain in the patient’s record and may be misleading if viewed in isolation [9]. It has been shown that prediction models that combine unstructured clinical notes with codified data outperform models that utilize either unstructured or codified data alone [10, 11].
To generate prior knowledge on the relationship among the clinical codes and NLP concepts, a potential solution is to construct a large-scale clinical knowledge graph (KG) on these concepts [12, 13, 14]. Representing EHR concepts with low-dimensional semantic embedding provides a quantitative glimpse into the degree of inter-relatedness of medical entities. These high-quality embeddings can improve the efficiency of downstream applications in biomedical and healthcare research including information retrieval [15, 16, 17], cohort selection [18, 19], and risk prediction [20, 21, 22].
In recent years, word embedding techniques [23, 24, 25] in NLP have been successfully applied for representing clinical concepts in a low-dimensional space. However, many methods do not naturally generate a sparse KG that indicates whether a link exists between entities. In addition, joint representation of large-scale codified and NLP EHR concepts is currently lacking, as summarized in Table 1. Recently, Bai [26] proposed to jointly learn vector representations of medical concepts and words using MIMIC-III data [27]. However, their work was limited in two ways. First, they did not represent words in the clinical notes as CUIs, thus limiting the reproducibility of these representations. Second, the MIMIC-III data only contains 58, 597 in-patient visits, which confines the model performance and cannot infer broader information for outpatients. As a result, their embeddings cannot be used to generate high-quality knowledge graphs capturing general clinical information. To the best of our knowledge, there is no existing work that derives comprehensive embeddings for both codes and CUIs from a comprehensive EHR with both inpatient and outpatient data.
Table 1:
A summary of existing EHR-derived medical embeddings.
Generating KG with a large number of entities, however, is challenging for several reasons. First, an efficient computational algorithm is needed to embed all concepts when both the number of concepts and the number of EHR records are large. Second, no existing KG embedding methods provide statistical certainty on whether a link exists between two entities. Most existing KG predicts links via a supervised fashion by optimizing prediction tasks using the labeled links between entity pairs, requiring mapping EHR codes and narrative concepts to existing entity pairs, which itself is a challenging task. In addition, these methods necessitate the use of unlinked entity pairs, which are not readily available. Moreover, it is challenging to handle narrative notes, which are disorganized and not ready to use. But thanks to the development of the Narrative Information Linear Extraction NLP software, as detailed in Appendix S.4, we are able to map clinical terms to CUIs in the UMLS. Then the codified and NLP data can be organized as triplets: (Patient id, date, concept) and are thus aligned.
In summary, there is a great unmet need for an approach that can integrate and summarize these high-dimensional and large-scale clinical data into a KG for studies. In this paper, we will address this need by proposing an Aggregated naRrative Codified Health (ARCH) records analysis which is an efficient statistical algorithm that can generate KG embeddings along with uncertainty measures on the links. With pairwise co-occurrence counts of all EHR concepts and a few simple summary statistics, the ARCH algorithm generates low-dimensional embeddings for each concept and performs large-scale hypothesis testing based on the cosine similarity between these embedding vectors. The connectivity of entity pairs is assessed jointly by controlling for a target false discovery rate (FDR). We validate the clinical utility of the ARCH KG, generated from EHR data from the Veterans Affairs (VA), along with semantic embeddings through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer’s disease (AD) patients.
| Summary | Description |
|---|---|
| Problem or Issue | Managing massive and complex electronic health record (EHR) data presents challenges in feature representation and uncertainty quantification for both codified data and narrative notes. |
| What is Already Known | Existing word embedding techniques in natural language processing (NLP) or based on co-occurrence matrix lack inherent support for sparse knowledge graph (KG) creation and joint representation of large-scale codified and NLP EHR concepts. |
| What this Paper Adds | This study introduces a method for generating low-dimensional KG embeddings for both codified and NLP EHR concepts, incorporating uncertainty measures on the links, andconstructing a sparse, large-scale KG. |
2. Methods
2.1. Generative model for the knowledge graph
Suppose there are a total of EHR codified and NLP concepts, indexed by . The semantic meaning of each concept is represented by a -dimensional embedding vector for . These embeddings are generated from a latent Gaussian graphical model [32]: each column of is independent and identically distributed from , where the precision matrix embeds the conditional dependency network of the concepts, , with the vertex set representing all EHR concepts and the edge set characterizing the conditional dependency between the concepts. Our goal is to learn the KG with characterized by in that if and only if or equivalently is conditionally dependent on given all remaining embeddings. We aim to identify through testing the set of hypotheses :
| (1) |
For each patient, we observed the recorded clinical concepts for each visit and the date of each visit. We assume that the observed clinical concepts in the EHR are generated from a latent Markov process driven by the embeddings sampled from the graphical model [33]. In specific, let be the concept recorded at the visit date and the occurrence probability of concept is modeled by
where the latent discourse vector represents the embedding of the topic at time and is generated from an autoregressive model
where is the weight parameter. Figure 1 illustrates the generation process. The represents the latent topic vector at each time (e.g., phenotype, treatment, lab measurement, etc) and changes gradually. The stochasticity of the topic vector explains the change of health status or disease stage of a patient; while the slow random walk (meaning that is obtained from with a small random displacement vector) ensures the nearby features are generated under similar discourses. Under this model, the embedding inner product can be approximated by the population positive point-wise mutual information (PPMI) between concept and [34] :
where , is the co-occurrence probability of the concept pair and is the occurrence probability of the concept . Therefore, when the number of concepts has a larger order than the square root of the sample size used to estimate the PPMI, testing can be achieved by testing based on the estimated PPMI.
Figure 1:
Data generation process of the EHR occurrence data, and data analytics pipeline. The embeddings of concepts are generated from a graphical model and the occurrence is then driven by a Markov process.
2.2. ARCH representation learning and graph recovery
For large-scale EHR datasets with a massive number of concepts and patient records, it is both statistically and computationally challenging to infer the network due to the latency and the large number of hypotheses involved. Our ARCH representation learning approach carries out the inference in two steps by (i) first screening for by identifying marginally dependent concept pairs with nonzero PPMI, and (ii) inferring about the Gaussian graphical model structure of via sparse regression. In the first step of screening, we apply the SURE screening [35] by selecting pairs with after controlling for a desired FDR. In the second step, we further infer the edges from the network via node-wise regression [36]. As the embedding vectors follow the Gaussian graphical model, the conditional distribution of embeddings is
| (2) |
where is the set of concepts related to concept obtained from the first prescreening step.
2.2.1. Pre-screening by PMI testing
To form a test statistic for and estimate , we first calculated the empirical PPMI as , with , where is the row sum of co-occurrence matrix , and is the total sum of the co-occurrence. Details for the construction of is given in Section 2.3. We next took an SVD of the empirical PPMI matrix as , we can estimate and population PPMI matrix of concepts respectively as
where being the first singular vectors of with positive eigenvalues. The dimension can be selected to optimize embedding quality similar to KESER [14] by maximizing the area under the Receiver Operating Characteristics curve (AUC) of distinguishing those known relation pairs from random pairs, where known relation pairs are curated from online sources, detailed in the validation studies in Section 2.4.1.
The estimator is close to the population PPMI matrix with a high approximation rate and asymptotically normal, which allows us to approximate with . Furthermore, we may form test statistic to identify since follows approximately standard normal distribution under the null hypothesis [34], where is an estimated standard error for , as shown in Figure 1 and detailed in Appendix S.1. To control for multiple comparisons, we performed the Benjamini-Hochberg (BH) procedure under dependence and identified related concept pairs with higher than a BH-controlled threshold as detailed in Appendix S.2.
2.2.2. Sparse embedding regression
The FDR-controlled testing procedure based on could serve as a prescreening of related concepts from the large number of concept pairs. To further screen for the most relevant concepts to form , we further performed a sparse regression of against all embedding vectors identified as related to the concept after initial screening, denoted by , to recover and hence its associated graph structure. Due to the potentially large number of elements identified in the pre-screening stage, we adopted an adaptive elastic-net penalized regression [37], which is a hybrid of a Lasso and Ridge regression:
where is the submatrix of corresponding to . The tuning parameters and control the support of and hence the network structure. We determined the optimal values for the hyperparameters and for each target concept by performing a grid search to balance the external and internal validation losses. Specifically, we computed the average of the internal Akaike information criterion (AIC) loss [38] and an external validation loss, which was obtained using an independent dataset , as detailed in Appendix S.3.
2.3. EHR data sources and preprocessing
We trained a large-scale ARCH KG using EHR data from the VA Corporate Data Ware-house (CDW), integrating both codified and narrative data from 12.6 million patients with at least 1 visit between 2000–2019. All raw data are processed as detailed in Appendix S.4, and then used to create a co-occurrence matrix by counting the number of co-occurrences within a 30-day window across all patients. To reduce noise, we removed concepts that have less than 3000 occurrences and concept pairs that have less than 1000 co-occurrences. Furthermore, we removed all concepts that co-occur with more than 95% of other concepts as they tend to be overly non-specific. This results in a total of over 61, 000 concepts, out of which 51, 423 are CUIs and 9, 586 are codified concepts.
2.4. Validation analyses
The ARCH KG was validated in four downstream tasks: (1) detecting known similar or related clinical concepts; (2) detecting drug side effects; (3) disease phenotyping; and (4) profiling of patients with AD. For the detection of known relationships and drug side effects, we also compared to embedding vectors from GloVe [25], from node2vec [39], and from several pretrained language models (PLMs). GloVe is an unsupervised learning algorithm that leverages feature-feature co-occurrence statistics to obtain vector representations. We used the same co-occurrence matrix to train embeddings from both ARCH and GloVe. Node2vec, which is a representational learning algorithm designed for networks, optimizes the likelihood of preserving the neighborhoods in a network. We transformed the PPMI matrix into a binary network where an edge exists between concepts and if . This approach was applied to generate concept embeddings. However, due to its substantial memory requirements, we limited node2vec’s application to codified data only, as running it on the entire dataset would require more than 700GB of memory. The PLM embeddings we compared to are based on Bidirectional Encoder Representations from Transformer (BERT) [40], including Self-aligning pretrained BERT (SAPBERT) [41], BERT for Biomedical Text Mining (BioBERT) [42], and BERT pretrained with PubMed (PubmedBERT) [43]. BERT’s model architecture is a multi-layer bidirectional Transformer encoder, while BioBERT, PubMedBERT and SAPBERT are pretrained on different sources based on BERT. BioBERT is pretrained on both general domain corpora and biomedical domain corpora (PubMed abstracts and PMC full-text articles), PubMedBERT is pretrained purely with in-domain text (PubMed text), and SAPBERT is pretrained on the biomedical KG of UMLS. The language model based embeddings were obtained only based on the description of the EHR concepts (e.g. preferred term for the CUI and code description). To examine the performance of including both codified data and NLP data, we also perform the proposed ARCH algorithm on codified data alone and NLP data alone.
2.4.1. Detecting known relationship pairs
We curated different categories of known relation pairs from online knowledge sources including similar pairs and related pairs as detailed in Appendix S.6.1. For each type of relationship, we calculated the cosine similarities of the embedding vectors of related pairs and those of randomly selected pairs to calculate AUC of the cosine similarities in distinguishing known pairs from random pairs. The dimension of embedding is chosen by optimizing the AUC. We then performed ARCH testing procedure to determine whether a pair of entities are related with FDR chosen at 1%, 5%, and 10%, and reported the power of the ARCH procedure. The power of ARCH procedures refers to the proportion of true positive relationships correctly identified among all known relationships. Since no existing procedures can control FDR, we calculated the power of other algorithms in detecting known relationships by ranking entity pairs according to cosine similarity generated from their corresponding embeddings and then selecting the top entity pairs as significant, where is the number of entity pairs selected by ARCH. Among those pairs, we calculated the proportion of those known to be related as their power. We have also computed the Spearman’s correlation between the algorithm output and domain expert ratings on 507 UMLS concept pairs for semantic similarity and ratings on 515 UMLS concept pairs for semantic relatedness. These reference standards are obtained from [44, 45].
2.4.2. Detecting drug side effect
We first obtained labels from the Side Effect Resource (SIDER)2 database of drugs and adverse drug reactions [46] as detailed in Appendix S.6.2, resulting in 128, 220 drug-AE (Adverse Events) pairs. The negative pairs were obtained from [47] and we kept the same number of negative pairs as the number of positive pairs. The AUC and power for detecting drug side effects based on ARCH embeddings or -values were calculated similarly to those for the relation detection. We further evaluated the quality of ARCH embeddings based on the performance of a few-shot supervised model [48] for this task, as detailed in Appendix S.7. We used 1% of the positive and negative pairs to estimate model parameters, another 1% as validation data to select optimal tuning parameters, and the remaining 98% pairs as a test data set for evaluation.
2.4.3. Disease phenotyping
A major bottleneck for conducting translational research studies with EHR is the lack of large-scale precise data on disease outcomes needed for predictive modeling. Most unsupervised algorithms require the specification of relevant features [49, 50, 51, 52, 53]. We next illustrate how the ARCH network can serve as an effective feature selection tool for EHR phenotyping and compare it to the existing KG-based feature selection tool, KESER [14]. KESER is a method that generates embeddings for codified data from EHR data and constructs a knowledge graph through sparse embedding regression. It can serve as both an embedding construction technique and a feature selection tool for codified codes. We compared PheNorm [51] trained with ARCH selected features, PheNorm trained with KESER selected features, the MAP algorithm which only uses counts of the main PheCode and CUI, and healthcare utilization [52], as well as two benchmark methods that use the logarithm of the count of the main disease ICD code plus one (Main ICD Only) and the logarithm of the count of the disease CUI plus one (Main NLP Only) as the disease predictive scores, respectively. PheNorm is a phenotyping algorithm that computes the score of each patient to the target disease by computing an inner product of the patient’s normalized feature vector and a coefficient vector, where the feature vector aggregates the information in the candidate feature set with denoising self-regression via dropout training. We trained these phenotyping algorithms using EHR data from 53, 549 Mass General Brigham (MGB) Biobank participants for 8 conditions: coronary artery disease (CAD), Crohn’s disease (CD), rheumatoid arthritis (RA), ulcerative colitis (UC), Congestive heart failure (CHF), type 1 diabetes mellitus (T1DM), type 2 diabetes mellitus (T2DM) and depression. To evaluate their accuracy, the CAD, CD, RA, UC, CHF, T2DM, T2DM and depression phenotyping algorithms were validated against 187, 138, 154, 127, 114, 540, 285 and 540 labeled observations curated via manual chart review, and the AUCs were reported.
2.4.4. Profiling of AD patient via ARCH embeddings
Semantic representation of the EHR concepts can be linked with patient-level EHR data to represent patient clinical profile [54, 55, 56]. These patient embeddings can then be applied to perform downstream tasks such as identifying “patient like me” [57] and mortality prediction [58]. However, representing a patient’s clinical profile for a specific condition, such as AD, requires the knowledge of other EHR features relevant to AD progression as well as their relative importance [59]. Our ARCH KG serves this purpose in that it can generate embeddings to represent an AD patient. To demonstrate this, we used EHR data of 38, 267 patients with AD diagnosis, collected from the University of Pittsburgh Medical Center (UPMC) over the period 2011–2021. We selected the AD-relevant features and generated patient embeddings using the term frequency-inverse document frequency (TF-IDF) procedure as detailed in Appendix S.8. As an illustration, we applied -means algorithm to cluster patients into two groups using the patient embeddings. We analyzed the mortality risk of the two groups using the Kaplan-Meier (KM) curve of the time from first AD diagnosis to death. We characterized the between group differences in patient profiles with respect to the distributions of AD-related features selected via ARCH.
3. Results
By optimizing the AUC of distinguishing known relation pairs from random pairs as detailed in Section 2.4.1, we set the dimension of embeddings output from ARCH as to optimize the embedding quality. Visualizations of the ARCH network can be found at https://celehs.hms.harvard.edu/ARCH/, https://phenomics.va.ornl.gov/web/cipher/vistools, which enables users to visualize concepts relevant to a set of target concepts.
3.1. Detecting known relationship pairs
The AUCs and power in detecting known relationships are summarized in Table 2, with detailed accuracy for specific types of relationships provided in Table A4 in Appendix S.10. The embeddings trained by ARCH achieved an AUC of 0.828 for detecting CUI-CUI similar pairs and 0.844 for detecting CUI-CUI related pairs, while GloVe derived embeddings attained lower AUCs of 0.825 and 0.760. Pretrained language model derived embeddings, including PubmedBERT, BioBERT and SAPBERT, achieved much lower AUCs, ranging from 0.572 to 0.759. The ARCH screening procedure had a power of 0.981 for code-code similar pairs and 0.901 for code-code related pairs under the target FDR of 0.05. In contrast, the highest power among the five benchmarks (GloVe, node2vec, PubmedBERT, BioBERT, and SAPBERT) was 0.956 for code-code similar pairs and 0.909 for code-code related pairs.
Table 2:
AUCs and power at FDR = 0.05 in detecting known similar (s) pairs and related (r) pairs with ARCH -values (p) and cosine similarities (c), GloVe, node2vec (n2v), PubMedBERT (Pub), BioBERT (Bio), SAPBERT (SAP) with target FDR being 0.05. For ARCH, we trained using full (f), codified only (code), and CUI only (CUI) features. Glove is trained with full features, node2vec is trained with codified-only features due to memory limitation (node2vec requires more than 700GB memory if trained with full features), and PubMedBERT, BioBERT, SAPBERT utilize the code description of full features. Shown also are the Spearman’s correlation (Corr) between the algorithm output and domain expert ratings for the human curated pairs with similarity and relatedness rankings.
| Type | ARCH | Glove | n2v | Pub | Bio | SAP | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p(f) | c(f) | p(code) | p(CUI) | c(code) | c(CUI) | (f) | (code) | (f) | (f) | (f) | |||
| AUC | code-code | s | 0.849 | 0.904 | 0.867 | - | 0.906 | - | 0.790 | 0.796 | 0.631 | 0.603 | 0.775 |
| r | 0.776 | 0.797 | 0.770 | - | 0.781 | - | 0.779 | 0.667 | 0.584 | 0.544 | 0.636 | ||
|
| |||||||||||||
| CUI-code | s | 0.943 | 0.965 | - | - | - | - | 0.903 | - | 0.746 | 0.726 | 0.908 | |
|
| |||||||||||||
| CUI-CUI | s | 0.864 | 0.826 | - | 0.857 | - | 0.825 | 0.825 | - | 0.728 | 0.641 | 0.759 | |
| r | 0.862 | 0.844 | - | 0.855 | - | 0.839 | 0.760 | - | 0.670 | 0.572 | 0.712 | ||
|
| |||||||||||||
| Power | code-code | s | 0.981 | 0.980 | 0.962 | - | 0.962 | - | 0.955 | 0.956 | 0.764 | 0.681 | 0.907 |
| r | 0.901 | 0.897 | 0.825 | - | 0.819 | - | 0.909 | 0.810 | 0.757 | 0.660 | 0.790 | ||
|
| |||||||||||||
| CUI-code | s | 0.925 | 0.924 | - | - | - | - | 0.863 | - | 0.579 | 0.521 | 0.786 | |
|
| |||||||||||||
| CUI-CUI | s | 0.896 | 0.887 | - | 0.882 | - | 0.870 | 0.874 | - | 0.675 | 0.526 | 0.708 | |
| r | 0.876 | 0.865 | - | 0.855 | - | 0.841 | 0.828 | - | 0.635 | 0.517 | 0.547 | ||
|
| |||||||||||||
| Corr | s | 0.497 | 0.598 | - | 0.506 | - | 0.595 | 0.310 | - | 0.120 | 0.085 | 0.234 | |
| r | 0.478 | 0.573 | - | 0.494 | - | 0.572 | 0.333 | - | 0.085 | 0.035 | 0.217 | ||
When the proposed ARCH model is trained on codified data alone, the power for detecting known similar and related code-code pairs decreases. For example, the power of ARCH -values in detecting code-code related pairs at FDR = 0.05 was 0.901 when using all concepts and 0.825 when using codified data only. Node2vec showed much worse performance in detecting these code-code pairs, with AUCs being 0.956 for similar pairs and 0.810 for related pairs. The Spearman’s rank correlation between the cosine similarities from the ARCH embedding vectors and the manual ratings is as high as 0.598 for similar pairs and 0.573 for related pairs. Other benchmarks, including GloVe and PLM embeddings, result in correlation lower than 0.35.
3.2. Identifying drug side effects
Table 3 shows the AUC-ROC and power of different methods in detecting drug side effects. The unsupervised ARCH embeddings and the screening test -values, which are trained on both structured and unstructured data, achieved substantially a higher AUC of 0.723 and 0.747, compared to those trained on the structured data or unstructured data alone, and those from other benchmarks which ranged from 0.584 to 0.708. With few-shot supervised training, the ARCH embeddings attained an AUC of 0.826 while the AUC of the fine-tuned GloVe embeddings being 0.818, fine-tuned node2vec embeddings being 0.65 and fine-tuned PLM embeddings remained below 0.73. Comparing the power in detecting drug side effects using either codified data or NLP data alone versus both codified and NLP data, we find that jointly analyzing codified and NLP data greatly improved the ability to capture side effects for most drug classes as shown in Figure 2. Furthermore, we show most of the side effects of Levothyroxine and Hydrocodone can be detected by ARCH while a significant fraction of the side effects can only be captured with the help of NLP data. More examples of word-cloud figures are shown in Appendix.
Table 3:
AUCs and the sensitivities of different benchmark methods compared with ARCH for identifying drug side effects. The first and the third blocks show the performance of each method without supervision (AUC(U) stands for AUC with unsupervised embeddings and Power(U) stands for power with unsupervised embeddings), while the second and the fourth blocks show the performance of the method with supervised learning (AUC(S) stands for AUC with supervised embeddings and Power(S) stands for power with supervised embeddings). ARCH has different deformations: p(f) stands for the -value output from ARCH trained with both codified data and NLP data; c(f) stands for the cosine similarity between the embedding vectors output from ARCH trained with both codified data and NLP data; p(code) stands for the -value output from ARCH trained with codified data alone; c(CUI) stands for the cosine similarity between the embedding vectors output from ARCH trained with NLP data alone. n2v stands for node2vec, Pub stands for PubmedBERT, Bio stands for BioBERT and SAP stands for SAPBERT. Glove is trained with full features, node2vec is trained with codified-only features due to memory limitation (node2vec requires more than 700GB memory if trained with full features), and PubMedBERT, BioBERT, SAPBERT utilize the code description of full features.
| FDR | ARCH | Glove | n2v | Pub | Bio | SAP | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p(f) | c(f) | p(code) | p(CUI) | c(code) | c(CUI) | (f) | (code) | (f) | (f) | (f) | ||
| AUC(U) | 0.760 | 0.759 | 0.531 | 0.588 | 0.538 | 0.586 | 0.708 | 0.526 | 0.637 | 0.620 | 0.597 | |
|
| ||||||||||||
| AUC(S) | - | 0.899 | - | - | 0.615 | 0.686 | 0.818 | 0.651 | 0.649 | 0.627 | 0.724 | |
|
| ||||||||||||
| Power(U) | 0.1 | 0.599 | 0.545 | 0.211 | 0.451 | 0.300 | 0.320 | 0.577 | 0.143 | 0.383 | 0.346 | 0.386 |
| 0.05 | 0.586 | 0.533 | 0.206 | 0.444 | 0.295 | 0.312 | 0.568 | 0.138 | 0.371 | 0.334 | 0.374 | |
| 0.01 | 0.564 | 0.510 | 0.195 | 0.428 | 0.286 | 0.296 | 0.551 | 0.130 | 0.346 | 0.314 | 0.352 | |
|
| ||||||||||||
| Power(S) | 0.1 | - | 0.647 | - | - | 0.437 | 0.494 | 0.652 | 0.243 | 0.473 | 0.403 | 0.462 |
| 0.05 | - | 0.638 | - | - | 0.431 | 0.484 | 0.643 | 0.237 | 0.463 | 0.389 | 0.449 | |
| 0.01 | - | 0.618 | - | - | 0.419 | 0.466 | 0.623 | 0.224 | 0.441 | 0.366 | 0.424 | |
Figure 2:
(a) Sensitivity of detecting drug-side effects pairs with only codified data, that with NLP, and that with both codified data and NLP with ARCH under target FDR 0.05. The word clouds of the side effects of two sample drugs - (b) Levothyroxine on the left and (c) Hydrocodone on the right. The surrounding words describe side effects. The words colored red are detected using codified data only while the words colored orange or red are detected by using both codified data and NLP codes. The words colored by grey are undetected. The size of the words is determined by the cosine similarity with the target drug code.
3.3. Disease phenotyping
Figure 3 shows the AUCs of 8 phenotyping algorithms validated on labeled data from MGB. PheNorm with ARCH selected features performs the best among all methods. The AUCs of the PheNorm algorithms with features selected by ARCH exceeded 0.9 for all 8 diseases and on average were 0.028 (-value ), 0.067 (-value ), 0.081 (-value ), and 0.076 (-value ) higher than that of PheNorm with ARCH trained on codified features alone, PheNorm with KESER features, MAP, ICD only and NLP only. The gain in performance is particularly noteworthy for conditions that benefit from NLP features. For example, after applying ARCH in the feature selection step, the AUC of the PheNorm algorithm for depression increased from 0.857 of KESER to 0.927 (-value ).
Figure 3:
The AUC of different phenotyping algorithms trained with different feature sets across 8 diseases.
3.4. Profiling of AD patient via ARCH embeddings
The AD cohort consists of about 64.7% female patients, 90.3% white and 7.6% black patients, with an average age of 82 years at first ICD code for AD and an average lifespan of 86 years. K-means clustering of the ARCH-based patient embeddings resulted in two subgroups: a late stage group consisting of 12.3% the patients and a early stage group formed by the remaining patients. As shown in Figure 4, the 5-year survival rate was 42.0% (95% CI: [38.6%, 45.7%]) and 80.9% (95% CI: [80.3%, 81.6%]) for the late and early stage groups, respectively.
Figure 4:
The KM survival curves for the early stage and late stage groups identified via -means clustering of the ARCH patient level embeddings.
Figure 5 highlights the top disease and drug features with the largest differences between the late and early stage groups. The phenotype features associated with the group with a smaller population are common phenotypes at the late stage of AD. Pneumonia is one of the two most serious medical conditions seen in late-stage AD patients [60]; hypovolemia and hypernatremia may be found in association with dehydration, which can occur in impaired late-stage AD patients who are dependent on others for fluid intake [61, 62, 63]. On the other hand, the features that appear more frequently in the early stage group of patients, which are colored blue in the figure, are either common signs or possible causes of AD. Memory deficits begin from the early stage of AD [64], while vitamin deficiency and hypothyroidism are risk factors for AD [65, 66, 67]. As shown in the network of drug features and procedure features, the features ‘atorvastatin’, ‘metformin’, ‘escitalopram’, ‘melatonin’, among others, have been shown to moderate AD or slow down the progression of cognitive impairment in AD patients [68, 69, 70, 71]. Memantine, a type of N-methyl-D-aspartate receptor antagonist, is the only drug approved for use in moderate to severe AD under current AD treatment guideline [72, 73]; Rivastigmine and Donepezil are the drugs approved by Food and Drug Administration (FDA) for AD treatment besides Memantine and two accelerated approval drugs3; all these three drugs are more common in the late stage group of patients. With these references, the clustering of patients is practical and realistic, indicating the good quality of patient embedding based on the feature selection by ARCH.
Figure 5:
The word cloud and the network of (a)(c) phenotype features; and (b)(d) drug features that drive the differences between the two subgroups. The size of the feature is determined by the between-group difference in the average intensity of such a feature. Red-colored features represent higher average intensity in the late stage group and blue-colored features represent higher intensity in the early stage group.
3.5. Computation cost
The main steps for our proposed algorithm are fast. First, the computation time for constructing co-occurrence matrix highly depends on the number of patients and features. However, we are able to create the co-occurrence matrix through distributed computing with the help of code at https://github.com/rusheniii/LargeScaleClinicalEmbedding. For the 12.6 million patients from VA, it cost us 3 days to compute the co-occurrence matrix with the help of 12 machines with 128 cores. Second, the variance estimation step and sparse embedding regression step are much faster if run in parallel. Splitting into 10 jobs, it takes less than 10 hours to finish these two steps.
4. Discussion and Conclusion
Utilizing summary-level EHR data, the ARCH KG learning approach provides a highly scalable method for effectively representing codified and narrative EHR concepts on a large scale, while also recovering their network structure. The proposed ARCH algorithm outputs both cosine similarity and -value to evaluate the relationship between two concepts. The latter (-value) includes uncertainty assessment, making it suitable for reproducible and robust analysis. On the other hand, the former (cosine similarity) is based on similarities between the embedding vectors and lacks a measure of uncertainty, making it more appropriate for predictive tasks. The differences in performance stem from these underlying characteristics. To choose between these variants, one should consider the specific requirements of their use case: -value for scenarios needing reliability and robustness, and cosine similarities for tasks prioritizing prediction accuracy. The -value should be used when reliability and robustness are critical, such as in scenarios where reproducibility is important or when the sample size is small. For example, if you are adding new features to a knowledge graph or working with a new dataset, ARCH() is preferred because it assesses the significance of relationships and is more generalizable. Cosine similarity is more computationally efficient and can be useful for tasks where computational resources are limited. To address varying evaluation requirements, we propose a complementary integration of -values and cosine similarity scores. By normalizing these metrics (e.g., via Min-Max scaling), their combination in downstream tasks can enable a balanced approach, prioritizing -values for statistical rigor and cosine similarities for computational efficiency. This framework enhances ARCH’s adaptability and ensures it meets diverse analytical needs. In conclusion, the VA EHR-derived ARCH embeddings represent the first large-scale EHR embeddings to include both codified and NLP concepts, with the incorporation of NLP concepts proving particularly beneficial in real-world applications and a wide range of predictive modeling tasks such as drug side effect monitoring and disease phenotyping. Additionally, the network structure derived from ARCH is constructed with a statistically guaranteed false discovery rate.
The versatility of the learned ARCH embeddings makes them ideal for a broad range of downstream tasks. These embeddings demonstrate greater robustness than existing PLM-based embeddings. Our semantic representation evaluations and drug side effect prediction studies show that the ARCH embeddings can effectively capture the semantic relationships between EHR entities and concepts. Our results indicate that the ARCH embeddings with few shot training have the potential to achieve high accuracy in KG-related tasks, such as entity matching and relation extraction. Additionally, the ARCH embeddings can serve as pre-trained representations of EHR concepts that can be linked to individual-level EHR data, further improving patient-level prediction tasks, as demonstrated in the AD patient profiling study. Joint representations of both codified and NLP data also enable more comprehensive multi-modal modeling of EHR data, significantly enhancing prediction performance for outcomes that require predictors that are not well-coded.
The use of summary-level data in learning the ARCH network creates an opportunity for collaborative training of knowledge graphs across multiple institutions. This approach can enhance the quality of the trained representation and improve the portability of downstream prediction algorithms. However, co-training ARCH embeddings using multi-institutional data faces a challenge in dealing with coding differences between institutions. Even for institutions that have mapped their local EHR codes to a common ontology, such mappings are often incomplete. Future research needs to explore co-training knowledge graphs for overlapping yet non-identical EHR concepts from multiple institutions based on summary-level data. Furthermore, average embedding across a range of representative contexts may help to improve the stability and quality, thus increasing the ability of serving the downstream tasks. Currently, the ARCH network relies solely on EHR occurrence patterns of concepts, disregarding valuable information contained in their descriptions. Incorporating both occurrence patterns and descriptions through language models is an intriguing avenue for further research in improving the network. The global embeddings for narrative concepts trained with large corpora of full-length biomedical articles have also attained strong results on the UMNSRS dataset [74]. The rank correlation between human annotation vs cosine similarity was 0.62 and 0.58 for embeddings from [74] and 0.598 and 0.573 for ARCH embeddings. More detailed comparisons between these two sets of embeddings warrant further research. However, the results presented in the above section collectively highlight the substantial value of the ARCH framework, not only in providing robust, scalable, and reproducible embeddings for clinical data but also in paving the way for innovative and impactful applications in biomedical research and healthcare delivery, despite limitations seen in specific evaluation sets.
Supplementary Material
Highlights.
Problem or Issue: Managing massive and complex electronic health record (EHR) data presents challenges in feature representation and uncertainty quantification for both codified data and narrative notes.
What is Already Known: Existing word embedding techniques in natural language processing (NLP) or based on co-occurrence matrix lack inherent support for sparse knowledge graph (KG) creation and joint representation of large-scale codified and NLP EHR concepts.
What this Paper Adds: This study introduces a method for generating low-dimensional KG embeddings for both codified and NLP EHR concepts, incorporating uncertainty measures on the links, and constructing a sparse, large-scale KG.
Acknowledgments
We would like to acknowledge the invaluable contributions arising from the collaboration between Veterans Affairs (VA) and the Department of Energy (DOE) which provided the computing infrastructure essential to develop and test these approaches at scale with nation-wide VA EHR data. This project was supported by the NIH grants 1OT2OD032581, R01 HL089778 and R01 LM013614, P30 AR072577, and the Million Veteran Program, Department of Veterans Affairs, Office of Research and Development, Veterans Health Administration, and was supported by the award #MVP000. This research used resources from the Knowledge Discovery Infrastructure at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05–00OR22725. This publication does not represent the views of the Department of Veterans Affairs or the U.S. government.
Footnotes
Author contributions
ZG: Methodology, Software, Writing – original draft. DZ: Methodology, Software, Writing – original draft. ER: Resources. VAP: Data curation. YH: Data curation. GO: Resources. ZX: Methodology. SS: Writing - review & editing. XX: Visualization. KFG: Writing - review & editing. CH: Writing - review & editing. CB: Visualization. JW: Data curation. LC: Writing - review & editing. TC: Writing - review & editing. EB: Writing - review & editing. ZX: Writing - review & editing. JMG: Writing - review & editing. KPL: Writing - review & editing. KC: Conceptualization, Writing – review & editing, Supervision. TC: Methodology, Conceptualization, Writing – review & editing, Supervision. JL: Methodology, Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition.
Declaration of interests
The authors declare no competing interests.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Data availability
The data that support the findings of this study are available from the Veterans Affairs (VA) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of VA. All original code has been deposited at 10.5281/zenodo.10426582 and is publicly available.
References
- 1.Halpern Y, Horng S, Choi Y, Sontag D, Electronic medical record phenotyping using the anchor and learn framework, JAMIA 23 (4) (2016) 731–740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Choi E, Schuetz A, Stewart WF, Sun J, Using recurrent neural network models for early detection of heart failure onset, JAMIA 24 (2) (2017) 361–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S, Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods, JAMIA 27 (1) (2020) 39–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Jin B, Che C, Liu Z, Zhang S, Yin X, Wei X, Predicting the risk of heart failure with ehr sequential data modeling, IEEE Access 6 (2018) 9256–9261. [Google Scholar]
- 5.McInnes BT, Pedersen T, Carlis J, Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain, in: AMIA Symposium, Vol. 2007, 2007, pp. 533–537. [PMC free article] [PubMed] [Google Scholar]
- 6.Ghassemi M, Naumann T, Doshi-Velez F, Brimmer N, Joshi R, Rumshisky A, Szolovits P, Unfolding physiological state: Mortality modelling in intensive care units, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 75–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Caballero Barajas KL, Akella R, Dynamically modeling patient’s health state from electronic medical records: A time series approach, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 69–78.
- 8.Tayefi M, Ngo P, Chomutare T, Dalianis H, Salvi E, Budrionis A, Godtliebsen F, Challenges and opportunities beyond structured data in analysis of electronic health records, Wiley Interdisciplinary Reviews: Computational Statistics 13 (6) (2021) e1549. [Google Scholar]
- 9.Abhyankar S, Demner-Fushman D, Callaghan FM, McDonald CJ, Combining structured and unstructured data to identify a cohort of icu patients who received dialysis, JAMIA 21 (5) (2014) 801–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhang D, Yin C, Zeng J, Yuan X, Zhang P, Combining structured and unstructured data for predictive models: a deep learning approach, BMC Med Inform Decis Mak 20 (2020) 280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang Y, Ng K, Byrd RJ, Hu J, Ebadollahi S, Daar Z, Defilippi C, Steinhubl SR, Stewart WF, Early detection of heart failure with varying prediction windows by structured and unstructured data in electronic health records, in: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2015, pp. 2530–2533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bauer-Mehren A, Lependu P, Iyer SV, Harpaz R, Leeper NJ, Shah NH, Network analysis of unstructured ehr data for clinical research, AMIA Summits on Translational Science Proceedings 2013 (2013) 14–18. [PMC free article] [PubMed] [Google Scholar]
- 13.Finlayson SG, LePendu P, Shah NH, Building the graph of medicine from millions of clinical narratives, Scientific Data 1 (2014) 140032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hong C, Rush E, Liu M, Zhou D, Sun J, Sonabend A, Castro VM, Schubert P, Panickan VA, Cai T, et al. , Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data, NPJ Digit Med 4 (1) (2021) 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Agarwal P, Searls DB, Can literature analysis identify innovation drivers in drug discovery?, Nature Reviews Drug Discovery 8 (11) (2009) 865–878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cohen T, Widdows D, Empirical distributional semantics: methods and biomedical applications, J Biomed Inform 42 (2) (2009) 390–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.De Vine L, Zuccon G, Koopman B, Sitbon L, Bruza P, Medical semantic similarity with a neural language model, in: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, 2014, pp. 1819–1822. [Google Scholar]
- 18.Glicksberg BS, Miotto R, Johnson KW, Shameer K, Li L, Chen R, Dudley JT, Automated disease cohort selection using word embeddings from electronic health records, Pac Symp Biocomput (2018) 145–156. [PMC free article] [PubMed]
- 19.Segura-Bedmar I, Raez P, Cohort selection for clinical trials using deep learning models, JAMIA 26 (11) (2019) 1181–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Feng Y, Min X, Chen N, Chen H, Xie X, Wang H, Chen T, Patient outcome prediction via convolutional neural networks based on multi-granularity medical concept embedding, in: 2017. IEEE Int Conf Bioinformatics Biomed, IEEE, 2017, pp. 770–777. [Google Scholar]
- 21.Choi E, Xiao C, Stewart W, Sun J, Mime: Multilevel medical embedding of electronic health records for predictive healthcare, Adv Neural Inf Process Syst 31 (2018). [Google Scholar]
- 22.Li Z, Roberts K, Jiang X, Long Q, Distributed learning from multiple ehr databases: contextual embedding models for medical events, J Biomed Inform 92 (2019) 103138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bengio Y, Ducharme R, Vincent P, A neural probabilistic language model, JMLR 3 (2003) 1137–1155. [Google Scholar]
- 24.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, Distributed representations of words and phrases and their compositionality, Adv Neural Inf Process Syst 26 (2013) 3111–3119. [Google Scholar]
- 25.Pennington J, Socher R, Manning CD, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. [Google Scholar]
- 26.Bai T, Chanda AK, Egleston BL, Vucetic S, EHR phenotyping via jointly embedding medical concepts and words into a unified vector space, BMC Med Inform Decis Mak 18 (4) (2018) 15–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Johnson AE, Pollard TJ, Shen L, Lehman L.-w. H., Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG, MIMIC-III, a freely accessible critical care database, Scientific Data 3 (1) (2016) 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Choi Y, Chiu CY-I, Sontag D, Learning low-dimensional representations of medical concepts, AMIA Summits on Translational Science Proceedings 2016 (2016) 41–50. [PMC free article] [PubMed] [Google Scholar]
- 29.Kartchner D, Christensen T, Humpherys J, Wade S, Code2vec: Embedding and clustering medical diagnosis data, in: 2017. IEEE International Conference on Healthcare Informatics, 2017, pp. 386–390. [Google Scholar]
- 30.Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, Tejedor-Sojo J, Sun J, Multi-layer representation learning for medical concepts, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1495–1504. [Google Scholar]
- 31.Zhou D, Gan Z, Shi X, Patwari A, Rush E, Bonzel C-L, Panickan VA, Hong C, Ho Y-L, Cai T, et al. , Multiview incomplete knowledge graph integration with application to cross-institutional ehr data harmonization, J Biomed Inform 133 (2022) 104147. [DOI] [PubMed] [Google Scholar]
- 32.Koller D, Friedman N, Probabilistic graphical models: principles and techniques, MIT press, 2009. [Google Scholar]
- 33.Arora S, Li Y, Liang Y, Ma T, Risteski A, A latent variable model approach to pmi-based word embeddings, Trans Assoc Comput Linguist 4 (2016) 385–399. [Google Scholar]
- 34.Xu Z, Gan Z, Zhou D, Shen S, Lu J, Cai T, Inference of dependency knowledge graph for electronic health records, arXiv preprint arXiv:2312.15611 (2023).
- 35.Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space, J R Stat Soc Series B Stat Methodol 70 (5) (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhou S, Rütimann P, Xu M, Bühlmann P, High-dimensional covariance estimation based on gaussian graphical models, JMLR 12 (2011) 2975–3026. [Google Scholar]
- 37.James G, Witten D, Hastie T, Tibshirani R, et al. , An introduction to statistical learning, Vol. 112, Springer, 2013. [Google Scholar]
- 38.Akaike H, Information theory and an extension of the maximum likelihood principle, in: Selected papers of hirotugu akaike, Springer, 1998, pp. 199–213. [Google Scholar]
- 39.Grover A, Leskovec J, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Devlin J, Chang M, Lee K, Toutanova K, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019, pp. 4171–4186. [Google Scholar]
- 41.Liu F, Shareghi E, Meng Z, Basaldella M, Collier N, Self-alignment pretraining for biomedical entity representations, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4228–4238. [Google Scholar]
- 42.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (4) (2020) 1234–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H, Domain-specific language model pretraining for biomedical natural language processing, ACM Transactions on Computing for Healthcare 3 (2021) 1–23. [Google Scholar]
- 44.Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB, Semantic similarity and relatedness between clinical terms: an experimental study, in: AMIA annual symposium proceedings, Vol. 2010, American Medical Informatics Association, 2010, p. 572. [PMC free article] [PubMed] [Google Scholar]
- 45.Pakhomov S, Semantic relatedness and similarity reference standards for medical terms (2018).
- 46.Kuhn M, Letunic I, Jensen LJ, Bork P, The sider database of drugs and side effects, Nucleic Acids Research 44 (D1) (2016) D1075–D1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hao Y, Tatonetti NP, Predicting negative control drugs to support research in drug safety, bioRxiv (2018) 380832.
- 48.Yuan Z, Zhao Z, Sun H, Li J, Wang F, Yu S, Coder: Knowledge-infused cross-lingual medical term embedding for term normalization, J Biomed Inform (2022) 103983. [DOI] [PubMed]
- 49.Agarwal V, Podchiyska T, Banda JM, Goel V, Leung TI, Minty EP, Sweeney TE, Gyang E, Shah NH, Learning statistical models of phenotypes using noisy labeled training data, JAMIA 23 (6) (2016) 1166–1173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Levine ME, Albers DJ, Hripcsak G, Methodological variations in lagged regression for detecting physiologic drug effects in ehr data, J Biomed Inform 86 (2018) 149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Yu S, Ma Y, Gronsbell J, Cai T, Ananthakrishnan AN, Gainer VS, Churchill SE, Szolovits P, Murphy SN, Kohane IS, et al. , Enabling phenotypic big data with phenorm, JAMIA 25 (1) (2018) 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Liao KP, Sun J, Cai TA, Link N, Hong C, Huang J, Huffman JE, Gronsbell J, Zhang Y, Ho Y-L, et al. , High-throughput multimodal automated phenotyping (MAP) with application to PheWAS, JAMIA 26 (11) (2019) 1255–1262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ahuja Y, Zhou D, He Z, Sun J, Castro VM, Gainer V, Murphy SN, Hong C, Cai T, surelda: A multidisease automated phenotyping method for the electronic health record, JAMIA 27 (8) (2020) 1235–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Miotto R, Li L, Kidd BA, Dudley JT, Deep patient: an unsupervised representation to predict the future of patients from the electronic health records, Sci Rep 6 (2016) 26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zhu Z, Yin C, Qian B, Cheng Y, Wei J, Wang F, Measuring patient similarities via a deep architecture with medical concept embedding, in: 2016 IEEE 16th International Conference on Data Mining, 2016, pp. 749–758. [Google Scholar]
- 56.Dubois S, Romano N, Kale DC, Shah N, Jung K, Learning effective representations from clinical notes, arXiv preprint arXiv:1705.07025 (2017).
- 57.Sharafoddini A, Dubin JA, Lee J, et al. , Patient similarity in prediction models based on health data: a scoping review, JMIR Medical Informatics 5 (1) (2017) e6730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Allyn J, Allou N, Augustin P, Philip I, Martinet O, Belghiti M, Provenchere S, Montravers P, Ferdynus C, A comparison of a machine learning model with euroscore ii in predicting mortality after elective cardiac surgery: a decision curve analysis, PLoS one 12 (1) (2017) e0169772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Lei L, Zhou Y, Zhai J, Zhang L, Fang Z, He P, Gao J, An effective patient representation learning for time-series prediction tasks based on EHRs, in: 2018 IEEE Int Conf Bioinformatics Biomed, 2018, pp. 885–892. [Google Scholar]
- 60.Kalia M, Dysphagia and aspiration pneumonia in patients with Alzheimer’s disease, Metabolism 52 (2003) 36–38. [DOI] [PubMed] [Google Scholar]
- 61.Lauriola M, Mangiacotti A, D’Onofrio G, Cascavilla L, Paris F, Paroni G, Seripa D, Greco A, Sancarlo D, Neurocognitive disorders and dehydration in older patients: clinical experience supports the hydromolecular hypothesis of dementia, Nutrients 10 (5) (2018) 562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Farlow MR, Alzheimer’s disease, Continuum: Lifelong Learning in Neurology 13 (2) (2007) 39–68. [Google Scholar]
- 63.Lee TJ, Kolasa KM, Feeding the person with late-stage Alzheimer’s disease, Nutrition Today 46 (2) (2011) 75–79. [Google Scholar]
- 64.Mimura M, Yano M, Memory impairment and awareness of memory deficits inearly-stage alzheimer’s disease, Reviews in the Neurosciences 17 (1–2) (2006) 253–266. [DOI] [PubMed] [Google Scholar]
- 65.Chai B, Gao F, Wu R, Dong T, Gu C, Lin Q, Zhang Y, Vitamin D deficiency as a risk factor for dementia and Alzheimer’s disease: an updated meta-analysis, BMC Neurology 19 (1) (2019) 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kim JH, Lee HS, Kim YH, Kwon MJ, Kim J-H, Min CY, Yoo DM, Choi HG, The association between thyroid diseases and Alzheimer’s disease in a national health screening cohort in Korea, Frontiers in Endocrinology 13 (2022) 815063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Hong CH, Falvey C, Harris TB, Simonsick EM, Satterfield S, Ferrucci L, Metti AL, Patel KV, Yaffe K, Anemia and risk of dementia in older adults: findings from the health abc study, Neurology 81 (6) (2013) 528–533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Sparks DL, Sabbagh MN, Connor DJ, Lopez J, Launer LJ, Browne P, Wasser D, Johnson-Traver S, Lochhead J, Ziolwolski C, Atorvastatin for the treatment of mild to moderate alzheimer disease: preliminary results, Archives of neurology 62 (5) (2005) 753–757. [DOI] [PubMed] [Google Scholar]
- 69.Liao W, Xu J, Li B, Ruan Y, Li T, Liu J, Deciphering the roles of metformin in Alzheimer’s disease: a snapshot, Frontiers in Pharmacology 12 (2022) 728315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Barak Y, Plopski I, Tadger S, Paleacu D, Escitalopram versus risperidone for the treatment of behavioral and psychotic symptoms associated with Alzheimer’s disease: a randomized double-blind pilot study, International Psychogeriatrics 23 (9) (2011) 1515–1519. [DOI] [PubMed] [Google Scholar]
- 71.Lin L, Huang Q-X, Yang S-S, Chu J, Wang J-Z, Tian Q, Melatonin in alzheimer’s disease, International Journal of Molecular Sciences 14 (7) (2013) 14575–14593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Liu J, Chang L, Song Y, Li H, Wu Y, The role of NMDA receptors in Alzheimer’s disease, Frontiers in Neuroscience 13 (2019) 43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Tariot PN, Farlow MR, Grossberg GT, Graham SM, McDonald S, Gergel I, Group MS, Group MS, et al. , Memantine treatment in patients with moderate to severe alzheimer disease already receiving donepezil: a randomized controlled trial, Journal of the American Medical Association 291 (3) (2004) 317–324. [DOI] [PubMed] [Google Scholar]
- 74.Pakhomov SV, Finley G, McEwan R, Wang Y, Melton GB, Corpus domain effects on distributional semantic modeling of medical terms, Bioinformatics 32 (23) (2016) 3635–3644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Benjamini Y, Hochberg Y, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57 (1) (1995) 289–300. [Google Scholar]
- 76.Yu S, Cai T, Cai T, Nile: fast natural language processing for electronic health records, arXiv preprint arXiv:1311.6063 (2013).
- 77.Bodenreider O, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Research 32 (Database issue) (2004) D267–D270. [DOI] [PMC free article] [PubMed]
- 78.Wen J, Zhang X, Rush E, Panickan VA, Li X, Cai T, Zhou D, Ho Y-L, Costa L, Begoli E, et al. , Multimodal representation learning for predicting molecule–disease relations, Bioinformatics 39 (2) (2023) btad085. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available from the Veterans Affairs (VA) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of VA. All original code has been deposited at 10.5281/zenodo.10426582 and is publicly available.





