Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2025 Oct 13:2024.09.29.24314557. Originally published 2024 Sep 30. [Version 3] doi: 10.1101/2024.09.29.24314557

ALADYNOULLI: A Bayesian approach to disease progression modeling for genomic discovery and clinical prediction

Sarah M Urbut 1,2,3,4, Yi Ding 5, Tetsushi Nakao 1,2,4, Xilin Jiang 6, Leslie Gaffney 4, Anika Misra 1,2,4, Whitney Hornsby 1,2,4, Jordan W Smoller 3,4,7,8, Alexander Gusev 5,3,4,, Pradeep Natarajan 1,2,3,4,, Giovanni Parmigiani 9,10,
PMCID: PMC11577253  PMID: 39568791

Abstract

Understanding how disease patterns evolve over a lifetime remains a key challenge in medicine. While electronic health records provide rich longitudinal data, existing models typically analyze each disease in isolation, missing the complex interplay between multiple conditions and genetic factors. Here, we combine longitudinal health records with genetic data to model individual trajectories, using a novel dynamic Bayesian framework called ALADYNOULLI that identifies latent disease signatures from longitudinal health records while modeling individual-specific trajectories. Applied across three biobanks with up to 50 years of follow-up, our model discovers clinically interpretable disease signatures that demonstrate remarkable consistency across diverse populations (79.2% cross-cohort correspondence) and show strong genetic correlations, enabling both accurate prediction of patient risk and discovery of novel genetic associations. The model achieves dramatic improvements in disease prediction across 24 conditions, outperforming established clinical risk scores like PCE, PREVENT and GAIL over short and longer-term horizons. Furthermore, our signature-based approach identifies over 150 genetic loci - many missed by single-disease GWAS - with multiple signatures showing strong genetic signals (cardiovascular h2 = 0.041, musculoskeletal h2=0.035). Critically, this unified modeling approach significantly improves predictive performance for multiple diseases while revealing distinct biological subtypes within traditional diagnostic categories—demonstrating substantial heterogeneity across diverse conditions including cancer, metabolic disorders, and psychiatric conditions, with Cohen’s d effect sizes up to 3.87 for signature differences between patient clusters (p ≤ 1 × 108 for 95% of comparisons). In conclusion, ALADYNOULLI combines genetics and longitudinal diagnosis to achieve both improved disease prediction and enhanced genetic discovery through a unified framework that captures the complex interplay between genetic predisposition and time-varying disease patterns. Application with link to simulated results available at http://aladynoulli.hms.harvard.edu. Code available with github permission enabled.

Introduction

The risk of disease varies substantially between individuals and throughout life, with complex interactions between genetic predisposition, environmental factors, and accumulated comorbidities. Understanding these dynamic patterns of risk could transform early detection, prevention, and personalized treatment strategies (13). The increasing availability of large-scale electronic health records (EHRs) linked to genetic data provides unprecedented opportunities to model these complex disease trajectories at a population scale (4, 5). However, extracting meaningful patterns from these rich, longitudinal datasets remains challenging due to patient population heterogeneity, the temporal nature of disease progression, and intricate relationships between diverse conditions.

Traditional approaches to analyzing EHR data often focus on isolated diseases or simple pairwise associations, failing to capture how multiple conditions evolve together over time (6). Recent unsupervised methods have attempted to identify disease clusters or trajectories (7), but typically do not account for temporal dynamics of disease risk or individual-level heterogeneity, particularly the influence of genetic factors on disease progression rates (8, 9). Furthermore, many models assume conditional independence of diseases, missing the opportunity to leverage information across related conditions for both prediction and discovery (10,11). Consider a patient who develops rheumatoid arthritis at age 45, followed by hypertension at 48, and eventually suffers a myocardial infarction at 52. Traditional approaches may treat these as separate events or simple comorbidities, missing the underlying metabolic-inflammatory signature that drives this progression. Also, they do not typically leverage information from patients with similar patterns to improve prediction for rare conditions, where limited data makes traditional disease-specific models less reliable.

We present ALADYNOULLI, a generative model that integrates genetic data with longitudinal EHRs to identify latent disease signatures while modeling individual-specific health trajectories over time. ALADYNOULLI addresses these limitations by identifying shared disease signatures that capture biological pathways common across multiple conditions, enabling more accurate prediction even for rare diseases through information sharing with related, more common conditions. Our approach offers several key advantages over existing methods: (1) Interpretability: disease signatures correspond to clinically meaningful biological processes rather than abstract statistical factors; (2) Temporal modeling: captures how disease risk evolves dynamically over the life course rather than static risk assessment; (3) Genetic integration: directly incorporates genetic information into the model architecture rather than as post-hoc analysis; (4) Unified framework: simultaneously models multiple diseases, sharing information across related conditions and improving prediction even for diseases with sparse data (12); and (5) Individual-specific trajectories: provides personalized risk profiles that adapt as new clinical information becomes available. By jointly modeling multiple diseases and their genetic determinants, ALADYNOULLI enables both improved prediction of future disease risk and enhanced discovery of genetic architecture underlying complex phenotypes, while revealing meaningful patient subgroups with distinct biological mechanisms that could inform personalized interventions.

Results

ALADYNOULLI captures temporal disease signatures and individual trajectories

Disease patterns among individuals vary by onset, progression speed, and composition, reflecting different underlying biological mechanisms. Unlike allocation-based topic models that conditionally allocate observed diseases to categories (7,10), ALADYNOULLI models the probability of each disease for an individual by integrating across multiple latent signatures (Figure 1).

Figure 1: ALADYNOULLI model overview and applications.

Figure 1:

Top: Example patient timeline showing the sequence and timing of major diagnoses over the life course. Middle: Key model components. Left: Population-level disease signatures (φ), with each line representing the age-dependent risk trajectory for a specific disease within a signature. Center: Individual signature loadings (λ) transformed to θ via softmax, for a representative patient, showing how contributions from different signatures evolve over time. Right: Disease risk prediction (π) for selected diseases, integrating population-level signatures and individual loadings to generate personalized risk trajectories. Bottom: Applications of the model, including genomic discovery, therapeutic targeting, and patient matching (e.g., digital twin identification or stratification of patients with the same diagnosis but different risk profiles).

For each individual i, disease d, and time point t, we model the probability of disease occurrence πidt as a weighted combination of signature-specific probabilities, where each signature captures patterns of diseases that tend to occur together (Table S1):

πidt=κk=1Kθiktsigmoidϕkdt, (1)

where sigmoidϕkdt=1/1+eϕkdt, κ is a global calibration parameter, θikt represents a normalized individual i’s time-varying association with signature k at time t, and ϕkdt captures the relationship between signature k and disease d over time t.

The normalized individual-signature associations (loadings) θikt are derived from latent variables λikt through a softmax transformation:

θikt=expλiktk=1Kexpλikt. (2)

These latent variables λikt follow a Gaussian process (13) prior wherein we model the effects of genetic factors and time (see Methods; Figure S1). Specifically:

λik~GPrk+giΓk,Ωλ (3)

where rk is a signature-specific reference level, Γk captures genetic effects on signature predisposition, gi represents individual genetic factors, here polygenic scores, affecting the mean of λik, and Ωλ is a temporal covariance kernel modeling smooth trajectories for λikt over time.

Similarly, the disease-signature associations follow a Gaussian process prior:

ϕkd~GPμd+ψkd,Ωϕ (4)

where μd is a disease-specific baseline, or the logit of the population prevalence, ψkd represents the overall strength of association between signature k and disease d, and Ωϕ allows for temporal variation in these associations.

A key innovation of our approach lies in its formulation as a mixture of probabilities rather than a probability of a mixture, as in traditional sparse factor analysis approaches (6). Unlike allocation-based topic models that conditionally assign diseases to individuals after the event has necessarily occurred (7), ALADYNOULLI directly models the probability of disease occurrence as a weighted combination of signature-specific disease probabilities.

This crucial distinction allows our model to: (1) predict future disease onset rather than merely explain observed diagnoses; (2) accommodate multiple contributing disease processes simultaneously rather than forcing competitive allocation to a single signature; and (3) accurately model chronic conditions that persist over time rather than treating each diagnosis as an independent event. The combination of softmax-transformed individual loadings (θ) and sigmoid-transformed disease probabilities (ϕ) ensures proper probability scaling.

Terminology clarification:

We note that factor analysis literature exhibits inconsistent terminology that can be confusing. In some traditions (e.g., sparse factor analysis (6, 14)), ”loadings” refer to individual-specific weights (our λ parameters), while ”factors” or ”coefficients” refer to feature importance (our ϕ parameters). In other traditions, ”loadings” refer to feature importance (our ϕ parameters), while individual components are called ”scores” or ”weights” (our λ parameters). Throughout this work, we use ”loadings” to refer to individual-specific signature associations (λ and θ) and ”signature-disease associations” to refer to feature importance (ϕ), consistent with the sparse factor analysis convention where loadings represent individual variation and factors represent feature structure.

Two complementary applications:

ALADYNOULLI serves two distinct but complementary purposes, each requiring different analytical approaches. For biomedical discovery, ALADYNOULLI operates with complete hindsight, leveraging entire patient trajectories to maximize our ability to identify biological patterns and mechanisms. This retrospective analysis transforms our understanding of disease patterns, progression speed, genetic relationships, and disease associations by using all available longitudinal data to characterize disease signatures, quantify genetic influences, and reveal patient heterogeneity within diagnostic categories. For clinical prediction, we operate under strict temporal constraints that mirror real-world clinical decision making (see Figure S1 for distinction). We employ a rigorous temporal validation framework that uses only information available up to a prediction time point (see Figure S2). This prospective approach simulates real-world clinical scenarios where physicians must predict future risk based solely on a patient’s history to date, ensuring our performance metrics reflect true predictive capability rather than retrospective explanation.

Applying ALADYNOULLI identifies consistent signature patterns across diverse populations

We applied ALADYNOULLI to three independent cohorts: UK Biobank (UKB, n=427,239), Mass General Brigham (MGB, n=48,069), and All of Us (AoU, n=208,263) (Table S2, Figure S3). We obtained ICD-10 codes from hospitalization diagnoses in each biobank (4, 15) and transformed these to pheCodes (16), following established approaches for EHR phenotyping (17) (see Methods). A set of 348 pheCodes were selected representing diseases with at least 1000 unique occurrences in UK Biobank hospitalization episode statistics (18) as in (7). Despite differences in population characteristics, healthcare systems, and data collection methodologies across these cohorts, our model identified remarkably consistent signature patterns (Table S3; Figure S4).

We set K=20 for our model, which converged well (S5) across all three biobanks and successfully identified 20 distinct disease signatures from the data. These model-derived signatures corresponded to recognized disease processes and captured diverse disease domains including cardiovascular, metabolic, pulmonary, psychiatric, musculoskeletal, and oncologic conditions (S3). Each signature demonstrates characteristic temporal patterns, with disease probabilities evolving dynamically with age across biobanks (Figure 2A; Figures S6, S7, S8). For example, the cardiovascular signature shows steadily increasing probabilities for conditions like atrial fibrillation and heart failure after age 55 years, while the malignancy signature displays a sharp rise in metastatic disease probabilities between ages 60–75 years. The specificity of each signature for a given disease, as modeled by ψkd, is preserved but heterogeneous, reflecting the model’s ability to disentangle signature-disease specificity (Figure 2B).

Figure 2: Population-level disease signatures inferred by ALADYNOULLI.

Figure 2:

(A) Age-dependent log hazard ratios for four representative disease signatures (cardiovascular, cancer, pulmonary, and cerebrovascular), as estimated by the model. Each line represents the predicted risk trajectory for a specific disease within the signature, illustrating distinct temporal patterns of disease onset. (B) Heatmap of signature-disease specificity parameters ψ^kd) learned by the model, with red indicating strong positive association and blue indicating negative association between diseases and signatures. (C) Cluster correspondence matrices comparing model-inferred disease groupings across biobanks (UK Biobank, MGB, and All of Us), demonstrating the consistency of disease clusters for common diseases. (D) Model-predicted age-specific probabilities of disease onset for a range of conditions, showing the temporal emergence of diseases across the lifespan. (E) Comparison of signature trajectories for cardiovascular and malignancy signatures across three independent biobanks (MGB, AoU, UKB), demonstrating the robustness and replicability of the model’s temporal patterns across cohorts.

Furthermore, the model’s tensor structure (Figure S9), enables rapid disease hazard calculation using the average loadings (Equation 3) and population-level ϕkd. The average age-specific hazard probabilities for a wide range of diseases are visualized in Figure 2C, highlighting temporal risk patterns.

As stated, these signature patterns show strong consistency across the three independent cohorts (Figure 2D, Figures S6, S7, S8). When comparing the membership of diseases within signatures between any two cohorts, we observed high concordance (median modified Jaccard index = 0.792, IQR = 0.65–0.89 across all pairwise comparisons between biobanks for similarity-matched signatures when computing intersection among signatures normalized to total number of diseases within the matched UKB signature, Figure S4). Figure 2E illustrates this consistency for two key signatures: cardiovascular disease and malignancy. Despite differences in healthcare systems and coding practices, the temporal patterns of key diseases within these signatures remain remarkably consistent, supporting the biological validity of the discovered patterns.

The model also captures disease-specific temporal dynamics that match clinical expectations. For instance, Type 1 diabetes peaks earlier in life compared to Type 2 diabetes within the metabolic signature (Fig 2E), while primary malignancies precede metastatic disease within the cancer signature. These nuanced temporal relationships emerge directly from the model without explicit encoding, demonstrating ALADYNOULLI’s ability to learn clinically meaningful disease trajectories.

Personalized trajectories reveal heterogeneity within disease categories

Beyond population-level signatures, ALADYNOULLI provides individual-specific trajectory information through the time-varying λikt parameters that reveal distinct disease progression patterns.

While each patient in Figures 3A-C demonstrate similar average signature loadings in aggregate (horizonal ‘static model summary’ profile), their disease journeys reveal biological differences among patients sharing this diagnosis, reflecting true heterogeneity—i.e., the presence of distinct subgroups with different underlying disease signature distributions—within the diagnostic category. Patient C (Panel C) experiences MI at age 54 following a complex trajectory of gastrointestinal and musculoskeletal conditions, with cardiovascular signature activation beginning subtly around age 50 and accelerating dramatically in the years preceding the event. In contrast, Patient B (Panel B) develops MI at age 72 after a markedly different prodrome dominated by respiratory and dermatologic conditions, showing a more gradual cardiovascular signature evolution. Post-MI trajectories in these two patients also diverge substantially: Patient C subsequently develops multiple cardiovascular complications and metabolic disorders, while Patient B’s post-MI course is characterized by different comorbidity patterns including genitourinary and infectious disease manifestations. These distinct temporal signatures preceding and following identical clinical endpoints illustrate how ALADYNOULLI captures the biological heterogeneity masked by traditional diagnostic categories—revealing that ”myocardial infarction” encompasses diverse pathophysiological pathways that may require different prevention and treatment strategies. Multiple additional examples (Figure S10) demonstrate the diversity in temporal loadings that would be missed by a summative approach considering only average loading.

Figure 3: Individual-level trajectories and dynamic risk profiles.

Figure 3:

(A–C) Patient-specific normalized signature loadings θ over time for three representative individuals. The lower panels show the disease timeline and key diagnoses for each patient. (D) Comparison of early-onset (<55 years) and late-onset (>70 years) MI: Average signature loadings and their temporal velocities reveal distinct dynamic patterns and rates of change associated with age of onset. (E) Decomposition of myocardial infarction (MI) risk for a representative patient: Top, time-varying signature loadings; middle, heatmap of log disease probabilities by signature and age; bottom, stacked area plot showing the aggregate risk over time. (F) Signature heterogeneity within disease subtypes: Stacked area plots show deviations in signature proportions from the population average for selected diseases (malignant neoplasm of female breast, major depressive disorder, and myocardial infarction), highlighting the diversity of underlying biological processes among patients with the same clinical diagnosis.

Our model also illustrates how individual-level trajectories and population phenomena combine to elicit time-varying personalized disease probabilities. Figure 3E is a heatmap of log disease probabilities by signature and age for MI, showing how overall MI risk is decomposed into the contributions of various time-varying signature loadings. This visualization reveals the complex interplay between multiple signatures in determining disease risk. While the cardiovascular signature contributes most significantly to MI risk, other signatures—particularly those related to metabolic conditions and inflammation—also play important roles. The stacked area plot below demonstrates how these contributions integrate to form the aggregate risk profile, revealing periods of accelerated risk accumulation that may represent critical windows for preventive intervention.

Aggregating these individual patterns reveals distinct group-level differences. In a retrospective analysis, the comparison of early-onset (≤ 55 years, mean age of event 49.7 years) and late-onset (≥ 70 years, mean age of onset 74.9 years) MI in Figure 3D shows that early-onset patients exhibit a higher and earlier peak in cardiovascular signature contribution, as well as a more rapid increase in signature loading prior to the event, compared to late-onset cases. These quantitative differences in trajectory characteristics suggest that early- and late-onset MI, while sharing the same clinical diagnosis, may represent distinct disease entities requiring different preventive strategies.

This pattern of heterogeneity within diagnostic categories extends broadly across diseases. Figure 3F captures signature heterogeneity within disease subtypes through stacked area plots showing deviations from the population average, highlighting the diversity of underlying biological processes among patients sharing the same clinical diagnosis.

To systematically quantify differences in signature composition among patients with the same clinical diagnosis for three representative diseases (myocardial infarction, breast cancer, and major depressive disorder), we applied k-means clustering to patients’ time-averaged signature loadings for each disease (Figure 3F). We then calculated cluster-specific Cohen’s effect sizes (19) CckSIG as follows (Figure S11; Extended data Data S1-S3). For cluster c and signature k, CckSIG is the standardized difference in mean time-averaged signature loadings between individuals in cluster c and those in all other clusters (see Figure S11). This measures of how distinct each cluster is with regard to each disease signature.

This analysis revealed that the vast majority of signature differences between clusters were not only large in magnitude (with many CckSIG values exceeding 0.8, and some as high as 2.5–3.9), but also highly statistically significant (p ≤ 1×10−8 for nearly all clusters). The largest effect size occurs in major depressive disorder, for the acute illness signature 16 (septicemia, acute renal failure, and critical care conditions) in cluster 2 showed C2,16SIG=3.87 (p ≈ 0), revealing a medically complex depression subgroup with severe acute comorbidities. In myocardial infarction, the cardiovascular signature 5 (encompassing coronary atherosclerosis, ischemic heart disease, and hypercholesterolemia) shows C3,5SIG=2.86 (p ≈ 0) in only one cluster, indicating that even within cardiovascular diseases, the cardiovascular signature itself reveals substantial heterogeneity between patient subgroups. In breast cancer, the cardiovascular signature also showed strong differentiation (C3,5SIG=2.46, p ≈ 0). Similarly, the pain/inflammatory/metabolic signature (Signature 7, characterized by asthma, migraine, osteoporosis, depression, and obesity) achieved near-complete patient separation in all conditions examined, with CSIG values ranging from 1.84 to 2.51. These effect sizes indicate near-complete separation between patient subgroups, suggesting distinct underlying disease processes within the same diagnostic category, underscoring the presence of distinct biological subgroups within each diagnostic category. These results demonstrate that the observed heterogeneity is both quantitatively substantial and statistically robust, supporting the biological relevance of the patient subgroups we identified (see Extended Data S1-S3).

The model’s ability to identify such distinct temporal trajectories and biological subtypes even among patients with similar diagnoses illustrates ALADYNOULLI’s potential for personalized risk assessment and intervention timing (Figure S10). By capturing how an individual’s signature associations evolve with each new diagnosis, ALADYNOULLI provides a dynamic framework for monitoring disease progression, predicting future complications, and identifying optimal windows for preventive measures. Unlike traditional risk scores that provide a single probability estimate, ALADYNOULLI offers a comprehensive view of an individual’s evolving disease landscape—revealing not just what conditions might develop but when and in what sequence—critical information for precision medicine.

Genetic factors influence signature trajectories

A key innovation of ALADYNOULLI is its integration of genetic information directly into the model, allowing us to quantify how genetic factors influence disease signature associations. We examined both the direct genetic effects on signature loadings through the Γk parameters and the association between polygenic risk scores (PRS) and signature trajectories. Importantly, to avoid double-dipping, we used external PRS that were developed independently of our signature analysis, ensuring that genetic information was not used both in training the model and in evaluating PRS-signature associations.

Genetic analysis revealed substantial genetic influence on signature associations through the Γk parameters. Using batch-aggregated effect estimates across model replicates and a Bonferroni correction for 36 PRS per signature (p ≤ 6.6 × 10−5), we identified 75 significant PRS-signature associations out of 756 tests (9.9%) (Fig 4A; see Extended Data S0). The strongest genetic effects were observed for signatures with known heritable components: coronary artery disease PRS on the cardiovascular signature (Signature 5, γ=0.24), LDL cholesterol PRS on Signature 5 (γ=0.11), and type 2 diabetes PRS on the metabolic signature (Signature 15, γ=0.22). Coronary, metabolic, and psychiatric signatures showed the strongest overall genetic influences (Figure 4A), consistent with the high heritability of these disease categories. Several PRS, including BMI, T2D, and HT, showed pleiotropic effects across multiple signatures (20), highlighting shared genetic architecture across disease processes. Importantly, the heterogeneous patient groups identified in our trajectory analysis (Figures 3D and 3F) show corresponding heterogeneity in underlying polygenic risk scores (Figure 4B), demonstrating that genetic variation contributes to the diverse disease progression patterns we observe.

Figure 4: Genetic architecture and polygenic risk stratification of ALADYNOULLI disease signatures.

Figure 4:

(A) Top polygenic risk score (PRS) associations for each disease signature, showing effect sizes for the most significant PRS-signature pairs across disease categories. (B) Heatmaps of mean PRS values by cluster for three representative diseases: major depressive disorder, breast cancer, and myocardial infarction, demonstrating the stratification of polygenic risk across model-inferred patient clusters. (C) UpSet plot showing the overlap of genome-wide significant loci between disease signatures and individual traits, with analyses performed without PRS prior, highlighting shared genetic mechanisms across diseases. We consider SNPs as shared if they are within 1 MB of a lead loci in each componenet trait. (D) Heatmap of positive genetic correlations rg between disease signatures and complex traits, computed using LD score regression without PRS prior, revealing shared genetic architecture and pleiotropy.

We quantified the variability of PRS scores across patient clusters by computing Cohen’s effect sizes CcpPRS for cluster c and PRS p, analogously to what done earlier for signatures (see also Methods). This analysis revealed substantial differences in polygenic risk scores between patient subgroups that parallels the biological variation observed in signature loadings (Figure 4B; Extended Data S4-S6). For major depressive disorder, signature loadings showed dramatic cluster-specific effects, with Signature 16 (likely psychiatric) showing extreme enrichment in Cluster 2 C2,16SIG=3.87 and Signature 7 (likely inflammatory) showing strong depletion in Cluster 1 C1,7SIG=1.37 but enrichment in Cluster 3 C1,7SIG=2.47. This signature variability was mirrored by corresponding PRS patterns: Cluster 3 showed strong enrichment for cardiovascular risk factors C3,BMIPRS=0.40,C3,CVDPRS=0.43,C3,CADPRS=0.38,C3,HTPRS=0.58, while Cluster 2 showed depletion in these same traits (see Extended Data S4-S6).

To systematically identify genetic loci associated with signature trajectories, we performed genome-wide association studies (GWAS) on lifetime signature exposure for each signature, computed as the area under each individual’s signature loading curve over their entire follow-up period (S12; Methods), after refitting our model excluding the genetic mean (giΓk in Equation 3) from the prior on λ. This approach investigates whether signature trajectories themselves have heritable components beyond the genetic effects we explicitly modeled. LD score regression analysis revealed significant SNP-based heritability for multiple signatures (Table S9), with the strongest signal observed for the cardiovascular signature (h2=0.041, SE = 0.003), followed by musculoskeletal h2=0.035, SE = 0.002) and pain/inflammation signatures (h2=0.027, SE = 0.002). All analyses showed appropriate genomic control λgc=1.021.22 and negligible population stratification (intercept ≈ 1.0), confirming that our signatures represent biologically meaningful patterns with distinct genetic architectures.

This genetic validation analysis identified 150 genome-wide significant loci across 15 of 21 signatures, with the cardiovascular signature alone accounting for 56 loci (37% of all discoveries) (Extended data S7-S27 for lead variants across signatures). This signature-based approach substantially outperformed traditional single-disease GWAS in detecting disease-associated variants: our cardiovascular signature analysis identified 23 unique loci compared to external GWAS assessing associations with myocardial infarction (29 loci), hypercholesterolemia (42 loci), and angina (26 loci) (Figure 4C). This enhanced discovery stems from three key factors: aggregation of signals across related conditions increases effective sample size; the continuous nature of signature loadings provides greater statistical power than binary disease endpoints; and signatures capture shared biological processes that may have stronger genetic determinants than individual disease manifestations. When associating significant loci in each signature with component trait genotype dosage, we found similar improvements across signatures (Figure S13)). This substantial genetic signal independent of our explicitly modeled genetic effects provides strong evidence that our disease signatures capture genuine biological processes with distinct genetic architectures rather than statistical artifacts (Extended Data Files 726).

For regional overlap analysis visualized in UpSet plots (Fig 4C), we defined variants as overlapping if they were located within 1MB windows of each other, reflecting the potential for different lead variants to tag the same causal locus through linkage disequilibrium. This approach substantially increased the overlap between our cardiovascular signature and individual disease GWAS compared to exact SNP matching, revealing shared genetic architecture that might be missed by traditional single-disease analyses. In contrast, for direct genotype-phenotype association testing, we used the exact signature lead SNPs to test for association with component trait phenotypes, providing a threshold-independent assessment of biological effects (S13).

Linkage disequilibrium score regression (21) analysis across a broad set of representative traits confirmed expected trait enrichment and depletion in non-signature associated traits (Figure 4D; Table S9). These findings demonstrate that ALADYNOULLI’s unified modeling approach not only improves disease prediction but also enhances genetic discovery by leveraging shared biological pathways across related conditions, potentially informing more targeted prevention strategies based on an individual’s genetic risk profile and signature associations.

Dynamic risk assessment improves disease prediction

A primary motivation for modeling longitudinal disease patterns is to improve prediction of future disease events. To rigorously evaluate ALADYNOULLI’s predictive performance, we implemented comprehensive, leakage-free validation strategies that mimic real-world clinical follow-up (Table S4). Our primary approach uses landmarking methodology (22), where we evaluate prediction performance at 30 distinct time points (landmarks) during follow-up, spanning ages 40 to 70 years. At each landmark, we use a model trained specifically for that time point, ensuring predictions are based only on information available up to that time. This approach reflects how the model would be used in clinical practice and provides a systematic temporal evaluation of model performance, capturing how predictive accuracy evolves as patients accumulate new diagnoses over time. The dynamic nature of this evaluation reflects the real-world scenario where clinicians must make predictions at various points in a patient’s journey, and the ability of ALADYNOULLI to update with new information. We also evaluate the prediction at recruitment against 1-year and 10-year outcomes (ALADYNOULLI recruitment 1 year, 10 year) for comparison with traditional clinical risk scores, where predictions are made a single time for each patient at the time of recruitment (i.e., 2006–2010 in UKB, Fig S2) and compared against 1-year or 10-year outcomes. All analyses were performed strictly prospectively, ensuring that only data available up to prediction time was used for each individual predicted. Individuals with prevalent disease at prediction time were excluded (see Methods). Finally, we compared with traditional cox modeling (23) using age as a time scale, also on ten year outcomes, with or without ALADYNOULLI as a predictor.

As shown in Figure 5A and Table S5, ALADYNOULLI demonstrates three key advantages over traditional approaches. First, it achieves substantial improvements in predictive accuracy across a broad range of diseases (AUC increase up to 0.20) and prediction periods. The dynamic risk predictions, which update in this analysis at 30 distinct time points during follow-up using only information available up to that time point, yield substantially higher AUCs than standard Cox models without ALADYNOULLI (e.g., ASCVD: 0.901 vs 0.634; Heart Failure: 0.838 vs 0.592; Diabetes: 0.814 vs 0.600; Table S6, S16, S18). Second, this systematic evaluation across multiple prediction timepoints demonstrates the model’s robust performance in real-world scenarios where predictions must be made at various points in a patient’s clinical journey. Finally, the prediction of the featured diseases (and beyond) comes simultaneously from the ALADYNOULLI model, a key strength of which is its ability to provide robust, simultaneous predictions across multiple disease categories without disease-specific optimization. Unlike traditional approaches that require separate models for each condition, our unified framework leverages shared information across related diseases, which is especially valuable for conditions where limited training data can be supplemented by biological connections to more common diseases. For example, secondary cancers (annual incidence ≈ 0.03%) showed substantial improvement in prediction accuracy (AUC: 0.712 vs 0.508 for traditional models), likely due to shared biological pathways with more common primary malignancies captured by our unified signature approach.

Figure 5: Multi-Disease Risk Prediction Performance and Model Interpretation.

Figure 5:

(A) Discrimination performance across the top 16 diseases, measured by the area under the ROC curve (AUC) in a prospective, leakage-free framework. Each dot represents a different modeling approach. The primary approach, Median Aladynoulli 1-year (highlighted), reflects clinical practice: 1-year AUCs are computed for each year of follow-up using only data available up to that year, and the median AUC across years is reported. This represents how the model would be used in real-world clinical settings, making 1-year predictions at each patient visit. Aladynoulli Recruitment (1-year) uses predictions made at recruitment to evaluate 1-year outcomes, while Aladynoulli Recruitment (10-year) uses predictions made at recruitment to evaluate 10-year outcomes for comparison with clinical risk scores. PREVENT and PCE models are evaluated for their ability to predict 10-year outcomes using only recruitment data available at the time of study center visit. Cox models are fit using age as the time scale and include either Aladynoulli predictions, family history, and sex, or only family history and sex as covariates. All analyses exclude individuals with prevalent disease at time of prediction and use only information available up to the time of prediction, ensuring a fully prospective evaluation. (B) Calibration plot across all follow-up periods for all at-risk individuals, showing observed versus predicted event rates on a log-log scale. Each point represents a bin of predicted risk, annotated with sample size; summary statistics (MSE, mean predicted, mean observed, total N) are provided. (C) Model 10-year risk predictions versus incidence-based risk for ASCVD, stratified by age and percentiles. Solid lines show model-predicted mean and percentiles; the dashed line shows prevalence-based risk. R2 indicates the correlation between predicted and observed risk. (D) ROC curves for each year of the 10-year ASCVD prediction horizon, comparing the Aladynoulli model (AUC = 0.90), the PREVENT model (AUC = 0.649), and the Pooled Cohort Equations (PCE, AUC = 0.664). (E) Softmax trajectory patterns for the latent patient loadings (λ): the upper panel shows individual patient trajectories for myocardial infarction (MI), censored prior to event; the lower panel shows mean trajectories for MI cases and controls, illustrating dynamic risk evolution over age.

We further evaluated ALADYNOULLI’s performance across age-specific prediction time-points spanning ages 40 to 70 years, providing a comprehensive assessment of how predictive accuracy evolves across the adult lifespan. This analysis, which evaluated 30 distinct prediction timepoints using cumulative data inclusion, revealed substantial discrimination in model performance. Key diseases showed remarkable age-specific discrimination: ASCVD achieved a median AUC of 0.985 (0.969,0.99) across 28 years of evaluation, while Breast Cancer demonstrated a median AUC of 0.981 (0.961,0.991) across 23 years, and Diabetes reached a median AUC of 0.948 across 25 years (Extended Figure S14). The systematic evaluation across multiple age-specific cohorts demonstrates the model’s robust performance in real-world scenarios where predictions must be made at various points in a patient’s clinical journey, with performance generally improving as more cumulative data becomes available. This approach also revealed that the previous 10-year rolling window methodology significantly underestimated the model’s clinical journey, with performance generally improving as more cumulative data becomes available.

The model also demonstrates excellent calibration (Figure 5B), with predicted probabilities closely matching observed event rates across the risk spectrum. This is crucial for clinical decision making, which requires reliable and actionable risk estimates. To illustrate ALADYNOULLI’s ability to capture evolving risk, we examined signature activation trajectories preceding disease onset by censoring individual data 5 years prior to events (Figure 5E; Methods). Examples of patients diagnosed with myocardial infarction reveal increases in cardiovascular signature activation 2–3 years before clinical events. Notably, these patterns emerge even when the target disease is censored from input data, indicating that the model captures informative signals from related comorbidities.

We further evaluated ALADYNOULLI’s performance for ASCVD (atherosclerotic cardiovascular disease) risk prediction, first in the general population and then in specific high-risk subgroups. In the overall cohort, ALADYNOULLI outperformed both PREVENT (AUC: 0.649) and PCE (AUC: 0.664) (Figure 5D), with particularly strong performance in sex-based analyses (males: ALADYNOULLI 0.701 vs PREVENT 0.597; females: ALADYNOULLI 0.667 vs PREVENT 0.657, Figure S15). We also evaluated the GAIL model (24) for breast cancer because of the availability of family history data for comparison in the UKB. Of note, many disease-specific clinical scores require information not available on biobank level interviews, though the detailed nature of the UK Biobank did provide these variables. ALADYNOULLI exceeded the ten-year AUC when compared to the GAIL model (0.649 to 0.543) (Figure S17).

We then specifically evaluated ALADYNOULLI’s performance in patients with pre-existing rheumatoid arthritis (RA) and breast cancer (BC), comparing against both the Pooled Cohort Equations (PCE) and the PREVENT (25) model (Figures S15, S16, S18). This analysis investigates whether ALADYNOULLI maintains predictive accuracy in the presence of confounding comorbidities that can mask cardiovascular risk signals, a common challenge in clinical practice. For 10-year ASCVD outcomes, we used the static version of our leakage-free prediction approach to compute baseline risk for each individual, as the number of ASCVD events per year in these high-risk subgroups was too small to allow stable estimation of dynamic 1-year AUCs (see Methods). Under this strict evaluation, ALADYNOULLI outperformed existing models, achieving AUCs of 0.681 (RA) and 0.630 (BC) compared to 10-year risk for PREVENT (RA: 0.659, BC: 0.54).

Discussion

We presented ALADYNOULLI, a novel Bayesian framework for modeling dynamic disease signatures and individual health trajectories from longitudinal health records and germline genetic data. By integrating these two data modalities, ALADYNOULLI provides a unified framework for understanding disease comorbidities, predicting future disease events, and discovering genetic architecture underlying complex phenotypes. This work addresses a critical gap in precision medicine (26, 27), where the integration of diverse data sources remains challenging despite the promise of personalized approaches to disease management (28). Unlike traditional disease-specific predictive models that require separate development for each condition, ALADYNOULLI’s unified framework simultaneously captures risk for multiple diseases, enabling information-sharing across related conditions, improved prediction for diseases with sparse data, and comprehensive decision support across clinical disciplines.

Our model’s identification of consistent disease signatures across three independent cohorts supports their biological validity and clinical relevance. These signatures capture meaningful disease relationships that align with known pathophysiological processes while revealing novel connections between conditions that may share underlying mechanisms. The temporal dynamics of these signatures further enhance our understanding of how disease risk evolves throughout the life course, addressing the need for more sophisticated approaches to understanding disease progression beyond static risk assessment (29).

The integration of genetic information also represents a significant advance over existing approaches. By directly modeling genetic influences on signature associations, ALADYNOULLI provides biological interpretability while improving predictive performance. The identification of genetic variants that associate more strongly with signature loadings than with individual diseases suggests our approach may uncover novel mechanisms with pleiotropic effects across many established diseases but weaker effects on individual diagnoses—these may represent more biologically critical pathways and better targets for therapeutic interventions than traditional single-disease GWAS approaches.

Beyond risk prediction, ALADYNOULLI’s identification of disease signatures and individual trajectories has important implications for precision medicine and therapeutic development (26,27). By revealing distinct patient subgroups with shared biological mechanisms, the model can inform more targeted therapeutic strategies that align with the vision of personalized medicine (28). This approach addresses the critical need for better patient stratification in clinical practice, where traditional diagnostic categories often mask underlying biological heterogeneity (30).

First, signature profiles can help identify patients likely to respond to specific interventions. For example, individuals with strong metabolic signature contributions to their coronary disease may benefit more from intensive glucose management, while those with inflammatory signature patterns might respond better to anti-inflammatory approaches. This targeted approach represents a key advancement toward the promise of precision medicine (26), where treatments are tailored to individual biological profiles rather than applied uniformly across broad diagnostic categories.

Second, the model can detect changing risk profiles in real-time as patients accumulate new diagnoses, allowing for dynamic adjustment of preventive strategies. Figure 3AC demonstrates this capability, showing how patients’ risk trajectories are updated following new clinical information, potentially triggering changes in monitoring or intervention intensity. This dynamic approach aligns with the emerging paradigm of digital medicine (31), where continuous monitoring and real-time risk assessment enable more responsive and personalized care.

Finally, signature-based patient stratification has a strong potential to enhance clinical trial efficiency by identifying more homogeneous patient populations and more appropriate controls. By enrolling patients with similar signature profiles, trials might achieve greater treatment effects and identify responder subgroups more effectively. This approach could mitagate the high failure rates in clinical trials by ensuring more biologically appropriate study populations (32), while also advancing our understanding of treatment response heterogeneity.

Several limitations should be acknowledged. First, our model relies on EHR data, which may contain biases related to healthcare access, diagnostic coding practices, and incomplete capture of disease history. These limitations are common to all EHR-based studies and highlight the importance of validating findings across multiple healthcare systems, as we have done here. Second, while we incorporate genetic factors, we do not explicitly model environmental exposures or lifestyle factors that significantly influence disease risk (29). Third, our use of established PRS may miss genetic effects that act directly on signatures but weakly on individual diagnoses, as our signature-based GWAS identified loci not captured by traditional single-disease approaches. Fourth, our model makes several important assumptions including linearity in genetic effects and additivity in signature contributions, which may not capture all complex interactions. Future work could integrate these additional data sources and relax these assumptions to further enhance predictive performance and biological insight, addressing the complex interplay between genetic and environmental factors that shape the risk of disease (33).

Despite these limitations, ALADYNOULLI represents a significant advance in longitudinal health modeling with important implications for precision medicine (26, 27). By capturing the complex interplay between genetic predisposition and time-varying disease patterns, our approach provides a framework for more personalized risk assessment and potential therapeutic targeting. Our model’s ability to identify meaningful patient subgroups within traditional disease categories, coupled with enhanced genetic discovery power, moves beyond simple risk prediction to provide deeper insights into disease biology and patient heterogeneity. These capabilities could inform more targeted clinical trials and intervention strategies, ultimately leading to more effective personalized prevention and treatment approaches.

As healthcare increasingly moves toward data-driven precision approaches (28, 31), a method like ALADYNOULLI that can integrate diverse data sources and model complex temporal relationships can become increasingly valuable for improving patient outcomes. The integration of longitudinal EHR data with genetic information represents a powerful approach to understanding disease biology and improving clinical decision-making, addressing key challenges in modern medicine including the need for more accurate risk prediction, better patient stratification, and enhanced therapeutic targeting (30). This work contributes to the broader vision of precision medicine where individual biological profiles guide clinical decision-making, moving beyond the limitations of traditional diagnostic categories toward more nuanced and personalized approaches to disease management.

Materials and Methods

Cohorts

Data are drawn from three distinct biobanks: Massachusetts General Brigham (MGB), UK Biobank (UKB), and All of Us (AoU). Each cohort is described in Table S2 and below S3).

Massachusetts General Brigham Biobank (MGBB)

MGBB is an integrated research initiative based in Boston, Massachusetts (15). It collects biological samples and health data from consenting individuals at Massachusetts General Hospital, Brigham and Women’s Hospital, and local healthcare sites within the MGB network. Since July 1, 2010, the MGBB has enrolled more than 140,000 participants and extracted DNA from approximately 90,000 participants’ samples, and 53,306 participants were genotyped by Illumina Global Screening Array (Illumina, CA). All participants provided their informed written / electronic consent. EHR data are available on all participants from approximately 1990 (see S7). We used a subset of 48,069 for whom EHR and genetic data were available.

UK Biobank (UKB)

The UKB is a large-scale, population-based cohort that recruited over 500,000 participants aged 40–69 years between 2006 and 2010 from across the United Kingdom (4, 34). The cohort includes extensive phenotypic data, biological samples, and longitudinal follow-up of health outcomes. Genotyping was performed using the UK BiLEVE array or the UKB Axiom array, with subsequent imputation to the Haplotype Reference Consortium (HRC) and UK10K reference panels. Participants were genotyped to investigate genetic contributions to various health and disease traits, with particular attention to the relationship between genetic variants and cardiometabolic diseases. Electronic health records are available on all participants from approximately 1980, and some as early as 1980 (35) and thus allow access to clinical diagnostic data prior to the recruitment date). We used the subset of 427,239 for whom genomic and EHR data were available. Polygenic risk scores (PRS) were obtained from an external set of controls (36).

All of US (AOU)

The AOU research program (37) is a large-scale cohort study designed to increase the representation of historically understudied populations in biomedical research. Since 2018, AOU enrolled adults (age18) at more than 730 US sites. Of the 800,000+ consented participants, more than 560,000 have completed core enrollment requirements, including health questionnaires and biospecimen collection. Data from these participants are continuously linked to electronic health records (EHR), which capture ICD-9 / ICD-10, SNOMED, and CPT codes. Genetic data includes array-based genotyping from 315,000 participants and whole genome sequencing (WGS) from 245,394 participants who were then available to contribute polygenic risk scores for downstream analyzes.

Preprocessing and Disease Encoding

Following the approach of Jiang et al. (7), we initially analyzed 348 PheCode diseases from UK Biobank that were selected based on prevalence thresholds (]geq1,000 occurrences) to ensure sufficient statistical power for comorbidity analysis. Disease records were mapped from ICD-10/ICD-10CM codes to PheCodes using a standardized three-step procedure. To validate our findings across independent populations, we then applied the same disease selection strategy to All of Us (AOU) and Mass General Brigham (MGB) cohorts using their respective ICD coding systems. In AOU, we extracted ICD-9 and ICD-10 codes directly from the OMOP Common Data Model condition occurrence tables (37), successfully reproducing all 348 diseases from the UK Biobank selection. In MGB, we similarly used ICD-9 and ICD-10 codes, reproducing 346 of the 348 diseases for validation analyses. This multi-cohort approach enabled us to assess the generalizability of disease signatures across different healthcare systems and populations while maintaining consistency in the underlying disease definitions used for ALADYNOULLI model development and validation. We observed a 79.2% correspondence between matched signatures (2).

Model

We recapitulate the model’s formulation and elaborate on important modeling choices and implementation details. The results in this paper describe our application to the UKB dataset; however we also applied this to the MGBB and AOU datasets to establish consistency as in (2).

Mathematical Formulation

The ALADYNOULLI model represents the probability of disease occurrence for patient i, disease d, at time t as:

πi,d,t=κk=1Kθi,k,tsigmoidϕk,d,t,

where κ is a global calibration parameter, θi,k,t represents patient i’s time-varying association with signature k, and ϕk,d,t captures the relationship between signature k and disease d over time.

The patient-signature associations are parameterized as a softmax function of latent variables λi,k,t as:

θi,k,t=expλi,k,tk=1Kexpλi,k,t

These patient-specific latent variables in turn follow a Gaussian process prior:

λi,k~GPrk+Γkgi,Ωλ

where rk is a signature-specific baseline, Γk captures how genetic/demographic factors gi influence patient-signature associations, and Ωλ is a kernel function ensuring temporal smoothness. The covariate matrix G contains 36 polygenic risk scores plus sex (37 features total), providing genetic and demographic information for each individual.

The kernel function Ωλ is defined as:

Ωλt,t=αλ2exptt22lλ2.

In our implementation, the amplitude parameter αλ is set to 100 and the length-scale parameter lλ is set to T/4.

Similarly, the disease-signature associations follow a Gaussian process:

ϕk,d~GPμd+ψk,d,Ωϕ,

where μd is a disease-specific baseline derived from the logit of the population prevalence, ψk,d represents the overall strength of association between signature k and disease d, and Ωϕ is a kernel function defined as:

Ωϕt,t=αϕ2exptt22lϕ2,

where the amplitude is fixed to αϕ=100 and the length-scale is set to lϕ=T/3 in our implementation.

Dynamic Range of Predictions

While our mixture-of-probabilities formulation has key advantages, as described in the main text, this approach introduces technical challenges that require careful parameterization. A key challenge with the mixture of probabilities formulation is that it naturally leads to a reduction in the variation of the predicted probabilities across individuals. When multiple sigmoid-transformed values are averaged, the resulting mixture tends to concentrate around moderate values, reducing the dynamic range of predictions. To address this:

  1. Signature-Disease Specificity ψk,d: We introduce the time-independent parameters ψk,d above to allow each signature to have strong positive or negative associations with specific diseases. This increases the separation between signatures and ensures a realistic dynamic range of the disease probabilities within each signature.

  2. Global Calibration κ: The global calibration parameter κ is necessary to rescale the final probabilities to obtain realistic overall disease prevalences. In our implementation, κ is learned from the data but in principal could be fixed.

This balance between expressiveness (through ψ’s) and calibration (through κ) ensures that the model can capture both rare and common diseases accurately while maintaining interpretable signature contributions. The combination of softmax-transformed individual loadings (θ) and sigmoid-transformed disease probabilities (ϕ) with these additional parameters ensures proper probability scaling.

Censored Data

A critical aspect of ALADYNOULLI is its careful handling of censored observations, which is part of what allows it to function as a generative model of disease progression rather than a retrospective analysis tool. The loss function incorporates time-to-event information through disease-specific censoring times Ei,d. For each individual i and disease d, we observe the data only up to time Ei,d, which represents either:

  • The time of disease onset (event time), or

  • The time of last follow-up without disease (censoring time)

The likelihood is constructed to respect this censoring structure ( (23,38)). Consider first LNLL, the negative log of the likelihood, that is the negative log of the probability of observed disease histories conditional on the disease probabilities π’s:

LNLL=i=1Nd=1Dt=1Ei,d1log1πi,d,tPre-event survival+Yi,d,Ei,dlogπi,d,Ei,dEvent occurrence+1Yi,d,Ei,dlog1πi,d,Ei,dCensored observation

This expression has three key components:

  1. Pre-event survival t<Ei,d: For all time points before the event/censoring time, we know that the individual did not have the disease, contributing log1πi,d,t to the likelihood at each time.

  2. Event occurrence t=Ei,d,Yi,d,Ei,d=1: If the disease occurred at time Ei,d, we observe the event. The corresponding contribution is logπi,d,Ei,d.

  3. Censored observation t=Ei,d,Yi,d,Ei,d=0: If the individual was censored without disease at time Ei,d, we only know they were disease-free at that time, contributing log1πi,d,Ei,d.

Key distinction from retrospective models:

This censoring-aware likelihood ensures that the model learns to predict disease risk prospectively. Unlike retrospective clustering approaches that analyze complete disease histories, ALADYNOULLI models the probability of future disease onset given only the information available up to each time point. This makes it suitable for: (1) Real-time risk prediction in clinical settings, (2) Modeling chronic diseases that develop over extended periods, settings, (3) handling varying follow-up times across individuals settings, and (4) avoiding information leakage from future events.

Objective Function for Maximum a Posteriori (MAP) Computation

The computational derivation of the MAP estimate of the unknown parameters in the model proceeds by optimising the function Ltotal which includes the negative log-likelihood LNLL as well as terms arising from the gaussian process prior.

The logs of the Gaussian process terms, as previously specified are:

LGPλ=i=1Nk=1K12λi,kμi,kΩλ1λi,kμi,k
LGPϕ=k=1Kd=1D12ϕk,dμdψk,dΩϕ1ϕk,dμdψk,d

Combining these terms, the negative log posterior (the “loss” for short) is:

Ltotal=LNLL+LGPλ+LGPϕ

This formulation enables ALADYNOULLI to learn disease progression patterns that respect the temporal structure of the data and provide clinically actionable predictions. Specifically, the negative log-likelihood leads the model to accurately predict the timing of disease onset. To encourage smoothness and temporal coherence in the latent trajectories, we use GP priors on both the individual-specific latent variables (λ) and, if applicable, the disease-time effects (ϕ). The implied regularization terms penalize deviations from the GP prior mean and covariance structure. In models with cluster structure, additional penalties may be included to encourage biologically meaningful clustering of diseases or signatures.

Training, Validation and Testing Architecture

Our analytical approach consists of two distinct stages: retrospective analysis for model training and cross-cohort validation, followed by prospective analysis for prediction evaluation, summarized in Figure S1.

Retrospective Analysis (Figures 24):

We performed disease signature characterization across all three cohorts to demonstrate reproducibility. For computational efficiency and uncertainty quantification, we divided each cohort into non-overlapping subsets: UK Biobank into B=39 subsets of approximately 10,000 individuals each (reserving one subset of 10,000 individuals as a held-out test set), All of Us into 30 subsets, and Mass General Brigham into 4 subsets (reflecting the smaller dataset sizes). For each subset b within each cohort, we jointly estimated both the disease-signature associations ϕ^(b) and individual loadings λ^(b) using all available observed data up to age 80 or censoring time, whichever came first. This retrospective approach utilized the complete disease trajectory for each individual, enabling us to characterize the full spectrum of disease signatures and their temporal dynamics. Each subset uses the same pre-computed initial clusters and ψ parameters, ensuring consistent signature interpretations across all subsets within each cohort.

For UK Biobank, the final disease-signature parameters used for population-level analysis were computed as the average across the 39 training subsets (excluding the held-out test set):

ϕ=139b=139ϕ(b)

This subset-averaging approach ensures robust parameter estimates while maintaining computational tractability. The AOU and MGB cohorts serve as external validation datasets, demonstrating the reproducibility of disease signatures across different populations and healthcare systems, but no prediction tasks were performed on these external cohorts.

This subset-averaged ϕ was used to generate the disease signature visualizations in Figure 2, including the temporal patterns and cross-cohort consistency analyses. Individual trajectory analyses (Figure 3) utilized the subset-specific λ(b) estimates, and subsequent clustering of patients by their time-averaged signature loadings was performed within each disease category. The genetic analyses (Figure 4) were performed exclusively in UK Biobank, employing the area-under-the-curve of individual signature trajectories as quantitative phenotypes for genome-wide association studies.

Prospective Analysis (Figure 5):

For prediction evaluation, we implemented a strictly prospective framework using only UK Biobank data. We used the subset-averaged ϕ parameters from the 39 training subsets as fixed, population-level disease-signature associations, and then estimated individual loadings (λ^) on the held-out test set of 10,000 UK Biobank individuals, using only data available up to specific prediction time points.

Specifically, for each prediction time point, we censored individual disease histories at that time point and re-estimated only the individual loadings while keeping ϕ fixed. This approach simulates real-world clinical scenarios where population-level disease patterns are known from prior research, but individual risk trajectories must be estimated from available clinical history up to the prediction time. All prediction performance metrics reported in this paper are based exclusively on this UK Biobank held-out test set.

This two-stage design ensures that: (1) our characterization of disease signatures leverages the full richness of longitudinal data to identify biologically meaningful patterns across multiple populations, while (2) our prediction evaluation maintains strict temporal boundaries to prevent data leakage and provides realistic estimates of clinical predictive performance using a completely independent test set.

Model Initialization

To ensure computational feasibility and parameter stability across large datasets, we implement a two-stage initialization approach. In the first stage, we perform the spectral clustering described below and ψ initialization once on the entire data set to establish stable disease clusters and signature-disease associations. These initial clusters and ψ values are then saved and reused in all subsequent subset analyses for b=1,,40.

We initialize the model parameters using spectral clustering (SciKit) on disease co-occurrence patterns. We compute a disease co-occurrence matrix C where Cd,d represents the frequency with which diseases d and d co-occur in the same patient. We apply spectral clustering to this matrix to identify K disease clusters.

We initialize the ψk,d parameters based on cluster membership: for diseases in cluster k, we set ψk,d=2.0+ϵ where ϵ is a small random noise, and for diseases not in cluster k, we set ψk,d=2.0+ϵ. For the time-varying parameters λi,k,t and ϕk,d,t, we initialize by drawing a single sample from the corresponding Gaussian process prior with reduced variance to preserve the structured mean initialization. Specifically, we initialize λi,k,t via a random draw from the Gaussian process prior with mean rk+Γkgi and covariance kernel scaled by amplitude α=0.1, where Γk is initialized using regression on disease occurrences. We initialize ϕk,d,t via a random draw from the Gaussian process prior with mean μd+ψk,d and the same reduced amplitude, where μd is derived from the logit of disease prevalence. The reduced amplitude ensures that random deviations do not bias the parameters arbitrarily away from the informative mean structure while maintaining temporal smoothness. We generate the GP samples via Cholesky decomposition of the scaled kernel matrices.

This approach provides a structured and plausible initialization that reflects our prior smoothness assumptions, rather than a purely random or fixed initialization. For the cluster-specific parameters ψk,d, we use deterministic values based on cluster membership, with small random noise added for variability.

Hyperparameter Specification

Model selection and hyperparameter specification were performed as follows. The number of latent signatures K was chosen to provide a parsimonious balance between model complexity and interpretability, based on prior experience and exploratory analysis. The hyperparameter ψ was initialized at values of 1 and −2, which on the log scale correspond to a 20-fold difference in odds (i.e., 10−9 vs 10−11), thereby spanning a broad range of plausible values for disease risk. All other model parameters were estimated using only the training data, and the final performance of the model was prospectively evaluated on an independent test set. This approach ensures that all reported performance metrics reflect true out-of-sample predictive accuracy, without overfitting or data leakage from hyperparameter tuning.

Optimization Details

We trained the model using gradient descent on the loss Ltotal. The solution can be interpreted as approximating a Maximum a Posteriori (MAP) estimate of the parameters, assuming dispersed priors on λ’s, Γ’s, μ’s ψ’s and ϕ’s. We minimize Ltotal over a fixed maximum number of epochs (i.e. complete passes through the entire training dataset). At each epoch, we compute gradients via backpropagation and update parameters using the Adam optimizer in PyTorch, a deep learning framework that allows efficient computation of gradients through automatic differentiation. The model was trained using a learning rate of 0.001. Learning rates and regularization strengths are treated as hyperparameters. The model is optimized at a time scale of one year and thus trained to provide the most accurate 1-year risk predictions. Longer-term risk (e.g., 10-year) can also be derived by simple manipulations of the estimated π’s.

For computational efficiency, we used the Cholesky decomposition to compute the Gaussian process contributions and to sample from the Gaussian process prior during initialization. We also used a jitter term of 1−8 to ensure numerical stability when computing the inverse of the kernel matrices. We trained the model for up to 1000 epochs, with early stopping based on validation loss to prevent overfitting. In practice, for our subsets of 10000 individuals, the model converged after 200 epochs.

Computation of Probabilities of Future Events

To ensure that our model provides prospective predictions without data leakage, we implement a strict censoring strategy that distinguishes between cohort recruitment and prediction time. This approach allows us to simulate real-world clinical scenarios where predictions are made based only on information available at a specific point in time.

Cohort enrollment time refers to the time point when an individual entries ALADYNOULLI, which for our purposes is age 30. For example, in our UKB analysis, all individuals were followed in the EHR from 1980 forward (39, 40) and thus assigned an enrollment time in our study at young adulthood, age 30, or whichever comes later.

Cohort recruitment time refers to the time when individuals joined the biobank. The UKB recruited individuals aged between 40 and 69 years in the time frame from 2006 to 2010, ensuring a comprehensive cohort for analysis and research purposes.

Prediction time refers to the time when we imagine making a clinical prediction, with knowledge of the health history up until that time. In practice this coincides with recruitment time as above to compare with clinical risk scores.

For example, in the UKB, an individual is observed in the EHR from adulthood and contributed data to our analysis from age 30 until the end of follow-up. However, for prediction analyses, we imagine making predictions at different time points after the cohort recruitment time (see S2). This is also consistent with the time at which an individual presented to the UKB, AOU or MGB recruitment center and contributed the samples necessary to calculate the common clinical risk scores.

For each individual i and disease d, we encode the time-to-event data for each disease using the standard convention in survival analysis (41) defining the event / censoring time t˜i,d and the event indicator δ˜i,d at prediction time as follows:

t˜i,d=minti,d,tipred
δ˜i,d=δi,difti,dtipred0ifti,d>tipred

where ti,d is the observed event or censoring time for individual i and disease d (measured as years since age 30), and tipred is the prediction time for individual i, computed as the individual’s recruitment age plus the prediction offset, converted to years since age 30: tipred=max0,recruitment agei+offset30. This censoring procedure ensures that for each prediction scenario, we use only the disease history that would have been available at that specific time point, thereby preventing data leakage from future events.

This ensures that the model is trained and evaluated only on data that would have been available at the time of prediction, thereby preventing any potential data leakage from future events.

Simulation Study

To validate the ability of the ALADYNOULLI model to recover latent disease clusters and temporal dynamics, we conducted a simulation study using ALADYNOULLI itself as the generative model. This approach allows us to test whether the model can accurately recover known ground truth parameters from synthetic data that follows the exact same probabilistic structure as our proposed model (Figure S19).

We generated synthetic data with N=10,000 individuals, D=20 diseases, T=50 time points (ages 30–79), K=5 latent disease signatures, and P=5 genetic covariates. The data generation process follows ALADYNOULLI’s exact mathematical formulation. We first created K=5 distinct disease clusters with 4 diseases per cluster, assigning strong positive associations within clusters ψk,d=1.0 and strong negative associations outside clusters ψk,d=3.0. Disease baseline trajectories μd were generated on the logit scale with diverse prevalence patterns ranging from rare (logit prevalence ≈ −14) to common (logit prevalence ≈ −8), incorporating realistic age-dependent onset patterns with varying peak ages and slopes.

For individual trajectories, we generated genetic covariates GN×P and genetic effect matrices ΓkP×K, then sampled individual signature loadings λi,k,t from Gaussian processes with means giΓk and temporal covariance with length scale T/4=12.5 years. Disease-signature associations ϕk,d,t were sampled from Gaussian processes with means μd+ψk,d and temporal covariance with length scale T/316.7 years. Event probabilities were computed using the exact ALADYNOULLIformula: πi,d,t=k=1Ksoftmaxλi,k,tsigmoidϕk,d,t, and disease events were sampled from these time-varying probabilities.

When we applied ALADYNOULLI to these synthetic data, the model successfully recovered the correct number of clusters (5/5), achieved high accuracy in disease cluster assignments (median Jaccard similarity 0.795), and accurately reconstructed the temporal trajectories and genetic effects, demonstrating the model’s ability to identify meaningful biological patterns rather than fitting noise.

Analysis

Stability Across Subsets and Cohorts

We empirically verified that ϕ estimates were highly stable across subsets, with remarkably small standard errors (e.g. in UKBB mean SE = 0.0010, median SE = 0.0002, with 95% of SE values ≤ 0.004) demonstrating the robustness of our disease-signature associations (Figure S5). This high stability validates our robust subset-averaging approach and confirms that the identified disease signatures represent replicable biological patterns rather than subset-specific noise (Figure S5).

Furthermore, when ϕ parameters were independently re-estimated in the AOU and MGB cohorts, they demonstrated strong replicability with the UKB-derived estimates, with high correlation coefficients across disease signatures (median proportion shared between matched signatures 0.792). This cross-cohort replicability provides additional evidence that disease signatures reflect universal biological patterns rather than population-specific variation or healthcare system-specific artifacts.

To further assess the replicability of our disease signatures across different populations (shown in Figure 2C of the main paper), we performed cluster correspondence as follows (Fig 2D). We examined the correspondence between disease clusters identified in each biobank by creating normalized confusion matrices. For each pair of biobanks (UKB vs MGB and UKB vs AoU), we identified the set of diseases common to both biobanks, mapped each disease to its assigned cluster in each biobank, created a cross-tabulation matrix showing the proportion of diseases in each UKB cluster that were assigned to each MGB/AoU cluster, and normalized the counts by row to show the distribution of cluster assignments.

We computed a modified Jaccard similarity index to quantify cross-cohort correspondence. For each UKB cluster k, we identified its best-matching cluster in the comparison cohort (the cluster receiving the highest proportion of diseases from that UKB cluster). The modified Jaccard similarity for cluster k is defined as:

Jk=Dk,UKBDk*,otherDk,UKB

where Dk,UKB is the set of diseases in UKB cluster k, Dk*,other is the set of diseases in the best-matching cluster k* in the comparison cohort, and || denotes set cardinality. The overall cross-cohort similarity is the median of these cluster-specific similarities: J=medianJ1,J2,,JK across all UKB clusters. This metric ranges from 0 (no correspondence) to 1 (perfect correspondence), where higher values indicate stronger replicability of disease clustering patterns across populations.

This analysis revealed strong correspondence between clusters across biobanks (median modified Jaccard similarity = 0.792), particularly for cardiovascular and malignancy signatures, suggesting robust biological patterns that transcend population differences.

For temporal pattern analysis, we performed a detailed comparison of the temporal patterns (ϕ trajectories) for diseases shared across all three biobanks, focusing on two key signatures: the cardiovascular signature (MGB: Sig 5, AoU: Sig 16, UKB: Sig 5) and the malignancy signature (MGB: Sig 11, AoU: Sig 11, UKB: Sig 6). For each signature, we identified diseases assigned to that signature in all three biobanks, plotted the temporal patterns (ϕ values) for each shared disease, overlaid the average pattern across all three biobanks (gray dashed line, 2), and used consistent colors for each disease across biobanks to facilitate comparison. This analysis demonstrated remarkable consistency in the temporal patterns of disease risk across different populations, with shared diseases showing similar risk trajectories despite being modeled independently in each biobank.

Individual Patient Trajectory Visualization

To illustrate the complex interplay of disease signatures in individual patients (shown in Figure 2 A-C of the main paper), we analyzed detailed trajectories for patients with multiple conditions. We identified patients who had at least one target disease of interest, developed multiple conditions (minimum of 2), and had complete follow-up data.

For each selected patient, we created a three-panel visualization. The Signature Dynamics Panel (Top Left) shows the temporal evolution of normalized signature loadings (θ) over time, with each signature represented by a distinct colored line, vertical dotted lines marking the timing of each disease diagnosis, and colors consistent across panels matching the primary signature of each diagnosed condition. The Disease Timeline Panel (Bottom Left) displays a chronological sequence of diagnosed conditions, with each condition represented by a horizontal line in its primary signature’s color, diagnosis points marked with filled circles, providing a visualization of disease progression and timing. The Signature Summary Panel (Right) shows a stacked bar chart of time-averaged signature loadings, with each segment representing the average contribution of a signature over the patient’s follow-up, colors matching the signature colors in the other panels, providing a static summary of the patient’s overall signature profile.

This visualization approach allows us to track how signature loadings change before and after each diagnosis, identify which signatures are most active at different time points, understand the temporal relationship between different conditions, and compare the relative contributions of different signatures to the patient’s overall disease profile.

Disease-Specific Trajectory and Heterogeneity Analysis

To systematically quantify differences in signature composition among patients with the same clinical diagnosis and understand disease progression heterogeneity and associated genetic architectures (Figures 3F, 4B-D), we performed trajectory clustering analysis using the ALADYNOULLI model. For each disease of interest (e.g., breast cancer, major depressive disorder, myocardial infarction), we implemented the following analysis pipeline:

Patient Selection and Temporal Averaging.

For each disease, we identified all patients who developed the condition and computed their time-averaged normalized signature loadings:

θ¯i,k=1Tit=1Tiθi,k,t

where θi,k,t represents signature loadings for individual i, signature k, and time t.

Patient Clustering.

We applied k-means clustering (k=3, chosen to balance interpretability with cluster distinctiveness) to the time-averaged signature loading matrix θ¯i,k to identify distinct patient subgroups within each disease category. This approach identifies distinct subgroups of patients who share similar underlying disease signature profiles despite having the same clinical diagnosis.

Trajectory Visualization.

We computed cluster-specific mean trajectories across individuals within the cluster μc,k,t=1CciCcθi,k,t and visualized deviations from population reference as stacked area plots for each time point: Δc,k,t=μc,k,trefk,t, where refk,t represents the population-average signature loading.

Genetic Architecture Analysis.

For each cluster, we computed mean polygenic risk scores (PRS) across individuals in the cluster, and created heatmaps showing cluster-specific values of these scores. To quantify variability of PRS scores among individuals with the same disease, we calculated Cohen’s d effect sizes for each PRS comparing in-cluster versus out-of-cluster distributions:

d=X¯inX¯outspooled

where X¯in and X¯out are the mean PRS values for patients within and outside each cluster, respectively, and spooled is the pooled standard deviation. Cohen’s d values of 0.2, 0.5, and 0.8 correspond to small, medium, and large effect sizes, respectively, providing a standardized measure of genetic differentiation between patient subgroups.

We applied the same Cohen’s d formula to both the time-averaged signature loadings and the mean polygenic risk scores (PRS) to quantify the degree of separation between clusters. For the signature loadings, Cohen’s d measures the standardized difference in mean time-averaged signature values between individuals in a given cluster and those in all other clusters, providing a measure of biological heterogeneity within each disease category. For the PRS, Cohen’s d quantifies the genetic differentiation between clusters, comparing the mean PRS values for individuals within a cluster to those outside the cluster. In both cases, a larger absolute value of d indicates greater separation between clusters.

We then calculated cluster-specific Cohen’s effect sizes (19) CckSIG as follows. For cluster c and signature k, CckSIG is the standardized difference in mean time-averaged signature loadings between individuals in cluster c and those in all other clusters. This measures how distinct each cluster is with regard to each disease signature. Similarly, for cluster c and PRS p, CcpPRS is the standardized difference in mean PRS values between individuals in cluster c and those in all other clusters.

Confidence intervals and p-values for Cohen’s d were estimated, and significance was assessed to determine whether the observed cluster differences were likely to be due to chance. This analysis revealed substantial standardized differences in signature loadings between patient subgroups, reflecting biological processes not typically considered in diagnoses.

Genetic Analysis of Signature Trajectories

For each individual i, we compute the temporal signature loadings θi,k(t) for each signature k and timepoint t using the softmax transformation:

θi,k(t)=expλi,k(t)kexpλi,k(t)

where λi,k(t) is the latent score for individual i, signature k, and time t. The softmax is computed across the signature dimension for each individual and timepoint. To summarize each individual’s overall exposure to a given signature, we integrate the signature trajectory over time:

AEXi,k=θi,k(t)dtt=1T1θi,k(t)+θi,k(t+1)2

where T is the total number of timepoints. The resulting average signature exposure over time (AEX) for each signature is used as a quantitative phenotype for downstream genetic association analysis (Figure S12).

We perform GWAS using the AEX values as quantitative phenotypes. For each signature k, we test for association between the AEX phenotype and genome-wide SNP genotypes. Association testing is performed using the Regenie (42) software (described below), which implements a two-step ridge regression approach for computational efficiency and control of population structure. The following covariates are included in all association models: sex, age at recruitment, and the first 20 principal components (PCs) of genetic ancestry (4).

For each signature, we identify genome-wide significant SNPs (e.g., P < 5 × 10−8) and further analyze their relationships with individual disease phenotypes. The analysis proceeds as follows. First, we extract the lead SNPs from the GWAS summary statistics for each signature. Second, for each top SNP, we test its association with a panel of binary constituent disease phenotypes which comprise our signature inputs using logistic regression, controlling for sex and the first 20 PCs using logistic regression. Third, we visualize the matrix of SNP-phenotype Z-statistics using heatmaps, highlighting SNPs that are associated with the signature but not with any individual disease (i.e., ”signature-specific” loci). Fourth, we use UpSet plots to visualize the overlap of significant variants across signatures and individual diseases, and compute Jaccard similarity indices to quantify the sharing of genetic associations. Fifth, for variants shared between signatures and diseases, we assess the consistency of effect directions across traits.

GWAS details

Regenie is run in two successive steps. Step 1 involves fitting a whole-genome ridge regression model to account for relatedness and population structure. Step 2 involves single-variant association testing using the residuals from Step 1, with covariate adjustment for sex, age at prediction, and the first 20 genetic PCs. This approach provides well-calibrated association statistics and is robust to case-control imbalance and relatedness in large biobank-scale datasets.

Model Evaluation and Comparison

Figure 5 presents a comprehensive evaluation of our multi-disease risk prediction model in the UKB, and comparisons with important single-disease models. Each is evaluated in a strictly prospective, leakage-free framework. In the testing data, all parameters were estimated using only information available up to the time of prediction. Individuals with prevalent disease at prediction were excluded from the risk set for that disease. In UKB all individuals are followed for at least 10 years from recruitment. In our analysis we consider only these initial 10 years to avoid comparing metrics obtained across differing risk sets to be comparable to existing scores, but this can easily be extended over the full set of avaialable prediction times. We considered the following prediction tasks and metrics.

Median AUC Aladynoulli Dynamic:

This metric evaluates the model’s ability to make dynamic predictions at multiple time points during follow-up: it is derived by refitting the Aladynoulli model using fixed ϕ^ parameters, previously estimated from the full-history training data, and now applied to a series of one-year prediction tasks. Critically, while the fixed ϕ^’s were estimated from the full training data, the λ^’s for each prediction task are now estimated using data only up to the point of prediction. For each of the first 10 years after recruitment, the model is retrained using only data available up to that point for the held out test set (in orange in Figure S1). The median area under the receiver operating curve (AUC) across these ten dynamic one-year fits is reported. This captures how predictive accuracy evolves as patients accumulate new diagnoses and leverages the flexibility of our method to perform dynamic, prospective risk estimation at any time point. Individuals with prevalent disease at prediction time were excluded from the risk set.

Aladynoulli Recruitment (1-year):

This metric uses the Aladynoulli model’s predicted 1-year risk at the time of recruitment, evaluated against observed 1-year outcomes. The risk estimate is π˜i,d,1, the predicted 1-year risk for individual i and disease d at year 1 after recruitment. In practice, any age of prediction could be chosen, but we use the age of recruitment to the UK Biobank given the availability of additional clinical variables for comparison, which improves comparability with some of the alternative approaches. As above, only information available at recruitment is used for estimation of individual loadings λ^ik on ϕ^full,kd estimated from the training set.

Aladynoulli Recruitment (10-year):

This metric uses the Aladynoulli model’s predicted 1-year risk at the time of recruitment, evaluated against observed 10-year outcomes. The risk estimate is π˜i,d,1, as in the 1-year predictions.

Cox with Aladynoulli:

This model is a Cox proportional hazards regression using age as the time scale and including the Aladynoulli risk prediction at recruitment, family history, and sex as covariates. This approach benchmarks the added value of the Aladynoulli prediction in a standard clinical modeling framework.

Cox without Aladynoulli:

This baseline Cox model uses only family history and sex as covariates, representing a minimal clinical model that does not require any model curation or disease-specific features. This highlights the fact that our approach does not rely on disease-specific risk factors or manual feature engineering.

When benchmarking AUC performance, we compared the ALADYNOULLI model not only to the benchmarking Cox proportional hazards model above, but also to established clinical risk scores: PREVENT (25), Pooled Cohort Equation (43) for ASCVD, and Gail (24) for breast cancer, models for diseases where these scores are available after specific and often expensive curation. Of note, these clinical risk scores require laboratory values and biomarkers that are either collected during targeted clinical visits (introducing selection bias) or, when extracted from routine EHR data, may be subject to measurement bias since sicker patients typically receive more frequent testing. In contrast, our approach leverages routinely collected diagnostic codes (ICD codes) that are systematically recorded for all patients regardless of disease severity, providing a more unbiased data source for risk prediction.

This approach ensured that all model comparisons were fair, prospective, and reflective of the information available at the time of risk assessment.

Additional evaluation (Table S7)

Dynamic 10-year Rolling:

This approach demonstrates the model’s interpolation capabilities by evaluating how probability estimates evolve as new information becomes available. For each year of the 10-year horizon, we update the model’s predictions using information available up to that time point, then aggregate the cumulative 10-year risk as 1t=1101π^i,d,t, where π^i,d,t is the predicted risk for individual i and disease d at year t after recruitment. While this rolling evaluation does not use knowledge of the future outcome of interest, it is not leakage-free. This is because it does incorporate future information about events that are potentially correlated with, or even resulting from, the event of interest, because the model’s probability estimates at year t are influenced by information available up to year t. Thus it is best understood as interpolation rather than extrapolation. While this metric cannot be used for prospective evaluation, it demonstrates the model’s technical capabilities for dynamic risk assessment and shows how probability estimates evolve over time.

Age-specific evaluation across 30 timepoints:

To comprehensively assess model performance across the adult lifespan, we evaluated predictions at 30 distinct age-specific timepoints spanning ages 40 to 70 years. This approach differs from the recruitment-time evaluation in that each timepoint represents a specific age cohort (e.g., age 40, 41, 42, etc.) rather than mixed-age groups at different follow-up times. For each age-specific timepoint, we used the cumulative data inclusion approach, where all available data from age 30 up to the prediction age is included, rather than a fixed 10-year window. This methodology ensures that predictions at each age benefit from the full available patient history while maintaining proper temporal alignment. We evaluated performance only for years with sufficient events (≥ 5 events) to ensure reliable AUC estimates, and computed median AUC values across all qualifying years for each disease. This approach revealed substantial improvements over the previous 10-year rolling window methodology, demonstrating the importance of proper data inclusion strategies in survival prediction models.

All analyses were performed using Python, with survival models implemented in lifelines and scikit-survival, and validated in R (Version 4.0) using the Survival package, and calibration and discrimination metrics computed using standard epidemiological methods.

Supplementary Material

Supplement 1

Acknowledgments

Funding:

This work was supported by National Institutes of Health grants (R01HL155915, R01HL157635, R35HL144758) to P.N., American Heart Association grants (19SFRN34800000, 19SFRN34850009) to P.N.

Footnotes

Competing interests: The authors declare no competing interests.

Data and materials availability:

The code for implementing ALADYNOULLI is available https://github.com/surbu with all analyses and code necessary for reproduction avaialable upon request from the authors. Access to individual-level UK Biobank data requires approval from the UK Biobank (https://www.ukbiobank.ac.uk/). Access to Mass General Brigham data requires approval from the Mass General Brigham Institutional Review Board. Access to All of Us data requires approval through the All of Us Researcher Workbench (https://www.researchallofus.org/).

References and Notes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Data Availability Statement

The code for implementing ALADYNOULLI is available https://github.com/surbu with all analyses and code necessary for reproduction avaialable upon request from the authors. Access to individual-level UK Biobank data requires approval from the UK Biobank (https://www.ukbiobank.ac.uk/). Access to Mass General Brigham data requires approval from the Mass General Brigham Institutional Review Board. Access to All of Us data requires approval through the All of Us Researcher Workbench (https://www.researchallofus.org/).


Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES