Abstract
Understanding how disease patterns evolve over a lifetime remains a key challenge in medicine. While electronic health records provide rich longitudinal data, existing models typically analyze each disease in isolation, missing the complex interplay between multiple conditions and genetic factors. Here, we combine longitudinal health records with genetic data to model individual trajectories, using a novel dynamic Bayesian framework called ALADYNOULLI that identifies latent disease signatures from longitudinal health records while modeling individual-specific trajectories. Applied across three biobanks with up to 50 years of follow-up, our model discovers clinically interpretable disease signatures that demonstrate remarkable consistency across diverse populations (79.2% cross-cohort correspondence) and show strong genetic correlations, enabling both accurate prediction of patient risk and discovery of novel genetic associations. The model achieves dramatic improvements in disease prediction across 24 conditions, outperforming established clinical risk scores like PCE, PREVENT and GAIL over short and longer-term horizons. Furthermore, our signature-based approach identifies over 150 genetic loci - many missed by single-disease GWAS - with multiple signatures showing strong genetic signals (cardiovascular h2 = 0.041, musculoskeletal ). Critically, this unified modeling approach significantly improves predictive performance for multiple diseases while revealing distinct biological subtypes within traditional diagnostic categories—demonstrating substantial heterogeneity across diverse conditions including cancer, metabolic disorders, and psychiatric conditions, with Cohen’s d effect sizes up to 3.87 for signature differences between patient clusters (p ≤ 1 × 108 for 95% of comparisons). In conclusion, ALADYNOULLI combines genetics and longitudinal diagnosis to achieve both improved disease prediction and enhanced genetic discovery through a unified framework that captures the complex interplay between genetic predisposition and time-varying disease patterns. Application with link to simulated results available at http://aladynoulli.hms.harvard.edu. Code available with github permission enabled.
Introduction
The risk of disease varies substantially between individuals and throughout life, with complex interactions between genetic predisposition, environmental factors, and accumulated comorbidities. Understanding these dynamic patterns of risk could transform early detection, prevention, and personalized treatment strategies (1–3). The increasing availability of large-scale electronic health records (EHRs) linked to genetic data provides unprecedented opportunities to model these complex disease trajectories at a population scale (4, 5). However, extracting meaningful patterns from these rich, longitudinal datasets remains challenging due to patient population heterogeneity, the temporal nature of disease progression, and intricate relationships between diverse conditions.
Traditional approaches to analyzing EHR data often focus on isolated diseases or simple pairwise associations, failing to capture how multiple conditions evolve together over time (6). Recent unsupervised methods have attempted to identify disease clusters or trajectories (7), but typically do not account for temporal dynamics of disease risk or individual-level heterogeneity, particularly the influence of genetic factors on disease progression rates (8, 9). Furthermore, many models assume conditional independence of diseases, missing the opportunity to leverage information across related conditions for both prediction and discovery (10,11). Consider a patient who develops rheumatoid arthritis at age 45, followed by hypertension at 48, and eventually suffers a myocardial infarction at 52. Traditional approaches may treat these as separate events or simple comorbidities, missing the underlying metabolic-inflammatory signature that drives this progression. Also, they do not typically leverage information from patients with similar patterns to improve prediction for rare conditions, where limited data makes traditional disease-specific models less reliable.
We present ALADYNOULLI, a generative model that integrates genetic data with longitudinal EHRs to identify latent disease signatures while modeling individual-specific health trajectories over time. ALADYNOULLI addresses these limitations by identifying shared disease signatures that capture biological pathways common across multiple conditions, enabling more accurate prediction even for rare diseases through information sharing with related, more common conditions. Our approach offers several key advantages over existing methods: (1) Interpretability: disease signatures correspond to clinically meaningful biological processes rather than abstract statistical factors; (2) Temporal modeling: captures how disease risk evolves dynamically over the life course rather than static risk assessment; (3) Genetic integration: directly incorporates genetic information into the model architecture rather than as post-hoc analysis; (4) Unified framework: simultaneously models multiple diseases, sharing information across related conditions and improving prediction even for diseases with sparse data (12); and (5) Individual-specific trajectories: provides personalized risk profiles that adapt as new clinical information becomes available. By jointly modeling multiple diseases and their genetic determinants, ALADYNOULLI enables both improved prediction of future disease risk and enhanced discovery of genetic architecture underlying complex phenotypes, while revealing meaningful patient subgroups with distinct biological mechanisms that could inform personalized interventions.
Results
ALADYNOULLI captures temporal disease signatures and individual trajectories
Disease patterns among individuals vary by onset, progression speed, and composition, reflecting different underlying biological mechanisms. Unlike allocation-based topic models that conditionally allocate observed diseases to categories (7,10), ALADYNOULLI models the probability of each disease for an individual by integrating across multiple latent signatures (Figure 1).
Figure 1: ALADYNOULLI model overview and applications.
Top: Example patient timeline showing the sequence and timing of major diagnoses over the life course. Middle: Key model components. Left: Population-level disease signatures , with each line representing the age-dependent risk trajectory for a specific disease within a signature. Center: Individual signature loadings transformed to via softmax, for a representative patient, showing how contributions from different signatures evolve over time. Right: Disease risk prediction for selected diseases, integrating population-level signatures and individual loadings to generate personalized risk trajectories. Bottom: Applications of the model, including genomic discovery, therapeutic targeting, and patient matching (e.g., digital twin identification or stratification of patients with the same diagnosis but different risk profiles).
For each individual , disease , and time point , we model the probability of disease occurrence as a weighted combination of signature-specific probabilities, where each signature captures patterns of diseases that tend to occur together (Table S1):
| (1) |
where , is a global calibration parameter, represents a normalized individual ’s time-varying association with signature at time , and captures the relationship between signature and disease over time .
The normalized individual-signature associations (loadings) are derived from latent variables through a softmax transformation:
| (2) |
These latent variables follow a Gaussian process (13) prior wherein we model the effects of genetic factors and time (see Methods; Figure S1). Specifically:
| (3) |
where is a signature-specific reference level, captures genetic effects on signature predisposition, represents individual genetic factors, here polygenic scores, affecting the mean of , and is a temporal covariance kernel modeling smooth trajectories for over time.
Similarly, the disease-signature associations follow a Gaussian process prior:
| (4) |
where is a disease-specific baseline, or the logit of the population prevalence, represents the overall strength of association between signature and disease , and allows for temporal variation in these associations.
A key innovation of our approach lies in its formulation as a mixture of probabilities rather than a probability of a mixture, as in traditional sparse factor analysis approaches (6). Unlike allocation-based topic models that conditionally assign diseases to individuals after the event has necessarily occurred (7), ALADYNOULLI directly models the probability of disease occurrence as a weighted combination of signature-specific disease probabilities.
This crucial distinction allows our model to: (1) predict future disease onset rather than merely explain observed diagnoses; (2) accommodate multiple contributing disease processes simultaneously rather than forcing competitive allocation to a single signature; and (3) accurately model chronic conditions that persist over time rather than treating each diagnosis as an independent event. The combination of softmax-transformed individual loadings and sigmoid-transformed disease probabilities ensures proper probability scaling.
Terminology clarification:
We note that factor analysis literature exhibits inconsistent terminology that can be confusing. In some traditions (e.g., sparse factor analysis (6, 14)), ”loadings” refer to individual-specific weights (our parameters), while ”factors” or ”coefficients” refer to feature importance (our parameters). In other traditions, ”loadings” refer to feature importance (our parameters), while individual components are called ”scores” or ”weights” (our parameters). Throughout this work, we use ”loadings” to refer to individual-specific signature associations ( and ) and ”signature-disease associations” to refer to feature importance , consistent with the sparse factor analysis convention where loadings represent individual variation and factors represent feature structure.
Two complementary applications:
ALADYNOULLI serves two distinct but complementary purposes, each requiring different analytical approaches. For biomedical discovery, ALADYNOULLI operates with complete hindsight, leveraging entire patient trajectories to maximize our ability to identify biological patterns and mechanisms. This retrospective analysis transforms our understanding of disease patterns, progression speed, genetic relationships, and disease associations by using all available longitudinal data to characterize disease signatures, quantify genetic influences, and reveal patient heterogeneity within diagnostic categories. For clinical prediction, we operate under strict temporal constraints that mirror real-world clinical decision making (see Figure S1 for distinction). We employ a rigorous temporal validation framework that uses only information available up to a prediction time point (see Figure S2). This prospective approach simulates real-world clinical scenarios where physicians must predict future risk based solely on a patient’s history to date, ensuring our performance metrics reflect true predictive capability rather than retrospective explanation.
Applying ALADYNOULLI identifies consistent signature patterns across diverse populations
We applied ALADYNOULLI to three independent cohorts: UK Biobank (UKB, n=427,239), Mass General Brigham (MGB, n=48,069), and All of Us (AoU, n=208,263) (Table S2, Figure S3). We obtained ICD-10 codes from hospitalization diagnoses in each biobank (4, 15) and transformed these to pheCodes (16), following established approaches for EHR phenotyping (17) (see Methods). A set of 348 pheCodes were selected representing diseases with at least 1000 unique occurrences in UK Biobank hospitalization episode statistics (18) as in (7). Despite differences in population characteristics, healthcare systems, and data collection methodologies across these cohorts, our model identified remarkably consistent signature patterns (Table S3; Figure S4).
We set K=20 for our model, which converged well (S5) across all three biobanks and successfully identified 20 distinct disease signatures from the data. These model-derived signatures corresponded to recognized disease processes and captured diverse disease domains including cardiovascular, metabolic, pulmonary, psychiatric, musculoskeletal, and oncologic conditions (S3). Each signature demonstrates characteristic temporal patterns, with disease probabilities evolving dynamically with age across biobanks (Figure 2A; Figures S6, S7, S8). For example, the cardiovascular signature shows steadily increasing probabilities for conditions like atrial fibrillation and heart failure after age 55 years, while the malignancy signature displays a sharp rise in metastatic disease probabilities between ages 60–75 years. The specificity of each signature for a given disease, as modeled by , is preserved but heterogeneous, reflecting the model’s ability to disentangle signature-disease specificity (Figure 2B).
Figure 2: Population-level disease signatures inferred by ALADYNOULLI.
(A) Age-dependent log hazard ratios for four representative disease signatures (cardiovascular, cancer, pulmonary, and cerebrovascular), as estimated by the model. Each line represents the predicted risk trajectory for a specific disease within the signature, illustrating distinct temporal patterns of disease onset. (B) Heatmap of signature-disease specificity parameters ) learned by the model, with red indicating strong positive association and blue indicating negative association between diseases and signatures. (C) Cluster correspondence matrices comparing model-inferred disease groupings across biobanks (UK Biobank, MGB, and All of Us), demonstrating the consistency of disease clusters for common diseases. (D) Model-predicted age-specific probabilities of disease onset for a range of conditions, showing the temporal emergence of diseases across the lifespan. (E) Comparison of signature trajectories for cardiovascular and malignancy signatures across three independent biobanks (MGB, AoU, UKB), demonstrating the robustness and replicability of the model’s temporal patterns across cohorts.
Furthermore, the model’s tensor structure (Figure S9), enables rapid disease hazard calculation using the average loadings (Equation 3) and population-level . The average age-specific hazard probabilities for a wide range of diseases are visualized in Figure 2C, highlighting temporal risk patterns.
As stated, these signature patterns show strong consistency across the three independent cohorts (Figure 2D, Figures S6, S7, S8). When comparing the membership of diseases within signatures between any two cohorts, we observed high concordance (median modified Jaccard index = 0.792, IQR = 0.65–0.89 across all pairwise comparisons between biobanks for similarity-matched signatures when computing intersection among signatures normalized to total number of diseases within the matched UKB signature, Figure S4). Figure 2E illustrates this consistency for two key signatures: cardiovascular disease and malignancy. Despite differences in healthcare systems and coding practices, the temporal patterns of key diseases within these signatures remain remarkably consistent, supporting the biological validity of the discovered patterns.
The model also captures disease-specific temporal dynamics that match clinical expectations. For instance, Type 1 diabetes peaks earlier in life compared to Type 2 diabetes within the metabolic signature (Fig 2E), while primary malignancies precede metastatic disease within the cancer signature. These nuanced temporal relationships emerge directly from the model without explicit encoding, demonstrating ALADYNOULLI’s ability to learn clinically meaningful disease trajectories.
Personalized trajectories reveal heterogeneity within disease categories
Beyond population-level signatures, ALADYNOULLI provides individual-specific trajectory information through the time-varying parameters that reveal distinct disease progression patterns.
While each patient in Figures 3A-C demonstrate similar average signature loadings in aggregate (horizonal ‘static model summary’ profile), their disease journeys reveal biological differences among patients sharing this diagnosis, reflecting true heterogeneity—i.e., the presence of distinct subgroups with different underlying disease signature distributions—within the diagnostic category. Patient C (Panel C) experiences MI at age 54 following a complex trajectory of gastrointestinal and musculoskeletal conditions, with cardiovascular signature activation beginning subtly around age 50 and accelerating dramatically in the years preceding the event. In contrast, Patient B (Panel B) develops MI at age 72 after a markedly different prodrome dominated by respiratory and dermatologic conditions, showing a more gradual cardiovascular signature evolution. Post-MI trajectories in these two patients also diverge substantially: Patient C subsequently develops multiple cardiovascular complications and metabolic disorders, while Patient B’s post-MI course is characterized by different comorbidity patterns including genitourinary and infectious disease manifestations. These distinct temporal signatures preceding and following identical clinical endpoints illustrate how ALADYNOULLI captures the biological heterogeneity masked by traditional diagnostic categories—revealing that ”myocardial infarction” encompasses diverse pathophysiological pathways that may require different prevention and treatment strategies. Multiple additional examples (Figure S10) demonstrate the diversity in temporal loadings that would be missed by a summative approach considering only average loading.
Figure 3: Individual-level trajectories and dynamic risk profiles.
(A–C) Patient-specific normalized signature loadings over time for three representative individuals. The lower panels show the disease timeline and key diagnoses for each patient. (D) Comparison of early-onset (<55 years) and late-onset (>70 years) MI: Average signature loadings and their temporal velocities reveal distinct dynamic patterns and rates of change associated with age of onset. (E) Decomposition of myocardial infarction (MI) risk for a representative patient: Top, time-varying signature loadings; middle, heatmap of log disease probabilities by signature and age; bottom, stacked area plot showing the aggregate risk over time. (F) Signature heterogeneity within disease subtypes: Stacked area plots show deviations in signature proportions from the population average for selected diseases (malignant neoplasm of female breast, major depressive disorder, and myocardial infarction), highlighting the diversity of underlying biological processes among patients with the same clinical diagnosis.
Our model also illustrates how individual-level trajectories and population phenomena combine to elicit time-varying personalized disease probabilities. Figure 3E is a heatmap of log disease probabilities by signature and age for MI, showing how overall MI risk is decomposed into the contributions of various time-varying signature loadings. This visualization reveals the complex interplay between multiple signatures in determining disease risk. While the cardiovascular signature contributes most significantly to MI risk, other signatures—particularly those related to metabolic conditions and inflammation—also play important roles. The stacked area plot below demonstrates how these contributions integrate to form the aggregate risk profile, revealing periods of accelerated risk accumulation that may represent critical windows for preventive intervention.
Aggregating these individual patterns reveals distinct group-level differences. In a retrospective analysis, the comparison of early-onset (≤ 55 years, mean age of event 49.7 years) and late-onset (≥ 70 years, mean age of onset 74.9 years) MI in Figure 3D shows that early-onset patients exhibit a higher and earlier peak in cardiovascular signature contribution, as well as a more rapid increase in signature loading prior to the event, compared to late-onset cases. These quantitative differences in trajectory characteristics suggest that early- and late-onset MI, while sharing the same clinical diagnosis, may represent distinct disease entities requiring different preventive strategies.
This pattern of heterogeneity within diagnostic categories extends broadly across diseases. Figure 3F captures signature heterogeneity within disease subtypes through stacked area plots showing deviations from the population average, highlighting the diversity of underlying biological processes among patients sharing the same clinical diagnosis.
To systematically quantify differences in signature composition among patients with the same clinical diagnosis for three representative diseases (myocardial infarction, breast cancer, and major depressive disorder), we applied k-means clustering to patients’ time-averaged signature loadings for each disease (Figure 3F). We then calculated cluster-specific Cohen’s effect sizes (19) as follows (Figure S11; Extended data Data S1-S3). For cluster and signature , is the standardized difference in mean time-averaged signature loadings between individuals in cluster and those in all other clusters (see Figure S11). This measures of how distinct each cluster is with regard to each disease signature.
This analysis revealed that the vast majority of signature differences between clusters were not only large in magnitude (with many values exceeding 0.8, and some as high as 2.5–3.9), but also highly statistically significant (p ≤ 1×10−8 for nearly all clusters). The largest effect size occurs in major depressive disorder, for the acute illness signature 16 (septicemia, acute renal failure, and critical care conditions) in cluster 2 showed (p ≈ 0), revealing a medically complex depression subgroup with severe acute comorbidities. In myocardial infarction, the cardiovascular signature 5 (encompassing coronary atherosclerosis, ischemic heart disease, and hypercholesterolemia) shows (p ≈ 0) in only one cluster, indicating that even within cardiovascular diseases, the cardiovascular signature itself reveals substantial heterogeneity between patient subgroups. In breast cancer, the cardiovascular signature also showed strong differentiation (, p ≈ 0). Similarly, the pain/inflammatory/metabolic signature (Signature 7, characterized by asthma, migraine, osteoporosis, depression, and obesity) achieved near-complete patient separation in all conditions examined, with values ranging from 1.84 to 2.51. These effect sizes indicate near-complete separation between patient subgroups, suggesting distinct underlying disease processes within the same diagnostic category, underscoring the presence of distinct biological subgroups within each diagnostic category. These results demonstrate that the observed heterogeneity is both quantitatively substantial and statistically robust, supporting the biological relevance of the patient subgroups we identified (see Extended Data S1-S3).
The model’s ability to identify such distinct temporal trajectories and biological subtypes even among patients with similar diagnoses illustrates ALADYNOULLI’s potential for personalized risk assessment and intervention timing (Figure S10). By capturing how an individual’s signature associations evolve with each new diagnosis, ALADYNOULLI provides a dynamic framework for monitoring disease progression, predicting future complications, and identifying optimal windows for preventive measures. Unlike traditional risk scores that provide a single probability estimate, ALADYNOULLI offers a comprehensive view of an individual’s evolving disease landscape—revealing not just what conditions might develop but when and in what sequence—critical information for precision medicine.
Genetic factors influence signature trajectories
A key innovation of ALADYNOULLI is its integration of genetic information directly into the model, allowing us to quantify how genetic factors influence disease signature associations. We examined both the direct genetic effects on signature loadings through the parameters and the association between polygenic risk scores (PRS) and signature trajectories. Importantly, to avoid double-dipping, we used external PRS that were developed independently of our signature analysis, ensuring that genetic information was not used both in training the model and in evaluating PRS-signature associations.
Genetic analysis revealed substantial genetic influence on signature associations through the parameters. Using batch-aggregated effect estimates across model replicates and a Bonferroni correction for 36 PRS per signature (p ≤ 6.6 × 10−5), we identified 75 significant PRS-signature associations out of 756 tests (9.9%) (Fig 4A; see Extended Data S0). The strongest genetic effects were observed for signatures with known heritable components: coronary artery disease PRS on the cardiovascular signature (Signature 5, ), LDL cholesterol PRS on Signature 5 (), and type 2 diabetes PRS on the metabolic signature (Signature 15, ). Coronary, metabolic, and psychiatric signatures showed the strongest overall genetic influences (Figure 4A), consistent with the high heritability of these disease categories. Several PRS, including BMI, T2D, and HT, showed pleiotropic effects across multiple signatures (20), highlighting shared genetic architecture across disease processes. Importantly, the heterogeneous patient groups identified in our trajectory analysis (Figures 3D and 3F) show corresponding heterogeneity in underlying polygenic risk scores (Figure 4B), demonstrating that genetic variation contributes to the diverse disease progression patterns we observe.
Figure 4: Genetic architecture and polygenic risk stratification of ALADYNOULLI disease signatures.
(A) Top polygenic risk score (PRS) associations for each disease signature, showing effect sizes for the most significant PRS-signature pairs across disease categories. (B) Heatmaps of mean PRS values by cluster for three representative diseases: major depressive disorder, breast cancer, and myocardial infarction, demonstrating the stratification of polygenic risk across model-inferred patient clusters. (C) UpSet plot showing the overlap of genome-wide significant loci between disease signatures and individual traits, with analyses performed without PRS prior, highlighting shared genetic mechanisms across diseases. We consider SNPs as shared if they are within 1 MB of a lead loci in each componenet trait. (D) Heatmap of positive genetic correlations between disease signatures and complex traits, computed using LD score regression without PRS prior, revealing shared genetic architecture and pleiotropy.
We quantified the variability of PRS scores across patient clusters by computing Cohen’s effect sizes for cluster and PRS , analogously to what done earlier for signatures (see also Methods). This analysis revealed substantial differences in polygenic risk scores between patient subgroups that parallels the biological variation observed in signature loadings (Figure 4B; Extended Data S4-S6). For major depressive disorder, signature loadings showed dramatic cluster-specific effects, with Signature 16 (likely psychiatric) showing extreme enrichment in Cluster 2 and Signature 7 (likely inflammatory) showing strong depletion in Cluster 1 but enrichment in Cluster 3 . This signature variability was mirrored by corresponding PRS patterns: Cluster 3 showed strong enrichment for cardiovascular risk factors , while Cluster 2 showed depletion in these same traits (see Extended Data S4-S6).
To systematically identify genetic loci associated with signature trajectories, we performed genome-wide association studies (GWAS) on lifetime signature exposure for each signature, computed as the area under each individual’s signature loading curve over their entire follow-up period (S12; Methods), after refitting our model excluding the genetic mean ( in Equation 3) from the prior on . This approach investigates whether signature trajectories themselves have heritable components beyond the genetic effects we explicitly modeled. LD score regression analysis revealed significant SNP-based heritability for multiple signatures (Table S9), with the strongest signal observed for the cardiovascular signature (, SE = 0.003), followed by musculoskeletal , SE = 0.002) and pain/inflammation signatures (, SE = 0.002). All analyses showed appropriate genomic control and negligible population stratification (intercept ≈ 1.0), confirming that our signatures represent biologically meaningful patterns with distinct genetic architectures.
This genetic validation analysis identified 150 genome-wide significant loci across 15 of 21 signatures, with the cardiovascular signature alone accounting for 56 loci (37% of all discoveries) (Extended data S7-S27 for lead variants across signatures). This signature-based approach substantially outperformed traditional single-disease GWAS in detecting disease-associated variants: our cardiovascular signature analysis identified 23 unique loci compared to external GWAS assessing associations with myocardial infarction (29 loci), hypercholesterolemia (42 loci), and angina (26 loci) (Figure 4C). This enhanced discovery stems from three key factors: aggregation of signals across related conditions increases effective sample size; the continuous nature of signature loadings provides greater statistical power than binary disease endpoints; and signatures capture shared biological processes that may have stronger genetic determinants than individual disease manifestations. When associating significant loci in each signature with component trait genotype dosage, we found similar improvements across signatures (Figure S13)). This substantial genetic signal independent of our explicitly modeled genetic effects provides strong evidence that our disease signatures capture genuine biological processes with distinct genetic architectures rather than statistical artifacts (Extended Data Files 7–26).
For regional overlap analysis visualized in UpSet plots (Fig 4C), we defined variants as overlapping if they were located within 1MB windows of each other, reflecting the potential for different lead variants to tag the same causal locus through linkage disequilibrium. This approach substantially increased the overlap between our cardiovascular signature and individual disease GWAS compared to exact SNP matching, revealing shared genetic architecture that might be missed by traditional single-disease analyses. In contrast, for direct genotype-phenotype association testing, we used the exact signature lead SNPs to test for association with component trait phenotypes, providing a threshold-independent assessment of biological effects (S13).
Linkage disequilibrium score regression (21) analysis across a broad set of representative traits confirmed expected trait enrichment and depletion in non-signature associated traits (Figure 4D; Table S9). These findings demonstrate that ALADYNOULLI’s unified modeling approach not only improves disease prediction but also enhances genetic discovery by leveraging shared biological pathways across related conditions, potentially informing more targeted prevention strategies based on an individual’s genetic risk profile and signature associations.
Dynamic risk assessment improves disease prediction
A primary motivation for modeling longitudinal disease patterns is to improve prediction of future disease events. To rigorously evaluate ALADYNOULLI’s predictive performance, we implemented comprehensive, leakage-free validation strategies that mimic real-world clinical follow-up (Table S4). Our primary approach uses landmarking methodology (22), where we evaluate prediction performance at 30 distinct time points (landmarks) during follow-up, spanning ages 40 to 70 years. At each landmark, we use a model trained specifically for that time point, ensuring predictions are based only on information available up to that time. This approach reflects how the model would be used in clinical practice and provides a systematic temporal evaluation of model performance, capturing how predictive accuracy evolves as patients accumulate new diagnoses over time. The dynamic nature of this evaluation reflects the real-world scenario where clinicians must make predictions at various points in a patient’s journey, and the ability of ALADYNOULLI to update with new information. We also evaluate the prediction at recruitment against 1-year and 10-year outcomes (ALADYNOULLI recruitment 1 year, 10 year) for comparison with traditional clinical risk scores, where predictions are made a single time for each patient at the time of recruitment (i.e., 2006–2010 in UKB, Fig S2) and compared against 1-year or 10-year outcomes. All analyses were performed strictly prospectively, ensuring that only data available up to prediction time was used for each individual predicted. Individuals with prevalent disease at prediction time were excluded (see Methods). Finally, we compared with traditional cox modeling (23) using age as a time scale, also on ten year outcomes, with or without ALADYNOULLI as a predictor.
As shown in Figure 5A and Table S5, ALADYNOULLI demonstrates three key advantages over traditional approaches. First, it achieves substantial improvements in predictive accuracy across a broad range of diseases (AUC increase up to 0.20) and prediction periods. The dynamic risk predictions, which update in this analysis at 30 distinct time points during follow-up using only information available up to that time point, yield substantially higher AUCs than standard Cox models without ALADYNOULLI (e.g., ASCVD: 0.901 vs 0.634; Heart Failure: 0.838 vs 0.592; Diabetes: 0.814 vs 0.600; Table S6, S16, S18). Second, this systematic evaluation across multiple prediction timepoints demonstrates the model’s robust performance in real-world scenarios where predictions must be made at various points in a patient’s clinical journey. Finally, the prediction of the featured diseases (and beyond) comes simultaneously from the ALADYNOULLI model, a key strength of which is its ability to provide robust, simultaneous predictions across multiple disease categories without disease-specific optimization. Unlike traditional approaches that require separate models for each condition, our unified framework leverages shared information across related diseases, which is especially valuable for conditions where limited training data can be supplemented by biological connections to more common diseases. For example, secondary cancers (annual incidence ≈ 0.03%) showed substantial improvement in prediction accuracy (AUC: 0.712 vs 0.508 for traditional models), likely due to shared biological pathways with more common primary malignancies captured by our unified signature approach.
Figure 5: Multi-Disease Risk Prediction Performance and Model Interpretation.
(A) Discrimination performance across the top 16 diseases, measured by the area under the ROC curve (AUC) in a prospective, leakage-free framework. Each dot represents a different modeling approach. The primary approach, Median Aladynoulli 1-year (highlighted), reflects clinical practice: 1-year AUCs are computed for each year of follow-up using only data available up to that year, and the median AUC across years is reported. This represents how the model would be used in real-world clinical settings, making 1-year predictions at each patient visit. Aladynoulli Recruitment (1-year) uses predictions made at recruitment to evaluate 1-year outcomes, while Aladynoulli Recruitment (10-year) uses predictions made at recruitment to evaluate 10-year outcomes for comparison with clinical risk scores. PREVENT and PCE models are evaluated for their ability to predict 10-year outcomes using only recruitment data available at the time of study center visit. Cox models are fit using age as the time scale and include either Aladynoulli predictions, family history, and sex, or only family history and sex as covariates. All analyses exclude individuals with prevalent disease at time of prediction and use only information available up to the time of prediction, ensuring a fully prospective evaluation. (B) Calibration plot across all follow-up periods for all at-risk individuals, showing observed versus predicted event rates on a log-log scale. Each point represents a bin of predicted risk, annotated with sample size; summary statistics (MSE, mean predicted, mean observed, total ) are provided. (C) Model 10-year risk predictions versus incidence-based risk for ASCVD, stratified by age and percentiles. Solid lines show model-predicted mean and percentiles; the dashed line shows prevalence-based risk. indicates the correlation between predicted and observed risk. (D) ROC curves for each year of the 10-year ASCVD prediction horizon, comparing the Aladynoulli model (AUC = 0.90), the PREVENT model (AUC = 0.649), and the Pooled Cohort Equations (PCE, AUC = 0.664). (E) Softmax trajectory patterns for the latent patient loadings : the upper panel shows individual patient trajectories for myocardial infarction (MI), censored prior to event; the lower panel shows mean trajectories for MI cases and controls, illustrating dynamic risk evolution over age.
We further evaluated ALADYNOULLI’s performance across age-specific prediction time-points spanning ages 40 to 70 years, providing a comprehensive assessment of how predictive accuracy evolves across the adult lifespan. This analysis, which evaluated 30 distinct prediction timepoints using cumulative data inclusion, revealed substantial discrimination in model performance. Key diseases showed remarkable age-specific discrimination: ASCVD achieved a median AUC of 0.985 (0.969,0.99) across 28 years of evaluation, while Breast Cancer demonstrated a median AUC of 0.981 (0.961,0.991) across 23 years, and Diabetes reached a median AUC of 0.948 across 25 years (Extended Figure S14). The systematic evaluation across multiple age-specific cohorts demonstrates the model’s robust performance in real-world scenarios where predictions must be made at various points in a patient’s clinical journey, with performance generally improving as more cumulative data becomes available. This approach also revealed that the previous 10-year rolling window methodology significantly underestimated the model’s clinical journey, with performance generally improving as more cumulative data becomes available.
The model also demonstrates excellent calibration (Figure 5B), with predicted probabilities closely matching observed event rates across the risk spectrum. This is crucial for clinical decision making, which requires reliable and actionable risk estimates. To illustrate ALADYNOULLI’s ability to capture evolving risk, we examined signature activation trajectories preceding disease onset by censoring individual data 5 years prior to events (Figure 5E; Methods). Examples of patients diagnosed with myocardial infarction reveal increases in cardiovascular signature activation 2–3 years before clinical events. Notably, these patterns emerge even when the target disease is censored from input data, indicating that the model captures informative signals from related comorbidities.
We further evaluated ALADYNOULLI’s performance for ASCVD (atherosclerotic cardiovascular disease) risk prediction, first in the general population and then in specific high-risk subgroups. In the overall cohort, ALADYNOULLI outperformed both PREVENT (AUC: 0.649) and PCE (AUC: 0.664) (Figure 5D), with particularly strong performance in sex-based analyses (males: ALADYNOULLI 0.701 vs PREVENT 0.597; females: ALADYNOULLI 0.667 vs PREVENT 0.657, Figure S15). We also evaluated the GAIL model (24) for breast cancer because of the availability of family history data for comparison in the UKB. Of note, many disease-specific clinical scores require information not available on biobank level interviews, though the detailed nature of the UK Biobank did provide these variables. ALADYNOULLI exceeded the ten-year AUC when compared to the GAIL model (0.649 to 0.543) (Figure S17).
We then specifically evaluated ALADYNOULLI’s performance in patients with pre-existing rheumatoid arthritis (RA) and breast cancer (BC), comparing against both the Pooled Cohort Equations (PCE) and the PREVENT (25) model (Figures S15, S16, S18). This analysis investigates whether ALADYNOULLI maintains predictive accuracy in the presence of confounding comorbidities that can mask cardiovascular risk signals, a common challenge in clinical practice. For 10-year ASCVD outcomes, we used the static version of our leakage-free prediction approach to compute baseline risk for each individual, as the number of ASCVD events per year in these high-risk subgroups was too small to allow stable estimation of dynamic 1-year AUCs (see Methods). Under this strict evaluation, ALADYNOULLI outperformed existing models, achieving AUCs of 0.681 (RA) and 0.630 (BC) compared to 10-year risk for PREVENT (RA: 0.659, BC: 0.54).
Discussion
We presented ALADYNOULLI, a novel Bayesian framework for modeling dynamic disease signatures and individual health trajectories from longitudinal health records and germline genetic data. By integrating these two data modalities, ALADYNOULLI provides a unified framework for understanding disease comorbidities, predicting future disease events, and discovering genetic architecture underlying complex phenotypes. This work addresses a critical gap in precision medicine (26, 27), where the integration of diverse data sources remains challenging despite the promise of personalized approaches to disease management (28). Unlike traditional disease-specific predictive models that require separate development for each condition, ALADYNOULLI’s unified framework simultaneously captures risk for multiple diseases, enabling information-sharing across related conditions, improved prediction for diseases with sparse data, and comprehensive decision support across clinical disciplines.
Our model’s identification of consistent disease signatures across three independent cohorts supports their biological validity and clinical relevance. These signatures capture meaningful disease relationships that align with known pathophysiological processes while revealing novel connections between conditions that may share underlying mechanisms. The temporal dynamics of these signatures further enhance our understanding of how disease risk evolves throughout the life course, addressing the need for more sophisticated approaches to understanding disease progression beyond static risk assessment (29).
The integration of genetic information also represents a significant advance over existing approaches. By directly modeling genetic influences on signature associations, ALADYNOULLI provides biological interpretability while improving predictive performance. The identification of genetic variants that associate more strongly with signature loadings than with individual diseases suggests our approach may uncover novel mechanisms with pleiotropic effects across many established diseases but weaker effects on individual diagnoses—these may represent more biologically critical pathways and better targets for therapeutic interventions than traditional single-disease GWAS approaches.
Beyond risk prediction, ALADYNOULLI’s identification of disease signatures and individual trajectories has important implications for precision medicine and therapeutic development (26,27). By revealing distinct patient subgroups with shared biological mechanisms, the model can inform more targeted therapeutic strategies that align with the vision of personalized medicine (28). This approach addresses the critical need for better patient stratification in clinical practice, where traditional diagnostic categories often mask underlying biological heterogeneity (30).
First, signature profiles can help identify patients likely to respond to specific interventions. For example, individuals with strong metabolic signature contributions to their coronary disease may benefit more from intensive glucose management, while those with inflammatory signature patterns might respond better to anti-inflammatory approaches. This targeted approach represents a key advancement toward the promise of precision medicine (26), where treatments are tailored to individual biological profiles rather than applied uniformly across broad diagnostic categories.
Second, the model can detect changing risk profiles in real-time as patients accumulate new diagnoses, allowing for dynamic adjustment of preventive strategies. Figure 3A–C demonstrates this capability, showing how patients’ risk trajectories are updated following new clinical information, potentially triggering changes in monitoring or intervention intensity. This dynamic approach aligns with the emerging paradigm of digital medicine (31), where continuous monitoring and real-time risk assessment enable more responsive and personalized care.
Finally, signature-based patient stratification has a strong potential to enhance clinical trial efficiency by identifying more homogeneous patient populations and more appropriate controls. By enrolling patients with similar signature profiles, trials might achieve greater treatment effects and identify responder subgroups more effectively. This approach could mitagate the high failure rates in clinical trials by ensuring more biologically appropriate study populations (32), while also advancing our understanding of treatment response heterogeneity.
Several limitations should be acknowledged. First, our model relies on EHR data, which may contain biases related to healthcare access, diagnostic coding practices, and incomplete capture of disease history. These limitations are common to all EHR-based studies and highlight the importance of validating findings across multiple healthcare systems, as we have done here. Second, while we incorporate genetic factors, we do not explicitly model environmental exposures or lifestyle factors that significantly influence disease risk (29). Third, our use of established PRS may miss genetic effects that act directly on signatures but weakly on individual diagnoses, as our signature-based GWAS identified loci not captured by traditional single-disease approaches. Fourth, our model makes several important assumptions including linearity in genetic effects and additivity in signature contributions, which may not capture all complex interactions. Future work could integrate these additional data sources and relax these assumptions to further enhance predictive performance and biological insight, addressing the complex interplay between genetic and environmental factors that shape the risk of disease (33).
Despite these limitations, ALADYNOULLI represents a significant advance in longitudinal health modeling with important implications for precision medicine (26, 27). By capturing the complex interplay between genetic predisposition and time-varying disease patterns, our approach provides a framework for more personalized risk assessment and potential therapeutic targeting. Our model’s ability to identify meaningful patient subgroups within traditional disease categories, coupled with enhanced genetic discovery power, moves beyond simple risk prediction to provide deeper insights into disease biology and patient heterogeneity. These capabilities could inform more targeted clinical trials and intervention strategies, ultimately leading to more effective personalized prevention and treatment approaches.
As healthcare increasingly moves toward data-driven precision approaches (28, 31), a method like ALADYNOULLI that can integrate diverse data sources and model complex temporal relationships can become increasingly valuable for improving patient outcomes. The integration of longitudinal EHR data with genetic information represents a powerful approach to understanding disease biology and improving clinical decision-making, addressing key challenges in modern medicine including the need for more accurate risk prediction, better patient stratification, and enhanced therapeutic targeting (30). This work contributes to the broader vision of precision medicine where individual biological profiles guide clinical decision-making, moving beyond the limitations of traditional diagnostic categories toward more nuanced and personalized approaches to disease management.
Materials and Methods
Cohorts
Data are drawn from three distinct biobanks: Massachusetts General Brigham (MGB), UK Biobank (UKB), and All of Us (AoU). Each cohort is described in Table S2 and below S3).
Massachusetts General Brigham Biobank (MGBB)
MGBB is an integrated research initiative based in Boston, Massachusetts (15). It collects biological samples and health data from consenting individuals at Massachusetts General Hospital, Brigham and Women’s Hospital, and local healthcare sites within the MGB network. Since July 1, 2010, the MGBB has enrolled more than 140,000 participants and extracted DNA from approximately 90,000 participants’ samples, and 53,306 participants were genotyped by Illumina Global Screening Array (Illumina, CA). All participants provided their informed written / electronic consent. EHR data are available on all participants from approximately 1990 (see S7). We used a subset of 48,069 for whom EHR and genetic data were available.
UK Biobank (UKB)
The UKB is a large-scale, population-based cohort that recruited over 500,000 participants aged 40–69 years between 2006 and 2010 from across the United Kingdom (4, 34). The cohort includes extensive phenotypic data, biological samples, and longitudinal follow-up of health outcomes. Genotyping was performed using the UK BiLEVE array or the UKB Axiom array, with subsequent imputation to the Haplotype Reference Consortium (HRC) and UK10K reference panels. Participants were genotyped to investigate genetic contributions to various health and disease traits, with particular attention to the relationship between genetic variants and cardiometabolic diseases. Electronic health records are available on all participants from approximately 1980, and some as early as 1980 (35) and thus allow access to clinical diagnostic data prior to the recruitment date). We used the subset of 427,239 for whom genomic and EHR data were available. Polygenic risk scores (PRS) were obtained from an external set of controls (36).
All of US (AOU)
The AOU research program (37) is a large-scale cohort study designed to increase the representation of historically understudied populations in biomedical research. Since 2018, AOU enrolled adults at more than 730 US sites. Of the 800,000+ consented participants, more than 560,000 have completed core enrollment requirements, including health questionnaires and biospecimen collection. Data from these participants are continuously linked to electronic health records (EHR), which capture ICD-9 / ICD-10, SNOMED, and CPT codes. Genetic data includes array-based genotyping from 315,000 participants and whole genome sequencing (WGS) from 245,394 participants who were then available to contribute polygenic risk scores for downstream analyzes.
Preprocessing and Disease Encoding
Following the approach of Jiang et al. (7), we initially analyzed 348 PheCode diseases from UK Biobank that were selected based on prevalence thresholds (]geq1,000 occurrences) to ensure sufficient statistical power for comorbidity analysis. Disease records were mapped from ICD-10/ICD-10CM codes to PheCodes using a standardized three-step procedure. To validate our findings across independent populations, we then applied the same disease selection strategy to All of Us (AOU) and Mass General Brigham (MGB) cohorts using their respective ICD coding systems. In AOU, we extracted ICD-9 and ICD-10 codes directly from the OMOP Common Data Model condition occurrence tables (37), successfully reproducing all 348 diseases from the UK Biobank selection. In MGB, we similarly used ICD-9 and ICD-10 codes, reproducing 346 of the 348 diseases for validation analyses. This multi-cohort approach enabled us to assess the generalizability of disease signatures across different healthcare systems and populations while maintaining consistency in the underlying disease definitions used for ALADYNOULLI model development and validation. We observed a 79.2% correspondence between matched signatures (2).
Model
We recapitulate the model’s formulation and elaborate on important modeling choices and implementation details. The results in this paper describe our application to the UKB dataset; however we also applied this to the MGBB and AOU datasets to establish consistency as in (2).
Mathematical Formulation
The ALADYNOULLI model represents the probability of disease occurrence for patient , disease , at time as:
where is a global calibration parameter, represents patient ’s time-varying association with signature , and captures the relationship between signature and disease over time.
The patient-signature associations are parameterized as a softmax function of latent variables as:
These patient-specific latent variables in turn follow a Gaussian process prior:
where is a signature-specific baseline, captures how genetic/demographic factors influence patient-signature associations, and is a kernel function ensuring temporal smoothness. The covariate matrix contains 36 polygenic risk scores plus sex (37 features total), providing genetic and demographic information for each individual.
The kernel function is defined as:
In our implementation, the amplitude parameter is set to 100 and the length-scale parameter is set to .
Similarly, the disease-signature associations follow a Gaussian process:
where is a disease-specific baseline derived from the logit of the population prevalence, represents the overall strength of association between signature and disease , and is a kernel function defined as:
where the amplitude is fixed to and the length-scale is set to in our implementation.
Dynamic Range of Predictions
While our mixture-of-probabilities formulation has key advantages, as described in the main text, this approach introduces technical challenges that require careful parameterization. A key challenge with the mixture of probabilities formulation is that it naturally leads to a reduction in the variation of the predicted probabilities across individuals. When multiple sigmoid-transformed values are averaged, the resulting mixture tends to concentrate around moderate values, reducing the dynamic range of predictions. To address this:
Signature-Disease Specificity : We introduce the time-independent parameters above to allow each signature to have strong positive or negative associations with specific diseases. This increases the separation between signatures and ensures a realistic dynamic range of the disease probabilities within each signature.
Global Calibration : The global calibration parameter is necessary to rescale the final probabilities to obtain realistic overall disease prevalences. In our implementation, is learned from the data but in principal could be fixed.
This balance between expressiveness (through ’s) and calibration (through ) ensures that the model can capture both rare and common diseases accurately while maintaining interpretable signature contributions. The combination of softmax-transformed individual loadings and sigmoid-transformed disease probabilities with these additional parameters ensures proper probability scaling.
Censored Data
A critical aspect of ALADYNOULLI is its careful handling of censored observations, which is part of what allows it to function as a generative model of disease progression rather than a retrospective analysis tool. The loss function incorporates time-to-event information through disease-specific censoring times . For each individual and disease , we observe the data only up to time , which represents either:
The time of disease onset (event time), or
The time of last follow-up without disease (censoring time)
The likelihood is constructed to respect this censoring structure ( (23,38)). Consider first , the negative log of the likelihood, that is the negative log of the probability of observed disease histories conditional on the disease probabilities ’s:
This expression has three key components:
Pre-event survival : For all time points before the event/censoring time, we know that the individual did not have the disease, contributing to the likelihood at each time.
Event occurrence : If the disease occurred at time , we observe the event. The corresponding contribution is .
Censored observation : If the individual was censored without disease at time , we only know they were disease-free at that time, contributing .
Key distinction from retrospective models:
This censoring-aware likelihood ensures that the model learns to predict disease risk prospectively. Unlike retrospective clustering approaches that analyze complete disease histories, ALADYNOULLI models the probability of future disease onset given only the information available up to each time point. This makes it suitable for: (1) Real-time risk prediction in clinical settings, (2) Modeling chronic diseases that develop over extended periods, settings, (3) handling varying follow-up times across individuals settings, and (4) avoiding information leakage from future events.
Objective Function for Maximum a Posteriori (MAP) Computation
The computational derivation of the MAP estimate of the unknown parameters in the model proceeds by optimising the function which includes the negative log-likelihood as well as terms arising from the gaussian process prior.
The logs of the Gaussian process terms, as previously specified are:
Combining these terms, the negative log posterior (the “loss” for short) is:
This formulation enables ALADYNOULLI to learn disease progression patterns that respect the temporal structure of the data and provide clinically actionable predictions. Specifically, the negative log-likelihood leads the model to accurately predict the timing of disease onset. To encourage smoothness and temporal coherence in the latent trajectories, we use GP priors on both the individual-specific latent variables and, if applicable, the disease-time effects . The implied regularization terms penalize deviations from the GP prior mean and covariance structure. In models with cluster structure, additional penalties may be included to encourage biologically meaningful clustering of diseases or signatures.
Training, Validation and Testing Architecture
Our analytical approach consists of two distinct stages: retrospective analysis for model training and cross-cohort validation, followed by prospective analysis for prediction evaluation, summarized in Figure S1.
Retrospective Analysis (Figures 2–4):
We performed disease signature characterization across all three cohorts to demonstrate reproducibility. For computational efficiency and uncertainty quantification, we divided each cohort into non-overlapping subsets: UK Biobank into subsets of approximately 10,000 individuals each (reserving one subset of 10,000 individuals as a held-out test set), All of Us into 30 subsets, and Mass General Brigham into 4 subsets (reflecting the smaller dataset sizes). For each subset within each cohort, we jointly estimated both the disease-signature associations and individual loadings using all available observed data up to age 80 or censoring time, whichever came first. This retrospective approach utilized the complete disease trajectory for each individual, enabling us to characterize the full spectrum of disease signatures and their temporal dynamics. Each subset uses the same pre-computed initial clusters and parameters, ensuring consistent signature interpretations across all subsets within each cohort.
For UK Biobank, the final disease-signature parameters used for population-level analysis were computed as the average across the 39 training subsets (excluding the held-out test set):
This subset-averaging approach ensures robust parameter estimates while maintaining computational tractability. The AOU and MGB cohorts serve as external validation datasets, demonstrating the reproducibility of disease signatures across different populations and healthcare systems, but no prediction tasks were performed on these external cohorts.
This subset-averaged was used to generate the disease signature visualizations in Figure 2, including the temporal patterns and cross-cohort consistency analyses. Individual trajectory analyses (Figure 3) utilized the subset-specific estimates, and subsequent clustering of patients by their time-averaged signature loadings was performed within each disease category. The genetic analyses (Figure 4) were performed exclusively in UK Biobank, employing the area-under-the-curve of individual signature trajectories as quantitative phenotypes for genome-wide association studies.
Prospective Analysis (Figure 5):
For prediction evaluation, we implemented a strictly prospective framework using only UK Biobank data. We used the subset-averaged parameters from the 39 training subsets as fixed, population-level disease-signature associations, and then estimated individual loadings on the held-out test set of 10,000 UK Biobank individuals, using only data available up to specific prediction time points.
Specifically, for each prediction time point, we censored individual disease histories at that time point and re-estimated only the individual loadings while keeping fixed. This approach simulates real-world clinical scenarios where population-level disease patterns are known from prior research, but individual risk trajectories must be estimated from available clinical history up to the prediction time. All prediction performance metrics reported in this paper are based exclusively on this UK Biobank held-out test set.
This two-stage design ensures that: (1) our characterization of disease signatures leverages the full richness of longitudinal data to identify biologically meaningful patterns across multiple populations, while (2) our prediction evaluation maintains strict temporal boundaries to prevent data leakage and provides realistic estimates of clinical predictive performance using a completely independent test set.
Model Initialization
To ensure computational feasibility and parameter stability across large datasets, we implement a two-stage initialization approach. In the first stage, we perform the spectral clustering described below and initialization once on the entire data set to establish stable disease clusters and signature-disease associations. These initial clusters and values are then saved and reused in all subsequent subset analyses for .
We initialize the model parameters using spectral clustering (SciKit) on disease co-occurrence patterns. We compute a disease co-occurrence matrix where represents the frequency with which diseases and co-occur in the same patient. We apply spectral clustering to this matrix to identify disease clusters.
We initialize the parameters based on cluster membership: for diseases in cluster , we set where is a small random noise, and for diseases not in cluster , we set . For the time-varying parameters and , we initialize by drawing a single sample from the corresponding Gaussian process prior with reduced variance to preserve the structured mean initialization. Specifically, we initialize via a random draw from the Gaussian process prior with mean and covariance kernel scaled by amplitude , where is initialized using regression on disease occurrences. We initialize via a random draw from the Gaussian process prior with mean and the same reduced amplitude, where is derived from the logit of disease prevalence. The reduced amplitude ensures that random deviations do not bias the parameters arbitrarily away from the informative mean structure while maintaining temporal smoothness. We generate the GP samples via Cholesky decomposition of the scaled kernel matrices.
This approach provides a structured and plausible initialization that reflects our prior smoothness assumptions, rather than a purely random or fixed initialization. For the cluster-specific parameters , we use deterministic values based on cluster membership, with small random noise added for variability.
Hyperparameter Specification
Model selection and hyperparameter specification were performed as follows. The number of latent signatures was chosen to provide a parsimonious balance between model complexity and interpretability, based on prior experience and exploratory analysis. The hyperparameter was initialized at values of 1 and −2, which on the log scale correspond to a 20-fold difference in odds (i.e., 10−9 vs 10−11), thereby spanning a broad range of plausible values for disease risk. All other model parameters were estimated using only the training data, and the final performance of the model was prospectively evaluated on an independent test set. This approach ensures that all reported performance metrics reflect true out-of-sample predictive accuracy, without overfitting or data leakage from hyperparameter tuning.
Optimization Details
We trained the model using gradient descent on the loss . The solution can be interpreted as approximating a Maximum a Posteriori (MAP) estimate of the parameters, assuming dispersed priors on ’s, ’s, ’s ’s and ’s. We minimize over a fixed maximum number of epochs (i.e. complete passes through the entire training dataset). At each epoch, we compute gradients via backpropagation and update parameters using the Adam optimizer in PyTorch, a deep learning framework that allows efficient computation of gradients through automatic differentiation. The model was trained using a learning rate of 0.001. Learning rates and regularization strengths are treated as hyperparameters. The model is optimized at a time scale of one year and thus trained to provide the most accurate 1-year risk predictions. Longer-term risk (e.g., 10-year) can also be derived by simple manipulations of the estimated ’s.
For computational efficiency, we used the Cholesky decomposition to compute the Gaussian process contributions and to sample from the Gaussian process prior during initialization. We also used a jitter term of 1−8 to ensure numerical stability when computing the inverse of the kernel matrices. We trained the model for up to 1000 epochs, with early stopping based on validation loss to prevent overfitting. In practice, for our subsets of 10000 individuals, the model converged after 200 epochs.
Computation of Probabilities of Future Events
To ensure that our model provides prospective predictions without data leakage, we implement a strict censoring strategy that distinguishes between cohort recruitment and prediction time. This approach allows us to simulate real-world clinical scenarios where predictions are made based only on information available at a specific point in time.
Cohort enrollment time refers to the time point when an individual entries ALADYNOULLI, which for our purposes is age 30. For example, in our UKB analysis, all individuals were followed in the EHR from 1980 forward (39, 40) and thus assigned an enrollment time in our study at young adulthood, age 30, or whichever comes later.
Cohort recruitment time refers to the time when individuals joined the biobank. The UKB recruited individuals aged between 40 and 69 years in the time frame from 2006 to 2010, ensuring a comprehensive cohort for analysis and research purposes.
Prediction time refers to the time when we imagine making a clinical prediction, with knowledge of the health history up until that time. In practice this coincides with recruitment time as above to compare with clinical risk scores.
For example, in the UKB, an individual is observed in the EHR from adulthood and contributed data to our analysis from age 30 until the end of follow-up. However, for prediction analyses, we imagine making predictions at different time points after the cohort recruitment time (see S2). This is also consistent with the time at which an individual presented to the UKB, AOU or MGB recruitment center and contributed the samples necessary to calculate the common clinical risk scores.
For each individual and disease , we encode the time-to-event data for each disease using the standard convention in survival analysis (41) defining the event / censoring time and the event indicator at prediction time as follows:
where is the observed event or censoring time for individual and disease (measured as years since age 30), and is the prediction time for individual , computed as the individual’s recruitment age plus the prediction offset, converted to years since age 30: . This censoring procedure ensures that for each prediction scenario, we use only the disease history that would have been available at that specific time point, thereby preventing data leakage from future events.
This ensures that the model is trained and evaluated only on data that would have been available at the time of prediction, thereby preventing any potential data leakage from future events.
Simulation Study
To validate the ability of the ALADYNOULLI model to recover latent disease clusters and temporal dynamics, we conducted a simulation study using ALADYNOULLI itself as the generative model. This approach allows us to test whether the model can accurately recover known ground truth parameters from synthetic data that follows the exact same probabilistic structure as our proposed model (Figure S19).
We generated synthetic data with individuals, diseases, time points (ages 30–79), latent disease signatures, and genetic covariates. The data generation process follows ALADYNOULLI’s exact mathematical formulation. We first created distinct disease clusters with 4 diseases per cluster, assigning strong positive associations within clusters and strong negative associations outside clusters . Disease baseline trajectories were generated on the logit scale with diverse prevalence patterns ranging from rare (logit prevalence ≈ −14) to common (logit prevalence ≈ −8), incorporating realistic age-dependent onset patterns with varying peak ages and slopes.
For individual trajectories, we generated genetic covariates and genetic effect matrices , then sampled individual signature loadings from Gaussian processes with means and temporal covariance with length scale years. Disease-signature associations were sampled from Gaussian processes with means and temporal covariance with length scale years. Event probabilities were computed using the exact ALADYNOULLIformula: , and disease events were sampled from these time-varying probabilities.
When we applied ALADYNOULLI to these synthetic data, the model successfully recovered the correct number of clusters (5/5), achieved high accuracy in disease cluster assignments (median Jaccard similarity 0.795), and accurately reconstructed the temporal trajectories and genetic effects, demonstrating the model’s ability to identify meaningful biological patterns rather than fitting noise.
Analysis
Stability Across Subsets and Cohorts
We empirically verified that estimates were highly stable across subsets, with remarkably small standard errors (e.g. in UKBB mean SE = 0.0010, median SE = 0.0002, with 95% of SE values ≤ 0.004) demonstrating the robustness of our disease-signature associations (Figure S5). This high stability validates our robust subset-averaging approach and confirms that the identified disease signatures represent replicable biological patterns rather than subset-specific noise (Figure S5).
Furthermore, when parameters were independently re-estimated in the AOU and MGB cohorts, they demonstrated strong replicability with the UKB-derived estimates, with high correlation coefficients across disease signatures (median proportion shared between matched signatures 0.792). This cross-cohort replicability provides additional evidence that disease signatures reflect universal biological patterns rather than population-specific variation or healthcare system-specific artifacts.
To further assess the replicability of our disease signatures across different populations (shown in Figure 2C of the main paper), we performed cluster correspondence as follows (Fig 2D). We examined the correspondence between disease clusters identified in each biobank by creating normalized confusion matrices. For each pair of biobanks (UKB vs MGB and UKB vs AoU), we identified the set of diseases common to both biobanks, mapped each disease to its assigned cluster in each biobank, created a cross-tabulation matrix showing the proportion of diseases in each UKB cluster that were assigned to each MGB/AoU cluster, and normalized the counts by row to show the distribution of cluster assignments.
We computed a modified Jaccard similarity index to quantify cross-cohort correspondence. For each UKB cluster , we identified its best-matching cluster in the comparison cohort (the cluster receiving the highest proportion of diseases from that UKB cluster). The modified Jaccard similarity for cluster is defined as:
where is the set of diseases in UKB cluster , is the set of diseases in the best-matching cluster in the comparison cohort, and denotes set cardinality. The overall cross-cohort similarity is the median of these cluster-specific similarities: across all UKB clusters. This metric ranges from 0 (no correspondence) to 1 (perfect correspondence), where higher values indicate stronger replicability of disease clustering patterns across populations.
This analysis revealed strong correspondence between clusters across biobanks (median modified Jaccard similarity = 0.792), particularly for cardiovascular and malignancy signatures, suggesting robust biological patterns that transcend population differences.
For temporal pattern analysis, we performed a detailed comparison of the temporal patterns ( trajectories) for diseases shared across all three biobanks, focusing on two key signatures: the cardiovascular signature (MGB: Sig 5, AoU: Sig 16, UKB: Sig 5) and the malignancy signature (MGB: Sig 11, AoU: Sig 11, UKB: Sig 6). For each signature, we identified diseases assigned to that signature in all three biobanks, plotted the temporal patterns ( values) for each shared disease, overlaid the average pattern across all three biobanks (gray dashed line, 2), and used consistent colors for each disease across biobanks to facilitate comparison. This analysis demonstrated remarkable consistency in the temporal patterns of disease risk across different populations, with shared diseases showing similar risk trajectories despite being modeled independently in each biobank.
Individual Patient Trajectory Visualization
To illustrate the complex interplay of disease signatures in individual patients (shown in Figure 2 A-C of the main paper), we analyzed detailed trajectories for patients with multiple conditions. We identified patients who had at least one target disease of interest, developed multiple conditions (minimum of 2), and had complete follow-up data.
For each selected patient, we created a three-panel visualization. The Signature Dynamics Panel (Top Left) shows the temporal evolution of normalized signature loadings over time, with each signature represented by a distinct colored line, vertical dotted lines marking the timing of each disease diagnosis, and colors consistent across panels matching the primary signature of each diagnosed condition. The Disease Timeline Panel (Bottom Left) displays a chronological sequence of diagnosed conditions, with each condition represented by a horizontal line in its primary signature’s color, diagnosis points marked with filled circles, providing a visualization of disease progression and timing. The Signature Summary Panel (Right) shows a stacked bar chart of time-averaged signature loadings, with each segment representing the average contribution of a signature over the patient’s follow-up, colors matching the signature colors in the other panels, providing a static summary of the patient’s overall signature profile.
This visualization approach allows us to track how signature loadings change before and after each diagnosis, identify which signatures are most active at different time points, understand the temporal relationship between different conditions, and compare the relative contributions of different signatures to the patient’s overall disease profile.
Disease-Specific Trajectory and Heterogeneity Analysis
To systematically quantify differences in signature composition among patients with the same clinical diagnosis and understand disease progression heterogeneity and associated genetic architectures (Figures 3F, 4B-D), we performed trajectory clustering analysis using the ALADYNOULLI model. For each disease of interest (e.g., breast cancer, major depressive disorder, myocardial infarction), we implemented the following analysis pipeline:
Patient Selection and Temporal Averaging.
For each disease, we identified all patients who developed the condition and computed their time-averaged normalized signature loadings:
where represents signature loadings for individual , signature , and time .
Patient Clustering.
We applied k-means clustering (k=3, chosen to balance interpretability with cluster distinctiveness) to the time-averaged signature loading matrix to identify distinct patient subgroups within each disease category. This approach identifies distinct subgroups of patients who share similar underlying disease signature profiles despite having the same clinical diagnosis.
Trajectory Visualization.
We computed cluster-specific mean trajectories across individuals within the cluster and visualized deviations from population reference as stacked area plots for each time point: , where represents the population-average signature loading.
Genetic Architecture Analysis.
For each cluster, we computed mean polygenic risk scores (PRS) across individuals in the cluster, and created heatmaps showing cluster-specific values of these scores. To quantify variability of PRS scores among individuals with the same disease, we calculated Cohen’s effect sizes for each PRS comparing in-cluster versus out-of-cluster distributions:
where and are the mean PRS values for patients within and outside each cluster, respectively, and is the pooled standard deviation. Cohen’s values of 0.2, 0.5, and 0.8 correspond to small, medium, and large effect sizes, respectively, providing a standardized measure of genetic differentiation between patient subgroups.
We applied the same Cohen’s formula to both the time-averaged signature loadings and the mean polygenic risk scores (PRS) to quantify the degree of separation between clusters. For the signature loadings, Cohen’s measures the standardized difference in mean time-averaged signature values between individuals in a given cluster and those in all other clusters, providing a measure of biological heterogeneity within each disease category. For the PRS, Cohen’s quantifies the genetic differentiation between clusters, comparing the mean PRS values for individuals within a cluster to those outside the cluster. In both cases, a larger absolute value of indicates greater separation between clusters.
We then calculated cluster-specific Cohen’s effect sizes (19) as follows. For cluster and signature , is the standardized difference in mean time-averaged signature loadings between individuals in cluster and those in all other clusters. This measures how distinct each cluster is with regard to each disease signature. Similarly, for cluster and PRS , is the standardized difference in mean PRS values between individuals in cluster and those in all other clusters.
Confidence intervals and p-values for Cohen’s were estimated, and significance was assessed to determine whether the observed cluster differences were likely to be due to chance. This analysis revealed substantial standardized differences in signature loadings between patient subgroups, reflecting biological processes not typically considered in diagnoses.
Genetic Analysis of Signature Trajectories
For each individual , we compute the temporal signature loadings for each signature and timepoint using the softmax transformation:
where is the latent score for individual , signature , and time . The softmax is computed across the signature dimension for each individual and timepoint. To summarize each individual’s overall exposure to a given signature, we integrate the signature trajectory over time:
where is the total number of timepoints. The resulting average signature exposure over time (AEX) for each signature is used as a quantitative phenotype for downstream genetic association analysis (Figure S12).
We perform GWAS using the AEX values as quantitative phenotypes. For each signature , we test for association between the AEX phenotype and genome-wide SNP genotypes. Association testing is performed using the Regenie (42) software (described below), which implements a two-step ridge regression approach for computational efficiency and control of population structure. The following covariates are included in all association models: sex, age at recruitment, and the first 20 principal components (PCs) of genetic ancestry (4).
For each signature, we identify genome-wide significant SNPs (e.g., P < 5 × 10−8) and further analyze their relationships with individual disease phenotypes. The analysis proceeds as follows. First, we extract the lead SNPs from the GWAS summary statistics for each signature. Second, for each top SNP, we test its association with a panel of binary constituent disease phenotypes which comprise our signature inputs using logistic regression, controlling for sex and the first 20 PCs using logistic regression. Third, we visualize the matrix of SNP-phenotype -statistics using heatmaps, highlighting SNPs that are associated with the signature but not with any individual disease (i.e., ”signature-specific” loci). Fourth, we use UpSet plots to visualize the overlap of significant variants across signatures and individual diseases, and compute Jaccard similarity indices to quantify the sharing of genetic associations. Fifth, for variants shared between signatures and diseases, we assess the consistency of effect directions across traits.
GWAS details
Regenie is run in two successive steps. Step 1 involves fitting a whole-genome ridge regression model to account for relatedness and population structure. Step 2 involves single-variant association testing using the residuals from Step 1, with covariate adjustment for sex, age at prediction, and the first 20 genetic PCs. This approach provides well-calibrated association statistics and is robust to case-control imbalance and relatedness in large biobank-scale datasets.
Model Evaluation and Comparison
Figure 5 presents a comprehensive evaluation of our multi-disease risk prediction model in the UKB, and comparisons with important single-disease models. Each is evaluated in a strictly prospective, leakage-free framework. In the testing data, all parameters were estimated using only information available up to the time of prediction. Individuals with prevalent disease at prediction were excluded from the risk set for that disease. In UKB all individuals are followed for at least 10 years from recruitment. In our analysis we consider only these initial 10 years to avoid comparing metrics obtained across differing risk sets to be comparable to existing scores, but this can easily be extended over the full set of avaialable prediction times. We considered the following prediction tasks and metrics.
Median AUC Aladynoulli Dynamic:
This metric evaluates the model’s ability to make dynamic predictions at multiple time points during follow-up: it is derived by refitting the Aladynoulli model using fixed parameters, previously estimated from the full-history training data, and now applied to a series of one-year prediction tasks. Critically, while the fixed ’s were estimated from the full training data, the ’s for each prediction task are now estimated using data only up to the point of prediction. For each of the first 10 years after recruitment, the model is retrained using only data available up to that point for the held out test set (in orange in Figure S1). The median area under the receiver operating curve (AUC) across these ten dynamic one-year fits is reported. This captures how predictive accuracy evolves as patients accumulate new diagnoses and leverages the flexibility of our method to perform dynamic, prospective risk estimation at any time point. Individuals with prevalent disease at prediction time were excluded from the risk set.
Aladynoulli Recruitment (1-year):
This metric uses the Aladynoulli model’s predicted 1-year risk at the time of recruitment, evaluated against observed 1-year outcomes. The risk estimate is , the predicted 1-year risk for individual and disease at year 1 after recruitment. In practice, any age of prediction could be chosen, but we use the age of recruitment to the UK Biobank given the availability of additional clinical variables for comparison, which improves comparability with some of the alternative approaches. As above, only information available at recruitment is used for estimation of individual loadings on estimated from the training set.
Aladynoulli Recruitment (10-year):
This metric uses the Aladynoulli model’s predicted 1-year risk at the time of recruitment, evaluated against observed 10-year outcomes. The risk estimate is , as in the 1-year predictions.
Cox with Aladynoulli:
This model is a Cox proportional hazards regression using age as the time scale and including the Aladynoulli risk prediction at recruitment, family history, and sex as covariates. This approach benchmarks the added value of the Aladynoulli prediction in a standard clinical modeling framework.
Cox without Aladynoulli:
This baseline Cox model uses only family history and sex as covariates, representing a minimal clinical model that does not require any model curation or disease-specific features. This highlights the fact that our approach does not rely on disease-specific risk factors or manual feature engineering.
When benchmarking AUC performance, we compared the ALADYNOULLI model not only to the benchmarking Cox proportional hazards model above, but also to established clinical risk scores: PREVENT (25), Pooled Cohort Equation (43) for ASCVD, and Gail (24) for breast cancer, models for diseases where these scores are available after specific and often expensive curation. Of note, these clinical risk scores require laboratory values and biomarkers that are either collected during targeted clinical visits (introducing selection bias) or, when extracted from routine EHR data, may be subject to measurement bias since sicker patients typically receive more frequent testing. In contrast, our approach leverages routinely collected diagnostic codes (ICD codes) that are systematically recorded for all patients regardless of disease severity, providing a more unbiased data source for risk prediction.
This approach ensured that all model comparisons were fair, prospective, and reflective of the information available at the time of risk assessment.
Additional evaluation (Table S7)
Dynamic 10-year Rolling:
This approach demonstrates the model’s interpolation capabilities by evaluating how probability estimates evolve as new information becomes available. For each year of the 10-year horizon, we update the model’s predictions using information available up to that time point, then aggregate the cumulative 10-year risk as , where is the predicted risk for individual and disease at year after recruitment. While this rolling evaluation does not use knowledge of the future outcome of interest, it is not leakage-free. This is because it does incorporate future information about events that are potentially correlated with, or even resulting from, the event of interest, because the model’s probability estimates at year are influenced by information available up to year . Thus it is best understood as interpolation rather than extrapolation. While this metric cannot be used for prospective evaluation, it demonstrates the model’s technical capabilities for dynamic risk assessment and shows how probability estimates evolve over time.
Age-specific evaluation across 30 timepoints:
To comprehensively assess model performance across the adult lifespan, we evaluated predictions at 30 distinct age-specific timepoints spanning ages 40 to 70 years. This approach differs from the recruitment-time evaluation in that each timepoint represents a specific age cohort (e.g., age 40, 41, 42, etc.) rather than mixed-age groups at different follow-up times. For each age-specific timepoint, we used the cumulative data inclusion approach, where all available data from age 30 up to the prediction age is included, rather than a fixed 10-year window. This methodology ensures that predictions at each age benefit from the full available patient history while maintaining proper temporal alignment. We evaluated performance only for years with sufficient events (≥ 5 events) to ensure reliable AUC estimates, and computed median AUC values across all qualifying years for each disease. This approach revealed substantial improvements over the previous 10-year rolling window methodology, demonstrating the importance of proper data inclusion strategies in survival prediction models.
All analyses were performed using Python, with survival models implemented in lifelines and scikit-survival, and validated in R (Version 4.0) using the Survival package, and calibration and discrimination metrics computed using standard epidemiological methods.
Supplementary Material
Acknowledgments
Funding:
This work was supported by National Institutes of Health grants (R01HL155915, R01HL157635, R35HL144758) to P.N., American Heart Association grants (19SFRN34800000, 19SFRN34850009) to P.N.
Footnotes
Competing interests: The authors declare no competing interests.
Data and materials availability:
The code for implementing ALADYNOULLI is available https://github.com/surbu with all analyses and code necessary for reproduction avaialable upon request from the authors. Access to individual-level UK Biobank data requires approval from the UK Biobank (https://www.ukbiobank.ac.uk/). Access to Mass General Brigham data requires approval from the Mass General Brigham Institutional Review Board. Access to All of Us data requires approval through the All of Us Researcher Workbench (https://www.researchallofus.org/).
References and Notes
- 1.Berry D. A., Bayesian clinical trials. Nature Reviews Drug Discovery 5 (1), 27–36 (2006), number: 1 Publisher: Nature Publishing Group, doi: 10.1038/nrd1927, https://www.nature.com/articles/nrd1927. [DOI] [PubMed] [Google Scholar]
- 2.Bellot A., Schaar M. V. D., Flexible Modelling of Longitudinal Medical Data: A Bayesian Nonparametric Approach. ACM Transactions on Computing for Healthcare 1 (1), 1–15 (2020), doi: 10.1145/3377164, https://dl.acm.org/doi/10.1145/3377164. [DOI] [Google Scholar]
- 3.Angus D. C., Chang C.-C. H., Heterogeneity of Treatment Effect: Estimating How the Effects of Interventions Vary Across Individuals. JAMA 326 (22), 2312–2313 (2021), doi: 10.1001/jama.2021.20552, https://doi.org/10.1001/jama.2021.20552. [DOI] [PubMed] [Google Scholar]
- 4.Sudlow C., et al. , UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine 12 (3), e1001779 (2015), doi: 10.1371/journal.pmed.1001779, https://dx.plos.org/10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Pedersen E. M., et al. , ADuLT: An efficient and robust time-to-event GWAS. Nature Communications 14 (1), 5553 (2023), publisher: Nature Publishing Group, doi: 10.1038/s41467-023-41210-z, https://www.nature.com/articles/s41467-023-41210-z. [DOI] [Google Scholar]
- 6.Wang W., Stephens M., Empirical Bayes Matrix Factorization. arXiv:1802.06931 [stat] (2021), arXiv: 1802.06931, http://arxiv.org/abs/1802.06931. [Google Scholar]
- 7.Jiang X., et al. , Age-dependent topic modeling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk. Nature Genetics 55 (11), 1854–1865 (2023), doi: 10.1038/s41588-023-01522-8, https://www.nature.com/articles/s41588-023-01522-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Urbut S. M., et al. , Dynamic Importance of Genomic and Clinical Risk for Coronary Artery Disease Over the Life Course. medRxiv (2023), publisher: Cold Spring Harbor Laboratory Preprints. [Google Scholar]
- 9.Hyttinen V., Kaprio J., Kinnunen L., Koskenvuo M., Tuomilehto J., Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs: a nationwide follow-up study. Diabetes 52 (4), 1052–1055 (2003), doi: 10.2337/diabetes.52.4.1052. [DOI] [PubMed] [Google Scholar]
- 10.Blei D. M., Lafferty J. D., Dynamic topic models, in Proceedings of the 23rd international conference on Machine learning - ICML ‘06 (ACM Press, Pittsburgh, Pennsylvania) (2006), pp. 113–120, doi: 10.1145/1143844.1143859, http://portal.acm.org/citation.cfm?doid=1143844.1143859. [DOI] [Google Scholar]
- 11.Blei D. M., Ng A. Y., Jordan M. I., Latent dirichlet allocation. J. Mach. Learn. Res. 3 (null), 993–1022 (2003). [Google Scholar]
- 12.Caruana R., Multitask Learning. Machine Learning 28 (1), 41–75 (1997), publisher: Springer Science and Business Media LLC, doi: 10.1023/a:1007379606734, https://link.springer.com/10.1023/A:1007379606734. [DOI] [Google Scholar]
- 13.Rasmussen C. E., Williams C. K. I., Gaussian Processes for Machine Learning (The MIT Press; ) (2006). [Google Scholar]
- 14.Engelhardt B. E., Stephens M., Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis. PLoS Genet 6 (9), e1001117 (2010), doi: 10.1371/journal.pgen.1001117, http://dx.doi.org/10.1371/journal.pgen.1001117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Koyama S., et al. , Decoding Genetics, Ancestry, and Geospatial Context for Precision Health. medRxiv (2023), publisher: Cold Spring Harbor Laboratory Preprints. [Google Scholar]
- 16.Bastarache L., Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS. Annual review of biomedical data science 4, 1–19 (2021), doi: 10.1146/annurev-biodatasci-122320-112352, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9307256/. [DOI] [Google Scholar]
- 17.Hripcsak G., Albers D. J., Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association 20 (1), 117–121 (2013), publisher: Oxford University Press (OUP), doi: 10.1136/amiajnl-2012-001145, https://academic.oup.com/jamia/article-lookup/doi/10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yeung M. W., Van Der Harst P., Verweij N., ukbpheno v1.0: An R package for phenotyping health-related outcomes in the UK Biobank. STAR Protocols 3 (3), 101471 (2022), doi: 10.1016/j.xpro.2022.101471, https://linkinghub.elsevier.com/retrieve/pii/S2666166722003513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cohen J., Statistical Power Analysis for the Behavioral Sciences (Routledge; ), 0 ed. (2013), doi: 10.4324/9780203771587, https://www.taylorfrancis.com/books/9781134742707. [DOI] [Google Scholar]
- 20.Solovieff N., Cotsapas C., Lee P. H., Purcell S. M., Smoller J. W., Pleiotropy in complex traits: challenges and strategies. Nature Reviews. Genetics 14 (7), 483–495 (2013), doi: 10.1038/nrg3461. [DOI] [Google Scholar]
- 21.Bulik-Sullivan B. K., et al. , LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics 47 (3), 291–295 (2015), doi: 10.1038/ng.3211, http://www.nature.com.proxy.uchicago.edu/ng/journal/v47/n3/full/ng.3211.html. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Putter H., van Houwelingen H. C., Understanding Landmarking and Its Relation with Time-Dependent Cox Regression. Stat Biosci 9 (2), 489–503 (2017), doi: 10.1007/s12561-016-9157-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cox D. R., Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological) 34 (2), 187–220 (1972), publisher: [Royal Statistical Society, Wiley], https://www.jstor.org/stable/2985181. [Google Scholar]
- 24.Gail M. H., et al. , Projecting Individualized Probabilities of Developing Breast Cancer for White Females Who Are Being Examined Annually. JNCI: Journal of the National Cancer Institute 81 (24), 1879–1886 (1989), doi: 10.1093/jnci/81.24.1879, https://doi.org/10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
- 25.Khan S S, et al., Development and Validation of the American Heart Association’s PREVENT Equations 149 (6), 430–449, _eprint: https://www.ahajournals.org/doi/pdf/10.1161/CIRCULATIONAHA.123.067626, doi: 10.1161/CIRCULATIONAHA.123.067626, https://www.ahajournals.org/doi/abs/10.1161/CIRCULATIONAHA.123.067626. [DOI] [Google Scholar]
- 26.Ashley E. A., Towards precision medicine. Nature Reviews Genetics 17 (9), 507–522 (2016), publisher: Springer Science and Business Media LLC, doi: 10.1038/nrg.2016.86, https://www.nature.com/articles/nrg.2016.86. [DOI] [Google Scholar]
- 27.Collins F. S., Varmus H., A New Initiative on Precision Medicine. New England Journal of Medicine 372 (9), 793–795 (2015), publisher: Massachusetts Medical Society, doi: 10.1056/nejmp1500523, http://www.nejm.org/doi/10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schork N. J., Personalized medicine: Time for one-person trials. Nature 520 (7549), 609–611 (2015), publisher: Springer Science and Business Media LLC, doi: 10.1038/520609a, https://www.nature.com/articles/520609a. [DOI] [PubMed] [Google Scholar]
- 29.Price A. L., Spencer C. C. A., Donnelly P., Progress and promise in understanding the genetic basis of common diseases. Proceedings of the Royal Society B: Biological Sciences 282 (1821), 20151684 (2015), publisher: The Royal Society, doi: 10.1098/rspb.2015.1684, https://royalsocietypublishing.org/doi/10.1098/rspb.2015.1684. [DOI] [Google Scholar]
- 30.Joyner M. J., Paneth N., Promises, promises, and precision medicine. Journal of Clinical Investigation 129 (3), 946–948 (2019), publisher: American Society for Clinical Investigation, doi: 10.1172/jci126119, https://www.jci.org/articles/view/126119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Steinhubl S. R., Topol E. J., Digital medicine, on its way to being just plain medicine. npj Digital Medicine 1 (1) (2018), publisher: Springer Science and Business Media LLC, doi: 10.1038/s41746-017-0005-1, https://www.nature.com/articles/s41746-017-0005-1. [DOI] [Google Scholar]
- 32.Simon N., Simon R., Adaptive enrichment designs for clinical trials. Biostatistics 14 (4), 613–625 (2013), publisher: Oxford University Press (OUP), doi: 10.1093/biostatistics/kxt010, https://academic.oup.com/biostatistics/article-lookup/doi/10.1093/biostatistics/kxt010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Bailey Z. D., et al. , Structural racism and health inequities in the USA: evidence and interventions. The Lancet 389 (10077), 1453–1463 (2017), publisher: Elsevier BV, doi: 10.1016/s0140-6736(17)30569-x, https://linkinghub.elsevier.com/retrieve/pii/S014067361730569X. [DOI] [Google Scholar]
- 34.Bycroft C., et al. , The UK Biobank resource with deep phenotyping and genomic data. Nature 562 (7726), 203–209 (2018), doi: 10.1038/s41586-018-0579-z, https://www.nature.com/articles/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Urbut S, et al., MS Gene: Multistate Modeling of Dynamic Lifetime Risk of Coronary Artery Disease Using Electronic Health Records in the UK Biobank. Circulation 148 (Suppl 1), A14747–A14747 (2023), publisher: Lippincott Williams & Wilkins Hagerstown, MD. [Google Scholar]
- 36.Thompson D. J., et al. , UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits (2022), doi: 10.1101/2022.06.16.22276246, https://www.medrxiv.org/content/10.1101/2022.06.16.22276246v2, iSSN: 2227–6246 Pages: 2022.06.16.22276246. [DOI] [Google Scholar]
- 37.The All of Us Research Program Investigators, The “All of Us” Research Program 381 (7), 668–676, doi: 10.1056/NEJMsr1809937, http://www.nejm.org/doi/10.1056/NEJMsr1809937. [DOI] [Google Scholar]
- 38.Kalbfleisch J. D., Prentice R. L., The Statistical Analysis of Failure Time Data (John Wiley & Sons; ) (2011). [Google Scholar]
- 39.Urbut S. M., et al. , MSGene: Derivation and validation of a multistate model for lifetime risk of coronary artery disease using genetic risk and the electronic health record. medRxiv (2023), publisher: Cold Spring Harbor Laboratory Preprints. [Google Scholar]
- 40.Yeung M. W., VERWEIJ N., niekverw/ukbpheno: v1.0.0 (2022), doi: 10.5281/ZENODO. 6557829, https://zenodo.org/record/6557829. [DOI] [Google Scholar]
- 41.Kalbfleisch J. D., Prentice R. L., The statistical analysis of failure time data (Wiley; ) (1980). [Google Scholar]
- 42.Mbatchou J., et al. , Computationally efficient whole-genome regression for quantitative and binary traits. Nature Genetics 53 (7), 1097–1103 (2021), doi: 10.1038/s41588-021-00870-7, https://doi.org/10.1038/s41588-021-00870-7. [DOI] [PubMed] [Google Scholar]
- 43.Lloyd J. D. M., et al. , Estimating Longitudinal Risks and Benefits From Cardiovascular Preventive Therapies Among Medicare Patients. Journal of the American College of Cardiology 69 (12), 1617–1636 (2017), publisher: American College of Cardiology Foundation, doi: 10.1016/j.jacc.2016.10.018, https://www.jacc.org/doi/10.1016/j.jacc.2016.10.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ambrosio M., et al. , Performance of PREVENT and pooled cohort equations for predicting 10-Year ASCVD risk in the UK Biobank. American Journal of Preventive Cardiology 22, 101009 (2025), publisher: Elsevier BV, doi: 10.1016/j.ajpc.2025.101009, https://linkinghub.elsevier.com/retrieve/pii/S2666667725000844. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code for implementing ALADYNOULLI is available https://github.com/surbu with all analyses and code necessary for reproduction avaialable upon request from the authors. Access to individual-level UK Biobank data requires approval from the UK Biobank (https://www.ukbiobank.ac.uk/). Access to Mass General Brigham data requires approval from the Mass General Brigham Institutional Review Board. Access to All of Us data requires approval through the All of Us Researcher Workbench (https://www.researchallofus.org/).





