Introduction
Since the launch of the first genomic dataset (Controlled Data Repository version 5 [CDRv5]) in 2022, the All of Us Research Program has continued to expand its genomic offerings, enabling a broad spectrum of genomic research across a variety of domains (Table 1). The latest release in spring of 2025 (Controlled Data Repository version 8 [CDRv8]) increased the number of participants with genotype arrays to more than 447,000, 414,000 of whom also have short read (sr) whole-genome sequencing (WGS). These data, along with other phenotypic data such as electronic health records (EHRs), survey information, wearable devices, and physical measures, are available to researchers via the cloud-based Researcher Workbench. To date, there have been over 700 peer-reviewed publications, including over 130 that are genomics focused, produced from more than 16,000 researchers using the data.
Table 1.
Content and chronology of the genomic data releases
| Data release name | CDR v5 (beta) | CDR v6 | CDR v7 | CDR v8 |
|---|---|---|---|---|
| Data release date | March 15, 2022 | June 22, 2022 | April 20, 2023 (June 4, 2024) | February 3, 2025 |
| Participants with genotype arrays | 165,208 | 165,127 | 312,945 | 447,278 |
| Participants with srWGS | 98,622 | 98,590 | 245,394 | 414,830 |
| Participants with structural variant calls | N/A | N/A | 11,390 (97,940) | 97,061 |
| Participants with lrWGS | N/A | N/A | 1,027 | 2,800 |
Last year, the All of Us Research Program published its inaugural AJHG year in review that highlighted the utility of All of Us genomics data across multiple domains, including cardiovascular disease, diabetes, and kidney disease. It focused on papers published from January 2023 through April 2024.1 This year, the All of Us Research Program Genomics Publications Working Group evaluated over 40 papers, published from April 2024 to December 2024, to complete the review of 2024 publications. The selected papers reflect work done using the genomic datasets released in 2022 through 2023 (Table 1). Publications reviewed met the established selection criteria.1 The working group identified a selection of articles that exemplify scientific advances that the All of Us dataset is enabling. In the following summaries, we will highlight these studies and the role All of Us data played in their primary conclusions. Papers are listed alphabetically by the first author.
Intersection of rare pathogenic variants from TCGA in the All of Us Research Program v6
Bates, B.A., Bates, K.E., Boris, S.A., Wessman, C., Stone, D., Bryan, J., Davis, M.F., and Bailey, M.H. (2025). Intersection of rare pathogenic variants from TCGA in the All of Us Research Program v6. HGG Adv 6, 100405.
Health conditions: cancer
All of Us data types: EHR (Systematized Nomenclature of Medicine [SNOMED] codes, International Classification of Diseases [ICD]-9 and ICD-10 codes), srWGS
All of Us dataset: controlled tier CDR v6
This study integrated All of Us whole-genome sequencing (WGS) and electronic health record (EHR) data with tumor genomic data from The Cancer Genome Atlas (TCGA) to evaluate the phenotypic impact of 586 cancer predisposition variants (across 99 genes) from TCGA. The authors identified 280 variants across 79 genes in the All of Us study population (∼67,000 individuals) that fit the selection criteria. They found that 1,865 individuals (2.8%) in the study population harbored one or more pathogenic variants and that these individuals were largely genetically similar to European reference populations. They also found that ocular neoplasms had the highest proportion of samples with a cancer predisposition variant, whereas hand, head, and neck tumors possess one of the lowest proportions. Analyses were divided into two broad genetic similarity categories (European-like and non-European-like) and grouped by gene. A cancer-focused phenome-wide association study of the seven genes that had at least 20 individuals with a predisposition variant revealed 6 associations, 5 to BRCA2 (breast, ovarian cancer) and one to SH2B3 (colorectal cancer). A broader analysis of phenotypes revealed an association of BRIP1 with neuroendocrine tumors. This study highlights that many known cancer predisposition variants identified in individuals with genetic similarity to Europeans fail to show significant effects in individuals with whom were not genetically similar to European reference populations and exemplified the power of diverse datasets like the All of Us Research Program to address key gaps in our understanding of cancer pathogenesis.
Secure discovery of genetic relatives across large-scale and distributed genomic datasets
Hong, M.M., Froelicher, D., Magner, R., Popic, V., Berger, B., and Cho, H. (2024). Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets. Res Comput Mol Biol 14758, 308–313.
Health conditions: not applicable, methods focused
All of Us data types: srWGS
All of Us dataset: controlled tier CDR v5
This study introduces SF-Relate, a privacy-preserving federated algorithm for identifying genetic relatives across distributed genomic datasets, a key step to reducing false positives in federated analyses. SF-Relate combines a novel locality-sensitive hashing approach with multiparty homomorphic encryption to enable secure and scalable kinship detection without sharing participant-level genomic data. The authors used two datasets from the UK Biobank and data from All of Us, split each of these three datasets into two, and then used their method to demonstrate that they could predict up to third degree relatives with 97% accuracy. In the larger dataset of 200,000 from the UK Biobank, the run time was 15 h. The use of All of Us data demonstrated that the methodology was robust to complex genomic diversity and applicable in cloud environments like the Researcher Workbench, illustrating an application for broad use in comparing relatedness across genomic databases for secure federated analyses.
Calibrated prediction intervals for polygenic scores across diverse contexts
Hou, K., Xu, Z., Ding, Y., Mandla, R., Shi, Z., Boulier, K., Harpak, A., and Pasaniuc, B. (2024). Calibrated prediction intervals for polygenic scores across diverse contexts. Nat Genet 56, 1386–1396.
Health conditions: nine traits (height, body mass index [BMI], waist-hip ratio [WHR], diastolic blood pressure, systolic blood pressure, low-density lipoprotein [LDL], cholesterol, high-density lipoprotein [HDL], and triglycerides)
All of Us data types: EHR (measurements, lab values), survey (education years, years at current address, employment), microarray
All of Us dataset: controlled tier CDR v7
This study used data from All of Us to improve polygenic scores (PGSs), which are becoming widely used for risk prediction in personalized medicine. Currently, PGS accuracy varies considerably across contexts, including age, sex, socioeconomic status, and genetic similarity by principal component analysis (PCA). To address this challenge, the authors introduce CalPred, a statistical framework that models multiple contextual factors jointly and produces calibrated trait prediction intervals reflecting variable PGS accuracy. Authors analyzed 72 traits in the UK Biobank and 12 matching traits in the All of Us Research Program. They demonstrate pervasive context-specific variation in PGS accuracy, with socioeconomic factors impacting accuracy more in the All of Us data than the UK Biobank, an impact that remained significant when accounting for genetic similarity groups. CalPred generates prediction intervals adaptive to context, correcting miscalibration seen in existing methods that ignore such variability. For some traits, interval adjustments reach up to 80%–90%. Their approach facilitates more reliable, individualized genomic predictions by incorporating diverse contextual factors, advancing precision health applications and equitable use of PGSs across heterogeneous populations.
Exome sequence analysis identifies rare coding variants associated with a machine-learning-based marker for coronary artery disease
Petrazzini, B.O., Forrest, I.S., Rocheleau, G., Vy, H.M.T., Marquez-Luna, C., Duffy, A., Chen, R., Park, J.K., Gibson, K., Goonewardena, S.N. et al. (2024). Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Nat Genet 56, 1412–1419.
Health conditions: coronary artery disease
All of Us data types: EHR (ICD-10 codes, current procedural terminology codes, laboratory values, and other classifications), srWGS
All of Us dataset: controlled tier CDR v7
This study illustrates how a digital marker that captures the spectrum of a complex disease like coronary artery disease (CAD) can enhance the discovery of rare variants. In silico score for coronary artery disease (ISCAD) was developed by the authors from electronic health record (EHR) data using machine learning. Using data from the UK Biobank, All of Us Research Program, and BioMe Biobank, the authors performed variant- and gene-level association testing to identify 17 genes associated with increasing or decreasing ISCAD scores. They were able to recapitulate previous rare variant findings for CAD with the same direction of effect, and 14 of these genes had moderate to strong independent evidence supporting their role in CAD. Using a continuous score model like ISCAD improves upon previous association studies using binary CAD diagnoses in which misclassification of cases and control due to bias or sub-optimal EHR phenotyping resulted in a reduction in statistical power. In addition to contributing to the understanding of the genetic etiology of CAD, this approach could be applied to other complex health conditions in EHR-linked databases.
Genetic determinants and phenotypic consequences of blood T cell proportions in 207,000 diverse individuals
Poisner, H., Faucon, A., Cox, N., and Bick, A.G. (2024). Genetic determinants and phenotypic consequences of blood T-cell proportions in 207,000 diverse individuals. Nat Commun 15, 6732.
Health conditions: phenome-wide investigation with findings in hematopoietic system, endocrine diseases, infectious diseases, circulatory/respiratory systems, and conditions related to pregnancy
All of Us data types: EHR (phecodes, blood cell counts), srWGS
All of Us dataset: controlled tier CDR v7
This study utilized the T cell ExTRECT method to calculate the T cell fraction in the blood from short read whole-genome sequencing data in two large electronic heath record (EHR)-linked databases: TOPMed and All of Us. The authors noted significant correlations between T cell fraction and demographic characteristics such as age and sex. They also found a significant correlation with the proportion of estimated similarity to global reference populations. Genome-wide association study meta-analysis identified a total of 27 unique loci associated with T cell abundance. The inclusion of a substantial number of people with diverse genetic similarity spread across global reference populations by principal component analysis (PCA) in the cross-group analyses and meta-analysis was key in uncovering associations that would not have been detectable in data with predominantly genetic similarity to European reference populations. The phenome-wide association study (PheWAS) analyses revealed associations between T cell fraction-associated loci and several diseases and phenotypes, in particular certain phenotypes associated with pregnancy. Dynamic changes in T cell abundance were observed throughout the course of pregnancy, which normalized post-partum, validating previous findings and supporting the use of this method to characterize T cell dynamics. The findings in this study relied on the diversity of the All of Us cohort and the breadth of available EHR data, including longitudinal EHR data available on pregnant women.
Functional analysis of G6PD variants associated with low G6PD activity in the All of Us Research Program
Powell, N.R., Geck, R.C., Lai, D., Shugg, T., Skaar, T.C., and Dunham, M.J. (2024). Functional analysis of G6PD variants associated with low G6PD activity in the All of Us Research Program. Genetics 228.
Health conditions: hemolytic anemia due to oxidative stress, G6PD deficiency
All of Us data types: EHR (lab values; ICD-10 codes), srWGS
All of Us dataset: controlled tier CDR v7
This study used the All of Us Research Program’s srWGS and EHR data to identify previously unclassified variants in G6PD likely to be associated with hemolytic anemia. The authors extracted the variants around G6PD from srWGS from over 245,000 participants and identified 359 coding variants, 296 of which had not been previously categorized as World Health Organization Class I/II/III/IV or Class A/B/C. Of these, 150 were non-synonymous, and 32 were not previously reported in ClinVar or other clinical databases. Through X-chromosome-association analyses and yeast-based functional assays, the study expanded pathogenic classifications for multiple variants. It also highlighted that 13% of individuals with deficiency-causing alleles in All of Us would be missed if only screened for the commonly tested c.202G>A variant. The All of Us dataset enabled reclassification of variants of uncertain significance, identification of disease-associated variants (e.g., c.430C>G and c.595A>G), and improved understanding of variant penetrance across different populations. This study highlights the power of the genetic diversity in All of Us data to enhance precision in pharmacogenomic screening.
Rare genetic variants in LDLR, APOB, and PCSK9 are associated with aortic stenosis
Ramo, J.T., Jurgens, S.J., Kany, S., Choi, S.H., Wang, X., Smirnov, A.N., Friedman, S.F., Maddah, M., Khurshid, S., Ellinor, P.T., and Pirruccello, J.P. (2024). Rare Genetic Variants in LDLR, APOB, and PCSK9 Are Associated With Aortic Stenosis. Circulation 150, 1767–1780.
Health conditions: aortic stenosis
All of Us data types: EHR (lab values, ICD-10 codes), srWGS
All of Us dataset: controlled tier CDR v7
This study investigates the impact of rare variants in LDLR, APOB, and PCSK9 on aortic stenosis (AS) risk using whole-genome sequencing (WGS) data and electronic health records (EHRs) from 421,049 unrelated participants (5,621 with AS) in the UK Biobank and 195,519 unrelated participants (1,087 with AS) in the All of Us Research Program. Individuals with protein-disrupting variants in LDLR showed greater mean low-density lipoprotein cholesterol (LDL-C) levels and increased risk for AS, while those with protein-disrupting variants in PCSK9 or APOB showed lower mean LDL-C levels and reduced risk for AS. Validation of these findings across the different populations represented in the All of Us dataset enhanced the generalizability of these results. The study suggests that lifelong alterations in LDL-C influence AS development, highlighting the potential of early and sustained lipid-lowering therapies to prevent AS progression.
Proteogenomic analysis integrated with electronic health record data reveals disease-associated variants in Black Americans
Tahir, U.A., Barber, J.L., Cruz, D.E., Kars, M.E., Deng, S., Tuftin, B., Gillman, M.G., Benson, M.D., Robbins, J.M., Chen, Z.Z. et al. (2024). Proteogenomic analysis integrated with electronic health records data reveals disease-associated variants in Black Americans. J Clin Invest 134.
Health conditions: phenome-wide analysis—associations found with sarcoidosis, cardiomyopathy, primary angle closure glaucoma, type 1 diabetes, helicobacter pylori infection, non-Hodgkin lymphoma, multiple sclerosis, end-stage renal disease, primary biliary cirrhosis, anemia, etc.
All of Us data types: EHR (phecodes), srWGS
All of Us dataset: controlled tier CDR v7
Large-scale gneomics studies have identified thousands of loci implicated in disease. To understand how these variants impact pathological processes, the authors utilized proteomic data from the Jackson Heart study to identify loci that impact protein, protein quantitative loci (pQTLs). They replicated these pQTLs in proteomic data from the Multi-Ethnic Study of Atherosclerosis (MESA). They then performed a phenome-wide association study (PheWAS) across 2 large multiethnic electronic health record (EHR) systems, All of Us and BioMe, to identify loci associated with specific health conditions and intersected these data with the pQTLs to identify loci that impact protein levels that are also associated with disease. Through this process, they identified multiple variants implicated in increasing or lowering protein levels that plausibly associate with disease. Further study of these protein associations could lead to the identification of biomarkers or drug targets. Importantly, multiple identified variants were rare or absent in individuals with genetic similarity to European reference populations or had impacts only evident in local ancestry analysis. This study highlighted the value of intersecting multi-omic data with EHR-linked data in understanding genetic-mediated disease pathology and the need for these studies to be performed in diverse datasets, such as All of Us and the 3 other databases utilized.
Analysis of rare genetic variants in All of Us cohort patients with common variable immunodeficiency
von Beck, T., Patel, M., Patel, N.C., and Jacob, J. (2024). Analysis of rare genetic variants in All of Us cohort patients with common variable immunodeficiency. Front Genet 15, 1409754.
Health conditions: common variable immunodeficiency (CVID)
All of Us data types: EHR (diagnoses, drug prescriptions, age, etc.), srWGS
All of Us dataset: controlled tier CDR v7
This study investigates genetic variants in All of Us participants with common variable immunodeficiency (CVID), a genetic disorder characterized by low levels of immunoglobulin antibodies. This rare disease often manifests in adults with varying severity ranging from recurrent infections to no symptoms, or hypogammaglobulinemia of undetermined significance (HGUS). Twenty-one individuals with CVID or HGUS were identified using electronic health record (EHR) data, including review by a clinical immunologist. Using All of Us short read whole-genome sequencing (srWGS) data, researchers isolated protein-coding non-synonymous variants. Variants identified affecting loci previously associated with CVID in Online Mendelian Inheritance in Man (OMIM) were screened for prior mention of CVID in the literature. An additional 61 genes implicated in antibody deficiency disorders, including the 14 OMIM-defined genes, were also explored. Finally, these individuals were assessed across all genetic loci for homozygous or male hemizygous stop-gain, start-loss, or frameshift variants. Known pathogenic variants were found in 4 of the 21 participants. Additional variants were identified in multiple genes, including two rare homozygous loss-of-function variants in MUC4 and SPAG11A, which will require further follow-up. The findings reported in this paper highlight the utility of large databases with multiple data types, like All of Us, to identify rare variants associated with rare diseases that impact long-term healthcare utilization.
Protein-truncating variant in APOL3 increases chronic kidney disease risk in epistasis with APOL1 risk alleles
Zhang, D.Y., Levin, M.G., Duda, J.T., Landry, L.G., Witschey, W.R., Damrauer, S.M., Ritchie, M.D., and Rader, D.J. (2024). Protein-truncating variant in APOL3 increases chronic kidney disease risk in epistasis with APOL1 risk alleles. JCI Insight 9.
Health conditions: chronic kidney disease, end-stage renal disease
All of Us data types: EHR (phecodes and laboratory values), srWGS
All of Us dataset: controlled tier CDRv7
This study used a genome-first approach to assess APOL variation and its association with chronic kidney disease (CKD) among participants genetically similar to African populations. APOL3, specifically the rs1108978 stop-gain variant, was independently associated with increased risk of CKD. It also interacts epistatically with APOL1 variants, most significantly affecting monoallelic APOL1 carriers. This variant was also significantly associated with other CKD traits such as decreased kidney volume. This study first identified the variant in data from the Penn Medicine Biobank and replicated these findings using data from Veterans Affairs Million Veteran Program and the All of Us Research Program. The study findings highlight the utility of studying population-specific variants to better understand disease mechanisms and differences in outcomes across populations, especially in genetic ancestries not well represented in current genetic research.
Conclusion
These highlighted papers represent only a fraction of the genomic analyses currently enabled by the All of Us Research Program. They emphasize the breadth of genomic inquiry that can be produced in the Researcher Workbench environment using All of Us data alone or in combination with other large-scale datasets such as UK Biobank (Hong et al., Hou et al., Petrazzini et al., and Ramo et al.). The featured papers enhance identification and interpretation of rare pathogenic mutations, improve interpretation of known genetic risk across diverse genetic ancestries, and further refine our understanding of the pathogenesis of a variety of diseases influenced by genetics, including immunodeficiency (von Beck et al.), coronary artery disease (Petrazzini et al.), cancer (Bates et al.), and aortic stenosis (Ramo et al.). A common theme among these papers is the discovery or new insights gained by performing these analyses in a database with broad genetic diversity. These findings underscore the importance of inclusion of populations from broad backgrounds in biomedical research. While several of the authors noted their conclusions were limited by low numbers of participants meeting inclusion criteria, the All of Us Research Program is now over halfway to its goal of 1 million participants with genomic data, with the current release of srWGS over 400,000 and data releases expected to exceed 500,000 within the next year. We expect that our next review will highlight new analyses enabled both by the increase in participant numbers with genomic data available as well as by the addition of new data types such as wearable data and long read (lr)WGS. We hope the research community will continue to innovate and combine the multiple layers of data within All of Us to gain new insights into how genomic variation contributes to lifetime health.
Acknowledgments
We gratefully acknowledge All of Us participants for their contributions, without whom the research described in these papers would not have been possible.
Reference
- 1.Kozlowski E., Farrell M.M., Faust E.J., Gallagher C.S., Jones G., Landis E., Litwin T.R., Lunt C., Mian S.H., Mockrin S.C., et al. All of Us Research Program year in review: 2023-2024. Am. J. Hum. Genet. 2024;111:1800–1804. doi: 10.1016/j.ajhg.2024.07.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
