Abstract
The All of Us (AoU) Research Program and UK Biobank (UKBB) boast a wealth of EHR data, which can be harnessed to refine cohort selection via rule-based phenotyping algorithms. The Observational Health Data Sciences and Informatics (OHDSI) Phenotype Library (PL) hosts many complex phenotyping rules. Here, we compare prevalence for 423 OHDSI PL cohorts in AoU and UKBB. For three select diseases (T2D, COPD, Acute MI), we analyze differences in demographics, social determinants of health (SDOH), geographic prevalence, and genome-wide association study (GWAS) results. We found that AoU has a significantly higher prevalence for 80% of phenotypes compared to UKBB. We also found that for the select diseases, SDOH variables between the two biobanks differ significantly. Findings for each of these three diseases confirm known regions of high risk. Additionally, GWAS in UKBB discovered more genes associated with each of the three diseases than GWAS in AoU.
Introduction
Precision medicine leverages data at the intersection of medical history, genomics, environmental exposures, and more, and will be key to improving our understanding of disease etiology.1 Biobanks act as the foundation for precision medicine efforts by housing a wide variety of data for each participant.2 Two prominent biobanks include the UK Biobank (UKBB) and the All of Us Research Program (AoU). The UKBB contains data from over 500,000 residents of the United Kingdom (UK) all aged 40-69 years old, a majority of whom are White.3,4 In contrast, the All of Us Research Program (AoU) emphasizes participant racial and ethnic diversity and contains data on more than 400,000 individuals over 18 years old in the United States.5 The recruitment strategies of these two biobanks differ as well. AoU has primarily targeted large academic medical centers, Veterans Affairs centers, and federally qualified health centers for recruitment.5 For example, the New York-Presbyterian/Columbia Irving Medical Center (NYP), a large academic medical center serving the New York City metropolitan area, is an AoU recruitment site. The UKBB, instead, recruited all eligible individuals through the National Health Service, the UK’s publicly funded healthcare system whose central registrars contain contact information for 98% of the population.3
Characterizing these biobanks and understanding how they differ is essential to contextualize any research findings involving the two biobanks. Previous work has shown that the UKBB population is healthier than the general UK population, while the AoU population has a higher disease burden compared with the US population, with the exception of psychiatric diagnoses.4,6,7 Furthermore, a recent study by Zeng et al. found that the majority of diseases have significantly higher prevalence in AoU than in the UKBB.6 Across each of these studies, disease cohort identification was performed via Phecode phenotyping or by self-reported outcomes from participants.4,6,7 Phecode phenotyping is a commonly used rule-based phenotyping approach that defines cases for a disease of interest as individuals with specific International Classification of Disease (ICD) codes reported on two or more distinct dates.8 However, the richness of Electronic Health Record (EHR) information contained in each of these biobanks allows researchers to define expressive cohorts for downstream analysis.9,10 Increased expressiveness in phenotype construction can account for the inherent heterogeneity and missingness in EHR data, and provide researchers with greater options to correct for implicit biases in their phenotype algorithm, hence allowing for better and unbiased comparison across different healthcare systems.11–13 The Observational Health Data Sciences and Informatics program (OHDSI) is an open-science collaborative that has developed a common EHR data standard called the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and a Phenotype Library consistent of rules to allow for the creation and analysis of phenotypes across institutions to enable observational studies.14–18 The OHDSI Phenotype Library (PL) serves as a repository for high-quality phenotype definitions, including those generated by domain experts and subjected to a peer review process, and thus is a rich resource for high-throughput computational phenotyping.19 Phenotype definitions in the OHDSI PL are operationalized as SQL queries that can be run on any OMOP CDM database platform.19 The OMOP CDM enables increasingly complex phenotyping algorithms, in part, because it contains temporal data from multiple domains, such as conditions, measurements, and procedures.15,20 Importantly, AoU stores EHR data in OMOP and the UKBB has been converted to OMOP, facilitating easier comparisons.21
Our goal is to present a comparison of 423 disease cohorts created using OHDSI’s PL between UKBB and AoU. To enable this comparison, we need to be able to create cohorts both in the AoU workbench and the UKBB. To facilitate the creation of these cohorts in the AoU workbench, we also aim to develop a tool called Atlas2AoU. This tool will adapt the original OHDSI PL queries for use in the AoU workbench. The adaptation is necessary because, in the current version of the AoU workbench, observation periods do not align with OMOP CDM standards.22 This misalignment arises from the inclusion of data not only from EHRs but also from other sources (e.g., surveys). For a select set of disease cohorts (Type II Diabetes (T2D), Chronic Obstructive Pulmonary Disease (COPD), and Acute Myocardial Infarction (MI) cohorts), we perform detailed analyses of geographic disease prevalence, demographics and social determinants of health (SDOH), and identify variants significantly associated with each disease through genome-wide association studies (GWAS). Additionally, we compare 14 peer-reviewed OHDSI PL phenotypes between the NYP OMOP and AoU. A complete workflow of methods is provided in Figure 1.
Figure 1.

Workflow of methods. Created with BioRender.com.
Methods
Phenotyping on the AoU Workbench
ATLAS query modification: To enable correct cohort construction in AoU, Atlas2AoU alters the original ATLAS query in two ways. First, a temporary observation period table is created and queried as opposed to the original. The temporary table is constructed using OHDSI’s suggestion of taking the minimum and maximum dates of recorded clinical events for each individual across the nine tables of specimen, death, visit occurrence, procedure occurrence, drug exposure, device exposure, condition occurrence, measurement, and observation.23 Second, due to permissions on the AoU workbench, queries do not write to a cohort table, but instead return the final table directly as a Pandas Data Frame. Atlas2AoU is available at https://github.com/G2Lab/Atlas2AoU.
ATLAS query testing: Before deploying the modified ATLAS query produced by Atlas2AoU on the AoU database, we first test that the modified cohort query returns the same set of individuals as the original ATLAS query. To do so, both the modified and original queries were deployed on an OMOP CDM PostgreSQL database with data from the UKBB. If the modified and original queries returned the same participant sets in the UKBB OMOP CDM, the modified query was approved for deployment on AoU. The AoU workspace is available at https://workbench.researchallofus.org/workspaces/aou-rw-a975274f/atlas2aou/analysis.
Phenotyping and categorization: 423 phenotype algorithms from the OHDSI PL v3.1.6 were modified using Atlas2AoU, tested on the UKBB OMOP CDM, and successfully deployed on the AoU OMOP CDM. Some of the 423 phenotype algorithms represent the same diseases. Further detail on the OHDSI PL v3.1.6 phenotype algorithms can be found at OHDSI Phenotype Library github page. Each phenotype was classified into one of 30 broader disease categories by SNOMED index codes using the mapping present in table S1 of Liu et al.24 Namely, all condition occurrence inclusion codes were traced up to the SNOMED index per the SNOMED hierarchy, and the phenotype category was determined by majority vote.24
Prevalence estimation
For all analyses, cases were defined as those identified by the OHDSI PL query, and controls were defined as all other participants in the OMOP CDM Person table. The total number of individuals (N) for each database was 502,365 for the UKBB and 287,012 for AoU. Prevalence estimates for phenotypes with 20 cases or less were masked. Prevalence ratios were obtained by dividing the AoU prevalence by the UKBB prevalence. A 2-proportion Z-test was performed to test for a significant difference between prevalence estimates in both databases. As in Zeng et al., we define statistically significant higher prevalence in AoU as a p-value < 2*10-5 and prevalence ratio > 1.1, while a p-value < 2*10-5 and prevalence ratio < 0.91 constitutes statistically significant lower prevalence in AoU.6 AoU prevalence estimates for 14 peer-reviewed OHDSI PL phenotypes were also compared to estimates from the NYP OMOP database which represents data from 3,724,487 individuals at a large academic medical center serving the greater New York City area.
Detailed cohort comparisons
Detailed cohort comparisons were conducted for phenotype definitions for T2D (cohort ID: 288), COPD (cohort ID: 28), and acute MI (cohort ID: 71) between the UKBB and AoU.25 These phenotypes were chosen since they represent widespread health concerns, and their corresponding definitions incorporate multiple data domains.
Detailed cohort comparisons – Geographic disease prevalence
For the UKBB, data fields 129 and 130 were used to obtain coordinates of each participant’s birthplace. The birthplace was then labeled by its corresponding county or unitary authority in the UK per 2023 boundaries from the Office of National Statistics (ONS) geoportal.26 For AoU, 2-digit participant zip codes in the continental US were obtained and matched to 2-digit Zip Code Tabulation Areas (ZCTAs) per 2020 US Census Bureau boundaries.27 Regions with case or control count less than or equal to 20 were not included in the analysis.
Detailed cohort comparisons – Demographics and SDOH
Participant age at recruitment or consent was obtained from each database. Data on race and biological sex were obtained from the OMOP CDM Person table. Finer-grained race concepts were collapsed into the three categories of Asian, Black, and White as per the AoU data dissemination policy.
Ten SDOH variables with comparable versions across both biobanks were chosen, covering education, socioeconomic status, housing, social interactions, and lifestyle choices. Complete case analysis was performed for all variables. For the nine categorical variables, fine-grained mapping between responses in each database was performed by two researchers. Detailed information on how raw SDOH variables were altered for comparison, and the mapping between categorical responses in the two databases is provided in the Supplementary Material. Chi-squared tests of independence were performed to test for a significant difference between categorical variables in UKBB and AoU cohorts. For the numeric SDOH variable, pack-years of smoking, a t-test was performed. All p-values were capped at 1*10-300. We used a Bonferroni-corrected significance level of 0.05/10.
Detailed cohort comparisons – GWAS
To perform GWAS in the UKBB, genotype quality control (QC) was first performed on the 93M imputed autosomal variants released by the UKBB.28 All imputed variants with genotype probability above 0.9 were hard-called from dosages using a threshold of 0.1 with PLINKv2.29,30 Variants with genotype missingness above 5%, minor allele frequency (MAF) less than 5%, and Hardy-Weinberg equilibrium (HWE) p-value above 1*10-6 were removed. All variants with duplicate rsids (reference SNP cluster ids) and all insertions and deletions were removed. Additionally, we removed any SNPs with an imputation INFO score less than or equal to 0.3, resulting in 3,686,405 bi-allelic SNPs. Sample QC was performed on the 487,159 samples with imputed variant data. All samples had a missingness rate less than or equal to 5% on a subset of 605,836 directly measured variants used by the UKBB for missingness calculations (tag ‘in_HetMiss’).31 KING v2.3.2 was used to perform kinship calculations with a set of 93,511 directly measured variants selected by the UKBB for relatedness calculations (tag ‘in_Relatedness’).31,32 From kinship calculations, 405,811 participants were selected for downstream analysis, none of whom were third degree relatives or closer. Next, flashpca v2.0 was used to calculate 20 principal components on the 405,811 participants and their genotypes for 147,606 directly measured SNPs previously selected by the UKBB for principal component analysis (tag ‘in_PCA’).31,33 Additionally, genotype array (UK BiLEVE vs. UK Biobank) was determined by calculating missingness at variants unique to the UK BiLEVE Axiom array. For the GWAS, the 20 principal components, genotype array, participant age (2012-year of birth per OMOP CDM Person table) and sex at birth per the OMOP CDM Person table were included as covariates. This methodology closely follows that of Cai et al.34
To perform GWAS in the AoU workbench, SNPs from the ACAF threshold callset were used.35 The ACAF threshold callset is the largest of three callsets and contains variants from short-read sequencing (srWGS) that have a population-specific allele frequency or allele count greater than 1% or 100, respectively, in genetic ancestry subpopulations.35 AoU recommends using these curated smaller call sets to speed up GWAS analysis and save cost in cloud charges. Quality control was performed using PLINK v1.9 by removing SNPs with MAF less than 5%.30,36 Related samples provided by AoU were removed from analysis. No other QC was performed per AoU guidelines, as sample and genotype QC were already performed on this call set.37 After quality control, 9.6M variants remained. GWAS was performed on 188,852 individuals. Participant year of birth and sex at birth per the OMOP CDM Person table were included as covariates. Additionally, all of the 16 principal components from the srWGS genetic predicted ancestry table for AoU, were included as covariates.38 Previous work has analyzed the distribution of these genetic ancestries across the US.39 For both UKBB and AoU GWAS, PLINK v2 was used to perform the logistic regression, as it is a widely used, efficient, and scalable tool for large-scale GWAS.29,30 Phenotype-genotype associations with no error code were used in all downstream analysis. Quantile-quantile plots generally suggested adequate control for population structure. A significance level of 5*10-8 was set to identify variants significantly associated with the phenotype, as is standard in GWAS. Significant variants were mapped to exons of protein-coding genes using the GENCODE Release v44 comprehensive gene annotation GTF file.40
Results
Prevalence estimation
We found that AoU has significantly higher prevalence compared to the UKBB for 335 of the 423 phenotypes (Figure 2a). In contrast, the UKBB only has significantly higher prevalence for 23 phenotypes. Thus, we find that AoU has a much higher disease burden than the UKBB, consistent with previous observations.6
Figure 2.

Prevalence comparison for OHDSI PL cohorts in AoU, the UKBB, and NYP
We observed the highest prevalence ratios for diseases in the Psychiatry/Psychology category, including two phenotype definitions specific to Attention Deficit Hyperactivity Disorder. Furthermore, we found that AoU has a higher prevalence for 100% of phenotypes in the Psychiatry/Psychology, Ophthalmology, Endocrinology, Sleep, Hematology, Rheumatology, Metabolism and Nutrition, Miscellanea, and Orthopedics categories. We observed the lowest prevalence ratios for diseases Peripheral Ischemia, Chilblains, and Cerebrovascular accidents. Lastly, we compared the 12 unmasked, peer-reviewed OHDSI PL phenotypes not only across biobanks but against NYP data and found that AoU has significantly higher prevalence than NYP for all 12 phenotypes (Figure 2b).
Detailed cohort comparisons – T2D
We next performed detailed cohort comparisons for T2D (Figure 3). Overall, we found that, in agreement with previous studies6, AoU has significantly higher T2D prevalence than the UKBB. Regions with the highest prevalence by birthplace in the UKBB are Newport, Blaenau Gwent, and Caerphilly, reflecting the fact that Wales is known to have the highest prevalence of diabetes in UK (Figure 3a).41 The 2-digit zip codes with the highest prevalence of T2D in the US correspond to areas of Kansas and Missouri (Figure 3a). We found that the Phecode phenotyping performed by a previous study6 resulted in higher prevalence estimates (UKBB: 0.091; AoU:0.191) than OHDSI PL phenotyping (UKBB: 0.072; AoU:0.175). However, the lower prevalence estimate provided by the OHDSI PL phenotype for the UKBB is more reflective of the general prevalence estimate of diabetes for the UK, 0.058, as well as average UKBB self reported estimates for those aged 45-65 (0.053).4,42 On the other hand, the T2D prevalence estimates on AoU by both OHDSI PL and Phecode are much higher than the US population estimate of diagnosed diabetes (0.113 vs. OHDSI PL: 0.175 and Phecode: 0.191).43
Figure 3.
Geographic, SDOH, genomic, and demographic profiles for T2D
We next performed GWAS on both cohorts. We found a significant association for a coding variant on the EML2 gene in the UKBB T2D cohort (Figure 3d). This mutation was previously found to be associated with T2D in an East Asian population.44 Despite a larger T2D cohort size in AoU and a larger effective sample size, the UKBB GWAS found more significant coding variants. However both the UKBB and the AoU GWAS discovered significant associations on coding variants of genes that are known risk factors for T2D or known to be involved in metabolic control mechanisms linked to diabetes.45–49 Lastly, we compared the SDOH variables between cohorts and found that smoking pack years, companionship, income, alcohol frequency, home ownership and education level are significantly different (Figure 3c). The AoU cohort has lower smoking pack years, frequency of alcohol consumption, and home ownership and a greater proportion of individuals experiencing companionship, belonging to households with higher income, and completing a high school degree or higher. Thus, in terms of SDOH variables alone, the UKBB population exhibits greater vulnerability to known risk factors for T2D.50,51 Nonetheless, the AoU population has a much greater proportion of individuals who are Black. The rates of T2D are known to be disproportionately higher among Black adults than the general adult population in the US.52
Detailed cohort comparisons – COPD
Next, we performed detailed cohort comparisons for COPD (Figure 4). We found that the prevalence of COPD is significantly higher in AoU than in the UKBB (Figure 4e). Areas of high prevalence for COPD in the UK are West Dunbartonshire, Glasgow City, and West Lothian (Figure 4a), and in the US are zip codes corresponding to areas of Kansas and Missouri (Figure 4b). These regions of high prevalence coincide with regions characterized by high smoking rates, a known risk factor for COPD.53,54 We found that Phecode phenotyping previously performed6 resulted in a similar prevalence estimate in AoU (OHDSI PL: 0.086;Phecode: 0.087), but a much higher estimate in the UKBB (OHDSI PL: 0.03; Phecode.: 0.06). We next investigated demographic differences between OHDSI PL COPD cohort in AoU and the UKBB and found that the AoU cohort has a higher proportion of Black participants and is generally older than the UKBB cohort. We next compared SDOH variables and found that the variables of smoking pack years, companionship, income, alcohol frequency, home ownership and education level are significantly different among the AoU and UKBB cohorts (Figure 4c). The direction of the difference follows the same trends as described above for T2D. While smoking pack years are significantly different between cohorts, there is no difference in the proportion of individuals who have smoked 100 cigarettes in their lifetime. Lastly, we performed GWAS on both cohorts. While the AoU GWAS did not detect any significant coding variants, the UKBB GWAS detected significant coding variants of genes (Figure 4d) known to harbor variants associated with COPD.55
Figure 4.
Geographic, SDOH, genomic, and demographic profiles for COPD
Detailed cohort comparisons – Acute MI
We next performed detailed cohort comparisons for acute MI (Figure 5). We found that the UKBB has significantly higher disease prevalence than AoU (Figure 5e). The regions of highest prevalence in the UK are East Lothian, Midlothian, and Knowsley (Figure 5a). The zip codes of highest prevalence in the US correspond to areas of Alabama, Michigan, and Florida (Figure 5b). These zip codes are also known to have high obesity rates, a recognized risk factor for acute MI.56–58 We found that previous Phecode phenotyping prevalence estimates6 (UKBB: 0.05; AoU: 0.034) were higher than OHDSI PL estimates (UKBB: 0.025; AoU: 0.022). We next compared demographic differences between OHDSI PL cohorts (Figure 5e), and found that the UKBB cohort has a much greater proportion of men than women, while the AoU cohort is more balanced. This difference is interesting considering recent efforts to properly diagnose acute MI in women, who often show different disease presentation than men.59 We compared SDOH variables between cohorts and found that smoking pack years, companionship, income, alcohol frequency, home ownership and education level are significantly different among the AoU and UKBB cohorts (Figure 5c). Similar to with T2D and COPD, we find that the AoU cohort has lower smoking pack years, frequency of alcohol consumption, and home ownership and a greater proportion of individuals experiencing companionship, belonging to households with higher income, and completing a high school degree or higher. Lastly, we performed GWAS on both cohorts (Figure 5d). The UKBB GWAS found significant coding variants in genes (Figure 5d) that have been previously associated with myocardial infarction and other cardiovascular complications.60–66 The AoU GWAS did not detect any significant coding variants.
Figure 5.
Geographic, SDOH, genomic, and demographic profiles for Acute MI
ATLAS query modification
Table 1 demonstrates the impact of deploying an ATLAS query on the AoU database without proper configuration of the observation period table for three phenotypes. False positives are defined as individuals who were found by the original query but not by the Atlas2AoU modified query, while false negatives were missing from the original query but discovered by the Atlas2AoU query. As shown in the table, across all three phenotypes, the Atlas2AoU query results in a larger cohort size, increasing power of downstream analysis.
Table 1.
Cohort size comparison between original and Atlas2AoU queries for three phenotypes
| Phenotype | Original query false positives | Original query false negatives | Original query cohort size | Atlas2AoU query cohort size |
|---|---|---|---|---|
| T2D | 370 | 43,111 | 7,503 | 50,244 |
| COPD | 488 | 23,044 | 2,256 | 24,812 |
| Acute MI | 0 | 6,131 | 194 | 6,325 |
Discussion and Conclusions
In this work, we found that AoU exhibits a significantly higher disease prevalence for 80% of OHDSI PL phenotypes compared to the UKBB. This result aligns with well-documented differences between the two biobanks: the UKBB primarily sampled from a relatively healthy cohort compared to the general UK population, whereas AoU recruited from a population with a higher disease burden than the general US population.4,6 Furthermore, we observed that AoU shows significantly higher prevalence across 12 OHDSI PL phenotypes compared to NYP. For conditions such as T2D, COPD, and acute MI, the Phecode phenotyping method used in previous studies generally produced higher prevalence estimates than the OHDSI PL phenotyping approach. This observation suggests a potential trade-off between sensitivity and specificity in phenotype definitions, though further investigation is needed to fully compare these two methods.
A more detailed look into the phenotypes of T2D, COPD, and acute MI confirmed many existing findings. Namely, geographic prevalence analysis of each cohort reflects overall trends in the US and the UK, and can inform targeted public health interventions. Additionally, regions of high prevalence found by geographic analysis were also regions with high prevalence of known disease risk-factors, for example smoking and COPD. Furthermore, the UKBB GWAS found significant coding variants in genes known to be associated with each disease. The AoU GWAS detected fewer genes across all three diseases studied, despite testing coding variants in the exons found significant in the UKBB and achieving larger effective sample sizes for T2D and COPD. This suggests possible loss of power from unaccounted population structure or cryptic relatedness, and future work may benefit from using linear mixed models to better control for confounding.67,68 Additionally, future work may involve gene-by-environment analysis to investigate how SDOH and genetic variants interact to influence disease risk. Lastly, across all three phenotypes, the SDOH variables of smoking pack years, companionship, income, alcohol frequency, home ownership and education level were significantly different between the UKBB and AoU, indicating that these differences may be reflective of the underlying sample population in each database, as opposed to reflecting different disease etiologies. This underscores a limitation in cross-biobank comparisons, as there exists confounding from the different recruitment strategies employed by each biobank.
Prevalence estimates in AoU are calculated using the number of participants with EHR data available (287,012). However, there are participants in AoU whose data come from sources outside of the EHR.69,70 These participants may have data relevant to the phenotype of interest, and may be classified as cases after OHDSI PL phenotyping. In this scenario, the prevalence estimate for the phenotype of interest would be an overestimate of the true prevalence. In our case, of the 423 phenotypes, this only occurred for one phenotype of obesity. While the Atlas2AoU tool provides users an opportunity to create OHDSI PL disease cohorts on the AoU Research Workbench, future work includes finding a way to create a permanent OMOP CDM compliant observation period table that considers different data provenance. This will allow for greater integration with OHDSI tools, though there are still technical challenges of running OHDSI’s large-scale analysis packages (in R) within the AoU Research Workbench that need to be resolved to replicate analyses done within the OHDSI research network.
Acknowledgments
We gratefully acknowledge All of Us participants for their contributions. We also thank the NIH All of Us Research Program for making available the participant data examined in this study. This publication was supported by a Roy and Diana Vagelos Precision Medicine Award, a Warren Alpert Foundation award, and an NIH R35GM147004 to G.G.; NIH awards 5U2COD023196 and 3OT2OD026556 to K.N.; NIH award T15LM007079 to A.N..
Figures & Tables
References
- 1.Johnson KB, Wei W, Weeraratne D, Frisse ME, Misulis K, Rhee K, et al. Precision medicine, AI, and the future of personalized health care. Clin Transl Sci. 2021 Jan;14(1):86–93. doi: 10.1111/cts.12884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kim EY. Biobanks as a treasury for precision medicine. Healthc Inform Res. 2021 Apr;27(2):93–4. doi: 10.4258/hir.2021.27.2.93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine. 2015 Mar;12(3) [Google Scholar]
- 4.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American Journal of Epidemiology. 2017 Nov 1;186(9):1026–34. doi: 10.1093/aje/kwx246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.All of Us Research Program Investigators. Denny J, Rutter J, Goldstein D, Philippakis A, Smoller J, et al. The “All of Us” research program. N Engl J Med. 2019 Aug 15;381(7):668–76. doi: 10.1056/NEJMsr1809937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zeng C, Schlueter DJ, Tran TC, Babbar A, Cassini T, Bastarache LA, et al. Comparison of phenomic profiles in the All of Us Research Program against the US general population and the UK Biobank. Journal of the American Medical Informatics Association. 2024 Jan 23:ocad260. [Google Scholar]
- 7.Barr PB, Bigdeli TB, Meyers JL. Prevalence, comorbidity, and sociodemographic correlates of psychiatric diagnoses reported in the All of Us Research Program. JAMA Psychiatry. 2022 Jun 1;79(6):622–8. doi: 10.1001/jamapsychiatry.2022.0685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bastarache L. Using phecodes for research with the electronic health record: from PheWAS to PheRS. Annual Review of Biomedical Data Science. 2021.
- 9.Swerdel JN, Ramcharran D, Hardin J. Using a data-driven approach for the development and evaluation of phenotype algorithms for systemic lupus erythematosus. PLoS One. 2023 Feb 16;18(2):e0281929. doi: 10.1371/journal.pone.0281929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hardin J, Murray G, Swerdel J. Phenotype algorithms to identify hidradenitis suppurativa using real-world data: development and validation study. JMIR Dermatol. 2022 Nov 30;5(4):e38783. doi: 10.2196/38783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 2013;20(1):117–21. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Banda JM, Seneviratne M, Hernandez-Boussard T, Shah NH. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annual Review of Biomedical Data Science. 2018 Jul 20;1:53–68. [Google Scholar]
- 13.Sun TY, Bhave SA, Altosaar J, Elhadad N. Assessing phenotype definitions for algorithmic fairness. AMIA Annu Symp Proc. 2023 Apr 29;2022:1032–41. [PMC free article] [PubMed] [Google Scholar]
- 14.Software Tools – OHDSI [Internet] Available from: https://www.ohdsi.org/software-tools/
- 15.OHDSI: Observational Health Data Sciences and Informatics; Standardized Data: The OMOP Common Data Model [Internet] Available from: https://www.ohdsi.org/data-standardization/ [Google Scholar]
- 16.Hripcsak G, Ryan PB, Duke JD, Shah NH, Park RW, Huser V, et al. Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences. 2016 Jul 5;113(27):7329–36. [Google Scholar]
- 17.Naderalvojoud B, Curtin CM, Yanover C, El-Hay T, Choi B, Park RW, et al. Towards global model generalizability: independent cross-site feature evaluation for patient-level risk prediction models using the OHDSI network. Journal of the American Medical Informatics Association. 2024 Feb 27:ocae028. [Google Scholar]
- 18.Wang Q, Reps JM, Kostka KF, Ryan PB, Zou Y, Voss EA, et al. Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network. PLOS ONE. 2020 Jan 7;15(1):e0226718. doi: 10.1371/journal.pone.0226718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rao G. PhenotypeLibrary: The OHDSI Phenotype Library [Internet] 2024.
- 20.Observational Health Data Sciences and Informatics, Kostka K. Chapter 10 Defining Cohorts | The Book of OHDSI [Internet] Available from: https://ohdsi.github.io/TheBookOfOhdsi/
- 21.Papez V, Moinat M, Voss EA, Bazakou S, Van Winzum A, Peviani A, et al. Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond. Journal of the American Medical Informatics Association. 2023 Jan 1;30(1):103–11. [Google Scholar]
- 22.OMOP CDM v5.4 [Internet] Available from: https://ohdsi.github.io/CommonDataModel/cdm54.html#observation_period.
- 23.Philofsky M EHR Working Group. OHDSI: Observational Health Data Sciences and Informatics; Observation period considerations for EHR data [Internet] Available from: https://ohdsi.github.io/CommonDataModel/ehrObsPeriods.html. [Google Scholar]
- 24.Liu H, Carini S, Chen Z, Phillips Hey S, Sim I, Weng C. Ontology-based categorization of clinical studies by their conditions. Journal of Biomedical Informatics. 2022 Nov 1;135:104235. doi: 10.1016/j.jbi.2022.104235. [DOI] [PubMed] [Google Scholar]
- 25.Rao G. Cohort Definitions in OHDSI Phenotype Library [Internet] Available from: https://github.com/OHDSI/PhenotypeLibrary/blob/ac17b7af55b01ec91eb2ac1ca1ea30473f8ba621/inst/Cohorts.csv.
- 26.Office for National Statistics. Open Geography Portal; Counties and Unitary Authorities (May 2023) Boundaries UK BFC [Internet] [Google Scholar]
- 27.United States Census Bureau. ZIP Code Tabulation Areas (ZCTAs) [Internet]
- 28.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018 Oct;562(7726):203–9. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Purcell S. PLINK 2.0 [Internet] Available from: http://pngu.mgh.harvard.edu/purcell/plink/
- 30.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Resource 1955 [Internet] Available from: https://biobank.ctsu.ox.ac.uk/ukb/refer.cgi?id=1955.
- 32.Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010 Nov 15;26(22):2867–73. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Abraham G, Qiu Y, Inouye M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics. 2017 Sep 1;33(17):2776–8. doi: 10.1093/bioinformatics/btx299. [DOI] [PubMed] [Google Scholar]
- 34.Cai N, Revez JA, Adams MJ, Andlauer TFM, Breen G, Byrne EM, et al. Minimal phenotyping yields genome-wide association signals of low specificity for major depression. Nat Genet. 2020 Apr;52(4):437–47. doi: 10.1038/s41588-020-0594-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Smaller callsets for analyzing short read WGS SNP & indel data with Hail MT, VCF, and PLINK [Internet] User Support. 2024.
- 36.Purcell S. PLINK 1.9 [Internet] Available from: http://pngu.mgh.harvard.edu/purcell/plink/
- 37.Why do I see a high sample no call rate in the smaller callsets? [Internet] User Support. 2024.
- 38.How the All of Us genomic data are organized [Internet] User Support. 2024.
- 39.Sharma S, Nagar SD, Pemu P, Zuchner S, Mariño-Ramírez L, Meller R, et al. Genetic ancestry and population structure in the All of Us Research Program cohort. Nat Commun. 2025 May 3;16(1):4123. doi: 10.1038/s41467-025-59351-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Human Release 44 [Internet] GENCODE. 2024.
- 41.The British Diabetic Association. Diabetes in Wales [Internet] Diabetes UK. Available from: https://www.diabetes.org.uk/in_your_area/wales/diabetes-in-wales.
- 42.Health Survey for England - 2010, Trend tables [Internet] NHS England Digital.
- 43.Centers for Disease Control and Prevention. National Diabetes Statistics Report [Internet] Diabetes. 2023.
- 44.Lu Z, Zhang H, Yang Y, Zhao H. Sex differences of the shared genetic landscapes between type 2 diabetes and peripheral artery disease in East Asians and Europeans. Hum Genet. 2023 Jul 1;142(7):965–80. doi: 10.1007/s00439-023-02573-x. [DOI] [PubMed] [Google Scholar]
- 45.Mohlke KL, Boehnke M. Recent advances in understanding the genetic architecture of type 2 diabetes. Human Molecular Genetics. 2015 Oct 15;24(R1):R85–92. doi: 10.1093/hmg/ddv264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Yan P, Zhang L, Yang C, Zhang W, Wang Y, Zhang M, et al. Observational and genetic analyses clarify the relationship between type 2 diabetes mellitus and gallstone disease. Front Endocrinol (Lausanne) 2024 Jan 31;14:1337071. doi: 10.3389/fendo.2023.1337071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yuan S, Xu F, Li X, Chen J, Zheng J, Mantzoros CS, et al. Plasma proteins and onset of type 2 diabetes and diabetic complications: Proteome-wide Mendelian randomization and colocalization analyses. Cell Reports Medicine. 2023 Sep;4(9):101174. doi: 10.1016/j.xcrm.2023.101174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Glunk V, Laber S, Sinnott-Armstrong N, Sobreira DR, Strobel SM, Batista TM, et al. A non-coding variant linked to metabolic obesity with normal weight affects actin remodelling in subcutaneous adipocytes. Nat Metab. 2023 May;5(5):861–79. doi: 10.1038/s42255-023-00807-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ren X, Feng C, Wang Y, Chen P, Wang S, Wang J, et al. SLC39A10 promotes malignant phenotypes of gastric cancer cells by activating the CK2-mediated MAPK/ERK and PI3K/AKT pathways. Exp Mol Med. 2023 Aug;55(8):1757–69. doi: 10.1038/s12276-023-01062-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wu Y, Ding Y, Tanaka Y, Zhang W. Risk factors contributing to type 2 diabetes and recent advances in the treatment and prevention. Int J Med Sci. 2014 Sep 6;11(11):1185–200. doi: 10.7150/ijms.10001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Demakakos P, Marmot M, Steptoe A. Socioeconomic position and the incidence of type 2 diabetes: the ELSA study. Eur J Epidemiol. 2012 May 1;27(5):367–78. doi: 10.1007/s10654-012-9688-4. [DOI] [PubMed] [Google Scholar]
- 52.Haw JS, Shah M, Turbow S, Egeolu M, Umpierrez G. Diabetes complications in racial and ethnic minority populations in the USA. Curr Diab Rep. 2021 Jan 9;21(1):2. doi: 10.1007/s11892-020-01369-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Centers for Disease Control and Prevention. Map of current cigarette use among adults [Internet] CDC State Tobacco Activities Tracking and Evaluation (STATE) System.
- 54.Revie L, Davies B, Mais D. Adult smoking habits in the UK: 2021 [Internet] Office for National Statistics.
- 55.Cho MH, McDonald MLN, Zhou X, Mattheisen M, Castaldi PJ, Hersh CP, et al. Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis. Lancet Respir Med. 2014 Mar;2(3):214–25. doi: 10.1016/S2213-2600(14)70002-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Jefferson County, Alabama obesity and tobacco use prevention [Internet] Centers for Disease Control and Prevention; Communities Putting Prevention to Work. [Google Scholar]
- 57.Duval County Community Health Assessment 2017 [Internet] Florida Health Duval County. 2018.
- 58.2018 Detroit Community Health Assessment [Internet] Detroit Health Department. 2018.
- 59.Mehta LS, Beckie TM, DeVon HA, Grines CL, Krumholz HM, Johnson MN, et al. Acute myocardial infarction in women. Circulation. 2016 Mar;133(9):916–47. doi: 10.1161/CIR.0000000000000351. [DOI] [PubMed] [Google Scholar]
- 60.Iqbal R, Jahan N, Sun Y, Xue H. Genetic association of lipid metabolism related SNPs with myocardial infarction in the Pakistani population. Mol Biol Rep. 2014 Mar 1;41(3):1545–52. doi: 10.1007/s11033-013-3000-x. [DOI] [PubMed] [Google Scholar]
- 61.Dai X, Wiernek S, Evans JP, Runge MS. Genetics of coronary artery disease and myocardial infarction. World J Cardiol. 2016 Jan 26;8(1):1–23. doi: 10.4330/wjc.v8.i1.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Kontou P, Pavlopoulou A, Braliou G, Bogiatzi S, Dimou N, Bangalore S, et al. Identification of gene expression profiles in myocardial infarction: a systematic review and meta-analysis. BMC Med Genomics. 2018 Nov 27;11(1):109. doi: 10.1186/s12920-018-0427-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Fuior EV, Gafencu AV. Apolipoprotein C1: its pleiotropic effects in lipid metabolism and beyond. International Journal of Molecular Sciences. 2019 Nov 26;20(23):5939. doi: 10.3390/ijms20235939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Jiang B, Li X, Li M, Zhou W, Zhao M, Wu H, et al. Genome-wide and exome-wide association study identifies genetic underpinning of comorbidity between myocardial infarction and severe mental disorders. Biomedicines. 2024 Oct 10;12(10):2298. doi: 10.3390/biomedicines12102298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Satomi-Kobayashi S, Ueyama T, Mueller S, Toh R, Masano T, Sakoda T, et al. Deficiency of Nectin-2 leads to cardiac fibrosis and dysfunction under chronic pressure overload. Hypertension. 2009 Oct;54(4):825–31. doi: 10.1161/HYPERTENSIONAHA.109.130443. [DOI] [PubMed] [Google Scholar]
- 66.Di Stolfo G, Mastroianno S, Soldato N, Massaro RS, De Luca G, Seripa D, et al. The role of TOMM40 in cardiovascular mortality and conduction disorders: an observational study. Journal of Clinical Medicine. 2024 Jan;13(11):3177. doi: 10.3390/jcm13113177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Yao Y, Ochoa A. Nordborg M, Weigel D, Nordborg M, editors. Limitations of principal components in quantitative genetic association models for human studies. eLife. 2023 May 4;12:e79238. doi: 10.7554/eLife.79238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018 Sep;50(9):1335–41. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Koumae A, Hollis H, Rodriguez K, Master H. C2022Q4R9 v7 Data characterization report: overall All of Us cohort demographics [Internet]
- 70.Types of All of Us data and how they are organized [Internet] User Support. 2023.



