Abstract
Large-scale sequencing has enabled unparalleled opportunities to investigate the role of rare coding variation in human phenotypic variability. Here, we present a pan-ancestry analysis of sequencing data from three large biobanks, including the All of Us Research Program. Using mixed-effects models, we performed gene-based rare variant testing for 601 diseases across 748,879 individuals, including 155,236 with ancestry dissimilar to European. We identified 363 significant associations, which highlighted core genes for the human disease phenome and identified potential novel associations, including UBR3 for cardiometabolic disease and YLPM1 for psychiatric disease. Pan-ancestry burden testing represented an inclusive and useful approach for discovery in diverse datasets, although we also highlight the importance of ancestry-specific sensitivity analyses in this setting. Finally, we found that effect sizes for rare protein-disrupting variants were concordant between samples similar to European ancestry and other genetic ancestries (βdeming = 0.7-1). Our results have implications for multi-ancestry and cross-biobank approaches in sequencing association studies for human disease.
Introduction
In recent years, the advent of large-scale sequencing has propelled studies into the role of rare coding variation in human phenotypic variability, including the human disease phenome1–6. However, for binary disease endpoints, previous work has had limitations in terms of power and/or statistical methodology. These limitations have included the use of simple tests that do not account for ancestry and other covariates, or models that produce mis-calibrated test statistics for highly imbalanced phenotypes. Furthermore, discovery analyses typically focused on individuals of European genetic ancestry1,4,7,8, limiting interpretability, transferability and equity of genomic findings9–11.
Several contemporary biobank initiatives have prioritized the inclusion of samples from understudied groups, with a goal to increase equitable understanding of health and disease2,12–14. Notably, the All of Us Research Program has completed whole-genome sequencing on almost 250 thousand participants across the United States, of which most are enrolled from underrepresented communities and approximately half are of non-European genetic ancestry12,15. In this work, we set out to create a dataset of gene-based rare coding variant associations for human disease across large biobanks with sequencing data, and assess the role of diverse ancestral composition in rare variant association analyses.
Results
Ancestry distributions across three sequenced biobanks
We combined large-scale exome sequence data from the UK Biobank (UKB) and the Mass General Brigham Biobank (MGB), with whole-genome sequencing data from All of Us (AoU) (Figure 1). While several phenome-wide association studies (PheWAS) for protein-coding variants have been published from the European ancestry subset of UKB1,3,7, AoU and MGB represent lesser characterized cohorts. MGB is a health-system biobank from Eastern Massachusetts with a relatively high disease prevalence16. AoU is a diverse biobank that is actively enrolling participants from both health systems and via population-based ascertainment across the United States12,17, with an emphasis on underrepresented groups. After quality-control procedures (Supplementary Note, Supplementary Figure 1), we had a total of 748,879 individuals including 454,162 from UKB, 242,902 from AoU and 51,815 from MGB (Supplementary Table 1).
Figure 1. Study overview for rare variant discovery across human disease.
Three studies were included in the analysis: All of Us (AoU) with whole-genome sequence data, UK Biobank (UKB) with exome sequencing data, and Mass General Brigham Biobank (MGB) with exome sequence data. Over 600 disease Phecodes were identified using a hierarchal clustering algorithm. Disease Phecodes were analyzed using exome-wide gene-based testing of rare genetic variants using three masks (LOF, LOF+missense and ultra-rare missense) after which P-values were combined into a single P-value using the Cauchy distribution for each gene-disease pair.
While ancestry is not truly categorical18, we grouped individuals into major continental ancestry groups based on their genetic similarity to samples from the 1000 Genomes project19 (Supplementary Note), namely African, Admixed-American, East-Asian, European, and South Asian ancestries. Furthermore, individuals not falling clearly within predefined categories may not truly have ‘mixtures’ of such categorical ancestries; nevertheless, we refer to such samples as having ‘admixed’ ancestry. As expected, in our data, the ancestral diversity was greatest in AoU with 49.9% of participants having a genetically determined ancestry other than European (most notably 21.0% African and 16.6% Admixed-American ancestry). In contrast, 94.4% and 83.5% of samples from UKB and MGB were genetically determined to be of European ancestry (Figure 2a; Supplementary Table 1). Across the three datasets, 119,660 individuals (16.0%) were similar to a defined continental ancestry other than European, and another 35,576 samples (4.7%) were of ‘admixed’ ancestry, totaling 155,236 samples with an ancestry dissimilar to European ancestry (20.7%) (Supplementary Table 1).
Figure 2. Multi-ancestry meta-analysis of rare genetic variation across three sequenced biobanks in over 750,000 individuals identifies 363 rare variant associations.
Panel a shows a stacked bar chart with the proportion of each continental ancestry on the y-axis and dataset on the x-axis. Ancestral diversity was largest in AoU. Panel b is a violin plot with overlaid boxplot showing the prevalence of Phecodes on the y-axis and each dataset on the x-axis. Plotted Phecodes were those included in the analysis with at least 50 cases in each dataset (N=546). Panel c is a stacked bar chart showing the number of identified disease associations (Cauchy Q<0.01) on the y-axis and each dataset and x-axis, as well as the meta-analysis results. Bars are stacked by the class of mask that yielded the lowest P-value (from LOF masks, LOF+missense masks, and ultra-rare missense masks). Panel d is a multi-trait gene-based Manhattan plot highlighting results from the overall meta-analysis, each dot representing one gene-trait test, with the -log10 of the Cauchy P-value on the y-axis and different disease categories on the x-axis. For disease categories with strong associations, the top three non-redundant genes are annotated with the gene names. Panel e is a violin plot with overlaid boxplot showing the distribution of inflation factors by phenotype (λ estimated at 95 percentile) on the y-axis, and different rare variant masks on x-axis, as well as the distribution for the Cauchy combination results (on the far right). Dotted lines show the 0.75 and 1.25 cutoffs for inflation factor on the y-axis. The number of phenotypes is 601 in all violins. Similarly, panel f shows the distribution of inflation factors by gene across the different masks and for the Cauchy combination results (on the far right), where the number of genes equals 14,388, 15,529, 17,809, 16,742, 15,462, 18,238, and 18,456. Note: Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair (unadjusted for multiple testing) after combining them using the Cauchy distribution. The Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these Cauchy P-values. P-values for mask-phecode pairs - prior to the Cauchy combination - were derived from Z-score-based meta-analyses of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. All boxplots show median (center), 25th percentile (bottom of box), 75th percentile (top of box), smallest/largest value within 1.5*inter quartile range from hinge (bottom/top whiskers, respectively), and data points outside of this range (dots). Abbreviations: EUR, European ancestry; AFR, African ancestry; AMR, Admixed American Ancestry; SAS, South-Asian ancestry; EAS, East-Asian ancestry; UND, undefined ancestry; LOF, loss-of-function.
Phenotype and disease distributions
When comparing the three datasets, we found that several quantitative measures were similar across cohorts, including standing height, HDL cholesterol and creatinine levels (Supplementary Table 1). Nevertheless, several measures substantially differed between UKB and the US-based cohorts. BMI and HbA1c levels were lower in UKB than in MGB and AoU, which potentially reflects higher obesity rates in the USA as compared to the UK20. LDL cholesterol, triglycerides and blood pressure levels were lower in AoU and MGB as compared to UKB (Supplementary Table 1), which might reflect different practices regarding hypertension control between the countries21,22 and increased utilization of lipid-lowering therapy over the last decade, especially in the US23,24. Overall, quantitative measures were comparable between AoU and MGB.
To define disease endpoints for our main genetic analyses, we created up to 1,866 phecodes (disease phenotypes) from International Classification of Disease (ICD) code mappings, after which we pruned down to a list of 601 index codes using a hierarchal clustering algorithm (Methods), as previously applied7. This approach was chosen to limit the number of highly correlated phenotypes – and thereby many potentially redundant rare variant associations – that have been found in many previous PheWAS approaches. In our primary analyses we set a minimum of 50 cases, which left 546 phecodes in UKB, 601 in AoU and 601 in MGB (Supplementary Table 2). We note that not all AoU participants had complete electronic health record (EHR) linkage, although inclusion of such samples did not meaningfully affect any genetic association analyses (Supplementary Note; Supplementary Figure 2).
In keeping with the health system-based ascertainment of MGB, we found markedly higher phecode prevalence estimates in MGB as compared to UKB and AoU (Figure 2b; Supplementary Note), as well as higher rates of likely pathogenic variants for cardiomyopathy (Extended Data Figure 1). Furthermore, we found that disease prevalence estimates were generally lower in UKB than AoU, likely due to sampling procedures and due to slightly different ICD coding systems (ICD10 in UKB and ICD10-CM in AoU and MGB). Despite the differences between cohorts, disease prevalence estimates were highly correlated across datasets (Spearman’s r in range of 0.7 and 0.9; Supplementary Figure 3). Furthermore, gene-based effect sizes for three masks correlated reasonably between datasets (Supplementary Figure 4).
Rare variant meta-analysis across biobanks
For the 601 phecode endpoints, we then performed exome-wide, gene-based burden testing in each dataset followed by a meta-analysis. We assessed six rare variant masks, including various combinations of loss-of-function (LOF) variants and missense variants, and using various frequency filters (maximum continental population minor allele frequency [MAFpopulation-max] <0.1% and <0.001%). For each gene-phecode pair, we combined P-values from each mask into one P-value using the Cauchy distribution (Figure 1 and Methods). For various sensitivity analyses, we also performed burden testing inclusive of both rare and low-frequency variants (MAFpopulation-max<1%; Figure 1). Analyses in AoU and MGB yielded mostly positive-control associations, including many that were identified in the large UKB dataset; conversely there were associations where AoU/MGB afforded better yield (ie associations did not reach significance in UKB) (Supplementary Note).
Per-cohort test statistics were very well-calibrated for all masks, as well as for the Cauchy combination, highlighting the robustness of our mixed-effects regression framework (Supplementary Figures 5-10). In an initial meta-analysis, however, we found an earlier-than-expected deviation of test statistics (Supplementary Figures 11-12). This inflation was largely due to the meta-analysis of AoU and MGB. Given partial recruitment from same sites, we investigated test-wide correlations of test statistics between AoU and MGB in more detail; phecodes showed a median of ∼0.05 exome-wide correlation in test statistics (Supplementary Figure 13; Supplementary Note). We therefore applied a Z-score based meta-analysis with correction for sample overlap, which markedly improved the calibration (especially for high allele count masks that were most affected; Supplementary Figures 14-15). As large sequencing biobanks continue to grow, issues relating to sample overlap will also increase; in future, central biobank policies might need reconsideration to allow identification of overlapping participants. Considering acceptable calibration of test statistics using our approach, we proceeded with the overlap-corrected meta-analysis.
Genetic association data quality
When assessing individual datasets, the largest number of significant associations was observed within the UKB (N=185 at Benjamini-Hochberg false-discovery rate [FDR] Q<0.01; FDR across all genes by all phecodes), while MGB yielded the fewest associations (N=52 at Q<0.01; Figure 2c, Supplementary Figure 5-8, Supplementary Table 3). Across the 11,060,516 unique gene-phecode pairs in our multi-ancestry meta-analysis, 363 gene-based associations reached significance at an overall FDR Q<0.01 (Figure 2d, Supplementary Table 3) among which 165 unique phenotypes and 123 unique genes. Of note, 464 signals would have been identified in a naïve meta-analysis without correction for sample overlap; in a meta-analysis omitting MGB we would have identified 319 significant associations. After correction for sample overlap, meta-analysis test statistics were reasonably calibrated within different bins for disease case counts and rare variant carrier counts (Supplementary Figure 14-17, Supplementary Table 4). Consistently, no individual phecode showed evidence of major test-statistic inflation in our final meta-analysis (all λ95%<1.16; Figure 2e).
In contrast, we found several genes with strong inflation (N=198 genes with λ95%>1.5, 11 genes with λ95%>2.5; Figure 2f, Supplementary Table 5). Per-gene inflation may be caused by uncorrected confounders (i.e., overlap, population-stratification), stochastics (given small number of tests per gene; ≤601), or alternatively by widespread deleteriousness or pleiotropy of rare variants in the gene. In support of the latter, we found that many inflated genes represent known causes of Mendelian disease (e.g., PKD1, APC, TTN and FBN1), for which inflation was most prominent in relevant disease categories (Supplementary Figure 18). Furthermore, inflated genes were enriched for LOF intolerance25,26 (LOEUF<0.5: OR 2.8, 95%CI [2.1; 3.7], P=5.5x10-12; pLI>0.9: OR 2.6, 95%CI [1.9; 3.6], P=4.3x10-9; two-sided Fisher exact tests). In a sensitivity analysis restricting to samples of European ancestry, we observed a markedly better test statistic calibration for a minority of genes, but for most it was not substantial (Supplementary Figure 18). Finally, we observed that a matched analysis of two synonymous masks yielded no signals at FDR Q<0.1 (Supplementary Note), with the most significant signals including IGLL5 for white blood cell-related traits27,28. These results suggest that a substantial proportion of gene-based inflation in our primary analysis was due to deleterious effects and/or stochastics, although a degree of finer (subcontinental) population stratification cannot be excluded.
Assessment of bias from pan-ancestry analyses in diverse populations
Given the increasing numbers and size of ancestrally diverse biobanks (eg AoU), it is important to understand whether pan-ancestry burden testing yields reasonable results. We therefore assessed the potential bias introduced by performing pan-ancestry analyses. To this end, we performed sensitivity analyses restricting to individuals with genetic ancestry similar to European ancestry (Figure 3; Supplementary Note; Supplementary Table 6). For the 363 significant signals, we compared the P-values from the all-ancestry and European-ancestry analysis, which flagged 6 potentially problematic associations that were markedly weaker in European ancestry individuals (Figure 3a). However, several of the weakened signals represented well-known gene-disease links, and comparison of log(OR) estimates showed a very high consistency between the pan-ancestry and European-ancestry analysis (Figure 3b).
Figure 3. Assessment of bias from inclusion of non-European samples among the significant associations.
Panel a shows a scatter plot with each dot representing a gene-phecode pair that reached test-wide significance in our primary analysis (Q<0.01), with - log10(PCauchy) from the primary analysis on the x-axis, and the -log10(PCauchy) derived from a European ancestry sensitivity analysis on the y-axis (both log-transformed for clarity). Specific cutoffs on the y-axis are highlighted using dotted lines. Any strong deviation of P-values could indicate bias in our multi-ancestry approach, or alternatively indicate markedly lower power among European samples. No associations were abolished when restricting to European samples. There were 6 additional strongly attenuated genes (0.05>PEUR>0.0005). Among these, several represent known gene-disease links (Supplementary Note). Panel b shows a scatter plot with the effect sizes for significant associations from the primary analysis on the x-axis, with the effect sizes from European-only sensitivity analyses on the y-axis. The effect size for the most significant mask is plotted for each gene-phecode pair, restricting to masks that had adequate allele counts in both the primary analysis and in the sensitivity analysis (cMAC≥20). Any large deviations from the dotted line (x=y) indicate bias from our multi-ancestry approach. Strikingly, no strong deviations of effect sizes were observed in this sensitivity analysis. For 8 associations there were insufficient alleles among European ancestry samples to compute an effect size, although represented well known gene-disease links (Supplementary Note). Taken together, these results show that the bias from inclusion of non-European samples was not substantial. Note: Bias is defined here as the spurious change in effect sizes / test statistics that is caused by inclusion of multiple ancestries but is not caused by true biological differences. Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair (unadjusted for multiple testing) after combining them using the Cauchy distribution. The Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these P-values. P-values for mask-phecode pairs (prior to the Cauchy combination) were derived from Z-score-based meta-analyses of score tests from logistic mixed-effects models with saddle-point-approximation. All statistical tests and P-values are two-sided. ORs were estimated using inverse-variance-weighted meta-analysis of two-sided Firth’s logistic regression results. Abbreviations: ALL, all ancestry individuals; EUR, European ancestry individuals; OR, odds ratio.
To assess the effect of ancestry bias in a highly diverse dataset, we then repeated these analyses restricting to AoU only (Supplementary Note). In AoU, we found a limited number of likely false-positive signals driven by potential ancestry bias (2 of 111-121 signals; one gene; Supplementary Figure 19). Finally, we assessed whether genes were associated with our categorical ancestry outcomes. While several gene burdens were associated with ancestry, the significant genes from our primary analysis did not overlap ancestry-associated genes (Supplementary Note).
Taken together, our findings indicate that pan-ancestry burden testing – using mixed regression-based methods – may be a reasonable and inclusive approach to identify rare variant association signals, in diverse datasets where cases and controls are well-represented across continental ancestries. At the same time, our results outline important ancestry-specific sensitivity analyses that should be considered to scrutinize such signals.
Somatic variation impacting sequencing association studies
We noticed several genes associated with clonal hematopoiesis of indeterminate potential (CHIP) among the inflated genes29–31. We explored somatic variation further, through prediction of age by rare variant carrier status, and by evaluation of the phenotypic associations found for known CHIP and known somatic leukemia genes (Supplementary Note). As expected, known CHIP genes were most strongly associated with age (DNMT3A, TET2, SRSF2, SF3B1, ASXL1; Extended Data Figure 2a). These CHIP genes - and several known somatic leukemia genes (TP53, NOTCH1, IDH2, KLHL6, RUNX1, CHD2, DDX41) - were also associated with hematological traits and leukemic outcomes (Supplementary Tables 6-8, Extended Data Figure 2b). We conclude that somatic variation affecting hematological outcomes and CHIP genes is likely, although most of the significant associations are likely causal - albeit by somatic variation rather than germline variation. The effect of somatic variation in driving associations for nonhematological traits seems small in our dataset (Supplementary Note). Nevertheless, we advise careful interpretation of results from sequencing of blood-derived DNA for hematological outcomes and known hematological genes.
Genetic effects of core genes for the human disease phenome
Among the 363 significant associations in our meta-analysis, 301 were directly reported in the Online Mendelian Inheritance in Man (OMIM) database or were plausibly related to entries in this database (82.9%; Supplementary Table 8). Indeed, the significant signals from our analyses highlight pleiotropic disease genes (ie, those associated with multiple disease outcomes and sequelae) and genes associated with large effect sizes (Figure 4), pointing towards core genes for the human disease phenome.
Figure 4. Large genetic effect sizes and pleiotropic associations identify core genes for the human disease phenome.
Panel a shows stacked bar charts for all genes from the meta-analysis that showed at least 3 associations, with the number of associations on the y-axis and gene on the x-axis. Bars are stacked by the class of the best mask (LOF, LOF+missense masks, or ultra-rare missense masks) for each gene-trait association. Panel b represents grouped boxplots showing rare variant effect size distributions per Phecode category, with log-scaled ORs on the y-axis and categories on the x-axis. The figure is restricted to gene-trait associations reaching Cauchy Q<0.01 and restricting to rare variant masks with P<2.6x10-6. Per category, only masks with at least 7 associations are shown (and therefore some categories do not show all masks and not all categories are plotted). In box plots, the number of contributing associations from left to right equals 53, 44, 23, 15, 9, 16, 37, 30, 9, 15, 8, 22, 32, 54, 58, 18, 9, 9, and 8. All boxplots show median (center), 25th percentile (bottom of box), 75th percentile (top of box), smallest/largest value within 1.5*inter quartile range from hinge (bottom/top whiskers, respectively), and data points outside of this range (dots). Panel c represents a multiple jittered lollipop chart showing rare variant effect sizes for each Phecode category. The x-axis shows the log-scaled OR with each dot representing an association (restricting to gene-trait associations with Cauchy Q<0.01 and rare variant masks with P<2.6x10-6), and Phecode categories on the y-axis. Horizontal lines start at 1 and end at the largest estimated effect size within the category. Dots are colored by class of rare variant mask. Select genes are annotated within each category to highlight large-effect size genes for the respective category. Note: In all panels, ORs were estimated using inverse-variance-weighted meta-analysis of two-sided Firth’s logistic regression results, while mask-phecode P-values were estimated from Z-score-based meta-analysis of score tests from logistic mixed-effects models with saddle-point-approximation. Cauchy P-values represent the omnibus P-value of all masks for a gene-phecode pair after combining them using the Cauchy distribution (unadjusted for multiple testing), while the Cauchy Q-values represent the Benjamini-Hochberg FDR adjustments of these P-values. All statistical tests and P-values are two-sided. Abbreviations: LOF, loss-of-function; OR, odds ratio.
Notable examples include associations of FBN1, a causative gene for Marfan’s syndrome (MIM154700), with 13 diseases across cardiovascular and genetic disease codes (Figure 4a, Supplementary Tables 6-8). FBN1 showed the largest effect size for Chromosomal anomalies and genetic disorders (ORLOF 569.08, PCauchy= 9.3x10-75; Figure 4bc, Supplementary Table 6). Similarly, the known adenomatosis poli gene APC was associated with colorectal cancer (MIM175100; PCauchy=2.8x10-18, ORLOF 12.7), and 22 other codes largely related to gastro-intestinal disease (Figure 4, Supplementary Tables 6-8). The largest number of gene-based associations was identified for PKD1, a gene causative in autosomal polycystic kidney disease (MIM173900). PKD1 associated with 29 codes (Figure 4a, Supplementary Tables 6-8), most notably genitourinary congenital anomalies (PCauchy=1.1x10-153, ORLOF 78.71) and chronic renal failure (PCauchy= 1.8x10-73; ORLOF 17.36). Notably, the present analysis identifies various disease sequelae associated with known Mendelian diseases genes (such as PKD1 and APC), many of which were not identified in previous PheWAS approaches (Supplementary Table 8).
Similarly, our large sample size allowed identification of many Mendelian gene-disease links that were not observed in previous PheWAS (Supplementary Table 8). For example, PTEN was associated with several phenotypes that recapitulate Cowden syndrome (MIM 158350), including congenital anomalies and thyroid disease; LMNA and TNNT2 were associated with cardiomyopathy and various sequelae (MIM 601494; 135150); CFTR was associated with cystic fibrosis (MIM 219700); FLCN was associated with congenital anomalies and renal cancers (MIM 135150); SMAD3, COL3A1 and LDLR were associated with vascular aneurysms (MIM 613795, 130050, 143890); PAX6 was associated with congenital diseases of the eye (MIM 120430); (potentially somatic) variants in TP53 were associated with various cancers; NODAL was associated with congenital heart disease (MIM 270100); and SOD1 was associated with anterior horn cell disease (MIM 105400). These results highlight how the continued growth in sequencing is enabling an increased detection of bona fide Mendelian contributors to the disease phenome.
Most of the significant associations were driven by masks that combined LOF variants with missense variants (e.g., a LOF+missense mask had the lowest P-value; N=193 associations), while LOF-only masks drove results for 157 gene-phecode pairs and missense-only masks drove results in only 13 cases (Figure 2c). For instance, among the highly pleiotropic genes, associations for LMNA, TP53, BRCA2 and BRCA1 were strongest for LOF+missense masks, while associations for MYH7 were driven largely by ultra-rare missense variation (Figure 4a, Supplementary Tables 6-7) consistent with genetic mechanisms in cardiomyopathy32,33.
Understanding the effect sizes conferred by rare variants from a genome-first view may enable unbiased interpretation of risk and allow comparison to common variant effects. For disease categories with multiple associations, we tabulated the distributions of effect sizes (Figure 4b, Supplementary Table 9). For the Circulatory System the median OR for LOF variants was 4.5 (1st quantile – 3rd quantile, Q1-Q3 [2.7-16.6], 53 pairs), with the largest effect identified for FBN1 and aortic aneurysm (ORLOF 108.5; Figure 4c, Supplementary Tables 6 and 9). The category Neoplasms showed a median ORLOF of 8.3 (Q1-Q3 [5.1 -19.5], 54 pairs) with the largest effect conferred by (potentially somatic) variants in the leukemia gene NOTCH1 (ORLOF 118.9, Figure 4bc). As expected, the largest median effect size was identified for associations in the category Congenital Anomalies (ORLOF of 24.3 (Q1-Q3 [15.0 -119.8], 15 pairs, Figure 4b, Supplementary Table 9). While we acknowledge that association yield is determined by statistical power, and therefore larger sample sizes may identify additional smaller-effect associations, our current work provides a useful reference of human Mendelian variation for common disease categories within the adult population.
Power from pan-ancestry approaches
In our pan-ancestry approach we identified markedly more significant associations than a restrictive approach including only individuals with ancestries similar to the largest continental ancestry in our dataset (European ancestry; 18.2% fewer associations [N=297]). The improved yield may reflect the larger total sample size, or additional power afforded by the inclusion of diverse ancestries. To assess this more formally, we down-sampled AoU to an ancestrally-diverse dataset of equal sample size to the ‘European ancestry’ subset (Methods) – N=106,057 samples with complete EHR linkage – and reran our main rare variant analyses. We observed comparable or slightly fewer numbers of significant signals in the ancestrally-diverse subsets of AoU as compared to the European subset (Supplementary Table 10). When including low-frequency variant masks (MAFpopulation-max<1%), we still did not observe a yield benefit of the ancestrally-diverse subset as compared to the European-only subset (Extended Data Figure 3).
Case frequencies across different ancestries may have contributed to this finding: Overall, disease prevalences were higher among samples with genetically determined European ancestry, as compared to other individuals (Extended Data Figure 3c; Supplementary Tables 11). It currently remains unclear whether this may represent an artifact of AoU sampling34, or potentially reflects broader bias in medical care in underrepresented populations35,36. Despite this, removal of phecodes that were strongly enriched among European ancestry samples did not markedly alter the results (Supplementary Table 12).
For common variant GWAS it has been shown that ancestry-specific variants may strongly contribute to genetic signal and discovery37–39. In contrast, our analyses did not establish a marked increase in discovery yield from ancestral diversity for rare variant burden testing of disease phenotypes, at a phenome-wide scale. This result may partly represent higher disease frequency among individuals genetically similar to European ancestry. In addition, we note that i) certain distinct rare variant signals may exist in populations dissimilar to European ancestry; ii) specific phenotypes might have increased yield in populations dissimilar to European ancestry, for instance if the phenotype is enriched in that population (Extended Data Figure 3c); and iii) our results may not translate to founder populations and populations with high degrees of consanguinity8,40. Furthermore, it is possible that sample sizes for underrepresented groups currently remain too small to confer meaningful boosts in power for burden testing.
Rare variant signals informing disease biology
We identified several biologically-plausible gene-disease links, which were recently described in biobank studies4,5,41–42 (Supplementary Tables 6-8). These included PIEZO1, encoding a mechano-sensing protein, with varicose veins (ORLOF 1.92, PCauchy= 5.3x10-8); AJUBA, encoding a protein involved in cell-cell adhesion, with erythematosquamous dermatosis (ORLOF 26.1, PCauchy= 2.1x10-7); and GIGYF1, encoding a regulator of insulin-like-growth-factor signaling, with type 2 diabetes (ORLOF 3.3, PCauchy= 4.8x10-14).
Overall, 42.4% of significant associations (154/363) did not reach significance in two previous biobank-scale PheWAS (Supplementary Table 8); 8.8% of associations (32/363) were also not reported in the OMIM database. Among these signals, several were consistent with recent literature. For instance, the association between SRCAP-complex-encoding genes DMAP1 (ORLOF 3.7, PCauchy= 6.3x10-8) and YEATS4 (ORLOF 3.8, PCauchy= 5.6x10-8) with benign neoplasms of uterus43; APOB - encoding a lipid particle apolipoprotein - with chronic liver disease and cirrhosis44 (ORLOF 2.3, PCauchy= 1.0x10-8); and NOS3 - encoding a nitric oxide synthase - with ischemic heart disease45 (ORLOF 1.7, PCauchy= 9.1x10-9).
We additionally focus on select novel findings, restricting to those that survived sensitivity analyses (Supplementary Table 6). For instance, we found that rare variants in YLPM1 were significantly associated with bipolar disorder (PCauchy=8.1x10-9; ORLOF, 3.9) and personality disorders (PCauchy 2.0x10-7; ORLOF, 7.8; Extended Data Figure 4). Common variants near YLPM1 are associated with mood instability, depressed affect and neuroticism46–48, and OpenTargets49 reports strong colocalization for a YLPM1 eQTL with “Feeling worry” and “Feeling nervous” (posterior probability of colocalization >0.8). Furthermore, a rare YLPM1 missense variant was among several variants co-segregating with apparent autosomal-dominant bipolar disorder in one pedigree50. In a recent exome sequencing study of bipolar disorder, ultra-rare YLPM1 LOF and missense variants reached nominal significance (OR 3.4, P=0.01; one-sided Fisher exact test), although some case overlap with our discovery samples is possible51. YLPM1 is expressed in many tissues, including brain, although it has not been widely studied functionally.
We further found that UBR3 variants were associated with an adverse metabolic profile, including hypertension (PCauchy=6.7x10-9; ORLOF, 2.8), type 2 diabetes (PCauchy=3.8x10-8; ORLOF, 3.6), and a suggestive signal for obesity (PCauchy=1.8x10-6; ORLOF, 2.6; Extended Data Figure 5). In OpenTargets, there was moderate evidence for colocalization between a common UBR3 sQTL and BMI-adjusted waist-to-hip ratio (posterior probability for colocalization of 0.66). A previous mutational screen in mice identified Ubr3 loss as a strong inducer of increased weight and fat-to-lean mass in both male and female mice52, and a paralog of UBR3, UBR2, was found in a recent sequencing study for body-mass-index53. UBR3 and UBR2 are highly constrained (pLI=1) and encode ubiquitin protein ligase components54.
Other novel associations include MIB1, encoding a Notch signaling protein55 found to regulate pancreatic β-cell formation in mice56, with type 2 diabetes (PCauchy=5.3x10-8; ORLOF, 1.3); and SYTL1, which encodes a synaptotagmin, a protein class involved in neuronal and endocrine exocytosis57,58, with hypothyroidism (PCauchy=6.5x10-8; ORLOF, 1.7). While we identified initial replication evidence for these genes (Supplementary Note), novel associations will require further external replication in other large datasets.
Consistency of rare variant effects across ancestries
Finally, we asked whether our multi-ancestry dataset could answer whether the effects of rare coding variation for human disease are consistent across ancestries. To this end, we used a three-sample approach to assess whether effects are consistent between European ancestry samples and individuals of other genetic ancestries. We first identified suggestively significant signals (P<2.6x10-6) from a meta-analysis of European individuals from UKB and MGB, and then assessed the effect sizes of these signals in the diverse AoU dataset (Methods).
Phenome-wide significant burden effect sizes from European ancestry samples correlated well with the estimated effects from other ancestries (Figure 5; Supplementary Table 13), similar to previous trans-ancestry findings for common variants59–61 and quantitative traits1. To better incorporate measurement error and assess calibration, we then used Deming regression (Methods). We found highly significant slopes (all P<2.4 x 10-8) for LOF and missense masks, which in most cases were consistent with a calibration of 1 (Figure 5; Supplementary Table 13). For instance, for LOF variant masks, the regression slope was 0.9 for European versus African ancestry (P=3.9x10-23, 95%CI [0.72; 1.08]), and 0.9 for European versus Admixed-American ancestry (P=6.4x10-47, 95%CI [0.78; 1.02]).
Figure 5. Effect sizes of rare coding variants for disease correlate between genetic European and other genetic ancestries.
Figure 5 shows scatter plots with the effect sizes from European-ancestry analyses on the x-axis with the respective effect sizes estimated among individuals dissimilar to European ancestry on the y-axis. In each figure panel, a three-sample design was applied: Significant mask-disease pairs were identified from a EUR meta-analysis of UKB and MGB (significance determined at P<2.6x10-6), after which those mask-disease pairs were assessed within different ancestry groups from the AoU dataset. Each panel shows effect sizes (ie, log[OR]) for EUR analysis on the x-axis and effect sizes from other ancestries on the y-axis; the left panels show EUR versus all non-European samples, the middle panels show EUR vs African ancestry samples, and the right panels show EUR versus Admixed-American samples. Part a shows results for rare LOF variant masks with at least 20 carriers in both ancestry assignments, while part b shows results for ultra-rare missense0.5 variant masks with at least 20 carriers in both ancestry assignments. Linear trend lines from error-in-variable total-least-squares Deming regression are added to the plots. Statistics from Deming regression, including estimated β [95%CI] and P-values, are added in text in the top left corners. A regression coefficient (βsens) and 95%CI is also provided in the bottom right corners, showing results from a combined sensitivity analysis where genes associated with age or leukemic outcomes are removed, and where analyses are adjusted for quantiles of effective sample size (Supplementary Table 14). Note: All ORs were estimated using Firth’s logistic regression models among unrelated participants. Deming regression was run using beta coefficients and their standard errors, making the analysis comparable to York regression with the assumption of uncorrelated errors. Standard errors were computed using Jackknife estimators. All statistical tests and P-values are two-sided. Abbreviations: LOF, loss-of-function; OR, odds ratio; EUR, European ancestry; AFR, African ancestry; AMR, Admixed American ancestry.; nonEUR, defined ancestry other than European; sens, sensitivity analysis.
We then performed several sensitivity analyses. These included the removal of genes associated with age and/or leukemic outcomes, and analyses accounting for bins of effective sample size. These analyses produced largely consistent results, although estimated coefficients for ultra-rare missense variants tended to be somewhat attenuated (Supplementary Tables 14-15; Figure 5). Finally, for LOF variants, we used a random-effects inverse-variance-weighted (IVW) approach to combine phenotype-specific results for phecodes with at least three qualifying genes. While we caution against overinterpretation of the individually noisy estimates, the meta-analysis yielded consistent results when comparing European ancestry to non-European ancestry (βDeming-IVW 0.82-0.87, P<2x10-17) and when comparing European ancestry to African ancestry (βDeming-IVW 0.84-0.87, P<0.004; Supplementary Table 15).
Broadly, our results provide evidence that effect sizes for rare LOF variants have reasonable consistency between European and other genetic ancestries, justifying further trans-ancestry approaches to improve discovery power in disease sequencing association studies. In addition, these analyses support the notion that causal variants share high consistency in their effects across different ancestries62. Nevertheless, our analyses assume homogeneous effects across phenotypes and genes; subgroup analyses with respect to specific diseases and genes were not adequately powered at the current sample size and remain directions for future work. Furthermore, our analyses were not powered to assess other major continental ancestries (eg. Asian ancestry) at this time.
Publicly available data via the Human Disease Knowledge Portal
We have released a web portal to browse our gene-based results, through the Human Disease Knowledge Portal (https://hugeamp.org:8000/research.html?pageid=600_traits_app_home). Users may browser results from individual datasets (UKB, AoU, MGB), various meta-analyses (including uncorrected and sample overlap-corrected meta-analyses), different ancestries (all-comers, European ancestry only) and various mask filters (MAFpopulation-max<0.1%; MAFpopulation-max <1%). For instance, our portal highlights 8 and 9 exome-wide significant (P<1e-6) genes for Cardiomyopathy and Diabetes Mellitus, respectively (Extended Data Figures 6-7). For de novo discovery or replication, researchers might want to restrict to specific subsets of our data; to this end, the individual cohort results and results for meta-analyses of UKB+AoU and UKB+MGB are available.
Discussion
The present work is imperfect and subject to several limitations. First, phecode definitions may imperfectly capture disease status, and therefore a degree of phenotype misclassification is likely. Second, the overlapping samples between AoU and MGB may have introduced bias despite applying an overlap-aware meta-analysis, as rare variant burdens may be affected differently by sample overlap than single common variants. Reassuringly, however, the uncorrected inflation was most notable for higher allele count burdens – which are expected to behave more similarly to common variants - while few significant results were observed for low allele count burdens (Supplementary Figures 16-17). Third, while our main signals were not driven by continental population stratification, it is possible that finer population stratification introduced some bias. Fourth, we applied a more liberal cutoff based on an FDR of 1% in our PheWAS analyses. For all the above reasons, any specific novel gene-disease links will require replication in independent datasets. Fifth, our statistical analyses were focused specifically on rare variant burden testing and might not translate to rare single variant and/or variance component tests.
Relatedly, our analyses of discovery yield in diverse datasets may still have been limited by sample size. Future studies with even larger diverse datasets might be needed to identify benefits for rare variant burden testing, especially considering our focus on binary outcomes. Finally, our analyses were restricted to protein-coding genes, while rare non-coding regions remain largely unexplored on a population scale. The All of Us Research Program aims to eventually release WGS data on over 1 million participants, and the UK Biobank recently made WGS data available on almost 500 thousand samples; these data will be instrumental to extend our findings to additional populations and non-coding regions.
In conclusion, through pan-ancestry meta-analysis of over 750,000 sequences, we present a dataset of gene-based rare variant associations across a wide range of human disease phenotypes. Our results provide insights into the consistent effects of ultra-rare coding genetic variation for human disease across ancestries, while providing analytical implications for future sequencing approaches. These findings are of relevance given the important and continued efforts to sequence underrepresented populations10–13,63,64. To propel use of our data, we have made our results available for download and browsing in the Human Disease Knowledge Portal.
Methods
Study datasets
In the present study we utilized three large biobanks with available sequencing data and linkage to electronic health records.
UK Biobank
The UK Biobank (UKB) is a large population-based prospective cohort study from the United Kingdom that included over 500,000 individuals with deep phenotypic data, including medical interviews, electronic health record linkage and death registry linkage65,66. Participants were recruited between 2006 and 2010 at ages of 40-69 years66. Relevant genomic data currently includes exome sequencing on over 450,000 samples funded through industry partnerships1,67. Exomes were captured using the revised version of the IDT xGen Exome Research Panel v1.0 on Illumina NovaSeq 6000 machines (https://www.ukbiobank.ac.uk/media/najcnoaz/access_064-uk-biobank-exome-release-faq_v11-1_final-002.pdf). Alignment using BWA-MEM, calling using DeepVariant, and joint genotyping using GLNexus have been described in detail elsewhere (https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/UKB_WES_Protocol.pdf). In the present study, we utilized the OQFE exome call set and closely followed a previously published pipeline to perform stringent quality-control (QC) of the exome sequencing data, including genotype QC, variant QC and sample QC5. Details on custom QC, principal component analysis, ancestry inference and relatedness inference are described in the Supplementary Note. After quality-control, we were left with 18.752.405 high-quality autosomal variants and 454.210 high-quality samples, of which 454,162 could be linked to their phenotypic data. The UK Biobank resource was approved by the UK Biobank Research Ethics Committee and all participants provided written informed consent to participate. Use of UK Biobank data was performed under application number 17488 and was approved by the Mass General Brigham Institutional Review Board.
All of Us
The NIH’s All of Us Research Program (AoU) is a longitudinal cohort study that aims to include 1 million racially, ancestrally and demographically diverse participants from across the United States, combining phenotypic data from various sources including patient-derived information and electronic health record linkage68. One of the goals set by AoU was to recruit individuals that have been and continue to be underrepresented in biomedical research because of limited access to health care12,68. Consistently, AoU prioritized underrepresented participants for genome sequencing and data collection and included them in the first few releases of the dataset, resulting in a diverse research population with rich phenotypic data. As part of the release in April 2023, whole genome sequencing (WGS) was performed on approximately 250,000 participants using Illumina NovaSeq 6000 machines following manufacturer’s best practices. Same protocol for library preparation (PCR Free Kapa HyperPrep) and software for variant calling (DRAGEN v3.4.12) were used to keep consistent WGS data generated from different AoU Genome Centers. A stringent central QC procedure was applied, as described in the program’s genomic quality report [https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-], leaving 245,394 samples (47.7% described as racial/ethnic minorities). We performed further genotype, variant, and sample QC procedures on the exome-region callset (contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon) released by the program, resulting in 242,902 eligible samples and 31,247,262 high-quality genetic variants. Details on the QC procedure, ancestry inference, principal component analysis and relatedness inference are described in the Supplementary Note. All enrolled participants provided informed consent to AoU. Use of AoU data was approved under a data use agreement between the Massachusetts General Hospital and the All of Us research program.
Mass General Brigham Biobank
The Mass General Brigham Biobank (MGB; formerly known as Partners Biobank) is an ongoing observational research project enrolling participants from a multicenter health system in Eastern Massachusetts69. Participants are enrolled with broad-based consent collected by local research coordinators, either as part of a collaborative research study or electronically through a patient portal70. Demographic data, blood samples and surveys are collected at baseline and linked to electronic health record data. All adult patients provided informed consent to participate. A small number of children were enrolled with IRB-approved assent forms; upon reaching 18 years of age all enrolled children had to provide consent or were removed from the study. The Human Research Committee of MGB approved the Biobank protocol (2009P002312). Exome sequencing has currently been completed for over 53,000 MGB participants, partly within the NHGRI’s Centers for Common Disease Genomics initiative and partly through industry partnership with IBM health. Samples were sequenced on Illumina NovaSeq machines with a custom exome panel (TWIST Human Core Exome), with a target of at least 20X coverage at >85% of target sites. Alignment, processing and joint-calling of variants were performed using the Genome Analysis ToolKit (GATK v4.1) following GATK best practices, after which we applied a stringent QC pipeline on the sequencing data (comparable to the pipeline applied in the UKB). Details on QC, ancestry inference, principal component analysis and relatedness inference are described in the Supplementary Note. After stringent QC, we were left with 12,421,458 autosomal genetic variants across 52,059 high-quality samples, of which 51,815 could be matched to their electronic health records.
Ancestry definitions
In all analyses across all datasets, ancestry labels were based on inference from the genetic data. In all datasets, we defined labels for continental ancestries, namely European (EUR), East Asian (EAS), South Asian (SAS), African (AFR) and Admixed American (AMR) ancestries. Methods for genetic inference of ancestry differed between UKB and MGB, as compared to AoU. Methodology for ancestry inference is described in the Supplementary Note.
Phenotype construction
We defined a harmonized set of disease endpoints across the included datasets. To this end, we used the R-package PheWAS (v1.0, https://github.com/PheWAS/PheWAS) to create disease phecodes mapped from various ICD-10 billing codes71. We required at least one instance of an ICD code to define a sample as a case, while all other samples were considered controls for the given phecode. Prevalent and incident cases were pooled. In MGB and AoU, 1,866 and 1,835 phecodes could be mapped from ICD-10-CM code data, respectively, while in UKB available ICD-10 code data allowed mapping to 1,591 phecodes. In UKB, we manually curated a select number of traits, which had low case numbers in UKB due to absence of available ICD-10 codes (but had high case numbers in AoU/MGB; Supplementary Note). Given the high degree of correlation between various phecodes, we then utilized a clustering algorithm to identify important index phecodes7; we performed the clustering algorithm within the most phenotypically rich dataset, MGB. We first excluded any phecode with <50 cases in MGB (leaving 1,770 phecodes), which we then used to create a cosine similarity matrix and a cosine distance matrix (1- similarity matrix). We used Ward’s method to hierarchically cluster the cosine distance matrix, using a clustering tree height cutoff of 1.0 to define meaningful phecode clusters. We defined the index phecode as the phenotype with the highest case count within a cluster, utilizing the sum of case counts across UKB, a previous release of AoU (N=98k), and MGB. Therefore, it is possible that a given index phecode is not present in each dataset; however, we keep the phecode yielding a high overall case number to increase statistical power for downstream genetic analysis. The clustering process left 519 index phecodes; we manually inspected the codes that were removed and pulled back 82 phecodes, leaving a final set of 601 largely independent phecodes for analysis. Of the 601 phecodes, 555, 601 and 601 codes were found in UKB, AoU and MGB, respectively, of which 546, 601 and 601 had at least 50 cases.
Variant annotation
In each dataset, variants were annotated using dbNSFP (v.4.2a for MGB and v.4.3a for UKB and AoU; ref.72) and the Loss-of-Function Transcript Effect Estimator (LOFTEE; ref.25) plug-in implemented in the Variant Effect Predictor (VEP; v.105)73 (https://github.com/konradjk/loftee). VEP was used to ascertain the most severe consequence of a given variant for each gene. LOFTEE was implemented to identify high-confidence LOF variants, which include frameshift indels, stop-gain variants and splice site disrupting variants. LOFs flagged by LOFTEE as dubious were removed. Missense variants were assigned a missense score representing the proportion of bioinformatics tools predicting a damaging effect, following previously published methods5. In short, we used information from 30 tools included in the dbNSFP database to score each missense variant by the number of tools predicting a damaging/deleterious effect, and divided this value by the number of tools that gave a prediction. Missense variants with <7 predictions were removed. For instance, if 14 tools predicted a damaging effect and 28 total tools gave a prediction, then the missense score would equal 0.5 (14/28). Details on the contributing tools in provided in the Supplementary Note. Finally, variants were annotated with the highest continental allele frequency from gnomAD v2 exomes (extracting frequencies for EUR, EAS, SAS, AFR and AMR super-populations) denoted as ‘gnomAD popmax’25. Within a dataset, the highest MAF between gnomAD popmax and the within-dataset MAF was designated the MAFpopulation-max.
Rare variant analyses
In each dataset, we performed exome-wide rare variant collapsing tests across the included disease phecodes with ≥50 cases. We assessed 6 rare variant masks in our main discovery analysis, namely
-
i)
‘rare LOF’ mask restricting to LOF variants with MAFpopulation-max <0.1% (ie, MAF<0.1% in the dataset and gnomAD popmax<0.1%),
-
ii)
‘rare LOF+missense0.8’ mask including both LOF variants and predicted-deleterious missense variants with missense score >0.8 and MAFpopulation-max <0.1%,
-
iii)
‘rare LOF+missense0.5’ mask including both LOF variants and predicted-deleterious missense variants with missense score >0.5 and MAFpopulation-max <0.1%,
-
iv)
‘ultra-rare LOF+missense0.5’ mask including both LOF variants and predicted-deleterious missense variants with missense score >0.5 and MAFpopulation-max <0.001% (for within dataset filtering, we used MAC<5 if more inclusive)
-
v)
‘ultra-rare missense0.5’ mask restricting to missense variants with missense score >0.5 and MAFpopulation-max <0.001% (for within dataset filtering, we used MAC<5 if more inclusive)
-
vi)
‘ultra-rare missense0.2’ mask restricting to missense variants with missense score >0.2 and MAFpopulation-max <0.001% (for within dataset filtering, we used MAC<5 if more inclusive)
The stringent frequency cutoffs were chosen to limit results to very rare genetic variation in an attempt to enforce orthogonality to conventional common variant GWAS results1,5.
In secondary analyses, we also performed burden testing inclusive of low-frequency variants (MAFpopulation-max <1%; Figure 1), for
-
i)
LOF variant mask (MAFpopulation-max <1%),
-
ii)
LOF+missense0.8 mask (MAFpopulation-max <1%), and
-
iii)
LOF+missense0.5 mask (MAFpopulation-max <1%)
For a given phenotype, rare variant masks were analyzed in a two-sided logistic mixed-effects score test using custom software (https://github.com/seanjosephjurgens/UKBB_200KWES_CVD/tree/v1.2), which is a previously-described adaptation (ref.)5 of the R-package GENESIS (v2.18; ref.74).
Fixed effects included age, age^2, sex, sequencing batch (if applicable; Supplementary Note), ancestral principal components 1 to 4, and any other component among the first 5 to 20 components if associated with the phecode (nominal P<0.05 among unrelated samples). In AoU only the first 16 components were available. We accounted for relatedness by including a sparse kinship matrix as a random effect (Supplementary Note) and P-values were computed using the saddle point approximation (SPA) to account for case-control imbalance.75 In cases where the mixed-effects model failed to converge, analyses were conducted using regular logistic regression among unrelated individuals. Missing genetic data were imputed to zero. For tests reaching nominal signifance (P<0.05), Odds ratios (ORs) and standard errors (SEs) were estimated using an approximate Firth’s bias-reduced logistic regression76,77 in the unrelated subset of each dataset.
Meta-analyses
To compute meta-statistics we used a score-based meta-analysis approach. For each phenotype-mask test, we computed the scoremeta as the sum of study-specific score statistics, and the score variancemeta as the sum of study-specific score variances78. To account for case-control imbalance in our meta-analysis, we recomputed the score variances in each dataset using the SPA P-values prior to meta-analysis (ref.)79 (Supplementary Note). To prevent false-positives driven by low minor allele count, we removed any tests with cumulative minor allele count (cMAC) <10 in the study-specific results prior to meta-analysis. Because AoU does not allow extraction of summary statistics describing results from <20 individuals, the minimum number of alternative allele carriers for AoU was set to 20 prior to extraction of data from the AoU web portal. After meta-analysis, we removed any results with cMAC<20. Therefore, our meta-analysis results include only tests with cMAC ≥20, where each contributing study has cMAC ≥10. Effect sizes for significant associations were estimating using an inverse-variance weighted meta-analysis of ORs and SEs.
Because we found that there was evidence of sample overlap and/or cryptic relatedness between AoU and MGB (median 0.05 test statistic correlation; Results), we then applied an approach to correct the meta-analytical P-values for this issue (Supplementary Note). In short, we used a weighted Z-score meta-analysis that i) first estimates the spurious test statistic correlation across datasets - estimated separately for each phenotype (we found that correlations were approximately consistent across masks; Supplementary Figure 13); and ii) then corrects the meta-analytical weights, accounting for the spurious correlation. While not perfect, we found that this approach yielded a substantially better calibration of meta-analytical test statistics (Results). We note that this correction does not directly correct the effect size estimation, and therefore the variance of the effect sizes might be underestimated; nevertheless, we found that the corrected P-value were reasonable for hypothesis testing.
To compute a single P-value per gene-phecode pair, we used the Cauchy distribution to combine the mask-specific P-values (from all six different masks) into a single omnibus P-value. The Cauchy distribution allows for valid aggregation of multiple, potentially correlated, test statistics into a single test statistic80. A Benjamini-Hochberg false-discovery-rate (FDR) correction was then applied to these Cauchy P-values to compute multiple testing-corrected Q-values, taking into account all gene-phecode pairs in one FDR correction. Q-values <0.01 were considered significant. All discovery analyses and meta-analyses considered all samples irrespective of ancestry. In sensitivity analyses, all discovery and meta-analyses were repeated restricting to samples genetically determined to be similar to European ancestry. We also analyzed matched synonymous variants, as described in the Supplementary Note.
Assessment of power benefits from diverse ancestries
To investigate the effect of ancestral diversity on discovery yield, we compared the number of identified associations at various significance cutoffs, when using all samples, and when using only samples of genetically-determined European ancestry. We assessed the number of signals at Bonferroni-corrected significance, P<1e-7, and at standard exome-wide significance (2.6x10-6). To disentangle whether differences in number of discovered associations was due to the diminished sample size for the European-only analysis compared to the entire dataset, we then applied a down-sampling approach. We down-sampled the AoU dataset in such a way (ie, removed samples) so the remaining sample size matched the sample size of the European subset of AoU. While doing so, we ensured the most ancestrally-diverse composition of the down-sampled dataset (Supplementary Note). As such, we created two equally sized subsets of AoU: one of exclusively European ancestry, and one highly diverse dataset. We performed exome-wide discovery analyses as described for our primary analysis. We then assessed the discovery yield – measured by number of significant associations at P<1e-7 and exome-wide significance – in both datasets. We further performed meta-analyses where we combined either dataset with UKB, to assess whether an ancestrally-diverse dataset may improve yield when combined in meta-analysis with a large homogenous set (Supplementary Table 10).
Assessment of rare variant effect sizes across ancestry
We then aimed to assess whether rare variant effects estimated from genetically-determined European samples were consistent in other ancestries. For this analysis, we considered rare LOF variant (MAF<0.1%) masks and ultra-rare missense0.5 (MAF<0.001%) variant masks. We employed a three-sample design to avoid bias from Winner’s curse. First, we identified suggestively significant mask-phecode associations from a European ancestry meta-analysis of UKB and MGB, defined as P<2.6x10-6 among samples of genetically-determined European ancestry. We then estimated effect sizes for those masks in various subsets of AoU. For instance, our main analysis focused on the effect sizes of those signals among European ancestry individuals in AoU, and the respective effect sizes among samples with a genetically-defined ancestry dissimilar to European (AFR, AMR, EAS, SAS, admixed). Effect sizes and standard errors for both groups were estimated using Firth’s regression among unrelated samples, requiring at least ≥20 rare variant carriers in both groups. In secondary analyses, we performed similar comparisons, this time comparing effect sizes from European ancestry samples with the respective effect sizes from two defined ancestry groups with sufficient sample size in AoU, namely African and Admixed American ancestries.
To compare phenome-wide effect sizes between different groups, we computed Pearson correlation estimates, quantifying the correlation between rare variant effect sizes from European samples against the respective effect sizes in non-European samples. Because effect sizes from our analysis are estimated with error (large standard errors given low numbers of carriers) this can downward bias correlation and regression estimates, a phenomenon known as attenuation bias81. Given known standard errors of our estimates, we also computed disattenuated correlation coefficients providing upper bound estimates of the possible true correlations between European effect sizes and non-European effect sizes (Supplementary Note). We then aimed to build regression models quantifying the relationship between European and non-European effect sizes. To incorporate the error in effect estimates, we used Deming regression62,82, a form of error-in-variables total-least-squares regression, to regress non-European effect sizes on European effect sizes (using function deming() in R-package deming v1.4.). Since the standard errors for each beta coefficient were known, these were directly fed into the regression model. As such, the assumption of equal error ratios was relaxed, making the regressions comparable to York regression with the assumption of uncorrelated errors. Regression weights were applied to account for potential heteroscedasticity, and standard errors were computed using Jackknife estimators for all regressions including >8 data points.
In sensitivity analyses, we removed genes associated with leukemic outcomes and/or age, to assess potential effects from somatic variation on phenome-wide effect size correlations. We also performed analyses accounting for bins of effective sample size, to better account for differential discovery power across different phenotypes (Supplementary Table 14); in these analyses, we performed Deming regression within quantiles determined by the effective sample size (computed within European ancestry samples), and then performed a random-effects inverse-variance weighted (IVW) meta-analysis to combine the results from the quantiles. For LOF variants, we finally performed analyses using phenotype-specific effect size correlations. To this end, we used phenotypes with at least 3 qualifying genes and performed Deming regression for each phenotype separately. We then used the IVW approach to combine the phenotype-specific Deming coefficients.
Extended Data
Figure 5.
Supplementary Material
Acknowledgements
We gratefully thank all participants of UK Biobank, All of Us, and Mass General Brigham Biobank, as this study would not have been possible without their contributions. We also thank the National Institutes of Health’s All of Us Research Program, the UK Biobank resource (under application number 17488), and the Massachusetts Biobank team, for making available the participant data examined in this study. P.T.E. was supported by funding from the National Institutes of Health (1RO1HL092577, 1R01HL157635), by a grant from the American Heart Association (18SFRN34110082, 961045) and from the European Union (MAESTRIA 965286). This work was also supported by an American Heart Association Strategically Focused Research Networks (SFRN) postdoctoral fellowship (18SFRN34110082) to L.-C.W. This work was supported by the John S. LaDue Memorial Fellowship for Cardiovascular Research, a Sarnoff Scholar award from the Sarnoff Cardiovascular Research Foundation, and by a National Institutes of Health grant (K08HL159346) to J.P.P. This work was further supported by a grant from the National Institutes of Health (1K08HL153937) and a grant from the American Heart Association (862032) to K.G.A. This work was supported by a Sigrid Jusélius Fellowship to J.T.R. This work was also supported by an Amsterdam UMC doctoral fellowship and the Junior Clinical Scientist Fellowship (03-007-2022-0035) from the Dutch Heart Foundation, to S.J.J. This work was supported by the BioData Ecosystem fellowship to S.H.C.
Footnotes
Author Contributions
S.J.J. and P.T.E. conceived and designed the study. S.J.J., X.W., S.H.C., L.-C.W., S.Koyama and J.P.P. performed data curation and data processing. S.J.J. and X.W. performed the main statistical and bioinformatic analyses, with S.H.C. providing important bioinformatic support. M.C., R.W., C.R., K.J.B., S.Kany, A.L.E., L.F.J.M.W. and J.T.R. contributed critically to the analysis plan. P.N., K.G.A., C.R.B., S.A.L., K.L.L., and P.T.E. supervised the study. T.N., P.S. and J.D. created the online web portal on the Human Disease Knoweldge Portal. J.F. and N.P.B. supervised the creation of the online web portal. S.J.J., X.W. and P.T.E. wrote the manuscript. All authors critically revised and approved the manuscript.
Competing Interests
P.T.E. has received sponsored research support from IBM Health, Bayer AG, Bristol Myers Squibb, and Pfizer; he has consulted for Bayer AG. S.A.L. is a full-time employee of Novartis Institutes of BioMedical Research as of July 18, 2022. S.A.L. previously received sponsored research support from Bristol Myers Squibb, Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier, and IBM, and has consulted for Bristol Myers Squibb, Pfizer, Blackstone Life Sciences, and Invitae. P.N. has received sponsored research support from Amgen, Apple, Boston Scientific, Novartis, and AstraZeneca, personal fees from Apple, AstraZeneca, Genentech / Roche, Novartis, Allelica, Foresite Labs, Blackstone Life Sciences, and HeartFlow, is a scientific advisory board member of Esperion Therapeutics, geneXwell, and TenSixteen Bio, is a scientific co-founder of TenSixteen Bio, and spousal employment at Vertex, all unrelated to the present work. B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson. The remaining authors declare no competing interests.
Data Availability
Results from our gene-based association analyses are available for browsing and download through our online portal (https://hugeamp.org:8000/research.html?pageid=600_traits_app_home). Bulk download of summary statistics is possible via the Cardiovascular Disease Knowledge Portal (https://cvd.hugeamp.org/downloads.html). Access to individual level UK Biobank data, both phenotypic and genetic, is available to bona fide researchers through application on the UK Biobank website (https://www.ukbiobank.ac.uk). The final release of the exome sequencing dataset of UK Biobank is available only through the DNAnexus Research Analysis Platform (https://www.ukbiobank.ac.uk/enable-your-research/research-analysis-platform). Additional information about registration for access to the data is available at http://www.ukbiobank.ac.uk/register-apply/. Use of UK Biobank data was performed under application number 17488. Access to individual phenotypic and genetic data from All of Us is currently available to bona fide researchers within the United States through the All of Us Researcher Workbench, a cloud-based computing platform (https://www.researchallofus.org/register/). A publicly available data browser is provided by the research program: https://databrowser.researchallofus.org/. Access to individual level data for participants from the Mass General Brigham Biobank is currently not publicly available.
Other datasets used in this manuscript include: the dbNSFP database v.4.2a and v.4.3a (https://sites.google.com/site/jpopgen/dbNSFP); gnomAD exomes v.2.1 (https://gnomad.broadinstitute.org/downloads); the Online Mendelian Inheritance in Man (OMIM) database (omim.org) accessed on August 25th 2022; and Ensembl release 105 (https://www.ensembl.org/info/data/index.html); the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) was accessed in December 2022.
Code Availability
Quality-control of individual level data was performed using Hail version 0.2 (https://hail.is) as well as PLINK version 2.0.a (https://www.cog-genomics.org/plink/2.0/). Variant annotation was performed using VEP version 105 (https://github.com/Ensembl/ensembl-vep). Main rare variant association analyses were performed using an adaptation of the R package GENESIS version 2.18 (https://rdrr.io/bioc/GENESIS/man/GENESIS-package.html), which has previously been made available by us through the GitHub repository https://github.com/seanjosephjurgens/UKBB_200KWES_CVD/ version 1.2 (DOI: 10.5281/zenodo.11638262). Meta-analyses were performed using custom code available in the same repository, and using METAL (2017-12-21 release). Analyses that were run in R, were run within R version 4 (https://www.r-project.org).
References
- 1.Backman JD, et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021;599:628–634. doi: 10.1038/s41586-021-04103-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Q, et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature. 2021 doi: 10.1038/s41586-021-03855-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Karczewski KJ, et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics. 2022;2 doi: 10.1016/j.xgen.2022.100168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jurgens SJ, et al. Analysis of rare genetic variation underlying cardiometabolic diseases and traits among 200,000 individuals in the UK Biobank. Nat Genet. 2022;54:240–250. doi: 10.1038/s41588-021-01011-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nat Genet. 2015;47:435–44. doi: 10.1038/ng.3247. [DOI] [PubMed] [Google Scholar]
- 7.Sun BB, et al. Genetic associations of protein-coding variants in human disease. Nature. 2022;603:95–102. doi: 10.1038/s41586-022-04394-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Heyne HO, et al. Mono- and biallelic variant effects on disease at biobank scale. Nature. 2023;613:519–525. doi: 10.1038/s41586-022-05420-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Need AC, Goldstein DB. Next generation disparities in human genomics: concerns and remedies. Trends Genet. 2009;25:489–94. doi: 10.1016/j.tig.2009.09.012. [DOI] [PubMed] [Google Scholar]
- 10.Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hindorff LA, et al. Prioritizing diversity in human genomics research. Nat Rev Genet. 2018;19:175–185. doi: 10.1038/nrg.2017.89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ramirez HA, et al. The All of Us Research Program: Data quality, utility, and diversity. Patterns. 2022;3:100570. doi: 10.1016/j.patter.2022.100570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gurdasani D, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2015;517:327–32. doi: 10.1038/nature13997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gaziano JM, et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–23. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]
- 15.Investigators, TA.o.U.R.P.G. Genomic data in the All of Us Research Program. Nature. 2024 doi: 10.1038/s41586-023-06957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Koyama S, et al. Decoding Genetics, Ancestry, and Geospatial Context for Precision Health. medRxiv. 2023 [Google Scholar]
- 17.Denny JC, et al. The “All of Us” Research Program. N Engl J Med. 2019;381:668–676. doi: 10.1056/NEJMsr1809937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ding Y, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023;618:774–781. doi: 10.1038/s41586-023-06079-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Auton A, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Janssen F, Bardoutsos A, Vidra N. Obesity Prevalence in the Long-Term Future in 18 European Countries and in the USA. Obes Facts. 2020;13:514–527. doi: 10.1159/000511023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Marshall A, et al. Comparison of hypertension healthcare outcomes among older people in the USA and England. J Epidemiol Community Health. 2016;70:264–70. doi: 10.1136/jech-2014-205336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Joffres M, et al. Hypertension prevalence, awareness, treatment and control in national surveys from England, the USA and Canada, and correlation with stroke and ischaemic heart disease mortality: a cross-sectional study. BMJ Open. 2013;3:e003423. doi: 10.1136/bmjopen-2013-003423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Matyori A, Brown CP, Ali A, Sherbeny F. Statins utilization trends and expenditures in the U.S. before and after the implementation of the 2013 ACC/AHA guidelines. Saudi Pharm J. 2023;31:795–800. doi: 10.1016/j.jsps.2023.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gao Y, Shah LM, Ding J, Martin SS. US Trends in Cholesterol Screening, Lipid Levels, and Lipid-Lowering Medication Use in US Adults, 1999 to 2018. J Am Heart Assoc. 2023;12:e028205. doi: 10.1161/JAHA.122.028205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kasar S, et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat Commun. 2015;6:8866. doi: 10.1038/ncomms9866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Jurgens SJ, et al. Adjusting for common variant polygenic scores improves yield in rare variant association analyses. Nat Genet. 2023;55:544–548. doi: 10.1038/s41588-023-01342-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jaiswal S. Clonal hematopoiesis and nonhematologic disorders. Blood. 2020;136:1606–1614. doi: 10.1182/blood.2019000989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Asada S, Kitamura T. Clonal hematopoiesis and associated diseases: A review of recent findings. Cancer Sci. 2021;112:3962–3971. doi: 10.1111/cas.15094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Mitchell E, et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature. 2022;606:343–350. doi: 10.1038/s41586-022-04786-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ingles J, et al. Evaluating the Clinical Validity of Hypertrophic Cardiomyopathy Genes. Circ Genom Precis Med. 2019;12:e002460. doi: 10.1161/CIRCGEN.119.002460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Walsh R, et al. Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet Med. 2017;19:192–203. doi: 10.1038/gim.2016.90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.National Academies of Sciences, E., and Medicine, Affairs, P.a.G., Committee on Women in Science, E.g., and Medicine & Research, C.o.I.t.R.o.W.a.U.M.i.C.T.a. Improving Representation in Clinical Trials and Research: Building Research Equity for Women and Underrepresented Groups. 2022 [PubMed] [Google Scholar]
- 35.Ward E, et al. Cancer disparities by race/ethnicity and socioeconomic status. CA Cancer J Clin. 2004;54:78–93. doi: 10.3322/canjclin.54.2.78. [DOI] [PubMed] [Google Scholar]
- 36.Suther S, Kiros GE. Barriers to the use of genetic testing: a study of racial and ethnic disparities. Genet Med. 2009;11:655–62. doi: 10.1097/GIM.0b013e3181ab22aa. [DOI] [PubMed] [Google Scholar]
- 37.Wojcik GL, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Vujkovic M, et al. Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis. Nat Genet. 2020;52:680–691. doi: 10.1038/s41588-020-0637-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Graham SE, et al. The power of genetic diversity in genome-wide association studies of lipids. Nature. 2021;600:675–679. doi: 10.1038/s41586-021-04064-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wall JD, et al. South Asian medical cohorts reveal strong founder effects and high rates of homozygosity. Nat Commun. 2023;14:3377. doi: 10.1038/s41467-023-38766-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Van Hout CV, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586:749–756. doi: 10.1038/s41586-020-2853-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Deaton AM, et al. Gene-level analysis of rare variants in 379,066 whole exome sequences identifies an association of GIGYF1 loss of function with type 2 diabetes. Sci Rep. 2021;11:21565. doi: 10.1038/s41598-021-99091-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Välimäki N, et al. Inherited mutations affecting the SRCAP complex are central in moderate-penetrance predisposition to uterine leiomyomas. Am J Hum Genet. 2023;110:460–474. doi: 10.1016/j.ajhg.2023.01.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Haas ME, et al. Machine learning enables new insights into genetic contributions to liver fat accumulation. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Khera AV, et al. Gene Sequencing Identifies Perturbation in Nitric Oxide Signaling as a Nonlipid Molecular Subtype of Coronary Artery Disease. Circ Genom Precis Med. 2022;15:e003598. doi: 10.1161/CIRCGEN.121.003598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ward J, et al. Genome-wide analysis in UK Biobank identifies four loci associated with mood instability and genetic correlation with major depressive disorder, anxiety disorder and schizophrenia. Transl Psychiatry. 2017;7:1264. doi: 10.1038/s41398-017-0012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Luciano M, et al. Association analysis in over 329,000 individuals identifies 116 independent variants influencing neuroticism. Nat Genet. 2018;50:6–11. doi: 10.1038/s41588-017-0013-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Nagel M, et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat Genet. 2018;50:920–927. doi: 10.1038/s41588-018-0151-7. [DOI] [PubMed] [Google Scholar]
- 49.Mountjoy E, et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet. 2021;53:1527–1533. doi: 10.1038/s41588-021-00945-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Liu FR, et al. Pedigree-based study to identify GOLGB1 as a risk gene for bipolar disorder. Transl Psychiatry. 2022;12:390. doi: 10.1038/s41398-022-02163-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Palmer DS, et al. Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia. Nat Genet. 2022;54:541–547. doi: 10.1038/s41588-022-01034-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Cui J, et al. Disruption of Gpr45 causes reduced hypothalamic POMC expression and obesity. J Clin Invest. 2016;126:3192–206. doi: 10.1172/JCI85676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Akbari P, et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science. 2021;373 doi: 10.1126/science.abf8683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Yamazaki O, Hirohama D, Ishizawa K, Shibata S. Role of the Ubiquitin Proteasome System in the Regulation of Blood Pressure: A Review. Int J Mol Sci. 2020;21 doi: 10.3390/ijms21155358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Li XY, Zhai WJ, Teng CB. Notch Signaling in Pancreatic Development. Int J Mol Sci. 2015;17 doi: 10.3390/ijms17010048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Horn S, et al. Mind bomb 1 is required for pancreatic β-cell formation. Proc Natl Acad Sci U S A. 2012;109:7356–61. doi: 10.1073/pnas.1203605109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Potter GB, Facchinetti F, Beaudoin GM, Thompson CC. Neuronal expression of synaptotagmin-related gene 1 is regulated by thyroid hormone during cerebellar development. J Neurosci. 2001;21:4373–80. doi: 10.1523/JNEUROSCI.21-12-04373.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Moghadam PK, Jackson MB. The functional significance of synaptotagmin diversity in neuroendocrine secretion. Front Endocrinol (Lausanne) 2013;4:124. doi: 10.3389/fendo.2013.00124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Brown BC, Ye CJ, Price AL, Zaitlen N, Consortium AGENTD. Transethnic Genetic-Correlation Estimates from Summary Statistics. Am J Hum Genet. 2016;99:76–88. doi: 10.1016/j.ajhg.2016.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Galinsky KJ, et al. Estimating cross-population genetic correlations of causal effect sizes. Genet Epidemiol. 2019;43:180–188. doi: 10.1002/gepi.22173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Yengo L, et al. A saturated map of common genetic variants associated with human height. Nature. 2022;610:704–712. doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hou K, et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet. 2023;55:549–558. doi: 10.1038/s41588-023-01338-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Ziyatdinov A, et al. Genotyping, sequencing and analysis of 140,000 adults from the Mexico City Prospective Study. bioRxiv. 2022 [Google Scholar]
- 64.Fatumo S, Inouye M. African genomes hold the key to accurate genetic risk prediction. Nat Hum Behav. 2023;7:295–296. doi: 10.1038/s41562-023-01549-1. [DOI] [PubMed] [Google Scholar]
Methods-only references
- 65.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Szustakowski JD, et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat Genet. 2021;53:942–948. doi: 10.1038/s41588-021-00885-0. [DOI] [PubMed] [Google Scholar]
- 68.Cronin RM, et al. Development of the Initial Surveys for the All of Us Research Program. Epidemiology. 2019;30:597–608. doi: 10.1097/EDE.0000000000001028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Karlson EW, Boutin NT, Hoffnagle AG, Allen NL. Building the Partners HealthCare Biobank at Partners Personalized Medicine: Informed Consent, Return of Research Results, Recruitment Lessons and Operational Considerations. J Pers Med. 2016;6 doi: 10.3390/jpm6010002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Boutin NT, et al. Implementation of Electronic Consent at a Biobank: An Opportunity for Precision Medicine Research. J Pers Med. 2016;6 doi: 10.3390/jpm6020017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Wu P, et al. Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation. JMIR Med Inform. 2019;7:e14325. doi: 10.2196/14325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Liu X, Wu C, Li C, Boerwinkle E. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum Mutat. 2016;37:235–41. doi: 10.1002/humu.22932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Gogarten SM, et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019;35:5346–5348. doi: 10.1093/bioinformatics/btz567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Zhou W, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine. 2006;25:4216–4226. doi: 10.1002/sim.2687. [DOI] [PubMed] [Google Scholar]
- 77.Mbatchou J, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. 2021;53:1097–1103. doi: 10.1038/s41588-021-00870-7. [DOI] [PubMed] [Google Scholar]
- 78.Tang ZZ, Lin DY. MASS: meta-analysis of score statistics for sequencing studies. Bioinformatics. 2013;29:1803–5. doi: 10.1093/bioinformatics/btt280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zhao Z, et al. UK Biobank Whole-Exome Sequence Binary Phenome Analysis with Robust Region-Based Rare-Variant Test. Am J Hum Genet. 2020;106:3–12. doi: 10.1016/j.ajhg.2019.11.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Liu Y, et al. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. Am J Hum Genet. 2019;104:410–421. doi: 10.1016/j.ajhg.2019.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Muchinsky PM. The correction for attenuation. Educational & Psychological Measurement. 1996:63–75. [Google Scholar]
- 82.Deming WE. Statistical adjustment of data. Wiley; 1943. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Results from our gene-based association analyses are available for browsing and download through our online portal (https://hugeamp.org:8000/research.html?pageid=600_traits_app_home). Bulk download of summary statistics is possible via the Cardiovascular Disease Knowledge Portal (https://cvd.hugeamp.org/downloads.html). Access to individual level UK Biobank data, both phenotypic and genetic, is available to bona fide researchers through application on the UK Biobank website (https://www.ukbiobank.ac.uk). The final release of the exome sequencing dataset of UK Biobank is available only through the DNAnexus Research Analysis Platform (https://www.ukbiobank.ac.uk/enable-your-research/research-analysis-platform). Additional information about registration for access to the data is available at http://www.ukbiobank.ac.uk/register-apply/. Use of UK Biobank data was performed under application number 17488. Access to individual phenotypic and genetic data from All of Us is currently available to bona fide researchers within the United States through the All of Us Researcher Workbench, a cloud-based computing platform (https://www.researchallofus.org/register/). A publicly available data browser is provided by the research program: https://databrowser.researchallofus.org/. Access to individual level data for participants from the Mass General Brigham Biobank is currently not publicly available.
Other datasets used in this manuscript include: the dbNSFP database v.4.2a and v.4.3a (https://sites.google.com/site/jpopgen/dbNSFP); gnomAD exomes v.2.1 (https://gnomad.broadinstitute.org/downloads); the Online Mendelian Inheritance in Man (OMIM) database (omim.org) accessed on August 25th 2022; and Ensembl release 105 (https://www.ensembl.org/info/data/index.html); the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) was accessed in December 2022.
Quality-control of individual level data was performed using Hail version 0.2 (https://hail.is) as well as PLINK version 2.0.a (https://www.cog-genomics.org/plink/2.0/). Variant annotation was performed using VEP version 105 (https://github.com/Ensembl/ensembl-vep). Main rare variant association analyses were performed using an adaptation of the R package GENESIS version 2.18 (https://rdrr.io/bioc/GENESIS/man/GENESIS-package.html), which has previously been made available by us through the GitHub repository https://github.com/seanjosephjurgens/UKBB_200KWES_CVD/ version 1.2 (DOI: 10.5281/zenodo.11638262). Meta-analyses were performed using custom code available in the same repository, and using METAL (2017-12-21 release). Analyses that were run in R, were run within R version 4 (https://www.r-project.org).












