Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

medRxiv logoLink to medRxiv
[Preprint]. 2023 Jul 3:2023.06.29.23291992. [Version 1] doi: 10.1101/2023.06.29.23291992

Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank

Frederik H Lassen 1,2,*, Samvida S Venkatesh 1,2, Nikolas Baya 1,2, Wei Zhou 4,5,6, Alex Bloemendal 4,7,8, Benjamin M Neale 4,5,6, Benedikt M Kessler 3, Nicola Whiffin 1,2,4, Cecilia M Lindgren 1,2,9,, Duncan S Palmer 2,†,*
PMCID: PMC10350147  PMID: 37461573

Abstract

Exome-sequencing association studies have successfully linked rare protein-coding variation to risk of thousands of diseases. However, the relationship between rare deleterious compound heterozygous (CH) variation and their phenotypic impact has not been fully investigated. Here, we leverage advances in statistical phasing to accurately phase rare variants (MAF ~ 0.001%) in exome sequencing data from 175,587 UK Biobank (UKBB) participants, which we then systematically annotate to identify putatively deleterious CH coding variation. We show that 6.5% of individuals carry such damaging variants in the CH state, with 90% of variants occurring at MAF < 0.34%. Using a logistic mixed model framework, systematically accounting for relatedness, polygenic risk, nearby common variants, and rare variant burden, we investigate recessive effects in common complex diseases. We find six exome-wide significant (P<1.68×107) and 17 nominally significant (P<5.25×105) gene-trait associations. Among these, only four would have been identified without accounting for CH variation in the gene. We further incorporate age-at-diagnosis information from primary care electronic health records, to show that genetic phase influences lifetime risk of disease across 20 gene-trait combinations (FDR < 5%). Using a permutation approach, we find evidence for genetic phase contributing to disease susceptibility for a collection of gene-trait pairs, including FLG-asthma (P=0.00205) and USH2A-visual impairment (P=0.0084). Taken together, we demonstrate the utility of phasing large-scale genetic sequencing cohorts for robust identification of the phenome-wide consequences of compound heterozygosity.


Thousands of independent genetic variants have been robustly associated with common, complex human diseases, leading to important advancements in therapeutic development1. Naturally occurring variants that disrupt protein-coding sequences are of interest in the context of drug discovery as they modulate potential biological targets with measurable effects on human physiology2,3. Thus, individuals who carry loss-of-function (LoF) variants on both the maternal and paternal copy of a gene, are in principle ‘experiments of nature’ and their identification can help to determine causality between gene function and phenotype4-6.

Coding variants in a gene can either be homozygous, when both gene copies harbor the same variant, or CH when both copies harbor different variants, usually at distinct genetic locations within the same gene locus. Alternatively, when two variants are located on a single gene copy, they are said to be ‘in cis’. Although both copies of a gene are disrupted in two-hit (CH or homozygous) carriers, analyses of the phenotypic impact of coding variation have typically ignored genetic phase information, that is, the separation or ‘phasing’ of an individual’s genome into maternally and paternally derived alleles7,8. Large-scale studies of bi-allelic damaging variation have generally been restricted to homozygotes in populations with excess homozygosity, such as Icelanders9, Finns10,11, and consanguineous populations12. In contrast, CH are expected to be more common in outbred populations, but are largely under-studied outside of rare disorders13-17.

Various methods exist to infer the genetic phase of two variants. ‘Phasing by transmission’ employs family member genotyping and Mendelian inheritance principles18, while ‘read-backed phasing’ utilizes physical relationships among variants within sequencing reads19. In large-scale biobanks, extensively genotyping family members is impractical, and short-read sequencing technologies only allow read-backed phasing for variants in close proximity. Therefore, ‘statistical phasing’, which models the generative process of newly arising genetic data subject to recombination and mutation18,20-23, is typically used to phase haplotypes in genetic biobank data. Obtaining high-quality statistically phased genetic data requires large sample sizes, (105-106 individuals), and tends to require large reference panels21. Furthermore, statistical phasing is more error prone for rare variants, which are precisely the collection of variants that we would like to investigate as they are a priori more likely to be deleterious variants of large effect under purifying selection. This difficulty in the accurate statistical phasing of rare variation has historically deterred the analysis of CH variants in biobanks. However, recent advancements in statistical phasing24, achieved by combining common variation across genotyping arrays and exome sequencing to create haplotype ‘scaffolds’22, enables accurate phasing of rare variants. By using this new accurate phase information which extends down into rare allele frequencies, we can identify damaging CH variants to expand the pool of identifiable two-hit carriers and screen for phenotypic consequences.

We describe and apply a systematic analytical approach to test for autosomal bi-allelic effects, gene-by-gene, across 311 traits in the UKBB 200k exome sequencing (ES) release, combining both CH and homozygous variation. We iteratively refine the candidate associations by adjusting for polygenic background, nearby common variant risk, and rare variant burden within the analyzed gene. Our approach identifies both known and novel bi-allelic-trait associations, providing important insights into the phenotypic impact of gene knockout in humans.

Results

Accurate phase inference and validation using parent-offspring trios and short-read sequences

We identified 13,377,336 high-quality variants in 176,935 individuals exome sequenced in the UKBB (Methods). To identify variants co-occurring on the same haplotype (in cis) or on opposite haplotypes (in trans) gene-by-gene, we jointly phased ES and genotype array data in the UKBB using SHAPEIT525 (Methods) following an investigation into the performance of current phasing software (Supplementary Table 4). Rare variants (MAF < 0.001) are assigned a posterior probability (PP) of true haplotype assignment, known as the phasing confidence score. Confidence in our ability to accurately statistically phase variants decreases with MAF (Supplementary Fig. 5). However, we a priori expect a disproportionate recessive damaging signal to reside in CH variants with at least one rare variant, and as a result, choosing a PP cutoff represents a trade-off in the signal to noise ratio. Following phasing, we restrict to 176,587 individuals of genetically-ascertained non-Finnish European (NFE) ancestry (Methods).

To assess statistical phasing quality, we benchmarked against phasing determined with parent-offspring trio data and read-backed phasing. We quantified phasing quality before and after filtering by PP ≥ 0.9 in 96 parent-offspring trios by calculating switch error rates (SER), estimated using Mendelian transmission, across 2,044,234 unique variants stratified by minor allele count (MAC) (Fig. 1a, Supplementary Fig. 6, Supplementary Tables 5-6). Across the 96 children, 93.1% of protein coding genes contained variants that were phased without switch errors (Supplementary Table 7). SERs among singletons (MAC=1) and variants with 2 ≤ MAC ≤ 5 were 12.1% (95% CI = 8.42 – 17.2) and 0.27% (0.13–0.53), respectively (Fig. 1a).

Fig. 1: CH variants composed of at least one ultra-rare variant (MAC ≤ 10) can be robustly identified in large scale biobanks.

Fig. 1:

a) Trio SER depicted on y-axis as a function of MAC bin (x-axis) for phased variants with MAF ≤ 5%, stratified by phasing confidence score PP ≥ 0.5 or PP ≥ 0.9. b) Counts of samples harboring different classes of variation with at least two variants in UKBB. Each set of three bars depicts the number of individuals with at least one CH variant, homozygous variant, or multi-hit (cis) variant, respectively. Here, we define a CH pLoF+damaging missense variant as any combination of pLoF and/or damaging missense variation on opposite haplotypes. A qualifying carrier for each bar occurs according to the configuration displayed above the bars, and is grouped by variant consequence according to the color legend. c-d) Number of CH or homozygous carriers per gene. e) 1 - cumulative fraction (y-axis) of homozygous (dashed line) and CH carriers as a function of lowest MAF (x-axis) in bi-allelic variant pairs for which both variants phased at PP ≥ 0.9 (solid line), stratified by variant consequence according to the color key.

Although calculation of SER using trios is the gold-standard approach for phasing quality estimation21, it is limited by the number of parent-offspring trios available. For this reason, we also performed read-backed phasing of 62,762 unique pairs of variants using UKBB short read sequences on chromosomes 20-22 in 176,586 NFE individuals using WhatsHap26 (Methods). While read-backed phasing only permits ascertainment of genetic phase among pairs of variants spanning one or a few overlapping short read sequences (with typical lengths of 150-300 bp), read-backed phasing accuracy is independent of allele frequency, and therefore represents an orthogonal approach for evaluating the quality of statistically phased variation. Consistent with trio-SER, we observed increasing agreement between pairs of statistically and read-backed phased variants with increasing MAC (Supplementary Fig. 7, Supplementary Table 8). Filtering to phased variants with PP ≥ 0.9, singletons and variants with 2 ≤ MAC ≤ 5, agreement between read-backed phasing and statistical phasing was 85.1% (95% CI = 83.7 – 86.3%) and 99.1% (95% CI = 98.98 – 99.16%) respectively (Supplementary Table 8, Supplementary Fig. 8).

Taken together, our benchmarking suggests that statistical phasing of the UKBB dataset is of high quality for rare to ultra-rare variants, increasing our confidence in the identification of damaging CH variation. Given our observations of well-calibrated PP, and the distribution of phasing confidence binned by MAC, we selected the empirical cutoff of PP ≥ 0.9 to retain 8,616,236 variants (43% of which are singletons) for downstream characterization and testing (Supplementary Fig. 5).

Identification and examination of CH variation in the UKBB

To interrogate the functional role of mono- and bi-allelic variation in the population, we annotated 8,616,236 variants with PP ≥ 0.9 and MAF ≤ 5% across 17,998 autosomal protein-coding genes. We enriched our search for variants with putatively large effect sizes by restricting analyses to two categories of predicted damaging variation: first, we annotated 289,981 high-confidence protein truncating variants, including stop-gain, essential splice and frame-shift variants identified as high-confidence by Loss-Of-Function Transcript Effect Estimator (LOFTEE)27, which we refer to as ‘putative loss-of-function (pLoF) variants’. Second, we annotated 444,804 missense variants classified as damaging by both REVEL score ≥ 0.6 and Phred scaled Combined Annotation Dependent Depletion (CADD) score ≥ 20, or LOFTEE low confidence (LC) protein truncating variants; we refer to these variants collectively as ‘damaging missense/protein altering’ (Supplementary Fig. 9, Supplementary Table 9). For each individual, we then determined the set of genes predicted to be affected by pLoFs+damaging missense/protein-altering variants in a CH, homozygous or in cis state on the same haplotype.

As we a priori expected that essential genes would be less permissible to bi-allelic damaging variants when compared to non-essential genes, we investigated tolerance towards predicted bi-allelic pLoF and pLoF+damaging missense/protein-altering variants across the genome. As some genes carry bi-allelic variants more often than others (owed to a variety of factors such as gene length and baseline mutation frequency28), we fit counts of the number of individuals carrying bi-allelic variants per gene using a Poisson regression model accounting for variation in gene length and mutation rate (Methods, Supplementary Tables 10-11). Both pLoF and pLoF+damaging missense/protein-altering bi-allelic variants (homozygous and CH) were significantly depleted in five of the six analyzed essential gene-sets (P<0.0560.0083) (Supplementary Fig. 12). Conversely, across three non-essential gene-sets, bi-allelic pLoFs+damaging missense/protein-altering variants were enriched among LoF tolerant genes27 (P0.0560.0167). We found that the degree and direction of effects were consistent across CH, homozygous bi-allelic, and heterozygous variants (Supplementary Fig. 12).

In founder9 and bottle-necked10 populations, some alleles are enriched to high frequency by chance, resulting in better powered association studies for the subset of variant alleles that are inherited from the parental population6 at higher frequency. To explore the diversity of bi-allelic variation in UKBB, a largely outbred population, we enumerated two-hit carriers across 176,587 individuals. We observed complete bi-allelic ‘knockout’ of 1,174 unique genes strictly owed to pLoF variants, identifying 1,431 (0.8%) CH and 8,582 (4.8%) homozygous individuals with bi-allelic pLoF variants in at least one gene (Fig. 1b). Across genes, 307 (26.1%) CH and 560 (47.7%) homozygous ‘knockouts’ were observed only in a single individual (Fig. 1c-d). We reasoned that inclusion of damaging missense/protein-altering variants in addition to pLoFs, would expand the number of identifiable damaging bi-allelic variants compared to assessing the two categories independently. Across 3,288 unique genes, we observed 11,491 (6.5%) CH and 17,863 (10.1%) homozygous carriers of pLoF+damaging missense/protein-altering variants. Of these, 1,112 (0.6%) CH and 436 (0.2%) homozygotes were carriers of bi-allelic pLoF+damaging missense/protein-altering variants in genes linked to traits with autosomal recessive mode of inheritance in Online Mendelian Inheritance in Man (OMIM)29. We generally observed a higher prevalence of carriers with variants in cis compared to CH, with over a third of individuals (64,555, 36.6%) carrying ≥2 pLoF+damaging missense/protein-altering variants on a single haplotype (Fig. 1b).

To better understand the evolutionary dynamics giving rise to pathogenic variants in trans, we examined the spectrum of allele frequencies of the constituent variants among our confidently called damaging CHs variants. CHs variants tend to comprise of two variants where one resides on a common haplotype, while the other on a rare haplotype, with a median difference in MAC of 1,181 (Supplementary Fig. 13-14). Approximately 90% of CH-constituent variants have MAF ≤ 0.0038, compared to homozygotes in which 90% are detected at MAF ≥ 0.0022 (Fig. 1d), suggesting that identifying deleterious bi-allelic CH variants requires phasing of rare alleles (Supplementary Fig. 15-16).

Multiple studies have assessed the prospects of ascertaining bi-allelic LoF variation at larger sample sizes in consanguineous, bottle-necked, and outbred populations6,12. To investigate empirically how the number of unique genes with bi-allelic variants scales in an outbred population, we performed down-sampling of UKBB participants. Consistent with previous literature, additional CH and homozygous variants can be inferred by considering both pLoF and damaging missense/protein-altering variation at even larger sample sizes (Supplementary Fig. 19).

Systematic evaluation of bi-allelic effects on common disease

We performed a series of association analyses using Scalable and Accurate Implementation of GEneralized mixed model (SAIGE)30, a generalized mixed model that uses a saddle-point approximation to provide accurate P-values for traits with extreme case-control ratio imbalance. This allowed us to investigate the effects of bi-allelic variants in 176,587 individuals across 311 phenotypes with varying population prevalence identified from primary and secondary care electronic health records (EHRs) (Methods). We restricted to 952 protein-coding genes with at least 5 individuals carrying bi-allelic variants in the same gene, which allowed us to detect odds ratio (OR) ≥ 10, for traits at approximately 2% population prevalence, with 80% power at exome-wide significance (Bonferroni P<0.05952×3111.68×107) (Methods, Supplementary Fig. 20). Using simulations, we confirmed our ability to detect recessive signals of association with well-calibrated false positive rates across a range of effect sizes (Methods, Supplementary Fig. 21a-c). We tested a total of 299,854 gene-trait combinations, and identified 30 gene-trait associations at nominal significance (P<0.059525.25×105), of which seven remained significant following stringent Bonferroni correction (P<1.68×107) (Supplementary Table 14, Supplementary Fig. 10).

A recessive gene-trait association may be driven by a variety of genetic factors unrelated to CH or homozygous status, such as polygenic background or through genetic tagging of a nearby common variant association. To mitigate these factors, we created a pipeline to condition on external genetic effects within the gene-trait regression model (Methods). First, we trained polygenic risk scores (PRS) for 111 significantly heritable traits (hsnp2P<0.05 and neff5000) using LDPred231 (Methods, Supplementary Table 12), a tool that allows PRS derivation based on summary statistics and linkage information. To control for polygenic risk and potentially boost power for association32, we included the off-chromosome PRS as an additional covariate (Supplementary Table 14). While the resulting P-values were altered by less than a single order of magnitude with the incorporation of PRS (Supplementary Fig. 11), controlling for PRS resulted in the abrogation of four nominally significant (P<5.25×105) gene-trait associations. To capture the effects of any causal common variants in linkage disequilibrium (LD) with the pLoF or damaging missense/protein-altering variants constituting the CH or homozygous variant, we further conditioned on nearby (within 1 mega base pairs (Mb) of the associated gene) common (MAF > 1%) variant association signals (Methods, Supplementary Table 13), which abrogated (P>0.05) the signal of two gene-trait pairs.

Lastly, we investigated whether any of the identified, putative recessive, associations could be accounted for by assuming an additive genetic architecture. To do this, we counted the number of gene copies affected by pLoF+damaging missense/protein altering variants in each individual. For each putative recessive association, we re-ran the analysis while simultaneously conditioning on the number of affected haplotypes. We also employed a complementary variant-level approach and repeated the analysis, conditioned on all low-frequency (MAC > 10, MAF < 5%) and ultra-rare (MAC ≤ 10) damaging variants (pLoF+damaging missense/protein-altering), including those that constitute the bi-allelic variant in question. Conditioning on the additive effects abrogated the signal of a single nominally significant gene-trait pair (P<5.25×105) (Supplementary Table 14).

Together, these analyses refined the list of putative gene-trait associations to 23 nominally significant associations out of which six are significant after correcting for multiple testing (conservative Bonferroni P<1.68×107) (Fig. 2a-2b, Supplementary Table 14) comprising 17 unique genes and 22 traits. Notably, only six of the 23 associations remained nominally significant (P<5.25×105) when restricted to only CH variant-carriers, and just four of 23 remained nominally significant when testing homozygous variants alone, underscoring the power of jointly analyzing these variant sets.

Fig. 2: Conditional recessive and additive modeling of gene copy disruption in 311 phenotypes across 176,587 participants.

Fig. 2:

a) Recessive Manhattan plot depicting log10-transformed gene-trait association P-values against chromosomal location. Associations are colored red or orange based on whether they are Bonferroni (P<1.68×107) or nominally (P<5.25×105) significant. Transparent coloring represents the resulting P-value when conditioning only on PRS, whereas solid coloring with black outline represents the P-value derived after conditioning on off-chromosome PRS, nearby (500 kb) common variant association signal, and rare variants within the gene when applicable (methods). The Bonferroni and nominal significance thresholds are also displayed as orange and red dashed lines respectively. A gene may appear multiple times if it is associated with more than one phenotype. A qualifying example of the recessive inheritance pattern is shown in the top right: disruption of both gene copies result in an effect on the phenotype (y). b) QQ-plot for genes with bi-allelic damaging variants after conditioning on off chromosome PRS. The shaded area depicts the 95%CI under the null. Gene-trait associations passing Bonferroni significance are labeled accordingly. c-d) Additive Manhattan plot and corresponding QQ-plot for genes with mono and bi-allelic damaging variants associated with at least one phenotype after conditioning on off chromosome PRS when applicable (methods). No additional conditioning was performed in this analysis. Gene-trait associations are colored red and orange based on whether they are respectively Bonferroni (P<9.8×109) or nominally (P<3.05×106) significant. The additive inheritance model is depicted in the top right: each affected haplotype result in a incremental effect on the phenotype (y).

We observed recessive gene-trait relationships across multiple organ systems (nervous, respiratory, circulatory, and genitourinary among others). All six associations that met the significance threshold after Bonferroni correction (P<1.68×107) have previously been reported in the literature. For example, individuals with bi-allelic variants in MUTYH, a gene involved in oxidative DNA-damage repair33, are at significantly increased risk of developing colorectal cancer (log10(OR)=4.7(95%CI=3.386.01), P=2.2×1012). We also find that bi-allelic variants in FLG increase risk of both asthma34 (log10(OR)=0.33(0.260.39), P=2.09×1022) and dermatitis35 (log10(OR)=0.28(0.220.33), P=2.65×1020). In addition, we observe that bi-allelic variants in GJB2 increase the risk of hearing loss29 (log10(OR)=1.66(1.052.26), P=9.93×108). At nominal significance (P<5.25×105), 10 of 23 associations have previously been reported. For example, bi-allelic variants in USH2A, linked to retina homeostasis36, increase risk of visual impairment (log10(OR)=5.77(2.938.62), P=3.5×105). For the remaining unreported hits, we observe gene-trait associations with plausible mechanistic insights. For example, we observe that putatively damaging bi-allelic variation in FAAH, a fatty acid amide hydrolase37, are associated with increased risk of dementia (log10(OR)=22.92(12.3533.48), P=1.06×105), consequently offering evidence supporting the hypothesis that lipid metabolism dysfunction is central to dementia pathogenesis38.

Boosting power in gene-level regression models through rare variant haplotype collapsing

Complementary to the recessive models described above, rare variant burden testing, which involves the aggregation of rare variants within a gene, has proven to be a robust method to collectively assess the phenotypic impact of rare variation across individuals. Rare variants are aggregated due to their low allele frequency leading to lack of statistical power for detection of single-marker associations. However, these frameworks generally ignore the genetic phase within each individual, and therefore do not differentiate between scenarios in which multiple damaging variants reside on the same (in cis) or opposite (in trans) haplotypes, despite these two forms having potentially distinct functional and phenotypic effects. We conducted additive genome-wide association analyses by testing for associations between the the number of disrupted gene copies (across 16,363 protein-coding genes with at least 10 haplotypes carrying pLoF+damaging missense/protein altering variation) in an individual and case status (across 311 phenotypes) (Methods, Fig. 2c-2d). After adjusting for polygenic contribution, we found 38 nominally significant gene-trait associations (Nominal P<0.05163633.05×106), among which 12 were significant associations after multiple-testing correction (Bonferroni P<0.0516363×3119.8×109, Supplementary Table 15). Among the significant hits are previously reported associations, including association between the number of putatively damaged copies of BRCA2 (P=6.16×1015) and CHEK2 (P=3.34×1015), and breast cancer.

Permutation testing to establish the impact of genetic phase on disease risk

It is commonly accepted that compound heterozygosity drives recessive disease risk by disruption of both copies of an implicated gene13-15. However, this notion has not been well studied in a large-scale population cohort. To assess the degree to which compound heterozygosity, rather than co-occurring variants on the same haplotype, drives disease risk, we permuted the genetic phase of observed pLoF+damaging missense/protein-altering variants within a gene to generate an empirical distribution of t-statistics corresponding to disease-association strength in the absence of phase information (Fig. 3a-b). To ensure a sufficiently large sampling distribution, we restricted our analysis to 5 nominally significant (P<5.25×105) gene-trait combinations with at least ten individuals that are either CH variant-carriers or with two or more pLoF or damaging missense/protein-altering variants on the same haplotype (Methods).

Fig. 3: In-silico permutation of genetic phase provides evidence for CH-specific effects.

Fig. 3:

a) Overview of the permutation pipeline. To be sufficiently powered to detect effects, we considered five significant (P<0.01) gene-trait pairs from the genome-wide analysis that have at least ten individuals harboring pLoF or damaging missense/protein-altering variants on the same haplotypes or CH carriers. Then, we shuffled CH trans and cis labels across samples and re-ran the association analysis, resulting in a null distribution of permuted t-statistics corresponding to the association strength in the absence of phase information. We derive the one-tailed empirical P-value by comparing the observed t-statistics with the empirical null distribution. b) The resulting distributions of permuted (white and black box plots) and observed t-statistic (red dot) for each gene-trait and the resulting empirical P-value. P-values shown in bold indicate Bonferroni significance (P<0.0505=0.01). Box and whisker plots display the quartiles of the empirical null distribution.

We found evidence for significant (Bonferroni P=0.055=0.01) compound-heterozygous specific effects in three of the five analyzed gene-trait combinations: CH variants in FLG are associated with increased risk of both asthma (P=0.00205) and dermatitis (P=0.0084), while CH variants in USH2A are associated with increased risk of visual impairment and blindness (P=0.00286) (Fig. 3c). We identified an additional gene-trait association at nominal significance (P<0.05), namely CH variants in SEPTIN10 associated with hyperplasia of prostate (P=0.011). Of these, FLG-asthma, FLG-dermatitis, and USH2A-visual impairment associations have previously been linked to disease in the CH state39-41. These observations demonstrate, on a large scale, the effect of compound heterozygosity in driving disease susceptibility, and by extension, how appropriately integrating genetic phase can lead to increased power to discover gene-trait associations.

Non-additive effects of compound heterozygous variants elevate lifetime risk of disease

Bi-allelic effects may be associated with earlier age at onset of disease, which is also often correlated with disease severity. We therefore explored whether CH and homozygous variants had longitudinal effects by evaluating age-at-diagnosis of 278 phenotypes with Cox proportional-hazards models. To identify effects owed to disruption of both gene copies, as opposed to haploinsufficiency, we compared bi-allelic variant carriers against a reference group comprising carriers of a single heterozygous variant for each gene. We tested 267,400 gene-trait combinations with at least five bi-allelic variants (homozygotes or CH) and 100 heterozygotes (Fig. 4a). After adjustment for polygenic risk via off-chromosome PRS, we identified seven gene-trait associations with significantly earlier age-at-diagnosis in bi-allelic variants compared to heterozygous carriers of pLoF+damaging missense/protein-altering variants (P<0.05952×2781.89×107, Fig. 4b-c, Supplementary Tables 16-17). Beyond these seven associations, we also identified 13 additional gene-trait relationships at a false discovery rate (FDR)< 5% (4b, Supplementary Fig. 22). For six out of the seven Bonferroni significant gene-trait combinations, we found no evidence (P>0.0570.00833) that carrying a single heterozygous variant altered lifetime disease risk compared to carrying two copies of the reference allele.

Fig. 4: Age-at-diagnosis modeling reveals novel recessive effects driven by damaging bi-allelic variants.

Fig. 4:

a) Flow diagram of our approach. To investigate whether homozygous and/or CH effects are associated with a difference in lifetime risk of disease development, we perform Cox proportional-hazards modeling for gene-trait combinations in which ≥ 5 samples are two-hit carriers (CH or homozygotes) and ≥ 100 samples that are heterozygotes. Among Bonferroni significant associations (P<1.89×107), we filter to gene-trait pairs for which at least five samples carry multiple variants disrupting the same haplotype, and test for an association between CH or homozygous carrier status and lifetime disease risk (corresponding to HRs>1). b) HRs when comparing CH and homozygous status versus heterozygous carrier status. Throughout, we display hazard ratios and corresponding P-values with (circles) and without (triangles) taking the polygenic contribution into account by conditioning on off-chromosome PRSs for heritable traits that pass our quality control cutoffs. P-values following inclusion of polygenic contribution to disease status are provided where PRS are predictive. HRs for gene-traits with two or more individuals with multiple cis variants on the same haplotype are displayed in pink. Associations that pass Bonferroni significance (P<1.89×107) and FDRs < 5% cutoff are illustrated in the top and bottom respectively. c) HRs when comparing bi-allelic status versus heterozygous carrier status for gene-trait pairs with ≥ 3 individuals harboring variants disrupting the same haplotype, allowing ascertainment of confidence intervals. c) HRs when comparing wildtype, heterozygous, CH and homozygous status against individuals that harbor two damaging variants on the same haplotype. 95% CIs are shown in the figure. Abbreviations: CC (colorectal cancer), COPD (chronic obstructive pulmonary disease).

We further sought to disentangle the effects of homozygous and CH variants on lifetime disease risk from that attributable to multiple damaging rare variant effects on a single haplotype. To do this, we analyzed such effects in the three gene-trait pairs with both (1) at least five CH and/or homozygous variants and (2) at least five individuals harboring ≥2 variants on the same haplotype (Fig. 4d, Supplementary Table 18). Compared to individuals with a single disrupted haplotype, both homozygous and CH carriers of pLoF+damaging missense/protein-altering variants in ATP2C2 were at increased lifetime risk of developing chronic obstructive pulmonary disease (COPD) (homozygote HR = 6.65(95%CI=4.58.8)); P=0.084, CH HR = 8.98(7.7910.17)); P=0.00028). Similarly, both homozygous and CH variants of FLG were at increased lifetime risk of asthma (homozygote HR=1.97(1.12.84); P=0.126, CH HR = 2.44(1.613.26); P=0.033) and dermatitis (homozygote HR = 1.7(0.882.5); P=0.20, CH HR=1.16(0.381.94); P=0.7) (Fig. 4c). For these gene-trait relationships, information encoded in genetic phase influences the risk of disease, with mono-allelic disruption leading to virtually unaltered risk while bi-allelic disruption may result in dramatic increase in lifetime risk of disease.

Biological insights into common complex disorders implicated by CH variation

Six of the seven gene-trait combinations for which we identify Bonferroni significant associations with lifetime disease risk are also significant in our cross-sectional recessive association analysis (Supplementary table 19). All six have previously been reported in the literature, albeit without age-at-onset effects. These include MUTYH and colorectal cancer, GJB2 and hearing loss, and a pleiotropic association of FLG with both dermatitis and asthma. ATP2C2-COPD is a novel candidate association with plausible mechanistic effects.

MUTYH-associated polyposis is considered a highly penetrant Mendelian cancer syndrome leading to adenomatous polyposis42. We link bi-allelic variants of MUTYH to elevated risk of benign neoplasms of the colon, with bi-allelic carriers having a median age of diagnosis at age 53.7 years (interquartile range (IQR) = 47.9 - 56.3 years), as compared to heterozygotes (median age-at-diagnosis = 61.7 (56.2 - 66.7) years) and wildtypes (61.1 (54.6 - 66.5) years); as well as malignant neoplasms of the colon (median age at diagnosis for bi-allelic carriers = 52.1 (IQR = 48.6 - 53.4) years, heterozygotes = 63.2 (57.7 - 67.0) years, and wildtypes = 62.9 (57.2 - 67.9) years) (Fig. 5a-b). Because benign growths can be precursors to malignant neoplasms, and since risk of both disorders was elevated in MUTYH bi-allelic carriers (benign HR = 18(95%CI=17.7218.9); P=4.7×1022, malignant HR = 31.4(95%CI=30.5732.3); P=5.2×1015), we examined the co-occurrence of variants across colorectal cancer outcomes. The same set of CH and homozygous variants are involved in the pathophysiology of benign and malignant neoplasms of the colon, suggesting that MUTYH-variant composition alone in insufficient to explain the dichotomy between malignant and benign polyposis (Supplementary Fig. 17).

Fig. 5: Trajectories of haplotype disruption in common disease.

Fig. 5:

a-b) Kaplan-Meier survival curves for CH (red), homozygous (orange), heterozygous carriers (blue), single disruption of haplotypes (pink) owed to pLoF or damaging missense/protein-altering mutations. Wildtypes and bi-allelic variants (CH or homozygous) are shown with green and black lines respectively. Both CH and homozygous MUTYH-variant carriers are at elevated lifetime risk of developing benign neoplasm of the colon compared to heterozygous carriers and wildtypes. c-d) Kaplan-Meier survival curves for ATP2C2 mono and bi-allelic variant carriers. Carriers of CH variants develop COPD more early compared to heterozygotes carriers and wildtypes. Moreover, individuals who harbor a single putatively disrupted haplotype owed to ≥2 damaging variants develop COPD at the same frequency as heterozygotes and wildtypes. e) Gene plots for ATP2C2, displaying protein coding variants for samples that carry ≥ 2 pLoF or damaging missense/protein-altering variants stratified by exon or intron. CH variants, multiple variants in cis, and homozygous variants are highlighted by lines joining the positions of co-occurring variants in a sample. Lines are colored by number of cases for the shown variant configurations, with gray lines indicating no observed samples are cases; orange lines indicating some some samples are cases; red lines indicate that all observed samples are cases. Variants are labeled by position (GRCh38) and according to inferred consequence (missense, stop gain, splice acceptor/donor). Protein domains are highlighted accordingly44.

ATP2C2, a calcium-transporting ATPase linked to surfactant protein D levels43 (a causal risk factor for COPD), is associated with COPD in our gene-trait analyses (HR = 8.3(95%CI=7.549.05); P=3.56×108). As we did not observe any nearby (1 Mb upstream or downstream) common variants in ATP2C2 associated with cross-sectional COPD (all P>5×106), the association between bi-allelic variants of ATP2C2 and COPD is potentially driven by the unique configurations of damaging-missense (n=7) and pLoF (n=1) variants that primarily reside in functional protein domains (Fig. 5e, Supplementary Fig. 18, Supplementary Table 20). 7 of 8 (87.5%) identified bi-allelic carriers of ATP2C2 (6 CH and 2 homozygous) were diagnosed with COPD (median age of diagnosis = 54.1 (IQR = 46.2 - 67.5) years) (Fig. 5c-d). In contrast, only 6.9% of individuals harboring multiple pLoF+damaging missense/protein-altering variants on the same ATP2C2 haplotype were diagnosed with COPD, and at the same median age (60.8 (53.7 - 67.9) years) as heterozygotes (58.0 (48.5 - 64.1) years) and wildtypes (59.2 (51.3 - 65.1) years).

FLG plays a pivotal role in the differentiation and maintenance of skin barriers34. FLG variants have been selectively associated with individuals with both asthma and atopic dermatitis, but not with those who have asthma without atopic dermatitis35. The exact nature of this relationship remains unclear. Our findings indicate that individuals carrying a single deleterious FLG allele face increased risk of dermatitis (P7.2×105), but not asthma (P=0.018), when compared to wildtypes. In contrast, individuals carrying two variant alleles have an increased risk of developing both dermatitis (P=5.27×1015) and asthma (P=4.47×1027), suggesting a recessive mode of inheritance for FLG-related asthma and a semi-dominant inheritance pattern for FLG-related dermatitis29. This implies that the loss of a single FLG copy can result in dermatitis, while the loss of both copies can lead to asthma. Together, this may help clarify why FLG-related asthma is seldom observed without the presence of FLG-related dermatitis.

Discussion

In this large biobank-scale effort, we systematically interrogate the role of bi-allelic coding variants in genes conferring risk for common complex diseases. In the cross-sectional and longitudinal analysis we identify 20 nominally significant (P<5.25×105) and 23 significant (FDR < 5%) gene-trait associations, respectively. Together, we find 36 unique gene-trait associations, that both replicate established relationships and identify previously unreported gene-trait associations for a range of binary phenotypes across the common disease spectrum.

We show that the 90% of deleterious CH variants occur at MAF < 0.34%. Given that phasing quality is directly correlated with allele frequency, it is essential to filter to the set of variants phased at high confidence to eliminate false positive identifications. Here, we quantified the increase in phasing quality using Mendelian inheritance logic in parent-offspring relationships and compared pairs of statistically phased variants to read-backed phased variants using short read sequences. While read-backed phasing is computationally expensive and restricted to variants in close proximity, we demonstrate that it can be employed to evaluate statistical phasing quality in cohorts that lack trio relationships, with error rates comparable to that of trio switch error rates.

CH disease associations have mainly been explored in rare disorders13-17, but are seldom investigated in the study of common disease. This is due to the low prevalence of variants in the CH state and the genetic architecture of common complex traits, which are typically influenced by environmental factors and numerous loci with low to modest contribution to risk. In this study, we address these challenges and offer multiple lines of evidence to demonstrate the role of CH effects in driving disease risk for common traits. We employed two complementary analyses to detect gene-trait associations: a genome-wide logistic association analysis and a time-to-event model. Through these methods, we identified associations in which variants in the homozygous or CH state resulted in increased disease risk compared to wildtypes and individuals carrying multiple pathogenic variants on the same haplotype. Our findings show that for certain gene-trait pairs, individuals with a single disrupted gene copy have a risk of developing disease that is virtually indistinguishable from that of wildtypes, suggesting non-additive gene dosage effects. Further, by permuting the genetic phase, we find evidence that incorporation of confidently phased CH variants can boost power to detect recessive associations in common disease. Collectively, our results emphasize the importance of considering each individual’s specific genetic context when assessing their genetic risk in a clinical setting. Simply identifying the presence of multiple pathogenic variants in a gene, disregarding the phase, may not be sufficient to fully understand an individual’s risk profile.

Many common complex traits have polygenic architectures, which should be accounted for when performing gene-trait association testing. The presence of bi-allelic variants in individuals with such diseases might be coincidental and not causally related to the trait, which may instead be a result of a high polygenic risk. However, across the significant recessive genome-wide associations, we observed that inclusion of PRS as a covariate, affected the resulting association P-value by less than single order of magnitude for the binary traits we analyzed. While we were only able to account for the polygenic contribution to disease development for 111 diseases with significant common variant heritability in the UKBB, due to low case numbers, these observations suggest that incorporation of polygenic background has limited influence on the degree of association when evaluating ultra-rare variation across binary traits.

We found that the majority of bi-allelic gene-disease associations are driven by variant combinations containing at least one missense variant, which would have been excluded under a stricter high-confidence pLoF criterion. Although our less stringent inclusion threshold enabled us to identify a greater number of bi-allelic variants, it is likely that some damaging missense or protein-altering variants would incorrectly be predicted as damaging, or may exhibit gain-of-function rather than loss-of-function effects, consequently reducing the signal-to-noise ratio in our analyses. Even ‘knockouts’ by bona fide pLoF variants may only result in partial gene inactivation, and not necessarily complete gene knockdown. Additionally, pLoF variants may be ‘rescued’ and not lead to complete or even partial loss-of-function. While we show that including damaging missense/protein-altering variants to define bi-allelic variants can improve power for certain phenotypic associations, further manual curation and experimental validation will be required to demonstrate that these variants truly result in loss-of-function.

The likelihood of damaging alleles occurring on the same haplotype is influenced by a complex interplay of factors, including population structure and balance between selection, drift, mutation, and recombination. We and others45 find that damaging CH variants occur less frequently than multiple damaging variants affecting the same haplotype, suggesting that in certain circumstances, natural selection operates on a haplotype level. Once a LoF variant occurs and expands in the population, the affected haplotype has no selection against additional acquisition of damaging mutations. This has implications for association studies investigating CH effects by counting the number of damaging variants in a gene while attributing equal probability to each of affecting each haplotype8, as such frameworks may overestimate the frequency of CH events.

‘Human knockouts’ have been extensively discussed in the context of therapeutic development. Examining both bi-allelic and mono-allelic carriers can help assess the safety of therapeutic interventions by analyzing how varying degrees of target modulation affect biological response3,6. We showcase several gene-trait relationships where the number of affected haplotypes influences the lifetime risk of disease, potentially representing the manifestation of ‘adverse events’ which are important endpoint in clinical trials. The absence of adverse events in mono-allelic carriers can potentially imply that partial pharmacological inhibition of a target may be a safe and effective approach. However, adverse effects observed in bi-allelic carriers of damaging variation within the same locus could indicate potential risks associated with complete target inhibition. A natural extension of this work could involve investigating mono and bi-allelic effects on quantitative outcomes, such as serum proteins. Changes in biomarkers (or other continuous outcomes) may reflect direct or indirect consequences of gene modulation and could serve as potential pharmacodynamic biomarkers commonly used to assess target engagement in clinical trials.

This work showcases the value of statistical phasing of damaging rare variants, and that association analyses that account for compound heterozygosity can be better-powered for gene-trait discovery. We show that this approach can be employed to discover well-established and novel non-additive and additive gene-trait relationships across a wide range of disease etiologies. From a clinical perspective, we demonstrate the importance of interrogating the genetic phase when dealing with CH variants in traits with recessive mode of inheritance. This is an important step towards uncovering the phenome-wide consequences of bi-allelic disruption across the human genome.

Methods

Exome sequencing quality control summary

We perform a series of hard-filters on genotype, sample, and variant metrics (Table 1, Table 2-3). We confirm genetic sex with reported sex, and restrict analysis to genetically ascertained samples of NFE ancestry, using random forest (RF) classifiers (Fig. 2-3). Finally, we filter based on a second collection of sample and variant filters (Tables 2-3). We used Hail 0.246 and PLINK 1.947 to perform all QC steps, and use R (4.0.2) scripts for plotting and filtering. Data was manipulated in R using data.table (1.14.2) and dplyr (1.0.7), random forest classifiers were trained using the randomForest (4.6-14) library, and plotting was performed using a ggplot2 (3.3.5).

Exome sequencing quality control, full details

Sample filters

We evaluated sample-level quality control (QC) metrics on the 200,643 UKBB ES multi-sample project level variant call format (VCF) call-set files46, Supplementary Table 1. All metrics were calculated for bi-allelic single nucleotide polymorphisms (SNPs), except for metrics involving insertions and deletions. We regressed out the first 21 principal components (PCs)48, and filtered out sample outliers of the residuals for each metric based on MAD (median absolute deviation) thresholds (Supplementary Table 1). Samples without PC data were subject to more stringent thresholds (Supplementary Table 1).

Genotype filters

Multi-allelic variants were split into bi-allelic variants and insertions and deletions (indel) were left-aligned49. Genotype calls meeting any of the following criteria were set to missing:

  1. Genotype quality (GQ) ≤ 20

  2. Total sequencing depth (DP) ≤ 10

  3. Heterozygous calls:
    1. SNPs: 1-sided binomial test of alternate allele depth related to total read depth P<1×103
    2. Indels: alternate allele read depth / total read depth < 0.3
  4. Homozygous indel calls: alternate allele read depth / total read depth < 0.7

Variant-level filters

Retain variants satisfying all of the following conditions:

  1. Not in a low complexity region (LCR)50.

  2. In sequencing target regions ±50 base pairs.

  3. MAF > 0 following genotype QC.

  4. Excess heterozygosity (ExcessHet < 54.69) filter: Phred-scaled P-value for exact test of excess heterozygosity51 in founders as determined by relatedness estimates and recorded ages of UKBB participants48. Variants were retained as recommended in genome analysis toolkit (GATK)51

Additional ES quality control

To perform further QC we use Hail, an open-source Python library which focuses on the analysis of large-scale genetic data sets. We used Hail to create our own methods, and we take advantage of the functionality that has been rewritten to enable fast and scalable analysis of large exome and genome sequencing projects. Unless otherwise stated, all of the following the data curation and quality control steps were performed in Hail46.

Briefly, we apply a collection of hard-filters on sample metrics. We confirm genotypic sex with reported sex, remove samples with excess glsplurv, and restrict analysis to samples of genetically ascertained NFE ancestry. Finally we apply a second collection of sample and variant hard filters. As an initial pass to remove low quality and contaminated samples, we filter out samples with call rate < 0.95, mean DP < 19.5× or mean GQ < 47.8 (Fig. 1).

Sex imputation

To confirm participant sex and calculate PCs, we extracted high quality common variants (allele frequency between 0.01 to 0.99 with high call rate (> 0.98)) and LD prune to pseudo-independent SNPs using --indep 50 5 2 in PLINK 1.9. When reported sex does not match genotypic sex, it may signal potential sample swaps in the data. Using the F-statistic for each sample using the subset of the non-pseudo autosomal region on chromosome X, we identify and remove samples where reported sex information is not confirmed in the sequence data (Fig. 2). Specifically, we remove samples satisfying at least one of the following criteria:

  • Sex is unknown in the phenotype files.

  • F-statistic > 0.6 and the sex is female in the phenotype file.

  • F-statistic < 0.6 and the sex is male in the phenotype file.

  • F-statistic > 0.6 and number of calls on the Y chromosome is < 100.

Defining a set of samples with non-Finnish European ancestry

To ensure adequate case-control for as many traits as possible, we restricted our analysis to a set of genetically ascertained NFE samples. To do this, we perform a number of principal component analysis (PCA) steps to ensure that we have subset down to NFE. We first run PCA on the 1000 Genomes (1KGP) samples (minus the small subset of related individuals within the 1KGP) using subsetting to LD pruned autosomal variants. We then project in the UKBB samples, ensuring that we correctly account for shrinkage bias in the projection52. Next, we removed samples outside of the European population (EUR) using a RF classifier: we train a RF on the super-populations labels of 1KGP and predict the super-population for each of the UKBB samples (Fig. 3). We denote strictly defined European subset as those with probability > 0.99 of being European according to the classifier. Another RF classifier is trained following restriction of the 1KGP samples to Europeans to determine NFE, using a classifier probability of 0.95. RF classifiers were trained using the randomForest (4.6) library in R. Samples not assigned to the NFE cluster were removed from downstream analysis.

Final hard filters

For our final variant filtering step, following restriction to the NFE subset, and removal of incorrectly defined sex or unknown sex, and run variant QC. We then filter out variants with call rate < 0.97, variants out of Hardy-Weinberg equilibrium (HWE) (P<1×106), and remove invariant sites following the previous sample based filters. After restricting to these high quality variants, we perform a final set of sample filters to finalize the quality controlled data. We evaluate a collection of sample metrics and remove samples falling outside four SDs of the sequencing batch mean (Ti/Tv, Het/HomVar, Insertion/Deletion ratios), and remove the collection of samples with over 175 singletons. The resultant curated analysis ready data set consists of 176,935 samples, and 9,169,408 variants (Supplementary Table 2-3).

A summary of sample and variant filters are provided in Supplementary Tables 2-3. The high quality ES call-set consisted of 176,935 samples and 9,169,408 variants.

Phasing

Combining ES data with genotype array data

We combined genotyping array (UK BiLEVE Axiom array and UKBB Axiom array) and exome chip (IDT xGen Exome Research Panel v1.0) variants after general ES quality control using Hail46 and BCFtools53 (1.12). For variants in both data sets, we preferentially retained those on the ES data. For variants on the genotyping array we excluded variants missingness > 5% after performing a liftover to GRCh38 using Hail46. To avoid biasing the phasing quality estimates, we excluded parents among trio relationships prior to phasing. We first created a common variant scaffold by phasing variants in the combined (exome sequencing and genotyping array) data with MAF > 0.1% and otherwise default parameters using SHAPEIT5_PHASE_COMMON module. We then phased the remaining rare variants using the common variant scaffold using the SHAPEIT5_PHASE_RARE with recommended parameters. To ensure computational tractability, we phased overlapping chunks of 100,000 variants with ≥ 50,000 variant overlap between consecutive chunks using Hail46. Following chunk phasing, we then removed the initial and final 22,500 variants from each chunk, so that 5,000 overlapping variants remained between contiguous phased chunks. We then combined the phased chunks, matching haplotype phase using bcftools53 (1.12) with the --ligate option. We then restrict this phased genetic dataset to the set of samples and variants present in the analysis ready NFE subset (Supplementary Tables 2-3).

Trio-switch error rates

We assessed phasing quality by comparing statistically phased genotypes to those implied in 96 trios using Mendelian inheritance logic. Switch errors are determined by traversing the statistically phased and parent-offspring transmitted haplotypes simultaneously and scanning for inconsistencies in phase between pairs of contiguous variants. This method only allows us to consider sites in which the one parent is heterozygous and the other is homozygous for the reference or alternate allele, and thus do not consider de novo variants or Mendelian inconsistencies in the trio data. To assess switch error in a site-specific manner, we modified and recompiled bcftools53 (1.12) to output errors by genomic position. We then used the modified version to assess switch by variant categories, for example by genetic data modality (genotyping array or ES), or by MAF bins. To evaluate switch errors across different phasing confidence thresholds, we filtered VCF using Hail46 and then repeated the switch error calculation step. We calculated binomial 95% confidence intervals (CIs) for SERs using the R-package HMisc54 (4.7).

Read-backed phasing

We performed read-backed phasing with UKBB ES short paired-end read sequences using .cram files provided by UKBB. As WhatsHap is computationally expensive, we restricted our analysis to pairs of variants on chromosomes 20-22 in 176,586 genetically ascertained NFEs. We phased both single nucleotide polymorphism (SNV) and indel with WhatsHap26 using the default recommended parameters. WhatsHap outputs lists of phased variants within ‘phased sets’. We carried forward reads overlapping no more than two variants, for which phase could be inferred. We combined these phased variants with statistically phased variants from SHAPEIT5 using Hail46, and determined agreement between estimated phasing in WhatsHap and SHAPEIT5 (Fig. 7).

Phenotype curation

We considered a collection of 282 binary quality controlled and publicly available common complex phenotypes for analysis55. To complement these, we also considered 28 common complex phenotypes that were obtained through manual curation, resulting in a total of 311 binary phenotypes for analysis. To increase our power for analyses for binary traits, we amalgamated a collection of phenotypes where possible: combining the phenotype curation of Censin et al.56, with the primary care mappings file provided by UK Biobank all_lkps_map_v3.xlsx and our own manual curation. We aggregated across ICD-10, ICD-9, operating codes, nurses interview reports, and self-reported diagnosis by doctor from the main phenotype file, as well as v2 and v3 read codes in the primary care data. As in Censin et al., we made use of the careful definitions of Eastwood et al.57, subsequently applied by Udler et al.58 for diabetes subtype curation. Briefly, the algorithm developed in Eastwood et al. bins individuals into putative diabetes status using a collection of phenotypes in the UK Biobank data including self-reported diabetes diagnosis, age of diagnosis, medications, start of insulin within a year of diagnosis. We defined cases as those placed in the probable and possible case categories in the algorithms output. Controls were defined as samples labeled as ‘diabetes unlikely’ by the algorithm.

Variant annotation masks

We annotated coding variation using Variant Effect Predictor (VEP)59 (v95) using the worst consequence by gene within ‘canonical’ transcripts. We classified variants into four categories: protein truncating variants (PTVs), missense variants, synonymous variants, and other variants (Supplementary Table 9). We then split PTVs into putative loss of function (pLoF) (HC) and LC loss-of-function variants using LOFTEE60, and labeled missense variants with both Rare Exome Variant Ensemble Learner (REVEL)61 score ≥ 0.6 and CADD62 score ≥ 20 as ‘damaging missense’ or otherwise as ‘other missense’. Finally, we combine the resultant ‘damaging missense’ category with LC loss-of-function variants, which we denote as ‘damaging missense/protein-altering’.

Bi-allelic encoding and recessive association modeling

Using custom Hail scripts, we define and annotate individuals as being ‘bi-allelic’ for a gene if they harbor at least one pLoFs or damaging missense variant with MAF < 5% on both inherited copies of the gene. For each sample, we encoded the presence and absence of a damaging bi-allelic variant for each gene as zero and two, respectively. We encode this information in a .vcf file and test for an association between presence of a damaging bi-allelic variant in a gene and a trait using SAIGE63, adjusting for sex, age, sex × age, age2, UKBB centre, genotyping batch and the first 10 PCs. We took relatedness into account using a sparse genetic relatedness matrix (GRM) fitted on NFE. We restrict analysis to (gene, trait) pairs with at least five bi-allelic variants in the curated ES with non-missing corresponding phenotype data (corresponding to a minimum MAC ≥ 10), and adjust for multiple testing at Bonferroni significance (P < 0.05/gene-trait pairs).

Gene copy dosage encoding and additive association modeling

We define annotate individuals as being ‘mono-allelic’ for a gene if they harbor at least one pLoFs or damaging missense variant with MAF < 5% on a single copy of the gene. Furthermore, if they harbor at least one pLoF or damaging missense variant on both inherited copies of the gene, we annotate them as ‘bi-allelic’. Using custom Hail scripts, we encode wildtypes, mono-allelic and bi-allelic carriers as 0, 1 and 2 respectively, thus representing the number of affected gene copies in an individual. We test for association using SAIGE63, adjusting for sex, age, sex × age, age2, UKBB centre, genotyping batch and the first 10 PCs. Again, we took relatedness into account using a sparse GRM fitted on NFE. We restricted to gene-pairs with at least 10 disrupted haplotypes (corresponding to a minimum MAC ≥ 10), and adjust for multiple testing at Bonferroni significance (P < 0.05/gene-trait pairs).

Polygenic risk scores

Curation of array-based genetic data

We generated PRSs using imputed genotypes provided by UKBB48. In the following, we make the distinction between training and testing data. The first represents the samples that are used for fitting LDPred264 weights and parameters while the latter represent the samples with bi-allelic variant (with homozygous or CH status) information in which we use to assess the predictive accuracy the fitted LDPred based PRS. For the training data, we took the genetically ascertained NFE and filtered to 246,152 unrelated samples (kinship coefficient < 2−4.5) that did not have quality controlled ES data available. NFE samples with high quality ES and imputed genotype data available were used for testing. Where predictive (nominal significant hsnp2 and neff5000), we include PRS as a covariate for downstream biallelic association testing to account for common variant polygenic risk for the trait under investigation.

Genotype variant filtering

We followed best practices from Privé et al.64, and filtered to common Haplotype Map 3 (HM3) SNPs65. Additionally, we exclude any variants with genotyping proportion < 1% and MAF < 1%, resulting in a total of 1,165,296 common autosomal variants for fitting PRS weights. To reduce the likelihood of spurious correlations between low-frequency variants in traits with low case or control count, we restricted to binary phenotypes with at least 1,250 cases and controls. Additionally, we imposed a phenotype specific MAF filter based on the number of cases and controls in a trait, specifically:

MAF>max(0.01,2×min(ncases,ncontrols)), (1)

where ncases and ncontrols are the numbers of cases and controls with high quality imputed sequence data available, respectively, to guard against non-causal variants that are overrepresented in cases or controls leading to false positive associations.

Common variant association testing

We tested for associations between the 1,165,296 common autosomal HM3 variants and phenotypes using Hail46, running logistic regression (logistic_regression_rows) adjusting for sex, age, sex × age, age2, UKBB assessment centre, genotyping batch and the first 10 PCs, using a Wald test.

Estimating heritability

We generated LD-scores for HM3 variants in sample, using a random subset 10,000 of 246,152 unrelated genetically ascertained NFEs without haplotype information. Using the genome-wide association study (GWAS) summary statistics and LD-scores, we estimated SNP heritability hsnp2 and standard errors (SEs) using LD score regression (LDSC)66,67. We evaluated PRS for phenotypes with nominal significant hsnp2 estimates (P<0.05) and restricted to phenotypes with nominally significant (P<0.05) LDSC based SNP heritability estimates and effective sample size neff5,000, where:

neff=41ncases+1ncontrols. (2)

Generating PRS using LDPred2

For a given phenotype, we trained a PRS predictor with LDPred2-auto64, using marginal effect size estimates evaluated on the 246,152 unrelated NFE samples (defined by kinship coefficient < 2−4.5) without ES data in the 200k ES UKBB release), hsnp2 as estimated by LDSC, and in-sample reference panel to evaluate local LD, as input. We removed any invariant sites and mean-imputed missing genotypes, before training the predictor. Following PRS training, we then predict into the 176,266 samples with ES and high-quality imputed genotype data.

Validation of polygenic risk scores

We assessed the ability of the resulting PRS to discriminate between case status by evaluating area under the curve (AUC) on the held-out unrelated set of samples with both HM3 SNPs and phased exome data. We used the function AUCBoot from the R package bigstatsr68 (1.5.6) to extract 10,000 bootstrap replicates of individuals and compute the 95% CIs for AUC.

Conditional analysis

Off-chromosome PRS conditional analysis

For each chromosome, C, we evaluated ‘off-chromosome’ PRS by setting weights on chromosome C to 0. We repeated this for each phenotype with PRS available and fit SAIGE63 models while controlling for off-chromosome PRS by including it is as a covariate in the null SAIGE model.

Common variant conditional analysis

To assess whether a putative signal in a gene is driven by nearby common variation, we filtered to samples that have both ES and imputed genotypes with MAF > 1% and imputation INFO score > 0.5. Then, for each gene that passed exome wide significance in the primary analysis (P<5×106), we tested for common variant associations in the region (1 Mb upstream and downstream of the gene). For each of these regions, we took an iterative approach, testing for common variant associations using SAIGE63, conditioning on the lead variant and repeating the regression until the conditional P for the newly included variant dropped below 5 × 10−6, allowing up to 25 ‘independent’ associations in the region. We used the same covariates as in the primary analysis. For every variant that passed exome-wide significance (P<5×106), we encoded the genotypes as dosages and embedded them alongside pseudo variants (bi-allelic variants) in a VCF. We then re-ran the primary analysis twice (with and without controlling for off-chromosome PRS), while conditioning on any nearby common variant signals of association with the phenotype of interest.

Rare and ultra-rare variant conditional analysis

For each significant (P<1.68×107) gene-trait associations in the genome-wide analysis after conditioning on PRS and nearby common variants association signals, we considered a further conditioning step. We sought to determine whether the residual signal of association could be explained by additive rare variant effects within the associated gene. To do this, ran further conditioned on rare (MAC≥ 10, MAF≤ 0.05) and ultra-rare (MAC≤ 10) variants annotated as either pLoF or damaging missense within each gene. Because conditioning on ultra-rare variation can lead to convergence issues, we performed a gene-wide collapsing of ultra-rare (MAF≤ 10) variants, thus aggregating them into a single ‘super’ variant to represent burden of ultra-rare damaging variation in the gene. Following this collapsing, we were able to condition on the ultra rare and rare variant contribution using SAIGE, while also conditioning on PRS and nearby common variant association signals when applicable.

Permutation of genetic phase

To test whether a putative gene-trait association is driven by compound heterozygosity, we designed a permutation-based pipeline that could be systematically applied and scaled across phenotypes and genes. To do this, we label samples that are either CH variants or heterozygous cis carriers and then randomly shuffle these labels a series of times. For each permutation, we re-run the association analysis conditioning on covariates as previously discussed (including off chromosome PRS and nearby common variants), and determine the resultant association strength under this label shuffling. Applying this permutation procedure multiple times, we can determine an empirical null for the association strength in the absence of phase information. The result is an empirical distribution of t-statistics and corresponding P-values that reflect the degree of association that would be expected given that the phase is random. We evaluate the one-sided empirical P-value, specifically:

Pempirical=1ni=1n1(titobserved) (3)

where n is the number of permutations, ti is the t-statistic under the ith random label shuffling, and tobserved is the observed t-statistic determined using the observed genetic phase. To ensure sampling of t-statistics at a sufficiently large number of configurations of the genetic phase, we analyzed gene-trait pairs with at least ten compound heterozygotes and/or samples with multiple variants on the same haplotype. We permuted up to 100,000 times. To control for multiple testing, we corrected for 5 gene-traits tested (Bonferroni significance threshold P<0.055=0.01).

Gene-set enrichment of bi-allelic variation

Analyzed gene-sets

We included the following gene lists in our gene-set enrichment analyses: essential in mice69, essential gnomAD27, essential ADaM70, essential in culture71, essential CRISPR72, genes with pLI > 0.9 in gnomAD27, non-essential in culture71, homozygous LoF tolerant27, and non-essential gnomAD27.

Poisson regression to assess enrichment of CH variants in gene-sets

We test for depletion and gene-set enrichment using poisson regression. We model the count of bi-allelic variants across samples as a function of gene-set and mutation frequency using the glm function in R.

samples with>1variant of classxin geneI(gene-set)+mutation rate (4)

where x is a pair (x1,x2):x1 {pLoF, damaging missense, pLoF and/or damaging missense, other missense, synonymous}, x2 {heterozygote, CH, bi-allelic variants}. For each annotation category we use the transcript-specific mutation rate28. 95% confidence intervals are determined using confint.glm from the MASS-package (v7.3-58.1).

Homozygote and CH down-sampling

To investigate the number of identifiable CH or homozygous events across varying sample sizes and variant annotations, we performed down-sampling across the total population of 176,587 individuals. To do this, we defined a set of 35 regularly spaced cutoffs between 1,000 and 176,587 samples using increments of 5000. To determine uncertainty in our estimates of the number of unique genes implicated as a homozygote and/or CH, we randomly sampled individuals for each down-sampling 100 times, with replacement. We calculated the 95% CI by taking the 2.5% and 97.5% quantiles for the number of unique genes affected at a given sample size, and repeated across annotations (Fig. 19).

Power analysis for bi-allelic association

We perform a power analysis based on bi-allelic (including both CH and homozygous) variant frequencies in the population. To do this, we adopted code73 allowing us to determine the effective effect size on the OR scale across candidate configurations of binary case-control counts by substituting alternate allele frequencies with bi-allelic variant frequencies. We calculated effect sizes at 80% power at Bonferroni significance (P<1.7×107) for a hypothetical traits with 823 (0.5%), 1766 (1%), 3532 (2%), 5298 (3%), 8829 (5%) cases of 176,587 total samples.

Simulation

Simulation of synthetic phenotypes using real genotypes

We performed a series of simulations to test that our pipeline would detect a CH effect in the presence of a true signal. We sampled 100,000 genetically-ascertained NFEs in the UKBB data, and extract chromosome 22 which we then use to simulate phenotypic data with a recessive genetic architecture. To emulate a scenario in which defects in protein coding genes lead to disease, we annotated the filtered UKBB genetic data and determined the collection of samples harboring damaging bi-allelic variants in each gene (compound heterozygous and homozygous, comprised of variants annotated as pLoF or damaging missense). We then define a n samples ×m genes matrix B~ with entries:

B~i,j={1,if a damaging bi-allelic variant is present in sampleiat genej0,otherwise} (5)

We then simulated liability under the following model:

yi=j=1mBi,jθj+εi (6)

where Bi,j is the (i,j)th entry of B after standardizing the columns of B~, E[θj]=bm, Var[θj]=h2m, and εiN(0,1h2). Here, we implicitly assume that presence of at least one homozygous or CH variant of any type within a given gene contributes the same risk to disease, whose average across genes is set by the parameter b. The resultant liability yi has mean 0 and variance 1. Note that the standardization of B imposes a frequency dependent relationship between prevalence of bi-allelic damaging variants in a gene and variance explained. We simulated under the spike-and-slab model:

θj{N(bmπθ,h2mπθ),ifpj<πθ0,otherwise}pjBernoulli(πθ)

in which πθ[0,1] is the proportion of causal genes with a recessive contribution to the phenotype. Finally, to obtain binary traits we used the liability threshold model assuming a case prevalence of 10%. In the following simulations, we set πθ=0.25, and considered h2 values of h2{0,0.01,0.02,0.05,0.10} and b values of b{0,0.5,1,2,10}.

Longitudinal effects

Time-to-event data curation

We curated age-at-diagnosis for 278 binary phenotypes from the UKBB-linked primary care and hospital record data. 251 phenotypes were curated using the mapping tables generated by Kuan et al.55, excluding any codes related to “history of…” events for which accurate age-at-diagnosis could not be extracted. The remaining 27 phenotypes individuals’ records were left-truncated at the age of first record (of any code) in either the primary care or hospital data, and right-censored at the age of the last record.

Cox proportional-hazards modeling

For each gene-trait combination to test, we performed Cox-proportional hazards modeling to estimate differences in lifetime risk of developing the phenotype between heterozygous carriers of pLoF + damaging missense variants in the gene (reference group) and individuals who are bi-allelic carriers (compound-heterozygous or homozygous), multi-hit cis-heterozygous carriers, and wildtypes. All effects were adjusted for sex, the first 10 genetic PCs, birth cohort (in ten-year intervals from 1930-1970), and UKBB assessment center. For phenotypes with a significantly heritable PRS, we additionally adjusted for off-chromosome PRS. We visualized survival probabilities using Kaplan-Meier curves74. Finally, for gene-trait combinations where we were powered to detect differences between compound-heterozygous and multi-hit cis heterozygous carriers of variants, i.e. where each group contained at least five cases of the phenotype, we repeated the above analysis with multi-hit cis heterozygous carriers as the reference group. Cox proportional-hazards regression was performed using the R package survival 3.3.175 and Kaplan-Meier plots drawn with the R package survminer 0.4.976.

Supplementary Material

Supplement 1
media-1.xlsx (5MB, xlsx)
2

Acknowledgments

F.H.L. is supported by the Wellcome Trust (award 224894/Z/21/Z), and the Medical Sciences Doctoral Training Centre at the University of Oxford. S.S.V. is supported by the Rhodes Scholarship, Clarendon Fund, and the Medical Sciences Doctoral Training Centre at the University of Oxford. N.B. is supported by the Clarendon Fund, and the Medical Sciences Doctoral Training Centre at the University of Oxford. W.Z. is supported by the National Human Genome Research Institute of the National Institutes of Health under award number K99HG012222. A.B. is supported by the Novo Nordisk Center for Genomic Mechanisms of Disease at the Broad Institute (NNF21SA0072102). N.W. is supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (220134/Z/20/Z) and research grant funding from the Rosetrees Trust (PGL19-2/10025). C.M.L. is supported by the Li Ka Shing Foundation, NIHR Oxford Biomedical Research Centre, Oxford, NIH (1P50HD104224-01), Gates Foundation (INV-024200), and a Wellcome Trust Investigator Award (221782/Z/20/Z). This research has been conducted using the UK Biobank resource under application Number 10844.

Footnotes

Code availability

The code required to reproduce our analyses are publicly available at https://github.com/frhl/wes_ko_ukbb. Data produced in the present study are available upon reasonable request to the authors.

Competing interests

B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. All other authors declare no competing interests.

References

  • 1.Nelson M. R., Tipney H., Painter J. L., et al. The support of human genetic evidence for approved drug indications. en. Nature Genetics 47, 856–860 (Aug. 2015). [DOI] [PubMed] [Google Scholar]
  • 2.Plenge R. M., Scolnick E. M. & Altshuler D. Validating therapeutic targets through human genetics. en. Nature Reviews Drug Discovery 12. Number: 8 Publisher: Nature Publishing Group, 581–594 (Aug. 2013). [DOI] [PubMed] [Google Scholar]
  • 3.Whiffin N., Armean I. M., Kleinman A., et al. The effect of LRRK2 loss-of-function variants in humans. en. Nature Medicine 26, 869–877 (June 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tobert J. A. Lovastatin and beyond: the history of the HMG-CoA reductase inhibitors. en. Nature Reviews Drug Discovery 2. Number: 7 Publisher: Nature Publishing Group, 517–526 (July 2003). [DOI] [PubMed] [Google Scholar]
  • 5.Do R. Q., Vogel R. A. & Schwartz G. G. PCSK9 Inhibitors: potential in cardiovascular therapeutics. eng. Current Cardiology Reports 15, 345 (Mar. 2013). [DOI] [PubMed] [Google Scholar]
  • 6.Minikel E. V., Karczewski K. J., Martin H. C., et al. Evaluating drug targets through human loss-of-function genetic variation. en. Nature 581, 459–464 (May 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Van Hout C. V., Tachmazidou I., Backman J. D., et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. en. Nature 586, 749–756 (Oct. 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.DeBoever C., Tanigawa Y., Lindholm M. E., et al. Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study. eng. Nature Communications 9, 1612 (Apr. 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sulem P., Helgason H., Oddson A., et al. Identification of a large set of rare complete human knockouts. en. Nature Genetics 47. Number: 5 Publisher: Nature Publishing Group, 448–452 (May 2015). [DOI] [PubMed] [Google Scholar]
  • 10.Heyne H. O., Karjalainen J., Karczewski K. J., et al. Mono- and biallelic variant effects on disease at biobank scale. eng. Nature 613, 519–525 (Jan. 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lim E. T., Würtz P., Havulinna A. S., et al. Distribution and Medical Impact of Loss-of-Function Variants in the Finnish Founder Population. en. PLoS Genetics 10 (ed Cutler D.) e1004494 (July 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Saleheen D., Natarajan P., Armean I. M., et al. Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity. en. Nature 544, 235–239 (Apr. 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.De Rosa M., Fasano C., Panariello L., et al. Evidence for a recessive inheritance of Turcot’s syndrome caused by compound heterozygous mutations within the PMS2 gene. en. Oncogene 19, 1719–1723 (Mar. 2000). [DOI] [PubMed] [Google Scholar]
  • 14.Hague S., Rogaeva E., Hernandez D., et al. Early-onset Parkinson’s disease caused by a compound heterozygous DJ-1 mutation. en. Annals of Neurology 54, 271–274 (Aug. 2003). [DOI] [PubMed] [Google Scholar]
  • 15.Robinson J. P., Johnson V. L., Rogers P. A., et al. Evidence for an Association between Compound Heterozygosity for Germ Line Mutations in the Hemochromatosis ( HFE ) Gene and Increased Risk of Colorectal Cancer. en. Cancer Epidemiology, Biomarkers & Prevention 14, 1460–1463 (June 2005). [DOI] [PubMed] [Google Scholar]
  • 16.Maffei L., Rochira V., Zirilli L., et al. A novel compound heterozygous mutation of the aromatase gene in an adult man: reinforced evidence on the relationship between congenital oestrogen deficiency, adiposity and the metabolic syndrome. en. Clinical Endocrinology 67, 218–224 (Aug. 2007). [DOI] [PubMed] [Google Scholar]
  • 17.Wang X.-H., Xie L., Chen S., et al. Identification of Novel Compound Heterozygous MYO15A Mutations in Two Chinese Families with Autosomal Recessive Nonsyndromic Hearing Loss. eng. Neural Plasticity 2021, 9957712 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Delaneau O., Zagury J.-F., Robinson M. R., et al. Accurate, scalable and integrative haplotype estimation. en. Nature Communications 10. Number: 1 Publisher: Nature Publishing Group, 5436 (Nov. 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Maestri S., Maturo M. G., Cosentino E., et al. A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings. en. International Journal of Molecular Sciences 21. Number: 23 Publisher: Multidisciplinary Digital Publishing Institute, 9177 (Jan. 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li N. & Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics 165, 2213–2233 (Dec. 2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Loh P.-R., Danecek P., Palamara P. F., et al. Reference-based phasing using the Haplotype Reference Consortium panel. en. Nature Genetics 48. Number: 11 Publisher: Nature Publishing Group, 1443–1448 (Nov. 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Barton A. R., Sherman M. A., Mukamel R. E., et al. Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses. en. Nature Genetics 53. Number: 8 Publisher: Nature Publishing Group, 1260–1269 (Aug. 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Browning S. R. & Browning B. L. Haplotype phasing: existing methods and new developments. en. Nature Reviews Genetics 12. Number: 10 Publisher: Nature Publishing Group, 703–714 (Oct. 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hofmeister R. J., Ribeiro D. M., Rubinacci S., et al. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank en. Oct. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Delaneau O., Zagury J.-F., Robinson M. R., et al. Accurate, scalable and integrative haplotype estimation. en. Nat. Commun. 10, 5436 (Nov. 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Martin M., Patterson M., Garg S., et al. WhatsHap: fast and accurate read-based phasing en. Nov. 2016. [Google Scholar]
  • 27.Karczewski K. J., Francioli L. C., Tiao G., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. en. Nature 581. Number: 7809 Publisher: Nature Publishing Group, 434–443 (May 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Samocha K. E., Kosmicki J. A., Karczewski K. J., et al. Regional missense constraint improves variant deleteriousness prediction. BioRxiv (2017). [Google Scholar]
  • 29.Hamosh A., Scott A. F., Amberger J. S., et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33, D514–D517 (Jan. 2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhou W., Nielsen J. B., Fritsche L. G., et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. en. Nature Genetics 50, 1335–1341 (Sept. 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Privé F., Arbel J. & Vilhjálmsson B. J. LDpred2: better, faster, stronger. en. Bioinformatics 36 (ed Schwartz R.) 5424–5431 (Apr. 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Jurgens S. J., Pirruccello J. P., Choi S. H., et al. Adjusting for common variant polygenic scores improves yield in rare variant association analyses. en. Nature Genetics. Publisher: Nature Publishing Group, 1–5 (Mar. 2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kavec M. J., Urbanova M., Makovicky P., et al. Oxidative Damage in Sporadic Colorectal Cancer: Molecular Mapping of Base Excision Repair Glycosylases MUTYH and hOGG1 in Colorectal Cancer Patients. International Journal of Molecular Sciences 23, 5704 (May 2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Rice N. E., Patel B. D., Lang I. A., et al. Filaggrin gene mutations are associated with asthma and eczema in later life. en. The Journal of allergy and clinical immunology 122. Publisher: NIH Public Access, 834 (Oct. 2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Palmer C. N. A., Irvine A. D., Terron-Kwiatkowski A., et al. Common loss-of-function variants of the epidermal barrier protein filaggrin are a major predisposing factor for atopic dermatitis. en. Nature Genetics 38. Number: 4 Publisher: Nature Publishing Group, 441–446 (Apr. 2006). [DOI] [PubMed] [Google Scholar]
  • 36.Sandberg M. A., Rosner B., Weigel-DiFranco C., et al. Disease course in patients with autosomal recessive retinitis pigmentosa due to the USH2A gene. eng. Investigative Ophthalmology & Visual Science 49, 5532–5539 (Dec. 2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bajaj S., Zameer S., Jain S., et al. Effect of the MAGL/FAAH Dual Inhibitor JZL-195 on Streptozotocin-Induced Alzheimer’s Disease-like Sporadic Dementia in Mice with an Emphasis on A, HSP-70, Neuroinflammation, and Oxidative Stress. eng. ACS chemical neuroscience 13, 920–932 (Apr. 2022). [DOI] [PubMed] [Google Scholar]
  • 38.Chew H., Solomon V. A. & Fonteh A. N. Involvement of Lipids in Alzheimer’s Disease Pathology and Potential Therapies. Frontiers in Physiology 11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Carlsen B. C., Meldgaard M., Johansen J. D., et al. Filaggrin compound heterozygous patients carry mutations in trans position. eng. Experimental Dermatology 22, 572–575 (Sept. 2013). [DOI] [PubMed] [Google Scholar]
  • 40.Riethmuller C., McAleer M. A., Koppes S. A., et al. Filaggrin breakdown products determine corneocyte conformation in patients with atopic dermatitis. en. Journal of Allergy and Clinical Immunology 136, 1573–1580.e2 (Dec. 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Liu X., Tang Z., Li C., et al. Novel USH2A compound heterozygous mutations cause RP/USH2 in a Chinese family. eng. Molecular Vision 16, 454–461 (Mar. 2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Weren R. D., Ligtenberg M. J., Geurts van Kessel A., et al. NTHL1 and MUTYH polyposis syndromes: two sides of the same coin? en. The Journal of Pathology 244. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/path.5002, 135–142 (2018). [DOI] [PubMed] [Google Scholar]
  • 43.Obeidat M., Li X., Burgess S., et al. Surfactant protein D is a causal risk factor for COPD: results of Mendelian randomisation. en. European Respiratory Journal 50. Publisher: European Respiratory Society; Section: Original articles (Nov. 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Finn R. D., Bateman A., Clements J., et al. Pfam: the protein families database. eng. Nucleic Acids Research 42, D222–230 (Jan. 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Guo M. H., Francioli L. C., Stenton S. L., et al. Inferring compound heterozygosity from large-scale exome sequencing data en. Pages: 2023.03.19.533370 Section: New Results. Mar. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Hail Team. Hail Oct. 2022. [Google Scholar]
  • 47.Chang C. C., Chow C. C., Tellier L. C., et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. en. Gigascience 4, 7 (Feb. 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bycroft C., Freeman C., Petkova D., et al. The UK Biobank resource with deep phenotyping and genomic data. en. Nature 562, 203–209 (Oct. 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. en. Bioinformatics 27, 2987–2993 (Nov. 2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. en. Bioinformatics 30, 2843–2851 (Oct. 2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Van der Auwera G. A. & O’Connor B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra en (“O’Reilly Media, Inc.”, Apr. 2020). [Google Scholar]
  • 52.Zhang D., Dey R. & Lee S. Fast and robust ancestry prediction using principal component analysis. en. Bioinformatics 36, 3439–3446 (June 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Danecek P., Bonfield J. K., Liddle J., et al. Twelve years of SAMtools and BCFtools. eng. GigaScience 10, giab008 (Feb. 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Harrell F. E. Jr & Harrell M. F. E. Jr Package ‘hmisc’. CRAN2018 2019, 235–236 (2019). [Google Scholar]
  • 55.Kuan V., Denaxas S., Gonzalez-Izquierdo A., et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. English. The Lancet Digital Health 1. Publisher: Elsevier, e63–e77 (June 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Censin J. C., Peters S. A. E., Bovijn J., et al. Causal relationships between obesity and the leading causes of death in women and men. en. PLoS Genet. 15, e1008405 (Oct. 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Eastwood S. V., Mathur R., Atkinson M., et al. Algorithms for the Capture and Adjudication of Prevalent and Incident Diabetes in UK Biobank. en. PLoS One 11, e0162388 (Sept. 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Udler M. S., McCarthy M. I., Florez J. C., et al. Genetic Risk Scores for Diabetes Diagnosis and Precision Medicine. en. Endocr. Rev. 40, 1500–1520 (Dec. 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.McLaren W., Gil L., Hunt S. E., et al. The Ensembl Variant Effect Predictor. en. Genome Biol. 17, 122 (June 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Karczewski K. J., Francioli L. C., Tiao G., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. en. Nature 581, 434–443 (May 2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Ioannidis N. M., Rothstein J. H., Pejaver V., et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. en. Am. J. Hum. Genet. 99, 877–885 (Oct. 2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Rentzsch P., Witten D., Cooper G. M., et al. CADD: predicting the deleteriousness of variants throughout the human genome. en. Nucleic Acids Res. 47, D886–D894 (Jan. 2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Zhou W., Nielsen J. B., Fritsche L. G., et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. en. Nat. Genet. 50, 1335–1341 (Sept. 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Privé F., Aschard H., Carmi S., et al. High-resolution portability of 245 polygenic scores when derived and applied in the same cohort. medRxiv (2021). [Google Scholar]
  • 65.International HapMap 3 Consortium, Altshuler D. M., Gibbs R. A., et al. Integrating common and rare genetic variation in diverse human populations. en. Nature 467, 52–58 (Sept. 2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Bulik-Sullivan B. K., Loh P.-R., Finucane H. K., et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. en. Nat. Genet. 47, 291–295 (Mar. 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Finucane H. K., Bulik-Sullivan B., Gusev A., et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. en. Nat. Genet. 47, 1228–1235 (Nov. 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Privé F., Aschard H., Ziyatdinov A., et al. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. en. Bioinformatics 34, 2781–2787 (Aug. 2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Georgi B., Voight B. F. & Bućan M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. eng. PLoS genetics 9, e1003484 (May 2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Vinceti A., Karakoc E., Pacini C., et al. CoRe: a robustly benchmarked R package for identifying core-fitness genes in genome-wide pooled CRISPR-Cas9 screens. en. BMC Genomics 22, 828 (Nov. 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Hart T., Chandrashekhar M., Aregger M., et al. High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. en. Cell 163, 1515–1526 (Dec. 2015). [DOI] [PubMed] [Google Scholar]
  • 72.Hart T., Tong A. H. Y., Chan K., et al. Evaluation and Design of Genome-Wide CRISPR/SpCas9 Knockout Screens. G3 Genes∣Genomes∣Genetics 7, 2719–2727 (Aug. 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Pirinen M. GWAS 3: Statistical power Feb. 2023. [Google Scholar]
  • 74.Kaplan E. L. & Meier P. Nonparametric Estimation from Incomplete Observations. J. Am. Stat. Assoc. 53, 457 (June 1958). [Google Scholar]
  • 75.Therneau T. M. A Package for Survival Analysis in R R package version 3.2-3 (2020). [Google Scholar]
  • 76.Kassambara A., Kosinski M. & Biecek P. survminer: Drawing Survival Curves using ‘ggplot2’ R package version 0.4.9 (2021). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.xlsx (5MB, xlsx)
2

Articles from medRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES