Inferring compound heterozygosity from large-scale exome sequencing data

Michael H Guo; Laurent C Francioli; Sarah L Stenton; Julia K Goodrich; Nicholas A Watts; Moriel Singer-Berk; Emily Groopman; Philip W Darnowsky; Matthew Solomonson; Samantha Baxter; gnomAD Project Consortium; Grace Tiao; Benjamin M Neale; Joel N Hirschhorn; Heidi L Rehm; Mark J Daly; Anne O’Donnell-Luria; Konrad J Karczewski; Daniel G MacArthur; Kaitlin E Samocha

doi:10.1038/s41588-023-01608-3

. Author manuscript; available in PMC: 2024 Jul 1.

Published in final edited form as: Nat Genet. 2023 Dec 6;56(1):152–161. doi: 10.1038/s41588-023-01608-3

Inferring compound heterozygosity from large-scale exome sequencing data

Michael H Guo ^1,^2,^*, Laurent C Francioli ^2,^3,^*, Sarah L Stenton ^2,^3,⁴, Julia K Goodrich ^2,³, Nicholas A Watts ^2,³, Moriel Singer-Berk ², Emily Groopman ^2,⁴, Philip W Darnowsky ², Matthew Solomonson ^2,³, Samantha Baxter ²; gnomAD Project Consortium^†, Grace Tiao ^2,³, Benjamin M Neale ^2,^3,^5,⁶, Joel N Hirschhorn ^2,^7,^8,⁹, Heidi L Rehm ^2,^3,¹⁰, Mark J Daly ^2,^3,¹¹, Anne O’Donnell-Luria ^2,^3,^4,¹⁰, Konrad J Karczewski ^2,^3,^6,¹⁰, Daniel G MacArthur ^2,^3,^12,¹³, Kaitlin E Samocha ^2,^3,^10,^‡

¹ Department of Neurology, Hospital of the University of the Pennsylvania, Philadelphia, PA, USA

² Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

³ Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA

⁴ Division of Genetics and Genomics, Boston Children’s Hospital, Boston, MA

⁵ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁶ The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA

⁷ Departments of Genetics and Pediatrics, Harvard Medical School, Boston, MA, USA

⁸ Division of Endocrinology, Boston Children’s Hospital, Boston, MA, USA

⁹ Center for Basic and Translational Obesity Research, Boston Children’s Hospital, Boston, MA, USA

¹⁰ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA

¹¹ Institute for Molecular Medicine Finland, (FIMM) Helsinki, Finland

¹² Centre for Population Genomics, Garvan Institute of Medical Research and UNSW Sydney, Sydney, Australia

¹³ Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, Australia

These authors contributed equally: Michael H. Guo and Laurent C. Francioli

^†

List of authors and their affiliations appear at the end of the paper

Author contributions

M.H.G., L.C.F., S.L.S., J.K.G., A.O.D.L., K.J.K., D.G.M., and K.E.S. conceived and designed experiments. M.H.G., L.C.F., S.L.S., and J.K.G. performed the analyses. N.A.W., P.W.D., and M.S. developed visualizations for the web browser. E.G. and M.S-B. performed variant curation. G.T., B.M.N., J.N.H., H.L.R., M.J.D., A.O.D.L., and K.J.K. provided data and analysis suggestions. J.N.H., D.G.M., and K.E.S. supervised the work. M.H.G., L.C.F., S.L.S., J.K.G., and K.E.S. completed the primary writing of the manuscript with input and approval of the final version from all other authors.

^‡

Correspondence should be addressed to K.E.S. (samocha@broadinstitute.org)

PMCID: PMC10872287 NIHMSID: NIHMS1958298 PMID: 38057443

Abstract

Recessive diseases arise when both copies of a gene are impacted by a damaging genetic variant. When a patient carries two potentially causal variants in a gene, accurate diagnosis requires determining that these variants occur on different copies of the chromosome (i.e., are in trans) rather than on the same copy (i.e., in cis). However, current approaches for determining phase, beyond parental testing, are limited in clinical settings. Here, we developed a strategy for inferring phase for rare variant pairs within genes, leveraging genotypes observed in the Genome Aggregation Database (gnomAD v2, n=125,748 exomes). Our approach estimates phase with 96% accuracy, both in trio data and in patients with Mendelian conditions and presumed causal compound heterozygous variants. We provide a public resource of phasing estimates for coding variants and counts per gene of rare variants in trans that can aid interpretation of rare co-occurring variants in the context of recessive disease.

Determination of phase has important implications in clinical genetics in the diagnosis of recessive diseases that result from disruption of both copies of a gene, either by homozygous variants or compound heterozygous variants. Compound heterozygous variants present a challenge because two variants observed within a gene can occur in trans or in cis, and only the former results in compound heterozygosity. Currently, phasing in clinical settings is performed using parental data, which is expensive and not always available. Thus, there is an important need for other approaches to determine phase of variants accurately, easily, and cheaply.

There are several approaches for directly inferring phase for variant pairs observed in an individual. Phase may be determined directly using data from sequencing reads. However, for typical short-read sequencing technologies, read-based phasing methods are generally only possible for variants in close proximity to each other¹, although sophisticated algorithms can phase some variant pairs at slightly longer distances^2–4. Long-read sequencing technologies that would allow for direct phasing are more expensive and have not yet been widely applied in clinical settings^5,6, while laboratory-based molecular methods for determining phase of variant pairs are low-throughput and technically challenging⁷. Phase can be determined based on transmission of variants from parents to offspring, but this approach increases cost and parental DNA is often not feasible to obtain or available. Thus, these direct phasing approaches all present critical limitations for determining phase of variant pairs within an individual in a clinical setting.

In contrast, indirect approaches for phasing rely on statistical methods applied to population data by identifying shared haplotypes among individuals in a population^8–11. However, these methods (reviewed in Tewhey et al.¹² and Browning and Browning¹³) require large numbers of reference samples (typically n >10⁵ individuals), are computationally intensive, and perform less well for rare variants. Furthermore, these approaches cannot be readily applied to exome sequencing data, which does not provide enough density of surrounding variants. Despite these limitations, these population-based approaches are attractive because they do not require sequencing of additional family members or application of expensive sequencing approaches.

We sought to address existing challenges of phasing in clinical settings, particularly for rare variants observed in exome sequencing data. We leverage the Genome Aggregation Database (gnomAD), which performed aggregation and joint genotyping of exome sequencing data from 125,748 individuals¹⁴. We use these haplotype patterns to generate a resource for phasing rare coding variants observed in an individual and identify factors that influence the accuracy of our approach. Additionally, to provide a contextualization of the background rate when observing biallelic rare variants in individuals with rare diseases, we provide statistics for how often variant pairs are observed in trans within gnomAD, stratified by allele frequency (AF) and mutational consequence. Finally, we disseminate these resources in a user-friendly fashion via the gnomAD browser for community use.

Results

Inference of phase in gnomAD

We sought to address the challenges of phasing variants observed in individuals in clinical settings by applying the principle that haplotypes are usually shared across individuals in a population (Fig. 1a). If two variants are in trans in many individuals in a population, then they are likely to be trans in any given individual’s genome and vice versa. The presence of a variant pair in trans in the population also indicates that the variant combination may be tolerated in trans. We reasoned that by generating phasing estimates from a large reference population, we could infer the phase of variants observed in an individual.

Fig. 1: — a, Schematic of phasing approach. b, Histogram of $P_{t r a n s}$ scores for variant pairs in *cis* (top, blue) and in *trans* (bottom, red). c, Proportion of variant pairs in each $P_{t r a n s}$ bin that are in *trans*. Each point represents variant pairs with $P_{t r a n s}$ bin size of 0.01. Blue dashed line at 10% indicates the $P_{t r a n s}$ threshold at which ≥ 90% of variant pairs in bin are on the same haplotype ( $P_{t r a n s} \leq 0.02$ ). Red dashed line at 90% indicates the $P_{t r a n s}$ threshold at which ≥ 90% of variant pairs in bin are on opposite haplotypes ( $P_{t r a n s} \geq 0.55$ ). Calculations are performed using variant pairs with population AF ≥ 1×10⁻⁴. d, Performance of $P_{t r a n s}$ for distinguishing variant pairs in *cis* and *trans.* Accuracy is calculated as the proportion of variant pairs correctly phased (green bars) divided by the proportion of variant pairs phased using $P_{t r a n s}$ (orange plus green bars). **b-d,** $P_{t r a n s}$ scores are population-specific.

To predict phase, we need to first estimate the haplotype frequencies in the population for a given pair of variants. To estimate haplotype frequencies, we used exome sequencing samples from gnomAD v2, a large sequencing aggregation database with 125,748 samples after rigorous quality control (Methods)¹⁴. There are several key advantages of using gnomAD as a reference dataset for calculating haplotype frequencies. First, samples in gnomAD undergo uniform processing and variant-calling, mitigating the impact of technical artifacts. Second, gnomAD provides sufficient sample size to estimate haplotype frequencies below 1×10⁻⁵. Lastly, gnomAD offers significant diversity, allowing results of our study to be applied beyond samples with European ancestry.

We focus on pairs of rare exonic variants occurring in the same gene, which are of the greatest interest in the context of Mendelian conditions. We required both variants to have a global minor AF in gnomAD exomes <5% and to be coding, flanking intronic (from position −1 to −3 in acceptor sites, and +1 to +8 in donor sites) or in the 5’/3’ UTRs. Across 19,877 genes, there were 5,320,037,963 unique variant pairs. 11,786,014 variant pairs are carried by the same individual at least once in gnomAD of which 105,322 are both singleton variants and observed in the same individual, where we are unable to make a phase prediction. We performed estimates based on all exome sequencing samples in gnomAD v2, as well as separate estimates within each genetic ancestry group (African/African American [AFR]: n=8128; Admixed American [AMR]: 17296; Ashkenazi Jewish [ASJ]: 5040; East Asian [EAS]: 9179; Finnish [FIN]: 10824; non-Finnish European [NFE]: 56885; Other: 3070; South Asian [SAS]: 15308).

For each pair of variants, we first generated pairwise genotype counts in gnomAD, with nine possible pairwise genotypes for each pair of variants (Fig. 1a). We then applied the Expectation-Maximization (EM) algorithm to each pair of variants to generate haplotype frequency estimates based on the observed pairwise genotype counts¹⁵. For a given pair of variants observed in an individual, the probability of two variants being in trans ( $P_{t r a n s}$ ) is the probability of inheriting each of the haplotypes that contain only one of the two variant alleles.

Validation of phasing estimates using trio data

To measure the accuracy of our approach, we analyzed variants in a set of 4,992 trios that underwent exome sequencing and joint processing with gnomAD. In this trio structure, we could use parental transmission as a gold standard for phase and could compare with phase as predicted using the EM algorithm in gnomAD samples. We first estimated the genetic ancestry of each individual in the trios by projecting on the principal components of ancestry in the gnomAD v2 samples (Supplementary Fig. 1). Of the 4,992 children from the trios, 4,775 were assigned to one of seven genetic ancestry groups (AFR: 73; AMR: 358; ASJ: 62; EAS: 1252; FIN: 149; NFE: 2815; SAS: 46). We removed any samples in our trio dataset that did not fall into one of the seven aforementioned genetic ancestry groups. We used our approach leveraging gnomAD data to estimate phase for every pair of rare (global AF < 5% and population AF < 5%) coding and flanking intronic/UTR variants within genes observed in either of the parents in the trios. Across the 4,775 trio samples, we identified 339,857 unique variant pairs and 1,115,347 total variant pairs (mean 241.7 variant pairs per trio sample) (Supplementary Fig. 2a). On average, each trio sample had 64.4 variant pairs where both variants were missense, inframe insertions/deletions (indels), or predicted loss-of-function (pLoF), and 0.35 pLoF/pLoF variant pairs (Supplementary Fig. 2b–c). Nearly all of the variants identified in the trios were single nucleotide variants, with only 2.7% being short indels (functional consequences depicted in Supplementary Fig. 3a).

The majority (91.1%) of unique variant pairs in the trio samples were observed in gnomAD at least once and thus amenable to our phasing approach (Fig. 1d). By contrast, only 2.1% of variant pairs in these samples were within 10 bp of each other, the range in which we previously found read-back phasing of the physical read data to be most effective¹ (Supplementary Fig. 3b). 8.2% of variant pairs were within 150 bp, the typical length of an Illumina exome sequencing read. Thus, our approach has a much higher ability to phase variants than physical read-back phasing data.

For each variant pair, we calculated the probability of being in trans ( $P_{t r a n s}$ ) based on the haplotype frequencies estimated using the EM algorithm applied to gnomAD as described above. We found a bimodal distribution of $P_{t r a n s}$ scores: the majority of probabilities were either very high (> 0.99; suggesting a high likelihood of being in trans), or they were very low (< 0.01; suggesting a high likelihood of being in cis) (Fig. 1b, Supplementary Fig. 4a–g). Using trio phasing-by-transmission as a gold standard, we generated receiver-operator curves for distinguishing whether a variant pair is likely in trans and found high sensitivity and specificity (area under curve [AUC] ranging from 0.892 to 0.997 across the component genetic ancestry groups) (Supplementary Fig. 5a) and high precision and recall (Supplementary Fig. 5b).

We next defined $P_{t r a n s}$ thresholds for classifying variant pairs as being in cis versus trans (see Methods). To set these thresholds, we binned variant pairs observed in the trio data based on their $P_{t r a n s}$ scores calculated from gnomAD samples from the same genetic ancestry group. We used only variants on odd chromosomes (i.e., chromosomes 1, 3, 5, etc) to determine $P_{t r a n s}$ thresholds. For each $P_{t r a n s}$ bin, we calculated the proportion of trio variant pairs that were in cis or trans based on phasing-by-transmission. The $P_{t r a n s}$ threshold for variant pairs in trans was defined as the minimum $P_{t r a n s}$ such that ≥90% of variant pairs in that bin were in trans based on trio phasing-by-transmission, with a similar approach used for the threshold for variants in cis. This resulted in $P_{t r a n s}$ values of ≤ 0.02 and ≥ 0.55 as the threshold for variants in cis and trans, respectively (Fig. 1c).

We assessed how well our $P_{t r a n s}$ thresholds performed by measuring phasing accuracy using the phasing estimates generated by the EM algorithm applied to gnomAD against trio phasing-by-transmission. For measuring accuracy, we used only variant pairs observed on even chromosomes (i.e., chromosomes 2, 4, 6, etc). Of the 91.1% unique variant pairs that were amenable to phasing using the EM algorithm in gnomAD, only a minority (8.9%) of unique variant pairs had an intermediate $P_{t r a n s}$ score (i.e., $0.02 < P_{t r a n s} < 0.55$ ) and therefore an indeterminate phase (Fig. 1d). We calculated accuracy as the percentage of phaseable variant pairs (i.e., both variants present in gnomAD and $P_{t r a n s}$ score ≤ 0.02 or ≥ 0.55) that were correctly phased. Based on these $P_{t r a n s}$ thresholds, the overall phasing accuracy was 95.8%. The accuracy for unique variant pairs that are in cis based on trio data was 91.7%, and the accuracy for variant pairs in trans was 99.7%. Further exploration of the limitations of this approach, including how sample size impacts the number of variant pairs that can be phased, are detailed in the Supplementary Note and Supplementary Figure 6.

We calculated the overall percentage of variants correctly phased in a given individual (i.e., variants are counted more than once if seen multiple times in the trio data). 96.9% variant pairs per individual had both variants present in gnomAD and therefore were amenable to phasing, and 92.3% of variant pairs observed in a given individual were correctly phased using our approach. For rarer variant pairs (AF < 0.1%), 80.1% of variant pairs per individual were correctly phased. Together, these results suggest that our approach can generate highly accurate phasing estimates.

Accuracy of phasing across allele frequencies

Since rare variants are most likely to be of interest in clinical genetics, we assessed the accuracy of phasing at different AF bins. We found high accuracy (i.e., proportion correct classifications) ranging from 0.779 to 0.988 across pairs of AF bins (Fig. 2). Accuracy remained high across allele frequencies for variant pairs in trans. For variant pairs in cis based on trio phasing data, accuracy was high when both variants in the pair were more common (AF ≥ 0.001). However, accuracy was much lower for rare variants in cis (AF < 1×10⁻⁴), particularly when one variant in the pair is rare and the other is more common (Fig. 2c). Variant pairs where both variants are singletons (i.e., observed once in gnomAD) were phased fairly accurately for variants in trans based on the trio phasing data (accuracy of 0.993). Given the lack of information, we do not report the phasing estimates for singleton/singleton variant pairs in cis (see Supplementary Note).

Fig. 2: — Phasing accuracy at different AF bins for all variant pairs (a), variant pairs in *trans* (b), and variant pairs in *cis* (c). Shading of squares and numbers in each square represent the phasing accuracy. Y-axis labels refer to the more frequent variant in each variant pair and X-axis labels refer to the rarer variant in each variant pair. Accuracy is the proportion of correct classifications (i.e., correct classifications / all classifications) and is calculated for all unique variant pairs seen in the trio data across all genetic ancestry groups using population-specific $P_{t r a n s}$ calculations.

Accuracy of phasing across genetic ancestry groups

In the above analyses, we used $P_{t r a n s}$ estimates calculated from samples in gnomAD with the same genetic ancestry group (“population”) in which the variant pair was seen in the trio data. We next asked if using all samples in gnomAD to calculate $P_{t r a n s}$ (“cosmopolitan”) would improve accuracy given larger sample sizes from which to calculate $P_{t r a n s}$ (Supplementary Fig. 7), with the caveat that using the full set of gnomAD samples would result in some genetic ancestry mismatching. We found that accuracy was generally similar when using population-specific ancestry estimates as compared to cosmopolitan estimates (Fig. 3a–b). However, for AFR and EAS, accuracy was slightly lower when using cosmopolitan estimates as compared to population-specific estimates specifically for variants in trans in these populations. For example, the phasing accuracy for variants in trans in the AFR ancestry group was 0.995 when using AFR-specific $P_{t r a n s}$ estimates, but dropped to 0.952 when using cosmopolitan $P_{t r a n s}$ estimates. These results suggest that cosmopolitan estimates allow a greater proportion of variants to be phased with generally similar accuracy as population-specific estimates, though we do identify certain scenarios where more caution is required.

Fig. 3: — Population-specific $P_{t r a n s}$ estimates are shown in light blue and cosmopolitan $P_{t r a n s}$ estimates are shown in medium blue. Accuracies are shown separately for variants in *trans* (a, left) and variants in *cis* (b, right).

Effect of distance and mutation rate on phasing accuracy

Recombination events, which disrupt the haplotype configuration of variant pairs, should influence phasing accuracy. To explore the impact of recombination, we plotted the accuracy of our phasing estimates as a function of physical distance between variant pairs. For variant pairs in trans, phasing accuracy was maintained across physical distances. However, for variant pairs in cis, accuracy rapidly decreased with longer physical distances (Fig. 4a). Since physical distance is only a proxy for recombination frequency, we also performed this analysis using interpolated genetic distances (Fig. 4b). We found again that variants in trans had preserved phasing accuracy across genetic distances, while variants in cis had phasing estimates that decreased substantially with genetic distance, particularly at distances greater than 0.1 centiMorgan. We also tested the effect of recombination using a set of multinucleotide variants, which are variant pairs in cis and very close in physical distance (see Supplementary Note).

Fig. 4: — a, Phasing accuracy (y-axis) as a function of physical distance (in base pairs on log₁₀ scale) between variants (x-axis). Blue represents variants on the same haplotype (in *cis*), and red represents variants on opposite haplotypes (in *trans*). b, Same as a, except the x-axis shows genetic distance (in centiMorgans). Accuracies for a and b are calculated based on unique variant pairs observed across all genetic ancestry groups using population-specific $P_{t r a n s}$ estimates.

Recurrent germline mutations can also result in inaccurate phasing. Rates of recurrent mutations are dependent on mutation type (e.g., transition versus transversion) and epigenetic marks (particularly CpG methylation), among other factors^16–20. Notably, transitions have higher mutation rates than transversions^18,21 and CpG transitions have the highest mutation rates, which increase with higher levels of methylation at the CpG¹⁴. To better understand the impact of mutation rates on phasing accuracy, we classified each single nucleotide variant in the trio data as a transversion, non-CpG transition, or CpG transition, with further subclassifications of these as having low, medium, or high DNA methylation as before¹⁴. We then calculated phasing accuracy as a function of combinations of mutation types using the trio data. We found similar accuracy for transversions and transitions (~0.97) (Supplementary Fig. 8a). However, mutation rates had a strong impact on accuracy for variant pairs in cis but not those in trans (Supplementary Fig. 8b–c). For variant pairs in cis, the phasing accuracies were lower at medium and high methylation CpG sites (0.82–0.89) than they were for low methylation sites (0.96). These results are consistent with recurrent mutations contributing to inaccurate phasing estimates, particularly for variant pairs in cis.

Accuracy in a cohort of patients with Mendelian disorders

To demonstrate our approach in a clinically relevant scenario, we turned to a set of 627 patients from the Broad Institute Center for Mendelian Genetics (CMG)²². All patients had a confident or strong candidate genetic diagnosis of a Mendelian condition based on carrying two rare variants in a recessive disease gene consistent with the patient’s phenotype. For 293 of the 627 patients, both variants were present in gnomAD and thus amenable to phasing (Supplementary Table 1). For these 293 variant pairs, we used population-specific $P_{t r a n s}$ estimates when available (n=215), and cosmopolitan $P_{t r a n s}$ estimates for the remaining 78 variant pairs. Our phasing approach predicted 281 (95.9%) variant pairs to be in trans, seven variant pairs (2.4%) to be in cis, and five (1.7%) as indeterminate ( $0.02 < P_{t r a n s} < 0.55$ or singleton/singleton variant in the same individual). Had only cosmopolitan $P_{t r a n s}$ estimates been used, one of the 281 in trans predictions would have been predicted in cis and one indeterminate. Of the seven variant pairs predicted to be in cis, six were from patients with proband-only sequencing. For these patients, the responsible clinician was contacted to ensure phenotype overlap with the disease gene and to pursue parental Sanger sequencing for confirmatory phasing by transmission or long read sequencing, where possible. The remaining variant pair predicted to be in cis originated from a patient with parental data confirming trans phase and thus our inferred phase to be incorrect (Supplementary Table 1). Overall, the results suggest that our phasing approach is highly accurate in clinical scenarios in patients with suspected Mendelian conditions and can be applied to a large fraction (~50% in our cohort) of candidate diagnoses.

Bi-allelic predicted damaging variants

We tabulated for each gene the number of individuals in gnomAD who carry two rare heterozygous variants, stratified by the predicted phase using $P_{t r a n s}$ cutoffs (i.e., in trans, unphased [intermediate $P_{t r a n s}$ ], and in cis), AF, and the predicted functional consequence of the least damaging variant in the pair. For comparison, we tabulated individuals with homozygous variants in the same manner. We classified predicted functional consequences as pLoF, missense with deleteriousness scored by REVEL²³ in line with recent ClinGen recommendations²⁴, or synonymous.

Overall, the number of individuals with rare, compound heterozygous (in trans), predicted damaging variants was low (median 0 individuals per gene with compound heterozygous loss-of-function variants at ≤ 1% AF, range 0–9) and only occurred in a small number of genes (Fig. 5 and Supplementary Fig. 9). 28 genes carried compound heterozygous pLoF variants (in 56 individuals) and an additional four genes carried compound heterozygous variants with at least a strong REVEL missense predicted consequence (in six individuals) at ≤ 1% AF cutoff. The vast majority of these genes have not, to date, been associated with disease (Fig. 5b). Manual curation of the pLoF variants resulted in seven high confidence “human knock-out” genes (ARHGEF37, CCDC66, FAM81B, FYB2, GNLY, RBKS, and SDSL). These genes are not associated with Mendelian disease nor are they known to be essential (see Methods). In the remaining 21 of the 28 genes with compound heterozygous pLoF variants, true loss-of-function was found to be uncertain or unlikely following manual curation, due, for example, to the variant falling in the last exon of the gene, in a weakly conserved exon, or in a minority of isoforms (Supplementary Table 2).

Fig. 5: — a, Proportion of genes with one or more individuals in gnomAD carrying predicted compound heterozygous (in *trans*) variants or a homozygous variant at ≤ 1% and ≤ 5% AF stratified by predicted functional consequence. b, Number of genes with ≥ 1 individual in gnomAD carrying compound heterozygous (in *trans*) or homozygous predicted damaging variants at ≤ 1% AF, stratified by predicted functional consequence and Mendelian disease-association in the Online Mendelian Inheritance in Man database. In total, 28 genes (25 non-disease, 2 autosomal dominant, and 1 autosomal recessive) carried predicted compound heterozygous loss-of-function variants at ≤ 1% AF, only seven of which were high confidence “human knock-out” events following manual curation. For predicted compound heterozygous variants, both variants in the variant pair must be annotated with a consequence at least as severe as the consequence listed (i.e., a compound heterozygous loss-of-function variant would be counted under the pLoF category but also included with a less deleterious variant under the other categories). All homozygous pLoF variants previously underwent manual curation as part of Karczewski et al¹⁴.

Generation of public resource

To make our resource widely usable to clinicians and researchers, we have calculated and released pairwise genotype counts and phasing estimates for each pair of rare coding variants occurring in the same gene for gnomAD. We have included all variant pairs within a gene where both variants have global minor AF in gnomAD exomes < 5%, and are either coding, flanking intronic (from position −1 to −3 in acceptor sites, and +1 to +8 in donor sites) or in the 5’/3’ UTRs. We have integrated these data into the gnomAD browser so that users can easily look up a pair of variants to obtain the genotype counts, haplotype frequency estimates, $P_{t r a n s}$ estimates, and likely co-occurrence pattern (Extended Data Fig. 1a). These results are shown for each individual genetic ancestry group and across all genetic ancestries in gnomAD v2. In addition, the data are available as a downloadable table for all variant pairs that were seen in at least one individual.

Furthermore, we have incorporated counts tables detailing the number of individuals carrying two rare variants stratified by AF, and functional consequence. The first table counts individuals carrying two rare heterozygous variants by predicted phase (in trans, unphased, and in cis) and the second table counts individuals carrying homozygous variants (Extended Data Fig. 1b). We envision that these data will aid the medical genetics community in interpreting the clinical significance of co-occurring variants in the context of recessive conditions. The data for all genes are also available as a downloadable table within gnomAD v2.

Discussion

In this work, we leveraged a large exome sequencing cohort to estimate haplotype frequencies for pairs of rare variants within genes and show that these haplotype frequency estimates can be utilized to predict phase of pairs of variants. We achieve high accuracy across a range of allele frequencies and across genetic ancestries and demonstrate that our approach is able to distinguish variants that are likely compound heterozygous in a clinical setting. We freely disseminate our results in an easy-to-use browser for the community.

Our work focuses on the challenging, yet common, scenario of determining phase for rare variants identified in exome sequencing of rare disease patients. While this scenario is common in medical genetics, other phasing approaches such as phasing-by-transmission or population-based phasing are challenging to apply. Our approach of using estimated haplotype frequencies from gnomAD to predict phase of variant pairs was generally accurate across a range of AFs (even for singleton variants) and across genetic ancestries. Most notably, 96.9% of rare (AF < 5%) variant pairs in a given individual had both variants present in gnomAD and therefore were amenable to phasing using our approach, which is much higher than the proportion amenable to phasing using physical read data. Overall, 92.3% of variant pairs observed per individual were correctly phased using our approach. We did find that our approach was less accurate for rare variant pairs in cis (see Supplementary Note). We also found that using “cosmopolitan” phasing estimates that leverage more samples in gnomAD generally had similar accuracy to using population-specific estimates, except for individuals of EAS and AFR genetic ancestry (see Supplementary Note). Thus, our approach can be applied to nearly all rare variant pairs and can generate accurate phasing estimates for variants of medical importance in rare recessive genetic diseases.

We utilized the EM algorithm to phase pairs of variants instead of more sophisticated population-based phasing approaches for several reasons^8–11. First, exome and targeted gene panel sequencing data are sparse, precluding the use of common non-coding variants as a “scaffold” for population-based phasing approaches. Recent work performed population-based phasing of rare variants from exome sequencing data by combining exome data with SNP genotyping arrays^11,25. However, SNP genotyping data are not usually generated in conjunction with a clinical sequencing test and were not readily available for much of gnomAD. Second, rare variants, which are of the greatest interest in Mendelian diseases, are challenging to phase using population-based approaches given the small numbers of shared haplotypes from which to derive phasing estimates in the population. Recent methods have shown accurate phasing of rare variants using genome sequencing data^10,11,26, but rely on a large genome reference panel. As the numbers of whole genome sequencing samples increases in future releases of gnomAD, this may represent a tractable and more accurate approach for phasing of rare variants. Exome sequencing and targeted gene sequencing remain commonly used in clinical settings, and thus we anticipate that our approach and the resources we have generated will remain useful. Third, we found that application of the EM algorithm to pairs of variants was more intuitive to illustrate how phasing estimates were derived from genotype data, allowing users to more easily assess the reliability of phasing estimates. Together, the EM algorithm provided us with the unique ability to phase pairs of rare variants in exome data in an intuitive fashion.

We found that there are only a small number of “human knock-out” genes affected by predicted compound heterozygous (in trans) loss-of-function variants, and that this number is substantially lower than is observed for homozygous loss-of-function variants. These compound heterozygous “human knock-out” events occurred in genes that are not known to be essential, an expected finding given that gnomAD is largely depleted of individuals with severe and early-onset conditions. When analyzing the 23,672 individuals that carry two rare (AF ≤ 1%) pLoF variants, we predict that in 20,421 (86%) of those individuals the variant pair is in cis and only ~0.2% confidently predicted to be in trans. This observation emphasizes that when a pair of rare pLoF variants is observed in the same gene in an individual from a general population sample, it is vastly more likely that these variants are carried on the same haplotype than that the individual is a genuine “knock-out” due to compound heterozygosity. We note, however, that our ability to identify rare variant pairs in trans in gnomAD v2 individuals is limited by the fact the same dataset was used for training (see Supplementary Note). We have released counts of co-occurring variant pairs within genes as observed in gnomAD, which will aid with interpretation of the clinical significance of co-occurring variant pairs.

There are several other important limitations to our work. First, to limit computational burden, we only report phasing estimates for rare coding and flanking intronic/UTR variant pairs within genes. These are the variant pairs of most interest to the medical genetics community, though we acknowledge that phase of deeper intronic variation will become important as more genome sequencing is performed. Second, future studies would benefit from even larger sample sizes, especially for genetic ancestry groups not well represented in our present study. Finally, we have only tested our phasing accuracy in a clinical setting in a retrospective manner and future prospective studies will be needed to confirm the clinical utility of our approach.

Methods

Ethical compliance and informed consent statement

Our collaborators obtained informed consent for all participants in the Broad Institute Center for Mendelian Genetics (CMG), and individual-level data, including genomics and clinical data, were de-identified and coded prior to our analyses in this work. We have complied with all relevant ethical regulations. The Broad Institute of MIT and Harvard, and Mass General Brigham IRB approved this work

gnomAD characteristics and data processing

In this work, we used exome sequencing data from the gnomAD v2.1 dataset (n = 125,748 individuals, GRCh37). These data were uniformly processed, underwent joint variant calling, and rigorous quality control, as described in Karczewski et al.¹⁴. Briefly, we aggregated ~200k exome and ~20k genome sequencing samples, primarily from case-control studies of common adult-onset conditions, and applied a BWA-Picard-GATK pipeline²⁷. Using Hail (https://github.com/hail-is/hail), we then removed samples that (1) failed population- and platform-specific quality control, (2) had second-degree or closer relations in the dataset, (3) did not have appropriate consent for release, and (4) had known severe, early-onset conditions. For variant quality control, we trained a random forest on site-level and genotype-level metrics (e.g., the quality by depth, QD), and demonstrated that it achieved both high precision and recall for both common and rare variants.

We subsetted the final cleaned gnomAD dataset for variants with global AF in gnomAD exomes < 5% that were either coding, flanking intronic (from position −1 to −3 in acceptor sites, and +1 to +8 in donor sites) or in the 5’/3’ UTRs. In total, this encompasses 5,320,037,963 unique variant pairs across 19,877 genes when removing singleton/singleton variant pairs observed in the same individual. We specifically extracted 20,921,100 pairs of variants, most of which were observed at least once in the same individual to create a more manageable downloadable file.

We performed analysis in this manuscript using Hail version 0.2.105²⁸, and analysis code is available at https://github.com/broadinstitute/gnomad_chets.

Haplotype estimates

Consider two variants, A and B. A and B represent the major alleles, and a and b represent the respective minor alleles. There are thus 9 pairwise genotypes for A and B: AABB, AaBB, aaBB, AABb, AaBb, aaBb, AAbb, Aabb, and aabb. Of these pairwise genotypes, only the phase for the double heterozygote (AaBb) is unknown. From these 9 possible genotypes, there are four possible haplotype configurations: AB, Ab, aB, and ab.

For each pair of variants, we applied the expectation-maximization (EM) algorithm¹⁵ to estimate haplotype frequencies from genotype counts. We set the initial conditions of the EM algorithm by partitioning the doubly heterozygous (AaBb) genotype counts equally between the AB|ab and Ab|aB haplotype configurations. We ran the EM algorithm until convergence or until the absolute value of the difference between consecutive maximum likelihood function values was less than 1×10⁻⁷. We calculated haplotype frequencies based on genotypes present within the same genetic ancestry group (“population-specific”) or using all samples from gnomAD (“cosmopolitan”). We performed these analyses of haplotype frequency estimates using Hail.

We then calculate $P_{t r a n s}$ as the likelihood that any given pair of doubly heterozygous variants (AaBb) in a patient is compound heterozygous (Ab|aB). $P_{t r a n s}$ can be calculated simply from the haplotype frequency estimates (AB, Ab, aB, and ab):

P_{t r a n s} = (A b \times a B) / (A B \times a b + A b \times a B)

Thus, $P_{t r a n s}$ simply represents the probability that the patient is compound heterozygous by inheriting both the Ab and aB haplotypes.

Determination of $P_{t r a n s}$ cutoffs

To determine $P_{t r a n s}$ cutoffs for classifying variants as occurring in cis or trans, we binned variant pairs on odd chromosomes (chromosome 1, 3, 5, etc) in $P_{t r a n s}$ increments of 0.01. For each bin, we calculated the proportion of variant pairs in that bin that are in trans based on phasing by trio data. We determined the $P_{t r a n s}$ threshold for variants in trans as the minimum $P_{t r a n s}$ such that 90% of variants in the bin are in trans based on trio data. We determined the $P_{t r a n s}$ threshold for variants in cis as the maximum $P_{t r a n s}$ such that 90% of variants in the bin are in cis based on trio data. For these calculations, we used only variants where both variants had a population AF ≥ 1×10⁻⁴. We used trio samples across all genetic ancestry groups and population-specific $P_{t r a n s}$ values for determination of the $P_{t r a n s}$ cutoffs.

Trio validation data

For validation of our phasing approach, we utilized 4,992 parent-child trios that were jointly processed and variant-called with gnomAD. Having access to parental genotypes allows us to perform phase-by-transmission and accurately determine whether two co-occurring variants in the same gene are in cis or in trans.

First, we estimated genetic ancestry of each individual in the trios by using ancestry inference estimates from the full gnomAD dataset, as previously described¹⁴. Briefly, we selected bi-allelic variants that passed all hard filters, had allele frequencies in a joint exome and genome callset > 0.001, and high joint call rates (> 0.99). The variants were then LD-pruned (r² = 0.1) and used in a principal component analysis (PCA). We previously used samples with known genetic ancestry to train a random forest on the first 20 principal components (PCs) and assigned samples to a genetic ancestry group based on having a random forest probability > 0.9. For the trios in this cohort, we projected their PCs for genetic ancestry onto the same gnomAD v2 samples to infer the genetic ancestry used here (Supplementary Fig. 1). Of these 4,922 trios, 4,775 of the children from the trios were assigned to one of the seven genetic ancestry groups in this study based on PCA and were used in this study.

We then phased the trio data using the Hail phase_by_transmission (https://hail.is/docs/0.2/experimental/index.html#hail.experimental.phase_by_transmission) function, which uses Mendelian transmission of alleles to infer haplotypes in trios for all sites that are not heterozygous in all members of the trio. Assigning haplotypes in trios based on parental genotype has traditionally been the gold standard, has switch error rates below 0.1%, and importantly errors aren’t dependent on the allele frequency of the variants phased²⁹. To maximize our confidence in the genotypes and phasing, we filtered genotypes to include only those with genotype quality (GQ) > 20, depth > 10 and allele balance > 0.2 for heterozygous genotypes prior to phasing. Sex chromosomes were excluded. In total, there were 339,857 unique variant pairs and 1,115,347 total variant pairs.

We compared trio phasing-by-transmission with phasing using our approach on even chromosomes (e.g., chromosomes 2, 4, 6, etc). 3,836 of the 4,775 trio samples were in the full release of gnomAD and were removed from gnomAD for trio validation. This resulted in a set of 121,912 gnomAD samples from which we derived haplotype estimates. We then performed phasing using the EM algorithm and calculated $P_{t r a n s}$ as above.

Based on the $P_{t r a n s}$ estimates, we classified trio variant pairs into 1) unable to phase using our approach (either variant not seen in gnomAD, or singleton/singleton variant pairs observed in the same individual in gnomAD), 2) indeterminate phase (those with intermediate $0.02 < P_{t r a n s} < 0.55$ ), 3) incorrectly phased, or 4) correctly phased. We calculated accuracy as the number of variant pairs correctly phased divided by the number of pairs correctly and incorrectly phased.

CpG analysis

We divided single nucleotide variants seen in the trio data into transitions and transversions. Transitions were further subdivided into those that are CpG mutations (5’-CpG-3’ mutating to 5’-TpG-3’) and those that are not. For each CpG transition, we calculated the mean DNA methylation values across 37 tissues in the Roadmap Epigenomics Project³⁰ and then stratified CpG transitions into 3 levels: low (missing or < 0.2), medium (0.2–0.6), and high (> 0.6) methylation, as detailed in Karczewski et al¹⁴. We calculated phasing accuracy as the number of variant pairs correctly phased divided by the number of pairs correctly and incorrectly phased. We calculated phasing accuracy for all pairwise combinations of transversions, non-CpG transitions, low methylation CpG transitions, medium methylation CpG transitions, and high methylation CpG transitions. We included all single nucleotide variants in the analysis and utilized population-specific EM estimates.

Calculating accuracy as a function of genetic distance

To estimate the genetic distance between pairs of genetic variants, we interpolated genetic distances between variant pairs using a genetic map from HapMap2³¹ (https://github.com/joepickrell/1000-genomes-genetic-maps). We utilized a HapMap2 genetic map representing average over recombination rates in the CEU, YRI, and ASN populations. We then ran interpolate_maps.py (downloaded from https://github.com/joepickrell/1000-genomes-genetic-maps/blob/master/scripts/interpolate_maps.py) for all variant pairs in the phasing trio data. We used population-specific $P_{t r a n s}$ estimates and calculated phasing accuracy as the number of variant pairs correctly phased divided by the number of pairs correctly and incorrectly phased.

MNV analysis

We obtained multi-nucleotide variant pairs for which read-back phasing had previously been calculated¹. We included all multi-nucleotide variant pairs where each constituent variant was analyzed in our study. We utilized cosmopolitan $P_{t r a n s}$ estimates and calculated phasing accuracy as the number of variant pairs correctly phased divided by the number of pairs correctly and incorrectly phased.

Rare disease patient analysis

We selected 627 patients from the Broad Institute Center for Mendelian Genetics (CMG)²² with a confident or strong candidate genetic diagnosis of a Mendelian condition. Each patient carried two presumed bi-allelic variants in an autosomal recessive disease gene consistent with the patient’s phenotype. For 293 of the patients, both variants were present in gnomAD and thus were amenable to our phasing approach. For 168 of the 293 patients, trio-sequencing (i.e., sequencing of the proband and the two unaffected biological parents) had been performed. For these 168 individuals with parental DNA sequencing, we were able to confirm phasing of the two variants via phase-by-transmission.

Determining counts of individuals with two rare, damaging variants

We annotated variants by the worst consequence on the canonical transcript by the Ensembl Variant Effect Predictor (VEP)³². We applied LOFTEE¹⁴ to annotate LoF variants and counted only high confidence LoF variants as “pLoF”. We used REVEL²³ in line with recent ClinGen recommendations²⁴: we counted REVEL scores ≥ 0.932 as “strong_revel_missense”, ≥ 0.773 as “moderate_to_strong_revel_missense”, and ≥ 0.644 as “weak_to_strong_revel_missense”.

We predicted phase (cis or trans) based on the $P_{t r a n s}$ thresholds for all variant pairs. All singleton/singleton variant pairs (AC = 1) and variant pairs with an indeterminate $P_{t r a n s}$ values ( $0.02 < P_{t r a n s} < 0.55$ ) were annotated as unphased.

We selected five AF thresholds for analysis and filtered variant pairs based on the highest global AF and, where available, the “popmax” AF of each variant in gnomAD (i.e., the highest AF information for the non-bottlenecked population [AFR, AMR, EAS, NFE, SAS]): 0.5%, 1%, 1.5%, 2%, and 5%. We also filtered out all variant pairs containing a variant with an AF > 5% in a bottlenecked population.

We performed gene-wise calculations of the number of individuals carrying a variant pair (irrespective of phase) and the number predicted to be in trans, unphased (indeterminate), and the number predicted to be in cis. We performed gene-wise calculations separately by AF threshold and functional consequences (26 consequence groups). If individuals carried multiple variant pairs in the same gene with different phase predictions, we counted the individual in only one phase group, prioritizing in trans over unphased and unphased over in cis. These gene-wise counts are displayed in the “Variant Co-occurrence” gnomAD browser feature. For individuals carrying multiple variant pairs in the same gene with different phase predictions, we also performed separate calculations allowing these individuals to be counted in multiple phase groups (data available to download).

Essential gene lists

We queried the following essential gene lists for the presence of the true “human knock-out” genes identified in this study:

2,454 genes essential in mice from Georgi et al. 2013³³
553 pan-cancer core fitness genes from Behan et al., 2019³⁴
360 core essential genes from genomic perturbation screens from Hart et al. 2014³⁵
684 genes essential in culture by CRISPR screening from Hart et al. 2017³⁶
1,075 genes annotated by the ADaM analysis of a large collection of gene dependency profiles (CRISPR-Cas9 screens) across human cancer cell lines from Vinceti et al. 2021³⁷

Extended Data

Supplementary Material

Supplementary Tables 1 and 2

NIHMS1958298-supplement-Supplementary_Tables_1_and_2.xlsx^{(34.4KB, xlsx)}

Supplementary Note

NIHMS1958298-supplement-Supplementary_Note.pdf^{(17.4MB, pdf)}

Acknowledgements

We would like to thank all members of the gnomAD team for helpful comments and suggestions, and to particularly recognize the members of the gnomAD methods and browser teams who worked hard over many years to provide cleaned datasets, easy-to-use browsers, and visualizations. This work was supported by NHGRI U24HG011450 (H.L.R., M.J.D.), UM1HG008900 (D.G.M., H.L.R.), and U01HG011755 (A.O.D.L., H.L.R.).

Genome Aggregation Database Consortium

Maria Abreu¹⁴, Carlos A. Aguilar Salinas¹⁵, Tariq Ahmad¹⁶, Christine M. Albert^17,18, Jessica Alföldi^2,3, Diego Ardissino¹⁹, Irina M. Armean^2,3,20, Gil Atzmon^21,22, Eric Banks²³, John Barnard²⁴, Samantha M. Baxter², Laurent Beaugerie²⁵, Emelia J. Benjamin^26,27,28, David Benjamin²³, Louis Bergelson²³, Michael Boehnke²⁹, Lori L. Bonnycastle³⁰, Erwin P. Bottinger³¹, Donald W. Bowden^32,33,34, Matthew J. Bown^35,36, Steven Brant³⁷, Sarah E. Calvo^2,10, Hannia Campos^38,39, John C. Chambers^40,41,42, Juliana C. Chan⁴³, Katherine R. Chao², Sinéad Chapman^2,3,5, Daniel Chasman^17,44, Siwei Chen^2,3, Rex L. Chisholm⁴⁵, Judy Cho³¹, Rajiv Chowdhury⁴⁶, Mina K. Chung⁴⁷, Wendy K. Chung^48,49,50, Kristian Cibulskis²³, Bruce Cohen^44,51, Ryan L. Collins^2,10,52, Kristen M. Connolly⁵³, Adolfo Correa⁵⁴, Miguel Covarrubias²³, Beryl Cummings^2,52, Dana Dabelea⁵⁵, Mark J. Daly^2,3,11, John Danesh⁴⁶, Dawood Darbar⁵⁶, Joshua Denny⁵⁷, Stacey Donnelly², Ravindranath Duggirala⁵⁸, Josée Dupuis^59,60, Patrick T. Ellinor^2,61, Roberto Elosua^62,63,64, James Emery²³, Eleina England², Jeanette Erdmann^65,66,67, Tõnu Esko^2,68, Emily Evangelista², Yossi Farjoun²³, Diane Fatkin^69,70,71, Steven Ferriera⁷², Jose Florez^44,73,74, Laurent C. Francioli^2,3, Andre Franke⁷⁵, Martti Färkkilä⁷⁶, Stacey Gabriel⁷⁷, Kiran Garimella²³, Laura D. Gauthier²³, Jeff Gentry²³, Gad Getz^44,77,78, David C. Glahn^79,80, Benjamin Glaser⁸¹, Stephen J. Glatt⁸², David Goldstein^83,84, Clicerio Gonzalez⁸⁵, Julia K. Goodrich^2,3, Leif Groop^86,87, Sanna Gudmundsson^2,3,4, Namrata Gupta^2,72, Andrea Haessly²³, Christopher Haiman⁸⁸, Ira Hall⁸⁹, Craig Hanis⁹⁰, Matthew Harms^91,92, Mikko Hiltunen⁹³, Matti M. Holi⁹⁴, Christina M. Hultman^95,96, Chaim Jalas⁹⁷, Thibault Jeandet²³, Mikko Kallela⁹⁸, Diane Kaplan²³, Jaakko Kaprio⁸⁷, Konrad J. Karczewski^2,3,6,10, Sekar Kathiresan^10,44,99, Eimear Kenny^96,100, Bong-Jo Kim¹⁰¹, Young Jin Kim¹⁰¹, George Kirov¹⁰², Zan Koenig², Jaspal Kooner^41,103,104, Seppo Koskinen¹⁰⁵, Harlan M. Krumholz¹⁰⁶, Subra Kugathasan¹⁰⁷, Soo Heon Kwak¹⁰⁸, Markku Laakso^109,110, Nicole Lake¹¹¹, Trevyn Langsford²³, Kristen M. Laricchia^2,3, Terho Lehtimäki¹¹², Monkol Lek¹¹¹, Emily Lipscomb², Christopher Llanwarne²³, Ruth J.F. Loos^31,113, Steven A. Lubitz^2,61, Teresa Tusie Luna^114,115, Ronald C.W. Ma^43,116,117, Daniel G. MacArthur^2,3,12,13, Gregory M. Marcus¹¹⁸, Jaume Marrugat^64,119, Alicia R. Martin², Kari M. Mattila¹¹², Steven McCarroll^5,120, Mark I. McCarthy^121,122,123, Jacob McCauley^124,125, Dermot McGovern¹²⁶, Ruth McPherson¹²⁷, James B. Meigs^2,44,128, Olle Melander¹²⁹, Andres Metspalu¹³⁰, Deborah Meyers¹³¹, Eric V. Minikel², Braxton D. Mitchell¹³², Vamsi K. Mootha^2,133, Ruchi Munshi²³, Aliya Naheed¹³⁴, Saman Nazarian^135,136, Benjamin M. Neale^2,3,5,6, Peter M. Nilsson¹³⁷, Sam Novod²³, Anne H. O’Donnell-Luria^2,3,4,10, Michael C. O’Donovan¹⁰², Yukinori Okada^138,139,140, Dost Ongur^44,51, Lorena Orozco¹⁴¹, Michael J. Owen¹⁰², Colin Palmer¹⁴², Nicholette D. Palmer¹⁴³, Aarno Palotie^3,5,87, Kyong Soo Park^108,144, Carlos Pato¹⁴⁵, Nikelle Petrillo²³, William Phu^2,4, Timothy Poterba^2,3,5, Ann E. Pulver¹⁴⁶, Dan Rader^135,147, Nazneen Rahman¹⁴⁸, Heidi L. Rehm^2,3,10, Alex Reiner^149,150, Anne M. Remes¹⁵¹, Dan Rhodes², Stephen Rich^152,153, John D. Rioux^154,155, Samuli Ripatti^87,156,157, David Roazen²³, Dan M. Roden^158,159, Jerome I. Rotter¹⁶⁰, Valentin Ruano-Rubio²³, Nareh Sahakian²³, Danish Saleheen^161,162,163, Veikko Salomaa¹⁶⁴, Andrea Saltzman², Nilesh J. Samani^35,36, Kaitlin E. Samocha^2,3,10, Jeremiah Scharf^2,5,10, Molly Schleicher², Heribert Schunkert^165,166, Sebastian Schönherr¹⁶⁷, Eleanor Seaby², Cotton Seed^3,5, Svati H. Shah¹⁶⁸, Megan Shand²³, Moore B. Shoemaker¹⁶⁹, Tai Shyong^170,171, Edwin K. Silverman^172,173, Moriel Singer-Berk², Pamela Sklar^174,175,176, J. Gustav Smith^157,177,178, Jonathan T. Smith²³, Hilkka Soininen¹⁷⁹, Harry Sokol^180,181,182, Matthew Solomonson^2,3, Rachel G. Son², Jose Soto²³, Tim Spector¹⁸³, Christine Stevens^2,3,5, Nathan Stitziel^89,184, Patrick F. Sullivan^95,185, Jaana Suvisaari¹⁶⁴, E. Shyong Tai^186,187,188, Michael E. Talkowski^2,5,10, Yekaterina Tarasova², Kent D. Taylor¹⁶⁰, Yik Ying Teo^186,189,190, Grace Tiao^2,3, Kathleen Tibbetts²³, Charlotte Tolonen²³, Ming Tsuang^191,192, Tiinamaija Tuomi^87,193,194, Dan Turner¹⁹⁵, Teresa Tusie-Luna^196,197, Erkki Vartiainen¹⁹⁸, Marquis Vawter¹⁹⁹, Christopher Vittal^2,3, Gordon Wade²³, Arcturus Wang^2,3,5, Qingbo Wang^2,138, James S. Ware^2,200,201, Hugh Watkins²⁰², Nicholas A. Watts^2,3, Rinse K. Weersma²⁰³, Ben Weisburd²³, Maija Wessman^87,204, Nicola Whiffin^2,205,206, Michael W. Wilson^2,3, James G. Wilson²⁰⁷, Ramnik J. Xavier^208,209, Mary T. Yohannes²

¹⁴University of Miami Miller School of Medicine, Gastroenterology, Miami, USA

¹⁵Unidad de Investigacion de Enfermedades Metabolicas, Instituto Nacional de Ciencias Medicas y Nutricion, Mexico City, Mexico

¹⁶Peninsula College of Medicine and Dentistry, Exeter, UK

¹⁷Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA, USA

¹⁸Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA

¹⁹Department of Cardiology University Hospital, Parma, Italy

²⁰European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK

²¹Department of Biology Faculty of Natural Sciences, University of Haifa, Haifa, Israel

²²Departments of Medicine and Genetics, Albert Einstein College of Medicine, Bronx, NY, USA

²³Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

²⁴Department of Quantitative Health Sciences, Lerner Research Institute Cleveland Clinic, Cleveland, OH, USA

²⁵Sorbonne Université, APHP, Gastroenterology Department Saint Antoine Hospital, Paris, France

²⁶NHLBI and Boston University’s Framingham Heart Study, Framingham, MA, USA

²⁷Department of Medicine, Boston University Chobanian and Avedisian School of Medicine, Boston, MA, USAD

²⁸Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA

²⁹Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA

³⁰National Human Genome Research Institute, National Institutes of Health Bethesda, MD, USA

³¹The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA

³²Department of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC, USA

³³Center for Genomics and Personalized Medicine Research, Wake Forest School of Medicine, Winston-Salem, NC, USA

³⁴Center for Diabetes Research, Wake Forest School of Medicine, Winston-Salem, NC, USA

³⁵Department of Cardiovascular Sciences, University of Leicester, Leicester, UK

³⁶NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester, UK

³⁷John Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

³⁸Harvard School of Public Health, Boston, MA, USA

³⁹Central American Population Center, San Pedro, Costa Rica

⁴⁰Department of Epidemiology and Biostatistics, Imperial College London, London, UK

⁴¹Department of Cardiology, Ealing Hospital, NHS Trust, Southall, UK

⁴²Imperial College, Healthcare NHS Trust Imperial College London, London, UK

⁴³Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, China

⁴⁴Department of Medicine, Harvard Medical School, Boston, MA, USA

⁴⁵Northwestern University Feinberg School of Medicine, Chicago, IL, USA

⁴⁶University of Cambridge, Cambridge, England

⁴⁷Departments of Cardiovascular, Medicine Cellular and Molecular Medicine Molecular Cardiology, Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA

⁴⁸Department of Pediatrics, Columbia University Irving Medical Center, New York, NY, USA

⁴⁹Herbert Irving Comprehensive Cancer Center, Columbia University Medical Center, New York, NY, USA

⁵⁰Department of Medicine, Columbia University Medical Center, New York, NY, USA

⁵¹McLean Hospital, Belmont, MA, USA

⁵²Division of Medical Sciences, Harvard Medical School, Boston, MA, USA

⁵³Genomics Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁵⁴Department of Medicine, University of Mississippi Medical Center, Jackson, MI, USA

⁵⁵Department of Epidemiology Colorado School of Public Health Aurora, CO, USA

⁵⁶Department of Medicine and Pharmacology, University of Illinois at Chicago, Chicago, IL, USA

⁵⁷Vanderbilt University Medical Center, Nashville, TN, USA

⁵⁸Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX, USA

⁵⁹Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA

⁶⁰National Heart Lung and Blood Institute’s Framingham Heart Study, Framingham, MA, USA

⁶¹Cardiac Arrhythmia Service and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA

⁶²Cardiovascular Epidemiology and Genetics Hospital del Mar Medical Research Institute, (IMIM) Barcelona Catalonia, Spain

⁶³CIBER CV Barcelona, Catalonia, Spain

⁶⁴Departament of Medicine, Medical School University of Vic-Central, University of Catalonia, Vic Catalonia, Spain

⁶⁵Institute for Cardiogenetics, University of Lübeck, Lübeck, Germany

⁶⁶German Research Centre for Cardiovascular Research, Hamburg/Lübeck/Kiel, Lübeck, Germany

⁶⁷University Heart Center Lübeck, Lübeck, Germany

⁶⁸Estonian Genome Center, Institute of Genomics University of Tartu, Tartu, Estonia

⁶⁹Victor Chang Cardiac Research Institute, Darlinghurst, NSW, Australia

⁷⁰Faculty of Medicine, UNSW Sydney, Kensington, NSW, Australia

⁷¹Cardiology Department, St Vincent’s Hospital, Darlinghurst, NSW, Australia

⁷²Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁷³Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA

⁷⁴Programs in Metabolism and Medical & Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁷⁵Institute of Clinical Molecular Biology, (IKMB) Christian-Albrechts-University of Kiel, Kiel, Germany

⁷⁶Helsinki University and Helsinki University Hospital Clinic of Gastroenterology, Helsinki, Finland

⁷⁷Bioinformatics Program MGH Cancer Center and Department of Pathology, Boston, MA, USA

⁷⁸Cancer Genome Computational Analysis, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁷⁹Department of Psychiatry and Behavioral Sciences, Boston Children’s Hospital and Harvard Medical School, Boston, MA, USA

⁸⁰Harvard Medical School Teaching Hospital, Boston, MA, USA

⁸¹Department of Endocrinology and Metabolism, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Israel

⁸²Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, NY, USA

⁸³Institute for Genomic Medicine, Columbia University Medical Center Hammer Health Sciences, New York, NY, USA

⁸⁴Department of Genetics & Development Columbia University Medical Center, Hammer Health Sciences, New York, NY, USA

⁸⁵Centro de Investigacion en Salud Poblacional, Instituto Nacional de Salud Publica, Mexico

⁸⁶Lund University Sweden, Sweden

⁸⁷Institute for Molecular Medicine Finland, (FIMM) HiLIFE University of Helsinki, Helsinki, Finland

⁸⁸Lund University Diabetes Centre, Malmö, Skåne County, Sweden

⁸⁹Washington School of Medicine, St Louis, MI, USA

⁹⁰Human Genetics Center University of Texas Health Science Center at Houston, Houston, TX, USA

⁹¹Department of Neurology Columbia University, New York City, NY, USA

⁹²Institute of Genomic Medicine, Columbia University, New York City, NY, USA

⁹³Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland

⁹⁴Department of Psychiatry, Helsinki University Central Hospital Lapinlahdentie, Helsinki, Finland

⁹⁵Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

⁹⁶Icahn School of Medicine at Mount Sinai, New York, NY, USA

⁹⁷Bonei Olam, Center for Rare Jewish Genetic Diseases, Brooklyn, NY, USA

⁹⁸Department of Neurology, Helsinki University, Central Hospital, Helsinki, Finland

⁹⁹Cardiovascular Disease Initiative and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

¹⁰⁰Charles Bronfman Institute for Personalized Medicine, New York, NY, USA

¹⁰¹Division of Genome Science, Department of Precision Medicine, National Institute of Health, Republic of Korea

¹⁰²MRC Centre for Neuropsychiatric Genetics & Genomics, Cardiff University School of Medicine, Cardiff, Wales

¹⁰³Imperial College, Healthcare NHS Trust, London, UK

¹⁰⁴National Heart and Lung Institute Cardiovascular Sciences, Hammersmith Campus, Imperial College London, London, UK

¹⁰⁵Department of Health THL-National Institute for Health and Welfare, Helsinki, Finland

¹⁰⁶Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, Center for Outcomes Research and Evaluation Yale-New Haven Hospital, New Haven, CT, USA

¹⁰⁷Division of Pediatric Gastroenterology, Emory University School of Medicine, Atlanta, GA, USA

¹⁰⁸Department of Internal Medicine, Seoul National University Hospital, Seoul, Republic of Korea

¹⁰⁹The University of Eastern Finland, Institute of Clinical Medicine, Kuopio, Finland

¹¹⁰Kuopio University Hospital, Kuopio, Finland

¹¹¹Department of Genetics, Yale School of Medicine, New Haven, CT, USA

¹¹²Department of Clinical Chemistry Fimlab Laboratories and Finnish Cardiovascular Research Center-Tampere Faculty of Medicine and Health Technology, Tampere University, Finland

¹¹³The Mindich Child Health and Development, Institute Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹¹⁴National Autonomous University of Mexico, Mexico City, Mexico

¹¹⁵Salvador Zubirán National Institute of Health Sciences and Nutrition, Mexico City, Mexico

¹¹⁶Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China

¹¹⁷Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Hong Kong, China

¹¹⁸University of California San Francisco Parnassus Campus, San Francisco, CA, USA

¹¹⁹Cardiovascular Research REGICOR Group, Hospital del Mar Medical Research Institute, (IMIM) Barcelona, Catalonia, Spain

¹²⁰Department of Genetics, Harvard Medical School, Boston, MA, USA

¹²¹Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital Old Road Headington, Oxford, OX, LJ, UK

¹²²Welcome Centre for Human Genetics, University of Oxford, Oxford, OX, BN, UK

¹²³Oxford NIHR Biomedical Research Centre, Oxford University Hospitals, NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX, DU, UK

¹²⁴John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA

¹²⁵The Dr. John T. Macdonald Foundation Department of Human Genetics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA

¹²⁶F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute Cedars-Sinai Medical Center, Los Angeles, CA, USA

¹²⁷Atherogenomics Laboratory University of Ottawa, Heart Institute, Ottawa, Canada

¹²⁸Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA

¹²⁹Department of Clinical Sciences University, Hospital Malmo Clinical Research Center, Lund University, Malmö, Sweden

¹³⁰Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia

¹³¹University of Arizona Health Science, Tuscon, AZ, USA

¹³²Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA

¹³³Howard Hughes Medical Institute and Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA

¹³⁴International Centre for Diarrhoeal Disease Research, Bangladesh

¹³⁵Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

¹³⁶Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

¹³⁷Lund University, Dept. Clinical Sciences, Skåne University Hospital, Malmö, Sweden

¹³⁸Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan

¹³⁹Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, Japan

¹⁴⁰Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Suita, Japan

¹⁴¹Instituto Nacional de Medicina Genómica, (INMEGEN) Mexico City, Mexico

¹⁴²Medical Research Institute, Ninewells Hospital and Medical School University of Dundee, Dundee, UK

¹⁴³Wake Forest School of Medicine, Winston-Salem, NC, USA

¹⁴⁴Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Republic of Korea

¹⁴⁵Department of Psychiatry Keck School of Medicine at the University of Southern California, Los Angeles, CA, USA

¹⁴⁶Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA

¹⁴⁷Children’s Hospital of Philadelphia, Philadelphia, PA, USA

¹⁴⁸Division of Genetics and Epidemiology, Institute of Cancer Research, London, SM, NG

¹⁴⁹University of Washington, Seattle, WA, USA

¹⁵⁰Fred Hutchinson Cancer Research Center, Seattle, WA, USA

¹⁵¹Medical Research Center, Oulu University Hospital, Oulu Finland and Research Unit of Clinical Neuroscience Neurology University of Oulu, Oulu, Finland

¹⁵²Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA

¹⁵³Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA

¹⁵⁴Research Center Montreal Heart Institute, Montreal, Quebec, Canada

¹⁵⁵Department of Medicine, Faculty of Medicine Université de Montréal, Québec, Canada

¹⁵⁶Department of Public Health Faculty of Medicine, University of Helsinki, Helsinki, Finland

¹⁵⁷Broad Institute of MIT and Harvard, Cambridge, MA, USA

¹⁵⁸Department of Biomedical Informatics Vanderbilt, University Medical Center, Nashville, TN, USA

¹⁵⁹Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

¹⁶⁰The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA

¹⁶¹Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

¹⁶²Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA

¹⁶³Center for Non-Communicable Diseases, Karachi, Pakistan

¹⁶⁴National Institute for Health and Welfare, Helsinki, Finland

¹⁶⁵Deutsches Herzzentrum, München, Germany

¹⁶⁶Technische Universität München, Germany

¹⁶⁷Institute of Genetic Epidemiology, Department of Genetics and Pharmacology, Medical University of Innsbruck, 6020 Innsbruck, Austria

¹⁶⁸Duke Molecular Physiology Institute, Durham, NC

¹⁶⁹Division of Cardiovascular Medicine, Nashville VA Medical Center, Vanderbilt University School of Medicine, Nashville, TN, USA

¹⁷⁰Division of Endocrinology, National University Hospital, Singapore

¹⁷¹NUS Saw Swee Hock School of Public Health, Singapore

¹⁷²Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA

¹⁷³Harvard Medical School, Boston, MA, USA

¹⁷⁴Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁷⁵Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁷⁶Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁷⁷The Wallenberg Laboratory/Department of Molecular and Clinical Medicine, Institute of Medicine, Gothenburg University and the Department of Cardiology, Sahlgrenska University Hospital, Gothenburg, Sweden

¹⁷⁸Department of Cardiology, Wallenberg Center for Molecular Medicine and Lund University Diabetes Center, Clinical Sciences, Lund University and Skåne University Hospital, Lund, Sweden

¹⁷⁹Institute of Clinical Medicine Neurology, University of Eastern Finad, Kuopio, Finland

¹⁸⁰Sorbonne Université, INSERM, Centre de Recherche Saint-Antoine, CRSA, AP-HP, Saint Antoine Hospital, Gastroenterology department, F-75012 Paris, France

¹⁸¹INRA, UMR1319 Micalis & AgroParisTech, Jouy en Josas, France

¹⁸²Paris Center for Microbiome Medicine, (PaCeMM) FHU, Paris, France

¹⁸³Department of Twin Research and Genetic Epidemiology King’s College London, London, UK

¹⁸⁴The McDonnell Genome Institute at Washington University, Seattle, WA, USA

¹⁸⁵Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, NC, USA

¹⁸⁶Saw Swee Hock School of Public Health National University of Singapore, National University Health System, Singapore

¹⁸⁷Department of Medicine, Yong Loo Lin School of Medicine National University of Singapore, Singapore

¹⁸⁸Duke-NUS Graduate Medical School, Singapore

¹⁸⁹Life Sciences Institute, National University of Singapore, Singapore

¹⁹⁰Department of Statistics and Applied Probability, National University of Singapore, Singapore

¹⁹¹Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, CA, USA

¹⁹²Institute of Genomic Medicine, University of California San Diego, San Diego, CA, USA

¹⁹³Endocrinology, Abdominal Center, Helsinki University Hospital, Helsinki, Finland

¹⁹⁴Institute of Genetics, Folkhalsan Research Center, Helsinki, Finland

¹⁹⁶Juliet Keidan Institute of Pediatric Gastroenterology Shaare Zedek Medical Center, The Hebrew University of Jerusalem, Jerusalem, Israel

¹⁹⁶Instituto de Investigaciones Biomédicas, UNAM, Mexico City, Mexico

¹⁹⁷Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico

¹⁹⁸Department of Public Health Faculty of Medicine University of Helsinki, Helsinki, Finland

¹⁹⁹Department of Psychiatry and Human Behavior, University of California Irvine, Irvine, CA, USA

²⁰⁰National Heart & Lung Institute & MRC London Institute of Medical Sciences, Imperial College, London, UK

²⁰¹Royal Brompton & Harefield Hospitals, Guy’s and St. Thomas’ NHS Foundation Trust, London, UK

²⁰²Radcliffe Department of Medicine, University of Oxford, Oxford, UK

²⁰³Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, Netherlands

²⁰⁴Folkhälsan Institute of Genetics, Folkhälsan Research Center, Helsinki, Finland

²⁰⁵National Heart & Lung Institute and MRC London Institute of Medical Sciences, Imperial College London, London, UK

²⁰⁶Cardiovascular Research Centre, Royal Brompton & Harefield Hospitals NHS Trust, London, UK

²⁰⁷Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, MS, USA

²⁰⁸Program in Infectious Disease and Microbiome, Broad Institute of MIT and Harvard, Cambridge, MA, USA

²⁰⁹Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, USA

Footnotes

Competing interests

L.C.F. is currently an employee of, and owns stock in, Vertex Pharmaceuticals. B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora, Inc. (f/k/a RBNC Therapeutics). H.L.R. has received support from Illumina and Microsoft to support rare disease gene discovery and diagnosis. M.J.D. is a founder of Maze Therapeutics and Neumora Therapeutics, Inc. (f/k/a RBNC Therapeutics). A.O.D.L. has consulted for Tome Biosciences and Ono Pharma USA Inc, and is member of the scientific advisory board for Congenica Inc and the Simons Foundation SPARK for Autism study. K.J.K. is a consultant for Tome Biosciences and Vor Biosciences, and a member of the Scientific Advisory Board of Nurture Genomics. D.G.M. is a paid advisor to GlaxoSmithKline, Insitro, Variant Bio and Overtone Therapeutics, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Google, Merck, Microsoft, Pfizer, and Sanofi-Genzyme. K.E.S. has received support from Microsoft for work related to rare disease diagnostics. The remaining authors declare no competing interests.

Code Availability

The code used to estimate $P_{t r a n s}$ estimates for variant pairs and to determine the number of individuals carrying rare, compound heterozygous variants can be found at:https://github.com/broadinstitute/gnomad_chets

The code has also been uploaded to Zenodo (https://doi.org/10.5281/zenodo.10034663).

Data Availability

The gnomAD v2 dataset can be accessed at https://gnomad.broadinstitute.org. We made use of prior quality control processing of these and related data. In addition, we downloaded HapMap2 genetic maps from https://github.com/joepickrell/1000-genomes-genetic-maps.

We provide both web-based look up tools and downloads for the data generated here. A look-up tool to find the likely co-occurrence pattern between two rare (global allele frequency in gnomAD exomes < 5%) coding, flanking intronic (from position −1 to −3 in acceptor sites, and +1 to +8 in donor sites) or 5’/3’ UTR variants can be found at:https://gnomad.broadinstitute.org/variant-cooccurrence

Additionally, we display the per-gene counts tables that detail the number of individuals with two rare variants, stratified by AF and functional consequence, on each gene’s main page. One table details counts of individuals with two heterozygous variants and includes predicted phase, while the second details individuals with homozygous variants. Both can be found by clicking on the “Variant Co-occurrence” tab on each gene’s main page.

All variant co-occurrence tables can be downloaded from:https://gnomad.broadinstitute.org/downloads#v2-variant-cooccurrence

References

1.Wang Q et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bansal V, Halpern AL, Axelrod N & Bafna V An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Patterson M et al. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J. Comput. Biol. 22, 498–509 (2015). [DOI] [PubMed] [Google Scholar]
4.Hager P, Mewes H-W, Rohlfs M, Klein C & Jeske T SmartPhase: Accurate and fast phasing of heterozygous variant pairs for genetic diagnosis of rare diseases. PLoS Comput. Biol. 16, e1007613 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Maestri S et al. A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings. Int. J. Mol. Sci. 21, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Mantere T, Kersten S & Hoischen A Long-Read Sequencing Emerging in Medical Genetics. Front. Genet. 10, 426 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Snyder MW, Adey A, Kitzman JO & Shendure J Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015). [DOI] [PubMed] [Google Scholar]
8.Li N & Stephens M Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Loh P-R et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Browning BL, Tian X, Zhou Y & Browning SR Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 108, 1880–1890 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hofmeister RJ, Ribeiro DM, Rubinacci S & Delaneau O Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Tewhey R, Bansal V, Torkamani A, Topol EJ & Schork NJ The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Browning SR & Browning BL Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Excoffier L & Slatkin M Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995). [DOI] [PubMed] [Google Scholar]
16.Hodgkinson A & Eyre-Walker A Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics vol. 12 756–766 Preprint at 10.1038/nrg3098 (2011). [DOI] [PubMed] [Google Scholar]
17.Ségurel L, Wyman MJ & Przeworski M Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014). [DOI] [PubMed] [Google Scholar]
18.Rahbari R et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lek M et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Carlson J et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Lynch M Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. U. S. A. 107, 961–968 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Baxter SM et al. Centers for Mendelian Genomics: A decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ioannidis NM et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Pejaver V et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am. J. Hum. Genet. 109, 2163–2177 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lassen FH et al. Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank. medRxiv 2023.06.29.23291992 (2023) doi: 10.1101/2023.06.29.23291992. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Sharp K, Kretzschmar W, Delaneau O & Marchini J Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

Methods-only references

27.Van der Auwera GA et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Hail Team. Hail 0.2.105-acd89e80c345. GitHub; https://github.com/hail-is/hail/commit/acd89e80c345. [Google Scholar]
29.Choi Y, Chan AP, Kirkness E, Telenti A & Schork NJ Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.McLaren W et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Georgi B, Voight BF & Bućan M From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Behan FM et al. Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature 568, 511–516 (2019). [DOI] [PubMed] [Google Scholar]
35.Hart T, Brown KR, Sircoulomb F, Rottapel R & Moffat J Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Hart T et al. Evaluation and Design of Genome-Wide CRISPR/SpCas9 Knockout Screens. G3 7, 2719–2727 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Vinceti A et al. CoRe: a robustly benchmarked R package for identifying core-fitness genes in genome-wide pooled CRISPR-Cas9 screens. BMC Genomics 22, 828 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables 1 and 2

NIHMS1958298-supplement-Supplementary_Tables_1_and_2.xlsx^{(34.4KB, xlsx)}

Supplementary Note

NIHMS1958298-supplement-Supplementary_Note.pdf^{(17.4MB, pdf)}

Data Availability Statement

All variant co-occurrence tables can be downloaded from:https://gnomad.broadinstitute.org/downloads#v2-variant-cooccurrence

[R1] 1.Wang Q et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat. Commun. 11, 2539 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Bansal V, Halpern AL, Axelrod N & Bafna V An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Patterson M et al. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J. Comput. Biol. 22, 498–509 (2015). [DOI] [PubMed] [Google Scholar]

[R4] 4.Hager P, Mewes H-W, Rohlfs M, Klein C & Jeske T SmartPhase: Accurate and fast phasing of heterozygous variant pairs for genetic diagnosis of rare diseases. PLoS Comput. Biol. 16, e1007613 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Maestri S et al. A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings. Int. J. Mol. Sci. 21, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Mantere T, Kersten S & Hoischen A Long-Read Sequencing Emerging in Medical Genetics. Front. Genet. 10, 426 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Snyder MW, Adey A, Kitzman JO & Shendure J Haplotype-resolved genome sequencing: experimental methods and applications. Nat. Rev. Genet. 16, 344–358 (2015). [DOI] [PubMed] [Google Scholar]

[R8] 8.Li N & Stephens M Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Loh P-R et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Browning BL, Tian X, Zhou Y & Browning SR Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genet. 108, 1880–1890 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Hofmeister RJ, Ribeiro DM, Rubinacci S & Delaneau O Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Tewhey R, Bansal V, Torkamani A, Topol EJ & Schork NJ The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Browning SR & Browning BL Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Excoffier L & Slatkin M Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995). [DOI] [PubMed] [Google Scholar]

[R16] 16.Hodgkinson A & Eyre-Walker A Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics vol. 12 756–766 Preprint at 10.1038/nrg3098 (2011). [DOI] [PubMed] [Google Scholar]

[R17] 17.Ségurel L, Wyman MJ & Przeworski M Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014). [DOI] [PubMed] [Google Scholar]

[R18] 18.Rahbari R et al. Timing, rates and spectra of human germline mutation. Nat. Genet. 48, 126–133 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Lek M et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Carlson J et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Lynch M Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. U. S. A. 107, 961–968 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Baxter SM et al. Centers for Mendelian Genomics: A decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Ioannidis NM et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Pejaver V et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am. J. Hum. Genet. 109, 2163–2177 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Lassen FH et al. Exome-wide evidence of compound heterozygous effects across common phenotypes in the UK Biobank. medRxiv 2023.06.29.23291992 (2023) doi: 10.1101/2023.06.29.23291992. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Sharp K, Kretzschmar W, Delaneau O & Marchini J Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Inferring compound heterozygosity from large-scale exome sequencing data

Michael H Guo

Laurent C Francioli

Sarah L Stenton

Julia K Goodrich

Nicholas A Watts

Moriel Singer-Berk

Emily Groopman

Philip W Darnowsky

Matthew Solomonson

Samantha Baxter

Grace Tiao

Benjamin M Neale

Joel N Hirschhorn

Heidi L Rehm

Mark J Daly

Anne O’Donnell-Luria

Konrad J Karczewski

Daniel G MacArthur

Kaitlin E Samocha

Abstract

Results

Inference of phase in gnomAD

Fig. 1: Overview of phasing approach using Expectation-Maximization method in gnomAD.

Validation of phasing estimates using trio data

Accuracy of phasing across allele frequencies

Fig. 2: Phasing accuracy as a function of variant allele frequency (AF).

Accuracy of phasing across genetic ancestry groups

Fig. 3: Phasing accuracy using population-specific versus cosmopolitan Ptrans estimates.

Effect of distance and mutation rate on phasing accuracy

Fig. 4: Phasing accuracy as a function of distance between variant pairs.

Accuracy in a cohort of patients with Mendelian disorders

Bi-allelic predicted damaging variants

Fig. 5: Counts of genes with variants in trans in gnomAD.

Generation of public resource

Discussion

Methods

Ethical compliance and informed consent statement

gnomAD characteristics and data processing

Haplotype estimates

Determination of Ptrans cutoffs

Trio validation data

CpG analysis

Calculating accuracy as a function of genetic distance

MNV analysis

Rare disease patient analysis

Determining counts of individuals with two rare, damaging variants

Essential gene lists

Extended Data

Extended Data Fig. 1: Publicly available browser for sharing phasing data.

Supplementary Material

Acknowledgements

Genome Aggregation Database Consortium

Footnotes

Data Availability

References

Methods-only references

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig. 3: Phasing accuracy using population-specific versus cosmopolitan $P_{t r a n s}$ estimates.

Determination of $P_{t r a n s}$ cutoffs