Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 May 2:2023.05.01.538953. [Version 2] doi: 10.1101/2023.05.01.538953

The landscape of tolerated genetic variation in humans and primates

Hong Gao 1,, Tobias Hamp 1,, Jeffrey Ede 1, Joshua G Schraiber 1, Jeremy McRae 1, Moriel Singer-Berk 2, Yanshen Yang 1, Anastasia Dietrich 1, Petko Fiziev 1, Lukas Kuderna 1,3, Laksshman Sundaram 1, Yibing Wu 1, Aashish Adhikari 1, Yair Field 1, Chen Chen 1, Serafim Batzoglou 1,, Francois Aguet 1, Gabrielle Lemire 2,4, Rebecca Reimers 4, Daniel Balick 5, Mareike C Janiak 6, Martin Kuhlwilm 3,7,8, Joseph D Orkin 3,9, Shivakumara Manu 10,11, Alejandro Valenzuela 3, Juraj Bergman 12,13, Marjolaine Rouselle 12, Felipe Ennes Silva 14,15, Lidia Agueda 16, Julie Blanc 16, Marta Gut 16, Dorien de Vries 6, Ian Goodhead 6, R Alan Harris 17, Muthuswamy Raveendran 17, Axel Jensen 18, Idriss S Chuma 19, Julie Horvath 20,21,22,23,24, Christina Hvilsom 25, David Juan 3, Peter Frandsen 25, Fabiano R de Melo 26, Fabricio Bertuol 27, Hazel Byrne 28, Iracilda Sampaio 29, Izeni Farias 27, João Valsecchi do Amaral 30,31,32, Mariluce Messias 33,34, Maria N F da Silva 35, Mihir Trivedi 11, Rogerio Rossi 36, Tomas Hrbek 27,37, Nicole Andriaholinirina 38, Clément J Rabarivola 38, Alphonse Zaramody 38, Clifford J Jolly 39, Jane Phillips-Conroy 40, Gregory Wilkerson 41,§, Christian Abee 42, Joe H Simmons 41, Eduardo Fernandez-Duque 42,43, Sree Kanthaswamy 44, Fekadu Shiferaw 45, Dongdong Wu 46, Long Zhou 47, Yong Shao 46, Guojie Zhang 47,48,49,50,51, Julius D Keyyu 52, Sascha Knauf 53, Minh D Le 54, Esther Lizano 3,55, Stefan Merker 56, Arcadi Navarro 3,57,58,59, Thomas Batallion 12, Tilo Nadler 60, Chiea Chuen Khor 61, Jessica Lee 62, Patrick Tan 61,63,64, Weng Khong Lim 63,64,65, Andrew C Kitchener 66,67, Dietmar Zinner 68,70, Ivo Gut 16,71, Amanda Melin 72,73, Katerina Guschanski 18,74, Mikkel Heide Schierup 12, Robin M D Beck 6, Govindhaswamy Umapathy 10,11, Christian Roos 75, Jean P Boubli 6, Monkol Lek 76, Shamil Sunyaev 77,5, Anne O’Donnell 2,4,78, Heidi Rehm 2,79, Jinbo Xu 1,80, Jeffrey Rogers 17,*,, Tomas Marques-Bonet 3,16,55,57,*, Kyle Kai-How Farh 1,*
PMCID: PMC10187174  PMID: 37205491

Abstract

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole genome sequencing data for 809 individuals from 233 primate species, and identified 4.3 million common protein-altering variants with orthologs in human. We show that these variants can be inferred to have non-deleterious effects in human based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.

One Sentence Summary:

Deep learning classifier trained on 4.3 million common primate missense variants predicts variant pathogenicity in humans.


A scalable approach for interpreting the effects of human genetic variants and their impact on disease risk is urgently needed to realize the promise of personalized genomic medicine (13). Out of more than 70 million possible protein-altering variants in the human genome, only ~0.1% are annotated in clinical variant databases such as ClinVar (4), with the remainder being variants of uncertain clinical significance (5, 6). Despite collaborative efforts by the scientific community, the rarity of most human genetic variants has meant that progress towards deciphering personal genomes has been incremental (7, 8). Consequently, clinical sequencing tests frequently return without definitive diagnoses, a frustrating outcome for both patients and clinicians (9, 10). In certain cases, patients have needed to be recontacted and diagnoses reversed when the presumed pathogenic variant was later found to be a common variant in previously understudied human populations (1113). Common variants can often be ruled out as the cause of penetrant genetic disease, since their high frequency in the population indicates that they are tolerated by natural selection, aside from rare exceptions due to founder effects and balancing selection (1416).

An emerging strategy for solving clinical variant interpretation on a genome-wide scale is the use of information from closely related primate species to infer the pathogenicity of orthologous human variants (17). Because chimpanzees and humans share 99.4% protein sequence identity (18), a protein-altering variant present in one species can be expected to produce similar effects on the protein in the other species. By conducting population sequencing studies in closely related non-human primate species, it is feasible to systematically catalog common variants and rule these out as pathogenic in human, analogous to how sequencing more diverse human populations has helped to advance clinical variant interpretation (8, 17). Nonetheless, earlier work (17) was limited by the very small primate population sequencing datasets available, which bounded the number of common variants discovered, and the scale of machine learning classifiers that could be trained.

RESULTS

A database of 4.3 million benign missense variants across the primate lineage

To expand upon this strategy, we sequenced 703 individuals from 211 primate species (19), and aggregated these with data from previous studies (1926), yielding a total of 809 individuals from 233 species. We identified 4.3 million unique missense (protein-altering) variants and 6.7 million unique synonymous (non-protein altering) variants (Fig. 1A), after excluding variants at positions that lacked unambiguous 1:1 mapping with human, or which resulted in non-concordant amino acid translation outcomes because of changes at neighboring nucleotides (fig. S1). The species selected for sequencing represent close to half of the 521 extant primate species on Earth (27) and cover all major primate families, from Old World monkeys and New World monkeys to lemurs and tarsiers. We targeted a small number of individuals per species (3.5 on average) to ensure that we primarily sampled common variants that have been filtered by natural selection rather than rare mutations (fig. S2).

Fig. 1. Common primate variants are largely benign in human.

Fig. 1.

(A) Counts of missense (solid green) and synonymous (shaded grey) variants from primates compared to the gnomAD database. Missense : synonymous counts and ratios are displayed above each bar. (B) Fractions of all possible human synonymous (grey) and missense variants (green) observed in primates. (C) Counts of benign (grey) and pathogenic (red) missense variants with two-star review status or above in the overall ClinVar database (left pie chart), compared to ClinVar variants observed in gnomAD (middle), and compared to ClinVar variants observed in primates (right). Conflicting benign and pathogenic annotations and variants interpreted only with uncertain significance were excluded. (D) Observed gnomAD (green) or primate (blue) missense variants in each amino acid position in the CACNA1A gene. Red circles represent the positions of annotated ClinVar pathogenic missense variants. Bottom scatterplot shows PrimateAI-3D predicted pathogenicity scores for all possible missense substitutions along the gene. (E) Multiple sequence alignment showing the ClinVar pathogenic variant chr11:77181548 G>A (red arrow) creating a cryptic splice site in human sequence (extended splice motif, blue). This variant is tolerated in Cebus Albifrons and other species with a G>C synonymous change in the adjacent nucleotide that stops the splice motif from forming. (F) Pie charts showing the fraction of benign (grey) and pathogenic (red) missense variants with ClinVar two-star review status or above in great apes, old world monkeys, new world monkeys, lemurs/tarsiers, mammals, chicken, and zebrafish. (G) Missense : synonymous ratios across the human allele frequency spectrum, with MSR of human variants seen in primates shown for comparison. The blue dashed line represents the expected missense : synonymous ratio of de novo variants. Colors and legend are the same as (A).

Compared to the genome Aggregation Database (gnomAD) cohort of 141,456 human individuals from diverse populations (28, 29), the primate sequencing cohort contained ~20% more exome variants despite sequencing 1/175th the number of individuals (Fig. 1A and fig. S3), attesting to the remarkable genetic diversity present in non-human primate species (19, 30), many of which are critically endangered (31). The overlap of primate variants with gnomAD was low, consistent with independent mutational origins in each species (fig. S3). Out of the 22 million possible synonymous variants in the human genome, 30% were observed in the primate cohort, compared to just 6% of possible missense mutations (Fig. 1B). Because de novo mutations would have laid down unbiased proportions of missense and synonymous variants, the observed depletion of missense mutations in the primate cohort is consistent with the majority of newly-arising human missense mutations being removed by natural selection due to their deleteriousness (8, 3234). The surviving missense variants are seen at high frequencies in primate populations, and represent a subset of missense variants that have tolerated filtering by natural selection and are unlikely to be pathogenic (35).

Missense variants from the primate cohort are strongly enriched for benign consequence in the ClinVar clinical variant database (Fig. 1C). Amongst ClinVar variants with higher review levels (2-star and above, indicating consensus by multiple submitters) (4), missense variants found in at least one non-human primate species were Benign or Likely Benign ~99% of the time, compared to 63% for ClinVar missense variants in general, and 80% for missense variants seen in gnomAD (Fig. 1C). The high fraction of pathogenic variants in gnomAD is consistent with the majority of these variants having arisen recently. Indeed, recent exponential human population growth introduced large numbers of rare variants through random de novo mutations (95% of variants in the gnomAD cohort are at < 0.01% population allele frequency), without sufficient time for selection to purge deleterious variants from the population (3640). Consequently, the gnomAD cohort provides a comparatively unfiltered look at variation caused by random mutations, whereas primate common variants represent the subset of random mutations that have survived.

The regions of human disease genes that were most densely populated by ClinVar pathogenic variants were also strongly depleted for primate common variants, with examples shown for CACNA1A (Fig. 1D) and CREBBP (fig. S4), genes responsible for familial epilepsy (41, 42) and Rubinstein-Taybi syndrome (43, 44). Missense variants in the gnomAD cohort were partially depleted within these same critical regions (Fig. 1D and fig. S4), indicating that humans and primates experience similar selective pressures. However, deleterious variants were incompletely removed in humans, consistent with the shorter amount of time they were exposed to natural selection.

Prior to using primate data as an indicator of benign consequence in a diagnostic setting, it is vital to understand why a handful of human pathogenic ClinVar variants appear as tolerated common variants in primates. Our clinical laboratory independently reviewed evidence for each of the 36 ClinVar pathogenic variants that appeared in the primate cohort, according to ACMG guidelines (14). Among these 36 variants, 8 were reclassified as variants of uncertain significance based on insufficient evidence of pathogenicity in the literature and an additional 9 were hypomorphic or mild clinical variants (table S1). The remaining 19 variants appear to be truly pathogenic in human, and are presumably tolerated in primate because of primate-human differences, such as interactions with changes in the neighboring sequence context (45, 46). In one such example, a compensatory synonymous sequence change at an adjacent nucleotide explains why the variant is benign in primate, but creates a pathogenic splice defect in human (Fig. 1E). We also expect that some of the variants identified among primates are rare pathogenic variants by chance, despite the small number of individuals sequenced within each species. By expanding our cohort to sequence a large number of individuals per species, we would definitively exclude rare variation from our catalogue of primate variation, as well as grow the database of benign variants to improve clinical variant interpretation.

As evolutionary distance from human increases, cases where the surrounding sequence context has changed sufficiently to alter the effect of the variant should also increase, until common variants in more distant species could no longer be reliably counted on as benign in human. We examined variation in each major branch of the primate tree, as well as variation from mammals (mouse, rat, cow, dog), chicken, and zebrafish, and evaluated their pathogenicity in ClinVar (Fig. 1F). Common variants from species throughout the primate lineage, including more distant branches such as lemurs and tarsiers, varied from 98.6% to 99% benign in the human ClinVar database, but this dropped to 87% for placental mammals, and 71% for chicken. The high fraction of variants that are pathogenic in human, yet tolerated as common variants in more distant vertebrates, indicates that selection on orthologous variants diverges substantially in distantly-related species, as a consequence of changes in the surrounding sequence context and other differences in the species’ biology (fig. S5).

We have made the primate population variant database, which contains over 4.3 million likely benign missense variants, publicly available at https://primad.basespace.illumina.com as a reference for the genomics community. Overall, this resource is over 50 times larger than ClinVar in terms of number of annotated missense variants, and consists almost entirely of variants of previously unknown significance. Most primate variants are rare or absent in the human population, with 98% of these variants at allele frequency < 0.01% (fig. S6). This makes it challenging to establish their pathogenicity through other means, since even the largest sequencing laboratories would be unlikely to observe any given variant in more than one unrelated patient. Despite their rarity, the subset of human variants that appear in primates have a low missense : synonymous ratio consistent with being depleted of deleterious missense variants (Fig. 1G). This contrasts with the high missense : synonymous ratio for rare human variants in the overall gnomAD cohort, which approaches the 2.2:1 ratio expected for random de novo mutations in the absence of selective constraint (47). At higher allele frequencies, natural selection has had more time to purge deleterious missense variants, allowing the human missense : synonymous ratio to start to converge toward the ratio observed for the subset of human variants that are present in other primates.

Gene-level selective constraint in humans versus non-human primates

The primate variant resource makes it possible to compare natural selection acting on individual genes across the primate lineage and identify human-specific evolutionary differences. Since the current primate cohort only contains an average of 3–4 individuals per species, we focused on comparing selective constraint in human genes versus primates as a whole. We found that the missense : synonymous ratios of individual genes were well-correlated between human and primates (Spearman r = 0.637) (Fig. 2A), indicating that genes which were depleted for deleterious missense mutations in human were also consistently depleted throughout the primate lineage. Moreover, the missense : synonymous ratios of both human and primate genes correlated similarly well with the probability of genes being loss of function intolerant (pLI) (Spearman correlation −0.534 and −0.489, respectively) (28). Had there been substantial divergence between human and primate, pLI, an independent metric derived from human protein-truncating variation, would have been expected to show much clearer agreement with human missense : synonymous ratios than primate.

Fig. 2. Selective constraint of primate genes compared to human.

Fig. 2.

(A) Scatter plot of missense : synonymous ratios between primate and human genes. Each gene is colored by its pLI score, with darker points showing haploinsufficient genes. (B) Observed and expected counts of synonymous (top) and missense (bottom) variants per gene in gnomAD (left) and primates (right). Genes are colored by their pLI scores. (C) Distributions of observed/expected ratios of synonymous (dashed lines) and missense (solid lines) variants for all genes. Results for primate genes (orange) and gnomAD genes (blue) are shown. (D) Scatter plot of missense : synonymous ratios between primate and human genes. Highlighted points are genes that are under significantly stronger (blue) or weaker (red) constraint in humans compared to non-human primates under both methods (Benjamini-Hochberg FDR < 0.05), while grey points show non-significant genes. The top 10 genes with the largest effect sizes in either direction are labeled.

To measure the selective constraint on each gene, we calculated the observed versus expected number of variants per gene, using trinucleotide mutation rates to model the expected probability of observing each variant (fig. S7) (28, 29). We modeled each primate species separately to account for differences in genetic diversity and the number of individuals sampled per species. The expected and observed counts of synonymous variants were highly correlated in both the gnomAD and primate cohorts, indicating that our model accurately captured the background distribution of neutral mutations (Fig. 2B; Spearman correlation 0.933 and 0.949, respectively). In contrast, for missense variants the expected and observed counts per gene diverged substantially (Spearman correlation 0.896 and 0.561 for human and primate, respectively), due to depletion of deleterious missense variants by natural selection in highly constrained genes (for example, high pLI genes). The most highly constrained genes were almost completely scrubbed of common missense variants in the primate cohort, whereas rare missense variants in the gnomAD cohort were depleted to a more modest extent due to the large sample size of gnomAD (Fig. 2C).

We next aimed to identify genes whose selective constraint was different in human compared to the rest of the primate lineage, a task made difficult by differences in diversity, allele frequency, and sample size between the human and primate cohorts (34, 48, 49). To this end, we developed two orthogonal strategies, and took the intersection of genes identified under both approaches. First, we used population genetic modeling (34, 50, 51) to estimate the average selection coefficient, s, ranging from 0 (benign) to 1 (severely pathogenic), of missense mutations in each gene, using a model of recent human population growth (figs. S7 and S8). We fit a single value of s per gene across non-human primate species, and identified genes that differed between sprimate and shuman using a likelihood ratio test, which we validated using population simulations (fig. S9). In a second approach, we fit a curve approximating the relationship between human and primate missense : synonymous ratios using a Poisson generalized linear mixed model (52), and identified genes where the observed human missense : synonymous ratio deviated from what would have been expected given the gene’s missense : synonymous ratio in primates (fig. S10). We also adjusted for gene length to account for shorter genes having more variability in their missense : synonymous ratio measurements than longer genes. The two methods were broadly concordant, with a Spearman correlation of 0.80 between the genes’ effect sizes in the two tests. Estimates of selection coefficients and observed and expected counts for each gene in human and primate are provided in table S2.

In total, we found 39 genes where selective constraint differed significantly between human and other primates under both methods (Benjamini-Hochberg FDR < 0.05 (53); Fig. 2D). The top three genes where shuman decreased the most relative to sprimate were CFTR, GJB2, and CD36, autosomal recessive disease genes for cystic fibrosis (54), hereditary deafness (55), and platelet glycoprotein deficiency (56), respectively. All three genes are known for deleterious mutations that are unusually common in local geographic human populations (5760), suggesting that they may be experiencing reduced selection due to heterozygote advantage that protects against specific environmental pathogens (6064). On the other end of the spectrum, TERT, known for its role in maintaining telomere length (65, 66), was among the top genes where shuman increased the most relative to sprimate. Humans have adapted to a much longer lifespan compared with other primate species, which have a median lifespan of 20–30 years, suggesting that increased selection on TERT may have occurred as part of human adaption towards extended longevity. We note that with the current size of the primate cohort, it is not possible to distinguish whether the increased selection on TERT occurred only in humans, or if it is part of a gradual trend towards extended longevity that began earlier in the great ape lineage, which also have longer lifespans relative to other primates (~40 years). Expanding the primate cohort by sequencing more individuals per species would improve detection of additional species-specific and lineage-specific evolutionary adaptation, and shed light on the evolutionary path that led to the present human condition.

PrimateAI-3D, a deep learning network for classifying protein-altering variants

We constructed PrimateAI-3D, a semi-supervised 3D-convolutional neural network for variant pathogenicity prediction, which we trained using 4.5 million common missense variants with likely benign consequence (Fig. 3A). In a departure from prior deep learning architectures that operated on linear sequence (17, 67), we voxelized the 3D structure of the protein at 2 Angstrom resolution (figs. S11 and S12) and used 3D-convolutions to enable the network to recognize key structural regions that may not be apparent from sequence alone (Fig. 3A). As an example, we show PrimateAI-3D predictions for STK11 (Fig. 3B), the tumor suppressor gene responsible for Peutz-Jeghers hereditary polyposis syndrome (6871), with each amino acid position colored by the average PrimateAI-3D score at that position. Common primate variants used for training and annotated ClinVar pathogenic variants from separate parts of the linear sequence form distinct clusters in 3D space. Although ClinVar variants are shown for illustration, it is important to note that the network was not trained on either human-engineered features or annotated variants from clinical variant databases, thereby avoiding potential human biases in variant annotation. Rather, it learns to infer pathogenicity based on the local enrichment or depletion of common primate variants, taking only the protein’s multiple sequence alignment and 3D structure as inputs.

Fig. 3. PrimateAI-3D architecture and variant classification performance.

Fig. 3.

(A) PrimateAI-3D workflow. Human protein structures and multiple sequence alignments are voxelized (left) as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a target residue (middle). The network is trained using a loss function with three components (right): common human and primate variants; fill-in-the-blank of a protein structure; score ranks from language models. (B) Protein structure of the STK11 gene, colored by PrimateAI-3D pathogenicity prediction scores (blue: benign; red: pathogenic). Spheres indicate residues with common human and primate variants (left) or residues with pathogenic mutations from ClinVar (right). For spheres, the color corresponds to the pathogenicity score of only the variant. For other residues, pathogenicity scores are averaged over all variants at that site. (C) Scatterplot shows performance of methods that predict missense variant pathogenicity in two clinical benchmarks (DDD and UKBB). Datasets are a subset of variants for which all methods have predictions. (D) Six barplots show method performance for six testing datasets (DMS assays, UKBB, ClinVar, DDD, ASD, and CHD).

PrimateAI-3D can utilize protein structures from either experimental sources or computational prediction (7276); we used AlphaFold DB (72, 73) and HHpred (74) predicted structures for the broadest coverage across human genes. For training data, we incorporated all common missense variants from the 233 non-human primate species (17), and common human missense variants (allele frequency > 0.1% across populations) in gnomAD (28, 29), TOPMed (77, 78), and UK Biobank (79, 80), resulting in a total of 4.5 million unique missense variants of likely benign consequence. This dataset covers 6.34% of all possible human missense variants, and is over 50-fold larger than the current ClinVar database (79,381 missense variants after excluding variants of uncertain significance and those with conflicting annotations), greatly enlarging the training dataset available for machine learning approaches. Because the training dataset consists only of variants labeled as benign, we created a control set of randomly selected variants that were matched to the common variants by trinucleotide mutation rate, and trained PrimateAI-3D to separate common variants from matched controls as a semi-supervised learning task.

In parallel with the variant classification task, we generated amino acid substitution probabilities for each position in the protein by masking the residue and using the sequence context to predict the missing amino acid, borrowing from language model architectures that are trained to predict missing words in sentences (81, 82). We trained both a 3D convolutional “fill-in-the-blank” model, which tasked the network with predicting the missing amino acid in a gap in the voxelized 3D protein structure, and separately, a language model utilizing the transformer architecture to predict the missing amino acid using the surrounding multiple sequence alignment as context (83). We implemented these models as additional loss functions to further refine the PrimateAI-3D predictions (fig. S13). We also trained a variational autoencoder (67) on multiple sequence alignments and found that it performed comparably to our transformer architecture (fig. S14). Hence, we incorporated the average of their predictions in the loss function, which performed better than either alone.

We evaluated PrimateAI-3D and 15 other published machine learning methods (67, 84) on their ability to distinguish between benign and pathogenic variants along six different axes (Fig. 3C, 3D, and fig. S15): predicting the effects of rare missense variants on quantitative clinical phenotypes in a cohort of 200,643 individuals from the UK Biobank (UKBB); distinguishing missense de novo mutations (DNM) seen in 31,058 patients with neurodevelopmental disorders (8587) (DDD) from de novo missense mutations in 2,555 healthy controls (8893); distinguishing de novo missense mutations seen in 4,295 patients with autism spectrum disorders (8894) (ASD) from de novo missense mutations in the shared set of 2,555 healthy controls; distinguishing de novo missense mutations seen in 2,871 patients with congenital heart disease (95) (CHD) from de novo missense mutations in the shared set of 2,555 healthy controls; separating annotated ClinVar benign and pathogenic variants (ClinVar) (4); and average correlation with in vitro deep mutational scan experimental assays across 9 genes (96105) (DMS assays). Our set of clinical benchmarks is the most comprehensive to date, and has a particular focus on rigorously testing the performance of classifiers on large patient cohorts across a diverse range of real world clinical settings (table S3).

For the UK Biobank benchmark, we analyzed 200,643 individuals with both exome sequencing data and broad clinical phenotyping, and identified 42 genes where the presence of rare missense variants was associated with changes in a quantitative clinical phenotype controlling for confounders such as population stratification, age, sex, and medications (table S4). These gene-phenotype associations included diverse clinical lab measurements such as low-density lipoprotein (LDL) cholesterol (increased by rare missense variants in LDLR, decreased by variants in PCSK9), blood glucose (increased by variants in GCK), and platelet count (increased by variants in JAK2, decreased by variants in GP1BB), as well as other quantitative phenotypes such as standing height (increased by variants in ZFAT) (table S4). To test each classifier’s ability to distinguish between pathogenic and benign missense variants, we measured the correlation between pathogenicity prediction score and quantitative phenotype for patients carrying rare missense variants in each of these genes. We report the average correlation across all gene-phenotype pairs for each classifier, taking the absolute value of the correlation, since these genes may be associated with either increase or decrease in the quantitative clinical phenotype.

The neurodevelopmental disorders cohort (DDD), autism spectrum disorders cohort (ASD), and congenital heart disease cohort (CHD) are among the largest published trio-sequencing studies to date, and consist of thousands of families with a child with rare genetic disease and their unaffected parents. In each cohort, we cataloged de novo missense mutations that appeared in affected probands but were absent in their parents, as well as de novo missense mutations that appeared in a set of shared healthy controls. We evaluated the ability of each classifier to separate the de novo missense mutations that appear in cases versus controls on the basis of their prediction scores, using the Mann-Whitney U test to measure performance.

PrimateAI-3D outperformed all other classifiers at distinguishing pathogenic from benign variants in the four patient cohorts we tested (UKBB, DDD, ASD, CHD); it was also the top performer at separating pathogenic from benign variants in the ClinVar annotation database, and had the highest average correlation with the deep mutational scan assays (Fig. 3D and fig. S15). After PrimateAI-3D, there was no clear runner-up, with second place occupied by six different classifiers in the six different benchmarks. We observed a moderate correlation between the performance of different classifiers in UKBB and DDD (Spearman r = 0.556; Fig. 3C), which are the two largest clinical cohorts and therefore likely the most robust for benchmarking (with 200,643 and 33,613 patients, respectively), but outside of PrimateAI-3D, strong performance of a classifier on one task had limited generalizability to other tasks. Our results underscore the importance of validating machine learning classifiers along multiple dimensions, particularly in large real-world cohorts, to avoid overgeneralizing a classifier’s performance based upon an impressive showing along a single axis.

PrimateAI-3D’s top-ranked performance at separating benign and pathogenic missense variants in ClinVar was unexpected, since the other machine learning classifiers (with the exception of EVE) were trained either directly on ClinVar, or on other variant annotation databases with a high degree of content overlap. Because they are primarily based on variants described in the literature, clinical variant databases are subject to ascertainment bias (12, 106, 107), which may have contributed to supervised classifiers picking up on tendencies of human variant annotation that are unrelated to the task of separating benign from pathogenic variants (figs. S16, S17, and S18). Given the challenges with human annotation, we also investigated whether PrimateAI-3D could assist in revising incorrectly labeled ClinVar variants, by comparing annotations in the current ClinVar database and those from a September 2017 snapshot. Disagreement between PrimateAI-3D and the 2017 version of ClinVar was highly predictive of future revision and the odds of revision increased with PrimateAI-3D confidence (fig. S19). Among variants with the 10% most confident PrimateAI-3D predictions, the odds of revision were 10-fold elevated if PrimateAI-3D was in disagreement with the ClinVar label (P < 10−14).

The performance of PrimateAI-3D on clinical variant benchmarks scaled directly with training dataset size, indicating that additional primate sequencing data will be the key to unlocking further gains (Fig. 4 and fig. S20). The current primate cohort already covers 30% of all possible synonymous variants in the human genome, despite containing only 809 individuals from 233 species (Fig. 4B). By increasing the number of species and the number of individuals sequenced per species, we expect to saturate the majority of the remaining tolerated substitutions in the human genome (fig. S21), including both coding and noncoding variation, leaving the remaining deleterious variants to be deduced by process of elimination.

Fig. 4. Impact of training dataset size on classification accuracy.

Fig. 4.

(A) Improved performance of PrimateAI-3D with increasing number of common human and primate variants in the training dataset (x-axis). Performance of each dataset (y-axis) was divided by the maximum performance observed across all training dataset sizes. (B) Cumulative fractions of all possible human synonymous (grey) and missense (green) variants observed as common variants in 234 primate species, including human (allele frequency > 0.1%). Each point shows the average of ten permutations, calculated with a different random ordering of the list of primate species each time.

Discovery of candidate disease genes for neurodevelopmental disorders

We applied PrimateAI-3D to improve statistical power for discovering candidate disease genes that are enriched for pathogenic de novo mutations in the neurodevelopmental disorders cohort (fig. S22). De novo missense mutations from affected individuals in the DDD cohort (87) were enriched 1.36-fold above expectation, based on estimates of background mutation rate using trinucleotide context (47). We selected a PrimateAI-3D classification threshold of 0.821, which called an equal number of pathogenic missense mutations (n=7,238) as the excess of de novo missense mutations in the cohort (Fig. 5A). Stratifying missense mutations by this threshold increased enrichment of pathogenic de novo missense mutations to 2.0-fold, substantially increasing statistical power for disease gene discovery in the cohort (Fig. 5B).

Fig. 5. Enrichment of de novo mutations in the neurodevelopmental disorder cohort over expectation.

Fig. 5.

(A) Enrichment of DNMs from Kaplanis et al. (87) across all genes. Enrichment ratios are given for synonymous, all missense, and protein-truncating variants (PTV), along with missense split by PrimateAI-3D score into benign (<0.821) and pathogenic (>0.821). (B) Enrichment of benign and pathogenic missense above expectation at varying PrimateAI-3D thresholds for classifying pathogenic missense.

By applying PrimateAI-3D to prioritize pathogenic missense variants, we identified 290 genes associated with intellectual disability at genome-wide significance (P < 6.4×10−7) (Table 1), of which 272 were previously discovered genes that either appeared in the Genomics England intellectual disability gene panel (108), or were already identified in the prior study (109) without stratifying missense variants (table S5). We excluded two genes, BMPR2 and RYR1 as borderline significant genes that already had well-annotated non-neurological phenotypes. Further clinical studies are needed to independently validate this list of candidate genes and understand their range of phenotypic effects.

Table 1.

Additional genes discovered in intellectual disability.

Missense P value

HGNC symbol Protein-truncating variants PrimateAI-3D score ≥ 0.821 All missense PrimateAI-3D score ≥ 0.821 All missense

AP1G1 2 4 5 4.1 ×10−7 5.9×10−5
ATP2B2 1 9 11 2.1×10−7 1.4×10−3
CELF2 2 4 4 1.2×10−7 6.7×10−5
MAP4K4 2 6 7 3.9×10−7 5.0×10−4
MED13 3 6 9 6.6×10−8 3.5×10−5
MFN2 0 6 8 3.4×10−7 1.0×10−5
NR4A2 2 4 5 3.7×10−7 3.3×10−5
PIP5K1C 0 8 9 2.8×10−8 4.9×10−4
RAB5C 2 4 5 8.6×10−8 1.5×10−5
SPOP 1 4 6 4.1×10−7 1.7×10−6
SPTBN2 1 10 16 3.9×10−7 4.5×10−3
XPO1 1 7 7 5.0×10−7 7.2×10−4
EIF4A2 2 4 4 1.7×10−7 2.1×10−4
LMBRD2 0 3 4 6.0×10−7 1.3×10−4
MARK2 4 3 5 2.3×10−7 3.8×10−5
NOTCH1 4 6 17 4.1×10−7 1.3×10−6

Genes achieving the genome-wide significance (p < 6.4×10−7) are shown when considering only missense de novo mutations with PrimateAI-3D scores ≥0.821. Counts of protein truncating and missense DNMs are provided. P values for gene enrichment are shown when the statistical test was run only with missense mutations with PrimateAI-3D score ≥ 0.821, and when it was repeated for all missense mutations.

Discussion

Our results demonstrate the successful pairing of primate population sequencing with state-of-the-art deep learning models to make meaningful progress towards solving variants of uncertain significance. Primate population sequencing and large-scale human sequencing are likely to fill complementary roles in advancing clinical understanding of human genetic variants. From the perspective of acquiring additional benign variants to train PrimateAI-3D, humans are not suitable, as the discovery of common human variants (>0.1% allele frequency) plateaus at roughly ~100,000 missense variants after only a few hundred individuals (17), and further population sequencing into the millions mainly contributes rare variants which cannot be ruled out for deleterious consequence. On the other hand, these rare human variants, because they have not been thoroughly filtered by natural selection, preserve the potential to exert highly penetrant phenotypic effects, making them indispensable for discovering new gene-phenotype relationships in large population sequencing and biobank studies. Fittingly, classifiers trained on common primate variants may accelerate these target discovery efforts, by helping to differentiate between benign and pathogenic rare variation.

The genetic diversity found in the 520 known non-human primate species is the result of ongoing natural experiments on genetic variation that have been running uninterrupted for millions of years. Today, over 60% of primate species on Earth are threatened with extinction in the next decade due to man-made factors (31). We must decide whether to act now to preserve these irreplaceable species, which act as a mirror for understanding our genomes and ourselves, and are each valuable in their own right, or bear witness to the conclusion of many of these experiments.

Materials and Methods

Primate polymorphism data

We aggregated high-coverage whole genomes of 809 primate individuals across 233 primate species, including 703 newly sequenced samples and 106 previously sequenced samples from the Great Ape Genome project (19). Samples that passed quality evaluation were then aligned to 32 high-quality primate reference genomes (110) and mapped to the GRCh38 human genome build.

We developed a random forest (RF) classifier to identify false positive variant calls and errors resulting from ambiguity in the species mapping. In addition, we removed variants that fell in primate codons that did not match the human codon at that position, as well as those residing in primate transcripts with likely annotation errors. We also devised quality metrics based on the distribution of RF scores and Hardy-Weinberg equilibrium, and developed a unique mapping filter to exclude variants in regions of non-unique mapping between primate species.

Identifying differential selection between humans and primates via population modeling

We first established a neutral background distribution of mutation rates per gene for each primate species by fitting the Poisson Random Field (PRF) model to the segregating synonymous variants in each species. The observed number of segregating synonymous sites is a Poisson random variable, with the mean determined by mutation rate, demography, and sample size (34). For simplicity, we assumed an equilibrium (i.e. constant) demography for all species besides human; for human, we used Moments (51) to find a best fitting demographic history based on the folded site frequency spectrum of synonymous sites. We adopted a Gamma distributed prior on mutation rates, which also accounts for the impact of GC content on mutation rate. We optimized the prior parameters via maximum likelihood and computed the posterior distribution of the mutation rate per gene.

The number of segregating nonsynonymous sites is modeled as a Poisson random variable similar to synonymous sites with additional selection parameters. We assumed that every nonsynonymous mutation in a gene shares the same population scaled selection coefficient γig To explicitly estimate selection coefficient of each gene per species, we devised a two-step procedure analogous to an EM algorithm to control for differences in population size across species.

To identify genes where human constraint is different from non-human primate selection, we developed a likelihood ratio test to test whether population scaled selection coefficients are significantly different between human and other primates. We then assessed whether our population genetic modeling improved the correlation of selection estimates of our primate data with previous gene-constraint metrics in humans, including pLI (28) and s_het (111). To validate the performance of our model, we performed population genetic simulations.

Poisson generalized linear mixed modeling of selection between humans and primates

In addition to population genetics model described above, we also applied an orthogonal approach to detect differences in selection between humans and primates based on missense-to-synonymous ratio (MSR). We fit a Poisson generalized linear mixed model (GLMM) to the pooled polymorphic synonymous and missense mutations across all primates to estimate the depletion of missense variants in each gene. Then, we fit a second Poisson GLMM to the human data, controlling for the primate depletion estimates, and compared the pooled primate MSR to the human MSR for each gene.

PrimateAI-3D Model

PrimateAI-3D is a 3D convolutional neural network that uses protein structures and multiple sequence alignments (MSA) to predict the pathogenicity of human missense variants. To generate the input for a 3D convolutional neural network, we voxelized the protein structure and evolutionary conservation in the region surrounding the missense variant. The network was trained to optimize three objectives: distinction between benign and unknown human variants; prediction of a masked amino acid at the variant site; per-gene variant ranks based on protein language models.

Protein structures and multiple sequence alignments

For 341 species, we used vertebrate and mammal MSAs from UCSC Multiz100 (112, 113) and Zoonomia (114). Another 251 species appeared in Uniprot for least 75% of all human proteins (115). For each protein, alignments from all 341+251=592 species were merged. Human protein structures were taken from AlphaFold DB (June 2021) (73). Proteins that did not sequence-match exactly to our hg38 proteins (2590; 13.5%) were homology modeled using HHpred (74) and Modeller (116).

Protein voxelization and voxel features

A regular sized 3D grid of 7×7×7 voxels, each spanning 2Å×2Å×2Å, was centered at the Cα atom of the residue containing the target variant (Fig. S11). For each voxel, we provided a vector of distances between its center and the nearest Cα and Cβ atoms of each amino acid type (Fig. S11; details in Supplementary Text section 1). We also provided additional voxel features including the pLDDT confidence metric from AlphaFold DB (Fig. S12), and the evolutionary profile, consisting of each amino acid’s frequency at the corresponding position in the 592 species alignment.

Model architecture

The first layers of the PrimateAI-3D model reduce the voxel tensor to a 64-vector through repeated valid-padded 3D convolutions with a kernel size of 3×3×3. A final hidden dense layer transforms this 64-length vector into a 20-length vector, corresponding to one output unit per amino acid at that position. The model was trained simultaneously using multiple loss functions to optimize the following complementary aspects of pathogenicity:

Benign primate variants

Using 4.5 million benign missense variants from primates, we sampled the same number of unknown variants from the set of all possible human missense variants, with the distribution of mutational probabilities matching the benign set, based on a trinucleotide mutation rate model. Variants for the same protein position were combined in a 20-length vector (benign: 0, unknown: 1) which was the target label for the network. We used mean squared error (MSE) as the loss function for non-missing labels and ignored missing labels.

3D fill-in-the-blank

We removed all atoms of a target residue before voxelization, discarding any information about the residue from the input tensor to the network. The network was then trained to predict a 20-length vector, labeled 0 (benign) for amino acids that occur at the target site in any of the 592 species and 1 (pathogenic) otherwise. All human protein positions with at least one possible missense variant were included in this dataset.

Variant ranks from language models

For each gene, we took the average pathogenicity ranking from two protein language models, PrimateAI language model (PrimateAI LM, described below) and our reimplementation of the EVE variational autoencoder algorithm which we extended to all human proteins (EVE*) (67). We calculated the pairwise logistic rank loss as described in Pasumarthi et al.(117).

PrimateAI Language Model

The PrimateAI language model (PrimateAI LM) is a MSA transformer (83) for fill-in-the-blank residue classification, which was trained end-to-end on MSAs of UniRef-50 proteins (118, 119) to minimize an unsupervised masked language modelling (MLM) objective (81). Our model requires ~50x less computation for training than previous MSA transformers due to several improvements in architecture and training (Fig. S9).

Model training procedure

Each batch had the same number of samples from each of the three variant datasets (~33 with a batch size of 100). For the language model ranks dataset, all 33 samples had to come from the same protein. The number of times a protein was chosen for a batch was proportional to the length of the protein. In order to make our model robust against protein orientations, we randomly rotated the protein atomic coordinates in 3D before voxelizing a variant.

Model Evaluation

We compared performance of our model and other models (84) on variants for which all models had scores. Deep mutational scanning assays were available for 9 human genes: Amyloid-beta (102), YAP1 (96), MSH2 (120), SYUA (101), VKOR1 (121), PTEN (99, 100), BRCA1 (122), TP53 (123), and ADRB2 (124). For each assay and prediction model, we calculated the absolute Spearman rank correlation between prediction and assay scores. The UK Biobank dataset (79, 80) contains 42 gene-phenotype pairs which were significantly associated by rare variant burden testing using all rare missense variants, without applying missense pathogenicity prioritization. The evaluation was the same as with DMS assays, except that correlations were calculated from the quantitative phenotypes of individuals carrying the variant, instead of the assay score for the variant. For ClinVar (4), we filtered to high-quality 2-star variants and evaluated model performance by calculating per-gene area under the receiver operating characteristic curve (AUC). For the rare disease cohorts, we collected de novo missense mutations from patients with developmental disorders (8587), autism spectrum disorders (8894) or congenital heart disorders (95). For all three datasets, we compared against DNMs from healthy controls (8893). We applied the Mann-Whitney U test to measure how well each model’s prediction scores could distinguish patient variants from control variants.

Supplementary Material

Supplement 1
media-1.pdf (5.1MB, pdf)
Supplement 2
media-2.xlsx (88KB, xlsx)
Supplement 3
media-3.xlsx (2.8MB, xlsx)
Supplement 4
media-4.xlsx (11.1KB, xlsx)
Supplement 5
media-5.xlsx (13.3KB, xlsx)
Supplement 6
media-6.xlsx (416.7KB, xlsx)
Supplement 7
media-7.xlsx (12.2KB, xlsx)

Acknowledgments:

We would like to thank Daniel MacArthur, Yun Song, and Mark Daly for helpful discussions, and the gnomAD team at the Broad Institute for their assistance with the website.

Funding:

LFKK was supported by an EMBO STF 8286 (to LFKK). MCJ was supported by (NERC) NE/T000341/1 (to RMDB, JPB, IG, DdV and MCJ). MK was supported by “la Caixa” Foundation (ID 100010434 to MK), fellowship code LCF/BQ/PR19/11700002 (to MK), and by the Vienna Science and Technology Fund (WWTF) and the City of Vienna through project VRG20-001 (to MK). JDO was supported by ”la Caixa” Foundation (ID 100010434 to JDO) and the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 847648 (to JDO). The fellowship code is LCF/BQ/PI20/11760004 (to JDO). FES was supported by Brazilian National Council for Scientific and Technological Development (CNPq) (Process numbers.: 200502/2015-8, 302140/2020-4, 300365/2021-7, 301407/2021-5, 301925/2021-6 to FES), and received funding from International Primatological Society - Conservation grant; The Rufford Foundation (14861-1, 231172), the Margot Marsh Biodiversity Foundation (SMA-CCO-G0000000023, SMA-CCOG0000000037), Primate Conservation Inc. (#1713 and #1689), and from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801505 (to FES). The Mamirauá Institute for Sustainable Development received funds from Gordon and Betty Moore Foundation (Grant #5344 to FES). Fieldwork for samples collected in the Brazilian Amazon was funded by grants from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq/SISBIOTA Program #563348/2010-0 to IPF), Fundação de Amparo à Pesquisa do Estado do Amazonas (FAPEAM/SISBIOTA #2317/2011 to IPF), and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES AUX # 3261/2013) to IPF. Sampling of nonhuman primates in Tanzania was funded by the German Research Foundation (KN1097/3-1 to SK and RO3055/2-1 to CR) and by the US National Science Foundation (BNS83-03506 to JPC). No animals in Tanzania were sampled purposely for this study. Details of the original study on Treponema pallidum infection can be requested from SK. Sampling of baboons in Zambia was funded by US NSF grant BCS-1029451 to JPC, CJJ and JR. The research reported in this manuscript was also funded by the Vietnamese Ministry of Science and Technology’s Program 562 (grant no. ĐTĐL.CN-64/19). ANC is supported by AEI-PGC2018-101927-BI00 704 (FEDER/UE to ANC), FEDER (Fondo Europeo de Desarrollo Regional)/FSE (Fondo Social Europeo), “Unidad de Excelencia María de Maeztu”, funded by the AEI (CEX2018-000792-M to ANC) and Secretaria d’Universitats i Recerca and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2017 SGR 880 to ANC). ADM was supported by the National Sciences and Engineering Research Council of Canada and Canada Research Chairs program. The authors would like to thank the Veterinary and Zoology staff at Wildlife Reserves Singapore for their help in obtaining the tissue samples, as well as the Lee Kong Chian Natural History Museum for storage and provision of the tissue samples. We wish to thank H. Doddapaneni, D.M. Muzny and M.C. Gingras for their support of sequencing at the Baylor College of Medicine Human Genome Sequencing Center. We greatly appreciate the support of Richard Gibbs, Director of HGSC for this project and thank Baylor College of Medicine for internal funding. TMB is supported by funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 864203 to TMB), BFU2017-86471-P (MINECO/FEDER, UE to TMB), “Unidad de Excelencia María de Maeztu”, funded by the AEI (CEX2018-000792-M to TMB), NIH 1R01HG010898-01A1 (to TMB) and Secretaria d’Universitats i Recerca and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2021 SGR 00177 to TMB), Howard Hughes International Early Career (to TMB), Obra Social “La Caixa” and internal funds from Baylor College of Medicine. HLR receives funding from Illumina, Inc to support rare disease gene discovery and diagnosis. JPB, RMDB, IG and DV were supported by a UKRI Grant NERC (NE/T000341/1). We thank Dr. Praveen Karanth (IISc), Dr. H.N. Kumara (SACON) for collecting and providing us with some of the samples from India. SMA was supported by a BINC fellowship from the Department of Biotechnology (DBT), India. We acknowledge the support provided by the Council of Scientific and Industrial Research (CSIR), India to GU for the sequencing at the Centre for Cellular and Molecular Biology (CCMB), India. Aotus azarae samples from Argentina where obtained with grant support to EFD from the Zoological Society of San Diego, Wenner-Gren Foundation, the L.S.B. Leakey Foundation, the National Geographic Society, the U.S. National Science Foundation (NSF-BCS-0621020, 1232349, 1503753, 1848954; NSF-RAPID-1219368, NSF-FAIN-1952072; NSF-DDIG-1540255; NSF-REU 0837921, 0924352, 1026991) and the U.S. National Institute on Aging (NIA- P30 AG012836-19, NICHD R24 HD-044964-11). JHS was supported in part by the NIH under award number P40OD024628 - SPF Baboon Research Resource. This research is supported by the National Research Foundation Singapore under its National Precision Medicine Programme (NPM) Phase II Funding (MOH-000588 to PT and WKL) and administered by the Singapore Ministry of Health’s National Medical ResearchCouncil. JR is also a Core Scientist at the Wisconsin National Primate Research Center, Univ. of Wisconsin, Madison. We acknowledge the institutional support of the Spanish Ministry of Science and Innovation through the Instituto de Salud Carlos III and the 2014–2020 Smart Growth Operating Program, to the EMBL partnership and institutional co-financing with the European Regional Development Fund (MINECO/FEDER, BIO2015-71792-P). We also acknowledge the support of the Centro de Excelencia Severo Ochoa, and the Generalitat de Catalunya through the Departament de Salut, Departament d’Empresa i Coneixement and the CERCA Programme to the institute.

Footnotes

Competing interests: Employees of Illumina, Inc. are indicated in the list of author affiliations. Serafim Batzoglou is currently affiliated with Seer, Inc. Heidi Rehm receives funding to support rare disease research and tool development from Illumina, Inc. and Microsoft, Inc. Patents related to this work are (1) title: Deep convolutional neural networks to predict variant pathogenicity using three-dimensional (3D) protein structures, filing No.: US 17/232,056, authors: Tobias Hamp, Kai-How Farh, Hong Gao; (2) title: Transfer learning-based use of protein contact maps for variant pathogenicity prediction, filing No.: US 17/876,481, authors: Chen Chen, Hong Gao, Laksshman Sundaram, Kai-How Farh; (3) title: Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks, filing No.: US 17/703,935, authors: Tobias Hamp, Kai-How Farh, Hong Gao;(4) title: Transformer language model for variant pathogenicity, filing No.: US 17/975,536 and US 17/975,547, authors: Jeffrey Ede, Tobias Hamp, Anastasia Dietrich, Yibing Wu, Kai-How Farh.

Data and materials availability:

All sequencing data have been deposited at the European Nucleotide Archive under the accession number PRJEB49549. Primate variants and PrimateAI-3D prediction scores are available with a non-commercial license upon request and are displayed on https://primad.basespace.illumina.com. The source code of PrimateAI-3D is accessible via https://github.com/Illumina/PrimateAI-3D and is also archived at https://doi.org/10.5281/zenodo.7738731. To reduce problems with circularity that have become a concern for the field, the authors explicitly request that the prediction scores from the method not be incorporated as a component of other classifiers, and instead ask that interested parties employ the provided source code and data to directly train and improve upon their own deep learning models.

References and Notes:

  • 1.MacArthur D. G. et al. , Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nussbaum R. L., Rehm H. L., ClinGen, ClinGen and Genetic Testing. N Engl J Med 373, 1379 (2015). [DOI] [PubMed] [Google Scholar]
  • 3.Rehm H. L. et al. , ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Landrum M. J. et al. , ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44, D862–868 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Liu X., Wu C., Li C., Boerwinkle E., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum Mutat 37, 235–241 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Stenson P. D. et al. , The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 133, 1–9 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rehm H. L., Evolving health care through personal genomics. . Nat Rev Genet 18, 259–267 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Whiffin N. et al. , Using high-resolution variant frequencies to empower clinical genome interpretation. Genet Med 19, 1151–1158 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Caspar S. et al. , Clinical sequencing: from raw data to diagnosis with lifetime value. Clinical genetics 93, 508–519 (2018). [DOI] [PubMed] [Google Scholar]
  • 10.Yang Y. et al. , Molecular findings among patients referred for clinical whole-exome sequencing. Jama 312, 1870–1879 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.SoRelle J. A., Thodeson D. M., Arnold S., Gotway G., Park J. Y., Clinical Utility of Reinterpreting Previously Reported Genomic Epilepsy Test Results for Pediatric Patients. JAMA Pediatr 173, e182302 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shah N. et al. , Identification of Misclassified ClinVar Variants via Disease Population Prevalence. Am J Hum Genet 102, 609–619 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Campuzano O. et al. , Reanalysis and reclassification of rare genetic variants associated with inherited arrhythmogenic syndromes. EBioMedicine 54, 102732 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Richards S. et al. , Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17, 405–424 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kim Y. E., Ki C. S., Jang M. A., Challenges and Considerations in Sequence Variant Interpretation for Mendelian Disorders. Ann Lab Med 39, 421–429 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Slatkin M., A population-genetic test of founder effects and implications for Ashkenazi Jewish diseases. The American Journal of Human Genetics 75, 282–293 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Sundaram L. et al. , Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161–1170 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Consortium C. S. A., Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). [DOI] [PubMed] [Google Scholar]
  • 19.Prado-Martinez J. et al. , Great ape genome diversity and population history. Nature 499, 471–475 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Fan Z. et al. , Ancient hybridization and admixture in macaques (genus Macaca) inferred from whole genome sequences. Mol Phylogenet Evol 127, 376–386 (2018). [DOI] [PubMed] [Google Scholar]
  • 21.Liu Z. et al. , Genomic Mechanisms of Physiological and Morphological Adaptations of Limestone Langurs to Karst Habitats. Mol Biol Evol 37, 952–968 (2020). [DOI] [PubMed] [Google Scholar]
  • 22.Wang L. et al. , A high-quality genome assembly for the endangered golden snub-nosed monkey (Rhinopithecus roxellana). Gigascience 8, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zoonomia C., A comparative genomics multitool for scientific discovery and conservation. Nature 587, 240–245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Evans B. J. et al. , Speciation over the edge: gene flow among non-human primate species across a formidable biogeographic barrier. R Soc Open Sci. 4, 170351 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Yu L. et al. , Genomic analysis of snub-nosed monkeys (Rhinopithecus) identifies genes and processes related to high-altitude adaptation. Nat Genet 48, 947–952 (2016). [DOI] [PubMed] [Google Scholar]
  • 26.Osada N., Matsudaira K., Hamada Y., Malaivijitnond S., Testing sex-biased admixture origin of macaque species using autosomal and X-chromosomal genomic sequences. Genome Biol. Evol. 13, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rylands A. B., Mittermeier R. A., Primate Behavioral Ecology. (Routledge, New York, ed. 6, 2021), pp. 407–428. [Google Scholar]
  • 28.Lek M. et al. , Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Karczewski K. J. et al. , The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Leffler E. M. et al. , Revisiting an old riddle: what determines genetic diversity levels within species? PLoS Biol 10, e1001388 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Estrada A. et al. , Impending extinction crisis of the world’s primates: Why primates matter. Sci Adv 3, e1600946 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ohta T., Slightly deleterious mutant substitutions in evolution. Nature 246, 96–98 (1973). [DOI] [PubMed] [Google Scholar]
  • 33.Reich D. E., Lander E. S., On the allelic spectrum of human disease. Trends Genet 17, 502–510 (2001). [DOI] [PubMed] [Google Scholar]
  • 34.Sawyer S. A., Hartl D. L., Population genetics of polymorphism and divergence. Genetics 132, 1161–1176 (1992). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Eyre-Walker A., Keightley P. D., The distribution of fitness effects of new mutations. Nature Reviews Genetics 8, 610–618 (2007). [DOI] [PubMed] [Google Scholar]
  • 36.Fu W. et al. , Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Simons Y. B., Turchin M. C., Pritchard J. K., Sella G., The deleterious mutation load is insensitive to recent population history. Nature genetics 46, 220–224 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Do R. et al. , No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nature Genetics 47, 126–131 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Albers P. K., McVean G., Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS biology 18, e3000586 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Mathieson I., McVean G., Demography and the age of rare variants. PLoS Genet 10, e1004528 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Damaj L. et al. , CACNA1A haploinsufficiency causes cognitive impairment, autism and epileptic encephalopathy with mild cerebellar symptoms. Eur J Hum Genet 23, 1505–1512 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Reinson K. et al. , Biallelic CACNA1A mutations cause early onset epileptic encephalopathy with progressive cerebral, cerebellar, and optic nerve atrophy. Am J Med Genet A 170, 2173–2176 (2016). [DOI] [PubMed] [Google Scholar]
  • 43.Bentivegna A. et al. , Rubinstein-Taybi Syndrome: spectrum of CREBBP mutations in Italian patients. BMC Med Genet 7, 77 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Stef M. et al. , Spectrum of CREBBP gene dosage anomalies in Rubinstein-Taybi syndrome patients. Eur J Hum Genet 15, 843–847 (2007). [DOI] [PubMed] [Google Scholar]
  • 45.Kondrashov A. S., Sunyaev S., Kondrashov F. A., Dobzhansky-Muller incompatibilities in protein evolution. Proc Natl Acad Sci U S A 99, 14878–14883 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Jordan D. M. et al. , Identification of cis-suppression of human disease mutations by comparative genomics. Nature 524, 225–229 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Samocha K. E. et al. , A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944–950 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bustamante C. D., Wakeley J., Sawyer S., Hartl D. L., Directional selection and the site-frequency spectrum. Genetics 159, 1779–1788 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Huang X. et al. , Inferring genome-wide correlations of mutation fitness effects between populations. Molecular Biology and Evolution. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet 5, e1000695 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Jouganous J., Long W., Ragsdale A. P., Gravel S., Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation. Genetics 206, 1549–1567 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bates D., Mächler M., Bolker B., Walker S., Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67, 1–48 (2015). [Google Scholar]
  • 53.Benjamini Y., Hochberg Y., Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B 57, 289–300 (1995). [Google Scholar]
  • 54.Rowntree R. K., Harris A., The phenotypic consequences of CFTR mutations. Annals of human genetics 67, 471–485 (2003). [DOI] [PubMed] [Google Scholar]
  • 55.Wilcox S. A. et al. , High frequency hearing loss correlated with mutations in the GJB2 gene. Human genetics 106, 399–405 (2000). [DOI] [PubMed] [Google Scholar]
  • 56.Shu H. et al. , The role of CD36 in cardiovascular disease. Cardiovascular Research, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Bobadilla J. L., Macek M. Jr, Fine J. P., Farrell P. M., Cystic fibrosis: a worldwide analysis of CFTR mutations—correlation with incidence data and application to screening. Human mutation 19, 575–606 (2002). [DOI] [PubMed] [Google Scholar]
  • 58.Chaleshtori M. H. et al. , High carrier frequency of the GJB2 mutation (35delG) in the north of Iran. International journal of pediatric otorhinolaryngology 71, 863–867 (2007). [DOI] [PubMed] [Google Scholar]
  • 59.Liu J. et al. , Distribution of CD36 deficiency in different Chinese ethnic groups. Human Immunology 81, 366–371 (2020). [DOI] [PubMed] [Google Scholar]
  • 60.Aitman T. J. et al. , Malaria susceptibility and CD36 mutation. Nature 405, 1015–1016 (2000). [DOI] [PubMed] [Google Scholar]
  • 61.Common J. E., Di W.-L., Davies D., Kelsell D. P., Further evidence for heterozygote advantage of GJB2 deafness mutations: a link with cell survival. Journal of medical genetics 41, 573–575 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.D’Adamo P. et al. , Does epidermal thickening explain GJB2 high carrier frequency and heterozygote advantage? European Journal of Human Genetics 17, 284–286 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Schroeder S. A., Gaughan D. M., Swift M., Protection against bronchial asthma by CFTR ΔF508 mutation: a heterozygote advantage in cystic fibrosis. Nature medicine 1, 703–705 (1995). [DOI] [PubMed] [Google Scholar]
  • 64.Pier G. B. et al. , Salmonella typhi uses CFTR to enter intestinal epithelial cells. Nature 393, 79–82 (1998). [DOI] [PubMed] [Google Scholar]
  • 65.Bojesen S. E. et al. , Multiple independent variants at the TERT locus are associated with telomere length and risks of breast and ovarian cancer. Nature genetics 45, 371–384 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Heidenreich B., Kumar R., TERT promoter mutations in telomere biology. Mutation Research/Reviews in Mutation Research 771, 15–31 (2017). [DOI] [PubMed] [Google Scholar]
  • 67.Frazer J. et al. , Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021). [DOI] [PubMed] [Google Scholar]
  • 68.Chae H. D., Jeon C. H., Peutz-Jeghers syndrome with germline mutation of STK11. Ann Surg Treat Res 86, 325–330 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Hernan I. et al. , De novo germline mutation in the serine-threonine kinase STK11/LKB1 gene associated with Peutz-Jeghers syndrome. Clin Genet 66, 58–62 (2004). [DOI] [PubMed] [Google Scholar]
  • 70.Nakanishi C. et al. , Germline mutation of the LKB1/STK11 gene with loss of the normal allele in an aggressive breast cancer of Peutz-Jeghers syndrome. Oncology 67, 476–479 (2004). [DOI] [PubMed] [Google Scholar]
  • 71.Yang H. R., Ko J. S., Seo J. K., Germline mutation analysis of STK11 gene using direct sequencing and multiplex ligation-dependent probe amplification assay in Korean children with Peutz-Jeghers syndrome. Dig Dis Sci 55, 3458–3465 (2010). [DOI] [PubMed] [Google Scholar]
  • 72.Jumper J. et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Varadi M. et al. , AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Söding J., Biegert A., Lupas A. N., The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Research 33, W244–W248 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Kallberg M. et al. , Template-based protein structure modeling using the RaptorX web server. Nat Protoc 7, 1511–1522 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Wang S., Li W., Liu S., Xu J., RaptorX-Property: a web server for protein structure property prediction. Nucleic Acids Res 44, W430–435 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Burgess D. J., The TOPMed genomic resource for human health. Nat Rev Genet 22, 200 (2021). [DOI] [PubMed] [Google Scholar]
  • 78.Taliun D. et al. , Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Bycroft C. et al. , The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Sudlow C. et al. , UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12, e1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 1, 4171–4186 (2019). [Google Scholar]
  • 82.You Y. et al. , in International Conference on Learning Representations. (2020). [Google Scholar]
  • 83.Rao R. M. et al. , MSA Transformer. Proceedings of the 38th International Conference on Machine Learning 139, 8844–8856 (2021). [Google Scholar]
  • 84.Liu X., Li C., Mou C., Dong Y., Tu Y., dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Medicine 12, 103 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Deciphering Developmental Disorders Study, Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Kaplanis J. et al. , Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.An J. Y. et al. , Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.De Rubeis S. et al. , Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Iossifov I. et al. , The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Iossifov I. et al. , De novo gene disruptions in children on the autistic spectrum. Neuron 74, 285–299 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Sanders S. J. et al. , Insights into Autism Spectrum Disorder Genomic Architecture and Biology from 71 Risk Loci. Neuron 87, 1215–1233 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Sanders S. J. et al. , De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485, 237–241 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.O’Roak B. J. et al. , Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485, 246–250 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Jin S. C. et al. , Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat Genet 49, 1593–1601 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Araya C. L. et al. , A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences 109, 16858–16863 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Chiasson M. A. et al. , Multiplexed measurement of variant abundance and activity reveals VKOR topology, active site and human variant impact. eLife 9, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Jia X. et al. , Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. American journal of human genetics 108, 163–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Matreyek K. A. et al. , Multiplex assessment of protein variant abundance by massively parallel sequencing. Nature Genetics 50, 874–882 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Mighell T. L., Evans-Dutson S., O’Roak B. J., A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships. American journal of human genetics 102, 943–955 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Newberry R. W., Leong J. T., Chow E. D., Kampmann M., DeGrado W. F., Deep mutational scanning reveals the structural basis for α-synuclein activity. Nature Chemical Biology 16, 653–659 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Seuma M., Faure A. J., Badia M., Lehner B., Bolognesi B., The genetic landscape for amyloid beta fibril nucleation accurately discriminates familial Alzheimer’s disease mutations. Elife 10, e63364 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Giacomelli A. O. et al. , Mutational processes shape the landscape of TP53 mutations in human cancer. Nature genetics 50, 1381–1387 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Starita L. M. et al. , Massively Parallel Functional Analysis of BRCA1 RING Domain Variants. Genetics 200, 413–422 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Jones E. M. et al. , Structural and functional characterization of G protein–coupled receptors with deep mutational scanning. eLife 9, e54895 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Amorim C. E. G. et al. , The population genetics of human disease: The case of recessive, lethal mutations. PLoS Genet 13, e1006915 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Quintans B., Ordonez-Ugalde A., Cacheiro P., Carracedo A., Sobrido M. J., Medical genomics: The intricate path from genetic variant identification to clinical interpretation. Appl Transl Genom 3, 60–67 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Martin A. R. et al. , PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat Genet 51, 1560–1565 (2019). [DOI] [PubMed] [Google Scholar]
  • 109.Thormann A. et al. , Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun 10, 2373 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Kuderna L. F. et al. , A global catalog of whole-genome diversity from 233 primate species Submitted. [DOI] [PubMed] [Google Scholar]
  • 111.C. A. Cassa et al. , Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet 49, 806–810 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Tyner C. et al. , The UCSC Genome Browser database: 2017 update. Nucleic Acids Res 45, D626–D634 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Kent W. J., Sugnet C. W., Furey T. S., Roskin K. M., Pringle T. H., Zahler A.M., Haussler D., The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Genereux D. P. et al. , A comparative genomics multitool for scientific discovery and conservation. Nature 587, 240–245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Suzek B. E., Wang Y., Huang H., McGarvey P. B., Wu C. H., UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics (Oxford, England) 31, 926–932 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Sali A., Blundell T. L., Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779–815 (1993). [DOI] [PubMed] [Google Scholar]
  • 117.Pasumarthi R. K. et al. , TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2970–2978 (2019). [Google Scholar]
  • 118.Suzek B. E., Huang H., McGarvey P., Mazumder R., Wu C. H., UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007). [DOI] [PubMed] [Google Scholar]
  • 119.Suzek B. E., Wang Y., Huang H., McGarvey P. B., Wu C. H., UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Jia X. et al. , Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am J Hum Genet 108, 163–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Chiasson M. A. et al. , Multiplexed measurement of variant abundance and activity reveals VKOR topology, active site and human variant impact. Elife 9, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Starita L. M. et al. , Massively Parallel Functional Analysis of BRCA1 RING Domain Variants. Genetics 200, 413–422 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Giacomelli A. O. et al. , Mutational processes shape the landscape of TP53 mutations in human cancer. Nature Genetics 50, 1381–1387 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Jones E. M. et al. , Structural and functional characterization of G protein–coupled receptors with deep mutational scanning. Elife 9, e54895 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Gudmundsson S. et al. , Addendum: The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 597, E3–E4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Vanderpool D. et al. , Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLoS Biol 18, e3000954 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Cock P. J. et al. , Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Eberle M. A. et al. , A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res 27, 157–164 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.de Manuel M. et al. , Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354, 477–481 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Leffler E. M., Gao Z., Pfeifer S., Ségurel L., Auton A., Venn O., Bowden R., Bontrop R., Wall J.D., Sella G., Donnelly P., Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science 339, 1578–1582 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Eilertson K. E., Booth J. G., Bustamante C. D., SnIPRE: Selection Inference Using a Poisson Random Effects Model. PLoS Comput Biol 8, e1002806 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J., Basic local alignment search tool. J Mol Biol 215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
  • 133.Johnson L. S., Eddy S. R., Portugaly E., Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Baek M. et al. , Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Marks D. S. et al. , Protein 3D structure computed from evolutionary sequence variation. PLoS One 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Kingma D., Ba J., Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, (2014). [Google Scholar]
  • 137.Mirdita M. et al. , Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Research 45, D170–D176 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Steinegger M. et al. , HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Rao R. M. et al. , Meila M, Zhang T, Eds. (PMLR, 2021), vol. 139, pp. 8844–8856. [Google Scholar]
  • 140.Ba J. L., Kiros J. R., Hinton G. E., paper presented at the Advances in NIPS 2016 Deep Learning Symposium, 2016 2016. [Google Scholar]
  • 141.Hendrycks D., Gimpel K., Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415, (2020). [Google Scholar]
  • 142.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014). [Google Scholar]
  • 143.Micikevicius P. et al. , Mixed Precision Training. International Conference on Learning Representations, (2018). [Google Scholar]
  • 144.Rajbhandari S., Rasley J., Ruwase O., He Y., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1–16 (2020). [Google Scholar]
  • 145.Bandaru P. et al. , Deconstruction of the Ras switching cycle through saturation mutagenesis. Elife 6, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146.Weile J. et al. , A framework for exhaustively mapping functional missense variants. Mol Syst Biol 13, 957 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Brenan L. et al. , Phenotypic Characterization of a Comprehensive Set of MAPK1/ERK2 Missense Mutants. Cell Rep 17, 1171–1183 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Awad M. M. et al. , Acquired Resistance to KRASG12C Inhibition in Cancer. New England Journal of Medicine 384, 2382–2393 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149.Zhang L. et al. , SLCO1B1: Application and Limitations of Deep Mutational Scanning for Genomic Missense Variant Function. Drug Metab Dispos, DMD-AR-2020–000264 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Gray V. E. et al. , Elucidating the Molecular Determinants of Aβ Aggregation with Deep Mutational Scanning. G3 (Bethesda) 9, 3683–3689 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Adzhubei I. A. et al. , A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152.Feng B.-J., PERCH: A Unified Framework for Disease Gene Prioritization. Human Mutation 38, 243–251 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153.Rentzsch P., Schubach M., Shendure J., Kircher M., CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Medicine 13, 31 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 154.Quang D., Chen Y., Xie X., DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155.Raimondi D. et al. , DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Research 45, W201–W206 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 156.Malhis N., Jacobson M., Jones S. J. M., Gsponer J., LIST-S2: taxonomy based sorting of deleterious missense mutations across species. Nucleic Acids Research 48, W154–W161 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Jagadeesh K. A. et al. , M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nature Genetics 48, 1581–1586 (2016). [DOI] [PubMed] [Google Scholar]
  • 158.Steinhaus R. et al. , MutationTaster2021. Nucleic Acids Research 49, W446–W451 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Choi Y., Sims G. E., Murphy S., Miller J. R., Chan A. P., Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 160.Ioannidis N. M. et al. , REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Sim N.-L. et al. , SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Research 40, W452–457 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 162.Carter H., Douville C., Stenson P. D., Cooper D. N., Karchin R., Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 Suppl 3, S3 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163.Shihab H. A. et al. , An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536–1543 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164.Meier J. et al. , Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv, (2021). [Google Scholar]
  • 165.Riesselman A. J., Ingraham J. B., Marks D. S., Deep generative models of genetic variation capture the effects of mutations. Nat Methods 15, 816–822 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 166.Stenson P. D. et al. , Human Gene Mutation Database: towards a comprehensive central mutation database. J Med Genet 45, 124–126 (2008). [DOI] [PubMed] [Google Scholar]
  • 167.Stenson P. D. et al. , The Human Gene Mutation Database (HGMD((R))): optimizing its use in a clinical diagnostic or research setting. Hum Genet 139, 1197–1207 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 168.Zhang Y., Skolnick J., Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004). [DOI] [PubMed] [Google Scholar]
  • 169.Ekeberg M., Lövkvist C., Lan Y., Weigt M., Aurell E., Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E 87, 012707 (2013). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (5.1MB, pdf)
Supplement 2
media-2.xlsx (88KB, xlsx)
Supplement 3
media-3.xlsx (2.8MB, xlsx)
Supplement 4
media-4.xlsx (11.1KB, xlsx)
Supplement 5
media-5.xlsx (13.3KB, xlsx)
Supplement 6
media-6.xlsx (416.7KB, xlsx)
Supplement 7
media-7.xlsx (12.2KB, xlsx)

Data Availability Statement

All sequencing data have been deposited at the European Nucleotide Archive under the accession number PRJEB49549. Primate variants and PrimateAI-3D prediction scores are available with a non-commercial license upon request and are displayed on https://primad.basespace.illumina.com. The source code of PrimateAI-3D is accessible via https://github.com/Illumina/PrimateAI-3D and is also archived at https://doi.org/10.5281/zenodo.7738731. To reduce problems with circularity that have become a concern for the field, the authors explicitly request that the prediction scores from the method not be incorporated as a component of other classifiers, and instead ask that interested parties employ the provided source code and data to directly train and improve upon their own deep learning models.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES