Abstract
Advances in gene sequencing technologies have accelerated the identification of genetic variants, but better tools are needed to understand which are causal of disease. This would be particularly useful in fields where gene therapy is a potential therapeutic modality for a disease-causing variant such as inherited retinal disease (IRD). Here, we apply structure-based network analysis (SBNA), which has been successfully utilized to identify variant-constrained amino acid residues in viral proteins, to identify residues that may cause IRD if subject to missense mutation. SBNA is based entirely on structural first principles and is not fit to specific outcome data, which makes it distinct from other contemporary missense prediction tools. In 4 well-studied human disease-associated proteins (BRCA1, HRAS, PTEN, and ERK2) with high-quality structural data, we find that SBNA scores correlate strongly with deep mutagenesis data. When applied to 47 IRD genes with available high-quality crystal structure data, SBNA scores reliably identified disease-causing variants according to phenotype definitions from the ClinVar database. Finally, we applied this approach to 63 patients at Massachusetts Eye and Ear (MEE) with IRD but for whom no genetic cause had been identified. Untrained models built using SBNA scores and BLOSUM62 scores for IRD-associated genes successfully predicted the pathogenicity of novel variants (AUC = 0.851), allowing us to identify likely causative disease variants in 40 IRD patients. Model performance was further augmented by incorporating orthogonal data from EVE scores (AUC = 0.927), which are based on evolutionary multiple sequence alignments. In conclusion, SBNA can used to successfully identify variants as causal of disease in human proteins and may help predict variants causative of IRD in an unbiased fashion.
Subject terms: Disease genetics, Molecular medicine, Genetics research
Introduction
As genetic sequencing has become increasingly available and less costly, a growing number of patients with clinical presentations of suspected genetic origin are undergoing targeted or whole-exome sequencing. Despite improved accessibility, the genetic basis of disease for a considerable proportion of these patients remains unclear following sequencing1. Inherited retinal diseases (IRD), whereby rod and cone photoreceptors degenerate, are a group of Mendelian disorders that represent an important cause of vision loss2. With the advancement in the availability of genetic testing3–6 and the lower cost of exome sequencing, IRD is a field with increasing promise and possibility for gene therapy interventions. However, among patients with an IRD, ~30% do not have a clear genetic basis despite classic retinal changes and a decrease in visual and retinal function7, making them ineligible candidates for treatment. For these patients, additional tools are needed to better define genetic variants that are not among the group of known causal variants (i.e., variants of uncertain significance—VUS).
Numerous computational tools that aim to predict the phenotype of genetic variants have been described8–10, many of which were trained on existing variant classifications or combine multiple metrics that use this type of training data11–19. These data were limited by sparsely available annotations20, and previous studies suggest that some of these algorithms may also have considerable false discovery rates21,22. Furthermore, many of these algorithms are limited by circularity, with duplication of variants in the training and test datasets as well as variants from the same protein within the training and test datasets23. The clinical applicability of these approaches has been limited as a result23,24. Another class of methods uses sequence conservation to estimate the likelihood that a particular variant will have a deleterious phenotypic effect25,26. Sequence conservation is an important feature to consider (and provides independent information from protein structure) but can be an imperfect proxy for the functional importance of a particular amino acid position within a protein of interest; this has been previously demonstrated within the context of both model and human proteins27,28 and also the human immunodeficiency virus (HIV), which has a per-base mutation rate ~104 times that of the human genome and thus serves as a model for an accelerated rate of genomic evolution29–31. Some of these approaches also leverage recent advances in machine learning, which results in limited post hoc transparency regarding the basis for a particular estimated probability of pathogenicity and is also subject to circularity and over-fitting26,32. Functional in vitro assays, such as a high throughput assay for the RHO gene, can help to characterize individual genetic variants33. However, these crucial experiments are not always feasible on a large scale for the entire set of patient-genotype combinations, given the need for appropriate experimental setup, tissue-type, and readout for the particular variant of interest.
To add to these approaches, we developed structure-based network analysis (SBNA) which leverages the application of network theory to protein structure data with the goal of quantifying local residue connectivity, bridging interactions, and ligand proximity in order to identify amino acid residues that are topologically important29,34. Using x-ray crystallography and cryogenic electron microscopy (cryo-EM) data, it models proteins as networks of connected amino acids to quantitatively estimate the topological importance of each amino acid as it relates to others in the protein, protein complex, or protein–ligand interaction. This approach is distinct from prior computational tools that use structural information because it is not reliant on pre-defined secondary structure elements; rather, it analyzes the crystallized tertiary structure of the folded protein as a network of weighted inter-residue interactions. Additionally, this approach does not require training on pre-labeled phenotypic data which means that it can provide a metric that is specifically focused on first-principle structural information. SBNA has been previously used to identify highly mutationally constrained amino acid residues and CD8+ T cell epitopes in HIV and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)29,35. In the case of HIV, these epitopes were found to be preferentially targeted by individuals able to suppress HIV replication in the absence of antiretroviral therapy29. Within the context of SARS-CoV-2, highly networked regions have resisted ongoing sequence evolution during the pandemic and thereby may be capable of conferring broad T cell-mediated protection against sarbecovirus infection35.
In light of these successes with highly mutable viruses, we chose to apply SBNA to human proteins which exist within a complex system of biological interactions. Such an application would potentially aid in better understanding which genetic variants discovered through sequencing are causal of disease. We first verified this approach in well-studied human proteins: breast cancer gene 1 (BRCA1), GTPase HRas (HRAS), phosphatase and tensin homolog (PTEN), and mitogen-activated protein kinase 1 (ERK2). We then sought to investigate whether this approach could generate meaningful results for IRD proteins of interest to estimate the phenotypic impact of missense variants, given that such a sizable number of patients with IRDs harbor VUSs. Therefore, we assessed the performance of SBNA in IRD protein-coding genes and their respective missense variants described in ClinVar, a large public database of reported pathogenic and benign genetic variation. Finally, we addressed a cohort of 63 affected subjects for whom a genetic cause of their condition has yet to be discovered to investigate the use of SBNA in real-world clinical scenarios.
Results
Structure-based network analysis accurately identifies missense variants in human proteins associated with in vitro loss of function and pathogenic clinical phenotypes
To evaluate whether SBNA could be meaningfully applied to human proteins, we first applied the approach to four well-studied human proteins—BRCA1, HRAS, PTEN, and ERK2—which were selected for analysis due to the availability of high-quality structural data. In addition, these proteins had published in vitro saturation mutagenesis experiments, which allowed us to extract the functional consequence of all missense variants and quantify mutational tolerance36–39. We next generated network scores for all amino acid residues in the available protein databank (PDB) files (Fig. 1a) and evaluated the correlation with mutational tolerance (Fig. 1b). All four proteins showed a strong inverse correlation between mutational tolerance and network score, which was consistent with previous findings for viral proteins and other model proteins29. Of note, only 17% of BRCA1 has been crystallized (the N- and C-terminal ends), but SBNA scores still performed reasonably well (Spearman correlation coefficient −0.514, p = 3.45e-22) despite limited structural data.
We next compared network scores to pathogenicity categorizations derived from human data using the ClinVar and gnomAD genetic databases for these same four proteins (Fig. 1c). Missense variants in all subsequent analyses were categorized with respect to human clinical data in line with the American College of Medical Genetics and Genomics/Association of Molecular Pathology (ACMG/AMP)24 as benign (encompassing “benign” or “likely benign” within ClinVar), VUS or pathogenic (encompassing “pathogenic” or “likely pathogenic” within ClinVar). We restricted our analysis to ClinVar missense variants with at least two-star level evidence, and gnomAD was used to identify additional relatively benign missense variants (variants with at least 250 alternative allele counts across 100,000 individuals). Across all four proteins, network scores assigned to pathogenic variants were significantly greater than those assigned to benign variants (median network scores for benign missense variants −0.936 and pathogenic variants 0.866, p = 1.836e-5, Fig. 1c). Scores assigned to VUSs fell in between those assigned to benign and pathogenic variants. These trends were observed consistently within individual proteins, though smaller sample sizes limit statistical power (Fig. 1d). Overall, network scores correlate with available clinical phenotype data for the four well-studied human proteins (Spearman correlation coefficient 0.228, p = 2.116e-23), suggesting that SBNA can be meaningfully applied to human proteins. Of note, we found that SBNA seems to capture structural properties beyond simple solvent accessibility (i.e., proximity to the core) because relative solvent accessibility (RSA) scores show less correlation with mutational tolerance scores than did SBNA (Supplementary Fig. 1).
SBNA predicts variants in IRD genes associated with pathogenic clinical phenotypes
Having validated SBNA on four canonical, well-studied human proteins, we then applied SBNA to additional human proteins. We analyzed the relationship between network scores and pathogenicity designations from high-quality ClinVar variants and benign gnomAD variants for the 47 human genes associated with IRDs, for which high-quality structural data is available for the encoded protein. This dataset includes both membrane proteins (e.g., ABCA4 and RHO) as well as cytoplasmic proteins (e.g., RPE65 and RPGR) (Supplementary Table 1). We found that pathogenic variants were assigned significantly greater network scores compared to both benign variants and VUSs (median benign −0.841, median VUS −0.188, median pathogenic 0.947 p = 3.140e-29 for benign vs. pathogenic, p = 1.753e-60 for VUS vs. pathogenic; Fig. 2a and Supplementary Fig. 2), which is similar to the pattern observed for the canonical human proteins. Because the number of high-quality ClinVar entries for missense variants in each of the 47 IRD-associated proteins varies considerably (Fig. 2b), we wanted to evaluate whether the difference between median benign and pathogenic network scores remained statistically significant even for proteins with limited variant data available in ClinVar. We grouped proteins by the amount of available high-quality entries in ClinVar and calculated the difference between the median benign and pathogenic network scores across each group of proteins. The magnitude of these differences was robust in the setting of differing levels of available clinical data across genes and was detectable down to 40 high-quality entries per gene (Fig. 2c).
Incorporating network scores and BLOSUM62 scores successfully predicts variant pathogenicity
To move from aggregate statistics to prediction of pathogenicity using network scores, we constructed a modified score that incorporates not only the SBNA network score but also the degree of chemical and side chain dissimilarity between the reference and mutant amino acid at that position (since missense variants vary in this regard). To capture the latter effect, we subtracted the BLOSUM62 matrix score from the SBNA score (which we will now refer to as the modified SBNA score) to allow for a distinction between non-conservative substitutions (e.g., ILE → TRP) and conservative ones (e.g., ILE → LEU)40. BLOSUM62 scores for non-synonymous substitutions range from −4 to 3, while 95% of the raw SBNA scores range from −2.88 to 4.97; thus, the two metrics are on similar scales and can be combined with simple subtraction. We then calculated receiver operating characteristic (ROC) curve statistics for high-quality ClinVar variants and benign gnomAD variants based on the modified SBNA score. Modified SBNA scores predicted variant pathogenicity (AUC 0.851) and outperformed network scores alone, BLOSUM62 scores alone, and RSA scores alone (Fig. 3a, b and Supplementary Fig. 3). We tested multivariable logistic regression modeling with 500 iterations of a 70/30% train-test split as well as a leave-one-out approach using the labels derived from high-quality ClinVar and gnomAD variants and found similar performance to the raw scores (AUC 0.835, Supplementary Figs. 3, 4). Given the advantage that using the raw scores has over a trained approach (which can be subject to poor phenotypic labeling and data circularity20,23), all downstream clinical applications described here use the raw modified SBNA score.
Comparison of SBNA to existing methods reveals similar performance without dependence on phenotype labels or evolutionary sequence data
We set out to compare SBNA to three tools that are built from different underlying data: Polyphen2, AlphaMissense, and EVE. Polyphen2 is a widely used computational prediction tool for variant pathogenicity41. It is important to note however that PolyPhen2 is used by the ACMG/AMP to assess pathogenicity designations in databases such as ClinVar and HumDiv, so using ClinVar as ground truth might overestimate the performance of PolyPhen2. AlphaMissense32 incorporates structural data from AlphaFold242, protein language modeling, and evolutionary multiple sequence alignments into a machine-learning model using ClinVar labels to train. Similar to PolyPhen2, there is a risk of over-fitting and circularity with AlphaMissense. EVE is a variant pathogenicity approach which relies only on evolutionary sequence data rather than clinical labeling for model training and is thus not subject to circularity, similar to SBNA26. EVE performs well on two of the well-studied human proteins (Spearman correlation ranging from −0.478 to −0.513, benign versus pathogenic P < 0.05 for all; Supplementary Fig. 5); EVE predictions are not available for ERK2. To minimize bias, all algorithms were tested on an independent dataset of 2800 rare variants derived from patients with an IRD who were seen at MEE, though of note, ground truth is still determined using the ClinVar database in accordance with the clinical standard in the field. Thus, methods that train on ClinVar must be interpreted with caution. The modified SBNA scores were compared to results generated using PolyPhen2 trained on two different datasets, HumDiv and HumVar12,41, as well as EVE scores26. All methods showed a significant difference between benign and pathogenic variants, and the modified SBNA scores correlated with the scores from other methods (Supplementary Figs. 6A, B, 7A, B, Spearman correlation coefficient range 0.510–0.571, p < 5e-24 for all). With a threshold of 1.5 for modified SBNA scores, the sensitivity was 0.548, specificity 0.908, positive predictive value 0.963, and negative predictive value 0.312.
ROC curves were generated by testing each of the methods on the dataset of 2800 variants present in MEE patients (AUC range: 0.788 [modified SBNA], 0.829–0.833 [PolyPhen2], 0.819 [AlphaMissense], and 0.809 [EVE]), Supplementary Figs. 6C, 7C). The modified SBNA, PolyPhen2, AlphaMissense, and EVE performed similarly, though PolyPhen2 and AlphaMissense scores may be inflated due to training on ClinVar pathogenicity labels.
Combining SBNA and EVE outperforms all methods individually
The correlation and prediction results suggest that structural information and sequence conservation provide distinctly important insight into pathogenicity, and thus, incorporating orthogonal metrics into a single score may help to improve model correlation with phenotypic benchmarks. We thus sought to use the two unbiased methods (modified SBNA and EVE). The modified network scores were scaled and added to EVE scores to create a combined score with a range of 0 to 2 (as EVE scores fall between 0 and 1, and SBNA scores were scaled based on the maximum and minimum values across all proteins such that they fell between 0 and 1 before adding to the EVE scores). When applied to the 2800 rare variants from MEE patients as well as the 47 IRD genes, this combined score distinguished pathogenicity (Supplementary Fig. 7D, benign vs. pathogenic p = 3.056e-17; Fig. 3d, benign vs. pathogenic p = 8.839e-69) and outperformed all other models with an AUC of 0.859 (Supplementary Fig. 7E) for the 2800 variants and 0.927 (Fig. 3c) for the IRD genes. With a threshold of 1.0 for the combination score, the sensitivity was 0.765, specificity 0.899, positive predictive value of 0.976 and negative predictive value of 0.416. While we note that EVE alone performed quite well (AUC = 0.915), adding the modified SBNA improves performance and, importantly, unlike EVE alone, offers a direct structural explanation for the mechanism through which variation may affect phenotype.
Model incorporating the modified SBNA scores identifies putative disease-causing variants with an unclear genetic basis for clinical disease
A significant percentage of patients with clinical presentations consistent with IRD lack an identified genetic basis for their phenotype, and this is also observed for patients who receive care from the Inherited Retinal Disease Service at MEE. We therefore evaluated genetic data from 3621 probands with a clinical diagnosis of an IRD based on visual acuity, visual field testing, clinical exam, fundus autofluorescence imaging, optical coherence tomography and electroretinogram in individuals who underwent targeted or whole-exome sequencing. Mitochondrial causes of IRD were excluded. Missense variants of interest were defined using variant ranking software43 and residence in one of the studied 47 IRD genes. There were 2948 unique variants identified and categorized as either “pathogenic”, “likely pathogenic”, “VUS”, or “benign” based on a known variant consequence in the literature using ACMG/AMP criteria44 and ClinVar designations20 (Supplementary Fig. 8). Variants were further categorized in the context of individual patients as “likely causal” if they were pathogenic or likely pathogenic, the zygosity was consistent with known modes of inheritance, and the clinical presentation was consistent with the known consequence of the affected gene. Of all the reviewed patients, 455 patients carried variants of interest in the 47 IRD genes. Before applying SBNA, 357 were found to have variants that were “likely causal”, while 63 patients harbored one or more VUSs that prohibited a molecular diagnosis. The remaining 35 patients had non-missense variants or variants within a region that lacked available structural data and were therefore excluded.
We generated pathogenicity predictions for the 2,948 IRD gene variants discovered in the probands (Fig. 4a). Variants receiving a modified SBNA score of >1.5 (calibrated using probability estimates from the regression modeling) were deemed pathogenic. Scores were considered in the context of the identified genetic variants in known IRD genes for each patient with a clinical presentation consistent with IRD. Variants were matched with any phenotypic data available in ClinVar to roughly benchmark the pathogenicity scores. Similar to the ClinVar analysis of IRD genes, there were observable differences between the modified SBNA scores assigned to known pathogenic variants as compared to benign variants and VUS/variants without any available clinical data (benign median −1.478, VUS median −0.656, pathogenic median 2.148; benign vs. pathogenic p = 1.713e-30).
For the 357 patients who harbored known pathogenic variants sufficient to cause disease, the modified SBNA scores were concordant with these pathogenicity categorizations in 96.0% of cases (Fig. 4b). For the 63 patients with VUSs as categorized by ACMG/AMP standards44 and/or ClinVar, the modified SBNA scores offered support for a genetic cause of disease for 40 patients (23 unique variants, Fig. 4c). By contrast, EVE scores offered support for a genetic cause of disease in 25 patients (20 in common with SBNA), PolyPhen2 scores trained on HumVar offered support in 33 patients (27 in common with SBNA), and AlphaMissense offered support in 40 patients (34 in common with SBNA). Combined EVE and modified SBNA scores offered support in 23 patients (20 in common with SBNA), but this analysis was limited because EVE does not provide scores for 15 patients. In the 15 patients where modified SBNA scores suggested a putative causative variant but EVE scores did not, one patient had a variant for which no EVE score was provided in the database, and the remainder had at least one candidate variant with an EVE score below the predicted pathogenicity threshold. Modes of inheritance included autosomal recessive (in combination with a known pathogenic variant or a second VUS with a high estimated probability of pathogenicity; n = 15), autosomal dominant (n = 2), and X-linked recessive (n = 5). For example, a patient with autosomal recessive ABCA4-related disease was found to have variants p.(Pro1380Leu) (known pathogenic) and p.(Arg1097Ser) (VUS). The p.(Arg1097Ser) variant scored highly (SBNA score 3.672, BLOSUM62 score -1, score difference 4.672), suggesting it is likely pathogenic and thus completing the genetic solution for this patient. Similarly, the VUS p.(Cys302Tyr) in RPGR was found in a hemizygous patient with phenotypic findings consistent with X-linked IRD and also scored highly (SBNA score 2.154, BLOSUM62 score -2, score difference 4.154) (Supplementary Fig. 9). For 19 patients, the modified SBNA scores contributed towards identifying a possible but not completely solved genetic cause, such as only one heterozygous VUS receiving a strong score. Finally, for the four remaining unsolved patients, SBNA was not possible due to lack of crystal structure data for those portions of the IRD-associated proteins or due to the presence of non-missense variants.
AlphaFold2 can improve structural coverage for SBNA
Despite evidence of strong performance when applied to IRD-associated proteins, SBNA remains broadly limited by the availability of high-quality structural data for proteins of interest. This structural coverage must also overlap with the availability of high-quality phenotypic data from ClinVar, limiting the scope of analysis (Fig. 5a). Applying AlphaFold2 may provide a path toward overcoming this limitation. For example, NR2E3 was excluded from the set of 47 IRD genes because the only available structural data is from a chimera formed between one domain of NR2E3 and an unrelated bacterial protein (PDB: 4LOG). SBNA performs poorly on this non-physiologic structure with relatively poor coverage of NR2E3 (47%) (Fig. 5b). However, when SBNA is applied to the full AlphaFold2-generated human NR2E3 structure, performance improves considerably (Fig. 5c).
To further establish that AlphaFold2 can help to overcome the limited availability of high-quality structural data for SBNA, we selected ten IRD-associated genes without available structural data that had a considerable amount of data in ClinVar (Fig. 5d). We performed SBNA on the AlphaFold2-generated structures for these genes and found a significant difference between the SBNA scores assigned to benign and pathogenic variants (p = 1.201e-7; benign median −0.539, VUS median −0.491, pathogenic median 0.076) (Fig. 5e). However, the magnitude of difference between the median benign and pathogenic network scores was decreased compared to SBNA performed on the structural data from the 47 IRD-associated genes. Still, these results suggest that AlphaFold2 could potentially be useful in expanding the applicability of SBNA, although it is not clear that the quality of this analysis would be superior to that performed on experimentally generated structural data.
Discussion
In this study, we applied SBNA to human proteins and demonstrated that the resulting residue network scores, especially when augmented with BLOSUM62 substitution scores, correlate strongly with missense variants that cause functional deficit or clinical disease. We leveraged those scores, in combination with BLOSUM62 substitution scores, to generate estimates of the likelihood that a particular missense variant was pathogenic. Using these estimates, we were able to nominate putative genetic solutions for 40 patients with clinical evidence of IRD. These results suggest that SBNA provides meaningful insights for patients with an unclear genetic basis for their clinical symptoms, particularly for patients with inherited retinal diseases. Importantly, we note that this method is based purely on structural principles rather than training on labeled outcome data, which means that it not only provides an unbiased prediction of pathogenicity but also the mechanism of pathogenicity (i.e., structural change) is proposed.
Numerous computational tools for the prediction of variant pathogenicity have been developed15,32,45–56. Virtually all of these tools use sequence conservation and protein secondary structure or domain information. These two categories of features can work well, especially when used in conjunction with one another. However, SBNA is distinct in that it captures the structural topology of individual amino acid residues in the context of the protein architecture and does not rely on pre-existing annotations. Importantly, unlike all existing tools, the performance of SBNA does not seem to rely on a training step involving thousands of phenotype-genotype combinations or multiple sequence alignment. As shown, the raw SBNA score combined with BLOSUM62 (without training) achieves an AUC of 0.851 on the dataset of 47 IRD proteins. Furthermore, the biophysical basis for SBNA scores can be analyzed on an atomic level. This transparency facilitates downstream applications, such as the consideration of possible gene therapy targets, and is not available within models that heavily leverage machine-learning approaches26,32. The predictive value of SBNA can be further augmented by incorporating EVE scores, which are ultimately based on evolutionary multiple sequence alignments26, to achieve an AUC of 0.927. This suggests that incorporating multiple orthogonal metrics may strengthen predictive models.
Applying SBNA may provide additional insight into patient candidacy for both approved and developmental gene therapy treatments. By identifying candidate pathogenic variants with high confidence, SBNA could be used to nominate a limited subset of variants for expedited in vivo validation to fast-track delivery of appropriate therapies to patients. Broadly, the patients that may benefit clinically from SBNA fall into three categories. The first are patients who may be a candidate for gene-specific, variant-agnostic therapies, such as the FDA-approved RPE65-targeted gene therapy voretigene neparvovec57. This also applies to any gene with variant-agnostic therapies in ongoing clinical trials. A second category of patients could potentially benefit from variant-specific gene editing therapies, such as those facilitated by CRISPR-Cas9-mediated non-homologous end joining, which are currently under development for CEP29058. The third category of patients with IRD who may benefit from SBNA are not yet candidates for any existing clinical therapies. However, using SBNA on a large scale to identify candidate disease-causing variants may be informative for nominating new genes for variant-agnostic therapy development and new variants for gene editing therapy development in order to maximize potential patient benefit.
This approach is limited by the availability of high-quality structural data, a requirement for SBNA29, though we note that any tool that proposes to use structural data will be reliant on this. Numerous proteins, including the majority that correspond to known genetic variants associated with IRD, lack available x-ray crystallography or cryo-EM structures altogether. In some cases, structures are available but are not of sufficient resolution to facilitate downstream network analysis. Furthermore, this approach will only be applicable to variants that result in a negative structural change. Other types of variants – such as splice site variants – will not be captured with this approach. However, given that this tool only requires protein structure data, we expect that the performance of SBNA will continue to improve. Software that leverages artificial intelligence to predict protein structure, such as AlphaFold2 or RoseTTAFold, have already emerged to help compensate for the lack of available structural data42,59. We have demonstrated that SBNA can be meaningfully applied to some of these in silico-generated structures, which may provide a path towards overcoming the limitation of high-quality structural data availability going forward.
In conclusion, SBNA is a new tool to estimate the likely extent of the phenotypic impact of missense variants by assessing the topological important of affected residues to protein structure. We demonstrated that this technique could be meaningfully applied to human proteins and showcased the use of SBNA in IRD patients who lack a clear genetic diagnosis. These types of insights could contribute to the design of novel gene therapies targeted at implicated genetic variants57,60,61.
Methods
Structural data
All structural data were downloaded from the Protein Data Bank34. Individual accession numbers for the well-studied human proteins and inherited retinal disease proteins are listed in Supplementary Table 1. For the non-human benchmark proteins, the same files were used as previously described29. In cases where multiple human structures were available, the highest oligomeric state of the protein with a resolution of ~3 angstroms or better was used. In cases where no human structure was available, the structure of one or more homologs was used. If multiple chains were present within the PDB file due to crystal packing rather than true oligomerization, only one of these chains was used for SBNA. Solvent and water molecules were removed from all PDB files prior to SBNA, but ligands and protein binding partners were included in the analysis. Only the protein of interest was designated to have a network score calculated by SBNA.
Structure-based Network Analysis
Structure-based network analysis was used to calculate network scores as previously described29,35. The details of this method have been described previously. As before, in cases where multiple conformations of a structure were used, network scores were averaged for each amino acid position. In cases where no human structure was available, the structures of one or more non-human homologs as listed in Supplementary Table 1 were used as templates in Modeler to generate a predicated human structure. Relative solvent accessibility (RSA) was calculated using DSSP files62 and previously published maximum solvent accessible surface area values63. Code to perform structure-based network analysis is publicly available via Zenodo (10.5281/zenodo.2597484).
Data analysis and visualization
Data analysis was performed using Python (version 3.8.2), with visualizations generated using the “matplotlib” package. Logistic regressions were performed using the “glm” package in R (version 4.0.4). Intercepts were set to zero for all logistic regression models. SBNA and logistic regression results for all tested variants are available via the Harvard Dataverse (10.7910/DVN/YEEPDY). Network score visualizations were generated in R (version 4.0.4) using the “rgl” package to implement OpenGL. The backbone centroid (centroid of nitrogen, alpha carbon, and carbon) positions were plotted along x, y, and z axes, and nodes were colored and given sphere radii corresponding to network scores. The protein structure backbone was then plotted, connecting the alpha carbons. Plotted structures were manually rotated, and two-dimensional views of interest were downloaded for inclusion.
Phenotype data
Clinical phenotype data was downloaded from ClinVar. Clinical evidence with a two-gold star level designation (meaning that “two or more submitters with assertion criteria and evidence [or a public contact] provided the same interpretation”) or better was included in the published analyses20. Pathogenicity designations from ClinVar were binned into Benign (“Benign”, “Benign/Likely benign”, “Likely benign”), VUS (“Uncertain significance”, “not provided”, “Conflicting interpretations of pathogenicity”), and Pathogenic (“Pathogenic”, “Pathogenic/Likely pathogenic”, “Likely pathogenic”) categories for analysis.
Additionally, allelic variation data from gnomAD64 was considered for each gene. Loci with at least 250 available allelic variants in gnomAD were considered benign variants. Results from ClinVar and gnomAD for each gene were considered in combination for these analyses.
Functional data
Functional data for the four well-studied human proteins—BRCA136, ERK239, PTEN37, and HRAS38—that had been previously published was used for this analysis. In cases where functional scores were assigned to multiple amino acid variants at the same position (e.g., if different functional scores were calculated for Ala101Pro and Ala101Gln), the arithmetic mean of all functional scores at that position was used.
Patient data
Patient data was gathered from among those presenting to the Inherited Retinal Disorders Service at Massachusetts Eye and Ear between the 1980s and 2020s. Appropriate written informed consent was obtained from all included patients, and approval was granted by the Mass General Brigham/Massachusetts Eye and Ear Institutional Review Board Protocol Number 2019P001098. This study was approved by the local institutional review board and adhered to the Declaration of Helsinki. Informed consent was obtained from all individuals on whom clinical data and genetic testing were performed.
Genetic analyses
Pre-existing genetic solutions were available for all of the 3621 IRD cases, where genetic diagnoses for 3018 cases was obtained by targeted next-generation sequencing approaches43 or whole genome sequencing, and the remaining solutions were available from prior single-strand conformation polymorphism (SSCP) analysis and Sanger sequencing. Sequence data was aligned to the hg38 genome build, and the subsequent variant calling, annotation, and analyses were performed as described43.
Model Comparisons
SBNA was compared to results from PolyPhen241 (trained on both HumDiv and HumVar and EVE26, with scores from both models generated as described in their respective publications. EVE scores were downloaded from https://evemodel.org; PolyPhen2 scores were generated using the “batch query” function at http://genetics.bwh.harvard.edu/pph2/. To analyze the likelihood of pathogenicity, SBNA scores, EVE scores, and PolyPhen2 scores (trained on HumVar) were calculated for a set of 2800 variants. Cutoffs for each model were as follows: modified SBNA score >1.5 (corresponding to an approximate pathogenicity probability of 75% in the logistic regression model), EVE score ≥0.65, and PolyPhen2 designation of “probably damaging”. The “combined score” was generated by scaling modified SBNA scores to fall between 0 and 1 based on the minimum and maximum values across all proteins and adding them to EVE scores. A threshold of 1.0 was selected as this corresponds to the sum of the minimum pathogenic EVE score (0.65) and the scaled version of the minimum pathogenic modified SBNA score (1.5, which scales to 0.37) with two significant figures.
Statistical analysis
Statistical analysis, including the generation of ROC curves, was performed using the “scipy.stats” package in Python as well as GraphPad Prism (version 9). Comparisons between three or more categories were made using the non-parametric Kruskal-Wallis test with Dunn’s post hoc analysis corrected for multiple comparisons with a Bonferroni correction. Comparisons between the two categories were performed using the non-parametric Mann–Whitney U-test. The correlation between the two datasets was calculated using non-parametric Spearman correlation coefficients. Spearman correlation coefficients between network scores and ClinVar pathogenicity designations was calculated by assigning 0 to benign variants, 1 to VUS, and 2 to pathogenic variants.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
B.M.H. is supported by award number T32GM144273 from the National Institute of General Medical Sciences and award number F30AI160908 from the National Institute of Allergy and Infectious Diseases. G.D.G. is supported by award numbers DP2AI154421, R01AI176533, and DP1DA058476, the Bill and Melinda Gates Foundation, the Burroughs Wellcome Career Award for Medical Scientists, and the Howard Goodman Fellowship. E.J.R. is supported by award number K12EY016335 from the National Eye Institute and the Massachusetts Lions Eye Research Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author contributions
Conceptualization: B.M.H., G.D.G., and E.J.R.; Methodology: B.M.H., Y.L., A.N., G.D.G., and E.J.R.; Software: B.M.H., Y.L., and E.J.R.; Validation: B.M.H., Y.L., A.N., J.C., E.A.P., E.M.P., K.M.B., G.D.G., and E.J.R.; Formal analysis: B.M.H., Y.L., E.M.P., K.M.B., and E.J.R.; Resources: E.A.P., E.M.P., K.M.B., G.D.G., and E.J.R.; Data curation: B.M.H., J.C., E.A.P., E.M.P., K.M.B., and E.J.R.; Writing—original draft: B.M.H. and E.J.R.; Writing—review and editing: B.M.H., Y.L., A.N., D.G.V., J.C., E.A.P., E.M.P., K.M.B., G.D.G., and E.J.R.; Visualization: B.M.H. and E.J.R.; Supervision: D.G.V., J.C., E.A.P., K.M.B., G.D.G., and E.J.R.; Project administration and funding acquisition: E.J.R.
Data availability
ClinVar, gnomAD, and Protein Data Bank data were publicly available and can be accessed using the gene names and unique identifiers in Supplementary Table 1. Patient data from Massachusetts Eye and Ear cannot be shared publicly as specified by the Institutional Review Board due to concerns regarding possible identifiability of patients with relatively rare genetic conditions. Qualified researchers may be able to access the data via collaboration and data usage agreement by contacting the corresponding author.
Code availability
Code to perform structure-based network analysis is publicly available via Zenodo (10.5281/zenodo.2597484).
Competing interests
G.D.G. discloses research funding from Merck. E.J.R. and G.D.G. have filed two patent applications related to the use of SBNA in HIV and SARS-CoV-2.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41525-024-00416-w.
References
- 1.Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 2011;12:628–640. doi: 10.1038/nrg3046. [DOI] [PubMed] [Google Scholar]
- 2.Berger W, Kloeckener-Gruissem B, Neidhardt J. The molecular basis of human retinal and vitreoretinal diseases. Prog. Retin. Eye Res. 2010;29:335–375. doi: 10.1016/j.preteyeres.2010.03.004. [DOI] [PubMed] [Google Scholar]
- 3.Consugar MB, et al. Panel-based genetic diagnostic testing for inherited eye diseases is highly accurate and reproducible, and more sensitive for variant detection, than exome sequencing. Genet. Med. 2015;17:253–261. doi: 10.1038/gim.2014.172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang F, et al. Next generation sequencing-based molecular diagnosis of retinitis pigmentosa: identification of a novel genotype-phenotype correlation and clinical refinements. Hum. Genet. 2014;133:331–345. doi: 10.1007/s00439-013-1381-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ge Z, et al. NGS-based molecular diagnosis of 105 eyeGENE((R)) probands with retinitis pigmentosa. Sci. Rep. 2015;5:18287. doi: 10.1038/srep18287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hafler BP. Clinical progress in inherited retinal degenerations: gene therapy clinical trials and advances in genetic sequencing. Retina. 2017;37:417–423. doi: 10.1097/IAE.0000000000001341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ben-Yosef, T. Inherited retinal diseases. Int. J. Mol. Sci. 10.3390/ijms232113467 (2022). [DOI] [PMC free article] [PubMed]
- 8.Peterson TA, Doughty E, Kann MG. Towards precision medicine: advances in computational approaches for the analysis of human variants. J. Mol. Biol. 2013;425:4047–4063. doi: 10.1016/j.jmb.2013.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Niroula A, Vihinen M. Variation interpretation predictors: principles, types, performance, and choice. Hum. Mutat. 2016;37:579–597. doi: 10.1002/humu.22987. [DOI] [PubMed] [Google Scholar]
- 10.Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett. 2016;590:2327–2341. doi: 10.1002/1873-3468.12307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pejaver V, et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 2020;11:5918. doi: 10.1038/s41467-020-19669-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jagadeesh KA, et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 2016;48:1581–1586. doi: 10.1038/ng.3703. [DOI] [PubMed] [Google Scholar]
- 14.Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47:D886–D894. doi: 10.1093/nar/gky1016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ioannidis NM, et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 2016;99:877–885. doi: 10.1016/j.ajhg.2016.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Feng BJ. PERCH: a unified framework for disease gene prioritization. Hum. Mutat. 2017;38:243–251. doi: 10.1002/humu.23158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Raimondi D, et al. DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res. 2017;45:W201–W206. doi: 10.1093/nar/gkx390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 2016;48:214–220. doi: 10.1038/ng.3477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Dong C, et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 2015;24:2125–2137. doi: 10.1093/hmg/ddu733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Landrum MJ, et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 2020;48:D835–D844. doi: 10.1093/nar/gkz972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wang T, et al. Probability of phenotypically detectable protein damage by ENU-induced mutations in the Mutagenetix database. Nat. Commun. 2018;9:441. doi: 10.1038/s41467-017-02806-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Miosge LA, et al. Comparison of predicted and actual consequences of missense mutations. Proc. Natl Acad. Sci. USA. 2015;112:E5189–E5198. doi: 10.1073/pnas.1511585112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Grimm DG, et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 2015;36:513–523. doi: 10.1002/humu.22768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Frazer J, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599:91–95. doi: 10.1038/s41586-021-04043-8. [DOI] [PubMed] [Google Scholar]
- 27.Mishra P, Flynn JM, Starr TN, Bolon DNA. Systematic mutant analyses elucidate general and client-specific aspects of Hsp90 function. Cell Rep. 2016;15:588–598. doi: 10.1016/j.celrep.2016.03.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hopf TA, et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 2017;35:128–135. doi: 10.1038/nbt.3769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gaiha GD, et al. Structural topology defines protective CD8(+) T cell epitopes in the HIV proteome. Science. 2019;364:480–484. doi: 10.1126/science.aav5095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Cuevas JM, Geller R, Garijo R, Lopez-Aldeguer J, Sanjuan R. Extremely high mutation rate of HIV-1 in vivo. PLoS Biol. 2015;13:e1002251. doi: 10.1371/journal.pbio.1002251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Cheng J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381:eadg7492. doi: 10.1126/science.adg7492. [DOI] [PubMed] [Google Scholar]
- 33.Wan A, Place E, Pierce EA, Comander J. Characterizing variants of unknown significance in rhodopsin: a functional genomics approach. Hum. Mutat. 2019;40:1127–1144. doi: 10.1002/humu.23762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Nathan A, et al. Structure-guided T cell vaccine design for SARS-CoV-2 variants and sarbecoviruses. Cell. 2021;184:4401–4413 e4410. doi: 10.1016/j.cell.2021.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Findlay GM, et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature. 2018;562:217–222. doi: 10.1038/s41586-018-0461-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mighell TL, Evans-Dutson S, O’Roak BJ. A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype-phenotype relationships. Am. J. Hum. Genet. 2018;102:943–955. doi: 10.1016/j.ajhg.2018.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hidalgo, F. et al. A saturation-mutagenesis analysis of the interplay between stability and activation in Ras. Elife10.7554/eLife.76595 (2022). [DOI] [PMC free article] [PubMed]
- 39.Brenan L, et al. Phenotypic characterization of a comprehensive set of MAPK1/ERK2 missense mutants. Cell Rep. 2016;17:1171–1183. doi: 10.1016/j.celrep.2016.09.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat. Biotechnol. 2004;22:1035–1036. doi: 10.1038/nbt0804-1035. [DOI] [PubMed] [Google Scholar]
- 41.Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genet.10.1002/0471142905.hg0720s76 (2013). [DOI] [PMC free article] [PubMed]
- 42.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zampaglione E, et al. The importance of automation in genetic diagnosis: lessons from analyzing an inherited retinal degeneration cohort with the Mendelian Analysis Toolkit (MATK) Genet. Med. 2022;24:332–343. doi: 10.1016/j.gim.2021.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kleinberger J, Maloney KA, Pollin TI, Jeng LJ. An openly available online tool for implementing the ACMG/AMP standards and guidelines for the interpretation of sequence variants. Genet. Med. 2016;18:1165. doi: 10.1038/gim.2016.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bao L, Zhou M, Cui Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 2005;33:W480–482,. doi: 10.1093/nar/gki372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Choi Y, Chan AP. PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics. 2015;31:2745–2747. doi: 10.1093/bioinformatics/btv195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–874. doi: 10.1101/gr.176601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods. 2014;11:361–362. doi: 10.1038/nmeth.2890. [DOI] [PubMed] [Google Scholar]
- 49.Tang H, Thomas PD. PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation. Bioinformatics. 2016;32:2230–2232. doi: 10.1093/bioinformatics/btw222. [DOI] [PubMed] [Google Scholar]
- 50.Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30:3894–3900. doi: 10.1093/nar/gkf493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ferrer-Costa C, Orozco M, de la Cruz X. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J. Mol. Biol. 2002;315:771–786. doi: 10.1006/jmbi.2001.5255. [DOI] [PubMed] [Google Scholar]
- 52.Li B, et al. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;25:2744–2750. doi: 10.1093/bioinformatics/btp528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Saunders CT, Baker D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. 2002;322:891–901. doi: 10.1016/S0022-2836(02)00813-6. [DOI] [PubMed] [Google Scholar]
- 54.Seifi M, Walter MA. Accurate prediction of functional, structural, and stability changes in PITX2 mutations using in silico bioinformatics algorithms. PLoS ONE. 2018;13:e0195971. doi: 10.1371/journal.pone.0195971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Shihab HA, et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum. Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Tavtigian SV, et al. Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J. Med. Genet. 2006;43:295–305. doi: 10.1136/jmg.2005.033878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Russell S, et al. Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with RPE65-mediated inherited retinal dystrophy: a randomised, controlled, open-label, phase 3 trial. Lancet. 2017;390:849–860. doi: 10.1016/S0140-6736(17)31868-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Burnight ER, et al. Using CRISPR-Cas9 to generate gene-corrected autologous iPSCs for the treatment of inherited retinal degeneration. Mol. Ther. 2017;25:1999–2013. doi: 10.1016/j.ymthe.2017.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Nuzbrokh Y, Ragi SD, Tsang SH. Gene therapy for inherited retinal diseases. Ann. Transl. Med. 2021;9:1278. doi: 10.21037/atm-20-4726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fenner BJ, et al. Gene-based therapeutics for inherited retinal diseases. Front. Genet. 2021;12:794805. doi: 10.3389/fgene.2021.794805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Joosten RP, et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011;39:D411–419,. doi: 10.1093/nar/gkq1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilites of residues in proteins. PLoS ONE. 2013;8:e80635. doi: 10.1371/journal.pone.0080635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
ClinVar, gnomAD, and Protein Data Bank data were publicly available and can be accessed using the gene names and unique identifiers in Supplementary Table 1. Patient data from Massachusetts Eye and Ear cannot be shared publicly as specified by the Institutional Review Board due to concerns regarding possible identifiability of patients with relatively rare genetic conditions. Qualified researchers may be able to access the data via collaboration and data usage agreement by contacting the corresponding author.
Code to perform structure-based network analysis is publicly available via Zenodo (10.5281/zenodo.2597484).