Abstract
Variants in the cystic fibrosis transmembrane conductance regulator gene (CFTR) result in cystic fibrosis–a lethal autosomal recessive disorder. Missense variants that alter a single amino acid in the CFTR protein are among the most common cystic fibrosis variants, yet tools for accurately predicting molecular consequences of missense variants have been limited to date. AlphaMissense (AM) is a new technology that predicts the pathogenicity of missense variants based on dual learned protein structure and evolutionary features. Here, we evaluated the ability of AM to predict the pathogenicity of CFTR missense variants. AM predicted a high pathogenicity for CFTR residues overall, resulting in a high false positive rate and fair classification performance on CF variants from the CFTR2.org database. AM pathogenicity score correlated modestly with pathogenicity metrics from persons with CF including sweat chloride level, pancreatic insufficiency rate, and Pseudomonas aeruginosa infection rate. Correlation was also modest with CFTR trafficking and folding competency in vitro. By contrast, the AM score correlated well with CFTR channel function in vitro–demonstrating the dual structure and evolutionary training approach learns important functional information despite lacking such data during training. Different performance across metrics indicated AM may determine if polymorphisms in CFTR are recessive CF variants yet cannot differentiate mechanistic effects or the nature of pathophysiology. Finally, AM predictions offered limited utility to inform on the pharmacological response of CF variants i.e., theratype. Development of new approaches to differentiate the biochemical and pharmacological properties of CFTR variants is therefore still needed to refine the targeting of emerging precision CF therapeutics.
Introduction
Cystic fibrosis (CF) is a lethal genetic disease caused by variants in the epithelial anion channel cystic fibrosis transmembrane conductance regulator (CFTR) [1]. CFTR is composed of an N-terminal lasso motif, two nucleotide binding domains (NBDs), two transmembrane domains (TMDs) and an unstructured regulatory domain (RD) [2]. Loss of CFTR protein production or function results in osmotic dysregulation at the epithelium of the skin, pancreatic duct, and lungs–leading to high sweat chloride levels, pancreatic insufficiency, and lung infections respectively [3]. Standard treatment paradigms for CF involve supplementation of salt, vitamins, and digestive enzymes, together with airway clearance therapies and small molecule CFTR modulators known as potentiators and correctors. CFTR variants experience distinct structural defects and proteostasis states leading to divergent pharmacological response profiles to modulators also known as theratypes [4–7].
At present, elexacaftor-tezacaftor-ivacaftor (ETI) is the best available highly effective modulator therapy for CF. This triple combination is clinically approved for ~170 CFTR variants, including the most commonly reported allele, deletion of phenylalanine 508 (F508del) [8–11]. ETI is composed of one gating potentiator (ivacaftor, VX-770) and two protein maturation correctors, tezacaftor (VX-661) and elexacaftor (VX-445). The corrector compounds have been suggested to directly bind unique subdomains of CFTR: VX-661 to TMD1 [12, 13], and VX-445 to the N-terminal lasso and TMD2 [14, 15]. Correctors contribute intermolecular interactions that favor the properly folded, trafficking competent state of CFTR. Due to the distinct binding sites, VX-661 and VX-445 elicit different mechanisms of action and confer variable theratype responses. Thus, profiling CFTR variant theratypes to these and other emerging modulators remains an important priority for CF personalized medicine.
Increasing implementation of next-generation sequencing approaches for CFTR DNA analysis has rapidly augmented the pace of novel CFTR variant discovery; and thus, hastened the need for more accurate pathogenicity prediction tools. This is particularly relevant to individuals with CFTR related metabolic syndrome (CRMS), also known as CF Screen Positive Inconclusive Diagnosis (CFSPID). Patients are diagnosed with this condition if they possess a positive newborn screen for CF and either of the following criteria: (1) normal sweat chloride value (<30 mEq/L) and two identified CFTR variants, at least one of which exhibits unclear phenotypic consequences; or (2) intermediate sweat chloride value (30–59 mEq/L) and detection of one or zero CF-causing variants [16]. Clinical symptoms worsen for approximately 11–48% of CRMS/CFSPID patients, who eventually convert to a CF diagnosis [17, 18]. Insufficient data exists to predict which CFTR variants (or other factors) enhance the risk for progression to CF.
Furthermore, high-throughput methods for characterizing CFTR variant severity are limited. Only 804 of the reported 2,111 variants have been annotated for disease association according to in vitro or clinical data [19]. The majority of these CFTR variants are single amino acid substitutions or missense variants [20]. Recently, AlphaMissense (AM) was published as a technology designed to predict the pathogenicity of missense variants throughout the human proteome [21]. Among well-characterized genetic diseases, AM included CF pathogenicity predictions for every possible CFTR single amino acid substitution. AM provides a significant advance beyond previous attempts to model a limited number of CFTR variants [22, 23]. Here, we evaluated the predictive validity of AM across several metrics of CF data such as pathogenicity in people with CF, in vitro CFTR folding and function, and theratype.
The increasing pace of novel CFTR variant discovery has created a need for pathogenicity prediction, especially among wild-type (WT) heterozygous individuals, e.g. carriers, and CRMS/CFSPID individuals whose variants remain uncharacterized. Our analysis suggests AM predicts the relative pathogenicity of severe CF-causing variants well, while performing modestly for variants of unknown significance (VUS) or variants of varying clinical consequence (VVCC). Overall, AM showed a high false positive rate for predicting CFTR2 patient outcomes [19]. Among VUSs, two variants from CFTR2 and 368 variants from ClinVar [24] were predicted as pathogenic. By contrast, the S912L VUS from CFTR2 was predicted benign despite clinical outcomes indicating ~half the people with this variant display hallmarks of CF disease. AM scores correlated modestly with CF pathogenicity metrics and CFTR trafficking/folding competency in vitro. Correlation improved when compared to CFTR channel functional data. These analyses imply AM has learned important trends in variant function despite not training on such data. Finally, we provide evidence that AM offers little power in predicting CFTR variant theratype, although we note this measure is beyond its intended design. Thus, AM may offer capabilities in predicting the pathogenicity of emerging variants but proved less useful for theratyping variants.
Results and discussion
I. AlphaMissense predictions of CFTR pathogenicity
AM makes pathogenicity predictions based on a 90% accuracy against ClinVar data [21]. For CFTR, AM predicted scores from 0.56–1.00 as pathogenic, scores from 0.34–0.56 as ambiguous, and scores from 0.04–0.34 as benign. We mapped the average AM prediction score per residue onto a CFTR structure (PDBID 5UAK) [25] (Fig 1A). TMDs showed a propensity for pathogenicity in contrast to residue conservation as calculated by ConSurf [26], which suggested the TMDs are comparatively variable across species (Fig 1A & S1A Fig). Since the regulatory domain (RD) is disordered and not resolved in the CFTR structure 5UAK, we also plotted the average AM score for RD residues against a RD map highlighting key features such as transiently formed α-helices and phosphorylation sites [4] (Fig 1B). Despite noted difficulty with disordered regions [21], AM predicted RD residues ~760–775 as a hotspot for pathogenicity. This is consistent with the role of transient helix 752–778 in CFTR gating through interactions with a conserved region of the NBD2 C-terminus [27, 28].
We sought to evaluate AM’s ability to correctly predict CF pathogenesis on 169 classified missense variants from CFTR2.org [19]. The CFTR2 database offered a rich patient metric repository including pathogenicity classifications as CF-causing, VVCC, non-CF-causing, or VUS (S1 Table). VVCC are defined by CFTR2 as variants that may cause CF when found heterozygous with CF-causing variants, which results in variable clinical diagnosis of CF, e.g., a person with a VVCC and a CF-causing variant may or may not present with CF [19].
AM showed a 95% accuracy (104/110) for predicting pathogenic variants and a 78% accuracy (14/18) for predicting benign variants based on variant determination in CFTR2 (S1 Table). We calculated the receiver operating characteristic (ROC) curve for all pairwise comparisons of pathogenicity predicted by AM (Fig 1C, See Methods). Briefly, all pairwise comparisons were considered–pathogenic, ambiguous, or benign were taken in turn to be a true positive. The alternative two predictions for a specific comparison were taken to be false positives. We considered pathogenic to predict CF-causing, ambiguous to predict VVCC, and benign to predict non-CF causing, VUS were not used. While looping through all possible score thresholds, the corresponding true positive and false positive rates were calculated and plotted. Benign predictions showed the highest area under the curve (AUC) (0.91) followed by pathogenic (0.80) and ambiguous respectively (0.66)–suggesting that AM has a high false positive rate, particularly for ambiguous predictions (Fig 1C). A high false positive rate may be attributed to a poor AlphaFold2 (AF2) predicted structure of CFTR. However, the AF2 predicted CFTR [29] shows a root mean squared deviation of just 2.5 Å from the active state cryo-EM model (PDB ID 6MSM, resolution 3.2 Å [30]) (S1B Fig).
We noted seven VUS in the CFTR2.org database and their respective AM predictions (Table 1). The location of these variants in the CFTR structure is shown (S1C Fig). Benign predicted R31L disrupts the arginine framed tripeptide motif at R29-R31, important for folding evaluation prior to ER export [31] and may affect endocytosis rates [32]. V201M was ambiguously predicted, consistent with our previous report describing this variant as mildly mis-trafficked and selectively sensitive to VX-661 [23]. A439V (benign prediction) and Y1014C (ambiguous prediction) showed trafficking and function slightly below WT [33] suggesting these variants are benign. Benign predicted variant S912L lies close to the CFTR glycosylation sites at N894 and N900, thus we speculated this mutation could interfere with glycan processing. Nevertheless, S912L trafficking and function remained sufficient compared to WT in vitro [34].
Table 1. CFTR2.org Variants of Unknown Significance (VUS).
Variant | AlphaMissense Score | Pathogenicity Prediction |
---|---|---|
R31L | 0.17 | benign |
V201M | 0.36 | ambiguous |
A349V | 0.26 | benign |
S912L | 0.12 | benign |
D924N | 0.83 | pathogenic |
M952T | 0.85 | pathogenic |
Y1014C | 0.37 | ambiguous |
Variants D924N and M952T, both located in transmembrane helix 8, are predicted as pathogenic (S1C Fig). D924N resides in the potentiator binding hotspot [35, 36] and, according to clinical data, may cause pancreatic insufficiency but not lung disease [37]. M952T displays robust functional expression in vitro [38], and two patients with an M952T/F508del genotype exhibit normal chloride transport measured from intestinal mucosa [38]–suggesting this variant is likely not pathogenic, despite the AM prediction.
For performance comparison, we also plotted ROC curves for AM predictions of the 115 ClinVar variants from the AlphaMissense study and observed 96% average accuracy as presented previously [21] (S1D Fig). To validate this finding, we additionally downloaded a dataset of 209 ClinVar variants directly from ClinVar, including 96 overlapping variants from the AlphaMissense benchmark set. The ROC curve for our expanded ClinVar dataset showed >90% prediction accuracy with additional variants (S1E Fig). Finally, we plotted a ROC curve for 113 ClinVar variants not included in the AM benchmark set, which revealed a >90% accuracy and indicates AM performs well on ClinVar data outside of the training set (S1F and S1G Fig). In addition to classified variants used for performance evaluation, ClinVar contains 1,277 CFTR VUS [24]. AM predicted VUS ClinVar variants to contain 728 benign, 181 ambiguous, and 368 pathogenic variants (S1H Fig, S2 Table).
AM performance was also compared to two other pathogenicity prediction tools, Evolutionary Scale Modeling (ESM) [39] and Evolutionary model of Variant Effect (EVE) [40] (S2A and S2B Fig). In the ROC AUCs of benign variants, the AM value (0.91) was higher than those obtained for ESM (0.78) or EVE (0.78). A similar observation was made for pathogenic ROC AUCs, with AM (0.80) slightly above ESM (0.76) or EVE (0.73). ROC AUCs for ambiguous variants were nearly uniform across all methods (AM, 0.66; ESM, 0.65; EVE, 0.64). AM therefore offers a slight advantage for predicting pathogenic or benign variants and less utility regarding ambiguous variants.
Previous analysis of CFTR variants across sampled genetic information indicates the NBD1-intracellular loop 4 (ICL4) interface is a hotspot of pathogenicity [41]. Thus, we generated a heatmap of AM scores for NBD1 residues 485–565, which encompass the α-helical subdomain, structurally diverse region (SDR), and the entire NBD1-ICL4 boundary (Fig 1D). For the Q-loop (residues 486–495) and helix 3 (residue 496–512), potential substitutions are largely predicted as pathogenic except for residues 494 and 511. Of note, AM predicts position 508 as intolerant to substitution. Deletion of the encoded phenylalanine (F508del) is the most frequently reported variant among worldwide CF populations [8, 42–44]. Most variations calculated as benign or ambiguous occur within helix 4/4b of the α-helical subdomain (residues 511–532) or SDR (residues 533–547) (Fig 1D). Possible substitutions across helix 4/4b that are predicted as pathogenic include V520 and C524. V520F7 and C524X [45] are not presently approved for CFTR correctors and potentiators. Most substitutions (40% benign, 35% pathogenic) within the SDR are predicted as benign as expected base on the lack of structure in the region.
In contrast, variations at the NBD1-ICL4 interface are overwhelmingly scored as severe. Residues 548–565 comprise the NBD1 core helix 5, which directly interacts with ICL4 and demonstrates the strongest sensitivity (7% benign, 82% pathogenic) to mutation with potential substitutions predicted as pathogenic (Fig 1D). This region contains numerous CF-causing variants, some of which are refractory to available CFTR modulators, such as R560T/K/S [4, 41]. Within the ICL4 region (residues 1048–1084), AM scores indicate 14% benign and 69% pathogenic predictions (Fig 1E). The heatmap reveals residues 1069, 1072, 1076, and 1084 as relatively tolerant to substitution. Together, these data suggested AM pathogenicity scores matched previous findings, as well as our general understanding about residue conservation throughout CFTR, while providing specific information about every possible substitution.
II. Cystic Fibrosis pathogenicity correlated modestly with AlphaMissense predictions
In addition to classifying variant pathogenicity, the CFTR2.org database annotates clinical outcomes for persons with CF including sweat chloride levels, pancreatic insufficiency rates, Pseudomonas aeruginosa infection rates, and lung function [19]. We curated the clinical outcomes for all CFTR missense variants with available data (S1 Table, See Methods). We then analyzed the ability of AM to predict patient pathogenicity metrics. Briefly, CFTR2.org data were downloaded from the Variant List History tab and filtered for 176 missense variants (169 classified and 7 VUS). Then, clinical outcome data were manually assembled by searching each variant and recording the sweat chloride (mEq/L), pancreatic insufficiency rate (%), P. aeruginosa infection rate (%), and lung function (forced expiratory volume in one second (FEV1), % predicted). Of note, CFTR2 data was based on individual alleles, e.g. missense variants.
First, we plotted AM score versus CF sweat chloride levels for 123 missense variants with sweat chloride values reported (Fig 2A). AM score correlated modestly with sweat chloride levels (Pearson Correlation Coefficient: 0.46, Spearman Correlation Coefficient: 0.48). CF-causing variants, shown in blue, clustered in the top right corner, indicative of high AM scores and elevated sweat chloride levels. By contrast, VVCCs, shown in yellow, clustered in the bottom right corner, reflecting an excessive AM score (Fig 2A). When considering CF-causing or VVCC separately, we note a reduced correlation between sweat chloride levels and AM scores (S3A and S3B Fig), suggesting AM captures the trend across all variant types rather than performing better on pathogenic variants.
Next, we plotted AM score versus pancreatic insufficiency rates for 116 missense variants present on at least one allele of persons with CF with CFTR2 outcomes reported (Fig 2B). AM scores correlated poorly with pancreatic insufficiency rates (Pearson coefficient: 0.31, Spearman Coefficient: 0.41) compared to sweat chloride. Again, AM failed to predict VVCCs, shown in yellow, on this metric (Fig 2B). However, considering CF-causing and VVCCs separately failed to change the correlation for pancreatic insufficiency (S3C and S3D Fig). Finally, we plotted AM score versus P. aeruginosa infection rates for 114 missense variants on at least one allele with CFTR2 outcomes reported (Fig 2C). AM correlated better here than for pancreatic insufficiency rates, but worse than for sweat chloride (Pearson Coefficient: 0.38, Spearman Coefficient: 0.44). However, it performed better on VVCCs, yet correlation was again reduced when only CF-causing or VVCCs were separately considered (S3E and S3F Fig).
Taken together, AM correlated modestly with clinical data and performed poorly on VVCCs and VUSs. For example, VUS S912L was predicted benign with an AM score of 0.12. However, this variant was associated with sweat chloride levels of 60 mEq/L (Fig 2A), which resides exactly at the diagnostic cutoff for CF. S912L displays a pancreatic insufficiency rate of 57% (Fig 2B) and P. aeruginosa infection rate of 50% (Fig 2C)–suggesting this variant may present with more pathologic characteristics than predicted or annotated in CFTR2. Unfortunately, pathogenic-predicted variants such as D924N and M952T have insufficient data available on CFTR2 for comparison. Weak performance by AM could be attributable to high false positive rates and/or compound heterozygous genotypes. The latter factor likely complicates interpretation of clinical data, as people with complex CF alleles may exhibit differing degrees of variant severity on each chromosome (e.g. one CF-causing paired with a VUS/VVCC) compared to patients with the same variant severity on each allele (e.g. two CF-causing).
III. AlphaMissense predicts CFTR function beyond folding and trafficking competency
Much CFTR biochemical and functional data was also available for comparison, including recent deep mutational scanning (DMS), theratype screening, and spatial covariance studies [23, 33, 46] (S3 and S4 Tables). In the DMS study, fluorescence-activated cell sorting was used to measure the cell surface immunostaining intensity of an epitope-tagged library of 129 CFTR variants including 100 missense variants [23]. In the theratype study, 655 variants including 585 missense variants were screened for their trafficking efficiency and function [33]. In the spatial covariance study, a CFTR trafficking and a chloride conductance index were established to characterize variant temperature response [46]. Variable, albeit high, overlap was observed between the CFTR2 dataset and the in vitro data sets discussed below (S2C Fig).
First, we evaluated AM ability to predict CFTR folding competency–which is well characterized to correlate with cell surface expression and trafficking efficiency [47–50]. We plotted AM prediction scores for 100 missense variants versus DMS cell immunostaining intensity (Fig 3A, S4 Table), which showed an inverse relationship with poor correlation (Pearson coefficient: -0.37, Spearman Coefficient: -0.37). Notably, among variants in the top right corner, e.g. high AM score and high surface staining, we observed several gating variants (G551D/S, R347H, S1251N, and G1244E etc.) (Fig 3A). Mis-gating variants traffic normally, but they are CF-causing due to disrupted properties of channel opening and closing. This result demonstrated that AM failed to infer the nature of the variant defect.
Next, we plotted AM prediction scores for 538 missense variants versus CFTR trafficking efficacy as measured by the ratio of mature, fully-glycosylated CFTR (band C) to the immature glycoform (band B) on western blot (C/B band ratio) (Fig 3B, S3 Table) [33]. Experimental data was filtered for plotting clarity (S4 Fig, See Methods). We removed highly variable experimental data with a standard error of the mean (SEM) greater than 30. Most CFTR variants show a C:B ratio less than 30% of WT, indicating a lack of reproducibility for these measurements with higher variability (8% of data points removed, 92% retained). AM scores displayed improved inverse correlation with the larger trafficking efficiency dataset (Pearson coefficient: -0.50, Spearman Coefficient: -0.53). This finding suggested AM can predict CFTR folding competency across diverse types of variants. Several off-axis variants were annotated that show poor predictions and poor trafficking (<30% of WT) based on the distribution of all trafficking data (S4A and S4B Fig).
Finally, we evaluated AM ability to predict CFTR function as measured by transepithelial current clamp conductance [33]. We plotted AM prediction scores versus forskolin (FSK)-induced basal CFTR channel activity as percent WT (FSK %WT) for 546 missense variants (Fig 3C). Again, highly variable experimental data were filtered out considering an SEM greater than 20 as most variants were less than 20% of WT (S4 Fig, See Methods), leaving 93% of the experimental data for comparison to AM. AM scores inversely correlated best with CFTR function measured by conductance (Pearson coefficient: -0.70, Spearman Coefficient: -0.69). Several off-axis variants were noted which show poor predictions and poor channel function (<30% of WT) based on the distribution of functional data (S4C and S4D Fig).
We verified the increased capability to predict CFTR function by correlating AM scores with a spatial covariance study (S5 Fig). This study describes trafficking (measured by western blot band shift assay) and chloride conductance indices and presents data for both metrics at 37 ºC and reduced temperature (27 ºC) [46]. Reduced temperature is a well-established method for partially rescuing F508del biogenesis [51]. We observed a modest correlation (Pearson coefficient: -0.46, Spearman Coefficient: -0.44) with trafficking index at 37 ºC, and a similar correlation at 27 ºC (Pearson coefficient: -0.48, Spearman Coefficient: -0.49) (S5A and S5B Fig). Again, correlation increased when compared to chloride conductance index (Pearson coefficient: -0.58, Spearman Coefficient: -0.54 at 37 ºC vs. Pearson coefficient: -0.50, Spearman Coefficient: -0.53 at 27 ºC) (S5C and S5D Fig). Together these results indicated that AM scores are closely aligned with pathogenicity but cannot differentiate between variants that compromise expression versus function.
IV. AlphaMissense cannot predict CFTR variant theratype
Given the rapid and continuous emergence of novel CFTR variants detected by next-generation sequencing technologies, as well as a robust pipeline of new modulators and other CFTR-directed treatments under development, the need remains for optimized approaches to CF precision therapeutics. CFTR variant theratyping is an established method for quantifying in vitro CFTR sensitivity to pharmacologic agents, results of which are utilized to predict treatment responses for genotype-matched patients [6, 52]. CF treatment involves two corrector compounds, VX-661 and VX-445, that likely bind directly to two unique sites on CFTR [14], show distinct mechanisms, and hence distinct response profiles across variants. Thus, theratyping variant response remains an important task for CF personalized medicine.
We sought to determine whether AM offered any predictive power for CFTR theratyping, although this task is beyond the intended scope of AM. Theratype distinguishing plots were generated and colored by AM pathogenicity score to assess for potential patterns. We split VX-445-sensitive variants from VX-661-sensitive variants along a diagonal axis of best fit by plotting CFTR immunostaining intensity for VX-445 versus VX-661 (Fig 4A). Variants responsive to VX-445 fell above the dotted line, and variants responsive to VX-661 fell below the dotted line [23]. Variants were then colored by AM pathogenicity score, although the color distribution across the responsive spectrum revealed little discernable patterns (Fig 4A). We also plotted basal CFTR immune staining intensity versus VX-661, VX-445, or the combination thereof, then shaded variants by AM pathogenicity (S6A–S6C Fig). Similarly, AM scores showed little-to-no color patterns and appear randomly distributed.
Next, we used the theratyping study CFTR functional data [33] to plot the VX-445 + VX-661 FSK-mediated response (% of WT) versus basal activity, then colored the values by AM pathogenicity score (Fig 4B). Benign variants fell along a linear diagonal, suggesting that benign predicted variants all experience a linear response to CFTR correctors. We speculate this shift may reflect well-documented WT modulator response, implying an inherent stabilizing effect of VX-445 and VX-661. C:B band ratio response colored by AM score portrayed a random distribution of score color (S7A Fig). Pathogenic predicted variants in both plots show a random distribution. To determine whether theratype was predicted by variant structural location within CFTR, combined with AM score, we subdivided the plot in Fig 4B by domain (S7B–S7E Fig). Each domain individually showed a similar random distribution of score colors. Finally, we calculated relative degree CFTR correction by subtracting basal FSK (% of WT) from VX-445+VX-661 correction FSK (% of WT) and plotted this difference against AM score (Fig 4C). Again, no obvious pattern was observed. In summary, we found AM score afforded little predictive power for profiling pharmacologic responsiveness of CFTR variants. However, AM score could potentially be a useful machine learning feature for future theratype prediction methods.
Conclusion
AlphaMissense has the exciting potential to aid with pathogenicity classification of rare and emerging variants identified during genetic screening. CF posited a valuable case study for evaluating AM performance because of abundance of clinical outcome data and in vitro variant classifications available. AM predicted pathogenicity of severe CF-causing variants well, albeit with a high false positive rate, and matched previous studies of CFTR variant pathogenicity in the NBD1/ICL4 interface [41]. However, AM performed modestly for pathogenicity predictions of VUSs and VVCCs, and the tool does not appear useful for CFTR theratype predictions. Again, for pathogenic missense variants, AM score correlated modestly with trafficking data and correlated well with channel activity functional data. Thus, predictions offer little information for distinguishing pathogenicity mechanism. AM may provide guidance in determining if polymorphisms in CFTR are benign, but performance on less severe disease variants indicate that caution must be taken when interpreting AM predictions. In vitro measurements on variant severity may aid in evaluating prediction quality and will remain necessary for CFTR theratyping.
Methods
Data curation and collection
AlphaMissense (AM) predictions for all single amino acid substitutions in the human proteome data was downloaded, gunzipped, and searched using vim text editor for CFTR accession number/Uniprot ID P13569. CFTR AM predictions were extracted into a separate file for analysis. ESM score predictions were downloaded from https://huggingface.co/spaces/ntranoslab/esm_variants by searching accession number P13569. EVE predictions were downloaded from https://evemodel.org/proteins/CFTR_HUMAN#variantsTableContainer by searching accession number P13569.
Cystic Fibrosis clinical outcome data was initially downloaded from the Variant List History tab on CFTR2.org. The table of 804 variants was filtered for 176 missense variants by removing in/dels, splicing variants, premature stop codons, etc. The patient information was manually curated by searching each variant and annotating the sweat chloride (mEq/L), pancreatic insufficiency rate (%), P. aeruginosa infection rate (%), lung function ages < 10 (FEV1%), lung function ages 10<20 (FEV1%), lung function ages >20 (FEV1%) (S1 Table). Lung function data proved too highly variable for comparison and was not used, but was still included in the Supporting Table for reference. CFTR2 definitions for these variants are as follows [19]: CF-causing: “A variant in one copy of the CFTR gene that always causes CF, as long as it is paired with another CF-causing variant in the other copy of the CFTR gene.” Non-CF-causing: “A variant in one copy of the CFTR gene that does not cause CF, even when it is paired with a CF-causing variant in the other copy of the CFTR gene.” Variant of Variable Clinical Consequence (VVCC): “A variant that may cause CF, when paired with a CF-causing variant in the other copy of the CFTR gene.” Variant of Unknown Significance (VUS): “A variant for which we do not have enough information to determine whether or not it falls into the other three categories.”
In vitro modulator response data was downloaded from [33] and deep mutational scanning data downloaded from [23]. The 650 variants from [33] were filtered to 585 missense variants. ClinVar data was downloaded after searching for CFTR. Clinvar predictions were filtered for missense variants by removing, in/dels, stop codons, double missense variants, etc. Filtering yielded 1768 missense variant pathogenicity predictions (S2 Table). For performance comparison, the missense variants were filtered by clinical significance. We removed classifications such as no interpretation, conflicting interpretations, uncertain significance, and drug response. This left 219 variants classified as pathogenic or likely benign for performance evaluation and ROC plotting.
Filtering experimental data
Experimental data from Bihler et al. [33] were filtered to exclude highly variable data based on the SEM due to lack of reproducibility. We plotted both the distribution of the data itself to look for outliers on the y axis of our correlation plots (Fig 3B and 3C) and the distribution of the SEM (S4 Fig). We labeled outliers with a C:B ratio of less than 30, but with a benign AM prediction of less than 0.3 (S4A Fig, Fig 3B). C:B band ratio SEM of greater than 30 were excluded from the analysis and not plotted for clarity, leaving 538 variants for analysis– 92% of the experimental data (S4B Fig). For the functional data, we labeled outliers of interest with a FSK % of WT of less than 30, but with a benign AM prediction of less than 0.3 (S4C Fig, Fig 3C). FSK % of WT SEM of greater than 20 were excluded from the analysis and not plotted for clarity, leaving 546 variants for analysis– 93% of the available experimental data.
Analysis
Data were analyzed and plotted in Python 3. Raw excel files were imported and parsed using the Pandas data frame library and plots were generated with the matplotlib.pyplot and seaborn libraries. Pearson and Spearman correlation coefficients were calculated with the scipy.stats library using the pearsonr() and spearmanr() functions respectively. Plots were generated for all possible variants with available data for a given metric. The receiver operating characteristic (ROC) curve for AM pathogenicity predicted by AM was calculated against CFTR2 classification. CFTR2 classifies variants as CF causing, variable clinical consequence (VVCC), or non-CF causing. We equated the AM prediction pathogenic to CF causing, ambiguous to variable, and benign to non-CF causing. Since ROC is used for binary classification, all pairwise comparisons were considered. In each ROC curve, a different prediction (pathogenic, ambiguous, or benign) was taken to be a true positive, and the other two predictions to be false positives. Then the corresponding true positive and false positive rates were calculated by considering all possible score cutoffs for pathogenicity. Theratype discerning plots were generated to distinguish responsive from non-responsive variants graphically and colored by the variants respective AM score.
Supporting information
Data Availability
All relevant data are within the manuscript and its Supporting information files.
Funding Statement
This work was supported by R35 GM133552 (NIGMS), R01 HL167046 (NHBLI), R00HL151965 (NIH) and OLIVER22A0-KB (CFF). EFM was supported by a predoctoral fellowship F31 HL162483 (NHLBI) and Chemical-Biology Interface training grant T32 GM065086 (NIGMS).
References
- 1.Welsh MJ, Smith AE. Molecular mechanisms of CFTR chloride channel dysfunction in cystic fibrosis. Cell. 1993. Jul;73(7):1251–4. doi: 10.1016/0092-8674(93)90353-r [DOI] [PubMed] [Google Scholar]
- 2.Riordan JR, Rommens JM, Kerem BS, Alon N, Rozmahel R, Grzelczak Z, et al. Identification of the Cystic Fibrosis Gene: Cloning and Characterization of Complementary DNA. Science. 1989. Sep 8;245(4922):1066–73. doi: 10.1126/science.2475911 [DOI] [PubMed] [Google Scholar]
- 3.Cutting GR. Cystic fibrosis genetics: from molecular understanding to clinical application. Nat Rev Genet. 2015. Jan;16(1):45–56. doi: 10.1038/nrg3849 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.McDonald EF, Meiler J, Plate L. CFTR Folding: From Structure and Proteostasis to Cystic Fibrosis Personalized Medicine. ACS Chem Biol. 2023. Sep 20;acschembio.3c00310. doi: 10.1021/acschembio.3c00310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Oliver KE, Han ST, Sorscher EJ, Cutting GR. Transformative therapies for rare CFTR missense alleles. Curr Opin Pharmacol. 2017. Jun;34:76–82. doi: 10.1016/j.coph.2017.09.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Clancy JP, Cotton CU, Donaldson SH, Solomon GM, VanDevanter DR, Boyle MP, et al. CFTR modulator theratyping: Current status, gaps and future directions. J Cyst Fibros. 2019. Jan;18(1):22–34. doi: 10.1016/j.jcf.2018.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Molinski SV, Ahmadi S, Hung M, Bear CE. Facilitating Structure-Function Studies of CFTR Modulator Sites with Efficiencies in Mutagenesis and Functional Screening. SLAS Discov. 2015. Dec;20(10):1204–17. doi: 10.1177/1087057115605834 [DOI] [PubMed] [Google Scholar]
- 8.CF Foundation Patient Registry, https://www.cff.org/medical-professionals/patient-registry. 2022.
- 9.Middleton PG, Mall MA, Dřevínek P, Lands LC, McKone EF, Polineni D, et al. Elexacaftor–Tezacaftor–Ivacaftor for Cystic Fibrosis with a Single Phe508del Allele. N Engl J Med. 2019. Nov 7;381(19):1809–19. doi: 10.1056/NEJMoa1908639 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Oliver KE, Carlon MS, Pedemonte N, Lopes-Pacheco M. The revolution of personalized pharmacotherapies for cystic fibrosis: what does the future hold? Expert Opin Pharmacother. 2023. Sep 22;24(14):1545–65. doi: 10.1080/14656566.2023.2230129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Trikafta Prescribing Information. 2023.
- 12.Baatallah N, Elbahnsi A, Mornon JP, Chevalier B, Pranke I, Servel N, et al. Pharmacological chaperones improve intra-domain stability and inter-domain assembly via distinct binding sites to rescue misfolded CFTR. Cell Mol Life Sci. 2021. Dec;78(23):7813–29. doi: 10.1007/s00018-021-03994-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fiedorczuk K, Chen J. Mechanism of CFTR correction by type I folding correctors. Cell. 2022;185(1):158–168.e11. doi: 10.1016/j.cell.2021.12.009 [DOI] [PubMed] [Google Scholar]
- 14.Fiedorczuk K, Chen J. Molecular structures reveal synergistic rescue of Δ508 CFTR by Trikafta modulators. Science. 2022. Oct 21;378(6617):284–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang C, Yang Z, Loughlin BJ, Xu H, Veit G, Vorobiev S, et al. Mechanism of dual pharmacological correction and potentiation of human CFTR [Internet]. Biophysics; 2022. Oct [cited 2023 Dec 23]. http://biorxiv.org/lookup/doi/10.1101/2022.10.10.510913 [Google Scholar]
- 16.Kallam EF, Kasi AS, Barr E, Linnemann RW, Guglani L. Diagnostic challenges in CFTR-related metabolic syndrome: Where the guidelines fall short. Paediatr Respir Rev. 2023. Aug;S1526054223000489. doi: 10.1016/j.prrv.2023.08.004 [DOI] [PubMed] [Google Scholar]
- 17.Barben J, Castellani C, Munck A, Davies JC, De Winter–de Groot KM, Gartner S, et al. Updated guidance on the management of children with cystic fibrosis transmembrane conductance regulator-related metabolic syndrome/cystic fibrosis screen positive, inconclusive diagnosis (CRMS/CFSPID). J Cyst Fibros. 2021. Sep;20(5):810–9. doi: 10.1016/j.jcf.2020.11.006 [DOI] [PubMed] [Google Scholar]
- 18.Southern KW, Barben J, Gartner S, Munck A, Castellani C, Mayell SJ, et al. Inconclusive diagnosis after a positive newborn bloodspot screening result for cystic fibrosis; clarification of the harmonised international definition. J Cyst Fibros. 2019. Nov;18(6):778–80. doi: 10.1016/j.jcf.2019.04.010 [DOI] [PubMed] [Google Scholar]
- 19.The Clinical and Functional TRanslation of CFTR (CFTR2); http://cftr2.org.
- 20.Cystic Fibrosis Mutation Database, http://www.genet.sickkids.on.ca/. 2023.
- 21.Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023. Sep 19;eadg7492. doi: 10.1126/science.adg7492 [DOI] [PubMed] [Google Scholar]
- 22.McDonald EF, Woods H, Smith ST, Kim M, Schoeder CT, Plate L, et al. Structural Comparative Modeling of Multi-Domain F508del CFTR. Biomolecules. 2022. Mar 18;12(3):471. doi: 10.3390/biom12030471 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.McKee AG, McDonald EF, Penn WD, Kuntz CP, Noguera K, Chamness LM, et al. General trends in the effects of VX-661 and VX-445 on the plasma membrane expression of clinical CFTR variants. Cell Chem Biol. 2023. Jun 15;30(6):632–642.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018. Jan 4;46(D1):D1062–7. doi: 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Liu F, Zhang Z, Csanády L, Gadsby DC, Chen J. Molecular Structure of the Human CFTR Ion Channel. Cell. 2017;169(1):85–92. doi: 10.1016/j.cell.2017.02.024 [DOI] [PubMed] [Google Scholar]
- 26.Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, et al. ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res. 2016. Jul;44(W1):W344–50. doi: 10.1093/nar/gkw408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Naren AP, Cormet-Boyaka E, Fu Jian Villain Matteo, Blalock JE, Quick MW, Kirk KL. CFTR Chloride Channel Regulation by an Interdomain Interaction. Science. 1999;286(October):544–8. doi: 10.1126/science.286.5439.544 [DOI] [PubMed] [Google Scholar]
- 28.Bozoky Z, Krzeminski M, Muhandiram R, Birtley JR, Al-Zahrani A, Thomas PJ, et al. Regulatory R region of the CFTR chloride channel is a dynamic integrator of phospho-dependent intra- and intermolecular interactions. Proc Natl Acad Sci U S A. 2013;110(47). doi: 10.1073/pnas.1315104110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021. Aug 26;596(7873):583–9. doi: 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zhang Z, Liu F, Chen J. Molecular structure of the ATP-bound, phosphorylated human CFTR. Proc Natl Acad Sci. 2018;115(50):12757–62. doi: 10.1073/pnas.1815287115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Farinha CM, Canato S. From the endoplasmic reticulum to the plasma membrane: mechanisms of CFTR folding and trafficking. Cell Mol Life Sci. 2017;74(1):39–55. doi: 10.1007/s00018-016-2387-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jurkuvenaite A, Varga K, Nowotarski K, Kirk KL, Sorscher EJ, Li Y, et al. Mutations in the Amino Terminus of the Cystic Fibrosis Transmembrane Conductance Regulator Enhance Endocytosis. J Biol Chem. 2006. Feb;281(6):3329–34. doi: 10.1074/jbc.M508131200 [DOI] [PubMed] [Google Scholar]
- 33.Bihler H, Sivachenko A, Millen L, Bhatt P, Patel AT, Chin J, et al. In Vitro Modulator Responsiveness of 655 CFTR Variants Found in People With CF [Internet]. Pharmacology and Toxicology; 2023. Jul [cited 2023 Sep 22]. http://biorxiv.org/lookup/doi/10.1101/2023.07.07.548159 [DOI] [PubMed] [Google Scholar]
- 34.Clain J, Lehmann-Che J, Girodon E, Lipecka J, Edelman A, Goossens M, et al. A neutral variant involved in a complex CFTR allele contributes to a severe cystic fibrosis phenotype. Hum Genet. 2005. May;116(6):454–60. doi: 10.1007/s00439-004-1246-z [DOI] [PubMed] [Google Scholar]
- 35.Liu F, Zhang Z, Levit A, Levring J, Touhara KK, Shoichet BK, et al. Structural identification of a hotspot on CFTR for potentiation. Science. 2019;364(6446):1184–8. doi: 10.1126/science.aaw7611 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yeh HI, Qiu L, Sohma Y, Conrath K, Zou X, Hwang TC. Identifying the molecular target sites for CFTR potentiators GLPG1837 and VX-770. J Gen Physiol. 2019. Jul 1;151(7):912–28. doi: 10.1085/jgp.201912360 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Koyano S, Hirano Y, Nagamori T, Tanno S, Murono K, Fujieda K. A Rare Mutation in Cystic Fibrosis Transmembrane Conductance Regulator Gene in a Recurrent Pancreatitis Patient Without Respiratory Symptoms. Pancreas. 2010. Jul;39(5):686–7. doi: 10.1097/MPA.0b013e3181c65c2e [DOI] [PubMed] [Google Scholar]
- 38.Hatton A, Bergougnoux A, Zybert K, Chevalier B, Mesbahi M, Altéri JP, et al. Reclassifying inconclusive diagnosis after newborn screening for cystic fibrosis. Moving forward. J Cyst Fibros. 2022. May;21(3):448–55. doi: 10.1016/j.jcf.2021.12.010 [DOI] [PubMed] [Google Scholar]
- 39.Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023. Sep;55(9):1512–22. doi: 10.1038/s41588-023-01465-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021. Nov 4;599(7883):91–5. doi: 10.1038/s41586-021-04043-8 [DOI] [PubMed] [Google Scholar]
- 41.Molinski SV, Shahani VM, Subramanian AS, MacKinnon SS, Woollard G, Laforet M, et al. Comprehensive mapping of cystic fibrosis mutations to CFTR protein identifies mutation clusters and molecular docking predicts corrector binding site. Proteins Struct Funct Bioinforma. 2018;86(8):833–43. doi: 10.1002/prot.25496 [DOI] [PubMed] [Google Scholar]
- 42.Campagna G, Amato A, Majo F, Ferrari G, Quattrucci S, Padoan R, et al. Registro italiano Fibrosi Cistica (RIFC). Rapporto 2019–2020. Epidemiol Prev. 2022. Sep;46(4S2):1–38. [DOI] [PubMed] [Google Scholar]
- 43.Zampoli M, Verstraete J, Frauendorf M, Kassanjee R, Workman L, Morrow BM, et al. Cystic fibrosis in South Africa: spectrum of disease and determinants of outcome. ERJ Open Res. 2021. Jul;7(3):00856–2020. doi: 10.1183/23120541.00856-2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vaidyanathan S, Trumbull AM, Bar L, Rao M, Yu Y, Sellers ZM. CFTR genotype analysis of Asians in international registries highlights disparities in the diagnosis and treatment of Asian patients with cystic fibrosis. Genet Med. 2022. Oct;24(10):2180–6. doi: 10.1016/j.gim.2022.06.009 [DOI] [PubMed] [Google Scholar]
- 45.Jones CT, Mclntosh L, Keston M, Ferguson A, Brock DJH. Three novel mutations in the cystic fibrosis gene detected by chemical cleavage: analysis of variant splicing and a nonsense mutation. Hum Mol Genet. 1992;1(1):11–7. doi: 10.1093/hmg/1.1.11 [DOI] [PubMed] [Google Scholar]
- 46.Anglès F, Wang C, Balch WE. Spatial covariance analysis reveals the residue-by-residue thermodynamic contribution of variation to the CFTR fold. Commun Biol. 2022. Apr 13;5(1):356. doi: 10.1038/s42003-022-03302-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Mendoza JL, Schmidt A, Li Q, Nuvaga E, Barrett T, Bridges RJ, et al. Requirements for efficient correction of Δf508 CFTR revealed by analyses of evolved sequences. Cell. 2012;148(1–2):164–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Rabeh WM, Bossard F, Xu H, Okiyoneda T, Bagdany M, Mulvihill CM, et al. Correction of Both NBD1 Energetics and Domain Interface Is Required to Restore ΔF508 CFTR Folding and Function. Cell. 2012. Jan;148(1–2):150–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Protasevich I, Yang Z, Wang C, Atwell S, Zhao X, Emtage S, et al. Thermal unfolding studies show the disease causing F508del mutation in CFTR thermodynamically destabilizes nucleotide-binding domain 1. Protein Sci. 2010;19(10):1917–31. doi: 10.1002/pro.479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.He L, Aleksandrov LA, Cui L, Jensen TJ, Nesbitt KL, Riordan JR. Restoration of domain folding and interdomain assembly by second‐site suppressors of the ΔF508 mutation in CFTR. FASEB J. 2010;24(8):3103–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Denning GM, Anderson MP, Amara JF, Marshall J, Smith AE, Welsh MJ. Processing of mutant cystic fibrosis transmembrane conductance regulator is temperature-sensitive. Nature. 1992. Aug;358(6389):761–4. doi: 10.1038/358761a0 [DOI] [PubMed] [Google Scholar]
- 52.McDonald EF, Sabusap CMP, Kim M, Plate L. Distinct proteostasis states drive pharmacologic chaperone susceptibility for cystic fibrosis transmembrane conductance regulator misfolding mutants. Miller E, editor. Mol Biol Cell. 2022. Jun 1;33(7):ar62. [DOI] [PMC free article] [PubMed] [Google Scholar]