Abstract
High-throughput deep mutational scanning (DMS) experiments have significantly impacted protein engineering, drug discovery, immunology, cancer biology, and evolutionary biology by enabling the systematic understanding of protein functions. However, the mutational space associated with proteins is astronomically large, making it overwhelming for current experimental capabilities. Therefore, alternative methods for DMS are imperative. We propose a topological deep learning (TDL) paradigm to facilitate in silico DMS. We utilize a new topological data analysis (TDA) technique based on the persistent spectral theory, also known as persistent Laplacian, to capture both topological invariants and homotopic shape evolution of data. To validate our TDL-DMS model, we use SARS-CoV-2 datasets and show excellent accuracy and reliability for binding interface mutations. This finding is significant for SARS-CoV-2 variant forecasting and designing effective antibodies and vaccines. Our proposed model is expected to have a significant impact on drug discovery, vaccine design, precision medicine, and protein engineering.
Keywords: SARS-CoV-2, infectivity, antibody-resistance, deep mutational scanning, topological deep learning
1. Introduction
Protein mutations refer to changes in the DNA sequence that result in alterations in the amino acid sequence of a protein. These changes can significantly affect the protein’s structure, function, and stability, including protein folding stability, protein binding affinity, and protein-protein interactions (PPIs). Protein mutations play a paramount role in evolutionary biology, cancer biology, immunology, directed evolution, and protein engineering.
Accurately analyzing the impact of mutations is crucial in many fields, such as identifying deleterious and benign mutations and developing novel antibody therapies for emerging virus variants. However, experimental evaluation of mutational outcomes can be time-consuming and expensive, as it requires the expression and purification of variant proteins and measurement of their activity over time [1]. Furthermore, measurements of site-directed mutagenesis for a single mutation may vary dramatically across different experimental approaches [2]. Therefore, leveraging accurate and reliable computational methods to predict the impact of mutations would have a profound effect on the throughput and accessibility of protein engineering and drug discovery.
Recent computational predictions of the impact of mutations on protein stability and PPI binding affinity have proven to be an important alternative to experimental mutagenesis analysis for systematically exploring protein structural functions, disease connections, virus infectivity, structural instability, and organism evolution directions [3-5]. Computational approaches offer a rapid, economical, and potentially accurate alternative to site-directed mutational experiments. Many computational methods have been employed for fields as diverse as protein folding energy changes and PPI binding free energy changes upon mutation.
Various computational methods have been developed to predict the impact of mutations on protein stability, with differing accuracies and computational requirements. Such methods include I-Mutant [6], FoldX [3], SDM [7], DUET [8], PoPMuSiC [9], Rosetta [10], SAAFEC [11], PPSC [12], PROVEAN [13], ELASPIC [14], STRUM [15], EASE-MM [16] etc. DUET, for instance, demonstrates a high correlation in a blind test set and outperforms individual methods like SDM and mCSM [8]. FoldX has the advantage of being easier to run locally, while PROVEAN offers reasonable results with lower computational costs and without requiring a protein structure [17]. Computational approaches designed for PPI binding free energy changes upon mutation typically rely on physical force fields, electrostatics, conformational sampling, and hydrophobic packing. These methods offer a computationally efficient alternative, including DFIRE [18], FoldX [3], Discovery Studio [19], EGED [20], CC/PBSA [21], Rosetta [22], PoPMuSiC & BeAtMuSic [9,23], and mCSM [24, 25]. Several studies have been conducted to compare the performance of computational methods in predicting protein stability and binding affinity changes upon mutation. One such study assessed the performance of six methods, including CC/PBSA, EGAD, FoldX, I-Mutant2.0, Rosetta, and Hunter, in predicting protein stability changes [26]. Another investigation evaluated the effectiveness of several methods, including bASA, dDFIRE, DFIRE, STATIUM, Rosetta, FoldX, and Discovery Studio scoring potentials, in predicting antibody binding affinity changes upon mutation, with Pearson correlation 0.22, 0.19, 0.31, 0.32, 0.16, 0.34, and 0.45, respectively [27].
Computational approaches for calculating protein biophysical properties generally fall into three categories: empirical models, physical models, and data-driven machine learning techniques. Empirical models implement potential terms and empirical functions to describe the free energy perturbation under the constraint of the range of conditions for which they are developed [28, 29]. Physical modeling makes use of multiscale implicit solvent models and molecular mechanics approaches. On the other hand, they depend on the accurate and self-sufficient predictions derived from the underlying measurements [30].
Alternatively, data-driven approaches employ machine learning (ML) and deep learning (DL) techniques to unveil the mechanism between protein stability/binding and complex structures or polypeptides. A major advantage of data-driven mutation modeling is their ability to handle high-throughput and diverse mutation datasets. Importantly, the predictive performance of DL approaches heavily rely on the availability and accuracy of training sets. The computational prediction of the impact of mutations on protein stability and protein-protein interactions (PPIs) plays a crucial role in drug repositioning and drug-target interaction. These predictions are essential for identifying deleterious and benign mutations, developing novel antibody therapies for emerging virus variants, and facilitating the throughput and accessibility of protein engineering [31] and drug discovery [32-35].
Increasing data availability can be directly benefited by deep mutational scanning (DMS) – a high-throughput experimental technique used to study the effects of thousands of mutations on a protein’s function, such as fitness, stability, and reactivity [36]. This approach combines site-directed mutagenesis with next-generation sequencing to measure the fitness of each mutation in a population based on their enrichment (i.e. change in frequency) during selection or screening. DMS has emerged as a primary approach for protein engineering [36-38] and provides reliable analysis of mutational impacts on protein stability, binding free energy, or evolutionary directions. DMS can measure tens of thousands of variants in a single experiment, which provides datasets for machine/deep learning studies. For example, the stochastic gradient boosting model, Envision, utilizes 21,026 variant effect measurements from nine mutational scan studies to create a unified mutant effect predictor. This predictor outperforms other missense variant effect predictors on both large-scale mutagenesis data and an independent test dataset consisting of 2,312 TP53 variants [39]. Sarfati et al. combined deep mutational scanning data and machine learning to predict mutant impacts using sequence and structure features of variants, measured in the overall correlation of the predicted and enrichment results [40].
With the advances in experimental techniques and computational approaches, we are now better equipped to study the emergence and evolution of viruses. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a global pandemic since late 2019 and has evolved into many different variants that cause several waves of coronavirus disease 2019 (COVID-19) infections. SARS-CoV-2 exploits mutations to improve its evolutionary fitness. Two mechanisms of SARS-CoV-2 evolution, namely natural selection via infectivity strengthening and antibody resistance, were identified in early 2020 [4] and late 2021 [41], respectively based on molecular biophysics, topological deep learning, and genotyping of viral genomes isolated from patients. The molecular model underlying the first mechanism is that mutations on the spike protein (S protein) receptor-binding domain (RBD) enhance the virus host cell entry by strengthening the binding of RBD and host angiotensin-converting enzyme 2 (ACE2), giving rise to more infectious variants [1, 4, 42-46]. The molecular model underlying the second mechanism is that RBD mutations are able to disrupt the RBD and antibody binding, leading to serious vaccine breakthroughs in the populations of Europe and the US that had the earliest access to vaccines [41].
DMS, as a reliable option, is used to measure the impact of single-amino acid mutations on the RBD-ACE2 binding affinity [47-50] and RBD-antibody binding affinity [48, 51, 52]. A study reports the stabilization of the original SARS-CoV-2 spike protein RBD by integrating deep mutational scanning and computational design [53].
Topological deep learning (TDL), first introduced in 2017 [54], is an emerging paradigm that combines topological data analysis (TDA) and deep learning techniques to analyze complex and high-dimensional data. TDA is a branch of mathematics that focuses on understanding the shape and structure of data [55,56]. It is most successful in cases where standard approaches fare poorly, but it can also significantly contribute in the situations where the standard approaches work very well, by contributing novel topological fingerprints. The basic idea behind TDL is to incorporate topological features of the data into deep learning models to improve their performance. This can be done by using topological descriptors to simplify the structural complexity of biomolecules [57-59] and embed physical interactions into topological invariants [54]. TopNetTree [60] model was designed for predicting PPI binding free energy changes upon mutation. These studies have significant implications for the field of computational biology and complex biological systems. Topological deep learning leverages the data analysis of intricate and high-dimensional data. Recently, TopNetmAb model has further been validated with DMS data and has been applied to predict RBD mutation-induced RBD-antibody binding free energy (BFE) changes [5].
Persistent homology, a key method in TDA, was employed in early TopNetTree [60] and TopNetmAb [5] models. Recently, we developed a persistent Laplacian-based TDL model for predicting PPI binding free energy changes upon mutation [61]. Persistent Laplacians, also known as persistent spectral theory [62], are a special case in a family persistent topological Laplacians, including persistent path Laplacians [63], persistent sheaf Laplacians [64], persistent hyperdigraph Lapacians [65], etc. Persistent topological Laplacians are designed to address the limitations of the current TDA methods.
In this work, we propose a TDL-DMS predictor for mutation-induced protein-protein interaction BFE changes. We collect five DMS datasets focusing on SARS-CoV-2 S protein RBD in RBD-ACE2 complexes and RBD-antibody complexes, including deep mutational scanning of the S protein receptor-binding domains (RBD) in the RBD-ACE2 complex [49], in another RBD-ACE2 complex [47,48], in RBD-CTC-445.2 complex [48], and in BA.1 and BA.2 variants [66]. We use an improved TDL model based persistent spectral theory [62] to construct both persistent topological invariants and persistent spectra for predicting single-amino acid mutation impacts on protein-protein interactions using the aforementioned DMS datasets as training sets. The three dimensional (3D) structures of appropriate RBD-ACE2 complexes and RBD-antibody complexes are also utilized in our TDL-DMS models. Our models are validated by leave-one-dataset-out and 10-fold cross validation. Finally, we demonstrate the performance of TDL-DMS models for in silico DMS.
2. Results
There are five SARS-CoV-2 RBD DMS datasets collected as the training set of the TDL model (see Table 1 for more details about datasets). To illustrate the performance of this proposed model, we employ dataset-level leave-one-out validations on these five SARS-CoV-2 RBD DMS datasets (10-fold cross validations are described in the Supplemental Material). Thus, the neural network model consists of six hidden layers with 15,000 neurons in each layer and generates six outputs for each dataset. For the validation process, when five out of the six DMS datasets are utilized as the training data, the remaining dataset functions as the test set for validation purposes. We provide a comprehensive statistical analysis for each validation, showcasing the results in the form of scatter and histogram plots based on the mutation locations. Additionally, we include five schematic representations, the definitions of their structural regions are derived from the relative accessible surface area (rASA) [67]. Residues with rASA can be classified into structural regions such as interior and surface or interface categories like support, rim, and core, which aids in analyzing TDL-DMS predictions of the SARS-CoV-2 Spike protein RBD while accounting for continuous amino acid exposure.
Table 1:
2.1. DMS of the RBD in the original RBD-ACE2 complex
To guide subsequent experiments and analyses by understanding the mutational impacts on SARS-CoV-2 infectivity and antibody resistance, we initially conducted an in silico DMS of the RBD in the original RBD-ACE2 complex, using the dataset with experimental DMS results on SARS-CoV-2 RBD by Starr, et al. [47]. The yeast-surface-display platform was applied for measuring expression of folded RBD protein and its binding to ACE2 and functional scores for RBD-ACE2 binding affinity were derived from per-barcode counts obtained during the experiments [47]. The dataset was released at the beginning of the pandemic and has been widely used for studying the SARS-CoV-2 RBD-ACE2 interaction and for vaccine design and antibody design. Readers interested in delving into the specifics can be referred to the authors’ GitHub repository for further information (https://github.com/jbloomlab/SARS-CoV-2-RBD_DMS). In our leave-one-out TDL prediction, protein structure 6M0J of RBD-ACE2 complex [68] (see Figure 1) was used in our TDL model. There are 3669 single mutations on RBD with an overall Pearson correlation of Rp=0.63 between the experimental results and predicted results. It is important to note that experimental DMS enrichment ratios were converted into binding free energies with errors, and some discrepancies were observed in the interior and surface mutations. Despite this, higher correlations were found in the support, rim, and core regions, indicating that the TDL-DMS model performs well in predicting the binding interface of the RBD-ACE2 complex. There were a large amount of very negative values (< −4.0) in the DMS data, while the predicted values are in the range from −5 to 1. For example, there are 101 values that were set to be −4.8. On the interior and surface mutations, the correlations are down to 0.52 and 0.57 respectively. Obviously, these mutations with very negative values belong to the interior and surface (see Figure S1). Nevertheless, correlations in the support, rim, and core regions are higher than that of others, with Rp of 0.73, 0.69, and 0.79, respectively. Therefore, the TDL-DMS model performs well on the binding interface of the RBD-ACE2 complex, which is the most relevant and important for understanding mutational impacts to SARS-CoV-2 inefctivity and antibody resistance.
Our next DMS dataset of the original SARS-CoV-2 RBD was provided by Linsky et al. [48]. The authors studied the de novo design of hACE2 decoys to moderate SARS-CoV-2, and provide a monovalent decoy high potently neutralizing SARS-CoV-2. In the experiment, approximately 1700 single mutations were tested, while our analysis considered 1539 single mutations, limited the 6M0J RBD protein structure, which lacks residues at both ends [68]. Here, the proteins use east display and the enrichment of DMS data is presented for experiments. Calculation detail can be found at the Supplementary Materials of Ref [48]. Overall, the predicted values have a correlation of 0.72 (see Figure S2). It is shown that there are two peaks in terms of population ranges for experimental DMS data, while only one peak of that for predicted results. One of the two peaks has the corresponding values around −2.5, which indicates mutations moderating the RBD-ACE2 binding (see Figure S2). Interestingly, when considering the interior mutations, which contribute mostly to the second peak, the correlation is 0.53 (see Figure S2). This difference might be caused either by the TDL-DMS model having lower performance for interior mutations or the dataset has experimental bias in certain regions. For mutations near the binding interface, i.e., support, rim and core, predictions have relatively high correlations with the experimental data. In the interface, residues have differences in their rASAs in monomer and complex, and play key roles in SARS-CoV-2 mutations. High correlations on regions suggest the TDL-DMS model has accurate predictions. The highest correlation between the experimental data and predicted results is observed for core mutations, Rp = 0.82 (see Figure S2).
2.2. DMS of the RBD in variant RBD-ACE2 complexes
In the same work, Linsky, et al. test the RBD binding to their de novo design protein CTC-445.2 and scan 1539 mutations on RBD [48]. The experiments are the same as the last one. For this dataset, the overall correlation of our TDL-DMS is 0.67. The TDL-DMS model has the worst performance in the interface for this particular dataset among the five DMS datasets, with the correlation of support, rim, and core being 0.57, 0.60, and 0.39, respectively. It is observed suboptimal performance of the TDL-DMS model, particularly in the classification of the core region, potentially due to limited data quality and quantity. Similar to the previous dataset, there are also two peaks in terms of population distribution for experimental DMS data, and only one of the peaks was predicted by TDL-DMS.
2.3. DMS of the RBD in a variant RBD-CTC-455.2 complex
Lastly, we examine two datasets featuring distinct mutations on the RBD. Figure 4 and Figure 5 show the correlations of predictions versus DMS experimental data [66] of the RBD in BA.1 RDB-ACE2 complex (PDB: 7T9L [69]) and BA.2 RBD-ACE2 complex (PDB: 7XB0 [70]), respectively. The converted binding affinity of the experimental DMS data is compared with our prediction values and a yeast-surface display platform was deployed [66]. For the calculation detail of converted binding affinity, please check the author’s repository (https://github.com/jbloomlab/SARS-CoV-2-RBD_DMS_Omicron). The correlation analysis for leave-one-out cross-validations on the RBDs in BA.1 and BA.2 variants binding to ACE2 reveals a consistent Pearson correlation coefficient (Rp) of 0.84. For both RBDs, the prediction results show varying Rp values across the support, rim, and core regions (see Figures 4 and 5). The overall correlations for both datasets are identical, and the correlations for interface mutations are notably high as well. BA.1 and BA.2 exhibit seven unique mutations on the RBD (BA.1: S371L, G446S, G496S; BA.2: S371F, T376A, D405N, R408S). Thus, when performing leave-one-out cross-validation, our proposed TDL-DMS model learns from one dataset and predicts the results for other datasets.
To show the detailed performance, we demonstrate the experimental and predicted DMSs of the RBD in the BA.2 RBD-ACE2 complex in Figure 6. Overall, our prediction captures the general pattern very well.
3. Discussion
Firstly, residues with their rASA can be considered buried as rASA is less than a certain cutoff, which prompts the definition of two structural regions: the interior and the surface, as shown in Figure 7. Due to the discreteness caused by the cutoff, a concern might rise as amino acids’ relative exposure is continuous. However, with the studies of Escherichia coli, Saccharomyces cervisiae, and Homo sapiens databases, it was concluded that the rASA cutoff distinguishing the surface and the interior easily is roughly 25% [67]. A similar idea is deployed when considering the interface of protein-protein complexes. Within the same work [67], three regions of binding interfaces are defined as support, rim, and core, which require including rASA on monomer and complex (see Figure 7). It is noted that interface residues contribute mostly to the binding energy [72]. The classification on the interface should be important to analyze TDL-DMS predictions of the RBD of the SARS-CoV-2 Spike protein. In the following discussion, results are analyzed in categories, i.e., interior, surface, support, rim, and core.
We present a comparison of correlations of experimental and predicted DMS values for each mutation in Figure 8 from all five datasets. There are 361 amino acid mutation types. Among them, 20 mutation types (red color) have negative correlations of experimental and predicted DMS values, while 5 mutation types (white color) have no correlation. The rest mutation types have positive correlations, i.e., more than half mutation types (236) have Rp > 0.50 and 76 mutation types have Rp > 0.70.
The pattern of DMS results over different mutation types is crucial for protein design, including the design of monoclonal antibodies (mAbs). We evaluate how well our TDL-DMS model predictions resemble the distribution in experimental data by examining the behavior of our model for 20 distinct amino acid types across the five DMS datasets. Remarkably, our predicted patterns align closely with the experimental data in terms of both average DMS results and variance of DMS results (see Figure 9). The overall predictions exhibit more negative changes, indicated by a higher prevalence of deep red-colored squares. In addition to considering amino acid size, we also classify them into charged, polar, hydrophobic, and special-case groups. Regarding changes in DMS results, we observe that most mutations from charged/polar to other residues yield a positive change (e.g., mutating from K or T to others). This suggests that mutations from charged/polar residues to other residues contribute to increased stability within the SARS-CoV-2 PPI system. Although our model exhibits a similar pattern in the variance of value changes as experimental data, the variance of the model predictions is generally lower, as shown in Figure 9.
Although achieving accurate predictions with a diversity level comparable to experimental data remains a challenging task, future trends are quite clear as shown in Figure 9. Essentially, residues K S, and T are relatively stable. In contrast, residues R, C, I, and Y are prone to mutations. Additionally, many mutations will generate D, E, and P.
In this study, we emphasize the leave-one-dataset-out validation approach due to the unique nature of our data and the specific objectives of our research. The test data consists of multiple datasets, each representing a specific SARS-CoV-2 spike RBD associated with ACE2 and antibodies. These datasets are distinct and provide different contexts for the evaluation of the proposed model. The leave-one-dataset-out validation approach allows us to assess the generalizability of the model across these different contexts. In this strategy, it can be evaluated how well the model can adapt to new and unseen data. This validation approach provides a robust estimate of the model’s performance. It reduces the risk of overfitting, as the model is tested on data that it has not seen during training. This gives us confidence that our model’s good performance is not due to memorizing the training data, but rather its ability to generalize from training.
In light of the findings and challenges encountered in this study, the future work will study refining the data collection and preprocessing methods to reduce noise and improve data quality. The quality and quantity of experimental data used for training and testing the model significantly impact machine learning performances. It is important to expand experimental datasets, particularly for regions where the model’s performance was weak. Additionally, we aim to implement experimental validation as an additional check on the model’s predictions. This will provide a more robust evaluation of the model’s performance and help identify areas for improvement. We believe that these steps will enhance the model’s predictive accuracy and contribute to the development of more effective tools for predicting DMS. Furthermore, we will continue to explore the potential of topological deep learning in the analysis of intricate and high-dimensional data. We are particularly interested in leveraging topological descriptors to simplify the structural complexity of biomolecules and embed physical interactions into topological invariants. This development will have significant implications for computational biology and complex biological systems.
4. Methods
This section provides an overview of spectral graph theory, simplicial complex, and persistent Laplacian methods for feature generation. These mathematical concepts play a crucial role in understanding the topological and spectral properties of protein-protein interactions. Additionally, machine learning and deep learning models are discussed in the context of test datasets and validation settings, highlighting their applications in the analysis and interpretation of these features. This overview aims to equip readers with the essential knowledge required for further exploration and implementation of these techniques.
4.1. Spectral graph theory
Spectral graph theory focuses on the study of graph Laplacian’s spectra, connecting the algebraic connectivity and spectral properties of underlying graphs or networks. Mathematically, a graph is an ordered pair , where ; is the vertex set with size , and ; is the edge set. Let denote the degree of each vertex , i.e., the number of edges connected to . A specific Laplacian matrix can be given by
(1) |
where “adjacent” refers to a specific definition or connection rule.
We can order the eigenvalues of the graph Laplacian matrix as
(2) |
The kernel dimension of is the multiplicity of 0 eigenvalues, indicating the number of connected components of , which is a topological property of the graph. The non-zero eigenvalues of contain information about the graph properties. In particular, is called the algebraic connectivity.
4.2. Simplicial complexes
Graph Laplacian allows only pairwise interactions (edges) and excludes high-order many-body interactions. In contrast, simplicial complexes offer a high-order generalization. Simplicial complexes serve as an elegant and robust mathematical framework for capturing the high-order interactions in graphs and networks. At the heart of this framework lies the concept of a -simplex, which is formed from a set of affinely independent points. Examples of simplices encompass various geometric elements such as vertices, edges, triangles, and tetrahedrons. A simplicial complex is an assemblage of simplices that adhere to specific conditions, and its dimension is established by the maximum dimension of its constituting simplices.
In graph theory, the degree of a vertex encapsulates the number of edges adjacent to it. However, when extending this idea to -simplices, one must take into account both lower and upper adjacency, as -simplices can simultaneously have ()-simplices and ()-simplices adjacent to them. Lower adjacency pertains to the sharing of a common ()-face, while upper adjacency entails the sharing of a common ()-face.
To delve deeper into the topological properties of simplicial complexes, it is useful to examine the boundary operator and chain complexes. The boundary operator, denoted as , maps to :
(3) |
where is the vertex to be excluded.
Chain complexes consist of sequences of chain groups interconnected by boundary operators:
(4) |
In essence, simplicial complexes offer an effective tool for probing the topological properties of graphs and networks. By analyzing the degrees of various simplices, their adjacencies, and the intricacies of boundary operator and chain complexes, we can gain a deeper understanding of the structure and connectivity inherent in complex systems.
4.3. Combinatorial Laplacian
In 1944, Eckmann introduced simplicial complexes into graph Laplacians, which gives rise to combinatorial Laplacian or topological Laplacian [73]. Combinatorial Laplacian was ingeniously devised to enrich the topological and geometric information inherent in simplicial complexes. Foundational concepts revolve around the oriented simplicial complex and the -combinatorial Laplacian. Comprehensive information on these topics can be explored in the cited literature [74-77]. The subsequent discussion delves into the properties of the -combinatorial Laplacian matrix and its associated spectra.
The -combinatorial Laplacian is predicated on oriented simplicial complexes, which harness both lower- and higher-dimensional simplices to investigate a specifically oriented simplicial complex. An oriented simplicial complex, , is characterized by the orientation of all its constituent simplices. When and are upper adjacent, sharing a common upper ()-simplex , they are deemed similarly oriented if both exhibit the same sign in , and dissimilarly oriented if the signs are contrary. Moreover, if and are lower adjacent, sharing a common lower ()-simplex , they are similarly oriented if bears the same sign in both and , and dissimilarly oriented if the signs are in opposition. In a similar vein, -chains can be defined on the oriented simplicial complex , along with the -boundary operator.
The -combinatorial Laplacian is a linear operator for integers
(5) |
where denotes the coboundary operator, mapping . The property is preserved, implying that . The matrix representation of the -combinatorial Laplacian operator, denoted by , is given by
(6) |
where and represent the matrix representations of the -boundary operator and -coboundary operator, respectively, in relation to the standard basis for and with specific orderings. Consequently, the number of rows in corresponds to the quantity of ()-simplices, while the number of columns reflects the quantity of -simplices in . Furthermore, the upper and lower -combinatorial Laplacian matrices are denoted by and , respectively. It is important to note that is the zero map, resulting in being a zero matrix. Hence, , with representing the (oriented) simplicial complex of dimension 1, which is essentially a simple graph. In particular, the 0-combinatorial Laplacian matrix is actually the Laplacian matrix as defined in the spectral graph theory.
4.4. Persistent Laplacians
Persistent Laplacian, also known as persistent spectral graphs or persistent combinatorial Laplacian [62], has emerged as a popular tool in topological data analysis. It was proposed to overcome the limitation of persistent homology for incapable of capturing the homotopic shape evolution of the data. It is based on a filtration process that converts a data set into a sequence of nested simplicial complexes with increasing levels of complexity. In each level, the Betti numbers are calculated, and the changes in the Betti numbers are tracked as the resolution of the data set increases. These changes in the Betti numbers, called topological persistence, provide a measure of the robustness of the topological features of the data set.
In order to study the persistence of the spectral properties of graphs or simplicial complexes, one can use the notion of persistent Laplacians, which are a family of Laplacian matrices that, encode the topological and geometric information of the simplicial complexes at different resolutions. The main idea is to construct a sequence of nested simplicial complexes by successively adding simplices to the complex, and associate a Laplacian matrix with each complex. By comparing the spectra of the Laplacian matrices at different resolutions, one can study the persistent spectral properties of simplicial complexes.
There are different ways to construct persistent Laplacians, depending on the filtration process and the type of Laplacian used [62]. One common approach is to use the combinatorial Laplacian matrix of the simplicial complexes defined in the previous section. Given a sequence of nested simplicial complexes with increasing dimension, one can also define a sequence of combinatorial Laplacian matrices by setting . Then, one can study the persistent spectral properties of the sequence of Laplacian matrices, such as the persistent eigenvalues and eigenvectors.
The harmonic spectra of persistent Laplacians at various scales are the same as the persistent Betti numbers, while the non-harmonic spectra can capture both topological changes and homotopic shape evolution of the data, see Figure 11. Note that, in the figure, each of the five charts on the top panel is represented by a segment of the non-harmonic spectra, i.e., the first non-zero eigenvalues in red. In contrast, persistent homology (blue bars) does not capture homotopic shape evolution (i.e., the states in the third chart and the fifth chart). As a result, persistent Laplacians offer an enriched representation of data.
4.5. Protein-protein interactions
PPIs are analyzed by topological and shape analysis. We initially partition the atoms in a protein-protein complex into several subsets:
: atoms at the mutation sites.
: atoms in the vicinity of the mutation site, within a cut-off distance .
: protein A atoms within of the binding site.
: protein B atoms within of the binding site.
: atoms of element type E within the system. We design the distance matrix to exclude interactions between atoms from the same set. For interactions between atoms and in sets and/or , we define the modified distance as follows:
(7) |
where represents the Euclidean distance between and . Molecular atoms are constructed as points, denoted by , with affinely independent points in a simplicial complex. Persistent spectral graphs are designed to capture multiscale topological and geometrical information across different scales along a filtration [62], yielding essential feature vectors for machine learning methods. Binned barcode vectorization-generated features can represent the strength of atomic bonds and van der Waals interactions, and are readily incorporated into machine learning models that discern and characterize local patterns.
Using atom subsets, such as and , we create simplicial complexes by considering only the edges from to for Vietoris-Rips complexes. Barcodes generated from persistent homology are then enumerated by bar lengths within specific intervals, with numbers 0 or 1, as part of the Vietoris-Rips complex filtration. Concurrently, for each complex in the filtration, we compute eigenvalues using graph Laplacian analysis. We gather statistics of eigenvalues, such as sum, maximum, minimum, mean, and standard deviation, to obtain normalized features for machine learning methods. An alternative vectorization approach involves extracting statistics of bar lengths, birth values, and death values, including sum, maximum, minimum, mean, and standard deviation. This technique is applied to vectorize Betti-1 (H1) and Betti-2 (H2) barcodes obtained from alpha complex filtration, based on the observation that higher-dimensional barcodes are sparser than H0 barcodes.
In summary, this methodology integrates topological representations and persistent Laplacian spectra to analyze protein-protein interactions. By categorizing atoms in a protein-protein complex into subsets, we can construct simplicial complexes and generate feature vectors for machine learning algorithms. This approach effectively captures the essential topological and geometrical information of the underlying molecular structures, facilitating the study of protein-protein interactions and their biological implications.
4.6. Machine learning
The features generated from the persistent spectral graph are tested using the deep neural network (Net) method. Validations are performed on the datasets discussed in the results section. Accurately predicting mutation-induced binding affinity changes in protein-protein complexes is a significant challenge. After generating effective features, machine learning or deep learning models are required for validation and real-world applications. A deep neural network is a network of neurons that maps an input feature layer to an output layer. The neural network mimics the human brain to solve problems with numerous neuron units and employs backpropagation to update weights on each layer. To capture input features at different levels and abstract more properties, one can construct more layers and more neurons in each layer, creating a deep neural network. Optimization methods for feedforward neural networks and dropout methods are applied to prevent overfitting. The network layers and the number of neurons in each layer are determined by grid searches based on 10-fold cross-validations. Then, the hyperparameters of stochastic gradient descent (SGD) with momentum are set up based on the network structure. The network has 7 layers with 10,000 neurons in each layer. For SGD with momentum, the hyperparameters are momentum = 0.9 and weight_decay=0. The learning rate is 0.002 and the epoch is 400. The Net is implemented using Pytorch [78].
Figure 11 provides the workflow of the proposed TDL-DMS methodology. The input is a protein-protein complex, and the output is the predicted DMS (the heatmap on the left). The protein-protein complex is partitioned into subsets, and simplicial complexes are constructed using the Vietoris-Rips complex and filtration. Barcodes are generated from persistent homology, and eigenvalues are computed from persistent graph Laplacians. The barcodes and eigenvalues are used to generate feature vectors. The feature vectors are then used as the inputs for the deep learning network to predict the binding affinity changes of mutations in protein-protein complexes. The model is trained with the experimental DMS data as the ground truth (the heatnmap on the right).
5. Conclusion
Deep mutational scanning (DMS) is a high-throughput experimental technique that enables the systematic analysis of the impact of mutations on protein function, providing insights into the structure-function relationships and evolutionary trends and constraints of proteins. DMS has been successfully applied to a wide range of biological systems, including enzymes, receptors, transcription factors, and viruses. It can be used to design proteins with improved properties, identify drug targets and inhibitors, and understand the mechanisms of protein evolution and adaptation. However, the mutational space of a typical protein is astronomically large and intractable for experimental means.
Computational approaches to DMS offer viable alternatives, although in silico DMS has hardly been reported. One challenge is the lack of accurate and reliable biophysical models for dealing with complex protein functions and protein-protein interactions (PPIs). Another challenge is the lack of high-quality DMS data for data-driven machine learning predictions.
Currently, it is well-understood that the SARS-CoV-2 spike protein plays the most important role in viral transmission, and its receptor-binding domain (RBD) binds to human ACE2 to facilitate viral entry into host cells. Emerging SARS-CoV-2 variants are spreading worldwide with increased transmissibility due to the natural selection of RBD mutations with higher infectivity [4] and/or stronger antibody resistance [41]. As a result, researchers have conducted various DMS studies on the original spike RBD and variant RBD in recent years [47, 48, 66]. This development enables the artificial intelligence (AI)-based prediction of DMS of future SARS-CoV-2 variants.
Topological deep learning (TDL) has led to the discovery of two SARS-CoV-2 evolutionary mechanisms [4, 41] and accurate forecasting of future dominant SARS-CoV-2 variants Omicron [79], Omicron BA.2 [80], and Omicron BA.5 [61]. Recently, a new generation of topological data analysis (TDA) techniques was proposed [62] and implemented for SARS-CoV-2 variant prediction [61]. Built on these experimental, mathematical, and computational advances, we develop our TDL-DMS model for SARS-CoV-2 RBDs.
We performed leave-one-dataset-out validation on the proposed TDL-DMS on five datasets involving various SARS-CoV-2 spike RBDs associated with ACE2 and antibodies. We found that our TDL-DMS model works well in general and offers excellent DMS predictions for RBD binding interface mutations, which are particularly important in forecasting future dominant SARS-CoV-2 variants.
We expect the proposed TDL-DMS framework to have potential applications in protein engineering, drug discovery, and directed evolution.
Supplementary Material
Present a novel approach to analyzing protein DMS data using topological deep learning methods.
Validated with DMS data using leave-one-dataset-out.
Predict mutation-induced protein-protein interaction.
Highlight the potential of topological deep learning in handling intricate and high-dimensional data.
Acknowledgments
This work was supported in part by NIH grants R01GM126189, R01AI164266, and R01AI146210, NSF grants DMS-2052983, DMS-1761320, and IIS-1900473, NASA grant 80NSSC21M0023, MSU Foundation, Bristol-Myers Squibb 65109, and Pfizer. JC thanks Dr. Daniel-Adriano Silva for the assistance in converting the experimental enrichment ratios and BFE changes.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data Availability
All datasets are available at https://weilab.math.msu.edu/DataLibrary/3D/DMS.tar.gz.
Code Availability
The source codes are available at https://github.com/WeilabMSU/TopNetDMS.
References
- [1].Hoffmann Markus, Kleine-Webe Hannah, Schroeder Simon, Krüger Nadine, Herrler Tanja, Erichsen Sandra, Schiergens Tobias S, Herrler Georg, Wu Nai-Huei, Nitsche Andreas, et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell, 181(2):271–280, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Guo Ruiqiong, Gaffney Kristen, Yang Zhongyu, Kim Miyeon, Sungsuwan Suttipun, Huang Xuefei, Hubbell Wayne L, and Hong Heedeok. Steric trapping reveals a cooperativity network in the intramembrane protease glpg. Nature chemical biology, 12(5):353–360, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Guerois Raphael, Nielsen Jens Erik, and Serrano Luis. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. Journal of molecular biology, 320(2):369–387, 2002. [DOI] [PubMed] [Google Scholar]
- [4].Chen Jiahui, Wang Rui, Wang Menglun, and Wei Guo-Wei. Mutations strengthened SARS-CoV-2 infectivity. J. Mol. Biol, 432(19):5212–5226, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Chen Jiahui, Gao Kaifu, Wang Rui, and Wei Guo-Wei. Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies. Chemical Science, 12(20):6929–6948, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Capriotti Emidio, Fariselli Piero, and Casadio Rita. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic acids research, 33(suppl_2):W306–W310, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Worth Catherine L, Preissner Robert, and Blundell Tom L. Sdm—a server for predicting effects of mutations on protein stability and malfunction. Nucleic acids research, 39(suppl_2):W215–W222, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Pires Douglas EV, Ascher David B, Blundell Tom L. Duet: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic acids research, 42(W1):W314–W319, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Dehouck Yves, Grosfils Aline, Folch Benjamin, Gilis Dimitri, Bogaerts Philippe, and Rooman Marianne. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics, 25(19):2537–2543, 2009. [DOI] [PubMed] [Google Scholar]
- [10].Kellogg Elizabeth H, Leaver-Fay Andrew, and Baker David. Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins: Structure, Function, and Bioinformatics, 79(3):830–838, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Getov Ivan, Petukh Marharyta, and Alexov Emil. SAAFEC: predicting the effect of single point mutations on protein folding free energy using a knowledge-modified mm/pbsa approach. International journal of molecular sciences, 17(4):512, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Yang Yang, Chen Biao, Tan Ge, Vihinen Mauno, and Shen Bairong. Structure-based prediction of the effects of a missense variant on protein stability. Amino Acids, 44(3):847–855, 2013. [DOI] [PubMed] [Google Scholar]
- [13].Choi Yongwook, Sims Gregory E, Murphy Sean, Miller Jason R, and Chan Agnes P. Predicting the functional effect of amino acid substitutions and indels. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Berliner Niklas, Teyra Joan, Colak Recep, Lopez Sebastian Garcia, and Kim Philip M. Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation. PloS one, 9(9):e107353, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Quan Lijun, Lv Qiang, and Zhang Yang. Strum: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics, 32(19):2936–2946, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Folkman Lukas, Stantic Bela, Sattar Abdul, and Zhou Yaoqi. Ease-mm: sequence-based prediction of mutation-induced stability changes with feature-based multiple models. Journal of Molecular Biology, 428(6):1394–1405, 2016. [DOI] [PubMed] [Google Scholar]
- [17].Strokach Alexey, Corbi-Verge Carles, and Kim Philip M. Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a cagi5 blind challenge. Human mutation, 40(9):1414–1423, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Zhang CHI, Liu Song, and Zhou Yaoqi. Accurate and efficient loop selections by the dfire-based all-atom statistical potential. Protein science, 13(2):391–399, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Dassault Systèmes Biovia et al. Discovery studio modeling environment, 2017.
- [20].Pokala Navin and Handel Tracy M. Energy functions for protein design: adjustment with protein–protein complex affinities, models for the unfolded state, and negative design of solubility and specificity. Journal of molecular biology, 347(1):203–227, 2005. [DOI] [PubMed] [Google Scholar]
- [21].Benedix Alexander, Becker Caroline M, de Groot Bert L, Caflisch Amedeo, and Böckmann Rainer A. Predicting free energy changes using structural ensembles. Nature methods, 6(1):3–4, 2009. [DOI] [PubMed] [Google Scholar]
- [22].Barlow Kyle A, Conchuir Shane O, Thompson Samuel, Suresh Pooja, Lucas James E, Heinonen Markus, and Kortemme Tanja. Flex ddg: Rosetta ensemble-based estimation of changes in protein–protein binding affinity upon mutation. The Journal of Physical Chemistry B, 122(21):5389–5399, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Dehouck Yves, Kwasigroch Jean Marc, Rooman Marianne, and Gilis Dimitri. Beatmusic: prediction of changes in protein–protein binding affinity on mutations. Nucleic acids research, 41(W1):W333–W339, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Pires Douglas EV and Ascher David B. mcsm-ab: a web server for predicting antibody–antigen affinity changes upon mutation with graph-based signatures. Nucleic acids research, 44(W1):W469–W473, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Rodrigues Carlos HM, Myung Yoochan, Pires Douglas EV, and Ascher David B. mcsm-ppi2: predicting the effects of mutations on protein–protein interactions. Nucleic acids research, 47(W1):W338–W344, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Potapov Vladimir, Cohen Mati, and Schreiber Gideon. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein engineering, design & selection, 22(9):553–560, 2009. [DOI] [PubMed] [Google Scholar]
- [27].Sirin Sarah, Apgar James R, Bennett Eric M, and Keating Amy E. AB-Bind: antibody binding mutational database for computational affinity predictions. Protein Science, 25(2):393–409, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Steinbrecher Thomas and Labahn Andreas. Towards accurate free energy calculations in ligand protein-binding studies. Current medicinal chemistry, 17(8):767–785, 2010. [DOI] [PubMed] [Google Scholar]
- [29].King Gregory and Warshel Arieh. Investigation or the free energy functions for electron transfer reactions. The Journal of Chemical Physics, 93(12):8682–8692, 1990. [Google Scholar]
- [30].Antonio Ehecatl, Rio-Chanona Del, Ahmed Nur Rashid, Wagner Jonathan, Lu Yinghua, Zhang Dongda, and Jing Keju. Comparison of physics-based and data-driven modelling techniques for dynamic optimisation of fed-batch bioprocesses. Biotechnology and bioengineering, 116(11):2971–2982, 2019. [DOI] [PubMed] [Google Scholar]
- [31].Qiu Yuchi and Wei Guo-Wei. Persistent spectral theory-guided protein engineering. Nature Computational Science, 3(2):149–163, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Zhao Bo-Wei, Wang Lei, Hu Peng-Wei, Wong Leon, Su Xiao-Rui, Wang Bao-Quan, You Zhu-Hong, and Hu Lun. Fusing higher and lower-order biological information for drug repositioning via graph representation learning. IEEE Transactions on Emerging Topics in Computing, 2023. [Google Scholar]
- [33].Su Xiaorui, Hu Pengwei, Yi Haicheng, You Zhuhong, and Hu Lun. Predicting drug-target interactions over heterogeneous information network. IEEE Journal of Biomedical and Health Informatics, 27(1):562–572, 2022. [DOI] [PubMed] [Google Scholar]
- [34].Wu Hao, Chen Zhongli, Wu Yingfu, Zhang Hongming, and Liu Quanzhong. Integrating protein-protein interaction networks and somatic mutation data to detect driver modules in pan-cancer. Interdisciplinary Sciences: Computational Life Sciences, pages 1–17, 2021. [DOI] [PubMed] [Google Scholar]
- [35].Chen Jinxiang, Wang Miao, Zhao Defeng, Li Fuyi, Wu Hao, Liu Quanzhong, and Li Shuqin. Msingb: A novel computational method based on ngboost for identifying microsatellite instability status from tumor mutation annotation data. Interdisciplinary Sciences: Computational Life Sciences, 15(1):100–110, 2023. [DOI] [PubMed] [Google Scholar]
- [36].Fowler Douglas M and Fields Stanley. Deep mutational scanning: a new style of protein science. Nature methods, 11(8):801–807, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Araya Carlos Land Fowler Douglas M. Deep mutational scanning: assessing protein function on a massive scale. Trends in biotechnology, 29(9):435–442, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Gasperini Molly, Starita Lea, and Shendure Jay. The power of multiplexed functional analysis of genetic variants. Nature protocols, 11(10):1782–1787, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Gray Vanessa E, Hause Ronald J, Luebeck Jens, Shendure Jay, and Fowler Douglas M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell systems, 6(1):116–124, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Sarfati Hagit, Naftaly Si, Papo Niv, and Keasar Chen. Predicting mutant outcome by combining deep mutational scanning and machine learning. Proteins: Structure, Function, and Bioinformatics, 90(1):45–57, 2022. [DOI] [PubMed] [Google Scholar]
- [41].Wang Rui, Chen Jiahui, and Wei Guo-Wei. Mechanisms of sars-cov-2 evolution revealing vaccine-resistant mutations in europe and america. The Journal of Physical Chemistry Letters, 12:11850–11857, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Tao Kaiming, Tzou Philip L, Nouhin Janin, Gupta Ravindra K, de Oliveira Tulio, Pond Sergei L Kosakovsky, Fera Daniela, and Shafer Robert W. The biological and clinical significance of emerging sars-cov-2 variants. Nature Reviews Genetics, 22(12):757–773, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Li Wendong, Shi Zhengli, Yu Meng, Ren Wuze, Smith Craig, Epstein Jonathan H, Wang Hanzhong, Crameri Gary, Hu Zhihong, Zhang Huajun, et al. Bats are natural reservoirs of SARS-like coronaviruses. Science, 310(5748):676–679, 2005. [DOI] [PubMed] [Google Scholar]
- [44].Qu Xiu-Xia, Hao Pei, Song Xi-Jun, Jiang Si-Ming, Liu Yan-Xia, Wang Pei-Gang, Rao Xi, Song Huai-Dong, Wang Sheng-Yue, Zuo Yu, et al. Identification of two critical amino acid residues of the severe acute respiratory syndrome coronavirus spike protein for its variation in zoonotic tropism transition via a double substitution strategy. Journal of Biological Chemistry, 280(33):29588–29595, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Song Huai-Dong, Tu Chang-Chun, Zhang Guo-Wei, Wang Sheng-Yue, Zheng Kui, Lei Lian-Cheng, Chen Qiu-Xia, Gao Yu-Wei, Zhou Hui-Qiong, Xiang Hua, et al. Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human. Proceedings of the National Academy of Sciences, 102(7):2430–2435, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Walls Alexandra C, Park Young-Jun, Tortorici M Alejandra, Wall Abigail, McGuire Andrew T, and Veesler David. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Starr Tyler N, Greaney Allison J, Hilton Sarah K, Ellis Daniel, Crawford Katharine HD, Dingens Adam S, Navarro Mary Jane, Bowen John E, Tortorici M Alejandra, Walls Alexandra C, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell, 182(5):1295–1310, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Linsky Thomas W, Vergara Renan, Codina Nuria, Nelson Jorgen W, Walker Matthew J, Su Wen, Barnes Christopher O, Hsiang Tien-Ying, Esser-Nobis Katharina, Yu Kevin, et al. De novo design of potent and resilient hACE2 decoys to neutralize SARS-CoV-2. Science, 370(6521):1208–1214, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Procko Erik. The sequence of human ace2 is suboptimal for binding the s spike protein of sars coronavirus 2. BioRxiv, 2020. [Google Scholar]
- [50].Starr Tyler N, Greaney Allison J, Hannon William W, Loes Andrea N, Hauser Kevin, Dillen Josh R, Ferri Elena, Farrell Ariana Ghez, Dadonaite Bernadeta, McCallum Matthew, et al. Shifting mutational constraints in the sars-cov-2 receptor-binding domain during viral evolution. BioRxiv, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Cao Longxing, Goreshnik Inna, Coventry Brian, Case James Brett, Miller Lauren, Kozodoy Lisa, Chen Rita E, Carter Lauren, Walls Alexandra C, Park Young-Jun, et al. De novo design of picomolar sars-cov-2 miniprotein inhibitors. Science, 370(6515):426–431, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Greaney Allison J, Starr Tyler N, Gilchuk Pavlo, Zost Seth J, Binshtein Elad, Loes Andrea N, Hilton Sarah K, Huddleston John, Eguia Rachel, Crawford Katharine HD, et al. Complete mapping of mutations to the sars-cov-2 spike receptor-binding domain that escape antibody recognition. Cell host & microbe, 29(1):44–57, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Leonard Alison C, Weinstein Jonathan J, Steiner Paul J, Erbse Annette H, Fleishman Sarel J, and Whitehead Timothy A. Stabilization of the sars-cov-2 receptor binding domain by protein core redesign and deep mutational scanning. Protein Engineering, Design and Selection, 35, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Cang Zixuan and Wei Guo-Wei. Topologynet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS computational biology, 13(7):e1005690, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Edelsbrunner Herbert, Harer John, et al. Persistent homology-a survey. Contemporary mathematics, 453(26):257–282, 2008. [Google Scholar]
- [56].Zomorodian Afra and Carlsson Gunnar. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational geometry, pages 347–356, 2004. [Google Scholar]
- [57].Townsend Jacob, Micucci Cassie Putman, Hymel John H, Maroulas Vasileios, and Vogiatzis Konstantinos D. Representation of molecular structures with persistent homology for machine learning applications in chemistry. Nature communications, 11(1):3230, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Meng Zhenyu and Xia Kelin. Persistent spectral–based machine learning (perspect ml) for protein-ligand binding affinity prediction. Science advances, 7(19):eabc5329, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Gameiro Marcio, Hiraoka Yasuaki, Izumi Shunsuke, Kramar Miroslav, Mischaikow Konstantin, and Nanda Vidit. A topological measurement of protein compressibility. Japan Journal of Industrial and Applied Mathematics, 32:1–17, 2015. [Google Scholar]
- [60].Wang Menglun, Cang Zixuan, and Wei Guo-Wei. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nature Machine Intelligence, 2(2):116–123, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Chen Jiahui, Qiu Yuchi, Wang Rui, and Wei Guo-Wei. Persistent laplacian projected omicron ba. 4 and ba. 5 to become new dominating variants. Computers in Biology and Medicine, 151:106262, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Wang Rui, Nguyen Duc Duy, and Wei Guo-Wei. Persistent spectral graph. International journal for numerical methods in biomedical engineering, 36(9):e3376, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [63].Wang Rui and Wei Guo-Wei. Persistent path laplacian. Foundation of Data Science, 5:26–55, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Wei Xiaoqi and Wei Guo-Wei. Persistent sheaf laplacians. arXiv preprint arXiv:2112.10906, 2021. [Google Scholar]
- [65].Chen Dong, Liu Jian, Wu Jie, and Wei Guo-Wei. Persistent hyperdigraph homology and persistent hyperdigraph laplacians. arXiv preprint arXiv:2304.00345, 2023. [Google Scholar]
- [66].Starr Tyler N, Greaney Allison J, Stewart Cameron M, Walls Alexandra C, Hannon William W, Veesler David, and Bloom Jesse D. Deep mutational scans for ace2 binding, rbd expression, and antibody escape in the sars-cov-2 omicron ba. 1 and ba. 2 receptor-binding domains. PLoS pathogens, 18(11):e1010951, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Levy Emmanuel D. A simple definition of structural regions in proteins and its use in analyzing interface evolution. Journal of molecular biology, 403(4):660–670, 2010. [DOI] [PubMed] [Google Scholar]
- [68].Lan Jun, Ge Jiwan, Yu Jinfang, Shan Sisi, Zhou Huan, Fan Shilong, Zhang Qi, Shi Xuanling, Wang Qisheng, Zhang Linqi, et al. Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature, 581(7807):215–220, 2020. [DOI] [PubMed] [Google Scholar]
- [69].Mannar Dhiraj, Saville James W, Zhu Xing, Srivastava Shanti S, Berezuk Alison M, Tuttle Katharine S, Marquez Ana Citlali, Sekirov Inna, and Subramaniam Sriram. Sars-cov-2 omicron variant: Antibody evasion and cryo-em structure of spike protein–ace2 complex. Science, 375(6582):760–764, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Li Linjie, Liao Hanyi, Meng Yumin, Li Weiwei, Han Pengcheng, Liu Kefang, Wang Qing, Li Dedong, Zhang Yanfang, Wang Liang, et al. Structural basis of human ace2 higher binding affinity to currently circulating omicron sars-cov-2 sub-variants ba. 2 and ba. 1.1. Cell, 185(16):2952–2960, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Goodsell David S, Autin Ludovic, and Olson Arthur J. Illustrate: software for biomolecular illustration. Structure, 27(11):1716–1720, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Bogan Andrew A and Thorn Kurt S. Anatomy of hot spots in protein interfaces. Journal of molecular biology, 280(1):1–9, 1998. [DOI] [PubMed] [Google Scholar]
- [73].Eckmann Beno. Harmonische funktionen und randwertaufgaben in einem komplex. Commentarii Mathematici Helvetici, 17(1):240–255, 1944. [Google Scholar]
- [74].Serrano Daniel Hernándezand Gómez Darío Sánchez. Higher order degree in simplicial complexes, multi combinatorial laplacian and applications of tda to complex networks. arXiv preprint arXiv:1908.02583, 2019. [Google Scholar]
- [75].Maletić Slobodan and Rajković Milan. Consensus formation on a simplicial complex of opinions. Physica A: Statistical Mechanics and its Applications, 397(March):111–120, 2014. [Google Scholar]
- [76].Goldberg Timothy E Combinatorial laplacians of simplicial complexes. Senior Thesis, Bard College, 2002. [Google Scholar]
- [77].Horak Danijela and Jost Jürgen. Spectra of combinatorial laplace operators on simplicial complexes. Advances in Mathematics, 244:303–336, 2013. [Google Scholar]
- [78].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [Google Scholar]
- [79].Chen Jiahui, Wang Rui, Gilby Nancy Benovich, and Wei Guo-Wei. Omicron variant (b. 1.1. 529): Infectivity, vaccine breakthrough, and antibody resistance. J Chem Inf Model, 62(2):412–422, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [80].Chen Jiahui and Wei Guo-Wei. Omicron ba. 2 (b. 1.1. 529.2): High potential for becoming the next dominant variant. The Journal of Physical Chemistry Letters, 13:3840–3849, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All datasets are available at https://weilab.math.msu.edu/DataLibrary/3D/DMS.tar.gz.
The source codes are available at https://github.com/WeilabMSU/TopNetDMS.