Abstract
The role of genomic variants in disease has expanded significantly with the advent of advanced sequencing techniques. The rapid increase in identified genomic variants has led to many variants being classified as Variants of Uncertain Significance or as having conflicting evidence, posing challenges for their interpretation and characterization. Additionally, current methods for predicting pathogenic variants often lack insights into the underlying molecular mechanisms. Here, we introduce MAVISp (Multi‐layered Assessment of VarIants by Structure for proteins), a modular structural framework for variant effects, accompanied by a web server (https://services.healthtech.dtu.dk/services/MAVISp-1.0/) to enhance data accessibility, consultation, and re‐usability. MAVISp currently provides data on over 1000 proteins, encompassing more than 10 million variants. A team of biocurators regularly analyzes and updates protein entries using standardized workflows, incorporating free‐energy calculations and biomolecular simulations. We illustrate the utility of MAVISp through selected case studies. The framework facilitates the analysis of variant effects at the protein level and has the potential to advance the understanding and application of mutational data in disease research.
Keywords: cancer genomics, free energy calculations, long‐range structural communication, protein function, protein stability, protein structures, variant effects
1. INTRODUCTION
We are witnessing unprecedented advances in cancer genomics, sequencing (Xuan et al., 2013), structural biology (Jumper et al., 2021), and high‐throughput multiplex‐based assays (Fowler et al., 2023; Weile & Roth, 2018). While sequencing approaches can identify genomic alterations, understanding the molecular mechanisms of these variants remains a challenge. Although many variants in human genes associated with disease are currently known, the identification of their effects on human health is lagging behind (Fowler & Rehm, 2024). Substantial evidence, which is necessary to classify variants according to their effects, is often lacking or contradictory in nature. Consequently, Variants of Uncertain Significance (VUS) or variants with conflicting evidence are continuously identified and reported in variant databases (Burke et al., 2022; Cerami et al., 2012; Gao et al., 2013; Landrum et al., 2020; Stenson et al., 2020; Tate et al., 2019). VUS remain an outstanding problem that complicates diagnosis and leads to suboptimal diagnosis or choice of therapy (Burke et al., 2022).
At the same time, the bioinformatics community has developed various approaches for predicting the impact of variants on human health, many of which are benchmarked against or complemented by experimental data and cellular readouts (Cagiada et al., 2021; Høie et al., 2022; Jepsen et al., 2020; Laine et al., 2019; Riesselman et al., 2018) In this context, experimental multiplex assays deliver good quality and high‐throughput assessment of the effect of variants on different readouts and have effectively been used to aid clinical variant interpretation (Gelman et al., 2019; McEwen et al., 2025). These computational and experimental approaches allow to classify variants for their potential pathogenic or benign effects, which are then reported in different repositories and compendia (Cerami et al., 2012; Gao et al., 2013; Landrum et al., 2020; Tate et al., 2019). In fact, computational methods are currently considered supporting evidence for variant classification, according to recent revisions of the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) variant classification guidelines (Pejaver et al., 2022). Variant effect predictors (VEPs), methods designed to predict the effect of a mutation at the genome or protein level, have made considerable progress, as outlined in recent reviews (Gerasimavicius et al., 2025; Livesey & Marsh, 2025; Rastogi et al., 2025). VEPs have classically relied on sequence data and variants with known classifications.
Nonetheless, in recent years, the advent of AlphaFold2 (Evans et al., 2021; Jumper et al., 2021; Varadi et al., 2022) and other similar methodologies has enabled the prediction of accurate three‐dimensional (3D) protein structures and complexes, often with a quality comparable to that of experiments. This, in turn, enabled the inclusion of information about protein structure in machine learning models, which are among the best‐performing available VEPs (Gerasimavicius et al., 2025). A well‐known example of this is AlphaMissense (Cheng et al., 2023a), which is based on a deep learning model similar to AlphaFold2. Additionally, it simultaneously learns to perform structure prediction and trains an unsupervised protein language model, thereby incorporating structural information into the prediction. The latter was then fine‐tuned for a variant classification task. Approaches based on protein language models (such as ESM‐1b [Rives et al., 2021] or, more recently, ESM‐2 [Lin et al., 2023] and ESM‐3 [Hayes et al., 2025]), which are unsupervised models of protein sequence, have also shown good performance when used in variant effect prediction tasks (Brandes et al., 2023; Hayes et al., 2025). ESM‐3 (Hayes et al., 2025) already incorporates structural information into its training, through specialized tokens, whereas protein sequence models have been used in conjunction with structural information in various ways (Blaabjerg et al., 2024; Sun & Shen, 2024). Even a model such as GEMME, which is an epistatic model entirely based on sequence conservation, has been supplemented with structural information as structure‐derived features in ESCOTT (Tekpinar et al., 2025). Rhapsody‐2 is a VEP that incorporates features derived from protein structure and dynamics within a machine learning framework (Banerjee et al., 2025). Finally, the ability to perform long and accurate biomolecular simulations and robust physical models enables the exploration of conformational changes and protein dynamics across different timescales (Hollingsworth & Dror, 2018).
In previous pilot projects, we explored structure‐based methods to analyze the impact of variants in coding regions of cancer‐related genes, focusing on their consequences on the protein product (Degn et al., 2022; Fas et al., 2020; Kumar & Papaleo, 2020). We propose that these methodologies could be widely applied to study disease‐associated variants. When formalized and standardized, this approach can complement existing methods for predicting pathogenic variants, such as the aforementioned AlphaMissense (Cheng et al., 2023a). Most available VEPs estimate the likelihood of damaging effects of variants but do not provide evidence of variant effects in relation to specific altered protein functions at the cellular level. On the contrary, with this contribution, we aim to link variant effects to specific underlying molecular mechanisms (Degn et al., 2022). A mechanistic understanding of variant effects can help the design of strategies in disease prevention, genetic counseling, clinical care, and treatment. Moreover, from a fundamental research perspective, mechanistic knowledge is also essential for designing and prioritizing experiments to investigate the underlying molecular causes of disease.
Considering this, we developed MAVISp (Multi‐layered Assessment of VarIants by Structure for proteins) to enable high‐throughput variant analysis within standardized workflows. MAVISp integrates results from VEPs and structure‐based predictions of variant effects on several protein properties. The data are accessible through a Streamlit‐based website for consultation and download (https://services.healthtech.dtu.dk/services/MAVISp-1.0/). Additionally, we maintain a Gitbook resource with detailed reports for individual proteins (https://elelab.gitbook.io/mavisp/).
With this publication, we provide data on in silico saturation mutagenesis for all possible variants at each mutation site with structural coverage for 1216 proteins and over 10 million variants. New data and updates of existing entries will be continuously released. Currently, we are capable of processing up to 20 new proteins weekly, which are deposited in a local version of the database. The public database is updated quarterly. Based on recent statistics (https://elelab.gitbook.io/mavisp/documentation/coverage-and-statistics), we anticipate providing 80–100 new proteins with each update, along with additional modules for existing entries. In this manuscript, we provide an overview of the methodology and show examples of data analysis and application.
2. RESULTS
2.1. Overview of MAVISp and its database
MAVISp performs a set of independent predictions, each assessing the effect of a specific amino acid substitution on a different aspect of protein function and structural stability, starting from one or more protein structures. These independent predictions are executed by the so‐called MAVISp modules (Figure 1a). MAVISp can be applied to individual three‐dimensional (3D) protein structures and their complexes (simple mode) or to an ensemble of structures generated through various approaches (ensemble mode). The framework is modular, allowing all the modules or only a selected subset to be applied, depending on the case study. Each module relies on Snakemake, Dask workflows, or Python scripts, all of which are supported by specific virtual environments. The modules are divided into two main categories: (i) modules to retrieve and select structures for analyses (shown in orange in Figure 1a), (ii) modules to perform analyses related to variant assessment or annotations (shown in blue in Figure 1a). Each module includes a strictly defined protocol for computational analysis that can be carried out either step by step or automatically embedded in more comprehensive pipelines (Methods). They are designed to ensure consistency across all the proteins under investigation and to enhance reproducibility and repeatability. Our prediction modules are also complemented by available experimental data or already available predictions that can be integrated in the MAVISp dataset, such as those for VEPs (shown in green in Figure 1a). All the resources used in the MAVISp framework are reported in Table S1, some of which have been developed within this work.
FIGURE 1.

Overview of MAVISp components. (a) MAVISp includes different modules, each managed by workflow engines or dedicated tools. The modules highlighted in orange handle the selection and collection of protein structures, while the modules in blue and purple are dedicated to structural analyses of variant effects in relation to protein functional‐ or stability‐related properties. Additionally, the framework provided modules with results from VEPs and scores derived by experiments, such as deep mutational scans (green). The procedure begins with a gene name, its UniProt, RefSeq identifiers, and the desired structural coverage. For each gene, all the steps can be conducted on a standard server with 32–64 CPUs. The only exceptions are: (i) the ENSEMBLE GENERATION module, which includes all‐atom MD simulations, and (ii) Rosetta‐based calculations on binding free energies and folding/unfolding free energy calculations. Depending on the simulation length and system size, these might require access to HPC facilities. On the left, the simple mode for the assessment is illustrated, which uses single experimental structures or models from AlphaFold2 or AlphaFold3. On the right, the ensemble mode is schematized, in which a conformational ensemble for the target protein or its complexes is applied. Here, we consider a conformational ensemble to be a collection of 3D conformations of the protein generated by a sampling method such as molecular dynamics or provided by NMR structures in the PDB. (b) Scheme of the current workflow for the MAVISp database and websever. Biocurators apply specific workflows and protocols within each MAVISp module to generate structure‐based predictions of changes linked to variants in each protein target. The results are gathered into a text‐based database. The data are further processed by the MAVISp Python package, which performs consistency checks, aggregates the data and outputs human‐readable CSV table files that make up the MAVISp database. These CSV files are imported by the Streamlit web app, powering the MAVISp webserver (https://services.healthtech.dtu.dk/services/MAVISp‐1.0/), where the data are available for interactive visualization and download. In addition, the MAVISp database can be used to generate graphical representations of the data, such as dot plots, lollipop plots, and UpSet plots. Finally, based on the information gathered so far, we provide GitBook reports to facilitate the interpretation of the results: https://elelab.gitbook.io/mavisp/.
The modules are used in the context of the overall MAVISp workflow (Figure 1b), which is designed to enable multiple biocurators to work concurrently and independently on distinct proteins. Data managers defined a priority list of targets that are analyzed in batches by biocurators, depending on the specific research project requirements. Additional targets of interest for the research community can be requested, as explained in the documentation on GitBook.
The workflow is designed as a set of consecutive steps that act on a protein of interest at a time. As the first step, once a protein of interest has been selected, a biocurator retrieves structural and functional information about it, along with key identifiers (e.g., gene name, UniProt AC, RefSeq identifier) for the next steps. Additionally, the biocurator proposes a trimming strategy for the protein, for example, identifying one or more sets of contiguous residues in the protein structure that can effectively serve as input for the prediction steps. This step entails considering only well‐structured and high‐accuracy regions of our proteins, which is crucial since most MAVISp modules are not designed to handle large intrinsically disordered regions. In selected cases, to avoid potential bias in our structural calculations, the curator may edit the structure by removing long disordered inclusions in structured regions. Furthermore, in the MAVISp ensemble mode, where the ENSEMBLE GENERATION module should be carried out, the biocurator identifies the initial structures for the simulations to be performed on the protein target in its free or bound state with other biomolecules and performs the necessary simulations to obtain the final structural ensemble. Once the protein structure or structural ensemble, depending on the mode, is available, the biocurator works with each available module and obtains: (i) a list of variants that MAVISp will annotate (see Materials and Methods for details) and (ii) the final predictions for each module. To do so, biocurators adhere to strict workflows for data collection based on a set of procedures codified in each module, which are mostly automated using Snakemake pipelines. Once this is completed, the MAVISp data managers will import and aggregate the data using the MAVISp Python package (https://github.com/ELELAB/MAVISp). This step also allows performing sanity checks, per‐module data classifications, and writing the results in a human‐readable table format, constituting the MAVISp database. The database files are the first product of MAVISp and contain the collected data and metadata for each of the identified variants (https://services.healthtech.dtu.dk/services/MAVISp-1.0/).
The datasets from the MAVISp database can then be further used in two ways. First, biocurators or data managers can perform a set of analyses, referred to as downstream analyses, which are generated downstream of database creation. These analyses result in the generation of publication‐ready figures that summarize the predicted effects for each variant and assist with results interpretation.
Furthermore, the biocurators use data from the downstream analysis to create a report in GitBook (https://elelab.gitbook.io/mavisp/), using a standard Markdown template and a semi‐automated procedure. Biocurators and data managers also act as reviewers for reports created by their peers. A review status is assigned to each GitBook entry to guide users regarding the quality and integrity of the curated data. To achieve this, we defined four review status levels (i.e., stars) for each protein entry (https://elelab.gitbook.io/mavisp/documentation/mavisp-review-status).
Finally, the MAVISp database is presented through a user‐friendly Streamlit‐based website (https://services.healthtech.dtu.dk/services/MAVISp-1.0/). The web app includes various visualizations to aid the interpretation of MAVISp results that are essentially equivalent to the downstream analyses outlined above: (a) a dot plot displaying classifications for each variant across MAVISp modules, experimental data (if available), and the VEP results, (b) a lollipop plot aggregating relevant mechanistic indicators (i.e., MAVISp‐identified effects at the structural level) associated with potentially pathogenic variants, and (c) an interactive representation on the 3D structure, showing the localization of mutation sites identified in (b). These features are designed to support the interpretation of results and facilitate the identification of variants with specific mechanisms and multiple effects. The source code for the MAVISp Python package and the web application are available on GitHub (https://github.com/ELELAB/MAVISp), while the complete dataset can be downloaded from an OSF repository (https://osf.io/ufpzm/). The OSF repository also includes previous versions of the database. Both source code and data are freely available and released under open‐source or free licenses.
We invite requests on targets or variants that are not yet available in MAVISp or scheduled for curation. We also welcome contributors as biocurators or developers, pending training and adherence to our guidelines (https://elelab.gitbook.io/mavisp). To facilitate entrance into the MAVISp community of biocurators and developers, we organize training events, research visits, and workshops.
Notably, a comprehensive update will be conducted annually to incorporate new versions of external tools or resources used by MAVISp, ensuring that resources remain current. Moreover, we continuously expand our toolkit and develop new modules to enable even more comprehensive assessments. The criteria for including new methods and approaches in the framework are detailed in the GitBook documentation (https://elelab.gitbook.io/mavisp/documentation/how-to-contribute-as-a-developer).
2.2. MAVISp modules for structure collection and selection
MAVISp includes various modules to select and model the structures of interest in both ensemble and simple mode (Figure 1a).
The STRUCTURE SELECTION module enables biocurators to identify the starting structure for their study, both for models of the free and bound states of the protein of interest. This module includes structure retrieval from the Protein Data Bank (PDB) (Berman et al., 2000), the AlphaFold Protein Structure Database (Varadi et al., 2022), or through the generation of initial models with AlphaFold3 (Abramson et al., 2024), AlphaFold2 (Jumper et al., 2021) and AlphaFold‐multimer (Evans et al., 2021). In addition, it streamlines the selection of structures in terms of structural quality, experimental resolution, missing residues, amino acid substitutions with respect to the UniProt reference sequence, as well as the AlphaFold per‐residue confidence score (pLDDT), integrating tools such as PDBminer (Degn et al., 2023). Using AlphaFill (Hekkelman et al., 2022) further assists in identifying cofactors to be included in the model structure or to identify mutation sites that should be flagged, if located in the proximity of a missing cofactor in the structure model. When necessary, a workflow is available to reconstruct missing residues or design linkers to replace large, disordered loops within structured domains (Methods).
According to the protocol established for the generation of the models, we retain 3D structures with reasonable accuracy based on parameters such as pLDDT, Predicted Aligned Error (PAE), and pDOCKQ2 (Zhu et al., 2023). In addition, the module includes protocols based on AlphaFold (Evans et al., 2021; Tsaban et al., 2022) or comparative modeling (Di Rita et al., 2018; Holdgaard et al., 2019) when the complex between the protein target and the interactor involves Short Linear Motifs (SLiMs).
The INTERACTOME module aids the identification of protein interactors for the target protein and their complex structures by querying the Mentha database (Calderone et al., 2013), the PDB, and experimentally validated proteome‐wide AlphaFold models (Burke et al., 2023), as well as the STRING database (Szklarczyk et al., 2025) (Methods). Once a suitable set of interactors has been identified, the information is used to predict protein complex structures, which are then utilized in the subsequent steps (i.e., the LOCAL_INTERACTIONS module, see below).
The ENSEMBLE GENERATION module allows the use of structural ensembles from different sources, such as NMR structures deposited in PDB, coarse‐grained models for protein flexibility (e.g., CABS‐flex; Kuriata et al., 2018), or all‐atom Molecular Dynamics (MD) simulations (with GROMACS [Abraham et al., 2015] and PLUMED [Bonomi et al., 2019; Tribello et al., 2013]) of the protein structure or its complexes. The choice of methods is based on the required accuracy of the generated ensemble and the available computational resources. Once individual structures or structural ensembles for the protein candidate are selected – either alone or with interactors ‐ the analysis modules can be used.
2.3. MAVISp modules for structural analysis
MAVISp integrates different analysis modules for both ensemble and simple mode (Figure 1a). The minimal set of data required to import a protein target and its variants into the MAVISp database includes the results from the STABILITY and PTM modules, along with predictions from VEPs. The STABILITY module is devoted to estimating the effects of the variants on the protein structural stability using folding free energy calculations (Methods). This module leverages workflows for high‐throughput in silico mutagenesis scans (Sora, Laspiur, et al., 2023; Tiberti, Terkelsen, et al., 2022) and a newly implemented protocol for RaSP (Blaabjerg et al., 2023) (Methods). All the methods used in this module predict change of free energy of folding upon the insertion of an amino acid substitution, and predictions are performed using FoldX, Rosetta, or RaSP. Once these predictions have been collected, MAVISp applies a consensus approach to classify the effect of the variants (Methods). The defined thresholds for changes in free energy are based on evidence that shows that variants with changes in folding free energy below 3 kcal/mol do not exhibit a marked decrease in stability at the cellular level (Abildgaard et al., 2019; Nielsen et al., 2017). Thus, MAVISp defines the following classes for changes in stability: stabilizing (ΔΔG ≤ −3 kcal/mol with both methods, FoldX and Rosetta or RaSP), destabilizing (ΔΔG ≥ 3 kcal/mol), neutral (−2 < ΔΔG < 2 kcal/mol), and uncertain (−3 < ΔΔG ≤ −2 kcal/mol or 2 ≤ ΔΔG < 3 kcal/mol). A variant is also classified as uncertain if the two methods would classify the effect of the variant differently. Of note, differences between FoldX and Rosetta/RaSP predictions are expected and reflect the distinct modeling assumptions and sampling strategies of the two classes of approaches. MAVISp leverages this complementarity by interpreting predictor agreement as increased confidence and disagreement as an indicator of uncertainty in the mechanistic annotation.
Since March 2024, we have adopted RaSP‐FoldX consensus as a default for data collection, after performing a benchmark using the MAVISp datasets (Supplementary Text S1 and https://github.com/ELELAB/MAVISp_RaSP_benchmark). RaSP provides a more suitable solution for high‐throughput data collection than CPU‐intensive scans based on Rosetta. In low‐throughput studies, where we focus in detail on a target protein, we can include Rosetta data, which is computationally more demanding.
The LOCAL INTERACTION module can be applied if the STRUCTURE SELECTION and INTERACTOME modules identify at least a suitable structure of the complex between the target protein and another biomolecule. The LOCAL INTERACTION module is based on estimating changes in binding free energy for variants at protein sites within 10 Å of the interaction interface, using protocols and consensus strategies that mirror those for STABILITY. In this case, we use a combination of FoldX and Rosetta calculations (Methods). Binding free energy thresholds are set based on the expected error margins of the predictors, approximately ±1 kcal/mol, as outlined by the authors of the methods and in accordance with general good practice in the literature. This approach addresses the scarcity of experimental datasets on amino acid substitutions that impact protein–protein interactions (Jankauskaitė et al., 2019; Sampson et al., 2024; Sargsyan & Lim, 2024), which are often constrained by system heterogeneity, limited mutation numbers, or both, thereby complicating reliable benchmarking. We rely on a consensus approach based on FoldX and Rosetta results for changes in binding free energies upon amino acid substitution. We classify a variant as stabilizing (both methods predict ΔΔG ≤ −1 kcal/mol), neutral (−1 kcal/mol < ΔΔG < 1 kcal/mol), or destabilizing (ΔΔG ≥ 1 kcal/mol). Cases in which the two methods disagree on the classification, or for which we do not have a prediction for both methods, and the side chain relative solvent accessible area of the residue is ≥25%, are classified as uncertain. This is because, in high‐throughput data collection, we cannot exclude the possibility that the site interacts if it is solvent‐exposed. In fact, very often only part of the 3D structures of protein–protein complexes are available or can be modeled. We also included support for LOCAL INTERACTION for protein‐DNA interactions and homodimers. Notably, a strength of our approach is to provide annotations for the effects of protein variants on various biological interfaces for the same target protein.
In ensemble mode, the STABILITY and LOCAL INTERACTION modules are used on ensembles of at least 25 structures from the simulations or on a number of main representative structures upon clustering, depending on the free energy calculation scheme to apply. The results obtained for each structure are then averaged, and classification is performed with the same strategies we use in simple mode using these average values. This approach is used to mitigate limitations due to lack of backbone flexibility when these free energy methods are applied to a single 3D structure (Degn et al., 2022; Peccati et al., 2023; Sapozhnikov et al., 2023; Tiberti, Terkelsen, et al., 2022).
The LONG‐RANGE module predicts mutational effects in the context of allostery, a fundamental mechanism of protein regulation, in which perturbations at one site redistribute conformational populations across the free energy landscape to modulate distal functional regions (Nussinov & Tsai, 2013). Increasing evidence indicates that disease‐associated variants frequently exert their effects through this mechanism, termed allosteric polymorphism, rather than through direct local destabilization or disruption of functional sites (Berezovsky & Nussinov, 2025; Tan et al., 2022; Tee et al., 2019; Weng et al., 2024). These observations underscore the need to account for long‐range allosteric effects of mutations for accurate interpretation of variants. Furthermore, as folding stability and allostery are fundamentally thermodynamically linked aspects of the free energy landscape, efforts should be made to account for long‐range effects and attempt to deconvolute true allosteric effects from stability‐mediated effects (Escobedo et al., 2025; Faure et al., 2024; Weng et al., 2024). The module applies coarse‐grained models to estimate allosteric free‐energy changes upon amino acid substitution based on AlloSigMA2 (Tan et al., 2020). The protocol followed by the LONG‐RANGE module has recently been updated and benchmarked using experimental data from deep mutational scans (Krzesińska et al., 2025). Details on the parameters and steps for analysis are also provided in the Methods. Variants are annotated as destabilizing (positive changes in allosteric free energy), stabilizing (negative changes in allosteric free energy), mixed effects (both conditions occur), or neutral if the variant does not cause any significant change. Additionally, variants that do not cause a significant change in residue side‐chain volume are annotated as uncertain. In ensemble mode, we evaluated the shortest communication paths using an atomic contact‐based Protein Structure Network (Sora, Tiberti, et al., 2023). This analysis, combined with the AlloSigMA2 data, allows pinpointing variants with long‐range effects to functional sites or protein pockets that could serve as interfaces to recruit interactors or ligands.
The FUNCTIONAL SITES module in simple mode allows evaluating the effect of variants at (or in the proximity of) the active site of enzymes or cofactor binding sites of proteins and it is based on analyses of contacts with the second sphere of coordination of the residues belonging to these sites (see Methods).
The FUNCTIONAL DYNAMICS module in ensemble mode includes enhanced‐sampling simulations to further assess the local or long‐range effects of a variant. As a first example, we applied this class of methods to validate the long‐range effects predicted for p53 variants on the DNA‐binding loops (Degn et al., 2022), and included such results in the MAVISp database.
The PTM module currently supports phosphorylation, annotating the effect of variants at phosphorylatable sites. It evaluates how loss or changes in phosphorylation sites may impact protein regulation, stability, or interaction with partners. To this goal, the module collects analyses and annotations such as solvent accessibility of the mutation site, inclusion of the site in phosphorylatable linear motif, comparison between predicted changes in folding or binding free energy upon amino acid substitution or upon phosphorylation at the site of interest. In the module, we applied a custom decision logic (Supplementary Text S2) to classify for each variant as neutral, damaging, unknown effect, potentially damaging or uncertain. The identification of the phosphorylation sites in the PTM module is based on known experimental phosphosites and SLiMs, as retrieved by Cancermuts (Tiberti, Di Leo, et al., 2022). These data are complemented by a manually curated selection of phospho‐modulated SLiMs (https://github.com/ELELAB/MAVISp/blob/main/mavisp/data/phosphoSLiMs_09062023.csv). For solvent‐inaccessible phosphorylatable residues, the effects are classified as uncertain in simple mode. In these cases, the ensemble mode is required to investigate whether a cryptic phosphorylated site may become accessible upon conformational changes (Henriques & Lindorff‐Larsen, 2020; Orioli et al., 2022). Of note, the current version of the PTM module has been designed based on fundamental principles of how phosphorylation can affect protein structure and should be used to identify variants for further investigation, particularly for experimental research. Benchmarking the effectiveness of this module would be difficult at the present time, given the relatively small number of amino acid substitutions that can affect phosphorylation currently present in the MAVISp database, especially considering those for which experimental data is available. To this end, we are currently in the process of curating and including more proteins relevant to benchmarking the PTM module. These will include experimental data on protein stability and protein–protein interactions upon phosphorylation (Huang et al., 2018; Potel et al., 2021).
MAVISp includes additional analyses and annotations, such as predictions on regions involved in early folding events (Raimondi et al., 2017), pLDDT score, secondary structure, and side‐chain solvent accessibility, which can assist in the interpretation of the results.
2.4. Variant effect predictors included in MAVISp
MAVISp provides annotations for the variant interpretation reported in ClinVar (Landrum et al., 2020), or calculated with REVEL (Ioannidis et al., 2016), DeMaSk (Munro & Singh, 2021), GEMME (Laine et al., 2019), EVE (Evolutionary model of variant effect) (Frazer et al., 2021), and AlphaMissense (Cheng et al., 2023a). In MAVISp, each of them is handled by a separate module. The results of these VEPs can be combined with those from the MAVISp structure‐based modules to understand variant effects and to prioritize variants for other studies, as detailed in the examples below.
2.5. Sources of variants supported by MAVISp
By default, we apply in silico saturation mutagenesis, which means that we provide predicted effects for each variant of a target protein at any position with structural coverage. Additionally, all variants reported for the target protein in COSMIC, cBioPortal, and ClinVar are annotated within MAVISp. We routinely update and maintain the entries in the MAVISp database to include up‐to‐date annotations using Cancermuts (Tiberti, Di Leo, et al., 2022). All Cancermuts annotations for MAVISp and other protein targets are also available at the Cancermuts webserver, https://services.healthtech.dtu.dk/services/Cancermuts-1.0/. In addition, annotations from lists of variants from other studies, such as cohort‐ or nationwide‐based studies or other disease‐related genomic initiatives, can be manually added.
Currently, MAVISp includes data on 10+ million variants from 1216 proteins (as of 01/03/2026). An overview of the currently available data and how to use them to address different research questions is described in detail in the next sections. The first targeted studies in which MAVISp has been applied to understand variants' impact in rare genetic diseases (Scrima et al., 2024) or involved in cancer hallmarks (Utichi et al., 2025; Sahu et al., 2023) are also suitable examples.
2.6. Interpretation of the results of MAVISp
MAVISp provides a comprehensive set of results for many variants; therefore, we have devised a few strategies that can be useful to make sense of the MAVISp data for a few common use cases that users might encounter.
One of the most important outputs from the downstream analyses of MAVISp is the so‐called dot plot, which is available on the GitBook reports or released within the target studies of specific proteins (see below for examples). A dotplot can also be generated within the MAVISp webserver in the “Classification” tab for up to 50 variants of choice simultaneously. This plot showcases (i) the classification of the different VEPs integrated in MAVISp, (ii) the classification performed by each MAVISp module, and (iii) the classification of variants in ClinVar, when available, as variant label colors. The code to generate dot plots from MAVISp csv file is also available in GitHub (https://github.com/ELELAB/MAVISp_downstream_analysis). In MAVISp, each module classifies mutations independently; therefore, the interpretation of any assigned label is specific and unique to each module. For example, a damaging classification from a variant effect predictor typically indicates that the variant is predicted to impair protein function or be pathogenic, depending on the predictor used. In contrast, a variant classified as damaging by the stability module indicates a predicted decrease in protein structural stability, without implying a direct effect on function.
Another representation which depends on further processing of a text output created by dot_plot.py (i.e., mechanistic_indicators_out.csv) provides a concise representation of the classes of mechanistic indicators found for each variant in the form of lolliplots. Lolliplots are also available in the GitBook report or in the “Damaging mutation” tab on the website, that shows only those variants that are at the same time: (i) classified as pathogenic for AlphaMissense, (ii) classified as loss‐of‐fitness or gain‐of‐fitness by DeMaSk, and (iii) damaging for the respective structure‐based module of MAVISp. The downstream analysis toolkit also provides the code to prepare upset plots or venn diagrams for the variant source (as reported in Gitbook). Finally, it includes a script (filter_pLDDT.py) that can use as input the mechanistic_indicators_out.csv file and allows to flag the variants for which MAVISp predicts mechanistic indicators but are characterized by low confidence in the AlphaFold model used for the analyses. These cases would require further attention and investigation, for example in the ensemble mode.
Consulting the available dot plot for an entry of interest is therefore the most straightforward place to start to access MAVISp data. To identify a subset of variants of interest, we have defined the following strategy for a data‐driven discovery of variants of interest with little other information (i.e., VUS, conflicting evidence or variants not reported in ClinVar). In this case, the dot plot allows to understand first which variants are predicted to be pathogenic, by using the AlphaMissense classification; these are the ones reported as Damaging in the AlphaMissense row. For these, we also consider the output of DeMaSk, that define whether the variant is classified as gain‐of‐fitness or loss‐of‐fitness. If a variant fulfil these criteria, we then consider the structure‐based MAVISp predictions for mechanistic indicators, that give us one or more explanations of the reason for the effect of the variant. For instance, the variant could be destabilizing the protein structure and will be reported with an altered stability as mechanistic indicator. Another common use case is to use MAVISp to get a mechanistic interpretation of variants already known in ClinVar. In this case, if the variant already has an interpretation of Pathogenic, Likely pathogenic, Benign, or Likely benign, we can just refer to the MAVISp mechanistic interpretation.
Importantly, researchers should always refer to specific biological or phenotypical contexts when interpreting predictions from MAVISp, including their knowledge of the biological role the protein investigation has or concerning the nature of the disease of interest. For instance, predictions might lead to different conclusions if the protein under consideration is from a tumor suppressor or from an oncogene.
In the next section we illustrate some of the applications of data collected with MAVISp through case studies (Table S2 for mapping of case studies and modules).
2.7. COSMIC tumor suppressor genes and oncogenes
At first, we prioritized MAVISp data collection of known driver genes in cancer, that is, tumor suppressors and oncogenes. To this goal, we collected data for the COSMIC Tumor Suppressor Genes (COSMIC v96), while the collection of the COSMIC Oncogene and Dual Role targets is ongoing. Furthermore, we have been including genes reported as a candidate driver by the Network of Cancer Genes (NGC) (Repana et al., 2019).
The MAVISp datasets on cancer driver genes can assist the identification of molecular mechanisms of predicted or known pathogenic variants in these genes, as well as aid the characterization of Variants of Uncertain Significance (VUS). A recent example is the study we performed on BRCA2 (Sahu et al., 2023). In this study, we analyzed BRCA2 variants reported in ClinVar, comparing the predictions from the STABILITY and LOCAL INTERACTIONS modules of MAVISp with results from a multiplex assay which measured the impact of these variants on cell viability. We were able to explain the effect of 84 BRCA2 variants, which were classified as non‐functional by the assay, and for which MAVISp predicted effects on protein stability or binding to the binding partner SEM1.
In the case of tumor suppressors, identifying variants that might lead to loss of function is particularly important. Given the structure–function relationship in proteins, structural stability is a key determinant that can be disrupted by amino acid substitutions, potentially leading to local or more severe protein misfolding, with consequent loss of function (Cagiada et al., 2021). As an example, we report the analysis of the MAVISp entry for the tumor suppressor BLM, a DNA helicase involved in DNA replication, recombination, and repair (Manthei & Keck, 2013). We identified a total of 1170 predicted destabilizing variants according to the STABILITY module, of which 45 were annotated in ClinVar (Figure 2a). Among these, 82% of destabilizing variants were found in structured regions of the protein, while the remaining 18% are located in disordered regions (Figure 2b). Of the ClinVar‐reported variants, 42 are classified as VUS. Y811C and C901Y are reported with conflicting interpretations, and only G952A is reported as likely pathogenic.
FIGURE 2.

Variants with effects on structural stability in the tumor suppressor protein BLM. (a) The cartoon representation shows the trimmed mode BLM368‐1290 and the spheres highlight the Cɑ atom of the 41 positions harboring 45 variants predicted as destabilizing by the MAVISp STABILITY module (RaSP/FoldX consensus) and annotated in ClinVar. Among these, Y764C, G891E, and L896P are also reported in CBioPortal, whereas F663I, L845P and C901Y are also reported in COSMIC. The two views correspond to the same domain rotated by 180°. The backbone and spheres are colored according to the AlphaFold pLDDT scores, that is, blue—very high (pLDDT >90), cyan—confident (70 < pLDDT ≤90), yellow—low (50 < pLDDT ≤70), and orange—very low (pLDDT ≤50). The labels indicate the mutation sites and the corresponding variants and are colored by ClinVar classification., uncertain significance (black), conflicting interpretation of pathogenicity (orange), and likely pathogenic (red). (b) The stacked bar plot shows the distribution of destabilizing BLM variants across secondary structure elements as defined by DSSP (i.e., H = ɑ‐helix, B = residue in isolated β‐bridge, E = extended strand, participates in β ladder, G = 3‐helix (310 helix), I = 5‐helix (π‐helix), T = hydrogen bonded turn, S = bend, and “–” = no secondary structure identified). The results refer to the data available in the MAVISp database on 12th September 2025. More information about BLM analyses with MAVISp can be found in the corresponding GitBook report: https://elelab.gitbook.io/Mavisp/proteins/blm.
These results provide a starting point for variant characterization and prioritization. As suggested in the previous section, our results can be used to guide the selection of a subset of variants with a predicted damaging impact from AlphaMissense, a loss‐of‐fitness signature according to DeMaSk, and that we predict to be damaging for stability. These would be suitable candidates for experimental validation. Concerning BLM, MAVISp identifies 41 ClinVar VUS or variants with conflicting evidence that could be prioritized according to these criteria (Table S3).
For example, depending on the size of the library to validate, methods such as flow cytometry sorting or cycloheximide chase assays (McKinnon, 2018; Miao et al., 2023) or using approaches based on multiplex technologies (Levy et al., 2014; Matreyek et al., 2018; Tsuboyama et al., 2023; Yen et al., 2008) would be useful to validate our predictions.
2.8. Integration of MAVISp data with experimental data
A useful feature of MAVISp is a dedicated module to curate and import experimentally derived scores of variant effects on different biological readouts (i.e., the EXPERIMENTAL DATA module, Figure 1a). These data can be directly compared with the structural properties we predict with MAVISp, for a variety of purposes. For example, they can serve as an additional layer of information with respect to the structure‐based mechanistic indicators themselves. Additionally, as done in the BRCA2 study, they can be used as a source of information for variants with a known detrimental effect, which may depend on different mechanisms of action for each variant, which can be investigate using MAVISp. In cases such as this, MAVISp helps identifying the possible mechanisms by which variants have an effect, for further in‐depth investigation.
Experimental data can also be used to validate the results of certain MAVISp modules when the predicted structural properties are related to the experimentally tested biological readouts. Additionally, deep mutational scans (DMS) can be used to benchmark or tune the classification thresholds used by the MAVISp modules, including those based on structural properties. In this context, the format of MAVISp database files is convenient for subsequent data processing, for example, using biostatistical models or machine learning. For PTEN, we included data from available deep mutational scans, reporting effects of mutations on cellular abundance or phosphatase activity (Matreyek et al., 2018; Mighell et al., 2020; Post et al., 2020), in its MAVISp entry. Cellular abundance is a critical property that is often perturbed by missense mutations and can be altered by changes in protein structural stability. We therefore compared predictions from the MAVISp STABILITY module—based on a consensus of RaSP and FoldX—with protein abundance scores obtained from VAMP‐seq assays (Matreyek et al., 2018; Mighell et al., 2020; Post et al., 2020). To compare the classification obtained by the stability module with the experimental data, we considered how the abundance scores from the experiment have been used for classification. Multiple classification strategies have been used for these data: on one side, the ProteinGym benchmark dataset (Notin et al., 2023) applies a threshold based on the median of the abundance score (i.e., 0.77). Variants that scored lower than this threshold (≥22% reduction of abundance relative to the wild‐type) were classified as low abundant, whereas those that scored higher were considered to be similar to the wild‐type. The second classification followed the original PTEN study deposited in MaveDB (MaveDB ID urn:mavedb:00000013‐a‐1), which defines four abundance classes. In this scheme, the 5% lowest‐abundance synonymous variants corresponded to a score of 0.71 (Matreyek et al., 2018), and variants were classified into bin 1 (low‐abundant, both score and confidence interval <0.71), bin 4 (WT‐like abundant, both scores >0.71), bin 2 (likely low‐abundant, score <0.71 but confidence interval >0.71), and bin 3 (likely WT‐like abundant, score >0.71 but confidence interval <0.71). For this analysis, we retained only variants in bins 1 and 4 to ensure an unambiguous classification. After applying these filters and excluding uncertain variants defined by the STABILITY module, the MaveDB‐based classification contained 1690 variants. To enable a direct comparison between the two classification strategies, the ProteinGym‐based dataset, which initially comprised 3211 variants, was filtered to include the same 1690 variants as the filtered MaveDB dataset. The two classification schemes were largely concordant, differing only for variants with abundance scores between 0.71 and 0.77, which were considered damaging by ProteinGym and neutral by MaveDB.
Figure 3 and Table 1 illustrate the performance of the MAVISp STABILITY classification against the classification of experimental data on protein abundance for PTEN. In this first comparison, we applied the same threshold used for this dataset in the benchmarking dataset ProteinGym (Notin et al., 2023), based on the median of the DMS scores. The consensus approach provided by the STABILITY module of MAVISp (accuracy 0.814) has an overall better performance in identifying variants that are found to be damaging in the assay than those predicted to cause damaging effects according to GEMME or DeMaSk (Figure 3b). Nevertheless, this approach has a lower sensitivity (0.66) compared to GEMME. We thus wondered if the relatively low sensitivity we obtained was due to cases with experimental scores too close to the median (Figure 3c). Additionally, in the original study for PTEN and as deposited in the MaveDB (Esposito et al., 2019; Rubin et al., 2025), a different classification for the variant scoring based on four abundance levels was proposed, as detailed above. We thus compared the MAVISp results with the experimental dataset for the PTEN experiment from MaveDB using abundance‐level classes as a threshold (Figure 3c), resulting in increased sensitivity for the methods applied within MAVISp. The results on PTEN from MAVISp fit well with recent computational studies of PTEN variants using Rosetta calculations of protein stability and analyses of sequence conservation (Cagiada et al., 2021; Jepsen et al., 2020).
FIGURE 3.

Comparison of GEMME, DeMaSk, and MAVISp STABILITY module predictions with experimentally‐derived scores for protein abundance and phosphatase activity of PTEN. (a) The trimmed AlphaFold structure (residues 1–351) of PTEN used for MAVISp stability module calculations is shown as a cartoon, colored according to pLDDT scores. (b–e) Histograms with performances of MAVISp STABILITY module, DeMaSk, and GEMME in predicting the effect of variants using VAMP‐seq scores with ProteinGym (b) and MaveDB (c) thresholds. (d, e) illustrate the performance of the same tools or their combination against an experimental functional readout that assesses the phosphatase activity at the cellular level.
TABLE 1.
Performances of MAVISp modules and VEP predictors against experimental measurements of protein abundance and phosphatase activity for PTEN.
| Assay | Column comparison | Threshold_mode | Sensitivity | Specificity | Accuracy | Precision | F1 score |
|---|---|---|---|---|---|---|---|
| Protein abundance assay | RaSP/FoldX consensus | ProteinGym | 0.666 | 0.954 | 0.814 | 0.928 | 0.771 |
| GEMME | 0.719 | 0.743 | 0.731 | 0.717 | 0.718 | ||
| DeMaSk | 0.66 | 0.814 | 0.741 | 0.763 | 0.708 | ||
| RaSP/FoldX consensus | MaveDB | 0.707 | 0.955 | 0.845 | 0.927 | 0.802 | |
| GEMME | 0.759 | 0.748 | 0.753 | 0.706 | 0.732 | ||
| DeMaSk | 0.7 | 0.818 | 0.766 | 0.754 | 0.726 | ||
| Phosphatase assay | RaSP/FoldX + GEMME | MaveDB | 0.776 | 0.653 | 0.697 | 0.551 | 0.644 |
| RaSP/FoldX + DeMaSk | 0.739 | 0.763 | 0.754 | 0.631 | 0.681 | ||
| GEMME | 0.751 | 0.68 | 0.705 | 0.563 | 0.644 | ||
| DeMaSk | 0.707 | 0.787 | 0.758 | 0.645 | 0.675 |
We next assessed whether MAVISp could also inform predictions of variant effects on PTEN phosphatase activity. Experimental data from a cellular phosphatase assay (MaveDB ID urn:mavedb:00000054‐a‐1) were classified as reduced (<0.89), wildtype‐like (0.89–1), or hyperactive (>1). Here, we investigated whether integrating MAVISp STABILITY data with VEP results could enhance predictive power, since reduced stability is not the only possible mechanism for loss of phosphatase activity in mutated variants. We combined the STABILITY results with GEMME or DeMaSk, applying a priority rule that gave damaging calls from GEMME/DeMaSk priority. This strategy produced performance comparable to GEMME or DeMaSk alone (Figure 3d,e; Table 1). Notably, combining changes in folding free energies with GEMME increased sensitivity (0.78) but reduced specificity (0.653), yielding an F1 score comparable to that of GEMME but lower than that of DeMaSk (0.64; Figure 3d,e; Table 1).
Overall, this section illustrates how to use MAVISp data to compare predictions and experiments, as well as how to integrate MAVISp modules on structural properties with VEP results.
2.9. Proteins involved in cancer hallmarks
To expand the contents of the MAVISp database, we have also been focusing on protein targets related to cancer hallmarks (Hanahan, 2022), and in particular on proteins involved in cancer hallmarks related to protein clearance at the cellular level, that is, the ability to escape cell death through apoptosis and autophagy, as well as kinases or transcription factors involved in the regulation of cellular proliferation. Mitochondrial apoptosis is tightly regulated by a network of protein–protein interactions between pro‐survival and pro‐apoptotic proteins. We investigated the mutational landscape in cancer of this group of proteins in a previous study (Kønig et al., 2019), which includes structural analyses with different simple mode MAVISp modules for both the pro‐survival proteins BCL2, BCL2L1, BCL2L2, BCL2L10, MCL1, and BCL2A1, as well as the pro‐apoptotic members of the family BOK, BAX, and BAK1. In these analyses, the C‐terminal transmembrane helix has been removed since the current version of our approach does not support transmembrane proteins or domains, illustrating how the STRUCTURE SELECTION module works.
Autophagy is a clearance mechanism with a dual role in cancer. The autophagy pathway relies on approximately 40 proteins, constituting the core autophagy machinery (Suzuki et al., 2017). As an example of the application of MAVISp to this group of proteins, we applied the simple mode to the markers of autophagosome formation, MAP1LC3B and the central kinase ULK1, building on the knowledge provided by previous work (Fas et al., 2020; Kumar & Papaleo, 2020).
For ULK1, we expanded our analysis to cover a larger portion of the protein, including both the N‐terminal (residues 7–279) and C‐terminal (837–1046) domains for the MAVISp assessment. ULK1 also demonstrates how to customize the AlphaFold model trimming to exclude disordered regions or linkers containing residues with low pLDDT scores in simple mode. In fact, their inclusion could lead to predictions of questionable quality. Disordered regions cannot be properly represented by a single conformation, and the ensemble mode would be necessary to derive more reliable conclusions.
ULK1 featured 215 variants reported in COSMIC, cBioPortal, and/or ClinVar, as shown in its dot plot (Figure 4a) generated using the MAVISp downstream analysis tools. Using the simple mode, 59 variants had predicted long‐range mixed effects. Furthermore, eight had a damaging effect on stability, one had a damaging PTM effect on regulation (S954N), one had a possible damaging PTM effect in function (S1042T), and four variants (L53P, G183V, E191G, and L215P) are characterized by both effects on stability and long‐range communication (Figure 4b–d). Most of the variants that were predicted to have long‐range and structure‐destabilizing effects are in the N‐terminal kinase domain of the protein. These predictions suggest that mutations in this region could compromise the local fold and potentially alter the long‐range communication, potentially leading to the inactivation of ULK1. We then performed a one‐microsecond MD simulation of the ULK1 N‐terminal kinase domain (residues 3–279, PDB ID: 5CI7) to generate a structural ensemble for the MAVISp ensemble mode. In this case, we used an approach based on graph analysis from a contact‐based PSN (Methods) provided by the LONG‐RANGE module to verify whether long‐range communication occurs between mutation and response sites predicted by the coarse‐grain model used in the simple mode. The ensemble mode also validates the prediction on the effect of variants on stability made in simple mode, as it compensates for the limited or non‐mobility of the protein main chain, which characterize the STABILITY module in simple mode. Overall, the application of the ensemble mode allowed to validate five variants with predicted long‐range damaging effects (H72D, H72N, E73D, E73K, and R160L) and two variants with a damaging effect on stability (G183V and L215P). The predicted destabilizing (Figure 4e) variant L215P has also been identified in samples from The Cancer Genome Atlas (TCGA) (Kumar & Papaleo, 2020).
FIGURE 4.

MAVISp ensemble mode to identify damaging variants in the autophagy kinase ULK1. (a) We examined the central autophagy kinase ULK1 using MAVISp, generating a saturation of all possible variants within the N‐terminal (residues 7–279) and C‐terminal domains (residues 837–1046), leading to a total of 8962 variants. Of these, 215 variants have been identified in COSMIC, cBioPortal, and/or ClinVar databases. (b) Among the ones reported in the previous databases, eight variants were reported as pathogenic by AlphaMissense (L21P, L53P, G183V, E191G, G208S, V211F, L215P, and W894G) and among these, four variants are predicted to have a damaging effect on both protein stability and long‐range communication (L53P, G183V, E191G, and L215P). (c) Using MAVISp simple and ensemble modes, we identified 22 variants with destabilizing effects in terms of folding free energy, long‐range effects, or PTM effects in regulation or in function. The mutation sites are highlighted with spheres on the AlphaFold models of the ULK1 N‐terminal (left) and C‐terminal (right) domains. (d) We showed the predicted changes in folding free energy upon amino acid substitution for each of the 22 variants as calculated by the STABILITY module of MAVISp with MutateX and RosettaDDGPrediction with the simple mode (left) or with the ensemble mode (right). Interestingly, most variants that alter structural stability are located in the catalytic domain of the enzyme. This suggests potential mechanisms for ULK1 inactivation. (e) Summary of the predicted effects on the 22 variants of ULK1 that have been found damaging with at least one MAVISp module with the simple mode (upper) or with the ensemble mode (lower) using the dot plot representation provided by the MAVISp toolkit for downstream analyses. Of note, the lower legend refers to the color of variants on the X‐axis, which are related to the ClinVar effect category.
The MAVISp entry for the autophagy marker MAP1LC3B provides an example on how data for the LOCAL INTERACTION module can be obtained for a protein that interacts with a functional motif embedded in intrinsically disordered proteins, that is, a short linear motif (SLiM). MAP1LC3B, in fact, is able to bind to proteins harboring a so‐called LC3‐interacting region (LIR) (Rogov et al., 2023). In MAVISp, we report results on the effect of variants using three examples of this mode of interaction: the binding of MAP1LC3B to the LIR regions of its binding partner SQSTM1 (Figure 5a), ATG13, and Optineurin. In this case, we first applied the protocols for (phospho)‐SLiM identification developed within the MAVISp framework (Methods) and PDBminer to identify possible starting structures. In the case of optineurin, we further model the flanking regions (Utichi et al., 2025). We identified 10 variants annotated in ClinVar: nine reported as VUS (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, and R11L) and one as benign, that is, E25Q (Figure 5a). MAVISp managed to predict a putative mechanistic explanation for the effect of four variants (Figure 5b–d): T29I is predicted to disrupt regulation by phosphorylation, L44P has an effect on both structural stability and long‐range effects to distal sites, whereas L44F and R11L have long‐range effects (Figure 5b). Additionally, a variant found in cancer studies, P32Q, is predicted to have a detrimental effect on structural stability, confirming previous experimental results that showed a propensity for aggregation (Fas et al., 2020). Of note, this variant is identified as having an uncertain effect on stability in MAVISp simple mode, whereas two different approaches for generating a conformational ensembles accounting for protein dynamics predicted a destabilizing effect (Figure 5d). Additionally, all the variants with a mechanistic indicator from MAVISp are also predicted as pathogenic by AlphaMissense (Figure 5c,d) and are good candidates to further experimental studies for their effects on the autophagy flux or other functional readouts. V91I is likely to be a benign variants since all the predictors identified neutral effects (Figure 5c,d).
FIGURE 5.

Analysis of MAP1LC3B VUS Variants from ClinVar. (a) A structural model (PDB ID: 2ZJD) of the MAP1LC3B (green) interaction with the LIR motif of SQSTM1(pink) highlights 10 ClinVar‐reported variants (E102K, H86D, T29I, V91I, P2R, L44P, L44F, D56G, R11L and E25Q) along with the cancer‐related variant P32Q. These variants are depicted as blue spheres on the structure. (b) Among these variants, five (R11L, T29I, P32Q, L44F, and L44P) are predicted as damaging by AlphaMissense. Interestingly, L44P is predicted to have a damaging effect on both long‐range communication and stability. (c, d) Summary of the predicted effects on the 11 variants of MAP1LC3B as reported by MAVISp dot plot with the simple mode (c) or with the ensemble mode (d) using the dot plot representation provided by the MAVISp toolkit for downstream analyses. Of note, the lower legend refers to the colors of variants on the X‐axis, which correspond ClinVar effect category.
2.10. Application of MAVISp to transmembrane proteins and to variants associated to other diseases
The STABILITY and LOCAL INTERACTION modules do not support predictions for variants in transmembrane regions. A survey of methods to predict folding free energy changes induced by amino acid substitutions in transmembrane proteins suggested that existing protocols, based on FoldX or Rosetta, are suitable for soluble proteins (Geng et al., 2019). Therefore, the protocols implemented in the MAVISp modules for transmembrane proteins only retain those variants that are not in contact with the membrane. An example of a MAVISp entry for this class of proteins is PILRA, which has a low pLDDT score in the transmembrane region and has been therefore excluded from the model, focusing on the analyses on the variants in the 32–153 region. In addition, we included other transmembrane proteins in the database, such as ATG9A and EGFR.
PILRA is a protein target connected to neurodegenerative diseases (Li et al., 2021), along with KIF5A, CFAP410, and CYP2R1, illustrating the broad applicability of MAVISp to proteins involved in different diseases. Proteins associated with other diseases, such as TTR, SOD1, and SMPD1, have also been included in the MAVISp database.
SMPD1 has been recently investigated in a targeted study using the ensemble mode of MAVISp together with other methodologies, validating our results by means of experimental data measuring the residual catalytic activity of enzyme variants (Scrima et al., 2024). As previously stated, MAVISp integrates curated experimental data for specific target proteins, which can be analyzed alongside the results from the computational modules. When a set of experimental data is available, it is possible to evaluate the correlation between predictions and experimental data. For SMPD1, we have obtained data on the residual catalytic activity of the enzyme for 135 variants (Scrima et al., 2024), available in the literature. Thanks to the MAVISp protocol, we predicted the effect of amino acid substitutions on changes in folding free energies as well as data for predicted functional effects from VEPs (Figure 6a), which can be compared with the experimental data (Figure 6b,c). At first, we evaluated whether DeMaSk and GEMME capture overlapping or complementary information using correlation statistics (Figure 6a). We found that the two predictors gave highly correlated results with a Pearson correlation coefficient of 0.8. The score values produced by the VEPs were mildly correlated with the residual activity measurements (Pearson correlation coefficient ~0.6, Figure 6b,c). Of note, most of the variants with predicted destabilizing effects are found at values of experimental residual activity lower than 20%, confirming what was observed in our previous study (Scrima et al., 2024). This suggests that changes in the stability of SMPD1 can help identify damaging variants of this enzyme. Nonetheless, in this case, the experimental readout cannot be explained solely by stability changes. Thus, variants with low residual activity and functionally damaging (GEMME and DeMaSk scores lower than −3 and − 0.25, respectively) and neutral for stability according to MAVISp are good candidates for further investigation. For example, biomolecular simulations or computational chemistry methods could be used to investigate the effects of these variants on the catalytic mechanism of the enzyme and lipid transport. Finally, variants such as Y500H, which have a low residual activity, high loss‐of‐fitness scores, and are uncertain for the STABILITY module, can be analyzed for their propensity to fall in early folding regions (see entry in the MAVISp database) and could be investigated in the ensemble mode using enhanced sampling simulations to accurately estimate their folding free energy profiles.
FIGURE 6.

MAVISp, GEMME, and DeMaSk predictions on the impact of SMPD1 variant subset. A subset of SMPD1 variants for which experimental data on enzyme activity are available is shown, along with predictions from the VEPs, GEMME, and DeMaSk. (a) Scatter plot comparing DeMaSk and GEMME scores. The red line represents the linear regression fit. (a–c) Scatter plots comparing DeMaSk (b) and GEMME (c) scores against experimental assay scores for enzymatic activity. The red line represents the regression, while the dotted line marks the threshold below which enzyme activity is considered inactive. Dots are colored based on the MAVISp STABILITY module classification (Rosetta/FoldX consensus): Destabilizing, Neutral, or Uncertain.
3. CONCLUSIONS AND FUTURE PERSPECTIVE
MAVISp provides a multi‐layered assessment of the effects of variants found in cancer studies or other diseases using structural methods. MAVISp results are especially useful for variant interpretation and prioritization. These results can be useful as a complementary resource to available pathogenic scores or high‐throughput experiments. MAVISp can help pinpoint the effects of a pathogenic variant for further studies.
A significant advantage of MAVISp is its comprehensive coverage, which extends beyond clinically identified variants to include novel variants that have not yet been characterized. This makes MAVISp a valuable resource for researchers and clinicians, facilitating the exploration of novel variants and their underlying pathogenic mechanisms. MAVISp can help on one side by associating mechanistic indicators with known or predicted pathogenic variants and on the other by aiding in the characterization of the effects of VUS or variants with conflicting evidence at the molecular level. Finally, we envision that MAVISp could, over time, become a community‐driven effort and serve as a repository of functional data on the effects of disease‐related variants. We have, in a recent review article, framed MAVISp in the context of other computational frameworks that collect data from different sources or integrate different structure‐based methods to characterize variants (Arnaudi et al., 2025). This idea has led to different attempts. Missense3D (Khanna et al., 2021) predicts the impact of variants on an array of structural features; ADDRESS (Woodard et al., 2021) includes predictions on stability and intermolecular contacts for variants found in UniProt humsavar; MUTATIONEXPLORER (Philipp et al., 2024) uses Rosetta and RaSP to predict the effect of amino acid substitutions on stability or binding on user‐provided structures; VUStruct (Moth et al., 2024) selects relevant protein structures for the protein of interest and performs a wide array of predictions, including on the effect of variants on stability, binding surface, and PTMs. The Genomics 2 Proteins (Kwon et al., 2024) portal includes data from several sources, including some that overlap with MAVISp such as Phosphosite or MaveDB, as well as features calculated from protein structure. ProtVar (Stephenson et al., 2024) also aggregates variants from different sources and includes both variant effect predictors, predictions of changes in stability upon amino acid substitution, as well as predictions of complex structures. MAVISp is, to our knowledge, the first resource to integrate data on binding free energies, long‐range effects, data derived from simulations, and experimental data from different sources into a centralized database. Protein stability, dynamics, and allosteric signaling are coupled aspects of the protein free‐energy landscape but remain mechanistically distinct (Escobedo et al., 2025; Faure et al., 2024; Weng et al., 2024). By integrating a wide group of mutational effects, MAVISp captures the multidimensional nature of mutations and provides a broad mechanistic context. This can guide the interpretation of mutational effects and help distinguish stability‐mediated changes from alterations driven by changes in binding or interactions, which is particularly relevant for its possible applications in precision medicine and the identification of potential therapeutic targets (Berezovsky & Nussinov, 2025; Tan et al., 2022). While MAVISp has currently a lower coverage than other resources, it includes carefully curated manual steps, such as during protein structure preparation and simulation.
As the database grows, it will provide high quality data on various structural properties, which can also be used for benchmarking or as features in machine learning models. To this end, the data collection we designed and present here is pivotal for building meaningful and accurate predictive models.
We would like to highlight previous studies that have demonstrated the usefulness of MAVISp and its protocols. For example, we have showcased the versatility of MAVISp in characterizing the effects induced by a redox post‐translational modification of cysteine (S‐nitrosylation) using structural methods (Papaleo et al., 2023). We focused on variants found in cancer samples for their capability to alter the propensity of cysteine to be S‐nitrosylated, or a population‐shift mechanism induced by the PTM. The collection of data using MAVISp modules has been pivotal in aggregating variants for each target of interest in the study on S‐nitrosylation. The pipelines developed in the S‐nitrosylation study will be integrated into the MAVISp PTM module, extending it beyond its current support for phosphorylation.
Alterations in transcription factors are often linked to aberrant gene expression, including processes such as proliferation, cell death, and other cancer hallmarks (Hanahan, 2022). Different mechanisms underlie alterations in the activity of transcription factors in cancer, including point mutations. A previous study on TP53 served as a platform for developing the modules currently available in MAVISp (Degn et al., 2022). We thus aim to expand the MAVISp database to include more transcription factors. To this end, one of the datasets under collection covers protein targets from the TRRUST2 database (Han et al., 2018), which includes experimentally characterized transcription factors and their targets, of which 150 have already been processed and included in the MAVISp database.
Furthermore, MAVISp provides pre‐calculated values of changes in folding or binding free energies and other metrics that can also be reanalyzed in the context of other research projects. By providing examples of PTEN and SMPD1, we introduced the curation of experimental data in MAVISp as a source of experimental validation. The implementation of additional modules for MAVISp (e.g., degron [Larsen et al., 2025] and aggregation propensity [Zambrano et al., 2015]) would likely improve coverage of the diverse mechanisms regulating protein abundance. Of note, MAVISp supports either data from multiplex assays of variant effects or experimental data from literature mining of the biocurators. The purpose of collecting experimental data is to validate our findings, update protocols, and continuously improve the included methodologies. The reliability of our predictions depends on their alignment with experimental results, which can be used as reference data to benchmark and improve our predictions over time. The database currently includes 16 protein entries with experimental data.
At this stage, MAVISp can provide annotations for variants of transmembrane proteins exclusively in regions that are not in contact with the membrane. Recently published approaches (Tiemann et al., 2023) could enable the application of the STABILITY module to transmembrane regions as well. In addition, we will include support to intrinsically disordered regions in the ensemble mode, designing new modules that reflect the most important properties.
We foresee that MAVISp will provide a large amount of data on structure‐based properties related to the changes that can exert at the protein level, which could be exploited for design of experimental biological readouts, also toward machine‐learning applications for variant assessment and classification or to understand the importance of specific variants in connection with clinical variables, such as drug resistance, risk of relapse, and more.
4. METHODS
4.1. Initial structures for MAVISp and STRUCTURE_SELECTION module
By default, in high‐throughput data collection, we use models from the AlphaFold2 database (Varadi et al., 2022) for most target proteins and trim them to remove regions with pLDDT scores <70 at the N‐ or C‐termini or very long disordered linkers between folded domains. For proteins coordinating cofactors, in low‐throughput targeted studies, we remodeled the relevant cofactors upon analyses with AlphaFill (Hekkelman et al., 2022) and where needed through MODELER (Webb & Sali, 2016). A summary of the initial structures used for each protein included in the database is available in the metadata on the MAVISp webserver. In selected cases, we have replaced long disordered loops with short residue stretches using a custom pipeline based on MODELER (https://github.com/ELELAB/MAVISp_loop_replacer). This was done to avoid potential bias in our structural calculations arising from the arbitrary conformation of such loops and their spurious contacts with the rest of the structure. In addition, for proteins with transmembrane regions, we used the PPM (Positioning of Proteins in Membrane) server 3.0 from OPM (Orientations of Proteins in Membrane) (Lomize et al., 2012; Lomize et al., 2022). For target proteins with more than 2700 residues, whose structures are not provided by the AlphaFold2 database, we used AlphaFold3 to produce the initial structure.
The advantage of using AlphaFold‐predicted structures in the default high‐throughput data collection of MAVISp lies in their ability to achieve quality comparable to experimental data, as demonstrated in previous work (Jumper et al., 2021), and at the same time circumventing limitations typically associated with experimental approaches, such as artifacts, missing atoms, and incomplete or absent residues.
The models used for the analyses undergo quality control using summary statistics from Procheck (Laskowski et al., 1993). The percentage of residues in core, allowed, and disallowed regions of the Ramachandran plot, together with the number of bad contacts (normalized per 100 residues) were extracted for each structure. Experimental structures are classified as high quality if they contained ≥75% residues in core regions, ≥95% in core and allowed regions, ≤0.5% in disallowed regions, and ≤2 bad contacts per 100 residues; relaxed thresholds defined an intermediate category, while structures exceeding these limits were excluded. For AlphaFold and homology models, more permissive criteria were applied to account for the absence of experimental refinement and the known enrichment of geometric outliers in low‐confidence regions (≥70% core, ≥90% core + allowed, ≤5% disallowed, ≤5 bad contacts per 100 residues).
4.2. INTERACTOME module
In the INTERACTOME module, implemented in the freely available PPI2PDB toolkit (https://github.com/ELELAB/PPI2PDB), we identify known interactors of the target protein by extracting data from the Mentha database (Calderone et al., 2013) and match them to available PDB structures using the mentha2pdb script. Mentha2pdb also examines experimentally validated dimeric complexes generated with AlphaFold2 from the HuRI and HuMAP databases by Burke et al. (2023). Mentha2PDB provides annotations of the interactors and generates input files for AlphaFold‐Multimer.
Complementarily, we retrieve interactors from the STRING database (Szklarczyk et al., 2025) and process them analogously using STRING2PDB, which maps STRING interactions to available PDB structures. The tool restricts retrieval to the physical subnetwork of STRING, with interactions supported by either curated database annotation or experimental data.
As a final step, we aggregate all interaction data for the target protein into a single table, ranking interactors primarily by Mentha and secondarily by STRING score to prioritize experimentally supported pairs. We then add complexes retrieved directly from the PDB using pdbminer‐complexes (https://github.com/ELELAB/MAVISp_automatization/tree/main/mavisp_templates/) to capture interactions not yet reflected in PPI databases.
We also use other methods to identify four different classes of short linear motifs (BRCT, LIR, BH3, and UIM) in our target proteins. Depending on the type, we use a combination of simple regular expression matching, a method designed by us for structure‐based identification of short linear motifs SLiMfast (available at https://github.com/ELELAB/SLiMfast) together with another method for predicting changes in secondary structure propensity that may be induced by phosphorylation in the core of putative LIR motifs, phosphor‐iLIR (https://github.com/ELELAB/phospho-iLIR), or DeepLoc 2.0 (Thumuluri et al., 2022) for predicting the subcellular localization of the protein, especially useful for BRCT motifs.
4.3. Free energy calculations for STABILITY, LOCAL INTERACTION and LONG‐RANGE modules
We applied the BuildModel module of the FoldX5 suite (Delgado et al., 2019) averaging over five independent runs to calculate changes in free energy of folding upon amino acid substitution with MutateX and the FoldX5 method. We used the cartddg2020 protocol for folding free energy calculations with Rosetta suite and the ref2015 energy function. In this protocol, only one structure is generated at the relax step and then optimized in Cartesian space. Five rounds of Cartesian space optimization provide five pairs of wild‐type and mutant structures for each variant. The change in folding free energy is then calculated as the difference between the free energies on the pair characterized by the lowest value of free energy for the mutant variant, as described in the original protocol (Frenz et al., 2020).
We used MutateX to calculate changes in binding free energy for the LOCAL INTERACTION module using the BuildModel and AnalyzeComplex functions of FoldX5 suite and averaging over five runs. With Rosetta, we used the flexddg protocol as implemented in RosettaDDGPrediction and the talaris2014 energy function. We used 35,000 backrub trials and a threshold for the absolute score for minimization convergence of 1 Rosetta Energy Unit (REU). The protocol then generates an ensemble of 35 structures for each mutant variant and calculates the average changes in binding free energy. We used Rosetta 2022.11 version for both stability and binding calculations. In the applications with RosettaDDGPrediction, the Rosetta Energy Units (REUs) were converted to kcal/mol with available conversion factors (Frenz et al., 2020). We also applied RaSP using the same protocol provided in the original publication (Blaabjerg et al., 2023) and adjusting the code in a workflow according to MAVISp‐compatible formats (https://github.com/ELELAB/RaSP_workflow). We have included data on complexes for 91 proteins at the date of 27/02/2026.
For the calculations of allosteric free energy, we used the structure‐based statistical mechanical model of allostery (SBSMMA) (Guarnera & Berezovsky, 2016; Tee et al., 2018) implemented in AlloSigMA2 (Tan et al., 2020). The model describes the mutated variants as “UP” or “DOWN” mutations depending on the difference in steric hindrance upon the substitution. We followed a recently updated and benchmarked protocol (Krzesińska et al., 2025). In brief, we classified as uncertain those variants for which the absolute changes in the volume of the side chain upon the amino acid substitution were lower than 5 Å (Weile & Roth, 2018), as recently applied to p53 (Degn et al., 2022). As a default, we considered as having an effect only variants that were exposed to the solvent (≥25% relative solvent accessibility of the side chain), with associated changes in absolute value of allosteric free energy larger than 2 kcal/mol and considered as remote response sites those that were at a distance higher than 5.5 Å from the mutation site, considering all heavy atoms, and which belong to pockets as identified by Fpocket (Le Guilloux et al., 2009) (see workflow at https://github.com/ELELAB/MAVISp_allosigma2_workflow/).
4.4. Efoldmine
The EFOLDMINE module, integrated within the simple mode of MAVISp, predicts residues with early folding propensity using the EfoldMine tool (Raimondi et al., 2017). Trained on residue‐level hydrogen/deuterium exchange nuclear magnetic resonance (HDX NMR) folding data from the Start2Fold database (Pancsa et al., 2016), this tool uses secondary structure propensity and backbone/side‐chain dynamics in a support‐vector machine algorithm to predict early folding regions based on the target's sequence.
In MAVISp, we incorporated EfoldMine to determine whether point mutations in variants fall within the predicted early folding regions, using a threshold of 0.169 to define residues involved in early folding events as suggested by the developers of the method (Raimondi et al., 2017) and considering only regions with a minimum length of three early folding residues to exclude isolated peaks (Krzesińska et al., 2025).
4.5. FUNCTIONAL SITE module
The FUNCTIONAL SITES module aids the identification of variants that might impact cofactor binding sites or active site residues, as well as the residues within the second coordination sphere of these sites. It is based on a contact analysis performed with the Arpeggio software (Jubb et al., 2017). Before the analysis, the model structure is subjected to energy minimization with Conjugate Gradients (Richard & August, 1994) in 50 steps, using the MMFF94 force field (Halgren, 1996), a van der Waals cutoff of 0.1, an interacting cutoff of 5.0 Å, and a physiological pH of 7.4. Subsequently, the output is further preprocessed to exclude clashes and proximal contacts (https://github.com/ELELAB/mavisp_accessory_tools).
4.6. Molecular dynamics simulations for MAVISp ensemble mode
We used either previously published (Degn et al., 2022; Fas et al., 2020; Invernizzi et al., 2014; Nygaard et al., 2016; Papaleo et al., 2014; Salamanca Viloria et al., 2017; Srinivasan et al., 2024) or newly collected one microsecond all‐atom molecular dynamics simulations performed using the CHARMM22* or CHARMM36m force fields (Piana et al., 2011). All the simulations have been carried out in the canonical ensemble after final equilibration steps and using explicit solvent and periodic boundary conditions. The template files used for the simulations are provided in OSF (https://osf.io/y3p2x/).
Ensembles generated using simulations are then subject to quality control, either using Mol_Analysis (Lambrughi et al., 2019) or MetaD_Analysis (https://github.com/ELELAB/MetaD-Analysis) tools.
As a first example of how we intend to use metadynamics data for the FUNCTIONAL_DYNAMICS module, we used the simulations from TP53 where the effects of amino acid substitutions on an interface for protein–protein interaction (residues 207–213) were investigated. We used a collective variable based on distances between two residues (D208‐R156) that were effective in capturing open (active) and closed (inactive) conformations of the loop. See repositories associated with the enhanced sampling simulations of TP53 (Degn et al., 2022). All the newly generated trajectories will be deposited as different entries in OSF, and the link is reported in the metadata on the MAVISp webserver. At the date of 02/03/2026, we have included 96 protein targets in the ensemble mode using as source of ensemble mostly unbiased MD simulations of 500 ns or one‐μs, as detailed in the corresponding metadata on the MAVISp webserver. In some cases, we included ensembles generated by a coarse‐grain model of flexibility or using the conformation provided by NMR structures from the PDB.
4.7. Protein Structure Networks and path analysis for MAVISp ensemble mode
In the ensemble mode we apply a module building upon the simple mode LONG_RANGE module. It uses AlloSigma2‐PSN (https://github.com/ELELAB/MAVISp_allosigma2_workflow/) where we constructed an atomic‐contact PSN on the full trajectories using PyInteraph2 (Sora, Tiberti, et al., 2023). Pairs of residues were retained only if their sequence distance exceeded Proxcut threshold of 1 and their edge calculations remained less than 4.5 Å, based on the thresholds described in PyInteraph2 (Sora, Tiberti, et al., 2023). We retained edges with an occurrence greater than Pcrit threshold of 50% across the ensemble frames, weighted on the interaction strength Imin of 3.
Subsequently, we used the path_analysis function of PyInteraph2 to identify the shortest paths of communication between each pair of AlloSigMA2 (Tan et al., 2020) predicted mutations and respective response sites, using a minimum distance threshold of 5.5 Å and retained paths that were four residues or longer.
4.8. CABS‐flex ensembles for MAVISp ensemble mode
We used the coarse‐grained CABS‐flex 2.0 method and software (Kuriata et al., 2018) as a part of a Snakemake (Mölder et al., 2021) pipeline, available at https://github.com/ELELAB/MAVISp_CABSflex_pipeline. The pipeline includes the possibility to tune the calculations by different restraints, secondary structure definition, ligand binding, and more. It also contains a quality control step to evaluate the secondary structure content of the generated structures with respect to the starting one, using DSSP (Kabsch & Sander, 1983) and the SOV‐refine score (Liu & Wang, 2018).
4.9. Variant effect prediction
We used DeMaSk (Munro & Singh, 2021), GEMME (Laine et al., 2019), EVE81, REVEL (Ioannidis et al., 2016) and AlphaMissense (Cheng et al., 2023a) as predictors for the effect of any possible amino acid substitution to natural amino acids, on the full protein sequence of the main UniProt (Consortium TU, 2023) isoform of each protein. We used available default parameters for each method unless noted otherwise. We used the standalone version of DeMaSk as available on its public GitHub (commit ID 10fa198), with BLAST+ 2.13.0. We followed the protocol available on GitHub: we first generated the aligned homologs sequence file by using the demask.homologs module and then calculated fitness impact predictions. Finally, we classified as loss‐of‐fitness those variants having a DeMaSk delta fitness score in absolute value lower or equal to −0.25, gain‐of‐fitness if the score is higher than 0.25, and neutral otherwise (Supplementary Text S3). We used the available online webserver to obtain variant effect predictions with GEMME, upon setting the number of JET iterations to 5, to obtain more precise results (Pancsa et al., 2016). We have classified variants having a GEMME score < = −3 as damaging, and neutral otherwise. Thresholds were selected according to our benchmarking (Supplementary Text S3). To obtain EVE scores, we have used the scripts, protocol and parameters available on the EVE GitHub (commit ID 740b0a7) as part of a custom‐built Snakemake (Mölder et al., 2021)‐based pipeline, available at https://github.com/ELELAB/MAVISp_EVE_pipeline. Using EVE first requires building a protein‐specific Bayesian variational autoencoder model, which learns evolutionary constraints between residues from a multiple sequence alignment. In the current MAVISp release, we generated such alignments using EVcouplings (Hopf et al., 2019), using the Uniref100 (Suzek et al., 2015) sequence database released on 01/03/2023, by keeping sequences with at least 50% of coverage with the target protein sequence, alignment positions with a minimum of 70% residue occupancy, and using a bit score threshold for inclusion of 0.5 bits with no further hyperparameter exploration. We then used our pipeline to perform model training, calculation of the evolutionary index, and used a global–local mixture of Gaussian Mixture Models to obtain a pathogenicity score and classification. We have used pre‐computed REVEL scores for variants as available in dbSNFP (Liu et al., 2011; Liu et al., 2020), accessed through myvariants.info (Lelong et al., 2022; Xin et al., 2016), as implemented in Cancermuts. We have classified as damaging variants that have a REVEL score larger or equal to 0.5 (Hopkins et al., 2023). We included AlphaMissense pathogenicity prediction scores and classification as available by the dataset of prediction for all possible amino acid substitutions in UniProt canonical isoforms, release version 2 (Cheng et al., 2023b).
4.10. Annotations from experimental data for EXPERIMENTAL_DATA module
We developed Python scripts to identify the overlap in coverage between the Mave database (MaveDB) (Esposito et al., 2019) and MAVISp, and to retrieve the score sets associated with the shared entries from the MaveDB (Esposito et al., 2019) database through their API (https://api.mavedb.org/docs). Where available, we also extracted information on methods and classification thresholds. For entries where this information was incomplete, the corresponding publications were manually reviewed to extract thresholds for variant classification.
The ProteinGym (Notin et al., 2023) repository was locally downloaded from GitHub, and a custom Python script was used to process the datasets based on the reference files provided in the repository. The datasets used for the analysis contained the experimental scores and the classification provided by the authors either based on the median of the score distributions or via manual annotation. The scores and their classifications were then integrated into the final database file generated by MAVISp. The aggregated scores, along with their classifications, were compiled into the final database file produced by MAVISp through a module dedicated to the experimental data.
4.11. Identification of RefSeq identifiers
To ensure the correct RefSeq annotations in MAVISp, we implemented a Python tool, compare_seq.py (https://github.com/ELELAB/mavisp_accessory_tools/), to verify the sequence identity between the canonical UniProt sequence used in our analyses and the corresponding RefSeq protein identifier to be used for the ClinVar search. The Uniprot sequences were retrieved using the UniProt REST API, while the RefSeq protein sequences were fetched from the NCBI Entrez Protein database. We implemented a global pairwise alignment using the Biopython pairwise2 module with the globalxx scheme to assess sequence identity. Each comparison was classified as an exact match, a mismatch (identity <100%), or unresolved due to missing or unresolvable sequences. To improve performance, the analyses were parallelized using the Python concurrent.futures module. The results were logged into structured CSV reports for consultation. This allows data managers to identify existing entries in MAVISp with RefSeq identifiers inconsistent with the provided UniProt accession code and assign them to biocurators for entry review.
Additionally, we provide the biocurators with a Python‐based script (uniprot2refseq, https://github.com/ELELAB/mavisp_accessory_tools/) that identifies RefSeq IDs for the UniProt canonical protein isoform. For each UniProt AC, we queried the UniProt REST API to obtain RefSeq protein cross‐references (NP_* IDs) from the canonical entry in JSON format. Only protein‐level RefSeq entries were considered. The canonical UniProt protein sequence was downloaded in FASTA format, and each RefSeq sequence was retrieved from the NCBI Protein database using Biopython and the Entrez API. Pairwise global alignments were performed using the Biopython pairwise2 module and we estimate the percentage sequence identity as the number of identical residues over the length of the longer sequence. Results were saved in tabular format, including UniProt AC, RefSeq ID, and sequence identity. This approach helps biocurators to identify the RefSeq IDs for the protein canonical isoform before data collection. The script is expected to be used by the biocurators before each run with the MAVISp automatization workflow described below.
4.12. Workflows for automatization and data collection within MAVISp
We provide and maintain two Snakemake workflows for the data collection of the default modules of MAVISp. The first is a Snakemake pipeline to automate MutateX runs as much as possible. It is designed to automatically download the chosen structure(s) from the AlphaFold structural database, or a custom structure input file, when necessary, trim them as requested, and generate desired MutateX folding free energy scans with a predictable directory structure. It only requires as input a csv file with metadata on the desired scan and a configuration file with details on the run to be performed. It is available at https://github.com/ELELAB/mutatex_pipelines/tree/main/custom_collect_scan.
Once such a scan is available, it is possible to use a second Snakemake pipeline, called MAVISp_automatization, which performs most of the steps that are necessary to annotate a protein for a MAVISp simple mode entry. Similarly to the previous pipeline, it only requires metadata on the target protein to be analyzed, as well as a MutateX mutational scan. It generates a dataset that can then be imported into the MAVISp database, except for predictions performed using Rosetta‐based methods, since these are much more computationally expensive and need to be performed separately using the RosettaDDGPrediction pipeline (Sora, Laspiur, et al., 2023). Using a Snakemake pipeline allows for improved efficiency and scalability, allowing the use of a multi‐core system to process several proteins or perform different analyses in parallel. It is available at https://github.com/ELELAB/MAVISp_automatization.
4.13. Criteria of inclusion for tools within MAVISp
The selection of tools to be included within MAVISp follows specific technical requirements related to open access licenses, scalability and reproducibility. Tools must provide programmatic access (e.g., API) or a standalone version to enable large‐scale automated data collection. This requirement currently limits the inclusion of some valuable web‐server–only resources in future updates for MAVISp.
AUTHOR CONTRIBUTIONS
Conceptualization: Elena Papaleo. Data curation: Matteo Arnaudi, Mattia Utichi, Kristine Degn, Laura Bauer, Simone Scrima, Karolina Krzesińska, Pablo Sánchez‐Izquierdo Besora, Katrine Meldgård, Ludovica Beltrame, Terézia Dorčaková, Anna Melidi, Lorenzo Favaro, Eleni Kiachaki, Anu Oswal, Alberte Heering Estad, Joachim Breitenstein, Jordan Safer, Francesca Maselli, Burcu Aykac Fas, Guglielmo Tedeschi, Philipp Becker, Jérémy Vinhas, Matteo Lambrughi, Matteo Tiberti, Elena Papaleo. Formal analysis: Matteo Arnaudi, Laura Bauer, Katrine Meldgård, Mattia Utichi, Matteo Lambrughi, Elena Papaleo. Funding acquisition: Elena Papaleo, Matteo Arnaudi, Claudia Cava, Sumaiya Iqbal, Mef Nilbert, Anna Rohlin. Methodology: Matteo Arnaudi, Mattia Utichi, Laura Bauer, Kristine Degn, Simone Scrima, Karolina Krzesińska, Katrine Meldgård, Ludovica Beltrame, Eleni Kiachaki, Joachim Breitenstein, Alberto Pettenella, Jérémy Vinhas, Matteo Lambrughi, Matteo Tiberti, Elena Papaleo. Project administration: Elena Papaleo. Resources: Peter Wad Sackett, Elena Papaleo, Sumaiya Iqbal, Claudia Cava. Software: Matteo Arnaudi, Kristine Degn, Karolina Krzesińska, Simone Scrima, Ludovica Beltrame, Alberto Pettenella, Matteo Lambrughi, Matteo Tiberti, Peter Wad Sackett. Supervision: Elena Papaleo, Matteo Lambrughi, Matteo Tiberti, Simone Scrima, Mattia Utichi, Matteo Arnaudi, Laura Bauer. Validation: All the coauthors. Visualization: Matteo Arnaudi, Mattia Utichi, Kristine Degn, Laura Bauer, Mattia Utichi, Matteo Tiberti. Writing—original draft: Elena Papaleo, Matteo Tiberti. Writing—review and editing: All the coauthors.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Supporting information
Figure S1.1: Distribution of the different types of amino acid substitutions. From is the type of the wild‐type amino acid. To is the type of the mutated amino acid.
Figure S1.2: Scatter plot of the free energy changes in kcal/mol obtained using RaSP and Rosetta respectively. Each point corresponds to an amino acid substitution. Blue dots are the cases where the MAVISp classification is the same for both of them, and red dots are the cases where they are not classified as the same. For clarification, the decision boundaries of MAVISp have been drawn as dotted lines.
Figure S1.3: Boxplot of the distribution of Rosetta and RaSP predictions illustrating both the high values predicted by Rosetta and the overall lower scores assigned by RaSP.
Figure S1.4: Number of amino acid substitutions in each class for RaSP vs. Rosetta protocol.
Figure S1.5: Confusion matrix between the RaSP and Rosetta protocol.
Figure S1.6: Accuracy of amino acid substitutions. Amino acid substitutions are categorized into From and To. From is the wild‐type amino acid type, and To is the mutated amino acid type.
Figure S1.7: Density plot SASA.
Figure S3.1: Workflow for the annotations of variant effects in the PTM REGULATION module.
Figure S3.2: Workflow for the classification of mutations in the PTM STABILITY module.
Figure S3.3: Workflow for the classification of mutations in the PTM module, function.
Figure S4.1: Distribution of DeMaSk scores used in the analyses. The figure consists of a bar plot, illustrating the distribution of DeMaSk scores used in the study, and the density of data points calculated via kernel density estimation. In particular, the graph shows the frequency of DeMaSk scores predicted for mutations that had a ClinVar review status of 3 or 4.
Figure S4.2: DeMaSk ROC and Sensitivity‐Specificity Curves. (A) The panel illustrates the ROC curve generated for DeMaSk using variants from the MAVISp database with ClinVar review status 3 or 4. (B) The panel shows the sensitivity (red), specificity (blue) and the D function (green), representing classification performances for DeMaSk. D is the euclidean distance between the ROC curve and the (1) point in the plot, which would represent optimal specificity and sensitivity. Lower D values indicate better performance. A vertical line marks the threshold with the lowest D.
Figure S4.3: Distribution plot, ROC and Sensitivity‐Specificity Curves for GEMME. (A) The panel illustrates the distribution of the GEMME scores, (B) The panel shows the ROC curve generated for GEMME using variants from the MAVISp database with ClinVar review status 3 or 4. (C) The panel shows the sensitivity (red), specificity (blue) and the D function (green), representing classification performances for GEMME. D is the euclidean distance between the ROC curve and the (1) point in the plot, which would represent optimal specificity and sensitivity. Lower D values indicate better performance. A vertical line marks the threshold with the lowest D.
Table S1:
Table S2:
Table S3:
ACKNOWLEDGMENTS
Our research has been supported by Carlsberg Foundation (CF18‐0314), Danmarks Grundforskningsfond (DNRF125), Hartmanns Fond (R241‐A33877), LEO Foundation (LF17006), NovoNordisk Fonden (NNF20OC0065262) to E.P. group. Part of the calculations have been supported by a EuroHPC (EHPC‐BEN‐2023B02‐010) and a EuroHPC (EHPC‐REG‐2023R01‐051) on Discoverer. The work is also supported by A PhD Fellowship from the Danish Data Science Academy (DDSA) to Matteo Arnaudi. Sumaiya Iqbal and Jordan Safer would like to thank the Merkin Institute of Transformative Technologies in Healthcare. Anna Rohlin is supported by The Assar Gabrielsson's Foundation and The Healthcare Board, Region Västra Götaland. Pablo Sánchez‐Izquierdo Besora is supported by a KBVU Pre‐Graduate Fellowship (R361‐A21156). This project is also in part based upon work from COST Action ML4NGP, CA21160, supported by COST (European Cooperation in Science and Technology) which supported a short‐term visit for GT in EP group. Kristine Degn, Matteo Tiberti and Elena Papaleo would like to thank Kresten Lindorff‐Larsen, Lasse M. Blaabjerg and Matteo Cagiada for their inputs and feedback to port the RaSP code within our workflow. We also would like to thank the group of Stefano Vanni for providing published MD trajectories for analysis within the ensemble mode of MAVISp. We would like to thank Jeppe Samuelsen, Kledi Salla, Amalie Drud Nielsen, Julie Bruun Brockhoff, Laura Mattioli, Zeming Hou, Subayan Akhuli, Ona Saulianskaite, Subhayan Akhuli, Martina Bellini, Eirini Giannakopoulou, Beatrice Drago, Eszter Toldi, Laura Kappel, Alessia Campo, Angeliki Vliora, Vinit Nilesh Vasa, Edene Levine, Konstantina Gkopi, and Alicia Llorente for preliminary work on targets reported in the database.
DATA AVAILABILITY STATEMENT
The data can either be consulted through our web server (https://services.healthtech.dtu.dk/services/MAVISp-1.0/) or as individual CSV files in the OSF repository (https://osf.io/ufpzm/). Other raw data and utilities can be found at the MAVISp extended data OSF repository (https://osf.io/y3p2x/). Reports for several proteins are available at https://elelab.gitbook.io/mavisp/.
REFERENCES
- Abildgaard AB, Stein A, Nielsen SV, Schultz‐Knudsen K, Papaleo E, Shrikhande A, et al. Computational and cellular studies reveal structural destabilization and degradation of MLH1 variants in lynch syndrome. Elife. 2019;8:e49138. Available from: https://elifesciences.org/articles/49138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, et al. GROMACS: high performance molecular simulations through multi‐level parallelism from laptops to supercomputers. SoftwareX. 2015;1–2:19–25. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2352711015000059 [Google Scholar]
- Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. Available from: https://www.nature.com/articles/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arnaudi M, Utichi M, Tiberti M, Papaleo E. Predicting the structure‐altering mechanisms of disease variants. Curr Opin Struct Biol. 2025;91:102994. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0959440X25000120 [DOI] [PubMed] [Google Scholar]
- Banerjee A, Bogetti AT, Bahar I. Accurate identification and mechanistic evaluation of pathogenic missense variants with Rhapsody‐2. Proc Natl Acad Sci U S A. 2025;122:e2418100122. 10.1073/pnas.2418100122?download=true [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berezovsky IN, Nussinov R. Allostery in disease: from mutations, mechanisms, and signalling partners to diagnostic and drug therapies. J Mol Biol. 2025;437:169407. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283625004735 [DOI] [PubMed] [Google Scholar]
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. Available from: https://pubmed.ncbi.nlm.nih.gov/10592235/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaabjerg LM, Jonsson N, Boomsma W, Stein A, Lindorff‐Larsen K. SSEmb: a joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun. 2024;15(1):1–9. Available from: https://www.nature.com/articles/s41467-024-53982-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blaabjerg LM, Kassem MM, Good LL, Jonsson N, Cagiada M, Johansson KE, et al. Rapid protein stability prediction using deep learning representations. Elife. 2023;12:e82593. Available from: https://elifesciences.org/articles/82593 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonomi M, Bussi G, Camilloni C, Tribello GA, Banáš P, Barducci A, et al. Promoting transparency and reproducibility in enhanced molecular simulations. Nat Methods. 2019;16:670–673. Available from: https://www.nature.com/articles/s41592-019-0506-8 [DOI] [PubMed] [Google Scholar]
- Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V. Genome‐wide prediction of disease variant effects with a deep protein language model. Nat Genet. 2023;55:1512–1522. Available from: https://www.nature.com/articles/s41588-023-01465-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burke DF, Bryant P, Barrio‐Hernandez I, Memon D, Pozzati G, Shenoy A, et al. Towards a structurally resolved human protein interaction network. Nat Struct Mol Biol. 2023;30:216–225. Available from: https://www.nature.com/articles/s41594-022-00910-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burke W, Parens E, Chung WK, Berger SM, Appelbaum PS. The challenge of genetic variants of uncertain clinical significance. Ann Intern Med. 2022;175:994–1000. 10.7326/M21-4109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cagiada M, Johansson KE, Valanciute A, Nielsen SV, Hartmann‐Petersen R, Yang JJ, et al. Understanding the origins of loss of protein function by analyzing the effects of thousands of variants on activity and abundance. Mol Biol Evol. 2021;38:3235–3246. Available from: https://academic.oup.com/mbe/article/38/8/3235/6199445 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calderone A, Castagnoli L, Cesareni G. Mentha: a resource for browsing integrated protein‐interaction networks. Nat Methods. 2013;10:690–691. Available from: https://www.nature.com/articles/nmeth.2561 [DOI] [PubMed] [Google Scholar]
- Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2:401–404. Available from: https://aacrjournals.org/cancerdiscovery/article/2/5/401/3246/The-cBio-Cancer-Genomics-Portal-An-Open-Platform [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome‐wide missense variant effect prediction with AlphaMissense. Science. 2023a;381:eadg7492. Available from: http://www.ncbi.nlm.nih.gov/pubmed/37733863 [DOI] [PubMed] [Google Scholar]
- Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Predictions for AlphaMissense. 2023b. Available from: https://zenodo.org/record/8360242 [DOI] [PubMed]
- Consortium TU . UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. 10.1093/nar/gkac1052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Degn K, Beltrame L, Dahl Hede F, Sora V, Nicolaci V, Vabistsevits M, et al. Cancer‐related mutations with local or long‐range effects on an allosteric loop of p53. J Mol Biol. 2022;434:167663. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283622002558 [DOI] [PubMed] [Google Scholar]
- Degn K, Beltrame L, Tiberti M, Papaleo E. PDBminer to find and annotate protein structures for computational analysis. J Chem Inf Model. 2023;63:7274–7281. 10.1021/acs.jcim.3c00884 [DOI] [PubMed] [Google Scholar]
- Delgado J, Radusky LG, Cianferoni D, Serrano L. FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics. 2019;35:4168–4169. Available from: https://academic.oup.com/bioinformatics/article/35/20/4168/5381539 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Di Rita A, Peschiaroli A, D'Acunzo P, Strobbe D, Hu Z, Gruber J, et al. HUWE1 E3 ligase promotes PINK1/PARKIN‐independent mitophagy by regulating AMBRA1 activation via IKKα. Nat Commun. 2018;9:3755. Available from: https://www.nature.com/articles/s41467-018-05722-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Escobedo A, Voigt G, Faure AJ, Lehner B. Genetics, energetics, and allostery in proteins with randomized cores and surfaces. Science. 2025;389:eadq3948. 10.1126/science.adq3948 [DOI] [PubMed] [Google Scholar]
- Esposito D, Weile J, Shendure J, Starita LM, Papenfuss AT, Roth FP, et al. MaveDB: an open‐source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 2019;20:223. 10.1186/s13059-019-1845-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans R, O'neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold‐Multimer. Biorxiv. 2021. 10.1101/2021.10.04.463034 [DOI] [Google Scholar]
- Fas BA, Maiani E, Sora V, Kumar M, Mashkoor M, Lambrughi M, et al. The conformational and mutational landscape of the ubiquitin‐like marker for autophagosome formation in cancer. Autophagy. 2020;17(10):2818–2841. 10.1080/15548627.2020.1847443 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Faure AJ, Martí‐Aranda A, Hidalgo‐Carcedo C, Beltran A, Schmiedel JM, Lehner B. The genetic architecture of protein stability. Nature. 2024;634:995–1003. Available from: https://www.nature.com/articles/s41586-024-07966-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fowler DM, Adams DJ, Gloyn AL, Hahn WC, Marks DS, Muffley LA, et al. An atlas of variant effects to understand the genome at nucleotide resolution. Genome Biol. 2023;24:147. 10.1186/s13059-023-02986-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fowler DM, Rehm HL. Will variants of uncertain significance still exist in 2030? Am J Hum Genet. 2024;111:5–10. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0002929723004007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599:91–95. Available from: https://www.nature.com/articles/s41586-021-04043-8 [DOI] [PubMed] [Google Scholar]
- Frenz B, Lewis SM, King I, DiMaio F, Park H, Song Y. Prediction of protein mutational free energy: benchmark and sampling improvements increase classification accuracy. Front Bioeng Biotechnol. 2020;8:558247. Available from: https://pubmed.ncbi.nlm.nih.gov/33134287/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6:pl1. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23550210 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman H, Dines JN, Berg J, Berger AH, Brnich S, Hisama FM, et al. Recommendations for the collection and use of multiplexed functional data for clinical variant interpretation. Genome Med. 2019;11:1–11. 10.1186/s13073-019-0698-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geng C, Xue LC, Roel‐Touris J, Bonvin AMJJ. Finding the ΔΔG spot: are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it? WIREs Comput Mol Sci. 2019;9:e1410. 10.1002/wcms.1410 [DOI] [Google Scholar]
- Gerasimavicius L, Teichmann SA, Marsh JA. Leveraging protein structural information to improve variant effect prediction. Curr Opin Struct Biol. 2025;92:103023. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0959440X25000417 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guarnera E, Berezovsky IN. Structure‐based statistical mechanical model accounts for the causality and energetics of allosteric communication. PLoS Comput Biol. 2016;12:e1004678. 10.1371/journal.pcbi.1004678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Halgren TA. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J Comput Chem. 1996;17:490–519. [Google Scholar]
- Han H, Cho J‐W, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46:D380–D386. Available from: http://academic.oup.com/nar/article/46/D1/D380/4566018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanahan D. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12:31–46. Available from: https://aacrjournals.org/cancerdiscovery/article/12/1/31/675608/Hallmarks-of-Cancer-New-DimensionsHallmarks-of [DOI] [PubMed] [Google Scholar]
- Hayes T, Rao R, Akin H, Sofroniew NJ, Oktay D, Lin Z, et al. Simulating 500 million years of evolution with a language model. Science. 2025;387:850–858. 10.1126/science.ads0018 [DOI] [PubMed] [Google Scholar]
- Hekkelman ML, de Vries I, Joosten RP, Perrakis A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat Methods. 2022;20:205–213. Available from: https://www.nature.com/articles/s41592-022-01685-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henriques J, Lindorff‐Larsen K. Protein dynamics enables phosphorylation of buried residues in Cdk2/cyclin‐A‐bound p27. Biophys J. 2020;119:2010–2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Høie MH, Cagiada M, Beck Frederiksen AH, Stein A, Lindorff‐Larsen K. Predicting and interpreting large‐scale mutagenesis data using analyses of protein stability and conservation. Cell Rep. 2022;38:110207. Available from: http://www.cell.com/article/S2211124721017113/fulltext [DOI] [PubMed] [Google Scholar]
- Holdgaard SG, Cianfanelli V, Pupo E, Lambrughi M, Lubas M, Nielsen JC, et al. Selective autophagy maintains centrosome integrity and accurate mitosis by turnover of centriolar satellites. Nat Commun. 2019;10:4176. Available from: https://www.nature.com/articles/s41467-019-12094-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hollingsworth SA, Dror RO. Molecular dynamics simulation for all. Neuron. 2018;99:1129–1143. 10.1016/j.neuron.2018.08.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hopf TA, Green AG, Schubert B, Mersmann S, Schärfe CPI, Ingraham JB, et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics. 2019;35:1582–1584. 10.1093/bioinformatics/bty862 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hopkins JJ, Wakeling MN, Johnson MB, Flanagan SE, Laver TW. REVEL is better at predicting pathogenicity of loss‐of‐function than gain‐of‐function variants. Hum Mutat. 2023;2023:1–6. Available from: https://www.hindawi.com/journals/humu/2023/8857940/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang H, Arighi CN, Ross KE, Ren J, Li G, Chen SC, et al. iPTMnet: an integrated resource for protein post‐translational modification network discovery. Nucleic Acids Res. 2018;46:D542–D550. 10.1093/nar/gkx1104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Invernizzi G, Tiberti M, Lambrughi M, Lindorff‐Larsen K, Papaleo E. Communication routes in ARID domains between distal residues in helix 5 and the DNA‐binding loops. PLoS Comput Biol. 2014;10:e1003744. 10.1371/journal.pcbi.1003744 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am J Hum Genet. 2016;99:877–885. Available from: http://www.cell.com/article/S0002929716303706/fulltext [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jankauskaitė J, Jiménez‐García B, Dapkūnas J, Fernández‐Recio J, Moal IH. SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation Xenarios I, editor. Bioinformatics. 2019;35:462–469. Available from: https://academic.oup.com/bioinformatics/article/35/3/462/5055583 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jepsen MM, Fowler DM, Hartmann‐Petersen R, Stein A, Lindorff‐Larsen K. Classifying disease‐associated variants using measures of protein activity and stability. Protein homeostasis diseases. New York (NY): Academic Press; 2020. p. 91–107. 10.1016/B978-0-12-819132-3.00005-1 [DOI] [Google Scholar]
- Jubb HC, Higueruelo AP, Ochoa‐Montaño B, Pitt WR, Ascher DB, Blundell TL. Arpeggio: a web server for calculating and Visualising interatomic interactions in protein structures. J Mol Biol. 2017;429:365–371. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283616305332 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. Available from: https://pubmed.ncbi.nlm.nih.gov/34265844/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers. 1983;22:2577–2637. 10.1002/bip.360221211 [DOI] [PubMed] [Google Scholar]
- Khanna T, Hanna G, Sternberg MJE, David A. Missense3D‐DB web catalogue: an atom‐based analysis and repository of 4M human protein‐coding genetic variants. Hum Genet. 2021;140:805–812. 10.1007/s00439-020-02246-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kønig SM, Rissler V, Terkelsen T, Lambrughi M, Papaleo E. Alterations of the interactome of Bcl‐2 proteins in breast cancer at the transcriptional, mutational and structural level. PLoS Comput Biol. 2019;15:e1007485. 10.1371/journal.pcbi.1007485 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krzesińska K, Degn K, Llorente A, Giannakopoulou E, Tiberti M, Papaleo E. Deciphering long‐range effects of mutations: an integrated approach using elastic network models and protein structure networks. J Mol Biol. 2025;437:169359. Available from: https://www.sciencedirect.com/science/article/pii/S0022283625004255?via%3Dihub [DOI] [PubMed] [Google Scholar]
- Kumar M, Papaleo E. A pan‐cancer assessment of alterations of the kinase domain of ULK1, an upstream regulator of autophagy. Sci Rep. 2020;10:14874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuriata A, Gierut AM, Oleniecki T, Ciemny MP, Kolinski A, Kurcinski M, et al. CABS‐flex 2.0: a web server for fast simulations of flexibility of protein structures. Nucleic Acids Res. 2018;46:W338–W343. Available from: https://academic.oup.com/nar/article/46/W1/W338/4995689 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwon S, Safer J, Nguyen DT, Hoksza D, May P, Arbesfeld JA, et al. Genomics 2 proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures. Nat Methods. 2024;21:1947–1957. Available from: https://www.nature.com/articles/s41592-024-02409-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laine E, Karami Y, Carbone A. GEMME: a simple and fast global Epistatic model predicting mutational effects. Mol Biol Evol. 2019;36:2604–2619. Available from: https://academic.oup.com/mbe/article/36/11/2604/5548199 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambrughi M, Tiberti M, Allega MF, Sora V, Nygaard M, Toth A, et al. Analyzing biomolecular ensembles. Methods in molecular biology. Volume 2022. New York (NY): Humana; 2019. p. 415–451. 10.1007/978-1-4939-9608-7_18 [DOI] [PubMed] [Google Scholar]
- Landrum MJ, Chitipiralla S, Brown GR, Chen C, Gu B, Hart J, et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 2020;48:D835–D844. Available from: https://pubmed.ncbi.nlm.nih.gov/31777943/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larsen FB, Voutsinos V, Jonsson N, Johansson KE, Ethelberg FD, Lindorff‐Larsen K, et al. Comprehensive degron mapping in human transcription factors. bioRxiv. 2025;2025.05.16.654404. 10.1101/2025.05.16.654404 [DOI] [Google Scholar]
- Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Cryst. 1993;26:283–291. Available from: https://journals.iucr.org/paper?S0021889892009944 [Google Scholar]
- Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics. 2009;10:1–11. 10.1186/1471-2105-10-168 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lelong S, Zhou X, Afrasiabi C, Qian Z, Cano MA, Tsueng G, et al. BioThings SDK: a toolkit for building high‐performance data APIs in biomedical research. Bioinformatics. 2022;38:2077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levy ED, Kowarzyk J, Michnick SW. High‐resolution mapping of protein concentration reveals principles of proteome architecture and adaptation. Cell Rep. 2014;7:1333–1340. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2211124714002964 [DOI] [PubMed] [Google Scholar]
- Li Y, Laws SM, Miles LA, Wiley JS, Huang X, Masters CL, et al. Genomics of Alzheimer's disease implicates the innate and adaptive immune systems. Cell Mol Life Sci. 2021;78:7397–7426. 10.1007/s00018-021-03986-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science. 2023;379:1123–1130. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- Liu T, Wang Z. SOV‐refine: a further refined definition of segment overlap score and its significance for protein structure similarity. Source Code Biol Med. 2018;13:1–10. 10.1186/s13029-018-0068-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011;32:894–899. 10.1002/humu.21517 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Li C, Mou C, Dong Y, Tu Y. dbNSFP v4: a comprehensive database of transcript‐specific functional predictions and annotations for human nonsynonymous and splice‐site SNVs. Genome Med. 2020;12:1–8. 10.1186/s13073-020-00803-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Livesey BJ, Marsh JA. Variant effect predictor correlation with functional assays is reflective of clinical classification performance. Genome Biol. 2025;26:1–27. 10.1186/s13059-025-03575-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lomize AL, Todd SC, Pogozheva ID. Spatial arrangement of proteins in planar and curved membranes by PPM 3.0. Protein Sci. 2022;31:209–220. Available from: http://www.ncbi.nlm.nih.gov/pubmed/34716622 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lomize MA, Pogozheva ID, Joo H, Mosberg HI, Lomize AL. OPM database and PPM web server: resources for positioning of proteins in membranes. Nucleic Acids Res. 2012;40:D370–D376. 10.1093/nar/gkr703 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manthei KA, Keck JL. The BLM dissolvasome in DNA replication and repair. Cell Mol Life Sci. 2013;70(21):4067–4084. 10.1007/s00018-013-1325-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet. 2018;50:874–882. Available from: https://www.nature.com/articles/s41588-018-0122-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- McEwen AE, Tejura M, Fayer S, Starita LM, Fowler DM. Multiplexed assays of variant effect for clinical variant interpretation. Nat Rev Genet. 2025;27(2):137–154. Available from: https://pubmed.ncbi.nlm.nih.gov/40691352/ [DOI] [PubMed] [Google Scholar]
- McKinnon KM. Flow cytometry: an overview. Curr Protoc Immunol. 2018;120:5.1.1–5.1.11. 10.1002/cpim.40 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miao Y, Du Q, Zhang H, Yuan Y, Zuo Y, Zheng H. Cycloheximide (CHX) chase assay to examine protein half‐life. Bio Protoc. 2023;13(11):e4690. Available from: https://bio-protocol.org/e4690 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mighell TL, Thacker S, Fombonne E, Eng C, O'Roak BJ. An integrated deep‐mutational‐scanning approach provides clinical insights on PTEN genotype–phenotype relationships. Am J Hum Genet. 2020;106:818–829. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0002929720301233 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins‐Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. Available from: https://f1000research.com/articles/10-33/v1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moth CW, Sheehan JH, Al Mamun A, Sivley RM, Gulsevin A, Rinker D, et al. VUStruct: a compute pipeline for high throughput and personalized structural biology. bioRxiv. 2024. 10.1101/2024.08.06.606224 [DOI] [Google Scholar]
- Munro D, Singh M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics. 2021;36:5322–5329. Available from: https://academic.oup.com/bioinformatics/article/36/22-23/5322/6039113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen SV, Stein A, Dinitzen AB, Papaleo E, Tatham MH, Poulsen EG, et al. Predicting the impact of lynch syndrome‐causing missense mutations from structural calculations. PLoS Genet. 2017;13:e1006739. 10.1371/journal.pgen.1006739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, et al. ProteinGym: large‐scale benchmarks for protein design and fitness prediction. Adv Neural Inf Process Syst. 2023;36:64331–64379. 10.1101/2023.12.07.570727 [DOI] [Google Scholar]
- Nussinov R, Tsai C‐J. Allostery in disease and in drug discovery. Cell. 2013;153:293–305. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23582321 [DOI] [PubMed] [Google Scholar]
- Nygaard M, Terkelsen T, Vidas Olsen A, Sora V, Salamanca Viloria J, Rizza F, et al. The mutational landscape of the oncogenic MZF1 SCAN domain in cancer. Front Mol Biosci. 2016;3:78. 10.3389/fmolb.2016.00078 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orioli S, Henning Hansen CG, Lindorff‐Larsen K. Transient exposure of a buried phosphorylation site in an autoinhibited protein. Biophys J. 2022;121:91–101. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0006349521038832 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pancsa R, Varadi M, Tompa P, Vranken WF. Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability. Nucleic Acids Res. 2016;44:D429–D434. 10.1093/nar/gkv1185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Papaleo E, Sutto L, Gervasio FL, Lindorff‐Larsen K. Conformational changes and free energies in a Proline isomerase. J Chem Theory Comput. 2014;10:4169–4174. 10.1021/ct500536r [DOI] [PubMed] [Google Scholar]
- Papaleo E, Tiberti M, Arnaudi M, Pecorari C, Faienza F, Cantwell L, et al. TRAP1 S‐nitrosylation as a model of population‐shift mechanism to study the effects of nitric oxide on redox‐sensitive oncoproteins. Cell Death Dis. 2023;14:284. Available from: https://www.nature.com/articles/s41419-023-05780-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peccati F, Alunno‐Rufini S, Jiménez‐Osés G. Accurate prediction of enzyme thermostabilization with Rosetta using AlphaFold ensembles. J Chem Inf Model. 2023;63:898–909. 10.1021/acs.jcim.2c01083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pejaver V, Byrne AB, Feng BJ, Pagel KA, Mooney SD, Karchin R, et al. Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria. Am J Hum Genet. 2022;109:2163–2177. Available from: https://pubmed.ncbi.nlm.nih.gov/36413997/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Philipp M, Moth CW, Ristic N, Tiemann JKS, Seufert F, Panfilova A, et al. MutationExplorer: a webserver for mutation of proteins and 3D visualization of energetic impacts. Nucleic Acids Res. 2024;52:W132–W139. 10.1093/nar/gkae301 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piana S, Lindorff‐Larsen K, Shaw DE. How robust are protein folding simulations with respect to force field parameterization? Biophys J. 2011;100:L47–L49. Available from: https://pubmed.ncbi.nlm.nih.gov/21539772/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Post KL, Belmadani M, Ganguly P, Meili F, Dingwall R, McDiarmid TA, et al. Multi‐model functionalization of disease‐associated PTEN missense mutations identifies multiple molecular mechanisms underlying protein dysfunction. Nat Commun. 2020;11:2073. Available from: https://www.nature.com/articles/s41467-020-15943-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potel CM, Kurzawa N, Becher I, Typas A, Mateus A, Savitski MM. Impact of phosphorylation on thermal stability of proteins. Nat Methods. 2021;18:757–759. Available from: https://www.nature.com/articles/s41592-021-01177-5 [DOI] [PubMed] [Google Scholar]
- Raimondi D, Orlando G, Pancsa R, Khan T, Vranken WF. Exploring the sequence‐based prediction of folding initiation sites in proteins. Sci Rep. 2017;7:8826. Available from: https://www.nature.com/articles/s41598-017-08366-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rastogi R, Chung R, Li S, Li C, Lee K, Woo J, et al. Critical assessment of missense variant effect predictors on disease‐relevant variant data. Hum Genet. 2025;144:281–293. 10.1007/s00439-025-02732-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Repana D, Nulsen J, Dressler L, Bortolomeazzi M, Venkata SK, Tourna A, et al. The network of cancer genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 2019;20:1. 10.1186/s13059-018-1612-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richard J, August S. An introduction to the conjugate gradient method without the agonizing pain. Pittsburgh: School of Computer Science; Carnegie Mellon University; 1994. [Google Scholar]
- Riesselman AJ, Ingraham JB, Marks DS. Deep generative models of genetic variation capture the effects of mutations. Nat Methods. 2018;15:816–822. Available from: https://www.nature.com/articles/s41592-018-0138-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118:e2016239118. 10.1073/pnas.2016239118?download=true [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogov VV, Nezis IP, Tsapras P, Zhang H, Dagdas Y, Noda NN, et al. Atg8 family proteins, LIR/AIM motifs and other interaction modes. Autophagy Rep. 2023;2(1):2188523. 10.1080/27694127.2023.2188523 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin AF, Stone J, Bianchi AH, Capodanno BJ, Da EY, Dias M, et al. MaveDB 2024: a curated community database with over seven million variant effects from multiplexed functional assays. Genome Biol. 2025;26:13. 10.1186/s13059-025-03476-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sahu S, Galloux M, Southon E, Caylor D, Sullivan T, Arnaudi M, et al. AVENGERS: analysis of variant effects using next generation sequencing to enhance BRCA2 stratification. BioRxiv. 2023;. 10.1101/2023.12.14.571713 [DOI] [Google Scholar]
- Salamanca Viloria J, Allega MF, Lambrughi M, Papaleo E. An optimal distance cutoff for contact‐based Protein Structure Networks using side‐chain centers of mass. Sci Rep. 2017;7:2838. Available from: https://www.nature.com/articles/s41598-017-01498-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sampson JM, Cannon DA, Duan J, Epstein JCK, Sergeeva AP, Katsamba PS, et al. Robust prediction of relative binding energies for protein–protein complex mutations using free energy perturbation calculations. J Mol Biol. 2024;436:168640. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283624002353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sapozhnikov Y, Patel JS, Ytreberg FM, Miller CR. Statistical modeling to quantify the uncertainty of FoldX‐predicted protein folding and binding stability. BMC Bioinformatics. 2023;24:426. 10.1186/s12859-023-05537-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sargsyan K, Lim C. Using protein language models for protein interaction hot spot prediction with limited data. BMC Bioinformatics. 2024;25:115. 10.1186/s12859-024-05737-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scrima S, Lambrughi M, Tiberti M, Fadda E, Papaleo E. ASM variants in the spotlight: a structure‐based atlas for unraveling pathogenic mechanisms in lysosomal acid sphingomyelinase. Biochim Biophys Acta. 2024;1870:167260. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0925443924002497 [DOI] [PubMed] [Google Scholar]
- Sora V, Laspiur AO, Degn K, Arnaudi M, Utichi M, Beltrame L, et al. RosettaDDGPrediction for high‐throughput mutational scans: from stability to binding. Protein Sci. 2023;32:e4527. 10.1002/pro.4527 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sora V, Tiberti M, Beltrame L, Dogan D, Robbani SM, Rubin J, et al. PyInteraph2 and PyInKnife2 to analyze networks in protein structural ensembles. J Chem Inf Model. 2023;63:4237–4245. 10.1021/acs.jcim.3c00574 [DOI] [PubMed] [Google Scholar]
- Srinivasan S, Di Luca A, Álvarez D, John Peter AT, Gehin C, Lone MA, et al. The conformational plasticity of structurally unrelated lipid transport proteins correlates with their mode of action. PLoS Biol. 2024;22:e3002737. 10.1371/journal.pbio.3002737 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stenson PD, Mort M, Ball EV, Chapman M, Evans K, Azevedo L, et al. The human gene mutation database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum Genet. 2020;139:1197–1207. Available from: https://pubmed.ncbi.nlm.nih.gov/32596782/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stephenson JD, Totoo P, Burke DF, Jänes J, Beltrao P, Martin MJ. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res. 2024;52:W140–W147. 10.1093/nar/gkae413 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun Y, Shen Y. Structure‐informed protein language models are robust predictors for variant effects. Hum Genet. 2024;144:209–225. 10.1007/s00439-024-02695-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. 10.1093/bioinformatics/btu739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzuki H, Osawa T, Fujioka Y, Noda NN. Structural biology of the core autophagy machinery. Curr Opin Struct Biol. 2017;43:10–17. 10.1016/j.sbi.2016.09.010 [DOI] [PubMed] [Google Scholar]
- Szklarczyk D, Nastou K, Koutrouli M, Kirsch R, Mehryary F, Hachilif R, et al. The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res. 2025;53:D730–D737. 10.1093/nar/gkae1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tan ZW, Guarnera E, Tee W‐V, Berezovsky IN. AlloSigMA 2: paving the way to designing allosteric effectors and to exploring allosteric effects of mutations. Nucleic Acids Res. 2020;48:W116–W124. Available from: https://academic.oup.com/nar/article/48/W1/W116/5835812 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tan ZW, Tee W‐V, Samsudin F, Guarnera E, Bond PJ, Berezovsky IN. Allosteric perspective on the mutability and druggability of the SARS‐CoV‐2 spike protein. Structure. 2022;30:590–607.e4. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0969212621004639 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2019;47:D941–D947. Available from: https://academic.oup.com/nar/article/47/D1/D941/5146192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tee W‐V, Guarnera E, Berezovsky IN. On the allosteric effect of nsSNPs and the emerging importance of allosteric polymorphism. J Mol Biol. 2019;431:3933–3942. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283619304413 [DOI] [PubMed] [Google Scholar]
- Tee W‐V, Guarnera E, Berezovsky IN. Reversing allosteric communication: from detecting allosteric sites to inducing and tuning targeted allosteric response. PLoS Comput Biol. 2018;14:e1006228. 10.1371/journal.pcbi.1006228 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tekpinar M, David L, Henry T, Carbone A. PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects. Genome Biol. 2025;26:1–42. 10.1186/s13059-025-03581-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thumuluri V, Almagro Armenteros JJ, Johansen AR, Nielsen H, Winther O. DeepLoc 2.0: multi‐label subcellular localization prediction using protein language models. Nucleic Acids Res. 2022;50:W228–W234. Available from: https://academic.oup.com/nar/article/50/W1/W228/6576357 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tiberti M, Di Leo L, Vistesen MV, Kuhre RS, Cecconi F, De Zio D, et al. The Cancermuts software package for the prioritization of missense cancer variants: a case study of AMBRA1 in melanoma. Cell Death Dis. 2022;13:872. Available from: https://www.nature.com/articles/s41419-022-05318-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tiberti M, Terkelsen T, Degn K, Beltrame L, Cremers TC, da Piede I, et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief Bioinform. 2022;23:bbac074. Available from: https://pubmed.ncbi.nlm.nih.gov/35323860/ [DOI] [PubMed] [Google Scholar]
- Tiemann JKS, Zschach H, Lindorff‐Larsen K, Stein A. Interpreting the molecular mechanisms of disease variants in human transmembrane proteins Biophys J. 2023;122(11):2176–2191. 10.1016/j.bpj.2022.12.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tribello GA, Bonomi M, Branduardi D, Camilloni C, Bussi G. PLUMED 2: new feathers for an old bird. Comput Phys Commun. 2013;185:604–613. Available from: http://arxiv.org/abs/1310.0980 [Google Scholar]
- Tsaban T, Varga JK, Avraham O, Ben‐Aharon Z, Khramushin A, Schueler‐Furman O. Harnessing protein folding neural networks for peptide–protein docking. Nat Commun. 2022;13. Available from: https://pubmed.ncbi.nlm.nih.gov/35013344/:176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, et al. Mega‐scale experimental analysis of protein folding stability in biology and design. Nature. 2023;620:434–444. Available from: https://www.nature.com/articles/s41586-023-06328-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Utichi M, Antonescu ON, Sora V, Marjault HB, Tiberti M, Maiani E, et al. Decoding phospho‐regulation and flanking regions in autophagy‐associated short linear motifs. Commun Biol. 2025;8:1255. 10.1038/s42003-025-08399-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein‐sequence space with high‐accuracy models. Nucleic Acids Res. 2022;50:D439–D444. Available from: https://academic.oup.com/nar/article/50/D1/D439/6430488 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Webb B, Sali A. Comparative protein structure modeling using MODELLER. Curr Protoc Bioinformatics. 2016;54:5.6.1–5.6.37. 10.1002/cpbi.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weile J, Roth FP. Multiplexed assays of variant effects contribute to a growing genotype–phenotype atlas. Hum Genet. 2018;137:665–678. 10.1007/s00439-018-1916-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weng C, Faure AJ, Escobedo A, Lehner B. The energetic and allosteric landscape for KRAS inhibition. Nature. 2024;626:643–652. Available from: https://www.nature.com/articles/s41586-023-06954-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woodard J, Zhang C, Zhang Y. ADDRESS: a database of disease‐associated human variants incorporating protein structure and folding stabilities. J Mol Biol. 2021;433:166840. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0022283621000346 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xin J, Mark A, Afrasiabi C, Tsueng G, Juchler M, Gopal N, et al. High‐performance web services for querying gene and variant annotation. Genome Biol. 2016;17:91. 10.1186/s13059-016-0953-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xuan J, Yu Y, Qing T, Guo L, Shi L. Next‐generation sequencing in the clinic: promises and challenges. Cancer Lett. 2013;340:284–295. 10.1016/j.canlet.2012.11.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yen H‐CS, Xu Q, Chou DM, Zhao Z, Elledge SJ. Global protein stability profiling in mammalian cells. Science. 2008;322:918–923. 10.1126/science.1160489 [DOI] [PubMed] [Google Scholar]
- Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015;43:W306–W313. 10.1093/nar/gkv359 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu W, Shenoy A, Kundrotas P, Elofsson A. Evaluation of AlphaFold‐Multimer prediction on multi‐chain protein complexes. Bioinformatics. 2023;39(7):btad424. 10.1093/bioinformatics/btad424 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1.1: Distribution of the different types of amino acid substitutions. From is the type of the wild‐type amino acid. To is the type of the mutated amino acid.
Figure S1.2: Scatter plot of the free energy changes in kcal/mol obtained using RaSP and Rosetta respectively. Each point corresponds to an amino acid substitution. Blue dots are the cases where the MAVISp classification is the same for both of them, and red dots are the cases where they are not classified as the same. For clarification, the decision boundaries of MAVISp have been drawn as dotted lines.
Figure S1.3: Boxplot of the distribution of Rosetta and RaSP predictions illustrating both the high values predicted by Rosetta and the overall lower scores assigned by RaSP.
Figure S1.4: Number of amino acid substitutions in each class for RaSP vs. Rosetta protocol.
Figure S1.5: Confusion matrix between the RaSP and Rosetta protocol.
Figure S1.6: Accuracy of amino acid substitutions. Amino acid substitutions are categorized into From and To. From is the wild‐type amino acid type, and To is the mutated amino acid type.
Figure S1.7: Density plot SASA.
Figure S3.1: Workflow for the annotations of variant effects in the PTM REGULATION module.
Figure S3.2: Workflow for the classification of mutations in the PTM STABILITY module.
Figure S3.3: Workflow for the classification of mutations in the PTM module, function.
Figure S4.1: Distribution of DeMaSk scores used in the analyses. The figure consists of a bar plot, illustrating the distribution of DeMaSk scores used in the study, and the density of data points calculated via kernel density estimation. In particular, the graph shows the frequency of DeMaSk scores predicted for mutations that had a ClinVar review status of 3 or 4.
Figure S4.2: DeMaSk ROC and Sensitivity‐Specificity Curves. (A) The panel illustrates the ROC curve generated for DeMaSk using variants from the MAVISp database with ClinVar review status 3 or 4. (B) The panel shows the sensitivity (red), specificity (blue) and the D function (green), representing classification performances for DeMaSk. D is the euclidean distance between the ROC curve and the (1) point in the plot, which would represent optimal specificity and sensitivity. Lower D values indicate better performance. A vertical line marks the threshold with the lowest D.
Figure S4.3: Distribution plot, ROC and Sensitivity‐Specificity Curves for GEMME. (A) The panel illustrates the distribution of the GEMME scores, (B) The panel shows the ROC curve generated for GEMME using variants from the MAVISp database with ClinVar review status 3 or 4. (C) The panel shows the sensitivity (red), specificity (blue) and the D function (green), representing classification performances for GEMME. D is the euclidean distance between the ROC curve and the (1) point in the plot, which would represent optimal specificity and sensitivity. Lower D values indicate better performance. A vertical line marks the threshold with the lowest D.
Table S1:
Table S2:
Table S3:
Data Availability Statement
The data can either be consulted through our web server (https://services.healthtech.dtu.dk/services/MAVISp-1.0/) or as individual CSV files in the OSF repository (https://osf.io/ufpzm/). Other raw data and utilities can be found at the MAVISp extended data OSF repository (https://osf.io/y3p2x/). Reports for several proteins are available at https://elelab.gitbook.io/mavisp/.
