Abstract
3DLigandSite is a web tool for the prediction of ligand-binding sites in proteins. Here, we report a significant update since the first release of 3DLigandSite in 2010. The overall methodology remains the same, with candidate binding sites in proteins inferred using known binding sites in related protein structures as templates. However, the initial structural modelling step now uses the newly available structures from the AlphaFold database or alternatively Phyre2 when AlphaFold structures are not available. Further, a sequence-based search using HHSearch has been introduced to identify template structures with bound ligands that are used to infer the ligand-binding residues in the query protein. Finally, we introduced a machine learning element as the final prediction step, which improves the accuracy of predictions and provides a confidence score for each residue predicted to be part of a binding site. Validation of 3DLigandSite on a set of 6416 binding sites obtained 92% recall at 75% precision for non-metal binding sites and 52% recall at 75% precision for metal binding sites. 3DLigandSite is available at https://www.wass-michaelislab.org/3dligandsite. Users submit either a protein sequence or structure. Results are displayed in multiple formats including an interactive Mol* molecular visualization of the protein and the predicted binding sites.
Graphical Abstract
INTRODUCTION
Elucidation of protein function remains a difficult and important task, with many millions of proteins present in UniProt (1) and only a small fraction of them functionally annotated (2,3), making automated sequence annotation tools essential. Small molecules that bind to proteins are intimately related to protein function; they can be substrates or products of an enzyme reaction, cofactors (4) that play an essential role in catalysis or have important structural or regulatory roles (5).
Methods for predicting ligand-binding sites [reviewed in (6)] use a range of different approaches, including sequence conservation (7), structural approaches such as identifying pockets on the protein surface, the combined analysis of sequence and structural information (8), and machine and deep learning (9–16). 3DLigandSite and methods such as firestar (17), FINDSITE (18,19), COACH-D (20) and FunFOLD2 (21) utilize knowledge of existing binding sites in solved protein structures present in the Protein Data Bank (PDB) (22). 3DLigandSite, FINDSITE, FunFOLD2 and COACH-D combine the modelling of protein structure with the identification of homologous proteins in the PDB that have ligands bound to them. These binding sites are then used to infer binding sites in the query protein. By contrast, firestar uses FireDB (23), a database of ligand-binding residues extracted from protein structures in the PDB and also catalytic residues extracted from the Catalytic Site Atlas (24).
Here, we present the first major update to the 3DLigandSite web server. 3DLigandSite was first developed in 2010 (25) to automate an approach that was successfully used in the ligand-binding site experiment in the eighth round of the critical assessment of protein structure prediction (CASP) community experiment (26,27). Over the last 12 years, 3DLigandSite has become widely used, attracting an average of 125 000 submissions per year, for a diverse range of purposes, including genome annotation (28,29), antiviral screening (30), the analysis of single-nucleotide variants (SNVs) associated with disease (31–35), the development of fluorescent sensors (36) and most recently for analysis of SARS coronavirus-2 proteins (37–39). Over the last 3 years, 3DLigandSite binding site predictions have been incorporated into the Protein Data Bank in Europe [PDBe (22)] Knowledgebase (40,41), making binding site predictions for protein structures in the PDBe widely available.
The basic 3DLigandSite algorithm remains the same in the new version, but it now makes use of the latest sequence searching methods and the highly accurate protein 3D structural models available from the AlphaFold Protein Structure Database [AlphaFold DB (42,43)]. 3DLigandSite now also incorporates machine learning as the final step in the prediction process, which improves prediction accuracy and associates a confidence score with each individual residue predicted to be part of a binding site. This is combined with a new web server that offers improved functionality for users to investigate the predicted binding sites.
THE 3DLigandSite METHOD
A summary of the 3DLigandSite methodology is outlined in Figure 1. Users submit either a protein sequence in FASTA format or a protein structure in PDB format. Where a sequence is submitted, the PDB is first searched for an existing structure with identical sequence that can be used. Where a match is not found, AlphaFold DB (42) is searched for an existing structural model. Finally, where a model is not available, Phyre2 (44) is used to perform template-based modelling.
The next step focuses on the identification of ligand-bound structures that are homologous to the query protein. Originally, 3DLigandSite used MAMMOTH (45) to perform a structural search of the query structure against a structural library of proteins from the PDB, which was a time-consuming step, typically taking between 40 and 80 min. This has been replaced by a sequence-based search using HHSearch (46) to screen a sequence library of ligand-bound proteins from the PDB (detailed later), which only takes a few minutes to run. All sequence matches with an HHSearch probability score >75% are retained, and their protein structures are aligned to the query structure using TM-align (47). The user can reduce the HHSearch probability cut-off if they would like to use less confident matches to the query sequence.
Where matches to the library of ligand-bound proteins are not identified by the sequence-based search, a structural search is performed using TM-align (47) that retains alignments with a TM-score of 0.6 or greater. The structural search is also available as an advanced option that users can choose to perform at the time of submission.
The ligands present in the library structures are superimposed onto the query structure by aligning the library structures with the query structure. These ligands are then clustered. Originally, 3DLigandSite used single linkage clustering to cluster ligands, which could result in very large clusters. To avoid this, 3DLigandSite now generates clusters such that 50% of each ligand must overlap with at least one of the other ligands in the same cluster. This change also required that metal and non-metal ligands are separately clustered given that metal ligands are single ions, while non-metal ligands are larger molecules (e.g. ATP, NAD). Individual predictions of metal and non-metal binding sites are also made for these separate clusters.
The final step of the prediction process is to determine the residues in the protein that are predicted to form the binding site associated with each cluster of ligands. Each cluster may contain multiple different ligands or many instances of the same (or similar) ligands in different poses. 3DLigandSite originally predicted any residue within 0.8 Å of at least 25% of the ligands in a cluster to be part of the binding site. This has been replaced by the introduction of a logistic regression classifier (detailed below) to perform this final prediction step. This also associates a confidence score (range 0–1) with each residue in the predicted binding site.
Generation of the library of ligand-bound protein structures
To generate the library of biologically relevant protein binding sites, protein structures were extracted from the PDB and filtered to retain only those containing ligands classed as cognate by FireDB (23). The library focuses on monomeric proteins. Where binding sites were located in the interface between two proteins, the multimer was split into monomeric structures and the ligands associated with both of the monomers. The protein structures were clustered, and the ligands from proteins in each cluster mapped onto a representative structure to reduce search time. The amino acid sequences of the retained structures were clustered using CD-HIT (48) using an 80% sequence identity threshold. The protein models in each cluster were then aligned to the cluster representative (obtained from CD-HIT) using TM-align (47), and the ligands were superimposed onto the representative structure and retained. An HHSearch (49) sequence database was built from the representative sequences for searching user-submitted protein sequences against.
Calculating residue conservation
To calculate residue conservation, HHblits (50) was used to search the query sequence against the UniClust30 database (51). The multiple sequence alignment was then used to calculate the Jensen–Shannon divergence (52) conservation score.
Machine learning-based prediction of binding site residues
The machine learning step was introduced to predict accurately which residues are most likely to be part of the binding site around a cluster of ligands. An equal number of binding and non-binding residues on the query protein were used for training and testing. For each of these residues, a set of features was extracted and converted to a 0–1 range (Supplementary Table S1). Several features were considered for best determining binding propensity. The features included distance measurements to the ligand cluster, residue conservation and amino acid properties such as charge, hydrophobicity and van der Waals volume (Supplementary Table S1). Solvent accessibility scores were obtained from ProAct2 (53). Distance-based features were calculated, including the minimum, maximum and average distances of each residue to ligands in the cluster, and the percentage of ligands in the cluster within 0.8 Å + van der Waals radii of the amino acid.
Univariate feature selection was used to identify ligand contacts. The three distance features and residue conservation were the most informative features for predicting ligand binding, as well as the negative charge residue feature for metal binding sites. A single distance metric was selected to avoid overtraining on a similar feature, resulting in the ligand contacts, minimum ligand distance, negatively charged and residue conservation as the selected features.
The scikit-learn Python package was used to train support vector machines (54), Extra-Trees, logistic regression and random forest classifiers. The data were then fitted with optimum parameters from 100 random iterations and three cross-validation steps using GridSearchCV within scikit-learn. A randomly generated 80:20 train–test split was used to fit the models.
The training and test sets comprise monomers with cognate ligands bound. These structures were identified by filtering the PDB, clustering their sequences using MMseqs2 (55) at a maximum sequence identity of 40%. This resulted in 5223 metal and 4995 non-metal binding sites. A subset of 1600 metal and 1573 non-metal binding sites was randomly selected for testing and training. The remaining binding sites were used as a validation set to evaluate performance on the trained classifiers (that had not been used in training) (Supplementary Figure S1). The PDB identifiers and chains of all sequences used are provided in Supplementary Table S2.
Binding residues were classed as all residues within van der Waals radii + 0.8 Å of the ligand present in the protein structure, with all other residues classed as non-binding. This resulted in 1976 and 6950 metal and non-metal binding residues, respectively, and an equal number of randomly selected non-binding residues were also randomly extracted (Supplementary Table S3), providing the positive and negative examples required for training the machine learning classifiers.
EVALUATING 3DLigandSite PERFORMANCE
The performance of 3DLigandSite was assessed using the validation set (see the ‘The 3DLigandSite Method’ section), which contained 59 203 and 16 166 (Supplementary Table S3) non-metal and metal binding residues, respectively, that had not been used in the testing or training of the classifiers. Performance was assessed using multiple measures of precision, sensitivity (recall) and the receiver operator characteristic (ROC).
The logistic regression classifier performed best on the non-metal binding sites, with an area under the receiver operating characteristic curve (AUROC) of 0.99, though a similar performance was observed for Extra-Trees and random forest classifiers (Figure 2A and Supplementary Table S4). For metal ligands, the logistic regression classifier performed best with an AUROC of 0.99 (Figure 2C and Supplementary Table S4). As the data set has a skewed distribution with many more negative examples than positive examples (i.e. many non-binding residues compared to those that are binding residues in each protein; Supplementary Table S3), precision–recall metrics provide a better indication of performance (56,57). 3DLigandSite obtained 92% recall at 75% precision for non-metal binding sites (Figure 2B) and 52% recall at 75% precision for metal binding sites (Figure 2D). We compared the performance of the new version of 3DLigandSite with the original version (25). The original 3DLigandSite did not make separate predictions for metal and non-metal ligands, so we assessed performance on the combined metal and non-metal binding sites. On the validation set, the original 3DLigandSite obtained recall of 56% at 59% precision.
The performance of 3DLigandSite was also evaluated on the 70 targets used for assessment of binding site prediction in CASP8 (26), CASP9 (58) and CASP10 (59). Using the sequence-based homology search 3DLigandSite obtained recall of 85% at 65% precision, and a Matthews’ correlation coefficient [MCC (60)] of 0.73. Performance using the structural search option was comparable, with slightly lower recall of 80% at 67% precision and an MCC of 0.72. Structural search results at a range of TM-score thresholds for inclusion of template structures are shown in Supplementary Table S5. On this data set, the sequence-based search was not inferior to the structure-based search, although recent studies have suggested that structural searches are better at identifying related protein structures (61,62). Given, the extra time taken to perform the structural search (∼4 h per submission), the sequence search is recommended and is the default for the web server.
THE 3DLigandSite WEB SERVER
The 3DLigandSite web server is available at https://www.wass-michaelislab.org/3dligandsite. The web server is free to all without a login requirement. Users can select to submit either a protein sequence (in FASTA format) or a protein structure (in PDB format). Where a sequence is submitted, the first step of the prediction process is to obtain a model of the protein structure. To do this, the PDB is first searched for a matching structure, followed by AlphaFold DB (42). Where a suitable model is not available in either database, Phyre2 (44) is used to generate a template-based model of the structure. The runtime for submissions that require Phyre2 is longer as modelling the protein structure is time-consuming, typically taking a few hours to complete. Where users submit a protein structure, the runtime is typically <5 min using default settings. Users who provide an email address receive an email upon submission and once their results are ready for viewing. The web server includes a help section that includes recordings that work users through both the submission process and interpretation of data in the Results pages.
Results’ output
3DLigandSite Results pages are split into three main sections. Results are initially presented as a sequence view (Figure 3), which shows the amino acid sequence of the submitted protein, residue conservation and a row for each cluster of ligands that has been identified as a potential binding site (Figure 3). This provides users with an easily interpretable view of the predicted binding sites.
The second section of results shows the cluster table, which includes details of the clusters identified, the number of ligands present in each cluster and the number of structures that these ligands originate from. The ligands are represented by the three-letter codes from the mmCIF dictionary and are linked to the small molecule details in the PDB (Figure 3). Clusters are sorted according to the number of ligands present in the cluster. There is greater confidence that a cluster represents a binding site when there is evidence for this from multiple protein structures. The second table in this section contains a tab for each ligand cluster and lists the residues predicted to be in the binding site along with the conservation score, solvent accessibility and the probability calculated by the logistic regression classifier.
The final section of the Results page contains a Mol* molecular viewer (www.molstar.org) (63) that by default displays the protein structure in a cartoon format along with the ligands in the top-ranked cluster, highlighting the predicted binding site residues in red (Figure 3). The Mol* viewer enables users to inspect the predicted binding sites within the protein structure and offers multiple features for exploring the structure. The 3DLigandSite control panel to the right of the viewer provides easy-to-use functions such as changing the colour or format of the display of the ligands and the protein structure. Further functionality is available via the Mol* built-in options shown on the top right of the viewer. The control panel also includes a button enabling users to generate publication-quality images of the current display in the viewer.
USE CASES
As set out in the ‘Introduction’ section, 3DLigandSite predictions have been widely used for a range of different biological and biomedical purposes. For example, with widespread use of sequencing technologies, there is extensive interest in the analysis of non-synonymous SNVs (nsSNVs). The aim here is to identify those nsSNVs that may alter protein structure and function and be associated with a phenotype such as a disease. Thus, 3DLigandSite has been used to analyse such nsSNVs for a range of diseases, from liver disease (31) to cardiomyopathies (33).
One application has been to study nsSNVs present in individuals with cystinuria, which is caused by variants in two genes, SLC7A9 and SLC3A1, that encode a dimeric amino acid transporter (64). Cystinuria is caused by variants that affect the ability of this transporter to transport cystine into cells, which results in the formation of kidney stones. In a recent study (34,65), 3DLigandSite was used to model the structure and ligand-binding sites of the two encoded proteins and to analyse how the set of nsSNVs observed in a cohort of patients may affect transporter function and be linked with the severity of the disease that patients experienced. Figure 3 shows the protein b(0+)AT, which is encoded by SLC7A9, and the predicted amino acid binding sites in the protein.
CONCLUDING REMARKS
The 3DLigandSite web server provides free access to an easy-to-use resource for modelling small molecule binding sites in proteins. This widely used resource has been extensively updated to offer improved functionality and to reduce the runtime of user submissions. Our benchmarking demonstrates that 3DLigandSite can obtain high recall with high precision, therefore accurately predicting binding sites in proteins that users are researching.
DATA AVAILABILITY
All data are provided in the manuscript or supplementary material.
Supplementary Material
ACKNOWLEDGEMENTS
Author contributions: M.N.W., M.M. and M.J.E.S. devised the research. J.E.M., H.U. and M.A. performed the research and analysed data. M.N.W. and J.E.M. wrote manuscript with contributions from all other authors.
Contributor Information
Jake E McGreig, School of Biosciences, Division of Natural Sciences, University of Kent, Canterbury, Kent CT2 7NJ, UK.
Hannah Uri, School of Biosciences, Division of Natural Sciences, University of Kent, Canterbury, Kent CT2 7NJ, UK.
Magdalena Antczak, School of Biosciences, Division of Natural Sciences, University of Kent, Canterbury, Kent CT2 7NJ, UK.
Michael J E Sternberg, Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London SW7 2AZ, UK.
Martin Michaelis, School of Biosciences, Division of Natural Sciences, University of Kent, Canterbury, Kent CT2 7NJ, UK.
Mark N Wass, School of Biosciences, Division of Natural Sciences, University of Kent, Canterbury, Kent CT2 7NJ, UK.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Engineering and Physical Sciences Research Council [to J.E.M.]; Wellcome Trust [WT104955MA and 218242/Z/19/Z (in part) to M.J.E.S.]; Biotechnology and Biological Sciences Research Council [BB/J0105451/1 (in part) to M.J.E.S.]. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Conflict of interest statement. None declared.
REFERENCES
- 1. UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021; 49:D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Zhou N., Jiang Y., Bergquist T.R., Lee A.J., Kacsoh B.Z., Crocker A.W., Lewis K.A., Georghiou G., Nguyen H.N., Hamid M.N.et al.. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019; 20:244–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Jiang Y., Oron T.R., Clark W.T., Bankapur A.R., D’Andrea D., Lepore R., Funk C.S., Kahanda I., Verspoor K.M., Ben-Hur A.et al.. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016; 17:184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mukhopadhyay A., Borkakoti N., Pravda L., Tyzack J.D., Thornton J.M., Velankar S.. Finding enzyme cofactors in Protein Data Bank. Bioinformatics. 2019; 35:3510–3511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Torrance J.W., Macarthur M.W., Thornton J.M.. Evolution of binding sites for zinc and calcium ions playing structural roles. Proteins. 2008; 71:813–830. [DOI] [PubMed] [Google Scholar]
- 6. Zhao J., Cao Y., Zhang L.. Exploring the computational methods for protein–ligand binding site prediction. Comput. Struct. Biotechnol. J. 2020; 18:417–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Capra J.A., Singh M.. Characterization and prediction of residues determining protein functional. Bioinformatics. 2008; 24:1473–1480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Capra J.A., Laskowski R.A., Thornton J.M., Singh M., Funkhouser T.A.. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput. Biol. 2009; 5:e10000585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Krivák R., Hoksza D.. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminform. 2018; 10:39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Jendele L., Krivák R., Skoda P., Novotny M., Hoksza D.. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019; 47:W345–W349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Santana C.A., Silveira S.A., Moraes J.P.A., Izidoro S.C., de Melo-Minardi R.C., Ribeiro A.J.M., Tyzack J.D., Borkakoti N., Thornton J.M. GRaSP: a graph-based residue neighborhood strategy to predict binding sites. Bioinformatics. 2020; 36:i726–i734. [DOI] [PubMed] [Google Scholar]
- 12. Jiménez J., Doerr S., Martínez-Rosell G., Rose A.S., De Fabritiis G.. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics. 2017; 33:3036–3042. [DOI] [PubMed] [Google Scholar]
- 13. Aggarwal R., Gupta A., Chelur V., Jawahar C.V., Priyakumar U.D.. DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks. J. Chem. Inf. Model. 2021; 10.1021/acs.jcim.1c00799. [DOI] [PubMed] [Google Scholar]
- 14. Stepniewska-Dziubinska M., Zielenkiewicz P., Siedlecki P.. Improving detection of protein–ligand binding sites with 3D segmentation. Sci. Rep. 2020; 1:5035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Kandel J., Tayara H., Chong K.T.. PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J. Cheminform. 2021; 13:65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Mylonas S.K., Axenopoulos A., Daras P.. DeepSurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics. 2021; 37:1681–1690. [DOI] [PubMed] [Google Scholar]
- 17. Lopez G., Maietta P., Rodriguez J.M., Valencia A., Tress M.L.. firestar—advances in the prediction of functionally important residues. Nucleic Acids Res. 2011; 39:W235–W241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Brylinski M., Skolnick J.. A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc. Natl Acad. Sci. U.S.A. 2008; 105:129–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Feinstein W.P., Brylinski M.. eFindSite: enhanced fingerprint-based virtual screening against predicted ligand binding sites in protein models. Mol. Inform. 2014; 33:135–150. [DOI] [PubMed] [Google Scholar]
- 20. Wu Q., Peng Z., Zhang Y., Yang J.. COACH-D: improved protein–ligand binding sites prediction with refined ligand-binding poses through molecular docking. Nucleic Acids Res. 2018; 46:W438–W442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Roche D.B., Buenavista M.T., McGuffin L.J.. FunFOLD2 server for the prediction of protein–ligand interactions. Nucleic Acids Res. 2013; 41:W303–W307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Armstrong D.R., Berrisford J.M., Conroy M.J., Gutmanas A., Anyango S., Choudhary P., Clark A.R., Dana J.M., Deshpande M., Dunlop R.et al.. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 2020; 48:D335–D343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Maietta P., Lopez G., Carro A., Pingilley B.J., Leon L.G., Valencia A, Tress M.L.. FireDB: a compendium of biological and pharmacologically relevant ligands. Nucleic Acids Res. 2013; 42:D267–D272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ribeiro A.J.M., Holliday G.L., Furnham N., Tyzack J.D., Ferris K., Thornton J.M.. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2018; 46:D618–D623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Wass M.N., Kelley L.A., Sternberg M.J.E.. 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res. 2010; 38:W469–W473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lopez G., Ezkurdia I., Tress M.L.. Assessment of ligand binding residue predictions in CASP8. Proteins. 2009; 77:138–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wass M.N., Sternberg M.J.E.. Prediction of ligand binding sites using homologous structures and conservation at CASP8. Proteins. 2009; 77:147–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Antczak M., Michaelis M., Wass M.N.. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun. 2019; 10:3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Nishiyama T., Sakayama H., de Vries J., Buschmann H., Saint-Marcoux D., Ullrich K.K., Haas F.B., Vanderstraeten L., Becker D., Lang D.et al.. The Chara genome: secondary complexity and implications for plant terrestrialization. Cell. 2018; 74:448–464. [DOI] [PubMed] [Google Scholar]
- 30. Kuhlmann F.M., Robinson J.I., Bluemling G.R., Ronet C., Fasel N., Beverley S.M.. Antiviral screening identifies adenosine analogs targeting the endogenous dsRNA Leishmania RNA virus 1 (LRV1) pathogenicity factor. Proc. Natl Acad. Sci. U.S.A. 2017; 114:E811–E819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Chambers J.C., Zhang W., Sehmi J., Li X., Wass M.N., Van der Harst P., Holm H., Sanna S., Kavousi M., Baumeister S.E.et al.. Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma. Nat. Genet. 2011; 43:1131–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Bernkopf M., Webersinke G., Tongsook C., Koyani C.N., Rafiq M.A., Ayaz M., Muller D., Enzinger C., Aslam M., Naeem F.et al.. Disruption of the methyltransferase-like 23 gene METTL23 causes mild autosomal recessive intellectual disability. Hum. Mol. Genet. 2014; 23:4015–4023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. O’Grady G.L., Best H.A., Sztal T.E., Schartner V., Sanjuan-Vazquez M., Donkervoort S., Neto O.A., Sutton R.B., Ilkovski B., Romero N.B.et al.. Variants in the oxidoreductase PYROXD1 cause early-onset myopathy with internalized nuclei and myofibrillar disorganization. Am. J. Hum. Genet. 2016; 99:1086–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Martell H.J., Wong K.A., Martin J.F., Kassam Z., Thomas K., Wass M.N. Associating mutations causing cystinuria with disease severity with the aim of providing precision medicine. BMC Genomics. 2017; 18:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Papalardo M., Wass M.N.. VarMod: modelling the functional effects of non-synonymous variants. Nucleic Acids Res. 2014; 42:W331–W336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Ho C.H., Frommer W.B.. Fluorescent sensors for activity and regulation of the nitrate transceptor CHL1/NRT1.1 and oligopeptide transporters. eLife. 2014; 3:e01917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Bojkova D., McGreig J.E., McLaughlin K.M., Masterson S.G., Antczak M., Widera M., Krahling V., Ciesek S., Wass M.N., Michaleis M.et al.. Differentially conserved amino acid positions may reflects differences in SAR-CoV-2 and SARS-CoV behaviour. Bioinformatics. 2021; 37:2282–2288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Agrawal A., Varshney R., Pathak M., Patel S.K., Rai V., Sulabh S., Gupta R., Solanki K.S., Varshney R., Nimmanapalli. R. Exploration of antigenic determinants in spike glycoprotein of SARS-CoV2 and identification of five salient potential epitopes. Virusdisease. 2021; 32:774–783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Venkateshan M., Muthu M., Suresh J., Kumar R.R.. Azafluorene derivatives as inhibitors of SARS CoV-2 RdRp: synthesis, physicochemical, quantum chemical, modeling and molecular docking analysis. J. Mol. Struct. 2020; 1220:128741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. PDBe-KB Consortium PDBe-KB: a community-driven resource for structural and functional annotations. Nucleic Acids Res. 2020; 48:D344–D353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. PDBe-KB Consortium PDBe-KB: collaboratively defining the biological context of structural data. Nucleic Acids Res. 2022; 50:D534–D542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Varadi M., Anyango S., Deshpande M., Nair S., Natassia C., Yordanova G., Yuan D., Stroe O., Wood G., Laydon A.et al.. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022; 50:D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Jumper J., Evans E., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Zidek A., Potapenko A.et al.. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Kelley L.A., Mezulis S., Yates C.M., Wass M.N., Sternberg M.J.E.. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 2015; 10:845–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Ortiz A.R., Strauss C.E.M, Olmea O.. MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 2002; 11:2606–2621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Soding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005; 21:951–960. [DOI] [PubMed] [Google Scholar]
- 47. Zhang Y., Skolnick J.. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005; 33:2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Fu L., Niu B., Zhu Z., Wu S., Li W.. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28:3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Steinegger M., Meier M., Mirdita M., Vöhringer H., Haunsberger S.J., Söding J.. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics. 2019; 20:473–415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Remmert M., Biegert A., Hauser A., Söding J.. HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment. Nat. Methods. 2012; 9:173–175. [DOI] [PubMed] [Google Scholar]
- 51. Mirdita M., von den Driesch L., Galiez C., Martin M.J., Söding J., Steinegger M.. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017; 45:D170–D176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Capra J.A., Singh M.. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007; 23:1875–1882. [DOI] [PubMed] [Google Scholar]
- 53. Williams M.A., Goodfellow J.M., Thornton J.M.. Buried waters and internal cavities in monomeric proteins. Protein Sci. 1994; 3:1224–1235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Cortes C., Vapnik V.. Support-Vector Networks. Mach. Learn. 1995; 20:273–297. [Google Scholar]
- 55. Steinegger M., Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35:1026–1028. [DOI] [PubMed] [Google Scholar]
- 56. Wass M.N., Sternberg M.J.E.. ConFunc—functional annotation in the twilight zone. Bioinformatics. 2008; 24:798–806. [DOI] [PubMed] [Google Scholar]
- 57. Davis J., Goadrich M.. The relationship between precision–recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning (ICML). 2006; Pittsburgh, PA, USA. [Google Scholar]
- 58. Schmidt T., Haas J., Cassarino T.G., Schwede T.. Assessment of ligand-binding residue predictions in CASP9. Proteins. 2011; 79:126–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Cassarino T.G., Bordoli L., Schwede T.. Assessment of ligand binding site predictions in CASP10. Proteins. 2014; 82:154–163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Matthews B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975; 405:442–451. [DOI] [PubMed] [Google Scholar]
- 61. Chen J., Guo M., Wang X., Liu B.. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief. Bioinform. 2018; 19:231–244. [DOI] [PubMed] [Google Scholar]
- 62. Yan R., Xu D., Yang J., Walker S., Zhang Y.. A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci. Rep. 2013; 3:2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Sehnal D., Rose A., Koca J., Burley S., Velankar S.. Mol*: towards a common library and tools for web molecular graphics. Workshop on Molecular Graphics and Visual Analysis of Molecular Data. 2018; [Google Scholar]
- 64. Thomas K., Wong K., Withington J., Bultitude M, Doherty A.. Cystinuria—a urologist’s perspective. Nat. Rev. Urol. 2014; 11:270–277. [DOI] [PubMed] [Google Scholar]
- 65. Wong K.A., Wass M., Thomas K.. The role of protein modelling in predicting the disease severity of cystinuria. Eur. Urol. 2016; 69:543–544. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data are provided in the manuscript or supplementary material.