Real-time structure search and structure classification for AlphaFold protein models

Tunde Aderinwale; Vijay Bharadwaj; Charles Christoffer; Genki Terashi; Zicong Zhang; Rashidedin Jahandideh; Yuki Kagaya; Daisuke Kihara

doi:10.1038/s42003-022-03261-8

. 2022 Apr 5;5:316. doi: 10.1038/s42003-022-03261-8

Real-time structure search and structure classification for AlphaFold protein models

Tunde Aderinwale ^1,^#, Vijay Bharadwaj ^1,^#, Charles Christoffer ¹, Genki Terashi ², Zicong Zhang ¹, Rashidedin Jahandideh ¹, Yuki Kagaya ², Daisuke Kihara ^1,^2,^✉

PMCID: PMC8983703 PMID: 35383281

Abstract

Last year saw a breakthrough in protein structure prediction, where the AlphaFold2 method showed a substantial improvement in the modeling accuracy. Following the software release of AlphaFold2, predicted structures by AlphaFold2 for proteins in 21 species were made publicly available via the AlphaFold Database. Here, to facilitate structural analysis and application of AlphaFold2 models, we provide the infrastructure, 3D-AF-Surfer, which allows real-time structure-based search for the AlphaFold2 models. In 3D-AF-Surfer, structures are represented with 3D Zernike descriptors (3DZD), which is a rotationally invariant, mathematical representation of 3D shapes. We developed a neural network that takes 3DZDs of proteins as input and retrieves proteins of the same fold more accurately than direct comparison of 3DZDs. Using 3D-AF-Surfer, we report structure classifications of AlphaFold2 models and discuss the correlation between confidence levels of AlphaFold2 models and intrinsic disordered regions.

Subject terms: Protein structure predictions, Molecular modelling

3D-AF-Surfer is presented as a computational resource for real-time protein structure comparison search between AlphaFold2 models and PDB entries within seconds to a few minutes.

Introduction

Structural biology has entered a phase when structure prediction methods, particularly a recent method, AlphaFold2¹, consistently produce reliable computational structure models with atomic accuracy. Protein structure prediction has been extensively studied in the computational biology community. Taking advantage of the accumulated protein sequence and structure information in the Protein Data Bank (PDB)², numerous methods have been developed based on different scientific disciplines, ideas, and various computational techniques. In the past few years, methods that use machine learning methods, particularly deep neural networks^3–9, made a large improvement in structure prediction accuracy in the Critical Assessment of techniques in protein Structure Prediction (CASP)¹⁰. In CASP14, a breakthrough¹¹ was achieved by AlphaFold2¹, which showed the best performance among participants with a substantial gap to the second-best method. Remarkably, the accuracy of AlphaFold2 models often reaches what would be expected from X-ray crystallography. It has been reported that models generated by AlphaFold2 have indeed helped experimental protein structure determination, as such models were successfully used for molecular replacement in X-ray crystallography and for density interpretation of cryo-EM maps^12,13.

Soon after the release of the AlphaFold2 code, predicted structure models by AlphaFold2 for proteins from 21 major model species have been released at the AlphaFold Protein Structure Database¹⁴. This is an invaluable resource for the biology community as modeled protein structures can be easily obtained without installing and running the AlphaFold2 software. Many proteins that do not have experimentally determined structures now have computational models with an expected high accuracy.

Here, we provide the infrastructure, 3D-AF-Surfer, for real-time protein structure model search within AlphaFold2 models and across entries in PDB at https://kiharalab.org/3d-surfer/submitalphafold.php. In any database, the functionality for quick entry search and comparison is essential. In 3D-AF-Surfer, a quick structure search against the entire PDB and AlphaFold2 models is realized with 3D Zernike descriptors (3DZD), which are rotationally invariant, mathematical representations of 3D shapes^15,16 (see Methods for more technical details). 3DZDs were shown to be effective in rapid protein structure database search^17–20 other tasks that involve biomolecular shape comparison and matching^21–25, mapping the global shape space of known protein structures²⁶, binding pocket comparison^27,28, drug screening^28,29, and protein docking²². To the best of our knowledge, 3D-AF-Surfer is the only tool that can search between AlphaFold2 models and PDB entries real-time, within seconds to a couple of minutes. In 3D-AF-Surfer, we further developed neural networks that take 3DZDs of proteins as input and achieve more accurate retrieval of proteins of the same fold than a direct comparison of 3DZDs.

Results

Domains with high confidence in AlphaFold2 models

In 3D-AF-Surfer, protein structure models generated by AlphaFold2 for 21 proteomes were retrieved from the European Bioinformatics Institute’s FTP server of the AlphaFold Database (https://ftp.ebi.ac.uk/pub/databases/alphafold) on July 22, 2021, which is still up-to-date on November 8, 2021. AlphaFold2 assigns one of four confidence levels, from very high confidence to very low confidence, to each amino acid position in a model. The confidence levels were assigned by the predicted local distance difference test (pLDDT) score³⁰, which examines the accuracy of Cα atom distances in a model. Since many models have low or very low confidence regions, which often have unfolded conformation, we extracted confident domain region(s) from each model in 3D-AF-Surfer (see Methods). In total, this procedure yielded 508,787 domains, which cover 48.8% of residues in all the AlphaFold2 models. The statistics of model counts is provided in Table 1.

Table 1.

Proteomes and structure models considered.

Species	Common name	Reference proteome	# unique UniProt IDs	# original	# domains	# structure predictions with no domains (1D)
Arabidopsis thaliana	Arabidopsis	UP000006548	27,434	27,434	37,682	5722
Caenorhabditis elegans	Nematode worm	UP000001940	19,694	19,694	26,160	4277
Candida albicans	C. albicans	UP000000559	5974	5,974	9,978	743
Danio rerio	Zebrafish	UP000000437	24,664	24,664	42,135	2530
Dictyostelium discoideum	Dictyostelium	UP000002195	12,622	12,622	18,963	2986
Drosophila melanogaster	Fruit fly	UP000000803	13,458	13,458	19,881	2335
Escherichia coli	E. coli	UP000000625	4363	4363	5397	417
Glycine max	Soybean	UP000008827	55,799	55,799	72,217	14,146
Homo sapiens	Human	UP000005640	20,504	23,391	44,827	3302
Leishmania infantum	L. infantum	UP000008153	7924	7924	12,257	1579
Methanocaldococcus jannaschii	M. jannaschii	UP000000805	1,773	1,773	2,097	131
Mus musculus	Mouse	UP000000589	21,615	21,615	35,216	2477
Mycobacterium tuberculosis	M. tuberculosis	UP000001584	3988	3988	5170	351
Oryza sativa	Asian rice	UP000059680	43,649	43,649	39,775	19,756
Plasmodium falciparum	P. falciparum	UP000001450	5187	5187	7283	1162
Rattus norvegicus	Rat	UP000002494	21,272	21,272	33,818	2664
Saccharomyces cerevisiae	Budding yeast	UP000002311	6040	6040	9837	967
Schizosaccharomyces pombe	Fission yeast	UP000002485	5128	5128	8173	637
Staphylococcus aureus	S. aureus	UP000008816	2888	2888	3283	415
Trypanosoma cruzi	T. cruzi	UP000002296	19,036	19,036	26,205	5436
Zea mays	Maize	UP000007305	39,299	39,299	48,433	11,582

Open in a new tab

For each proteome, the number of unique proteins, total original/domain models, and total original models containing no confident domains are given. The definition of the confident domains is given in the main text. The human original model count is underlined, indicating that the number of original models does not match the number of unique proteins. The human structure predictions retrieved from the AlphaFold Database contain models which are 1400-residue slices of larger proteins.

3D-AF-Surfer

Figure 1 illustrates the input and output panels of 3D-AF-Surfer, available at https://kiharalab.org/3d-surfer/submitalphafold.php. In the input panel, users can enter the AlphaFold model ID, PDB ID or upload the file of the query structure (Fig. 1a). When the first couple of letters of ID is entered, candidates of the rest will be listed. Then, the representation of protein structures used to compute 3DZD needs to be specified (full atom or main chain). Next, select the database to search against, which can be the full AlphaFold proteome database, structures from PDB (complexes, domain structures) or both combined. Users also have an option to select the method of the database search, a deep neural network-based search (the default setting), which is suitable for retrieving proteins with the same fold (see below) or original 3DZD-based search that is equipped in 3D-Surfer. The result page shows a table where the query structure is displayed on the left side and a list of retrieved structures ranked by their similarity to the query is shown on the right side (Fig. 1b). Clicking a retrieved structure invokes a new search using the selected structure as the query and allow users to “surf” in the protein structure universe. The panel also provides the option to compute the root mean square deviation (RMSD) between the query and the displayed similar structure. Pockets in the query structure can be identified using VisGrid³¹ or LIGSITE³². Finally, shown at the bottom of the page is the 3DZD of the query structure.

Fig. 1 — a The input page (see text). b An example output page. The query was PDB ID: 7tim-A, a TIM-barrel fold and search was against AlphaFold2 models using the deep neural network. As shown, retrieved top 25 hits are all TIM-barrel folds with a distance of 0.0, indicating that the network judged that these structures are highly likely to belong to the same fold.

PDB entries in 3D-AF-Surfer are updated bi-weekly. As of November 29, 2021, the server holds 547,639 protein chains and 249,163 additional domain structures from PDB, and 508,787 domain structures from the AlphaFold Database. Average time for a search measured over ten queries is as follows, when the neural network is used: Against AlphaFold domains: 55 s (s); PDB chains: 1 min 10 s; PDB domains: 22 s; PDB chains+domains: 1 min 15 s; All of the above: 2 min 26 s. Search is faster if 3DZD is used: 3 s against AlphaFold domains; 1.35 s, 1.45 s, 1.93 s against PDB chains, domains, and chains+domains, respectively, and 2.45 s for All of the above.

We further compared the computational time of 3D-AF-Surfer with DaliLite³³, TM-align³⁴, MADOKA³⁵, SPalignNS³⁶, and ZEAL³⁷. DaliLite and TM-align are conventional, commonly used structure alignment methods, while MADOKA and SPalignNS are more recent methods. ZEAL is a method that uses 3D Zernike moments instead of 3DZDs (see “Methods”). Table 2 reports the computational time of these methods on structure comparison of 4950 protein pairs formed from randomly sampled 100 proteins. For 3D-AF-Surfer, both direct 3DZD comparison and the neural network (3DZD-NN) were evaluated. 3DZD is the fastest of all methods, followed by 3DZD-NN. MADOKA was the next fastest, but it was 10 times slower than 3DZD-NN. ZEAL was the slowest of all the methods.

Table 2.

Comparison of computational time.

Method	Running time
3DZD	1.64 s
3DZD-NN	4.06 s
DaliLite	4 min 37.2 s
TM-align	10 min 14.4 s
MADOKA	41.4 s
SPalignNS	19 min 18.55 s
ZEAL	3 days 3 h 22 min 7.47 s

Open in a new tab

We ran the programs on a Linux machine with an Intel(R) Core i7-6900K CPU @ 3.20 GHz. min, minutes; sec, seconds. The running times reported are the average of three independent runs.

Secondary structure class of AlphaFold2 models

Figure 2a shows a breakdown of the secondary structure class of domain structures of AlphaFold2 models in comparison with SCOPe^38,39. Four secondary structure classes were considered, α, β, αβ, and small proteins. αβ corresponds to the α+β and α/β classes in SCOPe. The classification was performed with a machine learning method, a bagged⁴⁰ ensemble of support vector machine classifiers (SVMs) using the secondary structure content of SCOPe domains (see Methods). The bagged ensemble had an accuracy of 91.5% (Table 3). The method had the highest accuracy among all the methods compared, which include handmade classification procedures and different architectures of SVM. The classification result for SCOPe (Fig. 2a) is qualitatively consistent with earlier statistics of CATH⁴¹, where the αβ class occupies over 50% and the share of α-class is around 15%. On the other hand, we note a greater prevalence of α-class structures among the AlphaFold2 domains (Fig. 2b) than in the SCOPe statistics (Fig. 2a). This result probably indicates that α-class structure models tend to have higher confidence than other classes.

Fig. 2 — a The secondary structure classes were assigned to SCOPe domains and domains of high confidence in AlphaFold2 models. Four classes were considered, α, β, αβ, and small proteins. Left, SCOPe (232,630 domains); right, domains of high confidence in AlphaFold2 models. (508,787 domains). The classification was performed using a bagged SVM ensemble (see Methods). SCOPe domains (left) were also classified with the SVM ensemble to be able to compare with the results on AlphaFold2 domains (right). b Fold classification of the AlphaFold2 structure domains of high confidence. The classification was performed with the deep neural networks that were trained on the fold assignment provided in SCOPe (see Methods). The outer wheel indicates the fraction of each fold. Folds were ordered according to SCOPe IDs. Left, the fold distribution of AlphaFold2 domains using the deep network trained on 3DZDs of full atom domain structure surface. The inner wheel shows the fraction of secondary structure classes. Since this classification was based on the fold assignment, the fractions are overall consistent but not identical to those shown in panel (a). The top 10 most abundant folds are indicated. Right, the fold distribution using the deep network trained on 3DZDs of surface shapes with main-chain atoms. c The 10 most abundant folds among AlphaFold2 domains. The fraction of each fold is indicated in the wheel diagram on the left in panel b. For each fold, an example of AlphaFold domains is shown. (1) Non-globular all-alpha subunits of globular proteins (a.137). Example shown is A0A1D6E4Z3_F1, residue 823-895 (*maize*). (2) ROP-like (a.30): A0A1D6MV33_F1, residue 758-815 (*maize*). (3) Mediator hinge subcomplex-like (a.252). Q4DL50_F1, residue 384-495 (*T. cruzi*). (4) BAR/IMD domain-like (a.238). Q8LE58_F1, residue 2-133 (*Arabidopsis*). (5) Intrinsically disordered proteins (g.88). I1L2C2_F1, residue 210-284 (*soybean*). (6) N-terminal domain of bifunctional PutA protein (a.176). A7MBM2_F1, residue 157-225 (*human*). (7) L27 domain (a.194). A0A1D6PKM6_F1, residue 314-375 (*maize*). (8) alpha-alpha superhelix (a.118). K7KHY8_F, residue 213-524 (*soybean*). (9) Spectrin repeat-like (a.7). P38637_F1, residue 149-238_AFv1 (*S. cerevisiae*). 10 SRF-like (d.88). A0A1D6NUQ9_F1, residue 2-74 (*maize*).

Table 3.

Accuracy of fold class assignment on SCOPe.

Method	Accuracy
Method	Overall	$α$	$β$	$α β$	Small proteins
Expert handmade (without optimization)	0.852	0.683	0.771	0.961	0.357
Expert handmade (optimized)	0.880	0.759	0.889	0.928	0.500
Multinomial logistic regression	0.863	0.916	0.861	0.851	0.818
SVM (linear)	0.445	0.991	0.927	0.069	0.548
SVM (RBF kernel)	0.896	0.947	0.869	0.896	0.861
Bagged SVM (RBF kernel)	0.915	0.943	0.882	0.937	0.621

Open in a new tab

Fold classes were assigned to AlphaFold2 models based on secondary structure content and sequence length. Here we show the benchmark results from optimizing these classifiers on the original manually curated SCOPe fold classes. For the expert handmade classifiers, secondary structure content and protein length conditions were defined for each fold class. The first classifier without optimization used the following conditions: $length < 50 a a \to small$ ; else $helix \geq 60 % \to α$ ; else $sheet \geq 35 %$ and $helix < 20 % \to β$ ; else $\to α β$ . The second one optimized the actual threshold values by parameter sweep of an increment of 5% for secondary structure content and increments of 5aa for the sequence length. The optimized mapping was: $length < 55 a a \to small$ ; else $helix \geq 55 % \to α$ ; else $sheet \geq 25 %$ and $helix < 20 % \to β$ ; else $\to α β$ . For the other classifiers, lengths and secondary structure proportions were used directly as features. For each classifier, accuracy is shown both overall and per-class.

Fold classification by deep neural network

To have an overall grasp of the fold distribution of AlphaFold2 models, we used the deep neural network of 3D-AF-Surfer and classified AlphaFold domain structures into SCOPe folds (Fig. 2b). For this classification, we considered 1101 folds in the class a (all α proteins), b (all β proteins), c (α/β proteins), d (α+β proteins), and g (small proteins) in the SCOPe database. The neural network takes 3DZDs of two protein structures and outputs the probability that the two structures belong to the same SCOPe fold⁴² (Fig. 3; see “Methods”). This neural network architecture has shown significant performance in the yearly-held 3D Shape Retrieval Contests (SHREC) protein retrieval categories^42,43.

Fig. 3 — The Network takes as input two protein structures represented by their 3DZD vectors. The encoder layer uses the three hidden layers, each with 250, 200, 150 nodes, to encode the features in the 3DZD. The encoding vector of a length of 1452 is then input into the feature extractor layer, which is used to compare the encoded feature of the two structures using four distance metrics, the Euclidian distance, the cosine distance, the Manhattan (absolute value) distance, and dot product. The FC network takes the feature extractor output and predicts the probability that the two structure belong to the same fold.

For the current work we newly trained two networks, one that uses 3DZDs computed from full-atom protein surface and the other one that takes 3DZDs computed from main-chain Cα, C, and N atoms⁴⁴. The network with the main-chain atoms showed higher classification accuracy (95.0%) than the full-atom network (Table 3). This accuracy was higher than the original 3D-Surfer¹⁷, which compares 3DZDs directly with the Euclidean distance.

We also compared the structural classification performance of 3DZD and 3DZD-NN with SPalignNS, because Janan et al.⁴⁵ performed a comprehensive analysis of eighteen structure alignment methods and reported SPalignNS as the best method for fold classification (Supplementary Fig. 1). This comparison was performed on randomly sampled 2,500 positive (i.e. same-fold) and 2,500 negative (i.e. different-fold) pairs from the validation dataset used in Table 4. As shown in the figure, 3DZD-NN showed the highest AUC of 0.998, followed by SPalignNS with an AUC of 0.976. The AUC of 3DZD was the lowest, at 0.789.

Table 4.

Fold classification accuracy by 3DZD and the deep neural network.

Method	3DZD Type	Accuracy	Precision	Recall	F-Measure
Fold
3DZD-NN	Full Atom	0.954	0.945	0.964	0.954
3DZD-NN	Main Chain	0.977	0.974	0.979	0.977
3DZD	Full Atom	0.508	0.504	0.998	0.670
3DZD	Main Chain	0.616	0.571	0.939	0.710

Open in a new tab

This benchmark is computed using the test set from the SCOPe dataset. Balanced positive and negative test pairs were constructed from the set of 2521 protein structures in SCOPe. There were 167,872 test pairs in total. 3DZD is the original method where the 3DZD of two structures are compared with a score that uses Euclidean distance of 3DZDs of two proteins, which is defined as 1/(1+Euclidean distance). Thus, the score ranges from 0 to 1. 3DZD-NN is the deep network that outputs predicted probability that input two structures are in the same SCOPe fold. Probability values output by 3DZD-NN range from 0 to 1. We used the best threshold that maximized F-measure. The threshold values of 3DZD-NN full atom, 3DZD-NN main-chain, and 3DZD were 0.5, 0.6, and 0.1, respectively. See Table 1 in Supplementary Information for results of all different thresholds. See Methods for definitions of accuracy, precision, recall, and F-measure.

Illustrative cases of misclassifications of folds

Although 3D-AF-Surfer showed high fold classification performance as discussed above, there are certainly cases where it failed to provide a correct classification. Some such cases come from the inherent methodology of using 3DZDs as discussed in our earlier paper¹⁹. We showed four examples in Fig. 4. The two pairs in panel a and b are false negatives where the two structures belong to the same SCOP fold while both 3DZD-NN and 3DZD considered them as different folds. The pair in Fig. 4a (d2d0oa2 and d3g25d1) have similar secondary structure arrangement along the sequences but their spatial packings are different. Consequently, these two structures have different overall surface shape for 3DZD. In the pair in Fig. 4b, although the two structures have a bent β-sheet structure in common, extra α-helices in d1mjxb_ made the two folds less similar, which also led to differences in their surface shapes.

Fig. 4c, d shows examples of false positives, i.e. two pairs of structures of different folds where both 3DZD-NN and 3DZD recognized them as the same fold. The structures in Fig. 4c have similar spatial arrangements of secondary structures, each with a large β-sheet in the middle and a long, kinked characteristic α-helix on the side, although structure superimposition shows an RMSD over 15 Å. Figure 4d shows two proteins with different secondary structure classes but with a similar C-shaped surface shape. Detecting similar surface shape of proteins regardless of their main-chain conformations is characteristic of the performance of 3DZD, which, in these two cases, led to false positives. However, note that while these false positive pairs have a score above the detection threshold, they do not practically affect a database search against the entire PDB or AlphaFold2 models because there are many far more similar structures that occupy top hits in a search as shown in Supplementary Fig. 2.

In Fig. 5 we discuss cases where 3DZD-NN improved over 3DZD, where the neural network correctly classified two proteins as being in the same fold or not while 3DZD failed. In the pairs in Fig. 5a, b surface shapes of the two proteins are apparently different due to a tail that flipped out from the main body of the protein volume. 3DZD was confused by the shape difference, but the neural network was still able to correctly identify the pair as belonging to the same fold with high confidence. Figure 5c and Fig. 5d show cases where 3DZD had a slightly higher score than the threshold and considered them as the same fold while 3DZD-NN considered them as different folds. In both cases, while 3DZD could not differentiate the pairs due to their similar surface shapes, the neural network is able to differentiate the pair as not belonging to the same fold.

To summarize, surface shape similarity of proteins, which 3DZD detects, can lead to misclassification of protein folds if that is the main interest of users. But in many cases the neural network was able to correct such misclassification by 3DZD. It would be worthwhile to note that identifying proteins with similar surface shape but different main-chain conformations by 3DZD often lead to findings of functionally related proteins, which were otherwise missed due to the lack of main-chain and sequence-level similarity^19,37.

Fold distribution of AlphaFold2 models

We now discuss abundant folds observed in Alphafold2 models. In Fig. 2b, the fold classification are shown in wheel diagrams. The inner and the outer wheels of the pie charts show the classification result at the secondary structure class level and at the individual SCOPe folds, respectively. The distribution of the secondary structure class levels is consistent with Fig. 2a, which was classified from secondary structure content of models. Classifications using the main-chain atoms (the left panel in Fig. 2b) and full-atoms (the right panel) were also consistent. Overall, the α-class folds are dominant when all the proteomes are considered.

In Fig. 2c, we showed 10 most abundant folds from all the 21 species. Among them, eight belong to the α-class, one to the α+β-class (d.88), and one to the small protein class (g.88), respectively. Supplementary Table 2 breaks down the statistics into individual species. Reflecting the overall abundance of α-class proteins as shown in Fig. 2, α-class folds dominate top 10 rankings in all the species. On average, 7.0 α-class folds ranked within top 10 in each species, which contrasts to the small numbers of folds in α/β or α+β-class (1.67 folds) and β−class (0.71 folds). These results of Alphafold2 models are largely different from statistics taken from the SUPERFAMILY2.0 database⁴⁶, which is a reference of the current understanding of protein fold distribution (Supplementary Table 3, 4). As shown in Supplementary Table 4, the 21 species in SUPERFAMILY2.0 have more α/β or α+β-class folds within top 10: On average, 5.24 folds from the α/β or α+β-class are within top 10, which contrasts with 1.9 α-class folds. The dominance of the α/β and α+β-class observed in SUPERFAMILY2.0 is consistent with earlier works by Gerstein⁴⁷, which is shown in Supplementary Table 5 and by Kihara & Skolnick⁴⁸ (Supplementary Table 6), which assigned folds by a threading method. In Supplementary Table 2, commonly appeared folds with the SUPERFAMILY2.0 statistics (Supplementary Table 4) are underlined. There are not many common folds between the two tables. Seven species did not have common folds. For the rest of species, there were one to three common folds.

Low-confidence regions of AlphaFold2 models

At last, we also analyzed low-confidence regions of AlphaFold2 models as they are not handled in 3D-AF-Surfer and thus left out from the above analysis. Particularly, we analyzed correlation between the low-confidence regions (pLDDT $\leq$ 0.5 and 0.7) from AlphaFold2 models and disorder predictions. We used two disorder prediction methods, SPOT-Disorder-Single⁴⁹ and flDPnn⁵⁰. According to the two methods, about 14–18% of residues are disordered (Fig. 6a). On the other hand, considering 0.5 and 0.7 pLDDT as cutoffs, more residues, 25% and 36.5%, in AlphaFold2 models were in low confidence regions (Fig. 6b). The percentage of low-confidence residues varies for different species. Low-confidence regions are relatively small (7–13%) in the four bacterial proteomes, while D. discoideum has the largest fraction of low-confidence residues, 58.4%. For the other species, low-confident residues share about 30–40%.

In Fig. 6d, e, we compared disorder predictions and the model confidence scores using two score cutoffs, pLDDT of 0.5 and 0.7. When SPOT-Disorder-Single was used for disorder prediction (Figs. 6d), 52.6% and 44.2% of low-confidence regions defined with a pLDDT cutoff of 0.5 and 0.7, respectively, were predicted as disordered. Thus, reversely, 47.4% and 55.8% of low-confidence regions were predicted as ordered. On the other hand, almost all high confident regions were predicted to be ordered. The result was essentially the same when flDPnn was used (Fig. 6e), except that disordered residues in low-confidence regions became even less, 33.5% and 30.9% using pLDDT of 0.5 and 0.7 as a cutoff, respectively. The results indicate that low-confidence regions do not always correspond to disordered regions, at most only 30 to 50%, and rest would be folded in native protein structures. Figure 6f–i shows several examples. The first three panels (f, g, h) are similar cases. Low-confidence residues at pLDDT around 0.4 or lower have a wide range of disorder propensities, and about half of such residues have low disorder propensity and probably would be folded in the native structures. In the model shown in Fig. 6i does not have residues with high disorder propensity, implying that the protein would be well folded in the native form.

Discussion

We developed 3D-AF-Surfer, which performs protein structure comparison against the entire PDB and the entire Alphafold2 models within a couple of minutes. Thus, it would be a BLAST⁵¹ sequence database search tool-equivalent for 3D protein structure database search. At the time of writing, there is no other method that can perform such a fast structure comparison for the entire Alphafold2 models and PDB. As demonstrated in Results, 3D-AF-Surfer maintains high accuracy yet is still able to perform a real-time structure search, which allows users to analyze Alphafold2 models interactively. Currently, 3D-AF-Surfer is running on a single CPU on a regular Linux machine and all searches are performed on the fly. Therefore, further speed up can be easily achieved by using multiple CPUs or by applying other standard techniques of database management. With such an expansion of the server, 3D-AF-Surfer will be able to handle the future release of more structure models by the Alphafold database, which is expected to happen in near future.

Methods

Extraction of confident domain regions in AlphaFold2 models

To extract a confident domain in an AlphaFold2 model, we first extracted all contiguous regions of more than 50 confident residues that have a pLDDT score greater than 70.0. Then, confident regions separated by at most 5 non-confident residues were merged, along with the intervening residues regardless of confidence level. AlphaFold2 models were discarded if they have no confident domains. In total, this procedure yielded 508,787 domains. 83,615 (22.9%) models out of 365,198 total AlphaFold2 models contain no confident domains. The statistics of model counts is provided in Table 1. In terms of total residues, the domain dataset in 3D-AF-Surfer contains 48.8% (78,133,986 residues) of residues among the residues in all the AlphaFold2 models (160,235,650 residues).

SCOPe benchmark dataset for structure classification

We downloaded the latest version of the SCOPe dataset release 2.07 from the download page of the SCOPe website (https://scop.berkeley.edu/downloads/). The dataset included 256,391 structures in 1,430 folds after removing structures in class I (Artifacts). For each of the protein structures we used EDTSurf⁵² to generate the solvent excluded surface, for which a 3DZD vector is computed. We computed two types of 3DZD vector for a structure. The first one is computed using full atom of the protein structure. The second 3DZD is computed using only the main-chain Cα, C, and N atoms from the structure, because this main-chain surface representation performed better in our previous work⁴⁴.

Classification of secondary structure class with bagged SVM

The fold classification was performed with a bagged ensemble of SVMs using the secondary structure content of SCOPe domains. In bagging, $N$ = 20 different classifiers were trained on 5% of the SCOPe dataset selected randomly with replacement. The output classes were then decided by voting. On the training set, the bagged ensemble had an accuracy of 91.5%. This accuracy was higher than five other methods we compared, which were a multinomial logistic regression, two SVM architectures, and two expert-designed approaches. In the expert-designed approaches, the secondary structure content thresholds, i.e. fraction of amino acids in a protein in α helices, β strands, and coil (other structures) were considered. A detailed comparison of these methods is provided in Table 2.

Performance metrics

We measured the performance of the method using Accuracy, Precision, Recall and F-measure.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Precision = \frac{TP}{TP + FP}

Recall = \frac{TP}{TP + FN}

F - Measure = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

where TP = True positive, FP = False positive, TN = True negative, FN = False negative. True positive is the case where the protein pairs belong to the same fold and the method predicts correctly that they are in the same fold. True negative is similar to TP, the case where the protein pairs belong to different folds and the method predicts correctly that they belong to different folds.

False positive is the case where the protein pairs belong to different fold and the method predicts wrongly that they are in the same fold. False negative is the case where the protein pairs belong to the same fold and the method predicts wrongly that they belong to different folds.

3D Zernike descriptors (3DZD)

3DZDs are mathematical rotation-invariant moment-based descriptors. For a protein structure, a surface from a set of atoms was constructed and then mapped to a 3D cubic grid of size N3 (N = 200). Each voxel (a cube defined by the grid) is assigned either 1 or 0; 1 for a surface voxel that locates closer than 1.7 grid intervals to any triangle defining the protein surface, and 0 otherwise. This grid was considered as a 3D function $f (x)$ , for which a series was computed in terms of the Zernike–Canterakis basis¹⁵:

Z_{n l}^{m} (r, ϑ, φ) = R_{n l} (r) Y_{l}^{m} (ϑ, φ)

with $- l < m < l, 0 \leq l \leq n,$ and $(n - l)$ even. $Y_{l}^{m} (ϑ, φ)$ are spherical harmonics. $R_{n l} (r)$ are radial functions defined by Canterakis, constructed so that $Z_{n l}^{m} (r, ϑ, φ)$ are homogeneous polynomials when written in terms of Cartesian coordinates. 3D Zernike moments of $f (x)$ are defined as the coefficients of the expansion in this orthonormal basis, i.e. by the formula

Ω_{n l}^{m} = \frac{3}{4 π} \int_{∣x∣ \leq 1} f (x) {\bar{Z}}_{n l}^{m} (x) d x

3D Zernike moments will change if the 3D object, f(x), is rotated to a different orientation. Thus, they could be used to evaluate differences of shapes convolved with differences in orientation of two objects or to align objects³⁷. To achieve rotation invariance, the moments are collected into (2 l+1)-dimensional vectors $Ω_{n l} = (Ω_{n l}^{l}, Ω_{n l}^{l - 1}, Ω_{n l}^{l - 2}, Ω_{n l}^{l - 3}, \dots Ω_{n l}^{- l})$ , and the rotationally invariant 3D Zernike descriptors $F_{n l}$ are defined as norms of the vectors $Ω_{n l}$ ²¹. Thus,

F_{n l} = \sqrt{\sum_{m = - l}^{m = l} {(Ω_{n l}^{m})}^{2}}

Index n is called the order of the descriptor. The rotational invariance of 3D Zernike descriptors means e.g. that calculating $F_{n l}$ for a protein and its rotated version would yield the same result. We used 20 as the order because it gave reasonable results in our previous works on protein 3D shape comparison^17,19,44,53. A 3DZD with an order n of 20 represents a 3D structure as a vector of 121 invariants¹⁹.

Deep neural network for fold classification

Using the generated 3DZD, we trained a deep neural network that outputs the probability that a given pair of protein structures belong to the same fold. The network (Fig. 3) takes the 3DZDs of two protein shapes as input. Three hidden layers have 250, 200, and 150 neurons, respectively, which were used as the encoding of an input 3DZD. The encoder is connected to the feature extractor, a fully-connected network, which takes the 3DZDs of the two proteins, and the encodings from the three hidden layers, and four metrics that compare two vectors, the Euclidian distance, the cosine distance, the element-wise absolute difference, and the element-wise product, and the two features of the two protein shapes (the difference in the number of vertices and faces). In total, the number of the input features of the feature comparator is 2*121 + 2 * (250 + 200 + 150) + 2 * 4 + 2 = 1,452 features. The first term is the 3DZDs of order 20 (n = 20), which is a 121-element vector of the two protein shapes. The third term, 2 * 4 comes from the four-comparison metrics applied to two representations of the two proteins, the original 3DZDs and encodings, which concatenate the output of the input layer and the three intermediate layers of the encoder. The feature comparator outputs a score between 0 and 1 using a sigmoid activation function, which is the probability that the two proteins are in the same fold classification in the SCOPe database.

The training and validation were performed on the aforementioned structure dataset of SCOPe. Out of 256,391 structures in 1430 unique folds, we set aside 2541 structures for model validation. For each of the structures in the database, we generated positive and negative pairs. Positive pairs are protein structures that belong to the same fold, while negative pairs are from different folds. For training, we randomly sampled a balanced set of positive and negative pairs based on the batch size (i.e. 32 positive pairs and 32 negative pairs for a batch size of 64). We used ADAM for parameter optimization with a binary cross-entropy loss function. The learning rate was explored from 1e−3 to 7e−3 and 0.1–0.7 in our previous work and set to 0.005⁴². The accuracy of networks was evaluated on the negative and positive set generated from the 2541 structures, which totals 167,872 pairs.

To assign a fold to a query protein, the query was compared with 10 randomly selected structures from each SCOPe fold. Then, the fold that showed the highest probability for the query is assigned. Although the training of each network was performed on the folds for all the classes except for the artifact class (class I), in the pie charts in Fig. 2 we assigned to folds that belong to α, β, αβ (α+β and αβ), and small proteins, because the other classes are consider factors other than structural features.

Disorder region prediction methods

We used two methods, flDPnn⁵⁰ and SPOT-Disorder-Single⁴⁹. flDPnn uses profile information computed by three other methods, which is processed by a deep learning architecture to output residue-wise disorder prediction. flDPnn showed the top performance in the most recent Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment⁵⁴. Following the instruction of the software, residues with a disorder propensity score above 0.3 were considered disordered. We used the open-sourced implementation and trained models at http://biomine.cs.vcu.edu/servers/flDPnn/.

SPOT-Disorder-Single is a fast method that computes prediction from the single sequence of the query. It uses an ensemble of nine models. At their core, each model is constructed from ResNet blocks and/or LSTM BRNN blocks. Following the instruction of the software, residues with a disorder propensity score above 0.426 were considered disordered. We adopted the local version of SPOT-Disorder-Single available at (http://sparks-lab.org/server/SPOT-Disorder-Single) and kept the default configuration.

Statistics and reproducibility

The computational run time experiments (Table 2) were performed three times. We reported the parameters used to reproduce SCOPe database fold classification and released the trained neural network to reproduce the AlphaFold2 database fold classification.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Supplementary Information^{(713.3KB, pdf)}

Reporting Summary^{(912.7KB, pdf)}

Acknowledgements

This work was partly supported by the National Institutes of Health (R01GM133840, R01GM123055, and 3R01GM133840-02S1) and the National Science Foundation (CMMI1825941, MCB1925643, and DBI2003635).

Author contributions

D.K. conceived the study. T.A. developed the deep network and performed the benchmark studies and associated analyses. V.B. developed the website in discussion with T.A. C.C. and G.T. participated in constructing the benchmark dataset and processing Alphafold2 models. Y.K. participated in the website development and the genome-level fold analysis. Z.Z. and R.J. analyzed low-confidence regions of Alphafold2 models. All authors analyzed the results. T.A., C.C., Z.Z., R.J., Y.K., and D.K. drafted the manuscript and D.K. critically edited it. All authors approved the manuscript.

Peer review

Peer review information

Communications Biology thanks Shiyong Liu and the other, anonymous, reviewers for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Gene Chong.

Data availability

Data used in this webserver were obtained from PDB and the AlphaFold Database and are fully and freely available to public.

Code availability

The webserver described in this work is freely available for public at https://kiharalab.org/3d-surfer/submitalphafold.php. The codes used for classifying a protein structure into secondary structure class and fold are made available at https://github.com/kiharalab/3d-af_surfer.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Tunde Aderinwale, Vijay Bharadwaj.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-022-03261-8.

References

1.Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature10.1038/s41586-021-03819-2 (2021). [DOI] [PMC free article] [PubMed]
2.Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Yang J, et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA. 2020;117:1496–1503. doi: 10.1073/pnas.1914677117. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jain A, et al. Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction. Sci. Rep. 2021;11:7574. doi: 10.1038/s41598-021-87204-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Xu J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA. 2019;116:16856–16865. doi: 10.1073/pnas.1821309116. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zheng, W. et al. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins10.1002/prot.26193 (2021). [DOI] [PMC free article] [PubMed]
8.Bhattacharya D, Cao R, Cheng J. UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics. 2016;32:2791–2799. doi: 10.1093/bioinformatics/btw316. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.AlQuraishi M. End-to-end differentiable learning of protein structure. Cell Syst. 2019;8:292–301 e293. doi: 10.1016/j.cels.2019.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins10.1002/prot.26237 (2021). [DOI] [PMC free article] [PubMed]
11.Lupas AN, et al. The breakthrough in protein structure prediction. Biochem J. 2021;478:1885–1890. doi: 10.1042/BCJ20200963. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Millan, C. et al. Assessing the utility of CASP14 models for molecular replacement. Proteins10.1002/prot.26214 (2021). [DOI] [PMC free article] [PubMed]
13.Kryshtafovych, A. et al. Computational models in the service of X-ray and cryo-electron microscopy structure determination. Proteins10.1002/prot.26223 (2021). [DOI] [PMC free article] [PubMed]
14.Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature10.1038/s41586-021-03828-1 (2021). [DOI] [PMC free article] [PubMed]
15.Canterakis, N. 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition. Proc.11th Scandinavian Conference on Image Analysis, 85–93 (1999).
16.Novotni, M. & Klein, R. 3D Zernike descriptors for content based shape retrieval. Proc. 8th ACM symposium on Solid modeling and applications. 216–225 (2003).
17.La D, et al. 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009;25:2843–2844. doi: 10.1093/bioinformatics/btp542. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Esquivel-Rodriguez J, et al. Navigating 3D electron microscopy maps with EM-SURFER. BMC Bioinform. 2015;16:181. doi: 10.1186/s12859-015-0580-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Sael L, et al. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008;72:1259–1273. doi: 10.1002/prot.22030. [DOI] [PubMed] [Google Scholar]
20.Han X, Wei Q, Kihara D. Protein 3D structure and electron microscopy map retrieval using 3D-SURFER2.0 and EM-SURFER. Curr. Protoc. Bioinform. 2017;60:3 14 11–13 14 15. doi: 10.1002/cpbi.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Kihara D, Sael L, Chikhi R, Esquivel-Rodriguez J. Molecular surface representation using 3D Zernike descriptors for protein shape comparison and docking. Curr. Protein Pept. Sci. 2011;12:520–530. doi: 10.2174/138920311796957612. [DOI] [PubMed] [Google Scholar]
22.Venkatraman V, Yang YD, Sael L, Kihara D. Protein-protein docking using region-based 3D Zernike descriptors. BMC Bioinform. 2009;10:407. doi: 10.1186/1471-2105-10-407. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Venkatraman V, Sael L, Kihara D. Potential for protein surface shape analysis using spherical harmonics and 3D Zernike descriptors. Cell Biochem. Biophys. 2009;54:23–32. doi: 10.1007/s12013-009-9051-x. [DOI] [PubMed] [Google Scholar]
24.Venkatraman V, Chakravarthy PR, Kihara D. Application of 3D Zernike descriptors to shape-based ligand similarity searching. J. Cheminformatics. 2009;1:19. doi: 10.1186/1758-2946-1-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Shin WH, Zhu X, Bures MG, Kihara D. Three-dimensional compound comparison methods and their application in drug discovery. Molecules. 2015;20:12841–12862. doi: 10.3390/molecules200712841. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Han X, Terashi G, Christoffer C, Chen S, Kihara D. VESPER: global and local cryo-EM map alignment using local density vectors. Nat. Commun. 2021;12:2090. doi: 10.1038/s41467-021-22401-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Sael L, Kihara D. Detecting local ligand-binding site similarity in nonhomologous proteins by surface patch comparison. Proteins. 2012;80:1177–1195. doi: 10.1002/prot.24018. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zhu X, Xiong Y, Kihara D. Large-scale binding ligand prediction by improved patch-based method Patch-Surfer2.0. Bioinformatics. 2015;31:707–713. doi: 10.1093/bioinformatics/btu724. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Shin WH, Bures MG, Kihara D. PatchSurfers: two methods for local molecular property-based binding ligand prediction. Methods. 2016;93:41–50. doi: 10.1016/j.ymeth.2015.09.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722–2728. doi: 10.1093/bioinformatics/btt473. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Li B, et al. Characterization of local geometry of protein surfaces with the visibility criterion. Proteins. 2008;71:670–683. doi: 10.1002/prot.21732. [DOI] [PubMed] [Google Scholar]
32.Hendlich M, Rippmann F, Barnickel G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J. Mol. Graph Model. 1997;15:359–363. doi: 10.1016/s1093-3263(98)00002-3. [DOI] [PubMed] [Google Scholar]
33.Holm L. Benchmarking fold detection by DaliLite v.5. Bioinformatics. 2019;35:5326–5327. doi: 10.1093/bioinformatics/btz536. [DOI] [PubMed] [Google Scholar]
34.Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Deng L, Zhong G, Liu C, Luo J, Liu H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 2019;20:662. doi: 10.1186/s12859-019-3235-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Brown P, Pullan W, Yang Y, Zhou Y. Fast and accurate non-sequential protein structure alignment using a new asymmetric linear sum assignment heuristic. Bioinformatics. 2016;32:370–377. doi: 10.1093/bioinformatics/btv580. [DOI] [PubMed] [Google Scholar]
37.Ljung, F. & Andre, I. ZEAL: Protein structure alignment based on shape similarity. Bioinformatics10.1093/bioinformatics/btab205 (2021). [DOI] [PMC free article] [PubMed]
38.Chandonia JM, Fox NK, Brenner SE. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 2019;47:D475–D481. doi: 10.1093/nar/gky1134. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Fox NK, Brenner SE, Chandonia JM. SCOPe: structural classification of proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42:D304–D309. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Breiman L. Bagging predictors. Mach. Learn. 1996;24:123–140. [Google Scholar]
41.Orengo CA, et al. CATH–a hierarchic classification of protein domain. Struct. Struct. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]
42.Raffo A, et al. SHREC 2021: retrieval and classification of protein surfaces equipped with physical and chemical properties. Comput. Graph. 2021;99:1–21. [Google Scholar]
43.Langenfeld F, et al. Surface-based protein domains retrieval methods from a SHREC2021 challenge. J. Mol. Graph. Model. 2022;111:108103. doi: 10.1016/j.jmgm.2021.108103. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Sael L, Kihara D. Improved protein surface comparison and application to low-resolution protein structure data. BMC Bioinform. 2010;11:S2. doi: 10.1186/1471-2105-11-S11-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Sykes J, Holland BR, Charleston MA. Benchmarking methods of protein structure alignment. J. Mol. Evol. 2020;88:575–597. doi: 10.1007/s00239-020-09960-2. [DOI] [PubMed] [Google Scholar]
46.Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J. The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res. 2019;47:D490–D494. doi: 10.1093/nar/gky1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Gerstein M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins. 1998;33:518–534. doi: 10.1002/(sici)1097-0134(19981201)33:4<518::aid-prot5>3.0.co;2-j. [DOI] [PubMed] [Google Scholar]
48.Kihara D, Skolnick J. Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins. 2004;55:464–473. doi: 10.1002/prot.20044. [DOI] [PubMed] [Google Scholar]
49.Hanson J, Paliwal K, Zhou Y. Accurate single-sequence prediction of protein intrinsic disorder by an ensemble of deep recurrent and convolutional architectures. J. Chem. Inf. Model. 2018;58:2369–2376. doi: 10.1021/acs.jcim.8b00636. [DOI] [PubMed] [Google Scholar]
50.Hu G, et al. flDPnn: accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat. Commun. 2021;12:4438. doi: 10.1038/s41467-021-24773-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
52.Xu D, Zhang Y. Generating triangulated macromolecular surfaces by Euclidean Distance Transform. PLoS ONE. 2009;4:e8140. doi: 10.1371/journal.pone.0008140. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Sael L, La D, Li B, Rustamov R, Kihara D. Rapid comparison of properties on protein surface. Proteins. 2008;73:1–10. doi: 10.1002/prot.22141. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Necci M, Piovesan D, Predictors C, DisProt C, Tosatto SCE. Critical assessment of protein intrinsic disorder prediction. Nat. Methods. 2021;18:472–481. doi: 10.1038/s41592-021-01117-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(713.3KB, pdf)}

Reporting Summary^{(912.7KB, pdf)}

Data Availability Statement

Data used in this webserver were obtained from PDB and the AlphaFold Database and are fully and freely available to public.

[CR1] 1.Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature10.1038/s41586-021-03819-2 (2021). [DOI] [PMC free article] [PubMed]

[CR2] 2.Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Yang J, et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA. 2020;117:1496–1503. doi: 10.1073/pnas.1914677117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Jain A, et al. Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction. Sci. Rep. 2021;11:7574. doi: 10.1038/s41598-021-87204-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Xu J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA. 2019;116:16856–16865. doi: 10.1073/pnas.1821309116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Zheng, W. et al. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins10.1002/prot.26193 (2021). [DOI] [PMC free article] [PubMed]

[CR8] 8.Bhattacharya D, Cao R, Cheng J. UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics. 2016;32:2791–2799. doi: 10.1093/bioinformatics/btw316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.AlQuraishi M. End-to-end differentiable learning of protein structure. Cell Syst. 2019;8:292–301 e293. doi: 10.1016/j.cels.2019.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins10.1002/prot.26237 (2021). [DOI] [PMC free article] [PubMed]

[CR11] 11.Lupas AN, et al. The breakthrough in protein structure prediction. Biochem J. 2021;478:1885–1890. doi: 10.1042/BCJ20200963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Millan, C. et al. Assessing the utility of CASP14 models for molecular replacement. Proteins10.1002/prot.26214 (2021). [DOI] [PMC free article] [PubMed]

[CR13] 13.Kryshtafovych, A. et al. Computational models in the service of X-ray and cryo-electron microscopy structure determination. Proteins10.1002/prot.26223 (2021). [DOI] [PMC free article] [PubMed]

[CR14] 14.Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature10.1038/s41586-021-03828-1 (2021). [DOI] [PMC free article] [PubMed]

[CR15] 15.Canterakis, N. 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition. Proc.11th Scandinavian Conference on Image Analysis, 85–93 (1999).

[CR16] 16.Novotni, M. & Klein, R. 3D Zernike descriptors for content based shape retrieval. Proc. 8th ACM symposium on Solid modeling and applications. 216–225 (2003).

[CR17] 17.La D, et al. 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009;25:2843–2844. doi: 10.1093/bioinformatics/btp542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Esquivel-Rodriguez J, et al. Navigating 3D electron microscopy maps with EM-SURFER. BMC Bioinform. 2015;16:181. doi: 10.1186/s12859-015-0580-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Sael L, et al. Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008;72:1259–1273. doi: 10.1002/prot.22030. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Han X, Wei Q, Kihara D. Protein 3D structure and electron microscopy map retrieval using 3D-SURFER2.0 and EM-SURFER. Curr. Protoc. Bioinform. 2017;60:3 14 11–13 14 15. doi: 10.1002/cpbi.37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Kihara D, Sael L, Chikhi R, Esquivel-Rodriguez J. Molecular surface representation using 3D Zernike descriptors for protein shape comparison and docking. Curr. Protein Pept. Sci. 2011;12:520–530. doi: 10.2174/138920311796957612. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Venkatraman V, Yang YD, Sael L, Kihara D. Protein-protein docking using region-based 3D Zernike descriptors. BMC Bioinform. 2009;10:407. doi: 10.1186/1471-2105-10-407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Venkatraman V, Sael L, Kihara D. Potential for protein surface shape analysis using spherical harmonics and 3D Zernike descriptors. Cell Biochem. Biophys. 2009;54:23–32. doi: 10.1007/s12013-009-9051-x. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Venkatraman V, Chakravarthy PR, Kihara D. Application of 3D Zernike descriptors to shape-based ligand similarity searching. J. Cheminformatics. 2009;1:19. doi: 10.1186/1758-2946-1-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Shin WH, Zhu X, Bures MG, Kihara D. Three-dimensional compound comparison methods and their application in drug discovery. Molecules. 2015;20:12841–12862. doi: 10.3390/molecules200712841. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Han X, Terashi G, Christoffer C, Chen S, Kihara D. VESPER: global and local cryo-EM map alignment using local density vectors. Nat. Commun. 2021;12:2090. doi: 10.1038/s41467-021-22401-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Sael L, Kihara D. Detecting local ligand-binding site similarity in nonhomologous proteins by surface patch comparison. Proteins. 2012;80:1177–1195. doi: 10.1002/prot.24018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Zhu X, Xiong Y, Kihara D. Large-scale binding ligand prediction by improved patch-based method Patch-Surfer2.0. Bioinformatics. 2015;31:707–713. doi: 10.1093/bioinformatics/btu724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Shin WH, Bures MG, Kihara D. PatchSurfers: two methods for local molecular property-based binding ligand prediction. Methods. 2016;93:41–50. doi: 10.1016/j.ymeth.2015.09.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722–2728. doi: 10.1093/bioinformatics/btt473. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Li B, et al. Characterization of local geometry of protein surfaces with the visibility criterion. Proteins. 2008;71:670–683. doi: 10.1002/prot.21732. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Hendlich M, Rippmann F, Barnickel G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J. Mol. Graph Model. 1997;15:359–363. doi: 10.1016/s1093-3263(98)00002-3. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Holm L. Benchmarking fold detection by DaliLite v.5. Bioinformatics. 2019;35:5326–5327. doi: 10.1093/bioinformatics/btz536. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Deng L, Zhong G, Liu C, Luo J, Liu H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinform. 2019;20:662. doi: 10.1186/s12859-019-3235-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Brown P, Pullan W, Yang Y, Zhou Y. Fast and accurate non-sequential protein structure alignment using a new asymmetric linear sum assignment heuristic. Bioinformatics. 2016;32:370–377. doi: 10.1093/bioinformatics/btv580. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Ljung, F. & Andre, I. ZEAL: Protein structure alignment based on shape similarity. Bioinformatics10.1093/bioinformatics/btab205 (2021). [DOI] [PMC free article] [PubMed]

[CR38] 38.Chandonia JM, Fox NK, Brenner SE. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 2019;47:D475–D481. doi: 10.1093/nar/gky1134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Fox NK, Brenner SE, Chandonia JM. SCOPe: structural classification of proteins–extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42:D304–D309. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Breiman L. Bagging predictors. Mach. Learn. 1996;24:123–140. [Google Scholar]

[CR41] 41.Orengo CA, et al. CATH–a hierarchic classification of protein domain. Struct. Struct. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Raffo A, et al. SHREC 2021: retrieval and classification of protein surfaces equipped with physical and chemical properties. Comput. Graph. 2021;99:1–21. [Google Scholar]

[CR43] 43.Langenfeld F, et al. Surface-based protein domains retrieval methods from a SHREC2021 challenge. J. Mol. Graph. Model. 2022;111:108103. doi: 10.1016/j.jmgm.2021.108103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Sael L, Kihara D. Improved protein surface comparison and application to low-resolution protein structure data. BMC Bioinform. 2010;11:S2. doi: 10.1186/1471-2105-11-S11-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Sykes J, Holland BR, Charleston MA. Benchmarking methods of protein structure alignment. J. Mol. Evol. 2020;88:575–597. doi: 10.1007/s00239-020-09960-2. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J. The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res. 2019;47:D490–D494. doi: 10.1093/nar/gky1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Gerstein M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins. 1998;33:518–534. doi: 10.1002/(sici)1097-0134(19981201)33:4<518::aid-prot5>3.0.co;2-j. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Kihara D, Skolnick J. Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q. Proteins. 2004;55:464–473. doi: 10.1002/prot.20044. [DOI] [PubMed] [Google Scholar]

[CR49] 49.Hanson J, Paliwal K, Zhou Y. Accurate single-sequence prediction of protein intrinsic disorder by an ensemble of deep recurrent and convolutional architectures. J. Chem. Inf. Model. 2018;58:2369–2376. doi: 10.1021/acs.jcim.8b00636. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Hu G, et al. flDPnn: accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat. Commun. 2021;12:4438. doi: 10.1038/s41467-021-24773-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Xu D, Zhang Y. Generating triangulated macromolecular surfaces by Euclidean Distance Transform. PLoS ONE. 2009;4:e8140. doi: 10.1371/journal.pone.0008140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Sael L, La D, Li B, Rustamov R, Kihara D. Rapid comparison of properties on protein surface. Proteins. 2008;73:1–10. doi: 10.1002/prot.22141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Necci M, Piovesan D, Predictors C, DisProt C, Tosatto SCE. Critical assessment of protein intrinsic disorder prediction. Nat. Methods. 2021;18:472–481. doi: 10.1038/s41592-021-01117-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Real-time structure search and structure classification for AlphaFold protein models

Tunde Aderinwale

Vijay Bharadwaj

Charles Christoffer

Genki Terashi

Zicong Zhang

Rashidedin Jahandideh

Yuki Kagaya

Daisuke Kihara

Abstract

Introduction

Results

Domains with high confidence in AlphaFold2 models

Table 1.

3D-AF-Surfer

Fig. 1. Input and an output example of 3D-AF-Surfer.

Table 2.

Secondary structure class of AlphaFold2 models

Fig. 2. Distribution of protein secondary structure classes and fold classes of confident domains of AlphaFold2 models.

Table 3.

Fold classification by deep neural network

Fig. 3. Deep neural network model for protein fold classification.

Table 4.

Illustrative cases of misclassifications of folds

Fig. 4. Examples of protein pairs that were misclassified by 3D-AF-Surfer.

Fig. 5. Examples of pairs where 3DZD-NN classified correctly but 3DZD did not.

Fold distribution of AlphaFold2 models

Low-confidence regions of AlphaFold2 models

Fig. 6. Correlation between predicted disordered regions and low-confidence regions in AlphaFold2 models.

Discussion

Methods

Extraction of confident domain regions in AlphaFold2 models

SCOPe benchmark dataset for structure classification

Classification of secondary structure class with bagged SVM

Performance metrics

3D Zernike descriptors (3DZD)

Deep neural network for fold classification

Disorder region prediction methods

Statistics and reproducibility

Reporting summary

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases