Abstract
Proteins are essential to nearly all cellular mechanism and the effectors of the cells activities. As such, they often interact through their surface with other proteins or other cellular r ligands such as ions or organic molecules. The evolution generates plenty of different proteins, with unique abilities, but also proteins with related functions hence similar 3D surface properties (shape, physico-chemical properties, …). The protein surfaces are therefore of primary importance for their activity. In the present work, we assess the ability of different methods to detect such similarities based on the geometry of the protein surfaces (described as 3D meshes), using either their shape only, or their shape and the electrostatic potential (a biologically relevant property of proteins surface). Five different groups participated in this contest using the shape-only dataset, and one group extended its pre-existing method to handle the electrostatic potential. Our comparative study reveals both the ability of the methods to detect related proteins and their difficulties to distinguish between highly related proteins. Our study allows also to analyze the putative influence of electrostatic information in addition to the one of protein shapes alone. Finally, the discussion permits to expose the results with respect to ones obtained in the previous contests for the extended method. The source codes of each presented method have been made available online.
Keywords: SHREC2021, Proteins surface, 2000 MSC: 92-08
1. Introduction
Proteins are key molecular effectors at the cellular level. Proteins are linear assemblies of amino-acids that fold in specific, energy-driven 3D structures [1,2] linked to their activity. Identifying similarities within protein structures is therefore of tremendous importance in various fields, from biochemistry to drug design. Numerous methods have been dedicated to structural similarity search of proteins in structural bioinformatics, that mainly rely on the analysis of the 3D point clouds defined by the 3D coordinates of their individual atoms [3–7]. These methods are mostly based on the conserved core structure of proteins, and therefore, may be inefficient to detect proteins sharing similar surface. The protein surface is a higher-level description of the protein structure that abstracts the underlying protein sequence, structure and fold into a continuous shape with geometric and chemical features that fingerprint its interactions with the other molecules of its environment (solvent, ligands, proteins, nucleic acids, …) [8]. Methods able to detect protein surficial similarities are then of major importance.
Only a limited number of methods have been proposed so far:
Sael et al. use 3D Zernike descriptors to detect either global or local similarities between protein’s surfaces [9]. This method is able to use surficial physico-chemical properties like the electrostatic potential or the hydrophobicity [10].
MaSIF (molecular surface interaction fingerprinting) is a geometric deep learning framework that allows to fingerprint biomolecular surfaces [11]. Both geometric and chemical features are extracted and embedded into numerical vectors which is subsequently processed in an application-dependent manner.
FTIP (Farthest point sampling-enhanced Triangulation-based Iterative-closest-Point) is a global protein surface comparison method that uses the Farthest point sampling method to extract a subset of protein surfaces, and then uses a triangulation-based efficient Iterative-closest-Point algorithm to align these so-called feature-points [12].
BioZernike [13] adopts a slightly different approach: instead of using the 3D point cloud formed by the atoms coordinates, it uses the electron density distribution. A 3D Zernike moment normalization procedure is applied to orient the electron density volumes to be compared, allowing for fast retrieval of proteins/protein assemblies.
The aim of this track is therefore to assess the performance of currently available methods and to stimulate the development of novel methods. To this end, the dataset encompasses (1) a variety of protein domains, with some of them closely related, to query the dataset, (2) a dataset of experimental structures that contain one or more domains, (3) a few protein shapes corresponding to protein that contains two of the query domains, (4) two versions of the same dataset, one made of the protein shapes only, and the other with an additional physico-chemical property, the electrostatic potential, encoded along the shapes. We selected this surficial physico-chemical feature as it is the main driving force in many biological recognition processes, such protein-ligand and protein-protein interactions [14].
In the present work, we detail the dataset proposed by the challenge organizers to the participants and how it differs from the previously proposed datasets in Section 2. In Section 3, we describe the 5 methods submitted to the contest. The evaluation metrics are briefly introduced in Section 4, and the performance of the methods is presented in Section 5. Finally, we discuss the outcomes of the different submitted methods in Section 6.
2. The dataset
2.1. Constitution of the SHREC′21 dataset
The SHREC′21 protein dataset is based on the Pfam 33.1 database [16]. Basically, this database classifies protein sequences into domains and families, that can be grouped into clans whenever they are evolutionarily related. Protein domains of structures from the Protein Data Bank (PDB [17,18]) can therefore be attributed to a Pfam domain and, possibly, a clan. To build up the track dataset, we relied on this notion of domain, and manually selected 10 Pfam domains: the SH2 domain (PfamID PF00017), the SH3 domain (PfamID PF00018), the variant SH3 domain (SH3_2, PfamID PF07653), the PDZ domain (PfamID PF00595), the PDZ_6 domain (PfamID PF17820), the peptidase family M50 (m50, PF02163), the bromodomain family (PF00439), the DNA-binding domain of the STAT protein (STAT-binding, PF02864), the PHD-finger domain (PfamID PF000628), and the C2H2 Zinc-finger domain (zf-C2H2, PfamID PF00096).
For each selected domain, all corresponding structures from the PDB were listed, and the best resolution structures were retrieved to serve as a query for the track. When applicable, the NMR (Nuclear Magnetic Resonance) structures were assigned an arbitrary resolution of 2.25 Å [19], while structures with no resolution were discarded. The residues corresponding to the Pfam domains were then extracted from the selected structures when necessary, so that the selected domains were left alone. For example, only the DNA-binding domain of the STAT (Signal Transducer and Activator of Transcription) protein was kept as a query, its others domains being discarded.
The remaining structures were filtered according to their Uniprot [20] identifier, and duplicates were discarded to (1) ensure a diversity of sequence structures among the dataset, (2) limit the dataset size to a tractable size given the track timeline. Finally, only the best resolution structures for each Uniprot entry were kept (Table A.6). When NMR structures were selected, only the first model was considered. Unlike the query structures, we kept the other domains present in these structures that eventually constitute the dataset. Therefore, many dataset structures display several domains, at least one of which is one of the query domains.
For all structures (queries and dataset structures), we removed all hetero-atoms, and unwanted chains. The resulting PDB structures were then protonated using pdb2pqr [21], using propka [22,23] to compute the pKa values of the ionizable groups at pH = 7.2. The solvent-excluded surface of all protonated structures were computed using the default parameters of EDTSurf [24,25] except that inner cavities were discarded. We then computed the electrostatics using APBS suite [26], and used the multivalue program to compute the electrostatic potential at the mesh vertices locations. Two datasets were then assembled, one with only the protein surface shapes (shape-only dataset) and one combining the protein surface shapes and electrostatics values (shape + electrostatic dataset). Similarly, two sets of query surfaces were produced (shape-only and shape + electrostatic). Each dataset (shape-only and shape + electrostatic) includes 554 molecular surfaces which were made available to the track participants, along with the 2 sets of 10 queries, on the track web–page (http://shrec2021.drugdesign.fr).
Regarding the dataset, it is important to note that SH3 and SH3_2 domains were annotated as similar according to HHsearch (a tool commonly used to detect homologous proteins [27]), as well as the PDZ, PDZ_6 and m50 domains. We present in Fig. 1 the TM-scores matrix for all queries of the dataset. A TM-score above 0.5 indicates that the two structures are likely to share the same topology, while unrelated structures are usually associated to TM-scores below 0.17 [28]. SH3 and SH3_2 query structures show a TM-score of 0.84, while, PDZ and PDZ_6 query structures show a TM-score of 0.79. Surprisingly, the m50 query structure only has a TM-score of 0.28 and 0.32 with PDZ and PDZ_6 structures, respectively. A visual inspection of these structures confirmed that the peptidase M50 is topologically different from both the PDZ and PDZ_6 domains: while the peptidase M50 is mainly α-helical, PDZ and PDZ_6 are mixed α − β proteins. Overall, most query structures present an intermediate topological similarity with all other queries, as evidenced by the fact that all TM-scores range from 0.19 to 0.47, except for the aforementioned pairs of classes (SH3/SH3_2 and PDZ/PDZ_6).
Fig. 1.

Structural similarity between the protein structure queries. The TM-score (in the (0, 1] range) measures the topological similarity between two protein structures: the higher the TM-score, the more similar the two structures. Scores below 0.17 correspond to unrelated proteins, while those above 0.5 usually indicate two structures having the same fold [15]. The corresponding RMSD values are presented in Fig. B.11.
Finally, as the dataset encompasses multi-domain proteins, 22 dataset proteins display two of the query domains. Namely, 9 proteins of the dataset encompass both a SH2 and a SH3 domains, 6 proteins encompass both a SH2 and a Stat-binding domains, 4 proteins encompass both a bromodomain and a PHD-finger domain, 2 proteins encompass a PDZ_6 and a peptidase family M50 domains, and one structure encompass a PDZ and a peptidase m50 domains. The final structure of the dataset and the size of each class is summarized in Fig. 2.
Fig. 2.

The Upset Plot of ten selected Pfam domains in SHREC2021 challenge datasets. The dataset is composed of 554 individual shapes, of which 22 bears two of the domains of the dataset.
2.2. Comparison to previous SHREC datasets on proteins
Compared to previous SHREC datasets dealing with protein structures or surfaces [29–34], the SHREC′21 protein track dataset is characterized by three main aspects: (1) the presence of two datasets representing the same set of proteins, one shape-only and one shape + electrostatic dataset, (2) the close evolutionary relationship of some of the query domains, further characterized by a similar topology (Fig. 1), (3) the intermediate similarity of the domains topologies (Fig. 1), and (4) the use of individual domains to query a dataset of single-as well as multi-domain proteins shapes.
The main novelty of the SHREC′21 track is arguably the availability of protein surfaces with electrostatic values, which has been shown to improve the retrieval performance of protein surfaces [11,35]. This additional feature might therefore allow to better distinguish structurally related proteins based on their surficial properties and improve the methods’ performance.
2.3. Challenge proposed to the participants
SHREC, or 3D Shape Retrieval Challenges, are challenges primarily organized in order to evaluate the effectiveness of 3D-shape retrieval algorithms. A group organizes a challenge by building up a dataset, then proposes the challenge publicly to the community, and finally gathers, analyses and verifies the results. The theme of the challenge may vary from one to another, but all challenges take place in a limited time, which ranges from 1 to 1 1/2 months.
In our contest, the participants were asked, given each of the 10 query surfaces, to retrieve the molecular surfaces of proteins from the dataset that encompass the same domain as the query. Each query-to-dataset-surface distance was expected to be expressed as a dissimilarity score. The results were therefore 10 × 554 matrices of dissimilarity scores. Each participant was allowed to submit one dissimilarity matrix for each dataset: one matrix for the shape-only dataset, and one matrix for the shape + electrostatic dataset.
3. Participants and methods
Among the seven groups that initially registered to this track, only 5 were able to produce the results in time and returned a shape-only dissimilarity matrix. Only one method (3DZD, see 3.1) returned a dissimilarity matrix for the shape + electrostatic dataset. The other groups were not able to produce a satisfying training dataset or willing to develop their algorithm to handle the electrostatics values. In the following subsections, each group describes their new, respective methods.
3.1. Network trained with encoded 3DZD (3DZD) by Tunde Aderinwale, Charles Christoffer, Woong-Hee Shin, Genki Terashi, Xiao Wang & Daisuke Kihara
Our group submitted two (shape-only and shape + electrostatic) dissimilarity matrices of the target proteins to the 10 query proteins provided by the organizers. These methods are based on the 3D Zernike Descriptor (3DZD). 3DZD is the rotation-invariant shape descriptor derived from the coefficients of 3D Zernike-Canterakis polynomials [36].
3.1.1. Summary of the 3DZD method
Similar to SHREC′20 [34], our group trained two types of neural network to output a score that measures the dissimilarity between a pair of protein shapes. Briefly, the first framework (the Extractor model) is structured into multiple layers: an encoder layer with 3 hidden units of size 250, 200 and 150, a feature comparator layer which computes the Euclidean distance, cosine distance, element-wise absolute difference and product, and a fully connected layer with 2 hidden units of size 100 and 50. There are multiple hidden units in each layer, and our group uses the ReLu activation function in all except the output of the fully connected layer where the sigmoid activation function is used to output the probability that the two proteins belong to the same protein– or species–level in the SCOPe dataset classification [37,38]. The second framework (the end–to–end model) is similar to the first except the feature comparator layer is removed and the output of the encoder is directly connected to the fully connected layer.
The network is trained on the latest SCOPe dataset of 259,385 protein structures. 2500 protein structures were set aside for network validation. Proteins in Class I (Artifacts) were removed. Each of the two network frameworks is trained with two datasets. The first dataset is 3DZDs of the surface shape of proteins and the second one is feature vectors that concatenate 3DZD of shape and 3DZD of the electrostatic properties.
Our group examined the performance of the networks on the validation dataset to determine which models to use. For the shape-only dataset, we submitted predictions generated by the Extractor model. For the shape + electrostatic dataset, we submitted the average predictions between the Extractor model and the end–to–end model.
For each protein in the provided dataset, our group performs a pre–processing step as follows: (1) the PLY mesh data file is converted to a volumetric skin representation (Situs file) where points within 1.7 grid intervals are assigned with values that are interpolated from the mesh [9]. For the electrostatic features, the interpolated values are the potentials at the mesh vertices. For the shape feature, a constant of 1 is assigned to grids that overlap with the surface. (2) The resulting Situs file is fed into the EM-Surfer pipeline [39] to compute 3DZD.
3.1.2. Runtimes and computational cost
It takes approximately 12–13 min to pre-process each file. Generating 3DZD took ~8.00 s on average for each protein on an Intel® Xeon® CPU E5-2630 0 @ 2.30 GHz. The training of each models took 12 h. Dissimilarity prediction between two proteins using the trained model took ~0.22 s on average on a Nvidia® Titan X GPU. The averaging of the two matrices was almost instant and negligible. The code is available at the following url: https://github.com/kiharalab/shrec_2021_shape_retrieval.
3.2. ProteinNet: deep learning based protein characterization from 3D point clouds (ProteinNet) by Halim Benhabiles, Karim Hammoudi, Adnane Cabani, Feryal Windal & Mahmoud Melkemi
Our group proposes a deep learning approach to calculate a protein descriptor from its 3D point cloud. To this end, we developed a variant of PointNet [40] which is a point cloud deep architecture dedicated for 3D classification and segmentation. We adapted this architecture in order to learn an affine transformation matrix that allows to align the coordinates of the input 3D protein point cloud into a canonical representation. The new representation maintains interesting properties demonstrated in Ref. [40], including invariance to rigid geometric transformations as well as point order permutations. The ProteinNet deep architecture is illustrated in Fig. 3. More specifically, the architecture is based on a PT-Net module (Protein Transformer Network) which is inspired from the T-Net (Transformer Network) module of the original PointNet architecture. The PT-Net module is trained to predict an affine transformation matrix M that is constrained to be close to an orthogonal matrix, namely |(M.Mt) − I| = 0 (step 1 in Fig. 3). The matrix M is used to transform the input protein into its canonical representation (step 2 in Fig. 3). A cosine similarity loss between the original protein and the transformed one is then calculated (step 3 in Fig. 3) in order to back-propagate the error over the network (step 4 in Fig. 3) and optimize the matrix M.
Fig. 3.

ProteinNet deep architecture for protein point cloud transformation into canonical representation. Step (1): affine transformation matrix estimation. Step (2): protein point cloud transformation using the estimated affine matrix. Step (3): similarity calculation between the original protein point cloud (the input) and its transformed point cloud. Step (4): cosine similarity loss calculation between the original input protein point cloud and its transformation; and back–propagation over the network to optimize the estimation of the affine transformation matrix.
3.2.1. PT-Net module
The module is composed of a sequence of 3 convolution blocks (32, 64 and 512 layers) followed by a global max pooling layer and 3 successive dense layers (256, 128 and 9). As shown in Fig. 3, each convolution block as well as the dense layers (except the last one) undergo a batch normalization and a tangent hyperbolic activation function. The last dense layer of 9 units is reshaped to output the (3 × 3) M matrix (see details in Ref. [40]).
3.2.2. Data preparation and architecture training
All the proteins of the dataset of the track have been sampled to 2048 points using a Poisson disk sampling technique [41] and then normalized into a zero-center unit sphere based on their respective minimum bounding spheres [42]. The architecture has then been trained using a batch size of 16 on 80% of the dataset over 150 epochs and validated on the remaining 20% of the data. The training data were augmented on-the-fly (during the training process) by adding some geometric noise (e.g. random displacement of point coordinates in a limited interval).
3.2.3. Protein feature extractor
The trained ProteinNet model has then been exploited to calculate a protein feature descriptor, for each input protein, by extracting its intermediate Global Max Pooling hidden layer. This descriptor corresponds to a 1-dimension vector of 512 values.
3.2.4. Dissimilarity matrix computation
The dissimilarity matrix between the ten protein shape queries and the set of 554 protein shapes has been calculated using Euclidean distance between their respective 512 feature vectors.
3.2.5. Runtimes and computational cost
This framework has been developed in Python 3.7.6 using different libraries, namely Open3D 0.8.0.0, and Keras 2.2.4-tf on a TensorFlow-GPU 2.1.0 backend. The experiments have been conducted on an Intel Xeon® Gold® 5118 CPU@2.30 GHz with 128 GB of memory and NVIDIA® Tesla® T4 GPU with 16 GB of memory. The running times in s of each stage performed on CPU are reported in Table 1 for one protein. Table 2 shows the training times of the ProteinNet model trained on GPU. The code is available at the following url: https://github.com/Benhabiles-JUNIA/ProteinNet.
Table 1.
Running times in s using CPU for each stage of the ProteinNet framework obtained for one protein.
| Point cloud maximum and minimum sizes of two proteins | Point cloud sampling (2,048) and normalization | Feature descriptor (512) calculation of one protein | Distances from one protein to all protein dataset (554 proteins) |
|---|---|---|---|
|
| |||
| 582,496 points | 1.14 | 0.005 | 0.004 |
| 37,658 points | 0.9 | ||
Table 2.
Running times in s using CPU for each stage of the ProteinNet framework obtained for one protein.
| Deep learning model | ProteinNet |
|---|---|
|
| |
| Training data size | 499 |
| Epochs | 150 |
| Training time (s) | 155 |
3.3. Fisher kernel agglomerated local Augmented Point pair feature descriptors, trained with Gaussian Mixture Model (APPFD-FK-GMM) by Ekpo Otu, Reyer Zwiggelaar, David Hunter & Yonghuai Liu
Our group presents a novel APPFD-FK-GMM 3D shape retrieval method (see Fig. 4) based on Fisher Kernel (FK) and Gaussian Mixture Model (GMM) agglomeration of the Augmented Point-pair Feature Descriptor (APPFD) [43]: a 3D key point shape descriptor that robustly captures the physical geometric characteristics of 3D surface regions. Previous APPFD binning technique involves bucketting each of the 6-dimensional features of the APPFD into a multi-dimensional histogram with at least 7 bins in each feature-dimension, resulting to approximately 76 = 117,649-dimensional final feature-vector (APPFD), which is very high-dimensional final descriptor.
Fig. 4.

APPFD-FK-GMM processing pipeline involving Phase 1 (fitting a GMM to all the keypoints or LSPs descriptor, i.e. local APPFDs from each 3D protein surface and for all database protein surfaces) and Phase 2 (computing a single compact descriptor called fisher-vector (FV) for each 3D protein by aggregating all its keypoints or local APPFDs using the fisher kernel (FK) framework and the trained GMM in Phase 1. Within each LSP around a keypoint, six different geometric features are first extracted, and each feature-dimension is binned into a 1D histogram with 35 bins, where all histograms are combined to form a 210-dimensional descriptor, i.e local APPFD for each LSP. All such LSP descriptors from each 3D protein are compacted into a 4210-dimensional FV for that protein model, as in Phase 2.
In this work, we contribute a simpler approach, where each of the 6-dimensional feature is binned into a 1-dimensional histogram with 35 bins for each feature-dimension to produce a 210-dimensional local descriptor (APPFD) for every key point or local surface patch (LSP). Finally, the locally computed APPFDs are agglomerated into a compact code called the Fisher Vector (FV) with 4210 dimensions, which is L2 and power-normalized, and represents a single protein model, using the FK and GMM [44] framework.
The goal of the APPFD-FK-GMM method/contribution is to provide a straight-forward, efficient, robust, and compact representation, describing the geometry of 3D protein surfaces, with a knowledge-based (i.e. non-learning) approach. While a single protein surface in this challenge contains an average of 120,000 vertices and 200,000 triangular faces, our implementation address this very high data-structure by reducing 3D protein surface representation to 3,500 points sample.
3.3.1. Summary of the APPFD-FK-GMM method
Our method involves two main stages: (1) computing local APPFDs for selected key points on 3D protein surface, (2) key points APPFDs aggregation with FK and GMM described below.
Fig. 4 shows the processing pipeline of the APPFD-FK-GMM algorithm with complete implementation details provided in Ref. [44]. The reader is referred to Ref. [45], for further details regarding this method.
Stage (1) Computing Local APPFDs. Following key points determination for each 3D protein surface, represented as point cloud (P), the 4-dimensional feature, f1 = (α, β, γ, δ) in Ref. [46] is augmented with a locally-extracted 2-dimensional feature: f2(pi, pj) = (φ, θ) for every possible combination of point pair, pi, pj (without their estimated normals, ni, nj) in the local surface patch (LSP), {Pi, i = 1: K} around each key point in Ps, where K is the number of key points. The extraction of f2 (see Fig. 5) is because f1 is not robust enough to capture the entire geometric details of the underlying surface, whereas, the PPF approach opens up possibilities for additional feature space.
Fig. 5.

Local Surface Patch (LSP), Pi with pairwise points (pi, pj) as part of a surflet-pair relation for (pi, ni) and (pj, nj), with pi being the origin. θ and φ are the angles of vectors projection about the origin, pi. θ is the projection angle from vector 〈pi − pj〉 to vector 〈pi − pc〉 while φ is the projection angle from vector 〈pi − pj〉 to vector 〈pi − l〉. The LSP centre is given by pc, keypoint is given as where i = 2. Finally, l is the vector position of [34].
The angular projections θ and φ in Fig. 5 are derived by taking the scalar products of for ∠1, and for ∠2 about a point pi in a given LSP. Mathematically, scalar products defined in this manner are homogeneous (i.e invariant) under scaling and rotation. Therefore, f2 is considered rotation and scale invariant for 3D shapes under rigid and non-rigid transformations [34].
Finally, a 6-dimensional f3 = (f2 + f1) are locally obtained thus: f3(pi, pj) = (f2(pi, pj), f1(pi, pj)) = (φ, θ, α, β, γ, δ), and binned into a 1-dimensional histogram with 35 bins in each feature-dimension, normalized and concatenated to give 35 × 6 = 210-dimensional single local APPFDs per LSP.
Stage (2) Key points APPFDs Aggregation with FK and GMM. Here, the final descriptor (i.e. fisher-vector, FV) computation approach involves an initial step of training a GMM, given aggregated key points local APPFDs for all database 3D objects, then FK is applied on the trained model and a single protein’s local APPFDs to derive a global signature (APPFD-FK-GMM) for the protein surface (see Fig. 4).
3.3.2. Runtimes and computational cost
Our group submitted a dissimilarity matrix D = [10 × 554], where the entry D = [i, j] corresponds to the L2 distance from ith FV in the query set to the jth FV in the collection set.
While implementing the APPFD-FK-GMM for this task, K is specified by the parameter, vs = 0.20, which is the voxel size for point cloud down-sampling, while the radius parameter, r = 0.50 specifies the size of P. Regarding point cloud size, P = 3, 500 points are sampled.
In conclusion, we present a pure Python 3.60 implementation code that computes the APPFD-FK-GMM method. All experiments were carried out under Windows® 7 desktop PC with Intel® Core® i7-4790 CPU @ 3.60 GHz, 32 GB RAM. It takes an average of 30 s to compute the APPFD-FK-GMM. The implementation code is available at the following url: https://tinyurl.com/shrec21
3.4. Projected Wave Kernel Signature maps (PWKSM) by Léa Sirugue & Matthieu Montès.
This method is based on the 2D projection of the surface and the Wave Kernel Signature (WKS) descriptor. Wave Kernel Signature [47] is an isometric invariant descriptor that has been extensively improved and used in the field of computer vision [48–51]. We have combined WKS with a 2D projection on a unit sphere [52]. Lowering one dimension of the space allows us to have a fast and dense comparison of the surface while having a smaller storage size for files.
Descriptor calculation.
In a first step, the WKS descriptor is computed on the surface of the 3D object for each point of the mesh. The surface is flattened on the unit sphere using a conformal transformation [52]. Then, the 2D spherical coordinates of the unit sphere are converted into 2D cartesian coordinates on the plane [53]. A maps of size (θmax − θmin)/δ, (φmax − φmin)/δ is created. θmax and θmin are the maximum and minimum values of θ on the unit sphere and the same with φ, each representing an angle coordinate. δ is a coefficient to adapt for resolution. This type of projection is similar to topographic maps, that is why our group called this descriptor Projected Wave Kernel Signature Maps (PWKSM). An interpolation in the space of discrete integers is done to densify the maps. To reduce impact of deformation at the poles when converting to 2D cartesian coordinates, we computed 7 different maps with different pole orientations.
Descriptor comparison.
A dense comparison is made using GPGPU sum reduction technique [54–56]. Each point’s WKS of a PWKSM is compared to all points’ WKS of another PWKSM. The Earth Mover’s distance L is used to compare the WKS descriptor of each point. Then, the smallest distance between a point of a first PWKSM T and all points of a second PWKSM V is selected. The sum of all the smallest distances for each point of the first PWKSM are summed to create the score ST. The same is done for computing SV.
| (1) |
The final score is the average of ST and SV defined as follows:
| (2) |
3.4.1. Runtimes and computational cost
All the calculations were made on a computer based on a 64-bit OS with an Intel® Xeon® CPU @ 2.30 GHz, a Nvidia® Quadro® k4200 GPU with 4 GB and 32 GB of RAM.
Computing the WKS took on average 9 min and 31 s. It required on average 44 s to compute one PWKSM. The comparison of two surfaces (i.e 7 versus 7 PWKSM) takes on average 23 s. The code is available at the following url: https://gitlab.cnam.fr/gitlab/siruguej/PWKSM.
3.5. Graph-based learning methods for surface-based protein domains retrieval (DGCNN) by Huu-Nghia H. Nguyen, Tuan-Duy H. Nguyen, Vinh-Thuyen Nguyen-Truong, Danh Le, Hai-Dang Nguyen & Minh-Triet Tran
In this deep learning method, our group exploits the availability of protein class labels from Ref. [35] to optimize the representation of protein surfaces without any additional properties. Particularly, we designed a message-passing graph convolutional neural network (MPGCNN) with the Edge Convolution (EdgeConv) paradigm [57] for the protein classification objective. Then, the latent representation of protein surfaces from this neural network is used for the retrieval task in this track (Fig. 6).
Fig. 6.

Dynamic edge convolutional neural network.
3.5.1. Data pre–processing
For the meshes in each 3D model of a protein surface, we first sample 512 points on the surfaces of the meshes based on the area of the meshes. Then, to re-assign the topological structures for sampled points, we connect each nodes with their k-Nearest Neighbors based on their original coordinates (k = 16).
3.5.2. Edge convolution
In this geometry-only setting, the initial node features is the coordinates of sampled points. Each protein surface is represented by a k-Nearest Neighbors graph generated in the pre–processing step with 512 vertices (nodes).
The module that performs the graph message-passing function is the EdgeConv layer [57]. In the EdgeConv layer, the information of a vertex i after layer l is calculated as follows:
| (3) |
where N is the neighboring vertices of vertex i with
| (4) |
where ReLU is Rectified Linear Unit (in this implementation, we used LeakyReLU—a variant of ReLU), MLP is a standard multilayer perceptron (MLP), and ⊕ is the concatenation operator.
In this implementation, our group uses a dynamic variant of EdgeConv instead of the standard EdgeConv described above. At each Dynamic EdgeConv layer, each vertex’s k–Nearest Neighbors is re-calculated in the feature space produced by the previous layer, before applying the standard EdgeConv operation. After the graph has been recomputed, standard EdgeConv operation is performed.
After the pre–processing phase, the vertex features first go through 4 layers of Dynamic EdgeConv. The dimensions of output features for each vertex after these first-4 layers are 64, 64, 128, and 256, respectively. Then, the outputs of these 4 layers are concatenated to become a 512-dimensional vector for each vertex. This 512-dimension vector is then fed through another Dynamic EdgeConv layer, creating the output vector v with 512 dimensions. The feature vector v is pooled using the concatenation of the outputs of a max-pooling and a mean pooling layer to generate the first graph-level feature vector. This vector is passed through two MLP blocks with BatchNorm, Leaky-ReLU, and Dropout layers. Finally, the vector is passed through a Fully-Connected layer for classification (Fig. 6).
The latent representation of the graph is extracted as vectors by removing the last Fully-Connected layer from the network. The retrieval task is then performed by exploiting the L2-distances between these vectors.
3.5.3. Runtimes and computational cost
This method is implemented in Python 3.8 [58], using Pytorch [59] and Pytorch Geometric [60] libraries. The experiments were carried out a machine with an Intel® Core® i7-8700K 6-core CPU Processor @3.70 GHz with 32 GB of RAM and an NVIDIA® TITAN V with 12 GB of VRAM. The training and test set’s embedding extraction uses both the CPU and the GPU, while computation of distance matrix only uses the CPU. The detailed time report is represented in Table 3. The code is available at the following url: https://github.com/huunghia160799/SHREC-protein-domains.
Table 3.
Time report of each step of the DGCNN method.
| Training | Test Set Extraction | Matrix Computation | Total |
|---|---|---|---|
|
| |||
| ≈1100 min | ≈7 min | ≈3 min | 1173 min |
4. Evaluation metrics
We use common evaluation metrics to assess the performance of the proposed methods, most of which are used in other SHREC tracks [61] or similar works evaluating the performance in retrieval [62]. For each method, we compute the overall metrics (i.e the metrics averaged over all queries) and the individual metrics (i.e the metrics for each query) to provide a better understanding of the performance of each method for each query. Two composite classes are also presented: the SH3-like and the PDZ-like, which correspond to the grouping of the SH3 and SH3_2 classes, and of the PDZ and PDZ_6 classes, respectively. To set the dissimilarity values for these composite classes, for each entry of the dataset, we kept the minimal dissimilarity value from the SH3/SH3_2 queries and from the PDZ/PDZ_6 queries.
4.1. Nearest neighbor, first tier and second tier
These metrics measure the ratio of relevant objects among the k retrieved objects, and range in the interval [0, 1]. For the nearest neighbor (NN), only the first retrieved object is considered (k = 1), while the top C objects are considered for the first tier (FT), and the top 2 × C objects for the second tier (ST). Here, C represents the cardinal of the class under investigation, i.e the size of the class to which the query belongs. Higher values of nearest neighbor, first tier and second tier indicate better performance.
4.2. Precision-recall curves
The Precision (P) represents the fraction of relevant object retrieved compared to the top k retrieved objects: P = (relevant∩retrieved)/retrieved. Therefore, precision can be evaluated at different intervals. The Recall (R) represents the fraction of relevant objects retrieved compared to the size C of the class of the query: P = (relevant∩retrieved)/relevant. Both metrics range from 0 to 1. Precision-Recall curve plots the precision values at given recall values, which produces, in an ideal case, an horizontal line at P = 1 that spans the entire range of recall values.
4.3. Confusion matrix
A confusion matrix (CM) is a square matrix, whose columns represents the different classes of the dataset and rows the class of the query. For each row, each element CM(i, j) gives the number of objects from class i retrieved using the query j, considering the top k = C retrieved objects, C being the size of the class corresponding to query j. The elements CM(i, i) in the diagonal of the confusion matrix indicates the objects classified correctly, while the off-diagonal indicates mis–classified elements. To ease the comparison between classes of different sizes, the numbers were normalized over the class sizes C of the queries. Consequently, the sum of all elements of each row equals 1.
4.4. Reciprocal rank and mean reciprocal rank
The Reciprocal Rank (RR) measures the performance to find the first relevant item. For a given query, it equals to the inverse of the rank r of the first relevant item found: RR = 1/r. The Reciprocal Rank ranges from 0 (no relevant object found) to 1 (the first retrieved object is relevant). The Mean Reciprocal Rank (MRR) is the Reciprocal Rank averaged over all queries. This metric is useful as: (1) it is considered order-aware, contrary to the previous metrics, (2) typical use cases only consider the few first retrieved items; therefore, the higher the reciprocal rank, the better the performance.
5. Results
Among the participants of the track, all teams returned a dissimilarity matrix for the shape-only dataset, and only one method (3DZD) was adapted to handle the shape + electrostatics dataset.
5.1. Shape-only challenge
The results for the shape-only dataset are presented in Table 4 and in Figs. 8 and 7. Table 4 summarizes the performances of all submitted matrices for the shape-only dataset. For each metric (Nearest Neighbor, First Tier, Second Tier and Mean Reciprocal Rank), the highest value is indicated in bold. Given the dataset structure and the selected query domains, the best method achieves an overall level of 0.5 for the nearest neighbor metric, 0.160 for the first tier, 0.292 for the second tier and 0.523 for the mean reciprocal rank. These results must be balanced by the fact that a few classes have only a small number of models (namely, the Stat-binding and m50 classes only have 6 and 4 members, respectively, see Fig. 2), and thus impact negatively the averaged results. For completeness, Tables C7, C8, C9 and C10 in Appendix C contain the per-class evaluation metrics for all methods.
Table 4.
Summary of the average evaluation metrics for the shape-only dataset. The composite classes are excluded from the average; they are presented in Tables C7 to C10.
| Method | Nearest Neighbor | First Tier | Second Tier | Mean Reciprocal Rank |
|---|---|---|---|---|
|
| ||||
| 3DZD | 0.5 | 0.160 | 0.292 | 0.523 |
| ProteinNet | 0 | 0.088 | 0.195 | 0.126 |
| APPFD | 0.3 | 0.136 | 0.237 | 0.410 |
| PWKSM | 0.1 | 0.105 | 0.201 | 0.236 |
| DGCNN | 0 | 0.098 | 0.189 | 0.193 |
Fig. 8.

Confusion matrices of all methods for the shape-only dataset. The color-range is the same for all matrices. Confusion ranges from 0 (white background) to 1 (deep purple background). The original classes are separated from the composite classes (SH3-like and PDZ-like) by a black line.
Fig. 7.

Per-query precision-recall curves for the shape-only dataset, for each method. All plots are colored according to the legend on the bottom right of the figure.
The precision-recall curves for each individual classes (Fig. 7) show that most methods display a similar behavior for all classes, characterized by a quick drop of the precision at low recall values. A few methods, however, show a different pattern for a few classes (Fig. 7): see the PDZ class for the 3DZD method (green curve, top left plot) or the SH3 class for the APPFD-FK-GMM method (dark blue curve, middle left plot), for instance, whose corresponding curves display medium precision values at medium recall. The confusion matrices for all methods are shown in Fig. 8. Combined with Fig. 1, they allow us to put the performance into perspective. For instance, PDZ and PDZ_6 domains are topologically very similar (TM-score: 0.79, Fig. 1) and therefore were expected to be confusing. When using the PDZ_6 query, ProteinNet retrieved only 1 (4%) of the 26 PDZ_6 shapes within the first 26 retrieved results, but also 12 (46%) shapes from the PDZ class (Fig. 8, middle top confusion matrix). More strikingly, 3DZD only found 1 (3%) of the 33 SH3_2 shapes within the 33 first retrieved shapes using the SH3_2 query, but the other 32 retrieved shapes belong to the SH3 class (Fig. 8, second row of the top left confusion matrix), which is closely related to the SH3_2 class (TM-score: 0.84, Fig. 1).
5.2. Shape + electrostatics challenge
Similarly to the shape-only dataset, results for the shape + electrostatics dataset are presented in Table 5 and Figs. 9 and 10. Only one team returned a dissimilarity matrix for the shape + electrostatics dataset. The evaluation metrics are listed in Table 5. The results show similar trends compared to the shape-only dataset, with a nearest neighbor of 0.5, a first tier value of 0.16, a second tier value of 0.321 and a mean reciprocal rank of 0.454. These metrics are similar to the results obtained from the shape-only dataset for the 3DZD method (the second tier value increased while the mean reciprocal rank decreased). The per-class metrics are shown in Appendix D (Tables D11, D12, D13 and D14).
Table 5.
Summary of the average evaluation metrics for the shape + electrostatics dataset. The composite classes are excluded from the average; they are presented in Tables D11 to D14.
| Method | Nearest Neighbor | First Tier | Second Tier | Mean Reciprocal Rank |
|---|---|---|---|---|
|
| ||||
| 3DZD | 0.5 | 0.160 | 0.321 | 0.454 |
Fig. 9.

Per-query precision-recall curves for the shape + electrostatics dataset, for each method. All plots are colored according to the legend on the far right of the figure.
Fig. 10.

Confusion matrix for the shape + electrostatics dataset. Confusion ranges from 0 (white background) to 1 (deep purple background). The original classes are separated from the composite classes (SH3-like and PDZ-like) by a black horizontal line.
The precision-recall curves (Fig. 9) show a similar overall behavior for the 3DZD method, whose performance improved significantly for the SH3 domain but decreased significantly for the PDZ domain (dark blue and green curves, respectively, left plot of Fig. 9). The confusion matrix (Fig. 10) is in line with the previous results, indicating that 3DZD performs similarly in terms of overall performance but with a few differences at the per-class results.
6. Discussion and concluding remarks
6.1. Shape-only dataset
Overall, the 3DZD method obtained the best results, in line with the previous tracks on protein shapes, where this group similarly obtained overall good results [33,34]. This method relies on the use of 3D Zernike polynomials, which has been successfully used to retrieve proteins based on their shapes [9] or their Cα coordinates [13]. It then uses a neural network trained on the SCOPe [37,38] database, whose classification largely overlaps with the classification of the Pfam database [16]. As instance, in the SCOPe databases, the SH3 domain and the SH3_2 domain are classified in two different SCOPe domains, similarly to the Pfam classification. The DGCNN used the data from another SHREC′21 track, the Retrieval and classification of protein surfaces equipped with physical & chemical properties track [35]. The organizers of this track, similarly to a previous track [33], proposed a set of shapes derived from NMR structures along their surficial physico-chemical properties to allow the participants to train their methods, and the resulting classification of the proteins was derived from the SCOPe database as well. The DGCNN and 3DZD methods were therefore trained on similar data, but produced different performance.
Another point to consider, is that the DGCNN method uses a sampling (down to ≈ 8, 000 points) of the initial point clouds (see 3.5.1) that potentially resulted in a loss of information that might explain the difference of performance between these two groups (DGCNN and 3DZD). However, the ProteinNet and APPFD-FK-GMM methods use more severe down-sampling steps as well to reduce the number of points down to 2,048 for ProteinNet (see 3.2.2) and to 3,500 for APPFD-FK-GMM (see 3.3) with various outputs in terms of performance. These numbers should be compared to the initial meshes sizes, which range from 37,658 to 582,496 points. The APPFD-FK-GMM group, however, was able to better retrieve relevant results within the first hits, as evidenced by higher values of Nearest-neighbor and Mean Reciprocal Rank for the shape-only dataset (Table 4).
While some methods were able to maintain medium precision levels at medium recall values (see section 5.1), a few queries were difficult to handle for all methods. For the DNA-binding domain from the STAT protein family or the peptidase M50 domain, the low number of such surfaces (6 and 4, respectively) in the datasets explains the low performance observed for all methods. For the other queries, like the PDZ_6 domain or the SH3_2 domain, the explanation is the presence of closely related domains, the PDZ and SH3 domains, respectively. These confusing classes are significantly more populated (128 versus 26, for the PDZ/PDZ_6 domains, and 115 versus 33, for the SH3/SH3_2 domains). This is supported by the confusion matrices, which showed that, for instance, the ProteinNet group retrieved a great amount of SH3 domains (and almost no SH3_2 domains) within the top results using the SH3_2 query, or the DGCNN group retrieved a significantly higher proportion of PDZ domains than PDZ_6 domains within the first results using the PDZ_6 query. In these cases, the high level of similarity between the domains coupled to the imbalanced size of the classes have negatively impacted the results. In the mean time, these results highlight the limits of the currently available methods to distinguish between the most closely related proteins.
Also, we observed different results for order-aware (mean reciprocal rank) and order-unaware (nearest neighbor, first ter and second tier) metrics. While DGCNN, ProteinNet and PWKSM methods display similar values for order-unaware metrics, PWKSM displays a higher Mean Reciprocal Rank. When converted back to ranks, these results mean that, on average, PWKSM ranked better the first relevant match, but did find a similar amount of relevant items within the top results.
6.2. Shape + electrostatic dataset
While the shape of a protein is of tremendous importance, its surficial properties are important as well. Therefore, in this track, we generated the shape + electrostatic dataset which encompass both properties, in order to stimulate the development of such methods. However, only one groups returned a dissimilarity matrix for this dataset, namely the 3DZD group. Most groups that participated to this challenge come from the computer vision field. As such, most of the methods presented in this work are the result of methodological developments dedicated to the analysis of 3D point clouds. The development of new, specific methods to handle both shapes and electrostatics would require an amount of time far greater than the SHREC timeline. Nevertheless, we hope that this challenge along with the other challenge dedicated to protein shapes [35] would stimulate the development of such methods.
The overall results showed that the treatment of the electrostatics by the 3DZD marginally improved the results, compared to the shape-only results. Interestingly, the electrostatic potential impacted differently each class: it improved the ability of the 3DZD method to handle the SH3 or the Bromodomain classes, as evidenced by the precision-recall curves (Figs. 7 and 9, Tables in Appendix C and Appendix D). The 3DZD method is derived from previous attempts from the same team to couple shape and electrostatics analysis to classify protein surfaces [10]. As noted in this exploratory work, electrostatics may be suitable to compare closely related proteins, while our datasets was mainly composed of loosely related proteins (Fig. 1). The electrostatics potential feature is likely to be more beneficial for protein surface comparison of local features rather than global shapes. In such local cases (like comparison of catalytic or binding sites), local electrostatic hot spots would represent the major local feature rather than one of the many features of the global protein surface, as it is the case for the MaSIF method [11].
6.3. Current machine learning–based methods: pitfalls and challenges
Pitfalls of previously mentioned methods lies in the exploited protein datasets and their characteristics, notably the class imbalance as well as the high inter–class shape similarity. Indeed, the protein datasets are often highly imbalanced in terms of protein classes, which introduces a bias in the training process of these methods towards learning efficiently class representation. One common technique consists of using data augmentation to overcome this lack of original data. Hence, it is important to bear in mind that several protein classes cannot be considered as representative groups of protein families. Moreover, our problematic tackles a large quantity of classes composed of protein, which are e.g visually highly similar. In such a case, a challenge lies in the design of new methods with a high discriminating power that allows to extract the most significant features for distinguishing between protein classes. In this sense, other aspects of the proteins (in addition to the shape) such as molecular properties and electrostatic properties could be considered. These parameters have to be carefully analyzed through experiments before envisaging method generalizations.
6.4. Concluding remarks
In conclusion, we have presented the results of the SHREC′21 challenge on Surface-based protein domains retrieval. The number of participants remained stable compared to the last two years, indicating a constant interest of the shape retrieval community towards biologically relevant problems. Each group relied on different methods and theoretical background with respect to recommended modeling/machine learning practices [63,64] in order to solve the problem proposed by the organizers, and represent a variety of approaches to the same problem. As a step towards open science, all participants accepted to share their programs publicly with the community. Overall, the results are decreased compared to similar past tracks [34]. Indeed, two methods based on descriptors similar to 3DZD and APPFD-FK-GMM (3DZD and HAPPS, respectively) were presented in the SHREC′20 contest and performed very well (e.g both methods exceeding 0.95 for the NN metric) on a problem similar to the shape-only problem (see Tables 6 and 7 of [34]). However, the adapted versions (3DZD and APPFD-FK-GMM) did not reach the same level of performance by exploiting this new, particular dataset of proteins. This decrease of performance (and low performances from the three other methods) reveals that this year dataset was particularly hard to analyze, and that there is still room for improvements. Among the proposed methods, we observed that 3 over 5 used a learning-based protocol at some point. This proportion is in line with last year track, and show that such approaches continue to be investigated as they usually improve the results. To this regard, the SHREC′21 track on Retrieval and classification of protein surfaces equipped with physical & chemical properties might highlight some interesting points on the best architecture to learn protein surficial properties [35]. Similarly, the organizers of this track computed a set of additional chemical properties (electrostatic potential, location of potential hydrogen bond donors and acceptors, hydrophobicity). In this track, the participants first used the surface geometry then the combination of the geometry and physico-chemical features of the protein surfaces. The results showed that all methods improved their results when using both the geometric and physico-chemical data compared the geometry only. Particularly, the results generated by the machine learning based methods increased more compared to the other methods. As each of the physico-chemical feature was not considered individually, it remains hard to know whether one feature has a greater importance than the other. However, in their work, Gainza et al. showed that the electrostatic potential have the greatest impact of the physico-chemical features they computed. In our work, the electrostatic potential was used by the 3DZD team as an additional feature to help the retrieval task. The results reveals only slightly improved results compared to the results from the shape–only dataset. As noted in Ref. [10], the electrostatic potential may be of better use to compare closely related proteins, rather than comparing loosely related proteins, as it is the case in our work. Alternatively, shapes and electrostatics may be used in hierarchical way, i.e using first the shape then the electrostatics to achieve a better result.
This track reveals significantly lower results when compared to past tracks [31–34]. However, satisfactory solutions exist to distinguish between loosely related proteins, or to identify identical proteins with different conformations, based on their shapes only. Our work also reveals some limits of the methods used by the participants for the challenge. Very closely related proteins (such as SH3 and SH3_2 protein domains), i.e proteins displaying a high topological similarity and limited variations of their amino-acid sequences, hence surfaces, are still difficult to separate in different classes, but some methods distinguish them from the other classes. Also, when we consider the DNA-binding domain from the STAT proteins, no method was able to produce satisfactory results. While the DNA-binding domain used as a query has a globular shape, the STAT proteins are significantly bigger, with a non-globular shape, and have 3 additional domains (1 of which is a SH2 domain, a domain included in the dataset), which means that only partial matches compared to the query may be achieved. This specific issue (the comparison of partially overlapping objects) may require further development.
In the future, this latter point could be the subject of a dedicated SHREC track, and a good indicator of the overall progresses made in the field of the retrieval of proteins based on their surfaces. Currently, most methods have difficulties to handle such cases, which are quite common. Solving this challenge would be a step forward for the community. At the same time, explainable artificial intelligence (XAI) methods [65] may highlight the latent features responsible for good or bad predictions, and help decipher the results of machine learning–based methods. XAI methods may help explain the performance difference observed for each class of protein, and provide a human-interpretable representation of machine learning descriptors, and therefore help identifying the current limits of these algorithms. Finally, deciphering to which extend, if any, the standard physico-chemical features (electrostatics potential, charges distribution, hydrophobicity, etc.) improve the results may be the main focus of the next SHREC tracks devoted to protein surfaces.
Acknowledgments
The authors thank the 3DOR 2021 Workshop organizing committee for maintaining this workshop despite the current COVID-19 pandemic. Matthieu Montès and Florent Langenfeld thank Taoufik Labib for setting up and maintaining the track webpage.
Funding
Léa Sirugue, Matthieu Montès and Florent Langenfeld are supported by the European Research Council Executive Agency under the research grant number 640,283. Daisuke Kihara acknowledges supports from the National Institutes of Health (R01GM133840, R01GM123055) and the National Science Foundation (DBI2003635, CMMI1825941, and MCB1925643). Charles Christoffer is supported by NIGMS-funded pre–doctoral fellowship (T32 GM132024). Huu-Nghia H. Nguyen, Tuan-Duy H. Nguyen, Vinh-Thuyen Nguyen-Truong, Danh Le, Hai-Dang Nguyen, and Minh-Triet Tran are supported by National University Ho Chi Minh City (VNU-HCM) (DS2020-42-01).
Appendix A. List of PDB structures used as queries for the dataset
Table A.6.
| Domain name | Pfam ID | PDB code | chain | residues | Reference |
|---|---|---|---|---|---|
|
| |||||
| SH2 domain | PF00017 | 1P13 | B | 161–243 | [66,67] |
| SH3 domain | PF00018 | 1ABO | B | 67–113 | [68,69] |
| Variant SH3 domain (SH3_2) | PF07653 | 5O99 | B | 474–527 | [70,71] |
| PDZ domain | PF00595 | 2HE2 | B | 421–499 | [72,73] |
| PDZ_6 domain | PF17820 | 3KHF | B | 982–1034 | [74] |
| Peptidase family M50 (m50) | PF02163 | 3B4R | B | 111–186 | [75] |
| Bromodomain | PF00439 | 6CW0 | B | 10–95 | [76] |
| PHD-finger domain | PF00628 | 3KV5 | D | 39–88 | [77,78] |
| Zinc-finger domain, C2H2 type (zf-C2H2) | PF00096 | 4ISI | D | 472–493 | [79,80] |
| STAT protein, DNA-binding domain (Stat-binding) | PF02864 | 5D39 | D | 277–413 | [81,82] |
Appendix B. RMSD between queries structures
Fig. B.11.

Cα-RMSD (Root Mean Square Deviations) between queries structures. The higher the RMSD, the more distant the structures.
Appendix C. Evaluation metrics details for the shape-only dataset: reciprocal rank, per-class nearest-neighbor, first tier and second tier
Table C.7.
Per-class nearest-neighbor for the shape-only dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| ProteinNet | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| APPFD | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
| PWKSM | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| DGCNN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Table C.8.
Reciprocal rank for the shape-only dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.143 | 0.033 | 1 | 1 | 0.030 | 0.002 | 0.021 | 1 | 1 | 1 | 1 | 1 |
| ProteinNet | 0.071 | 0.036 | 0.25 | 0.5 | 0.25 | 0.005 | 0.002 | 0.010 | 0.071 | 0.067 | 1 | 0.5 |
| APPFD | 0.5 | 0.042 | 0.1 | 1 | 0.2 | 0.01 | 0.022 | 1 | 1 | 0.167 | 0.333 | 1 |
| PWKSM | 0.333 | 0.053 | 0.125 | 1 | 0.333 | 0.015 | 0.042 | 0.083 | 0.042 | 0.333 | 0.333 | 1 |
| DGCNN | 0.333 | 0.083 | 0.167 | 0.5 | 0.333 | 0.037 | 0.006 | 0.042 | 0.1 | 0.333 | 0.25 | 0.5 |
Table C.9.
Per-class first tier for the shape-only dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.35 | 0.03 | 0.05 | 0.42 | 0.00 | 0.00 | 0.00 | 0.24 | 0.34 | 0.17 | 0.70 | 0.39 |
| ProteinNet | 0.26 | 0.03 | 0.16 | 0.28 | 0.04 | 0.00 | 0.00 | 0.00 | 0.06 | 0.05 | 0.35 | 0.31 |
| APPFD | 0.40 | 0.03 | 0.15 | 0.22 | 0.04 | 0.00 | 0.00 | 0.15 | 0.25 | 0.13 | 0.44 | 0.27 |
| PWKSM | 0.18 | 0.06 | 0.21 | 0.26 | 0.08 | 0.00 | 0.00 | 0.11 | 0.08 | 0.08 | 0.20 | 0.29 |
| DGCNN | 0.19 | 0.09 | 0.13 | 0.25 | 0.12 | 0.00 | 0.00 | 0.04 | 0.08 | 0.09 | 0.26 | 0.30 |
Table C.10.
Per-class second tier for the shape-only dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.55 | 0.45 | 0.13 | 0.72 | 0.04 | 0.00 | 0.00 | 0.33 | 0.45 | 0.27 | 0.79 | 0.69 |
| ProteinNet | 0.55 | 0.09 | 0.43 | 0.51 | 0.08 | 0.00 | 0.00 | 0.02 | 0.15 | 0.13 | 0.68 | 0.60 |
| APPFD | 0.58 | 0.12 | 0.30 | 0.53 | 0.08 | 0.00 | 0.00 | 0.18 | 0.34 | 0.23 | 0.66 | 0.59 |
| PWKSM | 0.39 | 0.12 | 0.39 | 0.49 | 0.12 | 0.00 | 0.00 | 0.18 | 0.11 | 0.20 | 0.46 | 0.56 |
| DGCNN | 0.37 | 0.09 | 0.33 | 0.39 | 0.12 | 0.00 | 0.00 | 0.20 | 0.25 | 0.16 | 0.49 | 0.52 |
Appendix D. Evaluation metrics details for the shape + electrostatics dataset: reciprocal rank, per-class nearest-neighbor, first tier and second tier
Table D.11.
Per-class nearest-neighbor for the shape + electrostatics dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
Table D.12.
Reciprocal Rank for the shape + electrostatics dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.5 | 0.167 | 0.333 | 1 | 0.010 | 0.003 | 0.030 | 1 | 1 | 0.5 | 1 | 0.5 |
Table D.13.
Per-class first-tier for the shape + electrostatics dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.50 | 0.12 | 0.14 | 0.20 | 0.00 | 0.00 | 0.00 | 0.36 | 0.28 | 0.23 | 0.58 | 0.19 |
Table D.14.
Per-class second tier for the shape + electrostatics dataset.
| Method | SH3 | SH3_2 | SH2 | PDZ | PDZ_6 | m50 | STAT | zf-C2H2 | PHD | Bromodomain | SH3-like | PDZ-like |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Class size | 115 | 33 | 92 | 128 | 26 | 4 | 6 | 55 | 53 | 64 | 148 | 154 |
| 3DZD | 0.68 | 0.36 | 0.38 | 0.55 | 0.00 | 0.00 | 0.00 | 0.51 | 0.36 | 0.36 | 0.74 | 0.74 |
Footnotes
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- [1].Wolynes P, Onuchic J, Thirumalai D, Navigating the folding routes, Science 267 (5204) (1995) 1619–1620, 10.1126/science.7886447. URL:. [DOI] [PubMed] [Google Scholar]
- [2].Karplus M, Behind the folding funnel diagram, Nat. Chem. Biol. 7 (7) (2011) 401–404, 10.1038/nchembio.565. URL:. [DOI] [PubMed] [Google Scholar]
- [3].Holm L, Sander C, Mapping the protein universe, Science 273 (5275) (1996) 595–602, 10.1126/science.273.5275.595. URL:. [DOI] [PubMed] [Google Scholar]
- [4].Shindyalov IN, Bourne PE, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Eng. Des. Sel. 11 (9) (1998) 739–747, 10.1093/protein/11.9.739. URL:. [DOI] [PubMed] [Google Scholar]
- [5].Zemla ALGA, A method for finding 3D similarities in protein structures, Nucleic Acids Res. 31 (13) (2003) 3370–3374, 10.1093/nar/gkg571. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Zhang Y, TM–align: a protein structure alignment algorithm based on the TM–score, Nucleic Acids Res. 33 (7) (2005) 2302–2309, 10.1093/nar/gki524. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Mariani V, Biasini M, Barbato A, Schwede T, lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests, Bioinformatics 29 (21) (2013) 2722–2728, 10.1093/bioinformatics/btt473. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Shulman-Peleg A, Nussinov R, Wolfson HJ, Recognition of functional sites in protein structures, J. Mol. Biol. 339 (3) (2004) 607–633, 10.1016/j.jmb.2004.04.012. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, et al. , Fast protein tertiary structure retrieval based on global surface shape similarity, Proteins: Struct. Funct. Bioinf. 72 (4) (2008) 1259–1273, 10.1002/prot.22030. [DOI] [PubMed] [Google Scholar]
- [10].Sael L, La D, Li B, Rustamov R, Kihara D, Rapid comparison of properties on protein surface, Proteins: Struct. Funct. Bioinf. 73 (1) (2008) 1–10, 10.1002/prot.22141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, et al. , Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning, Nat. Methods 17 (2020) 184–192, 10.1038/s41592-019-0666-6, 2019, http://infoscience.epfl.ch/record/273279. [DOI] [PubMed] [Google Scholar]
- [12].Zhang Y, Sui X, Stagg S, Zhang J, FTIP: an accurate and efficient method for global protein surface comparison, Bioinformatics 36 (10) (2020) 3056–3063, 10.1093/bioinformatics/btaa076. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Guzenko D, Burley SK, Duarte JM, Real time structural search of the protein Data Bank, PLoS Comput. Biol. 16 (7) (2020), e1007970, 10.1371/journal.pcbi.1007970. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Zhang Z, Witham S, Alexov E, On the role of electrostatics in protein–protein interactions, Phys. Biol. 8 (3) (2011), 035001, 10.1088/1478-3975/8/3/035001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Zhang Y, Skolnick J, Scoring function for automated assessment of protein structure template quality, Proteins: Struct. Funct. Bioinf. 57 (4) (2004) 702–710, 10.1002/prot.20264. URL: 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
- [16].Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar G, Sonnhammer ELL, et al. , Pfam: the protein families database in 2021, Nucleic Acids Res. 49 (D1) (2020) D412–D419, 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. , The protein Data Bank, Nucleic Acids Res. 28 (1) (2000) 235–242, 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Berman H, Henrick K, Nakamura H, Announcing the worldwide protein Data Bank, Nat. Struct. Mol. Biol. 10 (12) (2003), 10.1038/nsb1203-980, 980–980. URL:. [DOI] [PubMed] [Google Scholar]
- [19].Takashima H, High-resolution protein structure determination by NMR, in: Webb G (Ed.), Annual Reports on NMR Spectroscopy, vol. 59, Elsevier, 2006, pp. 235–273, 10.1016/s0066-4103(06)59005-2. URL:. [DOI] [Google Scholar]
- [20].Consortium TU, UniProt: the universal protein knowledgebase in 2021, Nucleic Acids Res. 49 (D1) (2020) D480–D489, 10.1093/nar/gkaa1100. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA, PDB2PQR: an automated pipeline for the setup of Poisson–Boltzmann electrostatics calculations, Web Server, Nucleic Acids Res. 32 (2004) W665–W667, 10.1093/nar/gkh381. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Søndergaard CR, Olsson MHM, Rostkowski M, Jensen JH, Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pKa values, J. Chem. Theor. Comput. 7 (7) (2011) 2284–2295, 10.1021/ct200133y. URL:. [DOI] [PubMed] [Google Scholar]
- [23].Olsson MHM, Søndergaard CR, Rostkowski M, Jensen JH, PROPKA3: consistent treatment of internal and surface residues in empirical pKa predictions, J. Chem. Theor. Comput. 7 (2) (2011) 525–537, 10.1021/ct100578z. URL:. [DOI] [PubMed] [Google Scholar]
- [24].Xu D, Zhang Y, Generating triangulated macromolecular surfaces by Euclidean distance transform, PLoS One 4 (12) (2009), e8140, 10.1371/journal.pone.0008140. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Xu D, Li H, Zhang Y, Protein depth calculation and the use for improving accuracy of protein fold recognition, J. Comput. Biol 20 (10) (2013) 805–816, 10.1089/cmb.2013.0071. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA, Electrostatics of nanosystems: application to microtubules and the ribosome, Proc. Natl. Acad. Sci. Unit. States Am. 98 (18) (2001) 10037–10041, 10.1073/pnas.181342398. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Söding J, Protein homology detection by HMM–HMM comparison, Bioinformatics 21 (7) (2004) 951–960, 10.1093/bioinformatics/bti125. URL:. [DOI] [PubMed] [Google Scholar]
- [28].Xu J, Zhang Y, How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26 (7) (2010) 889–895, 10.1093/bioinformatics/btq066. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Temerinac-Ott M, Reisert M, Burkhardt H, SHREC’07 - Protein Retrieval Challenge, 2008.
- [30].Mavridis L, Venkatraman V, Ritchie DW, Morikawa N, Andonov R, Cornu A, et al. , SHREC’10 track: protein model classification, in: Daoudi M, Schreck T (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2010, ISBN 978–3-905674–22-4, pp. 117–124, 10.2312/3DOR/3DOR10/117-124. [DOI] [Google Scholar]
- [31].Song N, Craciun D, Christoffer CW, Han X, Kihara D, Levieux G, et al. , Protein shape retrieval, in: Pratikakis I, Dupont F, Ovsjanikov M (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2017, ISBN 978–3-03868–030-7, pp. 67–74. URL: 10.2312/3dor20171055. [DOI] [Google Scholar]
- [32].Langenfeld F, Axenopoulos A, Chatzitofis A, Craciun D, Daras P, Du B, et al. , Protein shape retrieval, in: Telea A, Theoharis T, Veltkamp R (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2018, ISBN 978–3-03868–053-6, pp. 53–61, 10.2312/3dor.20181053. [DOI] [Google Scholar]
- [33].Langenfeld F, Axenopoulos A, Benhabiles H, Daras P, Giachetti A, Han X, et al. , Protein shape retrieval contest, in: Biasotti S, Lavoué G, Veltkamp R (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2019, ISBN 978–3-03868–077-2, pp. 25–31, 10.2312/3dor.20191058. [DOI] [Google Scholar]
- [34].Langenfeld F, Peng Y, Lai YK, Rosin PL, Aderinwale T, Terashi G, et al. , SHREC 2020: multi-domain protein shape retrieval challenge, Comput. Graph. 91 (2020) 189–198, 10.1016/j.cag.2020.07.013. URL:. [DOI] [Google Scholar]
- [35].Raffo A, Fugacci U, Biasotti S, Rocchia W, Liu Y, Otu E, et al. , SHREC 2021 track: retrieval and classification of protein surfaces equipped with physical and chemical properties, Comput. Graph. 99 (2021) 1–21, 10.1016/j.cag.2021.06.010. [DOI] [Google Scholar]
- [36].Canterakis N, 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition, in: 11th Scandinavian Conf. On Image Analysis, 1999, pp. 85–93. [Google Scholar]
- [37].Fox NK, Brenner SE, Chandonia JM, SCOPe: structural classification of proteins—extended, integrating SCOP and ASTRAL data and classification of new structures, Nucleic Acids Res. 42 (D1) (2013) D304–D309, 10.1093/nar/gkt1240. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Chandonia JM, Fox NK, Brenner SE, SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database, Nucleic Acids Res. 47 (D1) (2018) D475–D481, 10.1093/nar/gky1134. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Esquivel-Rodríguez J, Xiong Y, Han X, Guang S, Christoffer C, Kihara D, Navigating 3D electron microscopy maps with EM-SURFER, BMC Bioinf. 16 (1) (2015), 10.1186/s12859-015-0580-6. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Qi CR, Su H, Mo K, Guibas LJ. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv preprint arXiv:161200593 2016;. [Google Scholar]
- [41].Yuksel C, Sample elimination for generating Poisson disk sample sets, 2, in: Computer Graphics Forum (Proceedings of EUROGRAPHICS 2015) 34, 2015, pp. 25–32, 10.1111/cgf.12538. URL:. [DOI] [Google Scholar]
- [42].Benhabiles H, Hammoudi K, Windal F, Melkemi M, Cabani A, A transfer learning exploited for indexing protein structures from 3D point clouds, in: Processing and Analysis of Biomedical Information, Springer International Publishing, 2019, pp. 82–89, 10.1007/978-3-030-13835-6_10. URL:. [DOI] [Google Scholar]
- [43].Otu E, Zwiggelaar R, Hunter D, Liu Y, Nonrigid 3D shape retrieval with happs: a novel hybrid augmented point pair signature, in: 2019 International Conference on Computational Science and Computational Intelligence (CSCI), 2019, pp. 662–668, 10.1109/CSCI49370.2019.00124. [DOI] [Google Scholar]
- [44].Otu E, Code implementation of agglomeration of point-pair feature descritor with Fisher kernel and Gaussian mixture model (APPFD-FK-GMM). https://github.com/KoksiHub/APPFD_FK_GMM-Method-For-SHREC-2021-Surface-based-Protein-Domains-Retrieval, 2021-07-09. [Google Scholar]
- [45].Moscoso Thompson E, Biasotti S, Giachetti A, Tortorici C, Werghi N, Obeid AS, et al. , SHREC’20 track: retrieval of digital surfaces with similar geometric reliefs, Comput. Graph. (2020). [Google Scholar]
- [46].Wahl E, Hillenbrand U, Hirzinger G, Surflet-pair-relation histograms: a statistical 3d-shape representation for rapid classification, in: Fourth International Conference on 3-D Digital Imaging and Modeling, 3DIM, 2003, pp. 474–481, 2003. Proceedings. 2003. [Google Scholar]
- [47].Aubry M, Schlickewei U, Cremers D, The wave kernel signature: a quantum mechanical approach to shape analysis, in: 2011 IEEE International Conference on Computer Vision Workshops, ICCV Workshops, 2011, pp. 1626–1633, 10.1109/ICCVW.2011.6130444. [DOI] [Google Scholar]
- [48].Rodolà E, Rota Bulo S, Windheuser T, Vestner M, Cremers D, Dense non-rigid shape correspondence using random forests, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4177–4184. [Google Scholar]
- [49].Boscaini D, Masci J, Melzi S, Bronstein MM, Castellani U, Vandergheynst P, Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks, Comput. Graph. Forum 34 (5) (2015) 13–23, 10.1111/cgf.12693. [DOI] [Google Scholar]
- [50].Limberger FA, Wilson RC, Feature encoding of spectral signatures for 3D non-rigid shape retrieval, BMVC (2015), 56–1. [Google Scholar]
- [51].Zeng H, Liu Y, Li S, Che J, Wang X, Convolutional neural network based multi-feature fusion for non-rigid 3D model retrieval, J. Inf. Process. Syst. 14 (1) (2018) 176–190. [Google Scholar]
- [52].Angenent S, Haker S, Tannenbaum A, Kikinis R, On the Laplace–Beltrami operator and brain surface flattening, IEEE Trans. Med. Imag. 18 (8) (1999) 700–711. [DOI] [PubMed] [Google Scholar]
- [53].Craciun D, Levieux G, Montes M, Shape similarity system driven by digital elevation models for non-rigid shape retrieval, in: Pratikakis I, Dupont F, Ovsjanikov M (Eds.), Eurographics Workshop on 3D Object Retrieval, The Eurographics Association, 2017, ISBN 978–3-03868–030-7, pp. 51–54, 10.2312/3dor.20171051. [DOI] [Google Scholar]
- [54].Fortune S, Wyllie J, Parallelism in random access machines, in: Proceedings of the Tenth Annual ACM Symposium on Theory of Computing, 1978, pp. 114–118. [Google Scholar]
- [55].Cole R, Vishkin U, Faster optimal parallel prefix sums and list ranking, Inf. Comput. 81 (3) (1989) 334–352. [Google Scholar]
- [56].Santos EE, Optimal and efficient algorithms for summing and prefix summing on parallel machines, J. Parallel Distr. Comput 62 (4) (2002) 517–543. [Google Scholar]
- [57].Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM, Dynamic graph CNN for learning on point clouds, ACM Trans. Graph. 38 (5) (2019) 1–12. [Google Scholar]
- [58].van Rossum G, Python Development Team, Python 3.8.8 documentation, URL: https://docs.python.org/3.8/, 2021. [Google Scholar]
- [59].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: an Imperative Style, High-Performance Deep Learning Library. arXiv preprint arXiv: 191201703 2019;. [Google Scholar]
- [60].Fey M, Lenssen JE. Fast Graph Representation Learning with Pytorch Geometric. arXiv preprint arXiv:190302428 2019;. [Google Scholar]
- [61].Moscoso Thompson E, Biasotti S, Giachetti A, Tortorici C, Werghi N, Obeid AS, et al. , SHREC 2020: retrieval of digital surfaces with similar geometric reliefs, Comput. Graph. 91 (2020) 199–218, 10.1016/j.cag.2020.07.011. URL:. [DOI] [Google Scholar]
- [62].Shilane P, Min P, Kazhdan M, Funkhouser T, The princeton shape benchmark, in: Proceedings of the Shape Modeling International 2004. SMI ‘04, IEEE Computer Society, USA, 2004, ISBN 0769520758, pp. 167–178. [Google Scholar]
- [63].Caruana R, Lundberg S, Ribeiro MT, Nori H, Jenkins S, Intelligible and explainable machine learning: best practices and practical challenges, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ‘20, Association for Computing Machinery, New York, NY, USA, 2020, ISBN 9781450379984, pp. 3511–3512, 10.1145/3394486.3406707. [DOI] [Google Scholar]
- [64].Artrith N, Butler KT, Coudert FX, Han S, Isayev O, Jain A, et al. , Best practices in machine learning for chemistry, Nat. Chem. 13 (6) (2021) 505–508, 10.1038/s41557-021-00716-z. doi: 10.1038/s41557-021-00716-z. URL:. [DOI] [PubMed] [Google Scholar]
- [65].Das A, Rad P, Opportunities and challenges in explainable artificial intelligence (XAI): a survey, CoRR (2020) abs/2006.11371. URL: https://arxiv.org/abs/2006.11371.arXiv:2006.11371. [Google Scholar]
- [66].Sonnenburg ED, Bilwes A, Hunter T, Noel JP, The structure of the membrane distal phosphatase domain of RPTPα reveals interdomain flexibility and an SH2 domain interaction region, Biochemistry 42 (26) (2003) 7904–7914, 10.1021/bi0340503.doi: 10.1021/bi0340503, [DOI] [PubMed] [Google Scholar]
- [67].Sonnenburg E, Bilwes A, Hunter T, Noel J, Crystal Structure of the Src SH2 Domain Complexed with Peptide (SDpYANFK), 2003, 10.2210/pdb1p13/pdb. [DOI] [Google Scholar]
- [68].Musacchio A, Saraste M, Wilmanns M, High-resolution crystal structures of tyrosine kinase SH3 domains complexed with proline-rich peptides, Nat. Struct. Biol. 1 (8) (1994) 546–551, 10.1038/nsb0894-546. URL:. [DOI] [PubMed] [Google Scholar]
- [69].Musacchio A, Wilmanns M, Saraste M, Crystal structure of the complex of the Abl tyrosine kinase SH3 domain with 3BP-1 synthetic peptide, URL: 10.2210/pdb1abo/pdb, 1995. [DOI]
- [70].Ponna SK, Myllykoski M, Boeckers TM, Kursula P, Structure of an unconventional SH3 domain from the postsynaptic density protein Shank3 at ultrahigh resolution, Biochem. Biophys. Res. Commun. 490 (3) (2017) 806–812, 10.1016/j.bbrc.2017.06.121. URL:. [DOI] [PubMed] [Google Scholar]
- [71].Ponna S, Myllykoski M, Boeckers T, Kursula P, Unconventional SH3 domain from the postsynaptic density scaffold protein Shank3, URL: 10.2210/pdb5o99/pdb, 2017. [DOI] [PubMed]
- [72].Elkins JM, Papagrigoriou E, Berridge G, Yang X, Phillips C, Gileadi C, et al. , Structure of PICK1 and other PDZ domains obtained with the help of self-binding C–terminal extensions, Protein Sci. 16 (4) (2007) 683–694, 10.1110/ps.062657507. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [73].Faucher F, de Jesus-Tran KP, Cantin L, Luu-the V, Labrie F, Breton R, Crystal structure of 17alpha-hydroxysteroid dehydrogenase in binary complex with NADP(H). in an open conformation, URL: 10.2210/pdb2he5/pdb, 2006. [DOI] [PubMed] [Google Scholar]
- [74].Roos A, Elkins J, Savitsky P, Wang J, Ugochukwu E, Murray J, et al. , The crystal structure of the PDZ domain of human Microtubule Associated Serine/Threonine Kinase 3 (MAST3), URL: 10.2210/pdb3khf/pdb, 2009. [DOI] [Google Scholar]
- [75].Feng L, Yan H, Wu Z, Yan N, Wang Z, Jeffrey PD, et al. , Structure of a site-2 protease family intramembrane metalloprotease, Science 318 (5856) (2007) 1608–1612, 10.1126/science.1150755. URL:. [DOI] [PubMed] [Google Scholar]
- [76].Dong A, Lin L, Bountra C, Arrowsmith C, Edwards A, R H, Crystal structure of Cryptosporidium parvum bromodomain cgd2_2690, URL: 10.2210/pdb6cw0/pdb, 2018. [DOI] [Google Scholar]
- [77].Horton JR, Upadhyay AK, Qi HH, Zhang X, Shi Y, Cheng X, Enzymatic and structural insights for substrate specificity of a family of Jumonji histone lysine demethylases, Nat. Struct. Mol. Biol. 17 (1) (2009) 38–43, 10.1038/nsmb.1753. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [78].Horton J, Upadhyay A, Qi H, Zhang X, Shi Y, Cheng X, Structure of KIAA1718, human Jumonji demethylase, in complex with N-oxalylglycine, URL: 10.2210/pdb3kv5/pdb, 2009. [DOI] [Google Scholar]
- [79].Zhang X, Glunz PW, Jiang W, Schmitt A, Newman M, Barbera FA, et al. , Design and synthesis of bicyclic pyrazinone and pyrimidinone amides as potent TF–FVIIa inhibitors, Bioorg. Med. Chem. Lett 23 (6) (2013) 1604–1607, 10.1016/j.bmcl.2013.01.094. URL:. [DOI] [PubMed] [Google Scholar]
- [80].Wei A, Structure of FACTOR VIIA in complex with the inhibitor (6s)-n-(4-CARBAMIMIDOYLBENZYL)-1-CHLORO-3-(CYCLOBUTYLAMINO)-8, 8-DIETHYL-4-OXO-4, 6, 7, 8-TETRAHYDROPYRROLO[1, 2-a]PYRAZINE-6-CARBOXAMIDE, URL: 10.2210/pdb4isi/pdb, 2013. [DOI] [Google Scholar]
- [81].Li J, Rodriguez JP, Niu F, Pu M, Wang J, Hung LW, et al. , Structural basis for DNA recognition by STAT6, Proc. Natl. Acad. Sci. Unit. States Am. 113 (46) (2016) 13015–13020, 10.1073/pnas.1611228113. URL:. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [82].Li J, Niu F, Ouyang S, Liu Z, Transcription factor-DNA complex, URL: 10.2210/pdb5d39/pdb, 2016. [DOI] [Google Scholar]
