Abstract
BACKGROUND
Modeling protein–protein interactions (PPIs) using docking algorithms is useful for understanding biomolecular interactions and mechanisms. Typically, a docking algorithm generates a large number of docking poses, and it is often challenging to select the best native-like pose. A further challenge is to recognize key residues, termed as hotspots, at protein–protein interfaces, which contribute more in stabilizing a protein–protein interface.
RESULTS
We had earlier developed a computer algorithm, called PPCheck, which ascribes pseudoenergies to measure the strength of PPIs. Native-like poses could be successfully identified in 27 out of 30 test cases, when applied on a separate set of decoys that were generated using FRODOCK. PPCheck, along with conservation and accessibility scores, was able to differentiate ‘native-like and non-native-like poses from 1883 decoys of Critical Assessment of Prediction of Interactions (CAPRI) targets with an accuracy of 60%. PPCheck was trained on a 10-fold mixed dataset and tested on a 10-fold mixed test set for hotspot prediction. We obtain an accuracy of 72%, which is in par with other methods, and a sensitivity of 59%, which is better than most existing methods available for hotspot prediction that uses similar datasets. Other relevant tests suggest that PPCheck can also be reliably used to identify conserved residues in a protein and to perform computational alanine scanning.
CONCLUSIONS
PPCheck webserver can be successfully used to differentiate native-like and non-native-like docking poses, as generated by docking algorithms. The webserver can also be a convenient platform for calculating residue conservation, for performing computational alanine scanning, and for predicting protein–protein interface hotspots. While PPCheck can differentiate the generated decoys into native-like and non-native-like decoys with a fairly good accuracy, the results improve dramatically when features like conservation and accessibility are included. The method can be successfully used in ranking/scoring the decoys, as obtained from docking algorithms.
Keywords: algorithm, protein–protein interactions, protein hotspots, computational alanine scanning, residue conservation, protein–protein docking
Background
Protein–protein interactions (PPIs) are critical for the normal functioning of a cell. Proteins interact with one another and carry out biological processes such as signal transduction, gene regulation, and immune response. Since interactions of two proteins often result in one or more biological processes, it is important to gain knowledge on interacting partners of proteins and hence their functions. We now have several databases1–3 that record PPI data from high-throughput experimental methods for thousands of proteins.
Protein–protein docking, the method of assembling two identical/different proteins together using physics-based computational algorithms, is still a challenging problem. Although we now have the methods that can explore all the possible rigid-body conformational states for the two proteins in a bound state, selection and ranking of the native-like pose is still a major problem. It is well understood that only when two proteins bind with one another in the correct conformation, they perform the designated function, and hence, selecting the native-like pose assumes central importance.
Several methods have been developed both for sampling the conformational space of the two proteins, which bind with one another, and to select the best native-like pose from the pool of generated complexes. Many of these algorithms perform grid-based searches to sample the conformational space. These algorithms inherently use shape/size complementarity, desolvation, or electrostatic interactions,4,5 physical force fields,6,7 empirical functions,8–10 and knowledge-based potentials derived from existing determined structures11–14 to dock the two proteins. Scoring of the docked poses is generally performed using either residue-level potentials or atomic potentials. While most residue-level potentials are simple to construct, easy to use, and are computationally very fast, atomic potentials are more accurate and also computationally more demanding. Several methods have been developed to computationally identify and/or predict the interacting partners for the proteins. These methods predict/identify interactions using features such as hydrophobic patch size,15 evolutionary-conserved positions,15,16 position-specific sequence profiles and residue neighbor list,17 surface accessibility,17–19 distance,20 structural similarity,16 secondary structure contributions,21 amino-acid composition,21,22 dipeptide composition, and biochemical tripeptide composition.22 PISA,23 one of the most popular prediction tools for predicting the stability of a complex, uses several features such as free energy of formation, solvation energy gain, interface area, hydrogen bonds, salt-bridges across the interface, and hydrophobic specificity.
Earlier, in our lab, we developed COILCHECK24 and COILCHECK+,25 which can be used for structural analyses and validation of special class of PPIs, namely, in coiled-coils. PPCheck is an improvised version of these tools with many significant new features (like consideration of only interface residues in calculation of normalized energy per residue, inclusion of interface waters in hydrogen bond energy calculations, and implementation of an optimum distance cutoff that can be used in a generalized way for calculating electrostatic interactions). PPCheck can be used to analyze diverse set of protein–protein interfaces, where simple distance criteria are employed to screen for interface residues followed by a quantitative view of the strength of interactions at the interface. PPCheck has been applied on a benchmark dataset of protein–protein complexes and standard values for the number of interface residues and the strength of PPIs had been obtained earlier.26 PPCheck can be used to analyze a set of docking decoys to recognize the native-like and the non-native-like poses. It uses a combination of energy, conservation, and accessibility scores to rank the models that are generated by the docking algorithm(s). When applied, separately, on a dataset of 30 dimers and decoys from six targets of Critical Assessment of Prediction of Interactions (CAPRI),27–31 it showed promising results in differentiating the native-like and the non-native-like poses.
Experimental alanine scanning mutagenesis is time consuming and an expensive way of finding out the structurally important residues. FoldX32 is one of the popular webservers that is used for prediction of important interactions that provide stability to the protein complexes, but it is not very accurate. Thus, there is a necessity for a new and powerful tool that can reliably predict the changes in binding energy of the protein complex when one of the residues is mutated to alanine. When applied on a set of 40 mutations from experimental studies, PPCheck performed well in calculating the changes in binding free energy of the complexes.
Hotspots are the interface residues that contribute maximally toward the stability of the complex, and when mutated to alanine, they impart an appreciable decrease in binding strength (difference of 2 kcal/mol or more in the binding energy). They are generally seen to exist in clusters called hot regions33 and are more conserved than other interface residues. Protein hotspots, apart from providing stability to the complex, also contribute to the specificity at the binding sites. Studies have also shown that hotspots mainly remain buried in the interface.33 Only a few databases, such as Alanine Scanning Energetics database (ASEdb)34 and Binding Interface Database (BID),35 contain information about a handful of experimentally determined hotspots, and hence, there is a need to develop computational tools to predict them. Several methods employ energy-based models,36,37 knowledge-based models,38–42 and molecular dynamics-based models.43–45 Graph theoretical approaches have also been applied to identify and analyze protein hotspots.46 PPCheck attributes pseudoenergies to protein–protein interface as a sum of non-covalent interaction energies (van der Waals, electrostatic, and hydrogen bond energies). Graph theoretical parameters, such as degree or extent of spatial residue interaction (ESRI) (an alternative term used in the present study for simplicity) for all the interface residues, are employed as features to predict hotspots. PPCheck reported high accuracy on the test dataset when compared with other existing methods of hotspot prediction.
PPCheck is freely available as a webserver at http://caps.ncbs.res.in/ppcheck/.
Methods
Identification of non-covalent interactions
Simple distance criteria are employed for the preliminary identification of non-covalent interactions, such as van der Waals, electrostatic, and hydrogen bonding. The respective energies are calculated using standard force fields as described in the following.
If the atoms of amino acids from two neighboring chains come within a distance of 7Å, then they are considered to be interacting and contribute to van der Waals interaction energy.
Van der Waals interaction energy is calculated as
(1) |
V is the van der Waals energy; Ri and Rj are the van der Waals radii for the atoms i and j, respectively; E is the van der Waals well depth; and r is the distance between the atoms.
Electrostatic interactions have been reported, and the corresponding energies are calculated using CHARMM package,47 if the charged residues are within or equal to an optimum distance cutoff of 10Å. Coulomb’s equation was used to quantify these interactions as follows:
(2) |
E is the electrostatic energy; q1 and q2 are partial atomic charges for the charged amino acids as obtained from CHARMM package; r is the distance between atoms; and D is the distance-dependent dielectric constant (D = 2r).
Hydrogen bonds are identified, and the corresponding energy is calculated using Kabsch and Sander’s equation as used in the DSSP program as follows:
(3) |
where q1 = 0.42e, q2 = 0.20e, and f = 332, and partial charges on the C, O (+q1, −q1) and N, H (−q2, q2) atoms.
Water molecules, when present at the interface, are considered when they form bridging hydrogen bonds with amino acids from two interacting protein chains. The single point charge model of water is considered,48 and hence, the values of the charges are chosen as follows for the water–amino acid interactions in the Kabsch and Sander’s equation49:
q1 = 0.42e, q2 = 0.41e, when water acts as hydrogen bond donor.
q1 = 0.82e, q2 = 0.20e, when water acts as hydrogen bond acceptor.
All the non-covalent interaction energies are summed up to total energy, and the ratio of total energy to the number of interface residues is termed as normalized energy per residue.
PPCheck, like COILCHECK,24 also reports residues involved in hydrophobic, strong electrostatic (salt-bridges) interactions, and short contacts, based on distance between specific atoms and nature of amino acids. All the hydrophobic amino acids, such as Leu, Ile, Val, Trp, Phe, and Tyr, are reported for hydrophobic interactions, if they are found within or equal to a Cβ–Cβ distance of 7Å. If oppositely charged amino acids are observed within or equal to a Cβ–Cβ distance of 4Å, then they are considered and reported as potential salt-bridges. Atoms are reported to be engaged in short contacts, if their spatial distance at the interface is lesser than the allowed van der Waals distance.
Short contacts are calculated as
(4) |
where R is the sum of van der Waals radii of the two atoms and r is the distance between the atoms.
Implementation
PPCheck has been developed using a combination of HTML, PERL, and PHP. The webserver works fine on all the browsers and the platforms. It takes ∼22 seconds to identify interactions between two protein chains having ∼150 amino acids each.
Selection of right cutoff for electrostatic interactions
Electrostatic interactions are long-range forces. In order to select the right cutoff for these kinds of interactions, the number of charged residues at the interfaces with Cβ–Cβ distance within various cutoff distances (7Å, 8Å, 10Å, 12Å, and 15Å) was calculated in the training dataset.
Dataset for prediction of native-like docking pose
In an earlier study,26 PPCheck was applied on 270 non-redundant, well-characterized, and high-quality protein–protein interfaces where crystal structures were determined with resolution better than 2.5Å. It was observed that in most of the stable PPIs, the number of residues at the interface ranges from 51 to 150 and the normalized energy per residues is better than −2 kJ/mol (ie, less than −2 kJ/mol). These values were used as standardized criteria to distinguish the native-like docking pose from the non-native-like ones.
As studied earlier,50 a set of six CAPRI targets (T24, T25, T26, T29, T32, and T36) with a total of 1883 decoys (best-predicted models by CAPRI participants) was collected from the CAPRI website maintained at the European Bioinformatics Institute (EBI) (http://www.ebi.ac.uk/msd-srv/capri/capri.html). In all, 132 of 1883 decoys were deemed as near-native-like models.
Residue conservation as an additional parameter with PPCheck to predict accuracy of native-like docking decoys
To improve the performance of PPCheck in its ability to differentiate native-like docking poses from the non-native-like ones, residue conservation and solvent accessibility were used as additional features. For finding out the extent of residue conservation of a protein, homologous sequences were identified using Position-Specific Iterative (PSI)-BLAST,51 against non-redundant (NR)-database (March 2013) with one iteration (Blastp), at an e-value cutoff of 10−2. The resultant hits are collected and clustered at 45% sequence identity using CD-HIT52 (word size = 2). Multiple sequence alignment was then performed using ClustalW,53 and the extent of residue conservation at each position was scored using structure-based alignment matrix created by Johnson and Overington matrix.54 The conservation score for each residue is calculated as
(5) |
where Ci is the total conservation score for the residue at the ith position, a is the residue type present in the query sequence at the ith position in the multiple sequence alignment, b is the residue type present in the jth homologous sequence at the ith position in the multiple sequence alignment, Sab is the amino-acid conservation score from structure-based matrix when residue type a in query sequence is substituted by residue type b in homologous sequence at the ith position, and n is the number of homologs for the query sequence in the multiple sequence alignment.
(6) |
Conservation score for each residue was further normalized to a range between 0 and 1 by dividing it by 100 (maximum amino-acid substitution score in structure-based matrix is 100 for cysteine–cysteine substitution).
Solvent accessibility is calculated using protein surface accessibility (PSA) module of JOY55 program. A residue is treated as exposed if the relative accessibility as compared to a model tripeptide is more than 7%. If the relative accessibility of a residue is less than 7%, then it is considered as buried.
PPCheck was applied on an earlier studied dataset26 of well characterized 270 protein–protein complexes in order to select the optimum values of solvent accessibility and number of conserved residues (conservation score ≥0.65) present at the interface, which can help in clear differentiation of native-like poses from the non-native-like ones. It was observed that more than 75% of the protein–protein interfaces (209/270 protein–protein interfaces) (Supplementary File 1) fell either in strong interactor or medium interactor categories. An interface was termed as strong interactor type if both the interacting chains had 10 or more exposed-conserved residues in their monomeric form. However, if one of the chains at the interface had 5 or more exposed-conserved residues in the monomeric form, while its interacting partner had 10 or more exposed conserved residues in the monomeric form, then such a complex was termed as medium interactor type. Since majority (209/270 protein–protein interfaces) of the complexes belonged to these (strong interactor or medium interactor) types, they were treated as gold standards for selecting/predicting the native-like docking pose.
Studies on an additional dataset for evaluating the effectiveness of methodology
In order to assess the prediction accuracy of PPCheck, a non-redundant set of 30 dimers (15 homodimers and 15 heterodimers) was selected, as studied earlier,56 to constitute the current test dataset. For the 30 dimers, the two interacting chains were separated and then allowed to dock using FRODOCK.57 For all the generated docked poses for each of the 30 pairs of chains (30 complexes), PPCheck was applied and its ability to predict best native-like docking pose was evaluated. Top-5, top-10, and top-20 models ranked by PPCheck were compared with the already available native pose (in PDB) with respect to (i) root mean square deviation (rmsd) of C-alpha atoms, as obtained from MMalign,58 and (ii) percentage of common interface residues (pcir) as observed in the native pose.
Similarly, rmsd values and pcir were calculated and compared between the top-1, top-5, top-10, and top-20 models of the test dataset, when conservation and solvent accessibility scores were used to assist PPCheck in differentiating the native-like docking poses from the non-native-like dosing poses.
Dataset for hotspot prediction
A set of 192 residue mutations from ASEdb and 126 residue mutations from BID were considered for training and testing PPCheck for hotspot prediction. The two datasets ASEdb and BID are mutually exclusive and independent of each other. Also, the same datasets have been largely used by almost all the existing methods of hotspot prediction; thus, consistency in the datasets is maintained. This also ensures a fair comparison between the existing methods.
ASEdb contains information about differences in binding energy of the complex when a single residue is mutated to alanine. A residue is considered as a hotspot if a gain of 2 kcal/mol or more of free energy of interaction is obtained, when mutated to alanine. Non-hotspots are the residues that when mutated to alanine causes a change in the binding energy by less than 0.4 kcal/mol. A total of 77 hotspots and 115 non-hotspots from ASEdb and 39 hotspots and 87 non-hotspots from BID were mixed and randomly rotated 10 times to form mutually exclusive 10-fold training and 10-fold test dataset. All further studies were carried out using these datasets.
Extent of spatial residue interaction
The number of residues from the partner chain with which a particular residue (say A) in the protein chain interacts is defined as the ESRI represented as Di of that residue (A). For example, if a residue at the 100th position in a protein is present at the interface and is spatially interacting with 10 other residues from the partner chain, then the ESRI of this residue (present at the 100th position) is said to be 10.
Normalized extent of spatial residue interaction
The ratio of the number of residues that interact with an interface residue (say x) and the average number of residues that all the interface residues interact with is known as the normalized extent of spatial residue interaction (NESRI) of that residue (x). The definition of NESRI has been illustrated in a better way using an example in Supplementary File 2.
Extent of energy contribution
The energy contributed by a residue (say A), present at the interface, while interacting with residues present at the interface of the interacting partner is known as the extent of energy contribution (EEC), represented as De of that residue (A). For example, if a residue at the 100th position is present at the interface and while interacting with other residues from the interacting chain, it contributes a total of −10 kJ/mol energy, then the EEC of this residue (present at the 100th position) is said to be −10.
Normalized extent of energy contribution
Normalized extent of energy contribution (NEEC) of a residue is the ratio of EEC of a residue to the average EEC of all the residues present at the interface. The definition of NEEC is explained in a detailed manner, using an example, in Supplementary File 2.
Comparison of hotspot prediction performance of various methods
In order to evaluate the performance of PPCheck, parameters such as sensitivity, specificity, accuracy, and F-score were compared with some of the other available methods for hotspot prediction. For this study, all the methods were applied on 126 mutations of test dataset (from BID) and the various parameters were computed and compared. The various parameters are calculated as
(7) |
(8) |
(9) |
(10) |
(11) |
(12) |
where TP (true positive) is a hotspot that is predicted as a hotspot, TN (true negative) is a non-hotspot that is predicted as a non-hotspot, FP (false positive) is a non-hotspot that is predicted as a hotspot, and FN (false negative) is a hotspot that is predicted as a non-hotspot.
Dataset for performing computational alanine scanning
ASEdb records information about changes in binding energy when a residue is mutated to alanine. A set of randomly selected 40 such mutations (Supplementary File 3) was extracted from the database, and computational alanine scanning was performed using PPCheck.
Results and Discussion
Selection of optimum cutoff for electrostatic interaction
A number of charged residues at protein–protein interfaces were calculated between Cβ and Cβ atoms at various distance cutoffs of 7Å, 8Å, 10Å, 12Å, and 15Å within the training dataset. An optimum distance cutoff is recognized as a value beyond which a significantly large number of residues are spuriously included as interface residues. A slight increase (of maximum up to two residues) was observed when the cutoff was increased from 7Å to 8Å in all the 262 complexes in the training dataset. When the distance cutoff was increased from 8Å to 10Å, the increase was still up to two residues, in all except one complex. However, when the cutoff was increased from 10Å to 12Å, we observe that the number of interface residues in complexes starts increasing by 10 extra residues. Hence, a cutoff of 10Å was selected as an optimum distance cutoff for identifying electrostatic interactions. The chosen cutoff is also important since higher distances include a large number of charged residues at the interface that do not contribute significantly to the stability. Figure 1 shows the pictorial representation of increase in interface residues (number of interface charged residues in bins) when the distance cutoff is increased gradually, in bins, from 7Å to 15Å.
PPCheck as a reliable tool for predicting native-like docking pose out of many decoys
In an earlier study,26 PPCheck was applied on 270 non-redundant, high-quality protein–protein interfaces, and it was observed that the number of residues in a stable protein–protein interacting complex ranges from 51 to 150, whereas the normalized energy per residue is better (less) than −2 kJ/mol. These values can be considered as gold standard for optimal normalized energy at protein–protein interfaces and hence were used to differentiate the native-like docking poses from the non-native ones.
Results on a dataset of 30 dimers
PPCheck was applied on 30 dimers (15 homodimers and 15 heterodimers) that were earlier used to train DockScore,56 an in-house algorithm for ranking docking decoys. For these 30 complexes, each chain was separated and then redocked using FRODOCK after altering its orientation. The aim was to check the consistent efficiency of PPCheck in predicting the native-like models in top-1, top-5, top-10, and top-20 ranks, respectively, out of 99 generated poses.
We observed that PPCheck and conservation and solvent accessibility scores could successfully rank native-like docking pose in top-1 position, from the 99 generated poses by FRODOCK (with an average C-alpha rmsd <4Å from the crystal structure of the complex) in 27 and 29 complexes out of 30. When poses within the top-5 PPCheck ranks are considered, 26 out of 30 complexes could be identified, but with increasing structural deviations from the native crystal structure.
Similarly, PPCheck could successfully rank native-like docking pose in top-1 position (>60% of common interface residues) between native pose and FRODOCK-generated pose in 25 out of 30 complexes. Figure 2 shows the performance of PPCheck in ranking the native-like poses in top-1, top-5, top-10, and top-20 positions.
For analyzing the performance of PPCheck in differentiating native-like and non-native-like docking pose, CAPRI dataset was selected. A total of 1883 decoys, which were all submitted as best models by the respective CAPRI participants, corresponding to six CAPRI targets50 (T24, T25, T26, T29, T32, and T36), were selected. When PPCheck alone was applied on these 1883 decoys/models, it could successfully identify 91 (out of 132; 68.9%) near-native-like models (Table 1).
Table 1.
METHOD(S)/SOURCE(S) | # OF TOTAL MODELS | TP | TN | FP | FN | ACCURACY (%) |
---|---|---|---|---|---|---|
PPCheck | 1883 | 91 | 758 | 993 | 41 | 45.09 |
PPCheck + Conservation + Accessibility | 1883 | 71 | 1057 | 694 | 61 | 59.90 |
Residue conservation, solvent accessibility, and PPCheck
Although PPCheck showed significant capabilities in differentiating the native-like docking poses from the non-native-like ones in a stringent dataset such as CAPRI decoys, we included residue conservation and solvent accessibility as additional parameters to further improve the accuracy of differentiation. Conservation score for all the residues of the interacting chains was obtained by collecting homologous sequences followed by multiple sequence alignment (please see the Methods section for greater details).
Out of the 1883 decoys from the six CAPRI targets, CAPRI assessment team recognized 132 decoys as native-like poses, while remaining 1751 decoys as non-native-like poses. PPCheck, along with residue conservation and solvent accessibility, could successfully differentiate 71 native-like poses and 1057 non-native-like poses. Thus, an overall accuracy of ∼60% was achieved in differentiating the native-like (71 out of 132) and non-native-like (1057 from 1751) docking poses.
PPCheck as computational alanine scanning tool
PPCheck was applied on a set of randomly selected 40 mutations from ASEdb and the change in binding energy of the complex when a specific residue was mutated to alanine was recorded. A correlation of 0.716 was observed (Fig. 3) between the changes in binding energy as recorded from experimental studies and as obtained from PPCheck. This correlation should be considered with caution as it has been obtained from a comparatively small dataset and the proportionality constant between the two axes is not 1. Further, there appears to be better correlation for mutations with large changes in binding energy.
Selection of optimum ESRI
Hotspots are interface residues that are generally seen to occur in clusters,59 and they contribute to the stability of the complex,27 along with providing specificity to the complex. Therefore, interface residues are expected to interact with large number of residues within the protein and form the partner chains. PPCheck was therefore applied on the 10-fold mixed training dataset (192 mutations obtained from ASEdb; 126 mutations obtained from BID), and the ESRI (please see the Methods section for explanation) of all the interface residues was calculated using residue-centric normalized PPCheck energies. We then checked how well the top-5 to top-15 residues, having the highest normalized ESRI (more than 1) and NEEC more than 1, were observed as hotspots in various protein–protein complexes in the training dataset. Figure 4 shows how the various parameters, such as true positive rate (TPR) and false positive rate (FPR), vary when top-5 to top-15 residues were considered as a hotspot from the training dataset. The best results were obtained when top-9 residues having the highest ESRI were considered as hotspots.
Selection of optimum EEC
Hotspots are the residues that bring a change of more than 2 kcal/mol when mutated to alanine,59 and they contribute more energy than an average interface residue. Thus, we believe that those interface residues that contribute high pseudoenergies while interacting with residues present at the interface will also have the maximum tendency to act as a hotspot. In order to support our assumption, PPCheck was applied on the 10-fold mixed training dataset (192 mutations obtained from ASEdb; 126 mutations obtained from BID) and the EEC of all the interface residues was calculated. We then checked how well the top-5, top-6, top-7, and top-15 residues having the highest EEC (NEEC more than 1) and NESRI more than 1 were observed as hotspots in the various protein–protein complexes in the training dataset. Supplementary File 4 shows that the optimum results for hotspot prediction on 10-fold mixed training dataset were obtained when the EEC is selected as 8.
Selection of optimum ESRI for predicting hotspots
A comparison between ESRI and EEC when top-5 to top-15 residues were treated as hotspots revealed that ESRI gave better results than EEC in almost all the cutoff values. Hence, top-9 residues having the highest ESRI (NESRI more than 1) and NEEC more than 1 were selected as hotspots.
How PPCheck predicts hotspots? Selection of ESRI or EEC for prediction
For every cutoff from top-5 to top-15 for ESRI and EEC methods of hotspot prediction, ESRI method gave improved results, ie, ratio of TPRs and FPRs from the ESRI method was better than that from the EEC method. Hence, the ESRI method was used for prediction of hotspots, ie, top-9 residues having the highest ESRI (NESRI more than 1) and NEEC more than 1 were selected as hotspots.
PPCheck as a hotspot prediction tool
PPCheck and other hotspot prediction tools, such as Robetta,60 FOLDEF,30 KFC,61 MINERVA,62 HotPoint,63 KFC2a, and KFC2b,64 were tested on 126 mutations from the BID dataset, and various parameters such as sensitivity, specificity, accuracy, F-score, and Matthews coefficient were computed and compared in order to evaluate the performance of each program for their ability to correctly predict hotspots. Table 2 shows that PPCheck achieved 58.8% sensitivity, which is much better than the existing programs such as FOLDEF, KFC, Robetta, HotPoint, and MINERVA. It also reported an F-score of 0.556, which is on par with other existing methods of hotspot prediction. Only KFC2 performed better (sensitivity-wise) than PPCheck in predicting hotspots. KFC2 performed better than PPCheck (raw data for comparison were collected from an earlier study),64 perhaps because of their consideration of solvent accessibility of residues. However, there were some cases where PPCheck outperformed KFC2b (PPCheck results were a subset of KFC2a, and hence, they were not compared with each other.). We discuss two such cases in the following.
Table 2.
METHOD | SENSITIVITY (%) | SPECIFICITY (%) | ACCURACY (%) | F-SCORE | MATHEW’S COEFFICIENT |
---|---|---|---|---|---|
FOLDEF | 26.36 | 93.51 | 73.37 | 0.369 | 0.272 |
Robetta | 44.85 | 92.21 | 77.99 | 0.548 | 0.432 |
KFC | 47.27 | 89.61 | 76.91 | 0.549 | 0.410 |
MINERVA | 51.82 | 93.51 | 81.00 | 0.619 | 0.517 |
KFC2b | 56.06 | 90.91 | 81.73 | 0.631 | 0.509 |
HotPoint | 57.58 | 83.12 | 75.46 | 0.583 | 0.410 |
PPCheck | 58.79 | 77.92 | 72.18 | 0.556 | 0.356 |
KFC2a | 78.49 | 83.12 | 81.73 | 0.720 | 0.590 |
Successful cases
Complex between soluble tissue factor protein and blood coagulation factor VII-A protein (PDB ID – 1FAK; interacting chains – T and L): Alanine scanning mutagenesis results show that two residues, T-LYS-20 position and T-ASP-58 position in soluble tissue factor, act as hotspots. Out of these two, KFC2b could predict only T-LYS-20 as hotspot, whereas PPCheck could identify both these residues as hotspots (Fig. 5A).
Complex between ATP-dependent HSLU protease ATP-binding subunit HSLU and ATP-dependent protease HSLV (PDB ID – 1G3I; interacting chains – A and G/H): Experimental studies reported six residues, A-ASP-438, A-LEU-439, A-ARG-441, A-PHE-442, A-ILE-443, and A-LEU-444, as hotspots. KFC2b could successfully predict only three of them (A-PHE-442, A-ILE-443, and A-LEU-444), whereas PPCheck could predict all the six hotspots (Fig. 5B).
Cases where both the programs fared equally well
Complex between EPO receptor and EPO mimetics peptide 1 (PDB ID – 1EBP; interacting chains – A and C/D): Residues A-PHE-93, A-MET-150, A-PHE-205, and C-TRP-13 were found to be hotspots as per alanine scanning mutagenesis results. Both KFC2b and PPCheck could successfully predict three residues each as hotspots for this complex. While KFC2b reported A-MET-150, A-PHE-205, and C-TRP-13 residues as hotspots, PPCheck reported A-PHE-93, A-MET-150, and C-TRP-13 as hotspots (Supplementary File 5).
Unsuccessful cases
Either one or all of the available hotspot prediction programs could successfully predict majority of the hotspots. However, there are some experimentally determined hotspots that could not be predicted by any of the available programs. H-HIS-76 in the complex between DES-GLA factor VII-A (heavy chain) and peptide E-76 (PDB ID – 1DVA; interacting chains – X and H) (Supplementary File 6A), A-ASP-427 in the complex between Nidogen-1 and Basement membrane-specific heparan sulfate proteoglycan core protein (PDB ID – 1GL4; interacting chains – A and B) (Supplementary File 6B), and B-LYS-345 in the complex between beta-catenin and adenomatous polyposis coli protein (PDB ID – 1 JPP; interacting chains – B and D) (Supplementary File 6C) are some of the hotspots that none of the presently available programs could successfully predict. When analyzed in detail, it was observed that all these residues contributed more energy than an average interface residue (normalized energy per residue was greater than 1), but they were found to be exposed (solvent accessibility more than 7%) in the complexed form. These residues were also found to interact with fewer interface residues from interacting partner (ESRI less than 8) and were among moderate or less-conserved amino acids in the proteins. High solvent accessibility (exposed nature of residues in complexed form) and moderate/less conservation of these residues could be the possible reason why these residues could not be predicted by any of the available methods for hotspot prediction. These examples also show that it is informative to apply multiple algorithms for the identification of hotspots and the method with the best statistical measures may not be consistently performing best for every case.
Conclusion
PPCheck is an objective energy scoring scheme to analyze PPIs. It is a valuable resource that can be used for various purposes such as identification of non-covalent interactions at a protein–protein interface, given the coordinates of the two (interacting) chains in a single pdb file. An average docking algorithm generates hundreds of decoys for a given pair of proteins. Most of these generated conformations are incorrect, ie, they do not show any similarity with the native complex. Other successful scoring scheme, such as DockScore, provides scores on a relative basis within the sampled poses and is not meant to recognize cases where all the poses could be incorrect. If more than one software/algorithm is used for obtaining docked complexes, then the complexity of determining the correct pose further increases as the best predicted models by each algorithm can be entirely different. In such cases, PPCheck can be reliably used in differentiating the native-like docking poses from the non-native-like decoys since universal energy ranges have been obtained by studying a large number of protein–protein complexes.26 PPCheck reported an accuracy of ∼60% in differentiating the native-like and non-native-like docking poses over a range of CAPRI targets. Prediction of PPIs and recognition of hotspots at the interface region form the central focus for understanding biochemical pathways and for bioengineering/drug design experiments, respectively. It can also be used to successfully predict the critical residues, hotspots, at the interface, which provides stability and specificity to the complex. PPCheck also finds its application in calculating residue conservation and performing computational alanine scanning. Thus, PPCheck is the only webserver, to the best of our knowledge, which can be reliably used for identifying non-covalent interactions, predicting hotspots at the interface, calculating residue conservation, performing computational alanine scanning, and differentiating native-like and non-native-like docking poses. The availability of PPCheck, which provides objective measures of the strength of interactions, should be valuable.
Supplementary Materials
Acknowledgments
This work and AS are supported by the Centres of Excellence (CoE), Department of Biotechnology (DBT), India. AS thanks the vice chancellor of SASTRA University for encouragement and support. The authors also thank Anu G. Nair and Oommen K. Mathew for technical assistance and U.S. Raghavender, Prashant Shingate, and Malini Manoharan for useful discussions.
Footnotes
ACADEMIC EDITOR: J.T. Efird, Editor in Chief
PEER REVIEW: Four peer reviewers contributed to the peer review report. Reviewers’ reports totaled 1,109 words, excluding any confidential comments to the academic editor.
FUNDING: This study was funded by DBT-CoE grant (grant number: BT/01/COE/09/01). The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.
COMPETING INTERESTS: Authors disclose no potential conflicts of interest.
Paper subject to independent expert blind peer review by minimum of two reviewers. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE).
Author Contributions
Designed the project: RS. Carried out the computational work, including scripting and data analysis: AS. Interpreted the results and included additional features: RS, AS. Wrote the first draft of the manuscript: AS. Provided critical comments to improve the manuscript: RS. Both authors reviewed and approved of the final manuscript.
REFERENCES
- 1.Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000;28:289–91. doi: 10.1093/nar/28.1.289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Snel B, Lehmann G, Bork P, Huynen MA. STRING : a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28:3442–4. doi: 10.1093/nar/28.18.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(Database issue):D535–9. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Andre P, Cedex EV. A coarse-grained protein–protein potential derived from an all-atom force field. Proteins. 2007;111:9390–9. doi: 10.1021/jp0727190. [DOI] [PubMed] [Google Scholar]
- 5.Ferna J, Totrov M, Abagyan R. ICM-DISCO docking by global energy optimization with fully flexible side-chains. Proteins. 2003;117:113–7. doi: 10.1002/prot.10383. [DOI] [PubMed] [Google Scholar]
- 6.Cheng TM, Blundell TL. Fernandez-recio J. pyDock: electrostatics and desolvation for effective scoring of rigid-body protein–protein docking. Proteins. 2007 Apr;515:503–15. doi: 10.1002/prot.21419. [DOI] [PubMed] [Google Scholar]
- 7.Bertonati C, Honig B, Alexov E. Poisson-Boltzmann calculations of nonspecific salt effects on protein–protein binding free energies. Biophys J. 2007;92:1891–9. doi: 10.1529/biophysj.106.092122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Vajda S, Sipplt M, Novotny J. Empirical potentials and functions for protein folding and binding. Curr Opin Struct Biol. 1997;7:222–8. doi: 10.1016/s0959-440x(97)80029-2. [DOI] [PubMed] [Google Scholar]
- 9.Pierce B, Weng Z. ZRANK : reranking protein docking predictions with an optimized energy function. Proteins. 2007;67:1078–86. doi: 10.1002/prot.21373. [DOI] [PubMed] [Google Scholar]
- 10.Andrusier N, Nussinov R, Wolfson HJ. FireDock: fast interaction refinement in molecular docking. Proteins. 2007;69:139–59. doi: 10.1002/prot.21495. [DOI] [PubMed] [Google Scholar]
- 11.Sippl MJ. Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol. 1990;213:859–83. doi: 10.1016/s0022-2836(05)80269-4. [DOI] [PubMed] [Google Scholar]
- 12.Moont G, Gabb HA, Sternberg MJE. Use of pair potentials across protein interfaces in screening predicted docked complexes. Proteins. 1999;373:364–73. [PubMed] [Google Scholar]
- 13.Glaser F, Steinberg DM, Vakser IA, Ben-tal N. Residue frequencies and pairing preferences at protein–protein interfaces of known high-resolution residue – residue contact preferences. The residue statistical strength of the data set. Differences between amino acid dist. Proteins. 2001;102:89–102. [PubMed] [Google Scholar]
- 14.Huang S-Y, Zou X. An iterative knowledge-based scoring function for protein–protein recognition. Proteins. 2008;72:557–79. doi: 10.1002/prot.21949. [DOI] [PubMed] [Google Scholar]
- 15.Neuvirth H, Raz R, Schreiber G. ProMate: a structure based prediction program to identify the location of protein–protein binding sites. J Mol Biol. 2004;338:181–99. doi: 10.1016/j.jmb.2004.02.040. [DOI] [PubMed] [Google Scholar]
- 16.Keskin O, Nussinov R, Gursoy A. PRISM: protein–protein interaction prediction by structural matching. Methods Mol Biol. 2008;484:505–21. doi: 10.1007/978-1-59745-398-1_30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chen H, Zhou HX. Prediction of interface residues in protein–protein complexes by a consensus neural network method: test against NMR data. Proteins. 2005;61:21–35. doi: 10.1002/prot.20514. [DOI] [PubMed] [Google Scholar]
- 18.Porollo A, Meller J. Prediction-based fingerprints of protein–protein interactions. Proteins. 2007;645:630–45. doi: 10.1002/prot.21248. [DOI] [PubMed] [Google Scholar]
- 19.Negi SS, Schein CH, Oezguen N, Power TD, Braun W. InterProSurf: a web server for predicting interacting sites on protein surfaces. Bioinformatics. 2007;23(24):3397–9. doi: 10.1093/bioinformatics/btm474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tina KG, Bhadra R, Srinivasan N. PIC: protein interactions calculator. Nucleic Acids Res. 2007;35(Web Server issue):W473–6. doi: 10.1093/nar/gkm423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Reynolds C, Damerell D, Jones S. ProtorP: a protein–protein interaction analysis server. Bioinformatics. 2009;25:413–4. doi: 10.1093/bioinformatics/btn584. [DOI] [PubMed] [Google Scholar]
- 22.Rashid M, Ramasamy S, Raghava GPS. A simple approach for predicting protein–protein interactions. Curr Protein Pept Sci. 2010;11:589–600. doi: 10.2174/138920310794109120. [DOI] [PubMed] [Google Scholar]
- 23.Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. J Mol Biol. 2007;372:774–97. doi: 10.1016/j.jmb.2007.05.022. [DOI] [PubMed] [Google Scholar]
- 24.Alva V, Syamala Devi DP, Sowdhamini R. COILCHECK: an interactive server for the analysis of interface regions in coiled coils. Protein Pept Lett. 2008;15:33–8. doi: 10.2174/092986608783330314. [DOI] [PubMed] [Google Scholar]
- 25.Sunitha MS, Nair AG, Charya A, Jadhav K, Mukhopadhyay S, Sowdhamini R. Structural attributes for the recognition of weak and anomalous regions in coiled-coils of myosins and other motor proteins. BMC Res Notes. 2012;5:530. doi: 10.1186/1756-0500-5-530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sukhwal A, Sowdhamini R. Oligomerisation status and evolutionary conservation of interfaces of protein structural domain superfamilies. Mol Biosyst. 2013;9:1652–61. doi: 10.1039/c3mb25484d. [DOI] [PubMed] [Google Scholar]
- 27.Henrick K, Moult J, Eyck LT, et al. CAPRI : a critical assessment of predicted interactions. Proteins. 2003;9:2–9. doi: 10.1002/prot.10381. [DOI] [PubMed] [Google Scholar]
- 28.Joe J. Assessing predictions of protein–protein interaction: the CAPRI experiment. Protein Sci. 2005;14:278–83. doi: 10.1110/ps.041081905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Janin J. Protein–protein docking tested in blind predictions: the CAPRI experiment. Mol Biosyst. 2010;6:2351–62. doi: 10.1039/c005060c. [DOI] [PubMed] [Google Scholar]
- 30.Lensink MF, Wodak SJ. Docking and scoring protein interactions: CAPRI 2009. Proteins. 2010;78:3073–84. doi: 10.1002/prot.22818. [DOI] [PubMed] [Google Scholar]
- 31.Lensink MF, Wodak SJ. Docking, scoring, and affinity prediction in CAPRI. Proteins. 2013;81:2082–95. doi: 10.1002/prot.24428. [DOI] [PubMed] [Google Scholar]
- 32.Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005;33(Web Server issue):W382–8. doi: 10.1093/nar/gki387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Keskin O, Ma B, Nussinov R. Hot regions in protein–protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol. 2005;345:1281–94. doi: 10.1016/j.jmb.2004.10.077. [DOI] [PubMed] [Google Scholar]
- 34.Thorn KS, Bogan AA. ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001;17:284–5. doi: 10.1093/bioinformatics/17.3.284. [DOI] [PubMed] [Google Scholar]
- 35.Fischer TB, Arunachalam KV, Bailey D, et al. The binding interface database (BID): a compilation of amino acid hot spots in protein interfaces. Bioinformatics. 2003;19:1453–4. doi: 10.1093/bioinformatics/btg163. [DOI] [PubMed] [Google Scholar]
- 36.Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002;320:369–87. doi: 10.1016/S0022-2836(02)00442-4. [DOI] [PubMed] [Google Scholar]
- 37.Kortemme T, Kim DE, Baker D. Computational alanine scanning of protein–protein interfaces. Sci STKE. 2004;2004:l2. doi: 10.1126/stke.2192004pl2. [DOI] [PubMed] [Google Scholar]
- 38.Darnell SJ, Page D, Mitchell JC. Predicting protein interaction hot spots. Proteins. 2007;68:813–23. doi: 10.1002/prot.21474. [DOI] [PubMed] [Google Scholar]
- 39.Guney E, Tuncbag N, Keskin O, Gursoy A. HotSprint: database of computational hot spots in protein interfaces. Nucleic Acids Res. 2008;36(Database issue):D662–6. doi: 10.1093/nar/gkm813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lise S, Archambeau C, Pontil M, Jones DT. Prediction of hot spot residues at protein–protein interfaces by combining machine learning and energy-based methods. BMC Bioinformatics. 2009;10:365. doi: 10.1186/1471-2105-10-365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ofran Y, Rost B. Protein–protein interaction hotspots carved into sequences. PLoS Comput Biol. 2007;3:e119. doi: 10.1371/journal.pcbi.0030119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tuncbag N, Gursoy A, Keskin O. Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinformatics. 2009;25:1513–20. doi: 10.1093/bioinformatics/btp240. [DOI] [PubMed] [Google Scholar]
- 43.Gonzalez-Ruiz D, Gohlke H. Targeting protein–protein interactions with small molecules: challenges and perspectives for computational binding epitope detection and ligand finding. Curr Med Chem. 2006;13:2607–25. doi: 10.2174/092986706778201530. [DOI] [PubMed] [Google Scholar]
- 44.Huo S, Massova I, Kollman PA. Computational alanine scanning of the 1:1 human growth hormone–receptor complex. J Comput Chem. 2002;23(1):15–27. doi: 10.1002/jcc.1153. [DOI] [PubMed] [Google Scholar]
- 45.Rajamani D, Thiel S, Vajda S, Camacho CJ. Anchor residues in protein–protein interactions. Proc Natl Acad Sci U S A. 2004;101:11287–92. doi: 10.1073/pnas.0401942101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Brinda KV, Kannan N, Vishveshwara S. Analysis of homodimeric protein interfaces by graph-spectral methods. Protein Eng. 2002;15:265–77. doi: 10.1093/protein/15.4.265. [DOI] [PubMed] [Google Scholar]
- 47.Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M. Program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem. 1983;4:187–217. [Google Scholar]
- 48.Forces I, Company RP. Interaction models for water in relation to protein hydration For molecular dynamics simulations of hydrated proteins a simple yet reliable model for the intermolecular potential for water is required. Such a model must be an effective pair potential val. Intermol Forces. 1981;331:331–8. [Google Scholar]
- 49.Kabsch W, Sander C. Dictionary protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- 50.Oliva R, Vangone A, Cavallo L. Ranking multiple docking solutions based on the conservation of inter-residue contacts. Proteins. 2013;81:1571–84. doi: 10.1002/prot.24314. [DOI] [PubMed] [Google Scholar]
- 51.Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–59. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 53.Larkin MA, Blackshields G, Brown NP, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–8. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
- 54.Johnson MS, Overington JP. A structural basis for sequence comparisons. An evaluation of scoring methodologies. J Mol Biol. 1993;233:716–38. doi: 10.1006/jmbi.1993.1548. [DOI] [PubMed] [Google Scholar]
- 55.Mizuguchi K, Deane CM, Blundell TL, Johnson MS, Overington JP. JOY: protein sequence-structure representation and analysis. Bioinformatics. 1998;14:617–23. doi: 10.1093/bioinformatics/14.7.617. [DOI] [PubMed] [Google Scholar]
- 56.Malhotra S, Sankar K, Sowdhamini R. Structural interface parameters are discriminatory in recognising near-native poses of protein–protein interactions. PLoS One. 2014;9:e80255. doi: 10.1371/journal.pone.0080255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Garzon JI, Lopéz-Blanco JR, Pons C, et al. FRODOCK: a new approach for fast rotational protein–protein docking. Bioinformatics. 2009;25:2544–51. doi: 10.1093/bioinformatics/btp447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Mukherjee S, Zhang Y. MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res. 2009;37:1–10. doi: 10.1093/nar/gkp318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Bogan AA, Thorn KS. Anatomy of hot spots in protein interfaces. J Mol Biol. 1998;280:1–9. doi: 10.1006/jmbi.1998.1843. [DOI] [PubMed] [Google Scholar]
- 60.Kortemme T, Baker D. A simple physical model for binding energy hot spots in protein–protein complexes. Proc Natl Acad Sci U S A. 2002;99(22):14116–21. doi: 10.1073/pnas.202485799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Darnell SJ, LeGault L, Mitchell JC. KFC Server: interactive forecasting of protein interaction hot spots. Nucleic Acids Res. 2008;36(Web Server issue):W265–9. doi: 10.1093/nar/gkn346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Cho K, Kim D, Lee D. A feature-based approach to modeling protein–protein interaction hot spots. Nucleic Acids Res. 2009;37:2672–87. doi: 10.1093/nar/gkp132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Tuncbag N, Keskin O, Gursoy A. HotPoint: hot spot prediction server for protein interfaces. Nucleic Acids Res. 2010;38(Web Server issue):W402–6. doi: 10.1093/nar/gkq323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhu X, Mitchell JC. KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density, and plasticity features. Proteins. 2011;79:2671–83. doi: 10.1002/prot.23094. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.