Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2021 Apr 20;30(6):1247–1257. doi: 10.1002/pro.4074

Identifying metal binding amino acids based on backbone geometries as a tool for metalloprotein engineering

Hoang Nguyen 1, Jesse Kleingardner 2,
PMCID: PMC8138524  PMID: 33829594

Abstract

Metal cofactors within proteins perform a versatile set of essential cellular functions. In order to take advantage of the diverse functionality of metalloproteins, researchers have been working to design or modify metal binding sites in proteins to rationally tune the function or activity of the metal cofactor. This study has performed an analysis on the backbone atom geometries of metal‐binding amino acids among 10 different metal binding sites within the entire protein data bank. A set of 13 geometric parameters (features) was identified that is capable of predicting the presence of a metal cofactor in the protein structure with overall accuracies of up to 97% given only the relative positions of their backbone atoms. The decision tree machine‐learning algorithm used can quickly analyze an entire protein structure for the presence of sets of primary metal coordination spheres upon mutagenesis, independent of their original amino acid identities. The methodology was designed for application in the field of metalloprotein engineering. A cluster analysis using the data set was also performed and demonstrated that the features chosen are useful for identifying clusters of structurally similar metal‐binding sites.

Keywords: machine learning, metal binding, metalloproteins, protein engineering

1. INTRODUCTION

Metalloproteins are protein structures that contain a functional metal ion and represent an estimated 30–40% of all proteins. 1 The number of protein structures has more than doubled in the past decade, and along with this increase in structural information has come a number of structural analysis methods focused on characterizing metal sites within proteins. 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 The various methods focus on different applications, such as structural model validation using the CheckMyMetal server tool, 9 prediction of metal sites given structures with unknown metal binding properties, 4 , 8 or classification of metal site structures exemplified by the “minimal functional site” approach taken by Andreini and coworkers. 5 , 7 , 10

Interest in the field of metalloprotein engineering is growing, as engineered metalloproteins can facilitate the discovery of detailed structure–function relationships in existing metalloproteins, tailor the properties of a metalloprotein toward a particular application, or expand the function of metalloproteins beyond those known in naturally‐occurring proteins. 11 , 12 , 13 Metalloprotein engineering can be performed by a number of strategies. In de novo metalloprotein design, a sequence is designed that is predicted computationally to fold into a particular three‐dimensional shape which can accommodate a desired metal‐site structure. 14 , 15 , 16 In protein redesign, a protein scaffold whose structure is already known is modified by site‐directed mutagenesis to accommodate a desired metal binding site. Frequently, the initial scaffold already contains a metal‐binding site that is being modified by mutagenesis and/or metal‐ion substitution, 17 , 18 , 19 , 20 , 21 but it is also possible that a new metal‐binding site can be introduced into an otherwise non‐metalloprotein template. 22 Alternative strategies include the design of metal binding sites as oligomerizing crosslinks between protein monomers to form a supramolecular assembly 23 or the incorporation of a synthetic metal cofactor into a protein scaffold. 24 The incorporation of unnatural amino acids 25 , 26 and the use of directed evolution 24 have enhanced our ability to engineer metalloproteins with desired functions and properties.

There are a number of computational tools already available to introduce a metal site into a scaffold in silico. 27 One of the earliest tools for metal‐site identification within a protein scaffold, METALSEARCH, 28 stands out for its use of backbone structure alone to identify favorable metal‐binding sites. The use of backbone structure alone for metalloprotein engineering is ideal, since it is independent of the side chains present in the scaffold structure, thus eliminating the need to model specific sets of metal‐binding amino acid side chains and their various rotamers when determining whether a metal‐site could be incorporated into a scaffold. The sequence and structural degrees of freedom would be greatly reduced compared to force field‐based protein design methods. However, METALSEARCH was limited to the design of tetrahedral sites ligated by only cysteine and/or histidine residues since it involved an explicit calculation of the desirable backbone geometry needed for binding.

DEZYMER 29 was developed to be more versatile in the target sites that can be incorporated, and more recently developed tools such as IPRO 30 and RosettaDesign 31 incorporate force field‐based backbone and side chain refinements. Both of these tools use a fast initial screening procedure to identify suitable locations for a predetermined substructure to be incorporated into a template structure, and this initial screening is necessary to quickly limit and prioritize the potential binding sites to be assessed by force field methods.

Given the growth of the amount of structural information available in the Protein Data Bank, 32 the speed of modern computers, and the high accessibility of modern machine learning algorithms that have already been successfully applied to metal‐binding site predictions in proteins, 33 a new method for the initial screening of template protein structures is proposed. Instead of using a predetermined metal‐binding substructure, data from a wide range of metal‐binding sites in the protein data bank, each with a diverse set of primary coordination spheres, is used to train a machine‐learning algorithm. In this way, a template structure for metalloprotein engineering can be quickly compared with the entire set of metal‐binding sites in the protein data bank using a decision tree machine‐learning algorithm. The analysis is based solely on the relative positions of backbone atoms for amino acids directly bound to each metal cofactor, with the goal of using the prediction algorithm as an aid to identifying templates in which to engineer metal‐binding sites without using an in silico mutagenesis and rotamer selection procedure. The method is tested here to determine whether a variety of metal‐binding sites are distinguishable from each other and from non‐metal binding sites on the basis of the backbone structure of the coordinating amino acids alone.

2. RESULTS

2.1. Machine learning metal‐site classification

The machine‐learning methodology used a Random Forest decision tree classifier to distinguish common types of metal sites found in protein structures on the basis of the three‐dimensional arrangement of the backbone atoms of those amino acids that coordinate the metal center. The relative geometries are represented by numerical values calculated from atomic coordinates. These values, referred to as features throughout this article, are used as input for machine learning. The calculation of these features is summarized in Figure 1, where the use of side chain atom coordinates was intentionally avoided so that the model might be used to find templates for metalloprotein engineering. As can be seen in Figure 2, the accuracy of the machine‐learning prediction of metal binding depended strongly upon how many amino acids bound the metal ion (n aa) and weakly on which Feature Set was used. The model is able to better distinguish between different types of metal sites with four ligating amino acids compared to metal sites with three ligating amino acids. This increase in accuracy between three and four coordinating amino acids likely results from increased separation of the metal and non‐metal data points within the feature space, since the number of different metal site classes is similar for both the n aa = 3 (eight classes) and the n aa = 4 (seven classes) data sets.

FIGURE 1.

FIGURE 1

The overall scheme used to generate the machine‐learning data sets from metalloprotein structures present in the Protein Data Bank. 32 The example shown is a zinc site in cellulose synthase from Salmonella typhimurium (PDBID 5OLT) ligated by the side chains of four amino acids

FIGURE 2.

FIGURE 2

Overall accuracy of the machine‐learning classification model as a function of the number of ligating amino acids (naa), the feature set used, and whether the single metal ion sites (CU, FE, MN, CO, NI, and ZN) were grouped into a single category (M). The description of the three data sets are shown in Table 1. The list of classes that the algorithm was distinguishing between is shown for each data group. A description of the classes and the grouping scheme can be found in Table S3

Feature Set 1 uses 13 features to represent the backbone geometry of the amino acids directly coordinating to metal centers (Table). However, to fully encode the relative positions of three or more atoms in three‐dimensional space, 3n–6 features are necessary, typically represented using internal coordinates in a series of distances, angles, and dihedrals. In Feature Set 3, this complete set of internal coordinate features was used to fully represent the 3D arrangement of the amino acids for n amino acids. Since internal coordinates are dependent on the ordering of the amino acids, the data set generated for Feature Set 3 recorded every possible ordering of the n amino acids, resulting in a large data set whose analysis required notably more computing resources. Feature Set 1, on the other hand, used only 13 features that were independent of the ordering of the amino acids, which could be collected and analyzed more quickly. The results in Figure 2 demonstrate that the compression and simplification of the geometric relationship between the coordinating amino acids that was performed for Feature Set 1 maintains or improves performance with no loss of accuracy when compared with the order‐dependent Feature Set 3. Therefore, the features selected for Feature Set 1 provide a computationally efficient data compression method capturing the geometry of the amino acid backbone atoms that make up a metal‐binding site with no loss of machine learning algorithm accuracy.

TABLE 1.

The descriptions of the data included in each of the three data sets used for this study

Feature set 1 2 3
Order‐independent geometric parameters of coordinating amino acids (Table S4) Included Included Not included
The number of coordinating amino acids of each type: C, D, E, H, K, M, N, Q, R, S, T, and Y Not included Included Not included
Internal coordinates for the backbone atoms of coordinating amino acids Not included Not included Included
Total number of features 13 25 12 (n aa = 3) or 18 (n aa = 4)

Feature Set 2 includes the order‐independent features, but also includes the number of each type of amino acid that binds to metals in this sample set (Cys, Asp, Glu, His, Lys, Met, Asn, Gln, Arg, Ser, Thr, Tyr) within the set of coordinating amino acids. Including the identities of the metal‐binding amino acids in this way improves performance by 7.2% in the n aa = 3 data set but only 3.4% in the n aa = 4 data set (Figure 2).

A breakdown of the overall accuracy of the classification model by metal ion is shown in Figure 3a, demonstrating variation in the accuracy of classification between the different metal sites. Non‐metal sites (None), Zn2+ sites, Fe2–S2 clusters (FES), Fe4–S4 clusters (SF4), and c‐type hemes are classified well by the model with greater than 90% accuracy, whereas Fe2+/3+, Mn2+/3+, Ni2+, and Co2+/3+ sites are classified poorly with less than 60% accuracy. The accuracy is partially correlated with the number of metal sites in the data set (Table S3), with non‐metal and Zn2+ sites being the most frequent and most accurate. However, a more detailed view of which sets of metal sites are commonly confused by the model is shown via confusion matrices in Figure S1, and demonstrates that the amino acid backbone geometries are not sufficiently different to distinguish between metals within mono‐ or di‐nuclear metal binding sites, which include Fe2+/3+, Mn2+/3+, Zn2+, Ni2+, and Cu1+/2+. When these individually labeled metal‐binding site were grouped among the same class (M), the overall classification accuracy of the model increased to 95% (Figure 3b). With all metal sites were grouped together, the classification accuracy increased further to 97%.

FIGURE 3.

FIGURE 3

Accuracy of the machine‐learning classification model as a function of both metal‐site and the number of ligating amino acids. The feature set used for this data, Feature Set 1, was the order‐independent features listed in Table S4. The metal sites were each classified into separate categories (a) or were grouped where class M constitutes either an Fe, Mn, Zn, Co, Ni, or Cu site (b). Some combinations of metal site and n aa were not populated enough to use for the machine learning, such as Fe4–S4 or Fe2–S2 clusters with n aa = 3 or Cu, Ni, and Co sites with n aa = 4

The Random Forest classifier offers an opportunity to not only assign a data point to a particular class, but to report the predicted probability that a given data point belongs to each class. In this case, the data point is the set of values for each feature in the set computed from relative amino acid backbone geometries. The classes are either a nonmetal site or any one of the different metal binding sites used for the study, and a probability will be generated for each class that is based on the percentage of the decision trees generated predicting the classification. The higher the probability, the greater the confidence the algorithm has in its class assignment. The resulting probabilities were filtered for each metal site and sorted into true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs). The probabilities for each category, TPs, FPs, TNs, and FNs, were then used to generate histograms showing the distribution of probabilities in each category. Select examples of these histograms are shown in Figure S2. A lack of overlap between TPs and FPs, or between TNs and FNs, demonstrate a correlation between the confidence of the algorithm and the likelihood of a correct classification. Most of the incorrectly classified metal sites are predicted with lower confidence, as one would expect if the probabilities generated were reliable measures of likelihood. For example, in the cases where the classification had greater than or equal to 80% confidence, the overall accuracy of the prediction model increases from 90.2 to 96.4 (n aa = 3, grouped) and from 95.2% to 98.9% (n aa = 4, grouped). The accuracy increases up to 98.5% (n aa = 3) and 99.6% (n aa = 4) if a confidence level over 90% is required.

2.2. Feature set data visualization

Visualization of the data can give a qualitative sense of how much overlap there is between backbone geometries characterized by a particular feature set. In addition, visualization can help show the distribution of the backbone geometries within a class of metal sites. Complete visualization of the 13‐dimensional data of Feature Set 1 is impossible, but projection onto the two principal components allows one to visualize most of the variance in the data (see Figure S3). First, visualization of the data shows less separation in general between points when there are just three ligating amino acids (Figure 4b) than when there are four (Figure 4d). This is consistent with the increase in classification accuracy observed with four ligands as opposed to just three. The two‐dimensional projection in Figure 4d shows clear separation between different types of metal centers and the presence of clusters within some of the metal groups.

FIGURE 4.

FIGURE 4

Scatter plot visualization of the data for Feature Set 1 after normalization and projection onto the two principal components (PC1 and PC2). (a) The data set was filtered to include heme c sites ligated by three amino acids, and the color of the data points represents a different set of ligating amino acids (i.e., histidine vs. lysine axial ligation). (b) The data set was filtered to include all metal and non‐metal sites bound by three amino acids, and the color of the data points represent different classes of metal sites. (c) The data set was filtered to include heme c sites ligated by three amino acids, and the color of the data points represent the four different clusters identified via K‐means clustering. (d) The data set was filtered to include all metal and non‐metal sites bound by four amino acids, and the color of the data points represent different classes of metal sites

To explore how the features collected from the geometries of the backbone atoms of ligating amino acids are affected by the identities of those amino acids, plots were generated for each metal that separately labeled the data points according to the identities of the ligating amino acids. These plots are shown for heme c in Figure 4a (n aa = 3), and for other metal sites in Figure S4 (n aa = 3) and Figure S5 (n aa = 4). Figure 4a shows that for hemes c ligated by three amino acids (two cysteines and a proximal ligand), there are a small number of clusters showing different geometries of the backbone atoms of these three amino acids. The CCH‐ligated hemes are clearly split between two clusters. For the CCK‐ligated hemes, of which there are 10 sites in the data set, they are split into two clusters and one outlier, with one CCK‐ligated cluster overlapping one of the CCH‐ligated clusters.

These different clusters observed for hemes c ligated by three amino acids were analyzed with a K‐means clustering analysis, resulting in four different clusters which are plotted in Figure 4c. The heme sites nearest to the center of each cluster (when the cluster size was greater than two) were then aligned with one another and overlaid to show the structural differences present between the different clusters (Figure 5). The comparison of representative CCH‐ligated heme structures shows that the major structural difference is in the axial histidine ligand, where a significant backbone rearrangement is coupled with a rotation of the axial histidine relative to the heme. The angle of the histidine relative to the heme is therefore correlated with a difference in backbone geometry (Figure S6A). The CCK‐ligated nitrite‐reductases were separated into two backbone geometries of the CXXCK motif (analogous to the CCH‐ligated hemes), with a unique outlier that is bound by an axial lysine outside of the CXXCK motif. A similar difference is found among hemes c with four ligating amino acids, where one cluster has a different histidine rotation angle compared with the rest (Figure S6B), which is also correlated with a shift toward a lower distribution of an out‐of‐plane heme ruffling distortion (Figure S6D). The analysis shows the correlation between the geometry of the protein backbone, the identities of the ligating amino acids, and side chain conformational changes in the primary coordination sphere.

FIGURE 5.

FIGURE 5

Overlaid structures of the heme c sites closest to the center of the four clusters identified in Feature Set 1 (see Table 1). The clusters are plotted in Figure 4c with matching colors. For each set, the heme cofactors were aligned using PyMOL and the C α and C β carbons were displayed as spheres. Left (His‐ligated hemes): heme A1248 of split‐Soret cytochrome c (PDB: 1H21; cluster 0; cluster size = 35; purple) overlaid with the heme A201 of cytochrome P460 (PDB: 6AMG; cluster 3; cluster size = 22; yellow). The figure highlights the different relative positions of the C α and C β carbons of the axial histidine ligand, and the different angles of the histidine imidazole rings relative to the heme. Right (Lys‐ligated hemes): Structure of heme A802 of octaheme tetrathionate reductase (PDB: 1SP3; cluster 2; cluster size = 1; green) overlaid with heme A513 of pentaheme cytochrome c nitrite reductase (PDB: 3BNJ, cluster 1; cluster size = 6; blue). The figure shows in green the unique ligation of a lysine occurring outside of its CXXCH motif, contrasted with the ligation of a lysine within a CXXCK motif in blue

3. DISCUSSION

The representation of sequence and structure data for machine‐learning has enabled advances in protein design and structure prediction. 34 , 35 This study showed that backbone geometries of a small number of amino acids (three or four) involved in metal binding can be compressed to an efficient, order‐independent, mathematical representation that captures the three dimensional geometry of the metal‐binding amino acids with no loss of prediction accuracy compared with the full set of order‐dependent internal coordinates. The compressed features can be fed into a simple and fast machine learning model that can distinguish between different kinds of metal cofactors with an overall accuracy of 95%, and the probabilities that are generated by the machine‐learning model are useful for further discernment between FPs and TPs.

The approach presented here can also be useful for distinguishing between different metal‐binding structures, and in particular shows the interplay between the protein backbone geometry and the structure of the primary coordination sphere of the metal, which was illustrated with the analysis of the heme‐histidine angle in Figure 5 and Figure S6. The assigned cluster of each heme group is based on an unsupervised learning algorithm using the features computed only from the backbone atom coordinates of the ligating amino acids. The correlation of the assigned cluster to the rotation of the histidine relative to the heme supports the strength of the feature set to discern between heme c binding sites with the same set of ligands but with different ligand binding geometries. The assigned cluster is also correlated to a non‐planar distortion of the heme known as ruffling, which is expected since the relationship between axial histidine rotation and out‐of‐plane heme ruffling has been observed and discussed previously. 36 In this study, however, the ruffling and the axial histidine rotation angle were both shown to be clearly correlated to the peptide backbone structure of the ligating amino acids found in a Cys‐X‐X‐Cys‐His or Cys‐X‐X‐Cys‐Lys motif. The cluster analysis and the visualization of the projected data set can therefore aid in quickly assessing a minimum number of different types of binding site geometries are seen in the PDB for a given metal cofactor and ligand set.

The approach of classifying metal‐binding sites by backbone geometries alone could be useful for applications in metalloprotein engineering, where suitable sets of amino acids that can be mutated to generate a metal‐binding site can be quickly identified. As an illustration of how this can be used for protein engineering, every combination of closely‐spaced amino acids within the structure of poplar apoplastocyanin, the apo form of a Type 1 copper protein, was analyzed and the data for Feature Set 1 was exported (PDB 2PCY, 99 amino acids, 711 4‐ligand combinations). The machine learning classification, using the training data from the metal sites within the PDB as input, was used to classify these 711 sites, outputting probabilities that each site bound to an Fe2–S2 clusters (FES), Fe4–S4 clusters (SF4), a c‐type heme, a mononuclear metal ion, or no metals. For every potential binding site, the metal‐binding site within the PDB closest in feature set‐space was found and exported, which could be used for further structure comparison or modeling. The computational time for all of these tasks was less than a minute on a standard laptop, demonstrating that this would be easy to scale up to high‐throughput analysis of many protein templates or even frames of a large molecular dynamics simulation to take into account structural heterogeneity.

The classification of all the possible metal‐binding sites in apoplastocyanin was based on background atom geometry alone, and seven sites out of all 711 possibilities were predicted to bind to a metal with a high probability (P ≥ 90%), and another 39 sites were predicted to bind to a metal with a moderate probability (90% >P ≥ 60%). These sites indicate a set of the most likely sites that could be engineered by mutagenesis to bind to a metal without significant rearrangement of the protein backbone. The actual Type 1 copper binding site for plastocyanin had a moderate (70%) probability, which given the nearly identical structures of apo‐ and holo‐plastocyanin, is likely due to the fact that, although there are a handful of Type 1 copper binding sites with a Cys‐His‐His‐Met ligation in the PDB (28 in the data set used), there are also a number of similarly‐arranged combinations of four amino acids with varying sequences that are not metal‐binding. Whether these existing sites could be converted to Type 1 Copper sites by mutagenesis would need further empirical investigation.

There are pitfalls inherent to using geometries of the backbone atoms alone for the structural classification of metal‐binding sites. If a metal‐binding site identified with this method involves the mutation of a glycine or a proline, the backbone structure of the protein may shift significantly, rendering the results inaccurate. Steric clashes and the presence of desirable secondary coordination sphere interactions are not included in the analysis and would have to be refined using different methods such as rmsd alignments, steric clash analysis, topographic steric maps, 37 desirable secondary sphere interactions, or the force field methods that are described in the literature. 30 , 31 It is also clear that this method does not distinguish well between metal cofactors with only three ligating amino acids or between different mononuclear metal sites. Without specifying the set of binding amino acids, the method would not be a good predictor of, for example, whether zinc or iron would bind better to an engineered binding site without further structural modeling or analysis. The geometries of amino acid backbone atoms have significant overlap between different cofactors composed of single metal ions, and that the amino acid identities and side chain positioning would need to be included in a subsequent analysis of metal binding site preferences. This lack of discernment between different mononuclear metal ions could also be seen as a strength of the model, showing a lack of overfitting based on the recognition of different protein folds commonly seen in the PDB. After all, it would be inaccurate for the model to predict an iron binding site if both an iron or a zinc could be incorporated, depending on the set of amino acid substitutions.

Importantly, because the training data comes from the protein data bank, its generalizability toward primary coordination spheres (including amino acid identities and coordinating geometries) not found in the protein data bank has yet to be tested. Geometries from the same metals and different ligand sets do not overlay well. Geometries from the same metal and ligand sets, yet with different side chain conformations, are also separated in the feature space. Therefore, many potential metal‐binding sites could exist within a protein template that would not be found by this set of training data because a similar binding site does not exist in the protein bank. In the short time scale, this pitfall could only be overcome with the introduction of structural data of accurate simulated metal binding site structures, the use of which is currently being explored.

4. CONCLUSIONS

This study proposes a methodology for encoding the geometries of a small number of metal‐binding amino acids within a protein to use as input for a machine learning algorithm that predicts with up to 97% accuracy the presence or absence of a metal cofactor. The accuracy is strongly dependent on the metal cofactor being analyzed, with the structurally constrained hemes c being the most accurately predicted. The encoded metal‐site geometries can be used for structural classification, demonstrated by its ability to cluster hemes c with an open distal coordinate site into groups with either a different proximal ligand or a rotation of the axial histidine relative to the heme. Since the features used were intentionally computed from amino acid backbone coordinates alone and do not include the identities of the amino acids, the methodology can be used toward identifying sites within a protein structure that have favorable backbone geometries to bind to metal ions upon mutagenesis.

5. MATERIALS AND METHODS

5.1. Data set description

5.1.1. Selection of protein structures and metal sites

Protein structures in the Protein Data Bank with a resolution of 2.5 Angstroms or better as of January 11, 2021 were filtered by selecting only those with one of the following metal‐containing chemical groups: MN, MN3, FE, FE2, FES, F3S, SF4, HEC, HEM, CO, 3CO, NI, CU, CU1, or ZN. This list of chemical groups represents every Chemical ID in the PDB containing Mn, Fe, Co, Ni, Cu, or Zn that is present in at least 100 PDB structures to provide enough data for the machine learning algorithm. HEM and HEC are two Chemical IDs used for c‐type hemes. The list of PDB structures was further filtered by selecting a representative structure at 90% sequence identity to reduce redundancy in the database. The 4‐digit PDB codes from the resulting 9,955 structures were fed into PyMOL 38 using a custom script written in Python 3. The script loaded each PDB structure, deleted duplicate chains to remove redundancy, and selected the metal‐containing chemical groups (listed above). Metal cofactors that differed from another by oxidation state alone were grouped together. Table S1 shows the description of each metal‐containing ligand, the number of PDB files for each, and the metal site groupings.

5.1.2. Amino acid set selection and ligand number

For every metal site, a list of residues binding to that metal site was generated by selecting amino acids that have an atom within a certain distance to the metal ion. The cutoff distance varied slightly by element (2.5 Angstroms for N, 2.65 Angstroms for O, and 2.75 Angstroms for S), and were chosen to conservatively capture the majority of amino acids in the metal's primary coordination sphere while ensuring that those amino acids not directly bonded to the metal were excluded. Distance‐to‐metal histograms for each element are shown in the Data S1 (Figure S7). In addition, amino acids containing an atom within 2.75 Angstroms of atom CAB or atom CAC on a heme group were also selected in order to capture the amino acids responsible for covalent ligation of c‐type hemes. The ligands were filtered further by the type of atom from the amino acid ligand that was closest to the metal ion, using only canonical amino acids with a side chain N, side chain O, or a side chain S as the atom closest to the metal (See Table S2 for a description of the ligand atoms used). Structural data from metal sites were collected only if they had three or four different amino acid side chains binding to them (which excluded b‐type hemes).

5.1.3. Data collection for non‐metal sites

To obtain structural data on non‐metal sites, the same set of representative protein structures at 90% sequence identity that had at least one metal site was used. Since there are an incredibly large number of combinations of non‐metal binding amino acids in any given protein structure, a random selection of 1 out of every 200 (with 1 being the minimum) amino acids were selected and their neighbors within 5.5 Angstroms were found. The number of amino acids ligands (either 3 or 4) used to extract features was chosen to give a distribution similar to the one observed within the metal site data set, where 30% of metal sites had three ligands and 70% of metal sites had four ligands.

Once the number of amino acids (n aa) to collect features from was determined, a complete list of all possible combinations of n aa amino acids among the set of neighbors was calculated. For the metal binding amino acids, no two amino acids could be more than 5.5 Angstroms apart, since in the metal sites each amino acid had to be within 2.75 Angstroms of the metal ion. Accordingly, for the non‐metal site comparison data, every combination of amino acids in which any two amino acids were further than 5.5 Angstroms apart were removed. Out of that group of amino acids combinations, a random selection of 1 out of every 200 (with 1 being the minimum) combinations were selected to collect structural data from. Table S3 shows the final number of each type of metal site (including the non‐metal reference) separated by the number of ligating amino acids.

5.1.4. Feature selection

The three dimensional arrangement of the amino acids that bind the metal cofactor was analyzed to extract features that could be fed into a machine learning algorithm. Coordinates of the backbone C α, N, and carbonyl C were extracted from the PDB structure and used to calculate the position of C β using a singular value decomposition (SVD) alignment procedure to align the backbone atoms with those of free alanine. This way, the spatial orientation of the amino acid could be captured, but glycine could be used to collect the control data for non‐metal sites since only the structure of the backbone atoms were utilized in the calculation. The compiled C α and modeled C β atomic coordinates were then used to compute features that describe the structural relationship between them. Features that depend upon the ordering of the input coordinates were avoided. In addition, features were chosen that could be determined in the same way regardless of the number of input amino acid coordinates in order to simplify the data set for machine learning, such that the same number of features would exist regardless of the number of ligating amino acids. If this was not possible, as was the case for the area of a triangle formed by three coordinate points, a suitable correlating parameter was chosen, that is, the volume of a polyhedron formed by more than three coordinates. See Figure 1 for a depiction of the data set collecting scheme, and see Table S4 for a complete description of the features used.

To test whether the order‐independent features sufficiently captured the three‐dimensional arrangement of the coordinating amino acids, the order‐dependent internal coordinates (distance, angle, dihedral) of the C α and C β atoms of each ligand were also collected in a separate data set. For the order‐dependent data set, the internal coordinates for every possible permutation of how to order the coordinating amino acids (P = n aa!) was collected as a separate data point.

In addition to the feature data, information was extracted to help with data filtering and analysis. This included the one letter symbols of the set of ligating amino acids (in alphabetical order), the total number of amino acids, and the number of backbone carbonyls coordinating to each metal. The PDB code, the chain and residue number of the metal, and the chain and residue number of each amino acid were also collected to uniquely identify each metal site. Finally, the type of metal site (or None for a nonmetal site) was listed as input to train the machine learning algorithm.

5.2. Metal site classification methods

The machine learning classification was implemented in Python using the scikit‐learn package. The number of non‐metal sites used was set to be a maximum of half of the total number of metal sites for any given data set analyzed. A Random Forest Classifier was used as the classification model, and the GridSearchCV module was used to tune the hyperparameters max_features and n_estimators. Each classification was performed five times using a Stratified K‐fold method with five splits, where 80% of the data were used to train the model and the remaining 20% of the data were used to test the accuracy of the model. The results reported are averages of the five splits. Data collected using a different number of ligating amino acids (three or four) were modeled separately. If a specific type of metal site made up less than 2% of the data set, it was excluded from the analysis. Analysis and workup of the results was performed using a python script developed in‐house.

5.3. Feature set data visualization

The data was filtered by metal and by the identities of the ligating amino acids using the Pandas package in python. In order to visualize the 13‐dimensional data from feature set 1, the data was first filtered into a relevant subset of metal and/or non‐metal sites with a particular number of ligating amino acids (n aa = 3 or n aa = 4). Next, the data were normalized to set the mean of each feature to be zero and the standard deviation to be 1. A principal component analysis was performed on the normalized subset of data. The data was then projected onto the first two principal components in order to maximize the variance between data points within feature set 1 while compressing it to only two dimensions. Data points were then plotted as 2D scatter plots using the matplotlib python package. The points were color‐coded according to the type of metal site (if there was more than one), the identities of the ligating amino acids (for a single metal site), or according to the cluster number assigned by K‐means clustering. For some plots, outliers (e.g., unique sequences) were excluded for more simple visualization, but these outliers were still included in all the analyses. The structures that represented the center of each cluster were aligned and visualized using PyMOL. 38

AUTHOR CONTRIBUTIONS

Hoang Nguyen: Conceptualization; data curation; formal analysis; investigation; methodology; software; validation; visualization. Jesse Kleingardner: Conceptualization; data curation; formal analysis; investigation; methodology; project administration; software; supervision; validation; visualization; writing‐original draft; writing‐review & editing.

CONFLICT OF INTEREST

The authors have no conflicts of interest to declare.

Supporting information

Data S1. Supporting Information.

ACKNOWLEDGMENTS

The authors thank Binbin Huang for his helpful discussions and suggestions about the features to use to represent ligand geometries.

Nguyen H, Kleingardner J. Identifying metal binding amino acids based on backbone geometries as a tool for metalloprotein engineering. Protein Science. 2021;30:1247–1257. 10.1002/pro.4074

REFERENCES

  • 1. Andreini C, Bertini I, Rosato A. Metalloproteomes: A bioinformatic approach. Acc Chem Res. 2009;42(10):1471–1479. [DOI] [PubMed] [Google Scholar]
  • 2. Ajitha M, Sundar K, Mugilan SA, Arumugam S. Development of METAL‐ACTIVE SITE and ZINCCLUSTER tool to predict active site pockets. Proteins. 2018;86(3):322–331. [DOI] [PubMed] [Google Scholar]
  • 3. Andreini C, Cavallaro G, Lorenzini S. FindGeo: A tool for determining metal coordination geometry. Bioinformatics. 2012;28(12):1658–1660. [DOI] [PubMed] [Google Scholar]
  • 4. He W, Liang Z, Teng M, Niu L. mFASD: A structure‐based algorithm for discriminating different types of metal‐binding sites. Bioinformatics. 2015;31(12):1938–1944. [DOI] [PubMed] [Google Scholar]
  • 5. Putignano V, Rosato A, Banci L, Andreini C. MetalPDB in 2018: A database of metal sites in biological macromolecular structures. Nucleic Acids Res. 2018;46:D459–D464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Srivastava A, Kumar M. Prediction of zinc binding sites in proteins using sequence derived information. J Biomol Struct Dyn. 2017;36(16):4413–4423. [DOI] [PubMed] [Google Scholar]
  • 7. Valasatava Y, Rosato A, Cavallaro G, Andreini C. MetalS3, a database‐mining tool for the identification of structurally similar metal sites. JBIC J Biol Inorg Chem. 2014;19(6):937–945. [DOI] [PubMed] [Google Scholar]
  • 8. Zhao W, Xu M, Liang Z, et al. Structure‐based de novo prediction of zinc‐binding sites in proteins of unknown function. Bioinformatics. 2011;27(9):1262–1268. [DOI] [PubMed] [Google Scholar]
  • 9. Zheng H, Cooper DR, Porebski PJ, Shabalin IG, Handing KB, Minor W. CheckMyMetal: A macromolecular metal‐binding validation tool. Acta Crystallogr D Struct Biol. 2017;73(Pt 3):223–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Valasatava Y, Andreini C, Rosato A. Hidden relationships between metalloproteins unveiled by structural comparison of their metal sites. Sci Rep. 2015;5:9486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Lu Y, Yeung N, Sieracki N, Marshall NM. Design of functional metalloproteins. Nature. 2009;460:855–862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Nastri F, D'Alonzo D, Leone L, Zambrano G, Pavone V, Lombardi A. Engineering metalloprotein functions in designed and native scaffolds. Trends Biochem Sci. 2019;44:1022–1040. [DOI] [PubMed] [Google Scholar]
  • 13. Zastrow ML, Pecoraro VL. Designing functional metalloproteins: From structural to catalytic metal sites. Coord Chem Rev. 2013;257(17):2565–2588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Lombardi A, Pirro F, Maglio O, Chino M, DeGrado WF. De novo design of four‐helix bundle metalloproteins: One scaffold, diverse reactivities. Acc Chem Res. 2019;52(5):1148–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tebo AG, Pinter TBJ, García‐Serres R, et al. Development of a rubredoxin‐type center embedded in a de novo‐designed three‐helix bundle. Biochemistry. 2018;57(16):2308–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Zhang S‐Q, Chino M, Liu L, et al. De novo design of tetranuclear transition metal clusters stabilized by hydrogen‐bonded networks in helical bundles. J Am Chem Soc. 2018;140(4):1294–1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Natoli SN, Hartwig JF. Noble‐metal substitution in hemoproteins: An emerging strategy for abiological catalysis. Acc Chem Res. 2019;52(2):326–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Selvan D, Prasad P, Farquhar ER, et al. Redesign of a copper storage protein into an artificial hydrogenase. ACS Catal. 2019;9(7):5847–5859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Slater JW, Shafaat HS. Nickel‐substituted rubredoxin as a minimal enzyme model for hydrogenase. J Phys Chem Lett. 2015;6(18):3731–3736. [DOI] [PubMed] [Google Scholar]
  • 20. Ward TR. Directed evolution of iridium‐substituted myoglobin affords versatile artificial metalloenzymes for enantioselective C–C bond‐forming reactions. Angew Chem Int Ed. 2016;55(48):14909–14911. [DOI] [PubMed] [Google Scholar]
  • 21. Yu Y, Cui C, Liu X, Petrik ID, Wang J, Lu Y. A designed metalloenzyme achieving the catalytic rate of a native enzyme. J Am Chem Soc. 2015;137(36):11570–11573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Marvin JS, Hellinga HW. Conversion of a maltose receptor into a zinc biosensor by computational design. Proc Natl Acad Sci. 2001;98(9):4955–4960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Rittle J, Field MJ, Green MT, Tezcan FA. An efficient, step‐economical strategy for the design of functional metalloproteins. Nat Chem. 2019;11(5):434–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Markel U, Sauer DF, Schiffels J, Okuda J, Schwaneberg U. Towards the evolution of artificial metalloenzymes—A protein engineer's perspective. Angew Chem Int Ed. 2019;58(14):4454–4464. [DOI] [PubMed] [Google Scholar]
  • 25. Koebke KJ, Pecoraro VL. Noncoded amino acids in de novo metalloprotein design: Controlling coordination number and catalysis. Acc Chem Res. 2019;52(5):1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Mirts EN, Bhagi‐Damodaran A, Lu Y. Understanding and modulating metalloenzymes with unnatural amino acids, non‐native metal ions, and non‐native metallocofactors. Acc Chem Res. 2019;52:935–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Akcapinar GB, Sezerman OU. Computational approaches for de novo design and redesign of metal‐binding sites on proteins. Biosci Rep. 2017;37(2):BSR20160179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Clarke ND, Yuan S‐M. Metal search: A computer program that helps design tetrahedral metal‐binding sites. Proteins. 1995;23(2):256–263. [DOI] [PubMed] [Google Scholar]
  • 29. Hellinga HW, Richards FM. Construction of new ligand binding sites in proteins of known structure: I. computer‐aided modeling of sites with pre‐defined geometry. J Mol Biol. 1991;222(3):763–785. [DOI] [PubMed] [Google Scholar]
  • 30. Pantazes RJ, Grisewood MJ, Li T, Gifford NP, Maranas CD. The iterative protein redesign and optimization (IPRO) suite of programs. J Comput Chem. 2015;36(4):251–263. [DOI] [PubMed] [Google Scholar]
  • 31. Liu Y, Kuhlman B. RosettaDesign server for protein design. Nucleic Acids Res. 2006;34:W235–W238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Bernman HM, Westbrook J, Feng Z, et al. The protein data bank. Nucleic Acids Res. 2000;28(1):235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Liu T, Altman RB. Prediction of calcium‐binding sites by combining loop‐modeling with machine learning. BMC Struct Biol. 2009;9:72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Nguyen DD, Cang Z, Wei G‐W. A review of mathematical representations of biomolecular data. Phys Chem Chem Phys. 2020;22(8):4343–4367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Yang KK, Wu Z, Arnold FH. Machine‐learning‐guided directed evolution for protein engineering. Nat Methods. 2019;16(8):687–694. [DOI] [PubMed] [Google Scholar]
  • 36. Fufezan C, Zhang J, Gunner MR. Ligand preference and orientation in b‐ and c‐type heme‐binding proteins. Proteins. 2008;73(3):690–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Falivene L, Cao Z, Petta A, et al. Towards the online computer‐aided design of catalytic pockets. Nat Chem. 2019;11(10):872–879. [DOI] [PubMed] [Google Scholar]
  • 38. Schrödinger, LLC . The PyMOL molecular graphics system, version 1.8; 2015.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting Information.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES