Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2022 Dec;31(12):e4484. doi: 10.1002/pro.4484

GeoPacker: A novel deep learning framework for protein side‐chain modeling

Jiale Liu 1, Changsheng Zhang 2, Luhua Lai 1,2,3,
PMCID: PMC9667900  PMID: 36309961

Abstract

Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning‐based methods have achieved impressive success not only in protein structure prediction, but also in protein sequence design. However, highly efficient and accurate protein side‐chain prediction methods that can give detailed atomic interactions are still lacking. In the present study, we developed a deep learning based method, GeoPacker, that uses geometric deep learning coupled ResNet for protein side‐chain modeling. GeoPacker explicitly represents atomic interactions with rotational and translational invariance for information extraction of relative locations. GeoPacker outperformed the state‐of‐the‐art energy function‐based methods in side‐chain structure prediction accuracy and runs about 10 and 700 times faster than the deep learning‐based method DLPacker and OPUS‐rota4 with comparable prediction accuracy, respectively. The performance of GeoPacker does not depend on the secondary structures that the residues belong to. GeoPacker gives highly accurate predictions for buried residues in the protein core as well as protein–protein interface, making it a useful tool for protein structure modeling, protein, and interaction design.

Keywords: deep learning, protein design, protein side‐chain packing, protein side‐chain structure prediction


Abbreviations

pdf

probabilistic distribution function

rSASA

relative solvent accessible surface area

RMSD

root mean square deviation

1. INTRODUCTION

Protein tertiary structure prediction 1 , 2 , 3 , 4 , 5 has achieved revolutionary improvements using deep learning methods and tremendous number of homologous sequences. Protein structure is determined by its primary sequence in principle, however, the prediction task is extremely challenging for orphan sequence 6 , 7 , 8 due to the immense conformational space caused by the lacking of constraints from coevolutional signals. Protein de novo design aims to produce sequences which are far away from the existing natural sequence space and compatible with the pre‐determined protein backbones. Thus, the structure prediction validation of de novo designed protein without homologous sequences is difficult. Most recently developed deep learning‐based methods for protein sequence design 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 only restore the amino acid residue components for a given protein backbone without constructing the atom‐level interactions because of the coarse‐grained model used. An alternative way to circumvent the structure prediction for de novo designed sequence is to accurately model side chain conformations at atom level, as the main driving force for soluble protein folding is the inner core hydrophobic interactions. In addition, protein side‐chain conformation is closely linked to protein biological functions. 18 Hence, developing fast and accurate protein side‐chain modeling method is important for protein structure prediction, protein engineering, and de novo protein design.

Traditional computational tools for protein side‐chain modeling 19 , 20 , 21 , 22 consist of a rotamer library, a search algorithm, and an energy scoring function based on force field or statistical potentials. Most of those tools achieved similar performance in predicting the side‐chain dihedral angles within a stringent tolerance of 20° deviation. 22 However, defining accurate energy function and high‐quality rotamer library remains difficult, which hinders the accuracy of side‐chain modeling methods. 23 , 24 , 25 , 26 In recent years, application of deep learning approaches brought breakthroughs in protein structure prediction 2 as well as other demanding scientific problems. 27 , 28 Deep learning‐based methods have also been used to tackle the side‐chain modeling problem. 29 , 30 , 31 , 32 , 33 For example, DLPacker, 31 a 3D convolutional neural network‐based method formulated as an image‐to‐image transformation, gained a large margin of accuracy improvement which is sampling‐based and discrete rotamer library dependent that are computational cost demanding. OPUS‐Rota4, 32 which assembled the DLPacker network architecture and predicted the distance, orientations, and an initial side‐chain model, then those predicted geometries were used as constraints for further accuracy improvement of the side‐chain model. 3 At the same time, OPUS‐Rota4 also employed co‐evolutional information of the target sequence for prediction. Despite high accuracy in side‐chain modeling, the requirement of many hierarchical input features makes OPUS‐Rota4 much slower in calculation and not easy to be applied.

Protein structure modeling and de novo protein design demand both accurate and fast side‐chain modeling calculations. Here we developed a geometric deep learning coupled ResNet 34 model, GeoPacker, to directly predict side‐chain dihedral angles. GeoPacker is rotamer library‐free and rotational and translational invariant. When repacking side chains for both natural proteins and de novo designed proteins, GeoPacker outperforms the state‐of‐the‐art energy function‐based methods in accuracy and DLPacker in speed. GeoPacker provides a user‐friendly tool for large‐scale protein side‐chain modeling tasks.

2. RESULTS AND DISCUSSION

2.1. Model overview

We devise a simplified geometric deep learning coupled ResNet to build a multitask learning architecture where tasks are related and mutually supervised, thus improving the side‐chain dihedral angle prediction accuracy and robustness. Another two auxiliary objectives are achieved which are the distance prediction of pair‐wise distal side‐chain heavy atoms (Table S1) 35 and orientation prediction defined as the dihedral angle of C β C γ ···C γ C β (N–C α for Gly, C γ C β for Ala, C γ1 for Ile and Val, and C γ2 for Thr) from two residues. These two additional constraints are theoretically enough to reconstruct and refine the full‐atom model. Similar to protein distance map prediction, 3 the dihedral angle is split into 48 bins (7.5° per bin) from −180° to 180°. For a protein backbone X and a sequence Y, we simplify and format the conception:

PRX,Y=Pr1,r2,r3,,rnY,X=independentPr1Y,X*Pr2Y,X**PrnY,X, (1)

where PriY,X=j=14PχijY,X,i=1,2,,n,j=1,2,3,4. Here, we assume that each residue rotamer is independent, which relies on the whole sequence to accelerate inferences. As a matter of fact, the specific residue rotamer is determined by its local microenvironment, but this hypothesis can be effectively alleviated and approximately reasonable to some extend with the addition of coarse‐grained rotamer‐free residue types. The kernel density estimation is used to approximate the probabilistic distribution function (pdf) of each bin from the curated dataset, then the specific dihedral angle χij is sampled from the predicted discrete bin,

χijpdfBinLeftBinRight (2)

An overview of the algorithm is shown in Figure 1. The GNN module is used for encoding the local geometric features of protein and the ResNet module is applied for extracting the global constraints information of distance map of pair‐wise C α s.

FIGURE 1.

FIGURE 1

A deep learning framework for protein side‐chain modeling. (a) The left is a sequence, the medium is GNN module for local information extraction, and the right is ResNet plus attention module for global constraint from distance map. After the ResNet module, two auxiliary objective predictions are, respectively, addressed following a 2D CNN and a fully connected layer. (b) GNN module modeled by protein backbone. The color and aggregation updates to present each time/layer status of the residues and links, respectively. (c) Data processing workflow of ResNet module and self‐attention module. (d) Geometric features for pair‐wise residues r i and r j , including backbone atoms C, N, O, C α , and a pseudo C β .

2.2. Performance and comparison

The common metric for protein side‐chain modeling is the dihedral angle recovery in a stringent tolerance, that is, within 20° deviations for every χ i in a residue when compared with the experimental structure. Protein side‐chain recovery is defined as the rate of correctly predicted side‐chain conformation over all in a stringent tolerance,

protein sidechain recovery rate=iLIirL, (3)

where L is the length of protein chain, Iir=j=14Iijrj is an indicator function for residue i, Iij is an indicator function of the j th side‐chain dihedral angle of residue i, respectively. Iij equals 1 only if the j th side‐chain dihedral angle of residue i is predicted correctly, otherwise equals 0. In addition, the side‐chain recovery for individual type of residues is also used as a metric:

residueksidechain recovery rate=i=1Nj=1LiIiji=1Nj=1LiIrij==residuek, (4)

where k = 1,2, …, 20, N is the total number of protein chains in the dataset, L i is the length of ith protein chain, r ij is the residue type of j th residue in i th protein chain, I ij is an indicator function, respectively. I ij equals 1 only if jth residue in i th protein belongs to residue k and all of their side‐chain dihedral angles are predicted correctly, otherwise equals 0.

Depending on side‐chain length, we set χ 1 for Cys, Pro, Ser, Thr, and Val, χ 1–2 for Asp, Asn, Ile, Leu, Phe, Trp, and Tyr, χ 1–3 for Met, Glu, and Gln, and χ 1–4 for Arg and Lys. It is worth pointing out that several side‐chain groups are symmetric under 180° rotations, including Phe, Tyr, Asp, and Glu, while all are periodic, such that −175° and 175° are actually only 10° apart.

In order to avoid overfitting when training the model, the whole data partition is done to shape cross‐validation test. Specifically, we randomly split 14,888 protein chains into a training set containing 13,000 chains for fourfold cross‐validation and an independent testset containing 1888 chains (TS1888). The average performance of the four models is evaluated on the TS1888.

We firstly depict the distribution of each protein side‐chain recovery (Figure 2a), where the overall protein side‐chain recovery is 65.95% in a stringent tolerance within 20° in TS1888. In addition, we also compute the correlation (Figure 2b) between side‐chain recovery and the root mean square deviation (RMSD) of side‐chain heavy atoms from the native structure without the backbone atoms (i.e., N, C, C α , and O) and a pseudo C β , which is −0.64. Intuitively, the more accurate the side‐chain recovery is, the smaller the RMSD is, however, the reverse is not true. For example, if only the last predicted dihedral angle of all residues were extremely deviated from the native, the RMSD would be expected to be small, while the side‐chain recovery rate would be zero as we define that the side‐chain recovery needs all the dihedral angles be correctly predicted. Similar phenomenon was demonstrated before in the reference. 26 Further analysis showed that (Figure S2) most of the wrong predictions are side chains on the surface, especially for the χ 3–4. One of the potential reasons is that the sparse microenvironment compared with the crowded inner core region. In addition, side‐chains exposed to solvent are usually flexible which may also take different conformations in different crystal structures, while buried side chains in the hydrophobic core are normally rigid. Thus rotamer recovery rate for exposed and buried side chains should be separately compared. We define the buried residues as those with relative solvent accessible surface area (rSASA) lower than 15% 36 measured by NACCESS 37 and exposed residues as those with rSASA larger than 15%. As shown in Figure 3a, the structures of the buried side chains are more precisely modeled than the exposed residues. We further checked whether side chain modeling accuracy depends on secondary structures as it is well known that favorable side chain rotamers are backbone dependent. 38 , 39 To this end, we reduce the 8‐state DSSP 40 secondary structures to 3 states according to the following rules: α‐helix (H) and 3–10 helix (I) to H; extended conformation (E) and isolated bridge (B) to E; others to C. The prediction accuracies for residues in the three types of secondary structures were not much different (Figure 3b). Side chains of the three aromatic residues, Trp, Phe, Tyr on extended structures (E) are more accurately modeled, while side chains of long charged residues are less accurate on loops.

FIGURE 2.

FIGURE 2

(a) The distribution of protein side‐chain χ 1–4 recovery within 20°; (b) Correlation between side‐chain recovery and RMSD of side‐chain heavy atoms relative to experimental structure

FIGURE 3.

FIGURE 3

Side‐chain recovery rate of individual residue types with respect to the rSASA (a) and different secondary structures in a stringent tolerance within 20° (b)

We then investigated the importance of different components in the network model. We first removed the distance map and the corresponding 2D task predictions. Interestingly, the side‐chain recovery rate did not change much (64.80% vs. 64.55%), while the standard deviation varies significantly (0.14% vs. 0.17%), indicating that the single task prediction is more fluctuant. We further eliminated the distance prediction of distal heavy atoms and orientation prediction. The performances in both cases were decreased and the distance prediction contributes more. This analysis demonstrates the advantages of multitask learning for accuracy and robustness (Table 1).

TABLE 1.

Components selection of model architecture and recovery of χ 1–4 in a stringent tolerance within 20°

Component excluded Recovery (%, cross validation) Recovery (%, test)
All Buried Exposed All Buried Exposed
None 64.80 ± 0.14 82.26 ± 0.27 56.04 ± 0.13 65.95 83.69 57.40
Distance map 64.55 ± 0.17 82.22 ± 0.34 55.79 ± 0.11 66.24 84.19 57.52
Distance prediction 63.32 ± 0.19 80.53 ± 0.32 54.81 ± 0.13 64.94 82.40 56.53
Orientation prediction 64.46 ± 0.16 82.13 ± 0.22 55.70 ± 0.13 66.06 83.94 57.44

We further compared GeoPacker to representative side‐chain modeling methods using the independent testset TS1888 (Table 2). GeoPacker outperformed the state‐of‐the‐art scoring‐function‐based methods, including Scwrl4 20 and FASPR, 22 and is comparable to DLPacker, 31 a representative deep learning‐based method, for side‐chain recovery of χ 1–2, and a little bit worse for χ 3 or χ 4 recovery. Figure 4 gives the performance for the 20 types of residues. GeoPacker works better for χ 1–2 of exposed charged or polar residues, particularly for the residue Arg and Lys, than all of the four dihedral angles are considered as χ 3 and χ 4 are more influenced by the flexible conformation of the long side chains. GeoPacker performs slightly worse than the other method for Met side chain recovery, which is mainly reflected in the prediction error of the terminal dihedral angle χ 3. For Trp, GeoPacker gave similar recovery rate to Scwrl4, which is lower than those of FASPR and DLPacker, probably due to its low frequency in nature and the large continuous rotamer exploration sampling space in GeoPacker while other methods were all discrete sampling based. In addition, we also used another dataset with lower pair‐wise sequence identity, that is, 30%, directly from trRosetta, 3 to retrain our model. As shown in Table S2, side‐chain recovery rates of all the test methods were remarkably increased using an independent testset trRosettaTS1833 (see Appendix S1 for data split). GeoPacker outperforms DLPacker in recovery of both χ 1–2 and χ 1–4. The performance of GeoPacker on individual residue is also much improved (Figure S3), especially for the long charged residues including Glu, Arg, and Lys. This might come from the high quality of the trRosetta dataset as well as the early release of it (before May 2018). For a fair comparison, GeoPacker has been trained on the dataset and tested with other methods on CASP14 targets, totally including 34 protein chains. GeoPacker outperforms Scwrl4 and FASPR in a large margin, and is also comparable with DLPacker and OPUS‐rota4 (Table S3).

TABLE 2.

Performance evaluation of protein side‐chain packing modeling in an independent testset TS1888

Method Recovery of χ 1–4 (%) (stringent tolerance within 20°) Recovery of χ 1–2 (%) (stringent tolerance within 20°) Runtime (s)
All Buried Exposed All
Scwrl4 20 58.47 76.18 49.79 62.77 2.82
FASPR 22 61.65 80.16 52.61 66.46 0.18
DLPacker 31 66.00 86.27 56.19 71.62 34.70
OPUS‐Rota4 32 69.89 89.77 60.27 76.46 2,206.12 a
GeoPacker 65.95 83.69 57.07 74.02 3.29
a

The corresponding job was performed by running 7 parallelizing jobs that lasted for 315.16 s. The total time cost for running a single job as other programs did was estimated by timing 315.16 with 7.

FIGURE 4.

FIGURE 4

Side‐chain dihedral angle recovery in a stringent tolerance within 20° for individual residue

Given the prospect of promising applications, especially for large‐scale de novo protein sequence design, both accuracy and running speed are important. To our best knowledge, FASPR is currently the fastest tool, though with a somewhat lower prediction accuracy. As the authors of DLPacker pointed out that it spent most of the running time to sample the rotamer with the most matched sidechain atomic density, and OPUS‐Rota4 suffers the same problem due to assembling the network architecture from DLPacker. Using an algorithm that directly generates the side‐chain dihedral angles, GeoPacker is about 10 times faster than DLPacker and 700 times faster than OPUS‐Rota4 with comparable prediction accuracy (Table 2). Due to the expensive time costs, only a part of tasks, including TS1888 and CASP14, are tested using OPUS‐Rota4.

3. DISCUSSION

Structures of protein side chains are important for biological functions. Despite the admirable efficiency, the performance of traditional modeling methods is limited by the inaccurate energy functions and low quality rotamer libraries. Deep learning‐based side‐chain modeling approaches are independent of rotamer libraries, thus are free from the limitations of discrete rotamer sampling. We have developed a geometric deep learning coupled ResNet algorithm to simultaneously predict three mutually supervised tasks. Despite only the predicted side‐chain dihedral angles are used for constructing the full atomic model and the accurate atomic interactions, the other tasks are helpful as constraints for further side‐chain conformational refinement by introducing a potential function, 41 where many literatures leverage the thought to improve their task performance. 3 , 32 For example, Xu et al. 32 improved around 2% by using the strategy for protein side‐chain packing modeling. On the other hand, our rational prediction procedure is autoregression. The residue conformation is coupled with other residues in the microenvironment. In order to speed up the process, we assumed the residue conformation is independent while the coarse‐grained rotamer‐free residues are coupled, which may produce inaccuracy that could be further improved in the future.

In addition, we focused more on the accurate prediction of buried residues, as solvent exposed residues are more flexible and contribute less to protein folding and structure stability. Proteins that could fold to stable structures form well‐defined hydrophobic cores with most of the buried polar atoms forming satisfactory hydrogen bonds. 42 Given the prospect of GeoPacker for de novo protein sequence design, we have tested 10 recently released structures (Table S4) with novel designed backbone topologies and de novo sequences for natural structure. 13 , 43 , 44 The overall performance of χ 1–4 recovery for all the tested methods decreased by 10% compared with that on natural proteins, yet the accuracy of the core region did not change significantly (Table 3). Thus GeoPacker is useful for large‐scale protein sequence design which is able to form accurate atomic interactions in the hydrophobic core with promising precision.

TABLE 3.

Protein side‐chain recovery on novel structures or new sequences

Method Recovery of χ 1–4 (%) (stringent tolerance within 20°)
All Buried Exposed
Scwrl4 49.05 76.69 37.76
FASPR 54.20 82.95 42.75
DLPacker 56.38 86.70 44.27
GeoPacker 58.46 86.35 47.34

As introducing perturbation was shown to be useful in flexible‐backbone protein design, 45 we want to see how tolerant and generalizable GeoPacker to backbone perturbations. We evaluated the influence using 379 structures comprised five non‐native test sets with different perturbations, 22 , 46 where the average RMSD (aRMSD) relative to the native backbone structure is 0.21, 0.57, 0.93, 1.48, and 1.88, respectively. All of the methods could tolerate perturbations below 1 Å (Figure S4).

We also test whether GeoPacker could be used in reconstructing side‐chain conformations at protein–protein interface. Protein–protein interactions are mainly mediated by hydrophobic interactions and hydrogen bonds. The variations to our adaptive model are the graphic representation of the protein complex, where the edges and their corresponding properties are added by the adjacent interface residues, and the distance map will be spanned to the complex, respectively. We tested the performance of GeoPacker using six de novo designed cases (Table S4) reported by two literatures. 47 , 48 The interface residue side‐chain recovery is 68.57%, indicating the transfer ability of our model, where the interface residues are defined by using distance within 5 Å between two heavy atoms of inter‐residues from different protein chains. To reach the state‐of‐the‐art performance for protein interface modeling problem, it will be necessary to apply further fine‐tuning procedure to the protein–protein interaction interface dataset by transferring our model.

4. CONCLUSION

In this study, we proposed a simplified geometric deep learning coupled ResNet approach, GeoPacker, for protein side‐chain structure prediction, where the inputs are only the geometric features from protein backbone and seven different reduced physicochemical properties for different residues. Our multitask learning framework, including the orientation prediction of C β C γ ···C γ C β , distance prediction of side‐chain distal pair‐wise heavy atoms, and side‐chain dihedral angles prediction, demonstrates that the mutually supervised objectives increase the accuracy and robustness for protein side‐chain modeling. GeoPacker outperforms the state‐of‐the‐art energy scoring function‐based methods in precision and runs about 10 times faster than the deep learning‐based method DLPacker with comparable performance. A series of testing, including de novo designed proteins and different backbone perturbations, further illustrate that GeoPacker is useful for protein structure modeling and de novo protein sequence and interface design, especially for large‐scale sequence side‐chain modeling with admirable efficiency and precision. GeoPacker is available in github at https://github.com/PKUliujl/GeoPacker.

5. MATERIALS AND METHODS

5.1. Data sets

We firstly culled the redundant proteins by PISCES 49 with the following rules: (a) protein pair‐wise sequence identity should be <60%; (b) only X‐ray crystallographic structures with resolution <2.0 Å were selected; (c) R‐factor should be <0.25; (d) the protein chain length is longer than 40 but shorter than 1,000 residues. It should be noted that the side‐chain packing modeling of longer proteins is feasible with our pre‐trained model. The cutoff setting is limited to finite memory of machines. Finally, we created a data set with 14,888 proteins.

5.2. Input features

On the one hand, protein side‐chain packing on the whole attributes to the local spatial interactions. Evidently, the local features are of importance for determining the rotamers. Here several features are used: (a) backbone dihedral angle (φ, ψ) with its sine and cosine values; (b) 8‐state secondary structures defined by DSSP 40 ; (c) the orientation and distance of pairwise backbone atoms from distinct residues in a cutoff threshold, including orientation {C,N,O,C α }C β ···C β {C α ,C,N,O} and their distance, correspondingly, where the C β is built using commonly chemical bond lengths and angles (1.55°A for C α C β bond length, 110.5° for plane angle of CC α C β and 122.55° for dihedral angle of NCC α C β ). 50 The cutoff is based on the distance of two central distinct C α s. In other words, if the distance is larger than the cutoff, all of the orientation and distance will not be computed within the pair residues. (d) Seven reduced physicochemical properties for each residue 51 are used, including steric parameters, polarizability, volume, hydrophobicity, isoelectric point, helix probability and sheet probability. (e) A basis set of 19 rigid‐body blocks. 52 To construct our graph model, these features are embedded into node features and edge features.

On the other hand, the stability of a protein is determined by its overall structure. Hence the distance map as a global constraint, provides an insight into global influence. Significantly, this characteristic is different from the above constructed graph representation. In graphic model, the spatial neighboring residues have link with each other and solely those links include distance statistics whilst the distance map shows the distance of every residue pair.

Because of the relative pair‐wise information, our model is rotational and transformational invariance, which is important for protein tertiary structure reconstruction and corresponding properties.

5.3. Methods

In our GNN module, we need to construct the graph firstly. Each node represents a residue where its center is the C α atom, and the edge is defined by the distance between C α s from two residues. In the protein graph modeling, the rotamer of center residue is mostly affected by its neighbors, therefore a spatial‐based graph convolutional neural network is intuitively selected as our model framework by aggregating the information of the neighboring nodes. We set distinct cutoff values from 10 to 12 Å, and interestingly, instead of improving the performance, it increases higher runtime of computation for denser graph. Finally, we set 10 Å as the cutoff. We give a potential interpretation that the more confusing information may additionally merge into the graph to increase the discriminating difficulty. After defining the edges, the orientation and distance between each backbone pairwise atoms are delivered to the edge as extra information/relationship. The updates/iterations of node features and edge features are shown below:

xit+1=wi1xit+wi2xi2++witxit+jNiΦxjteijtwjt, (5)
eijt+1=Ψeijtxjt+1xit+1 (6)

where xit+1 denotes the iteration t + 1 of node i, N i is the set of neighbors of node i, w i is a learnable parameters which reflects the importance of residues, Φ,Ψ are both learnable functions and e ij is the edge. Here Φ,Ψ are linear with activation function elu. Significantly, we add the former weighted information of node xi to update node xi at time t + 1 which increases the capability of message passing. In addition, we concatenate the node and edge information so that its correlation is not lost during node update. We further add the informational differences which reflect the flow from node i to j in edge update.

In the ResNet and the self‐attention module, the pair nodes information from the outputs of GNN module will be filtered by 3 * 3 kernels in the ResNet module along with distance map. The number of convolutional layers is set to 20. In each layer, the inputs are from all previous layers' outputs. The activation is elu. Furthermore, each normalization is added before activation layer. After the ResNet module, we add a self‐attention module due to other residues contribute differently to the target residue. To this end, a linear layer and softmax function are used to the output of ResNet module as a weighted matrix, that is, self‐attention matrix, which is further reshaped and multiplied by the output of ResNet module. In order to robustly capture the information, we utilize 2‐head self‐attention, which is corresponding to the workflow depicted in Figure 1c). After concatenating the output of attention operation, a pooling layer with mean operation is used at the specified axis, which can be further to transfer the two dimensional information to the one dimensional for better merging the node information from graph model.

The connections between different modules are demonstrated as below. Any two nodes information from the outputs of GNN modules are concatenated as pairing information, and then this information is fed into ResNet module along with distance map. On the other side, the node outputs of GNN module will pass a three‐head attention indicated by the number of arrows, and then all of the outputs from different modules are concatenated in the node level. Finally, a MLP module is operated on these information to obtain the predicted results of side‐chain dihedral angles.

As each task is nearly equally important, the loss weights of multi tasks in this article are just set to make the initial loss of several random samples nearly equal without many efforts for tuning, which is shown in the following:

Total loss=loss1+loss2/3.9+loss3/4, (7)

where the loss1, loss2, and loss3 are for side‐chain dihedral angle prediction, distance prediction of distal heavy atoms, and orientation prediction, correspondingly.

Moreover, the dropout as an overfitting technique, is also employed in our model. We set a 30% drop rate in the final fully connected layer. The optimizer is ADAM 53 with a learning rate of 0.001. Loss function is cross entropy.

AUTHOR CONTRIBUTIONS

Jiale Liu: Conceptualization (equal); investigation (lead); methodology (lead); software (lead); validation (equal); visualization (lead); writing – original draft (lead); writing – review and editing (equal). Changsheng Zhang: Formal analysis (equal); funding acquisition (equal); project administration (equal); supervision (equal); validation (equal); writing – review and editing (equal). Luhua Lai: Conceptualization (equal); formal analysis (equal); funding acquisition (lead); project administration (lead); supervision (lead); validation (equal); writing – review and editing (lead).

CONFLICT OF INTEREST

The authors declare no conflicts of interest.

Supporting information

Appendix S1 Supporting Information.

Liu J, Zhang C, Lai L. GeoPacker: A novel deep learning framework for protein side‐chain modeling. Protein Science. 2022;31(12):e4484. 10.1002/pro.4484

Review Editor: Nir Ben‐Tal

Funding information National Natural Science Foundation of China, Grant/Award Numbers: 22033001, 21977007

DATA AVAILABILITY STATEMENT

The data, source code and pre‐trained model are available in github at https://github.com/PKUliujl/GeoPacker.

REFERENCES

  • 1. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three‐track neural network. Science. 2021;373:871–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A. 2020;117:1496–1503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate De novo prediction of protein contact map by ultra‐deep learning model. PLoS Comput Biol. 2017;13:e1005324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ju F, Zhu J, Shao B, et al. CopulaNet: Learning residue co‐evolution directly from multiple sequence alignment for protein structure prediction. Nat Commun. 2021;12:2535. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Xu J, McPartlon M, Li J. Improved protein structure prediction by deep learning irrespective of co‐evolution information. Nat Mach Intell. 2021;3:601–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Chowdhury R, Bouatta N, Biswas S, et al. Single‐sequence protein structure prediction using language models from deep learning. bioRxiv. 2021. 10.1101/2021.08.02.454840. [DOI] [Google Scholar]
  • 8. Wang W, Peng Z, Yang J. Single‐sequence protein structure prediction using supervised transformer protein language models. bioRxiv. 2022. 10.1101/2022.01.15.476476. [DOI] [PubMed] [Google Scholar]
  • 9. Wang J, Cao H, Zhang JZ, et al. Computational protein design with deep learning neural networks. Sci Rep. 2018;8:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Chen S, Sun Z, Lin LH, et al. To improve protein sequence profile prediction through image captioning on pairwise residue distance map. J Chem Inf Model. 2020;60:391–399. [DOI] [PubMed] [Google Scholar]
  • 11. Qi Y, Zhang JZ. DenseCPD: Improving the accuracy of neural‐network‐based computational protein sequence design with DenseNet. J Chem Inf Model. 2020;60:1245–1252. [DOI] [PubMed] [Google Scholar]
  • 12. Zhang Y, Chen Y, Wang C, et al. ProDCoNN: Protein design using a convolutional neural network. Proteins. 2020;88:819–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Anishchenko I, Pellock SJ, Chidyausiku TM, et al. De novo protein design by deep network hallucination. Nature. 2021;600:547–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ingraham J, Garg V, Barzilay R, et al. Generative models for graph‐based protein design. Adv Neural Inf Process Syst. 2019;32:15820–15831. [Google Scholar]
  • 15. Karimi M, Zhu SW, Cao Y, Shen Y. De novo protein design for novel folds using guided conditional Wasserstein generative adversarial networks. J Chem Inf Model. 2020;60:5667–5681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Strokach A, Becerra D, Corbi‐Verge C, Perez‐Riba A, Kim PM. Fast and flexible protein design using deep graph neural networks. Cell Syst. 2020;11:402–411. [DOI] [PubMed] [Google Scholar]
  • 17. Jing B, Eismann S, Suriana P, et al. Learning from protein structure with geometric vector perceptrons. arXiv. 2020. 10.48550/arXiv.2009.01411 [DOI] [Google Scholar]
  • 18. Miao ZC, Cao Y. Quantifying side‐chain conformational variations in protein structure. Sci Rep. 2016;6 :1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Liang S, Zheng D, Zhang C, Standley DM. Fast and accurate prediction of protein side‐chain conformations. Bioinformatics. 2011;27:2913–2914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Krivov GG, Shapovalov MV, Dunbrack RL Jr. Improved prediction of protein side‐chain conformations with SCWRL4. Proteins. 2009;77:778–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Alford RF, Leaver‐Fay A, Jeliazkov JR, et al. The Rosetta all‐atom energy function for macromolecular modeling and design. J Chem Theory Comput. 2017;13:3031–3048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Huang X, Pearce R, Zhang Y. FASPR: An open‐source tool for fast and accurate protein side‐chain packing. Bioinformatics. 2020;36:3758–3765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Liang SD, Grishin NV. Side‐chain modeling with an optimized scoring function. Protein Sci. 2002;11:322–331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Liang SD, Zhou YQ, Grishin N, Standley DM. Protein side chain modeling with orientation‐dependent atomic force fields derived by series expansions. J Comput Chem. 2011;32:1680–1686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Peterson RW, Dutton PL, Wand AJ. Improved side‐chain prediction accuracy using an ab initio potential energy function and a very large rotamer library. Protein Sci. 2004;13:735–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Huang X, Pearce R, Zhang Y. Toward the accuracy and speed of protein side‐chain packing: A systematic study on rotamer libraries. J Chem Inf Model. 2019;60:410–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Davies A, Velickovic P, Buesing L, et al. Advancing mathematics by guiding human intuition with AI. Nature. 2021;600:70–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Hermann J, Schatzle Z, Noe F. Deep‐neural‐network solution of the electronic Schrodinger equation. Nat Chem. 2020;12:891–897. [DOI] [PubMed] [Google Scholar]
  • 29. Nagata K, Randall A, Baldi P. SIDEpro: A novel machine learning approach for the fast and accurate prediction of side‐chain conformations. Proteins. 2012;80:142–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Xu G, Wang Q, Ma J. OPUS‐Rota3: Improving protein side‐chain modeling by deep neural networks and ensemble methods. J Chem Inf Model. 2020;60:6691–6697. [DOI] [PubMed] [Google Scholar]
  • 31. Misiura M, Shroff R, Thyer R, Kolomeisky AB. DLPacker: Deep learning for prediction of amino acid side chain conformations in proteins. Proteins. 2022;90:1278–1290. [DOI] [PubMed] [Google Scholar]
  • 32. Xu G, Wang Q, Ma J. OPUS‐Rota4: A gradient‐based protein side‐chain modeling framework assisted by deep learning‐based predictors. Brief Bioinform. 2022;23:bbab529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. McPartlon M, Xu J. AttnPacker: An end‐to‐end deep learning method for rotamer‐free protein side‐chain packing. bioRxiv. 2022. 10.1101/2022.03.11.483812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. He K, Zhang X, Ren S et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; p. 770–778.
  • 35. Hiranuma N, Park H, Baek M, Anishchenko I, Dauparas J, Baker D. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat Commun. 2021;12:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Adamczak R, Porollo A, Meller J. Accurate prediction of solvent accessibility using neural networks‐based regression. Proteins: Struct Funct Genet. 2004;56:753–767. [DOI] [PubMed] [Google Scholar]
  • 37. Hubbard SJ, Thornton JM. “NACCESS,” Computer program. Vol 2. Department of Biochemistry and Molecular Biology, University College London, London, 1993;p. 1. [Google Scholar]
  • 38. Dunbrack RL Jr, Karplus M. Backbone‐dependent rotamer library for proteins application to side‐chain prediction. J Mol Biol. 1993;230:543–574. [DOI] [PubMed] [Google Scholar]
  • 39. Dunbrack RL Jr, Cohen FE. Bayesian statistical analysis of protein side‐chain rotamer preferences. Protein Sci. 1997;6:1661–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [DOI] [PubMed] [Google Scholar]
  • 41. Zhou HY, Zhou YQ. Distance‐scaled, finite ideal‐gas reference state improves structure‐derived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11:2714–2726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Bolon DN, Marcus JS, Ross SA, Mayo SL. Prudent modeling of core polar residues in computational protein design. J Mol Biol. 2003;329:611–622. [DOI] [PubMed] [Google Scholar]
  • 43. Anand N, Eguchi R, Mathews II, et al. Protein sequence design with a learned potential. Nat Commun. 2022;13:746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Liu Y, Zhang L, Wang W, et al. Rotamer‐free protein sequence design based on deep learning and self‐consistency. Nat Comput Sci. 2022;2:451–462. 10.1038/s43588-022-00273-6. [DOI] [PubMed] [Google Scholar]
  • 45. Saunders CT, Baker D. Recapitulation of protein family divergence using flexible backbone protein design. J Mol Biol. 2005;346:631–644. [DOI] [PubMed] [Google Scholar]
  • 46. Xu G, Ma TQ, Du JQ, et al. OPUS‐Rota2: An improved fast and accurate side‐chain modeling method. J Chem Theory Comput. 2019;15:5154–5160. [DOI] [PubMed] [Google Scholar]
  • 47. Cao L, Goreshnik I, Coventry B, et al. De novo design of picomolar SARS‐CoV‐2 miniprotein inhibitors. Science. 2020;370:426–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Cao L, Coventry B, Goreshnik I, et al. Design of protein‐binding proteins from the target structure alone. Nature. 2022;605:551–560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Wang GL, Dunbrack RL. PISCES: A protein sequence culling server. Bioinformatics. 2003;19:1589–1591. [DOI] [PubMed] [Google Scholar]
  • 50. Tien MZ, Sydykova DK, Meyer AG, Wilke CO. PeptideBuilder: A simple python library to generate model peptides. PeerJ. 2013;1:e80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Meiler J, Muller M, Zeidler A, et al. Generation and evaluation of dimension‐reduced amino acid parameter representations by artificial neural networks. J Mol Model. 2001;7:360–369. [Google Scholar]
  • 52. Lu MY, Dousis AD, Ma JP. OPUS‐PSP: An orientation‐dependent statistical all‐atom potential derived from side‐chain packing. J Mol Biol. 2008;376:288–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv. 2014. 10.48550/arXiv.1412.6980. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1 Supporting Information.

Data Availability Statement

The data, source code and pre‐trained model are available in github at https://github.com/PKUliujl/GeoPacker.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES