Skip to main content
Communications Biology logoLink to Communications Biology
. 2023 Aug 25;6:876. doi: 10.1038/s42003-023-05133-1

Integration of pre-trained protein language models into geometric deep learning networks

Fang Wu 1, Lirong Wu 1, Dragomir Radev 2, Jinbo Xu 3,4, Stan Z Li 1,
PMCID: PMC10457366  PMID: 37626165

Abstract

Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Several preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks but fail to present a comprehensive understanding of their benefits. In this work, we integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks and evaluate a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction. Our findings show an overall improvement of 20% over baselines. Strong evidence indicates that the incorporation of protein language models’ knowledge enhances geometric networks’ capacity by a significant margin and can be generalized to complex tasks.

Subject terms: Computational models, Proteins


Integrating knowledge from protein language models into geometric networks improves performance in tasks such as protein-protein interface prediction, model quality assessment, rigid-body docking, and binding affinity prediction.

Introduction

Macromolecules (e.g., proteins, RNAs, or DNAs) are essential to biophysical processes. While they can be represented using lower-dimensional representations such as linear sequences (1D) or chemical bond graphs (2D), a more intrinsic and informative form is the three-dimensional geometry1. 3D shapes are critical to not only understanding the physical mechanisms of action but also answering a number of questions associated with drug discovery and molecular design2. Consequently, tremendous efforts in structural biology have been devoted to deriving insights from their conformations35.

With the rapid advances of deep learning (DL) techniques, it has been an attractive challenge to represent and reason about macromolecules’ structures in the 3D space. In particular, different sorts of 3D information, including bond lengths and dihedral angles, play an essential role. In order to encode them, a number of 3D geometric graph neural networks (GGNNs) or CNNs69 have been proposed, and simultaneously achieve several crucial properties of Euclidean geometry such as E(3) or SE(3) equivariance and symmetry. Notably, they are essential constituents of geometric deep learning (GDL), an umbrella term that generalizes networks to Euclidean or non-Euclidean domains10.

Meanwhile, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. The abundance of 1D amino acid sequences has spurred increasing interest in developing protein language models at the scale of evolution, such as the series of ESM1113 and ProtTrans14. These protein language models can capture information about secondary and tertiary structures and can be generalized across a broad range of downstream applications. To be explicit, they have recently been demonstrated with strong capabilities in uncovering protein structures12, predicting the effect of sequence variation on function11, learning inverse folding15 and many other general purposes13.

With the fruitful progress in protein language models, more and more studies have considered enhancing GGNNs’ ability by leveraging the knowledge of those protein language models12,16,17. This is nontrivial because compared to sequence learning, 3D structures are much harder to obtain and thus less prevalent. Consequently, learning about the structure of proteins leads to a reduced amount of training data. For example, the SAbDab database18 merely has 3K antibody-antigen structures without duplicate. The SCOPe database19 has 226K annotated structures, and the SIFTS database20 comprises around 220K annotated enzyme structures. These numbers are orders of magnitude lower than the data set sizes that can inspire major breakthroughs in the deep learning community. In contrast, while the Protein Data Bank (PDB)21 possesses approximately 182K macromolecule structures, databases like Pfam22 and UniParc23 contains more than 47M and 250M protein sequences respectively.

In addition to the data size, the benefit of protein sequence to structure learning also has solid evidence and theoretical support. Remarkably, the idea that biological function and structures are documented in the statistics of protein sequences selected through evolution has a long history24. The unobserved variables that decide a protein’s fitness, including structure, function, and stability, leave a record in the distribution of observed natural sequences25. Those protein language models use self-supervision to unlock the information encoded in protein sequence variations, which is also beneficial for GGNNs. Accordingly, in this paper, we comprehensively investigate the promotion of GGNNs’ capability with the knowledge learned by protein language models (see Fig. 1). The improvements come from two major lines. Firstly, GGNNs can benefit from the information that emerges in the learned representations of those protein language models on fundamental properties of proteins, including secondary structures, contacts, and biological activity. This kind of knowledge may be difficult for GGNNs to be aware of and learn in a specific downstream task. To confirm this claim, we conduct a toy experiment to demonstrate that conventional graph connectivity mechanisms prevent existing GGNNs from being cognizant of residues’ absolute and relative positions in the protein sequence. Secondly and more intuitively, protein language models serve as an alternative way of enriching GGNNs’ training data and allow GGNNs to be exposed to more different families of proteins, thereby greatly strengthening GGNNs’ generalization capability.

Fig. 1. Illustration of our framework to strengthen GGNNs with knowledge of protein language models.

Fig. 1

The protein sequence is first forwarded into a pretrained protein language model to extract per-residue representations, which are then used as node features in 3D protein graphs for GGNNs.

We examine our hypothesis across a wide range of benchmarks, containing model quality assessment, protein-protein interface prediction, protein-protein rigid-body docking, and ligand binding affinity prediction. Extensive experiments show that the incorporation and combination of pretrained protein language models’ knowledge significantly improve GGNNs’ performance for various problems, which require distinct domain knowledge. By utilizing the unprecedented view into the language of protein sequences provided by powerful protein language models, GGNNs promise to augment our understanding of a vast database of poorly understood protein structures. Our work hopes to shed more light on how to bridge the gap between the thriving geometric deep learning and mature protein language models and better leverage different modalities of proteins.

Results and discussion

Our toy experiments illustrate that existing GGNNs are unaware of the positional order inside the protein sequences. Taking a step further, we show in this section that incorporating knowledge learned by large-scale protein language models can robustly enhance GGNN’s capacity in a wide variety of downstream tasks.

Tasks and datasets

  • Model Quality Assessment (MQA) aims to select the best structural model of a protein from a large pool of candidate structures and is an essential step in structure prediction26. For a number of recently solved but unreleased structures, structure generation programs produce a large number of candidate structures. MQA approaches are evaluated by their capability of predicting the global distance test (GDT-TS score) of a candidate structure compared to the experimentally solved structure of that target. Its database is composed of all structural models submitted to the Critical Assessment of Structure Prediction (CASP)27 over the last 18 years. The data is split temporally by competition year. MQA is similar to the Protein Structure Ranking (PSR) task introduced by Townshend et al.2.

  • Protein-protein Rigid-body Docking (PPRD) computationally predicts the 3D structure of a protein-protein complex from the individual unbound structures. It assumes that no conformation change within the proteins happens during binding. We leverage Docking Benchmark 5.5 (DB5.5)28 as the database. It is a gold standard dataset in terms of data quality and contains 253 structures.

  • Protein-protein Interface (PPI) investigates whether two amino acids will contact when their respective proteins bind. It is an important problem in understanding how proteins interact with each other, e.g., antibody proteins recognize diseases by binding to antigens. We use the Database of Interacting Protein Structures (DIPS), a comprehensive dataset of protein complexes mined from the PDB29, and randomly select 15K samples for evaluation.

  • Ligand Binding Affinity (LBA) is an essential task for drug discovery applications. It predicts the strength of a candidate drug molecule’s interaction with a target protein. Specifically, we aim to forecast pK=log10K, where K is the binding affinity in Molar units. We use the PDBbind database30,31, a curated database containing protein-ligand complexes from the PDB and their corresponding binding strengths. The protein-ligand complexes are split such that no protein in the test dataset has more than 30% or 60% sequence identity with any protein in the training dataset.

Experimental setup

We evaluate our proposed framework on the instances of several state-of-the-art geometric networks, using Pytorch32 and PyG33 on four standard protein benchmarks. For MQA, PPI, and LBA, we use GVP-GNN, EGNN, and Molformer as backbones. For PPRD, we utilize a deep learning model, EquiDock34, as the backbone. It approximates the binding pockets and obtains the docking poses using keypoint matching and alignment. For more experimental details, please refer to Supplementary Note 3.

Single-protein representation task

For MQA, we document First Rank Loss, Spearman correlation (RS), Pearson’s correlation (RP), and Kendall rank correlation (KR) in Table 1. The introduction of protein language models has brought a significant average increase of 32.63% and 55.71% in global and mean RS, of 34.66% and 58.75% in global and mean RP, and of 43.21% and 63.20% in global and mean KR respectively. With the aid of language models, GVP-GNN achieves the optimal global RS, global RP, and KR of 84.92%, 85.44%, and 67.98% separately.

Table 1.

Results on MQA.

Model PLM Model quality assessment
First rank loss Spearman correlation Pearson’s correlation Kendall rank
Mean Global Mean Global Mean Global
GVP-GNN 0.085 ± 0.002 0.4144 ± 0.010 0.6910 ± 0.008 0.5235 ± 0.013 0.6875 ± 0.006 0.2960 ± 0.010 0.4959 ± 0.004
0.033 ±  0.001 0.6121 ± 0.017 0.8492 ± 0.015 0.7399 ± 0.017 0.8544 ± 0.009 0.4530 ± 0.008 0.6798 ± 0.014
EGNN 0.054 ± 0.003 0.4249 ± 0.016 0.7341 ± 0.015 0.5315 ± 0.008 0.7336 ± 0.018 0.3004 ± 0.013 0.5344 ± 0.011
0.041 ± 0.001 0.5642 ± 0.013 0.8436 ± 0.012 0.6925 ± 0.006 0.8456 ± 0.015 0.4105 ± 0.014 0.6558 ± 0.006
Molformer 0.149 ± 0.003 0.1238 ± 0.011 0.3921 ± 0.004 0.1969 ± 0.004 0.3901 ± 0.012 0.0841 ± 0.010 0.2696 ± 0.005
0.088 ± 0.002 0.2424 ± 0.015 0.6516 ± 0.009 0.3850 ± 0.011 0.6210 ± 0.014 0.1681 ± 0.012 0.4579 ± 0.007

The column of ’PLM’ indicates whether the protein language model is used. The First Rank Loss is the average difference between the true scores of the best model and the top-ranked model for each target. Results are reported with mean   ±   standard deviation over three repeated runs and the best performance is in bold.

Apart from that, we provide a full comparison with all existing approaches in Table 2. We elect RWplus35, ProQ3D36, VoroMQA37, SBROD38, 3DCNN2, 3DGNN2, 3DOCNN39, DimeNet40, GraphQA41, and GBPNet42 as the baselines. Performance is recorded in Table 2, where the second best is underlined. It can be concluded that even if GVP-GNN is not the best architecture, it can largely outperform existing methods including the state-of-the-art no-pretraining method set by Ayken and Xia42 (i.e., GBPNet) and the state-of-the-art pretraining results set by Jing et al.43 if it is enhanced by the protein language model.

Table 2.

Comparison of performance on MQA.

Model PLM Model quality assessment
Spearman correlation Pearson’s correlation Kendall rank
Mean Global Mean Global Mean Global
RWplus35a 0.167 0.056 0.192 0.033 0.137 0.011
ProQ3D36a 0.432 0.772 0.444 0.796 0.304 594
VoroMQA37a 0.419 0.651 0.412 0.651 0.291 0.505
SBROD38a 0.413 0.569 0.431 0.551 0.291 0.393
3DOCNN39b 0.432 0.796 0.444 0.772 0.304 0.594
DimeNet40a 0.351 0.625 0.302 0.614 0.285 0.431
3DCNN2b 0.431 ± 0.013 0.789 ± 0.017 0.557 ± 0.011 0.780 ± 0.016 0.308 ± 0.010 0.592 ± 0.016
3DGNN2b 0.411 ± 0.006 0.750 ± 0.018 0.500 ± 0.012 0.747 ± 0.018 0.278 ± 0.005 0.547 ± 0.016
GVP-GNN7c 0.414 ± 0.010 0.691 ± 0.008 0.523 ± 0.013 0.687 ± 0.006 0.296 ± 0.010 0.495 ± 0.004
GraphQA41a 0.379 0.820 0.357 0.821 0.331 0.618
GBPNet42a 0.517 0.856 0.612 0.853 0.372 0.656
GVP-GNN 0.612 ± 0.017 0.849 ± 0.015 0.739 ± 0.017 0.854 ± 0.009 0.453 ± 0.008 0.679 ± 0.014

Models are sorted by the year they are released. Results are reported with mean ± standard deviation over three repeated runs and the best and second best performance are bolded and underlined, respectively.

aThese results are taken from ref. 42.

bThese results are taken from ref. 2.

cThese results are re-produced.

Protein-protein representation tasks

For PPRD, we report three items as measurements: the complex root mean squared deviation (RMSD), the ligand RMSD, and the interface RMSD in Table 3. The interface is determined with a distance threshold less than 8Å. It is noteworthy that, unlike the EquiDock paper, we do not apply the Kabsch algorithm to superimpose the receptor and the ligand. Contrastingly, the receptor protein is fixed during evaluation. All three metrics decrease considerably with improvements of 11.61%, 12.83%, and 31.01% in complex, ligand, and interface median RMSD, respectively. Notably, we also report the result of EquiDock, which is first pretrained on DIPS and then fine-tuned on DB5. It can be discovered that DIPS-pretrained EquiDock still performs worse than EquiDock equipped with pretrained language models. This strongly demonstrates that structural pretraining for GGNNs may not benefit GGNNs more than pretrained protein language models.

Table 3.

Performance of PPRD on DB5.5 Test Set.

Model PLM Protein-protein rigid-body docking
Complex RMSD Ligand RMSD Interface RMSD
Median Mean Std Median Mean Std Median Mean Std
EquiDock ✗♣ 16.88 17.11 5.33 40.35 37.97 12.94 16.19 37.97 4.47
✗♠ 15.02 14.31 5.28 36.82 35.95 13.18 14.37 35.68 4.12
✓♣ 14.92 13.14 4.59 35.17 33.48 14.34 11.17 33.48 4.38

Models with ♣ are directly trained and tested on DB5, while EquiDock with ♠ is first pretrained on DIPS and fine-tuned on the DB5 training set. Results are reported with mean ±  standard deviation over three repeated runs and the best performance is in bold.

For PPI, we record AUROC as the metric in Fig. 2. It can be found that AUROC increases for 6.93%, 14.01%, and 22.62% for GVP-GNN, EGNN, and Molformer respectively. It is worth noting that Molformer falls behind EGNN and GVP-GNN originally in this task. But after injecting knowledge learned by protein language models, Molformer achieves competitive or even better performance than EGNN or GVP-GNN. This indicates that protein language models can realize the potential of GGNNs to the full extent and greatly narrow the gap between different geometric deep learning architectures. The results mentioned above are amazing because, unlike MQA, PPRD and PPI study the geometric interactions between two proteins. Though existing protein language models are all trained on single protein sequences, our experiments show that the evolution information hidden in unpaired sequences can also be valuable to analyze the multi-protein environment.

Fig. 2. Some ablation studies.

Fig. 2

a Results of PPI with and without PLMs. b Performance of GGNNs on MQA with ESM-2 at different scales.

Protein-molecules representation task

For LBA, we compare RMSD, RS, RP, and KR in Table 4. The incorporation of protein language models produces a remarkably average decline of 11.26% and 6.15% in RMSD for 30% and 60% identity, an average increase of 51.09% and 9.52% in RP for the 30% and 60% identity, an average increment of 66.60% and 8.90% in RS for the 30% and 60% identity, and an average increment of 68.52% and 6.70% in KR for the 30% and 60% identity. It can be seen that the improvements in the 30% sequence identity is higher than that in the less restrictive 60% sequence identity. This confirms that protein language models benefit GGNNs more when the unseen samples belong to different protein domains. Moreover, contrasting PPRD or PPI, LBA studies how proteins interact with small molecules. Our outcome demonstrates that rich protein representations encoded by protein language models can also contribute to the analysis of protein’s reaction to other non-protein drug-like molecules. The result of a different data split has been placed in Supplementary Table 1.

Table 4.

Results on LBA.

Model PLM Ligand binding affinity
Sequence identity (30%)
RMSD Pearson’s correlation Spearman correlation Kendall rank
GVP-GNN 1.6480 ± 0.014 0.2138  ± 0.013 0.1648 ± 0.009 0.1107 ± 0.012
1.4556  ± 0.011 0.5373 ± 0.010 0.5078 ± 0.005 0.3495 ± 0.009
EGNN 1.4929 ± 0.012 0.4891 ± 0.017 0.4725 ± 0.008 0.3291 ± 0.014
1.4033 ± 0.013 0.5655 ± 0.016 0.5448 ± 0.005 0.3790 ± 0.007
Molformer 1.91f07 ± 0.018 0.4618 ± 0.014 0.4104 ± 0.011 0.2812 ± 0.019
1.6028 ± 0.020 0.5351 ± 0.017 0.5372 ± 0.015 0.3758 ± 0.016
Sequence identity (60%)
GVP-GNN 1.5438 ± 0.015 0.6608 ± 0.012 0.6668 ± 0.0010 0.4797 ± 0.014
1.5137 ± 0.019 0.6680 ± 0.010 0.6716 ± 0.008 0.4786 ± 0.012
EGNN 1.5928 ± 0.020 0.6274 ± 0.013 0.6271 ± 0.017 0.4498 ± 0.014
1.5595 ± 0.022 0.6445 ± 0.015 0.6463 ±  0.019 0.4656 ± 0.019
Molformer 1.8610 ± 0.018 0.5528 ± 0.016 0.5309 ± 0.015 0.3738 ± 0.017
1.5926 ± 0.024 0.6524 ± 0.018 0.6528 ± 0.016 0.4367 ± 0.011

Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.

In addition, we compare thoroughly with existing approaches for LBA in Table 5, where the second best is underlined. We select a broad range of models including DeepAffinity44, Cormorant45, LSTM46, TAPE47, ProtTrans14, 3DCNN2, GNN2, MaSIF48, DGAT49, DGIN49, DGAT-GCN49, HoloProt50, and GBPNet42 as the baseline. It is clear that even if EGNN is a median-level architecture, it can achieve the best RMSD and the best Pearson’s correlation when enhanced by protein language models, beating a group of strong baselines including HoloProt50 and GBPNet42.

Table 5.

Comparison of performance on LBA.

Model PLM Ligand binding affinity (Sequence identity = 30%)
RMSD Pearson’s correlation Spearman correlation Kendall rank
DeepAffinity44a 1.893 ± 0.650 0.415 0.426
Cormorant45b 1.568 ± 0.012 0.389 0.408
LSTM46c 1.985 ± 0.006 0.165 ± 0.006 0.152 ± 0.024
TAPE47c 1.890 ± 0.035 0.338 ± 0.044 0.286 ± 0.124
ProtTrans14c 1.544 ± 0.015 0.438 ± 0.053 0.434 ± 0.058
3DCNN2a 1.414 ± 0.021 0.550 0.553
GNN2a 1.570 ± 0.025 0.545 0.533
MaSIF48c 1.484 ± 0.018 0.467 ± 0.020 0.455 ± 0.014
DGAT49b 1.719 ± 0.047 0.464 0.472
DGIN49b 1.765 ± 0.076 0.426 0.432
DGAT-GCN49b 1.550 ± 0.017 0.498 0.496
GVP-GNN7d 1.648 ± 0.014 0.213 ± 0.013 0.164 ± 0.009 0.110 ± 0.012
EGNN58d 1.492 ± 0.012 0.489 ± 0.017 0.472 ± 0.008 0.329 ± 0.014
HoloProt50c 1.464 ± 0.006 0.509 ± 0.002 0.500 ± 0.005
GBPNet42b 1.405 ± 0.009 0.561 0.557
EGNN 1.403 ± 0.013 0.565 ± 0.016 0.544 ± 0.005 0.379 ±  0.007

Models are sorted by the year they are released. Results are reported with mean ± standard deviation over three repeated runs and the best and second best performance are bolded and underlined, respectively.

aThese results are taken from ref. 2.

bThese results are taken from ref. 42.

cThese results are copied from ref. 50.

dThese results are re-produced.

Scale and type of protein language models

It has been observed that as the size of the language model increases, there are consistent improvements in tasks like structure prediction12. Here we conduct an ablation study to investigate the effect of protein language models’ sizes on GGNNs. Specifically, we explore different ESM-2 with the parameter numbers of 8M, 35M, 150M, 650M, and 3B and plot results in Fig. 2. It verifies that scaling the protein language model is advantageous for GGNNs. More additional results can be found in Supplementary Note 4. We also provide a comparison of different sorts of PLMs’ influence in Supplementary Table 2. Besides that, we investigate the difference of PLMs’ effectiveness with and without MSA in Supplementary Table 3.

Limitations

Despite our successful confirmation that PLMs can promote geometric deep learning, there are several limitations and extensions of our framework left open for future investigation. For instance, our 3D protein graphs are residue-level. We believe atom-level protein graphs also benefit from our approach, but its increase in performance needs further exploration.

Conclusion

In this study, we investigate a problem that has been long ignored by existing geometric deep learning methods for proteins. That is, how to employ the abundant protein sequence data for 3D geometric representation learning. To answer this question, we propose to leverage the knowledge learned by existing advanced pre-trained protein language models and use their amino acid representations as the initial features. We conduct a variety of experiments such as protein-protein docking and model quality assessment to demonstrate the efficacy of our approach. Our work provides a simple but effective mechanism to bridge the gap between 1D sequential models and 3D geometric neural networks, and hope to throw light on how to combine information encoded in different protein modalities.

Method

Sequence recovery analysis

Preliminary and motivations

It is commonly acknowledged that protein structures maintain much more information than their corresponding amino acid sequences. And for decades long, it has been an open challenge for computational biologists to predict protein structure from its amino acid sequence. Though the advancement of Alphafold (AF)51 and RosettaFold52 has made a huge step in alleviating the limitation brought by the number of available experimentally determined protein structures, neither AF nor its successors such as Alphafold-Multimer53, IgFold54, and HelixFold55 are a panacea. Their predicted structures can be severely inaccurate when the protein is orphan and lacks multiple sequence alignment (MSA) as the template. Consequently, it is hard to conclude that protein sequences can be perfectly transformed to the structure modality by current tools and be used as extra training resources for GGNNs.

Moreover, we argue that even if conformation is a higher-dimensional representation, the prevailing learning paradigm may forbid GGNNs from capturing the knowledge that is uniquely preserved in protein sequences. Recall that GGNNs are mainly diverse in their patterns to employ 3D geometries, the input features include distance56, angles40, torsion, and terms of other orders57. The position index hidden in protein sequences, however, is usually neglected when constructing 3D graphs for GGNNs. Therefore, in this section, we design a toy trial to examine whether GGNNs can succeed in recovering this kind of positional information.

Protein graph construction

Here the structure of a protein can be represented as an atom-level or residue-level graph G=(V,E), where V and E=(eij) correspond to the set of N nodes and M edges respectively. Nodes have their 3D coordinates xRN×3 and the initial ψh-dimension roto-translational invariant features hRN×ψh (e.g., atom types and electronegativity, residue classes). Normally, there are three types of options to construct connectivity for molecules: r-ball graphs, fully-connected (FC) graphs, and K-nearest neighbors (KNN) graphs. In our setting, nodes are linked to K = 10 nearest neighbors for KNN graphs, and edges include all atom pairs within a distance cutoff of 8Å for r-ball graphs.

Recovery from graphs to sequences

Since most prior studies choose to establish 3D protein graphs based on purely geometric information and ignore their sequential identities, it provokes the following position identity question:

Can existing GGNNs identify the sequential position order only from geometric structures of proteins?

To answer this question, we formulate two categories of toy tasks (see Fig. 3). The first one is absolute position recognition (APR), which is a classification task. Models are asked to directly predict the position index ranging from 1 to N, the residue number of each protein. This task adopts accuracy as the metric and expects models to discriminate the absolute position of the amino acid within the whole protein sequence. We compute the distribution of the protein sequence lengths in Supplementary Fig. 1.

Fig. 3. Illustration of the sequence recovery problem.

Fig. 3

a Protein residue graph construction. Here we draw graphs in 2D for better visualization but study 3D graphs for GGNNs. b Two sequence recovery tasks. The first requires GGNNs to predict the absolute position index for each residue in the protein sequence. The second aims to forecast the minimum distance of each amino acid to the two sides of the protein sequence.

In addition to that, we propose the second task named relative position estimation (RPE) to focus on the relative position of each residue. Models are required to predict the minimum distance of residue to the two sides of the given protein and the root mean squared error (RMSE) is used as the metric. This task aims to examine the capability of GGNNs to distinguish which segment the amino acid belongs to (i.e., the center section of the protein or the end of the protein).

Experiments

Backbones

We adopt three technically distinct and broadly accepted architectures of GGNNs for empirical verification. To be specific, GVP-GNN7,43 extends standard dense layers to operate on collections of Euclidean vectors, performing both geometric and relational reasoning on efficient representations of macromolecules. EGNN58 is a translation, rotation, reflection, and permutation equivariant GNN without expensive spherical harmonics. Molformer9 employs the self-attention mechanism for 3D point clouds while guarantees SE(3)-equivariance.

Dataset

We exploit a small non-redundant subset of high-resolution structures from the PDB. To be specific, we use only X-ray structures with resolution < 3.0Å, and enforce a 60% sequence identity threshold. This results in a total of 2643, 330, and 330 PDB structures for the train, validation, and test sets, respectively. Experimental details, the summary of the database, and the description of these GGNNs are elaborated in Supplementary Notes 1 and 2.

Empirical results and analysis

Table 6 documents the overall results, where metrics are labeled with / if higher/lower is better, respectively. It can be found that all GGNNs fail to recognize either the absolute or the relative positional information encoded in the protein sequences with an accuracy lower than 1% and an extremely high RMSE.

Table 6.

Results of two residue position identification tasks.

Models Graph type APR RPE
Accuracy (%) RMSE
GVP-GNN r-ball graph 0.157 ± 0.002 392.38 ± 3.41
KNN graph 0.158 ±  0.003 392.38 ± 4.05
EGNN r-ball graph 0.150 ± 0.005 412.70 ± 2.36
KNN graph 0.131 ± 0.004 403.86 ± 1.77
Molformer FC graph 0.148 ± 0.007 270.69 ± 4.53

Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.

This phenomenon stems from the conventional ways to build graph connectivity, which usually excludes sequential information. To be specific, unlike common applications of GNNs such as citation networks59, social networks60, knowledge graphs61, molecules do not have explicitly defined edges or adjacency. On the one hand, r-ball graphs utilize a cut-off distance, which is usually set as a hyperparameter, to determine the particle connections. But it is hard to guarantee a cut-off to properly include all crucial node interactions for complicated and large molecules. On the other hand, FC graphs that consider all pairwise distances will cause severe redundancies, dramatically increasing the computational complexity especially when proteins consist of thousands of residues. Besides, GGNNs also easily get confused by excessive noise, leading to unsatisfactory performance. As a remedy, KNN becomes a more popular choice to establish graph connectivity for proteins34,62,63. However, all of them take no account of the sequential information and require GGNNs to learn this original sequential order during training.

The lack of sequential information can yield several problems. To begin with, residues are unaware of their relative positions in the proteins. For instance, two residues can be close in the 3D space but distant in the sequence, which can mislead models to find the correct backbone chain. Secondly, according to the characteristics of the MP mechanism, two residues in a protein with the same neighborhood are expected to share similar representations. Nevertheless, the role of those two residues can be significantly separate64 when they are located at different segments of the protein. Thus, GGNNs may be incapable of differentiating two residues with the same 1-hop local structures. This restriction has already been distinguished by several works6,65, but none of them make a strict and thorough investigation. Admittedly, sequential order may only be necessary for certain tasks. But this toy experiment strongly indicates that the knowledge monopolized by amino acid sequences can be lost if GGNNs only learn from protein structures.

Integration of language models into geometric networks

As discussed before, learning about 3D structures cannot directly benefit from large amounts of sequential data. Subsequently, the model sizes of GGNNs are limited, or instead, overfitting may occur66. On the contrary, comparing the number of protein sequences in the UniProt database67 to the number of known structures in the PDB, there are over 1700 times more sequences than structures. More importantly, the availability of new protein sequence data continues to far outpace the availability of experimental protein structure data, only increasing the need for accurate protein modeling tools.

Therefore, we introduce a straightforward approach to assist GGNNs with pretrained protein language models. To this end, we feed amino acid sequences into those protein language models, where ESM-212 is adopted in our case, and extract the per-residue representations, denoted as hRN×ψPLM. Here ψPLM = 1280. Then h can be added or concatenated to the per-atom feature h. For residue-level graphs, h immediately replaces the original h as the input node features.

Notably, incompatibility exists between the experimental structure and its original amino acid sequence. That is, structures stored in the PDB files are usually incomplete and some strings of residues are missing due to inevitable realistic issues68. They, therefore, do not perfectly match the corresponding sequences (i.e., FASTA sequence). There are two choices to address this mismatch. On the one hand, we can simply use the fragmentary sequence as the substitute for the integral amino acid sequence and forward it into the protein language models. On the other hand, we can leverage a dynamic programming algorithm provided by Biopython69 to implement pairwise sequence alignment and abandon residues that do not exist in the PDB structures. It is empirically discovered that no big difference exists between them, so we adopt the former processing mechanism for simplicity.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

42003_2023_5133_MOESM2_ESM.pdf (84.7KB, pdf)

Description of Additional Supplementary Files

Supplementary Data (12.5KB, xlsx)
Reporting Summary (2.1MB, pdf)

Acknowledgements

This work is supported in part by the Institute of AI Industry Research at Tsinghua University and the Molecule Mind.

Author contributions

F.W. and J.X. led the research. F.W. contributed technical ideas. F.W. and Y.T. developed the proposed method. F.W., D.R., and Y.T. performed the analysis. J.X. and D.R. provided evaluation and suggestions. All authors contributed to the manuscript.

Peer review

Peer review information

Communications Biology thanks Jianzhao Gao, Arne Elofsson, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Gene Chong.

Data availability

The data of model quality assessment, protein-protein interface prediction, and ligand affinity prediction is available by https://www.atom3d.ai/. The data of protein-protein rigid-body docking can be downloaded directly from the official repository of Equidock https://github.com/octavian-ganea/equidock_public. Source data for figures can be found in Supplementary Data.

Code availability

The code repository is stored at https://github.com/smiles724/bottleneck. It is also deposited in ref. 70.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-023-05133-1.

References

  • 1.Xu, M. et al. Geodiff: a geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations (ICLR, 2022).
  • 2.Townshend, R. J. et al. Atom3d: tasks on molecules in three dimensions. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).
  • 3.Wu Z, et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 2018;9:513–530. doi: 10.1039/C7SC02664A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lim J, et al. Predicting drug–target interaction using a novel graph neural network with 3d structure-embedded graph representation. J Chem. Inf. Model. 2019;59:3981–3988. doi: 10.1021/acs.jcim.9b00387. [DOI] [PubMed] [Google Scholar]
  • 5.Liu, Y., Yuan, H., Cai, L. & Ji, S. Deep learning of high-order interactions for protein interface prediction. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 679–687 (ACM, 2020).
  • 6.Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Advances in neural information processing systems32 (NeurIPS, 2019).
  • 7.Jing, B., Eismann, S., Suriana, P., Townshend, R. J. & Dror, R. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411 (2020).
  • 8.Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, Kim PM. Fast and flexible protein design using deep graph neural networks. Cell Syst. 2020;11:402–411. doi: 10.1016/j.cels.2020.08.016. [DOI] [PubMed] [Google Scholar]
  • 9.Wu, F. et al. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37 (2023).
  • 10.Atz K, Grisoni F, Schneider G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 2021;3:1023–1032. doi: 10.1038/s42256-021-00418-8. [DOI] [Google Scholar]
  • 11.Meier J, et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 2021;34:29287–29303. [Google Scholar]
  • 12.Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–1130. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 13.Rives A, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. 2021;118:e2016239118. doi: 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Elnaggar, A. et al. Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE. Trans. Pattern. Anal. Mach. Intell.44, 7112–7127 (2021). [DOI] [PubMed]
  • 15.Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proceedings of the 39th International Conference on Machine Learning. Vol. 162, 8946–8970 (PMLR, 2022).
  • 16.Boadu, F., Cao, H. & Cheng, J. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Preprint at https://www.biorxiv.org/content/10.1101/2023.01.17.524477v1 (2023). [DOI] [PMC free article] [PubMed]
  • 17.Chen C, Chen X, Morehead A, Wu T, Cheng J. 3d-equivariant graph neural networks for protein model quality assessment. Bioinformatics. 2023;39:btad030. doi: 10.1093/bioinformatics/btad030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dunbar J, et al. Sabdab: the structural antibody database. Nucleic Acids Res. 2014;42:D1140–D1146. doi: 10.1093/nar/gkt1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chandonia J-M, Fox NK, Brenner SE. Scope: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 2019;47:D475–D481. doi: 10.1093/nar/gky1134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Velankar S, et al. Sifts: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2012;41:D483–D489. doi: 10.1093/nar/gks1258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Mistry J, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bairoch A, et al. The universal protein resource (uniprot) Nucleic Acids Res. 2005;33:D154–D159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yanofsky C, Horn V, Thorpe D. Protein structure relationships revealed by mutational analysis. Science. 1964;146:1593–1594. doi: 10.1126/science.146.3651.1593. [DOI] [PubMed] [Google Scholar]
  • 25.Göbel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Proteins. 1994;18:309–317. doi: 10.1002/prot.340180402. [DOI] [PubMed] [Google Scholar]
  • 26.Cheng J, et al. Estimation of model accuracy in casp13. Proteins. 2019;87:1361–1377. doi: 10.1002/prot.25767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (casp)-round xiii. Proteins. 2019;87:1011–1020. doi: 10.1002/prot.25823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Vreven T, et al. Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J. Mol. Biol. 2015;427:3031–3041. doi: 10.1016/j.jmb.2015.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Townshend, R., Bedi, R., Suriana, P. & Dror, R. End-to-end learning on 3d protein structure for interface prediction. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).
  • 30.Wang R, Fang X, Lu Y, Wang S. The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J. Med. Chem. 2004;47:2977–2980. doi: 10.1021/jm030580l. [DOI] [PubMed] [Google Scholar]
  • 31.Liu Z, et al. Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics. 2015;31:405–412. doi: 10.1093/bioinformatics/btu626. [DOI] [PubMed] [Google Scholar]
  • 32.Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).
  • 33.Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In Workshop of International Conference on Learning Representations (ICLR, 2019).
  • 34.Ganea, O.-E. et al. Independent se (3)-equivariant models for end-to-end rigid protein docking. In International Conference on Learning Representations (ICLR, 2022).
  • 35.Zhang J, Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PloS one. 2010;5:e15386. doi: 10.1371/journal.pone.0015386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Uziela K, Menéndez Hurtado D, Shu N, Wallner B, Elofsson A. Proq3d: improved model quality assessments using deep learning. Bioinformatics. 2017;33:1578–1580. doi: 10.1093/bioinformatics/btw819. [DOI] [PubMed] [Google Scholar]
  • 37.Olechnovič K, Venclovas Č. Voromqa: Assessment of protein structure quality using interatomic contact areas. Proteins: Structure, Function, and Bioinformatics. 2017;85:1131–1145. doi: 10.1002/prot.25278. [DOI] [PubMed] [Google Scholar]
  • 38.Karasikov M, Pagès G, Grudinin S. Smooth orientation-dependent scoring function for coarse-grained protein quality assessment. Bioinformatics. 2019;35:2801–2808. doi: 10.1093/bioinformatics/bty1037. [DOI] [PubMed] [Google Scholar]
  • 39.Pagès G, Charmettant B, Grudinin S. Protein model quality assessment using 3d oriented convolutional neural networks. Bioinformatics. 2019;35:3313–3319. doi: 10.1093/bioinformatics/btz122. [DOI] [PubMed] [Google Scholar]
  • 40.Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In International Conference on Learning Representations (ICLR, 2020).
  • 41.Eismann S, et al. Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes. Proteins. 2021;89:493–501. doi: 10.1002/prot.26033. [DOI] [PubMed] [Google Scholar]
  • 42.Aykent, S. & Xia, T. Gbpnet: universal geometric representation learning on protein structures. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4–14 (ACM, 2022).
  • 43.Jing, B., Eismann, S., Soni, P. N. & Dror, R. O. Equivariant graph neural networks for 3d macromolecular structure. In Preprint at https://arxiv.org/abs/2106.03843 (2021).
  • 44.Karimi M, Wu D, Wang Z, Shen Y. Deepaffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35:3329–3338. doi: 10.1093/bioinformatics/btz111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Anderson, B., Hy, T. S. & Kondor, R. Cormorant: covariant molecular neural networks. In Advances in neural information processing systems 32 (NeurIPS, 2019).
  • 46.Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Preprint at https://arxiv.org/abs/1902.08661 (2019).
  • 47.Rao, R. et al. Evaluating protein transfer learning with tape. Adv Neural Inf. Process. Syst.32, 9689–9701 (2019). [PMC free article] [PubMed]
  • 48.Gainza P, et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods. 2020;17:184–192. doi: 10.1038/s41592-019-0666-6. [DOI] [PubMed] [Google Scholar]
  • 49.Nguyen T, et al. Graphdta: Predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37:1140–1147. doi: 10.1093/bioinformatics/btaa921. [DOI] [PubMed] [Google Scholar]
  • 50.Somnath VR, Bunne C, Krause A. Multi-scale representation learning on proteins. Adv. Neural Inf. Process. Syst. 2021;34:25244–25255. [Google Scholar]
  • 51.Jumper J, et al. Highly accurate protein structure prediction with alphafold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Evans, R. et al. Protein complex prediction with alphafold-multimer. Preprint at https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2 (2022).
  • 54.Ruffolo JA, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Biophys. J. 2022;121:155a–156a. doi: 10.1016/j.bpj.2021.11.1942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Wang, G. et al. Helixfold: an efficient implementation of alphafold2 using paddlepaddle. Preprint at https://arxiv.org/abs/2207.05477 (2022).
  • 56.Schütt, K. et al. Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. In Advances in neural information processing systems 30 (NeurIPS, 2017).
  • 57.Liu, Y. et al. Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations (ICLR, 2021).
  • 58.Satorras, V. G., Hoogeboom, E. & Welling, M. E (n) equivariant graph neural networks. In International conference on machine learning, 9323–9332 (PMLR, 2021).
  • 59.Sen P, et al. Collective classification in network data. AI Mag. 2008;29:93–93. [Google Scholar]
  • 60.Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Advances in neural information processing systems. 30 (NeurIPS, 2017).
  • 61.Carlson, A. et al. Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI conference on artificial intelligence (AAAI, 2010).
  • 62.Fout, A., Byrd, J., Shariat, B. & Ben-Hur, A. Protein interface prediction using graph convolutional networks. In Advances in neural information processing systems, 30 (NeurIPS, 2017).
  • 63.Stärk, H., Ganea, O., Pattanaik, L., Barzilay, R. & Jaakkola, T. Equibind: geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, 20503–20521 (PMLR, 2022).
  • 64.Murphy, R., Srinivasan, B., Rao, V. & Ribeiro, B. Relational pooling for graph representations. In International Conference on Machine Learning, 4663–4673 (PMLR, 2019).
  • 65.Zhang, Z. et al. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations (ICLR, 2023).
  • 66.Hermosilla, P. & Ropinski, T. Contrastive representation learning for 3d protein structures. Preprint at https://arxiv.org/abs/2205.15675 (2022).
  • 67.Consortium U. Uniprot: a hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Djinovic-Carugo K, Carugo O. Missing strings of residues in protein crystal structures. Intrinsically Disord. Proteins. 2015;3:e1095697. doi: 10.1080/21690707.2015.1095697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Cock PJ, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Wu, F. Code for Paper ’Integration of pre-trained protein language models into geometric deep learning networks’. Zenodo10.5281/zenodo.8022149 (2023). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

42003_2023_5133_MOESM2_ESM.pdf (84.7KB, pdf)

Description of Additional Supplementary Files

Supplementary Data (12.5KB, xlsx)
Reporting Summary (2.1MB, pdf)

Data Availability Statement

The data of model quality assessment, protein-protein interface prediction, and ligand affinity prediction is available by https://www.atom3d.ai/. The data of protein-protein rigid-body docking can be downloaded directly from the official repository of Equidock https://github.com/octavian-ganea/equidock_public. Source data for figures can be found in Supplementary Data.

The code repository is stored at https://github.com/smiles724/bottleneck. It is also deposited in ref. 70.


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES