BridgeNet: a high-efficiency framework integrating sequence and structure for protein and enzyme function prediction

Yilin Ye; Hongliang Duan; Yuguang Mu; Lei Wu; Jingjing Guo

doi:10.1093/bib/bbaf607

. 2025 Nov 19;26(6):bbaf607. doi: 10.1093/bib/bbaf607

BridgeNet: a high-efficiency framework integrating sequence and structure for protein and enzyme function prediction

Yilin Ye ¹, Hongliang Duan ², Yuguang Mu ³, Lei Wu ⁴, Jingjing Guo ^5,^✉

PMCID: PMC12629232 PMID: 41259416

Abstract

Understanding the relationship between protein sequences and structures is essential for accurate protein property prediction. We propose BridgeNet, a pre-trained deep learning framework that integrates sequence and structural information through a novel latent environment matrix, enabling seamless alignment of these two modalities. The model’s modular architecture—comprising sequence encoding, structural encoding, and a bridge module—effectively captures complementary features without requiring explicit structural inputs during inference. Extensive evaluations on tasks such as enzyme classification, Gene Ontology annotation, coenzyme specificity prediction, and peptide toxicity prediction demonstrate its superior performance over state-of-the-art models. BridgeNet provides a scalable and robust solution, advancing protein representation learning and enabling applications in computational biology and structural bioinformatics.

Keywords: protein representation learning, sequence-structure integration, deep learning in bioinformatics, protein property prediction

Introduction

Proteins are fundamental macromolecules that play essential roles in virtually all biological processes, including catalyzing metabolic reactions, transmitting cellular signals, and providing structural stability to cells and tissues [1–5]. Understanding the properties of proteins—such as their structure, function, stability, and interactions—is critical for advancing a wide range of scientific and industrial fields, including biotechnology, synthetic biology, and pharmaceutical development [6–9]. However, the vast diversity and complexity of protein sequences and structures in nature present significant challenges for large-scale characterization [10, 11]. High-throughput and accurate methods for protein property prediction are urgently needed to bridge this gap and accelerate advancements in these fields [12, 13].

Traditional experimental techniques, such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (Cryo-EM), have been instrumental in elucidating protein structures and functions at atomic resolution [14–17]. These methods provide unparalleled insights into protein conformations and interactions, enabling researchers to study mechanistic details of biological processes. However, these experimental approaches are often labor-intensive, expensive, and highly dependent on sample quality and experimental conditions, which limit their scalability and applicability to the vast number of proteins that remain uncharacterized. Additionally, certain classes of proteins, such as intrinsically disordered proteins (IDPs) or transient complexes, pose inherent challenges to these techniques due to their dynamic nature [18–23].

To overcome the limitations of experimental methods, computational approaches have emerged as powerful tools for protein characterization. These methods leverage the growing availability of protein sequence and structure data to predict various protein properties, including structure, function, and interactions [24, 25]. Early computational techniques were largely reliant on physics-based modeling or sequence homology, such as molecular dynamics simulations and comparative modeling. While these approaches achieved some success, their performance was often constrained by computational cost, limited accuracy, and reliance on homologous templates. The advent of machine learning (ML) and deep learning (DL) has revolutionized the field, enabling more accurate and scalable predictions by learning complex patterns from large-scale biological datasets [26, 27].

Recent breakthroughs in ML and DL have led to the development of advanced sequence-based models, such as ESM (Evolutionary Scale Modeling), TAPE, ProtBERT, and ProteinGPT, which are capable of extracting meaningful representations from protein sequences [28–31]. These models capture evolutionary and biophysical patterns that underlie protein structure and function, significantly improving the accuracy of tasks such as fold classification, functional annotation, and stability prediction [32–34]. On the structure-based side, AlphaFold and its successors have achieved remarkable success in predicting three-dimensional protein structures with near-experimental accuracy, even in cases where no homologous templates are available [35]. These advancements have fundamentally transformed the field, opening new avenues for large-scale protein property prediction. However, despite their success, these models often operate independently on sequence or structure data, which may limit their ability to fully capture the relationship between sequence-intrinsic properties and structure-environmental effects.

However, despite their success, these models often operate independently on sequence or structure data, which may limit their ability to fully capture the relationship between sequence-intrinsic properties and structure-environmental effects [36]. To address this limitation, recent approaches such as SABLE have begun to integrate both sequence and structural information, leveraging the complementary nature of these modalities to obtain more comprehensive protein representations [37]. Nevertheless, such strategies typically treat sequence and structure as separate, complementary sources, and may overlook the fact that they are actually different modalities of the same underlying biological entity. In principle, it is possible to pre-train models using both modalities to learn richer, multi-modal representations, while only relying on the more computationally efficient modality during deployment, thus balancing performance and computational cost [38–40].

In this study, we present BridgeNet, a novel pre-trained model that integrates both protein sequence and structure information during the pre-training process to generate robust and biologically meaningful protein representations. The model is grounded on the biological premise that a protein’s sequence inherently determines its intrinsic properties, while its structure reflects the interplay between these intrinsic properties and environmental factors. By leveraging structural information to guide the encoding process during pre-training, BridgeNet embeds a deep understanding of protein structure within its representations. Importantly, this design enables the model to perform downstream protein property prediction tasks without requiring additional structural information, while still retaining the benefits of structure-informed representations. Extensive experiments across multiple protein property prediction benchmarks demonstrate that BridgeNet significantly outperforms traditional methods and state-of-the-art models. This integrative approach highlights the power of embedding structural insights into sequence-based models, offering a scalable and efficient solution for advancing protein characterization.

Materials and methods

Datasets

Pre-training dataset

Our pre-training dataset is derived from UniRef 50, a comprehensive and widely utilized collection of protein sequences sourced from the UniProt database [41–43]. UniRef 50 clusters protein sequences that share 50% or greater identity into single representative entries, thereby effectively reducing redundancy while preserving the diversity of sequences across millions of proteins. This clustering approach ensures a balance between computational efficiency and biological relevance, making UniRef 50 an ideal resource for pre-training models designed for tasks such as protein function prediction, structural analysis, and a wide range of bioinformatics applications [44–46]. By encompassing a broad spectrum of sequence diversity, this dataset enables the model to generalize effectively across a variety of protein families and functional classes.

For the pre-training process, the pre-training dataset was filtered to exclude sequences longer than 1024 and deduplicated, resulting in a final dataset of 63,919,579 records. Additionally, we randomly selected 10% of the records and downloaded their structural data for structure-enhanced training. The dataset was partitioned into training, validation, and test sets in a 7:1:2 ratio, ensuring a systematic division that supports both model development and rigorous evaluation [47]. This partitioning strategy allows the model to leverage a large proportion of data for training, facilitating the learning of meaningful representations, while reserving sufficient data for validation to monitor overfitting and for testing to provide an unbiased assessment of model performance. The robustness of this approach ensures that the pre-trained model is not only optimized during training but also capable of maintaining high generalization accuracy across unseen protein sequences [48, 49].

Protein function prediction

The dataset curated by Ruijie Quan et al. is specifically designed for two critical tasks in bioinformatics: Enzyme Commission (EC) number prediction and Gene Ontology (GO) term prediction, both of which are pivotal in understanding protein functions and their roles in biological processes [50, 51]. The EC number dataset focuses on predicting enzyme classifications based on the chemical reactions they catalyze, providing insights into enzyme functionality and aiding in the systematic annotation of protein roles [52]. In addition, the dataset supports enzyme reaction classification, which aims to assign proteins to specific enzymatic reaction classes according to the full four-level EC hierarchy, enabling detailed characterization of catalytic activities [50]. Detailed information about the datasets can be found in the ‘Protein Function Prediction’ section of the Supplementary Materials.

Coenzyme classification

The dataset for the Coenzyme Classification task was obtained through the work of Ye et al., which carefully curated a collection of non-redundant sequences categorized by their cofactor binding specificities [53]. Cofactors such as NAD⁺ (nicotinamide adenine dinucleotide) and NADP⁺ (nicotinamide adenine dinucleotide phosphate) play crucial roles in enzymatic reactions, serving as electron carriers and influencing metabolic pathways. NAD⁺ is primarily involved in catabolic processes, such as cellular respiration, where it facilitates energy production, whereas NADP⁺ is key in anabolic processes like biosynthesis and redox reactions. Understanding the binding specificities of enzymes to these cofactors is essential for deciphering metabolic pathways, optimizing enzyme functionality, and advancing applications in fields like bioremediation and sustainable chemical synthesis. Detailed information about the datasets can be found in the ‘Coenzyme Classification’ section of the Supplementary Materials.

Peptide toxicity prediction

Utilizing the methodology proposed by Hossein Ebrahimikondori et al., we performed a comparative analysis on a protein benchmark dataset to advance the field of protein toxicity prediction [54]. This task holds substantial importance in bioinformatics, toxicology, and drug discovery, as the accurate identification of toxic proteins is critical for mitigating adverse effects in pharmaceutical development, ensuring patient safety, and guiding the design of safer therapeutics. By addressing this challenge, researchers can also gain deeper insights into the molecular mechanisms underlying toxicity, which has implications for both fundamental biology and applied sciences. Detailed information about the datasets can be found in the ‘Peptide Toxicity Prediction’ section of the Supplementary Materials.

The framework of the BridgeNet model

Our study is grounded on the foundational premise that the sequence of a protein dictates its intrinsic properties, while its three-dimensional structure determines its functional capabilities. This relationship suggests a transformative potential within certain latent projection spaces, whereby the representation of protein sequences can be converted into the representation of protein structures. This concept underlies the principles behind protein structure prediction tools such as AlphaFold.

In the context of protein representation, when structural and sequence representations can be interconverted within certain latent projection spaces, it becomes unnecessary to input the complete protein structure into the representation network. This approach significantly reduces the time and computational power required during inference. Building upon this understanding, we introduce a novel conceptual framework termed the ‘environment matrix’. We hypothesize that the transformation of protein sequence representations into structural representations is influenced by specific environmental matrices. In an ideal scenario, protein sequences and structures should correspond perfectly; however, environmental factors can lead the same protein to adopt multiple distinct structures. These matrices encompass various factors such as solution conditions (e.g., pH, ionic strength, and temperature), the involvement of molecular chaperones, and post-translational modifications like phosphorylation, methylation, and acetylation.

To address this complexity, we propose ‘BridgeNet’, a deep learning-based framework designed to integrate structural and sequence representations through distinct modules (Fig. 1). Within this framework, the latent environment matrix plays a crucial role in mapping identical sequence representations to diverse conformational states of proteins, thereby acknowledging the multifaceted nature of protein structure determination influenced by environmental conditions. This approach not only enhances our understanding of protein dynamics but also advances the predictive capabilities of computational models in structural biology.

Alt text. The overall framework of BridgeNet. — The overall framework of BridgeNet. (A) the pretraining process of BridgeNet. During pretraining, structural information of proteins is utilized to guide the formation of protein encoding. (B) the complete prediction framework integrating BridgeNet. Protein sequences are encoded via BridgeNet to obtain sequence representations, which are subsequently processed through various downstream task modules to generate corresponding prediction results.

In BridgeNet, the network is divided into three modules: the structure representation module, the sequence representation module, and the Bridge module. These modules are responsible for encoding the protein’s structure, encoding the protein’s sequence, and unifying the sequence and structural representations of proteins through a latent environment matrix, respectively.

Embedding module

In our network, the Sequence Representation Module is primarily responsible for processing protein sequence information, while the Structure Representation Module focuses on the characterization of protein structural information. For sequence encoding, we employ the BLOSUM-62 matrix, which is a widely used amino acid substitution scoring matrix derived from evolutionary information. Specifically, each amino acid residue in the protein sequence is converted into a fixed-length vector by extracting the corresponding row from the BLOSUM-62 matrix. By concatenating the encoding vectors of all residues in order, a sequence representation matrix is constructed, as Fig. 2A shown. This matrix is subsequently flattened and fed into the Sequence Representation Module for further extraction of high-level sequence features.

For protein structural information, the processing procedure is relatively more complex. Specifically, we first retrieve the corresponding PDB ID based on the UniProt ID of the original data, and then download the PDB structure file of the target protein from the RCSB Protein Data Bank, which is a well-established resource for experimentally determined protein structures. Subsequently, we extract the primary sequence as well as the three-dimensional coordinates of the Cα atom for each residue in the specified chain. Using a distance threshold of 5 Å, we calculate the pairwise spatial distances between all residues, and construct edges between those whose distances are less than or equal to 5 Å, thereby generating the protein structure adjacency matrix (i.e., the edge index of the graph structure). This adjacency matrix, together with the spatial coordinates of the nodes, is used as the input for the graph neural network, as Fig. 2B shown.

The sequence representation module

The Sequence Representation Module is a neural network architecture designed to encode and decode protein sequences through a sophisticated transformer-based framework. This network comprises several key components. The module begins with an input projector, which maps the amino acid dimension to the desired output dimension through a linear layer followed by a ReLU activation function to introduce non-linearity (Eq. 1). At the output stage, the output projector performs the inverse operation by mapping the output dimension back to the amino acid dimension through a linear layer, ensuring compatibility with sequence-level tasks.

(1)

To incorporate sequence order information, a positional encoding module is employed. This module adds position-specific information to the input sequence, enabling the model to capture the relative and absolute positional relationships between amino acids within the sequence.

As shown in Eq. 2, for each position index Inline graphic and dimension :

(2)

The core of the architecture consists of transformer encoder and decoder layers. The transformer encoder, composed of multiple stacked layers, processes the input sequence to extract high-level latent representations (Eq. 3). Each encoder layer includes multi-head self-attention mechanisms and feedforward neural networks. The multi-head attention is parameterized by the specified number of attention heads, while the feedforward network is expanded by a predefined factor (expand) to increase the model’s capacity for capturing complex sequence features. The transformer decoder mirrors the encoder structure, enabling the transformation of the encoded representations back into sequence space (Eq. 4).

(3)

(4)

This network architecture leverages the capabilities of transformers to model complex dependencies and interactions within protein sequences. By utilizing mechanisms such as multi-head attention and positional encoding, it provides a robust and flexible framework for protein representation learning, enabling the extraction of meaningful sequence features for downstream biological tasks (Eq. 6).

(6)

The structure representation module

The Structure Representation Module is a novel graph-based neural network architecture designed for encoding and decoding graph-structured data, with a primary focus on protein representations. This model leverages the power of graph convolutional networks (GCNs) to capture intricate structural relationships within graph data, which is particularly relevant in the context of protein–protein interaction networks or residue-level structural representations.

The encoding component of this architecture consists of two graph convolutional layers. The first layer applies a GCN-based transformation to project the input node features from their initial feature dimension to a hidden feature dimension. This transformation is followed by batch normalization, which ensures numerical stability during training, and a ReLU activation function to introduce non-linearity. The second graph convolutional layer takes the hidden representation and maps it to the output feature dimension, which serves as the encoded representation of the graph. Notably, no activation function is applied at this stage, as the encoded representation is directly utilized for downstream tasks that may require a linear feature space.

The decoding component reconstructs node-level features from the graph-level representation. It begins with a fully connected layer that projects the graph-level representation back to the hidden feature dimension. Batch normalization and a ReLU activation function are applied to refine the features and enable effective learning. Subsequently, two additional graph convolutional layers sequentially transform these features back to the original node feature dimension, thereby reconstructing the input graph structure.

To aggregate node-level features into a single graph-level representation, the architecture employs a global mean pooling operation. This mechanism computes the mean of all node features within a graph, producing a compact representation that is invariant to the number of nodes in the graph. Such a representation is particularly advantageous when dealing with graphs of varying sizes, as is common in biological datasets.

The bridge module

At the core of the Bridge module is a linear transformation layer. This layer performs a mapping from the input feature space to an output feature space of identical dimensionality. This absence of a bias component preserves the scale and centrality of the input embeddings, which is particularly advantageous for maintaining the intrinsic properties of the features during their transformation. Such a design choice is critical when the consistent alignment of feature distributions is required across different components of the network (Eq. 7).

(7)

where Inline graphic can represent either the Structure Representation or the Sequence Representation. Both and are transformation matrices of the same shape, which can be regarded as environmental matrices. This transformation does not alter the dimensionality of ; rather, the matrices operate on the latent space representations to convert a Structure Representation into a Sequence Representation, or vice versa. It is important to note that the networks employed for converting structure to sequence and sequence to structure are distinct, with independent parameters, although the vector dimensionality remains unchanged during the transformation.

Based on this design, the module can be abstracted as an encoder-like structure, where BridgeNet transforms the input from one modality into another. Unlike conventional encoder-decoder frameworks, we do not utilize the hidden encoding representations generated by this structure. Instead, this module functions in conjunction with Eq. 8 as part of the training process, with the objective of enforcing the alignment between structure and sequence representations through this transformation mechanism.

The downstream task module will integrate the sequence encoding module from the pre-trained model with an MLP (Multi-Layer Perceptron) for joint training and generate the corresponding prediction output.

Loss function design and optimization strategies

The training of BridgeNet is divided into two stages. In the first stage, the pretraining phase, the encoding network is simultaneously provided with protein structural and sequential information to facilitate network training. In the second stage, after the completion of pretraining, the encoding network undergoes further task-specific training by incorporating downstream tasks to accomplish various objectives. The overall workflow and loss calculation during the pretraining phase are illustrated in Fig. 3, and the details are provided in the Supplementary Materials.

Alt text. Diagram showing the data flow and loss calculation process of BridgeNet during pretraining. — Data flow and loss calculation of BridgeNet during the pretraining phase.

The specific number of training epochs was determined based on the convergence of sequence, graph, and bridge losses across epochs, as shown in Fig. S1. Once pretraining was completed, the weights of the sequence representation module were extracted and combined with various downstream task modules for joint training. As Fig. S1 shows, the Graph Rec Loss and Seq Rec Loss show minor changes due to the large dataset, as both submodules quickly stabilize with limited further improvement. In contrast, the significant decrease in Bridge Loss highlights enhanced alignment between structure and sequence. The slight increase in Seq Rec Loss may result from structural guidance introducing challenging-to-recover information, likely due to homologous proteins. This plot corresponds to the loss trends on the validation set during cross-validation. Based on the Bridge Loss trend, pretraining will stop at Epoch 12.

The details of the model’s pretraining and training processes can be found in the ‘Computational Resources and Training Details’ section of the Supplementary Materials.