Estimating protein complex model accuracy using graph transformers and pairwise similarity graphs

Jian Liu; Pawan Neupane; Jianlin Cheng

doi:10.1093/bioadv/vbaf180

. 2025 Jul 29;5(1):vbaf180. doi: 10.1093/bioadv/vbaf180

Estimating protein complex model accuracy using graph transformers and pairwise similarity graphs

Jian Liu ¹, Pawan Neupane ², Jianlin Cheng ^3,^✉

Editor: Nurcan Tunçbağ

PMCID: PMC12342149 PMID: 40799498

Abstract

Motivation

Estimation of protein complex structure accuracy is essential for effective structural model selection in structural biology applications such as protein function analysis and drug design. Despite the success of structure prediction methods such as AlphaFold2 and AlphaFold3, selecting top-quality structural models from large model pools remains challenging.

Results

We present GATE, a novel method that uses graph transformers on pairwise model similarity graphs to predict the quality (accuracy) of complex structural models. By integrating single-model and multimodel quality features, GATE captures intrinsic model characteristics and intermodel geometric similarities to make robust predictions. On the dataset of the 15th Critical Assessment of Protein Structure Prediction (CASP15), GATE achieved the highest Pearson’s correlation (0.748) and the lowest ranking loss (0.1191) compared with existing methods. In the blind CASP16 experiment, GATE ranked fifth based on the sum of z-scores, with a Pearson’s correlation of 0.7076 (first), a Spearman’s correlation of 0.4514 (fourth), a ranking loss of 0.1221 (third), and an area under the curve score of 0.6680 (third) on per-target TM-score-based metrics. Additionally, GATE also performed consistently on large in-house datasets generated by extensive AlphaFold-based sampling with MULTICOM4, confirming its robustness and practical applicability in real-world model selection scenarios.

Availability and implementation

GATE is available at https://github.com/BioinfoMachineLearning/GATE.

1 Introduction

Proteins are fundamental to biological processes, and their three-dimensional (3D) structures determine their functions and interactions with other molecules. Accurate knowledge of protein structures is crucial for biomedical research and technology development. However, experimental structure determination methods like X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy, while effective, are time-consuming and expensive. Consequently, computational protein structure prediction is essential for obtaining protein structures at a large scale.

Recent advances in deep learning have revolutionized protein structure prediction (Jumper et al. 2021, Evans et al. 2021, Liu et al. 2023b,c, Abramson et al. 2024). Methods such as AlphaFold2 (Jumper et al. 2021, Evans et al. 2021) and AlphaFold3 (Abramson et al. 2024) can predict high-accuracy protein structures for most single-chain proteins (monomers) and a significant portion of multichain proteins (multimers/complexes). However, estimating the quality of the predicted structural models without ground-truth structures and identifying the best candidates from a pool of models (decoys) remains a significant challenge.

AlphaFold itself assigns an estimated quality score, such as predicted local distance difference test (pLDDT) score or predicted template modelling (pTM) score, to each structural model that it generates. However, it still cannot always assign higher estimated quality scores to better structural models (Chen et al. 2023a, Edmunds et al. 2024). Therefore, there is a significant need to develop independent protein model accuracy estimation (or quality assessment) methods to predict the quality of predicted protein structural models. These methods are not only important for ranking and selecting predicted protein structures but also crucial for determining how to use them appropriately in specific applications such as protein function analysis and protein structure-based drug design.

Traditionally, the methods for estimating protein model accuracy (EMA) can be classified in different ways. From the output of the methods, they can be classified as global quality assessment methods of assigning a single, global score to quantify the overall accuracy of a model, and local quality assessment methods of assigning a quality score to each residue (or some specific residues such as interface residues in protein complexes). In this study, we focus on global quality assessment.

From the input, the EMA methods can be divided into single-model and multimodel methods. Single-model methods evaluate each model based on its own intrinsic properties, using statistical energy functions or machine learning, without comparing it with other models. Some examples of statistical energy-based methods include Rosetta (Alford et al. 2017), which rely on physical principles or statistical potentials. These methods are usually fast but cannot reliably correlates model energy and model quality due to the high complexity of protein energy landscape.

Machine learning–based single-model methods improve the accuracy of predicting model quality by integrating structural and sequence features of a model such as secondary structure and residue-residue contacts. Some early methods of using handcrafted structural features to estimate the accuracy of protein tertiary (monomer) structures include ProQ (Wallner and Elofsson 2003), ModelEvaluator (Wang et al. 2009), ModFOLD (McGuffin 2008), and QAcon (Cao et al. 2017).

More recently, deep learning–based methods, such as QDeep (Shuvo et al. 2020) and DISTEMA (Chen and Cheng 2022) were developed to learn features directly from raw structural data such as residue-residue distance maps or 3D atom grids of protein tertiary structures to estimate their quality. Particularly, some graph neural network (GNN)-based approaches, such as EnQA (Chen et al. 2023a) and GCPNet-EMA (Morehead et al. 2024), represent protein structures as graphs and integrate local and global structural information, outperforming traditional single-model approaches. Different from previous methods of predicting the quality of either tertiary structures or quaternary structures, both EnQA and GCPNet-EMA can be applied to predict the quality of both tertiary and quaternary structures, even though they were trained to predict the quality of tertiary structures only. The reason is that the graph representation they employ can be used to represent tertiary and quaternary structures well without any change.

Different from single-model EMA methods, multimodel EMA methods compare each model in a model pool with other models and rely its similarity with other models to estimate model quality. Therefore, they are often called consensus methods. One simple such approach is to use the average pairwise similarity between a model with all other models as its quality score, which is used by DeepRank3 for estimating quality of protein tertiary (monomer) structural models and by MULTICOM_qa (Roy et al. 2023) for estimating the quality of protein quaternary (complex/multimer) structural models. Despite of the simplicity, on average, multimodel EMA methods still performed better than single-model EMA methods because in most situations many good models can be generated, leading to good consensus assessment (Cao et al. 2015, Cheng et al. 2019, Roy et al. 2023). However, multimodel EMA methods often fail for some hard targets when there are only a few good models in a large model pool or when there are many bad (or suboptimal) and similar models in the pool. Particularly, a large portion of bad, similar models always lead to bad consensus evaluation. In contrast, the performance of single-model methods is much more robust against the composition of models in a model pool because they provide an independent assessment of the quality of each individual model without considering the similarity between models. Therefore, single-model EMA methods are not susceptible to the influence of many bad and similar models and still have a chance to pick rare, good models.

To combine the strengths of single-model and multimodel EMA methods and address their limitations, we introduce GATE, a novel approach for predicting global protein structure quality using graph transformers on pairwise model similarity graphs. In this study, we focus GATE on predicting the global quality of protein quaternary structures instead of tertiary structures because there are much fewer EMA methods for quaternary structures and predicting quaternary structures is a much more challenging problem than the largely solved tertiary structure prediction problem. By capturing the relationships between models using model similarity graphs as well as the quality features of individual models represented as nodes in the graphs, GATE is able to make robust quality predictions.

2 Materials and methods

The overall pipeline of GATE is illustrated in Fig. 1. The approach integrates pairwise similarity graph construction, structural similarity-based subgraph sampling, node-level quality prediction using a graph transformer, and the aggregation of quality scores across subgraphs to predict the quality of each structural model (called decoy) in a model pool.

Figure 1. — The workflow of GATE for predicting protein complex structure quality. The input consists of a set of protein complex structures (decoys) predicted from a protein sequence. (A) A pairwise similarity graph is constructed, where nodes represent individual decoys, and edges connect two structurally similar decoys. (B) Subgraphs are sampled based on structural similarity to make sure each group of similar models is equally represented in the subgraphs, preventing large groups from dominating small groups. (C) The sampled subgraphs are processed by a graph transformer to predict the quality score for each decoy. (D) Predicted quality scores from all sampled subgraphs are aggregated to produce the final quality score for each decoy.

2.1 Pairwise similarity graph construction

A pairwise similarity graph (Fig. 1A) is constructed to captures the structural relationships between decoys. Each node represents an individual decoy, and an edge connects two decoys if they are similar according to a structural similarity metric such as TM-score, DockQ score (Basu and Wallner 2016), QS-score, and CAD-score (Olechnovič et al. 2013). To ensure meaningful structural relationships, in this study, an edge is established between two nodes if their TM-score exceeds 0.5. Even though only TM-score, a global similarity metric, is used to determine if there is an edge between two nodes, other complementary metrics such as QS-score of measuring interchain interface accuracy can also be used as the features of edges.

2.2 Structural similarity-based subgraph sampling

To prevent overpopulated structurally similar low-quality models to dominate model quality assessment, which often happens with the consensus EMA methods, subgraphs are sampled from the full pairwise similarity graph to allow decoys of different structural folds are more equally represented. The sampling strategy is based on the structural similarity of the decoys in the model pool, quantified by the average pairwise similarity score (PSS; i.e. TM-score) (Fig. 1B).

If the average PSS is <0.8, K-means clustering is applied to cluster the decoys into clusters, where each decoy is represented by a vector of similarity scores between it and the models in the pool. The number of clusters is determined by the silhouette score. A subgraph is generated by randomly sampling an equal number of decoys from each cluster. Many subgraphs are sampled so that each decoy may be included in multiple subgraphs. If the average PSS is $\geq$ 0.8, which usually indicates easy cases, subgraphs are randomly sampled from the entire graph without decoy clustering.

Each subgraph contains up to 50 nodes to balance computational efficiency and the need for structural variation. The construction of the pairwise similarity graph and the subsequent subgraph sampling procedure are summarized in Algorithm 1, available as supplementary data at Bioinformatics Advances online.

2.3 Graph transformer

The subgraphs are processed using a graph transformer, which predicts the quality score for each decoy (Fig. 1C). The transformer employs multihead attention mechanisms to iteratively update node and edge embeddings, enabling it to capture both structural characteristics and contextual relationships.

2.3.1 Node feature embedding

Node feature embeddings represent the structural characteristics of each decoy in a graph. The features of each node are designed to capture both local and global structural information. Key features include:

Average PSSs between a decoy represented by a node and other decoys: including TM-score and QS-score, calculated at both full graph and subgraph levels. These scores provide a comprehensive view of the structural similarity between each decoy and others, which resemble the traditional consensus scores.
Single-model quality scores: including single-model quality scores calculated by ICPS (Roy et al. 2023), EnQA, DProQA (Chen et al. 2023b), and VoroIF-GNN (Olechnovič and Venclovas 2023) for a decoy, enriching the feature space with the quality assessment for individual models. These scores are normalized by multiplying the raw quality score by the ratio between the length of the decoy and that of the native structure. This normalization penalizes shorter decoys, ensuring that the scores reflect the completeness and accuracy of the decoy relative to the native structure.

The raw features above are passed through a multilayer perceptron (MLP) with activation functions (e.g. LeakyReLU) to transform them into high-dimensional embeddings that better capture nonlinear relationships. These embeddings serve as the initial representations for each decoy and are iteratively updated in the subsequent layers of the graph transformer.

2.3.2 Edge feature embedding

Edge features encode pairwise relationships between decoys, initialized using the number of common interaction interfaces and similarity scores between two connected decoys, including TM-score and QS-score. These metrics capture global and interfacial similarity between connected nodes, respectively.

Like the node feature embedding, the raw edge features are mapped into a higher-dimensional space through an MLP. Each edge embedding is refined by the graph transformer through iterative updates, integrating information from the neighboring nodes. This process allows the edge embeddings to dynamically leverage the contextual relationships between decoys.

2.3.3 Graph transformer layer

The graph transformer layers (Fig. 1C) iteratively update node and edge embeddings through multihead attention and feed-forward neural networks. The attention mechanism uses the node embeddings to generate queries (Q), keys (K), and values (V), while the edge embeddings are incorporated as biases to modulate pairwise interactions between nodes. The attention scores are computed as:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(1)

where Q, K, and V are the query, key, and value matrices, respectively, and $d_{k}$ is the dimension of the key.

To update the node embeddings, the model aggregates information from neighboring nodes using attention scores. For a node i, the updated embedding is calculated by:

h_{i}^{(l + 1)} = MLP (h_{i}^{(l)} + \sum_{j \in N (i)} α_{i j} h_{j}^{(l)}),

(2)

where $h_{i}^{(l)}$ is the embedding of node i at layer l, $N (i)$ represents its neighbors, and $α_{i j}$ is the attention score between nodes i and j. The $MLP$ introduces nonlinearity, enabling the model to capture complex structural relationships.

The edge embeddings are updated similarly by integrating information from their associated nodes. The edge embedding between nodes i and j at layer l is updated as:

e_{i j}^{(l + 1)} = MLP (e_{i j}^{(l)} + h_{i}^{(l)} + h_{j}^{(l)}) .

(3)

Residual connections and layer normalization are applied to both node and edge updates, ensuring stable training and efficient learning across layers.

By iteratively applying these updates, the graph transformer learns to capture intricate structural dependencies and relationships within the pairwise similarity graph, enabling it to effectively predict quality scores for the decoys.

2.3.4 Node-level prediction

The graph transformer predicts a global quality score for each decoy (node) in an input subgraph. The target global quality score is the true TM-score between a decoy and the native structure, calculated by USalign (Zhang et al. 2022).

2.3.5 Loss function

The graph transformer is trained using a combination of a mean squared error (MSE) loss and a pairwise loss:

L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(4)

where $y_{i}$ is the true quality score for decoy i, ${\hat{y}}_{i}$ the predicted score, and N, the number of decoys.

L_{pairwise} = \frac{1}{| P |} \sum_{(i, j) \in P} {(Δ y_{i j} - Δ {\hat{y}}_{i j})}^{2}

(5)

where P represents the number of decoy pairs, $Δ y_{i j}$ is the true difference between quality scores of decoys i and j, and $Δ {\hat{y}}_{i j}$ is the predicted difference. This loss enables the model to learn the relative quality between two models such that a better model has a higher score than a worse one.

The total loss is a weighted sum of the two losses above:

L_{total} = α L_{MSE} + β L_{pairwise}

(6)

where $α$ and $β$ control the importance of each component.

2.4 Aggregation of predicted quality scores across subgraphs

The final predicted quality score for each decoy is aggregated across subgraphs using either mean or median aggregation:

Mean aggregation: averaging predictions from all subgraphs containing the decoy.
Median aggregation: using the median of predictions, which is robust to outliers.

2.5 Training, validation, and test procedure

The graph transformer was first trained, validated, and tested on the CASP15 dataset and then blindly tested in the CASP16 experiment. The CASP15 dataset comprises 41 complex targets, with each target having an average of 270 decoys. The CASP15 dataset was divided into training, validation, and test sets by targets using a 10-fold cross-validation strategy to enable the robust evaluation of the transformer’s performance. Specifically, it was split into 10 subsets, each containing four to five targets. For each target, a fixed number of 2000 subgraphs were sampled from the full model similarity graph of all the decoys of the target, with each subgraph containing exactly 50 nodes. Each subset contains 8000–10 000 subgraphs. In each round of training, validation, and testing, 8 subsets were used for training (parameter optimization), 1 for validation (model selection), and 1 for testing. The test results over the 10 subsets were pooled together and evaluated as a whole.

The training was performed using a stochastic gradient descent optimizer or AdamW optimizer, with the hyperparameters organized into the following groups:

Optimizer parameters, including the learning rate, weight decay, and batch size, which were tuned to ensure effective optimization and convergence during training.
Graph transformer parameters, including the number of attention heads, the dimensionality of node and edge embeddings, the number of graph transformer layers, and the dropout rate. These parameters define the architecture and control regularization of GATE.

The trained GATE models were selected according to their performance on the validation set in terms of the combined loss function considering both the MSE loss and the pairwise loss. The search space for the hyperparameters is presented in Table 1, available as supplementary data at Bioinformatics Advances online.

Three different kinds of GATE models below were trained, validated, and tested on the same 10-fold data split:

GATE-Basic: use the basic set of node and edge features described above.
GATE-GCP: add the single-model quality score calculated by GCPNet-EMA for each deocy as an extra node feature.
GATE-Advanced: Further augment the node and edge feature sets using advanced interface quality scores, such as DockQ_ave (Studer et al. 2023), DockQ_wave (Studer et al. 2023), and CAD-scores. Specifically, the average interface scores between a decoy and other decoys are used as its extra node features and the interface scores between the two nodes of an edge are used as its extra edge features.

The selected hyperparameters for different folds in the GATE models are summarized in Tables 2–4, available as supplementary data at Bioinformatics Advances online, respectively.

3 Results

3.1 Evaluation metrics

The following metrics were used to assess the performance of the EMA methods:

Pearson’s correlation (Corr^p): Pearson’s correlation measures the linear relationship between the predicted and true quality scores. A higher Pearson’s correlation indicates a stronger alignment between the predicted and true values, reflecting a method’s accuracy in predicting overall structural quality.
Spearman’s correlation (Corr^s): Spearman correlation assesses the rank-order relationship between the predicted and true quality scores. This metric evaluates how well the predicted scores preserve the correct ranking of decoys, independent of the exact score values.
Ranking loss: Ranking loss quantifies the ability of a method to correctly rank the best decoy in the pool. It is calculated as the difference between the true quality (e.g. TM-score) of the best decoy in the pool and the true quality of the decoy with the highest predicted quality. Lower ranking loss indicates better performance in identifying the top-quality structure.
Receiver operating characteristic (ROC) curve and area under the curve (AUC): The ROC curve illustrates the trade-off between the true-positive rate (sensitivity) and the false-positive rate at various classification thresholds. By setting the threshold to the 75% quantile of true TM-scores of the decoys to divide them into two classes: positive ones and negative ones, we make the metric focus on identifying better-quality decoys. The AUC provides a single scalar value summarizing the ROC curve, with higher values indicating better discrimination. An AUC of 1.0 represents perfect classification, while an AUC of 0.5 indicates random performance. This metric is particularly useful for assessing a method’s robustness in ranking decoys and identifying top-quality structures.

Each metric above is calculated for the decoys of each target and then average over all the targets in a dataset.

3.2 Performance of GATE models on the CASP15 dataset

The performance of three GATE models and their ensemble (GATE-Ensemble) in comparison with several existing methods on the CASP15 dataset is reported in Table 1. In terms of TM-score, among the individual GATE models, GATE-Basic, using the basic features, achieved a Pearson’s correlation (Corr^p) of 0.7447, a ranking loss of 0.1127, and an AUC of 0.7181. GATE-GCP, which included an additional GCPNet-EMA feature, showed a slight improvement in Spearman’s correlation (0.5788) over GATE-Basic (0.5722) but had similar Pearson’s correlation (0.7453) and slightly worse ranking loss (0.1186), indicating that the added feature had minimal impact on the performance.

Table 1.

Comparison of GATE model, GATE ablation variants, CASP15 EMA predictors and other methods in terms of Pearson’s correlation (Corr^p), Spearman’s correlation (Corr^s), ranking loss, and AUC based on TM-score and DockQ on the CASP15 complex structure dataset.^a

Method	TM-score				DockQ
Method	Corr^p	Corr^s	Ranking loss	AUC	Corr^p	Corr^s	Ranking loss	AUC
CASP15 EMA predictors
VoroMQA-select-2020 (Olechnovič and Venclovas 2023)	0.3944^b	0.3692^b	0.1735^b	0.6663^b	0.4322^b	0.4044	0.2682	0.6741
ModFOLDdock (Edmunds et al. 2023)	0.5161^b	0.4356^b	0.1841	0.6721^b	0.5622	0.5185	0.2181	0.7022
ModFOLDdockS (Edmunds et al. 2023)	0.4717^b	0.3614^b	0.2199^b	0.6333^b	0.4068^b	0.4073	0.3119^b	0.6632
MULTICOM_qa (Roy et al. 2023)	0.6678^b	0.5260	0.1472	0.7059	0.5256	0.4668	0.2661	0.6748
MULTICOM_egnn (Chen et al. 2023b)	0.1437^b	0.1179^b	0.2611^b	0.5956^b	0.2158^b	0.2283^b	0.2943^b	0.6302
VoroIF (Olechnovič and Venclovas 2023)	0.4645^b	0.3069^b	0.1568^b	0.6472^b	0.5039	0.3455^b	0.2297	0.6447
ModFOLDdockR (Edmunds et al. 2023)	0.5333^b	0.4040^b	0.2160^b	0.6626^b	0.5357	0.4673	0.2623	0.6787
Bhattacharya	0.3803^b	0.3438^b	0.2220^b	0.6495^b	0.3581^b	0.3190^b	0.3475^b	0.6392^b
MUFold2	0.5370^b	0.2662^b	0.2374^b	0.6168^b	0.3846^b	0.1839^b	0.3850^b	0.5913^b
MUFold	0.5435^b	0.2714^b	0.2267^b	0.6252^b	0.3856^b	0.1356^b	0.3457^b	0.5865^b
ChaePred	0.4706^b	0.3507^b	0.2311^b	0.6592^b	0.4381^b	0.3545^b	0.3565^b	0.6615
Venclovas (Olechnovič and Venclovas 2023)	0.4677^b	0.3828^b	0.1249	0.6756^b	0.5288	0.4506	0.1828	0.6890
Other methods (normalized if applicable)
PSS	0.7292	0.5755	0.1406	0.7137	0.5118	0.4469	0.2648	0.6660
AlphaFold plDDT_norm	0.2578^b	0.2611^b	0.1793	0.6399^b	0.1710^b	0.1886^b	0.2615^b	0.6165^b
DProQA_norm	0.1598^b	0.1174^b	0.2555^b	0.5942^b	0.2109^b	0.2255^b	0.3162^b	0.6248
VoroIF-GNN-score_norm	0.1972^b	0.0966^b	0.2092^b	0.5695^b	0.2283^b	0.1335^b	0.2935^b	0.5704^b
Avg-VoroIF-GNN-res-pCAD_norm	0.1335^b	−0.0027^b	0.1737	0.5525^b	0.1049^b	−0.0030^b	0.2284	0.5522^b
VoroMQA-dark global_norm	0.0253^b	0.0037^b	0.1265	0.5580^b	−0.0670^b	−0.0316^b	0.2191	0.5476^b
GCPNet-EMA_norm	0.3216^b	0.2696^b	0.2052^b	0.6379^b	0.1862^b	0.1803^b	0.2830^b	0.6198^b
GATE models
GATE-Basic	0.7447	0.5722	0.1127	0.7181	0.5330	0.4345	0.2348^b	0.6703
GATE-GCP	0.7453	0.5788	0.1186	0.7191	0.5358	0.4389^b	0.2083	0.6715
GATE-Advanced	0.7224^b	0.5416^b	0.1018	0.6981^b	0.5142	0.4298	0.2112	0.6618
GATE-Ensemble	0.7480	0.5754	0.1191	0.7194	0.5353	0.4477	0.2140	0.6756
GATE ablation variants
GATE-Basic (w/o subgraph sampling)	0.7169	0.5478^b	0.1266	0.7067	0.5063^b	0.4145^b	0.2620^b	0.6528^b
GATE-GCP (w/o subgraph sampling)	0.7503	0.5771	0.1363	0.7278	0.5253	0.4394	0.2545^b	0.6773
GATE-Advanced (w/o subgraph sampling)	0.7158^b	0.5403^b	0.1224	0.7043^b	0.4975^b	0.4286	0.2478^b	0.6616^b
GATE-Basic (w/o pairwise loss)	0.6881^b	0.5534	0.1329	0.7183	0.5226	0.4498	0.2451	0.6796
GATE-GCP (w/o pairwise loss)	0.6923^b	0.5392^b	0.1516^b	0.7051	0.4974^b	0.4062^b	0.2604^b	0.6582^b
GATE-Advanced (w/o pairwise loss)	0.6756^b	0.5176^b	0.1588^b	0.6961^b	0.4982	0.4170	0.2538^b	0.6617
GATE-NoSingleEMA (w/o single-model EMA)	0.6570^b	0.4832^b	0.1511^b	0.6927	0.4987^b	0.3967^b	0.2986^b	0.6681

Open in a new tab

The term norm indicates that the quality scores predicted by a method are normalized by the length of the predicted structure relative to the native structure. Only the performance of the normalization version of such a method is shown because their unnormalized outputs do not account for partial structures. Bold font denotes the best result, while the second best result is underlined. The table includes all CASP15 EMA predictors that provided predictions for all multimeric targets, along with representative methods that were used as input features for training GATE. The source data (e.g. TM-score and DockQ for the predicted structure) were obtained from the official CASP15 website: https://predictioncenter.org/casp15/results.cgi?tr_type=multimer.

The values marked with are statistically significantly worse (P < .05) than the GATE-Ensemble baseline based on the one-sided Wilcoxon signed-rank test.

GATE-Advanced, with addition of several interfacial quality features like DockQ_ave and DockQ_wave, achieved the lowest ranking loss (0.1018), excelling in identifying top-quality decoys. However, its Pearson’s correlation (0.7224) was slightly lower than that of GATE-Basic and GATE-GCP, suggesting a tradeoff between the correlation and the ranking loss. This result highlights that while GATE-Advanced can select the top-1 model better (minimizing the ranking loss), its absolute quality predictions may be less precise.

GATE-Ensemble, which averages predictions from GATE-Basic, GATE-GCP, and GATE-Advanced, outperformed the three component methods in terms of Pearson’s correlation and AUC score but underperformed them in terms of the ranking loss. Specifically, it achieved the second highest Pearson’s correlation (0.7480), a balanced Spearman’s correlation (0.5754), a ranking loss of 0.1191, and the second highest AUC of 0.7194. Overall, ensembling the three component methods gained marginal improvement.

To assess the contributions of subgraph sampling, pairwise loss, and single-model EMA features, several ablation variants were evaluated. When subgraph sampling was disabled (i.e. using full pairwise similarity graphs to make prediction), performance generally deteriorated across metrics. For example, GATE-GCP (w/o subgraph sampling) achieved Pearson’s correlation of 0.7503 (slightly higher than GATE-GCP with subgraph sampling) but a notably worse ranking loss (0.1363 versus 0.1186). Similarly, disabling the pairwise loss during training led to substantial performance drops in both correlation and ranking loss across all models. For example, GATE-GCP (w/o pairwise loss) dropped to Pearson’s correlation = 0.6923 and ranking loss = 0.1516. Removing single-model EMA features (GATE-NoSingleEMA) caused the most substantial degradation (Pearson’s correlation = 0.6570, ranking loss = 0.1511), showcasing the importance of integrating both multimodel and single-model EMA features. Overall, the ablation studies demonstrate that each component of GATE, including subgraph sampling, pairwise loss, and single-model EMA features, contributes meaningfully to its strong performance.

When compared with the CASP15 EMA predictors, GATE-Ensemble generally outperformed most existing methods across multiple evaluation metrics. For example, the multimodel consensus method MULTICOM_qa achieved a Pearson’s correlation of 0.6678 and ranking loss of 0.1472, both worse than those of GATE-Ensemble. Similarly, Venclovas reached a Pearson’s correlation 0.4677 and ranking loss 0.1249, also performing worse than GATE-Ensemble.

To rigorously compare GATE-Ensemble against other methods, we performed one-sided Wilcoxon signed-rank tests for TM-score evaluations. As shown in Table 1, the metrics that are statistically significantly worse (P < .05) than GATE-Ensemble are marked with an asterisk. According to these tests, GATE-Ensemble achieved statistically significantly better Pearson’s correlation than nearly all single-model EMA methods and most CASP15 EMA predictors. For ranking loss, multiple methods, including consensus-based and single-model approaches, also performed statistically worse than GATE-Ensemble. The results demonstrate that integrating single-model quality scores and similarity between decoys in GATE substantially improves the estimation of the accuracy of decoys.

The consensus method based on the average PSS between decoys calculated by USalign also performed generally better than the single-model EMA methods. Its performance in terms of Pearson’s correlation, Spearman’s correlation, and AUC is comparable with the GATE models, but its ranking loss is substantially higher (worse) than them. For instance, the ranking loss of PSS is 0.1406, which is much higher than 0.1018 of GATE-Advanced, 0.1127 of GATE-Basic, 0.1186 of GATE-GCP, and 0.1191 of GATE-Ensemble.

A direct comparison of GATE-Ensemble and PSS shows that GATE-Ensemble outperformed PSS in terms of Pearson’s correlation, ranking loss, and AUC, while they performed almost the same in terms of Spearman’s correlation. The biggest difference between the two lies in Pearson’s correlation and ranking loss. The per-target Pearson’s correlation and ranking loss comparison between the two methods is visualized in Fig. 1A and B, available as supplementary data at Bioinformatics Advances online, respectively. In terms of ranking loss, GATE-Ensemble performed much better on five targets (H1114, H1144, H1167, T1115o, and T1161o) that are highlighted in different colors in Fig. 1, available as supplementary data at Bioinformatics Advances online and much worse than PSS on only one target (T1181o). Statistical tests using the one-sided Wilcoxon signed-rank test showed that the overall differences between GATE-Ensemble and PSS across all targets were not statistically significant (P > .05). Nevertheless, GATE-Ensemble demonstrates substantially improved ranking on several challenging targets, suggesting that while aggregate improvements are modest, GATE-Ensemble may provide practical advantages by improving robustness and reducing failure cases in difficult model selection scenarios.

A detailed analysis of T1181o (a homotrimer) revealed that the top-1 decoys selected by PSS and GATE-Ensemble are very similar (TM-score between them = 0.89), suggesting that their true quality scores should also be comparable. However, when aligning each decoy with the true structure, USalign successfully identifies an optimal chain mapping for the decoy selected by PSS, assigning it a high TM-score. In contrast, it fails to find an optimal chain mapping for the decoy selected by GATE-Ensemble, resulting in a low TM-score and, consequently, a high loss. This discrepancy indicates that the high loss observed for GATE-Ensemble on T1181o is not due to model selection but rather stems from a chain mapping issue within USalign. Although this issue only arises occasionally for certain targets, it necessitates manual inspection until the chain mapping algorithm in USalign is improved (Liu et al. 2023c).

The histogram of the true TM-scores of the decoys of the five targets on which GATE-Ensemble substantially outperformed PSS are plotted in Fig. 1C, available as supplementary data at Bioinformatics Advances online. One common pattern among the five targets is that common decoys of high frequency in the histograms have mediocre scores.

For a target H1114 ([NiFe]-hydrogenase, oxidoreductase, energy metabolism, stoichiometry: A4B8C8; skewness of the distribution of the TM-scores of the decoys = 0.831), GATE-Ensemble achieved a ranking loss of 0.0398 and a Pearson’s correlation of 0.722, outperforming PSS (ranking loss 0.2520, correlation 0.393). Similarly, for target H1144 (nanobody, A1B1; skewness = −1.176), GATE-Ensemble excelled with a ranking loss of 0.0017 and a correlation of 0.755, compared with PSS (ranking loss 0.3117, correlation 0.485).

For target H1167 (antibody–antigen complex, stoichiometry: A1B1C1; skewness = 0.024), representing a balanced decoy pool, GATE-Ensemble achieved a ranking loss of 0.0318 and a correlation of 0.702, far surpassing PSS (ranking loss 0.3095, correlation 0.335). Target T1115o (A16, skewness = 0.567) followed a similar trend, with GATE-Ensemble (ranking loss 0.1156, correlation 0.702) outperforming PSS (ranking loss 0.3668, correlation 0.465).

For the highly skewed target T1161o (Dimeric DZBB fold, DNA-binding protein, stoichiometry: A2; skewness = 2.258), GATE-Ensemble achieved a perfect ranking loss of 0 and a correlation of 0.740, compared with PSS (ranking loss 0.4692, correlation 0.362), demonstrating its robustness in selecting good decoys from in decoy pools dominated by low-quality decoys.

In summary, GATE-Ensemble consistently outperformed PSS in both ranking loss and Pearson’s correlation across all five targets because PSS selected common decoys of low or mediocre quality (i.e. the central tendency problem), while GATE-Ensemble is able to overcome the problem by using its graph transformer architecture and subgraph sampling to consider both the quality of individual decoys and the similarity between them. However, the per-target comparisons reveal that this performance advantage is not uniform across all targets. In certain cases (e.g. T1181o), PSS occasionally identifies top models with slightly better accuracy, likely due to factors such as chain mapping sensitivity or high structural diversity within the decoy pool, which may favor its selection strategy. These observations suggest that GATE-Ensemble and other methods capture complementary aspects of model quality. Therefore, integrating predictions from multiple EMA approaches may offer a promising direction to further improve accuracy by leveraging their respective strengths across diverse target characteristics.

Although GATE was trained on TM-score, we also evaluated its performance in terms of DockQ to assess interface-level prediction capability. GATE-Ensemble achieved a Pearson’s correlation of 0.5353, ranking loss of 0.2140, and AUC of 0.6756 based on DockQ. These results indicate that GATE-Ensemble remained competitive and outperformed most of the evaluated EMA methods, including all single-model methods such as AlphaFold plDDT_norm, DProQA_norm, EnQA_norm, VoroIF-GNN-score_norm, Average-VoroIF-GNN-residue-pCAD-score_norm, VoroMQA-dark global score_norm, and GCPNet-EMA_norm. According to one-sided Wilcoxon signed-rank tests (Table 1), many of these methods showed statistically significantly worse performance (P < .05) than GATE-Ensemble across multiple DockQ-based metrics.

Among the CASP15 EMA predictors, GATE-Ensemble performed better than most approaches, including MUFold2, ChaePred, VoroMQA-select-2020, and MULTICOM_qa, in terms of Pearson’s correlation and AUC. However, Venclovas and ModFOLDdock series achieved competitive or slightly better results under DockQ-based evaluation. For example, ModFOLDdock reached a higher Pearson’s correlation (0.5622) and AUC (0.7022), while Venclovas achieved the lowest ranking loss (0.1828) among all methods. These results suggest that while GATE-Ensemble maintains strong overall performance across global and interface-based metrics, there may still be room for improvement in interface-specific assessments where some specialized methods such as ModFOLDdock excel.

3.3 Performance of GATE in the blind CASP16 experiment

During CASP16, GATE blindly participated in the EMA category under the predictor name MULTICOM_GATE. The 10 models for each of three GATE variants (GATE-Basic, GATE-GCP, and GATE-Advanced) trained on the CASP15 dataset via the 10-fold cross-validation, i.e. 30 models in total, were ensembled together to make prediction. MULTICOM_GATE used the average output of the 30 models as prediction. Moreover, to ensure stable predictions given the randomness in subgraph sampling, MULTICOM_GATE made predictions for the decoys of each target five times and averaged them as the final prediction.

Out of the 38 CASP16 complex targets evaluated, GATE models were used to make predictions for 36 targets, with H1217 and H1227 excluded due to the limitation of computational resource. The performance of MULTICOM_GATE in comparison with other CASP16 EMA methods on the 36 targets is illustrated in Fig. 2, available as supplementary data at Bioinformatics Advances online and summarized in Table 2, using two perspectives: the sum of z-scores (Fig. 2, available as supplementary data at Bioinformatics Advances online) and per-target-average metrics (Table 2).

Table 2.

Average per-target evaluation metrics (Pearson’s correlation, Spearman’s correlation, ranking loss and AUC) of 23 CASP16 predictors in terms of TM-score and Oligo-GDT-TS.^a

Predictor name	TM-score				Oligo-GDT-TS
Predictor name	Corr^p	Corr^s	Ranking loss	AUC	Corr^p	Corr^s	Ranking loss	AUC
MULTICOM_LLM	0.6836	0.4808	0.1230	0.6685	0.6722	0.4656	0.1252	0.6603
MULTICOM_GATE	0.7076	0.4514	0.1221	0.6680	0.7235	0.4399	0.1328	0.6461
AssemblyConsensus (Studer et al. 2023)	0.6367	0.4661	0.1824	0.6584	0.7701	0.5163	0.1753	0.6702
ModFOLDdock2 (McGuffin et al. 2025)	0.6542	0.4640	0.1371	0.6859	0.6547	0.4143	0.1530	0.6588
MULTICOM	0.6156	0.4380	0.1207	0.6660	0.6413	0.4319	0.1368	0.6536
MIEnsembles-Server	0.6072	0.4498	0.1325	0.6670	0.6084	0.4091	0.1451	0.6671
GuijunLab-QA (Liu et al. 2025a)	0.6480	0.4149	0.1195	0.6328	0.6524	0.3972	0.1406	0.6377
GuijunLab-Human (Liu et al. 2025a)	0.6327	0.4148	0.1477	0.6368	0.6404	0.3976	0.1499	0.6483
MULTICOM_human	0.5897	0.4260	0.1518	0.6576	0.6149	0.4217	0.1498	0.6572
GuijunLab-PAthreader (Liu et al. 2023a)	0.5309	0.3744	0.1331	0.6237	0.6360	0.4353	0.1371	0.6382
ModFOLDdock2R (McGuffin et al. 2025)	0.5724	0.3867	0.1375	0.6518	0.6339	0.3724	0.1483	0.6355
GuijunLab-Assembly	0.5439	0.3280	0.1636	0.6191	0.5809	0.3135	0.1611	0.6182
ChaePred	0.4548	0.3971	0.1580	0.6534	0.4875	0.3673	0.1563	0.6331
ModFOLDdock2S (McGuffin et al. 2025)	0.5285	0.3116	0.1806	0.6084	0.5819	0.3335	0.1648	0.6129
MQA_server	0.4326	0.2913	0.1468	0.6120	0.5617	0.3708	0.1521	0.6323
MQA_base	0.4331	0.2897	0.1462	0.6085	0.5533	0.3597	0.1509	0.6281
Guijunlab-Complex (Liu et al. 2025a)	0.4889	0.3019	0.1792	0.6054	0.5693	0.3310	0.1772	0.6077
AF_unmasked (Mirabello et al. 2024)	0.4015	0.2731	0.1595	0.6052	0.4354	0.2875	0.1815	0.6113
MQA	0.4410	0.2425	0.2183	0.5858	0.4911	0.2631	0.2499	0.5874
COAST	0.3840	0.2297	0.2091	0.6072	0.4484	0.2678	0.2204	0.6078
MULTICOM_AI	0.3281	0.2623	0.1913	0.6057	0.3843	0.2834	0.1963	0.6111
VifChartreuse (Olechnovič et al. 2025)	0.2921	0.2777	0.1440	0.6149	0.2982	0.2469	0.1641	0.5956
VifChartreuseJaune (Olechnovič et al. 2025)	0.3421	0.1756	0.1630	0.5951	0.3300	0.1548	0.1915	0.5811
PIEFold_human (Wu et al. 2020)	0.1929	0.1451	0.2306	0.5497	0.2599	0.1759	0.2409	0.5541

Open in a new tab

The best performance for each metric is shown in bold, the second-best is underlined, and the third-best is underlined and italicized. The methods are ordered by the CASP16 Assessors’ score = Corr^p + Corr^s + (1 − ranking loss) + AUC based on TM-score and Oligo-GDT-TS. MULTICOM_GATE ranked among top three for most of the metrics. Source data (i.e. Pearson’s correlation, Spearman’s correlation, ranking loss, and AUC for each EMA predictor on each target) were obtained from the official CASP16 website: https://predictioncenter.org/casp16/results.cgi?tr_type=accuracy.

Figure 2, available as supplementary data at Bioinformatics Advances online presents the weighted sum of the z-scores for each evaluation metric, including Pearson’s correlation (P(TM-score, p) in terms of TM-score, P(Oligo-GDTTS, p)) in terms of oligomer global distance test (GDT-TS) score, Spearman’s correlation ( $S (TM-score, p)$ in terms of TM-score, $S (Oligo-GDTTS, p)$ ) in terms of oligomer GDT-TS score, ranking loss ( $L (TM-score, p)$ in terms of TM-score, $L (Oligo-GDTTS, p)$ ) in terms of oligomer GDT-TS score, AUC ( $R (TM-score, p)$ ) in terms of TM-score, and AUC ( $R (Oligo-GDTTS, p)$ ) in terms of oligomer GDT-TS score. The z-scores of a predictor for each target were calculated relative to the mean and standard deviation of all the CASP16 predictors. The per-target z-score are then summed across all 36 targets to quantify the overall performance of each predictor. This approach used by the CASP16 official EMA assessment provides a comprehensive evaluation of each predictor’s relative performance in terms of multiple complementary metrics.

MULTICOM_GATE achieved the highest sum of z-scores for $P (TM-score, p)$ and $L (Oligo-GDTTS, p)$ , demonstrating its consistent ability to predict structural quality across all targets. It also ranked among the top methods in $S (TM-score, p)$ and $S (Oligo-GDTTS, p)$ , highlighting its robustness in ranking decoys accurately. Furthermore, MULTICOM_GATE performed competitively in ranking loss ( $L (TM-score, p)$ and $L (Oligo-GDTTS, p)$ ), reflecting its effectiveness in identifying the highest-quality decoys. Its strong performance in $R (TM-score, p)$ and $R (Oligo-GDTTS, p)$ underscores its ability to distinguish between high- and low-quality decoys. According to the weighted sum of all the z-scores, MULTICOM_GATE ranked no. 5 among the 23 CASP16 predictors.

It is worth noting that z-score amplifies the impact of a target on which a predictor performs very well while most other predictors perform poorly. Another standard way to evaluate the predictors is to simply compare their average scores across all the test targets, without emphasizing specific ones. Table 2 complements Fig. 2, available as supplementary data at Bioinformatics Advances online by reporting the original per-target average performance metrics in terms TM-score, which was what MULTICOM_GATE was trained to predict. These metrics include Pearson’s correlation (Corr^p), Spearman’s correlation (Corr^s), ranking loss, and AUC, averaged across the 36 targets. MULTICOM_GATE achieved the highest Corr^p (0.7076), demonstrating its strong capability to predict structural quality based on TM-score. It also ranked third in Corr^s (0.4514) and third in ranking loss (0.1221), reflecting its balanced performance in both ranking consistency and identifying top-quality decoys. Additionally, MULTICOM_GATE achieved a competitive AUC value (0.6680), ranking third. Overall, MULTICOM_GATE delivered a balanced, excellent performance, ranking among top three in terms of every metric.

To illustrate how GATE identifies high-quality decoys, Fig. 3, available as supplementary data at Bioinformatics Advances online presents the pairwise similarity graph for a hard CASP16 target H1215, which is a heterodimer consisting of a mNeonGreen protein (antigen) bound with a nanobody.

The color of nodes in the graph corresponds to the true TM-scores of the decoys represented by them, ranging from 0.5 (yellow) to 1.0 (purple).

A low-quality but common decoy, H1215TS196_4 (true TM-score = 0.613, ranking loss = 0.377), was selected as the top-1 decoy by the average pairwise structural similarity (consensus) strategy. The decoy has an incorrect interface between its two subunits and has a high loss. In contrast, a high-quality decoy with a correct interface [H1215TS014_2 (true TM-score = 0.987, ranking loss = 0.003)], was identified by MULTICOM_GATE as top 1. The success of MULTICOM_GATE on this target is significant as many CASP16 EMA predictors failed on this target. This example demonstrates that GATE effectively integrates local and global structural features to prioritize high-quality decoys over common mediocre models selected by the consensus approach. By leveraging both pairwise similarity information and single-model quality scores, GATE accurately distinguishes the decoys with near-native fold, as indicated by the high TM-score of the selected decoy.

Together, the results above provide a holistic view of MULTICOM_GATE’s performance. The sum of z-scores reflect its relative strength across the targets, while the per-target-average metrics validate its absolute performance for TM-score prediction. These results demonstrate that MULTICOM_GATE effectively integrates local and global structural features to deliver robust and accurate predictions, solidifying its position as one of the top-performing methods in the blind CASP16 experiment.

In addition to the overall performance across all targets, we further analyzed whether MULTICOM_GATE performs differently on homo-oligomeric and hetero-oligomeric complexes. Based on TM-score, MULTICOM_GATE achieved similar Pearson’s correlations for both homo-oligomers (0.7053) and hetero-oligomers (0.7096), indicating comparable accuracy in predicting global model quality across complex types. The ranking loss was better for hetero-oligomers (0.1076) than for homo-oligomers (0.1384), while AUC was also higher for hetero-oligomers (0.6780 versus 0.6569). Interestingly, when evaluated by oligo-GDT-TS, MULTICOM_GATE exhibited better performance on homo-oligomers, achieving higher Pearson’s correlation (0.7596 versus 0.6912), slightly lower ranking loss (0.1320 versus 0.1335), and comparable AUC (0.6456 versus 0.6465). These results suggest that while MULTICOM_GATE maintains strong and consistent performance across both homo- and hetero-oligomers, the evaluation of homo-oligomers may particularly benefit from the alignment algorithm used in oligo-GDT-TS that are less sensitive to chain mapping.

Beyond performance rankings, MULTICOM_GATE differs substantially from many of the competing CASP16 methods in its underlying algorithmic design. Several methods in Table 2 adopt consensus or ensemble-based approaches that rely on clustering or averaging predictions across models. For example, AssemblyConsensus (Studer et al. 2023) computes consensus scores from pairwise structure comparisons; this approach excels in correlation, but lacks in selecting the best decoy. ModFOLDdock2 and its variants (McGuffin et al. 2025) combine machine learning with structural feature extraction, interface scoring, and limited pairwise similarity information, delivering strong performance in TM-score-based metrics, but worse in Oligo-GDT-TS metrics. Several other methods are purely single-model EMA predictors, such as GuijunLab-QA (Liu et al. 2025a) or ChaePred, which leverage deep learning models trained to assess structure quality based solely on the input model itself. These methods are often efficient and suitable for real-time scoring, but their performance are still relatively worse compared with the multimodel methods. In contrast, MULTICOM_GATE directly integrates both single-model and multimodel information through its graph-based subgraph sampling strategy, transforming pairwise similarity relationships into rich graph embeddings processed by transformer architectures. Its advantage is apparent in the blind CASP16 experiment.

3.4 Performance of GATE on in-house models in the CASP16 blind experiment

In addition to benchmarking GATE on the blind CASP16 EMA category, where decoy models were generated by many independent predictor groups, we further evaluated its performance in a more realistic use case by applying it to structural models generated by our in-house complex structure predictor during the CASP16 experiment. These models were produced by MULTICOM4 (Liu et al. 2025b), a protein complex prediction system that integrates AlphaFold-Multimer and AlphaFold3 (Abramson et al. 2024). MULTICOM4 was among the top five predictors in Phase 1 of CASP16 for multimeric structure prediction under known stoichiometry conditions.

Unlike the CASP16 EMA setting, where models from many independent predictors are available to construct a model similarity graph, real-world structure prediction workflows typically rely on large-scale model sampling carried out by one user. To mimic this scenario, we evaluated GATE on a decoy pool consisting of between 712 and 78 410 decoys (structural models) per target generated by MULTICOM4 for each of the 36 multimeric targets during CASP16. This pool includes a wide diversity of models sampled using different configurations of AlphaFold-Multimer and AlphaFold3, reflecting the contemporary practice of extensively sampling models with AlphaFold. Four targets (T1249v1o and T1249v2o, T1294v1o, and T1294v2o) that have the same sequence but different conformations are excluded. Given the 2-week prediction time constraint, during CASP16, we selected only the top-5 decoys ranked by the confidence score generated by different input settings for GATE to make prediction, resulting in 35 to 390 decoys per target.

To ensure prediction stability, GATE-Ensemble performed five independent predictions based on random subgraph sampling per target and averaged the results. As shown in Table 3, GATE-Ensemble achieved the highest Pearson’s correlation for both TM-score (0.4083) and Oligo-GDT-TS (0.3801), as well as the highest Spearman’s correlation for TM-score (0.2774) and Oligo-GDT-TS (0.2989), outperforming other baseline methods such as AlphaFold plDDT, GCPNet-EMA, and PSS. It also achieved the second-best and third-best ranking loss (0.1327 for TM-score and 0.1626 for Oligo-GDT-TS) and AUC (0.6469 for TM-score and 0.6475 for Oligo-GDT-TS). These results demonstrate that GATE generalizes well even in a practical AlphaFold-based workflow without relying on external predictions.

Table 3.

Comparison of evaluation metrics (Pearson’s correlation, Spearman’s correlation, ranking loss, and AUC) for different EMA methods applied to in-house structural models generated by MULTICOM4 in the CASP16 blind experiment.^a

Predictor name	TM-score				Oligo-GDT-TS
Predictor name	Corr^p	Corr^s	Ranking loss	AUC	Corr^p	Corr^s	Ranking loss	AUC
PSS	0.3947	0.2523	0.1388	0.6384	0.3385	0.2495	0.1582	0.6282^b
AlphaFold plDDT_norm	0.3806	0.2731	0.1334	0.6557	0.3663	0.2557	0.1206	0.6587
DProQA_norm	−0.0507^b	0.0112^b	0.1942^b	0.5689^b	0.0319^b	0.0709^b	0.2225	0.5874
VoroIF-GNN-score_norm	0.0648^b	0.1157^b	0.1929^b	0.5995	0.1143^b	0.1704	0.2066	0.6222
Avg-VoroIF-GNN-res-pCAD_norm	0.0729^b	0.1046^b	0.1669	0.5887^b	0.0744^b	0.1374^b	0.2044	0.6155
VoroMQA-dark global_norm	0.0385^b	0.1443	0.1286	0.6094	−0.0126^b	0.1456	0.1626	0.6220
GCPNet-EMA_norm	0.3597	0.2491	0.1345	0.6431	0.3555	0.2642	0.1691	0.6476
GATE-Ensemble	0.4083	0.2774	0.1327	0.6469	0.3801	0.2989	0.1626	0.6475

Open in a new tab

The evaluation was conducted using both TM-score and Oligo-GDT-TS. The best performance for each metric is shown in bold, and the second best is underlined. Source data for this table (i.e. TM-score and Oligo-GDT-TS of each in-house model) were computed by comparing the predicted decoys from MULTICOM4 to the corresponding experimental structures downloaded from the official CASP16 website: https://predictioncenter.org/casp16/.

The values are statistically significantly worse (P < .05) than the GATE-Ensemble baseline based on the one-sided Wilcoxon signed-rank test.

We also examined whether the observed performance trends hold for the in-house models. On the CASP16 in-house dataset, GATE showed markedly better performance on homo-oligomeric targets than on hetero-oligomeric ones. In terms of TM-score, Pearson’s correlation increased from 0.3405 for hetero-oligomers to 0.4852 for homo-oligomers, while Spearman’s correlation improved from 0.2656 to 0.2907, and AUC from 0.6400 to 0.6548. A similar pattern was observed for Oligo-GDT-TS: Pearson’s correlation improved significantly from 0.2479 (hetero) to 0.5300 (homo), Spearman’s correlation from 0.2213 to 0.3870, and AUC from 0.6304 to 0.6670. This analysis suggests that GATE captures structural signals more effectively in homo-oligomeric complexes with higher symmetry in the in-house prediction setting.

3.5 Limitation

While GATE demonstrates strong performance in our own test and the blind CASP16 experiment, it still has some limitations. First, generating PSSs for a large number of decoys of a large protein complex is a time-intensive process because the complex structure comparison tools used in this study, such as MMalign and USalign, are slow in comparing large multichain protein structures. This computational burden not only affects runtime but also limits the feasibility of deploying GATE as a real-time or high-throughput web server. In contrast to lightweight scoring methods that power interactive platforms [e.g. ModFOLDdock (Edmunds et al. 2023), DeepUMQA-X (Liu et al. 2025a)], GATE is currently restricted to local execution. A potential solution is to use faster approximate methods, such as FoldSeek (Van Kempen et al. 2024), to accelerate structural comparisons.

Second, subgraph sampling introduces some randomness so that running GATE multiple times may get slightly different quality scores for a decoy. However, this effect can be minimized by extensive sampling of subgraphs and averaging the predictions. In our experiment, if 2000 subgraphs are sampled and their predictions are aggregated to produce a final score, the aggregation smooths out variability, ensuring robust and consistent quality scores are produced even for diverse or imbalanced decoy pools. Consequently, the randomness from sampling has a negligible impact on the overall performance of GATE. Moreover, the randomness also has a positive effect because it allows GATE to generate standard deviations for predicted scores.

Third, unlike some competing methods, GATE does not currently provide local interface residue scores, which can be valuable for fine-grained model assessment, interface evaluation, and downstream functional analyses. This limitation stems from GATE’s design, where each decoy model is represented as a single node within the similarity graph, focusing exclusively on global model quality estimation rather than residue-level assessment. To address this, future extensions of GATE may adopt a hierarchical or multistage framework that combines global model ranking with residue-level or interface-level confidence scoring, potentially by incorporating additional structural or sequence-derived features.

4 Conclusion

This study introduces a novel graph-based approach to improve protein complex structure quality assessment by leveraging pairwise similarity graphs and graph transformers to integrate multiple complimentary features and the strengths of multimodel and single-model EMA methods. The approach was rigorously tested on the CASP15 dataset as well as in the blind CASP16 experiment, demonstrating the robust and promising performance across multiple evaluation metrics. Particularly, it maintains the advantage of the multimodel consensus methods over the single-model methods, while overcoming the failure of the central tendency of the consensus approach in some cases. In the future, we plan to speed up the construction of pairwise similarity graphs and explore more advanced graph neural network architectures and larger training datasets to better integrate structural and similarity features to further enhance the performance of the approach.

Supplementary Material

vbaf180_Supplementary_Data

vbaf180_supplementary_data.pdf^{(6.3MB, pdf)}

Contributor Information

Jian Liu, Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, United States.

Pawan Neupane, Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, United States.

Jianlin Cheng, Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, United States.

Author contributions

Jian Liu contributed to writing, software development, visualization, formal analysis, data curation, and validation. Pawan Neupane contributed to formal analysis. Jianlin Cheng was responsible for conceptualization, method design, investigation, funding acquisition, writing, validation, software, formal analysis, project administration, supervision, resources, and visualization.

Supplementary data

Supplementary data are available at Bioinformatics Advances online.

Conflict of interest

None declared.

Funding

This work was supported in part by the National Science Foundation [DBI2308699]; and the National Institutes of Health [R01GM093123].

Data availability

The protein targets and their native structures can be downloaded at https://predictioncenter.org/casp15/ and https://predictioncenter.org/casp16/. The CASP16 EMA results are also available at https://predictioncenter.org/casp16/results.cgi?tr_type=accuracy.

References

Abramson J, Adler J, Dunger J et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature 2024;630:493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alford RF, Leaver-Fay A, Jeliazkov JR et al. The rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput 2017;13:3031–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
Basu S, Wallner B. Dockq: a quality measure for protein-protein docking models. PLoS One 2016;11:e0161879. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao R, Adhikari B, Bhattacharya D et al. Qacon: single model quality assessment using protein structural and contact information with machine learning techniques. Bioinformatics 2017;33:586–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao R, Bhattacharya D, Adhikari B et al. Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics 2015;31:i116–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen C, Chen X, Morehead A et al. 3d-equivariant graph neural networks for protein model quality assessment. Bioinformatics 2023a;39:btad030. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Cheng J. Distema: distance map-based estimation of single protein model accuracy with attentive 2d convolutional neural network. BMC Bioinformatics 2022;23:141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Morehead A, Liu J et al. A gated graph transformer for protein complex structure quality assessment and its performance in casp15. Bioinformatics 2023b;39:i308–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng J, Choe M, Elofsson A et al. Estimation of model accuracy in casp13. Proteins 2019;87:1361–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edmunds NS, Alharbi SM, Genc AG et al. Estimation of model accuracy in casp15 using the m odfolddock server. Proteins 2023;91:1871–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edmunds NS, Genc AG, McGuffin LJ. Benchmarking of alphafold2 accuracy self-estimates as indicators of empirical model quality and ranking: a comparison with independent model quality assessment programmes. Bioinformatics 2024;40:btae491. [DOI] [PMC free article] [PubMed] [Google Scholar]
Evans R, O’Neill M, Pritzel A et al. Protein complex prediction with alphafold-multimer. bioRxiv, 2021. 10.1101/2021.10.04.463034 [DOI] [Google Scholar]
Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu D, Liu J, Wang H et al. Deepumqa-x: comprehensive and insightful estimation of model accuracy for protein single-chain and complex. Nucl Acids Res 2025a;53:W219–W227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu D, Zhang B, Liu J et al. Assessing protein model quality based on deep graph-coupled networks using protein language model. Brief Bioinform 2023a;25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Guo Z, Wu T et al. Improving alphafold2-based protein tertiary structure prediction with multicom in casp15. Commun Chem 2023b;6:188. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Guo Z, Wu T et al. Enhancing alphafold-multimer-based protein complex structure prediction with multicom in casp15. Commun Biol 2023c;6:1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Neupane P, Cheng J. Improvingalphafold2-and alphafold3-based protein complex structureprediction with multicom4 in casp16. Proteins: Structure, Function, and Bioinformatics. 2025b. [DOI] [PMC free article] [PubMed]
McGuffin LJ. The modfold server for the quality assessment of protein structural models. Bioinformatics 2008;24:586–7. [DOI] [PubMed] [Google Scholar]
McGuffin LJ, Alhaddad SN, Behzadi B et al. Prediction and quality assessment of protein quaternary structure models using the multifold2 and modfolddock2 servers. Nucleic Acids Res 2025;53:W472–W477. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mirabello C, Wallner B, Nystedt B et al. Unmasking alphafold to integrate experiments and predictions in multimeric complexes. Nat Commun 2024;15:8724. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morehead A, Liu J, Cheng J. Protein structure accuracy estimation using geometry-complete perceptron networks. Protein Sci 2024;33:e4932. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olechnovič K, Banciul R, Dapkūnas J et al. FTDMP: a framework for protein–protein, protein–DNA, and protein–RNA docking and scoring. Proteins 2025. [DOI] [PubMed]
Olechnovič K, Kulberkytė E, Venclovas Č. Cad-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81:149–62. [DOI] [PubMed] [Google Scholar]
Olechnovič K, Venclovas Č. Voroif-gnn: voronoi tessellation-derived protein–protein interface assessment using a graph neural network. Proteins 2023;91:1879–88. [DOI] [PubMed] [Google Scholar]
Roy RS, Liu J, Giri N et al. Combining pairwise structural similarity and deep learning interface contact prediction to estimate protein complex model accuracy in casp15. Proteins 2023;91:1889–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shuvo MH, Bhattacharya S, Bhattacharya D. Qdeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks. Bioinformatics 2020;36:i285–i291. [DOI] [PMC free article] [PubMed] [Google Scholar]
Studer G, Tauriello G, Schwede T. Assessment of the assessment—all about complexes. Proteins 2023;91:1850–60. [DOI] [PubMed] [Google Scholar]
Van Kempen M, Kim SS, Tumescheit C et al. Fast and accurate protein structure search with foldseek. Nat Biotechnol 2024;42:243–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wallner B, Elofsson A. Can correct protein models be identified? Protein Sci 2003;12:1073–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z, Tegge AN, Cheng J. Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009;75:638–47. [DOI] [PubMed] [Google Scholar]
Wu L-Y, Xia X-Y, Pan X-M A novel score for highly accurate and efficient prediction of native protein structures. bioRxiv 2020. 10.1101/2020.04.23.056945 [DOI] [Google Scholar]
Zhang C, Shine M, Pyle AM et al. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022;19:1109–15. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbaf180_Supplementary_Data

vbaf180_supplementary_data.pdf^{(6.3MB, pdf)}

Data Availability Statement

[vbaf180-B1] Abramson J, Adler J, Dunger J et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature 2024;630:493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B2] Alford RF, Leaver-Fay A, Jeliazkov JR et al. The rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput 2017;13:3031–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B3] Basu S, Wallner B. Dockq: a quality measure for protein-protein docking models. PLoS One 2016;11:e0161879. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B4] Cao R, Adhikari B, Bhattacharya D et al. Qacon: single model quality assessment using protein structural and contact information with machine learning techniques. Bioinformatics 2017;33:586–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B5] Cao R, Bhattacharya D, Adhikari B et al. Large-scale model quality assessment for improving protein tertiary structure prediction. Bioinformatics 2015;31:i116–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B6] Chen C, Chen X, Morehead A et al. 3d-equivariant graph neural networks for protein model quality assessment. Bioinformatics 2023a;39:btad030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B7] Chen X, Cheng J. Distema: distance map-based estimation of single protein model accuracy with attentive 2d convolutional neural network. BMC Bioinformatics 2022;23:141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B8] Chen X, Morehead A, Liu J et al. A gated graph transformer for protein complex structure quality assessment and its performance in casp15. Bioinformatics 2023b;39:i308–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B9] Cheng J, Choe M, Elofsson A et al. Estimation of model accuracy in casp13. Proteins 2019;87:1361–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B10] Edmunds NS, Alharbi SM, Genc AG et al. Estimation of model accuracy in casp15 using the m odfolddock server. Proteins 2023;91:1871–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B11] Edmunds NS, Genc AG, McGuffin LJ. Benchmarking of alphafold2 accuracy self-estimates as indicators of empirical model quality and ranking: a comparison with independent model quality assessment programmes. Bioinformatics 2024;40:btae491. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B12] Evans R, O’Neill M, Pritzel A et al. Protein complex prediction with alphafold-multimer. bioRxiv, 2021. 10.1101/2021.10.04.463034 [DOI] [Google Scholar]

[vbaf180-B13] Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B14] Liu D, Liu J, Wang H et al. Deepumqa-x: comprehensive and insightful estimation of model accuracy for protein single-chain and complex. Nucl Acids Res 2025a;53:W219–W227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B15] Liu D, Zhang B, Liu J et al. Assessing protein model quality based on deep graph-coupled networks using protein language model. Brief Bioinform 2023a;25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B16] Liu J, Guo Z, Wu T et al. Improving alphafold2-based protein tertiary structure prediction with multicom in casp15. Commun Chem 2023b;6:188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B17] Liu J, Guo Z, Wu T et al. Enhancing alphafold-multimer-based protein complex structure prediction with multicom in casp15. Commun Biol 2023c;6:1140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B18] Liu J, Neupane P, Cheng J. Improvingalphafold2-and alphafold3-based protein complex structureprediction with multicom4 in casp16. Proteins: Structure, Function, and Bioinformatics. 2025b. [DOI] [PMC free article] [PubMed]

[vbaf180-B19] McGuffin LJ. The modfold server for the quality assessment of protein structural models. Bioinformatics 2008;24:586–7. [DOI] [PubMed] [Google Scholar]

[vbaf180-B20] McGuffin LJ, Alhaddad SN, Behzadi B et al. Prediction and quality assessment of protein quaternary structure models using the multifold2 and modfolddock2 servers. Nucleic Acids Res 2025;53:W472–W477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B21] Mirabello C, Wallner B, Nystedt B et al. Unmasking alphafold to integrate experiments and predictions in multimeric complexes. Nat Commun 2024;15:8724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B22] Morehead A, Liu J, Cheng J. Protein structure accuracy estimation using geometry-complete perceptron networks. Protein Sci 2024;33:e4932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B23] Olechnovič K, Banciul R, Dapkūnas J et al. FTDMP: a framework for protein–protein, protein–DNA, and protein–RNA docking and scoring. Proteins 2025. [DOI] [PubMed]

[vbaf180-B24] Olechnovič K, Kulberkytė E, Venclovas Č. Cad-score: a new contact area difference-based function for evaluation of protein structural models. Proteins 2013;81:149–62. [DOI] [PubMed] [Google Scholar]

[vbaf180-B25] Olechnovič K, Venclovas Č. Voroif-gnn: voronoi tessellation-derived protein–protein interface assessment using a graph neural network. Proteins 2023;91:1879–88. [DOI] [PubMed] [Google Scholar]

[vbaf180-B26] Roy RS, Liu J, Giri N et al. Combining pairwise structural similarity and deep learning interface contact prediction to estimate protein complex model accuracy in casp15. Proteins 2023;91:1889–902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B27] Shuvo MH, Bhattacharya S, Bhattacharya D. Qdeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks. Bioinformatics 2020;36:i285–i291. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B28] Studer G, Tauriello G, Schwede T. Assessment of the assessment—all about complexes. Proteins 2023;91:1850–60. [DOI] [PubMed] [Google Scholar]

[vbaf180-B29] Van Kempen M, Kim SS, Tumescheit C et al. Fast and accurate protein structure search with foldseek. Nat Biotechnol 2024;42:243–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B30] Wallner B, Elofsson A. Can correct protein models be identified? Protein Sci 2003;12:1073–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf180-B31] Wang Z, Tegge AN, Cheng J. Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009;75:638–47. [DOI] [PubMed] [Google Scholar]

[vbaf180-B32] Wu L-Y, Xia X-Y, Pan X-M A novel score for highly accurate and efficient prediction of native protein structures. bioRxiv 2020. 10.1101/2020.04.23.056945 [DOI] [Google Scholar]

[vbaf180-B33] Zhang C, Shine M, Pyle AM et al. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022;19:1109–15. [DOI] [PubMed] [Google Scholar]

PERMALINK

Estimating protein complex model accuracy using graph transformers and pairwise similarity graphs

Jian Liu

Pawan Neupane

Jianlin Cheng

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

Figure 1.

2.1 Pairwise similarity graph construction

2.2 Structural similarity-based subgraph sampling

2.3 Graph transformer

2.3.1 Node feature embedding

2.3.2 Edge feature embedding

2.3.3 Graph transformer layer

2.3.4 Node-level prediction

2.3.5 Loss function

2.4 Aggregation of predicted quality scores across subgraphs

2.5 Training, validation, and test procedure

3 Results

3.1 Evaluation metrics

3.2 Performance of GATE models on the CASP15 dataset

Table 1.

3.3 Performance of GATE in the blind CASP16 experiment

Table 2.

3.4 Performance of GATE on in-house models in the CASP16 blind experiment

Table 3.

3.5 Limitation

4 Conclusion

Supplementary Material

Contributor Information

Author contributions

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases