Abstract
RNA performs a variety of functions within cells and is implicated in various human diseases. Because druggable proteins occupy a small portion of the genome, considerable interest has been increasing in developing drugs targeting RNAs. Thus, precise prediction of small-molecule binding sites across different classes of RNAs is important. In this study, a lightweight deep learning program for predicting RNA-drug binding sites, called compound binding site prediction for RNA (CoBRA), is introduced. Our approach utilizes residue-level embeddings derived from a pre-trained RNA language model, without relying on any structural information. These embeddings encapsulate the contextual and statistical properties of each nucleotide and are used as input for a multi-layer perceptron classifier that performs binary classification of binding nucleotides. The model was trained using the TR60 and HARIBOSS datasets and tested on four independent benchmark sets. The performance of CoBRA demonstrates a relative improvement of 22.1% in the Matthew correlation coefficient and a 45.6% increase in sensitivity compared to existing state-of-the-art RNA–ligand binding site prediction methods that utilize structural information. These results demonstrate that sequence-based language model embeddings, which do not require explicit coordinate or distance information, can match or outperform structure-based methods. This makes it a flexible tool for predicting binding sites across diverse RNA targets.
Keywords: RNA–small molecule binding site prediction, RNA language model, pre-trained embedding, deep learning, convolutional neural network
Graphical Abstract
Graphical Abstract.
Introduction
RNA plays diverse roles in the cell’s gene regulation, translation, and structural organization. They are also involved in various human diseases, including cancer, neurological disorders, cardiovascular dysfunction, and developmental abnormalities [1–3]. Given that only 1%–3% of the human genome encodes proteins [4] and that ~10%–14% of these proteins are considered druggable, leaving the majority (80%–90%) undruggable [5], targeting mRNAs has recently gained attention as a potential therapeutic strategy. Additionally, there has been an escalating focus on their potential as small-molecule drug targets, owing to their capacity to generate structured regions for binding with high specificity [6, 7]. These RNA-ligand interactions present novel avenues for therapeutic intervention, and thus, a precise computational prediction of ligand-binding sites across diverse RNA categories is needed.
Various RNA-ligand binding site prediction methods relying on structural information, either directly or indirectly, have been developed. Rsite [8, 9] calculates Euclidean distances between nucleotides based on RNA tertiary structures and Hamming distances between the secondary structures to predict binding sites. Rbind [10] represents RNA tertiary structures as a network, where nucleotides are treated as nodes and non-covalent spatial interactions form edges. Binding site prediction is then performed based on network centrality metrics such as degree and closeness. RNAsite [11] is a machine learning-based method that extracts sequence and structure-derived features of nucleotides with a sliding window strategy. A random forest classifier is employed to predict binding status based on these features. ZHmolReSTasite (ZeSTa) [12] utilizes point clouds of solvent-excluded surfaces generated from RNA tertiary structures. These are then converted into normalized topographic images corresponding to individual nucleotides. The resulting representations are used as input features to a deep learning model that learns surface-based geometric features. The structure-based approaches have shown solid performance in RNA–ligand binding site prediction by incorporating explicit three-dimensional or secondary structural information into geometric or graph-based representations. However, their reliance on experimentally determined RNA structures restricts their applicability to RNAs without reliable structural data and limits large-scale use.
On the other side, a couple of prediction methods utilize the RNA language models (LMs). RNABind [13] employs E(3)-equivariant graph neural networks to encode both sequence and structure information. Node features are enriched with embeddings from RNA language models, enabling the model to learn both geometric and contextual relationships for identifying ligand-binding nucleotides. It evaluated eight LMs, including ERNIE-RNA and RiNALMo, which achieved superior performance. RLSite [14] and GATRsite [15] combine RNA LM with graph attention networks. The three-dimensional structure is represented as graphs where nucleotides are nodes, incorporating both sequential features from language models and structural features from three-dimensional conformations. MVRBind [16] employs a multi-view feature extraction module, combining sequence information from one-hot, MSA, and LM, secondary structure, and tertiary structure. The gathered information is integrated by using a graph convolutional network. By performing multi-view graph message passing and feature fusion across these representations, the model captures hierarchical geometric dependencies to predict RNA–ligand binding sites. By combining molecular graphs and LMs, these methods achieved high performance.
In this study, we propose a lightweight deep learning-based model, called CoBRA (Compound Binding Site Prediction for RNA), that predicts RNA–ligand binding sites using an RNA LM, not relying on the RNA structural information. The model utilizes residue-level embeddings derived from pre-trained RNA language models, which implicitly encode the contextual and statistical properties of each nucleotide. These embeddings are subsequently entered into a multi-layer perceptron (MLP) classifier for residue-level binary classification. The proposed framework is designed to operate without explicit structural features such as three-dimensional coordinates or distance metrics, and its modularity allows for flexible experimentation. A combination of ten RNA LMs and six loss functions was systematically trained, combining TR60 [11] and HARIBOSS set [17], and then evaluated to identify optimal configurations for the RNA-ligand binding site prediction. The final model was evaluated in four benchmark sets: TE18, RB9, JL10, and TL12 [11, 12]. CoBRA achieved a relative improvement compared to the state-of-the-art RNA-ligand binding site prediction programs.
Material and methods
Dataset preparation
The prediction of RNA–ligand binding sites is a binary classification task at the residue level. The model predicts the binding of each nucleotide (residue) in each RNA sequence to an organic small molecule or metal ion. Each residue is designated as binding or non-binding. A nucleotide is labelled as a binding site if the distance between any of its heavy atoms and any ligand heavy atoms is less than or equal to 4 Å, following the definition from previous research [12, 13, 16].
A total of six publicly available RNA–ligand complex structure datasets are utilized in this study: HARIBOSS, TR60, RB9, TL12, JL10, and TE18. Panei et al. [17] extracted RNA-small molecule complexes from the PDB to construct the HARIBOSS dataset (Accessed February 2025), composed of 862 structures. Subsequently, these complexes were clustered based on their sequence and structural similarity with RNA. Su et al. [11] also collected 712 RNA-small molecule crystal structures and used TM-scoreRNA to calculate their structural similarity, resulting in 78 representatives. The dataset is further segmented into two distinct sets: the TR60 set and the TE18 set. These sets are utilized for the training and testing of RNAsite. Gao et al. [12] generated RB9, TL12, and JL10 datasets to assess their binding site prediction program, ZeSTa. The RB9 originated from RB19 by removing ten structures from the dataset that overlapped with TR60. From the RNA-small molecule structures deposited after January 2021, JL10 is characterized by junction loop structures, which exhibit high structural complexity. In contrast, TL12 structures are devoid of the loop, resulting in low structural complexity. Among the six datasets, HARIBOSS and TR60 were merged to generate training, test, and validation sets. Non-biological small molecules commonly used as crystallization additives or cryoprotectants, including water, sulfate, phosphate, glycerol, ethylene glycol, and polyethylene glycol (HOH, SO₄, PO₄, GOL, EDO, PEG), were excluded from the ligand set to prevent artifactual contacts. RNA chains longer than 161 nucleotides were subsequently removed, resulting in a final dataset of 432 unique RNA chains. The dataset was segmented using a pre-determined random seed to create three subsets with a ratio of 8:1:1 for training, validation, and internal testing, respectively.
The HARIBOSS set was utilized to further evaluate the generalizability of the proposed model. Zhu et al. [13] provide the four structure split sets based on structural similarity with a TM-Score threshold of 0.5. Sixty models combining RNA LMs and loss functions were trained and tested using the same protocol as the combined sets of HARIBOSS and TR60.
The remaining four, RB9, JL10, TL12, and TE18, were reserved as test sets. To ensure independent evaluation, any overlapping sequences between the training and test sets were removed. The problem formulation and dataset design enable quantitative assessment of both prediction accuracy and generalizability under practically meaningful scenarios.
Model architecture
To predict whether each nucleotide in an RNA sequence is a binding site or not, a residue-level binary classification model was designed. Fig. 1 illustrates the model architecture of CoBRA. The program’s input is a query RNA sequence, which is embedded using pre-trained RNA LMs. During the training process, these embeddings are maintained as a frozen state and do not undergo updates. The prediction model employs an MLP architecture comprising five fully connected layers. Each layer is followed sequentially by layer normalization, a ReLU activation function, and dropout with a probability of .1, promoting stable training and improved generalization. The final output layer produces a two-dimensional logit for each residue, corresponding to the binding and non-binding classes. These are converted into probabilities via a SoftMax function. All inputs are standardized to a maximum sequence length of 161 residues. Sequences of shorter length are padded with zeros in the input matrix X and with −1 in the target vector y. A masking strategy is implemented to ensure that the padding regions do not contribute to the loss computation or affect model training.
Figure 1.
An overview of the model architecture. The model takes a sequence of 161 input embeddings, each zero-padded to a fixed embedding dimension. The sequence is processed by an MLP with hidden sizes of 1024, 256, 128, and 64. Each layer includes layer normalization, ReLU activation, and dropout (P = .1). The final layer outputs 2D class logits per segment, followed by a softmax for binary classification.
The model training was executed using the PyTorch [18] framework. The optimization process was executed employing the AdamW [19] optimizer, with an initial learning rate set to 5 × 10−4. The cosine annealing method was employed for the purpose of learning rate scheduling. All experiments were run with a batch size of 4, a dropout probability of .1, for 100 epochs, and with the random seed fixed to 42 to ensure reproducibility.
RNA language models
Ten pre-trained RNA LMs were employed to pick the best model for residue-level binding site prediction. The models were pre-trained on diverse RNA types for various learning objectives, resulting in differences in embedding dimensionality and representational properties. Table 1 lists the RNA LMs, their pre-training targets, and the embedding size.
Table 1.
Description of RNA language models employed for CoBRA.
| Model name | Pre-training target | Embedding size |
|---|---|---|
| ERNIE-RNA | Non-coding RNA | 768 |
| RiNALMo | Non-coding RNA | 1280 |
| RNABERT | Non-coding RNA | 120 |
| RNA-FM | Non-coding RNA | 640 |
| RNA-MSM | Non-coding RNA | 768 |
| SpliceBERT | Precursor mRNA | 512 |
| SpliceBERT-510 | Precursor mRNA | 512 |
| SpliceBERT-H510 | Precursor mRNA | 512 |
| UTRLM-MRL | mRNA 5′UTR | 128 |
| UTRLM-TE_EL | mRNA 5′UTR | 128 |
The models can be categorized into two groups based on pre-training sets. The first category is the models trained on non-coding RNAs (ncRNAs), composed of ERNIE-RNA [20], RiNALMo [21], RNABERT [22], RNA-FM [23], and RNA-MSM [24]. This group captures generalizable structural patterns across various ncRNA families. ERNIE-RNA [20], comprised of 12 Transformer blocks, is pre-trained via masked language modeling (MLM) on 20 million ncRNAs from RNAcentral with structural information. RiNALMo [21] is also pre-trained on 36 million ncRNAs, employing MLM with 33 Transformer blocks. RNABERT [22] adopts the pre-training BERT algorithm to 762 K ncRNAs. It also encodes the characteristics of the RNA family and structure. RNA-FM [23] is built upon 12 bidirectional Transformer encoder blocks that produce an L × 640 embedding matrix for input length L, pre-trained on 23.7 million ncRNAs. RNA-MSM [24] follows the MSA Transformer architecture with ten blocks to learn two-dimensional co-evolutionary signals from homologous multiple sequence alignments of 3932 Rfam families.
The members of the second class are pre-trained on pre-mRNA or mRNA untranslated (UTR) regions. SpliceBERT, SpliceBERT-510, SpliceBERT-H510 [25], UTRLM-MRL, and UTRLM-TE_EL [26] are members of the class. The category specializes in modeling sequence features in post-transcriptional regulatory regions. SpliceBERT [25] is a BERT-based model with six Transformer encoder layers pre-trained on 2 million RNA sequences from 72 vertebrate species. Additionally, two variants of the model were also employed: SpliceBERT-510, an intermediate checkpoint, and SpliceBERT-H510, pre-trained exclusively on human data. UTR-LM [26] is a six-layer Transformer encoder pre-trained via semi-supervised masked nucleotide reconstruction, 5′ UTR secondary-structure prediction, and minimum-free-energy regression on 5′ UTRs from multiple species. Two task-specific variants, UTR-LM–TE_EL (translation efficiency and mRNA expression-level prediction) and UTR-LM–MRL (mean ribosome loading prediction) were also used to generate embeddings.
All language models were used in their pre-trained form with frozen parameters; no fine-tuning or weight updates were performed during training. Each RNA sequence was transformed into a residue-level embedding sequence using the corresponding model, then padded to a fixed length of 161 residues before being fed into the MLP classifier. Embedding dimensionality varied across models, ranging from 120 to 1280. The predictive performance of different embedding types was compared to assess their impact on RNA–ligand binding site prediction. Additionally, a series of models was constructed and tested to connect the two LMs by concatenating their embeddings. Furthermore, one-hot encoding of bases was employed in order to provide a baseline.
Loss functions
Compared to the RNA sequence length, the proportion of binding nucleotides (positive class) is low, leading to a severe class imbalance problem. To resolve the issue and quantitatively evaluate how different training objectives affect performance, we employed six loss functions.
Each loss function represents a different optimization strategy. Binary cross-entropy (BCE) serves as the conventional baseline for binary classification, encouraging predicted probabilities to converge to the ground truth. Class-balanced focal loss [27] and Tversky loss [28] are designed to address the class imbalance. The class-balanced focal loss emphasizes hard-to-classify samples by assigning them higher weights. In contrast, the Tversky loss applies asymmetric weighting to false positives and false negatives. By giving different penalties, the loss function tries to improve model sensitivity for an imbalanced dataset. Dice loss [29] and Lovász hinge loss [30] focus on residue-level spatial alignment and structural consistency. Dice loss maximizes the overlap between predicted and true positive regions, whereas Lovász hinge loss optimizes the Intersection over Union directly.
Lastly, we introduce a composite loss that combines triplet center loss (TCL) [31] and class-balanced focal loss to simultaneously improve class separation in the embedding space and address the class imbalance. This design is inspired by Wang et al. [32], demonstrated strong performance in protein-small molecule binding site prediction using a combination of TCL and class-balanced focal loss to enhance both feature discrimination and class imbalance handling. The total loss is defined as shown in Equation 1.
![]() |
(1) |
L focal and Ltcl are class-balanced focal loss and TCL loss, respectively. λ is used as a weight to balance between the losses. Lfocal, a class-balanced focal loss, assigns higher weights to address class imbalance, as shown in Equation (2).
![]() |
(2) |
pit denotes the predicted probability of the true class t, and (1-pit)g serves as a modulating factor that down-weights well-classified samples and focuses the training on hard examples. The focusing parameter g controls the relative importance of easy and hard samples. The term En, represents the effective number of samples for each class and balances the impact of class frequency.
TCL learns a center vector for each class and encourages embeddings of the same class to cluster together while enforcing a margin-based separation between different classes (Equation 3).
![]() |
(3) |
Here, f is the embedding vector of the input, cy and c1 − y denote the center vectors of the true and opposite classes, respectively, and m is the margin. The hyperparameters were set as g = 3, b = 0.999, l = 0.2, and m = 4, following the optimal configuration reported by Wang et al. [32].
Evaluation metrics
All metrics are calculated based on the values of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) in the confusion matrix. Precision is the ratio of TPs to all predicted positives (TP + FP), while recall is the percentage of TPs to all actual positives (TP + FN). F1-score is a harmonized mean of precision and recall that evaluates the balance between the two metrics. The Matthews correlation coefficient (MCC), a class imbalance robust metric, includes all elements of the confusion matrix, calculated as follows:
![]() |
(4) |
The value of MCC can range from +1 (perfect prediction) to −1 (perfect mismatch).
The area under the receiver operating characteristic curve (AUROC) quantifies the model’s discrimination ability by measuring the area under the ROC curve (TP rate versus FP rate), and area under the precision–recall curve (AUPRC) measures the area under the precision–recall curve, summarizing the trade-off between precision and recall across all thresholds.
Laplacian-based curvature for RNA 3D structure analysis
To analyze and characterize the three-dimensional RNA structures quantitatively, the Laplacian-based curvature descriptor (LN) [12] is employed. This metric quantifies the curvature at each nucleotide by measuring the deviation of its C3′ atom coordinates from the relative positions of surrounding residues. It thereby captures local structural distortion and reflects how naturally a nucleotide is embedded within the overall structure.
For each nucleotide, a coordinate vector was defined by the C3′ atom position. Then a Gaussian kernel-based weighting function was constructed from the all residue pairwise Euclidean distance matrix following Equation 5.
![]() |
(5) |
Here d is a single-scale parameter, set as the median of all pairwise distances between nucleotides in the structure. pi and pj denote the Euclidean coordinates of the C3′ atoms of nucleotides 𝑖 and 𝑗. This choice of scale balances local and global shape sensitivity, avoiding excessive localization or over-smoothing.
Using the weight function, the Laplacian norm value LNi for nucleotide i is defined as Equation 6.
![]() |
(6) |
This value reflects how far nucleotide i lies from its Gaussian-weighted center, capturing the relative positional characteristics within the RNA structure. Higher LN values indicate convex or protruding regions, while lower values suggest concave or densely packed areas. The computed LN values were utilized for analysis and structural visualization.
Evaluating computational cost with comparison to other machine learning
To evaluate the computational cost, all sequences were embedded using the frozen RiNALMo model on an NVIDIA RTX A6000 GPU. The resulting 1280-dimensional embeddings were used to compare the cost of logistic regression (LR) and random forest (RF). The hyperparameter settings for LR and RF are provided in Supplementary Table S1.
Results
Overall performance comparison and model selection
A total of 60 models were evaluated and compared, incorporating 10 RNA LM embeddings and 6 loss functions. Table 2 lists the top 10 MCC models and the best one-hot encoding model, benchmarked on the test set. Supplementary Table S2 showed the performance of individual models.
Table 2.
Performance of the top 10 MCC models and a baseline using one-hot encoding on the test set.a
| Language model | Loss function | MCC | AUROC | AUPRC | Precision | Recall |
|---|---|---|---|---|---|---|
| ERNIE-RNA | TCL focal | 0.657 | 0.868 | 0.817 | 0.787 | 0.718 |
| RNA-FM | TCL focal | 0.641 | 0.876 | 0.806 | 0.755 | 0.732 |
| RiNALMo | BCE | 0.629 | 0.881 | 0.816 | 0.737 | 0.738 |
| ERNIE-RNA | Focal | 0.626 | 0.874 | 0.808 | 0.741 | 0.726 |
| ERNIE-RNA | BCE | 0.625 | 0.863 | 0.798 | 0.750 | 0.714 |
| RiNALMo | Lovasz hinge | 0.618 | 0.874 | 0.779 | 0.731 | 0.727 |
| RNA-FM | Focal | 0.615 | 0.887 | 0.816 | 0.737 | 0.714 |
| RNA-FM | BCE | 0.613 | 0.881 | 0.808 | 0.706 | 0.753 |
| SpliceBERT-510 | BCE | 0.613 | 0.853 | 0.785 | 0.740 | 0.705 |
| RiNALMo | TCL focal | 0.609 | 0.876 | 0.802 | 0.713 | 0.736 |
| One-hot | Tversky | 0.099 | 0.568 | 0.329 | 0.336 | 0.573 |
aThe highest value for each metric among the ten models is highlighted in bold.
All the top 10 models demonstrated superior performance to the baseline model (MCC: 0.099, AUROC: 0.568), confirming that the performance improvements of CoBRA primarily originate from the contextual representations learned by RNA language models. The highest MCC is obtained by combining ERNIE-RNA with TCL focal as the language model and loss function, respectively. The model has also been determined to be second in terms of AUPRC and precision among the 60 models considered. It was observed that ERNIE-RNA, RNA-FM, and RiNALMo emerged as top contenders, appearing in the top ten MCC models on three separate occasions. The three LMs were pre-trained from ncRNAs. Conversely, the LMs pre-trained on mRNA-UTR regions are not included among the top 10, except for SpliceBERT-510. The comparatively diminished performance of mRNA-pretrained models might be due to their constrained exposure to a variety of RNA structures, which could result in diminished generalizability in binding site prediction when compared to ncRNA-pretrained counterparts. A survey of the top 10 MCC models reveals that 9 of them utilize entropy-like loss functions, including focal, TCL focal, and BCE. Except for the three aforementioned functions, the Lovasz hinge with RiNALMo is ranked sixth in terms of MCC.
The distribution of MCC, AUPRC, and AUROC for all 60 models is illustrated in Fig. 2 and Supplementary Table S3 across the LMs and loss functions. As observed by the top 10 MCC models, RiNALMo, ERNIE-RNA, and RNA-FM, which were pre-trained on ncRNAs, exhibited the highest average MCC, AUROC, AUPRC, and precision values (Fig. 2A). The five LMs, trained on mRNA-UTR regions, demonstrated consistent performance, with MCC ranging from 0.292 to 0.346. Conversely, the performances of LMs trained on ncRNAs exhibit significant variability. As observed in the top 10 models, ERNIE-RNA, RiNALMo, and RNA-FM, which were trained on more than 20 million sequences, exhibited superior performance to mRNA-UTR models, while RNABERT and RNA-MSM showed lower MCC values than UTR models. In terms of MCC, the best model using RNABERT (0.331) performed worse than the worst model using RiNALMo (0.372).
Figure 2.
Distribution of MCC (left) AUPRC (middle), and AUROC (right) across all RNA LMs (A) and loss functions (B).
With regard to loss functions, models that have been trained with BCE and focal loss have demonstrated a consistent and superior performance across key evaluation metrics in comparison to other loss functions. Conversely, models employing dice loss and Lovasz hinge loss exhibited unstable convergence and suboptimal performance in most cases (Fig. 2B).
The LMs with the top three MCC values, ERNIE-RNA, RNA-FM, and RiNALMo, were subsequently selected for the concatenation experiment (Supplementary Table S4). The combination of ERNIE-RNA and RNA-FM yielded the highest MCC among the concatenated models. However, none of the models performed MCC higher than the best model of single LM embedding. This implies that the embeddings might contain analogous information, rendering their combination ineffective.
Based on the single-embedding models from the main experiments, a selection of the top five MCC configurations was made, and subsequent benchmarking was conducted on the external validation sets.
Structure-based split dataset evaluation
To assess the generalizability of the models, the sixty models were retrained and evaluated using the RNABind structure-based split dataset. This ensures strict non-redundancy between the training and the test sets at both levels. The results of the top five models are given in Supplementary Table S5. Compared to the methods that combine sequence information as LMs and structural information, CoBRA demonstrated lower AUROC values ranging from 0.605 to 0.657, while RNABind, RLBind, Ret, and MVRBind reported AUROC values ranging from 0.671 to 0.776. On the other hand, CoBRA and RNABind demonstrated similar performance in terms of AUPRC.
Comparison with other existing methods on benchmark sets
We conducted a benchmarking study of the top five MCC models on various test sets composed of RNA-compound crystal structures. RB9, JL10, TL12, and TE18. Table 3 presents the performance of the models and other RNA-compound binding site prediction methods. The CoBRA models are designated CoBRA-M1, M2, and so forth, following their MCC ranks.
Table 3.
Comparison of CoBRA model performances on various test sets.a
| M1 | M2 | M3 | M4 | M5 | |
|---|---|---|---|---|---|
| Language model | ERNIE-RNA | RNA-FM | RiNALMo | ERNIE-RNA | ERNIE-RNA |
| Loss function | TCL focal | TCL focal | BCE | Focal | BCE |
| RB9 | |||||
| MCC | 0.585 | 0.593 | 0.557 | 0.538 | 0.559 |
| AUROC | 0.829 | 0.860 | 0.844 | 0.845 | 0.855 |
| Precision | 0.789 | 0.770 | 0.750 | 0.714 | 0.743 |
| Recall | 0.643 | 0.682 | 0.650 | 0.669 | 0.662 |
| JL10 | |||||
| MCC | 0.272 | 0.331 | 0.293 | 0.265 | 0.292 |
| AUROC | 0.721 | 0.747 | 0.739 | 0.713 | 0.720 |
| Precision | 0.590 | 0.656 | 0.582 | 0.563 | 0.656 |
| Recall | 0.463 | 0.453 | 0.531 | 0.516 | 0.375 |
| TL12 | |||||
| MCC | 0.575 | 0.521 | 0.546 | 0.520 | 0.499 |
| AUROC | 0.821 | 0.800 | 0.816 | 0.819 | 0.824 |
| Precision | 0.859 | 0.808 | 0.816 | 0.792 | 0.785 |
| Recall | 0.589 | 0.570 | 0.599 | 0.589 | 0.565 |
| TE18 | |||||
| MCC | 0.103 | 0.114 | 0.190 | 0.162 | 0.176 |
| AUROC | 0.580 | 0.594 | 0.647 | 0.634 | 0.626 |
| Precision | 0.504 | 0.473 | 0.562 | 0.525 | 0.537 |
| Recall | 0.253 | 0.465 | 0.357 | 0.390 | 0.394 |
| Average | |||||
| MCC | 0.384 | 0.390 | 0.397 | 0.371 | 0.382 |
| AUROC | 0.738 | 0.750 | 0.762 | 0.753 | 0.756 |
| Precision | 0.686 | 0.677 | 0.678 | 0.649 | 0.680 |
| Recall | 0.487 | 0.543 | 0.534 | 0.541 | 0.499 |
aThe top five MCC models were selected for validation. The highest value for each metric is highlighted in bold.
While the CoBRA-M1, ERNIE-RNA combined with TCL focal model demonstrated the highest MCC, AUPRC, and precision among the five models evaluated on the test set, its performance was not optimal on the four benchmark sets. Conversely, CoBRA-M3, utilizing RiNALMo as the RNA LM and BCE as the loss function, exhibited the highest performance in MCC and AUROC on average of the four test sets. For precision and recall, M1 and M2 performed best, respectively.
With regard to the MCC from individual benchmark sets, M2 has the highest value for JL10 and RB9, whereas M1 shows the best MCC for TL12. While M3 is only ranked top for TE18 among the five CoBRA models, it is ranked second for JL10, a set with a high structural complexity, and TL12, a set with a low structural complexity. This suggests that M3 has greater generalizability for RNA-compound binding site prediction problems than other models. Consequently, M3 is selected as a representative model for CoBRA, and the results are subjected to further analysis.
A comparative analysis of RNA-compound binding site prediction methods revealed that CoBRA exhibited superior performance, with the exception of TE18 (Table 4). On average, CoBRA exhibited a 15.4%, 9.2%, and 33.8% improvement in MCC, AUROC, and recall, respectively, when compared to ZeSTa, a state-of-the-art program with structural information. In particular, on datasets with relatively simpler structures, such as RB9 and TL12, CoBRA also outperformed other methods, with the exception of RB9 precision and AUROC. In a similar vein, the CoBRA algorithm demonstrated the most optimal performance in all metrics for the highly complex structure set, JL10. Conversely, the TE18 dataset exhibited a decline in performance, with an MCC reduction of 41.90% in comparison to ZeSTa.
Table 4.
Comparison of performances with other RNA-ligand binding site prediction programs.a
| Rsite | Rsite2 | RNAsite | RBind | RLBind | ZeSTa | RNABind | MVRBind | CoBRA | |
|---|---|---|---|---|---|---|---|---|---|
| RB9 | |||||||||
| MCC | 0.159 | 0.072 | 0.426 | 0.278 | – | 0.398 | – | – | 0.557 |
| AUROC | 0.575 | 0.528 | 0.790 | 0.592 | – | 0.786 | 0.865 | – | 0.844 |
| Precision | 0.430 | 0.382 | 0.750 | 0.681 | – | 0.767 | – | – | 0.750 |
| Recall | 0.353 | 0.187 | 0.378 | 0.230 | – | 0.408 | – | – | 0.650 |
| JL10 | |||||||||
| MCC | 0.046 | 0.007 | – | 0.083 | – | 0.211 | – | – | 0.293 |
| AUROC | 0.477 | 0.504 | – | 0.532 | – | 0.592 | 0.592 | – | 0.739 |
| Precision | 0.295 | 0.338 | – | 0.433 | – | 0.549 | – | – | 0.582 |
| Recall | 0.194 | 0.131 | – | 0.142 | – | 0.296 | – | – | 0.531 |
| TL12 | |||||||||
| MCC | – | – | – | – | – | 0.440 | – | – | 0.546 |
| AUROC | – | – | – | – | – | 0.704 | 0.754 | – | 0.816 |
| Precision | – | – | – | – | – | 0.740 | – | – | 0.816 |
| Recall | – | – | – | – | – | 0.514 | – | – | 0.599 |
| TE18 | |||||||||
| MCC | 0.071 | 0.010 | 0.253 | 0.187 | 0.324 | 0.327 | – | 0.351 | 0.190 |
| AUROC | 0.590 | 0.474 | 0.776 | 0.559 | 0.720 | 0.709 | 0.737 | 0.745 | 0.647 |
| Precision | 0.449 | 0.370 | 0.675 | 0.655 | 0.681 | 0.729 | – | 0.645 | 0.562 |
| Recall | 0.288 | 0.214 | 0.263 | 0.173 | 0.345 | 0.379 | – | 0.342 | 0.357 |
One of the successful cases of CoBRA is illustrated in Fig. 3A. A structure of pir-miRNA-300 apical loop fused to ydaO riboswitch scaffold, complexed with c-di-AMP (PDB ID: 6WTR) from the JL10 dataset. The switch regulates the gene expression in bacteria by sensing concentrations of ATP and c-di-AMP. The RNA structure is comprised of two three-way junctions connected by two helices and a large conserved interior loop. The compound works as an off switch when binding to the RNA. The binding site of the molecule is located between the two pseudoknots, stabilizing the RNA structure through intermolecular stacking [33, 34]. A comparison of the two programs revealed that CoBRA exhibited superior performance in comparison with ZeSTa. The MCC, precision, and recall of CoBRA are 0.923, 0.946, and 0.946, respectively, while those of ZeSTa are 0.121, 0.455, and 0.142, respectively. The uniqueness of the binding site characteristics makes the structure-based binding site prediction methods hard to predict, while CoBRA, a sequence-based one, demonstrated superior performance.
Figure 3.

Case studies of CoBRA. (A) A successfully predicted case of CoBRA. A pir-miRNA-300 complexed with bis-(3′,5′)-cyclic-dimeric-adenosine-monophosphate (PDB ID: 6WTR). (B)RNA Aptamer complexed with flavin mononucleotide (PDB ID: 1FMN). The right panel displays the distribution of Laplacian norm values, where blue indicates lower curvature (concave) and red indicates higher curvature (convex). (C) Wrongly predicted case of CoBRA: sisomicin bound to bacterial ribosomal decoding site (PDB ID: 4F8U). The right panel shows an RNA homodimer, highlighting the inter-chain binding interface.
The computational cost of CoBRA was then compared with that of the LR and RF models. The generation of the embedding using RiNALMo required a time period of 111 ms per sequence. Utilizing fixed embeddings, the MLP required 109.97 s for training and 0.222 s for testing, while the LR necessitated 44.34 s and 0.008 s, and the RF employed 74.79 s and 0.101 s.
Structural analysis of TE18 dataset
An investigation was conducted to examine the outcomes of TE18, wherein CoBRA exhibited the least optimal performance among the test sets. Dissecting the spatial structure of RNA (DSSR) [35] was employed to analyze the secondary structures of RNA crystal structures. The RNA secondary structures were classified into eight categories: stem, canonical, isolated canonical, internal loop, hairpin loop, bulge, junction, and helix. A subsequent analysis of the prediction performance of CoBRA was conducted, with the structural type serving as the primary variable.
CoBRA demonstrated a consistent level of accuracy, with a minimum of 51% for identifying binding sites across structural types. The highest accuracy was observed in junction sites, with an accuracy of 92%. On the other hand, the performance for identifying non-binding nucleotides exhibited variability according to structure, with higher accuracy in the stem (68%) and junction (67%), but lower accuracy in internal loops (47%) (Supplementary Fig. S1).
We also conducted a residue-level curvature analysis of the TE18 dataset by calculating the LN. As the LN values increase, the geometry of the residue evolves from concave to convex. The LN values of the binding and non-binding sites show distinct distribution (Supplementary Fig. S2), suggesting structural differences between the two classes. The binding site nucleotides have LN values ranging from 1.5 to 20.7, with an average of 10.9. Conversely, the non-binding ones exhibit larger values, ranging from 2.3 to 23.0, with an average of 12.7. It can be inferred that the geometry of RNA compound binding sites is relatively concave, a finding that is also reported by Su et al. [11].
Despite the absence of explicit incorporation of structural characteristics such as concavity as an input feature in CoBRA, a statistically significant relationship was observed between the model predictions and the LN value, as indicated by a point-biserial correlation of −0.171 with a P-value of 2.8e-5. This demonstrates that a model trained solely on RNA sequence embeddings could explain structural characteristics via inherent sequence patterns. We also observed a correlation of −0.247 between LN value and the actual binding label, indicating that actual binding sites tend to be located in structurally concave regions. Correlation analyses with TP/TN/FP/FN from CoBRA prediction demonstrate that TP and TN exhibit relatively high correlation coefficients (TP: r = −0.188, P = 4.0e-6; TN: r = 0.260, P = 1.2e-10). No significant relationship was found for FP (r = −0.027, P = 0.505), and FN exhibited a modest negative correlation (r = −0.126, P = 2.0e-3), indicating a slight reduction in missed positive predictions. A notable example is provided in Fig. 3B (PDB ID: 1FMN), illustrating the substantial correlation between CoBRA prediction and LN values. The flavin binding site of an RNA aptamer, defined by 4 Å from the co-crystalized ligand, exhibited a − 0.541 correlation coefficient (P = 8.0e-4). The right panel displays the LN values of nucleotides, indicated by the color change from blue (concave) to red (convex). CoBRA demonstrates an accuracy in predicting the binding site, with a precision of 0.636. However, the program failed to accurately predict three nucleotides with low LN values (average: 8.36) and erroneously predicted four nucleotides with high LN values (average: 9.89). This finding might show a limitation of only using RNA sequence information.
One of the most severe predicted cases of CoBRA in the TE18 dataset is sisomicin complexed with the bacterial ribosomal decoding site (PDB ID: 4F8U, Fig. 3C). The crystal structure contains homodimer RNA chains, and two sisomicins bind to the interface of the dimer. For prediction, the RNA sequence of a single chain, B chain of the PDB structure, was used as the input. CoBRA demonstrated a successful prediction for one of the binding sites (the red-colored region of the left panel). However, given that the binding site is designated by the ligand with the same chain ID of the input RNA sequence (the blue region), the resulting accuracy for that particular sample was found to be 0.0. A visual inspection of both the A and B chains together (right panel) revealed that the region was misclassified as a false positive. This case demonstrates that the accuracy of prediction may be diminished when binding sites are situated at the interface with other chains of a query sequence. This phenomenon has also been observed in protein-ligand binding site prediction problems using LMs [36]. One potential solution to this issue involves the inference of structural information or the incorporation of multi-chain as an input.
Prediction results on metal binding sites
Metal ions such as Mg2+ and Na+ often bind diffusely across the RNA surface to neutralize the negatively charged phosphate backbone, without forming specific binding pockets. These ions contribute to RNA structural stability and folding, as previously noted by Draper et al. [37]. In contrast, organic molecules typically bind within well-defined pockets, making their spatial localization and prediction of binding sites more tractable. Thus, predicting metal binding sites may be more challenging than predicting organic molecule binding sites due to the nature of metal binding. Also, due to the electrostatic and steric nature of metal ion binding, they may associate with multiple structurally similar regions with comparable affinity, complicating the prediction task, which is reported in MetalionRNA [38].
The binding sites of the test sets were classified according to the type of binding molecule: metal ion or non-metallic compound. The accuracies of metal binding site prediction were 0.647, 0.531, 0.497, and 0.202 for RB9, JL10, TL12, and TE18, respectively. As anticipated, the non-metallic compound category demonstrated higher levels of accuracy, with the values obtained for RB9, JL10, TL12, and TE18 being 0.733, 0.739, 0.690, and 0.468, respectively. Detailed information is summarized in Supplementary Table S6. This discrepancy is likely attributable to the inherent nature of RNA–metal ion interactions. The incorporation of physico-chemical characteristics has the potential to enhance the efficacy of predicting metal binding sites.
Discussion
Given the recent emphasis on RNA as a promising therapeutic target, the prediction of compound binding sites could serve as a fundamental starting point. The RNA-ligand binding site prediction models can be broadly categorized into two distinct approaches: structure-based and sequence-based. The structure-based approach leverages three-dimensional or secondary structural information to capture critical spatial and geometric features, while the sequence-based approach utilizes evolutionary and contextual patterns through sequence embeddings. While structure-based models generally demonstrate robust performance when accurate structural information is available, they exhibit significant performance decrement on complex RNA architectures, such as junction loops. Consistent with this observation, CoBRA showed lower AUROC values than structure-aware models on the RNABind dataset, reflecting the limitation of sequence-only representations in capturing structural determinants of ligand binding.
The integration of ERNIE-RNA as an RNA LM and focal loss function has led to the development of a novel RNA-compound binding site program, called CoBRA. On various benchmark sets, the model demonstrated performance that was either superior to or comparable to contemporary state-of-the-art prediction methods. The program demonstrated its capacity for generalization across a range of external datasets, including RB9 and TL12, without requiring explicit structural inputs. This finding highlights the efficacy of sequence embeddings in capturing functional signals across a diverse array of RNA architectures.
Despite these advances, several limitations were identified through structural analysis on the TE18 set, in which CoBRA exhibited suboptimal performance. The performance of the program is contingent upon the RNA secondary structure, denoted by DSSR. Also, it misses predicting binding site nucleotides located in concave regions, as indicated by low LN values. In addition, binding sites located at interfaces of RNA dimers were often missed when only a single RNA chain was provided as input. Although the program demonstrated comparable performance to the state-of-the-art methods, only using LMs, the absence of structural information could lead to failure in predicting the binding site. Consequently, the subsequent direction for CoBRA will be to address these deficiencies.
Key Points
Compound binding site prediction for RNA (CoBRA) provides a lightweight deep-learning method for RNA-ligand binding site prediction.
A comprehensive benchmark combining RNA language models and loss functions to find the optimal combination for CoBRA.
CoBRA can effectively predict RNA-ligand binding sites with a superior performance to the other state-of-the-art methods.
Supplementary Material
Contributor Information
Wonkyeong Jang, Department of Biomedical Informatics, Korea University College of Medicine, 161 Jeongneung-ro, Seongbuk-gu, Seoul 02708, Republic of Korea.
Woong-Hee Shin, Department of Biomedical Informatics, Korea University College of Medicine, 161 Jeongneung-ro, Seongbuk-gu, Seoul 02708, Republic of Korea; Arontier, Co., 241 Gangnam-daero, Seocho-gu, Seoul 06735, Republic of Korea.
Author contributions
Woong-Hee Shin conceived the study. Wonkyeong Jang designed and implemented CoBRA and conducted the benchmark. Woong-Hee Shin and Wonkyeong Jang analyzed the results. Wonkyeong Jang composed the manuscript. Woong-Hee Shin revised and polished the article. All authors have read and approved the final version of the manuscript.
Conflict of interest: None declared.
Funding
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)—ICT Challenge and Advanced Network of HRD (ICAN) grant funded by the Korea government (Ministry of Science and ICT) [IITP-2025-RS-2022-00156439 and IITP-2025-RS-2024-00438263]; the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korea government (Ministry of Science and ICT) [RS-2025-02217289 and 2022M3E5F3081268 to WHS]; Korea University grant [K2517281 to WHS].
Data availability
All datasets and source code are available at https://github.com/kucm-lsbi/CoBRA.
References
- 1. Esteller M. Non-coding RNAs in human disease. Nat Rev Genet 2011;12:861–74. 10.1038/nrg3074 [DOI] [PubMed] [Google Scholar]
- 2. Chen G, Wang Z, Wang D. et al. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res 2013;41:D983–6. 10.1093/nar/gks1099 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lu M, Zhang Q, Deng M. et al. An analysis of human microRNA and disease associations. PloS One 2008;3:e3420. 10.1371/journal.pone.0003420 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. de Souza N. The ENCODE project. Nat Methods 2012;10:1046. 10.1038/nmeth.2238 [DOI] [PubMed] [Google Scholar]
- 5. Hopkins A, Groom C. The druggable genome. Nat Rev Drug Discov 2002;1:727–30. 10.1038/nrd892 [DOI] [PubMed] [Google Scholar]
- 6. Shao Y, Zhang QC. Targeting RNA structures in diseases with small molecules. Essays Biochem 2020;64:955–66. 10.1042/EBC20200011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Yu A-M, Choi YH, Tu M-J. RNA drugs and RNA targets for small molecules: principles, progress, and challenges. Pharmacol Rev 2020;72:862–98. 10.1124/pr.120.019554 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zeng P, Li J, Ma W. et al. Rsite: a computational method to identify the functional sites of noncoding RNAs. Sci Rep 2015;5:9179. 10.1038/srep09179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Zeng P, Cui Q. Rsite2: an efficient computational method to predict the functional sites of noncoding RNAs. Sci Rep 2016;6:19016. 10.1038/srep19016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang K, Jian Y, Wang H. et al. RBind: computational network method to predict RNA binding sites. Bioinformatics 2018;34:3131–6. 10.1093/bioinformatics/bty345 [DOI] [PubMed] [Google Scholar]
- 11. Su H, Peng Z, Yang J. Recognition of small molecule–RNA binding sites using RNA sequence and structure. Bioinformatics 2021;37:36–42. 10.1093/bioinformatics/btaa1092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Gao J, Liu H, Zhuo C. et al. Predicting small molecule binding nucleotides in RNA structures using RNA surface topography. J Chem Inf Model 2024;64:6979–92. 10.1092/acs.jcim.4c01264 [DOI] [PubMed] [Google Scholar]
- 13. Zhu W, Ding X, Shen H. et al. Identifying RNA-small molecule binding sites using geometric deep learning with language models. J Mol Biol 2025;437. 10.1016/j.jmb.2025.169010 [DOI] [PubMed] [Google Scholar]
- 14. Sun S, Yang J, Gao L. et al. RNA language model and graph attention network for RNA and small molecule binding sites prediction. Bioinformatics 2025;41. 10.1093/bioinformatics/btaf447 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Sun C, Zhang L, Zhang L. et al. GATRsite: RNA–ligand binding site prediction using graph attention networks and pretrained RNA language models. J Chem Inf Model 2025;65:8448–61. 10.1021/acs.jcim.5c00605 [DOI] [PubMed] [Google Scholar]
- 16. Chen S, Huang Z, Wang Y. et al. MVRBind: multi-view learning for RNA-small molecule binding site prediction. Brief Bioinform 2025;26:bbaf489. 10.1093/bib/bbaf489 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Panei FP, Torchet R, Menager H. et al. HARIBOSS: a curated database of RNA-small molecules structures to aid rational drug design. Bioinformatics 2022;38:4185–93. 10.1093/bioinformatics/btac483 [DOI] [PubMed] [Google Scholar]
- 18. Paszke A. Pytorch: an imperative style, high-performance deep learning library. arXiv 2019. 10.48550/arXiv.1912.01703 [DOI] [Google Scholar]
- 19. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv 2019. 10.48550/arXiv.1711.05101 [DOI] [Google Scholar]
- 20. Yin W, Zhang Z, He L. et al. ERNIE-RNA: an RNA language model with structure-enhanced representations. bioRxiv 2024. 10.1101/2024.03.17.585376 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Penić RJ, Vlašić T, Huber RG. et al. RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks. Nat Commun 2025;16:5671. 10.1038/s41467-025-60872-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Akiyama M, Sakakibara Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics Bioinf 2022;4:lqac012. 10.1093/nargab/lqac012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Chen J, Hu Z, Sun S. et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv 2022. 10.48550/arXiv.2204.00300 [DOI] [Google Scholar]
- 24. Zhang Y, Lang M, Jiang J. et al. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res 2024;52:e3. 10.1093/nar/gkad1031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Chen K, Zhou Y, Ding M. et al. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinform 2024;25:bbae163. 10.1093/bib/bbae163 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Chu Y, Yu D, Li Y. et al. A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions. Nat Mach Intell 2024;6:449–60. 10.1038/s42256-024-00823-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Cui Y, Jia M, Lin T-Y. et al. Class-balanced loss based on effective number of samples. arXiv 2019. 10.48550/arXiv.1901.05555 [DOI] [Google Scholar]
- 28. Salehi SSM, Erdogmus D, Gholipour A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. arXiv 2017. 10.48550/arXiv.1706.05721 [DOI] [Google Scholar]
- 29. Milletari F, Navab N, Ahmadi S-A. V-net: fully convolutional neural networks for volumetric medical image segmentation. arXiv 2016. 10.48550/arXiv.1606.04797 [DOI] [Google Scholar]
- 30. Berman M, Triki AR, Blaschko MB. The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. arXiv 2018. 10.48550/arXiv.1705.08790 [DOI] [Google Scholar]
- 31. He X, Zhou Y, Zhou Z. et al. Triplet-center loss for multi-view 3d object retrieval. arXiv 2018. 10.48550/arXiv.1803.06189 [DOI] [Google Scholar]
- 32. Wang J, Liu Y, Tian B. Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning. J Chem 2024;16:125. 10.1186/s13321-024-00920-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Shoffner G, Peng Z, Guo F. Structures of microRNA-precursor apical junctions and loops reveal non-canonical base pairs important for processing. bioRxiv 2020. 10.1101/2020.05.05.078014 [DOI] [Google Scholar]
- 34. Gao A, Serganov A. Structural insights into recognition of c-di-AMP by by the ydaO riboswitch. Nat Chem Biol 2014;10:787–92. 10.1038/nchembio.1607 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lu X-J. DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL. Nucleic Acids Res 2020;48:e74. 10.1093/nar/gkaa426 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Chelur VR, Priyakumar UD. BiRDS—binding residue detection from protein sequences using deep ResNets. J Chem Inf Model 2022;62:1809–18. 10.1021/acs.jcim.1c00972 [DOI] [PubMed] [Google Scholar]
- 37. Draper DE, Grilley D, Soto AM. Ions and RNA folding. Annu Rev Biophys Biomol Struct 2005;34:221–43. 10.1146/annurev.biophys.34.040204.144511 [DOI] [PubMed] [Google Scholar]
- 38. Philips A, Milanowska K, Lach G. et al. MetalionRNA: computational predictor of metal-binding sites in RNA structures. Bioinformatics 2012;28:198–205. 10.1093/bioinformatics/btr636 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All datasets and source code are available at https://github.com/kucm-lsbi/CoBRA.









