Skip to main content
PLOS One logoLink to PLOS One
. 2025 May 13;20(5):e0320817. doi: 10.1371/journal.pone.0320817

iProtDNA-SMOTE: Enhancing protein-DNA binding sites prediction through imbalanced graph neural networks

Ruiyan Huang 1, Wangren Qiu 1, Xuan Xiao 1,2, Weizhong Lin 1,*
Editor: Syed Nisar Hussain Bukhari3
PMCID: PMC12074593  PMID: 40359455

Abstract

Protein-DNA interactions play a crucial role in cellular biology, essential for maintaining life processes and regulating cellular functions. We propose a method called iProtDNA-SMOTE, which utilizes non-equilibrium graph neural networks along with pre-trained protein language models to predict DNA binding residues. This approach effectively addresses the class imbalance issue in predicting protein-DNA binding sites by leveraging unbalanced graph data, thus enhancing model’s generalization and specificity. We trained the model on two datasets, TR646 and TR573, and conducted a series of experiments to evaluate its performance. The model achieved AUC values of 0.850, 0.896, and 0.858 on the independent test datasets TE46, TE129, and TE181, respectively. These results indicate that iProtDNA-SMOTE outperforms existing methods in terms of accuracy and generalization for predicting DNA binding sites, offering reliable and effective predictions to minimize errors. The model has been thoroughly validated for its ability to predict protein-DNA binding sites with high reliability and precision. For the convenience of the scientific community, the benchmark datasets and codes are publicly available at https://github.com/primrosehry/iProtDNA-SMOTE.

1. Introduction

Protein-DNA interactions serves as critical regions where transcription factors and other DNA-binding proteins recognize and bind DNA sequences [1] and plays a key roles in life-sustaining processes and cellular functions such as gene expression regulation, DNA replication, and repair [2,3]. Recognizing these binding sites and annotating their functions are essential for revealing gene regulatory networks, identifying disease-related genes, and elucidating mechanisms of drug action [4]. The rapid development of high-throughput sequencing technologies has led to the identification of many protein sequences with unknown functions. However, the identification of these binding sites poses significant challenges for experimental methods duo to the diversity and complexity of protein sequences [5], thereby impeding a comprehensive understanding of biological processes and the discovery of new drug targets and designs [6]. To overcome this, there is a significant scientific and practical need to develop rapid and accurate computational methods for predicting protein-DNA interactions. These methods could provide deeper insights into the mechanisms of these interactions [7], aid in the discovery of novel drug targets, and inform targeted therapeutic strategies [8].

Predicting protein-DNA binding sites involves two main approaches: traditional experimental techniques and computational methods. Traditional experimental methods like protein microarray analysis [9], ChIP-seq [10], x-ray crystallography [11], and Cryo-EM [12]provide valuable data but are costly and complex. In contrast, computational methods process protein sequence data quickly, providing a theoretical foundation for experimental validation and mitigating the limitations of experimental approaches. These computational techniques involve representing protein features based on their sequence, structure, and physicochemical properties, including techniques like one-hot encoding of amino acids [13], PSSM matrices [14], and protein secondary structures [15].

As machine learning and deep learning have advanced, so too has the sophistication of predictive modeling in the field of protein-DNA interactions. Early methods like support vector machines (SVM) [16] and random forests (RF) [17] have been eclipsed by the current generation of deep neural networks [18], which have significantly improved the accuracy and efficiency of predictions. Convolutional networks and graph neural networks (GNN) [19] have been particularly influential in refining the prediction of protein-DNA binding sites. Notably, convolutional networks used by Tayara et al. [20], capsule networks employed by Nguyen et al. [21], and the Inception network utilized by Fang [22] has each yielded substantial improvements in predictive accuracy. Graph neural networks excel at processing graph-structured data, effectively integrating protein sequences, structures, and physicochemical properties to optimize model performance. Yuan et al. introduced the GraphSite model [23], which incorporates tertiary structure information predicted by AlphaFold2. This approach has showcased the potential of these advanced computational techniques in molecular interaction research.

In the domain of bioinformatics, predictive models for protein-DNA binding sites are often impeded by the challenge of imbalanced data distributions. Such imbalances can significantly degrade the models’ capacity for generalization and the precision of their predictions. To mitigate these issues, the scientific community has developed an array of sophisticated methodologies, encompassing resampling strategies and ensemble learning techniques. For example, Hu et al. [24] enhanced model efficacy by employing random under-sampling to equilibrate the representation of positive and negative samples, followed by the construction of an ensemble of Support Vector Machine (SVM) classifiers, which were amalgamated via boosting algorithms. Gao et al. [25] innovatively applied multi-instance learning to predict protein-DNA interactions, while Zhu et al. [26] proposed a subsampling strategy based on the distance of samples from SVM separating hyperplanes, combined with AdaBoost algorithm, to build a protein-DNA binding site predictor that effectively handles data imbalance.

In addressing the challenges posed by imbalanced datasets within the deep learning paradigm, researchers have employed both data-level and algorithmic-level approaches to enhance model performance. For instance, GNN-CL graph neural networks with data interpolation techniques were used for synthesizing new samples to enrich the dataset [27]. The ImGAGN model employed generative adversarial graph networks to generate synthetic minority class nodes thereby optimizing model performance through adversarial processes [28]. The GraphSR model utilized pseudo-labeling techniques to enhance model generalization capabilities [29], while the QTIAH-GNN model introduced a multi-level label perception strategy, alongside parameterized similarity metrics and a specially designed loss function, to balance the predictive emphasis between majority and minority classes [30]. Furthermore, the field has witnessed the emergence of graph convolutional network variants specifically designed to handle imbalanced data, with the introduction of novel loss functions such as Focal Loss [31], which prioritizes the learning from minority class instances. In the domain of imbalanced graph learning [32], researchers have proposed methods such as GraphSMOTE [33], GraphENS [34], GATSMOTE [35], and GraphSHA [36]. These methodologies employ a variety of strategies to strengthen the model’s recognition capabilities for minority class nodes and further augment classification performance.

In this study, we introduce the iProtDNA-SMOTE model, an innovative prediction framework that integrates the pre-trained protein language model ESM2 [37] with graph neural network architectures. This model is specifically designed to address the challenge of imbalanced data by leveraging the GraphSMOTE [33] method to enhance recognition of minority class nodes. Furthermore, the iProtDNA-SMOTE model utilizes GraphSage [38] and multi-layer perceptron (MLP) [39] to effectively extract and assimilate sequence-derived features, thereby achieving high-precision prediction of protein-DNA binding sites. Our empirical evaluation across various benchmark datasets substantiates the model has superior predictive accuracy and its unwavering capacity for generalization in predicting DNA binding sites, thereby highlighting its significant potential for application within the biomedical field.

2. Materials and methods

2.1 Benchmark datasets

In this study, we subjected our iProtDNA-SMOTE method to rigorous evaluation using five reputable datasets that are well-established benchmarks in the field of protein-DNA binding sites prediction. These datasets, designated as TR646 [40], TE46 [40], TR573, TE129, and TE181, represent both training(TR) and testing(TE) components, respectively.

The TR646 dataset comprises 646 DNA-binding protein chains, encompassing a total of 15,636 DNA-binding sites and 298,503 non-binding sites. The TE46 dataset consists of 46 distinct DNA-binding proteins, with 956 DNA-binding sites and 9,911 non-binding sites. Both datasets were introduced through research using the DBPred model, a deep learning approach focused on predicting protein-DNA binding sites from sequence data. Our study employed the TR646 dataset for training, allowing us to explore the intrinsic properties of DNA-binding proteins in depth. The TE46 test set, with a sequence similarity of no more than 30% to the training set, ensures the rigor and independence of our evaluation.

Further, the TR573 dataset consists of 573 DNA-binding protein chains, containing 14,479 DNA-binding residues and 145,404 non-binding residues. The TE129 dataset includes 129 independent DNA-binding proteins, contributing 2,240 DNA-binding residues and 35,275 non-binding residues. These datasets were introduced through research using the GraphBind model, a graph neural network designed to identify nucleic acid binding residues from structural data. The TE181 dataset, introduced through the GraphSite model, includes 181 DNA-binding protein chains with 3,208 DNA-binding residues and 72,050 non-binding residues. This model uses structural insights from AlphaFold2 to classify DNA-binding residues. In our study, the TR573 dataset served for model training, enhancing our understanding of the characteristics of DNA-binding protein. To ensure the independence of the test sets, we restricted a sequence similarity threshold of no more than 30% between proteins in the TE129 and TE181 datasets and those in the TR573 training set. Moreover, to evaluate the model’s generalization capacity, we employed GraphSMOTE technology during model training to refine its handling of imbalanced data.

For the evaluation, we utilized the same data preprocessing procedures as existing models to ensure fairness in assessment. Table 1 presents a comprehensive statistical overview of the four datasets for reference. The TR646 and TE46 datasets were introduced by Patiyal et al. using the DBPred model, while the TR573 and TE129 datasets were presented by Xia et al. based on the GraphBind model. The TE181 dataset was introduced by Yuan et al. through the GraphSite model. We employed TR646 as the training set and its corresponding independent test set, TE46, for evaluation. The TE129 and TE181 test datasets were applied to assess the model trained on TR573. Through the application of these datasets, we comprehensively evaluated the iProtDNA-SMOTE model’s capability in predicting protein-DNA binding sites.

Table 1. Summary of benchmark protein–DNA binding datasets.

Dataset DNA-binding residues Non-binding residues % of binding residues
Training Dataset TR646 15636 298503 4.98
TR573 14479 145404 9.06
Test Dataset TE46 956 9911 8.87
TE129 2240 35275 5.97
TE181 3208 72050 4.26

2.2 The framework of iProtDNA-SMOTE

iProtDNA-SMOTE is a protein-DNA binding site prediction method based on graph neural networks. As illustrated in Fig 1, the iProtDNA-SMOTE process is streamlined into four distinct yet interconnected steps.

Fig 1. The workflow of iProtDNA-SMOTE.

Fig 1

  1. Feature Embedding Extraction: The process begins with extracting feature embeddings using the sophisticated ESM2 large language model. This initial step is pivotal as it translates the raw protein sequences into a high-dimensional space where the underlying biological signals are more pronounced.

  2. Graph Model Construction: Following embedding extraction, a graph model of the protein sequence is constructed. This graph representation is crucial as it allows the model to consider the spatial relationships between amino acids, which is vital for understanding protein-DNA interactions.

  3. Handling Imbalanced Datasets: To counteract the common issue of imbalanced datasets, iProtDNA-SMOTE incorporates GraphSMOTE. This technique adeptly adjusts the dataset balance, ensuring that the model does not become biased towards the more frequent classes and enhancing its predictive power across all data points.

  4. Classification of Graph-Structured Data: Finally, GraphSAGE-MLP is used for the classification of the graph-structured data. This combination of GraphSAGE for neighborhood aggregation and MLP for non-linear classification ensures that the model can accurately predict protein-DNA binding sites.

Each of these steps is designed to work synergistically, providing a comprehensive and robust framework for predicting protein-DNA binding sites with high accuracy and reliability.

Procedure I: Feature Embedding Extracting. The amino acid sequence is input into the large language model ESM2, which generates high-dimensional embeddings of size L×2560, where L represents the sequence length. ESM2, a deep learning model based on the transformer architecture, is specifically designed for understanding and predicting the three-dimensional structure and functions of proteins [41]. Pre-trained on a database containing millions of natural protein sequences, ESM2 is commonly utilized for tasks such as protein structure prediction, functional annotation, and protein-ligand interaction analysis. The feature embeddings produced by ESM2 comprehensively capture information from protein sequences, including chemical properties of amino acid residues, sequence patterns, and interactions between residues [42]. These embedding vectors not only encode information between individual residues but also effectively integrate relationships between residues at different positions within the sequence through the transformer’s self-attention mechanism. This integration is crucial as it allows the model to understand the complex interactions within proteins, which are essential for predicting protein-DNA binding sites. It is important to note that in subsequent training steps, the protein feature embeddings generated by ESM2 serve as input data for the graph neural network. These embeddings are generated by the encoder of the ESM2 model, providing precise feature vectors for each amino acid residue in the protein sequence. These feature vectors are utilized as node features in the graph network model, allowing the subsequent graph neural network to effectively process and analyze the protein data for accurate binding site prediction.

Procedure II: Graph Model Construction. Utilizing the latest generation protein structure prediction tool developed by DeepMind, AlphaFold3, we obtain precise three-dimensional protein structures. AlphaFold3 represents a significant improvement and expansion over its predecessor, AlphaFold2, with an enhanced evoformer module and diffusion network [43]. Once the protein’s three-dimensional structure is acquired, a spatial distance threshold of 8 angstroms is defined. If the distance between the α atoms of two residues is less than this threshold, they are connected in the graphical model. This connection facilitates the aggregation of amino acid information that is spatially close, even if it is distant in sequence. Each node in the graph represents an amino acid residue from the 3D structure, with node features derived from the ESM2 embeddings generated in Procedure I. Every node is labeled accordingly, ensuring that the constructed graphical model comprehensively integrates both the spatial structure and sequence information of the protein.

Procedure III: Handling Imbalanced Datasets. The algorithm initiates by pinpointing nodes within the training dataset that belong to minority classes. For each of these minority class nodes, it calculates their similarity to all other nodes across the graph to identify their most proximate neighbors. Subsequently, interpolation techniques are employed to synthesize new nodes. The techniques blend the features of the original minority class nodes with those of their neighbors, thereby generating a fresh set of samples that enrich the minority class representation within the dataset. Crucially, the GraphSMOTE algorithm [33] ensures that the newly minted nodes are not merely isolated additions but are integrated into the graph in manner that reflects realistic relationships and maintains the overall structural properties. This cyclical process of identifying, synthesizing, and integrating new nodes continues until the algorithm achieves the desired numerosity of minority class samples. Through this iterative enhancement, the algorithm not only bolsters the quantity of underrepresented classes but also carefully curates the expansion to safeguard the inherent structure and relational dynamics of the graph.

Procedure IV: Classification of Graph-Structured Data. After the aforementioned steps, we proceed by applying GraphSAGE graph convolution operations to the graph structure data. GraphSAGE aggregates neighbor features for each node, effectively mapping them into a new feature space. This process is designed to capture the local structural information of the graph, providing a representation of the graph’s topology. The features are subsequently fed into a Multi-Layer Perceptron, which utilizes a series of linear layers with non-linear activation functions to learn complex mappings from features to class labels. This integration of GraphSAGE with an MLP forms the backbone of a comprehensive Graph Neural Network (GNN) model, which is adept at leveraging both the graph’s topological structure and the robust learning capabilities of the MLP. Ultimately, the model outputs probability predictions for each node, categorizing them into predefined classes with high accuracy.

2.3 Unsupervised protein language models

The architecture of ESM2_t36_3B_UR50D, as shown in Fig 2, accepts a queried amino acid sequence as input and outputs a high-dimensional embedding matrix [44]. This matrix is designed to capable of capture complex biological features and patterns inherent in the sequence. ESM2_t36_3B_UR50D [45] is a variant of the ESM2 model, employing a 36-layer Transformer architecture as its core. It employs self-attention mechanisms that allow each amino acid residue to interact with and learn from others within the sequence, thus understanding their intricate relationships and dependencies. This capability is particularly adept at capturing long-range interactions, which are essential for deciphering the three-dimensional structure and function of proteins.

Fig 2. The workflow of ESM2_t36_3B_UR50D.

Fig 2

The model incorporates multiple attention heads, each focusing on distinct features of the sequence, collectively generating a comprehensive feature representation. With approximately 3 billion parameters, ESM2_t36_3B_UR50D is trained using masked language modeling objectives to produce feature representations of protein sequences. It benefits from a curated pre-training dataset known as UR50, which comprises over 60 million protein sequences from the UniRef90 database, ensuring a diverse and representative sample of protein sequences.

In summary, the model generates a feature vector of size L×2560, where L represents the length of the input sequence, and 2560 is the dimensionality of the feature vector, thereby providing a rich, detailed representation of the protein sequence’s biological characteristics.

2.4 GraphSAGE-MLP network

The GraphSAGE-MLP network, as illustrated in Fig 1, includes a feature update module with SageConv layers, and a MLP module with distinct input, hidden, and output layers.

In the feature aggregation phase, the SageConv layer enhances each node’s representation by incorporating features from surrounding nodes. Each node’s individual feature, denoted as hi, is combine with the aggregated features of its neighbors, ni, creating an expanded feature vector. This vector is then processed through a linear layer. To introduce complexity, a nonlinear activation function σ, such as ReLU, is applied to the output of the linear layer, yielding the final feature representation for each node. The mathematical computation is as follows:

ni=j𝒩(i)hj# (1)
hi=[hi,ni] (2)
zi=W·hi+b# (3)
hi(l+1)=σ(zi)# (4)

where ni represents the aggregation result of neighboring features of node i, N(i) is the set of neighbouring nodes of node i, and hj denotes the feature vector of neighbor node j. W stands for the weight matrix, and b is the bias term, collectively defining the linear transformation.

The MLP in our model is structured with an input layer, two hidden layers, and an output layer, each linked by nonlinear activation functions. The input layer takes the original features of a node and, through linear transformations, projects them into a higher-dimensional space in the first hidden layer. Here, each neuron calculates a weighted sum of its inputs and includes a bias term. The LeakyReLU function is then applied to introduce nonlinearity.

The output from the first hidden layer, denoted as a(1), is passed to the second hidden layer. It follows a similar process of linear transformation and LeakyReLU activation to further refine the feature representation. The final output layer receives these transformed features and, through a linear transformation, maps them to a space where each dimension represents a class in the classification task. Unlike the hidden layers, the output layer uses the softmax function to convert the output into a probability distribution across the different classes. The computation proceeds as follows:

z(1)=w(1)x+b(1)# (5)
a(1)=LeakyReLU(z(1))# (6)
z(2)=w(2)a(1)+b(2)# (7)
a(2)=LeakyReLU(z(2))# (8)
y^=softmax(zout)=ezoutl=1Kezl (9)

Where, x represents the input features, w(1) is the weight matrix, and b(1) is the bias vector. zout denotes the linear output of the output layer, K is the total number of classes, and y^ represents the model’s predicted class probabilities.

To tackle the issue of class imbalance in binary classification tasks, we utilize the focal loss function. This function adjusts the loss given to correctly classified samples, down weighting those that are easily classified and upweighting those that are challenging to classify correctly. By doing so, it encourages the model to focus more on the samples that are hard to classify accurately. The computation formula is as follows:

L=α(1pt)γlog(pt) (10)

where L represents the value of the loss function, α is a weight parameter balancing positive and negative samples to adjust for class imbalance, γ is an exponent that adjusts the model’s focus on easy versus hard samples, reducing the weight of easily classified samples and increasing that of hard-to-classify samples, and pt denotes the model’s predicted probability of the positive class.

2.5 Evaluation indices

In this study, we utilized six metrics—specificity (Spe), precision (Pre), recall (Rec), F1 score, Matthews correlation coefficient (MCC), and AUC—to evaluate the proposed method, ensuring consistency with previous research. These metrics are defined as follows:

Spe=TNTN+FP# (11)
Pre=TPTP+FP# (12)
Rec=TPTP+FN# (13)
F1=2×Pre×RecPre+Rec# (14)
Mcc=TP×TNFN×FP(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN)# (15)

Among these metrics, True Positives (TP) refers to the number of samples correctly predicted as positive by the model; False Positives (FP) are negative samples incorrectly predicted as positive; True Negatives (TN) are the number of samples correctly predicted as negative; and False Negatives (FN) are positive samples incorrectly predicted as negative by the model. Specifically, Specificity (Spe) measures the model’s ability to correctly identify negative samples; Precision (Pre) reflects the proportion of predicted positive samples that are actually positive; Recall (Rec), also known as Sensitivity, indicates the model’s ability to identify all positive samples. Additionally, the F1 score combines the performance of Precision and Recall, while the Matthews Correlation Coefficient (MCC) evaluates the overall performance of the model in handling predictions of both positive and negative classes, particularly suitable for evaluating imbalanced data. Given that this study addresses a binary classification problem with imbalanced classes, MCC is one of our primary evaluation metrics as it provides a comprehensive assessment of such scenarios. A high MCC score is achieved only when the model performs well across all four categories of the confusion matrix (TP, TN, FN, and FP).

3. Comparison with existing DNA-binding site predictors

3.1 Comparison of iProtDNA-SMOTE with other methods on TE46

To demonstrate the effectiveness of iProtDNA-SMOTE, we compared it against six state-of-the-art models for predicting DNA binding sites:DRNAPred [46], DNAPred, SVMnuc [47], NCBRPred [48], DBPred, and CLAPE-DB [49]. Thes comparisons were based on their performance on TE46 dataset. As detailed in Table 2, iProtDNA-SMOTE outperforms all other methods, with the highest MCCscore. Significantly, iProtDNA-SMOTE surpasses CLAPE-DB, the next best model, by approximately 1.7% in MCC. It also excels across all evaluation metrics. On the TE46 dataset, iProtDNA-SMOTE trained on TR646 achieves specificity (Spe) of 0.973, precision (Pre) of 0.583, F1 score of 0.447, and MCC of 0.418, marking improvements of 13.8%, 27.7%, 1.3%, and 1.7%, respectively, over CLAPE-DB. Although the recall (Rec) of 0.363 is slightly lower than CLAPE-DB, this reflects iProtDNA-SMOTE’s emphasis on precision during predictions, effectively reducing false positives. Furthermore, iProtDNA-SMOTE’s AUC metric is closely aligned with CLAPE-DB, further substantiating its competitive overall predictive performance.

Table 2. Performance comparisons of iProtDNA-SMOTE and 6 competing predictors on TE46 under independent validation.

Method Spe Rec Pre F1 MCC AUC
DRNAPred 0.692 0.677 0.185 0.291 0.226 0.755
DNAPred 0.655 0.671 0.157 0.254 0.194 0.730
SVMnuc 0.666 0.668 0.154 0.250 0.192 0.715
NCBRPred 0.674 0.677 0.165 0.265 0.207 0.713
DBPred 0.784 0.708 0.243 0.362 0.320 0.794
CLAPE-DB 0.835 0.747 0.306 0.434 0.401 0.871
iProtDNA-SMOTE 0.973 0.363 0.583 0.447 0.418 0.850

3.2 Comparison of iProtDNA-SMOTE with other methods on TE129 and TE181

Table 3 summarises the performance comparison of various models, including DRNAPred, DNAPred, SVMnuc NCBRPred, CLAPE-DB and iProtDNA-SMOTE on the independent validation dataset TE129. Among these models, iProtDNA-SMOTE achieving the highest MCC score. On the TE129 dataset, iProtDNA-SMOTE, trained with TR573 dataset achieves a specificity of 0.972, precision of 0.497, F1 score of 0.468, MCC of 0.437, and AUC of 0.896. These results represent substantial improvements over CLAPE-DB, with increases of 1.7%, 10.1%, 4.1%, 4.8%, and 1.5%, in specificity, precision, F1 score, MCC, and AUC, respectively.

Table 3. Performance comparisons of iProtDNA-SMOTE and 5 competing predictors on TE129 under independent validation.

Method Spe Rec Pre F1 MCC AUC
DRNAPred 0.937 0.233 0.190 0.210 0.155 0.693
DNAPred 0.954 0.396 0.353 0.373 0.332 0.845
SVMnuc 0.966 0.316 0.371 0.341 0.304 0.812
NCBRPred 0.969 0.312 0.392 0.347 0.313 0.823
CLAPE-DB 0.955 0.464 0.396 0.427 0.389 0.881
iProtDNA-SMOTE 0.972 0.442 0.497 0.468 0.437 0.896

Table 4 compares the performance of DNAPred, SVMnuc, NCBRPred, CLAPE-DB, and iProtDNA-SMOTE on the TE181 test dataset. iProtDNA-SMOTE achieves the highest MCC value among all methods. On the TE181 dataset, iProtDNA-SMOTE, trained with TR573 dataset, achieves a specificity of 0.963, precision of 0.303, F1 score of 0.330, MCC of 0.299, and AUC of 0.858. These results represent notable improvements over CLAPE-DB, with increases of 3.2% in specificity, 9.1% in precision, 5.0% in the F1 score, 4.7% in MCC, and 3.4% in AUC.

Table 4. Performance comparisons of iProtDNA-SMOTE and 4 competing predictors on TE181 under independent validation.

Method Spe Rec Pre F1 MCC AUC
DNAPred 0.948 0.334 0.223 0.267 0.233 0.802
SVMnuc 0.960 0.289 0.242 0.263 0.229 0.803
NCBRPred 0.964 0.259 0.241 0.250 0.215 0.771
CLAPE-DB 0.931 0.413 0.212 0.280 0.252 0.824
iProtDNA-SMOTE 0.963 0.362 0.303 0.330 0.299 0.858

On both the TE129 and TE181 independent test sets, iProtDNA-SMOTE demonstrates recall rates that are nearly on par with CLAPE-DB. This similarity suggests that our model offers a balanced approach to predictions, maintaining high accuracy while carefully avoiding false positives. The close performance in recall between the two models is particularly significant given that CLAPE-DB incorporates contrastive learning and pre-trained protein language models, which are also key components of iProtDNA-SMOTE’s deep learning architecture. This comparison underscores the effectiveness of iProtDNA-SMOTE’s graph neural network integration and its strategies for tackling class imbalance.

4. Conclusions

We introduce iProtDNA-SMOTE, a novel deep learning-based method for predicting DNA binding sites from protein sequences. This approach integrates the pre-trained protein language model ESM2 with graph neural network technology. After through evaluation using five benchmark datasets for protein-DNA binding sites, iProtDNA-SMOTE has been shown to surpass existing state-of-the-art methods in predictive accuracy. Several key advancements contribute to the improvements of iProtDNA-SMOTE. Firstly, the ESM2 model effectively captures the intricate protein sequence features through high-dimensional feature embeddings. Secondly, our graph data augmentation strategy adeptly strengthens the model’s capability to identify minority class nodes, leading to enhanced predictive accuracy.

While iProtDNA-SMOTE has demonstrated impressive results, there are opportunities for further refinement. For instance, the current graph model may struggle with extremely long protein sequences, and integrating more sophisticated graph convolutional networks or attention mechanisms could offer improved solutions. Additionally, with the rapid development of protein structure prediction tools such as AlphaFold3 and ESM2, utilizing their predictions could potentially yield even greater accuracy in DNA binding site prediction. Relevant research in these areas is ongoing.

Supporting Information

S1 Tables. Supplementary Tables.

(DOCX)

pone.0320817.s001.docx (16.6KB, docx)
S1 Dataset. iProtDNA-SMOTE benchmark datasets.

(RAR)

pone.0320817.s002.rar (414.6KB, rar)
S1 Code. iProtDNA-SMOTE code.

(RAR)

pone.0320817.s003.rar (6.7KB, rar)
S1 Weight. iProtDNA-SMOTE trained weights.

(RAR)

pone.0320817.s004.rar (3.5MB, rar)
S1 Model. The graph model for dataset TE46.

(RAR)

pone.0320817.s005.rar (99.5MB, rar)

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

References

  • 1.Oriol F, Alberto M, Joachim A-P, Patrick G, M BP, Ruben M-F, et al. Structure-based learning to predict and model protein-DNA interactions and transcription-factor co-operativity in cis-regulatory elements. NAR Genom Bioinform. 2024;6(2):lqae068. doi: 10.1093/nargab/lqae068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751–60. doi: 10.1038/nrg2845 [DOI] [PubMed] [Google Scholar]
  • 3.Gallagher LA, Velazquez E, Peterson SB, Charity JC, Radey MC, Gebhardt MJ, et al. Genome-wide protein-DNA interaction site mapping in bacteria using a double-stranded DNA-specific cytosine deaminase. Nat Microbiol. 2022;7(6):844–55. doi: 10.1038/s41564-022-01133-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lovering RC, Gaudet P, Acencio ML, Ignatchenko A, Jolma A, Fornes O, et al. A GO catalogue of human DNA-binding transcription factors. Biochim Biophys Acta Gene Regul Mech. 2021;1864(11–12):194765. doi: 10.1016/j.bbagrm.2021.194765 [DOI] [PubMed] [Google Scholar]
  • 5.Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic level protein structure with a language model. 2022. doi: 10.1101/2022.07.20.500902 [DOI] [PubMed] [Google Scholar]
  • 6.Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein-DNA-binding sites in computational biology. Brief Funct Genomics. 2022;21(5):357–75. doi: 10.1093/bfgp/elac009 [DOI] [PubMed] [Google Scholar]
  • 7.Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM Trans Comput Biol Bioinform. 2023;20(5):2619–28. doi: 10.1109/TCBB.2022.3190933 [DOI] [PubMed] [Google Scholar]
  • 8.Bai D, Ziadlou R, Vaijayanthi T, Karthikeyan S, Chinnathambi S, Parthasarathy A, et al. Nucleic acid-based small molecules as targeted transcription therapeutics for immunoregulation. Allergy. 2024;79(4):843–60. doi: 10.1111/all.15959 [DOI] [PubMed] [Google Scholar]
  • 9.Templin MF, Stoll D, Schrenk M, Traub PC, Vöhringer CF, Joos TO. Protein microarray technology. Drug Discov Today. 2002;7(15):815–22. doi: 10.1016/s1359-6446(00)01910-2 [DOI] [PubMed] [Google Scholar]
  • 10.Narlikar L, Jothi R. ChIP-Seq data analysis: identification of protein-DNA binding sites with SISSRs peak-finder. In: Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. 2011/12/02 edn. Humana Press, 2012, p. 305–22. doi: 10.1007/978-1-61779-400-1_20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Stella S, Molina R, Bertonatti C, Juillerrat A, Montoya G. Expression, purification, crystallization and preliminary X-ray diffraction analysis of the novel modular DNA-binding protein BurrH in its apo form and in complex with its target DNA. Acta Crystallogr F Struct Biol Commun. 2014;70(Pt 1):87–91. doi: 10.1107/S2053230X13033037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mishyna M, Volokh O, Danilova Y, Gerasimova N, Pechnikova E, Sokolova OS. Effects of radiation damage in studies of protein-DNA complexes by cryo-EM. Micron. 2017;96:57–64. doi: 10.1016/j.micron.2017.02.004 [DOI] [PubMed] [Google Scholar]
  • 13.Zhou J, Lu Q, Xu R, Gui L, Wang H. Prediction of DNA-binding residues from sequence information using convolutional neural network. IJDMB. 2017;17(2):132. doi: 10.1504/ijdmb.2017.084265 [DOI] [Google Scholar]
  • 14.Chen D, Zhang H, Chen Z, Xie B, Wang Y. Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. Comput Math Methods Med. 2022;2022:5847242. doi: 10.1155/2022/5847242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wang W, Zhang Y, Liu D, Zhang H, Wang X, Zhou Y. Prediction of DNA-Binding Protein-Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature. Front Bioeng Biotechnol. 2022;10:822392. doi: 10.3389/fbioe.2022.822392 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Park B, Im J, Tuvshinjargal N, Lee W, Han K. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models. Comput Methods Programs Biomed. 2014;117(2):158–67. doi: 10.1016/j.cmpb.2014.07.009 [DOI] [PubMed] [Google Scholar]
  • 17.Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009;10 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2164-10-S1-S1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. bioRxiv. 2023:2023.09.14.557719. doi: 10.1101/2023.09.14.557719 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Xia Y, Xia C-Q, Pan X, Shen H-B (2021) GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic acids research 49 (9):e51. doi: 10.1093/nar/gkab044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tayara H, Tahir M, Chong KT. iSS-CNN: Identifying splicing sites using convolution neural network. Chemometrics and Intelligent Laboratory Systems. 2019;188:63–9. doi: 10.1016/j.chemolab.2019.03.002 [DOI] [Google Scholar]
  • 21.Nguyen BP, Nguyen QH, Doan-Ngoc G-N, Nguyen-Vo T-H, Rahardja S. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinformatics. 2019;20(Suppl 23):634. doi: 10.1186/s12859-019-3295-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Fang C, Shang Y, Xu D. MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction. Proteins. 2018;86(5):592–8. doi: 10.1002/prot.25487 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23(2):bbab564. doi: 10.1093/bib/bbab564 [DOI] [PubMed] [Google Scholar]
  • 24.Hu J, Li Y, Zhang M, Yang X, Shen H-B, Yu D-J. Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(6):1389–98. doi: 10.1109/TCBB.2016.2616469 [DOI] [PubMed] [Google Scholar]
  • 25.Gao Z, Ruan J. Computational modeling of in vivo and in vitro protein-DNA interactions by multiple instance learning. Bioinformatics. 2017;33(14):2097–105. doi: 10.1093/bioinformatics/btx115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhu Y-H, Hu J, Song X-N, Yu D-J. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model. 2019;59(6):3057–71. doi: 10.1021/acs.jcim.8b00749 [DOI] [PubMed] [Google Scholar]
  • 27.Li X, Fan Z, Huang F, Hu X, Deng Y, Wang L, et al. Graph Neural Network with curriculum learning for imbalanced node classification. Neurocomputing. 2024;574:127229. doi: 10.1016/j.neucom.2023.127229 [DOI] [Google Scholar]
  • 28.Qu L, Zhu H, Zheng R. Imgagn: Imbalanced network embedding via generative adversarial graph networks. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. n.d.:1390–8. doi: 10.48550/arXiv.2106.02817 [DOI] [Google Scholar]
  • 29.Zhou M, Gong Z. GraphSR: A Data Augmentation Algorithm for Imbalanced Node Classification. AAAI. 2023;37(4):4954–62. doi: 10.1609/aaai.v37i4.25622 [DOI] [Google Scholar]
  • 30.Liu Y, Gao Z, Liu X. QTIAH-GNN: Quantity and topology imbalance-aware heterogeneous graph neural network for bankruptcy prediction. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. n.d.:1572–82. [Google Scholar]
  • 31.Lin T-Y, Goyal P, Girshick R, Fu C, Rethage D. Focal loss for dense object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. n.d.:2980–8. [Google Scholar]
  • 32.Ma Y, Tian Y, Moniz N, Chawla N. Class-imbalanced learning on graphs: A survey. arXiv preprint. 2023. doi: 10.48550/arXiv.2304.04300 [DOI] [Google Scholar]
  • 33.Zhao T, Zhang X, Wang S. GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks. Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021:833–41. doi: 10.1145/3437963.3441720 [DOI] [Google Scholar]
  • 34.Park J, Song J, Yang E. Graphens: Neighbor-aware ego network synthesis for class-imbalanced node classification. International conference on learning representations. n.d.:34. [Google Scholar]
  • 35.Liu Y, Zhang Z, Liu Y, Zhu Y. GATSMOTE: Improving Imbalanced Node Classification on Graphs via Attention and Homophily. Mathematics. 2022;10(11):1799. doi: 10.3390/math10111799 [DOI] [Google Scholar]
  • 36.Li W-Z, Wang C-D, Xiong H, Lai J-H. GraphSHA: Synthesizing Harder Samples for Class-Imbalanced Node Classification. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023:1328–40. doi: 10.1145/3580305.3599374 [DOI] [Google Scholar]
  • 37.Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. doi: 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2017. [Google Scholar]
  • 39.Srhar S, Arshad A, Raza A. Protien-DNA binding sites Prediction. In: 2021 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 2021. IEEE, pp 1–10. doi: 10.1109/ICIC53490.2021.9692990 [DOI] [Google Scholar]
  • 40.Patiyal S, Dhall A, Raghava GPS. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform. 2022;23(5):bbac322. doi: 10.1093/bib/bbac322 [DOI] [PubMed] [Google Scholar]
  • 41.Zhang B, He L, Wang Q, et al. Mit Protein Transformer: Identification Mitochondrial Proteins with Transformer Model. Paper presented at the International Conference on Intelligent Computing. 2023. [Google Scholar]
  • 42.Valverde Sanchez C. Sequence-based deep learning techniques for protein-protein interaction prediction. 2023.
  • 43.Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhu Y-H, Liu Z, Liu Y, Ji Z, Yu D-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform. 2024;25(2):bbae040. doi: 10.1093/bib/bbae040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. doi: 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
  • 46.Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Res. 2017;45(10):e84. doi: 10.1093/nar/gkx059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Su H, Liu M, Sun S, Peng Z, Yang J. Improving the prediction of protein-nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics. 2019;35(6):930–6. doi: 10.1093/bioinformatics/bty756 [DOI] [PubMed] [Google Scholar]
  • 48.Zhang J, Chen Q, Liu B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief Bioinform. 2021;22(5):bbaa397. doi: 10.1093/bib/bbaa397 [DOI] [PubMed] [Google Scholar]
  • 49.Liu Y, Tian B. Protein-DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform. 2023;25(1):bbad488. doi: 10.1093/bib/bbad488 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Syed Nisar Hussain Bukhari

5 Jan 2025

PONE-D-24-57420iProtDNA-SMOTE: Enhancing Protein-DNA Binding Sites Prediction through Imbalanced Graph Neural NetworksPLOS ONE

Dear Dr. Lin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Feb 19 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Syed Nisar Hussain Bukhari

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Please state what role the funders took in the study.  If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" 

If this statement is not correct you must amend it as needed. 

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. Your abstract cannot contain citations. Please only include citations in the body text of the manuscript, and ensure that they remain in ascending numerical order on first mention.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: N/A

Reviewer #4: No

Reviewer #5: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The manuscript studies an important area in understanding the biological process and the cellular functions arising out of these interactions.

2. The manuscript is well written, the literature is thoroughly reviewed and the study has been organized in a systematic fashion.

3. The language of the manuscript is good, but needs a little proof reading to fix some language and grammatical errors.

4. The working of ESM2 and Graph SMOTE should have been elaborated within the manuscript, so that it becomes easy for the reader to understand the class balancing and the embeddings generated by ESM2. Although the raw data files on GitHub contain sequence and encoding, but the graph data and ESM2 embedding are in binary format which is beyond comprehension. It would be beneficial for this study to explain the output of ESM2 and the graph structure derived from such embeddings.

5. The authors are also advised to perform some downstream analysis for the novel predictions generated by their model if any to show its relevance in predicting biological functions associated with this DNA binding protein.

Reviewer #2: Considering the use of graph-based neural network structure, it is necessary to discuss and examine more research studies. Also, with further explanations about the innovation presented in the article, the strengths of the presented model can be strengthened.

Reviewer #3: he study is methodologically sound, innovative, and impactful. Addressing the identified weaknesses would further elevate its contributions to the field.

Recommendation: Accept with minor revisions.

Reviewer #4: I find the idea of using graph neural networks and SMOTE to predict protein-DNA binding sites quite intriguing. The experiments on the TR646, TE46, and TR573 datasets, and the comparisons to strong baselines like CLAPE-DB and DNAPred, show promising results with AUC values between 0.850 and 0.896. However, I think the current version needs some serious work before it's ready for a top-tier journal like PLOS ONE.

The first thing that struck me was the huge gap between the method section and the data visualization. The method section felt like a dense wall of text, making it hard to follow. More diagrams or figures to illustrate the model and the results would make it much easier to understand.

I was also disappointed by the lack of discussion about the model's limitations. The authors briefly mention potential issues with long sequences, but that's it. I'd really like to see a more in-depth analysis of things like computational cost, training time, and how well the model scales to larger datasets. This would give a more balanced perspective.

The writing style also felt a bit… robotic. It looks a bit too polished and maybe even a bit salesy. I think a simpler, more direct writing style would be much better.

From a technical standpoint, I was concerned about the lack of ablation studies. The model combines several components, like the ESM2 pre-trained model and GraphSMOTE. It would be really helpful to see how much each of these components actually contributes to the final performance.

Reproducibility is another key issue. The authors provide code and datasets, which is good, but they're missing crucial training details like learning rates, batch sizes, and the number of epochs. This makes it hard for other researchers to independently verify the results.

Finally, the paper doesn't fully address the impact of data imbalance. Even with GraphSMOTE, the recall on the TE46 dataset is quite low (0.363), suggesting that this remains a challenge. I think a deeper discussion on how imbalance affects performance, especially recall, is needed.

Overall, I think the approach has a lot of potential. But the paper needs some significant revisions to make it more readable, transparent, and convincing. I recommend restructuring the paper, simplifying the language, adding more visuals, and conducting more experiments to fully evaluate the model.

Reviewer #5: This paper introduces iProtDNA-SMOTE, a novel model for predicting protein-DNA binding sites. The proposed method addresses the significant class imbalance problem in such datasets by combining the Graph SMOTE algorithm (designed for class imbalance issues) with protein-DNA language models and Graph Neural Networks (GNNs). The model was trained and tested on five protein-DNA binding benchmarks from the literature and demonstrates superior performance compared to other existing models on the same benchmarks.

Given the large class imbalance between the number of residues that bind to DNA and those that do not, the use of the SMOTE algorithm to account for this imbalance is highly relevant. The authors tackle this problem by framing it within a graph-based framework, utilizing embeddings from the ESM model and constructing a graph based on pairwise distances computed from the AlphaFold 3 (AF3) protein structure. The authors then train the Graph Neural Network on datasets curated from prior publications. It is worth noting that these training and test datasets are themselves predictions of protein-DNA interactions derived from previous models (GraphBind, GraphPred, and DBPred). During training, the Graph SMOTE algorithm is employed to upsample examples from the minority class (DNA-binding residues).

Comments:

I find the overall approach of the paper compelling, and it is reasonable to assume that a SMOTE-type algorithm would be beneficial in addressing class imbalance. The results on their independent benchmarks appear promising compared to other models in the literature. Overall, this is an interesting and innovative approach to a biologically significant problem characterized by substantial class imbalance.

However, I would like the following questions addressed before publication:

1. Why are the three models—GraphBind, GraphPred, and DBPred—not included in the benchmarks? The paper does not explain their absence. Is it because their predictions on these benchmarks are already very high, given that the benchmarks (labels) are essentially derived from the predictions of these models? This needs to be clarified, and their performances should be reported, possibly in a supplementary table if necessary.

2. I observed that iProtDNA-SMOTE consistently achieves very high precision but often has the lowest recall across benchmarks. Could the authors address why this trade-off occurs systematically? Is it due to the problem setup of oversampling the minority class, which might make the model adept at identifying a specific type of positive example (protein-DNA binding) while missing others? Some insights or discussion on this issue are crucial. I recommend examining the worst mistakes in the false negatives (i.e., binding sites missed by the model) to better understand the underlying reasons for the low recall and potentially improve it.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes:  Nisar Iqbal Wani PhD

Reviewer #2: No

Reviewer #3: Yes:  Dr. Syed Mutahar Aaqib

Reviewer #4: No

Reviewer #5: Yes:  Abhimanyu Banerjee

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review Comments iPROTDNA-smote.docx

pone.0320817.s006.docx (13.1KB, docx)
Attachment

Submitted filename: comments.docx

pone.0320817.s007.docx (13.8KB, docx)
Attachment

Submitted filename: Suggestions for Improvement.docx

pone.0320817.s008.docx (13KB, docx)
PLoS One. 2025 May 13;20(5):e0320817. doi: 10.1371/journal.pone.0320817.r003

Author response to Decision Letter 1


30 Jan 2025

Dear Editor,

Thank you very much for your Jan-06-2025 email. We appreciate the time and effort that you and the reviewers dedicated to providing feedback on our manuscript. And we are grateful for the insightful and helpful comments on our paper. As suggested, the MS has been carefully revised according to their comments. Our point-to-point responses can be summarized as follows. For clarity, our responses are started with "Reply".

Journal Requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Reply: We have carefully checked and ensured that our manuscript complies with the formatting requirements of PLOS ONE. We have referenced the templates provided by PLOS ONE and made necessary adjustments to the format of our manuscript.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

Reply: In accordance with PLOS ONE's guidelines for code sharing, we have made the author-generated code publicly available. The code is accessible at https://github.com/primrosehry/iProtDNA-SMOTE and includes detailed instructions for running it along with dependency information.

3. Thank you for stating the following financial disclosure: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

Reply: We have clearly stated the funding sources in the Funding Statement and added the following declaration: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Additionally, we have removed all funding-related information from the Acknowledgments section to comply with the journal's requirements.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Reply: We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: This research was funded by the National Natural Science Foundation of China, 62162032 and 32260154, and Technology Projects of the Education Department of Jiangxi Province of China, GJJ2201040 and GJJ2201004.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Reply: We have removed all funding-related information from the Acknowledgments section and ensured that all relevant declarations appear only in the Funding Statement. We have updated the Funding Statement to ensure its accuracy.

5. Your abstract cannot contain citations. Please only include citations in the body text of the manuscript, and ensure that they remain in ascending numerical order on first mention.

Our abstract does not contain any citations.

Reply: We have ensured that citations appear only in the body of the manuscript and are numbered in ascending order upon their first mention.

Reviewer #1

1. The manuscript studies an important area in understanding the biological process and the cellular functions arising out of these interactions.

Reply: We appreciate your time and effort in evaluating our work. We have expanded the introduction to better emphasize the importance of understanding the biological processes and cellular functions resulting from these interactions. This includes referencing recent studies to highlight the significance of this field.

2. The manuscript is well written, the literature is thoroughly reviewed and the study has been organized in a systematic fashion.

Reply: We appreciate your positive feedback on the manuscript’s writing and organization. We have ensured that the structure remains clear and logical throughout the study.

3. The language of the manuscript is good, but needs a little proof reading to fix some language and grammatical errors.

Reply: This is a good suggestion. We have carefully proofread the manuscript to correct any language and grammatical errors. We have also enlisted the help of a professional language editor to ensure that the manuscript meets high standards of clarity and accuracy.

4. The working of ESM2 and Graph SMOTE should have been elaborated within the manuscript, so that it becomes easy for the reader to understand the class balancing and the embeddings generated by ESM2. Although the raw data files on GitHub contain sequence and encoding, but the graph data and ESM2 embedding are in binary format which is beyond comprehension. It would be beneficial for this study to explain the output of ESM2 and the graph structure derived from such embeddings.

Reply: Thank you for your valuable suggestion. We have addressed this by adding a detail explanation of the workings of ESM2 and Graph SMOTE in manuscript. In section "Unsupervised Protein Language Model" of the Materials and Methods, we have elaborated on how ESM2 works and generates embeddings. Additionally, we have introduced a new section, "Construction of a Balanced Protein Graph," which provides a comprehensive explanation of how Graph SMOTE functions and how these embeddings are used to create graph structures. Furthermore, we have included examples of ESM2 embeddings (Fig. 2) and graphical data (Fig. 3) in a more reader-friendly format for readers.

5. The authors are also advised to perform some downstream analysis for the novel predictions generated by their model if any to show its relevance in predicting biological functions associated with this DNA binding protein.

Reply:We completely agree with your insightful suggestion. While we agree that such analysis would be highly beneficial, we currently face limitations in experimental time and resources that prevent us from conducting the relevant downstream experiments at this stage.

To address this limitation, we have provided a detailed description of the model-building process and the reliability of the prediction results in the manuscript. Our model has been trained and validated on a large dataset of known protein-DNA interaction data, achieving high accuracy and strong generalization capabilities. This rigorous validation process ensures that our model serves as a reliable tool for predicting protein-DNA binding sites.

We believe that these revisions have significantly improved the manuscript and addressed your concerns. We are grateful for your suggestions and hope that the revised version meets your expectations.

Reviewer #2

Considering the use of graph-based neural network structure, it is necessary to discuss and examine more research studies. Also, with further explanations about the innovation presented in the article, the strengths of the presented model can be strengthened.

Reply: Thank you for your valuable feedback on our manuscript. We highly appreciate Reviewer#2’s suggestion.

1- Given the main structure of the paper, which is based on graph-based neural networks, more previous research needs to be studied.

Reply: Many thanks for the reviewer’s suggestion. We have expanded our literature review to include additional studies on graph-based neural networks. The expansion provides a more comprehensive overview of the field.

2- In the classification of Graph structured data section, graph convolution operations need to be discussed further.

Reply:This is a good point. We have added a detailed explanation of the graph convolution operations in the "GraphSAGE-MLP Network" section, clarifying their role in feature aggregation.

3- Also in the classification of Graph structured data section, more explanation should be provided about collecting neighbor features.

Reply: We think this is an excellent suggestion. In the "GraphSAGE-MLP Network" section, we have provided a more comprehensive explanation of how neighbor features are collected, with a focus on the message-passing mechanism and its implementation in the model.

4- In Table 1, the training and test datasets are different. Please explain why this is done.

Reply: Many thanks for the reviewer’s suggestion. In Table 1, we have used two datasets for training and testing. One dataset comprises the training set TR646 and independent test set TE46, while the another dataset includes the training set TR573 and the independent test sets TE129 and TE181.

In the field of protein-DNA binding site prediction, these five classic datasets are widely used for model training and testing. The separation of the training set and the test set is essential to ensure the model's generalization ability. The training set is used to enable the model to learn the features of protein-DNA interactions, while the independent test set is used to evaluate the model's performance on unseen data. This separation helps to prevent overfitting and ensures the reliability and objectivity of the results.

Reviewer #3

The study is methodologically sound, innovative, and impactful. Addressing the identified weaknesses would further elevate its contributions to the field.

Recommendation: Accept with minor revisions.

Reply: We highly appreciate your positive comments and encouragement.

1. Sensitivity Analysis:

Include experiments to analyze the trade-offs between precision and recall for various datasets, particularly focusing on the biological implications of missing DNA-binding residues.

Reply: We appreciate the reviewer’s valuable suggestion. We have conducted a detailed analysis of the trade-offs between precision and recall for various datasets, particularly focusing on the biological implications of missing DNA-binding residues. This analysis is included in the "Results" and "Conclusions" sections of our manuscript. We have also outlined potential future research directions aimed at significantly improving recall while maintaining high precision, which will further enhance the overall performance of our model.

2. Efficiency Metrics:

Provide a comparison of computational time and resource utilization against competing methods to offer a holistic evaluation of the model’s practicality.

Reply: This is a good suggestion. In the "Conclusions" section, we have added a discussion on the computational resources used during our study to reduce computational time. We have included key training details such as dropout, alpha, gamma, learning rate, and epochs in section "Results" to enhance the reproducibility of our study. This information will help other researchers more accurately replicate our experimental results and compare computational efficiency.

3. Future Directions:

Discuss potential integrations with advanced graph attention mechanisms or hybrid models to address current limitations in handling long protein sequences.

Consider incorporating more diverse datasets or synthetic benchmarks to evaluate robustness further.

Reply: We thank the reviewer for pointing out this issue. In the "Conclusions" section, we have added a discussion on the limitations of the model and proposed potential directions for future research to address these limitations. Specifically, we highlighted the need to further optimize the model's prediction strategy to improve recall while maintaining high precision. We plan to introduce more complex graph convolutional network architectures, integrate protein structure prediction tools, and adjust the model's prediction threshold. These improvements are expected to enhance the overall performance of the model.

4. Error Analysis:

A deeper error analysis to identify specific cases where the model underperforms (e.g., specific protein classes or sequence patterns) would provide actionable insights for further refinement.

Reply: We agree with the reviewer’s suggestion. In the final part of section "Results," we have added an analysis of the impact of GraphSMOTE on model performance, conducting a more in-depth examination of how data imbalance affects model performance, particularly recall. Although our model's recall (Rec) value is lower than that of CLAPE-DB, it outperforms CLAPE-DB in terms of precision (Pre) and other performance metrics. This reflects iProtDNA-SMOTE's emphasis on precision during the prediction process, effectively reducing false positives. In the "Conclusions" section, we have also added a discussion on the limitations of the model and proposed potential directions for future research to address these limitations.

Reviewer #4

I find the idea of using graph neural networks and SMOTE to predict protein-DNA binding sites quite intriguing. The experiments on the TR646, TE46, and TR573 datasets, and the comparisons to strong baselines like CLAPE-DB and DNAPred, show promising results with AUC values between 0.850 and 0.896. However, I think the current version needs some serious work before it's ready for a top-tier journal like PLOS ONE.

Reply: We deeply appreciate Reviewer#4’s overall positive feedback and constructive comments.

The first thing that struck me was the huge gap between the method section and the data visualization. The method section felt like a dense wall of text, making it hard to follow. More diagrams or figures to illustrate the model and the results would make it much easier to understand.

Reply: We completely agree with this valuable suggestion by the reviewer. In response to your suggestion, we have added three new figures (see in Fig. 2, Fig. 3, and Fig. 4) to illustrate the key aspects of our model.

I was also disappointed by the lack of discussion about the model's limitations. The authors briefly mention potential issues with long sequences, but that's it. I'd really like to see a more in-depth analysis of things like computational cost, training time, and how well the model scales to larger datasets. This would give a more balanced perspective.

Reply: This is a good suggestion. In the "Conclusions" section, we have added a discussion on the limitations of the model and proposed potential directions for future research to address these limitations.

The writing style also felt a bit… robotic. It looks a bit too polished and maybe even a bit salesy. I think a simpler, more direct writing style would be much better.

Reply: We thank the reviewer for highlighting this issue. In response, we have simplified the language and made the writing more direct and accessible. We hope this improves the readability and clarity of our manuscript.

From a technical standpoint, I was concerned about the lack of ablation studies. The model combines several components, like the ESM2 pre-trained model and GraphSMOTE. It would be really helpful to see how much each of these components actually contributes to the final performa

Attachment

Submitted filename: renamed_51760.docx

pone.0320817.s011.docx (29.9KB, docx)

Decision Letter 1

Syed Nisar Hussain Bukhari

25 Feb 2025

iProtDNA-SMOTE: Enhancing Protein-DNA Binding Sites Prediction through Imbalanced Graph Neural Networks

PONE-D-24-57420R1

Dear Dr. Lin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Syed Nisar Hussain Bukhari

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #2: No

**********

Attachment

Submitted filename: comments_declet_1.docx

pone.0320817.s010.docx (13.1KB, docx)

Acceptance letter

Syed Nisar Hussain Bukhari

PONE-D-24-57420R1

PLOS ONE

Dear Dr. Lin,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Syed Nisar Hussain Bukhari

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Tables. Supplementary Tables.

    (DOCX)

    pone.0320817.s001.docx (16.6KB, docx)
    S1 Dataset. iProtDNA-SMOTE benchmark datasets.

    (RAR)

    pone.0320817.s002.rar (414.6KB, rar)
    S1 Code. iProtDNA-SMOTE code.

    (RAR)

    pone.0320817.s003.rar (6.7KB, rar)
    S1 Weight. iProtDNA-SMOTE trained weights.

    (RAR)

    pone.0320817.s004.rar (3.5MB, rar)
    S1 Model. The graph model for dataset TE46.

    (RAR)

    pone.0320817.s005.rar (99.5MB, rar)
    Attachment

    Submitted filename: Review Comments iPROTDNA-smote.docx

    pone.0320817.s006.docx (13.1KB, docx)
    Attachment

    Submitted filename: comments.docx

    pone.0320817.s007.docx (13.8KB, docx)
    Attachment

    Submitted filename: Suggestions for Improvement.docx

    pone.0320817.s008.docx (13KB, docx)
    Attachment

    Submitted filename: renamed_51760.docx

    pone.0320817.s011.docx (29.9KB, docx)
    Attachment

    Submitted filename: comments_declet_1.docx

    pone.0320817.s010.docx (13.1KB, docx)

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES