Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Aug 3;17(8):e1009247. doi: 10.1371/journal.pcbi.1009247

A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers

Frances L Heredia 1, Abiel Roche-Lima 2, Elsie I Parés-Matos 1,*
Editor: Eleonora Alfinito3
PMCID: PMC8362955  PMID: 34343165

Abstract

The selection of a DNA aptamer through the Systematic Evolution of Ligands by EXponential enrichment (SELEX) method involves multiple binding steps, in which a target and a library of randomized DNA sequences are mixed for selection of a single, nucleotide-specific molecule. Usually, 10 to 20 steps are required for SELEX to be completed. Throughout this process it is necessary to discriminate between true DNA aptamers and unspecified DNA-binding sequences. Thus, a novel machine learning-based approach was developed to support and simplify the early steps of the SELEX process, to help discriminate binding between DNA aptamers from those unspecified targets of DNA-binding sequences. An Artificial Intelligence (AI) approach to identify aptamers were implemented based on Natural Language Processing (NLP) and Machine Learning (ML). NLP method (CountVectorizer) was used to extract information from the nucleotide sequences. Four ML algorithms (Logistic Regression, Decision Tree, Gaussian Naïve Bayes, Support Vector Machines) were trained using data from the NLP method along with sequence information. The best performing model was Support Vector Machines because it had the best ability to discriminate between positive and negative classes. In our model, an Accuracy (A) of 0.995, the fraction of samples that the model correctly classified, and an Area Under the Receiving Operating Curve (AUROC) of 0.998, the degree by which a model is capable of distinguishing between classes, were observed. The developed AI approach is useful to identify potential DNA aptamers to reduce the amount of rounds in a SELEX selection. This new approach could be applied in the design of DNA libraries and result in a more efficient and faster process for DNA aptamers to be chosen during SELEX.

Author summary

In this manuscript authors explain the development and validation of a novel artificial intelligence approach to support and simplify the early steps of the process from SELEX, to help discriminate binding between deoxynucleotide aptamers from those unspecified targets of DNA-binding sequences. The approach was implemented based on Natural Language Processing and Machine Learning. CountVectorizer, a Natural Language Processing method, was used to extract information from nucleotide sequences. Four Machine Learning algorithms (Logistic Regression, Decision Tree, Gaussian Naïve Bayes, and Support Vector Machines) were trained using data from the Natural Language Processing method along with sequence information. From these four trained machine learning algorithms, the best performance and selected model was Support Vectors Machines, because it had the best discriminatory metrics (i.e., Accuracy (A) = 0.995; AUROC (AU) = 0.998). In general, all models showed good metric results for predicting DNA aptamer sequences. The Machine Learning model complexity and difficult interpretation may hinder its application into the standard practice. For this reason, the development of a web-app is already taking place to facilitate the interpretation and application of the obtained results.

Introduction

Aptamers are non-genomic, but biologically active single-stranded nucleic acid molecules, typically ranging between 10 and 100 nucleotides [1]. These short sequences can be designed to bind, with high affinity and specificity, to a broad spectrum of molecular targets, ranging from ions, small organic molecules to macromolecules such as proteins, viruses, and entire cells [28]. Aptamers assume a variety of shapes due to their tendency to form helices and single-stranded loops [911]. They are extraordinarily versatile and bind targets with high selectivity and specificity. Applications of aptamers in the field of medicine include diagnostic devices, therapeutic drugs, and antibody replacement and drug delivery systems [1221]. Aptamers are of high interest to the pharmaceutical industry due to substantially lower production costs, shelf lives of years, and, in many cases, high target specificity [22]. Moreover, since aptamers are chemically synthesized, they can provide a reliable source of raw materials than antibodies that are secreted by cells.

The selection of a DNA aptamer by Systematic Evolution of Ligands by EXponential enrichment (SELEX) consists of a binding step by mixing a target and a library containing vast patterns of randomized DNA sequences, each with a common fixed-sequence primer region [23]. A separation step is needed to isolate multiple target–DNA complexes from unbound DNA, followed by separation of the complexes employing filtration or chromatography techniques, and an amplification step by PCR [24]. The DNA sequences obtained are re-used as new DNA aptamer enriched pools, followed by another series of selection steps, called a "round". After repeated rounds, the DNA aptamers in the pools are sufficiently enriched and ready to be sequenced and evaluated as aptamers by way of a binding assay [25]. SELEX may require 10–20 rounds, leading to an overall procedure that is complex and time-consuming [26].

The SELEX technique has several limitations, for example, it requires 10 to 20 rounds to be completed, but an increase in by-products can be found after seven or more rounds, and for protein and small molecule targets a decrease in affinity was found to occur after 5–6 rounds [2729]. Multiple rounds of SELEX significantly bias the types of sequences [30]. Enrichment of unspecified binding of oligonucleotides during this aptamer selection process is often observed [31]. Most of the aptamers that have been published were manually selected, making the whole process of getting high affinity and specific aptamers time consuming [32]. Thus, a process requiring fewer rounds for aptamer selection is desirable.

Interest in the use of statistical methods in aptamer prediction approaches has grown lately. Computational techniques are simple, time-saving, cost-effective, and do not require specialized resources [33]. Aptamer’s computational prediction methods have been carried out in two major categories: prediction based on interaction and prediction based on structure. Computational prediction models based on interaction, take into account the physicochemical, energetic and conformational properties of the aptamers. These models, while may not be very accurate, may shed some light on in-depth understanding of the mechanisms of interactions between aptamers and their targets, but cannot be applied to the SELEX pipeline to reduce the number of steps [34]. Computational prediction models based on structure folding tend to be more accurate, but their use is hampered by the dependency on the availability of homologous sequences [35].

For all these reasons, a novel Artificial Intelligence approach that includes Natural Language Processing and Machine Learning (ML) was developed to support aptamer–target interaction research with advanced computational tools. Our approach provides results that allow researchers to discriminate between the aptamer and non-aptamer sequences for efficient SELEX data analysis. This approach can be used at the early rounds of SELEX, to help distinguish between specific and unspecific binding sequences. Moreover, the consensus aptamer sequence can be selected at the final round of SELEX. When more data are available, this approach could be applied to obtain more reliable predictive models that can eventually reduce the number of rounds of SELEX for consensus aptamer sequence identification.

Materials and methods

The workflow used in this paper (Fig 1) described our Artificial Intelligence (AI) approach. It included the Natural Language Processing (NLP) method for k-mer vectorization, along with feature selection, oversampling, machine learning training algorithms, and validation.

Fig 1. Overview of the AI approach used to obtain a model for the classification of a sequence as an aptamer.

Fig 1

It included the extraction of nucleotide sequences from the Nucleic Acid Database (NDB) and Aptagen. The sequences were converted into 6-mer vectors using the NLP modules. Out of the 5,123 vectors created, only the top 2.5% were selected for modeling, in the reduction of dimensionality module. Then the data was split into a training set (80% of the data, n = 4,099) and test set (20% of the data, n = 1,024). Because of data imbalance in the training set, the underrepresented samples were weighted highly. ML algorithms were trained to develop the models using the selected features. The developed models were tested using cross-validation and validated using the test sets. Fig 1 is also the Graphical Abstract.

Web scraping

The aptamers selected for this study were DNA aptamers with no modifications in neither the bases nor the backbone. The data for the DNA Aptamers was web-scraped from the online database Aptagen [36] using a python script. After downloading the data into a data frame, cleaning the data and removing sequences with modified bases and duplicates, the Dataframe included 238 unique aptamer sequences. The data for the DNA sequences was web-scraped from the Nucleic Acid Database (NDB) [37] using a tailored python script, while NDB reports both the 5’ and the 3’, only the 5’ sequences were scrapped. After downloading the data, a tailored scripting was implemented for data wrangling (i.e., cleaning and removing sequences with modified bases and duplicates). The final Dataframe included a total of 4,885 of unique sequences. The codes for the python scripts in GitHub [38] and the raw data are available at Mendeley Data [39].

Feature engineering

NLP Method–CountVectorizer

NLP techniques were used to transform the nucleotide sequences into k-mers numerical vectors to be used as input in the machine learning training algorithms. A CountVectorizer function, included in the SckitLearn library [40] was applied using n-grams (n = 6) as a parameter. The CountVectorizer algorithm used in the analysis is an enumeration algorithm, which counts the total occurrence of all possible k-mers (or n-grams) of a given length ‘k’ (’n’). K-mer counting involves counting the number of substrings that have length k in a string S, or a set of strings, where k is a positive integer. For any length k, there are 4k combinatorically possible k-mers. The ‘k’ (’n’) value was set to 6 because a previous study indicated that 6-mers performed better than k-mers of other lengths in target-aptamer identification [41]. The set of these 6-mers vectors contained numerical information that described the nucleotide sequences. They were used as features to train the ML algorithms and obtain the predictive models.

Other computed features / variables

In addition to the 6-mer vectors, other features were also calculated for each nucleotide sequence. These features were sequence length, percentage of each base (Adenosine Percentage, Cytosine Percentage, Glutamine Percentage, and Thymine Percentage), AT ratio, CG ratio, purine ratio, and pyrimidine ratio.

Dimensionality reduction

Only the top 2.5% most frequent features out of the 4,096 (46) originally generated were considered. It represented a total of 101 6-mers vectors. Before modeling, dimensionality reduction was performed on the remaining features using recursive feature elimination using logistic regression as the estimator [42]. The total number of features used to train the ML models was reduced to 33.

ML classification–modeling

The data was split randomly with stratified sampling by sequence type to achieve a roughly equal proportion of DNA and aptamers, in 80% for training and 20% for testing sets. One of the biggest challenges found was the small number of aptamer sequences (as a sample), which meant an unbalanced target binary variable (i.e., 238 aptamers and 4885 DNA sequences). After testing undersampling and oversampling methods, such as SMOTE [43], the weights of the target variables were selected as a method to solve the unbalance data. The parameter CLASS_WEIGHT was set as ‘balanced’ in each ML training algorithm to avoid undermining the models’ predictability. Four of the most common supervised machine learning algorithms for classification were used to be trained with our training set (i.e., Logistic Regression, Decision Tree, Gaussian Naïve Bayes and Support Vector Machines). A set of metrics [44] was chosen for model performance comparisons including Accuracy, Specificity, Sensitivity and AUROC metric values, which are defined as follows:

  • Accuracy is the fraction of samples that the model correctly classified and is defined as (TP+TN)/(TP+FP+FN+TN), where TP is True Positive, FP is False Positive, FN is False Negative, and TN is True Negative.

  • Specificity is the ratio of samples that the model correctly classified as negative classes to all the negative samples, and is defined as TN/(TN+FP).

  • Sensitivity represents the ratio of samples that the model correctly classified as positives classes to all the positive samples, and is defined as TP/(TP+FN).

  • Area Under the Receiver Operating Characteristics (AUROC) is a probability curve where the true positive rate is plotted against the false positive rate, the area under this curve represents degree by which a model is capable of distinguishing between classes [45].

The confusion matrix is a tabular display of the samples by their actual and predicted class. Validations results using the testing set, as well as the confusion matrix for each model, were also computed and reported.

Logistic Regression (LR)

For the Logistic Regression (LR) classifier [46], a Grid search was used to tune model hyperparameters, using a 5-fold cross-validation. The final classifier used an L2 penalty, a C value of 2, and an lbfgs solver.

Decision Tree (DT)

Decision Tree (DT) classifiers have a comprehensible classification model that in many different cases, including balanced datasets, is highly accurate [47]. Each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute. Each leaf represents class labels associated with the instance. The final classifier used a squared root function to determine the maximum number of features.

Gaussian Naïve Bayes (GNB)

Gaussian Naïve Bayes (GNB) classifiers are based on the Bayes Theorem [48]. This classifier assumes that the value of a particular feature is independent of the value of every feature. Naïve Bayes classifiers were chosen for this study because they need a small training sample to estimate the parameters needed for classification.

Support Vectors Machine (SVM)

Support Vector Machines (SVM) is a supervised machine learning algorithm that can be used for classification of high dimensional data [49]. It uses a technique called kernel trick, where data points are placed above and below the classifying hyperplane. The data is transformed, and based on these transformations, it finds an optimal boundary between the possible outputs. Some benefits of the SVM is the capture of more complex relationships between the data points. Its disadvantage is that the training time is much longer and it is computationally intensive, and there is no probabilistic explanation for the classification. SVM can accurately deal with complex non-linear boundary models. A Grid search with 5-fold cross-validation was used to tune model hyperparameters: C, a hyperparameter which adds a penalty for each misclassified data point and gamma, a hyperparameter which controls the level of influence of a single training point has on the model. The final classifier used a C value of 10 and a gamma of 0.01.

Validation and plots

All of the models were validated using a 5-fold cross-validation with accuracy as the scorer matrix. The cross-validation sets were generated from the initial dataset. The generated models and the confusion matrix were plotted for visual inspection. The symmetric correlation matrix was calculated and transformed into a heatmap to depict the relationship between all 6-mer sequences. Heatmaps were generated using the heatmap function in Seaborn [50]. Bar and Scatter plots were generated using the plot functions in Matplotlib [51].

Characterization of the biological implications of the top 6-mers

DNA aptamer sequences and their structures were downloaded from Protein Data Bank (PDB) to understand the biological significance of the generated 6-mers. These sequences were used as input in a function that identifies the top 6-mers, according to Table 1. The structures where the 6-mers were found, were analyzed to understand their biological role.

Table 1. Comparison of existing aptamer predictive studies.

Algorithm Aptamer Dataset No-aptamer Dataset Some Features Classifier MCC
This Study DNA aptamers (n = 238) Protein binding DNA (n = 4885) 6-mers for all sequences, Sequence features Support Vector Machines 0.896
[35] DNA/RNA aptamers (n = 159)
Small Molecule Targets (n = 20)
Randomly paired aptamers to Small Molecule targets 1,2-mers for aptamers, Physical-chemical properties of targets Nearest Neighbors 0.670
[57] DNA/RNA aptamers (n = 725)
Protein Targets (n = 164)
Randomly paired aptamers to protein targets 1,2-mers for aptamers, 1,2-mers for targets Physical-chemical properties of targets Random Forest 0.461

Results and discussions

Explosive progress in high-throughput DNA sequencing has driven advances in analytical tools to identify base consensus motifs among subgroups of DNA sequences [52]. These sequence analysis tools can also be employed to identify patterns among non-genomic, yet functional, oligonucleotides called aptamers. Though aptamers are classified as non-genomic sequences, tools built for genomic sequence analysis can still be useful, as demonstrated in studies with RNA aptamers [53,54]. Our AI approach, that leverages NLP and ML techniques, is developed to classify and discriminate DNA aptamer sequences from genomic DNA sequences. For this study, the Aptagen and NDB databases are web-scrapped to retrieve all sequences of published DNA aptamers and protein binding DNA sequences, respectively.

Baseline characteristics

From 5,123 sequences retrieved, 4,885 are protein-binding DNA sequences from NDB, and 238 are DNA aptamer sequences from Aptagen. A group of features/variables is initially determined by the NLP method (i.e., 6-mers vectorization function) where a total of 33 6-mer features are used after dimensionality reduction. Other features/variables chosen for the DNA and aptamer sequences used in this study are shown in Table 2. The DNA sequences have an even distribution of the bases, while in the aptamer sequences the distribution of the bases was skewed towards the thymine and guanine residues. The AT and CG ratios are calculated for both types of sequences. In the DNA sequences, the AT and GC ratios are 0.5 on average. Meanwhile, in the aptamer sequences, the CG and AT ratios are 0.54 and 0.45, respectively, suggesting that aptamers are lightly more stable than DNA sequences. As above mentioned, 4,099 (80%) and 1,024 (20%) sequence data are randomly assigned to the training and testing sets, respectively. After calculating the p-values, it is determined that the characteristics are similar between the training and testing set for these variables/features.

Table 2. Characteristics and other information of the other features/variables that corresponds to DNA and aptamer sequences (data are presented as Mean ± SD).

Variables Overall n = 5123 DNA Samples n = 4885 Aptamer Samples n = 238 P-value Training Set n = 4099 Testing Set n = 1024 P-value
Adenosine Percentage 24.1±10.4 24.3±10.4 20.8±8.4 < 0.001 24.2±10.4 24.1±10.4 0.900
Cytosine Percentage 24.7±10.3 24.9±10.4 21.5±8.4 < 0.001 24.7±10.3 24.9±10.4 0.613
Glutamine Percentage 26.6±11.3 26.4±11.2 32.4±12.3 < 0.001 26.7±11.3 26.7±11.2 0.950
Thymine Percentage 24.4±10.5 24.4±10.6 24.7±8.5 0.600 24.4±10.5 24.3±10.5 0.727
AT Ratio 0.48±0.14 0.49±0.15 0.40±0.10 0.001 0.49±0.15 0.48±0.15 0.739
CG Ratio 0.51±0.15 0.51±0.15 0.50±0.10 0.007 0.51±0.15 0.52±0.14 0.691
Purine Percentage 0.51±0.11 0.51±0.11 0.50±0.10 <0.001 0.51±0.11 0.52±0.15 0.956
Pyrimidine Percentage 0.49±0.10 0.49±0.11 0.50±0.10 <0.001 0.49±0.11 0.51±0.11 0.885

A pair plot (Fig 2) represents a visual representation of the relationship between these other features/variables in the dataset. It is built based on the density plot and the scatter plot. In the density plot, the diagonal shows the distribution of a single feature. At the bottom right corner, density plots for purine and pyrimidine distribution show a leptokurtic shape, with a mean around 50. For other features, aptamers have shown a leptokurtic distribution, while DNA features have a Gaussian distribution. Scatter plots above and below the density plots, show the relationship (or lack thereof) between the features of the two nucleotide types. These additional plots suggest that a single feature/variable is not enough to discriminate between DNA aptamers and DNA sequences without the use of a more sophisticated model.

Fig 2. Plot of the DNA vs. aptamer features.

Fig 2

Variables are against themselves to show the distribution. The blue area depicts DNA sequences while orange area depicts aptamers sequences.

Feature/variable exploration

The t-distributed Stochastic Neighbor Embedding (t-SNE) is plotted to further explore the generated data. This algorithm for dimensionality reduction is particularly well suited for the visualization of high-dimensional datasets [55]. The t-SNE algorithm calculates a similarity measure between pairs of variables in the high dimensional space and the low dimensional space. It then tries to optimize these two similarity measures using a cost function. In this way, t-SNE plots the multi-dimensional data to a lower dimensional space and attempts to find patterns in the data by identifying observed clusters based on the similarity of data points with multiple features. In Fig 3, DNA sequences are represented by the blue dots, while aptamer sequences are represented by the orange dots. The plots show that DNA sequences can be clustered into a group, while DNA aptamers are more scattered and more different from each other. Also, some aptamer samples fall into the DNA cluster. This result suggests that these samples are very similar to DNA samples and, therefore, it would be challenging for a model to predict. Although some DNA samples fall outside of their cluster, they are far apart from the aptamer samples and could be predicted by a model. However, it is important to note that the distances between points are relative because the algorithm is non-linear, the distances shown on the x- and y-axis have no direct interpretation.

Fig 3. t-SNE of the Dataframe colored by sequence type.

Fig 3

Blue dots depicts genomic DNA sequences, while orange dots depicts DNA aptamer sequences.

In Fig 4, the observed occurrences of the 6-mer are plotted. They are calculated and normalized for each of the oligonucleotides. It is evident from this figure that the distribution of the 6-mers varies greatly from DNA to aptamers and that the 6-mers with high GT content are more frequent in aptamers. In Figs 5, 6-mer content was compared in an all-versus-all pairwise fashion, to determine correlation coefficients (CC) of 5,100 comparisons in total (including comparisons between the same sequences). The CC values represent how likely two 6-mers to be present within one sequence are. The darker the color, the more occurrences of these two 6-mers being current within one sequence. In the heatmap, a darker, bluer color denotes a higher CC value, closer to 1. Lighter white, colors indicate CC values closer to 0, shows 6-mer pairs unrelated.

Fig 4. Bars plot of selected 6-mers and their normalized distribution.

Fig 4

Top graph: distribution of chosen 6-mers in genomic DNA sequences. Bottom graph: distribution of chosen 6-mers in DNA aptamer sequences.

Fig 5. The heatmap of features correlation.

Fig 5

The blue diagonal represents a correlation factor equals to one. Blue color means a positive correlation, while white color means no correlation.

ML algorithm performances and final models

The training data set was used to train the ML algorithms. The computed metric values (i.e., Accuracy, Sensitivity, Specificity, and AUCROC) can be seen in Table 3. For LR obtained model, the accuracy and AUROC are 96.3% and 0.988, respectively. Fig 6 shows information about the confusion matrix and plots for each ML algorithm. Fig 6A corresponds to LR. DT algorithm is also used. It is recommended when the dataset is small or when the data is imbalanced [56]. As can be seen in Table 3, when the DT is trained, the accuracy of the obtained model increased to 99.0%, but the AUROC decreased to 0.918, showing that the model is more suitable to predict DNA sequences instead of aptamer sequences (confusion matrix and plot can be seen in Fig 6B). GNB is another ML algorithm that is trained with our data set (Fig 6C shows the plots and confusion matrix). The accuracy of the model obtained with GNB is lower than the previous models as can be seen in Table 3. SVM is also trained to develop a model for aptamers. The results in Table 3 show that the final SVM model has the highest metric values for accuracy and AUROC (i.e., 99.2% and 0.998, respectively). It suggests that SVM model has the best discrimination between DNA and Aptamer sequences, as can be seen in Fig 6D. In general, all models show a high AUCROC metric value for predicting aptamer sequences, as can be shown in Fig 7.

Table 3. Classifiers’ predictive performance in the testing set.

Classifier Accuracy Sensitivity Specificity AUROC
LR 0.963 0.999 0.893 0.988
SVM 0.992 0.997 0.878 0.998
DT 0.990 0.998 0.805 0.918
GNB 0.917 0.919 0.878 0.926

Fig 6. Scatter plots of each ML model.

Fig 6

DNA aptamer sequences are shown as orange dots and DNA sequences are shown as dark blue dots. The insert shows the confusion matrix of each model. (A) Logistic Regression, (B) Decision Tree Classifier, (C) Gaussian Naïve Bayes and (D) Support Vector Machines. The light gray area is the boundary for predicted DNA sequences and the dark gray area is the boundary for predicted DNA aptamer sequences.

Fig 7. Receiver-operating characteristic curve by machine learning model.

Fig 7

The closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test.

Comparison with other reported ML algorithms

Other ML algorithms have been used to obtain models to predict aptamer sequences. One of such algorithms uses sequences of DNA and RNA aptamers from the now defunct Aptamer Base Database along with the physical-chemical properties of the small molecule targets they bound to. Using a Nearest Neighbors algorithm, they managed to obtain a Matthews Correlation Coefficient (MCC) metric value of 0.670 and the main predictive features in their study were related to the electrostatic and chemical descriptors of the target molecules and not the aptamers themselves [35]. The second reported ML approach also employed sequences of DNA and RNA aptamers from the Aptamer Base Database along with the protein targets they bound [57]. For this algorithm, they calculated the 1-mers and 2-mers for both the aptamer and protein sequences, and physical-chemical properties of the proteins. Using a Random Forest algorithm, they obtained a model with MCC metric value of 0.461 to predict the top aptamer related to the target. These algorithms use aptamers sequences randomly paired to the targets as the negative data to train on. To the best of our knowledge, our reported models are the only ones capable of discriminating aptamer sequences from non-aptamer sequences. In addition, our study identifies which sequence features make good candidates for aptamers. To compare our approach with the results from the other studies (i.e., [50] and [51]) we computed the MCC metric for our best model, this is SVM. As can be seen in Table 1, we obtain the best MCC value.

Characterization of the biological implications of the top 6-mers

The feature ranking method is used to identify the optimal features required for high accuracy in the ML algorithms. Out of the 110 features that are used as input in the principal component analysis, the top 33 are 6-mers vectors. According to Table 4, the 6-mers with the highest relevance across all ML models were TGG TGG, TGG GGG, GGG GTG, GGT TGG, GCA CAG and GGG GGG. From the top 6-mers identified, five are found as structures in PDB (Fig 8) and have been involved in protein binding. Three of these five 6-mers are found in a hairpin motif, and two are found in a G-quadruplex arrangement.

Table 4. Top predictive features (6-mers), their reported structural function and contribution.

Predictive Sequence Reported Structural Function References
TGG TGG A section of a G-quadruplex that interacts with Thrombin [58]
TGG GGG A section of a hairpin loop. [10,59]
GGG GTG A section of a hairpin loop that interacts with Thrombin and VEGF. [60]
GGT TGG Complimentary chain in the G-quadruplex the two thymine (T) residues interact with Thrombin. [58,60,61]
GGG GGG A small section of the hairpin loop that interacts with HIV-1 reverse transcriptase. [62]

Fig 8. Structures of elucidated Aptamers with their sequences.

Fig 8

The identified 6-mers are highlighted in red and blue, in the sequence and structure, the area where the 6-mers overlap is colored in purple. (A) NU172 Aptamer PDB:6GN7; (B) AMP Aptamer PDB: 1AW4; (C) V7t1 VEGF Aptamer PDB: 2M53; (D) HD22 Aptamer PDB: 4I7Y and (E) HIV-1 RT Aptamer PDB:5D36.

Aptamer predictive sequences

After the SVM model was generated, the more related sequences from these 6-mers (features) used to predict the DNA aptamer sequences are extracted. In the DNA sequences of all living organisms, 6-mers are short recurring elements. Within genomic DNA, due to their functional importance, these elements are both conserved and diverged across species, making these 6-mer patterns suitable for species identification. 6-mers may be part of the core segment of transcription factor binding sites or regulatory elements that participate in protein binding and gene regulation in different subregions of the genome [63]. Given the significance of 6-mers in genomic DNA, it could be assumed that 6-mers found in aptamers would also hold some biological or structural importance. To determine the biological importance of those 6-mers of six base pairs (bp) length, the top 5 predictive 6-mers are compared to the elucidated aptamer structures from the PDB database (Table 4). Some applications found on these 6-mers are described in Table 5. It is important to highlight that while 6-mers may have functional or structural importance in these aptamers, it is too premature to conclude that 6-mers always have biological significance. There are very few DNA aptamers that have been studied, from a structural perspective. Especially there are only 16 unique structures of DNA aptamers deposited in the PDB.

Table 5. Identification of the top 6-mers on elucidated aptamer structures.

Aptamer name Description
NU172 The crystal structure of NU172 is shown in Fig 8A. This DNA aptamer was designed to bind Thrombin and has a high potency as an anticoagulant [58]. NU172 contains two of the top predictive 6-mers GGT TGG and TGG TGG. This structure has a chair-like anti-parallel fold, where the 6-mer GGT TGG forms G-tetrad type I and II. The second 6-mer TGG TGG is part of a TGT loop that surrounds the G-tetrad. This TGT loop is highly flexible and different from other TGT loops found in another DNA aptamer sequences.
AMP The NMR structure of AMP is shown in Fig 8B. This DNA aptamer binds AMP as well as adenosine with an affinity shown to be 6 μM [10]. It has the 6-mer sequence TGG GGG, as one of two highly conserved guanine-rich regions. The TGG GGG sequence is part of two AMP-binding sites, which are located in the minor groove of the DNA helix. The TGG GGG also adapts a complex with the major groove centered about the adjacently bound AMP molecules.
V7t1 The NMR structure of V7t1 is shown in Fig 8C. This DNA aptamer binds to VEGF165 and VEGF121, the two most abundant VEGF isoforms, with KD values at a very low nanomolar concentration [64]. V7t1 comprises several G-rich regions and folds into a G-quadruplex. It contains the top predictive 6-mers GGG GTG and TGG GGG that overlap in the sequence GTGGGGGTG. These nucleotides are numbered as G2–G10. Loop regions comprise a non-residue propeller-type loop between G6 and G7, a T9-G10 D-shaped loop connecting outermost residues G8 and G11 within the same strand. The DNA backbone is in an extended conformation in the G6-G8 tract, which causes displacement of G6 and G8 from what is considered their ideal stacking position. The DNA strand is connecting residues G8 and G11 of the outer G-quartets within the same column of a G-quadruplex core. G11 adopts anti-conformation along with its glycosidic bond, and G7-G8 and G11 segments can be considered parts of two DNA strands oriented in a parallel fashion.
HD22 The crystal structure of HD22 is shown in Fig 8D. This DNA aptamer also binds to Thrombin and exhibits a substantially high negative charge density compared to other thrombin’s aptamers, thus strengthening its specificity for target recognition [61]. It is also a bimodular aptamer with respect to a double helix and a G-quadruplex. This aptamer has the top predictive 6-mers GGG GTG and GGT TGG that overlap in a single sequence as GGTTGGGGTG. Its bases are numbered as G17–G25. This region of the aptamer is part of the duplex structure, and it is organized into a G-tetrad capped by the Thy18-Thy19 on one side. Interaction between HD22-27 and Thrombin involves numerous residues, including molecules such as Thy18, Thy19, Gua20, Gua23, Thy24, of the aptamer and segments 89–101, 230–245 of Thrombin. Hydrophobic contacts, mainly involving loop residues Thy18 and Thy19, also contribute to the stability of the complex. A further anchorage is produced by Thy24, which bulges from the duplex region of the nucleotide into a protein pocket where it is mainly involved in polar contacts.
HIV-1 RT The crystal structure of HIV-1 RT is shown in Fig 8E. This DNA aptamer binds to HIV-1 reverse transcriptase with ultra-high affinity [62]. It has two repeats of the 6-mer GGG GGG, numbered as G27–G32, as part of the primer duplex strand. The conformational analysis of this aptamer suggest that base pairs conformation conforms into a B-form geometry. Nucleotides 28–33 can interact with crucial amino acid residues located in the p66 finger as well as in the palm and thumb subdomains of HIV-1 reverse transcriptase.

Conclusions

Aptamer screening efforts via SELEX could be defined as an obscure series of in vitro experiments, lacking the principles of a binding motif design in a screening library. In the absence of these principles, large random sequences have to be scanned first to recognize potential aptamers for a given target. In this work, an AI-based approach, based on NLP and ML, was developed to predict if a given sequence is an aptamer. It uses an NLP method to convert DNA sequences into numerical smaller representations (i.e., features) and ML to obtain a predictive model to classify a sequence as an aptamer or a genomic DNA sequence. The use of the best model examined in this paper to predict aptamers is promising to improve SELEX protocols and accelerate the rate of aptamer development. This new approach may allow DNA sequences resulting from the first-round of SELEX to be pre-selected as potential aptamers for the second-round of SELEX by eliminating non-specific binding sequences. This analytical step could reduce the number of SELEX rounds required to produce a good aptamer. Based on these comparative studies, new screening libraries could be developed to overexpress the promising 6-mers found or intentionally excluding those features that are not indicative of a DNA aptamer. This approach will allow us to identify aptamers faster and more precisely, so more aptamers can be generated in the future.

One of the limitations of this work is that the examined methods considered all aptamers to be the same, and does not take into consideration the different binding strengths of the aptamers and the different types of binding targets (i.e., proteins, small molecules, whole cells). As more aptamer data becomes available, new studies could be done taking into account those differences. Also, this model has not been validated through experiments. Thus, future studies could include testing the model in a SELEX experiment to better address its viability. Future works include improving the quality of the models using more data, when this type of data becomes available. Finally, ML model complexity and difficult interpretation may hinder its application into the standard practice. For this reason, the development of a web-app is already taking place to facilitate the interpretation and application of the obtained results.

Acknowledgments

Authors want to thank Luis E. Vázquez-Quiñones, professor of the School of Sciences and Technology of the Universidad Metropolitana-Ana G. Méndez, for his comments and suggestions during the writing of this manuscript.

Abbreviations

AI

Artificial Intelligence

DT

Decision Tree

GNB

Gaussian Naïve Bayes

LR

Logistic Regression

ML

Machine Learning

NLP

Natural Language Processing

SVM

Support Vector Machines

SELEX

Systematic Evolution of Ligands by Exponential enrichment

t-SNE

t-distributed Stochastic Neighbor Embedding

Data Availability

The data is available as Heredia F. DNA/Aptamer dataset, Mendeley. 2020;1. doi: 10.17632/76jgjbgndr.1, and in GitHub at https://github.com/eipm-uprm/Aptamer-ML.

Funding Statement

Research reported in this publication was supported by RCMI grant U54 MD007600 (National Institute on Minority Health and Health Disparities) from the National Institutes of Health (https://www.nimhd.nih.gov/programs/extramural/research-centers/rcmi/rcmi-grants.html). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ku TH, Zhang T, Luo H, Yen TM, Chen PW, Han Y, et al. Nucleic acid aptamers: An emerging tool for biotechnology and biomedical sensing. Sensors (Basel). 2015;15(7):16281–16313. doi: 10.3390/s150716281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Miyagawa A, Okada Y, Okada T. Aptamer-Based Sensing of Small Organic Molecules by Measuring Levitation Coordinate of Single Microsphere in Combined Acoustic–Gravitational Field. ACS Omega. 2020;5(7):3542–3549. doi: 10.1021/acsomega.9b03860 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ruscito A, DeRosa MC. Small-Molecule Binding Aptamers: Selection Strategies, Characterization, and Applications. Front Chem. 2016;4:14. doi: 10.3389/fchem.2016.00014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ng EWM, Shima DT, Calias P, Cunningham ET, Guyer DR, Adamis AP. Pegaptanib, a targeted anti-VEGF aptamer for ocular vascular disease. Nat Rev Drug Discov. 2006;5(2):123–132. doi: 10.1038/nrd1955 [DOI] [PubMed] [Google Scholar]
  • 5.Yeom G, Kang J, Jang H, Nam HY, Kim M-G, Park C-J. Development of DNA Aptamers against the Nucleocapsid Protein of Severe Fever with Thrombocytopenia Syndrome Virus for Diagnostic Application: Catalytic Signal Amplification using Replication Protein A-Conjugated Liposomes. Anal Chem. 2019;91(21):13772–13779. doi: 10.1021/acs.analchem.9b03210 [DOI] [PubMed] [Google Scholar]
  • 6.Zou X, Wu J, Gu J, Shen L, Mao L. Application of Aptamers in Virus Detection and Antiviral Therapy. Front. Microbiol. 2019;10:1462. doi: 10.3389/fmicb.2019.01462 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cerchia L, de Franciscis V. Targeting cancer cells with nucleic acid aptamers. Trends Biotechnol. 2010;28(10):517–525. doi: 10.1016/j.tibtech.2010.07.005 [DOI] [PubMed] [Google Scholar]
  • 8.Davis KA, Abrams B, Lin Y, Jayasena SD. Staining of cell surface human CD4 with 2′-F-pyrimidine-containing RNA aptamers for flow cytometry. Nucleic Acids Res. 1998;26(17):3915–3924. doi: 10.1093/nar/26.17.3915 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lin CH, Patei DJ. Structural basis of DNA folding and recognition in an AMP-DNA aptamer complex: distinct architectures but common recognition motifs for DNA and RNA aptamers complexed to AMP. Chem Biol. 1997;4(11):817–832. doi: 10.1016/s1074-5521(97)90115-0 [DOI] [PubMed] [Google Scholar]
  • 10.Choi S-J, Ban C. Crystal structure of a DNA aptamer bound to PvLDH elucidates novel single-stranded DNA structural elements for folding and recognition. Sci Rep. 2016;6:34998. doi: 10.1038/srep34998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Afanasyeva A, Nagao C, Mizuguchi K. Prediction of the secondary structure of short DNA aptamers. Biophys Physicobiol. 2019;16:287–294. doi: 10.2142/biophysico.16.0_287 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhu Q, Shibata T, Kabashima T, Kai M. Inhibition of HIV-1 protease expression in T cells owing to DNA aptamer-mediated specific delivery of siRNA. Eur J Med Chem. 2012;56:396–399. doi: 10.1016/j.ejmech.2012.07.045 [DOI] [PubMed] [Google Scholar]
  • 13.Kato K, Ikeda H, Miyakawa S, Futakawa S, Nonaka Y, Fujiwara M, et al. Structural basis for specific inhibition of Autotaxin by a DNA aptamer. Nat Struct Mol Biol. 2016;23(5):395–401. doi: 10.1038/nsmb.3200 [DOI] [PubMed] [Google Scholar]
  • 14.Forier C, Boschetti E, Ouhammouch M, Cibiel A, Ducongé F, Nogré M, et al. DNA aptamer affinity ligands for highly selective purification of human plasma-related proteins from multiple sources. J Chromatogr A. 2017;1489:39–50. doi: 10.1016/j.chroma.2017.01.031 [DOI] [PubMed] [Google Scholar]
  • 15.Lin X, Ivanov AP, Edel JB. Selective single molecule nanopore sensing of proteins using DNA aptamer-functionalised gold nanoparticles. Chem Sci. 2017;8(5):3905–3912. doi: 10.1039/c7sc00415j [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jarczewska M, Rębiś J, Górski Ł, Malinowska E. Development of DNA aptamer-based sensor for electrochemical detection of C-reactive protein. Talanta. 2018;189:45–54. doi: 10.1016/j.talanta.2018.06.035 [DOI] [PubMed] [Google Scholar]
  • 17.Trausch JJ, Shank-Retzlaff M, Verch T. Replacing antibodies with modified DNA aptamers in vaccine potency assays. Vaccine. 2017;35(41):5495–5502. doi: 10.1016/j.vaccine.2017.04.003 [DOI] [PubMed] [Google Scholar]
  • 18.Garner MM, Revzin A. A gel electrophoresis mothod for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res. 1981;9(13):3047–3060. doi: 10.1093/nar/9.13.3047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Morita Y, Leslie M, Kameyama H, Volk D, Tanaka T. Aptamer Therapeutics in Cancer: Current and Future. Cancers (Basel). 2018;10(3):80. doi: 10.3390/cancers10030080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Oktem HA, Bayramoglu G, Ozalp VC, Arica MY. Single-Step Purification of Recombinant Thermus aquaticus DNA Polymerase Using DNA-Aptamer Immobilized Novel Affinity Magnetic Beads. Biotechnol Prog. 2007;23(1):146–154. doi: 10.1021/bp0602505 [DOI] [PubMed] [Google Scholar]
  • 21.Catuogno S, Esposito CL, de Franciscis V. Aptamer-mediated targeted delivery of therapeutics: An update. Pharmaceuticals (Basel). 2016;9(4):69. doi: 10.3390/ph9040069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kedzierski S, Caltagirone T, Khoshnejad M. Synthetic Antibodies: The Emerging Field of Aptamers. Bioprocess. J. 2012;11(4):46–49. doi: 10.12665/J114.KedzierskiCaltagirone [DOI] [Google Scholar]
  • 23.Klug SJ, Famulok M. All you wanted to know about SELEX. Mol Biol Rep. 1994;20(2):97–107. doi: 10.1007/BF00996358 [DOI] [PubMed] [Google Scholar]
  • 24.Gopinath SCB. Methods developed for SELEX. Anal Bioanal Chem. 2007;387(1):171–182. doi: 10.1007/s00216-006-0826-2 [DOI] [PubMed] [Google Scholar]
  • 25.Rahimi F, Murakami K, Summers JL, Chen CHB, Bitan G. RNA aptamers generated against oligomeric Aβ40 recognize common amyloid aptatopes with low specificity but high sensitivity. PLoS One. 2009;4(11):e7694. doi: 10.1371/journal.pone.0007694 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhuo Z, Yu Y, Wang M, Li J, Zhang Z, Lui J, et al. Recent advances in SELEX technology and aptamer applications in biomedicine. Int J Mol Sci. 2017;18(10):2142. doi: 10.3390/ijms18102142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Jing M, Bowser MT. Methods for Measuring Aptamer-Protein Equilibria: A Review. Anal Chim Acta. 2011;686(1–2):9–18. doi: 10.1016/j.aca.2010.10.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tolle F, Wilke J, Wengel J, Mayer G. By-Product Formation in Repetitive PCR Amplification of DNA Libraries during SELEX. PLoS One. 2014;9(12):e114693. doi: 10.1371/journal.pone.0114693 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhou J, Rossi J. Aptamers as targeted therapeutics: current potential and challenges. Nat Rev Drug Discov. 2017;16(3):181–202. doi: 10.1038/nrd.2016.199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Blind M, Blank M. Aptamer Selection Technology and Recent Advances. Mol Ther Nucleic Acids. 2015;4(1):e223. doi: 10.1038/mtna.2014.74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Stoltenburg R, Reinemann C, Strehlitz B. SELEX—a (r)evolutionary method to generate high-affinity nucleic acid ligands. Biomol Eng. 2007;24(4):381–403. doi: 10.1016/j.bioeng.2007.06.001 [DOI] [PubMed] [Google Scholar]
  • 32.Gijs M, Penner G, Blackler GB, Impens NREN, Baatout S, Luxen A, et al. Improved aptamers for the diagnosis and potential treatment of HER2-positive cancer. Pharmaceuticals (Basel). 2016;9(2):29. doi: 10.3390/ph9020029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ahirwar R, Nahar S, Aggarwal S, Ramachandran S, Maiti S, Nahar P. In silico selection of an aptamer to estrogen receptor alpha using computational docking employing estrogen response elements as aptamer-alike molecules. Sci Rep. 2016;6:21285. doi: 10.1038/srep21285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wang S, Zhang YH, Lu J, Cui W, Hu J, Cai YD. Analysis and Identification of Aptamer-Compound Interactions with a Maximum Relevance Minimum Redundancy and Nearest Neighbor Algorithm. Biomed Res Int. 2016;2016:8351204. doi: 10.1155/2016/8351204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-Protein Interactions Using Only Sequence Information. BMC Bioinformatics. 2011;12:489. doi: 10.1186/1471-2105-12-489 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.https://www.aptagen.com/apta-index/
  • 37.Coimbatore Narayanan B, Westbrook J, Ghosh S, Petrov AI, Sweeney B, Zirbel CL, et al. The Nucleic Acid Database: new features and capabilities. Nucleic Acids Res. 2014;42(Database issue):D114–D122. doi: 10.1093/nar/gkt980 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.https://github.com/eipm-uprm/Aptamer-ML.git
  • 39.Heredia F. DNA/Aptamer dataset, Mendeley. 2020;1. doi: 10.17632/76jgjbgndr.1 [DOI] [Google Scholar]
  • 40.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(85):2825–2830. [Google Scholar]
  • 41.Song J, Zheng Y, Huang M, Wu L, Wang W, Zhu Z, et al. A Sequential Multidimensional Analysis Algorithm for Aptamer Identification based on Structure Analysis and Machine Learning. Anal Chem. 2020;92(4):3307–3314. doi: 10.1021/acs.analchem.9b05203 [DOI] [PubMed] [Google Scholar]
  • 42.Li F, Yang Y. Analysis of recursive feature elimination methods. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp 633–634. doi: 10.1145/1076034.1076164 [DOI]
  • 43.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE : Synthetic Minority Over-sampling Technique. Int J Artif Intell Res. 2001;16(1):321–357. doi: 10.1613/jair.953 [DOI] [Google Scholar]
  • 44.Hossin M, Sulaiman MN. A Review on Evaluation Metrics for Data Classification Evaluations. Int J Data Min Knowl Manag Process (Online). 2015;5(2):01–11. doi: 10.5121/ijdkp.2015.5201 [DOI] [Google Scholar]
  • 45.Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–635. [PMC free article] [PubMed] [Google Scholar]
  • 46.Maalouf M, Trafalis TB. Robust weighted kernel logistic regression in imbalanced and rare events data. Comput Stat Data Anal. 2011;55(1):168–183. doi: 10.1016/j.csda.2010.06.014 [DOI] [Google Scholar]
  • 47.Gao J, Tan PN. Converting output scores from outlier detection algorithms into probability estimates. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), pp 18–22. doi: 10.1109/ICDM.2006.43 [DOI]
  • 48.Xu S. Bayesian Naïve Bayes classifiers to text classification. J Inf Sci. 2016;44:1–12. doi: 10.1177/0165551516677946 [DOI] [Google Scholar]
  • 49.Ghaddar B, Naoum-Sawaya J. High dimensional data classification and feature selection using support vector machines. Eur J Oper Res. 2018;265(3):993–1004. doi: 10.1016/j.ejor.2017.08.040 [DOI] [Google Scholar]
  • 50.Waskom M and the Seaborn development team. Seaborn: statistical data visualization. 2020. doi: 10.5281/zenodo.592845 [DOI] [Google Scholar]
  • 51.Hunter JD. Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007;9(3):90–95. doi: 10.1109/MCSE.2007.55 [DOI] [Google Scholar]
  • 52.Das MK, Dai H-K. A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007;8 Suppl 7(Suppl 7):S21. doi: 10.1186/1471-2105-8-S7-S21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Caroli J, Taccioli C, De La Fuente A, Serafini P, Bicciato S. APTANI: a computational tool to select aptamers through sequence-structure motif analysis of HT-SELEX data. Bioinformatics (2016) 32, 161–164. doi: 10.1093/bioinformatics/btv545 [DOI] [PubMed] [Google Scholar]
  • 54.Zimmermann B, Gesell T, Chen D, Lorenz C, Schroeder R. Monitoring Genomic Sequences during SELEX Using High-Throughput Sequencing: Neutral SELEX. PLoS One. 2010;5(2):e9169. doi: 10.1371/journal.pone.0009169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.van der Maaten L, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
  • 56.Cieslak DA, Chawla NV. Learning decision trees for unbalanced data. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Berlin, Heidelberg: Springer; 2008. pp. 241–256. doi: 10.1007/978-3-540-87479-9_34 [DOI] [Google Scholar]
  • 57.Li B-Q, Zhang Y-C, Huang G-H, Cui W-R, Zhang N, Cai Y-D. Prediction of Aptamer-Target Interacting Pairs with Pseudo-Amino Acid Composition. PLoS One. 2014;9(1):e86729. doi: 10.1371/journal.pone.0086729 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Troisi R, Napolitano V, Spiridonova V, Krauss IR, Sica F. Several structural motifs cooperate in determining the highly effective anti-thrombin activity of NU172 aptamer. Nucleic Acids Res. 2018;46(22):12177–12185. doi: 10.1093/nar/gky990 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Marušič M, Veedu RN, Wengel J, Plavec J. G-rich VEGF aptamer with locked and unlocked nucleic acid modifications exhibits a unique G-quadruplex fold. Nucleic Acids Res. 2013;41(20):9524–9536. doi: 10.1093/nar/gkt697 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Krauss IR, Pica A, Merlino A, Mazzarella L, Sica F. Duplex-quadruplex motifs in a peculiar structural organization cooperatively contribute to thrombin binding of a DNA aptamer. Acta Crystallogr Sect D Biol Crystallogr. 2013;69(Pt12):2403–2411. doi: 10.1107/S0907444913022269 [DOI] [PubMed] [Google Scholar]
  • 61.Russo Krauss I, Spiridonova V, Pica A, Napolitano V, Sica F. Different duplex/quadruplex junctions determine the properties of anti-thrombin aptamers with mixed folding. Nucleic Acids Res. 2016;44(8):3969. doi: 10.1093/nar/gkw078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Miller MT, Tuske S, Das K, DeStefano JJ, Arnold E. Structure of HIV-1 reverse transcriptase bound to a novel 38-mer hairpin template-primer DNA aptamer. Protein Sci. 2016;25(1):46–55. doi: 10.1002/pro.2776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Alhamdoosh M, Wang D. Modelling the transcription factor DNA-binding affinity using genome-wide ChIP-based data. In: bioRxiv, Cold Spring Harbor Laboratory; 2016. pp. 061978. doi: 10.12688/f1000research.9005.3 [DOI] [Google Scholar]
  • 64.Nonaka Y, Sode K, Ikebukuro K. Screening and improvement of an anti-VEGF DNA aptamer. Molecules. 2010;15(1):215–225. doi: 10.3390/molecules15010215 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009247.r001

Decision Letter 0

Florian Markowetz, Eleonora Alfinito

27 Nov 2020

Dear Mrs. Pares-Matos,

Thank you very much for submitting your manuscript "A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers" for consideration at PLOS Computational Biology.

Your manuscript has been reviewed by a team of expert scientists, who while appreciating the general idea, have detected many flaws concerning the quality of data and results, and also the accessibility of your manuscript, too technical for the standards of PLOS Computational Biology.

In conclusion,  we strongly suggest you to move the manuscript to a more technical journal, alternatively,

you are required to respond point by point to all the referee questions/comments, and to strongly revise the paper, submitting a new version, more accessible for a broader audience.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Eleonora Alfinito, Ph.D.

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************

Dear Authors,

your manuscript has been reviewed by a team of expert scientists, who while appreciating the general idea, have detected many flaws concerning the quality of data and results, and also the accessibility of your manuscript, too technical for the standards of PLOS Computational Biology.

In conclusion, I strongly suggest you to move the manuscript to a more technical journal, otherwise,

you are required to respond point by point to all the referee questions/comments, and to strongly revise the paper, submitting a new version, more accessible for a broader audience.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: - abstract: you mention accuracy and auroc parameters, this is too technical and at this point these quantities are undefined and may be obscure to the general audience. Find a better way to present that information.

- line 111: provide a reference for the previous study

- line 139: provide a definition of Accuracy, Specificity, Selectivity and AUROC metric

- line 170: provide a definition of C, gamma

- line 175: provide a definition for the 'confusion matrix'

- line 286: table 1, too many digits in mean values taking into account the error.

- line 329: fig. 2: I cannot read the vertical axis of the plots even zooming in. Better reduce the number of shown cases to most representative ones.

- line 351: fig. 6: what is meaning of grey shaded area ?

Reviewer #2: The authors present a novel artificial intelligence approach for identifying deoxynucleotide aptamer from DNA sequences using a combination of natural language processing and machine learning. To train the model, the DNA sequence data were retrieved from the public database, NDB and Aptagen. NLP method (i.e.,CountVectorizer) was used to extract features from the nucleotide sequence. Four machine learning models, i.e., Logistic Regression, Decision Tree, Gaussian NaïveBayes, Support Vector Machines were trained on data from which the authors identified Support Vector Machine as the best performing model. They further correlated some of the predictive features to 3D structures and demonstrated their important functional roles in biology. Overall, I found the paper interesting, easy to read and can have potential impacts on facilitating aptamer design. The paper is good, but several points need to be addressed to further improve its impact and clarify to the audience as well as its overall presentation.

1. I think the introduction should be improved. While the section provides excellent background on aptamer biology and SELEX, it does not mention any previous works using computational approaches for aptamer prediction, including existing 2D/3D structural approaches. A brief survey of the field on the computational side may be necessary.

2. In the training data, the DNA sequences (negative samples) were substantially shorter than the aptamer sequence. In the pairwise scatter plot in figure 2, even just the length along is sufficient to separate aptamer from non-aptamer sequence. Clear decision boundaries can also be drawn for other features. I’m wondering if this alone gives good model performance. I think it would be more convincing if the authors could train/test on non-aptamer DNA sequences with similar sequence/length to the aptamer sequence or perhaps using randomized sequences to make sure the model is non-trivial and could be applied for harder/real-world problem.

3. Along the same line, I think it would be more impactful if the authors could also develop a machine learning model to predict positive aptamer sequences given a binding target, not simply just reducing the candidate pool.

4. In figure 4, the profile of DNA vs aptamer sequence shows a very different distribution. However, these features were pre-selected based on the frequency not by the machine learning models. Therefore, it is unclear to me how these features contribute to the model performance and their relevance to the developed models. I would recommend perhaps training a regularized linear model such as elastic net to see if some of these k-mers features could be picked up by the model.

5. How were the top 33 k-mer ranked/identified? The principal component analysis does not predict feature importance. Also, how were the top predictive features in table 4 determined? It is a bit confusing because the input to the model seems to be PCA components according to Figure 1.

6. How was the PCA analysis look like for the data? Although PCA was used for feature reduction, tSNE was shown instead.

7. In table 3, the authors compared their machine learning model performance to several others existing machine learning approaches on aptamer predictions. However, difference training/testing dataset were used. While informative, I think it will be more relevant if the same dataset could be used for comparison.

8. What is the scatterplot in figure 6? The x axis shows aptamer, y axis shows DNA but each dot could be either aptamer or DNA.

9. The conclusion is quite brief. The authors could maybe put their approach into broader content to see how their approach could be developed for aptamer design. Perhaps also compare/contrast to existing approaches, such as those outlined in Table 3 to showcase the unique value/limitation of the approach.

10. There is no implementation/code to the model. Could the authors provide perhaps with the github page along with training/testing dataset?

11. Figure quality should be substantially improved and revised for better presentation. The font in figure 2 is quite small and is barely readable.

12. Line 202, “Error! Reference source not found”.

13. Line 253, AUROC are 96.3% and 0.98.

Reviewer #3: The paper proposes an approach to simplify SELEX early steps based on Natural Language Processing (NLP) and Machine Learning (ML). The method helps identify possible aptamers from those non-aptamer sequences. The performance of four ML algorithms, including Logistic Regression, Decision Tree, Gaussian Naïve Bayes, and Support Vector Machines, were analyzed and compared. The idea of an AI-based approach for identifying aptamers is interesting, and five 6-mers having high relevance with aptamers were identified by ML algorithms.

Comments and local corrections:

(1) Data sources. The work used 4,885 protein-binding DNA sequences and 238 aptamer sequences as datasets. These two types of data differ significantly in sequence length. Is it suitable to use them as a dataset, and what the reason for this choice?

(2) In Table 1, there are two p-values; what is the difference between them? Why do they have such a big gap?

(3) In the given URL, we could not find the codes for the python scripts.

(4) Page 5, in the caption of Figure 1, we think the following data should be examined: the number of vectors 4080? and the number of the training set is 7,816? And the test set is 1,954?

(5) Page 6, line 111-112, the references of the previous study in the sentence “…as set to 6 because a previous study indicated that 6-mers performed better than k-mers of other …” did not cite.

(6) Page 10, line 202, Error! Reference source not found.

(7) The expression of some terms is inconsistent, and this may confuse the readers. For instance, DT or DTC for Decision Tree.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009247.r003

Decision Letter 1

Florian Markowetz, Eleonora Alfinito

2 Mar 2021

Dear Frances L. Heredia, Abiel Roche-Lima, Elsie I. Pares-Matos,

Thank you very much for submitting your manuscript "A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Eleonora Alfinito, Ph.D.

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have answered my questions in a satisfactory way. The revised draft may be published.

Reviewer #2: Overall, the revised manuscript has been improved from the previous version.

1. It is unclear to me how the recursive feature elimination step would automatically remove length as a feature as sequence length is a very prominent feature in differentiating DNA vs aptamer classes (figure 2). Similarly, It is confusing that the length is still listed as a feature in table 1. I would recommend removing length feature from consideration before the feature elimination step and re-estimate the performance.

2. Similarly, was Figure 2 generated with length as input?

3. What is the number on the axis for figure 6?

4. The font is too small for figure 5,6 and 8.

5. In figure 3, the axis labels should be tsne1 and tsne2.

6. Resolution should be improved for all figures.

Reviewer #4: The manuscript is found to be properly revised as per the comments given by the reviewers. All the concerns were properly cleared and the necessary information were found to be added where ever it is required. I’m totally satisfied with the revisions made by the authors. No other or new problems were found in the manuscript.

Reviewer #5: The work is interesting in the sense of providing a new tool for discrimination of DNA aptameters using ML methods. It is well written and easy to follow. The authors have improved their work following the comments of the reviewers. I reccomned its publication in this recent form.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

Reviewer #5: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #4: Yes: Dr. Abilash

Reviewer #5: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009247.r005

Decision Letter 2

Florian Markowetz, Eleonora Alfinito

5 Jul 2021

Dear Mrs. Pares-Matos,

We are pleased to inform you that your manuscript 'A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Eleonora Alfinito, Ph.D.

Guest Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009247.r006

Acceptance letter

Florian Markowetz, Eleonora Alfinito

27 Jul 2021

PCOMPBIOL-D-20-01755R2

A novel artificial intelligence-based approach for identification of deoxynucleotide aptamers

Dear Dr Parés-Matos,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Andrea Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    The data is available as Heredia F. DNA/Aptamer dataset, Mendeley. 2020;1. doi: 10.17632/76jgjbgndr.1, and in GitHub at https://github.com/eipm-uprm/Aptamer-ML.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES