Abstract
Objective
Autism spectrum disorder (ASD) is a complex neurodevelopmental condition influenced by various genetic and environmental factors. Currently, there is no definitive clinical test, such as a blood analysis or brain scan, for early diagnosis. The objective of this study is to develop a computational model that predicts ASD driver genes in the early stages using genomic data, aiming to enhance early diagnosis and intervention.
Methods
This study utilized a benchmark genomic dataset, which was processed using feature extraction techniques to identify relevant genetic patterns. Several ensemble classification methods, including Extreme Gradient Boosting, Random Forest, Light Gradient Boosting Machine, ExtraTrees, and a stacked ensemble of classifiers, were applied to assess the predictive power of the genomic features. TheEnsemble Model Predictor for Autism Spectrum Disorder (eNSMBL-PASD) model was rigorously validated using multiple performance metrics such as accuracy, sensitivity, specificity, and Mathew's correlation coefficient.
Results
The proposed model demonstrated superior performance across various validation techniques. The self-consistency test achieved 100% accuracy, while the independent set and cross-validation tests yielded 91% and 87% accuracy, respectively. These results highlight the model's robustness and reliability in predicting ASD-related genes.
Conclusion
The eNSMBL-PASD model provides a promising tool for the early detection of ASD by identifying genetic markers associated with the disorder. In the future, this model has the potential to assist healthcare professionals, particularly doctors and psychologists, in diagnosing and formulating treatment plans for ASD at its earliest stages.
Keywords: Autism spectrum disorder, machine learning general, ensemble modeling, autism detection, Random Forest
Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by early-onset difficulties in communication, social-interaction, and persistent behavioral abnormalities.1–3 ASD is diagnosed in approximately 1 in 58 children in the USA. Both inherited and de novo4–6 genetic variations can contribute to its development. As a result, ASD is an active field of research, with numerous studies ongoing to understand its causes. Environmental, social, and genetic factors may all play a role in the neurological differences associated with ASD. 7
The process of screening and diagnosing ASD is currently a complex and often imprecise one. Currently, there is no definitive medical test available to diagnose ASD at an early stage. Nevertheless, multiple studies are presently being conducted to explore the viability of employing magnetic resonance imaging (MRI) of the brain, eye tracking, and speech analysis as possible means of identification.8–11 Electroencephalography has also been explored for detection of ASD by analyzing brainwave patterns.12,13 The signs of ASD typically emerge between 12 and 18 months of age, though parents often may not recognize these signs initially. Most children do not receive a definitive diagnosis in early childhood, consequently missing out on essential support and interventions during a crucial developmental period.14,15 Global research indicates that ASD is four times more prevalent in males than females. This paper proposes a computational diagnostic model for the early identification of biomarkers associated with neurotransmission delays. These delays contribute to the cognitive development differences and neuropathological changes seen in ASD, particularly during early childhood.16,17 Several genetic variants and functional mutations can increase the risk of neurodevelopmental disorders. These include rare conditions like Giedion syndrome, Schinzel-Giedion syndrome, 18 and Opitz syndrome, 19 as well as more common disorders such as ASDs, Fragile X syndrome, Down’s syndrome, 20 and epilepsy.21–23 Additionally, mutations can contribute to psychiatric conditions like schizophrenia and bipolar disorder.
De novo mutations may have evolved mechanisms to increase the likelihood that their variants contribute to genetic conditions. Heritability variants tend to have more severe effects on average than de novo mutations, making them a significant susceptibility biomarker for ASD.24,25 Research suggests that a combination of inherited and de novo mutations may play a significant role in neural developmental deficits.26,27 These mutations can introduce insertions or deletions (indels) within genomic sequences, potentially leading to molecular changes in genetic characteristics associated with the development of intellectual disabilities. 28 Single nucleotide polymorphisms (SNPs) are common genetic variations abundant in nature. They are widely used as biomarkers for cognitive disorders. 29 An SNP refers to a single-nucleotide change in a DNA sequence occurring in at least 1% of the population. SNPs can exist in coding (exonic) and non-coding (intronic) regions, and SNPs within exons may be promising biomarkers for the genetic diagnosis of neurodevelopmental disorders. Autism-associated genes often involve SNPs, de novo mutations (arising spontaneously in an individual), or copy number variations (CNVs; duplications or deletions of the genetic material).30–33 Such genes, contributing to the development of autism, are termed autism driver genes. Conversely, passenger mutations, which do not directly influence the development of autism, may also be present.
In previous studies, various machine learning algorithms such as support vector machine (SVM), decision tree, and Gaussian naive Bayes (NB) have been used for the identification of ASD.34–36 Most of these studies utilized supervised learning methods where the algorithm learns from labeled data. Heinsfeld's model 37 achieved better classification accuracy (70%) than SVM and Random Forest (RF) algorithms and 38 achieved an accuracy of 90.39%. However, it is important to note that these studies used relatively small sample sizes and are based on imaging data, and further validation is needed to assess the generalizability of these findings. Previous research, such as CORR39, utilized neural network (NN) models trained on gene expression microarray data for ASD classification. Although it presented an optimized framework, it achieved a maximum accuracy of only 78.6%, limited by the dimensionality reduction techniques employed to handle the vast number of gene expression features. Similarly, Resnik40 applied machine learning classifiers on semantic similarity matrices of ASD and non-ASD genes, achieving an area under the curve (AUC) of 80%, but its classifier was constrained by the size and quality of available data. Additionally, Brainspan41 proposed an autoencoder-based approach for ASD risk gene prediction, focusing on long non-coding RNAs (lncRNAs), but reported accuracy levels of only 81%–82% depending on the model used (logistic regression (LR), SVM, or RF). Despite these efforts, the existing models often suffer from limited sample sizes, insufficient classifier diversity, and the inability to fully exploit the genetic sequence data available for ASD.
This study aims to identify autism-susceptibility genes from genetic sequences. Researchers are developing personalized ASD treatments focused on targeting the specific genes contributing to autistic behavior. These efforts involve modulating the expression or function of those genes. Consequently, there is a need for tools to efficiently and accurately pinpoint ASD driver genes. The ability to identify these genes would facilitate disease intervention and provide a valuable foundation for advancements in ASD therapies. To address this, the paper proposes Ensemble Model Predictor for Autism Spectrum Disorder (eNSMBL-PASD). It is a computational model designed to predict ASD driver genes using genomic sequences. By leveraging ensemble methods, including Extreme Gradient Boosting (XGB), RF, Light Gradient Boosting Machine (LGBM), and Stacking of classifiers, the model aims to enhance early diagnosis of ASD through accurate gene identification. The novelty of our research lies in the development of eNSMBL-PASD, an ensemble-based computational model that integrates advanced feature extraction techniques and diverse ensemble classifiers to predict ASD driver genes. The following steps can be considered while planning a systematic and valuable sequence-based methodology: (1) determining or developing a substantial benchmark dataset that may serve as a merit for testing and training the prediction model; (2) using a viable numerical expression for defining the organic arrangement tests, which helps in reflecting their correspondence with the concerned targets; (3) writing a computational algorithm that predicts effectively; (4) using the outcomes that are valid enough for evaluating the expected precision; and (5) forming a framework based upon a robust model that is publicly available. This work proposes a computational prediction model specifically trained on datasets containing both autism driver genes (successor genes) and non-driver genes (passenger genes). The computational model will be further tested and evaluated through different tests for its authentication and validation.
Materials and methods
This section (refer to Figure 1) provides a detailed description of the computational processes employed in this study. The fundamental steps encompass the selection of a benchmark dataset, sourced from ncbi.org. The autism-related genes are verified from literature review and downloaded. Further, in data processing, the Cluster Database at High Identity with Tolerance (CD-HIT) is applied to get the finalized list of FASTA sequences. Subsequently, feature vectors are extracted to delineate the principal characteristics of the dataset by utilizing techniques such as Position Relative Incidence Matrix (PRIM), Reverse Position Relative Incidence Matrix (RPRIM), Accumulative Absolute Position Incidence Vector (AAPIV), and Reverse Accumulative Absolute Position Incidence Vector (RAAPIV). The development phase entails training a robust prediction model using diverse classifiers, including RF, XGB, Light Gradient, Extra Tree, and Stacking. Finally, the validation process involves testing the model using various methodologies, including independent testing, self-consistency, and cross-validation using both five-fold and 10-fold approaches.
The experiments were conducted using Python programming language, employing widely-used libraries such as scikit-learn, XGBoost, LightGBM, Random Forest, Extra Tree, Stacking and TensorFlow for model development, training, and evaluation. All experiments were executed using two platforms: Google Colab and Jupyter Notebook. The hardware for local execution included an Intel Core i7 processor and 16GB RAM, which handled the feature extraction and training of models.
Figure 1.
Steps towards the construction of a robust prediction model.
Benchmark dataset collection
A dataset typically serves as a benchmark in scientific research, consisting of carefully curated samples obtained through experimental procedures. The careful selection of these samples is crucial, requiring clarity and specificity to ensure they are suitable for testing and training purposes. The experimental findings are rigorously validated through various tests, including cross-validation and independent set tests. This methodological approach provides strong evidence supporting the results.
The primary aim is to develop a robust computational model. To achieve this, a dataset comprising autism-related genes was meticulously collected from the National Center for Biotechnology Information (NCBI), the latest version that was publicly available (https://www.ncbi.nlm.nih.gov/). 42 This dataset serves as the foundation for the subsequent computational processes. In Figure 2, the detailed flowchart of the entire process is shown.
Figure 2.
Flow chart of the process.
The fundamental steps began with selecting and mapping a benchmark dataset sourced from NCBI. Each gene was carefully reviewed against the available literature to ensure its relevance to ASD. The inclusion criteria focused on selecting autism driver genes that were either experimentally validated or strongly associated with ASD, while passenger mutations not linked to the disorder were excluded. Only genes with complete and sufficient sequence data were included to ensure accuracy and robustness in the analysis. This processed dataset then served as the foundation for the subsequent feature vector formulation, where relevant genetic patterns were extracted and used to train predictive models. Hyperparameter tuning was then applied, followed by the validation of the model.
Data processing
All mutation cases in human genes are listed in this benchmark dataset. Mostly there are passenger mutation cases, which are not responsible for causing autism. Around autism-causing mutations, 3144 autism driver genes are responsible for causing it. This gathered data is subsequently used in formulating a dataset that can serve as a benchmark for the previously mentioned problem. Within the current study, G is used to denote the benchmark dataset and is represented as
| (1) |
Careful analysis was performed to formulate a database. The benchmark dataset consists of 3144 (positive) autism driver gene sequences (G+), along with 25,462 negative samples that are selected for negative gene sequences (G−). G− were picked from a vast passenger gene collection as mentioned in Figure 2. CD-HIT is a software suite used for clustering huge data of protein or nucleotide sequences. Our methodology involves refining and processing the data, followed by CD-HIT analysis (threshold 0.90) to identify optimal genomic sequences for feature extraction. This process results in a dataset of 9069 sequences, comprising 1413 autism driver genes and 7656 passenger genes.
When formulating the sample, we can articulate the sequence of DNA as
| (2) |
where at any arbitrary position, the nucleotide is indicated by
| (3) |
and ∈ indicates that an image belongs to the set hypothesis, signifying “member of.”
Scientific advancements have opened doors to explore the remarkable potential of biotechnology. However, a pressing challenge lies in developing computational and predictive models that efficiently transform sequence data into well-defined, measurable models. This process critically requires preserving the inherent groupings within the data, ensuring these groupings remain consistent with the original information they represent. These designs provide features and data that play an instrumental part in intelligent target analysis. Machine algorithms, for example, “support vector machine” (SVM), “RF,” and “XGB,” are best for evaluating vector detailing obtained from the exposure of proteomic or genomic arrangements. These machine algorithms are designed in a way that they easily receive vector input. In a discrete model, the entire sequence-related data must be renovated into a vector of definite size, while ensuring that no crucial information is lost.
This information determines the features of the given sequence. Pseudo Amino Acid Composition (PseAAC) 43 is anticipated to address the limitations of traditional protein data representation methods. In almost every computational proteomics background, Chou's PseAAC 44 has been deployed. 45 It was incorporated into a computer application named “PseAAC-General” 46 because of its importance and ubiquity in computational proteomics. The efficiency of PseAAC and its success in analyzing protein/peptide sequences indicate that it could yield similarly useful results if applied to DNA/RNA sequence analysis within computational genomics. 47 Transformation of genomic data in a general and well-constructed numerical encoding is possible, as shown by R in equation (3) as
| (4) |
where ru (u = 1, 2,…, Ω) represents an arbitrary numerical coefficient. From the gene sequence, the useful data extracted is in the form of components of equation (4). The methodology used in extracting these features is discussed further.
Feature vector formulation
Statistical moments
An empirical method is used for characterizing the measurements and components of equation (4). To obtain fixed-size data from genomic data, statistical moments are applied. Some unique information is described by each moment that results in designating the nature of data. The study of moments across various distributions is an active area of research for mathematicians and analysts. The genomic data's Hahn, central, and raw moments are presented in the feature set. For the predictor, these moments constitute a principal integrant of a vector. The moments incorporating the scale and area of variance serve for deciphering among sequences that function distinctively. In addition to these moments, other asymmetry-related moments and those characterizing data sources prove valuable in constructing classifiers for labeled datasets. Research demonstrates that the properties of proteomic and genomic sequences are influenced by the relative arrangement and composition of their constituent bases. Consequently, the most suitable computational and mathematical models for feature vector generation are those characterized by sensitivity to the specific arrangement of nucleotide bases within genomic sequences. This positioning must be within genomic sequences. It is an essential feature in developing assiduous and effective feature sets.48,49 As 2D data is required by the Hahn moments, a two-dimensional notation S’ is obtained from the genomic sequences. This S’ of size k*k is capable of storing the same information as S, but now this information is stored in a two-dimensional form, that is
and
| (5) |
From the square matrix, statistical moments are computed for formulating fixed-size feature vectors50,51 and dimensionality reduction. As mentioned previously, Hahn, raw, and central moments are the three moments used in this work.
For computing the raw moments of order a + b, the operations performed are described in equation (6).
| (6) |
Significant information is encompassed in the moments up to third-degree polynomials. It is inscribed within the sequences K00, K01, K02, K03,K10, K11, K12, K20,K21, and U30. Moreover, calculating the centroid (x, y) is a prerequisite for computing the central moments. The centroid serves as a center of the data that can be envisioned. The central moments can be calculated through the centroid as follows:
| (7) |
For computing Hahn moments, the square matrix acts like discrete input. The advantage of Hahn moment is that it provides information about the symmetry of data. Hahn moments are also reversible. This provides the benefit that, in the future, the original data can be reconstructed by using these moments. This reversibility factor affirms that the information remains intact within actual sequences. This also shows that the information reaches the diagnostic model via the matching feature vector.
Mostly, the obtained Hahn coefficient is normalized in equation (8).52,53
| (8) |
Formation of PRIM
These specific computational predictors facilitate gene attribute prediction, supporting gene classification and uncovering essential characteristics. To examine the relative positioning of nucleotide bases, the PRIM54,55 is used. Within a sequence, NPRIM captures the arrangement of a single nucleotide (Np) at position “p” relative to other nucleotides (equation (9)).
| (9) |
Within a sequence, the coefficient Ni→j represents how frequently an arbitrary nucleotide base (Ni) is positioned relative to any other random base (Nj). The presence of specific nucleotide base pairs (e.g., GG, TT, AG, AT, AA, CT) directly impacts the effectiveness of the feature extraction phase. A matrix of size 16 × 16 was developed to analyze the relative frequencies of different base pairings. This process generated 256 coefficients. Similarly, another matrix, TPRIM, (equation (10)) was formed to analyze tri-nucleotide base combinations (i.e., AAA, AAT, AAG, …, CGG, CGT, CGC). This matrix produced a total of 4096 coefficients. Central, Hahn, and raw moments were subsequently calculated for NPRIM, DPRIM, and TPRIM, generating coefficients up to the third order.
| (10) |
Formation of RPRIM
Identifying the underlying patterns, embedded within gene sequences, is the ultimate goal of feature extraction. There exists the need to analyze the gene sequences from different perspectives to draw the related information that corresponds to their behavior. Experimental findings indicate that analyzing a gene or protein's reversed sequence can yield valuable insights. Consequently, the RPRIM is constructed to leverage this principle. The resulting matrix is derived by first reversing the original sequence, followed by processing the reversed sequence, and the PRIM is computed as described in references.48,49,56,57 RPRIM was calculated using a similar methodology to PRIM matrices, incorporating mononucleotide, dinucleotide, and trinucleotide combinations. On RPRIM, an arbitrary element, Rr→j, holds information about the relative positioning of the rth base, unlike the jth nucleotide base. The representation of the RRPRIM matrix is given by equation (11):
| (11) |
The resulting matrix is again used for computing Hahn, raw, and central moments for uniformity.
Frequency vector (FV) determination
The composition and sequential arrangement of nucleotides within a chain influence the development of a feature set. As discussed, both matrices, the PRIM and RPRIM, help in extracting the nucleotide bases’ sequence-related correlations. The gene's composition-related information is summarized in the form of a FV. Within the gene sequence, the number of times the nucleotide occurs is provided by every element of this vector. The vector is represented as shown in equation (12):
| (12) |
The variable εi indicates the frequency of the ith nucleotide base within the sequence.
Formation of AAPIV
This feature extraction process extracts every compositional aspect of a gene. The FV indicates about the occurrence of each nucleotide base. Similarly, for each specific nucleotide base, the cumulative information related to the occurrence of their position is given by the AAPIV. Here, three distinct AAPIVs, representing varying levels of granularity, were generated and named KAAPIV4 (equation (13)), KAAPIV16 (equation (14)), and KAAPIV64 (equation (15)). Each vector signifies a different resolution: KAAPIV4 holds data on four nucleotides, KAAPIV16 on sixteen, and KAAPIV64 on 64, which are represented as follows:
| (13) |
| (14) |
| (15) |
For the ith component, the arbitrary position of occurrence for a specific nucleotide is considered. The total of all occurrence positions of the ith nucleotide is summed within an arbitrary element ii.
RAAPIV generation
The reverse sequence provides a deeper perspective on the hidden patterns within the gene sequence. The RAAPIV analysis examined three types of nucleotide combinations: single nucleotides, dinucleotides, and trinucleotides. Each combination has a unique vector length: 4 for single nucleotides, 16 for dinucleotides, and 64 for trinucleotides. For the gene's reverse sequence, the computation of AAPIV is termed RAAPIV and is given as⋏
| (16) |
| (17) |
| (18) |
The arbitrary element ni within the RAAPIV contains the total count of occurrence positions for the ith nucleotide in the reverse sequence.
Final feature vector formulation
A fixed scale notation is obtained from each primary sequence of equation (4). Large matrices PRIM, RPRIM, and G’ are changed into a succinct form, and moments are computed by computing Hahn, raw and central moments, as also referred in diverse studies.58–62 Henceforth, these moments are merged into a feature vector with RAAPIV, AAPIV, and FV. In this feature vector, coefficients are in correspondence to a sequence of arbitrary length. For all the samples, an encyclopedic set of feature vectors is computed. The final feature vector contains 522 columns, determined through an iterative process of feature selection and optimization.
Training classifiers
An optimal feature selection approach utilizes iterative probing techniques to identify the most relevant features for model training. We began with a core set of features and incrementally expanded the feature space by systematically adding additional features based on their relevance and contribution to model performance. 53 This refinement process helped in selecting top-performing classifiers, such as RF, XGB, LGBM, and Extra Tree are selected for further training and evaluation.
RF
Random decision forests, also known as RFs, are an ensemble learning method applied for classification, regression, and various other tasks. During training, a RF algorithm constructs multiple decision trees, and the final output is aggregated from the outputs of these individual trees.63,64 The class's output is represented in Figure 3.
Figure 3.
Random Forest Classifier.
XGB
It is a structure developed through supervised machine learning, decision trees, and boosting models. This model extracts patterns and features from the dataset, which are then used to train a predictor. The model is based on a decision tree having a combination of if/else (true/false) conditions and analyzing the least number of questions to evaluate the probability of a correct decision. Next, for classification and regression, it used the same ensemble learning algorithm as in RF, but these are differently combined and built decision trees. In the end, it iteratively trains the ensemble decision tree, and in the final output prediction, the weighted sum of all predictions is taken.
Extra Tree
The Extra Tree Classifier is a machine learning algorithm designed for classification tasks. It is an extension of the RF algorithm and operates by constructing multiple decision trees during training. However, unlike RF, the Extra Tree Classifier introduces additional randomness by selecting random split points for features, making it less prone to overfitting. During the prediction phase, each decision tree in the ensemble independently classifies the input data, and the final classification is determined by aggregating the results through majority voting. This ensemble approach improves the model's accuracy and generalization capability, even with high-dimensional datasets. The algorithm is computationally efficient because of its random feature selection and split points, making it suitable for large datasets. It can handle numerical and categorical features and does not require feature scaling. Moreover, the Extra Tree Classifier can handle missing data, making it robust for real-world datasets.
LGBM
The LGBM Classifier is a powerful machine learning algorithm based on gradient boosting. It is known for its speed, efficiency, and ability to handle large datasets. Unlike many other tree-based algorithms, it builds trees leaf-wise and utilizes histogram-based techniques to improve training efficiency. It can handle numerical and categorical features, supports early stopping to prevent overfitting, and provides feature importance for better interpretability. With customizable evaluation metrics and class imbalance handling, LGBM is versatile for various classification tasks. Its memory efficiency and multi-threading support make it a popular choice for real-world applications in domains such as finance, healthcare, and marketing.
Stacking
Stacking a machine learning algorithm is a technique that combines multiple base models into a single model by training a meta-model on their predictions.65–67 The core concept of Stacking involves using the predictions from multiple base models as input for a meta-model. This meta-model learns to combine the base model predictions in a way that optimizes overall system performance. The Stacking process typically involves several steps. First, the training data is split into several subsets. Then, each base model is trained on a different subset of the training data, and its predictions are computed on the remaining subset of the data. These predicted values are then combined into a new dataset and used as input features for the meta-model. Finally, the meta-model is trained on this new dataset and used to generate predictions on new data. This study employs a stacked ensemble consisting of XGBClassifier, ExtraTreesClassifier, LabelPropagation, BaggingClassifier, and LGBMClassifier. One of the advantages of Stacking is that it can capture more complex relationships in the data by combining the strengths of several models. Stacking can also reduce the risk of overfitting by using a separate dataset to train the meta-model.
Table 1 outlines the hyperparameters used for optimizing each machine learning classifier, including RF, ExtraTrees (ET), XGB, LGBM, and the Stacking approach.
Table 1.
Hyperparameter of machine learning classifiers.
| Hyperparameter | |
|---|---|
| Random Forest | N estimators = 50 |
| Max depth = 25 | |
| Oob_score = True | |
| N jobs = −1 | |
| Warm start = True | |
| ExtraTreesClassifier | N estimators = 100 |
| Max Depth = None | |
| Min Samples Split = 2 | |
| XGBoost | N estimators = 400 |
| Max Depth = 9 | |
| Learning Rate = 0.1 | |
| Light Gradient | LGBMClassifier() |
| Stacking | XGBClassifier() |
| ExtraTreesClassifier() | |
| LabelPropagation() | |
| BaggingClassifier() | |
| LGBMClassifier(objective='binary’, random_state = seed) |
Note. XGBoost: Extreme Gradient Boosting.
Validation of models
The assessment of prediction algorithm performance is critical for validating its efficacy. Researchers have developed a wide range of quantitative metrics to facilitate this assessment. These matrices are based on certain validated tests and experiments and these will be used for comparative analysis among different models.
Accuracy metrics
Mostly, four interconnected metrics are used to evaluate the performance of a computational predictor. The accuracy of metrics is defined as Acc, it is for inclusive prediction accuracy. The next one is sensitivity denoted by Sn, which reflects the accuracy in predicting positive samples. The specificity, denoted as Sp, is also a quantitative measure for predicting accuracy in negative samples. 68 The last one Mathew's correlation coefficient (MCC) is a measure to analyze the accuracy when negative and positive samples are unhinged.
W. Chen, Feng and Lin 69 defined and formulated the metrics in the following narration that is more readable and simple to conceive:
| (19) |
| (20) |
| (21) |
| (22) |
where represents the actual number of autism-suspected genes, whereas represents the number of autism-suspected genes predicted as passenger genes. indicates the passenger genes, and represents the passenger genes predicted as autism-suspected genes. Keeping the above equation in mind, sensitivity will be maximum when no sample is wrongly predicted as a passenger gene, i.e. . The same is the case of specificity ; it is maximum when . The overall performance of a predictor is analyzed through Acc and MCC metrics. When MCC and ACC are calculated as 1, the prediction models demonstrate perfect accuracy, indicating that all samples are correctly predicted, . The binary predictor is considered to have two classes and the probability of accuracy is 50%. The benchmark for a predictor to be effective and acceptable is above 50%. The value of MCC is 0 for such a predictor. The value of Acc is 0.5 and denoted as and , which reflect that only half of the positive samples and half of the negative samples are predicted accurately. The validity of the study and experiment is examined and evaluated by these metrics. These metrics could be further extended to multiclass predictors as well. The predictor's performance is evaluated using well-established tests. Accuracy metrics are calculated through well-defined tests to quantify a model's significance and establish its reliability for predicting unseen data.
Self-consistency validation
The most basic evaluation of a predictor's accuracy is often performed through self-consistency validation. A dataset of feature vectors, comprising both positive and negative samples, is initially generated and subsequently used for further training. Validation is the next step after substantial training of the model, and self-consistency would be the first test for assessing its validity. This test indicates that the predictor is being tested using the same data on which it was trained.
Table 2 presents a comparative analysis of several classifiers, including RF, LGBM, XGB, ET, and Stacking methods, evaluated on a standard benchmark dataset.
Table 2.
Self-consistency experimental results of RF, LGBM, ExtraTree, XGB, and Stacking.
| Accuracy | Sensitivity/Recall | Specificity | MCC | Precision | F1-Score | |
|---|---|---|---|---|---|---|
| RF | 100.0 | 100.0 | 100.0 | 1.0 | 100.0 | 100.0 |
| LGBM | 94.97 | 99.96 | 67.94 | 0.79 | 94.41 | 97.11 |
| XGB | 99.76 | 100.0 | 98.44 | 0.990 | 99.71 | 99.85 |
| ExtraTree | 100.0 | 100.0 | 100.0 | 1.0 | 100.0 | 100.0 |
| Stacking | 100.0 | 100.0 | 100.0 | 1.0 | 100.0 | 100.0 |
Note. RF: Random Forest; LGBM: Light Gradient Boosting Machine; XGB: Extreme Gradient Boosting; MCC: Mathew's correlation coefficient.
Comparative analysis has revealed that RF, ExtraTree, and Stacking models demonstrate remarkable performance compared to XGB and LGBM. Referring to Figure 4, the AUC suggests that the RF model achieves superior accuracy. These results have indicated that the predicted rule is aligned with the initially proposed computational method.
Figure 4.
Self-consistency ROC curves for RF, ET, LGBM, Stacking, and XGB.
Note. ROC: Receiver-operating Characteristic Curve; RF: Random Forest; ET: ExtraTrees; LGBM: Light Gradient Boosting Machine; XGB: Extreme Gradient Boosting.
In our study, self-consistency validation uses the same dataset for both training and testing, which helps evaluate the model's performance on the entire dataset. However, to prevent over-fitting, we also employ independent testing and 5-fold and 10-fold cross-validation, where the dataset is split into 70% for training and 30% for testing, ensuring the model is evaluated on completely unseen data for robust validation.
Independent dataset test
Independent set tests have been renowned for determining performance for unknown datasets. Typically, a dataset is divided into two unequal parts: a larger portion for training the predictor and a smaller portion for testing its accuracy. This process is repeated multiple times with varying dataset divisions to ensure the predictor's accuracy. In this test, the extracted benchmark data samples are used as input. Normally, the division is based on a 70–30 ratio, 70% is for training of the predictor and 30% is for testing. The whole process is repeated 10 times with different chunk sizes.
Table 3 presents the average accuracy for each predictor. Consistent with the self-consistency results, RF and XGB outperform LGBM, ET, and Stacking in independent testing. A graph in Figure 5 reflects that RF has the highest accuracy (0.91) and specificity (0.99), indicating it performs well in correctly classifying both positive and negative instances. XGB has the highest sensitivity (0.77) and MCC (0.70), indicating it performs well in correctly identifying positive instances and has a strong overall correlation between observed and predicted classifications.
Table 3.
Independent test results of RF, LGBM, ExtraTree, XGB, and Stacking.
| Accuracy | Sensitivity/Recall | Specificity | MCC | Precision | |
|---|---|---|---|---|---|
| LGBM | 0.88 | 0.76 | 0.94 | 0.55 | 0.66 |
| RF | 0.91 | 0.73 | 0.99 | 0.65 | 0.97 |
| ExtraTree | 0.88 | 0.63 | 0.99 | 0.48 | 0.98 |
| XGB | 0.92 | 0.77 | 0.99 | 0.70 | 0.96 |
| Stacking | 0.86 | 0.10 | 1.0 | 0.29 | 1.0 |
Note. RF: Random Forest; LGBM: Light Gradient Boosting Machine; XGB: Extreme Gradient Boosting; MCC: Mathew's correlation coefficient.
Figure 5.
Independent test ROC for eNSMBL-PASD-RF, eNSMBL-PASD-XGB, eNSMBL-PASD-ET, eNSMBL-PASD-LGBM & eNSMBL-PASD Stacking.
Note. ROC: Receiver-operating Characteristic Curve; RF: Random Forest; XGB: Extreme Gradient Boosting; ET: ExtraTrees; LGBM: Light Gradient Boosting Machine.
Cross-validation
The computational findings of self-consistency are based on comparative analysis of different models, i.e. RF, XGB, LGBM, Extra Tree, and Stacking. Its performance is not evaluated for unknown data samples. While independent tests offer some insight into a predictor's performance on unknown data samples, they do not account for all possible cases. It could be possible that major chunks would be neglected during permutations or random partitions. It is important to perform more efficient tests like cross-validation to authenticate the predictor's correctness. The best part about cross-validation is that it iterates all available dataset samples. The dataset is divided into k disjoint folds and the test is repeated k times. The test is iterated to randomly assigned partitions and training is done for k‒1 partition.
The predictor's accuracy is determined by calculating the average across multiple repetitions of the test (k-times). It has been proven to be an efficient approach for readily unavailable test data. Through this strategy, no dataset is left out for testing, so it provides a substantial representation of predictor accuracy for unavailable data. All predictors are tested on 10-fold cross-validation where all the datasets are divided into 10 disjoint folds. In each iteration, one fold is designated for testing while the remaining folds are used for training. This process repeats until every fold has served as the testing set. The average performance of each classifier (LGBM, XGB, RF, ET, and Stacking) across 10-fold cross-validation is summarized in Tables 4‒8, respectively (Table 4 refers to LGBM, Table 5 refers to XGB, Table 6 refers to RF, Table 7 refers to ET, and Table 8 refers to Stacking). It has been observed that the accuracy level is high in the case of RF, which proves the validity of the results as mentioned in previous tests.
Table 4.
Cross-validation ten-fold test for LGBM.
| F/1 | F/2 | F/3 | F/4 | F/5 | F/6 | F/7 | F/8 | F/9 | F/10 | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.85 | 0.83 | 0.81 | 0.82 | 0.84 | 0.83 | 0.85 | 0.84 | 0.82 | 0.84 | 0.83 |
| Sensitivity/Recall | 0.51 | 0.46 | 0.47 | 0.44 | 0.48 | 0.41 | 0.48 | 0.39 | 0.37 | 0.43 | 0.44 |
| Specificity | 0.91 | 0.89 | 0.87 | 0.89 | 0.90 | 0.91 | 0.92 | 0.92 | 0.90 | 0.91 | 0.90 |
| MCC | 0.43 | 0.35 | 0.33 | 0.34 | 0.39 | 0.34 | 0.42 | 0.36 | 0.29 | 0.37 | 0.36 |
Note. LGBM: Light Gradient Boosting Machine; MCC: Mathew's correlation coefficient.
Table 8.
Cross-validation 10-fold test for Stacking.
| F/1 | F/2 | F/3 | F/4 | F/5 | F/6 | F/7 | F/8 | F/9 | F/10 | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.86 | 0.86 | 0.86 | 0.86 | 0.86 | 0.86 | 0.86 | 0.85 | 0.86 | 0.86 | 0.85 |
| Sensitivity/Recall | 0.12 | 0.14 | 0.09 | 0.14 | 0.11 | 0.13 | 0.12 | 0.06 | 0.11 | 0.13 | 0.11 |
| Specificity | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.99 | 1.0 | 1.0 | 1.0 | 0.99 |
| MCC | 0.33 | 0.35 | 0.29 | 0.35 | 0.31 | 0.13 | 0.31 | 0.23 | 0.32 | 0.34 | 0.29 |
Note. MCC: Mathew's correlation coefficient.
Table 5.
Cross-validation 10-fold test for XGB.
| F/1 | F/2 | F/3 | F/4 | F/5 | F/6 | F/7 | F/8 | F/9 | F/10 | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.87 | 0.86 | 0.86 | 0.86 | 0.87 | 0.85 | 0.86 | 0.85 | 0.85 | 0.86 | 0.85 |
| Sensitivity/Recall | 0.43 | 0.40 | 0.41 | 0.39 | 0.41 | 0.34 | 0.39 | 0.30 | 0.34 | 0.39 | 0.38 |
| Specificity | 0.95 | 0.95 | 0.95 | 0.94 | 0.95 | 0.94 | 0.95 | 0.95 | 0.95 | 0.95 | 0.94 |
| MCC | 0.46 | 0.43 | 0.42 | 0.40 | 0.45 | 0.35 | 0.42 | 0.33 | 0.36 | 0.42 | 0.40 |
Note. XGB: Extreme Gradient Boosting; MCC: Mathew's correlation coefficient.
Table 6.
Cross-validation 10-fold test for Random Forest.
| F/1 | F/2 | F/3 | F/4 | F/5 | F/6 | F/7 | F/8 | F/9 | F/10 | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.89 | 0.88 | 0.87 | 0.88 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |
| Sensitivity/Recall | 0.99 | 0.32 | 0.27 | 0.98 | 0.26 | 0.24 | 0.28 | 0.22 | 0.26 | 0.27 | 0.40 |
| Specificity | 0.36 | 0.98 | 0.98 | 0.29 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.85 |
| MCC | 0.52 | 0.47 | 0.39 | 0.45 | 0.39 | 0.39 | 0.41 | 0.38 | 0.40 | 0.41 | 0.42 |
Note. MCC: Mathew's correlation coefficient.
Table 7.
Cross-validation 10-fold test for Extra Tree.
| F/1 | F/2 | F/3 | F/4 | F/5 | F/6 | F/7 | F/8 | F/9 | F/10 | Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.88 | 0.87 | 0.86 | 0.87 | 0.86 | 0.87 | 0.87 | 0.86 | 0.86 | 0.87 | 0.86 |
| Sensitivity/Recall | 0.25 | 0.24 | 0.16 | 0.21 | 0.17 | 0.19 | 0.23 | 0.14 | 0.18 | 0.21 | 0.19 |
| Specificity | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.0 | 0.99 |
| MCC | 0.46 | 0.44 | 0.35 | 0.42 | 0.36 | 0.39 | 0.32 | 0.34 | 0.34 | 0.43 | 0.38 |
Note. MCC: Mathew's correlation coefficient.
Based on the averaged five-fold cross-validation results, RF demonstrates the highest accuracy with an 87% advantage, as depicted in Figure 6.
Figure 6.
Average accuracy chart for eNSMBL-PASD-XGB, eNSMBL-PASD-LGBM, eNSMBL-PASD-ExtraTree, eNSMBL-PASD-Stacking, eNSMBL-PASD-RF.
Note. XGB: Extreme Gradient Boosting; LGBM: Light Gradient Boosting Machine; RF: Random Forest.
In addition to the previously mentioned tests, a five-fold cross-validation was performed on the data to assess the predictor’s accuracy. Results are available in Table 9 for all predictors (LGBM, XGB, RF, Extra Tree, and Stacking).
Table 9.
Cross-validation 5-fold comparison (LGBM, XGB, RF, Stacking and Extra Tree).
| eNSMBL-PASD-LGBM | F/1 | F/2 | F/3 | Fold 4 | Fold 5 | Avg | |
| Accuracy | 0.83 | 0.83 | 0.84 | 0.83 | 0.83 | 0.83 | |
| Sensitivity/Recall | 0.44 | 0.43 | 0.43 | 0.38 | 0.39 | 0.41 | |
| Specificity | 0.91 | 0.90 | 0.92 | 0.92 | 0.91 | 0.91 | |
| MCC | 0.36 | 0.35 | 0.37 | 0.33 | 0.32 | 0.34 | |
| eNSMBL-PASD-XGB | F/1 | F/2 | F/3 | F/4 | F/5 | Avg | |
| Accuracy | 0.86 | 0.86 | 0.86 | 0.86 | 0.85 | 0.85 | |
| Sensitivity/Recall | 0.40 | 0.40 | 0.37 | 0.36 | 0.35 | 0.37 | |
| Specificity | 0.94 | 0.94 | 0.95 | 0.95 | 0.94 | 0.94 | |
| MCC | 0.41 | 0.42 | 0.40 | 0.40 | 0.36 | 0.39 | |
| eNSMBL-PASD-RF | F/1 | F/2 | F/3 | F/4 | F/5 | Avg | |
| Accuracy | 0.88 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | |
| Sensitivity/Recall | 0.31 | 0.30 | 0.24 | 0.26 | 0.26 | 0.27 | |
| Specificity | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | |
| MCC | 0.47 | 0.42 | 0.36 | 0.41 | 0.39 | 0.41 | |
| eNSMBL-PASD-Extra-Tree | F/1 | F/2 | F/3 | F/4 | F/5 | Avg | |
| Accuracy: | 0.87 | 0.87 | 0.86 | 0.86 | 0.87 | 0.86 | |
| Sensitivity/Recall | 0.20 | 0.18 | 0.18 | 0.16 | 0.21 | 0.18 | |
| Specificity | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | |
| MCC | 0.41 | 0.38 | 0.36 | 0.36 | 0.39 | 0.38 | |
| eNSMBL-PASD-Stacking | F/1 | F/2 | F/3 | F/4 | F/5 | Avg | |
| Accuracy | 0.86 | 0.86 | 0.86 | 0.85 | 0.86 | 0.85 | |
| Sensitivity/Recall | 0.12 | 0.11 | 0.08 | 0.08 | 0.13 | 0.10 | |
| Specificity | 1.0 | 1.0 | 1.0 | 0.99 | 1.0 | 0.99 | |
| MCC | 0.32 | 0.31 | 0.34 | 0.26 | 0.33 | 0.31 |
Note. LGBM: Light Gradient Boosting Machine; XGB: Extreme Gradient Boosting; RF: Random Forest; MCC: Mathew's correlation coefficient.
Results and analysis
It is important to diagnose autism driver genes in an early stage for timely treatment and therapy. Nowadays, such bioinformatics tools have gained significant importance and demand. The proposed computational predictor is systematically designed by amassing benchmark datasets, applying feature extraction and training through robust machine learning algorithms. Finally, the prediction model is tested on multiple validators utilizing various validation techniques. The test results revealed that eNSMBL-PASD-RF demonstrated the best performance by acquiring 91%, 100%, 87%, and 87% in independent testing, self-consistency, five-fold, and 10fold cross-validation, respectively. Similarly, eNSMBL-PASD-XGB demonstrated excellent performance, achieving 92%, 99%, 85%, and 85% accuracy in independent testing, self-consistency, five-fold, and 10-fold cross-validation, respectively. The other algorithm eNSMBL-PASD-ET has shown its accuracy as 88%, 100%, 86%, and 86% in independent testing, self-consistency, five-fold, and 10-fold cross-validation, respectively. The eNSMBL-PASD-LGBM has shown its accuracy as 88%, 94%, 83%, and 83% in independent testing, self-consistency, five-fold, and 10-fold cross-validation, respectively. The eNSMBL-PASD-Stacking has shown its accuracy as 91%, 100%, 85%, and 85% in independent testing, self-consistency, five-fold, and 10-fold cross-validation, respectively.
In comparison to prior research (CORR39, Resnik40, and Brainspan41) efforts, our study demonstrates a notable advancement in gene classification. CORR used a dataset of 146 autism samples and 54,613 gene expression features and employed four feature selection methods to reduce the dimensionality of the gene expression features to 100 prominent genes. Resnik40 utilized 588 ASD candidate genes and 1189 non-ASD genes and used semantic similarity measures from Gene Ontology (GO) for functional similarities of ASD and non-ASD genes. Brainspan41 used 604 positive ASD genes and 1594 non-ASD genes from developmental brain gene expression profiles and RNA transcript sequences and Applied an autoencoder network for dimensionality reduction and representation learning of developmental brain gene expression data.
CORR39 used a NN model with 10-fold cross-validation, achieving 78.6% accuracy. The NN model was optimized with hyperparameters such as learning rate and number of layers. Resnik40 applied RF, SVM (linear and radial kernels), and NB classifiers. RF achieved an AUC of 0.80 with 500 trees, square root feature selection at each split, and equal class weights. Brainspan41 used LR, SVM, and RF. Hyperparameters included an autoencoder for dimensionality reduction and feature selection, resulting in a receiver-operating characteristic curve (ROC) AUC of 0.78 with k-mer frequency features for RNA transcript sequences.
We systematically gathered a substantial dataset of positive genes, obtaining 3144 entries from NCBI, whereas restricted genes have been examined in previous studies. Diversity in classifier selection is a key strength of our work. We employed a comprehensive ensemble of RF, LGBM, XGB, ET, and the Stacking model. This diverse set of classifiers allowed for a comprehensive exploration of the gene classification landscape, surpassing the singular NN model used in Reference 39 and the limited set of classifiers (RF, NB, linear-SVM, and radial-SVM) in Reference. 40 Our ensemble approach, in particular, achieved remarkable success, reaching the highest accuracy of 91% in an independent dataset and 87% in a 10-fold cross-validation setup, outperforming the highest accuracy of 80% reported in reference39 80% in Resnik40, and 81% for LR and 82% for SVM in reference 41 under different conditions.
Additional verification and comparison of our proposed model involved rigorous testing against existing models from CORR39. Resnik, Brainspan41, and an Autoencoder (as shown in Figure 7).
Figure 7.
Performance of the proposed predictor along with existing predictors.
Boundary and feature space visualization comparisons
The efficiency of RF representations is discussed in this section using boundary visualization and feature space visualization comparisons.
Figure 8 demonstrates how different classifiers use distinct decision boundaries to separate positive and negative classes. There were samples from both classes in the input data. Every classifier created an area of its own for classification. The ET and Stacking classifier demonstrated superior class separation, as it resulted in the fewest misclassified samples. The first row of samples highlights the challenges posed to the classifier due to non-linearity. These samples are not linearly separated, which is why ensemble methods like RF, ET, and XGB perform well, while LGBM struggles. The second row illustrates samples that require computational models capable of handling more complex decision boundaries, which are well-handled by ensemble classifiers, while LGBM again performs poorly. In the third row, all the samples have been depicted for each of the classifier, and the simplicity of the dataset allows even LGBM to perform well, as expected.
Figure 8.
Boundary visualization for all classifiers.
The t-SNE algorithm converts both deep and human crafted feature representations into two-dimensional visualizations for analytical purposes. It minimizes the tendency for points to cluster at the center of the map by applying a stochastic neighborhood embedding variant to create 2D representations of complex, high-dimensional data.
The t-SNE visualization reveals a clearer boundary among the positive samples and the negative one with deep representations. This distinct separation suggests that a learning model could more easily classify these samples independently. In contrast, human-engineered representations exhibit significant overlap between the two classes, implying greater difficulty for a learning algorithm to achieve accurate separation. The t-SNE plots provided in Figure 9 represent a lower-dimensional mapping of high-dimensional feature spaces (522 features). These plots are designed to capture local and global structures in the data. However, t-SNE visualizations do not guarantee clear decision boundaries, especially when dealing with high-dimensional biological datasets, such as gene expressions related to ASD. These types of data often involve complex, non-linear relationships that may not always result in easily separable clusters in a 2D space.
Figure 9.
Feature space visualization t-SNE on n_components 1, 2, 3, 4, and 5.
Discussion
The success of RF as the standout classifier in our work is closely intertwined with the pivotal role of feature extraction techniques in distilling essential genetic patterns from the benchmark dataset. Before applying the classifiers, feature extraction techniques such as PRIM (PRIM4, PRIM16, PRIM64), RPRIM (RPRIM4, RPRIM16, RPRIM64), FV determination, AAPIV (AAPIV4, AAPIV16, AAPIV64,) generation, RAAPIV (RAAPIV4, RAAPIV16, RAAPIV64) were employed, all these matrices based on mononucleotide, dinucleotide, and trinucleotide combinations. These techniques played a crucial role in capturing nuanced relationships between genetic sequences and identifying key genetic markers associated with autism. The distilled feature vectors served as the foundation for the subsequent classification process. Among the classifiers evaluated, RF consistently demonstrated high accuracy across various test scenarios, with an average accuracy of 0.87 in both 10-fold and five-fold cross-validations and 0.91 in independent tests. Its robustness was evident in accurately classifying both positive and negative instances, as reflected in commendable sensitivity/recall and specificity scores. With a well-balanced MCC score of approximately 0.41, RF showcased strong overall performance in classification tasks. While XGB also emerged as a strong contender, particularly excelling in specificity with an average specificity of 0.94 in both cross-validation tests, its slightly lower sensitivity/recall and MCC scores compared to RF indicated a marginally less robust performance across metrics. In contrast, although LGBM showcased competitive accuracy and specificity, its comparatively lower sensitivity/recall and MCC scores suggested areas for potential improvement. Among the remaining classifiers, Extra Tree and Stacking exhibited mixed results, with lower sensitivity/recall and MCC scores despite Extra Tree's high specificity and Stacking's perfect specificity. These findings align with the primary research objective of identifying an optimal classifier for autism-related genetic markers, supporting the hypothesis that ensemble methods, particularly RF, would outperform other classifiers in terms of accuracy and robustness. The high accuracy and balanced sensitivity/specificity scores confirm the hypothesis that RF's ability to handle complex genetic data would result in superior performance compared to other models.
Compared to previous studies, such as CORR39, Resnik40, and Brainspan41, which utilized NNs, RFs, and SVMs with limited datasets, our study demonstrates a significant advancement in terms of data diversity and classifier selection. By assembling a comprehensive dataset of 3144 genes from NCBI, our model surpassed the accuracy benchmarks of earlier efforts, achieving up to 91% accuracy in independent tests and outperforming CORR39's 80% accuracy and Resnik40’s 81%–82% accuracy. Our approach uniquely incorporates a diverse ensemble of classifiers, including RF, XGB, LGBM, ET, and Stacking, allowing for a broader exploration of gene classification possibilities, unlike the single-model focus seen in Reference 39 and limited classifiers used in Resnik40.
As a result, the success of RF as the best-suited classifier underscores the importance of feature extraction techniques in facilitating accurate predictive models for various machine learning applications, particularly in genetic sequence analysis and autism research. Our study does, however, have certain limitations. Even though the ensemble strategy performed well, the model's MCC and sensitivity scores in some classifiers were still lower than those in others. Furthermore, even though our dataset of 3144 genes is bigger than that of earlier research, more validation and expansion using a wider range of genetic data from different populations is necessary to increase the model's generalizability. Furthermore, since our model mainly uses feature extraction methods, future study could improve the model's performance by investigating different feature engineering strategies or deeper learning models.
Conclusion
Unlike other genetic disorders, autism cannot be diagnosed through blood tests, brain scans, or other medical examinations. Instead, doctors and psychologists rely on assessing the patient's history and observing their behavior to diagnose ASD. This is usually diagnosed when the patient is 18 months old or older. As the investigation is a curial issue, a predictor has been proposed using genetic sequences to diagnose autistic genetic disorders. The proposed machine learning model offers a more precise diagnostic tool for ASD, enhancing the accuracy of predictions and guiding tailored medical treatment plans effectively. Early identification through this method addresses the current diagnostic complexity, ensuring timely interventions crucial for optimizing developmental outcomes. In this work, different techniques have been triggered and utilized to evaluate the strength of predictors. In benchmark datasets, autism-suspected genes and passenger genes are categorized depending on their characteristics. Our work employs a thorough preprocessing methodology for the benchmark dataset, involving feature extraction techniques using PRIM (PRIM4, PRIM16, PRIM64), RPRIM (RPRIM4, RPRIM16, RPRIM64), FV determination, AAPIV (AAPIV4, AAPIV16, AAPIV64,) generation, and RAAPIV (RAAPIV4, RAAPIV16, RAAPIV64) based on single nucleotide combinations, dinucleotide combinations, and trinucleotide combinations. The eNSMBL-PASD model was trained on a diverse array of classification techniques, including Ensemble methods, XGB, RF, LGBM, and ET. The model was also observed through a rigorous validation process for its authentication. It has been analyzed that a RF predictor has shown marvelous results by accruing ∼91% accuracy in independent datasets and 87% in cross-validation (five-fold and 10-fold) tests. The benchmark dataset was compiled from the latest records at ncbi.org. Rigorous validation tests demonstrated that the RF model achieves superior accuracy and suitability compared to LGBM, ExtraTree, XGB, and Stacking. The current research will help in early diagnosis and treatment of autism by predicting using the genomic profile.
In this work, a novel, intelligent computational model for early-stage autism diagnosis is introduced. The model demonstrates superior accuracy (∼91%) in the analysis of genomic sequences, offering the potential to significantly improve early diagnosis.
In addition to its use by therapists, doctors, and other professionals for early ASD diagnosis, the model can be further enhanced to support personalized treatment plans by identifying specific gene variants linked to different ASD symptoms. Future research could also explore integrating the model with real-time genetic data and expanding its application to other neurodevelopmental disorders, improving overall diagnostic accuracy and treatment outcomes.
Supplemental Material
Supplemental material, sj-csv-1-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH
Supplemental material, sj-fasta-2-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH
Supplemental material, sj-fasta-3-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH
Acknowledgements
This Project was funded by Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, under grant no. (GPIP: 1334-830-2024). The authors, therefore, acknowledge with thanks DSR for technical and financial support.
Footnotes
Contributorship: AK, NA, and YDK conducted the literature review and conceived the study. AK, NA, SJM, YDK and AA were involved in protocol development, data collection, and data analysis. AK, NA, and YDK wrote the first draft of the manuscript. AK and FB worked on the finalized draft. All authors reviewed and edited the manuscript and approved the final version of the manuscript.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval and patient consent: This study utilized publicly available genomic sequences from open-access repositories (NCBI). As no new data were collected from human participants, informed consent and ethics approval were not required.
Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah (grant number GPIP: 1334-830-2024).
Guarantor: Sharaf J. Malebary
ORCID iDs: Ayesha Karim https://orcid.org/0009-0008-5644-7676
Nashwan Alromema https://orcid.org/0000-0001-6208-2863
Supplemental material: The data used in the study is available at https://figshare.com/articles/dataset/_b_eNSMBLPASD_Spearheading_Early_Autism_Spectrum_Disorder_Detection_through_Advanced_Genomic_Computational_Frameworks_Utilizing_Ensemble_Learning_Models_b_/27423693
References
- 1.Hodges H, Fealko C, Soares N. Autism spectrum disorder: definition, epidemiology, causes, and clinical evaluation. Transl Pediatr 2020; 9: 55–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Elsabbagh M, Divan G, Koh YJ, et al. Global prevalence of autism and other pervasive developmental disorders. Autism Res 2012: 160–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ousley O, Cermak T. Autism spectrum disorder: defining dimensions and subgroups. Curr Dev Disord Rep 2014; 1: 20–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Acuna-Hidalgo R, Veltman JA, Hoischen A. New insights into the generation and role of de novo mutations in health and disease. Genome Biol 2016: 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Almandil NB, Alkuroud DN, AbdulAzeez S, et al. Environmental and genetic factors in autism spectrum disorders: special emphasis on data from Arabian studies. Int J Environ Res Public Health 2019; 16: 658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rylaarsdam L, Guemez-Gamboa A. Genetic causes and modifiers of autism spectrum disorder. Front Cell Neurosci 2019; 13: 385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lozano R, Fullman N, Mumford JE, et al. Measuring universal health coverage based on an index of effective coverage of health services in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet 2020: 1250–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Borowiak K, Schelinski S, von Kriegstein K. Recognizing visual speech: reduced responses in visual-movement regions, but not other speech regions in autism. Neuroimage Clin 2018; 20: 1078–1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Nair S, Keehn RJJ, Berkebile MM, et al. Local resting state functional connectivity in autism: site and cohort variability and the effect of eye status. Brain Imaging Behav 2018; 12: 168–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Moore A, Wozniak M, Yousef A, et al. The geometric preference subtype in ASD: identifying a consistent, early-emerging phenomenon through eye tracking. Mol Autism 2018: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wolf A, Ueda K. Contribution of eye-tracking to study cognitive impairments among clinical populations. Front Psychol 2021: 2080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bosl WJ, Tager-Flusberg H, Nelson CA. EEG Analytics for early detection of autism spectrum disorder: a data-driven approach. Sci Rep 2018; 8: 6828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Peketi S, Dhok SB. Machine learning enabled p300 classifier for autism spectrum disorder using adaptive signal decomposition. Brain Sci 2023; 13: 315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Elder JH, Kreider CM, Brasher SNet al. et al. Clinical impact of early diagnosis of autism on the prognosis and parent–child relationships. Psychol Res Behav Manag 2017; 10: 283–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Honig MG, Dorian CC, Worthen JD, et al. Progressive long-term spatial memory loss following repeat concussive and subconcussive brain injury in mice, associated with dorsal hippocampal neuron loss, microglial phenotype shift, and vascular abnormalities. Eur J Neurosci 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hiremath CS, Sagar KJV, Yamini BK, et al. Emerging behavioral and neuroimaging biomarkers for early and accurate characterization of autism spectrum disorders: a systematic review. Transl Psychiatry 2021: 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Roberts TP, Bloy L, Liu S, et al. Magnetoencephalography studies of the envelope following response during amplitude-modulated sweeps: diminished phase synchrony in autism Spectrum disorder. Front Hum Neurosci 2021: 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hoischen A, van Bon BW, Gilissen C, et al. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nat Genet 2010; 42: 483–485. [DOI] [PubMed] [Google Scholar]
- 19.Hoischen A, van Bon BW, Rodríguez-Santiago B, et al. De novo nonsense mutations in ASXL1 cause Bohring-Opitz syndrome. Nat Genet 2011; 43: 729–731. [DOI] [PubMed] [Google Scholar]
- 20.Allen G. Aetiology of Down's syndrome inferred by Waardenburg in 1932. Nature 1974: 436–437. [DOI] [PubMed] [Google Scholar]
- 21.Lindhurst MJ, Sapp JC, Teer JK, et al. A mosaic activating mutation in AKT1 associated with the Proteus syndrome. N Engl J Med 2011; 365: 611–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Girard SL, Gauthier J, Noreau A, et al. Increased exonic de novo mutation rate in individuals with schizophrenia. Nat Genet 2011: 860–863. [DOI] [PubMed] [Google Scholar]
- 23.Xu B, Roos JL, Dexheimer P, et al. Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat Genet 2011; 43: 864–868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Alonso-Gonzalez A, Rodriguez-Fontenla C, Carracedo A. De novo mutations (DNMs) in autism spectrum disorder (ASD): pathway and network analysis. Front Genet 2018; 9: 406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Marco EJ, Aitken AB, Nair VP, et al. Burden of de novo mutations and inherited rare single nucleotide variants in children with sensory processing dysfunction. BMC Med Genomics 2018: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Parenti I, Rabaneda LG, Schoen Het al. et al. Neurodevelopmental disorders: from genetics to functional pathways. Trends Neurosci 2020; 43: 608–621. [DOI] [PubMed] [Google Scholar]
- 27.Goldmann JM, Veltman JA, Gilissen C. De novo mutations reflect development and aging of the human germline. Trends Genet 2019; 35: 828–839. [DOI] [PubMed] [Google Scholar]
- 28.Vissers LE, de Ligt J, Gilissen C, et al. A de novo paradigm for mental retardation. Nat Genet 2010; 42: 1109–1112. [DOI] [PubMed] [Google Scholar]
- 29.Najmabadi H, Hu H, Garshasbi M, et al. Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature 2011; 478: 57–63. [DOI] [PubMed] [Google Scholar]
- 30.Jackson M, Marks L, May GHet al. et al. Correction: the genetic basis of disease. Essays Biochem 2020; 64: 681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Whitman MC, Gioia D, Chan SA, et al. & Strabismus Genetics Research Consortium. Recurrent rare copy number variants increase risk for esotropia. Invest Ophthalmol Visual Sci 2020: 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Turner TN, Coe BP, Dickel DE, et al. Genomic patterns of de novo mutation in simplex autism. Cell 2017; 171: 710–722.e12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yuen RK, et al. Genome-wide characteristics of de novo mutations in autism. NPJ Genom Med 2016: 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Qureshi MS, Qureshi MB, Asghar J, et al. Prediction and analysis of autism spectrum disorder using machine learning techniques. J Healthc Eng 2023; 2023. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 35.Parikshak NN, Luo R, Zhang A, et al. Integrative functional genomic analyses implicate specific molecular pathways and circuits in autism. Cell 2013; 155: 1008–1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ziats MN, Rennert OM. Aberrant expression of long noncoding RNAs in autistic brain. J Mol Neurosci 2013; 49: 589–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Heinsfeld AS, Franco AR, Craddock RC, et al. Identification of autism spectrum disorder using deep learning and the ABIDE dataset. Neuroimage Clinical 2018; 17: 16–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kong Y, Gao J, Xu Y, et al. Classification of autism spectrum disorder by combining brain connectivity and deep neural network classifier. Neurocomputing 2019; 324: 63–68. [Google Scholar]
- 39.Roopa BS, Prasad RM. A selection of an optimal framework identifying the prominent autism risk gene biomarkers from gene expression data using neural network. SN Comput Sci 2021: 1–10.34723205 [Google Scholar]
- 40.Asif M, Martiniano HF, Vicente AMet al. et al. Identifying disease genes using machine learning and gene functional similarities, assessed through gene ontology. PloS One 2018; 13: e0208626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wang J, Wang L. Prediction and prioritization of autism-associated long non-coding RNAs using gene expression and sequence features. BMC bioinformatics 2020: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.National Center for Biotechnology Information (NCBI). Bethesda, MD: National Library of Medicine (US), National Center for Biotechnology Information; 1988. https://www.ncbi.nlm.nih.gov/ [cited 2017 Apr 6]. [Google Scholar]
- 43.Xu Y, Ding J, Wu LYet al. et al. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PloS One 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Cao DS, Xu QS, Liang Y. Z. Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 2013; 29: 960–962. [DOI] [PubMed] [Google Scholar]
- 45.Malebary SJ, Khan YD. Identification of antimicrobial peptides using Chou's 5 step rule. Comput Mater Continua 2021: 3. [Google Scholar]
- 46.Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinf 2021: 246–255. [DOI] [PubMed] [Google Scholar]
- 47.Feng P, Yang H, Ding H, et al. iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 2019; 111: 96–102. [DOI] [PubMed] [Google Scholar]
- 48.Suleman MT, Alkhalifah T, Alturise Fet al. et al. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Allehaibi K, Khan YD, Khan SA. ITAGPred: a two-level prediction model for identification of angiogenesis and tumor angiogenesis biomarkers. Appl Bionics Biomech 2021; 2021: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Akbar S, Hayat M. iMethyl-STTNC: identification of N6-methyladenosine sites by extending the idea of SAAC into Chou's PseAAC to formulate RNA sequences. J Theor Biol 2018; 455: 205–211. [DOI] [PubMed] [Google Scholar]
- 51.Ilyas S, Hussain W, Ashraf A, et al. iMethylK-PseAAC: improving accuracy of lysine methylation sites identification by incorporating statistical moments and position relative features into general PseAAC via Chou’s 5-steps rule. Curr Genomics 2019; 20: 275–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Akmal MA, Hussain W, Rasool N, et al. Using Chou's 5-steps rule to predict O-linked serine glycosylation sites by blending position relative features and statistical moment. IEEE/ACM Trans Comput Biol Bioinf 2020. [DOI] [PubMed] [Google Scholar]
- 53.Akmal MA, Rasool N, Khan YD. Prediction of N-linked glycosylation sites using position relative features and statistical moments. PloS One 2017; 12 e0181966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Awais M, Hussain W, Khan YD, et al. iPhosH-PseAAC: identify phosphohistidine sites in proteins by blending statistical moments and position relative features according to the Chou's 5-step rule and general pseudo amino acid composition. IEEE/ACM Trans Comput Biol Bioinform 2019: 596–610. [DOI] [PubMed] [Google Scholar]
- 55.Khan SA, Khan YD, Ahmad Set al. et al. N-MyristoylG-PseAAC: sequence-based prediction of N-myristoyl glycine sites in proteins by integration of PseAAC and statistical moments. Lett Org Chem 2019: 226–234. [Google Scholar]
- 56.Hassan A, Alkhalifah T, Alturise Fet al. et al. RCCC_Pred: a novel method for sequence-based identification of renal clear cell carcinoma genes through DNA mutations and a blend of feature. Diagnostics 2022; 12: 3036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Perveen G, Alturise F, Alkhalifah Tet al. et al. Hemolytic-Pred: a machine learning-based predictor for hemolytic proteins using position and composition-based features. Digital Health 2023: 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Alghamdi W, Attique M, Alzahrani E, et al. LBCEPred: a machine learning model to predict linear B-cell epitopes. Brief Bioinform 2022: 3. [DOI] [PubMed] [Google Scholar]
- 59.Shah AA, Malik HAM, Mohammad A, et al. Machine learning techniques for identification of carcinogenic mutations, which cause breast adenocarcinoma. Sci Rep 2022; 12: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Ahmed S, Arif M, Kabir M, et al. PredAoDP: accurate identification of antioxidant proteins by fusing different descriptors based on evolutionary information with support vector machine. Chemom Intell Lab Syst 2022: 228. [Google Scholar]
- 61.Almagrabi AO, Khan YD, Khan SA. iPhosD-PseAAC: identification of phosphoaspartate sites in proteins using statistical moments and PseAAC. Biocell 2021: 5. [Google Scholar]
- 62.Alzahrani E, Alghamdi W, Ullah MZet al. et al. Identification of stress response proteins through fusion of machine learning models and statistical paradigms. Sci Rep 2021: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Biau G, Scornet E. A random forest guided tour. Test 2016; 25: 197–227. [Google Scholar]
- 64.Taherzadeh G, Zhou Y, Liew AWCet al. et al. Structure-based prediction of protein–peptide binding regions using Random Forest. Bioinformatics 2018; 34: 477–484. [DOI] [PubMed] [Google Scholar]
- 65.Suleman MT, Alturise F, Alkhalifah Tet al. et al. iDHU-Ensem: identification of dihydrouridine sites through ensemble learning models. Digital Health 2023: 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Suleman MT, Khan YD. PseU-Pred: an ensemble model for accurate identification of pseudouridine site. Anal Biochem 2023: 676. [DOI] [PubMed] [Google Scholar]
- 67.Alromema N, Taseer Suleman M, Malebary SJ, et al. Identification of 6-methyladenosine sites using novel feature encoding methods and ensemble models. Sci Rep 2024; 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Chen J, Liu H, Yang Jet al. et al. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 2007: 423–428. [DOI] [PubMed] [Google Scholar]
- 69.Chen W, Feng PM, Lin Het al. et al. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 2013; 41: e68–e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-csv-1-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH
Supplemental material, sj-fasta-2-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH
Supplemental material, sj-fasta-3-dhj-10.1177_20552076241313407 for eNSMBL-PASD: Spearheading early autism spectrum disorder detection through advanced genomic computational frameworks utilizing ensemble learning models by Ayesha Karim, Nashwan Alromema, Sharaf J Malebary, Faisal Binzagr, Amir Ahmed and Yaser Daanial Khan in DIGITAL HEALTH









