Autism spectrum disorder detection with kNN imputer and machine learning classifiers via questionnaire mode of screening

Trapti Shrivastava; Vrijendra Singh; Anupam Agrawal

doi:10.1007/s13755-024-00277-8

. 2024 Mar 6;12(1):18. doi: 10.1007/s13755-024-00277-8

Autism spectrum disorder detection with kNN imputer and machine learning classifiers via questionnaire mode of screening

Trapti Shrivastava ^1,^✉, Vrijendra Singh ^1,^#, Anupam Agrawal ^1,^#

PMCID: PMC10917726 PMID: 38464462

Abstract

Autism spectrum disorder (ASD) is a neurodevelopmental disorder. ASD cannot be fully cured, but early-stage diagnosis followed by therapies and rehabilitation helps an autistic person to live a quality life. Clinical diagnosis of ASD symptoms via questionnaire and screening tests such as Autism Spectrum Quotient-10 (AQ-10) and Quantitative Check-list for Autism in Toddlers (Q-chat) are expensive, inaccessible, and time-consuming processes. Machine learning (ML) techniques are beneficial to predict ASD easily at the initial stage of diagnosis. The main aim of this work is to classify ASD and typical developed (TD) class data using ML classifiers. In our work, we have used different ASD data sets of all age groups (toddlers, adults, children, and adolescents) to classify ASD and TD cases. We implemented One-Hot encoding to translate categorical data into numerical data during preprocessing. We then used kNN Imputer with MinMaxScaler feature transformation to handle missing values and data normalization. ASD and TD class data is classified using Support vector machine, k-nearest-neighbor (KNN), random forest (RF), and artificial neural network classifiers. RF gives the best performance in terms of the accuracy of 100% with different training and testing data split for all four types of data sets and has no over-fitting issue. We have also examined our results with already published work, including recent methods like Deep Neural Network (DNN) and Convolution Neural Network (CNN). Even using complex architectures like DNN and CNN, our proposed methods provide the best results with low-complexity models. In contrast, existing methods have shown accuracy upto 98% with log-loss upto 15%. Our proposed methodology demonstrates the improved generalization for real-time ASD detection during clinical trials.

Keywords: Autism spectrum disorder, Random forest, K-nearest neighbor, Support vector machine, kNN imputer, Artificial neural network, Questionnaire mode of screening

Introduction

According to WHO statistics, 1 out of 160 children suffers from autism spectrum condition (ASD) [1]. ASD is a neurodevelopmental condition that comes with medical expenses. Symptoms of ASD appear in the first two years of life. Lack of sociability, lack of expression and imagery, communication difficulties, and repetition activities all are signs of ASD [2, 3]. Various diagnostic tools are available to diagnose ASD. These instrumental tests are time consuming and can only be done on people with severe symptoms of ASD.

Early diagnosis of ASD helps to make life easier for autistic children and their parents [4, 5]. Its increased rate has prompted physicians and scientists around the world to seek more efficient screening approaches. Medical domain experts have various quantitative checklists for ASD diagnosis such as AQ-10, Q-CHAT, autism diagnostic observation schedule (ADOS), autism diagnostic interview-revised (ADI-R), Indian scale for assessment of autism (ISAA), INCLEN diagnostic tool for autism spectrum disorder (INDT-ASD), etc. In various previous studies, different ML and DL algorithms have been applied to self collected datasets. Kosmicki, J. et al. [6] used LR and radial kernel SVM over ADOS for the clinical assessment of ASD. Bone et al. [7] collected 1264 ASD and 462 non-ASD patients datasets based on SRS and ADI-R and classified with SVM. Satu et al. [8] collected various samples of children for 16 to 30 months from parents in different places in Bangladesh to find valuable characteristics of TD and ASD. Jumaa et al. [9] collected the data of ages 4 to 17 years and classified that with help the of naive bayes (NB) classifier. Mujeeb Rahman et al. [10] proposed a model based on DNN to identify patients with ASD. Thabtah et al. [11] applied Self-Organizing Map (SOM) over the data collected from 2000 participants.

Based on the AQ-10 and Q-CHAT questionnaires, Allison et al. [12] created quantitative checklists to diagnose ASD for different age groups. Furthermore, Thabtah et al. [13] created the ASDTests application (based on mobile phones) with the help of these quantitative checklists and collected the data for toddlers (18–36 months) [14], children (4–11 years) [15], adolescents (12–15 years) [16], and adults (16 years and above) [17]. One and more of these datasets have been used over the years for ASD detection or classification purposes. Kumar [18] have looked at the prospect of automating the AQ-10-Adult [17] data using ANN, SVM, and Musa et al. [19] used the [14] and [15] datasets to identify ASD symptoms and classified them with SVM, DT, RF and NB classifiers. [15, 16, 17] datasets have been used in other works [20–22] to classify ASD from TD data. They have used SVM, Swarm intelligence technique, and CNN for the classification of these datasets, respectively.

The AQ-10 datasets have scarcity (shortage of available data) and sparsity (missing values) issues and because of this, overfitting is the most concerning problem to address. Over the past few years, it has been found that handling scarcity and sparsity have the greatest impact on the performance of ML/DL methods for classifying these datasets. Most of the existing work used dimensionality reduction, and feature selection techniques to overcome scarcity and the sparsity problem has been handled either by dropping missing datapoints or just replacing them with one of the statistic values. For the state-of-the-art results in classification of ASD and non-ASD data, Mohan [23] used LibSVM, IBk and NB classifiers on the [16] dataset and achieved 93.26%, 92.3%, and 91.34% precision, respectively. Vakadkar et al. [24] have classified the [14] dataset using SVM, RFC, NB, LR and KNN and achieved the highest accuracy upto 97.19% using LR. Omar, K.S. et al. [25] collected 250 new data sets and combined them with the datasets [15, 16], and [17] and classified using RF-ID3 and RF-CART and obtained an accuracy of 0.92, 0.97, and 0.93 with the corresponding datasets. Thabtah [26] introduced new induction based rules (RML) to classify [15, 16], and [17] datasets. They claimed that for the [16] dataset, Adaboost gives 89% accuracy; for the [17] dataset, RML gives 95% accuracy; and for the [15] dataset C4.5 gives 91% accuracy. Other papers such as [27, 28] used several ML and DL models to classify all four datasets. [27] achieved an accuracy of 98% using SVM for [14], 97% using adaboost for [15], 93.89% using glmboost for [16], and 98% using adaboost for [17]. Mohanty et al. [28] achieved an accuracy of $85 % \pm 4$ for all four datasets using DNN.

Nonetheless, researcher have utilized one or more datasets in their previous work. The existing body of work offers insightful information about the developments in the prediction of ASD through the use of ML and DL architectures with a range of preprocessing techniques, including label encoding to convert categorical variables to integers and mean replacement to replace missing values. Instead of using DL models, the research done on these datasets has demonstrated the usefulness of traditional ML models. Nonetheless, there are still several shortcomings and research gaps that require attention. One major area of unsatisfied research is handling sparsity and scarcity in datasets, such as missing value handling and the conversion of the category to integer values. The overfitting problem is another issue that has not been tapped yet. finally creating a more comprehensive and reliable model based on all four datasets to span a wide age range. To tackle the above shortcomings, the primary contributions of this study are outlined below:

In this paper, all four datasets (toddler, children, adolescent, and adult) have been used to cover a wide age range i.e. 1.5 years to 16 years and above.
In this paper, to handle the sparsity problem iterative imputation technique such as kNN imputer is used instead of univariate feature imputations i.e. mean, median, or mode.
As the datasets are small and simple instead of complex deep learning models we focused on conventional Ml models such as SVM, RF, KNN, and ANN to classify and predict ASD.
There are various There are various preferable reasons to occur overfitting in a model, such as (a) simple classifiers (b) different attributes in training–testing data, (c) unbalanced distribution of training and the testing data, and (d) imbalanced and noisy data. We tackle all these conditions at various stages and avoid or remove the overfitting issues.
At last, we have analyzed which ML methods performed better with the respective datasets and improved the state-of-the-art accuracy.

Further, the paper is organized in the following manner: Sect. 2 has details about datasets, pre-processing methods, and Ml classifiers. Section 3 covers the detailed discussion on analytical performance parameters and overfitting analysis Further in this section results of all the performance on different datasets have been analyzed. Section 4 has the performance analysis of each model with state-of-the-art results and finally in Sect. 5 presents the conclusion.

Materials and methodology

Questionnaire datasets

we used the toddler dataset [14] having 18 attributes, and child dataset [15], adolescent dataset [16], and adult dataset [17] having 21 attributes. The toddler dataset has 1054 samples with no missing values, the child dataset has 292 samples with 90 missing values, the adolescent dataset has 104 samples with 12 missing values, and the adult dataset has 704 samples with 192 missing values. The detailed questionnaires of all four datasets and missing values information are in the Appendix (Table 13). As per different age groups, all four datasets has different questionnaire but the same target class attribute “class name”, in which “NO” represents a typically developed (TD) person, and “YES” represents an autistic (ASD) person. Detailed questionnaires as per age groups are expressed in appendix (Tables 14 and 16).

Table 13.

Dataset features and missing values

Name	Sample’s number	Features	Class	Missing values
Q-CHAT-10 toddler	1054	18	2	0
AQ-10 children	292	21	2	90
AQ-10 adolescent	104	21	2	12
AQ-10 adult	704	21	2	192

Open in a new tab

Table 14.

Features description of toddler dataset

Feature name

Data type

Toddler’s features description (18–36 months)

Qchat-score

Integer

Final Score obtained on basis of Qchat-10

algorithm

Who completed the test

String (self, caregiver,

parent, health care

professional etc.)

Who attempt the questionnaire screening

Open in a new tab

Table 16.

Common features description of the toddler, children, adolescent and adult data-sets

Feature name (data type)	Toddler dataset features description (18–36 months)	Child dataset features description (4–11 years)	Adolescent dataset Features Description (12–15 years)	Adult dataset features description (16 and older)
Q1 (Int)	Does your child respond to his/her name?	When others do not, she/he/I frequently hear minor voices.
Q2 (Int)	Getting your child to look at you in the eyes is how simple?	She, he, or I tend to focus more on the big picture than the minutiae.
Q3 (Int)	Does your child point while expressing a desire for something?	She/he can readily follow the conversations of many distinct persons in a social environment.		Taking on many tasks at once comes naturally to me.
Q4 (Int)	Does your kid show signs of sharing your interests?	She or he finds switching between tasks to be simple.	She/he/I can swiftly return to what she/he/I was doing if there is a break.
Q5 (Int)	Does your child plays pretending games?	She/he lacks the communication skills needed to maintain a discussion with his/her nobility.	She/he regularly discovers that she/he lacks the ability to maintain a conversation.	Reading between the lines comes naturally to me.
Q6 (Int)	Does your child follow your gaze wherever you look?	She/he is skilled at small talk.		I am able to recognise when someone is growing bored while listening to me.
Q7 (Int)	Does your child attempt to console you or another family member when they are clearly upset?	She/he has trouble figuring out a character’s intentions or sentiments when reading a novel.	WShe/he enjoyed playing pretend games or role-playing with other kids when they were younger.	I have a hard time figuring out the intentions or sentiments of the characters while I’m reading a novel.
Q8 (Int)	Would you say that mmm, ymmm, dada, etc. were your child’s first words?	She or he used to like playing pretend games with other kids while they were in preschool.	It is challenging for her/him to picture what it would be like to be someone else.	I enjoy gathering facts on different topics.
Q9 (Int)	Does your child make basic gestures?	She/he finds it simple to infer someone’s thoughts or feelings just merely looking at their face.	Social circumstances come naturally to her/him.	Just looking at someone’s face, I can usually tell what they are thinking or feeling.
Q10 (Int)	Does your kid seem to be staring at nothing for no apparent reason?	Making new friends is difficult for her/him.		I have a hard time figuring out what the people’s intentions are.
Age (Int)	Kid’s age in months	Kid’s age in years
Gender (String (F, M))	Female or Male
Ethnicity (String)	Text list of common racial and ethnic groups.
Jaundice (String (Yes, No))	Whether the patient had jaundice at birth.
Family member with PDD/ ASD/ autism (String (Yes, No))	Whether anybody in your immediate family is autistic or has PDD/ASD.
Class/ASD(String (Yes, No))	Patient diagnosed with ASD or not

Open in a new tab

Methodology

Data pre-processing

The utilized datasets had mainly two issues: sparsity (missing values) and scarcity (shortage of data). The two most common missing value imputation methods are 1. Uni-variate feature imputation 2. Multivariate feature imputation [29]. The feature’s statics (mean, median, or most frequent value) in univariate imputation is used when missing attributes are numeric type. Missing values have been restored using an iterative method in multivariate imputation, such as multivariate imputation by chained equations (MICS), fully conditional specification (FCS), or kNN. Iterative imputation is a technique for estimating missing values in a regression problem in which each variable is represented as a function of the others. It allowed a better calculation to approximate missing values across all characteristics when the attributes are categorical type. Data skewness and outlier data points are among the most critical considerations in deciding which imputation techniques give effective values to replace the missing dataset.

There are three attributes (ethnicity, age, and relation) that have missing values and all are ordinal categorical type. Table 1 shows the data type of these attributes with the number of missing values. It is required to transform categorical data into numerical data to retain missing values before using any imputer. To keep missing value data, we will replace the NAN values with “missing” and then convert all object type attributes into int64 data type by using the One-Hot encoder. One-Hot encoder is best suited encoder for ordinal categorical values and it creates dummy values as per the requirement of different categories.

Table 1.

Details of missing values in the children, adolescent, and adult datasets

Feature name (Dtype)	Children dataset (292)	adolescent dataset (104)	Adult dataset (704)
Ethnicity (Object)	43	6	95
Age (Int64)	4	0	2
Relation (Object)	43	6	95

Open in a new tab

The kNN imptuer is a distance-based imputation technique that needs the data to be normalized to prevent skewness [30]. Otherwise, the kNN imputer will provide biased substitutes for the missing values due to the different data scales. To avoid biased replacements, we are using Min-Max scaler and normalized the data between the range of max = 1 and min = 0, which can be expressed as:

\begin{matrix} \begin{matrix} A_S c a l e d = [\frac{(A - A . m i n)}{(A - A . m a x) - A . m i n}] \end{matrix} \end{matrix}

where A.max = Max(A[i]), A.min = Min(A[i]). kNN imputer used the euclidean distance between K neighbors values present in a particular attribute and replaced the missing value by mean value of K nearest neighbors. We have experimented with different values of K = 3, 5, 7, and 9 in order to determine the appropriate value. We then analyzed the results by building a pipeline that uses kNN imputation and a Multi-Layer Perceptron (MLP) classifier. It employs repeated stratified K-fold cross-validation to assess the accuracy of the model, storing the outcomes for every strategy and displaying the accuracy scores mean and standard deviation. On the basis of that we choose K = 5 for kNN imputer as shown in Table 2.

Table 2.

Mean and std accuracy of each dataset on k = 3, 5, 7, 9

k/datasets

AQ-10 children

0.895,

0.066

0.891,

0.073

0.887,

0.068

0.886,

0.067

AQ-10 adolescent

0.831,

0.094

0.834,

0.087

0.837,

0.088

0.835,

0.092

AQ-10 adult

0.951,

0.027

0.954,

0.025

0.952,

0.027

0.953,

0.024

Open in a new tab

Table 3 shows the changed shape of datasets. The preprocessed data is then randomly divided into training and testing data.

Table 3.

Dimensionality of datasets before and after pre-processing

Dataset	Before pre-processing	After pre-processing
Q-CHAT-10 toddler	(1054, 18)	(1054, 31)
AQ-10 children	(292, 21)	(292, 84)
AQ-10 adolescent	(104, 21)	(104, 64)
AQ-10 adult	(704, 21)	(704, 99)

Open in a new tab

Training data percentage in each split is as follows 50%, 60%, 70%, 80%, and 90%. The detailed training–testing data sizes after splitting are in the appendix (Table 15). The classification model is trained using the KNN, SVM, RF, and ANN algorithms utilizing the original data set (training data). The evaluation parameters for the classification approach are used to measure the success ratio using the remaining data (i.e., the testing data). We used the attribute “class name” to assess the ground truth scores.

Table 15.

Training and testing data size with various splits

Dataset	Split 1		Split 2		Split 3		Split 4		Split 5
Dataset	Size of Train-data (50%)	Size of Test-data (50%)	Size of Train-data (60%)	Size of Test-data (40%)	Size of Train-data (70%)	Size of Test-data (30%)	Size of Train-data (80%)	Size of Test-data (20%)	Size of Train-data (90%)	Size of Test-data (10%)
ASD adult	352	352	422	282	492	212	563	141	633	71
ASD children	146	146	175	117	204	88	233	59	262	30
ASD adolescent	52	52	62	42	72	32	83	21	93	11
ASD toddler	527	527	632	422	737	317	843	211	948	106

Open in a new tab

Model training

Four classifiers, SVM, KNN, RF, and ANN were applied on pre-processed data after kNN Imputation. Fig. 1 shows the flow of proposed methods and investigate ASD risk. The following are some brief discussions of classifiers:

Random forests (RF)

The RF [31] approach uses several classification trees to classify the data samples. Each tree gives a classification result. The RF is a set of decision trees derived in the form of:

\begin{matrix} \begin{matrix} h (x, θ_{k}), f o r k = 1, . ., K . \end{matrix} \end{matrix}

where $θ_{k}$ defines randomly distributed feature samples that are independent and identical. Each of the K trees produces a number, and the majority approves the class of sample X. when sufficient numbers of trees are derived, the overall error is:

\begin{matrix} \begin{matrix} e \leq \frac{c \times (1 - m^{2})}{m^{2}}, \end{matrix} \end{matrix}

where c is the tree-to-tree correlation and m is a tree-to-tree strength metric.

K-nearest neighbour (K-NN)

KNN [32] is a well-known order classifier used in various utilization, including design algorithms, information mining, and other applied sciences. Inspecting the comparability of the ordered information decides the class task of the unclassified information. Various distance mechanisms can be used to evaluate nearest neighbors, such as: Euclidean distance:

\begin{matrix} \begin{matrix} d (x = a, b) = \sqrt{\sum_{i = 1}^{n} {(a_{i} - b_{i})}^{2}}, \end{matrix} \end{matrix}

Manhattan distance:

\begin{matrix} \begin{matrix} d (a, b) = \sum_{i = 1}^{n} ‖ a_{i} - b_{i} ‖, \end{matrix} \end{matrix}

Minkowski distance:

\begin{matrix} \begin{matrix} d (a, b) = {(\sum_{i = 1}^{n} {(a_{i} - b_{i})}^{q})}^{\frac{1}{q}}, q = 1, 2, . . . \end{matrix} \end{matrix}

Support vector machine (SVM)

For regression and classification issues, SVM [33] is a classic machine learning approach. SVM is a notable, fast, and precise order calculation. The goal of the SVM process is to detect hyperplane in n-dimensional space. Hyperplane defines the information tests and has the most significant advantage, such as the shortest distance between information tests in the two classes. The following is the form of the optimization problem connected with the SVM technique [33]:

\begin{matrix} \begin{matrix} min_{w, b, ξ} [\frac{1}{2} w^{T} w + \sum_{i = 1}^{l} ξ_{i}] \end{matrix} \end{matrix}

$y_{i} (w^{T} ϕ (x_{i}) + b) \geq 1 - ξ, ξ \geq 0$ , where $(x_{i}, y_{i}), i = 1, . . ., l$ is an instance-label pair, $x_{i} \in R^{n}, y \in {1, - 1}^{l}, ϕ$ is a mapping function. The function $k (a_{i}, a_{j}) \equiv ϕ {(a_{i})}^{T} ϕ (a_{j})$ is called the kernel function. There are several forms of the kernel function:

Linear function:

\begin{matrix} \begin{matrix} K (a_{i}, a_{j}) = a_{i}^{T} a_{j} \end{matrix} \end{matrix}

Polynomial function:

\begin{matrix} \begin{matrix} K (a_{i}, a_{j}) = {(γ a_{i}^{T} a_{j} + r)}^{d}, γ > 0 \end{matrix} \end{matrix}

Radial basis function:

\begin{matrix} K (a_{i}, a_{j}) = e x p (- γ ‖ a_{i} - a_{j} ‖^{2}), γ > 0 \end{matrix}

Sigmoid function:

\begin{matrix} K (a_{i}, a_{j}) = tanh (γ a_{i}^{T} a_{j} + r) \end{matrix}

Artificial neural network (ANN)

ANN is a basic neural network. It has several component to learn the machine as per dataset such as input layer, hidden layer, output layer, number of neurons, activation function, epochs, optimizer, weight initializer [34]. On the basis of input data we have decided number of neuron on each layer. There are various possible parameters such as neurons: 100, 500, 1000, 2000, 3000, activation function: Sigmoid, tanh, softmax, optimizer: Adam, adamax, SGD, RMSprop, weight initializer: zero initializer, random initializer, Glorot initialize values which can decide performance of a ANN model for particular dataset.

GridSearchCV help to decide list of hyperparameter which are more suitable for particular dataset. We have used earlystopping technique to obtain the optimal number of epoch after which performance of model is not changed. We have set total 3000 epochs and 20 epochs for earlystopping. Glorot weight initilization $G_{in}$ method with sigmoid activation function $S i g_{x}$ is used for deciding the weight at each layer using following equation:

\begin{matrix} \begin{matrix} G_{in} = \sqrt{\frac{6}{n e u r o n_{in} \times w_{in} + n e u r o n_{out} \times w_{out}}} \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} S i g_{x} = \frac{1}{1 + e^{- x}} \end{matrix} \end{matrix}

SGD optimizer is used for adjust the learning parameter (0.001) to reduce cost function and convergence.

After getting the optimal values of hyperparameters, we have taken 1 input layer, 1 dropout layer with 20% dropout probability for regularization and reducing overfitting, 2 hidden layer with sigmoid activation function, and 1 output layer with 1 neurons (due to binary classification). Neurons at each layer depends on data size such as: input layer has 29, 82, 97, 62 neurons, $1_{st}$ hidden layer has 14, 35, 48, 30 neurons, $2_{nd}$ hidden layer has 7, 15, 24, 15 neurons for toddler, child, adult, and adolescent data respectively.

The output of different classifiers was reflected using a variety of evaluation metrics, and the performance of each classifier was examined using these metrics. Accuracy (Acc), confusion matrix ( $C_{m}$ ), sensitivity ( $S_{n}$ ), precision ( $P_{r}$ ), F1 score ( $F 1_{s}$ ), Log-loss ( $L_{l}$ ) and auc-roc curve are used as performance parameter to analyze this work. All evaluation parameters shown in Table 4 were calculated using the $F_{n}$ , $T_{n}$ , $T_{p}$ , and $F_{p}$ values retrieved from $C_{m}$ . After evaluating the following, we looked at the best classifiers that might reflect the best results across all datasets; further, we analyzed the overfitting issue based on the training and testing accuracy values of all the models.

Table 4.

Details of evaluation parameters

Performance parameter	Formula used
Acc	$(T_{p} + T_{n}) / (T_{p} + T_{n}) + (F_{p} + F_{n})$
$S_{n}$	$T_{p} / T_{p} + F_{n}$
$P_{r}$	$T_{p} / T_{p} + F_{p}$
$F 1_{s}$	$T_{p} / T_{p} + 0.5 (F_{p} + F_{n})$
$L_{l}$	$- (x l o g (q) + (1 - x) l o g (1 - q))$

Open in a new tab

Results

We are utilizing the various Python packages and the GridSearchCV technique to select the optimum value of hyper-parameters for feature modification and classification jobs. After applying GridSearchCV we find (euclidean distance, nearest neighbour = 7), (kernel = RBF, c = 100 and gamma = 0.01) and (number of trees = 20)) are best parameters for KNN, SVM, and RF classifiers respectively. Various percentages of data splitting were used in our work.

Confusion matrix analysis: The effectiveness of classification models for a specific set of testing data is assessed using a confusion matrix ( $C_{m}$ ). The $C_{m}$ involves $T_{n}, T_{p}, F_{n}$ and $F_{p}$ values.

Accuracy analysis: The percentage of accurate predictions made using the data is known as accuracy (Acc). The calculation is as easy as dividing the total number of guesses by the number of accurate predictions. The amount of over-fitting in a model is limited when there is minimal variation between test and training accuracy.

AUC-ROC analysis: ROC curve is a probability curve plotted between $T_{p}$ and $F_{p}$ . In this curve, the model’s performance is directly proportional to the value of the AUC, which means that if the AUC value is 1, then the model can accurately differentiate between the TD and ASD classes.

Sensitivity analysis: Sensitivity ( $S_{n}$ ) analysis is a technique that assesses how the uncertainty of one or more input variables may impact the uncertainty of the output variable in a classification model.

Precision analysis: Precision ( $P_{r}$ ) derived that what percentage of $T_{p}$ values are actually correct.

F1 score analysis: F1 score ( $F 1_{s}$ ) includes the $F_{p}$ and $F_{n}$ values to measure the performance of a model. Its value belongs from 0 to 1. If the model F1 score is closer to 1, model performance is better.

Log-loss analysis: The log-loss ( $L_{l}$ ) evaluates how closely the estimate probability corresponds to the actual value. The higher the log-loss value, the more the projected probability deviates from the actual value.

Overfitting analysis: When the difference between training and testing accuracy is high, the model is considered overfitted [14]. There are various preferable conditions to occur overfitting in a model, such as (a) the classifier is too simple to capture the characteristics of attributes used in the training dataset, (b) attributes of training and testing datasets are different from each other, (c) distribution of training and the testing dataset is non-uniform, and (d) imbalanced and noisy data. In this work, we have hyper-tuned the used classifiers for each dataset using GridSearchCV and further trained the classifiers with the best hyper-parameters. Due to hyper-parameter tuning, trained classifiers accurately captured the characteristics of attributes. We have checked the performance of all the classifiers on different splits to ensure uniform training and testing dataset distribution. To balance and remove the noise (missing, redundant) data, we have used One-Hot encoding to convert categorical data to numerical data and normalized that with a MinMax scaler, then replaced the missing value with kNN imputer.

Performance analysis of toddler dataset

In case of toddler dataset, Fig. 5a–e RF $C_{m}$ gives $F_{p}$ and $F_{t}$ equal to zero for all splits. Similarly KNN predict 182 TD as TD, 391 ASD as ASD shown in Fig. 5g at split 2. SVM predict 7 TD as ASD, 20 ASD as TD shown in Fig. 5m at split 3. ANN predict 2 ASD as TD,3 ASD as TD shown in Fig. 5t at split 5.

As shown in Table 5 RF give 100% accuracy for toddler training and testing dataset for all splits. In contrast, KNN, SVM, and ANN offers different training and testing accuracy on every splits.

Table 5.

Toddler dataset performance analysis based on different parameters

Splits	Classifiers	Train Acc	Test Acc	Train Sn	Test Sn	Train Pr	Test Pr	Train F1s	Test F1s
Split 1	RF	1	1	1	1	1	1	1	1
	KNN	0.965	0.92	0.954	0.893	0.994	0.994	0.973	0.941
	SVM	0.996	0.975	0.997	0.973	0.997	0.991	0.997	0.982
	ANN (epoch = 1447)	0.883	0.932	1	0.994	0.994	0.992	0.997	0.993
Split 2	RF	1	1	1	1	1	1	1	1
	KNN	0.95	0.905	0.932	0.872	0.992	0.992	0.961	0.928
	SVM	0.983	0.963	0.982	0.964	0.992	0.984	0.987	0.974
	ANN (epoch = 49)	0.935	0.991	1	0.995	0.992	0.991	0.996	0.993
Split 3	RF	1	1	1	1	1	1	1	1
	KNN	0.955	0.911	0.944	0.878	0.99	0.993	0.966	0.932
	SVM	0.987	0.963	0.99	0.96	0.99	0.985	0.99	0.973
	ANN (epoch = 72)	0.936	0.994	1	0.996	0.99	0.996	0.995	0.996
Split 4	RF	1	1	1	1	1	1	1	1
	KNN	0.928	0.906	0.902	0.873	0.992	0.99	0.945	0.928
	SVM	0.98	0.956	0.986	0.95	0.986	0.985	0.986	0.967
	ANN (epoch = 21)	0.933	0.995	1	0.996	0.994	1	1	0.995
Split 5	RF	1	1	1	1	1	1	1	1
	KNN	0.933	0.933	0.902	0.913	1	0.99	0.948	0.95
	SVM	0.99	0.942	1	0.955	0.986	0.96	0.993	0.957
	ANN (epoch = 21)	0.931	0.992	1	0.996	1	0.995	1	0.996

Open in a new tab

Fig. 2p–t represents the AUC value in ROC curve in all 5 splits for toddler dataset. RF and ANN AUC value is 1 irrespective of splits. KNN and SVM, AUC values vary in every splits.

When we classified toddler dataset using RF $S_{n}$ , $P_{r}$ , and $F 1_{s}$ remains constants as 1 in all splits as shown in Table 5. However, $S_{n}$ , $P_{r}$ , and $F 1_{s}$ values of KNN, SVM, and ANN models are fluctuated. It depicts the input data size has impact over output class and actual class prediction values are influenced by training and testing data size.

The log-loss value for the toddler dataset using RF classifiers is approximately zero for all splits as shown in Fig. 3a and split 1, split 5 has the lowest log-loss value for SVM and KNN. Fig. 4a–e shows the log-loss of ANN classifier. Figure 5 shows the Confusion matrix of different classifiers on the Toddler data set in different splits. Train and test log-loss remains same in all splits. Whereas training and testing accuracy has changed. Difference between train and testing accuracy reside in range of (0.049, 0.059). There is a chance of overfitting issue in KNN, SVM, and ANN. Average training and testing accuracy for toddler dataset using different classifiers are as follow: RF (100%), RF (100%) > SVM (98.72%), ANN (98.08%) > KNN (94.62%), SVM (95.98%) > ANN (92.36%), KNN (91.5%). RF gives 100% accuracy with no log-loss for toddler dataset.

Fig. 4 — a–t Accuracy and log-loss of all datasets using ANN

Performance analysis of child dataset

In case of child dataset, Fig. 6a–e, RF gives $F_{p}$ and $F_{n}$ values equal to zero for all splits. Similarly, using the KNN, SVM, and ANN classifiers in different splits gives different values of $F_{p}$ and $F_{n}$ . For example Fig. 6g, $F_{p}$ is equal to 73, $F_{n}$ is 9, $T_{n}$ is 11, and $T_{p}$ is 83.

Table 6 describes that RF gives 100% training and testing accuracy in the child dataset. The difference between training and testing accuracy using KNN in split1-split5 are 0.041, 0.045, 0.001, 0.016, 0.06. respectively. Similarly, the accuracy difference derived using SVM and ANN on split1–split5 are (0.069, 0.005), (0.045, 0.072), (0.022, 0.089), (0.111, 0.08), and (0.163, 0.072) respectively.

Table 6.

Child dataset performance parameters analysis

Splits	Classifiers	Train Acc	Test Acc	Train Sn	Test Sn	Train Pr	Test Pr	Train F1s	Test F1s
Split 1	RF	1	1	1	1	1	1	1	1
	KNN	0.931	0.89	0.893	0.84	0.951	0.94	0.921	0.887
	SVM	1	0.931	1	0.92	1	0.945	1	0.932
	ANN (epoch = 1754)	0.76	0.765	1	0.946	0.956	0.922	0.977	0.934
Split 2	RF	1	0.994	1	0.989	1	1	1	0.994
	KNN	0.931	0.886	0.914	0.882	0.914	0.902	0.914	0.892
	SVM	0.982	0.937	1	0.936	0.959	0.946	0.979	0.941
	ANN (epoch = 21)	0.868	0.94	1	0.957	0.94	0.947	0.969	0.952
Split 3	RF	1	0.995	1	0.99	1	1	1	0.995
	KNN	0.896	0.897	0.891	0.894	0.868	0.902	0.88	0.898
	SVM	0.965	0.943	1	0.942	0.925	0.951	0.961	0.946
	ANN (epoch = 475)	0.864	0.953	1	0.961	0.948	0.943	0.973	0.952
Split 4	RF	1	0.957	1	0.916	1	1	1	0.956
	KNN	0.879	0.863	0.761	0.791	0.888	0.931	0.82	0.855
	SVM	0.982	0.871	0.952	0.758	1	0.989	0.975	0.858
	ANN (epoch = 21)	0.873	0.953	1	0.958	0.913	0.958	0.954	0.958
Split 5	RF	1	0.95	1	0.9	1	1	1	0.947
	KNN	0.896	0.836	0.7	0.717	1	0.94	0.823	0.813
	SVM	0.896	0.733	0.7	0.465	1	1	0.823	0.635
	ANN (epoch = 21)	0.881	0.953	1	0.961	1	0.954	1	0.958

Open in a new tab

The AUC value of RF and ANN gradually increases with each splits. The value of AUC of SVM and KNN fluctuated in different splits as show in Fig. 2k–o.

$P_{r}$ values for training and testing remains same in all split for child dataset using RF. $S_{n}$ , $P_{r}$ , $F 1_{s}$ of child dataset is fluctuated. For example, there is increment of 0.02% in $S_{n}$ from split 1 to split 2 but decrement of 0.09%, 0.01% in $P_{r}$ and $F 1_{s}$ from split 1 to split 2 respectively as shown in Table 6 for KNN classifier.

Fig. 3b shows the log-loss of RF, KNN, and SVM for child dataset. The minimum and maximum log-loss of RF, KNN, and SVM derived at (split 1, split 5), (split 3, split 5) respectively. Fig. 4f–j shows accuracy and log-loss at each split per epoch using ANN. Training and testing log-loss changes nearly 0.001% whereas training and testing accuracy difference for all splits changes gradually.

Due to difference between training and testing accuracy of SVM, KNN, and ANN there is overfitting issue in these models. Average training and testing accuracy for child dataset using different classifiers are as follow: RF (100%), RF (97.92%) > SVM (96.5%), ANN (91.28%) > KNN (90.66%), SVM (88.30%) > ANN (84.92%), ANN (87.44%). RF gives 100% accuracy with no log-loss for child dataset.

Performance analysis of adult dataset

$C_{m}$ of RF, KNN, SVM, and ANN over different split for adult dataset is shown in Fig. 7a–o. RF gives no incorrect prediction for first 4 split but in last split it shows 1 TD case predicted as ASD, 2 ASD case predicted as TD. Similarly, using KNN, RF, and ANN classifier in different split gives different $F_{p}$ , $F_{n}$ , $T_{p}$ , and $T_{n}$ values for example Fig. 7m shows 361 TD cases predicted as ASD and 17 ASD cases as TD.

For adult dataset Table 7 describe Acc, $S_{n}$ , $P_{r}$ , and $F 1_{s}$ at different splits. RF and SVM gives 100% training accuracy at most of the splits. KNN and ANN derived different training and testing accuracy for different splits.

Table 7.

Adult dataset performance parameters analysis

Splits	Classifiers	Train Acc	Test Acc	Train Sn	Test Sn	Train Pr	Test Pr	Train F1s	Test F1s
Split 1	RF	1	1	1	1	1	1	1	1
	KNN	0.971	0.951	0.917	0.891	0.978	0.921	0.946	0.906
	SVM	0.997	0.963	1	0.913	0.989	0.943	0.994	0.928
	ANN (epoch = 1549)	0.871	0.895	0.979	0.902	0.959	0.943	0.969	0.922
Split 2	RF	1	1	1	1	1	1	1	1
	KNN	0.978	0.95	0.939	0.887	0.987	0.913	0.962	0.9
	SVM	1	0.971	1	0.925	1	0.961	1	0.942
	ANN (epoch = 21)	0.933	0.963	0.975	0.925	0.963	0.942	0.969	0.933
Split 3	RF	1	1	1	1	1	1	1	1
	KNN	0.976	0.945	0.918	0.851	1	0.931	0.957	0.889
	SVM	0.995	0.957	0.983	0.867	1	0.965	0.991	0.913
	ANN (epoch = 33)	0.933	0.967	0.983	0.929	0.952	0.952	0.967	0.94
Split 4	RF	1	1	1	1	1	1	1	1
	KNN	0.992	0.959	0.976	0.917	1	0.924	0.988	0.92
	SVM	1	0.957	1	0.876	1	0.955	1	0.914
	ANN (epoch = 21)	0.932	0.967	0.976	0.938	0.954	0.951	0.965	0.944
Split 5	RF	1	0.995	1	0.987	1	0.993	1	0.99
	KNN	0.971	0.963	0.92	0.908	1	0.949	0.958	0.928
	SVM	1	0.949	1	0.92	1	0.888	1	0.904
	ANN (epoch = 21)	0.924	0.967	1	0.939	0.961	0.95	0.98	0.944

Open in a new tab

ROC curve with AUC value in different splits for adult dataset is shown in Fig. 2f–j. RF has highest AUC value approximately 1.00 in all splits followed by ANN (0.996) and so on.

When we used RF to classify adult dataset $S_{n}$ , $P_{r}$ , and $F 1_{s}$ remains (100, 1) or nearly 0.99 in all splits as shown in Table 7 but when we used KNN, SVM and ANN the values of $S_{n}$ , $P_{r}$ and $F 1_{s}$ increased and decreased as splits change.

Fig. 3d describes the log-loss values at each split derived using RF, KNN, and SVM. SVM and KNN has minimum log-loss value at split 2, split 5 respectively and maximum log-loss value at split 5 and split 3. Log-loss value for RF is approximately.009 for all splits. Log-loss for ANN is shown in Fig. 4p–t at each split per epoch belongs to range of (0.08, 0.13). Average training and testing accuracy for adult dataset using different classifiers are as follow: RF (100%), RF (99.90%) >SVM (99.84%), KNN (97.76%) >SVM (95.94%), KNN (95.36%) >ANN (91.86%), ANN (95.18%). RF gives 100% accuracy with minimum log-loss for adult dataset. So RF is best classifier for adult dataset for all splits with no overfitting issue.

Performance analysis of adolescent dataset

Classification of adolescent dataset using RF, KNN, and ANN is performed. The $C_{m}$ of adolescent dataset on various splits using different classifiers is shown in Fig. 8a–o. $C_{m}$ describe 4 cases of prediction for example Fig. 8r shows, 22 TD cases predicted as TD, 6 TD cases predicted as ASD, 1 ASD case predicted as TD and 44 ASD cases predicted as ASD.

Fig. 8 — a–t Confusion matrix of different classifiers on adolescent dataset in different splits

RF and SVM gives 100% training accuracy but 94 and 87% testing accuracy. KNN and ANN has 91%, 87% training accuracy and 83%, 84% testing accuracy as shown in Table 8.

Table 8.

Adolescent dataset performance parameters analysis

Splits	Classifiers	Train Acc	Test Acc	Train Sn	Test Sn	Train Pr	Test Pr	Train F1s	Test F1s
Split 1	RF	1	0.98	1	1	1	0.968	1	0.984
	KNN	0.942	0.846	0.968	0.87	0.939	0.87	0.953	0.87
	SVM	1	0.934	1	0.854	0.976	0.915	1	0.895
	ANN (epoch = 3000)	0.683	0.665	1	1	0.914	0.794	0.955	0.885
Split 2	RF	1	0.936	1	0.975	1	0.928	1	0.951
	KNN	0.951	0.857	1	0.9	0.92	0.878	0.958	0.888
	SVM	1	0.825	1	0.825	1	0.891	1	0.857
	ANN (epoch = 41)	0.903	0.863	1	1	0.92	0.816	0.958	0.898
Split 3	RF	1	0.972	1	0.977	1	0.977	1	0.977
	KNN	0.935	0.849	0.944	0.888	0.944	0.869	0.944	0.879
	SVM	1	0.876	1	0.955	1	0.86	1	0.905
	ANN (epoch = 1323)	0.913	0.891	1	0.977	0.947	0.88	0.972	0.926
Split 4	RF	1	0.94	1	0.943	1	0.961	1	0.952
	KNN	0.85	0.785	0.8	0.754	0.888	0.888	0.842	0.816
	SVM	1	0.869	1	0.867	1	0.92	1	0.893
	ANN (epoch = 21)	0.902	0.9	1	0.981	0.909	0.896	0.952	0.936
Split 5	RF	1	0.893	1	0.913	1	0.913	1	0.913
	KNN	0.9	0.84	1	0.896	0.833	0.852	0.909	0.873
	SVM	1	0.872	1	0.913	1	0.883	1	0.898
	ANN (epoch = 314)	0.963	0.903	1	0.965	1	0.875	1	0.918

Open in a new tab

Fig. 2a–e represents the ROC curve with AUC value for adolescent dataset. RF has highest AUC value in split 1 and split 3. KNN, SVM, and ANN AUC values are varying according to splits.

$S_{n}$ , $P_{r}$ , and $F 1_{s}$ values for adolescent dataset derived from various classifiers are listed out in Table 8. $S_{n}$ values for all 4 classifiers in all 5 split are remains same but $P_{r}$ and $F 1_{s}$ values of KNN, SVM, and ANN fluctuated more as compared to RF.

Log-loss of RF, KNN, and SVM over adolescent dataset is shown in Fig. 3c and RF has minimum log-loss in all splits. Log-loss value of ANN reduces as split size increases as shown in Fig. 4k–o. ANN has minimum difference between average training and testing accuracy of 0.02 followed by 0.05 (RF), 0.08 (KNN), and 0.12 (SVM) but RF has maximum training and testing accuracy of 100 and 94% with minimum log-loss. Average training and testing accuracy for adolescent dataset using different classifiers are as follow: RF (100%), RF (94.42%)<SVM (100%), SVM (87.52%)<KNN (91.56%), ANN (84.44%)<ANN (87.82%), KNN (83.54%). RF gives 100% accuracy with minimum log-loss for adult dataset. So RF is best classifier for adult dataset for all splits with no overfitting issue.

Discussion

There are other researchers [24–26] who worked on the subset of these four datasets for classification problems but the overfitting problem is still required to be addressed. Earlier the researchers worked on several classifiers with feature transformation and selection techniques in all four datasets [27]. According to the authors, child and toddler datasets had the best accuracy, at 98.77 and 97.10%, respectively, with log-losses of 3.01 and 9.62. The adolescent and adult datasets had the highest accuracy, at 93.89 and 98.36%, with log-losses of 15.81 and 5.64, respectively. They obtained decent accuracy while having a large log-loss, proving that model performance could result in a high error rate with overfitting. [28] worked on all four datasets using DNN with PCA and achieved an accuracy of 84–89%. This indicates that after reducing the dimension and using a complex model like DNN prediction rate is decreased. As per comparison results mentioned in Tables 9, 10, 11, and 12, it is clear that higher complex classifiers such as DNN don’t give better performances.

Table 9.

Comparative study for the toddler dataset

Methods	Accuracy %	AUC-ROC %	Sensitivity %	Precision %	F1 Score	Log-loss
SVM [27]	98.77	99.98	99.39	–	–	3.01
Logistic regression [24]	97	–	–	100	0.98	–
DNN [28]	85.24	–	70.48	–	0.82	–
Proposed method (RF)	100	100	100	100	1	0

Open in a new tab

Table 10.

Comparative study for the child dataset

Methods	Accuracy %	AUC-ROC %	Sensitivity %	Precision %	F1 Score	Log-loss
Random ForestCART + Random Forest-ID3 [25]	92.26	–	96.52	88.09	–	–
Adaboost [27]	97	99.89	98.4	–	–	9.62
C4.5 [26]	97.8	–	98	–	–	–
DNN [28]	85.71	–	100	–	0.88	–
Proposed method (RF)	100	100	100	100	1	0

Open in a new tab

Table 11.

Comparative study for the adolescent dataset

Methods	Accuracy %	AUC-ROC %	Sensitivity %	Precision %	F1 Score	Log-loss
Random ForestCART + Random Forest-ID3 [25]	93.78	–	98.6	90.82	–	–
Glmboost [27]	93.89	98.61	97.5	–	–	15.81
Rules-machine learning [26]	94.23	–	92.2	–	–	–
DNN [28]	84.21	–	100	–	0.84	–
Proposed method (RF)	100	100	100	100	1	0

Open in a new tab

Table 12.

Comparative study for the adult dataset

Methods	Accuracy %	AUC-ROC %	Sensitivity %	Precision %	F1 Score	Log-loss
Random ForestCART + Random Forest-ID3 [25]	97.1	–	97.07	90.54	–	–
Adaboost [27]	98.36	99.95	99.3	–	–	5.64
Rules-machine learning [26]	99.85	–	99.9	–	–	–
DNN [28]	89.26	–	100	–	0.85	–
Proposed method (RF)	100	100	100	100	1	0

Open in a new tab

In the proposed work, consider all the possible age group datasets (toddler(12–36 months), child (4–11 years), adolescent (12–17 years) and adult (18 years and above)) and apply various preprocessing steps followed by RF, SVM, KNN, and ANN classifiers. However, we focused on preprocessing data so that training data would be more accurate to give stable and minimal overfitting issues. Various performance matrices (confusion matrix, accuracy, AUC-ROC, sensitivity, precision, F1 score, and log-loss) and overfitting analysis were used to analyze the performance of each classifier. The classifiers ASD and TD class in each split with 100% (RF), 100% (RF), 100% (RF, SVM), 100% (RF) accuracy; 100% (RF, ANN), 98% (ANN), 94% (ANN), 99% (RF,ANN) AUC; 100% (RF), 100% (RF, ANN), 100% (RF, SVM, ANN), 100% (RF) sensitivity; 100% (RF), 100% (RF), 100% (RF), 100% (RF) precision; 1 (RF), 1 (RF), 1 (RF, SVM), 1 (RF) F1 score; RF gives approximately 0 log-loss for toddler, child, adolescent, and adult dataset, respectively. According to the comparison Tables 9, 10, 11, and 12, RF is found to be the best classifier with different training, and testing data splits for all four datasets and has negligible chances of overfitting. With the help of the K-NN imputer at the preprocessing stage, there is no need to apply neural network models, selection of features, or dimensionality reduction. Combining k-NN imputer with simple classifiers such as RF, SVM, and KNN gives more precise and accurate results in minimum resources.

Conclusion and future work

In this work, we have trained four ML classifiers (RF, SVM, KNN, and ANN) to classify ASD from TD using four diverse datasets (Adult, adolescent, Child, and Toddler) from the UCI and Kaggle databases. Our work is different in that it focuses on addressing prevalent challenges with current ASD prediction techniques. In real-time scenario questionnaires are often used to predict ASD, however they might have issues with overfitting, limited datasets, and inadequate coverage of different age groups.

We have looked at the causes of overfitting from preprocessing to training, which is a unique approach to model construction. At every level, we’ve put strategic measures in place to address these problems. For instance, we have used KNN Imputer to handle missing data while addressing sparsity, particularly when it comes to missing categorical attribute values. Furthermore, one-hot encoding was found useful in converting category characteristics to numerical ones and guaranteeing the inclusion of unique attributes for the purpose of filling in missing values. To address an imbalanced distribution of data during model training, we have used five different splits. To provide a strong check on the lack of overfitting, we carefully evaluated several performance metrics after training, including accuracy, log-loss, F1 score, precision, and sensitivity, for both training and test datasets. We observed the accuracy on the training and testing datasets and found that the difference between training and testing accuracy is minimal, which ensures that there is no overfitting issue. Our research demonstrates that early ASD diagnosis is feasible, even with less datasets, opening up a viable path for precise predictions made in real-time.

Future studies can focus on hybrid models that combine several dataset types to create an ASD prediction tool. Create new questionnaires based on the Indian Scale for Assessment of Autism (ISAA) and the INCLEN Diagnostic Tool for Autism Spectrum Disorder to gather more information (INDT-ASD). We were motivated by the need for a machine learning-based, dependable, real-time ASD prediction model that addressed the issues of overfitting, dealt with the constraints of limited datasets, and covered a broad age range, from 18 months to 16 years and older.

Acknowledgements

The authors are grateful to the Ministry of Education and Indian Institute of Information Technology, Allahabad for supplying the necessary materials required for completing this work.

Appendix A

See Tables 13, 14, 15, 16.

Data availability

The datasets used in this paper are available in references [15–17], and [14] and Section Questionnaire Datasets.

Declarations

Conflict of interest

The authors have no competing interests relevant to the content of this work.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vrijendra Singh and Anupam Agrawal have contributed equally to this work.

References

1.Autism: World Health Organization, Autism spectrum disorders. World Health Organization. Last checked on 2022; 26, 07, 2022
2.Heinsfeld AS, Franco AR, Craddock RC, Buchweitz A, Meneguzzi F. Identification of autism spectrum disorder using deep learning and the abide dataset. NeuroImage Clin. 2018;17:16–23. 10.1016/j.nicl.2017.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Azer SA, Bokhari RA, AlSaleh GS, Alabdulaaly MM, Ateeq KI, Guerrero APS, Azer S. Experience of parents of children with autism on youtube: are there educationally useful videos? Inform Health Soc Care. 2018;43(3):219–33. 10.1080/17538157.2018.1431238. [DOI] [PubMed] [Google Scholar]
4.Franz L, Adewumi K, Chambers N, Viljoen M, Baumgartner JN, De Vries PJ. Providing early detection and early intervention for autism spectrum disorder in south Africa: stakeholder perspectives from the western cape province. J Child Adolesc Mental Health. 2018;30(3):149–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Pagnozzi AM, Conti E, Calderoni S, Fripp J, Rose SE. A systematic review of structural mri biomarkers in autism spectrum disorder: a machine learning perspective. Int J Dev Neurosci. 2018;71:68–82. 10.1016/j.ijdevneu.2018.08.010. [DOI] [PubMed] [Google Scholar]
6.Kosmicki J, Sochat V, Duda M, Wall D. Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning. Transl Psychiatry. 2015;5(2):514–514. 10.1038/tp.2015.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bone D, Bishop SL, Black MP, Goodwin MS, Lord C, Narayanan SS. Use of machine learning to improve autism screening and diagnostic instruments: effectiveness, efficiency, and multi-instrument fusion. J Child Psychol Psychiatry. 2016;57(8):927–37. 10.1111/jcpp.12559. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Satu MS, Sathi FF, Arifen MS, Ali MH, Moni MA. Early detection of autism by extracting features: a case study in bangladesh. In: 2019 International conference on robotics, electrical and signal processing techniques (ICREST), pp. 400–405 (2019). IEEE. 10.1109/ICREST.2019.8644357
9.Jumaa N, Salman A, Al-Hamdani D. The autism spectrum disorder diagnosis based on machine learning techniques. J Xian Univ Architect Technol. 2020;12:575–83. [Google Scholar]
10.Mujeeb Rahman K, Monica Subashini M. A deep neural network-based model for screening autism spectrum disorder using the quantitative checklist for autism in toddlers (qchat). J Autism Dev Disord. 2022;52(6):2732–46. 10.1007/s10803-021-05141-2. [DOI] [PubMed] [Google Scholar]
11.Thabtah F, Spencer R, Abdelhamid N, Kamalov F, Wentzel C, Ye Y, Dayara T. Autism screening: an unsupervised machine learning approach. Health Inform Sci Syst. 2022;10(1):26. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Allison C, Auyeung B, Baron-Cohen S. Toward brief red flags for autism screening: the short autism spectrum quotient and the short quantitative checklist in 1,000 cases and 3,000 controls. J Am Acad Child Adolesc Psychiatry. 2012;51(2):202–12. 10.1016/j.jaac.2011.11.003. [DOI] [PubMed] [Google Scholar]
13.Thabtah F, Kamalov F, Rajab K. A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inform. 2018;117:112–24. 10.1016/j.ijmedinf.2018.06.009. [DOI] [PubMed] [Google Scholar]
14.Thabtah F. Autism screening data for Toddlers. Kaggle. Last checked on 2018; 26, 07, 2022
15.Thabtah F. Autistic spectrum disorder screening data for children data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022
16.Thabtah FF. Autistic spectrum disorder screening data for adolescent data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022
17.Thabtah FF. Autism screening adult data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022
18.Kumar CJ, Das PR. The diagnosis of asd using multiple machine learning techniques. Int J Dev Disabil. 2021. 10.1080/20473869.2021.1933730. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Musa RA, Manaa ME, Abdul-Majeed G. Predicting autism spectrum disorder (asd) for toddlers and children using data mining techniques. J Phys: Conf Ser. 2021;1804: 012089. 10.1088/1742-6596/1804/1/012089. [Google Scholar]
20.Erkan U, Thanh DN. Autism spectrum disorder detection with machine learning methods. Curr Psychiatry Res Rev Form: Curr Psychiatry Rev. 2019;15(4):297–308. 10.2174/2666082215666191111121115. [Google Scholar]
21.Vaishali R, Sasikala R. A machine learning based approach to classify autism with optimum behaviour sets. Int J Eng Technol. 2018;7(4):18. [Google Scholar]
22.Raj S, Masood S. Analysis and detection of autism spectrum disorder using machine learning techniques. Procedia Comput Sci. 2020;167:994–1004. 10.1016/j.procs.2020.03.399. [Google Scholar]
23.Mohan P, Paramasivam I. Feature reduction using svm-rfe technique to detect autism spectrum disorder. Evol Intel. 2021;14(2):989–97. 10.1007/s12065-020-00498-2. [Google Scholar]
24.Vakadkar K, Purkayastha D, Krishnan D. Detection of autism spectrum disorder in children using machine learning techniques. SN Comput Sci. 2021;2(5):1–9. 10.1007/s42979-021-00776-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Omar KS, Mondal P, Khan NS, Rizvi MRK, Islam MN. A machine learning approach to predict autism spectrum disorder. In: 2019 International conference on electrical, computer and communication engineering (ECCE), pp. 1–6 (2019). IEEE. 10.1109/ECACE.2019.8679454
26.Thabtah F, Peebles D. A new machine learning model based on induction of rules for autism detection. Health Inform J. 2020;26(1):264–86. 10.1177/1460458218824711. [DOI] [PubMed] [Google Scholar]
27.Akter T, Satu MS, Khan MI, Ali MH, Uddin S, Lio P, Quinn JM, Moni MA. Machine learning-based models for early stage detection of autism spectrum disorders. IEEE Access. 2019;7:166509–27. 10.1109/ACCESS.2019.2952609. [Google Scholar]
28.Mohanty AS, Parida P, Patra K. Identification of autism spectrum disorder using deep neural network. J Phys: Conf Ser. 2021;1921: 012006. 10.1088/1742-6596/1921/1/012006. [Google Scholar]
29.Biessmann F, Rukat T, Schmidt P, Naidu P, Schelter S, Taptunov A, Lange D, Salinas D. Datawig: missing value imputation for tables. J Mach Learn Res. 2019;20(175):1–6. [Google Scholar]
30.Zhang S. Nearest neighbor selection for iteratively knn imputation. J Syst Softw. 2012;85(11):2541–52. 10.1016/j.jss.2012.05.073. [Google Scholar]
31.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40. 10.1007/BF00058655. [Google Scholar]
32.Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85. 10.1080/00031305.1992.10475879. [Google Scholar]
33.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. 10.1007/BF00994018. [Google Scholar]
34.Anggoro DA, Novitaningrum D. Comparison of accuracy level of support vector machine (svm) and artificial neural network (ann) algorithms in predicting diabetes mellitus disease. ICIC Express Lett. 2021;15(1):9–18. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this paper are available in references [15–17], and [14] and Section Questionnaire Datasets.

[CR1] 1.Autism: World Health Organization, Autism spectrum disorders. World Health Organization. Last checked on 2022; 26, 07, 2022

[CR2] 2.Heinsfeld AS, Franco AR, Craddock RC, Buchweitz A, Meneguzzi F. Identification of autism spectrum disorder using deep learning and the abide dataset. NeuroImage Clin. 2018;17:16–23. 10.1016/j.nicl.2017.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Azer SA, Bokhari RA, AlSaleh GS, Alabdulaaly MM, Ateeq KI, Guerrero APS, Azer S. Experience of parents of children with autism on youtube: are there educationally useful videos? Inform Health Soc Care. 2018;43(3):219–33. 10.1080/17538157.2018.1431238. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Franz L, Adewumi K, Chambers N, Viljoen M, Baumgartner JN, De Vries PJ. Providing early detection and early intervention for autism spectrum disorder in south Africa: stakeholder perspectives from the western cape province. J Child Adolesc Mental Health. 2018;30(3):149–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Pagnozzi AM, Conti E, Calderoni S, Fripp J, Rose SE. A systematic review of structural mri biomarkers in autism spectrum disorder: a machine learning perspective. Int J Dev Neurosci. 2018;71:68–82. 10.1016/j.ijdevneu.2018.08.010. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Kosmicki J, Sochat V, Duda M, Wall D. Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning. Transl Psychiatry. 2015;5(2):514–514. 10.1038/tp.2015.7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Bone D, Bishop SL, Black MP, Goodwin MS, Lord C, Narayanan SS. Use of machine learning to improve autism screening and diagnostic instruments: effectiveness, efficiency, and multi-instrument fusion. J Child Psychol Psychiatry. 2016;57(8):927–37. 10.1111/jcpp.12559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Satu MS, Sathi FF, Arifen MS, Ali MH, Moni MA. Early detection of autism by extracting features: a case study in bangladesh. In: 2019 International conference on robotics, electrical and signal processing techniques (ICREST), pp. 400–405 (2019). IEEE. 10.1109/ICREST.2019.8644357

[CR9] 9.Jumaa N, Salman A, Al-Hamdani D. The autism spectrum disorder diagnosis based on machine learning techniques. J Xian Univ Architect Technol. 2020;12:575–83. [Google Scholar]

[CR10] 10.Mujeeb Rahman K, Monica Subashini M. A deep neural network-based model for screening autism spectrum disorder using the quantitative checklist for autism in toddlers (qchat). J Autism Dev Disord. 2022;52(6):2732–46. 10.1007/s10803-021-05141-2. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Thabtah F, Spencer R, Abdelhamid N, Kamalov F, Wentzel C, Ye Y, Dayara T. Autism screening: an unsupervised machine learning approach. Health Inform Sci Syst. 2022;10(1):26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Allison C, Auyeung B, Baron-Cohen S. Toward brief red flags for autism screening: the short autism spectrum quotient and the short quantitative checklist in 1,000 cases and 3,000 controls. J Am Acad Child Adolesc Psychiatry. 2012;51(2):202–12. 10.1016/j.jaac.2011.11.003. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Thabtah F, Kamalov F, Rajab K. A new computational intelligence approach to detect autistic features for autism screening. Int J Med Inform. 2018;117:112–24. 10.1016/j.ijmedinf.2018.06.009. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Thabtah F. Autism screening data for Toddlers. Kaggle. Last checked on 2018; 26, 07, 2022

[CR15] 15.Thabtah F. Autistic spectrum disorder screening data for children data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022

[CR16] 16.Thabtah FF. Autistic spectrum disorder screening data for adolescent data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022

[CR17] 17.Thabtah FF. Autism screening adult data set. University of California, Irvine, School of Information and Computer Sciences. Last checked on 2017; 26, 07, 2022

[CR18] 18.Kumar CJ, Das PR. The diagnosis of asd using multiple machine learning techniques. Int J Dev Disabil. 2021. 10.1080/20473869.2021.1933730. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Musa RA, Manaa ME, Abdul-Majeed G. Predicting autism spectrum disorder (asd) for toddlers and children using data mining techniques. J Phys: Conf Ser. 2021;1804: 012089. 10.1088/1742-6596/1804/1/012089. [Google Scholar]

[CR20] 20.Erkan U, Thanh DN. Autism spectrum disorder detection with machine learning methods. Curr Psychiatry Res Rev Form: Curr Psychiatry Rev. 2019;15(4):297–308. 10.2174/2666082215666191111121115. [Google Scholar]

[CR21] 21.Vaishali R, Sasikala R. A machine learning based approach to classify autism with optimum behaviour sets. Int J Eng Technol. 2018;7(4):18. [Google Scholar]

[CR22] 22.Raj S, Masood S. Analysis and detection of autism spectrum disorder using machine learning techniques. Procedia Comput Sci. 2020;167:994–1004. 10.1016/j.procs.2020.03.399. [Google Scholar]

[CR23] 23.Mohan P, Paramasivam I. Feature reduction using svm-rfe technique to detect autism spectrum disorder. Evol Intel. 2021;14(2):989–97. 10.1007/s12065-020-00498-2. [Google Scholar]

[CR24] 24.Vakadkar K, Purkayastha D, Krishnan D. Detection of autism spectrum disorder in children using machine learning techniques. SN Comput Sci. 2021;2(5):1–9. 10.1007/s42979-021-00776-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Omar KS, Mondal P, Khan NS, Rizvi MRK, Islam MN. A machine learning approach to predict autism spectrum disorder. In: 2019 International conference on electrical, computer and communication engineering (ECCE), pp. 1–6 (2019). IEEE. 10.1109/ECACE.2019.8679454

[CR26] 26.Thabtah F, Peebles D. A new machine learning model based on induction of rules for autism detection. Health Inform J. 2020;26(1):264–86. 10.1177/1460458218824711. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Akter T, Satu MS, Khan MI, Ali MH, Uddin S, Lio P, Quinn JM, Moni MA. Machine learning-based models for early stage detection of autism spectrum disorders. IEEE Access. 2019;7:166509–27. 10.1109/ACCESS.2019.2952609. [Google Scholar]

[CR28] 28.Mohanty AS, Parida P, Patra K. Identification of autism spectrum disorder using deep neural network. J Phys: Conf Ser. 2021;1921: 012006. 10.1088/1742-6596/1921/1/012006. [Google Scholar]

[CR29] 29.Biessmann F, Rukat T, Schmidt P, Naidu P, Schelter S, Taptunov A, Lange D, Salinas D. Datawig: missing value imputation for tables. J Mach Learn Res. 2019;20(175):1–6. [Google Scholar]

[CR30] 30.Zhang S. Nearest neighbor selection for iteratively knn imputation. J Syst Softw. 2012;85(11):2541–52. 10.1016/j.jss.2012.05.073. [Google Scholar]

[CR31] 31.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40. 10.1007/BF00058655. [Google Scholar]

[CR32] 32.Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85. 10.1080/00031305.1992.10475879. [Google Scholar]

[CR33] 33.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. 10.1007/BF00994018. [Google Scholar]

[CR34] 34.Anggoro DA, Novitaningrum D. Comparison of accuracy level of support vector machine (svm) and artificial neural network (ann) algorithms in predicting diabetes mellitus disease. ICIC Express Lett. 2021;15(1):9–18. [Google Scholar]

PERMALINK

Autism spectrum disorder detection with kNN imputer and machine learning classifiers via questionnaire mode of screening

Trapti Shrivastava

Vrijendra Singh

Anupam Agrawal

Abstract

Introduction

Materials and methodology

Questionnaire datasets

Table 13.

Table 14.

Table 16.

Methodology

Data pre-processing

Table 1.

Table 2.

Table 3.

Table 15.

Model training

Fig. 1.

Random forests (RF)

K-nearest neighbour (K-NN)

Support vector machine (SVM)

Artificial neural network (ANN)

Table 4.

Results

Performance analysis of toddler dataset

Fig. 5.

Table 5.

Fig. 2.

Fig. 3.

Fig. 4.

Performance analysis of child dataset

Fig. 6.

Table 6.

Performance analysis of adult dataset

Fig. 7.

Table 7.

Performance analysis of adolescent dataset

Fig. 8.

Table 8.

Discussion

Table 9.

Table 10.

Table 11.

Table 12.

Conclusion and future work

Acknowledgements

Appendix A

Data availability

Declarations

Conflict of interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases