ABSTRACT
Autism is a neurodevelopmental condition affecting ~1% of the population. Recently, machine learning models have been trained to classify participants with autism using their neuroimaging features, though the performance of these models varies in the literature. Differences in experimental setup hamper the direct comparison of different machine‐learning approaches. In this paper, five of the most widely used and best‐performing machine learning models in the field were trained to classify participants with autism and typically developing (TD) participants, using functional connectivity matrices, structural volumetric measures, and phenotypic information from the Autism Brain Imaging Data Exchange (ABIDE) dataset. Their performance was compared under the same evaluation standard. The models implemented included: graph convolutional networks (GCN), edge‐variational graph convolutional networks (EV‐GCN), fully connected networks (FCN), autoencoder followed by a fully connected network (AE‐FCN) and support vector machine (SVM). Our results show that all models performed similarly, achieving a classification accuracy around 70%. Our results suggest that different inclusion criteria, data modalities, and evaluation pipelines rather than different machine learning models may explain variations in accuracy in the published literature. The highest accuracy in our framework was obtained when using ensemble models (p < 0.001), leading to an accuracy of 72.2% and AUC = 0.77 using GCN classifiers. However, an SVM classifier performed with an accuracy of 70.1% and AUC = 0.77, just marginally below GCN, and significant differences were not found when comparing different algorithms under the same testing conditions (p > 0.05). Furthermore, we also investigated the stability of features identified by the different machine learning models using the SmoothGrad interpretation method. The FCN model demonstrated the highest stability in selecting relevant features contributing to model decision making. The code is available at https://github.com/YilanDong19/Machine‐learning‐with‐ABIDE.
Keywords: ABIDE, ensemble methods, functional MRI, interpretation, machine learning, stability, structural MRI
Five machine‐learning models were trained to classify autism using ABIDE. Similar performance suggested accuracy variations in the literature may stem from inclusion criteria, modalities, and evaluation pipelines. SmoothGrad investigated the stability of features identified by models and showed that structural and functional features from the ventricles and temporal cortex contribute to autism identification.

1. Introduction
Autism is a developmental condition characterized by deficits in social communication, social reciprocity, repetitive and stereotyped behaviors and interests (del Barrio 2004), and atypical responses to sensory stimuli (Green et al. 2013). Around 1% of the general population is diagnosed with autism (World Health Organization 2023), and symptoms such as early language delay usually appear when children are 2–3 years old (Landa 2008; Stefanatos 2008). Autism's core symptoms and co‐occurring conditions such as depression and learning disabilities (Stewart et al. 2006) often lead to serious challenges in the physical and mental well‐being of the individuals with autism and their families. At present, the medical diagnosis of autism is largely based on behavioral observations and clinical interviews, and the underlying neural mechanisms of autism are still unclear (Yahata et al. 2016). The insights provided by human neuroimaging studies may contribute to the development of autism biomarkers (Ecker et al. 2015).
In recent years, with the rapid development of magnetic resonance imaging (MRI) technology, we have gradually gained a deeper understanding of atypical brain function and structure linked to neurodevelopmental conditions such as autism. For instance, several structural MRI studies have reported differences in total brain volume (Courchesne 2002) and brain asymmetry (Postema et al. 2019) in participants with autism. Recent studies have also shown alterations in cortical thickness (CT) (Moradi et al. 2017) and subcortical volume (SV) (Katuwal et al. 2015).
Resting‐state functional MRI (rs‐fMRI) is widely used to describe long‐range functional relationships between different brain regions in brain disorders (Du et al. 2018) and has been used to characterize atypical functional connectivity in autism. Widespread reductions in connectivity among children with autism have been reported, spanning unimodal, heteromodal, primary somatosensory, and limbic and paralimbic cortices, whereas those with autism have had increased connectivity between a small set of nodes primarily in subcortical regions (Di Martino et al. 2014). More recently, a pattern of hyperconnectivity in prefrontal and parietal cortices and hypoconnectivity in sensory‐motor regions has been suggested to be consistent and reproducible across cohorts (Holiga et al. 2019).
However, neuroimaging studies investigating autism often report heterogeneous or even contradictory findings. For example, the volumetric differences between autism and typically developing participants (TD) reported by Aylward et al. (2002) were not replicated by Courchesne (2002). Differing results can be partially explained by biological heterogeneity within the population, variations in the imaging protocols, scanning parameters, and the scanner manufacturers, as well as variations in preprocessing and model evaluation methods. To date, no effective biomarker has yet been developed to enable a more objective diagnosis and stratification of autism, including the identification of subgroups that may benefit from therapeutic interventions.
In addition to traditional statistical approaches, modern artificial intelligence techniques have recently been employed with the hope that they may help to uncover these differences. In the past 10 years, classic machine learning techniques have been used to mine a wealth of information in structural MRI (sMRI) and rs‐fMRI to conduct autism research. Table 1 lists several experiments that were conducted using statistical and classical machine learning approaches. For instance, Anderson et al. (2011) used Pearson's correlation coefficients to construct functional connectivity matrices from the time series of each pair of Regions of Interest (ROIs), then applied a two‐tailed t‐test to identify a subset of connections that were significantly different between the autistic and TD participants. They subsequently employed a linear classifier to classify individuals with autism from TD participants based on the selected connections, resulting in an accuracy 79%. Ecker et al. (2010) performed one of the first studies in which autism was predicted based on structural brain measures. Segmented gray matter (GM) and white matter (WM) images comprised an input into a support vector machine (SVM) classifier, resulting in an accuracy of 86%. Sabuncu and Konukoglu (2015) utilized the sMRI volumetric features from FreeSurfer software (gray matter volume, average thickness, etc.) to classify autistic participants, achieving a classification accuracy of 60% Abraham et al. (2017) proposed a pipeline based on multi‐subject dictionary learning (MSDL) atlas, tangent embedding, and support vector classifier (SVC) with L2 regularization, achieving a prediction accuracy of 67% prediction accuracy.
TABLE 1.
Summary of statistical and classical machine learning approaches predicting autism from MRI features.
| Authors | Approaches | Datasets | Evaluation methods | Best reported performance |
|---|---|---|---|---|
| Anderson et al. (2011) | Linear classifier | 40 autism, 40 TD (fMRI) | Leave‐one‐out | 79% |
| Ecker et al. (2010) | SVM | 22 autism, 22 TD (sMRI) | Leave‐one‐out | 86% |
| Sabuncu and Konukoglu (2015) | SVM | 325 autism, 325 TD (sMRI) | 5‐fold cross‐validation | 60% |
| Abraham et al. (2017) | SVM | 403 autism, 468 TD (ABIDE, fMRI) | 10‐fold cross‐validation | 67% |
| Nielsen et al. (2013) | Linear classifier | 477 autism, 517 TD (ABIDE, fMRI) | Leave‐one‐out | 60% |
However, other authors have found that the heterogeneity of data from multiple collection sites has a significant impact on the experimental results. For example, Nielsen et al. (2013) performed a similar experiment as Anderson et al. (2011) with a bigger sample size but obtained very different results. They utilized fMRI data from 964 participants (16 sites from ABIDE dataset) rather than 80 participants, which resulted in the classification accuracy decrease from 79% (Anderson et al. 2011) to 60% (Nielsen et al. 2013). Wolfers et al. (2019) summarized the results of 57 studies and observed a trend toward decreasing accuracy with increasing sample size in autism classification, suggesting that increasingly large sample sizes may result in low accuracies because of the intrinsic heterogeneity of autism, but may allow for the identification of more robust decision functions.
Deep learning models have achieved results comparable to human expert performance in many fields, such as speech, natural language processing, and computer vision (Alasasfeh et al. 2021; Ciregan et al. 2012). They are also widely applied in the medical imaging field, attempting to obtain more accurate experimental results than with classic machine learning approaches. Table 2 lists several experiments conducted to predict autism using deep learning approaches. Heinsfeld et al. (2018) used deep neural networks with two autoencoders, reaching 70% classification accuracy in distinguishing autistic and TD participants. With the supplement of structural MRI data, Rakić et al. (2020) achieved 85% accuracy by using an autoencoder followed by a fully connected network classifier (AE‐FCN) and Ensembles of Multiple Models and Architectures (EMMA), combining information from functional MRI and structural MRI. Parisot et al. (2018) developed a graph convolutional neural network (GCN) to incorporate phenotypic information. The best accuracy obtained by this model was 70.4%. Based on the original GCN model, Huang and Chung (2020) proposed an edge‐variational graph convolutional neural network (EV‐GCN) with a reported classification accuracy of 81%.
TABLE 2.
Summary of deep learning approaches predicting autism from MRI features.
| Authors | Approaches | Datasets | Evaluation methods | Best reported performance |
|---|---|---|---|---|
| Heinsfeld et al. (2018) | Stacked denoising autoencoders + FCN | 505 autism, 530 TD (ABIDE, fMRI) | 10‐fold cross‐validation | 70% |
| Rakić et al. (2020) | Stacked autoencoders + FCN | 368 autism, 449 TD (ABIDE, fMRI and sMRI) | 10‐fold cross‐validation | 85% |
| Parisot et al. (2018) | GCN | 403 autism, 468 TD (ABIDE, fMRI) | 10‐fold cross‐validation | 70.4% |
| Huang and Chung (2020) | EV‐GCN | 403 autism, 468 TD (ABIDE, fMRI) | 10‐fold cross‐validation | 81% |
The complexity of deep learning models renders machine decision‐making opaque and significantly reduces the trust of doctors and patients in artificial intelligence (Singh et al. 2020). To overcome this weakness, intensive research on improving the interpretability of machine learning models has emerged, and a plethora of interpretation methods have been proposed to help researchers understand their inner workings mechanism (Salahuddin et al. 2022). For example, Garg and Garg (2020) utilized the Gradient‐weighted Class Activation Mapping (Grad‐CAM) (Selvaraju et al. 2017) and SmoothGrad (Smilkov et al. 2017) to picture the saliency images of the CNN model when classifying cancer histopathological images into malignant and benign categories.
In this study, we aimed to evaluate the performance of five previously proposed machine and deep learning models for the classification of autism under a standardized setting. We compared the performance of such models using a consistent, repeatable framework, addressing inconsistencies in training datasets and evaluation frameworks reported in the literature. We proposed:
A standardized and comprehensive cross‐validated evaluation framework, which fitted the models to the training set, tuned parameters on the validation set, and evaluated the performance on the test set. The whole fitting process was performed multiple times, while rotating the test set so that the performance was evaluated on each sample exactly once. This framework both avoids any overfitting and gives a robust performance independent of the selection of the test set.
A selection of five of the most widely used or best‐performing machine learning models from the existing literature, namely SVM (Bharadwaj et al. 2021), FCN (Rumelhart et al. 2013), AE‐FCN (Rakić et al. 2020), GCN (Parisot et al. 2018) and EV‐GCN (Huang and Chung 2020), was used to classify autistic and TD participants using the ABIDE dataset. All classifiers were trained and evaluated using our standardized cross‐validated evaluation framework to allow fair comparison of their performance.
To provide a comprehensive evaluation, we trained our classifier with six different combinations of feature sets: (a) structural MRI (sMRI) features, (b) sMRI + non‐imaging features, (c) functional MRI (fMRI) features, (d) fMRI + non‐imaging features, (e) sMRI + fMRI features, (f) sMRI + fMRI + non‐imaging features
We compared the performance of the individual models, as well as an ensemble of the models trained on different subsets of the data.
We applied the SmoothGrad interpretation methods to FCN, AE‐FCN, GCN, and EV‐GCN to study model stability and understand what features contributed to model decision‐making.
2. Methodology
2.1. ABIDE Dataset
The Autism Brain Imaging Data Exchange (ABIDE) database has aggregated brain structure and functional imaging data from multiple research institutes around the world to accelerate the research on the neural mechanisms of autism. Up to now, the ABIDE project has formed two large data sets: ABIDE I and ABIDE II (Di Martino et al. 2014). In this work, we utilize extracted structural and functional features available for the ABIDE I dataset (https://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html).
2.1.1. Participants
A total of 870 individuals (403 autistic and 467 TD participants) from 20 different collection sites were used in this study (Table 3). Our dataset was based on that used by Abraham et al. (2017), which comprises all ABIDE I participants, except for the ones with incomplete brain coverage and scanner artefacts. Additionally, we excluded one participant due to a FreeSurfer preprocessing failure.
TABLE 3.
The 20 collection sites of ABIDE.
| Collection sites | N | Average age | Sex (male/female) | Autism/TD |
|---|---|---|---|---|
| PITT | 50 | 18.50 | 43/7 | 24/26 |
| OLIN | 28 | 17.04 | 23/5 | 14/14 |
| OHSU | 25 | 10.81 | 25/0 | 12/13 |
| NYU | 172 | 15.33 | 136/36 | 74/98 |
| SBL | 26 | 33.77 | 26/0 | 12/14 |
| SDSU | 27 | 14.36 | 21/6 | 8/19 |
| STANFORD | 25 | 9.99 | 18/7 | 12/13 |
| TRINITY | 44 | 17.03 | 44/0 | 19/25 |
| UCLA_2 | 21 | 12.47 | 19/2 | 11/10 |
| UM_1 | 86 | 13.77 | 61/25 | 34/52 |
| UM_2 | 34 | 16.01 | 32/2 | 13/21 |
| USM | 67 | 22.59 | 67/0 | 43/24 |
| YALE | 41 | 13.31 | 25/16 | 22/19 |
| CALTECH | 15 | 26.79 | 10/5 | 5/10 |
| CMU | 11 | 26.82 | 7/4 | 6/5 |
| KKI | 33 | 10.31 | 24/9 | 12/21 |
| LEUVEN_1 | 28 | 22.43 | 28/0 | 14/14 |
| LEUVEN_2 | 28 | 14.17 | 21/7 | 12/16 |
| MAX_MUN | 45 | 26.49 | 41/4 | 19/26 |
| UCLA_1 | 64 | 13.35 | 55/9 | 37/27 |
2.1.2. Non‐imaging Features
Each sample used in this paper had several phenotypic and fMRI image quality measures, including age, gender, collection site, full scale IQ (FIQ), the number of timepoints with motion outliers (NUM), percentage of timepoints with motion outliers (PEC), and quality control anatomical rate (RAT). We used the collection site information in the GCN model (Section 2.3.4) to generate the connections between samples in the population graph.
2.1.3. Functional MRI Features
We downloaded the preprocessed fMRI data from the ABIDE website directly (http://preprocessed‐connectomes‐project.org/abide/download.html). Briefly, the preprocessing steps include skull stripping, slice timing correction, motion correction, global mean intensity normalization, nuisance signal regression, and band‐pass filtering (0.01–0.1 Hz). The fMRI BOLD timeseries were averaged for each brain region of Cameron Craddock's 200 ROI (CC200) atlas (Craddock et al. 2012) and Pearson's correlation coefficients were calculated for each pair of ROIs to construct functional connectivity matrices for each participant.
2.1.4. Structural MRI Features
All the raw sMRI data were pre‐processed by FreeSurfer software with recon‐all command, which includes intensity normalization, skull stripping, registration of the volumes to a common space, segmentation, cortical, white matter, and subcortical parcellation.
Desikan–Killiany cortical atlas with 68 regions was selected as the atlas for cortical parcellation (Desikan et al. 2006), each region was assigned nine features: number of vertices, surface area, gray matter volume, average thickness, thickness standard deviation (SD), integrated rectified mean curvature, integrated rectified gaussian curvature, folding index, and intrinsic curvature index. We also used metrics from 115 noncortical regions (white matter, ventricles, and subcortical regions) (Fischl et al. 2002; Salat et al. 2009). Each noncortical region was assigned seven features: number of voxels, volume, normalized intensity mean, normalized intensity SD, normalized intensity minimum, normalized intensity maximum, and normalized intensity range. The selection criteria of structural features are introduced in Section 2.2.2. In Table 4, statistical outputs evaluated are summarized. Among these, we selected statistical outputs from the cortical parcellation, subcortical, and white matter as the structural features.
TABLE 4.
The FreeSurfer statistical outputs evaluated.
| FreeSurfer outputs | Measures |
|---|---|
| Statistical outputs from cortical surface extraction (Desikan et al. 2006) |
Number of vertices Surface area Gray matter volume Average thickness Thickness SD Integrated rectified mean Curvature Integrated rectified Gaussian Curvature Folding index Intrinsic curvature index |
| Statistical output from Freesurfer subcortical parcellation including CSF and Ventricular regions (Fischl et al. 2002) |
Number of voxels Volume Normalized intensity mean Normalized intensity SD Normalized intensity min Normalized intensity max Normalized intensity range |
| Statistical output from Freesurfer white matter parcellation (Salat et al. 2009) |
2.2. Input Features
Due to a large number of features, especially in functional connectivity matrices, feature selection is a necessary step before features are input to the model, ensuring that the model is not overfitted and generalizes to new data well. Inputting all connectivity features into the models would lead to overfitting, decreasing model performance on the test set. We performed feature selection on fMRI and sMRI features separately to create input for the fMRI‐only and sMRI‐only models. For the joint model, we concatenated the selected sMRI and fMRI features.
2.2.1. fMRI Features
Due to the symmetry of the functional connectivity matrices, we first extracted the upper triangle of the functional connectivity matrix for each sample. The selected features were standardized to remove the mean and scale to unit variance, making sure all the features had similar distributions. We then performed recursive feature elimination using a ridge classifier. This approach has been previously shown to yield better results than alternative dimensionality reduction methods like autoencoders and principal component analysis (PCA) (Parisot et al. 2018). This feature selection step was conducted within each fold of both cross‐validation and nested cross‐validation. Specifically, for the 5‐fold cross‐validation, feature selection was performed five times (one per fold). In nested cross‐validation, feature selection was repeated 25 times (once for each of the five inner loops within the five outer loops).
To determine an appropriate number of functional features for training the machine learning models, we evaluated the performance of the baseline kernel SVM classifier (gamma = “scale”) using the grid search of different numbers of selected input features (1000, 2000, …, 19,900) produced by recursive feature elimination. We found that selecting 4000 features with a regularization parameter of C = 1 resulted in optimal cross‐validated accuracy in the prediction of autism and TD classes. Therefore, we select 4000 functional connectivity features in each fold, flattened into a 1‐D array, as the fMRI input for the machine learning models.
2.2.2. sMRI Features
We selected statistical outputs from cortical surface extraction, subcortical, and white matter parcellations as structural features (Table 4). From the three cortical parcellation atlases provided by FreeSurfer, we chose the Desikan–Killiany Atlas, considering it is widely used in the literature (Mizuno et al. 2019; Zabihi et al. 2019).
To create the sMRI input features, we concatenated the cortical surface extraction output features for the Desikan–Killiany atlas and subcortical and white matter parcellation outputs. This resulted in 1417 structural features flattened into a 1‐D array. Due to the variation in feature distribution, standardization is applied to all the features to set the mean to zero and scale to unit variance. We then applied the SVM classifier using the grid search of different numbers of selected input features (100, 200, …, 1400, 1417) produced by RFE. We selected 800 structural features flattened into a 1‐D array as the optimal number of sMRI features to input in further machine learning models.
2.2.3. Non‐imaging Features
Considering each participant has only six non‐imaging measures: age, gender, FIQ, NUM, PEC, RAT, we simply concatenated them together without employing feature selection when preparing the input features for the following three scenarios: (a) sMRI + non‐imaging features, (b) fMRI + non‐imaging features, (c) sMRI + fMRI + non‐imaging features.
2.3. Machine Learning Models
In this paper, we investigated five machine learning models to train classifiers to predict Autism and TD classes: support vector machine (SVM) (Bharadwaj et al. 2021), fully connected network (FCN) (Rumelhart et al. 2013), autoencoder followed by the fully connected network (AE‐FCN) (Rakić et al. 2020), graph convolutional network (GCN) (Parisot et al. 2018), and edge‐variational graph convolutional network (EV‐GCN) (Huang and Chung 2020).
2.3.1. Kernel SVM
We chose a Kernel SVM model with a Gaussian kernel (‘rbf’) to create a benchmark classifier to compare with neural network‐based models, as shown in Figure 1. It is a fast and flexible non‐linear classifier with a tunable kernel size and regularization (Bharadwaj et al. 2021).
FIGURE 1.

A support vector machine (SVM) for two‐group classification problems. The kernel SVM is implemented in scikit‐learn. We select the Gaussian kernel with kernel size set to “scale.” The regularization parameter “C” is selected using the grid search algorithm through cross‐validation and is adapted separately for each fold.
2.3.2. FCN
The first deep neural network applied was a fully connected neural network (FCN) (Rumelhart et al. 2013). Its architecture is shown in Figure 2.
FIGURE 2.

The structure of FCN. Linear layers are fully connected with each other. The white arrow on the right is the softmax layer, its output will be used to calculate cross‐entropy loss with labels. The number of nodes in each layer: N for the input layer of FCN (where N matches the dimensionality of the input feature sets), 500 for hidden layer 1, 30 for hidden layer 2, 2 for the output layer.
The network consisted of three linear (fully connected) layers. To prevent this model from overfitting, the dropout layers (dropout rate = 0.5) were added after the first and second linear layers.
2.3.3. Autoencoder + FCN
Rakić et al. (2020) applied an autoencoder followed by the fully connected network (AE‐FCN) to the autism classification problem. This architecture aims to reduce the dimension of the input features by creating a bottleneck. The performance was further improved by using Ensembles of Multiple Models and Architectures (EMMA) (Kamnitsas et al. 2017).
We have implemented AE‐FCN based on the description provided by Rakić et al. (2020). We optimized the parameters of the network ourselves as they were not detailed in the original study. Rather than relying solely on a cross‐entropy loss function for autism classification, we utilized both the cross‐entropy loss function for classification and the mean squared error loss function to maintain consistency between the autoencoder's input and output. Additionally, Rakić's paper did not specify the number of nodes in each layer and activation functions; therefore, we determined the configuration by comparing the validation accuracy in our autism classification task, which aimed to enhance model performance. The structure of our optimized model is shown in Figure 3.
FIGURE 3.

The structure of AE‐FCN. The number of nodes in hidden layers 1 and 3 is the same, and hidden layers 2 and 5 have the same number of nodes. Mean squared error (MSE) loss is applied to train the autoencoder to produce output similar to input, while cross‐entropy loss is applied to give the correct prediction for each individual. The total loss is equal to MSE loss plus cross‐entropy loss. The number of nodes in each layer: N for the input layer of the autoencoder (where N matches the dimensionality of the input feature sets), 300 for hidden layer 1, 150 for hidden layer 2, 300 for hidden layer 3, N for the output layer of the autoencoder (where N again matches the dimensionality of the input layer of the FCN), 300 for hidden layer 4, 16 for hidden layer 5, and 2 for the output layer.
2.3.4. GCN
GCN was proposed for autism classification by Parisot et al. (2018). This network allows processing the samples based on their similarity, which can be calculated from imaging, phenotypic, and acquisition‐related information, to reduce the dimensionality of the feature space.
The structure of the GCN model used in this paper is shown in Figure 4. First, a population graph that reflects the similarity of individual samples needs to be calculated. The goal is to leverage the complementary non‐imaging information to calculate similarities between participants to create a graph structure and thus exploit the power of graph convolutions (Parisot et al. 2018). In our implementation, the subject‐to‐subject similarity, which is used as the weights of the edges of the graph, was calculated to indicate whether the participants were imaged in the same (weight = 1) or different (weight = 0) collection site. We experimented with other designs of the population graph that included other phenotypic and image quality information; however, we did not find any benefit in performance compared to only including site information.
FIGURE 4.

The structure of three layers GCN model. In the population graph, only the nodes with the same color have a connection between them (black line). Features of each sample are the white squares on the node. Cross‐entropy loss is applied at the end. The feature channels in each layer: N for the input layer of GCN (where N matches the dimensionality of the input feature sets), 128 for hidden layer 1, 128 for hidden layer 2, 2 for the output layer.
2.3.5. EV‐GCN
Edge‐variational graph convolutional network (Huang and Chung 2020) has been shown to significantly outperform GCN (Parisot et al. 2018) see Table 2. Figure 5 presents the structure of EV‐GCN. It consisted of a pairwise association encoder (PAE), an edge dropout layer (ED), four Chebyshev graph convolution layers, and one fusion block followed by two fully connected layers. The pairwise association encoder (PAE) generates an adaptive population graph; therefore, connections change during the training process. The fusion block fuses the hidden features at each depth to alleviate the over‐smoothing problem in deep GCN models.
FIGURE 5.

The structure of EV‐GCN (Huang and Chung 2020). The connections in the Adaptive Population Graph are constructed by the PAE block with only non‐image (phenotypic) information, and they will be changed in the training process by the backpropagation of the cross‐entropy loss function. The parameters in each GC layer are N for the input layer of EV‐GCN (where N matches the dimensionality of the input feature sets), 16 for hidden layer 1, 16 for hidden layer 2, 16 for hidden layer 3, and 16 for the output layer. Fusion block: Concatenate the four outputs from previous GC layers. MLP: multilayer perceptron, with parameters: 64 for the input layer (64 = 16 × 4), 256 for hidden layer 1, and 2 for the output layer. ED: edge dropout layer. GC: graph convolution layer.
We have adapted publicly available EV‐GCN code (https://github.com/SamitHuang/EV_GCN) and applied it to the ABIDE dataset in our evaluation framework (see Section 2.5). EV‐GCN constructs the adaptive population graph by inputting “gender” and “site” phenotypic information to PAE. In contrast to Huang and Chung's approach of taking only the upper triangular portion of the population graph as the graph input, we take a different approach in EV‐GCN by inputting the entire population graph.
2.4. Ensemble Methods
The ensemble methods combine multiple machine learning models to improve classification performance. We investigated two types of ensembles: max voting and Ensembles of Multiple Models and Architectures (EMMA).
In max voting, multiple classification models are used to make predictions for each data point, and the prediction of each model is considered a “vote.” The final predictions are the labels that obtained the majority of the votes (Dietterich 2000). We applied max voting to aggregate responses from the five models of the same architecture trained during cross‐validation (See Section 2.5).
The purpose of EMMA is to obtain robust performance by aggregating the predictions from models with different architectures (Kamnitsas et al. 2017). In our experiments, we used EMMA to combine the outputs of all five models considered in this paper (SVM, FCN, AE‐FCN, GCN, EV‐GCN) by majority voting.
In addition to the predictions using ensemble methods, we also measured the performance of the individual models, denoted as “no ensemble” in our results.
2.5. Evaluation Pipeline
The evaluation pipeline can have important effects on the measured performance of the machine learning models. We propose a robust evaluation framework based on two principles. Firstly, it is important to have a training set for model fitting, a validation set for hyper‐parameter tuning, and a test set for the performance measurement to avoid model overfitting and consequently artificially increasing the performance that cannot be repeated on unseen datasets. Second, the performance should be calculated using all samples in the dataset to avoid performance variation caused by variability in sample set selection on which the method is evaluated. We propose two different evaluation frameworks, one for the individual machine learning models and the second one for the ensembles. These frameworks are illustrated in Figure 6.
FIGURE 6.

Diagrams of our evaluation pipelines. We used the same five test set groups (a) for all three types of models (“no ensemble,” max‐voting, and Emma). The cross‐validation approach (b) was applied to individual models and EMMA, while nested cross‐validation (c) is designed to enable the max voting of five models of the same type.
2.5.1. Creation of Test Sets
Our cross‐validation approach started by splitting the whole dataset into five groups (Figure 6a). Each of these groups will act as the test set exactly once, while we train the models using the remaining four groups. Therefore, the training process will be performed five times. This way, performance can be evaluated on all the samples, while also keeping the model training and tuning completely independent of the test set.
While this is commonly done randomly, we instead opt for a fixed setup where we stratify the groups to have the same proportions of samples of Autism and TD labels and collection sites (Figure 6a). This robust setup ensures consistency in performance measurements across our experiments. We have opted for five folds because it provides a good compromise between robustness and the number of models that need to be trained. Additionally, it would be difficult to properly stratify the data into more folds according to all our variables due to the reduced sample size available from some collection sites.
2.5.2. Cross‐Validation
For evaluation of each of the five individual models listed in Section 2.3 (“no ensemble”), we used a 5‐fold cross‐validation approach presented in (Figure 6b). In this setup, the model was tuned five times. In each fold, we have one test set to measure performance, one validation set that was used to select the optimal hyper‐parameters, and the remaining three groups were used for training. The model was fitted to the training data multiple times with different hyper‐parameter values, and hyper‐parameters were selected according to the highest performance on the validation set in each fold. After the hyper‐parameters had been tuned, the performance of the selected model in each fold was evaluated on the test set. The final performance was calculated by averaging performance on the test sets over all five folds. We calculated the accuracy and area under the ROC curve (AUC) for each machine learning model and applied a two‐sample paired T‐test to compare the performance of different models, feature sets, and ensemble methods.
We optimized the parameters of each model individually to enhance their performance. For the SVM model, we selected “rbf” as the kernel and tuned the regularization parameter “C.” The FCN model was designed by us, with optimizations applied to key hyperparameters such as the number of layers, the nodes in each layer, the activation functions, and other architectural components. For the AE‐FCN model, we focused on optimizing the loss function, the nodes in each layer, and activation functions. The GCN and EV‐GCN models were adapted from Parisot et al. (2018) and Huang and Chung (2020), respectively. Additionally, we resolved an information leakage issue in the EV‐GCN model. For the deep learning models, we selected the number of iterations in each fold according to the performance on the validation set.
To predict the results using the EMMA algorithm, we aggregated the predictions from SVM, FCN, AE‐FCN, GCN, and EV‐GCN from all five folds and obtained the final prediction by majority voting.
2.5.3. Nested Cross‐Validation
In the case of the max voting ensemble technique, we applied a nested cross‐validation approach (Cawley and Talbot 2010) presented in Figure 6c. In each outer fold, we selected the test set as detailed in Section 2.5.1 and Figure 6a, and the remaining four groups were further divided into five new groups in the inner loop (using the same stratification strategy as before) for training and tuning of the model using 5‐fold cross‐validation. This resulted in training 25 different models for each architecture. We selected the hyper‐parameters (kernel size and regularization for SVM, iteration number for DL models) based on the best average performance during 5‐fold cross‐validation, separately within each outer fold. Once the hyper‐parameters were chosen for each outer fold, the five models trained during the inner cross‐validation loop with the selected hyper‐parameters voted on the predicted labels on the test set.
2.6. The SmoothGrad Interpretation Method
The gradients of machine learning models play an important role in interpretability analysis. Analyzing the gradients of the model's output with respect to input features allows estimation of the relative importance of different features (Linardatos et al. 2021). Several gradient‐based interpretation methods, including LRP, SmoothGrad, and Grad‐CAM (Montavon et al. 2017; Selvaraju et al. 2017; Smilkov et al. 2017), have been proposed previously. However, considering the application conditions of each method, we ultimately opted for SmoothGrad as the interpretability method in this paper for investigating model stability and explaining the models' decisions, given that SmoothGrad is applicable to various architectures of deep learning models, and its mechanism is straightforward, relying solely on the gradient of the model. Applying SmoothGrad to the machine learning model yields a saliency map that indicates the importance of each input feature in the model's decision‐making process.
Due to the sensitivity of raw gradients to minor perturbations (noise) in the input, which results in unstable interpretation outcomes, the visualization of raw gradients extracted from machine learning models typically exhibits a high degree of noise (Smilkov et al. 2017). To mitigate this sensitivity to noise, SmoothGrad introduces random noise into the model's inputs, creating multiple noisy inputs for each participant. The gradients from these noisy inputs are averaged to reduce the sensitivity to noise, yielding a smoother and more stable interpretation result (Smilkov et al. 2017). In this paper, we generated 10 noisy inputs for each participant to compute the average SmoothGrad saliency maps, resulting in stable interpretation results.
To obtain reliable important features from SmoothGrad, it is essential to consider the stability of the machine learning model. We expect the stable model to maintain consistent performance or output results when faced with data variations and assign high saliency values to features consistently. In the 5‐fold nested cross‐validation pipeline (Figure 6c ), we have 25 models (with the same model architecture) trained on different training sets; a stable machine learning model should provide us with consistent saliency maps for all these models. These consistent maps indicate the focus of the model on the same group of important features that contribute to the model's decisions, even across varying training sets. In order to compare model stability, we have developed a measure of saliency stability akin to a “signal‐to‐noise ratio” (SNR) (Figure 7) of saliency maps for each machine learning model (to compare between FCN, AE‐FCN, GCN, and EV‐GCN).
FIGURE 7.

Stability assessment diagram of machine learning models.
The detailed model stability evaluation pipeline is shown in Figure 7. For each test participant, we employ SmoothGrad to generate five saliency maps from its corresponding five models and subsequently compute the feature‐wise mean, standard deviation (std) and SNR from these five saliency maps. We then calculate the average SNR value (mean/std) for each participant and compare the average SNR histogram across different machine learning models. To minimize the bias caused by the model initialization, we repeat the 5‐fold nested cross‐validation pipeline 3 times for FCN, AE‐FCN, GCN, and EV‐GCN to calculate the average SNR maps. The most important features contributing to the model's decision were extracted from the saliency maps of the most stable model. In the stability experiment, we excluded the SVM model due to the incompatibility of SmoothGrad with this method.
3. Results
We have performed a comprehensive evaluation of machine learning models performances using the consistent evaluation framework presented in Section 2.5. In Section 3.1, we aimed to compare the performance of:
Five machine learning models detailed in Section 2.3 (SVM, FCN, AE‐FCN, GCN, EV‐GCN).
Three label fusion strategies are detailed in Section 2.4 (“no ensemble,” max‐voting, EMMA).
Six different feature sets: (a) structural MRI (sMRI) features, (b) sMRI + non‐imaging features, (c) functional MRI (fMRI) features, (d) fMRI + non‐imaging features, (e) sMRI + fMRI features, (f) sMRI + fMRI + non‐imaging features.
In Section 3.2, we assessed the stability (“SNR” values) of different machine learning models using SmoothGrad. Upon determining the most stable machine learning model, we extracted the most important features from the model's saliency map in Section 3.3.
3.1. Classification Task
The performances of all models in terms of prediction accuracy and area under the ROC curve (AUC) are presented in Table 5. We present average performance on the validation set and test set. Note that the performance is aggregated from all cross‐validation folds; therefore, each sample contributed to the final performance exactly once. The model performances are further visualized in Figure 8.
TABLE 5.
The experiment results of five machine learning models from (a) structural MRI (sMRI) features, (b) sMRI + non‐imaging features, (c) functional MRI (fMRI) features, (d) fMRI + non‐imaging features, (e) sMRI + fMRI features, (f) sMRI + fMRI + non‐imaging features.
| (a) sMRI | |||||||
|---|---|---|---|---|---|---|---|
| CC200 | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 64.40% | 61.50% | 0.65 | 63.30% | 0.674 | 60.8% | 0.649 |
| FCN | 65.60% | 62.60% | 0.646 | 64.40% | 0.682 | ||
| AE‐FCN | 66.00% | 62.10% | 0.637 | 63.60% | 0.667 | ||
| GCN | 64.00% | 58.70% | 0.625 | 63.20% | 0.695 | ||
| EV‐GCN | 65.70% | 59.10% | 0.639 | 62.10% | 0.665 | ||
| (b) sMRI + non‐imaging | |||||||
|---|---|---|---|---|---|---|---|
| sMRI | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 66.10% | 63.70% | 0.672 | 64.40% | 0.702 | 64.0% | 0.663 |
| FCN | 66.10% | 63.80% | 0.649 | 66.20% | 0.696 | ||
| AE‐FCN | 67.40% | 61.00% | 0.645 | 64.70% | 0.677 | ||
| GCN | 64.60% | 62.50% | 0.658 | 62.80% | 0.688 | ||
| EV‐GCN | 67.00% | 61.80% | 0.661 | 64.50% | 0.701 | ||
| (c) fMRI | |||||||
|---|---|---|---|---|---|---|---|
| sMRI | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 68.40% | 66.20% | 0.732 | 68.70% | 0.748 | 66.4% | 0.705 |
| FCN | 70.00% | 64.70% | 0.689 | 68.90% | 0.74 | ||
| AE‐FCN | 70.80% | 66.60% | 0.69 | 69.00% | 0.732 | ||
| GCN | 70.80% | 67.00% | 0.711 | 69.30% | 0.76 | ||
| EV‐GCN | 68.70% | 65.20% | 0.698 | 67.80% | 0.741 | ||
| (d) fMRI + non‐imaging | |||||||
|---|---|---|---|---|---|---|---|
| sMRI | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 68.60% | 66.20% | 0.733 | 68.60% | 0.75 | 65.4% | 0.693 |
| FCN | 70.60% | 65.20% | 0.709 | 69.00% | 0.737 | ||
| AE‐FCN | 70.70% | 64.50% | 0.674 | 69.70% | 0.733 | ||
| GCN | 71.30% | 64.70% | 0.716 | 70.30% | 0.764 | ||
| EV‐GCN | 70.20% | 63.80% | 0.691 | 67.70% | 0.736 | ||
| (e) sMRI + fMRI | |||||||
|---|---|---|---|---|---|---|---|
| sMRI | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 69.90% | 67.90% | 0.752 | 70.10% | 0.767 | 66.8% | 0.717 |
| FCN | 71.30% | 66.70% | 0.726 | 70.60% | 0.765 | ||
| AE‐FCN | 70.90% | 64.50% | 0.703 | 69.80% | 0.741 | ||
| GCN | 73.10% | 67.50% | 0.738 | 72.20% | 0.774 | ||
| EV‐GCN | 69.90% | 66.20% | 0.712 | 69.20% | 0.76 | ||
| (f) sMRI + fMRI + non‐imaging | |||||||
|---|---|---|---|---|---|---|---|
| sMRI | No ensemble | Max voting | EMMA | ||||
| Validation | Test | AUC | Test | AUC | Test | AUC | |
| SVM | 70.10% | 68.00% | 0.754 | 70.60% | 0.768 | 68.0% | 0.715 |
| FCN | 72.41% | 66.00% | 0.714 | 69.70% | 0.753 | ||
| AE‐FCN | 72.60% | 66.80% | 0.689 | 67.60% | 0.74 | ||
| GCN | 72.30% | 68.60% | 0.742 | 71.30% | 0.78 | ||
| EV‐GCN | 70.10% | 66.90% | 0.709 | 71.30% | 0.769 | ||
Note: When “no ensemble” is used, “validation” is the best validation accuracy, and “test” is the corresponding test accuracy. AUC measures the area under the ROC curve of the predicted probability of the test set. Bold values represent the highest classification accuracy and AUC values.
FIGURE 8.

(a) The accuracies and AUC of different feature sets. (b) The accuracies and AUC of different machine learning models. (c) The accuracies and AUC of different ensemble situations.
3.1.1. The Overall Performance of the Models
Results presented in Table 5 and Figure 8 show that the models achieved prediction accuracies on the test set between 58% and 72%, and AUC between 0.64 and 0.78. The highest accuracy was obtained using the GCN with max‐voting, trained on combined fMRI and sMRI features, achieving a test accuracy of 72.2% and an AUC of 0.77 (see Table 5e). However, the differences among models were not statistically significant (see Section 3.1.3).
3.1.2. Predictive Accuracy of Different Feature Sets
To compare the predictive ability of different feature sets, we present a boxplot of accuracies and AUC (Figure 8a) for all models. We can observe a clear trend, where models trained on sMRI features have the lowest performance with an average accuracy of 62% and AUC of 0.65, while models trained on fMRI features achieved an average accuracy of 67% and AUC of 0.72. This was statistically significant (T‐test, p = 1.7e−6). Combining fMRI and sMRI features further improves the performance (average accuracy 68% and AUC 0.74), though this was not statistically significant (p = 0.2). Adding non‐imaging features did not result in a significant improvement in any of the feature sets. This suggests that functional connectivity plays a more prominent role in the prediction of autism diagnosis than structural features, as previously reported in the literature (Traut et al. 2022).
3.1.3. Comparison of Five Machine Learning Models
The results presented in Figure 8b demonstrate that all tested models performed similarly to classify Autism and TD on the ABIDE dataset with p > 0.05 when comparing each pair of models with a T‐test, indicating no significant improvement. Additionally, a Chi‐squared test was conducted between the best performing model of each machine learning model, revealing no significant differences (p > 0.05). This suggests that different inclusion criteria, data modalities, and evaluation pipelines, rather than different machine learning models, cause the variation in accuracy in the published literature (Abraham et al. 2017; Heinsfeld et al. 2018; Huang and Chung 2020; Parisot et al. 2018; Rakić et al. 2020). High heterogeneity in terms of collection site, image quality, and biological heterogeneity of autism may limit model performance.
Results in Table 5 show that SVM, the only classical machine learning model tested as our baseline, achieved test accuracy of 70.6% when used with the sMRI + fMRI + non‐imaging feature set and max voting. The performance is similar to the best performing GCN, but the model is much faster and requires fewer computational resources than deep learning models. The performances of AE‐FCN and EV‐GCN are similar to those of other models but are lower than the accuracies obtained by Huang and Chung (2020) and Rakić et al. (2020), which are 81% and 85%, respectively. This difference in accuracy is most likely due to different inclusion criteria or evaluation strategies.
3.1.4. Ensembles
Ensemble test accuracy using max voting (Table 5 and Figure 8c) enables the models to achieve better performance on the unseen dataset (test set), resulting in an average accuracy improvement of 3.8%, and this was significant over all models (p = 0.0001). However, the results obtained with EMMA tended to outperform the average scores of the five models only slightly and, in some cases, even performed worse, with overall no significant difference with the individual models.
3.2. Characterization of Model Stability Using SmoothGrad Saliency Maps
The SNR value indicates whether a model assigns similar importance to the same features across various training sets. A high SNR value suggests that the model consistently relies on the same important features to classify participants into Autism or TD groups, regardless of the training set variability.
To compare the stability of FCN, AE‐FCN, GCN, and EV‐GCN models, we followed the diagram in Figure 7 to calculate the average SNR per participant of these four models and plotted the histogram in Figure 9. The results illustrate that the FCN model has the highest SNR values and therefore the most stable saliency maps across different training sets compared with AE‐FCN, GCN, and EV‐GCN. For this reason, we select the FCN model to extract important features from its average saliency map in Section 3.3. Considering the best classification performance of the FCN model is achieved on the sMRI + fMRI feature set, we will extract the most important features from this particular feature set.
FIGURE 9.

The SNR histograms of FCN, AE‐FCN, GCN, and EV‐GCN.
3.3. Important Features Extracted From SmoothGrad Saliency Maps
We identified the significant fMRI and sMRI features from the FCN average saliency map based on saliency intensity. To better visualize and compare feature importance, we correlated the fMRI and sMRI saliency values with the regions defined in the CC200 atlas and the Desikan‐Killiany atlas respectively, applied z‐score normalization, and further projected them onto a standardized brain template using the BrainNet Viewer software (Xia et al. 2013) (Figures 10 and 11). We applied the same procedure to AE‐FCN, GCN, and EV‐GCN models, visualizing the results using BrainNet Viewer. These visualizations are provided in Figures S1–S6. The results indicate that the saliency maps for FCN, AE‐FCN, GCN, and EV‐GCN were largely consistent, with significant overlap in the highlighted regions, though the relative importance of individual regions varied between the methods. For example, all models highlighted structural features from the ventricles and functional features from the temporal cortex. Additionally, among the top 100 most significant functional connections, all models exhibited more connections in the left hemisphere compared to the right.
FIGURE 10.

Important structural brain features that contribute to autism classification task. (a) Cortical + subcortical areas. (b) White matter areas.
FIGURE 11.

(a) The important brain regions obtained from fMRI features. The numbers indicate the corresponding brain regions in the CC200 atlas. The names of these brain regions are listed in Table S1. (b) Top 100 connections.
To assess the classification capability of the top selected features, we divided the entire population into four groups: true positive (TP, correctly classified as Autism), true negative (TN, correctly classified as TD), false positive (FP, incorrectly classified as Autism), and false negative (FN, incorrectly classified as TD). We then plotted box plots for each important feature's normalized value across these four groups to compare the features' values (Figure 12).
FIGURE 12.

The box plots of the most important structural and functional features. p value measures the difference between TP and TN groups. *p < 0.05, **p < 0.01, ***p < 0.001.
The top 100 most important structural volumetric features and brain connections are provided individually in Table S1 and Table 2.
3.3.1. Important sMRI Features
We associated the saliency values of cortical, subcortical and white matter features with the Desikan–Killiany atlas and employed z‐score normalization. Considering white matter areas are under cortical surface, we assigned the white matter saliency values to the corresponding cortical surface for better visualization and plotted cortical and subcortical areas (including ventricles) in Figure 10a, and “white matter areas” in Figure 10b. A z‐score below 0 indicates less importance to autism classification, whereas a score above 0 means greater importance to autism classification.
The regions with the most relevant structural features identified by FCN were 3rd Ventricle, left inferior lateral ventricles, right paracentral lobule, and the white matter of left and right isthmus cingulate.
The box plots of the three most important structural features are displayed in Figure 12a–c, all showing significant differences between the TP and TN groups. These are the volume of the 3rd Ventricle, the normalized minimum intensity of white matter of the left isthmus of the cingulate gyrus, and the volume of the left inferior lateral ventricle. Notably, the boxplot of the feature with the highest z‐score among all the structural features, the volume of the 3rd Ventricle, shows that correctly classified autistic participants (TP) exhibit larger 3rd Ventricle volume than the correctly classified TD participants (TN). The misclassified autistic and TD participants (FN and FP groups), respectively, display smaller or larger volume sizes of the 3rd Ventricle compared to TP and TN groups, probably contributing to the incorrect classification of these individuals.
3.3.2. Important fMRI Features
We assigned the saliency value of each functional connection to its corresponding pair of brain regions, summing up the values for each brain region (nodal saliency “strength”), and then applied z‐score normalization. The most important brain regions used for the classification of autistic individuals based on their functional connectivity, as well as specific functional brain connections used, are visualized in Figure 11.
The most significant brain regions selected by the FCN model are located mostly in temporal cortex areas, including the Left Middle, Superior, and Inferior Temporal Cortex, as well as the Right Middle, Superior, and Inferior Temporal Cortex. Additionally, the Left Fusiform and Right Hippocampus and Parahippocampal regions were also highlighted as significant. This suggests that the connections associated with these brain regions may play a more substantial role in the Autism and TD classification compared to other connections.
The box plots of the top three connections are shown in Figure 12d–f. Among them, two demonstrate a notable difference between TP and TN groups (p < 0.05), they are the connection between “Left Inferior Temporal gyrus” and “Right Hippocampus and Parahippocampal gyrus,” and the connection between “Left Middle Temporal gyrus” and “Left Superior Medial Frontal gyrus.”
4. Discussion
The purpose of this work was to evaluate the performance of five machine learning models classifying autistic individuals and TD participants under a standard setting and extracting the features that contribute the most to the classification task with the most stable machine learning model. All the tested models showed similar classification performance, which illustrates that different inclusion criteria, data modalities, and evaluation pipelines, rather than different machine learning models, cause variation in accuracy in published literature. Similarly, a recent study also found that different machine learning algorithms and parameter settings tend to achieve comparable performance in the same experiment (Winter et al. 2024). Although the classical machine learning model SVM did not achieve the highest accuracy, it did not show significantly lower performance when compared with other more complex models. Our results suggest that, in this task, traditional machine learning models can produce similarly robust results to deep neural networks while being more computationally efficient.
While the machine learning models we utilized have different parameterizations, they are all non‐linear classifiers looking for non‐linear decision boundaries in data. To assess whether our machine learning models bring performance improvement in comparison to a simple linear approach, we further included an SVM model with a “linear” kernel (Table S3). SVM with an “rbf” kernel outperforms SVM with a “linear” kernel, achieving about 1% higher average accuracy, which suggests that there may be non‐linearity in the data. However, this increased accuracy is not statistically significant, as confirmed by Chi‐square and two‐sample paired t‐tests (p > 0.05). We therefore cannot confirm that the non‐linearity of our models is essential.
The two evaluation pipelines utilized in this paper, 5‐fold cross‐validation and 5‐fold nested cross‐validation, were designed to provide robust and accurate results while also preventing the models from overfitting. Employing 5‐fold cross‐validation enables the evaluation of performance across all samples while ensuring that model training and tuning are independent of the test set (Cawley and Talbot 2010). The choice of five folds is a good compromise between robustness and the number of models that need to be trained. Additionally, it would be difficult to stratify the data into more folds due to the limited number of samples from some collection sites. We also note that performance calculated on the validation set (which we only show for all individual models with “no ensemble”) is always higher than on the test set. This shows the importance of reporting the performance on the test set to avoid unrealistically inflated performance measurements that would not repeat on unseen datasets. The application of max voting ensemble improved the performance of all five models compared to the individual models (‘no ensemble’). This shows that max voting can result in more robust models with stable performance on unseen datasets as previously suggested (Parisot et al. 2018; Rakić et al. 2020). While max voting consistently resulted in better performance, EMMA's contribution was weaker, and in some cases even had a negative effect.
Our classification results show that feature sets used for training the classifiers resulted in the most significant differences in model performance. We have observed that models trained with fMRI features (i.e., functional connectivity) exhibit better performance than the models trained with sMRI features. This suggests that functional connectivity plays a more prominent role in the prediction of autism diagnosis than structural features, as previously reported in the literature (Traut et al. 2022). When merging sMRI and fMRI features, the addition of sMRI information improved the performance only marginally. Moreover, adding non‐imaging features did not yield a significant improvement in any of the feature sets.
Comprehensive experiments presented in this paper demonstrate that variations in experimental conditions, such as data modality and cross‐validation technique, can lead to disparate outcomes, despite utilizing the same machine learning model. However, the differences in performance are moderate, with a classification accuracy of around 70%. In fact, our results are consistent with most studies in the published literature that utilize the 871 samples of the ABIDE fMRI dataset. This suggests that the limitations of this dataset, including the inherent biological heterogeneity of autism, or site‐dependent acquisition and processing characteristics, rather than the classification models, may be the primary cause of the limited classification performance. This highlights that fruitful directions of future work will likely focus on better comprehension and disentanglement of the heterogeneity of autism and exploration of robust, effective stratification biomarkers.
FCN model combined with SmoothGrad yields the most stable interpretation results in comparison to other machine learning models. The lower SNR values observed for saliency features in GCN, AE‐FCN, and EV‐GCN models may be attributed to several factors. First, the GCN model has the population graph as the second feature input, while FCN focuses on the functional and structural features only. Second, the AE‐FCN model performs the classification and the denoising tasks at the same time, which could result in lower saliency SNR values. Third, the EV‐GCN model extends the GCN by dynamically learning and modifying the connections within the population graph during training, which might further affect the model's interpretive stability.
Many studies have investigated atypical brain structural features in autism (Courchesne 2002; Katuwal et al. 2015; Moradi et al. 2017). In this study, we found the structural features in the ventricles contributed the most to the model decision‐making. Some observations from our results are consistent with the published literature; for example, we observed an increased volume of the 3rd Ventricle and Left and Right LateralVentricle in autistic participants (p = 1.3e−8), which aligns with findings from previous studies (Hardan et al. 2001; Palmen et al. 2005; Wolfe et al. 2015). The cortical features of the right paracentral lobule and the white matter near the isthmus cingulate gyrus also showed higher z‐scores than other areas. These areas may play an important role in emotion regulation, emotion processing, and cognitive control in autism (Chien et al. 2021; Hau et al. 2019). Temporal cortex areas have been commonly associated with auditory perception, language, and vision functions (Bonner and Price 2013; Hickok and Poeppel 2007; Zaehle et al. 2004). Alterations in functional connectivity within the temporal cortex may contribute to the auditory and language difficulties observed in individuals with autism (Rotschafer 2021; Xiao et al. 2023). It is worth noting that we did not identify a similar pattern of important brain regions for autism classification using functional and structural features, suggesting distinct patterns of regions with atypical structural morphology and functional connectivity associated with autistic phenotypes.
A potential limitation is the use of a baseline kernel SVM classifier (gamma = “scale”) to determine the number of functional (4000) and structural (800) features for training of the machine learning models, which may introduce performance bias toward the baseline SVM classifier. However, since feature selection was performed independently within each fold during cross‐validation and nested cross‐validation using ridge classifier and recursive feature elimination, the impact of this bias is expected to be minimal. The fold‐wise feature selection ensured robust model evaluation, reducing the likelihood of significant bias in the final results (Parvandeh et al. 2020; Varoquaux et al. 2017).
Another limitation of our study is that the samples used are solely from the ABIDE I dataset, without testing model performance on additional external datasets. While the ABIDE I dataset consists of data from 17 different sites and the use of separate validation and test sets prevents leakage of information into the test sets, our results do not provide information about the models' generalization capacity in the presence of domain shifts, such as data from new sites, unseen scanning protocols, or different demographic characteristics. In the future, similar experiments may be performed on other datasets such as ABIDE II and AIMS‐2‐TRIALS (Loth et al. 2017) to further assess their generalizability across various data sources. The investigation of longitudinal autism databases may enable the identification and development of early stratification biomarkers for autism, potentially leading to more effective personalized supportive strategies for those individuals that request them. However, the direct application of the trained GCN and EV‐GCN to out‐group samples requires additional methodological development. This is because the current implementation of GCN and EV‐GCN requires the population graph to include information from all sites, which limits their inference to a new dataset. For the other trained models (SVM, FCN and AE‐FCN), we would expect decreased autism classification accuracies if they are applied to out‐group samples, similarly as reported in the autism classification challenge organized by Traut et al. (2022), where most submissions exhibited similar performance on public (ABIDE) and private datasets (e.g., ABIDE combined with data from the Robert Debré Hospital), but all showed decreased performance on the external EU‐AIMS dataset.
Additionally, we mainly utilized functional connectivity and sMRI volumetric features. Other types of features, such as topological properties of functional connectivity matrices (Kazeminejad and Sotero 2019; Plitt et al. 2015), extra‐axial cerebrospinal fluid (Shen et al. 2017) and genomic copy number variations (Sherman et al. 2021; Velinov 2019) also hold significant potential for autism classification and stratification. Previous research has demonstrated the potential of functional data in predicting continuous autism traits. For instance, Lake et al. (2019) utilized functional connectomes to estimate Social Responsiveness Scale (SRS) scores for autism and attention‐deficit/hyperactivity disorder (ADHD). Similarly, Dryburgh et al. (2020) employed functional connectomic data to predict FIQ and verbal IQ (VIQ). Exploring the performance of different machine learning models on autism‐related features would provide valuable insights. Furthermore, we expect explainable AI will expand to other autism research areas. Specifically, important features extracted through SmoothGrad could be applied to predict autism traits (Lake et al. 2019; Plitta et al. 2015) and identify different autism subtypes (Easson et al. 2019; Tang et al. 2020).
5. Conclusion
In this paper, five machine learning models from the existing literature were trained to classify individuals with autism using the ABIDE I dataset. The results showed that all tested models had similar performance, indicating that variations in accuracy in published literature may be attributed to different inclusion criteria, data modalities, and evaluation pipelines rather than the models themselves. The highest classification accuracy was achieved by using ensemble models, yielding an accuracy of 72.2% and an AUC of 0.77 using GCN models, although the differences among models were not statistically significant. We conclude that the performance of the classifiers is likely limited by factors other than the model architecture, such as high heterogeneity in terms of age, collection site, and image quality, as well as the biological heterogeneity of autism in general. Furthermore, the SmoothGrad method was applied to FCN, which exhibited the highest SNR value and was selected as the most stable model. The results suggested that structural features from the ventricles and functional features from the temporal cortex made the most significant contributions to the algorithms identifying autistic participants.
Our work also allows us to test the performance of new algorithms using our framework (which is now fully available in GitHub: https://github.com/YilanDong19/Machine‐learning‐with‐ABIDE), and benchmark their performance against the most widely used algorithms for the prediction of Autism in the ABIDE dataset.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Data S1. Supporting Information.
Acknowledgments
This work was supported by King's‐China Scholarship Council (K‐CSC) [grant number 202008060109]. The authors acknowledge support in part from the Wellcome Engineering and Physical Sciences Research Council (EPSRC) Centre for Medical Engineering at Kings College London [WT 203148/Z/16/Z], and the NIHR Maudsley Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.
Funding: This work was supported by King's‐China Scholarship Council (K‐CSC), 202008060109 and Wellcome Engineering and Physical Sciences Research Council (EPSRC) Centre, WT 203148/Z/16/Z.
Data Availability Statement
The data that support the findings of this study are available in ABIDE at https://fcon_1000.projects.nitrc.org/indi/abide/. These data were derived from the following resources available in the public domain: ABIDE I, https://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html.
References
- Abraham, A. , Milham M. P., Di Martino A., et al. 2017. “Deriving Reproducible Biomarkers From Multi‐Site Resting‐State Data: An Autism‐Based Example.” NeuroImage 147: 736–745. 10.1016/j.neuroimage.2016.10.045. [DOI] [PubMed] [Google Scholar]
- Alasasfeh, H. O. , Alomari T., and Ibbini M. S.. 2021. “Deep Learning Approach for COVID‐19 Detection Based on X‐Ray Images.” In 18th IEEE International Multi‐Conference on Systems, Signals and Devices, SSD 2021, 601–606. IEEE. [Google Scholar]
- Anderson, J. S. , Nielsen J. A., Froehlich A. L., et al. 2011. “Functional Connectivity Magnetic Resonance Imaging Classification of Autism.” Brain 134, no. 12: 3739–3751. 10.1093/brain/awr263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aylward, E. H. , Minshew N. J., Field K., Sparks B. F., and Singh N.. 2002. “Effects of Age on Brain Volume and Head Circumference in Autism.” Neurology 59, no. 2: 175–183. 10.1212/WNL.59.2.175. [DOI] [PubMed] [Google Scholar]
- Bharadwaj, Prakash K. B., and Kanagachidambaresan G. R.. 2021. “Pattern Recognition and Machine Learning.”
- Bonner, M. F. , and Price A. R.. 2013. “Where Is the Anterior Temporal Lobe and What Does It Do?” Journal of Neuroscience 33, no. 10: 4213–4215. 10.1523/JNEUROSCI.0041-13.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cawley, G. C. , and Talbot N. L. C.. 2010. “On Over‐Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.” Journal of Machine Learning Research 11: 2079–2107. [Google Scholar]
- Chien, Y. L. , Chen Y. C., and Gau S. S. F.. 2021. “Altered Cingulate Structures and the Associations With Social Awareness Deficits and CNTNAP2 Gene in Autism Spectrum Disorder.” NeuroImage: Clinical 31: 102729. 10.1016/j.nicl.2021.102729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ciregan, D. , Meier U., and Schmidhuber J.. 2012. “Multi‐Column Deep Neural Networks for Image Classification.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 3642–3649. IEEE. [Google Scholar]
- Courchesne, E. 2002. “Abnormal Early Brain Development in Autism.” Molecular Psychiatry 7: 21–23. 10.1038/sj.mp.4001169. [DOI] [PubMed] [Google Scholar]
- Craddock, R. , Cameron G., James A., Holtzheimer P. E., Hu X. P., and Mayberg H. S.. 2012. “A Whole Brain FMRI Atlas Generated via Spatially Constrained Spectral Clustering.” Human Brain Mapping 33, no. 8: 1914–1928. 10.1002/hbm.21333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- del Barrio, V. 2004. “Diagnostic and Statistical Manual of Mental Disorders.”
- Desikan, R. S. , Ségonne F., Fischl B., et al. 2006. “An Automated Labeling System for Subdividing the Human Cerebral Cortex on MRI Scans Into Gyral Based Regions of Interest.” NeuroImage 31, no. 3: 968–980. 10.1016/j.neuroimage.2006.01.021. [DOI] [PubMed] [Google Scholar]
- Dietterich, T. G. 2000. “Multiple Classifier Systems.” In Ensemble Methods in Machine Learning. Springer Berlin Heidelberg. [Google Scholar]
- Dryburgh, E. , McKenna S., and Rekik I.. 2020. “Predicting Full‐Scale and Verbal Intelligence Scores From Functional Connectomic Data in Individuals With Autism Spectrum Disorder.” Brain Imaging and Behavior 14, no. 5: 1769–1778. 10.1007/s11682-019-00111-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du, Y. , Fu Z., and Calhoun V. D.. 2018. “Classification and Prediction of Brain Disorders Using Functional Connectivity: Promising but Challenging.” Frontiers in Neuroscience 12: 525. 10.3389/fnins.2018.00525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Easson, A. K. , Fatima Z., and McIntosh A. R.. 2019. “Functional Connectivity Based Subtypes of Individuals With and Without Autism Spectrum Disorder.” Network Neuroscience 3, no. 2: 344–362. 10.1162/NETN. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ecker, C. , Bookheimer S. Y., and Murphy D. G. M.. 2015. “Neuroimaging in Autism Spectrum Disorder: Brain Structure and Function Across the Lifespan.” Lancet Neurology 14, no. 11: 1121–1134. [DOI] [PubMed] [Google Scholar]
- Ecker, C. , Rocha‐Rego V., Johnston P., et al. 2010. “Investigating the Predictive Value of Whole‐Brain Structural MR Scans in Autism: A Pattern Classification Approach.” NeuroImage 49, no. 1: 44–56. 10.1016/j.neuroimage.2009.08.024. [DOI] [PubMed] [Google Scholar]
- Fischl, B. , Salat D. H., Busa E., et al. 2002. “Whole Brain Segmentation: Automated Labeling of Neuroanatomical Structures in the Human Brain.” Neuron 33, no. 3: 341–355. 10.1016/s0896-6273(02)00569-x. [DOI] [PubMed] [Google Scholar]
- Garg, S. , and Garg S.. 2020. “Prediction of Lung and Colon Cancer Through Analysis of Histopathological Images by Utilizing Pre‐Trained CNN Models With Visualization of Class Activation and Saliency Maps.” In Proceedings of the 2020 3rd Artificial Intelligence and Cloud Computing Conference (AICCC '20), 38–45. Association for Computing Machinery. [Google Scholar]
- Green, S. A. , Rudie J. D., Colich N. L., et al. 2013. “Overreactive Brain Responses to Sensory Stimuli in Youth With Autism Spectrum Disorders.” Journal of the American Academy of Child and Adolescent Psychiatry 52, no. 11: 1158–1172. 10.1016/j.jaac.2013.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hardan, A. Y. , Minshew N. J., Mallikarjuhn M., and Keshavan M. S.. 2001. “Brain Volume in Autism.” Journal of Child Neurology 16, no. 6: 421–424. 10.1177/088307380101600607. [DOI] [PubMed] [Google Scholar]
- Hau, J. , Aljawad S., Baggett N., Fishman I., Carper R. A., and Müller R. A.. 2019. “The Cingulum and Cingulate U‐Fibers in Children and Adolescents With Autism Spectrum Disorders.” Human Brain Mapping 40, no. 11: 3153–3164. 10.1002/hbm.24586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinsfeld, A. S. , Franco A. R., Craddock R. C., Buchweitz A., and Meneguzzi F.. 2018. “Identification of Autism Spectrum Disorder Using Deep Learning and the ABIDE Dataset.” NeuroImage: Clinical 17: 16–23. 10.1016/j.nicl.2017.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hickok, G. , and Poeppel D.. 2007. “The Cortical Organization of Speech Understanding.” Nature 8 (May): 393–402. [DOI] [PubMed] [Google Scholar]
- Holiga, Š. , Hipp J. F., Chatham C. H., et al. 2019. “Patients With Autism Spectrum Disorders Display Reproducible Functional Connectivity Alterations.” Science Translational Medicine 11, no. 481: eaat9223. 10.1126/scitranslmed.aat9223. [DOI] [PubMed] [Google Scholar]
- Huang, Y. , and Chung A. C. S.. 2020. “Edge‐Variational Graph Convolutional Networks for Uncertainty‐Aware Disease Prediction.” In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020: 23rd International Conference, 562–572. Springer. [Google Scholar]
- Kamnitsas, K. , Bai W., Ferrante E., et al. 2017. Ensembles of Multiple Models and Architectures for Robust Brain Tumour Segmentation.
- Katuwal, G. J. , Cahill N. D., Baum S. A., and Michael A. M.. 2015. “The Predictive Power of Structural MRI in Autism Diagnosis.” In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 4270–4273. IEEE. 10.1109/EMBC.2015.7319338. [DOI] [PubMed] [Google Scholar]
- Kazeminejad, A. , and Sotero R. C.. 2019. “Topological Properties of Resting‐State FMRI Functional Networks Improve Machine Learning‐Based Autism Classification.” Frontiers in Neuroscience 13 (January): 1–10. 10.3389/fnins.2018.01018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lake, E. M. R. , Finn E. S., Noble S. M., et al. 2019. “The Functional Brain Organization of an Individual Allows Prediction of Measures of Social Abilities Transdiagnostically in Autism and Attention‐Deficit/Hyperactivity Disorder.” Biological Psychiatry 86, no. 4: 315–326. 10.1016/j.biopsych.2019.02.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Landa, R. J. 2008. “Diagnosis of Autism Spectrum Disorders in the First 3 Years of Life.” Nature Clinical Practice Neurology 4, no. 3: 138–147. 10.1038/ncpneuro0731. [DOI] [PubMed] [Google Scholar]
- Linardatos, P. , Papastefanopoulos V., and Kotsiantis S.. 2021. “Explainable ai: A Review of Machine Learning Interpretability Methods.” Entropy 23, no. 1: 1–45. 10.3390/e23010018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loth, E. , Charman T., Mason L., et al. 2017. “The EU‐AIMS Longitudinal European Autism Project (LEAP): Design and Methodologies to Identify and Validate Stratification Biomarkers for Autism Spectrum Disorders.” Molecular Autism 8, no. 1: 24. 10.1186/s13229-017-0146-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Di Martino, A. , Yan C. G., Li Q., et al. 2014. “The Autism Brain Imaging Data Exchange: Towards a Large‐Scale Evaluation of the Intrinsic Brain Architecture in Autism.” Molecular Psychiatry 19, no. 6: 659–667. 10.1038/mp.2013.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mizuno, Y. , Kagitani‐Shimono K., Jung M., et al. 2019. “Structural Brain Abnormalities in Children and Adolescents With Comorbid Autism Spectrum Disorder and Attention‐Deficit/Hyperactivity Disorder.” Translational Psychiatry 9, no. 1: 332. 10.1038/s41398-019-0679-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montavon, G. , Lapuschkin S., Binder A., Samek W., and Müller K. R.. 2017. “Explaining Nonlinear Classification Decisions With Deep Taylor Decomposition.” Pattern Recognition 65: 211–222. 10.1016/j.patcog.2016.11.008. [DOI] [Google Scholar]
- Moradi, E. , Khundrakpam B., Lewis J. D., Evans A. C., and Tohka J.. 2017. “Predicting Symptom Severity in Autism Spectrum Disorder Based on Cortical Thickness Measures in Agglomerative Data.” NeuroImage 144: 128–141. 10.1016/j.neuroimage.2016.09.049. [DOI] [PubMed] [Google Scholar]
- Nielsen, J. A. , Zielinski B. A., Fletcher P. T., et al. 2013. “Multisite Functional Connectivity MRI Classification of Autism: ABIDE Results.” Frontiers in Human Neuroscience 7: 599. 10.3389/fnhum.2013.00599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palmen, S. J. M. C. , Hulshoff Pol H. E., Kemner C., et al. 2005. “Increased Gray‐Matter Volume in Medication‐Naive High‐Functioning Children With Autism Spectrum Disorder.” Psychological Medicine 35, no. 4: 561–570. 10.1017/S0033291704003496. [DOI] [PubMed] [Google Scholar]
- Parisot, S. , Ktena S. I., Ferrante E., et al. 2018. “Disease Prediction Using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer's Disease.” Medical Image Analysis 48: 117–130. 10.1016/j.media.2018.06.001. [DOI] [PubMed] [Google Scholar]
- Parvandeh, S. , Yeh H. W., Paulus M. P., and McKinney B. A.. 2020. “Consensus Features Nested Cross‐Validation.” Bioinformatics 36, no. 10: 3093–3098. 10.1093/bioinformatics/btaa046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plitt, M. , Barnes K. A., and Martin A.. 2015. “Functional Connectivity Classification of Autism Identifies Highly Predictive Brain Features but Falls Short of Biomarker Standards.” NeuroImage: Clinical 7: 359–366. 10.1016/j.nicl.2014.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plitta, M. , Barnes K. A., Wallace G. L., Kenworthy L., and Martin A.. 2015. “Resting‐State Functional Connectivity Predicts Longitudinal Change in Autistic Traits and Adaptive Functioning in Autism.” Proceedings of the National Academy of Sciences of the United States of America 112, no. 48: E6699–E6706. 10.1073/pnas.1510098112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Postema, M. C. , van Rooij D., Anagnostou E., et al. 2019. “Altered Structural Brain Asymmetry in Autism Spectrum Disorder in a Study of 54 Datasets.” Nature Communications 10, no. 1: 4958. 10.1038/s41467-019-13005-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rakić, M. , Cabezas M., Kushibar K., Oliver A., and Lladó X.. 2020. “Improving the Detection of Autism Spectrum Disorder by Combining Structural and Functional MRI Information.” NeuroImage: Clinical 25: 102181. 10.1016/j.nicl.2020.102181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rotschafer, S. E. 2021. “Auditory Discrimination in Autism Spectrum Disorder.” Frontiers in Neuroscience 15 (June): 1–13. 10.3389/fnins.2021.651209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rumelhart, D. E. , Hinton G. E., and Williams R. J.. 2013. “Learning Internal Representations by Error Propagation.” Readings in Cognitive Science: A Perspective From Psychology and Artificial Intelligence V: 399–421. 10.1016/B978-1-4832-1446-7.50035-2. [DOI] [Google Scholar]
- Sabuncu, M. R. , and Konukoglu E.. 2015. “Clinical Prediction From Structural Brain MRI Scans: A Large‐Scale Empirical Study.” Neuroinformatics 13, no. 1: 31–46. 10.1007/s12021-014-9238-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salahuddin, Z. , Woodruff H. C., Chatterjee A., and Lambin P.. 2022. “Transparency of Deep Neural Networks for Medical Image Analysis: A Review of Interpretability Methods.” Computers in Biology and Medicine 140: 105111. 10.1016/j.compbiomed.2021.105111. [DOI] [PubMed] [Google Scholar]
- Salat, D. H. , Lee S. Y., van der Kouwe A. J., Greve D. N., Fischl B., and Rosas H. D.. 2009. “Age‐Associated Alterations in Cortical Gray and White Matter Signal Intensity and Gray to White Matter Contrast.” NeuroImage 48, no. 1: 21–28. 10.1016/j.neuroimage.2009.06.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Selvaraju, R. R. , Cogswell M., Das A., Vedantam R., Parikh D., and Batra D.. 2017. “Grad‐CAM:Visual Explanations From Deep Networks via Gradient‐Based Localization.” In Proceedings of the IEEE International Conference on Computer Vision, 618–626. IEEE. [Google Scholar]
- Shen, M. D. , Kim S. H., and McKinstry R. C.. 2017. “Increased Extra‐Axial Cerebrospinal Fluid in High‐Risk Infants Who Later Develop Autism.” Biological Psychiatry 82, no. 3: 186–193. 10.1016/j.biopsych.2017.02.1095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sherman, M. A. , Rodin R. E., Genovese G., et al. 2021. “Large Mosaic Copy Number Variations Confer Autism Risk.” Nature Neuroscience 24, no. 2: 197–203. 10.1038/s41593-020-00766-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh, A. , Sengupta S., and Lakshminarayanan V.. 2020. “Explainable Deep Learning Models in Medical Image Analysis.” Journal of Imaging 6, no. 6: 1–19. 10.3390/JIMAGING6060052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smilkov, D. , Thorat N., Kim B., Viégas F., and Wattenberg M.. 2017. “SmoothGrad: Removing Noise by Adding Noise.”
- Stefanatos, G. A. 2008. “Regression in Autistic Spectrum Disorders.” Neuropsychology Review 18, no. 4: 305–319. 10.1007/s11065-008-9073-y. [DOI] [PubMed] [Google Scholar]
- Stewart, M. E. , Barnard L., Pearson J., Hasan R., and O'Brien G.. 2006. “Presentation of Depression in Autism and Asperger Syndrome: A Review.” Autism 10, no. 1: 103–116. 10.1177/1362361306062013. [DOI] [PubMed] [Google Scholar]
- Tang, S. , Sun N., Floris D. L., Zhang X., Di Martino A., and Yeo B. T. T.. 2020. “Reconciling Dimensional and Categorical Models of Autism Heterogeneity: A Brain Connectomics and Behavioral Study.” Biological Psychiatry 87, no. 12: 1071–1082. 10.1016/j.biopsych.2019.11.009. [DOI] [PubMed] [Google Scholar]
- Traut, N. , Heuer K., Lemaître G., et al. 2022. “Insights From an Autism Imaging Biomarker Challenge: Promises and Threats to Biomarker Discovery.” NeuroImage 255: 119171. 10.1016/j.neuroimage.2022.119171. [DOI] [PubMed] [Google Scholar]
- Varoquaux, G. , Raamana P. R., Engemann D. A., Hoyos‐Idrobo A., Schwartz Y., and Thirion B.. 2017. “Assessing and Tuning Brain Decoders: Cross‐Validation, Caveats, and Guidelines.” NeuroImage 145: 166–179. 10.1016/j.neuroimage.2016.10.038. [DOI] [PubMed] [Google Scholar]
- Velinov, M. 2019. “Genomic Copy Number Variations in the Autism Clinic—Work in Progress.” Frontiers in Cellular Neuroscience 13: 1–6. 10.3389/fncel.2019.00057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winter, N. R. , Blanke J., Leenings R., et al. 2024. “A Systematic Evaluation of Machine Learning‐Based Biomarkers for Major Depressive Disorder.” JAMA Psychiatry 81, no. 4: 386–395. 10.1001/jamapsychiatry.2023.5083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolfe, F. H. , Auzias G., Deruelle C., and Chaminade T.. 2015. “Focal Atrophy of the Hypothalamus Associated With Third Ventricle Enlargement in Autism Spectrum Disorder.” Neuroreport 26, no. 17: 1017–1022. 10.1097/WNR.0000000000000461. [DOI] [PubMed] [Google Scholar]
- Wolfers, T. , Floris D. L., Dinga R., et al. 2019. “From Pattern Classification to Stratification: Towards Conceptualizing the Heterogeneity of Autism Spectrum Disorder.” Neuroscience and Biobehavioral Reviews 104: 240–254. [DOI] [PubMed] [Google Scholar]
- World Health Organization . 2023. “Autism.” https://www.who.int/news‐room/fact‐sheets/detail/autism‐spectrum‐disorders.
- Xia, M. , Wang J., and He Y.. 2013. “BrainNet Viewer: A Network Visualization Tool for Human Brain Connectomics.” PLoS One 8, no. 7: e68910. 10.1371/journal.pone.0068910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao, Y. , Wen T. H., Kupis L., et al. 2023. “Atypical Functional Connectivity of Temporal Cortex With Precuneus and Visual Regions May be an Early‐Age Signature of ASD.” Molecular Autism 14, no. 1: 1–11. 10.1186/s13229-023-00543-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yahata, N. , Morimoto J., and Hashimoto R.. 2016. “A Small Number of Abnormal Brain Connectionspredicts Adult Autism Spectrum Disorder.” Nature Communications 48, no. 7: 829–834. 10.1038/pj.2016.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zabihi, M. , Oldehinkel M., Wolfers T., et al. 2019. “Dissecting the Heterogeneous Cortical Anatomy of Autism Spectrum Disorder Using Normative Models.” Biological Psychiatry: Cognitive Neuroscience and Neuroimaging 4, no. 6: 567–578. 10.1016/j.bpsc.2018.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaehle, T. , Wüstenberg T., Meyer M., and Jäncke L.. 2004. “Evidence for Rapid Auditory Perception as the Foundation of Speech Processing: A Sparse Temporal Sampling FMRI Study.” European Journal of Neuroscience 20, no. 9: 2447–2456. 10.1111/j.1460-9568.2004.03687.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1. Supporting Information.
Data Availability Statement
The data that support the findings of this study are available in ABIDE at https://fcon_1000.projects.nitrc.org/indi/abide/. These data were derived from the following resources available in the public domain: ABIDE I, https://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html.
