Skip to main content
BioMedical Engineering OnLine logoLink to BioMedical Engineering OnLine
. 2025 Oct 21;24:123. doi: 10.1186/s12938-025-01456-8

Research on drug addiction detection based on AR-TSNET with bimodal EEG–NIRS

Xiaowen Zhang 1, Xuelin Gu 2,#, Li Chen 2, Xueshan Cao 3, Chaojing Zhang 3, Xiaoou Li 1,
PMCID: PMC12542443  PMID: 41121050

Abstract

Traditional research on drug addiction assessment relies primarily on psychological scales, self-reports from drug users, and subjective judgments from doctors, ands lacks objective physiological indicators and quantitative evaluation. This study introduces a visual trigger paradigm designed to elicit drug cravings in individuals with substance addiction, employing Electroencephalogram (EEG) and Near-Infrared Spectroscopy (NIRS) for data acquisition. The dataset comprises recordings from 20 healthy individuals and 36 individuals with drug addiction. A deep learning algorithm named AR-TSNET, which utilizes feature-level fusion, is proposed to classify. The deep learning network uses two modules called Tception and Sception to process EEG and NIRS data. Tception extracts features from EEG data while Sception extracts features from NIRS data. Different attention mechanisms are incorporated to better align with the characteristics of the data. The attention mechanism assigns weights to features, reducing the interference of redundant features. Residual connections are utilized to address the issue of information loss caused by increased network depth, thereby enhancing the stability and robustness of the model. The classification accuracy achieved through k-fold cross-validation is 92.6%. The confusion matrix and ROC curve fully demonstrate the excellent performance of the model. A comparison of single-modal and bimodal evaluation metrics confirms the superior performance of bimodal data with higher information content. These results provide preliminary evidence that the proposed method is a promising and effective approach for assessing the severity of drug addiction. By leveraging advanced deep learning techniques, the method demonstrates not only high accuracy and reliability but also the potential for broader applications in addiction research and clinical practice. Furthermore, its straightforward implementation and objective nature offer valuable insights into addiction severity while reducing reliance on subjective assessments.

Keywords: Methamphetamine, Deep learning, Electroencephalogram, Near-infrared spectroscopy, Feature-level fusion

Introduction

Drug addiction poses a major societal threat, affecting economic development and social stability. Existing severity assessments rely on scales, self-reports, and subjective judgments, and lack objective and quantitative measures. To address this limitation, we advocate leveraging physiological data to achieve a more precise and reliable assessment. Electroencephalogram (EEG) record brain activity by detecting spontaneous neuronal potential fluctuations beneath the scalp, offering noninvasive measurements and high temporal resolution [1]. The principle of near-infrared spectroscopy (NIRS) involves measuring cortical blood flow and metabolism by monitoring oxyhemoglobin (HbO) and deoxyhemoglobin (HbR) levels, offering good spatial resolution [2]. By combining these two signals, both temporal and spatial changes in brain activity can be captured simultaneously.

In recent years, the rapid advancement of deep learning algorithms has inspired researchers to pioneer innovative approaches for processing physiological data. The integration of EEG and fNIRS has gained attention for combining complementary information, enhancing brain signal analysis accuracy and depth. Qi et al. [3] proposed a multimodal fusion graph convolutional network for emotion recognition, employing a multiview mechanism to address fusion limitations and achieving superior performance. Similarly, AL-Quraishi et al. [4] found that bimodal fusion significantly outperformed single-modal methods, achieving 93.01% classification accuracy in analyzing brain wave activity and blood flow responses during movement. Liu et al. [5] used decision-level EEG-fNIRS fusion to classify mental workload, achieving higher accuracy than single modalities. In clinical applications, Yi et al. [6] used EEG and fNIRS combined with an SVM classifier, achieving 92.7% classification accuracy for depression, while Guven et al. [7] demonstrated that combining features from both modalities improved the classification accuracy of children with ADHD, showing promise for objective diagnosis. Jain et al. [8] applied deep learning and reinforcement learning, introducing a weighted ensemble model that achieved a lung cancer diagnosis accuracy of 99.40%, underscoring the versatility of multimodal approaches in healthcare. Other researchers have focused on optimizing multimodal methods for specific neurophysiological and cognitive applications. Chu et al. [9] optimized psychological load detection using 26 EEG channels and 2 fNIRS channels, achieving an accuracy of 78.25% with a random forest classifier. Lin et al. [10] explored cognitive-motor interference (CMI) and neurovascular coupling, finding that dual-task conditions reduced coupling, highlighting the importance of multimodal approaches in understanding complex brain interactions. Qiu et al. [11] combined multilevel progressive learning with a multimodal fusion framework, achieving accuracies of 96.74% and 98.42% for motor imagery (MI) and mental arithmetic (MA) tasks, respectively. Guo et al. [12] proposed a multimodal brain-computer interface classification method, significantly improving classification accuracy to 88.33%. Rapid signal detection has also benefited from multimodal fusion. Khan et al. [13] combined EEG and fNIRS to detect early hemodynamic responses, identifying brain signal changes within 1.5 s and improving accuracy by 20% compared to traditional methods. Abtahi et al. [14] distinguished Parkinson’s disease patients from healthy individuals, achieving a combined accuracy of 93.4%, significantly outperforming single-modal methods. Li et al. [15] analyzed brain network changes in Alzheimer’s disease (AD), demonstrating that bimodal methods more effectively reveal abnormalities, offering more precise diagnostic insights. These studies highlight the transformative potential of EEG and NIRS fusion, advancing diagnostic accuracy, cognitive research, and clinical applications. Previous multimodal imaging studies have demonstrated prefrontal hypofunction in substance users: methamphetamine-induced impulsivity correlates with reduced oxy-Hb during Stroop tasks [16], while in opioid users, left PFC cue-reactivity measured by fNIRS predicts treatment outcomes [17]. EEG studies similarly observe increased beta and gamma band activity in ACC/PFC regions linked to craving [18], and connectome metrics provide superior craving prediction.

On the basis of the aforementioned related works, combining EEG and NIRS signals yields better results than single modalities, offering more comprehensive information and significantly improving classification accuracy. Therefore, this study uses EEG and NIRS for feature-level fusion to distinguish between healthy individuals and individuals with drug addictions. We expect that bimodal fusion will provide higher sensitivity and accuracy in drug addiction severity classification.

Results

Classification results

This study used data from 40 participants, including 20 healthy individuals and 20 drug rehabilitation patients. A total of 52 electrodes were selected from the frontal, parietal, occipital, and temporal regions for the EEG data. For the NIRS data, 21 channels in the frontal area directly related to drug users were chosen. Each participant had 56 samples corresponding to drug-related image stimuli. Under sixfold cross-validation, the model achieved an average accuracy of 92.6% with a standard deviation of 5.56, demonstrating the good performance of the model in the fusion classification of EEG and NIRS data features.

After 400 epochs, the training and testing loss results are shown in Fig. 1. During training, the loss gradually decreased and approached zero, indicating that the model converged without overfitting. This confirms that the network structure is stable and effective. In the testing phase, the loss exhibited frequent oscillations, but the overall fluctuation range was between 0.4 and 0.8, without large-scale variations. These fluctuations likely result from the limited and imbalanced dataset and the high complexity of the model. Despite applying dropout, L2 regularization, and early stopping, the convergence remained partially unstable. This highlights the need for further optimization and supports the exploratory nature of this study.

Fig. 1.

Fig. 1

Loss curves for training and testing

In a binary classification problem, on the basis of predicted and actual labels of the samples, the samples can be categorized into four situations: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The resulting confusion matrix is shown in Fig. 2.

Fig. 2.

Fig. 2

Confusion matrix

The top-left corner (TP) represents the number of correctly classified drug rehabilitation samples, the top-right corner (FP) represents the number of healthy samples incorrectly classified as drug rehabilitation samples, the bottom-right corner (TN) represents the number of correctly classified healthy samples, and the bottom-left corner (FN) represents the number of drug rehabilitation samples incorrectly classified as healthy samples. In a confusion matrix, larger diagonal elements indicate more correct classifications, reflecting better model performance. The confusion matrix obtained in this study reflects this characteristic, demonstrating the superiority of the model.

The receiver operating characteristic (ROC) curve is also a graphical method used to evaluate the performance of a binary classification model. It shows the model’s performance at different thresholds by calculating the true positive rate (TPR) and the false positive rate (FPR), which reflect the accuracy of the classification model. The formulas for calculating the TPR and FPR are as follows:

TPR=TPTP+FN 1
FPR=FPFP+TN 2

The TPR represents the proportion of true drug rehabilitation samples among all samples classified as drug rehabilitation, whereas the FPR represents the proportion of true healthy samples among all samples classified as healthy. The AUC (area under the curve) value is obtained by calculating the area under the ROC curve. The AUC value ranges from 0 to 1, and the closer the AUC is to 1, the better the model's performance. The ROC curve plotted in this study is shown in Fig. 3.

Fig. 3.

Fig. 3

ROC curve

From the ROC curve, it is evident that the curve for each fold tends toward the top-left corner, indicating the model's strong ability to distinguish between positive and negative classes and demonstrating excellent performance. However, a dip is observed in the middle of the ROC curve for the second fold, suggesting that there may be some overfitting in that fold. The average AUC value from the sixfold cross-validation is 0.903, which proves that the model has good classification ability.

The results above indicate that the deep learning network model designed in this study has a stable structure and excellent performance. It can accurately differentiate between healthy participants and drug rehabilitation participants. The model demonstrates good robustness and strong generalizability, with all results meeting expectations and being helpful in determining the severity of drug addiction.

Comparison of the single-modality and multimodality results

The main task of this study is feature-level fusion and classification of EEG and NIRS data. After completed the fusion of the multimodal data, the classification results for the single-modal data were also analyzed. Single-modal classification was performed via separate subnetworks for each data type. To fully compare the performance of the multimodal and single-modal models, precision, recall, and the F1-score were selected as evaluation metrics. The formulas for calculating the precision, recall, and F1-score are as follows:

Precision=TPTP+FP 3
Recall=TPTP+FN 4
F1-Score=2TP2TP+FP+FN 5

Precision refers to the proportion of samples predicted as positive that are actually positive, which is the ratio of correctly classified drug rehabilitation samples to all samples classified as drug rehabilitation. Recall refers to the proportion of all actual positive samples that are correctly predicted as positive, which is the ratio of correctly classified drug rehabilitation samples to all correctly classified samples. The F1-score is the harmonic mean of precision and recall, providing a comprehensive reflection of the model's performance. A comparison of the performance metrics between the multimodal and single-modal models is shown in Fig. 4.

Fig. 4.

Fig. 4

Comparative performance analysis of single and multi-modality methods

Figure 4 shows that the performance metrics for the single-modality models are all lower than those for the multimodality models. EEG data shows better accuracy compared to NIRS data, with some fluctuations in the accuracy of NIRS data. The precision of EEG data is as high as 0.85, indicating that the EEG subnetwork successfully classifies drug rehabilitation patients with high accuracy. For the NIRS data, the recall is 0.71, which is higher than both the precision and accuracy. The higher recall suggests that the model tends to misclassify healthy samples as drug rehabilitation samples, indicating that the NIRS subnetwork is well-suited for disease monitoring-type tasks. When analyzed with multimodal data, all the metrics increase. The F1-score increased to 0.89, indicating improved model performance with multimodal fusion for objective and accurate identification of drug-rehabilitated and healthy samples. However, fluctuations in accuracy and the F1-score indicate that the stability of the multimodal model is still insufficient, and further improvements in robustness are needed.

By comparing the performance metrics of single-modal and multimodal models, we find that the multimodal model outperforms the single-modal model in all aspects, further confirming the excellent performance of feature-level fusion of EEG and NIRS data.

Comparison with traditional machine learning methods

In this study, several traditional machine learning methods are compared with the approach proposed in this research. Support vector machine (SVM) seeks the optimal hyperplane to maximize the margin between classes [19]. The random forest (RF) method constructs decision trees and combines prediction results, randomly selecting features at each tree node split [20]. Linear discriminant analysis (LDA) maximizes the between-class scatter and minimizes the within-class scatter to find the best linear projection that separates different classes [21]. K-nearest neighbors (KNN) calculates the distance between the prediction point and training data points, selecting the K nearest neighbors for classification [22]. We extracted a set of handcrafted features from both EEG and NIRS signals, including mean, peak amplitude, and power spectral density (PSD) as representative time- and frequency-domain features. To enhance the descriptive power of the traditional feature extraction pipeline, we further incorporated two nonlinear features commonly used in neurophysiological research: sample entropy (SampEn), which quantifies the unpredictability and complexity of a time series—lower entropy values typically reflect reduced signal variability and have been associated with altered brain dynamics in individuals with substance use disorders; and Hjorth parameters, which include activity (signal variance), mobility (the mean frequency estimated as the ratio between the standard deviation of the first derivative and that of the original signal), and complexity (the degree of frequency modulation, indicating the irregularity of the signal). All features were computed for each EEG and NIRS channel and concatenated into a unified feature vector for use in subsequent classification.

The mean value is used to calculate the average of the signal over a specific time interval. The calculation formula is as follows:

μ=1Ni=1NXi 6

Here, μ represents the mean value,N denotes the total number of data points, and Xi refers to the i-th data point.

The peak value is used to calculate the maximum and minimum values of the signal within a specific time interval. The calculation formula is as follows:

Pmax=maxtit1,t2xti 7
Pmin=mintit1,t2xti 8

Here, t1,t2 represents the time range of the data segment, which corresponds to the same duration as the time window used in the previous preprocessing step for extracting data samples.

The power spectral density (PSD) is used to compute the power distribution of the signal's frequency components, typically obtained using the fast fourier transform (FFT). The calculation formula is as follows:

Xf=-+xte-j2πftdt 9
PSDf=|X(f)|2T 10

Here, Xf represents the Fourier transform of the signal xt,and |Xf|2 denotes the squared magnitude of the signal at frequency f.

The sample entropy (SampEn) quantifies the unpredictability and complexity of a time series. Lower entropy values are often associated with reduced variability in brain activity, commonly observed in substance addiction [23]. The formula is given as:

SampEnm,r,N=-lnAB 11

Here, m is the embedding dimension, r is the tolerance (usually a fraction of the signal’s standard deviation), N is the total number of data points, A is the number of vector pairs matching at dimension m+1, and B is the number of vector pairs matching at dimension m.

In addition, we computed the Hjorth parameters, which describe the statistical and frequency-related characteristics of the signal. These include:

The Activity, defined as the variance of the signal:

Activity=Varxt 12

The Mobility, representing the mean frequency of the signal:

Mobility=VardxtdtVarxt 13

The Complexity, describing the rate of change in frequency:

Complexity=MobilitydxtdtMobilityxt 14

All of the above features were extracted from each EEG and NIRS channel and concatenated into a unified feature vector for use in traditional machine learning classification. The specific performance evaluation metrics are presented in Table 1. Table 1 shows that the proposed method achieves the highest accuracy, precision, and F1 score. SVM demonstrates the best precision among machine learning methods, reflecting its strong classification performance and suitability for high-dimensional data. RF's accuracy and precision are moderate but suggest potential overfitting. LDA struggles with complex high-dimensional data, leading to lower accuracy and precision. KNN achieves the highest accuracy among machine learning methods, delivering excellent results despite longer computation time. AR-TSNET outperforms machine learning methods in accuracy, precision, and F1 score, though its low recall indicates a conservative decision threshold, potentially overlooking some true positives. The advantage of AR-TSNET is that it does not require additional feature extraction or other manual processing after data input into the model. In contrast, traditional machine learning methods rely on manual or automated feature extraction. Therefore, the proposed approach is more efficient and achieves high classification accuracy.

Table 1.

Comparison of machine learning methods

Methods Accuracy (%) Precision (%) Recall (%) F1-score (%)
SVM 88.4 89.6 83.4 85.9
RF 79.7 78.2 81.7 79.2
LDA 78.8 74.8 82.8 74.3
KNN 89.4 88.7 84.3 86.1
Proposed study 92.6 90.9 82.0 89.7

The bold numbers in represent the maximum values of that column

Discussion

This study uses EEG and NIRS data for feature-level fusion and classifies healthy individuals and drug users. The approach of using deep learning networks for classification is an improvement over traditional psychological methods, and the final results validate the accuracy of the classification between healthy individuals and drug users.

Chu et al. [24] optimized EEG-fNIRS configurations through feature analysis, using 26-channel EEG and 2-channel fNIRS to achieve 76.25% accuracy in psychological load tasks, paving the way for applications in complex human–machine interactions. Liu et al. [25] classified psychological load levels during an n-back task, demonstrating that EEG-fNIRS fusion significantly outperforms single modalities. Cicalese et al. [26] achieved 79.31% accuracy in classifying healthy individuals, mild cognitive impairment, and Alzheimer's disease patients using LDA with mixed features, surpassing single-modal performance. Khan et al. [27] proposed MSVD-based fusion for brain-computer interfaces, integrating EEG and fNIRS with KNN and decision trees for effective classification. Ge et al. [28] studied neural activation during action observation, showing that EEG-fNIRS fusion improved task accuracy to 72.7%, offering insights into neural mechanisms and brain decoding applications. These studies collectively underscore the advantages of multimodal fusion for enhancing classification accuracy and advancing neuroscience applications.

In our previous research [29], the model achieved good accuracy but lacked stability and robustness, with long computation times and low efficiency. Therefore, in this study, we added Self-attention after Tception to enhance the model's ability to model important features. We also incorporated Channel-attention after Sception to improve the model's ability to capture both global and local information. This reduced the model's attention to redundant features, improving computational efficiency. The addition of ResNet prevented information loss, further enhancing the model's stability.

The deep learning network AR-TSNET we proposed showed a stable decrease in loss on the training set, but the loss for the testing set fluctuated significantly. This suggests that the model's generalization ability on unseen data is not strong. The confusion matrix indicates that the model has a good ability to distinguish between healthy individuals and drug users, and the ROC curve provides a comprehensive view of the model’s performance. However, the fluctuations in the AUC value suggest that overfitting may exist.

By comparing the performance metrics of the single-modal and multimodal models, we find that the fusion of multimodal data significantly outperforms single-modality models. The higher precision and lower recall of the multimodal model indicate that it can accurately classify healthy individuals and drug users. The increase in the F1-score for the multimodal model compared with the single-modal model further demonstrates the superiority of the multimodal model, indicating that our model is well-suited for addiction severity classification and is reasonable for objectively assessing drug addiction severity.

Nevertheless, this study should be regarded as exploratory, and the proposed method is not yet ready for clinical deployment. The relatively small and imbalanced sample size, single-center data collection, and demographic disparities between groups—such as variations in age and educational background—may constrain the generalizability of our findings. Although standard techniques such as regularization and early stopping were applied, the model's stability and robustness remain to be further improved. These limitations highlight the need for future validation with larger, more diverse, and multi-center datasets. Despite these constraints, the use of bimodal EEG and NIRS data—each offering complementary temporal and spatial resolution—demonstrates promise in improving classification accuracy and may provide a valuable physiological basis for objective assessment in addiction-related research and clinical decision-making.

Moreover, although this study focuses on multimodal EEG–NIRS fusion, we acknowledge the importance of benchmarking our Tception module against widely used EEG-specific deep learning models such as EEGNet, ShallowConvNet, DeepConvNet, and HTNet. These models serve as strong baselines in EEG classification tasks. In future work, we plan to integrate them as comparative baselines or replace the EEG branch of our framework to more thoroughly evaluate model performance and generalizability.

Conclusion

This study aimed to find an objective and accurate method for distinguishing drug addiction. Therefore, an experimental paradigm was designed to visually trigger drug cravings in drug users. EEG and NIRS data were collected from 20 healthy individuals and 36 drug users. An end-to-end deep learning network model AR-TSNET was constructed. The EEG subnetwork uses the Tception module to extract features, and the Self-attention module is used to assign weights to the features. In the NIRS subnetwork, the Sception module is used to extract features, whereas the Channel-attention module integrates global and local information, fully leveraging the advantages of each modality. The ResNet module is employed to prevent information loss and improve training efficiency. Finally, the features obtained from the subnetworks are fused and classified. This model demonstrates sufficient stability and can quickly and accurately generate classification results. The method proposed in this paper effectively classifies drug users and healthy individuals and provides support for subsequent clinical diagnosis and drug treatment.

Materials and methods

Participant Information

Twenty healthy male subjects, with an average age of 33.14 ± 2.67 years and an average education level of 14.27 ± 1.65 years, were recruited from Shanghai University of Health and Medicine. Thirty-six male subjects, all of whom were withdrawn from methamphetamine use, were recruited from the Qingdong Drug Rehabilitation Center in Shanghai, with an average age of 41.30 ± 4.64 years and an average education level of 8.16 ± 1.72 years. The average duration of drug use was 9.31 ± 5.69 years, and the average duration of drug abstinence was 12.81 ± 3.56 months.

The specific inclusion criteria for the study participants were as follows: (1) patients met the diagnostic criteria for substance use disorders in the DSM-5; (2) patients had less than six months of detoxification; (3) patients had at least a junior high school education; (4) patients aged 25–55 years; (5) patients who are willing to participate in the study and provide informed consent. The exclusion criteria: (1) severe cognitive impairment or inability to cooperate with the study's assessments and examinations; (2) severe physical disease; (3) severe psychotic symptoms; (4) participation in other psychological interventions or treatments; (5) history of abuse of other psychoactive substances (with the exception of nicotine). The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of Shanghai University of Medicine and Health Sciences (Approval No. 2021-SWYYKJ-04-230204197506121418).

EEG and NIRS synchronized integrated acquisition system

This study utilized the NeuSen U digital EEG system (Boruikang Technology, Changzhou, China), which follows the international 10/20 system, with 64 channels and a 2000 Hz sampling rate. The system features fully digital transmission, with each channel having an input impedance ≥ 120 MΩ. For NIRS, a NirSmart-3000DS system (Huichuang Medical Equipment, Danyang, China), consisting of an LED light source and an avalanche photodiode (APD) detector was used. The system operates at wavelengths of 730 nm and 850 nm with a sampling rate of 11 Hz, and uses 8 light source probes and 8 detection probes to form 42 effective channels. To ensure synchronized data acquisition, both EEG and NIRS electrodes were placed on the same cap, enabling simultaneous full-brain EEG and frontal NIRS signal recording. Figure 5 shows the electrode and detector placements, with light blue points representing EEG electrodes and blue/purple points representing NIRS detectors and receivers.

Fig. 5.

Fig. 5

Electrode arrangement

Methods

Data acquisition

The experimental paradigm, which was designed via E-prime software, consists of three stages. Figure 6 shows the overall process, Fig. 7 presents examples of drug-related and neutral stimuli, and Fig. 8 depicts the data acquisition scene.

Fig. 6.

Fig. 6

Experimental paradigm process diagram

Fig. 7.

Fig. 7

Examples of drug and neutral images, and drug use video

Fig. 8.

Fig. 8

Examples of data acquisition scenes

The first stage lasts 6 min, with 3 min of rest with the eyes closed and 3 min of rest with the eyes open, guided by audio instructions. The second stage lasts approximately 13 min, with 56 blocks of 10 s each. Each block begins with 4 images (2 drug-related and 2 neutral), which are displayed for 0.6 s, followed by a 7.6-s rest with a landscape painting and a 4-s black screen with a white cross as a separator. The third stage lasts 5 min, with 28 blocks of 10 s each. Each block presents a 2.4-s drug-use video, followed by a 7.6-s rest and a 4-s separation.

EEG and NIRS signals were simultaneously recorded while the participants viewed the paradigm in a quiet, dark environment, and wore the integrated EEG–NIRS cap, as shown in Fig. 8.

Data preprocessing

EEG data preprocessing

EEG data preprocessing was performed in MATLAB R2022b via the EEGlab toolbox. First, 52 electrode channels between FZ and O2 were selected and localized according to the EEG electrode layout [3032]. Owing to the significant difference in sampling rates, the original EEG data sampled at 1000 Hz were downsampled to 250 Hz to match the NIRS data [33]. To minimize the impact of environmental noise and physiological artifacts such as eye blinks and muscle activity, a band-pass Butterworth filter was applied to the EEG data [34]. The Butterworth filter is characterized by its maximally flat frequency response in the passband, and its transfer function is defined as:

Hf2=11+ffc2n=11+ϵ2ffp2n 15

In this formula, n is the order of the filter, fc is the cutoff frequency, and fp is the passband edge frequency. In this study, the band-pass range was set to 0.5–45 Hz. The lower cutoff at 0.5 Hz effectively removes slow drifts and movement artifacts, while the upper cutoff at 45 Hz eliminates high-frequency noise such as electromyographic (EMG) interference, without distorting the EEG signal of interest [35, 36]. The EEG data were segmented into 1-s samples on the basis of labeled markers, and independent component analysis (ICA) was applied to remove eye movement artifacts.

NIRS data preprocessing

The preprocessing of the near-infrared data in this study was performed via the MNE library in Python. First, the raw near-infrared data represented as light intensity signals were converted into HbO and HbR values via the Lambert–Beer equation [37]. To ensure the reliability of data acquisition, NIRS channels covering bilateral prefrontal and frontal regions were selected, as these areas are closely associated with cognitive control and drug-related decision-making processes. Although NIRS signals are generally less susceptible to high-frequency motion artifacts compared to EEG, they are prone to physiological noise such as cardiac pulsations, respiration, and slow hemodynamic drifts. Therefore, a band-pass Butterworth filter with a frequency range of 0.01–0.3 Hz was applied to the HbO and HbR signals to remove such interferences [38]. This frequency range removes interference from heartbeats, respiration, and slow drift, while preserving features related to the degree of drug addiction. Additionally, baseline correction was performed on the near-infrared data to eliminate individual differences in the starting values [39]. Similar to EEG data, the near-infrared data were segmented into a series of 0.6-s samples on the basis of the labeled markers.

The preprocessing flowchart for the experimental data is shown in Fig. 9. After preprocessing, the EEG and near-infrared data were input into the AR-TSNET deep learning network for further processing.

Fig. 9.

Fig. 9

Data preprocessing workflow

Deep learning algorithm

AR-TSNET based on feature-level fusion

The deep learning network designed in this study uses an end-to-end approach for training, eliminating the need for manual feature extraction. The Tception and Self-attention modules process EEG data by capturing temporal features and enhancing key patterns, while the Sception, Channel-attention, and ResNet modules handle NIRS data, focusing on spatial features, signal channel relevance, and refined feature representation. Finally, the features are fused and classified. The AR-TSNET network architecture is shown in Algorithm 1, and its specific structure is depicted in Fig. 6.

Algorithm 1.

Algorithm 1.

AR-TSNET Algorithm

EEG subnetwork

In this study, the Tception module was used to extract features from the EEG data. The Tception module includes multiscale 1D convolutions, ReLU activation, and average pooling. The multiscale 1D convolutions extract features from both short- and longterm windows via convolution kernels of different lengths. The formula for the 1D convolution is as follows:

Yi=j=0k-1Xi×s+j×Kj+b 16

Here, K represents the length of the convolution kernel, s is the stride in the convolution, and b is the bias term (if present). The length of the convolution kernel is determined by a time window sequence αi and the sampling rate fS, where i is an integer starting from 1. Therefore, the length of the convolution kernel at the i layer, ki can be defined as:

ki=1,αi·fS 17

The above formula allows for the automatic adaptation of dynamic convolution kernels. Convolution across different time windows can simultaneously capture features at multiple time scales, reducing redundant features while preserving the multidimensional information of the original data. This enables the model to capture signal fluctuation characteristics over different time ranges, which is crucial for effectively modeling temporal dependencies (Fig. 10).

Fig. 10.

Fig. 10

AR-TSNET deep learning network architecture

Although the Tception module has multiscale capabilities, convolution is a local operation with a limited receptive field. Therefore, incorporating a self-attention mechanism can more comprehensively model the temporal characteristics of EEG data. The self-attention module further enhances the model’s ability to capture temporal patterns in EEG data by adaptively adjusting feature weights, suppressing irrelevant or redundant information, and dynamically modeling global dependencies within the EEG data. The formula for the self-attention mechanism is as follows:

Self-Attention=γ·V·SoftmaxQ·KT+x 18

Here, x represents the input data, and Q K and V refer to the query matrix, key matrix, and value matrix derived from the input data, respectively. γ represents the learnable scalar parameter. In this way, the EEG subnetwork can effectively capture the global dependencies between data features and focus on the most relevant parts.

NIRS subnetwork

In this study, the Sception module is used to extract features from the NIRS data. The Sception module includes 2D convolution, ReLU activation, and average pooling. The 2D convolution contains global, hemispheric, and local kernels, with different kernel sizes helping to capture features of various scales in the near-infrared signals and the relationships between different brain regions. The global kernel has a size of (1, C), where C is the number of NIRS channels, and is used to capture global spatial relationship patterns. Inspired by EEGNet [40], a hemispheric kernel of size (1, 0.5C) is used to explore the relationship patterns between symmetric brain regions. A local kernel of size (1, 0.25C) is used to find relationships between different subregions of the brain. This allows the model to analyze oxygen concentration changes in NIRS from both global and local perspectives.

The channel-attention module enhances the model's feature extraction capability by adjusting weights on the basis of channel relevance. The channel attention mechanism assigns higher weights to more important channels, increasing the model's sensitivity to meaningful signals. The formula for the channel attention mechanism is as follows:

Channel-Attention=σfgavgxc+fgmaxxc·x 19

Here, x represents the input features, gavgxc refers to average pooling, gmaxxc refers to max pooling, and σ represents the sigmoid activation function. Channel attention effectively integrates both global and local information, enhances the model's understanding of spatial distribution features, suppresses noise, and improves model robustness by reducing interference from noise.

After multiple layers of feature extraction and channel attention, the potential for gradient vanishing and information loss exists, which can be mitigated by adding a ResNet module. The ResNet module learns the difference between the input and output, allowing the network to focus on extracting incremental features rather than directly fitting complex input distributions. The formula for the residual module is as follows:

ResNet=ReLUBNW2ReLUBNW1x+x 20

Here, x represents the input features, W1 and W2 represent the convolution operations, and BN represents batch normalization. ResNet more accurately extracts changing features, and its residual connection facilitates optimization, reducing training complexity. The stacking of multiple ResNet modules allows the model to progressively extract spatial distribution features in the data, from low-level local information to high-level global patterns, ensuring that key information is not overlooked.

Feature-level fusion

Feature-level fusion combines features extracted from different modalities into a unified representation for classification, enhancing analysis accuracy and performance. In this study, EEG and NIRS data are processed independently to extract meaningful features, which are then transformed into compatible distributions with the same dimensions and format. This standardization enables seamless feature-level fusion, leveraging the strengths of both modalities for improved classification accuracy and robustness.

The EEG data are processed using the Tception and Self-attention modules to extract temporal features and enhance key patterns. The NIRS data are processed via the Sception, Channel-attention and ResNet modules to capture spatial features and refine signal representations. After feature extraction, batch normalization standardizes the data before it is flattened and concatenated to integrate information from both modalities. The fused features are passed into a fully connected layer to produce the final classification result and leverage the strengths of both EEG and NIRS data.

Model architecture and parameters

Table 2 presents the full architecture of the model, including each layer’s type, output shape, and the number of trainable parameters. The complete model consists of approximately 24.7 million parameters, with 10.5 million from the EEG path, 6.6 million from the NIRS path, and 7.6 million from the final classification layers.

Table 2.

Model architecture and parameters

Layer Type Kernel/params Output shape Trainable params
1. EEG branch (Tception + Self-Attention)
Tception1 Conv2D (52, 4, 1 × 128) (1, 4, 1, 388) 26,624
Tception2 Conv2D (52, 4, 1 × 64) (1, 4, 1, 451) 13,312
Tception3-8 Conv2D*6 …… ……  ~ 80,000
BatchNorm BN2D num_T = 4 (1, 4, 1,1819) 8
SelfAttention Conv2D 3conv layers + gamma (1, 4, 1, 1819)  ~ 100
Flatten + FC_eeg Linear 8667 → 1200 (1, 1200) 10,400,400
2. NIRS Branch (Sception + Channel Attention + Residual Block)
Sception1 Cons2D (42, 3, 1 × 100) (1, 3, 4, 1) 12,600
Sception2-6 Cons2D …… ……  ~ 50,000
Channel Attention Conv2D*2 3 → 1.5 → 3 (ratio = 2) (1, 3, 1, 1) 21
Residual Block Conv2D × 2 BN + ReLU × 2 (1, 3, 8, 56) 864
Flatten + FC_nirs Linear 5502 → 1200 (1, 1200) 6,602,400
3.Fully Connected Classifier (Fusion)
FC1-8 Linear 2400 → 4 4,915,200
Output Linear 4 → 1 5

Model training

This study was conducted on a Windows 10 operating system, with an Intel Gold 5218R processor and an NVIDIA GeForce RTX 4090 GPU. The deep learning network model was implemented via the PyTorch platform. K-fold cross-validation training was employed, with k set to 6. K-fold cross-validation is a model evaluation method that randomly splits the dataset into k subsets. In each of the k iterations, one subset is selected as the test set, and the remaining subsets are used as the training set. After training the model on each training set, it is evaluated on the corresponding test set. The results from the k evaluations are then averaged to estimate the overall performance of the model. This method effectively assesses the model’s generalization ability and reduces the risk of overfitting. During training, early stopping was incorporated; if the model’s accuracy no longer improved, the training stopped early. The patience for early stopping was set to 200. The training process uses the Adam optimizer, selects the BCELoss function as the loss function, the learning rate is 0.0001, the Epoch value is 400, the Dropout rate is 0.1, the batch_size size is 24, and the L2 regularization coefficient is 0.000001. Each training session lasted approximately 50–60 min.

Cross-validation strategy

We adopted sixfold cross-validation at the subject level, where all 56 subjects were randomly split into six folds. In each fold, approximately 46 subjects (2576 trials) were used for training and 10 subjects (560 trials) were used for testing. All 56 trials from each subject were kept entirely within either the training or test set to prevent data leakage and ensure independent evaluation. The classification performance was evaluated using accuracy, precision, recall, F1-score, and area under the ROC curve (AUC), averaged across all six folds with standard deviation.

Leave-one-subject-out (LOSO) cross-validation was not implemented due to the high computational cost and class imbalance, but we acknowledge its value and plan to incorporate it in future work.

Author contributions

Xiaowen Zhang:Conceptualization, Data acquisition, Investigation, Methodology, Project administration, Writing—original draft, Writing—review and editing. Xuelin Gu: Conceptualization, Investigation, Methodology, Project administration, Writing—review and editing. Li Chen: Conceptualization, Data acquisition, Investigation, Methodology, Project administration. Xueshan Cao: Data Acquisition,Project administration, Supervision. Chaojing Zhang: Data Acquisition,Project administration, Supervision. Xiaoou Li: Project administration, Resources, Supervision, Validation, Writing—review and editing.

Funding

This work is sponsored by the Shanghai Municipal Science and Technology Plan Project (No. 22010502400), the Shanghai University of Medicine & Health Sciences Mental Health Research Institute Foundation (No. YJYPI202402), the Pudong New Area Science and Technology Development Fund Livelihood Research Project for Public Institutions (No. PKJ2024-Y37) and the Shanghai Municipal Education Commission AI Program (No. SHJWAIJK241203).

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. We have obtained the ethical approval for the data of individuals with drug addiction from the Ethics Committee of Shanghai University of Medicine and Health Sciences (Approval No. 2021-SWYYKJ-04-230204197506121418).

Informed consent

Informed consent was obtained from all individual participants included in the study.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xuelin Gu: co-first author.

References

  • 1.Chen J, Yu K, Bi Y. Strategic integration: a cross-disciplinary review of the fNIRS-EEG dual-modality imaging system for delivering multimodal neuroimaging to applications. Brain Sci. 2024;14(10):1022. 10.3390/brainsci14101022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Qiu L. EFMLNet: fusion model based on end-to-end mutual information learning for hybrid EEG-fNIRS brain–computer interface applications.
  • 3.Qi X, Wen Y, Zhang P. MFGCN: multimodal fusion graph convolutional network for speech emotion recognition. Neurocomputing. 2025;611:128646. 10.1016/j.neucom.2024.128646. [Google Scholar]
  • 4.Al-Quraishi MS, Elamvazuthi I, Tang TB. Bimodal data fusion of simultaneous measurements of EEG and fNIRS during lower limb movements. Brain Sci. 2021;11(6):713. 10.3390/brainsci11060713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Liu Y. Mental workload classification with concurrent electroencephalography and functional near-infrared spectroscopy.
  • 6.Yi L, Xie G, Li Z. Automatic depression diagnosis through hybrid EEG and near-infrared spectroscopy features using support vector machine. Front Neurosci. 2023;17:1205931. 10.3389/fnins.2023.1205931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Güven A, Altinkaynak M, Dolu N. Combining functional near-infrared spectroscopy and EEG measurements for the diagnosis of attention-deficit hyperactivity disorder. Neural Comput Appl. 2020;32(12):8367–80. 10.1007/s00521-019-04294-7. [Google Scholar]
  • 8.Jain R, Singh P, Kaur A. An ensemble reinforcement learning-assisted deep learning framework for enhanced lung cancer diagnosis. Swarm Evol Comput. 2024;91:101767. 10.1016/j.swevo.2024.101767. [Google Scholar]
  • 9.Chu H, Cao Y, Jiang J. Optimized EEG–fNIRS based mental workload detection method for practical applications. (2021–07–24) [2024–12–20]. https://www.researchsquare.com/article/rs-683529/v1. 10.21203/rs.3.rs-683529/v1
  • 10.Lin J, Lu J, Shu Z. An EEG-fNIRS neurovascular coupling analysis method to investigate cognitive-motor interference. Comput Biol Med. 2023;160:106968. 10.1016/j.compbiomed.2023.106968. [DOI] [PubMed] [Google Scholar]
  • 11.Qiu L, Zhong Y, He Z. Improved classification performance of EEG-fNIRS multimodal brain-computer interface based on multi-domain features and multi-level progressive learning. Front Hum Neurosci. 2022;16:973959. 10.3389/fnhum.2022.973959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Guo M, Feng L, Chen X. A novel strategy for differentiating motor imagination brain-computer interface tasks by fusing EEG and functional near-infrared spectroscopy signals. Biomed Signal Process Control. 2024;95:106448. 10.1016/j.bspc.2024.106448. [Google Scholar]
  • 13.Khan MJ, Ghafoor U, Hong KS. Early detection of hemodynamic responses using EEG: a hybrid EEG-fNIRS study. Front Hum Neurosci. 2018;12:479. 10.3389/fnhum.2018.00479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Abtahi M, Bahram Borgheai S, Jafari R. Merging fNIRS-EEG brain monitoring and body motion capture to distinguish Parkinson’s disease. IEEE Trans Neural Syst Rehabil Eng. 2020;28(6):1246–53. 10.1109/TNSRE.2020.2987888. [DOI] [PubMed] [Google Scholar]
  • 15.Li R, Nguyen T, Potter T. Dynamic cortical connectivity alterations associated with Alzheimer’s disease: an EEG and fNIRS integration study. NeuroImage: Clin. 2019;21:101622. 10.1016/j.nicl.2018.101622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yamamuro K, Kimoto S, Iida J. Reduced prefrontal cortex hemodynamic response in adults with methamphetamine induced psychosis: relevance for impulsivity. PLoS ONE. 2016;11(4):e0152373. 10.1371/journal.pone.0152373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Huhn AS, Sweeney MM, Brooner RK. Prefrontal cortex response to drug cues, craving, and current depressive symptoms are associated with treatment outcomes in methadone-maintained patients. Neuropsychopharmacology. 2019;44(4):826–33. 10.1038/s41386-018-0252-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Huang Y, Mohan A, DE Ridder D. The neural correlates of the unified percept of alcohol-related craving: a fMRI and EEG study. Sci Rep. 2018;8(1):923. 10.1038/s41598-017-18471-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Jakkula V. Tutorial on support vector machine (SVM).
  • 20.Biau G, Scornet E. A random forest guided tour. TEST. 2016;25(2):197–227. 10.1007/s11749-016-0481-7. [Google Scholar]
  • 21.Xanthopoulos P, Pardalos PM, Trafalis TB. Robust data mining. New York: Springer; 2013. 10.1007/978-1-4419-9878-1. [Google Scholar]
  • 22.Kasruddin Nasir AN, Ahmad MA, Najib MS. In ECCE2019: proceedings of the 5th international conference on electrical, control & computer engineering, Kuantan, Pahang, Malaysia, 29th July 2019. Singapore: Springer Singapore, 2020[2025-01-14]. 10.1007/978-981-15-2317-5. 10.1007/978-981-15-2317-5.
  • 23.Zhang M, Ren J, Li N. Hypnosis efficacy on nicotine addiction: an analysis of EEG microstates and brain oscillation entropy. AIMS Neurosci. 2025;12(1):15–31. 10.3934/Neuroscience.2025002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chu H, Cao Y, Jiang J. Optimized electroencephalogram and functional near-infrared spectroscopy-based mental workload detection method for practical applications. BioMed Eng OnLine. 2022;21(1):9. 10.1186/s12938-022-00980-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liu Y, Ayaz H, Shewokis PA. Multisubject, “Learning” for mental workload classification using concurrent EEG, fNIRS, and physiological measures. Front Hum Neurosci. 2017;11:389. 10.3389/fnhum.2017.00389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Cicalese PA, Li R, Ahmadi MB. An EEG-fNIRS hybridization technique in the four-class classification of Alzheimer’s disease. J Neurosci Methods. 2020;336:108618. 10.1016/j.jneumeth.2020.108618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Khan MU, Hasan MAH. Hybrid EEG-fNIRS BCI fusion using multi-resolution singular value decomposition (MSVD). Front Hum Neurosci. 2020;14:599802. 10.3389/fnhum.2020.599802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ge S, Wang P, Liu H. Neural activity and decoding of action observation using combined EEG and fNIRS measurement. Front Hum Neurosci. 2019;13:357. 10.3389/fnhum.2019.00357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhang X, Gu X, Li X. A study on the assessment of drug addiction level of bimodal Eeg–Nirs based on Tsnet. 2024.
  • 30.Alotaiby T, El-Samie FEA, Alshebeili SA. A review of channel selection algorithms for EEG signal processing. EURASIP J Adv Signal Process. 2015;2015(1):66. 10.1186/s13634-015-0251-9. [Google Scholar]
  • 31.Minnerly C, Bressler SL, Shokry IM. Estimating mental health conditions of patients with opioid use disorder. J Addict. 2019;2019:1–15. 10.1155/2019/8586153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kim J. Frontal EEG response to alcohol craving elicited by individually tailored video cues. 2023. [DOI] [PubMed]
  • 33.Hartmann KG, Schirrmeister RT, Ball T. EEG-GAN: generative adversarial networks for electroencephalograhic (EEG) brain signals. arXiv, 2018[2024-02-03]. http://arxiv.org/abs/1806.01875.
  • 34.Gu X, Yang B, Gao S. Application of bi-modal signal in the classification and recognition of drug addiction degree based on machine learning. Math Biosci Eng. 2021;18(5):6926–40. 10.3934/mbe.2021344. [DOI] [PubMed] [Google Scholar]
  • 35.Chen T, Su H, Zhong N. Disrupted brain network dynamics and cognitive functions in methamphetamine use disorder: insights from EEG microstates. BMC Psychiatry. 2020;20(1):334. 10.1186/s12888-020-02743-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jacobsen NSJ, Kristanto D, Welp S. Preprocessing choices for P3 analyses with mobile EEG: a systematic literature review and interactive exploration. Psychophysiology. 2025;62(1):e14743. 10.1111/psyp.14743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kocsis L, Herman P, Eke A. The modified Beer-Lambert law revisited. Phys Med Biol. 2006;51(5):N91. 10.1088/0031-9155/51/5/N02. [DOI] [PubMed] [Google Scholar]
  • 38.Dans PW, Foglia SD, Nelson AJ. Data processing in functional near-infrared spectroscopy (fNIRS) motor control research. Brain Sci. 2021;11(5):606. 10.3390/brainsci11050606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Jiao Y, Li Z, Chen X. Preprocessing methods for near-infrared spectrum calibration. J Chemometrics. 2020;34(11):e3306. 10.1002/cem.3306. [Google Scholar]
  • 40.Lawhern VJ, Solon AJ, Waytowich NR. EEGnet: a compact convolutional neural network for EEG-based brain–computer interfaces. J Neural Eng. 2018;15(5):056013. 10.1088/1741-2552/aace8c. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from BioMedical Engineering OnLine are provided here courtesy of BMC

RESOURCES