Abstract
Electroencephalography (EEG) recordings with visual stimuli require detailed coding to determine the periods of participant’s attention. Here we propose to use a supervised machine learning model and off-the-shelf video cameras only. We extract computer vision-based features such as head pose, gaze, and face landmarks from the video of the participant, and train the machine learning model (multi-layer perceptron) on an initial dataset, then adapt it with a small subset of data from a new participant. Using a sample size of 23 autistic children with and without co-occurring ADHD (attention-deficit/hyperactivity disorder) aged 49–95 months, and training on additional 2560 labeled frames (equivalent to 85.3 s of the video) of a new participant, the median area under the receiver operating characteristic curve for inattention detection was 0.989 (IQR 0.984–0.993) and the median inter-rater reliability (Cohen’s kappa) with a trained human annotator was 0.888. Agreement with human annotations for nine participants was in the 0.616–0.944 range. Our results demonstrate the feasibility of automatic tools to detect inattention during EEG recordings, and its potential to reduce the subjectivity and time burden of human attention coding. The tool for model adaptation and visualization of the computer vision features is made publicly available to the research community.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-10511-2.
Keywords: EEG, Visual attention, Computer vision, Machine learning, Data processing automation
Subject terms: Cognitive neuroscience, Attention
Introduction
Electroencephalography (EEG) is a widely used method for studying brain-behavior relations. A typical EEG recording session includes visual and/or auditory tasks, which can be presented in an event-related potential (ERP) paradigm or during spontaneous EEG recording. Collecting data using visual tasks in children is significantly more challenging due to their reduced ability to sustain their attention to visual stimuli1,2. The ability to sustain attention during EEG tasks can be especially challenging for children with neurodevelopmental disorders, such as autism and ADHD (attention-deficit/hyperactivity disorder)3,4. A meta-analysis by Stets et al. (2012)5 reports that studies involving visual tasks in infants have significantly higher attrition rates than auditory or combined visual and auditory tasks. While reports of attrition rates in different studies vary1,5,6, a general recommendation is to design tasks that will be engaging for children, thereby facilitating the maintenance of visual attention6. To facilitate visual attention children may be asked to provide a behavioral response (e.g., press a button)7,8 or an experimenter may gently redirect a child to the screen when noticing signs of disengagement3,8,9.
Removing segments of the data during which a participant did not look at the screen is often the first stage of EEG data processing in recordings with visual stimuli. Typically, researchers either code the participant’s attention on-line by pressing a button which sends a marker to the EEG recording when the participant was not attending to the stimulus7,10,11 or by recording the video of the participant’s behavior synchronously with the EEG recording and marking periods of inattention post-hoc upon reviewing the video offline3,12. This is a burdensome manual process requiring significant time and effort. It is also highly subjective; for example, the annotator might only see the participant’s face and must guess whether the participant’s gaze is directed to the area inside or outside of the screen. That is, the target of the child’s gaze is inferred based on the angle of the eyes because the recording camera is adjacent to the target screen, so the target screen is not actually captured in the video frame. For this reason, we expect human annotators coding inattention using EEG videos to vary more and have less reliability with one another than traditional video annotating. For example, a previous non-EEG study measuring inattention in young children and infants required human raters was able to obtain at least 90% frame-by-frame agreement before allowing them to work on the videos in the dataset, and their data had a range of agreement from 89.57 to 98.51% across multiple datasets13.
Subjectivity during this first stage of data processing poses an obstacle for EEG studies, in particular for multi-center ones, since reproducibility and constancy of EEG data quality in multi-center studies are critical14,15.
In addition to its value for data curation, information about inattention periods can be useful for creating clinical biomarkers. There is evidence of alterations in orienting, disengagement from, and sustaining attention to relevant stimuli in autistic children16–19, which undoubtedly influences the amount of inattention during the EEG study. Though a typical EEG study excludes from analysis time periods where the participant is not engaged with the visual stimulus11,20,21 inattentiveness during EEG in social/nonsocial stimuli can be a measure that distinguishes autistic and neurotypical children, used alone or in conjunction with EEG power features3.
Inattention detection has been studied not just within the field of medicine, but also in other contexts such as autonomous driving safety and assisted driving tools. Within the field of autonomous driving, inattention has been operationalized as “insufficient or no attention, to activities critical for safe driving.” Inattention can be divided into five subtypes: restricted attention (due to physical obstructions or blinks), misprioritized attention (attending to a less important feature instead of a potential safety concern), neglected attention (not checking the blind spot), cursory attention (looking in the right direction but failing to process the information), and diverted attention (distraction by driving-related or non-driving-related tasks and events)22,23. Computer vision can detect restricted attention, misprioritized attention, and diverted attention, but not cursory attention. Thus, more research is needed to accurately measure and predict all types of attention automatically.
Modern computer vision techniques now enable accurate, real-time inattention detection—often by analyzing facial expressions, eye gaze, and head pose, eschewing use of expensive additional sensors24. To date, conventional eye-tracking technologies has been most commonly used for detecting inattention. Simultaneously presenting a stimulus on the eye-tracker screen while recording both eye-tracking and EEG signals enables the detection of a participant’s visual attention directed towards the screen25. For example, a study by Maguire et al. (2014)26 used an eye-tracker synchronized with EEG to present an “attention-getter” animation in an experiment with 6–8 year old children. They reported increased retention of EEG data compared to the condition where children were asked to provide a behavioral response (button pressing) to facilitate attention. However, eye-tracking equipment can be expensive and requires calibration. Advances in inattention detection solely based on computer vision technology have a promise of substantially reducing the cost and effort of data preprocessing in lab experiments.
Here we propose a solution for monitoring attention during EEG acquisition based on computer vision analysis (CVA), which is scalable and less expensive than eye-tracking equipment, requiring only off-the-shelf cameras to objectively measure children’s movement behavior. This is largely enabled by the progress in face detection and estimation of facial landmarks, head pose, and gaze27–30. In non-EEG settings, these tools have been able to detect head turns in response to name31, and capture patterns of gaze in a low-cost setting without additional calibration13,32. In the work of Qian et al. (2022)33, supervised machine learning in combination with CVA approaches were applied to detect blink and head movement artifacts detection in a minimally constrained portable EEG setting.
A similar tool to the present CVA methodology, iCatcher13 is a publicly available supervised deep learning model trained to classify infants’ gaze into three categories (‘left’, ‘right’, and ‘away’) based on facial appearance. iCatcher offers a low-cost, automated alternative to traditional eye-tracking systems and is particularly useful for studies involving young children. iCatcher works best on short to moderately long videos with a single person in the frame. Clear visibility of the face is important, as obstructions like hats or coverings can affect accuracy. While development of toolboxes such as iCatcher have undoubtedly moved the field forward, there are some limitations that do not make it well-suited for videos recorded during EEG sessions. First, the EEG net/cap will obstruct part of the face, possibly below and above the eyes depending on what system is used. Secondly, when running EEG sessions with young children from neurodevelopmental populations, typically multiple research staff and/or parents are involved in helping the child sit still and complete the task, which would invalidate a tool like iCatcher which is optimized for videos with only one human face detectable.
In this work, we developed a combination of CVA and a supervised machine learning model to detect inattention periods during the EEG recordings. This is computed from the videos of the child’s head and upper body captured synchronously with EEG and with simple off-the-shelf cameras. While these videos were recorded during EEG sessions, the EEG data itself is not utilized in the present analysis. We hypothesized that automatic CVA codes of eye gaze coordinates, head pose descriptors (pitch, yaw, and roll), and nose landmarks could reliably detect periods of visual distraction from the screen using a supervised machine learning model. The proposed method requires minimal involvement by human annotators to fine-tune the model to a new participant. In this process, a small number of frames from the new participant’s video are labeled by a human, followed by an additional round of model training. Minor human involvement is critical since head poses and facial expressions of children vary significantly in clinical populations, justifying the need and opportunity for tuning the pre-trained model to new participants. Recent work based on iCatcher provides evidence that the lowest agreement between human annotators and automatic models occurs on the label ‘looking away from the screen13. Thus, we developed a graphical user interface (GUI) allowing users to label data for fine-tuning, visualize video and corresponding time series of CVA features, and post-process the model results. The post-processing stage gives an opportunity for additional quality control of inattention periods proposed by the model.
The proposed approach reduces subjectivity by providing the CVA features for human reference in the labeling process, thus standardizing the information an individual uses in their labeling. It also significantly reduces the coding time by decreasing the number of frames to be labeled manually. We therefore trained the model on an annotated dataset of children’s videos synchronized with EEG recordings, and then fine-tuned it to a new child by labeling a limited amount of randomly selected additional frames on the new video. To evaluate our approach, we trained and fine-tuned on a dataset of 23 videos in the leave-one-subject-out cross-validation setting, training on a dataset of 22 videos, and fine-tuning to a holdout video on each cross-validation iteration. Due to the length of the EEG sessions and the size of the videos created, a similar tool, iCatcher, was unable to handle the current data set. We have shared online the GUI for the video and CVA features inspection, model retraining, and predictions post-processing.
Methods
Participants
Participants were 23 children (16 males), ranging from 49 to 95 months of age who were part of a study funded by the National Institutes of Health (NICHD 2P50HD093074, Dawson, PI). The ethnic and racial composition of the sample was as follows: White, 17; Black, 0; Asian, 2; other and mixed race, 4; Hispanic, 4. All 23 children met DSM-5 criteria for autism spectrum disorder (ASD) based on the Autism Diagnostic Observation Schedule-2nd Edition34 by an experienced, research reliable psychologist. Eleven of the 23 children were diagnosed with co-occurring attention deficit/hyperactivity disorder (ADHD) based on a comprehensive clinical evaluation by a clinical psychologist with expertise in ADHD. Children had a mean Full-Scale IQ of 78.5 (SD = 25.5) based on the GCA Standard Score derived from Differential Ability Scales Second Edition35.
All caregivers/legal guardians of participants gave written, informed consent and the study protocol was approved by the Duke University Health System Institutional Review Board (Protocol numbers Pro00085435 and Pro00085156). Informed consent was obtained from the subjects and/or their legal guardian(s) for publication of identifying information/images in an online open-access publication. Methods were carried out in accordance with institutional, State, and Federal guidelines and regulations. The procedures in these studies adhere to the tenants of the Declaration of Helsinki. Additionally, the caregiver of the participant whose video was used in the Supplementary Materials, as well as blurred in the Figures, provided consent to use the materials in publication. All other data presented have been anonymized.
Recording synchronized video and EEG
Continuous EEG and event-related potentials (ERPs) were recorded as part of an EEG study while simultaneous video recording of the session was underway. EEG sessions were aborted early if the child could not comply with study procedures. Videos recorded during the EEG sessions were 00:04:15 to 00:31:09 in duration. One or two clinical research assistants were present in the room during the EEG recording to ensure the quality of the session and to gently redirect the participant’s attention back to the screen in case they were distracted. The child’s face was recorded from a Basler ACE acA1300-30uc camera below the screen synchronized with the EEG. The camera resolution was 1296 × 966 pixels and the frame rate was 30 fps. To synchronize the camera and EEG, an in-house software code was used, based on the Basler pylon library and Cedrus StimTracker hardware device used to set markers on the EEG recording. A diagram of the recording setup is shown in Fig. 1.
Fig. 1.
Recording setup. Video from the camera is recorded on Video Recording Computer, which sends a marker to the EEG Recording Computer via Cedrus Stimtracker every 100 frames. This allows for synchronization between the EEG and video recordings.
Extracting CVA features
To extract the CVA features, we used in-house code involving three steps: (a) face detection and disambiguation, (b) extraction of landmarks and head pose angles, and (c) gaze estimation. The raw set of extracted features per frame included nose x (horizontal) and y (vertical) coordinates in the frame, gaze x and y coordinates in the presentation screen plane, and head pose angles (pitch, yaw, and roll).
Face detection and disambiguation
Code for face detection and disambiguation used the face_recognition python library based on the dlib C + + library36. Every time the algorithm detected more than one face on the video (which happened either due to ambiguity of face detection – one face was detected twice, or when another person, e.g., research assistant(s) or parent(s) entered the frame), the algorithm showed the frame with a bounding box and prompted the user to select the correct participant’s face.
Extraction of landmarks and head pose angles
After the faces were detected, an algorithm for facial landmark extraction based on the intraface software library30 was applied to the detected faces. As a result, facial landmark pixel coordinates, as well as pitch, yaw, and roll head pose angles were obtained.
Gaze estimation. The iTracker software28 was used for gaze estimation, providing gaze x and gaze y coordinates in the screen plane. Even though iTracker was trained to predict gaze coordinates on a mobile device screen using the mobile device frontal camera, we used the output of iTracker as a proxy for gaze coordinates in the presentation screen plane. The software package was modular and this component can be easily replaced by others as preferred by the user.
Since the intraface library was not currently available to the general public, for the convenience of potential users we make publicly available an alternative processing pipeline which consists of our original face estimation and disambiguation code, and a code for landmarks, head pose and gaze extraction using the popular OpenFace software package27.
Data attrition
Due to pauses between EEG/ERP recordings where the behavior of participants varied significantly, inattention detection was restricted only to the periods during the actual stimuli presentation, excluding breaks between stimuli, and the training set for the machine learning (ML) model included only data from frames inside those periods. Frames where the face could not be detected (hence there was no information on landmarks and head pose) were excluded from the analysis.
Data pre-processing
Since inattention could happen in any direction (either when participants looked to the right or left, turned the head up or down, etc.), each feature for each participant was transformed into a positive (‘plus’; Eq. (1)) and negative (‘minus’; Eq. (2)) version,
![]() |
1 |
![]() |
2 |
The final set of features for the analysis are reported in Table 1.
Table 1.
List of input features per frame for the machine learning model.
| Feature name | Feature description |
|---|---|
| noseXplus | Nose coordinates |
| noseXminus | |
| noseYplus | |
| noseYminus | |
| gazeXplus | Gaze coordinates |
| gazeXminus | |
| gazeYplus | |
| gazeYminus | |
| yawplus | Head pose angles |
| yawminus | |
| pitchplus | |
| pitchminus | |
| rollplus | |
| rollminus |
After pre-processing the features, the participant identifier was one-hot encoded and added to the feature list. This allowed learning a separate bias term in the first layer of the trained neural network, resembling the design of mixed models. The number of categories for one-hot encoding was one more than the number of participants, with the assumption that the identifier of the participant whose data was used for model fine-tuning and prediction was encoded in the last category.
Data labeling
Data for all 23 participants was labeled by one of the co-authors using the Elan v. 6.3 software. Nine (39%) participants were selected for independent annotation by another co-author. Neither annotator participated in data analysis. Annotators labeled data using the recorded video as ‘gaze off screen’ if the participant looked away from the screen, and/or as ‘head turn’ if the participant turned their head. For the purpose of inattention detection, a frame was labeled as ‘inattention’ if it either was labeled as a head turn or gaze off screen. Annotators included eye blinks within periods of inattention and ignored eye-blinks when the participant was visually attending; eye blinks did not interrupt or break-up inattention events. Agreement on inattention labels between independent annotators was assessed with Cohen’s kappa37.
Training and evaluating machine learning model
Model inputs were features extracted from each video frame. We utilized a multi-layer perceptron (MLP) model with an input size of 37 features, two hidden layers (layer dimensions 512 and 14 were selected empirically following information bottleneck principle)38, and a temperature scaling layer for model calibration39. –40 The target variable for model training was each video frame’s inattention label, namely inattention-present or inattention-absent. The output was a one-hot encoding of binary inattention signal (dimension = 2). Cross-entropy loss was a cost function. Adam optimizer was used for model training41. We used weighted sampling for model training to allow each batch to have approximately equal amounts of positive and negative samples (inattention and attention respectively). Models were trained in the pytorch framework42. Evaluation was done using the leave-one-subject-out cross-validation (LOSO CV) method. To evaluate the model performance, we assessed average precision (AP, also known as area under precision-recall curve), area under the ROC curve (AUC), and maximal Cohen’s kappa (MK) between the human annotator and the machine learning predictions per participant across different thresholds. Additionally, we evaluated median Cohen’s kappa across the entire distribution at the range of thresholds between 0 and 1. This allowed us to assess the value of the threshold needed to achieve the best agreement between the model and the human coder over the entire distribution, without adjusting the threshold for each individual participant.
Transfer learning: adjusting ML model to a new participant
Our adaptation approach involved selecting a batch of 128 frames (corresponding to 4.270 s) for labeling and training for 20 epochs (full cycles over the entire labeled dataset) on newly labeled data at each iteration of additional training. To evaluate the performance of this approach, we assessed the three metrics defined in the previous section, considering both sequential (where frame features and labels are sampled into the batch sequentially from the beginning of the video, which resembles how humans would look through the dataset and label it), and random frame sampling approaches. We additionally assessed the maximum of median Cohen’s kappa across distribution, and computed the respective prediction threshold at iterations 5, 10 and 20, which correspond to 21.3, 42.6, and 85.3 additionally labeled seconds of data per participant. The exact algorithm was as follows:
Set N = 128 (the batch size).
Create empty dataset for labeled data.
Set Iteration = 0.
Predict probabilities of sample being positive in each frame.
If the approach is Random sampling, randomly sample N frames into the batch from the participant’s data.
If the approach is Sequential sampling, sample next N frames from the beginning of the participant’s data into the batch.
Remove frames included in the batch from the participant’s data.
Add batch to the labeled dataset (for training in LOSO CV framework we used the labels from the dataset for the participant the algorithm was being trained on).
Train for 20 epochs on the labeled dataset.
Compute AP, AUC, and MK.
Set Iteration + = 1.
If Iteration = = 50: Stop.
Go to 4.
Agreement measurements between model and human and between two humans
We used Cohen’s kappa as a metric of quality assessment for the human annotations. We selected nine participants and performed independent labeling by a second human annotator. We then computed Cohen’s kappa to measure agreement between both human annotators. We additionally computed Cohen’s kappa between the primary human annotator and the model prediction on a threshold level corresponding to maximal median kappa at iteration 20.
Graphical user interface for visualizing and retraining the model
We created a web-based GUI which allows for visualizing the data, labeling the data frame-by-frame and re-training the model in the random sampling framework, and post-processing the data (see Fig. 2 for screenshot, and Supplementary Materials online for video (Supplementary Video S1) of how the tool works). The tool is based on open-source tools ‘plotly’ (https://plotly.com/python/) and ‘dash’ (https://dash.plotly.com/).
Fig. 2.
A: Visualization of CVA features together with the video of the participant. B: Interface for labeling the frames. Image is included for educational purposes and the individual’s diagnostic cohort is not specified.
Results
Dataset statistics
The full dataset consisted of 566,043 frames (videos were 00:04:15 to 00:31:09 in duration). After excluding frames where the face or gaze were not detected, 535,539 frames were retained (5.38% of frames were invalid), with an average of 23,284 and a standard deviation of 6,193 frames per participant. Of all the frames, 79,629 were labeled as inattention (14.86% of the dataset).
Transfer learning results
The results of transfer learning can be seen in Table 2; Fig. 3. The sequential sampling approach performed substantially worse than the random sampling approach. Median AP, AUC and MK were 0.855, 0.965, 0.742 respectively at the start of the training (no adaptation to the participants yet). By iteration 20, median AP was 0.962, AUC 0.989, and MK 0.888 on random sampling approach as compared to median AP 0.640, AUC 0.862, and MK 0.548 in sequential sampling approach.
Table 2.
Average precision, AUC, and maximal cohen’s kappa percentiles at different iterations with two sampling/adaptation alternatives. The random sampling approach outperforms the sequential sampling one on all three metrics on each listed iteration.
| Average precision (percentile) |
AUC (percentile) |
Maximal Cohen’s kappa (percentile) |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sampling approach | Iteration | 50% | 25% | 75% | 50% | 25% | 75% | 50% | 25% | 75% |
| No Fine Tuning | 0 | 0.855 | 0.715 | 0.913 | 0.965 | 0.948 | 0.971 | 0.742 | 0.646 | 0.796 |
| Random sampling | 5 | 0.906 | 0.820 | 0.948 | 0.973 | 0.960 | 0.981 | 0.798 | 0.753 | 0.873 |
| 10 | 0.930 | 0.875 | 0.969 | 0.984 | 0.975 | 0.991 | 0.838 | 0.798 | 0.898 | |
| 20 | 0.962 | 0.931 | 0.981 | 0.989 | 0.984 | 0.993 | 0.888 | 0.865 | 0.925 | |
| Sequential sampling | 5 | 0.400 | 0.280 | 0.720 | 0.788 | 0.638 | 0.890 | 0.380 | 0.236 | 0.561 |
| 10 | 0.575 | 0.408 | 0.782 | 0.835 | 0.731 | 0.908 | 0.482 | 0.251 | 0.637 | |
| 20 | 0.640 | 0.408 | 0.801 | 0.862 | 0.771 | 0.930 | 0.548 | 0.354 | 0.678 | |
Fig. 3.
Average precision, Maximal Cohen’s kappa and AUC per each iteration using different sampling/adaptation methods. Line color is median, and shaded area is interquartile range per each iteration.
Cohen’s kappa analysis
Cohen’s kappa at different levels of prediction threshold for both sampling approaches (random and sequential) at iterations 5, 10, and 20 are shown in Fig. 4. Thresholds at the highest median kappa and the corresponding median kappa values are shown in Tables 2 and 3. The highest median kappa ranged between 0.792 and 0.888 in the random sampling approach, and between 0.223 and 0.426 in the sequential one. Figure 4 shows that the median Cohen’s kappa stayed relatively stable and high in the range of thresholds between 0.2 and 0.8, allowing for a general threshold for the model predictions to be set in this range.
Fig. 4.
Median (thick line) and Interquartile Range (shaded area) of Cohen’s kappa at different threshold levels at iterations 5, 10, and 20.
Table 3.
Thresholds and cohen’s kappa levels at highest median value of kappa in the two sampling approaches at iterations 5,10,20.
| Sampling approach | Iteration | Threshold | Median Cohen’s kappa |
|---|---|---|---|
| Random sampling | 5 | 0.310 | 0.792 |
| 10 | 0.484 | 0.838 | |
| 20 | 0.424 | 0.888 | |
| Sequential sampling | 5 | 0.004 | 0.223 |
| 10 | 0.008 | 0.296 | |
| 20 | 0.020 | 0.426 |
A second independent annotator labeled videos from nine participants, which in total accounted for 209,556 frames or 39% of the data set. It took the second annotator approximately 33 h to label nine videos, resulting in an average of 0.57 s spent per frame. Cohen’s kappa values between the two human annotators ranged between 0.548 and 0.859 (see Table 4).
Table 4.
Agreement level (Cohen’s kappa) between human annotators, and between the primary annotator and the models adapted by random sampling at iterations 5, 10 and 20.
| Participant | Agreement between annotators | Agreement (model, annotator 1) – iteration 5 |
Agreement (model, annotator 1) – iteration 10 |
Agreement (model, annotator 1) – iteration 20 |
|---|---|---|---|---|
| PT01 | 0.584 | 0.498 | 0.541 | 0.616 |
| PT02 | 0.748 | 0.736 | 0.768 | 0.753 |
| PT03 | 0.600 | 0.868 | 0.862 | 0.895 |
| PT04 | 0.593 | 0.705 | 0.766 | 0.828 |
| PT09 | 0.727 | 0.856 | 0.888 | 0.943 |
| PT10 | 0.548 | 0.665 | 0.721 | 0.785 |
| PT16 | 0.844 | 0.954 | 0.953 | 0.963 |
| PT18 | 0.859 | 0.8 | 0.78 | 0.789 |
| PT20 | 0.793 | 0.937 | 0.942 | 0.944 |
Agreement between the primary annotator and the model adapted by random sampling increased with each iteration of additional training and was in the ranges [0.498–0.954] at iteration 5, [0.541–0.953] at iteration 10, and [0.616–0.963] at iteration 20 (Table 4).
GUI for visualizing and preprocessing pipeline
We developed a web-based GUI which may be used for reviewing the CVA features of the video, additional labeling of frames and retraining the model, and post-processing of the data, including setting the model decision threshold and rejection of falsely detected inattention events. We make publicly available a pipeline for data pre-processing based on in-house code for face detection and OpenFace framework for head pose and gaze estimation27.
Discussion
In this work we proposed a method for detection of periods of inattention to visual stimuli during EEG recordings. The tool was based on the CVA of videos of participants’ movement behavior which were synchronously recorded with EEG. We outlined a data processing pipeline, including face and facial landmarks detection, head pose computation, and gaze estimation. We proposed a MLP model for predicting inattention from these CVA features, and random sampling as a means for fine-tuning the model for each participant. We made publicly available a GUI that allows for visualization of the CVA features, model fine-tuning, prediction thresholds adjustment, and results post-processing. While we utilized a sample of children with neurodevelopmental conditions to test the tool, we expect it will also work well on other research populations, including neurotypical children and adults. Features included gaze coordinates, pitch, yaw, roll, and nose coordinates; the nose landmark was particularly useful for detecting inattention that included a head turn to the side.
The proposed random frame sampling approach for model adaptation to the participant outperformed the sequential sampling approach. For the non-fine-tuned model, maximal Cohen’s kappa was 0.742, placing the best potential agreement with the human rater in the ‘substantial’ range40. Compared to the initial non-fine-tuned model prediction, the model trained on additional 2560 labeled frames (equivalent to labeling only about 85 s of the video) significantly improved performance, as indicated by all quality metrics. On the other hand, sequential frame sampling performance decreased in the initial five iterations (see Fig. 3), then gradually improved, but did not reach the performance of the random sampling approach. The reasons behind this included the strong temporal correlation of the features, hence low variability in the new input data, and the rare occurrence of inattention (prevalence of inattention is 14.86%), causing the absence of positive labels in many batches.
In line with a previous study13, we found that agreement on inattention labeling by human coders was in the ‘moderate’ to ‘substantial’ ranges in seven participants, and in the ‘perfect’ range for two participants43. Labeling inattention is a challenging task for humans, likely because annotators need to make a subjective judgement regarding the boundaries of the stimulus presentation screen. The provided GUI tool allows for visualization the raw CVA features together with the participant’s video, also enabling coders to label frames for the fine-tuning or post-processing stage. When the annotator needs to make a decision on an ambiguous frame, they can play the video to compare the frame in question with neighboring frames, which may help to better evaluate whether the participant was attending to the screen.
Our results show that the proposed approach makes annotation substantially more efficient. The existing human labeling system takes on average 220 min per video. Given that additional labeling takes about 0.57 s per frame, the need to label only about 2560 frames for a high quality labeling can significantly increase efficiency, reducing the effort to 24 min on average. Modularity of the tool we developed allows users to utilize any input/output compatible CVA pipeline and machine learning model, while keeping the same GUI. The initial model can be retrained as the amount of labeled data increase.
Using the same prediction model and tool for discarding inattention periods may facilitate multi-center studies by unifying the data pre-processing pipelines. Another way to facilitate multi-center studies is to perform pre-processing and labeling of the data in each center separately, and then share only the CVA features and annotations for training of the model with larger amounts of data. Such an approach helps to preserve the privacy of the data in each center, allowing centers to share only specific de-identified CVA features.
A limitation of the present study is the absence of a published model and our original full pre-processing pipeline. The reason for this is the removal of the intraface library30 from public access. We provided the code for an alternative pre-processing pipeline predicting the same features based on the publicly available OpenFace library, and the model structure and interface needed for full integration into the GUI. Additionally, another limitation of the current study is that we calculated performance using annotations from a primary annotator who coded all the videos; while we were able to bolster confidence in the primary annotator’s performance by recruiting a second annotator to re-code 39% of the videos, full confidence could only be achieved by having multiple annotators code all available videos. However, in pediatric psychology research, it has been suggested that having 10–25% of videos re-coded by a second annotator is sufficient, depending on how variable the behavior being coded is44,45. A potential future direction is to work with the missing data caused by an inability to detect a face in the video. CVA could not detect the face in 5.38% of the frames in our dataset, likely due to either extreme angles of the head with respect to the camera or because of face occlusions. Future studies may attempt to associate these periods with attention/inattention to the screen by using imputation/interpolation methods.
We presented a low-cost scalable approach to inattention detection during EEG recordings using CVA, and made a publicly available tool for visualization, model fine-tuning, and post-processing of the system’s results. We also made publicly available an example of the computer vision analysis pipeline which can be used in future studies. We showed that fine-tuning the model on small amounts of new data by labeling the data on a per-frame basis substantially increased the model performance. Our work demonstrated that computer vision analysis was feasible for detecting inattention in EEG studies. We hope that by providing a scalable method for assessing inattention during EEG experiments, EEG studies will be more reproducible, and the feasibility of studying early brain development in populations in which sustained attention during EEG experiments can be challenging will increase. Such populations include infants and children with and without neurodevelopmental conditions, among others.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
This research was supported by a grant from the National Institutes of Health (NIH; NICHD 2P50HD093074, Dawson, PI). We thank the NIH and the children that participated in the research studies and their families.
Author contributions
D.Yu.I., M.Di M., D.C., K.C., G.D. and G.S. contributed to the design of the work, data analysis and interpretation; S.M. and J.G. contributed to the data acquisition and labeling; D.Yu.I. and Z.C. contributed to the creation of the new software used in the work; D.Yu.I. and S.M. contributed to drafting the first version of the manuscript; all authors revised the final manuscript.
Data availability
Due to privacy concerns, participants’ videos cannot be shared. To enable the reproducibility of the results, the dataset with extracted CVA features that were used for model training, and code for initial model training and model fine-tuning, are made publicly available at https://github.com/dyisaev/eeg-cva-model-training. A pipeline based on OpenFace software for CVA feature extraction is made publicly available at https://github.com/dyisaev/eeg-cva-feature-extraction. A GUI interface for visualization, labeling, and post-processing, together with installation and usage instructions is available at https://github.com/dyisaev/eeg-cva-visualization-tool. Python 3.9.7 was used in the model training and data analysis. Versions of python packages are listed in the corresponding repositories.
Declarations
Competing interests
Dr. Dawson is on the Scientific Advisory Boards of the Nonverbal Learning Disability Project and Tris Pharma, Inc., and receives book royalties from Guilford Press and Springer Nature Press. Dr. Dawson has developed technology, data, and/or products that have been licensed to Cryocell, Inc., and Dawson and Duke University have benefited financially. Dr. Dawson has received funding from the Marcus Foundation, Cord Blood Association, the National Institutes of Health (NIH), and the Simons Foundation. Dr. Carpenter has had funding by the National Institutes of Health (NIH), the Department of Defense, and the Brain and Behavior Foundation. Dr. Carpenter is a standing member on the Programmatic Panel for the Department of Defense Congressionally Directed Medical Research Programs (CDMRP) Autism Research Program and has served as an ad hoc reviewer on NIH review panels; she has received reimbursement for her time on these panels. The remaining authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Dmitry Yu. Isaev and Samantha Major contributed equally to this work.
These authors jointly supervised this work: Geraldine Dawson and Guillermo Sapiro.
References
- 1.DeBoer, T., Scott, L. & Nelson, C. Methods for Acquiring and Analyzing Infant Event-Related Potentials in Infant EEG and Event-Related Potentials. 5–38 (Psychology, 2013).
- 2.Thierry, G. The use of event-related potentials in the study of early cognitive development. Infant Child. Dev.14 (1), 85–94. 10.1002/icd.353 (2005). [Google Scholar]
- 3.Isaev, D. Y. et al. Relative average look duration and its association with neurophysiological activity in young children with autism spectrum disorder. Sci. Rep.10 (1). 10.1038/s41598-020-57902-1 (2020). [DOI] [PMC free article] [PubMed]
- 4.Webb, S. J. et al. Guidelines and best practices for electrophysiological data collection, analysis and reporting in autism. J. Autism Dev. Disord.45 (2), 425–443. 10.1007/s10803-013-1916-6 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Stets, M., Stahl, D. & Reid, V. M. A meta-analysis investigating factors underlying attrition rates in infant ERP studies. Dev. Neuropsychol.37 (3), 226–252. 10.1080/87565641.2012.654867 (2012). [DOI] [PubMed] [Google Scholar]
- 6.Bell, M. A. & Cuevas, K. Using EEG to study cognitive development: issues and practices. J. Cogn. Dev.13 (3), 281–294. 10.1080/15248372.2012.691143 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ellis, A. E. & Nelson, C. A. Category prototypicality judgments in adults and children: behavioral and electrophysiological correlates. Dev. Neuropsychol.15 (2), 193–211. 10.1080/87565649909540745 (1999). [Google Scholar]
- 8.Todd, R. M., Lewis, M. D., Meusel, L. A. & Zelazo, P. D. The time course of social-emotional processing in early childhood: ERP responses to facial affect and personal familiarity in a Go-Nogo task. Neuropsychologia46 (2), 595–613. 10.1016/j.neuropsychologia.2007.10.011 (2008). [DOI] [PubMed] [Google Scholar]
- 9.Murias, M. et al. Validation of eye-tracking measures of social attention as a potential biomarker for autism clinical trials. Autism Res.11 (1), 166–174. 10.1002/aur.1894 (2018). [DOI] [PubMed] [Google Scholar]
- 10.Dawson, G. et al. Early behavioral intervention is associated with normalized brain activity in young children with autism. J. Am. Acad. Child Adolesc. Psychiatry. 51 (11), 1150–1159. 10.1016/j.jaac.2012.08.018 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Orekhova, E. V., Stroganova, T. A., Posikera, I. N. & Elam, M. EEG theta rhythm in infants and preschool children. Clin. Neurophysiol.117 (5), 1047–1062. 10.1016/j.clinph.2005.12.027 (2006). [DOI] [PubMed] [Google Scholar]
- 12.Murias, M. et al. Electrophysiological biomarkers predict clinical improvement in an open-label trial assessing efficacy of autologous umbilical cord blood for treatment of autism. Stem Cells Transl. Med. 783–791. 10.1002/sctm.18-0090 (2018). [DOI] [PMC free article] [PubMed]
- 13.Shannon, E. Y. et al. icatcher+: robust and automated annotation of infants’ and young children’s gaze behavior from videos collected in laboratory, field, and online studies. Adv. Methods Practices Psychol. Sci.6 (2), 25152459221147250 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kaiser, A. et al. EEG data quality: Determinants and impact in a multicenter study of children, adolescents, and adults with attention-deficit/hyperactivity disorder (ADHD). Brain Sci.11 (2). 10.3390/brainsci11020214 (2021). [DOI] [PMC free article] [PubMed]
- 15.Webb, S. J. et al. Biomarker acquisition and quality control for multi-site studies: The autism biomarkers consortium for clinical trials [methods]. Front. Integr. Nuerosci.13. 10.3389/fnint.2019.00071 (2020). [DOI] [PMC free article] [PubMed]
- 16.Elsabbagh, M. et al. Disengagement of visual attention in infancy is associated with emerging autism in toddlerhood. Biol. Psychiatry. 74 (3), 189–194. 10.1016/j.biopsych.2012.11.030 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Keehn, B., Müller, R. A. & Townsend, J. Atypical attentional networks and the emergence of autism. Neurosci. Biobehav. Rev.37 (2), 164–183. 10.1016/j.neubiorev.2012.11.014 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.McPartland, J. C., Webb, S. J., Keehn, B. & Dawson, G. Patterns of visual attention to faces and objects in autism spectrum disorder. J. Autism Dev. Disord.41 (2), 148–157. 10.1007/s10803-010-1033-8 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Werner, E., Dawson, G., Osterling, J. & Dinno, N. Recognition of autism spectrum disorder before one year of age. J. Autism Dev. Disord.30 (2), 157–162 (2000). [DOI] [PubMed] [Google Scholar]
- 20.Orekhova, E. V. et al. EEG hyper-connectivity in high-risk infants is associated with later autism. J. Neurodevelopmental Disorders. 6 (1), 1–11. 10.1186/1866-1955-6-40 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Stroganova, T. A., Orekhova, V., Posikera, I. N. & E., & Externally and internally controlled attention in infants: an EEG study. Int. J. Psychophysiol.30 (3), 339–351. 10.1016/S0167-8760(98)00026-9 (1998). [DOI] [PubMed] [Google Scholar]
- 22.Kotseruba, I. & Tsotsos, J. K. Attention for vision-based assistive and automated driving: A review of algorithms and datasets. IEEE Trans. Intell. Transport. Syst.23(11), 19907–19928 10.1109/TITS.2022.3186613 (2022).
- 23.Regan, M. A., Hallett, C. & Gordon, C. P. Driver distraction and driver inattention: Definition, relationship and taxonomy. Accid. Anal. Prevent. 43(5), 1771–1781 (2011). [DOI] [PubMed]
- 24.Li, W. et al. A survey on vision-based driver distraction analysis. J. Syst. Architect.121, 102319 (2021). [Google Scholar]
- 25.Ahtola, E., Stjerna, S., Stevenson, N. & Vanhatalo, S. Use of eye tracking improves the detection of evoked responses to complex visual stimuli during EEG in infants. Clin. Neurophysiol. Pract.2, 81–90. 10.1016/j.cnp.2017.03.002 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Maguire, M. J., Magnon, G. & Fitzhugh, A. E. Improving data retention in EEG research with children using child-centered eye tracking. J. Neurosci. Methods. 238, 78–81. 10.1016/j.jneumeth.2014.09.014 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Baltrusaitis, T., Zadeh, A., Lim, Y. C. & Morency, L. P. Openface 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 2018. 59–66. 10.1109/FG.2018.00019 (2018).
- 28.Krafka, K. Eye tracking for everyone. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Las Vegas, NV, USA, 2016. 2176–2184. 10.1109/CVPR.2016.239 (2016).
- 29.Lugaresi, C. et al. Mediapipe: A framework for Building perception pipelines. In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR). 10.48550/arXiv.1906.08172 (2019).
- 30.Torre, F. D. et al. IntraFace. In 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 2015. 1–8. 10.1109/FG.2015.7163082 (2015).
- 31.Perochon, S. et al. A scalable computational approach to assessing response to name in toddlers with autism. J. Child Psychol. Psychiatry. 62 (9), 1120–1131. 10.1111/jcpp.13381 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chang, Z. et al. Computational methods to measure patterns of gaze in toddlers with autism spectrum disorder. JAMA Pediatr.175 (8), 827–836. 10.1001/jamapediatrics.2021.0530 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Qian, X., Wang, M., Wang, X., Wang, Y. & Dai, W. Intelligent method for real-time portable EEG artifact annotation in semiconstrained environment based on computer vision. Comput. Intell. Neurosci.9590411. 10.1155/2022/9590411 (2022). [DOI] [PMC free article] [PubMed]
- 34.Gotham, K. et al. A replication of the autism diagnostic observation schedule (ADOS) revised algorithms. J. Am. Acad. Child. Adolesc. Psychiatry. 47 (6), 642–651. 10.1097/CHI.0b013e31816bffb7 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Elliott, C. D. Differential Ability Scales. 2nd Ed. (Harcourt Assessment, 2007).
- 36.King, D. E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res.10, 1755–1758 (2009). [Google Scholar]
- 37.Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.20 (1), 37–46. 10.1177/001316446002000104 (1960). [Google Scholar]
- 38.Tishby, N. & Noga Zaslavsky. and. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW). (IEEE, 2015).
- 39.Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In 34th International Conference on Machine Learning, ICML 2017. Vol. 3. 2130–2143 (2017).
- 40.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2001).
- 41.Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980 (2015).
- 42.Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Article 721. (Curran Associates Inc., 2019).
- 43.McHugh, M. L. Interrater reliability: the kappa statistic. Biochem. Med. (Zagreb). 22 (3), 276–282 (2012). [PMC free article] [PubMed] [Google Scholar]
- 44.Bey, A. L. et al. Automated video tracking of autistic children’s movement during Caregiver-Child interaction: an exploratory study. J. Autism Dev. Disord. 54 (10), 3706–3718. 10.1007/s10803-023-06107-2 (2024). Epub 2023 Aug 29. PMID: 37642871. [DOI] [PubMed] [Google Scholar]
- 45.Chorney, J. M., McMurtry, C. M., Chambers, C. T. & Bakeman, R. Developing and modifying behavioral coding schemes in pediatric psychology: A practical guide. J. Pediatr. Psychol.40 (1), 154–164. 10.1093/jpepsy/jsu099 (2015) (epub 2014 Nov 21). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Due to privacy concerns, participants’ videos cannot be shared. To enable the reproducibility of the results, the dataset with extracted CVA features that were used for model training, and code for initial model training and model fine-tuning, are made publicly available at https://github.com/dyisaev/eeg-cva-model-training. A pipeline based on OpenFace software for CVA feature extraction is made publicly available at https://github.com/dyisaev/eeg-cva-feature-extraction. A GUI interface for visualization, labeling, and post-processing, together with installation and usage instructions is available at https://github.com/dyisaev/eeg-cva-visualization-tool. Python 3.9.7 was used in the model training and data analysis. Versions of python packages are listed in the corresponding repositories.






