Abstract
Among the causes of the annually traffic accidents, driving fatigue is the main culprit. In consequence, it is of great practical significance to carry out the research of driving fatigue detection and early warning system. However, there are still two problems in the latest methods of driving fatigue detection: one is that a single information cannot precisely reflect the actual state of the driver in different fatigue phases, another one is the detection effect is not very well or even difficult to detect under abnormal illumination. In this paper, the multi-task cascaded convolutional networks (MTCNN) and infrared-based remote photo-plethysmography (rPPG) theory are used to extract the driver’s facial and physiological information, and the multi-modal specific fatigue information is deeply excavated, and the multi-modal feature fusion model is constructed to comprehensively analyze the driver’s fatigue variation tendency. Aiming at the matter of low detection accuracy under abnormal illumination, the multi-modal features extracted from visible light images and infrared images are fused by multi-loss reconstruction (MLR) module, and the driving fatigue detection module is established which is based on Bi-LSTM model by utilizing fatigue timing. The experiments were validated under all-weather illumination scenarios and were carried out on the datasets NTHU-DDD, UTA-RLDDD and FAHD. The results show that the multi-modal driving fatigue detection model has better performance than the single-modal model, and the accuracy is improved by 8.1%. In the abnormal illumination such as strong and weak light, the accuracy of the method can reach 91.7% at the highest and 83.6% at the lowest. Meanwhile, in the normal illumination, it can reach 93.2%.
Keywords: Multimodal, Abnormal illumination, Driving fatigue, Remote photoplethysmography, Multi-loss reconstruction feature fusion
Subject terms: Computer science, Software
Introduction
Driving fatigue refers to situations in which a person’s driving function decreases by continuously repeating simple and mechanical actions while driving a vehicle1. This is primarily manifested as distractions, dozing, diminished judgment, delayed operation, or even traffic accidents, such as rear-end collisions, and deviation from the road, posing a serious threat to road safety. Therefore, it is of great significance to investigate the mechanism of driver fatigue, develop a driving fatigue detection system, accurately monitor the real-time status of drivers, and provide reminders to promote the development of urban transportation and ensure people’s travel safety.
Existing driving fatigue detection methods are broadly composed of subjective detection, unimodal feature detection, and multimodal feature detection2,3. Subjective detection is primarily assessed using questionnaires and subjective scales.4 have long confirmed that the Karolinska Sleepiness Scale (KSS) can indeed effectively assess subjective sleepiness. However, such methods have large substantial individual variation and are time-consuming, making it difficult to meet the need for real-time detection and prevention. Hu et al.5 used the KSS as a benchmark to test the accuracy of their fatigue detection system and provided a new mission for the subjective detection method.
Unimodal feature detection is based on features of a single modality to determine fatigue, including vehicle characteristic features6,7, physiological features8–10, and visual features11–13. To improve the accuracy of driver fatigue detection, multiple features are used in combination to detect fatigue. Wakita et al.14 used the vehicle speed, brake pedal, accelerator pedal, and distance to the vehicle in front features as inputs to a Gaussian mixture model (GMM) to identify fatigue. The accuracy of this method was 81% for a simulator and 73% for a real vehicle; therefore, the detection accuracy in a real environment must be improved. Yu et al.15 proposed a 3D deep convolutional neural network conditional adaptive learning framework for extracting spatio-temporal feature representations of the driver’s head, eyes, and mouth in different scenes, and achieved better results, however, the parameters were more complex and real-time performance was average. Qiu et al.16 designed a miniaturized multi-physiological signal acquisition system that combines wavelet transforms and the Kalman filter based on the acquisition of EEG, ECG and proximal skin temperature signals, and conducted fatigue assessment experiments on 10 healthy subjects, which proved that the system was reliable and stable in detection. However, the acquisition process is more invasive.
Compared to single features, the accuracy of methods combining multiple features of the same modality has improved significantly. However, the requirements for fatigue detection are still increasing. Therefore, there is an urgent need to carry out research on the theoretical model of multimodal feature fusion and design a detection system with high accuracy, robustness, portability, and real-time performance.
Most existing multimodal fusion methods are implemented based on decision fusion and feature fusion methods, such as obtaining multi modal information such as the drivers’ driving behavior, vehicle operation parameters, facial information, and physiological parameters, and fusing this information to analyze fatigue states17–19. These known methods that use multiple modal signals to detect driving fatigue states have been improved in terms of accuracy and robustness, but most of them do not make good use of the higher-order interaction expression information between multimodal features. Real-life driving environments have several complicating factors that affect detection accuracy, including lighting changes, larger deflections of the face and partial occlusion. It’s also worth noting that current methods of detecting fatigue are mostly used in better daylight conditions, but in the real driving environment, drivers are often more susceptible to fatigue when driving at night (World Health Organization20). The reason for this is that the road environment is dark at night, the road driving judgment becomes more difficult compared to the day, the driver needs to expend more energy. Therefore, the research on driving fatigue detection at night is more important than that at daytime. At the same time, driving during the day will also have the problem of face exposure caused by high light. When the driver is in an abnormal lighting environment, such as high or low light, the accuracy of the fatigue detection model is affected to varying degrees due to changes in light intensity. The study by Shakeel et al.21 attempted to incorporate low-light driving images into the model for training. However, the results showed that low-light images reduced the accuracy of the model.
To address these challenges, this study designed an efficient detection of driver fatigue state in all-weather illumination scenarios. First, a facial and physiological feature extraction model was constructed based on visible light and infrared images. Subsequently, a multimodal fusion framework was designed using the deep fatigue information of each modality, and the MLR feature fusion module makes full use of the specificity and complementarity between multiple modal data. Then, a fatigue detection model based on Bi-LSTM network was developed, taking into account the correlation between fatigue and time. Finally, the effectiveness of this study was verified experimentally. The main contributions of this study are summarized as follows:
Combined with visible light and infrared image optimization feature extraction dual channel, can more quickly and accurately extract the driver ’s facial fatigue characteristics. Compared to a single light source image, it reduces the influence of abnormal lighting on fatigue detection, making the system more robust and able to detect driver fatigue in all conditions.
Taking the face and physiological data were used as the inputs of multimodal data, the advantages of multimodal data are fully exploited through feature fusion to realize the comprehensive application of multi-source information and achieve the effect of data enhancement and anti-interference. Compared with single-modal data, it provides sufficient data support for fatigue prediction.
Combined with the idea of multi-modal data fusion, the MLR module was designed to guide the fusion process of different modal data and maximize the use of effective information. This fusion framework not only utilizes the high-order interaction between multimodal features to express information, but also avoids information redundancy and greatly improves the performance of the model.
The remainder of this study is organized as follows. “Introduction” introduces the current fatigue-driving detection method. “Related study” summarizes related literature. “Method” presents the fatigue-driving detection method with multimodal feature fusion. “Experiments” presents the dataset used, evaluation metrics, experimental setup, and the results. “Conclusion” summarizes the study and discusses future research directions.
Related study
Numerous studies have shown that two major factors-the driver’s face and physiology-directly affect the performance of fatigue-driving detection systems. However, facial features do not change much during early fatigue, leading to difficulties in detection, and the combination of physiological feature detection from a medical perspective can improve defects such as low accuracy and false detection; therefore, it is necessary to establish a theoretical model of driving fatigue detection with multimodal feature fusion.
The visual feature-based approach analyzes the fatigue state based on the driver’s blink frequency, head offset state, and mouth opening (Kumar et al.11; Yang et al.12; Reddy et al.13), and has the advantages of direct detection of fatigue features and non-contact. With the great success of deep learning in the field of target detection in recent years, CNN-based deep learning has driven the rapid development of computer vision tasks. CNNs are often used as the main method for spatial domain feature extraction (Shih et al.22; Huynh et al.23; Yu et al.24; Park et al.25; Dwivedi et al.26) because they can automatically learn global face features without human intervention. However, changes in head posture may significantly change the appearance of the face, and wearing masks or sunglasses can cause occlusion of the face. The validity of the spatial features of the face is difficult to guarantee in such cases. To overcome the difficulty of detecting fatigue features caused by the large pose deformation exhibited by the captured global face, in 2013, Sun et al.27 first applied a CNN to face keypoint detection and proposed a deep convolutional network (DCNN), which is a cascade regression method. In 2016, Zhang et al.28 proposed a multi-task cascaded face detection framework, multi-task cascaded convolutional networks (MTCNN), to deal with face detection and face keypoint localization problems simultaneously, making full use of the potential connection between the two tasks. In 2017, Kowalski et al.29 proposed a deep alignment network (DAN) that replaces the partial image input of previous cascaded neural networks with the whole image input, so that it can effectively overcome the problems caused by head pose and initialization to obtain better detection results; however, the computational effort is also increased.
Based on driver physiological state detection, fatigue can be determined by detecting the driver’s physiological indicators, such as EEG, ECG, EMG, and other characteristics8–10. Although their accuracy is high, traditional physiological parameter detection methods have problems such as expensive equipment and inconvenience of wearing them. However, with advancements in technology, the non-contact rPPG method30,31 has made the acquisition and feature extraction of physiological parameters more convenient. Conventional cameras can capture subtle color changes on the human skin surface due to heart rate (HR) fluctuations using the rPPG algorithm, and HR can be extracted according to the corresponding algorithm, enabling contactless monitoring; for example, according to the spatial subspace rotation (2SR) algorithm31, we can obtain the rPPG signal by calculating the skin pixel rotation of the face in adjacent frames in the RGB space. However, the traditional rPPG method is susceptible to the brightness and darkness of the surrounding ambient light, and situations such as driving fatigue mostly occur during nighttime driving. Furthermore, the motion of the car causes the camera to produce a small phase shift of the same PPG signal when progressively scanning different parts of the frame, which leads to an incorrect estimation of rPPG. Magdalena et al.32 and Mironenko et al.33 conducted detailed studies. Additionally, advancements in dual-modality RGB-NIR cameras34 have enabled accurate heart rate estimation from facial image sequences, enhancing the robustness and practicality of fatigue detection systems. Furthermore, recent work on remote photoplethysmography and motion tracking convolutional neural network with bidirectional long short-term memory35 has introduced a non-invasive fatigue detection method based on multi-modal fusion, further improving detection accuracy and reliability.
It should be noted that all the above methods rely on single-modal data, which limits the ability to adapt to various scenarios and reliability of the model prediction. Each parameter has its advantages and disadvantages, so a combination of them is advised36–42. Some methods combine multimodal data for fatigue detection, including extraction of multimodal features and fusion models. For example, Sun et al.36 fed contextual features, driver facial features, and vehicle behavior features into three independent multiclass support vector machines (MCSVMs) to fuse multimodal information, and performed feature fusion and decision fusion successively through a two-layer fusion structure. Du et al.37 proposed a novel multimodal fusion recurrent neural network (MFRNN) fusing heart rate, and eye opening and closing. The temporal information of the three features is fed into a new relational layer containing two levels of RNN to identify the relationships among them. Huang et al.38 designed a multigranularity deep convolutional model (RF-DCM) that adaptively recalibrates and fuses multigranularity features through a feature correction subnetwork and a feature fusion subnetwork and exploited the LSTM network to explore the relationship between sequence frames. With the continuous exploration of researchers, the combination of multimodal features has gradually evolved from simple splicing and merging at the beginning to analyzing the mechanism and correlation characteristics between features, and using different modal characteristics for different task requirements. This has always been the pursuit of our multimodal fusion driving fatigue detection method.
The reason why the driving fatigue detection method is limited in practical application is that in addition to the fact that a single information cannot accurately reflect the real state of the driver in different fatigue stages, abnormal illumination is also an important factor limiting the performance of fatigue detection. To solve the adverse effects of different light changes, low light image enhancement (LLIE) is an important means43–45. Its main purpose is to improve the visual perception of images taken in an environment with insufficient light. This method of preprocessing the input image can improve the robustness of the model to the low light environment to a certain extent, but the improvement effect is relatively limited. In 2017, Lore et al.46 first applied deep learning to low-light image enhancement and made a breakthrough in accuracy. Therefore, enhancement methods based on deep learning have attracted the attention of many scholars. Jiang et al.47 proposed an unsupervised learning enhancement model and used a global-local discriminator. The model not only enhances the local low-light region, but also improves the global brightness of the image. However, the computational cost of this method is also relatively large, which cannot meet the real-time requirements of driving fatigue detection. So, some researchers also put their eyes on the infrared image. Yan et al.48 designed a fatigue detection system based on color and infrared eye feature fusion. The system first collects images through color and infrared lenses, then marks the collected images on the face and divides and calculates the eye area. Finally, the photosensitive device is used to adaptively adjust the weight of color and infrared features to further reduce the influence of illumination on fatigue detection.
Numerous studies have shown that two major factors-the driver’s face and physiology-directly affect the performance of fatigue-driving detection systems. However, facial features do not change much during early fatigue, leading to difficulties in detection, and the combination of physiological feature detection from a medical perspective can improve the shortcomings of vision algorithms in the pre-fatigue period such as low accuracy and false detection; therefore, it is necessary to establish a theoretical model of driving fatigue detection with multimodal feature fusion. Nowara et al.49 obtained reliable heart rate estimates from narrowband near-infrared video recordings. The combination of physiological and facial features extracted from infrared and visible images has become an interesting topic. To improve the accuracy, robustness, and practicality of fatigue detection algorithms, we use infrared-based rPPG to achieve contactless heart rate monitoring, then use the optimised MTCNN network to establish a facial feature extraction model, and fuse two modal features to meet the needs of driving fatigue detection under different light.
Methods
The multimodal fusion driving fatigue detection model framework constructed in this study is shown in Fig. 1.
Figure 1.
The multimodal fusion driving fatigue detection model constructed in this study consists of the following.consists of the following parts: (1) Multimodal feature extraction. This section primarily feeds the captured RGB and IR video data into the facial feature extraction and physiological feature extraction models, respectively, by cleaning and enhancing them and obtains the hidden fatigue characterization among the modalities. (2) Multimodal feature fusion. The deep information and interaction information between the modalities are extracted independently by the deep information mining module to obtain more expressive fatigue features. Moreover, the MLR feature fusion module is designed to fuse multiple deep modal information based on avoiding redundant data and enhancing information expression. (3) Driving fatigue recognition model. The fused multimodal features are fed into the Bi-LSTM network to further investigate the temporal relationship between features and fatigue to achieve accurate and efficient detection of driving fatigue.
Ethical approval and consent to participate
All methods were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by the Institutional Review Board. Informed consent was obtained from all subjects and/or their legal guardian(s).
Multimodal feature extraction
The entire facial feature extraction network consists of the MTCNN, classification, and R-Dropout algorithms. MTCNN consist of three cascade networks (P-Net, R-Net, and O-Net) for face detection and keypoint localization. These three cascade networks not only make the detection faster and more accurate but also reduce the influence of different illuminations, and can detect and locate faces and key points accurately under large deflection, tilt, and partial occlusion. This makes it ideal for testing drivers who are driving vehicles. However, the P-network for fast candidate window generation in the MTCNN algorithm needs to make image pyramids and then pass them into the network for calculation, which takes a long time, approximately 85% of the total calculation time. To ensure the real-time performance of driving fatigue detection, we improved the MTCNN algorithm. The MTCNN algorithm uses a sliding-window detection method based on building image pyramids to detect faces of different sizes. When the detection window size is fixed, the larger the original image, the more layers need to be scaled, and the greater the time overhead. In the specific usage environment of the car, only one driver needs to be detected; therefore, there is no need for too many layers of scaling. Therefore, in this study, we reduced the image pyramid level by reducing the original image size to reduce time. Through several experiments, we found that when the original image size is 640 480, scaling the image to 0.5 times the original size, that is, 320 240, can maintain a high accuracy rate and ensure that the image features of the eye and mouth target areas are clearly presented, ensuring the best results.
Given RGB images , each RGB image is input to the MTCNN network, and the output obtains the face candidate frame and five face keypoints. The images of the eyes and mouth regions are segmented using the key point coordinate values, they are then input to two classification networks, and each classification network predicts the state of the eyes and mouth at the same time with a 60 s interval, and the classified images are passed through the PECLOS criterion and prior knowledge to obtain the facial features—blink frequency and yawn frequency .
Studies have shown that the mental state of a person can be reflected by the activity state of the autonomic nervous system. HRV refers to the subtle rise and fall of the R–R interval during each heartbeat in the human body, best reflects the activity state of the autonomic nervous system. The literature50 confirmed the association between HRV signals and driving fatigue, highlighting that the ratio of low to high frequencies (LF/HF) in its frequency domain index differed significantly between awake and fatigued states, reflecting the changes occurring in the autonomic nervous system during driving fatigue.
The physiological feature extraction model used the rPPG method based on IR images, and given IR images , each IR image was segmented into ROI regions. HRV of heart rate variability was obtained using the IR-rPPG method, and the ratio of LF/HF of low and high frequencies was obtained after spectral analysis as a physiological feature .
-
(2)
When training two classification networks, the dropout layer randomly discards the units in the model, resulting in a mismatch between the training model and the prediction model, which affects the prediction accuracy. Therefore, the R-Drop method51 was proposed to ensure that the output of the model is basically the same each time the dropout discards different units by adding a regularization. This reduces the mismatch between model training and prediction, increases the robustness of the model to dropout, and improves the final performance of the sub-classification network model. The training principle of the R-Drop method is as follows:
For each training, randomly sampled data pairs from the training data , n is the number of training samples, repeated twice through the network’s forward-passed input data as . Because the dropout discards different units each time, the probability distributions obtained from the two outputs are and . The negative log-likelihood loss is calculated for both, and by minimizing the two-way KL scatter loss between the two distributions of the same sample to normalize the model prediction. The final training goal is to minimize the loss of the data . is the coefficient weight of control .1 2 3
Multi-loss reconstruction feature fusion module
Handling similarities and differences between different types of data is a key issue in multimodal feature fusion tasks. Therefore, in this study, we propose the MLR feature fusion module that can preserve the specific information between modalities while making full use of the complementary information of each modality and avoid redundant information.
The proposed MLR feature fusion module comprises two main processes: deep information mining and feature fusion. First, a restricted Boltzmann machine (RBM) network52 was selected to automatically learn the combined information of features. The RBM network structure is shown in Fig. 2.
Figure 2.
RBM network structure diagram.
Given a multimodal feature sample , where and are facial features extracted from RGB images and is physiological features extracted from IR images, the hidden neuron expressions of each modality are extracted using two layers of the RBM to better extract the cross-modal expressions between them. The previously extracted features are input into the visible layer v, for the input facial features and for the physiological features, which are extracted by RBM for the intermediate hidden unit features, at which time the posterior distribution probabilities of the hidden layer learned in the first layer model (the extracted deep facial features and deep physiological features ) are as follows:
| 4 |
| 5 |
where and is the weight of the connection between neurons in the two layers.
The features of the hidden units learned in RBM are input into RBM so that the hidden units learned in the second layer can have a higher dimensional interrelationship between facial features and physiological features, and such shared features can express the driving fatigue state more effectively. The posterior distribution probabilities (the extracted complementary features H) of the learned hidden layer in the second-layer model are as follows:
| 6 |
The hidden representations of each mode extracted from the RBM are fed into the RBM and three 1 1 convolutional layers, respectively. The former can obtain complementary features between modes, and the latter needs to pass through a ReLU layer to ensure the nonlinear capability of the network. For brevity, the combination of these convolutional and ReLU layers is represented by a conv module. Subsequently, a multimodal feature aggregation module is introduced to efficiently aggregate the two levels of features to obtain more discriminative and robust aggregated features.
The deep features of each modality extracted in the first layer were input to the conv module and then to the aggregation module to obtain the aggregated features.
| 7 |
| 8 |
The overall performance may be affected by the overlap of information between the shared and specific features. First, if the shared features contain more modality-specific information, it can lead to inaccurate feature delivery. Second, if the specific features are highly correlated with the shared features, the specific features can only provide little complement to the shared features. Redundant information in specific features can also affect the model sensitivity.
To alleviate these two problems, the complementary features H and aggregated features f output by the second-layer model are constrained by the loss function53. Based on the complementary mode loss L, the aggregation feature loss is introduced to calculate the similarity constraint of the cross-modality. This enhances the complementarity of these two features while ensuring a low correlation between the shared and specific features.
| 9 |
where and are predefined bounds, and i, k denotes the same index as each anchor point j between cross-modalities within a small batch. This cross-modal constraint ensures that the most difficult cross-modal negative sample is also far from its corresponding cross-modal positive sample.
The two are then dot multiplied to obtain the desired fusion features:
| 10 |
The overall information loss is finally constrained by a reconstruction task that combines the global loss function54 to guide the shared and aggregated features to self-differentiate and complement each other.
It is well known that deep features between modalities are more expressive, and deep features of the face help detect fatigue in a simple and clear way. However, early fatigue is easily misdetected and susceptible to light. Deep physiological features can provide more subtle changes in the human body during fatigue. These two modal features are complementary. Therefore, combining the two modal features can help to detect fatigue from multiple perspectives more completely and accurately.
Driving fatigue detection model
Drivers are prone to fatigue in the process of continuous and uninterrupted driving, and as time progresses, drivers will gradually show obvious fatigue characteristics. While, driving fatigue is continuous, driving fatigue states have obvious temporal characteristics, and their various feature responses also have different temporal information.
LSTM is a special type of recurrent neural network that can learn the temporal information of features well owing to the uniqueness of its network structure, thus significantly improving the long-term dependency problem in other recurrent neural networks. The unit structure is shown in Fig. 3.
Figure 3.
Schematic diagram of LSTM cell structure.
In short, in the LSTM, irrelevant information can be forgotten through the forgetting gate, and the internal current state will be updated continuously; thus, the structure can make the LSTM more comfortable in dealing with the relationship of sequential data.
For the prediction of driving fatigue, combining several inputs of the driver’s before and after states will yield more accurate results. The Bi-LSTM network, which includes both forward and backward LSTMs, was chosen as the detection model. Considering, the possibility of misclassifying normal blinks as fatigue closed-eye features in the input network, the output values of both forward and backward computations were saved at each moment in the implicit layer, and the final results were obtained by combining the outputs of each moment in the bidirectional implicit layer. In this manner, we can make full use of the temporal information of driver fatigue to improve the accuracy and reliability of driver fatigue state recognition.
Experiments
This section includes an introduction to the experimental dataset, evaluation metrics, experimental environment and current commonly used models, as well as a final analysis of the experimental results.
Dataset
We used three datasets to experimentally validate the algorithm proposed in this study, including two publicly available datasets: the NTHU-DDD dataset55 and the UTA-RLDDD dataset56, and one homemade dataset: the FAHD dataset.
NTHU-DDD dataset: The dataset was collected from the Computer Vision Laboratory of National Tsing Hua University and was filmed with 36 adults of different races and aged 18–40 years old. Under day and night lighting conditions, the researchers recorded subjects with different hairstyles and outfits sitting in chairs playing a normal driving game simulating a steering wheel and pedals. They were instructed by the experimenter to perform eight actions in five scenarios, including normal driving, laughing, yawning, and snoozing, among others. The dataset contained 360 training videos and 20 test videos with a video resolution of 640 480 pixels at 30 frames per second. The sequences recorded from each participant were divided into sleepy and non-sleepy groups.
UTA-RLDD dataset: This dataset was created by the University of Texas at Arlington, and 60 healthy subjects participated in the data collection, with 51 males and 9 females from different races and ages. The subjects freely recorded videos at home or college, taking three videos with a cell phone/webcam (any model or type) in three different states of sleepiness, each for approximately 10 min. Because the videos were recorded by the participants themselves with their cell phones or webcams, they produced different video resolutions and qualities, with frame rates always less than 30 fps, which is the frame rate of the general population using a camera. This dataset consisted of 180 RGB videos with a total duration of approximately 30 h, where each video was labeled as alert, hypervigilant, or lethargic.
Homemade driving fatigue dataset FAHD: Because there are few publicly available RGB and IR multimodal driving fatigue datasets, to obtain the driving dataset, we used the Oni S500 binocular IR camera, which can be flexibly installed in the rearview mirror or dashboard of a car. The binocular camera consisted of color RGB and IR sensors with an acquisition rate of 20 Hz. Simultaneous capture of RGB and IR images with a resolution of 640 480 pixels occurred. Although the positions of the RGB and IR cameras were offset, the difference was not significant.
The dataset contains real driving scenarios for various types of vehicles, such as private cars, cabs, and trucks, during the day and night. Fifty-six participants (30 males and 26 females), aged between 18 and 54 years old, were asked to perform driving as well as common behavioral activities on different road sections. There were light changes and driver head deflection, and the entire driving time was approximately 2 h. There were a variety of complex situations, such as light changes, driver head deflection, and partial occlusion, which ensured the diversity and randomness of the samples. After the video acquisition was completed, the video was divided into 60 s short videos, which were labeled as awake or fatigue state, containing 5000 RGB-IR videos of 1200 650 pixels, including 3720 videos of the awake state and 1280 videos of the fatigue state.
As shown in Table 1, each dataset has different subject populations, devices, experimental settings, and so on. Figure 4 shows sample images of each dataset.
Table 1.
Introduction of NTHU-DDD dataset, UTA-RLDD dataset and FAHD dataset.
| Dataset | Actions | Camera setting | Experimental settings | Period | Scenarios |
|---|---|---|---|---|---|
| NTHU-DDD | 8: Yawning, | 1 Active infrared illumination of the camera (IR) | Simulated | Day | . BareFace |
| Blinking, Nodding, | Settings | Night | . Glasses | ||
| Shocked face, | . Sunglasses | ||||
| Burst out laughing, | . Night-BareFace | ||||
| Talking, Laughing, | . Night-Glasses | ||||
| Looking at side | |||||
| UTA-RLDD | 3: Alert, | 1 Cell phone or web camera (RGB) | Different | Morning | . BareFace |
| Low vigilant, | Real-life | Noon | . Glasses | ||
| Drowsy | Environments snd backgrounds | Midnight | . Sunglasses | ||
| . Facial hair | |||||
| . Headset | |||||
| FAHD | 9: Safe driving, | 2 Cameras (IR, RGB) | Naturalistic | Day | . BareFace |
| Makinga call, | Driving | Night | . Glasses | ||
| Talking, Laughing, | . Sunglasses | ||||
| Smoking, Drinking, | . Mask | ||||
| Yawning, Nodding, | . Hat | ||||
| Looking at side |
Figure 4.
Sample images of the multi-source dataset.
Evaluation indicators
Because of the problem of unbalanced samples in the dataset, two evaluation metrics, accuracy and F1-score, were selected to evaluate the model performance in this study to reduce the misjudgment of the model. Accuracy refers to the proportion of correct predictions, and the F1-score is the harmonic mean of precision and recall. Precision is an evaluation metric for prediction results that represents the percentage of true positives out of all positive results. Recall is an evaluation metric for the original sample, representing the percentage of samples that are predicted to be positive out of those that are actually positive. Precision and recall respectively reflect the two performance of model, and F1-score can reflect both.
| 11 |
| 12 |
| 13 |
| 14 |
where P refers to the number of cases classified as positive, N refers to the number of cases classified as negative, TP is the number of cases correctly classified as P, TN is the number of cases correctly classified as N, FP is the number of cases incorrectly classified as P, and FN is the number of cases incorrectly classified as N.
Experimental environment
Experimental equipment and environment: Windows 11 64-bit operating system, Intel i7 2.30 GHz processor, 16 GB RAM, NVIDIA Geforce 4070 (GPU), Python (3.7), Keras (Tensorflow2.1) framework.
Dataset processing: To meet the experimental requirements, the datasets must be uniformly processed. First, all videos in the three datasets were cropped into video clips of 60 s duration, and then randomly divided into training and test sets at a ratio of 8:2. The consequences of misclassifying driver fatigue as sobriety are much more serious than those of misclassifying sobriety as fatigue. To avoid this situation and reduce prediction errors, we only consider the two cases of driver sobriety and fatigue, and classify them as fatigue as long as the driver shows some slight signs of fatigue. The labels of the three datasets were uniformly marked awake and fatigue.
Evaluation of feature extraction methods
(1) Comparison experiment of different facial feature extraction methods
In order to verify the effectiveness of the facial feature extraction method proposed in this study, we conducted comparative experiments with currently common and up-to-date methods, considering the average computational speed of each algorithm in processing different stages of the video, accuracy, and F1-scores. The experimental results are shown in Table 2.
Table 2.
Performance of different facial feature extraction methods.
| Method | Accuracy (%) | F1-score (%) | Speed (ms/f) | ||
|---|---|---|---|---|---|
| Face detection | Landmark obtained | Feature extraction | |||
| CNN | 90.7 | 89.6 | 34.32 | 12.44 | 45.81 |
| DCNN | 91.5 | 90.3 | 34.68 | 13.20 | 46.15 |
| MTCNN | 93.4 | 91.9 | 36.41 | 15.67 | 48.63 |
| DAN | 94.9 | 93.5 | 36.53 | 16.81 | 48.84 |
| Ours | 94.2 | 92.7 | 35.15 | 15.39 | 47.06 |
| FLAVA | 95.1 | 93.2 | 44.23 | 20.91 | 58.54 |
In the above Table 2, we can see that the accuracy of our method reaches 94.2%, and the F1-score reaches 92.7%, which exceeds CNN, DCNN, and MTCNN, and is slightly lower than that of DAN. The main reason is that our method adds the R-Drop algorithm on top of the three cascade networks to detect faces and key points, which maximally reduces the inconsistency in the process of model training and testing and improves the classification of our model. The two-stage alignment and full-image regression algorithms of the DAN network have better corrections for deflected or tilted faces that occur during driving, so the accuracy is the best. However, such a complex network structure also results in an increase in computation, and the three modules of the DAN algorithm require the highest computational cost.
We also introduced the FLAVA57 model into our experiments, which significantly enhanced our system’s ability to detect driver fatigue. The FLAVA model achieved an accuracy of 95.1% and an F1-score of 93.2%, surpassing all other methods, including ours. This improvement is primarily due to FLAVA’s advanced multimodal feature fusion capabilities, leveraging Transformer-based58 architecture. However, it’s worth noting that the computational cost of the FLAVA model is higher, with average speeds of 44.23 ms/f for face detection, 20.91 ms/f for landmark obtained, and 58.54 ms/f for feature extraction.
While the FLAVA model significantly enhanced our system’s performance, the MLR method demonstrated better performance in terms of detection speed and accuracy on our dataset, while also being more cost-effective in terms of computational resources. Specifically, the MLR method showed slightly better detection speed and accuracy while significantly reducing computational costs, which is crucial for real-time driver status monitoring systems.
Furthermore, we chose to use Bi-LSTM due to its superior ability to capture temporal features, understanding the state changes of a driver over different time points. Compared to Transformer models, Bi-LSTM is relatively simpler and more computationally efficient, making it suitable for real-time applications. Experimental validation also indicated that Bi-LSTM performed best on our dataset, providing higher accuracy and stability.
In summary, through comparative experiments of various deep learning methods, we validated the effectiveness of multimodal feature fusion methods in driver fatigue detection. By introducing the FLAVA model, we further explored the potential and applicability of Transformer-based methods in this task. Despite the new methods showing promising results in some aspects, traditional methods like MLR and Bi-LSTM still have significant advantages in terms of real-time performance and computational efficiency.
(2) Comparison experiments of different physiological feature extraction methods
In this section, we test the accuracy and stability of the proposed infrared rPPG-HR-detection system. We randomly selected 10 out of 56 subjects and asked them to wear an ECG monitor; subjects 1–5 performed the experiment at 12:00 noon and subjects 6–10 at 23:00 p.m. so that we could accurately extract their HR and HRV, including LF and HF data, compared with the detection results of the 2SR algorithm and the method in this study as shown in Fig. 5.
Figure 5.
Comparison results of different HRV signal extraction methods.
Figure 5 compares the effectiveness of the 2SR algorithm and the IR rPPG-HR method by comparing them with the authoritative ECG monitoring results. The detection results of both methods are consistent with the ECG monitor during daytime, but the deviation of the 2SR algorithm increases significantly during nighttime because the lack of light at night prevents it from effectively extracting the complete skin pixels, whereas our IR rPPG-HR method compensates for the effects caused by the low-light environment at night. After eight rounds of experiments, we achieved an accuracy rate of more than 90% for both, indicating that the method in this study can stably and accurately extract physiological features that effectively reflect fatigue.
Evaluation of multi-feature fusion methods
Comparison experiments of different modal feature inputs
To effectively evaluate the classification effect of multimodal data fusion with that of unimodal data, this study compared the inputs of different modal features, while keeping the experimental environment and subjects unchanged. The feature categories were divided into four groups: physiological features, visual features, mixed visual features and mixed multiple features, and the four feature types were classified into fatigue and wakefulness states by the given labels. The experimental results are shown in Table 3.
Table 3.
Performance comparison of different feature type inputs.
| Category | Feature | Parameter | Accuracy (%) | F1-score (%) |
|---|---|---|---|---|
| Facial | Eyes | PERCLOS | 81.8 | 80.9 |
| Mouth | Yawn frequency | 80.5 | 78.6 | |
| Physiology | ECG | HR | 82.4 | 81.1 |
| rPPG | HRV | 81.2 | 80.5 | |
| Single-mode | Eyes + Mouth | PERCLOS, | 84.2 | 81.9 |
| Yawn frequency | ||||
| ECG + rPPG | HR, HRV | 85.7 | 83.2 | |
| Multi-modal | Eyes + Mouth + rPPG | PERCLOS , | ||
| Yawn frequency, | 89.9 | 86.1 | ||
| HRV |
According to Table 3, the classification accuracies of the method in this study were 81.8% and 80.5% for the visual feature group (Eyes, Mouth) and 82.4% and 81.2% for the physiological feature groups (ECG, rPPG), respectively. Generally, fatigue detection using physiological features will be more accurate than visual features, but here our accuracy based on rPPG is lower than that based on Eyes, mainly because we obtained the HRV signal by visualization, which makes our signal noise larger. If it is obtained by electrodes, the heart rate signal noise will be relatively small, like the ECG-based method. The accuracy can reach 82.4%, which is higher than the accuracy of the visual features group (Eyes, Mouth) and rPPG, proving that ECG is indeed more effective. The difference, however, is not significant, only 1.2%, given the convenience of detection and the subsequent physiological features obtained from rPPG.
The accuracy of the mixed unimodal feature group (Eyes + Mouth; ECG + rPPG) was 84.2% and 85.7%, and the F1-scores were 81.9% and 83.2%, respectively, which were higher than the accuracy of other single feature groups, indicating that multiple features can describe the fatigue state more effectively than single features. For the mixed multiclass feature group (Eyes + Mouth + rPPG), the accuracy was as high as 89.9%, and the F1-score reached 86.1%. According to the experimental results, it can be seen that the detection results based on multimodal data are superior to those based solely on unimodal data, thus confirming the effectiveness of multimodal information in the driving fatigue detection algorithm.
-
(2)
Ablation experiment of multimodal feature fusion module
To evaluate the impact of the multi-loss reconstruction module in the multimodal fusion method in this study, we compared the module performance in the cases of softmax loss, global loss, top-ranking loss, and multi-loss reconstruction. To better reflect the characteristics and differences among different losses, the experiments divided datasets into three scenarios: daytime, nighttime and full day-night coverage, and the training and test sets were randomly divided in a ratio of 8:2. When light is brighter during the day, facial features are more abundant, so facial features dominate physiological features assist; at night, the light is weak, facial features are difficult to extract, and physiological features dominate facial features assist; the weights of facial features and physiological features keep changing as time passes from day to night. The experimental results are shown in Fig. 6.
Figure 6.
Ablation experimental results of the multimodal feature fusion module.
The first group in the experiment represents the module with only the basic softmax loss, the second group represents the module with only the top loss, the third group represents the module with only the global loss, and the fourth group represents our multiloss reconstruction module that includes both top and global losses. The softmax loss module in the first group has the lowest accuracy because it directly splices and fuses facial features with physiological features.
The top loss module in the second group, it has better performance than the first group in all three scenes with the greatest improvement in daytime and nighttime. This is primarily because the top loss can handle the specific features of each modality in multimodality well and fuse both features. However, it does not consider the ratio of the weights of different features when fusing features, which results in information redundancy and makes the fusion less effective than anticipated. Furthermore, the experiment also verifies that the accuracy of the top loss module improves by 2.37 and 2.51% when the features of each mode are dominant in daytime and nighttime, but only by 1.41% when the weights of different modes change continuously from day to night.
The global loss module in the third group had a better effect than the first group, and the accuracies improved by 1.66%, 1.15%, and 2.32%, respectively. However, both day and night scenes are inferior to the second group, and the accuracy is higher than that of the second group only in the day-to-night scenes. Because the main role of the global loss is to guide the self-discrimination and complementarity when fusing multimodal features, the best fusion scheme can be selected in the face of weight changes, but the feature processing is not as good as the top loss effect.
the fourth group of multi-loss reconstruction modules can achieve 85.25, 83.84, and 84.65% accuracy in all three scenes after combining global loss and top loss; thus, the results of the ablation experiments show that this method effectively improves the performance of the model and has some practical application. This has a practical application value.
Evaluation of driving fatigue detection methods
To make our experimental results more generalizable and reliable, the algorithms in this study were compared with the current mainstream driving fatigue detection algorithms based on three datasets. After processing by the corresponding algorithms, the accuracy and F1-scores of each algorithm for detecting fatigue were compared, and the experimental results are shown in Table 4. However, Whether it’s the MFRNN model37, the VGG-FaceNet39, or the DBN40, compared to our model, they require higher computational resources and have more complex network structures. Additionally, they have higher data requirements. In terms of experimental results in handling multimodal problems, our model demonstrates superior performance.
Table 4.
Comparison with state-of-the-art fatigue testing methods.
| Methods | Year | Granularity extraction | NTHU-DDD | UTA-RLDD | ||
|---|---|---|---|---|---|---|
| ACC | F1 | ACC | F1 | |||
| du2020vision37 | 2020 | Heart + Eyes | 83.6 | 81.5 | 80.2 | 79.7 |
| huang2020rf38 | 2020 | Face + Eyes | – | 89.4 | 90.9 | 88.6 |
| dua2021deep39 | 2021 | Behavioral | 85.0 | – | 81.6 | 78.2 |
| kir2022behavior40 | 2021 | Eyes + Mouth | 30.0 | – | 78.3 | – |
| du2022multimodal41 | 2022 | Heart + Eyes | 82.9 | 81.6 | 86.4 | – |
| pandey2022novel42 | 2022 | Head + Face | 88.5 | 83.8 | 89.0 | 80.3 |
| Proposed Model | 2022 | Heart + Eyes | 91.7 | 90.4 | 93.2 | 89.5 |
The results in Table 4 show that the reliability and accuracy of our algorithm are the best in three different environments. The main reason is that when we designed the facial feature and physiological feature fusion network, we obtained richer deep fatigue information through the deep information mining module and combined it with the MLR feature fusion module to effectively utilize the deep and complementary information of multimodal features. We used the IR rPPG-HR method to extract physiological information at night, which improved the detection accuracy and enhanced the robustness at the same time.
Example demonstration of driving fatigue detection system
To demonstrate the operation of the driving fatigue detection system more clearly and intuitively, we intercepted some videos in different situations for training and testing. As shown in Fig. 7, we demonstrate the system’s detection and determination of the alertness and fatigue states of several drivers in a variety of complex lighting environments.
Figure 7.
Display of driving fatigue detection results in complex lighting situations.
Figure 7a–c shows the normal daytime driving with sufficient light; Fig. 7d–f shows the strong light condition, which comes from nighttime lights and daytime sunlight; and Fig. 7g–i shows the low light condition when driving at night. Among them, the drivers in Fig. 7a,c,d,h are wearing eyeglasses, sunglasses or masks, which obscure part of the key detection area. It can be seen that the system is able to delineate the area of interest more accurately in both normal and multiple complex lighting situations, and it can obtain accurate results for the driver’s fatigue status.
Conclusion
In order to solve the key problems such as low accuracy and poor robustness of traditional fatigue detection methods in all-weather illumination environments, this study proposes a driving fatigue detection model based on a multimodal feature fusion module, first extracting multiple fatigue features through the MTCNN network and rPPG method of IR to improve the detection accuracy of the model under the influence of different light environments such as low and high light; second using RBM to mine deep features between the two modalities, and then combined with MLR module to enhance deep features and complementary feature expressions, remove redundant information and fully fuse features from multiple perspectives to reflect driving fatigue more completely and accurately. Finally, we use the time-series nature of fatigue and input to the Bi-LSTM network for detection. We conducted sufficient experiments and comparisons with several state-of-the-art methods on three datasets, showing that the model in this study has a good detection effect, strong robustness, and good practicality. Further research can focus on how to deeply integrate driving fatigue detection and artificial intelligence technology, promoting the intelligent process of road traffic, which in turn drives the smart transportation revolution and industrial upgrading.
Acknowledgements
We gratefully acknowledge all the members who participated in this work.
Author contributions
Conceptualization, S.H., Q.G. and K.X.; methodology, S.H. and Q.G.; software, S.H. and Q.G.; writing-original draft preparation, S.H. and Q.G.; writing-review and editing, S.H. and Q.G.; visualization, S.H., Q.G. and J.H.; supervision, K.X.; investigation, W.Z. and C.W.; funding acquisition, J.H.; project administration, Z.H. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (No. 62373372 and 62272485).
Data availability
The original datasets are available from55,56. Our code will not be made publicly available, because we may use it to build a commercial application. The authors of this paper acknowledge that informed consent has been obtained from participants and/or legal guardians for the publication of identifiable information or images that may be present in the dataset used and in the manuscript, which could lead to the identification of study participants, in an online open-access publication. Furthermore, we agree and commit to adhere to the relevant regulations regarding the publication of identifiable information/images.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Siyang Hu and Qihuang Gao.
References
- 1.Klauer, C. et al. The impact of driver inattention on near-crash/crash risk: An analysis using the 100-car naturalistic driving study data. (2006).
- 2.Sikander, G. & Anwar, S. Driver fatigue detection systems: A review. IEEE Trans. Intell. Transp. Syst.20, 2339–2352 (2018). 10.1109/TITS.2018.2868499 [DOI] [Google Scholar]
- 3.Němcová, A. et al. Multimodal features for detection of driver stress and fatigue. IEEE Trans. Intell. Transp. Syst.22, 3214–3233 (2020). 10.1109/TITS.2020.2977762 [DOI] [Google Scholar]
- 4.Kaida, K. et al. Validation of the Karolinska sleepiness scale against performance and EEG variables. Clin. Neurophysiol.117, 1574–1581 (2006). 10.1016/j.clinph.2006.03.011 [DOI] [PubMed] [Google Scholar]
- 5.Hu, S. & Zheng, G. Driver drowsiness detection with eyelid related parameters by support vector machine. Expert Syst. Appl.36, 7651–7658 (2009). 10.1016/j.eswa.2008.09.030 [DOI] [Google Scholar]
- 6.Li, Z., Li, S. E., Li, R., Cheng, B. & Shi, J. Online detection of driver fatigue using steering wheel angles for real driving conditions. Sensors17, 495 (2017). 10.3390/s17030495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McDonald, A. D., Schwarz, C., Lee, J. D. & Brown, T. L. Real-time detection of drowsiness related lane departures using steering wheel angle. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 56, 2201–2205 (Sage Publications, 2012).
- 8.Chaudhuri, A. & Routray, A. Driver fatigue detection through chaotic entropy analysis of cortical sources obtained from scalp EEG signals. IEEE Trans. Intell. Transp. Syst.21, 185–198 (2019). 10.1109/TITS.2018.2890332 [DOI] [Google Scholar]
- 9.Du, G., Wang, Z., Li, C. & Liu, P. X. A tsk-type convolutional recurrent fuzzy network for predicting driving fatigue. IEEE Trans. Fuzzy Syst.29, 2100–2111 (2020). 10.1109/TFUZZ.2020.2992856 [DOI] [Google Scholar]
- 10.Zheng, W.-L. et al. Vigilance estimation using a wearable EOG device in real driving environment. IEEE Trans. Intell. Transp. Syst.21, 170–184 (2019). 10.1109/TITS.2018.2889962 [DOI] [Google Scholar]
- 11.Kumar, V., Pham, H., Pandey, P. K. & Goel, A. Driving to safety: Real-time danger spot and drowsiness monitoring system. Soft Comput.25, 14479–14497 (2021). 10.1007/s00500-021-06381-1 [DOI] [Google Scholar]
- 12.Yang, H., Liu, L., Min, W., Yang, X. & Xiong, X. Driver yawning detection based on subtle facial action recognition. IEEE Trans. Multimed.23, 572–583 (2020). 10.1109/TMM.2020.2985536 [DOI] [Google Scholar]
- 13.Reddy, B., Kim, Y.-H., Yun, S., Seo, C. & Jang, J. Real-time driver drowsiness detection for embedded system using model compression of deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 121–128 (2017).
- 14.Wakita, T. et al. Driver identification using driving behavior signals. IEICE Trans. Inf. Syst.89, 1188–1194 (2006). 10.1093/ietisy/e89-d.3.1188 [DOI] [Google Scholar]
- 15.Yu, J., Park, S., Lee, S. & Jeon, M. Driver drowsiness detection using condition-adaptive representation learning framework. IEEE Trans. Intell. Transp. Syst.20, 4206–4218 (2018). 10.1109/TITS.2018.2883823 [DOI] [Google Scholar]
- 16.Qiu, X., Tian, F., Shi, Q., Zhao, Q. & Hu, B. Designing and application of wearable fatigue detection system based on multimodal physiological signals. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 716–722 (IEEE, 2020).
- 17.Abbas, Q. Hybridfatigue: A real-time driver drowsiness detection using hybrid features and transfer learning. Int. J. Adv. Comput. Scie. Appl.11 (2020).
- 18.Bakker, B. et al. A multi-stage, multi-feature machine learning approach to detect driver sleepiness in naturalistic road driving conditions. IEEE Trans. Intell. Transp. Syst.23, 4791–4800 (2021). 10.1109/TITS.2021.3090272 [DOI] [Google Scholar]
- 19.Ahlström, C. et al. Real-time adaptation of driving time and rest periods in automated long-haul trucking: Development of a system based on biomathematical modelling, fatigue and relaxation monitoring. IEEE Trans. Intell. Transp. Syst.23, 4758–4766 (2021). 10.1109/TITS.2021.3102519 [DOI] [Google Scholar]
- 20.WHO, V. Global Status Report on Road Safety 2018. (World Health Organization, 2018).
- 21.Shakeel, M. F., Bajwa, N. A., Anwaar, A. M., Sohail, A. & Khan, A. Detecting driver drowsiness in real time through deep learning based object detection. In International Work-Conference on Artificial Neural Networks, 283–296 (Springer, 2019).
- 22.Shih, T.-H. & Hsu, C.-T. Mstn: Multistage spatial-temporal network for driver drowsiness detection. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part III 13, 146–153 (Springer, 2017).
- 23.Huynh, X.-P., Park, S.-M. & Kim, Y.-G. Detection of driver drowsiness using 3d deep neural network and semi-supervised gradient boosting machine. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part III 13, 134–145 (Springer, 2017).
- 24.Yu, J., Park, S., Lee, S. & Jeon, M. Representation learning, scene understanding, and feature fusion for drowsiness detection. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part III 13, 165–177 (Springer, 2017).
- 25.Park, S., Pan, F., Kang, S. & Yoo, C. D. Driver drowsiness detection system based on feature representation learning using various deep networks. In Asian Conference on Computer Vision, 154–164 (Springer, 2016).
- 26.Dwivedi, K., Biswaranjan, K. & Sethi, A. Drowsy driver detection using representation learning. In 2014 IEEE International Advance Computing Conference (IACC), 995–999 (IEEE, 2014).
- 27.Sun, Y., Wang, X. & Tang, X. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3476–3483 (2013).
- 28.Zhang, K., Zhang, Z., Li, Z. & Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett.23, 1499–1503 (2016). 10.1109/LSP.2016.2603342 [DOI] [Google Scholar]
- 29.Kowalski, M., Naruniec, J. & Trzcinski, T. Deep alignment network: A convolutional neural network for robust face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 88–97 (2017).
- 30.Rouast, P. V., Adam, M. T., Chiong, R., Cornforth, D. & Lux, E. Remote heart rate measurement using low-cost RGB face video: A technical literature review. Front. Comput. Sci.12, 858–872 (2018). 10.1007/s11704-016-6243-6 [DOI] [Google Scholar]
- 31.Wang, W., Stuijk, S. & De Haan, G. A novel algorithm for remote photoplethysmography: Spatial subspace rotation. IEEE Trans. Biomed. Eng.63, 1974–1984 (2015). 10.1109/TBME.2015.2508602 [DOI] [PubMed] [Google Scholar]
- 32.Magdalena Nowara, E., Marks, T. K., Mansour, H. & Veeraraghavan, A. Sparseppg: Towards driver monitoring using camera-based vital signs estimation in near-infrared. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 1272–1281 (2018).
- 33.Mironenko, Y., Kalinin, K., Kopeliovich, M. & Petrushan, M. Remote photoplethysmography: Rarely considered factors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 296–297 (2020).
- 34.Lie, W.-N., Le, D.-Q., Lai, C.-Y. & Fang, Y.-S. Heart rate estimation from facial image sequences of a dual-modality RGB-NIR camera. Sensors23, 6079 (2023). 10.3390/s23136079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kong, L., Xie, K., Niu, K., He, J. & Zhang, W. Remote photoplethysmography and motion tracking convolutional neural network with bidirectional long short-term memory: Non-invasive fatigue detection method based on multi-modal fusion. Sensors24 (2024). [DOI] [PMC free article] [PubMed]
- 36.Sun, W., Zhang, X., Peeta, S., He, X. & Li, Y. A real-time fatigue driving recognition method incorporating contextual features and two fusion levels. IEEE Trans. Intell. Transp. Syst.18, 3408–3420 (2017). 10.1109/TITS.2017.2690914 [DOI] [Google Scholar]
- 37.Du, G., Li, T., Li, C., Liu, P. X. & Li, D. Vision-based fatigue driving recognition method integrating heart rate and facial features. IEEE Trans. Intell. Transp. Syst.22, 3089–3100 (2020). 10.1109/TITS.2020.2979527 [DOI] [Google Scholar]
- 38.Huang, R., Wang, Y., Li, Z., Lei, Z. & Xu, Y. Rf-dcm: Multi-granularity deep convolutional model based on feature recalibration and fusion for driver fatigue detection. IEEE Trans. Intell. Transp. Syst.23, 630–640 (2020). 10.1109/TITS.2020.3017513 [DOI] [Google Scholar]
- 39.Dua, M., Shakshi, Singla, R., Raj, S. & Jangra, A. Deep cnn models-based ensemble approach to driver drowsiness detection. Neural Comput. Appli.33, 3155–3168 (2021).
- 40.Kır Savaş, B. & Becerikli, Y. Behavior-based driver fatigue detection system with deep belief network. Neural Computing and Applications 34, 14053–14065 (2022).
- 41.Du, G. et al. A multimodal fusion fatigue driving detection method based on heart rate and perclos. IEEE Trans. Intell. Transp. Syst.23, 21810–21820 (2022). 10.1109/TITS.2022.3176973 [DOI] [Google Scholar]
- 42.Pandey, N. N. & Muppalaneni, N. B. A novel drowsiness detection model using composite features of head, eye, and facial expression. Neural Comput. Appl.34, 13883–13893 (2022). 10.1007/s00521-022-07209-1 [DOI] [Google Scholar]
- 43.Tian, X., Qiao, D. & Wu, C. Color image enhancement based on bi-histogram equalization. J. Xian Univ. Posts Telecommun.20, 58–63 (2015). [Google Scholar]
- 44.Fu, X. et al. A probabilistic method for image enhancement with simultaneous illumination and reflectance estimation. IEEE Trans. Image Process.24, 4965–4977 (2015). 10.1109/TIP.2015.2474701 [DOI] [PubMed] [Google Scholar]
- 45.Li, M., Liu, J., Yang, W., Sun, X. & Guo, Z. Structure-revealing low-light image enhancement via robust retinex model. IEEE Trans. Image Process.27, 2828–2841 (2018). 10.1109/TIP.2018.2810539 [DOI] [PubMed] [Google Scholar]
- 46.Lore, K. G., Akintayo, A. & Sarkar, S. Llnet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recogn.61, 650–662 (2017). 10.1016/j.patcog.2016.06.008 [DOI] [Google Scholar]
- 47.Jiang, Y. et al. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process.30, 2340–2349 (2021). 10.1109/TIP.2021.3051462 [DOI] [PubMed] [Google Scholar]
- 48.Sun, Y., Yan, P., Li, Z., Zou, J. & Hong, D. Driver fatigue detection system based on colored and infrared eye features fusion. Comput. Mater. Continua63, 1563–1574 (2020). 10.32604/cmc.2020.09763 [DOI] [Google Scholar]
- 49.Nowara, E. M., Marks, T. K., Mansour, H. & Veeraraghavan, A. Near-infrared imaging photoplethysmography during driving. IEEE Trans. Intell. Transp. Syst.23, 3589–3600 (2020). 10.1109/TITS.2020.3038317 [DOI] [Google Scholar]
- 50.Chang, Y.-L., Feng, Y.-C. & Chen, O. T.-C. Real-time physiological and facial monitoring for safe driving. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 4849–4852 (IEEE, 2016). [DOI] [PubMed]
- 51.Wu, L. et al. R-drop: Regularized dropout for neural networks. Adv. Neural Inf. Process. Syst.34, 10890–10905 (2021). [Google Scholar]
- 52.Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science313, 504–507 (2006). 10.1126/science.1127647 [DOI] [PubMed] [Google Scholar]
- 53.Ye, M., Wang, Z., Lan, X. & Yuen, P. C. Visible thermal person re-identification via dual-constrained top-ranking. IJCAI1, 2 (2018). [Google Scholar]
- 54.Kumar BG, V., Carneiro, G. & Reid, I. Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5385–5394 (2016).
- 55.Weng, C.-H., Lai, Y.-H. & Lai, S.-H. Driver drowsiness detection via a hierarchical temporal deep belief network. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part III 13, 117–133 (Springer, 2017).
- 56.Ghoddoosian, R., Galib, M. & Athitsos, V. A realistic dataset and baseline temporal model for early drowsiness detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019).
- 57.Singh, A. et al. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15638–15650 (2022).
- 58.Wolf, T. et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45 (Association for Computational Linguistics, Online, 2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The original datasets are available from55,56. Our code will not be made publicly available, because we may use it to build a commercial application. The authors of this paper acknowledge that informed consent has been obtained from participants and/or legal guardians for the publication of identifiable information or images that may be present in the dataset used and in the manuscript, which could lead to the identification of study participants, in an online open-access publication. Furthermore, we agree and commit to adhere to the relevant regulations regarding the publication of identifiable information/images.







