Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2022 Dec 10;81:104487. doi: 10.1016/j.bspc.2022.104487

Contactless blood oxygen estimation from face videos: A multi-model fusion method based on deep learning

Min Hu a, Xia Wu a, Xiaohua Wang a,, Yan Xing b, Ning An a,c, Piao Shi a
PMCID: PMC9735266  PMID: 36530216

Abstract

Blood Oxygen (SpO2), a key indicator of respiratory function, has received increasing attention during the COVID-19 pandemic. Clinical results show that patients with COVID-19 likely have distinct lower SpO2 before the onset of significant symptoms. Aiming at the shortcomings of current methods for monitoring SpO2 by face videos, this paper proposes a novel multi-model fusion method based on deep learning for SpO2 estimation. The method includes the feature extraction network named Residuals and Coordinate Attention (RCA) and the multi-model fusion SpO2 estimation module. The RCA network uses the residual block cascade and coordinate attention mechanism to focus on the correlation between feature channels and the location information of feature space. The multi-model fusion module includes the Color Channel Model (CCM) and the Network-Based Model(NBM). To fully use the color feature information in face videos, an image generator is constructed in the CCM to calculate SpO2 by reconstructing the red and blue channel signals. Besides, to reduce the disturbance of other physiological signals, a novel two-part loss function is designed in the NBM. Given the complementarity of the features and models that CCM and NBM focus on, a Multi-Model Fusion Model(MMFM) is constructed. The experimental results on the PURE and VIPL-HR datasets show that three models meet the clinical requirement(the mean absolute error 2%) and demonstrate that the multi-model fusion can fully exploit the SpO2 features of face videos and improve the SpO2 estimation performance. Our research achievements will facilitate applications in remote medicine and home health.

Keywords: Estimation, Remote photo-plethysmography, Deep learning, Residual network, Coordinate attention, Multi-model fusion

1. Introduction

SpO2 saturation reflects the function of the cardiopulmonary and respiratory system and is an essential physiological indicator for clinical medical monitoring [1]. Many lung diseases can be detected by detecting abnormal values of SpO2, such as SARS coronavirus in 2003, MERS, the Middle East respiratory syndrome coronavirus in 2012, and COVID-19 which spread uncontrollably in the current period. The literature [2] stated that patients with pulmonary infections did not feel shortness of breath in the early stages, but their SpO2 levels were decreasing. Therefore, monitoring SpO2, an important physiological parameter, is essential to avoid disease deterioration. In addition, regarding blood vessel detection, anomaly and blood glucose disease can be determined from the data collected from blood glucose meter [3].

Traditional contact pulse SpO2 meters [4] are widely used in routine and critical scenarios, but they are challenging to perform in particular scenarios, such as possible injury to burn patients and inaccurate SpO2 monitoring in patients with hand and foot tremors. In addition, pulmonary diseases are highly contagious in general, and incomplete disinfection of healthcare workers and healthcare tools may lead to the secondary transmission of pathogens. In these scenarios, contactless measurement of physiological parameters becomes an effective method for estimating SpO2. Several scholars have already studied the implementation of contactless monitoring of vital characteristics [5], for example, Remote Photo-plethysmography (rPPG), Image Photo-plethysmography (iPPG), contactless pulse SpO2 measurement, and others. To our best knowledge, a few articles studied contactless SpO2 estimation methods based on videos. Among them, most of the rPPG-based SpO2 estimation methods use traditional signal processing methods, such as filtering [6], signal interpolation [7], and fast Fourier transform [8].

However, with the rapid growth of deep learning techniques, an increasing number of scholars have applied them to biomedical scenarios and daily health monitoring and achieved desirable results, such as detecting COVID-19 by chest X-ray images [9] and physiological indicator estimation [10]. In particular, deep learning techniques regarding heart rate estimation have matured in physiological indicator estimation. However, the application of deep learning to contactless SpO2 estimation is still at an early stage of development, and the literature [11] provided a systematic review of them. Last year, J. Mathew et al. proposed a study combining deep learning with contactless SpO2 estimation in hands by their self-built dataset and achieved the clinical requirement [12]. Apart from the literature [12], no other literature uses deep learning techniques for non-contact SpO2 estimation tasks. However, considering the maturity and convenience of facial estimation and inspired by the previous studies of rPPG-based heart rate estimation literature [13] and the contactless blood oxygen estimation literature [14], a contactless SpO2 estimation method combined with deep learning based on face videos is proposed in this paper. Experimental results show that the SpO2 estimation values of this paper's method are comparable to those values obtained by the commercial pulse SpO2 meter, and the SpO2 measurement error in this paper is within an acceptable range according to the literature [15].

The main contributions of the work in this paper are as follows.

  • 1)

    To fully extract SpO2 features from face videos by deep learning, a network named Residuals and Coordinate Attention(RCA) for feature extraction is constructed. It is a combination of residual block cascade and coordinate attention mechanism. Thus, the RCA network not only learns the high-dimensional features of the image, but also focuses on the correlation relationship between feature channels and the location information in the feature space. In this way, the RCA network has a stronger feature representation capability.

  • 2)

    To use the classical color channel ratio algorithm to calculate SpO2, a SpO2 estimation model named Color Channel Model(CCM) for face videos is constructed, which is based on an image generator. CCM firstly uses residual blocks and transposed convolution to reconstruct feature maps learned by the RCA network into RGB images. Then, CCM calculates SpO2 using the red/blue color channel ratio method.

  • 3)

    To facilitate the network to focus on learning SpO2 features in many aspects, a SpO2 estimation model named Network-Based Model(NBM) is constructed. The NBM is based on a network and a two-part loss function. Then, it is fused with the CCM model into a multi-model fusion SpO2 estimation model named Multi-Model Fusion Model(MMFM), which constructs a loss function that takes full advantage of the features that CCM and NBM focus on as well as the complementarity between them. The CCM, RBM and MMFM perform SpO2 estimation experiments on the public datasets (PURE [16] and VIPL-HR [17], [18]) to achieve accurate SpO2 estimation.

2. Related work

SpO2 is the oxygen concentration in the blood and represents the ratio of oxygenated hemoglobin to total hemoglobin, with a normal range of 95 % to 100 %. The medical device commonly used to measure arterial SpO2 is the pulse SpO2 meter, which is based on the use of photo-plethysmography (PPG) to assess changes in blood volume in the human tissue microcirculation [19]. However, the pulse SpO2 meter has many limitations, such as possible injury to burn patients and inaccurate SpO2 monitoring in patients with hand and foot tremors [20]. In addition, there is a risk of virus infection when using contact oximeter to check the physiological health of patients with infectious diseases. So researchers proposed the remote photo-plethysmography (rPPG) which is a contactless method for estimating pulse waves generated by the heart through peripheral blood perfusion measurements, using a video camera to capture video of body parts, and a series of processing of the video to obtain the desired physiological information, such as heart rate [21], respiration rate [22] and heart rate variability [23].

In recent years, many scholars have started to use videos to investigate contactless measurements of SpO2, but most of these studies have used conventional signal processing equipment and methods. The literature [14] used a CMOS camera to record PPG signals for SpO2 estimation, with the light source alternating between orange light at 611 nm and near-infrared light at 880 nm. SpO2 was derived from the pulsatile component(AC) and the nonpulsatile component(DC) component analysis of the tracked PPG signals. The literature [24] proposed a method using iris tissue irradiated by two light wavelengths (630 nm and 940 nm) for reflectance images to assess SpO2 levels. The literature [25] developed a contactless skin SpO2 imaging system that uses reflected images of superficial tissue skin to create a map of SpO2 distribution across the measurement area to assess SpO2, heart rate(HR) and blood flow velocity(BFV). The literature [26] used two orthogonal vectors in RGB color space to extract the PPG signal and used a denoising algorithm based on a double-tree composite wavelet transform to reduce illumination and motion artifacts.

With the continuous development of deep learning techniques, there are a proliferation of studies using them for video-based physiological metrics estimation with excellent results, such as heart rate and respiration rate. The literature [17] proposed an end-to-end convolutional attention network to detect blood volume pulses in face videos, which in turn performs frequency analysis of the detected pulse signal to track heart rate and respiration rate. The literature [27] designed a DeepPhys model based on a two-stream approach based on a skin reflection model, which used the appearance module to provide attention to guide the learning of the motion module and thus recovered a more robust rPPG signal. The literature [13] divided the input videos into segments and applied a time-domain segment subnet to extract spatial and temporal information. The literature [28] was built on deep learning to construct temporal signals and used Action Units (AUs) for signal denoising to improve HR estimation.

However, the research on deep learning algorithms for SpO2 estimation by videos is still in its early stages. The literature [29] proposed a convolutional neural network architecture for contact SpO2 monitoring from smartphone cameras. Although they performed better than traditional ratio methods, they still had the drawbacks associated with contact measurements. In July 2021, J. Mathew et al. proposed a study combining deep learning with contactless SpO2 estimation in hands [12]. Experiments were conducted to collect 14 volunteers' hand videos to estimate SpO2, containing both normal breathing and breath-holding states, with a MAE of around 2 %, which aligns with clinical indications. However, the dataset collected in this literature is not public, and no relevant experiments have been performed on public datasets.

Inspired by the above articles, this paper proposes a contactless SpO2 estimation method based on deep learning with multi-model fusion, and conducts rich experiments on the public datasets (PURE and VIPL-HR). PURE and VIPL-HR contain face videos and physiological signal recordings of faces in rich scenes, such as stillness, brightness, darkness, head movement and physical post-exercise state. In PURE dataset, the ground truth value of SpO2 have been captured using a finger clip pulse oximeter (pulox CMS50E). In VIPL-HR dataset, the ground truth value of SpO2 have been captured using the CONTEC CMS60C BVP sensor. There are more than one hundred volunteers involved in the dataset construction work. Rich experiments verify that the RCA network can adequately extract SpO2 features from face videos, and the fusion model can achieve better SpO2 estimation. These also confirm that deep learning techniques can estimate SpO2 effectively and be applied to health screening and remote health assessment.

3. Material and methods

The proposed multi-model fused SpO2 estimation model based on deep learning is shown in Fig. 1 , which consists of two parts: the RCA network and the multi-model SpO2 estimation module. RCA network mainly consists of the residual network module and the coordinate attention module, which is used to extract physiological features from the facial image sequences. The multi-model SpO2 estimation module includes Color Channel Model, the Network-Based Model, and the Multi-Model Fused Model. SpO2 is calculated by feeding the features extracted by the RCA network into the SpO2 estimation module.

Fig. 1.

Fig. 1

Structure of SpO2 estimation network model based on deep learning.

3.1. RCA network

This section introduces an RCA network to extract feature signals with physiological information from face videos and use them to calculate the SpO2. Fig. 1 shows the structure of this RCA network. In terms of logical structure, RCA network and the conventional deep CNN [30] are similar. Both of them mainly consist of these types of layers: input layer, convolutional layer, activation function layer, pooling layer and fully connected layer. However, the difference is that the RCA network has residual blocks and attention composition. Residual blocks can alleviate the gradient disappearance problem in the conventional deep CNN by using jump connections. In addition, residual blocks can protect the integrity of the information. Besides, by using attention, more representative feature information can be extracted and result in a higher computational accuracy.

The input image sequence is denoted as V and the pixel size of images is 3 * 224 * 224. Firstly, V is fed into a convolution layer with the kernel size of 7,7 to downsample, namely, Conv7×7. Then, batch normalization(BN) and ReLu activation function(δ) are performed to accelerate the convergence of the network, and finally, global average pooling(GAP) output F:

F=GAP(δ(BN(Conv7×7(V)))) (1)

3.1.1. Residual module

To extract the high-dimensional features from F, F is fed into the Residual Block (RB) cascade, and the features F¯ are obtained. The residual module in this paper is constructed based on ideas from the literature [31]. It alleviates the gradient disappearance problem by using jump connections. In addition, the residual block protects the integrity of the information by passing the input information directly to the output in a bypass, reducing information loss and attrition and improving the learning capability of the network. Besides, to improve network performance and obtain a larger perceptual field to help capture the features that need to be learned and attended to, the network depth is increased by cascading two residual blocks. The structure is shown in Fig. 2 .

Fig. 2.

Fig. 2

Residual block cascade structure.

The formula for a single residual module is shown in Equation (2). δ is the ReLu activation function. Conv3×3 is a convolution layer with the kernel size of 3,3, and Conv1×1 is a convolution layer with the kernel size of 1,1. represents the Element-wise Sum.

RB(F)=δ(Conv3×3(δ(Conv3×3(F)))Conv1×1(F)) (2)

The output of the cascade of the two residual blocks is F¯ as Equation (3):

F¯=RB2(RB1(F)) (3)

3.1.2. Coordinate attention module

Large-amplitude head movements and changes in lighting conditions during the recovery of the rPPG signal cycle can interfere with signal estimation, resulting in excessive fluctuations in feature information changes, which can interfere with the signal periodicity [13]. However, the coordinate attention module [32] can focus on the relationship between feature channels and the location information in the feature space. Therefore, we add it after extracting the acquired high-dimensional features to enhance the feature representation.

Coordinate attention refers to embedding location information into channel attention and decomposing channel attention into two parallel 1D feature encoding processes, i.e., pooling, convolution and activation operations for feature information in the X (horizontal) and Y (vertical) directions respectively, to effectively integrate spatial coordinate information into the generated attention graph and construct a coordinate aware attention graph. In this paper, the coordinate attention idea is embedded into the RCA network, and the network structure is shown in Fig. 1. First, the high-dimensional features F¯=f¯1,f¯2,...,f¯C are used as input. Each channel is encoded along the horizontal and vertical directions using pooling layers of size H,1 and 1,W respectively to enable the attention module to capture the precise position information of the features. Thus, the output of the c-th channel at height h can be formulated as Equation (4). stands for summation.

zch(h)=1Wf¯c(j,w)f¯c(h,i) (4)

Similarly, the output of the c-th channel at width w can be written as Equation (5):

zcw(w)=1H0i<Hf¯c(j,w) (5)

f¯c represents the c-th channel of F¯. The above transformations perform feature aggregation along two spatial directions to obtain a pair of direction-aware feature maps [zh,zw]. They also enable the attention module to capture long-term dependencies along one spatial direction and preserve precise location information along with the other, which helps the network locate the target of interest more accurately.

Then, results of the two transformations are concatenated and transform operations are performed on them using the 1×1 convolutional transform function F1 as shown in Equation (6).

fc=δ(F1([zch,zcw])) (6)

Here, δ is ReLu activation function. fc is the intermediate feature mapping result of encoding the spatial information of the c-th channel in the horizontal and vertical directions. Decompose fc into two separate tensors fchRC/r×H and fcwRC/r×W along the spatial dimensions. fch and fcw are transformed into tensors with the same number of channels by using two 1×1 convolution transforms Fh and Fw respectively to obtain gch and gcw as shown in Equation (7).

gch=σ(Fh(fch)),gcw=σ(Fw(fcw)). (7)

Here, σ is Sigmoid activation function. Subsequently, gch and gcw are expanded using the function named expand_as so that both of them are of the same size as the input feature F¯ so as to serve as the attentional weights of F¯. The output of the coordinate attentive module is F=f1,f2,...,fC, in which the component of the c-th channel of F is:

fc(i,j)=fc(i,j)×gch(i)×gcw(j) (8)

3.2. Multi-Model fusion for SpO2 estimation

Existing research results show that the color channel ratio method based on videos has achieved certain results, such as the literature [26]. In order to combine the advantages of the Color Channel Model and the Network-Based Model, the features that both models focus on and the complementarity between the models are fully utilized. In this paper, a multi-model fusion SpO2 estimation module is constructed, as shown in Fig. 1. The Color Channel Model reconstructs RGB images from feature maps by the image generator, and then calculates the SpO2 value using the red and blue color channel AC/DC ratio analysis. The Network-Based Model is a simple network model based on deep learning techniques, and SpO2 is estimated directly from the feature maps output of the RCA network.

3.2.1. Color channel model

The structure of the Color Channel Model is shown in Fig. 1. The module consists of the image generator and the SpO2 Estimator1. In order to calculate the SpO2 using the signals from the red and blue color channels, the features F learned by the RCA network are first reconstructed into the RGB feature map feat by the image generator, and then fed into the SpO2 Estimator1 to obtain SpO2CCM. In addition, the loss functions Lossimg and LossCCM_S are constructed for the feature map SpO2CCM and the estimation value SpO2CCM respectively to improve the SpO2 estimation accuracy.

The feature F is first fed into the image generator to reconstruct the RGB image. The structure of the image generator is shown in Fig. 3 . This means that F is first fed into the residual block. Then, send the result to a transposed convolution kernel of size [3,3] which named ConvT3×3 for upsampling. Next, send it into normalization and non-linear activation to obtain I. Finally, I is fed into a convolution kernel of size [7,7] named Conv7×7 to obtain the RGB feature map feat which pixel size is 3 * 224 * 224.

Fig. 3.

Fig. 3

Structure of Image Generator.

Then, the reconstructed feat is fed into SpO2 Estimator1. Here, the signals of the red and blue channels are calculated by conversion to obtain cardiovascular pulse wave signals that can replace two different wavelengths (660 nm and 940 nm) in the pulse SpO2 meter [4], thus using the AC/DC ratio estimation, whose mathematical equation for the conversion calculation is shown in Equation (9):

SpO2CCM=A-BACRED/DCREDACBLUE/DCBLUE (9)

Here, ACRED and ACBLUE represent the standard deviation of the red and blue channels respectively. DCRED and DCBLUE represent the mean of the red and blue channels respectively, with fixed coefficients A = 125 and B = 26 based on the empirical evaluation [4].

In this section, loss functions are designed for feat and SpO2CCM respectively. In order to learn more SpO2-related information and preserve original image features as much as possible during the RCA network learning and feature map reconstruction, the L1Loss function is used to evaluate the loss between the reconstructed RGB feature map feat and the original map img to obtain Lossimg. H, W, and C represent the image height, width, and number of channels respectively.

Lossimg(img,feat)=1H×W×Ci=1Hj=1Wc=1Cimgi,j,c-feati,j,c (10)

Besides, for guiding the Color Channel Model learning to obtain more accurate SpO2 estimation values, the Smooth L1 loss function is used to evaluate the loss between SpO2CCM and the ground truth SpO2gt to obtain LossCCM_S:

LossCCM_S(x,y)=1Li=1L0.5(yi-xi)2,ifyi-xi<1yi-xi-0.5,otherwise (11)

Here, L represents the length of the input video frame, x and y represent the estimated value SpO2CCM and the ground truth SpO2gt respectively. When the absolute value of the difference between x and y is less than 1, L2 Loss is used. When the difference is larger, the translation of L1 Loss is used. The Smooth L1 loss function has a constant gradient when the difference between x and y is large solves the problem of large gradients destroying the training parameters in L2 loss. When the difference is small, the gradient decreases dynamically, solving the problem of difficult convergence in L1 loss.

By designing λimg and λCCM_S as balancing parameters to adjust the importance of Lossimg and LossCCM- S, the total loss of the Color Channel Model is denoted to estimate the SpO2 value as LossCCM:

LossCCM=λimg×Lossimg+λCCM_S×LossCCM_S (12)

3.2.2. Network-based model

The structure of the Network-Based Model is shown in Fig. 1. F learned by the RCA network is fed into the SpO2 Estimator2 to calculate the SpO2 value SpO2NBM. The two-part loss functions including Losslabel and LossNBM_S between SpO2NBM and the SpO2 ground truth SpO2gt monitored by the pulse oximeter are constructed to optimize the model, so that the model can focus on more SpO2 information and improve the accuracy of SpO2 estimation.

The structure of SpO2 Estimator2 is shown in Fig. 4 . Firstly, the high-dimensional features F generated by the RCA network are globally averaged and pooled to replace the fully connected operation to reduce the network parameters and overfitting. Then, feed it into a convolution named Conv1×1 with the kernel of size [1,1] to compress the number of channels, increase the nonlinearity of the network and obtain the feature map with 100 channels. Then, activate it by the SoftMax function to obtain Fpre. Since SpO2 can take values from 1 % to 100 %, the sequential vector S=1,2,...,100 is constructed and multiplied with Fpre to obtain the estimated value SpO2NBM.

Fpre=δ(Conv1×1(GAP(F))) (13)
SpO2NBM=i=1100Si×Fipre (14)
Fig. 4.

Fig. 4

Structure of SpO2 Estimator2.

Since the face videos contain not only SpO2-related signals, but also some other physiological signals. To motivate the network to learn more SpO2 signals than other signals, a two-part loss function is designed for Fpre and SpO2NBM in this section. First, the SpO2 ground truth SpO2gt is processed by the label distribution learning technique [33] to obtain a matrix label l of the same size as Fpre. The Kullback-Leibler Dispersion [34] (KLD) loss Losslabel between Fpre and l is constructed so that the real label l is used to direct Fpre to focus on more SpO2 information. The rule for calculating Losslabel is shown in Equation (15).

Losslabel(lFpre)=i=1Tj=1100[lijloglij-lijlogFijpre] (15)

Besides, to guide the Color Channel Model learning to obtain more accurate SpO2 estimation values, the Smooth L1 loss function is used to evaluate the loss between SpO2NBM and the ground truth SpO2gt to obtain LossNBM_S. It follows the same rules as Equation (11).

LossNBM_S(x,y)=1Li=1L0.5(yi-xi)2,ifyi-xi<1yi-xi-0.5,otherwise (16)

Here, x and y represent the estimated value SpO2NBM and the ground truth SpO2gt respectively.

By designing λlabel and λNBM_S as balancing parameters to adjust the importance of Losslabel and LossNBM_S, the total loss of the Network-Based Model is denoted to estimate the SpO2 value as LossNBM:

LossNBM=λlabel×Losslabel+λNBM_S×LossNBM (17)

3.2.3. Multi-Model fusion model

To facilitate the network to focus on and learn SpO2 features in many aspects and to improve the accuracy of SpO2 estimation, the Color Channel Model is combined with the Network-Based Model to construct a Multi-Model Fusion Model (MMFM). Using the SmoothL1Loss loss function, a new loss named LossMMFM is constructed for the estimation values SpO2CCM and SpO2NBM of the two models, allowing the two models bootstrap each other, making full use of the features of interest to both and the complementarity between them.

LossMMFM(x,y)=1Li=1L0.5(yi-xi)2,ifyi-xi<1yi-xi-0.5,otherwise (18)

Here, x and y represent SpO2CCM and SpO2NBM respectively. It follows the same calculation rules as Equation(11) and Equation(16). The importance of LossNBM, LossCCM and LossMMFM is adjusted by designing equilibrium parameters λNBM, λCCM and λMMFM, respectively.

Losstotal=λNBM×LossNBM+λCCM×LossCCM+λMMFM×LossMMFM

4. Implementation processes

This section introduces the implementation processes of the remote multi-model fusion SpO2 estimation model based on deep learning proposed in this paper. The model is divided into three parts: facial video preprocessing, the RCA network model training and SpO2 estimation.

4.1. Video preprocessing

In this paper, the PURE and VIPL-HR datasets are chosen for experiments, which contain face video sequences, timestamped text and SpO2-valued text corresponding to video frames. All video sequences need to be processed by face detection, localization (Fig. 5 ) and normalization. The specific steps are as follows.

Fig. 5.

Fig. 5

Facial cutting process.

Step 1. Face detection. Firstly, the images are input to a multi-task convolutional neural network (MTCNN) [35] for face detection and localization.

Step 2. Key landmark localization. During the key landmark localization process, we pre-downloaded an open-source file containing the 81 landmarks [36] of the face. Subsequently, we take the file as an argument and call the method named shape_predictor of the Dlib library to achieve the localization of the key landmarks.

Step 3.The selection of regions of interest(ROI). The facial landmarks obtained in the previous stage are processed to locate the ROI region: the maximum and minimum values of the ×, y axes of 81 points are combined to form the four corner coordinates of the ROI and cropped.

Step 4. The facial image sequence normalization. All the facial images in this paper are normalized to 224 × 224 pixel RGB images.

4.2. RCA network model training

The processed facial image sequence VRC×L×H×W is fed into the RCA network for feature extraction, where C, L, H and W represent the number of channels, length, height and width of the input video respectively.

V is firstly subjected to 7×7 convolutional transform and pooling, and then high-dimensional features with strong representation ability are extracted by residual block cascade and coordinate attention module. The image information is learned and SpO2 values are estimated from them by three SpO2 estimation models respectively. The model convergence speed is accelerated by adjusting the optimizer, learning rate, loss function, and other hyperparameters to obtain the best results.

4.3. SpO2 estimation

The high-dimensional features are extracted from the RCA network, and the SpO2 values are estimated by Color Channel Model, Network-Based Model, and Multi-Model Fusion Model respectively. It should be mentioned that the estimated values of both models and the corresponding errors are obtained in the fusion estimation model. The estimated value with the smallest error is taken as the result of the fusion estimation model. The effect is compared with the other two models to select the best model.

In addition, to avoid frame redundancy, the frame number is selected according to the setting of batch_size. In other words, the median of SpO2 ground truth in T frames is used as SpO2gt. The estimated values are obtained by averaging the T frames of SpO2 values output from the three models. It also reduces the error caused by single-frame computation without considering time characteristics and noise.

5. Experimental results

5.1. Experimental datasets

Most of the datasets used in the existing SpO2 research literature were taken by the authors themselves and are not public, and most of them had fewer subjects and scenes. Our experiments are conducted on the public datasets of PURE [16] and VIPL-HR [17], [18] for easier comparison. Both contain sufficient study subjects and rich scenarios to verify the robustness of the model. In addition, using the public dataset can facilitate method comparisons for subsequent studies. In the experiments, two datasets are divided 6:4 between the training and test datasets, and the information about them is shown in Table 1 .

Table 1.

Information of PURE and VIPL-HR.

Dataset Num Scenes Acquisition equipment Frame rate
PURE 10
  • Steady

  • Speck

  • Slow translate

  • Fast translate

  • Small rotate

  • Medium rotate

  • eco274CVGE Camera

30fps
VIPL-HR 107
  • Steady

  • Head-move

  • Speck

  • Dark

  • Bright

  • Remote

  • After skipping

  • RealSense F200 color camera

30fps

Fig. 6 (a) and (b) show histograms of the distribution of SpO2 values collected from the PURE and VIPL-HR datasets respectively. The range of SpO2 collected from the PURE dataset is 89–99 and the range from VIPL-HR is 86–99.

Fig. 6.

Fig. 6

Histogram of the distribution of SpO2 values.

5.2. Experimental settings

In this paper, MTCNN [35] model is used to detect and locate the face region of the original video, select the ROI and crop it, and normalize the face images to 224 × 224 pixels. We conduct experiments on the PURE dataset using the CCM model as a benchmark. The PURE dataset is small according to Table 1. Small dataset is more likely to show uncontrollable performance during the training process due to parameter changes. Therefore, experiments on the PURE dataset are more representative. Table 2 shows the experimental results for different image sizes. In all tables, the bold values indicate minimal errors.

Table 2.

Comparative experimentation of image sizes.

Dataset/Image size PURE
MAE (%) RMSE (%)
3 * 32 * 32 1.02 1.48
3 * 112 * 112 0.83 1.07
3 * 196 * 196 0.72 0.97
3 * 224 * 224 0.65 0.85
3 * 304 * 304 0.76 0.99
3 * 352 * 352 0.78 1.05
3 * 448 * 448 0.88 1.22

According to Table2, increasing the pixel size of images, the images contain more texture and contextual information, which capture better features. In addition, it is easier to obtain discriminative features when images become larger. However, when the size becomes larger to a certain extent, the model complexity is too large, which tends to lead to overfitting and the experimental performance decreases instead. In addition, the computational overhead also becomes much larger accordingly. From experiments, the input image size is set to 3 * 224 * 224, which is more suitable.

Then, the image sequences are fed into RCA network model to extract features, and SpO2 values are calculated using each of the three methods. In order to select the optimal optimizer and learning rate, rich ablation experiments are implemented. Four optimizers (RMSProp, Adagrad, Adam, AdamW) with different learning rates are used for model training and testing. Fig. 7 shows the comparative experimentation of different learning rates and optimizers.

Fig. 7.

Fig. 7

Comparative experimentation of different learning rates and optimizers.

From Fig. 7, when the RMSProp optimizer is used and the learning rate is 0.02 or 0.05, the situation arises where the model is difficult to converge. When the Adam or AdamW optimizer are used and the learning rate is 0.05, our model is also difficult to converge. As a whole, it is optimal to set the learning rate to 0.01. During the training process, the Adam optimizer is used to continuously adjust the learning rate to achieve the best results.

In general, the optimal experimental setup is determined with rich experiments. The Losstotal in Section 3.2.3 is used as the loss function. The Adam optimizer with a weight decay of 1e−4 is employed. The learning rate is set to 0.01. The ReduceLROnPlateau scheduler with a patience of 20 and factor of 0.1 is employed. The maximal epoch number and early stopping counter are set to 100 and 20, respectively. The batch_size is set to 50 frames on the PURE dataset. Extensive experiments verify that the above choice of parameters is optimal. Then, the converged model is migrated to the VIPL-HR dataset, and some hyperparameters are fine-tuned. The model network frameworks are all implemented based on PyTorch, and the graphics card used for the experiments is NVIDIA GTX1080TI.

5.3. Evaluation metrics

In order to verify the validity of the model, the mean absolute error (MAE) and root mean square error (RMSE) are used to present the experimental results, and the MAE and RMSE are calculated as shown in Equation (20), (21):

MAE=i=1LSpO2pre(i)-SpO2gt(i)L (20)
RMSE=i=1LSpO2pre(i)-SpO2gt(i)2L (21)

Here, L represents the length of the input video frame, SpO2pre represents the estimated value of SpO2, and SpO2gt represents the SpO2 ground truth.

5.4. Experiment and discussion

As none of the studies on contactless estimation of SpO2 have published its datasets and algorithms, this paper will use medical indicators to measure the model's validity. If MAE between the estimated SpO2 value of the model and the SpO2 ground truth is within 2 %, SpO2 is valid and the model is reliable [15].

5.4.1. Max-min coordinate face image cropping

In this paper, the maximum and minimum values of the × and y coordinates of 81 key points of the face are combined into the four corner point coordinates of ROI. The face is cropped by the coordinates, which preserves the most effective area of the face and reduces the effect of some areas of the face being occluded. In order to verify this conclusion, this paper does a comparison experiment between full face cropping and maximum-minimum boundary cropping (m_face), and here the Color Channel Model is used for verification. The results are shown in Table 3 . SpO2 values are given in %. Therefore, MAE and RMSE are also in %. We mark it in all of the experimental tables accordingly.

Table 3.

Comparative experimentation of facial cutting methods.

Dataset/Crop Type PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
all_face 0.86 1.18 1.15 1.61
m_face 0.65 0.85 1.04 1.48

The proposed image cropping method calculates the SpO2 values for the PURE and VIPL-HR datasets respectively. Fig. 8 and Fig. 9 show the comparison of the SpO2 estimation values and the reference values for the two datasets respectively. The horizontal label in Fig. 8 and Fig. 9 is “Batch”. It means the SpO2 value calculated for the nth batch of the image sequences. In order to closely reflect the true human SpO2 status, the estimated SpO2 are retained to two decimal places. The retention of these two decimals gives rise to reasonable fluctuations in Fig. 8 and Fig. 9.

Fig. 8.

Fig. 8

Experimental effect of PURE dataset.

Fig. 9.

Fig. 9

Experimental effect of VIPL-HR dataset.

It is of interest to note the abrupt changes in Fig. 8 and Fig. 9 that the model failed to detect, such as the point around 80 in Fig. 8. By analysing the details of the original dataset, the reason for the problem has been found. In fact, during the construction of the dataset, there are discontinuities and interruptions when shooting different scenes of the same object. Therefore, during this interval, the SpO2 status may change, and naturally the data within the corresponding dataset may also change abruptly. As datasets are feed as a whole continuously into the model for training and testing, the model does not immediately capture the sudden changes in SpO2 values due to scene switching, yet it is relatively seldom the case in general. In addition, undetectable abrupt changes are within 1 % and return to pre-change SpO2 status rapidly. In practice, in the actual scenario, the permissible error range is 2 %. It indicates that the estimation error due to the switching of detection objects and scenes is still acceptable. But whether or not it is these reasons that brought about the changes, when the magnitude exceeds 1 %, our model is still able to capture the change relatively quickly and estimate close SpO2 values. For example, the drop in SpO2 values at point around 120 in Fig. 8 and the rise at point around 160 in Fig. 9. It indicates that our model is effective and has application values.

Besides, since the normal range of SpO2 values in the human body are 95 %–100 %, values below 95 % are considered to be low SpO2 values. The proportion of low SpO2 values is relatively small. The PURE dataset captures SpO2 values once per frame at a frame rate of 30 fps and the percentage of the low SpO2 value is 4.9 %. In contrast, the VIPL-HR dataset collects SpO2 values once every second and the percentage of the low SpO2 value is 5.8 %. However, in clinical scenarios, low SpO2 samples need to be given more attention. Therefore, in addition to giving the overall SpO2 estimation errors for PURE and VIPL-HR dataset, low SpO2 values (<95) have been purposefully extracted for error calculations with the SpO2 values calculated by the Color Channel Model. The MAE in PURE dataset is 1.28 %, and the MAE in VIPL-HR dataset is 1.57 %. From the experimental results, even with the low SpO2 scenarios, the MAEs are still below 2 % and meeting the clinical target. It demonstrates that our model is valid for the low SpO2 situations as well.

5.4.2. RCA network

In this section, rich ablation experiments are conducted for the RCA network model, and each module and parameter inside the model, and the effectiveness of the model and each module is verified on the PURE and VIPL-HR datasets using the Color Channel Model.

First, the setting of batch_size affects the experimental results and reasonableness. Each batch will get the corresponding SpO2 estimation value and the average of the corresponding batch_size SpO2 ground truth, from which the overall MAE and RMSE of the test sets of PURE and VIPL-HR are calculated. Due to the limitations of the hardware device, a too large batch_size will result in insufficient memory to allow for training. A too small bitch_size will result in too little data being trained in a single batch. It may lead to large experimental errors and unstable experimental results. At this point, it is difficult to extract more global features. Therefore, we try to choose a multiple of 10 in the appropriate region for the ablation experiments. Besides, due to the experimental setup, the minimum value of batch_size is 15. Fig. 10 shows the comparison experiments of different batch_size.

Fig. 10.

Fig. 10

Plot of batch_size comparison experiments.

From Table 1 and Fig. 10, the PURE dataset is small and it is significantly affected by batch_size. In the process of randomly selecting a small batch_size and training, some uncontrollable factors are generated, resulting in fluctuations in the curve. Nevertheless, as the batch_size increases, PURE as a whole also produces the effect within the expectation, which is consistent with the training effect of large dataset. Finally, we choose 50 as batch_size after several experimental comparisons. Combined with the frame rate of the dataset, it can be considered reasonable to do the estimation and error calculation every less than 2 s with this setting. Subsequent experiments were set to a batch size of 50 frames.

Afterward, to verify the effectiveness of the RCA network and the internal modules for extracting SpO2 features, the RCA network is replaced by the feature extraction modules of classical deep neural networks (ResNet-50[31] and Inception v3[37]). The FLOPs and Params are used to measure the time complexity and memory complexity of the model respectively. FLOPs refer to floating point operations which is understood as the amount of computation. It can be used to measure the complexity of an algorithm or a model. Params refers to the number of parameters which represents the amount of memory used. Then, ablation experiments for residual blocks (RB) and the coordinate attention (CA) in the RCA are conducted. The above experiments are based on the Color Channel Model. According to Table 4 , the RCA network outperforms the other networks in terms of feature extraction capability. In addition, RCA networks have a lower time complexity and space complexity. It means that RCA network is more efficient and uses less memory. Our proposed model do not require much capturing equipment and calculation environment. Besides, both RB and CA can effectively improve the accuracy of SpO2 estimation and reduce MAE and RMSE. RB reduces information attrition and improves the learning capability. Besides, CA can focuses on feature channels and the location information, and enhances the feature representation.

Table 4.

Ablation experiments with RCA network.

Dataset/Model FLOPs (GMac) Params (M) PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
ResNet-50 [31] 5.12 24.05 1.02 1.49 1.50 1.77
Inception v3 [37] 3.67 22.68 1.04 1.47 1.14 1.51
RCA w/o CA 2.048 1.077 0.74 1.05 1.11 1.50
RCA w/o RB 0.94 0.11 1.02 1.44 1.15 1.53
RCA 2.05 1.08 0.65 0.85 1.04 1.48

Next, the selection experiment of the number of cascades for the cascaded residual network is conducted. According to Table 5 , the best results were obtained when two residual blocks were used for cascading, extracting the most effective high-dimensional features. Besides, using two residual blocks does not unduly increase FLOPs and Params.

Table 5.

Cascade number selection experiment for cascade residual blocks.

Dataset/RB num FLOPs (GMac) Params (M) PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
1 1.4 0.26 0.79 1.02 1.12 1.54
2 2.05 1.08 0.65 0.85 1.04 1.48
3 2.69 4.36 0.89 1.15 1.10 1.50
4 3.33 17.47 1.02 1.44 1.15 1.52

After determining the experimental setup and the structure of the RCA model, visualisations were used to show the facial regions that the RCA model learned and focused on.

First, a set of images of faces wearing glasses and hair coverings and their attention graphs are as shown in Fig. 11 . As can be seen from them, the RCA network focuses more on the cheek and forehead regions. Meanwhile, the model reduces the attention to the eyes, eyeglasses, hair, lips and chin.

Fig. 11.

Fig. 11

Visualisation of scenes with eyeglasses.

In addition, the visual scenes of facial rotation and movement and specking are as shown in Fig. 12 . As can be seen from them, the RCA network also reduces the focus on teeth and male beards. In non-stationary scenes, the RCA network still accurately focuses on the cheek and forehead regions.

Fig. 12.

Fig. 12

Visualisation of non-stationary scenes.

In general, SpO2 needs to be estimated from bare skin. Not focusing on disruptions such as beards, teeth, hair, eyes and eyeglasses helps to improve the accuracy of the model in detecting SpO2.

Besides, to explore the effectiveness of deep learning in greater depth, a SpO2 estimation experiment with traditional color channel segmentation is designed for comparison. The red and blue channel of the original image sequence are separated and fed into the RCA network separately to extract the signal features. The corresponding standard deviation and mean values were calculated respectively, and the SpO2 values were calculated by substituting into the SpO2 calculation Equation (9). Table 6 shows the experimental results, which demonstrate the effectiveness of deep learning in terms of the SpO2 estimation accuracy.

Table 6.

Red and blue color channel method selection experiment.

Dataset/Channel Method PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
Tradition Color Model 0.92 1.17 1.18 1.60
CCM 0.65 0.85 1.04 1.48

5.4.3. Comparative study of three SpO2 estimation models

After the experiments in the previous section, the RCA network structure is determined, and then the three SpO2 estimation models are trained and tested separately, the results are shown in Table 7 .

Table 7.

SpO2 estimation model comparison experiment.

Dataset/Estimate Model FLOPs (GMac) Params (M) PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
CCM 2.05 1.08 0.65 0.85 1.04 1.48
NBM 1.00 0.70 0.86 1.04 1.10 1.50
MMFM 3.05 1.77 0.63 0.80 1.00 1.43

As can be seen from the results, the MMFM has the lowest error in SpO2 estimation, the conventional CCM has the middle error, and the NBM has the highest error, but all of them are within 2 % of the clinical target. The MMFM effectively combines the advantages of the NBM and the CCM, making full use of the characteristics and complementarity between the two models to achieve relatively better estimation results. The complexity of MMFM is the sum of the complexity of CCM and NBN.

Besides, to evaluate the real-time property and the latency property, extensive experiments are done. Specifically, the time required to load the pre-trained model, and the time required to load the batch_size (50) frames and feed them into the models to estimate the SpO2 are calculated. Separate experiments are conducted on the PURE and VIPL-HR datasets. Both of them have similar loading and estimating times, so their average time are chosen as the result. In Color Channel Model, the loading time is about 1.1 s, and the estimating time is about 0.5 s. In Network-Based Model, the loading time is about 0.4 s, and the estimating time is about 0.5 s. In Multi-Model Fusion Model, the loading time is about1.6 s, and the estimating time is about 0.5 s. Overall, the CCM, NBM and MMFM all take around 0.5 s to estimate SpO2 after the pre-trained models are loaded. The delay of 0.5 s is acceptable in terms of timeliness.

In addition, ablation experiments on the loss functions are conducted to verify the effectiveness of Lossimg and Losslabel for CCM, NBM and MMFM respectively. The experimental results are shown in Table 8 , which can confirm that the reduced image loss and KLD loss help to improve the model estimation accuracy and reduce its error.

Table 8.

Loss function ablation experiments.

Lossimg Losslabel PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
CCM 0.81 1.05 1.15 1.55
0.65 0.85 1.04 1.48
NBM 0.89 1.13 1.19 1.56
0.86 1.04 1.10 1.50
MMFM 0.87 1.06 1.18 1.59
0.77 1.01 1.08 1.46
0.82 0.99 1.14 1.48
0.63 0.80 1.00 1.43

5.4.4. Sub-scene verification

To analyze whether the RCA model still maintains better performance under head motion states and different lighting conditions, three SpO2 estimation models will be individually trained and validated for several unstable scenes in the PURE and VIPL-HR datasets respectively to improve the robustness of the model. The experimental results are shown in Fig. 13 and Fig. 14 .

Fig. 13.

Fig. 13

Comparison of experimental results of the three models in PURE by scene.

Fig. 14.

Fig. 14

Comparison of experimental results of the three models in VIPL-HR by scene.

According to Fig. 13, the estimation errors of MMFM are lower than those of NBM and CCM in different scenes of the PURE dataset. NBM outperforms CCM only in the steady scenes and performs worse than CCM in the rest of the unstable scenes.

According to Fig. 14, each of the three estimation models has advantages and disadvantages in different scenarios in the VIPL-HR dataset, but all of them are relatively small in difference. Based on the presentation of the dataset in Table 1, it can be seen that VIPL-HR is large compared to PURE. In the smaller PURE dataset, the results of the three estimation models differ significantly and the Color Channel Model significantly outperforms the Network-Based Model. In the larger VIPL-HR dataset, the difference between the three models is tiny. Through these experiments, this paper concludes that when the experimental samples are large enough and the learning objects are large enough, the features extracted from the RCA network can gradually be detected by the network model with comparable accuracy to that of the Color Channel Model. In addition, MMFM takes into account the advantages of CCM and NBM to achieve a more accurate result.

5.4.5. Comparison with other models

We summarise articles on contactless SpO2 estimation over the years. From these, representative articles are selected to compare the experimental results with MMFM. Since the code and dataset of them are not public, the MAEs in these articles are extracted for comparison as shown in Table 9 . The Num denotes the number of volunteers involved in the construction of these datasets. DL denotes Deep Learning. Area denotes the area of the body to be photographed.

Table 9.

Comparison of different contactless SpO2 estimation methods.

Method Dataset Num MAE (%) DL Area
[25] 2016 Self-built 9 1.00 hands
[38] 2020 Self-built 9 0.85 face
[39] 2020 Self-built 21 0.83 face
[40] 2021 Self-built 25 1.7 face
[12] 2021 Self-built 14 1.81 hands
[41] 2022 BIDMC PPG [42] 52 1.45 PPG signal
*[38] 2020 PURE 10 1.51 face
VIPL-HR 107 2.32
*[39] 2020 PURE 10 1.30 face
VIPL-HR 107 2.02
Ours-MMFM PURE 10 0.63 face
VIPL-HR 107 1.00

The literature [38] presented a SpO2 monitoring method without physical contact with the patient using imaging photoplethysmography (iPPG). Firstly, the authors used a camera to capture videos and extracted the forehead area as an ROI. Then, they performed Eulerian video magnification (EVM) [43] on the facial videos to amplify the changes in skin color due to the heart cycle. Lastly, the red and blue channel ratio method is used to calculate SpO2 values.

The literature [39] presented a SpO2 monitoring method by using a video camera. The authors used the face detector provided by the Dlib library to detect faces in the images. Then, they select the forehead and left and right cheeks as three ROIs. To track the ROIs in the videos, they used the Kanade-Lucas-Tomasi (KLT) [44] tracking method. The next step was the analysis of the RGB signals coming from the ROIs marked in each video frame of a short sequence. To avoid noise, the signals are improved through a preprocessing phase. Next, they used Power Spectral Density (PSD) through Welch’s method [45] to select the most informative ROI. Finally, they used the chosen ROI to calculate SpO2 values by the red and blue channel ratio method.

As the source code of the literature [38], [39] is not publicly available, we reproduce the model based on details from the literature and do comparative experiments on the PURE and VIPL-HR datasets. Therefore, we mark them with * in Table 9. By reproducing the models from the literature [38], [39] and conducting comparative experiments on the PURE and VIPL-HR datasets, the RCA network and the MMFM model are proven to be more effective in extracting SpO2-related feature from facial image sequences. Besides, it demonstrates the effectiveness of deep learning in the field of contactless SpO2 measurement as well.

In addition, we conduct an experimental comparison with the literature [41] published in August 2022. It is an extremely new literature. It utilizes PPG signals and machine learning technology to extract SpO2 and achieve clinical indicators. But in general, MMFM can adapt better to the SpO2 estimation task. Furthermore, it confirms the effectiveness of deep learning in the area of contactless SpO2 detection.

Since there is no method based on deep learning for Sp02 estimation by face videos, it is important to compare our model with a simple deep neural network trained with a regression loss. Therefore, we experiment with SpO2 estimation task on Classical network models(ResNet-18[31] and VGG16[46]) with L1 Loss and compare these results. According to Table 10 , the MMFM has a slightly higher time complexity than ResNet-18, but the lowest spatial complexity of the three models. The MMFM has the lowest detection error and highest accuracy. Overall, the estimation model of this paper is more applicable to the SpO2 estimation task.

Table 10.

Deep learning model comparison experiments.

Dataset/Model FLOPs (GMac) Params (M) PURE
VIPL-HR
MAE (%) RMSE (%) MAE (%) RMSE (%)
VGG16 [46] 15.38 72.34 2.00 2.66 2.94 3.21
RestNet-18 [31] 1.82 11.18 1.16 1.37 1.98 2.23
Ours-MMFM 3.05 1.77 0.63 0.80 1.00 1.43

6. Conclusion

In contactless estimating of SpO2, traditional contactless methods require elaborate and detailed signal processing techniques to meet medical specifications. How to develop deep learning techniques for contactless SpO2 estimation for better accuracy is the focus of our research. The multi-model fusion method based on deep learning is proposed in this paper. Firstly, the maximum and minimum corner cropping of faces is proposed. It can preserve the most effective information about faces and reduce the impact of facial occlusion. Next, we combine the residual network module and the coordinate attention module as the RCA network to obtain a high-dimensional feature map with greater representation. Finally, the SpO2 values are calculated by three SpO2 estimation models(CCM, NBM and MMFM). Their advantages and disadvantages are compared. From the experimental results on the PURE and VIPL-HR datasets, the MAEs of three models are less than 2 % and all of them meet the clinical requirement. It confirms the effectiveness of deep learning for contactless SpO2 estimation from face videos. Additionally, the MMFM combines the advantages of CCM and NBM effectively. It fully uses the features of interest and complementarity between the two models to achieve more accurate estimation results.

In future research, we will focus on how to construct deep learning models to obtain a more accurate SpO2 estimation. For example, we can design our network models based on large pre-trained models for migration learning, and use the latest relevant variants based on Transformer instead of the traditional convolution. Beisides, we will focus on improving the ability to deal with small, transient changes in SpO2 status for more valuable applications. Furthermore, how to minimize the redundancy of videos and how to achieve real-time monitoring to achieve practical value are also directions for our future research. We will continue to improve model accuracy and performance and apply it to telemedicine and home health.

CRediT authorship contribution statement

Min Hu: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Xia Wu: Conceptualization, Methodology, Software, Writing – original draft, Investigation, Data curation, Formal analysis, Writing – review & editing, Visualization. Xiaohua Wang: Conceptualization, Validation, Writing – review & editing, Supervision. Yan Xing: Conceptualization, Investigation, Validation, Resources, Supervision. Ning An: Conceptualization, Writing – review & editing, Resources, Supervision. Piao Shi: Validation, Resources, Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by National Natural Science Foundation of China under Grant 62176084, and Grant 62176083, and in part by the Fundamental Research Funds for the Central Universities of China under Grant PA2021GDSK0092 and PA2022GDSK0066.

Data availability

Data will be made available on request.

References

  • 1.O'Driscoll B.R., Howard L.S., Davison A.G. BTS guideline for emergency oxygen use in adult patients. Thorax. 2008;63(Suppl. 6):vi1–vi68. doi: 10.1136/thx.2008.102947. [DOI] [PubMed] [Google Scholar]
  • 2.Starr N., Rebollo D., Asemu Y.M., Akalu L., Mohammed H.A., Menchamo M.W., Melese E., Bitew S., Wilson I., Tadesse M., Weiser T.G. Pulse oximetry in low-resource settings during the COVID-19 pandemic. Lancet Glob. Health. 2020;8(9):e1121–e1122. doi: 10.1016/S2214-109X(20)30287-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Şekeri K., et al. Data collection from blood glucose meter and anomaly detection. Karaelmas Fen ve Mühendislik Dergisi. 2017;7(2):428–433. doi: 10.7212/ZKUFBD.V7I2.664. [DOI] [Google Scholar]
  • 4.Kong L., Zhao Y., Dong L., Jian Y., Jin X., Li B., Feng Y., Liu M., Liu X., Wu H. Non-contact detection of oxygen saturation based on visible light imaging device using ambient light. Opt. Express. 2013;21(15):17464. doi: 10.1364/OE.21.017464. [DOI] [PubMed] [Google Scholar]
  • 5.Verkruysse W., Bartula M., Bresch E., Rocque M., Meftah M., Kirenko I. Calibration of contactless pulse oximetry. Anesth. Analg. Jan. 2017;124(1):136–145. doi: 10.1213/ANE.0000000000001381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Duch W. Filter methods. Stud. Fuzziness Soft Comput. 2006;207:89–117. doi: 10.1007/978-3-540-35488-8_4. [DOI] [Google Scholar]
  • 7.Schafer R.W., Rabiner L.R. A digital signal processing approach to interpolation. Proc. IEEE. 1973;61(6):692–702. doi: 10.1109/PROC.1973.9150. [DOI] [Google Scholar]
  • 8.J.H. Davis, Fourier transforms, in: Applied and Numerical Harmonic Analysis, no. 9783319433691, 2016, pp. 425–566, doi: 10.1007/978-3-319-43370-7_7.
  • 9.Srivastava G., Chauhan A., Jangid M., Chaurasia S. CoviXNet: a novel and efficient deep learning model for detection of COVID-19 using chest X-Ray images. Biomed. Signal Process. Control. 2022;78 doi: 10.1016/j.bspc.2022.103848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wang D., Hu Q., Yang C. Biometric recognition based on scalable end-to-end convolutional neural network using photoplethysmography: a comparative study. Comput. Biol. Med. 2022;147 doi: 10.1016/j.compbiomed.2022.105654. [DOI] [PubMed] [Google Scholar]
  • 11.Gupta A., Ravelo-García A.G., Dias F.M. Availability and performance of face based non-contact methods for heart rate and oxygen saturation estimations: a systematic review. Comput. Methods Programs Biomed. 2022;219:106771. doi: 10.1016/j.cmpb.2022.106771. [DOI] [PubMed] [Google Scholar]
  • 12.J. Mathew, X. Tian, M. Wu, C.-W. Wong, Remote blood oxygen estimation from videos using neural networks, arXiv e-prints, p. arXiv:2107.05087, 2021, [Online], Available: http://arxiv.org/abs/2107.05087. [DOI] [PMC free article] [PubMed]
  • 13.Hu M., Qian F., Guo D., Wang X., He L., Ren F. ETA-rPPGNet: effective time-domain attention network for remote heart rate measurement. IEEE Trans. Instrum. Meas. 2021;70:1–12. [Google Scholar]
  • 14.Shao D., Liu C., Tsow F., Yang Y., Du Z., Iriya R., Yu H., Tao N. Noncontact monitoring of blood oxygen saturation using camera and dual-wavelength imaging system. IEEE Trans. Biomed. Eng. 2016;63(6):1091–1098. doi: 10.1109/TBME.2015.2481896. [DOI] [PubMed] [Google Scholar]
  • 15.Chan C., Inskip J.A., Kirkham A.R., Ansermino J.M., Dumont G., Li L.C., Ho K., Novak Lauscher H., Ryerson C.J., Hoens A.M., Chen T., Garde A., Road J.D., Camp P.G. A smartphone oximeter with a fingertip probe for use during exercise training: usability, validity and reliability in individuals with chronic lung disease and healthy controls. Physiotherapy (United Kingdom) 2019;105(3):297–306. doi: 10.1016/j.physio.2018.07.015. [DOI] [PubMed] [Google Scholar]
  • 16.R. Stricker, S. Muller, H.M. Gross, Non-contact video-based pulse rate measurement on a mobile service robot, in: IEEE RO-MAN 2014 - 23rd IEEE International Symposium on Robot and Human Interactive Communication: Human-Robot Co-Existence: Adaptive Interfaces and Systems for Daily Life, Therapy, Assistance and Socially Engaging Interactions, Oct. 2014, pp. 1056–1062, doi: 10.1109/ROMAN.2014.6926392.
  • 17.Niu X., Shan S., Han H., Chen X. RhythmNet: end-to-end heart rate estimation from face via spatial-temporal representation. IEEE Trans. Image Process. 2020;29:2409–2423. doi: 10.1109/TIP.2019.2947204. [DOI] [PubMed] [Google Scholar]
  • 18.X. Niu, H. Han, S. Shan, X. Chen, ‘VIPL-HR: a multi-modal database for pulse estimation from less-constrained face video’, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11365, LNCS, Dec. 2019, pp. 562–576, doi: 10.1007/978-3-030-20873-8_36.
  • 19.Mannheimer P.D. The light-tissue interaction of pulse oximetry. Anesth. Analg. 2007;105(Suppl. 6) doi: 10.1213/01.ane.0000269522.84942.54. [DOI] [PubMed] [Google Scholar]
  • 20.Runciman W.B., Webb R.K., Barker L., Currie M. The Australian incident monitoring study. The pulse oximeter: applications and limitations–an analysis of 2000 incident reports. Anaesth. Intensive Care. 1993;21(5):543–550. doi: 10.1177/0310057X9302100509. [DOI] [PubMed] [Google Scholar]
  • 21.de Haan G., Jeanne V. Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 2013;60(10):2878–2886. doi: 10.1109/TBME.2013.2266196. [DOI] [PubMed] [Google Scholar]
  • 22.Chen M., Zhu Q., Wu M., Wang Q. Modulation model of the photoplethysmography signal for vital sign extraction. IEEE J. Biomed. Health Inform. 2021;25(4):969–977. doi: 10.1109/JBHI.2020.3013811. [DOI] [PubMed] [Google Scholar]
  • 23.Poh M.Z., McDuff D.J., Picard R.W. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 2011;58(1):7–11. doi: 10.1109/TBME.2010.2086456. [DOI] [PubMed] [Google Scholar]
  • 24.Y.H. Wang, C.J. Hung, C.H. Shen, S.J. Chen, A new oxygen saturation images of iris tissue, in: Proceedings of IEEE Sensors, 2010, pp. 1386–1389, doi: 10.1109/ICSENS.2010.5690526.
  • 25.Tsai H.Y., Huang K.C., Yeh J.A. No-contact oxygen saturation measuring technology for skin tissue and its application. IEEE Instrum. Meas. Mag. 2016;19(5):57–64. doi: 10.1109/MIM.2016.7579071. [DOI] [Google Scholar]
  • 26.Bal U. Non-contact estimation of heart rate and oxygen saturation using ambient light. Biomed. Opt. Express. 2015;6(1):86. doi: 10.1364/boe.6.000086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.W. Chen, D. McDuff, DeepPhys: video-based physiological measurement using convolutional attention networks, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11206, LNCS, 2018, pp. 356–373, doi: 10.1007/978-3-030-01216-8_22.
  • 28.Lokendra B., Puneet G. AND-rPPG: a novel denoising-rPPG network for improving remote heart rate estimation. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105146. [DOI] [PubMed] [Google Scholar]
  • 29.Ding X., Nassehi D., Larson E.C. Measuring oxygen saturation with smartphone cameras using convolutional neural networks. IEEE J. Biomed. Health Inform. 2019;23(6):2603–2610. doi: 10.1109/JBHI.2018.2887209. [DOI] [PubMed] [Google Scholar]
  • 30.Teuwen J., Moriakov N. Handbook of Medical Image Computing and Computer Assisted Intervention. Elsevier; 2019. Convolutional neural networks; pp. 481–501. [DOI] [Google Scholar]
  • 31.K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 770–778, doi: 10.1109/CVPR.2016.90.
  • 32.Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp. 13708–13717, doi: 10.1109/CVPR46437.2021.01350.
  • 33.Gao B.-B., Xing C., Xie C.-W., Wu J., Geng X. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 2017;26(6):2825–2838. doi: 10.1109/TIP.2017.2689998. [DOI] [PubMed] [Google Scholar]
  • 34.J.M. Joyce, Kullback-Leibler Divergence, in: International Encyclopedia of Statistical Science, Springer, Berlin, Heidelberg, 2011, pp. 720–722, doi: 10.1007/978-3-642-04898-2_327.
  • 35.Zhang K., Zhang Z., Li Z., Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett. Oct. 2016;23(10):1499–1503. doi: 10.1109/LSP.2016.2603342. [DOI] [Google Scholar]
  • 36.V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874, doi: 10.1109/CVPR.2014.241.
  • 37.C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 2818–2826, doi: 10.1109/CVPR.2016.308.
  • 38.A. de Fatima Galvao Rosa, R.C. Betini, Noncontact SpO2 measurement using Eulerian video magnification, IEEE Trans. Instrum. Meas. 69(5) (2020) 2120–2130, doi: 10.1109/TIM.2019.2920183.
  • 39.G. Casalino, G. Castellano, G. Zaza, A mHealth solution for contact-less self-monitoring of blood oxygen saturation, in: Proceedings - IEEE Symposium on Computers and Communications, Jul. 2020, Vol. 2020-July, doi: 10.1109/ISCC50000.2020.9219718.
  • 40.Moço A., Verkruysse W. Pulse oximetry based on photoplethysmography imaging with red and green light: calibratability and challenges. J. Clin. Monit. Comput. 2021;35(1):123–133. doi: 10.1007/s10877-019-00449-y. [DOI] [PubMed] [Google Scholar]
  • 41.B. Koteska, H. Mitrova, A.M. Bogdanova, F. Lehocki, Machine learning based SpO2 prediction from PPG signal’s characteristics features, Aug. 2022, pp. 1–6, doi: 10.1109/memea54994.2022.9856498.
  • 42.Pimentel M.A.F., Johnson A.E.W., Charlton P.H., Birrenkott D., Watkinson P.J., Tarassenko L., Clifton D.A. Toward a robust estimation of respiratory rate from pulse oximeters. IEEE Trans. Biomed. Eng. 2017;64(8):1914–1923. doi: 10.1109/TBME.2016.2613124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wu H.-Y., Rubinstein M., Shih E., Guttag J., Durand F., Freeman W. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. 2012;31(4):1–8. [Google Scholar]
  • 44.Tomasi C. Detection and tracking of point features technical report CMU-CS-91-132. Image Rochester NY. 1991;91(April):1–22. [Google Scholar]
  • 45.M.O. Solomon, PSD Computations Using Welch’s Method, Sandia National Laboratories, no. SAND91-1533, p. 64, Dec. 1991, doi: 10.2172/5688766.
  • 46.K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, Sep. 2015, doi: 10.48550/arxiv.1409.1556.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.


Articles from Biomedical Signal Processing and Control are provided here courtesy of Elsevier

RESOURCES