Contactless blood oxygen estimation from face videos: A multi-model fusion method based on deep learning

Min Hu; Xia Wu; Xiaohua Wang; Yan Xing; Ning An; Piao Shi

doi:10.1016/j.bspc.2022.104487

. 2022 Dec 10;81:104487. doi: 10.1016/j.bspc.2022.104487

Contactless blood oxygen estimation from face videos: A multi-model fusion method based on deep learning

Min Hu ^a, Xia Wu ^a, Xiaohua Wang ^a,^⁎, Yan Xing ^b, Ning An ^a,^c, Piao Shi ^a

PMCID: PMC9735266 PMID: 36530216

Abstract

Blood Oxygen ( ${SpO}_{2}$ ), a key indicator of respiratory function, has received increasing attention during the COVID-19 pandemic. Clinical results show that patients with COVID-19 likely have distinct lower ${SpO}_{2}$ before the onset of significant symptoms. Aiming at the shortcomings of current methods for monitoring ${SpO}_{2}$ by face videos, this paper proposes a novel multi-model fusion method based on deep learning for ${SpO}_{2}$ estimation. The method includes the feature extraction network named Residuals and Coordinate Attention (RCA) and the multi-model fusion ${SpO}_{2}$ estimation module. The RCA network uses the residual block cascade and coordinate attention mechanism to focus on the correlation between feature channels and the location information of feature space. The multi-model fusion module includes the Color Channel Model (CCM) and the Network-Based Model(NBM). To fully use the color feature information in face videos, an image generator is constructed in the CCM to calculate ${SpO}_{2}$ by reconstructing the red and blue channel signals. Besides, to reduce the disturbance of other physiological signals, a novel two-part loss function is designed in the NBM. Given the complementarity of the features and models that CCM and NBM focus on, a Multi-Model Fusion Model(MMFM) is constructed. The experimental results on the PURE and VIPL-HR datasets show that three models meet the clinical requirement(the mean absolute error $⩽$ 2%) and demonstrate that the multi-model fusion can fully exploit the ${SpO}_{2}$ features of face videos and improve the ${SpO}_{2}$ estimation performance. Our research achievements will facilitate applications in remote medicine and home health.

Keywords: Estimation, Remote photo-plethysmography, Deep learning, Residual network, Coordinate attention, Multi-model fusion

1. Introduction

${SpO}_{2}$ saturation reflects the function of the cardiopulmonary and respiratory system and is an essential physiological indicator for clinical medical monitoring [1]. Many lung diseases can be detected by detecting abnormal values of ${SpO}_{2}$ , such as SARS coronavirus in 2003, MERS, the Middle East respiratory syndrome coronavirus in 2012, and COVID-19 which spread uncontrollably in the current period. The literature [2] stated that patients with pulmonary infections did not feel shortness of breath in the early stages, but their ${SpO}_{2}$ levels were decreasing. Therefore, monitoring ${SpO}_{2}$ , an important physiological parameter, is essential to avoid disease deterioration. In addition, regarding blood vessel detection, anomaly and blood glucose disease can be determined from the data collected from blood glucose meter [3].

Traditional contact pulse ${SpO}_{2}$ meters [4] are widely used in routine and critical scenarios, but they are challenging to perform in particular scenarios, such as possible injury to burn patients and inaccurate ${SpO}_{2}$ monitoring in patients with hand and foot tremors. In addition, pulmonary diseases are highly contagious in general, and incomplete disinfection of healthcare workers and healthcare tools may lead to the secondary transmission of pathogens. In these scenarios, contactless measurement of physiological parameters becomes an effective method for estimating ${SpO}_{2}$ . Several scholars have already studied the implementation of contactless monitoring of vital characteristics [5], for example, Remote Photo-plethysmography (rPPG), Image Photo-plethysmography (iPPG), contactless pulse ${SpO}_{2}$ measurement, and others. To our best knowledge, a few articles studied contactless ${SpO}_{2}$ estimation methods based on videos. Among them, most of the rPPG-based ${SpO}_{2}$ estimation methods use traditional signal processing methods, such as filtering [6], signal interpolation [7], and fast Fourier transform [8].

However, with the rapid growth of deep learning techniques, an increasing number of scholars have applied them to biomedical scenarios and daily health monitoring and achieved desirable results, such as detecting COVID-19 by chest X-ray images [9] and physiological indicator estimation [10]. In particular, deep learning techniques regarding heart rate estimation have matured in physiological indicator estimation. However, the application of deep learning to contactless ${SpO}_{2}$ estimation is still at an early stage of development, and the literature [11] provided a systematic review of them. Last year, J. Mathew et al. proposed a study combining deep learning with contactless ${SpO}_{2}$ estimation in hands by their self-built dataset and achieved the clinical requirement [12]. Apart from the literature [12], no other literature uses deep learning techniques for non-contact ${SpO}_{2}$ estimation tasks. However, considering the maturity and convenience of facial estimation and inspired by the previous studies of rPPG-based heart rate estimation literature [13] and the contactless blood oxygen estimation literature [14], a contactless ${SpO}_{2}$ estimation method combined with deep learning based on face videos is proposed in this paper. Experimental results show that the ${SpO}_{2}$ estimation values of this paper's method are comparable to those values obtained by the commercial pulse ${SpO}_{2}$ meter, and the ${SpO}_{2}$ measurement error in this paper is within an acceptable range according to the literature [15].

The main contributions of the work in this paper are as follows.

1)
To fully extract ${SpO}_{2}$ features from face videos by deep learning, a network named Residuals and Coordinate Attention(RCA) for feature extraction is constructed. It is a combination of residual block cascade and coordinate attention mechanism. Thus, the RCA network not only learns the high-dimensional features of the image, but also focuses on the correlation relationship between feature channels and the location information in the feature space. In this way, the RCA network has a stronger feature representation capability.
2)
To use the classical color channel ratio algorithm to calculate ${SpO}_{2}$ , a ${SpO}_{2}$ estimation model named Color Channel Model(CCM) for face videos is constructed, which is based on an image generator. CCM firstly uses residual blocks and transposed convolution to reconstruct feature maps learned by the RCA network into RGB images. Then, CCM calculates ${SpO}_{2}$ using the red/blue color channel ratio method.
3)
To facilitate the network to focus on learning ${SpO}_{2}$ features in many aspects, a ${SpO}_{2}$ estimation model named Network-Based Model(NBM) is constructed. The NBM is based on a network and a two-part loss function. Then, it is fused with the CCM model into a multi-model fusion ${SpO}_{2}$ estimation model named Multi-Model Fusion Model(MMFM), which constructs a loss function that takes full advantage of the features that CCM and NBM focus on as well as the complementarity between them. The CCM, RBM and MMFM perform ${SpO}_{2}$ estimation experiments on the public datasets (PURE [16] and VIPL-HR [17], [18]) to achieve accurate ${SpO}_{2}$ estimation.

2. Related work

${SpO}_{2}$ is the oxygen concentration in the blood and represents the ratio of oxygenated hemoglobin to total hemoglobin, with a normal range of 95 % to 100 %. The medical device commonly used to measure arterial ${SpO}_{2}$ is the pulse ${SpO}_{2}$ meter, which is based on the use of photo-plethysmography (PPG) to assess changes in blood volume in the human tissue microcirculation [19]. However, the pulse ${SpO}_{2}$ meter has many limitations, such as possible injury to burn patients and inaccurate ${SpO}_{2}$ monitoring in patients with hand and foot tremors [20]. In addition, there is a risk of virus infection when using contact oximeter to check the physiological health of patients with infectious diseases. So researchers proposed the remote photo-plethysmography (rPPG) which is a contactless method for estimating pulse waves generated by the heart through peripheral blood perfusion measurements, using a video camera to capture video of body parts, and a series of processing of the video to obtain the desired physiological information, such as heart rate [21], respiration rate [22] and heart rate variability [23].

In recent years, many scholars have started to use videos to investigate contactless measurements of ${SpO}_{2}$ , but most of these studies have used conventional signal processing equipment and methods. The literature [14] used a CMOS camera to record PPG signals for ${SpO}_{2}$ estimation, with the light source alternating between orange light at 611 nm and near-infrared light at 880 nm. ${SpO}_{2}$ was derived from the pulsatile component(AC) and the nonpulsatile component(DC) component analysis of the tracked PPG signals. The literature [24] proposed a method using iris tissue irradiated by two light wavelengths (630 nm and 940 nm) for reflectance images to assess ${SpO}_{2}$ levels. The literature [25] developed a contactless skin ${SpO}_{2}$ imaging system that uses reflected images of superficial tissue skin to create a map of ${SpO}_{2}$ distribution across the measurement area to assess ${SpO}_{2}$ , heart rate(HR) and blood flow velocity(BFV). The literature [26] used two orthogonal vectors in RGB color space to extract the PPG signal and used a denoising algorithm based on a double-tree composite wavelet transform to reduce illumination and motion artifacts.

With the continuous development of deep learning techniques, there are a proliferation of studies using them for video-based physiological metrics estimation with excellent results, such as heart rate and respiration rate. The literature [17] proposed an end-to-end convolutional attention network to detect blood volume pulses in face videos, which in turn performs frequency analysis of the detected pulse signal to track heart rate and respiration rate. The literature [27] designed a DeepPhys model based on a two-stream approach based on a skin reflection model, which used the appearance module to provide attention to guide the learning of the motion module and thus recovered a more robust rPPG signal. The literature [13] divided the input videos into segments and applied a time-domain segment subnet to extract spatial and temporal information. The literature [28] was built on deep learning to construct temporal signals and used Action Units (AUs) for signal denoising to improve HR estimation.

However, the research on deep learning algorithms for ${SpO}_{2}$ estimation by videos is still in its early stages. The literature [29] proposed a convolutional neural network architecture for contact ${SpO}_{2}$ monitoring from smartphone cameras. Although they performed better than traditional ratio methods, they still had the drawbacks associated with contact measurements. In July 2021, J. Mathew et al. proposed a study combining deep learning with contactless ${SpO}_{2}$ estimation in hands [12]. Experiments were conducted to collect 14 volunteers' hand videos to estimate ${SpO}_{2}$ , containing both normal breathing and breath-holding states, with a MAE of around 2 %, which aligns with clinical indications. However, the dataset collected in this literature is not public, and no relevant experiments have been performed on public datasets.

Inspired by the above articles, this paper proposes a contactless ${SpO}_{2}$ estimation method based on deep learning with multi-model fusion, and conducts rich experiments on the public datasets (PURE and VIPL-HR). PURE and VIPL-HR contain face videos and physiological signal recordings of faces in rich scenes, such as stillness, brightness, darkness, head movement and physical post-exercise state. In PURE dataset, the ground truth value of ${SpO}_{2}$ have been captured using a finger clip pulse oximeter (pulox CMS50E). In VIPL-HR dataset, the ground truth value of ${SpO}_{2}$ have been captured using the CONTEC CMS60C BVP sensor. There are more than one hundred volunteers involved in the dataset construction work. Rich experiments verify that the RCA network can adequately extract ${SpO}_{2}$ features from face videos, and the fusion model can achieve better ${SpO}_{2}$ estimation. These also confirm that deep learning techniques can estimate ${SpO}_{2}$ effectively and be applied to health screening and remote health assessment.

3. Material and methods

The proposed multi-model fused ${SpO}_{2}$ estimation model based on deep learning is shown in Fig. 1 , which consists of two parts: the RCA network and the multi-model ${SpO}_{2}$ estimation module. RCA network mainly consists of the residual network module and the coordinate attention module, which is used to extract physiological features from the facial image sequences. The multi-model ${SpO}_{2}$ estimation module includes Color Channel Model, the Network-Based Model, and the Multi-Model Fused Model. ${SpO}_{2}$ is calculated by feeding the features extracted by the RCA network into the ${SpO}_{2}$ estimation module.

3.1. RCA network

This section introduces an RCA network to extract feature signals with physiological information from face videos and use them to calculate the ${SpO}_{2}$ . Fig. 1 shows the structure of this RCA network. In terms of logical structure, RCA network and the conventional deep CNN [30] are similar. Both of them mainly consist of these types of layers: input layer, convolutional layer, activation function layer, pooling layer and fully connected layer. However, the difference is that the RCA network has residual blocks and attention composition. Residual blocks can alleviate the gradient disappearance problem in the conventional deep CNN by using jump connections. In addition, residual blocks can protect the integrity of the information. Besides, by using attention, more representative feature information can be extracted and result in a higher computational accuracy.

The input image sequence is denoted as $V$ and the pixel size of images is 3 * 224 * 224. Firstly, $V$ is fed into a convolution layer with the kernel size of $[7, 7]$ to downsample, namely, $Con v_{7 \times 7}$ . Then, batch normalization(BN) and ReLu activation function( $δ$ ) are performed to accelerate the convergence of the network, and finally, global average pooling(GAP) output $F^{'}$ :

F^{'} = G A P (δ (B N (C o n v_{7 \times 7} (V))))

(1)

3.1.1. Residual module

To extract the high-dimensional features from $F^{'}$ , $F^{'}$ is fed into the Residual Block (RB) cascade, and the features $\bar{F}$ are obtained. The residual module in this paper is constructed based on ideas from the literature [31]. It alleviates the gradient disappearance problem by using jump connections. In addition, the residual block protects the integrity of the information by passing the input information directly to the output in a bypass, reducing information loss and attrition and improving the learning capability of the network. Besides, to improve network performance and obtain a larger perceptual field to help capture the features that need to be learned and attended to, the network depth is increased by cascading two residual blocks. The structure is shown in Fig. 2 .

Fig. 2 — Residual block cascade structure.

The formula for a single residual module is shown in Equation (2). $δ$ is the ReLu activation function. ${Conv}_{3 \times 3}$ is a convolution layer with the kernel size of $[3, 3]$ , and ${Conv}_{1 \times 1}$ is a convolution layer with the kernel size of $[1, 1]$ . $\oplus$ represents the Element-wise Sum.

RB (F^{'}) = δ {(Conv}_{3 \times 3} (δ {(Conv}_{3 \times 3} (F^{'}))) \oplus C o n v_{1 \times 1} (F^{'}))

(2)

The output of the cascade of the two residual blocks is $\bar{F}$ as Equation (3):

\bar{F} = R B_{2} (R B_{1} (F^{'}))

(3)

3.1.2. Coordinate attention module

Large-amplitude head movements and changes in lighting conditions during the recovery of the rPPG signal cycle can interfere with signal estimation, resulting in excessive fluctuations in feature information changes, which can interfere with the signal periodicity [13]. However, the coordinate attention module [32] can focus on the relationship between feature channels and the location information in the feature space. Therefore, we add it after extracting the acquired high-dimensional features to enhance the feature representation.

Coordinate attention refers to embedding location information into channel attention and decomposing channel attention into two parallel 1D feature encoding processes, i.e., pooling, convolution and activation operations for feature information in the X (horizontal) and Y (vertical) directions respectively, to effectively integrate spatial coordinate information into the generated attention graph and construct a coordinate aware attention graph. In this paper, the coordinate attention idea is embedded into the RCA network, and the network structure is shown in Fig. 1. First, the high-dimensional features $\bar{F} = [{\bar{f}}_{1}, {\bar{f}}_{2}, . . ., {\bar{f}}_{C}]$ are used as input. Each channel is encoded along the horizontal and vertical directions using pooling layers of size $(H, 1)$ and $(1, W)$ respectively to enable the attention module to capture the precise position information of the features. Thus, the output of the c-th channel at height h can be formulated as Equation (4). $\sum$ stands for summation.

z_{c}^{h} (h) = \frac{1}{W} \sum_{{\bar{f}}_{c} (j, w)} {\bar{f}}_{c} (h, i)

(4)

Similarly, the output of the c-th channel at width w can be written as Equation (5):

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 ⩽ i < H} {\bar{f}}_{c} (j, w)

(5)

${\bar{f}}_{c}$ represents the c-th channel of $\bar{F}$ . The above transformations perform feature aggregation along two spatial directions to obtain a pair of direction-aware feature maps $[z^{h}, z^{w}]$ . They also enable the attention module to capture long-term dependencies along one spatial direction and preserve precise location information along with the other, which helps the network locate the target of interest more accurately.

Then, results of the two transformations are concatenated and transform operations are performed on them using the $1 \times 1$ convolutional transform function $F_{1}$ as shown in Equation (6).

f_{c} = δ (F_{1} ([z_{c}^{h}, z_{c}^{w}]))

(6)

Here, $δ$ is ReLu activation function. $f_{c}$ is the intermediate feature mapping result of encoding the spatial information of the c-th channel in the horizontal and vertical directions. Decompose $f_{c}$ into two separate tensors $f_{c}^{h} \in R^{C / r \times H}$ and $f_{c}^{w} \in R^{C / r \times W}$ along the spatial dimensions. $f_{c}^{h}$ and $f_{c}^{w}$ are transformed into tensors with the same number of channels by using two $1 \times 1$ convolution transforms $F_{h}$ and $F_{w}$ respectively to obtain $g_{c}^{h}$ and $g_{c}^{w}$ as shown in Equation (7).

\begin{matrix} g_{c}^{h} = σ (F_{h} (f_{c}^{h})), \\ g_{c}^{w} = σ (F_{w} (f_{c}^{w})) . \end{matrix}

(7)

Here, $σ$ is Sigmoid activation function. Subsequently, $g_{c}^{h}$ and $g_{c}^{w}$ are expanded using the function named expand_as so that both of them are of the same size as the input feature $\bar{F}$ so as to serve as the attentional weights of $\bar{F}$ . The output of the coordinate attentive module is $F^{″} = [f_{1}^{″}, f_{2}^{″}, . . ., f_{C}^{″}]$ , in which the component of the c-th channel of $F^{″}$ is:

f_{c}^{″} (i, j) = f_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(8)

3.2. Multi-Model fusion for ${SpO}_{2}$ estimation

Existing research results show that the color channel ratio method based on videos has achieved certain results, such as the literature [26]. In order to combine the advantages of the Color Channel Model and the Network-Based Model, the features that both models focus on and the complementarity between the models are fully utilized. In this paper, a multi-model fusion ${SpO}_{2}$ estimation module is constructed, as shown in Fig. 1. The Color Channel Model reconstructs RGB images from feature maps by the image generator, and then calculates the ${SpO}_{2}$ value using the red and blue color channel AC/DC ratio analysis. The Network-Based Model is a simple network model based on deep learning techniques, and ${SpO}_{2}$ is estimated directly from the feature maps output of the RCA network.

3.2.1. Color channel model

The structure of the Color Channel Model is shown in Fig. 1. The module consists of the image generator and the ${SpO}_{2}$ Estimator1. In order to calculate the ${SpO}_{2}$ using the signals from the red and blue color channels, the features $F^{″}$ learned by the RCA network are first reconstructed into the RGB feature map $feat$ by the image generator, and then fed into the ${SpO}_{2}$ Estimator1 to obtain $Sp O_{2_{CCM}}$ . In addition, the loss functions $Los s_{img}$ and $Los s_{C C M_S}$ are constructed for the feature map $Sp O_{2_{CCM}}$ and the estimation value $Sp O_{2_{CCM}}$ respectively to improve the ${SpO}_{2}$ estimation accuracy.

The feature $F^{″}$ is first fed into the image generator to reconstruct the RGB image. The structure of the image generator is shown in Fig. 3 . This means that $F^{″}$ is first fed into the residual block. Then, send the result to a transposed convolution kernel of size $[3, 3]$ which named $Conv T_{3 \times 3}$ for upsampling. Next, send it into normalization and non-linear activation to obtain $I^{'}$ . Finally, $I^{'}$ is fed into a convolution kernel of size $[7, 7]$ named $Con v_{7 \times 7}$ to obtain the RGB feature map $feat$ which pixel size is 3 * 224 * 224.

Then, the reconstructed $feat$ is fed into ${SpO}_{2}$ Estimator1. Here, the signals of the red and blue channels are calculated by conversion to obtain cardiovascular pulse wave signals that can replace two different wavelengths (660 nm and 940 nm) in the pulse ${SpO}_{2}$ meter [4], thus using the AC/DC ratio estimation, whose mathematical equation for the conversion calculation is shown in Equation (9):

Sp O_{2 C C M} = A - B \frac{A C_{RED} / D C_{RED}}{A C_{BLUE} / D C_{BLUE}}

(9)

Here, $A C_{RED}$ and $A C_{BLUE}$ represent the standard deviation of the red and blue channels respectively. $D C_{RED}$ and $D C_{BLUE}$ represent the mean of the red and blue channels respectively, with fixed coefficients A = 125 and B = 26 based on the empirical evaluation [4].

In this section, loss functions are designed for $feat$ and $Sp O_{2_{CCM}}$ respectively. In order to learn more ${SpO}_{2}$ -related information and preserve original image features as much as possible during the RCA network learning and feature map reconstruction, the L1Loss function is used to evaluate the loss between the reconstructed RGB feature map $feat$ and the original map $img$ to obtain $Los s_{img}$ . H, W, and C represent the image height, width, and number of channels respectively.

Los s_{img} (i m g, f e a t) = \frac{1}{H \times W \times C} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} |im g_{i, j, c} - f e a t_{i, j, c}|

(10)

Besides, for guiding the Color Channel Model learning to obtain more accurate ${SpO}_{2}$ estimation values, the Smooth L1 loss function is used to evaluate the loss between $Sp O_{2_{CCM}}$ and the ground truth $SpO 2_{gt}$ to obtain $Los s_{C C M_S}$ :

Los s_{C C M_S} (x, y) = \frac{1}{L} \sum_{i = 1}^{L} \{\begin{matrix} 0.5 * {(y_{i} - x_{i})}^{2}, i f |y_{i} - x_{i}| < 1 \\ |y_{i} - x_{i}| - 0.5, o t h e r w i s e \end{matrix})

(11)

Here, L represents the length of the input video frame, $x$ and $y$ represent the estimated value $Sp O_{2_{CCM}}$ and the ground truth $SpO 2_{gt}$ respectively. When the absolute value of the difference between $x$ and $y$ is less than 1, L2 Loss is used. When the difference is larger, the translation of L1 Loss is used. The Smooth L1 loss function has a constant gradient when the difference between $x$ and $y$ is large solves the problem of large gradients destroying the training parameters in L2 loss. When the difference is small, the gradient decreases dynamically, solving the problem of difficult convergence in L1 loss.

By designing $λ_{img}$ and $λ_{C C M_S}$ as balancing parameters to adjust the importance of $Los s_{img}$ and $Los s_{CCM - S}$ , the total loss of the Color Channel Model is denoted to estimate the ${SpO}_{2}$ value as $Los s_{CCM}$ :

Los s_{CCM} = λ_{img} \times L o s s_{img} + λ_{C C M_S} \times L o s s_{C C M_S}

(12)

3.2.2. Network-based model

The structure of the Network-Based Model is shown in Fig. 1. $F^{″}$ learned by the RCA network is fed into the ${SpO}_{2}$ Estimator2 to calculate the ${SpO}_{2}$ value $Sp O_{2_{NBM}}$ . The two-part loss functions including $Los s_{label}$ and $Los s_{N B M_S}$ between $Sp O_{2_{NBM}}$ and the ${SpO}_{2}$ ground truth $SpO 2_{gt}$ monitored by the pulse oximeter are constructed to optimize the model, so that the model can focus on more ${SpO}_{2}$ information and improve the accuracy of ${SpO}_{2}$ estimation.

The structure of ${SpO}_{2}$ Estimator2 is shown in Fig. 4 . Firstly, the high-dimensional features $F^{″}$ generated by the RCA network are globally averaged and pooled to replace the fully connected operation to reduce the network parameters and overfitting. Then, feed it into a convolution named $Con v_{1 \times 1}$ with the kernel of size $[1, 1]$ to compress the number of channels, increase the nonlinearity of the network and obtain the feature map with 100 channels. Then, activate it by the $SoftMax$ function to obtain $F^{pre}$ . Since ${SpO}_{2}$ can take values from 1 % to 100 %, the sequential vector $S = [1, 2, . . ., 100]$ is constructed and multiplied with $F^{pre}$ to obtain the estimated value $Sp O_{2_{NBM}}$ .

F^{pre} = δ (C o n v_{1 \times 1} (G A P (F^{″})))

(13)

Sp O_{2 N B M} = \sum_{i = 1}^{100} S_{i} \times F_{i}^{pre}

(14)

Fig. 4 — Structure of ${SpO}_{2}$ Estimator2.

Since the face videos contain not only ${SpO}_{2}$ -related signals, but also some other physiological signals. To motivate the network to learn more ${SpO}_{2}$ signals than other signals, a two-part loss function is designed for $F^{pre}$ and $Sp O_{2_{NBM}}$ in this section. First, the ${SpO}_{2}$ ground truth $SpO 2_{gt}$ is processed by the label distribution learning technique [33] to obtain a matrix label $l$ of the same size as $F^{pre}$ . The Kullback-Leibler Dispersion [34] (KLD) loss $Los s_{label}$ between $F^{pre}$ and $l$ is constructed so that the real label $l$ is used to direct $F^{pre}$ to focus on more ${SpO}_{2}$ information. The rule for calculating $Los s_{label}$ is shown in Equation (15).

Los s_{label} (l ‖ F^{pre}) = \sum_{i = 1}^{T} \sum_{j = 1}^{100} [l_{ij} \log l_{ij} - l_{ij} \log F_{ij}^{pre}]

(15)

Besides, to guide the Color Channel Model learning to obtain more accurate ${SpO}_{2}$ estimation values, the Smooth L1 loss function is used to evaluate the loss between $Sp O_{2_{NBM}}$ and the ground truth $SpO 2_{gt}$ to obtain $Los s_{N B M_S}$ . It follows the same rules as Equation (11).

Los s_{N B M_S} (x, y) = \frac{1}{L} \sum_{i = 1}^{L} \{\begin{matrix} 0.5 * {(y_{i} - x_{i})}^{2}, i f |y_{i} - x_{i}| < 1 \\ |y_{i} - x_{i}| - 0.5, o t h e r w i s e \end{matrix})

(16)

Here, $x$ and $y$ represent the estimated value $Sp O_{2_{NBM}}$ and the ground truth $SpO 2_{gt}$ respectively.

By designing $λ_{label}$ and $λ_{N B M_S}$ as balancing parameters to adjust the importance of $Los s_{label}$ and $Los s_{N B M_S}$ , the total loss of the Network-Based Model is denoted to estimate the ${SpO}_{2}$ value as $Los s_{NBM}$ :

Los s_{NBM} = λ_{label} \times L o s s_{label} + λ_{N B M_S} \times L o s s_{NBM}

(17)

3.2.3. Multi-Model fusion model

To facilitate the network to focus on and learn ${SpO}_{2}$ features in many aspects and to improve the accuracy of ${SpO}_{2}$ estimation, the Color Channel Model is combined with the Network-Based Model to construct a Multi-Model Fusion Model (MMFM). Using the SmoothL1Loss loss function, a new loss named $Los s_{MMFM}$ is constructed for the estimation values $Sp O_{2_{CCM}}$ and $Sp O_{2_{NBM}}$ of the two models, allowing the two models bootstrap each other, making full use of the features of interest to both and the complementarity between them.

Los s_{MMFM} (x, y) = \frac{1}{L} \sum_{i = 1}^{L} \{\begin{matrix} 0.5 * {(y_{i} - x_{i})}^{2}, i f |y_{i} - x_{i}| < 1 \\ |y_{i} - x_{i}| - 0.5, o t h e r w i s e \end{matrix})

(18)

Here, $x$ and $y$ represent $Sp O_{2_{CCM}}$ and $Sp O_{2_{NBM}}$ respectively. It follows the same calculation rules as Equation(11) and Equation(16). The importance of $Los s_{NBM}$ , $Los s_{CCM}$ and $Los s_{MMFM}$ is adjusted by designing equilibrium parameters $λ_{NBM}$ , $λ_{CCM}$ and $λ_{MMFM}$ , respectively.

Los s_{total} = λ_{NBM} \times L o s s_{NBM} + λ_{CCM} \times L o s s_{CCM} + λ_{MMFM} \times L o s s_{MMFM}

4. Implementation processes

This section introduces the implementation processes of the remote multi-model fusion ${SpO}_{2}$ estimation model based on deep learning proposed in this paper. The model is divided into three parts: facial video preprocessing, the RCA network model training and ${SpO}_{2}$ estimation.

4.1. Video preprocessing

In this paper, the PURE and VIPL-HR datasets are chosen for experiments, which contain face video sequences, timestamped text and ${SpO}_{2}$ -valued text corresponding to video frames. All video sequences need to be processed by face detection, localization (Fig. 5 ) and normalization. The specific steps are as follows.

Step 1. Face detection. Firstly, the images are input to a multi-task convolutional neural network (MTCNN) [35] for face detection and localization.

Step 2. Key landmark localization. During the key landmark localization process, we pre-downloaded an open-source file containing the 81 landmarks [36] of the face. Subsequently, we take the file as an argument and call the method named shape_predictor of the Dlib library to achieve the localization of the key landmarks.

Step 3.The selection of regions of interest(ROI). The facial landmarks obtained in the previous stage are processed to locate the ROI region: the maximum and minimum values of the ×, y axes of 81 points are combined to form the four corner coordinates of the ROI and cropped.

Step 4. The facial image sequence normalization. All the facial images in this paper are normalized to 224 × 224 pixel RGB images.

4.2. RCA network model training

The processed facial image sequence $V \in R^{C \times L \times H \times W}$ is fed into the RCA network for feature extraction, where $C$ , $L$ , $H$ and $W$ represent the number of channels, length, height and width of the input video respectively.

$V$ is firstly subjected to $7 \times 7$ convolutional transform and pooling, and then high-dimensional features with strong representation ability are extracted by residual block cascade and coordinate attention module. The image information is learned and ${SpO}_{2}$ values are estimated from them by three ${SpO}_{2}$ estimation models respectively. The model convergence speed is accelerated by adjusting the optimizer, learning rate, loss function, and other hyperparameters to obtain the best results.

4.3. SpO₂ estimation

The high-dimensional features are extracted from the RCA network, and the ${SpO}_{2}$ values are estimated by Color Channel Model, Network-Based Model, and Multi-Model Fusion Model respectively. It should be mentioned that the estimated values of both models and the corresponding errors are obtained in the fusion estimation model. The estimated value with the smallest error is taken as the result of the fusion estimation model. The effect is compared with the other two models to select the best model.

In addition, to avoid frame redundancy, the frame number is selected according to the setting of batch_size. In other words, the median of ${SpO}_{2}$ ground truth in T frames is used as $SpO 2_{gt}$ . The estimated values are obtained by averaging the T frames of ${SpO}_{2}$ values output from the three models. It also reduces the error caused by single-frame computation without considering time characteristics and noise.

5. Experimental results

5.1. Experimental datasets

Most of the datasets used in the existing ${SpO}_{2}$ research literature were taken by the authors themselves and are not public, and most of them had fewer subjects and scenes. Our experiments are conducted on the public datasets of PURE [16] and VIPL-HR [17], [18] for easier comparison. Both contain sufficient study subjects and rich scenarios to verify the robustness of the model. In addition, using the public dataset can facilitate method comparisons for subsequent studies. In the experiments, two datasets are divided 6:4 between the training and test datasets, and the information about them is shown in Table 1 .

Table 1.

Information of PURE and VIPL-HR.

Dataset	Num	Scenes	Acquisition equipment	Frame rate
PURE	10	• Steady • Speck • Slow translate • Fast translate • Small rotate • Medium rotate	• eco274CVGE Camera	30fps
VIPL-HR	107	• Steady • Head-move • Speck • Dark • Bright • Remote • After skipping	• RealSense F200 color camera	30fps

Open in a new tab

Fig. 6 (a) and (b) show histograms of the distribution of ${SpO}_{2}$ values collected from the PURE and VIPL-HR datasets respectively. The range of ${SpO}_{2}$ collected from the PURE dataset is 89–99 and the range from VIPL-HR is 86–99.

5.2. Experimental settings

In this paper, MTCNN [35] model is used to detect and locate the face region of the original video, select the ROI and crop it, and normalize the face images to 224 × 224 pixels. We conduct experiments on the PURE dataset using the CCM model as a benchmark. The PURE dataset is small according to Table 1. Small dataset is more likely to show uncontrollable performance during the training process due to parameter changes. Therefore, experiments on the PURE dataset are more representative. Table 2 shows the experimental results for different image sizes. In all tables, the bold values indicate minimal errors.

Table 2.

Comparative experimentation of image sizes.

Dataset/Image size	PURE
Dataset/Image size	MAE (%)	RMSE (%)
3 * 32 * 32	1.02	1.48
3 * 112 * 112	0.83	1.07
3 * 196 * 196	0.72	0.97
3 * 224 * 224	0.65	0.85
3 * 304 * 304	0.76	0.99
3 * 352 * 352	0.78	1.05
3 * 448 * 448	0.88	1.22

Open in a new tab

According to Table2, increasing the pixel size of images, the images contain more texture and contextual information, which capture better features. In addition, it is easier to obtain discriminative features when images become larger. However, when the size becomes larger to a certain extent, the model complexity is too large, which tends to lead to overfitting and the experimental performance decreases instead. In addition, the computational overhead also becomes much larger accordingly. From experiments, the input image size is set to 3 * 224 * 224, which is more suitable.

Then, the image sequences are fed into RCA network model to extract features, and ${SpO}_{2}$ values are calculated using each of the three methods. In order to select the optimal optimizer and learning rate, rich ablation experiments are implemented. Four optimizers (RMSProp, Adagrad, Adam, AdamW) with different learning rates are used for model training and testing. Fig. 7 shows the comparative experimentation of different learning rates and optimizers.

From Fig. 7, when the RMSProp optimizer is used and the learning rate is 0.02 or 0.05, the situation arises where the model is difficult to converge. When the Adam or AdamW optimizer are used and the learning rate is 0.05, our model is also difficult to converge. As a whole, it is optimal to set the learning rate to 0.01. During the training process, the Adam optimizer is used to continuously adjust the learning rate to achieve the best results.

In general, the optimal experimental setup is determined with rich experiments. The $Los s_{total}$ in Section 3.2.3 is used as the loss function. The Adam optimizer with a weight decay of 1e−4 is employed. The learning rate is set to 0.01. The ReduceLROnPlateau scheduler with a patience of 20 and factor of 0.1 is employed. The maximal epoch number and early stopping counter are set to 100 and 20, respectively. The batch_size is set to 50 frames on the PURE dataset. Extensive experiments verify that the above choice of parameters is optimal. Then, the converged model is migrated to the VIPL-HR dataset, and some hyperparameters are fine-tuned. The model network frameworks are all implemented based on PyTorch, and the graphics card used for the experiments is NVIDIA GTX1080TI.

5.3. Evaluation metrics

In order to verify the validity of the model, the mean absolute error (MAE) and root mean square error (RMSE) are used to present the experimental results, and the MAE and RMSE are calculated as shown in Equation (20), (21):

MAE = \frac{\sum_{i = 1}^{L} |SpO 2_{pre} (i) - S p O 2_{gt} (i)|}{L}

(20)

RMSE = \sqrt{\frac{\sum_{i = 1}^{L} {(SpO 2_{pre} (i) - S p O 2_{gt} (i))}^{2}}{L}}

(21)

Here, L represents the length of the input video frame, $SpO 2_{pre}$ represents the estimated value of ${SpO}_{2}$ , and $SpO 2_{gt}$ represents the ${SpO}_{2}$ ground truth.

5.4. Experiment and discussion

As none of the studies on contactless estimation of ${SpO}_{2}$ have published its datasets and algorithms, this paper will use medical indicators to measure the model's validity. If MAE between the estimated ${SpO}_{2}$ value of the model and the ${SpO}_{2}$ ground truth is within 2 %, ${SpO}_{2}$ is valid and the model is reliable [15].

5.4.1. Max-min coordinate face image cropping

In this paper, the maximum and minimum values of the × and y coordinates of 81 key points of the face are combined into the four corner point coordinates of ROI. The face is cropped by the coordinates, which preserves the most effective area of the face and reduces the effect of some areas of the face being occluded. In order to verify this conclusion, this paper does a comparison experiment between full face cropping and maximum-minimum boundary cropping (m_face), and here the Color Channel Model is used for verification. The results are shown in Table 3 . ${SpO}_{2}$ values are given in %. Therefore, MAE and RMSE are also in %. We mark it in all of the experimental tables accordingly.

Table 3.

Comparative experimentation of facial cutting methods.

Dataset/Crop Type	PURE		VIPL-HR
Dataset/Crop Type	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
all_face	0.86	1.18	1.15	1.61
m_face	0.65	0.85	1.04	1.48

Open in a new tab

The proposed image cropping method calculates the ${SpO}_{2}$ values for the PURE and VIPL-HR datasets respectively. Fig. 8 and Fig. 9 show the comparison of the ${SpO}_{2}$ estimation values and the reference values for the two datasets respectively. The horizontal label in Fig. 8 and Fig. 9 is “Batch”. It means the ${SpO}_{2}$ value calculated for the nth batch of the image sequences. In order to closely reflect the true human ${SpO}_{2}$ status, the estimated ${SpO}_{2}$ are retained to two decimal places. The retention of these two decimals gives rise to reasonable fluctuations in Fig. 8 and Fig. 9.

It is of interest to note the abrupt changes in Fig. 8 and Fig. 9 that the model failed to detect, such as the point around 80 in Fig. 8. By analysing the details of the original dataset, the reason for the problem has been found. In fact, during the construction of the dataset, there are discontinuities and interruptions when shooting different scenes of the same object. Therefore, during this interval, the ${SpO}_{2}$ status may change, and naturally the data within the corresponding dataset may also change abruptly. As datasets are feed as a whole continuously into the model for training and testing, the model does not immediately capture the sudden changes in ${SpO}_{2}$ values due to scene switching, yet it is relatively seldom the case in general. In addition, undetectable abrupt changes are within 1 % and return to pre-change ${SpO}_{2}$ status rapidly. In practice, in the actual scenario, the permissible error range is 2 %. It indicates that the estimation error due to the switching of detection objects and scenes is still acceptable. But whether or not it is these reasons that brought about the changes, when the magnitude exceeds 1 %, our model is still able to capture the change relatively quickly and estimate close ${SpO}_{2}$ values. For example, the drop in ${SpO}_{2}$ values at point around 120 in Fig. 8 and the rise at point around 160 in Fig. 9. It indicates that our model is effective and has application values.

Besides, since the normal range of ${SpO}_{2}$ values in the human body are 95 %–100 %, values below 95 % are considered to be low ${SpO}_{2}$ values. The proportion of low ${SpO}_{2}$ values is relatively small. The PURE dataset captures ${SpO}_{2}$ values once per frame at a frame rate of 30 fps and the percentage of the low ${SpO}_{2}$ value is 4.9 %. In contrast, the VIPL-HR dataset collects ${SpO}_{2}$ values once every second and the percentage of the low ${SpO}_{2}$ value is 5.8 %. However, in clinical scenarios, low ${SpO}_{2}$ samples need to be given more attention. Therefore, in addition to giving the overall ${SpO}_{2}$ estimation errors for PURE and VIPL-HR dataset, low ${SpO}_{2}$ values (<95) have been purposefully extracted for error calculations with the ${SpO}_{2}$ values calculated by the Color Channel Model. The MAE in PURE dataset is 1.28 %, and the MAE in VIPL-HR dataset is 1.57 %. From the experimental results, even with the low ${SpO}_{2}$ scenarios, the MAEs are still below 2 % and meeting the clinical target. It demonstrates that our model is valid for the low ${SpO}_{2}$ situations as well.

5.4.2. RCA network

In this section, rich ablation experiments are conducted for the RCA network model, and each module and parameter inside the model, and the effectiveness of the model and each module is verified on the PURE and VIPL-HR datasets using the Color Channel Model.

First, the setting of batch_size affects the experimental results and reasonableness. Each batch will get the corresponding ${SpO}_{2}$ estimation value and the average of the corresponding batch_size ${SpO}_{2}$ ground truth, from which the overall MAE and RMSE of the test sets of PURE and VIPL-HR are calculated. Due to the limitations of the hardware device, a too large batch_size will result in insufficient memory to allow for training. A too small bitch_size will result in too little data being trained in a single batch. It may lead to large experimental errors and unstable experimental results. At this point, it is difficult to extract more global features. Therefore, we try to choose a multiple of 10 in the appropriate region for the ablation experiments. Besides, due to the experimental setup, the minimum value of batch_size is 15. Fig. 10 shows the comparison experiments of different batch_size.

From Table 1 and Fig. 10, the PURE dataset is small and it is significantly affected by batch_size. In the process of randomly selecting a small batch_size and training, some uncontrollable factors are generated, resulting in fluctuations in the curve. Nevertheless, as the batch_size increases, PURE as a whole also produces the effect within the expectation, which is consistent with the training effect of large dataset. Finally, we choose 50 as batch_size after several experimental comparisons. Combined with the frame rate of the dataset, it can be considered reasonable to do the estimation and error calculation every less than 2 s with this setting. Subsequent experiments were set to a batch size of 50 frames.

Afterward, to verify the effectiveness of the RCA network and the internal modules for extracting ${SpO}_{2}$ features, the RCA network is replaced by the feature extraction modules of classical deep neural networks (ResNet-50[31] and Inception v3[37]). The FLOPs and Params are used to measure the time complexity and memory complexity of the model respectively. FLOPs refer to floating point operations which is understood as the amount of computation. It can be used to measure the complexity of an algorithm or a model. Params refers to the number of parameters which represents the amount of memory used. Then, ablation experiments for residual blocks (RB) and the coordinate attention (CA) in the RCA are conducted. The above experiments are based on the Color Channel Model. According to Table 4 , the RCA network outperforms the other networks in terms of feature extraction capability. In addition, RCA networks have a lower time complexity and space complexity. It means that RCA network is more efficient and uses less memory. Our proposed model do not require much capturing equipment and calculation environment. Besides, both RB and CA can effectively improve the accuracy of ${SpO}_{2}$ estimation and reduce MAE and RMSE. RB reduces information attrition and improves the learning capability. Besides, CA can focuses on feature channels and the location information, and enhances the feature representation.

Table 4.

Ablation experiments with RCA network.

Dataset/Model	FLOPs (GMac)	Params (M)	PURE		VIPL-HR
Dataset/Model	FLOPs (GMac)	Params (M)	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
ResNet-50 [31]	5.12	24.05	1.02	1.49	1.50	1.77
Inception v3 [37]	3.67	22.68	1.04	1.47	1.14	1.51
RCA w/o CA	2.048	1.077	0.74	1.05	1.11	1.50
RCA w/o RB	0.94	0.11	1.02	1.44	1.15	1.53
RCA	2.05	1.08	0.65	0.85	1.04	1.48

Open in a new tab

Next, the selection experiment of the number of cascades for the cascaded residual network is conducted. According to Table 5 , the best results were obtained when two residual blocks were used for cascading, extracting the most effective high-dimensional features. Besides, using two residual blocks does not unduly increase FLOPs and Params.

Table 5.

Cascade number selection experiment for cascade residual blocks.

Dataset/RB num	FLOPs (GMac)	Params (M)	PURE		VIPL-HR
Dataset/RB num	FLOPs (GMac)	Params (M)	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
1	1.4	0.26	0.79	1.02	1.12	1.54
2	2.05	1.08	0.65	0.85	1.04	1.48
3	2.69	4.36	0.89	1.15	1.10	1.50
4	3.33	17.47	1.02	1.44	1.15	1.52

Open in a new tab

After determining the experimental setup and the structure of the RCA model, visualisations were used to show the facial regions that the RCA model learned and focused on.

First, a set of images of faces wearing glasses and hair coverings and their attention graphs are as shown in Fig. 11 . As can be seen from them, the RCA network focuses more on the cheek and forehead regions. Meanwhile, the model reduces the attention to the eyes, eyeglasses, hair, lips and chin.

In addition, the visual scenes of facial rotation and movement and specking are as shown in Fig. 12 . As can be seen from them, the RCA network also reduces the focus on teeth and male beards. In non-stationary scenes, the RCA network still accurately focuses on the cheek and forehead regions.

In general, ${SpO}_{2}$ needs to be estimated from bare skin. Not focusing on disruptions such as beards, teeth, hair, eyes and eyeglasses helps to improve the accuracy of the model in detecting ${SpO}_{2}$ .

Besides, to explore the effectiveness of deep learning in greater depth, a ${SpO}_{2}$ estimation experiment with traditional color channel segmentation is designed for comparison. The red and blue channel of the original image sequence are separated and fed into the RCA network separately to extract the signal features. The corresponding standard deviation and mean values were calculated respectively, and the ${SpO}_{2}$ values were calculated by substituting into the ${SpO}_{2}$ calculation Equation (9). Table 6 shows the experimental results, which demonstrate the effectiveness of deep learning in terms of the ${SpO}_{2}$ estimation accuracy.

Table 6.

Red and blue color channel method selection experiment.

Dataset/Channel Method	PURE		VIPL-HR
Dataset/Channel Method	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
Tradition Color Model	0.92	1.17	1.18	1.60
CCM	0.65	0.85	1.04	1.48

Open in a new tab

5.4.3. Comparative study of three ${SpO}_{2}$ estimation models

After the experiments in the previous section, the RCA network structure is determined, and then the three ${SpO}_{2}$ estimation models are trained and tested separately, the results are shown in Table 7 .

Table 7.

${SpO}_{2}$ estimation model comparison experiment.

Dataset/Estimate Model	FLOPs (GMac)	Params (M)	PURE		VIPL-HR
Dataset/Estimate Model	FLOPs (GMac)	Params (M)	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
CCM	2.05	1.08	0.65	0.85	1.04	1.48
NBM	1.00	0.70	0.86	1.04	1.10	1.50
MMFM	3.05	1.77	0.63	0.80	1.00	1.43

Open in a new tab

As can be seen from the results, the MMFM has the lowest error in ${SpO}_{2}$ estimation, the conventional CCM has the middle error, and the NBM has the highest error, but all of them are within 2 % of the clinical target. The MMFM effectively combines the advantages of the NBM and the CCM, making full use of the characteristics and complementarity between the two models to achieve relatively better estimation results. The complexity of MMFM is the sum of the complexity of CCM and NBN.

Besides, to evaluate the real-time property and the latency property, extensive experiments are done. Specifically, the time required to load the pre-trained model, and the time required to load the batch_size (50) frames and feed them into the models to estimate the ${SpO}_{2}$ are calculated. Separate experiments are conducted on the PURE and VIPL-HR datasets. Both of them have similar loading and estimating times, so their average time are chosen as the result. In Color Channel Model, the loading time is about 1.1 s, and the estimating time is about 0.5 s. In Network-Based Model, the loading time is about 0.4 s, and the estimating time is about 0.5 s. In Multi-Model Fusion Model, the loading time is about1.6 s, and the estimating time is about 0.5 s. Overall, the CCM, NBM and MMFM all take around 0.5 s to estimate ${SpO}_{2}$ after the pre-trained models are loaded. The delay of 0.5 s is acceptable in terms of timeliness.

In addition, ablation experiments on the loss functions are conducted to verify the effectiveness of $Los s_{img}$ and $Los s_{label}$ for CCM, NBM and MMFM respectively. The experimental results are shown in Table 8 , which can confirm that the reduced image loss and KLD loss help to improve the model estimation accuracy and reduce its error.

Table 8.

Loss function ablation experiments.

	$Los s_{img}$	$Los s_{label}$	PURE		VIPL-HR
	$Los s_{img}$	$Los s_{label}$	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
CCM	✗	—	0.81	1.05	1.15	1.55
CCM	✓	—	0.65	0.85	1.04	1.48
NBM	—	✗	0.89	1.13	1.19	1.56
NBM	—	✓	0.86	1.04	1.10	1.50
MMFM	✗	✗	0.87	1.06	1.18	1.59
	✓	✗	0.77	1.01	1.08	1.46
	✗	✓	0.82	0.99	1.14	1.48
	✓	✓	0.63	0.80	1.00	1.43

Open in a new tab

5.4.4. Sub-scene verification

To analyze whether the RCA model still maintains better performance under head motion states and different lighting conditions, three ${SpO}_{2}$ estimation models will be individually trained and validated for several unstable scenes in the PURE and VIPL-HR datasets respectively to improve the robustness of the model. The experimental results are shown in Fig. 13 and Fig. 14 .

According to Fig. 13, the estimation errors of MMFM are lower than those of NBM and CCM in different scenes of the PURE dataset. NBM outperforms CCM only in the steady scenes and performs worse than CCM in the rest of the unstable scenes.

According to Fig. 14, each of the three estimation models has advantages and disadvantages in different scenarios in the VIPL-HR dataset, but all of them are relatively small in difference. Based on the presentation of the dataset in Table 1, it can be seen that VIPL-HR is large compared to PURE. In the smaller PURE dataset, the results of the three estimation models differ significantly and the Color Channel Model significantly outperforms the Network-Based Model. In the larger VIPL-HR dataset, the difference between the three models is tiny. Through these experiments, this paper concludes that when the experimental samples are large enough and the learning objects are large enough, the features extracted from the RCA network can gradually be detected by the network model with comparable accuracy to that of the Color Channel Model. In addition, MMFM takes into account the advantages of CCM and NBM to achieve a more accurate result.

5.4.5. Comparison with other models

We summarise articles on contactless ${SpO}_{2}$ estimation over the years. From these, representative articles are selected to compare the experimental results with MMFM. Since the code and dataset of them are not public, the MAEs in these articles are extracted for comparison as shown in Table 9 . The Num denotes the number of volunteers involved in the construction of these datasets. DL denotes Deep Learning. Area denotes the area of the body to be photographed.

Table 9.

Comparison of different contactless ${SpO}_{2}$ estimation methods.

Method	Dataset	Num	MAE (%)	DL	Area
[25] 2016	Self-built	9	1.00	✗	hands
[38] 2020	Self-built	9	0.85	✗	face
[39] 2020	Self-built	21	0.83	✗	face
[40] 2021	Self-built	25	1.7	✗	face
[12] 2021	Self-built	14	1.81	✓	hands
[41] 2022	BIDMC PPG [42]	52	1.45	✗	PPG signal
*[38] 2020	PURE	10	1.51	✗	face
*[38] 2020	VIPL-HR	107	2.32	✗	face
*[39] 2020	PURE	10	1.30	✗	face
*[39] 2020	VIPL-HR	107	2.02	✗	face
Ours-MMFM	PURE	10	0.63	✓	face
Ours-MMFM	VIPL-HR	107	1.00	✓	face

Open in a new tab

The literature [38] presented a ${SpO}_{2}$ monitoring method without physical contact with the patient using imaging photoplethysmography (iPPG). Firstly, the authors used a camera to capture videos and extracted the forehead area as an ROI. Then, they performed Eulerian video magnification (EVM) [43] on the facial videos to amplify the changes in skin color due to the heart cycle. Lastly, the red and blue channel ratio method is used to calculate ${SpO}_{2}$ values.

The literature [39] presented a ${SpO}_{2}$ monitoring method by using a video camera. The authors used the face detector provided by the Dlib library to detect faces in the images. Then, they select the forehead and left and right cheeks as three ROIs. To track the ROIs in the videos, they used the Kanade-Lucas-Tomasi (KLT) [44] tracking method. The next step was the analysis of the RGB signals coming from the ROIs marked in each video frame of a short sequence. To avoid noise, the signals are improved through a preprocessing phase. Next, they used Power Spectral Density (PSD) through Welch’s method [45] to select the most informative ROI. Finally, they used the chosen ROI to calculate ${SpO}_{2}$ values by the red and blue channel ratio method.

As the source code of the literature [38], [39] is not publicly available, we reproduce the model based on details from the literature and do comparative experiments on the PURE and VIPL-HR datasets. Therefore, we mark them with * in Table 9. By reproducing the models from the literature [38], [39] and conducting comparative experiments on the PURE and VIPL-HR datasets, the RCA network and the MMFM model are proven to be more effective in extracting ${SpO}_{2}$ -related feature from facial image sequences. Besides, it demonstrates the effectiveness of deep learning in the field of contactless ${SpO}_{2}$ measurement as well.

In addition, we conduct an experimental comparison with the literature [41] published in August 2022. It is an extremely new literature. It utilizes PPG signals and machine learning technology to extract ${SpO}_{2}$ and achieve clinical indicators. But in general, MMFM can adapt better to the ${SpO}_{2}$ estimation task. Furthermore, it confirms the effectiveness of deep learning in the area of contactless ${SpO}_{2}$ detection.

Since there is no method based on deep learning for Sp02 estimation by face videos, it is important to compare our model with a simple deep neural network trained with a regression loss. Therefore, we experiment with ${SpO}_{2}$ estimation task on Classical network models(ResNet-18[31] and VGG16[46]) with L1 Loss and compare these results. According to Table 10 , the MMFM has a slightly higher time complexity than ResNet-18, but the lowest spatial complexity of the three models. The MMFM has the lowest detection error and highest accuracy. Overall, the estimation model of this paper is more applicable to the ${SpO}_{2}$ estimation task.

Table 10.

Deep learning model comparison experiments.

Dataset/Model	FLOPs (GMac)	Params (M)	PURE		VIPL-HR
Dataset/Model	FLOPs (GMac)	Params (M)	MAE (%)	RMSE (%)	MAE (%)	RMSE (%)
VGG16 [46]	15.38	72.34	2.00	2.66	2.94	3.21
RestNet-18 [31]	1.82	11.18	1.16	1.37	1.98	2.23
Ours-MMFM	3.05	1.77	0.63	0.80	1.00	1.43

Open in a new tab

6. Conclusion

In contactless estimating of ${SpO}_{2}$ , traditional contactless methods require elaborate and detailed signal processing techniques to meet medical specifications. How to develop deep learning techniques for contactless ${SpO}_{2}$ estimation for better accuracy is the focus of our research. The multi-model fusion method based on deep learning is proposed in this paper. Firstly, the maximum and minimum corner cropping of faces is proposed. It can preserve the most effective information about faces and reduce the impact of facial occlusion. Next, we combine the residual network module and the coordinate attention module as the RCA network to obtain a high-dimensional feature map with greater representation. Finally, the ${SpO}_{2}$ values are calculated by three ${SpO}_{2}$ estimation models(CCM, NBM and MMFM). Their advantages and disadvantages are compared. From the experimental results on the PURE and VIPL-HR datasets, the MAEs of three models are less than 2 % and all of them meet the clinical requirement. It confirms the effectiveness of deep learning for contactless ${SpO}_{2}$ estimation from face videos. Additionally, the MMFM combines the advantages of CCM and NBM effectively. It fully uses the features of interest and complementarity between the two models to achieve more accurate estimation results.

In future research, we will focus on how to construct deep learning models to obtain a more accurate ${SpO}_{2}$ estimation. For example, we can design our network models based on large pre-trained models for migration learning, and use the latest relevant variants based on Transformer instead of the traditional convolution. Beisides, we will focus on improving the ability to deal with small, transient changes in ${SpO}_{2}$ status for more valuable applications. Furthermore, how to minimize the redundancy of videos and how to achieve real-time monitoring to achieve practical value are also directions for our future research. We will continue to improve model accuracy and performance and apply it to telemedicine and home health.

CRediT authorship contribution statement

Min Hu: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Xia Wu: Conceptualization, Methodology, Software, Writing – original draft, Investigation, Data curation, Formal analysis, Writing – review & editing, Visualization. Xiaohua Wang: Conceptualization, Validation, Writing – review & editing, Supervision. Yan Xing: Conceptualization, Investigation, Validation, Resources, Supervision. Ning An: Conceptualization, Writing – review & editing, Resources, Supervision. Piao Shi: Validation, Resources, Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by National Natural Science Foundation of China under Grant 62176084, and Grant 62176083, and in part by the Fundamental Research Funds for the Central Universities of China under Grant PA2021GDSK0092 and PA2022GDSK0066.

Data availability

Data will be made available on request.

References

1.O'Driscoll B.R., Howard L.S., Davison A.G. BTS guideline for emergency oxygen use in adult patients. Thorax. 2008;63(Suppl. 6):vi1–vi68. doi: 10.1136/thx.2008.102947. [DOI] [PubMed] [Google Scholar]
2.Starr N., Rebollo D., Asemu Y.M., Akalu L., Mohammed H.A., Menchamo M.W., Melese E., Bitew S., Wilson I., Tadesse M., Weiser T.G. Pulse oximetry in low-resource settings during the COVID-19 pandemic. Lancet Glob. Health. 2020;8(9):e1121–e1122. doi: 10.1016/S2214-109X(20)30287-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Şekeri K., et al. Data collection from blood glucose meter and anomaly detection. Karaelmas Fen ve Mühendislik Dergisi. 2017;7(2):428–433. doi: 10.7212/ZKUFBD.V7I2.664. [DOI] [Google Scholar]
4.Kong L., Zhao Y., Dong L., Jian Y., Jin X., Li B., Feng Y., Liu M., Liu X., Wu H. Non-contact detection of oxygen saturation based on visible light imaging device using ambient light. Opt. Express. 2013;21(15):17464. doi: 10.1364/OE.21.017464. [DOI] [PubMed] [Google Scholar]
5.Verkruysse W., Bartula M., Bresch E., Rocque M., Meftah M., Kirenko I. Calibration of contactless pulse oximetry. Anesth. Analg. Jan. 2017;124(1):136–145. doi: 10.1213/ANE.0000000000001381. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Duch W. Filter methods. Stud. Fuzziness Soft Comput. 2006;207:89–117. doi: 10.1007/978-3-540-35488-8_4. [DOI] [Google Scholar]
7.Schafer R.W., Rabiner L.R. A digital signal processing approach to interpolation. Proc. IEEE. 1973;61(6):692–702. doi: 10.1109/PROC.1973.9150. [DOI] [Google Scholar]
8.J.H. Davis, Fourier transforms, in: Applied and Numerical Harmonic Analysis, no. 9783319433691, 2016, pp. 425–566, doi: 10.1007/978-3-319-43370-7_7.
9.Srivastava G., Chauhan A., Jangid M., Chaurasia S. CoviXNet: a novel and efficient deep learning model for detection of COVID-19 using chest X-Ray images. Biomed. Signal Process. Control. 2022;78 doi: 10.1016/j.bspc.2022.103848. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wang D., Hu Q., Yang C. Biometric recognition based on scalable end-to-end convolutional neural network using photoplethysmography: a comparative study. Comput. Biol. Med. 2022;147 doi: 10.1016/j.compbiomed.2022.105654. [DOI] [PubMed] [Google Scholar]
11.Gupta A., Ravelo-García A.G., Dias F.M. Availability and performance of face based non-contact methods for heart rate and oxygen saturation estimations: a systematic review. Comput. Methods Programs Biomed. 2022;219:106771. doi: 10.1016/j.cmpb.2022.106771. [DOI] [PubMed] [Google Scholar]
12.J. Mathew, X. Tian, M. Wu, C.-W. Wong, Remote blood oxygen estimation from videos using neural networks, arXiv e-prints, p. arXiv:2107.05087, 2021, [Online], Available: http://arxiv.org/abs/2107.05087. [DOI] [PMC free article] [PubMed]
13.Hu M., Qian F., Guo D., Wang X., He L., Ren F. ETA-rPPGNet: effective time-domain attention network for remote heart rate measurement. IEEE Trans. Instrum. Meas. 2021;70:1–12. [Google Scholar]
14.Shao D., Liu C., Tsow F., Yang Y., Du Z., Iriya R., Yu H., Tao N. Noncontact monitoring of blood oxygen saturation using camera and dual-wavelength imaging system. IEEE Trans. Biomed. Eng. 2016;63(6):1091–1098. doi: 10.1109/TBME.2015.2481896. [DOI] [PubMed] [Google Scholar]
15.Chan C., Inskip J.A., Kirkham A.R., Ansermino J.M., Dumont G., Li L.C., Ho K., Novak Lauscher H., Ryerson C.J., Hoens A.M., Chen T., Garde A., Road J.D., Camp P.G. A smartphone oximeter with a fingertip probe for use during exercise training: usability, validity and reliability in individuals with chronic lung disease and healthy controls. Physiotherapy (United Kingdom) 2019;105(3):297–306. doi: 10.1016/j.physio.2018.07.015. [DOI] [PubMed] [Google Scholar]
16.R. Stricker, S. Muller, H.M. Gross, Non-contact video-based pulse rate measurement on a mobile service robot, in: IEEE RO-MAN 2014 - 23rd IEEE International Symposium on Robot and Human Interactive Communication: Human-Robot Co-Existence: Adaptive Interfaces and Systems for Daily Life, Therapy, Assistance and Socially Engaging Interactions, Oct. 2014, pp. 1056–1062, doi: 10.1109/ROMAN.2014.6926392.
17.Niu X., Shan S., Han H., Chen X. RhythmNet: end-to-end heart rate estimation from face via spatial-temporal representation. IEEE Trans. Image Process. 2020;29:2409–2423. doi: 10.1109/TIP.2019.2947204. [DOI] [PubMed] [Google Scholar]
18.X. Niu, H. Han, S. Shan, X. Chen, ‘VIPL-HR: a multi-modal database for pulse estimation from less-constrained face video’, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11365, LNCS, Dec. 2019, pp. 562–576, doi: 10.1007/978-3-030-20873-8_36.
19.Mannheimer P.D. The light-tissue interaction of pulse oximetry. Anesth. Analg. 2007;105(Suppl. 6) doi: 10.1213/01.ane.0000269522.84942.54. [DOI] [PubMed] [Google Scholar]
20.Runciman W.B., Webb R.K., Barker L., Currie M. The Australian incident monitoring study. The pulse oximeter: applications and limitations–an analysis of 2000 incident reports. Anaesth. Intensive Care. 1993;21(5):543–550. doi: 10.1177/0310057X9302100509. [DOI] [PubMed] [Google Scholar]
21.de Haan G., Jeanne V. Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 2013;60(10):2878–2886. doi: 10.1109/TBME.2013.2266196. [DOI] [PubMed] [Google Scholar]
22.Chen M., Zhu Q., Wu M., Wang Q. Modulation model of the photoplethysmography signal for vital sign extraction. IEEE J. Biomed. Health Inform. 2021;25(4):969–977. doi: 10.1109/JBHI.2020.3013811. [DOI] [PubMed] [Google Scholar]
23.Poh M.Z., McDuff D.J., Picard R.W. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 2011;58(1):7–11. doi: 10.1109/TBME.2010.2086456. [DOI] [PubMed] [Google Scholar]
24.Y.H. Wang, C.J. Hung, C.H. Shen, S.J. Chen, A new oxygen saturation images of iris tissue, in: Proceedings of IEEE Sensors, 2010, pp. 1386–1389, doi: 10.1109/ICSENS.2010.5690526.
25.Tsai H.Y., Huang K.C., Yeh J.A. No-contact oxygen saturation measuring technology for skin tissue and its application. IEEE Instrum. Meas. Mag. 2016;19(5):57–64. doi: 10.1109/MIM.2016.7579071. [DOI] [Google Scholar]
26.Bal U. Non-contact estimation of heart rate and oxygen saturation using ambient light. Biomed. Opt. Express. 2015;6(1):86. doi: 10.1364/boe.6.000086. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.W. Chen, D. McDuff, DeepPhys: video-based physiological measurement using convolutional attention networks, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11206, LNCS, 2018, pp. 356–373, doi: 10.1007/978-3-030-01216-8_22.
28.Lokendra B., Puneet G. AND-rPPG: a novel denoising-rPPG network for improving remote heart rate estimation. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105146. [DOI] [PubMed] [Google Scholar]
29.Ding X., Nassehi D., Larson E.C. Measuring oxygen saturation with smartphone cameras using convolutional neural networks. IEEE J. Biomed. Health Inform. 2019;23(6):2603–2610. doi: 10.1109/JBHI.2018.2887209. [DOI] [PubMed] [Google Scholar]
30.Teuwen J., Moriakov N. Handbook of Medical Image Computing and Computer Assisted Intervention. Elsevier; 2019. Convolutional neural networks; pp. 481–501. [DOI] [Google Scholar]
31.K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 770–778, doi: 10.1109/CVPR.2016.90.
32.Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp. 13708–13717, doi: 10.1109/CVPR46437.2021.01350.
33.Gao B.-B., Xing C., Xie C.-W., Wu J., Geng X. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 2017;26(6):2825–2838. doi: 10.1109/TIP.2017.2689998. [DOI] [PubMed] [Google Scholar]
34.J.M. Joyce, Kullback-Leibler Divergence, in: International Encyclopedia of Statistical Science, Springer, Berlin, Heidelberg, 2011, pp. 720–722, doi: 10.1007/978-3-642-04898-2_327.
35.Zhang K., Zhang Z., Li Z., Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett. Oct. 2016;23(10):1499–1503. doi: 10.1109/LSP.2016.2603342. [DOI] [Google Scholar]
36.V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874, doi: 10.1109/CVPR.2014.241.
37.C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 2818–2826, doi: 10.1109/CVPR.2016.308.
38.A. de Fatima Galvao Rosa, R.C. Betini, Noncontact SpO2 measurement using Eulerian video magnification, IEEE Trans. Instrum. Meas. 69(5) (2020) 2120–2130, doi: 10.1109/TIM.2019.2920183.
39.G. Casalino, G. Castellano, G. Zaza, A mHealth solution for contact-less self-monitoring of blood oxygen saturation, in: Proceedings - IEEE Symposium on Computers and Communications, Jul. 2020, Vol. 2020-July, doi: 10.1109/ISCC50000.2020.9219718.
40.Moço A., Verkruysse W. Pulse oximetry based on photoplethysmography imaging with red and green light: calibratability and challenges. J. Clin. Monit. Comput. 2021;35(1):123–133. doi: 10.1007/s10877-019-00449-y. [DOI] [PubMed] [Google Scholar]
41.B. Koteska, H. Mitrova, A.M. Bogdanova, F. Lehocki, Machine learning based SpO2 prediction from PPG signal’s characteristics features, Aug. 2022, pp. 1–6, doi: 10.1109/memea54994.2022.9856498.
42.Pimentel M.A.F., Johnson A.E.W., Charlton P.H., Birrenkott D., Watkinson P.J., Tarassenko L., Clifton D.A. Toward a robust estimation of respiratory rate from pulse oximeters. IEEE Trans. Biomed. Eng. 2017;64(8):1914–1923. doi: 10.1109/TBME.2016.2613124. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Wu H.-Y., Rubinstein M., Shih E., Guttag J., Durand F., Freeman W. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. 2012;31(4):1–8. [Google Scholar]
44.Tomasi C. Detection and tracking of point features technical report CMU-CS-91-132. Image Rochester NY. 1991;91(April):1–22. [Google Scholar]
45.M.O. Solomon, PSD Computations Using Welch’s Method, Sandia National Laboratories, no. SAND91-1533, p. 64, Dec. 1991, doi: 10.2172/5688766.
46.K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, Sep. 2015, doi: 10.48550/arxiv.1409.1556.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[b0005] 1.O'Driscoll B.R., Howard L.S., Davison A.G. BTS guideline for emergency oxygen use in adult patients. Thorax. 2008;63(Suppl. 6):vi1–vi68. doi: 10.1136/thx.2008.102947. [DOI] [PubMed] [Google Scholar]

[b0010] 2.Starr N., Rebollo D., Asemu Y.M., Akalu L., Mohammed H.A., Menchamo M.W., Melese E., Bitew S., Wilson I., Tadesse M., Weiser T.G. Pulse oximetry in low-resource settings during the COVID-19 pandemic. Lancet Glob. Health. 2020;8(9):e1121–e1122. doi: 10.1016/S2214-109X(20)30287-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0015] 3.Şekeri K., et al. Data collection from blood glucose meter and anomaly detection. Karaelmas Fen ve Mühendislik Dergisi. 2017;7(2):428–433. doi: 10.7212/ZKUFBD.V7I2.664. [DOI] [Google Scholar]

[b0020] 4.Kong L., Zhao Y., Dong L., Jian Y., Jin X., Li B., Feng Y., Liu M., Liu X., Wu H. Non-contact detection of oxygen saturation based on visible light imaging device using ambient light. Opt. Express. 2013;21(15):17464. doi: 10.1364/OE.21.017464. [DOI] [PubMed] [Google Scholar]

[b0025] 5.Verkruysse W., Bartula M., Bresch E., Rocque M., Meftah M., Kirenko I. Calibration of contactless pulse oximetry. Anesth. Analg. Jan. 2017;124(1):136–145. doi: 10.1213/ANE.0000000000001381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0030] 6.Duch W. Filter methods. Stud. Fuzziness Soft Comput. 2006;207:89–117. doi: 10.1007/978-3-540-35488-8_4. [DOI] [Google Scholar]

[b0035] 7.Schafer R.W., Rabiner L.R. A digital signal processing approach to interpolation. Proc. IEEE. 1973;61(6):692–702. doi: 10.1109/PROC.1973.9150. [DOI] [Google Scholar]

[b0040] 8.J.H. Davis, Fourier transforms, in: Applied and Numerical Harmonic Analysis, no. 9783319433691, 2016, pp. 425–566, doi: 10.1007/978-3-319-43370-7_7.

[b0045] 9.Srivastava G., Chauhan A., Jangid M., Chaurasia S. CoviXNet: a novel and efficient deep learning model for detection of COVID-19 using chest X-Ray images. Biomed. Signal Process. Control. 2022;78 doi: 10.1016/j.bspc.2022.103848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0050] 10.Wang D., Hu Q., Yang C. Biometric recognition based on scalable end-to-end convolutional neural network using photoplethysmography: a comparative study. Comput. Biol. Med. 2022;147 doi: 10.1016/j.compbiomed.2022.105654. [DOI] [PubMed] [Google Scholar]

[b0055] 11.Gupta A., Ravelo-García A.G., Dias F.M. Availability and performance of face based non-contact methods for heart rate and oxygen saturation estimations: a systematic review. Comput. Methods Programs Biomed. 2022;219:106771. doi: 10.1016/j.cmpb.2022.106771. [DOI] [PubMed] [Google Scholar]

[b0060] 12.J. Mathew, X. Tian, M. Wu, C.-W. Wong, Remote blood oxygen estimation from videos using neural networks, arXiv e-prints, p. arXiv:2107.05087, 2021, [Online], Available: http://arxiv.org/abs/2107.05087. [DOI] [PMC free article] [PubMed]

[b0065] 13.Hu M., Qian F., Guo D., Wang X., He L., Ren F. ETA-rPPGNet: effective time-domain attention network for remote heart rate measurement. IEEE Trans. Instrum. Meas. 2021;70:1–12. [Google Scholar]

[b0070] 14.Shao D., Liu C., Tsow F., Yang Y., Du Z., Iriya R., Yu H., Tao N. Noncontact monitoring of blood oxygen saturation using camera and dual-wavelength imaging system. IEEE Trans. Biomed. Eng. 2016;63(6):1091–1098. doi: 10.1109/TBME.2015.2481896. [DOI] [PubMed] [Google Scholar]

[b0075] 15.Chan C., Inskip J.A., Kirkham A.R., Ansermino J.M., Dumont G., Li L.C., Ho K., Novak Lauscher H., Ryerson C.J., Hoens A.M., Chen T., Garde A., Road J.D., Camp P.G. A smartphone oximeter with a fingertip probe for use during exercise training: usability, validity and reliability in individuals with chronic lung disease and healthy controls. Physiotherapy (United Kingdom) 2019;105(3):297–306. doi: 10.1016/j.physio.2018.07.015. [DOI] [PubMed] [Google Scholar]

[b0080] 16.R. Stricker, S. Muller, H.M. Gross, Non-contact video-based pulse rate measurement on a mobile service robot, in: IEEE RO-MAN 2014 - 23rd IEEE International Symposium on Robot and Human Interactive Communication: Human-Robot Co-Existence: Adaptive Interfaces and Systems for Daily Life, Therapy, Assistance and Socially Engaging Interactions, Oct. 2014, pp. 1056–1062, doi: 10.1109/ROMAN.2014.6926392.

[b0085] 17.Niu X., Shan S., Han H., Chen X. RhythmNet: end-to-end heart rate estimation from face via spatial-temporal representation. IEEE Trans. Image Process. 2020;29:2409–2423. doi: 10.1109/TIP.2019.2947204. [DOI] [PubMed] [Google Scholar]

[b0090] 18.X. Niu, H. Han, S. Shan, X. Chen, ‘VIPL-HR: a multi-modal database for pulse estimation from less-constrained face video’, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11365, LNCS, Dec. 2019, pp. 562–576, doi: 10.1007/978-3-030-20873-8_36.

[b0095] 19.Mannheimer P.D. The light-tissue interaction of pulse oximetry. Anesth. Analg. 2007;105(Suppl. 6) doi: 10.1213/01.ane.0000269522.84942.54. [DOI] [PubMed] [Google Scholar]

[b0100] 20.Runciman W.B., Webb R.K., Barker L., Currie M. The Australian incident monitoring study. The pulse oximeter: applications and limitations–an analysis of 2000 incident reports. Anaesth. Intensive Care. 1993;21(5):543–550. doi: 10.1177/0310057X9302100509. [DOI] [PubMed] [Google Scholar]

[b0105] 21.de Haan G., Jeanne V. Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 2013;60(10):2878–2886. doi: 10.1109/TBME.2013.2266196. [DOI] [PubMed] [Google Scholar]

[b0110] 22.Chen M., Zhu Q., Wu M., Wang Q. Modulation model of the photoplethysmography signal for vital sign extraction. IEEE J. Biomed. Health Inform. 2021;25(4):969–977. doi: 10.1109/JBHI.2020.3013811. [DOI] [PubMed] [Google Scholar]

[b0115] 23.Poh M.Z., McDuff D.J., Picard R.W. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 2011;58(1):7–11. doi: 10.1109/TBME.2010.2086456. [DOI] [PubMed] [Google Scholar]

[b0120] 24.Y.H. Wang, C.J. Hung, C.H. Shen, S.J. Chen, A new oxygen saturation images of iris tissue, in: Proceedings of IEEE Sensors, 2010, pp. 1386–1389, doi: 10.1109/ICSENS.2010.5690526.

[b0125] 25.Tsai H.Y., Huang K.C., Yeh J.A. No-contact oxygen saturation measuring technology for skin tissue and its application. IEEE Instrum. Meas. Mag. 2016;19(5):57–64. doi: 10.1109/MIM.2016.7579071. [DOI] [Google Scholar]

[b0130] 26.Bal U. Non-contact estimation of heart rate and oxygen saturation using ambient light. Biomed. Opt. Express. 2015;6(1):86. doi: 10.1364/boe.6.000086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0135] 27.W. Chen, D. McDuff, DeepPhys: video-based physiological measurement using convolutional attention networks, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11206, LNCS, 2018, pp. 356–373, doi: 10.1007/978-3-030-01216-8_22.

[b0140] 28.Lokendra B., Puneet G. AND-rPPG: a novel denoising-rPPG network for improving remote heart rate estimation. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105146. [DOI] [PubMed] [Google Scholar]

[b0145] 29.Ding X., Nassehi D., Larson E.C. Measuring oxygen saturation with smartphone cameras using convolutional neural networks. IEEE J. Biomed. Health Inform. 2019;23(6):2603–2610. doi: 10.1109/JBHI.2018.2887209. [DOI] [PubMed] [Google Scholar]

[b0150] 30.Teuwen J., Moriakov N. Handbook of Medical Image Computing and Computer Assisted Intervention. Elsevier; 2019. Convolutional neural networks; pp. 481–501. [DOI] [Google Scholar]

[b0155] 31.K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 770–778, doi: 10.1109/CVPR.2016.90.

[b0160] 32.Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp. 13708–13717, doi: 10.1109/CVPR46437.2021.01350.

[b0165] 33.Gao B.-B., Xing C., Xie C.-W., Wu J., Geng X. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 2017;26(6):2825–2838. doi: 10.1109/TIP.2017.2689998. [DOI] [PubMed] [Google Scholar]

[b0170] 34.J.M. Joyce, Kullback-Leibler Divergence, in: International Encyclopedia of Statistical Science, Springer, Berlin, Heidelberg, 2011, pp. 720–722, doi: 10.1007/978-3-642-04898-2_327.

[b0175] 35.Zhang K., Zhang Z., Li Z., Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett. Oct. 2016;23(10):1499–1503. doi: 10.1109/LSP.2016.2603342. [DOI] [Google Scholar]

[b0180] 36.V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867–1874, doi: 10.1109/CVPR.2014.241.

[b0185] 37.C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Dec. 2016, Vol. 2016-Decem, pp. 2818–2826, doi: 10.1109/CVPR.2016.308.

[b0190] 38.A. de Fatima Galvao Rosa, R.C. Betini, Noncontact SpO2 measurement using Eulerian video magnification, IEEE Trans. Instrum. Meas. 69(5) (2020) 2120–2130, doi: 10.1109/TIM.2019.2920183.

[b0195] 39.G. Casalino, G. Castellano, G. Zaza, A mHealth solution for contact-less self-monitoring of blood oxygen saturation, in: Proceedings - IEEE Symposium on Computers and Communications, Jul. 2020, Vol. 2020-July, doi: 10.1109/ISCC50000.2020.9219718.

[b0200] 40.Moço A., Verkruysse W. Pulse oximetry based on photoplethysmography imaging with red and green light: calibratability and challenges. J. Clin. Monit. Comput. 2021;35(1):123–133. doi: 10.1007/s10877-019-00449-y. [DOI] [PubMed] [Google Scholar]

[b0205] 41.B. Koteska, H. Mitrova, A.M. Bogdanova, F. Lehocki, Machine learning based SpO2 prediction from PPG signal’s characteristics features, Aug. 2022, pp. 1–6, doi: 10.1109/memea54994.2022.9856498.

[b0210] 42.Pimentel M.A.F., Johnson A.E.W., Charlton P.H., Birrenkott D., Watkinson P.J., Tarassenko L., Clifton D.A. Toward a robust estimation of respiratory rate from pulse oximeters. IEEE Trans. Biomed. Eng. 2017;64(8):1914–1923. doi: 10.1109/TBME.2016.2613124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0215] 43.Wu H.-Y., Rubinstein M., Shih E., Guttag J., Durand F., Freeman W. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. 2012;31(4):1–8. [Google Scholar]

[b0220] 44.Tomasi C. Detection and tracking of point features technical report CMU-CS-91-132. Image Rochester NY. 1991;91(April):1–22. [Google Scholar]

[b0225] 45.M.O. Solomon, PSD Computations Using Welch’s Method, Sandia National Laboratories, no. SAND91-1533, p. 64, Dec. 1991, doi: 10.2172/5688766.

[b0230] 46.K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, Sep. 2015, doi: 10.48550/arxiv.1409.1556.

PERMALINK

Contactless blood oxygen estimation from face videos: A multi-model fusion method based on deep learning

Min Hu

Xia Wu

Xiaohua Wang

Yan Xing

Ning An

Piao Shi

Abstract

1. Introduction

2. Related work

3. Material and methods

Fig. 1.

3.1. RCA network

3.1.1. Residual module

Fig. 2.

3.1.2. Coordinate attention module

3.2. Multi-Model fusion for SpO2 estimation

3.2.1. Color channel model

Fig. 3.

3.2.2. Network-based model

Fig. 4.

3.2.3. Multi-Model fusion model

4. Implementation processes

4.1. Video preprocessing

Fig. 5.

4.2. RCA network model training

4.3. SpO2 estimation

5. Experimental results

5.1. Experimental datasets

Table 1.

Fig. 6.

5.2. Experimental settings

Table 2.

Fig. 7.

5.3. Evaluation metrics

5.4. Experiment and discussion

5.4.1. Max-min coordinate face image cropping

Table 3.

Fig. 8.

Fig. 9.

5.4.2. RCA network

Fig. 10.

Table 4.

Table 5.

Fig. 11.

Fig. 12.

Table 6.

5.4.3. Comparative study of three SpO2 estimation models

Table 7.

Table 8.

5.4.4. Sub-scene verification

Fig. 13.

Fig. 14.

5.4.5. Comparison with other models

Table 9.

Table 10.

6. Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2. Multi-Model fusion for ${SpO}_{2}$ estimation

4.3. SpO₂ estimation

5.4.3. Comparative study of three ${SpO}_{2}$ estimation models