Estimation of heart rate and respiratory rate by fore-background spatiotemporal modeling of videos

Xiujuan Zheng; Wenqin Yan; Boxiang Liu; Yue Ivan Wu; Haiyan Tu

doi:10.1364/BOE.546968

. 2025 Jan 30;16(2):760–777. doi: 10.1364/BOE.546968

Estimation of heart rate and respiratory rate by fore-background spatiotemporal modeling of videos

Xiujuan Zheng ¹, Wenqin Yan ¹, Boxiang Liu ^1,², Yue Ivan Wu ³, Haiyan Tu ^1,^2,^*

PMCID: PMC11828443 PMID: 39958852

Abstract

Heart rate (HR) and respiratory rate (RR) are two critical physiological parameters that can be estimated from video recordings. However, the accuracy of remote estimation of HR and RR is affected by fluctuations in ambient illumination. To address this adverse effect, we propose a fore-background spatiotemporal (FBST) method for estimating HR and RR from videos captured by consumer-grade cameras. Initially, we identify the foreground regions of interest (ROIs) on the face and chest, as well as the background ROIs in non-body areas of the videos. Subsequently, we construct the foreground and background spatiotemporal maps based on the dichromatic reflectance model. We then introduce a lightweight network equipped with adaptive spatiotemporal layers to process the spatiotemporal maps and automatically generate a feature map of the non-illumination perturbation pulses. This feature map serves as input to a ResNet-18 network to estimate the physiological rhythm. Finally, we extract pulse signals and estimate HR and RR concurrently. Experiments conducted on three public and one private dataset demonstrate the superiority of the proposed FBST method in terms of accuracy and computational efficiency. These findings provide novel insights into non-intrusive human physiological measurements using common devices.

1. Introduction

Heart rate (HR) and respiratory rate (RR) are crucial physiological parameters that reflect cardiorespiratory rhythms [1]. HR is typically measured using electrocardiography (ECG) or photoplethysmography (PPG), while RR is assessed through capnography or non-invasive respiratory belts [2]. However, contact sensors can cause skin irritation and complicate nonintrusive measurements in daily life. To overcome these challenges, video-based methods have been developed to detect subtle changes in skin color and body movements, allowing for the simultaneous estimation of HR and RR using consumer-grade cameras [3].

Since Verkruysse discovered the potential for physiological signal analysis from video recordings [4], researchers have proposed various methodologies to obtain precise physiological signals. Poh et al. [5] employed independent component analysis (ICA) to isolate remote pulse signal components, achieving stable HR measurement results. Wiede et al. [6] employed the principle components analysis (PCA) method to analyze the trajectory points acquired through the KLT tracker, yielding relatively robust results in extracting RR. Subsequently, numerous advanced signal decomposition methods have been developed for extracting physiological signals and then estimating HR or RR. Liu et al. [7] utilized variational mode extraction (VME) to extract pulse signals information, ultimately obtaining the HR. Ghodratigohar et al. [8] employed complete ensemble empirical mode decomposition (CEEMD) to select the optimal intrinsic function for respiration signal extraction. While these signal processing methods perform well in specific contexts, their dependence on manual parameter tuning can lead to instability when inappropriate parameters are chosen. This limitation reduces their applicability in various scenarios, especially under uncertain illumination conditions.

As deep learning becomes more prevalent and datasets more accessible, researchers are exploring its application for measuring physiological signals. Chen et al. [9] proposed DeepPhys, which used the attention mechanism to merge appearance and motion information, demonstrating that deep neural networks outperformed the standard signal processing approaches. Pail et al. [10] proposed the TSCAN network, which improves running speed and prediction accuracy using a multi-task temporal shift convolutional attention network. Liu et al. [11] proposed EfficientPhys, which combines the frame difference computation and picture normalization pre-processing processes of the preceding two approaches into a single end-to-end network. Some studies use face images directly as network input, requiring significant computational resources and making the process less transparent. Yu et al. [12] introduced the PHYSNET network, which is the first deep network structure that directly recovers physiological signals from videos using 3D convolution. Although the precision has been enhanced, it places heavy demands on the operational equipment and has restricted real-time performance. Yan et al. [13] employed the spatiotemporal map as the input signal, encompassing both temporal and spatial information. This approach reduces computational load but decreased calculation accuracy of HR estimation compared to previous methods. Deep learning methods offer improved stability under varying illumination conditions but often struggle to balance computational complexity and accuracy. Additionally, their lack of clear physiological interpretations limits their practical use in physiological signal analysis.

The model-based methods are proposed by using the dichromatic reflection model to explain the physiological signal variations in the videos. Wang et al. [14] linearly combined the raw signals extracted from the RGB channels of video to eliminate disturbances caused by the environment, thereby obtaining physiological signals. Zheng et al. [15] proposed an absorption intensity heartbeat modulation averaged shifted histogram (AIHM-ASH) method for HR estimation, which enhances the signal-to-noise ratio (SNR) of pulse signals through the ASH algorithm. However, model-based methods are constrained by their underlying assumptions and demonstrate limited generalizability, particularly when illumination variations do not conform to these assumptions, leading to instability in the methods. Futhermore, in existing studies, researchers often discard background signals when defining regions of interest (ROIs), overlooking the shared ambient illumination between background and foreground areas. Li et al. [16] utilized the green channel value of the background as a reference to reduce illumination disturbance, but this approach is limited by the nonlinear interactions between pulse and foreground signals in complex scenarios. Additionally, heart rhythm and HR are typically extracted separately from RR, which hinders effective monitoring of overall physiological and psychological health.

Considering these limitations of the previous work, we proposed a novel fore-background spatiotemporal (FBST) method to mitigate illumination variations by modeling the spatiotemporal maps of the foreground and background in videos. Subsequently, we designed spatiotemporal (ST) layers to automatically process the spatiotemporal maps to derive the non-illumination perturbation feature map. In conjunction with the ST layer, we employed the lightweight backbone network ResNet-18 to estimate HR and RR in parallel. The main contributions of this study are threefold: 1) We built a fore-background spatiotemporal model, grounded in physiological principles, to eliminate ambient illumination variations by jointly modeling the spatiotemporal maps of the foreground and background in the videos. 2) We proposed a unified method to estimate cardiorespiratory rhythms in parallel with videos in different scenarios. 3) We achieved accurate HR and RR estimation using a lightweight network with a low computational cost.

2. Method

This section details the proposed FBST method for estimating HR and RR, as illustrated in Fig. 1. Firstly, We define ROIs for both the foreground and background to derive raw signals. Specifically, we identify ROIs in the face and chest, subdividing these regions to capture relevant cardiac and respiratory signals. We also select non-human background areas and use PCA to extract primary illumination variations, minimizing interference from non-illumination disturbances. The specific procedures for deriving raw signals will be described in detail in Section 2.3 during the construction of the spatiotemporal maps. Secondly, we established the foreground-background model based on the knowledge of the dichromatic reflection model to eliminate illumination variations. The model inference process is described in Sections 2.1 to 2.2, while the construction of the spatiotemporal maps is detailed in Section 2.3. Finally, we estimate HR and RR in parallel using the lightweight network with designed ST layers, as described in Section 2.4, and the network framework is detailed in Section 2.5.

2.1. Dichromatic reflection model

Figure 2(a) depicts the dichromatic reflection model, which characterizes light reflected from an inhomogeneous object’s surface as a linear combination of diffuse and specular reflection [17]. In computer vision, each pixel in a recorded image sequence can be expressed as a time-varying function in the RGB channels:

C_{l} (t) = I (t) \cdot (v_{s} (t) + v_{d} (t)) + v_{n} (t)

(1)

where $C_{l} (t)$ denotes the RGB channels of $l^{t h}$ image pixel; $I (t)$ denotes the luminance intensity level, which depends on the intensity of light source itself and its position relative to the matter and the camera; $v_{s} (t)$ denotes specular reflection caused by the interaction of light with medium; $v_{d} (t)$ denotes diffuse reflection caused by the interaction of light with colorant; $v_{n} (t)$ denotes the quantization noise of the camera sensor. Here, bold mathematical characters represent vectors or matrices. In general processing, camera quantization noise can be reduced by averaging. Specular reflection and diffuse reflection are divided into constant and time varying quantities, so the average pixels in a defined ROI can be written as:

\bar{C} (t) = I (t) \cdot (u_{s} \cdot (s_{o} + s (t)) + u_{d} \cdot (d_{o} + d (t)))

(2)

where $\bar{C} (t)$ denotes the average RGB channels of a defined ROI, $u_{s}$ denotes the unit color vector of the light spectrum; $s_{o}$ and $s (t)$ denotes the constant and time-varing parts of specular reflection, and $u_{d}$ denotes the unit color vector of the material; $d_{o}$ and $d (t)$ denotes the constant and time-varing parts of diffuse reflections. The stationary parts in specular and diffuse reflections can be combined into a single component representing the stationary, so Eq. 2) is rewritten as:

\bar{C} (t) = I (t) \cdot (u_{o} \cdot c_{o} + u_{s} \cdot s (t) + u_{d} \cdot d (t))

(3)

where $u_{o}$ denotes the unit color vector of the reflection and $c_{o}$ denotes the reflection strength.

Fig. 2. — Dichromatic reflection model. (a) Reflection and absorption models of light on inhomogeneous media. (b) Reflection and absorption models of light on face. (c) Reflection and absorption models of light on chest.

2.2. Fore-background model

As the body (foreground) and non-body (background) parts in the video can be regarded as optically inhomogeneous media, we establish the fore-background (FB) model through the functional relationships between them based on the knowledges of dichromatic reflection model.

Face movement is mainly divided into rigid and non-rigid movement, and the influence of the rigid movement can be basically eliminated by tracing the ROIs. Assuming that the non-rigid motion of the face will not produce large relative displacement with the camera, so the specular reflection of the face can be regarded as a constant. Diffuse reflection as shown in Fig. 2(b) is time-varied by blood volume changes. Thus the raw signal of the face ROI can be expressed as:

C^{f} (t) = I (t) \cdot (u_{o}^{f} \cdot c_{o}^{f} + u_{d}^{f} \cdot d^{f} (t))

(4)

where $C^{f} (t)$ denotes the RGB channels of face ROI; $u_{o}^{f}$ denotes the units color vector of the reflection of face and $c_{o}^{f}$ denotes the reflection strength of face; $u_{d}^{f}$ denotes the units color vector of the skin diffuse reflection and $d^{f} (t)$ denotes the time-varying parts.

Breathing causes the chest to rise and fall, so the specular reflection of the chest is time varying, but the diffuse reflection produced by clothes can be regarded as a constant. Fig. 2(c) demostrates the light absorption and refection in the chest. The raw signal of the chest ROI can be expressed as:

C^{c} (t) = I (t) \cdot (u_{o}^{c} \cdot c_{o}^{c} + u_{s}^{c} \cdot s^{c} (t))

(5)

where $C^{c} (t)$ denotes the RGB channels of chest ROI; $u_{o}^{c}$ denotes the units color vector of the reflection of chest and $c_{o}^{c}$ denotes the reflection strength of chest; $u_{s}^{c}$ denotes the units color vector of the chest specular reflection, $s^{c} (t)$ denotes the time-varying parts.

As mentioned above, we use Eq. 4) and Eq. 5) to descript the slight color changes in face and chest, which are related to cardiopulmonary rhythm signals. Next, we unify the expression of raw signal of the foreground based on Eq. 4) and (5), as:

C (t) = I (t) \cdot (u_{o} \cdot c_{o} + u_{p} \cdot p (t))

(6)

where $C (t)$ denotes the RGB channels of foreground ROI; $u_{o}$ denotes the constant unit color vector of the foreground reflection and $c_{o}$ denotes the foreground reflection strength; $u_{p}$ denotes the relative pulsatile strengths of the foreground in the RGB channels and $p (t)$ denotes the cardiac rhythm and respiratory physiological signals present in the facial and chest regions.

The dichromatic reflection model is also used in the background. We analyze the background in two scenarios: simple with varying illumination and complex with both varying illumination and disruptive motion. In simple scenarios, the diffuse and specular reflection parts can be treated as constants. Therefore, the sequence of the background ROI can be expressed as:

C^{b} (t) = I (t) \cdot (u_{o}^{b} \cdot c_{o}^{b})

(7)

where $C^{b} (t)$ denotes the RGB channels of defined ROI of background; $u_{c}^{b}$ denotes the units color vector of the reflection of background and $c_{o}^{b}$ denotes the reflection strength of background.

Assuming that foreground and background recorded in the same video are subject to the the same ambient illumination, we deduce the expression of the hidden physiological signal in the simple scenario by combining Eq. 6) and (7), as:

p (t) = u_{p} \cdot p (t) = u_{o}^{b} \cdot c_{o}^{b} \cdot {C^{b} (t)}^{- 1} \cdot C (t) - u_{o} \cdot c_{o}

(8)

where $p (t)$ denotes the physiological signal hidden in RGB channels and the simplified Eq. 8) is written as:

p (t) = W \cdot \tilde{C} (t) - b

(9)

where $b = u_{o} \cdot c_{o}$ and $W = u_{o}^{b} \cdot c_{o}^{b}$ are constant vectors, and $\tilde{C} (t) = {C^{b} (t)}^{- 1} \cdot C (t)$ denotes color signals recorded by camera including foreground and background.

In complex scenarios, there may be interference factors such as irrelevant personnel movement and object position change in the background, so the light reflected from the background changes over time. The specular reflection and diffuse reflection generated by the interaction between illumination and background cannot be regarded as a constant, so the the raw signal of background ROI in complex scenario can be expressed as:

C^{b} (t) = I (t) \cdot (u_{o}^{b} \cdot c_{o}^{b} + u_{s}^{b} \cdot s^{b} (t) + u_{d}^{b} \cdot d^{b} (t))

(10)

where $s^{b} (t)$ denotes the time-varing part of specular reflections of the background and $d^{b} (t)$ denotes the time-varing part of diffuse reflections of the background. Combining Eq. 6) and 10, we can get:

p (t) = (W + u_{s}^{b} \cdot s^{b} (t) + u_{d}^{b} \cdot d^{b} (t)) \cdot \tilde{C} (t) - b

(11)

where $u_{s}^{b} \cdot s^{b} (t)$ and $u_{d}^{b} \cdot d^{b} (t)$ denote time varying parameters. According to Eq. 9) and (11), the physiological signal and the combined fore-background signal have distinct functional relationships in different scenarios: linear in simple scenarios without background disturbances and nonlinear in complex scenarios with background disturbances.

2.3. Spatiotemporal maps

Spatiotemporal map of foreground:

Most previous methods extract remote physiological signals by pixel averaging from the entire frame. However, both pulse signal and respiratory signal have varied intensities in different regions. In prior studies, the face has often been chosen as the ROI for estimating pulse signals [18,19], while the chest area is typically used for estimating respiratory signals [20,21]. Building on this, we extract facial and chest regions as ROIs from videos using landmark points from the Seetaface library [22] and employ a spatiotemporal map to effectively characterize cardiorespiratory rhythms.

For HR estimation, we divide the whole face region into several square ROIs. Since there are a lot of background elements in the four corners of face, we remove the four corners to get n face ROIs. For RR estimation, we divided the chest region into multiple ROIs and subsequently identified the n ROIs highest SNR calculation within each ROI [23]:

S N R_{i} = \frac{\int_{f 1}^{f 2} {| F {S_{i} (t)} |}^{2} d f}{\int_{0}^{f 1} {| F {S_{i} (t)} |}^{2} d f + \int_{f 2}^{+ \infty} {| F {S_{i} (t)} |}^{2} d f}

(12)

where $S N R_{i}$ denotes $i^{t h}$ ROI, $F {S_{i} (t)}$ denotes the fourier transform of the respiration signal of $i^{t h}$ ROI, $f 1$ and $f 2$ is the lower and upper limit of the integral defined by the possible physiological range of the respiration rate. In this study, we set $f 1$ to 0.1 Hz and $f 2$ to 0.7 Hz, as this frequency range encompasses the majority of human respiratory rates, which typically range from 6 to 42 bpm [6].

For a video clip with T frames and c color channels, the foreground spatiotemporal map representation can be obtained by placing the n raw siganls of the face or chest ROIs directly placed into rows, that is:

F = [\begin{array}{c} C_{1} (t_{o}) & C_{1} (t_{o} + 1) & \dots & C_{1} (t_{o} + T) \\ C_{2} (t_{o}) & C_{2} (t_{o} + 1) & \dots & C_{2} (t_{o} + T) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{n} (t_{o}) & C_{n} (t_{o} + 1) & \dots & C_{n} (t_{o} + T) \end{array}]

(13)

where $F \in R^{n, T, c}$ denotes spatiotemproal map of foregroud, $C_{n} (t_{o} + T) \in R^{c}$ denotes RGB channels of $n^{t h}$ face or chest ROI and $t_{o}$ denotes the beginning frame of the video. Raw pulse signals and respiratory signals are normalized. In our experiments, the pulse signals are then bandpass filtered using third-order butterworth filter with edge frequencies of 0.667 to 3 Hz and the respiratory signals with 0.1 to 0.7 Hz.

Spatiotemporal map of background:

The variation in the signal obtained from background is mainly caused by light source. Assuming that the background is expose to the same parallel light, but different regions of the backgrouns are subject to different disturbances. Camera noise can be reduced by pixels averaging, but disturbances still exist due to other factors such as irrelevant personnel moving around. Therefore, we select m ROIs of background to find their consistency of illumination variations. Specically, for a video clip with T frames and c color channels, we get:

B = [\begin{array}{c} C_{1}^{b} (t_{o}) & C_{1}^{b} (t_{o} + 1) & \dots & C_{1}^{b} (t_{o} + T) \\ C_{2}^{b} (t_{o}) & C_{2}^{b} (t_{o} + 1) & \dots & C_{2}^{b} (t_{o} + T) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{m}^{b} (t_{o}) & C_{m}^{b} (t_{o} + 1) & \dots & C_{m}^{b} (t_{o} + T) \end{array}]

(14)

where $B \in R^{m, T, c}$ denotes spatiotemporal map of background, $C_{m}^{b} (t_{o} + T) \in R^{c}$ denotes RGB channels of $m^{t h}$ ROI selected from background and $t_{o}$ denotes the beginning frame as the foreground video. We used PCA to find the main illumination variation components of m ROIs to get spatiotemproal map of backgroud $\tilde{B} \in R^{1, T, c}$ . Then we filter and duplicate the signals n times to generate the final spatiotemproal map of background $\tilde{B} \in R^{n, T, c}$ .

2.4. Spatiotemporal layers

We stack the spatiotemporal maps of foreground and background as:

S (n, T, c^{*}) = (F, \tilde{B})

(15)

where $S (n, T, c^{*})$ denotes the stacking spatiotemporal map and $c^{*} = 2 c$ . We propose an adaptive layer called ST layer, to deal with the stacked spatiotemproal maps, as shown in Fig. 3. It can be expressed as:

f : S \to P, S \in R^{n, T, c^{*}}, P \in R^{n, T, 1}

(16)

where $P$ denotes the output feature map after ST layer without the disturbance of illumination variations.

Fig. 3. — The architecture of the spatiotemproal layer. (a) Linear ST layer consisted of one or two fully connected layers. (b) Nonlinear ST layer consisted of a convolution layer and ReLU activation function.

Linear ST Layer:

In the simple scenario, we assume there is no complicated disturbance in background and propose an adaptive linear transformation using fully connected layers described as:

f_{l i n e a r} : P = W \cdot S + b

(17)

where $W \in R^{(n \times T \times c^{*}) \times (n \times T \times 1)}$ , $b \in R^{(n \times T \times 1)}$ are weight and bias of the linear ST layer $f_{l i n e a r}$ , which are driven by the loss function of estimated cardiopulmonary rhythm parameters.

Nonlinear ST Layer:

In the complexity scenario, we can observe the transform part in Eq. 11) with high non-linearity. Therefore, we use 2-dimensional convolution of a $1 \times 1$ convolution kernel as nonlinear transform layer $f_{n o n l i n e a r}$ and Rectified Linear Unit (ReLU) as a nonlinear activation function, that is:

f_{n o n l i n e a r} : P = σ (\tilde{W} \cdot S + \tilde{b})

(18)

where $\tilde{W}$ is the $1 \times 1$ convolution kernel, $\tilde{b}$ is the bias and $σ (\cdot)$ is ReLU function. By using a $1 \times 1$ convolution kernel, the number of channels can be reduced without changing the spatial dimensions of the output, which is useful in reducing the computational complexity of the neural network and in compressing the representation of the input. Meanwhile, the neural network can learn more complex and nonlinear relationship of the input by using ReLU function, which is driven by loss function to evolve into appropriate values for estimating cardiopulmonary rhythm parameters.

2.5. Estimation network

We employ ResNet-18 as the backbone network for HR and RR estimation, which process the non-illumination pertubation feature map after ST layer as an image input. We treat the estimation of HR and RR as a regression task, which can learn to extract features directly from the input data, eliminating the need for feature engineering. The approach allows the entire model to be trained on preprocessed stacked spatiotemporal maps, rather than relying on hand-crafted features. The pre-trained ResNet-18 model [24] is fine-tuned for the estimation task. The $L_{1}$ loss is used for measuring the residual between the estimate HR/RR and ground truth. And negative pearson correlation coefficient is utilized as the loss function for pulse signal estimation.

3. Experiments

3.1. Public datasets

The proposed method is evaluated on three public datasets: UBFC-rPPG [25], PURE [26], and COHFACE [27].

UBFC-rPPG:

The dataset comprises videos of 42 subjects engaging in a time-sensitive mathematical game within a real-world environment. The videos were recorded utilizing a Logitech C920 HD Pro camera at a frame rate of 30 fps and a resolution of $640 \times 480$ pixels, employing an uncompressed 8-bit RGB format. Each video has a duration of 1 minute. Simultaneously with the videos, PPG pulse signals were collected using a contact medical CMS50E pulse oximeter. The videos feature a green curtain as the background. We used data from the first 30 subjects for training and the subsequent 12 subjects for intra-dataset testing.

PURE:

The dataset comprises 60 videos captured by an eco274CVGE camera with a resolution of $640 \times 480$ pixels and a frame rate of 30 fps. The videos showcase 10 subjects, each executing six distinct head posturing actions: steady, talking, slow translation, fast translation, small rotation, and fast rotation. The finger pulse oximeter recorded synchronized disciplined reference pulse wave signals at a frequency of 60 Hz. The background is an unobstructed area of realistic space. We used data from the first 8 subjects for training and the subsequent 2 subjects for intra-dataset testing.

COHFACE:

The dataset comprises videos recorded by 40 individuals using a Logitech C525 camera. The videos were recorded at $640 \times 480$ resolution and 20 fps under varying lighting conditions, including both studio and natural lighting. Physiological data was synchronously sampled at a frequency of 256 Hz using a signal sensor. In addition, the videos are compressed in MPEG-4 visual format and exhibit poor video signal quality. We used data from the first 30 subjects for training and the subsequent 10 subjects for intra-dataset testing.

3.2. Private dataset

Most public datasets provide ground truth only for pulse signals related to HR and lack respiratory signals. Therefore, we collected a private dataset that includes both pulse and respiratory ground truth signals for the further evaulations. This study was conducted in accordance with the Declaration of Helsinki, andapproved by the Biomedical Research Ethics Committeeof WestChina Hospital, Sichuan University. Figure 4 illustrates the experimental setup. The dataset comprises 28 videos recored by a Sony HDR CX-450 camera at a frame rate of 50 fps and a resoulution of $1920 \times 1080$ pixel. The dataset includes 18 subjects aged between 20 and 62 years old. Ten subjects recorded videos during two time periods: before and after their daily work, while the remaining eight subjects recorded videos exclusively after their daily work. The subject sat on the chair positioned ∼1 meter in front of the camera for a duration of 2 to 3 minutes under natural lighting conditions. The subjects exhibited uncontrolled movements in their natural state. The subjects’ head movements were within 20 pixels horizontally and 30 pixels vertically, while the average facial width was 280 pixels and the average facial height was 350 pixels. The contact respiratory signal and pulse signal is measured using the Biopac MP160 at a sampling frequency of 500 Hz. We used data from the first 21 videos for training and the subsequent 7 videos for intra-dataset testing.

Fig. 4. — Experimental setup of data acquisitions for the private dataset

3.3. Experimental setup

Settings:

To standardize the datasets, all remote physiological signals and their corresponding ground truth values are resampled to a frequency of 30 Hz. We take a 10 seconds window to process all videos and ground truth. Adjacent windows are defined by sliding forward with a step size of 0.5 second. The corresponding physiological parameters are calculated by fourier transform of ground truth signals within a 10 second window. All experiments were conducted using the following settings: pre-trained model was fine-tuned using the adam optimizer, with a learning rate of 1e-3 and a weight decay of 1e-4. The batch size was 32, and each experiment was executed for a maximum of 30 epochs.

To separate the foreground from the background, we employ Seetaface algorithm [22] to automatically localize the facial and chest regions as the foreground. Face images are resize to $128 \times 128$ and divedid into 16 ROIs to get final 12 ROIs by remove the four corners of the frame. Meanwhile, chest images are resized to $256 \times 128$ and divedid into 16 ROIs by moving 64 pixels horizontally and 32 pixels vertically to get the final 12 ROIs with the highest SNR. Background signals are extracted from three rectangular background ROIs located in non-body areas as shown in Fig. 5. For the high-resolution videos in the private dataset, the background ROIs are automatically defined as the areas surrounding the detected head in the first frame, as identified by the Seetaface algorithm. For the remaining datasets, background ROIs are manually selected based on pixel coordinates. In cases where background illumination is complex or the region is large, the background area is chosen to be as close to the human body to ensure accurate detection of the ambient lighting signals associated with the foreground.

The HR distribution in the UBFC-rPPG, PURE, and private datasets is imbalanced. To address this, we applied a data resampling strategy at the video level to balance the HR distribution in the training set. Specifically, we first computed HR values from the ground truth and then resampled entire video segments, along with their corresponding ground truth , based on predefined HR thresholds. This resampling ensures that both the input data and the reference labels maintain consistency. In the UBFC-rPPG dataset, video segments with HR exceeding 90 bpm were upsampled by a factor of 1.5, while those below 90 bpm were downsampled by a factor of 2. In the PURE dataset, videos with HR above 80 bpm were upsampled by a factor of 1.5, whereas those below 80 bpm were downsampled by a factor of 2. For the private dataset, videos with HR exceeding 90 bpm were upsampled by a factor of 1.5, while those below this threshold were downsampled by a factor of 1.5. Meanwhile, the ground truth were resampled following the same strategy to maintain data integrity and prevent distributional inconsistencies. The results before and after resampling, shown in Fig. 6, demonstrate that this approach effectively mitigates HR distribution imbalances, ensuring a more representative dataset for neural network training. Since RR exhibits a naturally balanced distribution within a narrow range, resampling was not applied to the RR estimation.

Fig. 6. — The ground truth HR and the resampling HR distributions of three datasets.

Metrics:

We employed three quality metrics to evaluate the performance of our model, including the mean absolute error (MAE), the root mean square error (RMSE), and the pearson correlation coefficient (R) between estimates and ground truth. The results of all control methods are obtained either through replicating the rPPG-toolbox [28] or using the original results in the published paper. The input for all methods is a 10s length ROI, including the face and chest areas determined by anchor points.

4. Results

4.1. Heart rate

Heart rate:

We first conduct within-dataset testing to estimate HR on the UBFC-rPPG, PURE, COHFACE, and private datasets, comparing our results with six classical learning methods and five recent deep learning methods. The results are presented in the Table 1, demonstrating that classical methods generally perform worse than deep learning methods. This is due to the reliance of classical methods on specific theoretical assumptions, which can lead to reduced effectiveness when these assumptions are not met in real-world scenarios. For instance, the ICA method performs well in terms of error and correlation on the PURE dataset, but exhibits poor performance on the other three dataset. Furthermore, our FBST method outperforms the benchmark method PHYSNET on the UBFC-rPPG dataset, achieving an RMSE of 2.79, which represents a 32.6% improvement over PHYSNET’s RMSE of 3.70. Additionally, the FBST method yields a Pearson correlation coefficient of 0.97. The improvement can be attributed to the elimination of illumination effects on the face through the utilization of ST layer. The results clearly indicate that the PHYSNET and EfficientPhys methods exhibited poor performance on our private dataset. Despite conscientious attempts to fine-tune the hyperparameters based on the information provided in the original paper, it is plausible that certain vital settings may have been overlooked or suboptimally optimized. Such deviations may lead to disparities in the learned representations, ultimately affecting the overall results. Moreover, variations in the dataset might have significantly influenced the observed differences, given the relatively smaller scale of our private dataset compared to publicly available datasets. Nevertheless, our proposed method demonstrates promising performance on the private dataset, which can be attributed to the moderately ResNet-18 architecture with the ST layer and the inclusion of a residual structure. These characteristics facilitate the lightweight network architecture in achieving robust performance, even with a small-scale dataset.

Table 1. Intra-dataset testing of HR estimation results by our method and several state-of-the-art methods on four datasets ^a .

	UBFC-rPPG			PURE			COHFACE			Private
	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑

ICA [5]	11.49	4.13	0.69	2.98	0.61	0.98	17.67	11.61	0.11	6.40	1.96	0.80
CHROM [29]	7.60	2.25	0.80	11.60	3.37	0.78	18.76	12.13	0.27	3.93	0.98	0.92
POS [30]	8.38	2.34	0.78	12.75	3.96	0.75	15.98	9.56	0.29	3.60	0.88	0.93
GREEN [4]	21.77	12.17	0.45	14.48	6.93	0.64	16.22	9.77	0.16	11.60	4.44	0.51
LGI [31]	19.87	10.20	0.56	5.66	1.44	0.94	15.37	9.30	0.16	6.51	1.57	0.79
PBV [30]	32.59	23.13	0.07	7.90	2.80	0.89	19.15	13.16	0.05	12.37	5.55	0.53
And-rPPG [32]	4.75	3.15	0.92	-	-	-	8.06	6.81	0.63	-	-	-
PHYSNET [12]	3.70	1.15	0.94	3.44	1.90	0.98	11.60	8.59	0.36	13.53	6.38	0.29
TSCAN [10]	5.41	1.34	0.90	3.71	2.23	0.98	-	-	-	-	-	-
EfficientPhys [11]	5.53	1.64	0.89	5.99	1.33	0.97	12.64	4.70	0.51	15.79	7.65	0.20
FBST(ours)	2.79	2.17	0.97	2.41	1.81	0.99	6.41	4.61	0.73	2.41	1.84	0.97

Open in a new tab

^{^a}

Bold is the best result

COHFACE dataset is severely compressed and the quality of the video plays a vital role in remote HR estimation. Therefore, all methods are less effective on the COHFACE dataset as can be seen from the Table 1. Compared to other methods, our FBST method demonstrates a notable improvement on this dataset. It achieves a MAE of 4.61 and a pearson correlation coefficient exceeding 70%. These results indicate an acceptable level of error for practical applications. Our method performs effectively on the COHFACE dataset by effectively addressing its primary challenge: the unbalanced impact of illumination. By incorporating an illumination adjustment layer, our method successfully mitigates this challenge. Additionally, the selected backbone network ResNet-18, exhibits robustness across various scenarios.

Remote pulse signal:

We evaluated the network’s capability to estimate remote pulse signals by comparing it with a naive method. In this comparison, the FBST method includes several types of ST layer: the UBFC-rPPG dataset (1FC), the PURE dataset (CONV), the COHFACE dataset (2FC), and the private dataset (2FC). The naive method involves extracting background signals using PCA and then subtracting these signals from the foreground signals in the green channel [16]. Figure 7 illustrates a 10-second segment of the estimated pulse signals alongside the ground truth for each dataset, thereby highlighting the performance discrepancies between the two approaches. The results indicate that under stable and uniform lighting conditions, such as UBFC and private datasets, the naive method yields satisfactory results, suggesting that utilizing background signals as a reference for ambient light can effectively mitigate illumination variation disturbances in the foreground. However, in scenarios characterized by uneven illumination conditions or when the background comprises non-planar surfaces, as seen in the PURE and COHFACE datasets, the naive method fails to accurately capture the functional relationship between foreground and background signals. In contrast, our proposed approach employs a self-adaptive ST layer to optimally identify this relationship, significantly enhancing the elimination of illumination variation disturbances. The pulse signals measured by our approach show rhythmic patterns that align with the ground truth signals, enabling accurate heart rate estimation in the time domain. Additionally, we can analyze the derived pulse signals to extract detailed rhythmic information [12].

Fig. 7. — Examples of estimated pulse signal and ground truth on four datasets.

4.2. Respiratory rate

The accuracy of the proposed method in estimating RR was evaluated on both the COHFACE dataset and the private dataset, and compared against 2 classical learning methods and 3 recent deep learning methods. The results are presented in the Table 2. It is observed that the our method demonstrates a 33% increase in MAE accuracy on the COHFACE dataset compared to the Line method [33], while on the private dataset, the MAE is increased by 62% compared to the Line method. Utilizing the fourier transform of the respiratory signals to directly derive the ground truth RR may be constrained by the limited resolution of the spectrum, whereas the estimated RR is a continuous value. Moreover, due to its more restricted dynamic range compared to HR, RR exhibits less variability in the data, resulting in a lower Pearson correlation coefficient between the estimated RR and the ground truth for all methods. The majority of existing methods pestimate the RR based on their private dataset with a duration of 30 seconds or 1 minute. Our method demonstrates, for the first time, the feasibility of estimating the RR from the publicly available dataset using 10 second windows.

Table 2. Intra-dataset testing of RR estimation results by our method and several state-of-the-art methods on two datasets ^a .

	COHFACE			Private
	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑

EfficientPhys	9.87	7.33	0.01	9.55	7.33	0.05
TSCAN	11.18	9.06	0.01	14.23	11.91	0.02
PhysNet	8.30	6.21	0.07	9.81	8.45	−0.04
R-Channel [34]	6.52	3.72	0.33	6.62	4.33	0.40
Line [33]	5.11	2.96	0.09	8.04	6.27	0.21
FBST(ours)	3.62	2.10	0.36	5.27	3.86	0.32

Open in a new tab

^{^a}

Bold is the best result

4.3. Ablation study

Analysis of ST layer existence:

We conducted a comparative analysis to investigate the impact of ST layers on the estimation of HR and RR across various datasets, and the results are presented in Table 3. The results reveal that the ST layer improves HR and RR estimation by automatically deriving a disturbance-free pulse feature map from the foreground and background spatiotemporal maps, effectively removing the illumination variations. Additionally, the effectiveness of the ST layer in addressing illumination imbalance is evident based on the HR estimation results on COHFACE dataset, with a 3.7 reduction in the estimated MAE and an increase in the Pearson correlation coefficient of 0.23.

Table 3. Results of different relationships of foreground and background spatiotemporal maps ^a .

	ST layer	PCA	UBFC-rPPG			PURE			COHFACE			Private
			RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑	RMSE↓	MAE↓	R↑

HR	✗	✗	3.34	2.56	0.95	2.72	2.28	0.99	10.26	8.31	0.50	3.02	2.30	0.95
	1FC	✓	2.79	2.17	0.97	2.56	1.83	0.99	7.55	5.49	0.58	2.88	2.06	0.96
	2FC	✓	3.00	2.27	0.96	2.45	1.91	0.99	6.41	4.61	0.73	2.42	1.84	0.97
	CONV	✓	3.20	2.39	0.96	2.41	1.81	0.99	6.96	4.99	0.69	2.78	2.29	0.97
	best	✗	3.48	2.76	0.97	2.93	2.48	0.99	8.08	5.54	0.68	2.96	2.25	0.94

RR	✗	✗	-			-			3.87	2.29	0.29	6.77	5.22	−0.19
	1FC	✓	-			-			3.77	2.10	0.33	5.94	4.22	0.10
	2FC	✓	-			-			3.62	2.10	0.36	5.27	3.86	0.32
	CONV	✓	-			-			3.65	2.16	0.35	6.01	4.37	0.19
	best	✗	-			-			3.84	2.39	0.19	6.88	5.44	0.27

Open in a new tab

^{^a}

* 1FC means 1fully connected layer, 2FC means 2fully connected layers, CONV means convolution layer with ReLU function; bold is the best result; best refers to the configuration of the ST layer that achieves the lowest RMSE, the lowest MAE, and the highest R value (UBFC: 1FC, PURE: CONV, COHFACE: 2FC, Private: 2FC).

Analysis of ST layer type:

we assessed the impact of linear and nonlinear ST layers using different network configurations: a fully connected layer (1FC), two fully connected layers (2FC), and a nonlinear convolutional layer (CONV) with ReLU function. Results are shown in the Table 3. Various datasets exhibit diverse capabilities in mitigating illumination effects across distinct network layers. UBFC-rPPG dataset demonstrates superior performance with the 1FC layer, PURE dataset excels with the convolutional layer, COHFACE dataset achieves the highest performance in HR and RR estimation with the 2FC layer, and private dataset also yields the best results with the 2FC layer. A preliminary conclusion can be drawn that a linear relationship is more effective in error mitigation when the background is flat and devoid of other interfering factors. Conversely, a nonlinear relationship proves more adept at error elimination when the background is spatial or when other interfering factors are present. Because in a simple flat scene, the changes in background signal primarily result from ambient light, which can be considered constant. However, in complex scenes involving spatial elements or personnel interference, the background signal is influenced not only by ambient light but also by the reflection of object light, which exhibits temporal variation.

Analysis of PCA:

As demonstrated in Table 3,our method demonstrated better performance on all databases when applying dimensionality reduction to the background signal through PCA, as opposed to utilizing a single background signal directly. This is due to the fact that directly selecting a single background signal can be influenced by multiple non-illumination effects, such as shadowing caused by unrelated objects or individuals passing through the area of interest. Through PCA, the principal component signals from multiple background signals can be analyzed to better identify the illumination variation signal, effectively reducing the influence of other interference noises.

Analysis of data resampling:

To evaluate the effectiveness of the data resampling techniques applied to the UBFC, PURE, and private datasets in mitigating dataset bias for HR estimation, we input the original data into the network and computed the HR. The ST layer type selected for each dataset was the best performing option presented in Table 3, namely, UBFC used 1FC, PURE utilized CONV, and the private dataset adopted 2FC. The results are presented in the Table 4. Comparing the results in Tables 3 and 4, it is clear that the data resampling strategies significantly improved the accuracy of HR estimation. The MAE and RMSE for all three datasets decreased after data resampling, with PURE showing the largest decrease. Specifically, PURE’s MAE decreased by 0.6 bpm, and its RMSE decreased by 0.72 bpm.

Table 4. Intra-dateset testing on HR estimation without data resampling ^a .

Dataset	ST layer	RMSE↓	MAE↓	R↑

UBFC-rPPG	1FC	3.39	2.46	0.87
PURE	CONV	3.13	2.41	0.98
Private	2FC	3.28	2.18	0.87

Open in a new tab

^{^a}

* 1FC means 1fully connected layer, 2FC means 2fully connected layers, CONV means convolution layer with ReLU function

Analysis of cross-dataset testing:

In addition to conducting within-dataset testing on four datasets, we also performed cross-dataset evaluations for pulse signal estimation on the UBFC dataset and respiratory rate (RR) estimation on the COHFACE dataset to demonstrate the generalizability of our models. Specifically, we trained a pulse signal estimation model using the PURE dataset and evaluated its performance on the UBFC dataset. Similarly, we trained an RR estimation model on the private dataset and assessed its performance on the COHFACE dataset. The results for our proposed approach and the state-of-the-art methods are presented in Table 5. The results indicate that our proposed method exhibited robust performance on unseen domains, achieving the lowest RMSE and MAE in both HR and RR estimation. Compared to the second-best method, EfficientPhys, our approach reduced the RMSE of HR by 2.5 bpm and the MAE of RR by 3.93 bpm, demonstrating the superior accuracy of our model.

Table 5. Cross-dateset testing on HR and RR estimation ^a .

	HR		RR
	RMSE↓	MAE↓	RMSE↓	MAE↓

EfficientPhys	6.40	1.28	9.65	7.11
TSCAN	9.07	2.30	11.92	9.88
PhysNet	-	1.63	12.36	10.78
FBST(ours)	3.90	1.03	5.72	4.11

Open in a new tab

^{^a}

* HR results trained on the PURE dataset and tested on the UBFC-rPPG dataset

* RR results trained on the Private dataset and tested on the COHFACE dataset

* Bold is the best result

Analysis of shorter videos performance:

To assess the effectiveness of our method in real-time applications, we input spatiotemporal map data from videos of varying lengths into the ST layer, which consists of two fully connected layers. Specifically, we examine the impact of input lengths ranging from 4 to 10 seconds on HR and RR estimation using a private dataset. The results, presented in Fig. 8, show that as video duration decreases, the RMSE for both HR and RR increases due to shorter segments containing less physiological information, reducing estimation accuracy. Both HR and RR estimation exhibit progressive variations as duration decreases, with RR showing more gradual changes, consistent with its typically lower frequency compared to HR. Despite these challenges, our model effectively predicts outcomes from shorter videos, demonstrating its potential for real-time monitoring of physiological and psychological health.

Fig. 8. — Different spatiotemporalmaps lengths for HR and RR estimation.

5. Discussion

With the increasing prevalence of deep learning, a growing number of models are enhancing their performance through the incorporation of additional network layers or modules. Nevertheless, these approaches necessitate substantial computational power and advanced hardware, leading to subpar real-time performance. To address this issue, we integrate a spatiotemporal layer into the ResNet-18 backbone network and employ a migration training strategy to alleviate computational demands. We conducted a comparison of the time and computing resources utilized by our FBST method with three deep learning techniques (PHYSNET, TSCAN, and EfficientPhys) on a laptop equipped with an NVIDIA GeForce 1650 GPU. The results are depicted in Fig. 9. Our FBST method achieves the smallest number of floating-point operations (FLOPs) and exhibits a remarkably short running time due to its ability to input a single spatiotemporal map encompassing both temporal and spatial information, without the need for inputting image sequences as required by the other three methods. This reduction in parameter calculations at the input stage contributes to the efficiency of our approach.

Fig. 9. — Computational costs of compared methods on a laptop equipped with an NVIDIA GeForce 1650 GPU

Despite the proposed method’s advantages, there are several limitations to consider. We identify facial features and body poses to locate key points for the foreground and background. However, significant rotational movements of the subject may compromise the accuracy of facial or body pose detection, making it difficult to extract these signals automatically. Additionally, manual background extraction can introduce variability and potentially miss changes in ambient lighting. Therefore, future work should focus on developing more adaptive methods for the automatic localization of foreground and background.

We assume that the specular reflection of the face remains constant under typical conditions. However, during prolonged and significant head movements, this reflection should be considered time varying. We adopted PCA to identify the predominant ambient illumination variations in the background. As illustrated in Fig. 10, although PCA amplifies the fluctuations of illumination variations, challenges still arise under complex illumination conditions. For example, if both the foreground and background experience extreme and uneven illumination changes, or if there is substantial unrelated ambient illumination present in the background, the PCA based selection of the main illumination variation that corresponds to the foreground may be limited. In future studies, we aim to develop a more general model that can mitigate the effects of in ambient illumination changes by considering a wider range of real world scenarios, including dynamic environments and those with multiple light sources.

Fig. 10. — Example of green channel background signals before and after PCA

To validate the frequency ranges used in this paper, we confirmed that all subjects in the dataset maintained HR and RR within these ranges. This filtering operation effectively suppresses non-physiological disturbances in the signals, enhancing the robustness of our method. However, in larger datasets, it is important to select appropriate filtering limits based on actual conditions to further improve the robustness of HR and RR estimation.

In this study, we utilized a neural network module, the ST layer, to adaptively identify the optimal functional relationship between foreground and background signals. This approach involved iteratively updating parameters based on the loss function to achieve the best results. Different types of ST layers were chosen for various scenarios. In some cases, nonlinear models performed worse than their linear counterparts. This is due to backgrounds containing only variations in ambient illumination from fixed light sources, where the relationship between illumination in the background and the physiological signals in the foreground is linear. As a result, the use of nonlinear convolutional layers for parameter estimation may lead to greater deviations compared to linear models. While our method qualitatively analyzes the appropriate ST layer types for different scenarios, future research could focus on quantitatively adapting the selection of these layers.

6. Conclusions

In this study, we propose a novel FBST method for estimating HR and RR in parallel from videos. This method focuses on modeling the foreground and background spatiotemporal maps. We designed spatiotemporal layers to automatically optimize the model parameters, which demonstrate effectiveness across diverse illumination scenarios. Experimental results demonstrate that our method achieves impressive performance on the public UBFC-rPPG, PURE and COHFACE datasets, as well as our private dataset, when compared to classical and deep learning approaches. The proposed FBST method achieves lower computational cost when compared to other deep learning methods by utilizing a lightweight network, indicating its superior adaptability to real-world applications.

Acknowledgment

The authors would like to thank the volunteers who participated in the private dataset experiments.

Funding

National Natural Science Foundation of China10.13039/501100001809 (62271333); Sichuan Provincial Science and Technology Support Program10.13039/100012542 (2022YFS0032).

Disclosures

The authors declare no conflicts of interest.

Data availability

The public datasets used in this paper are available from Ref. [25–27]. The self-collected dataset cannot be shared at this time due to privacy reasons, but can be made available on reasonable request.

References

1.Evans D., Hodgkinson B., Berry J., “Vital signs in hospital patients: a systematic review,” Int. J. Nurs. Stud. 38(6), 643–650 (2001). 10.1016/S0020-7489(00)00119-X [DOI] [PubMed] [Google Scholar]
2.Brüser C., Antink C. H., Wartzek T., et al. , “Ambient and unobtrusive cardiorespiratory monitoring techniques,” IEEE Rev. Biomed. Eng. 8, 30–43 (2015). 10.1109/RBME.2015.2414661 [DOI] [PubMed] [Google Scholar]
3.McDuff D., “Camera measurement of physiological vital signs,” ACM Comput. Surv. 55(9), 1–40 (2023). 10.1145/3558518 [DOI] [Google Scholar]
4.Verkruysse W., Svaasand L. O., Nelson J. S., “Remote plethysmographic imaging using ambient light,” Opt. Express 16(26), 21434–21445 (2008). 10.1364/OE.16.021434 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Poh M. Z., McDuff D. J., Picard R. W., “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation,” Opt. Express 18(10), 10762–10774 (2010). 10.1364/OE.18.010762 [DOI] [PubMed] [Google Scholar]
6.Wiede C., Richter J., Manuel M., et al. , “Remote respiration rate determination in video data - vital parameter extraction based on optical flow and principal component analysis,” in Visigrapp, (2017), pp. 326–333.
7.Liu B., Zheng X., Wu Y. I., “Remote heart rate estimation in intense interference scenarios: A white-box framework,” IEEE Trans. Instrum. Meas. 73, 1–14 (2024). 10.1109/TIM.2024.3419088 [DOI] [Google Scholar]
8.Ghodratigohar M., Ghanadian H., Al Osman H., “A remote respiration rate measurement method for non-stationary subjects using ceemdan and machine learning,” IEEE Sens. J. 20(3), 1400–1410 (2020). 10.1109/JSEN.2019.2946132 [DOI] [Google Scholar]
9.Chen W., McDuff D., “Deepphys: Video-based physiological measurement using convolutional attention networks,” in Computer Vision , (Springer International Publishing, 2018), pp. 356–373. [Google Scholar]
10.Liu X., Fromm J., Patel S., et al. , “Multi-task temporal shift attention networks for on-device contactless vitals measurement,” in 34th Conference on Neural Information Processing Systems, (2020). [Google Scholar]
11.Liu X., Hill B., Jiang Z., et al. , “Efficientphys: Enabling simple, fast and accurate camera-based cardiac measurement,” in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, (2023), pp. 4997–5006. [Google Scholar]
12.Yu Z., Li X., Zhao G., “Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks,” in Computer Vision and Pattern Recognition, (2019), pp. 4264–4271.
13.Yan W., Zhuang J., Chen Y., “Mff-net: A lightweight multi-frequency network for measuring heart rhythm from facial videos,” Sensors 24(24), 7937 (2024). 10.3390/s24247937 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wang W., den Brinker A. C., Stuijk S., et al. , “Algorithmic principles of remote PPG,” IEEE Trans. Biomed. Eng. 64(7), 1479–1491 (2017). 10.1109/TBME.2016.2609282 [DOI] [PubMed] [Google Scholar]
15.Zheng Y., Lin Z., Ding W., “Heart rate estimation from color video sequences with high snr,” Opt. Lett. 48(2), 379–382 (2023). 10.1364/OL.476117 [DOI] [PubMed] [Google Scholar]
16.Li X., Chen J., Zhao G., et al. , “Remote heart rate measurement from face videos under realistic situations,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, (2014), pp. 4264–4271. [Google Scholar]
17.Tominaga S., Dichromatic Reflection Model (Springer US, 2014), pp. 191–193. [Google Scholar]
18.Kwon S., Kim J., Lee D., et al. , “Roi analysis for remote photoplethysmography on facial video,” in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), (2015), pp. 4938–4941. [DOI] [PubMed] [Google Scholar]
19.Gao H., Zhang C., Pei S., et al. , “Region of interest analysis using delaunay triangulation for facial video-based heart rate estimation,” IEEE Trans. Instrum. Meas. 73, 1–12 (2024). 10.1109/TIM.2024.3363786 [DOI] [Google Scholar]
20.Brieva J., Moya-Albor E., Rivas-Scott O., et al. , “Non-contact breathing rate monitoring system based on a Hermite video magnification technique,” in 14th International Symposium on Medical Information Processing and Analysis , vol. 10975 (SPIE, 2018), p. 1097504. [Google Scholar]
21.Wang W., den Brinker A. C., “Algorithmic insights of camera-based respiratory motion extraction,” Physiol. Meas. 43(7), 075004 (2022). 10.1088/1361-6579/ac5b49 [DOI] [PubMed] [Google Scholar]
22.Wu S., Kan M., He Z., “Funnel-structured cascade for multi-view face detection with alignment awareness,” Neurocomputing 221, 138–145 (2017). 10.1016/j.neucom.2016.09.072 [DOI] [Google Scholar]
23.Borges P., Cardoso G., “Mapeamento de intensidade da pulsacao sanguinea por video,” Rev. Bras. Fis. Med. 10(2), 34 (2016). 10.29384/rbfm.2016.v10.n2.p34-38 [DOI] [Google Scholar]
24.Deng J., Dong W., Socher R., et al. , “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, (2009), pp. 248–255. [Google Scholar]
25.Bobbia S., Macwan R., Benezeth Y., “Unsupervised skin tissue segmentation for remote photoplethysmography,” Pattern Recognit. Lett. 124, 82–90 (2019). 10.1016/j.patrec.2017.10.017 [DOI] [Google Scholar]
26.Stricker R., Müller S., Gross H.-M., “Non-contact video-based pulse rate measurement on a mobile service robot,” The 23rd IEEE International Symposium on Robot and Human Interactive Communication, , pp. 1056–1062 (2014). 10.1109/ROMAN.2014.6926392 [DOI] [Google Scholar]
27.Heusch G., Anjos A., Marcel S., “A reproducible study on remote heart rate measurement,” ArXiv (2017). 10.48550/arXiv.1709.00962 [DOI]
28.Liu X., Narayanswamy G., Paruchuri A., et al. , “rPPG-Toolbox: Deep remote PPG toolbox,” Github, 2023, https://github.com/ubicomplab/rPPG-Toolbox.
29.de Haan G., Jeanne V., “Robust pulse rate from chrominance-based rppg,” IEEE Trans. Biomed. Eng. 60(10), 2878–2886 (2013). 10.1109/TBME.2013.2266196 [DOI] [PubMed] [Google Scholar]
30.de Haan G., van Leest A., “Improved motion robustness of remote-PPG by using the blood volume pulse signature,” Physiol. Meas. 35(9), 1913–1926 (2014). 10.1088/0967-3334/35/9/1913 [DOI] [PubMed] [Google Scholar]
31.Pilz C. S., Zaunseder S., Krajewski J., et al. , “Local group invariance for heart rate estimation from face videos in the wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2018), pp. 1335–13358. [Google Scholar]
32.Lokendra B., Puneet G., “AND-rPPG: a novel denoising-rPPG network for improving remote heart rate estimation,” Comput. Biol. Med. 141, 105146 (2022). 10.1016/j.compbiomed.2021.105146 [DOI] [PubMed] [Google Scholar]
33.Alnaggar M., Siam A. I., Handosa M., “Video-based real-time monitoring for heart rate and respiration rate,” Expert Syst. with Appl. 225, 120135 (2023). 10.1016/j.eswa.2023.120135 [DOI] [Google Scholar]
34.Schrumpf F., Mönch C., Bausch G., et al. , “Exploiting weak head movements for camera-based respiration detection,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, (2019), pp. 6059–6062. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[r1] 1.Evans D., Hodgkinson B., Berry J., “Vital signs in hospital patients: a systematic review,” Int. J. Nurs. Stud. 38(6), 643–650 (2001). 10.1016/S0020-7489(00)00119-X [DOI] [PubMed] [Google Scholar]

[r2] 2.Brüser C., Antink C. H., Wartzek T., et al. , “Ambient and unobtrusive cardiorespiratory monitoring techniques,” IEEE Rev. Biomed. Eng. 8, 30–43 (2015). 10.1109/RBME.2015.2414661 [DOI] [PubMed] [Google Scholar]

[r3] 3.McDuff D., “Camera measurement of physiological vital signs,” ACM Comput. Surv. 55(9), 1–40 (2023). 10.1145/3558518 [DOI] [Google Scholar]

[r4] 4.Verkruysse W., Svaasand L. O., Nelson J. S., “Remote plethysmographic imaging using ambient light,” Opt. Express 16(26), 21434–21445 (2008). 10.1364/OE.16.021434 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Poh M. Z., McDuff D. J., Picard R. W., “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation,” Opt. Express 18(10), 10762–10774 (2010). 10.1364/OE.18.010762 [DOI] [PubMed] [Google Scholar]

[r6] 6.Wiede C., Richter J., Manuel M., et al. , “Remote respiration rate determination in video data - vital parameter extraction based on optical flow and principal component analysis,” in Visigrapp, (2017), pp. 326–333.

[r7] 7.Liu B., Zheng X., Wu Y. I., “Remote heart rate estimation in intense interference scenarios: A white-box framework,” IEEE Trans. Instrum. Meas. 73, 1–14 (2024). 10.1109/TIM.2024.3419088 [DOI] [Google Scholar]

[r8] 8.Ghodratigohar M., Ghanadian H., Al Osman H., “A remote respiration rate measurement method for non-stationary subjects using ceemdan and machine learning,” IEEE Sens. J. 20(3), 1400–1410 (2020). 10.1109/JSEN.2019.2946132 [DOI] [Google Scholar]

[r9] 9.Chen W., McDuff D., “Deepphys: Video-based physiological measurement using convolutional attention networks,” in Computer Vision , (Springer International Publishing, 2018), pp. 356–373. [Google Scholar]

[r10] 10.Liu X., Fromm J., Patel S., et al. , “Multi-task temporal shift attention networks for on-device contactless vitals measurement,” in 34th Conference on Neural Information Processing Systems, (2020). [Google Scholar]

[r11] 11.Liu X., Hill B., Jiang Z., et al. , “Efficientphys: Enabling simple, fast and accurate camera-based cardiac measurement,” in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, (2023), pp. 4997–5006. [Google Scholar]

[r12] 12.Yu Z., Li X., Zhao G., “Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks,” in Computer Vision and Pattern Recognition, (2019), pp. 4264–4271.

[r13] 13.Yan W., Zhuang J., Chen Y., “Mff-net: A lightweight multi-frequency network for measuring heart rhythm from facial videos,” Sensors 24(24), 7937 (2024). 10.3390/s24247937 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Wang W., den Brinker A. C., Stuijk S., et al. , “Algorithmic principles of remote PPG,” IEEE Trans. Biomed. Eng. 64(7), 1479–1491 (2017). 10.1109/TBME.2016.2609282 [DOI] [PubMed] [Google Scholar]

[r15] 15.Zheng Y., Lin Z., Ding W., “Heart rate estimation from color video sequences with high snr,” Opt. Lett. 48(2), 379–382 (2023). 10.1364/OL.476117 [DOI] [PubMed] [Google Scholar]

[r16] 16.Li X., Chen J., Zhao G., et al. , “Remote heart rate measurement from face videos under realistic situations,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, (2014), pp. 4264–4271. [Google Scholar]

[r17] 17.Tominaga S., Dichromatic Reflection Model (Springer US, 2014), pp. 191–193. [Google Scholar]

[r18] 18.Kwon S., Kim J., Lee D., et al. , “Roi analysis for remote photoplethysmography on facial video,” in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), (2015), pp. 4938–4941. [DOI] [PubMed] [Google Scholar]

[r19] 19.Gao H., Zhang C., Pei S., et al. , “Region of interest analysis using delaunay triangulation for facial video-based heart rate estimation,” IEEE Trans. Instrum. Meas. 73, 1–12 (2024). 10.1109/TIM.2024.3363786 [DOI] [Google Scholar]

[r20] 20.Brieva J., Moya-Albor E., Rivas-Scott O., et al. , “Non-contact breathing rate monitoring system based on a Hermite video magnification technique,” in 14th International Symposium on Medical Information Processing and Analysis , vol. 10975 (SPIE, 2018), p. 1097504. [Google Scholar]

[r21] 21.Wang W., den Brinker A. C., “Algorithmic insights of camera-based respiratory motion extraction,” Physiol. Meas. 43(7), 075004 (2022). 10.1088/1361-6579/ac5b49 [DOI] [PubMed] [Google Scholar]

[r22] 22.Wu S., Kan M., He Z., “Funnel-structured cascade for multi-view face detection with alignment awareness,” Neurocomputing 221, 138–145 (2017). 10.1016/j.neucom.2016.09.072 [DOI] [Google Scholar]

[r23] 23.Borges P., Cardoso G., “Mapeamento de intensidade da pulsacao sanguinea por video,” Rev. Bras. Fis. Med. 10(2), 34 (2016). 10.29384/rbfm.2016.v10.n2.p34-38 [DOI] [Google Scholar]

[r24] 24.Deng J., Dong W., Socher R., et al. , “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, (2009), pp. 248–255. [Google Scholar]

[r25] 25.Bobbia S., Macwan R., Benezeth Y., “Unsupervised skin tissue segmentation for remote photoplethysmography,” Pattern Recognit. Lett. 124, 82–90 (2019). 10.1016/j.patrec.2017.10.017 [DOI] [Google Scholar]

[r26] 26.Stricker R., Müller S., Gross H.-M., “Non-contact video-based pulse rate measurement on a mobile service robot,” The 23rd IEEE International Symposium on Robot and Human Interactive Communication, , pp. 1056–1062 (2014). 10.1109/ROMAN.2014.6926392 [DOI] [Google Scholar]

[r27] 27.Heusch G., Anjos A., Marcel S., “A reproducible study on remote heart rate measurement,” ArXiv (2017). 10.48550/arXiv.1709.00962 [DOI]

[r28] 28.Liu X., Narayanswamy G., Paruchuri A., et al. , “rPPG-Toolbox: Deep remote PPG toolbox,” Github, 2023, https://github.com/ubicomplab/rPPG-Toolbox.

[r29] 29.de Haan G., Jeanne V., “Robust pulse rate from chrominance-based rppg,” IEEE Trans. Biomed. Eng. 60(10), 2878–2886 (2013). 10.1109/TBME.2013.2266196 [DOI] [PubMed] [Google Scholar]

[r30] 30.de Haan G., van Leest A., “Improved motion robustness of remote-PPG by using the blood volume pulse signature,” Physiol. Meas. 35(9), 1913–1926 (2014). 10.1088/0967-3334/35/9/1913 [DOI] [PubMed] [Google Scholar]

[r31] 31.Pilz C. S., Zaunseder S., Krajewski J., et al. , “Local group invariance for heart rate estimation from face videos in the wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2018), pp. 1335–13358. [Google Scholar]

[r32] 32.Lokendra B., Puneet G., “AND-rPPG: a novel denoising-rPPG network for improving remote heart rate estimation,” Comput. Biol. Med. 141, 105146 (2022). 10.1016/j.compbiomed.2021.105146 [DOI] [PubMed] [Google Scholar]

[r33] 33.Alnaggar M., Siam A. I., Handosa M., “Video-based real-time monitoring for heart rate and respiration rate,” Expert Syst. with Appl. 225, 120135 (2023). 10.1016/j.eswa.2023.120135 [DOI] [Google Scholar]

[r34] 34.Schrumpf F., Mönch C., Bausch G., et al. , “Exploiting weak head movements for camera-based respiration detection,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, (2019), pp. 6059–6062. [DOI] [PubMed] [Google Scholar]

PERMALINK

Estimation of heart rate and respiratory rate by fore-background spatiotemporal modeling of videos

Xiujuan Zheng

Wenqin Yan

Boxiang Liu

Yue Ivan Wu

Haiyan Tu

Abstract

1. Introduction

2. Method

Fig. 1.

2.1. Dichromatic reflection model

Fig. 2.

2.2. Fore-background model

2.3. Spatiotemporal maps

Spatiotemporal map of foreground:

Spatiotemporal map of background:

2.4. Spatiotemporal layers

Fig. 3.

Linear ST Layer:

Nonlinear ST Layer:

2.5. Estimation network

3. Experiments

3.1. Public datasets

UBFC-rPPG:

PURE:

COHFACE:

3.2. Private dataset

Fig. 4.

3.3. Experimental setup

Settings:

Fig. 5.

Fig. 6.

Metrics:

4. Results

4.1. Heart rate

Heart rate:

Table 1. Intra-dataset testing of HR estimation results by our method and several state-of-the-art methods on four datasets a .

Remote pulse signal:

Fig. 7.

4.2. Respiratory rate

Table 2. Intra-dataset testing of RR estimation results by our method and several state-of-the-art methods on two datasets a .

4.3. Ablation study

Analysis of ST layer existence:

Table 3. Results of different relationships of foreground and background spatiotemporal maps a .

Analysis of ST layer type:

Analysis of PCA:

Analysis of data resampling:

Table 4. Intra-dateset testing on HR estimation without data resampling a .

Analysis of cross-dataset testing:

Table 5. Cross-dateset testing on HR and RR estimation a .

Analysis of shorter videos performance:

Fig. 8.

5. Discussion

Fig. 9.

Fig. 10.

6. Conclusions

Acknowledgment

Funding

Disclosures

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Intra-dataset testing of HR estimation results by our method and several state-of-the-art methods on four datasets ^a .

Table 2. Intra-dataset testing of RR estimation results by our method and several state-of-the-art methods on two datasets ^a .

Table 3. Results of different relationships of foreground and background spatiotemporal maps ^a .

Table 4. Intra-dateset testing on HR estimation without data resampling ^a .

Table 5. Cross-dateset testing on HR and RR estimation ^a .