Skip to main content
Health Care Science logoLink to Health Care Science
. 2026 Jan 15;5(1):74–84. doi: 10.1002/hcs2.70047

A Deep Neural Network Based on Two‐Stage Training for Estimating Heart Rate Variability From Camera Videos

Lan Lan 1, Jin Yin 2,3, Haohan Zhang 4, Hua Jiang 5, Rui Qin 6, Xia Zhao 7, Yu Zhang 8, Yilong Wang 9,, Jiajun Qiu 2,3,
PMCID: PMC12946708  PMID: 41767171

ABSTRACT

Background

Studies have shown that heart rate variability (HRV) is a predictor of the prognosis of cardiovascular diseases. Contact heartbeat monitoring equipment is widely used, especially in hospitals, and benefits from the rapidity and accuracy of the detection of physiological health indicators. However, long‐term contact with equipment has many adverse effects. The purpose of this study was to improve the accuracy of HRV detection via noncontact equipment, thus enabling HRV to be assessed in various scenarios.

Methods

A novel deep learning approach was proposed for measuring heartbeats through camera videos. First, we performed facial segmentation and divided the face into 16 grid cells with different light balance scores. After the trend is filtered by the Hamming window, a transformer‐based neural network is used to further filter the signal. Finally, heart rate (HR) and HRV are estimated.

Results

We used 1 million synthetic data points for pretraining and a public dataset in combination with a dataset that we constructed for task training. The final results were obtained on a test dataset that we constructed. The accuracy for HR with a low light balance score (0.867–0.983) was greater than that with a high score (0.667–0.750). Our method had higher accuracy in estimating HR than traditional filtering methods (0.167–0.417) and state‐of‐the‐art neural network filtering methods (0.783–0.917) did. The root mean square error of the HRV from the time domain was the lowest, and the correlation index score was the highest for the HRV from the frequency domain estimated by our method compared with those estimated by two neural networks.

Conclusions

Light balance, large sample training, and two‐stage training can improve the accuracy of HRV estimation.

Keywords: camera, cardiovascular disease, deep learning, heart rate variability, pretraining


Heart Rate Variability Estimation from Camera Videos

graphic file with name HCS2-5-74-g004.jpg


Abbreviations

ECG

electrocardiogram

FFT

fast Fourier transform

HF

high frequency

HR

heart rate

HRV

heart rate variability

LF

low frequency

MSE

mean square error

RGB

red, green, and blue

RMSE

root mean square error

RMSSD

root mean square of successive differences

RNN

recurrent neural network

SD1

standard deviation perpendicular to the line of identity

SD2

standard deviation along the line of identity

SDNN

standard deviation of the interval

SDSD

standard deviation of the differences between successive intervals

SNR

signal‐to‐noise ratio

VIPL‐HR

vision and intelligence processing lab‐heart rate

1. Introduction

Heart rate variability (HRV) refers to the changes in the heartbeat cycle. It is generally obtained by analyzing the R‐wave interval in an electrocardiogram (ECG). HRV reflects the degree of sinus arrhythmia in the heart and the balance of interactions between neurohumoral factors and the sinoatrial node. Previous studies have shown that HRV is a predictor of the prognosis of cardiovascular diseases, e.g., sudden cardiac death, coronary heart disease, hypertension and chronic heart failure, as well as chronic obstructive pulmonary disease, diabetes and other diseases [1], and can reflect various information, such as sleep and mental stress status [2, 3, 4]. Contact heartbeat monitoring equipment, such as ECG and fingertip pulse oximeters, is widely used, especially in hospitals, which benefits from the rapidity and accuracy of the detection of physiological health indicators. However, long‐term contact with equipment has many adverse effects, e.g., the risk of infection and skin allergies. Additionally, ECG devices have many limitations, such as signal processing and accuracy issues [5].

Capillary blood flow is directly related to heartbeats, and hemoglobin in capillaries absorbs light in a certain frequency band [6]. With a photosensitive sensor, the characteristics of blood flow can be determined by signal extraction. Therefore, a camera can be used as a measurement carrier of a noncontact device with information regarding the human face as the measurement object. The heartbeat signal may be extracted by recording color images of the human face. However, compared with contact methods, noncontact methods are more susceptible to external environmental interference in the collection of physiological information, which can introduce more noise [7] and result in the loss of more important information, so their accuracy is much lower. Therefore, improving accuracy is a long‐term research direction and an important demand.

Related research (Supporting Information S1: Appendix S1) has focused on data enhancement and signal filtering [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. The quality of signal filtering directly affects the accuracy of the final result. There are essentially three research directions for filtering methods: traditional analytical filtering algorithms such as fast Fourier transform (FFT), wavelet decomposition, and empirical mode decomposition; neural network filtering; and mixtures of two methods for filtering. Neural network filtering is the most important research direction, as it is equivalent to the analytical filtering method, which has more advantages in terms of design flexibility and accuracy improvement. The latest research has focused on three key areas. First, in the field of end‐to‐end deep learning, research, such as that represented by Uni‐rPPGNet [18], has adopted the lightweight visual transformer architecture to predict HRV directly. This approach avoids the need for manual feature engineering but still faces limitations such as a high requirement for labeled data and insufficient adaptability to dynamic environments. Second, there have been breakthroughs in multimodal fusion techniques. For example, CardiacMamba [19] innovatively integrates radio frequency and red, green, and blue (RGB) video signals using a time difference mamba module, which significantly mitigates skin color bias and motion interference issues. However, this increases the complexity of the required system hardware. In addition, PulseGAN [20] enhances signal quality by using a generative adversarial network to synthesize high‐fidelity pulse waveforms, which reduces the AVNN error by 20.85%, whereas RADIANT [21] effectively suppresses noise interference in fatigue detection using a transformer‐based signal embedding strategy. However, models developed on the basis of limited training data [22] have a significant lack of generalizability when addressing diverse skin tones and motion states.

To improve the accuracy of HR and HRV estimation, this study summarizes and optimizes previous work to design a standard testing process for HR and HRV from camera videos. Our contributions include the following: (1) In terms of signal acquisition, we propose the use of the blue spectrum to evaluate the balance score of light irradiation on the face. The evaluation of this equilibrium has a significant effect on the stability of the experimental results. (2) Compared with traditional analytical methods, neural network filtering requires considerable training, and the quality and quantity of the data used in the training process are crucial and directly affect the accuracy and generalization ability. We designed a method for infinitely generating high‐quality pretraining data by using the superposition of signals, thereby fundamentally solving this problem. (3) The training process is innovative and adopts a two‐stage training method. Especially in the first stage, comparing the frequency information of the same signal with the time domain information improves the ability of the filtering network (transformer model [23]) to understand the essence of the signal, thereby improving the filtering accuracy.

2. Methods

2.1. Overall Framework

Our algorithm achieves HR and HRV estimation through three major steps: signal acquisition, signal filtering, and index calculation. The acquisition step uses a deep neural network to detect the face and locate the key areas of the face to extract facial blood spectrum information while balancing light uniformity. Next, the green channel is extracted by separating the RGB channels, a simple convolution filter (Hamming window) is used to remove the trend, and then a deep neural network is used for stronger filtering. Compared with the filters used in other similar studies [24, 25], deep neural network filtering can better preserve the original waveform [26]. From the original waveform, the HR and HRV can be calculated (Figure 1).

Figure 1.

Figure 1

Overall framework.

2.2. Signal Acquisition

Human body data were collected at 30 frames per second through camera videos. Facial skin regions were detected through the BiseNetV2 network [27]. We followed the default parameters of the BiseNetV2 network to perform facial segmentation on the original videos. Then, we extracted the skin area on the basis of the maximum and minimum horizontal abscissa and maximum and minimum vertical ordinates and used perspective transformation to transform each image into a 512 × 512 image, which was divided into 16 regions with a 4 × 4 grid. We separated the RGB channels of the image and preserved the green and blue channels. Because hemoglobin easily absorbs light at green frequencies but not at blue frequencies, we used green channel data to calculate the HR and blue channel data to determine whether the light evenly illuminated the skin area. The pixels were averaged in the skin area of the green channel in each grid. The shape of the sequence signal was 16 × 20 × 30 because collection was performed at a rate of 30 frames per second for 20 s for 16 sequence signals. The standard deviation of the pixels was calculated in the skin area of the blue channel in each grid cell. Each grid cell formed 20 × 30 standard deviations. The average of the standard deviations for each grid cell was the light balance score for that grid cell. The lower the standard deviation, the more uniform the light was. Finally, the signals from the 8 areas with the lowest scores were selected for subsequent analyses.

2.3. Signal Filtering

2.3.1. Filtering Steps

Step 1: The 8 time series signals were preprocessed. First, a Hamming window (Equation 1) [28] was used to remove trend signals that are usually caused by unstable factors during signal collection, such as the inability of the person from which data are being collected to remain completely stationary. In addition, after removing these signals, the range of input data values was more concentrated, which was beneficial for the subsequent processing of neural network filters.

ω(n)=0.540.46cos2πnN,0nN, (1)

where ω(n) is the window function, n represents the window length, N represents the signal length, and cos() is the cosine function. We conducted experiments with n values of 7, 9, 11, and 13 and ultimately found that 11 was the most effective. Finally, the Hamming window was obtained as [0.08, 0.17, 0.40, 0.68, 0.91, 1.00, 0.91, 0.68, 0.40, 0.17, 0.08]. Then, we used the value of the Hamming window as the convolution kernel and performed convolution operations with the original signal to obtain the filtered signal (Equation 2 and Figure 2).

Znew=XsourceXconv={Ztn+1,Ztn+1,Ztn+2,,Zt}, (2)

where Z represents the signal, Xsource is the original signal, Xconv is the convolved signal, t denotes the current position and n denotes the nth position.

Figure 2.

Figure 2

Convolution filtering process of the facial spectral signal. (a) Filter window shape. (b) Original facial spectral signal. (c) Yellow represents the signal after convolution filtering. (d) Difference between the original signal and the signal after convolution filtering, including HR information.

Step 2: We sliced the signal at 30 frames per second and defined each segment as 1 token. The video with these 8 signals for 20 s had a total of 8 × 20 tokens.

Step 3: We stacked tokens with position encoding and used traditional relative position encoding calculation methods.

Step 4: The data shape was 8 × 20 × 30 (8 grid cells, 20 s of collection, and 30 frames per second), which was fed into the filtering network to obtain the filtered waveform.

Step 5: We calculated the power spectrum from the filtered waveform, which was subsequently used to calculate the HR.

Step 6: The peak value of the filtered waveform was extracted, and the R–R interval was calculated, which was used for subsequent HRV calculations.

2.3.2. Filter Network Structure

The filter network (Figure 3a) adopted an encoding and decoding structure, where the encoder encoded the input signal into a feature signal and the decoder decoded the feature signal into a filtered signal. The encoder and decoder adopted transformer structures that consisted of 8 and 4 layers of transformers, respectively. In each transformer substructure, we set four attention heads, used LayerNormal for normalization, and used Eule as the activation function (Figure 3b). We slightly modified the multilayer perceptron layer by using two fully connected layers without nonlinear operations. The parameter structures were 30 × 256 and 256 × 30. The main purpose was to amplify the parameter quantity without introducing more nonlinearity.

Figure 3.

Figure 3

Filter network structure. (a) Filter network. (b) Transformer layer. MLP, multilayer perceptron.

Time‐domain and frequency‐domain signals can be fully aligned. The alignment method uses FFT transformation to truncate half of the frequency diagram and half of the phase diagram due to symmetry and then concatenate them to ensure complete consistency with the original signal length. Therefore, the structural design of the frequency‐domain encoder is completely consistent with that of the time‐domain encoder.

2.3.3. Training and Testing of Filtering

The entire filter network training process was divided into two stages: pretraining and task training (Figure 4).

Figure 4.

Figure 4

Filter network training. (a) Pretraining. (b) Task training.

2.3.3.1. Stage 1: Pretraining
  • a. Data Generation

For the pretraining dataset, fake data generated by signal superposition were adopted. In theory, any periodic signal can be formed by the superposition of signals of different frequencies. By generating signals of different frequencies and then overlaying them, a complex signal (Equation 3) can be formed.

H=imhi, (3)

where H represents a complex signal, h represents a single‐frequency signal, m represents the number of single‐frequency signals, and i represents the serial number.

However, signals from the real world are complex. Real‐world signals have periodic and nonperiodic components, in contrast to other signals, unless they are extracted. The processing of a nonperiodic signal usually involves two methods: processing it as a segment of a signal with an infinite period and processing it as a type of noise. We treat it as noise. Therefore, the formula becomes

X=im(hi±), (4)

where X represents a complex signal, h represents a single‐frequency signal, represents noise, m represents the number of single‐frequency signals, and i represents the serial number. The design of the noise is crucial, as it directly determines whether the model has sufficient generalization ability. We used six types of noise: simple noise, abnormal noise, amplitude noise, position noise, high‐order noise, and irrational noise (Supporting Information S1: Appendix S2). These noise signals were simply superimposed without setting weights.

  • b. Loss Function

Feature similarity loss. All signals have both frequency‐domain and time‐domain representations. We encoded the same signal in both the time domain and the frequency domain, and the encoded feature vectors should have had the maximum positive correlation. We used the following loss function (Equation 5) to complete self‐monitoring training for the signal.

Lr=cossim(ϵ(f(X)),ϑ(FFT(X))), (5)

where Lr denotes the regularization loss, cossim denotes the cosine similarity, ϵ converts a time‐domain feature into a feature vector, f(X) represents the feature signal extracted by the neural network, X represents the original signal, ϑ maps the frequency domain signal onto a feature vector, and FFT(X) represents the FFT of the original signal. We used a technique in which the frequency‐domain transformation is applied only to signals between 0.8 and 2 Hz, as the HR is within that frequency range, which can force the network to focus only on the signal in that frequency band.

Restoration loss. We wanted the network to smoothly restore mixed signals from 0.8 to 2 Hz without noise. Therefore, we calculated the mean square error (MSE) (Equation 6).

Lr=MSE(δ(f(X)),S), (6)

where δ represents the decoding network and S represents the target waveform.

We selected the AdamW optimizer. We set beta to (0.95, 0.90) and the weight_decay to 0.001. The learning rate was set to 0.0001, and the batch size was set to 2000.

Stage: Task Training

We used transfer learning training as corrective training, which required the selection of a mild optimizer and a lower learning rate. Therefore, we chose the stochastic gradient descent optimizer with a learning rate of 0.01. We did not set other parameters to avoid damaging the knowledge learned by the network during the pretraining stage.

Our method was compared with two state‐of‐the‐art models (EfficientPhys [15] and MTTS‐CAN [14]).

2.4. Index Calculation

2.4.1. HR

We first autocorrelated the signal and then used the FFT. The power spectrum of the HR could be obtained from an autocorrelation of the frequency signal. The frequency that corresponded to the maximum power was the current HR (Supporting Information S1: Figure S1). Because the correct signals were more concentrated, by reducing the variance for selection, accurate signals could ultimately be obtained (Supporting Information S1: Figure S2). Each volunteer was tested three times. The time interval for each test was 2 h, and the duration of each test was 15 s. While facing the camera to record facial information, the volunteers wore a fingertip pulse oximeter with medical qualifications. After the fingertip pulse oximeter readings were relatively stable, 15 s of data and facial videos were recorded. We recorded and compared the HR displayed by the fingertip pulse oximeter three times. The first indicator judgment method indicated that the estimate was correct if the positive and negative errors were within range 1. The second error range was range 2. The third error range was range 4. Hence, the estimated HR was classified as correct or incorrect. Then, we calculated the estimated accuracy of each method.

2.4.2. HRV

With the R‐R interval of the waveform, we calculated the HRV‐related indices, including time‐domain and frequency‐domain indices. Referring to the definition [29], we calculated the root mean square of successive differences (RMSSD), standard deviation of the interval (SDNN), Poincaré plot standard deviation perpendicular to the line of identity (SD1), Poincaré plot standard deviation along the line of identity (SD2), standard deviation of the differences between successive intervals (SDSD), SD1/SD2, low frequency (LF), high frequency (HF), and LF/HF. While collecting the volunteer videos, we also collected the HRV and psychological stress index using an HRV analyzer (model: HW6C), where the HRV value was used as the ground truth. RMSSD and SDNN (two commonly used HRV indices) were used to compare the filtering effects of our filter network and two state‐of‐the‐art models. The comparison effect was measured using the root mean square error (RMSE), of which smaller values indicate better results. We also performed Pearson correlation analysis between six time‐domain indices and the psychological stress index to verify the effectiveness of the HRV estimated by our method.

Owing to the different filtering algorithms used by various HRV analyzer manufacturers, there were significant differences in the absolute values of the HRV frequency domain indices, which made the RMSE unsuitable to use for evaluation. Our evaluation method randomly arranges the data and then compares them before and after. If data are larger than the previous data, they are recorded as “1”; otherwise, they are recorded as “0”. Then, the Hamming distance (the number of characters with the same position) between the test data and the target data is calculated. The larger the Hamming distance is, the more consistent the trend changes. Multiple disordered evaluations were conducted, and finally, the mean was calculated as the trend to evaluate the index scores. The higher the score is, the more accurate the HRV estimate.

All analysis procedures were performed with PyTorch 1.10, Meta, Menlo Park, USA.

2.5. Data

In the pretraining phase, we generated one million units of synthetic data using the signal superposition technique to train the neural network filter. These data simulate the combination of various frequency signals and noise. We conducted transfer learning training on both a public dataset and our dataset. Pulse label extraction dataset is a subset of the Vision and Intelligence Processing Lab ‐ Heart Rate (VIPL‐HR) public dataset [30], which consists of 200 filtered raw videos and pairs of optical volume waveform data. Our dataset included 28 volunteers (aged 18–68; male‐to‐female ratio of 58:42), and 100 videos of their faces and synchronized photovoltaic volume data (obtained via medical‐grade fingertip pulse oximetry) were collected. Training task optimization data: Forty videos were combined with the pulse label extraction dataset for training to optimize the performance on the task and improve the accuracy. Test data: The remaining 60 videos (36 male and 24 female) were used for the final test and as the basis for evaluating the results.

Synergistic training mechanism among different datasets: training modern neural networks requires large‐scale datasets, yet the challenge of insufficient task data is often encountered in practical applications. This study draws on the phased progressive training method of large language models by adopting datasets with different characteristics in each training phase. Specifically, our training framework consists of three phases [31, 32]. In the pretraining phase, large‐scale training is conducted on synthetic data to develop the neural network's basic understanding of time series signals. This phase emphasizes data size rather than quality and aims to build a basic cognitive model. The second phase is the fine‐tuning phase, in which selected, high‐quality, real‐world datasets are used to adapt the network to the real‐world data distribution of the target task. This phase focuses on data quality rather than quantity to transition from general to specialized knowledge. The third phase is the optimization phase, in which independently collected, scenario‐specific data are used to fine‐tune the model. This effectively eliminates possible population and environmental biases (due to differences in collection conditions) from the data from the previous phase and ensures the model's applicability in real‐world scenarios.

3. Results

3.1. Ablation Experiments

3.1.1. Light Balancing

The filtered signal‐to‐noise ratios (SNRs) of the network with and without light balancing were compared. Supporting Information S1: Table S1 shows that the SNR was greater when light balancing was used, which indicated better performance. When face segmentation was performed, 16 grids were scored in terms of light balance. The 8 grids with the lowest light balance scores and the 8 grids with the highest light balance scores were selected for subsequent analysis. Table 1 shows that the accuracy of HR with a low light balance score was greater than that with a high light balance score.

Table 1.

HR estimation from different light balance scores.

Light balance score Range 1 Range 2 Range 4
Low 0.867 0.950 0.983
High 0.667 0.717 0.750

3.1.2. Hamming Window

We randomly used four sets of HR data that were measured with fingertip pulse oximeters to evaluate the performance of Hamming window filtering. Supporting Information S1: Table S2 shows that HR estimation with Hamming window filtering was more accurate than that without Hamming window filtering.

3.1.3. Pretraining

To evaluate the effect of incorporating frequency signals, we used pretraining data directly for the final training without conducting task training or incorporating noise data. Supporting Information S1: Table S3 shows that incorporating frequency signals into unsupervised training significantly improved the accuracy of HR estimation. To evaluate the effect of incorporating noise, we used the pretraining data directly for the final training without conducting task training or incorporating frequency signals for unsupervised learning. Supporting Information S1: Table S4 shows that each type of noise had a positive effect on the accuracy of HR estimation. Finally, the accuracy of HR estimation with pretraining was greater than that without pretraining (Supporting Information S1: Table S5).

3.1.4. Neural Networks

We compared transformer and recurrent neural network (RNN) structures in terms of the loss of data. Supporting Information S1: Figure S3 shows that the RNN did not easily converge when applied to the data in this study.

3.2. HR Test

All methods were tested on our dataset. Table 2 shows that the accuracy of HR estimation with neural network filtering was much higher than that without neural network filtering. The accuracy of the HR calculated through our filtering network was also higher than those of EfficientPhys and MTTS‐CAN. The larger the error range set is, the higher the accuracy of the estimated HR.

Table 2.

HR test results for several methods.

Methods Range 1 Range 2 Range 4
Without using neural networks for filtering
Gudi et al. [12] 0.167 0.200 0.217
Wavelet 0.300 0.367 0.417
Using neural networks for filtering
Our method 0.867 0.950 0.983
EfficientPhys 0.833 0.917 0.917
MTTS‐CAN 0.783 0.800 0.833

3.3. HRV Test

3.3.1. Time Domain

According to the time‐domain analysis of the HRV, the RMSE of the HRV estimated by our method was much smaller than the RMSEs of the HRVs estimated by the EfficientPhys and MTTS‐CAN algorithms (Table 3), which indicated that our method performed the best in the HRV time‐domain analysis.

Table 3.

HRV from the time‐domain test.

Method RMSE
RMSSD
Our method 8.2
EfficientPhys 14.2
MTTS‐CAN 19.3
SDNN
Our method 17.5
EfficientPhys 21.9
MTTS‐CAN 26.5

3.3.2. Frequency Domain

The frequency‐domain analysis of the HRV shows that the correlation index score estimated by our method is much higher than those estimated by the EfficientPhys and MTTS‐CAN algorithms (Table 4), which indicates that our method performs best in HRV frequency‐domain analysis.

Table 4.

HRV from the frequency‐domain test.

Method Correlation index score
LF
Our method 0.81
EfficientPhys 0.51
MTTS‐CAN 0.42
HF
Our method 0.76
EfficientPhys 0.46
MTTS‐CAN 0.48
LF/HF
Our method 0.78
EfficientPhys 0.47
MTTS‐CAN 0.43

3.3.3. Correlations With Physical Stress

We conducted a correlation analysis between six time‐domain indices and the physical stress index. Figure 5 shows that SDNN, RMSSD, SD1, SD2, and SDSD obviously follow 2nd‐order function laws. The significant correlations between these indices and physical stress are very obvious. If our data were invalid, they would appear cluttered in the scatterplots.

Figure 5.

Figure 5

Relationships between HRV indices and the physiological stress index. The ordinate of each scatter plot is an HRV index that we calculated, and the abscissa is the psychological stress index calculated by an HRV analyzer.

4. Discussion

In this study, we improved the accuracy of HR and HRV measurements using three refined rPPG techniques. The first technique uses light balance scores to select images for subsequent analysis. The second technique pretrains the model on a large amount of synthesized data. The third technique is a two‐stage training mode that is based on the fusion of frequency‐domain information and time‐domain information. These techniques have important applications in medicine and health.

A previous study [33] used partial facial images for HR estimation. The signals collected with this approach may not necessarily reflect the true HR of the human body, thus generating in lower or higher results. We exploited the fact that the skin does not easily absorb blue light and used blue light to balance the lighting, which alleviated the light balance problem to a certain extent and improved the SNR. We segmented the entire face and divided it into 16 grid cells. A light balance score was calculated for each grid cell. The lower the light balance score is, the more uniform the light. Table 2 shows that the HR estimation accuracies of the 8 grid cells with low light balance scores were higher than those of the 8 grid cells with high light balance scores, which verifies that more uniform light can improve the accuracy of HR estimation.

This study employed multiple signal processing techniques. A Hamming window was used to remove frequency signals that were not related to HR, which had a positive effect on the accuracy of HR estimation. In the pretraining stage, the addition of frequency signals and noise improved the performance of our model. Moreover, previous studies [34] used limited sample sizes because of factors such as video collection costs. This study used a synthesized dataset to generate 1 million samples for pretraining our model. Overall, pretraining can significantly improve the accuracy of HR estimation. A previous study [25] used public datasets such as the VIPL‐HR dataset, which has low data quality. In this study, we screened the VIPL‐HR dataset and selected 200 high‐quality videos for the second stage of training.

There are generally two types of temporal analysis models for deep neural networks: recurrent feedback structures, such as RNN, GRU, and LSTM, and transformer structures. The RNN structure continuously compresses historical information and combines it with current information prediction outcomes, which makes this structure a sensitive structure for current local information. Owing to the compression of historical information leading to partial loss of historical information, it performs poorly in handling long‐sequence problems. The transformer structure does not compress any historical information but dynamically associates historical information through attention mechanisms, thus making it a globally sensitive structure. The transformer has advantages in handling long sequence problems. In this study, collecting signal data continuously for 20 s and processing 30 frames per second were necessary, which meant that the resulting sequence had 600 data points. This sequence was quite long, so a transformer structure was more appropriate. Additionally, frequency signals were also used to process the sequence. Time‐domain information is suitable as sequence information, but frequency‐domain information does not have a time dimension, so it can be treated only as multistate information. Therefore, the use of a transformer structure for frequency information is better. Our results show that RNNs did not easily converge when applied to the data in this study. Tables 2, 3, 4 show that our method performed better than traditional signal processing methods and other methods in estimating HR and HRV, which indicates that the light balancing, large‐sample training, and the two‐stage training mode can improve the accuracy of HR and HRV.

However, this study has several limitations that require attention. Although synthetic data pretraining improved the model performance, the differences between simulated and real physiological signals have not been quantitatively assessed. This may affect the model's ability to generalize to real healthcare scenarios. With respect to experimental conditions, the issue of measurement errors during strenuous exercise has not been addressed, thus creating a gap between the needs of clinical exercise load trials and the study's findings. Additionally, we have validated the measurement accuracy only in static environments. The lack of analysis of variability in populations of different ages and skin colors may affect health equity. Reliance on publicly available datasets such as VIPL‐HR reduces costs, but the lack of sample diversity may mask potential bias.

Future studies can improve upon this approach in multiple ways. Technological optimizations could include developing more realistic signal synthesis methods on the basis of generative adversarial networks, establishing a deep learning compensation model for motion artifacts, and fusing multimodal physiological signals (e.g., respiratory waveforms) to improve the system's robustness. To expand clinical applications, cross‐center clinical validation studies are recommended, with special attention given to the applicability of these findings to specific populations (e.g., burn victims and dark‐skinned individuals). Additionally, the exploration of edge computing deployment options could promote the practical application of this technology in settings such as home care and telemedicine. These improvements will ultimately facilitate the translation of rPPG technology from laboratory research to evidence‐based medical practice, thereby providing more reliable technical support for universal health monitoring.

5. Conclusion

Our proposed method can be used for long‐term, noncontact assessment of cardiovascular function in clinical monitoring, which is particularly useful in scenarios such as neonatal intensive care and sleep apnea syndrome screening. In public health, this technology could enable large‐scale population screening for cardiovascular health using a standard camera. This would provide an inexpensive solution for the early detection of chronic diseases. Additionally, the blue‐light balancing technique used in this study solves the problem of ambient light interference with conventional rPPGs, thereby enabling reliable home health monitoring.

Author Contributions

Lan Lan: writing − original draft (equal). Jin Yin: writing − original draft (equal). Haohan Zhang: data curation (supporting). Hua Jiang: investigation (supporting). Rui Qin: visualization (supporting). Xia Zhao: investigation (supporting). Yu Zhang: data curation (supporting). Yilong Wang: conceptualization (supporting). Jiajun Qiu: conceptualization (supporting).

Ethics Statement

This study was approved by the Hospital (Approval Number: KY2022‐189‐02). This study does not report personal information.

Consent

Each participant provided informed consent.

Conflicts of Interest

Rui Qin, Director of the Research and Development Department at Zhenshu Xiangcheng (Chengdu) Technology Co. Ltd., supported the figure preparation for this work and declares no conflicts of interest.

Supporting information

Figure S1: Spectrum diagram. a. Autocorrelation signal. b. Power spectrum signal. In the power spectrum signal, the signal with the highest peak is the HR signal.

Figure S2: Screening process for variance minimization.

Figure S3: Comparison of loss in the training transformer structure (red line) and RNN structure (blue line).

Table S1: Difference in the signal‐to‐noise ratio with and without light balancing for ten random groups of data.

Table S2: Performance of the Hamming window on four random groups of data.

Table S3: HR estimation with the incorporation of frequency signals.

Table S4: HR estimation with the incorporation of noise.

Table S5: HR estimation with and without pretraining.

HCS2-5-74-s001.docx (263.6KB, docx)

Acknowledgments

The authors have nothing to report.

Lan L., Yin J., Zhang H., et al., “A Deep Neural Network Based on Two‐Stage Training for Estimating Heart Rate Variability From Camera Videos,” Health Care Science 5 (2026): 74‐84. 10.1002/hcs2.70047.

Lan Lan and Jin Yin contributed equally to this work.

Contributor Information

Yilong Wang, Email: yilong528@aliyun.com.

Jiajun Qiu, Email: qiujiajun@wchscu.cn.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

  • 1. Javaid A. Q., Wiens A. D., Fesmire N. F., Weitnauer M. A., and Inan O. T., “Quantifying and Reducing Posture‐Dependent Distortion in Ballistocardiogram Measurements,” IEEE Journal of Biomedical and Health Informatics 19, no. 5 (2015): 1549–1556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Malik J., Lo Y. L., and Wu H., “Sleep‐Wake Classification via Quantifying Heart Rate Variability by Convolutional Neural Network,” Physiological Measurement 39, no. 8 (2018): 085004. [DOI] [PubMed] [Google Scholar]
  • 3. Kim H. G., Cheon E. J., Bai D. S., Lee Y. H., and Koo B. H., “Stress and Heart Rate Variability: A Meta‐Analysis and Review of the Literature,” Psychiatry Investigation 15, no. 3 (2018): 235–245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Lan L., Hu G., Li R., et al., “Machine Learning for Selecting Important Clinical Markers of Imaging Subgroups of Cerebral Small Vessel Disease Based on a Common Data Model,” Tsinghua Science and Technology 29, no. 5 (2024): 1495–1508. [Google Scholar]
  • 5. Khairuddin A. M., Ku Azir K. N. F., and Kan P. E., “Limitations and Future of Electrocardiography Devices: A Review and the Perspective From the Internet of Things.” 2017 International Conference on Research and Innovation in Information Systems (ICRIIS) (Langkawi, Malaysia: IEEE, 2017), 1–7. [Google Scholar]
  • 6. Boushel R., Langberg H., Olesen J., Gonzales‐Alonzo J., Bülow J., and Kjær M., “Monitoring Tissue Oxygen Availability With Near Infrared Spectroscopy (NIRS) in Health and Disease,” Scandinavian Journal of Medicine & Science in Sports 11, no. 4 (2001): 213–222. [DOI] [PubMed] [Google Scholar]
  • 7. Tripathi K., “The Novel Hierarchical Clustering Approach Using Self‐Organizing Map With Optimum Dimension Selection,” Health Care Science 3, no. 2 (2024): 88–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Chen X., Cheng J., Song R., Liu Y., Ward R., and Wang Z. J., “Video‐Based Heart Rate Measurement: Recent Advances and Future Prospects,” IEEE Transactions on Instrumentation and Measurement 68, no. 10 (2019): 3600–3615. [Google Scholar]
  • 9. Odinaev I., Prae‐Arporn K., Wong K. L., et al., Camera‐Based Heart Rate Variability and Stress Measurement From Facial Videos. In: 2022 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE).ed. Arlington, VA, USA 2022.
  • 10. Martinez‐Delgado G. H., Correa‐Balan A. J., May‐Chan J. A., Parra‐Elizondo C. E., Guzman‐Rangel L. A., and Martinez‐Torteya A., “Measuring Heart Rate Variability Using Facial Video,” Sensors 22, no. 13 (2022): 4690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ayesha A. H., Qiao D., and Zulkernine F., “Heart Rate Monitoring Using PPG With Smartphone Camera.” 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).ed. Houston, TX, USA (IEEE, 2021), 2985–2991. [Google Scholar]
  • 12. Gudi A., Bittner M., and van Gemert J., “Real‐Time Webcam Heart‐Rate and Variability Estimation With Clean Ground Truth for Evaluation,” Applied Sciences 10, no. 23 (2020): 8630. [Google Scholar]
  • 13. Yu Z., Peng W., Li X., Hong X., and Zhao G., “Remote Heart Rate Measurement From Highly Compressed Facial Videos: An End‐to‐End Deep Learning Solution With Video Enhancement.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (Seoul, Korea (South): IEEE, 2019), 151–160. [Google Scholar]
  • 14. Liu X., Fromm J., Patel S., and McDuff D., Multi‐Task Temporal Shift Attention Networks for On‐Device Contactless Vitals Measurement. Preprint. 2020.
  • 15. Liu X., Hill B., Jiang Z., Patel S., and McDuff D., “EfficientPhys: Enabling Simple, Fast and Accurate Camera‐Based Cardiac Measurement.” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (Waikoloa, HI, USA: IEEE, 2023), 4997–5006. [Google Scholar]
  • 16. Yu Z., Shen Y., Shi J., et al., “PhysFormer++: Facial Video‐Based Physiological Measurement With SlowFast Temporal Difference Transformer,” International Journal of Computer Vision 131, no. 6 (2023): 1307–1330. [Google Scholar]
  • 17. Paruchuri A., Liu X., Pan Y., Patel S., McDuff D., and Sengupta S. Motion Matters: Neural Motion Transfer for Better Camera Physiological Sensing. Preprint. 2023.
  • 18. Liu X., Zuo J., Ma X., and Kuang H., “Uni‐rPPGNet: Efficient and Lightweight Remote Heart Rate Variability Measurement.” 2024 IEEE 7th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (IEEE, 2024), 19–23. [Google Scholar]
  • 19. Wu Z., Xie Y., Zhao B., et al. CardiacMamba: A Multimodal RGB‐RF Fusion Framework With State Space Models for Remote Physiological Measurement. 2025.
  • 20. Song R., Chen H., Cheng J., Li C., Liu Y., and Chen X., “PulseGAN: Learning to Generate Realistic Pulse Waveforms in Remote Photoplethysmography,” IEEE Journal of Biomedical and Health Informatics 25, no. 5 (2021): 1373–1384. [DOI] [PubMed] [Google Scholar]
  • 21. Gupta A. K., Kumar R., Birla L., and Gupta P., RADIANT: Better rPPG Estimation Using Signal Embeddings and Transformer (IEEE, 2023), 4965–4975. [Google Scholar]
  • 22. Pirzada P., Wilde A., and Harris‐Birtill D., “Remote Photoplethysmography for Heart Rate and Blood Oxygenation Measurement: A Review,” IEEE Sensors Journal 24, no. 15 (2024): 23436–23453. [Google Scholar]
  • 23. Vaswani A., Shazeer N., Parmar N., et al., Attention Is All You Need. In: NIPS'17.ed. Red Hook, NY, USA 2017:6000–6010.
  • 24. Kakouche I., Maali A., El Korso M. N., Mesloub A., and Azzaz M. S., “Non‐Contact Measurement of Respiration and Heart Rates Based on Subspace Methods and Iterative Notch Filter Using Uwb Impulse Radar,” Journal of Physics D: Applied Physics 55, no. 3 (2021): 035401. [Google Scholar]
  • 25. Chatterjee A. and Roy U. K., “Algorithm to Calculate Heart Rate & Comparison of Butterworth IIR and Savitzky‐Golay FIR Filter,” Journal of Computer Science & Systems Biology 11 (2018): 171–177. [Google Scholar]
  • 26. Kim J. W., Park S. M., and Choi S. W., “Real‐Time Photoplethysmographic Heart Rate Measurement Using Deep Neural Network Filters,” ETRI Journal 43, no. 5 (2021): 881–890. [Google Scholar]
  • 27. Yu C., Gao C., Wang J., Yu G., Shen C., and Sang N., “BiSeNet V2: Bilateral Network With Guided Aggregation for Real‐Time Semantic Segmentation,” International Journal of Computer Vision 129, no. 11 (2021): 3051–3068. [Google Scholar]
  • 28.NumPy Developers. numpy.hamming.ed. 2022.
  • 29. Catai A. M., Pastre C. M., Godoy M. F., Silva E., Takahashi A. C. M., and Vanderlei L. C. M., “Heart Rate Variability: Are You Using It Properly? Standardisation Checklist of Procedures,” Brazilian Journal of Physical Therapy 24, no. 2 (2020): 91–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Niu X., Han H., Shan S., and Chen X., VIPL‐HR: A Multi‐Modal Database for Pulse Estimation From Less‐Constrained Face Video, eds. Li H., Jawahar C. V., Mori G., and Schindler K. (Switzerland: Springer International Publishing AG, 2019), 562–576). [Google Scholar]
  • 31. Shi C., Zhu Y., and Yang S. Plain‐Det: A Plain Multi‐Dataset Object Detector.ed. Berlin, Heidelberg 2024: 210–26.
  • 32. Sun K. and Dredze M. Amuro and Char: Analyzing the Relationship Between Pre‐Training and Fine‐Tuning of Large Language Models. ArXiv. 2024; abs/2408.06663.
  • 33. Chen W. and McDuff D. DeepPhys: Video‐Based Physiological Measurement Using Convolutional Attention Networks: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part II.ed. 2018: 356–73.
  • 34. Niu X., Shan S., Han H., and Chen X., “RhythmNet: End‐to‐End Heart Rate Estimation From Face via Spatial‐Temporal Representation,” IEEE Transactions on Image Processing 29 (2020): 2409–2423. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1: Spectrum diagram. a. Autocorrelation signal. b. Power spectrum signal. In the power spectrum signal, the signal with the highest peak is the HR signal.

Figure S2: Screening process for variance minimization.

Figure S3: Comparison of loss in the training transformer structure (red line) and RNN structure (blue line).

Table S1: Difference in the signal‐to‐noise ratio with and without light balancing for ten random groups of data.

Table S2: Performance of the Hamming window on four random groups of data.

Table S3: HR estimation with the incorporation of frequency signals.

Table S4: HR estimation with the incorporation of noise.

Table S5: HR estimation with and without pretraining.

HCS2-5-74-s001.docx (263.6KB, docx)

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.


Articles from Health Care Science are provided here courtesy of Wiley on behalf of Tsinghua University Press

RESOURCES