Abstract
The erratic nature of cardiac rhythms can precipitate a multitude of pathologies. Consequently, the endeavor to achieve stabilization of the human heartbeat has garnered significant scholarly interest in recent years. In this context, an adaptive nonlinear disturbance compensator (ANDC) strategy has been meticulously developed to ensure the stabilization of cardiac activity. Moreover, a double deep reinforcement learning (DDRL) algorithm has been employed to adaptively calibrate the tunable coefficients of the ANDC controller. To facilitate this, as well as to replicate authentic environmental conditions, a dynamic model of the heart has been constructed utilizing the framework of the Markov Decision Process (MDP). The proposed methodology functions in a closed-loop configuration, wherein the ANDC controller guarantees both stability and disturbance mitigation, while the DDRL agent persistently refines control parameters in accordance with the observed state of the system. Two categories of input signals, namely normal signals and MDP-based stochastic signals, are administered to assess the system’s efficacy under both standard and uncertain conditions. Furthermore, the influence of pathological neural activity is emulated through the introduction of external signals characterized by eight discrete frequency components. Quantitative assessments employing metrics such as peak amplitude, signal energy, and zero-crossing rate are performed for each state of the cardiovascular model. The findings substantiate that the ANDC-DDRL strategy effectively stabilizes cardiac rhythms across diverse conditions, surpassing the performance of conventional baseline methods.
Keywords: Cardiovascular system, Heartbeat, Markov decision process (MDP), Adaptive nonlinear disturbance compensator (ANDC), Double deep reinforcement learning (DDRL)
Subject terms: Cardiology, Computational biology and bioinformatics, Engineering, Mathematics and computing
Introduction
The human heart rate exhibits variability rather than constancy. Indeed, heart rate variability (HRV) constitutes a fundamental element in the proficient operation of the cardiovascular system and acts as a significant marker in medical diagnostics. Fluctuations in cardiac rhythm, whether they are regular or irregular, can indicate both physiological and pathological states. For example, a moderate degree of variability is frequently linked to a well-functioning autonomic nervous system, whereas excessive or reduced variability may denote underlying cardiac conditions such as arrhythmias, heart failure, or autonomic dysfunction. Consequently, the analysis and regulation of heart rate dynamics have emerged as a primary objective in the fields of biomedical engineering and clinical research1–4.
Mathematical models assume a vital role across all natural sciences, encompassing domains from particle physics to medicine and biology. The precise mathematical modeling of the heart is indispensable for comprehending its dynamic behavior and for devising effective control strategies for pathological regulation5,6. By encapsulating key nonlinearities, time-dependent behaviors, and the interrelations among cardiovascular components, mathematical models empower researchers to simulate diverse physiological and pathological scenarios in a controlled and replicable manner. Furthermore, they establish a foundation for the creation and validation of sophisticated control algorithms that can be employed to regulate heart rhythms across various clinical contexts. The objective of these models is to convert a series of hypotheses (theory) into predictions of observable phenomena7,8.
Over the span of recent decades, significant progress has been realized in the domain of mathematical modeling and regulation of cardiac dynamics. The initial pursuits were inspired by the remarkable effectiveness of the Hodgkin–Huxley model, which precisely characterized nerve action potentials9. However, the transference of similar principles to cardiac physiology has proven to be considerably more complex, attributable to the extended duration and multifaceted nature of cardiac action potentials. Researchers such as FitzHugh10, along with subsequent contributors Van der Pol and Van der Mark11, laid the essential groundwork for modeling the nonlinear oscillatory behavior demonstrated by cardiac pacemaker cells. Building on this foundational work, Grudzinski and Zebrowski12, in collaboration with Gois and Savi, developed models that employed modified Van der Pol oscillators to replicate both normative and pathological electrocardiogram (ECG) patterns. These models elucidated the inherent nonlinear and at times chaotic characteristics of cardiac rhythms, encompassing phenomena such as period-doubling bifurcations and arrhythmogenic chaos.
Simultaneously with these advancements in modeling, an array of control strategies has been proposed to mitigate aberrant cardiac responses. Glass et al. examined periodic stimulation of cardiac cell aggregates to analyze arrhythmic phenomena13, while other scholars have implemented feedback control methodologies to reduce alternans or chaotic rhythms within both biological tissues and simulated environments14. Conventional control methodologies, including Proportional-Integral-Derivative (PID)15 and Two-Degree-of-Freedom (DOF) PID controllers16, have been employed to stabilize cardiac rhythms amidst noisy or variable conditions. More recently, discretized feedback systems17 and nonlinear control strategies18 have been utilized to address specific pathological features within cardiac models. Despite the promising results of these methodologies, many remain critically reliant on accurate system modeling, which poses a considerable challenge due to the physiological variability and nonlinear dynamics intrinsic to the heart. As a result, the necessity for adaptive, model-flexible, and intelligent control solutions continues to intensify in the pursuit of reliable heartbeat stabilization.
In light of the aforementioned challenges, the present study introduces an innovative adaptive intelligent control framework meticulously devised to stabilize the dynamic characteristics of human heartbeat across both normative and pathological conditions. The proposed methodology integrates an adaptive nonlinear disturbance compensator (ANDC) aimed at actively counteracting external disturbances while addressing system uncertainties, in conjunction with a double deep Q-Network (DDQN)-based reinforcement learning (RL) algorithm that autonomously calibrates the parameters of the controller in real-time. In contrast to traditional model-dependent control strategies, the proposed approach is fundamentally data-driven and possesses the capability to extract optimal parameter adjustments directly from the system’s behavioral dynamics, thereby enhancing resilience to nonlinearities and physiological variations.
Recent advancements in learning-based control methodologies augment classical paradigms from four distinct perspectives. Primarily, supervised optimal control integrates trajectory imitation with reinforcement learning (RL) to effectively tackle sparse or intricate reward shaping challenges in complex continuous systems19. Additionally, centralized deep RL controllers have been employed to orchestrate multi-objective trajectory tracking within distributed-drive electric vehicles, illustrating the efficacy of policy learning across interrelated lateral and longitudinal objectives20. Furthermore, near real-time online RL investigates the comparative effects of synchronous versus asynchronous update schedules on latency and stability throughout the learning process21. Lastly, the demands of safety-aware and partially observable industrial applications (such as blast furnace operations) incentivize the development of constraint-aware and offline RL strategies to mitigate unsafe exploration while managing unobservable states22. In contrast, our research specifically addresses the regulation of physiological rhythms: we employ a robust ANDC and utilize DDQN to dynamically adjust its observer and feedback gains in real-time using a physiology-informed reward mechanism, thus effectively merging robustness with data-driven adaptability. Furthermore, to accurately replicate authentic clinical scenarios, the cardiac model is articulated within a Markov decision process (MDP) framework, thus enabling dynamic responses to both deterministic and stochastic stimuli. The principal contributions of the current manuscript are delineated as follows:
-
i.
The cardiovascular system is conceptualized within a Markov Decision Process (MDP) framework, thereby facilitating the evaluation of the controller under both deterministic (normal) and stochastic (pathological) input conditions, which enhances the realism of the simulations.
-
ii.
A sophisticated adaptive nonlinear control framework is developed for the stabilization of human heartbeat, which incorporates an adaptive nonlinear disturbance compensator (ANDC) to ensure robust performance in the presence of disturbances and model uncertainties.
-
iii.
An integration of a double deep Q-Network (DDQN)-based reinforcement learning paradigm is accomplished to autonomously and continuously adjust the parameters of the ANDC controller in real time, thereby significantly improving adaptability to physiological variability.
-
iv.
Comprehensive and quantitative assessments are conducted by applying external brain-like signals across eight distinct frequency levels to emulate neurological or pathological influences on cardiac function. Furthermore, critical performance metrics such as peak amplitude, energy variation, and zero-crossing rate are meticulously evaluated to rigorously ascertain the efficacy of the proposed controller in comparison to existing methodologies.
The structure of the remainder of the paper is organized as follows: Sect. 2 elaborates on the formulation of the dynamic human heartbeat model, encompassing the state-space representation and its integration within both normative and MDP-based contexts. Section 3 provides a thorough account of the proposed control strategy, detailing the formulation of the ANDC controller and the implementation of the DDQN algorithm. The simulation results derived from the execution of the proposed strategy within the Simulink environment are scrutinized in Sect. 4. Finally, Sect. 5 concludes the manuscript and delineates potential avenues for future research.
Formulation of human heart dynamics using state-space and MDP frameworks
To replicate the physiological phenomenon of the heartbeat, the cardiovascular organ has been explicated through an electromechanical framework. The utilized human heart model within this construct encompasses the pressure variable PIN(t), systemic capacitances correlated with the right atrium and left atrium, identified as CBL and CCL, respectively, as well as systemic capacitances pertaining to the right ventricle and left ventricle, denoted as CDL and CEL, mutual inductance between the ventricle and atrium recognized as LF, alongside the inductances of the left and right ventricles, classified as LE and LD, accompanied by the inductances of the right atrium and left atrium, symbolized as LB and LC. The variable PIN(t) denotes the input to the system, whereas CBL is conceptualized within the circuitry as a capacitor, functioning as the principal pump; this characterization is equally applicable to CCL. The pressure generated by the right atrium acts as an input to the right ventricle, mirroring the function of the left atrial chamber23,24. The dynamic model of the heart predicated on the electromechanical system can be computed as follows15,16:
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
According to the dynamic model, the human heart framework based on the electromechanical paradigm consists of four components (pacemaker, heartbeat, circulatory system, sensor) which are represented by
, respectively25. The aforementioned model delineates the cardiovascular system under normative operational conditions of the human heart. To emulate the human heart under various pathological conditions, a stochastic signal generator with diverse frequencies (
) has been incorporated into the primary model. Nevertheless, the output of the system, inclusive of
, can be expressed as follows:
![]() |
5 |
where the output variable is defined as
, representing the heart rate of the cardiovascular system.
To enable comprehensive analysis and control design, the specified cardiovascular model, which encompasses transfer functions for the pacemaker, heartbeat, circulatory system, and sensor, is restructured into a state-space representation. This conversion is imperative due to the fact that state-space models provide direct insight into the internal dynamics (states) of each subsystem, which is vital for scrutinizing the phase behavior of the states and comprehending their temporal progression. Furthermore, the representation of the system in state-space form facilitates the incorporation of contemporary control and decision-making methodologies, such as Markov Decision Processes (MDPs). Modeling the system as an MDP environment necessitates a state-based framework wherein transitions and rewards can be delineated in relation to the internal state dynamics of the system26. Consequently, the transformation not only enhances the understanding of the physiological behavior of the cardiovascular system but also offers an appropriate mathematical architecture for the execution of intelligent control strategies. The detailed formulation of the state-space model has been elucidated in the subsequent sections.
Each constituent of the cardiovascular system is delineated by a transfer function and transmuted into its corresponding state-space representation to promote system analysis and control design. The overarching state-space model is articulated as follows:
![]() |
6 |
Thus:
![]() |
7 |
![]() |
8 |
![]() |
9 |
![]() |
10 |
The configuration of the devised state-space model is illustrated in Fig. 1. The configuration of the devised state-space model is illustrated in Fig. 1. To replicate the normal human heartbeat, a pulse generator (amplitude = 1 mv, period = 1 s, pulse width = 1 s, phase delay = 0–1 unit) has been incorporated into the system. The outcomes of the system, encompassing the normal human heartbeat and the phase behavior of each state, have been depicted in Fig. 2.
Fig. 1.
The framework of the heart dynamic model with two different inputs.
Fig. 2.
The results of the heart dynamic model, (a) normal human heart, (b) Phase system behavior.
Modeling the cardiovascular system as an MDP environment
In order to operationalize the cardiovascular model within a Markov Decision Process (MDP) framework, a discrete-time, discrete-action signal was employed as the input for the system, signifying the intended heartbeat reference. This signal is denoted as:
![]() |
11 |
where
represents the selected action at each decision step. Each
corresponds to a discrete pulse level chosen from the MDP action set. The actions are actuated as brief pulses every 2 s through randomized selection, thereby introducing variability and exploration into the control.
In the dynamic model (11),
acts as the external input driving the pacemaker block of Fig. 1. Thus, in the stochastic case, the deterministic pulse generator used in Sect. 2.1 is replaced by the action signal
. This modification propagates into the state-space formulation (7)– (10), where
appears as the exogenous input that influences the system states.
By integrating this MDP-generated signal into the input of the physiological model, the system navigates through an array of dynamic states in response to discrete decisions, thereby aligning the continuous-time system with a state-action framework suitable for reinforcement learning and policy optimization. This methodology facilitates the examination of state transitions, rewards, and agent performance in the regulation of cardiac dynamics through sophisticated decision-making.
The outcomes of the heart model within the MDP environment, encompassing the input signal, output heartbeat, and phase system of each state, have been illustrated in Fig. 3.
Fig. 3.
The results of the MDP heart model, (a) input/output signal, (b) Phase states behavior.
Formulation the adaptive nonlinear disturbance compensator (ANDC) and double deep reinforcement learning (DDRL)
The adaptive nonlinear disturbance compensator (ANDC) paradigm affords numerous benefits when utilized within physiological models, such as the cardiovascular system. In contrast to traditional linear controllers, which may encounter difficulties in addressing the nonlinearities, uncertainties, and disturbances endemic to biological dynamics, the ANDC proficiently estimates and compensates for both internal and external disturbances in real time27. This characteristic renders it particularly effective for the modulation of heart rhythm under fluctuating conditions, including autonomic variability and pharmacological influences. In this manuscript, the ANDC is applied to stabilize and monitor desired heartbeat signals produced by a Markov Decision Process (MDP) agent. By utilizing nonlinear error feedback in conjunction with an extended observer framework, the controller guarantees robustness and adaptability, thereby enhancing performance even in the presence of model uncertainties and external perturbations. The configuration of the developed controller is illustrated in Fig. 4.
Fig. 4.
The details of the implemented controller.
Structure of the ANDC controller
The ANDC architecture comprises three principal components: a reference trajectory generator tasked with producing smooth reference signals, a dynamic state and disturbance estimator responsible for disturbance assessment, and a nonlinear feedback-based regulation control law. The input to the system is the reference signal
generated by the MDP, which signifies the intended heart rate. The system model under consideration is articulated as follows:
![]() |
12 |
where
denote the actual heart rate and its rate of change, respectively. The term
encapsulates the unknown system dynamics,
signifies the total disturbance, and the control input is represented by
.
Reference trajectory generator (RTG)
The RTG produces a continuous tracking reference
and its derivative
from the target signal
, which in the context of this investigation is provided by the MDP agent:
![]() |
13 |
Here,
denotes the sampling interval, while
represents the tracking velocity parameter and
represents the discrete reference input from the MDP agent. The filter gain denotes by
. The function
is a nonlinear tracking function that smooths the trajectory; its detailed definition is provided later in Eq. (17).
State and disturbance observer
The state and disturbance observer not only estimates the system states
and
but also the total disturbance
, which is treated as an extended state28:
![]() |
14 |
In this framework,
and
correspond to the estimates of
and
, respectively, while
denotes the estimated disturbance. The observer component comprises three gains represented by
,
, and
. Additionally,
is a nonlinear feedback function applied to the state-estimation error; its definition is provided in Eq. (18).
Nonlinear feedback law
The control input is derived from the nonlinear feedback associated with the estimated tracking errors:
![]() |
15 |
![]() |
16 |
Here,
and
denote the nonlinear feedback gains, while
signifies the estimated control gain. Also,
and
denote the discrete-time tracking errors at step
. This configuration guarantees that the control signal effectively compensates for the total disturbance and successfully achieves robust output tracking. Two core nonlinear functions used in ANDC are defined as29:
![]() |
17 |
![]() |
18 |
The function
represents a time-optimal control mechanism that regulates the rate of convergence while simultaneously mitigating overshoot during the tracking process.
Tuning of ANDC parameters using double deep Q-learning (DDQN)
In this study, we implement a double deep Q-Learning (DDQN) algorithm to autonomously adjust critical parameters of the nonlinear disturbance-rejection-based controller employed for the regulation of heartbeat dynamics. Given the nonlinear, uncertain, and time-varying characteristics inherent in biological systems, manual tuning of parameters is frequently suboptimal and impractical, particularly in real-time or personalized applications. Reinforcement learning (RL), specifically value-based methodologies such as DDQN, presents a compelling solution by facilitating data-driven30, closed-loop optimization of control parameters through iterative interactions with the environment. The process of parameter tuning via the DDQN algorithm is elucidated in Fig. 5.
Fig. 5.
The architecture of the tuning controller’s parameters by DDQN algorithm.
Motivation and overview
The controller utilized in this investigation comprises two fundamental components necessitating parameter tuning:
-
i.
State and Disturbance Observer: The parameters
associated with three observers, dictate the convergence rate of the observer and the precision of disturbance estimation. -
ii.
Nonlinear Feedback Law: The feedback gain parameters
delineate the intensity of state error feedback, thereby exerting a direct influence on stability, responsiveness, and steady-state error.
The optimal configuration of these five parameters is likely to fluctuate based on individual patient conditions, input dynamics, and external disturbances. Consequently, we approach the tuning task as a sequential decision-making challenge and employ a DDQN agent to navigate the parameter space and achieve convergence towards high-performing configurations.
MDP formulation for parameter tuning
The process of tuning controllers is conceptualized as a discrete-time Markov Decision Process (MDP), delineated as follows31:
![]() |
19 |
In this context,
denotes the state space, wherein each state vector
encompasses the current output error of the system
, the estimated disturbance
, and the control signal
. The symbol A indicates the action space, wherein the action vector
signifies a prospective set of controller parameters (
). The transition function
delineates how the system transitions in accordance with the selected parameters and the nonlinear control law, with the subsequent state
being contingent upon the system’s response to the exerted control.
represents the reward function, which incentivizes actions that seek to minimize tracking error, control effort, and disturbance estimation error. Lastly,
is the discount factor that reconciles short-term and long-term performance metrics during the learning process.
Remark 1
In the following, we use subscript notation (
,
,
) to denote discrete-time variables in the reinforcement learning framework, while continuous-time signals of the cardiovascular system remain expressed as functions of
.
DDQN framework and learning
In order to mitigate the overestimation bias inherent in conventional Q-learning, we employ Double Deep Q-Networks (DDQN) utilizing two neural networks32:
-
i.
Online Q-Network (
): This methodology approximates the state-action value function and is utilized for the selection of policies. -
ii.
Target Q-Network (
): This methodology furnishes stable target values for training by undergoing updates at a lower frequency.
At each iteration, the network is trained based on the Bellman target:
![]() |
20 |
Moreover, the loss function can be articulated as:
![]() |
21 |
According to Eq. (21), the parameters of the network, denoted as
, and
is target network, undergo modification through the application of gradient descent, while experience replay is employed to enhance sample efficiency and mitigate variance. In this context, the online and target Q-networks each comprise two hidden layers (64 and 32 neurons) with ReLU activations. The exploration process is managed utilizing an
policy, where the parameter ϵ decays progressively over time33. Also, it should be mention that we use fixed-period hard updates of the target network and batched replay; in practice, asynchronous update scheduling is a plausible extension to further reduce latency during online deployment.
Reward function design
The formulation of the reward function is pivotal in directing the learning trajectory of the DDQN agent towards achieving optimal control efficacy. Within the domain of cardiovascular regulation, the fundamental control objectives encompass the minimization of the divergence between the actual and desired heartbeat (tracking error), the suppression of disturbances that may influence system dynamics, and the assurance that the control signal remains within physiologically safe and efficient bounds. Consequently, three essential performance metrics are integrated into the reward framework: the squared tracking error, the estimated magnitude of disturbances, and the squared control effort. These elements collectively encapsulate both the precision of regulation and the efficiency of the controller. In light of these considerations, the reward at each discrete time step
is articulated as:
![]() |
22 |
where
signifies the discrepancy between the reference heartbeat and the output heartbeat (
),
denotes the estimated disturbance,
represents the control effort, and
are the weighting factors that modulate the significance of accuracy, disturbance rejection, and the smoothness of actuation. This reward architecture incentivizes the agent to identify controller parameters that guarantee precise tracking with minimal disturbance influence and control effort, thereby fostering robust and energy-efficient cardiovascular regulation.
Learning workflow and integration
The incorporation of DDQN into the nonlinear controller tuning paradigm adheres to a methodical learning framework. Initially, the cardiovascular system, inclusive of the nonlinear controller, is initialized. In each episode, the DDQN agent selects a parameter vector
, which signifies the observer and feedback gains. These parameters are subsequently implemented within the controller, which is utilized to regulate the system over a finite simulation duration. Throughout this period, the system’s output, the estimated disturbance, and the control signal are scrutinized to formulate the current state
and to calculate the corresponding reward
. The transition tuple
is retained in a replay memory for the purpose of training. The Q-network is refined utilizing mini-batches drawn from this memory to ensure the stabilization of learning and to enhance generalization. An
strategy is employed to equilibrate exploration and exploitation during the training phase. This iterative process is replicated across multiple episodes, permitting the agent to progressively refine its policy and approach an optimal configuration of controller parameters that augment tracking performance, robustness, and disturbance mitigation. Upon achieving convergence, the acquired parameters are implemented for real-time control.
Numerical results
In this section, to assess the efficacy of the proposed methodology aimed at stabilizing human heartbeat, the state space dynamic model of the human heart with two inputs (normal and MDP) has been simulated within a Simulink environment. Subsequently, the tuning algorithm (DDQN) for adjusting the Adaptive Nonlinear Dynamic Control (ANDC) coefficients has been executed in a Matlab environment. The configuration parameters of the DDQN for tuning purposes have been delineated in Table 1. A series of tests have been devised to evaluate the operational efficacy of the proposed strategy, which will be elaborated upon in the ensuing discussion.
Table 1.
The configuration of the DDQN hyperparameters.
| Coeff | Value | Coeff | Value |
|---|---|---|---|
| Learning Rate |
|
Batch Size | 128 |
| Discount Factor |
|
Target Network Update Frequency | 500 |
| Replay Buffer Size |
|
Optimizer | Adam |
Experiment 1) functionality of the controller under normal input with pathological States and MDP input
In this section, to elucidate the effects of pathological states on human heartbeat, a random sine wave signal with an amplitude of 20 and varied frequencies corresponding to different waveforms (delta, theta, alpha, beta) has been applied to the heart model. The specific details of the wave frequencies have been cataloged in Table 2. The heartbeat output influenced by the brain waves has been portrayed in Fig. 6. To provide further insight into the external signal’s impact on the heart, the phase behavior of the states has been illustrated in Fig. 7.
Table 2.
The details of the brain waves frequency.
| Waves name | Frequency |
|---|---|
| Delta | From 1.0 to 3.9 Hz |
| Theta | From 4.0 to 7.9 Hz |
| Alpha | From 8.0 to 13.9 Hz |
| Beta | From 14.0 to 100 Hz |
Fig. 6.
The outcomes of the human heartbeat under various brain waves.
Fig. 7.
The phase state trajectories of four states under various frequencies.
Figure 7 distinctly illustrates the impacts of pathological conditions on cardiac rhythm. Moreover, Fig. 8 delineates the phase-space trajectories of a dynamical system subjected to varying frequency stimuli, represented through a quartet of state variables alongside their respective temporal derivatives. This phenomenon elucidates the system’s nonlinear dynamic characteristics and its progression within state space in response to diverse frequency components, emphasizing the influence of different wave bands on the dynamics of the system.
Fig. 8.
The output of the heart system with developed strategy under difference frequencies.
In the subsequent phase of assessing the efficacy of the controller in managing unstable signals across varying frequencies, the proposed methodology has been implemented in the cardiac system, with the results of the controller depicted in Fig. 8. Figure 8 demonstrates that the controller proficiently stabilizes the amplitude of the system’s response across all frequency bands. The resultant signals manifest a more orderly and consistent waveform pattern in comparison to the previously illustrated uncontrolled responses, signifying enhanced tracking capabilities and diminished oscillatory distortion. This observation substantiates the controller’s capacity to sustain stable system dynamics in the face of frequency-dependent perturbations, thereby affirming its resilience and adaptability under diverse conditions. To further elucidate the operational capabilities of the developed controller, the phase-space trajectories of each state under various brain wave frequencies have been illustrated in Fig. 9.
Fig. 9.
The phase state trajectories of the four states under difference frequencies.
The phase space dynamics presented in Fig. 9 indicate that under regulated conditions, the system demonstrates markedly more organized and stable behavior relative to the uncontrolled scenario. Specifically, the trajectories of
and
exhibit repetitive and periodic characteristics with attenuated chaotic oscillations, signifying a substantial level of stability and predictability imparted by the controller. It appears that the controller enforces a convergence behavior across all states, confining the system within bounded trajectories even amidst fluctuating frequency signals. This regularity within the phase space serves as a robust indicator of the controller’s efficacy in managing dynamic variations and the nonlinearities intrinsic to the modeled physiological system.
Furthermore, the observation of consistent patterns across varying frequencies highlights the controller’s versatility and effectiveness across multiple brain wave modalities. These findings substantiate that the control strategy not only stabilizes the cardiac signals but also facilitates smooth transitions among internal states, thereby ensuring the safe and reliable operation of the closed-loop system across a physiologically pertinent frequency spectrum.
Experiment 1.1) The functionality of the controller under MDP environment
This subsection comprehensively examines the efficacy of the proposed strategy when the system is subjected to an input signal originating from a MDP environment, as illustrated in Fig. 3a. In contrast to conventional frameworks where the controller engages directly within the MDP loop, the MDP in this context orchestrates the external input pattern, which signifies a sequence of decision-based perturbations aimed at assessing the controller’s robustness in managing inputs that are dynamically evolving. The operational characteristics of the controller within the MDP environment, along with the discrepancy between the reference heartbeat and the actual output, are delineated in Fig. 10.
Fig. 10.
The outcomes of the controlled heartbeat with MDP input signal.
Figure 10 critically assesses the tracking precision and control efficacy of the proposed strategy in sustaining the desired cardiovascular rhythm under MDP-induced excitation. In this context, subfigure (a) presents a comparative analysis between the reference heartbeat signal and the controlled heartbeat response. The output of the controlled system closely mimics the shape and timing of the reference waveform, thereby demonstrating effective real-time adaptation notwithstanding the non-uniform external stimulation. The magnified representation accentuates the intricate tracking behavior, wherein the controlled output aligns meticulously with the reference, thereby indicating the controller’s high temporal resolution and minimal latency in response. Moreover, subfigure (b) illustrates the tracking error throughout the entire duration of observation. The error remains consistently minimal, with oscillations confined to a narrow amplitude range (approximately
), thereby signifying precise output regulation. This low steady-state and transient error substantiates the controller’s proficiency in mitigating disturbances and rectifying deviations effectively. Subsequently, the phase-space behavior of the model under MDP conditions is presented in Fig. 11 to further elucidate the capabilities of the proposed framework.
Fig. 11.
The phase state behavior of four states with MDP input signal.
The findings illustrated in Fig. 11 corroborate that the controller adeptly maintains system stability and guarantees consistent dynamic regulation, even when faced with variable and potentially abrupt input pulses generated by the MDP. In comparison to the uncontrolled scenario, the system states now manifest diminished chaotic behavior, reduced peak deviations, and enhanced trajectory coherence, all of which affirm the robustness and adaptability of the DDQN-ANDC in preserving physiological regulation amid complex, decision-driven excitations.
Experiment 2) Numerical analysis of the controller performance
In this segment, the proposed methodology has undergone an assessment utilizing quantitative metrics. In this context, various metrics such as peak amplitude, energy, and zero crossings have been computed for four distinct states across three conditions (absence of control, control with input from a normal heartbeat, and control with input from a MDP heartbeat). The numerical outcomes of the states across diverse conditions are systematically presented in Table 3.
Table 3.
The quantitative outcomes of the States under various conditions.
| Frequencies | Peak
|
Energy
|
Zero-crossing | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| WC | N | M | WC | N | M | WC | N | M | ||
|
1.5 | 6.326 | 1.5906 | 4.0583 | 9.976 | 140.68 | 75.828 | 50 | 184 | 75 |
| 3 | 7.594 | 1.5987 | 4.69635 | 14.619 | 143.014 | 79.3165 | 38 | 181 | 79 | |
| 5 | 6.494 | 1.9925 | 4.34325 | 12.38 | 139.97 | 76.675 | 41 | 182 | 76 | |
| 7 | 6.818 | 1.7222 | 4.3701 | 9.744 | 140.432 | 75.588 | 50 | 184 | 75 | |
| 10 | 7.747 | 1.9289 | 4.93795 | 15.77 | 140.64 | 78.705 | 51 | 183 | 78 | |
| 13.5 | 8.546 | 2.3223 | 5.53415 | 13.757 | 139.87 | 77.3135 | 47 | 182 | 77 | |
| 50 | 5.6958 | 1.6292 | 3.7625 | 10.711 | 141.96 | 76.8355 | 51 | 181 | 76 | |
| 85 | 8.345 | 1.9026 | 5.2238 | 18.653 | 141.52 | 80.5865 | 45 | 181 | 80 | |
|
1.5 | 1.5906 | 0.8223 | 1.2601 | 6.74 | 0.1857 | 3.7737 | 96 | 207 | 174 |
| 3 | 1.5987 | 0.82 | 1.265 | 6.939 | 0.1856 | 3.9129 | 83 | 208 | 173 | |
| 5 | 1.9925 | 0.8225 | 1.5415 | 6.419 | 0.1857 | 3.549 | 116 | 211 | 182 | |
| 7 | 1.7222 | 0.8211 | 1.3518 | 6.468 | 0.1857 | 3.5833 | 109 | 213 | 182 | |
| 10 | 1.9289 | 0.8223 | 1.4969 | 6.303 | 0.1858 | 3.467 | 109 | 226 | 192 | |
| 13.5 | 2.322 | 0.8213 | 1.7717 | 7.529 | 0.1856 | 4.3259 | 97 | 218 | 183 | |
| 50 | 1.6292 | 0.822 | 1.287 | 6.359 | 0.185 | 3.5068 | 102 | 207 | 176 | |
| 85 | 1.902 | 0.8206 | 1.4775 | 8.022 | 0.1856 | 4.671 | 97 | 206 | 174 | |
|
1.5 | 0.4676 | 0.0643 | 0.2466 | 0.5083 | 0.006 | 0.2576 | 47 | 48 | 46 |
| 3 | 0.4224 | 0.0642 | 0.2149 | 0.7122 | 0.006 | 0.4003 | 38 | 49 | 47 | |
| 5 | 0.4439 | 0.0644 | 0.23 | 0.598 | 0.0061 | 0.3204 | 41 | 47 | 45 | |
| 7 | 0.461 | 0.0644 | 0.242 | 0.48 | 0.0063 | 0.2378 | 50 | 46 | 46 | |
| 10 | 0.5458 | 0.06416 | 0.3013 | 0.7282 | 0.0065 | 0.4116 | 47 | 45 | 45 | |
| 13.5 | 0.6428 | 0.0643 | 0.3692 | 0.66 | 0.006 | 0.3638 | 45 | 70 | 65 | |
| 50 | 0.3852 | 0.0642 | 0.1889 | 0.5178 | 0.0063 | 0.2643 | 39 | 49 | 47 | |
| 85 | 0.5463 | 0.0642 | 0.3016 | 0.8781 | 0.0061 | 0.5165 | 47 | 43 | 44 | |
|
1.5 | 0.8544 | 0.1428 | 0.5409 | 1.901 | 0.0263 | 0.0985 | 61 | 68 | 66 |
| 3 | 0.7185 | 0.1428 | 0.4457 | 1.892 | 0.0263 | 0.9322 | 59 | 64 | 63 | |
| 5 | 0.9729 | 0.1426 | 0.6238 | 1.778 | 0.02637 | 0.9525 | 64 | 58 | 59 | |
| 7 | 0.89153 | 0.1426 | 0.5668 | 1.73 | 0.0263 | 0.9188 | 72 | 64 | 65 | |
| 10 | 0.9446 | 0.1428 | 0.604 | 1.785 | 0.02638 | 0.9574 | 69 | 74 | 73 | |
| 13.5 | 1.347 | 0.1428 | 0.8857 | 2.098 | 0.02636 | 0.9765 | 63 | 71 | 69 | |
| 50 | 0.8443 | 0.1427 | 0.5338 | 1.775 | 0.0263 | 0.9503 | 60 | 56 | 56 | |
| 85 | 1.069 | 0.1427 | 0.6911 | 2.498 | 0.0263 | 0.9564 | 57 | 72 | 69 | |
Table 3 offers a comprehensive quantitative evaluation of the dynamic behavior of the cardiovascular system under three unique operational scenarios: absence of control (WC), controlled with input from a normal heartbeat (N), and controlled with input from an MDP-based heartbeat (M). The analysis encompasses four principal state variables (
), which signify pacemaker function, circulatory dynamics, electrical conduction, and overall systemic output. Each state is scrutinized using three pivotal metrics: peak amplitude, indicative of the maximum deviation in the signal; energy, representing the aggregate power of the signal; and zero crossings, which denote the frequency and rhythm of oscillatory behavior.
In the absence of control (WC) condition, the system functions within an open-loop framework devoid of any regulatory mechanism. This configuration yields markedly elevated peak values, particularly in states
and
. These heightened amplitudes are indicative of unregulated and potentially pathological cardiac responses. The energy levels for
and
under the WC condition remain relatively low due to erratic yet low-frequency oscillations, although the peak amplitudes imply a degree of instability. The count of zero crossings in this scenario is similarly reduced, frequently remaining below 50, signifying a slower and more irregular heartbeat rhythm.
Conversely, in the N condition, wherein a normal input heartbeat is administered in conjunction with the proposed control mechanism, the system exhibits a significant enhancement. Peak values across all states are markedly diminished, and the energy levels, particularly for
and
, experience a substantial increase, often surpassing 140. This seemingly paradoxical rise in energy is attributable to higher-frequency yet low-amplitude oscillations, which are characteristic of a stable and responsive cardiovascular system. The frequency of zero crossings also experiences a pronounced increase, frequently exceeding 180, reflecting a tightly regulated and rapidly responding physiological rhythm. Collectively, these metrics substantiate the assertion that the controller, when functioning with an unblemished input signal, promotes high-frequency, low-energy oscillations that are emblematic of healthy heart dynamics.
It is imperative to note that the values corresponding to state
, which delineates the pacemaker dynamics of the cardiovascular system, are consistently elevated across all experimental conditions when juxtaposed with the other states. The pacemaker element, represented within
, serves as the principal initiator of electrical excitation within the cardiac structure. It produces the preliminary impulse that activates subsequent reactions within the conduction system and muscular contraction. Consequently, it inherently demonstrates elevated amplitude oscillations to proficiently initiate and disseminate the cardiac cycle.
In the M condition, wherein the input signal is generated by a MDP to emulate more intricate and uncertain cardiac rhythms, the controller continues to demonstrate effective performance, albeit with anticipated degradation. The peak amplitudes across all states reside between those of WC and N, thereby indicating moderated control. The energy values are greater than those observed in WC yet inferior to those in N, affirming that while the controller is incapable of entirely mitigating the variability induced by the MDP, it sustains bounded and stable dynamics. The frequency of zero-crossings under M is uniformly situated between those of N and WC, suggesting that the controller successfully maintains rhythm regularity while accommodating the heightened complexity of the input signal.
Overall, it is observed from the numerical results that the M condition epitomizes a realistic and robust control paradigm, wherein the system is subjected to variable and stochastic input profiles. The findings illustrate that the proposed controller not only guarantees systemic stability but also adapts to uncertainties in input dynamics without compromising physiological plausibility. The values in the M column are logically positioned between those of WC and N, encapsulating a controlled yet flexible system response. These results highlight the adaptive capacity and efficacy of the designed control strategy in preserving cardiovascular stability under both nominal and perturbed conditions, rendering it particularly suitable for real-time application within closed-loop frameworks for heart rhythm regulation.
Experiment 2.1)
In this section, the operational efficacy of the developed strategy has been systematically compared with three distinct controllers under two distinct conditions (normal input heartbeat and MDP input heartbeat). For this purpose, the proposed controller has been juxtaposed with a non-optimal active disturbance rejection controller (NP-ADRC), sliding mode control (SMC), and a PID controller, with respect to peak value, energy, and zero-crossings of the heartbeat output across eight different frequencies. The functional outcomes of these controllers have been illustrated in Fig. 12.
From Fig. 12, it is evident that, when subjected to normal heartbeat input, the ANDC-DDQN controller demonstrates markedly enhanced stabilization efficacy. The peak amplitudes across the frequency spectrum remain tightly constrained within the range of approximately
, underscoring the controller’s capacity to mitigate overshoot and uphold a consistent output. In contrast, the PID controller exhibits considerable fluctuations in peak values, with amplitudes escalating significantly, indicating a deficiency in its ability to limit signal excursions. Both NP-ADRC and SMC display moderate performance; however, they still manifest a degree of variability that is absent in ANDC-DDQN.
Fig. 12.
Quantitative Comparison of Controller Performance, (a) normal input heartbeat, (b) MDP input heartbeat.
With regard to energy consumption, the developed strategy demonstrates the most efficient and consistent profile, with energy values congregating around 0.185 across the entire frequency spectrum. This observation reflects minimal control effort and an optimally regulated signal. Conversely, the PID and SMC controllers reveal elevated and more variable energy consumption, implying a less efficient or excessively aggressive control mode. Although NP-ADRC exhibits slightly greater stability than PID and SMC, it nevertheless falls short of ANDC-DDQN in terms of energy uniformity.
The zero-crossing analysis further accentuates the stability of ANDC-DDQN. Its output exhibits a consistently rhythmic behavior, with zero-crossing counts closely clustered around 80, reflecting the regularity characteristic of a healthy heartbeat. In contrast, the PID controller generates significantly more oscillatory output, with crossings exceeding
in certain instances. SMC and NP-ADRC once again display intermediary performance but are unable to achieve the rhythmic consistency observed in the proposed methodology.
When exposed to the MDP-based heartbeat input, which introduces a more dynamic and irregular excitation, ANDC-DDQN continues to surpass the performance of the other controllers. Although the peak values are inherently elevated under this more intricate input, they consistently remain lower than those generated by NP-ADRC, SMC, and PID, all of which exhibit both higher peaks and increased variability. Moreover, ANDC-DDQN upholds a stable energy profile, despite the heightened demands imposed by the MDP signal, while the other controllers exhibit more pronounced fluctuations. Notably, the proposed controller maintains regular oscillatory behavior under these challenging conditions, with zero-crossing counts remaining near the 80 thresholds, whereas PID and SMC demonstrate diminished rhythmic stability.
Conclusion and future research
In this investigation, a sophisticated adaptive closed-loop strategy has been formulated to achieve stabilization of the human cardiac rhythm. The dynamic model representing the cardiovascular system, which incorporates four essential subsystems, was simulated within a state-space framework to effectively capture intricate behaviors and phase space dynamics. Two distinct input scenarios were examined: (i) a conventional deterministic heartbeat input, and (ii) a stochastic input generated through a Markov Decision Process (MDP), engineered to replicate real-world variability and pathological conditions. To navigate the complex and uncertain dynamics inherent in the system, the proposed control architecture integrates an adaptive nonlinear disturbance compensator (ANDC) aimed at stabilization and disturbance rejection, alongside a double deep Q-Network (DDQN) agent that adaptively adjusts the parameters of the ANDC in real-time. This hybrid intelligent control framework enables the system to sustain stability and robustness, even in the presence of unknown inputs and stochastic disturbances. The efficacy of the controller was meticulously assessed under a variety of testing conditions, which included the application of external brain-like signals at eight distinct frequencies to simulate pathological influences. For each scenario, the behavior of individual state variables was scrutinized using quantitative metrics such as peak amplitude, signal energy, and zero-crossing rate. These metrics furnished a comprehensive insight into the system’s dynamic response and regulatory efficacy. Ultimately, the proposed ANDC-DDQN controller was compared against three alternative control strategies. The results indicated that the developed framework surpasses the alternatives in both normal and MDP-based environments, successfully stabilizing the heartbeat while maintaining physiological fidelity. These findings emphasize the potential of integrating adaptive nonlinear control with deep reinforcement learning to address advanced biomedical regulation tasks. Future work will incorporate explicit safety constraints and partial observability (e.g., action/heart-rate bounds, POMDP modeling) following safety-aware RL practice in industrial systems.
Author contributions
W.A. and E.A. conceptualized the study and developed the overall methodology. H.K. implemented the heart model and performed the simulations. Y.B. designed and analyzed the control architecture. M.A. contributed to the reinforcement learning framework and optimization strategy. A.M. prepared figures and tables, and assisted in result interpretation. W.A. and E.A. wrote the main manuscript text. All authors reviewed and approved the final manuscript.
Data availability
The paper does not present any data, and do not use any data. All results can be re-extracted by the provided algorithm and equations.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Walid Ayadi and Emad Alkhazraji contributed equally to this work.
References
- 1.Behnia, S., Ziaei, J., Ghiassi, M. & Yahyavi, M. Comprehensive chaotic description of heartbeat dynamics using scale index and Lyapunov exponent, Omega500, 1–5 (2013).
- 2.Cohen, M. A. & Taylor, J. A. Short-term cardiovascular oscillations in man: measuring and modelling the physiologies. J. Physiol.542(3), 669–683 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Nouri, S. et al. NAMPT gene rs2058539 variant is a risk factor for nonalcoholic fatty liver disease. Revista Da Associação Médica Brasileira. 70(7), e20230188 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rappel, W. J. The physics of heart rhythm disorders. Phys. Rep.978, 1–45 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Noble, D. Modelling the heart: Insights, failures and progress, Bioessays24(12), 1155–1163 (2002). [DOI] [PubMed]
- 6.Potse, M. Mathematical modeling and simulation of ventricular activation sequences: implications for cardiac resynchronization therapy. J. Cardiovasc. Transl. Res.5, 146–158 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Doshi, D. & Burkhoff, D. Cardiovascular simulation of heart failure pathophysiology and therapeutics. J. Card. Fail.22(4), 303–311 (2016). [DOI] [PubMed] [Google Scholar]
- 8.Trayanova, N. A., Lyon, A., Shade, J. & Heijman, J. Computational modeling of cardiac electrophysiology and arrhythmogenesis: toward clinical translation. Physiol. Rev.104(3), 1265–1333 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hodgkin, A. L. & Huxley, A. F. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol.117(4), 500 (1952). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Fitzhugh, R. Thresholds and plateaus in the Hodgkin-Huxley nerve equations. J. Gen. Physiol.43(5), 867–896 (1960). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Van Der Pol, B. & Van Der Mark, J. The heartbeat considered as a relaxation oscillation, and an electrical model of the heart. Lond. Edinb. Dublin Philos. Mag. J. Sci.6(38), 763–775 (1928). [Google Scholar]
- 12.Żebrowski, J. et al. Nonlinear oscillator model reproducing various phenomena in the dynamics of the conduction system of the heart. Chaos: Interdisciplinary J. Nonlinear Science, 17, 1, (2007). [DOI] [PubMed]
- 13.Bub, G. & Glass, L. Bifurcations in a discontinuous circle map: a theory for a chaotic cardiac arrhythmia: a theory for a chaotic cardiac arrhythmia. Int. J. Bifurcat. Chaos. 5(02), 359–371 (1995). [Google Scholar]
- 14.Ferreira, B. B., de Paula, A. S. & Savi, M. A. Chaos control applied to heart rhythm dynamics. Chaos Solitons Fractals. 44(8), 587–599 (2011). [Google Scholar]
- 15.Aabid, M., Elakkary, A. & Sefiani, N. Stabilization of human heart using pid controller. J. Theor. Appl. Inform. Technol.92, 1 (2016).
- 16.Jonathan, A., Eze, P., Njoku, D. & Chika, A. Mathematical Based Model for Stabilization of Human Heart UsingTwo-DOF PID Controller.
- 17.Dubljevic, S., Lin, S. F. & Christofides, P. D. Studies on feedback control of cardiac alternans. Comput. Chem. Eng.32(9), 2086–2098 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Christini, D. J. et al. Nonlinear-dynamical arrhythmia control in humans. Proc. Natl. Acad. Sci. USA98(10), 5827–5832 (2001). [DOI] [PMC free article] [PubMed]
- 19.Liu, Y., Liu, F. & Huang, R. Supervised optimal control in complex continuous systems with trajectory imitation and reinforcement learning. Sci. Rep.15(1), 19479 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhang, Z., Zhang, J., Huang, C., Wu, Y. & Zhang, J. Deep reinforcement learning-based centralized control for trajectory tracking of distributed drive electric vehicles. Int. J. Mach. Learn. Cybern. 1–19 (2025).
- 21.Radac, M. B. & Chirla, D. P. Near real-time online reinforcement learning with synchronous or asynchronous updates. Sci. Rep.15(1), 17158 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jiang, K., Jiang, Z., Jiang, X., Xie, Y. & Gui, W. Reinforcement learning for blast furnace ironmaking operation with safety and partial observation considerations. IEEE Trans. Neural Netw. Learn. Syst.35(3), 3077–3090 (2024). [DOI] [PubMed] [Google Scholar]
- 23.Yu, Y. C. & Zhang, X. A simple cardiovascular system model for ventricular assist device development. in 2009 IEEE 35th Annual Northeast Bioengineering Conference 1–2 (IEEE, 2009).
- 24.Hassani, K., Navidbakhsh, M. & Rostami, M. Simulation of the cardiovascular system using equivalent electronic system. Biomedical Papers-Palacky Univ. Olomouc. 150(1), 105 (2006). [DOI] [PubMed] [Google Scholar]
- 25.Aabid, M., Elakkary, A. & Sefiani, N. PID parameters optimization using ant-colony algorithm for human heart control, in 23rd International Conference on Automation and Computing (ICAC) 1–6 (IEEE, 2017).
- 26.Scherer, W. T., Adams, S. & Beling, P. A. On the practical Art of state definitions for Markov decision process construction. IEEE Access.6, 21115–21128 (2018). [Google Scholar]
- 27.Faraji, B., Gheisarnejad, M., Yalsavar, M. & Khooban, M. ‘An adaptive ADRC control for Parkinson’s patients using machine learning. IEEE Sensors J.21(6) (2021).
- 28.Beccari, G. et al. Preparing observations for ESO telescopes: A versatile approach. in Observatory Operations: Strategies, Processes, and Systems IX vol. 12186, 165–180 (SPIE, 2022).
- 29.Zhou, Z., Huang, R., Ou, X., Wang, W. & Lei, X. Design and implementation of four-rotor controller based on ADRC. in Proceedings of the 2nd International Conference on Information Technologies and Electrical Engineering 1–6 (2019).
- 30.Akbarzadeh, S. & Allahverdi, A. Analyzing Genetic Data with Machine Learning to Identify Disease Risk.
- 31.Levin, E., Pieraccini, R. & Eckert, W. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process.8(1), 11–23 (2000). [Google Scholar]
- 32.Dentamaro, V., Impedovo, D., Pirlo, G., Abbattista, G. & Gattulli, V. Double Deep Q Network with In-Parallel Experience Generator. in IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS) 1–8 (IEEE, 2020).
- 33.Zhang, S. et al. On the convergence and sample complexity analysis of deep q-networks with ε-greedy exploration. Adv. Neural. Inf. Process. Syst.36, 13064–13102 (2023). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The paper does not present any data, and do not use any data. All results can be re-extracted by the provided algorithm and equations.











































