Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Feb 18;15:5990. doi: 10.1038/s41598-025-89285-6

Multi-robot hierarchical safe reinforcement learning autonomous decision-making strategy based on uniformly ultimate boundedness constraints

Huihui Sun 1,2,3, Hui Jiang 4,, Long Zhang 1,3,, Changlin Wu 1,3,, Sen Qian 2
PMCID: PMC11836298  PMID: 39966430

Abstract

Deep reinforcement learning has exhibited exceptional capabilities in a variety of sequential decision-making problems, providing a standardized learning paradigm for the development of intelligent multi-robot systems. Nevertheless, when confronted with dynamic and unstructured environments, the security of decision-making strategies encounters serious challenges. The absence of security will leave multi-robot susceptible to unknown risks and potential physical damage. To tackle the safety challenges in autonomous decision-making of multi-robot systems, this manuscripts concentrates on a uniformly ultimately bounded constrained hierarchical safety reinforcement learning strategy (UBSRL). Initially, the approach innovatively proposes an event-triggered hierarchical safety reinforcement learning framework based on the constrained Markov decision process. The integrated framework achieves a harmonious advancement in both decision-making security and efficiency, facilitated by the seamless collaboration between the upper-tier evolutionary network and the lower-tier restoration network. Subsequently, by incorporating supplementary Lyapunov safety cost networks, a comprehensive strategy optimization mechanism that includes multiple safety cost constraints is devised, and the Lagrange multiplier principle is employed to address the challenge of identifying the optimal strategy. Finally, leveraging the principles of uniformly ultimate boundedness, the stability of the autonomous decision-making system is scrutinized. This analysis reveals that the action trajectories of multiple robots can be reverted to a safe space within a finite time frame from any perilous state, thereby theoretically substantiating the efficacy of the safety constraints embedded within the proposed strategy. Subsequent to exhaustive training and meticulous evaluation within a multitude of standardized scenarios, the outcomes indicate that the UBSRL strategy can effectively restricts the safety indicators to remain below the threshold, markedly enhancing the stability and task completion rate of the motion strategy.

Keywords: Multi-robot, Reinforcement learning, Security constraint, Decision-making, Uniformly ultimate boundedness

Subject terms: Electrical and electronic engineering, Mechanical engineering, Information technology

Introduction

The rapid evolution of machine learning technology has seen deep reinforcement learning (DRL) become a cornerstone in the development of multi-robot systems, endowing them with the capacity for autonomous control, decision-making and advanced intelligent services13. It is broadly accepted that DRL constitutes a methodology within the machine learning domain, focused on discerning optimal strategies through recursive interactions with dynamic environments, employing a trial-and-error framework. Despite continuous endeavors to implement and enhance reinforcement learning (RL) methodologies, it is apparent that numerous current approaches do not sufficiently address the problems of risk management or decision-making safety46. Certain methodologies even incorporate random exploration into their processes, seeking to enhance approximations without due regard for the inherent risks. In these instances, the implementation of unsafe motion strategies can lead not only to the system entering hazardous conditions but also to potential damage to the robot’s physical structure79. Additionally, it has been noted by researchers that contemporary DRL motion control strategies are not well-suited to managing dangerous tasks that may be encounterd during operation10,11.

Reinforcement learning autonomous decision-making aim to select the optimal strategy to maximize rewards. Through the exploration of diverse states, the robot ultimately discerns the sequence of states and actions that yield the highest reward12. While this learning approach can yield an optimal strategy, it may also lead the robot to diverge from the anticipated behavioral trajectory, potentially resulting in unsafe outcomes13,14. Consequently, incorporating safety considerations into the robot’s motion control process and implementing measures to restrict and circumvent these unsafe scenarios has emerged as a pivotal research focus within the field of deep reinforcement learning for robotics1518.

To tackle the challenge of inadequate decision-making security in the realm of deep reinforcement learning, a class of methods known as safe reinforcement learning has been introduced1922. This approach, while adhering to safety constraints, employs constrained Markov decision processes to model motion and aims to identify optimal strategies by maximizing expected returns2325. Instead of directly imposing constraints on the action space2628, it integrates multi-dimensional restrictions derived from action value updates, effectively partitioning the state-action space into viable and non-viable zones20,29. In practice, an agent’s operation is confined to state-action sequences that meet these constraints, ensuring only permissible behaviors are executed. This methodology safeguards against agents performing dangerous actions that could result in avoidable harm30. Within the robotics sector, especially in multi-robot industries that require real-time control, such as logistics distribution, intelligent transportation31, and smart warehousing, safe reinforcement learning systems have found extensive application3235.

Safety reinforcement learning autonomous decision-making methods are categorized into two primary groups: risk shielding and safety decision-making. The risk shielding safety reinforcement learning approach directly identifies perilous actions and excludes them from the viable action set, ensuring that the agent’s decisions consistently adhere to safety criteria. Hou et al. delved into the fundamental workings of action shielding in robots and introduced the Action-Level Masking method36. Angelopoulos et al. put forth a technique that incorporates human preferences into reinforcement learning by employing a shielding mechanism37. The shielding mechanism examines all possible actions in the current state and eliminates those that contravene safety protocols, thereby establishing a subset of safe actions. Safety decision-making approaches based on danger shielding demonstrate robust control over the exclusion of hazardous actions, enhancing the efficiency of reinforcement learning decisions while maintaining safety. Nonetheless, issues such as global action screening strategies and interpretable shielding mechanisms need to be explored in future studies.

In contrast to risk shielding, reinforcement learning methodologies that incorporate safety decision-making expand upon the shielding mechanism by introducing a suite of safety action options and dynamically calibrate the range of these actions. This strategy circumvents the limitation of singular action choices and bolsters the safety and resilience of the reinforcement learning system within unpredictable settings. Ji et al. introduced a multi-agent deep Q-network framework, grounded in safety decision-making, aimed at the secure management of adaptable robotic arms, which fine-tunes the action set boundaries to safeguard against unsafe actions38. Duan et al. integrated offline and online training methodologies and implemented a “safety shield” correction mechanism to adeptly rectify unsafe decisions39.

Reinforcement learning (RL) approaches that incorporate safety considerations have the potential to yield secure alternatives while circumventing unsafe actions, thereby significantly enhancing the robustness and reliability of RL systems. Nonetheless, the challenge persists in devising an optimal set of safety-constrained behaviors that are adaptable across various states and capable of dynamically adjusting to the evolving environmental conditions.

Although the concept of safe reinforcement learning has been introduced, its practical application still confronts several significant challenges, as elaborated in40,41. Primarily, agents integrated with safety constraints frequently prioritize safety excessively, compromising their ability to explore optimal strategies, which ultimately results in suboptimal decision-making efficiency within robotic systems4244. Secondly, Markov processes incorporating multiple constraints grapple with difficulties in model resolution and intricate computational procedures, impeding the realization of seamless end-to-end control40,41. Lastly, these constraints lack the capability to adaptively adjust in response to varying environmental conditions, especially in the realm of multi-robot systems, where ensuring the safety of each robot’s decision-making process poses an even greater challenge42.

To tackle potential security concerns that may emerge during the sequential decision-making process, a novel a uniformly ultimately bounded constrained hierarchical safety reinforcement learning strategy (UBSRL) approach is introduced for multi-robot systems collaborative decision-making tasks.

The primary contributions of this manuscript can be articulated as follows:

  • Proposes an event-triggered hierarchical safety reinforcement learning framework based on a constrained Markov decision process, which achieves a harmonious advancement in both decision-making safety and efficiency, facilitated by the seamless collaboration between the upper-tier evolutionary networks and the lower-tier restoration networks.

  • By incorporating supplementary Lyapunov safety cost networks, a comprehensive strategy optimization mechanism that includes multiple safety cost constraints is devised, and the Lagrange multiplier method is employed to address the challenge of identifying the optimal strategy.

  • The stability of the autonomous decision-making system has been established through the application of the uniformly ultimately boundedness theory. This analysis demonstrates that the action trajectories of multiple robots can be reverted to a safe space within a finite timeframe, regardless of their initial perilous state.

  • Subsequent to exhaustive training and meticulous evaluation within a multitude of standardized scenarios, the outcomes indicate that the UBSRL methodology can effectively restricts the safety indicators to remain below the threshold, markedly enhancing the stability and security of the motion strategy.

The rest of this paper is organized as follows. Section 2 presents the work related to Markov modeling for safe reinforcement learning methods under constraints. Section 3 describes the design process of the UBSRL approach. Section 4 analyzes the stability of the system using the theory of uniformly ultimate boundedness. Then, we evaluate the performance of our method with the existing algorithms in Sect. 5. Section 6 concludes the paper.

Problem description

Markov decision making with security constraints

The conventional reinforcement learning approach identifies the optimal policy by leveraging the Markov decision process. In contrast, the secure reinforcement learning method, which incorporates safety constraints, determines the optimal policy through a restricted Markov decision process. The decision-making process of this approach is illustrated in Fig. 1.

Fig. 1.

Fig. 1

Markov process with safety constraints.

A Markov process constrained by safety considerations can be characterized as follows: Inline graphic. S is a state vector with finite dimensions, and the state vector for all agents can be represented as: Inline graphic. A is the action space of robot:Inline graphic. P is the state transition probability; R stands for immediate reward, and the reward value can be defined according to the robot task environment. C denotes the safety constraint function, which is evaluated in a similar way as the reward function. I represents the total number of robots.

As shown in Fig. 1, the interaction between the robot and the environment can be considered as a discrete-time sequence of a Markov process. Based on the initially perceived environment s0, the robot makes a corresponding decision action a0. After the action a0 is executed by the robot, the environment transitions to a new state s1, with the probability of the environmental transition probability P. At this stage, the environment immediately assigns a reward r1 and a cost c1 to the robot. Subsequently, the robot executes an action a1 based on the current state s1, causing the environmental state to transition to s2 and again providing the robot with a reward r1 and a cost c1. This interactive process exhibits a trend of infinite continuation, thereby forming a Markov chain: s0, a0, s1, r1, a1, …, st−1, rt−1, at−1, st, rt,…. In accordance with the theoretical framework of the Markov chain, the safe reinforcement learning strategy will continuously undergo iterative optimization until convergence is achieved.

Safe reinforcement learning

The Markov decision process with security constraints can be formulated as the process of solving the optimal security policy Inline graphicwhile satisfying the security constraints as follows.

graphic file with name M5.gif 1

where, Inline graphicdenotes the security policies with the security constraints, Inline graphicis the objective function.

The goal of the agent is to maximize the system reward. The reward can be written as the cumulative discounted expected value of all future rewards:

graphic file with name M8.gif 2

where, T is the final time, t is the current time, and s0 is the initial state of all agents.

In order to consider the effect of the action vector on the state-value function Inline graphic. The target reward in the multi-agent system can be further represented by the action-value function Inline graphic:

graphic file with name M11.gif 3

where, Inline graphicdenotes the system state at the next moment, and Inline graphic is the immediate reward of the current robot.

In addition to action-value function Inline graphic, safe reinforcement learning methods require another evaluation function to evaluate the safety of policies. That is, the nonnegative safety constraint function Inline graphic. The constraint function is used to evaluate the security of the action.

Safe reinforcement learning mainly focuses on the safety constraint function Inline graphicunder long-term cumulative expectations. For a given trajectoryInline graphic, the cumulative discounted constraint function is as follows.

graphic file with name M18.gif 4

where, Inline graphicrepresents the immediate cost,Inline graphic denotes the action-state trajectory.

Robot safety constraints can be broadly categorized into two main types: direct constraints on the action space and indirect constraints. The mechanism of direct constraints can be articulated as follows:

graphic file with name M21.gif 5

In which, Inline graphicandInline graphic represent the upper and lower thresholds of robot actions.

In this scenario, the robot’s action space is frequently constrained, which inherently limits its flexibility. Consequently, our UBSRL approach eschews this restrictive mechanism, opting instead for an indirect constrained safe reinforcement learning strategy.

Safe reinforcement learning strategies impose constraints not directly on the robot’s action space, but rather on its cost function, representing an indirect constraint approach. This indirect constraint approach can be further categorized into two types: hard constraints and soft constraints.

When undertaking decision-making tasks that demand stringent security measures, the hard constraint method is typically implemented, ensuring that the single-step constraint is consistently enforced. Consequently, the set of security policies can be articulated as follows:

graphic file with name M24.gif 6

The rigid constraints of single-step decision-making impose overly stringent limitations on the range of actions available, typically used in model-based reinforcement learning methodologies for autonomous decision-making.

In contrast, most model-free reinforcement learning tasks favor soft constraints, which are governed by the cumulative expected discounted value of the cost function. Consequently, the safe policy Inline graphiccan be articulated as follows:

graphic file with name M26.gif 7

Consequently, the UBSRL strategy employs a more adaptable indirect constraint approach. The robot’s action space remains unrestricted; the objective is solely to guide the robot in steering clear of hazardous zones while encouraging exploration of safer areas.

In the robot decision-making process, if the action at each step satisfies condition Inline graphic, the current action can complies with the safety constraint. Should the robot inadvertently enter a hazardous state, it maintains the capability to recover from the unsafe condition within a finite time frame. This soft constraint methodology has demonstrated impressive results in tasks such as multi-robot autonomous decision-making tasks.

In multi-agent safe reinforcement learning, the goal of safe Markov decision problem is to maximize the cumulative rewards r while ensuring that the expected safety constraint Inline graphic is less than the constraint threshold m. The solution process of the optimal policy can be formulated as a constrained optimization problem as follows.

graphic file with name M29.gif 8

where, m is the threshold of the safety constraint and Inline graphicis the expected safety constraint: Inline graphic.

In the autonomous decision-making process, the agent is awarded a reward and incurs a cost at each step. Our goal is to maximize the cumulative reward over time, while ensuring that the cost threshold is not breached. This presents a soft constrained optimization challenge. The optimal policy can be derived by optimizing the objective function that balances rewards with safety constraints.

Method design

Event-triggered layered safety policy

Decision safety is a crucial aspect to contemplate in multi-robot system. Nevertheless, agents frequently disregard potential hazards in pursuit of high value, inadvertently leading multi-robot into perilous zones. To mitigate this issue, we introduce an event-triggered hierarchical security policy in this section. This policy encompasses two tiers: the evolution strategy and the security recovery strategy, as depicted in Fig. 2.

Fig. 2.

Fig. 2

Event-triggered hierarchical security policy.

The evolution strategy boasts swift learning and updating capabilities, enabling it to promptly converge on the optimal solution. Conversely, the safety policy emphasizes ensuring that the robot makes prudent decisions when safety constraints are breached, guiding it back to a secure state. Upon the driving event’s output exceeding a predetermined threshold, the safety trigger is activated, prompting the system to transition from the evolution strategy to the security recovery strategy. This transition ensures the safety of the robot’s action outputs. The event-triggered state of the robot is defined as follows:

Definition 1

If the robot undertakes the current action Inline graphicand subsequently collides, rolls over, or falls into a trap, such that the trigger event value is greater than or equal to the threshold value, i.e.: Inline graphic, the robot will be deemed to have entered a dangerous space. Consequently, the subsequent action of the robot will be constrained, and the autonomous decision-making strategy will be switched to a safe recovery strategy.

Definition 2

If the robot executes the current action, and the trigger event value fails to reach the trigger threshold m, i.e.: Inline graphic, the robot’s actions will remain unconstrained. Consequently, the robot will proceed with the evolution strategy, or transition from the previous safety recovery strategy to the evolution strategy.

Then, the experience buffer is constructed separately according to the different policy: the non-safe experience buffer Inline graphicand the safe experience bufferInline graphic.

When the conditions of Definition 1 are satisfied, the agent executes the safe recovery strategy, and the dangerous state experience samples Inline graphicwill be stored in the non-safe experience buffer Inline graphic.

When the conditions of Definition 2 are satisfied, the agent executes the When the conditions of Definition 1 are satisfied, the agent executes the safe recovery strategy, and the state experience samples Inline graphicwill be stored in the non-safe experience buffer Inline graphic.

  • (A)

    Evolution strategy.

In the realm of the event-triggered hierarchical security strategy, the evolutionary approach adopts an intricate multi-agent parallel structure, which is illustrated in Fig. 3. Each individual agent within this framework possesses a dedicated Actor-Critic network, enhancing its learning capabilities. Additionally, to mitigate the potential overestimation of Q values, these agents employ a dual Critic network. Notably, the agents leverage global information to augment their learning processes and utilize local information for making informed decisions. The cumulative joint strategy, emanating from the collective actions of all agents, can be succinctly articulated as:Inline graphic, and its policy parameters can be expressed as:Inline graphic.

Fig. 3.

Fig. 3

Multi-agent distributed network structure of the evolution strategy.

In the evolutionary strategy, each actor agent is responsible for generating the robot action Inline graphic, and the Critic network is responsible for evaluating the current action valueInline graphic. After each interaction with the environment, an experience sample Inline graphicwill be produced and stored in a safe sample buffer Inline graphic.

The cumulative expected reward for the i th agent can be used as an objective function of strategy optimization:

graphic file with name M47.gif 9

where, Inline graphicis the global observation vector, Inline graphicrepresents the observation of the agent.

Then, the objective function can be optimized through the method of solving the strategy gradient.

graphic file with name M50.gif 10

where, Inline graphicis the deterministic action strategy function, Inline graphicrepresents the network parameter of the deterministic action strategy function, andInline graphicdenotes the state-action value n.

The evaluation network employs a centralized update mechanism and utilizes Temporal Difference (TD) error to refine the network’s updates. The loss function for the Critic network’s updates can be articulated as:

graphic file with name M54.gif 11

where, Inline graphic represents the target action value function of the i-th agent, Inline graphicdenotes the current Critic value network parameter, and Inline graphic represents the target Critic network parameter.

The value of the Inline graphic can be easily overestimated due to the approximate error inherent in the value function. When evaluating the strategy generated by the Actor using an overestimated action value function, this leads to an overestimation of the current action’s value. To address this issue, the evolutionary network incorporates an additional set of Critic networks. These multiple Critic networks evaluate the action value simultaneously and choose the smallest value for updating, thereby mitigating the impact of overestimation.

When calculating the Temporal Difference (TD) error, the updated target action value can be expressed as:

graphic file with name M59.gif 12

Furthermore, as the deterministic strategy is prone to the approximation error of the function, the evolutionary strategy imposes regularization on the target strategy network. The target action is approximated within a narrow range by introducing a micro-noise that adheres to a normal distribution.

graphic file with name M60.gif 13

where, Inline graphic is the micronoise, b is the boundary value of the noise.

By smoothing the target action value on similar actions, regularization can solve the possible incorrect spikes in the deterministic strategy gradient and avoid fragile behaviors in policy networks.

  • (B)

    Safe recovery strategy.

The safety of actions is a crucial consideration for multi-robot systems. Traditional evolutionary strategies incorporate only reward information regarding actions, yet they fail to account for safety considerations. This deficiency often leads to overly aggressive strategies that can place multi-robot in perilous situations. To address this issue, this section proposes the construction of a secure recovery network that incorporates safety constraints. As depicted in Fig. 4, the secure recovery network integrates a Lyapunov evaluation network into the multi-agent distributed Actor-Critic architecture. This addition is designed to evaluate the expected cost value of each robot action, thereby determining the associated risk level.

Fig. 4.

Fig. 4

Multi-agent network architecture with security constraints.

The Lyapunov evaluation function Inline graphicis a conceptual function whose value depends on the corresponding action-state vector. The Lyapunov state function can be obtained from the Lyapunov action evaluation function:

graphic file with name M63.gif 14

The value of Lyapunovevaluation function is calculated in a similar way to the action-states value Inline graphic. The Belman equation is still satisfied between the current generation values and the future generation values:

graphic file with name M65.gif 15

where, Inline graphicis the immediate non-negative security constraint function. Inline graphicdenotes the target Lyapunov evaluation function.

The target Lyapunov evaluation function also serves to reduce bias that may arise from the overestimation of evaluation values. Instead of iteratively updating the network parameters, the target network merely replicates a subset of weights recently updated by the current network, a process referred to as delayed updates.

Network architecture optimization

  • (A)

    Policy network optimization.

To update the policy network, the objective functionInline graphic needs to be maximized. Based on the boundary security constraint, the objective function to maximize can be expressed as follows:

graphic file with name M69.gif 16

where, m is the safety constraint threshold.

To streamline the resolution of Eq. (16), the policy optimization subject to security constraints is recast as a multi-objective optimization problem, leveraging the Lagrange multiplier technique. When the agent is in a non-secure state:Inline graphic, a new objective function is built for the policy network. The function includes the action value function Inline graphicand the security cost functionInline graphic. The specific objective function is as follows:

graphic file with name M73.gif 17

where, Inline graphic is the Lagrange multiplier, k is the safety factor parameter, and Inline graphicis the non-safe experience buffer.

The updating process of the policy network can be viewed as a process of minimizing the objective function of the policy network. The gradient descent method is used to optimize the policy network, and the optimization result of the policy network Actor is obtained as follows.

graphic file with name M76.gif 18

where, Inline graphic represents in the current strategy.

Based on the appropriate step size, the policy parameter Inline graphicis updated along the gradient direction of the policy network, so as to ensure that the policy at the next time is not worse than the previous time until convergence.

Then, the objective function Inline graphicfor the Lagrange multiplier is constructed as follows.

graphic file with name M80.gif 19

The Lagrange multiplier Inline graphicutilizes the gradient ascent methodology to continuously calibrate in tandem with the network’s iterative updates, ensuring the robustness of the strategic framework.

  • (B)

    Critic network update optimization.

The Critic network refines the policy network’s weights by minimizing the loss function via gradient descent. Consequently, a parameter is identified that maximizes the expected value functionInline graphic.

The Critic network retrieves samples from the global experience replay buffer and optimizes them in accordance with the Temporal Difference (TD) error of the state-action function. Its loss function is expressed as follows:

graphic file with name M83.gif 20

where,Inline graphic represents the number of samples drawn by each batch from the safety memory buffer, Inline graphicis the target value of the action value Inline graphic.

  • (C)

    Lyapunov safe network optimization.

The Lyapunov safe network undergoes optimization through the application of the gradient descent method. The loss function for the Lyapunov network is articulated as follows:

graphic file with name M87.gif 21

where, UD represents the set of dangerous states, Inline graphic denotes the Lyapunov cost function, and Inline graphicis the target value of the Lyapunov cost function. Inline graphicindicates the number of samples drawn from the dangerous states per batch.

The target value of the Lyapunov cost function Inline graphic is calculated from the current cost Inline graphic and the Lyapunov cost function of the next state Inline graphic:

graphic file with name M94.gif 22

The Critic network and the Lyapunov safe network undergo parallel training and synchronous updates. However, they differ in their sampling methods: the Critic network selects samples from the secure states subset, whereas the Lyapunov safe network selects from the non-secure states subset.

During training, the target network parameters for the Actor, Critic, and Lyapunov networks are not immediately updated. Instead, they experience delayed updates. The weight parameters are replicated from their respective current networks, and only a fraction of the change values are updated in each instance:

graphic file with name M95.gif 23

Inline graphic, Inline graphic, Inline graphic represent the soft update factor of the weight parameter. A smaller update factor can optimize the target network delay and prevent the unstable bias caused by the overestimation.

Uniformly ultimate boundedness stability analysis

System stability is a critical prerequisite for robot motion control. Classical control methods often utilize Lyapunov theory to demonstrate the stability of the control system. However, for data-driven reinforcement learning (RL) motion control systems, the traditional Lyapunov stability analysis method is not applicable. Among various stability analysis methods, uniformly ultimate boundedness (UUB) has been proven effective41. A UUB-stable autonomous decision-making strategy ensures that action vectors approach an equilibrium state within a limited number of steps. Uniform ultimate bounded stability (UUB) and safety reinforcement learning constraints are integrated. Multi-robot systems must learn a reinforcement learning controller with a UUB guarantee to tackle autonomous decision-making tasks involving safety constraints. Each update of the controller meets the UUB criteria and incrementally enhances the cumulative reward; The Lyapunov cost function continuously diminishes, eventually dropping below a threshold. When faced with potential danger, the robot proactively moves towards a safe area, enabling the robot to recover from the hazard within a limited timeframe.

UUB stability definition for safe reinforcement learning

This section broadens the classical definition of uniformly ultimately bounded (UUB) stability to encompass a more general scenario. We reevaluate the concept of UUB stability within the context of safe reinforcement learning, specifically to tackle the challenges associated with multi-robot motion control tasks.

The definition of the UUB stability theorem is as follows:

Theorem 1

If a control system has positive constants b, m and Inline graphic, there exists Inline graphic, such that Inline graphic,Inline graphic, Then, this system is said to be uniformly ultimate bounded about Inline graphic.

Leveraging the stability assurance provided by UUB, the trajectory will enter and remain within the stability threshold range after a finite period of time. Subsequently, we will apply the uniformly ultimate boundedness method to verify the security of decision-making strategy under the constraints of Lyapunov safety. The stability of the Lyapunov safety constraint function is defined based on Theorem 1.41,42.

Theorem 2

If there is a function a Lyapunov function Inline graphic and positive constants Inline graphic, such that:

graphic file with name M106.gif 24

Then, the motor control system is uniformly ultimately bounded with boundary Inline graphic.

Moreover, the expected value Inline graphic is bounded during the N steps, and also satisfies:Inline graphic, Inline graphic.

Remark 1

Represents the Lyapunov cost function, which guarantees the stability of the control strategy. The Lyapunov cost function can be calculated from the Lyapunov action-value function Inline graphic:

graphic file with name M112.gif 25

In Eq. (24), Inline graphic means the average distribution of s in N steps.

graphic file with name M114.gif 26

where, N represents the maximum moment when the probability of a hazardous sample in the sample set is greater than zero, Inline graphic represents the state with the N time step.

The Inline graphic is used to determine whether the current state s belongs to the unsafe set:

graphic file with name M117.gif 27

where, Inline graphic

UUB stability proof for safe reinforcement learning

First, it is imperative to demonstrate conclusively that N is a finite constant. We employ the counterexample method to postulate that N is infinite39,41,42, with the intention of deriving a contradiction, such that:

graphic file with name M119.gif 28

The first term on the left of the inequality in Eq. (24) can be rewritten as:

graphic file with name M120.gif 29

Since Inline graphicis finite, it is obtained when N tends to an infinite value.

graphic file with name M122.gif 30

According to the known condition Inline graphic, the Eq. (29) can be arranged as:

graphic file with name M124.gif 31

Combination Eqs. (30) and (31), it can be obtained:

graphic file with name M125.gif 32

The second term on the left of the inequality in Eq. (24) can be rewritten as:

graphic file with name M126.gif 33

By comparing Eqs. (32) and (33), it can be concluded that if both inequalities are satisfied, the result will be contrary to the reality of Eq. (24). Therefore, our assumption about with Inline graphic is a pseudo-proposition. So it can be concluded that there exists a finite N such that Inline graphic for any Inline graphic.

The proof that the expectation of Inline graphic is bounded after N time steps will be continued next. Since it is a finite value of N has been proved, it can be obtained:

graphic file with name M131.gif 34

Collating Eq. (34) gives:

graphic file with name M132.gif 35

After the decomposition of the left term of the inequality Eq. (35), there exists an arbitrary moment of time such that.

graphic file with name M133.gif 36

By definition Inline graphic in Theorem 2, Eq. (36) can be expressed as:

graphic file with name M135.gif 37

The probabilistic choice of samples for each space in the defined sample space can be described as:

graphic file with name M136.gif 38

The expected value of the cost function Inline graphic can be expressed as:

graphic file with name M138.gif 39

Combining Eqs. (38) and (39), the expected value of the cost function Inline graphic can be further expressed as:

graphic file with name M140.gif 40

According to inequality (40), the expectation value of the cost function Inline graphicis bounded in a finite number of steps N and satisfiesInline graphic, Inline graphic. Thus, the reinforcement learning motion control system has been demonstrated to be uniformly ultimately bounded (UUB) stable.

Upon transitioning into an unsafe condition, the robot possesses the capability to navigate out of peril within a specified and finite sequence of N steps, thereby ensuring that the cost function value descends beneath the established safety threshold.

Experiment

Initialization setting

In this section, we present the performance evaluation of the proposed algorithm across two distinct environments. The experimental setup encompasses a standardized scenario known as Safety-Gymnasium and a custom-built scenario. Safety-Gymnasium is a highly modular, readable, and customizable benchmark environment library, constructed upon the MuJoCo physics engine. It serves primarily as a platform for evaluating the performance of safe reinforcement learning algorithms. Conversely, the custom-built scenario is developed using CoppeliaSim, a professional robot physical simulator, and is designed to facilitate the training and testing of multi-robot cooperative safe reinforcement learning algorithms within unfamiliar settings. Throughout the experimentation, the network hyperparameters for the UBSRL algorithm, as introduced in this paper, are specified in Table 1.

Table 1.

Network parameters for the UBSRL algorithm.

Parameters Instruction Values
l a Learning rate of actor 0.0001
l c Learning rate of critic 0.0001
l s Learning rate of Lyapunov 0.0001
Batch_size Sample size 512
γ Discount factor 0.99
τ1, τ2, τ3 Soft update factor 0.01
UD, Us Experience buffer capacity 10^6

Experiments in standardized scenarios

The Safety Gymnasium43 encompasses three standardized scenarios: Circle, Goal and Button, as depicted in Fig. 5.

Fig. 5.

Fig. 5

Safety-Gymnasium safe reinforcement learning scenarios.

In the Circle scenario, the robot is tasked with navigating in a circular path around the central region while ensuring it remains within the safe range. The green zone represents this safe area, and the robot’s starting point is randomly determined within the circular zone for each training session. In the Goal scenario, the robot must navigate to a target position while avoiding numerous obstacles scattered throughout the scene. Upon reaching the target, the goal is randomly relocated to a new position, though the overall layout of the scene remains constant. The Button scenario presents multiple goals, and the robot’s objective is to reach each goal sequentially until it has visited them all.

In the standardized experimental scenarios, a comprehensive series of experiments were conducted using the proposed UBSRL method, as well as two other secure reinforcement learning methods, CPO and L-PPO. The evaluation criteria included episode reward values and round cost values. The training goal was to maximize the reward while ensuring the cost remained at a minimum. Within the bounds of the security constraints, the higher the reward value, the better the performance. The fluctuations in reward values and security cost values for each method across the three scenarios are depicted in Figs. 6 and 7.

Fig. 6.

Fig. 6

Cost values comparison in standard scenarios.

Fig. 7.

Fig. 7

Reward values comparison in standard scenarios.

Figure 6 depicts the evolution of cost values for the three methods under consideration. On the other hand, Fig. 7 depicts the variation in the reward curves of these methods. Observing the figures, it becomes evident that the proposed UBSRL method satisfies the safety constraints in all three scenarios. Furthermore, the safety cost value has diminished beneath the safety threshold, denoted by the dotted line, after 400 episodes of training. This outcome is attributed to the significant role played by the Lyapunov safety constraint network during the training process, signifying that the robot’s action decisions comply with safety performance requirements. Turning our attention to Fig. 7, we can discern that the proposed method’s episode reward value exhibits clear superiority over the two comparative methods. After 450 episodes of training, it tends to stabilize near the peak value, with the maximum reward being markedly higher than the other methods, and with minimal fluctuations. In conclusion, through lateral comparison, it becomes apparent that the two comparative methods, CPO and L-PPO, fail to confine stable actions within the safety threshold. Their strategies oscillate above this threshold, leading to lower reward values and unstable convergence.

To further elucidate the rationality and efficacy of the proposed network architecture, we endeavored to streamline the evaluation network of UBSRL by substituting the dual evaluation network with a singular one, and subsequently termed this refined approach Simplified-UBSRL. Building upon this, additional experiments were executed within the Safety-Gymnasium safety reinforcement learning environment, where we compared the algorithm’s loss function and task failure rate variations. The outcomes are depicted in Fig. 8; Table 2.

Fig. 8.

Fig. 8

Comparison for the loss function.

Table 2.

Comparison of Task failure rates in different scenarios.

Circle Goal Button
UBSRL 1.5% 2.1% 3.6%
Simplified-UBSRL 3.5% 4.8% 5.5%

The research results indicate that in three distinct experimental scenarios, network simplification resulted in a substantial rise in the volatility of the loss function, hindering convergence. This could potentially compromise the robot’s capacity to make prompt and accurate decisions. A deeper analysis reveals that when contrasting the task failure rates across various scenarios, the probability of the robot colliding or encountering danger escalated notably, with the task failure rate surging by over 2%.

Consequently, it is concluded that network simplification did not effectively improve the robot’s motion control efficiency. The proposed method demonstrates a more stable security performance in the context of standardized secure reinforcement learning scenarios.

Experiments in multi-robot scenarios

In this section, we aim to evaluate the safety performance and the efficacy of cooperative decision-making among multi-robot. To this end, we have constructed several representative motion scenarios using the robot physics simulator CoppeliaSim, as depicted in Fig. 9.

Fig. 9.

Fig. 9

Safe reinforcement learning environments for multi-robot.

These scenarios encompass a range of robots maneuvering through both dynamic and static obstacles. Following extensive experimentation, we compare and analyze the security performance of the collaborative capabilities of our proposed UBSRL method against those of other reinforcement learning approaches. As depicted in Fig. 9, scenario 1 encompasses a multi-robot formation navigation undertaking, characterized by a circular domain designated as a safe area. Each robotic entity is tasked with maintaining a secure distance from its peers and traversing the circular perimeter in a uniform clockwise fashion.

The cumulative reward value for the entire multi-robot system can be articulated as the aggregate sum of the individual rewards attained by each constituent robot.

graphic file with name M144.gif 41

The reward Inline graphicfor a single robot can be specifically expressed as follows:

graphic file with name M146.gif 42

where, Inline graphicis the regional center position, Inline graphic is the current position of the ith robot, and Inline graphicrepresents the angle deviation.

The cost of a robot includes two parts: movement state cost and environmental cost, which can be specifically represented as:

graphic file with name M150.gif 43

where, Inline graphic represents the action space of the robot, Inline graphicis the movement state cost, Inline graphic denotes the environmental cost.

Scenario 2 involves a cooperative target search task for a team of robots. These robots must work together to gather the maximum number of green targets, all the while steering clear of red danger zones and obstacles, until they have achieved full coverage of all targets. A robot is rewarded for reaching the green zone and incurs a penalty upon entering the red zone. The reward that a robot earns can be articulated as follows.

graphic file with name M154.gif 44

where, Inline graphicis the immediate reward, Inline graphicdenotes the environment reward, and Inline graphicis the goal arrival reward.

The calculation method for robot safety costs is similar to scenario 1, covering both motion state costs and environmental costs.

graphic file with name M158.gif 45

In Scenario 3, several robots are required to navigate to the target area while maintaining a specific formation. This scenario is riddled with obstacles and hazardous zones, and the robots’ goal positions are subject to real-time changes. Throughout the training process, the unmanned robots’ rewards can be articulated as follows.

graphic file with name M159.gif 46

where,Inline graphic is the formation reward.

In order to ensure that the multi-robot systems maintain a certain formation, the formation reward Inline graphic is set the as:

graphic file with name M162.gif 47

where, Inline graphicare the adjustment coefficient, Inline graphicare the real-time distance of the formation and Inline graphicis the preset distance of the formation.

The safety cost includes three parts: motion state cost, environmental cost, and collision cost. The specific expressions are as follows:

graphic file with name M166.gif 48

where, Inline graphic denotes the movement state cost, Inline graphicsignifies the environmental cost, and Inline graphic represents the formation deviation cost. When the robot strays from the intended formation sequence, Inline graphic

Multi-robot autonomous decision-making experiments were conducted across three distinct scenarios, utilizing the UBSRL, CPO, and SAC methodologies respectively. Each episode was allotted a maximum of 500 training time steps. The evolution of the average episode reward and the average episode cost function across these scenarios is depicted in Figs. 10 and 11.

Fig. 10.

Fig. 10

Reward values in a multi-robot scenarios.

Fig. 11.

Fig. 11

Cost values in a multi-robot scenarios.

As illustrated in Fig. 10, the methodology presented in this paper demonstrates consistent stability within scenario 1. Up until the 3000th iteration, the reward value surged rapidly, surpassing those of the other two methodologies. Following the 3000th episode, the reward value experienced a slight decline as the robot started to exit the zone of highest reward and opted for a more conservative strategy, influenced by the safety Lyapunov function’s actions. A similar pattern is observed with the CPO methods, which can be attributed to the constraining impact of the cost function. The fluctuation of reward and substitution values in scenarios 2 and 3 mirrors that of scenario 1. The UBSRL method achieves a state of convergence in reward value after 2000 to 3000 training iterations, outpacing the CPO security reinforcement learning method, yet marginally falling short of the SAC method. Despite the SAC algorithm achieving the highest reward value, it suffers from the lowest success rate due to insufficient safety constraints. Figure 11 reveals that the cost value of the UBSRL method proposed herein is lower than that of the CPO method, suggesting that UBSRL exhibits superior safety constraint performance compared to the CPO method.

In scenario 3, the robots must continuously adapt their postures to maintain a stable formation. Consequently, to evaluate the efficacy of the proposed method, UBSRL, in enabling autonomous formation of multi-robot, the alterations in the average speed and inter-vehicle distances were experimentally examined, as depicted in Fig. 12.

Fig. 12.

Fig. 12

Motion performance indicators of multi-robot scenarios.

Figure 11 illustrates the comparison of multi-robot formation states, including (a) the variation of formation velocity among multi-robot systems, and (b) the variation of formation distance among multi-robot systems. As depicted in Fig. 12, the velocity of the robots exhibits significant fluctuations in the initial stages, struggling to stabilize around the desired speed and distance. However, as training continues, these velocity fluctuations start to diminish. By approximately 4000 episodes, the robots’ velocity becomes essentially stable, maintaining the set speed. A similar pattern is observed in the distance variation.

The experimental statistics, which summarize the task success rates of three algorithms, are presented in Fig. 12c. The success rate is a crucial metric for evaluating the effectiveness of decision-making system. Figure 12c reveals that the proposed method achieves the highest task success rate, whereas the SAC method yields the lowest. When comparing scenarios, due to the high task complexity across the three scenarios, scenario 3 has the lowest task success rate, influenced by the combined effects of safety constraints.

To conduct a more thorough comparison of the performance efficacy of the method introduced in this paper, we incorporated new safety reinforcement learning methods as benchmarks, including: CPO48, CVPO49, and STR50. Initially, we compiled the distribution of task failure rates across diverse scenarios, as illustrated in Table 3.

Table 3.

Failure rate comparison in different scenarios.

Scenario Failure rate Risk rate Timeout rate
1 2 3 1 2 3 1 2 3
UBSRL 0.5% 1.2% 1.5% 0.4% 0.5% 0.8% 0.1% 0.7% 0.7%
CPO 2.2% 3.0% 4.5% 1.8% 1.4% 2.8 0.4 1.8 1.7
CVPO 2.5 3.2% 4.7% 1.9% 1.6 3.2 0.6 1.6 1.5
STR 1.8% 2.6% 3.2% 1.4 1.1 2.1 0.4 1.5 1.1

The task failure rate comprises two components: the danger rate and the timeout rate. The danger rate pertains to the likelihood of the robot encountering obstacles or venturing into hazardous zones. The timeout rate signifies the probability of the robot failing to achieve the designated goal within the prescribed timeframe. The outcomes indicate that UBSRL exhibits the lowest failure rate, danger rate, and timeout rate across all scenarios, demonstrating the utmost stability in performance. In contrast, CPO and CVPO display relatively high failure rates, particularly in scenario 3, where CVPO attains the peak failure rate of 4.7%. The performance of STR lies between UBSRL and CPO/CVPO, with both its failure rate and danger rate being relatively low. However, in scenario 2, its timeout rate is marginally higher than UBSRL. These results can verify that our proposed UBSRL performance performs better than the contrast algorithm.

In addition, the task success rate and average time of the robot were calculated, as shown in Figs. 13 and 14. As can be seen from the results, our task success rate was the highest among the three scenarios. In terms of the average time consuming to complete a task, it can be seen that scene one is relatively simple and the task takes the least time. Scenario 2 requires more tasks to explore and takes the most time. By comparing with other methods, it can be found that the average time consumption of our proposed method is the least, indicating that the closer the selected strategy is to the optimal, and the performance of the self-strategy is better than other methods.

Fig. 13.

Fig. 13

Comparison of success rates.

Fig. 14.

Fig. 14

Comparison of average time.

Furthermore, the task success rates and average time of the robots were computed and are depicted in Figs. 13 and 14. The outcomes reveal that the task success rate of UBSRL was the highest across the three scenarios. Regarding the average time required per episode, it is evident that scenario 1 is the simplest, resulting in the shortest task duration.In scenario 2, the robot must fulfill a multitude of exploration goals, thus necessitating the greatest time expenditure. Upon comparison with alternative methods, it becomes evident that our proposed method exhibits the lowest average time consumption, and the performance of the strategy surpasses that of other methods.

To ascertain the efficacy and resilience of the UBSRL, supplementary experiments were executed within more intricate and dynamic scenarios. The experimental paradigm encompasses two types dynamic scenarios of multi-robot collaborative interaction. As depicted in Fig. 15a illustrates a multi-robot cooperative search scenario, which incorporates a variety of obstacles and pedestrian. The robots are tasked with identifying the closest target and navigating through the impediments to attain the objective. Should a robot encounter an obstacle or fail to reach the target within the stipulated timeframe, the mission is deemed unsuccessful. Figure 15b portrays a multi-robot cooperative pursuit scenario involving 3 pursuers and 1 evader. The pursuer robots are required to diligently encircle the evader while circumventing all dynamic and static obstacles.

Fig. 15.

Fig. 15

Multi-robot collaborative interaction scenarios.

Upon completing 10,000 episodes of training experiments within dynamic environments, a comparative analysis of reward and cost values was conducted for various safety-focused reinforcement learning strategies. The outcomes are depicted in Figs. 16 and 17.

Fig. 16.

Fig. 16

Comparison of reward values in different scenarios.

Fig. 17.

Fig. 17

Comparison of cost values in different scenarios.

As illustrated in Fig. 17, the initial phase witnessed a rapid learning efficiency across all algorithms, with a swift increase in reward values. This indicates that all methodologies swiftly adapt and learn effective strategies during the early stages of training. UBSRL exhibited a marginally superior trend at this juncture, demonstrating its efficacy in the early learning phase. The UBSRL sustained high reward values in subsequent stages, showcasing commendable long-term performance and stability. The STR also maintained a relatively consistent performance in later stages, albeit with reward values slightly below those of UBSRL. In contrast, the reward values for CPO and CVPO experienced significant fluctuations in the later stages and exhibited lower overall levels compared to UBSRL and STR, suggesting that both algorithms may be somewhat deficient in stability and efficiency when dealing with long-term learning tasks.

Figure 17 presents the cost curves of the four methods varying with the number of training Episodes. During the nascent stages of training, the cost values for all methods are notably elevated, suggesting that the preliminary learning phase is characterized by exploratory behavior and the quest for effective strategy acquisition. The UBSRL methods exhibits the most precipitous decline in cost during this initial phase, underscoring its superior efficiency in the early learning stages. Furthermore, UBSRL sustains the lowest cost metric in subsequent phases, signifying its robustness and efficiency in enduring learning processes. The STR algorithm also demonstrates a tendency towards stabilization in later stages, albeit at a marginally elevated cost level compared to UBSRL. Conversely, CPO and CVPO algorithms experience more pronounced cost fluctuations in the later stages, and their overall cost levels are higher than those observed for UBSRL and STR. In the context of complex dynamic collaborative scenarios, the UBSRL algorithm demonstrates superior performance relative to the other three algorithms, particularly in the early and long-term learning phases.

Upon analyzing the task success rates, reward values, and cost values of the methodologies in variety of scenarios, it becomes evident that the UBSRL method presented in this paper demonstrates superior security and stability performance within complex, unsafe environments. The UBSRL method showcases commendable adaptability across various scenarios.

Conclusion

This paper introduces a novel hierarchical safe reinforcement learning strategy, grounded in uniformly ultimately bounded constraints, that effectively addresses the issue of insufficient safety in autonomous decision-making processes. This strategy is constructed on the foundation of an event-triggered distributed deep deterministic policy gradient framework, and offers a safety guarantee for the multi-robot systems. Furthermore, by integrating the Lyapunov security cost network, an optimization objective for with multiple conditional constraints has been developed, and the calculation process has been streamlined through the application of the Lagrange multiplier technique. Ultimately, the stability of the proposed strategy is analyzed employing the uniformly ultimately bounded theory, ensuring that the state trajectories of robots can be restored to safe states within a finite time frame. Experimental results in multi-robot scenarios demonstrate that the ultimately bounded constrained hierarchical safety reinforcement learning strategy (UBSRL) outperforms existing safety reinforcement learning approaches in terms of task execution efficiency and decision security, providing a theoretical foundation for multi-robot collaborative decision-making systems.

In the future, we aim to extend our UBSRL method to flexible DRL network architectures based on dynamically evolving networks in more complex and variable task environments, and investigate how to enhance the adaptability and robustness of multi-agent reinforcement learning Autonomous decision-making strategy.

Acknowledgements

This research was funded by the National Key Research and Development Program of China (No.2022YFB4702501), Supported by the Open Research Fund of Anhui Province Key Laboratory of Machine Vision Inspection (KLMVI-2024-HIT-12), the Anhui Province Science and Technology Innovation Breakthrough Plan (No. 202423i08050056), the National Natural Science Foundation of China (No.52175013) and the Fundamental Research Funds for the Central Universities (BFUKF202421).

Author contributions

H.S., H.J., L.Z., C.W. and S. Q. conceived and designed research; H.S. and H.J. performed experiments; L.Z. and C.W. analyzed the data; All authors interpreted the results; S. Q. helped with methodology and prepared figures; H.S. drafted manuscript; All authors edited and revised manuscript; All the authors agreed with the results and conclusions of the manuscript.

Data availability

The datasets used and analysed during the current study available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hui Jiang, Email: aaa-jhui@163.com.

Long Zhang, Email: zhanglongcumt@outlook.com.

Changlin Wu, Email: wuchanglin@hnnu.edu.cn.

References

  • 1.Li, X., Ren, J. & Li, Y. Multi-mode filter target tracking method for mobile robot using multi-agent reinforcement learning. Eng. Appl. Artif. Intell.127, 107398 (2024).
  • 2.Ryu, S. et al. Evaluation criterion of wheeled mobile robotic platforms on grounds: a survey. Int. J. Precis. Eng. Manuf.25 (3), 675–686 (2024). [Google Scholar]
  • 3.Li, T. et al. Applications of multi-agent reinforcement learning in future internet: a comprehensive survey. IEEE Commun. Surv. Tutor.. 24 (2), 1240–1279 (2022). [Google Scholar]
  • 4.Nath, A. et al. Multi-agent Q-learning Based Navigation in an Unknown environment (Springer, 2022).
  • 5.Li, M. et al. Disturbance rejection and high dynamic quadrotor control based on reinforcement learning and supervised learning. Neural Comput. Appl.34 (13), 11141–11161 (2022). [Google Scholar]
  • 6.Shakya, A. K., Pillai, G. & Chakrabarty, S. Reinforcement learning algorithms: a brief survey. Expert Syst. Appl.11 (231), 120495 (2023). [Google Scholar]
  • 7.Basso, R. et al. Dynamic stochastic electric vehicle routing with safe reinforcement learning. Transp. Res. E157. (2022).
  • 8.Konar, A., Baghi, B. H. & Dudek, G. Learning goal conditioned socially compliant Navigation from demonstration using risk-based Features. IEEE Rob. Autom. Lett.6 (2), 651–658 (2021). [Google Scholar]
  • 9.Abdulsaheb, J. A. & Kadhim, D. J. Classical and heuristic approaches for mobile robot path planning: a survey. Robotics12 (4), 1–35 (2023). [Google Scholar]
  • 10.Haarnoja, T. et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Sci. Rob.9 (89), 8022 (2024). [DOI] [PubMed] [Google Scholar]
  • 11.Bai, Y., Lv, Y. & Zhang, J. Smart mobile robot fleet management based on hierarchical multi-agent deep Q network towards intelligent manufacturing. Eng. Appl. Artif. Intell.124, 106534 (2023). [Google Scholar]
  • 12.Thananjeyan, B. et al. Recovery rl: safe reinforcement learning with learned recovery zones. IEEE Rob. Autom. Lett.6 (3), 4915–4922 (2021). [Google Scholar]
  • 13.Shah, V., Next-Generation, S. & Exploration AI-Enhanced Autonomous Navigation Systems. J. Environ. Sci. Technol.3(1), 47–64 (2024).
  • 14.Samsani, S. S. & Muhammad, M. S. Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning. IEEE Rob. Autom. Lett.6 (3), 5223–5230 (2021). [Google Scholar]
  • 15.Riley, J. et al. Utilising assured multi-agent reinforcement learning within safety-critical scenarios. Proc. Comput. Sci.192, 1061–1070 (2021). [Google Scholar]
  • 16.Rasheed, A. A. A., Abdullah, M. N. & Al-Araji, A. S. A review of multi-agent mobile robot systems applications. Int. J. Electr. Comput. Eng.12 (4), 3517–3529 (2022). [Google Scholar]
  • 17.Yasuda, Y. D. V., Martins, L. E. G. & Cappabianco, F. A. M. Autonomous visual navigation for mobile robots. ACM Comput. Surv.. 53 (1), 1–34 (2021). [Google Scholar]
  • 18.Oroojlooy, A. & Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell.53 (11), 13677–13722 (2023). [Google Scholar]
  • 19.Li, C. et al. Deep reinforcement learning in smart manufacturing: a review and prospects. CIRP J. Manuf. Sci. Technol.40, 75–101 (2023). [Google Scholar]
  • 20.Brunke, L. et al. Safe learning in robotics: from learning-based control to safe reinforcement learning. Annual Rev. Control Rob. Auton. Syst.5, 411–444 (2022). [Google Scholar]
  • 21.Chang, L. et al. Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment. Auton. Robots. 45, 51–76 (2024). [Google Scholar]
  • 22.Sreenivas, N. K. & Rao, S. Safe deployment of a reinforcement learning robot using self stabilization. Intell. Syst. Appl.16, 200105 (2022). [Google Scholar]
  • 23.Hsu, K. et al. Sim-to-lab-to-Real: safe reinforcement learning with shielding and generalization guarantees. Artif. Intell.314, 103811 (2023). [Google Scholar]
  • 24.Zhang, L. et al. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles. IEEE Trans. Neural Netw. Learn. Syst.32 (12), 5435–5444 (2021). [DOI] [PubMed] [Google Scholar]
  • 25.Baheri, A. Safe reinforcement learning with mixture density network, with application to autonomous driving. Results Control Optim.6, 100095 (2022). [Google Scholar]
  • 26.Basso, R. et al. Dynamic stochastic electric vehicle routing with safe reinforcement learning. Transp. Res. E. 157, 102496 (2022). [Google Scholar]
  • 27.Han, D. et al. A survey on deep reinforcement learning algorithms for robotic manipulation. Sensors23 (7), 3762 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Jiang, H. et al. Design and kinematic modeling of a passively-actively transformable mobile robot. Mech. Mach. Theory. 142, 103591 (2024). [Google Scholar]
  • 29.Hu, Y., Fu, J. & Wen, G. Safe reinforcement learning for model-reference trajectory tracking of uncertain autonomous vehicles with model-based acceleration. IEEE Trans. Intell. Veh. (2023).
  • 30.Song, Z. et al. Self-adaptive obstacle crossing of an AntiBot from reconfiguration control and mechanical adaptation. J. Mech. Rob.16 (2), 021002 (2024). [Google Scholar]
  • 31.Mavrogiannis, C. I. & Knepper, R. A. Multi-agent path topology in support of socially competent navigation planning. Int. J. Robot. Res.38 (2–3), 338–356 (2019). [Google Scholar]
  • 32.Kim, H., So, K. K. F. & Wirtz, J. Service robots: applying social exchange theory to better understand human–robot interactions. Tour. Manag.92, 104537 (2022). [Google Scholar]
  • 33.Yang, Y. et al. Model-free safe reinforcement learning through neural barrier certificate. IEEE Rob. Autom. Lett.8 (3), 1295–1302 (2023). [Google Scholar]
  • 34.Liu, Y., Zhang, Q. & Zhao, D. Multi-task safe reinforcement learning for navigating intersections in dense traffic. (2022).
  • 35.Milani, S. et al. Explainable reinforcement learning: a survey and comparative review. ACM Comput. Surveys. 56 (7), 1–36 (2024). [Google Scholar]
  • 36.Hou, Y. et al. Exploring the use of invalid action masking in reinforcement learning: A comparative study of on-policy and off-policy algorithms in real-time strategy games. Appl. Sci.. 13(14), 8283 (2023).
  • 37.Angelopoulos, A. Increasing transparency of reinforcement learning using shielding for human preferences and explanations. 2023.arxiv preprint arxiv:2311.16838.
  • 38.Ji G, Yan J, Du J, et al. Towards safe control of continuummanipulator using shielded multiagent reinforcement learning. IEEE Robot. Autom. Lett.6(4), 7461–7468 (2021).
  • 39.Duan Y. Mobile robot path planning algorithm based on improved a star. J. Phys. Conf. Ser.2021. (2023).
  • 40.Zhang L, Li Y. Mobile robot path planning algorithm based on improved a star. J. Phys. Conf. Ser. (2021).
  • 41.LaValle, S. M., Kuffner, J. J. & Donald, B. R. Rapidly-exploring random trees: Progress and prospects. Algorithm. Comput. Robot. New. Dir.5, 293–308 (2024). [Google Scholar]
  • 42.Rezaee, M. R., Hamid, N. A., Hussin, M. & Zukarnain, Z. A. Comprehensive review of drones collision avoidance schemes: challenges and open issues. IEEE Trans. Intell. Transp. Syst.25 (5), 1–12 (2024). [Google Scholar]
  • 43.Garmroodi, A. D., Nasiri, F. & Haghighat, F. Optimal dispatch of an energy hub with compressed air energy storage: a safe reinforcement learning approach. J. Energy Storage. 57, 106147 (2023). [Google Scholar]
  • 44.Kochdumper, N. et al. Provably safe reinforcement learning via action projection using reachability analysis and polynomial zonotopes. IEEE Open. J. Control Syst.2, 79–92 (2023). [Google Scholar]
  • 45.Du, B. et al. Safe deep reinforcement learning-based adaptive control for USV interception mission. Ocean Eng. 246 (2022).
  • 46.Han, M. et al. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee. Automatica129, 109689 (2021). [Google Scholar]
  • 47.Ji, J. et al. OmniSafe: an infrastructure for accelerating safe reinforcement Learning Research. arXiv preprint arXiv:2305.09304 (2023).
  • 48.Achiam, J., Held, D., Tamar, A. & Abbeel, P. Constrained policy optimization. In International conference on machine learning. PMLR7, 22–31 (2017).
  • 49.Liu, Z. et al. Constrained variational policy optimization for safe reinforcement learning. PMLR. 6, 13644–13668 (2022).
  • 50.Mao, Y. et al. Supported trust region optimization for offline reinforcement learning. In International Conference on Machine Learning. 23829–23851 (PMLR, 2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and analysed during the current study available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES