Integral Reinforcement-Learning-Based Optimal Containment Control for Partially Unknown Nonlinear Multiagent Systems

Qiuye Wu; Yongheng Wu; Yonghua Wang

doi:10.3390/e25020221

. 2023 Jan 23;25(2):221. doi: 10.3390/e25020221

Integral Reinforcement-Learning-Based Optimal Containment Control for Partially Unknown Nonlinear Multiagent Systems

Qiuye Wu ¹, Yongheng Wu ¹, Yonghua Wang ^1,^*

Editors: Qiang Zhang¹, Yifeng Zeng¹

PMCID: PMC9955993 PMID: 36832588

Abstract

This paper focuses on the optimal containment control problem for the nonlinear multiagent systems with partially unknown dynamics via an integral reinforcement learning algorithm. By employing integral reinforcement learning, the requirement of the drift dynamics is relaxed. The integral reinforcement learning method is proved to be equivalent to the model-based policy iteration, which guarantees the convergence of the proposed control algorithm. For each follower, the Hamilton–Jacobi–Bellman equation is solved by a single critic neural network with a modified updating law which guarantees the weight error dynamic to be asymptotically stable. Through using input–output data, the approximate optimal containment control protocol of each follower is obtained by applying the critic neural network. The closed-loop containment error system is guaranteed to be stable under the proposed optimal containment control scheme. Simulation results demonstrate the effectiveness of the presented control scheme.

Keywords: adaptive dynamic programming, integral reinforcement learning, containment control, multiagent systems, neural networks

1. Introduction

Distributed coordination control of multiagent systems (MASs) has drawn expansive interest due to its potential application on agricultural irrigation [1], disaster rescue [2], microgrid scheduling [3], marine survey [4] and wireless communication [5]. The distributed coordination control aims to guarantee that all agents which exchange local information by communicating with their neighbors reach an agreement on some variables of interest [6]. Over the last decade, containment control has received increasing attention because of its remarkable performance in addressing the secure control issues, such as hazardous material treatment [7] and fire rescue [8]. The goal of containment control is to drive the followers to enter and keep within the convex hull spanned by multiple leaders. Numerous interesting and significant results of containment control have been presented. Reference [9] developed a fuzzy-observer-based backstepping control to achieve the containment of MASs. An adaptive funnel containment control was proposed in [10], where the containment errors converged to an adjustable funnel boundary. In practical applications, containment control has been developed for autonomous surface vehicles [4], unmanned aerial vehicles [11] and spacecrafts [12]. Notice that most of the aforementioned works have ignored the control performance with a minimum of energy consumption.

It is well-known that the Riccati equation or the Hamilton–Jacobi–Bellman equation (HJBE) are solved to acquire the optimal control for linear or nonlinear systems [13], respectively. In other words, the Riccati equation is a particular case of the HJBE. As a classical optimization algorithm, dynamic programming (DP) [14] is regarded as an effective way to obtain the optimal solution of the HJBE. However, as the dimension of state variables increases, the computation of the DP approach expands as a geometric series, which arouses the dilemma of the “curse of dimensionality”. With the success of AlphaGo, reinforcement learning (RL) has stimulated increasing enthusiasm from scholars to tackle the “curse of dimensionality” problem [15]. As is synonymous with RL, adaptive DP (ADP) [16] forward-in-time-solves the optimal control problem with the aid of neural network (NN)-based approximators. Moreover, ADP has been increasingly exploited for the optimal coordination control of MASs. Reference [17] established a cooperative policy iteration (PI) algorithm to solve the differential graphical games of linear MASs. In the nonlinear case, Reference [18] investigated the consensus problem via model-based PI with a generalized fuzzy hyperbolic critic structure. An event-triggered ADP-based optimal coordination control was proposed for the communication load and the commutation consumption was reduced [19]. To tackle the optimal containment control (OCC) problem, a finite-time fault-tolerant control was proposed via model-based PI [20]. In the presence of state constraints, Reference [21] presented a proper barrier function to transform the state constraint problem into an unconstrained case, thereafter the event-triggered OCC protocols were obtained. In Reference [22], distributed RL was applied to handle an OCC problem with collision avoidance of nonholonomic mobile robots. When the accurate model of the plant is not obtained, system identification is always employed. It should be pointed out that system identification is intractable for responding to dynamic changes of systems in time, which brings inevitable identification errors.

Recently, the integral RL (IRL) method was adopted to relax the accurate model requirement of the plant by constructing the integral Bellman equation [23,24]. An actor–critic architecture was adopted to execute the IRL algorithm, in which an actor NN learned the optimal control strategy and a critic NN was devoted to approximating the optimal value function. In the presence of heterogeneous linear MASs (HLMASs), the IRL method was developed to handle the robust OCC problem [25]. An adaptive output-feedback method was developed for the containment control for HLMASs via the IRL algorithm [26]. In Reference [27], the off-policy IRL-based OCC scheme was presented for unknown HLMASs with active leaders. However, the OCC problem of the nonlinear MASs with partially unknown dynamics has rarely been investigated via the IRL method. Moreover, the actor–critic architecture requires constructing the actor NN, which makes the control structure more complex. It is crucial to develop an IRL-based OCC scheme by implementing a simplified control structure. In addition, most of the aforementioned OCC approaches ensure the weight estimation error of the critic NN is uniformly ultimately bounded (UUB) only, which may degrade the control performance. All the above concerns motivated our research.

Inspired by the aforementioned works, we developed an IRL-based OCC scheme with asymptotically stable critic structure for partially unknown nonlinear MASs. The main contributions are reflected as follows.

(1)
Different from existing control schemes [9,20], an IRL method is introduced to construct the integral Bellman equation without the system identification. Furthermore, IRL proves to be equivalent to model-based PI, which guarantees the convergence of the developed control algorithm.
(2)
The IRL-based OCC scheme is implemented by a critic-only architecture for nonlinear MASs with unknown drift dynamics, rather than by an actor–critic architecture for linear MASs [25,26,27]. Thus, the proposed scheme simplifies the control structure.
(3)
In contrast to the existing OCC schemes [20,21,22] which guarantee the weight errors to be UUB, a modified weight-updating law is presented to tune the critic NN weights, whose weight error dynamic is asymptotically stable.

This paper is organized as follows. In Section 2, graph theory and its application to the containment of MASs are outlined. In Section 3, the IRL-based OCC scheme and its convergence proof are presented for nonlinear MASs. Then, the stability of the closed-loop containment error systems is analyzed in detail. In Section 4, two simulation examples demonstrate the effectiveness of the proposed scheme. In Section 5, concluding remarks are drawn.

2. Preliminaries and Problem Description

2.1. Graph Theory

For a network with N agents, the information interactions among agents are reflected by a weighted graph $G = (V, ε, A)$ with the nonempty finite set of nodes $V = {υ_{1}, \dots, υ_{N}}$ , the edge set $ε \subseteq V \times V$ and the nonnegative weighted adjacency matrix $A = [a_{i p}]$ . If node $υ_{i}$ links to node $υ_{p}$ , the edge $(υ_{i}, υ_{p}) \in ε$ is available with $a_{i p} > 0$ ; otherwise, $a_{i p} = 0$ . For a node $υ_{i}$ , the node $υ_{p}$ is named as a neighbor of $υ_{i}$ when $(υ_{i}, υ_{p}) \in ε$ . In this way, $N_{i} = {υ_{p} \in V : (υ_{p}, υ_{i}) \in ε}$ represents the set of all neighbors of $υ_{i}$ . Denote the Laplacian matrix as $L = D - A = [l_{i p}]$ , where $D = diag {d_{11}, d_{22}, \dots, d_{N N}}$ , $d_{i i} = \sum_{p = 1}^{N_{i}} a_{i p}$ and $l_{i p}$ satisfies

l_{i p} = \{\begin{matrix} \sum_{q = 1, q \neq i}^{N_{i}} a_{i q}, & i = p, \\ - a_{i p}, & i \neq p . \end{matrix}

It implies that each row sum of L equals to zero. A sequence of edges described by $(υ_{1}, υ_{2}), (υ_{3}, υ_{4}), \dots$ with $υ_{i} \in V$ is defined as a directed path. For arbitrary $(υ_{i}, υ_{p}) \in V$ , a directed graph is strongly connected, if there is a directed path from $υ_{i}$ to $υ_{p}$ , while the directed graph is said to contain a spanning tree if there exists a directed path from a root node to every other nodes with respect to $G$ . This paper focuses on a strongly connected digraph with a spanning tree.

2.2. Problem Description

Consider the leader–follower nonlinear MASs in the form of the graph $G$ with M leaders and N followers, where the node dynamic of the ith follower is modeled by

{\dot{x}}_{i} = f (x_{i} (t)) + g_{i} (x_{i} (t)) μ_{i} (t),

(1)

where $x_{i} \in R^{n}$ is the state vector for the ith follower, $μ_{i} \in R^{m}$ is the control input vector, $i = 1, 2, \dots, N$ , and the nonlinear functions $f (x_{i}) \in R^{n}$ and $g_{i} (x_{i}) \in R^{n \times m}$ represent the unknown drift dynamic and the control input matrix, respectively. Denote the global state vector as $x = {[x_{1}^{T}, x_{2}^{T}, \dots, x_{N}^{T}]}^{T} \in R^{N \times n}$ .

Assumption 1.

$f (x_{i})$ and $g_{i} (x_{i})$ are Lipschitz continuous on the compact set $Ω_{i}$ with $f (0) = 0$ and the system (1) is controllable.

Define the node dynamic of the jth leader as

{\dot{r}}_{j} = h_{j} (r_{j} (t)),

(2)

where $r_{j} \in R^{n}$ stands for the state vector of the jth leader, $j = 1, 2, \dots, M$ and $h_{j} (r_{j}) \in R^{n}$ satisfies Lipschitz continuity.

Definition 1

(Convex hull [8]). A set $C \subseteq R^{M \times n}$ is convex if for any $y_{1}, y_{2} \in C$ and $\forall ρ \in (0, 1)$ , $((1 - ρ) y_{1} + ρ y_{2}) \in C$ . A convex hull of a finite set $Y = {y_{1}, y_{2}, \dots, y_{M}}$ is the minimal convex set, i.e., $C o (Y) = \{\sum_{j = 1}^{M} ρ_{j} y_{j} | y_{j} \in Y, ρ_{j} \in R, ρ_{j} \geq 0, \sum_{j = 1}^{M} ρ_{j} = 1\}$ .

The containment control aims to find a set of distributed control protocols $μ = {μ_{1}, μ_{2}, \dots, μ_{N}}$ such that all followers stay in the convex hull formed by the leaders, i.e., $x_{i} (t) \to C o (Y)$ with $Y = {r_{1}, r_{2}, \dots, r_{M}}$ . For the ith follower, the local neighborhood containment error $e_{i}$ is formulated as

\begin{matrix} e_{i} = & \sum_{p \in N_{i}} a_{i p} (x_{i} - x_{p}) + \sum_{j = 1}^{M} b_{i j} (x_{i} - r_{j}) \\ = & d_{i i} x_{i} - \sum_{p \in N_{i}} a_{i p} x_{p} + \sum_{j = 1}^{M} b_{i j} (x_{i} - r_{j}), \end{matrix}

(3)

where $e_{i} \in R^{n}$ , $b_{i j} \geq 0$ represents the pinning gain. Define $B_{j} = diag [b_{1 j}, \dots, b_{i j}, \dots, b_{N j}] \in R^{N \times N}$ . In fact, the connection between the ith follower and the jth leader is available if and only if $b_{i j} > 0$ . Denote the communication graph as $G_{x} = (G, x)$ . The global containment error vector of $G_{x}$ is

e = (G \otimes I_{n}) x + ((B (I_{M} \otimes 1_{N})) \otimes I_{n}) \bar{r},

where $e = {[e_{1}^{T}, e_{2}^{T}, \dots, e_{N}^{T}]}^{T} \in R^{N \times n}$ , $\bar{r} = {[r_{1}^{T}, r_{2}^{T}, \dots, r_{M}^{T}]}^{T} \in R^{M \times n}$ , $G = L + B (1_{M} \otimes I_{N})$ , $I_{n}$ represents the n-dimension identity matrix, $1_{M}$ stands for the M-dimensional column vector whose every element equals to 1 and $B = [B_{1}, B_{2}, \dots, B_{M}] \in R^{N \times N M}$ . Considering (1), (2) and (3), for the ith follower, the local neighborhood containment error dynamic is formulated as

\begin{matrix} {\dot{e}}_{i} = F_{i} + c_{i} g_{i} (x_{i}) μ_{i} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}, \end{matrix}

(4)

where $c_{i} = (d_{i i} + \sum_{j = 1}^{M} b_{i j})$ and $F_{i} = c_{i} f (x_{i}) - \sum_{p \in N_{i}} a_{i p} f (x_{p}) - \sum_{j = 1}^{M} b_{i j} h_{j} (r_{j})$ . For the ith follower, the local neighborhood containment error is dominated not only by local states and local control inputs, but also by the information from its neighbors and the leaders. In order to implement the synchronization of the partially unknown nonlinear MASs (i.e., $e_{i} \to 0$ ), an IRL-based OCC scheme is designed in the next subsection.

3. IRL-Based OCC Scheme

3.1. Optimal Containment Control

For the local neighborhood containment error dynamic (4), define the cost function as

J_{i} (e_{i} (0)) = \int_{0}^{\infty} P_{i} (e_{i} (ξ), μ_{i} (ξ), μ_{- i} (ξ)) d ξ,

(5)

where $P_{i} (e_{i}, μ_{i}, μ_{- i}) = e_{i}^{T} Q_{i} e_{i} + \sum_{p \in {N_{i}, i}} μ_{p}^{T} R_{i p} μ_{p}$ is a utility function, $μ_{- i} = {μ_{p} | p \in N_{i}}$ represents a set of the local control protocols from the neighbors of node $υ_{i}$ , and $Q_{i} \in R^{n \times n}$ and $R_{i p} \in R^{m \times m}$ are the positive definite matrices.

Definition 2

(Admissible control policies [17]). The feedback control policies $μ_{i} (e_{i}) (i \in I)$ are defined to be admissible with respect to (5) on a compact set $Ω_{i}$ , denoted by $μ_{i} (e_{i}) \in A (Ω_{i})$ , if $μ_{i} (e_{i})$ is continuous on $Ω_{i}$ with $μ_{i} (0) = 0$ , $μ_{i} (e_{i})$ stabilizes (4) on $Ω_{i}$ and $J_{i} (e_{i} (0))$ is finite $\forall e_{i} (0) \in Ω_{i}$ .

Definition 3

(Nash equilibrium [17]). An N-tuple admissible control policy $μ^{*} (e) = {μ_{1}^{*} (e_{1}),$ $μ_{2}^{*} (e_{2}), \dots, μ_{N}^{*} (e_{N})}$ is said to constitute a Nash equilibrium solution in graph $G_{x}$ , if the following N inequalities are satisfied

$\begin{matrix} J_{i} (e_{i}, μ_{i}^{*}, μ_{- i}^{*}) \leq J_{i} (e_{i}, μ_{i}, μ_{- i}^{*}), i = 1, 2, \dots, N, \end{matrix}$

where $μ_{- i}^{*} = {μ_{1}^{*}, \dots, μ_{i - 1}^{*}, μ_{i + 1}^{*}, \dots, μ_{N}^{*}}$ .

This paper aims to find an N-tuple optimal admissible control policy $μ^{*} (e)$ to minimize the cost function (5) for each follower such that the Nash equilibrium solution in $G_{x}$ (i.e., the OCC protocols) is obtained.

For arbitrary $μ_{i} (e_{i}) \in A (Ω_{i})$ of the ith follower, define the value function

C_{i} (e_{i} (t)) = \int_{t}^{\infty} P_{i} (e_{i} (ξ), μ_{i} (ξ), μ_{- i} (ξ)) d ξ .

(6)

When (6) is finite, then the Bellman equation is

\begin{matrix} 0 = e_{i}^{T} Q_{i} e_{i} + \sum_{p \in {N_{i}, i}} μ_{p}^{T} R_{i p} μ_{p} + \nabla C_{i}^{T} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) μ_{i} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}), \end{matrix}

(7)

where $V_{i} (0) = 0$ and $\nabla C_{i} (e_{i}) = \partial C_{i} (e_{i}) / \partial e_{i}$ . For the ith follower, the local Hamiltonian is

\begin{matrix} H_{i} (e_{i}, μ_{i}, μ_{- i}, C_{i} (e_{i})) = & e_{i}^{T} Q_{i} e_{i} + \sum_{p \in {N_{i}, i}} μ_{p}^{T} R_{i p} μ_{p} \\ + \nabla C_{i}^{T} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) μ_{i} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}) . \end{matrix}

Define the optimal value function as

C_{i}^{*} (e_{i}) = min_{μ_{i} \in A (Ω_{i})} C_{i} (e_{i}) .

(8)

According to [13], the optimal value function $C_{i}^{*} (e_{i})$ satisfies the HJBE as follows

0 = min_{μ_{i} \in A (Ω_{i})} H_{i} (e_{i}, μ_{i}, μ_{- i}, C_{i}^{*} (e_{i})) .

(9)

The local OCC protocol is

\begin{matrix} μ_{i}^{*} (e_{i}) = & arg min_{μ_{i} \in A (Ω_{i})} H_{i} (e_{i}, μ_{i}, μ_{- i}, C_{i}^{*} (e_{i})) \\ = & - \frac{1}{2} c_{i} R_{i i}^{- 1} g_{i}^{T} (x_{i}) \nabla C_{i}^{*} (e_{i}) . \end{matrix}

(10)

It should be mentioned that the analytical solution of the HJBE is intractable to obtain since $C_{i}^{*} (e_{i})$ is unknown. According to [15], the solution of the HJBE is successively approximated through a sequence of iterations with policy evaluation

\begin{matrix} 0 = & e_{i}^{T} Q_{i} e_{i} + \sum_{p \in {N_{i}, i}} μ_{p}^{(k - 1) T} R_{i p} μ_{p}^{(k - 1)} \\ + \nabla C_{i}^{(k) T} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) μ_{i}^{(k - 1)} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}^{(k - 1)}), \end{matrix}

(11)

and policy improvement

\begin{matrix} μ_{i}^{(k)} = - \frac{1}{2} c_{i} R_{i i}^{- 1} g_{i}^{T} (x_{i}) \nabla C_{i}^{(k)} (e_{i}), \end{matrix}

(12)

where $(k)$ represents the kth iteration index with $k \in N^{+}$ .

From (11), we can see that the policy evaluation requires the accurate mathematical model of (1). However, the accurate mathematical model is always difficult to obtain in practice. To break this bottleneck, the IRL method is developed to relax the requirement of the accurate model in the policy evaluation.

3.2. Integral Reinforcement Learning

For $t_{τ} > 0$ , (6) can be rewritten as

\begin{matrix} C_{i} (e_{i} (t)) = \int_{t}^{t + t_{τ}} (e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) + \sum_{p \in {N_{i}, i}} μ_{p}^{T} (ξ) R_{i p} μ_{p} (ξ)) d ξ + C_{i} (e_{i} (t + t_{τ})) . \end{matrix}

(13)

Based on the integral Bellman Equation (13), $V_{i}^{*} (e_{i})$ and $μ_{i}^{*}$ satisfy

\begin{matrix} 0 = & \int_{t}^{t + t_{τ}} (e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) + \sum_{p \in {N_{i}, i}} μ_{p}^{* T} (ξ) R_{i p} μ_{p}^{*} (ξ)) d ξ \\ + C_{i}^{*} (e_{i} (t + t_{τ})) - C_{i}^{*} (e_{i} (t)) . \end{matrix}

(14)

Compared to (7), the policy evaluation (14) is not required for the accurate system dynamics in (1).

Theorem 1.

Let $C_{i}^{(k)} (e_{i}) \geq 0$ , $C_{i}^{(k)} (0) = 0$ and $μ_{i}^{(k)} \in A (Ω_{i})$ . $C_{i}^{(k)} (e_{i})$ is the solution of the integral Bellman equation

$\begin{matrix} 0 = & \int_{t}^{t + t_{τ}} e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) d ξ + \int_{t}^{t + t_{τ}} \sum_{p \in {N_{i}, i}} μ_{p}^{(k - 1) T} (ξ) R_{i p} μ_{p}^{(k - 1)} (ξ) d ξ \\ + C_{i}^{(k)} (e_{i} (t + t_{τ})) - C_{i}^{(k)} (e_{i} (t)), \end{matrix}$ (15)

if and only if $C_{i}^{(k)} (e_{i})$ is the only solution of (11).

Proof of Theorem 1.

Considering (11), the time derivative of $C_{i}^{(k)} (e_{i})$ corresponding to (4) is transformed as

$\begin{matrix} \frac{d C_{i}^{(k)} (e_{i})}{d t} = & \nabla C_{i}^{(k)} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) μ_{i}^{(k - 1)} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}^{(k - 1)}) \\ = & - e_{i}^{T} Q_{i} e_{i} - \sum_{p \in {N_{i}, i}} μ_{p}^{(k - 1) T} R_{i p} μ_{p}^{(k - 1)} . \end{matrix}$ (16)

Integrate on both sides of (16) within $[t, t + t_{τ}]$ , that is

$\begin{matrix} C_{i}^{(k)} (e_{i} (t + t_{τ})) - C_{i}^{(k)} (e_{i} (t)) = & - \int_{t}^{t + t_{τ}} e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) d ξ \\ - \int_{t}^{t + t_{τ}} \sum_{p \in {N_{i}, i}} μ_{p}^{(k - 1) T} (ξ) R_{i p} μ_{p}^{(k - 1)} (ξ) d ξ . \end{matrix}$ (17)

According to the derivation of (16) and (17), if $C_{i}^{(k)} (e_{i})$ is the solution of (11), $C_{i}^{(k)} (e_{i})$ satisfies the integral Bellman Equation (15). Next, we verify the uniqueness of the solution $C_{i}^{(k)} (e_{i})$ .

Supposing that $Υ_{i}^{(k)} (e_{i})$ is another solution of (11) with $Υ_{i}^{(k)} (0) = 0$ . Similar to the mathematical operation of (16), we have

$\begin{matrix} \frac{d Υ_{i}^{(k)} (e_{i})}{d t} = - e_{i}^{T} Q_{i} e_{i} - \sum_{p \in {N_{i}, i}} μ_{p}^{(k - 1) T} R_{i p} μ_{p}^{(k - 1)} . \end{matrix}$ (18)

Subtracting (16) into (18) yields

$\frac{d}{d t} (Υ_{i}^{(k)} (e_{i}) - C_{i}^{(k)} (e_{i})) = 0 .$ (19)

Solving (19), we have $Υ_{i}^{(k)} (e_{i}) - C_{i}^{(k)} (e_{i}) = ς_{i}$ with $ς_{i} \in R$ a real constant. For $e_{i} = 0$ , we have $ς_{i} = Υ_{i}^{(k)} (0) - C_{i}^{(k)} (0) = 0$ . That is to say, $Υ_{i}^{(k)} (e_{i}) = C_{i}^{(k)} (e_{i})$ . One can derive that $C_{i}^{(k)} (e_{i})$ is the unique solution. In summary, $C_{i}^{(k)} (e_{i})$ is the unique solution of (15) if and only if $C_{i}^{(k)} (e_{i})$ is the only solution of (11). □

Theorem 1 reveals that the IRL algorithm with (15) and (12) theoretically equals to the model-based PI algorithm, whose relevant convergence analysis was provided in [15]. Hence, the IRL algorithm can be guaranteed to be convergent.

Theorem 2.

Considering the nonlinear MAS with partially unknown dynamic as (1), the local neighborhood containment error dynamic as (4) and the optimal value function $C_{i}^{*} (e_{i})$ as (8), the closed-loop containment error system is guaranteed to be asymptotically stable under the local OCC protocol (10). Furthermore, the containment control is achieved with a set of the OCC protocols ${μ_{1}^{*}, μ_{2}^{*}, \dots,$ $μ_{N}^{*}}$ if there is a spanning tree in the directed graph.

Proof of Theorem 2.

Selecting the Lyapunov function candidate as $C_{i}^{*} (e_{i})$ . Combining (7), (8) and (10), then

$\begin{matrix} \nabla C_{i}^{* T} (e_{i}) F_{i} = & - \nabla C_{i}^{* T} (e_{i}) (c_{i} g_{i} (x_{i}) μ_{i}^{*} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}^{*}) \\ - e_{i}^{T} Q_{i} e_{i} - \sum_{p \in {N_{i}, i}} μ_{p}^{* T} R_{i p} μ_{p}^{*} . \end{matrix}$ (20)

Substituting (20) into the time derivative of $V_{i}^{*} (e_{i})$ , then

$\begin{matrix} {\dot{C}}_{i}^{*} (e_{i}) = & \nabla C_{i}^{* T} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) μ_{i}^{*} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) μ_{p}^{*}) \\ = & - e_{i}^{T} Q_{i} e_{i} - \sum_{p \in {N_{i}, i}} μ_{p}^{* T} R_{i p} μ_{p}^{*} . \end{matrix}$

Therefore, ${\dot{C}}_{i}^{*} (e_{i}) \leq 0$ . One can conclude that the closed-loop containment error system (4) is asymptotically stable with the local OCC protocol (10). Since a spanning tree exists in the directed graph, the containment control of the nonlinear MAS with partially unknown dynamic can be achieved. □

3.3. Critic NN Implementation

Based on the Stone–Weierstrass approximation theorem, on the compact set $Ω_{i}$ , the optimal function $C_{i}^{*} (e_{i})$ and its partial gradient can be established by a critic NN as

\begin{matrix} C_{i}^{*} (e_{i}) = & ϕ_{i}^{* T} σ_{i} (e_{i}) + ω_{i} (e_{i}), \end{matrix}

(21)

\begin{matrix} \nabla C_{i}^{*} (e_{i}) = & \nabla σ_{i}^{T} (e_{i}) ϕ_{i}^{*} + \nabla ω_{i} (e_{i}), \end{matrix}

(22)

where $ϕ_{i}^{*} \in R^{l_{i}}$ represents the ideal weight, $σ_{i} (\cdot) \in R^{l_{i}}$ represents the activation function, $l_{i}$ represents the number of hidden neurons and $ω_{i} (e_{i})$ stands for the reconstruction error.

Since the ideal weight vector is unknown, the approximation of $C_{i}^{*} (e_{i})$ and $\nabla C_{i}^{*} (e_{i})$ are expressed as

\begin{matrix} {\hat{C}}_{i} (e_{i}) & = {\hat{ϕ}}_{i}^{T} σ_{i} (e_{i}), \\ \nabla {\hat{C}}_{i} (e_{i}) & = \nabla σ_{i}^{T} (e_{i}) {\hat{ϕ}}_{i}, \end{matrix}

(23)

where $\nabla σ_{i} (e_{i}) = \partial σ_{i} (e_{i}) / \partial e_{i}$ and ${\hat{ϕ}}_{i} \in R^{l_{i}}$ represents the estimation of $ϕ_{i}^{*}$ . Then, the local OCC protocol (10) can be approximated by

{\hat{μ}}_{i} (e_{i}) = - \frac{1}{2} c_{i} R_{i i}^{- 1} g_{i}^{T} (x_{i}) \nabla σ_{i}^{T} (e_{i}) {\hat{ϕ}}_{i} .

(24)

The approximate local Hamiltonian is

\begin{matrix} e_{c i} = & \int_{t}^{t + t_{τ}} (e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) + \sum_{p \in {N_{i}, i}} {\hat{μ}}_{p}^{T} (ξ) R_{i j} {\hat{μ}}_{p} (ξ)) d ξ \\ + {\hat{ϕ}}_{i}^{T} \underset{θ_{i}}{\underset{︸}{(σ_{i} (e_{i} (t + t_{τ})) - σ_{i} (e_{i} (t)))}} . \end{matrix}

(25)

Combining (14) and (21) with (25) yields

\begin{matrix} e_{c i} = & \int_{t}^{t + t_{τ}} (e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) + \sum_{p \in {N_{i}, i}} {\hat{μ}}_{p}^{T} (ξ) R_{i p} {\hat{μ}}_{p} (ξ)) d ξ \\ - \int_{t}^{t + t_{τ}} (e_{i}^{T} (ξ) Q_{i} e_{i} (ξ) + \sum_{p \in {N_{i}, i}} μ_{p}^{* T} (ξ) R_{i p} μ_{p}^{*} (ξ)) d ξ \\ + {\hat{ϕ}}_{i}^{T} θ_{i} - ϕ_{i}^{* T} θ_{i} - ω_{i} (e_{i} (t + t_{τ})) + ω_{i} (e_{i} (t)) \\ = & \int_{t}^{t + t_{τ}} \sum_{p \in {N_{i}, i}} {({\hat{μ}}_{p} (ξ) + μ_{p}^{*} (ξ))}^{T} R_{i p} ({\hat{μ}}_{p} (ξ) - μ_{p}^{*} (ξ)) d ξ \\ - {\tilde{ϕ}}_{i} θ_{i} - ω_{i} (e_{i} (t + t_{τ})) + ω_{i} (e_{i} (t)) \\ = & - {\tilde{ϕ}}_{i} θ_{i} + Φ_{i}, \end{matrix}

(26)

where ${\tilde{ϕ}}_{i} = ϕ_{i}^{*} - {\hat{ϕ}}_{i}$ represents the weight estimation error and $Φ_{i} = \int_{t}^{t + t_{τ}} \sum_{p \in {N_{i}, i}} {({\hat{μ}}_{p} (ξ) + μ_{p}^{*} (ξ))}^{T} R_{i p} ({\hat{μ}}_{p} (ξ) - μ_{p}^{*} (ξ)) d ξ - ω_{i} (e_{i} (t + t_{τ})) + ω_{i} (e_{i} (t))$ .

Assumption 2.

$Φ_{i}$ is bounded by $η_{i}$ , i.e., $∥ Φ_{i} ∥ \leq η_{i}$ with $η_{i} > 0$ .

In order to tune ${\hat{ϕ}}_{i}$ , the steepest descent algorithm is employed to minimize $E_{c i} = \frac{1}{2} e_{c i}^{2}$ . A modified updating law of ${\hat{ϕ}}_{i}$ is

{\dot{\hat{ϕ}}}_{i} = - l_{c i} \frac{θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} (e_{c i} - {\hat{η}}_{i})

(27)

where $l_{c i} > 0$ and ${\hat{η}}_{i}$ , the estimation of $η_{i}$ , can be updated by

{\dot{\hat{η}}}_{i} = l_{s i} \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}},

(28)

where $l_{s i} > 0$ is a design constant. Considering (26) and (27), the weight estimation error is updated by

{\dot{\tilde{ϕ}}}_{i} = - l_{c i} \frac{θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} ({\tilde{ϕ}}^{T} θ_{i} - Φ_{i} + {\hat{η}}_{i}) .

(29)

Theorem 3.

Considering the nonlinear MAS with partially unknown dynamic as (1), the local neighborhood containment error dynamic as (4) and the critic NN with the modified updating laws (27) and (28), then ${\tilde{ϕ}}_{i}$ is guaranteed to be asymptotically stable.

Proof of Theorem 3.

Define ${\tilde{η}}_{i} = η_{i} - {\hat{η}}_{i}$ . Choose the Lyapunov function candidate as

$Ξ_{c i} = \frac{1}{2 l_{c i}} {\tilde{ϕ}}_{i}^{T} {\tilde{ϕ}}_{i} + \frac{1}{2 l_{s i}} {\tilde{η}}_{i}^{2} .$ (30)

According to (28), ${\tilde{η}}_{i}$ is updated by

${\dot{\tilde{η}}}_{i} = - l_{s i} \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} .$ (31)

Considering (29) and (31), the time derivative of (30) is

$\begin{matrix} {\dot{Ξ}}_{c i} = & \frac{1}{l_{c i}} {\tilde{ϕ}}_{i}^{T} {\dot{\tilde{ϕ}}}_{i} + \frac{1}{l_{s i}} {\tilde{η}}_{i} {\dot{\tilde{η}}}_{i} \\ = & - \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} ({\tilde{ϕ}}^{T} θ_{i} - Φ_{i} + {\hat{η}}_{i}) - \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} {\tilde{η}}_{i} \\ = & - {\tilde{ϕ}}_{i}^{T} Ψ_{i} {\tilde{ϕ}}_{i} + \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} (Φ_{i} - {\hat{η}}_{i} - {\tilde{η}}_{i}), \end{matrix}$ (32)

where $Ψ_{i} = θ_{i} θ_{i}^{T} / {(1 + θ_{i}^{T} θ_{i})}^{2}$ . According to Assumption 2, (32) is derived as

$\begin{matrix} {\dot{Ξ}}_{c i} \leq & - λ_{min} (Ψ_{i}) {∥ {\tilde{ϕ}}_{i} ∥}^{2} + \frac{{\tilde{ϕ}}_{i}^{T} θ_{i}}{{(1 + θ_{i}^{T} θ_{i})}^{2}} (∥ Φ_{i} ∥ - η_{i}) \\ \leq & - λ_{min} (Ψ_{i}) {∥ {\tilde{ϕ}}_{i} ∥}^{2} . \end{matrix}$

It indicates ${\dot{Ξ}}_{c i} \leq 0$ . Therefore, one can conclude that ${\tilde{ϕ}}_{i}$ is ensured to be asymptotically stable. □

Under the framework of the critic-only architecture, the IRL-based OCC scheme is presented. For each follower, the local neighborhood containment error (3) is established by communicating with its neighbors and the leaders. The value function of each follower is approximated by the critic NN (23), whose weights are tuned by a modified weight updating law (27). Based on (1), (3) and (23), the local OCC protocol (24) is obtained. The structural diagram of the developed IRL-based OCC scheme is shown in Figure 1.

Structural diagram of the developed IRL-based OCC scheme.

Remark 1.

In the actor–critic architecture, the optimal value function and the optimal control policy are approximated by a critic NN and an actor NN, respectively. While for the critic-only architecture, the optimal value function is approximated by a critic NN and the optimal control policy is directly obtained by combining (10) and (22). Hence, the critic-only architecture keeps the same performance as the actor–critic one. In contrast, the critic-only architecture utilizes a single critic NN only, which implies that the control structure is simplified and the computation burden is reduced.

3.4. Stability Analysis

Assumption 3.

$ϕ_{i}^{*}$ , ${\tilde{ϕ}}_{i}$ , $\nabla σ_{i} (\cdot)$ and $\nabla ω_{i} (\cdot)$ are norm-bounded, i.e.,

$\begin{matrix} ∥ ϕ_{i}^{*} ∥ \leq ϕ_{i M}, ∥ {\tilde{ϕ}}_{i} ∥ \leq {\bar{ϕ}}_{i M}, ∥ \nabla σ_{i} (\cdot) ∥ \leq {\bar{σ}}_{i M}, ∥ \nabla ω_{i} (\cdot) ∥ \leq {\bar{ω}}_{i M}, ∥ g_{i} (\cdot) ∥ \leq {\bar{g}}_{i M}, \end{matrix}$

where $ϕ_{i M}$ , ${\bar{ϕ}}_{i M}$ , ${\bar{σ}}_{i M}$ , ${\bar{ω}}_{i M}$ and ${\bar{g}}_{i M}$ are positive constants.

Theorem 4.

Considering the nonlinear MAS with partially unknown dynamics as (1), the local neighborhood containment error dynamic as (4), the optimal value function as (8) and the critic NN which is updated by (27) and (28), the local containment control protocol (24) can guarantee the closed-loop containment error system (4) to be UUB.

Proof of Theorem 4.

The Lyapunov function candidate is chosen as

$Ξ_{i} = C_{i}^{*} (e_{i}) .$ (33)

Considering (20), (21) and Assumption 3, the time derivative of (33) corresponding to (4) is

$\begin{matrix} {\dot{Ξ}}_{i} = & {\dot{C}}_{i}^{*} (e_{i}) \\ = & \nabla C_{i}^{* T} (e_{i}) (F_{i} + c_{i} g_{i} (x_{i}) {\hat{μ}}_{i} + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) {\hat{μ}}_{p}) \\ = & \nabla C_{i}^{* T} (e_{i}) (c_{i} g_{i} (x_{i}) ({\hat{μ}}_{i} - μ_{i}^{*}) + \sum_{p \in N_{i}} a_{i p} g_{p} (x_{p}) ({\hat{μ}}_{p} - μ_{p}^{*})) - e_{i}^{T} Q_{i} e_{i} - \sum_{p \in {N_{i}, i}} μ_{p}^{* T} R_{i p} μ_{p}^{*} \\ \leq & ∥\nabla C_{i}^{* T} (e_{i})∥ (∥c_{i} g_{i} (x_{i}) ({\hat{μ}}_{i} - μ_{i}^{*})∥ + \sum_{p \in N_{i}} ∥a_{i p} g_{p} (x_{p}) ({\hat{μ}}_{p} - μ_{p}^{*})∥) - λ_{min} (Q_{i}) {∥ e_{i} ∥}^{2} \\ \leq & ({\bar{σ}}_{i M} ϕ_{i M} + {\bar{ω}}_{i M}) (c_{i} {\bar{g}}_{i M} ∥ {\hat{μ}}_{i} - μ_{i}^{*} ∥ + \sum_{p \in N_{i}} a_{i p} {\bar{g}}_{p M} ∥{\hat{μ}}_{p} - μ_{p}^{*}∥) - λ_{min} (Q_{i}) {∥ e_{i} ∥}^{2} . \end{matrix}$ (34)

Notice that

$\begin{matrix} ∥ {\hat{μ}}_{i} - μ_{i}^{*} ∥ = & ∥ - \frac{1}{2} R_{i i}^{- 1} c_{i} g_{i}^{T} (x_{i}) \nabla σ_{i}^{T} (e_{i}) {\hat{ϕ}}_{i} + \frac{1}{2} R_{i i}^{- 1} c_{i} g_{i}^{T} (x_{i}) (\nabla σ_{i}^{T} (e_{i}) ϕ_{i}^{*} + \nabla ω_{i} (e_{i})) ∥ \\ = & ∥ \frac{1}{2} R_{i i}^{- 1} c_{i} g_{i}^{T} (x_{i}) (\nabla σ_{i}^{T} (e_{i}) {\tilde{ϕ}}_{i} + \nabla ω_{i} (e_{i})) ∥ \\ \leq & \frac{c_{i} {\bar{g}}_{i M}}{2 ∥ R_{i i} ∥} ({\bar{σ}}_{i M} {\bar{ϕ}}_{i M} + {\bar{ω}}_{i M}) . \end{matrix}$

Then, (34) becomes

$\begin{matrix} {\dot{Ξ}}_{i} \leq & ({\bar{σ}}_{i M} ϕ_{i M} + {\bar{ω}}_{i M}) (\frac{c_{i}^{2} {\bar{g}}_{i M}^{2}}{2 ∥ R_{i i} ∥} ({\bar{σ}}_{i M} {\bar{ϕ}}_{i M} + {\bar{ω}}_{i M}) + \sum_{p \in N_{i}} \frac{c_{p} a_{i p} {\bar{g}}_{p M}^{2}}{2 ∥ R_{p p} ∥} ({\bar{σ}}_{p M} {\bar{ϕ}}_{p M} + {\bar{ω}}_{p M})) \\ - λ_{min} (Q_{i}) {∥ e_{i} ∥}^{2} . \end{matrix}$ (35)

Let $Π_{i 1} = \frac{c_{i}^{2} {\bar{g}}_{i M}^{2}}{2 ∥ R_{i i} ∥} ({\bar{σ}}_{i M} {\bar{ϕ}}_{i M} + {\bar{ω}}_{i M}) + \sum_{p \in N_{i}} \frac{c_{p} a_{i p} {\bar{g}}_{p M}^{2}}{2 ∥ R_{p p} ∥}$ $({\bar{σ}}_{p M} {\bar{ϕ}}_{p M} + {\bar{ω}}_{p M})$ . Thus, (35) turns to

$\begin{matrix} {\dot{Ξ}}_{i} \leq & \underset{Π_{i 2}}{\underset{︸}{({\bar{σ}}_{i M} ϕ_{i M} + {\bar{ω}}_{i M}) Π_{i 1}}} - λ_{min} (Q_{i}) {∥ e_{i} ∥}^{2} \\ = & Π_{i 2} - λ_{min} (Q_{i}) {∥ e_{i} ∥}^{2} . \end{matrix}$

It shows ${\dot{L}}_{i 2} < 0$ if $e_{i}$ lies outside the compact set

$Ω_{e_{i}} = \{e_{i} : ∥ e_{i} ∥ \leq \sqrt{\frac{Π_{i 2}}{λ_{min} (Q_{i})}}\} .$

Therefore, the closed-loop containment error system (4) is UUB under the local containment control protocol (24). □

Remark 2.

In Assumption 1, we know that the nonlinear functions $f (x)$ and $g_{i} (x)$ are Lipschitz continuous on a compact set $Ω_{i}$ containing the origin, $f (0) = 0$ . It indicates that the developed control scheme is effective in a compact set $Ω_{i}$ . If the system states are outside this compact set, this scheme might be invalid. In Theorem 4, we analyzed the system stability within such a compact set via the Lyapunov direct method, which means the closed-loop system is stable in the compact set under the developed IRL-based OCC scheme.

4. Simulation Study

This section provides two simulation examples to support the developed IRL-based OCC scheme.

4.1. Example 1

Consider a six-node graph network connected by three leader nodes. The directed topology of the graph is displayed in Figure 2.

As displayed in Figure 2, nodes 1–3 stand for the leaders 1–3 and nodes 4–6 represent the followers 1–3. In (3), the edge weights and pinning gains were set to 0.5. The node dynamic of the jth leader is described as ${\dot{r}}_{j} = \bar{A} r_{j}$ , where $r_{j} = {[r_{j 1}, r_{j 2}]}^{T} \in R^{2}$ represents the state vector, $j = 1, 2, 3$ and

\bar{A} = [\begin{matrix} 0.1 & - 1 \\ 1 & - 0.1 \end{matrix}] .

For the ith follower, the node dynamic is formulated as ${\dot{x}}_{i} = \bar{A} x_{i} + {\bar{B}}_{i} μ_{i}$ , where $x_{i} = {[x_{i 1}, x_{i 2}]}^{T} \in R^{2}$ and $μ_{i} \in R$ with $i = 1, 2, 3$ , ${\bar{B}}_{1} = {[- 1.5, 1]}^{T}$ , ${\bar{B}}_{2} = {[- 1, 1]}^{T}$ and ${\bar{B}}_{3} = {[- 1, - 0.5]}^{T}$ . The local neighborhood containment error vector $e_{i} = {[e_{i 1}, e_{i 2}]}^{T} \in R^{2}$ is calculated by (3).

In the simulation, $C_{i} (e_{i})$ was reconstructed by a critic NN with a 2–5–1 structure. The activation function was described as $σ_{i} (e_{i}) = {[e_{i 1}^{2}, e_{i 1} e_{i 2}, e_{i 2}^{2}, e_{i 1}^{2} e_{i 2}, e_{i 2}^{2} e_{i 1}]}^{T}$ . The initialization of the node dynamics were characterized as $x_{1} (0) = {[0.50, - 1.00]}^{T}$ , $x_{2} (0) = {[1.00, - 0.50]}^{T}$ , $x_{3} (0) = {[0.80, - 0.30]}^{T}$ , $r_{1} (0) = {[0.62, 0.83]}^{T}$ , $r_{2} (0) = {[0.45, 0.40]}^{T}$ and $r_{3} (0) = {[0.30, 0.22]}^{T}$ . The related parameters were chosen as $Q_{i} = 5 I_{2}$ , $R_{i p} = R_{i i} = 1$ , $l_{c i} = 0.1$ and $l_{s i} = 0.1$ .

The simulation results are shown in Figure 3, Figure 4 and Figure 5 using the developed IRL-based OCC protocols. The evolution procedure of the local neighborhood containment errors for triple followers is shown in Figure 3, which indicates that the local neighborhood containment errors were regulated to zero under the developed control protocols. Thus, the containment control of MAS could be reached. Figure 4 and Figure 5 depict the state curves of the leaders and the followers, where all followers moved and stayed within the region formed by the envelope curves. It implies that the satisfactory performance of the containment control was acquired. The state curves of the followers and the leaders are displayed as 2-D phase plane plot in Figure 6 and the region enveloped by the three leaders $υ_{1}, υ_{2}$ and $υ_{3}$ is shown at three different instants ( $t = 16.0 s, 20.3 s$ and $25.0 s$ ). We can observe from Figure 6 that the followers converged to the convex hull.

Local neighborhood containment errors $e_{i}$ .

Performance of containment control ( $r_{j 1}$ and $x_{i 1}$ ).

Performance of containment control ( $r_{j 2}$ and $x_{i 2}$ ).

4.2. Example 2

Consider the nonlinear MAS consisting of three single-link robot arms and triple leader nodes. A rigid link is attached to each robot arm via a gear train to a direct current motor [28]. In Figure 2, the directed topology among these robot arms is shown. We chose the values of all edge weights and pinning gain as 1.

The state trajectories of the leaders is given by $r_{1} = {[0.6 sin (t), 0.6 cos (t)]}^{T}$ , $r_{2} = {[0.4 sin (t + \frac{π}{6}), 0.4 cos (t + \frac{π}{6})]}^{T}$ and $r_{3} = {[0.2 sin (t - \frac{π}{6}), 0.2 cos (t - \frac{π}{6})]}^{T}$ . The single-link robot arm for each follower can be described as

J {\ddot{z}}_{i} + \bar{B} {\dot{z}}_{i} + \bar{M} g l sin (z_{i}) = u_{i},

(36)

where $J = 9 kg \cdot m^{2}$ , $\bar{B} = 30.5$ , $\bar{M} = 1 kg$ , $l = 1 m$ , $g = 9.8 m / s^{2}$ and $i = 1, 2, 3$ . The notations of the model (36) are defined in Table 1.

Table 1.

Notations of the single-link robot arm.

Symbol	Notation
$z_{i}$	Link angle
${\dot{z}}_{i}$	Angular velocity of the link
$J$	Total rotational inertia of the link and motor
$\bar{B}$	Overall damping coefficient
$\bar{M}$	Total mass of the link
l	Distant from joint axis to mass center of the link
$u_{i}$	Command generator

Open in a new tab

Define $x_{i} = {[x_{i 1}, x_{i 2}]}^{T} = {[z_{i}, {\dot{z}}_{i}]}^{T} \in R^{2}$ and $μ_{i} = u_{i}$ . For the ith follower, the model (36) can be rewritten as

\begin{matrix} [\begin{matrix} {\dot{x}}_{i 1} \\ {\dot{x}}_{i 2} \end{matrix}] = & [\begin{matrix} x_{i 2} \\ - \frac{\bar{M} g l}{J} sin (x_{i 1}) - \frac{\bar{B}}{J} x_{i 2} \end{matrix}] + [\begin{matrix} 0 \\ \frac{1}{J} \end{matrix}] μ_{i} . \end{matrix}

(37)

Similar to Example Section 4.1, the local neighborhood containment error vector was given as $e_{i} = {[e_{i 1}, e_{i 2}]}^{T} \in R^{2}$ .

The critic NN structures and the related activation functions were initialized as in Example Section 4.1. The critic NN weights were initialized as the random values within $(0, 36)$ and the parameters of initialization and control were chosen as $r_{1} (0) = {[0, 0.6]}^{T}$ , $r_{2} (0) = {[0.4 sin (\frac{π}{6}), 0.4 cos (\frac{π}{6})]}^{T}$ , $r_{3} (0) = {[0.2 sin (- \frac{π}{6}), 0.2 cos (- \frac{π}{6})]}^{T}$ , $x_{1} (0) = {[0.8, 0.1]}^{T}$ , $x_{2} (0) = {[0.6, 0.5]}^{T}$ , $x_{3} (0) = {[0.7, - 0.3]}^{T}$ , $Q_{i p} = 18 I_{n}$ , $R_{i p} = 5$ , $t_{τ} = 0.1 s$ , $l_{c i} = 0.1$ and $l_{s i} = 0.1$ .

Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 show the simulation results. The local neighborhood containment errors converged to a small region around zero as depicted in Figure 7, which shows that the containment control of the nonlinear MAS was achieved. In Figure 8 and Figure 9, it can be found that the state trajectories of single-link robot arms (36) entered and stayed within the region enveloped by the leader nodes as the time progressed, which indicated the satisfactory performance of the developed scheme. The evolution curves of all agents are illustrated as the 2-D phase plane plot in Figure 10. We can see that the convex hull formed by the leaders $υ_{1}, υ_{2}$ and $υ_{3}$ contains the followers at the time instants $t = 5.0 s, 10.0 s, 14.5 s$ and $26.0 s$ , which implies that the followers converged to the convex hull. Figure 11 describes the curves of the containment control inputs, which shows the regulation process of the containment error system.

Local neighborhood containment errors of triple followers.

Containment control inputs of triple followers.

5. Conclusions

This paper investigated the OCC problem of nonlinear MASs with partially unknown dynamics via the IRL method. Based on the IRL method, the integral Bellman equation was constructed to relax the requirement of the drift dynamics. The proposed control algorithm was guaranteed to converge by analyzing the convergence of IRL. With the aid of the universal approximation capability of the NN, the solution of the HJBE was acquired by a critic NN with a modified weight-updating law which guaranteed the asymptotical stability of the weight error dynamics. By using the Lyapunov stability theorem, we showed that the closed-loop containment error system was UUB. From the simulation results of two examples, the effectiveness of the proposed IRL-based OCC scheme was illustrated. In the considered MASs, the information among all agents was transmitted by a desired communication network, which is always confronted with some security issues, such as attacks and packet dropouts. The focus of our future work is to develop a novel distributed resilient containment control for the MASs subjected to attacks and packet dropouts.

Acknowledgments

We appreciate all the authors for their contributions and the support of the foundation.

Author Contributions

Conceptualization, Q.W. and Y.W. (Yonghua Wang); methodology, Q.W.; software, Q.W.; investigation, Q.W.; writing—original draft preparation, Q.W.; writing—review and editing, Y.W. (Yongheng Wu); visualization, Y.W. (Yongheng Wu); supervision, Y.W. (Yonghua Wang); funding acquisition, Y.W. (Yonghua Wang). All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This research was supported by Open Research Fund of The State Key Laboratory for Management and Control of Complex Systems under grant no. 20220118.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

1.Jimenez A.F., Cardenas P.F., Jimenez F. Intelligent IoT-multiagent precision irrigation approach for improving water use efficiency in irrigation systems at farm and district scales. Comp. Electr. Agric. 2022;192:106635. doi: 10.1016/j.compag.2021.106635. [DOI] [Google Scholar]
2.Vallejo D., Castro-Schez J., Glez-Morcillo C., Albusac J. Multi-agent architecture for information retrieval and intelligent monitoring by UAVs in known environments affected by catastrophes. Eng. Appl. Artif. Intell. 2020;87:103243. doi: 10.1016/j.engappai.2019.103243. [DOI] [Google Scholar]
3.Liu Y., Wang Y., Li Y., Gooi H.B., Xin H. Multi-agent based optimal scheduling and trading for multi-microgrids integrated with urban transportation networks. IEEE Trans. Power. Syst. 2021;36:2197–2210. doi: 10.1109/TPWRS.2020.3040310. [DOI] [Google Scholar]
4.Deng Q., Peng Y., Qu D., Han T., Zhan X. Neuro-adaptive containment control of unmanned surface vehicles with disturbance observer and collision-free. ISA Trans. 2022;129:150–156. doi: 10.1016/j.isatra.2022.01.004. [DOI] [PubMed] [Google Scholar]
5.Hamani N., Jamont J.P., Occello M., Ben-Yelles C.B., Lagreze A., Koudil M. A multi-cooperative-based approach to manage communication in wireless instrumentation systems. IEEE Syst. J. 2018;12:2174–2185. doi: 10.1109/JSYST.2017.2721220. [DOI] [Google Scholar]
6.Ren W., Beard R.W. Consensus seeking in multiagent systems under dynamically changing interaction topologies. IEEE Trans. Autom. Control. 2005;50:655–661. doi: 10.1109/TAC.2005.846556. [DOI] [Google Scholar]
7.Luo K., Guan Z.H., Cai C.X., Zhang D.X., Lai Q., Xiao J.W. Coordination of nonholonomic mobile robots for diffusive threat defense. J. Frankl. Inst. 2019;356:4690–4715. doi: 10.1016/j.jfranklin.2019.03.014. [DOI] [Google Scholar]
8.Yu Z., Liu Z., Zhang Y., Qu Y., Su C.Y. Distributed finite-time fault-tolerant containment control for multiple unmanned aerial vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2020;31:2077–2091. doi: 10.1109/TNNLS.2019.2927887. [DOI] [PubMed] [Google Scholar]
9.Li Y., Qu F., Tong S. Observer-based fuzzy adaptive finite-time containment control of nonlinear multiagent systems with input delay. IEEE Trans. Cybern. 2021;51:126–137. doi: 10.1109/TCYB.2020.2970454. [DOI] [PubMed] [Google Scholar]
10.Li Z., Xue H., Pan Y., Liang H. Distributed adaptive event-triggered containment control for multi-agent systems under a funnel function. Int. J. Robust Nonlinear Control. 2022 doi: 10.1002/rnc.6344. [DOI] [Google Scholar]
11.Li Y., Liu M., Lian J., Guo Y. Collaborative optimal formation control for heterogeneous multi-agent systems. Entropy. 2022;24:1440. doi: 10.3390/e24101440. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhao L., Yu J., Shi P. Command filtered backstepping-based attitude containment control for spacecraft formation. IEEE Trans. Syst. Man Cybern. Syst. 2021;51:1278–1287. doi: 10.1109/TSMC.2019.2896614. [DOI] [Google Scholar]
13.Liu D., Wei Q., Wang D., Yang X., Li H. Adaptive Dynamic Programming with Applications in Optimal Control. Springer; Cham, Switzerland: 2017. [Google Scholar]
14.Bellman R.E. Dynamic Programming. Princeton Univ. Press; Trenton, NJ, USA: 1957. [Google Scholar]
15.Abu-Khalaf M., Lewis F.L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica. 2005;41:779–791. doi: 10.1016/j.automatica.2004.11.034. [DOI] [Google Scholar]
16.Liu D., Xue S., Zhao B., Luo B., Wei Q. Adaptive dynamic programming for control: A survey and recent advances. IEEE Trans. Syst. Man. Cybern. Syst. 2021;51:142–160. doi: 10.1109/TSMC.2020.3042876. [DOI] [Google Scholar]
17.Vamvoudakis K.G., Lewis F.L., Hudas G.R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica. 2012;48:1598–1611. doi: 10.1016/j.automatica.2012.05.074. [DOI] [Google Scholar]
18.Zhang H., Zhang J., Yang G., Luo Y. Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming. IEEE Trans. Fuzzy. Syst. 2015;23:152–163. doi: 10.1109/TFUZZ.2014.2310238. [DOI] [Google Scholar]
19.Zhao W., Zhang H. Distributed optimal coordination control for nonlinear multi-agent systems using event-triggered adaptive dynamic programming method. ISA Trans. 2019;91:184–195. doi: 10.1016/j.isatra.2019.01.021. [DOI] [PubMed] [Google Scholar]
20.Cui J., Pan Y., Xue H., Tan L. Simplified optimized finite-time containment control for a class of multi-agent systems with actuator faults. Nonlinear Dyn. 2022;109:2799–2816. doi: 10.1007/s11071-022-07586-1. [DOI] [Google Scholar]
21.Xu J., Wang L., Liu Y., Xue H. Event-triggered optimal containment control for multi-agent systems subject to state constraints via reinforcement learning. Nonlinear Dyn. 2022;109:1651–1670. doi: 10.1007/s11071-022-07513-4. [DOI] [Google Scholar]
22.Xiao W., Zhou Q., Liu Y., Li H., Lu R. Distributed reinforcement learning containment control for multiple nonholonomic mobile robots. IEEE Trans. Circuits Syst. I Reg. Papers. 2022;69:896–907. doi: 10.1109/TCSI.2021.3121809. [DOI] [Google Scholar]
23.Chen C., Lewis F.L., Xie K., Xie S., Liu Y. Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems. Automatica. 2020;119:109081. doi: 10.1016/j.automatica.2020.109081. [DOI] [Google Scholar]
24.Yu D., Ge S.S., Li D., Wang P. Finite-horizon robust formation-containment control of multi-agent networks with unknown dynamics. Neurocomputing. 2021;458:403–415. doi: 10.1016/j.neucom.2021.01.063. [DOI] [Google Scholar]
25.Zuo S., Song Y., Lewis F.L., Davoudi A. Optimal robust output containment of unknown heterogeneous multiagent system using off-policy reinforcement learning. IEEE Trans. Cybern. 2018;48:3197–3207. doi: 10.1109/TCYB.2017.2761878. [DOI] [PubMed] [Google Scholar]
26.Mazouchi M., Naghibi-Sistani M.B., Hosseini Sani S.K., Tatari F., Modares H. Observer-based adaptive optimal output containment control problem of linear heterogeneous Multiagent systems with relative output measurements. Int. J. Adapt. Control Signal Process. 2019;33:262–284. doi: 10.1002/acs.2950. [DOI] [Google Scholar]
27.Yang Y., Modares H., Wunsch D.C., Yin Y. Optimal containment control of unknown heterogeneous systems with active leaders. IEEE Trans. Control Syst. Technol. 2019;27:1228–1236. doi: 10.1109/TCST.2018.2794336. [DOI] [Google Scholar]
28.Zhang H., Lewis F.L., Qu Z. Lyapunov, adaptive, and optimal design techniques for cooperative systems on directed communication graphs. IEEE Trans. Ind. Electron. 2012;59:3026–3041. doi: 10.1109/TIE.2011.2160140. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data is contained within this manuscript.

[B1-entropy-25-00221] 1.Jimenez A.F., Cardenas P.F., Jimenez F. Intelligent IoT-multiagent precision irrigation approach for improving water use efficiency in irrigation systems at farm and district scales. Comp. Electr. Agric. 2022;192:106635. doi: 10.1016/j.compag.2021.106635. [DOI] [Google Scholar]

[B2-entropy-25-00221] 2.Vallejo D., Castro-Schez J., Glez-Morcillo C., Albusac J. Multi-agent architecture for information retrieval and intelligent monitoring by UAVs in known environments affected by catastrophes. Eng. Appl. Artif. Intell. 2020;87:103243. doi: 10.1016/j.engappai.2019.103243. [DOI] [Google Scholar]

[B3-entropy-25-00221] 3.Liu Y., Wang Y., Li Y., Gooi H.B., Xin H. Multi-agent based optimal scheduling and trading for multi-microgrids integrated with urban transportation networks. IEEE Trans. Power. Syst. 2021;36:2197–2210. doi: 10.1109/TPWRS.2020.3040310. [DOI] [Google Scholar]

[B4-entropy-25-00221] 4.Deng Q., Peng Y., Qu D., Han T., Zhan X. Neuro-adaptive containment control of unmanned surface vehicles with disturbance observer and collision-free. ISA Trans. 2022;129:150–156. doi: 10.1016/j.isatra.2022.01.004. [DOI] [PubMed] [Google Scholar]

[B5-entropy-25-00221] 5.Hamani N., Jamont J.P., Occello M., Ben-Yelles C.B., Lagreze A., Koudil M. A multi-cooperative-based approach to manage communication in wireless instrumentation systems. IEEE Syst. J. 2018;12:2174–2185. doi: 10.1109/JSYST.2017.2721220. [DOI] [Google Scholar]

[B6-entropy-25-00221] 6.Ren W., Beard R.W. Consensus seeking in multiagent systems under dynamically changing interaction topologies. IEEE Trans. Autom. Control. 2005;50:655–661. doi: 10.1109/TAC.2005.846556. [DOI] [Google Scholar]

[B7-entropy-25-00221] 7.Luo K., Guan Z.H., Cai C.X., Zhang D.X., Lai Q., Xiao J.W. Coordination of nonholonomic mobile robots for diffusive threat defense. J. Frankl. Inst. 2019;356:4690–4715. doi: 10.1016/j.jfranklin.2019.03.014. [DOI] [Google Scholar]

[B8-entropy-25-00221] 8.Yu Z., Liu Z., Zhang Y., Qu Y., Su C.Y. Distributed finite-time fault-tolerant containment control for multiple unmanned aerial vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2020;31:2077–2091. doi: 10.1109/TNNLS.2019.2927887. [DOI] [PubMed] [Google Scholar]

[B9-entropy-25-00221] 9.Li Y., Qu F., Tong S. Observer-based fuzzy adaptive finite-time containment control of nonlinear multiagent systems with input delay. IEEE Trans. Cybern. 2021;51:126–137. doi: 10.1109/TCYB.2020.2970454. [DOI] [PubMed] [Google Scholar]

[B10-entropy-25-00221] 10.Li Z., Xue H., Pan Y., Liang H. Distributed adaptive event-triggered containment control for multi-agent systems under a funnel function. Int. J. Robust Nonlinear Control. 2022 doi: 10.1002/rnc.6344. [DOI] [Google Scholar]

[B11-entropy-25-00221] 11.Li Y., Liu M., Lian J., Guo Y. Collaborative optimal formation control for heterogeneous multi-agent systems. Entropy. 2022;24:1440. doi: 10.3390/e24101440. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12-entropy-25-00221] 12.Zhao L., Yu J., Shi P. Command filtered backstepping-based attitude containment control for spacecraft formation. IEEE Trans. Syst. Man Cybern. Syst. 2021;51:1278–1287. doi: 10.1109/TSMC.2019.2896614. [DOI] [Google Scholar]

[B13-entropy-25-00221] 13.Liu D., Wei Q., Wang D., Yang X., Li H. Adaptive Dynamic Programming with Applications in Optimal Control. Springer; Cham, Switzerland: 2017. [Google Scholar]

[B14-entropy-25-00221] 14.Bellman R.E. Dynamic Programming. Princeton Univ. Press; Trenton, NJ, USA: 1957. [Google Scholar]

[B15-entropy-25-00221] 15.Abu-Khalaf M., Lewis F.L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica. 2005;41:779–791. doi: 10.1016/j.automatica.2004.11.034. [DOI] [Google Scholar]

[B16-entropy-25-00221] 16.Liu D., Xue S., Zhao B., Luo B., Wei Q. Adaptive dynamic programming for control: A survey and recent advances. IEEE Trans. Syst. Man. Cybern. Syst. 2021;51:142–160. doi: 10.1109/TSMC.2020.3042876. [DOI] [Google Scholar]

[B17-entropy-25-00221] 17.Vamvoudakis K.G., Lewis F.L., Hudas G.R. Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica. 2012;48:1598–1611. doi: 10.1016/j.automatica.2012.05.074. [DOI] [Google Scholar]

[B18-entropy-25-00221] 18.Zhang H., Zhang J., Yang G., Luo Y. Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming. IEEE Trans. Fuzzy. Syst. 2015;23:152–163. doi: 10.1109/TFUZZ.2014.2310238. [DOI] [Google Scholar]

[B19-entropy-25-00221] 19.Zhao W., Zhang H. Distributed optimal coordination control for nonlinear multi-agent systems using event-triggered adaptive dynamic programming method. ISA Trans. 2019;91:184–195. doi: 10.1016/j.isatra.2019.01.021. [DOI] [PubMed] [Google Scholar]

[B20-entropy-25-00221] 20.Cui J., Pan Y., Xue H., Tan L. Simplified optimized finite-time containment control for a class of multi-agent systems with actuator faults. Nonlinear Dyn. 2022;109:2799–2816. doi: 10.1007/s11071-022-07586-1. [DOI] [Google Scholar]

[B21-entropy-25-00221] 21.Xu J., Wang L., Liu Y., Xue H. Event-triggered optimal containment control for multi-agent systems subject to state constraints via reinforcement learning. Nonlinear Dyn. 2022;109:1651–1670. doi: 10.1007/s11071-022-07513-4. [DOI] [Google Scholar]

[B22-entropy-25-00221] 22.Xiao W., Zhou Q., Liu Y., Li H., Lu R. Distributed reinforcement learning containment control for multiple nonholonomic mobile robots. IEEE Trans. Circuits Syst. I Reg. Papers. 2022;69:896–907. doi: 10.1109/TCSI.2021.3121809. [DOI] [Google Scholar]

[B23-entropy-25-00221] 23.Chen C., Lewis F.L., Xie K., Xie S., Liu Y. Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems. Automatica. 2020;119:109081. doi: 10.1016/j.automatica.2020.109081. [DOI] [Google Scholar]

[B24-entropy-25-00221] 24.Yu D., Ge S.S., Li D., Wang P. Finite-horizon robust formation-containment control of multi-agent networks with unknown dynamics. Neurocomputing. 2021;458:403–415. doi: 10.1016/j.neucom.2021.01.063. [DOI] [Google Scholar]

[B25-entropy-25-00221] 25.Zuo S., Song Y., Lewis F.L., Davoudi A. Optimal robust output containment of unknown heterogeneous multiagent system using off-policy reinforcement learning. IEEE Trans. Cybern. 2018;48:3197–3207. doi: 10.1109/TCYB.2017.2761878. [DOI] [PubMed] [Google Scholar]

[B26-entropy-25-00221] 26.Mazouchi M., Naghibi-Sistani M.B., Hosseini Sani S.K., Tatari F., Modares H. Observer-based adaptive optimal output containment control problem of linear heterogeneous Multiagent systems with relative output measurements. Int. J. Adapt. Control Signal Process. 2019;33:262–284. doi: 10.1002/acs.2950. [DOI] [Google Scholar]

[B27-entropy-25-00221] 27.Yang Y., Modares H., Wunsch D.C., Yin Y. Optimal containment control of unknown heterogeneous systems with active leaders. IEEE Trans. Control Syst. Technol. 2019;27:1228–1236. doi: 10.1109/TCST.2018.2794336. [DOI] [Google Scholar]

[B28-entropy-25-00221] 28.Zhang H., Lewis F.L., Qu Z. Lyapunov, adaptive, and optimal design techniques for cooperative systems on directed communication graphs. IEEE Trans. Ind. Electron. 2012;59:3026–3041. doi: 10.1109/TIE.2011.2160140. [DOI] [Google Scholar]

PERMALINK

Integral Reinforcement-Learning-Based Optimal Containment Control for Partially Unknown Nonlinear Multiagent Systems

Qiuye Wu

Yongheng Wu

Yonghua Wang

Roles

Abstract

1. Introduction

2. Preliminaries and Problem Description

2.1. Graph Theory

2.2. Problem Description

Assumption 1.

Definition 1

3. IRL-Based OCC Scheme

3.1. Optimal Containment Control

Definition 2

Definition 3

3.2. Integral Reinforcement Learning

Theorem 1.

Proof of Theorem 1.

Theorem 2.

Proof of Theorem 2.

3.3. Critic NN Implementation

Assumption 2.

Theorem 3.

Proof of Theorem 3.

Figure 1.

Remark 1.

3.4. Stability Analysis

Assumption 3.

Theorem 4.

Proof of Theorem 4.

Remark 2.

4. Simulation Study

4.1. Example 1

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

4.2. Example 2

Table 1.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

5. Conclusions

Acknowledgments

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases