Adaptive Optimal Control of Hybrid Electric Vehicle Power Battery via Policy Learning

Qinglin Zhu; Huanli Sun; Ziliang Zhao; Yixin Liu; Jun Zhao

doi:10.1155/2023/8288527

. 2023 May 29;2023:8288527. doi: 10.1155/2023/8288527

Adaptive Optimal Control of Hybrid Electric Vehicle Power Battery via Policy Learning

Qinglin Zhu ¹, Huanli Sun ², Ziliang Zhao ^1,^✉, Yixin Liu ², Jun Zhao ³

PMCID: PMC10241567 PMID: 37284055

Abstract

An online policy learning algorithm is used to solve the optimal control problem of the power battery state of charge (SOC) observer for the first time. The design of adaptive neural network (NN) optimal control is studied for the nonlinear power battery system based on a second-order (RC) equivalent circuit model. First, the unknown uncertainties of the system are approximated by NN, and a time-varying gain nonlinear state observer is designed to address the problem that the resistance capacitance voltage and SOC of the battery cannot be measured. Then, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required and the actor NN widely used in most design of the optimal control methods is removed. Finally, the effectiveness of the optimal control theory is verified by simulation.

1. Introduction

Nowadays, electric vehicles are developing at a high speed [1]. The power battery provides the required high power for vehicle start stop, acceleration and deceleration, and other instabilities and greatly improves the service life of fuel cells by controlling the charging and discharging power of the power battery [1, 2]. As an important energy storage part of fuel-cell hybrid vehicles, it has far-reaching significance for the research of power cells. The state of charge (SOC) in the battery is one of the important parameters of the battery management system (BMS), but SOC cannot be directly measured by the on-board sensors. Therefore, SOC estimation is a very important problem in the theory and application. Moreover, the power battery is a highly complex nonlinear system in its working state, which greatly increases the difficulty of estimation [3].

In order to meet the requirements of accurate, fast, and real-time estimation of power battery SOC under different conditions, scholars have carried out a lot of advanced achievements. In [4], the authors proposed an observer-based unilateral Lipschitz conditional nonlinear system control method for a class of nonlinear systems with time-varying parameter uncertainties and norm bounded disturbances. For the state-space equation of the equivalent circuit model, a power battery SOC estimation method based on nonlinear observer is proposed in [5]. The authors in [6] introduced the second-order resistance capacitance (RC) model of the battery pack. Under the unilateral Lipschitz condition, a nonlinear observer based on the H∞ method is designed, but whether the optimal performance of the observer can be guaranteed remains to be verified. For the problem of optimal control design of the observers, the authors proposed an adaptive neural network backstepping recursive optimal control method for nonlinear strict feedback systems with state constraints [7]. The neural network (NN) state identification is used to approximate the unknown nonlinear dynamics, and under the actor-critic structure, the virtual and actual optimal controllers are constructed through the backstepping recursive control algorithm. Because actor-critic structure-based adaptive laws are generated on the basis of the square of Behrman residual error obtained by the gradient descent method, these methods are too complex and difficult to implement. In this regard, the authors in [8] proposed an optimal control method based on reinforcement learning (RL) for a class of nonlinear strict feedback systems with unknown dynamic functions. This method eliminates the persistent excitation assumption necessary for most RL-based adaptive optimal control. On this basis, the adaptive NN output-feedback optimal control problem for a class of strict feedback nonlinear systems with unknown internal dynamics, input saturation, and state constraints is studied in [9]. In [10, 11], the authors proposed the novel optimal control algorithm based on advanced AI techniques, which further promotes the development of the optimal control theory.

Inspired by the abovementioned research results, a nonlinear observer with time-varying gain is designed in this paper. Based on the unilateral Lipschitz condition, the nonlinear dynamic problem contained in the system output is solved. The internal unknown dynamic function is approximated by NN to estimate the SOC and the resistance capacitance voltage of the dynamic battery in the power system. Then, based on estimated system states, we develop a policy learning-based optimal control and the estimated weight error is convergence to zero. Finally, the simulation results show the effectiveness of the proposed method.

The innovations of this paper are summarized as follows:

The optimal control method based on critic NN is used to solve the optimal control problem of the power battery SOC observer for the first time.
Only one critic NN is used to ensure the convergence of the NN weights; thus, the actor NN widely used in most design of optimal control methods [12–14] is removed.
Unlike the existing optimal control with known state, the battery state in this paper is unknown. This leads to a complex optimal control problem.

2. System Modeling

In this paper, we consider the second-order RC equivalent circuit model as shown in Figure 1 [15], where U_oc is the open-circuit voltage (OCV) respected to SOC, I_T represents the current, U_T denotes the terminal voltage, R₀ is the ohmic resistance, R₁ and R₂ are the electrochemical polarization resistance and the concentration polarization resistance, respectively, and C₁ and C₂ are the capacitances. U₁ and U₂ show the voltage of the electrochemical capacitor C₁ and concentration polarization capacitor C₂, respectively.

The schematic diagram of the second-order RC model.

Then, based on the Kirchhoff voltage laws, the state equation of Figure 1 can be given as

\begin{matrix} \{\begin{matrix} {\dot{U}}_{1} = - \frac{1}{R_{1} C_{1}} U_{1} + \frac{1}{C_{1}} I_{T}, \\ {\dot{U}}_{2} = - \frac{1}{R_{2} C_{2}} U_{2} + \frac{1}{C_{2}} I_{T}, \\ S \dot{O} C = - \frac{1}{Q_{n}} I_{T}, \end{matrix} \end{matrix}

(1)

where Q_n is the nominal capacity of the battery.

Then, its output equation can be defined as

\begin{matrix} U_{T} = U_{o c} (S O C) - R_{0} I_{T} - U_{1} - U_{2}, \end{matrix}

(2)

where 0 ≤ SOC ≤ 1, and U_oc(SOC) is the nonlinear monotone increasing function.

Based on (1) and (2), we can obtain state space equation as follows:

\begin{matrix} \{\begin{matrix} \dot{x} = A x + B u, x (0) = x_{0}, \\ y = g (x) + C x - R_{0} u, \end{matrix} \end{matrix}

(3)

where $x = {[\begin{matrix} U_{1} & U_{2} & S O C \end{matrix}]}^{T} \in ℝ^{3}$ , y=U_T ∈ ℝ, u=I_T ∈ ℝ, g(x)=U_oc(SOC) ∈ ℝ, and x₀ is the initial state.

\begin{matrix} A = [\begin{matrix} - \frac{1}{R_{1} C_{1}} & 0 & 0 \\ 0 & - \frac{1}{R_{2} C_{2}} & 0 \\ 0 & 0 & 0 \end{matrix}] \in R^{3 \times 3}, B = {[\begin{matrix} \frac{1}{C_{1}} & \frac{1}{C_{2}} & - \frac{1}{Q_{n}} \end{matrix}]}^{T} \in R^{3}, \\ C = [\begin{matrix} - 1 & - 1 & 0 \end{matrix}] \in R^{1 \times 3} . \end{matrix}

(4)

As the power battery is a highly complex nonlinear system in its working state, there are many unknown uncertainties such as ambient temperature, battery self-discharge, battery life, and cycle interval. Therefore, the state space expression (3) can be expressed as follows:

\begin{matrix} \{\begin{matrix} \dot{x} = A x + B u + d (x), x (0) = x_{0}, \\ y = g (x) + C x - R_{0} u, \end{matrix} \end{matrix}

(5)

where d(x) represents nonlinear characteristics.

Assumption 1 . —

In this paper, we assume that (A, B) is stabilizable and (A, C) is detectable. The nonlinear term d(x) is continuous and bounded.

Control objective: for the second-order RC equivalent model of power battery, based on an adaptive observer a policy learning algorithm-based optimal controller is designed to guarantee all signals of the closed-loop system uniformly ultimately bounded (UUB).

According to the second-order RC model of the power battery, we can derive its state space (3) or (5); then, we should design the control law u for the derived state space equation. Thus, we will use the NN observer and the policy learning algorithm to design the control law u.

3. Optimal Control of Power Battery

3.1. Observer Design via NN

This section will design an observer to estimate the battery voltage and SOC. Thus, we assume

\begin{matrix} d (x) = W_{1}^{T} σ (x) + ε (x), \end{matrix}

(6)

where W₁ ∈ ℝ^N is the ideal NN weights, σ(x) ∈ ℝⁿ⟶ℝ^N is the activation function, and ε(x) ∈ ℝ denotes the NN error.

In this paper, the function d(x) is unknown continuous; hence, the estimated function is

\begin{matrix} \hat{d} (x) = {\hat{W}}_{1}^{T} σ (x), \end{matrix}

(7)

where ${\hat{W}}_{1}$ is the estimation of W₁.

Then, based on (5) and (7), the observer can be designed as

\begin{matrix} \{\begin{matrix} \dot{\hat{x}} = A \hat{x} + B u + {\hat{W}}_{1}^{T} σ (\hat{x}) + L {[\frac{\partial g}{\partial x}]}_{x = \hat{x}}^{T} (y - \hat{y}), \\ \hat{y} = C \hat{x} + g (\hat{x}) - R_{0} u, \end{matrix} \end{matrix}

(8)

where $\hat{x}$ is the estimation of x, L=P⁻¹ ∈ ℝ^3×3 is the observation matrix, P is the positive matrix, and $\hat{y}$ is the estimation of y.

We define the observation error

\begin{matrix} \tilde{x} = x - \hat{x} . \end{matrix}

(9)

Then, from (5) and (8), we can obtain the observation error dynamic equation as

\begin{matrix} \dot{\tilde{x}} = [A - L {(\frac{\partial g}{\partial x})}^{T} C] \tilde{x} - L {(\frac{\partial g}{\partial x})}_{x = \hat{x}}^{T} \tilde{g} + W_{1}^{T} (σ (x) - σ (\hat{x})) + {\tilde{W}}_{1} σ (\hat{x}) + ε, \end{matrix}

(10)

where $\tilde{g} = g (x) - g (\hat{x}) = \partial g / \partial x |x = ξ (x - \hat{x}), {\tilde{W}}_{1} = {\hat{W}}_{1} - W_{1}$ is the NN weight error.

Lemma 2 . —

For system (5), if it adopts designed observer (8), the NN weights ${\hat{W}}_{1}$ satisfy the adaptive law

$\begin{matrix} {\dot{\hat{W}}}_{1} = - σ (\hat{x}) {\tilde{x}}^{T} P . \end{matrix}$ (11)

This can guarantee that errors $\tilde{x}$ and ${\tilde{W}}_{1}$ are UUB.

Proof —

Consider a Lyapunov function

$\begin{matrix} V_{1} = \frac{1}{2} {\tilde{x}}^{T} P \tilde{x} + \frac{1}{2} t r ({\tilde{W}}_{1}^{T} {\tilde{W}}_{1}) . \end{matrix}$ (12)

From [15], we have ${[\partial g / \partial x]}_{x = \hat{x}}^{T} = [0,0, {\dot{U}}_{o c} (S \hat{O} C)]$ with $α_{\min} \leq {\dot{U}}_{o c} (S \hat{O} C) \leq α_{\max}$ , where α_min and α_max are the minimum and maximum values of the change rate of the ${\dot{U}}_{o c}$ function, respectively. Then, the derivation of (12) gives

$\begin{matrix} {\dot{V}}_{1} \leq \frac{1}{2} {\dot{\tilde{x}}}^{T} [P A + A^{T} P - R M C - C^{T} {(R M)}^{T}] - 2 Q] \tilde{x} \\ + {\tilde{x}}^{T} P W_{1}^{T} (σ (x) - σ (\hat{x})) + {\tilde{x}}^{T} P \cdot {\tilde{W}}_{1} σ (\hat{x}) + {\tilde{x}}^{T} P ε + \frac{1}{2} t r ({\dot{\tilde{W}}}_{1}^{T} {\tilde{W}}_{1} + {\tilde{W}}_{1}^{T} {\dot{\tilde{W}}}_{1}), \end{matrix}$ (13)

where $M = {[\begin{matrix} m_{1}, & m_{2}, & m_{3} \end{matrix}]}^{T} \in ℝ^{3}$ .

According to the unilateral Lipschitz condition [9], the following inequalities can be obtained:

$\begin{matrix} {\tilde{x}}^{T} P ε \leq \frac{1}{2} {‖\tilde{x}‖}^{2} + \frac{1}{2} {‖P‖}^{2} \sum_{i = 1}^{3} ε_{i}^{* 2}, \end{matrix}$ (14)

$\begin{matrix} {\tilde{x}}^{T} P W_{1}^{* T} (σ (x) - σ (\hat{x})) \leq {‖\tilde{x}‖}^{2} + {‖P‖}^{2} {‖W_{1}‖}^{2} . \end{matrix}$ (15)

Taking (14) and (15) into (13), and considering tr(ab^T)=tr(b^Ta)=b^Ta, we have

$\begin{matrix} {\dot{V}}_{1} \leq \frac{1}{2} {\tilde{x}}^{T} [P A + A^{T} P - R M C - C^{T} {(R M)}^{T} - 2 Q] \tilde{x} + {‖\tilde{x}‖}^{2} + {‖P‖}^{2} {‖W_{1}‖}^{2} + \frac{1}{2} {‖\tilde{x}‖}^{2} \\ + \frac{1}{2} {‖P‖}^{2} \sum_{i = 1}^{3} ε_{i}^{* 2} + t r ({\tilde{W}}_{1}^{T} σ (\hat{x}) {\tilde{x}}^{T} P + {\tilde{W}}_{1}^{T} {\dot{\tilde{W}}}_{1}) . \end{matrix}$ (16)

Based on [8], let PA+A^TP − RMC − C^T(RM)^T − 2Q=−Ψ, where $Q = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & α_{\min}^{2} \end{matrix}]$ ; thus, (16) can be further written as

$\begin{matrix} {\dot{V}}_{1} \leq - a_{0} {‖\tilde{x}‖}^{2} + \frac{1}{2} {‖P‖}^{2} {‖{\tilde{W}}_{1}‖}^{2} + D_{0}, \end{matrix}$ (17)

where a₀=λ_min(ψ) − 3/2 and D₀=‖P‖²‖W₁‖²+1/2‖P‖²∑_i=1³ε_i².

If $\hat{d} (x) ⟶ d (x)$ , then the term $1 / 2 {‖P‖}^{2} {‖{\tilde{W}}_{1}‖}^{2} + D_{0}$ in (17) can converge to zero. Moreover, by selecting the appropriate matrix ψ, λ_min(ψ) can be relatively large. According to (17), the observation error can converge to a small neighborhood containing the origin.

3.2. Optimal Control Design Based on the Observer

3.2.1. Online Policy Learning Algorithm

In this section, based on critic NN, we construct the policy learning law. Thus, system (8) can be rewritten as

\begin{matrix} \dot{\hat{x}} = F (\hat{x}) + B u, \end{matrix}

(18)

where $F (x) = A x + {\hat{W}}_{1}^{T} σ (x) + L {[\partial g / \partial x]}_{x = \hat{x}}^{T} (y - \hat{y})$ , and L is the Lyapunov function.

To realize the optimal control, we first define the cost function as\

\begin{matrix} V (\hat{x}, u) = \int_{0}^{\infty} r (\hat{x}, u) d s . \end{matrix}

(19)

With $r (\hat{x}, u) = {\hat{x}}^{T} Q_{s} \hat{x} + u^{T} R_{s} u$ being the utility function, Q_s ∈ ℝ^3×3 and R_s ∈ ℝ are the weight matrices of proper dimension.

We define the Hamiltonian function of the optimal control problem and the optimal cost function as

\begin{matrix} H (\hat{x}, u, \nabla V (\hat{x})) = r (\hat{x}, u) + {(\nabla V (\hat{x}))}^{T} (F (\hat{x}) + B u) . \end{matrix}

(20)

\begin{matrix} V^{*} (\hat{x}) = \min_{u} \int_{0}^{\infty} r (\hat{x}, u) d s . \end{matrix}

(21)

The optimal cost function $V^{*} (\hat{x})$ is the solution of the following HJB equation:

\begin{matrix} 0 = \min_{u} H (\hat{x}, u, \nabla V^{*} (\hat{x})) . \end{matrix}

(22)

With ∇V^∗(x)=∂V^∗(x)/∂x, we can obtain this optimal control action as

\begin{matrix} u^{*} = - \frac{1}{2} R_{s}^{- 1} B^{T} \nabla V^{*} (\hat{x}), \end{matrix}

(23)

and the HIB equation in terms of ∇V^∗(x) as

\begin{matrix} 0 = {\hat{x}}^{T} Q_{s} \hat{x} + {(\nabla V^{*} (\hat{x}))}^{T} F (\hat{x}) - \frac{1}{4} {(\nabla V^{*} (\hat{x}))}^{T} B R_{s}^{- 1} B^{T} \nabla V^{*} (\hat{x}), \end{matrix}

(24)

with V^∗(0)=0.

To realize the policy learning, some iteration procedure can be given as follows:

(1)
Select the small positive number τ. Set i=0 and V⁽⁰⁾=0, and then give an initial admissible control u⁽⁰⁾.
(2)
Using the control u⁽ⁱ⁾, resolve
$\begin{matrix} 0 = r (\hat{x}, u) + {(\nabla^{i + 1} V (\hat{x}))}^{T} (F (\hat{x}) + B u^{i}), \end{matrix}$ (25)
with V⁽ⁱ⁺¹⁾(0)=0.
(3)
Update the control action using
$\begin{matrix} u^{(i + 1)} = \frac{1}{2} R_{s}^{- 1} B^{T} \nabla V^{(i + 1)} (\hat{x}) . \end{matrix}$ (26)
(4)
If $‖V^{(i + 1)} (\hat{x}) - V^{(i)} (\hat{x})‖ \leq τ$ , stop, then apply the optimal control; else, let i=i+1 and go back to (2).

This algorithm will be convergence to the optimal control and optimal cost function when i⟶∞. The convergence of this algorithm can be referred to [16, 17].

3.2.2. NN Implementation

We assume the cost function $V (\hat{x})$ is continuously differentiable. Then, we can use the NN reconstruct the $V (\hat{x})$ as

\begin{matrix} V (\hat{x}) = W_{2}^{T} σ_{c} (\hat{x}) + ε_{c} (\hat{x}), \end{matrix}

(27)

where W₂ ∈ ℝ^N is the ideal NN weights, σ_c(x) ∈ ℝⁿ is the activation function, and $ε_{c} (\hat{x}) \in ℝ$ denotes the NN error. Then,

\begin{matrix} \nabla V (\hat{x}) = {(\nabla σ_{c} (\hat{x}))}^{T} W_{2} + \nabla ε_{c} (\hat{x}), \end{matrix}

(28)

where $\nabla σ (\hat{x}) = \partial σ_{c} (\hat{x}) / \partial \hat{x}$ and $\nabla ε_{c} (\hat{x}) = \partial ε_{c} (\hat{x}) / \partial \hat{x}$ are the gradient of the activation function and NN error, respectively. According to (28), we can obtain the Lyapunov function as

\begin{matrix} 0 = r (\hat{x}, u) + (W_{2}^{T} \nabla σ_{c} (\hat{x}) + {(\nabla ε_{c} (\hat{x}))}^{T}) \dot{\hat{x}} . \end{matrix}

(29)

Assumption 3 . —

(see [12–14, 18]). If the NN weight W₂, the NN error ε_c, the gradient ∇σ_c, and derivative ∇ε_c are bounded, then we can have ε_c⟶0 and ∇ε_c⟶0.

We define the estimation of (27) as

$\begin{matrix} \hat{V} (\hat{x}) = {\hat{W}}_{2}^{T} σ_{c} (\hat{x}) . \end{matrix}$ (30)

Then, we have

$\begin{matrix} \nabla \hat{V} (\hat{x}) = {(\nabla σ_{c} (\hat{x}))}^{T} {\hat{W}}_{c} . \end{matrix}$ (31)

with $\nabla \hat{V} (\hat{x}) = \partial \hat{V} (\hat{x}) / \partial \hat{x}$ . Thus, the estimated Hamiltonian function can be given as

$\begin{matrix} H (\hat{x}, u, {\hat{W}}_{2}) = r (\hat{x}, u) + {\hat{W}}_{2}^{T} \nabla σ_{c} (\dot{x}) \dot{\hat{x}} = e_{c} . \end{matrix}$ (32)

To minimize error (32), we construct the objective function J=(1/2)e_c^Te_c, and then the descent algorithm can be designed as

$\begin{matrix} {\dot{\hat{W}}}_{2} = - α_{1} [\frac{\partial J}{\partial W}] = - α_{1} [\frac{\partial e_{c}}{\partial W}], \end{matrix}$ (33)

with α₁ > 0 being the learning gain of the NN.

Based on (29), the Hamiltonian function can be rewritten as

$\begin{matrix} H (\hat{x}, u, W_{2}) = r (\hat{x}, u) + W_{2}^{T} \nabla σ_{c} (\hat{x}) \dot{\hat{x}} = e_{h}, \end{matrix}$ (34)

where $e_{h} = - {(\nabla ε_{c} (\hat{x}))}^{T} \dot{\hat{x}}$ is the residual error.

Define $ϕ = \nabla σ_{c} (\hat{x}) \dot{\hat{x}}$ , if there is a positive constant ϕ_M such that ‖ϕ‖ ≤ ϕ_M, and denote the weight estimation error ${\tilde{W}}_{2} = W_{2} - {\hat{W}}_{2}$ , and then based on (32) and (34), we have $e_{h} - e_{c} = {\tilde{W}}_{2}^{T} ϕ$ ; thus, we have the dynamic of the weight estimation error as

$\begin{matrix} {\dot{\tilde{W}}}_{2} = - {\dot{\hat{W}}}_{2} = α_{1} (e_{h} - {\tilde{W}}_{2}^{T} ϕ) ϕ . \end{matrix}$ (35)

The persistent excitation (PE) condition is required to tune the NN, guaranteeing ‖ϕ‖ ≥ ϕ_m with ϕ_m being the positive constant. To this end, a probing noise is inserted into the system to meet the PE.

In this case, the optimal control action can be given as

$\begin{matrix} u^{*} = - \frac{1}{2} R_{s}^{- 1} B^{T} ({(\nabla σ (\hat{x}))}^{T} W_{2} + \nabla ε_{c} (\hat{x})), \end{matrix}$ (36)

and its estimation is

$\begin{matrix} \hat{u} = - \frac{1}{2} R_{s}^{- 1} B^{T} {(\nabla σ (\hat{x}))}^{T} {\hat{W}}_{2} . \end{matrix}$ (37)

Equation (37) shows that using the trained critic network, the control policy can be derived directly; thus, the actor NN is removed in this paper. The structural diagram of the algorithm is given in Figure 2.

The structural diagram of the algorithm.

Lemma 4 . —

For system (18), the adaptive law for the NN is provided by (33), and then the weight estimation error of NN is UUB.

Proof —

Choose the Lyapunov function as $K (t) = (1 / α_{1}) t r ({\tilde{W}}_{2}^{T} {\tilde{W}}_{2})$ . The time derivative of the Lyapunov function along the trajectory of error dynamics (35) is

$\begin{matrix} \dot{K} (t) = \frac{2}{α_{1}} t r ({\tilde{W}}_{2}^{T} {\dot{\tilde{W}}}_{2}) = \frac{2}{α_{1}} t r ({\tilde{W}}_{2}^{T} α_{1} (e_{h} - {\tilde{W}}_{2}^{T} ϕ) ϕ) . \end{matrix}$ (38)

After doing some basic manipulations, we have

$\begin{matrix} \dot{K} (t) \leq - (2 - α_{1}) {‖{\tilde{W}}_{2}^{T} ϕ‖}^{2} + \frac{1}{α_{2}} e_{h}^{2} . \end{matrix}$ (39)

Considering the Cauchy–Schwarz inequality and noticing the assumption ‖ϕ‖ ≤ ϕ_M, we can conclude that $\dot{K} (t) < 0$ as long as 1 < α₁ < 2 and

$\begin{matrix} ‖{\tilde{W}}_{2}‖ > \sqrt{\frac{e_{h}^{2}}{α_{1} (2 - α_{1}) ϕ_{M}^{2}}} . \end{matrix}$ (40)

According to the Lyapunov theory, we obtain that the dynamics of the weight estimation error is UUB. The norm of the weight estimation error is bounded as well.

It is noted that the estimated weight ${\hat{W}}_{2}$ is optimal to W₂, and this indicates that the solution $\hat{V}$ can be extracted from the estimated vector ${\hat{W}}_{2}$ given in (30). Thus, one can derive the actual control $\hat{u} = - 1 / 2 R_{s}^{- 1} B^{T} {(\nabla σ (\hat{x}))}^{T} {\hat{W}}_{2}$ for system (18) based on ${\hat{W}}_{2}$ . As a consequence of Lemma 4, we can conclude that $\hat{u}$ will converge to the optimal control u^∗, i.e., $‖\hat{u} - u^{*}‖ ⟶ 0$ such that the control system stability can be retained based on Lemma 4.

Remark 5 . —

In this paper, an observer is designed using NN to online estimate the unknown state (SOC); then, based on the estimated state, we develop a policy learning algorithm to online resolve the optimal control of the battery. The proposed methods are different from our previous work, such as [18], where the system states are assumed to be known, and this limits the application of the optimal control algorithm in practice.

Remark 6 . —

To realize the output-feedback control using the policy learning, the PE condition is required in this paper. As shown in [14, 17], to guarantee the PE condition, an alternative way is to insert an exploration noise into the system for the first two seconds [17].

4. Simulation Results

For the second-order RC equivalent model of power battery, the effectiveness of the optimal control theory in this paper is verified by simulation based on Matlab. The values of resistance, capacitance, and battery capacity in the second-order RC equivalent model (5) are as follows: R₀=10.822mΩ, R₁=3.103mΩ, R₂=2.611mΩ, C₁=8.4379kF, C₂=91.401kF, and Q_n=45A · h.

Let M=I, then we can obtain P and L as

\begin{matrix} P = [\begin{matrix} 14.1250 & 0 & 19.6371 \\ 0 & 128.7451 & 178.9860 \\ 19.6371 & 178.9860 & 0 \end{matrix}], \\ L = [\begin{matrix} 0.0638 & - 0.007 & 0.005 \\ - 0.007 & 0.0008 & 0.005 \\ 0.005 & 0.005 & - 0.0036 \end{matrix}] . \end{matrix}

(41)

Given the design parameters in learning law (33) as α₁=0.1 and the initial values as x₁(0)=0.1, x₂(0)=0.2, x₃(0)=1, ${\hat{x}}_{1} (0) = 0.01, {\hat{x}}_{2} (0) = 0, {\hat{x}}_{3} (0) = 0.99$ , and ${\hat{W}}_{2} = [0.3909 0.5812 1.0576 0.1 0.2 1]$ , we design the regressor of the critic NN as σ(x)=[x₁², x₁x₂, x₁x₃, x₂², x₂x₃, x₃²]^T.

We aim at obtaining an optimal control policy that can stabilize system (18). For system (18), we need to find a feedback control policy that minimizes the cost function.

\begin{matrix} V (\hat{x}, u) = \int_{0}^{\infty} ({\hat{x}}^{T} Q_{s} \hat{x} + u^{T} R_{s} u) d s, \end{matrix}

(42)

with Q_s=I and R_s=2I. We adopt the online policy iteration algorithm to tackle the optimal control problem, where a critic network is constructed to approximate the cost function. During the implementation process of the policy learning algorithm, we introduce the noise to meet the PE condition. The exponentially decreasing probing noise and sinusoidal signals with different frequencies are used. They are introduced into the control input and thus affect the system states.

The evolution of the state trajectory is depicted in Figure 3, and this can be used to further design the optimal controller for the proposed system. Figure 4 gives the good estimated weights, where we have that the convergence of the weight has occurred after 1000 s. Then, the probing signal is turned off. This good convergence of the NN weights can ensure the stability of the controlled system, which can be found in Figure 5. Figure 5 is the controller system trajectory with the designed optimal controller. We see that the state converge to zero after the probing noise is turned off. Figure 6 shows the cost of the system under which is smooth, and this indicates that the designed controller is effective. The control action is given in Figure 7, which is bounded. This further shows Lemma 4 is true.

To show the improved performance of the proposed single critic NN-based ADP for solving the derived optimal control problem, a critic-actor NN-based online learning method [19] is also used for comparison. Moreover, in this comparison, we add the robustness verification of the proposed method. To this end, we set the nonlinear term d(x)=0.5 sin(x₁). The profiles of the critic NN and actor NN weights can be found in Figure 8 and the corresponding control performances are given in Figure 9. Compared with Figures 9(a) and 9(b), it is clear that the proposed single critic NN-based can achieve faster transient state convergence even if there is a nonlinear term.

System state x (a) using the proposed method and (b) the method proposed in [19, 20].

Generally, the modeling accuracy and control structure will influence the control performance of the closed-loop control systems. In this paper, the main factors affecting the control performance are the modeling uncertainties of the system and the convergence performance of critic NN weights. Moreover, better convergence of critic NN weights, i.e., faster convergence speed can help to achieve better control performance. In this respect, different choices of critic NN parameters and structure will affect the convergence of critic NN weights and the control performance. Hence, proper selection of NN parameters and structure, such as the initial value of weights, learning gain, and regressor structure, is helpful to further improve the control response.

5. Conclusion

For the second-order RC equivalent nonlinear system of power battery, the unknown uncertainty of the system is approximated by NN, and a time-varying gain nonlinear state observer is designed to solve the problem that the resistance capacitance voltage and charge (SOC) of the battery cannot be measured. Then, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required, and the actor NN widely used in most design of the optimal control methods is removed. Finally, the effectiveness of the optimal control theory is verified by simulation.

Acknowledgments

This work was supported in part by Jilin Provincial Major Science and Technology Projects (Grant no.: 20210301020GX).

Data Availability

The data used to support the findings of this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors' Contributions

Qinglin Zhu and Jun Zhao conceptualized the study; Huanli Sun and Ziliang Zhao were responsible for methodology; Ziliang Zhao and Yixin Liu performed formal analysis; Qinglin Zhu wrote the original draft; Qinglin Zhu and Yixin Liu reviewed and edited the manuscript; and Huangli Sun and Ziliang Zhao were responsible for funding acquisition. All authors have read and agreed to the published version of the manuscript.

References

1.Klancar G., Blazic S. Optimal constant acceleration motion primitives. IEEE Transactions on Vehicular Technology . 2019;68(9):8502–8511. doi: 10.1109/tvt.2019.2927124. [DOI] [Google Scholar]
2.Eshani M., Gao Y., Gay S. E., Emadi A. Modern Electric, Hybrid Electric, and Fuel Cell Vehicles . Boca Raton, FL, USA: CRC Press; 2005. [Google Scholar]
3.He H., Xiong R., Zhang X., Sun F., Fan J. State-of-charge estimation of the lithium-ion battery using an adaptive extended kalman filter based on an improved thevenin model. IEEE Transactions on Vehicular Technology . 2011;60(4):1461–1469. doi: 10.1109/tvt.2011.2132812. [DOI] [Google Scholar]
4.Cheng K. W. E., Divakar B. P., Wu H., Ding K., Ho H. F. Battery-management system (BMS) and SOC development for electrical vehicles. IEEE Transactions on Vehicular Technology . 2011;60(1):76–88. doi: 10.1109/tvt.2010.2089647. [DOI] [Google Scholar]
5.Ahmad S., Rehan M., Hong K. S. Observer-based robust control of one-sided Lipschitz nonlinear systems. ISA Transactions . 2016;65:230–240. doi: 10.1016/j.isatra.2016.08.010. [DOI] [PubMed] [Google Scholar]
6.Xia B., Chen C., Tian Y., Sun W., Xu Z., Zheng W. A novel method for state of charge estimation of lithium-ion batteries using a nonlinear observer. Journal of Power Sources . 2014;270:359–366. doi: 10.1016/j.jpowsour.2014.07.103. [DOI] [Google Scholar]
7.Zhu Q., Xiong N., Yang M., Huang R., Hu D. State of charge estimation for lithium-ion battery based on nonlinear observer: an H∞ method. Energies . 2017;10(5):p. 679. doi: 10.3390/en10050679. [DOI] [Google Scholar]
8.Li X., Li Y. Neural networks optimized learning control of state constraints systems. Neurocomputing . 2021;453:512–523. doi: 10.1016/j.neucom.2020.10.034. [DOI] [Google Scholar]
9.Roman R.-C., Precup R.-E., Hedrea E.-L., et al. Iterative feedback tuning algorithm for tower crane systems. Procedia Computer Science . 2022;199:157–165. doi: 10.1016/j.procs.2022.01.020. [DOI] [Google Scholar]
10.Chen T., Babanin A., Muhannad A., Chapron B., Chen C. Modified evolved bat algorithm of fuzzy optimal control for complex nonlinear systems. Romanian Journal of Information Science and Technology . 2020;23(T):T28–T40. [Google Scholar]
11.Zamfirache I. A., Precup R.-E., Roman R.-C., Petriu E. M. Policy iteration reinforcement learning-based control using a grey wolf optimizer algorithm. Information Sciences . 2022;585:162–175. doi: 10.1016/j.ins.2021.11.051. [DOI] [Google Scholar]
12.Abu-Khalaf M., Lewis F. L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica . 2005;41(5):779–791. [Google Scholar]
13.Wang D., Liu D., Wei Q., Zhao D., Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica . 2012;48(8):1825–1832. [Google Scholar]
14.Vamvoudakis K. G., Lewis F. L. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica . 2010;46(5):878–888. [Google Scholar]
15.Zhang J., Li K., Li Y. Output feedback based simplified optimized backstepping control for strict-feedback systems with input and state constraints. IEEE/CAA Journal of Automatica Sinica . 2021;8(6):1119–1132. [Google Scholar]
16.Li Y., Pei X., Yi S. Adaptive neural network optimal control of hybrid electric vehicle power battery. Journal of Jilin University Engineering and Technology Edition . 2022;52(9):2063–2068. [Google Scholar]
17.Wang D., Liu D., Li H. Policy iteration algorithm for online design of robust control for a class of continuous-time nonlinear systems. IEEE Transactions on Automation Science and Engineering . 2014;11(2):627–632. [Google Scholar]
18.Zhao J., Lv Y. Output-feedback robust control of systems with uncertain dynamics via data-driven policy learning. International Journal of Robust and Nonlinear Control . 2022;32(18):9791–9807. doi: 10.1002/rnc.6374. [DOI] [Google Scholar]
19.Zhang H., Cui L., Zhang X., Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Transactions on Neural Networks . 2011;22(12):2226–2236. doi: 10.1109/tnn.2011.2168538. [DOI] [PubMed] [Google Scholar]
20.Zhao J., Na J., Gao G. Robust tracking control of uncertain nonlinear systems with adaptive dynamic programming. Neurocomputing . 2022;471:21–30. doi: 10.1016/j.neucom.2021.10.081. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of this study are available upon request from the corresponding author.

[B1] 1.Klancar G., Blazic S. Optimal constant acceleration motion primitives. IEEE Transactions on Vehicular Technology . 2019;68(9):8502–8511. doi: 10.1109/tvt.2019.2927124. [DOI] [Google Scholar]

[B2] 2.Eshani M., Gao Y., Gay S. E., Emadi A. Modern Electric, Hybrid Electric, and Fuel Cell Vehicles . Boca Raton, FL, USA: CRC Press; 2005. [Google Scholar]

[B3] 3.He H., Xiong R., Zhang X., Sun F., Fan J. State-of-charge estimation of the lithium-ion battery using an adaptive extended kalman filter based on an improved thevenin model. IEEE Transactions on Vehicular Technology . 2011;60(4):1461–1469. doi: 10.1109/tvt.2011.2132812. [DOI] [Google Scholar]

[B4] 4.Cheng K. W. E., Divakar B. P., Wu H., Ding K., Ho H. F. Battery-management system (BMS) and SOC development for electrical vehicles. IEEE Transactions on Vehicular Technology . 2011;60(1):76–88. doi: 10.1109/tvt.2010.2089647. [DOI] [Google Scholar]

[B5] 5.Ahmad S., Rehan M., Hong K. S. Observer-based robust control of one-sided Lipschitz nonlinear systems. ISA Transactions . 2016;65:230–240. doi: 10.1016/j.isatra.2016.08.010. [DOI] [PubMed] [Google Scholar]

[B6] 6.Xia B., Chen C., Tian Y., Sun W., Xu Z., Zheng W. A novel method for state of charge estimation of lithium-ion batteries using a nonlinear observer. Journal of Power Sources . 2014;270:359–366. doi: 10.1016/j.jpowsour.2014.07.103. [DOI] [Google Scholar]

[B7] 7.Zhu Q., Xiong N., Yang M., Huang R., Hu D. State of charge estimation for lithium-ion battery based on nonlinear observer: an H∞ method. Energies . 2017;10(5):p. 679. doi: 10.3390/en10050679. [DOI] [Google Scholar]

[B8] 8.Li X., Li Y. Neural networks optimized learning control of state constraints systems. Neurocomputing . 2021;453:512–523. doi: 10.1016/j.neucom.2020.10.034. [DOI] [Google Scholar]

[B9] 9.Roman R.-C., Precup R.-E., Hedrea E.-L., et al. Iterative feedback tuning algorithm for tower crane systems. Procedia Computer Science . 2022;199:157–165. doi: 10.1016/j.procs.2022.01.020. [DOI] [Google Scholar]

[B10] 10.Chen T., Babanin A., Muhannad A., Chapron B., Chen C. Modified evolved bat algorithm of fuzzy optimal control for complex nonlinear systems. Romanian Journal of Information Science and Technology . 2020;23(T):T28–T40. [Google Scholar]

[B11] 11.Zamfirache I. A., Precup R.-E., Roman R.-C., Petriu E. M. Policy iteration reinforcement learning-based control using a grey wolf optimizer algorithm. Information Sciences . 2022;585:162–175. doi: 10.1016/j.ins.2021.11.051. [DOI] [Google Scholar]

[B12] 12.Abu-Khalaf M., Lewis F. L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica . 2005;41(5):779–791. [Google Scholar]

[B13] 13.Wang D., Liu D., Wei Q., Zhao D., Jin N. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica . 2012;48(8):1825–1832. [Google Scholar]

[B14] 14.Vamvoudakis K. G., Lewis F. L. Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica . 2010;46(5):878–888. [Google Scholar]

[B15] 15.Zhang J., Li K., Li Y. Output feedback based simplified optimized backstepping control for strict-feedback systems with input and state constraints. IEEE/CAA Journal of Automatica Sinica . 2021;8(6):1119–1132. [Google Scholar]

[B16] 16.Li Y., Pei X., Yi S. Adaptive neural network optimal control of hybrid electric vehicle power battery. Journal of Jilin University Engineering and Technology Edition . 2022;52(9):2063–2068. [Google Scholar]

[B17] 17.Wang D., Liu D., Li H. Policy iteration algorithm for online design of robust control for a class of continuous-time nonlinear systems. IEEE Transactions on Automation Science and Engineering . 2014;11(2):627–632. [Google Scholar]

[B18] 18.Zhao J., Lv Y. Output-feedback robust control of systems with uncertain dynamics via data-driven policy learning. International Journal of Robust and Nonlinear Control . 2022;32(18):9791–9807. doi: 10.1002/rnc.6374. [DOI] [Google Scholar]

[B19] 19.Zhang H., Cui L., Zhang X., Luo Y. Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Transactions on Neural Networks . 2011;22(12):2226–2236. doi: 10.1109/tnn.2011.2168538. [DOI] [PubMed] [Google Scholar]

[B20] 20.Zhao J., Na J., Gao G. Robust tracking control of uncertain nonlinear systems with adaptive dynamic programming. Neurocomputing . 2022;471:21–30. doi: 10.1016/j.neucom.2021.10.081. [DOI] [Google Scholar]

PERMALINK

Adaptive Optimal Control of Hybrid Electric Vehicle Power Battery via Policy Learning

Qinglin Zhu

Huanli Sun

Ziliang Zhao

Yixin Liu

Jun Zhao

Abstract

1. Introduction

2. System Modeling

Figure 1.

Assumption 1 . —

3. Optimal Control of Power Battery

3.1. Observer Design via NN

Lemma 2 . —

Proof —

3.2. Optimal Control Design Based on the Observer

3.2.1. Online Policy Learning Algorithm

3.2.2. NN Implementation

Assumption 3 . —

Figure 2.

Lemma 4 . —

Proof —

Remark 5 . —

Remark 6 . —

4. Simulation Results

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

5. Conclusion

Acknowledgments

Data Availability

Conflicts of Interest

Authors' Contributions

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases