Skip to main content
Entropy logoLink to Entropy
. 2023 Jan 21;25(2):208. doi: 10.3390/e25020208

Forward-Backward Sweep Method for the System of HJB-FP Equations in Memory-Limited Partially Observable Stochastic Control

Takehiro Tottori 1,*, Tetsuya J Kobayashi 1,2,3,4
Editors: Mohammad Reza Rahimi Tabar, Adrian-Mihail Stoica
PMCID: PMC9955073  PMID: 36832575

Abstract

Memory-limited partially observable stochastic control (ML-POSC) is the stochastic optimal control problem under incomplete information and memory limitation. To obtain the optimal control function of ML-POSC, a system of the forward Fokker–Planck (FP) equation and the backward Hamilton–Jacobi–Bellman (HJB) equation needs to be solved. In this work, we first show that the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space. Based on this interpretation, we then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is one of the most basic algorithms for Pontryagin’s minimum principle, which alternately computes the forward FP equation and the backward HJB equation in ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control and mean-field stochastic control, it is guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited to the optimal control function in ML-POSC.

Keywords: decision-making, optimal control, stochastic control, incomplete information, memory limitation, mean-field control

1. Introduction

In many practical applications of the stochastic optimal control theory, several constraints need to be considered. In the cases of small devices [1,2] and biological systems [3,4,5,6,7,8], for example, incomplete information and memory limitation become predominant because their sensors are extremely noisy and their memory resources are severely limited. To take into account one of these constraints, incomplete information, partially observable stochastic control (POSC) has been extensively studied in the stochastic optimal control theory [9,10,11,12,13]. However, because POSC cannot take into account the other constraint, memory limitation, it is not practical enough for designing memory-limited controllers for small devices and biological systems. To resolve this problem, memory-limited POSC (ML-POSC) has recently been proposed [14]. Because ML-POSC formulates noisy observation and limited memory explicitly, ML-POSC can take into account both incomplete information and memory limitation in the stochastic optimal control problem.

However, ML-POSC cannot be solved in a similar way as completely observable stochastic control (COSC), which is the most basic stochastic optimal control problem [15,16,17,18]. In COSC, the optimal control function depends only on the Hamilton–Jacobi–Bellman (HJB) equation, which is a time-backward partial differential equation given a terminal condition (Figure 1a) [15,16,17,18]. Therefore, the optimal control function of COSC can be obtained by solving the HJB equation backward in time from the terminal condition, which is called the value iteration method [19,20,21]. In contrast, the optimal control function of ML-POSC depends not only on the HJB equation but also on the Fokker–Planck (FP) equation, which is a time-forward partial differential equation given an initial condition (Figure 1b) [14]. Because the HJB equation and the FP equation interact with each other through the optimal control function in ML-POSC, the optimal control function of ML-POSC cannot be obtained by the value iteration method.

Figure 1.

Figure 1

Schematic diagram of the relationship between the backward dynamics, the optimal control function, and the forward dynamics in (a) COSC, (b) ML-POSC, (c) deterministic control, and (d) MFSC. w*, p*, λ*, and s* are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively. u* is the optimal control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In COSC, because the optimal control function u* depends only on the HJB equation w*, it can be obtained by solving the HJB equation w* backward in time from the terminal condition, which is called the value iteration method. (b) In ML-POSC, because the optimal control function u* depends on the FP equation p* as well as the HJB equation w* (orange), it cannot be obtained by the value iteration method. In this paper, we propose FBSM for ML-POSC, which computes the HJB equation w* and the FP equation p* alternately. Because the coupling of the HJB equation w* and the FP equation p* is limited only to the optimal control function u*, the convergence of FBSM is guaranteed in ML-POSC. (c) In deterministic control, because the coupling of the adjoint equation λ* and the state equation s* is not limited to the optimal control function u* (green), the convergence of FBSM is not guaranteed. (d) In MFSC, because the coupling of the HJB equation w* and the FP equation p* is not limited to the optimal control function u* (green), the convergence of FBSM is not guaranteed.

To propose an algorithm to solve ML-POSC, we first show that the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space. Pontryagin’s minimum principle is one of the most representative approaches to the deterministic optimal control problem, which converts it into the two-point boundary value problem of the forward state equation and the backward adjoint equation [22,23,24,25]. We formally show that the system of HJB-FP equations is an extension of the system of adjoint and state equations from the deterministic optimal control problem to the stochastic optimal control problem.

The system of HJB-FP equations also appears in the mean-field stochastic control (MFSC) [26,27,28]. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been briefly mentioned in MFSC [29,30,31], its details have not yet been investigated. In this work, we investigate it in more detail by deriving the system of HJB-FP equations in a similar way to Pontryagin’s minimum principle. We note that our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges. However, our results are consistent with many conventional results and also provide a useful perspective in proposing an algorithm.

We then propose the forward-backward sweep method (FBSM) for ML-POSC. FBSM is an algorithm to compute the forward FP equation and the backward HJB equation alternately, which can be interpreted as an extension of the value iteration method. FBSM has been proposed in Pontryagin’s minimum principle of the deterministic optimal control problem, which computes the forward state equation and the backward adjoint equation alternately [32,33,34]. Because FBSM is easy to implement, it has been used in many applications [35,36]. However, the convergence of FBSM is not guaranteed in deterministic control except for special cases [37,38] because the coupling of adjoint and state equations is not limited to the optimal control function (Figure 1c). In contrast, we show that the convergence of FBSM is generally guaranteed in ML-POSC because the coupling of the HJB-FP equations is limited only to the optimal control function (Figure 1b).

FBSM is called the fixed-point iteration method in MFSC [39,40,41,42]. Although the fixed-point iteration method is the most basic algorithm to solve MFSC, its convergence is not guaranteed for the same reason as deterministic control (Figure 1d). Therefore, ML-POSC is a special and nice class of optimal control problems where FBSM or the fixed-point iteration method is guaranteed to converge.

This paper is organized as follows: In Section 2, we formulate ML-POSC. In Section 3, we derive the system of HJB-FP equations of ML-POSC from the viewpoint of Pontryagin’s minimum principle. In Section 4, we propose FBSM for ML-POSC and prove its convergence. In Section 5, we apply FBSM to the linear-quadratic-Gaussian (LQG) problem. In Section 6, we verify the convergence of FBSM by numerical experiments. In Section 7, we discuss our work. In Appendix A, we briefly review Pontryagin’s minimum principle of deterministic control. In Appendix B, we derive the system of HJB-FP equations of MFSC from the viewpoint of Pontryagin’s minimum principle. In Appendix C, we show the detailed derivations of our results.

2. Memory-Limited Partially Observable Stochastic Control

In this section, we briefly review the formulation of ML-POSC [14], which is the stochastic optimal control problem under incomplete information and memory limitation.

2.1. Problem Formulation

This subsection outlines the formulation of ML-POSC [14]. The state of the system xtRdx at time t[0,T] evolves by the following stochastic differential equation (SDE):

dxt=b(t,xt,ut)dt+σ(t,xt,ut)dωt, (1)

where x0 obeys p0(x0), utRdu is the control, and ωtRdω is the standard Wiener process. In COSC [15,16,17,18], because the controller can completely observe the state xt, it determines the control ut based on the state xt as ut=u(t,xt). By contrast, in POSC [9,10,11,12,13] and ML-POSC [14], the controller cannot directly observe the state xt and instead obtains the observation ytRdy, which evolves by the following SDE:

dyt=h(t,xt)dt+γ(t)dνt, (2)

where y0 obeys p0(y0), and νtRdν is the standard Wiener process. In POSC [9,10,11,12,13], because the controller can completely memorize the observation history y0:t:={yτ|τ[0,t]}, it determines the control ut based on the observation history y0:t as ut=u(t,y0:t). In ML-POSC [14], by contrast, because the controller cannot completely memorize the observation history y0:t, it compresses the observation history y0:t into the finite-dimensional memory ztRdz, which evolves by the following SDE:

dzt=c(t,zt,vt)dt+κ(t,zt,vt)dyt+η(t,zt,vt)dξt, (3)

where z0 obeys p0(z0), vtRdv is the control, and ξtRdξ is the standard Wiener process. The memory dimension dz is determined by the available memory size of the controller. In addition, the memory noise ξt represents the intrinsic stochasticity of the memory to be used. Therefore, unlike the conventional POSC, ML-POSC can explicitly take into account the memory size and noise of the controller. Furthermore, because the memory dynamics (3) depends on the memory control vt, it can be optimized through the memory control vt, which is expected to realize the optimal compression of the observation history y0:t into the limited memory zt. In ML-POSC [14], the controller determines the state control ut and the memory control vt based on the memory zt as follows:

ut=u(t,zt),vt=v(t,zt). (4)

The objective function of ML-POSC is given by the following expected cumulative cost function:

J[u,v]:=Ep(x0:T,y0:T,z0:T;u,v)0Tf(t,xt,ut,vt)dt+g(xT), (5)

where f is the cost function, g is the terminal cost function, p(x0:T,y0:T,z0:T;u,v) is the probability of x0:T, y0:T, and z0:T given u and v as parameters, and Ep[·] is the expectation with respect to the probability p. Because the cost function f depends on the memory control vt, ML-POSC can explicitly take into account the memory control cost, which is also impossible with the conventional POSC.

ML-POSC is the problem of finding the optimal state control function u* and the optimal memory control function v* that minimize the expected cumulative cost function J[u,v] as follows:

u*,v*:=argminu,vJ[u,v]. (6)

ML-POSC first formulates the finite-dimensional and stochastic memory dynamics explicitly, then optimizes the memory control by considering the memory control cost. As a result, unlike the conventional POSC, ML-POSC is a practical framework for memory-limited controllers where the memory size, noise, and cost are imposed and non-negligible.

The previous work [14] has shown the validity and effectiveness of ML-POSC. In the LQG problem of conventional POSC, the observation history y0:T can be compressed into the Kalman filter without a loss of performance [10,18,43]. Because the Kalman filter is finite-dimensional, it can be interpreted as the finite-dimensional memory zt and discussed in terms of ML-POSC. The previous work [14] has proven that the optimal memory dynamics of ML-POSC become the Kalman filter in this problem, which indicates that ML-POSC is a consistent framework with the conventional POSC. Furthermore, the previous work [14] has demonstrated the effectiveness of ML-POSC in the LQG problem with memory limitation and in the non-LQG problem by numerical experiments.

2.2. Problem Reformulation

Although the formulation of ML-POSC in the previous subsection is intuitive, it is inconvenient for further mathematical investigations. To address this problem, we reformulate ML-POSC in this subsection. The formulation in this subsection is simpler and more general than that in the previous subsection.

First, we define an extended state st as follows:

st:=xtztRds, (7)

where ds=dx+dz. The extended state st evolves by the following SDE:

dst=b˜(t,st,u˜t)dt+σ˜(t,st,u˜t)dω˜t, (8)

where s0 obeys p0(s0), u˜tRdu˜ is the control, and ω˜tRdω˜ is the standard Wiener process. ML-POSC determines the control u˜tRdu˜ based on the memory zt as follows:

u˜t=u˜(t,zt). (9)

The extended state SDE (8) includes the previous SDEs (1)–(3) as a special case because they can be represented as follows:

dst=b(t,xt,ut)c(t,zt,vt)+κ(t,zt,vt)h(t,xt)dt+σ(t,xt,ut)OOOκ(t,zt,vt)γ(t)η(t,zt,vt)dωtdνtdξt, (10)

where p0(s0)=p0(x0)p0(z0).

The objective function of ML-POSC is given by the following expected cumulative cost function:

J[u˜]:=Ep(s0:T;u˜)0Tf˜(t,st,u˜t)dt+g˜(sT). (11)

where f˜ is the cost function and g˜ is the terminal cost function. It is obvious that this objective function (11) is more general than that in the previous subsection (5).

ML-POSC is the problem of finding the optimal control function u˜* that minimizes the expected cumulative cost function J[u˜] as follows:

u˜*:=argminu˜J[u˜]. (12)

In the following sections, we mainly consider the formulation of this subsection because it is simpler and more general than that in the previous subsection. Moreover, we omit ·˜ for simplicity of notation.

3. Pontryagin’s Minimum Principle

If the control ut is determined based on the extended state st as ut=u(t,st), ML-POSC is the same problem with COSC of the extended state, and its optimality conditions can be obtained in the conventional way [15,16,17,18]. In reality, however, because ML-POSC determines the control ut based only on the memory zt as ut=u(t,zt), its optimality conditions cannot be obtained in a similar way as COSC. In the previous work [14], the optimality conditions of ML-POSC were obtained by employing a mathematical technique of MFSC [30,31].

In this section, we obtain the optimality conditions of ML-POSC by employing Pontryagin’s minimum principle [22,23,24,25] on the probability density function space (Figure 2 (bottom right)). The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as a conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

Figure 2.

Figure 2

The relationship between Bellman’s dynamic programming principle (top) and Pontryagin’s minimum principle (bottom) on the state space (left) and on the probability density function space (right). The left-hand side corresponds to deterministic control, which is briefly reviewed in Appendix A. The right-hand side corresponds to ML-POSC and MFSC, which are shown in Section 3 and Appendix B, respectively. The conventional approach in ML-POSC [14] and MFSC [30,31] can be interpreted as the conversion from Bellman’s dynamic programming principle (top right) to Pontryagin’s minimum principle (bottom right) on the probability density function space.

In Appendix A, we briefly review Pontryagin’s minimum principle in deterministic control (Figure 2 (left)). In this section, we obtain the optimality conditions of ML-POSC in a similar way as Appendix A (Figure 2 (right)). Furthermore, in Appendix B, we obtain the optimality conditions of MFSC in a similar way as Appendix A (Figure 2 (right)). MFSC is more general than ML-POSC except for the partial observability. In particular, the expected Hamiltonian is non-linear with respect to the probability density function in MFSC, while it is linear in ML-POSC.

Although our derivations are formal, not analytical, and more mathematically rigorous proofs remain future challenges, our results are consistent with the conventional results of COSC [15,16,17,18], ML-POSC [14], and MFSC [26,27,28,30,31], and also provide a useful perspective in proposing an algorithm.

3.1. Preliminary

In this subsection, we show a useful result in obtaining Pontryagin’s minimum principle. Given arbitrary control functions u and u, J[u]J[u] can be calculated as follows:

J[u]J[u]=0TEp(t,s)H(t,s,u,w)Ep(t,s)H(t,s,u,w)dt, (13)

where H is the Hamiltonian, which is defined as follows:

Ht,s,u,w:=f(t,s,u)+Luw(t,s). (14)

Lu is the backward diffusion operator, which is defined as follows:

Luw(t,s):=i=1dsbi(t,s,u)w(t,s)si+12i,j=1dsDij(t,s,u)2w(t,s)sisj, (15)

where D(t,s,u):=σ(t,s,u)σ(t,s,u). w(t,s) is the solution of the following Hamilton–Jacobi–Bellman (HJB) equation driven by u:

w(t,s)t=Ht,s,u,w, (16)

where w(T,s)=g(s). p(t,s) is the solution of the following Fokker–Planck (FP) equation driven by u:

p(t,s)t=Lup(t,s), (17)

where p(0,s)=p0(s). Lu is the forward diffusion operator, which is defined as follows:

Lup(t,s):=i=1ds(bi(t,s,u)p(t,s))si+12i,j=1ds2(Dij(t,s,u)p(t,s))sisj. (18)

Lu is the conjugate of Lu as follows:

w(t,s)Lup(t,s)ds=p(t,s)Luw(t,s)ds. (19)

We derive Equation (13) in Appendix C.1.

3.2. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of ML-POSC. It corresponds to Pontryagin’s minimum principle on the probability density function space (Figure 2 (bottom right)). If u* is the optimal control function of ML-POSC (12), then the following equation is satisfied:

u*(t,z)=argminuEpt*(x|z)Ht,s,u,w*,a.s.t[0,T],zRdz, (20)

where w*(t,s) is the solution of the following HJB equation driven by u*:

w*(t,s)t=Ht,s,u*,w*, (21)

where w*(T,s)=g(s). pt*(x|z):=p*(t,s)/p*(t,s)dx is the conditional probability density function of state x given memory z, and p*(t,s) is the solution of the following FP equation driven by u*:

p*(t,s)t=Lu*p*(t,s), (22)

where p*(0,s)=p0(s). We derive this result in Appendix C.2.

In deterministic control, Pontryagin’s minimum principle can be expressed by the derivatives of the Hamiltonian (Figure 2 (bottom left)). Similarly, the system of HJB-FP Equations (21) and (22) can be expressed by the variations of the expected Hamiltonian

H¯(t,p,u,w):=Ep(s)Ht,s,u,w (23)

as follows:

p*(t,s)t=δH¯(t,p*,u*,w*)δw(s), (24)
w*(t,s)t=δH¯(t,p*,u*,w*)δp(s), (25)

where p*(0,s)=p0(s) and w*(T,s)=g(s) (Figure 2 (bottom right)). Therefore, the system of HJB-FP equations can be interpreted via Pontryagin’s minimum principle on the probability density function space.

3.3. Sufficient Condition

Pontryagin’s minimum principle (20) is only a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (20) becomes a necessary and sufficient condition if the expected Hamiltonian H¯(t,p,u,w) is convex with respect to p and u. We obtain this result in Appendix C.3.

3.4. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the probability density function space (Figure 2 (top right)) [14], the optimal control function of ML-POSC is given by the following equation:

u*(t,z,p)=argminuEp(x|z)Ht,s,u,δV*(t,p)δp(s), (26)

where V*(t,p) is the value function on the probability density function space, which is the solution of the following Bellman equation:

V*(t,p)t=Ep(s)Ht,s,u*,δV*(t,p)δp(s), (27)

where V*(T,p)=Ep(s)g(s). More specifically, the optimal control function of ML-POSC is given by u*(t,z)=u*(t,z,p*), where p* is the solution of the FP Equation (22).

Because the Bellman Equation (27) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous work [14] converted the Bellman Equation (27) into the HJB Equation (21) by defining

w*(t,s):=δV*(t,p*)δp(s), (28)

where p* is the solution of FP Equation (22). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

3.5. Relationship with Completely Observable Stochastic Control

In the COSC of the extended state, the control ut is determined based on the extended state st as ut=u(t,st). Therefore, in the COSC of the extended state, Pontryagin’s minimum principle on the probability density function space is given by the following equation:

u*(t,s)=argminuHt,s,u,w*,a.s.t[0,T],sRds, (29)

where w*(t,s) is the solution of the HJB Equation (21). Because this proof is almost identical to that of Section 3.2, it is omitted in this paper.

While the optimal control function of ML-POSC (20) depends on the FP equation and the HJB equation, the optimal control function of COSC (29) depends only on the HJB equation. From this nice property of COSC, Equation (29) is not only a necessary condition but also a sufficient condition without assuming the convexity of the expected Hamiltonian. We derive this result in Appendix C.4.

This result is consistent with the conventional result of COSC [15,16,17,18]. Unlike ML-POSC and MFSC, COSC can be solved by Bellman’s dynamic programming principle on the state space. In COSC, Pontryagin’s minimum principle on the probability density function space is equivalent to Bellman’s dynamic programming principle on the state space. Because Bellman’s dynamic programming principle on the state space is a necessary and sufficient condition, Pontryagin’s minimum principle on the probability density function space may also become a necessary and sufficient condition.

4. Forward-Backward Sweep Method

In this section, we propose FBSM for ML-POSC and then prove its convergence by employing the interpretation of the system of HJB-FP equations by Pontryagin’s minimum principle introduced in the previous section.

4.1. Forward-Backward Sweep Method

In this subsection, we propose FBSM for ML-POSC, which is summarized in Algorithm 1. FBSM is an algorithm to compute the forward FP equation and the backward HJB equation alternately. More specifically, in the initial step of FBSM, we initialize the control function u0:Tdt0 and obtain p0:T0 by computing the FP equation forward in time from the initial condition. In the backward step, we obtain w0:T1 by computing the HJB equation backward in time from the terminal condition and simultaneously update the control function from u0:Tdt0 to u0:Tdt1 by minimizing the conditional expected Hamiltonian. In the forward step, we obtain p0:T2 by computing the FP equation forward in time from the initial condition and simultaneously update the control function from u0:Tdt1 to u0:Tdt2 by minimizing the conditional expected Hamiltonian. By iterating the backward and forward steps, the objective function of ML-POSC J[u0:Tdtk] monotonically decreases and finally converges to the local minimum at which the control function of ML-POSC u0:Tdtk satisfies Pontryagin’s minimum principle.

Pontryagin’s minimum principle is only a necessary condition of the optimal control function, not a sufficient condition. Therefore, the control function obtained by FBSM is not necessarily the global optimum except in the case where the expected Hamiltonian is convex. Nevertheless, the control function obtained by FBSM is expected to be superior to most control functions because it is locally optimal.

FBSM has been used in deterministic control [32,34,35,38] and MFSC [39,40,41,42]. However, the convergence of FBSM for these problems is not guaranteed because the backward dynamics depend on the forward dynamics even without the optimal control function (Figure 1c,d). In contrast, the convergence of FBSM is guaranteed in ML-POSC because the backward HJB equation does not depend on the forward FP equation without the optimal control function (Figure 1b). More specifically, in FBSM for ML-POSC, the objective function J[u0:Tdtk] monotonically decreases and finally converges to Pontryagin’s minimum principle. In the following subsections, we prove this nice property of FBSM for ML-POSC.

Algorithm 1: Forward-Backward Sweep Method (FBSM)
  • //— Initial step —//

  • k0

  • p0k(s)p0(s)

  • for t=0 to Tdt do

  •    Initialize utk(z)

  •    pt+dtk(s)ptk(s)+Lutkptk(s)dt

  • end for

  • while J[u0:Tdtk] do not converge do

  •    if k is even then

  •      //— Backward step —//

  •      wTk+1(s)g(s)

  •      for t=Tdt to 0 do

  •         utk+1(z)argminuEptk(x|z)H(t,s,u,wt+dtk+1)

  •         wtk+1(s)wt+dtk+1(s)+H(t,s,utk+1,wt+dtk+1)dt

  •      end for

  •    else

  •      //— Forward step —//

  •      p0k+1(s)p0(s)

  •      for t=0 to Tdt do

  •         utk+1(z)argminuEptk+1(x|z)H(t,s,u,wt+dtk)

  •         pt+dtk+1(s)ptk+1(s)+Lutk+1ptk+1(s)dt

  •      end for

  •    end if

  •    kk+1

  • end while

  • return u0:Tdtk

4.2. Preliminary

In this subsection, we show an important result in proving the convergence of FBSM for ML-POSC. We suppose that u0:tdt,t+dt:Tdt:={u0,...,utdt,ut+dt,...,uTdt} is given and only ut is optimized as follows:

ut*:=argminutJ[u0:Tdt]. (30)

In ML-POSC, ut* can be calculated as follows:

ut*(z)=argminutEpt(x|z)Ht,s,ut,wt+dt,a.s.zRdz, (31)

where wt+dt(s) is the solution of the following time-discretized HJB equation driven by ut+dt:Tdt:

wτ(s)=wτ+dt(s)+Hτ,s,uτ,wτ+dtdt,τ{t+dt,...,Tdt}, (32)

where wT(s)=g(s). pt(x|z):=pt(s)/pt(s)dx is the conditional probability density function of state x given memory z, and pt(s) is the solution of the following time-discretized FP equation driven by u0:tdt:

pτ+dt(s)=pτ(s)+Luτpτ(s)dt,τ{0,...,tdt}, (33)

where p0(s). Equation (31) is obtained by the similar way to Pontyragin’s minimum principle in Appendix C.5 and also by the time discretization method in Appendix C.6.

Importantly, wt+dt does not depend on ut in ML-POSC (Figure 3a) while λt+dt and wt+dt depend on ut in deterministic control (Figure 3b) and MFSC (Figure 3c), respectively. Therefore, ut* can be obtained without modifying wt+dt in ML-POSC, which is essentially different from deterministic control and MFSC. From this nice property, the convergence of FBSM is guaranteed in ML-POSC.

Figure 3.

Figure 3

Schematic diagram of the effect of updating the control function to the forward and backward dynamics in (a) ML-POSC, (b) deterministic control, and (c) MFSC. w0:T, p0:T, λ0:T, and s0:T are the solutions of the HJB equation, the FP equation, the adjoint equation, and the state equation, respectively. u0:Tdt is a given control function. The arrows indicate the dependence of variables. The variable at the head of an arrow depends on the variable at the tail of the arrow. (a) In ML-POSC, while the update from ut to ut (yellow) changes w0:t and pt+dt:T to w0:t and pt+dt:T, respectively (red), it does not change p0:t and wt+dt:T (blue). From this property, the convergence of FBSM is guaranteed in ML-POSC. (b) In deterministic control, the update from ut to ut (yellow) changes λt+dt:T to λt+dt:T as well (red) because the adjoint equation depends on the state equation (green). Because FBSM does not take into account the change of λt+dt:T, the convergence of FBSM is not guaranteed in deterministic control. (c) In MFSC, the update from ut to ut (yellow) changes wt+dt:T to wt+dt:T as well (red) because the HJB equation depends on the FP equation (green). Because FBSM does not take into account the change of wt+dt:T, the convergence of FBSM is not guaranteed in MFSC.

4.3. Monotonicity

In FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each time step. More specifically,

J[u0:tdtk,ut:Tdtk+1]J[u0:tk,ut+dt:Tdtk+1] (34)

is satisfied in the backward step, and

J[u0:tdtk+1,ut:Tdtk]J[u0:tk+1,ut+dt:Tdtk] (35)

is satisfied in the forward step. We prove this result in Appendix C.7. Furthermore, in FBSM for ML-POSC, the objective function is monotonically non-increasing with respect to the update of the control function at each iteration step as follows:

J[u0:Tdtk+1]J[u0:Tdtk]. (36)

Equation (36) is obviously satisfied from Equations (34) and (35).

4.4. Convergence to Pontryagin’s Minimum Principle

We assume that J[u0:Tdt] has a lower bound. From Equation (36), FBSM for ML-POSC is guaranteed to converge to the local minimum. Furthermore, we assume that if the candidate of utk+1 includes utk, then set utk+1 at utk. Under these assumptions, FBSM for ML-POSC converges to Pontryagin’s minimum principle (20). More specifically, if J[u0:Tdtk+1]=J[u0:Tdtk] holds, u0:Tdtk+1 satisfies Pontryagin’s minimum principle (20). We prove this result in Appendix C.8.

Therefore, unlike deterministic control and MFSC, in FBSM for ML-POSC, the objective function J[u0:Tdtk] monotonically decreases and finally converges to the local minimum at which the control function u0:Tdtk satisfies Pontryagin’s minimum principle (20).

5. Linear-Quadratic-Gaussian Problem

In this section, we apply FBSM to the LQG problem of ML-POSC [14]. In the LQG problem of ML-POSC, the system of HJB-FP equations is reduced from partial differential equations to ordinary differential equations.

5.1. Problem Formulation

In the LQG problem of ML-POSC, the extended state SDE (8) is given as follows [14]:

dst=A(t)st+B(t)utdt+σ(t)dωt, (37)

where s0 obeys the Gaussian distribution p0(s0):=Ns0μ0,Λ0 where μ0 is the mean vector and Λ0 is the precision matrix. The objective function (11) is given as follows:

J[u]:=Ep(s0:T;u)0TstQ(t)st+utR(t)utdt+sTPsT, (38)

where Q(t)O, R(t)O, and PO. The LQG problem of ML-POSC is the problem of finding the optimal control function u* that minimizes the objective function J[u] as follows:

u*:=argminuJ[u]. (39)

5.2. Pontryagin’s Minimum Principle

In the LQG problem of ML-POSC, Pontryagin’s minimum principle (20) can be calculated as follows [14]:

u*(t,z)=R1BΠK(Λ)(sμ)+Ψμ,a.s.t[0,T],zRdz, (40)

where K(Λ) is defined as follows:

K(Λ):=OΛxx1ΛxzOI, (41)

where μ(t) and Λ(t) are the mean vector and the precision matrix of the extended state, respectively, which correspond to the solution of the FP Equation (22). We note that Ept(z|x)s=K(Λ)(sμ)+μ is satisfied. μ(t) and Λ(t) are the solutions of the following ordinary differential equations (ODEs):

μ˙=ABR1BΨμ, (42)
Λ˙=ABR1BΠK(Λ)ΛΛABR1BΠK(Λ)ΛσσΛ, (43)

where μ(0)=μ0 and Λ(0)=Λ0. Ψ(t) and Π(t) are the control gain matrices of the deterministic and stochastic extended state, respectively, which correspond to the solution of the HJB Equation (21). Ψ(t) and Π(t) are the solutions of the following ODEs:

Ψ˙=Q+AΨ+ΨAΨBR1BΨ, (44)
Π˙=Q+AΠ+ΠAΠBR1BΠ+(IK(Λ))ΠBR1BΠ(IK(Λ)), (45)

where Ψ(T)=Π(T)=P. The ODE of Ψ (44) is the Riccati equation [16,17,18], which also appears in the LQG problem of COSC. In contrast, the ODE of Π (45) is the partially observable Riccati equation [14], which appears only in the LQG problem of ML-POSC. The above result is obtained in [14].

The ODE of Ψ (44) can be solved backward in time from the terminal condition. Using Ψ, the ODE of μ (42) can be solved forward in time from the initial condition. In contrast, the ODEs of Π (45) and Λ (43) cannot be solved in a similar way as the ODEs of Ψ (44) and μ (42) because they interact with each other, which is a similar problem to the system of HJB-FP equations.

5.3. Forward-Backward Sweep Method

In the LQG problem of ML-POSC, FBSM is reduced from Algorithm 1 to Algorithm 2. F(Λ,Π) and G(Λ,Π) are defined by the right-hand sides of the ODEs of Λ (43) and Π (45), respectively, as follows:

F(Λ,Π):=ABR1BΠK(Λ)ΛΛABR1BΠK(Λ)ΛσσΛ,G(Λ,Π):=Q+AΠ+ΠAΠBR1BΠ+(IK(Λ))ΠBR1BΠ(IK(Λ)).

This result is obtained in Appendix C.9. Importantly, in the LQG problem of ML-POSC, FBSM computes the ODEs of Λ (43) and Π (45) instead of the FP Equation (22) and the HJB Equation (21).

Algorithm 2: Forward-Backward Sweep Method (FBSM) in the LQG problem
  • //— Initial step —//

  • k0

  • Λ0kΛ0

  • for t=0 to Tdt do

  •    Initialize Πt+dtk

  •    Λt+dtkΛtk+F(Λtk,Πt+dtk)dt

  • end for

  • while J[u0:Tdtk] do not converge do

  •    if k is even then

  •      //— Backward step —//

  •      ΠTk+1P

  •      for t=Tdt to 0 do

  •         Πtk+1Πt+dtk+1+G(Λtk,Πt+dtk+1)dt

  •      end for

  •    else

  •      //— Forward step —//

  •      Λ0k+1Λ0

  •      for t=0 to Tdt do

  •         Λt+dtk+1Λtk+1+F(Λtk+1,Πt+dtk)dt

  •      end for

  •    end if

  •    kk+1

  • end while

  • return u0:Tdtk

6. Numerical Experiments

In this section, we verify the convergence of FBSM in ML-POSC by performing numerical experiments on the LQG and non-LQG problems. The setting of the numerical experiments is the same as the previous work [14].

6.1. LQG Problem

In this subsection, we verify the convergence of FBSM for ML-POSC by conducting a numerical experiment on the LQG problem. We consider state xtR, observation ytR, and memory ztR, which evolve by the following SDEs:

dxt=xt+utdt+dωt, (46)
dyt=xtdt+dνt, (47)
dzt=vtdt+dyt, (48)

where x0 and z0 obey the standard Gaussian distributions, y0 is an arbitrary real number, ωtR and νtR are independent standard Wiener processes, and ut=u(t,zt)R and vt=v(t,zt)R are the controls. The objective function to be minimized is given as follows:

J[u,v]:=Ep(x0:10,y0:10,z0:10;u,v)010xt2+ut2+vt2dt. (49)

Therefore, the objective of this problem is to minimize the state variance with small state and memory controls.

This problem corresponds to the LQG problem, which is defined by (37) and (38). By defining st:=(xt,zt)R2, u˜t:=(ut,vt)R2, and ω˜t:=(ωt,νt)R2, the SDEs (46)–(48) can be rewritten as follows:

dst=1010st+u˜tdt+dω˜t, (50)

which corresponds to (37). Furthermore, the objective function (49) can be rewritten as follows:

J[u˜]:=Ep(s0:10;u˜)010st1000st+u˜tu˜tdt, (51)

which corresponds to (38).

We apply the FBSM of the LQG problem (Algorithm 2) to this problem. Π0(t) is initialized by Π0(t)=O. To solve the ODEs of Πk(t) and Λk(t), we use the fourth-order Runge–Kutta method. Figure 4 shows the control gain matrix Πk(t)R2×2 and the precision matrix Λk(t)R2×2 obtained by FBSM. The color of each curve represents the iteration k. The darkest curve corresponds to the first iteration k=0, and the brightest curve corresponds to the last iteration k=50. Importantly, Πk(t) and Λk(t) converge with respect to the iteration k.

Figure 4.

Figure 4

The elements of the control gain matrix Πk(t)R2×2 (ac) and the precision matrix Λk(t)R2×2 (df) obtained by FBSM (Algorithm 2) in the numerical experiment of the LQG problem of ML-POSC. Because Πzxk(t)=Πxzk(t) and Λzxk(t)=Λxzk(t), Πzxk(t) and Λzxk(t) are not visualized. The darkest curve corresponds to the first iteration k=0, and the brightest curve corresponds to the last iteration k=50. Π0(t) is initialized by Π0(t)=O.

Figure 5a shows the objective function J[uk] with respect to iteration k. The objective function J[uk] monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function J[uk] finally converges, and uk satisfies Pontryagin’s minimum principle from Section 4.4.

Figure 5.

Figure 5

Performance of FBSM in the numerical experiment of the LQG problem of ML-POSC. (a) The objective function J[uk] with respect to the iteration k. (bd) Stochastic simulation of state xt (b), memory zt (c), and the cumulative cost (d) for 100 samples. The expectation of the cumulative cost at t=10 corresponds to the objective function (49). Blue and orange curves correspond to the first iteration k=0 and the last iteration k=50, respectively.

Figure 5b–d compare the performance of the control function uk at the first iteration k=0 and the last iteration k=50 by performing a stochastic simulation. At the first iteration k=0, the distributions of state and memory are unstable, and the cumulative cost diverges. In contrast, at the last iteration k=50, the distributions of state and memory are stabilized and the cumulative cost is smaller. This result indicates that FBSM improves the performance in ML-POSC.

Although Figure 5b–d look similar to Figure 2d–f in the previous work [14], they are comparing different things. While Figure 5b–d demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of the partially observable Riccati Equation (45) with that of the conventional Riccati Equation (44).

6.2. Non-LQG Problem

In this subsection, we verify the convergence of FBSM in ML-POSC by conducting a numerical experiment on the non-LQG problem. We consider state xtR, observation ytR, and memory ztR, which evolve by the following SDEs:

dxt=utdt+dωt, (52)
dyt=xtdt+dνt, (53)
dzt=dyt, (54)

where x0 and z0 obey the Gaussian distributions p0(x0)=N(x0|0,0.01) and p0(z0)=N(z0|0,0.01), respectively. y0 is an arbitrary real number, ωtR and νtR are independent standard Wiener processes, and ut=u(t,zt)R is the control. For the sake of simplicity, memory control is not considered. The objective function to be minimized is given as follows:

J[u]:=Ep(x0:1,y0:1,z0:1;u)01Q(t,xt)+ut2dt+10x12, (55)

where

Q(t,x):=1000(0.3t0.6,0.1|x|2.0),0(others). (56)

The cost function is high in 0.3t0.6 and 0.1|x|2.0, which represents the obstacles. In addition, the terminal cost function is the lowest at x=0, which represents the desirable goal. Therefore, the system should avoid the obstacles and reach the goal with a small control. Because the cost function is non-quadratic, it is a non-LQG problem.

We apply the FBSM (Algorithm 1) to this problem. u0(t,z) is initialized by u0(t,z)=0. To solve the HJB equation and the FP equation, we use the finite-difference method. Figure 6 shows wk(t,s) and pk(t,s) obtained by FBSM at the first iteration k=0 and at the last iteration k=50. From Appendix C.6, wk(t,s) is given as follows:

wk(t,s)=Ep(st+dt:1|st=s;uk)t1Q(τ,xτ)+(uτk)2dτ+10x12. (57)

Because u0(t,z)=0, w0(t,s) reflects the cost function corresponding to the obstacles and the goal (Figure 6a–e). In contrast, because u50(t,z)0, w50(t,s) becomes more complex (Figure 6f–j). In particular, while w0(t,s) does not depend on memory z, w50(t,s) depends on memory z, which indicates that the control function u50(t,z) is adjusted by the memory z. We note that w0(1,s) (Figure 6e) and w50(1,s) (Figure 6j) are the same because they are given by the terminal cost function as w0(1,s)=w50(1,s)=10x2. Furthermore, while p0(t,s) is a unimodal distribution (Figure 6k–o), p50(t,s) is a bimodal distribution (Figure 6p–t), which can avoid the obstacles.

Figure 6.

Figure 6

The solutions of the HJB equation wk(t,s) (aj) and the FP equation pk(t,s) (kt) at the first iteration k=0 (ae,ko) and at the last iteration k=50 (fj,pt) of FBSM (Algorithm 1) in the numerical experiment of the non-LQG problem of ML-POSC. u0(t,z) is initialized by u0(t,z)=0.

Figure 7a shows the objective function J[uk] with respect to iteration k. The objective function J[uk] monotonically decreases with respect to iteration k, which is consistent with Section 4.3. This monotonicity of FBSM is the nice property of ML-POSC that is not guaranteed in deterministic control and MFSC. The objective function J[uk] finally converges, and its uk satisfies Pontryagin’s minimum principle from Section 4.4.

Figure 7.

Figure 7

Performance of FBSM in the numerical experiment of the non-LQG problem of ML-POSC. (a) The objective function J[uk] with respect to the iteration k. (b) Stochastic simulation of the state xt for 100 samples. The black rectangles and the cross represent the obstacles and the goal, respectively. Blue and orange curves correspond to the first iteration k=0 and the last iteration k=50, respectively. (c) The objective function (55), which is computed from 100 samples.

Figure 7b,c compare the performance of the control function uk at the first iteration k=0 and the last iteration k=50 by conducting the stochastic simulation. At the first iteration k=0, the obstacles cannot be avoided, which results in a higher objective function. In contrast, at the last iteration k=50, the obstacles can be avoided, which results in a lower objective function. This result indicates that FBSM improves the performance in ML-POSC.

Although Figure 7b,c look similar to Figure 3a,b in the previous work [14], they are comparing different things. While Figure 7b,c demonstrate the performance improvement by the FBSM iteration, the previous work [14] compares the performance of ML-POSC with the local LQG approximation of the conventional POSC.

7. Discussion

In this work, we first showed that the system of HJB-FP equations corresponds to Pontryagin’s minimum principle on the probability density function space. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been briefly mentioned in MFSC [29,30,31], its details have not yet been investigated. We addressed this problem by deriving the system of HJB-FP equations in a similar way to Pontryagin’s minimum principle. We then proposed FBSM to ML-POSC. Although the convergence of FBSM is generally not guaranteed in deterministic control [32,34,35,38] and MFSC [39,40,41,42], we proved the convergence in ML-POSC by noting the fact that the update of the current control function does not affect the future HJB equation in ML-POSC. Therefore, ML-POSC is a special and nice class where FBSM is guaranteed to converge.

Our derivation of Pontryagin’s minimum principle on the probability density function space is formal, not analytical. Therefore, more mathematically rigorous proofs should be pursued in future work. Nevertheless, because our results are consistent with the conventional results of COSC [15,16,17,18], ML-POSC [14], and MFSC [26,27,28,30,31], they would be reliable except for special cases. Furthermore, our results provide a unified perspective on FBSM in deterministic control [32,34,35,38] and the fixed-point iteration method in MFSC [39,40,41,42], which have been studied independently. It clarifies the different properties of ML-POSC from deterministic control and MFSC, which ensures the convergence of FBSM.

The regularized FBSM has recently been proposed in deterministic control, which is guaranteed to converge even in the general deterministic control [44,45]. Our work gives an intuitive reason why the regularized FBSM is guaranteed to converge. In the regularized FBSM, the Hamiltonian is regularized, which makes the update of the control function smaller. When the regularization is sufficiently strong, the effect of the current control function on the future backward dynamics would be negligible. Therefore, the regularized FBSM of deterministic control would be guaranteed to converge for a similar reason to the FBSM of ML-POSC. However, the convergence of the regularized FBSM is much slower because the stronger regularization makes the update of the control function smaller. The FBSM of ML-POSC does not suffer from such a problem because the future backward dynamics already do not depend on the current control function without regularization.

Our work gives a hint about a modification of the fixed-point iteration method to ensure convergence in MFSC. Although the fixed-point iteration method is the most basic algorithm in MFSC, its convergence is not guaranteed [39,40,41,42]. Our work showed that the fixed-point iteration method is equivalent to the FBSM on the probability density function space. Therefore, the idea of regularized FBSM may also be applied to the fixed-point iteration method. More specifically, the fixed-point iteration method may be guaranteed to converge by regularizing the expected Hamiltonian.

In FBSM, we solve the HJB equation and the FP equation using the finite-difference method. However, because the finite-difference method is prone to the curse of dimensionality, it is difficult to solve high-dimensional ML-POSC. To resolve this problem, two directions can be considered. One direction is the policy iteration method [21,46,47]. Although the policy iteration method is almost the same as FBSM, only the update of the control function is different. While FBSM updates the system of HJB-FP equations and the control function simultaneously, the policy iteration method updates them separately. In the policy iteration method, the system of HJB-FP equations becomes linear, which can be solved by the sampling method [48,49,50]. Because the sampling method is more tractable than the finite-difference method, the policy iteration method may allow high-dimensional ML-POSC to be solved. Furthermore, the policy iteration method has recently been studied in MFSC [51,52,53]. However, its convergence is not guaranteed except for special cases in MFSC. In a similar way to FBSM, the convergence of the policy iteration method may be guaranteed in ML-POSC.

The other direction is machine learning. Neural network-based algorithms have recently been proposed in MFSC, which can solve high-dimensional problems efficiently [54,55]. By extending these algorithms, high-dimensional ML-POSC may be solved efficiently. Furthermore, unlike MFSC, the coupling of the HJB-FP equations is limited only to the optimal control function in ML-POSC. By exploiting this nice property, more efficient algorithms may be devised for ML-POSC.

Abbreviations

The following abbreviations are used in this manuscript:

COSC Completely Observable Stochastic Control
POSC Partially Observable Stochastic Control
ML-POSC Memory-Limited Partially Observable Stochastic Control
MFSC Mean-Field Stochastic Control
FBSM Forward-Backward Sweep Method
HJB Hamilton-Jacobi-Bellman
FP Fokker-Planck
SDE Stochastic Differential Equation
ODE Ordinary Differential Equation
LQG Linear-Quadratic-Gaussian

Appendix A. Deterministic Control

In this section, we briefly review Pontryagin’s minimum principle in deterministic control [22,23,24,25].

Appendix A.1. Problem Formulation

In this subsection, we formulate deterministic control [22,23,24,25]. The state of the system stRds at time t[0,T] evolves according to the following ordinary differential equation (ODE):

dstdt=b(t,st,ut), (A1)

where the initial state is s0, and the control is ut=u(t)Rdu. The objective function is given by the following cumulative cost function:

J[u]:=0Tf(t,st,ut)dt+g(sT), (A2)

where f is the cost function and g is the terminal cost function. Deterministic control is the problem of finding the optimal control function u* that minimizes the cumulative cost function J[u] as follows:

u*:=argminuJ[u]. (A3)

Appendix A.2. Preliminary

In this subsection, we show a useful result in deriving Pontryagin’s minimum principle. Given arbitrary control functions u and u, J[u]J[u] can be calculated as follows [16]:

J[u]J[u]=0TH(t,st,ut,λt)H(t,st,ut,λt)H(t,st,ut,λt)s(stst)dt+g(sT)g(sT)g(sT)s(sTsT), (A4)

where H is the Hamiltonian, which is defined as follows:

Ht,s,u,λ:=f(t,s,u)+λb(t,s,u). (A5)

λt is the solution of the following adjoint equation driven by u:

dλtdt=Ht,st,ut,λts, (A6)

where λT=g(sT)/s. st and st are the solutions of the state Equation (A1) driven by u and u, respectively.

In the following, we derive Equation (A4). J[u]J[u] can be calculated as follows:

J[u]J[u]=0Tf(t,st,ut)dt+g(sT)0Tf(t,st,ut)dt+g(sT)=0TH(t,st,ut,λt)(λt)b(t,st,ut)dt+g(sT)0TH(t,st,ut,λt)(λt)b(t,st,ut)dt+g(sT)=0TH(t,st,ut,λt)H(t,st,ut,λt)dt0T(λt)b(t,st,ut)b(t,st,ut)dt+g(sT)g(sT). (A7)

From the state Equation (A1),

J[u]J[u]=0TH(t,st,ut,λt)H(t,st,ut,λt)dt0T(λt)d(stst)dtdt+g(sT)g(sT). (A8)

From the integration by parts and s0s0=0,

J[u]J[u]=0TH(t,st,ut,λt)H(t,st,ut,λt)dt+0Tdλtdt(stst)dt+g(sT)g(sT)(λT)(sTsT). (A9)

From the adjoint Equation (A6), Equation (A4) is obtained.

Appendix A.3. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of deterministic control. It corresponds to Pontryagin’s minimum principle on the state space (Figure 2 (bottom left)). If u* is the optimal control function of deterministic control (A3), then the following equation is satisfied [16]:

u*(t)=argminuHt,st*,u,λt*,t[0,T], (A10)

where λt* is the solution of the following adjoint equation driven by u*:

dλt*dt=H(t,st*,ut*,λt*)s, (A11)

where λT*=g(sT*)/s. st* is the solution of the following state equation driven by u*:

dst*dt=H(t,st*,ut*,λt*)λ, (A12)

where s0*=s0. Because H(t,st*,ut*,λt*)/λ=b(t,st*,ut*), Equation (A12) is consistent with Equation (A1).

In the following, we show that Equation (A10) is the necessary condition of the optimal control function of deterministic control. We define the control function:

uε(t):=u*(t)t[0,T]\Eε,u(t)tEε, (A13)

where Eε:=[t,t+ε][0,T], and u:[0,T]Rdu. From Equation (A4), J[uε]J[u*] can be calculated as follows:

J[uε]J[u*]=0TH(t,stε,utε,λt*)H(t,st*,ut*,λt*)H(t,st*,ut*,λt*)s(stεst*)dt+g(sTε)g(sT*)g(sT*)s(sTεsT*)=0TH(t,stε,ut*,λt*)H(t,st*,ut*,λt*)H(t,st*,ut*,λt*)s(stεst*)dt+g(sTε)g(sT*)g(sT*)s(sTεsT*)+EεH(t,stε,ut,λt*)H(t,stε,ut*,λt*)dt. (A14)

Letting ε0,

J[uε]J[u*]=0TH(t,st*,ut*,λt*)s(stεst*)H(t,st*,ut*,λt*)s(stεst*)dt+g(sT*)s(sTεsT*)g(sT*)s(sTεsT*)+H(t,st*,ut,λt*)H(t,st*,ut*,λt*)dt=H(t,st*,ut,λt*)H(t,st*,ut*,λt*)dt. (A15)

Because u* is the optimal control function, the following inequality is satisfied:

0J[uε]J[u*]=H(t,st*,ut,λt*)H(t,st*,ut*,λt*)dt. (A16)

Therefore, Equation (A10) is the necessary condition of the optimal control function of deterministic control.

Appendix A.4. Sufficient Condition

Pontryagin’s minimum principle (A10) is a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (A10) becomes a necessary and sufficient condition if the Hamiltonian H(t,s,u,λ) is convex with respect to s and u and the terminal cost function g(s) is convex with respect to s.

In the following, we show this result. We define the arbitrary control function u:[0,T]Rdu. From Equation (A4), J[u]J[u*] is given by the following equation:

J[u]J[u*]=0TH(t,st,ut,λt*)H(t,st*,ut*,λt*)H(t,st*,ut*,λt*)s(stst*)dt+g(sT)g(sT*)g(sT*)s(sTsT*). (A17)

Since H(t,s,u,λ) is convex with respect to s and u and g(s) is convex with respect to s, the following inequalities are satisfied:

H(t,st,ut,λt*)H(t,st*,ut*,λt*)+H(t,st*,ut*,λt*)s(stst*)
+H(t,st*,ut*,λt*)u(utut*), (A18)
g(sT)g(sT*)+g(sT*)s(sTsT*). (A19)

Hence, the following inequality is satisfied:

J[u]J[u*]0TH(t,st*,ut*,λt*)u(utut*)dt. (A20)

Because u* satisfies (A10), the following stationary condition is satisfied:

H(t,st*,ut*,λt*)u=0. (A21)

Hence, the following inequality is satisfied:

J[u]J[u*]0. (A22)

Therefore, Equation (A10) is the sufficient condition of the optimal control function of deterministic control if H(t,s,u,λ) is convex with respect to s and u and g(s) is convex with respect to s.

Appendix A.5. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the state space (Figure 2 (top left)) [16], the optimal control function of deterministic control is given by the following equation:

u*(t,s)=argminuHt,s,u,w*(t,s)s, (A23)

where w*(t,s) is the value function on the state space, which is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation:

w*(t,s)t=Ht,s,u*,w*(t,s)s, (A24)

where w*(T,s)=g(s). More specifically, the optimal control function of deterministic control is given by u*(t)=u*(t,st*), where st* is the solution of the state Equation (A12).

The HJB Equation (A24) can be converted into the adjoint Equation (A11) by defining

λt*:=w*(t,st*)s, (A25)

where st* is the solution of the state Equation (A12). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top left)) to Pontryagin’s minimum principle (Figure 2 (bottom left)) on the state space.

In the following, we obtain this result. First, we define

Λ*(t,s):=w*(t,s)s. (A26)

By differentiating the HJB Equation (A24) with respect to s, the following equation is obtained:

Λ*(t,s)t=Ht,s,u*,Λ*s+Λ*(t,s)sb(t,s*,u*), (A27)

where Λ*(T,s)=g(s)/s. Then the derivative of λt*=Λ*(t,st*) with respect to t can be calculated as follows:

dλt*dt=Λ*(t,st*)t+Λ*(t,st*)sdst*dt. (A28)

By substituting Equation (A27) into Equation (A28), the following equation is obtained:

dλt*dt=Ht,s,u*,λ*sΛ*(t,st*)sdst*dtb(t,s*,u*)(*). (A29)

From the state Equation (A12), (*)=0 is satisfied. Therefore, λ*(t) satisfies the adjoint Equation (A11).

Appendix B. Mean-Field Stochastic Control

In this section, we show that the system of HJB-FP equations in MFSC corresponds to Pontryagin’s minimum principle on the probability density function space. Although the relationship between the system of HJB-FP equations and Pontryagin’s minimum principle has been mentioned briefly in MFSC [29,30,31], its details have not yet been investigated. In this section, we address this problem by deriving the system of HJB-FP equations in the similar way as Appendix A. Although our derivations are formal, not analytical, our results are consistent with the conventional results of MFSC [26,27,28,30,31].

Appendix B.1. Problem Formulation

In this subsection, we formulate MFSC [26,27,28]. The state of the system stRds at time t[0,T] evolves by the following stochastic differential equation (SDE):

dst=b(t,st,pt,ut)dt+σ(t,st,pt,ut)dωt, (A30)

where s0 obeys p0(s0), pt(s):=p(t,s) is the probability density function of the state s, ut(s):=u(t,s)Rdu is the control, and ωtRdω is the standard Wiener process. The objective function is given by the following expected cumulative cost function:

J[u]:=Ep(s0:T;u)0Tf(t,st,pt,ut)dt+g(sT,pT), (A31)

where f is the cost function, g is the terminal cost function, p(s0:T;u) is the probability of s0:t:={sτ|τ[0,t]} given u as a parameter, and Ep[·] is the expectation with respect to probability p. MFSC is the problem of finding the optimal control function u* that minimizes the expected cumulative cost function J[u] as follows:

u*:=argminuJ[u]. (A32)

Appendix B.2. Preliminary

In this subsection, we show a useful result in deriving Pontryagin’s minimum principle. Given arbitrary control functions u and u, J[u]J[u] can be calculated as follows:

J[u]J[u]=0TH¯(t,p,u,w)H¯(t,p,u,w)δH¯t,p,u,wδp(s)p(t,s)p(t,s)dsδH¯t,p,u,wδp(s)p(t,s)p(t,s)dsdt+g¯(p)g¯(p)δg¯(p)δp(s)p(T,s)p(T,s)ds, (A33)

where H¯ and g¯ are the expected Hamiltonian and terminal cost function, respectively, which are defined as follows:

H¯(t,p,u,w):=Ep(s)Ht,s,p,u,w, (A34)
g¯(p):=Ep(s)g(s,p). (A35)

H is the Hamiltonian, which is defined as follows:

Ht,s,p,u,w:=f(t,s,p,u)+Luw(t,s). (A36)

Lu is the backward diffusion operator, which is defined as follows:

Luw(t,s):=i=1dsbi(t,s,p,u)w(t,s)si+12i,j=1dsDij(t,s,p,u)2w(t,s)sisj, (A37)

where D(t,s,p,u):=σ(t,s,p,u)σ(t,s,p,u). w is the solution of the following Hamilton-Jacobi-Bellman (HJB) equation driven by u:

w(t,s)t=δH¯t,p,u,wδp(s), (A38)

where w(T,s)=(δg¯(p)/δp)(s). p is the solution of the following Fokker-Planck (FP) equation driven by u:

p(t,s)t=Lup(t,s), (A39)

where p(0,s)=p0(s). p is the solution of the FP Equation (A39) driven by u. Lu is the forward diffusion operator, which is defined as follows:

Lup(t,s):=i=1ds(bi(t,s,p,u)p(t,s))si+12i,j=1ds2(Dij(t,s,p,u)p(t,s))sisj. (A40)

Lu is the conjugate of Lu as follows:

w(t,s)Lup(t,s)ds=p(t,s)Luw(t,s)ds. (A41)

In the following, we derive Equation (A33). J[u]J[u] can be calculated as follows:

J[u]J[u]=Ep(s0:T)0Tf(t,st,pt,ut)dt+g(sT,pT)Ep(s0:T)0Tf(t,st,pt,ut)dt+g(sT,pT)=Ep(s0:T)0TH(t,st,pt,ut,w)Lutw(t,st)dt+g(sT,pT)Ep(s0:T)0TH(t,st,pt,ut,w)Lutw(t,st)dt+g(sT,pT)=0TH¯(t,p,u,w)H¯(t,p,u,w)dt0TEp(t,s)Luw(t,s)Ep(t,s)Luw(t,s)dt+g¯(p)g¯(p). (A42)

Because Lut and Lut are the conjugates of Lut and Lut, respectively,

J[u]J[u]=0TH¯(t,p,u,w)H¯(t,p,u,w)dt0TLup(t,s)Lup(t,s)w(t,s)dsdt+g¯(p)g¯(p). (A43)

From the FP Equation (A39),

J[u]J[u]=0TH¯(t,p,u,w)H¯(t,p,u,w)dt0Tp(t,s)p(t,s)tw(t,s)dsdt+g¯(p)g¯(p). (A44)

From the integration by parts and p(0,s)p(0,s)=p0(s)p0(s)=0,

J[u]J[u]=0TH¯(t,p,u,w)H¯(t,p,u,w)dt+0Tp(t,s)p(t,s)w(t,s)tdsdt+g¯(p)g¯(p)p(T,s)p(T,s)w(T,s)ds. (A45)

From the HJB Equation (A38), Equation (A33) is obtained.

Appendix B.3. Necessary Condition

In this subsection, we show the necessary condition of the optimal control function of MFSC. It corresponds to Pontryagin’s minimum principle on the probability density function space (Figure 2 (bottom right)). If u* is the optimal control function of MFSC (A32), then the following equation is satisfied:

u*(t,s)=argminuHt,s,p*,u,w*,a.s.t[0,T],sRds, (A46)

where w* is the solution of the following HJB equation driven by u*:

w*(t,s)t=δH¯(t,p*,u*,w*)δp(s), (A47)

where w*(T,s)=(δg¯(p*)/δp)(s). p* is the solution of the following FP equation driven by u*:

p*(t,s)t=δH¯(t,p*,u*,w*)δw(s), (A48)

where p*(0,s)=p0(s).

In the following, we show that Equation (A46) is the necessary condition of the optimal control function of MFSC. We define the control function

uε(t,z):=u*(t,s)(t,s)([0,T]×Rds)\(Eε1×Fε2),u(t,s)(t,s)Eε1×Fε2, (A49)

where Eε1:=[t,t+ε1][0,T], Fε2:=[s,s+ε2]Rds, and u:[0,T]×RdsRdu. From Equation (A33), J[uε]J[u*] can be calculated as follows:

J[uε]J[u*]=0TH¯(t,pε,uε,w*)H¯(t,p*,u*,w*)δH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsδH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsdt+g¯(pε)g¯(p*)δg¯(p*)δp(s)pε(T,s)p*(T,s)ds=0TH¯(t,pε,u*,w*)H¯(t,p*,u*,w*)δH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsδH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsdt+g¯(pε)g¯(p*)δg¯(p*)δp(s)pε(T,s)p*(T,s)ds+Eε1Fε2H(t,s,pε,u,w*)H(t,s,pε,u*,w*)pε(t,s)dsdt. (A50)

Letting ε10 and ε20,

J[uε]J[u*]=0TδH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsδH¯t,p*,u*,w*δp(s)pε(t,s)p*(t,s)dsdt+δg¯(p*)δp(s)pε(T,s)p*(T,s)dsδg¯(p*)δp(s)pε(T,s)p*(T,s)ds+H(t,s,p*,u,w*)H(t,s,p*,u*,w*)p*(t,s)dsdt=H(t,s,p*,u,w*)H(t,s,p*,u*,w*)p*(t,s)dsdt. (A51)

Because u* is the optimal control function, the following inequality is satisfied:

0J[uε]J[u*]=H(t,s,p*,u,w*)H(t,s,p*,u*,w*)p*(t,s)dsdt. (A52)

Therefore, Equation (A46) is the necessary condition of the optimal control function of MFSC.

Appendix B.4. Sufficient Condition

Pontryagin’s minimum principle (A46) is a necessary condition and generally not a sufficient condition. Pontryagin’s minimum principle (A46) becomes a necessary and sufficient condition if the expected Hamiltonian H¯(t,p,u,w) is convex with respect to p and u and the expected terminal cost function g¯(p) is convex with respect to p.

In the following, we show this result. We define the arbitrary control function u:[0,T]×RdsRdu. From Equation (A33), J[u]J[u*] is given by the following equation:

J[u]J[u*]=0TH¯(t,p,u,w*)H¯(t,p*,u*,w*)δH¯t,p*,u*,w*δp(s)p(t,s)p*(t,s)dsδH¯t,p*,u*,w*δp(s)p(t,s)p*(t,s)dsdt+g¯(p)g¯(p*)δg¯(p*)δp(s)p(T,s)p*(T,s)ds. (A53)

Because H¯(t,p,u,w) is convex with respect to p and u and g¯(p) is convex with respect to p, the following inequalities are satisfied:

H¯(t,p,u,w*)H¯(t,p*,u*,w*)+δH¯(t,p*,u*,w*)δp(s)(p(t,s)p*(t,s))ds
+δH¯(t,p*,u*,w*)δu(s)(u(t,s)u*(t,s))ds, (A54)
g¯(p)g¯(p*)+δg¯(p*)δp(s)(p(T,s)p*(T,s))ds. (A55)

Hence, the following inequality is satisfied:

J[u]J[u*]0TEp*(t,s)H(t,s,p*,u*,w*)u(u(t,s)u*(t,s))dt. (A56)

Because u* satisfies Equation (A46), the following stationary condition is satisfied:

H(t,s,p*,u*,w*)u=0. (A57)

Hence, the following inequality is satisfied:

J[u]J[u*]0 (A58)

Therefore, Equation (A46) is the sufficient condition of the optimal control function of MFSC if the expected Hamiltonian H¯(t,p,u,w) is convex with respect to p and u and the expected terminal cost function g¯(p) is convex with respect to p.

Appendix B.5. Relationship with Bellman’s Dynamic Programming Principle

From Bellman’s dynamic programming principle on the probability density function space (Figure 2 (top right)) [56,57,58], the optimal control function of MFSC is given by the following equation:

u*(t,s,p)=argminuHt,s,p,u,δV*(t,p)δp(s), (A59)

where V*(t,p) is the value function on the probability density function space, which is the solution of the following Bellman equation:

V*(t,p)t=Ep(s)Ht,s,p,u*,δV*(t,p)δp(s), (A60)

where V*(T,p)=Ep(s)g(s). More specifically, the optimal control function of MFSC is given by u*(t,s)=u*(t,s,p*), where p* is the solution of the FP Equation (A48).

Because the Bellman Equation (A60) is a functional differential equation, it cannot be solved even numerically. To resolve this problem, the previous works [30,31] converted the Bellman Equation (A60) into the HJB Equation (A47) by defining

w*(t,s):=δV*(t,p*)δp(s), (A61)

where p* is the solution of FP Equation (A48). This approach can be interpreted as the conversion from Bellman’s dynamic programming principle (Figure 2 (top right)) to Pontryagin’s minimum principle (Figure 2 (bottom right)) on the probability density function space.

Appendix C. Derivation of Main Results

Appendix C.1. Derivation of Result in Section 3.1

In this subsection, we derive Equation (13). J[u]J[u] can be calculated as follows:

J[u]J[u]=Ep(s0:T)0Tf(t,st,ut)dt+g(sT)Ep(s0:T)0Tf(t,st,ut)dt+g(sT)=Ep(s0:T)0TH(t,st,ut,w)Lutw(t,st)dt+g(sT)Ep(s0:T)0TH(t,st,ut,w)Lutw(t,st)dt+g(sT)=0TEp(t,s)H(t,s,u,w)Ep(t,s)H(t,s,u,w)dt0TEp(t,s)Luw(t,s)Ep(t,s)Luw(t,s)dt+Ep(T,s)g(s)Ep(T,s)g(s). (A62)

Because Lut and Lut are the conjugates of Lut and Lut, respectively,

J[u]J[u]=0TEp(t,s)H(t,s,u,w)Ep(t,s)H(t,s,u,w)dt0TLup(t,s)Lup(t,s)w(t,s)dsdt+Ep(T,s)g(s)Ep(T,s)g(s). (A63)

From the FP Equation (17),

J[u]J[u]=0TEp(t,s)H(t,s,u,w)Ep(t,s)H(t,s,u,w)dt0Tp(t,s)p(t,s)tw(t,s)dsdt+Ep(T,s)g(s)Ep(T,s)g(s). (A64)

From the integration by parts and p(0,s)p(0,s)=p0(s)p0(s)=0,

J[u]J[u]=0TEp(t,s)H(t,s,u,w)Ep(t,s)H(t,s,u,w)dt+0Tp(t,s)p(t,s)w(t,s)tdsdt+Ep(T,s)g(s)Ep(T,s)g(s)p(T,s)p(T,s)w(T,s)ds. (A65)

From the HJB Equation (16), Equation (13) is obtained.

Appendix C.2. Derivation of Result in Section 3.2

In this subsection, we show that Equation (20) is the necessary condition of the optimal control function of ML-POSC. It corresponds to Pontryagin’s minimum principle on the probability density function space. We define the control function

uε(t,z):=u*(t,z)(t,z)([0,T]×Rdz)\(Eε1×Fε2),u(t,z)(t,z)Eε1×Fε2, (A66)

where Eε1:=[t,t+ε1][0,T], Fε2:=[z,z+ε2]Rdz, and u:[0,T]×RdzRdu. From Equation (13), J[uε]J[u*] can be calculated as follows:

J[uε]J[u*]=0TEpε(t,s)H(t,s,uε,w*)Epε(t,s)H(t,s,u*,w*)dt=Eε1Fε2Eptε(x|z)H(t,s,u,w*)Eptε(x|z)H(t,s,u*,w*)ptε(z)dzdt.

Letting ε10 and ε20,

J[uε]J[u*]=Ept*(x|z)H(t,s,u,w*)Ept*(x|z)H(t,s,u*,w*)pt*(z)dzdt.

Because u* is the optimal control function, the following inequality is satisfied:

0J[uε]J[u*]=Ept*(x|z)H(t,s,u,w*)Ept*(x|z)H(t,s,u*,w*)pt*(z)dzdt.

Therefore, Equation (20) is the necessary condition of the optimal control function of ML-POSC.

Appendix C.3. Derivation of Result in Section 3.3

In this subsection, we show that Equation (20) is the sufficient condition of the optimal control function of ML-POSC if the expected Hamiltonian H¯(t,p,u,w) is convex with respect to p and u. We define the arbitrary control function u:[0,T]×RdzRdu. From Equation (13), J[u]J[u*] is given by the following equation:

J[u]J[u*]=0TEp(t,s)H(t,s,u,w*)Ep(t,s)H(t,s,u*,w*)dt. (A67)

Because H¯(t,p,u,w) is convex with respect to p and u, the following inequality is satisfied:

Ep(t,s)H(t,s,u,w*)=H¯(t,p,u,w*)H¯(t,p*,u*,w*)+δH¯(t,p*,u*,w*)δp(s)(p(t,s)p*(t,s))ds+δH¯(t,p*,u*,w*)δu(z)(u(t,z)u*(t,z))dz. (A68)

Because

δH¯(t,p*,u*,w*)δp(s)=δδpp(s)H(t,s,u*,w*)dsp=p*
=H(t,s,u*,w*),δH¯(t,p*,u*,w*)δu(z)=δδupt*(z)Ept*(x|z)H(t,s,u,w*)dzu=u* (A69)
=pt*(z)Ept*(x|z)H(t,s,u*,w*)u, (A70)

the above inequality can be calculated as follows:

Ep(t,s)H(t,s,u,w*)p*(t,s)H(t,s,u*,w*)ds+H(t,s,u*,w*)(p(t,s)p*(t,s))ds+pt*(z)Ept*(x|z)H(t,s,u*,w*)u(u(t,z)u*(t,z))dz=Ep(t,s)H(t,s,u*,w*)+Ept*(z)Ept*(x|z)H(t,s,u*,w*)u(u(t,z)u*(t,z)). (A71)

Hence, the following inequality is satisfied:

J[u]J[u*]0TEpt*(z)Ept*(x|z)H(t,s,u*,w*)u(u(t,z)u*(t,z))dt. (A72)

Because u* satisfies Equation (20), the following stationary condition is satisfied:

Ept*(x|z)H(t,s,u*,w*)u=0. (A73)

Hence, the following inequality is satisfied:

J[u]J[u*]0 (A74)

Therefore, Equation (20) is the sufficient condition of the optimal control function of ML-POSC if H¯(t,p,u,w) is convex with respect to p and u.

Appendix C.4. Derivation of Result in Section 3.5

In this subsection, we show that Equation (29) is the sufficient condition of the optimal control function of COSC without assuming the convexity of the expected Hamiltonian. We define the arbitrary control function u:[0,T]×RdsRdu. From Equation (13), J[u]J[u*] is given by the following equation:

J[u]J[u*]=0TEp(t,s)H(t,s,u,w*)Ep(t,s)H(t,s,u*,w*)dt. (A75)

From (29), the following inequality is satisfied:

J[u]J[u*]0TEp(t,s)H(t,s,u*,w*)Ep(t,s)H(t,s,u*,w*)dt=0. (A76)

Therefore, Equation (29) is the sufficient condition of the optimal control function of COSC.

Appendix C.5. Derivation of Result in Section 4.2 by the Similar Way as Pontyragin’s Minimum Principle

In this subsection, we derive Equation (31) from Equation (30) by the similar way as Pontyragin’s minimum principle. From Equation (13), the following equality is satisfied:

J[u0:tdt,ut,ut+dt:Tdt]J[u0:tdt,ut*,ut+dt:Tdt]=Ept(s)H(t,s,ut,wt+dt)Ept(s)H(t,s,ut*,wt+dt)dt=Ept(z)Ept(x|z)H(t,s,ut,wt+dt)Ept(x|z)H(t,s,ut*,wt+dt)dt. (A77)

Therefore, Equation (31) is equivalent with Equation (30).

Appendix C.6. Derivation of Result in Section 4.2 by the Time Discretized Method

In this subsection, we derive Equation (31) from Equation (30) by the time discretized method. Equation (30) can be calculated as follows:

ut*=argminutJ[u0:Tdt]=argminutEp(s0:T;u0:Tdt)0Tf(τ,sτ,uτ)dτ+g(sT)=argminutEp(st:T;u0:Tdt)tTf(τ,sτ,uτ)dτ+g(sT)=argminutEp(st:T;u0:Tdt)f(t,st,ut)dt+t+dtTf(τ,sτ,uτ)dτ+g(sT)=argminutEpt(st)f(t,st,ut)dt+Ep(st+dt:T|st;ut:Tdt)t+dtTf(τ,sτ,uτ)dτ+g(sT)=argminutEpt(st)f(t,st,ut)dt+Ep(st+dt|st;ut)wt+dt(st+dt), (A78)

where pt(s) is the solution of the FP Equation (33) driven by u0:tdt, and wt+dt(s) is defined as follows:

wt+dt(s):=Ep(st+2dt:T|st+dt=s;ut+dt:Tdt)t+dtTf(τ,sτ,uτ)dτ+g(sT). (A79)

From Ito’s lemma,

ut*=argminutEpt(st)f(t,st,ut)dt+wt+dt(st)+Lutwt+dt(st)dt=argminutEpt(st)f(t,st,ut)dt+Lutwt+dt(st)dt=argminutEpt(s)H(t,s,ut,wt+dt). (A80)

Because control ut is a function of memory z in ML-POSC, the minimization by ut can be exchanged with the expectation by pt(z) as follows:

ut*(z)=argminutEpt(x|z)Ht,s,ut,wt+dt. (A81)

Therefore, Equation (31) is derived from Equation (30). Finally, we prove that wt(s) is the solution of the HJB Equation (32) driven by ut+dt:Tdt. wt(s) can be calculated as follows:

wt(s)=Ep(st+dt:T|st=s;ut:Tdt)tTf(τ,sτ,uτ)dτ+g(sT)=f(t,s,ut)dt+Ep(st+dt|st=s;ut)wt+dt(st+dt)=f(t,s,ut)dt+wt+dt(s)+Lutwt+dt(s)dt=wt+dt(s)+H(t,s,ut,wt+dt)dt, (A82)

where wT(s)=g(s). Therefore, wt(s) defined by Equation (A79) is the solution of the HJB Equation (32) driven by ut+dt:Tdt.

Appendix C.7. Derivation of Result in Section 4.3

In this subsection, we mainly derive the inequality of the forward step (35). The inequality of the backward step (34) can be derived in a similar way. In the forward step, u0:tdtk+1 and ut+dt:Tdtk are given, and utk+1 is defined by

utk+1(z):=argminutEptk+1(x|z)Ht,s,ut,wt+dtk. (A83)

From the equivalence of Equations (30) and (31), the following equation is satisfied:

utk+1=argminutJ[u0:tdtk+1,ut,ut+dt:Tdtk]. (A84)

Therefore, the inequality of the forward step (35) is satisfied.

Appendix C.8. Derivation of Result in Section 4.4

In this subsection, we show that FBSM for ML-POSC converges to Pontryagin’s minimum principle (20). More specifically, we prove that if J[u0:Tdtk+1]=J[u0:Tdtk] holds, u0:Tdtk+1 satisfies Pontryagin’s minimum principle (20). We mainly consider the forward step. We can make a similar discussion in the backward step. If J[u0:Tdtk+1]=J[u0:Tdtk] holds, then J[u0:tk+1,ut+dt:Tdtk]=J[u0:tdtk+1,ut:Tdtk] holds from Equation (35). Because J[u0k+1,udt:Tdtk]=J[u0:Tdtk] holds, u0k+1=u0k holds. Then, because J[u0k,udtk+1,u2dt:Tdtk]=J[u0:Tdtk] holds, udtk+1=udtk holds. Iterating this procedure from t=0 to t=Tdt, u0:Tdtk+1=u0:Tdtk holds. Therefore, because the HJB equation and the FP equation depend on the same control function u0:Tdtk+1=u0:Tdtk, u0:Tdtk+1 satisfies Pontryagin’s minimum principle (20).

Appendix C.9. Derivation of Result in Section 5.3

In this subsection, we show that FBSM is reduced from Algorithm 1 to Algorithm 2 in the LQG problem of ML-POSC.

We first consider the initial step. We assume that the control function is initialized by

u0(t,z)=R1BΠ0K(Λ0)(sμ)+Ψμ, (A85)

where Π0 is arbitrary and Λ0 is the solution of Λ˙0=F(Λ0,Π0) given Λ0(0)=Λ0. When the control function is initialized by (A85), the solution of the FP equation is given by the Gaussian distribution pt0(s):=N(s|μ,Λ0), where μ is the solution of (42) and Λ0 is the solution of Λ˙0=F(Λ0,Π0) given Λ0(0)=Λ0.

We then consider the backward step. When the solution of the FP equation is given by the Gaussian distribution ptk(s):=N(s|μ,Λk), the solution of the HJB equation is given by the quadratic function wtk+1(s)=sΠk+1s+(αk+1)s+βk+1, where Πk+1, αk+1, and βk+1 are the solutions of the following ODEs:

Π˙k+1=G(Λk,Πk+1), (A86)
α˙k+1=(ABR1BΠk+1)αk+12(IK(Λk))Πk+1BR1BΠk+1(IK(Λk))μ, (A87)
β˙k+1=Πk+1σσ14(αk+1)BR1Bαk+1+μ(IK(Λk))Πk+1BR1BΠk+1(IK(Λk))μ, (A88)

where Πk+1(T)=P, αk+1(T)=0, and βk+1(T)=0.

We finally consider the forward step. When the solution of the HJB equation is given by the quadratic function wtk(s)=sΠks+(αk)s+βk, the solution of the FP equation is given by the Gaussian distribution ptk+1(s):=N(s|μ,Λk+1), where μ is the solution of (42) and Λk+1 is the solution of Λ˙k+1=F(Λk+1,Πk) given Λk+1(0)=Λ0. Therefore, FBSM is reduced from Algorithm 1 to Algorithm 2 in the LQG problem of ML-POSC. The details of these calculations are almost the same with [14].

Author Contributions

Conceptualization, Formal analysis, Funding acquisition, Writing—original draft, T.T. and T.J.K.; Software, Visualization, T.T. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

The first author received a JSPS Research Fellowship (Grant No. 21J20436). This work was supported by JSPS KAKENHI (Grant No. 19H05799) and JST CREST (Grant No. JPMJCR2011).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Fox R., Tishby N. Minimum-information LQG control Part II: Retentive controllers; Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC); Las Vegas, NV, USA. 12–14 December 2016; pp. 5603–5609. [DOI] [Google Scholar]
  • 2.Fox R., Tishby N. Minimum-information LQG control part I: Memoryless controllers; Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC); Las Vegas, NV, USA. 12–14 December 2016; pp. 5610–5616. [DOI] [Google Scholar]
  • 3.Li W., Todorov E. An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System; Proceedings of the 45th IEEE Conference on Decision and Control; San Diego, CA, USA. 13–15 December 2006; pp. 3242–3247. [DOI] [Google Scholar]
  • 4.Li W., Todorov E. Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. Int. J. Control. 2007;80:1439–1453. doi: 10.1080/00207170701364913. [DOI] [Google Scholar]
  • 5.Nakamura K., Kobayashi T.J. Connection between the Bacterial Chemotactic Network and Optimal Filtering. Phys. Rev. Lett. 2021;126:128102. doi: 10.1103/PhysRevLett.126.128102. [DOI] [PubMed] [Google Scholar]
  • 6.Nakamura K., Kobayashi T.J. Optimal sensing and control of run-and-tumble chemotaxis. Phys. Rev. Res. 2022;4:013120. doi: 10.1103/PhysRevResearch.4.013120. [DOI] [Google Scholar]
  • 7.Pezzotta A., Adorisio M., Celani A. Chemotaxis emerges as the optimal solution to cooperative search games. Phys. Rev. E. 2018;98:042401. doi: 10.1103/PhysRevE.98.042401. [DOI] [Google Scholar]
  • 8.Borra F., Cencini M., Celani A. Optimal collision avoidance in swarms of active Brownian particles. J. Stat. Mech. Theory Exp. 2021;2021:083401. doi: 10.1088/1742-5468/ac12c6. [DOI] [Google Scholar]
  • 9.Davis M.H.A., Varaiya P. Dynamic Programming Conditions for Partially Observable Stochastic Systems. SIAM J. Control. 1973;11:226–261. doi: 10.1137/0311020. [DOI] [Google Scholar]
  • 10.Bensoussan A. Stochastic Control of Partially Observable Systems. Cambridge University Press; Cambridge, UK: 1992. [DOI] [Google Scholar]
  • 11.Fabbri G., Gozzi F., Święch A. Probability Theory and Stochastic Modelling. Volume 82. Springer International Publishing; Cham, Switzerland: 2017. Stochastic Optimal Control in Infinite Dimension. [DOI] [Google Scholar]
  • 12.Wang G., Wu Z., Xiong J. An Introduction to Optimal Control of FBSDE with Incomplete Information. Springer International Publishing; Cham, Switzerland: 2018. Springer Briefs in Mathematics. [DOI] [Google Scholar]
  • 13.Bensoussan A., Yam S.C.P. Mean field approach to stochastic control with partial information. ESAIM Control. Optim. Calc. Var. 2021;27:89. doi: 10.1051/cocv/2021085. [DOI] [Google Scholar]
  • 14.Tottori T., Kobayashi T.J. Memory-Limited Partially Observable Stochastic Control and Its Mean-Field Control Approach. Entropy. 2022;24:1599. doi: 10.3390/e24111599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kushner H. Optimal stochastic control. IRE Trans. Autom. Control. 1962;7:120–122. doi: 10.1109/TAC.1962.1105490. [DOI] [Google Scholar]
  • 16.Yong J., Zhou X.Y. Stochastic Controls. Springer; New York, NY, USA: 1999. [DOI] [Google Scholar]
  • 17.Nisio M. Probability Theory and Stochastic Modelling. Volume 72. Springer; Tokyo, Japan: 2015. Stochastic Control Theory. [DOI] [Google Scholar]
  • 18.Bensoussan A. Interdisciplinary Applied Mathematics. Volume 48. Springer International Publishing; Cham, Switzerland: 2018. Estimation and Control of Dynamical Systems. [DOI] [Google Scholar]
  • 19.Kushner H.J., Dupuis P.G. Numerical Methods for Stochastic Control Problems in Continuous Time. Springer; New York, NY, USA: 1992. [DOI] [Google Scholar]
  • 20.Fleming W.H., Soner H.M. Controlled Markov Processes and Viscosity Solutions. 2nd ed. Springer; New York, NY, USA: 2006. Number 25 in Applications of Mathematics. [DOI] [Google Scholar]
  • 21.Puterman M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience; Hoboken, NJ, USA: 2014. [Google Scholar]
  • 22.Pontryagin L.S. Mathematical Theory of Optimal Processes. CRC Press; Boca Raton, FL, USA: 1987. [Google Scholar]
  • 23.Vinter R. Optimal Control. Birkhäuser Boston; Boston, MA, USA: 2010. [DOI] [Google Scholar]
  • 24.Lewis F.L., Vrabie D., Syrmos V.L. Optimal Control. John Wiley & Sons; New York, NY, USA: 2012. [Google Scholar]
  • 25.Aschepkov L.T., Dolgy D.V., Kim T., Agarwal R.P. Optimal Control. Springer International Publishing; Cham, Switzerland: 2016. [DOI] [Google Scholar]
  • 26.Bensoussan A., Frehse J., Yam P. Mean Field Games and Mean Field Type Control Theory. Springer; New York, NY, USA: 2013. Springer Briefs in Mathematics. [DOI] [Google Scholar]
  • 27.Carmona R., Delarue F. Probabilistic Theory of Mean Field Games with Applications I. Springer Nature; Cham, Switzerland: 2018. Number Volume 83 in Probability Theory and Stochastic Modelling. [DOI] [Google Scholar]
  • 28.Carmona R., Delarue F. Probabilistic Theory of Mean Field Games with Applications II. Springer International Publishing; Cham, Switzerland: 2018. Volume 84, Probability Theory and Stochastic Modelling. [DOI] [Google Scholar]
  • 29.Carmona R., Delarue F. The Master Equation for Large Population Equilibriums. In: Crisan D., Hambly B., Zariphopoulou T., editors. Stochastic Analysis and Applications 2014. Volume 100. Springer International Publishing; Cham, Switzerland: 2014. pp. 77–128. [DOI] [Google Scholar]
  • 30.Bensoussan A., Frehse J., Yam S.C.P. The Master equation in mean field theory. J. Math. Pures Appl. 2015;103:1441–1474. doi: 10.1016/j.matpur.2014.11.005. [DOI] [Google Scholar]
  • 31.Bensoussan A., Frehse J., Yam S.C.P. On the interpretation of the Master Equation. Stoch. Process. Their Appl. 2017;127:2093–2137. doi: 10.1016/j.spa.2016.10.004. [DOI] [Google Scholar]
  • 32.Krylov I., Chernous’ko F. On a method of successive approximations for the solution of problems of optimal control. USSR Comput. Math. Math. Phys. 1963;2:1371–1382. doi: 10.1016/0041-5553(63)90353-7. [DOI] [Google Scholar]
  • 33.Mitter S.K. Successive approximation methods for the solution of optimal control problems. Automatica. 1966;3:135–149. doi: 10.1016/0005-1098(66)90009-4. [DOI] [Google Scholar]
  • 34.Chernousko F.L., Lyubushin A.A. Method of successive approximations for solution of optimal control problems. Optim. Control. Appl. Methods. 1982;3:101–114. doi: 10.1002/oca.4660030201. [DOI] [Google Scholar]
  • 35.Lenhart S., Workman J.T. Optimal Control Applied to Biological Models. Chapman and Hall/CRC; New York, NY, USA: 2007. [DOI] [Google Scholar]
  • 36.Sharp J.A., Burrage K., Simpson M.J. Implementation and acceleration of optimal control for systems biology. J. R. Soc. Interface. 2021;18:20210241. doi: 10.1098/rsif.2021.0241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hackbusch W. A numerical method for solving parabolic equations with opposite orientations. Computing. 1978;20:229–240. doi: 10.1007/BF02251947. [DOI] [Google Scholar]
  • 38.McAsey M., Mou L., Han W. Convergence of the forward-backward sweep method in optimal control. Comput. Optim. Appl. 2012;53:207–226. doi: 10.1007/s10589-011-9454-7. [DOI] [Google Scholar]
  • 39.Carlini E., Silva F.J. Semi-Lagrangian schemes for mean field game models; Proceedings of the 52nd IEEE Conference on Decision and Control; Firenze, Italy. 10–13 December 2013; pp. 3115–3120. [DOI] [Google Scholar]
  • 40.Carlini E., Silva F.J. A Fully Discrete Semi-Lagrangian Scheme for a First Order Mean Field Game Problem. SIAM J. Numer. Anal. 2014;52:45–67. doi: 10.1137/120902987. [DOI] [Google Scholar]
  • 41.Carlini E., Silva F.J. A semi-Lagrangian scheme for a degenerate second order mean field game system. Discret. Contin. Dyn. Syst. 2015;35:4269. doi: 10.3934/dcds.2015.35.4269. [DOI] [Google Scholar]
  • 42.Lauriere M. Numerical Methods for Mean Field Games and Mean Field Type Control. arXiv. 20212106.06231 [Google Scholar]
  • 43.Wonham W.M. On the Separation Theorem of Stochastic Control. SIAM J. Control. 1968;6:312–326. doi: 10.1137/0306023. [DOI] [Google Scholar]
  • 44.Li Q., Chen L., Tai C., E W. Maximum Principle Based Algorithms for Deep Learning. J. Mach. Learn. Res. 2018;18:1–29. [Google Scholar]
  • 45.Liu X., Frank J. Symplectic Runge–Kutta discretization of a regularized forward–backward sweep iteration for optimal control problems. J. Comput. Appl. Math. 2021;383:113133. doi: 10.1016/j.cam.2020.113133. [DOI] [Google Scholar]
  • 46.Bellman R. Dynamic Programming. Princeton University Press; Princeton, NJ, USA: 1957. [Google Scholar]
  • 47.Howard R.A. Dynamic Programming and Markov Processes. John Wiley; Oxford, UK: 1960. [Google Scholar]
  • 48.Kappen H.J. Linear Theory for Control of Nonlinear Stochastic Systems. Phys. Rev. Lett. 2005;95:200201. doi: 10.1103/PhysRevLett.95.200201. [DOI] [PubMed] [Google Scholar]
  • 49.Kappen H.J. Path integrals and symmetry breaking for optimal control theory. J. Stat. Mech. Theory Exp. 2005;2005:P11011. doi: 10.1088/1742-5468/2005/11/P11011. [DOI] [Google Scholar]
  • 50.Satoh S., Kappen H.J., Saeki M. An Iterative Method for Nonlinear Stochastic Optimal Control Based on Path Integrals. IEEE Trans. Autom. Control. 2017;62:262–276. doi: 10.1109/TAC.2016.2547979. [DOI] [Google Scholar]
  • 51.Cacace S., Camilli F., Goffi A. A policy iteration method for Mean Field Games. arXiv. 2021 doi: 10.1051/cocv/2021081.2007.04818 [DOI] [Google Scholar]
  • 52.Laurière M., Song J., Tang Q. Policy iteration method for time-dependent Mean Field Games systems with non-separable Hamiltonians. arXiv. 2021 doi: 10.1007/s00245-022-09925-5.2110.02552 [DOI] [Google Scholar]
  • 53.Camilli F., Tang Q. Rates of convergence for the policy iteration method for Mean Field Games systems. arXiv. 2022 doi: 10.1016/j.jmaa.2022.126138.2108.00755 [DOI] [Google Scholar]
  • 54.Ruthotto L., Osher S.J., Li W., Nurbekyan L., Fung S.W. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci. USA. 2020;117:9183–9193. doi: 10.1073/pnas.1922204117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lin A.T., Fung S.W., Li W., Nurbekyan L., Osher S.J. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proc. Natl. Acad. Sci. USA. 2021;118:e2024713118. doi: 10.1073/pnas.2024713118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Laurière M., Pironneau O. Dynamic programming for mean-field type control. C. R. Math. 2014;352:707–713. doi: 10.1016/j.crma.2014.07.008. [DOI] [Google Scholar]
  • 57.Laurière M., Pironneau O. Dynamic programming for mean-field type control. J. Optim. Theory Appl. 2016;169:902–924. doi: 10.1007/s10957-015-0785-x. [DOI] [Google Scholar]
  • 58.Pham H., Wei X. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control. Optim. Calc. Var. 2018;24:437–461. doi: 10.1051/cocv/2017019. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES