Skip to main content
NIST Author Manuscripts logoLink to NIST Author Manuscripts
. Author manuscript; available in PMC: 2022 Dec 1.
Published in final edited form as: IEEE Trans Neural Netw Learn Syst. 2022 Nov 30;33(12):7523–7533. doi: 10.1109/TNNLS.2021.3085358

Reinforcement Learning Based Optimal Tracking Control Under Unmeasurable Disturbances With Application to HVAC Systems

Syed Ali Asad Rizvi 1, Amanda J Pertzborn 2, Zongli Lin 3
PMCID: PMC9703879  NIHMSID: NIHMS1849483  PMID: 34129505

Abstract

This paper presents the design of an optimal controller for solving tracking problems subject to unmeasurable disturbances and unknown system dynamics using reinforcement learning (RL). Many existing RL control methods take disturbance into account by directly measuring it and manipulating it for exploration during the learning process, thereby preventing any disturbance induced bias in the control estimates. However, in most practical scenarios, disturbance is neither measurable nor manipulable. The main contribution of this article is the introduction of a combination of a bias compensation mechanism and the integral action in the Q-learning framework to remove the need to measure or manipulate the disturbance, while preventing disturbance induced bias in the optimal control estimates. A bias compensated Q-learning scheme is presented that learns the disturbance induced bias terms separately from the optimal control parameters and ensures the convergence of the control parameters to the optimal solution even in the presence of unmeasurable disturbances. Both state feedback and output feedback algorithms are developed based on policy iteration (PI) and value iteration (VI) that guarantee the convergence of the tracking error to zero. The feasibility of the design is validated on a practical optimal control application of a heating, ventilating, and air conditioning (HVAC) zone controller.

Keywords: Heating, ventilating, and air conditioning (HVAC) control; optimal tracking; Q-learning; reinforcement learning (RL)

I. Introduction

REINFORCEMENT learning (RL) is a class of artificial intelligence algorithms, which has gained significant attention in the control community for its potential use in designing intelligent controllers that learn the optimal actions without needing prior knowledge of the system model. This model-free design is desirable because system models are generally hard to obtain and modeling uncertainties can significantly affect the closed-loop stability and the optimality of the controller. one of the important control applications of RL is in solving the optimal tracking problem, which involves designing a controller that forces the system to follow a prescribed reference signal. Tracking control finds application in diverse areas such as robotics, autonomous vehicles, aerospace, building controls, and multiagent systems [1]–[5]. The presence of external disturbances, however, makes the tracking problem more challenging.

One of the pioneering developments in RL-based optimal tracking control involves the idea of state augmentation [6]–[9]. Disturbance rejection capabilities have recently been incorporated in RL by adapting ideas from game theory [10]. The presence of parametric and nonparametric uncertainties [11] and extensions to nonlinear systems [12] have also been considered. Disturbance rejection controllers based on the H design have been presented to solve the optimal tracking problem [13]–[15]. Extensions of these learning approaches employing policy iteration methods have also been presented recently [16]. In all the works discussed so far, the disturbance is treated as a decision maker with the disturbance signal being a measurable signal whose L2 norm is bounded. However, in a practical setting a disturbance is not an intelligent decision maker, and it cannot be measured or influenced by the controller. Consequently, ignoring the disturbance in the learning equation leads to control estimates that may become biased because the learning equation does not hold true as a result of the missing disturbance terms. An analysis of the bias terms arising in the closed-form value function as a result of non-disturbance sources has been carried out in [17], [18]. Different from the state augmentation approach, output regulation based on the internal model principle [19] has also recently been considered in the learning control literature [20], [21]. In these approaches the reference and the disturbance are assumed to be generated by an internal model.

To address the above difficulties, instead of directly measuring the disturbance, we introduce bias terms in the Q-function. The Q-learning algorithm is then designed to learn this modified Q-function, which includes the estimates of the bias incurred by the disturbance. Explicitly including the estimates of the bias terms prevents the crucial control parameters from being affected. To achieve disturbance rejection while tracking, we augment the dynamics with the integral of the tracking error. The Q-learning scheme learns the control parameters for this augmented system while also countering the disturbance induced bias to prevent the estimates from drifting away during the learning phase. As will be shown later in the article, although the integral action helps in rejecting the disturbance to ensure asymptotic tracking, it alone is insufficient to prevent the biasing effect of the disturbance. To relax the exploration condition, we employ off-policy learning in a way similar to [22], in which the behavioral policy that is used to generate system data does not follow the intermediate policies being learned.

The contributions of this present work are summarized as follows. In recent work [13], where the disturbance was assumed to be measurable during the data collection and learning phase, an off-policy RL technique was proposed to solve the optimal tracking problem subject to an L2 disturbance. This approach also required a discounted cost function. As acknowledged in [13], discounted cost functions may not guarantee closed-loop stability and a system dependent bound on the discounting factor needs to be satisfied (see [23] for discrete-time problems). This work solves the tracking problem in the presence of external disturbances using an integral augmentation approach that does not require a discounted cost function. More importantly, compared to the robust off-policy techniques [13], [15], [24], [25], this work does not require the measurement of the disturbance, which does not need to be an L2 signal. In another recent work [26], the internal model principle is employed to generate a disturbance, therefore, the disturbance is implicitly measurable. A separate identification process is needed in that approach to solve a set of regulator equations. This work attempts to address the above mentioned difficulties in solving the optimal tracking problem in the presence of unmeasurable disturbances.

The proposed scheme is demonstrated through zone control in a heating, ventilating, and air conditioning (HVAC) application. In this case study, the zone is a room in a commercial office building and the goal is to maintain the zone temperature at the desired set point. The zone temperature is affected by the weather, the airflow rate, the supply air temperature, the thermal mass of the building materials, and the internally generated thermal loads (from equipment, people, etc.). In this scenario the air-handling unit (AHU), which produces the supply air at a given temperature and airflow rate, is the actuator that manipulates the zone temperature. The optimal operation of the system is based on the balance between maintaining the zone at a specified temperature and the cost of the energy required to meet that need.

The remainder of this article is organized as follows: Section II provides a description of the problem. Section III presents the main theoretical development of this paper, where we introduce a bias compensation mechanism and integral action to create a modified Q-function. Then, the design of a Q-learning scheme is presented that learns this Q-function to solve the optimal tracking problem. In particular, we present four Q-learning algorithms based on policy iteration (PI) and value iteration (VI) using state feedback and output feedback. Section IV includes the application of the proposed scheme to the design of an HVAC zone controller. Some concluding remarks are made in Section V.

II. Problem Description

Consider a discrete-time linear time-invariant system in the state space form

xk+1=Axk+Buk+Ddk,yk=Cxk, (1)

where xkn is the system state, ukm is the control input, dkq is the external disturbance, and ykp is the system output. We define the tracking error as

ek=ykrk,

where rkp is the reference trajectory. We assume that mp. The control problem is to find the optimal control sequence uk with feedback gain K* that guarantees asymptotic output tracking, i.e., limk→∞ ek = 0, while minimizing a quadratic cost function of the form

J=i=0(eiTQeei+u˜iTRu˜i), (2)

where (A, (Qe)1/2C) is observable, and Qe ≥ 0 and R > 0 are the cost weighting matrices that penalize the performance and control energy in terms of the tracking error and the relative control u˜k=ukuss, respectively, with the subscript ss indicating the steady-state values.

III. Design Methodology

A. State Augmentation With Integral Action

In this section, we introduce the integral action to compensate for external disturbances and to guarantee tracking error convergence. To this end, we introduce a new state wk that accumulates the tracking error, which is the discrete-time equivalent of the integral action in the continuous-time setting. Based on this new state, we form the following augmented system:

xk+1=Axk+Buk+Ddk,wk+1=wk+ek,

which can be represented compactly in terms of the augmented state vector Xk=[xkTwkT]T as

Xk+1=[A0CIp]Xk+[B0]uk+[D0]dk[0Ip]rkA¯Xk+B¯uk+D¯dk+R¯rk,Yk=[C00Ip]XkC¯Xk. (3)

Assuming that the tracking problem is solvable with a steady-state state Xss, and a steady-state control uss that balances the effect of the disturbance and the reference by means of the integral action, we can obtain the following error dynamics:

X˜k+1=A¯X˜k+B¯u˜k,Y˜k=C¯X˜k, (4)

where X˜k=XkXss. For this augmented system, we define the augmented cost function in terms of (2) as

J=i=0([x˜iw˜i]T[CTQeC00Qw][x˜iw˜i]+u˜iTRu˜i)i=0(X˜iQX˜i+u˜iTRu˜i). (5)

To design an optimal controller we first establish the controllability conditions for the augmented dynamics.

Lemma 1:

The augmented system (3) is controllable if the original system (1) is controllable and has no invariant zeros at z = 1, where z is the z-transform variable.

Proof:

By the Popov-Belevitch-Hautus (PBH) test, the augmented system is controllable if and only if [A¯λIn+pB¯] has full row rank of n + p. In view of the definition of A¯ and B¯ in (3), the rank of [A¯λIn+pB¯] is evaluated as

ρ[A¯λIn+pB¯]=ρ[AλIn0BC(1λ)Ip0].

For λ ≠ 1, we can cancel out the entries of C using the columns of Ip to result in

ρ[A¯λIn+pB¯]=ρ[AλInB000(1λ)Ip]=ρ[AλInB]+p.

Because the original system (1) is controllable, we have ρ[A¯λInB]=n, and therefore,

ρ[A¯λIn+pB¯]=n+p.

For λ = 1, we have

ρ[A¯λIn+pB¯]=ρ[AλInBC0].

Recall that the system (A, B, C) has a zero at z = 1 if and only if

ρ[AλInBC0]<n+min{p,m}=n+p

and, as a result, ρ[A¯λIn+pB¯]=n+p if the system (A, B, C) has no invariant zeros at z = 1. This completes the proof. ■

Under the conditions of controllability of (A¯,B¯) and observability of (A¯,(Q)1/2), where (Q1/2)T(Q)1/2 = Q, there exists a unique optimal control given by

u˜k=(R+B¯TPB¯)1B¯TPA¯X˜k=KX˜k, (6)

where P* is the unique positive definite solution to the algebraic Riccati equation (ARE) [27]

A¯TP¯A¯P+QA¯TPB¯(R+B¯TPB¯)1B¯TPA¯=0. (7)

Moreover, (6) and (7) suggest that the optimal feedback gain K* for the error dynamics (4) is identical to that of the augmented dynamics (3). As such, K* can be obtained independent of the disturbance, reference, and steady-state offsets, which are handled by the integral action, as will be seen.

The design procedure discussed above is an offline approach that assumes the availability of a perfectly known model of the system. That is, the system dynamics matrices are available so that K* can be obtained by solving the ARE (7). In this work, we are interested in learning K* by employing the framework of RL. In particular, we present the design of a completely model-free Q-learning method that enables us to learn K* online. The existing RL control literature identifies a difficulty in applying RL control to the system dynamics (3) that stems from the presence of the extra term corresponding to the external disturbances [28], which are generally not available for measurement in an online setting. The disturbance, if not accounted for, results in bias in the Q-learning estimates, causing them to be suboptimal and, more importantly, may render the closed-loop system unstable. Therefore, in the following, we present a Q-learning scheme that accounts for the biasing effect of the disturbances.

B. Bias Compensated Q-function

In this section, we first seek to derive a Q-function for the augmented system dynamics while accounting for the bias effect of the disturbances. For the design of an online algorithm, we consider the dynamics (3) rather than the error dynamics (4), which is for analysis only and involves the steady-state values that are not available a priori. Nevertheless, the resulting optimal feedback control matrix K* is the same for both (3) and (4), as mentioned in Section III-A. For a stabilizing control uk = −K Xk with policy K, the total cost incurred when starting from any state Xk is quadratic in the state as given by

VK(Xk)=XkTPXk,P=PT>0. (8)

The Q-function associated with K is [29]

QK=XkTQXk+ukTRuk+VK(Xk+1), (9)

which is the sum of the one-step cost of taking an arbitrary action uk from the state at time k, Xk, plus the total cost of using policy K from time k + 1 onward. The reference rk and the disturbance dk are neither the state nor the decision makers and are, therefore, considered external signals that influence the dynamics. Substituting the dynamics (3) in (9), we have

QK=[Xkukrk]T[Q+A¯TPA¯A¯TPB¯A¯TPR¯B¯TPA¯R+B¯TPB¯B¯TPR¯R¯TPA¯R¯TPB¯R¯TPR¯][Xkukrk]+2XkTA¯TP¯D¯dk+2ukTB¯TP¯D¯dk+2rkTB¯TP¯D¯dk+dkTD¯TP¯D¯dk(zk)THzk+2XkTA¯TP¯D¯dk+2ukTB¯TP¯D¯dk+2rkTB¯TP¯D¯dk+dkTD¯TP¯D¯dk, (10)

where the last four terms involving the unmeasurable signal dk result in an estimation bias. As dk is not known, we can lump it together with the unknown system dynamics matrices to write the Q-function more compactly as

QK=[Xkukrkc]T[Q+A¯TPA¯A¯TP¯B¯A¯TPR¯b1B¯TPA¯R+B¯TPB¯B¯TPR¯b2R¯TPA¯R¯TPB¯R¯TPR¯b3b1Tb2Tb3Tb4][Xkukrkc][Xkukrkc]T[HXXHXuHXrb1HuXHuuHurb2HrXHruHrrb3b1Tb2Tb3Tb4][Xkukrkc]zkTHzk, (11)

where c is an arbitrary bias scaling factor. It is worth pointing out that (11) is an extension of the LQR Q-function that incorporates the biasing effect of the disturbance and the bi ’s and c are dependent on the disturbance. The optimal Q-function QK and its corresponding matrix H* are obtained using the above expression by substituting P = P*. The optimal feedback term can then be obtained as

K=(Huu)1(HuX).

The result is the same as the feedback gain K* defined in (6). This suggests that learning the optimal Q-function amounts to learning the optimal feedback controller. In the Section III-C, we will present iterative Q-learning algorithms that provide estimates of this optimal Q-function.

C. Full State Feedback Q-learning Algorithms

In this section, we will present a state feedback Q-learning scheme incorporating the integral action toward solving the optimal tracking problem. Before introducing the bias compensated algorithms, we will present an uncompensated Q-learning algorithm for the augmented system (3). Let QK=(zk)THzk be the uncompensated Q-function that does not fully take into account the effect of the disturbance. The Q-learning Bellman equation corresponding to this Q-function is obtained as [30]

QK(Xk,uk)=XkTQXk+ukTRuk+QK(Xk+1,KXk+1),

or equivalently

(zk)THzk=XkTQXk+ukTRuk+(zk+1)THzk+1. (12)

Algorithm 0.

State Feedback Q-learning Policy Iteration Algorithm for Tracking Control

input: input-state data
output: H*
  1: initialize. Select a stabilizing initial policy uk0=K0Xk+vk with vk being an exploration signal. Set j ← 0.
  2: acquire data. Apply input uk0 to collect Ll(l + 1)/2 datasets of (Xk, uk, rk).
  3: repeat
  4:  policy evaluation. Determine the least-squares solution of
(zk)THjzk=XkTQXk+ukTRuk+(zk+1)THjzk+1.
  5:  policy improvement. Determine an improved policy as
Kj+1=(Huuj)1(HuXj).
  6:  jj + 1.
  7: until KjKj1<ε for some small ε > 0.

Algorithm 0, the uncompensated Q-learning algorithm, is based on this Q-learning equation and includes an integral feedback term to compensate for the steady-state tracking error resulting from the disturbance. However, as will be shown, the integral action alone will not prevent the disturbance from incurring bias in the Q-learning estimates during learning.

We now proceed to present the bias compensated Q-learning algorithms. For the compensated Q-function in (11), we have the following Q-learning Bellman equation

zkTHzk=XkTQXk+ukTRuk+zk+1THzk+1, (13)

or equivalently

H¯Tz¯k=XkTQXk+ukTRuk+H¯Tz¯k+1, (14)

where

H¯=vec(H)[h112h122h1lh222h232h2lhll]Tl(l+1)/2,l=n+m+2p+1,z¯k=[zk12zk1zk2zk1zklzk22zk2zk3zk2zklzkl2]T

with zki being the components of zk. Based on (14), both PI and VI algorithms are considered next to learn the Q-function and the optimal feedback controller. Algorithm 1 presents a PI Q-learning algorithm for the linear quadratic tracking problem. This is essentially a two-step procedure. In the policy evaluation step, we use the key equation (14) to solve for the unknown vector H¯ in the least-squares sense by collecting Ll(l + 1)/2 data samples of (Xk, uk, rk) to form the data matrices Φl(l+1)/2×L and ΥL×1, defined by

Φ=[z¯kL+1z¯kL+2z¯kL+2z¯kL+3z¯kz¯k+1],Υ=[XkL+1TQXkL+1+ukL+1TRukL+1XkL+2TQXkL+2+ukL+2TRukL+2XkTQXk+ukTRuk].

Algorithm 1.

Bias Compensated State Feedback Q-learning Policy Iteration Algorithm for Tracking Control

input: input-state data
output: H*
  1: initialize. Select a stabilizing initial policy uk0=K0Xk+vk with vk being an exploration signal. Set j ← 0.
  2: acquire data. Apply input uk0 to collect Ll(l + 1)/2 datasets of (Xk, uk, rk).
  3: repeat
  4:  policy evaluation. Determine the least-squares solution of
(H¯j)T(z¯kz¯k+1)=XkTQXk+ukTRuk.
  5:  policy improvement. Determine an improved policy as
Kj+1=(Huuj)1(HuXj).
  6:  jj + 1
  7: until KjKj1<ε for some small ε > 0.

Then, the least-squares solution of (14) is given by

H¯j=(ΦΦT)1ΦΥ, (15)

where H¯j is the jth estimate of the unknown vector H¯. Because uk = −K Xk, which is linearly dependent on Xk, (15) will not have a unique solution, which is needed for convergence to the optimal parameters. To overcome this issue, an excitation noise is added in uk to guarantee a unique solution to (15). Note, however, that the exploration noise is unable to excite the bias term c in the vector zk, and therefore, there will be a zero entry corresponding to the quadratic term in c in the regression vector z¯kz¯k+1. Nevertheless, because the bias scaling factor itself is arbitrarily selected, this issue can be tackled by separately adding an arbitrary offset in the corresponding entry of z¯k+1. The above measures ensure that the following condition can be satisfied:

ρ(Φ)=l(l+1)/2. (16)

Clearly, the rank condition (16) is necessary to obtain the optimal solution, which is a unique solution to the least-squares problem (15). This rank condition is crucial to exploration in off-policy RL control algorithms [22]. Interested readers can refer to [31] for the stochastic version of this condition. Algorithm 1 requires a stabilizing (not necessarily optimal) policy at initialization. However, this requirement could be restrictive in certain applications and as such a policy may be difficult to obtain because the system dynamics are not known in advance or the dynamics are nonlinear [32]. To address this difficulty, we can refer to a slightly different iterative technique, VI, which does not impose this restriction. A bias compensated Q-learning VI algorithm is presented in Algorithm 2.

Algorithm 2.

Bias Compensated State Feedback Q-learning Value Iteration Algorithm for Tracking Control

input: input-state data
output: H*
  1: initialize. Select an arbitrary policy uk0=K0Xk+vk with vk being an exploration signal. Set j ← 0 and H0 ≥ 0.
  2: acquire data. Apply input uk0 to collect Ll(l + 1)/2 datasets of (Xk, uk, rk).
  3: repeat
  4:  value update. Determine the least-squares solution of
(H¯j+1)T(z¯k)=XkTQXk+ukTRuk+H¯jz¯k+1.
  5:  policy improvement. Determine an improved policy as
Kj+1=(Huuj+1)1(HuXj+1)
  6:  jj + 1.
  7: until KjKj1<ε for some small ε > 0.

The data matrices Φl(l+1)/2×L and ΥL×1 for the case of VI are defined by

Φ=[z¯k1z¯k2z¯kL],Υ=[r1(yk,uk)+H¯j1Tz¯k+11rL(yk,uk)+H¯j1Tz¯k+1L]T.

Remark 1:

Algorithms 1 and 2 are the extension of the standard LQR Q-learning algorithms found in the literature [33]. They learn the disturbance induced bias terms without measuring the disturbance directly, which enables these algorithms to prevent the biasing components from affecting the optimal control parameters.

D. Output Feedback Q-learning Algorithms

Section III-C presented designs that were based on the feedback of the state xk. However, in practice, only a subset of the state is measurable through the system output. Classical state estimation techniques that estimate xk through input-output measurements do not apply as the system dynamics are unknown. In our previous work [34], we designed a Q-learning scheme based on a parameterization of the system state in terms of a sequence of delayed measurements of input, output, and disturbance, as follows

xk=Myy¯k1,kN+Muu¯k1,kN+Mdd¯k1,kN, (17)

where Nn is an upper bound on the system’s observability index. Interested readers can refer to [34] for the details of this parameterization.

We use state parameterization (17) to describe the Q-function in (9). Based on (17), we obtain the parameterization of the augmented state Xk as follows:

Xk=[MuMyMd0000Ip][u¯kTy¯kTd¯kTwkT]T[M¯uM¯yM¯dM¯w][u¯kTy¯kTd¯kTwkT]T. (18)

Next, we derive the expressions for the output feedback Q-function Q𝒦 and the associated output feedback policy 𝒦.

Substitution of the augmented state vector with its parameterized form (18) into the state feedback Q-function (11) results in

Q𝒦=[u¯ky¯kwkukrkc]T[u¯u¯u¯y¯u¯wu¯uu¯r¯b1y¯u¯y¯y¯y¯wy¯uy¯rb2wu¯wy¯wwwuwrb3uu¯uy¯uwuuurb4ru¯ry¯rwwurrb5b1Tb2Tb3Tb4Tb5Tb6T][u¯ky¯kwkukrkc]ζkTζk (19)

where =Tl×l with l = mN + pN + m + 2p + 1 and the submatrices are defined in an obvious way as in [34]. Note that bis and c are again the disturbance dependent terms as seen in the case of the state feedback Q-function (11). Notice that the delayed disturbance dependent term d¯k introduced as a result of the state parameterization has been lumped together with the biasing term c.

The optimal output feedback policy 𝒦 can be obtained when the optimal output feedback function Q𝒦 is minimized with respect to uk. This results in

𝒦=(uu)1[uu¯uy¯uw].

Finally, we have the following feedback control law:

uk=(uu)1(uu¯u¯k1,kN+uy¯y¯k1,kN+uwwk)𝒦[u¯k1,kNTy¯k1,kNTwkT]T. (20)

It will be shown in the proof of Theorem 1 that the integral action wk is able to compensate for the unmeasurable disturbances d¯k and dk in a way similar to the state feedback case.

Having formulated the output feedback Q-function, we need to develop a Q-learning scheme that can learn this function. In view of the output feedback Q-function (19), the output feedback Q-learning Bellman equation follows from (13) as

¯Tζ¯k=YkTQyYk+ukTRuk+¯Tζ¯k+1 (21)

where

¯=vec()l(l+1)/2,l=mN+pN+m+2p+1.

The regression vector ζ¯kl(l+1)/2 is defined as

ζ¯=[ζk12ζk1ζk2ζk1ζklζk22ζk2ζk3ζk2ζlζkl2]T

where ζk=[ζk1ζk2ζkl].

In comparison with (14), the output feedback learning equation (21) involves more parameters, as the internal state information is not readily available, but rather is embedded in the sufficiently long sequence of input–output data u¯k1,kN and y¯k1,kN.

Equation (21) is utilized in the output feedback PI and VI algorithms. The policy iteration Algorithm 3 is the output feedback counterpart of the policy iteration algorithm, Algorithm 1. The two key differences between these two algorithms can be seen in the policy evaluation and the policy update steps, where we observe that the learning and the control update equations do not involve the state information. Algorithm 4 operates in the same way as the state feedback VI Algorithm 2, but without requiring the measurement of the internal state. Furthermore, it also relaxes the condition of a stabilizing initial gain 𝒦0. It is worth noting at this point that both Algorithms 3 and 4 must also satisfy the rank condition (16), with more exploration compared to the state feedback algorithms because the number of unknown parameters is larger.

Algorithm 3.

Bias Compensated Output Feedback Q-learning Policy Iteration Algorithm for Tracking Control

input: input-output data
output:
  1: initialize. Select a stabilizing initial policy uk0=𝒦0[u¯k1,kNy¯k1,kNwk]+vk with vk being an exploration signal. Set j ← 0.
  2: acquire data. Apply input uk0 to collect Ll(l + 1)/2 datasets of (u¯k1,kN,y¯k1,kN,wk,uk,rk).
  3: repeat
  4:  policy evaluation. Determine the least-squares solution of
(¯j)T(ζ¯kζ¯k+1)=YkTQyYk+ukTRuk.
  5:  policy improvement. Determine an improved policy as
𝒦j+1=(uuj)1[uu¯juy¯juwj].
  6:  jj + 1.
  7: until 𝒦¯j𝒦¯j1<ε for some small ε > 0.

Algorithm 4.

Bias Compensated Output Feedback Q-learning Value Iteration Algorithm for Tracking Control

input: input-output data
output: H*
  1: initialize. Select an arbitrary policy uk0=𝒦0[u¯k1,kNy¯k1,kNwk]+vk with vk being an exploration signal. Set j ← 0 and 00.
  2: acquire data. Apply input uk0 to collect Ll(l + 1)/2 datasets of (u¯k1,kN,y¯k1,kN,wk,uk,rk).
  3: repeat
  4:  value update. Determine the least-squares solution of
(¯j+1)T(ζ¯k)=YkTQyYk+ukTRuk+¯jζ¯k+1.
  5:  policy improvement. Determine an improved policy as
𝒦j+1=(uuj+1)1[uu¯j+1uy¯j+1uwj+1].
  6:  jj + 1.
  7: until 𝒦¯j𝒦¯j1<ε for some small ε > 0.

We now establish the convergence of the proposed scheme toward achieving optimal tracking in Theorem 1:

Theorem 1:

Assume that the controllability conditions in Lemma 1 hold and (A¯,(Q)1/2) (for state feedback) or (A¯,(Qy)1/2C¯) (for output feedback) is observable. Then the proposed scheme generates a sequence of controls {ukj, j = 1, 2, 3, …} that converges to the optimal feedback controller under the rank condition (16), and the tracking error ek converges to zero if the disturbance and reference vary infrequently relative to the control dynamics.

Proof:

Consider first the state feedback algorithms, Algorithms 1 and 2. The bias compensated Q-function (11) satisfies the state feedback Q-learning equation (13), which forms the basis of Algorithms 1 and 2. This equation has a unique solution if the rank condition (16) holds. Given the controllability and observability assumptions on the pairs (A¯,B¯), and (A¯,(Q)1/2), respectively, the PI and VI Q-learning algorithms, Algorithms 1 and 2, converge to the optimal feedback matrix K* as shown in [30], [35]. Under K*, the closed-loop dynamics are given by

Xk+1=(A¯B¯K)Xk+[D¯R¯][dkTrkT]T.

The disturbance and reference can be considered in a steady state if they vary infrequently relative to the dynamics. Because A¯B¯K is Schur stable, under these external steady-state inputs, Xk reaches its steady-state Xss and, therefore, wk+1 = wk. Then, because wk+1 = wk + ek, it follows that the tracking error ek converges to zero asymptotically.

Consider next the output feedback algorithms, Algorithms 3 and 4. The bias compensated output feedback Q-function satisfies the output feedback Q-learning equation (21), which is employed in Algorithms 3 and 4. Because (A¯,B¯) and (A¯,(Qy)1/2C¯) are controllable and observable, respectively, the output feedback policy iteration and VI algorithms converge to the optimal output feedback gain 𝒦, as shown in our previous work [36], under the rank condition (16). The closed-loop dynamics under the output feedback control are

Xk+1=A¯XkB¯K[(Muu¯k1,kN+Myy¯k1,kN)TwkT]T+[D¯R¯][d¯kTrkT]T.

Adding and subtracting the missing disturbance sequence Mdd¯k1,kN to obtain the state feedback form using the parameterization (17) results in

Xk+1=(A¯B¯K)Xk+[D¯R¯][dkTrkT]T+B¯K1:nMdd¯k1,kN.

Then, similar to the state feedback case, we also have dk−1,kN in the steady state, which, in view of the fact that A¯B¯K is Schur stable, implies that Xk will reach the steady-state Xss and wk+1 = wk. Therefore, the tracking error ek converges to zero asymptotically. ■

IV. Application to HVAC Zone Control

In this section, we apply the proposed scheme to design an HVAC controller for a zone in a commercial building. This is a multi-objective optimal control problem that requires accounting for both the zone comfort and the energy consumption. To formulate this problem into the presented mathematical framework, the zone comfort is associated with obtaining the desired thermal state (i.e., set point temperature) and the energy cost corresponds to the control energy utilized by the actuators. We consider the AHU as the actuator that supplies the zone with air of an appropriate temperature (i.e., supply air) to manipulate the zone temperature. The dynamic model of an HVAC zone used in this case study is adapted from [37]. The thermal dynamics of a building zone are given by the following set of differential equations:

dTzdt=fsaρacpaCz(TsaTz)+2UwewAwewCz(TwewTz)+2UwnsAwnsCz(TwnsTz)+KoCz(ToTz)+qCz,dTwewdt=UwewAwewCwew(TzTwew)+UwewAwewCwew(ToTwew),dTwnsdt=UwnsAwnsCwns(TzTwns)+UwnsAwnsCwns(ToTwns),

which is discretized with a sampling period of one minute to obtain a state space model of the form (1). The description of the quantities is given in the Nomenclature. The nominal parameters given in the Nomenclature are only used to compute the true values of optimal parameters in order to compare our estimates. In other words, the proposed control scheme itself does not require any knowledge of these parameters and the optimal control parameters are learned online.

In this model the outside temperature and the heat gains from the occupants, lights, etc., are the disturbances, which are all assumed to be unmeasurable. Let the user-defined performance index be specified as Qe = 300, Qw = 60, and R = 100. Note that Qw ≠ 0 is needed for the observability of the pair (A¯,(Q)1/2) of the augmented system. The optimal feedback control gain for the augmented dynamics (3) can be found by solving the Riccati (7) and is given by

K=[1.68640.14130.18290.5906].

Before presenting the results of the bias compensated Q-learning algorithms, we will first test the uncompensated Q-learning algorithm, Algorithm 0, to analyze the effect of the unmeasurable disturbances. The parameter estimates experience bias as a result of the external disturbances. The final estimate of the control matrix is

K^=[0.686125.464432.94021.0840].

In this particular case, even though the algorithm was initialized with a stabilizing control matrix, the final control gain estimate is not only biased but also destabilizing. This can be seen from the eigenvalues of the resulting closed-loop dynamics matrix A¯B¯K^, which are 0.3235, 1.5391, 0.9695 ± j0.0114 with λ = 1.5391 being the unstable eigenvalue. For comparison with existing works, recall the robust off-policy algorithms in [24] and [13]. The off-policy method in [24] employs the knowledge of the input and disturbance matrices, whereas the method in [13] removes this requirement. However, both of these methods require access to the disturbance data during the learning phase (see Step 1 of Algorithm 2 in [13], [24]). The model-free algorithm in [13] is applied to our problem for comparison with our model-free method, first with the disturbance data during the learning phase and then without. A discounting factor of α = 0.1 and the disturbance attenuation level of γ = 5 were selected for the H cost function described in [13]. The optimal control gain obtained by this method is

K=[44.70920.10660.133145.6562],

where the last element corresponds to the reference trajectory, Tzr. The learning algorithm in [13] was then applied, with disturbance signals To and q known during the learning phase. The final estimate of the control gain is

K^=[44.70950.10670.132945.6565]

which shows convergence to the optimal solution, consistent with the results in [13]. Next, we apply the learning algorithm from [13] without the measurement of the disturbance, although the same disturbances act on the system. In this case, the final estimate of the control gain is

K^=[72.10940.66412.773452.1094].

The lack of measurement of the disturbances has resulted in a bias in the estimates. Our algorithms address this limitation.

We will now focus on our proposed bias compensated Q-learning algorithms. Let us consider first the policy iteration algorithm, Algorithm 1. We start the algorithm with a stabilizing initial control, which, in this example, is a simple proportional-integral controller. The proportional gain is 0.8432 and the integral gain 0.2953. This corresponds to an initial policy K0 = [0.8432 0 0 0.2953]. To satisfy the rank condition (16), we add sinusoids of different frequencies and magnitudes to the feedback control policy for the supply air temperature Tsa. This initial control is applied during the first 30 minutes to collect online system data. Fig. 1 shows the evaluation of the closed-loop response under Algorithm 1.

Fig. 1.

Fig. 1.

Evolution of the closed-loop system under Algorithm 1.

During the first 30 minutes we see an exploratory response while the output is still trying to track the reference signal. This is a result of applying an already stabilizing policy that could provide suboptimal tracking in the presence of added exploratory signals. These 30 minutes of online data are then utilized to solve the Bellman equation in the policy iteration step and to update the control parameters in the subsequent iterations j = 1, 2, ….

At the beginning of hour 2, a disturbance is introduced as a result of a decrease in the outside temperature, which is almost seamlessly compensated by the controller with an expected increase in the supply air temperature to compensate for the outside temperature drop. At the same time, the desired temperature set point increases. As can be seen, the zone temperature responds to this change and converges to the new set point. Similarly, when the heat gain load changes because of occupancy, lighting, or other sources, the proposed scheme is able to track the desired reference trajectory in the remainder of the period. The proposed Q-learning scheme learns the optimal control parameters for the augmented system while compensating for the external disturbances that would otherwise cause Q-learning to diverge. The final estimates of the optimal control gain are

K^=[1.68650.14140.18310.5907],

which is close to the optimal value despite the presence of the unmeasurable disturbances. This is a result of the bias compensation mechanism introduced in the Q-function and is an advantage of the proposed scheme.

We now proceed to validate the proposed VI algorithm, Algorithm 2. We test this algorithm under the same conditions as for Algorithm 1. Different from the PI algorithm, we initialize the VI algorithm with zero feedback gain, that is, K0 = [0 0 0 0]. Clearly, this gain is nonstabilizing and we cannot expect tracking during the first 30 minutes of learning, as can be seen in the zone temperature response in Fig. 2. It is interesting to note that the post learning trajectories, after the first 30 minutes of learning, are the same for both the PI and VI algorithms. This is because both the algorithms eventually converge to the optimal control parameters. The final estimate of the optimal control gain is

K^=[1.68620.14100.18220.5905].

Fig. 2.

Fig. 2.

Evolution of the closed-loop system under Algorithm 2.

Note that more iterations are required for Algorithm 2 to converge to the optimal parameters because the search space, which is not limited by a stabilizing initial controller, is larger.

The results presented so far dealt with the full state feedback case, that is, the measurement of the internal state was required for both Algorithms 1 and 2 to learn the optimal control parameters. In the following, we present results for Algorithms 3 and 4 that do not impose this requirement. These algorithms are driven completely by the input-output data instead of requiring the internal state information. For the HVAC application, this means that we no longer need to install sensors on the walls to measure the wall temperature. Instead, only zone temperature measurements are required. This reflects a more realistic HVAC control system.

For the purpose of comparison, the user-defined cost matrices and the rest of the conditions for the output feedback algorithms are kept the same as with the state feedback algorithms. We first validate the output feedback policy iteration algorithm, Algorithm 3. We utilize a proportional-integral controller for initial tracking. The nominal optimal output feedback control parameters for the augmented dynamics (3) can be found by solving the Riccati (7) and the state parameterization (17) and are given by

𝒦=[0.42304.52863.115619.204524.28798.13130.5906].

Algorithm 3 involves a longer learning phase because there are more unknown parameters to be determined. Specifically, we collected 50 datasets of the input-output data as compared to the 30 datasets for the state feedback algorithms. It can be seen in Fig. 3 that the output feedback algorithm regulates the zone temperature well, similar to the state feedback algorithm but without requiring wall temperature measurements. The final estimate of the output feedback optimal control gain using Algorithm 3 is

𝒦^=[0.42304.52323.110519.183024.25348.11780.5906]

which is close to the optimal output feedback parameters 𝒦.

Fig. 3.

Fig. 3.

Evolution of the closed-loop system under Algorithm 3.

If stabilizing initial output feedback parameters are not known, then Algorithm 4 can be applied. In this case the output feedback control parameters are initialized to zero. The system response is shown in Fig. 4. In the absence of a stabilizing initial feedback law, the zone temperature is unable to track the desired reference temperature. In the post learning response, we see that it is able to track the reference signal quite well even in the presence of the unmeasurable disturbances arising from heat gain and outside climate variations. The final estimate of the output feedback optimal control gain using Algorithm 4 is

𝒦^=[0.42304.53503.122219.230324.33128.14840.5906]

which is also close to the optimal output feedback parameters 𝒦.

Fig. 4.

Fig. 4.

Evolution of the closed-loop system under Algorithm 4.

V. Conclusions

This article presented a model-free solution to the optimal tracking problem involving unmeasurable disturbances based on the framework of reinforcement learning. A new Q-learning based scheme was proposed with a bias compensation mechanism to account for the effect of the disturbance on the learning estimates. An extended Q-function was employed that includes bias compensation terms to prevent the control parameters from drifting away in the presence of the disturbance. Both PI and VI algorithms based on state feedback and output feedback were presented to learn the optimal parameters and to guarantee convergence of the tracking error to zero. Finally, the proposed scheme was validated by designing an optimal set point tracking controller for a practical HVAC zone system in the presence of the unknown disturbances related to outside climate variations and the internal heat gains. In our future work, we will consider extending the design to develop a distributed control scheme for a more complex HVAC system.

Nomenclature

Awew

Area of East/West walls = 9 m2

Awns

Area of North/South walls = 12 m2

Cpa

Specific heat of air = 1.005 kJ/kg-C

Cwew

Thermal capacitance of East/West walls = 70 kJ/C

Cwns

Thermal capacitance of North/South walls = 60 kJ/C

Cz

Thermal capacitance of the zone = 60 kJ/C

fsa

Volume flow rate of the supply air = 0.192 m3

q

Heat gain from occupants, lights, doors in Watts (W)

To

Outside temperature in degrees Celsius (C)

Tsa

Supply air temperature in degrees Celsius (C)

Twew

Temperature of the East/West walls in degrees Celsius (C)

Twns

Temperature of the North/West walls in degrees Celsius (C)

Tz

Temperature of the zone in degrees Celsius (C)

Uwew

Heat transfer coefficient of East/West walls = 2 W/m2-C

Uwns

Heat transfer coefficient of North/West walls = 2 W/m2-C

Ko

Thermal transfer coefficient between the zone and the outside = 9 W/C

ρa

Density of air = 1.25 kg/m3

Biographies

graphic file with name nihms-1849483-b0005.gif

Syed Ali Asad Rizvi received the B.E. degree in industrial electronics from the Institute of Industrial Electronics Engineering, NED University of Engineering and Technology, Karachi, Pakistan, in 2012, the M.S. degree in electrical engineering from the National University of Sciences and Technology, Islamabad, Pakistan, in 2014, and the Ph.D. degree in electrical engineering from the University of Virginia, Charlottesville, VA, USA, in 2020.

He is currently a Post-Doctoral Fellow with the National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA. His current research interests include artificial intelligence, reinforcement learning control, robust control, distributed learning and optimization, and their application in cyber-physical systems.

graphic file with name nihms-1849483-b0006.gif

Amanda J. Pertzborn received the Ph.D. degree from the University of Wisconsin-Madison, Madison, WI, USA.

She is currently the PI of the Intelligent Building Agents Project with the Building Energy and Environment Division, National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA. Her research focuses on intelligent control of heating, ventilation, and air conditioning (HVAC) systems.

graphic file with name nihms-1849483-b0007.gif

Zongli Lin (Fellow, IEEE) received the B.S. degree in mathematics and computer science from Xiamen University, Xiamen, China, in 1983, the Master of Engineering degree in automatic control from the Chinese Academy of Space Technology, Beijing, China, in 1989, and the Ph.D. degree in electrical and computer engineering from Washington State University, Pullman, WA, USA, in 1994.

He is currently the Ferman W. Perry Professor with the School of Engineering and Applied Science and a Professor of electrical and computer engineering with the University of Virginia, Charlottesville, VA, USA. His current research interests include nonlinear control, robust control, and control applications.

Dr. Lin is also a fellow of IFAC and the American Association for the Advancement of Science (AAAS). He was the Program Chair of the 2018 American Control Conference and the General Chair of the 13th and 16th International Symposium on Magnetic Bearings, held in 2012 and 2018, respectively. He was an Associate Editor of IEEE Transactions on Automatic Control from 2001 to 2003, IEEE/ASME Transactions on Mechatronics from 2006 to 2009, and IEEE Control Systems Magazine from 2005 to 2012. He was elected as a member of the Board of Governors of the IEEE Control Systems Society from 2008 to 2010 and 2019 to 2021 and chaired the IEEE Control Systems Society Technical Committee on Nonlinear Systems and Control from 2013 to 2015. He has served on the operating committees of several conferences. He also serves on the editorial boards of several journals and book series, including Automatica, Systems & Control Letters, and Birkhauser book series, Control Engineering.

Contributor Information

Syed Ali Asad Rizvi, Mechanical Systems and Controls Group, Energy and Environment Division, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899 USA.

Amanda J. Pertzborn, Mechanical Systems and Controls Group, Energy and Environment Division, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899 USA

Zongli Lin, Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904 USA.

References

  • [1].Hong Y, Hu J, and Gao L, “Tracking control for multi-agent consensus with an active leader and variable topology,” Automatica, vol. 42, no. 7, pp. 1177–1182, Jul. 2006. [Google Scholar]
  • [2].Liao F, Wang JL, and Yang G-H, “Reliable robust flight tracking control: An LMI approach,” IEEE Trans. Control Syst. Technol, vol. 10, no. 1, pp. 76–89, Aug. 2002. [Google Scholar]
  • [3].Lymperopoulos G and Ioannou P, “Distributed adaptive HVAC control for multi-zone buildings,” in Proc. IEEE 58th Conf. Decis. Control (CDC), Dec. 2019, pp. 8142–8147. [Google Scholar]
  • [4].Mu C, Ni Z, Sun C, and He H, “Air-breathing hypersonic vehicle tracking control based on adaptive dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst, vol. 28, no. 3, pp. 584–598, Mar. 2017. [DOI] [PubMed] [Google Scholar]
  • [5].Tao G, Adaptive Control Design and Analysis. Hoboken, NJ, USA: Wiley, 2003. [Google Scholar]
  • [6].Modares H and Lewis FL, “Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning,” IEEE Trans. Autom. Control, vol. 59, no. 11, pp. 3051–3056, Nov. 2014. [Google Scholar]
  • [7].Kiumarsi B, Lewis FL, Naghibi-Sistani M-B, and Karimpour A, “Optimal tracking control of unknown discrete-time linear systems using input-output measured data,” IEEE Trans. Cybern, vol. 45, no. 12, pp. 2770–2779, Dec. 2015. [DOI] [PubMed] [Google Scholar]
  • [8].Vamvoudakis KG, “Optimal trajectory output tracking control with a Q-learning algorithm,” in Proc. Amer. Control Conf. (ACC), Jul. 2016, pp. 5752–5757. [Google Scholar]
  • [9].Luo B, Liu D, Huang T, and Wang D, “Model-free optimal tracking control via critic-only Q-learning,” IEEE Trans. Neural Netw. Learn. Syst, vol. 27, no. 10, pp. 2134–2144, Oct. 2016. [DOI] [PubMed] [Google Scholar]
  • [10].Vamvoudakis KG, Modares H, Kiumarsi B, and Lewis FL, “Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online,” IEEE Control Syst, vol. 37, no. 1, pp. 33–52, Feb. 2017. [Google Scholar]
  • [11].Tutsoy O, Barkana DE, and Tugal H, “Design of a completely model free adaptive control in the presence of parametric, non-parametric uncertainties and random control signal delay,” ISA Trans, vol. 76, pp. 67–77, May 2018. [DOI] [PubMed] [Google Scholar]
  • [12].He S, Fang H, Zhang M, Liu F, Luan X, and Ding Z, “Online policy iterative-based H optimization algorithm for a class of nonlinear systems,” Inf. Sci, vol. 495, pp. 1–13, Aug. 2019. [Google Scholar]
  • [13].Modares H, Lewis FL, and Jiang Z-P, “H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning,” IEEE Trans. Neural Netw. Learn. Syst, vol. 26, no. 10, pp. 2550–2562, Jun. 2015. [DOI] [PubMed] [Google Scholar]
  • [14].Liu Y, Wang Z, and Shi Z, “H tracking control for linear discrete-time systems via reinforcement learning,” Int. J. Robust Nonlinear Control, vol. 30, no. 1, pp. 282–301, Jan. 2020. [Google Scholar]
  • [15].Peng Y, Chen Q, and Sun W, “Reinforcement Q-learning algorithm for H tracking control of unknown discrete-time linear systems,” IEEE Trans. Syst., Man, Cybern. Syst, vol. 50, no. 11, pp. 4109–4122, Nov. 2020. [Google Scholar]
  • [16].Luo B, Yang Y, and Liu D, “Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems,” IEEE Trans. Cybern., early access, Feb. 20, 2020, doi: 10.1109/TCYB.2020.2970969. [DOI] [PubMed] [Google Scholar]
  • [17].Tutsoy O and Brown M, “An analysis of value function learning with piecewise linear control,” J. Experim. Theor. Artif. Intell, vol. 28, no. 3, pp. 529–545, May 2016. [Google Scholar]
  • [18].Tutsoy O and Brown M, “Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control,” Optim. Control Appl. Methods, vol. 37, no. 1, pp. 108–126, Jan. 2016. [Google Scholar]
  • [19].Huang J, Nonlinear Output Regulation: Theory and Applications. Philadelphia, PA, USA: SIAM, 2004. [Google Scholar]
  • [20].Gao W and Jiang Z-P, “Adaptive dynamic programming and adaptive optimal output regulation of linear systems,” IEEE Trans. Autom. Control, vol. 61, no. 12, pp. 4164–4169, Dec. 2016. [Google Scholar]
  • [21].Chen C, Modares H, Xie K, Lewis FL, Wan Y, and Xie S, “Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics,” IEEE Trans. Autom. Control, vol. 64, no. 11, pp. 4423–4438, Nov. 2019. [Google Scholar]
  • [22].Jiang Y and Jiang Z-P, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, Oct. 2012. [Google Scholar]
  • [23].Postoyan R, Busoniu L, Nešić D, and Daafouz J, “Stability analysis of discrete-time infinite-horizon optimal control with discounted cost,” IEEE Trans. Autom. Control, vol. 62, no. 6, pp. 2736–2749, Jun. 2017. [Google Scholar]
  • [24].Luo B, Wu H-N, and Huang T, “Off-policy reinforcement learning for H control design,” IEEE Trans. Cybern, vol. 45, no. 1, pp. 65–76, 2014. [DOI] [PubMed] [Google Scholar]
  • [25].Xiao Z, Li J, and Li P, “Output feedback H control for linear discrete-time multi-player systems with multi-source disturbances using off-policy Q-learning,” IEEE Access, vol. 8, pp. 208938–208951, 2020. [Google Scholar]
  • [26].Jiang Y, Kiumarsi B, Fan J, Chai T, Li J, and Lewis FL, “Optimal output regulation of linear discrete-time systems with unknown dynamics using reinforcement learning,” IEEE Trans. Cybern, vol. 50, no. 7, pp. 3147–3156, Jul. 2020. [DOI] [PubMed] [Google Scholar]
  • [27].Lewis FL and Syrmos VL, Optimal Control. Hoboken, NJ, USA: Wiley, 1995. [Google Scholar]
  • [28].Kiumarsi B, Vamvoudakis KG, Modares H, and Lewis FL, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Trans. Neural Netw. Learn. Syst, vol. 29, no. 6, pp. 2042–2062, Jun. 2018. [DOI] [PubMed] [Google Scholar]
  • [29].Sutton RS and Barto AG, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998. [Google Scholar]
  • [30].Bradtke SJ, Ydstie BE, and Barto AG, “Adaptive linear quadratic control using policy iteration,” in Proc. Amer. Control Conf. (ACC), 1994, pp. 3475–3479. [Google Scholar]
  • [31].He S, Zhang M, Fang H, Liu F, Luan X, and Ding Z, “Reinforcement learning and adaptive optimization of a class of Markov jump systems with completely unknown dynamic information,” Neural Comput. Appl, vol. 32, pp. 14311–14320, 2020. [Google Scholar]
  • [32].He S, Fang H, Zhang M, Liu F, and Ding Z, “Adaptive optimal control for a class of nonlinear systems: The online policy iteration approach,” IEEE Trans. Neural Netw. Learn. Syst, vol. 31, no. 2, pp. 549–558, Feb. 2020. [DOI] [PubMed] [Google Scholar]
  • [33].Lewis FL, Vrabie D, and Vamvoudakis KG, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Syst, vol. 32, no. 6, pp. 76–105, Dec. 2012. [Google Scholar]
  • [34].Rizvi SAA and Lin Z, “Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control,” Automatica, vol. 95, pp. 213–221, Sep. 2018. [Google Scholar]
  • [35].Landelius T, “Reinforcement learning and distributed local model synthesis,” Ph.D. dissertation, Dept. Elect. Eng., Linköping Univ. Electron. Press, Lidingö, Sweden, 1997. [Google Scholar]
  • [36].Rizvi SAA and Lin Z, “Output feedback Q-learning control for the discrete-time linear quadratic regulator problem,” IEEE Trans. Neural Netw. Learn. Syst, vol. 30, no. 5, pp. 1523–1536, May 2019. [DOI] [PubMed] [Google Scholar]
  • [37].Tashtoush B, Molhim M, and Al-Rousan M, “Dynamic model of an HVAC system for control analysis,” Energy, vol. 30, no. 10, pp. 1729–1745, Jul. 2005. [Google Scholar]

RESOURCES