Adaptive control for circulating cooling water system using deep reinforcement learning

Jin Xu; Han Li; Qingxin Zhang

doi:10.1371/journal.pone.0307767

. 2024 Jul 24;19(7):e0307767. doi: 10.1371/journal.pone.0307767

Adaptive control for circulating cooling water system using deep reinforcement learning

Jin Xu ^1,^#, Han Li ^1,^#, Qingxin Zhang ^1,^*,^#

Editor: Lalit Chandra Saikia²

PMCID: PMC11268623 PMID: 39047030

Abstract

Due to the complex internal working process of circulating cooling water systems, most traditional control methods struggle to achieve stable and precise control. Therefore, this paper presents a novel adaptive control structure for the Twin Delayed Deep Deterministic Policy Gradient algorithm, which is based on a reference trajectory model (TD3-RTM). The structure is based on the Markov decision process of the recirculating cooling water system. Initially, the TD3 algorithm is employed to construct a deep reinforcement learning agent. Subsequently, a state space is selected, and a dense reward function is designed, considering the multivariable characteristics of the recirculating cooling water system. The agent updates its network based on different reward values obtained through interactions with the system, thereby gradually aligning the action values with the optimal policy. The TD3-RTM method introduces a reference trajectory model to accelerate the convergence speed of the agent and reduce oscillations and instability in the control system. Subsequently, simulation experiments were conducted in MATLAB/Simulink. The results show that compared to PID, fuzzy PID, DDPG and TD3, the TD3-RTM method improved the transient time in the flow loop by 6.09s, 5.29s, 0.57s, and 0.77s, respectively, and the Integral of Absolute Error(IAE) indexes decreased by 710.54, 335.1, 135.97, and 89.96, respectively, and the transient time in the temperature loop improved by 25.84s, 13.65s, 15.05s, and 0.81s, and the IAE metrics were reduced by 143.9, 59.13, 31.79, and 1.77, respectively. In addition, the overshooting of the TD3-RTM method in the flow loop was reduced by 17.64, 7.79, and 1.29 per cent, respectively, in comparison with the PID, the fuzzy PID, and the TD3.

1 Introduction

In the production process of many industrial sectors, a large amount of waste heat will be generated. Currently, it is necessary to use cold water or other liquids to absorb the heat in time to ensure the regular operation of the production process. The cold water used in this process is called cooling water in industrial production. To save water resources and reduce energy costs, consider recycling industrial cooling water to form a circulating cooling water system. With the continuous development of modern industry, as an essential cooling method in industrial production, circulating cooling water system is widely used in various production processes, such as pharmaceutical, electric power, chemical, metallurgy, Marine engine, etc. Optimizing circulating cooling water system control can improve industrial production efficiency and reduce energy consumption and maintenance costs. Therefore, in controlling circulating cooling water systems, the study of achieving efficient control has become an important topic.

At present, the control methods employed in circulating cooling water systems are predominantly based on traditional PID control [1,2], fuzzy control [3–5] model predictive control (MPC) [6,7], intelligent optimization algorithms [8–10] and other traditional methods. For example, Xia et al. [11] proposed the use of a PID controller and a fuzzy PID controller as the control strategy for the temperature controller of the circulating cooling water system in a fuel cell engine. This approach resulted in a notable reduction in temperature fluctuations during the water temperature mixing process. Terzi et al. [12] proposed the use of a model predictive control algorithm as the control strategy for an industrial plant’s circulating cooling water system, which resulted in an improvement in control performance. Zhang et al. [13] developed an algorithmic coupling of an artificial neural network optimized by a genetic algorithm and a heat transfer model of the condenser and air-cooling heat exchanger. This was employed to optimize and control the mass flow of circulating cooling water in an indirect cooling system of thermal power units. The objective was to enhance the efficiency of the circulating cooling water system and to reduce costs. However, these methods all have certain limitations, which are difficult to adapt to the nonlinear dynamic characteristics of the system and the uncertainty in the operation process. To a significant extent, these methods depend on prior knowledge and necessitate the development of sophisticated system models and the adjustment of parameters. For instance, the PID control method necessitates manual system adjustment and is unable to accommodate the intricate dynamic alterations of the system. Although the fuzzy control method can effectively handle uncertainty, it often requires considerable expertise to design fuzzy rules and may struggle to achieve optimal control. Conversely, MPC can optimize control strategies based on predictive control, thereby enhancing control performance. However, MPC is characterized by high computational complexity and demands significant computing resources, which presents a challenge in the application of real-time, high-frequency control systems. Furthermore, MPC is susceptible to model accuracy and measurement precision, which may result in suboptimal performance in cases of unknown systems or model errors. The intelligent optimization algorithm exhibits strong adaptability but may be prone to becoming stuck in a suboptimal local solution, thus failing to ensure optimal control of the system.

In recent years, with the continuous development of artificial intelligence technology, artificial intelligence theories and technologies such as deep learning and reinforcement learning have been widely applied in many fields, such as games field [14,15], robot control field [16–18], building energy efficiency field [19], natural language processing field [20], and automatic driving field [21–23], and fault diagnosis field [24,25]. RL is a machine learning method to learn the optimal decision through trial and error. It is powerful nonlinear modeling and adaptive learning ability have brought new opportunities for controlling the circulating cooling water system. For example, Qiu et al. [26] proposed a model-free optimal control method based on reinforcement learning to control circulating cooling water systems in the architectural field, which makes it have broad application prospects in the architectural area where accurate system performance models are generally lacking. Wu et al. [27] proposed a PI controller based on reinforcement learning to control a steam compression refrigeration system with nonlinearity and coupling two inputs and two outputs, realizing adaptive control and improving control performance. Compared with traditional control methods, the reinforcement learning method can automatically learn the system’s dynamic characteristics and operation rules without manually adjusting the control parameters and has better adaptability and intelligence. In addition, the reinforcement learning method can also use multi-agent reinforcement learning to realize collaborative control among multiple circulating cooling water systems and further improve control efficiency and stability. For example, Fu et al. [28] proposed a multi-agent deep reinforcement learning method for building cooling water system control to optimize the load distribution, cooling tower fan frequency, and cooling pump frequency of different cooling water systems. Furthermore, industrial processes’ safety usually requires solving constrained optimal control (COC) problems. Zhang et al. [29] proposed a new safety-enhanced learning algorithm for COC problems of continuous-time nonlinear systems with unknown dynamics and perturbations. For the uncertainties in the bridge crane system, such as payload mass and unmodeled dynamics, without knowing the system model, a new model-free online reinforcement learning control method for the real-time position adjustment and anti-sway control problem of bridge cranes is proposed [30], which combines the advantages of adaptive and optimal control and exhibits satisfactory performance. These research results show that reinforcement learning methods have a broad application prospect in industrial process control.

In order to ascertain whether deep reinforcement learning methods offer certain advantages over traditional control methods in the recirculating cooling water system, and to address issues such as the inability of traditional control methods to achieve stable and precise control of the controlled system, this paper proposes the design of an adaptive control structure for the recirculating cooling water system with the objective of improving the system’s control performance. This paper makes the following contributions:

1) The design of an adaptive control structure based on the Twin Delayed Deep Deterministic Policy Gradient algorithm under a reference trajectory model (TD3-RTM) enables end-to-end control of the recirculating cooling water system at the simulation level.

2) The state space and reward function were designed to consider the multivariable characteristics of the recirculating cooling water system. A reference trajectory model was introduced to accelerate the convergence speed of the agent and reduce oscillations and instability in the control system.

3) The exploration of the potential application of deep reinforcement learning in the recirculating cooling water system, with the objective of providing references and insights for control problems in the industrial field.

The rest of this paper is organized as follows: Section 2 is the background, introducing the basics of deep reinforcement learning, the working principle of the circulating cooling water system, and the system model. Section 3 is the methods, which Outlines the design of adaptive control structure based on TD3-RTM. Section 4 is the experiment and analysis of results. Section 5 is the conclusion, which summarizes this study and puts forward the future research direction.

2 Background

The prerequisite for combining the control of a circulating cooling water system with reinforcement learning is establishing a Markov model of the circulating cooling water system. The working principle and model of the circulating cooling water system and Markov decision process (MDP) based on the circulating cooling water system are described below.

2.1 Circulating cooling water system

The circulating cooling water system comprises a temperature sensor, flowmeter, pressure sensor, heat exchanger, electric control valve, manual butterfly valve, check valve, frequency conversion pump, and other equipment. The schematic system diagram is shown in Fig 1.

The cold-water flow into the line through the electric regulating valve M₁. When the pressure sensor P₁ detects that the pressure in the main line of the system exceeds the safety value required by the system, the opening of the electric regulating valve M₂ increases and discharges part of the cold water for pressure relief. At the same time, M₂ is connected to the check valve to prevent backflow. Another amount of cold water enters the main line through M₁, is detected by the flow meter and temperature sensor T₂, and then enters the heat exchanger. As the heat exchange proceeds, some hot water is discharged through an electrically regulated valve M₄ connected to a check valve. Another part of the hot water is mixed with the cold water through an electrically controlled valve M₃. This cycle ensures that the cold water flowing into the heat exchanger has a constant temperature, thus ensuring the stability and safety of the system.

Flow and temperature are crucial control objectives in a circulating cooling water system. To simplify modeling and control complexity, this paper focuses on these two critical variables as the primary targets for controlling the circulating cooling water system. On the other hand, over the past decades, the successful application of single-variable control theory has demonstrated the convenience and effectiveness of using transfer functions to express and analyze control systems. Therefore, transfer function matrices are employed in this paper to describe and analyze circulating cooling water systems.

This paper represents the circulating cooling water system as a multivariate model with two inputs and outputs, as shown in Fig 2. The input variables are the opening of electric control valve M₁ and M₃. The output variables are the water flow and temperature into the heat exchanger. The linear transfer functions G₁₁, G₁₂, G₂₁, and G₂₂ represent the relationship between the input and output variables of the system, where the first number represents the output, and the second number represents the input. For example, G₂₁ represents the effect of valve opening M₁ on temperature.

The transfer functions G₁₁, G₁₂, G₂₁, and G₂₂, which represent the dynamic behavior of the system, need to be identified, and the best pairing variables need to be found for the controller design. Therefore, the best-paired variables are found by selecting different variable pairs to observe the regulation state of the system during the experiment and then collecting data on the input and output quantities of the system at a steady state. For the collected system data, the data is firstly pre-processed to remove the outliers and noise. Then the transfer function model G(S) of the circulating cooling water system [31] is obtained using the MATLAB system identification toolbox, as shown in Eq 1.

G (S) = [\begin{array}{l} \frac{0.7541 S + 0.002914}{S^{2} + 0.08358 S + 0.0002578} \frac{24.65 S + 0.02572}{S^{2} + 2.529 S + 0.003538} \\ \frac{4.721 e - 05 S - 3.809 e - 06}{S^{2} + 0.2304 S + 4.309 e - 14} \frac{0.354 S + 0.0006877}{S^{2} + 1.189 S + 0.002565} \end{array}]

(1)

2.2 Markov decision model of circulating cooling water system

The mathematical foundation and modeling tool of reinforcement learning is the MDP. An MDP usually comprises state space s, action space a, state transition function P, reward function r, and discount factor γ. At any time step t, the agent first observes the current state s_t of the environment and the current corresponding reward value r_t. Based on this state and reward information, the agent acts a_t and obtains the state s_t+1 and the reward r_t+1 from the environment for the next step. The interaction between the reinforcement learning agent and the environment under the control system is shown in Fig 3.

In control system terminology, the term "agent" refers to the designed controller; the "environment" includes the system outside the controller, which, in this paper, refers explicitly to the circulating cooling water system. The policy represents the optimal control behavior sought by the designer. As shown in Fig 3, the interaction process between the agent and the environment indicates that state s represents various features and parameters measured by sensors in the circulating cooling water system, such as flow and temperature. Action a illustrates the opening value of the electric regulating valve determined by the agent based on the current state of the circulating cooling water system. Reward r indicates the feedback obtained by the agent after taking specific actions in specific conditions. Rewards are used to evaluate the quality of the agent’s behavior and guide decision-making in different states. In the context of the circulating cooling water system, rewards can be used to measure the control effectiveness and performance of the system. In deep reinforcement learning, state transition function P is often unknown, so the agent needs to estimate the state transition probability through interaction with the environment and learn and optimize control strategies. The design of the control strategy based on deep reinforcement learning relies on the design of the state, action, reward function, and reinforcement learning algorithms. Section 3 will provide a detailed introduction to the design of the control strategy.

3 Methods

In this study, a deep reinforcement learning approach is used to design an adaptive controller for the circulating cooling water system. In deep reinforcement learning, the neural network is used as the value function or parameterized policy, while the gradient optimization method is used to optimize the loss. Here, the twin delayed deep deterministic policy gradient [32] (TD3) algorithm, which is an actor-critic framework to deal with continuous action space problems, is employed to optimize the control parameters in the circulating cooling water system.

The TD3 algorithm is a deep reinforcement learning algorithm based on the Actor-Critic framework based on the Deep Deterministic Policy Gradient [33] (DDPG) algorithm. Since the value network of DDPG tends to overestimate the action value function, the TD3 algorithm has made improvements in the following three aspects to address the shortcomings of the DDPG algorithm: with truncated double Q learning, the problem of overestimation of the critic network is alleviated; the robustness and smoothness of the algorithm are improved by adding noise that obeys a truncated normal distribution to the output action of the target policy network; make the policy network and the three target networks update less frequently than the value network, this method can reduce the variance of the approximate action-value function, and a better policy can be obtained. The TD3 algorithm was selected for controlling the circulating cooling water system in this study due to its capability to handle continuous action spaces, utilization of twin Q networks to mitigate overestimation bias, and implementation of delayed policy updates and soft updates to reduce function approximation errors, thereby delivering more stable and precise control. Furthermore, TD3’s deep neural networks are proficient in effectively modeling the complex nonlinear and multivariate characteristics of the system, facilitating real-time adaptation and optimized control, thereby enhancing system performance.

Choosing an appropriate deep reinforcement learning algorithm is only part of designing a controller. In contrast, the design of states, actions, and rewards in reinforcement learning are crucial in determining the agent’s learning capabilities, control performance, and adaptability to dynamic environments. Thoughtful and well-tailored designs of these elements are essential for successful and efficient learning in various applications. The following will explain the selection and design of states, actions, and reward functions.

3.1 Control strategy design

3.1.1 State

The state reflects essential information during the interaction between the agent and the environment, and the selection of the state space directly affects the agent’s decision-making, thereby influencing the overall control performance of the system. Therefore, the state should contain sufficient information to describe the current stage. In the circulating cooling water system, where the actuator exhibits nonlinear characteristics, and the process gain varies with different manipulated variables, the chosen state space in this study is as follows:

s = {[e F, e T, \int e F d t, \int e T d t, F, T, F s p, T s p, a_{1}, a_{2}]}^{T}

(2)

Where, eF = F_sp−F and eT = T_sp−T represent the control error values of the flow and temperature loops, respectively. ∫eFdt and ∫eTdt are the error integrals. F and T represent the historical output measurement values of flow and temperature, respectively. Fsp and Tsp are the setpoints for flow and temperature. a₁ and a₂ are the manipulated variable values for the flow and temperature loops, respectively, which are the action values output by the agent.

3.1.2 Action

Actions represent the actions taken by the agent in specific states, and the agent’s task is to choose appropriate actions in different states to maximize its long-term rewards. In reinforcement learning, actions are typically determined by the agent’s policy, and in control systems, they correspond to the manipulated variables applied to the system. In this study, the action values correspond to the opening values of the electric regulating valves in the circulating cooling water system, where a₁ represents the opening value of the electric regulating valve M₁, and a₂ represents the opening value of the electric regulating valve M₃. The range of action values is [0, 100], making the action space:

a = {[a_{1}, a_{2}]}^{T}

(3)

3.1.3 Reward

The reward function is a crucial concept in reinforcement learning, which is used to evaluate the performance of an agent in an environment. The reward function is typically a mapping from the state and action space to a real number, representing the desirability of an action taken by the agent in each state. In reinforcement learning, the objective of the agent is to maximize the accumulated reward by interacting with the environment. Therefore, the reward function can be viewed as the objective function of the reinforcement learning task. By adjusting its policy, the agent can attempt to maximize the reward function and learn how to take optimal actions in different states of the environment.

In some reinforcement learning tasks, the reward function is typically designed such that the agent receives a reward only when the output values satisfy the system requirements. This type of reward function is known as a sparse reward function. In simple environments like single-variable systems, using a sparse reward function can still yield good control results. However, in a multivariate system, transferring the state of the system environment to the target state becomes more complex and uncertain than that of a univariate system. Therefore, based on the characteristics of circulating cooling water systems, this paper designs a dense reward function. For the flow loop, the dense reward function is set as follows:

r_{1} = {\begin{cases} 100, | e F_{t} | \leq φ_{1} \\ 1 / e F_{t}, φ_{1} < | e F_{t} | \leq φ_{2} \\ - e F_{t}, | e F_{t} | > φ_{2} \end{cases}

(4)

Where eF_t represents the error value of the current moment of flow. φ₁ and φ₂ represent the thresholds of error values in different intervals of flow. When the error value satisfies the system goal requirements, give the agent a large reward value to encourage the current behavior. In this paper, φ₁ equals 0.1, and φ₂ equals 5.

Furthermore, the temperature loop reward function is designed in the same way as the flow loop. Therefore, the reward function of the temperature loop is defined as:

r_{2} = {\begin{cases} 100, | e T_{t} | \leq η_{1} \\ 1 / e T_{t}, η_{1} < | e T_{t} | \leq η_{2} \\ - e T_{t}, | e T_{t} | > η_{2} \end{cases}

(5)

Where eT_t represents the error value of the current moment of temperature. η₁ and η₂ represent the thresholds of error values in different intervals of temperature. In this paper, η₁ equals 0.1, and η₂ equals 2.

Finally, the reward function r_t based on the circulating cooling water system is defined as

r_{t} = r_{1} + r_{2}

(6)

3.2 Network structure and algorithm design

The network structure of the TD3 algorithm comprises four principal components: the Actor network, the Critic network, the Target Actor network, and the Target Critic network. The Actor network generates a policy for continuous actions based on the current state. The Critic network is responsible for estimating the Q-value for the current state and action pair. The Target Actor and Target Critic networks serve as target networks for the Actor and Critic networks, respectively. The Target Actor and Target Critic networks have the same structure as the Actor and Critic networks, respectively, and their parameters are updated through soft updates from the Actor and Critic networks. In this study, the Actor and Critic networks are implemented with three-layer neural networks, comprising 128 and 64 neurons in their respective hidden layers. The rectified linear unit (ReLU) function is employed as the activation function. Furthermore, as the control actuator in the circulating cooling water system is an electric regulating valve with a range of 0 to 100, the output of the Actor network is normalized to the range of [–1, 1] using the tanh function and then scaled using the scaling operation.

To enhance the exploration and learning capabilities of the agent, this paper introduces a reference trajectory model. This model guides the agent to converge more rapidly to the desired control policy during the learning process, thereby improving the control effectiveness and learning speed of reinforcement learning. The reference trajectory model utilized in this study is:

F_{r} (s) = \frac{1}{τ_{r} s + 1}

(7)

In addition, in practical applications, setpoints may experience sudden changes or instability, which can lead to unstable performance or oscillations in the control system. By introducing the reference trajectory model, the setpoint signal can be smoothed to make its changes more gradual and smoother, thereby helping to reduce oscillations and instability in the control system. In this paper, τ_r equals 0.2. The design of the control system is illustrated in Fig 4.

In this study, a controller for the circulating cooling water system is designed based on the TD3 algorithm. Its control strategy is shown in Algorithm 1.

Algorithm 1. TD3 algorithm in circulating cooling water control system.

Initialize replay buffer M, initialize critic network $Q_{θ_{1}}$ , $Q_{θ_{2}}$ , parameters θ₁, θ₂, initialize actor network π_ϕ parameter ϕ, initialize target network parameters ${θ_{1}}^{-} \leftarrow θ_{1}, {θ_{2}}^{-} \leftarrow θ_{2}, ϕ^{-} \leftarrow ϕ$ .

Repeat

Randomly initialize the flow and temperature setpoints within the range allowed by system.

select action $a_{t} \sim π_{ϕ} (s_{t}) + ε, ε \sim N (0, σ)$ , accept reward r_t and next state s_t+1.

store state transfer data (s_t, a_t, r_t, s_t+1) to M.

Sample mini batches of size B from M.

${a^{'}}_{t + 1} \leftarrow π_{ϕ^{-}} (s_{t + 1}) + ε, ε \sim clip (N (0, σ^{-}, - c, c), c > 0)$ $y = r_{t} + γ \min_{i = 1, 2} Q_{{θ_{i}}^{-}} (s_{t + 1}, {a^{'}}_{t + 1})$ Update value network.

$θ_{i} \leftarrow {argmin}_{θ_{i}} B^{- 1} \sum {(y - Q_{θ_{i}} (s_{t}, a_{t}))}^{2}$ if t mod d then

Update ϕ

$\nabla_{ϕ} J (ϕ) = B^{- 1} \sum \nabla_{a} Q_{θ_{1}} (s_{t}, a_{t}) |_{a_{t} = π_{ϕ} (s_{t})} \nabla_{ϕ} π_{ϕ} (s_{t})$ Update the target network, where ρ is the soft update factor.

$\begin{array}{l} {θ_{i}}^{-} \leftarrow ρ θ_{i} + (1 - ρ) {θ_{i}}^{-} \\ ϕ^{-} \leftarrow ρ ϕ + (1 - ρ) ϕ^{-} \end{array}$

end if

end for

4 Experiments and analysis of results

In the training process of TD3-RTM in this study, the total number of episodes is set to 2000, with a sampling time of 0.1 seconds and a maximum simulation duration of 20 seconds. To enhance the disturbance rejection control performance of the system, random step signals with amplitudes ranging from -5 to 5 are applied at the control ports of the flow and temperature loops at the 15th second. The reference step input signals for the flow (m^3/h) and temperature (°C) are set to [550, 650] and [20, 30], respectively, to achieve robustness to significant setpoint changes in the system. Since TD3-RTM is based on the TD3 algorithm, the primary hyperparameters used in the training process of the TD3 algorithm are shown in Table 1.

Table 1. Hyperparameter settings of the algorithm 1.

Hyperparameters	Values
Discount factor, γ	0.995
Mini-batch size	128
Replay buffer size	1e6
Critic learning rate	1e-3
Actor learning rate	5e-4
Target update frequency	10
Exploration model	Gaussian noise
Variance, σ	0.2
Variance decay rate	1e-5
Policy update frequency	2
Soft update factor, ρ	5e-3

Open in a new tab

All computations were carried out on a standard PC (Win11, AMD 4600H CPU@3.00GHz, 16GB) in MATLAB/Simulink R2022b. To validate the effectiveness of TD3-RTM, comparisons were made with the classical PID controller, fuzzy PID controller, DDPG algorithm, and TD3 algorithm. To be fair, the PID parameters for classical PID control and fuzzy PID control were obtained using the Ziegler-Nichols method. The neural network architecture, number of neurons, and learning rate used in different deep reinforcement learning algorithms were the same. Each task was run for 2000 episodes, and the experiments were repeated five times with different random seeds. The recorded results represent the average reward value for every 20 episodes. The learning curves are shown in Fig 5.

The results in Fig 6 indicate that TD3-RTM can converge to the desired control policy faster and with more stable convergence performance under different random initial states. Additionally, TD3-RTM achieves higher total rewards after 2000 episodes of learning.

4.1 Step response and disturbance rejection performance simulation experiment

To validate the control effectiveness of TD3-RTM in the circulating cooling water system, a 100-second simulation experiment was conducted with a flow of 600 m^3/h and a temperature of 25°C. At the 60-second and 80-second marks, disturbance signals with amplitudes of 5 and -5 were applied to the control ports of the flow and temperature loops. The control performance of different control methods is shown in Figs 6 and 7.

As shown in Fig 6, it can be observed that in the flow control loop, the deep reinforcement learning controller exhibits faster response speed and minor overshoot compared to the classical PID controller and fuzzy PID controller in the step response. TD3-RTM is less affected when faced with external disturbance signals, while the DDPG algorithm shows steady-state error and the TD3 algorithm exhibits oscillations. The oscillations in the TD3 algorithm cause continuous changes in the control signal, which can damage the actuators in the circulating cooling water system and lead to system instability. As shown in Fig 7, in the temperature control loop, the PID controller, fuzzy PID controller and DDPG algorithm have relatively slow response speeds, while the TD3 algorithm and TD3-RTM achieve good control performance. The deep reinforcement learning controller performs better when subjected to external disturbance signals. The performance parameters for different control methods are shown in Table 2.

Table 2. Comparison of controllers performance parameters.

Variables	Controllers	Rise Time (s)	Transient Time (s)	Overshoot (%)	IAE
Flow	PID	1.03	7.10	18.60	753.80
	Fuzzy-PID	0.57	6.30	8.75	378.36
	DDPG	0.30	1.58	0	179.23
	TD3	0.28	1.78	2.25	133.22
	TD3-RTM	0.47	1.01	0.96	43.26
Temperature	PID	16.54	29.81	0	170.35
	Fuzzy-PID	9.32	17.44	0	85.28
	DDPG	5.99	18.84	0	58.24
	TD3	2.97	4.60	0.21	28.22
	TD3-RTM	2.78	3.79	0	26.45

Open in a new tab

From Table 2, compared to PID, fuzzy PID, DDPG and TD3, the TD3-RTM method improved the transient time in the flow loop by 6.09s, 5.29s, 0.57s, and 0.77s, respectively, and the Integral of Absolute Error(IAE) indexes decreased by 710.54, 335.1, 135.97, and 89.96, respectively, and the transient time in the temperature loop improved by 25.84s, 13.65s, 15.05s, and 0.81s, and the IAE metrics were reduced by 143.9, 59.13, 31.79, and 1.77, respectively. In addition, the overshooting of the TD3-RTM method in the flow loop was reduced by 17.64, 7.79, and 1.29 per cent, respectively, in comparison with the PID, the fuzzy PID, and the TD3. Generally, for industrial energy-consuming scenarios such as the circulating cooling water system, a controller with lower IAE and shorter settling time can save more energy. Although the DDPG algorithm shows a shorter rise time and no overshoot in the flow control loop, it exhibits a longer rise time and settling time in the temperature control loop. Its IAE is the largest compared to TD3 and TD3-RTM. Overall, TD3-RTM demonstrates significant advantages in both control loops.

4.2 Tracking performance simulation experiment

To validate the tracking performance of TD3-RTM, this study designed different setpoints for both the flow control loop and the temperature control loop and conducted 300 seconds of simulation experiments. The control effects of different control methods are shown in Figs 8 and 9.

As shown in Fig 8, in the flow control loop, when the setpoint is changed, the DDPG algorithm becomes unstable, and both the PID controller and fuzzy PID controller exhibit significant overshoot and long settling times at different setpoints. On the other hand, the TD3 algorithm and TD3-RTM outperform other methods significantly. In Fig 9, although all controllers can track the given setpoints, their transient responses vary widely. The DDPG algorithm can reach the setpoint at the end of each simulation time, but its settling time is longer, and its performance is inferior to the PID and fuzzy PID controllers. However, TD3-RTM’s performance during setpoint changes is comparable to the TD3 algorithm, with the fastest response speed and good settling time. Overall, TD3-RTM performs well in both control loops and shows excellent potential.

5 Conclusion

This paper presents a novel adaptive control structure based on the Twin Delayed Deep Deterministic Policy Gradient algorithm under a reference trajectory model (TD3-RTM) for addressing complex control problems in recirculating cooling water systems. Initially, the TD3 algorithm is employed to construct a deep reinforcement learning agent, enabling it to select appropriate actions for each loop based on system state features. Additionally, the multivariable characteristics of the recirculating cooling water system necessitate the design of a dense reward function, which enables the agent to receive various rewards through interactions with the environment and update its network, thereby gradually approaching the optimal policy. Furthermore, the introduction of the reference trajectory model accelerates the convergence speed of the agent and reduces system oscillations and instability. Simulation results show that compared to PID, fuzzy PID, DDPG and TD3, the TD3-RTM method improved the transient time in the flow loop by 6.09s, 5.29s, 0.57s, and 0.77s, respectively, and the Integral of Absolute Error(IAE) indexes decreased by 710.54, 335.1, 135.97, and 89.96, respectively, and the transient time in the temperature loop improved by 25.84s, 13.65s, 15.05s, and 0.81s, and the IAE metrics were reduced by 143.9, 59.13, 31.79, and 1.77, respectively. In addition, the overshooting of the TD3-RTM method in the flow loop was reduced by 17.64, 7.79, and 1.29 per cent, respectively, in comparison with the PID, the fuzzy PID, and the TD3. To further enhance safety and system stability, regular monitoring of system performance and adjusting as necessary are encouraged in practical applications. Furthermore, the utilization of backup control strategies to address exceptional circumstances that may be beyond the scope of deep reinforcement learning algorithms ensures the system’s stability in extreme conditions.

This research validates the potential of deep reinforcement learning in the circulating cooling water system and offers novel solutions and insights for practical engineering control problems. Although the method proposed in this paper achieves good control performance in simulation experiments and shows advantages over both traditional control methods and other deep reinforcement learning methods, there are some potential limitations, such as applicability limitations, computational resource requirements, hyper-parameter sensitivity, adaptability to environmental variations, and the challenge of practical system validation. Further optimization and extension of the proposed control method can be explored for broader industrial applications, along with investigating other deep reinforcement learning algorithms for complex system control. This will contribute to advancing intelligent control technology in industrial automation, enhancing production efficiency and resource utilization.

Supporting information

S1 Appendix. Datas and codes from the experiments.

(ZIP)

pone.0307767.s001.zip^{(1.2MB, zip)}

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Kim K-H. Temperature Stabilization of the Klystron Cooling Water at the KOMAC. Journal of the Korean Physical Society. 2018;73(8):1157–62. doi: 10.3938/jkps.73.1157 [DOI] [Google Scholar]
2.Garciadealva Y, Best R, Gomez VH, Vargas A, Rivera W, Jimenez-Garcia JC. A Cascade Proportional Integral Derivative Control for a Plate-Heat-Exchanger-Based Solar Absorption Cooling System. Energies. 2021;14(13):20. doi: 10.3390/en14134058 WOS:000671123300001. [DOI] [Google Scholar]
3.Liu W-h, Xie Z. Design and Simulation Test of Advanced Secondary Cooling Control System of Continuous Casting Based on Fuzzy Self-Adaptive PID. Journal of Iron and Steel Research International. 2011;18(1):26–30. doi: 10.1016/S1006-706X(11)60006-X [DOI] [Google Scholar]
4.Liang YY, Wang DD, Chen JP, Shen YG, Du J. Temperature control for a vehicle climate chamber using chilled water system. Appl Therm Eng. 2016;106:117–24. doi: 10.1016/j.applthermaleng.2016.05.168 WOS:000381530600013. [DOI] [Google Scholar]
5.Jia Y, Zhang R, Lv X, Zhang T, Fan Z. Research on Temperature Control of Fuel-Cell Cooling System Based on Variable Domain Fuzzy PID. 2022;10(3):534. doi: 10.3390/pr10030534 [DOI] [Google Scholar]
6.Zhao Y, Pistikopoulos E. Dynamic modelling and parametric control for the polymer electrolyte membrane fuel cell system. Journal of Power Sources. 2013;232:270–8. 10.1016/j.jpowsour.2012.12.116. [DOI] [Google Scholar]
7.Muller CJ, Craig IK. Economic hybrid non-linear model predictive control of a dual circuit induced draft cooling water system. J Process Control. 2017;53:37–45. 10.1016/j.jprocont.2017.02.009. [DOI] [Google Scholar]
8.Dulce-Chamorro E, Martinez-de-Pison FJ. An advanced methodology to enhance energy efficiency in a hospital cooling-water system. Journal of Building Engineering. 2021;43:102839. 10.1016/j.jobe.2021.102839. [DOI] [Google Scholar]
9.Liang J, Li L, Li Y, Wang Y, Feng X. Operation optimization of existing industrial circulating water system considering variable frequency drive. Chemical Engineering Research and Design. 2022;186:387–97. 10.1016/j.cherd.2022.08.010. [DOI] [Google Scholar]
10.Niu D, Liu X, Tong YJIJoCIS. Operation Optimization of Circulating Cooling Water System Based on Adaptive Differential Evolution Algorithm. 2023;16(1):22. [Google Scholar]
11.Xia QA, Zhang T, Sun ZF, Gao Y. Design and optimization of thermal strategy to improve the thermal management of proton exchange membrane fuel cells. Appl Therm Eng. 2023;222:11. doi: 10.1016/j.applthermaleng.2022.119880 WOS:000914111300001. [DOI] [Google Scholar]
12.Terzi E, Cataldo A, Lorusso P, Scattolini R. Modelling and predictive control of a recirculating cooling water system for an industrial plant. J Process Control. 2018;68:205–17. doi: 10.1016/j.jprocont.2018.04.009 WOS:000442706100018. [DOI] [Google Scholar]
13.Zhang W, Ma L, Jia B, Zhang Z, Liu Y, Duan L. Optimization of the circulating cooling water mass flow in indirect dry cooling system of thermal power unit using artificial neural network based on genetic algorithm. Appl Therm Eng. 2023;223:120040. 10.1016/j.applthermaleng.2023.120040. [DOI] [Google Scholar]
14.Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of Go without human knowledge. Nature. 2017;550(7676):354–9. doi: 10.1038/nature24270 [DOI] [PubMed] [Google Scholar]
15.McNamara JM, Houston AI, Leimar O. Learning, exploitation and bias in games. PLOS ONE. 2021;16(2):e0246588. doi: 10.1371/journal.pone.0246588 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hwangbo J, Lee J, Dosovitskiy A, Bellicoso D, Tsounis V, Koltun V, et al. Learning agile and dynamic motor skills for legged robots. 2019;4(26):eaau5872. doi: 10.1126/scirobotics.aau5872 [DOI] [PubMed] [Google Scholar]
17.Ejaz MM, Tang TB, Lu CK. Vision-Based Autonomous Navigation Approach for a Tracked Robot Using Deep Reinforcement Learning. IEEE Sensors Journal. 2021;21(2):2230–40. doi: 10.1109/JSEN.2020.3016299 [DOI] [Google Scholar]
18.Fernandez-Gauna B, Etxeberria-Agiriano I, Graña M. Learning Multirobot Hose Transportation and Deployment by Distributed Round-Robin Q-Learning. PLOS ONE. 2015;10(7):e0127129. doi: 10.1371/journal.pone.0127129 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Fu Q, Han Z, Chen J, Lu Y, Wu H, Wang YJJoBE. Applications of reinforcement learning for building energy efficiency control: A review. 2022;50:104165. [Google Scholar]
20.Le-Khac PH, Healy G, Smeaton AF. Contrastive Representation Learning: A Framework and Review. IEEE Access. 2020;8:193907–34. doi: 10.1109/ACCESS.2020.3031549 [DOI] [Google Scholar]
21.Al-Qizwini M, Bulan O, Qi X, Mengistu Y, Mahesh S, Hwang J, et al. A Lightweight Simulation Framework for Learning Control Policies for Autonomous Vehicles in Real-World Traffic Condition. IEEE Sensors Journal. 2021;21(14):15762–74. doi: 10.1109/JSEN.2020.3036532 [DOI] [Google Scholar]
22.Gangopadhyay B, Soora H, Dasgupta P. Hierarchical Program-Triggered Reinforcement Learning Agents for Automated Driving. IEEE Transactions on Intelligent Transportation Systems. 2022;23(8):10902–11. doi: 10.1109/TITS.2021.3096998 [DOI] [Google Scholar]
23.Ashraf NM, Mostafa RR, Sakr RH, Rashad MZ. Optimizing hyperparameters of deep reinforcement learning for autonomous driving based on whale optimization algorithm. PLOS ONE. 2021;16(6):e0252754. doi: 10.1371/journal.pone.0252754 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Cao J, Ma J, Huang D, Yu P. Finding the optimal multilayer network structure through reinforcement learning in fault diagnosis. Measurement. 2022;188:110377. 10.1016/j.measurement.2021.110377. [DOI] [Google Scholar]
25.Wang R, Jiang H, Li X, Liu S. A reinforcement neural architecture search method for rolling bearing fault diagnosis. Measurement. 2020;154:107417. 10.1016/j.measurement.2019.107417. [DOI] [Google Scholar]
26.Qiu S, Li Z, Li Z, Li J, Long S, Li X. Model-free control method based on reinforcement learning for building cooling water systems: Validation by measured data-based simulation. Energy and Buildings. 2020;218:110055. 10.1016/j.enbuild.2020.110055. [DOI] [Google Scholar]
27.Wu Y, Xing L, Liu XK, Guo F, editors. A New Solution to the PID18 Challenge: Reinforcement-Learning-based PI Control. 2022 34th Chinese Control and Decision Conference (CCDC); 2022. 15–17 Aug. 2022. [Google Scholar]
28.Fu Q, Chen X, Ma S, Fang N, Xing B, Chen J. Optimal control method of HVAC based on multi-agent deep reinforcement learning. Energy and Buildings. 2022;270:112284. 10.1016/j.enbuild.2022.112284. [DOI] [Google Scholar]
29.Zhang H, Zhao C, Ding J. Robust safe reinforcement learning control of unknown continuous-time nonlinear systems with state constraints and disturbances. J Process Control. 2023;128:103028. 10.1016/j.jprocont.2023.103028. [DOI] [Google Scholar]
30.Zhang H, Zhao C, Ding J. Online reinforcement learning with passivity-based stabilizing term for real time overhead crane control without knowledge of the system model. Control Engineering Practice. 2022;127:105302. 10.1016/j.conengprac.2022.105302. [DOI] [Google Scholar]
31.Li T, Liu Y, Chen Z. Design of Gas Turbine Cooling System Based on Improved Jumping Spider Optimization Algorithm. 2022;10(10):909. doi: 10.3390/machines10100909 [DOI] [Google Scholar]
32.Fujimoto S, Hoof H, Meger D, editors. Addressing function approximation error in actor-critic methods. International conference on machine learning; 2018: PMLR. [Google Scholar]
33.Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. 2015. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0307767.r001

Decision Letter 0

Joint Chair Prof Dr Stelios Bekiros

19 Oct 2023

PONE-D-23-24165Adaptive control for circulating cooling water system using deep reinforcement learningPLOS ONE

Dear Dr. Zhang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 25 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Prof. Dr. Stelios Bekiros, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

Additional Editor Comments :

REVIEWER COMMENTS:

This paper presents a deep RL-based control of a circulating cooling water system. The topic has some practical significance, but the novelty of the paper is not enough. However, the decision can be reconsidered if the authors could carefully address all the concerns raised.

1. How does the proposed method ensure the stability of the system?

2. The authors mentioned many successful applications of RL to circulating cooling water system ([26-28]), what is the contribution of this manuscript compared to them? It is suggested that the motivation and contributions should be more emphasized.

3. Since there are many related methods that can also deal with optimal control of unknown systems, it is better to provide a more comprehensive literature review. Please note that the up-to-date of references will contribute to the up-to-date of your manuscript. The studies named: Robust safe reinforcement learning control of unknown continuous-time nonlinear systems with state constraints and disturbances, Journal of Process Control; Online reinforcement learning with passivity-based stabilizing term for real time overhead crane control without knowledge of the system model, Control Engineering Practice, can be used to explain the method in the study or to indicate the contribution in the "Introduction" section. I believe this would further strengthen the introduction and lend support to the methodology used in general.

4. Check the notation system throughout the text. For example, the differential operator in equation (1) and the state in MDP use the same character "s". The transfer function G and the state transition function P should be unified, the current expression is confusing. If a1 and M1 represent the same value, why do the authors use different notations?

5. The control error values in equation (2) are not defined. The error between what? It is suggested that the reference trajectory model be placed in a more appropriate location.

6. What is the difference between the proposed method and TD3?

7. Please improve the quality of all figures and the language.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper presents a deep RL-based control of a circulating cooling water system. The topic has some practical significance, but the novelty of the paper is not enough. However, the decision can be reconsidered if the authors could carefully address all the concerns raised.

1. How does the proposed method ensure the stability of the system?

5. The control error values in equation (2) are not defined. The error between what? It is suggested that the reference trajectory model be placed in a more appropriate location.

6. What is the difference between the proposed method and TD3?

7. Please improve the quality of all figures and the language.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jul 24;19(7):e0307767. doi: 10.1371/journal.pone.0307767.r002

Author response to Decision Letter 0

20 Nov 2023

Dear Editors and Reviewers:

Thanks for your comments concerning our manuscript entitled “Adaptive control for circulating cooling water system using deep reinforcement learning” (ID: PONE-D-23-24165). Your comments are really helpful for revising and improving our paper. We have studied those comments carefully and have made some corrections which we hope to meet with your approval. The main corrections in the paper and the response to the reviewers are as follows:

Comments 1: How does the proposed method ensure the stability of the system?

Response 1: Thank you for your valuable feedback on our submitted paper. We have carefully read your review comments, and in response to your concerns about system stability, we are willing to provide a detailed response.

In this paper, we employ the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm designed to improve training stability. By introducing a Twin Q network and a delayed update mechanism, we aim to reduce the variance during training to prevent excessive fluctuations in the system. The algorithm employs experience playback, which is one of the commonly used techniques in the field of deep reinforcement learning. Experience playback helps mitigate instability due to sample correlation and improves the system's robustness by reusing previous experience. In this paper, we rationally design the state space and the reward function of a multivariate system for circulating cooling water to help the deep learning agent perceive the system state more accurately and adjust it according to the reward signal. This initiative aims to prevent unstable behaviours during the training process. In addition, we introduce a reference trajectory model to accelerate convergence and reduce system oscillations during control. This optimization tool helps to make the control system approximate the optimal policy more smoothly and improve the overall stability.

We encourage regular system performance monitoring in practical applications to enhance safety and system stability further. It is crucial to make adjustments as needed to maintain system stability. Additionally, considering alternative control strategies may be prudent to address specific scenarios that deep reinforcement learning algorithms might not handle effectively. This ensures that the system remains stable even in extreme circumstances.

Based on this, we have made an addition in Section 5. Please refer to the red content in the first paragraph of section 5 on page 13, at line 311~314.

Comments 2: The authors mentioned many successful applications of RL to circulating cooling water system ([26-28]), what is the contribution of this manuscript compared to them? It is suggested that the motivation and contributions should be more emphasized.

Response 2: Thank you for your valuable suggestions on our paper. We understand and value your comments.

Regarding the motivation of the research in this paper, the circulating cooling water system is a complex system with nonlinear, time-lag and multivariate characteristics. Traditional control methods, such as PID controllers, fuzzy control, model predictive control, etc., are often difficult to cope with the complex dynamic characteristics of the system and the uncertainty in the operation process and thus have certain limitations. However, with the development of artificial intelligence technology, reinforcement learning, as a machine learning method based on trial-and-error learning, has powerful nonlinear modelling and adaptive learning capabilities. On the one hand, this paper wants to verify whether deep reinforcement learning has certain advantages over traditional control methods in circulating cooled water systems; on the other hand, although [26-28] have done some research in circulating cooled water-related systems, in general, the research in this field is not deep enough and comprehensive, based on which this paper adopts a different deep reinforcement learning method from [26-28]: the Twin Delayed Deep Deterministic Policy Gradient. The main contributions of this paper are as follows: 1) A deep reinforcement learning controller for circulating cooling water systems was designed based on the TD3 algorithm, achieving end-to-end control and enhancing system stability. 2) The circulating cooling water multivariable system's state space and reward function were reasonably designed. The convergence speed of the agent was accelerated, and the oscillations and instability of the control system were reduced by adding a reference trajectory model. 3) The controller design does not require a model or specialized knowledge about industrial processes. Random disturbance signals were introduced during simulation training to improve the system's adaptive capabilities. 4) The application of deep reinforcement learning in circulating cooling water systems was explored, providing reference and inspiration for control problems in other industrial domains.

Based on this, we have made an addition in Section 1. Please refer to the red content in the third paragraph of section 1 on page 3, at line 69~83.

Comments 3: Since there are many related methods that can also deal with optimal control of unknown systems, it is better to provide a more comprehensive literature review. Please note that the up-to-date of references will contribute to the up-to-date of your manuscript. The studies named: Robust safe reinforcement learning control of unknown continuous-time nonlinear systems with state constraints and disturbances, Journal of Process Control; Online reinforcement learning with passivity-based stabilizing term for real time overhead crane control without knowledge of the system model, Control Engineering Practice, can be used to explain the method in the study or to indicate the contribution in the "Introduction" section. I believe this would further strengthen the introduction and lend support to the methodology used in general.

Response 3: Thank you for your comments and suggestions. The literature you recommended is critical. We fully agree and have added this section to the manuscript.

Please refer to the red content in the third paragraph of section 1 on page 3, at line 61~68.

Comments 4: Check the notation system throughout the text. For example, the differential operator in equation (1) and the state in MDP use the same character "s". The transfer function G and the state transition function P should be unified, the current expression is confusing. If a1 and M1 represent the same value, why do the authors use different notations?

Response 4: Thank you for your comments and suggestions. We apologize for the lack of clarity in our previous presentation; your suggestion is essential.

For this reason, we use "S" to denote the differential operator in Equation (1) and "s" to denote the state in the MDP. The transfer function G is mainly used in control system theory to describe linear time-invariant systems' input and output relationship. In contrast, the state transfer function P is mainly used in the MDP framework to describe the state transfer process between an intelligent body and its environment. In addition, the values of a1 and M1 are indeed the same in this paper, and the reason why different symbols are used is that a1 denotes the value of the action in reinforcement learning, and M1 denotes the value of the valve opening in the circulating cooling water system. The action value a1 obtained after the training of the reinforcement learning algorithm is applied to the circulating cooling water system as a control quantity, and the realization of the control quantity is done through M1.

For the revision details of this question, please refer to the red content in section 2 on page 5, at line 127.

Comments 5: The control error values in equation (2) are not defined. The error between what? It is suggested that the reference trajectory model be placed in a more appropriate location.

Response 5: Thank you for pointing this out. For the revision details of this question, please refer to the red content in section 3 on page 7, at line 177.

The reason for placing the reference trajectory model after the setpoint is that, considering that the setpoint may have sudden changes or instability in practical applications, adding the reference trajectory model after it can smooth out the setpoint signal so that its changes are slower and smoother, thus helping to reduce the oscillations and instability of the control system. The fact that the reference trajectory model is placed after the setpoint does not affect the magnitude of the error value.

Comments 6: What is the difference between the proposed method and TD3?

Response 6: Thank you for pointing this out. We are willing to provide further explanation on the issue.

The control algorithm used in our proposed method is TD3. However, we have adjusted the system's control structure to adapt to the control problems in circulating cooling water systems. Since the setpoints in the circulating water system may have sudden changes or instability in practical applications, this may lead to unstable performance or oscillations in the control system. By introducing a reference trajectory model, the setpoint signal can be smoothed to make its changes slower and smoother, which helps to reduce the oscillation and instability of the control system. From the learning curves of different methods under the same task in Fig. 5, the method proposed in this paper obtains higher rewards faster and more stably due to the addition of the reference trajectory model. Through simulation experiments, the method proposed in this paper has a better performance and a more significant potential in a comprehensive view.

Comments 7: Please improve the quality of all figures and the language.

Response 7: Thank you for your review and valuable comments. We take your suggestions very seriously and have already started to improve the quality of all graphics and language in the paper. We will carefully review and cross-reference your guidance to ensure that all charts and graphs, as well as the presentation of the paper, are more precise, more accurate, and better aligned with academic requirements. We look forward to demonstrating a tangible effort to improve on your suggested improvements in the final version.

Thank you again for your guidance and review.

We are looking forward to hearing from you at your earliest convenience. Thanks for your attention and time.

Sincerely,

Qingxin Zhang (Corresponding author)

E-mail: zhy9712_sau@163.com

Shenyang Aerospace University

November 20, 2023

Attachment

Submitted filename: Response to Reviewers.docx

pone.0307767.s002.docx^{(99.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0307767.r003

Decision Letter 1

Lalit Chandra Saikia

19 Jan 2024

PONE-D-23-24165R1Adaptive control for circulating cooling water system using deep reinforcement learningPLOS ONE

Dear Dr. Zhang,

Please submit your revised manuscript by Mar 04 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Lalit Chandra Saikia, PhD

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

All the comments of reviewer must be addressed and necessary changes must be done in the revised manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed most of my concerns and the paper is recommended for acceptance if possible.

Reviewer #2: (No Response)

Reviewer #3: Some potential drawbacks of the proposed deep reinforcement learning control method:

1. Complexity: Deep RL methods introduce significant complexity compared to traditional controllers.

2. Hyperparameters: Fine-tuning hyperparameters like discount factor, learning rate etc. requires expertise.

3. Sample efficiency: Large volumes of experience/data needed to learn optimal policy, may not be feasible in practice.

4. Brittleness: Policies could fail under distribution shifts or novel operating conditions not seen during training.

5. Non-stationary systems: No mechanism provided to continually learn as system dynamics change over time.

6. Interpretability: Learned policies are black-boxes, hard to analyze causes of behavior and ensure robustness.

7. Real system validation: Only simulated tests conducted, performance on real plant with noises/disturbances unknown.

8. Computational cost: Training deep RL agents is computationally expensive requiring specialized hardware.

9. Data requirements: Need sufficient coverage of state-action space in collected data to train policy.

10. Safety: No fail-safes described for scenarios where control deteriorates before retraining can occur.

11. Single objective: Only optimize for one control metric, may negatively impact other important factors.

12. Keywords section is missing.

13. Describe dataset features in more details and its total size and size of (train/test) as a table.

14. Flowchart and algorithm steps need to be inserted.

15. Time spent need to be measured in the experimental results.

16. Limitation Section need to be inserted.

17. All metrics need to be calculated in the experimental results as tables.

18. Address the accuracy/improvement percentages in the abstract and in the conclusion sections, as well as the significance of these results.

19. The architecture of the proposed model must be provided

20. The authors need to make a clear proofread to avoid grammatical mistakes and typo errors.

21. The authors need to add recent articles in related work and update them.

22. Add future work in last section (conclusion) (if any)

23. Enhance the clarity of the Figures by improving their resolution.

24. To improve the Related Work and Introduction sections authors are recommended to review this highly related research work paper:

a) Building an Effective and Accurate Associative Classifier Based on Support Vector Machine

b) A survey on improving pattern matching algorithms for biological sequences

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Tarek Abd El-Hafeez

**********

PLoS One. 2024 Jul 24;19(7):e0307767. doi: 10.1371/journal.pone.0307767.r004

Author response to Decision Letter 1

25 Feb 2024

Response to Reviewer Comments

Thank you very much for taking the time to review this manuscript. We will carefully consider and provide detailed answers to your questions. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the resubmitted files.

Comments: Some potential drawbacks of the proposed deep reinforcement learning control method: