Abstract
Soft sensors are mathematical methods that describe the dependence of primary variables on secondary variables. A nonlinear characteristic commonly appears in modern industrial process data with increasing complexity and dynamics, which has brought challenges to soft sensor modeling. To solve these issues, a novel supervised attention-based bidirectional long short-term memory (SA-BiLSTM) is first proposed in this paper to handle the nonlinear industrial process modeling with dynamic features. In this SA-BiLSTM model, an attention mechanism is introduced to calculate the correlation between hidden features in each time step, thus avoiding the loss of important information. Furthermore, this approach combines historical quality information and a moving window through a supervised strategy of quality variables. Such manipulation not only extracts and exploits nonlinear dynamic latent information from the process and quality variables but also enhances the model’s learning efficiency and overall prediction performance. Finally, two real industrial examples demonstrate the superiority of the proposed method compared to conventional methods.
I. Introduction
The most critical aspect of modern industrial processes (control, optimization, and monitoring) is the real-time detection of key quality variables, relevant parameters, and other information.1−4 Modern plants may detect critical quality variables by installing online equipment or using offline instrumentation.5 However, traditional offline equipment often leads to detection lags and high maintenance costs, and online instrumental analysis is a proven method. Still, it is costly, and maintenance costs are difficult to control.4,6 In contrast, soft sensor technology has advantages that traditional offline devices and online instrumental analysis do not have, such as low latency, low cost, and easy maintenance. By building mathematical models, soft sensors establish the relationship between some easy-to-measure variables X (input) and difficult-to-measure variables Y (output).7 Therefore, soft sensors are receiving more and more attention in industrial processes such as fault diagnosis, data monitoring, and quality prediction.8−10
Soft sensors can be broadly classified into two types: model-driven and data-driven. Model-driven soft sensors usually focus on describing the ideal steady state of a process, but this often does not match the actual industrial process. In contrast, data-driven soft sensors use data taken from accurate plant data, which better describes the actual process conditions.11 With the continuous development of distributed control systems, the influx of large amounts of real-time data has driven the development of data-driven soft sensors for various processes.12 The initial generation of data-driven soft sensors includes principal component analysis (PCA),13 principal component regression (PCR),14 partial least-squares (PLS),15 support vector machine (SVM),16 artificial neural network (ANN),17 etc. Researchers have proposed new modeling strategies based on PCA and PLS to cope with different processes.18 For example, Yuan et al.19 proposed a probabilistic weighted principal component analysis and a spatial-temporal locally weighted partial least-squares method to solve nonlinear downscaling and time-varying problems. SVM and ANN are used to implement nonlinear processes. Among them, ANNs are more widely used than SVMs, which are composed of multiple layers of nonlinear hidden units and can effectively handle high-dimensional nonlinear data. Its shallow network performs well, but due to the gradient disappearance or explosion problem, the prediction performance will deteriorate when the network is deepened. To solve the above problem, LeCun et al.20 proposed a multihidden layer neural network using deep learning to extract features, ushering in the wave of deep learning in the industry. Depending on the type of learning (supervised deep learning, unsupervised deep learning), supervised fine-tuning and unsupervised pretraining become the core techniques of deep learning.21,22
The characteristic that deep learning can extract complex latent features directly from data has led to its increasing popularity in the soft sensor. Deep Belief Network (DBN) and Stacked Auto-Encoder (SAE) have been widely used in industrial processes as typical deep learning. In particular, these two methods are also known as Deep Latent Variable Models (DLVMs).23 For example, Liu et al.24 proposed a DBN to extract nonlinear features to predict the outlet oxygen content of combustion systems online. Wang et al.25 combined SAE with support vector regression (SVR) to propose a deep network soft sensor model to estimate rotor deformation of air preheaters in boilers of thermal power plants. Although some DBNs and SAE-based networks perform well in dealing with linear and nonlinear problems, the dynamic nonlinearity of the internal response mechanisms of instruments in industrial processes, the degree of difficulty in extracting dynamic information from the processes, and the highly nonlinear temporal correlation of industrial data, as previously described, are among the challenges that still prevent their widespread use. In order to model such industrial data, it is necessary for the proposed model to obtain dynamic information between process variables and to pay attention to the nonlinear time correlation between samples. Therefore, a temporal dynamic neural network, recurrent neural network (RNN), is used to extract dynamic information from time series data. However, the problems of gradient disappearance and explosion lead to the inability of RNN to model long sequences. To solve this problem, a long short-term memory network (LSTM) was proposed by Hochreiter et al.26 to select which useful information to store and which useless information to forget by means of a nonlinear gate structure. It has been verified that the LSTM can better deal with gradient disappearance and explosion caused by long-term dependence.27 As a result, the LSTM has gained widespread interest in natural language processing,28 speech recognition,29 and industrial process time series data.30 Since the input of the LSTM focuses only on the dynamic information on process variables, the introduction of quality information into the cell structure to enhance the dynamic extraction of quality information is not considered, which makes the correlation between the hidden state and dynamic quality information on the LSTM network biased. Therefore, quality information is essential in the soft sensor modeling process, and introducing it into the cell structure as a supervisory feature will significantly improve the accuracy of quality variable prediction. For example, Yuan et al.31 proposed a supervised long short-term memory network (SLSTM) that allows the network to learn dynamic information related to quality variables, thus increasing the accuracy of the model’s prediction of quality variables. Although many deep soft sensor models have introduced quality information in recent years, only recent quality variable information is considered, while the potential dynamic information between quality variables is often ignored, and the number of model network structure neurons is often large, thus causing overfitting and a significant increase in the inefficiency and instability of the model. Therefore, data-driven dynamic soft sensor modeling with supervised bidirectional long time steps is proposed by Lui et al.32 A bidirectional structure reduces the number of neurons and combines the quality information with moving windows after expansion. However, the correlation of hidden states at different time steps is not considered. When the input sequence becomes longer, the correlation between samples far apart in the window is difficult to detect, thus ignoring some critical information.33 To solve this problem, this paper proposed a supervised attention-based bidirectional long short-term memory (SA-BiLSTM) applied to soft sensor modeling. First, the ability of the model to capture dynamic information between quality variables is enhanced by introducing historical quality information for z time steps. Second, different attention values are assigned to the implicit states generated by the bidirectional structure in combination with the attention mechanism to calculate the correlation between each time step, and a weighted summation is performed to obtain the context vector. Finally, the quality variables are predicted by a fully connected layer. In addition, L2 regularization is used to prevent overfitting. Eventually, it is applied to the ammonia synthesis process to predict CO and CO2 residual concentrations.
The remainder of this paper is structured as follows. The attention mechanism and LSTM are reviewed in Section II. Section III details the soft sensor model of the supervised attention-based BiLSTM network. In Section IV, the method’s effectiveness in this paper is verified by two case studies proposed for the CO2 absorption column and methanation furnace in the ammonia synthesis process. Finally, conclusions are drawn in Section V.
II. Related Works
The Attention Mechanism
In the case of dealing with long sequences of traditional encoder-decoder models, a fixed-length context vector leads to a large amount of missing critical information regardless of the length of the input sequence, resulting in a significant decrease in model accuracy.34 To solve a similar problem, the similarity between the intermediate results of the input and the target information is calculated in the encoding stage, and the results are subsequently passed to the decoder to obtain the output sequence so that a better understanding of which state information is more critical to the output sequence can be obtained. The above method is the attention mechanism, the essence of which focus limited attention on the focused information, i.e., to calculate the similarity between the input and the target.35 To summarize, the attention function is to calculate the similarity between a query vector and a set of keys in a key-value pair and normalize the correlation to get the weights of the values corresponding to the keys, and then weight the values to obtain the final result.36
The similarity between the input and output is compared using a mapping s to represent it. The commonly used similarity measures are as follows:37
| 1 |
| 2 |
| 3 |
where f(•) denotes a neural network, such as Multilayer Perceptron (MLP). Equations 1 and 2 are the dot product and matrix multiplication methods, respectively. The abstracted attention mechanism is shown in Figure 1.
Figure 1.

Structure of an attention mechanism.
The input sequence is {x1, x2, ..., xt}, and the corresponding attention weights are {a1, a2, ..., at}. From Figure 1, it is known that a softmax function is passed halfway to make the sum of attention weights 1. The softmax is calculated as
| 4 |
where ai denotes the attention weight of the ith input and n is the total number of input samples.
Long Short-Term Memory
Deep dynamic soft sensors were initially developed based on models such as Autoencoder (AE) and DBN.32 With the explosion of deep learning in recent years, RNNs, initially applied to natural language processing, were introduced to the problem of modeling soft sensors for industrial processes processing temporal data due to their ability to memorize dynamic information. However, due to the gradient vanishing and explosion problems of RNN, there is an increasing interest in its variant LSTM, which can effectively alleviate the gradient vanishing and explosion problems.27 The basic LSTM cell adds three nonlinear gates (i.e., input, output, and forget gates) to the RNN cell, which allows the cell to maintain its state better and control the information passing through the cell more efficiently, thus making it more prominent in the face of prolonged dependence. In addition, the cell state ct in the cell is used for long-term memory of dynamic information, and the ht hidden state is responsible for short-term information and will be passed to the next cell.
The structure of the LSTM cell is shown in Figure 2. The calculation results of three of the gates are as follows:
| 5 |
| 6 |
| 7 |
where σ(•) represents the nonlinear activation function. In addition, tanh(•) is a common nonlinear activation function that is used to express the cell state ct, i.e.
| 8 |
The hidden state and the update of the storage cell are as
| 9 |
| 10 |
where ⊙ is used for the Hadamard product of two vectors.
Figure 2.

Structure of a LSTM.
The LSTM performs well in a number of applications with less complex processes and dynamics, and some early examples of LSTM-based soft sensors are summarized by Ke et al.38
III. The Proposed Supervised Attention-Based BiLSTM Network for Soft Sensing
In this section, a supervised attention-based bidirectional long short-memory network is used for soft sensor modeling, which increases the attention mechanism by introducing temporal attention to uncover correlations between different time steps in the sequence. The context vector after temporal attention is a weighted sum of the products of the attention weights and the corresponding hidden states. A supervised strategy that incorporates historical quality information allows for the capture and exploitation of nonlinear dynamic potential information in process and quality variables. A bidirectional structure is used to reduce the number of hidden units, which increases the model’s accuracy and explores the underlying dynamic information more deeply in both positive and negative directions.
SA-BiLSTM
Figure 3 shows the structure of the cells with quality information in the model. It is assumed that I and O are the dimensions of input vector xt and output vector yt, respectively, and there is an SA-BiLSTM cell with quality information in the SA-BiLSTM layer. The input gate it, forget gate ft, output gate ot, cell state ct, cell update state c̃t and hidden state ht in the structure shown in Figure 3 are represented as
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
Here,
are input weights,
are hidden state weights,
to
are output weights,
are biases, and * is an omitted representation
of i, f, o, c. σ(•) and tanh(•)
denote nonlinear activation functions, ⊙ is the Hadamard product
operation, and eqs 11–14 in yt–1 to yt–z are the historical quality variables
of z-window size. As shown in the SA-BiLSTM bidirectional
structure in Figure 4a,
is the forward hidden state in the SA-BiLSTM
bidirectional structure, then
is the backward hidden state, and the two
are summed at the element level to obtain the final output ĥ.
Figure 3.
SA-BiLSTM model contains a cell structure of historical quality.
Figure 4.
Structure of the SA-BiLSTM bidirectional and Attention Layer. (a) The bidirectional structure. (b) The Attention Layer.
When the sequence is long, the correlation between the data samples in the base LSTM of each time step is difficult to be found, and some necessary information is difficult to be retained, resulting in a severe decline in the prediction performance of the model. To solve this problem, an attention mechanism is introduced to find the correlation between different time steps by adaptively learning the hidden states of the LSTM for all time steps. Since the attention mechanism has received much attention in recent years, various attention variants have emerged. In terms of models, attention is generally used in the convolutional neural network (CNN) or RNN, i.e., CNN-Attention and LSTM-Attention, where the attention mechanism weights the hidden states of all time steps to find the more important ones.
Figure 4b illustrates the attention layer introduced after the SA-BiLSTM bidirectional structure, and the green box part of Figure 5 is the attention under the entire model framework, from which it can be seen that the attention weights of the hidden states at moment t are calculated as follows:
| 18 |
| 19 |
where T is the subwindow size, Vap and Wa denote the attention learning parameters of the zth sample, and stp is the attention weight of the pth hidden state under the t time step subwindow. Using eq 19, the attention weights of all hidden states are 1. Then, the hidden states of all time steps are weighted and summed with the attention weights to obtain a state vector containing key information
| 20 |
where C is the final vector, i.e., the context vector, A is the attention weight matrix consisting of α at each time step, and H is the matrix consisting of the outputs of the hidden layers at different time steps. Ultimately, the quality prediction output is calculated as
| 21 |
Here, F(•) indicates a fully connected layer. After the above structure, the prediction of the quality variable ŷt is performed using the attention mechanism as well as the fully connected layer.
Figure 5.
SA-BILSTM-based soft sensor modeling framework.
Soft Sensor Development Based on SA-BiLSTM Model
Figure 5 shows the framework diagram of the SA-BiLSTM-based soft sensor modeling process, which includes an input layer, a bidirectional structure layer, an attention layer, a fully connected layer, and an output layer. First, the industrial time series data is fed into the input layer and then transferred to the SA-BiLSTM cell structure with historical quality information. The individual gate structures in this cell perform operations to select which information to remember and which to forget and to maximize the information related to quality variables. After that, the relevant information is passed to the attention layer as hidden state ht allows the SA-BiSLTM to further uncover the correlation between each time step and prevent forgetting the critical information related to the current prediction when the sequence data is excessively long. Finally, the sequence hidden state after the weighted summation of attention weights enters the fully connected layer, which is predicted by the fully corresponding weights and biases for the current input feature hidden state. The whole SA-BiLSTM soft sensor model is trained by pairing the L2 regularization to prevent the overfitting phenomenon of the model. The trained loss function with an L2 regular term is shown below:
| 22 |
where yt is the actual value of the quality variable and ŷt is the predicted value, λ denotes the regularization factor, and the sum of the squares of wj can also be expressed as wTw, which is the square of the Euclidean parametrization (2-parametrization) of the vector parameter w.
In addition, the value of z in the SA-BiLSTM model is closely related to the dynamics in the industrial time series data and the complexity of the industrial process. The higher the complexity of the industrial process is, the more complex the dynamic distribution is and the larger the z value used. The z value can be determined by a grid search method for selection, and the choice of the corresponding z value differs for different industrial processes. To ensure the efficiency of the SA-BiLSTM soft sensor in terms of training methods, the batch training method is used to improve the performance of the model; i.e., the training data set is divided into small batches of the same for training. In terms of optimization algorithms, the Adam algorithm is used in the model training phase because of its advantages over the root-mean-square backpropagation algorithm and the momentum gradient descent algorithm in terms of memory requirements and computational effort. The mean squared error (MSE) is used as the loss function, which is calculated as
| 23 |
where Ttrain is the total number of samples in the training data set, and yt and ŷt are the actual and predicted values of the quality variables, respectively.
Figure 6 illustrates
a schematic of the SA-BiLSTM-based soft sensor modeling process. Algorithm
1 shows the detailed procedure of the SA-BiLSTM algorithm, using backpropagation
for gradient calculation and Adam algorithm for parameter updating.
Figure 6.
Flowchart for the SA-BiLSTM-based soft sensor model framework.
The evaluation phase uses the coefficient of determination R2, and the root-mean-square error (RMSE) to evaluate the accuracy of the SA-BiLSTM model. The more minor the RMSE result is, the more accurate the prediction results are, so the model with the smallest RMSE is taken. The coefficient of determination, R2, ranges from 0% to 100%. The larger the R2 is, the closer the prediction result of the model is to the actual value, and it can describe the prediction performance of the soft sensor model more intuitively. Both are defined as follows:
| 24 |
| 25 |
where y̅ is the mean value of the quality information on the testing data set.
IV. Results and Discussion
CO2 Absorption Column
To demonstrate the validity of the model, it was applied to an ammonia synthesis industrial process to predict CO2 and CO contents. In the ammonia synthesis industrial process, the CO2 absorption column, as one of the critical devices, determines the raw material purity of the final ammonia synthesis process and is related to the quality of the product with a significant degree of defects. The residual CO2 concentration at the outlet of the CO2 absorber is an important indicator.39Figure 7 shows the process flow diagram of a CO2 Absorption Column, in which the chemical reaction occurs as shown in the following equation:
| 26 |
Figure 7.
Diagram of CO2 absorption column process.
In actual industrial processes, the detection of residual CO2 is often expensive, so the CO2 concentration is predicted by other auxiliary variables. Table 1 provides a description of the auxiliary variables selected for the process, which include temperature, pressure, and liquid level.
Table 1. Description of the Process and Quality Variables in the CO2 Absorption Column Example.
| Tags | Descriptions |
|---|---|
| U1 | Pressure of process gas into E3 |
| U2 | Liquid level of Separator 2 |
| U3 | Temperature of barren liquor at E1 exit |
| U4 | Flow rate of barren liquor to CO2 Absorption Column |
| U5 | Flow rate of half deficient liquor to CO2 Absorption Column |
| U6 | Temperature of process gas at exit of Separator 2 |
| U7 | Differential pressure of process gas at entrance of CO2 Absorption Column |
| U8 | Temperature of rich liquor at exit of CO2 Absorption Column |
| U9 | Liquid level of CO2 Absorption Column |
| U10 | High level alarming of Separator 1 |
| U11 | Pressure of process gas at the exit |
| Y | Concentration of residual CO2 in process gas |
This case data set is collected from the actual data in the database, and 4250 samples are collected and divided into the training data set and testing data set, where the number of training data set samples is 3000, and the rest are testing data set samples. The software environment used is as follows: OS, Windows 10 Home (64-bit); CPU, 11th Gen Intel Core i5-11400H (2.70 GHz); GPU, NVIDIA GeForce RTX 3050; RAM, 16.0GB.
For the training of the SA-BiLSTM model, the hyperparameters of the experimental model used for quality prediction are first defined. The hyperparameters were selected using a grid search with learning rates ranging from 10–5 to 10–2, small batch sizes ranging from 16 to 128 with an interval of 16, and the number of neurons searched from 64 to 256 with an interval of 10. Second, the historical quality window size was determined to range from 1 to 5 after using a trial-and-error method. The window T size ranged from 1 to 10. The final determination of the SA-BiLSTM network hyperparametric hidden neurons H = 65, initial learning rate γ = 0.001, small batch size B = 64. To maintain the convergence of the model, the number of epochs E = 200 is set, but the model tends to overfit as the number of iterations increases. To prevent the occurrence of overfitting, the L2 regular parameter λ = 0.001 and the historical quality window z = 5, after which the operation is executed according to Algorithm 1.
The time overhead required to train the soft sensor model based on the SA-BiLSTM network was 39.51 s, while it took 0.11 s on the testing data set. To verify the performance of the model, the SA-BiLSTM was compared with other dynamic soft sensor models based on deep learning methods (LSTM, Bidirectional Long Short-Term Memory network (BiLSTM), SLSTM), and Figure 8 shows the error of the prediction of quality variables during the testing phase of the compared models. It can be seen that the error of the BiLSTM is slightly smaller than that of the LSTM, the performance improvement is not outstanding, and there is still a gap between the two and the actual output. The SLSTM introduces qualitative information and slightly improves performance compared to the former, while the SA-BiLSTM is closer to zero error and less volatile than the other three models, and its predictions are closer to or overlap with the actual output. Introducing the historical quality information and attention mechanism is more advantageous than LSTM, BiLSTM, and SLSTM in dealing with the nonlinear dynamic characteristics between process and quality variables and the dynamic characteristics between historical quality information. Figure 9 shows the differences between the four methods mentioned above in an overall way. It can be seen that the soft sensor model based on the SA-BiLSTM is more clearly and closely distributed along the diagonal, thus further illustrating the better performance of the SA-BiLSTM on prediction.
Figure 8.
Testing phase errors of LSTM, BiLSTM, SLSTM, SA-BiLSTM.
Figure 9.
Predicted scatter plots of LSTM, BiLSTM, SLSTM, SA-BiLSTM.
Table 2 compares the prediction performance of the proposed method with three other deep learning soft sensor modeling methods on the testing data set. From Table 2, we can see that LSTM has the worst prediction effect. Although it solves the RNN gradient disappearance problem, it does not perform ideally for longer sequence problems, does not take into account the concern of hidden states at different time steps, and tends to miss the capture of important information. The BiLSTM is relatively slightly improved compared to the LSTM, while the dependence for longer distances is more straightforward to capture than the LSTM, but the problem is the same as the LSTM. The SLSTM is significantly improved compared to the first two, and it is easier to learn the knowledge related to the quality variables due to the introduction of quality information for calculation. However, the dynamic nature of the historical quality information is not considered. Therefore, the prediction performance of SA-BiLSTM is vastly improved after introducing the historical quality information supervised strategy and discovering the correlation between different time steps through the attention mechanism.
Table 2. Prediction of CO2 Concentration by Four Methods on the Testing Data Set.
| Method | RMSEtest | Rtest2 |
|---|---|---|
| LSTM | 0.0044 | 0.4804 |
| BiLSTM | 0.0035 | 0.6640 |
| SLSTM | 0.0025 | 0.8308 |
| SA-BiLSTM | 0.0005 | 0.9922 |
Figure 10 shows the predictions of the LSTM, BiLSTM, SLSTM, and SA-BiLSTM for the testing data set. From Figure 10, we can find that the LSTM and BiLSTM have the worst prediction effect, the prediction curve is similar to the actual curve in terms of trend, but the deviation is too large. Although the SLSTM improves the prediction, there is still room to reduce the deviation from the actual value. The SA-BiLSTM has a better prediction effect on the quality variables than the above methods. Because the correlation of hidden states with different time steps is considered, it is not easy to ignore important information for the current prediction. The prediction curve can track the output curve better, with a slight deviation around the 1000th sample.
Figure 10.

Prediction of residual CO2 concentration in CO2 absorption columns: (a) LSTM, (b) BiLSTM, (c) SLSTM, (d) SA-BiLSTM.
Methanation Furnace Unit for Ammonia Synthesis Process
The methanation furnace studied, in this case, is an essential branch of the ammonia synthesis process, the primary purpose of which is to minimize as much as possible the CO and CO2 content in the process gas, and the decarbonization unit of this process produces hydrogen (one of the raw materials for ammonia synthesis). The primary function of the methanation furnace is to react hydrogen with carbon in the form of CO and CO2 mixed into the process gas in the methanation furnace to produce methane, and the final methane produced is transferred back to recover.40Figure 11 shows the process schematic of the methanation plant, and the reactions occurring in it as
| 27 |
Figure 11.
Schematic diagram of the methanation furnace process.
According to the purpose of the device mentioned above, we need to detect the content of CO and CO2 gas in the process gas as key quality variables. However, the measurement of CO and CO2 residual content in actual industrial processes is expensive and has an inevitable delay. Therefore, a real-time and cost-effective soft sensor model is needed to predict the CO and CO2 residual content (quality variables) based on easily measurable auxiliary variables. The inputs to the soft sensors are ten auxiliary variables (temperature, pressure, flow, etc.), and in Figure 11, the residual contents of CO and CO2 (quality variables) are marked in light blue, where the process variables are described in detail in ref (40).
The data set was taken from an actual database from which 6000 samples were collected. The first 5000 samples were used as the training data set and the remaining 1000 samples as the testing data set. The initial hyperparameters are set before training the SA-BiLSTM network, after which the hyperparameters are tuned by taking an exact grid search on the same device. The model is trained in the same way as in the previous section. The initial learning rate γ = 0.001, the number of hidden neurons H = 50, the maximum number of epochs E = 200, the minimum batch was selected from the set {16, 32, 64, 128, 256} with B = 128 and the historical quality window size z = 4 selected from the set {1,2,3,4,5,6} with the window size selected as T = 6, and to prevent overfitting, the L2 regularization parameter λ = 0.001. The number of hidden neurons is relatively small because the number of auxiliary variables is not much different from that of the CO2 absorption column.
To demonstrate the performance of the proposed method, soft sensor modeling of methanation furnace data using basic LSTM, the BiLSTM, and the SLSTM is performed for comparison with the SA-BiLSTM. Figure 12 shows the quality predictions of the LSTM, BiLSTM, SLSTM, and SA-BiLSTM for the methanogenic furnace testing data set. The figure shows that the prediction trends of both the LSTM and BiLSTM are similar to the actual data but are more conservative to the extent that they cannot accurately predict the quality variables. The effect of the SLSTM is significantly improved compared with the LSTM and BiLSTM, which can effectively track the actual value. However, the overall error still needs to be reduced, probably because only the recent quality information is considered, resulting in limited improvement in prediction ability. In contrast, the SA-BiLSTM network introduces historical quality information. It combines the attention mechanism to uncover more critical information, which can fit the quality variables well and is very effective. This means that the proposed method deals with the dynamics of time series data, the effect of nonlinearity is improved, and the effect of capturing the potential information between auxiliary and quality variables is pronounced.
Figure 12.

Methanation furnace CO and CO2 residual gas content prediction: (a) LSTM, (b) BiLSTM, (c) SLSTM, (d) SA-BiLSTM.
Figure 13 shows the errors of the LSTM, BiLSTM, SLSTM, and SA-BiLSTM for predicting the quality variables on the testing data set. The inability to extract the vital information in the hidden states and the involvement of the quality information in the network structure lead to floating errors of −0.4 to 0.5 for the LSTM and BiLSTM predictions, which are relatively poor. On the contrary, the SLSTM reduces the oscillation interval from −0.1 to 0.2 after introducing the quality information, which is slightly improved compared with the LSTM and BiLSTM. The soft sensor model based on SA-BiLSTM, with the addition of historical quality information and attention mechanism, improves the prediction effect significantly, with overlap with the actual value and error closer to 0. The training time is only 35.21 s, which can handle the temporal data effectively.
Figure 13.
Prediction error plots for the LSTM, BiLSTM, SLSTM and SA-BiLSTM testing data set.
Table 3 gives the RMSE and R2 values for the proposed method and the other three deep learning soft sensor modeling methods on the testing data set. The table shows that SA-BiLSTM has an RMSE of 0.0298 and R2 of 0.9535 and shows a better prediction for quality variables compared to the LSTM, BiLSTM, and SLSTM. It can be well demonstrated that the proposed model is more effective in extracting and utilizing dynamic information, and although the hidden neurons are reduced relative to the CO2 absorption column, the stability and prediction performance of the network remain stable. It also implies that the proposed method has outstanding performance in terms of prediction accuracy and efficiency compared with the traditional deep learning-based soft sensor methods. From the scatter plots of the four models given in Figure 14, it is clear that the prediction distribution based on the SA-BiLSTM soft sensor model is closer along the diagonal, further proving the stability, efficiency, and accuracy of the proposed method.
Table 3. Prediction of Mechanized CO and CO2 Residual Content Using Four Methods on the Testing Data Set.
| Method | RMSEtest | Rtest2 |
|---|---|---|
| LSTM | 0.1262 | 0.1710 |
| BiLSTM | 0.0928 | 0.5509 |
| SLSTM | 0.0605 | 0.8088 |
| SA-BiLSTM | 0.0298 | 0.9535 |
Figure 14.

Predicted scatter plots of LSTM, BiLSTM, SLSTM, and SA-BiLSTM on the testing data set.
V. Conclusion
This paper develops a supervised attention-based bidirectional long short-term memory network (SA-BiLSTM) for data-driven dynamic process soft sensor modeling based on a long short-term memory network. Since traditional deep learning-based soft sensor modeling does not consider the importance of historical quality information in the modeling process and the relevance of hidden states at different time steps, this results in important information not being captured and utilized. Therefore a novel attention-based BiLSTM with supervised historical quality information is proposed to capture and exploit the potential dynamic information from process variables and key quality variables in industrial time series data using a bidirectional structure as well as historical quality information, introduce an attention mechanism to calculate the hidden state weights at different time steps, and uncover the correlation of hidden states at each time step to improve the accuracy of modeling. The L2 regularization is used to improve the stability of the model while preventing overfitting. The SA-BiLSTM model is finally applied to two industrial cases of CO2 absorption column and methanation furnace unit for the ammonia synthesis process. The results demonstrate that the proposed method outperforms the basic LSTM, BiLSTM, and the soft sensor model with deep learning focusing on recent quality information for quality prediction.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants 62203169, 61573137, 62003300, in part by the Natural Science Foundation of Zhejiang Province under Grants LQ22F030009 and LQ23F030004, and in part by the Natural Science Foundation of Huzhou under Grant 2021YZ11, and in part by the Huzhou Key Laboratory of Intelligent Sensing and Optimal Control for Industrial Systems (grant 2022-17).
The authors declare no competing financial interest.
References
- Yuan X.; Ge Z.; Song Z. Locally weighted kernel principal component regression model for soft sensing of nonlinear time-variant processes. Ind. Eng. Chem. Res. 2014, 53, 13736–13749. 10.1021/ie4041252. [DOI] [Google Scholar]
- Yao L.; Ge Z. Deep learning of semisupervised process data with hierarchical extreme learning machine and soft sensor application. IEEE Transactions on Industrial Electronics 2018, 65, 1490–1498. 10.1109/TIE.2017.2733448. [DOI] [Google Scholar]
- Sun Q.; Ge Z. A survey on deep learning for data-driven soft sensors. IEEE Transactions on Industrial Informatics 2021, 17, 5853–5866. 10.1109/TII.2021.3053128. [DOI] [Google Scholar]
- Yang Z.; Ge Z. Industrial virtual sensing for big process data based on parallelized nonlinear variational Bayesian factor regression. IEEE Transactions on Instrumentation and Measurement 2020, 69, 8128–8136. 10.1109/TIM.2020.2993980. [DOI] [Google Scholar]
- Yuan X.; Ge Z.; Song Z.; Wang Y.; Yang C.; Zhang H. Soft sensor modeling of nonlinear industrial processes based on weighted probabilistic projection regression. IEEE Transactions on Instrumentation and Measurement 2017, 66, 837–845. 10.1109/TIM.2017.2658158. [DOI] [Google Scholar]
- Souza F. A.; Araújo R.; Mendes J. Review of soft sensor methods for regression applications. Chemometrics and Intelligent Laboratory Systems 2016, 152, 69–79. 10.1016/j.chemolab.2015.12.011. [DOI] [Google Scholar]
- Curreri F.; Patanè L.; Xibilia M. G. Soft sensor transferability: a survey. Applied Sciences 2021, 11, 7710. 10.3390/app11167710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang Y.; Yin S.; Dong J.; Kaynak O. A review on soft sensors for monitoring, control, and optimization of industrial processes. IEEE Sensors Journal 2021, 21, 12868–12881. 10.1109/JSEN.2020.3033153. [DOI] [Google Scholar]
- Qian J.; Song Z.; Yao Y.; Zhu Z.; Zhang X. A review on autoencoder based representation learning for fault detection and diagnosis in industrial processes. Chemometrics and Intelligent Laboratory Systems 2022, 231, 104711. 10.1016/j.chemolab.2022.104711. [DOI] [Google Scholar]
- Mei W.; Liu Z.; Tang L.; Su Y. Test Strategy Optimization Based on Soft Sensing and Ensemble Belief Measurement. Sensors 2022, 22, 2138. 10.3390/s22062138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z.; Ge Z. Rethinking the Value of Just-in-Time Learning in the Era of Industrial Big Data. IEEE Transactions on Industrial Informatics 2022, 18, 976–985. 10.1109/TII.2021.3073645. [DOI] [Google Scholar]
- Yang Z.; Ge Z. On paradigm of industrial big data analytics: From evolution to revolution. IEEE Transactions on Industrial Informatics 2022, 18, 8373–8388. 10.1109/TII.2022.3190394. [DOI] [Google Scholar]
- Liu J.; Wang J.; Liu X.; Ma T.; Tang Z. MWRSPCA: online fault monitoring based on moving window recursive sparse principal component analysis. Journal of Intelligent Manufacturing 2022, 33, 1255–1271. 10.1007/s10845-020-01721-8. [DOI] [Google Scholar]
- Memarian A.; Varanasi S. K.; Huang B. Mixture robust semi-supervised probabilistic principal component regression with missing input data. Chemometrics and Intelligent Laboratory Systems 2021, 214, 104315. 10.1016/j.chemolab.2021.104315. [DOI] [Google Scholar]
- Liu J.; Sun D.; Chen J. Comparative study on wavelet functional partial least squares soft sensor for complex batch processes. Chem. Eng. Sci. 2022, 254, 117601. 10.1016/j.ces.2022.117601. [DOI] [Google Scholar]
- Brusamarello B.; Da Silva J. C. C.; De Morais Sousa K.; Guarneri G. A. Bearing fault detection in three-phase induction motors using support vector machine and fiber Bragg grating. IEEE Sensors Journal 2022, 1. 10.1109/JSEN.2022.3167632. [DOI] [Google Scholar]
- Wang G.; Jia Q.-S.; Zhou M.; Bi J.; Qiao J.; Abusorrah A. Artificial neural networks for water quality soft-sensing in wastewater treatment: a review. Artificial Intelligence Review 2022, 55, 565. 10.1007/s10462-021-10038-8. [DOI] [Google Scholar]
- Yuan X.; Ye L.; Bao L.; Ge Z.; Song Z. Nonlinear feature extraction for soft sensor modeling based on weighted probabilistic PCA. Chemometrics and Intelligent Laboratory Systems 2015, 147, 167–175. 10.1016/j.chemolab.2015.08.014. [DOI] [Google Scholar]
- Yuan X.; Zhou J.; Wang Y. A spatial-temporal LWPLS for adaptive soft sensor modeling and its application for an industrial hydrocracking process. Chemometrics and Intelligent Laboratory Systems 2020, 197, 103921. 10.1016/j.chemolab.2019.103921. [DOI] [Google Scholar]
- LeCun Y.; Bengio Y.; Hinton G. Deep learning. nature 2015, 521, 436–444. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Hinton G. E.; Osindero S.; Teh Y.-W. A fast learning algorithm for deep belief nets. Neural computation 2006, 18, 1527–1554. 10.1162/neco.2006.18.7.1527. [DOI] [PubMed] [Google Scholar]
- Hinton G. E.; Salakhutdinov R. R. Reducing the dimensionality of data with neural networks. science 2006, 313, 504–507. 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
- Kong X.; Jiang X.; Zhang B.; Yuan J.; Ge Z. Latent variable models in the era of industrial big data: Extension and beyond. Annual Reviews in Control 2022, 54, 167. 10.1016/j.arcontrol.2022.09.005. [DOI] [Google Scholar]
- Liu Y.; Fan Y.; Chen J. Flame images for oxygen content prediction of combustion systems using DBN. Energy Fuels 2017, 31, 8776–8783. 10.1021/acs.energyfuels.7b00576. [DOI] [Google Scholar]
- Wang X.; Liu H. Soft sensor based on stacked auto-encoder deep neural network for air preheater rotor deformation prediction. Advanced engineering informatics 2018, 36, 112–119. 10.1016/j.aei.2018.03.003. [DOI] [Google Scholar]
- Hochreiter S.; Schmidhuber J. Long short-term memory. Neural computation 1997, 9, 1735–1780. 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- Greff K.; Srivastava R. K.; Koutník J.; Steunebrink B. R.; Schmidhuber J. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 2017, 28, 2222–2232. 10.1109/TNNLS.2016.2582924. [DOI] [PubMed] [Google Scholar]
- Xiao J.; Zhou Z. Research progress of RNN language model. 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA) 2020, 1285–1288. [Google Scholar]
- Oruh J.; Viriri S.; Adegun A. Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition. IEEE Access 2022, 10, 30069–30079. 10.1109/ACCESS.2022.3159339. [DOI] [Google Scholar]
- Siami-Namini S.; Tavakoli N.; Namin A. S. The performance of LSTM and BiLSTM in forecasting time series. 2019 IEEE International Conference on Big Data (Big Data) 2019, 3285–3292. [Google Scholar]
- Yuan X.; Li L.; Wang Y. Nonlinear dynamic soft sensor modeling with supervised long short-term memory network. IEEE transactions on industrial informatics 2020, 16, 3168–3176. 10.1109/TII.2019.2902129. [DOI] [Google Scholar]
- Lui C. F.; Liu Y.; Xie M. A Supervised Bidirectional Long Short-Term Memory Network for Data-Driven Dynamic Soft Sensor Modeling. IEEE Transactions on Instrumentation and Measurement 2022, 71, 1–13. 10.1109/TIM.2022.3152856. [DOI] [Google Scholar]
- Shao W.; Ge Z.; Li H.; Song Z. Semisupervised dynamic soft sensing approaches based on recurrent neural network. Journal of Electronic Measurement and Instrumentation 2019, 33, 7–13. [Google Scholar]
- Yuan X.; Li L.; Wang Y.; Yang C.; Gui W. Deep learning for quality prediction of nonlinear dynamic processes with variable attention-based long short-term memory network. Canadian Journal of Chemical Engineering 2020, 98, 1377–1389. 10.1002/cjce.23665. [DOI] [Google Scholar]
- Zhang H.; Qiao G.; Lu S.; Yao L.; Chen X. Attention-based Feature Fusion Generative Adversarial Network for yarn-dyed fabric defect detection. Text. Res. J. 2022, 004051752211296. 10.1177/00405175221129654. [DOI] [Google Scholar]
- Vaswani A.; Shazeer N.; Parmar N.; Uszkoreit J.; Jones L.; Gomez A. N.; Kaiser Ł.; Polosukhin I. Attention is all you need. Advances in neural information processing systems 2017, 30, 5998. [Google Scholar]
- Xu K.; Ba J.; Kiros R.; Cho K.; Courville A.; Salakhudinov R.; Zemel R.; Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. International conference on machine learning 2015, 2048–2057. [Google Scholar]
- Ke W.; Huang D.; Yang F.; Jiang Y. Soft sensor development and applications based on LSTM in deep neural networks. 2017 IEEE Symposium Series on Computational Intelligence (SSCI) 2017, 1–6. 10.1109/SSCI.2017.8280954. [DOI] [Google Scholar]
- Yang Z.; Ge Z. Monitoring and prediction of big process data with deep latent variable models and parallel computing. Journal of Process Control 2020, 92, 19–34. 10.1016/j.jprocont.2020.05.010. [DOI] [Google Scholar]
- Yang Z.; Yao L.; Ge Z. Streaming parallel variational Bayesian supervised factor analysis for adaptive soft sensor modeling with big process data. Journal of Process Control 2020, 85, 52–64. 10.1016/j.jprocont.2019.10.010. [DOI] [Google Scholar]










