A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models

Zheng Zhou; Cheng Qiu; Yufan Zhang

doi:10.1038/s41598-023-49899-0

. 2023 Dec 16;13:22420. doi: 10.1038/s41598-023-49899-0

A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models

Zheng Zhou ¹, Cheng Qiu ^1,^✉, Yufan Zhang ¹

PMCID: PMC10725498 PMID: 38104205

Abstract

The proposed methodology presents a comprehensive analysis of soft sensor modeling techniques for air ozone prediction. We compare the performance of three different modeling techniques: LR (linear regression), NN (neural networks), and RFR (random forest regression). Additionally, we evaluate the impact of different variable sets on prediction performance. Our findings indicate that neural network models, particularly the RNN (recurrent neural networks), outperform the other modeling techniques in terms of prediction accuracy. The proposed methodology evaluates the impact of different variable sets on prediction performance, finding that variable set E demonstrates exceptional performance and achieves the highest average prediction accuracy among various software sensor models. In comparing variable set E and A, B, C, D, it is observed that the inclusion of an additional input feature, PM₁₀, in the latter sets does not improve overall performance, potentially due to multicollinearity between PM₁₀ and PM_2.5 variables. The proposed methodology provides valuable insights into soft sensor modeling for air ozone prediction.Among the 72 sensors, sensor NN_R[Y]C outperforms all other evaluated sensors, demonstrating exceptional predictive performance with an impressive R² of 0.8902, low RMSE of 24.91, and remarkable MAE of 19.16. With a prediction accuracy of 81.44%, sensor NN_R[Y]C is reliable and suitable for various technological applications.

Subject terms: Environmental chemistry, Scientific data, Statistics

Introduction

Background and importance of air ozone prediction

Air pollution, including compounds such as ozone, has become a global concern due to its detrimental effects on human health and the environment^1,2. Ozone is a reactive gas formed through complex photochemical reactions involving precursor pollutants such as nitrogen oxides (NO_x) and volatile organic compounds (VOC_s)^3–5. Elevated ozone levels in the atmosphere can contribute to respiratory issues, cardiovascular diseases, and lung inflammation in humans. It can also harm plants, reduce crop yields, and disrupt ecosystems. Accurately predicting ozone concentrations in the air is crucial for effective air quality management and the development of appropriate mitigation strategies. By forecasting ozone levels, policymakers, environmental agencies, and health professionals can take timely measures to reduce exposure and mitigate the potential health and ecological risks associated with high ozone concentrations. This can include implementing emission controls, adjusting industrial activities, and raising awareness among vulnerable populations.

Soft sensor modeling for air ozone prediction and its significance

Soft sensor modeling, also known as virtual sensing or data-driven modeling, enables the estimation of specific physical or chemical parameters using available data and mathematical models^6–8. In the context of air ozone prediction, soft sensor modeling involves constructing models using relevant environmental variables such as meteorological data, pollutant concentrations and historical ozone measurements to predict ozone levels in real-time or for future periods. This approach allows for the development of virtual sensors that provide continuous estimates of ozone concentrations, even in cases where physical sensors are not present or practical to deploy^9,10. The significance of soft sensor modeling lies in its ability to overcome limitations associated with physical sensors, such as cost, maintenance, and limited coverage. Soft sensors offer a cost-effective and flexible alternative for ozone prediction, enabling widespread monitoring and forecasting of ozone concentrations. Furthermore, soft sensor models can be continuously updated and optimized using new data, providing accurate and up-to-date information for decision-makers in air quality management and public health.

Objectives of the study

The main objectives of this study are to compare and evaluate the performance of different soft sensor modeling techniques for air ozone prediction. Specifically, we will compare the effectiveness of linear regression, neural networks and random forests regression in predicting ozone concentrations. These techniques were chosen due to their widespread usage and demonstrated capabilities in modeling complex relationships in environmental systems. Through this comparative analysis, we aim to identify the most suitable modeling technique for air ozone prediction based on criterion such as predictive accuracy, efficiency and interpretability. Additionally, we seek to explore the strengths and limitations of each modeling approach and provide insights into their practical applications in air quality management and decision-making.

Literature review

Overview of linear regression, neural networks and random forests regression

Air ozone prediction has been an important area of research due to the detrimental effects of ozone pollution on human health and the environment¹¹. In recent years, several studies have been conducted to develop and evaluate different methods for air ozone prediction. Here, we provide an overview of some key research findings and methodologies.

Linear regression

LR (Linear regression) is a popular and widely used modeling technique in statistics and machine learning. It aims to establish a linear relationship between the input variables and the target variable. The model assumes a linear combination of the input features to predict the continuous output variable. The coefficients of these input variables are estimated using various optimization algorithms, such as least squares. LR is simple to implement and interpret, making it a good choice for scenarios with linear relationships between variables. MLR (Multiple linear regression) is a form of LR that is suitable for this case. MLR provides equations linking a number of input variables (x_n) to a target variable (y) using Eq. (1)¹².

y = w_{0} + w_{1} x_{1} + \dots + w_{n} x_{n}

where w₀ is the intercept, w_n is a coefficient for x_n and n is the number of input variables. Out-of-sample accuracy can be improved by using regularization methods which add a penalty term to the model input variables, shrinking the freedom of the input variable during learning.

Nonlinear extension refers to the use of nonlinear feature functions to transform independent variables in linear regression, in order to capture nonlinear relationships in the data.

In LR, we assume that there is a linear relationship between the independent variables and the dependent variable. However, in real-world data, there may exist nonlinear relationships, where the relationship between the independent variables and the dependent variable cannot be accurately described by a simple linear model.

To address this issue, we can use nonlinear extension. This means applying some nonlinear functions to the independent variables to introduce nonlinear features in the model, in order to better fit the nonlinear relationships in the data.

For example, if there is a quadratic relationship between the independent variable x and the dependent variable y, we can square the independent variable x to obtain x² as a new independent variable, and then use both x and x² as input variables to build a linear regression model. This way, the model can capture the quadratic relationship between x and y.MLR with nonlinear extension(MLR-NE) provides equations linking a number of input variables (xn) to a target variable (y) using Eq. (2).

y = w 0 + w_{1} x_{1}^{2} + \dots + w_{n} x_{n}^{2}

In addition to using the square function, other nonlinear functions such as logarithmic, exponential and trigonometric functions can also be applied to transform the independent variables. This allows the model to adapt to more complex nonlinear relationships.

It is important to note that nonlinear extension can improve the fitting capability of the model and make it more suitable for nonlinear data. However, the resulting extended model may be more complex, less interpretable and have a risk of overfitting. Therefore, when performing nonlinear extension, a trade-off between the accuracy of model fitting and interpretability needs to be considered.

Data-driven models, such as regression-based approaches, have been widely used for air ozone prediction. Linear regression (LR) is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more independent variables. In air ozone prediction, LR models can be employed to identify correlations between ozone levels and relevant factors, such as temperature, humidity, wind speed and pollutant concentrations. Researchers have utilized various variables, including meteorological parameters, pollutant concentrations and emission data, to develop accurate prediction models. For example, Wei Zhao employed multiple linear regression to predict ozone levels based on boundary layer height, humidity, wind direction, surface solar radiation, total cloud cover and sea level pressure in Hong Kong¹³.

Neural networks

BPNN (Backpropagation Neural Networks) and RNN (Recurrent Neural Networks) are two commonly used artificial neural networks, respectively suitable for regression tasks and sequential data processing.

BPNN utilizes the backpropagation algorithm to train the network by iteratively adjusting the weights and biases of the neurons to minimize the difference between the predicted and actual output,as shown in Fig. 1. This iterative process helps the model capture complex non-linear relationships between input and output variables, making it suitable for various regression problems.

RNN is a type of neural networks designed to process sequential data, such as time series or text data. Unlike BPNN, RNN has a feedback mechanism that allows information to be carried forward through time loops, as shown in Fig. 2. This recurrent structure enables RNN to capture temporal dependencies and contextual information within the data. In regression tasks, RNN can model the sequence of input variables and predict the corresponding continuous output. They are particularly useful for problems where past inputs have a significant impact on current predictions.

Machine learning techniques have gained popularity in air ozone prediction due to their ability to capture complex relationships in data. Neural networks are computational models inspired by the structure and functioning of biological neural networks. These models consist of interconnected nodes (neurons) organized in layers and are trained using optimization algorithms to learn complex patterns in the data. For air ozone prediction, neural networks can capture nonlinear relationships between predictor variables and ozone concentrations.Neural networks, including BPNN and RNN, have been utilized for ozone prediction. RNN possesses feedback connections that allow information to flow between different time steps, making them ideal for time series analysis and prediction. In air ozone prediction, RNN can effectively capture temporal dependencies and patterns in ozone data.RNN, in particular, has shown promise in capturing temporal dependencies and patterns in ozone data^14,15. Wang Dongsheng et al. developed an RNN model to predict hourly ozone concentrations in air quality monitoring stations in the Yangtze River Delta, China¹⁶.

Random forest regression

RFR (random forest regression) is an ensemble learning technique that combines the power of decision trees and randomness. It constructs a multitude of decision trees using random subsets of the training data and randomly selected subsets of the input variables. Each decision tree makes independent predictions and the final prediction is obtained by averaging the predictions of all the trees,as shown in Fig. 3. RFR handles both linear and non-linear relationships, effectively captures complex interactions between input variables and is robust against overfitting. It is particularly suitable for high-dimensional data with categorical and numerical features and performs well even in the presence of outliers and missing values.

Ensemble models, such as RFR (random forest regression) and gradient boosting, have also been applied for air ozone prediction^17,18. RFR is an ensemble learning method that combines multiple decision trees to make predictions. Each decision tree is built using a random subset of features and the final prediction is determined by aggregating the predictions from individual trees. RFR is known for its robustness, ability to handle high-dimensional data and resistance to overfitting¹⁹. For instance, Massimo Stafoggia et al.²¹ used RFR to predict daily ozone concentrations in Sweden, considering various meteorological variables such as air temperature, cloud coverage, barometric pressure and snow albedo²⁰.

Applications of methods in environmental prediction

LR, NN and RFR have been widely employed in various environmental prediction tasks beyond air ozone prediction.

Water quality prediction

These methods have found applications in areas such as water quality prediction. LR, NN and RFR have been used to predict water quality parameters, including dissolved oxygen levels, pH and nutrient concentrations^21–23.

Air pollutant concentration modeling

NN and RFR have been applied to forecast concentrations of air pollutants, such as particulate matter (PM) and nitrogen dioxide (NO₂)^24,25.

Environmental impact assessment

LR and NN have been applied for environmental impact assessment, such as global warming, human health, metal depletion, freshwater ecotoxicity, particulate matter formation and terrestrial acidification^26–28.

These examples highlight the versatility and effectiveness of these modeling techniques in addressing a range of environmental prediction tasks.

Performance in ozone prediction of prediction models

LR, NN and RFR are prediction models based on different principles and algorithms. LR predicts by fitting a linear relationship between input features and output variables. NN utilizes multi-layered neuron networks to establish nonlinear mapping relationships. RFR combines multiple decision tree models through ensemble learning to enhance prediction performance.

To accurately predict ozone concentrations and trends, various prediction methods have been employed.The performance of commonly used different prediction models in ozone prediction is compared as Table 1.

Table 1.

Methods used in ozone concentrations prediction.

References	Variables/inputs	Targets/outputs	Performance	Model
¹³	Boundary layer height, humidity, wind direction, solar radiation, total cloud cover and sea level pressure, temperature	Surface ozone in Hong Kong	R²0.62	LR
²⁹	Temperature, NO₂, SO₂, O₃, PM₁₀	Future ozone concentration for next three days in Malaysia	R²0.296996 RMSE0.01853	LR
³⁰	Temperature, NO₂, NO, wind velocity, relative humidity	Ozone concentration of Northern Portugal	R²0.7 RMSE29.5 μg/m³	LR
³⁰	Temperature, NO₂, NO, wind velocity, relative humidity	Ozone concentration of Northern Portugal	R²0.78 RMSE25.64 μg/m³	BPNN
³¹	Meteorological parameters, NO₂	Ozone concentration of Nanjing	R²0.84 RMSE22.5	BPNN
³²	Precipitation, barometric pressure relative humidity, sunshine duration temperature, wind speed	Ozone concentration of Jinan	R²0.8429 RMSE21.9290	BPNN
¹⁵	Temperature, dew point, relatively humidity, wind speed	Ozone concentration of Hangzhou	R²0.91 RMSE19.87	RNN
⁹	NO_X, CO, PM_10/2.5, VOC_S, winds peed, temperature, humidity, radiation	Hourly ozone concentration in Shanghai	R²0.96 RMSE7.71	RNN
¹⁸	Temperature, dew point, relatively humidity, wind speed humidity, wind speed	Ozone Concentration of Hangzhou	R²0.85 RMSE27.64	RFR
³³	Evaporation, temperature, relatively humidity, day of year, sunshine duration	Daily ambient ozone levels across China	R²0.69 RMSE26	RFR

Model type	Model name	Description	Differences with the other similar model
Linear regression	R_ML	Multiple Linear Regression model with multiple independent variables, assuming a linear relationship between the dependent variable and the independent variables	Multiple Linear Regression model with nonlinear terms has the advantage of allowing for a more complex relationship between the dependent variable and the independent variables, which can improve the fitting capability of the model, especially for nonlinear data
Linear regression	R_MLNE	Multiple Linear Regression model with nonlinear terms, allowing for a more complex relationship between the dependent variable and the independent variables. Differs from R_ML in the inclusion of nonlinear terms
Neural network	NN_BP[X]	An artificial neural network model trained using the backpropagation algorithm with 1 times the number of input variables in the hidden layer(s). Differs from NN_BP[2X] in the number of neurons in the hidden layer(s)	The main difference between these two models lies in their complexity and potential learning capability. NN_BP[2X] has a higher number of neurons, which increases the model's capacity to learn more complex relationships between the input and output variables. This can lead to better fitting results and more accurate predictions.A higher number of neurons also increases the risk of overfitting, as the model may become too complex and fit the noise in the training data
Neural network	NN_BP[2X]	An artificial neural network model similar to NN_BP[X] but with twice as many neurons in the hidden layer(s)
Recurrent neural network	NN_R[Y]	A neural network model that can process sequential data, using the first 30 time steps to make predictions. Differs from NN_R[0.5Y] in the number of time steps used for prediction	The main difference between these two models lies in the amount of historical information they consider when making predictions. NN_R[Y] takes into account a longer sequence of past data, which may provide more context and improve the model's ability to capture temporal patterns and trends. Using more time steps also increases the computational complexity of the model and may require more data to train effectively
Recurrent neural network	NN_R[0.5Y]	A neural network model similar to NN_R[Y] but uses the first 15 time steps for prediction
Random forest regression	RFR₁₀₀	An ensemble learning model that combines multiple decision trees for regression prediction, using 100 decision trees. Differs from RFR₂₀₀ in the number of decision trees used	The increased number of decision trees in RFR200 generally leads to a more complex model, which can capture more subtle patterns in the data and potentially result in more accurate predictions. However, this comes at the cost of increased computational complexity and a higher risk of overfitting, particularly if the dataset is small
Random forest regression	RFR₂₀₀	An ensemble learning model similar to RFR₁₀₀ but uses 200 decision trees for regression prediction

Criterion	Description	Practical application
R²	Referred to as the coefficient of determination, it is an indicator of the strength of the relationship between variables	Measures the strength of the relationship between predicted trend and actual trend
RMSE	Root Mean Square Error (RMSE) is another widely used statistical metric to evaluate the performance of a model. It measures the square root of the average of the squared differences between the predicted and actual values. Similar to MSE, a lower RMSE value indicates a higher level of accuracy in prediction	Measures the average accuracy of the predicted trend against the actual trend
MAE	Mean Absolute Error (MAE) is a commonly used statistical metric to assess the performance of a model. It calculates the average of the absolute differences between the predicted and actual values. MAE provides a measure of the average magnitude of the errors, disregarding their direction. Similar to MSE and RMSE, a lower MAE value indicates a higher level of accuracy in prediction	Measures the average accuracy of the predicted values compared to the actual values. Instead of focusing solely on the differences between predicted and actual values, MAE calculates the average magnitude of these differences. It provides a meaningful measure of the average prediction error, regardless of the direction of the errors
Variable utilization	Variable utilization refers to the number of input variables used by each soft sensor, ranging from 1 to 9	Represents the amount of data needed, which indirectly reflects the amount of pre-foundation work
Accuracy	When the measured ozone concentration is above the local standard value, if the prediction is valid and significant, that means the predicted value is greater than the standard value,the accuracy meets the requirement.Accuracy(%) is equal to the number of successful predictions divided by the number of occurrences where the measured values exceeded the threshold	Indicates the accuracy at the threshold(standard value). The criterion of accuracy expresses the concern and attention to the predictive ability of ozone concentration exceeding the standard. As an example, when the measured value of air ozone concentration is 210 μg/m³, which exceeds the local ambient air quality standard (ozone, 160 μg/m³), if the predicted value is greater than 160 μg/m³, then it indicates that the prediction of the fact that the standard has been exceeded has been successful; otherwise, it indicates a prediction failure. The accuracy is calculated by dividing the number of successful predictions in the test set by the number of days in the test set with all the metrics exceeding the threshold

Criterion	Ranking attributes	Functions	Weight
Accuracy	The higher the Accuracy value, the better the performance and the higher the ranking value	The higher the Accuracy value, indicating a better prediction performance in terms of correctly identifying instances where the ozone concentration exceeds the local standard value. This criterion emphasizes the importance of accurately predicting ozone concentration exceedances, providing a measure of the model's ability to capture such events	10
R²	The higher the R² value, the better the performance and the higher the ranking value	The higher the R² value, indicating a stronger relationship between variables and a better fit of the model to the data. This practical application provides insight into the model's ability to capture variations in the data	4
RMSE	The lower the RMSE value, the better the performance and the higher the ranking value	This criterion provides a comprehensive evaluation of the model's performance, considering both the magnitude and direction of the errors	3
MAE	The lower the MAE value, the better the performance and the higher the ranking value	MAE provides a comprehensive measure of the average prediction error, considering both the magnitude and direction of the errors. This criterion effectively evaluates the model's ability to minimize the overall prediction error, offering insight into its predictive performance	2
Variable utilization	The lower the variable utilization value, the better the performance and the higher the ranking value	The variable utilization indirectly reflects the amount of pre-foundation work needed, such as data acquisition, feature engineering, and data cleaning. This attribute offers valuable information about the potential complexity and resources needed for the implementation of each soft sensor	1

Sensor	R²	RMSE	MAE	Variable utilization	Accuracy (%)
R_MLA	0.7271	29.94	24.4891	9	59.820
R_MLB	0.7262	29.99	24.1181	8	60.680
R_MLC	0.7296	29.81	23.7566	7	60.680
R_MLD	0.7297	29.8	23.7594	6	60.680
R_MLE	0.7152	30.59	25	5	62.390
R_MLF	0.7119	30.77	24.616	4	60.680
R_MLG	0.7211	30.27	24.4848	3	55.550
R_MLH	0.6851	32.17	26.3866	2	52.990
R_MLI	0.6928	31.77	26.2531	1	49.570
R_MLNEA	0.7642	27.83	22.7889	9	57.260
R_MLNEB	0.7318	29.69	23.802	8	60.680
R_MLNEC	0.7334	29.6	23.7532	7	61.530
R_MLNED	0.7337	29.58	23.7448	6	58.970
R_MLNEE	0.7254	30.04	24.3147	5	59.830
R_MLNEF	0.7278	29.91	24.2875	4	60.680
R_MLNEG	0.7247	30.07	24.0526	3	63.250
R_MLNEH	0.7156	30.57	24.7728	2	63.250
R_MLNEI	0.7228	30.18	24.5716	1	47.000

Sensor	Ranking values in the following criterion					Weighted ranking values
Sensor	R²	RMSE	MAE	Variable utilization	Accuracy	Weighted ranking values
R_MLA	7	7	5	1	4	100
R_MLB	6	6	7	2	5	108
R_MLC	8	8	9	3	5	127
R_MLD	9	9	8	4	5	133
R_MLE	4	4	4	5	9	131
R_MLF	3	3	3	6	5	83
R_MLG	5	5	6	7	3	84
R_MLH	1	1	1	8	2	37
R_MLI	2	2	2	9	1	37

Sensor	R²	RMSE	MAE	Variable utilization	Accuracy (%)
NN_BP[X]A	0.87274	25.7923	20.189	9	76.289
NN_BP[X]B	0.8758	27.0951	20.8078	8	79.381
NN_BP[X]C	0.8728	26.9391	20.88	7	83.505
NN_BP[X]D	0.86533	24.7628	19.0255	6	79.381
NN_BP[X]E	0.8722	24.1676	18.8387	5	81.443
NN_BP[X]F	0.86969	25.774	19.9231	4	76.289
NN_BP[X]G	0.81766	25.9039	20.5584	3	71.134
NN_BP[X]H	0.76605	30.08	24.2408	2	70.103
NN_BP[X]I	0.7392	31.4552	25.1164	1	63.918
NN_BP[2X]A	0.8799	25.2692	19.8742	9	76.289
NN_BP[2X]B	0.88129	26.3355	20.7613	8	78.351
NN_BP[2X]C	0.84936	26.3093	20.8789	7	70.103
NN_BP[2X]D	0.87687	28.0203	19.916	6	79.381
NN_BP[2X]E	0.86211	25.6323	19.9623	5	84.536
NN_BP[2X]F	0.85464	28.1521	21.6367	4	77.32
NN_BP[2X]G	0.81043	25.4879	20.2439	3	70.103
NN_BP[2X]H	0.77477	31.7843	25.8355	2	70.103
NN_BP[2X]I	0.73017	34.3527	27.6884	1	59.794

Sensor	R²	RMSE	MAE	Variable utilization	Accuracy (%)
NN_R[Y]A	0.90	25.8317	20.0169	9	78.351
NN_R[Y]B	0.9001	24.2706	18.709	8	78.351
NN_R[Y]C	0.8902	24.9123	19.1583	7	81.443
NN_R[Y]D	0.8962	25.7434	20.013	6	80.412
NN_R[Y]E	0.8838	26.0779	19.6821	5	83.505
NN_R[Y]F	0.8530	27.5419	21.2768	4	80.412
NN_R[Y]G	0.8498	26.6071	21.2003	3	71.134
NN_R[Y]H	0.7928	30.5566	24.5275	2	70.103
NN_R[Y]I	0.7370	33.3163	26.7145	1	59.794
NN_R[0.5Y]A	0.867	26.6056	20.8383	9	78.351
NN_R[0.5Y]B	0.8888	25.1199	19.8288	8	80.412
NN_R[0.5Y]C	0.8808	26.4318	20.2491	7	80.412
NN_R[0.5Y]D	0.8566	25.3612	19.6895	6	78.351
NN_R[0.5Y]E	0.8802	27.0095	21.2078	5	82.474
NN_R[0.5Y]F	0.8677	26.2297	19.9844	4	79.381
NN_R[0.5Y]G	0.8436	26.1772	20.7957	3	70.103
NN_R[0.5Y]H	0.8268	30.5001	24.5668	2	68.041
NN_R[0.5Y]I	0.7323	32.0173	25.6736	1	65.979

Sensor	R²	RMSE	MAE	Variable utilization	Accuracy (%)
RFR_100A	0.8215	24.9195	19.6193	9	73.196
RFR_100B	0.8215	24.3742	19.1659	8	75.258
RFR_100C	0.8176	24.5416	19.2376	7	76.289
RFR_100D	0.8242	24.4918	19.2901	6	76.289
RFR_100E	0.808	25.0466	19.7699	5	78.351
RFR_100F	0.8023	25.2306	19.8128	4	81.443
RFR_100G	0.7584	28.7701	23.2987	3	69.072
RFR_100H	0.6768	32.6854	26.3009	2	67.010
RFR_100I	0.7003	33.7101	27.3665	1	63.918
RFR_200A	0.8254	24.889	19.5336	9	75.258
RFR_200B	0.8273	22.9189	19.6088	8	74.227
RFR_200C	0.823	24.3381	19.1967	7	78.351
RFR_200D	0.821	24.4174	19.3058	6	78.351
RFR_200E	0.8126	25.1275	19.7744	5	77.320
RFR_200F	0.8508	25.3318	19.9433	4	79.381
RFR_200G	0.7595	29.0843	23.5596	3	69.072
RFR_200H	0.6784	32.9689	26.4953	2	64.948
RFR_200I	0.6995	33.6597	27.3013	1	63.918

Variable set	Ranking values in the following criterion(average value)					Weighted ranking values
Variable set	R²	RMSE	MAE	Variable utilization	Accuracy	Weighted ranking values
A	9	8	6	1	4	113
B	8	9	8	2	5	127
C	6	6	7	3	7	129
D	7	7	9	4	6	131
E	5	5	5	5	9	140
F	4	4	4	6	8	122
G	3	3	3	7	3	64
H	2	2	2	8	2	46
I	1	1	1	9	1	28

Variable set	R²	RMSE	MAE	Variable utilization	Accuracy (%)
A	0.8322	26.38	20.92	9	71.85
B	0.8316	26.22	20.85	8	73.42
C	0.8246	26.61	20.89	7	74.04
D	0.8255	26.52	20.59	6	73.98
E	0.8199	26.71	21.01	5	76.23
F	0.8172	27.37	21.44	4	74.45
G	0.7856	27.80	22.27	3	67.43
H	0.7395	31.41	25.39	2	65.82
I	0.7193	32.56	26.34	1	59.24

Models	R2	RMSE	MAE	Variable utilization	Accuracy (%)
R_ML	0.7154	30.57	24.71	5	58.12
R_MLNE	0.7310	29.72	24.01	5	59.16
NN_BP[X]	0.8391	26.89	21.06	5	75.72
NN_BP[2X]	0.8355	27.93	21.87	5	74.00
NN_R[Y]	0.8559	27.21	21.26	5	75.95
NN_R[0.5Y]	0.8493	27.27	21.43	5	75.94
RFR₁₀₀	0.7812	27.09	21.54	5	73.43
RFR₂₀₀	0.7886	26.98	21.64	5	73.43

PERMALINK

A comparative analysis of linear regression, neural networks and random forest regression for predicting air ozone employing soft sensor models

Zheng Zhou

Cheng Qiu

Yufan Zhang

Abstract

Introduction

Background and importance of air ozone prediction

Soft sensor modeling for air ozone prediction and its significance

Objectives of the study

Literature review

Overview of linear regression, neural networks and random forests regression

Linear regression

Neural networks

Figure 1.

Figure 2.

Random forest regression

Figure 3.

Applications of methods in environmental prediction

Water quality prediction

Air pollutant concentration modeling

Environmental impact assessment

Performance in ozone prediction of prediction models

Table 1.

Comparison of prediction models

Methodology

Data collection and preprocessing

Feature selection and engineering

Figure 4.

Figure 5.

Figure 6.

Table 2.

Application

Models

Table 3.

Assessment of soft sensor model

Table 4.

Table 5.

Modeling process

Figure 7.

Results and analysis

Results of LR

Table 6.

Table 7.

Table 8.

Results of NN

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Results of RFR

Table 15.

Table 16.

Table 17.

Comparison of different variable sets

Figure 8.

Table 18.

Table 19.

Comparison of different models

Figure 9.

Table 20.

Table 21.

Comparison of all sensors

Table 22.

Figure 10.

Conclusion

Summary of the study

Discussion of the most effective modeling technique

Future directions for research in soft sensor modeling for air ozone prediction

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK