Skip to main content
MethodsX logoLink to MethodsX
. 2023 Mar 24;10:102153. doi: 10.1016/j.mex.2023.102153

Water quality predictions through linear regression - A brute force algorithm approach

AC P Fernandes a,, A R Fonseca b, FAL Pacheco c, LF Sanches Fernandes b
PMCID: PMC10106967  PMID: 37077896

Abstract

Linear regression is one of the oldest statistical modeling approaches. Still, it is a valuable tool, particularly when it is necessary to create forecast models with low sample sizes. When researchers use this method and have numerous potential regressors, choosing the group of regressors for a model that fulfills all the required assumptions can be challenging. In this sense, the authors developed an open-source Python script that automatically tests all the combinations of regressors under a brute-force approach. The output displays the best linear regression models, regarding the thresholds set by users for the required assumptions: statistical significance of the estimations, multicollinearity, error normality, and homoscedasticity. Further, the script allows the selection of linear regressions with regression coefficients according to the user’s expectations. This script was tested with an environmental dataset to predict surface water quality parameters based on landscape metrics and contaminant loads. Among millions of possible combinations, less than 0.1 % of the regressor combinations fulfilled the requirements. The resulting combinations were also tested in geographically weighted regression, with similar results to linear regression. The model's performance was higher for pH and total nitrate and lower for total alkalinity and electrical conductivity.

  • A Python script was developed to find the best linear regressions within a dataset.

  • Output regressions are automatically selected based on regression coefficient expectations set by the user and the linear regression assumptions.

  • The algorithm was successfully validated through an environmental dataset.

Keywords: Python script, Water quality, Landscape metrics, Geographic information systems, Contaminant emissions

Method name: Automatic selection of robust linear regression models

Graphical abstract

Image, graphical abstract


Specifications table

Subject area: Environmental Science
More specific subject area: Water Quality
Name of your method: Automatic selection of robust linear regression models
Name and reference of original method: Stanton, J.M., 2001. Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors. J. Stat. Educ. 9. 10.1080/10691898.2001.11910537
Gang Su, X., 2009. Linear regression analysis: Theory and computing. World Scientific Publishing Co. Pte. Ltd. 10.1142/6986
Resource availability: Software:
-Python (https://www.Python.org/downloads/ or https://www.anaconda.com).
Software for dataset preparation:
-Microsoft Excel (https://www.microsoft.com/pt-pt/microsoft-365/excel);
-ArcGIS desktop (https://www.esri.com/);
-ZonalMetrics-Toolbox (https://www.arcgis.com/home/item.html?id=96c3ffc7439f4972a08f4edbc51d89be).
Data:
-Water quality (https://snirh.apambiente.pt/);
-Land use map (https://dados.gov.pt/pt/datasets/carta-de-uso-e-ocupacao-do-solo-2015/);
-Pollution sources (https://sniamb.apambiente.pt/content/geo-visualizador);

Method details

Introduction

To formally address water quality concerns, it is necessary to resort to models that expose the problem or test possible solutions [1,2]. Typically, models are classified as either physical or statistical [3]. In physical models, the interactions are already determined by biological, physical, and chemical processes, so they are widely used for prediction purposes [4]. There are a plethora of possible water quality models such as Water Quality Analysis Simulation Program (WASP) [5], and the Hydrological Simulation Program-FORTRAN (HSPF) [6], but the most used in research is Soil and Water Assessment Tool (SWAT) [7]. Statistical models have the advantage that can be used to identify cause-effect interactions revealed by the dataset [8], but can also be used for prediction purposes [9]. Different statistical methods can be used for prediction, such as regression to classification analysis [10], or even more complex methodologies based on machine learning [11]. Linear regression is widely used in different science fields [12], [13], [14], [15], with the advantage of being easily implemented, interpreted, and also because the concerns of meeting the required assumptions have been studied for a long time [16,17]. When it is intended to create a linear regression model, the first step is to prepare a dataset with candidate variables. Then, different combinations of independent variables are tested in order to predict the dependent variable. In addition, statistical tests must be performed to check if the linear regression assumptions are met for each tested combination of candidate variables. This process can be done under a trial-error approach but is not an efficient option, particularly when there are many candidate variables. Multiple regression tools ease this process by calculating the resulting linear regression for each possible regressor combinations. Some tools can be used for that purpose in software, such as ArcMap [18], Microsoft Excel toolboxes [19] or JMP [20]. Though these tools are helpful, they do not provide enough statistical tests to confirm if the linear regression assumptions are met [21]. Stepwise regression is commonly used when datasets have many variables [22], by adding and removing variables based on their statistical significance through an iterative process that does not test all possible combinations [22]. Besides this process reaches models whose regression coefficients are statistically significant and also with high determination coefficients [23], it is also necessary to confirm if the linear regression assumptions are met. The resulting outputs of stepwise algorithms are reasonably fast but do not guarantee that the output selects the best model [24]. To find the best model within a dataset, the possible combinations must be tested, which can be done through brute force. Besides brute force feasibility is relied on computing speed and dataset size [25], brute force guarantees that the best model is found within the dataset, since it tests all the possible combinations [26].

With the advent of geographic information systems (GIS), a spatial linear regression method was developed, called Geographically Weighted Regression (GWR) [27]. Instead of having fixed regression coefficients, these change along in space, which can achieve better predictions for spatial data than ordinary least-squares (OLS) linear regression [28]. To correctly perform a GWR model, it is recommended first to test the OLS linear regression, and if the assumptions are met [29]. Since GWR is a spatial analysis, it is also necessary to test if the residuals distribution is not spatially autocorrelated [30]. Then, the used predictors in OLS model can be used to build a GWR [31,32].

The main purpose of the present study is to develop a Python script to test all the possible combinations of variables to predict a selected variable, regarding the linear regression assumptions resorting to an automatic algorithm using open-source software. The second objective is to test the script to an environmental dataset and predict the surface water quality in all sub-catchments. The study was conducted in a Portuguese River Basin (Fig. 1). The Ave River Basin (ARB), located in the northwest region of Portugal, has poor water quality, which has been a recurring concern for the past decades [33], [34], [35]. The authors' intention for the method application is to build robust linear regression models to predict water quality parameters in each catchment. Among different surface water parameters (SWP), one model was chosen for each parameter. Then, the chosen regressor combination was applied in GWR, to test if this achieves better predictions than linear regression [36,37].

Fig. 1.

Fig 1

Study area - (A) Portugal, (B) portrayed Ave River Basin, and (C) Ave River Basin land uses.

Methodology

The authors developed a script to find linear ordinary least squares models within a dataset, that fulfill the linear regression assumptions, resorting to open source coding language Python. The developed script tests if these assumptions are met, and also, it has a particular feature which is the selection of models based on the expected signs of regression coefficients. Models whose regression coefficients do not meet the operator's expectations are excluded from the output. For environmental purposes, this particularity is crucial to select models. However, the script can be applied to any science field.

The user defines statistical thresholds, such as the maximum variance inflation factor and probability of rejecting the null hypothesis for multiple tests. According to these input parameters, all the models that meet the assumptions are reported. Still, it is up to the user to choose one model, among all the ones reported in the output.

This algorithm was applied to a water quality case study to find a group of regressors that can predict surface water parameters in Ave River Basin (ARB) sub-catchments and then apply a Geographically Weighted Regression (GWR) using the same predictors. The methodological workflow (Fig. 2) is subdivided into three steps: input preparation, operation of the Python script, and the application of the selected linear regression for the case study.

Fig. 2.

Fig 2

Methodological workflow. GWR: Geographically Weighted Regression.

Dataset preparation

To run the Python script, it is required to create three input tables in the comma-separated values (CSV) format. The first table contains the dataset to create the linear regression models, including the predicted variable and possible predictors. The second table contains samples to calculate predictions for each output model, columns are the regression variables and the lines are the samples to predict. The final table contains the expected outcome of each candidate variable. With this table, the algorithm only selects linear regressions whose regressors have a sign according to the established expectations.

An environmental dataset was used to test this algorithm by predicting surface water quality parameters (SWP) based on pollution indicators: landscape metrics, point source pressures, diffuse emissions, and population density. To prepare the dataset, it was necessary to resort to multiple databases. Measurements of surface water parameters for the selected river basin, ARB, were downloaded from the National Hydrological Information System (the Portuguese abbreviation is SNIRH) [38], including the average concentration of the SWP and also the location of the sampling sites. The digital elevation model was downloaded from the European Environmental Agency [39], and was used to delineate the watershed, drainage lines, drainage areas of each sampling site, and sub-catchments (presented in Fig. 1). These procedures were performed in ArcMap [40] using ArcHYDRO functionalities [41]. The land use map of 2015 was downloaded from the Portuguese Territory Planning website [42], to calculate landscape metrics (LSM) for each drainage area and sub-catchments by using a Python toolbox developed by other authors [43]. In total, nine metrics were calculated and are described in Table 1. These metrics were calculated for the generic land use classes, agriculture (AGR), artificial surfaces (ART), forested areas (FOR), water bodies (WAB), and two metrics, Shannon's diversity index (shdi) and edge length of land use (ed) were calculated for all land use classes simultaneously (ALL). The population density data was downloaded for each census subsection, from the National Statistics Institute website [44]. The Portuguese Environmental Agency (APA) provided the point source and diffuse pressures datasets, which can be accessed in the National Environmental Information System [45]. For diffuse emissions, the dataset contains the total loads of nitrogen and phosphorous (N and P) from livestock, agriculture, and forest, in kg/year/km2. For point source pressures, the dataset contains the total loads of biological and chemical oxygen demands (BOD and COD), nitrogen, and phosphorous (N and P) from all effluent discharge sites in kg/year.

Table 1.

Calculated landscape metrics. AGR- agriculture, ART- artificial surfaces, FOR- forested areas, and WAB- Water bodies.

graphic file with name fx1.gif

A more detailed description of all regressors is available in a CSV file in the supplementary material, “Variables_with_signs.csv”. In total, it was collected 73 possible predictors and it was applied four transformations (x2,x0.5, log10(x+1), and 1/(1+x)), so the number of “variables” increased to 365. Since this is a high number of predictors to test in a brute force algorithm, only variables whose Pearson correlation with the predicted variable that had an expected sign was selected. For each variable, it was manually chosen the transformation that resulted in the highest Pearson correlation coefficient with the dependent variable in order to reduce the number of input variables. Further, for point source and diffuse emissions, one variable was chosen for each since these are strongly correlated, even with transformations. For each SWP, a CSV file was created, containing the selected predictors and the selected SWP (the predicted variable). These CSV files are presented in the supplementary material, named according to the specific SWP: “Electrical_conductivity”, “pH”, “Total_alkalinity” and ” Total_nitrate”.

As previously mentioned, the algorithm only exports models in which the regression coefficient signs are according to the user's expectation. For the case study application, the file “Variables_with_signs.csv”, presented in the supplementary material, contains this information. In the column “effect”, for each variable it must be set one of three values: 1, -1, or 0. For variables that is expected a positive regression coefficient, the selected “effect” should be 1. For variables in which is expected a negative regression coefficient, the selected “effect” should be -1. Also, if users do not expect any effect for a variable, the attributed “effect” should be 0. In this sense, the algorithm exports models in which the regression coefficient of the respective variable is positive or negative.

The analysis of the expected signs was made according to the expected effect (i.e., point source pressures do not increase water quality (WQ). For variables that theoretically increase contamination, it was selected an expected positive sign for the respective regression coefficient. For metrics related to forest, for example, pz_(FOR), percentage of forested areas, it was attributed a negative sign since forests positively increase WQ and decrease the SWP. But for some variables, in which there is no expectation, it was not attributed an expected effect. For example, for lpi_(ART), since it is not known if having most of the urban areas aggregated or dispersed improves or decreases water quality, for this reason, it was attributed an expected effect of 0.

The table with the entire dataset that was used to build the linear regression models is presented in “All_data_for_models.csv”, but a CSV file is available for each SWP containing all the used variables to build the respective models. The file named “Predictions_input.csv” contains all variables for each catchment and drainage area, which is a file that allows the prediction of the SWP in each catchment. These table files are inputs for the Python script, described in the following sections.

Python script

Linear regression

Linear regression (LR) is a statistical approach to modeling the linear relation between the dependent variable with one or more regressors. For the present approach, it was used the ordinary least squares method, in which the linear regression is calculated to have the least possible sum of squares of the difference between the measured and predicted values [46]. LR is one of the oldest statistical modeling approaches that has been widely used since around two hundred years ago [47,48]. The main purpose is to use this method to forecast/predict the response variable in cases where is easier to measure the independent variables rather than the dependent ones [49]. For example, estimating a tree's biomass is easier by measuring its height and diameter rather than weight [50].

Eq. (1) shows the linear regression model with m predictors. In this equation, y represents the dependent variable, xj is the independent variable (regressor), β0 is the constant/intercept, βj is the regression coefficient associated with each regressor, and ε is the random error term.

y=β0+j=1m(βjxj)+ε (1)

When a linear regression model is created, multiple statistical criteria must be analyzed to assess the model's suitability for prediction [51]. The coefficient of determination (R2) is the proportion of the explained variation of the endogenous variable by the exogenous variables [52]. A high R2 does not directly mean that the regression is suitable for predictions because the statistical significance of the R2 is dependent on the degrees of freedom. As more explanatory variables are added to a model, the R2 tends to increase, but the freedom degrees decrease [53]. In this sense, the adjusted R2 is a coefficient that brings the same information as the R2, but it is adjusted based on the degrees of freedom [54]. Another way to access the overall significance of a regression is through the F statistics. If the calculated F value is higher than the critical F for the selected confidence interval (normally 95%), the null hypothesis is rejected, which states that none of the regressors is useful for explaining the dependent variable [55].

When is necessary to create a regression model, the first question that might emerge is, “which explanatory variables should be used?”. By trying to find a suitable model, it is recurrent to test different combinations of exogenous variables through a trial-error method. Among the tested variables, some might not support the prediction of the exogenous variable [56]. This can be tested for each regression coefficient through t statistics (coefficient divided by its standard error) [57]. If the p-value is below the confidence interval, is rejected the null hypothesis, which states that the regression coefficient is equal to 0 [58]. In multiple linear regression is recommended to include regressors that are not linearly dependent (correlated) with each other [59]. When one or more regressors are strongly correlated, the variance of the coefficient estimates and predictions becomes sensitive to minor changes in the model, making the regression model unstable [60]. To measure the degree of multicollinearity is commonly calculated the regressors variance inflation factor (VIF) [61]. This factor ranges from 1 to infinite (when there is a perfect correlation between regressors) and must be lower as possible. Still, there are different recommendations for the maximum acceptable VIF, 10 [62], 7.5 [63], or 5 [64].

There are three assumptions for the error terms in regression models: constant variance, normal distribution, and independence [58]. Different statistical tests can be performed to confirm whether a regression model meets these assumptions. For those tests, the null hypothesis states that the assumption is met. Consequently, as lower is the p-value, higher becomes the probability of disregarding the assumption. The constant variance of the error terms, called homoscedasticity, means that along with the regression, the error term variance is equal for low and high predicted values [65]. If this assumption is not met, it means that the regression error terms are heteroscedastic. Consequently, observations with larger errors will deviate from the fitted model, so predictions become biased [66]. Residuals should follow a normal distribution because the calculation of confidence interval and consequent variable significance is based on normal distributions [67]. In linear regression is expected that successive values of the error are independent along time [68]. If this assumption is not met, the error term is autocorrelated, so the estimates’ standard error is affected [69]. The presence of autocorrelation is mandatory to be verified when samples represent different time-steps [70]. However, another type of autocorrelation must be verified when samples represent different locations [71]. For regressions in which samples represent points in space, it is expected that error terms are not spatially clustered but dispersed or random. If the error term is clustered, the data presents spatial autocorrelation, deviating estimations in space [72]. So, to evaluate the spatial distribution of the error term is used Moran's I test [73].

Algorithm workflow

The designed Python script (available in the supplementary material) calculates all the possible linear regressions that fulfill the statistical assumptions for any dataset, according to the workflow presented in Fig. 3. Before using the algorithm is necessary to:

  • 1.
    Create 3 CSV files, with the user's data. This can be tested with the files presented in the supplementary material for the case study application.
    • a.
      CSV file with the dataset to create the linear regression models, including the dependent variable and all the possible regressors- “DATASET”.
    • b.
      CSV file with the list of independent variables and the expected effect- “signs_table”.
    • c.
      CSV file with a dataset to calculate the predictions of each output model –“Data_for_predictions”.
  • 2.
    Edit the script.
    • a.
      Set the data paths for the input csv files.
    • b.
      Select the dependent variable-“ Chosen_Y”
    • c.
      Set the minimum and maximum number of regressors for the output regression models- respectively, “nor_min” and ” nor_max”
    • d.
      Set if the intercept should be equal to 0 or defined by the regression model- “intercept_value”
    • e.
      Define the statistical thresholds.
Fig. 3.

Fig 3

Algorithm workflow.

The first inputs are the data paths, where the user defines the path of the three necessary CSV files. The first must contain the dataset path, used to create the linear regression models, defined in the script as “DATASET”. The data file should be in the CSV format since this is converted into a data frame using the “pandas.read_csv” command. Furthermore, is necessary to set the index of the data table, which for the present study is “Sampling_site”. The table with the expected signs is defined as “signs_table”. For this table, the index column should be set as “variable”, which is the column containing the list of variables. In the column named as “effect”, should be placed the expected signs of regression coefficients: “0” if there is no expectation, “1” if is expected a positive sign, or “-1” if it is expected a negative sign. The “Data_for_predictions” variable in the algorithm is the path for a CSV file containing all the samples to predict, for the case study, the file is “Data_for_predictions.csv”. This CSV file must have the same independent variables as the ones given for the “DATASET” file. For this data frame, it must be defined the index column, which for the presented case study is “Sampling site”. The output table path is defined in “output_table_path”, this is the path for the output excel file containing all the data of the output regression models.

After defining the data frames, other input parameters must be specified. The “Chosen_Y” is the predicted variable, and all the other variables presented in the dataset will be assumed as regressors. The minimum and the maximum number of predictors are the variables “nor_min” and “nor_max”, respectively. If the user requires that the regressions have an intercept equal to 0, the “intercept_value” must be set as “False”, while if different than 0 should be defined as “True”. For statistical thresholds, it must be defined the maximum allowed variance inflation factor. The authors suggest to select a maximum VIF of 5, since this is the stricter value found in the literature [64]. However, the user can choose higher or lower values, in “ MAXIMUM_VIF”. All the regressions with higher values than the selected will be excluded from the output. The probability to reject the null hypothesis is set in 4 variables. The first is the maximum probability to reject the null hypothesis for the R2, set on the variable “max_p_r_sq”. The regression coefficients are defined in “max_p_reg_coef”. For heteroskedasticity and error normality, the minimum p-value to reject the null hypothesis is set in “min_p_for_Heteroskedasticity” and “min_p_for_error_normality”, respectively.

The last variables are the number of samples for the Goldfield-Quandt test. Since this test requires the calculation of two auxiliary regressions, the user sets the range of samples to apply in each regression, which is advisable to contain around one-third of the samples in each regression. In “lower_limit_A_GQ” is set the position of the first sample to use in the first regression, and in “upper_limit_A_GQ” is the position of the last sample for the first regression. In “lower_limit_B_GQ”, and “upper_limit_B_GQ” is set the position of the first and last samples, respectively, for the second regression. Since in the present study the regressions were calculated with 29 samples, it was set the lower and upper limit of the first regression as 1 to 9 (9 samples), for the second regression from 21 to 29 (also 9 samples), and letting the samples from 10 to 20 (11 samples) out of this analysis.

After setting all the inputs in the script, it is ready to run. First, the algorithm calculates all the possible combinations of predictors, regarding the minimum and the maximum number of predictors, using the function “combinations”, stored in a data frame “combo_df”. This is used in a cycle that selects each combination of predictors to be tested. If the combination fails in any test, the algorithm goes for the next combination of predictors until no combination is left to be tested.

The first test is for multicollinearity, accessed through the variance inflation factor of each regressor. If for any of the predictors the VIF is higher than the defined, the algorithm starts to calculate the statistical tests for the next set of predictors. Otherwise, it continues for the current combination, calculating the regression. If the R2 statistical significance is below the maximum p-value, the algorithm continues and calculates the statistical significance of the predictors. If the maximum p-value among all predictors is lower than the defined, it is tested if the error has a normal distribution through 5 tests, Jarque-Bera [74], Anderson-Darling [75], Shapiro-Wilk [76], Kolmogorov-Smirnov [77] and Omnibus [78]. The heteroscedasticity is verified through 4 tests, Breusch–Pagan [79], Harvey-Collier [80], Glejser [81], and Goldfeld-Quandt [82]. After all tests, the algorithm calculates the number of predictors that do not have the expected signs and also predictions, until all the possible combinations of predictors have been tested. At the end of the algorithm is created an excel file that contains 5 sheets, “Summary_1”, “Summary_2”, “Predicted_values”, “Specific_predictions”, and “Dataset”. The last sheet contains the used dataset. In “Specific_predictions,” the lines are the predictions of each output regression for each sample (in each column) that were defined in the table “Data_for_predictions”. While in “Predicted_values,” are presented the predictions of each regression model for each sample contained in the dataset. Both these sheets allow to compare the measured vs predicted values. Each line of the summary sheets is a regression, and each row represents a statistical regression output. The first 9 rows are the most important, containing the regression data and statistical tests. The first one is the number of predictors used in the linear regression “number_of_predictors”. The second is the maximum variance inflation factor found among predictors “1_vif”. The 3rd, 4th, and 5th columns are relative to the explained variation, “2_r_sq” is the R2, “2_r_sq_adj” is the adjusted R2, and “2_p_value_of_R_squared” the probability to accept the null hypothesis of the regression F-test. The maximum p-value of the T-test among predictors is expressed in the column “3_p_regression_coef”. In the columns “4_MINIMUM_P_VALUE” and “5_MINIMUM_P_VALUE” is the minimum p-value of the error normality and heteroscedasticity tests, respectively. The column “6_sign_test” contains the number of predictors that do not have the expected sign. The following twelve columns contain the p values of each error normality and heteroscedasticity tests. Moreover, the 22nd column contains the regression intercept, while the following columns contain the regression coefficients of each possible regressors. Besides “Summary_1” and “Summary_2” have an identical structure, “Summary_1” contains only regressions whose sign, is according to expectations, and all heteroscedasticity and error normality tests have met the statistical thresholds, while “Summary_2” contains regressions that do not have an expected sign, and whose error normality or heteroscedasticity tests results have met the statistical requirements for at least one of the tests in the group.

Case study application

The Python algorithm's purpose is to find linear regressions within the chosen dataset regarding the linear regression assumptions according to the specified statistical thresholds. Among possible hydrological years, the models were created using the hydrological year (HY) of 2014-2015, which is the one that contains more sampling sites, 29 in total. The algorithm was run for each SWP, and then it was selected a combination of variables that resulted in a model with the highest adjusted R2. The chosen linear regressions were applied in the Geographically Weighted Regression (GWR) ArcMap tool. Before executing the GWR, the spatial autocorrelation of the standardized error terms was calculated using the “Global Moran's I” tool. If spatial autocorrelation was present, another set of combinations would be selected. Otherwise, it was performed in GWR to predict the SWP in 248 catchments. The predictions of SWP were obtained for all the sampling sites using both regression methods, and it was compared the measured value of the HY between 2012 and 2017 with the predictions, to check the models' performance using the percent bias (PBIAS) [83] and mean absolute percentage error (MAPE) [84] (ratings are presented Table 2).

Table 2.

Rating of water quality prediction models, based on the Percent Bias (PBIAS) and Mean Absolute Percentage Error (MAPE).

graphic file with name fx2.gif

Method validation

Case study outputs

The Python algorithm was applied 4 times, one for each of the selected SWP. The selected inputs were: a number of regressors ranging from 1 to 6, a maximum VIF of 5, a probability to reject the null hypothesis for all statistical tests of 0.05, and the first and last 9 samples for the Goldfeld-Quandt test. Table 3 briefly presents the outputs of the Python script. When the algorithm was performed, it took around 1.5 seconds for each set of 100 combinations of regressors, which means that for each SWP the algorithm required several hours to compute all the possible combinations regarding the selected inputs. The output combinations, presented in Table 3, are the number of combinations that meet the selected statistical thresholds for all tests. In the respective output excel file of each SWP, these combinations of variables and the resulting linear regressions are presented in the sheet “Summary_1”. Among all the output combinations, one model was selected to apply for the case study, those with the highest adjusted R2. In the “Summary_2” are presented more regressions since in this summary is demonstrated all regressions that have passed at least one test. By looking at the maximum adjusted R2 values, the results indicate that the models can explain more of the pH and total nitrate in surface waters since these achieved the highest adjusted R2 values, rather than electrical conductivity (EC) and total alkalinity, with adjusted R2 of 0.797 and 0.693, respectively.

Table 3.

Summary of calculated regressions for each SWP.

Endogenous variable Number of possible regressors Number of regressors Combinations of regressors
Adjusted R2 range
Total Output
EC 44 [1], [2], [3], [4], [5], [6] 8295045 413 [0.126-0.797]
pH 48 14196868 9871 [0.147-0.928]
Total nitrate 39 3930550 213 [0.113-0.917]
Total alkalinity 36 2391495 82 [0.216-0.693]

The selected linear regressions are presented in Table 4, with the respective combination number. In the supplementary material, these results can be found in the individual excel file of each SWP. It is seen that the maximum VIF among all regressors is below 5, the regression p-values are below 0.05, as also the regression coefficients. The minimum p-values found among all the homoscedasticity and error normality tests are above 0.05 since for these tests is necessary to accept the null hypothesis. For these 4 linear regressions, the spatial autocorrelation of the standardized error term was calculated through the Moran's I algorithm in ArcMap, presented in Fig. 4. When spatial autocorrelation is calculated, is computed a z-score that reports the degree of spatial autocorrelation. If this z-score is higher than 1.65, the residuals are clustered, below -1,65 are dispersed, and within this range, the distribution is spatially random [85]. All of the z-scores are inside the random range, meaning that the pattern is random, which is satisfactory for the selected regressions.

Table 4.

Selected regression for each surface water parameter.

Endogenous variable Combination number nº of regressors VIF (maximum) R2 Adjusted R2 R2 (p-value) Regression coefficients (maximum p-value) Homoscedasticity tests (minimum p-value) Error normality tests (minimum p-value)
EC 1187093 5 1.616 0.812 0.771 1.18E-07 3.34E-02 1.07E-01 3.08E-01
pH 543724 5 1.607 0.926 0.910 2.86E-12 4.89E-04 5.46E-01 7.21E-01
Total nitrate 284906 5 4.129 0.932 0.917 1.23E-12 4.82E-02 3.82E-01 2.90E-01
Total alkalinity 312277 5 1.618 0.713 0.650 1.26E-05 3.22E-02 1.05E-01 1.08E-01

Fig. 4.

Fig 4

Spatial autocorrelation of the regression models, (A) conductivity, (B) pH, (C) total nitrate, and (D) total alkalinity. Z- scores of Moran's I test are presented in the respective image.

Besides the Python algorithm was calculated using a maximum of 6 regressors, it was found satisfactory regressions with only 5 exogenous variables. According to Table 5, all the regressors are statistically significant, including the intercept term. All VIF values are below 5, meaning that multicollinearity is absent from the models. Notably, the used variables in regressions are predominantly landscape metrics. Only the selected regression for pH contains the diffuse emissions from agriculture and forested areas (AF_P), which is transformed into 1 divided by the diffuse emission plus 1. This does not mean that the other variables, such as point source emissions or population density, are not suitable predictors, since these are also related to water quality. Some regressions are demonstrated in the supplementary material that contains variables that are not landscape metrics. Yet, for these SWP, it resulted that the predictors are majorly landscape metrics. Though the regressions are statistically robust, it was tested these models to predict the respective surface water parameters during different HYs. For LR and GWR models, it was compared the measured with the predicted values, for the 2 HY before and after 2014–2015. In Appendix B, is shown the scatter plots between the measured with predicted values for all models for all hydrological years. Noteworthy, for all SWP, the R2 value of GWR is slightly higher than LR, the scatter dispersion of measured vs predicted values is visually identical. To formally address the differences between measured and predicted values, in Table 6 is shown the PBIAS and MAPE for GWR and LR. Comparing GWR with LR, the differences are not considerable. Regarding conductivity, the MAPE coefficient shows noticeable differences between GWR and LR models, but for the other SWP, the differences are low since the maximum is 2.5%. The maximum difference is 5.84% for the MAPE coefficient of the conductivity model applied to the HY of 2012–2013, where the MAPE of LR is 38.52% and the MAPE of GWR is 32.68%. In this way, there is no substantial difference between using GWR or LR since the results are practically the same. In general, LR models provide less biased predictions, while GWR has more accurate predictions since the MAPE is generally lower. Ratings of PBIAS and MAPE, are given in Table 7. As it can be seen, conductivity is the parameter with the worst predictions for almost all HY, with only a reasonable rating for the water quality calibration model efficiency, with the exception for the HY of 2015-2016, which has a low-performance rating. For total alkalinity, all predictions are considered reasonable ratings. Total nitrate predictions are reasonable, and “good” rating is achieved for the HYs of 2014–2015 and 2015–2016. For pH, the predictions are entirely accurate for all HYs with the highest rating.

Table 5.

Regression coefficients for each SWP selected regression.

Endogenous variable Regression exogenous variables Coefficients p-value VIF
EC shdi_(FOR)2 -5.97E+01 1.04E-03 1.62
cce_(ART)/(WAB)2 4.37E-01 3.38E-05 1.40
cce_(WAB)/(AGR)0.5 8.11E+01 8.29E-03 1.62
edp_(FOR)0.5 -1.31E+02 3.34E-02 1.03
cedd_(AGR)/(FOR)2 -1.57E+07 2.41E-03 1.38
intercept 3.15E+02 1.83E-07 -
pH 1/(1+AF_P) -2.46E+01 1.03E-07 1.53
cce_(ART)/(WAB) 2.76E-02 8.53E-08 1.61
edp_(FOR)0.5 -1.01E+00 4.14E-07 1.36
cedp_(AGR)/(FOR)2 -3.99E+00 1.32E-04 1.47
cedd_(ART)/(FOR)0.5 -6.49E+00 4.89E-04 1.16
intercept 8.48E+00 5.43E-24 -
Total nitrate pz_(AGR)2 1.27E-02 6.45E-11 1.44
cedd_(AGR)/(FOR) -1.95E+03 9.32E-04 2.03
1/(1+cce_(FOR)/(ART)) 7.80E+00 4.36E-04 4.13
lpi_(AGR)2 -8.46E-03 1.57E-05 3.56
cedd_(ART)/(FOR)2 1.80E+06 4.23E-06 1.97
intercept 2.50E+00 4.82E-02 -
Total alkalinity ppc_(FOR)2 -5.05E-03 3.63E-04 1.20
1/(1+lpi_(ART)) 5.99E+01 4.80E-04 1.62
edp_(FOR)0.5 -3.79E+01 8.03E-04 1.10
cedd_(ART)/(FOR)0.5 -4.10E+02 2.13E-03 1.13
cedd_(FOR)/(WAB)2 -1.17E+09 3.22E-02 1.56
intercept 5.76E+01 3.28E-07 -

Table 6.

Regressions performance, comparing the predicted values with the measured values for 5 hydrological years, and for the linear regression (LR) and geographically weighted regression (GWR). The hydrological year 2014–2015 is marked with *, since the data of this HY was used to build the models. The color of the values is according to the given ratings in Table 2.

graphic file with name fx3.gif

Table 7.

Overall rating of linear regression (LR) and geographically weighted regression (GWR) for each hydrological year. The color of the values is according to the given ratings in Table 2.

graphic file with name fx4.gif

In Fig. 5. is shown the prediction of the SWPs for each of the 248 catchments in ARB, in a raster color gradient. These predictions show how the concentrations vary along the river basin. These values were compared to the recommended concentration for human consumption, the stricter water quality thresholds, according to the Portuguese legislation (Decree-law no. 236/98). Under these thresholds, the concentration of conductivity must be below 400 µS/cm, pH between 6.5 to 8.5, total nitrate below 25 mg/L, and total alkalinity below 30 mg/L, all the catchments that do not meet the threshold for the respective SWP are marked with stripes. Fig. 5 (E) demonstrates the number of studied SWP that do not meet the requirements, in each sub-catchment. In general, conductivity is low but only 5 sub-catchments exceed the minimum threshold. It turns out that the maximum concentration is very high, 4000 µS/cm, which is probably an unreal prediction, possibly because the conductivity model had the worst rating. For pH, there are multiple catchments with a low pH, predominantly in the upstream region, where primarily forested areas exist. Total nitrates locations with a concentration above the threshold are scattered and few. Total alkalinity results are above the recommended values in some catchments in the upstream area of the ARB. Still, along the river, there is a sequence of sub-catchments that have a high concentration.

Fig. 5.

Fig 5

Predictions of the surface water parameters (SWP), (A) electrical conductivity, (B) pH, (C) total nitrate,(D) total alkalinity, and (E) is the number of SWP with concentrations above the threshold.

Python script

By applying the Python script to the presented case study, it was demonstrated that the developed script is useful for finding linear regressions regarding the statistical thresholds. However, the authors emphasize that the algorithm does not find the single best linear regression but the best linear regressions set. The best model choice depends on an overall appreciation of the explained variation of the dependent variable, multicollinearity, and the residual assumptions, which the authors leave to the operators/researchers to decide. However, the authors recommend choosing the model with lower VIF values, higher statistical significance for regression coefficients, the lowest probability of having a non-normal error distribution or homoscedasticity, and highest adjusted R-squared. It is also pointed out to follow the best practices by applying suitable statistical thresholds [21]. Among possible programming languages, the authors have chosen to use Python since it is open source and one of the most popular programming languages [86,87], which makes it accessible to the majority of users. For those who are not familiar with programming, Python has the advantage of being one of the easiest languages to learn and apply [88]. Also, there is no need to have a deep programming knowledge of Python to use the developed algorithm since it is only necessary to create the CSV tables, edit the paths, and set the statistical thresholds in the first part of the algorithm. Unlike languages such as Matlab, Python is a non-paid language, so the final reason to choose Python is that it is an open-source language, making it accessible to any user. For homoscedasticity and error normality, it was chosen to apply well-established statistical tests because these are more frequently used and known by the scientific community, which gives more confidence to any user to interpret the results. However, more advanced approaches are being developed [89], which can be applied in the script for researchers who feel comfortable with Python programming. The presented script applies more than one test for homoscedasticity and error normality to bring more confidence to the results since these assumptions might be confirmed in one test and not in another. Still, some tests might be preferred over others [90]. For example, when comparing normality tests, Thadewald and Büning, 2007 [91] recommend that for distribution with short tails, Shapiro-Wilk might be better than Jarque Bera, or for different sample sizes, specific tests are valid [92]. For such reasons, the authors designed the script to create two outputs: “Summary_1” only shows the regressions that have passed all tests, and “Summary_2” shows all the regression models that have passed at least one test. This second output is important for authors that do not agree with the application of one test regarding their demands, and so the regressions are not presented in ”Summary_1” for failing in at least one test. As mentioned in the introduction, other software calculate linear regression models among candidate variables. For example, in ArcMap, the tool “Exploratory Regression“[18] does a similar job, with the advantage that calculates the spatial autocorrelation for each output regression. Still, the disadvantages are the necessity of a paid license, only executing one test for the error normality (Jarque-Bera), and does not perform homoscedasticity tests. In the presented Python script, spatial autocorrelation tests were not included to allow autors who do not have spatial data (in shapefile format) to use it. For authors who require to calculate the error spatial autocorrelation this can be done with the “Moran's I algorithm”, calculated separately in ArcMAP [93] and QGIS [94], or this test can also be included in the script using the “Moran” function available in PySAL library [95].

The Python script application was essential to choose a group of regressors to create GWR for each SWP. For the present study, the authors have gathered 73 possible regressors, and by applying four transformations (x2,x0.5, log10(x+1), and 1/(1+x)), the number of “candidate variables” increased to 365. Applying the 365 variables would require a very long time for the script to complete the calculations. For this reason, the authors have chosen to reduce the number of candidate variables by selecting only the transformation with the strongest Pearson correlation and with an expected sign for each SWP. For this reason, the number of possible regressors is different for each SWP, Table 3.

For users that would like to apply the presented algorithm, if the number of candidate variables is low, the authors consider that is not necessary to apply any preliminary variable selection strategy. This is because the computing speed was in average 1.5 seconds for every 100 combinations, performed in Python (3.8.8), using Anaconda (4.10.1), Spyder (4.2.5), in a desktop with windows 10 64 bits, 16 RAM, and a processor Intel(R) Xeon(R) CPU E5-2620 v3 2.4 GHz. The number of combinations is given by Eq. (2), which is a factorial relation between the possible candidate regressors (n) and the number of regressors (j). In Fig. 6 is presented the predicted time according to the number of combinations. For the present study, the processing time among surface water parameters ranged from 10 to 60 hours. This is indeed a disadvantage when users want to test a high number of regressors with a high range of candidate variables. In the future, this might not persist as a troubling disadvantage since quantum computer systems might become accessible for researchers [96]. Further, the adaptation of the code for a faster computing language, such as c++ [97], can reduce the script run time. However, it was chosen to develop the script in Python to promote ease of access in exchange for having a lower computing speed.

numberofcombinations=j=1mn!j!(nj)! (2)

Fig. 6.

Fig 6

Algorithm processing time vs the number of combinations between regressors.

Conclusion

A Python script was developed to calculate the best linear regressions for all the possible combinations of regressors of the input dataset. The output regressions are based on statistical thresholds defined by the user. The selected tests for this algorithm are: significance of the regressors, multicollinearity tested through VIF, error normality through five tests, Jarque-Bera, Anderson-Darling, Shapiro-Wilk, Kolmogorov-Smirnov and Omnibus, and heteroscedasticity through 4 tests, Breusch–Pagan, Harvey-Collier, Glejser and Goldfeld-Quandt.

The script included a novelty: the selection of regression models, in which the regression coefficients are according to the user's expectations. This detail is relevant for models that portray pollution since the outcome of pollution sources in models should indicate the increase of contamination. In this sense, the script will only report models that fulfilled this restriction, which is optionally set by the user.

The algorithm was tested with an environmental dataset to find linear regressions that could predict water quality parameters. The dataset included landscape metrics, population density, diffuse emissions, and point source pressures among the possible regressors. It was selected one of the output models for each surface water parameter, also tested in GWR. As a result, the GWR models were not very different from LR, but had higher R2 values and also a lower MAPE. The predictions of pH were of high accuracy, for total nitrate, from reasonable to good, for total alkalinity, reasonable, and for electrical conductivity, from reasonable to low. Among the possible combinations of regressors of the models, the majority of regressors were landscape metrics, showing that for the case study, the landscape has a strong effect on water quality.

The developed brute force algorithm proved to be effective by testing all possible combinations, guaranteeing that the best models are found. The method demonstrated to be appropriate for datasets with tens of variables, according to the case study application. However, since the application is under brute force, the algorithm might not be feasible for big datasets with hundreds of variables. In this sense, further research is required to improve the method's computational speed.

CRediT authorship contribution statement

A.C. P Fernandes: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization. A. R Fonseca: Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing. F.A.L. Pacheco: Conceptualization, Methodology, Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition. L.F. Sanches Fernandes: Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

For the author integrated in the CERENA research centre financial support was provided by the  Base Funding—UIDB/04028/2020 and Programmatic Funding - UIDP/04028/2020  of the Research Center for Natural Resources and Environment—CERENA—funded by national funds through the FCT/MCTES (PIDDAC). For authors integrated in the CITAB research centre, this work was supported by National Funds by FCT – Portuguese Foundation for Science and Technology, under the project UIDB/04033/2020. The authors integrated in the CITAB research centre are also integrated in the Inov4Agro – Institute for Innovation, Capacity Building and Sustainability of Agri-food Production. The Inov4Agro is an Associate Laboratory composed of two R&D units (CITAB & GreenUPorto). For the author integrated in the CQVR, the research was additionally supported by National Funds by FCT – Portuguese Foundation for Science and Technology, under the project UIDB/QUI/00616/2020 and UIDP/00616/2020. Financial support was provided by the Portuguese Foundation for Science and Technology (FCT), Ministry of Science, Technology, and Higher Education (MCTES), European Social Fund (FSE) through NORTE 2020 (North Regional Operational Program 2014/2020) and European Union (EU) to António Fernandes (Grant: SFRH/BD/146151/2019). The authors would like to thank the Regional Hydrographic Administration of the North (Administração Regional Hidrográfica do Norte, in Portuguese) and the Portuguese Environment Agency (Agência Portuguesa do Ambiente, in Portuguese) for providing the point source and diffuse emission data for Ave River Basin.

Footnotes

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.mex.2023.102153.

Appendix A

Geographically Weighted Regressions Coefficients

Fig. A1

Fig. A1.

Fig A1

Geographically weighted regression coefficients, (A), (E), (I), (M), (Q), (U) for conductivity, (B), (F), (J), (N), (R), (V) for pH, (C), (G), (K), (O), (S), (X) for total nitrate and (D), (H), (L), (P), (T), (Y) for total alkalinity.

Appendix B- Comparison between the linear regression (LR) with geographically weighted regression (GWR) models

Figs. B1B4

Fig. B2.

Fig B2

Comparison between the LR with GWR model predictions for pH.

Fig. B3.

Fig B3

Comparison between the LR with GWR model predictions for total nitrates.

Fig. B1.

Fig B1

Comparison between the LR with GWR model predictions for conductivity.

Fig. B4.

Fig B4

Comparison between the LR with GWR model predictions for total alkalinity.

Appendix B. Supplementary materials

Supplementary material, submitted along with the manuscript contains 3 folders: -“input_files”- all the necessary tables for the algorithm, regarding the case study and each surface water parameter; -“output_files”- the resulting excel files for each run of the script, regarding the case study application -“Python_script”- containing the used script for each surface water parameter.

mmc1.zip (99.9MB, zip)

Data availability

  • Data is provided in the supplementary material

References

  • 1.Cho K.H., Pachepsky Y., Ligaray M., Kwon Y., Kim K.H. Data assimilation in surface water quality modeling: a review. Water Res. 2020;186 doi: 10.1016/J.WATRES.2020.116307. [DOI] [PubMed] [Google Scholar]
  • 2.Uddin M.G., Nash S., Olbert A.I. A review of water quality index models and their use for assessing surface water quality. Ecol. Indic. 2021;122 doi: 10.1016/J.ECOLIND.2020.107218. [DOI] [Google Scholar]
  • 3.Thakur A.K. Springer US, Boston, MA; 1991. Model: Mechanistic vs Empirical; pp. 41–51. (New Trends Pharmacokinet.). [DOI] [Google Scholar]
  • 4.Loucks D.P., van Beek E. Water Resource Systems Planning and Management. Springer International Publishing; 2017. Water quality modeling and prediction; pp. 417–467. [DOI] [Google Scholar]
  • 5.Wool T., Ambrose R.B., Martin J.L., Comer A. WASP 8: The next generation in the 50-year evolution of USEPA’s water quality model. Water (Switzerland) 2020;12 doi: 10.3390/W12051398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Fonseca A.R., Santos J.A. Predicting hydrologic flows under climate change: the tâmega basin as an analog for the mediterranean region. Sci. Total Environ. 2019;668:1013–1024. doi: 10.1016/j.scitotenv.2019.01.435. [DOI] [PubMed] [Google Scholar]
  • 7.da S. Burigato Costa C.M., da Silva Marques L., Almeida A.K., Leite I.R., de Almeida I.K. Applicability of water quality models around the world – a review. Environ. Sci. Pollut. Res. 2019;26 doi: 10.1007/s11356-019-06637-2. [DOI] [PubMed] [Google Scholar]
  • 8.Pearl J. Causal inference in statistics: an overview. Stat. Surv. 2009;3 doi: 10.1214/09-SS057. [DOI] [Google Scholar]
  • 9.Avila R., Horn B., Moriarty E., Hodson R., Moltchanova E. Evaluating statistical model performance in water quality prediction. J. Environ. Manage. 2018;206:910–919. doi: 10.1016/J.JENVMAN.2017.11.049. [DOI] [PubMed] [Google Scholar]
  • 10.Mitchell M. Selecting the correct predictive modeling technique. Towar. Data Sci. 2019 https://towardsdatascience.com/selecting-the-correct-predictive-modeling-technique-ba459c370d59 (accessed August 10, 2021) [Google Scholar]
  • 11.Sagan V., Peterson K.T., Maimaitijiang M., Sidike P., Sloan J., Greeling B.A., Maalouf S., Adams C. Monitoring inland water quality using remote sensing: potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth Sci. Rev. 2020;205 doi: 10.1016/J.EARSCIREV.2020.103187. [DOI] [Google Scholar]
  • 12.Huang X., Wang H., Luo W., Xue S., Hayat F., Gao Z. Prediction of loquat soluble solids and titratable acid content using fruit mineral elements by artificial neural network and multiple linear regression. Sci. Hortic. (Amsterdam). 2021;278 doi: 10.1016/J.SCIENTA.2020.109873. [DOI] [Google Scholar]
  • 13.Ramasamy M., Nagan S., Senthil Kumar P. A case study of flood frequency analysis by intercomparison of graphical linear log-regression method and Gumbel's analytical method in the Vaigai river basin of Tamil Nadu, India. Chemosphere. 2022;286 doi: 10.1016/J.CHEMOSPHERE.2021.131571. [DOI] [PubMed] [Google Scholar]
  • 14.Correndo A.A., Hefley T.J., Holzworth D.P., Ciampitti I.A. Revisiting linear regression to test agreement in continuous predicted-observed datasets. Agric. Syst. 2021;192 doi: 10.1016/J.AGSY.2021.103194. [DOI] [Google Scholar]
  • 15.Maaouane M., Zouggar S., Krajačić G., Zahboune H. Modelling industry energy demand using multiple linear regression analysis based on consumed quantity of goods. Energy. 2021;225 doi: 10.1016/J.ENERGY.2021.120270. [DOI] [Google Scholar]
  • 16.Loftus S.C. Basic Statistical with R. Academic Press; 2022. Simple linear regression; pp. 227–247. [DOI] [Google Scholar]
  • 17.Allen R.G.D. The assumptions of linear regression. Economica. 1939;6 doi: 10.2307/2548931. [DOI] [Google Scholar]
  • 18.Esri, Exploratory Regression, ArcGIS Desktop. (2018). https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-statistics-toolbox/exploratory-regression.htm (accessed August 12, 2021).
  • 19.Braun M.T., Oswald F.L. Exploratory regression analysis: a tool for selecting models and determining predictor importance. Behav. Res. Methods. 2011;43 doi: 10.3758/s13428-010-0046-8. [DOI] [PubMed] [Google Scholar]
  • 20.Jones B., Sall J. JMP statistical discovery software. Wiley Interdiscip. Rev. Comput. Stat. 2011;3 doi: 10.1002/wics.162. [DOI] [Google Scholar]
  • 21.A. Kassambara, Linear Regression Assumptions and Diagnostics in R: Essentials, Articles - Regression Model Diagnostics. (2018). http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/ (accessed August 27, 2022).
  • 22.Wang K., Chen Z. Stepwise regression and all possible subsets regression in education. Electron. Int. J. Educ. Arts Sci. 2018;2:60–81. [Google Scholar]
  • 23.Rose S., McGuire T.G. Limitations of p-values and r-squared for stepwise regression building: a fairness demonstration in health policy risk adjustment. Am. Stat. 2019;73:152–156. doi: 10.1080/00031305.2018.1518269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Smith G. Step away from stepwise. J. Big Data. 2018;5 doi: 10.1186/s40537-018-0143-6. [DOI] [Google Scholar]
  • 25.Riyad W.A., Yee S.C.P., Thinakaran R.A.P., Salam Z.A.B.A. Comparative evaluation of numerous optimization algorithms for compiling travel salesman problem. J. Adv. Res. Dyn. Control Syst. 2020;12 doi: 10.5373/JARDCS/V12SP7/20202178. [DOI] [Google Scholar]
  • 26.Rashid J., Kanwal S., Kim J., Nisar M.W., Naseem U., Hussain A. Heart disease diagnosis using the brute force algorithm and machine learning techniques. Comput. Mater. Contin. 2022;72:3195–3211. doi: 10.32604/cmc.2022.026064. [DOI] [Google Scholar]
  • 27.Brunsdon C., Fotheringham A.S., Charlton M.E. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr. Anal. 1996;28:281–298. doi: 10.1111/j.1538-4632.1996.tb00936.x. [DOI] [Google Scholar]
  • 28.Sheehan K.R., Strager M.P., Welsh S.A. Advantages of geographically weighted regression for modeling benthic substrate in two greater yellowstone ecosystem streams. Environ. Model. Assess. 2013;18 doi: 10.1007/s10666-012-9334-2. [DOI] [Google Scholar]
  • 29.M. Anwar, Geographic Weighted Regression on 911 phone calls, YouTube. (2012). https://www.youtube.com/watch?v=plfCMZhROeQ&t=2510s&ab_channel=MoulayAnwarSounny-Slitine (accessed August 11, 2021).
  • 30.Koh E.H., Lee E., Lee K.K. Application of geographically weighted regression models to predict spatial characteristics of nitrate contamination: implications for an effective groundwater management strategy. J. Environ. Manage. 2020;268 doi: 10.1016/J.JENVMAN.2020.110646. [DOI] [PubMed] [Google Scholar]
  • 31.Zhu C., Zhang X., Zhou M., He S., Gan M., Yang L., Wang K. Impacts of urbanization and landscape pattern on habitat quality using OLS and GWR models in Hangzhou, China. Ecol. Indic. 2020;117 doi: 10.1016/J.ECOLIND.2020.106654. [DOI] [Google Scholar]
  • 32.Kashki A., Karami M., Zandi R., Roki Z. Evaluation of the effect of geographical parameters on the formation of the land surface temperature by applying OLS and GWR, a case study Shiraz City, Iran. Urban Clim. 2021;37 doi: 10.1016/J.UCLIM.2021.100832. [DOI] [Google Scholar]
  • 33.Sousa J.C.G., Ribeiro A.R., Barbosa M.O., Ribeiro C., Tiritan M.E., Pereira M.F.R., Silva A.M.T. Monitoring of the 17 EU watch list contaminants of emerging concern in the ave and the sousa rivers. Sci. Total Environ. 2019 doi: 10.1016/j.scitotenv.2018.08.309. [DOI] [PubMed] [Google Scholar]
  • 34.Fonseca A., Boaventura R.A.R., Vilar V.J.P. Integrating water quality responses to best management practices in Portugal. Environ. Sci. Pollut. Res. 2018 doi: 10.1007/s11356-017-0610-1. [DOI] [PubMed] [Google Scholar]
  • 35.Fernandes A., Sanches Fernandes L.F., Moura J.P., Cortes R.M.V., Pacheco F.A.L. A structural equation model to predict macroinvertebrate-based ecological status in catchments influenced by anthropogenic pressures. Sci. Total Environ. 2019;681:242–257. doi: 10.1016/J.SCITOTENV.2019.05.117. [DOI] [PubMed] [Google Scholar]
  • 36.Permai S.D., Christina A., Santoso Gunawan A.A. Fiscal decentralization analysis that affect economic performance using geographically weighted regression (GWR) Proced. Comput. Sci. 2021;179:399–406. doi: 10.1016/J.PROCS.2021.01.022. [DOI] [Google Scholar]
  • 37.Robbert Legg T.B. Michigan, ArcUser; 2009. Applying Geographically Weighted Regression to a Real Estate Problem, An Example from Marquette. [Google Scholar]
  • 38.SNIRH, Sistema Nacional de Informação de Recursos Hídricos, (1997). https://snirh.apambiente.pt/ (accessed January 10, 2021).
  • 39.EEA, Data and maps — European environment agency, (2021). https://www.eea.europa.eu/data-and-maps (accessed December 12, 2018).
  • 40.ESRI ArcMap 10.1. Environ. Syst. Resour. Inst. 2012 [Google Scholar]
  • 41.ESRI, ArcHydro tools for ArcGIS 10 – Tutorial, (2012).
  • 42.DGT, Direcção geral do território, Carta de Uso e Ocupação do Solo. (2018). http://www.dgterritorio.pt/ (accessed April 12, 2020).
  • 43.Adamczyk J., Tiede D. ZonalMetrics - a python toolbox for zonal landscape structure analysis. Comput. Geosci. 2017;99:91–99. doi: 10.1016/J.CAGEO.2016.11.005. [DOI] [Google Scholar]
  • 44.INE, Statistics Portugal- Census 2011, (2014). https://censos.ine.pt/ (accessed January 3, 2021).
  • 45.SNIAMB, Sistema Nacional de Informação de Ambiente, (2016). https://sniamb.apambiente.pt/ (accessed December 2, 2020).
  • 46.Magdalinos T. Least squares and ivx limit theory in systems of predictive regressions with garch innovations. Econom. Theory. 2021 doi: 10.1017/S0266466621000086. [DOI] [Google Scholar]
  • 47.Stanton J.M. Galton, pearson, and the peas: a brief history of linear regression for statistics instructors. J. Stat. Educ. 2001;9 doi: 10.1080/10691898.2001.11910537. [DOI] [Google Scholar]
  • 48.Gang Su X. World Scientific Publishing Co. Pte. Ltd.; 2009. Linear regression analysis: Theory and computing. [DOI] [Google Scholar]
  • 49.Venkatesh Babu R., Ayyappan G., Kumaravel A. Comparison of linear regression and simple linear regression for critical temperature of semiconductor. Indian J. Comput. Sci. Eng. 2020;10:177–183. doi: 10.21817/indjcse/2019/v10i6/191006050. [DOI] [Google Scholar]
  • 50.Islam M.R., Azad M.S., Mollick A.S., Kamruzzaman M., Khan M.N.I. Allometric equations for estimating stem biomass of Artocarpus chaplasha Roxb. in Sylhet Hill forest of Bangladesh. Trees For. People. 2021;4 doi: 10.1016/j.tfp.2021.100084. [DOI] [Google Scholar]
  • 51.Park K., Rothfeder R., Petheram S., Buaku F., Ewing R., Greene W.H. Basic Quantitative Research Methods for Urban Planners. Taylor and Francis; 2020. Linear regression; pp. 220–269. [DOI] [Google Scholar]
  • 52.Pyrczak F., Oh D.M. Coefficient of determination. Mak. Sense Stat. 2019 doi: 10.4324/9781315179803-39. [DOI] [Google Scholar]
  • 53.Yin P., Fan X. Estimating R2 shrinkage in multiple regression: a comparison of different analytical methods. J. Exp. Educ. 2001;69:203–224. http://www.jstor.org/stable/20152659 [Google Scholar]
  • 54.Miles J., Squared R. Wiley StatsRef: Statistics Reference Online. 2014. Adjusted R squared. [DOI] [Google Scholar]
  • 55.Steinberger L. The relative effects of dimensionality and multiplicity of hypotheses on the f-test in linear regression. Electron. J. Stat. 2016;10 doi: 10.1214/16-EJS1186. [DOI] [Google Scholar]
  • 56.Maneejuk P., Yamaka W. Significance test for linear regression: how to test without p-values? J. Appl. Stat. 2021;48 doi: 10.1080/02664763.2020.1748180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Derryberry D.W., Aho K., Edwards J., Peterson T. Model selection and regression t-statistics. Am. Stat. 2018;72 doi: 10.1080/00031305.2018.1459316. [DOI] [Google Scholar]
  • 58.Marques C.P., de J.F. Fonseca T., Duarte J.C. Sílabas & Desafios, Faro; 2018. Guia Prático de Avaliações Florestais: Inventário Florestal e Modelação Estatística. [Google Scholar]
  • 59.Casson R.J., Farmer L.D.M. Understanding and checking the assumptions of linear regression: a primer for medical researchers. Clin. Exp. Ophthalmol. 2014:42. doi: 10.1111/ceo.12358. [DOI] [PubMed] [Google Scholar]
  • 60.Katrutsa A., Strijov V. Comprehensive study of feature selection methods to solve multicollinearity problem according to evaluation criteria. Exp. Syst. Appl. 2017;76:1–11. doi: 10.1016/j.eswa.2017.01.048. [DOI] [Google Scholar]
  • 61.Ullah M.I., Aslam M., Altaf S., Ahmed M. Some new diagnostics of multicollinearity in linear regression model. Sains. Malays. 2019;48 doi: 10.17576/jsm-2019-4809-26. [DOI] [Google Scholar]
  • 62.Kutner M., Nachtsheim C., Neter J. 4th ed. McGraw-Hill Irwin; 2004. Applied Linear Regression Models. [Google Scholar]
  • 63.Javari M. Spatial monitoring and variability of daily rainfall in Iran. Int. J. Appl. Environ. Sci. 2017;12 [Google Scholar]
  • 64.Hair J.F., Hult G.T.M., Ringle C., Sarstedt M. A primer on partial least squares structural equation modeling. Sage Publ. Inc. 2014 doi: 10.1016/j.lrp.2013.01.002. [DOI] [Google Scholar]
  • 65.Malyarets L., Kovaleva K., Lebedeva I., Misiura I., Dorokhov O. The heteroskedasticity tests implementation for linear regression model using matlab. Inform. 2018;42 doi: 10.31449/inf.v42i4.1862. [DOI] [Google Scholar]
  • 66.Baum C.F., Lewbel A. Advice on using heteroskedasticity-based identification. Stata J. 2019;19 doi: 10.1177/1536867x19893614. [DOI] [Google Scholar]
  • 67.Wu S. Is normal distribution necessary in regression? how to track and fix it? Towar. Data Sci. 2020 https://towardsdatascience.com/is-normal-distribution-necessary-in-regression-how-to-track-and-fix-it-494105bc50dd (accessed July 23, 2021) [Google Scholar]
  • 68.Delgado M.A., Mora J. A nonparametric test for serial independence of regression errors. Biometrika. 2000;87 doi: 10.1093/biomet/87.1.228. [DOI] [Google Scholar]
  • 69.Mukherjee A.Kr., Laha M. Problem of autocorrelation in linear regression detection and remedies. Int. j. multidiscip. res. mod. educ. 2019;5:105–110. doi: 10.5281/ZENODO.2656268. [DOI] [Google Scholar]
  • 70.Zhao J., Liu S., Xiong X., Cai Z. Differentially private autocorrelation time-series data publishing based on sliding window. Secur. Commun. Netw. 2021 doi: 10.1155/2021/6665984. 2021. [DOI] [Google Scholar]
  • 71.Getis A. Reflections on spatial autocorrelation. Reg. Sci. Urban Econ. 2007;37 doi: 10.1016/j.regsciurbeco.2007.04.005. [DOI] [Google Scholar]
  • 72.Griffith D.A., Chun Y. Spatial autocorrelation and uncertainty associated with remotely-sensed data. Remote Sens. 2016;8 doi: 10.3390/rs8070535. [DOI] [Google Scholar]
  • 73.Li H., Calder C.A., Cressie N. Beyond Moran's I: Testing for spatial dependence based on the spatial autoregressive model. Geogr. Anal. 2007;39 doi: 10.1111/j.1538-4632.2007.00708.x. [DOI] [Google Scholar]
  • 74.Jarque C.M., Bera A.K. Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Econ. Lett. 1980;6:255–259. doi: 10.1016/0165-1765(80)90024-5. [DOI] [Google Scholar]
  • 75.Anderson T.W., Darling D.A. A test of goodness of fit. J. Am. Stat. Assoc. 1954;49 doi: 10.1080/01621459.1954.10501232. [DOI] [Google Scholar]
  • 76.Shapiro S.S., Wilk M.B. An analysis of variance test for normality (Complete Samples) Biometrika. 1965;52 doi: 10.2307/2333709. [DOI] [Google Scholar]
  • 77.Kolmogorov A.N. Sulla determinazione empirica di una legge di distribuzione. Giorn. Inst. Ital. Attuari. 1933;4:83–91. http://ci.nii.ac.jp/naid/10030673552/en/ (accessed July 22, 2021) [Google Scholar]
  • 78.D'Agostino R.B. An omnibus test of normality for moderate and large size samples. Biometrika. 1971;58 doi: 10.1093/biomet/58.2.341. [DOI] [Google Scholar]
  • 79.Breusch T.S., Pagan A.R. A simple test for heteroscedasticity and random coefficient variation. Econometrica. 1979;47:1287–1294. doi: 10.2307/1911963. [DOI] [Google Scholar]
  • 80.Harvey A.C., Collier P. Testing for functional misspecification in regression analysis. J. Econom. 1977;6:103–119. doi: 10.1016/0304-4076(77)90057-4. [DOI] [Google Scholar]
  • 81.Glejser H. A new test for heteroskedasticity. J. Am. Stat. Assoc. 1969;64 doi: 10.1080/01621459.1969.10500976. [DOI] [Google Scholar]
  • 82.Goldfeld S.M., Quandt R.E. Some tests for homoscedasticity. J. Am. Stat. Assoc. 1965;60 doi: 10.1080/01621459.1965.10480811. [DOI] [Google Scholar]
  • 83.de Salis H.H.C., da Costa A.M., Vianna J.H.M., Schuler M.A., Künne A., Fernandes L.F.S., Pacheco F.A.L. Hydrologic modeling for sustainable water resources management in urbanized karst areas. Int. J. Environ. Res. Public Health. 2019;16 doi: 10.3390/ijerph16142542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Montaño Moreno J.J., Palmer Pol A., Sesé Abad A., Cajal Blasco B. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema. 2013;25 doi: 10.7334/psicothema2013.23. [DOI] [PubMed] [Google Scholar]
  • 85.Davarpanah A., Babaie H.A., Dai D. Spatial autocorrelation of neogene-quaternary lava along the Snake River Plain, Idaho, USA. Earth Sci. Inf. 2018;11 doi: 10.1007/s12145-017-0315-5. [DOI] [Google Scholar]
  • 86.Team C. Top 8 most in-demand programming languages for 2021. Medium. 2021 https://medium.com/codica/top-8-most-in-demand-programming-languages-for-2021-50cd4c3a8c34 (accessed August 12, 2021) [Google Scholar]
  • 87.Feldman S. Chart: the most popular programming languages. Statista. 2019 https://www.statista.com/chart/16567/popular-programming-languages/ (accessed August 12, 2021) [Google Scholar]
  • 88.Malloy B.A., Power J.F. An empirical analysis of the transition from python 2 to python 3. Empir. Softw. Eng. 2019;24 doi: 10.1007/s10664-018-9637-2. [DOI] [Google Scholar]
  • 89.Cattaneo M.D., Jansson M., Newey W.K. Inference in linear regression models with many covariates and heteroscedasticity. J. Am. Stat. Assoc. 2018;113:1350–1361. doi: 10.1080/01621459.2017.1328360. [DOI] [Google Scholar]
  • 90.Rosopa P.J., Schaffer M.M., Schroeder A.N. Managing heteroscedasticity in general linear models. Psychol. Methods. 2013;18:335–351. doi: 10.1037/a0032553. [DOI] [PubMed] [Google Scholar]
  • 91.Thadewald T., Büning H. Jarque-bera test and its competitors for testing normality - a power comparison. J. Appl. Stat. 2007;34 doi: 10.1080/02664760600994539. [DOI] [Google Scholar]
  • 92.Fitrianto A., Chin L.Y. Assessing normality for data with different sample sizes using SAS, minitab and R. ARPN J. Eng. Appl. Sci. 2016;11:10845–10850. [Google Scholar]
  • 93.Esri How spatial autocorrelation (Global Moran's I) works. ArcGIS Deskt. 2018 https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-statistics-toolbox/h-how-spatial-autocorrelation-moran-s-i-spatial-st.htm (accessed August 12, 2021) [Google Scholar]
  • 94.Oxoli D., Prestifilippo G., Bertocchi D., Zurbarán M. Enabling spatial autocorrelation mapping in QGIS: the hotspot analysis plugin. Geoing. Ambient. Miner. 2017;151:45–50. [Google Scholar]
  • 95.Kang W. PySAL and spatial statistics libraries. Geogr. Inf. Sci. Technol. Body Knowl. 2020 doi: 10.22224/GISTBOK/2020.3.1. 2020. [DOI] [Google Scholar]
  • 96.Alexeev Y., Bacon D., Brown K.R., Calderbank R., Carr L.D., Chong F.T., DeMarco B., Englund D., Farhi E., Fefferman B., Gorshkov A.V., Houck A., Kim J., Kimmel S., Lange M., Lloyd S., Lukin M.D., Maslov D., Maunz P., Monroe C., Preskill J., Roetteler M., Savage M.J., Thompson J. Quantum computer systems for scientific discovery. PRX Quant. 2021;2 doi: 10.1103/prxquantum.2.017001. [DOI] [Google Scholar]
  • 97.Sethi A. Comparison of 10 programming languages. Medium. 2020 https://reubenrochesingh.medium.com/comparison-of-10-programming-languages-f43b0ac337a4 (accessed August 16, 2021) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material, submitted along with the manuscript contains 3 folders: -“input_files”- all the necessary tables for the algorithm, regarding the case study and each surface water parameter; -“output_files”- the resulting excel files for each run of the script, regarding the case study application -“Python_script”- containing the used script for each surface water parameter.

mmc1.zip (99.9MB, zip)

Data Availability Statement

  • Data is provided in the supplementary material


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES