Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2021 Nov 9;204:112348. doi: 10.1016/j.envres.2021.112348

Multivariate data driven prediction of COVID-19 dynamics: Towards new results with temperature, humidity and air quality data

Dunfrey P Aragão a, Emerson V Oliveira a, Arthur A Bezerra a, Davi H dos Santos a, Andouglas G da Silva Junior b, Igor G Pereira a, Prisco Piscitelli c, Alessandro Miani d, Cosimo Distante e, Jordan S Cuno f, Aura Conci f, Luiz MG Gonçalves a,
PMCID: PMC8577104  PMID: 34767822

Abstract

Since the start of the COVID-19 pandemic many studies investigated the correlation between climate variables such as air quality, humidity and temperature and the lethality of COVID-19 around the world. In this work we investigate the use of climate variables, as additional features to train a data-driven multivariate forecast model to predict the short-term expected number of COVID-19 deaths in Brazilian states and major cities. The main idea is that by adding these climate features as inputs to the training of data-driven models, the predictive performance improves when compared to equivalent single input models. We use a Stacked LSTM as the network architecture for both the multivariate and univariate model. We compare both approaches by training forecast models for the COVID-19 deaths time series of the city of São Paulo. In addition, we present a previous analysis based on grouping K-means on AQI curves. The results produced will allow achieving the application of transfer learning, once a locality is eventually added to the task, regressing out using a model based on the cluster of similarities in the AQI curve. The experiments show that the best multivariate model is more skilled than the best standard data-driven univariate model that we could find, using as evaluation metrics the average fitting error, average forecast error, and the profile of the accumulated deaths for the forecast. These results show that by adding more useful features as input to a multivariate approach could further improve the quality of the prediction models.

Keywords: COVID-19 dynamics, Air quality and temperature, AI prediction, COVID-19 epidemiology, Time-series forecast, Multivariate forecast

1. Introduction

After its first case reported at the end of 2019 (Guan et al., 2020) the SARS-CoV-2 virus has rapidly spread around the world, causing a serious acute respiratory syndrome that is simply referred as COVID-19. Currently, it is known that most of the human-to-human transmission of COVID-19 occurs through contact, either by touching a contaminated region and bringing the hands in contact with the eyes, nose, or mouth, and/or through contact with suspended aerosol particles in the air surrounding contaminated people, that are expelled through coughing, sneezing and even talking (Who - coronavirus disease, 2020).

The number of new daily cases and deaths caused by COVID-19 increased in some countries around the world at the end of 2020 and beginning of 2021, with a significant higher number of new daily deaths from COVID-19 compared to the first wave of COVID-19 in some locations. This can be clearly observed in Fig. 1 , which shows the higher number of weekly deaths from COVID-19 in Europe, North America and Africa, which starts around 2020 October 1st.

Fig. 1.

Fig. 1

COVID-19 deaths in Europe, North America and Africa - Image extracted from https://ourworldindata.org/.

To forecast the behavior of COVID-19 pandemic, approaches such as SIR, SEIR, SIRASD epidemiological models and machine learning methods as Long Short-Term Memory (LSTM) and Modified Auto-Encoder (MAE) have been compared in previous work (Pereira et al., 2020). Clustering the countries by some features prior to training a set of models, one for each cluster, seems to prevent mistakes when designing a prediction model since it is based on dynamics of similar countries. Moreover, this previous study shows that the MAE method is more efficient and produced better results for the COVID-19 epidemiology than the LSTM method and other traditional methods, besides it requires more features in order to get improved results. Actually, there are other works presenting model comparisons, as using a chaotic marine predator algorithm (CMPA) to improve the performance of the adaptive neuro-fuzzy inference system (ANFIS) and to apply its COVID-19 prediction method to confirmed cases using official World Health Organization (WHO) data sets for Russia and Brazil (Al-qaness et al., 2021). A final but not last study (Friedman et al., 2021) compares 386 public global forecast models for COVID-19 mortality. The idea is to help authorities to select the best forecast models so far, since there are many public websites with this information and the predictions can be significantly disparate. Most of the models that passed the selection criteria proposed by the authors used the traditional univariate epidemiological models such as SIR and SEIRD.

The several efforts to predict the spread of the disease in different countries, using mathematical, statistical, and data-driven based approaches, help to understand the epidemiological characteristics and policies to be adopted to reduce the number of deaths caused by the disease and slow the spread of the COVID-19 pandemic. In South America, for example, the number of daily deaths in the second wave, as of 2021 March 20th, were much higher than in the first wave. Moreover, there was an attempt to reduce the number of daily deaths in Asian countries, while Oceania (Who - coronavirus disease, 2020) has it partially controlled (Fig. 2 ).

Fig. 2.

Fig. 2

Weekly confirmed COVID-19 deaths in populated continents - Image extracted from https://ourworldindata.org/.

One research direction regards the effects that air pollution has on COVID-19 death toll and cases, since it is known that air pollution can cause several disorders, including chronic respiratory diseases, stroke and cardiovascular problems. Since the beginning of the pandemic, several studies investigate the hypothesis that the effects of air pollution could increase death counts and transmissibility of COVID-19 (Travaglio et al., 2021). Two consequences may arise from the effects of a polluted air: either increasing the transmissibility of the virus (short-term) or causing health conditions that increase the lethality of COVID-19 (long-term) (Barcelo, 2020). Hence, it should be considered the hypothesis that it is possible to correlate pollution, temperature, and humidity with COVID-19 deaths, as verified next.

In the first phases of the current SARS-CoV-2 pandemic, multiple studies suggested that outdoor airborne transmission could play an important role during COVID-19 outbreak in northern Italy (Conticini et al., 2020; Setti et al., 1136). The main hypothesis was that virus-laden aerosol could interact with atmospheric particles creating clusters with pre-existing particles acting as carriers, enhancing the persistence of the virus in atmosphere (Carducci et al., 1107). This lead to the suggestion that atmospheric particulate matter concentrations is a kind of proxy to track virus dispersion in the atmosphere. There are not, up to now, specific data on the interaction of SARS-CoV-2 with pre-existing particles. However, it is known that atmospheric aerosols could contain biological material (bacteria and viruses) in certain conditions (Verreault et al., 2008; Després et al., 2012) and that the interaction between viruses and atmospheric particles could influence (increasing or decreasing) their infectivity (Groulx et al., 2018). Other studies found that concentration and size distribution of both virus-laden aerosol and pre-existing particles strongly influence the probability of interaction (Belosi et al., 2021). Hence, some work on studying filters for purifying the air for indoor environments are also found, specifically for reducing aerosol particulate matter (PM) and volatile organic compounds (VOCs) (Fermo et al., 2021).

On the long-term scenario, we highlight three works (Travaglio et al., 2021; Comunian et al., 2020; Konstantinoudis et al., 2021) showing a positive correlation between air pollution (mostly Nitrogen dioxide (NO2) and Particulate Matter 2,5 μm (PM2.5) and number of deaths and cases. Conversely, other works show shot-term negative correlation for specific pollutants and region due to the improvement of air quality in mid 2020 as a result of mitigation measures early in the pandemic (Zangari et al., 2020; Naqvi et al., 2021; Rodríguez-Urrego and Rodríguez-Urrego, 2020). On the short-term scenario, a few studies try to relate the air pollution with the spread of COVID-19, since it is known that virus and bacteria can survive longer in the air when they get together with particle matter, specifically Particulate Matter of 10 μm (PM10) and PM2.5 (Verreault et al., 2008; Després et al., 2012; Groulx et al., 2018; Belosi et al., 2021).

The relationship between temperature for new daily cases and new COVID-19 deaths is also highlighted in some studies as another weather impact. A thoroughly discussion on the effects of meteorological parameters over COVID-19 spread has been done, such as to analyze the effects of temperature changes on 166 countries and its relationship with COVID-19 (Wu et al., 2020). In addition, studies on the effects of weather conditions indicate that these characteristics could affect the spread of COVID-19 in the local territory (Brassey et al., 2020). The relationship between temperature and humidity, which may include a significant change in COVID-19 transmission, and the most appropriate temperature for viral transmission and resistance on contaminated surfaces and the duration of the presence of the virus, is also examined in other works (Ma et al., 1016; Wang et al., 2020; Chan et al., 2011; Diao et al., 1016).

Nevertheless, a case study in Brazil shows that temperatures and humidity are highly correlated to the spread of COVID-19 (Auler et al., 1016). It is noticed that the tropical climate may be favourable to the COVID-19 transmission, as occurs in most Brazilian cities. Furthermore, a work based on Brazilian cities (Prata et al., 1016) explores the linear and nonlinear relationship between the average temperature and confirmed cases as reported by the use of a generalized additive model (GAM) that shows a negative linear relationship with the number of confirmed cases and a flattened curve spreading at around 25.8 degrees Celsius. Also, the impact of meteorological conditions across the 27 capitals of Brazilian provinces was analyzed in the first month of the disease outbreak. The presented results take into account temperature, population density and flights as features for COVID-19 forecasts (Pequeno et al., 2020).

Hence, as our main contributions here, by taking into consideration the results of the above works, this paper first tries to confirm that environmental variables plays a role on the COVID-19 spreading, verifying some evidence that there is a correlation with COVID-19 deaths in the most affected Brazilian cities. Further, we use environmental features such as AQI (Air Quality Index), temperature, and humidity as input to a multivariate predictive model in an attempt to generate better predictions. We compare the performance of the best data-driven univariate model and the best multivariate model that we could find over several training attempts. The idea of our current work is to use the K-means clusters to identify similar AQI curves from different countries. This allows us to use a based transfer learning approach across each cluster of our model. Actually, in this work, we apply the LSTM method using data from the specific cluster to which the prediction location is assigned. E.g., for São Paulo state, only data from cluster 3 is used, as it will be seen in the paper. We found that this increased feature set together with the clustering approach can improve our previous data driven approach, providing a more accurate forecast, which is currently not limited only to daily death cases as done in our previous work (Pereira et al., 2020).

2. Materials and methods

Different scenarios could be adopted regarding the granularity level form the geographic perspective (communities, cities, provinces, countries etc). A more granular approach yields better reliability and less variability for each location in spite of higher computational effort and data availability.

On the other hand, a less granular approach, such as adopting countries' data, generalizes information through the use of uncertainty metrics such as average, maximum and minimum. The main advantage of such less granular approach regards in a richer data availability available in global datasets. However, we emphasize that, depending on a country's size, information such as air quality may not adequately represent the country's overall behavior.

Nonetheless, to demonstrate the feasibility and validity of this study, we conduct a primary investigation using the less granular approach. The first step is to acknowledge how country data perform and cluster based on similarities. Methods like K-means are an instrument for investigating dimensions and generating similarity clusters among assembly members. Following that, a smaller granularity analysis should be carried out.

2.1. COVID-19 deaths dataset

On the global perspective, we use the COVID-19 dataset available at the World Health Organization (WHO) website (Who - coronavirus disease, 2020). Similarly and consistently with the WHO dataset, we use data collected by Wesley Cota (Cota, 1590) to carry out the analysis for Brazilian cities and states. Both datasets starts at January 22, 2020.

The WHO dataset comprises 274 different localities, including countries and major cities. China, for example, was divided into 33 provinces, which were added to the aforementioned number of localities. On these cases, we sum the number of cases and deaths of the provinces to achieve a set of unique values for the whole country. We apply the aforementioned process to the following countries: Australia, Canada, China, Denmark, France, the Netherlands, and the United Kingdom, resulting in a total number of 192 localities.

2.2. Air quality index (AQI)

The AQI is an index that describes the status of air quality. It is a dimensionless value that takes into account six criteria pollutants: PM2.5, PM10, SO2, CO, NO2 and O3. The computation is performed based on the Equation (1).

AQIP=IhighIlowChighClow(CmpClow)+Ilow (1)

where, AQI P is the air quality index for each pollutant; I high and I low are the high and low indexes breakpoints; C high and C low are the high and low concentration breakpoints; and C mp is the measured concentration of the pollutant. Having the AQI for each pollutant computed, the maximum value is selected as the final index value. Mathematically,

AQI=max(AQIp1,AQIp2,AQIp3,AQIpN) (2)

It is important to point out that each country has their own definition of the AQI. This includes choosing only some air quality parameters (for example, only CO2 and NO2) and changing the ranges of safety margins for these parameters to be used in the AQI Equation (1). In our approach, we have used the average of the AQI parameters available, for each country and date. This was necessary because countries have neither the same AQI parameters nor the same volume of data available.

2.3. World Air Quality Index project

The dataset available from the World Air Quality Index project (WAQI) (World air quality index p, 2021), which corresponds to a set of data referring to the air quality of over 130 countries, covering 2000 major cities, and updated three times a day starting in January of 2020. The dataset contains minimum, maximum, median, and standard deviation values for each of the air pollutant types, which are: CO, NO2, O3, SO2, PM10, PM2.5, as well as meteorological data, including humidity and temperature.

Furthermore, the WAQI dataset contains data that has been mislabeled, implying that several countries are missing data on various components. Only meteorological data is available in some countries. The AQI estimation bias for a significant percentage of countries in the period where COVID-19 patterns were present in the entire year 2020 reflects this.

Because of that lack, the filtered data contemplate a reduced amount of 95 countries and 615 cities. Besides, some countries keep data for multiple major cities. For example, in Brazil, data is kept for the cities of São José dos Campos, Vitória, and São Paulo. We looked for the capital's country first in these cases, and if we couldn't find it, we looked for the city with the largest population.

Also, we notice that the cities with daily air pollutants records in the WAQI dataset correspond to the countries' capitals or major urban centers. Furthermore, that the numbers of confirmed COVID-19 cases in these large urban centers are the primary sources of data to the country's amount values, with a strong correlation between the numbers in the capital or largest urban center and the country's amount numbers. As a result, we took this approach, taking into account registered numbers, curve trending behavior, seasonality, and residual variables.

2.4. Air quality dataset from satellite data

One of the main global monitoring tools is satellites, which are capable to acquire information of almost all Earth's surface. Focusing on the interest of analyzing air quality, we have used the data generated by the SENTINEL-5 mission, which is part of the European Earth Observation Program (Copernicus) (Copernicus open access hu), under the management and coordination of the European Commission (EC). It is focused on air quality and provide information about the main data products cloud, HCHO, SO2, O3, Aerosol layer height, NO2, CO, CH4.

The SENTINEL-5 acquires data in form of bands around the Earth based on geographical location and interval of days. Then, the data are processed as numerical matrices and, using cartographic databases, we delimit the analysis of air quality to a region such as a country or smaller ones such as states and cities.

2.5. Air quality dataset from PM filters

As mentioned in Section 2.3, in addition to the WAQI (World air quality index p, 2021) there are also other proven datasets for making data available from PM2.5 sensors. These sensors check the amount of exposure to air pollution in daily average micrograms/cubic meters. It can be used to obtain the number of suspended particulate matter in a unit volume of air within 0.3–10 μm (PM1 to PM10). In Brazil, through the WAQI platform we can see that most of the installed sensors are in the south/southeast and midwest regions. And that the regions that make data available for these platforms also have their own platform to share data with the population, which is the case of CETESB - State of São Paulo [33], JEAP - State of Rio de Janeiro [34], CELEPAR - State of Paraná [35] and IEMA - State of Espirito Santo [36].

With the use of Particulate matter sensors, it is possible to get real-time data from regions where one wants to monitor and compare the data obtained from the sensors with other air pollution visualization methods, such as satellite data. It is also possible to obtain data from closed environments, such as educational, industrial, shopping centers or any other closed environment. For its installation, mainly in Brazil, there is a technical standard from the Ministry of the Environment [37], as if it is installed wrongly, it can provide false data to anyone who may use them. In this work, in order to show our approach we use a subset of these data, conforming the São Paulo City regulations.

2.6. Feature engineering

Based on country data from the WAQI and WHO datasets, we decided to work with a similar number of countries. As a result, we only kept countries that had information on at least five pollutant features (out of a total of six pollutant features that were measured). Fig. 3 illustrates the workflow.

Fig. 3.

Fig. 3

Feature engineering workflow process.

At the end of this features and countries selection, we organize 7 dataframes, each of which contains information on 63 countries, and the data corresponds to:

  • COVID-19 daily deaths Dataframe;

  • Dataframes for each feature measured of: CO, NO2, O3, SO2, PM10, and PM2.5.

Finally, the country features dataframes are fed into the AQI calculus, which uses a six-level scale for each day as specified by the US-EPA 2016 standard, with: Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy, Very Unhealthy, and Hazardous, which represents scales for health consequences and cautionary statements. The COVID-19 daily deaths were used with the AQI six-level dataframe in the forecast experiments.

The AQI is noteworthy because of certain cautionary statements for active children and adults, as well as people with respiratory disorders such as asthma. The five-level are classified as:

  • Good - None. 0 to 50 as a value range;

  • Moderate - should limit prolonged outdoor exertion. Range of values: 51 to 100;

  • Unhealthy - should limit prolonged outdoor exertion. Set of values divided into two ranges: 101 to 150 (more related to sensitive people) and 151 to 200 (population in general);

  • Very Unhealthy - should avoid all outdoor exertion; everyone else, especially children, should limit outdoor exertion. Set of values divided into two ranges: 201 to 250 (very danger mainly to sensitive people) and 251 to 300 (very danger to population in general).

  • Hazardous - everyone should avoid all outdoor exertion. A value of 301 and higher.

2.7. Clustering data

Natural clusters in the AQI of the countries' output were identified using the unsupervised K-means algorithm. Using this tool, we can find countries with similar pollutant air patterns that could be used according to the place where the model will be applied. In this way, we can use only a specific group, decreasing the number of data and improving the model generalization. The algorithm's number of k determines the number of clusters required to group the elements.

K-means is an unsupervised approach that takes a set of n observations and splits them into k groups, identifying related trends in each group depending on the n-nearest observation's mean [38]. The idea is to organize each point to its current nearest center, update the clustering centers by calculating the mean of the member points, and repeat the process before convergence conditions [38]. The algorithm can be applied in a variety of ways, including arranging counties. However, since the method is unsupervised, it should be combined with an unregulated neural network for pattern extraction.

2.8. LSTM based Neural Networks

One or more explicit knowledge re-feeding into the structure is common in Recurrent Neural Network (RNN). However, when combined with the backpropagation algorithm, this type of function can cause “blow up” or vanishing gradients. In first case, the gradient explosion can result in an oscillation over the neural weight values. On the other hand, the vanish may turn impracticable the network fitting for assimilating the requisite time dependency. To handle with these problems [39], it is proposed the Long Short-Term Memory (LSTM).

In contrast to traditional RNN, the LSTM preserves information for a larger number of historical data entries while maintaining the importance of recent data, by the use of two internal states, which are the cell state and the hidden state, making the information transport to the network neurons ahead.

An LSTM is an architecture of the RNN family that conveys both long and short-term information. To accomplish such complex information retrieval from sequential sources, the LSTM is built based on the use of gates that controls the flow of information among the cells. To accommodate the output of the LSTM cells to the predictions of the number of deaths range, we employ a linear layer at the output of the model. The graph of this approach is shown on Fig. 4 .

Fig. 4.

Fig. 4

Univariate model graph of the proposed architecture.

The data is captured, processed, and analyzed using the Python 3.7 platform, Pandas 1.2 framework, and Matplotlib (release 3.4.0) for plotting statistical graphics. The LSTM neural network was implemented and tested using Tensorflow version 2.4.

3. Experiments and results

In order to validate our approach, a series of setups and experiments have been done. The first experimental part is related to data regularization and clustering, in order to be used by the LSTM network. Then, this clustered data is used for the training of LSTM, whose parameters and specifications were better adjusted. Next section depicts the experiments with clustering, and the LSTM, with their particularities. Finally, the application on the evaluation of different feature configurations is performed in order to verify the correlation.

3.1. Clustering with K-means

To understand the behavior and reaction related to changes in the air quality variation, our idea is to perform the K-means to clustering over the AQI shape curves, related to the behavior time-based on the trend and seasonality of the temporal series. Therefore, before applying the K-means algorithm, each country's AQI is subjected to a 30-day moving average, mitigating the effects of a random value and short-term fluctuations on the AQI over a specified time frame. After, each curve was normalized, assisting in the establishment of clearly defined elements and attributes, providing a comprehensive catalog of data, and converting that data into a uniform format ranging from 0 to 1.

Subsequently, we submit the data to Dynamic Time Warping (DTW) and Euclidean metrics to evaluate the algorithm. The Euclidean distance metric fits better once we applied it to the normalized data, showing that the clustering was based not on its AQI absolute values.

On the K-means algorithm, we set the k value for this analysis as 9 and it was submitted for the process of dividing 63 country using the Euclidean distance calculated. Fig. 5 depicts both cluster k-numbers and the countries that made them up.

Fig. 5.

Fig. 5

K-means-based AQI for country clusters. The table is divided into several columns, one for each cluster, with the countries that make up each column mentioned below. The number of countries selected for each cluster is shown at the bottom.

Cluster 1 (Fig. 6 a), cluster 6 (Fig. 6b), and cluster 7 (Fig. 6c) were chosen to understand the grouping made from the behavior of the AQI curves, as shown in Fig. 6. As can be seen in these comparisons, there are several patterns in the prevalence of air pollutants trends and seasonality for each country over specific periods, and COVID-19 daily deaths provide important data on air quality.

Fig. 6.

Fig. 6

Countries groups created by K-means algorithm - AQI curves for 1-, 6-, and 7-k clusters.

As mentioned, the clustering was based on the trend and seasonality of the curves. Even though the K-means produced clusters based on similar shapes and it may permit that some countries grouped on a specific cluster may not stand on the same expected AQI distribution values. As shown in Fig. 7 , through descriptive statistics, the generated clusters can contain countries that contrast in data distribution, in which the expected value (mean value) of AQI from different countries may not be the same interval values.

Fig. 7.

Fig. 7

The pattern revealed by the boxplot diagrams shows that all values of each cluster are concentrated in specific ranges that determine the AQI. The higher the AQI value represents the worse the air quality and the greater the health risk.

Once K-means clustered the elements based on the distribution and regulation of the nearest centroids for attribution of the found point to this cluster, the approach allows for the organic selection data, as well as the labeling and referring of centroids to newly inserted data. This results in using for explanation in terms of understanding how residents react to the behavior of the local curve, in a place with a specific AQI, which residents are already subjected to this air quality. Any other country that comes into one of these clusters will be able to submit a neural network transfer learning, which will allow for a better grasp of the data analysis and understanding.

Lastly, we can better explain the behavior of the data by the exploratory data analysis methods histograms with Kernel Density Estimation (KDE) using Fig. 7. Notice that we added a kernel centered about each data point for every single data point and we eventually sum these together trying to get the probability density function f X(x) that describes well the randomness of the data. When the AQI distribution has an upper tail (tail at the end of the distribution, at the top) more pronounced than the lower tail (tail at the bottom of the distribution, at the bottom), the distribution has a negative asymmetry, as shown in Fig. 8 a. It will have a positive asymmetry if the reverse is true (Fig. 8b). The distribution is otherwise symmetric (most apparent to the normal distribution curve) (Fig. 8c). As a result, not necessarily we're dealing with outlier data, but rather data with the asymmetric distribution, also know as skewed data.

Fig. 8.

Fig. 8

AQI histograms with Kernel Density Estimation for three different countries inserted in Cluster 1 - Jordan, Lithuania and Thailand.

3.2. Evaluating the prediction models for the city of São Paulo

Due to the small size of the training dataset, we decided to train the model with 90% of the data and the remaining 10% of the data is used to evaluate the predictive skill of the model. We use data from March 27, 2020 to June 03, 2021, which accounts for 435 days, or 62 standard weeks of data (Sunday to Saturday). The train dataset is composed of the first 56 weeks and the test dataset is the last 6 weeks. To alleviate trends and seasonality we use as input the biweekly (14 days) moving average for the COVID-19 deaths and cases time series.

Forecast of future deaths is performed on a weekly basis. The first week of the training dataset (d 0, …,d 6) is used as the input of the model to predict the next one (d 7, …,d 13). The input/output samples are organized using a sliding window of 1, so the next input sample is (d 1, …,d 7) and the expected output is (d 8, …,d 14). This approach increases the number of training examples from 56 to 392 samples.

The train dataset has size n and the test dataset has size j. W is the total dataset (train + test) tensor and W i is week i from the dataset. We used a walk-forward approach to validate the models after training. In this approach, the last week of the training dataset W n is used as input of the forecast model to predict the next one (W^1). We then use W n+1 to predict W^2, W n+2 to predict W^3 and so on. The output of this process is a tensor of predicted data W^ of size j. Therefore, it is possible to compare the forecast of the model against real data in the test dataset and estimate the forecast performance for multiple future weeks. The Root Mean Square Error (RMSE) between real data from the test dataset (W n+1, W n+2, …, W n + j) and the respective predicted weeks (W^1, W^2, …, W^j) is used as a performance metric for the models. A small RMSE average for the predicted weeks indicates good generality and small RMSE for the whole dataset indicates good fitness. This evaluation approach was used in both the univariate and multivariate cases.

Due to the stochastic nature of Neural Networks weights optimization, it is good practice to train multiple versions of the same model. It could be the case that models with similar performance on training could result in significantly different forecasts. To increase the robustness of this process we first performed a simple grid-search on the hyperparameters of the network to find the most suitable training setup for each input configuration. The training configuration setup is a vector [c 1, c 2, c 3, c 4, c 5, c 6], where c 1 is the number of input days, c 2 is the number of units at each layer, c 3 is the number of epochs, c 4 is the batch size, c 5 is a dropout rate and c 6 indicates if the input series are normalized. We tested 8 configurations, changing the number of units of each layer (50 and 100), the batch size (16 and 32) and if the inputs are normalized or not (None and MinMax). The training for each set of hyperparameters has only 10 trials of 100 epochs each, and the average performance of configuration is stored and compared with other configurations. This grid-search process was performed for each of the eight input configuration. After selecting the most suitable hyperparemters for each input configuration, we performed a more in dept training, with 500 epochs and 50 trials. The network architecture is as described in Section 2.8 and the evaluation of the models found is presented bellow.

3.2.1. Univariate approach

For the univariate case we used the deaths time series (input) to predict the deaths time series (expected output). This case is relatively simple, the network only has problems fitting the curve if the time series has high correlation with shifted versions of itself (auto-correlation). The best training configuration found in the grid-search was [7, 50, 100, 16, 0.0, None]. The average RMSE for the whole dataset and for the test dataset is shown in the boxplots D of Fig. 9 . The forecast curve from the best model found during the 50 models trial is shown in the first two graphs of Fig. 10 (biweekly moving average and accumulated deaths).

Fig. 9.

Fig. 9

Boxplot of average weekly forecast performance for four input features combinations. Plot (a) is the forecast error for the whole dataset (train + test) representing the fitting performance and plot (b) is the forecast error for only the test dataset representing the generalization performance of the model. Stars are the mean values, the red line is the median and crosses are outliers. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Fig. 10.

Fig. 10

Forecast for the biweekly moving average of COVID-19 deaths and accumulated COVID-19 deaths for the city of São Paulo. These are the best models for each input configuration.

3.2.2. Multivariate approach

For the multivariate approach we decided to perform three experiments, gradually adding more features to the input and observing the predictive skill of the models found. The three input configurations tested are: deaths and Aqi (D + A); deaths, Aqi, temperature and humidity (D + A + T + H) and deaths, cases, Aqi, temperature and humidity (D + C + A + T + H). Fig. 9 show the average RMSE for these cases, both for the whole dataset and for the test dataset and the curves predicted by the models can be observed in Fig. 10, Fig. 11 .

Fig. 11.

Fig. 11

Forecast for the biweekly moving average of COVID-19 deaths and accumulated COVID-19 deaths for the city of São Paulo. These are the best models for each input configuration.

3.3. Discussion about the results

The experiments provided some interesting results. First, about the evaluation, we decided to plot both the RMSE for the whole dataset (the entire curve in Fig. 10) and also for only the test dataset (only the red part of Fig. 10). The first RMSE represents the fitting performance of the model, namely, the capacity of the model to approximate data from the training. The second RMSE represents the generalization performance of the model, i. e, how the model performed on data not seeing during training. Another important information to evaluate the model performance is the accumulated deaths. We use an approach that is similar to the one that has been used to compare the performance of several forecast models (Friedman et al., 2021). The idea is to see to what percentage of the real accumulated COVID-19 deaths the prediction models gets after the forecast. We improved this approach by also showing the entire curve, to see if the deaths accumulated by the forecast also follow the profile of the real accumulated curve (second plot of Fig. 10).

The motivation for evaluating both the fitting error and the forecast error can be seen in the upper graph of Fig. 10. This graph shows the predicted curve of the best model for the univariate case. Despite having a good prediction error of only 8.42, this model wasn't able to converge to a good fitting solution. This can be partially explained through the boxplot D of Fig. 9. Notice that this boxplot has low standard deviation, which is an indication of the poor quality of the input space created by only using one feature. Since the search space is less diverse, the training is more effective in finding sub-optimal solutions with similar performance, and this partially explains why this case had the best overall expected forecast performance (Fig. 9b).

Next we provide an analysis for the multivariate cases. The boxplots presented in Fig. 9a clearly shows that the average fitting performance of the models increases when more input features are added (lower mea RMSE). The case where deaths, air quality, temperature and humidity (D + A + T + H) are used as input has the best overall fitting performance. And, as it can be seen in Fig. 9b, it also generates the model with the best forecast performance, even though in the average it lost to the univariate case. A problem that we noticed is that the input space is so appropriate for the network architecture that it ends up generating models that overfit, i. e, that fit the curve perfectly for the training dataset but that has a poor performance on data that it does not see during training (this means that it does not generalize). The forecast shown in the upper graph of Fig. 11 is the best overall model, which has a good trade-of between fitting and forecast performances. Even though the predictive error is a bit higher than some other models, this model also has a near perfect overall accumulated COVID-19, being only −0.06% away from the real accumulated value after the forecast. So we can say with confidence that this model is the most skilled that we found, that is, it has superior performance in the 7 day forecast for the COVID-19 deaths time-series of the city of São Paulo.

As a final remark, the network does not improve its performance by adding COVID-19 number of cases time-series in the mix (D + C + A + T + H). It takes much longer to converge and also does not converge to better solutions than the case without the cases time-series. Although intuitively, the use of the cases time-series as input would help the forecast, since more people with COVID-19 means that more people can die of COVID-19. However, this does not translate into a better performance of the neural network function approximation. An explanation is that the number of cases time-series has a high linear correlation with the number of deaths time-series, and, in terms of thee neural network training, this does not add much information to the search space, despite making it more complex (with one more dimension). It is also possible that changing the architecture of the network can lead to better models in this case, and this investigation can be performed as future works.

One of the goals of our project is to have a common repository (in construction at http://ncovid.natalnet.br), which will host all information, including not only codes however also the results of our achievements in form of graphs and web pages, which will be made available to the community. A preview is at www.natalnet.br/covid that is working currently with previous data to this work (we are on the way to update it). Besides, code and data used in this research can be found in https://github.com/Natalnet/ncovid-air-paper.

4. Conclusion

In this work we have verified the influence of environmental factors as temperature, humidity and air quality, on the forecast of the COVID-19 pandemic behavior. For that we have used a data driven approach, which is based on a simple model using RNN with LSTM cells. The model has been applied on COVID-19 data from Brazil reporting the daily deaths, the daily temperature, humidity, and an index for air quality.

From the experiments we could observe some interesting issues. First, as expected, we noticed that the effects of isolation and lockdown causes instantaneous modifications on data that the temporal average approach can not explain nor reduce its problems. Perhaps the annual average can help in this issue, geographically speaking. Yet, results from the experiments using multiple feature configurations could indicate that a data-driven approach can be used to strengthen the hypotheses of correlation between air pollution and COVID-19 fatalities.

As noticed, the number of groups used in the K-means is 9, for the reported results. We also tested with other numbers, however the results were not as satisfactory as for this number, which has the best performance. In future work, we intend to investigate and apply the results of K-means; after adding a new location and inserting it into a cluster, we can identify a previously trained model and use it in forecasting tasks.

This new set of features has been shown to have severe influence on the COVID-19 dynamics, so we figured out from here that it should be taken into consideration by any method that tries to understand the behavior of the pandemics, mainly the ones based on AI that have been developed in our previous research (Pereira et al., 2020). Our future work is exactly in this direction, with the feature set improved by the current work. We will revisit the use of MAE, LSTM, and other deep learning approaches on COVID-19 pandemic for predicting its behavior on a long time manner.

Funding

This work has been developed with partial support of Coordination for the Improvement of Higher Education Personnel (CAPES-Brazil), grants number 001 and 88881.506890/2020–01, and by National Research Council (CNPq-Brazil) grant number 311640/2018–4.

Availability of data and material

Code and data used in this research can be found in https://github.com/Natalnet/ncovid-air-paper.

Credit authors statement

All authors have made substantial contributions to the conception or design of the work; the acquisition, analysis, or interpretation of data; the creation of new software used in the work; drafted the work or revised it critically for important intellectual content; and approved the version to be published, as detailed next. Dunfrey P. Aragão: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft, Writing - review and editing; Emerson V. Oliveira: Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing - original draft; Arthur A. Bezerra: Data curation, Investigation, Visualization, Writing - original draft; Davi. H. Santos: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing - original draft, Writing - review and editing; Andouglas G. Silva-Júnior: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft; Igor G. Pereira: Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing - review and editing; Prisco Piscitelli: Conceptualization, Data curation, Funding acquisition, Methodology, Resources, Supervision, Writing - review and editing; Alessandro Miani: Conceptualization, Data curation, Funding acquisition, Methodology, Resources, Supervision, Writing - review and editing; Cosimo Distante: Conceptualization, Data curation, Funding acquisition, Methodology, Resources, Supervision, Writing - review and editing; Jordan S. Cuno: Data curation, Formal analysis, Investigation, Software, Validation, Visualization, Writing - original draft; Aura Conci: Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing - review and editing; Luiz M. G. Gonçalves: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review and editing. In addition, all authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Al-qaness M.A., Saba A.I., Elsheikh A.H., Elaziz M.A., Ibrahim R.A., Lu S., Hemedan A.A., Shanmugan S., Ewees A.A. Efficient artificial intelligence forecasting models for COVID-19 outbreak in Russia and Brazil. Process Saf. Environ. Protect. 2021;149:399–409. doi: 10.1016/j.psep.2020.11.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. A. C. Auler, F. A. Cássaro, V. O. da Silva, L. F. Pires, Evidence that high temperatures and intermediate relative humidity might favor the spread of COVID-19 in tropical climate: a case study for the most affected Brazilian cities, Sci. Total Environ. 729. doi:10.1016/j.scitotenv.2020.139090. [DOI] [PMC free article] [PubMed]
  3. Barcelo D. An environmental and health perspective for covid-19 outbreak: meteorology and air quality influence, sewage epidemiology indicator, hospitals disinfection, drug therapies and recommendations. Journal of Environmental Chemical Engineering. 2020;8(4):104006. doi: 10.1016/j.jece.2020.104006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Belosi F., Conte M., Gianelle V., Santachiara G., Contini D. On the concentration of sars-cov-2 in outdoor air and the interaction with pre-existing atmospheric particles. Environ. Res. 2021;193:110603. doi: 10.1016/j.envres.2020.110603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brassey J., Heneghan C., Mahtani K.R., Aronson J.K. Do weather conditions influence the transmission of the coronavirus (SARS-CoV-2) Tech. rep., The Centre for Evidence-Based Medicine. 2020 www.cebm.net/oxford-covid-19/ [Google Scholar]
  6. A. Carducci, I. Federigi, M. Verani, Covid-19 airborne transmission and its prevention: waiting for evidence or applying the precautionary principle?, Atmosphere 11 (7). doi:10.3390/atmos11070710. URL https://www.mdpi.com/2073-4433/11/7/710.
  7. Chan K.H., Peiris J.S., Lam S.Y., Poon L.L., Yuen K.Y., Seto W.H. The effects of temperature and relative humidity on the viability of the SARS coronavirus. Advances in Virology. 2011 doi: 10.1155/2011/734690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Comunian S., Dongo D., Milani C., Palestini P. Air pollution and covid-19: the role of particulate matter in the spread and increase of covid-19's morbidity and mortality. Int. J. Environ. Res. Publ. Health. 2020;17(12):4487. doi: 10.3390/ijerph17124487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Conticini E., Frediani B., Caro D. Can atmospheric pollution be considered a co-factor in extremely high level of sars-cov-2 lethality in northern Italy? Environ. Pollut. 2020;261:114465. doi: 10.1016/j.envpol.2020.114465. URL https://www.sciencedirect.com/science/article/pii/S0269749120320601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Copernicus Open Access Hub, https://https://scihub.copernicus.eu/.
  11. W. Cota, Monitoring the number of COVID-19 cases and deaths in Brazil at municipal and federative units level, SciELOPreprints :362doi:10.1590/scielopreprints.362. URL https://doi.org/10.1590/scielopreprints.362.
  12. Després V., Huffman J., Burrows S.M., Hoose C., Safatov A., Buryak G., Fröhlich-Nowoisky J., Elbert W., Andreae M., Pöschl U., Jaenicke R. Primary biological aerosol particles in the atmosphere: a review. Tellus B. 2012;64(1):15598. arXiv:https://doi.org/10.3402/tellusb.v64i0.15598, doi:10.3402/tellusb.v64i0.15598. URL https://doi.org/10.3402/tellusb.v64i0.15598. [Google Scholar]
  13. Y. Diao, S. Kodera, D. Anzai, J. Gomez-Tames, E. A. Rashed, A. Hirata, Influence of population density, temperature, and absolute humidity on spread and decay duration of COVID-19: a comparative study of scenarios in China, England, Germany, and Japan, One Health 12. doi:10.1016/j.onehlt.2020.100203. [DOI] [PMC free article] [PubMed]
  14. Fermo P., Artíñano B., De Gennaro G., Pantaleo A.M., Parente A., Battaglia F., Colicino E., Di Tanna G., Goncalves da Silva Junior A., Pereira I.G., Garcia G.S., Garcia Goncalves L.M., Comite V., Miani A. Improving indoor air quality through an air purifier able to reduce aerosol particulate matter (pm) and volatile organic compounds (vocs): experimental results. Environ. Res. 2021;197:111131. doi: 10.1016/j.envres.2021.111131. URL https://www.sciencedirect.com/science/article/pii/S0013935121004254. [DOI] [PubMed] [Google Scholar]
  15. Friedman J., Liu P., Troeger C.E., Carter A., Reiner R.C., Barber R.M., Collins J., Lim S.S., Pigott D.M., Vos T., et al. Predictive performance of international covid-19 mortality forecasting models. Nat. Commun. 2021;12(1):1–13. doi: 10.1038/s41467-021-22457-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Groulx N., Urch B., Duchaine C., Mubareka S., Scott J.A. The pollution particulate concentrator (popcon): a platform to investigate the effects of particulate air pollutants on viral infectivity. Sci. Total Environ. 2018;628–629:1101–1107. doi: 10.1016/j.scitotenv.2018.02.118. URL https://www.sciencedirect.com/science/article/pii/S0048969718305084. [DOI] [PubMed] [Google Scholar]
  17. Guan W.-j., Ni Z.-y., Hu Y., Liang W.-h., Ou C.-q., He J.-x., Liu L., Shan H., Lei C.-l., Hui D.S., Du B., Li L.-j., Zeng G., Yuen K.-Y., Chen R.-c., Tang C.-l., Wang T., Chen P.-y., Xiang J., Li S.-y., Wang J.-l., Liang Z.-j., Peng Y.-x., Wei L., Liu Y., Hu Y.-h., Peng P., Wang J.-m., Liu J.-y., Chen Z., Li G., Zheng Z.-j., Qiu S.-q., Luo J., Ye C.-j., Zhu S.-y., Zhong N.-s. Clinical characteristics of coronavirus disease 2019 in China. N. Engl. J. Med. 2020;382(18):1708–1720. doi: 10.1056/NEJMoa2002032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Konstantinoudis G., Padellini T., Bennett J., Davies B., Ezzati M., Blangiardo M. Long-term exposure to air-pollution and covid-19 mortality in england: a hierarchical spatial analysis. Environ. Int. 2021;146:106316. doi: 10.1016/j.envint.2020.106316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Y. Ma, Y. Zhao, J. Liu, X. He, B. Wang, S. Fu, J. Yan, J. Niu, J. Zhou, B. Luo, Effects of temperature variation and humidity on the death of COVID-19 in Wuhan, China, Sci. Total Environ. 724. doi:10.1016/j.scitotenv.2020.138226. [DOI] [PMC free article] [PubMed]
  20. Naqvi H.R., Datta M., Mutreja G., Siddiqui M.A., Naqvi D.F., Naqvi A.R. Improved air quality and associated mortalities in India under covid-19 lockdown. Environ. Pollut. 2021;268:115691. doi: 10.1016/j.envpol.2020.115691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. P. Pequeno, B. Mendel, C. Rosa, M. Bosholn, J. L. Souza, F. Baccaro, R. Barbosa, W. Magnusson, Air transportation, population density and temperature predict the spread of COVID-19 in Brazil, PeerJ 2020 (6). doi:10.7717/peerj.9322. [DOI] [PMC free article] [PubMed]
  22. Pereira I.G., Guerin J.M., Silva Júnior A.G., Garcia G.S., Piscitelli P., Miani A., Distante C., Gonçalves L.M.G. Forecasting covid-19 dynamics in Brazil: a data driven approach. Int. J. Environ. Res. Publ. Health. 2020;17(14):5115. doi: 10.3390/ijerph17145115. https://www.mdpi.com/1660-4601/17/14/5115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. D. N. Prata, W. Rodrigues, P. H. Bermejo, Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil, Sci. Total Environ. 729. doi:10.1016/j.scitotenv.2020.138862. URL https://pubmed.ncbi.nlm.nih.gov/32361443/. [DOI] [PMC free article] [PubMed]
  24. Rodríguez-Urrego D., Rodríguez-Urrego L. Air quality during the covid-19: Pm2. 5 analysis in the 50 most polluted capital cities in the world. Environ. Pollut. 2020:115042. doi: 10.1016/j.envpol.2020.115042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. L. Setti, F. Passarini, G. De Gennaro, P. Barbieri, S. Licen, M. G. Perrone, A. Piazzalunga, M. Borelli, J. Palmisani, A. Di Gilio, E. Rizzo, A. Colao, P. Piscitelli, A. Miani, Potential role of particulate matter in the spreading of covid-19 in northern Italy: first observational study based on initial epidemic diffusion, BMJ Open 10 (9). arXiv:https://bmjopen.bmj.com/content/10/9/e039338.full.pdf, doi:10.1136/bmjopen-2020-039338. URL https://bmjopen.bmj.com/content/10/9/e039338. [DOI] [PMC free article] [PubMed]
  26. Travaglio M., Yu Y., Popovic R., Selley L., Leal N.S., Martins L.M. Links between air pollution and covid-19 in england. Environ. Pollut. 2021;268:115859. doi: 10.1016/j.envpol.2020.115859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Verreault D., Moineau S., Duchaine C. Methods for sampling of airborne viruses. Microbiol. Mol. Biol. Rev. 2008;72(3):413–444. doi: 10.1128/MMBR.00002-08. doi:10.1128/MMBR.00002-08. URL https://mmbr.asm.org/content/72/3/413. arXiv: https://mmbr.asm.org/content/72/3/413.full.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. M. Wang, A. Jiang, L. Gong, L. Lu, W. Guo, C. Li, J. Zheng, C. Li, B. Yang, J. Zeng, Y. Chen, K. Zheng, H. Li, Temperature Significantly Change Covid-19 Transmission in 429 Cities, medRxivarXiv:https://www.medrxiv.org/content/early/2020/02/25/2020.02.22.20025791.full.pdf, doi:10.1101/2020.02.22.20025791. URL https://www.medrxiv.org/content/early/2020/02/25/2020.02.22.20025791.
  29. Who - Coronavirus Disease (Covid-19) Pandemic. https://www.who.int/emergencies/diseases/novel-coronavirus-2019 2020-12-25.
  30. World Air Quality Index Project. https://aqicn.org/data-platform/covid19/ 2021-03-10.
  31. Wu Y., Jing W., Liu J., Ma Q., Yuan J., Wang Y., Du M., Liu M. Effects of temperature and humidity on the daily new cases and new deaths of COVID-19 in 166 countries. Sci. Total Environ. 2020;729:139051. doi: 10.1016/j.scitotenv.2020.139051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zangari S., Hill D.T., Charette A.T., Mirowsky J.E. Air quality changes in New York city during the covid-19 pandemic. Sci. Total Environ. 2020;742:140496. doi: 10.1016/j.scitotenv.2020.140496. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Code and data used in this research can be found in https://github.com/Natalnet/ncovid-air-paper.


Articles from Environmental Research are provided here courtesy of Elsevier

RESOURCES