Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Mar 13;14:6142. doi: 10.1038/s41598-024-56267-z

Social and economic variables explain COVID-19 diffusion in European regions

Christian Cancedda 1, Alessio Cappellato 1, Luigi Maninchedda 2, Leonardo Meacci 2, Sofia Peracchi 3, Claudia Salerni 4, Elena Baralis 1, Flavio Giobergia 1,, Stefano Ceri 4
PMCID: PMC10937953  PMID: 38480771

Abstract

At the beginning of 2020, Italy was the country with the highest number of COVID-19 cases, not only in Europe, but also in the rest of the world, and Lombardy was the most heavily hit region of Italy. The objective of this research is to understand which variables have determined the prevalence of cases in Lombardy and in other highly-affected European regions. We consider the first and second waves of the COVID-19 pandemic, using a set of 22 variables related to economy, population, healthcare and education. Regions with a high prevalence of cases are extracted by means of binary classifiers, then the most relevant variables for the classification are determined, and the robustness of the analysis is assessed. Our results show that the most meaningful features to identify high-prevalence regions include high number of hours spent in work environments, high life expectancy, and low number of people leaving from education and neither employed nor educated or trained.

Subject terms: Scientific data, Risk factors

Introduction

At the end of 2019, the COVID-19 disease started spreading around the world. The first case of COVID-19 in Italy was confirmed in Lombardy on February 20, 2020. The first patient was admitted to a hospital in Codogno (Lombardy, Italy) with a “mild pneumonia resistant to therapy, no relevant travel history and no apparent exposure to diseased contacts”1. In the following week, the Codogno area saw a rapid increase of COVID-19 cases, as well as other areas in the southern part of Lombardy1. At the beginning of 2020, Italy was the nation with the highest number of cases among not only European countries, but the rest of the world2. Lombardy, with its 10 million inhabitants, was the most heavily hit region in Italy, with one third of all cases and half of all deaths2.

The main objective of this work is to identify the main variables that explain the spreading of the virus in the European regions. We collected variables belonging to six categories: education, demography, healthcare, mobility, primary sector, economy. Specific variables were selected on the basis of their availability on an adequate spatial scale. We characterized European regions at a sub-national granularity that has allowed to reach a good balance between the research goals and data availability.

The adopted methodology consists in using several machine learning algorithms to learn about the main associations between the collected variables and the risk of exposure of each European region (quantified by the density of COVID-19 cases registered during the first and second waves). Next, we analyze the machine learning models that have been trained to extract useful insights in terms of the importance that each model assigns to the various available features. Since we study both the first and the second COVID-19 waves, we additionally make comparisons between the role that the features played in both situations.

As proximity is key to viral diffusion3, several works study the spreading of COVID-19 in space and time. Focusing on Italy, De Angelis4 and Bontempi5 analyzed Lombardy data at the municipality level, while Coccia6 and Cartenì7 considered the whole Italian territory, either at the provincial level or at the regional one. Other studies addressed several European countries: Bontempi8 focused on Italy, Spain and France, while Kapitsinis9 included 9 European countries. For what concerns the temporal dimension, some works focused on the initial period of the first wave (between January and April 2020)46,9, other works concentrated on the second wave (between September and December 2020)7, and other works focused on discontinuous short intervals: for example, Bontempi8 focused on April 2020, November 2020 and January 2021, the peaks of COVID-19 diffusion in Europe. Such strategy of analysis of epidemiological phenomena has its roots in mathematical epidemiology1012, in which the spread of the disease under study is modeled by means of susceptible-infected-recovered (in short SIR)11,13 models at a chosen geographic granularity, separately for each time-period based on the employment of pharmacological or non-pharmacological containment measures. The standard SIR model and its variations of increasing complexity(e.g. SIRV, SEIR, SEIRS) have already been applied at different geographic scales, from region14, to country15,16 and continent level17,18 in order to develop a thorough understanding on the risk of transmission of COVID-19. To this end, risk estimation of COVID-19 spread has been studied from the perspective of time-varying SIR models19. Nonetheless, following studies showed that SIR is an ineffective modeling strategy20 for the COVID-19 epidemic, as such technique does not account for the contributing factors which have been shown to be greatly correlated with the spread of COVID-19, such as pollution, social norms and cultural context of each population20.

For what concerns the choice of the factors which may have contributed to COVID-19 diffusion, some studies focused on environmental and meteorological data5,21,22, other studies focused on mobility: Sannigrahi23 analysed how the volume of international commerce influenced the spread of COVID-19, while Cartenì7 analyzed how public transport influenced the spreading of COVID-19 in the second wave. Differently from these works, we consider a larger set of European regions and contributing factors.

Several works take into consideration a variety of contributing factors. The most common macro areas analysed regard demography, economy or the healthcare system2326, sometimes integrated with environmental6 or political factors27. Some studies argued that the strongest correlated factors with COVID-19 are socio-economic (population, poverty, income, GDP)8,25,28; other studies assessed a major correlation of environmental factors such as air quality5,9 or air pollution22. In other cases, spatial and geometrical properties (e.g. of cities) have been studied to understand their relationship with the spread of COVID-1929.

Some studies disagree, perhaps due to the difference in datasets and methods: for instance, Amdaoud25 argued that better hospitals and medical services were strongly correlated with lower mortality rates, but Lupu24 stated that they did not affect at all the number of deaths. Most of the studies agreed that factors such as life expectancy9,25,26,30 or population density22,31 are highly correlated both with mortality and number of COVID-19 cases. What is crucial to understand is how these factors correlate to other socio-economic and healthcare elements, in order to create a more encompassing model, that takes in consideration the various aspects of our society. Hence, this work has the objective to explore, from both a temporal and a geographical perspective, how these various socio-economic and healthcare elements have had an effect on the spread of COVID-19.

Materials

The dataset used for this study includes a subset of European regions, as defined by the NUTS2 classification. The NUTS (Nomenclature of Territorial Units for Statistics) is a hierarchical system for dividing the economic territory of the European Union (EU)32. The NUTS nomenclature identifies regions at three different levels (NUTS 1, 2, 3), moving from larger to smaller territorial units. In particular, the second NUTS level (NUTS2) identifies regions within countries (for example, in Italy NUTS2 identifies regions such as Lombardy or Tuscany). As a general representation, each NUTS2 region is identified by a code that concatenates the NUTS1 identifier and an incremental region identifier. For example, Italian regions Lombardy and Tuscany are identified by ITC4 and ITI1 respectively. In this case, ITC and ITI are the NUTS1 representations for north-west and central Italy. A full list of all NUTS levels is openly available32.

The regions considered for this study belong to a total of 20 European countries. The total population living in these regions is of 357 million people.

For each of the regions under study we collected a total of 22 variables from the Eurostat data repositories. Eurostat is the statistical office of the European Union33. The variables span six macro areas that have been found in literature to be relevant in relation to the spread of viruses: economy, education, population, healthcare, primary sector and mobility.

Although the data is made available by Eurostat, this Office does not generally engage in direct data collection. Instead, data is gathered in individual European countries by their respective national statistical authorities, adhering to standardized European statistical regulations and standards, overseen by Eurostat. National authorities are responsible for verifying and analyzing their data, which is subsequently transmitted to Eurostat. The organization performs validation and quality control checks on the received data. These data sets are systematically published in accordance with a predefined publication calendar34.

We additionally gathered information on the impact that the first and second COVID-19 waves had on the various regions, based on data gathered from the Joint Research Centre Repository35. The data has been collected at the sub-national level and includes the numbers of infections by COVID-19, collected directly from the National Authoritative sources (more specifically, on the National monitoring websites)36. Following37, we characterized the first wave as having occurred between February 20 and August 20, 2020, and the second wave between August 20, 2020 and February 20, 2021.

For each pandemic wave, we studied the density of cases and deaths per hundred thousands inhabitants. The number of deaths is strongly dependent on the preparedness of the local health systems to react to and manage the virus effects. Instead, the number of cases allows for a more precise characterisation of the diffusion environment in the regions under analysis. Based on the goal of trying to understand how viruses spread and what factors might influence future waves (or similar pandemic events), we consider the number of cases occurred as being a more interesting target.

The rest of this section presents a better characterization of the data used, with information about the data cleaning performed, the variables considered and the discretization that has been applied to the problem.

Data cleaning

After collecting the socio-economic variables and the COVID-19 outcomes, we applied an initial data cleaning step. From the pool of all available regions, Eurostat data were available only for 205 NUTS2 regions. Coronavirus data collected from the JRC repository were available only for 154 and 152 regions, for the first and second wave respectively. Thus, only regions for which both the covariates and coronavirus cases data were available could be considered. Furthermore, we note that the information on the population of a region is necessary to preprocess the data (e.g. compute the various densities). For this reason, 3 regions have been removed due to absence of this value. After applying all of these constraints, a total of 151 and 149 regions have been analyzed for the first and second wave.

Consequently, since Sweden did not adopt pandemic containment procedures, all its 8 regions were removed from the analysis: the dynamics that occurred there are distant from those of the other European regions, which are the main focus of this study.

Additional attention has been paid to 6 NUTS2 regions which had missing values for covariates which, even if filled with the aggregate national statistic (computed considering the regions in the dataset), would result in extreme outlier values (unlikely with respect to the sampled distribution of the respective missing feature) after normalization by their populations. Hence, the regions BG32, BG42 (in Bulgaria), ES63, ES64 (in Spain), FI20 (in Finland) and FRY5 (in France), which presented the above stated anomalous behaviors, have also been removed.

We thus obtained a dataset of 137 NUTS2 regions for the first COVID-19 wave, and 135 NUTS2 regions for the second one.

Variables

The 22 variables collected from Eurostat can be divided into six different categories, which, according to literature, are relevant in their relation to the spread of viruses: education, population, healthcare, mobility, primary sector and economy.

  • Education variables include information about early leavers from education and training, NEET rate (people Not engaged in Education, Employment or Training), students enrolled in education, students in tertiary education and participation rate in education and training.

  • Population variables regard life expectancy, crude death rate, population density and total deaths.

  • Healthcare variables considered for this study include those related to long-term care beds, hospital beds, hospital discharges for respiratory diseases and health personnel.

  • Mobility variables cover the volumes of vehicles and air passengers.

  • Primary sector variables consider data about farm labour force and utilized agricultural area.

  • Economic variables include unemployment rate, thousands of hours worked, GDP (Gross Domestic Product), compensation per employee and regional GVA (Gross Value Added – the total value of all goods and services produced subtracted of the value of goods and services used for intermediate consumption in their production).

For each variable, we select the latest available measurement before the outset of the pandemic, representing the pre-pandemic condition. When meaningful, the variables have been typically normalized per 100,000 inhabitants. A full description of each variable is available in Table 1.

Table 1.

Brief description for each of the 22 variables used for the study.

Category Variable Description Unit of measure
Education Early leavers from education and training Percentage of young people (between 18 and 24 years old) who left school, university or training classes after having started them, over the total number of people who joined schools, universities and training Percentage
Students enrolled in tertiary education Total number of students (independent from sex and age) enrolled in tertiary school education Absolute number
NEET rate Percentage of people from 15 to 24 who are Not engaged in Education, Employment or Training (NEET) over the total number of people of that age Percentage
Participation in education and training Percentage of people between 25 and 64 years old who, in the last 4 weeks, has participated in educational and training activities Percentage
Pupils and students enrolled Total number of students (independent from sex and age) enrolled in school Absolute number
Population Life expectancy Life expectancy for a person Years
Population density Density of population in each region People/km2
Crude death rate Mortality in relation to the total population, Expressed in deaths per 100,000 inhabitants Per 100,000 population
Deaths Number of deaths in each region Absolute number
Healthcare Hospital discharges for respiratory diseases Number of people who left the hospital after having suffered from respiratory diseases Absolute number
Long-term care beds Number of long term beds available in a region for every 100,000 inhabitants Per 100,000 population
Health personnel Health care staff, the “manpower” active in the health care sector (e.g. doctors, dentists, nurses) Absolute number
Available hospital beds Number of available hospital beds in a region, expressed as average per 100,000 inhabitants Per 100,000 population
Mobility Air passengers Number of passengers carried, in that region, per thousand of inhabitants Per 1,000 population
Stock of vehicles Total number of vehicles present in a region Absolute number
Primary sector Farm labour force

Regular labor force involved in farm work.

Expressed in Annual Work Units (AWU), where 1 AWU corresponds to the work performed in a year by one full-time person

AWU
Utilized agricultural area Share of the utilizied agricultural area (UAA) occupied by the main agricultural land uses (arable land, permanent grassland and land under permanent crops) Percentage
Economy Unemployment rate Unemployment rate in a region Percentage
Thousands of hours worked Sum of the hours worked by employees in a certain area Absolute number
Real growth rate of regional GVA GVA (Gross Value Added) is an indicator of the economic activity of a country or region. It reflects the total value of all goods and services produced minus the value of goods and services used for intermediate consumption in their production

Percentage change

w.r.t. previous period

Compensation of employees Sum of the compensations of all employees of a certain area Millions of euros
GDP Gross Domestic Product (GDP) at current market prices Millions of euros

The dataset collected in this way contains some missing values. Figure 1a and b show the distribution of missing variables for each region and the distribution of missing regions for each variable, respectively. The region with the majority of missing variables is “Northern and Eastern Finland”, with all but one feature missing. The feature with the highest number of missing values is instead “longterm care beds”, whose information is missing from 54 regions ( 39% of regions). The missing values have been filled by using each region’s national average for the selected feature.

Figure 1.

Figure 1

Analysis of missing values in the adopted dataset. (a) shows the distribution of the fraction of missing features for each region. (b) shows the distribution of the fraction of missing regions for each feature.

Risk classes of COVID-19 diffusion

We are interested in building classes of risk for the various regions, according to their risk of exposure in terms of COVID-19 cases. This helps define a clearer narrative for the general public: because of this, several countries have already adopted similar risk-based classes to provide different guidelines and enforce different policies (e.g. in Italy, yellow, orange and red zones have been defined based on each region’s risk).

We build risk classes according to each region’s risk of exposure to COVID-19 cases by performing a 1-dimensional k-means38 unsupervised clustering algorithm on the number of cases for each wave, with a varying number of clusters: we found that two clusters is an optimal choice, in terms of silhouette score39,40 and of numerosity of each risk class. To obtain a more robust choice of clusters, we used multiple centroids initialization and selected the cluster assignment with the lowest sum of square distances from the points to their cluster’s centroids. It should be noted that applying k-means to a 1D dataset has the same effect as defining k-1 thresholds to use to group the data points. We still used k-means instead of manually defining the thresholds both to obtain a meaningful value for k (from a k-vs-silhouette plot) and to automate the selection of the threshold value. The COVID-19 cases density and the resulting discretized risk classes are separately shown for the two waves in Fig. 2.

Figure 2.

Figure 2

ROC curves for the different predictive models with respect to the first (left) and second (right) waves. Curves close to the (0, 1) corner achieve better performance. (a) shows the ROC curve for the first wave: the random forest generally achieves better performance. (b) shows the ROC curve for the second wave: in this case, all models obtain similar results. All results have been obtained using leave-one-out validation. The positive class is considered to be the high risk one. The blue dotted line represents random guess.

Methods

The objective of this work is to identify and quantify the effect that various socio-economic variables have in determining the risk class associated with each European region. To achieve this goal we opted for a classification-based machine learning approach, using models that provide some kind of interpretation of the results obtained. The machine learning pipeline adopted is comprised of the following steps: minority class rebalancing, model training, tuning and evaluation; followed by a final interpretation of the results. The rest of this section presents each of these steps in additional detail.

All experiments have been performed using a machine running Ubuntu 20.04, equipped with an Intel Xeon X5650 (6 cores, 12 threads) @ 2.66 GHz and 32 GB of memory. The source code has been developed using Python 3.8 and the scikit-learn42 library for the main proposed models. A link to the source code repository is made available as a part of the Data availability statement.

Minority class rebalancing

The two classes (high and low risk) are not well-balanced for the two waves. For this reason, we use SMOTE43 (Synthetic Minority Over-sampling Technique), a commonly adopted technique to rebalance the problem. We have found the best performance in terms of downstream classification task for a number of neighbors for the interpolation N=6. This choice is found to be a stable one, since we also achieve similar performance for other similar values of N.

Model training and tuning

We trained various classifiers that can be inspected to extract meaningful considerations on the importance of the input features. Hence, we used a subset of interpretable methods from the statistical learning literature, namely: logistic regression (LR), Support Vector Machine (SVM)44 with a linear kernel and random forest45 (RF). Logistic regression allows to infer from the available data, the relationship that exists between the features of an input sample and the respective predicted class, based on the odds of membership to one class with respect to all others.

Random Forest is an ensemble learning method which achieves similar goals by means of majority vote prediction from multiple separately trained decision trees, each fit on a subset of features(features bagging).

The Support Vector Machine instead solves the classification problem by finding the hyper-plane that best separates the data points according to the respective classes. This fact is then used to predict the class associated with a test sample, based on the side of the found plane it is located.

We are mainly interested in studying the “high risk” class, i.e. the most alarming of the two classes. Because of this, we measure precision and recall of all classifiers obtained for the high risk class. To find a balanced model, we additionally consider the F1 score (i.e. the harmonic mean between precision and recall). We use this metric for the choice of hyperparameters and overall evaluation because both minimizing the false positives and false negatives is important for the purpose of our classification problem: a large number of false positives implies over-estimating the importance of features that are not indicative of high risk, whereas high false negatives would result in under-estimating the features that might instead be relevant for the high risk class.

To find the optimal hyperparameters we adopted a 5-fold cross validation technique, choosing for each model the configuration that maximizes the average F1 score on the high risk class, across all folds. The following are the hyperparameters that have been considered for each model.

For the logistic regression, we considered using either an L1 or an L2 regularization term (with regularization coefficient {0.1,0.5,1,5,10,20,50}, as well as no regularization at all. For the random forest we trained models using 100 trees and a maximum depth that ranged from 2 to 10, as well as not enforcing any depth regulation. For the SVM we adopted a linear kernel (to obtain weights that can be easily interpreted), with a regularization parameter {0.1,0.5,1,5,10,20,50}. We also tune the class weight assigned to the low and high risk categories; for this parameter, the following combinations are selected (notation is high risk weight : low risk weight): {1:1,4:1,8:1,32:1,#samples#classes·#sampleshigh_risk:#samples#classes·#sampleslow_risk}.

Model evaluation

The models are evaluated by computing the leave-one-out F1 score on the dataset of each wave. The leave-one-out validation strategy has not been used during the hyperparameter tuning phase because of its computational complexity. However, it provides a good estimate of the model’s performance on new data points.

As the tuning choices have been made on the dataset through k-fold cross validation, the overall performance metrics reported for leave-one-out are not independent of the performance obtained during validation. We are aware of this, but still opted for this evaluation technique due to the limited data availability – which would only allow for a small test set to be set aside.

Feature selection and interpretation

Three ML methods (logistic regression, linear SVM, random forests) have been used for feature selection. Each model has been trained with its best hyperparameter configuration and used to establish the relationships between the 22 variables and the risk class prediction. Each model has its means to quantify the relevance of the relationship between a variable and the risk classes, according to the available data. For both linear SVM and logistic regression, each of the parameters of the models provides a measure of the degree by which each feature skews the likelihood that a test sample should be classified with a given label or not.

Hence, the risk class prediction is the result of the weighted contribution of each variable of the test sample. In the case of Random Forest instead, a measure of absolute relevance, namely the feature importance, is attributed to each variable as a means to quantify its impact in the class prediction procedure. Although it is not directly employed for inference by the model (as is the case of the coefficient of LR and SVM), it allows to measure the influence of each variable in the overall majority vote class prediction. Next, we discuss in more detail the interpretation of each algorithm.

Logistic regression provides the means to both classify regions and estimate the influence of each feature on the odds of the risk class46 of any given NUTS2 region. The optimization objective defined below allows us to find the set of coefficients WRd which define the relevance of all features for the purpose of classification. Here, xiRd represents the set of features of the i-th sample of our dataset.

minWRd,bRCiNlog(1+e-yi(WTxi+b))+12||W||22

In the case of L1 penalty term, the 12||W||22 is substituted by the L1 norm ||W||1. For the logistic regression, the quantity WTxi+b represents the logarithm of the predicted odds ratio. As such, the coefficients in W have a direct relationship with the predicted probability of the positive class (high risk).

Similar considerations hold for linear SVM, which separates samples of the two risk classes by determining an hyperplane that maximizes the margin between two classes. Such separation frontier is defined by a set of coefficients obtained through an optimization routine: their values determine the relationship between each feature and the risk classes47. In the linear case, the optimization objective used to determine the optimal set of coefficients WRd is:

minWRd,bRCiNmax(0,1-yi(WTxi+b))+12||W||22

Both of these models provide an indication of the direct or inverse influence that the features have on the identification of the low or high risk class.

Finally, random forests can be used to extract the importance that each feature has for the classifier48. This feature importance can be computed for each decision tree that comprises the random forest: this information can then be aggregated to compute an overall feature importance.

At any node of a decision tree, we can identify the set of samples to be processed as D and the two splits produced as DL and DR, such that D=DLDR and DLDR=. For any of these sets of points we can compute an impurity value, which represents how “impure” they are with respect to the target class. We refer to this impurity score as ι(D). In this paper, we will discuss the impurity in terms of Gini index. The contribution of the split towards the overall separation of the classes in the dataset can be quantified as:

C(D)=|D|ι(D)-|DL|ι(DL)-|DR|ι(DR)

denoting the decrease in impurity achieved when passing from the parent to the children. Each impurity is weighted by the number of samples contained within the respective split. For each split, there is exactly one feature that is used for that split. As such, the contribution of the split can be assigned to the corresponding feature. It follows that for each feature f we can build a set Df of all nodes that use that feature for the split. The overall contribution of feature f (or its feature importance), for tree T is then computed as:

fiT(f)=DDfC(D)ϕFDDϕC(D)

Where F is the set of all available features. The term at the denominator is used to normalize the result so that the sum of all feature importances is 1. For a random forest, the overall feature importance of any feature is given by the mean feature importance for that feature across all the trees T. Once again, the feature importances obtained in this way are normalized so as to sum to 1.

fi(f)=TTfiT(f)ϕFTTfiT(ϕ)

Results

In this section we present the results achieved for the models built, in terms of both performance and notions learned.

Models performance

In Table 2 we include the performance for both waves in terms of F1 score, precision and recall on the “high risk” class. The first conclusion that can be drawn from this data is that all models have better predictive capabilities for the second wave – additionally all models are more consistent in terms of results achieved. Multiple reasons may be the cause of this: among them, we note that there has generally been a more accurate counting of COVID-19 infections during the second wave. This is the result of a general better preparedness w.r.t. the one shown by most countries for the first wave. The ROC (Receiver Operating Characteristic) curves for the best performing models are shown in Figure 3. Once again we observe a general consistency for the second wave, and an improved performance achieved by the random forest for the first wave.

Table 2.

Performance of different predictive models with respect to both waves.

First wave Second wave
F1 score Precision Recall F1 score Precision Recall
Logistic regression 0.5217 0.4444 0.6316 0.8659 0.8353 0.8987
SVM 0.4706 0.375 0.6316 0.859 0.8701 0.8481
Random forest 0.5957 0.5 0.7368 0.8447 0.8293 0.8608

All metrics are computed on the “high risk” class, using leave-one-out validation. In bold are the best performing models for each wave and metric.

Figure 3.

Figure 3

Coefficients learned by the linear SVM model for the first and second wave. Positive coefficients imply a positive weight of the respective feature toward the prediction of the “high risk” class, whereas negative coefficients characterize features more related to the “reduced risk” class. The magnitude of the coefficients is proportional to how impactful the respective feature is in the prediction. The coefficients are sorted by descending importance (in absolute value). (a) shows the coefficients learned by the linear SVM for the first wave, (b) shows the coefficients for the second wave.

Predictive variables

We are mainly interested in extracting the importance assigned by each classifier to the variables under study. Since the models are characterized by different learning mechanisms, we first try to assess whether the models are consistent with one another in terms of importance assigned to the various variables under study.

To quantify the consensus of the classifiers in terms of the relevance of each feature, we first compute the Pearson correlation among the importances learned by each model. The results are shown in Table 3. To compare the feature importance of random forests (positive, larger for more important features) with that of SVM and logistic regression (positive or negative, larger in absolute value for more important features), the absolute value of the weights of SVM and logistic regression have been used.

Table 3.

Correlation coefficient between each model’s feature importances.

First wave Second wave
LR SVM RF LR SVM RF
First wave LR 0.8917 ns 0.8423 0.5919 ns
SVM 0.8917 0.5065 0.5876 0.4938 ns
RF ns 0.5065 ns ns 0.4585
Second wave LR 0.8423 0.5876 ns 0.7834 ns
SVM 0.5919 0.4938 ns 0.7834 ns
RF ns ns 0.4585 ns ns

Feature importances for SVM and logistic regression are obtained as the absolute value of the learned weights. Cells containing “ns” represent situations when the correlation between models is not significant (p-value 0.05).

It can be observed that SVM and logistic regression make similar choices in terms of feature weighting, as denoted by large correlation values. An interesting fact is that, across the two waves, all models maintain a large positive correlation. This denotes that, although the two waves present many differences (as discussed next), all models make reasonably consistent choices in the selection of the relevance of the features.

From Table 3 we can infer that the SVM has large correlation with both the random forest and the logistic regression. Because of this, we further analyze the coefficients learned by this model. These weights are shown in Fig. 4, separately for the first and second waves.

Figure 4.

Figure 4

Feature distribution ordered from top to bottom by increasing p-value for an F-test applied on the data of the first (left) and second (right) waves. A lower p-value represents a more significant difference in means between the distributions of values between the high and reduced risk classes. A larger divergence may be leveraged by the classification models to identify differences between the two classes. All features have been standardized. (a) shows the distributions for the first wave, (b) shows the distribution for the second wave.

For a more expressive representation of the distribution of values, Fig. 5 shows the violin plots, separately for high and reduced risks, for both waves. The difference in distribution of values between the various features is the main drive of the magnitude of the weights learned. In other words, the more the distributions for a specific feature diverge, the larger the associated weight learned by the classification model will be.

Figure 5.

Figure 5

Side-by-side comparison of the continuous density of reported COVID-19 cases normalized per 100,000 inhabitants (left) and of the resulting discretization in binary severity classes (right), separately for first (top) and second (bottom) waves. Data obtained form the COVID-19 cases35. Borders have been defined using the NUTS2 GeoJSON definitions41. Python 3.8, GeoPandas 0.8 and Matplotlib 3.7 have been used to produce the images.

The first wave

Considering the order of decreasing relevance from top to bottom of the graph, it can be observed that the economic variables such as compensation of employees, GDP and number of hours worked are some of the features which mainly affect the odds of a NUTS2 region to be at high risk of exposure to COVID-19 spread.

The number of hours worked and the population density are also relevant predictors which increase the odds of a region to be at high risk. On the other hand, employee compensation is inversely correlated with the risk classes.

Next, information about healthcare and demographic features have a considerable influence for the first wave. In particular, a high life expectancy and a high population density can be seen to have a positive effect on the prediction of a high risk class. The number of long-term care beds is instead negatively correlated with it. Although no causality can be inferred, we can hypothesise that an aging population has been affected more substantially by the first COVID-19 wave, that more densely populated areas can result in more widely spread infections, whereas a strong healthcare system can be assumed to have been useful to prevent many cases from further spreading.

The second wave

Similarly to the first wave, we observe that economic and socio-demographic indicators provide the greatest contributions to the classification models, together with those related to education; all the others only marginally improve odds in favor of either the low or the high risk class.

More specifically, the GDP and the number of hours worked have a positive impact on the prediction of the high risk class, whereas employee compensation is once again inversely proportional to the odds of being at high risk.

It is peculiar that life expectancy becomes much less impactful, while the crude death rate, the number of enrolled students in primary and in tertiary education now become relevant for classification.

While in general the most predictive features are consistent across waves, there are some cases that can be observed where inconsistent behaviors can be observed: most notably the NEET rate and the number of students enrolled in primary education change from positively to negatively correlated. Furthermore, the number of students in tertiary education changes from uncorrelated to highly positively correlated in the second wave. We can in this case advance the hypothesis that this change in effect for education variables can be explained in terms of the different roles that education has played in the two waves: the first wave has been characterized by students being prevented from attending schools and universities, whereas the second wave has often seen students being allowed to resume in-presence activities.

Discussion

Both the first and second wave are characterized by a high similarity of the selected most relevant features. Socio-economic variables are ranked as highly relevant, with compensation of employees deemed to be the most influential factor with negative correlation to the high risk of the exposure to COVID-19 spread. The number of hours worked and GDP have also been determined as important, with positive proportional relationship to the increase in number of infections. This result is aligned to other studies that stated that the strongest correlation with the virus has to be associated to socio-economic metrics5, mainly in relation to the quality of life and the GDP25,28.

These factors are usually associated with modern developed economies, high ratio of urban population, the people’s inclusion in the service industry, high health system maturity and solid international migrant stocks. Thanks to this result it is possible to state that a high level of economic wellness is associated with a higher possibility to fight against the virus but, at the same time, it increases the possibilities for COVID-19 to spread, because the community is more dynamic and people interact more with each other. Indeed, the impact of the dynamicity of NUTS2 regions on the spread of COVID-19 can be observed by the change in relevance of some features that appear as highly relevant in one wave but not on the other.

Such distinctive characteristic is remarked by the increase in relevance of the number of hours worked, from fifth to second most relevant feature for risk classification during the second wave, as well as the major increase in importance of the number of students in primary education(represented by “students enrolled”) and tertiary education (“students tertiary education”). Both are only slightly correlated to the risk classes in the first wave, while instead they become negatively and positively correlated respectively, during the second wave. On the other hand life expectancy, population density and number of long-term care beds are shown as highly influential only for classification on the first wave.

The negative correlation shown by the latter is supported also by the documented struggle of the European healthcare systems in managing the number of COVID-19 cases during the first months of the pandemic, especially if patients were to manifest severe symptoms49. Not all the patients affected by the virus were automatically admitted to local hospitals, due to the shortage of available beds and rooms, and since from March to May 2020 the majority of European citizens were in full-lockdown50, ill people were more likely to infect the other people living with them, therefore leading to a growth in the number of cases. As already shown in previous works9,25,26, life expectancy, strongly correlated with demographic factors such as number of elders within a population or quality of life in a specific area, is strongly linked to the number of COVID-19 cases (as well as deaths caused by the virus).

Moreover, given the particular lockdown conditions imposed during the first wave of the epidemic, population density and the NEET rate showed as crucial factors, because the higher the number of citizens, the easier COVID-19 transmission becomes, in both lockdown and non-restricted contexts22,31, since nations’ ability to maintain safe distances between citizens is jeopardized51. It is necessary to note that although lockdowns were imposed as soon as the virus had been detected in Europe in late February, the high positive correlation of the thousand hours worked, hence time spent in office environments, can only be explained by the research of Ref.52,53 that provides evidence for the presence of COVID-19 in Europe as early as January of 2020 or the last months of 2019, as social dynamics had been limited soon after February 2020.

As demographics and healthcare related features partially characterize the first wave of COVID-19, socio-economic variables become predominant for the second wave. The major importance of socio-economic dynamics for the second wave is shown by features such as the thousand hours worked, compensation of employees, NEET rate, which proxy the characteristics of the working population and the ones associated to the demographics of primary and tertiary education. It can be observed that crude death rate becomes relevant, with negative proportionality relationship, during the second wave. The case of the NEET (Not in Education, Employment, or Training) rate is rather peculiar, as it changes from positively to negatively correlated, to the odds of high risk of COVID-19 spread from the first to the second wave. As this slice of the total population is not an active part of the workforce, a greater NEET rate is shown to be associated to lower risk of the spread of COVID-19 during the second wave, while it is instead positively correlated for the first wave, in which, to most of the workforce was imposed smart-working modality. Thus, COVID-19 infection was facilitated regardless of the employment status.

This might also suggest that countries have provided a more adequate response, from the healthcare standpoint, in managing the ongoing pandemic during the second wave. In spite of this, the partial return to on-site working modalities during the second period of the epidemic resulted in an increase of COVID-19 diffusion due to the increase in human-to-human interactions that characterizes the professional world.

Conclusions

This study analyzed the impact of various social and economic factors on the spread of COVID-19 across European regions during the first and second waves of the pandemic. Machine learning models were able to identify key variables that influenced the risk classification of regions into high or low prevalence of cases. For both waves, socio-economic variables such as hours worked, GDP, and unemployment proved most important. However, the influence of specific factors differed between waves. For the first wave, demographic traits like life expectancy and healthcare capacity also played a role. For the second wave, education levels and crude death rates became more predictive. This suggests countries adapted their responses over time. Overall, our results provide insight into how regional characteristics combined with containment measures shaped COVID-19 transmission dynamics across Europe.

There are several avenues for continuing and expanding this research. First, additional waves of data could be analyzed to see how predictive factors evolve as the pandemic progresses. Demographic and behavioral changes over time may influence transmission risks. Second, incorporating mobility and social contact network data could yield a more granular understanding of spread. Third, performing similar analyses at finer geographic scales, such as the municipality level, may reveal localized transmission patterns.

Acknowledgements

This article reports a multidisciplinary research project developed for the Covid@Lombardy project at Alta Scuola Politecnica. Authors are greatly indebted to Ilaria Capua, whose enthusiasm and curiosity has stimulated our research; she connected us to research fellows working in the E-ellow Submarine initiative, which applies her Circular Health Approach to COVID-19 research. In this context, we further acknowledge the contribution of Ciro Cattuto and Daniela Paolotti from ISI Torino and of Anna Ferrari from CERN.

Abbreviations

AUC:

Area under the curve

AWU:

Annual work units

COVID-19:

Coronavirus disease 2019

EU:

European Union

GDP:

Gross domestic product

GVA:

Gross value added

JRC:

Joint research centre

LR:

Logistic regression

NEET:

Not engaged in education, employment or training

NUTS:

Nomenclature of territorial units for statistics

NUTS2:

Nomenclature of territorial units for statistics, level 2

RF:

Random forest

ROC curve:

Receiver operating characteristic curve

SMOTE:

Synthetic minority over-sampling technique

SVM:

Support vector machine

UAA:

Utilized agricultural area

Author contributions

C.C., A.C., L.M., L.M., S.P. and C.S performed the experiments and interpreted the results. E.B., F.G. and S.C. designed and supervised the experiments. All authors wrote and reviewed the manuscript with equal contribution.

Data availibility

The source code and the data used for this study are openly available at https://github.com/Chris1nexus/covid-eu-analysis.

Competing interests

The authors declare no competing interests.

Footnotes

The original online version of this Article was revised: The original version of this Article contained errors in Affiliations 2, 3 and 4 respectively, where the city was incorrectly given as ‘Turin’. The correct Affiliations are listed in the Correction for this Article.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

3/28/2024

A Correction to this paper has been published: 10.1038/s41598-024-58218-0

References

  • 1.Cereda, D. et al. The early phase of the covid-19 outbreak in Lombardy, Otaly. Preprint at arXiv:2003.09320 (2020).
  • 2.Usuelli M. The Lombardy region of Italy launches the first investigative covid-19 commission. Lancet. 2020;396:e86–e87. doi: 10.1016/S0140-6736(20)32154-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McLafferty S. Placing pandemics: Geographical dimensions of vulnerability and spread. Eurasian Geogr. Econ. 2010;51:143–161. doi: 10.2747/1539-7216.51.2.143. [DOI] [Google Scholar]
  • 4.De Angelis E, et al. Covid-19 incidence and mortality in Lombardy, Italy: An ecological study on the role of air pollution, meteorological factors, demographic and socioeconomic variables. Environ. Res. 2021;195:110777. doi: 10.1016/j.envres.2021.110777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bontempi E. First data analysis about possible covid-19 virus airborne diffusion due to air particulate matter (pm): The case of Lombardy (Italy) Environ. Res. 2020;186:109639. doi: 10.1016/j.envres.2020.109639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Coccia M. The relation between environmental, demographic, and geographical factors, and covid-19 diffusion: A case study. Forthcoming J. Adm. Soc. Sci. 2021;7:169–209. [Google Scholar]
  • 7.Cartenì A, Di Francesco L, Henke I, Marino TV, Falanga A. The role of public transport during the second covid-19 wave in Italy. Sustainability. 2021;13:11905. doi: 10.3390/su132111905. [DOI] [Google Scholar]
  • 8.Bontempi E, Coccia M, Vergalli S, Zanoletti A. Can commercial trade represent the main indicator of the covid-19 diffusion due to human-to-human interactions? a comparative analysis between Italy, France, and Spain. Environ. Res. 2021;201:111529. doi: 10.1016/j.envres.2021.111529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kapitsinis N. The underlying factors of the covid-19 spatially uneven spread. initial evidence from regions in nine EU countries. Reg. Sci. Policy Pract. 2020;12:1027–1045. doi: 10.1111/rsp3.12340. [DOI] [Google Scholar]
  • 10.Ross R. An application of the theory of probabilities to the study of a priori pathometry—Part I. Proc. R. Soc. Lond. Ser. A. 1997;92:204–230. doi: 10.1098/rspa.1916.0007. [DOI] [Google Scholar]
  • 11.Ross R, Hudson HP. An application of the theory of probabilities to the study of a priori pathometry—Part II. Proc. R. Soc. Lond. Ser. A. 1997;93:212–225. doi: 10.1098/rspa.1917.0014. [DOI] [Google Scholar]
  • 12.Ross R, Hudson HP. An application of the theory of probabilities to the study of a priori pathometry—Part III. Proc. R. Soc. Lond. Ser. A. 1997;93:225–240. doi: 10.1098/rspa.1917.0015. [DOI] [Google Scholar]
  • 13.Kermack WO, McKendrick AG, Walker GT. A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. Ser. A. 1997;115:700–721. doi: 10.1098/rspa.1927.0118. [DOI] [Google Scholar]
  • 14.Hespanha JP, Chinchilla R, Costa RR, Erdal MK, Yang G. Forecasting COVID-19 cases based on a parameter-varying stochastic SIR model. Annu. Rev. Control. 2021;51:460–476. doi: 10.1016/j.arcontrol.2021.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Giordano G, et al. Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy. Nat. Med. 2020;26:855–860. doi: 10.1038/s41591-020-0883-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cooper I, Mondal A, Antonopoulos CG. A SIR model assumption for the spread of COVID-19 in different communities. Chaos Solitons Fractals. 2020;139:110057. doi: 10.1016/j.chaos.2020.110057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Singh P, Gupta A. Generalized SIR (GSIR) epidemic model: An improved framework for the predictive monitoring of COVID-19 pandemic. ISA Trans. 2022;124:31–40. doi: 10.1016/j.isatra.2021.02.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Marinov TT, Marinova RS. Adaptive SIR model with vaccination: Simultaneous identification of rates and functions illustrated with COVID-19. Sci. Rep. 2022;12:15688. doi: 10.1038/s41598-022-20276-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kiamari M, et al. COVID-19 risk estimation using a time-varying SIR-model. In: Kiamari M, et al., editors. Procedings of the 1st ACM SIGSPATIAL International Workshop on Modeling and Understanding the Spread of COVID-19. Association for Computing Machinery; 2020. pp. 36–42. [Google Scholar]
  • 20.Moein S, et al. Inefficiency of SIR models in forecasting COVID-19 epidemic: A case study of Isfahan. Sci. Rep. 2021;11:4725. doi: 10.1038/s41598-021-84055-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ceylan Z. Insights into the relationship between weather parameters and covid-19 outbreak in Lombardy, Italy. Int. J. Healthc. Manag. 2021;14:255–263. doi: 10.1080/20479700.2020.1858394. [DOI] [Google Scholar]
  • 22.Aabed K, Lashin MM. An analytical study of the factors that influence covid-19 spread. Saudi J. Biol. Sci. 2021;28:1177–1195. doi: 10.1016/j.sjbs.2020.11.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sannigrahi S, Pilla F, Basu B, Basu AS, Molter A. Examining the association between socio-demographic composition and covid-19 fatalities in the European region using spatial regression approach. Sustain. Cities Soc. 2020;62:102418. doi: 10.1016/j.scs.2020.102418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lupu D, Maha LG, Viorică ED. Covid-19 incidence in Europe: Drivers and government interventions. Transylv. Rev. Admin. Sci. 2020;16:80–93. [Google Scholar]
  • 25.Amdaoud M, Arcuri G, Levratto N. Are regions equal in adversity? A spatial analysis of spread and dynamics of covid-19 in Europe. Eur. J. Health Econ. 2021;22(4):629–42. doi: 10.1007/s10198-021-01280-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Pardhan S, Drydakis N. Associating the change in new covid-19 cases to gdp per capita in 38 European countries in the first wave of the pandemic. Front. Public Health. 2020;8:1065. doi: 10.3389/fpubh.2020.582140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hossain, M. A. Is the spread of covid-19 across countries influenced by environmental, economic and social factors? medRxiv (2020).
  • 28.Farseev, A., Chu-Farseeva, Y.-Y., Qi, Y. & Loo, D. B. Understanding economic and health factors impacting the spread of covid-19 disease. medRxiv (2020).
  • 29.Kwok CYT, et al. Spatial analysis of the impact of urban geometry and socio-demographic characteristics on covid-19, a study in Hong Kong. Sci. Total Environ. 2021;764:144455. doi: 10.1016/j.scitotenv.2020.144455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Allel K, Tapia-Muñoz T, Morris W. Country-level factors associated with the early spread of covid-19 cases at 5, 10 and 15 days since the onset. Glob. Public Health. 2020;15:1589–1602. doi: 10.1080/17441692.2020.1814835. [DOI] [PubMed] [Google Scholar]
  • 31.Velavan, T. & Meyer, C. La epidemia de covid-19. Trop. Med. Int. Health25 (2020). [DOI] [PMC free article] [PubMed]
  • 32.Eurostat N. Nomenclature of Territorial Units for Statistics. Luxembourg; 1995. [Google Scholar]
  • 33.European Commission. Eurostat. https://ec.europa.eu/eurostat/web/main/data/database (2021).
  • 34.Commission, E. European statistics (eurostat) (2024).
  • 35.European Commission Joint Research Centre. Covid statistics (jrc-covid). https://data.jrc.ec.europa.eu/collection/id-00343 (2021).
  • 36.D, G., A, A. & C, P. Covid statistics.
  • 37.Eurostat. 2nd wave of covid-19 losing its force in february (2021).
  • 38.Forgy EW. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics. 1965;21:768–769. [Google Scholar]
  • 39.Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
  • 40.Shahapure KR, Nicholas C. Cluster quality analysis using silhouette score. In: Shahapure KR, Nicholas C, editors. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) IEEE; 2020. pp. 747–748. [Google Scholar]
  • 41.GISCO. Gisco statistical unit dataset. https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/nuts (2021).
  • 42.Pedregosa F, et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
  • 43.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002;16:321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]
  • 44.Boser, B. E., Guyon, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. In: Proc. of the fifth annual workshop on Computational learning theory, 144–152 (1992).
  • 45.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • 46.Thrane C. How to present results from logistic regression analysis in hospitality and tourism research. Tour. Hosp. Res. 2005;5:295–305. doi: 10.1057/palgrave.thr.6040029. [DOI] [Google Scholar]
  • 47.Cortes C, Vapnik V. Support-vector networks. Mach. Learn. 1995;20:273–297. doi: 10.1007/BF00994018. [DOI] [Google Scholar]
  • 48.Louppe G, Wehenkel L, Sutera A, Geurts P. Understanding variable importances in forests of randomized trees. In: Louppe G, Wehenkel L, Sutera A, Geurts P, editors. Advances in Neural Information Processing Systems. Curran Associates Inc; 2013. [Google Scholar]
  • 49.Winkelmann J, et al. European countries’ responses in ensuring sufficient physical infrastructure and workforce capacity during the first covid-19 wave. Health Policy. 2021;126:722–722. doi: 10.1016/j.healthpol.2021.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Plümper T, Neumayer E. Lockdown policies and the dynamics of the first wave of the sars-cov-2 pandemic in Europe. J. Eur. Public Policy. 2020;29(3):321–41. doi: 10.1080/13501763.2020.1847170. [DOI] [Google Scholar]
  • 51.Ciotti M, et al. The covid-19 pandemic. Crit. Rev. Clin. Lab. Sci. 2020;57:365–388. doi: 10.1080/10408363.2020.1783198. [DOI] [PubMed] [Google Scholar]
  • 52.Cereda D, et al. The early phase of the COVID-19 epidemic in Lombardy, Italy. Epidemics. 2021;37:100528. doi: 10.1016/j.epidem.2021.100528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Cerqua A, Di Stefano R. When did coronavirus arrive in Europe? Stat. Methods Appl. 2022;31:181–195. doi: 10.1007/s10260-021-00568-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The source code and the data used for this study are openly available at https://github.com/Chris1nexus/covid-eu-analysis.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES