Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2020 Dec 11;3:100067. doi: 10.1016/j.cscee.2020.100067

An analysis to identify the important variables for the spread of COVID-19 using numerical techniques and data science

Deepro F Pasha a, Alex Lundeen b, Dilruba Yeasmin c, M Fayzul K Pasha d,
PMCID: PMC7834342  PMID: 38620894

Abstract

Considering system theory, the socio-economic variables that constitute a society should be able to capture the system response such as the number of weekly COVID-19 cases. A numerical approach has been presented in this paper to answer two vital questions; which variables are more important and how many variables are needed to capture the dynamics of the spread. Using the theory of least squares regression, two types of problems have been set up and solved using multilinear regression (MLR) and nonlinear powered function known as NLR in this study. Numerical techniques were applied to pre- and post-process the data and the vast number of outputs. Total 43 socio-economic and meteorological variables from 31 counties in California in the United States resulted about 37.4 millions of combinations for the analysis. Results show that variables related to total population, household income, occupation, and transportation are more important than the others. The frequency of having higher correlation for a variable increases as more variables are combined with it. Similarly, correlation increases as the number of variables in a combination increases. Some 5- variable combinations can capture the dynamics of the spread with higher accuracy having correlation coefficient as high as 0.985.

Keywords: COVID-19, Numerical techniques, Correlation, Least square method

1. Introduction

If society is considered to be a system, its variables should capture the system’s response and behaviors. There are many variables under different categories including demography, economy, culture, education, transportation, health, weather etc. that constitute a society. Some of these variables may be more important than the others in capturing system response and behavior. As the Coronavirus 2019 (COVID-19), also known as SARS-CoV-2 identified to be highly infectious [1,2], scientist and researchers are working to understand the dynamics of the spread of the virus using the predictive models in epidemics [3,4] and system behaviors to aid in the process of controlling the outbreaks [5].

According to the US Center for Disease Control [6], the primary method of COVID-19 spread is through respiratory droplets from human to human. Many researchers looked at the different factors that might affect the transmission of the droplets [7]. Some of those included demography [8,9], social connectedness and travel [10,11], economic, cultural, financial conditions [12,13], and meteorological characteristics [[14], [15], [16]]. Numerous relationships were found between these variables and the spread of COVID-19 via a variety of methods. Some of these relationships include higher COVID-19 susceptibility in older and more intergenerational populations and in lower temperature areas than in younger populations [8] and in higher temperature areas [14]. However, despite all of these predictions, the rate of infection continued to increase in summer affecting younger generations besides the elderly [17], disproving the rising average temperature hypothesis presented by many scientists.

While many factors that are possibly associated with COVID-19 cases have been identified discretely, a study is thus required where all the main socio-economic and meteorological variabilities are considered comprehensively. A systematic approach can include the social variabilities in a way that can quantify the impact of each of the variables and their combinations on the spread of COVID-19. IA comprehensive understanding of these combinations of variables and their magnitudes of influence on COVID-19 spread would greatly benefit the decision-makers to control the spread of COVID-19.

Data science and existing numerical techniques including single linear regression (SLR), multilinear regression (MLR), and nonlinear regression (NLR) can be used to understand the relationships and their degrees of influence on COVID-19 spread. This paper outlines a numerical scheme to identify the important variables separately and combined and their impacts on the spread of COVID-19 using SLR, MLR, and NLR and well-known statistical assessment parameters. Please note that MLR with one independent variable is referred to as SLR in this study.

2. Methodology

Methodology consists of four steps. First step was data collection followed by data normalization and correlation analysis using MLR and NLR in the second and third steps respectively. In the final step, all the results were combined together to identify the important variables and their impacts on the spread of the virus. Literature search was conducted first to identify the variables that are postulated to be the reasons of spreading the virus. Since the number of weekly new COVID-19 cases, which is the dependent variable, was available throughout the study area, weekly time step was used as the temporal resolution. Hence, the meteorological variabilities were also converted into weekly basis. Different county data has been used to observe the impact of the data that are quasi-static in nature such as household income. Spatial resolution includes county level since different counties exhibit different weather and climatic behaviors but under similar state regulations.

To eliminate the bias in magnitude, data was normalized (standardized) using the following equation,

Zij=xijμjσj (1)

where, xij, μj, σj, Zij were respectively any value, mean, standard deviation, and standardized values of variable j.

Assuming linear relationship between observed independent variables X, (x 1, x 2, x 3, ….. … …, x n) and observed dependent variable vector Y (y for single occurrence), the mathematical equation can be written as [18].

Y=XA+ε (2)
y=a0+a1x1+a1x1+a1x1++anxn+ε (3)

where, A (a 0, a 1, a 2, a 3, ….. …. ….., a n) is the regression coefficient vector and ε is the error vector. As mentioned, if a single independent (i.e., only x 1) variable is used, the equation is referred to as SLR. If more than one independent variables are used, the equation is referred to as MLR in this paper. However, if the dependent variable is nonlinearly (using a power equation) related to the independent variable(s), the following equation which is known as NLR in this paper can be used.

y=a0x1a1x2a2x3a3xnan+ε (4)

A logarithmic transformation of original data can be used to determine the coefficients of Eq. (4). Using the theory of least squares regression, the following equation can be used to solve for the regression coefficients A [18].

A=(XtX)1XtY (5)

where X is the matrix values of observed independent variables and t denotes the transpose of the matrix.

Two separate analyses one for MLR and the other for NLR were conducted to identify the important variables and their impacts in terms of frequency analysis and correlations. The following equation was used to calculate the correlation coefficient (CC).

CC=nxiyi(xi)(yi)nxi2(xi)2nyi2(yi)2 (6)

3. Application

State of California offers a wide range of variabilities in weather, social, and economic data under similar state regulation. Total 31 counties (Fig. 1 ) from different regions of the state were considered for the study. Total 43 independent variables have been selected under eight different categories; weather, demography, education, household income, occupation, health, transportation, and recreation. The definition, symbol, and source of data for each of these indicators are presented in Table 1 . A brute force method was applied to calculate the total number of combinations assuming one (1) independent variable to five (5) independent variables combinations resulting 1,099,295 (=43C1 ​+ ​43C2 ​+ ​43C3 ​+ ​43C4 ​+ ​43C5) combinations separately for MLR and NLR. These combinations were analyzed for each week separately for 17 weeks (from March 18 to July 15). Therefore, the combinations to analyze were 37,376,030 (= 2 x 17 x1,099,295). Considering the vastness of the combinations, MATLAB-R2018b [19] programming language was used to develop codes for computational purpose.

Fig. 1.

Fig. 1

Study area with population density and weekly average temperature and number COVID-19 cases for Week 16 (July 1–7, 2020).

Table 1.

Data definition and source.

Category Symbol Definition Source
Weather AT Avg temp (0F) California Irrigation Management Information System (CIMIS)
MxT Max temp (0F)
MnT Min temp (0F)
Rh Relative humidity
Sr Solar radiation
Ws Wind speed
Demography TP Total population US Census Bureau
PD Population density
A18 Age less than 18
A65 Age over 65
AO Age others
H Hispanic
WNH White not Hispanic
W White
B African American
A Asian
Oth Others
Education HS High school US Census Bureau
BS Bachelor
Household income VLI Very low US Census Bureau
LI Low
LMI Lower medium
MI Medium
HI High
Occupation TO Total jobs US Census Bureau
MBSA Management, business, science and arts
Serv Service
SO Sales and office
NRCM Natural resource, construction, and maintenance
PTM Production, transportation, material moving
SH Service health
Health Dia Diabetes Center for Disease Control (CDC)
Ob Obesity
Transport CA Car alone US Census Bureau
CC Car carpooled
PT Public transport
Wa Walking
TC Taxi cab
WH Work from home
Recreation SAO Sports arena outside Google maps (https://maps.google.com)
SAI Sports arena inside
CH Concert hall
Mu Museum

Frequency analysis includes identifying the number of weeks for which the correlation between a single independent and dependent variable was higher than a threshold value (CC=>0.8 in this case). Out of total 43, 19 variables were found highly correlated with the number of weekly cases (Fig. 2 ). The main categories of these variables are total population, household income, occupation, and transportation. Both MLR and NLR identify the same variables but with different magnitudes of correlation. For example, in household income category, VLI (i.e., very low income) has the highest impact and HI (i.e., high income) has the lowest impact on the spread of the virus. Similarly, in occupation category, SH (i.e., service in health) has the highest impact.

Fig. 2.

Fig. 2

Frequency of independent variable affecting the number of cases with CC greater than 0.80 for MLR and NLR with single independent variable.

Some variables combined with the other variables may show better correlation. Therefore, considering the presence of each variable in a combination (2–5 variables combinations), the maximum possible number of occurrence for a single variable was calculated and used in the frequency analysis. Table 2 shows that an independent variable combined with other variable(s) can capture the dynamics better. As seen, the general trend of the impact is similar to one independent variable case (Fig. 2). However, variables with minimum or no impact in single variable case can have more impact when combined with other variables.

Table 2.

Frequency of independent variable in multiple variable combination affecting the number of cases with CC greater than 0.80 for MLR and NLR.

Independent Variable MLR
NLR
2 Var 3 Var 4 Var 5 Var 2 Var 3 Var 4 Var 5 Var
AT 44% 70% 84% 92% 32% 59% 78% 89%
MxT 44% 70% 84% 92% 27% 53% 73% 86%
MnT 44% 70% 84% 92% 33% 61% 80% 91%
Rh 44% 70% 84% 92% 29% 56% 76% 88%
Sr 44% 70% 85% 92% 22% 48% 70% 85%
Ws 44% 70% 84% 92% 21% 47% 69% 84%
TP 100% 100% 100% 100% 83% 92% 96% 99%
PD 45% 71% 85% 92% 24% 51% 72% 86%
A18 45% 71% 85% 92% 34% 60% 79% 89%
A65 44% 69% 84% 92% 36% 63% 81% 91%
AO 45% 70% 85% 92% 22% 48% 70% 85%
H 45% 70% 85% 92% 39% 65% 82% 92%
WNH 43% 69% 84% 92% 39% 66% 83% 92%
W 44% 70% 85% 92% 21% 48% 70% 85%
B 43% 69% 84% 92% 25% 51% 72% 86%
A 45% 71% 85% 92% 25% 52% 74% 87%
Oth 43% 70% 84% 92% 32% 62% 82% 92%
HS 44% 70% 85% 92% 39% 65% 81% 91%
BS 45% 71% 85% 92% 36% 61% 79% 90%
VLI 100% 100% 100% 100% 77% 86% 92% 95%
LI 100% 100% 100% 100% 68% 80% 89% 94%
LMI 100% 100% 100% 100% 61% 76% 88% 94%
MI 100% 100% 100% 100% 73% 84% 91% 95%
HI 100% 100% 100% 100% 59% 78% 89% 94%
TO 100% 100% 100% 100% 72% 85% 92% 96%
MBSA 100% 100% 100% 100% 62% 79% 90% 95%
Serv 100% 100% 100% 100% 71% 84% 91% 95%
SO 100% 100% 100% 100% 66% 79% 88% 94%
NRCM 100% 100% 100% 100% 66% 81% 91% 95%
PTM 100% 100% 100% 100% 70% 80% 89% 93%
SH 100% 100% 100% 100% 80% 90% 96% 98%
Dia 44% 70% 84% 92% 31% 57% 75% 87%
Ob 44% 70% 84% 92% 26% 52% 72% 85%
CA 100% 100% 100% 100% 71% 83% 90% 95%
CC 100% 100% 100% 100% 64% 78% 88% 94%
PT 69% 90% 97% 99% 45% 70% 87% 94%
Wa 100% 100% 100% 100% 35% 61% 80% 90%
TC 100% 100% 100% 100% 30% 55% 75% 87%
WH 100% 100% 100% 100% 57% 76% 88% 93%
SAO 45% 71% 85% 93% 22% 48% 70% 84%
SAI 44% 70% 85% 93% 22% 47% 70% 84%
CH 44% 70% 84% 92% 23% 49% 70% 84%
Mu 44% 71% 85% 93% 26% 54% 75% 88%

Considering the vastness of the combinations, average CC under each number of combination for all 17 weeks of results was used to assess the impact of the number of variables in a combination. The average CC represents the trend and average impact of the number of variables on the dynamics of the spread (Fig. 3 ). As seen, CC increases with the increase of number of variables in a combination. This is expected, since one or few variable(s) may not be enough to represent the true dynamics of the COVID-19 spread. The slopes of the curve are found steeper in the beginning than towards the end representing smaller contributions by additional variables. Similar to Fig. 2 and Table 2, the NLR is capable of assessing the impacts of each combinations separately causing the average CC smaller than MLR.

Fig. 3.

Fig. 3

Impact of number of variables in a combination to predict the number of cases.

Another analysis has been conducted to observe how average CC varies by week. It is found that the average CC varies from 0.50 to 0.96 for 1- to 5- variable combinations for both MLR and NLR. The average CC for a particular combination fluctuates more in NLR than in MLR. The highest and the lowest CC are found for Week 2 (March 25 – March 31) and Week 11 (May 27 – June 2) respectively.

One of the highest CC value combinations (TP, VLI, SH, PT, and CH) has been used to observe the dynamics and predictability of the regression coefficients for Week 3 (Fig. 4 ). As seen, both MLR and NLR can generally capture the pattern of the observed cases for all the counties. While MLR slightly over and under predicts, the NLR performs better prediction for the lower number of cases.

Fig. 4.

Fig. 4

Observed and predicted weekly cases for all counties using one of the best 5-variable combinations regression coefficients for Week 3.

4. Conclusion

A numerical approach, which consists of the applications of MLR and NLR has been presented to observe the correlation between different socio-economic and meteorological variables and the weekly number of COVID-19 cases. Results show that 8 independent variables under total population, household income, occupation, and transportation categories are highly correlated with COVID-19 cases. As the given number of independent variables in a combination increases, the frequency of occurrences for higher correlation coefficient also increases. Similarly, the average CC, which is calculated considering all weeks results increases with the number of independent variables in a combination. Some highly correlated combinations can capture and predict the system behavior with high accuracy. The correlation coefficients for these combinations can be as high as 0.985. These observations can be used to develop a predictive model.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References


Articles from Case Studies in Chemical and Environmental Engineering are provided here courtesy of Elsevier

RESOURCES