Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2023 Feb 27;88:103598. doi: 10.1016/j.ijdrr.2023.103598

Predicting the issuance of COVID-19 stay-at-home orders in Africa: Using machine learning to develop insight for health policy research

Jordan Mansell a,1,, Carter Lee Rhea b,1, Gregg R Murray c,1
PMCID: PMC9968666  PMID: 36875319

Abstract

During the COVID-19 pandemic, many countries have issued stay-at-home orders (SAHOs) to reduce viral transmission. Because of their social and economic consequences, SAHOs are a politically risky decision for governments. Researchers typically attribute public health policymaking to five theoretically significant factors: political, scientific, social, economic, and external. However, a narrow focus on extant theory runs the risk of biasing findings and missing novel insights. This research employs machine learning to shift the focus from theory to data to generate hypotheses and insights “born from the data” and unconstrained by current knowledge. Beneficially, this approach can also confirm the extant theory. We apply machine learning in the form of a random forest classifier to a novel and multiple-domain data set of 88 variables to identify the most significant predictors of the issuance of a COVID-19-related SAHO in African countries (n = 54). Our data set includes a wide range of variables from sources such as the World Health Organization that cover the five principal theoretical factors and previously ignored domains. Generated using 1000 simulations, our model identifies a combination of theoretically significant and novel variables as the most important to the issuance of a SAHO and has a predictive accuracy using 10 variables of 78%, which represents a 56% increase in accuracy compared to simply predicting the modal outcome.

Keywords: Machine-learning, Stay-at-home orders, COVID-19, Health policy, Pandemic

1. Introduction

The COVID-19 pandemic has significantly impacted the lives of people globally. The rising number and lethality of cases resulted in countries implementing significant public health regulations, including government-issued stay-at-home orders (SAHOs). Intended to curb the transmission of the virus by limiting social contact between individuals, such orders are recognized by many as a highly effective public health tool but an extreme step because of their negative economic and social consequences. While SAHOs take many forms, generally speaking, they mandate that individuals remain in their residences except when going to and working an “essential” job or engaging in “essential” activities like acquiring groceries or medical care. Besides limiting fundamental freedoms of movement and association granted to citizens in many countries, they also severely constrain economic activity, including employment and wage-earning, and social activity, including school attendance and attending group-based events. As a result, the issuance of SAHOs by countries has not been uniform [[1], [2], [3]]. Understanding which factors prompted countries to issue SAHOs is significant to health policy research as this information can be used to improve global preparedness and response to future health crises. Unfortunately, health policy experts have had few opportunities to examine this question, and it remains unclear which factors made the most significant contribution.

In the following paper, we investigate which factors are most important for the issuance of SAHOs among the African countries (n = 54) between January 31 and June 15, 2020. We employ a novel and multi-domain data set along with a non-traditional estimation approach to develop a “more sophisticated, wider-scale, and finer-grained” understanding of the issuance of SAHOs than may be offered by traditional, theoretically motivated regression approaches [4]; p. 7). Further, this alternative estimation method can serve as a means to confirm extant theory (e.g. Ref. [5], as robust effects should be detected regardless of the method used. The objective is to create insights for future hypothesis testing which are “born of the data.” To maintain objectivity, this paper briefly reviews but does not make theoretical arguments about the importance of specific factors.

The data set includes 88 heterogeneous variables that capture country-level information on conventional, theoretically informed factors related to political, scientific, social, economic, and external issues (e.g. Ref. [6], as well as unconventional measures such as environment and geography for each country in our data set. To identify the most significant factors, we treat the question of which countries issue SAHOs as a dichotomous classification problem (issued SAHO versus did not issue SAHO) and utilize a random forest classifier [7]. This machine learning method is well suited to identifying a small set of variables that best explains an outcome in a more extensive set of variables, which, in this case, is to identify the variables that best explain whether a country issued a SAHO order or not. Our result identifies a set of 10 variables that categorize which countries issued a SAHO with 78% accuracy, which represents a substantial increase in accuracy over simply predicting the modal outcome. Furthermore, we find that variables that capture the diffusion of SAHOs are the most important for predicting the issuance of SAHOs. Secondary factors include median age, cumulative cases lagged, the total number of COVID cases per doctor, and the percentage of urban population. In constructing our data set, we drew from well-known primary data sources such as the European Centre for Disease Prevention and Control, State Fragility Index, World Bank, and World Health Organization (WHO).

In the following sections, we discuss the benefits of our machine learning approach to modeling and the broad-spectrum multi-domain data set analyzed. Then, we detail the machine learning process used and the results the process generated as well as a comparative analysis using a traditional regression approach. Finally, we review the implications of the results in terms of both our understanding of this public health issue as well as social science data analysis and knowledge production.

1.1. Benefits of a machine learning approach

Our objective in this study is to identify the most significant predictors of the issuance of a COVID-19-related SAHO in the 54 African countries. From a data analytic perspective on the question of what motivates countries to issue a SAHO, there are several advantages to using a random forest classifier, a machine learning approach, over traditional regression approaches [8,9]. Compared to regression models, random forests are robust to outliers and the distributional properties of independent variables. Because they are not predicting the linear probability of an event, random forests are also less constrained by small sample sizes than regression approaches. The robustness of random forests to sample size and the properties of distributions are important for the current study, which is limited to a small number of cases (n = 54) and for which the independent variables show consistent deviations from normality or outliers. Additional advantages of the random forest approach are that it provides a direct measure of performance estimation (correct classifications), and it can be applied to a wide variety of information types (number and string variables) with minimal data formatting [10]. Lastly, in traditional modeling approaches, variable selection is often biased by researchers for a variety of theoretical, and at times, arbitrary reasons [8,9,11]. For example, statistical training and familiarity often lead social science researchers to favor maximum likelihood approaches to data analysis and to select variables displaying certain distributional characteristics (e.g., normality) [8,12]. Furthermore, common practices such as categorical recoding of variables may introduce biases or obscure relevant statistical patterns.

Principally concerned with the efficient categorization of outcomes, random forest classification minimizes these biases by not discriminating against variables apriori. Rather, the algorithm is solely concerned with identifying the variables and cut points that matter for outcome classification [13]. Beneficially, the random forest approach still allows for the application of theoretical arguments after the fact. Because of these advantageous properties, random forests are being utilized in economics to study problems of mass appraisal, evaluation, and market prediction [8,12,13], and in political science to study civil war onset [9,14], the effect of IMF policy on child poverty rates [15], nuclear proliferation [16], and migration [17].

1.2. Case Selection and data set

We test our model using data from the 54 African countries to evaluate the issuance of SAHOs in the first wave of the COVID-19 pandemic. Africa is among the most socially, politically, and economically complicated regions in the world [18]. The 54 countries together count more than 1000 languages and 3000 ethnicities. They operate with wide ranging institutional arrangements and political ideologies that have been shaped by deep-rooted tribal, ethnic, and colonial forces [19,20]. As a result of long-term corruption, administrative deficiencies, and armed conflict, many political institutions remain underdeveloped [21]. The northern region of the continent, which is comprised primarily of Arab states with substantial petroleum resources, enjoys greater economic self-sufficiency than the remaining countries of Sub-Saharan Africa, many of which have suffered a great deal of political and economic strife and upheaval [21]. In terms of public health, the continent is disproportionately affected by global disease and epidemics [22,23] such as tuberculosis, malaria, and HIV/AIDS. In 2019, the WHO indicated that African countries earned a mean score of 46 (out of 100) for pandemic preparation versus a global mean of 63 [24]. That said, individual country scores varied widely, with four countries scoring 25 or less and five scoring 70 or more.

While the decision to apply our model to African countries is arbitrary to some degree, there are three reasons that justify the African case. First, Africa is understudied in general, providing an opportunity to expand the existing health policy literature [25]. Second, the characteristics of African countries are highly heterogeneous [26]. For the purpose of utilizing a random forest classifier, sample heterogeneity is crucial for avoiding model overfit and ensuring that the algorithm is capable of correctly categorizing observations outside a given sample. Third, the heterogeneity of the African sample increases the likelihood that findings should generalize to other contexts [27]. In other words, variables that work in the African data set may be more likely to work more broadly.

Reliable data on Africa are difficult to collect. Many governments are authoritarian and avoid transparency [28]. There are limited resources for reliable economic and census data in general and university resources for political science-related research more specifically [18]; p. 2). Further, there are ongoing conflicts that often make it dangerous for researchers to conduct studies [18]; p.2). With the challenges of data reliability in mind, data were collected from well-regarded sources including the World Bank, World Health Organization, United Nations, Database of Political Institutions, and the Sustainable Development Solutions Network. From an initial data collection of 227 independent variables from 10 data sets, we reduced the size of the input data to 105 variables by dropping all variables for which a country was missing a value. After dropping duplicate variables, the final data set contains 88 variables from six data sets. The most current year available was used for each measure, but nothing before 2015. Most measures were from 2018 or 2019. The details of the data-cleaning procedure are described below.

The dependent/outcome variable is sahoStatus, which is coded 1 when a country had issued a SAHO and 0 when it had not. Overall, 41 countries issued a SAHO, and 13 did not between January 31 and June 15, 2020 [29]. January 31 is the day after the WHO declared a public health emergency related to COVID-19, and June 15 is approximately 30 days after the last African country issued a SAHO during the first wave of the pandemic. We collected one observation for each country. Because the random forest algorithm performs more efficiently with a balance in outcome categories, we created a dataset with 27 observations of countries that had issued a SAHO during the evaluation period and 27 that had not. The 27 observations of countries that had issued a SAHO were randomly selected from the full list of 41 countries that issued a SAHO during the 136-day evaluation period. The recorded variable values for those countries were the observations for each of those countries on the day they issued their SAHO (e.g., Algeria issued a SAHO on March 23, so the recorded data for the observation on Algeria were the recorded variable values for Algeria on March 23). The 27 observations of countries that had not issued a SAHO were also randomly selected. The recorded variable values for those countries were observations on randomly selected days on which they had not issued a SAHO (e.g., April 12 was randomly selected for Ethiopia, and Ethiopia had not issued a SAHO on or before that day, so the recorded data for the observation on Ethiopia were the recorded variable values for Ethiopia on April 12). This variable is derived from the Oxford COVID-19 Government Response Tracker data [30], specifically variable C6. Linking the issuance of a SAHO with a specific day is necessary to evaluate the contribution of external factors such as Temporal Diffusion to the issuance of SAHOs.

Following [31]; the large number of independent variables follows the best practices for a data-informed approach, such as machine learning, which aims to be as comprehensive as possible without sacrificing the quality or reliability of the information. This broad-based approach allows our analyses to be informed by the data rather than constrained by extant theory. We, therefore, sought to incorporate a wide selection of reliable metrics into our data set. When building the data set, the objective was to sample a large and heterogeneous set of reliable variables to avoid excluded factors that might be inadvertently overlooked or dismissed for arbitrary reasons such as past statistical training or personal preferences. In some cases, questions about the full theoretical value or the interpretability of variables from the data sets were tabled until after the analysis in order to sample the widest possible number of possible predictors. For simplicity, we divide these variables into ten categories according to variable type and the source database. These categories are (1) theoretical, (2) state fragility, (3) development indicators, (4) institutional, (5) WHO goals, (6) WHO regulations, (7) CIA Factbook, (8) United Nations, (9) Sustainable Development Solutions Network, and (10) miscellaneous. The following descriptions are based on the initial data set of 227 variables.

1.2.1. Theoretical

These variables were selected based on theoretical expectations identified by Ref. [29] as well as [32] in their evaluations of SAHOs issued in the first wave of the COVID-19 pandemic in Africa and in the Middle East and North Africa region. In these articles, the literature on public health policymaking (e.g. Ref. [33], particularly informed the selection of these variables. They fall into five broad categories: political, scientific, social, economic, and external.

One approach to understanding health policymaking is to think of factors inside versus outside a country that affect its policy adoption (e.g., Ref. [34]. The inside or “internal” factors include political, social, and economic forces. Political factors include issues such as concentration of power and governance capacity. For example, in countries in which power is concentrated, such as authoritarian countries, the policymaking process is easier as there are fewer actors to satisfy [35]. Similarly, governments with greater governance capacity have greater access to administrative tools with which to make and implement policy [36]. Social factors include issues such as the role of government in the provision of health care and population vulnerability. When health care is perceived as a public good, such as in countries with a national health service, a country is more likely to enact more stringent policies [37]. Sociodemographic characteristics also matter as disease vulnerability varies by, for example, age, sex, education, and income [38]. Countries with a more vulnerable population are likely to enact more stringent policies to protect their vulnerable population. And economic factors include country wealth. Evidence suggests that wealthier countries are more likely to fund the provision of health care than poorer countries [39]. Science and health policies require specific expertise to assess [40], so policy researchers sometimes include scientific or medical forces among the internal factors [6]. Science or medical forces include the physical threat of the disease, a government's experience with serious diseases, and government preparedness for a disease. When a disease poses a serious threat to a population [41] and the government recognizes that threat [42], it is more likely to impose more stringent policies in response.

The outside or “external” factors capture how policies diffuse across jurisdictions. Policies could diffuse across states because they learn from each other, compete with or coerce each other, or exert normative pressure [43]. These open at least two avenues for diffusion: geography and time. Geographic diffusion occurs when policies diffuse between neighbors or physically proximate states [44]. Temporal diffusion often represents common patterns of adoption, such as the S-shaped innovation adoption curve which manifests as an accelerating and decelerating curve [45]. Ultimately, of 37 initial theoretical variables, 20 were retained in the final data set.

1.2.2. State fragility

These variables are composed of the components of the State Fragility Index [46]. Unlike the political focus in Ref. [32] on regime type and administrative capacity, these variables include measures of government security; political, economic, and social effectiveness and legitimacy; as well indicators of armed conflict, oil production and consumption, and regional effects. These are political in nature. Ultimately, of 14 initial SFI variables, 1 was retained in the final data set.

1.2.3. Development indicators

These variables are derived from the World Development Indicators (WDI) from the World Bank [47]. The World Bank provides hundreds of variables, so a selection process had to be determined. In this case, the World Bank categorizes its variables into “data themes,” which include poverty and inequality, people, environment, economy, states and markets, and global links [47]. Each of these categories “features” sets of indicators. For instance, the people category reports subcategories of featured variables representing population dynamics, education, labor, health, and gender [48]. At least one indicator was selected from each subcategory in each category, with two to three exceptions when the variable had been previously included in the data set. The variables are primarily economic and social in nature. Ultimately, of 43 initial WDI variables, eight were retained in the final data set.

1.2.4. Institutional

These were collected from the Database of Political Institutions [49], which is maintained by the Inter-American Development Bank. The DPI includes data on political institutions and elections such as measures of checks and balances, government tenure and stability, ideology and party affiliation, and legislative party fragmentation. Author judgment informed variable selection. These variables are political in nature. Ultimately, of 18 initial DPI variables, none were retained in the final data set.

1.2.5. WHO goals

Collected from the WHO, these variables were selected because they are the indicator variables used to measure progress toward the 17 Sustainable Development Goals (SDGs) adopted by world leaders in September 2015 when they “set out a vision for a world free of poverty, hunger, disease and want” [50]. In particular, the measures are part of SDG 3, which addresses “healthy lives and promoting well-being,” and at least one indicator was selected from each subcategory in SDG 3 except when no variables were available. The variables are primarily health-related. Ultimately, of 12 initial SDG variables, five were retained in the final data set.

1.2.6. WHO regulations

These were collected from the WHO regarding International Health Regulations (IHR) related to nations' capacity to “prevent, detect, assess, notify, and respond to public health risks and acute events” of international concern [24]. The data include the 13 capacities (e.g., legislation and financing, food safety, and risk communication) and a total average score. The variables are primarily related to nations’ public health risks and crises as well as their public health infrastructure. Ultimately, of 14 initial IHR variables, none were retained in the final data set.

1.2.7. Factbook

These variables were collected from the [51]; which provides a multidomain collection of data types of interest to comparative politics and international relations scholars [52,53]. These variables capture a wide range of important domains relative to countries of the world including geography, people and society, government, the economy, energy, communications, transportation, and military and security. Ultimately, of 69 initial Factbook variables, 49 were retained in the final data set.

1.2.8. United Nations

The eighth set of variables was collected from the 2020 [54] World Economic Situation and Prospects (WESP) report (see the Statistical Annex). This set consists of eight measures that broadly characterize each country in terms of economic development and basic economic conditions. Beyond general categories of development, the data also include measures of national income, fuel exportation/importation, heavy indebtedness, and four identified African regions. As suggested by the name of the report from which they are drawn, the measures are primarily related to economic and development concerns. Ultimately, of eight initial WESP variables, five were retained in the final data set.

1.2.9. Sustainable Development Solutions Network (SDSN)

These were collected from the 2020 [55]. This set of five measures captures a diverse set of social issues regarding citizen perceptions of social support, freedom, generosity, corruption, and happiness. These data are derived from public opinion surveys in each country fielded by the Gallup Organization. As such, they are the only data based on data collected at the individual level. Ultimately, of five initial SDSN variables, none were retained in the final data set.

1.2.10. Miscellaneous

A set of miscellaneous variables mostly related to countries’ credit ratings was collected from various sources. Ultimately, of eight initial miscellaneous variables, none were retained in the final data set.

2. Methodology

There are several methods to extract critical subsets of variables or “features” from a larger data set: a standard covariance study, principal component analysis (PCA), and decision trees. To gain insight into the important features in our data set while also minimizing biases resulting from the data structure, we employ a Monte-Carlo implementation of the random forest algorithm – an extension of decision trees [7]. Decision trees recursively split the data into their classifications (determined a priori) by optimizing the information gain at each disunion [56]. The advantage of this approach, as compared to covariance or PCA models, is that it minimizes biases in variable selection resulting from the data structure (e.g., normality).

Several decision tree algorithms exist. We employ the standard CART (Classification And Regression Tree) algorithm implemented in the python package sklearn [57]; the CART algorithm is the standard decision tree algorithm used throughout the literature). We test both information gain calculations: the Gini index and entropy. As expected, testing demonstrates that there is no statistically meaningful discrepancy in our results regardless of whether the Gini index or entropy is used. We report our results using entropy. To optimize the decision tree algorithm for predictive accuracy, we use the random forest algorithm, which utilizes power in numbers [7]. While decision trees represent a powerful class of algorithms, they suffer bias from their training set. To overcome this bias, the random forest leverages two disparate techniques: bootstrap aggregation (or bagging) and random feature (independent variable) selection [58]. Each decision tree in the random forest is trained on a subset of the initial training set chosen by random sampling with the replacement of the initial sets. Furthermore, each tree randomly selects without replacement a subset of features on which it calculates splits.

Our implementation of the random forest uses 50 decision trees. We apply a maximum recursion level of 30 and allow for standard automatic pruning as implemented in sklearn in order to reduce overfitting [57]. We set aside 40 randomly selected countries (∼75% of the total sample) to serve as the training set, while the remaining 14 countries (∼25%) are used to test the algorithm. We note that the training and test sets do not change throughout this work. Variables described in the Case Selection and data set section above serve as inputs (independent variables) to train the tree while sahoStatus serves as the target (dependent) variable. To further reduce bias and overfitting, we split our feature selection algorithm into two parts: initial variable selection and final importance calculation. During the initial variable selection, we apply a Monte-Carlo implementation of the random forest algorithm with 1000 iterations using all 88 variables. At each iteration, we recorded the top 50 variables ranked on importance as calculated by the feature importance variable implemented in sklearn. This variable acts as the mean decrease in the impurity measure (i.e. the entropy or Gini index) on splits that are computed for a given variable -- therefore, it is a proxy for feature importance in the tree. After our initial 1000 runs, we then retain the top 30 variables from our initial list of 88 based on their mean feature importance score across all runs. For the final calculation, we again run the Monte-Carlo implementation; however, we only use the top 30 variables from the initial implementation. We take the feature importance from this implementation to determine the final values. In order to demonstrate the methodology's predictive power, we report in Table 1 the weighted f1-score, a standard metric commonly used in machine learning, obtained when retaining 1, 5, 10, 15, 20, 25, and 30 variables. The f1-score is computed as (precision2 *recall)/(precision + recall). In terms of the True Positive (TP), False Positive (FP), and False Negative (F1N) the f1-score is TP/(TP+0.5*(FP + F1N)). Since the test sample is limited, we run 1000 additional instances of a forest for each variable configuration. We find that 10 variables are sufficient to obtain greater than 78% predictive f1-score. Increasing the number of variables does not meaningfully increase the predictive power. We note that the efficacy of this method is limited by the small number of training and test countries, but we also note that the random forest classifier is well-suited for small data sets [8,9].

Table 1.

F1-Score of the Random Forest in Predicting a Stay-at-Home Order at Different Levels of Retained Variables.

Number of Variables in Final Training Test f1-score (PREa)
1 68% (36%)
5 70% (40%)
10 78% (56%)
15 77% (54%)
20 77% (54%)
25 76% (52%)
30 74% (48%)
a

Proportional reduction in error.

3. Results

Fig. 1 reports the individual feature importance of the top 30 variables in our data. See Supplemental Material A for full descriptions of abbreviated variable names reported in Fig. 1 and elsewhere. Feature importance is an average measure of how much a feature decreases the weighted impurity in a tree. A measure of the homogeneity of the labels at a node, impurity is the measure that decision trees try to minimize when splitting each node (e.g., the proportion of countries at given GDP that issues a SAHO). We report that our methodology predicts the issuance of SAHOs using a subset of 10 of these variables with an f1-score of 0.78, which represents a substantial 56% increase in accuracy over simply predicting the modal outcome.3 See Table 1 for the comparative predictive accuracies when our model is constrained to different numbers of variables (i.e., 1, 5, 10, 15, 20, 25, and 30) and Table 2 for descriptions of the top 10 variables.

Fig. 1.

Fig. 1

Feature importance vs feature name for the 30 most important variables.* The error bars represent the 95% confidence interval. * See Supplemental Material A for descriptions of abbreviated variable names. * See Supplemental Material B for descriptions of abbreviated variable names.

Table 2.

10 most important variables in order of Importance21qqaA.

Feature Importance Variable/Feature Definition (data set) Zero-order (10-var Model) Correlation with DV Related Variablesa,b
1 tempDiffS Temporal diffusion: generic S-shaped innovation curve. (theoretical) - (−) -tempDiff66
2 prevAdopt Percent of African countries previously issuing a SAHO. (theoretical) + (+) dayNum, -consIncome, geoDiff
3 MedianAge Country median population age. (Factbook) + (+) ElectyRural; commBanks; ITCELSETSP2; ITNETUSERZS; agValWork; gniPerCap; primaryEduc; consIncome; gdpEmployed; ciaGDPPerCapitaPPP
4 cumCasesLag (cumCasesLag1P100 KC) Daily cumulative COVID-19 cases per 100,000 population, lagged one day. (European Centre for Disease Prevention and Control) + (−) cumDeathsLag1P100 KC, deathsMD
5 Area Sum of all land and water areas delimited by international boundaries and/or coastlines. (Factbook) - (−) Airports
6 envMort Death rate from ambient and household air pollution (per 100 k population). (WHO - SDG) - (−)
7 casesMD Daily cumulative COVID-19 cases in a country, lagged one day, divided by the number of medical doctors in that country. (European Centre for Disease Prevention and Control, WHO Global Health Observatory Data Workforce) + (+)
8 urbanPop Urban population (% of total population). (WDI) + (+) pctINET, exports, consIncome, gdpEmployed, obesity, GDPPerCapLn
9 StartBIZ Time required to start a business. (WDI) - (−) -liec
10 muslimPct Percentage of a country's population adhering to Islam. (Factbook) - (−)
a

Pearson r > 0.5 or < -0.5; a negative sign in front indicates a negative correlation.

b

See Supplemental Material A for descriptions of abbreviated variable names.

Table 2 includes the source of each variable. Overall, it indicates that the two most important variables are from the data set of theoretical variables. Of the remaining variables, three are from the CIA World Factbook, two are from the European Centre for Disease Prevention and Control, two are World Development Indicators, and one is from the WHO Sustainable Development Goals.

The first two variables (tempDIFFS, prevAdopt) are theoretical measures of external effects in the form of policy diffusion, that is, the effects of policy adoptions in other states on the adoption of a policy in a target state [59]. Policy diffusion falls into the set of variables informed by extant theory. TempDIFFS is a measure of temporal diffusion as represented by a generic S-shaped innovation curve. In particular, this measure is fitted to the centre of the timeframe, day 65, with values decreasing to the centre point and increasing thereafter, and not to the actual hazard rate, which would render the measure empirically meaningless. This measure captures a learning process where early adopting states demonstrate the effects of an adopted policy. Next, non-adopting states watch the outcomes of the policy in the early adopting states and several subsequently adopt. Finally, several laggard states also adopt, resulting in an S-shaped cumulative distribution curve. PrevAdopt, on the other hand, is the percentage of African countries previously issuing a SAHO. It is likely best interpreted as a measure of geographic diffusion, in this case, regional diffusion. As such, adopting states are learning from, competing with, or coerced by the policy adoptions of geographically proximate states.

The third variable is country's median age (ciaMedianAge). The role of age is widely recognized in extant public health policy research [37]. Regarding the current case, research suggests older people are more vulnerable to negative COVID-19 outcomes [60]. With that in mind, a government with older citizens may be more likely to issue a SAHO.

The fourth and seventh variables are cumulative COVID-19 lag (cumCasesLag1P100 KC) and cases per Medical Doctor (casesMD). These medical variables likely reflect the strain placed on a country's health care system by the pandemic, which SAHOs are intended to reduce (i.e., flatten the curve) and are consistent with the extant theory.

The fifth variable is the land area or the sum of all land and water areas delimited by international boundaries and/or coastlines (Area). To our knowledge, this measure has been ignored by extant public health theory. The effect may be related to population density as countries with greater land area likely have lower population density and, therefore, may be less vulnerable to the spread of infectious disease [61]. This may make them less likely to take extreme steps to address a pandemic, but this is a factor open to further investigation.

The sixth variable is the death rate from ambient and household air pollution (envMort). This scientific factor may reflect the state of a country's healthcare system or the impact of pre-existing conditions or comorbidities on the severity of COVID-19 cases. Countries with higher deaths from environmental factors such as respiratory illnesses from pollution may have more developed healthcare systems and, as a result, may be less concerned about being overwhelmed by patients with aggressive infectious diseases. Countries with higher probabilities of death from noncommunicable diseases tend to have fewer deaths from communicable diseases (i.e., infectious diseases) and may be less concerned about taking steps to stop the spread of a highly virulent virus.

The eighth variable is urban population (urbanPop). Social arrangements matter and research suggest that infectious diseases spread more quickly through urban and presumably more densely populated areas [61]. Indeed, Fig. 1 indicates population growth and population are the 11th, 12th, 24th, and 29th most important variables identified by the machine learning process.

The ninth variable is the time required to start a business (startBIZ). While this might be viewed as an effect of economic and/or bureaucratic capacity, it is not a conventional indicator of either. Table 2 indicates it is consequentially correlated with a measure of legislative and executive electoral competitiveness, but this is likely a random association. This analysis, then, suggests another factor to investigate.

The tenth variable is the percentage of the population that adheres to Islam (MuslimPct). One possibility is that high levels of ethnic or religious homogeneity may be related to the issuance of a SAHO. Indeed, the percentage of the population that adheres to Christianity (ChristPct) is also identified as the 17th most important variable, however, neither variable is significantly correlated with the issuance of a SAHO. The interpretation of this variable is not straightforward.

Overall, the results confirm a number of factors found in the extant theoretical public health literature and raise questions about others that may warrant further investigation. Public health scholars are unlikely to be surprised about the findings regarding policy diffusion (two measures), median age, cumulative cases, and cases per doctor, (variables 1, 2, 3, 4, and 7, respectively). Some effects may be more compelling to some public health scholars than others such as environmental mortality and percentage of urban population (variables 6, and 8, respectively). Three variables, land area, time to start a business, and Muslim percent (variables 5, 9, and 10, respectively) do not fit neatly in extant public health policy research and may be viable candidates for further explanation.

3.1. Estimating directionality for future modeling and hypothesis testing

The machine learning analyses identified the most significant predictors of the issuance of a SAHO within the larger data set of predictors. A limitation of the random forest machine learning algorithm, though, is that it does not readily indicate whether the relationship between a predictor and the outcome variable is positive or negative. To estimate direction, we ran a series of zero-order and multivariate logistic regression models with robust confidence intervals on the probability of adopting a SAHO. We emphasize that linear regression does not approximate the features of a random forest (e.g., splitting data according to classifications that optimize the information gain at each disunion), therefore, it is not appropriate to evaluate the outcomes of our machine learning model according to each variable's statistical significance in linear regression. Rather, the purpose of these regression models is to inform future research and hypothesis testing. We also note for readers that the predicted probability of several estimates exceeds 1. While predicted probabilities are typically between 0 and 1, with continuous variables values can exceed 1 when the change in slope of the tangent line between two values exceeds 1 (e.g., 3 and 4) [62].

Of the top 10 variables, all except the percentage of urban population (urbanPop) showed significant skewness or kurtosis. Strong skewness and kurtosis often result in violations of the linear regression assumption of normality. Prior to their inclusion in the logistic model, then, a logistic transformation was performed on five variables (MedianAge, Area, envMort, StartBIZ, and MuslimPct) and hyperbolic sine transformations on three variables (prevAdopt, cumCasesLag1P100 KC, and casesMD). The implementation of a transformed versus non-transformed variable was determined by whether the transformation reduced the skewness of the variable. As a result, despite displaying significant kurtosis, one variable (tempDIFFS) was not transformed prior to its inclusion in a logistic regression model. Lastly, to improve comparability we standardized each of the variables before including them in the logistic model.

Looking at the summary of zero-order correlations in Table 3 (with the relationships reported symbolically, as + or -, in the fourth column of Table 2), five variables are positively correlated and five are negatively correlated with the issuance of a SAHO. The previous adoption in other countries, median age, cumulative cases lagged, cases per MD, and urbanPop are positively correlated with the issuance of a SAHO. The likely positive relationships for previous adoption, median age, cumulative cases lagged, cases per MD, and urban population seem consistent with extant literature as mentioned above. Temporal diffusion, geographic area, environmental mortality, length of time required to establish a business, and the percentage of a population adhering to Islam are negatively correlated with the issuance of a SAHO. While a negative relationship is likely within the extant literature for temporal diffusion and environmental mortality, it is not clear to us that the current literature offers a rationale for the other relationships.

Table 3.

Summary statistics logistic regression models. Independent zero-order correlations between the standardized variables identified by the random-forest as significant to the issuance of a stay-at-home order and the probability of issuing a stay-at-home order.

Variable Coef. Std. Rob. Z P>|z| [95% Conf. Int] BIC
tempDIFFS −1.243 0.358 −3.47 .001*** −1.945 −0.540 68.15
prevAdopt 0.198 0.283 0.70 .484 −0.357 0.754 82.32
MedianAge 0.695 0.382 1.82 .069 −0.054 1.443 77.56
cumCaseLag 0.052 0.271 .19 .848 −0.480 0.583 82.80
Area −0.368 0.322 −1.14 .253 −0.999 0.263 81.14
envMort −0.697 0.294 −2.37 .018** −1.274 −0.121 77.17
casesMD 0.151 0.227 0.66 .507 −0.294 0.595 82.57
urbanPop 0.452 0.266 1.70 .089 −0.069 0.973 80.28
StartBIZ −0.450 0.265 −1.70 .090 −0.969 0.070 80.30
muslimPct −0.185 0.128 −1.45 .148 −0.437 0.066 80.75
N = 54

Note: Summarizes the outcomes of multiple independent regression models. The full results are listed in Tables 2–11 in the online appendix. Variables are ordered according to their importance in the random forest. P < 0.100; **P < 0.050; ***P < 0.001 (two-tailed).

Table 4 summarizes the results of six logistic regression models that predict the issuance of SAHOs using the 10 most important variables as identified by the random forest and reported in Table 2. These models indicate the estimated unique effect of each variable after accounting for the effects of the other variables. Only in one case--cumulative cases lagged --do the estimated effects of the 10-variable logistic regression model diverge from the bivariate models reported in Table 3. So the large majority of the effects are robust to covariates. In particular, consistent with the results of the random forest, comparing the measures of model significance (Wald), explained variance (R2), and model fit (BIC) shows that temporal diffusion disproportionately contributes to the fit and significance of each model. Additionally, temporal diffusion is the only variable that consistently reaches significance at the 95% confidence level.

Table 4.

Summary Statistics Logistic Regression Models. Correlations between multiple standardized independent variables identified by the random forest as most important to the issuance of a stay-at-home order and the probability of issuing a stay-at-home order.

Variable Top 1 Variable Top 3 Variables Top 5 Variables Top 7 Variables Top 10 Variables Top 610 Variables
tempDIFFS −1.243*
0.358
−1.169*
0.381
−1.402*
0.553
−1.556*
0.784
−1.468*
0.691
prevAdopt 0.153
0.402
−0.123
0.616
0.010
0.783
0.312
0.747
MedianAge 0.681
0.440
0.053
0.432
0.524
0.508
0.608
0.498
cumCasesLag 0.358
0.594
−1.385
1.106
−1.410
1.091
Area −0.193
0.455
−0.436
0.554
−0.218
0.705
envMort −0.679
0.548
−0.524
0.539
−0.617
0.328
casesMD 2.623
1.896
2.140
1.639
−0.065
0.251
urbanPop 0.329
0.448
0.426
0.314
startBIZ −0.908
0.428
−0.712*
0.310
MuslimPct −0.238
0.170
−0.197
0.151
Wald 12.03 17.48 16.05 19.12 18.83 11.89
R2 0.196 0.249 0.265 0.360 0.409 0.163
AIC 64.17 64.21 67.01 64.26 66.25 74.66
BIC 68.15 72.17 78.94 80.17 88.13 86.59

P < 0.100; *P < 0.050 (two-tailed).

3.2. Index variables

To examine further the cumulative effect of factors that fall into the same category, at the recommendation of our reviewers, we created a series of five index variables which we added to our 88-variable data set. We then reran the random forest on that augmented data set of 93 variables. The original variables which compose the indices are also retained as individual factors in this analysis. The goal of creating these variables is to evaluate whether the predictive accuracy of our random forest is increased when related factors are combined into a single predictive variable. Index variables are composed of multiple related indicators (e.g., median age, birth rate, and infant mortality). By averaging the properties of multiple indicators, index variables are commonly used in linear/frequentist models to correct for biases in the properties of estimators such as the skewness of their distributions. Because the predictive value of the variables in our random forests is determined according to their maximization of information gain, a parameter that is not biased by the properties of the variables’ distribution, our expectation is that the inclusion of index variables will not produce meaningful changes in model prediction.

We created two sets of index variables using the list of the top 30 variables identified by our random forest. First, we applied exploratory factor analysis to the list of 30 variables to identify related terms. Based on factor loadings >0.50 and eigenvalues >1.0, we created one index variable DataFactor1. Next, we created four variables based on the categories of theoretical relatedness among the top 30 variables identified. These four theoretical categories are: Health Factors 1 and 2, SocioEconomic Factor, and Ethnicity Factor. A political category was not created because only one variable within the list of the top 30 was identified as political, femalesPARL the proportion of seats held by women in the national parliament. An external factor was not created as no variables displayed a sufficiently high set of factor loadings. The full list and factor analyses of all variables initially associated with each theoretical factor are listed in the Supplemental Materials C. All five indices are detailed in Table 5 .

Table 5.

Descriptions of indexesa.

Exploratory Indexes
Theoretical Indexes
DataFactor1 Health1 Health2 SocialEconomic EthnicityFactor
MedianAge MedianAge cumCasesLag1P100 KC urbanPop muslimPct
BirthRate BirthRate casesMD PctINET christPct
envMort envMort ciaArea MedianAge
urbanPop PopGrowRate population (cia) GDPPerCapitaPPP
PopGrowRate mortalityUnder5yrs population
PctINET matMort
mortalityUnder5yrs InfantMortalityRate
matMort GDPPerCapitaPPP
InfantMortalityRate
gdpPerCapLn
gdpPerCap
GDPPerCapitaPPP
6.987 4.875 3.112 2.134 1.795

* Eigenvector value for each factor generated during the exploratory factor analysis.

a

See Supplemental Material A for descriptions of abbreviated variable names.

To create the theoretical index variables, we began with all variables which can be reasonably categorized into one of the four theoretical categories. We then excluded variables using a stepwise approach according to whether their factor loadings meet the 0.5 threshold. With each iteration, we omitted the variable with the smallest factor loading and then reran the model with the remaining variables. The process stopped when all variables had shared factor loadings at or above 0.50. Based on the result of the factor analysis, we created two health factor variables as these variables loaded onto separate factors. We also combined variables for the social and economic factors as these variables loaded onto a single factor. Each of the index variables is normally distributed with the exception of Health Factor 2, which displays a significant skew and kurtosis, and Ethnicity Factor, which displays significant kurtosis (see Supplemental Materials C).

3.3. Results

Fig. 2 depicts the result of the index variable random forest model. It shows that four of the five index variables are retained in the model: Ethnicity Factor, Health Factor 1, Data Factor, and Socio Economic Factor. Table 6 shows that, contrary to expectations, after adding the index variables the predictive accuracy of the 10-variable model increases to 87% from 78%. Furthermore, looking at the models where the random forest is pruned to different numbers of variables (e.g., 1, 5, 10, 15, 20, 25, 30), we see that the predictive accuracy of the index model is higher in five of the seven cutoffs (all but for one variable and 30 variables). Based on these results, we conclude that the inclusion of the index variable can contribute to increases in model prediction.

Fig. 2.

Fig. 2

Feature importance vs feature name for the 30 most important variables with index variables added.* The error bars represent the 95% confidence interval. * See Supplemental Material B and C for descriptions of abbreviated variable names.

Table 6.

F1-Score of the Random Forest in Predicting a Stay-at-Home Order at Different Levels of Retained Variables When Index Variables are Included.

Number of Variables in Final Training Test f1-score (PREa)
1 59% (18%)
5 84% (68%)
10 87% (74%)
15 80% (60%)
20 81% (62%)
25 78% (56%)
30 72% (44%)
a

Proportional reduction in error.

Comparing the composition of the original and second model, we see that tempDIFFS remains the most important variable and prevAdopt the second most important. In total, 24 of the top 30 variables in the original model are retained by the second model (TempDIFFS, prevAdopt, Median Age, cumCasesLag1P100 KC, Area, envMort, CasesMD, UrbanPop, StartBIZ, MuslimPCT, BirthRate, PopGrowRate, femalePARL, inflation, cumCasesLag1, matMORT, ChristPCT, pctINET, InfantMortalityRate, ElectyFossileFuel, gdpPerCapLN, gdpPerCap, NetMigrationRate, and nonComDis).

Of the six variables in the original model not directly retained in the index model, two variables (MortalityUnder5years, and GDPPerCapitaPPP) are indirectly retained as part of the Data Factor 1 and Health Factor 1 indices. Furthermore, the four variables not retained are between the 24th and 30th variables indicating that they are among the least important contributors. Three of the top five variables in the index model also appear in the top five variables of the original model and, further, each of the top five variables in the original model is observed in the index model. The only major difference is the inclusion of the two indices Ethnicity Factor and Health Factor 1. This indicates a high level of stability between the factors that best predict SAHOs.

Apart from this change, the remaining differences between models are slight. For example, with the exception of the indices themselves, only two new variables, Population Density and Phones Fixed Lines, appear in the index model. Furthermore, Population Density is a substitution in kind, as it is related to two variables, ciaPopulation, and population, that are not retained in the index model.

Lastly, Table 7 summarizes the results of six logistic regression models involving the 10 most important variables identified by the random forest using the augmented data set that includes the five indices. Eight of the 10 variables in the index model are also in the top 10 of the original model. The two new variables are two of the created indices. Ethnicity Factor is identified as the third most important variable and HealthFactor 1 is the fifth. Noteworthy is that these indices include the two variables omitted from the original model's list of top 10 variables, MuslimPct, and envMort.

Table 7.

Summary Statistics for Logistic Regression Models that include the Index Variables. Correlations between multiple standardized independent variables identified by the random forest as most important to the issuance of a stay-at-home order and the probability of issuing a stay-at-home order.

Variable Top 1 Variable Top 3 Variables Top 5 Variables Top 7 Variables Top 10 Variables Top 610 Variables
tempDIFFS −1.243*
0.358
−1.260*
0.387
−1.315*
0.538
−1.304*
0.540
−1.497*
0.668
prevAdopt 0.163
0.401
0.130
0.739
0.184
0.686
0.168
0.708
EthnicityFactor1 −0.573
0.373
−0.297
0.426
−0.307
0.428
−0.037
0.424
cumCasesLag 0.244
0.660
0.118
0.602
−1.287
1.111
HealthFactor1 −0.635
0.361
−0.722
0.817
−0.321
0.998
MedianAge −0.137
0.930
0.613
0.962
0.619
0.463
Area −0.215
0.424
−0.196
0.609
−0.001
0.407
StartBiz −0.646
0.401
−0.501
0.305
CasesMD 1.991
1.408
0.081
0.284
urbanPop 0.257
0.507
0.226
0.368
Wald 12.03 14.26 18.51 17.74 18.97 8.61
R2 0.196 0.232 0.284 0.287 0.369 0.115
AIC 64.17 65.51 65.62 69.36 69.26 78.23
BIC 68.15 73.46 77.55 85.27 91.14 90.17

P < 0.100; *P < 0.050 (two-tailed).

In the top 10 index logistic model, neither achieves a conventional level of statistical significance. Although the effects of the indices are statistically discernible, post-estimation model diagnostics suggest the inclusion of the new variables, including the indices, has a limited effect on model fit. For example, for the original 10-variable model detailed in Table 4 the BIC is 88.13 and R2 is 0.41, and for the second 10-variable model that includes the two indices the BIC is 91.14 and R2 is 0.37. Note that a smaller BIC indicates less model uncertainty. This suggests that the 10-variable model that includes the indices performs worse but not substantially so [63]. describes a three-point difference in BIC as “positive” but not “strong” or “very strong” evidence that one model performs better than another. It is also worth noting that the effects of temporal diffusion are robust as it, again, significantly contributes to these regression models.

We note two considerations. First, given the small number of cases available in this study (n = 54), it is not the ideal candidate to evaluate the utility of index variables to the prediction of the issuance of SAHOs. This is because the properties of a single country on a single variable have a significant impact on overall model prediction. Second, we limited the construction of the index variables to the top 30 variables identified in our original random forest model. Future research should adopt a more comprehensive approach to develop indices across all available variables in the data set. To develop these comprehensive indices in addition to standard social science approaches for data reduction such as correspondence or factor analysis, researchers should consider alternative approaches from machine learning such as variational auto encoding [64].

4. Discussion

This paper explores the application of a machine learning method, the random forest classifier, to the prediction of the issuance of SAHOs during the initial wave of COVID-19 in African countries. In using this approach, we made four contributions. First, we found that a data-first approach results in nontrivial support for extant public health theory. That is, the four most important variables identified by the machine learning classifier were originally identified in theory-based research (e.g., Ref. [29]. More broadly, we found that external forces, particularly temporal diffusion, made the strongest contribution to the prediction of SAHOs. Second, in addition to the support provided to extant theory, we identified potential new avenues of investigation regarding public health policy making, such as geographic area, business climate, and religious homogeneity. Third, we demonstrated the power of the random forest model as a predictive tool. Our 10-variable original model provided a 56% increase in predictive accuracy over predicting the modal outcome in the original model and 74% in the index model. And, fourth, using machine learning, we showed how using methods other than standard regression can help identify important variables that may be omitted due to model and disciplinary assumptions, which is critical for problems such as this.

Regarding this last contribution, future research on health policy should seriously consider employing multiple modeling techniques to validate their findings. In particular, for research using similar variables on similarly sized samples, data properties outside the researcher's control may introduce bias into the analysis. In other words, the statistical significance of results obtained exclusively using a linear/frequentist approach may not be sufficient to produce reliable or generalizable results. Employing several statistical models, models relying on separate analytic assumptions would significantly improve the rigor of future research by ensuring that results are not dependent upon the assumptions of any single statistical model. Because they are robust to sample and distributive data properties, which significantly limit linear/frequentist approaches, we believe that machine learning methods naturally complement more traditional linear/frequentist methods.

Besides providing a means to discover overlooked effects, an additional benefit of the multiple model approach is that it can provide a means to confirm the extent theory. This principle is demonstrated, in part, by Temporal Diffusion's feature importance in the random forest and significance in the logistic model.

A critical limitation of the random forest method is that it does not provide a measure of directionality for the independent variables. Variable directionality is often of principal interest to social scientists, especially in research involving public policy or policy interventions. This limitation of the random forest approach further highlights the benefits of multiple modeling approaches to research, as we demonstrated one method to derive directionality using conventional modeling.

A second limitation of our methodology is that we lost a large number of variables (n = 139) from our original set of 227 variables due to missing values. The decision to drop variables, as opposed to imputing missing data, is due to the effects of imputation on model reliability when a data set contains a small number of cases. Because they focus on minimizing entropy or correct categorization of cases, the results of random forest models are more susceptible to changes from the inclusion of imputed data than linear/frequentist models. This problem is compounded by the high levels of variation in independent variables displayed by the African countries. Methods such as mean imputation or regression imputation would introduce a robust set of new assumptions about data symmetry into the model, which is difficult to justify given the small number of highly heterogeneous cases. An alternative methodology would be to utilize a neural network to generate imputations. However, this machine-learning approach was outside the scope of the present paper. As a result, the current study does not provide an exhaustive evaluation of the contribution of all possible factors (e.g., social, political, economic) or variables to the issuance of SAHO's, as many potentially relevant factors are omitted from our analysis. As more complete data sets become available, researchers must evaluate the predictive value of the variables omitted from this analysis.

5. Conclusion

This paper takes a data-first approach to examining an extraordinary government response to an extraordinary global phenomenon. In the process, it demonstrates this approach can be used to generate new insights and confirm the extant theory. Further, it demonstrates the value of using multiple modeling methods to assess important relationships in public policy and public health policy. And it suggests that prediction and investigation can fruitfully coexist if not enhance each other.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

2

Precision is defined as the number of true positives divided by the sum of true positives and false positives. Recall is defined as the number of true positives divided by the sum of true positives and false negatives.

3

Some refer to this as the proportional reduction in error. It is calculated as (% model accuracy - % modal category)/(1 - % modal category). In this case, (0.78–0.50)/(1–0.50).

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.ijdrr.2023.103598.

Appendix A. Supplementary data

The following is the Supplementary data to this article.

Multimedia component 1
mmc1.docx (177.8KB, docx)

Data availability

Data will be made available on request.

References

  • 1.CDC . Department of Health and Human Services, CDC; Atlanta, GA: US: 2020. State, Territorial, and County COVID-19 Orders and Proclamations for Individuals to Stay Home.https://ephtracking.cdc.gov/DataExplorer/index.html?c=33&i=160&m=927 [Google Scholar]
  • 2.Medline A., Hayes L., Valdez K., Hayashi A., Vahedi F., Capell W., Klausner J.D. Evaluating the impact of stay-at-home orders on the time to reach the peak burden of Covid-19 cases and deaths: does timing matter? BMC Publ. Health. 2020;20(1):1–7. doi: 10.1186/s12889-020-09817-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Moreland A., Herlihy C., Tynan M.A., Sunshine G., McCord R.F., Hilton C.…Popoola A. Timing of state and territorial COVID-19 stay-at-home orders and changes in population movement—United States, March 1–May 31, 2020. MMWR (Morb. Mortal. Wkly. Rep.) 2020;69(35):1198. doi: 10.15585/mmwr.mm6935a2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kitchin R. Big Data, new epistemologies and paradigm shifts. Big Data & Society. 2014;1(1) 2053951714528481. [Google Scholar]
  • 5.Scime A., Murray G.R. Ethical Data Mining Applications for Socio-Economic Development. IGI Global; 2013. Social science data analysis: the ethical imperative; pp. 131–147. [Google Scholar]
  • 6.Brownson R.C., Chriqui J.F., Stamatakis K.A. Understanding evidence-based public health policy. Am. J. Publ. Health. 2009;99(9):1576–1583. doi: 10.2105/AJPH.2008.156224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Brieman L., Friedman J.H., Olshen R.A., Stone C.J. Wadsworth Inc; 1984. Classification and Regression Trees; p. 67. [Google Scholar]
  • 8.Čeh M., Kilibarda M., Lisec A., Bajat B. Estimating the performance of random forest versus multiple regression for predicting prices of the apartments. ISPRS Int. J. Geo-Inf. 2018;7(5):168. [Google Scholar]
  • 9.Muchlinski D., Siroky D., He J., Kocher M. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit. Anal. 2016;24(1):87–103. [Google Scholar]
  • 10.Grömping U. Variable importance assessment in regression: linear regression versus random forest. Am. Statistician. 2009;63(4):308–319. [Google Scholar]
  • 11.Schrodt P.A. Beyond the linear frequentist orthodoxy. Polit. Anal. 2006;14(3):335–339. [Google Scholar]
  • 12.Hong J., Choi H., Kim W.S. A house price valuation based on the random forest approach: the mass appraisal of residential property in South Korea. Int. J. Strat. Property Manag. 2020;24(3):140–152. [Google Scholar]
  • 13.Antipov E.A., Pokryshevskaya E.B. Mass appraisal of residential apartments: an application of Random forest for valuation and a CART-based approach for model diagnostics. Expert Syst. Appl. 2012;39(2):1772–1778. [Google Scholar]
  • 14.Wang Y. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data: a comment. Polit. Anal. 2019;27(1):107–110. [Google Scholar]
  • 15.Daoud A., Johansson F. 2019. Estimating Treatment Heterogeneity of International Monetary Fund Programs on Child Poverty with Generalized Random Forest. [Google Scholar]
  • 16.Suzuki A. Is more better or worse? New empirics on nuclear proliferation and interstate conflict by random forests. Research & Politics. 2015;2(2) 2053168015589625. [Google Scholar]
  • 17.Best K.B., Gilligan J.M., Baroud H., Carrico A.R., Donato K.M., Ackerly B.A., Mallick B. Random forest analysis of two household surveys can identify important predictors of migration in Bangladesh. Journal of Computational Social Science. 2021;4(1):77–100. [Google Scholar]
  • 18.Anderson D., Cheeseman N. In: Routledge Handbook of African Politics. Cheeseman N., Anderson D., Scheibler A., editors. Routledge; 2013. An introduction to African politics; pp. 1–8. [Google Scholar]
  • 19.Joireman S.F. Inherited legal systems and effective rule of law: Africa and the colonial legacy. J. Mod. Afr. Stud. 2001;39(4):571–596. [Google Scholar]
  • 20.Young C. Robert J. Berg and Jennifer Seymour Whitaker, Strategies For African Development: A Study For the Committee On African Development Strategies. University of California Press; 1986. Africa's colonial legacy; pp. 25–51. [Google Scholar]
  • 21.Tordoff W. fourth ed. 2002. Government and Politics in Africa. (Palgrave) [Google Scholar]
  • 22.Colizzi V., Sonela N. Why publishing the journal of public health in Africa. J. Publ. Health Afr. 2017;8:729. doi: 10.4081/jphia.2017.729. Pmid:28748065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mazibuko Z. Jacana Media; 2019. Epidemics and the Health of African Nations. [Google Scholar]
  • 24.World Health Organization . 2019. E-SPAR State Party Annual Report: IHR Score Per Capacity.https://extranet.who.int/e-spar#capacity-score [Cited 2021 January 14] Available from: [Google Scholar]
  • 25.Gros J.-G. Rowman & Littlefield; 2015. Healthcare Policy in Africa: Institutions and Politics from Colonialism to the Present. [DOI] [PubMed] [Google Scholar]
  • 26.Allen C. Understanding african politics. Rev. Afr. Polit. Econ. 1995;22(65):301–320. [Google Scholar]
  • 27.Okma K.G., Marmor T.R. Comparative studies and healthcare policy: learning and mislearning across borders. Clin. Med. 2013;13(5):487. doi: 10.7861/clinmedicine.13-5-487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ahram A.I., Goode J.P. Researching authoritarianism in the discipline of democracy. Soc. Sci. Q. 2016;97(4):834–849. [Google Scholar]
  • 29.Murray G.R., Rutland J. Prioritizing public health? Factors affecting the issuance of stay-at-home orders in response to COVID-19 in Africa. PLOS Global Public Health. 2022;2(1) doi: 10.1371/journal.pgph.0000112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Thomas Hale, Noam Angrist, Rafael Goldszmidt, Beatriz Kira, Anna Petherick, Toby Phillips, Samuel Webster, Emily Cameron-Blake, Laura Hallas, Saptarshi Majumdar, and Helen Tatlow. (2021). “A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker).” Nature Human Behaviour. 10.1038/s41562-021-01079-8. It's: volume 5, issue 4, pp. 529-538. [DOI] [PubMed]
  • 31.Wujek B., Hall P., Günes F. SAS Institute Inc; 2016. Best Practices for Machine Learning Applications. [Google Scholar]
  • 32.Murray G.R., Jilani-Hyler N. Identifying factors associated with the issuance of coronavirus-related stay-at-home orders in the Middle East and North Africa Region. World Med. Health Pol. 2021;13(3):477–502. doi: 10.1002/wmh3.444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Spasoff R.A. Oxford University Press; 1999. Epidemiologic Methods for Health Policy. [Google Scholar]
  • 34.Berry F.S., Berry W.D. State lottery adoptions as policy innovations: an event history analysis. Am. Polit. Sci. Rev. 1990;84(2):395–415. [Google Scholar]
  • 35.De Mesquita B.B., Smith A., Siverson R.M., Morrow J.D. MIT press; 2005. The Logic of Political Survival. [Google Scholar]
  • 36.Christensen T., Lægreid P., Rykkja L.H. Organizing for crisis management: building governance capacity and legitimacy. Publ. Adm. Rev. 2016;76(6):887–897. [Google Scholar]
  • 37.Blank R.H., Burau V., Kuhlmann E. fifth ed. 2018. Comparative Health Policy. [Google Scholar]
  • 38.Theme-Filha M.M., Szwarcwald C.L., Souza-Júnior P.R.B.D. Socio-demographic characteristics, treatment coverage, and self-rated health of individuals who reported six chronic diseases in Brazil, 2003. Cad. Saúde Pública. 2005;21:S43–S53. doi: 10.1590/s0102-311x2005000700006. [DOI] [PubMed] [Google Scholar]
  • 39.Musgrove P., Zeramdini R., Carrin G. Basic patterns in national health expenditure. Bull. World Health Organ. 2002;80:134–146. [PMC free article] [PubMed] [Google Scholar]
  • 40.Gormley W.T., Jr. Regulatory issue networks in a federal system. Polity. 1986;18(4):595–620. [Google Scholar]
  • 41.World Health Organization . World Health Organization; 2020. Key Planning Recommendations for Mass Gatherings in the Context of COVID-19: Interim Guidance. (29 May 2020) [Google Scholar]
  • 42.Meltsner A.J. Political feasibility and policy analysis. Publ. Adm. Rev. 1972;32(6):859–867. [Google Scholar]
  • 43.Dobbin F., Simmons B., Garrett G. The global diffusion of public policies: social construction, coercion, competition, or learning? Annu. Rev. Sociol. 2007;33:449–472. [Google Scholar]
  • 44.Chamberlain R., Haider-Markel D.P. Lien on me”: state policy innovation in response to paper terrorism. Polit. Res. Q. 2005;58(3):449–460. [Google Scholar]
  • 45.Grübler A. Time for a change: on the patterns of diffusion of innovation. Daedalus. 1996;125(3):19–42. [Google Scholar]
  • 46.State Fragility Index 2018. http://www.systemicpeace.org/inscr/SFImatrix2018c.pdf Accessed at.
  • 47.World Bank World Development Indicators. https://datatopics.worldbank.org/world-development-indicators/ (n.d.)
  • 48.World Bank People. https://datatopics.worldbank.org/world-development-indicators/themes/people.html#featured-indicators_1 (n.d.a)
  • 49.Database of Political Institutions . 2017. Database of Political Institutions 2017.https://data.iadb.org/DataCatalog/Dataset#DataCatalogID=938i-s2bw [Google Scholar]
  • 50.World Health Organization Monitoring health for the SDGs. https://www.who.int/data/gho/data/themes/sustainable-development-goals?lang=en (n.d.)
  • 51.CIA World Factbook 2020. https://www.cia.gov/the-world-factbook/ Accessed at.
  • 52.Podiotis P. 2020. Sentiment Analysis of the CIA World Factbook. Available at: SSRN 3721400. [Google Scholar]
  • 53.Podiotis P. 2020. Towards International Relations Data Science: Mining the CIA World Factbook. arXiv preprint arXiv:2010.05640. [Google Scholar]
  • 54.United Nations World Economic Situation and Prospects, 2020. 2020. https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/publication/WESP2020_FullReport_web.pdfhttps://www.un.org/development/desa/dpad/publication/world-economic-situation-and-prospects-2020/ Accessed at. For more info see.
  • 55.World Happiness Report 2020. https://worldhappiness.report/ed/2020/#read Accessed at.
  • 56.Quinlan J.R. Induction of decision trees. Mach. Learn. 1986;1(1):81–106. [Google Scholar]
  • 57.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O.…Duchesnay E. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
  • 58.Breiman L. Bagging predictors. Mach. Learn. 1996;24(2):123–140. [Google Scholar]
  • 59.Gilardi F. In: Handbook of International Relations. second ed. Simmons, editor. SAGE Publications; 2012. Transnational diffusion: norms, ideas, and policies; pp. 453–477. (Walter Carlsnaes, Thomas Risse and Beth). [Google Scholar]
  • 60.Yanez N.D., Weiss N.S., Romand J.A., Treggiari M.M. COVID-19 mortality risk for older men and women. BMC Publ. Health. 2020;20(1):1–7. doi: 10.1186/s12889-020-09826-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Kawashima K., Matsumoto T., Akashi H. Urban Resilience. Springer; Cham: 2016. Disease outbreaks: critical biological factors and control strategies; pp. 173–204. [Google Scholar]
  • 62.Boggess M., MacDonald K. STATA Corporation; 2004. Marginal Effects of Probabilities Greater than 1. [Google Scholar]
  • 63.Raftery A.E. Bayesian model selection in social research. Socio. Methodol. 1995;25:111–163. [Google Scholar]
  • 64.Mahmud M.S., Huang J.Z., Fu X. Variational autoencoder-based dimensionality reduction for high-dimensional small-sample data classification. Int. J. Comput. Intell. Appl. 2020;19(1) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (177.8KB, docx)

Data Availability Statement

Data will be made available on request.


Articles from International Journal of Disaster Risk Reduction are provided here courtesy of Elsevier

RESOURCES