Estimating the propagation of both reported and undocumented COVID-19 cases in Spain: a panel data frontier approximation of epidemiological models

Inmaculada C Álvarez; Luis Orea; Alan Wall

doi:10.1007/s11123-023-00664-5

. 2023 Mar 1;59(3):259–279. doi: 10.1007/s11123-023-00664-5

Estimating the propagation of both reported and undocumented COVID-19 cases in Spain: a panel data frontier approximation of epidemiological models

Inmaculada C Álvarez ¹, Luis Orea ², Alan Wall ^2,^✉

PMCID: PMC9975832 PMID: 37143450

Abstract

We use a stochastic frontier analysis (SFA) approach to model the propagation of the COVID-19 epidemic across geographical areas. The proposed models permit reported and undocumented cases to be estimated, which is important as case counts are overwhelmingly believed to be undercounted. The models can be estimated using only epidemic-type data but are flexible enough to permit these reporting rates to vary across geographical cross-section units of observation. We provide an empirical application of our models to Spanish data corresponding to the initial months of the original outbreak of the virus in early 2020. We find remarkable rates of under-reporting that might explain why the Spanish Government took its time to implement strict mitigation strategies. We also provide insights into the effectiveness of the national and regional lockdown measures and the influence of socio-economic factors in the propagation of the virus.

Keywords: SIR models, Stochastic frontier analysis, Panel data, COVID-19, Spain

Introduction

The COVID-19 pandemic, which began in China in December 2019, spread worldwide in a short time. Faced with the threat of their public health systems being overwhelmed, several countries, with Italy and Spain at the forefront as they were the most-affected at the initial stage of the pandemic, saw themselves forced to implement national lockdowns of the population. In the specific case of Spain, this gave rise to heated debates, which would be repeated in other countries (notably the UK), over the timing and duration of the lockdown. There was fierce criticism from some opposition parties over the Spanish national government’s handling of the first wave of the pandemic, and it is noteworthy that the institutional response in Spain to the following waves of COVID-19 up to and including the sixth wave which began in late 2021 have been delegated to regional governments which are charged with implementing measures at local or regional level. A consequence of the regional nature of the new institutional response, however, is that much less attention may be paid to the propagation of the coronavirus across the Spanish provinces and regions.

The propagation of the COVID-19 epidemic and the effectiveness of institutional responses have given rise to a rapidly-evolving literature. Most existing empirical research has focused on the Chinese COVID-19 epidemic. Chinazzi et al. (2020) show that travel limitations had modest effects on containing the spread of the disease in Wuhan (unless they were complemented with additional public health interventions and behavioural changes). Fang et al. (2020), using a difference-in-differences estimator, find that the lockdown was highly effective in reducing the total infection cases outside the city. Regarding the relaxation of the control measures in China, Leung et al. (2020) find it would increase the cumulative number of coronavirus cases and conclude that it is necessary to monitor the increase in new cases due to the effects of relaxing control measures in order for policy makers to be able readjust their decisions.

One of the first studies that aimed to examine the effectiveness of the control measures implemented in several European countries was carried out by Flaxman et al. (2020). They find that the Spanish lockdown averted about 67% of potential deaths by the 31st of March. Saez et al. (2020) and Orea and Alvarez (2022) also find that the Spanish national lockdown was effective in attenuating the propagation of the virus during the first wave of COVID-19 contagion. Orea and Alvarez (2022) conclude, however, that this control measure should be implemented at the very early stages of the epidemics because a rapid institutional response to the COVID-19 outbreak not only saves lives but also attenuates the economic impact of the Spanish coronavirus epidemic.

The effectiveness of institutional control measures while controlling for spatial propagation effects has been treated marginally in the literature, though there are notable exceptions. Thus, Gross et al. (2020) study the spatio-temporal propagation of COVID-19 in China and compare it to other countries. They conclude that early action may attenuate the disease, given the strong relation between population migration and the spreading of disease. Giuliani et al. (2020) also use data disaggregated by provinces to implement an epidemiological model explaining the propagation of COVID-19 across the Italian provinces. The origin of this spatial dimension of propagation is the high inter-provincial mobility of people, and they conclude that the control measures were more successful in those provinces with more effective enforcement. Dickson et al. (2020) find that in the northern Italian provinces the Government containment measures not only succeeded in drastically reducing the transmission of COVID-19 amongst individuals within these provinces, but also avoided contagions between neighbouring areas. Another exception is Gutiérrez et al. (2021), who show that part of the heterogeneity in the incidence of the disease found in Spain is due to differences in mobility flows across the Spanish territories. Orea and Álvarez (2022) find similar results using a simple but novel empirical strategy that loosely mimics the popular reproduction-based models used in the epidemiological literature.

Aside from spatial propagation effects, another important issue that has often been overlooked or not controlled for in this literature is the number of undocumented coronavirus cases. The relevance of this lies in the fact that the proportion of coronavirus infections not detected by the health system during the first wave of contagion of COVID-19 was likely much larger than the proportion of laboratory-confirmed coronavirus cases (see Flaxman et al. 2020), with the result that the official number of coronavirus cases likely falls short of the true number of cases, perhaps significantly so. As Korolev (2021) points out, if we do not take underreporting into account and estimate models from data on confirmed cases under the assumption that all cases are reported, our estimates might be seriously biased. In addition, underreporting may dampen public and political support for more stringent measures such as investments in medical equipment, mandatory masks or mandatory lockdowns. As the undocumented cases facilitate the rapid dissemination of coronavirus (see Li et al. 2020), the reported cases at the first stages of the coronavirus epidemic were likely unable to anticipate the fast development of the coronavirus epidemic in the following weeks.

To account simultaneously for geographical propagation of the virus, the prevalence of undocumented cases and the effectiveness of institutional control measures, in this paper we propose a stochastic frontier analysis (SFA) approach to estimating epidemic curves. The SFA approach can be used to control for the existence of undocumented coronavirus cases because these cases are not observed by the econometrician and the reported cases are always lower than the total number of COVID-19 infections. Therefore, the unobserved cases can be proxied using a one-sided random term in the same fashion as firms’ inefficiency in production economics. In this sense, our work can also be considered as contributing to the line of research initiated by Millimet and Parmeter (2021), which highlighted that the stochastic frontier framework can usefully be extended into the measurement error literature when researchers consider that outcomes are measured with asymmetric error. COVID cases certainly fall into this category.

The model we propose can be seen as an extension to a frontier setting of previous work by Orea and Álvarez (2022), who advocate using a third-order function of the so-called ‘epidemic time’ of the outbreak (i.e., the number of days since the onset date) to capture the typical S-shaped temporal pattern of the virus epidemic. Their non-frontier epidemic-time model can be viewed as a reduced-form model that simply aims to fit the observed epidemic curve of cumulative cases, and for this reason, it does not make assumptions about the incubation period or other critical parameters that determine the contagion of COVID-19. This appealing feature can obviously be applied to our stochastic frontier specification of the epidemic curve. For robustness analysis, and as an alternative to the epidemic-time specification of Orea and Álvarez (2022), we have also estimated a stochastic SIR-based frontier specification, inspired by the non-frontier econometric SIR model proposed by Chudik et al. (2020) which replaces the epidemic time variables with time-varying epidemiological regressors.

Our epidemic stochastic frontier analysis (ESFA) model has other attractive features. First, the stochastic frontier model can be estimated using epidemic-type data only, i.e., the rates of growth of coronavirus cases depend in our models on own and neighbours’ epidemic times, lagged cases of COVID-19, date of implementation of control measures, and so on. However, the model is flexible enough to include other covariates if deemed appropriate. Another advantage of our model is that it permits reporting rates to be estimated rather than assumed and is flexible enough to permit these reporting rates to vary across geographical cross-section units of observation. As such, our ESFA model can be thought of as complementary to existing epidemiological models, such as Chudik et al. (2020), which often assume common reporting rates across areas.1

As the volatility of the rates of growth of reported cases are typically much larger at the beginning of the epidemic than when the epidemic has advanced, our ESFA model must be estimated using time-varying heteroskedastic noise terms. To capture this feature, we propose a stochastic frontier specification which can be interpreted as a heteroskedastic version of the model introduced by Wang and Ho (2010) whose aim was to control for individual effects in a production economics setting. Therefore, our paper also has a methodological contribution for practitioners aiming to estimate firms’ efficiency using the Wang and Ho (2010) approach.

As our epidemic-time model can be extended to include other covariates, in our empirical application we take advantage of this feature to incorporate a series of socio-economic and environmental variables to test their influence of the evolution of total and under-reported cases. We also carry out a series of robustness checks on our epidemic-time stochastic frontier model, including an analysis of the effects of changes to the distributional assumptions and the effects of changing the actual panel data set used to check the effect of dropping observations with zeroes in the variables.

Overall, the empirical strategy used in this paper can be said to rely on several different but related assumptions, which are supported by previous literature: i) the propagation of the virus across areas (Spanish provinces in our application) depends on people’s mobility (Giuliani et al. 2020); ii) this mobility can be modelled using spatial econometrics techniques (Eliasson et al. 2003; Orea and Álvarez 2022); iii) the undocumented cases represent a large proportion of total cases of infection (Flaxman et al. 2020); and iv) the proportion of undocumented cases through the epidemic development varies over time (Li et al. 2020).

A final assumption, which opens the way for a stochastic frontier-based approach, is that unobserved cases can be proxied using a one-sided random term in the same fashion as Millimet and Parmeter (2022). A comparison of our model with Millimet and Parmeter (2022) is instructive as our approaches have similarities and differences. As in Millimet and Parmeter (2022), our model permits that underreporting can be modelled as a function of a set of covariates and that the impact of non-pharmaceutical interventions (in our case, lockdown measures) on COVID-19 cases can be assessed. Spatial spillover effects are incorporated into both models, in the sense that the spread of the virus is modelled not only as a function of an area’s own cases but also as a function of cases in neighbouring areas. It is worth noting that whereas Millimet and Parmeter (2022) uses country-level data, we use more disaggregated spatial units (i.e. provinces). As human mobility across our Spanish provinces is larger than across countries, the spatial propagation of the disease is likely more intense in our application than in their application to countries. Note also that whereas in Millimet and Parmeter (2022) these spillover effects are incorporated into the frontier, in our model they are incorporated into both the frontier and to the one-sided error term capturing undocumented cases. Another contrast to Millimet and Parmeter (2022) is that our model, based as it is on Wang and Ho (2010), explicitly controls for individual (fixed) effects. Another contrast is the feature, mentioned above, that we explicitly model heteroscedasticity in the idiosyncratic noise term to capture differences in the temporal evolution of volatility of the growth rates of reported cases, and in particular the fact that volatility is much larger at the beginning of the epidemic. Millimet and Parmeter (2022) control for this volatility indirectly by using weekly data. When a sufficiently long data set is available, this is a perfectly valid solution. However, when the time series is relatively short, as in our case, aggregating daily data to weekly level will not be feasible. In these settings where daily data must be used, explicit modelling of heteroscedasticity of the noise term becomes a necessary and attractive feature. Finally, note that we follow Orea and Álvarez (2022) and use a simple epidemic-time model, whereas Millimet and Parmeter (2022) use a SIR-based model, where current coronavirus cases depend on a set of geographic, demographic and political characteristics of each country. For comparison purposes, however, we also estimate a SIR-based model that instead uses lagged values of coronavirus cases to explain current cases in the same fashion as Chudik et al. (2020). As lagged cases are also measured with error due to the existence of undocumented coronavirus cases, the epidemic-time model, which can be viewed as a reduced-form of a SIR-based model, is our preferred specification. In summary, we see our model as complementary to that of Millimet and Parmeter (2022), where one or the other may be more suitable depending on the data available and the assumptions the researcher is willing to make.

The paper proceeds as follows. Section 2 defines the three epidemic curves we use, namely the total epidemic curve, the reported cases epidemic curve, and the undocumented cases epidemic curve. In Section 3 we present the stochastic frontier representation of the epidemic curves. Distributional assumptions about the error terms and the maximum likelihood procedure for the general specification of the model are discussed. Section 4 presents our empirical application to Spanish provinces at the outset of the COVID-19 epidemic in the spring of 2020, where we estimate a basic version of our preferred frontier model, namely the epidemic-time model, with a spatial lag specification. In Section 5 we present a series of extensions and robustness checks to the basic model. Section 6 provides a discussion of the empirical results and Section 7 concludes and provides some pointers for future research.

Total and partial epidemic curves

In this section we define three epidemic curves that resemble the popular reproduction-based models used in the epidemiological literature, which often ignore the existence of undocumented coronavirus cases.

Consider a panel of i = 1,…, N provinces observed on t = 1,…, T days. t is the calendar time. Let E_i denote the onset date of the epidemic, namely the date on which province i reports its first coronavirus case. We then analyse the development of the epidemic in each province, i.e., the temporal evolution of coronavirus cases once each province reports its first coronavirus case. A key variable to carry out this analysis is the epidemic time, K_it = t-E_i, which denotes the number of days since the onset date. Next, let $Y_{i t}^{*}$ denote the cumulative number of both laboratory-confirmed (Y_it) and undocumented (U_it) coronavirus cases until day t in province i. Thus:

Y_{i t}^{*} = Y_{i t} + U_{i t}

In Orea and Álvarez (2022), the epidemic curve of reported cases (Y_it) is represented by an autoregressive relationship:2

Y_{i t} = β_{i t} Y_{i t - 1}

where β_it can be interpreted as an autoregressive parameter (function) that depends on a set of covariates. We label this the epidemic curve.3 We have found in our application that Y_it is not a stationary variable, in which case estimating (2) might give spurious results. This issue vanishes if we use rates of growth of reported coronavirus cases. In order to get a simple empirical specification of (2), we take natural logarithms and first-differentiate the model. This yields the following expression:

R a t e_{i t} = l n Y_{i t} - l n Y_{i t - 1} = l n β_{i t}

where lnβ_it simply measures the daily rate of growth of reported cases. Two alternative specifications (linear vs exponential) for lnβ_it are used in our empirical application. While the linear specification might yield negative rates of growth of cumulative cases, the so-called exponential specification imposes the theoretical restriction β_it ≥ 1. We expect rates of growth of coronavirus cases to vary with the epidemic time, K_it, because the traditional epidemic curve has an S-shaped form. If this is indeed the case, the epidemic curve β_it can be modelled empirically as a third-order function of the (logged) epidemic time variable, conditional on other control variables.4

Similar autoregressive expressions can be written for undocumented and total coronavirus cases. That is, each variable measuring coronavirus cases has its own epidemic curve. While the epidemic curve of reported cases is given by (3), the epidemic curves of undocumented and total coronavirus cases can be written as follows:

U_{i t} = θ_{i t} U_{i t - 1}

Y_{i t}^{*} = β_{i t}^{*} Y_{i t - 1}^{*}

Figure 1 illustrates our three hypothetical epidemic curves. By construction, we have assumed in this figure that $Y_{i t}^{*}$ is the sum of Y_it and U_it for each epidemic time K_it. Note that while the epidemic curve of reported cases has the traditional S-shaped form, the epidemic curve of undocumented cases is depicted using a log form from the beginning of the epidemic onwards. This allows the proportion of undocumented cases to decrease over time as in Li et al. (2020). The shape of the total epidemic curve is thus a combination of the shapes of the two partial epidemic curves.

Fig. 1 — Epidemic curve of total, reported and undocumented cases

We now examine this feature analytically. Taking into account (1), the autoregressive parameter $β_{i t}^{*}$ can be decomposed as follows:

β_{i t}^{*} = β_{i t} + (θ_{i t} - β_{i t}) U_{i t - 1} / Y_{i t - 1}^{*}

This equation shows that the slope of the overall epidemic curve coincides with that of the epidemic curve of reported cases if both reported and undocumented cases have the same temporal patterns (i.e., θ_it = β_it).5 In order to link both epidemic curves, let u_it denote the log difference between total and reported coronavirus cases:

u_{i t} = l n Y_{i t}^{*} - l n Y_{i t}

Given the above definition, the proportion of undocumented cases can be expressed as an increasing function of u_it because U_it/Y_it^* = 1−e^−uit. u_it can therefore be viewed as a relative measure of the undocumented cases in an epidemic outbreak: loosely speaking, we can interpret u_it as the “proportion of undocumented cases”. Equation (7) also allows us to link the reported and undocumented cases as follows:

U_{i t} = Y_{i t} (e^{u_{i t}} - 1)

If we plug (8) into (4) in both consecutive periods and use (2), we get:

β_{i t} = θ_{i t} \cdot (e^{u_{i t - 1}} - 1) / (e^{u_{i t}} - 1)

This equation states that β_it = θ_it if, and only if, the log difference between total and reported coronavirus cases (u_it) is time invariant, i.e., when ∆u_it = u_it − u_it−1 = 0. Using (2) and (5) and the definition of u_it in (7), the previous decomposition in (6) collapses to:

β_{i t} = β_{i t}^{*} e^{- Δ u_{i t}}

This equation shows that the total epidemic curve coincides with the epidemic curve of reported cases when the proportion of undocumented cases does not change over time, i.e., when ∆u_it = 0. On the other hand, Eq. (10) suggests that the epidemic curve of reported cases (i.e. β_it) can be estimated using two approaches: i) from an econometric specification of Eq. (2) that does not provide any information about the relative importance of undocumented cases, as in Orea and Álvarez (2022);6 or ii) from a stochastic frontier specification of (10) that is able to estimate both the total epidemic curve ( $β_{i t}^{*}$ ) and the temporal changes in the proportion of undocumented cases (∆u_it), as we will see in the next section.

The latter empirical strategy is developed in detail in the next section. In a nutshell, this strategy involves estimating the epidemic curve of total cases using a stochastic frontier specification of the model where the undocumented cases are proxied using a one-sided random term in the same fashion as firms’ inefficiency in production economics. The two partial epidemic curves (i.e., the epidemic curves of reported and undocumented cases) can be obtained once the epidemic curve of total cases has been appropriately adjusted using the estimated proportions of undocumented cases that appear in (9) and (10).

Frontier specification of our epidemic curves

Frontier specification

This section discusses estimation of the epidemic curve of reported cases using a stochastic frontier model, an econometric specification widely used in production economics to measure firms’ efficiency. The stochastic frontier analysis approach can be used to control for the existence of undocumented coronavirus cases because these cases are not observed by the econometrician and the reported cases are always lower than the total number of COVID-19 infections. This is illustrated in Fig. 2, where we have simplified our previous figure by dropping the two partial epidemic curves. This figure shows that the total epidemic curve can be viewed as a function that envelops the observed number of coronavirus cases from above. The gap between Y^* and Y is the number of undocumented cases, which never takes negative values. The stochastic frontier analysis approach uses one-sided random terms to control for non-negative (or non-positive) unobserved variables, such as firm inefficiency in production economics.

Fig. 2 — Overall epidemic curve and undocumented cases

As lnβ_it = Rate_it by definition, the stochastic frontier model that is finally estimated can be obtained once we take natural logarithms in (10) and add a traditional noise term:

R a t e_{i t} = l n β_{i t}^{*} (\cdot) + v_{i t} - Δ u_{i t} = l n β_{i t}^{*} (\cdot) + ε_{i t}

where $l n β_{i t}^{*} (\cdot)$ is a function of a set of covariates determining the temporal evolution of total coronavirus cases. The idiosyncratic feature of our frontier specification of the model is the existence of two random terms. The first one is the traditional noise term (v_it) capturing random shocks, measurement or specification errors and other unobservable variables not correlated with the set of explanatory variables determining the rate of growth of coronavirus cases. The second random term is the difference of two one-sided random terms and captures changes over time in the proportion of undocumented cases (∆u_it). Note that the cumulative number of unreported cases obviously increases over time but that the proportion of undocumented cases, u_it, may either increase or decrease as the pandemic evolves so that ∆u_it may be either positive or negative. We show in Section 3.3. that this does not preclude the use of a stochastic frontier model if we impose some structure on the distribution of u_it.

Our empirical strategy thus relies on three assumptions: i) the epidemic nature of this disease can be best represented by a total epidemic curve, regardless of whether researchers observe all COVID-19 cases or not; ii) the unobserved cases can be proxied using a one-sided random term in the same fashion as firm inefficiency in production economics; and iii) the proportion of undocumented cases varies over time during the evolution of the epidemic.

Our epidemic frontier model in (11) looks similar to a (panel) stochastic production frontier model. It is common in this literature to estimate the following model in levels:

l n Y_{i t} = α_{i} + f (X_{i t}, β) + v_{i t} - u_{i t}

where the subscript i stands for firm, X_it is a vector of exogenous production drivers, β is a vector of technological parameters, v_it is a noise term capturing production shocks, and u_it is a non-negative random term capturing firm inefficiency. α_i is a firm-specific intercept aiming to capture characteristics that affect firms’ production but that are unobserved or omitted variables. In our setting, this captures time-invariant unobserved effects that affect the levels of COVID-19 cases and that Millimet and Parmeter (2022) controlled directly using a set of geographic and demographic characteristics of each country. Estimation of the model in (12) using the so-called True Fixed Effects (TFE) model introduced by Greene (2005)7 is not easy due to the incidental parameter problem.8 Wang and Ho (2010) solve this problem using temporal transformations of (12). If we take first differences in Eq. (12) to remove the time-invariant firm-specific effects, we get:

Δ l n Y_{i t} = Δ f (X_{i t}, β) + v_{i t}^{*} - Δ u_{i t}

where $v_{i t}^{*} = Δ v_{i t}$ follows a (multivariate) normal distribution. The production frontier model in (13) is similar to our epidemic frontier model in (11). There are, however, two main differences. First, while ∆f (X_it, β) can be negative in a production economics setting, we need to impose the theoretical restriction $l n β_{i t}^{*} \geq 0$ due to the cumulative nature of Y_it. Second, while the production frontier function represents a “technology” (i.e. an unknown combination of production processes), our frontier represents an underlying epidemic process that involves both confirmed and undocumented cases.

Finally, it is worth highlighting that the aforementioned stochastic epidemic frontier model recently proposed by Millimet and Parmeter (2022) aims to explain new coronavirus cases. Our model, on the other hand, focuses on rates of growth of cumulative cases. Despite this difference, both approaches are similar if we take into account that Rate_it ≈ (Y_it−Y_it−1)/Y_it−1 = N_it/Y_it−1, where N_it stands for new cases in day t in province i. For this reason, there are no clear advantages of using rates of growth instead of new cases. Moreover, Orea and Álvarez (2022) show that our parameter estimates can be interpreted as a semi-elasticity of the number of new cases with respect to an explanatory variable, in the same fashion as in count regression models. They point out, however, that the growth rate of cumulative cases is much less volatile than the number of new cases or its growth rate. Our empirical strategy therefore provides more accurate predictions than a count-type model. This is a feature of the model that is important in our application because we use predicted values to carry out our counterfactual analyses aimed at examining the effect of the Spanish lockdown.

In order to estimate the above model using ML, we are forced to choose a distribution for both the noise term (v_it) and the one-sided random term capturing the proportion of undocumented cases (u_it). In what follows, we discuss the distribution of v_it, the distribution of u_it, and the likelihood function.

Distribution of the noise term

We have added a noise term (v_it) in Eq. (11) in order to directly capture measurement errors in the rate of growth of coronavirus cases. As is customary in the stochastic frontier literature, we assume that the v_it’s are independent of the u_i’s. If we next assume that v_it is independently distributed over time and follows a normal distribution with zero mean, the noise vector v_i = (v_i1,…,v_iT) will follow a multivariate normal distribution with a diagonal covariance matrix. Using the notation from Wang and Ho (2010), the density function of the vector v_i is:

g (v_{i}) = {(2 π)}^{- \frac{T}{2}} {∣Π∣}^{- 1 / 2} e x p \{- \frac{1}{2} v_{i}^{'} Π^{- 1} v_{i}\}

where ∏ is the variance-covariance matrix of v_i. We then assume that the noise vector v_i = (v_i1,…,v_iT) follows a multivariate normal distribution with a diagonal but heteroskedastic variance-covariance matrix because the volatility of the rates of growth of reported cases decreases throughout the epidemic development:

Π = (\begin{matrix} σ_{v 1}^{2} & 0 & \begin{matrix} 0 & 0 \end{matrix} \\ 0 & σ_{v 2}^{2} & \begin{matrix} 0 & 0 \end{matrix} \\ \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} ⋱ & 0 \\ 0 & σ_{v T}^{2} \end{matrix} \end{matrix})

This specification of the variance-covariance matrix of v_i differs from that used in Wang and Ho (2010) in two important aspects. On one hand, our noise term is heteroskedastic, whereas it follows a homoskedastic distribution in Wang and Ho (2010). On the other, while we assumed that v_it is not autocorrelated over time, the first-differences and within transformations carried out by Wang and Ho (2010) to remove time-invariant firm-specific effects introduce negative correlations between two consecutive (transformed) noise terms.

An autocorrelated specification can be obtained if we introduce the noise terms before computing the rates of growth of coronavirus cases, in the spirit of Chudik et al. (2020) and Millimet and Parmeter (2022). Let us rewrite (7) as follows:

l n Y_{i t}^{*} = l n Y_{i t} + u_{i t} - v_{i t}

where v_it is a two-sided error term that now captures non-systematic variations in total coronavirus cases (Millimet and Parmeter 2022). If we next take natural logarithms in (5) and replace $l n Y_{i t}^{*}$ and $l n Y_{i t - 1}^{*}$ with (16) evaluated at t and t−1, we get:

Δ l n Y_{i t} = l n β_{i t}^{*} + {\tilde{v}}_{i t} - Δ u_{i t}

where ${\tilde{v}}_{i t} = Δ v_{i t}$ . The new noise term is no longer independently distributed over time. If we assume that v_it follows a heteroskedastic normal distribution, the noise vector ${\tilde{v}}_{i} = ({\tilde{v}}_{i 1}, \dots, {\tilde{v}}_{i T})$ follows a multivariate normal distribution with the following variance-covariance matrix:

Π = (\begin{matrix} \begin{matrix} σ_{v 1}^{2} + σ_{v 0}^{2} \\ - σ_{v 1}^{2} \\ \begin{matrix} 0 \\ ⋮ \\ \begin{matrix} ⋮ \\ 0 \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} - σ_{v 1}^{2} \\ σ_{v 2}^{2} + σ_{v 1}^{2} \\ \begin{matrix} - σ_{v 2}^{2} \\ ⋮ \\ \begin{matrix} ⋮ \\ 0 \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} 0 & \dots & 0 \\ - σ_{v 2}^{2} & \dots & 0 \\ \begin{matrix} σ_{v 3}^{2} + σ_{v 2}^{2} \\ ⋮ \\ \begin{matrix} ⋮ \\ 0 \end{matrix} \end{matrix} & \begin{matrix} \dots \\ \dots \\ \begin{matrix} ⋱ \\ - σ_{v (T - 1)}^{2} \end{matrix} \end{matrix} & \begin{matrix} 0 \\ ⋮ \\ \begin{matrix} - σ_{v (T - 1)}^{2} \\ σ_{v T}^{2} + σ_{v (T - 1)}^{2} \end{matrix} \end{matrix} \end{matrix} \end{matrix})

If v_it is homoskedastic as in Wang and Ho (2010), we get the variance-covariance matrix of their first-differences transformed noise term (see their Eq. 12). It is an empirical question whether specification (15) or (18) of the noise term is better. However, it should be mentioned that estimation of a frontier epidemic model using (18) is more problematic if the panel dataset is not continuous and there are missing observations between t = 1 and t = T. This happens, for instance, if we drop the observations with zero rates of growth of coronavirus cases that led to convergence problems9 when maximizing the likelihood functions in most of our estimated models.10

Distribution of u_it

We now turn to the part of the likelihood function related to the proportion of undocumented cases. Estimating (11) is far from straightforward because the distribution of ∆u_it is generally not known if we assume that u_it is independently distributed across provinces and over time (see, for instance, Wang 2003, and Orea and Álvarez 2019). To deal with this issue, we follow Wang and Ho (2010) and assume that u_it possesses the so-called scaling property so that it can be multiplicatively decomposed into two components as follows:

u_{i t} = h (z_{i t}, τ) \cdot u_{i}

where h_it = h(z_it, τ) ≥ 0 is a deterministic (scaling) function, z_it is a set of undocumented-cases determinants (often labelled as contextual or z-variables), and u_i is a homoskedastic one-sided random variable. For notational ease, we assume hereafter that the panel dataset is balanced in the sense that we have not dropped observations along the epidemic development. The preceding implies that the first temporal difference of u_it in (11) can be rewritten as:

Δ u_{i t} = (h_{i t} - h_{i t - 1}) \cdot u_{i} = Δ h_{i t} u_{i}

where ∆h_it can be positive or negative. Notice that if the scaling function h_it is not constant, the one-sided random variable is identified after the first-difference, and that the distribution of u_i is not affected by the first-differences transformation. This key aspect of their model enabled Wang and Ho (2010) to get a tractable likelihood function for their transformed model. The same applies to our stochastic frontier epidemic model. Consequently, as the density function of ε_i = (ε_i1,…, ε_iT) has a closed-form, Eq. (11) can be estimated by Maximum Likelihood (ML), provided that the scaling function h_it is not constant. As Wang and Ho (2010) point out, this condition requires that z_it contains at least one variable which changes values over time. Obviously, this happens if we include the epidemic time K_it = t−E_i as determinant of the proportion of undocumented cases.

Our frontier model in (20) essentially mimics the one proposed by Wang and Ho (2010) to get a tractable likelihood function for their transformed model. It also looks like the specification introduced by Kumbhakar (1990) and Battese and Coelli (1992) except for the first-differencing transformation of the scaling function. In this sense, we have basically replaced η_it = e^−η(t-T) in Battese and Coelli (1992, Eq. 2) with η_it = ∆h_it = e^τzit − eτ^zit−1, where z_it is a set of undocumented-cases determinants that include a time-trend variable (e.g., t or K_it).

Likelihood function

For simplicity, we will assume that u_i~N⁺(0, σ_u). We recall that the half-normal distribution of u_i is not affected by the first-differencing transformation of the idiosyncratic one-sided error term, so that ∆u_it = ∆h_itu_i is distributed as a heteroscedastic half-normal. Wang and Ho (2010) showed that the aforementioned assumptions on v_it and u_it yield the following log-likelihood function for province i:

\begin{matrix} l n L_{i} = - \frac{N}{2} l n (2 π) - \frac{1}{2} l n ∣Π∣ - \frac{1}{2} ε_{i}^{'} Π^{- 1} ε_{i} \\ + \frac{1}{2} (\frac{μ_{*}^{2}}{σ_{*}^{2}} - \frac{μ^{2}}{σ_{u}^{2}}) + l n [σ_{*} Φ (\frac{μ_{*}}{σ_{*}})] - l n [σ_{u} Φ (\frac{μ}{σ_{u}})] \end{matrix}

where Φ is the standard normal cumulative distribution function, ε_i = (ε_i1,…, ε_iT), ε_it = ∆lnY_it−lnβ_it^*(·), and

μ_{*} = \frac{μ / σ_{u}^{2} - ε_{i}^{'} Π^{- 1} Δ h_{i}}{Δ h_{i}^{'} Π^{- 1} Δ h_{i} + 1 / σ_{u}^{2}}

σ_{*}^{2} = \frac{1}{Δ h_{i}^{'} Π^{- 1} Δ h_{i} + 1 / σ_{u}^{2}}

where ∆h_i = (∆h_i1,…, ∆h_iT). Consistent parameters estimates can be obtained by numerically maximizing $l n L = \sum_{i = 1}^{N} l n L_{i}$ .

Empirical illustration

Sample and data

We have used several sources to construct a dataset of coronavirus cases across Spain. As most control measures began on the days of March 13th and 14th, 2020, we analyse data on coronavirus cases two weeks before and two weeks after those dates. In particular, our data set covers the period between the onset of the epidemic in each province and the 4th of April.

The daily evolution of laboratory-confirmed COVID-19 cases in the Spanish mainland provinces was collected manually by the authors from the official press releases of the Spanish regional governments, the Ministry of Health and Wikipedia. These information sources had to be consulted to extend backwards in time the provincial data published by Datadista in GitHub, under a free license. GitHub extracts their data from a variety of documents published by the Ministry of Health but only published data from March 13th on.11 For the 28th of March onwards we collected the data directly using RTVE Flourish.12 We used the regional online data released by the Ministry of Health13 and the province-level data released by the Spanish regional governments to correct typos and lack of information on coronavirus cases in some provinces.

Figure 3 shows the boxplots of the growth rates of cumulative reported cases by epidemic time, from which two features are evident. First, the rates of growth of reported cases are much larger at the beginning of the epidemic than when the epidemic has advanced. That is, our dependent variable tends to decrease over the epidemic time. Second, the volatility is much larger when K_it is small, and declines as K_it increases. This calls for a time-varying heteroskedastic specification of our symmetric error term.

Fig. 3 — Growth rates of cumulative cases

Parameter estimates

Table 1 shows the parameter estimates of several epidemic-time specifications of Eq. (11). That is, the four specifications in this table use a third-order function of lnK_it to capture the temporal evolution of coronavirus cases. As the likelihood function of these models has a closed form, they have all been estimated by ML. The first two specifications assume that the epidemic curve of total coronavirus cases (i.e., $l n β_{i t}^{*}$ ) is a linear function of a set of covariates, whereas the last two specifications assume that $l n β_{i t}^{*}$ is an exponential function in order to impose the theoretical restriction $β_{i t}^{*} \geq 1$ . The non-frontier models assume that ∆u_it = 0, thereby ignoring the one-sided random term that appears in Eq. (11), which is equivalent to assuming that the proportion of undocumented cases does not change over time. These non-frontier models therefore impose the strong assumption that the epidemic curves of both total and reported coronavirus cases coincide (see Eq. 10). The frontier models relax this assumption by adding the first difference of a one-sided error term that can be multiplicatively decomposed into an exponential scaling function (that is, we assume that the scaling function that appears in Eq. (12) is h_it = e^zit′τ) and a homoskedastic half-normal random variable.

Table 1.

MLE: Epidemic-time (lnK_it) specification

	Linear				Exponential
	Non-frontier model		Frontier model		Non-frontier model		Frontier model
	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.
Overall epidemic curve
Intercept	0.8083**	0.3734	0.8950	0.7857	4.1081***	1.1877	5.1474	3.2768
lnK_it	−0.3775	0.4130	−0.4959	0.8168	−7.0363***	1.5375	−8.8926**	3.8386
$l n K_{i t}^{2}$	0.1492	0.1518	0.1602	0.2869	3.2610***	0.6438	4.0997***	1.5123
$l n K_{i t}^{3}$	−0.0248	0.0183	−0.0227	0.0333	−0.5010***	0.0863	−0.6332***	0.1964
W_ilnK_it	0.0360*	0.0185	0.0830***	0.0201	0.0835***	0.0499	0.1278*	0.0724
D_t	−0.1815***	0.0304	−0.1376***	0.0295	−0.5964***	0.0875	−0.4977***	0.1306
W_ilnK_t∙D_t	−0.0466***	0.0187	−0.0782***	0.0175	−0.1964***	0.0541	−0.2879***	0.0592
Noise term (lnσ_v)
Intercept	0.7954***	0.1375	1.1092***	0.0924	0.8430***	0.1333	1.1247***	0.1111
lnK_it	−1.0215***	0.0485	−1.1673***	0.0322	−1.0411***	0.0468	−1.1714***	0.0390
Scaling function
K_t			−0.0383***	0.0112			−0.0437***	0.0145
W_ilnK_t			−0.1376***	0.0429			−0.0796	0.0501
K_t∙D_t			−0.0020*	0.0011			−0.0026*	0.0014
W_ilnK_t∙D_t			−0.0281**	0.0136			−0.0350*	0.0194
u-term (lnσ_u)
Intercept			1.2122***	0.5060			1.1527***	0.4280
Day of the week effects
Tuesday	0.0154	0.0108	0.0182	0.0210	0.1009	0.0722	0.1992	0.1478
Wednesday	0.0128	0.0106	0.0173	0.0180	0.0689	0.0720	0.1599	0.1281
Thursday	0.0130	0.0103	0.0177	0.0156	0.1083	0.0699	0.2243*	0.1220
Friday	0.0047	0.0103	0.0127	0.0185	0.0216	0.0780	0.1421	0.1288
Saturday	0.0020	0.0101	0.0062	0.0132	−0.0109	0.0734	−0.0124	0.1463
Sunday	0.0213	0.0114	0.0167	0.0201	0.1729**	0.0711	0.2571**	0.1432
Mean log LF	0.6572		1.6373		0.6648		1.6379
Pseudo R-sq	0.3442		0.3735		0.3291		0.3507
Mean RR			0.3490				0.4220
Obs.	1290		1290		1290		1290

Open in a new tab

***Significant at 1% level

**Significant at 5% level

*Significant at 10% level

All models include a day-of-the-week effect (not reported) that aims to capture reporting lags by regional and national governments. They also all include a dummy variable D_t that takes the value 1 from the 14th of March, 2020, the day marking the imposition of most of the coronavirus control measures by the Spanish government. The coefficient of this dummy variable allows us to test whether the Spanish lockdown and other public control measures implemented around the 14th of March were able to attenuate the spread of the virus within each province.14

Following the scant epidemiology literature that controls for spatial spillover effects, we use a spatial lag of X specification (SLX) to measure the propagation effect of mobility of people across provinces. In particular, we include W_i lnK_t as an epidemic frontier driver, where lnK_t is a N × 1 vector of epidemic times of the Spanish provinces, and W_i is a 1 × N spatial weight vector where the weights measure the degree of mobility (connectivity) between provinces. We follow Giuliani et al. (2020) and Gross et al. (2020) and use a contiguity or binary W_i vector, where the weights equal one for adjacent units and zero for non-bordering units.15 Therefore, we assume that $l n β_{i t}^{*}$ depends on the epidemic time of neighbouring provinces. We have selected the epidemic time to capture the potential propagation effects between provinces for two reasons. First, this variable is exogenous by construction. Second, Orea and Álvarez (2022) found that the SLX spatial specification captured all the spatial dependence in the dependent variable using a set of spatial autocorrelation tests on the model’s residuals.16

The specification of both random terms is also common to all models. On the one hand, all models have been estimated assuming that the logarithm of the standard deviation of v_it depends on the logarithm of K_it because the volatility of growth rates of reported cases decreases throughout the evolution of the epidemic. In order to capture temporal changes in u_it, we assume that the scaling function depends on two time-varying contextual variables: i) the epidemic time of each province (K_it), in the same fashion as Battese and Coelli (1992); and ii) the logged epidemic time of neighbouring provinces (W_ilnK_t), because we believe that the mobility of people across provinces might also have a significant effect on the proportion of undocumented cases. Finally, we interact our lockdown dummy variable D_t with both K_it and W_ilnK_t in order to examine whether the Spanish lockdown and other control measures (such as an increase in testing) reduced the proportion of undocumented cases.

The intercepts estimated in the linear models are close to unity, indicating that the initial rates of growth of coronavirus cases are relatively large. The exponential models yield much larger initial growth rates, a result that might explain why all the coefficients of the third-order function of lnK_it are statistically significant using this specification. In contrast, we do not find significant lnK_it coefficients using the linear specification, a result that implies that the rates of growth of coronavirus cases do not change during the epidemic. This result would appear to be incorrect, however, as Fig. 3 suggests that these rates of growth decrease rapidly in the early stages of the epidemic. This feature is better captured by the exponential model, as the negative large coefficient of lnK_it found using this specification indicates that these growth rates rapidly decreased a short time after the beginning of the epidemic. Moreover, the previous result, together with the positive and negative coefficients found respectively for $l n K_{i t}^{2}$ and $l n K_{i t}^{3}$ , is consistent with the traditional S-shaped epidemic curves. For all these reasons, the exponential specifications of our epidemic curve are the preferred ones.

Another key result of our empirical exercise is the positive and statistically significant coefficient found for the spatially-lagged variable, W_ilnK_t. This result provides evidence supporting the belief that people’s mobility did spread the virus across the country, as it indicates that the growth rate of COVID-19 cases in a province depends on the evolution of the pandemic in other provinces. Notice that we have interacted D_t with W_ilnK_t. This implies that the coefficient of W_ilnK_t actually measures propagation effects before the implementation of the Spanish lockdown. The coefficient of W_ilnK_t·D_t is negative and statistically significant, indicating that the lockdown was quite effective in preventing the propagation of the coronavirus between provinces. Another issue is whether the lockdown was effective in reducing the propagation of the virus within each province. This within-province impact of the Spanish lockdown can be examined using the estimated coefficient of D_t.17 We find a negative and statistically significant effect of the Spanish lockdown on the rates of growth of coronavirus cases, regardless of whether we use linear or exponential specifications for the epidemic curve.

In summary, these results allow us to conclude that the lockdown was effective in both preventing the propagation of the coronavirus between provinces and in attenuating the propagation of the virus within each province. We carried out a counterfactual exercise using the parameter estimates of our preferred model to simulate what the situation would have been on April 4th if the lockdown had not been implemented around March 14th. We found that the lockdown reduced the number of potential COVID-19 cases by 65.2%. Using a similar approach, Cho (2020) found that the cases of infection in Sweden would have been reduced by almost 75% had its policymakers followed stricter containment policies.

We now focus our attention on the distribution of both random terms. As expected, we find that the standard deviation of the noise term decreases with the logarithm of K_it. Regarding the one-sided random term, using the exponential frontier model we find that the average reporting rate (RR = Y/Y^*) is 42.2%. This rate changes over time as we find that the coefficients of the scaling function are negative, with most of them being statistically significant. We also find very different (under)reporting rates across the Spanish provinces, which is one of the contributions of the paper as the previous epidemiological literature often relies on common rates. For illustrative purposes, in Appendix B, the reported, unreported and total case estimations for each Spanish province on April 4th, 2020, are presented in Table 4 and the temporal evolution of reported and total cases by province are presented in Fig. 7.

Table 4.

Reported, undocumented and total cases by province (April 4th, 2020)

Region	Province	Reported	Undocumented	Total	RR
Region	Province	A	B	C = A + B	D = A/C
Andalucía	Almería	346	327	673	51.4
	Cádiz	846	236	1082	78.2
	Córdoba	974	394	1368	71.2
	Granada	1477	351	1828	80.8
	Huelva	279	65	344	81.1
	Jaén	914	651	1565	58.4
	Málaga	1863	717	2580	72.2
	Sevilla	1602	2185	3787	42.3
Aragón	Huesca	396	106	502	78.9
	Teruel	371	154	525	70.7
	Zaragoza	2409	1062	3471	69.4
Asturias	Asturias	1605	691	2296	69.9
Cantabria	Cantabria	1441	1500	2941	49.0
Castilla	Albacete	2653	3269	5922	44.8
La Mancha	Ciudad Real	3854	3418	7272	53.0
	Cuenca	497	164	661	75.2
	Guadalajara	858	168	1026	83.6
	Toledo	2169	1994	4163	52.1
Castilla	Ávila	679	242	921	73.7
León	Burgos	985	265	1250	78.8
	León	1261	1087	2348	53.7
	Palencia	472	198	670	70.4
	Salamanca	1659	2226	3885	42.7
	Segovia	1148	1659	2807	40.9
	Soria	803	262	1065	75.4
	Valladolid	1403	2652	4055	34.6
	Zamora	339	84	423	80.1
Cataluña	Barcelona	27484	34557	62041	44.3
	Girona	2072	1538	3610	57.4
	Lleida	1176	274	1450	81.1
	Tarragona	958	756	1714	55.9
Extremadura	Badajoz	672	526	1198	56.1
	Cáceres	1375	1337	2712	50.7
Galicia	A Coruña	2180	715	2895	75.3
	Lugo	561	344	905	62.0
	Ourense	921	460	1381	66.7
	Pontevedra	1519	528	2047	74.2
La Rioja	La Rioja	2592	516	3108	83.4
Madrid	Madrid	37584	27553	65137	57.7
Murcia	Murcia	1235	194	1429	86.4
Navarra	Navarra	3073	2859	5932	51.8
País Vasco	Álava	2639	544	3183	82.9
	Vizcaya	4489	2481	6970	64.4
	Guipúzcoa	1500	729	2229	67.3
Valencia	Alicante	2627	1042	3669	71.6
	Castellón	852	2056	2908	29.3
	Valencia	3701	2838	6539	56.6

Open in a new tab

Note: Reported rate (RR) in percentage

Fig. 7 — Temporal evolution of reported and total cases by province

Regarding the temporal path of reporting rates, Fig. 4 shows the province-specific reporting rates by epidemic time computed using our preferred exponential frontier model. Several comments are in order regarding this figure. First, we observe that all rates tend to increase throughout the evolution of the epidemic because we have found before that u_it tends to decline over time. Second, the sample mean varies from 25.3 to 52.5%. These averages reveal that the multiplication factor is on average close to 4 at the beginning of the epidemic and close to 2 at later stages.18 Third, the minimum RR values suggest that there are (many) provinces with very low reporting rates, and hence extremely large multiplication factors, especially at the very beginning of their epidemic episodes. In this sense, our estimated reporting rates are in line with Li et al. (2020), who also find very low reporting rates (14%) before the implementation of the Chinese travel restrictions.19

The geographical distribution of reporting rates across the Spanish provinces is shown in Fig. 5. As the reporting rates vary over time, we have depicted this map using the provincial reporting rates evaluated at the epidemic time 20. Figure 5 seems to suggest the existence of two groups of provinces, one with relatively large reporting rates and the other with relatively low reporting rates. It can be seen that most, but not all, of the provinces with small reporting rates are located in the regions of Castilla-León, Extremadura, and Valencia, and the two main epicentres in Spain (Madrid and Barcelona). The multiplication factors in these provinces (not shown) are on average close to 8. The largest reporting rates are found in coastal Andalucía and several provinces located in the Iberian and Pyrenees mountain ranges. Consequently, their multiplication factors are much smaller than those computed for the previously-mentioned provinces (close to 1.7 on average).

Robustness analyses

In this section we provide some extensions to the base model. We do not show the new parameter estimates for reasons of space but they can be found in Orea et al. (2021). Here, we simply summarize the main results of these robustness analyses.

Alternative specifications: SIR-based models

First, we compared our results with those obtained using a frontier specification inspired in the SIR theoretical epidemic model of Chudik et al. (2020), where we replace the third-order function of lnK_it with the first and second-order lagged values of lnY_it and their interaction. The derivation of a SIR-based frontier model can be found in Appendix A. There we show that some simplifying and strong assumptions need to be made in order to estimate a SIR-based model once undocumented cases are incorporated, so that its results are likely to be biased.

To examine the relative performance of the SIR-based and epidemic-time frontier specifications, we carried out several simulation exercises, which can be found in a previous version of this work published as a working paper (Orea et al. 2021). In summary, we found that in all cases the frontier specification performs better than a non-frontier model in terms of goodness-of-fit. Regarding the frontier specifications, we find that the SIR specification provides a better goodness of fit than the epidemic-time specification because lnY_it exhibits greater cross-sectional heterogeneity than lnK_it. However, the SIR specifications tend to significantly overestimate the (proportion of) unreported cases. Finally, both models perform particularly poorly when they do not take account of heteroskedasticity in the symmetric error term when the cross-section dimension of the panel dataset is small. When the cross-section dimension is increased, the estimates are much more accurate. In any case, it appears appropriate to model the symmetric error term as heteroskedastic.20

The parameter estimates from the linear and exponential specifications of the non-frontier and frontier models are reported in Table 2. As with the epidemic-time models, we find using the SIR-based models that the mobility of people across provinces did clearly spread the virus across the country. We find a larger coefficient for the interaction of W_ilnK_t with the lockdown dummy variable, indicating that the lockdown was even more effective in preventing the propagation of the coronavirus between provinces using the SIR specification. In contrast, the within-province impact of the Spanish lockdown is smaller than in the epidemic-time models. Overall, both the epidemic-time and SIR-based specifications suggest the existence of significant spatial spillovers and provide evidence that the Spanish lockdown was effective in reducing the propagation of COVID-19 both within and between provinces.

Table 2.

MLE: SIR specification

	Linear				Exponential
	Non-frontier model		Frontier model		Non-frontier model		Frontier model
	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.
Overall epidemic curve
Intercept	0.2897***	0.0301	0.1842***	0.0385	−1.4418***	0.1058	−2.0487***	0.1978
lnY_t−1	0.1948***	0.0230	0.0370	0.0369	0.2666***	0.0679	−0.0901	0.1467
lnY_t−2	−0.2380***	0.0220	−0.1062***	0.0330	−0.5327***	0.0618	−0.3994***	0.1238
lnY_t−1∙lnY_t−2	0.0053***	0.0007	0.0020	0.0020	−0.0105**	0.0053	−0.0334**	0.0149
W_ilnK_t	0.0550***	0.0181	0.0972***	0.0218	0.1941***	0.0491	0.4492***	0.1442
D_t	−0.1121***	0.0296	−0.0574	0.0363	−0.3384***	0.0875	−0.0790	0.1639
W_ilnK_it∙D_t	−0.0591***	0.0182	−0.0863***	0.0187	−0.2344***	0.0524	−0.3795***	0.1157
Noise term (lnσ_v)
Intercept	0.8743***	0.1352	1.0180***	0.0926	0.6446***	0.1288	0.9607***	0.1537
lnK_t	−1.0710***	0.0476	−1.1553***	0.0326	−0.9807***	0.0453	−1.1295***	0.0554
Scaling function
K_t			−0.0091	0.0233			−0.0538***	0.0188
W_ilnK_it			−0.0185	0.0395			−0.0688*	0.0370
K_t∙D_t			0.0005	0.0012			0.0002	0.0018
WilnK_t∙D_t			0.0044	0.0103			0.0042	0.0284
u-term (lnσ_u)
Intercept			2.5189	2.1028			1.2793***	0.2587
Day of the week effects
Tuesday	0.0192**	0.0100	0.0193	0.0215	0.1316*	0.0712	0.1892	0.1300
Wednesday	0.0124	0.0098	0.0173	0.0149	0.1018	0.0718	0.1756*	0.1005
Thursday	0.0103	0.0096	0.0182	0.0144	0.1077	0.0714	0.2436**	0.1171
Friday	0.0006	0.0094	0.0125	0.0193	0.0061	0.0788	0.1842	0.1433
Saturday	−0.0064	0.0092	0.0042	0.0151	−0.0632	0.0718	−0.1121	0.1688
Sunday	0.0234**	0.0106	0.0177	0.0248	0.2262***	0.0695	0.2796**	0.1332
Mean log LF	0.7174		1.6843		0.6934		1.6817
Pseudo R-sq	0.3799		0.4531		0.3827		0.4344
Mean RR			0.0340				0.4060
Obs	1290		1290		1290		1290

Open in a new tab

***Significant at 1% level

**Significant at 5% level

*Significant at 10% level

Regarding the two random terms, in the SIR-based models we find a decreasing standard deviation for the noise term, as occurred with the epidemic-time models. The parameter estimates of the scaling function, on the other hand, differ notably from those obtained in the epidemic-time models when we use a linear specification of the model, but not when using an exponential specification. Moreover, whereas the linear SIR-based and epidemic-time models provide very different average reporting rates, the exponential specification of the SIR-based model provides quite similar average reporting rates to its epidemic-time equivalent. This occurs in spite of our finding in the simulation exercises that the SIR specifications tended to underestimate the reporting rates. The exponential form of the SIR-based frontier epidemic curve therefore tends to attenuate the bias in the estimation of the one-sided error term using the linear SIR-based specification. Regarding the temporal path of reporting rates, Fig. 6 shows the province-specific reporting rates by epidemic time computed using the exponential SIR-based model. As in our epidemic-time model, all rates tend to increase throughout the evolution of the epidemic.

Fig. 6 — Temporal evolution of reporting rates (SIR model)

Finally, it is worth highlighting the larger (mean) log likelihood and (pseudo) R-squared values of both linear and non-linear SIR-based models. This is an expected result because the temporal lags of reported cases explain a larger proportion of the current cross-sectional heterogeneity of reported cases than the simple polynomial of epidemic times.21 While the temporal path of reporting rates is robust to this issue, the provinces’ reporting rates might change if their “true” frontier is not properly captured using a polynomial of epidemic-time variables. We do not find a systematic positive or negative bias in reporting rates. We instead find that the differences in epidemic-time and SIR-based reporting rates have to do with differences in reported cases across provinces that have not been perfectly captured by the epidemic-time variables. Indeed, although the correlation is not strong, we find that the epidemic-time model tends to overestimate (underestimate) the SIR-based reporting rates of provinces with large (small) numbers of reporting cases. In the sense, our reporting rates can be viewed as a lower (upper) bound of the true reporting rates in these provinces.

In summary, both epidemic-time and SIR-based models provide similar frontier and distributional results and confirm our hypothesis that the proportion of reported cases through the epidemic development increases over time in line with Li et al. (2020). Just pointing out that the undocumented cases are likely higher (smaller) than that estimated using our epidemic-time model in most (less) affected provinces.

Additional variables: socio-economic determinants

An appealing feature of both epidemic-time and SIR-based specifications is that they can be estimated using epidemic-type data only, i.e., the rates of growth of coronavirus cases depend in our models on own and neighbours’ epidemic times, lagged cases of COVID-19, date of implementation of control measures, etc. However, this does not preclude adding other covariates. The introduction of crucial socio-economic determinants not only provides an estimate of their potential impact but may also offer guidance for future policies aimed at preventing the emergence of epidemics.

To examine this issue, we estimated our preferred model (exponential epidemic-time specification) by adding, one at a time, a series of socio-economic variables to both the overall epidemic frontier and the proportion of under-reported cases through the scaling function.22 None of these variables had a significant effect on the proportion of under-reporting cases, perhaps due to the fact that the random u-term we are modelling here is time-invariant. Similarly, most of the demographic and weather variables do not have a significant frontier effect. However, we did find that the most-populated provinces have had more intensive coronavirus epidemics, most likely due to agglomeration of individuals and the fact that the use of public transport is more prevalent in these provinces. We also found that the COVID-19 epidemic was more intense in provinces with a relatively large share of workers in the service sector. In contrast, the epidemic was weaker in provinces with a relatively large share of workers in the agriculture sector. The risk of contagion in the service sector, where many jobs are indoor, is likely much larger than in the agricultural sector, where work is mainly outdoor.

Temporal windows

As most control measures began on March 14th, the data used in our empirical analysis on coronavirus cases corresponded to a temporal window defined between the onset of the epidemic in each province and the 4th of April (i.e., about three weeks before and three weeks after mid-March). The sample epidemic time ranges from K_it = 3 to K_it = 40 in this window, labelled hereafter as W0340. The first two days of the epidemic of each province are not used because we need two temporal lags to estimate the SIR-based models.

As mentioned above, zero rates of growth of coronavirus cases often appear at the beginning of outbreaks. We estimated our epidemic models dropping these observations, for two reasons. First, we found convergence problems when estimating the frontier specifications of our epidemic curves, even when we added a dummy variable á la Battese (1997) to identify the observations with zero rates of growths. The huge volatility of the dependent variable caused by the presence of zero rates of growth was not sufficiently captured by the mentioned dummy variable to achieve convergence of the maximization procedures. Second, when using non-frontier econometric techniques we found that only the initial temporal patterns were biased once we dropped observations with zero rates of growth of coronavirus cases (we use a third-order function of lnK_it).

In order to partially address this issue of dropping observations, we re-estimate our models using two additional alternative temporal windows. The epidemic time ranges from K_it = 7 to K_it = 44 in the second window (W0744 hereafter) and ranges from K_it = 10 to K_it = 47 in the third window (W1047 hereafter). As we move from the first through to the third window, there is a fall in the number of zero rates of growth dropped from the sample. Whereas in the first (original) window we dropped 134 observations with zero rates of growth of coronavirus cases, this figure falls by half in the second window (67 observations were dropped in W0744), and falls by half again in the third and final window (only 30 observations were dropped in W1047).

While the panel datasets for each window are highly unbalanced due to the widely-differing epidemic onset dates across provinces, the second and third windows use more complete panel datasets. They do, however, reduce the number of pre-lockdown observations, which is problematic in that these are needed not only to measure the effectiveness of the Spanish lockdown to battle the COVID-19 pandemic but also to estimate spatial propagation effects across the Spanish provinces. As such, there are advantages and disadvantages to using windows that begin at later dates. To assess these trade-offs, we present the parameter estimates of the exponential epidemic-time specification for the three different temporal windows (W0340, W0744, W1047). The parameter estimates are presented in Table 3. Notice that the third window is estimated with a second-order function of lnK_it because the epidemic curve in this window is properly captured with the two first epidemic-time variables. As the volatility of the rates of growth of reported cases in the third window is relatively small, we have also estimated this specification with zero rates of growth of reported cases. Our results in Table 3 show that our parameter estimates are robust to this issue when both models, with and without zero rates of growth, converge. As the volatility of the rates of growth of reported cases is much larger in the earlier stages of the epidemic, the goodness-of-fit increases notably in the second and third windows. We find similar provincial reporting rates, with correlation coefficients close to 0.90 in all cases. The temporal patterns of these reporting rates are also similar, although the reporting rates are larger in the later windows. On the other hand, we do not find significant spatial propagation effects across provinces when we use the second and third windows because they include much fewer pre-lockdown observations, a result that is to be expected. As the national lockdown of the population basically halted the mobility of people across provinces, this effect can only be measured if there is a relatively large dispersion of epidemic developments across provinces before the implementation of the Spanish lockdown. Using the final window (W1047), we do not find a significant effect of the lockdown on the rates of growth of coronavirus cases. Again, this is to be expected because W1047 includes fewer of the pre-lockdown observations that are needed to identify a differential temporal pattern before and after the policy measure.

Table 3.

MLE: Epidemic-time (lnK_it) specification with different temporal windows

	W0340		W0744		W1047		W1047
	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.	Coef.	s.e.
Overall epidemic curve
Intercept	5.1474	3.2768	31.3080	20.0315	−24.5151***	5.9564	−29.0482***	6.1906
lnK	−8.8926**	3.8386	−39.1261*	21.3358	16.4949***	3.9424	19.7441***	4.0948
lnK²	4.0997***	1.5123	15.4096**	7.5498	−3.0285***	0.6332	−3.5628***	0.6556
lnK³	−0.6332***	0.1964	−2.0246**	0.8826
WlnK	0.1278*	0.0724	0.1070	0.1116	−0.0299	0.1813	0.2430	0.1629
D	−0.4977***	0.1306	−0.4942**	0.2522	−0.3842	0.4481	−0.6655*	0.3471
WlnK·D	−0.2879***	0.0592	−0.3453***	0.1054	−0.2528	0.1961	−0.4942***	0.1612
Noise term
Intercept	1.1247***	0.1111	2.9305***	0.1539	4.1414***	0.2183	4.3246***	0.2278
lnK	−1.1714***	0.0390	−1.7956***	0.0474	−2.1849***	0.0661	−2.2389***	0.0698
Scaling function
K	−0.0437***	0.0145	−0.0655***	0.0079	−0.0711***	0.0062	−0.0634***	0.0058
WlnK	−0.0796	0.0501	0.0119	0.0433	0.0348	0.0484	0.0471	0.0462
K·D	−0.0026*	0.0014	−0.0021*	0.0012	−0.0020	0.0015	−0.0035**	0.0015
WlnK·D	−0.0350*	0.0194	−0.0298	0.0211	−0.0296	0.0251	−0.0441*	0.0242
Undocumented cases term
Intercept	1.1527***	0.4280	1.7139***	0.4982	1.8568***	0.4922	1.7275***	0.5083
Day of the week effects
Tuesday	0.1992	0.1478	0.3095	0.2062	0.3604	0.2611	0.3110	0.2754
Wednesday	0.1599	0.1281	0.2727	0.1731	0.3215*	0.1884	0.2928	0.1823
Thursday	0.2243*	0.1220	0.3945*	0.1530	0.4357***	0.1580	0.4263***	0.1688
Friday	0.1421	0.1288	0.2897*	0.1639	0.3035*	0.1768	0.3110	0.2045
Saturday	−0.0124	0.1463	0.1232	0.1813	0.1877	0.1841	0.1570	0.1978
Sunday	0.2571*	0.1432	0.3280	0.2162	0.3510	0.2526	0.3215	0.2588
Mean log LF	1.6379		1.9594		2.2119		2.1735
Obs	1290		1357		1394		1424
Pseudo R-sq	0.3507		0.4157		0.4575		0.3457
Epidemic time
Minimum	3		7		10		10
Maximum	40		44		47		47
Mean RR	0.422		0.438		0.480		0.470
Zero rates of growth	No		No		No		Yes

Open in a new tab

***Significant at 1% level

**Significant at 5% level

*Significant at 10% level

Variance-covariance matrix specification

A final robustness analysis has to do with the specification of the variance-covariance matrix of our noise term. We assume in this subsection that the noise term is autocorrelated over time, in the same fashion as Wang and Ho (2010). However, estimation of a frontier epidemic model using (14) is problematic if the panel dataset is not continuous and there are missing observations. When varying the temporal windows above, we did not find severe convergence issues when we estimated the model using all observations of the last (third) window. For this reason, this robustness analysis is performed using the epidemic times from K_it = 10 to K_it = 47.

Generally speaking, the frontier coefficients were robust to different specifications of the variance-covariance matrix of the noise term. Moreover, the diagonal variance-covariance matrix outperforms the alternative specification in terms of goodness-of-fit. On the other hand, regardless of whether we use diagonal or autocorrelated variance-covariance matrix, we find that its standard deviation decreases over time. This feature is thus robust to the specification of the noise term as autoregressive. Finally, we find that most of the coefficients of the scaling function are negative using both specifications, indicating again that the proportion of undocumented (reported) cases decreases (increases) over time.

Discussion of empirical results

In all estimated models, we find very different reporting rates across the Spanish provinces. The large cross-sectional heterogeneity in reporting rates found in our empirical application is one of the contributions of the paper, as previous epidemiological literature has often assumed common rates (an exception to this is Millimet and Parmeter 2022, who also have differential reporting rates). For instance, the strength of the government mitigation policy is modelled in Chudik et al. (2020) in terms of the proportion of population that is exposed to COVID-19. To estimate this proportion, they need to make an assumption regarding the reporting rate. In particular, they use the data from the Diamond Princess cruise ship reported by Moriarty et al. (2020) to calibrate this rate and assume that the average reporting rate is equal to 50% in all Chinese provinces. They find a very large exposure rate in Hubei province (the epicentre of the epidemic), where reducing this exposure required time due to the novelty of the virus. The estimated exposure rates in other provinces ranged between 9 and 87%, indicating that the Chinese control measures had very different effects in each province. This somewhat unexpected result might be caused by the common value used by these authors to calibrate the reporting rate. On average, most of our reporting rates range from 10 to 79%, a similar variation found for the exposure rates in Chudik et al. (2020). Therefore, it may be the case that their estimated variety of exposure rates is caused by of the fact that their econometric model ignores systematic variations in reporting rates across provinces.

Most of our estimated models provide evidence supporting the belief that human mobility did spread the virus across the country before the implementation of the Spanish lockdown. Therefore, restricting people’s mobility (between or within provinces) seems to be a reasonable measure to attenuate the propagation of the coronavirus. In this sense, our results show that the lockdown was effective both in preventing the propagation of the coronavirus between provinces as well as in attenuating the propagation of the virus within each province. Hence, we find that the Spanish lockdown, together with other control measures, was an effective measure to battle COVID-19 in the absence of pharmaceutical measures (e.g., vaccines).

The average contraction in the rates of growth of coronavirus cases attributed to the lockdown is around 6.8 percentage points (from 18.2% with no lockdown to 11.4% with the lockdown). The largest reductions were found in provinces that are either close to the epicentres of the coronavirus or adjacent to provinces with more advanced epidemics. The reductions in the rates of growth of coronavirus cases attributed to the lockdown in these provinces are much larger than the average value. For instance, we find notable effects in Ávila, Segovia and Cuenca, which neighbour Madrid, the Spanish province hardest-hit by coronavirus. Large effects are also found in Tarragona and Lérida, which neighbour Barcelona, the second hardest-hit Spanish province. We also find large effects of the lockdown in Ciudad Real and Albacete, two adjacent provinces that are two local hotspots of the coronavirus in the centre of Spain. In southern Spain, we find large effects in Córdoba, which neighbours Málaga, the main epicentre of the coronavirus in this area. We also find important effects for sparsely-populated provinces such as León, Soria, Palencia, Burgos and Teruel. It is worth mentioning that the epidemic in many of these provinces began almost one week later than it did in neighbouring provinces. Therefore, while local and national lockdowns of the population are effective measures to battle COVID-19, they should be implemented at the very early stages of the epidemics.

It is worth mentioning here that, although the lockdown was very strict, this control measure did not completely eliminate human mobility. Apart from movement within and across provinces due to the existence of essential work, there was also an exodus from the epicentres of the Spanish coronavirus crisis of people wishing to spend the lockdown in provinces with few or no reported cases of COVID-19.

We also extended our pure frontier epidemic models by including a set of socio-economic factors that might influence the evolution of the epidemic in each province. This information can be very useful for policy makers and health authorities planning the relaxation of a lockdown. We find that the most-populated provinces had more intensive coronavirus epidemics. More (less) intensive coronavirus epidemics were also found in provinces with a relatively large share of workers in the service (agricultural) sector. These results, together with the strong propagation effects estimated for provinces close to the main epicentre of the coronavirus in Spain, point to the idoneity of carrying out a gradual, focused relaxation of the control measures. Thus, the relaxation of the lockdown should be slow in the most-populated provinces, in provinces with a higher share of the workforce in the service sector, and in the main epicentres of the coronavirus of Spain. Control measures could be lifted earlier in provinces mainly engaged in primary-sector production.23

Conclusions and future research

This paper attempts to bridge the epidemiological modelling and production economics literatures by proposing stochastic frontier analysis as a useful tool with which the epidemic curves of COVID-19 can be estimated. We have proposed two different types of stochastic epidemic frontier specifications, one based on the econometric SIR specification of Chudik et al. (2020) and the other based on previous work by Orea and Álvarez (2022) which approximates the epidemic curves with functions of the epidemic times, i.e., the time since the onset of the pandemic. The most appealing feature of these models is that they can both be estimated using standard stochastic frontier techniques. One of the specifications of the model can be interpreted as a heteroskedastic version of the model introduced by Wang and Ho (2010). As such, the model we propose should prove useful for practitioners to control for individual effects in a production economics context under time-varying heteroskedasticity.

The models presented permit undocumented cases to be estimated, rather than assumed, and also allow spatial propagation of the virus across geographical areas to be modelled. A simulation exercise indicated that the epidemic-time model performed better, and in an empirical application to the case of the original outbreak of the pandemic in Spain we provide estimates from several different specifications of this model. The results from our models provided insights into the effectiveness of the national and regional lockdown measures and the influence of socio-economic factors in the propagation of the virus.

Our work can be extended in several directions. We have found convergence problems when the model included observations with zero rates of growths. These observations tended to generate huge volatility in the dependent variable. Our application of the SFA approach to examine a non-traditional issue seems to uncover a weakness of this approach when then the target variable is highly volatile. Practitioners aiming to estimate firms’ efficiency should expect the appearance of convergence problems if firms’ production is highly volatile. As this situation is often observed (e.g., in agricultural economics), an interesting topic for future research would be to examine how to deal with this issue properly from a methodological perspective.

In the empirical application in this paper we availed of data at provincial level that allowed us to analyse the effectiveness of national and regional institutional responses at this level of disaggregation. However, several regions in Spain, including Andalusia, Asturias, the Basque Country, Cantabria, Catalonia, Madrid and Murcia have also provided data on coronavirus cases at municipal level. By adapting our empirical strategy to this more disaggregated data we will be able to evaluate the local control measures established by the regional governments during the second and successive waves of contagion of COVID-19.

Another extension would be to explore the possibility of different collectives within the population having different proportions of asymptomatic or undocumented cases. For example, data at provincial level by gender would allow us to examine whether the proportion of undocumented cases among women is larger or smaller than that among men. If this were the case, public health authorities should be particularly aware of gender-based channels of transmission of the virus in sectors of the economy where one gender or the other makes up a substantial majority of the workforce. These types of differences between collectives can be modelled with a system of epidemic spatial stochastic frontier equations, one for each collective. The copula-based maximum likelihood (ML) approach introduced by Lai and Huang (2013) is well-suited for such an analysis.

Finally, the relationship between reported and undocumented cases could be explored in greater depth. Li et al. (2020) have indicated that undocumented (asymptomatic) cases facilitate the dissemination of COVID-19. It is not clear how to explore this cross-group propagation effect using a frontier analysis approach because it tends to “reverse” the sign of the one-sided error term capturing the proportion of undocumented cases. A candidate is the latent class frontier model approach of Kumbhakar et al. (2007), as this model allows the sample to be split into two groups that differ in how the one-sided error term enters the model.

Appendix A

SIR-based specification of our epidemiological frontier model

We derive in this appendix an epidemiological model using the Susceptible-Infected-Recovered (SIR) specification of Chudik et al. (2020) and discuss how it can be expressed in a frontier setting. These authors derive the following second-order non-linear difference equation specification of the SIR model (see their Eq. 11):

{\tilde{Y}}_{i t} = {\tilde{Y}}_{i t - 1}^{2} / {\tilde{Y}}_{i t - 2} + θ [{\tilde{Y}}_{i t - 1} {\tilde{Y}}_{i t - 2} (1 - γ) - {\tilde{Y}}_{i t - 1}^{2}]

where ${\tilde{Y}}_{i t}$ denotes the true number of infected in province i at time t, θ is the effective transmission rate, and γ is the rate of recovery. If we divide both sides of (A1) by ${\tilde{Y}}_{i t - 1}$ and take logs, we get:

Δ l n {\tilde{Y}}_{i t} = l n [{\tilde{Y}}_{i t - 1} / {\tilde{Y}}_{i t - 2} + θ [{\tilde{Y}}_{i t - 2} (1 - γ) - {\tilde{Y}}_{i t - 1}]]

As can be seen, the true rate of growth of coronavirus cases depends on first- and second-order lagged values and their interaction. Ignoring other random errors, Chudik et al. (2020) assume that the ratio of confirmed to true cases at time t can be written as:

\frac{Y_{i t}}{{\tilde{Y}}_{i t}} = π_{i t} = e^{- u_{i t}}, u_{i t} \geq 0

so that

Y_{i t} e^{u_{i t}} = {\tilde{Y}}_{i t}, u_{i t} \geq 0

where the one-sided term u_it in (A4) simply measures the gap between the true and confirmed number of cases, such that:

l n {\tilde{Y}}_{i t} = l n Y_{i t} + u_{i t}

If we use (A5) to replace the true number of cases on the left-hand side of (A2) with their “observed” counterparts, we get:

\begin{matrix} R a t e_{i t} = l n f (q_{i t}, β) - Δ u_{i t} \\ = l n [{\tilde{Y}}_{i t - 1} / {\tilde{Y}}_{i t - 2} + θ [{\tilde{Y}}_{i t - 2} (1 - γ) - {\tilde{Y}}_{i t - 1}]] - Δ u_{i t} \end{matrix}

Note that the term in brackets depends on the true, but unobserved, number of cases in periods t−1 and t−2. If we follow Chudik et al. (2020) and replace them with their “observed” counterparts, Eq. (A6) becomes:

R a t e_{i t} = l n [\frac{Y_{i t - 1}}{Y_{i t - 2}} \cdot e^{Δ u_{i t - 1}} + θ [Y_{i t - 2} e^{u_{i t - 2}} (1 - γ) - Y_{i t - 1} e^{u_{i t - 1}}]] - Δ u_{i t}

Several comments are in order regarding this SIR-based frontier model. First, if we assume that the one-sided random term u_t is i.i.d. and follows, say, a half-normal distribution, the distribution of (A7) in not known and cannot be estimated using the standard stochastic frontier (SF) estimators. Second, we need to make some simplifying assumptions if we are to estimate (A7) using standard SF techniques. For instance, we might assume that the u-terms inside the brackets balance each other out and that lnf(q_it, β) is a linear specification of lnY_it−1, lnY_it−2 and lnY_it−1lnY_it−2, as is customary in the production economics literature (squares of both lnY_t−1 and lnY_t−2 can also be included if a Translog specification is preferred). However, estimating such a model likely provides biased results, not only because we are ignoring u-terms but also because the lagged values of reported cases might be correlated with the time-invariant part of the error term capturing the proportion of undocumented cases (u_i). This could occur if undocumented (asymptomatic) cases facilitate the dissemination of COVID-19 and thereby increase the reporting rates.

Appendix B

Figure 7 and Table 4

Compliance with ethical standards

Conflict of interest

The authors declare no competing interests.

Footnotes

For example, Chudik et al. (2020) use the data from the Diamond Princess cruise ship reported by Moriarty et al. (2020) to calibrate the proportion of the population exposed to COVID-19 and assume an average reporting rate in all Chines provinces of 50%. They find large variations in exposure rates across Chinese provinces, ranging from 9 to 87%. The fact that their econometric model ignores systematic variations in reporting rates across provinces may well be causing this wide variety of exposure rates.

The model that describes the expected number of infections at time (day) t in Giuliani et al. (2020) is also allowed to depend on the number of infections reported at time t−1.

Notice that this specification resembles the popular reproduction-based models used in the epidemiological literature in the sense that our beta parameter plays the same role as the so-called “reproductive number of the infection” (R), a fundamental epidemiological quantity that represents the average number of infections per infected case over the course of their infection. The key aim of the coronavirus control measures is to reduce β_it. If β_it is equal to one, there are no new infections, and the pandemic has therefore been controlled. The same would happen if the reproductive number of the infection is equal to unity in an epidemiological model.

⁴

Orea and Álvarez (2022) show that the theoretical SIR and SEIR epidemiological models yield time-varying growth rates of cumulative cases, regardless of whether daily or longer temporal lags are used. Moreover, the SIR model and its variants produce S-shaped epidemic curves of cumulative cases that can be accurately predicted using a third-order function of lnK_it.

⁵

Of course, if there are no undocumented cases (U_it = U_it−1 = 0), then the actual curves themselves coincide.

⁶

This simple empirical strategy might provide biased results as it ignores the potential correlation with the undocumented cases, which constitute an omitted variable in this analysis.

⁷

This estimator treats α_i as fixed parameters. If they are treated instead as time-invariant random variables, we get the so-called True Random Effects (TRE) panel stochastic frontier model.

⁸

This problem appears when the number of parameters to be estimated increases with the number of cross-sectional observations in the data. In this situation, consistency of the parameter estimates is not guaranteed even if N → ∞.

⁹

Their inclusion makes the rates of growth of coronavirus cases extremely volatile, especially at the beginning of the epidemic outbreaks. This extremely high volatility is difficult to capture using the standard distributions for both the noise term (v_it) and the one-sided random term capturing the proportion of undocumented cases (u_it).

¹⁰

The number of zero rates of growth decreases notably if we use more recent temporal windows (i.e. not centred around the start of the lockdown) to carry out our empirical analysis. For this reason, we will try to deal with this issue in our empirical application by using a temporal window that begins one week later, at the expense of a fall in the number of pre-lockdown observations.

¹¹

See https://github.com/datadista/datasets/tree/master/COVID%2019.

¹²

See https://app.flourish.studio/visualisation/1451263/.

¹³

See https://covid19.isciii.es/.

¹⁴

It is worth mentioning that the third-order function of lnK_it captures the temporal pattern of the virus epidemic, conditional on D_t. In other words, the epidemic curve associated to this function can be interpreted as our as if scenario with no control measures.

¹⁵

Other spatial specifications based on human mobility across all the Spanish provinces were used in Orea and Álvarez (2022). They found very similar results due to 77% of the variation of the weights of the mobility-based W matrix being explained by the binary values of the weights of the contiguity W matrix.

¹⁶

Our spatial SLX specification does not distinguish between reported and undocumented propagation across provinces. A SAR specification with W_ilnY_t and W_iu_t allows us to deal with this issue. However, estimating this model is far from simple because the distribution of W_iu_t is generally not known if u_it is independently distributed across provinces, as assumed above. As estimating this model presents important methodological challenges, we leave an examination of this issue for future research.

¹⁷

AsW_ilnK_t is measured in deviations with respect to the post-lockdown sample mean, the coefficient of D_t can be interpreted as an average effect.

¹⁸

If we use individual reporting rates to compute individual multiplication factors, we get values on the order of two or three digits, with a mean value of 8, which are consistent with the large attack rates (i.e. proportions of infected people) found for Spain by Flaxman et al. (2020) in their study using 11 European countries.

¹⁹

The fraction of all infections that were documented after the travel restrictions was estimated to be 65%, a slightly larger reporting rate than that found in our paper after the implementation of the Spanish lockdown.

²⁰

A separate but related matter has to do with the onset date of the pandemic used in our paper. Our epidemic time variable is defined as the number of days since the observed onset date of the pandemic, which relies on reported cases. In order to see whether in practice the gap between observed and true onset dates is an important issue, Orea and Álvarez (2022) simulated several scenarios with different observed onset dates due to underreporting. They found that the goodness-of-fit of the model only deteriorated when underreporting is extremely large and the gap between observed and true onset dates varies notably across provinces. In this case, however, a model with fixed effects retrieves the predictive capabilities of the model. Estimating such a model in a frontier setting will be a topic of future research.

²¹

We have carried out a Vuong test in order to examine whether the (non-nested) epidemic-time and SIR models are equivalent in terms of goodness-of-fit. Although the sign of the Vuong tests tended to suggest that the SIR models provide a better goodness of fit, the results of these tests were not totally conclusive. Thus, while we cannot reject that the non-frontier epidemic-time and SIR models are equivalent (the absolute value of the Vuong test was 1.23), the Vuong test (2.46 in absolute value) suggests a similar performance at 1% confident level, but a better fit of the frontier SIR model at the 5% confident level.

²²

The socio-economic environment is measured through the provincial GDP per capita and the shares of the services and agricultural sectors in total provincial employment. The demographic structure is measured using population size, population density, and three population age variables. As there is an active debate regarding the influence of the natural environment, we also included two weather variables (temperature and rainfall).

²³

As most tasks in the construction sectors are outdoor, this sector might also be restarted before other sectors.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Battese G. A note on the estimation of Cobb‐Douglas production functions when some explanatory variables have zero values. J Agric Econ. 1997;48(1‐3):250–252. doi: 10.1111/j.1477-9552.1997.tb01149.x. [DOI] [Google Scholar]
Battese G, Coelli T. Frontier production functions, technical efficiency and panel data: with application to paddy farmers in India. J Product Anal. 1992;3:153–169. doi: 10.1007/BF00158774. [DOI] [Google Scholar]
Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, Pastore y Piontti A, Mu K, Rossi L, Sun K, Viboud C, Xiong X, Yu H, Halloran ME, Longini IM, Vespignani A (2020) The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak Outbreak to pandemic. Science 368(6489):395–400. 10.1126/science.aba9757 [DOI] [PMC free article] [PubMed]
Cho SW. Quantifying the impact of nonpharmaceutical interventions during the COVID-19 outbreak: the case of Sweden. Econom J. 2020;23(3):323–344. doi: 10.1093/ectj/utaa025. [DOI] [Google Scholar]
Chudik A, Pesaran MH, Rebucci A (2020) Voluntary and mandatory social distancing: evidence on Covid-19 exposure rates from Chinese provinces and selected countries. NBER Working paper 27039. Working Paper 27039, http://www.nber.org/papers/w27039.
Dickson MM, Espa G, Giuliani D, Santi F, Savadori L. Assessing the effect of containment measures on the spatio-temporal dynamic of COVID-19 in Italy. Nonlinear Dynamics. 2020;101(3):1833–1846. doi: 10.1007/s11071-020-05853-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eliasson K, Lindgren U, Westerlund O. Geographical labour mobility: migration or commuting. Reg Stud. 2003;37(8):827–837. doi: 10.1080/0034340032000128749. [DOI] [Google Scholar]
Fang H, Wang L, Yang Y. Human mobility restrictions and the spread of the Novel Coronavirus (2019-nCoV) in China. J Public Econ. 2020;191:104272. doi: 10.1016/j.jpubeco.2020.104272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Flaxman S, Mishra S, Gandy A, Unwin HJT, Mellan TA, Coupland H, Whittaker C, Zhu H, Berah T, Eaton JW, Monod M. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature. 2020;584(7820):257–261. doi: 10.1038/s41586-020-2405-7. [DOI] [PubMed] [Google Scholar]
Giuliani D, Dickson, MM, Espa G, Santi F (2020) Modelling and predicting the spatio-temporal spread of Coronavirus disease 2019 (COVID-19) in Italy. Available at SSRN: https://ssrn.com/abstract=3559569 or 10.2139/ssrn.3559569. [DOI] [PMC free article] [PubMed]
Greene W. Reconsidering heterogeneity in panel data estimators of the stochastic frontier model. J Econom. 2005;126(2):269–303. doi: 10.1016/j.jeconom.2004.05.003. [DOI] [Google Scholar]
Gross B, Zheng Z, Liu S, Chen X, Sela A, Li J, Li D, Havlin S (2020) Spatio-temporal propagation of COVID-19 pandemics. Available at medRxiv preprint. 10.1101/2020.03.23.20041517.
Gutiérrez MJ, Inguanzo B, Orbe S. Distributional impact of COVID-19: regional inequalities in cases and deaths in Spain during the first wave. Appl Econ. 2021;53(31):3636–3657. doi: 10.1080/00036846.2021.1884838. [DOI] [Google Scholar]
Korolev I. Identification and estimation of the SEIRD epidemic model for COVID-19. J Econom. 2021;220(1):63–85. doi: 10.1016/j.jeconom.2020.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kumbhakar SC. Production frontiers, panel data, and time-varying technical inefficiency. J Econom. 1990;46:201–211. doi: 10.1016/0304-4076(90)90055-X. [DOI] [Google Scholar]
Kumbhakar SC, Orea L, Rodríguez-Álvarez A, Tsionas EG. Do we estimate an input or an output distance function? An application of the mixture approach to European railways. J Prod Anal. 2007;27(2):87–100. doi: 10.1007/s11123-006-0031-5. [DOI] [Google Scholar]
Lai H-P, Huang CJ. Maximum likelihood estimation of seemingly unrelated stochastic frontier regressions. J Prod Anal. 2013;40(1):1–14. doi: 10.1007/s11123-012-0289-8. [DOI] [Google Scholar]
Leung K, Wu JT, Liu D, Leung GM. First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment. Lancet. 2020;395:1382–1393. doi: 10.1016/S0140-6736(20)30746-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, Shaman J. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) Science. 2020;368(6490):489–493. doi: 10.1126/science.abb3221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Millimet DL, Parmeter CF. Accounting for skewed or one-sided measurement error in the dependent variable. Political Analysis. 2021;30(1):66–88. doi: 10.1017/pan.2020.45. [DOI] [Google Scholar]
Millimet DL, Parmeter CF. COVID-19 severity: a new approach to quantifying global cases and deaths. J R Stat Soc Series A. 2022;185(3):1178–1215. doi: 10.1111/rssa.12826. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moriarty L, Plucinski MMarston B, et al. (2020) Public health responses to COVID-19 outbreaks on cruise ships - worldwide, February–March 2020. Morbidity and Mortality Weekly Report (MMWR) 69(Mar):347–352. 10.15585/mmwr.mm6912e3 [DOI] [PMC free article] [PubMed]
Orea L, Álvarez IC. A new stochastic frontier model with cross-sectional effects in both noise and inefficiency terms. J Econom. 2019;213(2):556–577. doi: 10.1016/j.jeconom.2019.07.004. [DOI] [Google Scholar]
Orea L, Alvarez I (2022) How effective has the Spanish lockdown been to battle COVID-19? A spatial analysis of the coronavirus propagation across provinces. 31(1), 154–173. 10.1002/hec.4437. [DOI] [PMC free article] [PubMed]
Orea L, Álvarez I, Wall A (2021) Estimating the propagation of the COVID-19 virus with a stochastic frontier approximation of epidemiological models: a panel data econometric model with an application to Spain. Efficiency Series Paper, 01/2021, Oviedo Efficiency Group, University of Oviedo. http://www.unioviedo.es/oeg/ESP/esp_2021_01.pdf.
Saez M, Tobias A, Varga D, Barceló MA. Effectiveness of the measures to flatten the epidemic curve of COVID-19. The case of Spain. Sci Total Environ. 2020;727:138761. doi: 10.1016/j.scitotenv.2020.138761. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang HJ, Ho CW. Estimating fixed-effect panel stochastic frontier models by model transformation. J Econom. 2010;157(2):286–296. doi: 10.1016/j.jeconom.2009.12.006. [DOI] [Google Scholar]
Wang H-J. A stochastic frontier analysis of financing constraints on investment: the case of financial liberalization in Taiwan. J Bus Econ Stat. 2003;21:406–419. doi: 10.1198/073500103288619016. [DOI] [Google Scholar]

[CR1] Battese G. A note on the estimation of Cobb‐Douglas production functions when some explanatory variables have zero values. J Agric Econ. 1997;48(1‐3):250–252. doi: 10.1111/j.1477-9552.1997.tb01149.x. [DOI] [Google Scholar]

[CR2] Battese G, Coelli T. Frontier production functions, technical efficiency and panel data: with application to paddy farmers in India. J Product Anal. 1992;3:153–169. doi: 10.1007/BF00158774. [DOI] [Google Scholar]

[CR3] Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, Pastore y Piontti A, Mu K, Rossi L, Sun K, Viboud C, Xiong X, Yu H, Halloran ME, Longini IM, Vespignani A (2020) The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak Outbreak to pandemic. Science 368(6489):395–400. 10.1126/science.aba9757 [DOI] [PMC free article] [PubMed]

[CR4] Cho SW. Quantifying the impact of nonpharmaceutical interventions during the COVID-19 outbreak: the case of Sweden. Econom J. 2020;23(3):323–344. doi: 10.1093/ectj/utaa025. [DOI] [Google Scholar]

[CR5] Chudik A, Pesaran MH, Rebucci A (2020) Voluntary and mandatory social distancing: evidence on Covid-19 exposure rates from Chinese provinces and selected countries. NBER Working paper 27039. Working Paper 27039, http://www.nber.org/papers/w27039.

[CR6] Dickson MM, Espa G, Giuliani D, Santi F, Savadori L. Assessing the effect of containment measures on the spatio-temporal dynamic of COVID-19 in Italy. Nonlinear Dynamics. 2020;101(3):1833–1846. doi: 10.1007/s11071-020-05853-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] Eliasson K, Lindgren U, Westerlund O. Geographical labour mobility: migration or commuting. Reg Stud. 2003;37(8):827–837. doi: 10.1080/0034340032000128749. [DOI] [Google Scholar]

[CR8] Fang H, Wang L, Yang Y. Human mobility restrictions and the spread of the Novel Coronavirus (2019-nCoV) in China. J Public Econ. 2020;191:104272. doi: 10.1016/j.jpubeco.2020.104272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] Flaxman S, Mishra S, Gandy A, Unwin HJT, Mellan TA, Coupland H, Whittaker C, Zhu H, Berah T, Eaton JW, Monod M. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature. 2020;584(7820):257–261. doi: 10.1038/s41586-020-2405-7. [DOI] [PubMed] [Google Scholar]

[CR10] Giuliani D, Dickson, MM, Espa G, Santi F (2020) Modelling and predicting the spatio-temporal spread of Coronavirus disease 2019 (COVID-19) in Italy. Available at SSRN: https://ssrn.com/abstract=3559569 or 10.2139/ssrn.3559569. [DOI] [PMC free article] [PubMed]

[CR11] Greene W. Reconsidering heterogeneity in panel data estimators of the stochastic frontier model. J Econom. 2005;126(2):269–303. doi: 10.1016/j.jeconom.2004.05.003. [DOI] [Google Scholar]

[CR12] Gross B, Zheng Z, Liu S, Chen X, Sela A, Li J, Li D, Havlin S (2020) Spatio-temporal propagation of COVID-19 pandemics. Available at medRxiv preprint. 10.1101/2020.03.23.20041517.

[CR13] Gutiérrez MJ, Inguanzo B, Orbe S. Distributional impact of COVID-19: regional inequalities in cases and deaths in Spain during the first wave. Appl Econ. 2021;53(31):3636–3657. doi: 10.1080/00036846.2021.1884838. [DOI] [Google Scholar]

[CR14] Korolev I. Identification and estimation of the SEIRD epidemic model for COVID-19. J Econom. 2021;220(1):63–85. doi: 10.1016/j.jeconom.2020.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] Kumbhakar SC. Production frontiers, panel data, and time-varying technical inefficiency. J Econom. 1990;46:201–211. doi: 10.1016/0304-4076(90)90055-X. [DOI] [Google Scholar]

[CR16] Kumbhakar SC, Orea L, Rodríguez-Álvarez A, Tsionas EG. Do we estimate an input or an output distance function? An application of the mixture approach to European railways. J Prod Anal. 2007;27(2):87–100. doi: 10.1007/s11123-006-0031-5. [DOI] [Google Scholar]

[CR17] Lai H-P, Huang CJ. Maximum likelihood estimation of seemingly unrelated stochastic frontier regressions. J Prod Anal. 2013;40(1):1–14. doi: 10.1007/s11123-012-0289-8. [DOI] [Google Scholar]

[CR18] Leung K, Wu JT, Liu D, Leung GM. First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment. Lancet. 2020;395:1382–1393. doi: 10.1016/S0140-6736(20)30746-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, Shaman J. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) Science. 2020;368(6490):489–493. doi: 10.1126/science.abb3221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] Millimet DL, Parmeter CF. Accounting for skewed or one-sided measurement error in the dependent variable. Political Analysis. 2021;30(1):66–88. doi: 10.1017/pan.2020.45. [DOI] [Google Scholar]

[CR21] Millimet DL, Parmeter CF. COVID-19 severity: a new approach to quantifying global cases and deaths. J R Stat Soc Series A. 2022;185(3):1178–1215. doi: 10.1111/rssa.12826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Moriarty L, Plucinski MMarston B, et al. (2020) Public health responses to COVID-19 outbreaks on cruise ships - worldwide, February–March 2020. Morbidity and Mortality Weekly Report (MMWR) 69(Mar):347–352. 10.15585/mmwr.mm6912e3 [DOI] [PMC free article] [PubMed]

[CR23] Orea L, Álvarez IC. A new stochastic frontier model with cross-sectional effects in both noise and inefficiency terms. J Econom. 2019;213(2):556–577. doi: 10.1016/j.jeconom.2019.07.004. [DOI] [Google Scholar]

[CR24] Orea L, Alvarez I (2022) How effective has the Spanish lockdown been to battle COVID-19? A spatial analysis of the coronavirus propagation across provinces. 31(1), 154–173. 10.1002/hec.4437. [DOI] [PMC free article] [PubMed]

[CR25] Orea L, Álvarez I, Wall A (2021) Estimating the propagation of the COVID-19 virus with a stochastic frontier approximation of epidemiological models: a panel data econometric model with an application to Spain. Efficiency Series Paper, 01/2021, Oviedo Efficiency Group, University of Oviedo. http://www.unioviedo.es/oeg/ESP/esp_2021_01.pdf.

[CR26] Saez M, Tobias A, Varga D, Barceló MA. Effectiveness of the measures to flatten the epidemic curve of COVID-19. The case of Spain. Sci Total Environ. 2020;727:138761. doi: 10.1016/j.scitotenv.2020.138761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] Wang HJ, Ho CW. Estimating fixed-effect panel stochastic frontier models by model transformation. J Econom. 2010;157(2):286–296. doi: 10.1016/j.jeconom.2009.12.006. [DOI] [Google Scholar]

[CR28] Wang H-J. A stochastic frontier analysis of financing constraints on investment: the case of financial liberalization in Taiwan. J Bus Econ Stat. 2003;21:406–419. doi: 10.1198/073500103288619016. [DOI] [Google Scholar]

PERMALINK

Estimating the propagation of both reported and undocumented COVID-19 cases in Spain: a panel data frontier approximation of epidemiological models

Inmaculada C Álvarez

Luis Orea

Alan Wall

Abstract

Introduction

Total and partial epidemic curves

Fig. 1.

Frontier specification of our epidemic curves

Frontier specification

Fig. 2.

Distribution of the noise term

Distribution of uit

Likelihood function

Empirical illustration

Sample and data

Fig. 3.

Parameter estimates

Table 1.

Table 4.

Fig. 7.

Fig. 4.

Fig. 5.

Robustness analyses

Alternative specifications: SIR-based models

Table 2.

Fig. 6.

Additional variables: socio-economic determinants

Temporal windows

Table 3.

Variance-covariance matrix specification

Discussion of empirical results

Conclusions and future research

Appendix A

Appendix B

Compliance with ethical standards

Conflict of interest

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Distribution of u_it