Abstract
For many countries attempting to control the fast-rising number of coronavirus cases and deaths, the race is on to “flatten the curve,” since the spread of coronavirus disease 2019 (COVID-19) has taken on pandemic proportions. In the absence of significant control interventions, the curve could be steep, with the number of COVID-19 cases growing exponentially. In fact, this level of proliferation may already be happening, since the number of patients infected in Italy closely follows an exponential trend. Thus, we propose a test. When the numbers are taken from an exponential distribution, it has been demonstrated that they automatically follow Benford’s Law (BL). As a result, if the current control interventions are successful and we flatten the curve (i.e., we slow the rate below an exponential growth rate), then the number of infections or deaths will not obey BL. For this reason, BL may be useful for assessing the effects of the current control interventions and may be able to answer the question, “How flat is flat enough?” In this study, we used an epidemic growth model in the presence of interventions to describe the potential for a flattened curve, and then investigated whether the epidemic growth model followed BL for ten selected countries with a relatively high mortality rate. Among these countries, South Korea showed a particularly high degree of control intervention. Although all of the countries have aggressively fought the epidemic, our analysis shows that all countries except for Japan satisfied BL, indicating the growth rates of COVID-19 were close to an exponential trend. Based on the simulation table in this study, BL test shows that the data from Japan is incorrect.
Keywords: Benford’s law, Epidemic growth model, Outbreak, Intervention, Sequential updating method
1. Introduction
Coronavirus disease 2019 (COVID-19) was first identified in December 2019 and has since spread globally, resulting in the ongoing 2020 coronavirus pandemic. For many countries attempting to control the fast-rising number of coronavirus cases and deaths, the race is on to “flatten the curve,” with the “curve” being the projected number of people who will contract COVID-19 over a period of time. This “curve” can vary based on the growth (infection) rate and degree of intervention. Thus, “flattening the curve” requires intervention efforts, such as social distancing, to slow the growth rate of COVID-19 infections and give more time to hospitals and health systems to cope with the large number of infected patients.
In the absence of significant control interventions, the curve could be steep, with the number of COVID-19 cases growing exponentially. In fact, this level of proliferation may already be happening, since the number of patients infected in Italy closely follows an exponential trend according to Remuzzi and Remuzzi [1]. Thus, we propose a test. When the numbers are taken from an exponential distribution, it is demonstrated that they automatically follow Benford’s Law (BL) [2], [3]. As a result, we can infer that when the epidemic growth curve follows exponential distribution, the number of infections and deaths will obey BL. To state it differently, if the current control interventions are successful and we flatten the curve (i.e., we slow the rate below an exponential growth rate), then the number of infections or deaths will not obey BL. Therefore, BL may be useful for assessing the effects of the current control interventions and may be able to answer the question, “How flat is flat enough?”
BL was an empirically discovered pattern for the frequency distribution of first digits in many real-life datasets [4], [5].1 It states that in many naturally occurring collections of numbers, the leading digit is non-uniformly distributed in a predictable manner. In addition, the leading significant digit is likely to be small. For example, the number 1 occurs as the first digit about 30.1% of the time, while 9 occurs as the first digit about 4.5% of the time. In this situation, the number 1 appears more than six times more frequently than 9. Checking for the validity of BL in this dataset would be the best approach in a forensic analysis looking at potential manipulations of the number of cases [7], [8], since a distribution of first digits that deviates from the expected distribution may indicate fraud. Prior studies have shown that BL is also applicable to genome data [9], the half-lives of unstable nuclei [3], self-reported toxic emissions data [10], tax auditing [11], accounting [12], election data [13], [14], stock markets and final data [15], [16], [17], [18], [19], [20], regression coefficients [21], inflation data [7], World Wide Web [22], religions [23], [24], [25], birth data [26], river data [27], first letter words [28], elementary particle decay rates ([29], astrophysical measurements [30], and more.
In this study, our goal was to use an epidemic growth model in the presence of interventions to describe the potential for a flattened curve, to investigate whether the epidemic growth model followed BL, to test the model against empirical COVID-19 data on the number of deaths in multiple countries, and to discuss whether or not the model could be used to detect fraud in the reported number of deaths that have occurred in the presence of interventions.
This paper is organized as follow: Section 2 briefly introduce the BL; Section 3 describes the theoretical relationship between BL and epidemic growth model in the absence (and presence) of control intervention; Section 4 describes simulation for testing the theoretical development from Section 3; Section 5 applies BL and the epidemic growth model in the presence of intervention to COVID-19 data, and finally Section 6 concludes.
2. Benford’s law
BL is the observation that in many collections of numbers from real-life data or mathematical tables, the significant digits are not uniformly distributed; they are heavily skewed toward the smaller digits. More precisely, the significant digits in many data sets obey a very particular logarithmic distribution. The special case of the first significant digit is
where denotes the first significant digit. From a statistical standpoint, a Borel probability measure on is Benford if for all , where is the significant of a real number is its coefficient when it is expressed as a floating point. That is, the significant function is defined as follows: if is a non-zero real number, then , where is the unique number in with for some . Then, a random variable is Benford if its distribution on is Benford, i.e., if for all .
The useful result for this study is that if is a random variable uniformly distributed on , then the random variable is Benford. To show this, let us say the cumulative distribution function of a Benford random variable is for all . Then, by rewriting the cumulative distribution function as , we have , where . Thus, a Benford variable can be generated by , where .
To evaluate the degree of deviation between the observed and expected first digit distribution from BL, we considered the chi-squared test. For the corresponding , the statistic can be estimated as
where is the observed frequency in each bin in the observed data, and is the expected frequency based on Benford’s distribution. The chi-square statistic works as a measure of the gap between the realization observed in the data and that implied by the Benford distribution; the larger the chi-square statistic is, the stronger the deviation from the Benford distribution will be. In this case, with a 95% confidence level, is the critical value for the rejection of the null hypothesis; if the value of the is less than the critical value, then we accept the null hypothesis and conclude that the data fits the Benford distribution. Then, the null hypothesis () is that the observed distribution of the first significant digit in the case of interest is the same as expected on the basis of BL; the alternative hypothesis () is that the observed distribution of the first significant digit in the case of interest is not the same as expected on the basis of BL. Particularly in forensic analysis, if the null hypothesis can be rejected, the observed series does not satisfy BL and thus infers a possible manipulation of data.
3. A generalized epidemic growth model and benford’s law
The growth pattern of infectious disease outbreak assumes exponential growth dynamics based on the differential equation:
where is the cumulative number of cases, is time, and is a positive constant (growth rate), which varies as the exponent changes.
3.1. Growth model in the absence of control intervention:
When , we can solve with the separation of variables:
where . is considered the classical epidemic growth model; the cumulative number of cases, , grows according to the equation, where is the growth rate per unit of time, is time, and () is the number of cases at the start of the outbreak. The growth rate, , is related to , as derived from SIR (susceptible–infected–removed) models, where is the mean infectious period.
The function is Benford. The proof relies on Weyl’s equidistribution theorem, which states that if is irrational, then for large , the fractional parts of for are uniformly scattered over the unit interval ([31], [32]). More specifically, the function , can be written as , where is irrational. Thus, for the large range of the fractional parts of become uniformly distributed over the interval . As we mentioned in Section 2, if is a random variable uniformly distributed on , then the random variable is Benford. Therefore, the numbers taken from the function naturally follow BL. To visualize the relationship between and Benford. We generated 1000 observations from , where , , and . In Fig. 1, the simulated first-digit distribution of the growth model in the absence of control intervention is represented (gray) along with the corresponding probability mass for Benford’s distribution (black). Although digit 1 and digit 2 exhibit slightly higher proportion than BL, the simulated data from the growth model are quite close to following Benford distribution. We discuss the validity of BL in more detail with different simulation scenarios in Section 4.
3.2. Growth model in the presence of control intervention:
When , we have a function that does not grow as fast as exponential functions. Let us define , where is a positive integer. As we mentioned before, the larger the number we choose as , the closer we are to exponential growth, and therefore, the function naturally follows BL. Now, we have
which is a polynomial of degree . In terms of the epidemic growth model, describes the cumulative number of cases at time , is a positive growth rate, and is a deceleration of the growth parameter.
Similarly, in Section 3.1, the function can be written as , where is irrational. However, Weyl’s equidistribution theorem requires large in order to achieve the state (or aspect) of the fractional parts that are uniformly scattered over the unit interval, and thus, the function is Benford. If (i.e., ), this function describes the cumulative number increases linearly, and thus, is not Benford; however, if (i.e., ), this function describes the exponential growth, and thus, is Benford. In observational study, given the small value of , the growth may be not as great as exponential growth because no matter how large is, is still less than 1.
When is intermediate, values that lie between 0 and 1, the function describes polynomial growth patterns. For example, if (i.e., ), the function describes constant incidence over time, while the cumulative number of cases follows a quadratic polynomial; if (i.e., ), the function describes incidence grows quadratically, while the cumulative number of cases fits a cubic polynomial.
Thus, we believe that when is small we may able to “flatten the curve”. But, how small is small enough not to obey BL? To better understand the association between and Benford, we run simulations in the next section.
4. Simulation
In the previous section, we show that when the epidemic growth model can satisfy BL, epidemic growth in the presence (absence) of control intervention may dissatisfy (satisfy) Benford distribution. Nevertheless, we want to check this claim with more generality under uncertainty (e.g., randomly generated parameters, randomly generated initial values, and relatively small fixed samples). Thus, we run simulations of the growth model in order to evaluate the satisfaction of BL.
Simulation 1: The growth model in the absence of control intervention, . We generate a random variable from the growth model with , such as , where , , and . As mentioned before, after the calculation of the first-digit occurrence, we conduct the test to detect deviations from BL. Table 1 presents the results of generating 10,000 series of simulated data representing the epidemic growth in each different value of the growth rate. The length of each series is fixed at . We found that in each scenario, the number of detected Benford cases (i.e., ) are consistently greater than 95%, regardless of the growth rate (). To state it differently, under the uncertainty created when the deceleration of growth () is 1, most of the epidemic growth will follow BL. Notably, in the absence of control intervention, the growth model will likely satisfy BL, regardless of whether the growth rate was high or low; even a low growth rate may cause exponential growth in the absence of intervention.
Table 1.
Growth rate, | Deceleration of growth parameter, | Number of Benford cases |
---|---|---|
0.1 | 1 | 9,681 (96.81%) |
0.2 | 1 | 10,000 (100.00%) |
0.3 | 1 | 9,890 (98.90%) |
Simulation 2: The growth model in the presence of control intervention, . We generate a random variable from the growth model, such as , where , , , and . Table 2 presents the results of generating 10,000 series of simulated data representing the epidemic growth in each different value of the growth rate () and the deceleration of the growth parameter (). The length of each series is also fixed at . We found that when the epidemic growth rate is around 0.1, the number of cases in BL was hinged on the deceleration of growth parameter (). For example, when and , we found that only 752 cases out of 10,000 (7.52%) satisfied BL. In contrast, when and , we found that 9,281 cases out of 10,000 (92.81%) satisfied BL. Notably, only moderately strong intervention () may effectively decrease the growth when the growth rate is around 0.1, and therefore, the model will likely dissatisfy BL (e.g., when and , we found less than half of simulated cases (45.41%) satisfied BL). To state it differently, in this setting, we may able to “flatten the curve” if and .
Furthermore, when , in all of the scenarios, the number of detected Benford cases (i.e., ) is greater than 89%, regardless of the deceleration of the growth parameter .
Table 2.
Growth rate, | Deceleration of growth parameter, | Number of Benford cases |
---|---|---|
0.1 | 0.1 | 752 (7.52%) |
0.2 | 2,455 (24.55%) | |
0.3 | 4,541 (45.41%) | |
0.4 | 6,472 (64.72%) | |
0.5 | 7,873 (78.73%) | |
0.6 | 8,514 (85.14%) | |
0.7 | 9,015 (90.15%) | |
0.8 | 9,105 (91.05%) | |
0.9 | 9,281 (92.81%) | |
0.2 | 0.1 | 8,927 (89.27%) |
0.2 | 9,848 (98.48%) | |
0.3 | 9,984 (99.84%) | |
0.4 | 9,999 (99.99%) | |
0.5 | 10,000 (100.00%) | |
0.6 | 10,000 (100.00%) | |
0.7 | 10,000 (100.00%) | |
0.8 | 10,000 (100.00%) | |
0.9 | 10,000 (100.00%) | |
0.3 | 0.1 | 9,997 (99.97%) |
0.2 | 10,000 (100.00%) | |
0.3 | 10,000 (100.00%) | |
0.4 | 10,000 (100.00%) | |
0.5 | 10,000 (100.00%) | |
0.6 | 10,000 (100.00%) | |
0.7 | 10,000 (100.00%) | |
0.8 | 10,000 (100.00%) | |
0.9 | 10,000 (100.00%) |
5. COVID-19 application
5.1. Data
In this section, we apply the epidemic growth model to the number of deaths attributed to COVID-19. The World Health Organization declared the coronavirus outbreak to be a pandemic in early March 2020. Most countries affected by COVID-19 have tried to stop the spread of the virus through various means, such as social distancing and quarantines. We collected the number of deaths from the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE; https://coronavirus.jhu.edu/). The 2019 Novel Coronavirus Visual Dashboard provides daily data on the cumulative number of deaths since January 22, 2020. We selected 10 countries that showed a relatively high mortality rate. To apply the epidemic growth model to the growth patterns of the number of deaths, we followed the methodology used in other epidemiological studies. In particular, we focus on the analysis of the early ascending phase of COVID-19; we used the day when coronavirus-related deaths were first observed to the day when the number of deaths peaked in each country as the early ascending phase for this study [33], [34], [35], [36]. For example, in the case of the US, the first deaths were observed on February 29, which is 39 days after January 22. The peak in the number of deaths was observed on April 6, which is 76 days after January 22.
5.2. Estimation and confidence intervals
To jointly estimate the parameters and , we used the least-square curve fitting, as in prior studies [36], [37]. In particular, we fit the cumulative number of deaths to the equation . We then implemented a least-square fitting procedure in R using the built-in function lsqcurvefit in the pracma package. In order to increase the accuracy of the estimation, the parameters were updated with the mean square errors and prior knowledge of the parameters (i.e., the estimated parameters in the previous iteration). Our estimation method differs from existing studies in two respects. First, our proposed estimation approach sequentially updates the parameters by using the estimated parameters from the previous iteration; after setting the initial value of , the parameters (growth rate parameter) and (deceleration of growth parameter) were jointly estimated after 100 times iteration. Second, the proposed method incorporates the overall mean squared error (MSE) into the estimation procedure; once the iteration has ended, we can choose the estimated and , which gives the minimum value among 100 MSEs. In contrast to our proposed approach, which is based on many estimations (i.e., 100 iterations), to the best of our knowledge, existing studies rely on only one estimation (i.e., one iteration).
To illustrate how the proposed approach will work with COVID-19 data, we generated values from with , , , and (i.e., ). The initial values for and were given as 0. The initial value of c was given as the first observation of the generated sample. Fig. 2 shows the mean squared errors (MSEs) against the iteration number. The plot indicates that the proposed approach rapidly reaches a stationary distribution. When the number of iterations () exceeded 13, the MSE was about 3.86; when there was no sequential updating (), the MSE was 462.29. Thus, the MSE of the proposed approach was approximately 120 () times lower than the MSE of the approach that did not update the estimates (i.e., estimating the parameters based on ). Fig. 3 shows the estimated parameters when we employed our proposed approach. As the iteration proceeded, the estimated parameters became closer to the true parameters. The dashed line indicates the true parameters. For example, after iteration 10, we obtained estimated parameters of 0.59 and 0.49 for and , respectively. When , the estimated parameters for and were 2.11 and 0.18, respectively. Through this analysis, it was found that the method of the least-squares fitting of the curve was highly affected by the initial value of . Thus, we believe that the proposed approach is always desirable when is not known. For additional clarity, we have provided the source code that was used to test the proposed approach (in the R statistical environment) as an Appendix A alongside the full text of the manuscript. We hope to spark a discipline-wide discussion of the merits of advanced and flexible matching methods in a contemporary organizational setting.
Before estimating the parameters, we set the initial value for , , and c. The initial values for and were given as 0. For the US, the UK, and Japan, the initial value of c was given as 2. For China, Spain, Germany, and the Netherlands, the initial value of c was given as 3. For South Korea and Italy, the initial value of c was given as 4.2 After setting the initial value, the parameters and were jointly estimated after 100 times iteration. Once and were jointly estimated, the value of c was updated as the initial value of . Given the estimated , , and c, the mean squared error was calculated using the fitted growth model and the original data. After we obtained the mean squared error, the next iteration started, and and were updated as those from the previous iteration were. Once the iteration ended, we chose the estimated and , which could minimize the mean squared error.
We constructed a 95% confidence interval for the and estimates using the parametric bootstrapping method [36], [37], [38], [39]; based on the 1000 estimates (by fitting the least-squares method 1000 times), we calculated the variance (standard errors) of the estimated parameters, as prior studies have [36], [37], [38]. More precisely, the bootstrapping error was estimated by simulating 1000 realizations of the best-fit curve using the parametric bootstrap with a negative binomial error structure. Using the bootstrapping error, we then obtained the nominal 95% confidence intervals. In particular, we generated (in this study, ) sets of boosted cumulative numbers of cases (, where ) based on the observed data (, where ) in the following manner. For , simply let . For , was sampled from a negative binomial (NB) distribution with mean , which is the daily change in observed samples between day and . In each , the total number of bootstrapped values is . In this study, indicates the length of the data points. For example, if , , and , we generated samples from the NB distribution with mean . In this study, we used a NB instead of a Poisson distribution because of the overdispersion problem and chose the dispersion parameter range 0.001–0.9.3 Overdispersion refers to the presence of a greater variance of observed data than would be expected in a given parametric model. Notably, an overdispersion problem often occurred when fitting a Poisson distribution to the data. The Poisson distribution had only one parameter. Thus, the variance of the distribution was equal to the mean. The corresponding realization of the cumulative number of deaths due to COVID-19 was given by . The and were then estimated from each of the 1000 simulated epidemic growths. The empirical distribution of the estimated parameters was used to construct 95% confidence intervals. The estimated parameters and confidence intervals are reported in Table 3.
Table 3.
Johns Hopkins Coronavirus Resource Centera |
Growth model in the presence of control intervention |
BL test |
|||||||
---|---|---|---|---|---|---|---|---|---|
Country | The early ascending phase (Dates in 2020) |
T | Growth rate, (95% C.I) | Deceleration of growth parameter, (95% C.I) | -value | ||||
US | 02/29 04/06 | 38 | 1 | 0.364 (0.355, 0.371) |
0.921 (0.917, 0.924) |
1.388 | 0.990 | ||
China | 01/22 02/11 | 21 | 17 | 1.485 (1.453, 1.515) |
0.612 (0.609, 0.614) |
2.691 | 0.952 | ||
South Korea | 02/20 03/02 | 12 | 1 | 0.590 (0.521, 0.657) |
0.549 (0.535, 0.562) |
8.698 | 0.368 | ||
Japan | 02/13 04/04 | 52 | 1 | 0.140 (0.136, 0.144) |
0.773 (0.761, 0.783) |
25.166 | 0.001 | ||
UK | 03/05 04/04 | 30 | 1 | 0.369 (0.363, 0.374) |
0.929 (0.926, 0.931) |
5.674 | 0.684 | ||
Spain | 03/03 04/02 | 31 | 1 | 1.021 (1.005, 1.036) |
0.774 (0.772, 0.776) |
7.101 | 0.526 | ||
Italy | 02/21 03/26 | 35 | 1 | 0.722 (0.704, 0.739) |
0.802 (0.799, 0.805) |
3.499 | 0.899 | ||
Germany | 03/09 04/08 | 31 | 2 | 0.525 (0.515, 0.534) |
0.819 (0.815, 0.822) |
4.594 | 0.800 | ||
Netherlands | 03/06 04/07 | 33 | 1 | 0.672 (0.660, 0.682) |
0.761 (0.757, 0.764) |
5.288 | 0.726 | ||
Sweden | 03/11 03/27 | 17 | 1 | 0.271 (0.257, 0.286) |
0.999 (0.982, 1.018) |
7.375 | 0.497 |
Note: The early ascending phase is the period between the day when the first death was observed after January 22, 2020 and the day when the number of deaths peaked after January 22, 2020; we used the number of deaths observed in these periods to estimate and and conduct the BL test. T denotes the length of the data points. is reduced a chi-squared statistic.
Data source: https://coronavirus.jhu.edu/.
5.3. Results
It is possible to fit the data for the number of coronavirus deaths into the growth model in the presence of control interventions, as reported in Fig. 4. The predicted value of the number of deaths can be computed based on the estimated parameters ( and ) and is consistent with the observed number of deaths attributed to COVID-19 that were reported by the JHU CSSE. The consistency between the model prediction and the reported data is very close in all 10 countries.
Overall, our analysis revealed a diversity of profiles across the countries. Estimates of the deceleration of growth parameter ranged from 0.549 in South Korea, reflecting a high degree of intervention, to 0.999 in Sweden. Nevertheless, the epidemic growth in all countries satisfied BL with the exception of Japan.4 These findings are consistent with the simulation. Based on the simulation, we showed that when in all of the scenarios, the number of detected Benford cases was greater than 89%, regardless of . For example, the estimated deceleration of growth in South Korea was the lowest () among the 10 countries, but the estimated growth rate was , which was greater than 0.2. Thus, it followed BL.
The estimated growth rate and deceleration of growth in Japan were 0.140 and 0.773, respectively. Based on the simulation, this profile’s number of detected Benford cases was between 90.15% ( and ) and 91.05% ( and ), as shown in Table 2. However, the calculated statistic was 25.166 (<0.001), indicating the values from Japan were significantly different from the theoretical values of BL.
As a robustness check, we performed BL test on the data sets for the dates after the COVID-19 peak. Thus, the data set in each country contains from the date when the first death was observed after January 22, 2020, to the (fixed) date of June 18, 2020. Not surprisingly, the empirical findings from all countries were statistically significant at the 0.05 level, indicating that the epidemic growth in all countries does not satisfy BL (for more in details, see Appendix B).
6. Conclusion
The objective of this study was to (1) introduce an epidemic growth model that could capture the intervention (e.g., flattening the curve) efforts in different countries in order to better understand the growth rate of COVID-19 infections, (2) establish a link between this epidemic growth model and BL, and (3) propose a sequential updating scheme for parameter estimates.
We found that the predicted number of deaths from the model was very close to the observed number of COVID-19 deaths across all 10 countries. Mathematically, we also showed that epidemic growths without intervention are likely to satisfy BL, because epidemic growths naturally follow an exponential family distribution. Thus, BL was applicable to the epidemic growth model.
Furthermore, it is possible that when the degree of intervention is high, the growth of death or infection rates may not obey BL. This theory would mean that “flattening the curve” interventions would not only be able to slow the growth rate of the outbreak, but also change the characteristics of its nature so that the distribution of first digits followed BL. As a result, BL testing alone would not be sufficient to detect potential manipulations of the growth of the death rate. For this reason, it is important to interpret the model’s estimated parameters for the growth rate () and deceleration of growth (), because they can provide insight into how likely a given case satisfies BL based on the simulation.
Although all of the countries have aggressively fought the epidemic, our analysis shows that 9 out of 10 countries satisfied BL, indicating the growth rates of COVID-19 in these 9 countries were close to an exponential trend. This finding may be due to the fact that the estimated growth parameters for all were greater than 0.2. Notably, Sweden’s strategy for fighting COVID-19 depends on the development of herd immunity [40]. Herd immunity occurs when a large portion of the population becomes immune to the pandemic. Thus, Sweden has not imposed a lockdown [41], [42]. Based on the BL test on the data from Sweden, the calculated statistic was 7.375 (), indicating that the growth of the epidemic in Sweden has satisfied BL (see Table 3 in the manuscript). This finding means that all countries that used interventions (except for Japan) satisfied BL, indicating that the growth rates of COVID-19 were similar in countries that did not use significant interventions (e.g., Sweden).
However, Sweden has shown the lowest deceleration of growth among the 10 countries considered in this study. The estimated deceleration of growth in Sweden is 0.999 with a 95% confidence interval (0.982, 1.018). Since the 95% confidence interval contains a deceleration of growth parameter of 1.000 () regardless of the growth rate (), the growth pattern of COVID-19 in Sweden can be better described by the growth model in the absence of control interventions (Section 3.1).
In the case of Japan, we further investigated the inconsistency between the simulation test (where, given and , the number of detected Benford cases was greater than 90.15%, as shown in Table 2) and BL test (where, given the estimates, the values from Japan did not satisfy the Benford distribution with a -value=0.001, as shown in Table 3). We then conducted a BL test based on each of the boosted samples () that we used for constructing the 95% confidence intervals (see Section 5.2). Given and , the result from the bootstrapped sample showed that the number of detected Benford cases was 88.33%, which was close to the simulation shown in Table 2, even though the simulation scheme was different from the parametric bootstrapping method. Thus, we believe the data generating process in Japan is distinct from the other 9 countries in this study and does not obey BL. One of the possible reasons of the difference between Japan and the other 9 countries is that although JHU CSSE provides public access to the global cases and trends of COVID-19 and updates their data daily, they must rely on self-reported data from each country [43]. The problem with this type of data is that it is subject to intentional manipulation, thus diminishing its reliability or suitability for data analysis. Benford’s law has already attracted interest in antifraud analysis [44], [45]. For that reason, testing Benford’s law is particularly attractive for the detection of fraudulent self-reported COVID-19 data [44], [45]. Based on the empirical findings and the simulation Table 2, BL test shows that the data from Japan is incorrect. These inconsistent results (between the BL test and the simulation table) are important to note because they can discourage researchers from investigating any other self-reported data by Japan further in detail, such as by checking whether the hospitals are managing to cope with the number of infected patients admitted in critical care.
In this study, we also found that the method of the least-squares fitting of the curve was highly affected by the initial value of . Thus, we believe that the proposed approach in Section 5.2 is always desirable when is not known.
CRediT authorship contribution statement
Kang-Bok Lee: Conceptualization, Methodology, Software, Visualization, Writing - original draft, Supervision. Sumin Han: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Yeasung Jeong: Data curation, Formal analysis, Investigation, Resources.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Footnotes
As shown in physics, physical constants obey BL as well [6].
Different initial values of needed to be considered for different countries for two reasons: (1) COVID-19 was identified at different times in different countries; (2) Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE; https://coronavirus.jhu.edu/), which provided the left-censored data, was not collecting data before January 22, 2020. Notably, COVID-19 was first identified in Wuhan, China in December 2019. Thus, with the exception of China, most countries reported first observations (the number of deaths by January 22, 2020) to JHU CSSE as one or two. For that reason, is 1 or 2 in most countries, except for China. Given the data from JHU CSSE, we had to estimate the initial value for China. The estimated value of for China was 17, which yielded the lowest prediction errors using least-squares fitting.
As the dispersion parameter gets larger, the NB turns into a Poisson distribution.
Following the calculation of the first-digit occurrence in each country, we compared the distribution with the theoretical values of BL.
Appendix A. Estimation for and
Appendix B.
See Table B.1.
Table B.1.
Johns Hopkins Coronavirus Resource Centera |
BL test |
||||
---|---|---|---|---|---|
Country | The entire phase (Dates in 2020) | T | -value | ||
US | 02/29 06/18 | 111 | 15.364 | 0.052 | |
China | 01/22 06/18 | 149 | 275.420 | < 0.001 | |
South Korea | 02/20 06/18 | 120 | 177.680 | < 0.001 | |
Japan | 02/13 06/18 | 127 | 69.904 | < 0.001 | |
UK | 03/05 06/18 | 106 | 51.742 | < 0.001 | |
Spain | 03/03 06/18 | 108 | 158.54 | < 0.001 | |
Italy | 02/21 06/18 | 119 | 97.019 | < 0.001 | |
Germany | 03/09 06/18 | 102 | 183.750 | < 0.001 | |
Netherlands | 03/06 06/18 | 105 | 135.14 | < 0.001 | |
Sweden | 03/11 06/18 | 100 | 46.633 | < 0.001 |
Note: The entire phase is the period from the date when the first death was observed after January 22, 2020, to the date of June 18, 2020. T denotes the length of the data points.
Data source: https://coronavirus.jhu.edu/.
References
- 1.Remuzzi A., Remuzzi G. COVID-19 and Italy: what next? Lancet. 2020 doi: 10.1016/S0140-6736(20)30627-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Princeton University Press; 2015. Benford’s Law. [Google Scholar]
- 3.Ni D., Ren Z. Benford’s law and half-lives of unstable nuclei. Eur. Phys. J. A. 2008;38(3):251–255. [Google Scholar]
- 4.Joannes-Boyau R., Bodin T., Scheffers A., Sambridge M., May S.M. Using Benford’s law to investigate Natural Hazard dataset homogeneity. Sci. Rep. 2015;5(1):1–8. doi: 10.1038/srep12046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Benford F. The law of anomalous numbers. Proc. Am. Philos. Soc. 1938;78:551–572. [Google Scholar]
- 6.Burke J., Kincanon E. Benford’s law and physical constants: the distribution of initial digits. Amer. J. Phys. 1991;59(10):952. [Google Scholar]
- 7.Miranda-Zanetti M., Delbianco F., Tohmé F. Tampering with inflation data: A Benford law-based analysis of national statistics in Argentina. Physica A. 2019;525:761–770. [Google Scholar]
- 8.Zhang J. 2020. Testing case number of coronavirus disease 2019 in China with newcomb-Benford law. arXiv preprint arXiv:2002.05695. [Google Scholar]
- 9.Hoyle D.C., Rattray M., Jupp R., Brass A. Making sense of microarray data distributions. Bioinformatics. 2002;18(4):576–584. doi: 10.1093/bioinformatics/18.4.576. [DOI] [PubMed] [Google Scholar]
- 10.de Marchi S., Hamilton J. Assessing the accuracy of self-reported data: An evaluation of the toxics release inventory. J. Risk Uncertain. 2006;32:57–76. [Google Scholar]
- 11.Nigrini M.J. Taxpayer compliance application of Benford’s law. J. Am. Taxation Assoc. 1996;18:72–92. [Google Scholar]
- 12.Durtschi C., Hillison W., Pacini C. The effective use of Benford’s law to assist in detecting fraud in accounting data. J. Forensic Account. 2004;5(1):17–34. [Google Scholar]
- 13.Gamermann D., Antunes F.L. Statistical analysis of Brazilian electoral campaigns via Benford’s law. Physica A. 2018;496:171–188. [Google Scholar]
- 14.Tam Cho W.K., Gaines B.J. Breaking the (Benford) law: Statistical fraud detection in campaign finance. Amer. Statist. 2007;61(3):218–223. [Google Scholar]
- 15.Ausloos M., Eskandary A., Kaur P., Dhesi G. Evidence for gross domestic product growth time delay dependence over foreign direct investment. a time-lag dependent correlation study. Physica A. 2019;527 [Google Scholar]
- 16.Cerqueti R., Fenga L., Ventura M. Does the US exercise contagion on Italy? A theoretical model and empirical evidence. Physica A. 2018;499:436–442. [Google Scholar]
- 17.Shi J., Ausloos M., Zhu T. Benford’s law first significant digit and distribution distances for testing the reliability of financial reports in developing countries. Physica A. 2018;492:878–888. [Google Scholar]
- 18.Bariviera A.F., Martín M.T., Plastino A., Vampa V. LIBOR Troubles: Anomalous movements detection based on maximum entropy. Physica A. 2016;449:401–407. [Google Scholar]
- 19.Clippe P., Ausloos M. Benford’s law and Theil transform of financial data. Physica A. 2012;391(24):6556–6567. [Google Scholar]
- 20.Hill T.P. The first digit phenomenon: A century-old observation about an unexpected pattern in many numerical tables applies to the stock market, census statistics and accounting data. Am. Sci. 1998;86(4):358–363. [Google Scholar]
- 21.Diekmann A. Not the first digit! using benford’s law to detect fraudulent scientific data. J. Appl. Statist. 2007;34(3):321–329. [Google Scholar]
- 22.Dorogovtsev S.N., Mendes J.F.F., Oliveira J.G. Frequency of occurrence of numbers in the World Wide Web. Physica A. 2006;360(2):548–556. [Google Scholar]
- 23.Mir T.A. The Benford law behavior of the religious activity data. Physica A. 2014;408:1–9. [Google Scholar]
- 24.Mir T.A. The law of the leading digits and the world religions. Physica A. 2012;391(3):792–798. [Google Scholar]
- 25.Ausloos M. Econophysics of a religious cult: the antoinists in Belgium [1920–2000] Physica A. 2012;391(11):3190–3197. [Google Scholar]
- 26.Ausloos M., Herteliu C., Ileanu B. Breakdown of Benford’s law for birth data. Physica A. 2015;419:736–745. [Google Scholar]
- 27.Ausloos M., Cerqueti R., Lupi C. Long-range properties and data validity for hydrogeological time series: The case of the Paglia river. Physica A. 2017;470:39–50. [Google Scholar]
- 28.Yan X., Yang S.G., Kim B.J., Minnhagen P. Benford’s law and first letter of words. Physica A. 2018;512:305–315. [Google Scholar]
- 29.Dantuluri A., Desai S. Do lepton branching fractions obey Benford’s law? Physica A. 2018;506:919–928. [Google Scholar]
- 30.Alexopoulos T., Leontsinis S. Benford’s law in astronomy. J. Astrophys. Astron. 2014;35(4):639–648. [Google Scholar]
- 31.Blum J.R., Mizel V.J. A generalized Weyl equidistribution theorem for operators, with applications. Trans. Amer. Math. Soc. 1972;165:291–307. [Google Scholar]
- 32.Kar A. Weyl’s equidistribution theorem. Resonance. 2003;8(5):30–37. [Google Scholar]
- 33.Ganyani T., Roosa K., Faes C., Hens N., Chowell G. Assessing the relationship between epidemic growth scaling and epidemic size: The 2014–16 Ebola epidemic in West Africa. Epidemiol. Infect. 2019:147. doi: 10.1017/S0950268818002819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shanafelt D.W., Jones G., Lima M., Perrings C., Chowell G. Forecasting the 2001 foot-and-mouth disease epidemic in the UK. EcoHealth. 2018;15(2):338–347. doi: 10.1007/s10393-017-1293-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chowell G., Viboud C., Simonsen L., Moghadas S.M. Characterizing the reproduction number of epidemics with early subexponential growth dynamics. J. R. Soc. Interface. 2016;13(123) doi: 10.1098/rsif.2016.0659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Viboud C., Simonsen L., Chowell G. A generalized-growth model to characterize the early ascending phase of infectious disease outbreaks. Epidemics. 2016;15:27–37. doi: 10.1016/j.epidem.2016.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chowell G., Nishiura H., Bettencourt L.M. Comparative estimation of the reproduction number for pandemic influenza from daily case notification data. J. R. Soc. Interface. 2007;4(12):155–166. doi: 10.1098/rsif.2006.0161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chowell G., Ammon C.E., Hengartner N.W., Hyman J.M. Transmission dynamics of the great influenza pandemic of 1918 in Geneva, Switzerland: assessing the effects of hypothetical interventions. J. Theoret. Biol. 2006;241(2):193–204. doi: 10.1016/j.jtbi.2005.11.026. [DOI] [PubMed] [Google Scholar]
- 39.Efron B., Tibshirani R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1986:54–75. [Google Scholar]
- 40.Korhonen J., Granberg B. Sweden Backcasting. now?—Strategic planning for covid-19 mitigation in a liberal democracy. Sustainability. 2020;12(10):4138. [Google Scholar]
- 41.Moosa I.A. The effectiveness of social distancing in containing Covid-19. Appl. Econ. 2020:1–14. [Google Scholar]
- 42.Gibney E. Whose coronavirus strategy worked best? Scientists hunt most effective policies. Nature. 2020 doi: 10.1038/d41586-020-01248-1. [DOI] [PubMed] [Google Scholar]
- 43.Dong E., Du H., Gardner L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infect. Dis. 2020;20(5):533–534. doi: 10.1016/S1473-3099(20)30120-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Barabesi L., Cerasa A., Cerioli A., Perrotta D. Goodness-of-fit testing for the Newcomb-Benford law with application to the detection of customs fraud. J. Bus. Econom. Statist. 2018;36(2):346–358. [Google Scholar]
- 45.Hales D.N., Sridharan V., Radhakrishnan A., Chakravorty S.S., Siha S.M. Testing the accuracy of employee-reported data: An inexpensive alternative approach to traditional methods. European J. Oper. Res. 2008;189(3):583–593. [Google Scholar]