Skip to main content
PLOS One logoLink to PLOS One
. 2022 Jan 7;17(1):e0260836. doi: 10.1371/journal.pone.0260836

Improved log-Gaussian approximation for over-dispersed Poisson regression: Application to spatial analysis of COVID-19

Daisuke Murakami 1,*, Tomoko Matsui 2
Editor: Luca Citi3
PMCID: PMC8741021  PMID: 34995283

Abstract

In the era of open data, Poisson and other count regression models are increasingly important. Still, conventional Poisson regression has remaining issues in terms of identifiability and computational efficiency. Especially, due to an identification problem, Poisson regression can be unstable for small samples with many zeros. Provided this, we develop a closed-form inference for an over-dispersed Poisson regression including Poisson additive mixed models. The approach is derived via mode-based log-Gaussian approximation. The resulting method is fast, practical, and free from the identification problem. Monte Carlo experiments demonstrate that the estimation error of the proposed method is a considerably smaller estimation error than the closed-form alternatives and as small as the usual Poisson regressions. For counts with many zeros, our approximation has better estimation accuracy than conventional Poisson regression. We obtained similar results in the case of Poisson additive mixed modeling considering spatial or group effects. The developed method was applied for analyzing COVID-19 data in Japan. This result suggests that influences of pedestrian density, age, and other factors on the number of cases change over periods.

Introduction

Currently, a wide variety of count data are collected through sensors and used for smart urban and regional management (see [1]). For example, in 2020–2021 when the coronavirus disease (COVID-19) spread globally, the daily number of people infected with coronavirus was monitored worldwide, and countermeasures were considered based on the observations [2].

Poisson and other regression models for count data have been used for analyzing the number of COVID-19 cases (e.g., [3, 4]) or other diseases (e.g., [5, 6]). These regression models have also been used in ecology (e.g., [7, 8]), criminology (e.g., [9, 10]), and other fields. Recently, Bayesian Poisson regression, which assumes Poisson distribution for the count data and Gaussian priors for latent variables describing spatial, group, and other effects, is widely used in applied studies.

Still, Poisson regression has remaining issues in terms of (a) computational efficiency and (b) identifiability. Regarding (a), owing to the lack of conjugacy between the Poisson and Gaussian distributions, an approximate inference is necessary for the estimation. Unfortunately, the Markov Chain Monte Carlo method can be slow for large samples. Faster approximations have been developed for count data regression in a context of additive modeling (e.g., [11, 12]), mixed effects modeling [13], and Gaussian process (e.g., [14, 15]).

Regarding (b), the maximum likelihood estimates of the conventional Poisson regression are unidentifiable or identifiable only weakly for certain data configurations [16, 17], typically, for small samples with many zeros. As we will illustrate later, this property considerably worsens the accuracy of Poisson regression estimates in some cases.

Gaussian approximation is useful for improving (a) the computational efficiency and avoiding (b) the identification problem, which is attributed to the Poisson likelihood [18]. [1922] proposed closed-form Gaussian approximations for Poisson regression. These approaches are easy to implement, computationally efficient, and free from the identification problem. Given the current situation wherein a wide range of researchers and practitioners use count data, these practical approaches will become increasingly important. Unfortunately, these approximations have the following disadvantages:

  1. They have poor approximation accuracy for counts with many zeros as we will demonstrate later. A closed-form approach accurately describing such data is needed.

  2. An arbitrary parameter, which is used to avoid taking the logarithm of zero, must be determined a priori. The value is known to have substantial impact on the modeling result [23]. A closed-form approach without such an arbitrary parameter is needed.

Given that, we develop a log-Gaussian approximation for the over-dispersed Poisson regression that is fast, practical, avoids the identification problem, and overcomes (i)–(ii).

Methods

Improved log-Gaussian approximation

Over-dispersed Poisson regression

This study considers the following over-dispersed Poisson model for count variables Yi|i∈{1,…N}:

YiodPoisson(λi,σ2),λi=ziexp(μi), (1)

where E[Yi] = λi and Var[Yi] = σ2λi. λi is a mean parameter, σ2 is an over-dispersion parameter, and zi is a given offset variable.

Suppose that μi = xiβ where xi is a column vector of K explanatory variables and β is a vector of regression coefficients. The coefficient estimator and the variance-covariance matrix are given as

β^=(XΛX)1XΛz, (2)
Var[β^]=σ^2(XΛX)1. (3)

z = [z1,…,zN]′ with zi=μi+Yiλiλi,X=[x1,,xN], and Λ is a diagonal matrix whose i-th element equals λi. The coefficients are estimated by an iteratively re-weighted least squares (IRLS) method alternately updating β^ and λ^i=ziexp(xiβ^) until convergence. Given λ^i, the dispersion parameter is estimated as follows:

σ^2=1NKi=1N(Yiλ^i)2λ^i. (4)

The resulting mean estimate λ^i for the over-dispersed Poisson model is known to be the same as the conventional Poisson regression assuming σ^2=1:

YiPoisson(λi),λi=ziexp(μi). (5)

λ^i is the Poisson maximum likelihood estimator that suffer from the identification problem as detailed in [17]. Note that the λi parameter explains not only the mean but also the mode of Yi; for integer-valued λi, Yi has two modes {λi−1, λi}. Later, we will use the center of the two modes Modec[Yi] = λi−0.5.

Log-Gaussian approximation for the Poisson regression

To overcome the identification problem, we consider approximating the mean estimator λ^i by using an estimator λ^i+ obtained from a log-Gaussian model, which is unaffected by the identification problem. The estimated λ^i+ is used to estimate β^,Var[β^], and σ^2.

For the approximation, we need to identify a log-Gaussian model that accurately approximates the Poisson model Eq (5) around λi. Although mean-based log-Gaussian approximations for Poisson regression has been developed (e.g., [20]), the mean and mode of the two distributions behave somewhat differently; the mean and mode of a Poisson distribution are linearly proportional and grow in the same order (and, thus, λi explains both mean and mode) while those of a log-Gaussian distribution are not linearly proportional, and the mean grows faster than the mode. Therefore, mean-based approximation can have poor approximation accuracy around the mode, which is the distribution center. Considering the success of Laplace or other mode-based approximations in previous studies, it is reasonable to accurately approximate Poisson distribution around the mode.

This study first develops a mode-based closed-form approximation. We will use the Poisson mode center Modec[Yi]. Because the mode center is available only when λi = E[Yi]≥0.5 to assure non-negativity, we develop a mode-based approximation for λi≥0.5 and another approximation for λi<0.5. After that, we combine the two approximations for estimating the λ^i+ parameter.

Approximation for λi≥0.5

We approximate Eq (5) by using the log-Gaussian variable yi defined as:

yi+clogN(μi(G),1λi+c) (6)

where μi(G) represents the mean (logscale), and c is a constant required to avoid taking the logarithm of zero. 1λi+c is an approximate variance for a log-transformed Poisson random deviate.

We perform the approximation so that the mode Mode[yi] of the log-Gaussian model equals the mode center Modec[Yi] of the Poisson model. The following condition is obtained from the mode matching Modec[Yi] = Mode[yi]:

ziexp(μi)0.5=exp(μi(G)1λi+c)c. (7)

Eq (7) suggests that μi and μi(G) do not generally have a linear relationship. Exceptionally, they have the following linear relationship if c = 0.5:

μi(G)=log(zi)+μi+1λi+0.5. (8)

While existing studies have determined c somewhat arbitrary, c = 0.5 is found to be necessary for applying the linear approximation under our assumption.

Let us substitute c = 0.5 and Eq (8) into Eq (6). Then, we obtain the following log-Gaussian model approximating the Poisson model:

yi+0.5LogN(log(zi)+μi+1λi+0.5,1λi+0.5). (9)

By organizing Eq (9), we have the following model:

log(yi*)N(μi,1λi+0.5), (10)

where yi*=yi+0.5ziexp(1λi+0.5). The log-Gaussian distribution approximates the Poisson distribution around the mode center.

Approximation for λi<0.5

If λi<0.5, the mode of the Poisson variable Yi and log-Gaussian variable yi* behave somewhat differently: the Poisson mode always takes zero value while the mode of the log-Gaussian distribution gradually converges to zero as λi (or μi) declines. Mode-based approximation is not suitable in this case. Conversely, the means of the two distributions both converge to zero as λi or μi approaches zero. For λi = E[Yi]<0.5, we rely on a mean-based approximation.

By taking the expectation of yi* using Eq (10), we have the following relationship:

E[yi*]exp(0.5λi+0.5)=exp(μi). (11)

Eq (11) implies that, when approximating the Poisson mean function μi (logscale) using yi*, it should be rescaled by multiplying exp(0.5λi+0.5). By applying the rescaling to yi*, Eq (10) is modified to approximate the mean of the Poisson distribution as follows:

log(yi**)N(μi,1λi+0.5), (12)

where yi**=yi+0.5ziexp(1.5λi+0.5).

Proposed approximation

Considering the advantage of the mode-based approximation explained in the “Log-Gaussian approximation for the Poisson regression” section, we use Eq (10) as long as λi≥0.5 while Eq (12) is used otherwise. Still, λi = E[Yi] is unknown a priori. Considering the property of count data that P(Yi<0.5) = P(Yi = 0), we approximate P(E[Yi]<0.5) using the ratio r of zero counts in {Y1,…,YN}. Given the approximation, Eqs (10) and (12) are applied with probabilities 1−r and r, respectively. By combining these equations using r, our proposed approximation is formulated as:

log(yi+)N(μi,1λi+0.5), (13)

where yi+=yi+0.5ziexp(1+0.5rλi+0.5), which yields Eq (10) if r = 0 and Eq (12) if r = 1. If all counts are non-zero, the mode-based approximation is applied for all the samples. As the share of zero counts increases, the mean-based approximation is emphasized.

For the unknown λi in the variance term, we rely on a plug-in estimator λ^i=yi. The resulting approximation equation yields

log(yi+)N(μi,1yi+0.5). (14)

This plug-in method ignores the uncertainty in the variance term. Consideration of the uncertainty will be an important future task.

Given Eq (14), the estimator for μi = xiβ yields μ^i+=xiβ^+ where β^+=(XΛyX)1XΛyy+,y+=[y1+,,yN+]', and Λy is a diagonal matrix whose i-th element equals yi+0.5.μ^i+ approximates the μi parameter but is free from the identification problem. Thus, we use λ^i+=ziexp(μ^i+) as an estimate of the Poisson mean λi. In other words, the estimated λ^i+ is substituted into Eqs (2)—(4) to estimate β^,Var[β^], and v^2.

Proposed approximation for Poisson mixed effects model

Our approximation is readily extended for (over-dispersed) Poisson mixed effects model (MEM) (e.g., [21]) which is formulated as

YiodPoisson(λi,σ2),λi=ziexp(μi),μi=xiβ,βN(0,Σβ), (15)

where Σβ is the variance-covariance matrix for β. We consider the following estimators for Eq (15):

β^=(XΛ+X+Σβ1)1XΛ+z+, (16)
Var[β^]=σ^2(XΛ+X+Σβ1)1, (17)
σ^2=1NLi=1N(Yiλ^i+)2λ^i+, (18)

where L=tr[(XΛ+X+Σβ1)1XΛ+X] is the effective degrees of freedom, z+=[z1+,,zN+] with zi+=μ^i++Yiλ^i+λ^i+, and Λ+ is a diagonal matrix whose i-th element equals λ^i+. These estimators are identical to the conventional Poisson MEM if z+ is replaced with z.

As before, we approximate the Poisson mean λ^i+=ziexp(μ^i+) in Eqs (16)–(18) using the following model approximating Eq (15) around λi:

log(yi+)N(μi,1yi+0.5),μi=xiβ,βN(0,Σβ). (19)

Once the Gaussian mixed effects model (Eq 19) is estimated, the approximate Poisson mean μ^i+=xiβ^+ (logscale) is obtained where β^+=(XΛyX+Σβ1)1XΛyy+.

In short, an over-dispersed Poisson regression with/without random coefficients is approximated by the following steps: (I) Estimate μ^i+ using a log-Gaussian model whose explained variable log(yi+)=log(yi+0.5zi)1+0.5ryi+0.5 and sample weight yi+0.5; (II) Substitute the estimated λ^i+=ziexp(μ^i+) into Eqs (2)–(4) for models without random coefficients or Eqs (16)—(18) for models with random effects. Later, we examine approximation accuracy of our approach through Monte Carlo experiments.

Property of the proposed approximation

Table 1 summarizes closed-form approximations for the Poisson regression models. These methods perform approximations through the estimation of a log-Gaussian model using the explained variables and the weight shown in this table. These practical methods will be useful to avoid the identification problem for not only researchers but also practitioners. However, existing methods are accurate only for a moderate to large μi [22]. These methods should not be used for counts with many zeros. Besides, the c parameter, which has a considerable impact on analysis result, must be determined a priori (see the Introduction section). These drawbacks inhibit the practical use of these approximations.

Table 1. Closed-form approximations for Poisson regression.

c is a tuning parameter that must be determined a priori. zi = 1 is assumed. All the approximations employ Gaussian linear regression whose explained variables and weights are as shown in the table.

Method Explained variables Weight Outline
Log-linear approx. [19] log(yi+c) yi+c β^ and Var[β^] are estimated from Gaussian model
Taylor approx. [20]; log(yi+c)cyi+c yi+c
Log-Gamma approx. [22]
Our approximation log(yi+0.5)1+0.5ryi+0.5 yi+0.5 β^ and Var[β^] are estimated from an over-dispersed Poisson regression model whose mean function λi is approximated using the log-Gaussian model

Major advantages of our approximation relative to these existing methods are as follows: (A) it does not have the tuning parameter (c); (B) because of the mode matching, the proposed method accurately approximates the mode of the Poisson distribution irrespective of μi; (C) Gaussian approximation is used only for estimating the Poisson mean while alternative methods use it for estimating both the Poisson mean and the regression coefficients. As we will show later, these advantages considerably improve the approximation accuracy for count data with many zeros.

Our mode-matching method is akin to the Laplace approximation, which is based on the mode-matching of a Gaussian distribution and the target distribution. Considering studies demonstrating the accuracy of the Laplace approximation, our mode-based approach is expected to be accurate as well. Still, the Laplace approximation can have poor accuracy if the target distribution is far from Gaussian distribution. Extension based on other approximation methods such as numerical quadrature is an important remining task.

Results: Monte Carlo experiments

Case 1: Basic over-dispersed Poisson regression model

This section compares the estimation accuracy of the proposed approximation (Proposed) with standard Poisson regression (Poisson), over-dispersed Poisson regression (odPoisson), and negative binomial regression alternatives (NB). We also compare ours with the log-linear approximation of [19] (LogLinear) and the Taylor approximation of [20] (Taylor).

The simulated count data yi is generated from the over-dispersed Poisson regression with mean λi and the overdispersion parameter σ2:

yiodPoisson(λi,σ2),λi=exp(β0+xi,1β1+xi,2β2), (20)

where xi,1 and xi,2 are generated from standard normal distributions N(0, 1), and {β1, β2} = {2.0, 0.5}. We refer to β1 as a strong and β2 as a weak coefficient. σ2 = 1 implies the standard Poisson regression without over-dispersion while σ2>1 means over-dispersion. The β0 parameter implicitly controls the ratio of zero counts; a smaller β0 value yields more zero counts.

Over-dispersed Poisson distribution does not have probability mass function [24]. The simulation data is sampled to satisfy E[yi] = λi and Var[yi] = σ2λi, which yi~odPoisson(λi, σ2) assumes, as follows:

  1. Calculate λi = exp(β0+xi,1β1+xi,2β2)

  2. Calculate vi = (σ2−1)/λi

  3. Sample yi~NB(λi, vi) where NB(λi, vi) is a negative binomial distribution with expectation λi and variance Var[yi]=λi+viλi2

The sampled yi has the expectation E[yi] = λi and variance Var[yi]=λi+viλi2=λi+(σ21)λi=σ2λi that odPoisson(λi, σ2) assumes. Thus, the sampled yi fulfills the assumption of the over-dispersed Poisson distribution.

The coefficient estimation accuracy is compared across models while varying β0∈{−2, −1, 0, 1, 1}, σ2∈{1, 5}, and N∈{50, 200}. In each case, the simulations were iterated 1000 times and the root mean squared error (RMSE) and the mean bias are evaluated:

RMSE[βk]=1Niter=1500(β^k(iter)βk)2,Bias[βk]=1Niter=1500(β^k(iter)βk) (21)

where β^k(iter) is the estimated βk in the iter-th iteration.

The evaluated RMSE and bias values are plotted in Figs 1 and 2 in a case without overdispersion σ2 = 1.0 whereas Figs 3 and 4 in cases with overdispersion σ2 = 5.0. LogLinear and Taylor tend to have large RMSEs and biases across cases, and the errors inflate if yi has many zero values (i.e., small β0). In contrast, the RMSE values for Proposed are as small as the Poisson and odPoisson specifications across cases. Poisson, odPoisson, and NB have large RMSE values for small over-dispersed samples (σ2 = 5.0) with many zero values (β0 = −2); it is attributable to the identification problem explained in the “Introduction” section. Proposed does not suffer from this problem. Proposed is advantageous in terms of stability. The bias of the proposed method is small across cases. It is suggested that the proposed method estimates regression coefficients in a reasonable accuracy.

Fig 1. RMSE of the regression coefficients in cases without overdispersion (σ2 = 1.0) (x−axis: β0, y-axis: RMSE).

Fig 1

Fig 2. Bias of the regression coefficients in cases without overdispersion (σ2 = 1.0) (x−axis: β0, y-axis: Bias).

Fig 2

Fig 3. RMSE of the regression coefficients in cases with overdispersion (σ2 = 5.0) (x−axis: β0, y-axis: RMSE).

Fig 3

Fig 4. Bias of the regression coefficients in cases with overdispersion (σ2 = 5.0) (x−axis: β0, y-axis: Bias).

Fig 4

Fig 5 shows the coefficient standard error (SE) estimates. While the SEs estimated from Proposed are similar to odPoisson, the former method tends to have smaller SE values than the latter when σ2 = 5.0. To examine if our SE accurately estimates the uncertainty in the coefficient estimates, Fig 6 plots (SE)/(standard deviation of the estimated coefficient values). The value is close to 1.0 if the SEs accurately evaluate the uncertainty. Based on the figure, all the methods tend to underestimate the SE value. Still, the bias of Proposed is smaller than Poisson, NB, LogLinear, and Taylor whereas larger than odPoisson. Reducing the bias in the SE estimates is an important task.

Fig 5. Means of the coefficient standard errors (N = 200) (x−axis: β0, y-axis: mean standard error).

Fig 5

Fig 6. Means of (estimated standard error)/(standard deviation of the estimated coefficient values) when (N = 200; x−axis: β0, y-axis: mean standard error).

Fig 6

In S1 Appendix in S1 File, we perform another Monte Carlo experiments assuming six explanatory variables. The result is consistent with the results obtained in this section.

Case 2: Model with spatial effects

To verify the expandability of the proposed model, this section applies the proposed method to estimate a spatial regression model, which has been widely used to analyze spatial phenomena in the environment, economy, and epidemic. We consider the following model:

yiodPoisson(λi,σ2),λi=exp(β0+xi,1β1+xi,2β2+si), (22)

where {β1, β2} = {2, 0.5} and σ2 = 5, which means an overdispersion with variance Var[yi] = 5λi. si is a process capturing a spatially dependent pattern of the data. It is modeled by a low rank Gaussian process whose spatial dependence exponentially decays relative to the Euclidean distance between the geometric centers of the two zones. Eq (22) is an over-dispersed Poisson mixed-effects model (MEM) that considers spatial dependence. The model is estimated by applying the maximum likelihood (ML) estimation for the Poisson MEM (Poisson), an over-dispersed Poisson MEM (odPoisson), the Taylor approximate Poisson MEM (Taylor), and our specification (Proposed). Taylor and Proposed fitted linear MEMs using the transforming explained variables and weight variables (see Table 1). All models were estimated using a restricted maximum likelihood method implemented a R package mgcv [11].

We assumed β0∈{−2, −1, 0, 1, 1} and N∈{50, 200}. In each case, the models were estimated 500 times, and the estimation accuracies were compared. Figs 7 and 8 display the estimated RMSEs and biases, respectively. When N = 50, odPoisson took extremely large RMSEs due to the identification problem. Poisson and Taylor also had large RMSEs. In contrast, the proposed method tends to have smaller RMSE values. The proposed method may be a better choice for small samples. Even for N = 200, the RMSEs and biases of Proposed were as small as those of Poisson and odPoisson. The estimation accuracy of the proposed method was verified in the case of spatial regression.

Fig 7. RMSE of the regression coefficients (model with spatial effects) (x-axis: β0, y-axis: RMSE).

Fig 7

Fig 8. Bias of the regression coefficients (model with spatial effects) (x-axis: β0, y-axis: Bias).

Fig 8

Fig 9 compares the coefficient standard errors. The SEs obtained from Proposed are similar to odPoisson for a large β0, while Proposed has smaller SEs for small β0. Fig 10 plots (SE)/(standard deviation of the estimated coefficient values) when N = 200. Unlike odPoisson whose SEs are severely underestimated for small β0. Proposed estimates SEs reasonably accurately across cases.

Fig 9. Means of the coefficient standard errors (N = 200) (x−axis: β0, y-axis: mean standard error).

Fig 9

Fig 10. Means of (estimated standard error)/(standard deviation of the estimated coefficient values) (N = 200; x−axis: β0, y-axis: RMSE).

Fig 10

Finally, Fig 11 compares the estimation accuracy for the spatially dependent process si. The RMSE values of Proposed are almost identical with odPoisson suggesting the accuracy of our approximation.

Fig 11. RMSE of the estimated spatial effects (x-axis: β0, y-axis: RMSE).

Fig 11

We performed another Monte Carlo experiment assuming group effects, which estimates heterogeneity across groups, instead of the spatially dependent effects. As summarized in S2 Appendix in S1 File, the RMSEs and biases are as small as Poisson and odPoisson for N = 200 and smaller than the two methods for N = 50.

In short, the proposed method provides an accurate and stable approximation for an over-dispersed Poisson MEM.

Computation time comparison

Finally, computation time is compared while varying N∈{1,000, 10,000, 100,000, 300,000} under cases 1 and 2. β0 = 0 and σ2 = 1 are assumed in this section. We use R version 4.0.2 (https://cran.r-project.org/) installed in a Mac Pro (3.5 GHz, 6-Core Intel Xeon E5 processor with 64 GB memory). The gam function in the mgcv package is used for the model estimation.

Under case 1 (basic model), Poisson, odPoisson, and Proposed took 20.1, 116.0, and 1.34 seconds on average, respectively. In case 2 which estimated spatial effects, Proposed is again considerably faster than Poisson and odPoisson, especially for large samples as plotted in Fig 12. The computational efficiency of Proposed is confirmed.

Fig 12. Computation time comparison under case 2 (model with spatial effects).

Fig 12

Results: COVID-19 analysis

Outline

This section employs the developed approximation to an analysis of the COVID-19 (coronavirus disease 2019) pandemic. Since the first case was detected in Wuhan, China, in December 2019, the coronavirus spread. As of February 1, 2021, the cumulative number of confirmed cases is 103.41 million, while the confirmed death toll is 2.25 million. To achieve effective infection control for not only COVID-19 but also pandemics/endemics in the future, it is important to investigate the influencing factor behind the disaster. The data of daily new cases analyzed in this section is provided from JX Press corporation (https://jxpress.net/).

Fig 13 plots the number of daily cases in Japan between February 1, 2020, and January 29, 2021. The number peaked around April 2020, August 2020, and January 2021, respectively. Based on the time trend, we refer to February 1 –May 31 as the first wave, June 1 –September 30 as the second wave, and October 1 –January 29, 2021, as the third wave. Fig 14 displays the spatial plots of the daily new cases by prefecture. This figure shows the tendency of the number of infected people to become large near Tokyo and Osaka, which are major urban areas.

Fig 13. Daily number of cases across Japan.

Fig 13

Fig 14. Number of cases by prefecture.

Fig 14

We performed a regression analysis exploring the influencing factor of the increase/decrease in each wave. The explained variables were the number of daily cases by prefecture by 10-year age groups (-19, 20–29, …, 70–79, 80-). The sample sizes were 51,336, 50,508, and 50,094 for the three waves, respectively. 89.0% (45,696 samples), 77.1% (38,954 samples), and 49.7% (24,873 samples) of the samples were zeros.

For the COVID-19 data, we approximate the following over-dispersed Poisson additive mixed model using our approach:

yiodPoisson(λi,σ2),λi=exp(β0+xiβ1+l=14gi,l+si), (23)

where yi is the number of daily new cases. β0 and β1 are regression coefficients. The model is fitted for each of the three datasets completely separately. To scale the mean function according to the population, the offset variable zi is given by the prefectural population. The explanatory variable xi is the prefectural pedestrian density by day, which is relative to January 13, 2020 (source: Apple Mobility Trends: https://covid19.apple.com/mobility). The density is estimated based on the number of route searches by Apple map users. For further detail, see the source page. gi,l represents the l-th group-wise random effect. We consider the effects by week (gi,1), days of the week (gi,2), generation (gi,3), and prefecture (gi,4). In considering countermeasures, it is important to reveal not only patterns by prefectures but also across prefectures. To estimate this, we include a low rank Gaussian process si which smoothly varies depending on geographic coordinates; we use the geographic center of each prefecture for the modeling. The model was estimated using the mgcv package.

Results

Table 2 summarizes the estimated parameters. The estimated coefficients of pedestrian density become positively significant in the first and second the waves. Self-restraint was estimated to reduce the number of cases in the early periods. Based on the estimated dispersion parameter (σ2), the variance of the number of cases were over-dispersed, and the tendency became stronger over time.

Table 2. Parameter estimates.

See Fig 15 for the fitting on the number of cases.

First wave Second wave Third wave
Est. t value Est. t value Est. t value
Const (β0) ‒18.03 ‒84.71 ***1) ‒17.91 ‒84.13 *** ‒14.64 ‒68.79 ***
Pedestrian density (β1) 0.31 4.26 *** 1.30 18.01 *** 0.09 1.29
Dispersion parameter (σ2) 1.29 2.22 3.33

1) *** demotes the statistical significance of 1%.

Fig 15. Comparison of the observed and predicted number of cases.

Fig 15

Fig 16 plots the estimated group effects by week, days of the week, and generation. The estimated week-wise effects show that the increase in cases lasts longer in the third wave. Control of the infection spread might be getting more difficult over waves. Regarding the days of the week, Monday has the lowest while Thursday, Friday, and Saturday have higher values. The difference is attributable to some business reasons such as the closing of hospitals and PCR test sites. The estimated generation effects have considerable differences across waves. In the first wave, people who are in the working generation (the 20s - 50s) tend to be infected. Commuting and/or meeting in the office might trigger the infection. In the second wave, the 20’s group has a strong tendency of being infected as compared to the elders, therefore, more self-restriction is needed. In the third wave, not only the 20s but also the 30s – 50s have high chances of being infected. Infection might spread again across the working generation.

Fig 16. Estimated group effects (week, days of the week, generation).

Fig 16

Fig 17 plots the estimated prefecture-wise independent effects and spatially dependent effects. The former estimates local hotspots while the latter, global hotspots. The estimated prefecture-wise effects suggest that prefectures including major cities (Tokyo, Osaka, Fukuoka) and Hokkaido are local hotspots. More countermeasures might be required in these prefectures. On the other hand, based on the estimated spatially dependent effects, there is a global hotspot around Tokyo, and the influences grow over waves. Control of the infection spread from Tokyo might have been important to mitigate the third wave.

Fig 17.

Fig 17

Estimated group effects by prefecture (top) and spatially dependent effects (bottom).

Discussion

This study develops a practical log-Gaussian approximation for Poisson regression models. Considering its simplicity, stability, and computational efficiency, it will be useful for researchers as well as practitioners.

Exploring the expandability of our approach is an important future task. For example, our approach might be useful for spatial and spatiotemporal interpolation of count data by combining it with Gaussian process models without additional computation and implementation costs. Our approach might also be useful for fast count data assimilation by combining it with a state-space model. Exploring such extensions will be an interesting research endeavor.

Supporting information

S1 File

(DOCX)

Data Availability

The R codes used in the “Results: Monte Carlo experiments” and “Results: COVID-19 analysis” sections are available from https://github.com/dmuraka/SimplePoissonApprox_MCsim. The COVID-19 data, which is used in “Results: COVID-19 analysis” section, is owned by JX Press Corporation (https://jxpress.net/ (there is Japanese website only)) and cannot be shared publicly because it is proprietary data. Anyone can purchase the data from the company, and the authors of this study had no special access privileges to the data.

Funding Statement

This work was supported by JSPS KAKENHI Grant Numbers 17H02046, JP18H01556, and 18H03628, and JST-Mirai Program Grant Number JP1124793, Japan; all grants were awarded to DM.

References

  • 1.Soomro K, Bhutta MN M, Khan Z, Tahir MA. Smart city big data analytics: An advanced review. Wiley Interdiscip. Rev Data Min Knowl Discov. 2019;9(5):e1319. [Google Scholar]
  • 2.Viner RM, Russell S, Croker H, Packer J, Ward J, Stansfield C, et al. School closure and management practices during coronavirus outbreaks including COVID-19: a rapid systematic review. Lancet Child Adolesc Health. 2020;4(5):397–404. doi: 10.1016/S2352-4642(20)30095-X [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Oztig LI, Askin OE. Human mobility and coronavirus disease 2019 (COVID-19): a negative binomial regression analysis. Public health. 2020;185:364–7. doi: 10.1016/j.puhe.2020.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Vokó Z, Pitter JG. The effect of social distance measures on COVID-19 epidemics in Europe: an interrupted time series analysis. GeroScience. 2020;42(4):1075–82. doi: 10.1007/s11357-020-00205-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wakefield J. Disease mapping and spatial regression with count data. Biostatistics. 2007;8:158–183. doi: 10.1093/biostatistics/kxl008 [DOI] [PubMed] [Google Scholar]
  • 6.Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J Roy Stat Soc C. 2010;59(5):905–20. [Google Scholar]
  • 7.Ver Hoef JM, Boveng PL. Quasi‐Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology. 2007;88(11):2766–72. doi: 10.1890/07-0043.1 [DOI] [PubMed] [Google Scholar]
  • 8.Lindén A, Mäntyniemi S. Using the negative binomial distribution to model overdispersion in ecological count data. Ecology. 2011;92(7):1414–21. doi: 10.1890/10-1831.1 [DOI] [PubMed] [Google Scholar]
  • 9.Osgood DW. Poisson-based regression analysis of aggregate crime rates. J Quant Criminol. 2000;16(1):21–43. [Google Scholar]
  • 10.Piza EL. Using Poisson and negative binomial regression models to measure the influence of risk on crime incident counts. Rutgers Center on Public Security. 2012. [Google Scholar]
  • 11.Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J Roy Stat Soc B. 2011;73(1):3–36. [Google Scholar]
  • 12.Rodríguez-Álvarez MX, Lee DJ, Kneib T, Durbán M, Eilers P. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Stat Comput. 2015(5);25:941–57. [Google Scholar]
  • 13.Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. New York: Springer; 2000. [Google Scholar]
  • 14.Diggle PJ, Tawn JA, Moyeed RA. Model‐based geostatistics. J Roy Stat Soc C. 1998;47(3):299–350. [Google Scholar]
  • 15.Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J Roy Stat Soc B. 2009;71(2):319–92. [Google Scholar]
  • 16.Silva JS, Tenreyro S. On the existence of the maximum likelihood estimates in Poisson regression. Econ Lett. 2010;107(2):310–2. [Google Scholar]
  • 17.Correia S, Guimarães P, Zylkin T. Verifying the existence of maximum likelihood estimates for generalized linear models. ArXiv. 2019;1903.01633. [Google Scholar]
  • 18.Breslow NE. Extra‐Poisson variation in log‐linear models. J Roy Stat Soc C. 1984;33(1):38–44. [Google Scholar]
  • 19.El-Sayyad GM. Bayesian and classical analysis of Poisson regression. J Roy Stat Soc B. 1973;35(3):445–51. [Google Scholar]
  • 20.Chan AB, Dong D. Generalized Gaussian process models. Proceedings of the IEEE conference on computer vision and pattern Recognition. 2011;5995688:2681–8.
  • 21.Wood SN. Generalized Additive Models: An Introduction with R. CRC Press: Boca Raton; 2017. [Google Scholar]
  • 22.Chan AB, Vasconcelos N. Counting people with low-level features and Bayesian regression. IEEE Trans Image Process. 2011;21(4):2160–77. doi: 10.1109/TIP.2011.2172800 [DOI] [PubMed] [Google Scholar]
  • 23.Bellego C, Pape LD. Dealing with logs and zeros in regression models. SSRN: 3444996 [Preprint]. 2019; Available from: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996. [Google Scholar]
  • 24.Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika. 1974;61(3):439–47. [Google Scholar]

Decision Letter 0

Luca Citi

26 May 2021

PONE-D-21-12939

Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19

PLOS ONE

Dear Dr. Murakami,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular:

  • Improve clarity of the manuscript

  • Clarify why the reported main motivation is improved efficiency and accuracy for Bayesian Poisson regression but Bayesian modelling does not appear anywhere in the paper

  • Clarify the meaning of "posterior variance" etc.

  • Explain and motivate mismatch between the model used to simulate data and the model used to analyse it

  • Clarify the reason for the “extra” 0.5/(Y+0.5) in (6)

  • Revise terminology (e.g. zero-inflated Poisson, posterior variance, etc)

  • Poisson regression with random effects/spatial effects is already very well developed, so approximations such as this one are not strictly necessary. Please demonstrate that the proposed solution is much more efficient.

Please submit your revised manuscript by Jul 10 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Luca Citi, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include the data sources used in the Data availability statement and Methods section. We note that the source of the COVID-19 data appears to be unclear in the Methods section. Please also indicate in the Data availability statement whether you are able to openly share the code used, and if so, where others can access this.

3.We note that Figure(s) 10 and 13 in your submission contain map images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a) You may seek permission from the original copyright holder of Figure(s) 10 and 13 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b)  If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Additional Editor Comments:

  • Please clarify the meaning of, e.g., "posterior variance". In a Bayesian setting "prior" and "posterior" usually refer to distributions of the parameters. So when talking about a Poisson distribution, the "posterior" would be the a distribution of lambda (given one or more observations y), not of the Poisson RV "y" as apparently used in the paper. (See for example https://stats.stackexchange.com/a/26225).

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors need to used more than 2 explanatory variables. For example, p=4 and 8. ............................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Reviewer #2: You propose an approximation for modeling Poisson or overdispersed (OD) Poisson data, evaluate some of its properties by simulation, and apply it to a data set on prefecture-level data on new Covid-19 cases. Modern computing has made it possible to model generalized linear mixed models, including models with spatial and temporal correlation. So approximations such as yours are not strictly necessary. You suggest that your approximation is a more computationally efficient. I agree this is likely, but you never demonstrate how much more efficient.

I found it difficult to follow some parts of your presentation because of word choice and other issues, perhaps related to translation into English. As a result some of my concerns may simply reflect an unclear presentation. However, I have three major concerns about your approach

I don’t follow where the “extra” 0.5/(Y+0.5) in (6) comes from (lines 92-93). If Z ~ logN(m, v), by definition log Z ~ N(m,v). L 92 looks like you are computing E Z = exp(m + v/2) so the “extra” 0.5/(y+0.5) is v/2. But, the expectation in L 92 is not the equivalent of E Z, it is the equivalent of E log(Z), which = m. Said another way, It looks like you are computing log E Z, which does not equal E log Z. You want E log Z to get to equation (6). The choice of constant in log(Y+c) does matter. One example, where X is transformed, is Ekwaru and Veugelers, 2018, Stat BioPharm Res, 10:26-29.

I also could not follow or do not agree with your variance manipulations. This has two aspects.

1) Line 79 gives, according to your text, an approximation of the “posterior variance of a Poisson distribution”, citing El-Sayyad 1973. I did not read El-Sayyad and do not understand what you mean by “posterior”. Is 1/(y+c) is an approximate variance for a Poisson random deviate or an approximate variance for a log transformed Poisson random deviate?

2) There appears to be a mismatch between your OD variance model and the mechanism used to simulate OD Poisson data. It was not clear in the main text how simulated from an OD Poisson distribution. I finally found the answer in Appendix S1, where you simulate Y ~ Pois(lambda), log lambda = mu + Z, Z ~ N(0, sigma^2). That generates observations from a log normal Poisson mixture for which variance Y = mu + (exp(sigma^2)-1) mu^2. You give your OD variance model as Var log Y = sigma^2/(y+c) approximately. That is not the same variance model as the variance pattern of your simulated data. Applying the Jacobian of the log Y transformation to the log normal variance, you get Var log Y approx. = 1/mu + (exp(sigma^2) -1). I presume El-Sayyad then uses a plug-in estimate of mu to estimate 1/mu as 1/(y+c). These two expressions are the same for Poisson data, i.e. when sigma^2 = 0, but not the same for OD Poisson data.

You emphasize the computational efficiency. Please provide data on that. Compare the time to fit a model using a log-Gaussian approximation to the time required to fit a Poisson or OD Poisson model. The OD Poisson model could be fit by adding an observation specific random effect (to generate a log-normal Poisson distribution).

Details:

Please give this a careful read for English usage and word choice. Three examples are line 33, I presume you meant “collected” not “corrected”, the usual interpretation of “zero-inflated”, and L 226, a “decade” is ten years. There are others throughout the manuscript.

L 45. I don’t understand the “lack of contiguity”. In the usual setup for Poisson count data with latent variables, the Poisson distribution is the data model conditional on the latent observation-specific mean. That latent variable has support on the non-negative half line, so a log-normal or gamma distribution for the latent variable is completely compatible. I believe you’re thinking about approximating a discrete Poisson distribution by a continuous distribution, which has nothing to do with Bayesian Poisson regression.

L 75: Do you mean “prior” here or are you really talking about random effects?

L 99, 108: “sample weight” has two contrasting uses: Variance = weight * constant and weighted SS = Sum weight * (y - \\hat Y)^2. The weighted SS use is more frequent, because that’s how weights are specified in most (all?) software. You’re using the first definition. I suggest you explain carefully, avoiding the term weight, if you retain 1/(y+0.5). Or, define the weight as y+0.5.

L 101: If the “extra” 0.5/(y+0.5) in (6) is correct and your variance model is assumed corrected, notice that adding overdispersion will change the “extra” term to 0.5 \\sigma^2/(y+0.5). The weights won’t change because they are unaffected by a constant multiplier to the variance.

Table 1. It would be useful to add your overdispersed approximation to this table.

L 142. 500 simulated data sets seems rather small to provide a reasonably precise estimate of the rMSE, especially when it seems some estimates are “wild”. Please either provide an estimate of the Monte-Carlo variance to demonstrate that your results from 500 simulations are sufficiently precise or increase the number of simulations (1000, or substantially more if the distribution of estimates is very skewed).

L 148: Please be careful about your terminology. Equation (9) is not generating zero-inflated Poisson data. It is generating data from a Poisson distribution with mean close to 0 so there is a high probability of observing 0. A zero-inflated Poisson model is a mixture model. One mixture component is a point mass at 0, with a mixture probability of p0. The second mixture component is a draw from a Poisson distribution, with mixture probability of 1-p0. Similar issue on l 227. The counts may be zero inflated, but reporting the %0’s doesn’t tell you that.

L 181. Please cite one of the papers describing mgcv instead of linking to the CRAN page. See the information provided by citation(‘mgcv’).

L 192. Small se’s are good only when they still correctly describe the uncertainty in the estimate. One way to check this is to compare the average estimated variance of a coefficient = average squared standard error to the observed variance of an estimate over simulations.

L 221/2, Figures 10 and 11. The figure numbering seems reversed from what is described in the text.

L 225: “generation” – please be consistent in your choice of names. I believe this is what was called wave in lines 216-217. Ahh (based on text later on): generation is age group.

L 225: Please emphasize that you are doing completely separate analyses for each wave of data. One could look at “determinants within each wave” by fitting a single model with a wave*determinant interaction.

L 225: determinants (plural) seems an overstatement. Your model only looked at pedestrian density.

L 226: This is spatio-temporal data. Please explain carefully what data were used in the analysis. The text says “every decade”. Was this one day’s data, the sum over a series of days within “the decade”, all the daily observations within “the decade”, or something else?

L 237 – 239: The spatial component of the data is per prefecture. Why include both prefecture and the spatial random effect in the model? Prefecture will capture the only spatial aspect of the data, so I don’t see where there is residual spatial dependence.

L 240: Why use the spmoran package instead of gamm() in mgcv? GAMM is the spatial method you evaluated in the simulation. If you need to use Moran Eigenvector Maps to model the spatial dependence, please explain the reason.

Reviewer #3: The authors have carried out some commendable work on approximating the Poisson likelihood using a mode-matching log-Gaussian approach. However, the main motivation -- as mentioned in the abstract and introduction -- is on improved efficiency and accuracy for Bayesian Poisson regression, but Bayesian modelling does not appear anywhere in the paper. In all the Results section (including two simulations and the Japanese COVID-19 data analysis), the proposed approximation is only used for frequentist modelling via maximum likelihood and no Bayesian analysis is considered. Given that frequentist estimation and inference for Poisson regressionb with random effects/spatial effects is already very well developed via the MGCV and LME4 packages in R, the advancement using proposed approximation is minimised.

Specific comments include:

Page 2 line 33: "corrected" = "collected"

Page 2 line 45: "contiguity" should be "conjugacy"?

Page 3 Equation (2): the observation y appears on both sides of expression (2). It is mentioned in the following paragraph that "1/(y+c) approximates the posterior variance of the Poisson distribution" but what is this the posterior variance of? Is it the variance of the posterior distribution of P(lambda|y)?

Page 4 Equation (8): the expression here is not limited to over-dispersion. Some discussion (or simulations) based on under-dispersed counts would be interesting in this paper.

Page 6 line 135: how is the over-dispersed Poisson simulated in your paper? Did you use the Neg-Bin, or Poisson-Gamma, or some other simulation mechanism?

Page 6 Equations (10): Would it be better to replace the RMSE by a relative RMSE, by dividing the squared difference by beta_k?

All the figures are very low resolution: I would recommend using a different image format for your figures.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Luca Citi

6 Sep 2021

PONE-D-21-12939R1Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19PLOS ONE

Dear Dr. Murakami,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular, there are a number of important outstanding issues raised by reviewer 2 about the derivation that should be addressed in the next submission.

Please submit your revised manuscript by Oct 21 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Luca Citi, PhD

Academic Editor

PLOS ONE

Journal Requirements:

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have adequately addressed all the comments that raised in a previous round of review..............................................................................

Reviewer #2: My comments focus on the adequacy of your response to three issues raised in the first review. You have made an appropriate and major response to two issues, appear to misrepresent a third, and have ignored the fourth. Specifically:

Thank you for the major revision to lines 74 – 139. That makes it much easier to understand your derivation. Unfortunately, it also makes it possible to identify places where the derivation is not done carefully enough. Further comments on this are in the details.

Thank you also for including computing time information. That is helpful.

I remain very concerned about the mismatch between the distributions assumed in the analysis and those used to simulate data. You cite Gsteiger et al to support using a negative binomial distribution to simulate data from the quasi-poisson distribution (i.e. where variance = lambda_i sigma^2). Where does Gsteiger say this? The text below equation (10) in section 2.2 very clearly says variance = mu + k mu^2, the usual expression for negative binomial variance. As I said in the first review, these are two very different variance patterns.

I repeat my previous concern about “too small” SEs. A small SE is a good only when that se is correctly estimated. You claim you can’t evaluate the performance of the SE estimator. I don’t understand why not. If it is numerical issues, doesn’t this raise concerns about the practical use of your approximation.

Specific details:

Equation (2) doesn’t make probabilistic sense because y_i occurs in the specification of the distribution. Technically, you are giving the distribution of y_i conditional on y_i, which is meaningless. Similar issues occur through (6). A derivation could be done more carefully by using E y in the variance expression then substituting y_i as a “plug-in” estimator of E y_i at the end. That will clarify what is being assumed by the proposed approximation. Many do not like plug-in estimators, so you are free to find another way to derive your results that avoids a distribution of Y conditional on Y.

Please check the derivation of equation (7). You seem to be using Var y^* = exp(mu + v), not exp(mu+v/2). The issue of y_i occuring in the variance expression is even more important here.

L 125. Don’t you apply (6) with probability 1-r and (8) with probability r, where r is the proportion of 0 counts?

L 128: lines 125-126 suggest a finite mixture of multiplicative constants, i.e. that from (6) with probability (1-r) and that from (8) with probability r. That does not lead to the definition of y^+ given here. This definition appears out of thin air. It does match (6) when r=0 and (8) when r=1, but that is neither a probabilistic or statistical justification for that definition.

L 135: what happened to r in the definition of the response variable? This is probably a typo because the analogous expression in table 1 includes r.

L 157: Yes, Laplace is often good but it doesn’t always work. There is a substantial literature on better approximations, e.g. numerical quadrature with more quadrature points.

L 203: Please be careful not to overstate your results. Estimates may be close on average, but that does imply they are accurate. I am especially concerned about B0, which seems (Figure 2) just as biased for N=200 as it is for N=50. If you consider an even simpler situation with a constant mean, inspection of equation (6) shows that estimates of B0 are not unbiased and not consistent, because of the approximation.

L 210: What is sigma^2 here? I.e., what distribution is that a parameter of? If a negative binomial distribution is used to simulate data, I strongly suspect that data simulated with sigma^2 = 1 are still overdispersed, based on second major comment about variance models.

L 315: overall, age-group is much clearer than generation. Thank for changing that wording. Here is one lingering use of generation.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 2

Luca Citi

18 Nov 2021

Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19

PONE-D-21-12939R2

Dear Dr. Murakami,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Luca Citi, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have adequately addressed all the comments that raised in a previous round of review. o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o

Reviewer #2: Thank you again for a substantial revision that has improved the manuscript. You have addressed all my concerns.

Details:

Line 186 in revision 1: Thank you for the clarification. I had not seen this method of simulating counts with Var = k*mu before. That’s really nice.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Zakariya Y. Algamal

Reviewer #2: No

Acceptance letter

Luca Citi

23 Dec 2021

PONE-D-21-12939R2

Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19

Dear Dr. Murakami:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Luca Citi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    Attachment

    Submitted filename: response_v3.docx

    Attachment

    Submitted filename: draft_MM_v5_response.docx

    Data Availability Statement

    The R codes used in the “Results: Monte Carlo experiments” and “Results: COVID-19 analysis” sections are available from https://github.com/dmuraka/SimplePoissonApprox_MCsim. The COVID-19 data, which is used in “Results: COVID-19 analysis” section, is owned by JX Press Corporation (https://jxpress.net/ (there is Japanese website only)) and cannot be shared publicly because it is proprietary data. Anyone can purchase the data from the company, and the authors of this study had no special access privileges to the data.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES