Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Apr 3;48(5):786–803. doi: 10.1080/02664763.2020.1748581

Copula-based Markov zero-inflated count time series models with application

Mohammed Alqawba a,b, Norou Diawara b,CONTACT
PMCID: PMC9041618  PMID: 35707445

ABSTRACT

Count time series data with excess zeros are observed in several applied disciplines. When these zero-inflated counts are sequentially recorded, they might result in serial dependence. Ignoring the zero-inflation and the serial dependence might produce inaccurate results. In this paper, Markov zero-inflated count time series models based on a joint distribution on consecutive observations are proposed. The joint distribution function of the consecutive observations is constructed through copula functions. First- and second-order Markov chains are considered with the univariate margins of zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), or zero-inflated Conway–Maxwell–Poisson (ZICMP) distributions. Under the Markov models, bivariate copula functions such as the bivariate Gaussian, Frank, and Gumbel are chosen to construct a bivariate distribution of two consecutive observations. Moreover, the trivariate Gaussian and max-infinitely divisible copula functions are considered to build the joint distribution of three consecutive observations. Likelihood-based inference is performed and asymptotic properties are studied. To evaluate the estimation method and the asymptotic results, simulated examples are studied. The proposed class of models are applied to sandstorm counts example. The results suggest that the proposed models have some advantages over some of the models in the literature for modeling zero-inflated count time series data.

KEYWORDS: Copula, integer-valued time series, Conway–Maxwell–Poisson, Markov process, negative binomial, Poisson, zero-inflation

1. Introduction

Zero-inflated counts time series can be found in several fields such as environmental sciences, public health, and economics. For example, in monthly counts of sandstorms in some areas, in rare diseases with low infection rates, and in crimes such as arson. In these cases, the observed counts may include a considerable frequency of zeros. However, during certain seasons, these counts could result in larger values. Additionally, these zero-inflated counts are usually autocorrelated when the data is collected over time. Overlooking the frequent occurrence of zeros and the serial correlation could lead to false inference. In many real-life time series examples, the series are not stationary and observe some sort of trend and seasonal features. In Figure 1, an example of such time series is shown. It displays the monthly counts of strong sandstorms recorded by the AQI airport station in Eastern Province, Saudi Arabia. One can see the frequent occurrence of zeros in the distribution of the counts, serial dependence, decreasing trend, and seasonality in the series. Standard time series models fail to account for such problems. Motivated by these problems, we propose and develop a class of copula-based Markov time series models for zero-inflated counts with the presence of covariates.

Figure 1.

Figure 1.

Time series plot of monthly count of sandstorms, bar-plot of distribution of sandstorm counts, autocorrelation function, and the circular plot.

Zero-inflated regression models for independent counts were first proposed by Lambert [16] via a generalized linear model (GLM), assuming the counts follow zero-inflated Poisson (ZIP) distribution. Later on, other distributional assumptions were proposed, such as the zero-inflated negative binomial distribution (ZINB) in [23] and the zero-inflated Conway–Maxwell–Poisson (ZICMP) distribution in [24]. Recently, modeling serially dependent zero-inflated counts has gained more attention. Authors in [7] proposed a regression ZIP model to analyze air pollution-related emergency room visit counts. Under the integer-valued generalized autoregressive conditional heteroscedatic (INGARCH) class of models, Zhu [30] and Gonçalves et al. [4] proposed zero-inflated INGARCH models with different distributional assumptions. Yang et al. [28] proposed state-space models to handle a time series of zero-inflated counts with both ZIP and ZINB distributions. Assuming the sequence of zero-inflated counts followed a Markov process, Wang [27] introduced Poisson regression models to fit daily numbers of phone calls. Yang et al. [29] also fit Markov models to public health surveillance for syphilis using a partial likelihood approach.

In this paper, copula-based models are proposed. Copulas are multivariate distributions with uniform margins on the unit interval. There is ample research on how they have been used to capture the correlation structures of continuous time series data (for example, see [6,9,13,21]). Due to computational complexity, there is not as much literature for count time series data as there is for continuous time series data. Joe [14] reviewed Markov models with copula-based transition probabilities for time series of counts. Masarotto and Varin [18] introduced marginal regression models for count time series data with the serial dependence being captured by a Gaussian copula. Jia et al. [11] and Lennon and Yuan [17] applied the same models but suggested different estimation methods. Alqawba et al. [2] extended the models proposed by Masarotto and Varin [18] to include a class of models that accommodates for zero-inflation in the time series of counts.

Here, we extend the work done by Joe [14] on constructing the count time series models through copula-based joint distributions of consecutive observations by including a class of models that accounts for zero-inflation in the count time series. An important advantage of copula-based Markov models is that they help avoid some strict distributional assumptions on the marginals, such as the infinite divisibility condition [14]. The latter condition is not necessarily satisfied when we assume the counts follow a zero-inflated distribution. In addition, the copula-based models extend nicely to non-stationary processes through time-varying parameters in the univariate margins of the ZIP, ZINB, and ZICMP distributions proposed in this paper.

This paper is organized as follows. In Section 2, we provide background on the theories needed to construct the proposed class of Markov models. We briefly describe the zero-inflated regression models following the ZIP, ZINB, and ZICMP distributions, respectively. We then define copula functions. In Section 3, we present the proposed copula-based Markov models and discuss in detail the first- and second-order Markov models. In Section 4, we describe the statistical inference method applied via likelihood inference and provide asymptotic results of the maximum likelihood estimates. Section 5 presents simulation studies. Then, the proposed models are used to analyze the Sandstorm counts in Section 6. We end the paper with a summary in Section 7.

2. Background

A time series is a sequence of random variables, {Yt}, taken over an equally spaced discrete time, which is t=1,,n for some fixed n. The observed values, {yt}, are usually referred to as the realization of the stochastic process {Yt}. An important feature of a time series is that consecutive observations are serially dependent. Hence, time series analysis is concerned with accounting for the serial dependence. A complete description of a time series process is the joint multivariate distribution of the observed values, which is

F(y1,,yn)=Pr(Y1y1,,Ynyn). (1)

However, constructing multivariate distributions is quite challenging, especially when the time series is discrete and observes an unusual behavior, such as zero inflation. As mentioned in the introduction, the main objective of this paper is to build a multivariate distribution as given in (1) to describe the zero-inflated count time series through a class of copula-based Markov models. Next, we discuss appropriate margins of these zero-inflated counts and their multivariate distributions via copula theories.

2.1. Zero-inflated count regression models

In this section, we will revisit the zero inflated regression models under different distributional assumptions. First, we will examine the ZIP regression model with independent counts [16]. Then we will consider other distributional assumptions, such as the ZINB distribution [23] and ZICMP [24], which can accommodate for over-dispersion and under-dispersion when choosing the latter.

Suppose Yt denotes a random count at time t with the probability mass function (pmf) and the cumulative distribution function (cdf) given as ft and Ft, respectively.

  1. ZIP: ωt zero-inflation parameter, λt intensity parameter, and
    ft(yt)=ωtI{yt=0}+(1ωt)eλtλtytyt!,
    where I{yt=0} is the indicator function, ωt[0,1] and λt>0. If ωt0, the baseline Poisson distribution is obtained.
  2. ZINB: ωt zero-inflation parameter, λt intensity parameter, κt dispersion parameter, and
    ft(yt)=ωtI{yt=0}+(1ωt)Γ(κt+yt)Γ(κt)yt!κtκt+λtκtλtκt+λtyt,
    where I{yt=0} is the indicator function, ωt[0,1], λt>0 and κt0. If ωt0, the baseline NB distribution is obtained.
  3. ZICMP: ωt zero-inflation parameter, λt intensity parameter, κt dispersion parameter, and
    fYt(yt)=ωtI{yt=0}+(1ωt)λtyt(yt!)κtZ(λt,κt),
    where I{yt=0} is the indicator function, ωt[0,1], λt>0 and κt0. If ωt0, the baseline CMP distribution is obtained, and if κt=1, the ZIP distribution is obtained.

Following GLM [19], we will simultaneously fit two GLMs for the ZIP parameters and three GLM's for the ZINB and ZICMP parameters, which is: (1) the intensity parameter λt, with the logarithmic link function; (2) the zero-inflation parameter ωt for t=1,,n with the logit link function; and finally, in the case of ZINB and ZICMP distribution, (3) the dispersion parameter κt, with the logarithmic link function just like the intensity parameter. That is,

log(λt)=βxt, (2)
logit(ωt)=logωt1ωt=γzt, (3)

and

log(κt)=αwt, (4)

where xt=(x1t,,xkt), zt=(z1t,,zlt), and wt=(w1t,,wmt) are the associated covariates that affect the intensity parameter λt, the zero-inflation parameter ωt, and the dispersion parameter κt, respectively. In addition, β=(β1,,βk), γ=(γ1,,γl), and α=(α1,,αm) are the regression coefficients for the log-linear model given in (2), the logit model given in (3), and the log-linear model given in (4), respectively.

2.2. Copula

The multivariate normal distribution is a natural extension of the univariate normal distribution to higher dimensions, and it plays a central role in statistical theory. However, there is not one but several multivariate extensions of univariate discrete distributions to higher dimensions. And copulas facilitate such extensions and constructions of multivariate distributions with given continuous or discrete marginals, which model various types of dependence. An extensive and detailed discussion of copulas is contained in [13]. A d dimensional copula C(a1,a2,,ad):[0,1]d[0,1] is simply a multivariate cdf with all d univariate marginals uniform on the unit interval. A multivariate cdf for Y=(Y1,Y2,,Yd) is given as

F(y1,y2,,yd)=C(F1(y1),F2(y2),,Fd(yd)).

If all the margins are integer valued as given in [13] on page 27, the multivariate probability mass function can be obtained as

f(y1,y2,,yd)=P(Y1=y1,Y2=y2,,Yd=yd)=j1=12j2=12jd=12(1)j1+j2++jdC(u1j1,u2j2,udjd),

where ut1=Ft(yt) and ut2=Ft(yt). The term Ft(yt) is the left-hand limit of Ft at yt, which is equal to Ft(yt1) when Yt is integer valued random variable. A comprehensive list of copulas can be found in [13].

3. Copula-based Markov chain models

A general form of p order Markov models with copula-based transition probabilities as defined in [13] is given as

Yt=g(ϵt;Yt1,,Ytp),

where {ϵt} is an i.i.d stochastic continuous latent process and g(.) is assumed to be an increasing function in ϵt for t=1,,n. Thus, the observed value Yt depends on the past through only Yt1,,Ytp. If the process {Yt} is continuous, then there exists a simple stochastic representation for the Markov model. However, for discrete process, as in the case here, there is no simple stochastic representation for the model [14]. Next, we discuss the first-order Markov models and their immediate extensions to second-order Markov models.

3.1. First- order Markov models

Suppose {Yt} is a zero-inflated count time series following a first-order Markov chain with one of the distributions introduced in Section 2. Then, taking advantage of the chain rule of probability and the Markov property, the multivariate joint density distribution of Y1,,Yn is given as

Pr(Y1=y1,,Yn=yn)=Pr(Y1=y1)t=2nPr(Yt=yt|Yt1=yt1). (5)

The transition probability, i.e. conditional probability in the right hand of (5), depends on the joint density function of Yt,Yt1 and can be found using the copula functions as shown in Section 2. That is, let

F12(yt,yt1)=C(Ft(yt|Xt;θ),Ft1(yt1|Xt1;θ);δ), (6)

where C(.;δ) is a bivariate copula function with parameter vector δ. The covariates Xt=(xt,zt,wt), for t=1,,n, are the covariates corresponding to the intensity (mean) parameter λt, the zero-inflation parameter ωt and the dispersion parameter κt, if existed, respectively. Notice that in some cases, these parameters, or part of them, may be constant across time when the covariates are not significant and dropped from the model. The parameter vector θ=(β,γ,α) is the unknown marginal regression coefficient. Hence, the transition probability is given as

Pr(Yt=yt|Yt1=yt1)=Pr(Yt=yt,Yt1=yt1)ft1(yt1|Xt1;θ), (7)

where

Pr(Yt=yt,Yt1=yt1)=F12(yt,yt1)F12(yt,yt1)F12(yt,yt1)+F12(yt,yt1),

and yt=yt1 since Yt is a discrete random variable for all t.

Several choices of the bivariate copula function C(.;δ) can be selected depending on the degree and the sign of the dependence and the tail behavior of the copula. Table 1 shows examples of copula choices considered in this paper. For example, if there is symmetry, the Gaussian copula is recommended. However, Gumbel or reflected Gumbel copulas perform better than Gaussian copula with the existence of tail dependence [14]. Moreover, the Frank and the Gaussian copulas are both reflection symmetric and allow for negative dependence (see [13] for more details).

Table 1. Bivariate Copula functions.

Copula Copula function
Gaussian C(u1,u2;δ)=Φδ(Φ1(u1),Φ1(u2)),  δ[1,1]
Frank C(u1,u2;δ)=1δlog1+(eδu11)(eδu21)eδ1,  δR {0}
Gumbel C(u1,u2;δ)=exp[((log(u1))δ+(log(u2))δ)1/δ],  δ1
reflected Gumbel C(u1,u2;δ)=u1+u21+exp[((log(u1))δ+(log(u2))δ)1/δ],  δ1
Plackett C(u1,u2;δ)=[1+(δ1)(u1+u2)][1+(δ1)(u1+u2)]24u1u2δ(δ1)2(δ1),  δ0

Next, we will discuss the immediate extension of the first-order Markov models. That is, the second-order Markov models where the zero-inflated count Yt depended on the past two counts.

3.2. Second order Markov models

Suppose {Yt} is a zero-inflated count time series of second-order Markov chains following one of the distributions introduced in Section 2. Now, the multivariate joint distribution of Y1,,Yn is given by

Pr(Y1=y1,,Yn=yn)=Pr(Y1=y1,Y2=y2)×t=3nPr(Yt=yt|Yt1=yt1,Yt2=yt2), (8)

where the conditional probability of Y2 given Y1=y1 can be evaluated using (7). However, for the second-order transition probabilities of Yt given Yt1=yt1 and Yt2=yt2, we need to fit an appropriate trivariate copula function for the joint distribution of Yt,Yt1 and Yt2 for t=3,,n.

The most popular choice is the trivariate Gaussian copula. Using the joint multivariate distribution with discrete margin given in (5) and the Gaussian copula function, the transition probabilities of Yt given Yt1=yt1 and Yt2=yt2 for t=3,,n is given by

Pr(Yt=yt|Yt1=yt1,Yt2=yt2)=Pr(Yt=yt,Yt1=yt1,Yt2=yt2)Pr(Yt1=yt1,Yt2=yt2),

where the joint distribution is given by

P(Yt=yt,Yt1=yt1,Yt2=yt2)=j1=01j2=01j3=01(1)j1+j2+j3F123(ytj1,yt1j2,yt2j3), (9)

where the function F123(.) is given by

F123(yt,yt1,yt2)=ΦR(δ)(Ft(yt|Xt;θ),Ft1(yt1|Xt1;θ),Ft2(yt2|Xt2;θ)),       

for t=3,,n where R(δ) is a 3×3 correlation matrix, with δ=(δ1,δ2) as a vector of the dependence structure parameters. The bivariate copula margins F12 and F23 of the trivariate copula function F123 are assumed to be the same and, where case, given by the bivariate Gaussian copula.

Another way of evaluating the trivariate joint density function, when the Gaussian copula function is chosen, can be via integrating over rectangle probability [20]. That is

Pr(Yt=yt,Yt1=yt1,Yt2=yt2)=Dt(yt;θ)Dt1(yt1;θ)Dt2(yt2;θ)φR(δ)(zt2,zt1,zt)dzt2dzt1dzt, (10)

where

Dt(yt;θ)=[Φ1{Ft(yt|Xt;θ)},Φ1{Ft(yt|Xt;θ)}]. (11)

Although the Gaussian copula function in (10) has no closed-form, there are several accurate deterministic approximations of the function when the dimension is low, such as the case here with the trivariate Gaussian or the bivariate Gaussian.

Another way of calculating the trivariate joint distribution, if the closed copula function is desired, can be found by employing the Laplace transform (LT) of a non-negative random variable through a max-infinite divisible (max-id) copula as introduced in [15]. They stated that when the copula functions pair copula function were chosen as C12=C23=H, where H is a permutation symmetric max-id bivariate copula function, and C13 is the independent copula function, then the model would be appropriate for generating a second-order Markov chain.

Hence, the function F123(.) in (9) becomes the following trivariate max-id copula

F123(yt,yt1,yt2)=ψj{t,t2}logH(e0.5ψ1(Fj;δ1),e0.5ψ1(Ft1;δ1);δ2)+12ψ1(Fj;δ1);δ1, (12)

where Fj=Fj(yj|Xj;θ) for j = t, t−2 and t=3,,n. The function ψ(.;δ1) is the LT with δ1 describing the minimal dependence, and H(.;δ2) is a permutation symmetric max-id bivariate copula function with δ2 describing the stronger pairwise dependence. The bivariate margins of (12) are given by

Fi2(yj,yt1)=ψlogH(e0.5ψ1(Fj;δ1),e0.5ψ1(Ft1;δ1);δ2)+12ψ1(Fj;δ1)+12ψ1(Ft1;δ1);δ1, (13)

for i = 1, 3, j=t,t2, and

F13(yt,yt2)=ψ(ψ1(Ft;δ1)+ψ1(Ft2;δ1);δ1).

The use of the above trivariate max-id copula is suggested when there is stronger dependence for measurements at nearer time points [14]. He also stated that in the case of large value clustering (such as when the time series observes seasonality) a good choice for H(.;δ2) is the bivariate Gumbel copula and ψ(.;δ1) is the positive stable Laplace transform, which results in the function F123(.) in (12) to be a trivariate extreme value copula. The Gaussian copula and the max-id copula can be extended to fit Markov models of order greater than 2.

4. Statistical inference

4.1. Log-likelihood functions

The inference or estimation method performed for the Markov models' parameter vector ϑ=(θ,δ) presented in this chapter is the maximum likelihood estimation (MLE) method. As stated previously, likelihood inference method is easily applied when the chosen copula family has simple form. In addition, likelihood inference gives us the advantage of performing hypothesis testing through the likelihood ratio statistics and model selection using the log-likelihood function. Next, we give a detailed description of the likelihood functions of the two models presented earlier.

For the first-order Markov models, the likelihood function is given by (5), and the log-likelihood is given as

l(ϑ;y)=logf1(y1|X1;θ)+t=2nlogj1=01j2=01F12(ytj1,yt1j2)t=2nlogft1(yt1|Xt1;θ), (14)

where F12(yt,yt1) is given by (6), and θ and δ are the parameter vectors of the marginals and the dependence structure, respectively. The log-likelihood function in (14) has closed-form if the copula family chosen to define C(.;δ) has a closed-form.

For second-order Markov models with Gaussian copula function, the log-likelihood function is given as

l(ϑ;y)=logD1(y1;θ)D2(y2;θ)φR(δ)(z2,z1)dz2dz1+t=3nlogDt(yt;θ)Dt1(yt1;θ)Dt2(yt2;θ)φR(δ)(zt2,zt1,zt)dzt2dzt1dztlogDt1(yt1;θ)Dt2(yt2;θ)φR(δ)(zt2,zt1)dzt2dzt1, (15)

where Dt(yt;θ) is given by (11). The expression in (15) is not in a closed form, and approximations are needed for the rectangle probabilities. However, for the trivariate max-id copula, the log-likelihood function can take a closed form and is given by

l(ϑ;y)=logj1=01j2=01F12(y1j1,y2j2)+t=3nlogj1=01j2=01j3=01(1)j1+j2+j3F123(ytj1,yt1j2,yt2j3)logj1=01j2=01F12(yt1j1,yt2j2), (16)

where F12(yt1,yt2) and F123(yt,yt1,yt2) are given by (13) and (12), respectively.

Hence, the maximum likelihood estimates of ϑ=(θ,δ) can be obtained by:

ϑˆ=argmaxϑ l(ϑ;y).

This optimization will produce a Hessian Matrix that yields the observed Fisher information matrix. To get the standard errors of the ML estimates of ϑ, one can take the inverse of the Fisher information matrix. In the next section, asymptotic results are derived to prove that the inverse of the Fisher information matrix of the above log-likelihood functions evaluated at the MLE of ϑ can be used as an estimated covariance of matrix of ϑˆ.

4.2. Asymptotic properties

To draw some inference on Markov processes, Billingsley [3] gave important results that basically state that, under certain regularity conditions, the asymptotic likelihood theory and numerical maximum likelihood from the i.i.d case can be extended to hold with dependent data following a Markov process. First, we will consider the first-order Markov model with the corresponding log-likelihood given in (14). As listed in [12] on page 318, the regularity conditions needed in order to use the results from [3] for the first-order Markov process are verified. Assume that we have {Yt} for t=1,,n a first-order Markov chain with state space S, and Pr(Yt=yt|Yt1=yt1;ϑ) is a family of transition densities with respect to a counting measure and with column vector parameter ϑ of dimension r in the parameter space Θ. In addition, for asymptotic analysis, rewrite the log-likelihood function given in (14) as

ln(ϑ)=t=2np(ϑ;yt1,yt), (17)

where p(ϑ;yt1,yt)=logPr(Yt=yt|Yt1=yt1;ϑ), for t=2,,n. Note that the first probability, Pr(Y1=y1;θ), is omitted from the function since the first observation, y1, is asymptotically insignificant.

Now, given the regularity conditions in [12], the following asymptotic results from the i.i.d case hold for our Markov processes. In particular, there exists a root ϑˆn of /ϑln(ϑ)=0 such that

  1. The ML estimator ϑˆn of ϑ=(θ,δ) converges in probability to the true value of ϑ, say ϑ0. That is, ϑˆn is a consistent estimator of ϑ.

  2. The ML estimator ϑˆn is asymptotically normal. That is, n1/2(ϑˆnϑ0)dNn(0,Σ1(ϑ0)), where Σ(ϑ0)=Var(ϑ0).

  3. The log-likelihood ratios for hypotheses involving nested models for the parameter ϑ have asymptotic chi-square distribution. That is, 2[maxϑ ln(ϑ)l(ϑ0)]dχr2.

Hence, as in the i.i.d case, the numerical maximization of (17) yields the observed Fisher information matrix, which can be used as an estimated covariance matrix of ϑˆ. That is, for large n, n1Σ1(ϑˆ)In1(ϑˆ), where

In1(ϑˆ)=2ln(ϑ)ϑϑ|ϑˆ1=2t=2np(ϑ;yt1,yt)ϑϑ|ϑˆ1.

Joe [12] argued that the theory in [3] still applies for higher-order Markov processes, assuming the order is known. He also stated that extension of the asymptotic theory to a case where the transition probabilities depend on covariates should be possible.

5. Simulated examples

To evaluate the performance of the proposed method and confirm the asymptotic results, a comprehensive simulation study was conducted. We carried out the simulation in the statistical software R [22]. Out of the several processes to choose from, we simulated first-order stationary Markov processes with joint distribution of consecutive observations following the bivariate Gaussian and Frank copulas. The marginal distributions were chosen to be the ZIP, ZINB, and ZICMP distributions. Since we assumed the process is stationary, we set the marginal distributions' parameters, θ, to be constant across time. For Gaussian copula, the marginal parameters were chosen as (1) ZIP with θ=(λ=3,ω=0.3), (2) ZINB with θ=(λ=3,ω=0.3,κ=5), and (3) ZICMP with θ=(λ=3,ω=0.2,κ=0.5). The dependence parameter for the bivariate Gaussian copula was chosen as δ=0.5 across all three models.

We generated 500 simulated datasets for each of the above models with the sample sizes, n = 100, 200 and 500. The evaluation criterion was chosen to be the mean absolute deviation error (MADE), which is given by:

1mi=1m|ϑˆiϑ|,

where m is the number of replications, i.e. m = 500.

The parameter estimates were obtained after constructing the log-likelihood function given in (14) for the ZIP, ZINB, and ZICMP distributions. A summary of the simulation results are shown in Table 2, which represents the count time series ZIP, ZINB, and ZICMP models with joint distribution of consecutive observations following the bivariate Gaussian copula. The results indicate that the proposed estimation method produces reasonable estimates and relatively small MADEs. In addition, as the sample size increases, the parameter estimates appeared to converge to the true parameter values. To assess the approximate normality of the estimates, Q–Q plots of the ML estimates for the 500 ZIP, ZINB, and ZICMP replicates of length n = 500 are shown in Figure 2. These plots agree with the asymptotic results given in Section 4.2.

Table 2. Mean of estimates, MADEs (within parentheses) for Markov zero-inflated models with Gaussian copula.

Model n λ ω κ δ
ZIP 100 2.9992(0.2773) 0.2949(0.0591)   0.4840(0.0748)
  200 3.0056(0.1698) 0.2961(0.0402)   0.4903(0.0542)
  500 2.9980(0.1051) 0.2977(0.0249)   0.4933(0.0342)
ZINB 100 3.0078(0.3225) 0.2961(0.0661) 4.9338(1.6244) 0.4809(0.0801)
  200 3.0109(0.2252) 0.2968(0.0581) 5.2743(1.4477) 0.4860(0.0581)
  500 2.9990(0.1430) 0.2980(0.0290) 4.9858(1.0384) 0.4913(0.0362)
ZICMP 100 3.4689(0.6813) 0.2008(0.0488) 0.5516(0.0860) 0.4771(0.0747)
  200 3.3383(0.4619) 0.2016(0.0332) 0.5404(0.0611) 0.4847(0.0540)
  500 3.2545(0.2996) 0.2031(0.0210) 0.5326(0.0404) 0.4885(0.0336)

Figure 2.

Figure 2.

Q–Q plots of the ML estimates for the 500 ZIP, ZINB and ZICMP processes of length n = 500.

Next, we simulate a stationary Markov process with dependence structure following the Frank copula. The marginal parameters were chosen as (1) ZIP with θ=(λ=3,ω=0.3), (2) ZINB with θ=(λ=4.1,ω=0.25,κ=5), and (3) ZICMP with θ=(λ=4.1,ω=0.25,κ=0.5). The dependence parameter for the bivariate Frank copula is chosen to be δ=3.2 across all three models. A summary of the simulation results are shown in Table 3, which represents the count time series ZIP, ZINB, and ZICMP models with joint distribution of consecutive observations following the bivariate Frank copula. Similar to the Gaussian copula, the proposed method yields robust performance.

Table 3. Mean of estimates, MADEs (within parentheses) for Markov zero-inflated models with Frank copula.

Model n λ ω κ δ
ZIP 100 2.9897(0.2513) 0.2935(0.0556)   3.1585(0.5723)
  200 3.0272(0.1916) 0.2902(0.0388)   3.1809(0.5316)
  500 3.0409(0.1051) 0.2965(0.0249)   3.2226(0.3563)
ZINB 100 4.1530(1.1330) 0.2555(0.1434) 0.6780(0.2990) 3.0515(0.7628)
  200 4.0739(0.8759) 0.2471(0.1177) 0.5779(0.2390) 3.1333(0.6089)
  500 4.1200(0.5459) 0.2528(0.0768) 0.5307(0.1348) 3.1394(0.3645)
ZICMP 100 4.3895(0.5621) 0.2493(0.0509) 0.5167(0.0719) 3.2445(0.5745)
  200 4.1753(0.4619) 0.2500(0.0355) 0.5082(0.0490) 3.2312(0.4222)
  500 4.1159(0.3890) 0.2486(0.0203) 0.5052(0.0339) 3.2431(0.2785)

6. Application to sandstorm counts

The data set used in this example consists of the monthly count of strong sandstorms recorded by the AQI airport station in Eastern Province, Saudi Arabia. These data are a subset of the data studied in [1]. The station happens to be located in one of the major dust producing regions in the world [10]. Sandstorms are a weather event that result from strong wind releasing dust from the ground and transfering it long distances [5]. Sandstorms can cause many environmental and human-related hazards. For example, sandstorms impact the air quality, disturb daily activities, and disrupt transportation. Hence, studying and accurately analyzing the behavior of these phenomena is important to successfully forecast such events.

The monthly counts studied here are characterized as strong sandstorms by the AQI airport station. Tao et al. [26] stated that a strong sandstorm reduces the level of visibility to less than 500 m and with average wind speeds of 17.2 to 24.4 m/s. The counts of these events contain zero inflation. Several works have been applied on handling rare events such as strong sandstorms (e.g. see [25] and [8]). Here we apply the proposed Markov zero-inflated count time series models with copula-based transition probabilities.

The data set consists of 348 monthly counts of strong sandstorms, starting from January 1978 to December 2013. The main objective was to apply the proposed models and investigate if there were any significant seasonal and trend components. Additionally, we investigated if there were any other predictors that affected the frequency of sandstorms, such as the monthly counts of dust haze events, maximum wind speed, temperature, and relative humidity.

Figure 1 shows the sandstorms series plot, the autocorrelation function, bar-plot of the distribution of sandstorm counts, and circular plot of the monthly mean count of sandstorms. From the time series plot and the bar-plot, we could see that the distribution of the sandstorm counts had more zeros relative to a Poisson distribution with the same empirical mean. These zeros represented about 59% of the sample. Decreasing trend could also be observed from the time series plot. Additionally, seasonality was also seen from the autocorrelation function and circular plot. In fact, from the circular plot, we concluded that most sandstorms occurred during spring time, i.e. March, April, and May. Thus, trend and seasonal covariates were added to the models.

Hence, we fitted several models with different marginal distributions and dependence structures. The marginal distributions were chosen to be the ZIP, ZINB, and ZICMP distributions with the log-linear function of the intensity parameter given as

log(λt)=β0+β1 (t×103)+β2x1t+β3x2t+β4x3t,

and the logit function for the zero-inflation parameter given as

logit(ωt)=γ0+γ1z1t+γ2z2t+γ3z3t,

for t=1,,n, where x1t=z1t=cos(2πt/12), x2t=z2t=sin(2πt/12), and x3t=z3t are the monthly count of dust haze events. The log-function of the dispersion parameter (if it existed) was given by log(κ)=α, i.e. it was chosen to be constant across time. We considered both first-order and second-order dependence structures. In the first-order Markov models, we fitted the bivariate Gaussian, Frank, Gumbel, reflected Gumbel, and Plackett copula functions for the joint distribution of two consecutive observations. For the second-order Markov models, we fitted the trivariate Gaussian and max-id copula functions for the joint distribution of three consecutive observations. In the trivariate max-id copula function, we chose the Laplace transform function, ψ(.), to be either the positive staple Laplace transform (PSLT) or the log series Laplace transform (LSLT) with H(.;δ) chosen to be either the bivariate Frank or Gumbel copulas. Out of these models, we selected two models, each with different marginals, based on the model selection criteria Akaike information criterion ( AIC), Bayesian information criterion (BIC), and root mean square prediction errors (RMSPE).

Table 4 shows comparisons of the ZIP, ZINB, and ZICMP Markov models with the different dependence structures. The increase of the dependence order improves the models. The second-order Markov models outperformed the first-order Markov models in terms of the AIC, BIC, and RMSPE values. However, the second-order parameters of the trivariate Gaussian copula and the Laplace transform parameters, in the trivariate max-id copula, are not always significant and can be dropped if necessary. Additionally, having the same dependence structure, the Markov models with ZINB margins seem to fit the sandstorm data better than the models with ZIP and ZICMP margins. Within the ZIP and ZINB margins, the bivariate reflected Gumbel and Frank copula function are chosen to model the dependence structures. The trivariate Gaussian copula and the trivariate max-id copula with PSLT and bivariate Gumbel copula function for the ZICMP margin. Hence, we want to show how each dependence structure can be interpreted in term of the autocorrelation. The ordinary Poisson and NB Markov models are also shown in the table as a comparison with the best models from the ZIP and ZINB Markov models. The results show that the zero-inflated models outperform the ordinary models.

Table 4. Comparisons of the ZIP, ZINB, and ZICMP models with different dependence structures.

Marginal Copula Order AICa BICb RMSPEc
ZIP Gaussian 1 435.55 945.76 1.59
  Frank 1 434.11 942.88 1.58
  Gumbel 1 438.32 951.32 1.60
  ref.Gumbel 1 433.87 942.41 1.57
  Plackett 1 435.21 945.09 1.58
  Gaussian 2 433.85 950.22 1.56
  PSLT/Frank 2 431.25 948.87 1.56
  LSLT/Frank 2 433.77 953.90 1.55
  PSLT/Gumbel 2 418.46 923.28 1.62
ZINB Gaussian 1 425.86 938.08 1.59
  Frank 1 424.54 935.45 1.59
  Gumbel 1 428.20 942.77 1.60
  ref.Gumbel 1 424.15 934.67 1.58
  Plackett 1 425.5 937.38 1.59
  Gaussian 2 424.06 942.34 1.56
  PSLT/Frank 2 417.00 928.23 1.57
  LSLT/Frank 2 423.67 941.56 1.57
  PSLT/Gumbel 2 413.05 920.32 1.63
ZICMP Gaussian 1 477.10 1040.56 1.91
  Frank 1 475.06 1036.49 1.88
  Gumbel 1 473.49 1033.3 1.79
  ref.Gumbel 1 473.73 1033.84 1.87
  Plackett 1 469.06 1024.48 1.77
  Gaussian 2 449.97 994.16 1.72
  PSLT/Frank 2 457.44 1009.11 1.80
  LSLT/Frank 2 450.91 996.04 1.77
  PSLT/Gumbel 2 455.12 1004.47 1.83
Poisson Frank 1 495.90 1038.9 1.76
  ref.Gumbel 1 488.16 1023.43 1.75
NB Frank 1 446.28 947.52 2.32
  ref.Gumbel 1 445.69 946.35 2.29

a AIC =l(ϑˆ;y)r,

b BIC =2l(ϑˆ;y)+rlogn.

c RMSPE={1/npt=1+pn[ytEˆ(Yt|Yt1=yt1,,Ytp=ytp;ϑˆ)]2}1/2.

Table 5 shows that the zero-inflated Markov models are capable of accounting for first-order dependence. However, only the models with ZICMP margins account for the second-order dependence. The autocorrelation coefficients are similar when the dependence structure is the same for the models with ZIP and ZINB margins. For the marginal parameters, θ, the estimates are quite similar between the ZIP and ZINB and slightly different from the ZICMP. All models suggest significant decreasing trend in the number of strong sandstorms since β1<0. Seasonality is also significant at annual frequencies since β2,β3,γ1 and γ2 are significantly different from zero. Finally, the affect of dust haze is significant since both β4 and γ3 are significantly different from zero. To compare between the Markov models in term of the dependence, we consider the Kendall's tau. The Kendall's tau, when the chosen copula function is the reflected Gumbel, is given by τK=1δ11, so for the ZIP margin, it equals to τK=0.1908, which is similar to the one corresponding to the ZINB margin, τK=0.2156. When the copula function is the Frank, the Kendall's tau is then given as τK=141D1(δ1)/δ1, where D1(.) is the Debye function D1(x)=1/x0xt/et1dt. Thus, for the ZIP margin it equals to τK=0.1664, which is similar to the one corresponding to the ZINB margin, τK=0.1870. In both cases, the ZINB distribution provides slightly stronger dependence than the ZIP distribution.

Table 5. Parameter estimates (standard errors) for the copula-based Markov models fit to the sandstorms count series.

  ZIP ZINB ZICMP
Parameter ref.Gumbel Frank ref.Gumbel Frank Gaussian PSLT/Gumbel
β0 0.9578(0.1190) 0.9705(0.1172) 0.9960(0.1564) 0.9391(0.1593) 0.8139(0.0577) 0.7701(0.1938)
β1 −4.2579(0.6098) −4.0715(0.6049) −4.9031(0.8052) −4.7335(0.8168) −2.3656(0.0302) −2.4943(0.6999)
β2 −0.2184(0.0890) −0.1789(0.0879) −0.1956(0.1233) −0.1836(0.1253) −0.0921(0.0107) −0.1248(0.0860)
β3 0.3722(0.0957) 0.3650(0.0950) 0.4371(0.1255) 0.4379(0.1283) 0.2203(0.0554) 0.2466(0.0827)
β4 0.0656(0.0088) 0.0638(0.0089) 0.0635(0.0121) 0.0636(0.0124) 0.0452(0.0024) 0.0426(0.0112)
γ0 0.6627(0.2717) 0.7264(0.2677) 0.5357(0.3077) 0.5926(0.3086) 1.1825(0.0481) 1.1292(0.2221)
γ1 0.6236(0.2561) 0.6615(0.2514) 0.6706(0.2991) 0.6607(0.3006) 0.6252(0.0717) 0.6266(0.2014)
γ2 −0.9051(0.2507) −0.8798(0.2427) −0.8819(0.2814) −0.7799(0.2799) −0.9827(0.0456) −0.9824(0.2070)
γ3 −0.1407(0.0444) −0.1498(0.0448) −0.1596(0.0538) −0.1664(0.0545) −0.1565(0.0181) −0.1467(0.0335)
α     1.4876(0.5709) 1.4303(0.5431) 0.8083(0.0186) 0.7540(0.1400)
δ1 1.2358(0.0765) 1.5326(0.4672) 1.2748(0.0896) 1.7328(0.5034) 0.2779(0.0444) 1.0242(0.0885)
δ2         0.1818(0.0238) 1.1922(0.1057)
τK 0.1908 0.1664 0.2156 0.1870 0.1793 0.1612

Figure 3 displays the predicted values, which were the conditional expectations of Yt given Yt1 for the ZIP and ZINB Markov models and Yt given Yt1 and Yt2 for ZICMP Markov models for t=1,,n. The ZIP and ZINB models perform better than the ZICMP, especially with the first hundred observations where non-zero counts are more frequent. Within each margin, the reflected Gumbel and Frank copulas are very similar with the ZIP and ZINB margins. However, with the ZICMP margin, the Gaussian copula is better than the max-id copula.

Figure 3.

Figure 3.

Predicted values using the conditional expectations of the Markov models fit to the sandstorm count series. Dots represent the observed counts.

Finally, we compare our proposed class of zero-inflated Markov models to some zero-inflated time series of counts in the literature. We first fit the zero-inflated integer-valued autoregressive (ZIINAR) models with ZIP and ZINB distributed counts presented in [29] to the Sandstorm counts. Also, we fit the class of zero-inflated nonlinear state-space (ZINLSS) time series counts given in [2] with ZIP, ZINB, and ZICMP distributions. Table 6 shows the parameter estimates and their standard errors for the five models along sides the AIC of each models. The estimates of the marginal parameters are quite similar to our Markov models. However, the serial dependence parameters coming from ZIINAR indicate stronger serial dependence. The standard errors of our Markov models' parameters are smaller than both methods in Table 6. The AIC values are also similar across the classes of models, but the Markov ones provide slightly smaller values. An important advantage of the models proposed in this paper is the straight forward estimation method which utilizes the MLE. In contrast, Yang et al. [29] maximized the partial likelihood function to estimate the parameters of the ZIINAR class. This could lead to loss of some information. In addition, the estimation method in [2] uses sequential importance sampling that requires intense computations. Also, the proposed class of Markov models provides a variety dependence structures that we can choose from.

Table 6. Parameter estimates (standard errors) for the ZIINAR [29] and ZINLSS [2] models fit to the sandstorms count series.

  ZIINAR ZINLSS
Parameter ZIP ZINB ZIP ZINB ZICMP
β0 0.7314(0.1424) 0.6898(0.1813) 0.9977(0.1175) 0.9709(0.1570) 0.7978(0.1888)
β1 4.1779(0.5435) 4.6293(0.6839) 4.1493(0.6065) 4.7477(0.7976) 2.4523(0.5772)
β2 0.0647(0.0964) 0.0369(0.1321) 0.2004(0.0885) 0.1813(0.1243) 0.1089(0.0723)
β3 0.3722(0.0957) 0.4146(0.1236) 0.3461(0.0938) 0.4231(0.1239) 0.2093(0.0786)
β4 0.3295(0.0936) 0.0676(0.0119) 0.0627(0.0088) 0.0645(0.0123) 0.0435(0.0094)
γ0 0.6685(0.2647) 0.5157(0.3114) 0.7647(0.2622) 0.5656(0.3047) 0.6629(0.2119)
γ1 0.5891(0.2356) 0.6894(0.2923) 0.6163(0.2460) 0.6648(0.2925) 1.0047(0.2132)
γ2 0.8578(0.2338) 0.7940(0.2740) 0.8931(0.2401) 0.8363(0.2736) 0.1496(0.0344)
γ3 0.1542(0.0439) 0.1754(0.0566) 0.1489(0.0424) 0.1659(0.0524) 0.2466(0.1613)
α   1.6104   0.6400(0.2437) 1.1733(0.2230)
δ 0.4110(0.1344) 0.4182(0.1669) 0.2580(0.0623) 0.2503(0.0724) 0.2870(0.0780)
AIC 435.44 427.06 435.45 425.81 430.8

7. Summary

Many scientific investigations or natural phenomena result in data that consist of counts measured sequentially at different time points. And it is common to see a large number of zeros in the count time series data. Applying ordinary Poisson and NB distributions to these time series of counts might not be appropriate due to the frequent occurrence of zeros. In this paper, we have extended the work done by Joe [14] to include a class of models that accounts for zero inflation. We used the marginal ZIP, ZINB, and ZICMP distributions to build Markov zero-inflated count time series models. The serial dependence was captured through constructing bivariate and trivariate joint distributions of the consecutive observations. The joint distribution function of the consecutive observations is constructed through copula functions such as the Gaussian, Frank, Gumbel copula functions. Simulation studies were conducted to evaluate the maximum likelihood estimation method. The studies showed that the estimated parameters are consistent and normally distributed. The proposed Markov models are applied on the sandstorm counts. The models prove to be reliable on handling different zero-inflated count time series data. Future direction is to consider pair copula construction introduced in [20] to fit second-order Markov models.

Acknowledgments

The authors thank the editor, the anonymous AE, and referee for the constructive comments, which led to considerable improvement of the manuscript. The authors thank the Saudi General Authority of Meteorological & Environmental Protection for making the sandstorm data available.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Alqawba M., Diawara N., and Kim J.M., Copula directional dependence of discrete time series marginals, Comm. Stat. Simul. Comput. (2019). doi: 10.1080/03610918.2019.1630434. [DOI] [Google Scholar]
  • 2.Alqawba M., Diawara N., and Rao Chaganty N., Zero-inflated count time series models using gaussian copula, Sequen. Anal. Design Methods Appl. 38 (2019), pp. 342–357. [Google Scholar]
  • 3.Billingsley P., Statistical Inference for Markov Processes, Vol. 2, University of Chicago Press, Chicago, 1961. [Google Scholar]
  • 4.Gonçalves E., Mendes-Lopes N., and Silva F., Zero-inflated compound poisson distributions in integer-valued garch models, Statistics 50 (2016), pp. 558–578. doi: 10.1080/02331888.2015.1114622 [DOI] [Google Scholar]
  • 5.Goudie A.S. and Middleton N.J., Desert Dust in the Global System, Springer Science & Business Media, Verlag Berlin Heidelberg, 2006. [Google Scholar]
  • 6.Guolo A. and Varin C., Beta regression for time series analysis of bounded data, with application to canada google® flu trends, Ann. Appl. Stat. 8 (2014), pp. 74–88. doi: 10.1214/13-AOAS684 [DOI] [Google Scholar]
  • 7.Hasan M.T., Sneddon G., and Ma R., Regression analysis of zero-inflated time-series counts: Application to air pollution related emergency room visit data, J. Appl. Stat. 39 (2012), pp. 467–476. doi: 10.1080/02664763.2011.595778 [DOI] [Google Scholar]
  • 8.Ho C.H. and Bhaduri M., On a novel approach to forecast sparse rare events: Applications to parkfield earthquake prediction, Natl. Hazards 78 (2015), pp. 669–679. doi: 10.1007/s11069-015-1739-1 [DOI] [Google Scholar]
  • 9.Ibragimov R., Copula-based characterizations for higher order Markov processes, Econ. Theory. 25 (2009), pp. 819–846. doi: 10.1017/S0266466609090720 [DOI] [Google Scholar]
  • 10.Idso S.B., Dust storms, Sci. Am. 235 (1976), pp. 108–115. doi: 10.1038/scientificamerican1076-108 [DOI] [Google Scholar]
  • 11.Jia Y., Kechagias S., Livsey J., Lund R., and Pipiras V., Latent gaussian count time series modeling, arXiv preprint arXiv:1811.00203 (2018)
  • 12.Joe H., Multivariate Models and Multivariate Dependence Concepts, Chapman and Hall/CRC, Boca Raton, 1997. [Google Scholar]
  • 13.Joe H., Dependence Modeling with Copulas, Chapman and Hall/CRC, Boca Raton, 2014. [Google Scholar]
  • 14.Joe H., Markov models for count time series, in Handbook of Discrete-Valued Time Series, Chapman and Hall/CRC, Boca Raton, 2016, pp. 49–70.
  • 15.Joe H. and Hu T., Multivariate distributions from mixtures of max-infinitely divisible distributions, J. Multivar. Anal. 57 (1996), pp. 240–265. doi: 10.1006/jmva.1996.0032 [DOI] [Google Scholar]
  • 16.Lambert D., Zero-inflated Poisson regression, with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. doi: 10.2307/1269547 [DOI] [Google Scholar]
  • 17.Lennon H. and Yuan J., Estimation of a digitised gaussian arma model by Monte Carlo expectation maximisation, Comput. Stat. Data. Anal. 133 (2019), pp. 277–284. doi: 10.1016/j.csda.2018.10.015 [DOI] [Google Scholar]
  • 18.Masarotto G. and Varin C., Gaussian copula marginal regression, Electron. J. Stat. 6 (2012), pp. 1517–1549. doi: 10.1214/12-EJS721 [DOI] [Google Scholar]
  • 19.Nelder J.A. and Wedderburn R.W.M., Generalized linear models, J. Roy. Stat. Soc. Ser. A (General) 135 (1972), pp. 370–384. Available at http://www.jstor.org/stable/2344614. doi: 10.2307/2344614 [DOI] [Google Scholar]
  • 20.Panagiotelis A., Czado C., and Joe H., Pair copula constructions for multivariate discrete data, J. Amer. Statist. Assoc. 107 (2012), pp. 1063–1072. doi: 10.1080/01621459.2012.682850 [DOI] [Google Scholar]
  • 21.Patton A.J., Copula–based models for financial time series, in Handbook of Financial Time Series, eds., T. Mikosch, J.P. Kreiss, R. Davis and T.G. Andersen. Springer, Berlin, 2009, pp. 767–785. doi: 10.1007/978-3-540-71297-8_34. [DOI] [Google Scholar]
  • 22.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2013.
  • 23.Ridout M., Hinde J., and DeméAtrio C.G., A score test for testing a zero-inflated poisson regression model against zero-inflated negative binomial alternatives, Biometrics 57 (2001), pp. 219–223. doi: 10.1111/j.0006-341X.2001.00219.x [DOI] [PubMed] [Google Scholar]
  • 24.Sellers K.F. and Raim A., A flexible zero-inflated model to address data dispersion, Comput. Stat. Data. Anal. 99 (2016), pp. 68–80. doi: 10.1016/j.csda.2016.01.007 [DOI] [Google Scholar]
  • 25.Tan S., Bhaduri M., and Ho C.H., A statistical model for long-term forecasts of strong sand dust storms, J. Geosci. Environ. Protect. 2 (2014), pp. 16. doi: 10.4236/gep.2014.23003 [DOI] [Google Scholar]
  • 26.Tao G., Jingtao L., Xiao Y., Ling K., Yida F., and Yinghua H., Objective pattern discrimination model for dust storm forecasting, Meteorolog. Appl. A J. Forecast., Pract. Appl. Training Tech. Model. 9 (2002), pp. 55–62. [Google Scholar]
  • 27.Wang P., Markov zero-inflated Poisson regression models for a time series of counts with excess zeros, J. Appl. Stat. 28 (2001), pp. 623–632. doi: 10.1080/02664760120047951 [DOI] [Google Scholar]
  • 28.Yang M., Cavanaugh J.E., and Zamba G.K., State-space models for count time series with excess zeros, Stat. Modelling. 15 (2015), pp. 70–90. doi: 10.1177/1471082X14535530 [DOI] [Google Scholar]
  • 29.Yang M., Zamba G.K., and Cavanaugh J.E., Markov regression models for count time series with excess zeros: A partial likelihood approach, Stat. Methodol. 14 (2013), pp. 26–38. doi: 10.1016/j.stamet.2013.02.001 [DOI] [Google Scholar]
  • 30.Zhu F., Zero-inflated poisson and negative binomial integer-valued garch models, J. Statist. Plann. Inference 142 (2012), pp. 826–839. doi: 10.1016/j.jspi.2011.10.002 [DOI] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES