Abstract
We propose zero-inflated statistical models based on the generalized Hermite distribution for simultaneously modelling of excess zeros, over/underdispersion, and multimodality. These new models are parsimonious yet remarkably flexible allowing the covariates to be introduced directly through the mean, dispersion, and zero-inflated parameters. To accommodate the interval inequality constraint for the dispersion parameter, we present a new link function for the covariate-dependent dispersion regression model. We derive score tests for zero inflation in both covariate-free and covariate-dependent models. Both the score test and the likelihood-ratio test are conducted to examine the validity of zero inflation. The score test provides a useful tool when computing the likelihood-ratio statistic proves to be difficult. We analyse several hotel booking cancellation datasets extracted from two recently published real datasets from a resort hotel and a city hotel. These extracted cancellation datasets reveal complex features of excess zeros, over/underdispersion, and multimodality simultaneously making them difficult to analyse with existing approaches. The application of the proposed methods to the cancellation datasets illustrates the usefulness and flexibility of the models.
Keywords: Covariate-dependent dispersion, excess zeros, multimodality, over/underdispersion, zero-inflated regression models, zero and k-inflated generalized Hermite distribution
1. Introduction
1.1. Statistical models for count data
The Poisson distribution is arguably the most widely used distribution for modelling count data. The other widely used count distributions include the binomial and negative binomial distributions where the assumptions of the Poisson model are violated, for example, when the range of count values is limited or when overdispersion is present. However, it is very common in practice that count data present not only overdispersion, but also zero inflation or multimodality, cases where the assumptions for these popular discrete distributions do not hold. To deal with overdispersion and zero inflation, various generalizations of these discrete distributions have been considered and extensively studied by researchers. These studies include, for example, the zero-inflated Poisson regression by Lambert [17]; the zero-inflated Poisson models by van den Broek [21], Bohning et al. [4], Xie et al. [24], Jansakul and Hinde [13]; the zero-inflated Poisson mixed regression models by Xiang et al. [23] and Xie et al. [25]; the zero-inflated binomial regression by Hall [11]; the zero-inflated negative binomial models by Jansakul and Hinde [14]; and zero-inflation and overdispersion in generalized linear models by Deng and Paul [5]. Other generalizations of the Poisson distribution that are compound-Poisson or contagious distributions such as the Hermite distribution have also been extensively studied (see, for instance, Gurland [10]; Kemp and Kemp [15]). To deal with overdispersion and multimodality, Giles [7] proposed multinomially inflated Poisson, negative binomial, and Hermite distributions. Giles [8] introduced covariates to the Hermite distribution and demonstrated that the Hermite regression model for currency and banking crises outperforms the Poisson and negative binomial models. Although the multinomially inflated Hermite distribution is more general than the Hermite distribution, there are no formal statistical procedures to test which value is inflated or whether a given value is inflated. Furthermore, fitting such an overly complicated multinomially-inflated Hermite distribution is not a trivial task and may lead to identifiability problems. Since the Hermite distribution is already capable of capturing count-inflation in its own right, a more natural extension is the generalized Hermite (GH) distribution introduced by Gupta and Jain [9]. This flexible distribution can account for both overdispersion and multimodality. Nevertheless, it may not be able to deal with excess zeros, over/underdispersion, and multimodality simultaneously. In this paper, we propose zero-inflated statistical models based on the GH distribution to simultaneously account for excess zeros, over/underdispersion, and multimodality. We derive score statistics for testing zero inflation in both covariate-free and covariate-dependent GH models. Both the score test and the likelihood-ratio test are conducted to examine the validity of zero inflation. The score test provides a useful tool when computing the likelihood-ratio statistic proves to be difficult. We demonstrate the advantages of using the proposed methods to analyse the multiple features of the hotel cancellation datasets.
1.2. Challenges of booking cancellations in hotel industry
Hotel industry captures huge volumes of structured and unstructured data in real-time catering to millions of travellers every day (see [1] and references therein). Vast quantities of the data in hotel industry are unstructured or semi-structured text data, for example, in the form of XML files or other markup languages. It is not only computationally difficult to extract quantitative information from the text-heavy data due to its sheer volume and complexity but also statistically challenging to analyse the information encoded in the text data. Although the text-heavy data contains rich information about consumer preferences, the more structured quantitative data characterizing customer behaviour and travel trends is rather limited. Such data remains under-analysed and statistically poorly understood. To demonstrate the statistical challenges present in these types of data, we analyse several hotel booking cancellation datasets extracted from two real datasets that are recently published by Antonio et al. [2]. These publicly available datasets provide extremely valuable hotel demand data for understanding and characterizing customer behaviour at a resort hotel and at a city hotel. The extracted cancellation datasets reveal complex features of excess zeros, over/underdispersion, and multimodality simultaneously making them difficult to analyse with existing approaches.
Since cancellation of hotel reservations has a significant impact on the revenue, forecasting of cancellation rates is a key component of any hotel revenue management system. Hotel revenue management aims at enhancing the revenues of a hotel by managing a limited amount of supply to maximize revenue by dynamically controlling the price and quantity offered. To deal with the risk of empty rooms resulting from cancellation, hotels routinely practice overbooking that is accepting more bookings than actual capacity based on estimated number of cancellations. However cancellation rate forecasting is required not only in determining the levels of overbooking but also in estimating net demand as discussed by Morales and Wang [18].
The existing approaches in dealing with forecasting cancellation focus mainly on machine learning methods to mine potential predictors from Passenger Name Record, Property Management System, or other hotel data records for classifying each booking into the two classes of ‘cancelled’ and ‘not cancelled’ and for estimating two-class probability (see [3,18], and references therein). None of the existing methods examine the statistical features of the distributions of cancellations with respect to the number of days prior to check-in. As the cancellation deadline approaches, accurate statistical models of cancellation are essential for forecasting and decision-making in dynamic pricing and capacity allocation of the hotel revenue management. Hotel cancellation policies vary a lot per hotel chains, brands, and hotel types. Even the same hotel can have multiple cancellation policies that vary per rate, season, and distribution channel. One of the common policies for many hotels is to allow free cancellations (i.e. with no penalty) up to 24 hours before check-in time on the date that the reservation starts. A customer who cancels within 24 hours of check-in forfeits the price of the room for the first night of the reservation but can be refunded the amount paid for additional nights beyond the first. No-shows are customers who booked but never arrived for their stay at the hotel. These customers are also charged. Nevertheless, a no-show is actually not all that bad for the hotel. The customer pays for the room but the hotel incurs no room cleaning or other charges. Consequently, accurate prediction of cancellations is more important than predicting no-shows since a cancellation gives the hotel an opportunity to rebook the room if the cancellation happens early enough.
1.3. Description of hotel booking cancellation datasets
We extracted six hotel booking cancellation datasets and related variables from two real datasets published by Antonio et al. [2], which provide hotel demand data with 40,060 observations on 31 variables from a resort hotel (H1) and 79,330 observations on 31 variables from a city hotel (H2). Both datasets contain bookings due to arrive between the 1st of July 2015 and the 31st of August 2017.
Since the two datasets share the same structure (see [2]), we describe our data extraction procedure only for the resort hotel. First we extracted the number of previous bookings that were cancelled by the customer prior to the current booking. There are two variables named ‘PreviousCancellations’ and ‘PreviousBookingsNotCanceled’ in the Antonio et al. dataset. Since the value of zeros for each variable may correspond to the case that there was no customer profile associated with the booking as described by Antonio et al. [2], we use the criterion that the sum of both variables is at least one to guaranteed that the value of zeros for the variable ‘PreviousCancellations’ is a genuine zero. The extracted data set at this resort hotel is denoted by PrevCanH1. It contains 1820 zeros out of 2915 observations. The percentage of zero values is approximately . The number of previous cancellations (NumPrevCan) is clearly a proper discrete random variable but not very interesting in practice as a response variable because previous cancellations already happened before the current booking. Our purpose of analysing this count dataset is simply to show an application of our methods with a proper count response. By the same extraction procedure, we obtained the second dataset denoted by PrevCanH2 for the city hotel.
Next we consider extracting information on a more interesting response variable that is important for hotel revenue management. For each cancelled booking, we computed the time (in days) of cancellation to its scheduled check-in date, which is our response variable of interest. As a potential predictor, we also computed the length of stays (LOS) between check-in date and check-out date. It is worth pointing out that LOS is readily available for each booking regardless whether the booking will be cancelled or not, whereas NumPrevCan is not always available, for example, for new customers. The construction of these two numerical variables led to our booking cancellation datasets for both hotels. From each cancellation dataset, we extracted only those bookings that were cancelled within 30 days of their scheduled check-in dates and with at least one night stay. The resulting dataset from the resort hotel is denoted by TCLOSH1. It contains 3898 cancellations. Its response variable has 320 zeros representing cancellations on their scheduled check-in dates. The cancellation dataset extracted from the city hotel is denoted by TCLOSH2. Finally we consider extracting a cancellation dataset with two potential predictors for our response variable of the time of cancellation to its scheduled check-in date. The first predictor is LOS from TCLOSH1. The second predictor is the Average Daily Rate (ADR) as described by Antonio et al. [2]. Since a significant number of ADR values in the Antonio et al. datasets are recorded as zeros, we extracted only positive ADR values corresponding to those cancelled bookings from TCLOSH1. These extracted ADR values range from the smallest value of 4 to the largest value of 437. We performed the log transformation of them so that the log ADR (LogADR) values are on the same scale as LOS. The resulting dataset is denoted by TCLOSLogADRH1. It contains 3854 values of its response variable with 309 zeros ( ). The two-predictor cancellation dataset extracted from the city hotel is denoted by TCLOSLogADRH2.
All of these datasets exhibit the feature of substantial numbers of zeros. Figure 1 obtained from TCLOSH1 shows that the rates of cancellation vary according to the time until arrival. They increase as we get close to the arrival day. The number of cancellations on the check-in date is excessive relative to the number of cancellations on other days. More importantly, Figure 1 exhibits, in the absence of covariates, the multimodality of the distribution of cancellations with respect to the number of days prior to check-in. It is worth pointing out that the means of the responses are not identical in the presence of covariate effects, in which case multimodality of the response should be assessed after removing the covariate effects. To assess the multimodality feature in the presence of covariates, at least graphically, we applied the proposed zero-inflated GH regression method to the datasets TCLOSH1 and TCLOSLogADRH1. More specifically, we model the mean, dispersion, and zero-inflation parameters as a function of LOS for TCLOSH1. We model both the mean and dispersion parameters as a function of LOS and LogADR for TCLOSLogADRH1. We performed our zero-inflated GH regression and obtained estimates of all regression coefficients associated with these models and standardized our responses by their estimated means and standard deviations. Since the standardized response variables are no longer integer-valued and they are real-valued and overlapping, we estimated their distributions by the nonparametric kernel density estimates using the default bandwidth and kernel function provided by the R density function. The kernel density estimates are displayed in Figure 2. The two major peaks in Figure 2 correspond to the two major modes in Figure 1 at the values of 0 and 11. These results show that even after removing the covariate effects, at least from the mean and variance of the response, the multimodality of the distribution of the response still persists for both datasets.
Figure 1.
Frequency distribution of cancellation at resort hotel.
Figure 2.
Kernel density estimates based on zero-inflated GH regression.
2. Methodology
2.1. Generalized Hermite distribution
The GH probability mass function (pmf) is
where , , is a positive integer, and denotes the integer part of a real number x. It reduces to the Hermite distribution when m = 2. For fixed values of α and β, Figure 3 displays the effect of the parameter m on the number of modes in the GH pmf. The larger the value of m is, the more the number of modes the pmf tends to have. The multiple modes of the GH pmf are the major feature distinguished from other discrete distributions and provide greater flexibility for modelling count data.
Figure 3.
GH probability mass function.
Puig [20] presents a useful reparametrization of the GH model in terms of the mean and the coefficient of dispersion , where . The domain of these parameters is and . When d = 1, the GH model reduces to the Poisson distribution. This representation allows the covariates to be easily incorporated in the GH regression model and the GH pmf becomes
(1) |
(2) |
2.2. Zero-inflated GH distribution
Although the GH distribution models the value of zeros via the probability Equation (1), it is sometimes not adequate to account for the excessive number of zeros observed in many applications such as the cancellation datasets to be analysed. We propose the zero-inflated GH distribution.
(3) |
where and are given in Equations (1) and (2), respectively. This model can be viewed as a two-component Binomial-GH mixture with the mixing probability ω. The zero-inflated GH probability generating function (pgf) is
Its mean is and variance . Thus its overdispersion index is . This shows that the zero-inflation also increases the overdispersion. Its zero-inflated index
A naive estimator of ZI would be to use the empirical estimates of and E(Y) from the data and plug in these estimates into the above equation for ZI. For example, , , and , where the response variables used for TCLOSH1 and TCLOSH2 are the time of cancellation to its check-in date. These index values suggest zero inflation. It is worth pointing out that simply having many zero response observations is not necessarily an indication of zero inflation and the question is whether there are too many zeros given the specified model. The percentage of zeros in PrevCanH1 is 62.44%, the percentage of zeros in TCLOSH1 is 8.21%, whereas the percentage of zeros in TCLOSH2 is 5.09%. However in all of these cases, our score tests show that the number of zeros is too large for the GH distribution to fit the corresponding data well.
Occasionally we observe count data with underdispersion. Since , the zero-inflated GH distribution can model underdispersion if ω takes a negative value. However, this would, in general, lead to a relative shortage of zeros and lose the meaning of mixing probability. An easy way to deal with underdispersion and excess zeros is to extend our zero-inflated GH model to include an additional inflated value k. This is a generalization of Giles' multinomially-inflated Hermite distribution to the GH distribution. But our objective is not to test inflation at the value of k as the GH distribution is already capable of capturing count-inflation in its own right. Our goal is to use the zero and k-inflated GH model to simultaneously model excess zeros, underdispersion, and multimodality. In particular, we propose to choose k empirically from the data to be the first mode greater than zero. Mathematically the zero and k-inflated GH distribution is specified by
(4) |
It follows from (4) that the mean and variance of zero and k-inflated GH pmf are
(5) |
In particular, if , we obtain the mean and variance of the zero-inflated GH distribution. If , equations in (5) reduce to the mean and variance of the GH distribution. These flexible mean and variance expressions allow the zero and k-inflated GH model to account for both overdispersion and underdispersion without requiring or to be negative. In fact, under the constraints that , and , our real data analysis demonstrates that the zero and k-inflated GH distribution can model both cases of overdispersion and underdispersion.
In most applications of the zero-inflated models, we observe count data with overdispersion rather than underdispersion. We focus on the zero-inflated GH model in this paper. To derive our score statistics for testing zero inflation in (3), it is convenient to express the mixing probability ω in terms of (see [21]). It follows from the recurrence relation derived by Gupta and Jain [9] for computing the GH probability that the zero-inflated GH probability can be computed recursively as
2.3. Maximum likelihood estimation of the zero-inflated GH model
Maximum likelihood (ML) estimation for the GH distribution has been studied by Puig [20] who derived a necessary and sufficient condition for the existence of the MLE of d and μ for a fixed value of m. In the domain of the parameters, the log-likelihood function is strictly concave and hence the MLE is unique. The MLE of μ is simply the sample mean. For the ML estimation of the zero-inflated GH model, we also consider the log-likelihood function for a fixed value of m unless otherwise stated. Given a random sample from the pmf (3), the log-likelihood function is
The MLEs of , and θ can be found by maximizing provided that they belong to the interior of the domain of the parameters. In our data analysis, we first use the R package hermite developed by Morina, et al. [19] to fit the GH model. The resulting estimates are then used as the initial values for fitting our zero-inflated GH. We use the R package maxLik written by Henningsen and Toomet [12] to obtain the MLEs of , and θ. The Newton-Raphson algorithm converged in just a few iterations.
2.4. Score test for zero inflation
We propose a score statistic to test whether the number of zeros is too large for the GH distribution to adequately account for. We first consider the case of no covariates and then the case with covariates.
2.4.1. The case of no covariates
To derive the score statistic for testing , we need the score function and the expected information matrix , whose derivations are provided in the Online Supplement. Let and denote the MLEs of μ and d under . We obtain the score statistic as
(6) |
where
2.4.2. The case with covariates
Covariates can be incorporated into the zero-inflated GH model in various ways to build zero-inflated GH regression models. We model μ as a function of explanatory variables through a link function . More specifically, given a vector of covariates for the ith observation, we consider , where is a vector of regression coefficients to be estimated and is specified to be the logarithmic link function. This simple specification ensures that all the fitted values are positive. Since the dispersion index d is required to satisfy the constraint 1<d<m, any model of d has to respect this constraint. For each given m, we model d as a function of explanatory variables through a link function . In other words, given a vector of covariates, we consider , where is specified by
(7) |
It is easy to check that this choice of the link function h allows the dispersion index d to satisfy the constraint 1<d<m. To model the zero-inflated parameter θ, we first derive our score statistic for testing . Once we reject this hypothesis, we will model θ as a function of explanatory variables by using the logarithmic link function . Both elements of and elements of may include elements of . The order parameter m is either specified or regarded as an unknown constant to be estimated. Given the covariate vectors , , and , we assume to follow a zero-inflated GH distribution of order m. We provide the derivations of and in the Online Supplement based on the log-likelihood function given by
Let and be the ML estimators under . Let and be the corresponding estimates of and . Then we obtain the score statistic
(8) |
where is presented in the Online Supplement.
In some applications, it is reasonable to assume that the dispersion parameter does not depend on any covariate. We only need to model μ and the zero-inflation parameter θ as a function of explanatory variables. Let and be the ML estimators under . Then the score statistic is given by
(9) |
where is given in the Online Supplement.
3. Applications
3.1. Comparisons with some existing zero-inflated models
Given a random sample of count data exhibiting excess zeros, overdispersion, and multimodality simultaneously, it would be interesting to compare the performance of several existing zero-inflated models with our zero-inflated GH model. We consider the zero-inflated Poisson (ZiP) model, the zero-inflated negative binomial (ZiNB) model, the hurdle Poisson (HP) model, the hurdle negative binomial (HNB) model [26,27], and the MIP model [7]. The zero-inflated models are mathematically specified by
where, for the zero-inflated negative binomial model,
(10) |
whereas for the zero-inflated Poisson model,
(11) |
The hurdle model is given by
The HNB and HP models correspond to the specification of by (10) and (11), respectively. Finally we consider the MIP model proposed by Giles [7]. In order to decide what specific MIP model to fit, we need to determine what integer values are inflated. Since there is no formal statistical procedure available to test count-inflation for the MIP model, we resort to the empirical distribution of our count data to decide what values are potentially inflated. We generate a random sample of 1000 observations from the zero-inflated GH distribution with , and m = 6. The data is given in Table 2. Its empirical distribution seems to suggest possible count-inflation at the integer values of 0, 6, and possibly 12. We consider two specifications for the MIP model: zero and six inflated Poisson model (MIP2); zero, six, and twelve inflated Poisson model (MIP3). The MIP3 model is specified by
(12) |
The MIP2 model is specified via (12) by setting and changing the last restriction to . We fitted these models to the random sample and the MLE estimates of model parameters are presented in Table 1 together with their asymptotic standard errors given in parentheses and the values of log likelihood. We also employed Tukey's rootograms as described by Kleiber and Zeileis [16] to graphically assess the fit of these models. The rootogram compares observed and expected values graphically by plotting histogram-like rectangles for the observed frequencies and a curve for the fitted frequencies all on a square-root scale (see [16] for the description and interpretation of rootograms, in particular, the hanging rootogram). As seen from the hanging rootograms in Figure 4, ZiP and HP perform similary, and both ZiNB and HNB also perform similarly. In order to fit the value of zero well, all of these distributions significantly under-estimated the value of one and they also under-estimated other modes of the underlying distribution. It is worth noting from Table 1 that the values of log likelihoods for ZiNB and HNB are all larger than the values of log likelihoods for MIP2 and MIP3. But none of these fitted models captures the multimodality of the underlying distribution as well as MIP2 and MIP3 as seen from Figure 5.
Table 2. Actual and predicted counts.
Y | Actual | Zero-inflated GH | GH | MIP2 | MIP3 |
---|---|---|---|---|---|
0 | 578 | 578 | 379 | 578 | 578 |
1 | 210 | 201 | 274 | 77 | 88 |
2 | 76 | 78 | 72 | 104 | 108 |
3 | 18 | 20 | 13 | 93 | 88 |
4 | 2 | 4 | 2 | 62 | 54 |
5 | 1 | 1 | 0 | 33 | 26 |
6 | 45 | 49 | 67 | 45 | 45 |
7 | 35 | 38 | 36 | 6 | 4 |
8 | 10 | 15 | 9 | 2 | 1 |
9 | 9 | 4 | 2 | 1 | 0 |
10 | 1 | 1 | 0 | 0 | 0 |
12 | 7 | 5 | 4 | 0 | 7 |
13 | 3 | 4 | 2 | 0 | 0 |
14 | 4 | 1 | 1 | 0 | 0 |
15 | 1 | 0 | 0 | 0 | 0 |
Table 1. MLE results from fitting GH, ZiGH, ZkiGH, ZiNB, HNB and MIP models.
MIP2 | MIP3 | ZiGH | GH | ZiNB | HNB | ZkiGH | |
---|---|---|---|---|---|---|---|
0.5492 | 0.5419 | 0.3245 | |||||
(0.017) | (0.017) | (0.037) | |||||
0.0301 | 0.0343 | 0.00006 | |||||
(0.007) | (0.007) | (0.0097) | |||||
0.0070 | |||||||
(0.003) | |||||||
ω | 0.00004 | 0.4220 | |||||
(0.001) | (0.016) | ||||||
μ | 2.6822 | 2.4454 | 1.9272 | 1.3090 | 1.3089 | 0.0001 | 1.9317 |
(0.098) | (0.095) | (0.139) | (0.072) | (0.081) | (0.006) | (0.141) | |
θ | 0.4723 | 0.3313 | 0.00001 | ||||
(0.076) | (0.025) | (0.001) | |||||
d | 3.9769 | 3.9790 | 3.9789 | ||||
(0.117) | (0.118) | (0.145) | |||||
logL | −1735.979 | −1692.839 | −1373.654 | −1398.334 | −1479.522 | −1470.443 | −1373.667 |
Figure 4.
Hanging rootograms for zero-inflated Poisson and hurdle models.
Figure 5.
Hanging rootograms for zero-inflated GH and MIP models.
Both MIP2 and MIP3 fitted the modes very well and MIP3 performed better than MIP2. However, they either underestimate or overestimate those values between the modes. Both MIP2 and MIP3 were not able to capture the shape between the modes of the true underlying distribution. We also fitted the zero and k-inflated GH (ZkiGH) model with k equal to 6 being the first mode greater than zero. Table 1 shows that parameter estimates and log likelihood from fitting zero and six-inflated GH model are almost the same as their corresponding estimates and log likelihood from fitting zero-inflated GH model that is nested in the zero and six-inflated GH model. In fact, the estimate of is statistically no different from zero. If we set to zero and use the remaining estimates of the zero and six-inflated GH model to compute the predicted counts, then these predicted counts are identical to those of fitting the zero-inflated GH model except for predicting counts of zero and one with 580 and 199, respectively. For this reason, we did not include those results associated with the zero and six-inflated GH model in Table 2, Figure 4, and Figure 5.
Among all these models, the best fitted model in terms of the values of log likelihoods is clearly the zero-inflated GH model. This is, of course, expected as the data is generated from the zero-inflated GH distribution. Based on the MLE estimates of the GH model, the score statistic as given in (6) is computed to be 50.2005. Its p-value from the distribution is 1.388124 e−12. This provides strong evidence for rejecting . The likelihood-ratio statistic is 49.36, which leads to the same conclusion of rejecting .
Applications of zero-inflated GH models require the specification of the order m. The question is how to choose or estimate m. Since the order m is a discrete parameter, finding the ML estimate of m is a challenging task. One reasonable method of estimation is through profile likelihood method as given in [19]. For each value of m between two given positive integers, we maximize the log likelihood of the GH model with respect to the remaining model parameters and obtain the corresponding log-likelihood value. Then we find that value of m that maximizes these log-likelihood values. Take our simulated data for example, we fitted the GH model for each value m between 2 and 10 and obtained the corresponding log-likelihood values given in Table 3. It follows from Table 3 that the value of m that maximizes these log likelihoods is equal to 6. In our data analysis, we found that the optimal value of m is 6.
Table 3. Order selection by profile likelihood method.
Order m | 2 | 3 | 4 | 5 | 6 |
GH LogL | −1774.13 | −1629.988 | −1614.611 | −1525.794 | −1398.334 |
Order m | 7 | 8 | 9 | 10 | |
GH LogL | −1700.919 | −1914.574 | −1958.16 | −2011.087 |
3.2. Zero-inflated GH models for previous hotel booking cancellations
To demonstrate applications of zero-inflated GH models with a proper count response, we analyse the number of previous cancellations by fitting GH, zero-inflated GH, zero and k-inflated GH, and MIP models. We perform both score and likelihood-ratio tests for the zero-inflated GH model of PrevCanH1. To fit these GH-based models, we first need to choose the value of m. For each value m between 2 and 10, we fitted the GH model to PrevCanH1 and obtained m = 6 by using the profile likelihood method as described previously. Once the choice of m = 6 is selected, we estimate all other model parameters by the ML method. These estimates are presented in Table 4. The empirical distribution of the dataset from Table 5 suggests that there are two modes that occur at 0 and 24, respectively. So we fitted 0 and 24-inflated GH model and MIP2 model with inflated values at 0 and 24. These results are displayed in Table 4. In addition, we present the hanging rootograms for fitting GH, zero-inflated GH, zero and 24-inflated GH, and MIP2 models in Figure 6. We also provide the actual and predicted counts for the number of previous cancellations at the resort hotel in Table 5. These results show that both the zero-inflated GH and zero and 24-inflated GH models perform better than the GH and MIP2 models. Table 5 shows that MIP2 significantly under-estimated the count at the value of 1 and GH significantly under-estimated the count at the value of 0. Figure 6 further confirmed these under-estimation findings. The zero and 24-inflated GH model was able to capture the second mode leading to a higher log-likelihood value and hence outperform the zero-inflated GHmodel.
Table 4. ML estimates of GH, ZiGH, ZkiGH and MIP2 models for PrevCanH1.
Parameters | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
1.3979 | 0.0467 | 29.95 | < 2e−16 | |
d | 4.5410 | 0.0559 | 81.21 | < 2e−16 |
Order | 6 | |||
GH LogL | −4077.996 | |||
2.2291 | 0.1002 | 22.24 | < 2e−16 | |
d | 4.5405 | 0.0558 | 81.31 | < 2e−16 |
θ | 0.5946 | 0.0537 | 11.07 | < 2e−16 |
Order | 6 | |||
ZiGH LogL | −3997.036 | |||
0.2930 | 0.0273 | 10.72 | < 2e−16 | |
0.0164 | 0.0024 | 6.98 | 2.89e−12 | |
μ | 1.4522 | 0.0765 | 18.99 | < 2e−16 |
d | 3.9661 | 0.0792 | 50.05 | < 2e−16 |
Order | 6 | |||
ZkiGH LogL | −3720.28 | |||
0.5949 | 0.0098 | 60.56 | < 2e−16 | |
0.0165 | 0.0024 | 6.99 | 2.84e−12 | |
μ | 2.5803 | 0.0538 | 47.96 | < 2e−16 |
MIP2 LogL | −6206.868 |
Table 5. Actual and predicted counts for PrevCanH1.
Y | Actual | ZiGH | GH | MIP2 | ZkiGH |
---|---|---|---|---|---|
0 | 1820 | 1820 | 1644 | 1820 | 1820 |
1 | 896 | 477 | 670 | 221 | 571 |
2 | 44 | 155 | 137 | 286 | 169 |
3 | 14 | 34 | 19 | 246 | 33 |
4 | 6 | 5 | 2 | 158 | 5 |
5 | 3 | 1 | 0 | 82 | 1 |
14 | 14 | 5 | 2 | 0 | 2 |
19 | 19 | 1 | 1 | 0 | 0 |
24 | 48 | 0 | 0 | 48 | 48 |
25 | 25 | 0 | 0 | 0 | 0 |
26 | 26 | 0 | 0 | 0 | 0 |
Figure 6.
Hanging rootograms for ZiGH, ZkiGH, GH and MIP2 models for PrevCanH1.
To test the zero-inflated GH model, we computed the score statistic to be . Its p-value from the distribution is 2.83865e-38. This provides strong evidence for rejecting . The likelihood-ratio statistic for testing is 161.92, which leads to the same conclusion as the score test reached. We conclude that the GH distribution does not provide an adequate model to account for the extra number of zeros for modelling the number of previous cancellations at the resort hotel.
The main advantage of the score test over the likelihood-ratio test is that the score test only requires the computation of the restricted ML estimates under the null hypothesis, which is often computationally simpler than computing the unconstrained ML estimates especially for complicated multidimensional likelihood functions. Computing the unconstrained ML estimates of zero-inflated GH models can be challenging and maximization algorithms may not converge. This is the case with zero-inflated GH modelling for NumPrevCan of PrevCanH2 at the city hotel. PrevCanH2 contains 1,153 zeros out of 6,542 previous cancellations. Under , we fitted the GH model to PrevCanH2 with m = 6 and obtained the restricted ML estimates and . The value of the score statistic is 3744.805, which clearly rejects . However when we use and as starting values, the maxLik algorithm for finding the unconstrained ML estimates of zero-inflated GH model parameters does not converge. We have not been able to find the right initial value for θ to make the algorithm converge. As a result, the likelihood-ratio test is not applicable to this dataset. Nevertheless, since our score test rejects , we can fit zero and k-inflated GH model to this dataset instead of the zero-inflated GH model. To fit the zero and k-inflated GH model, we observe from the empirical distribution of the dataset that the first mode occurs at the value of 1. So we choose k = 1 to fit the zero and one-inflated GH model. For comparison, we also fit MIP2 and MIP3 models. The empirical distribution suggests that there are two modes that occur at 1 and 11, respectively. The value of zero is not a mode but it is inflated relative to the GH model based on our score test. Since we are interested in zero-inflation, we include zero as one of the inflated values for both MIP2 and MIP3 models: fitting MIP2 with zero and one-inflated values and fitting MIP3 with three inflated values at 0, 1, and 11. The results of fitting zero and one-inflated GH, MIP2, and MIP3 models are shown in Table 6. In terms of log likelihood, both zero and one-inflated GH and MIP3 performed significantly better than MIP2. It is worth noting that even without explicitly including the second mode at 11, the zero and one-inflated GH model provided the largest value of log likelihood demonstrating the flexibility of the GH-based models.
Table 6. ML estimates of ZkiGH, MIP2, and MIP3 models for PrevCanH2.
Parameters | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
0.1748 | 0.0047 | 36.98 | < 2e−16 | |
0.7835 | 0.0052 | 150.59 | < 2e−16 | |
4.4003 | 0.2287 | 19.24 | < 2e−16 | |
d | 2.4089 | 0.1630 | 14.78 | < 2e−16 |
Order | 6 | |||
ZkiGH LogL | −4544.049 | |||
0.1760 | 0.0047 | 37.33 | < 2e−16 | |
0.7865 | 0.0051 | 154.42 | < 2e−16 | |
4.8049 | 0.1505 | 31.93 | < 2e−16 | |
MIP2 LogL | −4654.728 | |||
0.1752 | 0.0047 | 37.12 | < 2e−16 | |
0.7844 | 0.0052 | 151.84 | < 2e−16 | |
0.0053 | 0.0009 | 5.90 | 3.64e−09 | |
3.5370 | 0.1565 | 22.60 | < 2e−16 | |
MIP3 LogL | −4548.549 |
To further show the flexibility of zero and k-inflated GH model, we provide one more example of applying the ZkiGH model to the analysis of underdispersion. This is a subset of PrevCanH2 corresponding to those cancelled bookings from TCLOSH2. The mean of the dataset is 1.03151 and its variance is 0.26457 indicating underdispersion. Since the mode occurs at the value of 1, we fitted the zero-one inflated GH model and displayed the results in Table 7. Based on the fitted model, we obtained the estimated mean to be 1.03153 and the estimated variance equal to 0.22039.
Table 7. ML estimates of zero and one-inflated model for underdispersion of PrevCanH2.
Parameters | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
0.0142 | 0.0040 | 3.56 | 0.000368 | |
0.9596 | 0.0066 | 146.13 | < 2e−16 | |
2.7424 | 0.5488 | 4.997 | 5.82e−07 | |
1.7732 | 0.4695 | 3.78 | 0.000159 | |
Order | 6 | |||
ZkiGH LogL | −250.3878 |
Although zero and k-inflated GH and MIP models are quite flexible, ML estimation of these models in the presence of covariates can be very challenging because of many local maxima. Their maximization algorithms may not converge. Since the problem of how to choose suitable starting values for the maximization algorithm to converge remains open, we focus on the zero-inflated GH model when introducing covariates through the mean, dispersion, and zero-inflated parameters.
3.3. Zero-inflated GH regression models for hotel booking cancellations
Time of cancellation to its scheduled check-in date is an important response variable in hotel revenue management. Accurate forecasting of cancellation early enough before the event occurs is essential for decision-making in hotel dynamic pricing and capacity allocation. Time to event data are typically modelled by continuous random variables in survival analysis. However complex features of our datasets such as excess zeros and multimodality pose serious challenges to the existing methods in survival analysis. Although there are hurdle models such as two-part truncated normal or two-part lognormal model for mixed discrete-continuous outcomes, we are not aware of any parametric model of continuous random variables which is flexible enough to simultaneously deal with excess zeros and multimodality.
We first fitted the GH and zero-inflated GH models to the response variables of TCLOSH1 and TCLOSH2 that has 539 zeros out of 10,582 cancellations. We presented the results in Tables 8 and 9. It is worth noting that the point estimates for both datasets demonstrate remarkable similarity despite different hotels and striking difference in sample sizes. The estimates of the zero-inflation parameter for both hotels are highly significant indicating that both datasets have excess zeros relative to the GH model. To test for zero-inflation at the resort hotel, the likelihood-ratio statistic is and the score statistic is . Hence both tests reject . Similarly for the city hotel, the likelihood-ratio statistic is W = 1430.72 and the score statistic is . Both tests reject . Although the zero-inflated GH model fits the cancellation datasets better than the GH model, it would be important to check if there is any covariate that can help explain the extra number of zeros in the datasets. One potentially useful predictor would be the length of stays (LOS) as LOS is readily available for each booking even before the event of cancellation occurs. In fact, the influence of LOS on cancellation has already been recognized by Antonio et al. [1] and Falk and Vieru [6]. But it is not clear what would be the mathematical relationship between LOS and the days of cancellation prior to check-in. In particular, the GH regression and the zero-inflated GH regression of cancellation days prior to check-in on LOS are not available in the literature.
Table 8. GH and ZiGH models for time of cancellation to check-in date at resort hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
μ | 12.7214 | 0.1194 | 106.6 | < 2e−16 |
d | 4.7447 | 0.0227 | 208.9 | < 2e−16 |
Order (m) | 6 | |||
GH LogL | −14179.39 | |||
Coefficients | Estimate | Std. Error | z-value | p-value |
μ | 13.8184 | 0.1296 | 106.59 | e−16 |
d | 4.4687 | 0.0282 | 158.23 | e−16 |
Zero Inflation (θ) | 0.0862 | 0.0052 | 16.56 | e−16 |
Order (m) | 6 | |||
ZiGH LogL | −13585.58 |
Table 9. GH and ZiGH models for time of cancellation to check-in date at city hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
μ | 12.6010 | 0.0736 | 171.1 | < 2e−16 |
d | 4.6886 | 0.0155 | 302.1 | < 2e−16 |
Order (m) | 6 | |||
GH LogL | −37920.42 | |||
Coefficients | Estimate | Std. Error | z-value | p-value |
μ | 13.2255 | 0.0759 | 174.27 | e−16 |
d | 4.4842 | 0.0183 | 244.75 | e−16 |
Zero Inflation (θ) | 0.0496 | 0.0024 | 20.91 | e−16 |
Order (m) | 6 | |||
ZiGH LogL | −37205.06 |
We performed the GH regression of the number of days prior to check-in on LOS for the dataset TCLOSH1. The estimated coefficients and standard errors are presented in Table 10. The p-values in Table 10 show that all of the regression coefficients are statistically highly significant. In particular, the estimated coefficient of LOS is and its p-value is less than 2e-16. The significant positive relationship between the length of stay and the cancellation time prior to check-in demonstrates that customers booking longer length of stay tend to cancel the reservation earlier prior to arrival and consequently the predictor LOS can be used to predict the number of cancellation days prior to check-in. Is the effect of LOS with big? How to interpret the magnitude of this estimate in this highly nonlinear regression model? One possible interpretation is through the partial effect of a covariate on the conditional expectation defined by . We note that for the GH distribution. In our case, we have a single covariate x = LOS. It follows from that . This shows that the partial effect of LOS depends on the level of LOS. For this dataset, LOS ranges from 1 days to 14 days. Suppose that we fix the level of LOS at 6 days. Then a simple computation shows that, when LOS increases by one day, . This implies that a customer whose LOS is 7 days would on average cancel her booking approximately more than half a day earlier than a customer whose LOS is 6 days. Whether the effect of LOS is big depends on how valuable a half-day is to a hotel. We expect the effect of LOS to be big for a resort hotel. This finding provides useful information for hotel revenue management. To see if the predictor LOS adequately explain the extra number of zeros in the dataset, we computed both the likelihood-ratio statistic W and the score statistic as given in (9) and obtained W = 1066.02 and . Both tests reject . We conclude that although the predictor LOS in the GH regression is significant, it does not completely explain the extra number of zeros in the dataset. The zero-inflated GH regression is then fitted and its results are given in Table 10. Since for the zero-inflated GH distribution, the aforementioned partial effect interpretation is applicable to this case. However when we model the zero-inflation parameter θ as a function of LOS, the partial effect interpretation is no longer applicable. It is difficult to interpret the effect of LOS in such a case.
Table 10. GH and ZiGH regression on LOS at resort hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.3944 | 0.0140 | 171.65 | < 2e−16 |
LOS ( ) | 0.0372 | 0.0024 | 15.75 | < 2e−16 |
Dispersion Index (d) | 4.7241 | 0.0231 | 204.12 | < 2e−16 |
Order (m) | 6 | |||
GH LogL | −14074.2 | |||
Coefficients | Estimate | Std. Error | z-value | p-value |
Intercept ( | 2.5242 | 0.0143 | 176.28 | e−16 |
LOS ( ) | 0.0249 | 0.0025 | 9.85 | e−16 |
Dispersion Index (d) | 4.4617 | 0.0286 | 156.16 | e−16 |
Zero Inflation (θ) | 0.0853 | 0.0052 | 16.38 | e−16 |
Order (m) | 6 | |||
ZiGH LogL | −13541.19 |
Next we consider the case of allowing the dispersion index d to depend on the covariate LOS via in (7) and fitted the GH regression on LOS. The results are given in Table 11. Again all of the estimated regression coefficients are statistically significant at the 0.05 significance level. In particular, the p-value for the coefficient of LOS in the dispersion function is 0.0324 indicating that the predictor LOS has some effect on the dispersion although it is not very strong as compared to other regression coefficients. To see if the GH regression on LOS with covariate-dependent dispersion adequately account for the extra number of zeros in the dataset, we fitted the zero-inflated GH regression on LOS with covariate-dependent dispersion and its results are presented in Table 11. It is worth noting that the estimated coefficient of LOS in the dispersion function is no longer significant (its p-value is 0.595) after introducing the zero-inflation parameter. In addition, the log-likelihood values in Table 11 are almost the same as the log-likelihood values in Table 10. Consequently the likelihood-ratio test rejects . The score statistic as in (8) is 6165.432, leading to the same conclusion as the likelihood-ratio test. We conclude that incorporating the predictor LOS in both the mean and dispersion functions of the GH model does not completely explain the extra number of zeros in the dataset.
Table 11. GH and ZiGH regression on LOS at resort hotel with covariate-dependent dispersion.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.4097 | 0.0156 | 154.77 | < 2e−16 |
LOS ( ) | 0.0335 | 0.0029 | 11.61 | < 2e−16 |
Intercept ( ) | 1.9355 | 0.0378 | 51.14 | < 2e−16 |
LOS ( ) | −0.0163 | 0.0076 | −2.14 | 0.0324 |
Order (m) | 6 | |||
GH LogL | −14071.73 | |||
Intercept ( ) | 2.5210 | 0.0156 | 162.02 | < 2e−16 |
LOS ( ) | 0.0257 | 0.0029 | 8.78 | < 2e−16 |
Intercept ( ) | 1.6026 | 0.0406 | 39.49 | < 2e−16 |
LOS ( ) | 0.0041 | 0.0076 | 0.53 | 0.595 |
Zero Inflation (θ) | 0.0853 | 0.0052 | 16.38 | e−16 |
Order (m) | 6 | |||
ZiGH LogL | −13541.05 |
After rejecting , it may be argued that the zero-inflation parameter θ may depend on the predictor LOS. To see if this is the case, we fitted the zero-inflated GH regression on the predictor LOS with LOS-dependent dispersion and LOS-dependent zero inflation functions. The results are presented in Table 12. The p-values show that both coefficients of LOS in the regression and zero inflation functions are statistically highly significant but the coefficient of LOS in the dispersion function is not significant different from zero. These findings confirm that the predictor LOS indeed has a significant effect on the zero-inflation parameter θ. In particular, the negative sign of the estimated coefficient of LOS in the zero inflation function suggests that the longer the length of stay, the less likely the cancellation occurs on its scheduled check-in date. This is consistent with our previous finding that customers booking longer length of stay tend to cancel the reservation earlier prior to arrival.
Table 12. ZiGH regression on LOS at resort hotel with covariate-dependent dispersion and zero inflation.
Coefficients | Estimate | Std. error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.5293 | 0.0155 | 162.96 | < 2e−16 |
LOS ( ) | 0.0233 | 0.0030 | 7.89 | 2.93e−15 |
Intercept ( ) | 1.5756 | 0.0402 | 39.22 | < 2e−16 |
LOS ( ) | 0.0122 | 0.0075 | 1.61 | 0.107 |
Intercept ( ) | −1.1346 | 0.1422 | −7.98 | 1.48e−15 |
LOS ( ) | −0.4738 | 0.0614 | −7.71 | 1.21e−14 |
Order (m) | 6 | |||
LogL | −13479.95 |
We applied the same analysis to TCLOSH2. The results are given in Tables 13, 14, and 15. It is quite remarkable that the results from the city hotel are almost the same as those from the resort hotel. These findings further confirm that LOS is a key predictor for the time of cancellation prior to check-in regardless of resort hotel or city hotel.
Table 13. GH and ZiGH regression on LOS at city hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.4868 | 0.0088 | 281.80 | < 2e−16 |
LOS ( ) | 0.0154 | 0.0021 | 7.29 | 3.05e−13 |
Dispersion Index (d) | 4.6895 | 0.0155 | 302.63 | e−16 |
Order (m) | 6 | |||
GH LogL | −37896.07 | |||
Intercept ( ) | 2.5461 | 0.0091 | 280.19 | < 2e−16 |
LOS ( ) | 0.0117 | 0.0022 | 5.24 | 1.64e−07 |
Dispersion Index (d) | 4.4857 | 0.0185 | 243.05 | e−16 |
Zero Inflation (θ) | 0.0494 | 0.0024 | 20.83 | e−16 |
Order (m) | 6 | |||
ZiGH LogL | −37192.2 |
Table 14. GH and ZiGH regression on LOS at city hotel with covariate-dependent dispersion.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.4940 | 0.0098 | 254.74 | < 2e−16 |
LOS ( ) | 0.0130 | 0.0025 | 5.18 | 2.23e−07 |
Intercept ( ) | 1.8718 | 0.0263 | 71.10 | < 2e−16 |
LOS ( ) | −0.0112 | 0.0067 | −1.68 | 0.0937 |
Order (m) | 6 | |||
GH LogL | −37894.62 | |||
Intercept ( ) | 2.5456 | 0.0097 | 261.45 | < 2e−16 |
LOS ( ) | 0.0119 | 0.0025 | 4.72 | 2.38e−06 |
Intercept ( ) | 1.6380 | 0.0285 | 57.52 | < 2e−16 |
LOS ( ) | 0.0009 | 0.0072 | 0.13 | 0.896 |
Zero Inflation (θ) | 0.0494 | 0.0024 | 20.83 | < 2e−16 |
Order (m) | 6 | |||
ZiGH LogL | −37192.19 |
Table 15. ZiGH regression on LOS at city hotel with covariate-dependent dispersion and zero inflation.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.5542 | 0.0098 | 261.62 | < 2e−16 |
LOS ( ) | 0.0088 | 0.0026 | 3.433 | 0.0006 |
Intercept ( ) | 1.5775 | 0.0365 | 43.20 | < 2e−16 |
LOS ( ) | 0.0213 | 0.0105 | 2.03 | 0.0427 |
Intercept ( ) | −2.2391 | 0.1259 | −17.78 | < 2e−16 |
LOS ( ) | −0.3035 | 0.0525 | −5.78 | 7.36e−09 |
Order (m) | 6 | |||
LogL | −37179.64 |
Even in the presence of our second potential predictor LogADR, we demonstrate that LOS remains to be a significant predictor for the time of cancellation prior to check-in regardless of resort hotel or city hotel. We fitted the GH and zero-inflated GH regression to TCLOSLogADRH1 and TCLOSLogADRH2 whose response variable has 506 zeros out of 10,499 observations. We model both the regression function and the dispersion index as functions of two potential predictors: LOS and LogADR. Tables 16 and 17 show that both LOS and LogADR are highly significant predictors in the regression functions. It is worth noting that despite the presence of LogADR, the estimated LOS coefficients not only have the same positive sign but also have remarkably similar magnitudes as those given in Tables 11 and 14. The positive sign of the estimated LogADR coeficients seems to suggest that other things being equal, customers with higher average daily rates tend to cancel the reservation earlier prior to arrival at both hotels. In terms of the dispersion function of the two predictors, the effects of LOS and LogADR are different depending on whether it is the city hotel or the resort hotel. For the city hotel, LogADR becomes the dominant and significant predictor in the dispersion function, whereas LOS has no statistically significant effect on the dispersion parameter. In contrast, LogADR in the dispersion function is not significant at the resort hotel, whereas its estimated LOS coefficient is significant for the GH regression and is no longer significant for the zero-inflated GH regression. The negative sign of the estimated LogADR coefficients indicates that the higher the average daily rate is, the less disperse the response variable would become at the city hotel, whereas the negative sign of estimated LOS coefficients suggests that the longer the LOS is, the less disperse the response variable would become at the resort hotel. Tables 16 and 17 show that the likelihood-ratio statistics at the resort hotel and the city hotel are equal to 973.08 and 1264.66, respectively, leading to the rejection of for both hotels. The score statistics at the resort hotel and the city hotel are 1841.528 and 2795.88, respectively, leading to the same conclusion.
Table 16. GH and ZiGH regression on LOS and LogADR at resort hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 1.6327 | 0.0738 | 22.133 | < 2e−16 |
LOS ( ) | 0.0317 | 0.0029 | 10.790 | < 2e−16 |
LogADR ( ) | 0.1749 | 0.0159 | 10.984 | < 2e−16 |
Intercept ( ) | 2.0540 | 0.1733 | 11.854 | < 2e−16 |
LOS ( ) | −0.0159 | 0.0073 | −2.161 | 0.0307 |
LogADR ( ) | −0.0370 | 0.0387 | −0.957 | 0.3384 |
Order (m) | 6 | |||
GH LogL | −13826.48 | |||
Intercept ( ) | 1.8779 | 0.0719 | 26.110 | < 2e−16 |
LOS ( ) | 0.0238 | 0.0030 | 7.986 | 1.4e−15 |
LogADR ( ) | 0.1443 | 0.0155 | 9.302 | < 2e−16 |
Intercept ( ) | 1.9122 | 0.1901 | 10.060 | < 2e−16 |
LOS ( ) | 0.0009 | 0.0073 | 0.117 | 0.9069 |
LogADR ( ) | −0.0713 | 0.0418 | −1.707 | 0.0878 |
Zero Inflation (θ) | 0.0813 | 0.0052 | 15.755 | < 2e−16 |
Order (m) | 6 | |||
ZiGH LogL | −13339.94 |
Table 17. GH and ZiGH regression on LOS and LogADR at city hotel.
Coefficients | Estimate | Std. Error | z-value | p-value |
---|---|---|---|---|
Intercept ( ) | 2.1107 | 0.0737 | 28.626 | < 2e−16 |
LOS ( ) | 0.0125 | 0.0025 | 4.955 | 7.23e−07 |
LogADR ( ) | 0.0840 | 0.0157 | 5.348 | 8.88e−08 |
Intercept ( ) | 2.4604 | 0.1654 | 14.874 | < 2e−16 |
LOS ( ) | −0.0062 | 0.0068 | −0.904 | 0.36582 |
LogADR ( ) | −0.1343 | 0.0359 | −3.745 | 0.00018 |
Order (m) | 6 | |||
GH LogL | −37530.49 | |||
Intercept ( ) | 2.2986 | 0.0782 | 29.385 | < 2e−16 |
LOS ( ) | 0.0116 | 0.0025 | 4.580 | 4.64e−06 |
LogADR ( ) | 0.0538 | 0.0166 | 3.236 | 0.00121 |
Intercept ( ) | 2.3608 | 0.2074 | 11.384 | < 2e−16 |
LOS ( ) | 0.0032 | 0.0074 | 0.431 | 0.6662 |
LogADR ( ) | −0.1597 | 0.0448 | −3.562 | 0.00037 |
Zero Inflation (θ) | 0.0459 | 0.0023 | 19.911 | < 2e−16 |
Order (m) | 6 | |||
ZiGH LogL | −36898.16 |
Our data analysis has demonstrated that cancellation predictors LOS and LogADR have remarkably similar patterns at both the resort hotel and the city hotel. In the presence of other predictors, however, significant cancellation predictors may not always be the same per hotel. Our findings are consistent with the findings of Antonio et al. [1]. Although there are other variables such as market segment and distribution channel available in the datesets provided by Antonio et al. [2], we did not include those variables into our zero-inflated GH regression models because of the difficulty of finding suitable initial values for the ML estimation algorithm to converge.
4. Conclusion and discussion
Motivated by the challenges of the complex data in hotel industry, we have developed zero-inflated statistical models based on the generalized Hermite distribution for simultaneously modelling of excess zeros, over/underdispersion, and multimodality. These new models are parsimonious yet remarkably flexible allowing the covariates to be introduced directly through the mean, dispersion, and zero-inflated parameters. A new link function for the covariate-dependent dispersion model is proposed to accommodate the interval inequality constraint for the dispersion parameter. We derive score statistics for testing zero inflation in both covariate-free and covariate-dependent models and providing a useful alternative when computing the likelihood-ratio statistic proves to be difficult.
Applications of the zero-inflated GH regression to the hotel booking cancellation datasets enabled the identification of significant predictors LOS and LogADR in the nonlinear regression with covariate-dependent dispersion model for predicting the time of cancellation to its scheduled check-in date. Such regression relationship shed light on customers' cancellation behaviour: other things being equal, customers booking longer length of stay tend to cancel the reservation earlier prior to arrival at both hotels; likewise customers with higher average daily rates tend to cancel the reservation earlier prior to their scheduled check-in dates. These findings provide useful information for accurate forecasting and decision-making in hotel revenue management. If the hotel were able to predict cancellations accurately enough in particular when approaching arrival dates, it could potentially factor them into its booking strategy. The zero-inflated GH regression provides such a relevant model for cancellations. Given the remarkable similarity of cancellation patterns at both the resort hotel and the city hotel, it would be interesting for future research to investigate if the established regression relationship remains valid at other hotels provided the relevant data are available.
Future research would be to investigate the problem of how to select starting values, for example, by bootstrap restarting [22], to initiate high-dimensional zero-inflated GH and zero and k-inflated GH regression with covariate-dependent dispersion model so that the optimization algorithms converge. It would also be interesting to extend the proposed zero-inflated models to accommodate correlated responses.
Supplementary Material
Acknowledgements
The author would like to thank the Editor, the Associate Editor and the two reviewers for their constructive and detailed comments and suggestions that significantly improved the article.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Antonio N., Almeida A.D., and Nunes L., Big data in hotel revenue management: Exploring cancellation drivers to gain insights into booking cancellation behavior, Cornell Hospitality Quart. 60 (2019), pp. 298–319. doi: 10.1177/1938965519851466 [DOI] [Google Scholar]
- 2.Antonio N., Almeida A.D., and Nunes L., Hotel booking demand datasets, Data. Brief. 22 (2019), pp. 41–49. doi: 10.1016/j.dib.2018.11.126 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Antonio N., Almeida A.D., and Nunes L., Predicting hotel booking cancellations to decrease uncertainty and increase revenue, Tourism Manage. Stud. 13 (2017), pp. 25–39. doi: 10.18089/tms.2017.13203 [DOI] [Google Scholar]
- 4.Bohning D., Dietz E., Schlattmann P., Mendonca L., and Kirchner U., The zero-inflated Poisson model and the decayed, missing and filled Teech index in dental epidemiology, J. Royal Stat. Soc. A 162 (1999), pp. 195–209. doi: 10.1111/1467-985X.00130 [DOI] [Google Scholar]
- 5.Deng D. and Paul S., Score tests for zero-inflation and over-dispersion in generalized linear models, Stat. Sin. 15 (2005), pp. 257–276. [Google Scholar]
- 6.Falk M. and Vieru M., Modelling the cancellation behaviour of hotel guests, Int. J. Contemp. Hospitality Manage. 30 (2018), pp. 3100–3116. doi: 10.1108/IJCHM-08-2017-0509 [DOI] [Google Scholar]
- 7.Giles D.E., Modeling inflated count data In L. Oxley and D. Kulasiri, eds., Proceedings of the MODSIM 2007 International Congress on Modelling and Simulation, (2007), pp. 919–925
- 8.Giles D.E., Hermite regression analysis of multi-modal count data, Econ. Bull. 30 (2010), pp. 2936–2945. [Google Scholar]
- 9.Gupta R. and Jain G., A generalized Hermite distribution and its properties, SIAM. J. Appl. Math. 27 (1974), pp. 359–363. doi: 10.1137/0127027 [DOI] [Google Scholar]
- 10.Gurland J., Some interrelations among compound and generalized distributions, Biometrika 44 (1957), pp. 265–268. doi: 10.1093/biomet/44.1-2.265 [DOI] [Google Scholar]
- 11.Hall D.B., Zero-inflated Poisson and binomial regression with random effects: A case study, Biometrics 56 (2000), pp. 1030–1039. doi: 10.1111/j.0006-341X.2000.01030.x [DOI] [PubMed] [Google Scholar]
- 12.Henningsen A. and Toomet O, maxLik: A package for maximum likelihood estimation in R, Comput. Stat. 26 (2011), pp. 443–458. doi: 10.1007/s00180-010-0217-1 [DOI] [Google Scholar]
- 13.Jansakul N. and Hinde J., Score tests for zero-inflated Poisson models, Comput. Stat. Data Anal. 40 (2002), pp. 75–96. doi: 10.1016/S0167-9473(01)00104-9 [DOI] [Google Scholar]
- 14.Jansakul N. and Hinde J.P., Score tests for extra-zero models in zero-inflated negative binomial models, Commun. Stat. Simul. Comput. 38 (2009), pp. 92–108. doi: 10.1080/03610910802421632 [DOI] [Google Scholar]
- 15.Kemp C.D. and Kemp A.W., Some properties of the ‘hermite’ distribution, Biometrika 52 (1965), pp. 381–394. [PubMed] [Google Scholar]
- 16.Kleiber C. and Zeileis A., Visualizing count data regressions using rootograms, Am. Stat. 70 (2016), pp. 296–303. doi: 10.1080/00031305.2016.1173590 [DOI] [Google Scholar]
- 17.Lambert D., Zero-inflated Poisson regression, with an application to defects in manufacturing, Technometrics 34 (1992), pp. 1–14. doi: 10.2307/1269547 [DOI] [Google Scholar]
- 18.Morales D. and Wang J., Forecasting cancellation rates for service booking revenue management using data mining, Eur. J. Oper. Res. 202 (2009), pp. 554–562. doi: 10.1016/j.ejor.2009.06.006 [DOI] [Google Scholar]
- 19.Morina D., Higueras M., Puig P., and Oliveira M., Generalized Hermite distribution modelling with the R package Hermite, R. J. 7/2 (2015), pp. 263–274. doi: 10.32614/RJ-2015-035 [DOI] [Google Scholar]
- 20.Puig P., Characterizing additively closed discrete models by a property of their maximum likelihood estimators, with an application to generalized Hermite distributions, J. Am. Stat. Assoc. 98 (2003), pp. 687–692. doi: 10.1198/016214503000000594 [DOI] [Google Scholar]
- 21.van den Broek J., A score test for zero-inflation in a Poisson distribution, Biometrics 51 (1995), pp. 738–743. doi: 10.2307/2532959 [DOI] [PubMed] [Google Scholar]
- 22.Wood S.N., Minimizing model fitting objectives that contain spurious local minima by bootstrap restarting, Biometrics 57 (2001), pp. 240–244. doi: 10.1111/j.0006-341X.2001.00240.x [DOI] [PubMed] [Google Scholar]
- 23.Xiang L., Lee A., Yau K., and McLachlan G., A score test for overdispersion in zero-inflated Poisson mixed regression model, Stat. Med. 26 (2007), pp. 1608–1622. doi: 10.1002/sim.2616 [DOI] [PubMed] [Google Scholar]
- 24.Xie M., He B., and Goh T., Zero-inflated Poisson model in statistical process control, Comput. Stat. Ata Anal. 38 (2001), pp. 191–201. doi: 10.1016/S0167-9473(01)00033-0 [DOI] [Google Scholar]
- 25.Xie F., Wei B., and Lin J., Score tests for zero-inflated generalized Poisson mixed regression model, Comput. Stat. Data Anal. 53 (2009), pp. 3478–3489. doi: 10.1016/j.csda.2009.02.017 [DOI] [Google Scholar]
- 26.Zeileis A. and Kleiber C., countreg: Count data regression, R package, (2016). Available at http://R-Forge.R-project.org/projects/countreg/
- 27.Zeileis A., Kleiber C., and Jackman S., Regression models for count data in R, J. Stat. Softw. 27 (2008), pp. 1–25. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.