Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Mar 28;42(13):3141–3149. doi: 10.1134/S1995080222010103

On the Partial-Geometric Distribution: Properties and Applications

Krisada Khruachalee 1,, Winai Bodhisuwan 1,, Andrei Volodin 2,
PMCID: PMC8961498

Abstract

In this article we introduce the new, two-parameter partial-geometric distribution (PG) that contains both geometric and first success distributions as a particular case. Some probability and statistical properties of the proposed distribution are discussed, including probability mass function, mean, variance, moment generating function, and probability generating function. We propose the method of maximum likelihood for estimating the model’s parameters, and apply the PG distribution to two real datasets to illustrate the flexibility of the proposed distribution. We found the PG distribution is more dynamic than the geometric distribution in the sense that it can be applied to the under-dispersed data. The PG distribution also performs well with a goodness of fit test and some other model selection characteristics for model fitting of these two datasets. Thus, the PG distribution can be applied as an alternative model for the analysis of discrete data.

Keywords: geometric distribution, first success distribution, maximum likelihood estimation method

INTRODUCTION

There is much interest in developing the most flexible probability distributions; many generalized classes of distributions have been developed and applied to describe various natural phenomena [1]. To provide an explanation of a natural phenomenon, researchers consider a construction of the new generalized class of distributions, and decide whether the underlying distribution should be regarded as discrete, continuous, or of a mixed type. The discrete distributions are very useful in many applications especially when the count phenomenon consists of non-negative integers. This happens when the number of times a discrete event occurrences are observed and examined in a specific area or period of time [2]. Examples include the number of trips per month that a person takes, the number of children a couple has, the number of Prussian soldier deaths during the Crimean war resulting from being kicked by a horse (a famous classical example related to the Poisson distribution); see [3–5], etc. In these situations, the continuous model is inappropriate to describe the count phenomenon. Accordingly, the discrete models are as significant as the continuous models.

Situations where a number of trials or experiments must occur before a predetermined number of successes, such as the number of bills that must be proposed to a legislature before 10 bills are passed, and recently the number of Thai citizens that must be tested for COVID-19 before 100 Thai citizens are confirmed to be infectious, are of the interest to many researchers keen to find suitable probability distributions that explain these natural phenomena.

In addition, the complications of comparing the probabilities of success for the Bernoulli trials that mostly arise in medical and biological investigations, acceptance sampling in quality control, and modeling demand for a product are considered. These real-life phenomena can be described by the geometric distribution. However, there are criticisms that the geometric and first success distributions are sometimes considered to be the same, when they are actually different.

Because the confusion between the geometric and first success distributions plays a very important role in this study, we would like to introduce a new family of distributions called the partial-geometric (PG) distribution that contains both the geometric and first success distributions as a particular case. The idea to combine and consider the geometric and first success distributions as a member of one distribution family seems to be very natural, but we did not see it in literature. The number of the studies that propose modifications of the geometric distribution for various purposes is so large that we decided not to discuss them. The major advantage of the newly proposed PG distribution over the previously modified geometric distributions is the flexibility in applications to real-life data.

The remainder of the paper is organized as follows. In Section 2, we discuss the difference between geometric and first success distributions. The PG distribution and some of its mathematical properties are defined in Section 3. Then, the maximum likelihood estimates of the PG distribution parameters are discussed in Section 3.3. Finally, some practical applications of the proposed distribution are illustrated by a goodness of fit with two real datasets in Section 4.

MATERIALS AND METHODS

Based on the theoretical interpretation of the Bernoulli experiment, we note a confusion between two very simple and basic geometric and first success distributions. In some literature, these two distributions are considered the same. But, as mentioned in Gut (2009) [6], they are different and defined in the following way.

Let Inline graphic and Inline graphic. A random variable Inline graphic has a Geometric distribution with parameter Inline graphic, denoted by Inline graphic, if its probability mass function (pmf) is

graphic file with name M6.gif

A random variable Inline graphic has a First Success distribution with parameter Inline graphic, denoted by Inline graphic, if its pmf is

graphic file with name M10.gif

We can interpret the geometric distribution as the number of failures in Bernoulli experiments until we reach first successes, while the first success distribution is the number of trials in Bernoulli experiments need to reach the first success.

Referring to the properties of the probability generating function (pgf), the pgf of the geometric distribution Inline graphic can be calculated in the following way:

graphic file with name M12.gif

where Inline graphic In the same manner, the pgf of the first success distribution Inline graphic can be also calculated in accordance with the following procedure:

graphic file with name M15.gif

where again Inline graphic. According to the pgf of the geometric and first success distributions, we present Proposition 1 that can be used to illustrate the connection of these two distributions as follows.

Proposition 1. If a random variable Inline graphic, then the random variable Inline graphic follows Inline graphic distribution. Similarly, if a random variable Inline graphic, then the random variable Inline graphic follows Inline graphic distribution.

Proof. If a random variable Inline graphic, then the pgf of the random variable Inline graphic can be written

graphic file with name M25.gif

The second part of the Proposition can be shown in the similar way. Inline graphic

PARTIAL-GEOMETRIC DISTRIBUTION AND ITS MATHEMATICAL PROPERTIES

Partial-Geometric (PG) Distribution

Changing the momentum by adding an extra parameter Inline graphic, leads us to propose the PG distribution. A random variable Inline graphic has a Partial-Geometric distribution with parameters Inline graphic and Inline graphic, denoted by Inline graphic, if its pmf is

graphic file with name M32.gif

where Inline graphic.

In order to illustrate the appearance of the PG distribution, Figs. 1 and 2 show some pmf plots of the PG distribution with various values of the parameters Inline graphic (0.25, 0.50 and 0.75) and Inline graphic (Inline graphic, Inline graphic, Inline graphic and Inline graphic) where Inline graphic. We found that the scale of the PG distribution change due to the parameter Inline graphic. On the other hand, the shape parameter of the PG distribution can be varied because of the parameter Inline graphic. It is seen that the pmf rapidly decreases as parameter Inline graphic increases. In addition, the PG distribution is clearly a unimodal curve when the Inline graphic conversely increases to Inline graphic. According to Figs. 1 and 2, we conclude that the PG distribution is right skewed and unimodal.

Fig. 1.

Fig. 1

The pmf plots of the partial-geometric distribution in various combinations of Inline graphic (Inline graphic and Inline graphic) and Inline graphic (Inline graphic, Inline graphic, Inline graphic, and Inline graphic).

Fig. 2.

Fig. 2

The pmf plots of the partial-geometric distribution in various combinations of Inline graphic (Inline graphic and Inline graphic) and Inline graphic (Inline graphic, Inline graphic, Inline graphic, and Inline graphic).

Measuring the dispersion of the partial-geometric (PG) distribution, the Inline graphic, the ratio between variance to mean, is applied under some specified values of parameters Inline graphic and Inline graphic from Figs. 1 and 2. The values of Inline graphic will indicate whether the distribution is over-dispersed (Inline graphic) or under-dispersed (Inline graphic) [7]. Table 1 illustrates that the partial-geometric (PG) distribution is more dynamic than the geometric distribution in the sense that it can be applied to the under-dispersed data as well where the geometric distribution is only suitable for over-dispersed data.

Table 1.

The mean, variance and index of dispersion (Inline graphic) values of the partial-geometric (PG) distribution for different value of Inline graphic and Inline graphic

Figure Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
1(a) 0.25 0.1111 1.3333 7.5556 5.6667
1(b) 0.25 0.1667 2.0000 10.0000 5.0000
1(c) 0.25 0.2222 2.6667 11.5556 4.3333
1(d) 0.25 0.3000 3.6000 12.2400 3.4000
1(e) 0.50 0.3333 0.6667 1.5556 2.3333
1(f) 0.50 0.5000 1.0000 2.0000 2.0000
2(a) 0.50 0.6667 1.3333 2.2222 1.6667
2(b) 0.50 0.9000 1.8000 2.1600 1.2000
2(c) 0.75 1.0000 0.4444 0.5432 1.2222
2(d) 0.75 1.5000 0.6667 0.6667 1.0000
2(e) 0.75 2.0000 0.8889 0.6914 0.7778
2(f) 0.75 2.7000 1.2000 0.5600 0.4667

Probability Properties

Some probability properties of the PG distributions, especially the mean, variance, moment generating function (mgf), and pgf are provided in this section.

Theorem 1. Let Inline graphic, then the mean of Inline graphic is Inline graphic

Proof. The expectation of the PG distribution can be obtained from

graphic file with name M79.gif

Since Inline graphic is the geometric series, the expectation will be Inline graphic Inline graphic

Theorem 2. Let Inline graphic, then the variance of Inline graphic is

graphic file with name M85.gif

Proof. The variance of the PG distribution can be obtained from

graphic file with name M86.gif

With the expectation definition,

graphic file with name M87.gif
graphic file with name M88.gif

Since Inline graphic is the geometric power series, then the Inline graphic will be equal to Inline graphic. Therefore,

graphic file with name M92.gif

Inline graphic

Theorem 3. Let Inline graphic, then the mgf of Inline graphic is Inline graphic where Inline graphic.

Proof. The mgf of the PG distribution can be achieved from

graphic file with name M98.gif
graphic file with name M99.gif

Since Inline graphic is the geometric series, then the mgf will be equals Inline graphic where Inline graphic Inline graphic

Theorem 4. Let Inline graphic, then the pgf of Inline graphic is Inline graphic where Inline graphic.

Proof. The pgf of the PG distribution can be acquired from

graphic file with name M108.gif

Since Inline graphic is the geometric series, then the pgf will be equals Inline graphic where Inline graphic Inline graphic

Parameter Estimation

We consider the maximum likelihood estimation (MLE) that is the most commonly used method for parameter estimation. Let Inline graphic be an independent and identically distributed random sample of size Inline graphic from the partial-geometric distribution, Inline graphic, and Inline graphic be the observed sample values. For Inline graphic denote the frequencies Inline graphic that is, Inline graphic is the count of observations that are equal to Inline graphic. Note that

graphic file with name M121.gif

The likelihood function can be written as

graphic file with name M122.gif
graphic file with name M123.gif
graphic file with name M124.gif
graphic file with name M125.gif

The log-likelihood function can be written as

graphic file with name M126.gif
graphic file with name M127.gif

By taking partial derivatives by the parameters, we obtain

graphic file with name M128.gif

The method of maximum likelihood estimators are found by equating the partial derivatives to zero; that is

graphic file with name M129.gif

By rewriting the second equation and substituting to the first equation, we obtain the estimated maximum likelihood parameters Inline graphic and Inline graphic of the PG distribution as

graphic file with name M132.gif

APPLICATIONS TO REAL LIFE DATA

In order to evaluate the performance of the PG distribution, we consider two real datasets to fit with the two competing geometric and Poisson distributions. We do not consider the first success distribution as a competing because the data contain zeros.

The first dataset is accident data that provides the total number of the claims for 9,461 automobile insurance policies [10]. The second dataset is the number of hospitals stays of persons age 66 and over, for which there are 4,406 observations. These data were acquired from the national medical expenditure survey of how Americans use and pay for health services conducted in 1987 and 1988 to reveal a comprehensive picture of medical expenditure [11].

The appropriate distribution for fitting these datasets is evaluated with the Anderson-Darling (AD) goodness of fit test for discrete data [12]. In addition, the discrete AD test is obtained by applying the dgof package [13] in the R language. Moreover, there are also other model selection criteria used to determine the best fit model: the minus log-likelihood (-LL), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC). The results of fitting different distributions to these datasets are recorded in Tables 2 and 3.

Table 2.

Estimated parameters for the number of cliams of the automobile insurance policies

Number of Observed Expected value by fitting distribution
claims frequency Geometric Partial-Geometric Poisson
0 7840 Inline graphic Inline graphic Inline graphic
1 1317 Inline graphic Inline graphic Inline graphic
2 239 Inline graphic Inline graphic Inline graphic
3 42 Inline graphic Inline graphic Inline graphic
4 14 Inline graphic Inline graphic Inline graphic
5 4 Inline graphic Inline graphic Inline graphic
6 4 Inline graphic Inline graphic Inline graphic
7 1 Inline graphic Inline graphic Inline graphic
Estimates Inline graphic Inline graphic Inline graphic
               LL Inline graphic Inline graphic Inline graphic
               AIC Inline graphic Inline graphic Inline graphic
               BIC Inline graphic Inline graphic Inline graphic
               AD test Inline graphic Inline graphic Inline graphic
               Inline graphic-value Inline graphic Inline graphic Inline graphic

Table 3.

Estimated parameters for the number of hospital stays

Number of Observed Expected value by fitting distribution
hospital stays frequency Geometric Partial-Geometric Poisson
0 3541 Inline graphic Inline graphic Inline graphic
1 599 Inline graphic Inline graphic Inline graphic
2 176 Inline graphic Inline graphic Inline graphic
3 48 Inline graphic Inline graphic Inline graphic
4 20 Inline graphic Inline graphic Inline graphic
5 12 Inline graphic Inline graphic Inline graphic
6 5 Inline graphic Inline graphic Inline graphic
7 1 Inline graphic Inline graphic Inline graphic
8 4 Inline graphic Inline graphic Inline graphic
Estimates Inline graphic Inline graphic Inline graphic
               LL Inline graphic Inline graphic Inline graphic
               AIC Inline graphic Inline graphic Inline graphic
               BIC Inline graphic Inline graphic Inline graphic
               AD test Inline graphic Inline graphic Inline graphic
               Inline graphic-value Inline graphic Inline graphic Inline graphic

The fitted distributions for the number of claims and the number of hospital stays presented in Tables 2 and 3 illustrate that the Inline graphic-value based on the discrete AD test statistic of the PG distribution provides a good fit to the data where it provides the largest Inline graphic-value among others. Moreover, the PG distribution provides the lowest values of -LL, AIC and BIC. Obviously, the PG distribution provides the nearest expected value to the observed frequency. Therefore, the most appropriate fit distribution among these three distributions for the number of claims and the number of hospital stays is the PG distribution followed by the geometric and Poisson distributions respectively.

Figure 3 illustrates the plots of fitted frequency of the geometric, partial-geometric, and Poisson distributions with the observed datasets for the total number of claims of the automobile insurance policies and the number of hospital stays. It firmly shows that the partial-geometric (PG) distribution provides the most fitted performance to these two datasets among the three distributions. Thus, we consistently conclude that the PG distribution is more flexible than the geometric and Poisson distributions.

Fig. 3.

Fig. 3

The fitted frequency of the geometric, partial-geometric and Poisson distributions to real datasets.

DISCUSSION AND CONCLUSION

Confusion between geometric and first success distributions led to the idea of developing a new family of distributions. By adding an extra parameter to an existing distribution for capturing more variation of the natural phenomena, the partial-geometric distribution that contains both geometric and first success distributions as a particular case is proposed. We found that the PG distribution is right skewed and unimodal. Moreover, it can also be applied to model under-dispersed data. We also derived some essential probability properties, for instance, probability mass function, mean, variance, moment generating function, and probability generating function. The maximum likelihood estimation method is employed to estimate the parameters of the PG distribution. Due to the practical applications with two real datasets, the PG distribution provides the highest Inline graphic-value for the discrete AD test and also provides the lowest values of -LL, AIC and BIC as well. Therefore, the PG distribution is useful as an alternative to other distribution for the analysis of discrete data.

ACKNOWLEDGMENTS

The authors would like to thank the Department of Statistics, Faculty of Science, Kasetsart University.

FUNDING

The research of the author listed last was partially supported by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities, project 1.13556.2019/13.1.

Footnotes

(Submitted by A. M. Elizarov)

Contributor Information

Krisada Khruachalee, Email: krisada.khr@ku.th.

Winai Bodhisuwan, Email: fsciwnb@ku.ac.th.

Andrei Volodin, Email: andrei@uregina.ca.

REFERENCES

  • 1.Alzaatreh A., Lee C., Famoye F. A new method for generating families of continuous distributions. METRON. 2013;71:63–79. doi: 10.1007/s40300-013-0007-y. [DOI] [Google Scholar]
  • 2.Johnson N. L., Kotz S. Distribution in Statistics: Discrete Distributions. New York: Houghton Mifflin; 1969. [Google Scholar]
  • 3.Bortkiewicz L. V. Das Gesetz der kleinen Zahlen. Leipzig: G. Teubner; 1898. [Google Scholar]
  • 4.Winsor C. P. Quotations: Das Gesetz der kleinen Zahlen. Human Biology. 1947;19:154–161. [PubMed] [Google Scholar]
  • 5.Weaver W. Lady Luck: The Theory of Probability. London: Heinemann; 1964. [Google Scholar]
  • 6.Gut A. An Intermediate Course in Probability. New York: Springer Science; 2009. [Google Scholar]
  • 7.Chakraborty S., Chakravarty D. Discrete gamma distributions: Properties and parameter estimations. Commun. Stat.—Theory Method. 2012;41:3301–3324. doi: 10.1080/03610926.2011.563014. [DOI] [Google Scholar]
  • 8.Nash J. C. On best practice optimization methods in R. J. Stat. Software. 2014;60:1–14. doi: 10.18637/jss.v060.i02. [DOI] [Google Scholar]
  • 9.R Core Team, R: A Language and Environment for Statistical Computing (2020).
  • 10.Klugman S. A., Panjer H. H., Willmot G. E. Loss Models: From Data to Decisions. Hoboken, NJ: Wiley; 2012. [Google Scholar]
  • 11.Deb P., Trivedi P. K. Demand for medical care by the elderly: A finite mixture approach. J. Appl. Econometr. 1997;12:313–336. doi: 10.1002/(SICI)1099-1255(199705)12:3<313::AID-JAE440>3.0.CO;2-G. [DOI] [Google Scholar]
  • 12.Choulakian V., Lockhart R. A., Stephens M. A. Cramer-von Mises statistics for discrete distributions. Canad. J. Stat. 1994;22:125–137. doi: 10.2307/3315828. [DOI] [Google Scholar]
  • 13.Arnold T. B., Emerson J. W. Nonparametric goodness-of-fit tests for discrete null distributions. R J. 2011;3:34–39. doi: 10.32614/RJ-2011-016. [DOI] [Google Scholar]

Articles from Lobachevskii Journal of Mathematics are provided here courtesy of Nature Publishing Group

RESOURCES