Abstract
Data sets originating from wide range of research studies are composed of multiple variables that are correlated and of dissimilar types, primarily of count, binary/ordinal and continuous attributes. The present paper builds on the previous works on multivariate data generation and develops a framework for generating multivariate mixed data with a pre-specified correlation matrix. The generated data consist of components that are marginally count, binary, ordinal and continuous, where the count and continuous variables follow the generalized Poisson and normal distributions, respectively. The use of the generalized Poisson distribution provides a flexible mechanism which allows under- and over-dispersed count variables generally encountered in practice. A step-by-step algorithm is provided and its performance is evaluated using simulated and real-data scenarios.
Keywords: Generalized Poisson, Mutivariate ordinal, discretization
1. Introduction
In many research and application areas, data sets including only one type of variables is highly unusual. Most data sets often contain a combination of binary, ordinal, count, and continuous variables that are correlated with complex dependence structure. That is, mixed data sets are not uncommon in wide range of scientific fields including in health and behavioral sciences, social sciences, and economics. For example, the Interstitial Cystitis database [1] includes longitudinal data on pain in the pelvic or bladder area, urgency (pressure to urinate), urinary frequency, nocturnal void, and nocturia from a prevalent cohort of subjects with Interstitial Cystitis. For each individual, these outcomes were associated with covariates, such as the demographic and clinic characteristics of the patients, which were observed at different times. The pain and urgency scores are continuous variables, urinary frequency is a discrete count variable, nocturnal void is a three-level ordinal variable, and nocturia is a dichotomous variable. The primary aim of the study was to determine whether symptoms tend to co-fluctuate together indicating a single underlying aetiology or vary independently suggesting multiple mechanisms at work. Due to the ubiquitousness of mixed outcomes, development of flexible models and methods for joint analysis of such data is an active area of research [2, 3]. A Monte-Carlo simulation to assess the performance of such models and methods typically require a mechanism to generate multivariate data that closely resemble real data sets that consist of mixed variable attributes.
Multivariate data generation has been addressed extensively in statistical literature. Krummenauer [4], Cai and Kendall [5], Minhajuddin et al. [6], Shin and Pasupathy [7], Yahav and Shmueli [8], and Barbiero and Ferrari [9] have developed various methods for generating simulated data from multivariate Poisson distribution. The Barbiero and Ferrari [9] approach was shown to be comparatively more efficient, user-friendly, and often more accurate than the prior methods. Biswas [10], Demirtas [11], and Ferrari and Barbiero [12] have described and evaluated procedures for multivariate ordinal data generation. Recently, a few notable developments have occurred in the area of mixed data generation. Ruscio and Kaczetow [13] has proposed a robust iterative technique that can generate mixed data either mimicking the marginal distribution observed in the sample data or utilizing a pre-specified parameters. Demirtas and Doganay [14], Demirtas et al. [15], Demirtas and Yavuz [16] and Amatya and Demirtas [17] have advanced the field by developing and implementing methods for simulating correlated mixed data comprised of binary-normal, binary-nonnormal, ordinal-normal, and count-normal components, respectively.
This work is motivated by the need of simultaneously simulating data composed of variables with different attributes, of continuous, count and binary/ordinal types that are often encountered jointly in many settings. The proposed mechanism is built upon a combination of a few random variate generation methods that involve simulation of multivariate count data that follow the generalized Poisson distribution, multivariate binary/ordinal data, multivariate continuous data that follow normal distribution, and a mix of these distributions. The proposed method maintains the specified marginal characteristics of each variable as well as the linear association structure among them. The methods in Famoye [18], Ferrari and Barbiero [12], and Barbiero and Ferrari [9] are particularly relevant for the development of the current work. We describe the details of these methods in Sections 2–4. Furthermore, since binary is a special case of ordinal, in what follows ordinal is meant to include binary as well.
A remark on the use of the generalized Poisson distribution (GPD) in this article is in order. Generally, count data are assumed to follow a Poisson distribution which requires the mean-variance equality. However, real data in practice rarely meet such a restriction. Most often, a variance of the count data is larger than its mean (over-dispersion) and seldom smaller than the mean (under-dispersion) [19–21]. Such phenomenon is common in longitudinal count data either by marginal behavior of these variables or the correlation induced by the repeated nature of the count variables. For example, Thall and Vail [22] gives a data set on two-week seizure counts for 59 epileptics. At each of four successive post randomization clinic visits, the number of seizures occurring over the previous two weeks was reported. The mean (variance) of the number of seizures in each of the four clinical visits were 8.95 (33.92), 8.36 (26.29), 8.44 (35.69) and 7.34 (22.14). Clearly, the basic equal mean and variance assumption of the ordinary Poisson distribution is violated. Thus, an ordinary Poisson distribution is not appropriate to model such over-or under-dispersed count data. A better choice is the generalized Poisson distribution which allows both over- or under-dispersion.
The article is organized as follows. In Section 2, we provide background information on GPD and computation of its quantile function. In Section 3, we describe a process involved in transforming a univariate normal random variable into a discrete random variable. In Section 4, we establish connection between both discrete and continuous variables by adjusting Barbiero and Ferrari [9] method for GPD and deriving an expression for correlation between discretized and normal variables. In Section 5, we outline the proposed methodology and algorithm to generate multivariate mixed data with count (GPD), ordinal, and continuous (normal) components. In Section 6, we present simulation studies for assessing the performance of the suggested method. In Section 7, we illustrate the utility of the method in mimicking the Interstitial Cystitis data. In Section 8, we conclude the paper with discussions, remarks, and future directions.
2. Generalized Poisson Distribution
The generalized Poisson distribution is a two parameter distribution that includes the ordinary Poisson distribution as a special case. Let X be a positive discrete random variable, whose probability mass function Px (θ, λ) is defined as
| (1) |
and zero otherwise, where θ > 0, max(−1, −θ/q) ≤ λ < 1 and q ≥ 4 is the largest positive integer for which θ + λq > 0 when λ < 0. Then, X is said to follow GPD. The GPD reduces to the Poisson distribution when λ = 0 and possesses the dual characteristics of over- and under-dispersion depending on whether λ > 0 or λ < 0, respectively. The first four moments of GPD (as derived in Ch. 9 of [23]) are as follows:
| (2) |
The cumulative distribution function Fx (θ, λ) and quantile function of GPD are defined as:
| (3) |
| (4) |
The expressions (3) and (4) can be evaluated efficiently by exploiting the following recurrence relation between GPD probabilities.
| (5) |
We are not aware of readily available software implementation for computation (SAS provides functions only for λ > 0). We modify Famoye [18] method—developed for generating univariate generalized Poisson random variate—and adapt it for the computation of . The modified algorithm is as follow:
Set x = 0, w = e−λ, S = e−θ, and P = S
- while q > S, do
- x=x+1
- C = θ − λ + λx
- P = wC(1 + λ/C)(x−1)P/x
- S = S + P
Deliver x
3. Discretization
A discrete random variable can be obtained by discretization of any continuous random variable. The simple and precise approach is to generate a univariate discrete random variable with known probability mass function through discretization of a continuous normal random variable. Let Z be a standard normal random variable. A discrete random variable X ∈ {x1, x2,…, xmax} having distribution function FX is constructed as follows:
| (6) |
where ri = Φ−1 (Fx(xi)), Φ being the standard normal cumulative distribution function (CDF). The numerical value of xmax is trivial for the ordinal variable with k categories. However, for discrete count variables with a non-finite support, the largest value is not directly available. To circumvent this problem, Barbiero and Ferrari [9] suggested reducing the support of the discrete components by approximating), where ε is chosen to be as small as possible.
4. Correlation among variables with different attributes
The proposed method is concerned with simulation of data sets that are comprised of count (GPD), ordinal and normal random variables. The principle behind the proposal is to start with multivariate normal distribution and discretize its components thereby transforming them into count and ordinal components [9, 12]. In so doing, starting pairwise correlations are adjusted such that the final correlation matrix (after discretization) is close to the target correlation matrix. The details of this correlation adjustment process is given in the Section 5. Three sets of correlations are of interest: correlation among discrete random variables (GPD-GPD, GPD-ordinal, ordinal-ordinal); correlation among continuous random variables (normal-normal); and correlation among discrete and continuous random variables (GPD-normal, ordinal-normal). Of the three sets, correlations among normal random variables do not pose a major challenge as the method is built up on the concept of discretization of multivariate normal random variables.
4.1. Correlation for discrete random variables
Suppose Z is a multivariate normal random vector with correlation matrix ΣZ whose discretized components form a new multidimensional discrete random variable X with correlation matrix ΣX. The elements of ΣX obviously differ from those of ΣZ, but the transformation is tractable. Ferrari and Barbiero [12] has established the link between pairwise correlation among normal components (δC) and those of corresponding discretized components (δD) by the following expression.
where
| (7) |
k1 and k2 are the largest finite integer in the support of the discrete variables, and ΦZ and ΦZ1,Z2 are the univariate and bivariate standard normal CDF with δC correlation, respectively. Ferrari and Barbiero [12] and Barbiero and Ferrari [9] used this relationship to construct correlated ordinal and correlated Poisson random variables, respectively.
4.2. Correlation Between Discrete and Normal Continuous Random Variables
Let Z1 and Z2 be two components of a bivariate standard normal random vector with a pairwise correlation coefficient δN. Suppose Z2 is discretized to form a discrete variable X2, following the procedure described in the Section 3, according to a probability distribution function FX (as defined in Equation 3). The resulting pairwise correlation δN D between Z1 and X2 can be calculated using a bivariate normal distribution result E(Z1|r1 < Z2 < r2) = δN E(Z2|r1 < Z2 < r2) and the following property of a truncated normal distribution:
| (8) |
such that,
| (9) |
where ϕ and Φ are the standard normal probability density function (PDF) and CDF, respectively and ri = Φ−1 (Fx(xi)) with r0 = −∞. For k-category ordinal variable, xk is the last category. For a count variable following GPD, xk is the value of the largest count as estimated by .
In the next section, we provide a unified framework on the joint generation of count, ordinal, and normal variables. Caveats pertaining to the correlation bounds for discretized ordinal and count variables within and between variable types also apply to the proposed method. These bounds are conveniently approximated through the “generate, sort, and correlate” technique [24].
5. Algorithm for generating multivariate mixed data with generalized Poisson, ordinal, and normal components
Let C1, C2, …, Cj be a set of count variables following GPD with corresponding parameter vectors (θ1, λ1), (θ2, λ2), …, (θj, λj); O1, O2, …, Om be a set of ordinal variables with proportion vectors p1, p2, …, pm and , where l = 1, 2, …, h. The target linear association among three types of variables is specified in (j + m + h) × (j + m + h) correlation matrix is Σ. Without loss of generality, assume that the first j columns of data matrix consist of count variables, followed by m columns of ordinal variables, and the last h columns of normal continuous variables. Then, Σ is comprised of six components: ΣCC, ΣOO, ΣNN, ΣOC, ΣON and ΣCN,where C, O and N correspond to count, ordinal, and normal, respectively. In this setup, each Oi may have a different number of categories or subset of them may have the same number of categories with the same or different probability vectors. Suppose that discretization of first j + m components of n × (j + m + h) dimensional data matrix drawn from N(j + m + h) (0, Σ*) results into a n × (j + m + h) dimensional mixed data matrix that has the target correlation matrix Σ.
Compute, ximax = FCi (1 −ε) for i = 1,…, j, where FCi is as defined in the Equation 3.
- Compute Σ*. To this end, set up following loop to solve Equation 7 and 9 until a desired result is achieved.
- Begin by setting κ = 0 and Σ* (κ) = Σ. Let be the stth element of Σ* for s, t = j + m + h.
- For each elements in and calculate using Equation 7; where s,t = 1,…, (j + m) and s > t.
- Form a matrix ΣD (κ) by appropriately arranging and ;
- For each elements in calculate and using Equation 9; where s = (j + m + 1),…, (j + m + h), t = 1,…, (j + m) and s > t.
- Check if |δst(κ) − δst| < η for s = 1,…, (j +m + h), t = 1…, (j + m) and s > t; where δst (κ) and δst are the stth element of ΣD(κ) and Σ, respectively; and η is a small positive number (say 0.00001)
- If yes, set Σ* = Σ* (κ). Go to ((3))
- If not, calculate new as follows
- and update Σ* (κ) to form Σ* (κ + 1) such that constitute its elements. Set κ = κ + 1 and repeat ((2)(b) to ((2)(e)
Draw n samples from N(j + m + h) (0, Σ*). These samples form n × (j + m + h) intermediate data matrix Y*.
Discretize the first jth columns of Y* according to FCi (θi, λi) and ximax.
Discretize the (j + 1)th to (j + m)th columns of Y* according to proportion vectors p1, p2,…, pm.
Center and scale the last n columns of Y* according to the target mean and standard deviation vectors.
Some numerical issues may arise when implementing the algorithm. First, the target pairwise correlation between two discrete variables and correlation between discrete and continuous variables must respect the correlation bounds. These bounds can be calculated using “generate, sort, and correlate” technique [24]. Such violations need to be checked and addressed prior to Σ* computation. Secondly,the computed intermediate correlation matrix Σ* may not be positive definite. In such situation, one can compute the nearest positive definite matrix by the methods proposed by Higham [25, 26] using nearPD function in the Matrix package in R. Finally, for some parametric combinations of GPD, the FX calculation may require comparatively longer computation time.
6. Method evaluation via simulation
The proposed method involves simulation of correlated multivariate count, ordinal, and normal variables. It requires specification of parameters governing the marginal distributions of each type of variables and also the linear association among all the variables in the form of correlation matrix. For the purpose of this evaluation we consider data sets that are composed of two GPD variables, two ordinal variables and two normal variables, i.e.; total of 6 variables. The following parameter vectors are considered for GPD variables: θ ∈ {(1, 2), (10, 20)} and λ ∈ {(.1, .2),(−.1, −.2),(.4, .6),(−.2, .4),(−.2, −.4),(.5, −.7)}. These parameter values represent moderate to high over- and under-dispersed count variables. To avoid a proliferation of simulation scenarios, we have limited this simulation for a single set of parameter values of ordinal and normal components. The cumulative marginal probabilities of two ordinal components—one with three categories and another with four categories— were chosen to be p1 = (.35, .80) and p2 = (.32, .54, .81). This specification covers a reasonably broad range of probability values. The means and standard deviations of the normal components are set at (0, 1). Finally, the correlation matrices are specified randomly. The values of pairwise correlations are checked for bound violation, symmetry, and positive definitiveness, ensuring the validity of the randomly generated correlation matrix. Performance of the proposed method is evaluated for each combination of parameters using the criteria described in the next section.
6.1. Accuracy and precision
We report the relative bias (RB) of the estimates over 1000 Monte Carlo replications, where η is the true value of a parameter and η̂ is the corresponding estimated value. Also, the root mean square error RMSE (η̂) is estimated using , the standardized bias (SB) is estimated using, and coverage rate (CR) is defined as a percentage of times that η is contained within a 95% confidence interval. Finally, RB and SB are the measures of accuracy and RMSE and CR are the integrated measures of some combination of accuracy and precision. Typically, we desire RB < 5% and SB < 50% [27].
6.2. Results
We generate 1000 sets of simulated data (n=200) for each combinations of parameters. A subset of results of the evaluation is shown in Table 1 due to space limitation. The full table is available at the second author’s website, http://demirtas.people.uic.edu/table1.pdf. The mean pairwise correlations are strikingly close to the target values for all combination of parameter specifications. The values on all the evaluation criteria are well within the tolerable limits. The average cumulative proportions of two ordinal variables are also very close to the specified target values (RB<1%, SB<8%, CR≈ 95%). Furthermore, RB<1.4% and SB <11% for count variables indicate the accurate replication of the specified values of mean (γ1) and standard deviation of GPD; and CR≈ 95% suggest high precision. The most difficult parameters to recover accurately are the higher moments—skewness (γ3) and kurtosis (γ4) of the GPD variables, especially for low θ and high λ combinations. For skewness parameters, RB<9% and SB<35% indicating acceptable results with somewhat low accuracy, whereas CR≈ 95% suggest acceptable precision. Similarly, for kurtosis parameters, RB<19%, SB<42%, and CR 94%–97%. Given that only the fourth moment of GPD variable, under extreme conditions (small θ and large λ), produces values on evaluation metrics that are over thresholds—in our opinion—the proposed method does a remarkable job at generating mixed data with count, ordinal, and normal components, as measured by the concordance between the specified and empirically computed quantities on average.
Table 1.
Key descriptive statistics and evaluation quantities. The mean represents the average of estimates across 1000 simulation replicates.
| θ | λ | Evaluation Criteria | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True Value | Mean | RB | SB | RMSE | CR | Min. | 1st Qu. | 3rd Qu. | Max. | |||
| (1, 2) | (.1, .2) | ρ | 0.1865 | 0.1864 | 0.0707 | 0.1814 | 0.0727 | 96.10 | −0.0383 | 0.1348 | 0.2366 | 0.4494 |
| 0.1241 | 0.1242 | 0.1104 | 0.1962 | 0.0698 | 95.10 | −0.1069 | 0.0765 | 0.1695 | 0.3350 | |||
| 0.1965 | 0.1919 | 2.3207 | 6.6481 | 0.0687 | 95.40 | −0.0762 | 0.1460 | 0.2381 | 0.4173 | |||
| 0.3627 | 0.3611 | 0.4283 | 2.5224 | 0.0616 | 95.60 | 0.1457 | 0.3182 | 0.4043 | 0.5231 | |||
| 0.2671 | 0.2678 | 0.2726 | 1.0978 | 0.0663 | 95.00 | 0.0375 | 0.2234 | 0.3129 | 0.4515 | |||
| 0.2783 | 0.2762 | 0.7674 | 3.3110 | 0.0645 | 95.50 | 0.0646 | 0.2354 | 0.3207 | 0.4802 | |||
| 0.1932 | 0.1892 | 2.0507 | 5.7763 | 0.0687 | 95.40 | −0.0598 | 0.1447 | 0.2362 | 0.3784 | |||
| 0.3764 | 0.3733 | 0.8107 | 5.1045 | 0.0598 | 95.70 | 0.1704 | 0.3327 | 0.4148 | 0.5409 | |||
| 0.2704 | 0.2718 | 0.5161 | 2.1930 | 0.0636 | 96.00 | 0.0791 | 0.2290 | 0.3168 | 0.4293 | |||
| 0.2735 | 0.2687 | 1.7604 | 7.3842 | 0.0654 | 94.30 | 0.0629 | 0.2277 | 0.3131 | 0.5225 | |||
| 0.1174 | 0.1206 | 2.7414 | 4.4714 | 0.0720 | 95.60 | −0.1149 | 0.0767 | 0.1684 | 0.3594 | |||
| 0.1438 | 0.1437 | 0.0478 | 0.0985 | 0.0697 | 95.00 | −0.1084 | 0.0971 | 0.1915 | 0.3769 | |||
| 0.3861 | 0.3851 | 0.2592 | 1.6800 | 0.0596 | 95.20 | 0.1573 | 0.3466 | 0.4267 | 0.5876 | |||
| 0.3356 | 0.3332 | 0.7189 | 3.8945 | 0.0620 | 95.30 | 0.1255 | 0.2921 | 0.3736 | 0.5146 | |||
| 0.3825 | 0.3805 | 0.5100 | 3.1678 | 0.0616 | 95.30 | 0.1558 | 0.3404 | 0.4223 | 0.5569 | |||
| γ1 | 1.1111 | 1.1081 | 0.2751 | 3.8193 | 0.0800 | 95.20 | 0.8600 | 1.0600 | 1.1600 | 1.3800 | ||
| 2.5000 | 2.5003 | 0.0100 | 0.1893 | 0.1320 | 94.60 | 2.1150 | 2.4100 | 2.5850 | 2.9900 | |||
| 1.1712 | 1.1675 | 0.3216 | 4.7146 | 0.0799 | 95.60 | 0.8931 | 1.1120 | 1.2240 | 1.4580 | |||
| 1.9764 | 1.9728 | 0.1827 | 2.7420 | 0.1317 | 94.60 | 1.6180 | 1.8780 | 2.0590 | 2.4630 | |||
| γ3 | 1.2649 | 1.2150 | 3.9429 | 16.9577 | 0.2982 | 96.70 | 0.6300 | 1.0150 | 1.3670 | 3.8590 | ||
| 1.1068 | 1.0685 | 3.4581 | 13.1912 | 0.2925 | 96.30 | 0.4125 | 0.8793 | 1.2200 | 3.6660 | |||
| γ4 | 5.0667 | 4.7724 | 5.8079 | 15.7250 | 1.8934 | 97.10 | 2.4800 | 3.7110 | 5.3270 | 31.7100 | ||
| 4.7750 | 4.5638 | 4.4222 | 12.3918 | 1.7162 | 96.60 | 2.3540 | 3.5790 | 5.0690 | 29.9600 | |||
| p | 0.3500 | 0.3517 | 0.4857 | 5.0079 | 0.0339 | 0.95 | 0.2450 | 0.3300 | 0.3750 | 0.4600 | ||
| 0.8000 | 0.8009 | 0.1175 | 3.3734 | 0.0279 | 0.94 | 0.7250 | 0.7800 | 0.8200 | 0.8900 | |||
| 0.3200 | 0.3206 | 0.1891 | 1.8271 | 0.0331 | 0.95 | 0.2100 | 0.3000 | 0.3412 | 0.4300 | |||
| 0.5400 | 0.5405 | 0.0870 | 1.3241 | 0.0355 | 0.95 | 0.4450 | 0.5150 | 0.5650 | 0.6450 | |||
| 0.8100 | 0.8115 | 0.1895 | 5.5632 | 0.0276 | 0.95 | 0.7150 | 0.7950 | 0.8300 | 0.9000 | |||
| (10,20) | (−.2,−.4) | ρ | 0.3317 | 0.3350 | 1.0009 | 5.1817 | 0.0641 | 95.60 | 0.0957 | 0.2912 | 0.3784 | 0.6035 |
| 0.2449 | 0.2464 | 0.6458 | 2.2962 | 0.0688 | 95.90 | 0.0429 | 0.1973 | 0.2950 | 0.4437 | |||
| 0.2780 | 0.2801 | 0.7529 | 3.2329 | 0.0647 | 95.40 | 0.0861 | 0.2358 | 0.3262 | 0.5135 | |||
| 0.1837 | 0.1846 | 0.4911 | 1.3465 | 0.0670 | 95.60 | −0.0017 | 0.1385 | 0.2282 | 0.3998 | |||
| 0.2717 | 0.2748 | 1.1611 | 4.8461 | 0.0651 | 95.30 | 0.0560 | 0.2318 | 0.3186 | 0.4873 | |||
| 0.1865 | 0.1879 | 0.7666 | 2.1196 | 0.0674 | 94.50 | −0.0125 | 0.1404 | 0.2332 | 0.3891 | |||
| 0.2107 | 0.2122 | 0.7011 | 2.2212 | 0.0665 | 95.60 | −0.0142 | 0.1654 | 0.2583 | 0.4039 | |||
| 0.3617 | 0.3646 | 0.7905 | 4.6359 | 0.0617 | 95.30 | 0.0705 | 0.3226 | 0.4057 | 0.5569 | |||
| 0.2135 | 0.2137 | 0.1007 | 0.3102 | 0.0693 | 95.90 | −0.1018 | 0.1700 | 0.2608 | 0.4171 | |||
| 0.1548 | 0.1540 | 0.5020 | 1.1363 | 0.0684 | 95.10 | −0.0769 | 0.1108 | 0.1996 | 0.3751 | |||
| 0.3188 | 0.3201 | 0.3926 | 1.9556 | 0.0640 | 95.80 | 0.1381 | 0.2768 | 0.3670 | 0.5221 | |||
| 0.3587 | 0.3579 | 0.2130 | 1.1931 | 0.0640 | 95.20 | 0.1297 | 0.3142 | 0.4016 | 0.5430 | |||
| 0.2669 | 0.2645 | 0.8860 | 3.5797 | 0.0661 | 95.00 | 0.0444 | 0.2218 | 0.3089 | 0.4261 | |||
| 0.2335 | 0.2331 | 0.1932 | 0.6729 | 0.0670 | 95.40 | 0.0332 | 0.1877 | 0.2808 | 0.4634 | |||
| 0.2407 | 0.2426 | 0.7809 | 2.9008 | 0.0648 | 95.30 | 0.0498 | 0.1990 | 0.2878 | 0.4488 | |||
| 0.1796 | 0.1844 | 2.6872 | 7.2200 | 0.0670 | 94.30 | −0.0184 | 0.1388 | 0.2281 | 0.4188 | |||
| 0.3926 | 0.3935 | 0.2266 | 1.4939 | 0.0595 | 95.00 | 0.1644 | 0.3538 | 0.4330 | 0.5537 | |||
| γ1 | 8.3333 | 8.3287 | 0.0557 | 2.7920 | 0.1663 | 94.90 | 7.8450 | 8.2200 | 8.4450 | 8.8700 | ||
| 14.2857 | 14.2825 | 0.0226 | 1.6308 | 0.1980 | 95.00 | 13.6900 | 14.1500 | 14.4200 | 14.9000 | |||
| 2.4056 | 2.4053 | 0.0136 | 0.2764 | 0.1185 | 95.50 | 2.0670 | 2.3200 | 2.4880 | 2.8090 | |||
| 2.6998 | 2.6938 | 0.2214 | 4.4824 | 0.1334 | 95.60 | 2.3110 | 2.5990 | 2.7920 | 3.0490 | |||
| γ3 | 0.1732 | 0.1670 | 3.5937 | 3.8583 | 0.1614 | 95.00 | −0.3318 | 0.0609 | 0.2699 | 0.8250 | ||
| 0.0378 | 0.0344 | 9.0104 | 2.0669 | 0.1647 | 95.10 | −0.4314 | −0.0853 | 0.1426 | 0.6621 | |||
| γ4 | 2.9700 | 2.9260 | 1.4826 | 13.6121 | 0.3263 | 96.50 | 2.1920 | 2.7030 | 3.0970 | 5.1520 | ||
| 2.9557 | 2.9333 | 0.7585 | 7.1579 | 0.3138 | 95.60 | 2.0900 | 2.7100 | 3.1080 | 4.1460 | |||
| p | 0.3500 | 0.3525 | 0.7200 | 7.5922 | 0.0332 | 95.60 | 0.2450 | 0.3300 | 0.3750 | 0.4500 | ||
| 0.8000 | 0.8009 | 0.1081 | 3.0480 | 0.0284 | 95.90 | 0.7150 | 0.7800 | 0.8200 | 0.8850 | |||
| 0.3200 | 0.3203 | 0.1000 | 0.9819 | 0.0326 | 94.60 | 0.2150 | 0.3000 | 0.3400 | 0.4350 | |||
| 0.5400 | 0.5404 | 0.0722 | 1.0688 | 0.0365 | 95.00 | 0.4250 | 0.5150 | 0.5650 | 0.6550 | |||
| 0.8100 | 0.8101 | 0.0099 | 0.2760 | 0.0290 | 94.70 | 0.7050 | 0.7900 | 0.8300 | 0.8900 | |||
We do not report the results of small sample cases for brevity, as expected mean, RB, SB did not change substantially. The CR was somewhat higher, but not to a degree of inefficiency. In addition, as one would expect, RMSE was also slightly larger.
7. Example: Interstitial Cystitis Data Base study
We consider the Interstitial Cystitis Data Base (ICDB) study to illustrate the performance of the proposed method to mimic real data. We introduced this data in the Introduction section. Briefly, the ICDB was established by National Institute of Diabetes and Digestive and Kidney Disease in 1993 to understand the natural history of the interstitial cystitis (IC) and to assess the demographic and clinical characteristics of IC patients [28]. A number of articles have been published utilizing this database since its establishment. Among other variables, the data set contains information on number of non-nocturnal voids per day (VF_AWAKE), ordinal levels of urinary urgency over last four weeks (PURG2C3), and average maximum interval between voids per day (AVGMINT). For this illustration, we selected only the patients who had complete data on all three variable across 3-, 6- and 9-month visits. A total of 309 patients met this requirement. The final data excluded variables that are not under consideration and patients that did not meet the aforementioned requirement. The VF_AWAKE variable measured on three occasions constitute three correlated count variables; PURG2C3 variable measured on three occasions constitute three correlated ordinal (three levels) variables; and AVGMINT also measured on three occasions constitute three correlated normal variables. Together, they form a data set with nine correlated variables of dissimilar types.
We demonstrate an ability of the proposed method to duplicate this “model” data in two ways. First, we show the proximity of the four moments of VF_AWAKE, cumulative proportions of PURG2C3, mean and standard deviation of AVGMINT, and the correlation matrix estimated from the “model” data and from simulated data (mimicking the “model” data). From the “model” data, θ and λ parameters for VF_AWAKE are obtained by fitting univariate generalized Poisson distribution to 3-, 6- and 9-month assessment data. The appropriateness of marginal normal distributions for AVGMINT variables are tested using the Anderson-Darling normality test (p-value>0.05). The correlation matrix and the values of the relevant parameters estimated from the “model” data are presented on the Tables 2 and 3, respectively. Using these estimated quantities as true parameters, the process of generating a data matrix of size 309 × 9 is replicated 1000 times.
Table 2.
Correlation matrix of the interstitial cystitis data
| VF_AWAKE | PURG1C3 | AVGMINT | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| months | 3 | 6 | 9 | 3 | 6 | 9 | 3 | 6 | 9 | |
| VF_AWAKE | 3 | 1 | 0.757 | 0.758 | 0.275 | 0.195 | 0.285 | −0.475 | −0.479 | −0.466 |
| 6 | 0.757 | 1 | 0.806 | 0.246 | 0.274 | 0.312 | −0.5 | −0.53 | −0.506 | |
| 9 | 0.758 | 0.806 | 1 | 0.274 | 0.256 | 0.397 | −0.466 | −0.481 | −0.519 | |
| PURG1C3 | 3 | 0.275 | 0.246 | 0.274 | 1 | 0.575 | 0.545 | −0.242 | −0.241 | −0.166 |
| 6 | 0.195 | 0.274 | 0.256 | 0.575 | 1 | 0.599 | −0.255 | −0.273 | −0.201 | |
| 9 | 0.285 | 0.312 | 0.397 | 0.545 | 0.599 | 1 | −0.235 | −0.248 | −0.229 | |
| AVGMINT | 3 | −0.475 | −0.5 | −0.466 | −0.242 | −0.255 | −0.235 | 1 | 0.716 | 0.729 |
| 6 | −0.479 | −0.53 | −0.481 | −0.241 | −0.273 | −0.248 | 0.716 | 1 | 0.75 | |
| 9 | −0.466 | −0.506 | −0.519 | −0.166 | −0.201 | −0.229 | 0.729 | 0.75 | 1 | |
Table 3.
Parameter estimates from interstitial cystitis data
| Assesment occations | ||||
|---|---|---|---|---|
| 3 months | 6 months | 9months | ||
| VF_AWAKE | θ | 7.31 | 7.16 | 7.03 |
| λ | 0.34 | 0.34 | 0.38 | |
| γ1 | 11.03 | 10.85 | 11.28 | |
| 16.64 | 16.44 | 18.09 | ||
| γ3 | 0.76 | 0.77 | 0.84 | |
| γ4 | 3.90 | 3.94 | 4.11 | |
| PURG2C3 | p1 | 0.37 | 0.40 | 0.41 |
| p2 | 0.78 | 0.81 | 0.81 | |
| AVGMINT | mean | 4.80 | 4.86 | 4.89 |
| SD | 1.64 | 1.64 | 1.72 | |
Note: SD denotes standard deviation; p1 and p2 denote cumulative probabilities of first and second category of three-level ordinal variable; and γ1, γ2, γ3 and γ4 denotes the mean, SD, skewness, and kurtosis of GPD.
The results of the comparison of these quantities between the “model” and simulated data are presented on the Figures 1 – 4. Figure 1 shows three evaluation metrics. The maximum observed values (representing the worst-case scenario) on all three metrics fall well under the thresholds for all the parameters including kurtoses of count variables. Figure 2 shows distributions of cumulative proportion of each categories of ordinal variables PURG2C3 assessed over 3 time points. The x-axis displays the “true” values from the “model” data and the y-axis displays estimated values from simulated data sets. The average estimated cumulative proportions are in close proximity to the “true” values with relatively narrow interquartile range. Figure 3 shows 36 estimated mean pairwise correlations along with their “true” value on the x-axis. It shows that the average estimated pairwise correlations over simulated data are in close vicinity of the “true” values. The largest difference between the average and “true” pairwise correlations is less than 0.0022. Finally, Figure 4 presents the distributions of four sample moments of VF AWAKE variables. The upper two graphs show the distributions of sample means and standard deviations. The averages of these two statistics are close to the “true” values with very few outliers. As expected, distributions of the third and the fourth moments contain more outliers. However, the averages are fairly close to their “true” values. Overall, the proposed method succeed remarkably in duplicating the marginal distributions and dependence structure of the “model” data.
Figure 1.
Accuracy of the estimated parameters. RB: Relative bias, SB: Standardized Bias.
Figure 4.
Four moments of GPD count variables.
Figure 2.
Cumulative probabilities of three ordinal variables.
Figure 3.
Correlations among variables with different attributes, of count, categorical and continuous types.
Next, we fit a mixed-effect ordinal regression model with cumulative probit link and equally spaced thresholds. We show that the estimated values of the parameters of this regression model using the “model” data and simulated data are close to each other. The model under consideration is as follows:
| (10) |
where yij is an unobservable latent variable related to the ordinal variable PURG2C3, αc is a series of threshold values, and υ0i is a random effect assumed to have N (0, συ0) distribution. Table 4 presents the estimated regression parameters for the “model” data on the “True Value” column and for the simulated data on “Mean” column along with evaluation metrics on the remaining columns. Clearly, the average of the parameter estimates from 1000 simulated data sets are well within the tolerable limits. These results suggest that the proposed method is useful in capturing and reconstructing the real data trends, and may be regarded as a useful tool for evaluating statistical models and methods.
Table 4.
Parameter estimates of cumulative logit ordinal mixed-effect regression.
| True Value | Mean | RB | SB | RMSE | CR | Min. | 1st Qu. | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|---|---|---|---|
| α1 | −0.3928 | −0.3758 | 0.0170 | 4.3292 | 0.3495 | 0.95 | −1.6470 | −0.6098 | −0.1410 | 0.6547 |
| α − α | 2.0762 | 2.0891 | 0.0130 | 0.6257 | 0.1204 | 0.94 | 1.7740 | 2.0080 | 2.1580 | 2.5180 |
| AVGMINT | −0.2072 | −0.2099 | 0.0027 | 1.2908 | 0.0476 | 0.95 | −0.3864 | −0.2426 | −0.1782 | −0.0500 |
| VF_AWAKE | 0.1186 | 0.1227 | 0.0041 | 3.4609 | 0.0165 | 0.95 | 0.0785 | 0.1116 | 0.1333 | 0.1799 |
| MONTH | −0.0305 | −0.0332 | 0.0027 | 8.9748 | 0.0186 | 0.94 | −0.0894 | −0.0457 | −0.0211 | 0.0202 |
| συ 0 | 1.6228 | 1.5615 | 0.0613 | 3.7775 | 0.2670 | 0.95 | 0.8955 | 1.3700 | 1.7130 | 2.6300 |
8. Discussion
The approach to multivariate data generation with mixed variable types in this paper is motivated by the emerging popularity of data analysis techniques for mixed outcomes. The method draws ideas from previous works by Ferrari and Barbiero [12], Barbiero and Ferrari [9], Demirtas and Doganay [14], and Amatya and Demirtas [17]. The novelties in this paper are the use of the generalized Poisson distribution to address over- and under-dispersion in count data, derivation of a connection (correlation) between normal and count/ordinal variables obtained via discretization, and the unified framework through which data sets composed of dissimilar variable types are generated. The proposed framework is also capable of generating multivariate data composed of only one type of variables mentioned in this work. This is particularly useful for generating multivariate GPD variables, method for which is not currently available. For the values of parameters that are generally encountered in practice, the proposed method performs considerably well and provides a robust mechanism for generating multivariate data with combination of count, ordinal and normal variables. Future work will address modeling continuous variable components in the mixed data via non-normal continuous distributions.
References
- 1.Simon LJ, Landis JR, Erickson DR, Nyberg LM, Group IS, et al. The interstitial cystitis data base study: concepts and preliminary baseline descriptive statistics. Urology. 1997;49(5):64–75. doi: 10.1016/s0090-4295(99)80334-3. [DOI] [PubMed] [Google Scholar]
- 2.Song P, Li M, Yuan Y. Joint regression analysis of correlated data using gaussian copulas. Biometrics. 2009;65(1):60–68. doi: 10.1111/j.1541-0420.2008.01058.x. [DOI] [PubMed] [Google Scholar]
- 3.De Leon AR, Chough KC. Analysis of mixed data: Methods and applications. CRC Press; 2013. [Google Scholar]
- 4.Krummenauer F. Efficient simulation of multivariate binomial and Poisson distributions. Biometrical Journal. 1998;40(7):823–832. [Google Scholar]
- 5.Cai Y, Kendall W. Perfect simulation for correlated Poisson random variables conditioned to be positive. Statistics and Computing. 2002;12(3):229–243. [Google Scholar]
- 6.Minhajuddin TM, Harris IR, Schucany WR. Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation. 2004;74(8):599–607. [Google Scholar]
- 7.Shin K, Pasupathy R. Proceedings of the 39th Conference on Winter Simulation: 40 Years! The Best is Yet to Come; WSC ’07. Piscataway, NJ, USA: IEEE Press; 2007. A method for fast generation of bivariate Poisson random vectors; pp. 472–479. [Google Scholar]
- 8.Yahav I, Shmueli G. On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry. 2012;28(1):91–102. [Google Scholar]
- 9.Barbiero A, Ferrari PA. Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry. 2014;31(5):669–680. [Google Scholar]
- 10.Biswas A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters. 2004;70(1):25–235. [Google Scholar]
- 11.Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation. 2006;76(11):1017–1025. [Google Scholar]
- 12.Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. doi: 10.1080/00273171.2012.692630. [DOI] [PubMed] [Google Scholar]
- 13.Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research. 2008;43(3):355–381. doi: 10.1080/00273170802285693. [DOI] [PubMed] [Google Scholar]
- 14.Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics. 2012;22(3):323–236. doi: 10.1080/10543406.2010.521874. [DOI] [PubMed] [Google Scholar]
- 15.Demirtas H, Hedeker D, Mermelstein JM. Simulation of massive public health data by power polynomials. Statistics in Medicine. 2012;31(27):3337–3346. doi: 10.1002/sim.5362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Demirtas H, Yavuz Y. Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics. 2015;25(4):635–650. doi: 10.1080/10543406.2014.920868. [DOI] [PubMed] [Google Scholar]
- 17.Amatya A, Demirtas H. Simultaneous generation of multivariate data with Poisson and normal marginals. Journal of Statistical Computation and Simulation. 2015;85(15):3129–3139. [Google Scholar]
- 18.Famoye F. Generalized Poisson random variate generation. American Journal of Mathematical and Management Sciences. 1997;17(3–4):219–237. [Google Scholar]
- 19.Consul PC, Famoye F. Lagrangian probability distributions. Springer; 2006. [Google Scholar]
- 20.Sáez-Castillo AJ, Conde-Sánchez A. A hyper-Poisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis. 2013;61:148–157. [Google Scholar]
- 21.Lynch HJ, Thorson JT, Shelton AO. Dealing with under- and over-dispersed count data in life history, spatial, and community ecology. Ecology. 2014;95(11):3173–3180. [Google Scholar]
- 22.Thall PF, Vail SC. Some covariance models for longitudinal count data with overdispersion. Biometrics. 1990;46(3):657–671. [PubMed] [Google Scholar]
- 23.Consul PC. Generalized Poisson distributions: Properties and applications. New York: Marcel Dekker, Inc.; 1989. [Google Scholar]
- 24.Demirtas H, Hedeker D. A practical way for computing approximate lower and upper correlation bounds. American Statistician. 2011;65(2):104–109. [Google Scholar]
- 25.Higham NJ. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications. 1988;103:103–118. [Google Scholar]
- 26.Higham NJ. Computing the nearest correlation matrix-a problem from finance. IMA Journal of Numerical Analysis. 2002;22(3):329–343. [Google Scholar]
- 27.Demirtas H. Simulation driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica. 2004;58(4):466–482. [Google Scholar]
- 28.Propert KJ, Schaeffer AJ, Brensinger CM, Kusek JW, Nyberg LM, Landis JR. A prospective study of interstitial cystitis: results of longitudinal followup of the interstitial cystitis data base cohort. The interstitial cystitis data base study group. The Journal of Urology. 2000;163(5):1434–1439. doi: 10.1016/s0022-5347(05)67637-9. [DOI] [PubMed] [Google Scholar]




