Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Apr 22.
Published in final edited form as: J Stat Comput Simul. 2016 Apr 22;86(18):3595–3607. doi: 10.1080/00949655.2016.1177530

Concurrent generation of multivariate mixed data with variables of dissimilar types

Anup Amatya a,*, Hakan Demirtas b
PMCID: PMC5117654  NIHMSID: NIHMS824605  PMID: 27885310

Abstract

Data sets originating from wide range of research studies are composed of multiple variables that are correlated and of dissimilar types, primarily of count, binary/ordinal and continuous attributes. The present paper builds on the previous works on multivariate data generation and develops a framework for generating multivariate mixed data with a pre-specified correlation matrix. The generated data consist of components that are marginally count, binary, ordinal and continuous, where the count and continuous variables follow the generalized Poisson and normal distributions, respectively. The use of the generalized Poisson distribution provides a flexible mechanism which allows under- and over-dispersed count variables generally encountered in practice. A step-by-step algorithm is provided and its performance is evaluated using simulated and real-data scenarios.

Keywords: Generalized Poisson, Mutivariate ordinal, discretization

1. Introduction

In many research and application areas, data sets including only one type of variables is highly unusual. Most data sets often contain a combination of binary, ordinal, count, and continuous variables that are correlated with complex dependence structure. That is, mixed data sets are not uncommon in wide range of scientific fields including in health and behavioral sciences, social sciences, and economics. For example, the Interstitial Cystitis database [1] includes longitudinal data on pain in the pelvic or bladder area, urgency (pressure to urinate), urinary frequency, nocturnal void, and nocturia from a prevalent cohort of subjects with Interstitial Cystitis. For each individual, these outcomes were associated with covariates, such as the demographic and clinic characteristics of the patients, which were observed at different times. The pain and urgency scores are continuous variables, urinary frequency is a discrete count variable, nocturnal void is a three-level ordinal variable, and nocturia is a dichotomous variable. The primary aim of the study was to determine whether symptoms tend to co-fluctuate together indicating a single underlying aetiology or vary independently suggesting multiple mechanisms at work. Due to the ubiquitousness of mixed outcomes, development of flexible models and methods for joint analysis of such data is an active area of research [2, 3]. A Monte-Carlo simulation to assess the performance of such models and methods typically require a mechanism to generate multivariate data that closely resemble real data sets that consist of mixed variable attributes.

Multivariate data generation has been addressed extensively in statistical literature. Krummenauer [4], Cai and Kendall [5], Minhajuddin et al. [6], Shin and Pasupathy [7], Yahav and Shmueli [8], and Barbiero and Ferrari [9] have developed various methods for generating simulated data from multivariate Poisson distribution. The Barbiero and Ferrari [9] approach was shown to be comparatively more efficient, user-friendly, and often more accurate than the prior methods. Biswas [10], Demirtas [11], and Ferrari and Barbiero [12] have described and evaluated procedures for multivariate ordinal data generation. Recently, a few notable developments have occurred in the area of mixed data generation. Ruscio and Kaczetow [13] has proposed a robust iterative technique that can generate mixed data either mimicking the marginal distribution observed in the sample data or utilizing a pre-specified parameters. Demirtas and Doganay [14], Demirtas et al. [15], Demirtas and Yavuz [16] and Amatya and Demirtas [17] have advanced the field by developing and implementing methods for simulating correlated mixed data comprised of binary-normal, binary-nonnormal, ordinal-normal, and count-normal components, respectively.

This work is motivated by the need of simultaneously simulating data composed of variables with different attributes, of continuous, count and binary/ordinal types that are often encountered jointly in many settings. The proposed mechanism is built upon a combination of a few random variate generation methods that involve simulation of multivariate count data that follow the generalized Poisson distribution, multivariate binary/ordinal data, multivariate continuous data that follow normal distribution, and a mix of these distributions. The proposed method maintains the specified marginal characteristics of each variable as well as the linear association structure among them. The methods in Famoye [18], Ferrari and Barbiero [12], and Barbiero and Ferrari [9] are particularly relevant for the development of the current work. We describe the details of these methods in Sections 2–4. Furthermore, since binary is a special case of ordinal, in what follows ordinal is meant to include binary as well.

A remark on the use of the generalized Poisson distribution (GPD) in this article is in order. Generally, count data are assumed to follow a Poisson distribution which requires the mean-variance equality. However, real data in practice rarely meet such a restriction. Most often, a variance of the count data is larger than its mean (over-dispersion) and seldom smaller than the mean (under-dispersion) [1921]. Such phenomenon is common in longitudinal count data either by marginal behavior of these variables or the correlation induced by the repeated nature of the count variables. For example, Thall and Vail [22] gives a data set on two-week seizure counts for 59 epileptics. At each of four successive post randomization clinic visits, the number of seizures occurring over the previous two weeks was reported. The mean (variance) of the number of seizures in each of the four clinical visits were 8.95 (33.92), 8.36 (26.29), 8.44 (35.69) and 7.34 (22.14). Clearly, the basic equal mean and variance assumption of the ordinary Poisson distribution is violated. Thus, an ordinary Poisson distribution is not appropriate to model such over-or under-dispersed count data. A better choice is the generalized Poisson distribution which allows both over- or under-dispersion.

The article is organized as follows. In Section 2, we provide background information on GPD and computation of its quantile function. In Section 3, we describe a process involved in transforming a univariate normal random variable into a discrete random variable. In Section 4, we establish connection between both discrete and continuous variables by adjusting Barbiero and Ferrari [9] method for GPD and deriving an expression for correlation between discretized and normal variables. In Section 5, we outline the proposed methodology and algorithm to generate multivariate mixed data with count (GPD), ordinal, and continuous (normal) components. In Section 6, we present simulation studies for assessing the performance of the suggested method. In Section 7, we illustrate the utility of the method in mimicking the Interstitial Cystitis data. In Section 8, we conclude the paper with discussions, remarks, and future directions.

2. Generalized Poisson Distribution

The generalized Poisson distribution is a two parameter distribution that includes the ordinary Poisson distribution as a special case. Let X be a positive discrete random variable, whose probability mass function Px (θ, λ) is defined as

Px(θ,λ)={θ(θ+λx)(x1)x!eθλxfor x=0,1,2,0for x>q when λ<0 (1)

and zero otherwise, where θ > 0, max(−1, −θ/q) ≤ λ < 1 and q ≥ 4 is the largest positive integer for which θ + λq > 0 when λ < 0. Then, X is said to follow GPD. The GPD reduces to the Poisson distribution when λ = 0 and possesses the dual characteristics of over- and under-dispersion depending on whether λ > 0 or λ < 0, respectively. The first four moments of GPD (as derived in Ch. 9 of [23]) are as follows:

γ1=θ1λ,γ2=θ(1λ)3,γ3=1+2λθ(1λ),γ4=1+8λ+6λ2θ(1λ). (2)

The cumulative distribution function Fx (θ, λ) and quantile function Fx1(p|θ,λ) of GPD are defined as:

Fx(θ,λ)=P(Xk|θ,λ)=x=0kPx(θ,λ) (3)
Fx1(p|θ,λ)=min(x0:Fx(x)p),p(0,1) and 0={0,1,2,}. (4)

The expressions (3) and (4) can be evaluated efficiently by exploiting the following recurrence relation between GPD probabilities.

Px(θ,λ)=θλ+λxx(1+λθλ+λx)x1eλPx1(θ,λ) for x1 and P0(θ,λ)=eθ. (5)

We are not aware of readily available software implementation for FX1(q;θ,λ) computation (SAS provides functions only for λ > 0). We modify Famoye [18] method—developed for generating univariate generalized Poisson random variate—and adapt it for the computation of FX1. The modified algorithm is as follow:

  1. Set x = 0, w = e−λ, S = e−θ, and P = S

  2. while q > S, do
    • x=x+1
    • C = θ − λ + λx
    • P = wC(1 + λ/C)(x−1)P/x
    • S = S + P
  3. Deliver x

3. Discretization

A discrete random variable can be obtained by discretization of any continuous random variable. The simple and precise approach is to generate a univariate discrete random variable with known probability mass function through discretization of a continuous normal random variable. Let Z be a standard normal random variable. A discrete random variable X ∈ {x1, x2,…, xmax} having distribution function FX is constructed as follows:

if Z<r1X=x1
if r1Zr2X=x2if rk1ZX=xmax, (6)

where ri = Φ−1 (Fx(xi)), Φ being the standard normal cumulative distribution function (CDF). The numerical value of xmax is trivial for the ordinal variable with k categories. However, for discrete count variables with a non-finite support, the largest value is not directly available. To circumvent this problem, Barbiero and Ferrari [9] suggested reducing the support of the discrete components by approximating), xmax=Fx1(1ε) where ε is chosen to be as small as possible.

4. Correlation among variables with different attributes

The proposed method is concerned with simulation of data sets that are comprised of count (GPD), ordinal and normal random variables. The principle behind the proposal is to start with multivariate normal distribution and discretize its components thereby transforming them into count and ordinal components [9, 12]. In so doing, starting pairwise correlations are adjusted such that the final correlation matrix (after discretization) is close to the target correlation matrix. The details of this correlation adjustment process is given in the Section 5. Three sets of correlations are of interest: correlation among discrete random variables (GPD-GPD, GPD-ordinal, ordinal-ordinal); correlation among continuous random variables (normal-normal); and correlation among discrete and continuous random variables (GPD-normal, ordinal-normal). Of the three sets, correlations among normal random variables do not pose a major challenge as the method is built up on the concept of discretization of multivariate normal random variables.

4.1. Correlation for discrete random variables

Suppose Z is a multivariate normal random vector with correlation matrix ΣZ whose discretized components form a new multidimensional discrete random variable X with correlation matrix ΣX. The elements of ΣX obviously differ from those of ΣZ, but the transformation is tractable. Ferrari and Barbiero [12] has established the link between pairwise correlation among normal components (δC) and those of corresponding discretized components (δD) by the following expression.

δD=l=1k1t=1k2(lE(X1))(tE(X2))fZ1Z2(l/k1,t/k2)l=1k1(lE(X1))2(1/k1)t=1k2(tE(X2)2(1/k2)

where

fZ1Z2(l/k1,t/k2)=ΦZ1,Z2(ΦZ1(l/k1)ΦZ1(t/k2))ΦZ1,Z2(ΦZ1(l/k1)ΦZ1((t1)/k2))ΦZ1,Z2(ΦZ1((l1)/k1)ΦZ1(t/k2))+ΦZ1,Z2(ΦZ1((l1)/k1)ΦZ1((t1/k2)), (7)

k1 and k2 are the largest finite integer in the support of the discrete variables, and ΦZ and ΦZ1,Z2 are the univariate and bivariate standard normal CDF with δC correlation, respectively. Ferrari and Barbiero [12] and Barbiero and Ferrari [9] used this relationship to construct correlated ordinal and correlated Poisson random variables, respectively.

4.2. Correlation Between Discrete and Normal Continuous Random Variables

Let Z1 and Z2 be two components of a bivariate standard normal random vector with a pairwise correlation coefficient δN. Suppose Z2 is discretized to form a discrete variable X2, following the procedure described in the Section 3, according to a probability distribution function FX (as defined in Equation 3). The resulting pairwise correlation δN D between Z1 and X2 can be calculated using a bivariate normal distribution result E(Z1|r1 < Z2 < r2) = δN E(Z2|r1 < Z2 < r2) and the following property of a truncated normal distribution:

E(Z2|r1<Z2<r2)=δNϕ(r1)ϕ(r2)Φ(r2)Φ(r1), (8)

such that,

δND=E(X2)E(Z1|X2)E(X2)E(Z1)E[X2E(X2)]2E[Z1E(Z1)]2=i=1kxiPX(X2=xi)E(Z1|ri1<Z2<ri)i=1kxi2PX(X2=xi)[i=1kxiPX(X2=xi)]2=δNi=1kxi[ϕ(ri1)ϕ(ri)]σi=1kxi2[Φ(ri)Φ(ri1)][i=1kxi2{Φ(ri)Φ(ri1)}]2, (9)

where ϕ and Φ are the standard normal probability density function (PDF) and CDF, respectively and ri = Φ−1 (Fx(xi)) with r0 = −∞. For k-category ordinal variable, xk is the last category. For a count variable following GPD, xk is the value of the largest count as estimated by xk=Fx1(1ε).

In the next section, we provide a unified framework on the joint generation of count, ordinal, and normal variables. Caveats pertaining to the correlation bounds for discretized ordinal and count variables within and between variable types also apply to the proposed method. These bounds are conveniently approximated through the “generate, sort, and correlate” technique [24].

5. Algorithm for generating multivariate mixed data with generalized Poisson, ordinal, and normal components

Let C1, C2, …, Cj be a set of count variables following GPD with corresponding parameter vectors (θ1, λ1), (θ2, λ2), …, (θj, λj); O1, O2, …, Om be a set of ordinal variables with proportion vectors p1, p2, …, pm and Wl~N(μl,σl2), where l = 1, 2, …, h. The target linear association among three types of variables is specified in (j + m + h) × (j + m + h) correlation matrix is Σ. Without loss of generality, assume that the first j columns of data matrix consist of count variables, followed by m columns of ordinal variables, and the last h columns of normal continuous variables. Then, Σ is comprised of six components: ΣCC, ΣOO, ΣNN, ΣOC, ΣON and ΣCN,where C, O and N correspond to count, ordinal, and normal, respectively. In this setup, each Oi may have a different number of categories or subset of them may have the same number of categories with the same or different probability vectors. Suppose that discretization of first j + m components of n × (j + m + h) dimensional data matrix drawn from N(j + m + h) (0, Σ*) results into a n × (j + m + h) dimensional mixed data matrix that has the target correlation matrix Σ.

  1. Compute, ximax = FCi (1 −ε) for i = 1,…, j, where FCi is as defined in the Equation 3.

  2. Compute Σ*. To this end, set up following loop to solve Equation 7 and 9 until a desired result is achieved.
    1. Begin by setting κ = 0 and Σ* (κ) = Σ. Let δNst* be the stth element of Σ* for s, t = j + m + h.
    2. For each elements δNst* in CC*(κ) OO*(κ) and OC*(κ) calculate δDst(κ) using Equation 7; where s,t = 1,…, (j + m) and s > t.
    3. Form a matrix ΣD (κ) by appropriately arranging δDst(κ) and δNDst(κ) ;
    4. For each elements δNst*(κ) in CN*(κ) calculate δNDij and ON*(κ) using Equation 9; where s = (j + m + 1),…, (j + m + h), t = 1,…, (j + m) and s > t.
    5. Check if |δst(κ) − δst| < η for s = 1,…, (j +m + h), t = 1…, (j + m) and s > t; where δst (κ) and δst are the stth element of ΣD(κ) and Σ, respectively; and η is a small positive number (say 0.00001)
      • If yes, set Σ* = Σ* (κ). Go to ((3))
      • If not, calculate new δNst* as follows
        δNst*(κ+1)=δNst*(κ)δstδst*(κ),
      • and update Σ* (κ) to form Σ* (κ + 1) such that δNst*(κ+1) constitute its elements. Set κ = κ + 1 and repeat ((2)(b) to ((2)(e)
  3. Draw n samples from N(j + m + h) (0, Σ*). These samples form n × (j + m + h) intermediate data matrix Y*.

  4. Discretize the first jth columns of Y* according to FCii, λi) and ximax.

  5. Discretize the (j + 1)th to (j + m)th columns of Y* according to proportion vectors p1, p2,…, pm.

  6. Center and scale the last n columns of Y* according to the target mean and standard deviation vectors.

Some numerical issues may arise when implementing the algorithm. First, the target pairwise correlation between two discrete variables and correlation between discrete and continuous variables must respect the correlation bounds. These bounds can be calculated using “generate, sort, and correlate” technique [24]. Such violations need to be checked and addressed prior to Σ* computation. Secondly,the computed intermediate correlation matrix Σ* may not be positive definite. In such situation, one can compute the nearest positive definite matrix by the methods proposed by Higham [25, 26] using nearPD function in the Matrix package in R. Finally, for some parametric combinations of GPD, the FX calculation may require comparatively longer computation time.

6. Method evaluation via simulation

The proposed method involves simulation of correlated multivariate count, ordinal, and normal variables. It requires specification of parameters governing the marginal distributions of each type of variables and also the linear association among all the variables in the form of correlation matrix. For the purpose of this evaluation we consider data sets that are composed of two GPD variables, two ordinal variables and two normal variables, i.e.; total of 6 variables. The following parameter vectors are considered for GPD variables: θ ∈ {(1, 2), (10, 20)} and λ ∈ {(.1, .2),(−.1, −.2),(.4, .6),(−.2, .4),(−.2, −.4),(.5, −.7)}. These parameter values represent moderate to high over- and under-dispersed count variables. To avoid a proliferation of simulation scenarios, we have limited this simulation for a single set of parameter values of ordinal and normal components. The cumulative marginal probabilities of two ordinal components—one with three categories and another with four categories— were chosen to be p1 = (.35, .80) and p2 = (.32, .54, .81). This specification covers a reasonably broad range of probability values. The means and standard deviations of the normal components are set at (0, 1). Finally, the correlation matrices are specified randomly. The values of pairwise correlations are checked for bound violation, symmetry, and positive definitiveness, ensuring the validity of the randomly generated correlation matrix. Performance of the proposed method is evaluated for each combination of parameters using the criteria described in the next section.

6.1. Accuracy and precision

We report the relative bias (RB) of the estimates (E(η^)ηη×100%) over 1000 Monte Carlo replications, where η is the true value of a parameter and η̂ is the corresponding estimated value. Also, the root mean square error RMSE (η̂) is estimated using E(η^η)2, the standardized bias (SB) is estimated using, (|(η^)η|SD(η^)×100%) and coverage rate (CR) is defined as a percentage of times that η is contained within a 95% confidence interval. Finally, RB and SB are the measures of accuracy and RMSE and CR are the integrated measures of some combination of accuracy and precision. Typically, we desire RB < 5% and SB < 50% [27].

6.2. Results

We generate 1000 sets of simulated data (n=200) for each combinations of parameters. A subset of results of the evaluation is shown in Table 1 due to space limitation. The full table is available at the second author’s website, http://demirtas.people.uic.edu/table1.pdf. The mean pairwise correlations are strikingly close to the target values for all combination of parameter specifications. The values on all the evaluation criteria are well within the tolerable limits. The average cumulative proportions of two ordinal variables are also very close to the specified target values (RB<1%, SB<8%, CR≈ 95%). Furthermore, RB<1.4% and SB <11% for count variables indicate the accurate replication of the specified values of mean (γ1) and standard deviation (γ2) of GPD; and CR≈ 95% suggest high precision. The most difficult parameters to recover accurately are the higher moments—skewness (γ3) and kurtosis (γ4) of the GPD variables, especially for low θ and high λ combinations. For skewness parameters, RB<9% and SB<35% indicating acceptable results with somewhat low accuracy, whereas CR≈ 95% suggest acceptable precision. Similarly, for kurtosis parameters, RB<19%, SB<42%, and CR 94%–97%. Given that only the fourth moment of GPD variable, under extreme conditions (small θ and large λ), produces values on evaluation metrics that are over thresholds—in our opinion—the proposed method does a remarkable job at generating mixed data with count, ordinal, and normal components, as measured by the concordance between the specified and empirically computed quantities on average.

Table 1.

Key descriptive statistics and evaluation quantities. The mean represents the average of estimates across 1000 simulation replicates.

θ λ Evaluation Criteria

True Value Mean RB SB RMSE CR Min. 1st Qu. 3rd Qu. Max.
(1, 2) (.1, .2) ρ 0.1865 0.1864 0.0707 0.1814 0.0727 96.10 −0.0383 0.1348 0.2366 0.4494
0.1241 0.1242 0.1104 0.1962 0.0698 95.10 −0.1069 0.0765 0.1695 0.3350
0.1965 0.1919 2.3207 6.6481 0.0687 95.40 −0.0762 0.1460 0.2381 0.4173
0.3627 0.3611 0.4283 2.5224 0.0616 95.60 0.1457 0.3182 0.4043 0.5231
0.2671 0.2678 0.2726 1.0978 0.0663 95.00 0.0375 0.2234 0.3129 0.4515
0.2783 0.2762 0.7674 3.3110 0.0645 95.50 0.0646 0.2354 0.3207 0.4802
0.1932 0.1892 2.0507 5.7763 0.0687 95.40 −0.0598 0.1447 0.2362 0.3784
0.3764 0.3733 0.8107 5.1045 0.0598 95.70 0.1704 0.3327 0.4148 0.5409
0.2704 0.2718 0.5161 2.1930 0.0636 96.00 0.0791 0.2290 0.3168 0.4293
0.2735 0.2687 1.7604 7.3842 0.0654 94.30 0.0629 0.2277 0.3131 0.5225
0.1174 0.1206 2.7414 4.4714 0.0720 95.60 −0.1149 0.0767 0.1684 0.3594
0.1438 0.1437 0.0478 0.0985 0.0697 95.00 −0.1084 0.0971 0.1915 0.3769
0.3861 0.3851 0.2592 1.6800 0.0596 95.20 0.1573 0.3466 0.4267 0.5876
0.3356 0.3332 0.7189 3.8945 0.0620 95.30 0.1255 0.2921 0.3736 0.5146
0.3825 0.3805 0.5100 3.1678 0.0616 95.30 0.1558 0.3404 0.4223 0.5569

γ1 1.1111 1.1081 0.2751 3.8193 0.0800 95.20 0.8600 1.0600 1.1600 1.3800
2.5000 2.5003 0.0100 0.1893 0.1320 94.60 2.1150 2.4100 2.5850 2.9900

(γ2)
1.1712 1.1675 0.3216 4.7146 0.0799 95.60 0.8931 1.1120 1.2240 1.4580
1.9764 1.9728 0.1827 2.7420 0.1317 94.60 1.6180 1.8780 2.0590 2.4630

γ3 1.2649 1.2150 3.9429 16.9577 0.2982 96.70 0.6300 1.0150 1.3670 3.8590
1.1068 1.0685 3.4581 13.1912 0.2925 96.30 0.4125 0.8793 1.2200 3.6660

γ4 5.0667 4.7724 5.8079 15.7250 1.8934 97.10 2.4800 3.7110 5.3270 31.7100
4.7750 4.5638 4.4222 12.3918 1.7162 96.60 2.3540 3.5790 5.0690 29.9600

p 0.3500 0.3517 0.4857 5.0079 0.0339 0.95 0.2450 0.3300 0.3750 0.4600
0.8000 0.8009 0.1175 3.3734 0.0279 0.94 0.7250 0.7800 0.8200 0.8900

0.3200 0.3206 0.1891 1.8271 0.0331 0.95 0.2100 0.3000 0.3412 0.4300
0.5400 0.5405 0.0870 1.3241 0.0355 0.95 0.4450 0.5150 0.5650 0.6450
0.8100 0.8115 0.1895 5.5632 0.0276 0.95 0.7150 0.7950 0.8300 0.9000
(10,20) (−.2,−.4) ρ 0.3317 0.3350 1.0009 5.1817 0.0641 95.60 0.0957 0.2912 0.3784 0.6035
0.2449 0.2464 0.6458 2.2962 0.0688 95.90 0.0429 0.1973 0.2950 0.4437
0.2780 0.2801 0.7529 3.2329 0.0647 95.40 0.0861 0.2358 0.3262 0.5135
0.1837 0.1846 0.4911 1.3465 0.0670 95.60 −0.0017 0.1385 0.2282 0.3998
0.2717 0.2748 1.1611 4.8461 0.0651 95.30 0.0560 0.2318 0.3186 0.4873
0.1865 0.1879 0.7666 2.1196 0.0674 94.50 −0.0125 0.1404 0.2332 0.3891
0.2107 0.2122 0.7011 2.2212 0.0665 95.60 −0.0142 0.1654 0.2583 0.4039
0.3617 0.3646 0.7905 4.6359 0.0617 95.30 0.0705 0.3226 0.4057 0.5569
0.2135 0.2137 0.1007 0.3102 0.0693 95.90 −0.1018 0.1700 0.2608 0.4171
0.1548 0.1540 0.5020 1.1363 0.0684 95.10 −0.0769 0.1108 0.1996 0.3751
0.3188 0.3201 0.3926 1.9556 0.0640 95.80 0.1381 0.2768 0.3670 0.5221
0.3587 0.3579 0.2130 1.1931 0.0640 95.20 0.1297 0.3142 0.4016 0.5430
0.2669 0.2645 0.8860 3.5797 0.0661 95.00 0.0444 0.2218 0.3089 0.4261
0.2335 0.2331 0.1932 0.6729 0.0670 95.40 0.0332 0.1877 0.2808 0.4634
0.2407 0.2426 0.7809 2.9008 0.0648 95.30 0.0498 0.1990 0.2878 0.4488
0.1796 0.1844 2.6872 7.2200 0.0670 94.30 −0.0184 0.1388 0.2281 0.4188
0.3926 0.3935 0.2266 1.4939 0.0595 95.00 0.1644 0.3538 0.4330 0.5537

γ1 8.3333 8.3287 0.0557 2.7920 0.1663 94.90 7.8450 8.2200 8.4450 8.8700
14.2857 14.2825 0.0226 1.6308 0.1980 95.00 13.6900 14.1500 14.4200 14.9000

(γ2)
2.4056 2.4053 0.0136 0.2764 0.1185 95.50 2.0670 2.3200 2.4880 2.8090
2.6998 2.6938 0.2214 4.4824 0.1334 95.60 2.3110 2.5990 2.7920 3.0490

γ3 0.1732 0.1670 3.5937 3.8583 0.1614 95.00 −0.3318 0.0609 0.2699 0.8250
0.0378 0.0344 9.0104 2.0669 0.1647 95.10 −0.4314 −0.0853 0.1426 0.6621

γ4 2.9700 2.9260 1.4826 13.6121 0.3263 96.50 2.1920 2.7030 3.0970 5.1520
2.9557 2.9333 0.7585 7.1579 0.3138 95.60 2.0900 2.7100 3.1080 4.1460

p 0.3500 0.3525 0.7200 7.5922 0.0332 95.60 0.2450 0.3300 0.3750 0.4500
0.8000 0.8009 0.1081 3.0480 0.0284 95.90 0.7150 0.7800 0.8200 0.8850
0.3200 0.3203 0.1000 0.9819 0.0326 94.60 0.2150 0.3000 0.3400 0.4350
0.5400 0.5404 0.0722 1.0688 0.0365 95.00 0.4250 0.5150 0.5650 0.6550
0.8100 0.8101 0.0099 0.2760 0.0290 94.70 0.7050 0.7900 0.8300 0.8900

We do not report the results of small sample cases for brevity, as expected mean, RB, SB did not change substantially. The CR was somewhat higher, but not to a degree of inefficiency. In addition, as one would expect, RMSE was also slightly larger.

7. Example: Interstitial Cystitis Data Base study

We consider the Interstitial Cystitis Data Base (ICDB) study to illustrate the performance of the proposed method to mimic real data. We introduced this data in the Introduction section. Briefly, the ICDB was established by National Institute of Diabetes and Digestive and Kidney Disease in 1993 to understand the natural history of the interstitial cystitis (IC) and to assess the demographic and clinical characteristics of IC patients [28]. A number of articles have been published utilizing this database since its establishment. Among other variables, the data set contains information on number of non-nocturnal voids per day (VF_AWAKE), ordinal levels of urinary urgency over last four weeks (PURG2C3), and average maximum interval between voids per day (AVGMINT). For this illustration, we selected only the patients who had complete data on all three variable across 3-, 6- and 9-month visits. A total of 309 patients met this requirement. The final data excluded variables that are not under consideration and patients that did not meet the aforementioned requirement. The VF_AWAKE variable measured on three occasions constitute three correlated count variables; PURG2C3 variable measured on three occasions constitute three correlated ordinal (three levels) variables; and AVGMINT also measured on three occasions constitute three correlated normal variables. Together, they form a data set with nine correlated variables of dissimilar types.

We demonstrate an ability of the proposed method to duplicate this “model” data in two ways. First, we show the proximity of the four moments of VF_AWAKE, cumulative proportions of PURG2C3, mean and standard deviation of AVGMINT, and the correlation matrix estimated from the “model” data and from simulated data (mimicking the “model” data). From the “model” data, θ and λ parameters for VF_AWAKE are obtained by fitting univariate generalized Poisson distribution to 3-, 6- and 9-month assessment data. The appropriateness of marginal normal distributions for AVGMINT variables are tested using the Anderson-Darling normality test (p-value>0.05). The correlation matrix and the values of the relevant parameters estimated from the “model” data are presented on the Tables 2 and 3, respectively. Using these estimated quantities as true parameters, the process of generating a data matrix of size 309 × 9 is replicated 1000 times.

Table 2.

Correlation matrix of the interstitial cystitis data

VF_AWAKE PURG1C3 AVGMINT

months 3 6 9 3 6 9 3 6 9
VF_AWAKE 3 1 0.757 0.758 0.275 0.195 0.285 −0.475 −0.479 −0.466
6 0.757 1 0.806 0.246 0.274 0.312 −0.5 −0.53 −0.506
9 0.758 0.806 1 0.274 0.256 0.397 −0.466 −0.481 −0.519

PURG1C3 3 0.275 0.246 0.274 1 0.575 0.545 −0.242 −0.241 −0.166
6 0.195 0.274 0.256 0.575 1 0.599 −0.255 −0.273 −0.201
9 0.285 0.312 0.397 0.545 0.599 1 −0.235 −0.248 −0.229

AVGMINT 3 −0.475 −0.5 −0.466 −0.242 −0.255 −0.235 1 0.716 0.729
6 −0.479 −0.53 −0.481 −0.241 −0.273 −0.248 0.716 1 0.75
9 −0.466 −0.506 −0.519 −0.166 −0.201 −0.229 0.729 0.75 1

Table 3.

Parameter estimates from interstitial cystitis data

Assesment occations

3 months 6 months 9months
VF_AWAKE θ 7.31 7.16 7.03
λ 0.34 0.34 0.38
γ1 11.03 10.85 11.28
(γ2)
16.64 16.44 18.09
γ3 0.76 0.77 0.84
γ4 3.90 3.94 4.11

PURG2C3 p1 0.37 0.40 0.41
p2 0.78 0.81 0.81

AVGMINT mean 4.80 4.86 4.89
SD 1.64 1.64 1.72

Note: SD denotes standard deviation; p1 and p2 denote cumulative probabilities of first and second category of three-level ordinal variable; and γ1, γ2, γ3 and γ4 denotes the mean, SD, skewness, and kurtosis of GPD.

The results of the comparison of these quantities between the “model” and simulated data are presented on the Figures 14. Figure 1 shows three evaluation metrics. The maximum observed values (representing the worst-case scenario) on all three metrics fall well under the thresholds for all the parameters including kurtoses of count variables. Figure 2 shows distributions of cumulative proportion of each categories of ordinal variables PURG2C3 assessed over 3 time points. The x-axis displays the “true” values from the “model” data and the y-axis displays estimated values from simulated data sets. The average estimated cumulative proportions are in close proximity to the “true” values with relatively narrow interquartile range. Figure 3 shows 36 estimated mean pairwise correlations along with their “true” value on the x-axis. It shows that the average estimated pairwise correlations over simulated data are in close vicinity of the “true” values. The largest difference between the average and “true” pairwise correlations is less than 0.0022. Finally, Figure 4 presents the distributions of four sample moments of VF AWAKE variables. The upper two graphs show the distributions of sample means and standard deviations. The averages of these two statistics are close to the “true” values with very few outliers. As expected, distributions of the third and the fourth moments contain more outliers. However, the averages are fairly close to their “true” values. Overall, the proposed method succeed remarkably in duplicating the marginal distributions and dependence structure of the “model” data.

Figure 1.

Figure 1

Accuracy of the estimated parameters. RB: Relative bias, SB: Standardized Bias.

Figure 4.

Figure 4

Four moments of GPD count variables.

Figure 2.

Figure 2

Cumulative probabilities of three ordinal variables.

Figure 3.

Figure 3

Correlations among variables with different attributes, of count, categorical and continuous types.

Next, we fit a mixed-effect ordinal regression model with cumulative probit link and equally spaced thresholds. We show that the estimated values of the parameters of this regression model using the “model” data and simulated data are close to each other. The model under consideration is as follows:

ϕ1{P(yij<c)}=αc+β0AVGMINT+β1VF_AWAKE+β2MONTH+υ0i, (10)

where yij is an unobservable latent variable related to the ordinal variable PURG2C3, αc is a series of threshold values, and υ0i is a random effect assumed to have N (0, συ0) distribution. Table 4 presents the estimated regression parameters for the “model” data on the “True Value” column and for the simulated data on “Mean” column along with evaluation metrics on the remaining columns. Clearly, the average of the parameter estimates from 1000 simulated data sets are well within the tolerable limits. These results suggest that the proposed method is useful in capturing and reconstructing the real data trends, and may be regarded as a useful tool for evaluating statistical models and methods.

Table 4.

Parameter estimates of cumulative logit ordinal mixed-effect regression.

True Value Mean RB SB RMSE CR Min. 1st Qu. 3rd Qu. Max.
α1 −0.3928 −0.3758 0.0170 4.3292 0.3495 0.95 −1.6470 −0.6098 −0.1410 0.6547
α − α 2.0762 2.0891 0.0130 0.6257 0.1204 0.94 1.7740 2.0080 2.1580 2.5180
AVGMINT −0.2072 −0.2099 0.0027 1.2908 0.0476 0.95 −0.3864 −0.2426 −0.1782 −0.0500
VF_AWAKE 0.1186 0.1227 0.0041 3.4609 0.0165 0.95 0.0785 0.1116 0.1333 0.1799
MONTH −0.0305 −0.0332 0.0027 8.9748 0.0186 0.94 −0.0894 −0.0457 −0.0211 0.0202
συ 0 1.6228 1.5615 0.0613 3.7775 0.2670 0.95 0.8955 1.3700 1.7130 2.6300

8. Discussion

The approach to multivariate data generation with mixed variable types in this paper is motivated by the emerging popularity of data analysis techniques for mixed outcomes. The method draws ideas from previous works by Ferrari and Barbiero [12], Barbiero and Ferrari [9], Demirtas and Doganay [14], and Amatya and Demirtas [17]. The novelties in this paper are the use of the generalized Poisson distribution to address over- and under-dispersion in count data, derivation of a connection (correlation) between normal and count/ordinal variables obtained via discretization, and the unified framework through which data sets composed of dissimilar variable types are generated. The proposed framework is also capable of generating multivariate data composed of only one type of variables mentioned in this work. This is particularly useful for generating multivariate GPD variables, method for which is not currently available. For the values of parameters that are generally encountered in practice, the proposed method performs considerably well and provides a robust mechanism for generating multivariate data with combination of count, ordinal and normal variables. Future work will address modeling continuous variable components in the mixed data via non-normal continuous distributions.

References

  • 1.Simon LJ, Landis JR, Erickson DR, Nyberg LM, Group IS, et al. The interstitial cystitis data base study: concepts and preliminary baseline descriptive statistics. Urology. 1997;49(5):64–75. doi: 10.1016/s0090-4295(99)80334-3. [DOI] [PubMed] [Google Scholar]
  • 2.Song P, Li M, Yuan Y. Joint regression analysis of correlated data using gaussian copulas. Biometrics. 2009;65(1):60–68. doi: 10.1111/j.1541-0420.2008.01058.x. [DOI] [PubMed] [Google Scholar]
  • 3.De Leon AR, Chough KC. Analysis of mixed data: Methods and applications. CRC Press; 2013. [Google Scholar]
  • 4.Krummenauer F. Efficient simulation of multivariate binomial and Poisson distributions. Biometrical Journal. 1998;40(7):823–832. [Google Scholar]
  • 5.Cai Y, Kendall W. Perfect simulation for correlated Poisson random variables conditioned to be positive. Statistics and Computing. 2002;12(3):229–243. [Google Scholar]
  • 6.Minhajuddin TM, Harris IR, Schucany WR. Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation. 2004;74(8):599–607. [Google Scholar]
  • 7.Shin K, Pasupathy R. Proceedings of the 39th Conference on Winter Simulation: 40 Years! The Best is Yet to Come; WSC ’07. Piscataway, NJ, USA: IEEE Press; 2007. A method for fast generation of bivariate Poisson random vectors; pp. 472–479. [Google Scholar]
  • 8.Yahav I, Shmueli G. On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry. 2012;28(1):91–102. [Google Scholar]
  • 9.Barbiero A, Ferrari PA. Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry. 2014;31(5):669–680. [Google Scholar]
  • 10.Biswas A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters. 2004;70(1):25–235. [Google Scholar]
  • 11.Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation. 2006;76(11):1017–1025. [Google Scholar]
  • 12.Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. doi: 10.1080/00273171.2012.692630. [DOI] [PubMed] [Google Scholar]
  • 13.Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research. 2008;43(3):355–381. doi: 10.1080/00273170802285693. [DOI] [PubMed] [Google Scholar]
  • 14.Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics. 2012;22(3):323–236. doi: 10.1080/10543406.2010.521874. [DOI] [PubMed] [Google Scholar]
  • 15.Demirtas H, Hedeker D, Mermelstein JM. Simulation of massive public health data by power polynomials. Statistics in Medicine. 2012;31(27):3337–3346. doi: 10.1002/sim.5362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Demirtas H, Yavuz Y. Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics. 2015;25(4):635–650. doi: 10.1080/10543406.2014.920868. [DOI] [PubMed] [Google Scholar]
  • 17.Amatya A, Demirtas H. Simultaneous generation of multivariate data with Poisson and normal marginals. Journal of Statistical Computation and Simulation. 2015;85(15):3129–3139. [Google Scholar]
  • 18.Famoye F. Generalized Poisson random variate generation. American Journal of Mathematical and Management Sciences. 1997;17(3–4):219–237. [Google Scholar]
  • 19.Consul PC, Famoye F. Lagrangian probability distributions. Springer; 2006. [Google Scholar]
  • 20.Sáez-Castillo AJ, Conde-Sánchez A. A hyper-Poisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis. 2013;61:148–157. [Google Scholar]
  • 21.Lynch HJ, Thorson JT, Shelton AO. Dealing with under- and over-dispersed count data in life history, spatial, and community ecology. Ecology. 2014;95(11):3173–3180. [Google Scholar]
  • 22.Thall PF, Vail SC. Some covariance models for longitudinal count data with overdispersion. Biometrics. 1990;46(3):657–671. [PubMed] [Google Scholar]
  • 23.Consul PC. Generalized Poisson distributions: Properties and applications. New York: Marcel Dekker, Inc.; 1989. [Google Scholar]
  • 24.Demirtas H, Hedeker D. A practical way for computing approximate lower and upper correlation bounds. American Statistician. 2011;65(2):104–109. [Google Scholar]
  • 25.Higham NJ. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications. 1988;103:103–118. [Google Scholar]
  • 26.Higham NJ. Computing the nearest correlation matrix-a problem from finance. IMA Journal of Numerical Analysis. 2002;22(3):329–343. [Google Scholar]
  • 27.Demirtas H. Simulation driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica. 2004;58(4):466–482. [Google Scholar]
  • 28.Propert KJ, Schaeffer AJ, Brensinger CM, Kusek JW, Nyberg LM, Landis JR. A prospective study of interstitial cystitis: results of longitudinal followup of the interstitial cystitis data base cohort. The interstitial cystitis data base study group. The Journal of Urology. 2000;163(5):1434–1439. doi: 10.1016/s0022-5347(05)67637-9. [DOI] [PubMed] [Google Scholar]

RESOURCES