Concurrent generation of multivariate mixed data with variables of dissimilar types

Anup Amatya; Hakan Demirtas

doi:10.1080/00949655.2016.1177530

. Author manuscript; available in PMC: 2017 Apr 22.

Published in final edited form as: J Stat Comput Simul. 2016 Apr 22;86(18):3595–3607. doi: 10.1080/00949655.2016.1177530

Concurrent generation of multivariate mixed data with variables of dissimilar types

Anup Amatya ^a,^*, Hakan Demirtas ^b

PMCID: PMC5117654 NIHMSID: NIHMS824605 PMID: 27885310

Abstract

Data sets originating from wide range of research studies are composed of multiple variables that are correlated and of dissimilar types, primarily of count, binary/ordinal and continuous attributes. The present paper builds on the previous works on multivariate data generation and develops a framework for generating multivariate mixed data with a pre-specified correlation matrix. The generated data consist of components that are marginally count, binary, ordinal and continuous, where the count and continuous variables follow the generalized Poisson and normal distributions, respectively. The use of the generalized Poisson distribution provides a flexible mechanism which allows under- and over-dispersed count variables generally encountered in practice. A step-by-step algorithm is provided and its performance is evaluated using simulated and real-data scenarios.

Keywords: Generalized Poisson, Mutivariate ordinal, discretization

1. Introduction

In many research and application areas, data sets including only one type of variables is highly unusual. Most data sets often contain a combination of binary, ordinal, count, and continuous variables that are correlated with complex dependence structure. That is, mixed data sets are not uncommon in wide range of scientific fields including in health and behavioral sciences, social sciences, and economics. For example, the Interstitial Cystitis database [1] includes longitudinal data on pain in the pelvic or bladder area, urgency (pressure to urinate), urinary frequency, nocturnal void, and nocturia from a prevalent cohort of subjects with Interstitial Cystitis. For each individual, these outcomes were associated with covariates, such as the demographic and clinic characteristics of the patients, which were observed at different times. The pain and urgency scores are continuous variables, urinary frequency is a discrete count variable, nocturnal void is a three-level ordinal variable, and nocturia is a dichotomous variable. The primary aim of the study was to determine whether symptoms tend to co-fluctuate together indicating a single underlying aetiology or vary independently suggesting multiple mechanisms at work. Due to the ubiquitousness of mixed outcomes, development of flexible models and methods for joint analysis of such data is an active area of research [2, 3]. A Monte-Carlo simulation to assess the performance of such models and methods typically require a mechanism to generate multivariate data that closely resemble real data sets that consist of mixed variable attributes.

Multivariate data generation has been addressed extensively in statistical literature. Krummenauer [4], Cai and Kendall [5], Minhajuddin et al. [6], Shin and Pasupathy [7], Yahav and Shmueli [8], and Barbiero and Ferrari [9] have developed various methods for generating simulated data from multivariate Poisson distribution. The Barbiero and Ferrari [9] approach was shown to be comparatively more efficient, user-friendly, and often more accurate than the prior methods. Biswas [10], Demirtas [11], and Ferrari and Barbiero [12] have described and evaluated procedures for multivariate ordinal data generation. Recently, a few notable developments have occurred in the area of mixed data generation. Ruscio and Kaczetow [13] has proposed a robust iterative technique that can generate mixed data either mimicking the marginal distribution observed in the sample data or utilizing a pre-specified parameters. Demirtas and Doganay [14], Demirtas et al. [15], Demirtas and Yavuz [16] and Amatya and Demirtas [17] have advanced the field by developing and implementing methods for simulating correlated mixed data comprised of binary-normal, binary-nonnormal, ordinal-normal, and count-normal components, respectively.

This work is motivated by the need of simultaneously simulating data composed of variables with different attributes, of continuous, count and binary/ordinal types that are often encountered jointly in many settings. The proposed mechanism is built upon a combination of a few random variate generation methods that involve simulation of multivariate count data that follow the generalized Poisson distribution, multivariate binary/ordinal data, multivariate continuous data that follow normal distribution, and a mix of these distributions. The proposed method maintains the specified marginal characteristics of each variable as well as the linear association structure among them. The methods in Famoye [18], Ferrari and Barbiero [12], and Barbiero and Ferrari [9] are particularly relevant for the development of the current work. We describe the details of these methods in Sections 2–4. Furthermore, since binary is a special case of ordinal, in what follows ordinal is meant to include binary as well.

A remark on the use of the generalized Poisson distribution (GPD) in this article is in order. Generally, count data are assumed to follow a Poisson distribution which requires the mean-variance equality. However, real data in practice rarely meet such a restriction. Most often, a variance of the count data is larger than its mean (over-dispersion) and seldom smaller than the mean (under-dispersion) [19–21]. Such phenomenon is common in longitudinal count data either by marginal behavior of these variables or the correlation induced by the repeated nature of the count variables. For example, Thall and Vail [22] gives a data set on two-week seizure counts for 59 epileptics. At each of four successive post randomization clinic visits, the number of seizures occurring over the previous two weeks was reported. The mean (variance) of the number of seizures in each of the four clinical visits were 8.95 (33.92), 8.36 (26.29), 8.44 (35.69) and 7.34 (22.14). Clearly, the basic equal mean and variance assumption of the ordinary Poisson distribution is violated. Thus, an ordinary Poisson distribution is not appropriate to model such over-or under-dispersed count data. A better choice is the generalized Poisson distribution which allows both over- or under-dispersion.

The article is organized as follows. In Section 2, we provide background information on GPD and computation of its quantile function. In Section 3, we describe a process involved in transforming a univariate normal random variable into a discrete random variable. In Section 4, we establish connection between both discrete and continuous variables by adjusting Barbiero and Ferrari [9] method for GPD and deriving an expression for correlation between discretized and normal variables. In Section 5, we outline the proposed methodology and algorithm to generate multivariate mixed data with count (GPD), ordinal, and continuous (normal) components. In Section 6, we present simulation studies for assessing the performance of the suggested method. In Section 7, we illustrate the utility of the method in mimicking the Interstitial Cystitis data. In Section 8, we conclude the paper with discussions, remarks, and future directions.

2. Generalized Poisson Distribution

The generalized Poisson distribution is a two parameter distribution that includes the ordinary Poisson distribution as a special case. Let X be a positive discrete random variable, whose probability mass function P_x (θ, λ) is defined as

P_{x} (θ, λ) = {\begin{matrix} \frac{θ {(θ + λ_{x})}^{(x - 1)}}{x!} e^{- θ - λ_{x}} & for x = 0, 1, 2, \dots \\ 0 & for x > q when λ < 0 \end{matrix}

(1)

and zero otherwise, where θ > 0, max(−1, −θ/q) ≤ λ < 1 and q ≥ 4 is the largest positive integer for which θ + λq > 0 when λ < 0. Then, X is said to follow GPD. The GPD reduces to the Poisson distribution when λ = 0 and possesses the dual characteristics of over- and under-dispersion depending on whether λ > 0 or λ < 0, respectively. The first four moments of GPD (as derived in Ch. 9 of [23]) are as follows:

γ_{1} = \frac{θ}{1 - λ}, γ_{2} = \frac{θ}{{(1 - λ)}^{3}}, γ_{3} = \frac{1 + 2 λ}{\sqrt{θ (1 - λ)}}, γ_{4} = \frac{1 + 8 λ + 6 λ^{2}}{θ (1 - λ)} .

(2)

The cumulative distribution function F_x (θ, λ) and quantile function $F_{x}^{- 1} (p | θ, λ)$ of GPD are defined as:

F_{x} (θ, λ) = P (X \leq k | θ, λ) = \sum_{x = 0}^{k} P_{x} (θ, λ)

(3)

F_{x}^{- 1} (p | θ, λ) = min (x \in ℕ_{0} : F_{x} (x) \geq p), p \in (0, 1) and ℕ_{0} = {0, 1, 2, \dots} .

(4)

The expressions (3) and (4) can be evaluated efficiently by exploiting the following recurrence relation between GPD probabilities.

P_{x} (θ, λ) = \frac{θ - λ + λ_{x}}{x} {(1 + \frac{λ}{θ - λ + λ_{x}})}^{x - 1} e^{- λ} P_{x - 1} (θ, λ) for x \geq 1 and P_{0} (θ, λ) = e^{- θ} .

(5)

We are not aware of readily available software implementation for $F_{X}^{- 1} (q; θ, λ)$ computation (SAS provides functions only for λ > 0). We modify Famoye [18] method—developed for generating univariate generalized Poisson random variate—and adapt it for the computation of $F_{X}^{- 1}$ . The modified algorithm is as follow:

Set x = 0, w = e^−λ, S = e^−θ, and P = S
while q > S, do
- x=x+1
- C = θ − λ + λx
- P = wC(1 + λ/C)^(x−1)P/x
- S = S + P
Deliver x

3. Discretization

A discrete random variable can be obtained by discretization of any continuous random variable. The simple and precise approach is to generate a univariate discrete random variable with known probability mass function through discretization of a continuous normal random variable. Let Z be a standard normal random variable. A discrete random variable X ∈ {x₁, x₂,…, x_max} having distribution function F_X is constructed as follows:

if Z < r_{1} \to X = x_{1}

\begin{matrix} if r_{1} \leq Z \leq r_{2} \to X = x_{2} \\ ⋮ \\ if r_{k - 1} \leq Z \to X = x_{max}, \end{matrix}

(6)

where r_i = Φ⁻¹ (F_x(x_i)), Φ being the standard normal cumulative distribution function (CDF). The numerical value of x_max is trivial for the ordinal variable with k categories. However, for discrete count variables with a non-finite support, the largest value is not directly available. To circumvent this problem, Barbiero and Ferrari [9] suggested reducing the support of the discrete components by approximating), $x_{max} = F_{x}^{- 1} (1 - ε)$ where ε is chosen to be as small as possible.

4. Correlation among variables with different attributes

The proposed method is concerned with simulation of data sets that are comprised of count (GPD), ordinal and normal random variables. The principle behind the proposal is to start with multivariate normal distribution and discretize its components thereby transforming them into count and ordinal components [9, 12]. In so doing, starting pairwise correlations are adjusted such that the final correlation matrix (after discretization) is close to the target correlation matrix. The details of this correlation adjustment process is given in the Section 5. Three sets of correlations are of interest: correlation among discrete random variables (GPD-GPD, GPD-ordinal, ordinal-ordinal); correlation among continuous random variables (normal-normal); and correlation among discrete and continuous random variables (GPD-normal, ordinal-normal). Of the three sets, correlations among normal random variables do not pose a major challenge as the method is built up on the concept of discretization of multivariate normal random variables.

4.1. Correlation for discrete random variables

Suppose Z is a multivariate normal random vector with correlation matrix Σ_Z whose discretized components form a new multidimensional discrete random variable X with correlation matrix Σ_X. The elements of Σ_X obviously differ from those of Σ_Z, but the transformation is tractable. Ferrari and Barbiero [12] has established the link between pairwise correlation among normal components (δ_C) and those of corresponding discretized components (δ_D) by the following expression.

δ_{D} = \frac{\sum_{l = 1}^{k_{1}} \sum_{t = 1}^{k_{2}} (l - E (X_{1})) (t - E (X_{2})) f_{Z_{1} Z_{2} (l / k_{1}, t / k_{2})}}{\sum_{l = 1}^{k_{1}} {(l - E (X_{1}))}^{2} (1 / k_{1}) \sum_{t = 1}^{k_{2}} (t - E {(X_{2})}^{2} (1 / k_{2})}

where

f_{Z_{1} Z_{2}} (l / k_{1}, t / k_{2}) = Φ_{Z_{1}, Z_{2}} (Φ_{Z}^{- 1} (l / k_{1}) Φ_{Z}^{- 1} (t / k_{2})) - Φ_{Z_{1}, Z_{2}} (Φ_{Z}^{- 1} (l / k_{1}) Φ_{Z}^{- 1} ((t - 1) / k_{2})) - Φ_{Z_{1}, Z_{2}} (Φ_{Z}^{- 1} ((l - 1) / k_{1}) Φ_{Z}^{- 1} (t / k_{2})) + Φ_{Z_{1}, Z_{2}} (Φ_{Z}^{- 1} ((l - 1) / k_{1}) Φ_{Z}^{- 1} ((t - 1 / k_{2})),

(7)

k₁ and k₂ are the largest finite integer in the support of the discrete variables, and Φ_Z and Φ_Z₁,Z₂ are the univariate and bivariate standard normal CDF with δ_C correlation, respectively. Ferrari and Barbiero [12] and Barbiero and Ferrari [9] used this relationship to construct correlated ordinal and correlated Poisson random variables, respectively.

4.2. Correlation Between Discrete and Normal Continuous Random Variables

Let Z₁ and Z₂ be two components of a bivariate standard normal random vector with a pairwise correlation coefficient δ_N. Suppose Z₂ is discretized to form a discrete variable X₂, following the procedure described in the Section 3, according to a probability distribution function F_X (as defined in Equation 3). The resulting pairwise correlation δ_{N D} between Z₁ and X₂ can be calculated using a bivariate normal distribution result E(Z₁|r₁ < Z₂ < r₂) = δ_N E(Z₂|r₁ < Z₂ < r₂) and the following property of a truncated normal distribution:

E (Z_{2} | r_{1} < Z_{2} < r_{2}) = δ_{N} \frac{ϕ (r_{1}) - ϕ (r_{2})}{Φ (r_{2}) - Φ (r_{1})},

(8)

such that,

δ_{N D} = \frac{E (X_{2}) E (Z_{1} | X_{2}) - E (X_{2}) E (Z_{1})}{\sqrt{E {[X_{2} - E (X_{2})]}^{2} E {[Z_{1} - E (Z_{1})]}^{2}}} = \frac{\sum_{i = 1}^{k} x_{i} P_{X} (X_{2} = x_{i}) E (Z_{1} | r_{i - 1} < Z_{2} < r_{i})}{\sqrt{\sum_{i = 1}^{k} x_{i}^{2} P_{X} (X_{2} = x_{i}) - {[\sum_{i = 1}^{k} x_{i} P_{X} (X_{2} = x_{i})]}^{2}}} = δ_{N} \frac{\sum_{i = 1}^{k} x_{i} [ϕ (r_{i - 1}) - ϕ (r_{i})]}{σ \sqrt{\sum_{i = 1}^{k} x_{i}^{2} [Φ (r_{i}) - Φ (r_{i - 1})] - {[\sum_{i = 1}^{k} x_{i}^{2} {Φ (r_{i}) - Φ (r_{i - 1})}]}^{2}}},

(9)

where ϕ and Φ are the standard normal probability density function (PDF) and CDF, respectively and r_i = Φ⁻¹ (F_x(x_i)) with r₀ = −∞. For k-category ordinal variable, x_k is the last category. For a count variable following GPD, x_k is the value of the largest count as estimated by $x_{k} = F_{x}^{- 1} (1 - ε)$ .

In the next section, we provide a unified framework on the joint generation of count, ordinal, and normal variables. Caveats pertaining to the correlation bounds for discretized ordinal and count variables within and between variable types also apply to the proposed method. These bounds are conveniently approximated through the “generate, sort, and correlate” technique [24].

5. Algorithm for generating multivariate mixed data with generalized Poisson, ordinal, and normal components

Let C₁, C₂, …, C_j be a set of count variables following GPD with corresponding parameter vectors (θ₁, λ₁), (θ₂, λ₂), …, (θ_j, λ_j); O₁, O₂, …, O_m be a set of ordinal variables with proportion vectors p₁, p₂, …, p_m and $W_{l} ~ N (μ_{l}, σ_{l}^{2})$ , where l = 1, 2, …, h. The target linear association among three types of variables is specified in (j + m + h) × (j + m + h) correlation matrix is Σ. Without loss of generality, assume that the first j columns of data matrix consist of count variables, followed by m columns of ordinal variables, and the last h columns of normal continuous variables. Then, Σ is comprised of six components: Σ_CC, Σ_OO, Σ_NN, Σ_OC, Σ_ON and Σ_CN,where C, O and N correspond to count, ordinal, and normal, respectively. In this setup, each O_i may have a different number of categories or subset of them may have the same number of categories with the same or different probability vectors. Suppose that discretization of first j + m components of n × (j + m + h) dimensional data matrix drawn from N_{(j + m + h)} (0, Σ*) results into a n × (j + m + h) dimensional mixed data matrix that has the target correlation matrix Σ.

Compute, x_{i_max} = F_{C_i} (1 −ε) for i = 1,…, j, where F_{C_i} is as defined in the Equation 3.
Compute Σ*. To this end, set up following loop to solve Equation 7 and 9 until a desired result is achieved.
1. Begin by setting κ = 0 and Σ* (κ) = Σ. Let $δ_{N}^{s t *}$ be the st^th element of Σ* for s, t = j + m + h.
2. For each elements $δ_{N}^{s t *}$ in $\sum_{C C}^{*} (κ)$ $\sum_{O O}^{*} (κ)$ and $\sum_{O C}^{*} (κ)$ calculate $δ_{D}^{s t} (κ)$ using Equation 7; where s,t = 1,…, (j + m) and s > t.
3. Form a matrix Σ_D (κ) by appropriately arranging $δ_{D}^{s t} (κ)$ and $δ_{N D}^{s t} (κ)$ ;
4. For each elements $δ_{N}^{s t *} (κ)$ in $\sum_{C N}^{*} (κ)$ calculate $δ_{N D}^{i j}$ and $\sum_{O N}^{*} (κ)$ using Equation 9; where s = (j + m + 1),…, (j + m + h), t = 1,…, (j + m) and s > t.
5. Check if |δ^st(κ) − δ^st| < η for s = 1,…, (j +m + h), t = 1…, (j + m) and s > t; where δ^st (κ) and δ^st are the st^th element of Σ_D(κ) and Σ, respectively; and η is a small positive number (say 0.00001)
  - If yes, set Σ* = Σ* (κ). Go to ((3))
  - If not, calculate new $δ_{N}^{s t *}$ as follows
    $δ_{N}^{s t *} (κ + 1) = δ_{N}^{s t *} (κ) \frac{δ^{s t}}{δ^{s t *} (κ)},$
  - and update Σ* (κ) to form Σ* (κ + 1) such that $δ_{N}^{s t *} (κ + 1)$ constitute its elements. Set κ = κ + 1 and repeat ((2)(b) to ((2)(e)
Draw n samples from N_{(j + m + h)} (0, Σ*). These samples form n × (j + m + h) intermediate data matrix Y*.
Discretize the first jth columns of Y* according to F_{C_i} (θ_i, λ_i) and x_{i_max}.
Discretize the (j + 1)^th to (j + m)^th columns of Y* according to proportion vectors p₁, p₂,…, p_m.
Center and scale the last n columns of Y* according to the target mean and standard deviation vectors.

Some numerical issues may arise when implementing the algorithm. First, the target pairwise correlation between two discrete variables and correlation between discrete and continuous variables must respect the correlation bounds. These bounds can be calculated using “generate, sort, and correlate” technique [24]. Such violations need to be checked and addressed prior to Σ* computation. Secondly,the computed intermediate correlation matrix Σ* may not be positive definite. In such situation, one can compute the nearest positive definite matrix by the methods proposed by Higham [25, 26] using nearPD function in the Matrix package in R. Finally, for some parametric combinations of GPD, the F_X calculation may require comparatively longer computation time.

6. Method evaluation via simulation

The proposed method involves simulation of correlated multivariate count, ordinal, and normal variables. It requires specification of parameters governing the marginal distributions of each type of variables and also the linear association among all the variables in the form of correlation matrix. For the purpose of this evaluation we consider data sets that are composed of two GPD variables, two ordinal variables and two normal variables, i.e.; total of 6 variables. The following parameter vectors are considered for GPD variables: θ ∈ {(1, 2), (10, 20)} and λ ∈ {(.1, .2),(−.1, −.2),(.4, .6),(−.2, .4),(−.2, −.4),(.5, −.7)}. These parameter values represent moderate to high over- and under-dispersed count variables. To avoid a proliferation of simulation scenarios, we have limited this simulation for a single set of parameter values of ordinal and normal components. The cumulative marginal probabilities of two ordinal components—one with three categories and another with four categories— were chosen to be p₁ = (.35, .80) and p₂ = (.32, .54, .81). This specification covers a reasonably broad range of probability values. The means and standard deviations of the normal components are set at (0, 1). Finally, the correlation matrices are specified randomly. The values of pairwise correlations are checked for bound violation, symmetry, and positive definitiveness, ensuring the validity of the randomly generated correlation matrix. Performance of the proposed method is evaluated for each combination of parameters using the criteria described in the next section.

6.1. Accuracy and precision

We report the relative bias (RB) of the estimates $(\frac{E (\hat{η}) - η}{η} \times 100 %)$ over 1000 Monte Carlo replications, where η is the true value of a parameter and η̂ is the corresponding estimated value. Also, the root mean square error RMSE (η̂) is estimated using $\sqrt{E {(\hat{η} - η)}^{2}}$ , the standardized bias (SB) is estimated using, $(\frac{| \sum (\hat{η}) - η |}{S D (\hat{η})} \times 100 %)$ and coverage rate (CR) is defined as a percentage of times that η is contained within a 95% confidence interval. Finally, RB and SB are the measures of accuracy and RMSE and CR are the integrated measures of some combination of accuracy and precision. Typically, we desire RB < 5% and SB < 50% [27].

6.2. Results

We generate 1000 sets of simulated data (n=200) for each combinations of parameters. A subset of results of the evaluation is shown in Table 1 due to space limitation. The full table is available at the second author’s website, http://demirtas.people.uic.edu/table1.pdf. The mean pairwise correlations are strikingly close to the target values for all combination of parameter specifications. The values on all the evaluation criteria are well within the tolerable limits. The average cumulative proportions of two ordinal variables are also very close to the specified target values (RB<1%, SB<8%, CR≈ 95%). Furthermore, RB<1.4% and SB <11% for count variables indicate the accurate replication of the specified values of mean (γ₁) and standard deviation $(\sqrt{γ_{2}})$ of GPD; and CR≈ 95% suggest high precision. The most difficult parameters to recover accurately are the higher moments—skewness (γ₃) and kurtosis (γ₄) of the GPD variables, especially for low θ and high λ combinations. For skewness parameters, RB<9% and SB<35% indicating acceptable results with somewhat low accuracy, whereas CR≈ 95% suggest acceptable precision. Similarly, for kurtosis parameters, RB<19%, SB<42%, and CR 94%–97%. Given that only the fourth moment of GPD variable, under extreme conditions (small θ and large λ), produces values on evaluation metrics that are over thresholds—in our opinion—the proposed method does a remarkable job at generating mixed data with count, ordinal, and normal components, as measured by the concordance between the specified and empirically computed quantities on average.

Table 1.

Key descriptive statistics and evaluation quantities. The mean represents the average of estimates across 1000 simulation replicates.

Evaluation Criteria

True Value

Mean

RMSE

Min.

1st Qu.

3rd Qu.

Max.

(1, 2)

(.1, .2)

0.1865

0.1864

0.0707

0.1814

0.0727

96.10

−0.0383

0.1348

0.2366

0.4494

0.1241

0.1242

0.1104

0.1962

0.0698

95.10

−0.1069

0.0765

0.1695

0.3350

0.1965

0.1919

2.3207

6.6481

0.0687

95.40

−0.0762

0.1460

0.2381

0.4173

0.3627

0.3611

0.4283

2.5224

0.0616

95.60

0.1457

0.3182

0.4043

0.5231

0.2671

0.2678

0.2726

1.0978

0.0663

95.00

0.0375

0.2234

0.3129

0.4515

0.2783

0.2762

0.7674

3.3110

0.0645

95.50

0.0646

0.2354

0.3207

0.4802

0.1932

0.1892

2.0507

5.7763

0.0687

95.40

−0.0598

0.1447

0.2362

0.3784

0.3764

0.3733

0.8107

5.1045

0.0598

95.70

0.1704

0.3327

0.4148

0.5409

0.2704

0.2718

0.5161

2.1930

0.0636

96.00

0.0791

0.2290

0.3168

0.4293

0.2735

0.2687

1.7604

7.3842

0.0654

94.30

0.0629

0.2277

0.3131

0.5225

0.1174

0.1206

2.7414

4.4714

0.0720

95.60

−0.1149

0.0767

0.1684

0.3594

0.1438

0.1437

0.0478

0.0985

0.0697

95.00

−0.1084

0.0971

0.1915

0.3769

0.3861

0.3851

0.2592

1.6800

0.0596

95.20

0.1573

0.3466

0.4267

0.5876

0.3356

0.3332

0.7189

3.8945

0.0620

95.30

0.1255

0.2921

0.3736

0.5146

0.3825

0.3805

0.5100

3.1678

0.0616

95.30

0.1558

0.3404

0.4223

0.5569

γ¹

1.1111

1.1081

0.2751

3.8193

0.0800

95.20

0.8600

1.0600

1.1600

1.3800

2.5000

2.5003

0.0100

0.1893

0.1320

94.60

2.1150

2.4100

2.5850

2.9900

(\sqrt{γ_{2}})

1.1712

1.1675

0.3216

4.7146

0.0799

95.60

0.8931

1.1120

1.2240

1.4580

1.9764

1.9728

0.1827

2.7420

0.1317

94.60

1.6180

1.8780

2.0590

2.4630

γ³

1.2649

1.2150

3.9429

16.9577

0.2982

96.70

0.6300

1.0150

1.3670

3.8590

1.1068

1.0685

3.4581

13.1912

0.2925

96.30

0.4125

0.8793

1.2200

3.6660

γ⁴

5.0667

4.7724

5.8079

15.7250

1.8934

97.10

2.4800

3.7110

5.3270

31.7100

4.7750

4.5638

4.4222

12.3918

1.7162

96.60

2.3540

3.5790

5.0690

29.9600

0.3500

0.3517

0.4857

5.0079

0.0339

0.95

0.2450

0.3300

0.3750

0.4600

0.8000

0.8009

0.1175

3.3734

0.0279

0.94

0.7250

0.7800

0.8200

0.8900

0.3200

0.3206

0.1891

1.8271

0.0331

0.95

0.2100

0.3000

0.3412

0.4300

0.5400

0.5405

0.0870

1.3241

0.0355

0.95

0.4450

0.5150

0.5650

0.6450

0.8100

0.8115

0.1895

5.5632

0.0276

0.95

0.7150

0.7950

0.8300

0.9000

(10,20)

(−.2,−.4)

0.3317

0.3350

1.0009

5.1817

0.0641

95.60

0.0957

0.2912

0.3784

0.6035

0.2449

0.2464

0.6458

2.2962

0.0688

95.90

0.0429

0.1973

0.2950

0.4437

0.2780

0.2801

0.7529

3.2329

0.0647

95.40

0.0861

0.2358

0.3262

0.5135

0.1837

0.1846

0.4911

1.3465

0.0670

95.60

−0.0017

0.1385

0.2282

0.3998

0.2717

0.2748

1.1611

4.8461

0.0651

95.30

0.0560

0.2318

0.3186

0.4873

0.1865

0.1879

0.7666

2.1196

0.0674

94.50

−0.0125

0.1404

0.2332

0.3891

0.2107

0.2122

0.7011

2.2212

0.0665

95.60

−0.0142

0.1654

0.2583

0.4039

0.3617

0.3646

0.7905

4.6359

0.0617

95.30

0.0705

0.3226

0.4057

0.5569

0.2135

0.2137

0.1007

0.3102

0.0693

95.90

−0.1018

0.1700

0.2608

0.4171

0.1548

0.1540

0.5020

1.1363

0.0684

95.10

−0.0769

0.1108

0.1996

0.3751

0.3188

0.3201

0.3926

1.9556

0.0640

95.80

0.1381

0.2768

0.3670

0.5221

0.3587

0.3579

0.2130

1.1931

0.0640

95.20

0.1297

0.3142

0.4016

0.5430

0.2669

0.2645

0.8860

3.5797

0.0661

95.00

0.0444

0.2218

0.3089

0.4261

0.2335

0.2331

0.1932

0.6729

0.0670

95.40

0.0332

0.1877

0.2808

0.4634

0.2407

0.2426

0.7809

2.9008

0.0648

95.30

0.0498

0.1990

0.2878

0.4488

0.1796

0.1844

2.6872

7.2200

0.0670

94.30

−0.0184

0.1388

0.2281

0.4188

0.3926

0.3935

0.2266

1.4939

0.0595

95.00

0.1644

0.3538

0.4330

0.5537

γ¹

8.3333

8.3287

0.0557

2.7920

0.1663

94.90

7.8450

8.2200

8.4450

8.8700

14.2857

14.2825

0.0226

1.6308

0.1980

95.00

13.6900

14.1500

14.4200

14.9000

(\sqrt{γ_{2}})

2.4056

2.4053

0.0136

0.2764

0.1185

95.50

2.0670

2.3200

2.4880

2.8090

2.6998

2.6938

0.2214

4.4824

0.1334

95.60

2.3110

2.5990

2.7920

3.0490

γ³

0.1732

0.1670

3.5937

3.8583

0.1614

95.00

−0.3318

0.0609

0.2699

0.8250

0.0378

0.0344

9.0104

2.0669

0.1647

95.10

−0.4314

−0.0853

0.1426

0.6621

γ⁴

2.9700

2.9260

1.4826

13.6121

0.3263

96.50

2.1920

2.7030

3.0970

5.1520

2.9557

2.9333

0.7585

7.1579

0.3138

95.60

2.0900

2.7100

3.1080

4.1460

0.3500

0.3525

0.7200

7.5922

0.0332

95.60

0.2450

0.3300

0.3750

0.4500

0.8000

0.8009

0.1081

3.0480

0.0284

95.90

0.7150

0.7800

0.8200

0.8850

0.3200

0.3203

0.1000

0.9819

0.0326

94.60

0.2150

0.3000

0.3400

0.4350

0.5400

0.5404

0.0722

1.0688

0.0365

95.00

0.4250

0.5150

0.5650

0.6550

0.8100

0.8101

0.0099

0.2760

0.0290

94.70

0.7050

0.7900

0.8300

0.8900

Open in a new tab

We do not report the results of small sample cases for brevity, as expected mean, RB, SB did not change substantially. The CR was somewhat higher, but not to a degree of inefficiency. In addition, as one would expect, RMSE was also slightly larger.

7. Example: Interstitial Cystitis Data Base study

We consider the Interstitial Cystitis Data Base (ICDB) study to illustrate the performance of the proposed method to mimic real data. We introduced this data in the Introduction section. Briefly, the ICDB was established by National Institute of Diabetes and Digestive and Kidney Disease in 1993 to understand the natural history of the interstitial cystitis (IC) and to assess the demographic and clinical characteristics of IC patients [28]. A number of articles have been published utilizing this database since its establishment. Among other variables, the data set contains information on number of non-nocturnal voids per day (VF_AWAKE), ordinal levels of urinary urgency over last four weeks (PURG2C3), and average maximum interval between voids per day (AVGMINT). For this illustration, we selected only the patients who had complete data on all three variable across 3-, 6- and 9-month visits. A total of 309 patients met this requirement. The final data excluded variables that are not under consideration and patients that did not meet the aforementioned requirement. The VF_AWAKE variable measured on three occasions constitute three correlated count variables; PURG2C3 variable measured on three occasions constitute three correlated ordinal (three levels) variables; and AVGMINT also measured on three occasions constitute three correlated normal variables. Together, they form a data set with nine correlated variables of dissimilar types.

We demonstrate an ability of the proposed method to duplicate this “model” data in two ways. First, we show the proximity of the four moments of VF_AWAKE, cumulative proportions of PURG2C3, mean and standard deviation of AVGMINT, and the correlation matrix estimated from the “model” data and from simulated data (mimicking the “model” data). From the “model” data, θ and λ parameters for VF_AWAKE are obtained by fitting univariate generalized Poisson distribution to 3-, 6- and 9-month assessment data. The appropriateness of marginal normal distributions for AVGMINT variables are tested using the Anderson-Darling normality test (p-value>0.05). The correlation matrix and the values of the relevant parameters estimated from the “model” data are presented on the Tables 2 and 3, respectively. Using these estimated quantities as true parameters, the process of generating a data matrix of size 309 × 9 is replicated 1000 times.

Table 2.

Correlation matrix of the interstitial cystitis data

		VF_AWAKE			PURG1C3			AVGMINT

	months	3	6	9	3	6	9	3	6	9
VF_AWAKE	3	1	0.757	0.758	0.275	0.195	0.285	−0.475	−0.479	−0.466
	6	0.757	1	0.806	0.246	0.274	0.312	−0.5	−0.53	−0.506
	9	0.758	0.806	1	0.274	0.256	0.397	−0.466	−0.481	−0.519

PURG1C3	3	0.275	0.246	0.274	1	0.575	0.545	−0.242	−0.241	−0.166
	6	0.195	0.274	0.256	0.575	1	0.599	−0.255	−0.273	−0.201
	9	0.285	0.312	0.397	0.545	0.599	1	−0.235	−0.248	−0.229

AVGMINT	3	−0.475	−0.5	−0.466	−0.242	−0.255	−0.235	1	0.716	0.729
	6	−0.479	−0.53	−0.481	−0.241	−0.273	−0.248	0.716	1	0.75
	9	−0.466	−0.506	−0.519	−0.166	−0.201	−0.229	0.729	0.75	1

Open in a new tab

Table 3.

Parameter estimates from interstitial cystitis data

Assesment occations

3 months

6 months

9months

VF_AWAKE

7.31

7.16

7.03

0.34

0.38

γ₁

11.03

10.85

11.28

(\sqrt{γ_{2}})

16.64

16.44

18.09

γ₃

0.76

0.77

0.84

γ₄

3.90

3.94

4.11

PURG2C3

p₁

0.37

0.40

0.41

p₂

0.78

0.81

AVGMINT

mean

4.80

4.86

4.89

1.64

1.72

Open in a new tab

Note: SD denotes standard deviation; p₁ and p₂ denote cumulative probabilities of first and second category of three-level ordinal variable; and γ₁, γ₂, γ₃ and γ₄ denotes the mean, SD, skewness, and kurtosis of GPD.

The results of the comparison of these quantities between the “model” and simulated data are presented on the Figures 1 – 4. Figure 1 shows three evaluation metrics. The maximum observed values (representing the worst-case scenario) on all three metrics fall well under the thresholds for all the parameters including kurtoses of count variables. Figure 2 shows distributions of cumulative proportion of each categories of ordinal variables PURG2C3 assessed over 3 time points. The x-axis displays the “true” values from the “model” data and the y-axis displays estimated values from simulated data sets. The average estimated cumulative proportions are in close proximity to the “true” values with relatively narrow interquartile range. Figure 3 shows 36 estimated mean pairwise correlations along with their “true” value on the x-axis. It shows that the average estimated pairwise correlations over simulated data are in close vicinity of the “true” values. The largest difference between the average and “true” pairwise correlations is less than 0.0022. Finally, Figure 4 presents the distributions of four sample moments of VF AWAKE variables. The upper two graphs show the distributions of sample means and standard deviations. The averages of these two statistics are close to the “true” values with very few outliers. As expected, distributions of the third and the fourth moments contain more outliers. However, the averages are fairly close to their “true” values. Overall, the proposed method succeed remarkably in duplicating the marginal distributions and dependence structure of the “model” data.

Accuracy of the estimated parameters. RB: Relative bias, SB: Standardized Bias.

Cumulative probabilities of three ordinal variables.

Correlations among variables with different attributes, of count, categorical and continuous types.

Next, we fit a mixed-effect ordinal regression model with cumulative probit link and equally spaced thresholds. We show that the estimated values of the parameters of this regression model using the “model” data and simulated data are close to each other. The model under consideration is as follows:

ϕ^{- 1} {P (y_{i j} < c)} = α_{c} + β_{0} A V G M I N T + β_{1} V F_A W A K E + β_{2} M O N T H + υ_{0_{i}},

(10)

where y_ij is an unobservable latent variable related to the ordinal variable PURG2C3, α_c is a series of threshold values, and υ_{0_i} is a random effect assumed to have N (0, σ_υ₀) distribution. Table 4 presents the estimated regression parameters for the “model” data on the “True Value” column and for the simulated data on “Mean” column along with evaluation metrics on the remaining columns. Clearly, the average of the parameter estimates from 1000 simulated data sets are well within the tolerable limits. These results suggest that the proposed method is useful in capturing and reconstructing the real data trends, and may be regarded as a useful tool for evaluating statistical models and methods.

Table 4.

Parameter estimates of cumulative logit ordinal mixed-effect regression.

	True Value	Mean	RB	SB	RMSE	CR	Min.	1st Qu.	3rd Qu.	Max.
α₁	−0.3928	−0.3758	0.0170	4.3292	0.3495	0.95	−1.6470	−0.6098	−0.1410	0.6547
α − α	2.0762	2.0891	0.0130	0.6257	0.1204	0.94	1.7740	2.0080	2.1580	2.5180
AVGMINT	−0.2072	−0.2099	0.0027	1.2908	0.0476	0.95	−0.3864	−0.2426	−0.1782	−0.0500
VF_AWAKE	0.1186	0.1227	0.0041	3.4609	0.0165	0.95	0.0785	0.1116	0.1333	0.1799
MONTH	−0.0305	−0.0332	0.0027	8.9748	0.0186	0.94	−0.0894	−0.0457	−0.0211	0.0202
σ_{υ ₀}	1.6228	1.5615	0.0613	3.7775	0.2670	0.95	0.8955	1.3700	1.7130	2.6300

Open in a new tab

8. Discussion

The approach to multivariate data generation with mixed variable types in this paper is motivated by the emerging popularity of data analysis techniques for mixed outcomes. The method draws ideas from previous works by Ferrari and Barbiero [12], Barbiero and Ferrari [9], Demirtas and Doganay [14], and Amatya and Demirtas [17]. The novelties in this paper are the use of the generalized Poisson distribution to address over- and under-dispersion in count data, derivation of a connection (correlation) between normal and count/ordinal variables obtained via discretization, and the unified framework through which data sets composed of dissimilar variable types are generated. The proposed framework is also capable of generating multivariate data composed of only one type of variables mentioned in this work. This is particularly useful for generating multivariate GPD variables, method for which is not currently available. For the values of parameters that are generally encountered in practice, the proposed method performs considerably well and provides a robust mechanism for generating multivariate data with combination of count, ordinal and normal variables. Future work will address modeling continuous variable components in the mixed data via non-normal continuous distributions.

References

1.Simon LJ, Landis JR, Erickson DR, Nyberg LM, Group IS, et al. The interstitial cystitis data base study: concepts and preliminary baseline descriptive statistics. Urology. 1997;49(5):64–75. doi: 10.1016/s0090-4295(99)80334-3. [DOI] [PubMed] [Google Scholar]
2.Song P, Li M, Yuan Y. Joint regression analysis of correlated data using gaussian copulas. Biometrics. 2009;65(1):60–68. doi: 10.1111/j.1541-0420.2008.01058.x. [DOI] [PubMed] [Google Scholar]
3.De Leon AR, Chough KC. Analysis of mixed data: Methods and applications. CRC Press; 2013. [Google Scholar]
4.Krummenauer F. Efficient simulation of multivariate binomial and Poisson distributions. Biometrical Journal. 1998;40(7):823–832. [Google Scholar]
5.Cai Y, Kendall W. Perfect simulation for correlated Poisson random variables conditioned to be positive. Statistics and Computing. 2002;12(3):229–243. [Google Scholar]
6.Minhajuddin TM, Harris IR, Schucany WR. Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation. 2004;74(8):599–607. [Google Scholar]
7.Shin K, Pasupathy R. Proceedings of the 39th Conference on Winter Simulation: 40 Years! The Best is Yet to Come; WSC ’07. Piscataway, NJ, USA: IEEE Press; 2007. A method for fast generation of bivariate Poisson random vectors; pp. 472–479. [Google Scholar]
8.Yahav I, Shmueli G. On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry. 2012;28(1):91–102. [Google Scholar]
9.Barbiero A, Ferrari PA. Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry. 2014;31(5):669–680. [Google Scholar]
10.Biswas A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters. 2004;70(1):25–235. [Google Scholar]
11.Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation. 2006;76(11):1017–1025. [Google Scholar]
12.Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. doi: 10.1080/00273171.2012.692630. [DOI] [PubMed] [Google Scholar]
13.Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research. 2008;43(3):355–381. doi: 10.1080/00273170802285693. [DOI] [PubMed] [Google Scholar]
14.Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics. 2012;22(3):323–236. doi: 10.1080/10543406.2010.521874. [DOI] [PubMed] [Google Scholar]
15.Demirtas H, Hedeker D, Mermelstein JM. Simulation of massive public health data by power polynomials. Statistics in Medicine. 2012;31(27):3337–3346. doi: 10.1002/sim.5362. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Demirtas H, Yavuz Y. Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics. 2015;25(4):635–650. doi: 10.1080/10543406.2014.920868. [DOI] [PubMed] [Google Scholar]
17.Amatya A, Demirtas H. Simultaneous generation of multivariate data with Poisson and normal marginals. Journal of Statistical Computation and Simulation. 2015;85(15):3129–3139. [Google Scholar]
18.Famoye F. Generalized Poisson random variate generation. American Journal of Mathematical and Management Sciences. 1997;17(3–4):219–237. [Google Scholar]
19.Consul PC, Famoye F. Lagrangian probability distributions. Springer; 2006. [Google Scholar]
20.Sáez-Castillo AJ, Conde-Sánchez A. A hyper-Poisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis. 2013;61:148–157. [Google Scholar]
21.Lynch HJ, Thorson JT, Shelton AO. Dealing with under- and over-dispersed count data in life history, spatial, and community ecology. Ecology. 2014;95(11):3173–3180. [Google Scholar]
22.Thall PF, Vail SC. Some covariance models for longitudinal count data with overdispersion. Biometrics. 1990;46(3):657–671. [PubMed] [Google Scholar]
23.Consul PC. Generalized Poisson distributions: Properties and applications. New York: Marcel Dekker, Inc.; 1989. [Google Scholar]
24.Demirtas H, Hedeker D. A practical way for computing approximate lower and upper correlation bounds. American Statistician. 2011;65(2):104–109. [Google Scholar]
25.Higham NJ. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications. 1988;103:103–118. [Google Scholar]
26.Higham NJ. Computing the nearest correlation matrix-a problem from finance. IMA Journal of Numerical Analysis. 2002;22(3):329–343. [Google Scholar]
27.Demirtas H. Simulation driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica. 2004;58(4):466–482. [Google Scholar]
28.Propert KJ, Schaeffer AJ, Brensinger CM, Kusek JW, Nyberg LM, Landis JR. A prospective study of interstitial cystitis: results of longitudinal followup of the interstitial cystitis data base cohort. The interstitial cystitis data base study group. The Journal of Urology. 2000;163(5):1434–1439. doi: 10.1016/s0022-5347(05)67637-9. [DOI] [PubMed] [Google Scholar]

[R1] 1.Simon LJ, Landis JR, Erickson DR, Nyberg LM, Group IS, et al. The interstitial cystitis data base study: concepts and preliminary baseline descriptive statistics. Urology. 1997;49(5):64–75. doi: 10.1016/s0090-4295(99)80334-3. [DOI] [PubMed] [Google Scholar]

[R2] 2.Song P, Li M, Yuan Y. Joint regression analysis of correlated data using gaussian copulas. Biometrics. 2009;65(1):60–68. doi: 10.1111/j.1541-0420.2008.01058.x. [DOI] [PubMed] [Google Scholar]

[R3] 3.De Leon AR, Chough KC. Analysis of mixed data: Methods and applications. CRC Press; 2013. [Google Scholar]

[R4] 4.Krummenauer F. Efficient simulation of multivariate binomial and Poisson distributions. Biometrical Journal. 1998;40(7):823–832. [Google Scholar]

[R5] 5.Cai Y, Kendall W. Perfect simulation for correlated Poisson random variables conditioned to be positive. Statistics and Computing. 2002;12(3):229–243. [Google Scholar]

[R6] 6.Minhajuddin TM, Harris IR, Schucany WR. Simulating multivariate distributions with specific correlations. Journal of Statistical Computation and Simulation. 2004;74(8):599–607. [Google Scholar]

[R7] 7.Shin K, Pasupathy R. Proceedings of the 39th Conference on Winter Simulation: 40 Years! The Best is Yet to Come; WSC ’07. Piscataway, NJ, USA: IEEE Press; 2007. A method for fast generation of bivariate Poisson random vectors; pp. 472–479. [Google Scholar]

[R8] 8.Yahav I, Shmueli G. On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry. 2012;28(1):91–102. [Google Scholar]

[R9] 9.Barbiero A, Ferrari PA. Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry. 2014;31(5):669–680. [Google Scholar]

[R10] 10.Biswas A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters. 2004;70(1):25–235. [Google Scholar]

[R11] 11.Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. Journal of Statistical Computation and Simulation. 2006;76(11):1017–1025. [Google Scholar]

[R12] 12.Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behavioral Research. 2012;47(4):566–589. doi: 10.1080/00273171.2012.692630. [DOI] [PubMed] [Google Scholar]

[R13] 13.Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behavioral Research. 2008;43(3):355–381. doi: 10.1080/00273170802285693. [DOI] [PubMed] [Google Scholar]

[R14] 14.Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics. 2012;22(3):323–236. doi: 10.1080/10543406.2010.521874. [DOI] [PubMed] [Google Scholar]

[R15] 15.Demirtas H, Hedeker D, Mermelstein JM. Simulation of massive public health data by power polynomials. Statistics in Medicine. 2012;31(27):3337–3346. doi: 10.1002/sim.5362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Demirtas H, Yavuz Y. Concurrent generation of ordinal and normal data. Journal of Biopharmaceutical Statistics. 2015;25(4):635–650. doi: 10.1080/10543406.2014.920868. [DOI] [PubMed] [Google Scholar]

[R17] 17.Amatya A, Demirtas H. Simultaneous generation of multivariate data with Poisson and normal marginals. Journal of Statistical Computation and Simulation. 2015;85(15):3129–3139. [Google Scholar]

[R18] 18.Famoye F. Generalized Poisson random variate generation. American Journal of Mathematical and Management Sciences. 1997;17(3–4):219–237. [Google Scholar]

[R19] 19.Consul PC, Famoye F. Lagrangian probability distributions. Springer; 2006. [Google Scholar]

[R20] 20.Sáez-Castillo AJ, Conde-Sánchez A. A hyper-Poisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis. 2013;61:148–157. [Google Scholar]

[R21] 21.Lynch HJ, Thorson JT, Shelton AO. Dealing with under- and over-dispersed count data in life history, spatial, and community ecology. Ecology. 2014;95(11):3173–3180. [Google Scholar]

[R22] 22.Thall PF, Vail SC. Some covariance models for longitudinal count data with overdispersion. Biometrics. 1990;46(3):657–671. [PubMed] [Google Scholar]

[R23] 23.Consul PC. Generalized Poisson distributions: Properties and applications. New York: Marcel Dekker, Inc.; 1989. [Google Scholar]

[R24] 24.Demirtas H, Hedeker D. A practical way for computing approximate lower and upper correlation bounds. American Statistician. 2011;65(2):104–109. [Google Scholar]

[R25] 25.Higham NJ. Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications. 1988;103:103–118. [Google Scholar]

[R26] 26.Higham NJ. Computing the nearest correlation matrix-a problem from finance. IMA Journal of Numerical Analysis. 2002;22(3):329–343. [Google Scholar]

[R27] 27.Demirtas H. Simulation driven inferences for multiply imputed longitudinal datasets. Statistica Neerlandica. 2004;58(4):466–482. [Google Scholar]

[R28] 28.Propert KJ, Schaeffer AJ, Brensinger CM, Kusek JW, Nyberg LM, Landis JR. A prospective study of interstitial cystitis: results of longitudinal followup of the interstitial cystitis data base cohort. The interstitial cystitis data base study group. The Journal of Urology. 2000;163(5):1434–1439. doi: 10.1016/s0022-5347(05)67637-9. [DOI] [PubMed] [Google Scholar]

PERMALINK

Concurrent generation of multivariate mixed data with variables of dissimilar types

Anup Amatya

Hakan Demirtas

Abstract

1. Introduction

2. Generalized Poisson Distribution

3. Discretization

4. Correlation among variables with different attributes

4.1. Correlation for discrete random variables

4.2. Correlation Between Discrete and Normal Continuous Random Variables

5. Algorithm for generating multivariate mixed data with generalized Poisson, ordinal, and normal components

6. Method evaluation via simulation

6.1. Accuracy and precision

6.2. Results

Table 1.

7. Example: Interstitial Cystitis Data Base study

Table 2.

Table 3.

Figure 1.

Figure 4.

Figure 2.

Figure 3.

Table 4.

8. Discussion

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Concurrent generation of multivariate mixed data with variables of dissimilar types

Anup Amatya

Hakan Demirtas

Abstract

1. Introduction

2. Generalized Poisson Distribution

3. Discretization

4. Correlation among variables with different attributes

4.1. Correlation for discrete random variables

4.2. Correlation Between Discrete and Normal Continuous Random Variables

5. Algorithm for generating multivariate mixed data with generalized Poisson, ordinal, and normal components

6. Method evaluation via simulation

6.1. Accuracy and precision

6.2. Results

Table 1.

7. Example: Interstitial Cystitis Data Base study

Table 2.

Table 3.

Figure 1.

Figure 4.

Figure 2.

Figure 3.

Table 4.

8. Discussion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases