Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Mar 24;48(5):765–785. doi: 10.1080/02664763.2020.1745765

General location multivariate latent variable models for mixed correlated bounded continuous, ordinal, and nominal responses with non-ignorable missing data

Elham Tabrizi 1, Ehsan Bahrami Samani 1,CONTACT, Mojtaba Ganjali 1
PMCID: PMC9042174  PMID: 35707447

Abstract

Using a multivariate latent variable approach, this article proposes some new general models to analyze the correlated bounded continuous and categorical (nominal or/and ordinal) responses with and without non-ignorable missing values. First, we discuss regression methods for jointly analyzing continuous, nominal, and ordinal responses that we motivated by analyzing data from studies of toxicity development. Second, using the beta and Dirichlet distributions, we extend the models so that some bounded continuous responses are replaced for continuous responses. The joint distribution of the bounded continuous, nominal and ordinal variables is decomposed into a marginal multinomial distribution for the nominal variable and a conditional multivariate joint distribution for the bounded continuous and ordinal variables given the nominal variable. We estimate the regression parameters under the new general location models using the maximum-likelihood method. Sensitivity analysis is also performed to study the influence of small perturbations of the parameters of the missing mechanisms of the model on the maximal normal curvature. The proposed models are applied to two data sets: BMI, Steatosis and Osteoporosis data and Tehran household expenditure budgets.

Keywords: Beta regression, conditional grouped continuous model, general mixed data model, latent variable, the maximal normal curvature

2010 Mathematics Subject Classifications: 62J12, 62J05

1. Introduction

The percentages, proportions, and fractions are some examples for variables supported on the standard unit interval. Some examples of proportions include the proportion of household income spent on electronic devices, the proportion of homicides involving firearms, and the proportion of crude oil converted to gasoline after distillation, etc. Various models have been proposed to analyze such data. One of the mentioned models is beta regression model introduced by Kieschnick and McCullough [18] and then Ferrari and Cribari-Neto [15] amplified the use of the beta regression model with changing the parameterization of the beta distribution indexed by mean and dispersion parameters. This model studied by many researchers such as Ferrari and Pinheiro [16], Anholeto et al. [3], Barreto-Souza and Simas [4], and Tabrizi et al. [28,29]. There are many practical situations in which the dataset contains more than one response supported on the standard unit interval. Multivariate beta regression models using a copula function to construct the joint distribution of the responses are proposed by De Souza and Da Silva Moura [14]. Sometimes, the mentioned responses lying in a bounded interval sum up to a constant (called compositional responses). For example, consider a demography study in which the proportions of the population with some specific religions are the response variables. Let Y1, Y2, Y3, and Y4 denote the proportions of the people who are Christians, Muslims, Jews, and the remaining religions living in a town, respectively. It is clear that p=14Yp=1. Two other examples for compositional data are the sediment composition in a lake in which the samples are taken and classified into sand, silt, and clay by their weights and the proportions of household income spent on housing and fuel, foodstuffs, health-care, and remaining expenditures. Dirichlet regression models studied by Maier [20] can be used to analyze compositional data.

Multivariate data containing mixtures of continuous and discrete responses are common. Specially, continuous and categorical correlated responses data are commonly collected in medical studies. For example, consider the data from a medical study where the correlated responses are the ordinal responses of the steatosis and osteoporosis of the spine and continuous response of body mass index with the possibility of non-ignorable values. Also, sometimes the continuous response has bounded support. As an example, again consider the data from a medical study such that the bounded continuous responses are the proportions of four serum protein components in blood samples (Albumin, PreAlbumin, Globulin A, and Globulin B) which affect a special disease and again the ordinal responses are the steatosis and osteoporosis of the spine. As regards, separate analysis of each response has some shortcomings and it gives biased estimates for the parameters and misleading inference [11]. A way out of this problem is to use methods that simultaneously allow the joint modelling of the mixed data considering non-ignorable missing mechanisms. Specifying the joint distribution of the mixed responses can be formalized in two different ways: (1) specifying the marginal distribution of the discrete variables and the conditional distribution of the continuous variables, given the discrete variables, or (2) specifying the marginal distribution of the continuous variables and the conditional distribution of the discrete variables, given the continuous variables [13]. According to the first approach, for joint modelling of some continuous and nominal variables, one method is to use the general location model (GLOM). Olkin and Tate [22] described a general location model based on a multinomial distribution for the nominal response and a multivariate Gaussian model for the continuous response conditional on the discrete response. In each level of the composition of the nominal variables, the continuous variables are assumed to have a multivariate normal distribution with constant covariance matrix and changeable mean without considering any covariate effects on the responses simultaneously. Note that the GLOM does not accommodate dependence between ordinal and nominal responses and this model is suitable for the continuous and nominal responses. Also, there is not any types of GLOM in the literature in which continuous response has bounded support. In contrast, based on second approach, Cox [8] described a model in which the marginal distribution of the continuous response is Gaussian and it is multiplied by a logistic representation for the conditional distribution of the binary response given the continuous outcome. More recently, Cox and Wermuth [9] compared a number of different models based on these two factorizations of the joint distribution. Another method uses the simultaneous modelling of the continuous and discrete responses by applying correlated errors in the model to take into account the correlation between the responses [17]. To model simultaneously two correlated continuous and ordinal responses, one method is to use the concept of a latent variable. In this situation, it is supposed that the ordinal response is obtained by partitioning the space of an unobservable continuous variable called latent variable into non-overlapping intervals [24]. The conditional grouped continuous model (CGCM) uses the latent variable and applies the second approach to model the mentioned responses such that CGCM considers the multivariate normal distribution as the distribution for the latent variables and the conditional distribution of the latent variables given the continuous responses [2,26]. Finally, to model simultaneously three correlated continuous, ordinal, and nominal responses, De Leon and Carriere [12] combined two models GLOM and CGCM to obtain a model which called the general mixed data model (GMDM). The joint distribution of the responses in GMDM is expressed as the product of the joint distribution of the nominal and continuous responses (GLOM) multiplied by the conditional distribution of the ordinal responses given the nominal and continuous responses (CGCM). Paleti et al. [23] extended the GMDM such that this model includes mixed continuous, nominal, ordinal, and count variables. Mirkamali and Ganjali [21] extended the GLOM using a joint model for analyzing zero-inflated outcomes and skew continuous outcomes. Amiri et al. [1] proposed a general location model with factor analyzer covariance matrix structure.

For data having missing values, traditional methods which may give biased and inconsistent estimates are not suitable since in any way they ignore the missing data mechanism. Rubin [27] and Little and Rubin [19] define a typology of incomplete-data models and made important distinctions between the various types of missing mechanism. They classified the missing data mechanism into three categories missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The missing data mechanism is MCAR if the probability of the observed missingness indicator is dependent neither on the observed responses nor on the missing responses, MAR if, given the observed responses, it is not dependent on the missing responses and MNAR if, given the observed responses, it is dependent on the missing responses. From the likelihood-based inferences, MCAR and MAR can be regarded as ignorable. MCAR is ignorable for both sampling-based and likelihood-based inferences. In MNAR, the mechanism is non-ignorable. So, the important factor for selection of an approach to missing values is missing data mechanism. There are different strategies for dealing with missing values in this paper, we use a famous technique in which a dummy variable for whether a variable is missing or not is determined.

For general location model with missing data, Belin et al. [5] extended the general location model with MAR assumption in a health study. Peng et al. [25] proposed an extension of the general location model for causal inferences with non compliance and missing data on the outcomes. Also, also they considered the effect of covariates. In the above research, the univariate variable approach for the discrete and continuous responses with MAR mechanism is assumed. Also, Cui et al. [10] considered other different methods like the Bayesian approach. They proposed a novel Bayesian Gaussian copula factor approach that is proven to be consistent with ignorable data (MCAR). In this approach, the problem of learning about parameters of latent variable models from mixed continuous and ordinal data with missing values is considered. Chen and Tang [6] consider non-parametric approach for analyzing mixed continuous and discrete explanatory variables. In this approach, they proposed a non-parametric regression with a mixture of continuous and discrete explanatory variables with missing response.

In this paper, we extend the model of De Leon and Carriere [12] along with the multivariate latent variable approach for the bounded continuous, continuous, ordinal, and nominal responses with non-ignorable missing data. Also, we proposed a general location model for the univariate and multivariate bounded responses using beta and Dirichlet distributions. This model has not been yet considered by other researchers. The joint distribution of the bounded continuous, nominal and ordinal responses is decomposed into a marginal multinomial distribution for the nominal variables and a conditional joint distribution for the bounded continuous and ordinal responses given the nominal responses. In each level of composition of the nominal responses, the bounded continuous variables with missing values are beta and Dirichlet distribution with changeable mean with considering any covariate effects on the responses simultaneously. We also apply the random effects and latent variables approach to take into account the correlation between responses.

In Section 2, first, we review a GMDM model for the mixed correlated nominal, continuous, and ordinal responses. Also, then extended this model to a new model in which there is the possibility of the non-ignorable missing values in the continuous responses. Second, we extended the general location models for mixed correlated nominal, bounded continuous and ordinal data with and without non-ignorable responses. Section 3 is devoted to a brief discussion of the identifiability issues associated with the proposed joint models. In Section 4, some simulation studies are conducted in order to assess the performance of the proposed models. We use two dataset, the data of body mass index, the osteoporosis of the spine, waist, job status and type of the accommodation of the patients in Taleghani hospital of Tehran. The data of the household expenditure budgets, the education level of the household head. Also, having a toddler at home status in Tehran, to illustrate the application of the proposed models in Section 5. In Section 6, a sensitivity analysis is performed to study the influence of a small perturbation of the parameters of the missing mechanism on the maximal normal curvature. Finally, the paper ended with some remarks in the last section.

2. Models and likelihoods

Let Ui=(Ui1,,UiD) denote the D×1 vector of nominal variables where the dth component of Ui, Uid, having Jd possible states such that Jd{1,2,3,} and therefore uid=1,2,,jd,,Jd. So the vector Ui defines a contingency table with S=d=1DJd cells. Now, j=(j1,,jD) represents a specific cell in the mentioned contingency table. Suppose Yij=(Yi1j,,YiPj) show the continuous or bounded continuous response vector of the ith individual in the cell j, where Yipj is the pth component of the response vector Yij for i=1,2,,nj and p=1,,P. Also, Zij=(Zi1j,,ZiQj) denote the Q×1 ordinal response vector defined as

Ziqj={1,Yiqjθ1qc+1,θcq<Yiqjθ(c+1)q,c=1,2,,C1C,Yiqj>θ(C1)q,,

where i=1,2,,nj, q=1,,Q, θ1q,θ2q,,θ(C1)q are the cut-point parameters, and Yij=(Yi1j,,YiQj) denote the underlying latent variables for ordinal responses. Suppose that the mentioned three responses are recorded for n individuals and the cell frequency for each cell j is denoted by nj. So, it is clear that jJnj=n, where J={(j1,,jD)|jd{1,2,,Jd},d=1,2,,D} is the set of all possible states for j. Moreover, the expected probability of falling into cell j is denoted by πj, such that jJπj=1. In all models in this paper, like the general location model, Ui are modelled by a multinomial distribution where πj=Pr(Ui=j) is the vector of state probabilities. The nominal variables make together a contingency table. As an example, Figure C1 shows the demographic characteristics of the modelling for D = 2, J1=J2=2, S = 4, J={(1,1),(1,2),(2,1),(2,2)}, P = 1, Q = 2, Yij=Yi1j, Zij=(Zi1j,Zi2j), and n(1,1)+n(1,2)+n(2,1)+n(2,2)=n.

2.1. GMDM for mixed continuous, ordinal, and nominal data

In this subsection, suppose that the support of the distribution of the continuous responses is not bounded.

2.1.1. Complete data model

In GMDM, the joint distribution of [Ui,Zij,Yij] (the joint distribution of the jth cell of the contingency table) can be factorized as

[Ui=j,Zij,Yij]=[Zij,Yij|Ui=j][Ui=j]=[Zij|Yij,Ui=j][Yij|Ui=j][Ui=j],

where [Ui], [Zij|Yij,Ui] and [Yij|Ui] denote the marginal distribution of Ui, the conditional distribution of Zij given Yij and Ui, and the conditional distribution of Yij given Ui, respectively. The GMDM takes the form:

Yipj|(Ui=j)=γpjGij+εipj(1),p=1,,P,Yiqj|(Ui=j)=βqjHiqj+εiqj(2),q=1,,Q, (1)

where γ={γpj,p=1,,P,jJ}, β={βqj,q=1,,Q,jJ}, π=(πj,jJ), also Gij and Hiqj are the vector of the explanatory variables with regard to Yipj and Yiqj, respectively, for the ith individual in the cell j. Also, (εij(1),εij(2)), where εij(1)=(εi1j(1),,εiPj(1)), εij(2)=(εi1j(2),,εiQj(2)), Σ11=Var(εij(1)), Σ22=var(εij(2)) and Σ12=Σ21=Cov(εij(1),εij(2)) is (P+Q)-variate normal distribution with zero mean and covariance matrix

Σ=(Σ11Σ12Σ21Σ22),

such that Σ shows the covariance matrix between two errors (or two responses). So, the dependence between continuous responses and continuous latent variables has been modelled simultaneously.

The Likelihood for this model, which has been given in the Appendix (the Subsection 1.1 of the supplementary materials), shows the simplification obtained by using the assumption of normality for errors of the model in the system of Equations (1).

2.1.2. Incomplete data model

Typically, when missing data occur in an outcome, assume Ryij=(Ryi1j,,RyiPj) as the missingness indicator vector related to Yij where Ryipj is defined as

Ryipj={1,ifRyipj>00,Otherwise.

Also, suppose that Rzij=(Rzi1j,,RziQj) is the missingness indicator vector related to Zij where Rziqj is defined as

Rziqj={1,ifRziqj>00,Otherwise.

In the above definitions, Ryipj and Rziqj denote the underlying latent variables of the non-response mechanism for the continuous and ordinal variables, respectively. The GMDM takes the form:

Yipj|(Ui=j)=γpjGij+εipj(1),p=1,,P,Yiqj|(Ui=j)=βqjHiqj+εiqj(2),q=1,,Q,Ryipj|(Ui=j)=αpjMipj+εipj(3),p=1,,P,Rziqj|(Ui=j)=ηqjNiqj+εiqj(4),q=1,,Q, (2)

where α={αpj,p=1,,P,jJ}, η={ηqj,q=1,,Q,jJ}, Mipj and Niqj are the vector of the explanatory variables for the ith individual in the cell j. Let

(εij(1),εij(2),εij(3),εij(4))iidMVN(0,Σε),

where εij(3)=(εi1j(3),,εiPj(3)), εij(4)=(εi1j(4),,εiQj(4)), and

Σε=(Σ11Σ12Σ13Σ14Σ21Σ22Σ23Σ24Σ31Σ32Σ33Σ34Σ41Σ42Σ43Σ44),

where Σuu=Var(εij(u)), for u = 1, 2, 3, 4 and Σuv=Cov(εij(u),εij(v)), u<v, u, v = 1, 2, 3, 4 and Σuv=Σvu. Note that if one of the matrices Σ13,Σ14,Σ23,Σ24 is not zero, then the missing data mechanism of the response is not missing completely at random.

The Likelihood for this model, which has been given in the Appendix (the Subsection 1.2 of the supplementary materials)., shows the simplification obtained by using the assumption of normality for errors of the model in the system of Equations (2).

2.2. GLOM for mixed bounded continuous, ordinal, and nominal data

Here, we want to extend the general location model for multivariate bounded continuous responses using the Dirichlet distribution. In this subsection, assume that Yipj is bounded continuous for i=1,,nj, p=1,,P, and jJ such that it is a recorded percentage or proportion. So, 0<Yipj<1. Note that if a<Yipj<b such that a and b ( a,bR) are the known real numbers, we can use the transformation Yipj=(Yipja)/(ba) ( 0<Yipj<1) and then model the Yipj by the method introduced in this article. Let YipjBeta(ϕμipj,ϕ(1μipj)), such that

f(yipj;μipj,ϕ)=Γ(ϕ)Γ(ϕμipj)Γ(ϕ(1μipj))yipjϕμipj1(1yipj)(1μipj)ϕ1,yipj(0,1), (3)

where 0<μipj<1, ϕ>0, and Γ(.) is the gamma function. By the above parameterization of the beta distribution, E(Yipj)=μipj and Var(Yipj)=μipj(1μipj)/(1+ϕ). Also, let the joint distribution of Yij be the Dirichlet distribution ( YijD(μij,ϕ)) of order P + 1, such that

f(yij;μij,ϕ)=Γ(ϕ)Γ(ϕ(1p=1Pμipj))p=1PΓ(ϕμipj)×(1p=1Pyipj)ϕ(1p=1Pμipj)p=1Pyipjϕμipj1, (4)

yipj(0,1), p=1Pyipj<1, μij=(μi1j,,μiPj), and Cov(yipj,yipj)=μipjμipj/(1+ϕ) for pp.

Note 2.1

Dirichlet regression models are commonly used to analyze a set of bounded continuous responses that sum up to a constant. Based on the above notation, this regression is usually used when p=1P+1Yipj=1, where Yi(P+1)j=1p=1PYipj. In such situation, the joint distribution of (Yij,Yi(P+1)j) is as follows:

f(yij,yi(P+1)j;μij,ϕ)=Γ(ϕ)p=1P+1Γ(ϕμipj)p=1P+1yipjϕμipj1,yipj(0,1),p=1P+1yipj=1.

It is suitable for modelling the compositional data.

2.2.1. Complete data model

An extended general location model for correlated (0,1)-supported, ordinal, and nominal responses take the form:

log(μipj1p=1Pμipj)|(Ui=j,bij)=γpjGipj+bij,p=1,,Plogit(Pr(Ziqjc))|(Ui=j,bij)=βqjc+βqjHiqj+bij,q=1,,Q, (5)

For c=1,,C1, with C−1 strictly increasing model intercept parameters βqj1<βqj2<<βqj(C1) and the link function logit(a)=log(a/(1a)). Also, the shared random effect is represented by bij assumed to be distributed in the population as N(0,σb2) given Ui=j. Note that for (i,j)(i,j), bij and bij are independent. Moreover, βqj does not contain an intercept. It is noteworthy that under this model

μipj|(Ui=j,bij)=eγpjGipj+bij1+p=1PeγpjGipj+bij,p=1,,P.

So, model (5) guarantees that 0<μipj<1 for p=1,,P and p=1Pμipj<1. The joint distribution of [Ui=j,Zij,Yij] can be factorized as

[Ui=j,Zij,Yij]=[Zij,Yij|Ui=j][Ui=j].

Note that Yij|Ui=j,bijD(μij,ϕ) and the Yij and Zij given bij are independent. The σb is a parameter related to the dependency between (0,1)-supported and ordinal responses. The likelihood of model (5) is presented in the Appendix (the Subsection 1.3 of the supplementary materials).

2.2.2. Incomplete data model

Again, let Ryipj and Rziqj be the observed missingness indicators related to Yipj and Ziqj. To manage the missing mechanism issue, the latent variables Ryipj and Rziqj can be applied, but now we prefer to use another approach. Let Ryipj and Rziqj be distributed according to Bernoulli distribution. The incomplete joint model takes the form:

log(μipj1p=1Pμipj)|(Ui=j,bij)=γpjGipj+bij,p=1,,P,logit(Pr(Ziqjc))|(Ui=j,bij)=βqjc+βqjHiqj+bij,q=1,,Q,logit(Pr(Ryipj=1))|(Ui=j,yipj)=αpjMipj+ωpjyipj,logit(Pr(Rziqj=1))|(Ui=j,ziqj)=ηqjNiqj+ψqjziqj, (6)

For c=1,,C1, where the statistical significance of ωpj and ψqj imply that the missingness mechanism is MNAR.

The Likelihood for this model has been given in the Appendix (the Subsection 1.4 of the supplementary materials). The likelihoods of the mentioned models (1) and (2) can be maximized by the function ‘nlminb’ in software R. This function uses the port routine optimization method given in ‘ http://netlib.bell-labs.com/cm/cs/cstr/153.pdf’. The function uses a sequential quadratic programming (SQP) method to minimize the requested function. Needless to say, many problems in statistics are of the form finding the values of the parameters that maximize the likelihood function in which some constraints are also imposed on the parameters. For example, in models (5) and (6) we know that βqj1<βqj2<<βqj(C1). This adds a combinatorial layer to the problem, which makes it much harder to solve. There are a lot more packages available to solve optimization problems in R. ‘Rsolnp’ package can be applied to solve general non-linear optimization problems using augmented Lagrange multiplier method with an SQP interior algorithm. It is a reason to choose the ‘solnp’ function to maximize the likelihoods of models (5) and (6). Moreover, the observed Hessian matrix may be obtained by ‘nlminb’ function or maybe provided by function ‘fdHess’.

3. Identifiability conditions

The model identifiability issue have been discussed here. All proofs are given in the Appendix (the Section 2 of the supplementary materials).

Definition 3.1

The model is identifiable if for any two different values θ1θ2 in Θ, the corresponding probability distributions Pr(θ1) and Pr(θ2) are different.

Now, first, we review three theorems which can be proven using the same argument as applied in [30]. We apply them to prove the other theorems presented in this section. In the following, we call the model of all (Yi,Zi,Ui)s or (Yi,Zi,Ui,Ryi,Rzi)s (response variables of all individuals) the joint model and the model of the ith response variables the individual model, where Yi=(Yij;jJ) and Zi=(Zij;jJ). In the following theorems, we use the non-identifiability definition of a covariance matrix discussed by Wang [31].

Theorem 3.2

The joint model is identifiable if and only if at least one individual model is identifiable.

The result of Theorem 3.2 to model (1) yields that (πj;jJ), (γ,Σ11), (β,Σ22,θ1q,,,θ(C1)q,,θ(C1)Q), and Σ12 are identifiable if they are identifiable in the separate models for Ui, Yi|Ui, Zi|Ui, and Zi|Yi,Ui responses, respectively. Also, in model (5), (πj;jJ), (γj,ϕ), and β0qj=(βqj1,βqj2,,βqj(C1)). (β0qj,βqj) are identifiable if they are identifiable in the separate models for Ui, Yij|Ui=j, and Ziqj|Ui=j responses, respectively. Moreover, σb is identifiable, if it is identifiable in at least one of the separate models for Yi|Ui and Zi|Yi,Ui responses.

Proposition 3.3

model (1) is identifiable under the following conditions:

  1. The parameter vector γ can include the intercept parameters but β should not include any intercept.

  2. Σ22 is restricted to be a correlation matrix R22 in which all diagonal elements equal one.

  3. All nj×length(Gij) design matrices Gj=[G1j,,Gnjj] have full ranks.

  4. Hiqj contains at least one continuous covariate.

Proposition 3.4

The necessary conditions for identifiability of model (2) are:

  1. The parameter vector γ, α, and η can include the intercept parameters but β should not include any intercept.

  2. Σ22, Σ33, and Σ44 are restricted to be the correlation matrices R22, R33, and R44 in which all diagonal elements equal one.

  3. All nj×length(Gij) design matrices Gj=[G1j,,Gnjj] have full ranks.

Proposition 3.5

(π,γ,ϕ) in model (5) is identifiable under the following conditions:

  1. γpj=γj for p=1,,P.

  2. Let γ= ( γj,iJ). The parameter vector γ should not include any intercept.

  3. There are at least two values for p (p and p) such that all the components in the covariate vectors Gipj and Gipj take all values in SGp and SGp, where SGp,SGpR contains at least one interval and zero. For other covariates, it is enough that their SG contain zero.

For example, SGp and SGp can be R, (a,b), (a,b](,c), [a,+), or (a,b){0,1,2}, where a,b,cR+.

4. Simulation study

Here, three different simulation studies are considered to assess the performances of models (1) and (5) for complete data and model (2) for incomplete data. Note that the optimization algorithms need to specify the initial values for the parameters. As in the simulation studies, the true values of the parameters are known, one may use values close to them as the initials to evaluate the performances of the models under study. However, these values are unknown in practice and the final estimation results might be affected by the initial values. Therefore, we should suggest at least one appropriate method for the specification of initial points. As a comparison, we estimates the parameters of model (5) in the second simulation study under two different sets of the initial values, the real parameter values and our suggested initial points. Then, the results are compared. Tables and Figures are relegated to the Appendix (the Section 3 of the supplementary materials).

4.1. Simulation study 1: GMDM for mixed continuous, ordinal, and nominal responses

In the first study, let D = 2, J1=J2=2, P = 1, Q = 2, and C = 3. So, we consider the case of two nominal variables U1 and U2 each with two states. The variables U1 and U2 are generated from a multinomial distribution with π=(0.25,0.25,0.25,0.25). Also, the variables Y1j, Y1j, and Y2j are generated from a multivariate normal distribution with zero mean ( γj=β1j=β2j=0) and covariance matrix

Σ=(10.50.510.51).

In fact, First, we want to study the model (1) with the covariates Gij, Hi1j, and Hi2j designed to take values in R. So, we generate them from normal distributions such that

GijiidN(0,1),(Hi1jHi2j)iidN2(0,[10.50.51]).

Let θ11=θ12=1 and θ21=θ22=1. The ordinal variables Z1j and Z2j with three levels are defined as

Zi1j={1,Yi1jθ112,θ11<Yi1jθ21,3,Yi1j>θ21,

and

Zi2j={1,Yi2jθ122,θ12<Yi2jθ22,3,Yi2j>θ22,.

Three sample sizes n=1000,5000, and 10, 000 are used for this model. Also, 1000 Monte Carlo replications are applied. In the first simulation study, for n = 1000, we have n11=250, n12=301, n21=332, and n22=117, for n = 5000, we have n11=1340, n12=1421, n21=527, and n22=1712, and for n = 10, 000, we have n11=2504, n12=2415, n21=2540, and n22=2541. Figure 1 of the Section 3 of the supplementary materials shows the demographic characteristics of the model for the first simulation study. So, the following simple model is the target model of the first simulation study

Yi1j|(Ui=j)=μyi1j+ϵi1j(1),Yi1j|(Ui=j)=μyi1j+ϵi1j(2),Yi2j|(Ui=j)=μyi2j+ϵi2j(2), (7)

where μyi1j=γjGij, μyi1j=β1jHi1j, and μyi2j=β2jHi2j. The main results are presented in Table 1 of the Section 3 of the supplementary materials. Note that, it contains the average estimated parameters over all simulations and they are close to the true values. So, model (7) produces consistent estimates of the regression parameters.

In the second study, a missing data mechanism related to the continuous response, Y1j, was added to the model. Let consider four continuous variables Y1j, Ry1j, Y1j, and Y2j. The variables are generated from a multivariate normal distribution with zero mean ( αj=0 and MijN(0,1)) and covariance matrix

Σ=(10.50.50.510.50.510.51).

Ryi1j is defined as follows:

Ryi1j={1,ifRyi1j>00,Otherwise.

Figure 2 of the Section 3 of the supplementary materials shows the demographic characteristics of the model for the second simulation study and the rates of missingness in Y1j. In this simulation study, we analyze the following model:

Yi1j|(Ui=j)=μyi1j+ϵi1j(1),Yi1j|(Ui=j)=μyi1j+ϵi1j(2),Yi2j|(Ui=j)=μyi2j+ϵi2j(2),Ry1j|(Ui=j)=μRyi1j+ϵi1j(3), (8)

where μRyi1j=αjMij. Estimation results are given in Table 2 of the Section 3 of the supplementary materials. These results show that the parameter estimates are close to the true values of the parameters. Here, the true values of parameters have been used as the initial values.

4.2. Simulation study 2: GLOM for mixed bounded continuous, ordinal, and nominal responses

Consider the model (5) and let P = 2, Q = 1, D = 2, J1=J2=2, and C = 3. Three effective sample sizes 100, 500, and 1000 are used in this subsection. The outcome model is as follows:

log(μi1j1μi1jμi2j)|(Ui=j,bij)=γjGi1j+bij,log(μi2j1μi1jμi2j)|(Ui=j,bij)=γjGi2j+bij,logit(Pr(Zi1jc))|(Ui=j,bij)=β1jc+β1jHi1j+bij,c=1,2, (9)

where j{(1,1),(1,2),(2,1),(2,2)}. We also choose γ(1,1)=γ(1,2)=γ(2,1)=γ(2,2)=1, ϕ=2, σb=0.5, π(1,1)=π(1,2)=π(2,1)=0.25, β1(1,1)1=β1(1,2)1=β1(2,1)1=β1(2,2)1=1, ,β1(1,1)2=β1(1,2)2=β1(2,1)2=β1(2,2)2=2, and β1(1,1)=β1(1,2)=β1(2,1)=β1(2,2)=1 to perform this simulation study. The covariates Gipj and Hij for p = 1, 2 are simulated independently as follows:

HijiidN(0,1),(Gi1jGi2j)iidN2(0,[10.50.51]).

The ‘solnp’ function of ‘Rsolnp’ package in software ‘R’ is used to maximize the log-likelihood function in terms of parameters with the inequality constraints β1(1,1)1<β1(1,1)2, β1(1,2)1<β1(1,2)2, β1(2,1)1<β1(2,1)2, and β1(2,2)1<β1(2,2)2. Also, ‘fdHess’ function is used to gain observed information matrix. The results are summarized in Table 3 of the Section 3 of the supplementary materials. For n = 100, there are two columns for the estimates where the estimated values in the first column (Est.) are obtained when initial points are close to real values and the estimated values in the second column (Est.0) are obtained when the initial points are chosen as follow: we fit eight separate beta regression models for Yi1(1,1), Yi1(1,2), Yi1(2,1), Yi1(2,2), Yi2(1,1), Yi2(1,2), Yi2(2,1), and Yi2(2,2). It gives us some estimates for γjs. The means of two obtained estimates for each γj from two models for Yi1j, and Yi2j are our target initial values. Also, we use a Dirichlet regression for Y1, Y2, and 1Y1Y2 (only intercept in the model) for finding an estimates for φ. πj is estimated by nj/n. The initial values for β are found by fitting four separate cumulative logistic regression models for Zi1j. Finally, in the above twelve models, we use a random sample generated from the standard normal distribution, Gipj, and Hiqj as the covariates. So, in these models, we gain some initial estimates for σb. The mean of all obtained estimates for σb from twelve models is considered here. It is evident that the parameter estimates are close to the true values and the more the value of n the better the estimates and the smaller the standard errors. For n = 100, the estimates in the Est. column show, compared to the estimates in the Est.0 column, that the estimates given real parameter values as initial points and the estimates given our suggested initial values match, up to the first four decimal points. Figure 3 of the Section 3 of the supplementary materials presents the mean squared errors (MSE) varying with respect to n. As the number of subjects increases, the MSEs of estimated parameters are decreased.

5. Application

This section contains two applications of the proposed models in the second section. First, we use model (5) for the Tehran household expenditure budgets dataset described in the following subsection. Second, we apply model (2) for the medical dataset described in the Subsection 5.2.

5.1. Application 1: tehran household expenditure budgets data

The set of household expenditure data available to us is from a study of 37,962 Tehran households. This dataset, collected by the statistical centre of Iran at 2017, is available on the statistical centre of Iran website. From this set, we have for simplicity selected a random sample of 1000 households. For each household, data is available on the number of persons ( Gi1j), the total household income, the education level of the head of household ( Zij), having a toddler at home status ( Ui), the area of the house in square meters ( Gi2j), the smoking or drinking status of the head of household ( Gi3j), and the proportion of income spent on monthly expenditures in some commodity/service groups ( Yij) such as foodstuffs ( Yi1j), housing and fuel ( Yi2j), health-care ( Yi3j), and so on. A brief description of the data is presented in Table 1. Also, Table 4 of the Section 3 of the supplementary materials shows the contingency table of the nominal and ordinal responses (U and Z). According to this Table, the most frequent type of the family in the dataset is the family with no toddler when the household head's education level is primary.

Table 1. Descriptive statistics for the household expenditure budgets data.

Dependent variable name Notation Level Level notation Mean or Percentage
Education Z1 Illiterated 0 0.188
    Primary 1 0.524
    Diploma 2 0.170
    Higher educated 3 0.118
Toddler status U1 Having a toddler 1 0.288
    No having toddler 0 0.712
Monthly income       1916004 Toman
Foodstuffs expenditure Y1     0.308
Housing and fuel expenditure Y2     0.228
Health-care expenditure Y3     0.064
Number of members G1     4
Area of house G2     100.425
Smoking or drinking status G3 Yes 1 0.243
    No 0 0.757

The box plot of the monthly income and the monthly expenditures on foodstuffs, housing and fuel, and health-care are presented in Figure 4 of the Section 3 of the supplementary materials. This Figure show that the most proportion of the income is spent on foodstuffs. Also, some observations seem to be outliers based on these box plots. As we know, an outlier is an observation that appears to deviate markedly from other observations in the sample. An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible). Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. If the data contains significant outliers, we may need to consider the use of robust statistical techniques. Here, we do not think that the income data with large values have been coded incorrectly and we do not remove any observation in our analysis. A careful discussion about outliers detection in our regression model can be a new work. Then, maybe, a robust regression method is recommended if there are outliers in the data. In this paper, we do not focus on outliers detection methods or robust regression methods. In order to do a joint analysis, we consider the following general location model for Y1, Y2, Y3, Z, and U responses as follows:

log(μi1j1μi1jμi2jμi3j)|(Ui=j,bij)=γ1jGi1j+bij,log(μi2j1μi1jμi2jμi3j)|(Ui=j,bij)=γ2jGi2j+bij,log(μi2j1μi1jμi2jμi3j)|(Ui=j,bij)=γ3jGi3j+bij,logit(Pr(Zijc))|(Ui=j,bij)=βjc+βjGi3j+bij,c=0,1,2, (10)

where j{0,1} and bijN(0,σb2). To specify initial values for the parameters which is needed in the optimization algorithms, we fit three separate beta regression models for Y1, Y2, and Y3. It gives us some estimates for γpjs. Also, we use a Dirichlet regression for Y1, Y2, Y3, and 1Y1Y2Y3 (considering only intercept in the model) for finding an estimates for φ. π1 is estimated by the proportion of families having a toddler and finally the initial values for β are found by fitting a separate cumulative logistic regression model for Z. The mean of all obtained estimates for σb from four models for Y1, Y2, Y3, and Z is considered here.

5.1.1. Results

Results of using model (10) are given in Table 2. From Table 2, the response variables Yij and Zij are dependent according to the significance of the parameter σb. As we expect, the expenditure spent on foodstuff is affected by the number of members in a family. Also, the area of the house has significant effect on the expenditure spent on housing and fuel. However, the effect of smoking and drinking status is negligible on the level of education in families with no toddler, it affects the expenditure spent on health-care services and the level of education in families having at least one toddler in the home.

Table 2. Results of using model (10), (parameter estimates highlighted in bold are significant at 5 % level).
Parameter γ1(1) γ1(2) γ2(1) γ2(2) γ3(1) γ3(2) φ σb π1
Est. 0.211 0.215 0.006 0.005 −0.584 −0.705 8.38 1.192 0.288
S.E. 0.011 0.01 0.001 0.00032 0.122 0.075 0.32 0.048 0.014
Parameter β(1)0 β(1)1 β(1)2 β(1) β(2)0 β(2)1 β(2)2 β(2)  
Est. −2.905 0.994 2.343 1.238 −0.564 1.98 3.144 −0.005  
S.E. 0.313 0.155 0.184 0.324 0.106 0.105 0.139 0.1  

5.2. Application 2: body mass index, steatosis, and osteoporosis data

The medical dataset is obtained from an observational study on women in the Taleghani hospital of Tehran, Iran. These data record status of osteoporosis of the spine as an ordinal response and BMI and waist as continuous responses for 163 patients. BMI is defined on a continuous scale. Nominal variables which affect these variables are: job status (employee or homemaker) and type of the accommodation (apartment or private). Osteoporosis, which literally means ‘porous bone’, is a disease in which the density and quality of bone are reduced. As the bones become more porous and fragile, the risk of fracture is greatly increased. The loss of bone occurs ‘silently’ and progressively. Often there are no symptoms until the first fracture occurs. The most common fractures associated with osteoporosis occur at the hip, spine, and wrist. The incidence of these fractures, particularly at the hip and spine, increases with age in both women and men. It most often is seen in postmenopausal women, particularly light-skinned, small-framed women with a family history of osteoporosis. The loss of calcium from bones is the major effect of aging on the skeletal system. The body mass index, or Quetelet index is defined as the individual's body weight divided by the square of his or her height. The formula universally used in medicine produce a unit measure of kg/m2. BMI=Weight(kg)/(height(m))2. Steatosis is fatty infiltration of the liver. When inflammation is associated with the fatty change, the term steatohepatitis is used. Steatosis is often but not exclusively an early histological feature of alcoholic liver disease (alcohol-related fatty liver) leading to alcohol-related steatohepatitis. The non alcohol-related cases are known as a non-alcoholic fatty liver disease (NAFLD) and non-alcoholic steatohepatitis (NASH).The osteoporosis and steatosis are defined our two ordinal variables ( Y1 and Y2) with three levels as

Y1={1,Thepersondoesnothaveosteoporosis,2,Thepersonhasmediocreosteoporosis,3,Thepersonhasmanyosteoporosis

and

Y2={1,Thepersondoesnothavesteatosis,2,Thepersonhasmediocresteatosis,3,Thepersonhasmanysteatosis.

An interesting question is whether there are associations between the osteoporosis, body mass index (BMI) and steatosis with considering any covariate effects on the responses simultaneously. Explanatory variables which affect these variables are: (1) amount of total body calcium (Ca), (2) job status (Job, employee, or housekeeper), (3) type of the accommodation (house or apartment) and (4) the systolic blood pressure (SBP) which is defined as the peak pressure in the arteries and occurs near the beginning of the cardiac cycle. The normal rate, in adult humans, for systolic is near but less than 120 mmHg. Descriptive statistics (mean and standard deviation for continuous response and frequency or percentage for ordinal responses) are given in Table 3. Figure 5 of the Section 3 of the supplementary materials shows demographic characteristics of modelling for BMI, steatosis and osteoporosis data.

Table 3. Descriptive statistics for medical data.

    No. Mean S.E.
BMI   143 29.357 10.806
Missing data   20    
Osteoporosis of the spine Levels No. Percentage  
  None 59 0.362  
  Mild 65 0.399  
  Severe 39 0.239  
Steatosis        
  None 39 0.240  
  Mild 68 0.421  
  Severe 53 0.339  

Table 3 shows less percentage for severe osteoporosis than those of none and mild levels. It also gives the frequency and percentage of different levels of steatosis. As it can be seen more than 50 percent of individuals have severe steatosis. The mean of BMI is 28.266 which is high for the sample in hand. In our application, the percentage of missing values of BMI is 20.000%. The model is

BMIj|Ta,Job=γ0j+γ11jSBPj+γ12jCaj+ε1j,RBMIj|Ta,Job=γ21jSBPj+γ22jCaj+ε2j,Y1j|Ta,Job=γ31jSBPj+γ32jCaj+ε3jY2j|Ta,Job=γ41jSBPj+γ42jCaj+ε4j

where J=(J1,J2),Jd=1,2. The covariance matrix of the vector of errors (ε1j,ε2j,ε3j,ε4j), for this model is

Σ=(σ2σρ12σρ13σρ141ρ23ρ241ρ341).

Here, a multivariate normal distribution with the correlation between four errors named as ρ12, ρ13, ρ14, ρ23, ρ24, ρ34, are assumed and these parameters should be also estimated.

5.2.1. Results

Results of using model (2) are given in Table 4. Model (2) shows a significant effect of SBP on BMI, significant effect of SBP on the missing indicator for BMI, a significant effect of SBP on the probability of the low value of steatosis and a significant effect of Ca on the probability of the low value of osteoporosis of the spine. From these effects, we can infer that the people who live in apartment have more BMI than that of people who live in a house and the more the amount of calcium in the body of the patient, the higher is the probability of the low value of osteoporosis of the spine.

Table 4. Results of using model (2), (parameter estimates highlighted in bold are significant at 5% level).
Parameter Est. S.E Parameter Est. S.E.
BMI
γ0(1,1) 21.386 4.218 γ0(1,2) 22.423 4.528
γ0(2,1) 23.678 5.321 γ0(2,2) 19.789 4.860
γ11(1,1) 0.037 0.024 γ11(1,2) 0.043 0.027
γ11(2,1) 0.035 0.028 γ11(2,2) 0.041 0.028
γ12(1,1) −0.252 0.514 γ12(1,2) −0.276 0.443
γ12(2,1) −0.243 0.421 γ(2,2) −0.287 0.423
RBMI
γ21(1,1) 0.152 0.111 γ21(1,2) 0.169 0.123
γ21(2,1) 0.163 0.167 γ21(2,2) 0.148 0.101
γ22(1,1) −0.001 0.005 γ22(1,2) −0.006 0.008
γ22(2,1) −0.007 0.009 γ22(2,2) −0.004 0.006
Y1
γ31(1,1) 0.213 0.123 γ31(1,2) 0.312 0.143
γ31(2,1) 0.067 0.059 γ31(2,2) 0.078 0.032
γ32(1,1) 0.044 0.020 γ32(1,2) 0.054 0.025
γ32(2,1) 0.018 0.154 γ32(2,2) 0.012 0.0149
Y2
γ41(1,1) 0.010 0.173 γ41(1,2) 0.012 0.154
γ41(2,1) 0.056 0.023 γ41(2,2) 0.061 0.024
γ42(1,1) 0.012 0.169 γ42(1,2) 0.013 0.174
γ42(2,1) 0.012 0.169 γ42(2,2) 0.013 0.174
π(1,1) 0.031 0.013 π(1,2) 0.025 0.012
π(2,1) 0.030 0.013 π(2,2) 0.031 0.013
σ2 12.643 0.251 ρ12 0.315 0.038
ρ13 −0.212 0.086 η2 −1.787 1.295
ρ24 0.299 0.223 η1 −2.643 1.304
ρ14 0.559 0.113 ξ1 1.205 1.475
ρ23 0.109 0.089 ξ2 2.283 1.470
ρ34 0.019 0.011      

For model (2) correlation parameters ρ12, ρ13, ρ14, and ρ24 are strongly significant. They show a positive correlation between BMI and the missing indicator for BMI ( ρ^12=0.315), it shows a negative correlation between BMI and osteoporosis of the spine ( ρ^13=0.212) and a positive correlation between BMI and steatosis ( ρ^14=0.559). By these results, we can conclude that the missing indicator for BMI is related to BMI but is not related to two ordinal responses. This leads to having a NMAR mechanism (a NMAR for BMI which means correlation between error terms of BMI and RBMI).

6. Sensitivity analysis

We used sensitivity analysis to study model output varies with changes in model inputs. Cook [7] presented a general method for assessing the local influence of minor perturbations of a statistical model.

Generally, one introduce perturbations into the model through the q×1 vector ω which is restricted to some open subset Ω of Rq and θ is a p×1 vector of unknown parameters. Cook [7] has then shown that the normal curvature Cl of the lifted line in the direction l can be easily calculated by

Cl=2|lΔ(L¨)1Δl|, (11)

where

Δi=2li(θ|ωi)ωiη|θ=θ^,ωi=0

and define Δ as the p×n matrix with Δi as its ith column and L¨ denote the p×p matrix of the second-order derivatives of l(θ|ω0), where there is an ω0 in Ω, with respect to θ, also evaluated at θ=θ^. Obviously, Cl can be calculated for any direction l. One evident choice is the vector li containing one in the ith position and zero elsewhere, corresponding to the perturbation of the ith weight only.

For finding the condition for MAR, let Wj=(Zj,Y1j,Y2j)=(Wobsj,Wmisj), Wj=(Zj,Y1j,Y2j)=(Wobsj,Wmisj) and Rj=(Rzj,RY1j), where Wobsj is the vector of latend variables related to the observed part of Wj=(Zj,Y1j,Y2j), Wmisj is the vector of latend variables related to the missing part of Wj=(Zj,Y1j,Y2j). According to our joint model, the vector of responses along with the missing indicators (Wj,Rj)=(Wobsj,Wmisj,Rj) has a multivariate normal distribution with the following covariance structure:

Σ=(Σo,oΣo,mΣo,RΣm,oΣm,mΣm,RΣR,oΣR,mΣR,R),

where

Σo,o=cov(Wobsj,Wobsj),Σm,m=cov(Wmisj,Wmisj),Σo,m=cov(Wobsj,Wmisj),Σo,R=cov(Wobsj,Rj),ΣR,R=cov(Rj,Rj).

The joint density function can also be partitioned as

f(Wj,Rj)=f(Wmisj,Rj|Wobsj)f(Wobsj),

where f(Wmisj,Rj|Wobsj) and f(Wobsj) have, respectively, a conditional and a marginal normal distribution. According to the missing mechanism definitions, to have an MAR mechanism the covariance matrix of the above mentioned conditional normal distribution, i.e.

Σm,R|o=cov(Wmisj,Rj|Wobsj),=(Σm,mΣm,RΣR,mΣR,R)(Σm,oΣR,o)Σo,o1(Σm,oΣR,o),=(Σm,mΣm,oΣo,o1Σm,oΣm,RΣm,oΣo,o1ΣR,oΣR,mΣR,oΣo,o1Σm,oΣR,RΣR,oΣo,o1ΣR,o),

should satisfy the following constraint:

Σm,RΣm,oΣo,o1ΣR,o=0. (12)

For our application (see section of application), we have missing values only for our continuous variable and we may have Wobs=(Y1,Y2) and Wmiss=Z , ( Y1 and Y2 are ordinal responses and Z is the continuous response.). For missing mechanism we only need to define R=Rz, as we do not have any missing value for our continuous response, and

Σo,o=(1ρ13ρ131),Σm,R=σρ12,ΣR,R=1,Σm,m=σ2,Σm,o=(σρ13,σρ14),ΣR,o=(ρ23,σρ24),

so that the above constraint will be reduced to,

ωMAR=1[11ρ132(σρ13ρ23σρ14ρ13ρ23σρ132ρ24+σρ14ρ24)]=0.

The

ω=1[11ρ132(σρ13ρ23σρ14ρ13ρ23σρ132ρ24+σρ14ρ24)],

as weight defines the perturbation of the MAR model.

Also, For finding the condition for MAR and correlated responses, Let

ω=(ωMAR,ρ12,ρ13,ρ14,ρ23,ρ24,ρ34),

and ω0=(0,0,0,0,0,0,0) to be the log-likelihood function that corresponds to MAR and correlated model.

This reflects the influence of the condition for MAR of the responses and correlated responses. The corresponding local influence measure, denoted by Cl, then becomes Cl=2|ΔlL¨1Δl|. The corresponding local influence measure, denoted by Cl, then becomes Cl=2|ΔlL¨1Δl|. Another important direction is the direction lmax of maximal normal curvature Cmax. It shows how to perturb the condition for MAR of the responses to obtain the largest local changes in Cmax. Cmax is the largest eigenvalue of ΔiL¨1Δi and lmax is the corresponding eigenvector. To search for Sensitivity analysis we find Cmax. This is confirmed by the obtained curvature Cmax=14.133 computed from (3). This curvature indicates extreme local sensitivity.

7. Discussion

In this article, we proposed a general location multivariate latent variable model for mixed nominal, ordinal and continuous responses with and without non-ignorable responses. This procedure should be used when the specification of general location models of all the variables with non-ignorable responses are difficult. Work on extending our methodology to allow for clustering in the data is an on-going research on our part. We are also exploring generalizations of the model to the multivariate case of more than one discrete and more than one continuous outcome, including the incorporation of random effects. The challenge here lies in defining models that allow for different levels of association among outcomes, as in longitudinal studies, and always guarantees proper joint distributions.

Supplementary Material

Supplementary_Material.pdf

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Amiri L., Khazaei M., and Ganjali M., The grouped continuous model for multivariate ordered categorical variables and covariate adjustment, Adv. Data. Anal. Classif. 11 (2017), pp. 593–609. doi: 10.1007/s11634-016-0258-6 [DOI] [Google Scholar]
  • 2.Anderson J.A. and Pemberton J.D., The grouped continuous model for multivariate ordered categorical variables and covariate adjustment, Biometrics 41 (1985), pp. 875–885. doi: 10.2307/2530960 [DOI] [PubMed] [Google Scholar]
  • 3.Anholeto T., Sandoval M.C., and Botter D.A., Adjusted Pearson residuals in beta regression models, J. Stat. Comput. Simul. 84 (2014), pp. 999–1014. doi: 10.1080/00949655.2012.736993 [DOI] [Google Scholar]
  • 4.Barreto-Souza W. and Simas A.B., Improving estimation for beta regression models via em-algorithm and related diagnostic tools, J. Stat. Comput. Simul. 87 (2017), pp. 2847–2867. doi: 10.1080/00949655.2017.1350679 [DOI] [Google Scholar]
  • 5.Belin T., Hu M.Y., Young A.S., and Grusky O., Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study, Stat. Med. 18 (1999), pp. 3123–3135. doi: [DOI] [PubMed] [Google Scholar]
  • 6.Chen S.X. and Tang C.Y., Nonparametric regression with discrete covariate and missing values, Stat. Interface. 4 (2011), pp. 463–474. doi: 10.4310/SII.2011.v4.n4.a5 [DOI] [Google Scholar]
  • 7.Cook R.D., Assessment of local influence, J. R. Stat. Soc. Ser. B 48 (1986), pp. 133–169. [Google Scholar]
  • 8.Cox D.R., The analysis of multivariate binary data, J. Appl. Stat. 21 (1972), pp. 113–126. doi: 10.2307/2346482 [DOI] [Google Scholar]
  • 9.Cox D.R. and Wermuth N., Response models for mixed binary and quantities variables, Biometrika 79 (1992), pp. 441–461. doi: 10.1093/biomet/79.3.441 [DOI] [Google Scholar]
  • 10.Cui R., Bucur I.G., Groot P., and Heskes T., A novel Bayesian approach for latent variable modeling from mixed data with missing values, Stat. Comput. 29 (2019), pp. 1–17. doi: 10.1007/s11222-017-9790-2 [DOI] [Google Scholar]
  • 11.De Leon A.R. and Carriere K.C., The one-sample location hypothesis for mixed bivariate data, Comm. Statist. Theory Methods 29 (2000), pp. 2573–2581. doi: 10.1080/03610920008832623 [DOI] [Google Scholar]
  • 12.De Leon A.R. and Carriere K.C., General mixed data model: extension of general location and grouped continuous models, Can. J. Stat. 35 (2007), pp. 533–548. doi: 10.1002/cjs.5550350405 [DOI] [Google Scholar]
  • 13.De Leon A.R. and Chough K.C., Analysis of Mixed Data: Methods and Applications, CRC Press, New York, 2013. [Google Scholar]
  • 14.De Souza D.F. and Da Silva Moura F.A., Multivariate beta regression with application in small area estimation, J. Off. Stat. 32 (2016), pp. 747–768. doi: 10.1515/jos-2016-0038 [DOI] [Google Scholar]
  • 15.Ferrari S.L.P. and Cribari-Neto F., Beta regression for modeling rates and proportions, J. Appl. Stat. 31 (2004), pp. 799–815. doi: 10.1080/0266476042000214501 [DOI] [Google Scholar]
  • 16.Ferrari S.L. and Pinheiro E.C., Improved likelihood inference in beta regression, J. Stat. Comput. Simul. 81 (2011), pp. 431–443. doi: 10.1080/00949650903389993 [DOI] [Google Scholar]
  • 17.Heckman J., Dummy endogenous variable in a simultaneous equation system, Econometrica 6 (1978), pp. 931–959. doi: 10.2307/1909757 [DOI] [Google Scholar]
  • 18.Kieschnick R. and McCullough B.D., Regression analysis of variates observed on (0, 1): percentages, proportions and fractions, Stat. Modell. 3 (2003), pp. 193–213. doi: 10.1191/1471082X03st053oa [DOI] [Google Scholar]
  • 19.Little R.J. and Rubin D., Statistical Analysis with Missing Data, 2nd ed., Wiley, New york, 2002, p. 14. [Google Scholar]
  • 20.Maier M.J., DirichletReg: Dirichlet regression for compositional data in R, Research Report Series / Department of Statistics and Mathematics, 125. WU Vienna University of Economics and Business, Vienna, 2014.
  • 21.Mirkamali S.J. and Ganjali M., A general location model with zero-inflated counts and skew normal outcomes, J. Appl. Stat. 44 (2017), pp. 2716–2728. doi: 10.1080/02664763.2016.1261813 [DOI] [Google Scholar]
  • 22.Olkin I. and Tate R.F., Multivariate correlation models with mixed discrete and continuous variables, Ann. Math. Statist. 32 (1961), pp. 743–453. doi: 10.1214/aoms/1177705052 [DOI] [Google Scholar]
  • 23.Paleti R., Bhat C.R., and Pendyala R.M., Integrated model of residential location, work location, vehicle ownership, and commute tour characteristics, Transp. Res. Rec. 2382 (2013), pp. 162–172. doi: 10.3141/2382-18 [DOI] [Google Scholar]
  • 24.Pearson K., Mathematical Contribution to the Theory of Evolution. Xiii. on the Theory of Contingency and Its Relation to Association and Normal Correlation, Biometrics, Series I. Drapers Co. Research Memoirs, Dulau and Co., London, 1904. [Google Scholar]
  • 25.Peng Y.H., Little R.J.A., and Raghunathan T.E., An extended general location model for causal inferences from data subject to noncompliance and missing values, Biometrics 60 (2004), pp. 598–607. doi: 10.1111/j.0006-341X.2004.00208.x [DOI] [PubMed] [Google Scholar]
  • 26.Poon W.Y. and Lee S.Y., Maximum likelihood estimation of multivariate polyserial and polychoric correlation coefficients, Psychometrika 52 (1987), pp. 409–430. doi: 10.1007/BF02294364 [DOI] [Google Scholar]
  • 27.Rubin D.B., Inference and missing data, Biometrika 63 (1976), pp. 581–590. doi: 10.1093/biomet/63.3.581 [DOI] [Google Scholar]
  • 28.Tabrizi E., Samani E.B., and Ganjali M., Analysis of mixed correlated bivariate zero-inflated count and (k,l)-inflated beta responses with application to social network datasets, Commun. Stat. Theory Methods 48 (2018), pp. 1651–1681. doi: 10.1080/03610926.2018.1435815 [DOI] [Google Scholar]
  • 29.Tabrizi E., Samani E.B., and Ganjali M., Joint modeling of mixed zero-inflated count and (k,l)-inflated beta longitudinal responses with nonignorable missing values for social network analysis. Multivariate Behavioral Research J Appl. Statistics 2020, submitted for publication. [Google Scholar]
  • 30.Tabrizi E., Samani E.B., and Ganjali M., A note on the identifiability of latent variable models for mixed longitudinal data. Stat. Probab. Lett. 2020, submitted for publication. [Google Scholar]
  • 31.Wang W., Identifiability of linear mixed effects models, Electron. J. Stat. 7 (2013), pp. 244–263. doi: 10.1214/13-EJS770 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Material.pdf

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES