Abstract
Using a multivariate latent variable approach, this article proposes some new general models to analyze the correlated bounded continuous and categorical (nominal or/and ordinal) responses with and without non-ignorable missing values. First, we discuss regression methods for jointly analyzing continuous, nominal, and ordinal responses that we motivated by analyzing data from studies of toxicity development. Second, using the beta and Dirichlet distributions, we extend the models so that some bounded continuous responses are replaced for continuous responses. The joint distribution of the bounded continuous, nominal and ordinal variables is decomposed into a marginal multinomial distribution for the nominal variable and a conditional multivariate joint distribution for the bounded continuous and ordinal variables given the nominal variable. We estimate the regression parameters under the new general location models using the maximum-likelihood method. Sensitivity analysis is also performed to study the influence of small perturbations of the parameters of the missing mechanisms of the model on the maximal normal curvature. The proposed models are applied to two data sets: BMI, Steatosis and Osteoporosis data and Tehran household expenditure budgets.
Keywords: Beta regression, conditional grouped continuous model, general mixed data model, latent variable, the maximal normal curvature
2010 Mathematics Subject Classifications: 62J12, 62J05
1. Introduction
The percentages, proportions, and fractions are some examples for variables supported on the standard unit interval. Some examples of proportions include the proportion of household income spent on electronic devices, the proportion of homicides involving firearms, and the proportion of crude oil converted to gasoline after distillation, etc. Various models have been proposed to analyze such data. One of the mentioned models is beta regression model introduced by Kieschnick and McCullough [18] and then Ferrari and Cribari-Neto [15] amplified the use of the beta regression model with changing the parameterization of the beta distribution indexed by mean and dispersion parameters. This model studied by many researchers such as Ferrari and Pinheiro [16], Anholeto et al. [3], Barreto-Souza and Simas [4], and Tabrizi et al. [28,29]. There are many practical situations in which the dataset contains more than one response supported on the standard unit interval. Multivariate beta regression models using a copula function to construct the joint distribution of the responses are proposed by De Souza and Da Silva Moura [14]. Sometimes, the mentioned responses lying in a bounded interval sum up to a constant (called compositional responses). For example, consider a demography study in which the proportions of the population with some specific religions are the response variables. Let , , , and denote the proportions of the people who are Christians, Muslims, Jews, and the remaining religions living in a town, respectively. It is clear that . Two other examples for compositional data are the sediment composition in a lake in which the samples are taken and classified into sand, silt, and clay by their weights and the proportions of household income spent on housing and fuel, foodstuffs, health-care, and remaining expenditures. Dirichlet regression models studied by Maier [20] can be used to analyze compositional data.
Multivariate data containing mixtures of continuous and discrete responses are common. Specially, continuous and categorical correlated responses data are commonly collected in medical studies. For example, consider the data from a medical study where the correlated responses are the ordinal responses of the steatosis and osteoporosis of the spine and continuous response of body mass index with the possibility of non-ignorable values. Also, sometimes the continuous response has bounded support. As an example, again consider the data from a medical study such that the bounded continuous responses are the proportions of four serum protein components in blood samples (Albumin, PreAlbumin, Globulin A, and Globulin B) which affect a special disease and again the ordinal responses are the steatosis and osteoporosis of the spine. As regards, separate analysis of each response has some shortcomings and it gives biased estimates for the parameters and misleading inference [11]. A way out of this problem is to use methods that simultaneously allow the joint modelling of the mixed data considering non-ignorable missing mechanisms. Specifying the joint distribution of the mixed responses can be formalized in two different ways: (1) specifying the marginal distribution of the discrete variables and the conditional distribution of the continuous variables, given the discrete variables, or (2) specifying the marginal distribution of the continuous variables and the conditional distribution of the discrete variables, given the continuous variables [13]. According to the first approach, for joint modelling of some continuous and nominal variables, one method is to use the general location model (GLOM). Olkin and Tate [22] described a general location model based on a multinomial distribution for the nominal response and a multivariate Gaussian model for the continuous response conditional on the discrete response. In each level of the composition of the nominal variables, the continuous variables are assumed to have a multivariate normal distribution with constant covariance matrix and changeable mean without considering any covariate effects on the responses simultaneously. Note that the GLOM does not accommodate dependence between ordinal and nominal responses and this model is suitable for the continuous and nominal responses. Also, there is not any types of GLOM in the literature in which continuous response has bounded support. In contrast, based on second approach, Cox [8] described a model in which the marginal distribution of the continuous response is Gaussian and it is multiplied by a logistic representation for the conditional distribution of the binary response given the continuous outcome. More recently, Cox and Wermuth [9] compared a number of different models based on these two factorizations of the joint distribution. Another method uses the simultaneous modelling of the continuous and discrete responses by applying correlated errors in the model to take into account the correlation between the responses [17]. To model simultaneously two correlated continuous and ordinal responses, one method is to use the concept of a latent variable. In this situation, it is supposed that the ordinal response is obtained by partitioning the space of an unobservable continuous variable called latent variable into non-overlapping intervals [24]. The conditional grouped continuous model (CGCM) uses the latent variable and applies the second approach to model the mentioned responses such that CGCM considers the multivariate normal distribution as the distribution for the latent variables and the conditional distribution of the latent variables given the continuous responses [2,26]. Finally, to model simultaneously three correlated continuous, ordinal, and nominal responses, De Leon and Carriere [12] combined two models GLOM and CGCM to obtain a model which called the general mixed data model (GMDM). The joint distribution of the responses in GMDM is expressed as the product of the joint distribution of the nominal and continuous responses (GLOM) multiplied by the conditional distribution of the ordinal responses given the nominal and continuous responses (CGCM). Paleti et al. [23] extended the GMDM such that this model includes mixed continuous, nominal, ordinal, and count variables. Mirkamali and Ganjali [21] extended the GLOM using a joint model for analyzing zero-inflated outcomes and skew continuous outcomes. Amiri et al. [1] proposed a general location model with factor analyzer covariance matrix structure.
For data having missing values, traditional methods which may give biased and inconsistent estimates are not suitable since in any way they ignore the missing data mechanism. Rubin [27] and Little and Rubin [19] define a typology of incomplete-data models and made important distinctions between the various types of missing mechanism. They classified the missing data mechanism into three categories missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The missing data mechanism is MCAR if the probability of the observed missingness indicator is dependent neither on the observed responses nor on the missing responses, MAR if, given the observed responses, it is not dependent on the missing responses and MNAR if, given the observed responses, it is dependent on the missing responses. From the likelihood-based inferences, MCAR and MAR can be regarded as ignorable. MCAR is ignorable for both sampling-based and likelihood-based inferences. In MNAR, the mechanism is non-ignorable. So, the important factor for selection of an approach to missing values is missing data mechanism. There are different strategies for dealing with missing values in this paper, we use a famous technique in which a dummy variable for whether a variable is missing or not is determined.
For general location model with missing data, Belin et al. [5] extended the general location model with MAR assumption in a health study. Peng et al. [25] proposed an extension of the general location model for causal inferences with non compliance and missing data on the outcomes. Also, also they considered the effect of covariates. In the above research, the univariate variable approach for the discrete and continuous responses with MAR mechanism is assumed. Also, Cui et al. [10] considered other different methods like the Bayesian approach. They proposed a novel Bayesian Gaussian copula factor approach that is proven to be consistent with ignorable data (MCAR). In this approach, the problem of learning about parameters of latent variable models from mixed continuous and ordinal data with missing values is considered. Chen and Tang [6] consider non-parametric approach for analyzing mixed continuous and discrete explanatory variables. In this approach, they proposed a non-parametric regression with a mixture of continuous and discrete explanatory variables with missing response.
In this paper, we extend the model of De Leon and Carriere [12] along with the multivariate latent variable approach for the bounded continuous, continuous, ordinal, and nominal responses with non-ignorable missing data. Also, we proposed a general location model for the univariate and multivariate bounded responses using beta and Dirichlet distributions. This model has not been yet considered by other researchers. The joint distribution of the bounded continuous, nominal and ordinal responses is decomposed into a marginal multinomial distribution for the nominal variables and a conditional joint distribution for the bounded continuous and ordinal responses given the nominal responses. In each level of composition of the nominal responses, the bounded continuous variables with missing values are beta and Dirichlet distribution with changeable mean with considering any covariate effects on the responses simultaneously. We also apply the random effects and latent variables approach to take into account the correlation between responses.
In Section 2, first, we review a GMDM model for the mixed correlated nominal, continuous, and ordinal responses. Also, then extended this model to a new model in which there is the possibility of the non-ignorable missing values in the continuous responses. Second, we extended the general location models for mixed correlated nominal, bounded continuous and ordinal data with and without non-ignorable responses. Section 3 is devoted to a brief discussion of the identifiability issues associated with the proposed joint models. In Section 4, some simulation studies are conducted in order to assess the performance of the proposed models. We use two dataset, the data of body mass index, the osteoporosis of the spine, waist, job status and type of the accommodation of the patients in Taleghani hospital of Tehran. The data of the household expenditure budgets, the education level of the household head. Also, having a toddler at home status in Tehran, to illustrate the application of the proposed models in Section 5. In Section 6, a sensitivity analysis is performed to study the influence of a small perturbation of the parameters of the missing mechanism on the maximal normal curvature. Finally, the paper ended with some remarks in the last section.
2. Models and likelihoods
Let denote the vector of nominal variables where the dth component of , , having possible states such that and therefore . So the vector defines a contingency table with cells. Now, represents a specific cell in the mentioned contingency table. Suppose show the continuous or bounded continuous response vector of the ith individual in the cell j, where is the pth component of the response vector for and . Also, denote the ordinal response vector defined as
where , , are the cut-point parameters, and denote the underlying latent variables for ordinal responses. Suppose that the mentioned three responses are recorded for n individuals and the cell frequency for each cell j is denoted by . So, it is clear that , where is the set of all possible states for j. Moreover, the expected probability of falling into cell j is denoted by , such that . In all models in this paper, like the general location model, are modelled by a multinomial distribution where is the vector of state probabilities. The nominal variables make together a contingency table. As an example, Figure C1 shows the demographic characteristics of the modelling for D = 2, , S = 4, , P = 1, Q = 2, , , and .
2.1. GMDM for mixed continuous, ordinal, and nominal data
In this subsection, suppose that the support of the distribution of the continuous responses is not bounded.
2.1.1. Complete data model
In GMDM, the joint distribution of (the joint distribution of the jth cell of the contingency table) can be factorized as
where , and denote the marginal distribution of , the conditional distribution of given and , and the conditional distribution of given , respectively. The GMDM takes the form:
(1) |
where , , , also and are the vector of the explanatory variables with regard to and , respectively, for the ith individual in the cell j. Also, , where , , , and is -variate normal distribution with zero mean and covariance matrix
such that Σ shows the covariance matrix between two errors (or two responses). So, the dependence between continuous responses and continuous latent variables has been modelled simultaneously.
The Likelihood for this model, which has been given in the Appendix (the Subsection 1.1 of the supplementary materials), shows the simplification obtained by using the assumption of normality for errors of the model in the system of Equations (1).
2.1.2. Incomplete data model
Typically, when missing data occur in an outcome, assume as the missingness indicator vector related to where is defined as
Also, suppose that is the missingness indicator vector related to where is defined as
In the above definitions, and denote the underlying latent variables of the non-response mechanism for the continuous and ordinal variables, respectively. The GMDM takes the form:
(2) |
where , , and are the vector of the explanatory variables for the ith individual in the cell j. Let
where , , and
where , for u = 1, 2, 3, 4 and , u<v, u, v = 1, 2, 3, 4 and . Note that if one of the matrices is not zero, then the missing data mechanism of the response is not missing completely at random.
The Likelihood for this model, which has been given in the Appendix (the Subsection 1.2 of the supplementary materials)., shows the simplification obtained by using the assumption of normality for errors of the model in the system of Equations (2).
2.2. GLOM for mixed bounded continuous, ordinal, and nominal data
Here, we want to extend the general location model for multivariate bounded continuous responses using the Dirichlet distribution. In this subsection, assume that is bounded continuous for , , and such that it is a recorded percentage or proportion. So, . Note that if such that a and b ( ) are the known real numbers, we can use the transformation ( ) and then model the by the method introduced in this article. Let , such that
(3) |
where , , and is the gamma function. By the above parameterization of the beta distribution, and . Also, let the joint distribution of be the Dirichlet distribution ( ) of order P + 1, such that
(4) |
, , , and for .
Note 2.1
Dirichlet regression models are commonly used to analyze a set of bounded continuous responses that sum up to a constant. Based on the above notation, this regression is usually used when , where . In such situation, the joint distribution of is as follows:
It is suitable for modelling the compositional data.
2.2.1. Complete data model
An extended general location model for correlated -supported, ordinal, and nominal responses take the form:
(5) |
For , with C−1 strictly increasing model intercept parameters and the link function . Also, the shared random effect is represented by assumed to be distributed in the population as given . Note that for , and are independent. Moreover, does not contain an intercept. It is noteworthy that under this model
So, model (5) guarantees that for and . The joint distribution of can be factorized as
Note that and the and given are independent. The is a parameter related to the dependency between -supported and ordinal responses. The likelihood of model (5) is presented in the Appendix (the Subsection 1.3 of the supplementary materials).
2.2.2. Incomplete data model
Again, let and be the observed missingness indicators related to and . To manage the missing mechanism issue, the latent variables and can be applied, but now we prefer to use another approach. Let and be distributed according to Bernoulli distribution. The incomplete joint model takes the form:
(6) |
For , where the statistical significance of and imply that the missingness mechanism is MNAR.
The Likelihood for this model has been given in the Appendix (the Subsection 1.4 of the supplementary materials). The likelihoods of the mentioned models (1) and (2) can be maximized by the function ‘nlminb’ in software R. This function uses the port routine optimization method given in ‘ http://netlib.bell-labs.com/cm/cs/cstr/153.pdf’. The function uses a sequential quadratic programming (SQP) method to minimize the requested function. Needless to say, many problems in statistics are of the form finding the values of the parameters that maximize the likelihood function in which some constraints are also imposed on the parameters. For example, in models (5) and (6) we know that . This adds a combinatorial layer to the problem, which makes it much harder to solve. There are a lot more packages available to solve optimization problems in R. ‘Rsolnp’ package can be applied to solve general non-linear optimization problems using augmented Lagrange multiplier method with an SQP interior algorithm. It is a reason to choose the ‘solnp’ function to maximize the likelihoods of models (5) and (6). Moreover, the observed Hessian matrix may be obtained by ‘nlminb’ function or maybe provided by function ‘fdHess’.
3. Identifiability conditions
The model identifiability issue have been discussed here. All proofs are given in the Appendix (the Section 2 of the supplementary materials).
Definition 3.1
The model is identifiable if for any two different values in Θ, the corresponding probability distributions and are different.
Now, first, we review three theorems which can be proven using the same argument as applied in [30]. We apply them to prove the other theorems presented in this section. In the following, we call the model of all s or s (response variables of all individuals) the joint model and the model of the ith response variables the individual model, where and . In the following theorems, we use the non-identifiability definition of a covariance matrix discussed by Wang [31].
Theorem 3.2
The joint model is identifiable if and only if at least one individual model is identifiable.
The result of Theorem 3.2 to model (1) yields that , , , and are identifiable if they are identifiable in the separate models for , , , and responses, respectively. Also, in model (5), , , and . are identifiable if they are identifiable in the separate models for , , and responses, respectively. Moreover, is identifiable, if it is identifiable in at least one of the separate models for and responses.
Proposition 3.3
model (1) is identifiable under the following conditions:
The parameter vector γ can include the intercept parameters but β should not include any intercept.
is restricted to be a correlation matrix in which all diagonal elements equal one.
All design matrices have full ranks.
contains at least one continuous covariate.
Proposition 3.4
The necessary conditions for identifiability of model (2) are:
The parameter vector γ, α, and η can include the intercept parameters but β should not include any intercept.
and are restricted to be the correlation matrices and in which all diagonal elements equal one.
All design matrices have full ranks.
Proposition 3.5
in model (5) is identifiable under the following conditions:
for .
Let γ= ( ). The parameter vector γ should not include any intercept.
There are at least two values for p (p and ) such that all the components in the covariate vectors and take all values in and where contains at least one interval and zero. For other covariates, it is enough that their contain zero.
For example, and can be , , , , or , where .
4. Simulation study
Here, three different simulation studies are considered to assess the performances of models (1) and (5) for complete data and model (2) for incomplete data. Note that the optimization algorithms need to specify the initial values for the parameters. As in the simulation studies, the true values of the parameters are known, one may use values close to them as the initials to evaluate the performances of the models under study. However, these values are unknown in practice and the final estimation results might be affected by the initial values. Therefore, we should suggest at least one appropriate method for the specification of initial points. As a comparison, we estimates the parameters of model (5) in the second simulation study under two different sets of the initial values, the real parameter values and our suggested initial points. Then, the results are compared. Tables and Figures are relegated to the Appendix (the Section 3 of the supplementary materials).
4.1. Simulation study 1: GMDM for mixed continuous, ordinal, and nominal responses
In the first study, let D = 2, , P = 1, Q = 2, and C = 3. So, we consider the case of two nominal variables and each with two states. The variables and are generated from a multinomial distribution with . Also, the variables , , and are generated from a multivariate normal distribution with zero mean ( ) and covariance matrix
In fact, First, we want to study the model (1) with the covariates , , and designed to take values in . So, we generate them from normal distributions such that
Let and . The ordinal variables and with three levels are defined as
and
Three sample sizes and 10, 000 are used for this model. Also, 1000 Monte Carlo replications are applied. In the first simulation study, for n = 1000, we have , , , and , for n = 5000, we have , , , and , and for n = 10, 000, we have , , , and . Figure 1 of the Section 3 of the supplementary materials shows the demographic characteristics of the model for the first simulation study. So, the following simple model is the target model of the first simulation study
(7) |
where , , and . The main results are presented in Table 1 of the Section 3 of the supplementary materials. Note that, it contains the average estimated parameters over all simulations and they are close to the true values. So, model (7) produces consistent estimates of the regression parameters.
In the second study, a missing data mechanism related to the continuous response, , was added to the model. Let consider four continuous variables , , , and . The variables are generated from a multivariate normal distribution with zero mean ( and ) and covariance matrix
is defined as follows:
Figure 2 of the Section 3 of the supplementary materials shows the demographic characteristics of the model for the second simulation study and the rates of missingness in . In this simulation study, we analyze the following model:
(8) |
where . Estimation results are given in Table 2 of the Section 3 of the supplementary materials. These results show that the parameter estimates are close to the true values of the parameters. Here, the true values of parameters have been used as the initial values.
4.2. Simulation study 2: GLOM for mixed bounded continuous, ordinal, and nominal responses
Consider the model (5) and let P = 2, Q = 1, D = 2, , and C = 3. Three effective sample sizes 100, 500, and 1000 are used in this subsection. The outcome model is as follows:
(9) |
where . We also choose , , , , , , and to perform this simulation study. The covariates and for p = 1, 2 are simulated independently as follows:
The ‘solnp’ function of ‘Rsolnp’ package in software ‘R’ is used to maximize the log-likelihood function in terms of parameters with the inequality constraints , , , and . Also, ‘fdHess’ function is used to gain observed information matrix. The results are summarized in Table 3 of the Section 3 of the supplementary materials. For n = 100, there are two columns for the estimates where the estimated values in the first column (Est.) are obtained when initial points are close to real values and the estimated values in the second column (Est.0) are obtained when the initial points are chosen as follow: we fit eight separate beta regression models for , , , , , , , and . It gives us some estimates for s. The means of two obtained estimates for each from two models for , and are our target initial values. Also, we use a Dirichlet regression for , , and (only intercept in the model) for finding an estimates for φ. is estimated by . The initial values for β are found by fitting four separate cumulative logistic regression models for . Finally, in the above twelve models, we use a random sample generated from the standard normal distribution, , and as the covariates. So, in these models, we gain some initial estimates for . The mean of all obtained estimates for from twelve models is considered here. It is evident that the parameter estimates are close to the true values and the more the value of n the better the estimates and the smaller the standard errors. For n = 100, the estimates in the Est. column show, compared to the estimates in the Est.0 column, that the estimates given real parameter values as initial points and the estimates given our suggested initial values match, up to the first four decimal points. Figure 3 of the Section 3 of the supplementary materials presents the mean squared errors (MSE) varying with respect to n. As the number of subjects increases, the MSEs of estimated parameters are decreased.
5. Application
This section contains two applications of the proposed models in the second section. First, we use model (5) for the Tehran household expenditure budgets dataset described in the following subsection. Second, we apply model (2) for the medical dataset described in the Subsection 5.2.
5.1. Application 1: tehran household expenditure budgets data
The set of household expenditure data available to us is from a study of 37,962 Tehran households. This dataset, collected by the statistical centre of Iran at 2017, is available on the statistical centre of Iran website. From this set, we have for simplicity selected a random sample of 1000 households. For each household, data is available on the number of persons ( ), the total household income, the education level of the head of household ( ), having a toddler at home status ( ), the area of the house in square meters ( ), the smoking or drinking status of the head of household ( ), and the proportion of income spent on monthly expenditures in some commodity/service groups ( ) such as foodstuffs ( ), housing and fuel ( ), health-care ( ), and so on. A brief description of the data is presented in Table 1. Also, Table 4 of the Section 3 of the supplementary materials shows the contingency table of the nominal and ordinal responses (U and Z). According to this Table, the most frequent type of the family in the dataset is the family with no toddler when the household head's education level is primary.
Table 1. Descriptive statistics for the household expenditure budgets data.
Dependent variable name | Notation | Level | Level notation | Mean or Percentage |
---|---|---|---|---|
Education | Illiterated | 0 | 0.188 | |
Primary | 1 | 0.524 | ||
Diploma | 2 | 0.170 | ||
Higher educated | 3 | 0.118 | ||
Toddler status | Having a toddler | 1 | 0.288 | |
No having toddler | 0 | 0.712 | ||
Monthly income | 1916004 Toman | |||
Foodstuffs expenditure | 0.308 | |||
Housing and fuel expenditure | 0.228 | |||
Health-care expenditure | 0.064 | |||
Number of members | 4 | |||
Area of house | 100.425 | |||
Smoking or drinking status | Yes | 1 | 0.243 | |
No | 0 | 0.757 |
The box plot of the monthly income and the monthly expenditures on foodstuffs, housing and fuel, and health-care are presented in Figure 4 of the Section 3 of the supplementary materials. This Figure show that the most proportion of the income is spent on foodstuffs. Also, some observations seem to be outliers based on these box plots. As we know, an outlier is an observation that appears to deviate markedly from other observations in the sample. An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly. If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible). Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. If the data contains significant outliers, we may need to consider the use of robust statistical techniques. Here, we do not think that the income data with large values have been coded incorrectly and we do not remove any observation in our analysis. A careful discussion about outliers detection in our regression model can be a new work. Then, maybe, a robust regression method is recommended if there are outliers in the data. In this paper, we do not focus on outliers detection methods or robust regression methods. In order to do a joint analysis, we consider the following general location model for , , , Z, and U responses as follows:
(10) |
where and . To specify initial values for the parameters which is needed in the optimization algorithms, we fit three separate beta regression models for , , and . It gives us some estimates for s. Also, we use a Dirichlet regression for , , , and (considering only intercept in the model) for finding an estimates for φ. is estimated by the proportion of families having a toddler and finally the initial values for β are found by fitting a separate cumulative logistic regression model for Z. The mean of all obtained estimates for from four models for , , , and Z is considered here.
5.1.1. Results
Results of using model (10) are given in Table 2. From Table 2, the response variables and are dependent according to the significance of the parameter . As we expect, the expenditure spent on foodstuff is affected by the number of members in a family. Also, the area of the house has significant effect on the expenditure spent on housing and fuel. However, the effect of smoking and drinking status is negligible on the level of education in families with no toddler, it affects the expenditure spent on health-care services and the level of education in families having at least one toddler in the home.
Table 2. Results of using model (10), (parameter estimates highlighted in bold are significant at 5 level).
Parameter | φ | ||||||||
---|---|---|---|---|---|---|---|---|---|
Est. | 0.211 | 0.215 | 0.006 | 0.005 | −0.584 | −0.705 | 8.38 | 1.192 | 0.288 |
S.E. | 0.011 | 0.01 | 0.001 | 0.00032 | 0.122 | 0.075 | 0.32 | 0.048 | 0.014 |
Parameter | |||||||||
Est. | −2.905 | 0.994 | 2.343 | 1.238 | −0.564 | 1.98 | 3.144 | −0.005 | |
S.E. | 0.313 | 0.155 | 0.184 | 0.324 | 0.106 | 0.105 | 0.139 | 0.1 |
5.2. Application 2: body mass index, steatosis, and osteoporosis data
The medical dataset is obtained from an observational study on women in the Taleghani hospital of Tehran, Iran. These data record status of osteoporosis of the spine as an ordinal response and BMI and waist as continuous responses for 163 patients. BMI is defined on a continuous scale. Nominal variables which affect these variables are: job status (employee or homemaker) and type of the accommodation (apartment or private). Osteoporosis, which literally means ‘porous bone’, is a disease in which the density and quality of bone are reduced. As the bones become more porous and fragile, the risk of fracture is greatly increased. The loss of bone occurs ‘silently’ and progressively. Often there are no symptoms until the first fracture occurs. The most common fractures associated with osteoporosis occur at the hip, spine, and wrist. The incidence of these fractures, particularly at the hip and spine, increases with age in both women and men. It most often is seen in postmenopausal women, particularly light-skinned, small-framed women with a family history of osteoporosis. The loss of calcium from bones is the major effect of aging on the skeletal system. The body mass index, or Quetelet index is defined as the individual's body weight divided by the square of his or her height. The formula universally used in medicine produce a unit measure of . . Steatosis is fatty infiltration of the liver. When inflammation is associated with the fatty change, the term steatohepatitis is used. Steatosis is often but not exclusively an early histological feature of alcoholic liver disease (alcohol-related fatty liver) leading to alcohol-related steatohepatitis. The non alcohol-related cases are known as a non-alcoholic fatty liver disease (NAFLD) and non-alcoholic steatohepatitis (NASH).The osteoporosis and steatosis are defined our two ordinal variables ( and ) with three levels as
and
An interesting question is whether there are associations between the osteoporosis, body mass index (BMI) and steatosis with considering any covariate effects on the responses simultaneously. Explanatory variables which affect these variables are: (1) amount of total body calcium (Ca), (2) job status (Job, employee, or housekeeper), (3) type of the accommodation (house or apartment) and (4) the systolic blood pressure (SBP) which is defined as the peak pressure in the arteries and occurs near the beginning of the cardiac cycle. The normal rate, in adult humans, for systolic is near but less than 120 mmHg. Descriptive statistics (mean and standard deviation for continuous response and frequency or percentage for ordinal responses) are given in Table 3. Figure 5 of the Section 3 of the supplementary materials shows demographic characteristics of modelling for BMI, steatosis and osteoporosis data.
Table 3. Descriptive statistics for medical data.
No. | Mean | S.E. | ||
---|---|---|---|---|
BMI | 143 | 29.357 | 10.806 | |
Missing data | 20 | |||
Osteoporosis of the spine | Levels | No. | Percentage | |
None | 59 | 0.362 | ||
Mild | 65 | 0.399 | ||
Severe | 39 | 0.239 | ||
Steatosis | ||||
None | 39 | 0.240 | ||
Mild | 68 | 0.421 | ||
Severe | 53 | 0.339 |
Table 3 shows less percentage for severe osteoporosis than those of none and mild levels. It also gives the frequency and percentage of different levels of steatosis. As it can be seen more than 50 percent of individuals have severe steatosis. The mean of BMI is 28.266 which is high for the sample in hand. In our application, the percentage of missing values of BMI is 20.000%. The model is
where . The covariance matrix of the vector of errors , for this model is
Here, a multivariate normal distribution with the correlation between four errors named as , , , , , are assumed and these parameters should be also estimated.
5.2.1. Results
Results of using model (2) are given in Table 4. Model (2) shows a significant effect of SBP on BMI, significant effect of SBP on the missing indicator for BMI, a significant effect of SBP on the probability of the low value of steatosis and a significant effect of Ca on the probability of the low value of osteoporosis of the spine. From these effects, we can infer that the people who live in apartment have more BMI than that of people who live in a house and the more the amount of calcium in the body of the patient, the higher is the probability of the low value of osteoporosis of the spine.
Table 4. Results of using model (2), (parameter estimates highlighted in bold are significant at 5% level).
Parameter | Est. | S.E | Parameter | Est. | S.E. |
---|---|---|---|---|---|
BMI | |||||
21.386 | 4.218 | 22.423 | 4.528 | ||
23.678 | 5.321 | 19.789 | 4.860 | ||
0.037 | 0.024 | 0.043 | 0.027 | ||
0.035 | 0.028 | 0.041 | 0.028 | ||
−0.252 | 0.514 | −0.276 | 0.443 | ||
−0.243 | 0.421 | −0.287 | 0.423 | ||
0.152 | 0.111 | 0.169 | 0.123 | ||
0.163 | 0.167 | 0.148 | 0.101 | ||
−0.001 | 0.005 | −0.006 | 0.008 | ||
−0.007 | 0.009 | −0.004 | 0.006 | ||
0.213 | 0.123 | 0.312 | 0.143 | ||
0.067 | 0.059 | 0.078 | 0.032 | ||
0.044 | 0.020 | 0.054 | 0.025 | ||
0.018 | 0.154 | 0.012 | 0.0149 | ||
0.010 | 0.173 | 0.012 | 0.154 | ||
0.056 | 0.023 | 0.061 | 0.024 | ||
0.012 | 0.169 | 0.013 | 0.174 | ||
0.012 | 0.169 | 0.013 | 0.174 | ||
0.031 | 0.013 | 0.025 | 0.012 | ||
0.030 | 0.013 | 0.031 | 0.013 | ||
12.643 | 0.251 | 0.315 | 0.038 | ||
−0.212 | 0.086 | −1.787 | 1.295 | ||
0.299 | 0.223 | −2.643 | 1.304 | ||
0.559 | 0.113 | 1.205 | 1.475 | ||
0.109 | 0.089 | 2.283 | 1.470 | ||
0.019 | 0.011 |
For model (2) correlation parameters , , , and are strongly significant. They show a positive correlation between BMI and the missing indicator for BMI ( ), it shows a negative correlation between BMI and osteoporosis of the spine ( ) and a positive correlation between BMI and steatosis ( ). By these results, we can conclude that the missing indicator for BMI is related to BMI but is not related to two ordinal responses. This leads to having a NMAR mechanism (a NMAR for BMI which means correlation between error terms of BMI and ).
6. Sensitivity analysis
We used sensitivity analysis to study model output varies with changes in model inputs. Cook [7] presented a general method for assessing the local influence of minor perturbations of a statistical model.
Generally, one introduce perturbations into the model through the vector ω which is restricted to some open subset Ω of and θ is a vector of unknown parameters. Cook [7] has then shown that the normal curvature of the lifted line in the direction l can be easily calculated by
(11) |
where
and define Δ as the matrix with as its ith column and denote the matrix of the second-order derivatives of , where there is an in Ω, with respect to θ, also evaluated at Obviously, can be calculated for any direction l. One evident choice is the vector containing one in the ith position and zero elsewhere, corresponding to the perturbation of the ith weight only.
For finding the condition for MAR, let , and where is the vector of latend variables related to the observed part of , is the vector of latend variables related to the missing part of . According to our joint model, the vector of responses along with the missing indicators has a multivariate normal distribution with the following covariance structure:
where
The joint density function can also be partitioned as
where and have, respectively, a conditional and a marginal normal distribution. According to the missing mechanism definitions, to have an MAR mechanism the covariance matrix of the above mentioned conditional normal distribution, i.e.
should satisfy the following constraint:
(12) |
For our application (see section of application), we have missing values only for our continuous variable and we may have and , ( and are ordinal responses and Z is the continuous response.). For missing mechanism we only need to define , as we do not have any missing value for our continuous response, and
so that the above constraint will be reduced to,
The
as weight defines the perturbation of the MAR model.
Also, For finding the condition for MAR and correlated responses, Let
and to be the log-likelihood function that corresponds to MAR and correlated model.
This reflects the influence of the condition for MAR of the responses and correlated responses. The corresponding local influence measure, denoted by , then becomes . The corresponding local influence measure, denoted by , then becomes Another important direction is the direction of maximal normal curvature . It shows how to perturb the condition for MAR of the responses to obtain the largest local changes in . is the largest eigenvalue of and is the corresponding eigenvector. To search for Sensitivity analysis we find . This is confirmed by the obtained curvature computed from (3). This curvature indicates extreme local sensitivity.
7. Discussion
In this article, we proposed a general location multivariate latent variable model for mixed nominal, ordinal and continuous responses with and without non-ignorable responses. This procedure should be used when the specification of general location models of all the variables with non-ignorable responses are difficult. Work on extending our methodology to allow for clustering in the data is an on-going research on our part. We are also exploring generalizations of the model to the multivariate case of more than one discrete and more than one continuous outcome, including the incorporation of random effects. The challenge here lies in defining models that allow for different levels of association among outcomes, as in longitudinal studies, and always guarantees proper joint distributions.
Supplementary Material
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Amiri L., Khazaei M., and Ganjali M., The grouped continuous model for multivariate ordered categorical variables and covariate adjustment, Adv. Data. Anal. Classif. 11 (2017), pp. 593–609. doi: 10.1007/s11634-016-0258-6 [DOI] [Google Scholar]
- 2.Anderson J.A. and Pemberton J.D., The grouped continuous model for multivariate ordered categorical variables and covariate adjustment, Biometrics 41 (1985), pp. 875–885. doi: 10.2307/2530960 [DOI] [PubMed] [Google Scholar]
- 3.Anholeto T., Sandoval M.C., and Botter D.A., Adjusted Pearson residuals in beta regression models, J. Stat. Comput. Simul. 84 (2014), pp. 999–1014. doi: 10.1080/00949655.2012.736993 [DOI] [Google Scholar]
- 4.Barreto-Souza W. and Simas A.B., Improving estimation for beta regression models via em-algorithm and related diagnostic tools, J. Stat. Comput. Simul. 87 (2017), pp. 2847–2867. doi: 10.1080/00949655.2017.1350679 [DOI] [Google Scholar]
- 5.Belin T., Hu M.Y., Young A.S., and Grusky O., Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study, Stat. Med. 18 (1999), pp. 3123–3135. doi: [DOI] [PubMed] [Google Scholar]
- 6.Chen S.X. and Tang C.Y., Nonparametric regression with discrete covariate and missing values, Stat. Interface. 4 (2011), pp. 463–474. doi: 10.4310/SII.2011.v4.n4.a5 [DOI] [Google Scholar]
- 7.Cook R.D., Assessment of local influence, J. R. Stat. Soc. Ser. B 48 (1986), pp. 133–169. [Google Scholar]
- 8.Cox D.R., The analysis of multivariate binary data, J. Appl. Stat. 21 (1972), pp. 113–126. doi: 10.2307/2346482 [DOI] [Google Scholar]
- 9.Cox D.R. and Wermuth N., Response models for mixed binary and quantities variables, Biometrika 79 (1992), pp. 441–461. doi: 10.1093/biomet/79.3.441 [DOI] [Google Scholar]
- 10.Cui R., Bucur I.G., Groot P., and Heskes T., A novel Bayesian approach for latent variable modeling from mixed data with missing values, Stat. Comput. 29 (2019), pp. 1–17. doi: 10.1007/s11222-017-9790-2 [DOI] [Google Scholar]
- 11.De Leon A.R. and Carriere K.C., The one-sample location hypothesis for mixed bivariate data, Comm. Statist. Theory Methods 29 (2000), pp. 2573–2581. doi: 10.1080/03610920008832623 [DOI] [Google Scholar]
- 12.De Leon A.R. and Carriere K.C., General mixed data model: extension of general location and grouped continuous models, Can. J. Stat. 35 (2007), pp. 533–548. doi: 10.1002/cjs.5550350405 [DOI] [Google Scholar]
- 13.De Leon A.R. and Chough K.C., Analysis of Mixed Data: Methods and Applications, CRC Press, New York, 2013. [Google Scholar]
- 14.De Souza D.F. and Da Silva Moura F.A., Multivariate beta regression with application in small area estimation, J. Off. Stat. 32 (2016), pp. 747–768. doi: 10.1515/jos-2016-0038 [DOI] [Google Scholar]
- 15.Ferrari S.L.P. and Cribari-Neto F., Beta regression for modeling rates and proportions, J. Appl. Stat. 31 (2004), pp. 799–815. doi: 10.1080/0266476042000214501 [DOI] [Google Scholar]
- 16.Ferrari S.L. and Pinheiro E.C., Improved likelihood inference in beta regression, J. Stat. Comput. Simul. 81 (2011), pp. 431–443. doi: 10.1080/00949650903389993 [DOI] [Google Scholar]
- 17.Heckman J., Dummy endogenous variable in a simultaneous equation system, Econometrica 6 (1978), pp. 931–959. doi: 10.2307/1909757 [DOI] [Google Scholar]
- 18.Kieschnick R. and McCullough B.D., Regression analysis of variates observed on (0, 1): percentages, proportions and fractions, Stat. Modell. 3 (2003), pp. 193–213. doi: 10.1191/1471082X03st053oa [DOI] [Google Scholar]
- 19.Little R.J. and Rubin D., Statistical Analysis with Missing Data, 2nd ed., Wiley, New york, 2002, p. 14. [Google Scholar]
- 20.Maier M.J., DirichletReg: Dirichlet regression for compositional data in R, Research Report Series / Department of Statistics and Mathematics, 125. WU Vienna University of Economics and Business, Vienna, 2014.
- 21.Mirkamali S.J. and Ganjali M., A general location model with zero-inflated counts and skew normal outcomes, J. Appl. Stat. 44 (2017), pp. 2716–2728. doi: 10.1080/02664763.2016.1261813 [DOI] [Google Scholar]
- 22.Olkin I. and Tate R.F., Multivariate correlation models with mixed discrete and continuous variables, Ann. Math. Statist. 32 (1961), pp. 743–453. doi: 10.1214/aoms/1177705052 [DOI] [Google Scholar]
- 23.Paleti R., Bhat C.R., and Pendyala R.M., Integrated model of residential location, work location, vehicle ownership, and commute tour characteristics, Transp. Res. Rec. 2382 (2013), pp. 162–172. doi: 10.3141/2382-18 [DOI] [Google Scholar]
- 24.Pearson K., Mathematical Contribution to the Theory of Evolution. Xiii. on the Theory of Contingency and Its Relation to Association and Normal Correlation, Biometrics, Series I. Drapers Co. Research Memoirs, Dulau and Co., London, 1904. [Google Scholar]
- 25.Peng Y.H., Little R.J.A., and Raghunathan T.E., An extended general location model for causal inferences from data subject to noncompliance and missing values, Biometrics 60 (2004), pp. 598–607. doi: 10.1111/j.0006-341X.2004.00208.x [DOI] [PubMed] [Google Scholar]
- 26.Poon W.Y. and Lee S.Y., Maximum likelihood estimation of multivariate polyserial and polychoric correlation coefficients, Psychometrika 52 (1987), pp. 409–430. doi: 10.1007/BF02294364 [DOI] [Google Scholar]
- 27.Rubin D.B., Inference and missing data, Biometrika 63 (1976), pp. 581–590. doi: 10.1093/biomet/63.3.581 [DOI] [Google Scholar]
- 28.Tabrizi E., Samani E.B., and Ganjali M., Analysis of mixed correlated bivariate zero-inflated count and (k,l)-inflated beta responses with application to social network datasets, Commun. Stat. Theory Methods 48 (2018), pp. 1651–1681. doi: 10.1080/03610926.2018.1435815 [DOI] [Google Scholar]
- 29.Tabrizi E., Samani E.B., and Ganjali M., Joint modeling of mixed zero-inflated count and (k,l)-inflated beta longitudinal responses with nonignorable missing values for social network analysis. Multivariate Behavioral Research J Appl. Statistics 2020, submitted for publication. [Google Scholar]
- 30.Tabrizi E., Samani E.B., and Ganjali M., A note on the identifiability of latent variable models for mixed longitudinal data. Stat. Probab. Lett. 2020, submitted for publication. [Google Scholar]
- 31.Wang W., Identifiability of linear mixed effects models, Electron. J. Stat. 7 (2013), pp. 244–263. doi: 10.1214/13-EJS770 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.