Abstract
Semi-continuous data, also known as zero-inflated continuous data, have a substantial portion of responses equal to a single value (typically 0) and a continuous, right-skewed distribution among the remaining positive values. For jointly modeling multivariate clustered semi-continuous responses, the covariate effects in the positive parts can be proportionally constrained to the covariate effects in the logistic part, yielding a multivariate two-part fixed effects model. It is shown that, both theoretically and experimentally, the proportionally constrained model is more efficient than the unconstrained model in terms of parameter estimation, and thus provides a deeper understanding of the data structure when the proportionality structure holds. A robust variance estimation method is also introduced and tested under various model mis-specified cases. The proposed model is applied to data from a randomized controlled trial evaluating potential preventive effects of meditation or exercise on duration and severity of acute respiratory infection illness. The new analysis infers that meditation not only has highly significant effects on reduction of acute respiratory infection severity and duration, but also has significant effects on preventing acute respiratory infection, which was not previously reported in the literature.
Keywords: Zero-inflated data, proportionality, clinical trials, robust estimator, generalized linear model
1. Introduction
A common problem encountered in scientific data is the presence of many zeroes, known as zero-inflation. Examples can be found in medical research, insurance, finance, economics, and many other scientific and quantitative studies.
Continuous data with a significant portion of zeros (generally more than 50% zero observations), known as “semi-continuous” data, have also been discussed in the literature. Because a semi-continuous variable is quite different from one that has been left-censored or truncated, a natural idea is to view the semi-continuous responses as the result of two processes, for example, one determines whether the response is zero and the other determines the actual level if it is non-zero. For convenience, we refer to the data arising from these two processes as the “logistic part” and the “positive part” of the original data, respectively. Under this framework, a common approach is to consider a model which consists of a Bernoulli distribution and a log-normal (or gamma) distribution. This is often referred as “two-part model”, and has appeared in econometric analyses for many years. For example, Lu et al.1 and Hall and Zhang2 proposed a marginal version of the two part model, utilizing generalized estimating equation (GEE) for the purposes of estimation and inference; further discussion can be found in Su et al.3 and Su et al.4 Although such two-part model is named as “marginal,” actually its method is conditional in the sense that the model is conditioned on the outcome being greater than zero in the second component. Smith et al.5 had developed marginalized models that were marginal over the zero and positive components. The other approach is based on two-part mixed models with correlated random effects in both parts of the model, where the objective is to investigate the effects of covariates at the subject-specific level (conditional effects). For example, Olsen and Schafer6 and Tooze et al.7 extended the two-part model to account for correlation through the introduction of random effects. The choice of approach adopted depends on the objectives of the study, and the intended purposes for the results obtained. In contrast to what has been mentioned, in the literature, some research results have shown that data with much fewer zeros can also be considered semi-continuous and are more appropriately fitted with two-part models; examples can be found in literature.8–10 In light of this, we will address the performance of our proposed model in terms of the number of zero observations in Section 3.3.
In this research, we consider a multivariate two-part fixed effects model, to deal with multiple semi-continuous responses with common subsets of zero (and non-zero) observations. Our motivating data is from a randomized controlled trial. The Meditation or Exercise for Preventing Acute Respiratory Infection (MEPARI) trial11 was designed to evaluate potential preventive effects of meditation or exercise on incidence, duration and severity of acute respiratory infection (ARI) illness. In the trial, the primary response variables were ARI severity and ARI duration. ARI severity was measured by area-under-curve global severity, using the Wisconsin Upper Respiratory Symptom Survey (WURSS-24) summed across days of illness.12 ARI illness duration was measured in days and hours, from the first symptom noticed until the last time the participant felt ill. The two responses equaled zeros (or non-zeros) simultaneously, and contained about 56% zeros because those participants did not report ARI symptoms during the trial.
Modeling the two semi-continuous responses using separate generalized linear models can cause problems of inefficiency. In most clinical trial data, the sample size is limited, as in the MEPARI data, where there are 149 participants, only 66 of whom report ARI symptoms. If one assumes regression coefficients for each part are independent in the model, and fits a conventional generalized linear model for the logistic part and two positive parts separately, the model performs poorly because of too many parameters and the preponderance of zeros. To improve the efficiency, a natural idea is to consider a model with proportional constraints, where the regression coefficients in the two positive parts are assumed to be proportional to the regression coefficients in the logistic part on the link scales. Determination of proportionality in the sub-models may provide a deeper understanding of the ARI development. The logit model describes the development from zero to nonzero, and the linear models (for the two positive parts) describe the development above zero. If proportionality holds, then the same mechanism, corresponding to a set of covariates, determines development in the three sub-models. In practice, the validity of the constraints imposed by the constrained model needs to be assessed; this can be checked via model testing of an unconstrained model versus a constrained model.
The idea of incorporating proportionality in two-part models has been discussed in literature for years. Published proportionality studies include the zero-inflated Poisson regression model in Lambert,13 logit-(log) gamma two-part model in Moulton et al.,14 and logit-linear two-part model in Han and Kronmal.15 In recent years, semi-parametric two-part models with proportional constraints have been proposed.16,17 These studies demonstrate that determining proportionality structure can provide insights into the biological mechanisms underlying (for example) disease development.
In this paper, a new multivariate two-part fixed effects model with proportional constraints is proposed. Due to the specific structure of the ARI data, the existing methods which deal with only one semi-continuous response are not applicable. Hence, a multivariate model is needed. As discussed above, the traditional method of modelling the two parts separately leads to statistical inefficiency, whereas introducing the proportional constraints into the model may help overcome this problem. Applying the proposed model, a more comprehensive analysis of the MEPARI trial is conducted, which provides a deeper understanding of the development of ARI illness. We not only generalize the univariate to the multivariate case, but also include detailed discussions on variable selection problems, the hypothesis testing procedure, and structured hypothesis testing problems arising with the multivariate model. To deal with model mis-specification, we also introduce a robust variance estimator and show the results under different cases of model mis-specification.
Besides applications to our ARI illness study, the proposed model has a wide applicability in many research areas. In medical research, it can be used to model the severity and duration of other acute/chronic diseases such as asthma attacks, nasal allergies, migraine headaches, and herpes outbreaks, to name a few. The idea of the constrained model can also be applied to other scientific and industrial areas: for example, medical cost data, where the length of stay and hospital charges are the two semi-continuous responses; for insurance companies, where the claim size and duration are the responses.
The rest of this article is organized as follows. In Section 2, we introduce a multivariate two-part fixed effects model with proportional constraints incorporating multiple correlated semi-continuous responses. To fit the MEPARI data structure, we illustrate the model with the bivariate case. We derive the MLEs, provide a closed form expression of the Fisher information matrix of the MLEs, and quantify the efficiency of the constrained model. In Section 3, simulation results demonstrate the good performance of the proposed procedure. In Section 4, we illustrate the application of the proposed model to ARI illness data from the MEPARI trial. In addition, an iterative procedure is proposed to assist with model selection. More discussions are provided in Section 5. The formulas for computing the observed Fisher information matrix, and full proofs of the theorems are given in the Supplemental Materials.
2. The model
In this section, we first formulate the multivariate two-part fixed effects model. For notational simplicity, we let dimension D = 2 to fit the ARI data structure; however, the case of D > 2 can easily be adopted. Note that the two responses need to have zero (or non-zero) observations simultaneously. For the case that the two responses do not have common zeros, one may need to consider different model settings, because the data generating procedures may not be coupled in that case.
2.1. Model with proportional constraints
We use notation similar to that in Olsen and Schafer.6 Let Yi,d denote the outcome measure of the ith individual, for i = 1, 2, …, n, and d = 1, 2, corresponding to the two types of semi-continuous responses. Notice that for Yi,1 and Yi,2, they equal to zero or non-zero simultaneously. The responses can be recorded as three variables
| (1) |
and
| (2) |
where gd (d = 1, 2) are some monotone increasing functions (e.g. log) that will make Vi,d approximately Gaussian. In practice, the gd functions are unknown, and can be determined by Box–Cox transformations.
These responses can be modeled by three fixed effects models, one for the logit probability of Ui = 1 and two for the conditional mean responses E(Vi,d|Ui = 1). The logit model is:
| (3) |
where pi = P(Ui = 1), xi is covariate vector, a0 denotes the intercept term, and β is an unknown coefficient vector including treatment effects. Often in practice, the covariates for observed response components are the same, and then the fixed effects models for the two continuous responses can be formulated similarly as
| (4) |
where xi is known p × 1 covariate vector, β, δ, ξ are unknown p × 1 parameter vectors, and ϵi,d are random errors, assumed here to be normally distributed
| (5) |
We further assume −1 < ρ < 1 to exclude the degenerate case. If β, δ and ξ are independent parameters, the model is an unconstrained two-part model; it is equivalent to first fitting a logit model to yi,d = 0 versus yi,d > 0 and then fitting linear regressions to the positive responses {yi,d > 0}. There are two shortcomings with the unconstrained model (denoted as model “BC0”). First, the logistic part may not provide efficient estimates of the parameters β with limited sample size, as with the real data example, where there are only 149 observations in total, yielding many insignificant covariate effects. Second, if the zero-inflation process is coupled with the process generating the non-zero-inflated data, for example, in the data from the ARI illness analysis, we may expect some relationship between β and δ (and ξ). In our real data, the two types of responses are ARI severity and ARI duration, which equal zero simultaneously, implying that those participants did not have ARI illness. For participants with ARI illness, it is found that those participants with severe symptoms generally had longer durations. Thus, considering the two types of semi-continuous responses together in the model will give a better understanding of the general pattern of ARI illness. Using the idea of “single index” μi, where , an important question is whether there exists a common “single index” for all the three parts, i.e. if there is such a latent process that affects the logistic part as well as the two positive parts in similar ways. In particular, we define the bivariate two-part fixed effects model with proportional constraints as model “BC2” by
| (6) |
The parameters to be estimated are . The parameter space is defined by a0 ∈ R, β ∈ Rp; for d = 1, 2, ad ∈ R, bd ∈ R if ||β|| ≠ 0, with bd = 0 otherwise; and and ρ ∈ (−1, 1). Here ||β|| denotes the Euclidean norm of β. Alternatively, ad (d = 0, 1, 2) in equation (6) can be replaced by , where wd is a vector of covariates and αd is a coefficient vector, either of which can be different for the three parts, respectively. This alternative is referred to partially constrained model because not all the coefficients are assumed to be constrained. All theoretical derivations of the alternative model will go through (more discussions are given in Section 5). Without loss of generality, for expository purposes, we use simplified notation ad in the subsequent theoretical derivations.
Throughout this article, i = {1, 2, …, n} denotes the n independent observations yi, where yi = (yi,1, yi,2)T. Define d0 = {i : yi = 0} and d+ = {i : yi > 0}, let n0 and n1 be the cardinalities of d0 and d+, respectively, and let dall = d0 ∪ d+. For simplicity, we also assume the observations i = 1, …, n0 are in d0 and i = n0 + 1, …, n are in d+. Let , , . The design matrix (J X) is assumed to have full column rank p + 1, where J is a n-dimensional vector of 1’s.
2.2. Maximum likelihood estimation
For illustration, we use the log function for gd (d = 1, 2), so Vi,d = log(Yi,d) for Yi,d > 0. The log-likelihood function for model BC2 is
| (7) |
Letting
| (8) |
the MLE can be derived by solving the score equation (details are given in the Supplemental Material Appendix A).
2.3. Asymptotic properties
Before showing the asymptotic properties of the MLE, we first compute the expected Fisher information as below. Define the expected Fisher information with respect to (WRT) θ as IBC2F, where . Straightforward but tedious computations show that
| (9) |
where IΣ is the Fisher information matrix WRT , and IBC2 is the Fisher information matrix WRT (a0, β, a1, b1, a2, b2). More details are given in the Supplemental Material Appendices B and C.
To establish the asymptotic properties of the MLE, we need the assumptions (A1) to (A4) (details given in Supplemental Material Appendix G). These assumptions guarantee the identifiability of parameter θ, boundedness of the expected likelihood, differentiability and uniform integrability of the log-likelihood function, and existence of the asymptotic covariance matrix. The following Theorem 1 provides the asymptotic normality of the MLE, which establishes the theoretical framework of hypothesis testing in the following sections. The proof is given in the Supplemental Material Appendix D.
Theorem 1.
Under assumptions (A1) to (A4), the MLE is a consistent estimator of θ, and we have
We compare the efficiency of the MLE’s under the three models BC2 (with constraints), BC0 (without constraints) and UC1 (with constraints, univariate case). Here, UC1 has the same format for logit and Vi,1 format shown in equation (6), while Vi,2 is unconstrained as shown in equation (4). This univariate case reflects that the proportionality structure only holds for logistic part and one positive response Vi,1. We will use the definition of Loewner ordering of symmetric matrices.18
Definition 1.
(Loewner ordering) Let A and B be two symmetric matrices of order n. We say that A ≥ B if A − B is positive semi-definite (PSD). Similarly, we say that A > B if A − B is positive definite (PD).
The following Theorem 2 proves that, when the proportionality structure holds, model BC2 is asymptotically more efficient than model UC1 and BC0, thus will be more powerful to detect if specific covariate effects are significant or not in the model. The proof is given in the Supplemental Material Appendix E.
Theorem 2.
In Loewner ordering,
the asymptotic covariance matrix of the MLE of , , is no larger in BC2 than in BC0. The result also holds for and ;
the asymptotic covariance matrix of the MLE of , , is no larger in BC2 than in UC1. The result also holds for .
2.4. Model mis-specification
In certain cases, model specifications (5) and (6) may be violated. Huber19 and White20 introduced the “sandwich variance estimator” under certain regularity conditions in which the point estimators are consistent. We assume that the independent random vectors Zi have a distribution with (Radon–Nykodym) density h(·), and the parametric family of distribution functions all have densities f(z, θ).
Under the mis-specified case, we actually maximize the quasi-log-likelihood function, which is defined the same as equation (7). The quasi-MLE is defined in equation (8). When the appropriate inverses exist, we further define the following quantities
| (10) |
Their expected values under the “true” model are
| (11) |
We need to modify assumptions (A3) and (A4) to establish the asymptotic properties of the “sandwich estimator”. Basically, we need to assume that the mean model for Ui and Vi,D|Ui (in equations (1) and (2)) is correctly specified. The modified assumptions (A3’) and (A4’) given in Supplemental Material Appendix G, provide weaker regularity conditions of the quasi-MLE when the model is mis-specified as in equations (5) and (6).
Following the theoretical framework in White,20 we have the following Theorem 3.
Theorem 3.
Under assumptions (A1), (A2), (A3’) and (A4’), the quasi-MLE is a consistent estimator of θ0, and we have
In Section 3, we illustrate the efficiency of using the proportional constrained model, and also compare results of using both traditional MLE and sandwich estimator when the model is mis-specified.
2.5. Hypothesis tests
Under the setting of the two-part fixed effects model with proportional constraints, one may be interested in the following three typical hypotheses:
H01 : βs = 0, where βs is a subset of β.
H02 : b1 = 0 and/or H02′ : b2 = 0.
H03 : δ = b1β and ξ = b2β for some non-zero b1 and b2.
As one can see, hypothesis (1) is equivalent to assess whether specific covariate effects are significant or not. Hypothesis (2) is to check if the proportional constraints are zero or not. Hypothesis (3) is to check if the proportionality assumption holds. Note that for hypothesis (2), if b1 or b2 is not significant under specific level a, there are two possibilities: the first is the corresponding positive part is homogeneous, second is the proportionality structure may not hold. Under each case, one needs to be careful and consider if the unconstrained or partially constrained model is more appropriate for the data.
For testing H01, from the asymptotic results in Theorem 1, the Wald test statistic can be constructed as follows. Note that βs is a sub-vector of θ, let be the observed information matrix, where is the MLE of the parameters and H(θ) is the Hessian matrix, and is a consistent estimator of IBC2F. We can construct the Wald test statistic by
where L is the matrix such that Lθ = 0 under H01, and k equals the number of parameters in the hypothesis H01.
For testing H02, note that even if b1 = b2 = 0, the expected information matrix IBC2F in equation (9) is still PD; thus one can perform the Wald test for either the complex hypothesis b1 = b2 = 0 or a simple hypothesis b1 = 0 or b2 = 0. We notice that H02 : b1 = 0 and/or b2 = 0 imply that Model BC2 (given by equation (6)) may not be suitable for modeling the positive parts of problems of interest. One may need to consider different model settings or model with proportional constraints for only one positive part.
For testing H03, note that the model with δ = b1β and ξ = b2β is nested within the model where δ, ξ and β are completely free. To determine the proportionality, we can perform a likelihood ratio test, with the test statistic given by
where is the MLE under constrained model BC2 and is the MLE under unconstrained model BC0.
When the model is mis-specified, as discussed in earlier Section 2.4, the above hypotheses can be tested by replacing the asymptotic variance with corresponding sandwich estimator of variance.
2.6. Model selection
Although many intermediate models are needed in order to determine the proportionality structure, we are most interested in the final models, i.e. models with proportionality properly determined, and with the corresponding coefficients in all three parts properly specified. In our study, as one can see from Table 4 from Real Data analysis, there are multiple covariates and multiple scenarios of models with proportionality included. It is not surprising that some of the covariates are not significant, and that we may need to remove those covariates to obtain a best fitting final model. We do not attempt to address general model selection issues here, and rather focus on a particular feature of selection arising from the proportionality feature under the BC2 model.
Table 4.
Description of real data.
| Variable | Description |
|---|---|
| ARI severity | Measured by area under curve (AUC) |
| ARI duration | Measured by days |
| ID | Participant ID, 149 in total |
| Randomization | Group indicator, 1 = Exercise, 2 = Meditation, 3 = Control |
| Age | Age of patients, from 50 to 75 |
| Gender | 0 = Male, 1 = Female |
| BMI | Body mass index |
| LOT | Optimism score, overall optimism score with high scores representing greater optimism |
| MAAS | Mindfulness score, higher scores indicate greater mindfulness |
| PHQ-9 | Depression score, higher scores indicate greater depression |
| PSS-10 | Perceived stress score, higher scores represent greater perceived stress |
| Ryff | Social support score, higher scores reflect positive relations with others |
| PANAS-Positive | Positive emotion score, higher scores indicate stronger positive emotion |
| PANAS-Negative | Negative emotion score, higher scores indicate stronger negative affect |
| SF12-Physical | Physical health score, higher scores indicate better physical functioning |
| SF12-Mental | Mental health score, higher scores indicate better mental functioning |
| STAI Y-1 | Anxiety score (current state), high scores on their respective scales mean more state anxiety |
| STAI Y-2 | Anxiety score (general state), high scores on their respective scales mean more trait anxiety |
To do the variable selection, one can always perform backward elimination based on the derived Wald test statistic. However, in practice, with limited sample size, we find that in the constrained model BC2, testing the null hypothesis H0 : βk = 0 is not very efficient, where βk is a fixed effect coefficient. Note that in the proposed model, βk not only appears in the logistic part, but also appears in both log-normal parts through δk = b1βk and ξk = b2βk. Hence, instead of testing H0 : βk = 0 only, we can test H0 : b1βk = 0 and H0 : b2βk = 0 simultaneously. If at least one of them is rejected, then we reject the null hypothesis H0 : βk = 0. It is equivalent to say that, if we observe at least one small P-value in the two log-normal parts, we treat βk as non-zero and keep it in the model. From the previous asymptotic results and Slutsky’s theorem, it is easy to see that, given b1 and b2 are non-zero, the estimators of b1 and b2 always converge to the true values in probability, thus the multiple testing problem above is asymptotically equivalent to test H0 : βk = 0 only. We also perform a simulation study (results not reported) to show the efficiency of testing the three hypotheses simultaneously. We find that the power is at least twice as much as testing H0 : βk = 0 only. With the new idea, to fully determine the proportionality structure as well as the significant covariates, we propose a step-wise approach. It is noted that in each step, if the constrained model is not preferred over the unconstrained model (using likelihood ratio test), the procedure will stop and return to the current constrained model.
Fit model BC2, the bivariate two-part fixed effects model with proportional constraints using all covariates.
For all coefficients βks of the covariate xk (k = 1, …, r), compute the P-values for the hypotheses H0 : βk = 0, denoted as P1, …, Pr. For simplicity, we assume the coefficients βks are ordered according to their P-values, and we have P1 < P2 < ⋯ < Pr.
Compute the corresponding P-values for H0 : δk = 0 and H0 : ξk = 0, where and their SE are computed using the delta method for .
Check the largest P-value Pr, for a given level of significance a. If the two corresponding P-values for H0 : δr = 0 and H0 : ξr = 0 are both greater than a, we remove the corresponding covariate xr from the model. If not, check the second largest P-value Pr−1 and determine if the corresponding covariate xr−1 can be removed, and so on.
In step 4, if we have removed one covariate, refit the model using the current set of covariates, and repeat Step 2 to Step 4. Otherwise, if we cannot remove any covariate in Step 4, we stop and get the final model.
In practice, if the removal of covariates causes the unconstrained model BC0 to outperform BC2, one should carefully check the reason for the failure of model BC2. For example, is it because the two positive responses (after transformation) do not have similar proportionality structure? Or is it because the proportionality structure only partially holds for one response? Or because the normality assumption is violated? One may also be interested to see if model UC1 would be a candidate for the data, if the proportionality structure does not hold for both positive parts simultaneously. For example, when applying fully constrained model BC2, b2 is not significant, or the estimated coefficients have opposite signs for some covariates, which would contradict intuition. When applying the proposed methodology, one should compare model BC2 and UC1 using the likelihood ratio test, and test if proportional constraints b1 or b2 is zero, to verify that assumptions of model BC2 holds. One should also check the meaning of estimated coefficients to determine if model BC2 or UC1 or even unconstrained model BC0 is more appropriate for the data.
3. Simulation studies
3.1. Correctly specified case
In this section, we perform a simulation study to investigate efficiency gains in fitting the proposed bivariate two-part model with proportional constraints (denoted by BC2) instead of fitting the bivariate two-part model without constraints (denoted by BC0), or a univariate two-part model with constraints (denoted by UC1). We used a similar data structure as in our real data example, where the number of treatment groups t = 3, with each group containing 50 observations, and the sample size n = 150. To estimate the treatment effects, we added two indicator variables ti,1 and ti,2 to the first two groups, respectively, and the third group was treated as the referent. So ti,1 = 1 for i = 1, …, 50, and ti,1 = 0 otherwise; ti,2 = 1 for i = 51, …, 100, and ti,2 = 0, otherwise.
For the proposed model with proportional constraints, i.e. model BC2, we first generated the common “single index” , where β = (β1, β2, β3)T was the vector of fixed effect coefficients, and β0t (t = 1, 2) was the group difference between group t and group 3. The covariate vector, xi = (xik)1×3, for k = 1, 2, 3, where xik are independent and identically distributed (i.i.d.) as N(0, 1).
For each individual i, we generated the binary responses Ui from Bernoulli experiments with success probability pi given by: . Next, given the coefficients ad and bd, the latent positive responses were generated by: , where ϵi,d (d = 1, 2) were from the bivariate normal distribution with parameters . The observed responses . The parameters were set to β = (1, 0:5, 0)T, β01 = −0.5, β02 = 0:5, a1 = a2 = 0, b1 = 0.6, b2 = 1, and = (1,1,0.7). A total of 1000 samples were generated using the settings of the proposed model BC2 and all three models were fitted. Note that the univariate model UC1 was fitted for Yi,1 only, because the simulation settings for Yi,2 were similar and not reported separately. In the following simulations, intercept terms are intentionally omitted from displays.
Table 1 summarizes the behavior of the estimates for the parameters in the logistic part and log-normal part of Yi,1. It reports the average parameter estimates, their standard deviations (SD) and average standard errors (SE). In Table 1, when fitting models BC2 and UC1 for each sample, the estimates and SE are computed using delta method for . From the results in the table, for nearly all the parameters, the average SE is close to the SD of the estimates, indicating that the standard errors from the Newton–Raphson algorithm are effective measures of variation. We also computed the coverage probabilities using the three models (not reported here); the probabilities of covering true values were all very close to the nominal level (within ±2% of the 95% nominal level). To compare the efficiency for the three models, if one focuses on the parameters in the logistic part (Table 1), as expected, noticeable efficiency gains occur for the proposed model BC2; for instance, the average standard error estimate for β2 increases from 0.138 to 0.205 between BC2 and BC0, and this quantity increases from 0.138 to 0.163 between BC2 and UC1, which substantiate the theoretical results in Theorem 2, i.e. the bivariate two-part model with proportional constraints is the most efficient. The same phenomenon appears for the parameters in the log-normal part of Yi,1 (Table 1); for instance, the average standard error estimate for δ3 increases from 0.059 to 0.115 between BC2 and BC0, and this quantity increases from 0.059 to 0.082 between BC2 and UC1. For the variance parameters , as one can see from Table 1, the average estimates and average SE are similar for all the three models, it is not surprising since the MLE of variance parameters and the MLE of other parameters are asymptotically independent, as shown previously.
Table 1.
Results from the simulation study: average estimates, standard deviations (SD) of estimates, and average standard errors (SE), logistic part and log-normal Part of Y1.
| Non-zero % | True | Model BC2 |
Model BC0 |
Model UC1 |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 52% |
52% |
52% |
||||||||
| Average | Average | Average | Average | Average | Average | |||||
| Estimates | SD | SE | Estimates | SD | SE | Estimates | SD | SE | ||
| β1 | 1.000 | 1.038 | 0.217 | 0.208 | 1.066 | 0.244 | 0.235 | 1.047 | 0.229 | 0.219 |
| β2 | 0.500 | 0.516 | 0.128 | 0.127 | 0.530 | 0.187 | 0.188 | 0.520 | 0.150 | 0.150 |
| β3 | 0.000 | −0.002 | 0.106 | 0.100 | 0.000 | 0.204 | 0.195 | −0.001 | 0.145 | 0.138 |
| β01 | −0.500 | −0.514 | 0.295 | 0.276 | −0.523 | 0.480 | 0.481 | −0.512 | 0.375 | 0.362 |
| β02 | 0.500 | 0.535 | 0.276 | 0.255 | 0.520 | 0.500 | 0.474 | 0.532 | 0.368 | 0.341 |
| b1 | 0.600 | 0.602 | 0.173 | 0.165 | 0.603 | 0.177 | 0.166 | |||
| b2 | 1.000 | 1.005 | 0.243 | 0.230 | ||||||
| log-normal Part of Y1 | ||||||||||
| δ1 | 0.600 | 0.600 | 0.123 | 0.121 | 0.595 | 0.140 | 0.138 | 0.605 | 0.127 | 0.127 |
| δ2 | 0.300 | 0.298 | 0.073 | 0.071 | 0.295 | 0.104 | 0.100 | 0.300 | 0.085 | 0.082 |
| δ3 | 0.000 | −0.001 | 0.062 | 0.057 | 0.000 | 0.119 | 0.110 | 0.000 | 0.084 | 0.079 |
| δ01 | −0.300 | −0.298 | 0.170 | 0.160 | −0.292 | 0.317 | 0.303 | −0.297 | 0.222 | 0.211 |
| δ02 | 0.300 | 0.307 | 0.151 | 0.145 | 0.312 | 0.277 | 0.260 | 0.305 | 0.207 | 0.193 |
3.2. Mis-specified case
As mentioned in the foregoing section on methodology, when the model is mis-specified, the traditional variance estimator of MLE (in Theorem 1) is not consistent and the sandwich variance estimator (in Theorem 3) is preferred. We simulated two mis-specified cases and reported results in Tables 2 and 3, respectively. For mis-specified case 1, we assumed unequal variance for each response, but the responses were still independent. We followed the same parameter settings as shown in Table 1, but the regression errors ϵi,d (d = 1, 2) were replaced by , where ui ~ Unif(0:3, 3) and ui were independent. Whereas in mis-specified case 2, we assumed there exists intra-class correlation for the responses, which was similar to random effects model structure. Specifically, we followed the same definition of term and parameter settings in the previous simulation. To allow the cluster effect, we used subscript ij instead of i, and we had , where i = 1, 2, …, 30, j = 1, 2, …, 5, γis were independent and follow N(0, 1).
Table 2.
Mis-specified Case 1, unequal variance.
| Non-zero % | True | Model BC2 |
Model BC0 |
||||
|---|---|---|---|---|---|---|---|
| 52% |
52% |
||||||
| Average | Average | Average | Average | ||||
| Estimates | SD | SE | Estimates | SD | SE | ||
| Logistic Part | |||||||
| β1 | 1.000 | 1.045 | 0.233 | 0.220 (0.224) | 1.066 | 0.244 | 0.235 |
| β2 | 0.500 | 0.518 | 0.152 | 0.153 (0.157) | 0.530 | 0.187 | 0.188 |
| β3 | 0.000 | −0.002 | 0.156 | 0.143 (0.148) | 0.000 | 0.204 | 0.195 |
| β01 | −0.500 | −0.518 | 0.391 | 0.371 (0.375) | −0.523 | 0.480 | 0.481 |
| β02 | 0.500 | 0.533 | 0.377 | 0.351 (0.357) | 0.520 | 0.500 | 0.474 |
| Proportional constraints | |||||||
| b1 | 0.600 | 0.601 | 0.256 | 0.241 (0.253) | |||
| b2 | 1.000 | 1.006 | 0.312 | 0.290 (0.305) | |||
| Log-normal Part of Y1 | |||||||
| δ1 | 0.600 | 0.600 | 0.223 | 0.216 (0.225) | 0.590 | 0.256 | 0.252 |
| δ2 | 0.300 | 0.296 | 0.121 | 0.122 (0.123) | 0.289 | 0.181 | 0.182 |
| δ3 | 0.000 | 0.000 | 0.095 | 0.086 (0.089) | −0.002 | 0.222 | 0.201 |
| δ01 | −0.300 | −0.296 | 0.251 | 0.242 (0.245) | −0.295 | 0.556 | 0.551 |
| δ02 | 0.300 | 0.308 | 0.236 | 0.225 (0.232) | 0.316 | 0.477 | 0.473 |
Note: Results from the simulation study: average estimates, standard deviations (SD) of estimates, and average standard errors (SE), logistic part and log-normal part of Y1. Sandwich estimator results for BC2 in parenthesis. ARI: acute respiratory infection.
Table 3.
Mis-specified Case 2, adding a random effect term to μi.
| Non-zero % | True | Model BC2 |
Model BC0 |
||||
|---|---|---|---|---|---|---|---|
| 52% |
52% |
||||||
| Average | Average | Average | Average | ||||
| Estimates | SD | SE | Estimates | SD | SE | ||
| Logistic Part | |||||||
| β1 | 1.000 | 0.893 | 0.212 | 0.202 (0.206) | 0.921 | 0.234 | 0.224 |
| β2 | 0.500 | 0.442 | 0.146 | 0.133 (0.137) | 0.448 | 0.187 | 0.181 |
| β3 | 0.000 | 0.007 | 0.117 | 0.120 (0.123) | 0.012 | 0.195 | 0.190 |
| β01 | −0.500 | −0.435 | 0.500 | 0.323 (0.330) | −0.437 | 0.611 | 0.471 |
| β02 | 0.500 | 0.482 | 0.485 | 0.305 (0.316) | 0.467 | 0.623 | 0.464 |
| Proportional constraints | |||||||
| b1 | 0.600 | 0.617 | 0.214 | 0.201 (0.210) | |||
| b2 | 1.000 | 1.028 | 0.304 | 0.285 (0.297) | |||
| Log-normal Part of Y1 | |||||||
| δ1 | 0.600 | 0.523 | 0.142 | 0.134 (0.137) | 0.512 | 0.162 | 0.154 |
| δ2 | 0.300 | 0.259 | 0.092 | 0.082 (0.083) | 0.260 | 0.125 | 0.113 |
| δ3 | 0.000 | 0.004 | 0.068 | 0.070 (0.072) | 0.006 | 0.133 | 0.126 |
| δ01 | −0.300 | −0.259 | 0.311 | 0.196 (0.202) | −0.267 | 0.433 | 0.343 |
| δ02 | 0.300 | 0.279 | 0.289 | 0.181 (0.187) | 0.277 | 0.394 | 0.297 |
Note: Results from the simulation study: average estimates, standard deviations (SD) of estimates, and average standard errors (SE). Logistic part and log-normal part of Y1. Sandwich estimator results for BC2 in parenthesis.
From Table 2, under the mis-specified case 1, where the variances are unequal, in either logistic or log-normal part, the sandwich estimator provides very close results to traditional variance estimator under constrained model (BC2), and they are more efficient than the unconstrained model (BC0). For example, when using model BC2, the traditional MLE variance estimator and sandwich estimator have Average SEs for β3 0.143 and 0.148, respectively; while using unconstrained model, the value is 0.195, about 30% larger than previous two. From Table 3, we have similar findings as in Table 2: under the misspecified case 2, when there are intra-class correlations, the sandwich estimator provides very close results to traditional variance estimator under constrained model (BC2), still more efficient than unconstrained model (BC0). For example, when using model BC2, the traditional MLE variance estimator and sandwich estimator have average SEs for β3 0.120 and 0.123, respectively; while using unconstrained model, the value is 0.190, about 50% larger than previous two. The results from Tables 2 and 3 imply that, under these mis-specified cases, the traditional MLE still preserves most of its power, and provides variance estimators very close to sandwich estimator. It is also noted that from Table 3 that the average SEs and SD are not close for β01, β02 (also for δ01, δ02). It is not surprising due to the appearance of random effects structure. Note that each group actually contains 10 clusters, so there are 10 generated values from N(0, 1) among each group. Because we only estimate a “combined effect” for each group, under some extreme cases (less than 1% of total cases), when the sum of the 10 generated cluster effects is extremely large or small, the MLE would give very large (or very small) estimates of β01 and β02, then the SD values would be larger than average SEs.
3.3. Other scenarios
We also tried different parameter settings for the cases that log-odds were very large or very small (not reported here), which corresponded to many or few non-zero observations. To make this change, we simply increased or decreased the value of a0 to generate these scenarios. We found that, when the log-odds were apart from 0 (either positive or negative), the average estimated standard error of MLE of a0 tended to increase due to the fact of inefficient sample size for logistic part. We also noticed that, when log-odds increased, i.e. when there were more non-zero observations, all the average SEs of MLEs tended to decrease. When log-odds decreased, i.e. when there were more zero observations, all the average SEs of MLEs tended to increase. This was to be expected because the number of positive observations was affected by log-odds. However, for those cases, especially when the log-odds were too large or small, the average estimated values of the proportional constraints b1 and b2 were far from true values. We checked the detailed results and found that the algorithm failed to converge for about 1% among the 1000 replicates, and yielded extremely large or small estimates of b1 and b2. This was also due to the fact of inefficient sample size. If we increased the sample size, this phenomenon would disappear. Thus we suggest that when the log-odds is too large or small, it is better to check the results carefully, and try different starting values to make sure the results are converged.
It was also of interest to check the estimation efficiency when R-squared values for positive parts were either large or small, which corresponded to the cases that the variances of random errors were either small or large. We found that when R-squared value was large for both positive parts, i.e. variances of error terms were small, the average SEs of all MLEs tended to decrease, and vice versa. It was reasonable because when the variances of errors were small, the estimated standard errors became smaller, and covariate effects became more significant.
From Table 1, it seems that lower efficiency gains result from increasing the parameter values. For example, the average SEs for β1 and δ1 (Table 1) are closer for all three models, rather than the average SEs for β3 and δ3. To investigate further, we considered the previous simulation again, and computed the asymptotic relative efficiency (ARE) of the MLEs of the parameters β1 and δ1 = b1β1 from model BC2 WRT their MLEs from model UC1. We also computed the ARE of the MLEs from model BC2 WRT the MLEs from model BC0. It is found that, when the proportionality structure holds, the proposed model BC2 is more powerful than models UC1 and BC0 in determining the significance of the covariate effects. For example, in Table 9 of Supplemental Material Appendix F, when ρ = 0:7, β1 = 0 (then δ1 = 0), the ARE of the MLE of β1 from model BC2 WRT the MLE from model BC0 is 0.27 (logistic part), and the value is 0.17 for δ1 = b1β1 (log-normal part). It shows that when proportionality structure holds and sample size tends to infinity, BC2 only needs less than one-third of the observations to reach the same power of testing if coefficient is zero or not in the model. Detailed simulation results are provided in Supplemental Material Appendix F.
4. Real data example
4.1. Description of data
Acute respiratory infection (ARI) illness, which includes influenza, flu-like syndromes, common cold, upper respiratory tract infection, or upper respiratory infection, is an acute infection, generally viral in etiology, involving structures of the upper airways.21 Our motivating example for this methodological development came from a randomized controlled trial11 called the MEPARI trial (Meditation or Exercise for Preventing Acute Respiratory Infection). It was designed to evaluate potential preventive effects of meditation or exercise on duration, and severity of ARI illness, and to study the relationship between ARI severity (or duration) and certain covariates. The trial included several periods; to illustrate our method, all the reported ARI illness episodes (or no ARI) were considered. In this data, the primary response variables were ARI severity and ARI duration. The three treatment groups were “Exercise,” “Meditation” and “Control,” which consisted of 47, 51 and 51 individuals (149 participants completed the trial), respectively. The two semi-continuous responses contained 83 zeros (about 56%), indicating that the corresponding participants did not report any ARI illness. Among those 66 participants (about 44%) with ARI, during the whole trial, 42 had one episode, 22 had two episodes, and 2 had three episodes, respectively. ARI severity and ARI duration were computed based on sum of all episodes. Table 4 shows the description of the responses and covariates in the data. Our primary interests were to test the treatment effects (group effects), and also to explore potential relationships between self-reported psychosocial measures (LOT to STAI Y-2 in Table 4) and ARI outcomes. Thus, the proposed bivariate two-part fixed effects model can be applied to the MEPARI data. The distributions of the two responses are shown in Figure 1 (top part). One can see that the ranges of the original data are large, and that both responses distributions are highly skewed to the right. To help reduce skewness, we applied Box–Cox transformations (to the positive part) to the two positive responses. From Figure 1 (middle part), after applying the proposed method to log-transformed data, Vi,d = log(Yi,d) if Yi,d > 0, the residuals histograms were closer to normal density curves. We also checked the residual versus fitted plot (bottom part of Figure 1), and did not find violation of the homogeneous variance assumption.
Figure 1.
Distributions of ARI severity and duration (without zeros), top part shows original data, middle part shows distributions of residuals after applying the proposed method to log-transformed data, bottom part shows the residual versus fitted value plot for the final model on log-transformed data. ARI: acute respiratory infection.
Let Yi,1 denote the measure of ARI severity and Yi,2 denote the measure of ARI duration, xi be the covariate vector (values at baseline are used). To identify the three treatment effects, two indicator variables were added to show if a patient was from “Exercise” or “Meditation” group. The “Control” group received zero value for the two indicator variables. The coefficients of these two groups were defined as βE–C and βM–C. They had direct interpretations, i.e. they represented treatment differences between the Exercise (or Meditation) group and Control groups.
Note that some participants reported two or even three episodes of ARI. To model the differences, it was natural to add two dummy variables ei2 and ei3 in the positive parts. Specifically, for k = 2, 3, eik = 1 if the participant had k episodes of ARI and eik = 0 otherwise. With the settings above, for d = 1, 2, the model with proportional constraints was given by equation (6), we have
where β is the vector of coefficients for covariates including the treatment effects, and the coefficients ϕk,d (k = 2, 3) represent the increments of responses due to the second or third episode of the disease.
4.2. Results and analysis
To determine the final model, we first considered the full model with all the covariates, including Age to STAI Y-2 (in Table 4) as well as the two group differences, and then applied the previous iterative procedure to do the model selection. In the selection procedure, the group differences βE–C and βM–C were always kept in the model because they were our primary interest. Because the sample size was limited, to be conservative, α = 0.1 was used as significance level when we did the model selection. For the final model with proportional constraints, we present the estimates of regression coefficients in Table 5. For comparison, we also include the results for the bivariate model without constraints in Table 6. For the constrained model and unconstrained models, the −2log-likelihood values were 410.4 and 405.3, respectively. To test the proportionality, if we performed a likelihood ratio test between the constrained and unconstrained models, the test statistic had a value of 5.1, and followed a Chi-squared distribution with 10 degrees of freedom. Direct computation showed that the P-value was 0.884, indicating that the model with proportional constraints (i.e. a model with fewer number of parameters) should not be rejected. We also considered two intermediate models, where the first one only had proportionality constraint b1 for ARI severity, but unconstrained for ARI duration; the second one was vice versa. The likelihood ratio tests between the fully constrained model and two intermediate models had P-values 0.638 and 0.578, respectively, indicating that it was preferred to consider the proportionality constraints for both severity and duration parts. It is also interesting that the correlation coefficient ρ in Table 6 is noticeably high (0.802), which indicates that the two positive parts are highly correlated after log-transformation. This provides another reason that we may connect the two responses using the proposed model with proportional constraints.
Table 5.
Results for ARI severity and ARI duration, with proportional constraints b1 and b2.
| Non-zero % | Logistic part |
ARI severity |
ARI duration |
||||||
|---|---|---|---|---|---|---|---|---|---|
| 44.2% |
44.2% |
44.2% |
|||||||
| Est | S.E. | P-value | Est | S.E. | P-value | Est | S.E. | P-value | |
| Intercept | −1.518 | 1.155 (1.148) | 0.189 (0.186) | 3.744 | 1.412 (1.535) | 0.008 (0.008) | 1.400 | 0.606 (0.700) | 0.021 (0.015) |
| MAAS | 0.225 | 0.153 (0.178) | 0.143 (0.205) | 0.295 | 0.182 (0.222) | 0.105 (0.185) | 0.113 | 0.078 (0.096) | 0.145 (0.236) |
| BMI | 0.020 | 0.015 (0.017) | 0.168 (0.233) | 0.027 | 0.016 (0.020) | 0.092 (0.188) | 0.010 | 0.007 (0.008) | 0.136 (0.223) |
| SF12-Physical | −0.021 | 0.011 (0.013) | 0.070 (0.097) | −0.027 | 0.011 (0.013) | 0.017 (0.045) | −0.010 | 0.005 (0.005) | 0.025 (0.035) |
| STAI Y-2 | 0.028 | 0.016 (0.017) | 0.082 (0.091) | 0.037 | 0.017 (0.019) | 0.025 (0.044) | 0.014 | 0.008 (0.009) | 0.067 (0.132) |
| Exercise - Control | −0.270 | 0.243 (0.299) | 0.267 (0.367) | −0.354 | 0.251 (0.261) | 0.160 (0.175) | −0.136 | 0.107 (0.123) | 0.206 (0.270) |
| Meditation - Control | −0.444 | 0.265 (0.295) | 0.093 (0.133) | −0.582 | 0.241 (0.279) | 0.016 (0.037) | −0.223 | 0.116 (0.137) | 0.055 (0.104) |
| Two episodes | 0.928 | 0.241 (0.220) | 0.000 (0.000) | 0.874 | 0.144 (0.136) | 0.000 (0.000) | |||
| Three episodes | 1.136 | 0.714 (0.373) | 0.112 (0.002) | 0.950 | 0.415 (0.264) | 0.022 (0.000) | |||
| b1 | 1.311 | 0.648 (0.817) | 0.043 (0.109) | ||||||
| b2 | 0.503 | 0.281 (0.352) | 0.073 (0.153) | ||||||
| σ1 | 0.889 | 0.078 (0.086) | 0.000 (0.000) | ||||||
| σ2 | 0.541 | 0.047 (0.042) | 0.000 (0.000) | ||||||
| P | 0.797 | 0.046 (0.049) | 0.000 (0.000) | 0.797 | 0.046 (0.049) | 0.000 (0.000) | |||
Note: Sandwich estimator for SE results in parenthesis. ARI: acute respiratory infection.
Table 6.
Results for ARI Severity and ARI duration, without constraints.
| Non-zero % | Logistic part |
ARI Severity |
ARI duration |
||||||
|---|---|---|---|---|---|---|---|---|---|
| 44.2% |
44.2% |
44.2% |
|||||||
| Est | S.E. | P-value | Est | S.E. | P-value | Est | S.E. | P-value | |
| Intercept | −0.999 | 2.351 | 0.671 | 3.170 | 1.525 | 0.038 | 0.565 | 0.919 | 0.539 |
| MAAS | 0.091 | 0.314 | 0.770 | 0.362 | 0.199 | 0.069 | 0.162 | 0.120 | 0.178 |
| BMI | 0.027 | 0.029 | 0.341 | 0.026 | 0.017 | 0.134 | 0.013 | 0.010 | 0.230 |
| SF12-Physical | −0.015 | 0.019 | 0.414 | −0.024 | 0.012 | 0.052 | −0.002 | 0.007 | 0.753 |
| STAI Y-2 | 0.022 | 0.027 | 0.413 | 0.040 | 0.018 | 0.029 | 0.019 | 0.011 | 0.083 |
| Exercise - Control | −0.730 | 0.419 | 0.082 | −0.239 | 0.282 | 0.397 | −0.163 | 0.170 | 0.339 |
| Meditation - Control | −0.561 | 0.409 | 0.170 | −0.598 | 0.263 | 0.023 | −0.293 | 0.159 | 0.065 |
| Two episodes | 0.906 | 0.243 | 0.000 | 0.870 | 0.146 | 0.000 | |||
| Three episodes | 1.120 | 0.722 | 0.121 | 1.018 | 0.435 | 0.019 | |||
| σ1 | 0.886 | 0.077 | 0.000 | ||||||
| σ2 | 0.534 | 0.046 | 0.000 | ||||||
| P | 0.802 | 0.044 | 0.000 | 0.802 | 0.044 | 0.000 | |||
ARI: acute respiratory infection.
One thing to note from Table 5 is that, although we provide P-values, they are actually computed based on results of asymptotic Wald-tests, i.e. based on standard normal distributions. When the sample size is limited, as in the ARI data, the P-values from columns “ARI Severity” and “ARI Durations” do not reflect the true P-values of testing hypotheses H0 : b1βk = 0 and H0 : b2βk = 0, respectively, because the latter ones are also true if βk = 0. We list the “P-values” here to give an idea about the significance of coefficients. As a limitation of this study, we do not provide theoretical justifications of the distributions of MLEs when the sample size is limited. We also do not discuss much about the cases when either b1 (b2) is zero, or the whole covariate vector β = 0, because under those scenarios, the proportionality structures do not hold for all three parts or not well defined, and some parts degenerate to homogeneous cases. One needs to carefully determine whether a separate modelling method (i.e. unconstrained model) is more appropriate.
Comparing the constrained model and the unconstrained model, we notice that the constrained model is more efficient; for each coefficient, the estimated SE is much smaller using the constrained model, and most of the covariates are significant in all three parts under significance level 0.1. However, from Table 6, only the log-normal part of ARI severity has some significant covariate effects; for the logistic part and the log-normal part of ARI duration, most covariates are not significant. Notice that from Table 5, for the positive parts, if there is at least one significant covariate effect, the overall covariate effects (measured by b1 and b2) for positive parts are significant. Due to the special structure of the model with proportional constraints, there is no need for a covariate effect to be significant in all three parts, i.e. if the covariate is significant in at least one part, the covariate effect is regarded as significant.
From Table 5, first we notice that the P-values for b1 and b2 are quite small (0.043 and 0.073, respectively), indicating that the overall effects of the covariates are very significant. As one can see, the positive signs of b1 and b2 imply that the set of covariates, which includes treatment effects and other factors, will increase (decrease) the probability of getting ARI and the severity/duration (once people really get ARI) in the expected direction. We also find that the P-values of the second episode are close to 0, and those of the third episode are 0.122 and 0.022, respectively, indicating the increments of ARI severity (duration) due to more than one episodes should not be ignored. For the treatment effects, as we can see, the treatment difference “Meditation-Control” is negative and significant in all the three parts, with P-values 0.093, 0.016 and 0.055, respectively, implying that participants from the Meditation group have overall lower probability to get ARI illness than those from Exercise and Control groups; the ARI severity and durations are also lower for those participants with ARI illness. It is noted that the estimates are computed based on the log-transformed scale. To interpret the result, for example in Table 5, given other covariates fixed, the Meditation will help to decrease the probability of getting ARI and reduce the severity/duration once getting ARI: −0.444 for the difference between Meditation and Control (in logistic part) means that, one will see the odds of getting ARI in Meditation Group are 36% (exp(−0.444) = 0.641) lower than the odds of Control Group; −0.582 (in ARI Severity Part) for the difference between Meditation and Control group actually means the overall severity in Meditation group is about 44% (exp(−0.582) = 0.559) lower than that of the Control group. It is noted that, for the two log-normal parts, the covariate effects estimated by exponentiating the estimated value of regression coefficient are valid as measures of multiplicative impact on mean response as a function of unit differences in covariate values when the underlying “homogeneous variance” assumption holds. Because the log-transformed response (log(Yi,d)) is used in the model, we need to verify the homogeneous variance assumption (as shown in bottom part of Figure 1), to make sure the variance is not correlated with the covariates, thus to interpret exponentiated coefficients correctly in this way. Under certain scenarios, for some data, if this assumption is violated, one may need to consider other model structure, for example, using original values of Yi,d in equation (6), and applying log-link to the mean response E(Y) (then the positive part of the model becomes ). Alternatively, interpretation of exponentiated coefficients can be avoided. More discussions are given in Section 5. For the treatment difference between Exercise and Control, although we have three negative values −0.270, −0.354 and −0.136, respectively, the estimates are not significant under significance level 0.1. Comparing the results using traditional separate modelling method (Table 6), we find the Meditation effect is significant only for two log-normal parts (with P-values 0.023, 0.065), while for the logistic part it is not significant (with P-value 0.170), implying that Meditation can help to reduce ARI symptoms but may not help to prevent ARI. In the final model, we find that the covariates BMI and STAY Y-2 have positive estimates, from Table 5, indicating that higher BMI and more anxieties may increase the probability for participants to get ARI or increase the severity (and duration) for those with ARI. Although the effect of BMI is not very significant in all the three parts, we keep BMI in our model because many studies suggest that people with higher BMI values suffer more from ARI illnesses.22,23 We also notice that the covariate SF12-Physical has a negative estimate, indicating that participants with better physical functioning have more resistance to the ARI illness. It is surprising to find that the covariate MAAS has a positive estimate, which means participants with greater mindfulness may have higher probability to get ARI or have higher severity (and duration) of the illness. This phenomenon is also found using the unconstrained model (Table 6). However, it is not a contradiction to our previous analysis. There are several possible reasons. First, the MAAS effect has large P-values in all the three parts, indicating the effect is not significant. Second, it is a longitudinal data, the covariates are measured at different points, and we use the baseline values in the model, which correspond to participants’ mindfulness status prior to any treatments. During the trial, these self-report psychosocial measures could be affected greatly by the treatments and many other factors. Third, from Table 6, the P-value of MAAS is 0.069 in ARI severity part. Such P-value is a direction toward significance, thus MAAS is kept in the constrained model. When the sample size becomes larger, the effect of MAAS can be further tested.
We also report the results using the sandwich variance estimator as reference in Table 5 (shown in parenthesis). The SEs of coefficients are close, and the differences are in the 10–15% range for nearly all the SEs. For example, STAI Y-2 has SEs 0.016, 0.017, 0.008 in logistic, ARI severity and ARI duration, respectively, while the values are 0.017, 0.019 and 0.009 when the sandwich variance estimator is used. For treatment effects “Meditation-Control,” the SEs are 0.265, 0.241, and 0.116, respectively, while the SEs are 0.295, 0.279 and 0.137, respectively, when using sandwich variance estimator. The P-values are listed as references, and obviously all the P-values are slightly larger when the sandwich method is used. Due to the limited sample size, the P-values are less informative. Considering the result from simulation studies, the traditional MLEs provide reasonable result and accurate SEs under different mis-specified cases. We conclude that, in the ARI data analysis, there is no material change to the SEs when the sandwich estimator is used. The proportional constrained model provides a better way to describe the whole data structure and perform further inferences.
It is also of interest to perform the basic unadjusted analysis, where there are only treatment effects, and all other parameters are not controlled. The results for constrained model and unconstrained model are given in Table 7. We also performed likelihood ratio tests to check the overall treatment effect. Notice that when both treatment effects were zero, this situation was “non-standard”. For the constrained model, when test “both treatment effects were zero” (H0 : βE-C = βM-C = 0), the test statistic had value 8.8, and followed a Chi-squared distribution with df 4, the corresponding P-value was 0.066. For the unconstrained model, the test statistic had value 11.5, and followed a Chi-squared distribution with df 6, and the corresponding P-value was 0.074. It is interesting that for the basic unadjusted case, both constrained model and unconstrained model tend to have similar conclusions on overall treatment effect, which is not significant under 0.05 level, but significant under 0.1 level. Considering the limited sample size, these results suggest that the overall treatment effect cannot be ignored (under 0.1 level).
Table 7.
Results for ARI severity and ARI duration, and comparison of constrained and unconstrained models.
| Non-zero % | Logistic part |
ARI severity |
ARI duration |
||||||
|---|---|---|---|---|---|---|---|---|---|
| 44.2% |
44.2% |
44.2% |
|||||||
| Est | S.E. | P-value | Est | S.E. | P-value | Est | S.E. | P-value | |
| Proportional constraints | |||||||||
| Intercept | 0.015 | 0.316 | 0.963 | 6.255 | 0.195 | 0.000 | 2.612 | 0.124 | 0.000 |
| Exercise - Control | −0.222 | 0.365 | 0.542 | −0.373 | 0.329 | 0.258 | −0.203 | 0.185 | 0.274 |
| Meditation - Control | −0.515 | 0.486 | 0.288 | −0.864 | 0.328 | 0.008 | −0.470 | 0.207 | 0.023 |
| b1 | 1.677 | 1.797 | 0.351 | ||||||
| b1 | 0.913 | 0.998 | 0.360 | ||||||
| Without constraints | |||||||||
| Intercept | 0.197 | 0.281 | 0.485 | 6.197 | 0.204 | 0.000 | 2.581 | 0.131 | 0.000 |
| Exercise - Control | −0.765 | 0.414 | 0.065 | −0.146 | 0.332 | 0.660 | −0.082 | 0.213 | 0.699 |
| Meditation - Control | −0.553 | 0.400 | 0.167 | −0.865 | 0.312 | 0.006 | −0.471 | 0.200 | 0.019 |
Note: Case of basic unadjusted analysis. ARI: acute respiratory infection.
From Table 7, we also find that although the estimates of b1 and b2 are not significant, the constrained model is still preferred than the unconstrained model. The test statistic of the likelihood ratio test between the two models had value 2.7, and followed a Chi-squared distribution with df 2, and the corresponding P-value was 0.259, which means the model with proportional constraints was not rejected. The basic unadjusted results also show that the effect of Meditation is highly significant for both positive parts, indicating that under the simplest case, Meditation still has a reduction effect of ARI. For the logistic part, this result is not significant, because when sample size is limited and there are no other covariates, the treatment effects performs like group intercepts, the estimated standard errors are large and thus not efficient.
5. Discussion
In the analysis of data from clinical trials, one problem is that, with limited sample size, covariates are not statistically significant if tested individually; however, they may have overall covariate effects on the outcome measures and thus cannot be ignored. We refer this overall effect to be a common “single index,” which is a latent process that affects the logistic part as well as the two positive parts in similar ways. Another issue with multiple covariates is that researchers may find it difficult to determine which ones may have significant effects. To assess the overall effects and find the covariates with significant effects, we propose a multivariate two-part fixed effects model in this research. For two semi-continuous responses with common zero observations, we show that when proportionality holds for the logistic and two positive parts, implementing such proportional constraints will decrease the estimated SE of the MLE, and thus will increase the efficiency. We also propose a stepwise algorithm for model selection in the presence of proportional constraints and introduce a robust variance estimation method to address different model mis-specified cases. The proposed model is applied to the ARI illness data from the MEPARI trial. Using the bivariate two-part fixed effects model, our results suggest that the treatment “Meditation” can decrease the probability to get ARI and decrease the ARI severity (duration) than the treatments “Exercise” and “Control”. It also suggests that the covariates SF12-Physical and STAI Y-2 have significant effects (level 0.1) on the probability to get ARI and on ARI severity (duration).
As mentioned, the multivariate two-part fixed effects model with proportional constraints can easily be adapted. To demonstrate the extension, if we have D responses (D>2), they equal zero (or non-zero) values simultaneously. We can model these responses by D+1 fixed effects models, one for the logit probability of Ui =1 and D for the conditional mean responses E(Vi,d|Ui = 1). Like the bivariate case, for each response Yi,d, we define Vi,d the same as in equation (2). The multivariate two-part fixed effects model with proportional constraints is then given by
where pi = P(Ui = 1), ϵ = (ϵi,1, …, ϵi,D) is the vector of random errors and assumed to follow a non-degenerate multivariate normal distribution with mean 0 and covariance matrix Σ. We can show that the model with constraints will be the most efficient; the proofs are similar to Theorems 1 and 2 with similar assumptions. We note that Vi,d in equation (2) is assumed to be Gaussian distributed. This assumption is for showing the asymptotic relative efficient of MLEs of the constrained model BC2. Without the normality assumption, there is no explicit form for the covariance matrix. For other link functions, and/or distributions from exponential families, if the mean model for Ui and Vi,D|Ui (in equations (1) and (2)) and other assumptions are satisfied in Theorem 3, the constrained model can still be adopted. When the model is mis-specified, we provide empirical results and show that the constrained model is still more efficient than unconstrained model in the simulation study.
Another possible extension is similar to model (1) in Han and Kronmal,15 that is, for a set of covariates w, coefficient vector is α (not affected by the proportionality constraint); for the other set of covariates x, the proportionality constraint affects their coefficients. The structure is partially constrained model. It is already addressed in the univariate case of our proposed model, as one can see this from using , instead of a0, a1 in our univariate model. For the bivariate and multivariate case, notations and proofs are similar. Under the partially constrained case, the proposed model provides accurate estimates and efficient SEs than traditional unconstrained model.
As a limitation of this study, we do not discuss much about the case when the proportionality structure does not hold for all three parts as assumed in equation (6). For example, it is possible that only one log-normal part and the logistic part include the common predictor μi, or only the two log normal parts include μi, or even all the three parts do not have such μi in common. Under these scenarios, the proposed model in (6) is mis-specified, and the estimates are biased. This can be mitigated by fitting simpler proportional constrained models respectively. The K-fold cross validation can be applied to check if estimates are consistent, and it is always suggested to compare the results with unconstrained model. When the constrained and unconstrained models have several opposite signs of estimates, or the proportional constraints bd (d=1, 2) have estimates that are hard to interpret (e.g. small estimates compared to SE, or have opposite signs against intuition), it is better to fit model with few assumptions, such as the unconstrained model or partially constrained model.
Another limitation is that the proposed model is not the optimal choice for a panel data structure (also known as random effects model). As one can see from Table 3, when there is a random effect term, the proposed fixed effects model does not provide satisfactory results. By using the sandwich estimator, the estimates still have large SEs due to inefficient sample size. In this case, one should be careful that a random effects model may be more appropriate, and the proportionality structure can be incorporated to get more efficient estimates.
Supplementary Material
Acknowledgements
The authors thank the editor, the associate editor, and the referees for their valuable suggestions.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The data for this paper came from the MEPARI2 trial, which was supported by the National Institutes of Health, National Center for Complementary and Integrative Health (NCCIH; R01AT006970). While this paper was written, Bruce Barrett was supported by a mid-career research and mentoring grant from NCCIH (grant no. K24AT006543). Rathouz’s effort on this project was supported by NIH (grant no. R01HL094786-05A1).
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
- 1.Lu SE, Lin Y and Shih WCJ. Analyzing excessive no changes in clinical trials with clustered data. Biometrics 2004; 60: 257–267. [DOI] [PubMed] [Google Scholar]
- 2.Hall DB and Zhang Z. Marginal models for zero inflated clustered data. Stat Model 2004; 4: 161–180. [Google Scholar]
- 3.Su L, Tom BD and Farewell VT. Bias in 2-part mixed models for longitudinal semicontinuous data. Biostatistics 2009; 10: 374–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Su L, Tom BD and Farewell VT. A likelihood-based two-part marginal model for longitudinal semicontinuous data. Stat Meth Med Res 2015; 24: 194–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Smith VA, Preisser JS, Neelon B, et al. A marginalized two-part model for semicontinuous data. Stat Med 2014; 33: 4891–4903. [DOI] [PubMed] [Google Scholar]
- 6.Olsen MK and Schafer JL. A two-part random-effects model for semicontinuous longitudinal data. J Am Stat Assoc 2001; 96: 730–745. [Google Scholar]
- 7.Tooze JA, Grunwald GK and Jones RH. Analysis of repeated measures data with clumping at zero. Stat Meth Med Res 2002; 11: 341–355. [DOI] [PubMed] [Google Scholar]
- 8.Duan N, Manning WG, Morris CN, et al. A comparison of alternative models for the demand for medical care. J Business Econom Stat 1983; 1: 115–126. [Google Scholar]
- 9.Madden CW, Mackay BP, Skillman SM, et al. Risk adjusting capitation: applications in employed and disabled populations. Health Care Manage Sci 2000; 3: 101–109. [DOI] [PubMed] [Google Scholar]
- 10.Smith VA, Neelon B, Maciejewski ML, et al. Two parts are better than one: modeling marginal means of semicontinuous data. Health Service Outcome Res Methodol 2017; 17: 198–218. [Google Scholar]
- 11.Barrett B, Hayney MS, Muller D, et al. Meditation or exercise for preventing acute respiratory infection: a randomized controlled trial. Ann Fam Med 2012; 10: 337–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Barrett B, Brown R, Mundt M, et al. The Wisconsin upper respiratory symptom survey is responsive, reliable, and valid. J Clin Epidemiol 2005; 58: 609–617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lambert D Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34: 1–14. [Google Scholar]
- 14.Moulton LH, Curriero FC and Barroso PF. Mixture models for quantitative HIV RNA data. Stat Meth Med Res 2002; 11: 317–325. [DOI] [PubMed] [Google Scholar]
- 15.Han C and Kronmal R. Two-part models for analysis of Agatston scores with possible proportionality constraints. Commun Stat Theory Meth 2006; 35: 99–111. [Google Scholar]
- 16.Liu A, Kronmal R, Zhou X, et al. Determination of proportionality in two-part models and analysis of multi-ethnic study of atherosclerosis (MESA). Stat Its Interf 4: 475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu H, Ma S, Kronmal R, et al. Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (mesa). Ann Appl Stat 2012; 6: 1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Pukelsheim F Optimal design of experiments. vol. 50 SIAM, 1993. [Google Scholar]
- 19.Huber PJ. The behavior of maximum likelihood estimates under nonstandard conditions. 1967.
- 20.White H Maximum likelihood estimation of misspecified models. Econometrica: J Econometric Soc 1982; 50: 1–25. [Google Scholar]
- 21.Fendrick AM, Monto AS, Nightengale B, et al. The economic burden of non–influenza-related viral respiratory tract infection in the United States. Arch Intern Med 2003; 163: 487–494. [DOI] [PubMed] [Google Scholar]
- 22.Jedrychowski W, Maugeri U, Flak E, et al. Predisposition to acute respiratory infections among overweight preadolescent children: an epidemiologic study in Poland. Public Health 1998; 112: 189–195. [DOI] [PubMed] [Google Scholar]
- 23.Campitelli M, Rosella L and Kwong J. The association between obesity and outpatient visits for acute respiratory infections in Ontario, Canada. Int J Obesity 2014; 38: 113–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

