Skip to main content
Entropy logoLink to Entropy
. 2023 May 28;25(6):863. doi: 10.3390/e25060863

Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data

Renjun Ma 1,*, Md Dedarul Islam 1, M Tariqul Hasan 1, Bent Jørgensen 2
Editor: Yuehua Wu
PMCID: PMC10297622  PMID: 37372207

Abstract

Multilevel semicontinuous data occur frequently in medical, environmental, insurance and financial studies. Such data are often measured with covariates at different levels; however, these data have traditionally been modelled with covariate-independent random effects. Ignoring dependence of cluster-specific random effects and cluster-specific covariates in these traditional approaches may lead to ecological fallacy and result in misleading results. In this paper, we propose Tweedie compound Poisson model with covariate-dependent random effects to analyze multilevel semicontinuous data where covariates at different levels are incorporated at relevant levels. The estimation of our models has been developed based on the orthodox best linear unbiased predictor of random effect. Explicit expressions of random effects predictors facilitate computation and interpretation of our models. Our approach is illustrated through the analysis of the basic symptoms inventory study data where 409 adolescents from 269 families were observed at varying number of times from 1 to 17 times. The performance of the proposed methodology was also examined through the simulation studies.

Keywords: best linear unbiased predictors, clustered data, random effects, repeated data, two-part models, zero-inflated data

MSC: 62J05, 62J12, 62P10

1. Introduction

Random effects models have been widely used in the analysis of multilevel data in various fields of applications. Random effects have traditionally been assumed to be covariate-independent in these models; however, such assumption may lead to ecological fallacy [1] in the interpretation of cluster or sub-cluster level covariates. The ecological fallacy occurs if the misleading inference is made about covariate effects when group level covariates are associated with individual observations directly. To account for correlation between cluster-specific random effects and cluster-specific covariates, models with covariate-dependent random effects have been introduced to handle clustered binary data [2], zero-inflated count data [3], discrete-survival data [4] and survival data [5,6]. Modeling of multilevel semicontinuous data has recently become an area of active research [7,8,9]; however, the cluster-specific and sub-cluster-specific random effects are assumed covariate-independent so far in the literature. Multilevel semicontinuous data with cluster-specific covariates are frequently observed in medical, environmental, economic and insurance studies. An example of three-level semicontinuous data with cluster or sub-cluster level covariates is the basic symptoms inventory (BSI) study data [10] where 409 adolescents from 269 independent HIV infected parents were observed at varying number of times from 1 to 17 times. The purpose of this study was to evaluate the effectiveness of an intervention. The intervention was designed based on social learning theory that focused primarily on skill building to reduce the risk from dangerous sexual practices and substance abuse [10]. At baseline, patients and adolescents were randomly assigned to the intervention (132 patients and 203 youth) or standard care (137 patients and 206 youth). The effect of intervention on adolescents was measured by the Global severity index (GSI). GSI is a composite index that based on 53 psychiatric indices which are an indicator of abnormal mental behavior [11]. Response variable of GSI scores is right-skewed with excessive number of exact zeros. Furthermore, parental, child-specific and visit-specific covariates were observed for these 409 adolescents; therefore, accounting for the covariates at their relevant levels would be more appropriate.

In this paper, we introduce a Tweedie compound Poisson model with covariate-dependent cluster-specific and sub-cluster-specific random effects for three-level semicontinuous data. Unlike two-part models [8,9], our model characterizes zero and positive components in an integral way instead of separately. In addition, the skewness can be flexibly modeled by the power index parameter of Tweedie family of compound Poisson models. Furthermore, our approach consolidates conditional and marginal modeling. The estimation of our model has been developed based on the orthodox best linear unbiased predictors (BLUP) of random effects given the data as an extension of [12]. Explicit expressions derived for BLUP predictors of random effects predictors in this paper facilitate computation and interpretation of our models. To the best of our knowledge, this is the first time a compound Poisson model is developed with covariate-dependent random effects. Tweedie compound Poisson models are often used to characterize insurance claims, rainfalls, pollutants and health data in practice.

The rest of the paper is organized as follows. After introducing our compound Poisson mixed model with covariate-dependent random effects, we discuss prediction of random effects and model estimation in Section 2 and Section 3, respectively. Our approach is illustrated through the analyses of the basic symptoms inventory data in Section 4. The simulations and conclusion are presented in Section 5 and Section 6.

2. Tweedie Compound Poisson Models with Covariate-Dependent Random Effects

2.1. The Model

The zero-inflated continuous data are usually referred to as semicontinuous data. More specifically, we only consider nonnegative semicontinuous data in this paper. Let Yijk represent the semicontinuous responses recorded from the kth (k=1,2,,nij) observation of the jth (j=1,2,,Ji) sub-cluster of the ith (i=1,2,,I) independent cluster. Then the response vector can be expressed as Y=(Y111,,Y11n11,,YIJ1,,YIJnIJ). We consider the cluster level random effect Ui for the response of the ith cluster and the sub-cluster level random effect Vij for the response from the jth sub-cluster of the ith cluster. Let W=(U,V) denote the vector of the random effects where U=(U1,,Ui,,UI) and V=(V1,,Vj,,VJI) with Vj=(Vj1,,Vjj,,VjJI) respectively. Our model is based on the following three assumptions:

Assumption 1.

Cluster level random effects U1,,Ui,,UI are independently distributed with gamma distribution with mean μi and variance σ2. That is E(Ui)=μi and Var(Ui)=σ2, where μi=expZiβ(1) with Zi be the cluster level covariate vector of the ith cluster.

Assumption 2.

Given the cluster level random effects U=(U1,,Ui,,UI), the sub-cluster level random effects V11,,V1J1,,VJ1,,VJJI are conditionally independent following gamma distribution with mean μijUi and variance τ2Ui1, where μij=expZijβ(2) with Zij represents sub-cluster level covariates.

Assumption 3.

Given the cluster and sub-cluster level random effect W=(U,V), the responses Yijk follow compound Poisson distribution which can be expressed as

Yijk|Wcpyijk;ρ2exp1ρ2yijkμijk1p1pμijk2p2pwhere1<p<2, (1)

where the explicit expression for cp(yijk;ρ2) are given by [13] but are immaterial in our derivation in the following sections. This Tweedie family is also known as the power-variance family since we have E(Yijk|W)=μijkVij and var(Yijk|W)=ρ2Vij1pμijkqVitp=ρ2μijkpVij. In (1), μijk=expZijkβ(3), where Zijk represents observation level covariates.

The index parameter is restricted to the range 1<p<2 for the semicontinuous response since only this sub-family of Tweedie distributions has a support of the response Yijk on left-closed interval [0,+). In addition, gamma distributed random effects have been traditionally used in the multiplicative models for multilevel data.

2.2. Moment Structure

The main focus of this section is to present the moment structures of the compound Poisson mixed model with covariate-dependent random effects discussed in the previous subsection. The moments of the model can be obtained after some algebraic calculations by methods of conditioning on random effects. The unconditional mean and covariance are presented here to facilitate the parameter estimation. As Ui independently follows gamma distribution with mean μi and variance σ2, the moment structure of Ui can be expressed as

E(Ui)=μiandCov[Us,Ui]=δ(s,i)σ2μi2,

where δ(s,i)=1, if s=i and 0 otherwise. Following Assumption 2, the conditional mean and variance of Vij|U=u can be expressed as E(Vij|U)=μijUi and Var(Vij|U)=τ2μij2Ui. This implies that the unconditional mean of Vij is

E(Vij)=μiμij (2)

and the covariance of Vst and Vij is

Cov[Vst,Vij]=δ(s,i)σ2μi2μijμit+δ(t,j)τ2μiμij2. (3)

After some extensive algebra, Cov[Us,Yijk] and Cov[Vst,Yijk] can be expressed as

Cov[Us,Yijk]=δ(s,i)σ2μi2μijμijk

and

Cov[Vst,Yijk]=δ(s,i)σ2μi2μijμit+δ(t,j)τ2μiμij2μijk,

respectively. Following Assumption 3, the conditional mean and variance of E[Yijk|W] can be simplified as E[Yijk|W]=μijkVij and Var[Yijk|W]=ρ2μijkpVij, respectively. Thus the unconditional mean of the response variable Yijk can be calculated by conditioning on the vector of random effects W, which can be expressed as

E[Yijk]=μiμijμijk=exp(Xijkβ), (4)

where Xijk=(Zi,Zij,Zijk). Similarly the unconditional covariance Yijk and Ystl can be simplified as

Cov[Ystl,Yijk]=δ(s,i)σ2μi2μijμitμijkμitl+δ(t,j)[τ2μiμij2μijkμijl+δ(l,k)ρ2μiμijμijkp]. (5)

The moments structures described above will be used to develop the estimating equations for the regression and random effect parameters in next section.

In the literature, marginal and conditional inferences refer to the transformed linearity in regression parameters under appropriate link function for marginal and conditional means, respectively [14,15,16]. Our approach consolidates both conditional and marginal modeling interpretations under one model since both the conditional mean and the marginal mean of our model are clearly linear in regression parameters under the log-link.

3. Estimation of Parameters

In this section, we discuss the prediction of random effects and estimation of regression and random effects parameters.

3.1. Best Linear Unbiased Predictors of Random Effects

As in [12], the orthodox BLUP predictors of cluster level random effects Ui can be expressed in matrix form as follows

U^i=E(Ui)+Cov(Ui,Yi)Var(Yi)1YiE(Yi). (6)

Furthermore, the explicit expression of BLUP for cluster level random effects Ui are given by

U^i=μi+σ2μij=1Jik=1nijwijμijk1pYijk1+σ2μij=1Jik=1nijwijμijμijk2p, (7)

where wij=1/(ρ2+τ2μijk=1nijμijk2p). Similarly the BLUP predictors of the sub-cluster level random effects Vij can be calculated as

V^ij=E(Vij)+cov(Vij,Yi)var(Yi)1YiE(Yi), (8)

which can be simplified as

V^ij=μijU^i+τ2μijwijk=1nijμijk1p(YijkμijμijkU^i)=ρ2wijμijU^i+τ2μijwijk=1nijμijk1pYijk. (9)

The BLUP predictors for cluster and sub-cluster random effects given in Equations (7) and (9) are positive with explicit linear combinations of semicontinuous responses. These BLUP predictors provide the best prediction of the random effects among the class of all linear functions of the observed responses.

3.2. Estimation of Regression Parameters

In this section, we first consider the estimation for the regression parameters in the case of known random effect parameters. To do that, following [12], we differentiate the partially observed ‘joint’ log-likelihood of the Tweedie mixed model for the data and random effects with respect to β=(β(1),β(2),β(3)) and yield the partially observed ‘joint’ score function. Then we replace the random effects with their BLUP predictors to obtain an unbiased estimating function for the regression parameters β. To be specific, the score function for cluster level regression parameters can be achieved by the following score function

β(1)=i=1Iziμi1σ2(Uiμi),

which can be modified after including the BLUP predictors of the cluster level random effect as

ψ(1)β=i=1IZiμi1βσ2U^i(β)μi(β)=i=1Iψi(1)β, (10)

where the ith component corresponds to the ith independent cluster. Similarly the sub-cluster level and observation level score functions can simplified as

ψ(2)β=i=1Ij=1JiZijμij1βτ2V^ij(β)U^i(β)μij(β)=i=1Iψi(2)β, (11)

and

ψ(3)β=i=1Ij=1Jik=1nijZijkμijk1pβρ2YijkV^ij(β)μijk(β)=i=1Iψi(3)β, (12)

Without loss of generality, we assume that the matrix of the covariates X=(Zi,Zij,Zijk) is of full rank. Let the global estimating function be defined by

ψβ=ψ(1)β,ψ(2)β,ψ(3)β. (13)

As in [12,17], a global matrix expression can be obtained for this ψ(β). Furthermore, a simple relationship among sensitivity matrix S(β)=Eβψ(β)/β, variability matrix V(β)=Eβψ(β)ψ(β) and the Godambe Information matrix J(β)=S(β)V(β)1S(β) can be obtained. Since the joint density for (Y,U) forms an exponential family in regression parameter β, the proof in [17] applies to ψ(β) as long as the score functions are linear in both U and Y. However, this linearity is clear since the score functions are in fact the right hand side of (10)–(12) with U^(β) being replaced by U. A component-wise proof can also be found in [18]. Thus we have the following results:

Theorem 1.

For Tweedie compound Poisson models with covariate-dependent random effects, the predicted global score function ψ(β), sensitivity matrix S(β) and variability matrix V(β) can be expressed as follows:

  • 1.

    ψ(β)=XdiagE(Y)Var1(Y)YE(Y),

  • 2.

    V(β)=Varψ(β)=XdiagE(Y)Var1(Y)diagE(Y)X,

  • 3.

    J(β)=S(β)=V(β).

Following [12], under mild regularity conditions, the solution of the estimating equation ψ(β)=0 is consistent and asymptotically normal with asymptotic mean β and asymptotic variance given by the inverse of the sensitivity matrix S(β)=Eβψ(β)/β as the subjects are assumed to be independent. In addition, this estimating function ψ(β)=0 is optimal in the sense that it attains the minimum asymptotic covariance for the estimator of β within the class of all linear functions of Y [12]. As in [12], this estimation equation ψ(β)=0 can be solved iteratively using the scoring algorithm, where the value of β is updated following

β*=βS1(β)ψ(β).

As the asymptotic variance matrix of regression parameter vector β^ is given by the inverse of the negative sensitivity matrix S1(β), the asymptotic variance for each component of regression parameter vector β^ is given by the corresponding diagonal element of its asymptotic variance matrix, S1(β); therefore, the estimated standard errors of β^ are obtained as the square root of diagonal elements of its asymptotic variance matrix, S1(β).

3.3. Estimation of Random Effects Parameters

The dispersion parameters are now assumed to be unknown. The unknown dispersion parameter is estimated using the adjusted Pearson estimator [12]. The adjusted Pearson estimate has some advantages; one of them is that the Pearson estimator is adjusted by bias correction. We may thus estimate σ2 by the following adjusted Pearson estimator:

σ^2=1Ii=1I(U^iμi)2μi2+1Ii=1Id(i)μi2, (14)

where the explicit expression for d(i) can be expressed as

d(i)=E(U^iUi)2=σ2μi21+σ2μij=1Jik=1nijwijμijμijk2p.

The expression of τ2 is written as

τ^2=1Ii=1I1Jij=1Ji(V^ijμijU^i)2μiμij2+1Ii=1I1Jij=1Jid(i)μij2+d(ij)2ρ2d(i)wijμij2μiμij2, (15)

where d(ij)=E(V^ijVij)2=ρ2wijτ2μiμij2+ρ2d(i)wijμij2. The expression for ρ2 is written as

ρ^2=1Ii=1I1Jij=1Ji1nijk=1nij(YijkV^ijμijk)2μiμijμijkp+1Ii=1I1Jij=1Ji1nijk=1nijd(ij)μijk2pμiμij. (16)

As the clusters are assumed to be independent, standard estimating function theory can be applied to demonstrate the consistency and asymptotic normality for both regression and random effects parameters of our models as the number of clusters tends to infinity. In fact, this proof has actually been shown in the Chapter 5 of [18]. In addition, our estimation algorithm iterates among updating regression parameters through the scoring method, predicting random effects via BLUP predictor by (6) and (9), updating dispersion parameters through the moment estimators given in (14)–(16).

4. Analysis of Basic Symptoms Inventory Study Data

In this section, we apply our proposed covariate dependent semicontinuous model to the BSI study data.

4.1. Model Specification

In the BSI study, 409 adolescents from 269 independent HIV infected parents were followed over time between 1995 and 2002. Adolescents were observed at varying number of times from 1 to 17 times. Therefore, the data are of a hierarchical structure with observations nested within adolescents and adolescents further nested by their parents. The response of interest is the measurement of global severity index (GSI) scores with covariates at observation, adolescent and parent levels. A complete list of these variables with descriptions is given in Table 1.

Table 1.

Variable description for the BSI data.

Variable Type Name Description
Response     GSI BSI global severity index.
Explanatory Cluster level Parent level
    Treatment Intervention or not.
    Gender.Par 1 if Parent is female or 0 otherwise.
    Age.Par Parent’s baseline age.
Sub-cluster level Adolescent level
    Age.Adol Adolescent’s baseline age.
    Race.Adol 1 if Adolescent is Hispanic or 0 otherwise.
    Gender.Adol 1 if Adolescent is female or 0 otherwise.
Observation level
    Months Number of months adolescent in the study.
    Spring Spring season.
    Summer Summer season.

The BSI data were analyzed in [19] as longitudinal normal data ignoring parental level; however, there is a large proportion of exact zeros in the GSI response as shown in Figure 1; therefore, a three level model for semicontinuous data is more appropriate.

Figure 1.

Figure 1

Histogram for the Global severity index (GSI) scores of Basic symptoms inventory (BSI) Data. The bold bar on the left of the histogram indicates proportion of exact zeros.

Let Yijk represent the GSI for observation k from the adolescent j of the ith parent. The random effects for parent i and adolescent j of parent i are denoted by Ui and Vij, respectively. The three-level Tweedie models with covariate-dependent random effects (TMCDRE) is then specified as follows.

Yijk|V=vTwp(μijkvij,ρ2vij1p), (17)

with μijk=exp(zijkβ(3)). In (17), zijk represents observation level covariates and Vij|U*=u* follows

Vij|U*=u*Gamma(μijui,τ2ui1), (18)

where μij=exp(zijβ(2)) with zij represents sub-cluster level covariates. In (18), cluster level random effects Ui are assumed to be independent with

UiGamma(μi,σ2), (19)

where μi=exp(ziβ(1)) with zi represents cluster level covariates.

4.2. Analysis Results

First, we obtained a maximum likelihood estimate p^ from Tweedie’s compound Poisson model using the R package “tweedie” Version 2.07 (Dunn, 2010) in the presence of the previous described covariates. The estimated index parameters p was 1.55. For p=1.55, we applied our approach to estimate regression and random effect parameters based on the model specified above. The regression and random effect parameters are presented in Table 2. To compare our proposed model with Tweedie compound Poisson models with covariate-independent random effects, we also presented analysis results based on the conventional Tweedie mixed model (CTMM) in Table 2.

Table 2.

Parameter estimates for the BSI data based on CTMM and TMCDRE.

Levels Covariates TMM TMCDRE
Estimates St. Errors p-Value Estimates St. Errors p-Value
Observation Intercept −1.1574 0.4449 0.0093 −1.2114 0.4515 0.0074
Months −0.0473 0.0072 0.0000 −0.0503 0.0071 0.0000
Spring 0.0866 0.0325 0.0061 0.0880 0.0316 0.0045
Summer −0.0204 0.0338 0.5353 −0.0173 0.0326 0.5892
Adolescent Age.Ad 0.0444 0.0221 0.0455 0.0403 0.0221 0.0719
Race.Ad 0.1226 0.0908 0.1802 0.1148 0.0908 0.2077
Gender.Ad 0.4017 0.0863 0.0001 0.4058 0.0863 0.0000
Parent Treatment 0.0149 0.0916 0.8729 0.0110 0.0918 0.9045
Gender.Pr 0.2305 0.1240 0.0643 0.2310 0.1326 0.0819
Age.Pr −0.0199 0.0092 0.0308 −0.0165 0.0095 0.0854
σ2 0.1007 0.1050
τ2 0.5921 0.3844
ρ2 0.5316 0.6794

Our results in Table 2 show that the effects of covariates at observation level on the GSI score did not change materially in terms of direction, significance and magnitude. Second, all covariates at adolescent level had positive effects on the GSI score with race factor clearly insignificant and gender factor highly significant. The interesting phenomenon is that effect of baseline age on the GSI score changed from significant to insignificant at the significance level of 0.05 when all covariates were considered at their relevant levels. Third, again the effects of covariates at parent level on the GSI score did not change materially in terms of direction, significance and magnitude. Our results demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. We have not observed any dramatic change in this analysis. This might partly be explained by the nature of involved covariates in this study since no covariates on health status were available. Another interesting phenomenon is that the variation at the adolescent level was clearly shifted to that at the parent level when all covariates were associated at their relevant levels. A finding of practical importance is that the intervention was not effective whether covariates were associated at their relevant levels in the analysis or not.

5. Simulation Studies

In this section, we compare the performance of the proposed models (TMCDRE) with the conventional Tweedie mixed model (CTMM) through simulation studies to assess the effects of associating all covariates were considered at their relevant levels. To replicate data in realistic conditions, we simulated data based on analysis results given in Table 2 with respective arrangements of covariates under the TMCDRE and CTMM. The estimated regression parameters (β^) and random effect parameters (σ2, τ2, ρ2) under the TMCDRE and CTMM in Table 2 were taken as true values for the respective models. The simulation study under the TMCDRE and CTMM were done separately in Section 5.1 and Section 5.2.

5.1. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Dependent Random Effects

In this section, we first generated clustered semicontinuous data from the TMCDRE model in Table 2. We then analysed the simulaed based on the TMCDRE and CTMM models. Our objective here was to compare the performance of TMCDRE and CTMM when the true data were generated from TMCDRE. The data generation procedure is given below.

  • We first generate 269 samples (u1,,u269) using Gamma(μi,σ2) where,
    μi=exp(β7Treatment+β8Parent Gender+β9Parent age)
  • In the second step, we generate ni samples (vi1,,vini and ni varies from 1 to 5) for each ui using Gamma(μijui,τ2ui1), where
    μij=exp(β4Base age+β8Hispanic+β7Gender).
  • Finally, we generate nij samples (Yij1,,Yijnij and nij varies from 1 to 17) for each vij using vijTwp(μijk,ρ2vij), where
    μijk=exp(β0+β1True month+β2Spring+β3Summer).

We carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated standard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques are presented in Table 3. Parameter estimates of cluster level covariates are less biased for TMCDRE than for CTMM. Parameter estimate of the sub-cluster level continuous covariate Adolescent base age is less biased for TMCDRE than for CTMM. TMCDRE produced less biased estimates for observation level binary covariates Spring and Summer than CTMM. All dispersion parameters are overestimated by the TMCDRE, whereas only σ2 and τ2 are overestimated by CTMM. The simulation study shows that ρ2 is underestimated by the CTMM approach. Dispersion parameters are more biased for CTMM than for TMCDRE.

Table 3.

Summary statistics for 400 simulations from Tweedie model with covariate-dependent random effects.

Levels Covariates True Value CTMM TMCDRE
Bias Sim. SE a Est. SE b Bias Sim. SE Est. SE
Observation Intercept (β0) −1.2114 −0.0146 0.4463 0.4401 −0.0238 0.4354 0.4439
Months (β1) −0.0503 −0.0004 0.0075 0.0081 0.0006 0.0073 0.0086
Spring (β2) 0.0880 −0.0015 0.0303 0.0317 −0.0008 0.0296 0.0309
Summer (β3) −0.0173 −0.0004 0.0318 0.0312 0.0002 0.0317 0.0312
Sub-cluster Age.Ad (β4) 0.0403 0.0012 0.0234 0.0218 0.0010 0.0233 0.0219
Race.Ad (β5) 0.1148 0.0007 0.0906 0.0899 0.0009 0.0897 0.0895
Gender.Ad (β6) 0.4058 0.0062 0.0887 0.0852 0.0062 0.0871 0.0845
Cluster Treatment (β7) 0.0110 0.0081 0.0920 0.0909 0.0076 0.0906 0.0906
Gender.Pr (β8) 0.2310 0.0093 0.1247 0.1229 0.0090 0.1232 0.1254
Age.Pr (β9) −0.0165 −0.0007 0.0100 0.0091 −0.0004 0.0098 0.0094
σ2 0.1050 0.0018 0.0561 0.0034 0.0551
τ2 0.3844 0.1871 0.1000 0.0117 0.1170
ρ2 0.6794 −0.1471 0.0718 0.0270 0.1201

a: Sim. SE, standard error of estimates over 400 simulations. b: Est. SE, average of 400 estimated standard errors.

Table 3 shows the Sim SE and Est SE. Clearly, differences between Sim SE and Est SE of parameter estimates of observation level covariates (True month, Spring and Summer) are smaller for TMCDRE technique than those from the CTMM technique. Differences between Sim SE and Est SE for regression parameter estimates of all sub-cluster level covariates (Adolescents base age, Hispanic, Gender) and cluster level covariates (Treatment and Parent base age) are less for the TMCDRE than for CTMM. The only exception is with the covariate Gender.Pr. As expected, TMCDRE performed better than for CTMM when data were simulted from the TMCDRE model.

5.2. Simulating Data from the Tweedie Compound Poisson Model with Covariate-Independent Random Effects

In this section, we first generate clustered semicontinuous data using the conventional Tweedie Mixed Model (CTMM) method. Then, we analyze these simulated data by using TMCDRE and CTMM methods. Our objective is to compare TMCDRE and CTMM’s performances when the true data were generated based on CTMM. The data generation technique is discussed below:

  • We will generate 269 samples (u1,,u269) using Gamma(1,σ2).

  • In the second step, we will generate ni samples (vi1,,vini and ni varies from 1 to 5) for each ui using Gamma(ui,τ2ui1).

  • Finally, we will generate nij samples (Yij1,,Yijnij and nij varies from 1 to 17) for each vij using vijTwp(μijk,ρ2vij), where
    μijk=exp(βTX).

The fixed effect and random effect parameter estimates in Table 2 using CTMM technique (β^=(1.15,0.047,0.086,0.020,0.044,0.122,0.401,0.0149,0.230,0.019), σ2=0.10, τ2=0.59, and ρ2=0.53) are used as true parameter values. Similar to Section 5.1, we carried out 400 simulation runs using the procedure discussed above. Estimated bias, simulated stadard errors (Sim SE) and estimated standard errors (Est SE) of the regression and random effect parameters using TMCDRE and CTMM techniques presented in Table 4.

Table 4.

Summary statistics for 400 simulations from Conventional Tweedie mixed model.

Levels Covariates True Value CTMM TMCDRE
Bias Sim. SE a Est. SE b Bias Sim. SE Est. SE
Observation Intercept (β0) −1.1574 −0.0053 0.4170 0.4373 −0.0001 0.4193 0.4455
Months (β1) −0.0473 −0.0001 0.0071 0.0071 0.0001 0.0072 0.0070
Spring (β2) 0.0866 −0.0002 0.0332 0.0318 −0.0004 0.0331 0.0320
Summer (β3) −0.0204 −0.0003 0.0342 0.0326 −0.0004 0.0342 0.0324
Sub-cluster Age.Ad (β4) 0.0444 −0.0014 0.0205 0.0217 −0.0016 0.0205 0.0219
Race.Ad (β5) 0.1226 −0.0029 0.0836 0.0893 −0.0031 0.0848 0.0897
Gender.Ad (β6) 0.4017 0.0020 0.0833 0.0846 0.0028 0.0840 0.0845
Cluster Treatment (β7) 0.0149 0.0010 0.0892 0.0902 0.0014 0.0891 0.0908
Gender.Pr (β8) 0.2305 −0.0052 0.0091 0.0090 −0.0047 0.0091 0.0094
Age.Pr (β9) −0.0199 −0.0003 0.1240 0.1221 −0.0005 0.1232 0.1262
σ2 0.1007 0.0036 0.0544 0.0104 0.0552
τ2 0.5921 −0.0294 0.1029 −0.2436 0.1397
ρ2 0.5316 0.0030 0.0305 −0.1576 0.1001

a: Sim. SE, standard error of estimates over 400 simulations. b: Est. SE, average of 400 estimated standard errors.

Our simulated results indicate that although data are generated from CTMM, parameter estimates from TMCDRE for observation level continuous covariate True month is less biased than the corresponding parameter estimate from CTMM. TMCDRE parameter estimate for the cluster level binary covariate Parent gender is also less biased than the corresponding parameter estimate from CTMM. CTMM gives less biased estimates for all other covariates than the TMCDRE. Dispersion parameters σ2 and ρ2 are overestimated by both conventional and covariate-dependent approach, whereas τ2 is underestimated by CTMM and TMCDRE. Dispersion parameters are more biased for TMCDRE than those from CTMM are.

Table 4 also indicates that the differences between Sim SE and Est SE for parameter estimate of observation level binary covariate (Spring) are smaller for TMCDRE than the CTMM. One important feature of TMCDRE is that covariates are separated in their corresponding level. The simulation study shows that the differences between Sim SE and Est SE of parameter estimates corresponding to binary covariates (Hispanic, Gender) in sub-cluster level are smaller for TMCDRE than for CTMM. Differences between Sim SE and Est SE for all other covariate are less for CTMM than TMCDRE. In other words, when data were generated from the CTMM, CTMM performed better than TMCDRE in general; however, TMCDRE still produced better estimates for some covariates than the CTMM.

6. Conclusions

In this paper, we have introduced a three-level Tweedie compound Poisson model with covariate-dependent random effects for semicontinuous data. As random effects at different levels are likely to be associated with the covariates at relevant levels, this new model enables us to account for such association in our analysis. Our analysis of BSI data demonstrate that analysis results may change materially when all covariates were associated at their relevant levels. Our simulation study has shown that Tweedie compound Poisson model with covariate-dependent random effects performs better than Tweedie compound Poisson model with covariate-independent random effects when random effects at different levels are associated with the covariates at relevant levels. Although we also compared performances between models with covariate-dependent and covariate-independent random effects when random effects at different levels are not associated with the covariates at relevant levels, such no association situation is far less likely in practice. Therefore, Tweedie compound Poisson model with covariate-dependent random effects is more appropriate for analysis of multilevel semicontinuous data with covariates at different levels.

Tweedie compound Poisson model with covariate-dependent random effects becomes far more complex than Tweedie compound Poisson model with covariate-independent random effects; however, we still obtained explicit expression for random effects predictors at different levels. These explicit expressions may facilitate interpretation of clustering effects at cluster and sub-cluster levels.

Models with covariate-dependent random effects have been introduced to handle multilevel binary, zero-inflated and survival data in the literature. Our model for multilevel semicontinuous data has enriched this class of models. Furthermore, Tweedie models with covariate-dependent random effects have also been developed in [18,20] for multilevel continuous and count data.

Acknowledgments

This paper is dedicated to the memory of Bent Jørgensen. The authors thank the editor and four anonymous referees for their helpful comments that greatly improved the presentation of these results.

Author Contributions

Conceptualization, B.J. and R.M.; methodology, R.M.; software, M.D.I. and M.T.H.; formal analysis, M.D.I., R.M. and M.T.H.; data curation, M.D.I., R.M. and M.T.H.; writing—original draft preparation, M.T.H., R.M. and M.D.I.; writing—review and editing, R.M. and M.T.H. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from https://robweiss.faculty.biostat.ucla.edu/book-data-sets (accessed on 14 March 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This research was funded by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-04751 and RGPIN-2017-04246).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Freedman D.A. Ecological inference and the ecological fallacy. Int. Encycl. Soc. Behav. Sci. 1999;6:4027–4030. [Google Scholar]
  • 2.Wang Z., Louis T.A. Marginalized binary mixed-effects models with covariate-dependent random effects and likelihood inference. Biometrics. 2004;60:884–891. doi: 10.1111/j.0006-341X.2004.00243.x. [DOI] [PubMed] [Google Scholar]
  • 3.Wong K.Y., Lam K.F. Modeling zero-inflated count data using a covariate-dependent random effect model. Stat. Med. 2013;32:1283–1293. doi: 10.1002/sim.5626. [DOI] [PubMed] [Google Scholar]
  • 4.Scheike T.H., Ekstrøm C.T. A discrete survival model with random effects and covariate-dependent selection. Appl. Stoch. Model. Data Anal. 1998;14:153–163. doi: 10.1002/(SICI)1099-0747(199806)14:2&#x0003c;153::AID-ASM343&#x0003e;3.0.CO;2-9. [DOI] [Google Scholar]
  • 5.Liu D., Kalbfleisch J.D., Schaubel D.E. A positive stable frailty model for clustered failure time data with covariate-dependent frailty. Biometrics. 2011;67:8–17. doi: 10.1111/j.1541-0420.2010.01444.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zhou H., Hanson T., Jara A., Zhang J. Modelling county level breast cancer survival data using a covariate-adjusted frailty proportional hazards model. Ann. Appl. Stat. 2015;9:43. doi: 10.1214/14-AOAS793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zhang Y. Likelihood-based and Bayesian methods for Tweedie compound Poisson linear mixed models. Stat. Comput. 2013;23:743–757. doi: 10.1007/s11222-012-9343-7. [DOI] [Google Scholar]
  • 8.Su L., Tom B.D., Farewell V.T. A likelihood-based two-part marginal model for longitudinal semicontinuous data. Stat. Methods Med. Res. 2015;24:194–205. doi: 10.1177/0962280211414620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu L., Strawderman R.L., Johnson B.A., O’Quigley J.M. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Stat. Methods Med. Res. 2016;25:133–152. doi: 10.1177/0962280212443324. [DOI] [PubMed] [Google Scholar]
  • 10.Bursch B., Lester P., Jiang L. Rotheram-Borus, M.J.; Weiss, R. Psychosocial predictors of somatic symptoms in adolescents of parents with HIV: A six-year longitudinal study. AIDS Care. 2008;20:667–676. doi: 10.1080/09540120701687042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.McGillicuddy N.B., Rychtarik R.G., Morsheimer E.T., Burke-Storer M.R. Agreement Between Parent and Adolescent Reports of Adolescent Substance Use. J. Child Adolesc. Subst. Abuse. 2007;16:59–78. doi: 10.1300/J029v16n04_04. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ma R., Jørgensen B. Nested generalized linear mixed models: An orthodox best linear unbiased predictor approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007;69:625–641. doi: 10.1111/j.1467-9868.2007.00603.x. [DOI] [Google Scholar]
  • 13.Duan X., Ma R., Zhang X. Forecasting occurrence and quantity of monthly precipitation simultaneously while accounting for complex serial correlation. Int. J. Climatol. 2022;42:9494–9509. doi: 10.1002/joc.7839. [DOI] [Google Scholar]
  • 14.Lee Y., Nelder J.A. Hierarchical generalized linear models (with discussion) J. R. Stat. Soc. Ser. B Stat. Methodol. 1996;58:619–678. doi: 10.1111/j.1467-9876.2006.00538.x. [DOI] [Google Scholar]
  • 15.Lee Y., Nelder J.A. Conditional and marginal models: Another view. Stat. Sci. 2004;19:219–238. doi: 10.1214/088342304000000305. [DOI] [Google Scholar]
  • 16.Ma R., Yan G., Hasan M.T. Tweedie family of generalized linear models with distribution-free random effects for skewed longitudinal data. Stat. Med. 2018;37:3519–3532. doi: 10.1002/sim.7841. [DOI] [PubMed] [Google Scholar]
  • 17.Ma R., Krewski D., Burnett R.T. Random effects Cox models: A Poisson modelling approach. Biometrika. 2003;90:157–169. doi: 10.1093/biomet/90.1.157. [DOI] [Google Scholar]
  • 18.Ma R. Ph.D. Thesis. University of British Columbia; Vancouver, BC, Canada: 1999. An Orthodox BLUP Approach to Generalized Linear Mixed Models. [Google Scholar]
  • 19.Weiss R. Modeling Longitudinal Data. Springer; New York, NY, USA: 2005. [Google Scholar]
  • 20.Islam M.D. Master’s Thesis. University of New Brunswick; Fredericton, NB, Canada: 2013. Analysis of Clustered Data Using Tweedie Models with Covariate-Dependent Random Effects. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data are available from https://robweiss.faculty.biostat.ucla.edu/book-data-sets (accessed on 14 March 2023).


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES