Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jun 12.
Published in final edited form as: Stat Interface. 2011 Oct 1;4(4):475–487. doi: 10.4310/sii.2011.v4.n4.a6

Determination of proportionality in two-part models and analysis of Multi-Ethnic Study of Atherosclerosis (MESA)

Anna Liu 1, Richard Kronmal 2, Xiaohua Zhou 3, Shuangge Ma 4,
PMCID: PMC3680156  NIHMSID: NIHMS452150  PMID: 23772262

Abstract

In MESA (Multi-Ethnic Study of Atherosclerosis), it is of interest to model the development and progression of CAC (coronary artery calcium). With about half of the CAC scores equal to zero and the rest continuously distributed, semiparametric two-part models are needed. Our main interest lies in determining the (partial) proportionality between the two covariate effects in two-part models. Such an investigation can provide important information on the mechanisms underlying CAC development. We propose a novel approach, which consists of penalized maximum likelihood estimation and a step-wise hypothesis testing procedure to determine proportionality. Simulation shows satisfactory performance of the proposed approach. Analysis of MESA suggests that proportionality holds for all covariates except LDL and HDL.

Keywords and phrases: Two-part models, Proportionality, Semiparametric estimation

1. INTRODUCTION

The MESA (Multi-Ethnic Study of Atherosclerosis) is an ongoing study of the prevalence, risk factors, and progression of subclinical cardiovascular disease in a multi-ethnic cohort (Bild et al. 2002). It provides a valuable opportunity to study the development and progression of CAC (coronary artery calcium), which is an important risk factor for various coronary heart diseases. In MESA, the CAC is measured with the Agatston score, which is the amount of calcium at each lesion scaled by an attenuation factor and summed over all lesions. The histogram in Figure 1 shows that the CAC has a mixture distribution, with about half of the CAC scores equal to zero and the rest continuously distributed.

Figure 1.

Figure 1

Analysis of MESA: histogram of log(1 + CAC).

Data with characteristics similar to that of CAC has been referred to “zero-inflated data”. Existing methods for analyzing such data include the marginal likelihood method, quasi-likelihood method (McCulloch and Searle 2001), penalized quasi-likelihood method (Yau and Lee 2001), non-parametric maximum likelihood method (Min and Agresti 2005), Bayesian method (Ghosh et al. 2006), penalized likelihood method (Ma 2009) and others. Among available models, two-part models have attracted extensive attention. Two-part models have a long history in economic, statistical, and biomedical literature. Unlike alternatives such as the promotion models (Thompson and Chhikara 2003), two-part models do not assume specific data generating mechanisms. On a special note, two-part models have been suggested as the default models for describing the CAC in MESA (http://mesa-nhlbi.org/).

In two-part models, there are two covariate effects. The focus of this study is on the determination of proportionality between them. Denote X = (X1, X2, X3) as the covariate. Motivated by Figure 1, we consider Y = log(1 + CAC) and the following two-part model. In the first part, assume

φ-1(Pr(Y>0X))=h(X), (1)

where φ is the link function, φ−1 is the inverse of φ, and h(X) is the unknown covariate effect. In the second part of the model, assume

forY>0:YX=h(X)+ε, (2)

where h*(X) is the unknown covariate effect and ε is the random error.

With models (1) and (2), the two covariate effects are proportional if h*(X) = τh(X) with τ ≠ 0. In our study, a biologically meaningful result demands τ > 0. In our data analysis, such a result is naturally obtained without any constraint. When the full proportionality does not hold, there can be multiple scenarios. Consider for example additive covariate effects with h(X) = h1(X1) + h2(X2) + h3(X3). Partial proportionality holds if h*(X) = τ (h2(X2) + h3(X3)) + (τh1(X1) + (X1)) with (X1) ≠ 0 and τ ≠ 0. That is, proportionality of covariate effects holds for X2 and X3 but not X1. Other partial proportionality scenarios can be defined in a similar manner.

Determination of proportionality may provide a deeper understanding of CAC development. Models (1) and (2) describe the development of CAC in different ranges, with model (1) describing the development from zero to nonzero and model (2) describing the development above zero. If full proportionality holds, then the same mechanism – which corresponds to h(X) – determines development in both ranges. In contrast, under the partial proportionality described above, there may be two mechanisms. The first corresponds to h2(X2) + h3(X3), which remains the same in both ranges of CAC values. In contrast, the mechanism corresponding to X1 differs between the two ranges. We note that models (1) and (2) have different link functions and are on different scales. However, when investigating covariate effects, we are more interested in contributions of covariates relative to each other. Thus, it is meaningful to compare h(X) against h*(X).

Published proportionality studies include the zero-inflated Poisson regression model in Lambert (1992) and Albert et al. (1997), logit-(log) gamma two-part model in Moulton et al. (2002), and logit-linear two-part model in Han and Kronmal (2006). These studies show that determining proportionality structure can provide insights into the biological mechanisms underlying (for example) disease development. In addition, compared with models without proportionality constraints, models with fully or partially proportional covariate effects have fewer unknown parameters and can be more accurately estimated.

The aforementioned proportionality studies have assumed parametric covariate effects. For the CAC in MESA, McClelland et al. (2006) and our analysis suggest that semi-parametric models may be needed. With semiparametric two-part models, we conjecture that determination of proportionality with respect to parametric covariate effects can be achieved using the hypothesis testing approach in Han and Kronmal (2006), although such a possibility has not been investigated. On the other hand, determination of proportionality with respect to nonparametric covariate effects has not been studied.

In this article, we investigate determination of proportionality of covariate effects with semiparametric two-part models. This study advances from published literature along the following aspects. First, it advances from existing proportionality studies by adopting flexible semiparametric models. Second, the hypothesis testing approach (for determining proportionality) advances from published studies by investigating semiparametric models and adopting a stepwise approach that can accommodate multiple covariate effects. Third, this study advances from published two-part model studies by investigating different models and more importantly developing an effective approach for determining proportionality. Last, this study provides comprehensive analysis of CAC, which may help advance our understanding of the development of coronary heart diseases.

The rest of the article is organized as follows. We introduce the data and model settings in Section 2. We describe the proposed method in Section 3. We consider a penalized maximum likelihood approach for estimation and a hypothesis testing approach for determination of proportionality. We conduct simulation in Section 4 and analyze the MESA data in Section 5. The article concludes with discussion in Section 6.

2. DATA AND MODEL

Let Y = log(1+CAC). Without loss of generality, denote X = (X1, X2, X3)′ and Z = (Z1, Z2, Z3)′ as covariates. In the first part of the two-part model, assume that

φ-1(Pr(Y>0X,Z))=β0+β1X1+β2X2+β3X3+f1(Z1)+f2(Z2)+f3(Z3)=βX+f(Z), (3)

where φ is the link function and φ−1 is its inverse. β = β0, β1, β2, β3)′, = (1, X′)′, and f(Z) =f1(Z1)+ f2(Z2)+ f3(Z3). In the second part of the model, assume that for Y>0:

YX,Z=τ(β0+β1X1+β2X2+β3X3+f1(Z1)+f2(Z2)+f3(Z3))+α0+α2X2+α3X3+g1(Z1)+g2(Z2)+g3(Z3)+ε=τ(βX+f(Z))+αX+g(Z)+ε, (4)

where α = (α0, α2, α3)′, X=(1,X2,X3), g(Z) =g1(Z1)+ g2(Z2)+ g3(Z3). For identifiability, we assume that for the “anchor” covariate X1, τβ1 ≠ 0; in addition, Pfi = Pgi = 0, where P is the expectation. Motivated by Figure 1, we assume ε ~ N(0, σ2).

In (3) and (2), α, β, τ, and σ are the unknown parametric parameters. f and g are the unknown nonparametric covariate effects. Motivated by McClelland et al. (2006), we assume that f and g are smooth functions.

3. PENALIZED ESTIMATION AND DETERMINATION OF PROPORTIONALITY

We propose a penalized estimation approach and use penalized splines for nonparametric covariate effects. The equivalence between penalized spline models and mixed models has been well established (Speed 1991; Wang 1998; Wand 2003). We take advantage of this equivalence and transform the hypothesis testing on proportionality to one on fixed parameters and variance components in the corresponding mixed models. Inference is then made through the marginal likelihood of the semi-continuous data. Hypothesis testing on variance components or smoothing parameters, or more generally on nonparametric functions in semi-parametric regression, has been investigated (Hardle et al. 1998; Zhang and Lin 2003; Claeskens 2004; Liu et al. 2005; Crainiceanu et al. 2005; Fan and Jiang 2007; Jose Lombardia and Sperlich 2008; Kauermann et al. 2009). We choose the likelihood ratio based test, which has been shown to be more powerful in the literature. The parametric bootstrap is used to obtain approximated null distributions.

3.1 Penalized estimation

For an observation with covariate (X, Z) and response Y, the log-likelihood function is

l(α,β,τ,σ,f,gX,Z)=I(Y>0){-12log(2π)-12log(σ2)-(Y-τ(βX+f(Z))-αX-g(Z))22σ2}+I(Y>0)log(φ(βX+f(Z)))+I(Y=0)log(1-φ(βX+f(Z))). (5)

In this study, we set φ as the logit link function. Assume n iid observations. With smooth f and g, we consider the penalized maximum likelihood estimate (PMLE)

(α^,β^,τ^,σ^,f^,g^)=argmax{Pnl-λf2J2(f)-λg2J2(g)}. (6)

Here Pn is the empirical measure, λf and λg are the tuning parameters, J2(f)=i=13J2(fi)=i=13(fi(s))2dZi is the penalty on smoothness, and fi(s) is the sth derivative of fi. In this study, we set s = 2.

3.2 Estimation with thin plate splines

Under the assumptions described in Appendix, we limit and ĝ to be spline functions. In general, the penalty in (6) not necessarily leads to a thin plate spline solution. However, when the regression function is one dimensional, the thin plate penalty (equation (4.48) on page 135 of Gu (2002)) is the same as the integrated squared second derivative penalty. This is demonstrated in Example 4.1 of Gu (2002). In our study, the functions fis and gis are one dimensional and we use s = 2. Thus, we use thin plate splines with K knots for estimation of the nonparametric covariate effects. The penalized splines we use include smoothing splines as a special case when the knots are the design points. For the development of asymptotic properties, the full basis function space (with knots at the design points) is needed. In computation, we follow common practice and take the number of knots to be smaller than the number of design points. As a limitation of this study, we do not provide theoretical justification for the validity of this approach. Of note, even though quite a few studies have used a smaller number of knots, only Kim and Gu (2004) provides a rigorous development.

For a generic function m(x), its thin plate spline representation is

m(x)d0+d1x+k=1Kckx-pk3, (7)

where d0, d1 and cks are the unknown regression coefficients and pks are the fixed knots.

For i = 1, 2, 3, at the design points, we have fi(Zi) = Tidfi + Σicfi, gi(Zi) = Tidgi + Σicgi, where Ti = (1, Zi), Σi = (|Zipi1|3, …, |ZipiK|3), piks are the knots, and dfi = (d0fi, d1fi)′, dgi = (d0gi, d1gi)′, cfi = (c1fi,…, cKfi)′, cgi = (c1gi,…, cKgi)′ are the regression coefficients. Denote θ=(α0,α2,α3,β0,β1,β2,β3,τ,σ,df1,df2,df3,dg1,dg2,dg3) and b=(cf1,cf2,cf3,cg1,cg2,cg3). Once the knots are chosen following Wahba (1990), penalization on the smoothness is equivalent to penalization on the coefficient b. To allow further flexibility, instead of using unified λf and λg for all components of f and g, we can use different λfi and λgi for i = 1, 2, 3. With these notations, the penalized log-likelihood function defined in (6) can be rewritten as

Pnl(Yθ,b,σ2)-i=13λfi2cfiDicfi-i=13λgi2cgiDicgi, (8)

with l(Yθ,b,σ2)=I(Y>0)η1-log(1+exp(η1))-I(Y>0)(12log(2π)+12logσ2+(Y-η2)22σ2),η1=β0+β1X1+β2X2+β3X3+i=13(Tidfi+icfi), and η2=τη1+α0+α2X2+α3X3+i=13(Tidgi+icgi). Di is a K × K matrix with its kth row equal to (|pikpi1|3, , |pikpiK|3).

The objective function defined in (8) is concave in both θ and b, and can be maximized using the Newton-Raphson approach.

3.2.1 Tuning parameter selection

We use a Generalized Maximum Likelihood (GML) smoothing parameter selection approach, which is built on a close connection between the penalized log-likelihood (8) and log-likelihood of a mixed model. Although the connection between penalized smoothing and mixed models has been previously observed (Speed 1991; Wang 1998; Wand 2003), we may be the first to explore this connection with semi-continuous data. The main challenge is that the likelihood function of such mixed models involves high dimensional integration, and the standard practice of using the Laplace approximation leads to biased estimates of the variance components.

First we note that the penalized log-likelihood in (8) is equivalent to the log joint likelihood of the response Y and the following random effects:

cfi~N(0,Di+/λfi2),cgi~N(0,Di+/λgi2),i=1,2,3, (9)

where Di+ is the Moore-Penrose inverse of Di (Graybill, 2001). The equivalence is due to chiDichi=chi(Di+)+chiforh=f,g for h = f, g. When the distribution of Y belongs to the exponential family, the log joint likelihood is exactly the penalized quasi-likelihood (PQL) in Breslow and Clayton (1993), which also discusses singular variance matrices of random effects and recommends the use of Moore-Penrose inverse.

If we assume a flat prior on θ, then the GML criterion estimates the smoothing parameters and σ2 from the marginal density of Y, which is

L(Yλf1,λf2,λf3,λg1,λg2,λg3,σ2)=exp(Pnl(Yθ,b,σ2)-i=13l(cfi)-i=13l(cgi))×dθdcf1dcg3, (10)

where l(cfi) and l(cgi) are the log-likelihood functions of the normal distributions in (9).

If l(Y |θ, b, σ2) were a normal likelihood, the GML criterion would give the REML estimates of the tuning parameters, which are the inverse of the variance components in a mixed-effects model with cfis and cgis as the random effects. Under this mixed-effects model framework, alternatively, we can use a full marginal likelihood (ML) approach, which allows us to estimate the fixed effect θ together with the variance components. Here, the full marginal likelihood of Y is

L(Yθ,λf1,λf2,λf3,λg1,λg2,λg3,σ2)=exp(Pnl(Yθ,b,σ2)-i=13l(cfi)-i=13l(cgi))×dcf1dcg3. (11)

The REML and ML approaches are asymptotically equivalent, with the former more efficient for estimating variance components and the latter more convenient for inferences. In this study, since estimation and testing of both fixed effects and tuning parameters are of interest, we adopt the ML approach and carry out the multivariate integration in (11) using the spherical-radial quadrature algorithm (Monohan and Genz 1997).

For a generic multivariate integration f(u)du with integration dimension d, the spherical-radial quadrature algorithm involves two steps. First the integrand f(u) is transformed into an approximate spherically symmetrical function f*(x) through f*(x) = |B|−1f(û + B−1x) where û and H = BB are the mode and the hessian matrix of the integrand. For the integration in (11), the integrand is a concave function with close-form gradient and hessian. The Newton-Raphson algorithm can be used to find the mode û rather quickly. After the transformation, a change of variable is performed so that the multivariate integration is now in terms of a scalar radius and a vector of length d. The second step involves evaluating the transformed integrand at predefined radial and spherical quadrature points. With the 7 point Gauss-Kronrod rule for the radius and the simplex rule by Monahan and Genz (1997), this step needs 7(d + 1) integrand evaluations to obtain the integral approximation. As argued by Clarkson and Zhan (2002), since our purpose is to obtain the maximum likelihood estimates, we do not need to approximate the likelihood with very high accuracy. Similar to Clarkson and Zhan (2002), we find that one application of the simplex rule (as opposed to multiple applications with rotations) is sufficient. We conduct the integration (11) on a typical desk PC and find that it takes about 0.2 second with sample size 1,000 and d = 60 (i.e, 10 knots for each non-parametric function). We use the nlm function in R (which uses a Newton-type algorithm) for optimization of (11) and find that it takes about 6 minutes in the same setting. For estimation or inference that only involves the smoothing parameters, optimizing (10) is computationally more efficient than (11) since (10) is a function of much lower dimensionality, although the integration dimension is higher.

3.3 Determination of proportionality

Determination of proportionality with respect to Xi is equivalent to testing H0 : αi = 0 vs H1 : αi ≠ 0, i = 2, 3. With Zi, determination of proportionality amounts to testing H0 : dgi = 0, λgi = ∞ vs H1 : dgi ≠ 0 or λgi ≠ ∞, i = 1, 2, 3.

Motivated by studies on simple linear models (Wahba 1990) and generalized linear models (Liu et al. 2005) as well as Guo (2002) and Crainiceanu et al. (2005), for both parametric and nonparametric covariate effects, we propose using the following likelihood ratio test statistic based on the ML defined in (11): (12)

TML=supH0L(Yθ,λf1,λf2,λf3,λg1,λg2,λg3,σ2)supH0H1L(Yθ,λf1,λf2,λf3,λg1,λg2,λg3,σ2). (12)

In our study, there are multiple covariates and multiple scenarios of partial proportionality. To fully determine the proportionality structure, we use a step-wise approach. Denote A, AP, and AN as the index sets of all covariates, covariates with proportional effects, and covariates with non-proportional effects, respectively. Denote |AP| as the cardinality of AP.

  1. Initialize AP = A;

  2. For aAP, fit an intermediate model with covariates in AP − {a} having proportional effects and covariates in AN ∪ {a} having non-proportional effects. Compute the p-value for proportionality using the bootstrap approach described below.

  3. Repeat Step 2 over all aAP and compare the |AP| p-values so obtained. Denote a* as index of the covariate with the smallest p-value. If the smallest p-value is not significant, abort loop. Otherwise, update AP = AP − {a*} and AN = AN ∪ {a*}.

  4. If |AP| = 0, abort loop. Otherwise, repeat Steps 2 and 3.

This approach starts with all covariate effects being proportional. In Step 2, we determine the significance of proportionality of each covariate effect. In Step 3, the proportionality constraint on one covariate effect is released. Iteration is terminated once AP cannot be further reduced. Motivated by Liu et al. (2005), we propose the following bootstrap approach to compute the significance of proportionality.

  1. Fit the null and full models;

  2. Generate random errors from the normal distribution with mean zero and variance σ̂2 estimated from the full model;

  3. Under the null, compute the probability of Y > 0 from model (3) and generate the binary I(Y > 0). For those with Y > 0, generate the continuous Y values. Here Y s are equal to the sum of the null model evaluated at the design points and the normal random errors;

  4. With the generated responses, estimate the null and full models again. Compute the statistic TML;

  5. Repeat Steps 2 to 5 B (e.g. 500) times. An empirical p-value can then be computed.

A byproduct of the above procedure is the bootstrap confidence intervals for both the parametric and nonparametric parameters, which can serve as the basis for inference.

The likelihood ratio test and the bootstrap procedure can be computationally expensive. To calculate the likelihood ratio test statistic, we need to fit the null and full models. When fitting the full model, we suggest setting initial values as the estimates based on the null model, which may speed up the computation. The bootstrap procedure is highly parallel, which makes it computationally affordable.

3.4 Asymptotic properties

Although many intermediate models are needed in order to determine the proportionality structure, we are most interested in the final models, i.e., models with proportionality properly determined. For those models, we establish asymptotic properties of the PMLE. Sufficient conditions are provided in Appendix. Denote the true value of (α, β, τ, σ, f, g) as (αT, βT, τT, σT, fT, gT). Define d2((α, β, τ, σ, f, g), (αT, βT, τT, σT, fT, gT)) = (ααT)2 + (ββT)2 + (ττT)2 + (σσT)2 + ∫(ffT)2dPZ + ∫(ggT)2dPZ, with PZ denoting the distribution function of Z.

Lemma 1

Under assumptions A1–A4 provided in Appendix,

d((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT))=Op(n-s/(2s+1));J(f^),J(g^)=Op(1).

The estimates of nonparametric covariate effects are consistent and have the optimal convergence rate. Lemma 1 also establishes that J(), J(ĝ) =Op(1). That is, and ĝ have the “right” order of smoothness. The L2 consistency, together with the smoothness and compactness conditions described in Appendix, can lead to the uniform consistency of and ĝ, i.e., sup |fT| = oP (1) and sup |ĝgT| = oP (1). For the estimates of parametric parameters, we have the following results.

Lemma 2

With assumptions and Σ specified in Appendix,

n{(α^,β^,τ^,σ^)-(αT,βT,τT,σT)}DN(0,).

Despite the slow convergence rate of and ĝ, the estimates of parametric parameters are still n consistent and asymptotically normally distributed.

4. SIMULATION

In simulation, we generate data from

Pr(Y>0X,Z)=logit(η1),andforY>0,YX,Z=η2+ε, (13)

where η1 = −4 + 5X1 − 2.5X2 + 1.5X3 + 8 sin(6Z1) + 7Z2 −20(Z2 − 0.5)2, τ = 0.2, and σ = 0.5. We assume that X1 = 0 or 1 with probability 0.5; X2 = 1, 2, 3, or 4 with probability 0.25; X3 ~ N (0, 1); Z1 is equally spaced between 0 and 1; and Z2 ~ Unif [0, 1]. We set the sample size n = 1, 000. We define the “difference function” as η2τη1. Determination of proportionality then amounts to testing if components of the difference function are equal to zero. As shown in Table 1, ten difference functions are considered. For a clear view, we omit the intercepts in Table 1, which are needed to satisfy the identifiability assumption of Pfi = Pgi = 0. In simulation, X1 is chosen as the anchor.

Table 1.

Simulation study: power of testing non-proportionality with various difference functions

Difference function Power
X2 X3 Z1 Z2
0 0.040 0.046 0.043 0.078
0.33X3 0.045 1 0.021 0.054
5Z1+Z12+0.9Z2
0.051 0.064 0.635 0.806
0.8X2+5Z1+2Z12+0.8Z2
0.900 0.046 0.820 0.620
0.33X3+5Z1+10Z12+Z2
0.076 0.980 1 0.920
0.33X3 + 0.5Z2 0.062 1 0.033 0.400
0.33X3 + Z2 0.079 1 0.048 0.950
0.3X2+ 0.33X3 0.220 1 0.042 0.051
0.5X2 + 0.33X3 0.560 1 0.035 0.050
0.05X3 + Z2 0.045 0.160 0.038 0.960
0.1X3 + Z2 0.066 0.800 0.032 0.990

We first investigate the determination of proportionality. In Table 1, we present the power of detecting non-proportionality computed based on 1,000 replicates. We can see that in general, the proposed approach can correctly identify the proportionality structure. When proportionality holds for a specific covariate, the power is usually close to 0.05, the nominal significance level. In contrast, when proportionality does not hold, the proposed approach can identify the non-proportionality with a high probability. Consider, for example, difference function 0.1X3 + Z2. With probabilities 0.80 and 0.99, the non-proportionality with respect to X3 and Z2 can be identified. The error rates of mistakenly identifying non-proportionality with respect to X2 and Z1 are 0.066 and 0.032, respectively. In addition, when the regression coefficients in difference functions increase, the power increases. Consider for example difference functions 0.33X3 + 0.5Z2 and 0.33X3 + Z2. When the regression coefficient of Z2 increases from 0.5 to 1, the power increases from 0.40 to 0.95.

For the final models, we evaluate performance of the penalized estimation and bootstrap inference. We show a representative example of the estimation results in Figure 2, where data is generated with difference function 0.33X3+5Z1+10Z12+Z2. For the covariates with non-parametric effects, the mean estimates fit the unknown true functions well. The 95% confidence intervals provide satisfactory coverage. As expected, the confidence intervals become wider, when it is closer to the boundaries and there are fewer observations. Note that, for identifiability, we have assumed Pfi = Pgi = 0. We omit the intercepts in Table 1. The intercepts have been added back in Figure 2. We have examined estimation results for parametric parameters and found negligible biases, satisfactory convergence rates, marginal distributions close to normal, and satisfactory bootstrap coverage.

Figure 2.

Figure 2

Simulation with difference function 0.33X3+5Z1+10Z12+Z2: estimation and inference results for nonparametric covariate effects. Solid black line: true covariate effect; Red dashed line: mean estimates; Blue dash-dotted lines: mean 95% confidence intervals.

5. ANALYSIS OF MESA

The MESA is a population based, multi-center study of subclinical cardiovascular diseases. The study cohort consists of 6,814 subjects with age ranging from 45 to 84 at the baseline. Subjects with missing measurements are removed, leading to a sample size of 6,658 for downstream analysis. The CAC has a mixture distribution, with about half of the CAC scores equal to zero and the rest continuously distributed. We adopt the two-part model. In the first part, we assume the logit link function. In the second part, we study log(1 + CAC), which has a distribution close to normal.

Following McClelland et al. (2006), we consider the following predictors: gender (female is used as the reference group), race (Caucasian, African-American, Chinese, and Hispanic; Caucasian is used as the reference group), former smoker (binary indicator), current smoker (binary indicator), diabetes (binary indicator), SBP (systolic blood pressure), DBP (diastolic blood pressure), age, BMI (body mass index), LDL cholesterol, and HDL cholesterol. Among the 13 covariates, 7 are binary, which naturally correspond to parametric covariate effects. In addition, our preliminary analysis suggests linear effects for SBP and DBP. Thus, in the semiparametric models, there are 9 parametric covariate effects and 4 nonparametric ones. Following Han and Kronmal (2006), X3 is selected as the anchor.

We use the step-wise approach to determine proportionality. In the first step, we find that the proportionality of LDL effect has the smallest p-value (< 0.001). Thus we release the proportionality constraint on LDL. In the second step, we find that the HDL effect has the smallest p-value (0.012). We then fit a model with the proportionality constraints on LDL and HDL released. In the third step, for covariates other than LDL and HDL, we find that releasing the proportionality constraints leads to insignificant p-values. We thus conclude that proportionality holds for all covariates except LDL and HDL.

For the final model with proportionality constraints on all covariates expect LDL and HDL, we present the estimates of parametric regression coefficients in Table 2 and estimates of nonparametric covariate effects in Figure 3. We find that the following risk factors are significantly associated with a higher level of CAC: being male, being Caucasian, being a smoker (both former and current), having diabetes, and having a higher level of SBP. Those findings are consistent with the literature.

Table 2.

Analysis of MESA. Parametric regression coefficients in the full model (with no proportionality constraint) and the final model (with proportionality properly determined). Estimates (bootstrap standard errors) in the logistic (η1) and linear (η2) models

Predictor Full model Final model
η1 η2 η1 η2
Gender: Male (X1) 0.945 (0.092) 0.618 (0.099) 0.960 (0.078) 0.651 (0.053)
Race: Chinese (X2) −0.119 (0.070) −0.285 (0.081) −0.211 (0.078) −0.143 (0.053)
Race: African-American (X3) −0.787 (0.071) −0.398 (0.085) −0.727 (0.063) −0.493 (0.047)
Race: Hispanic (X4) −0.628 (0.074) −0.358 (0.073) −0.594 (0.063) −0.402 (0.045)
Former smoker (X5) 0.370 (0.072) 0.213 (0.071) 0.354 (0.052) 0.240 (0.036)
Current smoker (X6) 0.609 (0.094) 0.328 (0.096) 0.573 (0.078) 0.388 (0.052)
Diabetes (X7) 0.243 (0.070) 0.275 (0.068) 0.299 (0.055) 0.203 (0.038)
SBP (X8) 0.009 (0.002) 0.004 (0.002) 0.008 (0.002) 0.005 (0.001)
DBP (X9) −0.0034 (0.004) 0.0032 (0.004) −0.0009 (0.004) −0.0006 (0.002)
τ 0.678 (0.037)
σ 1.677 (0.021) 1.680 (0.021)

Figure 3.

Figure 3

Analysis of MESA. Estimated nonparametric covariate effects in the final model with proportionality properly determined. Solid black line: estimate; Red dashed line: mean estimate from bootstrap samples; Blue dash-dotted lines: 95% confidence intervals.

For Age and BMI, their nonparametric covariate effects are proportional (Figure 3). It is interesting that their effects are almost linear, which suggests that it may be possible to further simplify the model by assuming parametric Age and BMI effects. The bootstrap confidence intervals suggest that both the Age and BMI effects are significant. Increases in Age and BMI are associated with a higher level of CAC, which is consistent with findings in the literature. For LDL and HDL, the proportionality does not hold. The shapes of covariate effects are significantly different in the two parts of the model. For HDL, its covariate effects have an “U” shape. In the literature, nonparametric modeling of HDL has not been well investigated. This study is among the first to find this interesting relationship between HDL and CAC. Implications of this finding need to be pursued in future biomedical studies. For LDL, it is interesting that the covariate effects are close to linear. Increase in LDL is associated with a higher probability of nonzero CAC, which is consistent with findings in the literature. The bootstrap confidence intervals suggest the significance of LDL effect. For nonzero CAC values, the LDL effect is negligible.

To complement the above analysis, we also fit the full model with no proportionality constraint. Estimation results are shown in Table 2 and Figure 4. Comparing the full and final models, we find that estimates in the two models are reasonably close. This is expected since estimates under both models are asymptotically consistent. An important finding is that in general, estimates in the final model have smaller variances. In Table 2, all bootstrap standard errors (except for that of X2 in η1) in the full model are larger than or equal to their counterparts in the final model. The improved efficiency is consistent with studies on parametric models in Han and Kronmal (2006) and others.

Figure 4.

Figure 4

Analysis of MESA. Estimated nonparametric covariate effects in the full model with no proportionality constraint. Solid black line: estimate; Red dashed line: mean estimate from bootstrap samples; Blue dash-dotted lines: 95% confidence intervals.

6. CONCLUSION

In this article, we study the semiparametric two-part modeling of the CAC in MESA. We use a penalized maximum likelihood approach for estimation and a step-wise hypothesis testing approach for determination of proportionality. Our numerical and theoretical studies show that the proposed method can properly identify the proportionality structure, and the estimation results are satisfactory.

We conduct detailed analysis of the CAC in MESA. By adopting the flexible semiparametric two-part model, this study can provide a deeper understanding of the development of CAC. Specifically, this study is among the first to find the interesting “U” shape for the effects of HDL in both parts of the model and the different shapes of the LDL effects. Assuming parametric models, Han and Kronmal (2006) conclude that the effects of HDL, LDL, diabetes, and race-Chinese are not proportional. In contrast, with the semiparametric model, we only conclude non-proportionality for HDL and LDL. Our analysis disproves the hypothesis that the change from a zero to a positive Agaston score and the change from a lower to a higher Agaston score share the same biological process. Instead, we find that risk factors affect the CAC level via at least two different mechanisms, with the cholesterol having a different mechanism from the other risk factors.

In our models, to be consistent with previous studies such as Han and Kronmal (2006) and McClelland et al. (2006), we assume additive covariate effects. We note that it is possible to extend the proposed method, accommodate interactions, and conduct analysis with transformed covariates (for example the ratio HDL/LDL). The proposed model and method have no “built-in” robustness. We suspect that the performance of the proposed method can be unsatisfactory under model misspecification. The proposed tuning parameter selection method has been motivated by several published studies. Our numerical studies show that the tuning parameters so selected have satisfactory performance. In theoretical investigation, we provide the asymptotic rate for the tuning. However, as in many other studies, it is not completely clear whether the tuning parameters selected using the proposed approach match the asymptotics. The proposed method demands an anchor covariate. The anchor is needed in many other studies that have an identifiability constraint. In theory, as long as the corresponding covariate effect is nonzero, it does not matter which covariate is selected as the anchor; In practice, we propose selecting a covariate with a “strong” effect, which can be parametric or nonparametric. We chose a parametric covariate effect partly to follow Han and Kronmal (2006) and partly to simplify the computation. In assumption A1 (Appendix), we assume that the true value of τ is bounded away from zero. In addition, we expect the anchor variable to have a strong effect. In practical data analysis, if the estimated τ×effect of anchor covariate is close to zero, it should raise alarm: either there should be no constraint or the choice of anchor is improper. Since it is not our focus, we refer to publications such as Ma and Huang (2007) and references therein for more detailed discussions on anchor.

Acknowledgments

We thank the editor, associate editor, and two referees for careful review and insightful comments. We thank the investigators, the staff, and the participants of MESA for their valuable contributions. A full list of participating MESA investigators and institutions can be found at http://www.mesa-nhlbi.org. This study has been supported by H98230-09-1-0044 from NSA (Liu), N01-HC95159 from NHLBI (Kronmal) and DMS 0805984 from NSF (Zhou and Ma).

APPENDIX

We provide proofs of lemmas 1 and 2. First, we make the following assumptions.

  • (A1)

    X and Z are component-wise bounded. (αT, βT, τT, σT) is an interior point of a compact set. τT is abounded away from 0;

  • (A2)

    Component-wise, fT and gT belong to the Sobolev space indexed by the order of derivative s.

  • (A3)

    P (l(α, β, τ σ, f, g) − l(αT, βT, τT, σT, fT, gT)) ≤ −K1d2((α, β, τ, σ, f, g), (αT, βT, τT, σT, fT, gT)) with a fixed constant K1 > 0.

  • (A4)

    λf, λg = Op(ns/(2s+1)).

For most practical data, the compactness assumption A1 can be satisfied. We make this assumption for theoretical convenience and allow the actual bounds to remain unknown. We assume smooth nonparametric covariate effects in A2. Usually, s = 2. We assume that the maximizer of the likelihood function is “well-separated” in A3. This assumption can be satisfied under the compactness assumptions A1 and A2 and the differentiability of likelihood function.

Proof of Lemma 1

Definition (Bracketing number)

Let ( Inline graphic, || · ||) be a subset of a normed space of real function h on some set. Given two functions h1 and h2, the bracket [h1, h2] is the set of all functions h with h1hh2. An ε bracket is a bracket [h1, h2] with ||h1h2|| ≤ ε. The bracketing number N[](ε, Inline graphic, || · ||) is the minimum number of ε brackets needed to cover Inline graphic. The entropy with bracketing is the logarithm of the bracketing number.

van de Geer (2002) proves that, for the functional class

H={h:[0,1][0,1],(h(s)(x))2dx<1},

log N[](ε, Inline graphic, L2(P)) ≤ K2ε−1/s, for fixed K2 and s and all ε.

Under the boundedness assumptions A1 and A2 and the differentiability of the log-likelihood function, we have

logN[](ε,l(α,β,τ,σ,f,g),L2(P))K3ε-1/s, (14)

for a fixed constant K3.

Examination of the log-likelihood suggests that if α̂, β̂, τ̂, σ̂ → ∞, then Pnl → −∞. Thus, we are able to focus on the set of bounded α̂, β̂, τ̂, σ̂, although the actual bound remains unknown. As we optimize in the Sobolev space (indexed by the order of derivative s), and ĝ are smoothing splines. The proof follows Theorem 1.3.1 of Wahba (1990; p. 11).

From the definition of PMLE, we have

Pnl(α^,β^,τ^,σ^,f^,g^)-λf2J2(f^)-λg2J2(g^)Pnl(αT,βT,τT,σT,fT,gT)-λf2J2(fT)-λg2J2(gT).

From the properties of likelihood function, we have

Pl(α^,β^,τ^,σ^,f^,g^)Pl(αT,βT,τT,σT,fT,gT).

Combining the above two equations, we get

λf2J2(f^)+λg2J2(g^)+P(l(αT,βT,τT,σT,fT,gT)-l(α^,β^,τ^,σ^,f^,g^))λf2J2(fT)+λg2J2(gT)+(Pn-P)(l(α^,β^,τ^,σ^,f^,g^)-l(αT,βT,τT,σT,fT,gT)). (15)

In addition, the entropy result in (14) implies that

(Pn-P)(l(αT,βT,τT,σT,fT,gT)-l(α^,β^,τ^,σ^,f^,g^))=oP(n-1/2)(1+J(fT)+J(gT)+J(f^)+J(g^)). (16)

Combining equations (15) and (16) with assumption A4, we have

λfJ(f^)=oP(1)andλgJ(g^)=oP(1). (17)

Under assumption A3, equations (15) and (16) imply that

K1d2((αT,βT,τT,σT,fT,gT),(α^,β^,τ^,σ^,f^,g^))oP(1)+oP(n-1/2)(1+J(fT)+J(gT)+J(f^)+J(g^)).

This equation and equation (17) lead to the consistency of PMLE. To prove the rate of convergence, we use the following result.

van de Geer (2000) consider a uniformly bounded class of functions Γ, with supγ∈Γ |γγ0| < ∞ and a fixed γ0 ∈ Γ, and log N[](ε, Γ, P) ≤ K4εb for all ε > 0, where b ∈ (0, 2) and K4 is a fixed constant. Then for δn = n−1/(2+b),

supγΓ(Pn-P)(γ-γ0)||γ-γ0||21-b/2nδn2=Op(n-1/2), (18)

where xy = max(x, y).

Under the compactness assumptions A1 and A2 and considering the differentiability of log-likelihood function, we have

K1d2((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT))P(l(αT,βT,τT,σT,fT,gT)-l(α^,β^,τ^,σ^,f^,g^))K5d2((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT)), (19)

where K5 is a fixed constant. Combining equations (18) with (19) and (15), we have

λf2J2(f^)+λg2J2(g^)+K1d2((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT))λf2J2(fT)+λg2J2(gT)+OP(n-1/2)(1+J(fT)+J(f^)+J(gT)+J(g^))×{d1-1/2s((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT))n1-2s2(2s+1)}. (20)

Note that all the three terms on the left-hand side are positive. Compare each term with the right-hand side. Simple calculations give that

J(f^)=OP(1)andJ(g^)=OP(1),d((α^,β^,τ^,σ^,f^,g^),(αT,βT,τT,σT,fT,gT))=OP(n-s/(2s+1)).

Proof of Lemma 2

To prove the n consistency and asymptotic normality, we apply Theorem 1 in Ma and Kosorok (2005). Application of this theorem requires the following conditions to hold: (a) consistency and rate of convergence, which has been established in Lemma 1; (b) finite asymptotic variance, which is shown below; (c) stochastic equicontinuity, which can be established using the entropy result and the consistency result; and (d) smoothness of the model, which holds with the differentiability of likelihood function.

Thus, to prove Lemma 2, we only need to establish the non-singularity of the information matrix. Denote α, i̇β, i̇τ, i̇σ as the partial derivatives of the log-likelihood function with respect to τ, β, τ, σ, respectively. For tf, tg ~ 0, consider ft = f + tf ξf and gt = g + tgξg, such that ft, gt still satisfy assumption A2. Denote the space generated by ξfξg as Inline graphic. The score operators for f and g are if[ξf]=limtf0l(α,β,τ,σ,ft,g)-l(α,β,τ,σ,f,g)tf and ig[ξg]=limtg0l(α,β,τ,σ,ft,gt)-l(α,β,τ,σ,f,g)tg. Denote 1 = (α, i̇β, i̇τ, i̇σ)′ as the score function for the parametric parameters and f,g[ξf, ξg] = (f [ξf], i̇g[ξg]) as the score operator for the non-parametric parameters.

Project 1 onto the space generated by f,g[ξf, ξg] = (f [ξf], i̇g[ξg]). The efficient score for (α, β, τ) is U=i1-if,g[P(i1)if,gZP(if,gZ)]. We further assume

  • (A5)

    P (UU) is component-wise bounded and positive definite.

Then Σ = P−1(UU) is the asymptotic variance matrix.

Contributor Information

Anna Liu, Department of Mathematics and Statistics, University of Massachusetts.

Richard Kronmal, Department of Biostatistics, University of Washington.

Xiaohua Zhou, Department of Biostatistics, University of Washington, Biostatistics Unit, HSR&D Center of Excellence, Veterans Affairs Puget Sound Health Care System.

Shuangge Ma, Email: shuangge.ma@yale.edu, School of Public Health, Yale University.

References

  1. Albert PS, Follmann DA, Barnhart HX. A generalized estimating equation approach for modeling random length binary vector data. Biometrics. 1997;53:1116–1124. [PubMed] [Google Scholar]
  2. Bild DE, Bluemke DA, Burke GL, Detrano R, et al. Multi-ethnic study of atherosclerosis: objectives and design. American Journal of Epidemiology. 2002;156:871–881. doi: 10.1093/aje/kwf113. [DOI] [PubMed] [Google Scholar]
  3. Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]
  4. Claeskens G. Restricted likelihood ratio lack-of-fit tests using mixed spline models. Journal of The Royal Statistical Society, B. 2004;66:909–926. [Google Scholar]
  5. Clarkson DB, Zhan Y. Using spherical-radial quadrature to fit generalized linear mixed effects models. Journal of Computational and Graphical Statistics. 2002;11:639–659. [Google Scholar]
  6. Crainiceanu C, Ruppert D, Claeskens G, Wand M. Exact likelihood ratio tests for penalised splines. Biometrika. 2005;92:91–103. [Google Scholar]
  7. Fan J, Jiang J. Nonparametric inference with generalized likelihood ratio tests. TEST. 2007;16:409–444. [Google Scholar]
  8. Ghosh SK, Mukhopadhyay P, Lu JC. Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference. 2006;136:1360–1375. [Google Scholar]
  9. Graybill FA. Matrices With Applications in Statistics. Duxbury Press; 2001. [Google Scholar]
  10. Gu C. Smoothing Spline ANOVA Models. Springer; 2002. [Google Scholar]
  11. Guo W. Inference in smoothing spline analysis of variance. Journal of The Royal Statistical Society, B. 2002;64:887–898. [Google Scholar]
  12. Han C, Kronmal RA. Two-part models for analysis of Agatston scores with possible proportionality constraints. Communications in Statistics–Theory and Methods. 2006;35:99–111. [Google Scholar]
  13. Hardle W, Mammen E, Muller M. Testing parametric versus semiparametric modeling in generalized linear models. Journal of The American Statistical Association. 1998;93:1461–1474. [Google Scholar]
  14. Jose Lombardia M, Sperlich S. Semiparametric inference in generalized mixed effects models. Journal of The Royal Statistical Society, B. 2008;70:913–930. [Google Scholar]
  15. Kauermann G, Claeskens G, Opsomer JD. Bootstrapping for penalized spline regression. Journal of Computational and Graphical Statistics. 2009;18:126–146. [Google Scholar]
  16. Kim YJ, Gu C. Smoothing spline Gaussian regression: More scalable computation via efficient approximation. Journal of the Royal Statistical Society, Ser B. 2004;66:337–356. [Google Scholar]
  17. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:1–14. [Google Scholar]
  18. Liu A, Meiring W, Wang Y. Testing generalized linear models using smoothing spline methods. Statistica Sinica. 2005;15:235–256. [Google Scholar]
  19. Ma S. Cure model with current status data. Statistica Sinica. 2009;19:233–249. [Google Scholar]
  20. Ma S, Huang J. Combining multiple markers for classification using ROC. Biometrics. 2007;63:751–757. doi: 10.1111/j.1541-0420.2006.00731.x. [DOI] [PubMed] [Google Scholar]
  21. Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis. 2005;96:190–217. [Google Scholar]
  22. McClelland RL, Chung H, Detrano R, Post W, Kronmal RA. Distribution of coronary artery calcium by race, gender, and age. Results from the Multi-Ethnic Study of Atherosclerosis (MESA) Circulation. 2006;113:30–37. doi: 10.1161/CIRCULATIONAHA.105.580696. [DOI] [PubMed] [Google Scholar]
  23. McCulloch CE, Searle SR. Generalized, linear, and mixed models. New York, Chichester: John Wiley & Sons; 2001. [Google Scholar]
  24. Min Y, Agresti A. Random effect models for repeated measures of zero-inflated count data. Statistical Modeling. 2005;5:1–19. [Google Scholar]
  25. Monahan J, Genz A. Spherical-radial integration rules for a Bayesian computation. Journal of the American Statistical Association. 1997;92:664–674. [Google Scholar]
  26. Moulton LH, Curriero FC, Barroso PF. Mixture models for quantitative HIV RNA data. Statistical Methods in Medical Research. 2002;11:317–325. doi: 10.1191/0962280202sm292ra. [DOI] [PubMed] [Google Scholar]
  27. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; 2003. [Google Scholar]
  28. Speed T. Comment: That BLUP is a good thing: the estimation of random effects by G.K. Robinson. Statistical Science. 1991;6:42. [Google Scholar]
  29. Thompson LA, Chhikara RS. A Bayesian cure rate model for repeated measurements and interval censoring. Proceedings of JSM; 2003.2003. [Google Scholar]
  30. van de Geer S. Cambridge Series in Statistical and Probabilistic Mathematics. 2000. Empirical Processes in M-Estimation. [Google Scholar]
  31. Wahba G. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM.1990. [Google Scholar]
  32. Wand MP. Smoothing and mixed models. Computational Statistics. 2003;18:223–249. [Google Scholar]
  33. Wang Y. Mixed effects smoothing spline analysis of variance. JRSSB. 1998;60:159–174. [Google Scholar]
  34. Wood SN. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC; 2006. [Google Scholar]
  35. Yau KK, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]
  36. Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74. doi: 10.1093/biostatistics/4.1.57. [DOI] [PubMed] [Google Scholar]

RESOURCES