Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jun 1.
Published in final edited form as: Biometrics. 2011 Nov 9;68(2):429–436. doi: 10.1111/j.1541-0420.2011.01683.x

Semiparametric Frailty Models for Clustered Failure Time Data

Zhangsheng Yu 1, Xihong Lin 2, Wanzhu Tu 1,3
PMCID: PMC3600835  NIHMSID: NIHMS444845  PMID: 22070739

Summary

We consider frailty models with additive semiparametric covariate effects for clustered failure time data. We propose a doubly penalized partial likelihood (DPPL) procedure to estimate the nonparametric functions using smoothing splines. We show that the DPPL estimators could be obtained from fitting an augmented working frailty model with parametric covariate effects, whereas the nonparametric functions being estimated as linear combinations of fixed and random effects, and the smoothing parameters being estimated as extra variance components. This approach allows us to conveniently estimate all model components within a unified frailty model framework. We evaluate the finite sample performance of the proposed method via a simulation study, and apply the method to analyze data from a study of sexually transmitted infections (STI).

Keywords: Doubly penalized partial likelihood, smoothing spline, Gaussian frailty, sexually transmitted disease, Smoothing parameter, Variance components

1. Introduction

Frailty models are commonly used to model correlated failure time data. There is an extensive literature on parametric regression for covariate effects in frailty models (Therneau, Grambsch, and Pankratz 2003). For example, McGilchrist and Aisbett (1991) proposed a penalized partial likelihood approach for parameter estimation in a Gaussian frailty model setting. Following Breslow and Clayton (1993), Ripatti and Palgram (2000) used the Laplace approximation for the integrated likelihood of Gaussian frailty models. Murphy (1995) and Parner (1998) studied the asymptotic properties of Gamma frailty models. To accommodate complicated covariate effects and their unknown functional forms, we propose frailty models with some covariates modeled nonparametrically and some covariates modeled parametrically. To our knowledge, little work has been done on frailty models with semiparametric covariate effects for correlated failure time data.

This research is motivated by a study of sexually transmitted infections (STI). Evidence-based recommendations about the beginning age and frequency of STI screening require an accurate quantification of time from sexual debut to first infection (Meyers et al. 2008). In adolescent women, concurrent infections with multiple organisms are not uncommon, with Chlamydia trachomatis, Neisseria gonorrhoeae and Trichomonas vaginalis being the most frequently detected pathogenes. Once sexually active, young women are at risk of infection via sexual behavior. Evaluation of risk associated with STI acquisition is important as it provides clinically useful markers for more targeted STI screening. Since concurrent and repeated infections with multiple organisms within the same subject are correlated, it is desirable to use frailty models to examine factors related to the timing of STI following the onset of sexual activity.

The number of sex partners is often thought to be indicative of STI risk. However, the assumption of a linear effect for the number of partners is nonetheless questionable. Intuitively, while a smaller number of partners typically indicates lower risk, a very large number of partners may not present proportionally higher risk because of increased sexual experience and prophylactic activities. Indeed, a closer examination of the fourth order polynomial effects for the lifetime number of sex partners revealed that the coeffcients of all polynomial terms were significant at level 0.05. Hence it is desirable to model the effect of the number of sex partners nonparametrically while using the frailty to account for within-subject correlation.

For independent survival data, spline and kernel methods have been used in nonparametric regression. For example, O’Sullivan (1988) estimated the nonparametric covariate functions by smoothing splines. Hastie and Tibshirani (1990) proposed an alternative algorithm for fitting smoothing splines. Gray (1992, 1994) argued for the use of the penalized spline method. Local kernel methods were proposed by Tibshirani and Hastie (1987). Fan et al. (1997) studied the local kernel method and related asymptotic properties. More recently, Duchateau et al. (2004) studied frailty models with a single nonparametric covariate function using a smoothing spline. Cai et al. (2007, 2008) and Yu and Lin (2008) discussed a marginal likelihood estimation associated with local kernel methods for correlated survival data.

Du and Ma (2010) proposed a fully nonparametric hazard model with frailty log h(t, u; b) = η(t, u) + zT b, with a fully nonparametric function of time and covariates η(t, u), where u is a vector of covaraites and b is a frailty. Our proposed semiparametric model takes the form λ(t;xi,wi,zi,b)=λ0(t)exp{Σj=1p1θj(xij)+wiTγ+ziTb}. It is different from Du and Ma’s model in the following two aspects. (i) Modeling structure: our model distangles the baseline hazard and covariates in the same spirit as the Cox model, and models covariate effects using a semiparametric additive function which allows for both parametric and nonparametric covariate effects, with extensions to multiple covariates. Although Du and Ma’s nonparametric model takes a general functional space, models defined in a more restrictive functional space, such as semiparametric additive frailty models, are often much easier to interpret in biomedical applications. Better estimation effciency associated with the parametric component also increases its appeal compared to the fully nonparametric approach. Further, multi-dimensional nonparametric functions are often diffcult to fit. (ii) Parameter estimation: Our approach connects the semiparametric spline Cox model with a frailty model, which is parallel to the connection between mixed models and spline models, as recognized by Green (1984) and Lin and Zhang (1999). As a result, the nonparametric functions can be easily estimated as linear combinations of fixed and random effects. The smoothing parameters can be easily estimated as extra variance components which are naturally imbedded in the frailty model with a parametric covariate function. In contrary to our work, Du and Ma (2010) does not build a connection between the frailty model and spline Cox model and does not take an advantage of the variance components method for estimating the smoothing parameter. Instead, they used a cross-validation method for estimating the smoothing parameter which is computationally more intensive in general.

This article is structured as follows. In Section 2, we present our modeling structure. In Section 3, we discuss the DPPL method for estimation and inference of the semiparametric functions. In Section 4, we describe the estimation of variance components and smoothing parameters. In Section 5, we conduct a simulation study to evaluate the finite sample performance of the method. In Section 6, we illustrate the use of our model by analyzing data from an STI epidemiologic study. We conclude the paper in Section 7 with a few remarks about the proposed method.

2. Semiparametric Frailty Models

To assess the linear and potentially nonlinear effects of covariates in a frailty model setting, we introduce a general frailty model (1) with semiparametric additive covariate functions for clustered, hierarchial, and spatial data. We refer to this model as the semiparametric frailty model for simplicity.

For the ith observation, let Ti denote the underlying failure time, and Ci denote the censoring time for i = 1, ⋯ , n, where n is the total number of observations. We write the observed time as Ti=min(Ti,Ci), Ci). Let δi be a censoring indicator. The covariate vector xi has p1 components and wi has p2 components. The covariates zi are assumed to be associated with random effects b = {b1, ⋯ , bq}T. Conditioning on the random effects b, we assume that survival times Tis are independent, and underlying failure time Ti is independent of censoring time Ci. We further assume that the censoring times are independent of the random effects b.

We present in (1) a semiparametric frailty model with a general correlation structure. The model can be used to model clustered, hierarchical and spatial survival data

λ(t;xi,wi,zi,b)=λ0(t)exp{j=1p1θj(xij)+wiTγ+ziTb}, (1)

where λ0(t) is an unspecified baseline hazard, θj(xj) is a nonparametric covariate function for the covariate xj, w is the covariate vector whose effect is modeled linearly, random effects b ~ MVN(0, D(ν)), with a full rank covariance matrix D(ν), and ν is a vector of variance components. Examples of the choices of covariance structures for clustered, hierarchical and spatial survival data can be found, e.g., in Breslow and Clayton (1993).

We develop below an estimation and inference procedure for the general semiparametric frailty model (1), of which the model for clustered survival data is a special case. The integrated likelihood of (θ, γ, ν) for the observed data under the general additive frailty model (1) is

L{λ0(t),θ,γ,ν}=i=1nλ(ti;xi,wi,zi,b)δiexp{Λ(ti;xi,wi,zi,b)}f{b,D(ν)}db=1D(ν)12i=1n{λ0(ti)eΣj=ip1θj(xij)+wiTγ+ziTb}δi×exp{Λ0(ti)eΣj=ip1θj(xij)+wiTγ+ziTb}e12bTD(ν)1bdb, (2)

where b ~ MV N(0, D(ν)). Since the θj(·) are nonparametric functions, we discuss in the next section their estimation using smoothing splines.

3. Estimation of the Semiparametric Functions Using Smoothing Splines

3.1 Smoothing Spline Estimation

In this section, we discuss estimation and inference of model (1) using smoothing splines. We assume the nonparametric covariate functions θj(·) to be smooth and twice-differentiable. Smoothing spline estimation of the θj(·) can proceed by maximizing the penalized integrated loglikelihood

{λ0(t),θ,γ,ν)}+j=1p112τjθj(2)(x)2dx, (3)

where (·) = logL(·) and L(·) is defined in (2), τ = (τ1, ⋯ , τp1) is a vector of smoothing parameters controlling the goodness of fit and the smoothness of the curves.

Let θj={θj(x1j0),,θj(xrjj0)}T, where xkj0(k=1,,rj) are the ordered rj distinct covariate values for the jth covariate xj. We assume Σk=1rjθj(xkj0)=0 to ensure identifiability similar to Lin and Zhang (1999). Let Nj be an n by rj indicator matrix such that {θj(x1j), ⋯ , θj (xnj)}T = Njθj. The conditional hazard of the ith observation given the random effects b can then be rewritten as

λ(t;xi,wi,zi,b)=λ0(t)×exp{Ni1Tθ1++Nip1Tθp1+wiTγ+ziTb}, (4)

where NijT is the ith row of the indicator matrix Nj. Using the results of O’Sullivan (1988) and noting equation (3) is a continuous function of the θj(·), one can show that, for given values of λ0(t), τ and ν, the maximizer of the integrated penalized loglikelihood (3) is a vector of natural cubic smoothing spline estimators of the θj(·), and (3) can be equivalently written as

{λ0(t),θ,γ,ν)}+j=1p112τjθjTKjθj, (5)

where Kj is the smoothing spline penalty matrix of the covariate xj. See Section 2.2 of Green and Silverman (1994) for more details of the proof of solution to be smoothing spline.

Since the integrated loglikelihood (2) does not have a closed form expression, following Breslow and Clayton (1993) and Ripatti and Palmgren (2000), we write (2) as ∫ exp{−S(b)}db, and apply the Laplace approximation to (2). Using calculations similar to expression (3) and profiling out the baseline hazard in a similar way to Appendix B of Ripatti and Palmgren (2000), one can show that, given (τ, ν), smoothing spline estimators of the θj(·) and the parametric regression coeffcient γ can be obtained by jointly maximizing following equation with respect to θ, γ, b.

D(θ1,,θp1,γ,b;ν,τ)i=1nδi[{j=1p1NijTθj+wiTγ+ziTb}log{lR(ti)eΣj=1p1Nljθ+wiTγ+zlTb}]12bTD1bj=1p112τjθjTKjθj. (6)

Given ν, τ, the DPPL is a partial likelihood with quadratic penalty terms which is concave and has a unique maximizer (up to a constant) as a function of the θj and γ. Given θ^j estimated at the design points, the smoothing spline estimators of θj(x) at any point x can be obtained by interpolating the splines using these estimated function values (see Section 2.4.1 of Green and Silverman 1994).

3.2 The Frailty Model Representation

In this subsection, we discuss the connection between the DPPL (6) for model (1) and the penalized likelihood of traditional frailty models with parametric covariate functions. Following a similar spirit of Lin and Zhang (1999), one can show that there is a one-to-one transformation between θj and (βj,ajT)T, i.e., θj=xj0βj+BjTaj. Here xj0=(x1j0xj0,,xrjj0xj0)T, xj0=Σi=1rjxij0rj, Bj=Lj(LjTLj)1 and LjLjT=Kj (see Green, P. J. 1987 for detail). One therefore can easily show that this new parameterization yields θjTKjθj=ajTaj.

Applying this transformation to (6), we see that the DPPL is equal to

D(β,γ,a,b;ν,τ)=i=1nδi[j=1p1NijT(xj0βj+Bjaj)+wiTγ+ziTblog{lR(ti)eΣj=1p1Nlj(xj0βj+Bjaj)+wlTγ+zlTb}]12bTD1b12aTΨ1a, (7)

where a=(a1T,,apT)T, β = (β1, ⋯ , βp)T, and Ψ = diag(τ1Ir1×r1, ⋯ , τpIrp×rp).

A comparison of (7) with equation (4) in Ripatti et al. (2000) shows that the two penalized likelihoods take the same form. Hence the DPPL smoothing spline estimators of {θ1(·), ⋯, θp(·)} and the regression coeffcients γ of the semiparametric frailty model (1) can be obtained by fitting the following augmented working frailty model with parametric covariate effects using the penalized likelihood approach of Ripatti, et al. (2000),

λi(t;xi,wi,ai,zi,b)=λ0(t)exp{j=1p1Nijxj0βj+wiTγ+j=1p1NijBjaj+ziTb}, (8)

where β and γ are vectors of fixed effects, and a and b are independent random effects distributed as MV N(0,Ψ) and MV N(0, D(ν)), respectively.

After obtaining the fixed and random effect estimators γ^, γ^j, and a^j by maximizing , the DPPL smoothing spline estimators of θj can be calculated as θ^j=xj0β^j+Bja^j, which is a linear combination of fixed and random effect estimators. For inference, we need to estimate the variances of θ^j and γ^. Ripatti et al. (2000) proposed using the inverse of the minus second partial derivative of penalized partial likelihood to estimate covariance matrix of coefficient estimators. Their simulation results showed that the estimators works well. To make inference of θ^j, we take a similar approach. The covariance of θ^j can then be estimated by the transformation, COV^(θ^j)=(xj0,Bj)COV^(β^j,a^jT)(xj0,Bj)T.

3.3 Frailty Models with Stratified Baseline Hazards

In the context of the STI data example, infections with different organisms tend to have distinct population prevalence rates and natural histories. We therefore extend model (1) to accommodate stratified baseline hazards as follows

λ(t;xi,wi,zi,b)=λ0j(t)×exp{θ1(xi1)++θp(xip1)+wiTγ+zijTb}, (9)

where λ0j(t) is the baseline hazard for the jth stratum (organism), j = 1, ⋯ , J. One can write a DPPL similar to (6) with different at risk sets Rij={All observations with respect to jth organism at risk at time tij}. Estimation and inference for model (9) then follow a scheme similar to those with a common baseline hazard as presented in Section 3.2.

4. Inference on Smoothing Parameters and Variance Components

We assume in Section 3 that the smoothing parameters τ and the variance components ν are known. In practice, they need to be estimated from the data. Motivated by the working parametric frailty model (8), we propose to treat the smoothing parameter τj(j = 1, ⋯ , p) as extra variance components. We estimate τ and ν simultaneously as variance components by maximizing the profile likelihood (10) in the working parametric frailty model (8) as follows:

lD(β(τj,ν),τj,ν)=12logD~(θ)12logS(a,b)12(a,b)D~(τj,ν)1(a,b)T, (10)

where D~=(Ψ00D(ν)). Following Ripatti and Palgram (2000), The estimating equation of variance components can be derived and simplified from the profile likelihood as follows:

12[tr(D~1D~(τ,ν))+tr(S(a^,b^)1D~1(τ,ν))(a^,b^)D~1D~(τ,ν)D~1(a^,b^)T]=0

These equations are in the same spirit as equation (8) of Ripatti and Palmgren (2000) and equation (6.21) of Duchateau et al. (2008) by noting that D~1(τ,ν)=D~1D~(τ,ν)D~1.

The estimating equations can be simplified as:

τ^=a^ja^j+tr{[S(a^,b^)ajaj]1}rj2 (11)

and

ν^=b^b^+tr{[S(a^,b^)bb]1}nc, (12)

where a^j is the maximizer of (7); S(a^,b^)ajaj is the (rj − 2) × (rj − 2) diagonal block submatrix of S(a^,b^) corresponding to aj; and tr{} is the trace value of the matrix, and nc is the number of clusters. Here S(a^,b^) is calculated by plugging in the corresponding estimators. The estimated value a^,b^, D are all functions of τ, ν. The estimates of τ, ν are obtained by iterating equations (11) and (12). The general expression of S″(·) is given in the Appendix I. The baseline hazard is estimated using Breslow’s estimator. An alternative to the S″ is using S~, the version profiling out the the baseline hazard function.

The variance of ν^ for a shared frailty model can be estimated by using

var(ν^)=2ν^2[nc+1ν^2tr{[S(a^,b^)bb]1[S(a^,b^)bb]1}2ν^tr{[S(a^,b^)bb]1}]1

Note that this estimating equation is derived from the Fisher information matrix for τ, ν and has already accounted for the variability due to the estimation of τ. Appendix II summarizes the estimation procedure for all the model components. Sample code for implementing the procedure, along with a sample data set, have been posted on www.biostat.iupui.edu/yuz/Code. The code can be adapted to conveniently fit the semiparametric additive models for data organized in the manner described in the introduction.txt file and the main function file dpplest.R.

5. Simulation Studies

In this section, we evaluate finite sample performance of the proposed method through a simulation study. We consider the following hazard model

λ(tij;x1,x2,w1,bi)=exp{θ1(x1)+θ2(x2)+w1γ+bi}, (13)

where

θ1(x)={2×beta(x10,8,8)+beta(x10,5,5)}9,
θ2(x)={6×beta(x10,30,17)+4×beta(x10,3,11)}40,

and beta(·) is the density function of the BETA distribution, and the bi are independent random effects following N(0, ν).

The true value of regression coeffcient γ was set as 0.5. Covariate w1 was generated as a binary random variable with an equal probability to be 0 or 1, x1 was generated as a cluster-level covariate taking values from 100 equally spaced knots in [0, 10], xij1 = (i mod 100)/10, and x2 was generated as

x2ij=trun{(i+5)6}10+105×(j1),

where trun{·} denotes the truncation operator. The variance component was set as ν = 0.25, 0.5. Censoring time followed an exponential distribution with rate 0.4 and the maximum followup times was set to be 5. The censoring percentage was about 18%. One hundred data sets were generated. In each data set, the number of clusters was 120 with 5 observations per cluster. For each simulated data set, we applied the DPPL method to estimate {θ1(x1), θ2(x2), γ, ν}.

Table 1 gives the averages of the estimates of the regression coeffcient γ and the variance component ν. For both settings of ν = 0.25 and ν = 0.5, the estimates of γ are very close to the true values. The estimated standard errors of γ^j are close to their empirical counterparts. The estimates of the variance component ν are also very close to the true value. Their SE estimates are slightly smaller than the empirical ones.

Table 1. Estimators of regression coefficient coefficient and variance component in the simulation studies based on 100 runs.

Parm. True Average Est. SE Emp. SE 95% CP
ν = 0.25
γ 0.5 0.5081 0.1016 0.0923 98%
ν 0.25 0.2445 0.0648 0.0757 90%

ν = 0.5
γ 0.5 0.5057 0.1052 0.1003 97%
ν 0.5 0.4884 0.0973 0.1129 93%

Parm: parameter

CP: coverage probability

Average: average of the estimates

Est. SE: Estimated SE

Emp. SE: Monte-Carlo estimated SE

Fig. 1 depicts the performance of the DPPL spline estimates of the two smooth functions θ1(x1) and θ2(x2) when the variance component ν = 0.5. Figures 1 (a) and (d) show that the averages of the estimated DPPL smoothing spline estimates of θ1(x1) and θ2(x2) are close to the true values. There are some biases in the estimated curves when the curvatures of the functions are high, e.g. at the second peak of θ2(x2). Fig. 1(b) and (e) show that the estimated SEs of θ^1(x1) and θ^2(x2) are close to their empirical counterparts except for the regions where the curvatures of the functions are high. Figure 1(c) and (f) calculate the coverage probabilities of the point-wise estimated 95% confidence intervals of θ^1(x1) and θ^2(x2). They are close to the nominal value 95% except for the peaks. The results are similar when ν = 0.25 and are not shown here. Our simulation study shows that the proposed DPPL method works well in finite samples in estimating the nonparametric functions and the variance components.

Figure 1.

Figure 1

Simulation results for the nonparametric covariate function θ1(x) and θ2(x): (a) Average of the DPPL spline estimates of θ^1(x), dotted and true θ1(x), solid; (b) Standard errors: estimated, dotted and empirical, solid; (c) Empirical coverage probability: the mean coverage probability is 0.901; (d) Average of the DPPL spline estimates of θ^2(x), dotted and true θ2(x), solid, (e) Standard errors of θ2(x): estimated, dotted and empirical, solid; (f) Empirical coverage probability of θ2(x): the mean coverage probability is 0.875

We also perform simulations with a larger censoring percentage and a smaller number of events. In the simulation with about 66% censoring (200 events, results not shown), the bias of the parametric regression coeffcients and the nonparametric function are only slightly larger than when there is 18% censoring. The empirical coverage probability for the parametric regression coeffcient is close to 95%. The average coverage probability of the nonparametric function is 86.7% due to the somewhat larger bias at the peak of the curve resulted from a smaller number of events. Performance improves when the sample size increases.

6. Application to the Sexually Transmitted Infection Data

The primary mode of transmission of STI is sexual contact. In the United States, much of the disease burden associated with STI is on women and young people. For example, it is estimated that subjects between 15 and 24 years of age account for about half of new STI cases each year, although they only represent a quarter of the population (Weinstock, Berman, and Cates 2004). STI screening is motivated by the disproportionate morbidity among adolescent women, including pelvic inflammatory disease, ectopic pregnancy, tubal infertility, preterm birth, and increased susceptibility to human immunodeficiency virus infection (CDC 2007; Cates and Wasserheit 1991; Paavonen et al. 2008; and Fleming et al. 1999). Despite the consensus on the need of screening, potentially useful behavioral markers for selected screening have not been adequately assessed because of limited epidemiological data (USPSTF 2007). Among the behavioral markers, the most relevant to the screening practice is the number of sexual partners. An improved understanding of the effect of the number of partners will help us to identify individuals who are at increased STI risk for selective screening.

A recent study provides an opportunity to examine the effect of sexual partners on the time between sexual debut and the first STI. Briefly, young women between ages 14 and 17 years, attending one of three adolescent medicine clinics were eligible for enrollment regardless of prior sexual activity and STI diagnosis. At enrollment, participants were interviewed for their lifetime and recent (past two months) sexual behaviors, as well as the age of sexual debut (first sex) and the number of sex partners. Cervical and vaginal specimens were collected by a research nurse practitioner for testing of C trachomatis, N gonorrhoeae and T vaginalis. Infected participants were treated at the visit or shortly thereafter. The same procedures were repeated every three months. Information about infections prior to study enrollment or outside of the study venues was extracted from participants’ electronic medical records. Detailed study protocol was described elsewhere (Tu et al. 2009). If a participant was sexually active at enrollment, age of first sex was ascertained from enrollment interview; for those who became sexually active during the course of follow-up, age of first sex was determined from quarterly interviews.

There were 387 adolescent women enrolled into the study. The enrolled subjects were followed up to 8.2 years. On average, study participants reported 3 partners in their lifetime at the time of enrollment (median=2,; range 0-28). We focused here on the effects of age of sexual debut, number of unprotected sexual intercourse during the last two months, and the baseline life time number of sexual partners reported at enrollment. The outcome of interest is time from first coitus to the first infection with each of the three organisms C. trachomatis, N. gonorrhoeae and T. vaginalis. Among 387 women, 26% of participants were censored for C. trachomatis infection, 51% are censored for N. gonorrhoeae infection, and 47% are censored for T. vaginalis infection by the end of follow up.

Times from first coitus to the initial STI with each organism within the same individual are correlated due to the common sexual behavior and physiological environment within the same subject. To model the correlation of the three types of infections of the same subject, we used a stratified frailty model with a common random intercept in (9) by assuming zij = 1 and bis are identical and independently distributed as a normal distribution with mean zero and an unknown variance ν. Our preliminary study shows that the lifetime number of partners has a polynomial effect up to 4th order. Therefore, it is desirable to fit a model with a nonparametric effect of lifetime number of partners. To accommodate such a nonlinear effect, we introduced a nonparametric component for the lifetime number of sexual partners. Parametric effects were used for age at first intercourse, ethnicity (white), and number of unprotected sex events in past two months.

Specifically, we considered the following model:

λij(t)=λ0j(t)exp{θ(N_PARTNERS)+γ1×RACE+γ2×AGE+γ3×N_SEX+bi}, (14)

where N_PARTNERS is the lifetime number of partners ascertained at enrollment, AGE is the self-reported age at the first sex, N_SEX is the number of unprotected sex in last two months, θ(·) is an unknown smooth function and bi is the subject-specific random effect following N(0, ν). Preliminary analysis shows a similar effect of race, age, and N_SEX on risk of STI with different organisms. We therefore assume the covariate effects are the same for the three types of infections for model simplicity. As demonstrated in Fig. 2, while there was a generally positive association between the number of partners and STI acquisition, the risk was highest for subjects with four to six partners, which appeared to be the range that the number of partners effect became significantly different from those without sex partners. This intriguing observation, though previously not reported in literature, is perhaps not entirely surprising. For example, one could speculate that adolescent women with relatively fewer (less than four) partners had lower risk because they are usually younger and are seeing younger male partners, who are unlikely to be sources of infection pathogens. On the other hand, women who had a larger number of sexual partners (more than seven) are likely to be more mature and more cognizant of the STI risk thus using more prophylactics. This may explain the apparent lack of linear increase of STI risk in women with larger number of sexual partners. Additionally, we could not rule out the possibility that these women are more experienced in partner selection. As suggested by the associate editor, we also analyzed the data using an approach similar to that of Du and Ma (2010), using the function sshzd in the R package gss. Initially, we tried the model log λi(t, N_PARTNERS, AGE, N_SEX) = g(t, N_PARTNERS, AGE, N_SEX) + bi with a fully nonparametric regression function g(·) and a random effect bi, as in Du and Ma, but we had numerical diffculties fitting this model. Hence, in an effort to provide a best comparison between the Du-Ma’s fully nonparametric approach and our semiparametric additive model approach, we fit a fully nonparametric model of the form log λi(t, N_PARTNERS, AGE, N_SEX) = g(t, N_PARTNERS, AGE, N_SEX) without the random effect bi. Both analysis point to a generally positive association between number of partners and infection risk. And both models show an attenuation of infection risk when partner number is greater than five (plot not shown). Therefore, the more parsimonious semiparametric model has adequately captured the trend demonstrated by the full nonparametric model.

Figure 2.

Figure 2

Smoothing spline estimate of the number of lifetime partner effect using model (14) in the sexually transmitted infection study

The estimated effects of the covariates are summarized in Table 2. The STI infection risk for subjects with a higher number of unprotected sex in past two months was higher than those with fewer unprotected sex events although the effect did not reach a level of statistical significance (p-value=0.89). White adolescents tended to have a lower (0.436) STI risk than black teens (p-value<0.001). Teens having later sexual debut at higher age had higher infection rates than others having sexual debut at younger age (p-value<0.001). Again, this seemingly paradoxical observation may reflect the STI risk level presented by the male partners. Although the current study is unable to fully disseminate the reasons behind the observations due to the lack of male partner data, the findings nonetheless highlight the complexity of the STI risk in adolescents and the need for a more careful examination of the behavioral markers for STI screening.

Table 2. Regression Coefficient estimates from fitting the semiparametric frailty model (Time to sexually transmitted infection study.

Covariate Race (white) Age N_Sex
Estimate −0.831 0.326 0.0008
SE (0.240) (0.047) (0.0057)
p-value <0.001 <0.001 0.89

Age: Age at first sex.

7. Discussion

In this paper, we proposed a semiparametric frailty model for examination of semiparametric covariate effects in correlated survival data. Nonparametric functions are estimated using smoothing splines. Since the observed likelihood involves integration with no closed form, the Laplace method is used to approximate the likelihood. We developed the DPPL method to estimate all of the model components, including the semiparametric covariate effects, the the variance components, and the smoothing parameter within a unified framework using an augmented working frailty model with parametric covariate effects. With the proposed approach, the smoothing spline estimators of the nonparametric functions are obtained as a linear combinations of fixed effects and random effects, and the smoothing parameters are treated as extra variance components.

Alternatively, one may use the Gaussian quadratures or the MCMC-type method to integrate out the cluster-level random effects to reduce bias in some parameters. However, these calculations are computationally intensive. In addition, one still has to estimate the smoothing parameter. If REML is used to estimate the smoothing parameter, one also needs to deal with high dimensional integration resulted from the random effects associated with the smoothing spline estimator. And these numerical integration methods for the REML likelihood are not feasible. Additionally, inferences for model parameters using numerical integration remain challenging. A key advantage of the DPPL is that estimation and inference of all the model components are easily obtained within a parametric frailty model unified framework. Future research is needed to pursue these numerical integration methods and compare their performance with the DPPL method. For future research, we will consider bias correction similar to what was used by Lin and Breslow (1996) to improve the performance of the DPPL method.

For the STI application, our analysis showed a nonlinear effect of the number of partners on STI acquisition in adolescent women. This finding has painted a more nuanced picture of the partner effect: when the cumulative lifetime number of partners is relatively small, i.e., fewer than five, infection risk clearly increases with the number of partners; but when the number of partners is relatively large, infection risk no longer has proportional increase with the number of partners, possibly due to the increased prophylactic behaviors and immunological maturity in young women with more sexual experience (Ethier and Orr, 2007). As discussed in previous sections, while such a finding is not scientifically unexpected, the complexity of partner effect appears to point to the need of an algorithm to more effectively differentiate the STI risk for screening purposes. In search of meaningful STI screening indicators, the proposed semiparametric frailty model offers an indispensable statistical tool with the necessary modeling flexibility to assess the effects of potential screening variables.

Acknowledgement

This work is supported by National Institutes of Health grants RO1 HD42404 (Tu), RO1 HL095086 (Tu, Yu), R37 CA76404 (Lin, Yu), and P01 CA134294 (Lin, Yu).

Appendix I: The Derivatives of the Integrant of the Integrated Likelihood of the General Frailty Model

Consider the general frailty model λi(t)=λ0(t)exp{xiTβ+wiTγ+z~iTb~}, where β is a vector of regression coeffcients and and z~i is the covariate vector associated with the random effect b~, b~ is a vector of random effects following N(0,D~) which may include b, a for the additive frailty model (2). Write the integrated loglikelihood as exp{S(b~)}db~, where

S(b~)=i=1n[δi{log(λ0(ui))+j=1PxijTβ+wiTγ+z~iTb~}+Λ0(ui)eΣj=1pxijTβ+wiTγ+z~iTb~]+12b~TD~1b~
S(b~)=i=1n[δiz~i+Λ0(ui)exiTβ+wiTγ+z~iTb~z~i]+D~1b~
S(b~)=i=1nΛ0(ui)eΣj=1pxijβ+wiTγ+z~iTb~z~iz~iT+D~1

Appendix II: Algorithm for estimation

  • Generate the new covariates xj0, Bj, and indicator matrix Nj for fitting the augment working frailty model (8).

  • For fixed variance component ν and smoothing parameters τ, maximize the (7) with respect to βj, aj, γ, b using a Newton-Raphson algorithm.

  • For estimate βj, aj, γ, b obtained from step (2), calculate the smoothing parameters and variance components using equation (12,12).

  • Iterate between step (2) and (3) until convergence.

References

  1. Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J. Am. Statist. Assoc. 1993;88:9–25. [Google Scholar]
  2. Cai J, Fan J, Zhou H, Zhou Y. Marginal hazard models with varying coeffcient for multivariate failure time data. Ann. Statist. 2007;35:324–354. [Google Scholar]
  3. Cai J, Fan J, Jiang J, Zhou H. Partially linear hazard regression model with varying coeffcient for multivariate survival data. J. R. Statist. Soc. B. 2008;70:141–158. [Google Scholar]
  4. Cates W, Jr, Wasserheit JN. Genital chlamydial infections: epidemiology and reproductive sequelae. Am J Obstet Gynecol. 1991;164:1771–1781. doi: 10.1016/0002-9378(91)90559-a. [DOI] [PubMed] [Google Scholar]
  5. CDC . Trends in reportable sexually transmitted diseases in the United States, 2006. US Dept of Health and Human Services, Centers for Disease Control and Prevention; Atlanta, GA: 2007. p. 12. [Google Scholar]
  6. Du P, Ma S. Frailty model with spline estimated nonparametric hazard function. Statistica Sinica. 2010;20:561–580. [Google Scholar]
  7. Duchateau L, Janssen P. Penalized partial likelihood for frailties and smoothing splines in time to first insemination models for dairy cows. Biometrics. 2004;60:608–614. doi: 10.1111/j.0006-341X.2004.00209.x. [DOI] [PubMed] [Google Scholar]
  8. Duchateau L, Janssen P. The frailty model. Springer; 2008. [Google Scholar]
  9. Ethier KA, Orr DP. Behavioral interventions for prevention and control of STDs among adolescents. In: Aral SO, Douglas JM, Lipshutz JA, editors. Behavioral Interventions for Prevention and Control of Sexually Transmitted Diseases. Springer US; New York, NY: 2007. [Google Scholar]
  10. Fan J, Gijbels I, King M. Local likelihood and local partial likelihood in hazard regression. Ann. Statist. 1997;25:1661–1690. [Google Scholar]
  11. Fleming DT, Wasserheit JN. From epidemiological synergy to public health policy and practice: the contribution of other sexually transmitted diseases to sexual transmission of HIV infection. Sex Transm. Infect. 1999;75:3–17. doi: 10.1136/sti.75.1.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gray RJ. Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. J. Am. Statist. Assoc. 1992;87:942–951. [Google Scholar]
  13. Gray RJ. Spline-Based tests in survival analysis. Biometrics. 1994;50:640–652. [PubMed] [Google Scholar]
  14. Green PJ. Penalized likelihood for General Semi-parametric Regression Models. Intern. Statist. Rev. 1987;55:245–260. [Google Scholar]
  15. Green PJ, Silverman BW. Nonparametric regression and generalized linear models. Chapman and hall; London: 1994. [Google Scholar]
  16. Hastie T, Tibshirani R. Exploring the nature of covariate effects in the proportional hazards model. Biometrics. 1990;46:1005–1016. [PubMed] [Google Scholar]
  17. Lin X, Zhang D. Inference in generalized additive mixed model by using smoothing splines. J. R. Statist. Soc. B. 1999;61:381–400. [Google Scholar]
  18. Lin X, Breslow NE. Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion. J. Am. Statist. Assoc. 1996;91:1007–16. [Google Scholar]
  19. McGilchrist CA, Aisbett CW. Regression with frailty in survival analysis. Biometrics. 1991;47:461–466. [PubMed] [Google Scholar]
  20. Meyers D, Wolff T, Gregory K, Marion L, Moyer V, Nelson H, Petitti D, Sawaya GF. US Preventive Service Task Force (USPSTF) recommendations for STI screening. Am Fam Physician. 2008;77:819–824. [PubMed] [Google Scholar]
  21. Murphy S. Asymptotic theory for the frailty model. Ann. of Statist. 1995;23:182–198. [Google Scholar]
  22. O’sullivan F. Nonparametric estimation of relative risk using splines and cross-validation. SIAM J. Sci. Stat. Comput. 1988;9:531–542. [Google Scholar]
  23. Paavonen J, Westrom L, Eschenbach D. Pelvic Inflammatory Disease. In: Holmes KKSP, Stamm WE, Piot P, et al., editors. Sexually Transmitted Diseases. 4th ed. McGraw-Hill; New York, NY: 2008. pp. 1017–1050. [Google Scholar]
  24. Parner E. Asymptotic theory for the correlated gamma-frailty model. Annals of Statistics. 1998;26:183–214. [Google Scholar]
  25. Ripatti S, Palmgren J. Estimation of multivariate frailty models using penalized partial likelihood. Biometrics. 2000;56:1016–1022. doi: 10.1111/j.0006-341x.2000.01016.x. [DOI] [PubMed] [Google Scholar]
  26. Therneau TM, Grambsch PM, Pankratz VS. Penalized survival models and frailty. J. of Comp. and Graph. Statist. 2003;12:156–175. [Google Scholar]
  27. Tibshirani R, Hastie T. local likelihood estimation. J. Am. Statist. Assoc. 1987;82:559–567. [Google Scholar]
  28. Tu W, Batteiger BE, Wiehe S, Ofner S, Van Der Pol B, Katz BP, Orr DP, Fortenberry JD. Time from first intercourse to first sexually transmitted infection diagnoses among adolescent women. Archives of Pediatrics and Adolescent Medicine. 2009;163(12):1106–1111. doi: 10.1001/archpediatrics.2009.203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Weinstock H, Berman S, Cates W., Jr. Sexually transmitted diseases among American youth: Incidence and prevalence estimates, 2000. Perspectives on Sexual and Reproductive Health. 2004;36:610. doi: 10.1363/psrh.36.6.04. [DOI] [PubMed] [Google Scholar]
  30. US Preventive Services Task Force Screening for chlamydial infection: US Preventive Services Task Force recommendation statement. Ann Intern Med. 2007;147(2):128–134. doi: 10.7326/0003-4819-147-2-200707170-00172. [DOI] [PubMed] [Google Scholar]
  31. Yu Z, Lin X. Nonparametric regression using local kernel estimating equations for correlated failure time data. Biometrika. 2008;95:123–137. [Google Scholar]

RESOURCES