Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 5.
Published in final edited form as: Stat Med. 2021 Feb 15;40(10):2400–2412. doi: 10.1002/sim.8910

Marginal analysis of current status data with informative cluster size using a class of semiparametric transformation cure models

Kwok Fai Lam 1,2, Chun Yin Lee 3, Kin Yau Wong 3, Dipankar Bandyopadhyay 4
PMCID: PMC13050021  NIHMSID: NIHMS2155815  PMID: 33586218

Abstract

This research is motivated by a periodontal disease dataset that possesses certain special features. The dataset consists of clustered current status time-to-event observations with large and varying cluster sizes, where the cluster size is associated with the disease outcome. Also, heavy censoring is present in the data even with long follow-up time, suggesting the presence of a cured subpopulation. In this paper, we propose a computationally efficient marginal approach, namely the cluster-weighted generalized estimating equation approach, to analyze the data based on a class of semiparametric transformation cure models. The parametric and nonparametric components of the model are estimated using a Bernstein-polynomial based sieve maximum pseudo-likelihood approach. The asymptotic properties of the proposed estimators are studied. Simulation studies are conducted to evaluate the performance of the proposed estimators in scenarios with different degree of informative clustering and within-cluster dependence. The proposed method is applied to the motivating periodontal disease data for illustration.

Keywords: cure model, current status data, estimating equations, informative cluster size, survival analysis

1 |. INTRODUCTION

This research is motivated by a cross-sectional periodontal disease (PD) study of Gullah-speaking African-American diabetics, or the GAAD study,1 where the periodontal health condition of each study participant was examined once by recording the tooth site-level surrogate clinical attachment level, or CAL.2 A CAL ≥ 3 mm is regarded as moderate to severe incidence of PD, although the (latent) time of incidence of the PD status, both at the site-level, and tooth-level (when averaging CAL values corresponding to the sites of a tooth) remains unknown. The central objective of this study is to identify the risk factors associated with the latent time to PD incidence at the tooth-level.

The GAAD dataset has several interesting characteristics, which cannot be handled easily using traditional methods. First, the response variable of interest, that is, the age of PD incidence T, for each tooth, cannot be observed directly, and is subject to case-1 interval censoring,3 with either {TY} or {T>Y} being observed, where Y denotes the age at inspection. This type of data is commonly called current status data.4 Second, the disease outcomes of the teeth within a subject are clustered in nature, and are highly correlated as they shared the same oral health condition. Third, the cluster size N, which is the number of (available) teeth of a subject at examination, varies substantially from 3 to 28. It is postulated that N is associated with the disease outcome, since PD is a leading cause of tooth loss. To motivate this further, we group the data into three strata, namely (a) N[1,10); (b) N[10,20); and (c) N[20,), and plot the nonparametric Turnbull estimates of the survival curves for current status data in Figure 1, ignoring the effect of clustering. It is obvious that the survival probabilities increase with cluster sizes, which points to the informative cluster size (ICS) paradigm.5 Fourth, the survival curves exhibit a plateau behavior at high probability values in their right tails, suggesting the existence of teeth that are non-susceptible to PD.

FIGURE 1.

FIGURE 1

Turnbull estimates of the survival functions for the periodontal disease data, stratified by cluster size

In standard survival analysis, all subjects are assumed to experience the event of interest eventually, such that a non-susceptible status can never be observed in practice. However, in studies of breast cancer6,7 and relapse-free survival of the melanoma patients,8,9 this assumption can be violated, where not all individuals in the population are susceptible to the event of interest (eg, death due to the specific cancer), leading to longer-term follow-up and high censoring rates. Here, it may be more appropriate to postulate the existence of long-term survivors, or a cure fraction in the population. The mixture cure (MC) model introduced by Berkson and Gage10 is presumably the most popular cure-rate model. Here, a subject in the population is simply classified as either cured, or noncured using a binary random variable U, and a logistic link function is often used to model the association between the cure probability and a set of covariates. In these models, the failure time distribution of the uncured subjects may also be modelled by another set of covariates. Based on right-censored data, various estimation methods have been proposed6,7,1113 for the MC model. For current status data, Lam and Xue14 studied a semiparametric MC regression model. Despite its popularity, the MC setup has several limitations, including absence of the proportional hazards (PH) structure, and poor interpretation for describing the underlying biological process,8 at least in cancer studies. Moreover, this type of mixture models cannot be extended naturally to accommodate clustered data. On the contrary, the promotion time cure (PTC) model15,16 overcomes the above drawbacks. An appealing feature of this model is that both the cure probability and failure time distribution can be modelled by just one set of covariates, retaining the PH assumption. Later, Zeng et al17 proposed a class of transformation cure models by extending the PTC model.

Whenanalyzingclusteredinterval-censoreddata,thedependenceofthewithin-clusterobservations,generallyaccommodated by a frailty- or copula-induced model, cannot be ignored, otherwise, loss of estimation efficiency is inevitable. There are two commonly used estimation approaches in the literature. The first one is the so-called direct inference, where the regression and dependence parameters are estimated simultaneously in the joint likelihood. In that vein, Zhou et al18 studied the semiparametric transformation models for bivariate interval-censored failure time data based on the gamma frailty. Sun and Ding19 considered the semi-parametric transformation models for bivariate general interval-censored survival data based on the two-parameter Achimedean copula model. Lee et al20 studied a class of semi-parametric partly linear frailty transformation models for clustered interval-censored data. All the aforementioned papers adopted the sieve maximum likelihood estimation (sieve-MLE) approach where the estimators are shown to be consistent, asymptotically normal and efficient, but their methods do not accommodate data with a cure fraction. Moreover, they only focused on the bivariate case or small cluster size settings as the direct inference procedures are computationally demanding, or even infeasible when the number of event types or the cluster sizes are large. Analysis of clustered interval-censored data with a cure fraction based on a semi-parametric frailty-Cox model was considered in Lam and Wong.21 Therein, the estimation method is able to accommodate data with moderate cluster sizes, but would be quite computationally demanding when the cluster sizes are moderately large (say ≥10). The other approach is the marginal approach, which only focuses on the estimation of the regression parameters to determine whether a covariate has a significant effect on the responses. Here, similar to generalized estimating equations (GEE), the parameter estimation is performed via the maximization of a pseudo-likelihood function under a working independence assumption, leading to a computationally efficient approach in handling clustered data, even with large cluster sizes. An appealing property of the GEE-type direction is that the estimator of the regression parameter is consistent when the marginal survival model is correctly specified (in the absence of ICS), irrespective of the underlying dependence structure within a cluster. Based on the GEE and Clayton’s copula, Kor et al22 and Niu and Peng23 studied Cox-type models for interval-censored data, and cure-rate right censored data, respectively. Furthermore, the cluster-weighted GEE approach24 can be extended to deal with scenarios of ICS with the survival outcome. Under right-censoring, cluster-weighted Weibull and Cox PH models were proposed,5,25 that are shown to be equivalent to the within-cluster resampling (WCR) approach.26 For interval-censored data with ICS, comparison of the estimating equation and WCR approaches under a Cox setup27,28 was conducted, and a linear transformation approach (including Cox PH model and a proportional odds model) was proposed.29 However, these models do not accommodate a cure fraction. In this paper, we propose a marginal transformation PTC model17 powered by Bernstein polynomial (BP)30 based sieve maximum pseudo-likelihood (PL) approach to derive inference in clustered current status data with ICS and a cure fraction.

The rest of the paper is structured as follows. In Section 2, we present a class of semi-parametric transformation cure models for the marginal distribution of the failure times. The estimation and inference procedures will also be discussed. In Section 3, we establish the theoretical properties for the proposed estimators. In Section 4, we study the finite sample performance of the proposed estimator under a variety of synthetic data generated scenarios. Application of the proposed model and method to the GAAD dataset is demonstrated in Section 5. Finally, some concluding remarks are provided in Section 6. Proof of theoretical results from Section 3 are relegated to Appendix A1.

2 |. MODEL SPECIFICATION AND METHODS

We consider a random sample of n clusters (subjects), consisting of cluster subunits (teeth). Let Ni denote the (random) size of the i th cluster, i=1,,n,j=1,,Ni, and let Tij denote the event time of interest, say time to achieving moderate to severe PD (at the tooth-level). Also, let Xij=1,Xij1,,XijpT denote the vector of covariates for the j th unit in the i th cluster. We leave the association structure of (Ti1,,TiNi,Ni) unspecified. Conditional on Xij, the marginal survival function of Tij has the form

StXij=GF(t)expβTXij;ρ, (1)

where β=β0,β1,,βpT is a vector of unknown regression parameters, F is an unspecified distribution function, and G(;ρ) is a prespecified transformation function indexed by a parameter ρ. Note that the cure proportion is SXij=GexpβTXij;ρ, which is generally nonzero. In this paper, we consider the class of Box–Cox transformations for G, where

G(x;ρ)=exp-ρ-1(1+x)ρ-1,ρ>0;(1+x)-1,ρ=0.

This model generalizes the PTC model8 by the introduction of a transformation function. In the PTC model, each event time of interest Tij is thought of as the minimum of U independent latent event times, T˜1,,T˜U, where U is a Poisson random variable with mean expβTXij, and each T˜k(k=1,,U) follows the cumulative distribution function F; Tij is set to be if U=0, and that observation is considered cured. With the transformation, the latent variable U is allowed to follow a large class of distributions. In particular, with ρ=1,U follows the Poisson distribution with mean expβTXij, and with ρ=0,U follows the geometric distribution with mean 1+exp-βTXij; these two choices of ρ result in the PH model and the proportional odds (PO) model, respectively.

Suppose that the event times are not exactly observed, but are subject to case-1 interval censoring. Let Yij be the time of inspection, and ΔijITijYij be the event indicator(of PD incidence), for the j th tooth in the i th cluster. The inspection times, calculated as the difference between the subject’s clinic visit time and the time of (adult) tooth eruption, may vary with the tooth-types. We assume that the event times and the censoring/inspection times are independent, conditional on the cluster size and covariates. The observed data consist of 𝒪i=Ni,Yij,Δij,Xijj=1,,Ni;i=1,,n. We propose a marginal approach for estimating the unknown parameters θ=(β,F) in model (1). Analogous to Williamson et al,25 we adopt a working independence-within-cluster assumption and maximize the following PL function

Ln(θ𝒪)=i=1nj=1NiLijθ𝒪ijwij=i=1nj=1Ni1-SYijXijΔijSYijXij1-Δijwij, (2)

where Lij denotes the likelihood contribution of the j th observation from the i th cluster, and wij is the weighting of the corresponding likelihood contribution. The specification of wij’s is given in Section 2.2.

2.1 |. Sieve maximum pseudo likelihood estimation

Maximization of the PL function (2) is not straightforward, due to the presence of the nonparametric function F. Although the distribution function F has infinite dimension, one can still estimate the function using monotone step functions, or polynomial splines. 31,32 This essentially reduces the dimensionality in estimating F so that the PL function in (2) can be evaluated numerically. In this paper, we adopt a BP to approximate the distribution function F. The BP approach enjoys several merits from the perspective of implementation; it often requires only a few parameters for a decent approximation, and is free from prespecification of the interior knots as opposed to, for example, B-splines,33 thereby rendering significant computational efficiency. Also, we can show that the optimization based on the BP can be easily reduced to an unconstrainted nonlinear problem via reparameterization.

Following Lam et al,34 we impose a zero-tail constraint F(t)=1 for tτ, where τ is called a cure threshold. In practice, one can set the cure threshold to be maxi,jΔijYij, given that the time length of the study is sufficiently long. Let 𝒜Rp+1 be a bounded parameter space of the regression parameters β, and let F be the collection of all nonnegative, nondecreasing functions, with an upper bound equals to 1 over the interval [0,τ]. Then, the BP-based sieve for the approximation of F can be defined as

F,n=Km(t;ψ):0ψ0ψ1ψmandKm(τ;ψ)=1,

where

Km(t;ψ)=j=0mψjmjtτjτ-tτm-j,

is called a BP of degree m, and ψ is an (m+1)-dimensional vector of coefficients of the basis polynomials. We choose m to be an integer which grows at a rate of Onν for 0<v<1, and the function could approximate35 arbitrarily closely any smooth true function as n. Moreover, the monotonicity constraints on ψ ensure that the estimated function is nondecreasing, and Km(.;ψ) is a differentiable function on [0,τ] that starts from the origin.

Therefore, the parameter estimates for θ can be obtained by maximizing the PL function Ln(θ) over the sieve space 𝒜F,n. Monotonicity of F and F(τ)=1 can be easily imposed by the reparameterization ψq=i=0qeϕi/j=0meϕj where ϕj(-,) for j=0,,m and q=0,,m. Hence, the optimization can be done via typical unconstrained methods such as the Newton-Raphson or Nelder-Mead simplex algorithm. In the implementation of the proposed method, the quasi-Newton method of Broyden-Fletcher-Goldfarb-Shanno36 algorithm is adopted for solving this unconstrained nonlinear optimization problem, which is readily available in standard statistical software, such as R.

2.2 |. Statistical inference for β

For a given m, the sieve maximum PL estimator θ^n(β^n,F^n) can be obtained by solving the weighted score function

i=1nUiθ𝒪ii=1nj=1NiwijlogLijθ𝒪ijθ=0.

This framework reduces to a traditional unit-based analysis under wij=1i=1,,n;j=1,,Ni and is referred to as the GEE method in subsequent sections. To adjust for ICS, we adopt wij=Ni-1 for j=1,,Ni, and the resulting method is referred to as the cluster-weighted GEE (CWGEE) method. It is well-known that the inference of β cannot be simply performed based on the usual Fisher information matrix because the cluster-specific heterogeneity would be completely ignored. Instead, we propose to use a robust sandwich estimator37 given by

Σn(θ^n𝒪)=Hn(θ^n𝒪)-1i=1nUi(θ^n𝒪i)Ui(θ^n𝒪i)THn(θ^n𝒪)-1, (3)

where

Hn(θ𝒪)=-i=1nUiθ𝒪iθ.

The variance of β^n can be approximated by the corresponding elements of Σn, and the performance of this variance estimator and the corresponding robust confidence interval estimation will be evaluated in Section 4.

3 |. THEORETICAL RESULTS

Let N,Tj,Yj,Δj,Xjj=1,,N denote the data for a generic cluster. Let β* and F* be the true parameter values. The conditions required for the derivation of the asymptotic properties of the proposed estimators are listed below, with some conditions involving a generic positive constant M.

(C1) Conditional on N, the distributions of Yj,Tj,Xj for j=1,,N are identical.

(C2) The true parameter value β* belongs to the interior of the parameter space 𝒜. The cure threshold τ satisfies F*(τ)=F*()=1, and the function F* is strictly increasing, with bounded r th derivative on (0,τ) for some r1. The transformation function G is strictly decreasing, with bounded second derivative on (0,), and G(0)=1.

(C3) The support of Y1 is [η,τ] for some η(0,τ), and PY1=τX1 is bounded away from 0 almost surely.

(C4) With probability 1, X1<M. Also, the smallest eigenvalue of EX1X1TY1 is bounded away from 0 almost surely. In addition, there exists κ(0,1), such that VaraTX~1Y1κaTE(X~1X~1TY1)a almost surely for all aRp, where X~1 consists of the last p components of X1.

(C5) The censoring time Y1 is continuous on (η,τ), and its density function is twice-differentiable on (η,τ) with bounded second derivative.

Condition (C1) requires that conditional on the cluster size, the survival and censoring time distributions are identical across observations of the same cluster. Condition (C2) requires that the cure threshold τ is correctly specified, such that observations surviving beyond τ must be cured. Condition (C3) requires that the follow-up is long enough to cover the cure threshold. This condition is necessary for the identification of the cure proportion. Condition (C4) is imposed for the identifiability of the model parameters. Generally, to satisfy condition (C4), one should avoid selecting covariates that are strongly correlated with each other or with the inspection time. Condition (C5) ensures that the least-favorable direction for F exists, and can be approximated by the BP functions at a fast enough rate.

Let

(β,F)=1Nj=1NΔjlog1-GFYjeXjTβ+1-ΔjlogGFYjeXjTβ,
˙β(β,F)=1Nj=1N-Δj1-GFYjeXjTβ+1-ΔjGFYjeXjTβGFYjeXjTβFYjeXjTβXj,
˙Fβ,Fh=1Nj=1N-Δj1-GFYjeXjTβ+1-ΔjGFYjeXjTβGFYjeXjTβeXjTβhYj.

Let (1)(β,F), ˙β(1)(β,F), and ˙F(1)(β,F)[h] be the first terms of the corresponding summations above. For a d-dimensional vector of functions h=h1,,hdT, we let ˙F(β,F)[h]=˙F(β,F)h1,,˙F(β,F)hdT and ˙F(1)(β,F)[h]=(˙F(1)(β,F)h1,,˙F(1)(β,F)hd)T. We can show that under conditions (C1)–(C4), the CWGEE method is consistent, such that

β^n-β*+ητ(F^n-F*)2(t)dt1/2a.s.0.

Also, under the same conditions, the rate of convergence of the estimators is given by

β^n-β*+ητ(F^n-F*)2(t)dt1/2=Opn-min{(1-v)/2,vr/2},

where r is given in condition (C2), and v is such that m=Onν. If we further assume that r2,v(1/4,1/2), and condition (C5) holds, then

nβ^n-β*dN(0,I~),

where

I~={P(˙β(1)-˙F(1)[h~])2}-1{P(˙β-˙F[h~])2}{P(˙β(1)-˙F(1)[h~])2}-1,

P denotes the true probability measure, a2=aaT for any vector a, the score statistics are evaluated at the true parameter values, and h~ is the least-favorable direction for F in the univariate model, such that P˙F(1)[h~]˙F(1)[h]=P˙β(1)˙F(1)[h] for all hL2(P).

The proof of estimation consistency is outlined in Appendix A1. The proofs of the rate of convergence and the asymptotic normality of the estimators follow the arguments in Zhang et al27 and Zhou et al18 and are omitted.

4 |. SIMULATION STUDIES

We conduct a simulation study to assess the finite-sample performance of the parameter estimates from our proposed method. We generate two covariates, namely X1 from a Bernoulli distribution with probability 0.5 and X2 from a standard normal distribution. In the simulation, we let Xij=1,Xij1,Xij2T. To mimic a study where the cluster size can be large and informative, we considered the Gumbel copula model in order to generate survival times from each cluster. First, a cluster-specific latent variable ξi is generated according to a positive stable (PS) distribution38 with parameter γ having the Laplace transform Φξ(s)=exp-sγ for i=1,2,,n. The ICS is generated based on the sampled value of ξi. Let the maximum cluster size be n˜. Specifically, the cluster size Ni is randomly drawn from a binomial distribution with parameters (n˜, 0.75) if ξi is less than the median of its respective PS distribution, and from a binomial distribution with parameters (n˜, 0.25) otherwise. Clusters with sizes 0, 1 and n˜ are discarded. Samples would be regenerated until we have n clusters. Given Ni>1, the correlated failure times for the i th cluster can be generated based on the Gumbel copula39 model with the joint survival function given by

St1,t2,,tNiXij,Ni=exp-j=1Ni-logStjXij1/γγ,

where γ(0,1] characterizes the degree of association among the failure times of the uncured patients in each cluster, and SXij is given by (1). The parameter γ is set to be 0.2, 0.5, and 0.8 with respective Kendall’s tau of 0.8, 0.5, and 0.2. Note that for any simulated observations with marginal survival rates StijXij<GexpβTXij;ρ, they would be regarded as cured. Also, a random monitoring time Yij is generated by min{Uniform(0,8),4}, such that the observation is either left- or right-censored. In the above setup, a small value of cluster-specific ξi is associated with a large cluster size, long survival times, and a high chance of being cured. Such a setting is chosen to mimic the GAAD dataset as an illustration, where the patients with more sampled teeth generally have better oral health, and are less prone to the event.

We consider (n,n˜)=(100,40),(200,20), and (500,100) and set the parameter vector β0,β1,β2 to be (0.5,−1,1), (−0.5, 1,−1), and (−0.5, 1,−1), respectively. The distribution function F(t) is set to be {1-exp(-min(t,τ))}/{1-exp(-τ)} with the cure threshold τ=4. We set the transformation parameter ρ=0,0.5,1. In the simulated samples, the cure probability in the population fluctuates around 50%, and the right-censoring rates fluctuates around 60%. The data are analyzed by maximizing Ln(θ) over the sieve space based on the weighted and nonweighted models, that is wij=Ni-1 and wij=1i=1,,n;j=1,,Ni, respectively. The degree of the BP is set to be m=3 for both the weighted and non-weighted methods. For each scenario, we consider 1000 replicates.

Tables 1 and 2 summarize the results based on (n,n˜)=(100,40) and (200,20), and Table S1 summarizes the results based on (n,n˜)=(500,100). The bias, empirical SD (ESD), and average estimated SE (ESE) of the estimated coefficients are reported. The empirical coverage (EC) is computed for the 95% confidence intervals of the regression parameters, constructed based on the asymptotic normality of the estimators. In all settings with large or small cluster sizes, the ESE matches closely with the ESD which confirms that the robust variance-covariance matrix in (3) is valid for both the weighted or non-weighted methods. Also, in the case with ICS, the estimator based on GEE approach yields large bias in most cases, while the empirical bias based on CWGEE is small and negligible in all cases. This pattern is consistent throughout the change in the number of clusters n, and the change in parameter values over the scenarios. In particular, one can see that the intercept β0 is severely underestimated by the GEE approach, resulting in an overestimation of the cure proportion. This phenomenon can be explained by the fact that we generate the data with positive association between cluster size and survival time, and that the GEE implicitly assigns larger weights to larger clusters. In addition, the dependence parameter γ in the copula model may actually affect the resulting performance of the GEE approach, but not the CWGEE. As expected, noticeable improvement in the estimation can be seen from the GEE when the within-cluster dependence is getting weaker (ie, a large value of γ). Moreover, due to the bias of the point estimator, the GEE estimator always underestimates the coverage with a large discrepancy, whereas the CWGEE estimator provides an empirical coverage probability of the parameter that closely resembles the nominal level of 95%.

TABLE 1.

Simulation results corresponding to the estimated regression parameters for n=100 and n˜=40 (maximum cluster size), with varying γ and ρ

n γ ρ Regression parameter CWGEE
GEE
Bias ESD ESE EC Bias ESD ESE EC

100 0.2 0 β0 0.040 0.192 0.194 0.95 −0.711 0.180 0.179 0.04
β1 −0.024 0.153 0.152 0.95 −0.142 0.152 0.152 0.86
β2 0.028 0.114 0.111 0.94 0.145 0.122 0.120 0.80
0.5 β0 0.029 0.137 0.144 0.95 −0.484 0.143 0.145 0.09
β1 −0.018 0.125 0.124 0.95 −0.257 0.137 0.137 0.56
β2 0.018 0.095 0.095 0.96 0.253 0.112 0.114 0.40
1 β0 0.035 0.106 0.113 0.96 −0.315 0.116 0.120 0.25
β1 −0.025 0.113 0.112 0.94 −0.309 0.128 0.128 0.32
β2 0.025 0.091 0.093 0.94 0.304 0.109 0.111 0.19
100 0.5 0 β0 0.032 0.163 0.169 0.95 −0.580 0.143 0.142 0.03
β1 −0.016 0.140 0.138 0.95 −0.077 0.125 0.123 0.90
β2 0.019 0.095 0.093 0.93 0.079 0.086 0.086 0.85
0.5 β0 0.027 0.117 0.126 0.95 −0.400 0.110 0.114 0.06
β1 −0.015 0.116 0.112 0.93 −0.169 0.110 0.109 0.67
β2 0.018 0.085 0.080 0.94 0.170 0.087 0.081 0.47
1 β0 0.029 0.090 0.103 0.94 −0.271 0.091 0.093 0.18
β1 −0.023 0.099 0.101 0.96 −0.224 0.101 0.102 0.40
β2 0.023 0.080 0.079 0.94 0.220 0.082 0.081 0.23
100 0.8 0 β0 0.037 0.123 0.135 0.95 −0.314 0.100 0.099 0.13
β1 −0.018 0.124 0.128 0.95 −0.018 0.108 0.108 0.94
β2 0.014 0.078 0.078 0.96 0.021 0.065 0.066 0.94
0.5 β0 0.033 0.088 0.100 0.95 −0.222 0.079 0.078 0.19
β1 −0.017 0.100 0.103 0.96 −0.069 0.090 0.091 0.88
β2 0.016 0.066 0.067 0.95 0.070 0.058 0.058 0.77
1 β0 0.032 0.071 0.078 0.94 −0.158 0.066 0.065 0.32
β1 −0.020 0.094 0.091 0.93 −0.099 0.085 0.082 0.78
β2 0.016 0.064 0.064 0.94 0.096 0.056 0.056 0.60

Abbreviations: EC, empirical coverage with 95% nominal level; ESD, empirical SD; ESE, estimated standard error.

TABLE 2.

Simulation results corresponding to the estimated regression parameters for n=200 and n˜=20 (maximum cluster size), with varying γ and ρ

n γ ρ Regression parameter CWGEE
GEE
Bias ESD ESE EC Bias ESD ESE EC

200 0.2 0 β0 0.004 0.158 0.161 0.95 −0.852 0.165 0.163 0.01
β1 0.016 0.149 0.141 0.94 0.132 0.142 0.133 0.83
β2 −0.019 0.094 0.094 0.95 −0.135 0.096 0.097 0.74
0.5 β0 0.003 0.128 0.132 0.95 −0.733 0.139 0.139 0.01
β1 0.018 0.112 0.113 0.96 0.251 0.115 0.117 0.42
β2 −0.022 0.079 0.080 0.95 −0.250 0.089 0.090 0.20
1 β0 −0.002 0.108 0.110 0.94 −0.625 0.120 0.121 0.00
β1 0.025 0.102 0.099 0.94 0.305 0.112 0.109 0.21
β2 −0.024 0.073 0.076 0.96 −0.302 0.088 0.089 0.06
200 0.5 0 β0 0.001 0.144 0.148 0.95 −0.659 0.131 0.132 0.01
β1 0.021 0.135 0.134 0.94 0.078 0.115 0.116 0.89
β2 −0.014 0.086 0.085 0.94 −0.074 0.078 0.076 0.84
0.5 β0 0.007 0.116 0.121 0.95 −0.565 0.113 0.114 0.00
β1 0.017 0.108 0.107 0.94 0.168 0.104 0.102 0.62
β2 −0.016 0.072 0.071 0.94 −0.166 0.071 0.070 0.36
1 β0 −0.003 0.096 0.101 0.96 −0.494 0.098 0.100 0.00
β1 0.026 0.093 0.093 0.95 0.221 0.093 0.093 0.33
β2 −0.020 0.067 0.068 0.95 −0.215 0.068 0.069 0.12
200 0.8 0 β0 0.016 0.114 0.128 0.96 −0.335 0.091 0.095 0.06
β1 0.014 0.127 0.129 0.95 0.019 0.106 0.107 0.95
β2 −0.019 0.076 0.077 0.95 −0.024 0.063 0.064 0.93
0.5 β0 0.018 0.094 0.101 0.96 −0.289 0.082 0.083 0.06
β1 0.007 0.102 0.102 0.96 0.060 0.093 0.090 0.89
β2 −0.011 0.063 0.064 0.96 −0.066 0.056 0.056 0.79
1 β0 0.006 0.084 0.084 0.94 −0.258 0.074 0.073 0.06
β1 0.018 0.091 0.089 0.94 0.097 0.082 0.080 0.78
β2 −0.019 0.059 0.061 0.93 −0.098 0.052 0.054 0.56

Abbreviations: EC, empirical coverage with 95% nominal level; ESD, empirical SD; ESE, estimated SE.

5 |. APPLICATION: GAAD DATA

In this section, we illustrate our approach via application to the GAAD dataset mentioned in Section 1. The word “Gullah” represents unique cultural and linguistic patterns of the African-Americans living on the sea islands of South Carolina.40 The GAAD study was primarily aimed to explore the relationship between PD and diabetes (determined by HbA1c, or “glycosylated hemoglobin”) in this population. The dataset has n=288 subjects, where 170 subjects have at least one tooth identified with PD incidence. Note, although the time of clinic visit for all available teeth in a subject are the same, the actual inspection time of adult permanent teeth varies with tooth-types,41 which often does not get recorded. In lieu of exact eruption times, we use the approximate permanent dentition times of U.S. adults published by the American Dental Association, and available at https://www.mouthhealthy.org/en/az-topics/e/eruption-charts. With only 913 out of 5461 teeth recording Δ=1, we also observe heavy censoring, which can be attributed to teeth that are nonsusceptible to PD. In addition, Figure 1 presented in Section 1 suggests that the cluster size is informative.

The subject-level covariates under consideration are gender (male/female), smoking status (smoker/non-smoker), glycemic level, or HbA1c (controlled/uncontrolled), and body mass index (BMI), while the only tooth-level covariate is the jaw indicator, that is, location of the tooth in upper/lower jaw. About 26% of the subjects are smokers. The mean age is 55 years, with a range from 26 to 87 years. Female subjects seem to be predominant (about 73%) in our data, which is not uncommon among Gullah subjects.40 About 74% of subjects are obese (BMI ≥30), and 64% are with uncontrolled HbA1c. In our analysis, we categorize BMI variable into three groups, namely normal (<25 kg/m2), overweight (25–30 kg/m2) and obese (≥30 kg/m2).

First, our proposed survival model is fitted to the GAAD dataset, based on the CWGEE approach. For the implementation of the proposed method, one has to choose the degree of the BP, and the transformation parameter value. We perform a two-dimensional grid search over (m,ρ), and select the best model based on the Akaike information criterion (AIC).42 The AIC is given by -logLˆn+2(m+1+p), where Lˆn is the PL evaluated at the sieve-MLE. Although one can select ρ with the lowest AIC, a model will not have much practical value unless it has a good interpretation. We search m from 1 to 6 and ρ over {0,0.1,,1}. The cluster-weighted method with (m,ρ)=(1,1) achieves the smallest AIC, suggesting that the PTC model gives a better fit than the PO cure model; the estimated covariate effects are similar for different choices of m. Then, the same set of (m,ρ) is applied to the GEE approach for comparison. The estimates for the regression parameters, robust standard errors and corresponding 95% confidence intervals (CIs) are presented in Table 3. We observe that the estimated regression parameters from both approaches, although similar in terms of direction, are very different in magnitude. The estimates for the intercept β0 (and the corresponding cure probabilities) from the two approaches are also quite different; specifically, the estimated cure probabilities for zero covariates from the CWGEE and GEE are exp{-exp(-0.886)}=0.662 and exp{-exp(-1.288)}=0.759, respectively. This finding echoed with the simulation results, where the intercept is severely underestimated under the GEE approach. This can be explained by the informative clustering nature of the data. The GEE approach suggests that smoking, HbA1c, gender and jaw variables are significant factors associated with the risk of developing PD. Contrarily, the CWGEE approach suggests that only HbA1c and gender variables are significant factors. Under the PTC model, the constant hazard ratio assumption is preserved. The hazard ratio for gender (female against male) is exp(-0.760)=0.468, and the hazard ratio for HbA1c (uncontrolled against controlled) is exp(0.409)=1.505; both ratios are different from 1 at 5% level of significance. This suggests that females or subjects with controlled HbA1c are less susceptible to PD and tend to have longer time to disease onset, if susceptible.

TABLE 3.

GAAD data analysis: Estimates for the regression parameters, estimated standard errors (ESE), and 95% confidence intervals, based on the cluster-weighted generalized estimating equations (CWGEE) and GEEs methods

CWGEE GEE

(m,p) Covariates estimates ESE 95% CI estimates ESE 95% CI

(1, 1) Intercept −0.886 0.352 (−1.576, −0.195) −1.288 0.322 (−1.920, −0.657)
Smoker (smoker = 1) 0.393 0.203 (−0.003, 0.792) 0.574 0.212 (0.159, 0.989)
HbAlc (uncontrolled = 1) 0.409 0.198 (0.022, 0.797) 0.508 0.201 (0.114, 0.901)
Gender (female = 1) −0.760 0.198 (−1.148, −0.372) −0.773 0.219 (−1.203, −0.343)
Jaw (upper jaw = 1) 0.082 0.108 (−0.130, 0.294) 0.284 0.085 (0.117, 0.451)
BMI (25 – 30) 0.330 0.325 (−0.307, 0.967) 0.176 0.349 (−0.509, 0.861)
BMI (≥30) −0.091 0.291 (−0.662, 0.480) −0.051 0.324 (−0.687, 0.584)

We use a cross-validation procedure, delineated in Appendix A2, to compare the predictive performance of the CWGEE and GEE approaches. In the analysis, the averaged log-likelihood values in the testing sets are −46.8 and −47.3 for CWGEE and GEE, respectively. Also, CWGEE yields larger log-likelihood in 73% of the replicates. The results suggest that CWGEE produces more accurate outcome prediction by accommodating ICS.

6 |. DISCUSSION

In this paper, we consider a class of semi-parametric transformation cure models as a generalization of the PTC model for marginal analysis of clustered current status data. Under ICS, the traditional GEE yields biased estimation, even if the marginal survival model is correctly specified. As a remedy for ICS, we propose to use a CWGEE approach that weights a cluster by the inverse of the cluster size. Nonetheless, GEE and CWGEE are both valid, when the cluster size is non-informative.25 We consider a sieve-MLE approach, and approximate the nonparametric unspecified distribution function F by a BP. Constraints on F such as monotonicity and F(τ)=1 can be imposed easily by a reparameterization of the BP coefficients. The proposed estimators are shown to be consistent, and asymptotically normal.

We illustrate our proposed method on a dataset recording current status time to event of PD incidence, where initial data exploration reveals the presence of ICS paradigm. Given that the cluster sizes in the dataset are potentially large (many subjects with >20 teeth), the estimation methods by maximizing the joint likelihood in the frailty models can be computationally intensive. On the contrary, the marginal approach proposed here provides a computationally efficient method for parameter estimation. It is speculated that informative clustering is present, as the prevalence of the disease decreases with the number of teeth. This is plausibly why some parameter estimates based on GEE differ substantially from those based on CWGEE.

In this paper, we treat the transformation parameter ρ as prespecified and select it based on an information criterion. An alternative method for the selection/estimation of ρ is to regard it as an unknown parameter and estimate it along with other parameters using the maximum (pseudo) likelihood method. Nevertheless, as remarked by Zeng et al17 under a similar cure transformation model, the transformation parameter cannot be reliably estimated under sample sizes smaller than 1500. Similar situations have been encountered in our setup, and the estimation of ρ is not numerically stable when n is moderately small as the pseudo-likelihood function of ρ would be almost flat.

There are a number of future directions to consider, stemming from our current work. First, we currently allow the cluster size and the response to be associated but do not explicitly model the association structure. Alternatively, it is also worthy to consider a joint modeling approach that regresses both the cluster size and survival times on the same or different sets of covariates. In such a model, the association between the cluster size and current-status survival outcomes can be modelled via a shared frailty term in both regression models. One advantage of a joint-modeling approach is that, under a correctly specified model and regularity conditions, the regression parameters can be estimated with optimal statistical efficiency. Second, we may develop a statistical test to detect the presence of ICS within current-status (or more general interval-censored) scenarios with a cured proportion. Third, the present work assumes that the inspection time Y is noninformative, which may be invalid in practice. One can consider extending the work to accommodate dependent censoring for clustered time-to-event data, using a copula or a frailty approach to model the association between T and Y.43,44

Supplementary Material

Table S1

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

ACKNOWLEDGEMENTS

The authors thank the Center for Oral Health Research at the Medical University of South Carolina for providing the motivating dataset, and the context of this work. They also thank the anonymous associate editor and two reviewers, whose constructive comments led to a significantly improved presentation. The research of K. F. Lam was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. 17305819). Bandyopadhyay’s research was partially supported by the NIH (Awards # R01DE024984 and P30CA016059).

APPENDIX A

A.1. Proof of estimation consistency

By theorem 1.6.2 of Lorentz,35 there exists a Bernstein polynomial F˜n such that F˜n-F*=On-rν/2. Let Pn denote the empirical measure. By the definition of β^n and F^n,

Pn(β^n,F^n)Pnβ*,F˜n.

Therefore,

Pn-P(β^n,F^n)+Pn-Pβ*,F*+Pnβ*,F˜n-β*,F*Pβ*,F*-P(β^n,F^n). (B1)

Following the proof of lemma 2 of Zhou et al,18 we can show that the first two terms on the left-hand side of (B1) converges to 0 almost surely. By the mean-value theorem, the third term on the left-hand side of (B1) converges uniformly to 0. Note that for any fixed function f,

E1Nj=1NfYj,Δj,Xj=E1Nj=1NEfYj,Δj,XjN=EfY1,Δ1,X1,

where the second equality follows from condition (C1). Therefore, the right-hand side of (B1) is equal to P(1)β*,F*-P(1)(β^n,F^n). This term is the Kullback-Leibler divergence for the observed data of the first subject of a cluster, so that by Wong and Shen45 (p346), it is bounded below by the Hellinger distance:

Q(Δ1;ξ^n)1/2-Q(Δ1;ξ*)1/2L2(P)2=(-1)Δ1G(ξ)2QΔ1;ξ1/2ξ=ξ˜nF^nY1eX1Tβ^n-F*Y1eX1Tβ*L2(P)2,

where ξ^n=F^nY1eX1Tβ^n,ξ*=F*Y1eX1Tβ*,ξ˜n is some value between ξ^n and ξ*, and

Q(Δ;ξ)={1-G(ξ)}ΔG(ξ)1-Δ.

By condition (C2), G(ξ) is negative and uniformly bounded away from 0, and QΔ1;ξ is clearly uniformly bounded above by 1. Therefore, the right-hand side above is up to a scaling factor bounded below by

F^nY1eX1Tβˆn-F*Y1eX1Tβ*L2(P)2EeX1Tβˆn-eX1Tβ*2Y1=τPY1=τ.

By conditions (C3) and (C4) and the mean-value theorem, the right-hand side above is up to a scaling factor bounded below by β^n-β*2. Therefore, (B1) implies that β^n-β*2op(1), where the right-hand side goes to 0 almost surely. Similarly, we conclude that F^nY1-F*Y1L2(P)2a.s.0, and thus the desired result follows from condition (C3).

A.2. Cross-validation procedure

We use the following cross-validation procedure to compare the predictive performance of CWGEE and GEE on the GAAD data:

  1. Randomly split the GAAD data (n=288) into a training set and a testing set, with clusters as sampling units and a ratio of sample size of 2:1 (ie, 192 and 96 clusters for the training and testing sets, respectively).

  2. Perform CWGEE and GEE on the training set with (m,ρ)=(1,1), and denote the resulting sieve-MLEs as θ^CWGEE and θ^GEE, respectively.

  3. Generate B=1000 random samples from the testing set, where each random sample consists of a randomly selected observation from each cluster. For b=1,,B, let (Yi(b),Δi(b),Xi(b))i=1,,n(b) be the b th random sample with n(b)=96 and
    (b)(θ)i=1n(b)Δi(b)log{1-S(Yi(b)θ,Xi(b))}+(1-Δi(b))logS(Yi(b)θ,Xi(b)),
    be the log-likelihood value evaluated at θ.
  4. Compute the mean log-likelihood values B-1b=1B(b)(θ^CWGEE) and B-1b=1B(b)(θ^GEE).

  5. Repeat (i) to (iv) 100 times, and report the mean log-likelihood values for both methods averaged over the 100 replicates.

DATA AVAILABILITY STATEMENT

The synthetic data used to support our findings in Section 4 is available from the corresponding author upon reasonable request. The application data are not publicly available due to privacy, or ethical restrictions.

REFERENCES

  • 1.Fernandes JK, Wiegand RE, Salinas CF, et al. Periodontal disease status in Gullah African Americans with type 2 diabetes living in South Carolina. J Periodontol. 2009;80(7):1062–1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Darby ML, Walsh M. Dental Hygiene Theory and Practice. 3rd ed. St. Louis, Missouri: Elsevier Health Sciences; 2010. [Google Scholar]
  • 3.Huang J, Wellner JA. Asymptotic normality of the NPMLE of linear functionals for interval censored data, case 1. Stat Neerl. 1995;49(2):153–163. [Google Scholar]
  • 4.Klein JP, Van Houwelingen HC, Ibrahim JG, Scheike TH. Handbook of Survival Analysis. Boca Raton, FL: CRC Press; 2016. [Google Scholar]
  • 5.Cong XJ, Yin G, Shen Y. Marginal analysis of correlated failure time data with informative cluster sizes. Biometrics. 2007;63(3):663–672. [DOI] [PubMed] [Google Scholar]
  • 6.Peng Y, Dear KBG. A nonparametric mixture model for cure rate estimation. Biometrics. 2000;56(1):237–243. [DOI] [PubMed] [Google Scholar]
  • 7.Lam KF, Fong DYT, Tang OY. Estimating the proportion of cured patients in a censored sample. Stat Med. 2005;24(12):1865–1879. [DOI] [PubMed] [Google Scholar]
  • 8.Chen MH, Ibrahim JG, Sinha D. A new Bayesian model for survival data with a surviving fraction. J Am Stat Assoc. 1999;94(447):909–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ibrahim JG, Chen MH, Sinha D. Bayesian semiparametric models for survival data with a cure fraction. Biometrics. 2001;57(2):383–388. [DOI] [PubMed] [Google Scholar]
  • 10.Berkson J, Gage RP. Survival curve for cancer patients following treatment. J Am Stat Assoc. 1952;47(259):501–515. [Google Scholar]
  • 11.Farewell VT. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics. 1982;38:1041–1046. [PubMed] [Google Scholar]
  • 12.Kuk AYC, Chen CH. A mixture model combining logistic regression with proportional hazards regression. Biometrika. 1992;79(3):531–541. [Google Scholar]
  • 13.Sy JP, Taylor JMG. Estimation in a cox proportional hazards cure model. Biometrics. 2000;56(1):227–236. [DOI] [PubMed] [Google Scholar]
  • 14.Lam KF, Xue H. A semiparametric regression cure model with current status data. Biometrika. 2005;92(3):573–586. [Google Scholar]
  • 15.Tsodikov AD, Yakovlev AY, Asselain B. Stochastic Models of Tumor Latency and Their Biostatistical Applications. Vol 1. Singapore: World Scientific; 1996. [Google Scholar]
  • 16.Tsodikov A A proportional hazards model taking account of long-term survivors. Biometrics. 1998;54:1508–1516. [PubMed] [Google Scholar]
  • 17.Zeng D, Yin G, Ibrahim JG. Semiparametric transformation models for survival data with a cure fraction. J Am Stat Assoc. 2006;101(474):670–684. [Google Scholar]
  • 18.Zhou Q, Hu T, Sun J. A sieve semiparametric maximum likelihood approach for regression analysis of bivariate interval-censored failure time data. J Am Stat Assoc. 2017;112(518):664–672. [Google Scholar]
  • 19.Sun T, Ding Y. Copula-based semiparametric regression method for bivariate data under general interval censoring. Biostatistics. 2019. 10.1093/biostatistics/kxz032. [DOI] [PubMed] [Google Scholar]
  • 20.Lee CY, Wong KY, Lam KF, Xu J. Analysis of clustered interval-censored data using a class of semiparametric partly linear frailty transformation models. Biometrics. 2020. 10.1111/biom.13399. [DOI] [PubMed] [Google Scholar]
  • 21.Lam KF, Wong KY. Semiparametric analysis of clustered interval-censored survival data with a cure fraction. Comput Stat Data Anal. 2014;79:165–174. [Google Scholar]
  • 22.Kor CT,Cheng KF,Chen YH.Amethodforanalyzingclusteredinterval-censoreddatabasedonCox’smodel.StatMed.2013;32(5):822–832. [Google Scholar]
  • 23.Niu Y, Peng Y. Marginal regression analysis of clustered failure time data with a cure fraction. J Multivar Anal. 2014;123:129–142. [Google Scholar]
  • 24.Williamson JM, Datta S, Satten GA. Marginal analyses of clustered data when cluster size is informative. Biometrics. 2003;59(1):36–42. [DOI] [PubMed] [Google Scholar]
  • 25.Williamson JM, Kim HY, Manatunga A, Addiss DG. Modeling survival data with informative cluster size. Stat Med. 2008;27(4):543–555. [DOI] [PubMed] [Google Scholar]
  • 26.Hoffman EB, Sen PK, Weinberg CR. Within-cluster resampling. Biometrika. 2001;88(4):1121–1134. [Google Scholar]
  • 27.Zhang X, Sun J. Regression analysis of clustered interval-censored failure time data with informative cluster size. Comput Stat Data Anal. 2010;54(7):1817–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang X, Sun J. Semiparametric regression analysis of clustered interval-censored failure time data with informative cluster size. Int J Biostat. 2013;9(2):205–214. [DOI] [PubMed] [Google Scholar]
  • 29.Zhao H, Ma C, Li J, Sun J. Regression analysis of clustered interval-censored failure time data with linear transformation models in the presence of informative cluster size. J Nonparametr Stat. 2018;30(3):703–715. [Google Scholar]
  • 30.Farouki RT. The Bernstein polynomial basis: a centennial retrospective. Comput Aid Geometr Des. 2012;29(6):379–419. [Google Scholar]
  • 31.Rossini AJ, Tsiatis AA. A semiparametric proportional odds regression model for the analysis of current status data. J Am Stat Assoc. 1996;91(434):713–721. [Google Scholar]
  • 32.Zhang Y, Hua L, Huang J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand J Stat. 2010;37(2):338–354. [Google Scholar]
  • 33.Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Stat Sci. 1996;11(2):89–121. [Google Scholar]
  • 34.Lam KF, Wong KY, Zhou F. A semiparametric cure model for interval-censored data. Biom J. 2013;55(5):771–788. [DOI] [PubMed] [Google Scholar]
  • 35.Lorentz GG. Bernstein Polynomials. NewYork, NY: Chelsea Publishing Company; 1986. [Google Scholar]
  • 36.Dai YH. Convergence properties of the BFGS algoritm. SIAM J Optim. 2002;13(3):693–701. [Google Scholar]
  • 37.Royall RM. Model robust confidence intervals using maximum likelihood estimators. Int Stat Rev/Revue Internationale de Statistique. 1986;54(2):221–226. [Google Scholar]
  • 38.Hougaard P Survival models for heterogeneous populations derived from stable distributions. Biometrika. 1986;73(2):387–396. [Google Scholar]
  • 39.Emura T, Chen YH. Analysis of Survival Data with Dependent Censoring: Copula-Based Approaches. New York, NY: Springer; 2018. [Google Scholar]
  • 40.Johnson Spruill I, Hammond P, Davis B, McGee Z, Louden D. Health of Gullah families in South Carolina with type 2 diabetes. Diabetes Educ. 2009;35(1):117–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chaitanya P, Reddy JS, Suhasini K, Chandrika IH, Praveen D. Time and eruption sequence of permanent teeth in Hyderabad children: a descriptive cross-sectional study. Int J Clin Pediatr Dent. 2018;11(4):330–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Berlin, Germany: Springer Science & Business Media; 2003. [Google Scholar]
  • 43.Ma L, Hu T, Sun J. Sieve maximum likelihood regression analysis of dependent current status data. Biometrika. 2015;102(3):731–738. [Google Scholar]
  • 44.Liu Y, Hu T, Sun J. Regression analysis of current status data in the presence of a cured subgroup and dependent censoring. Lifetime Data Anal. 2017;23(4):626–650. [DOI] [PubMed] [Google Scholar]
  • 45.Wong WH, Shen X. Probability inequalities for likelihood ratios and convergence rates of sieve MLEs. Ann Stat. 1995;23(2):339–362. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1

Data Availability Statement

The synthetic data used to support our findings in Section 4 is available from the corresponding author upon reasonable request. The application data are not publicly available due to privacy, or ethical restrictions.

RESOURCES