Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 28.
Published in final edited form as: Lifetime Data Anal. 2022 Jun 26;28(3):428–491. doi: 10.1007/s10985-022-09557-5

Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records

Liang Liang 1, Jue Hou 2, Hajime Uno 3, Kelly Cho 4, Yanyuan Ma 5, Tianxi Cai 6
PMCID: PMC10044535  NIHMSID: NIHMS1879201  PMID: 35753014

Abstract

Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.

Keywords: censoring, electronic health records, functional principle component analysis, point process, proportional odds model, Semi-supervised learning, more

1. Introduction

While clinical trials and traditional cohort studies remain critical sources of data for clinical research, they have limitations including the generalizability of the study findings and the limited ability to test broader hypotheses. In recent years, real world clinical data derived from disease registry, insurance claims and electronic health record (EHR) systems are increasingly used for precision medicine research. These real world data (RWD) open opportunities for developing accurate personalized risk prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision medicine. Efficiently deriving prediction models for the risk of developing future clinical events using RWD, however, faces practical and methodological challenges. Precise event time information such as time to cancer recurrence is not readily available in RWD such as EHR and claims data. Simple proxies to the event time based on the encounter time of first diagnosis or procedure codes may poorly approximate the true event time (Uno et al., 2018). On the other hand, annotating event times manually via chart review is time and resource prohibitive.

Learning the onset status of clinical events has been thoroughly investigated in the past decade using a large-scale medical encounter data set that lacks precise onset status and a small training set with gold standard labels on the true onset status. The solutions can be classified by their training sample into supervised approaches that use only the labeled data, unsupervised approaches that use no labels and semi-supervised approaches that combine information from labeled and unlabeled data. Semi-supervised methods (Yu et al., 2015, 2016; Zhang et al., 2019) usually use the unlabeled data for feature screening and selection before the final training on the labeled data. Growing efforts have been made in recent years to predict the onset time of clinical events under a similar setting with partially labeled event times. The existing literatures on phenotyping of event times are mostly supervised approaches that only use labeled data for training. Several supervised algorithms exist for predicting cancer recurrence time by extracting features from the encounter pattern of relevant codes. For example, Chubak et al. (2015) proposed a rule based algorithm that classifies the recurrence status, R ∈ {+, −}, based on decision tree, and assign the recurrence time for those with predicted R = + as the earliest encounter time of one or more specific codes. Hassett et al. (2015) proposed two-step algorithms where a logistic regression was used to classify R in step I and then the recurrence time for those with R = + is estimated as a weighted average of the times that the counts of several prespecified codes peaked. Instead of peak time, Uno et al. (2018) focuses on the time at which an empirically estimated encounter intensity function has the sharpest change, referred to as the change point throughout the paper. The recurrence time is approximated as a weighted average of the change point times associated with a few selected codes. Despite of their reasonable empirical performance, these ad hoc procedures have several major limitations. First, only a very small number of codes are selected according to domain knowledge. Second, intensity function estimated based on finite difference may yield substantial variability in the resulting peak or change times due to the sparsity of encounter data. One exception is a recent semi-supervised approach by Ahuja et al. (2021). Ahuja et al. (2021) first conducted aggregation of the predicting features, then predicted the event time from the longitudinal trajectories of the aggregated features using a hidden Markov model. Our approach differs from Ahuja et al. (2021) in the order of addressing the feature aggregation and temporal trajectory — we first extract the characteristics from the longitudinal trajectories of the predicting features, then aggregate the extracted characteristics to predict the event time by fitting a survival model.

In this paper, we frame the question of annotating event time with longitudinal encounter records as a statistical question of predicting an event time T using baseline covariates U as well as features derived from a p-variat point process, 𝒩=(𝒩1,,𝒩p). Specifically, with a small labeled set containing observations on {T,U,𝒩} and a large unlabeled set 𝒰 containing observations on U and 𝒩 only, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) procedure. In the Step I, we include the large unlabeled data to estimate the underlying subject specific intensity functions associated with 𝒩 and deriving summaries of the intensity functions, denoted by W^, as features for predicting T. In Step II, we predict T using Z^=(U,W^) by fitting a penalized proportional odds (PO) model which approximates the non-parametric baseline function via B-splines. Our MATA is semi-supervised as unlabeled data and labeled data are used in Step I and II, correspondingly. Estimating individualized intensity functions is a challenging task in the current setting because the encounter data is often sparse and the shape of the intensity functions can vary greatly across subjects. As such traditional multiplicative intensity models (Lawless, 1987; Dean and Balshaw, 1997; Nielsen and Dean, 2005) fail to provide accurate approximations. To overcome those difficulties, we employed a non-parametric FPCA method by Wu et al. (2013) to estimate the subject specific intensity functions using the large unlabeled set 𝒰. We demonstrate that when the size of 𝒰 is sufficiently large relative to the size of , the approximation error of W^W can is ignorable compared to the estimation error from fitting the spline model in .

Even though the idea of employing a spline-based approach is straightforward and intuitive, our method differs from the classical B-spline works in the sense that B-splines are used on the outcome model, i.e., the failure time, rather than the preprocessing of the predictors. Special attention is devised to accommodate this fact. We established the novel consistency results and asymptotic convergence rates for the proposed estimator, both the parametric and nonparametric part. There are some existing literature adopting a spline-based approach in a similar context as ours, including Shen (1998); Royston and Parmar (2002); Zhang et al. (2010); Younes and Lachin (1997). However, Royston and Parmar (2002) and Younes and Lachin (1997) did not address the asymptotic properties of their estimators at all; Shen (1998) and Zhang et al. (2010) employed a seive maximum likelihood based approach which considers spline as a special case but only provided theoretical justification on the asymptotics of the parametric part.

One great advantage of the proposed MATA approach is the easy implementation of classical variable selection algorithms such as LASSO. In comparison, Chubak et al. (2015); Hassett et al. (2015); Uno et al. (2018) exhaust all possible combinations of selected encounters and select the optimal one under certain criteria, which brings great computational complexity. No variable selection method has been developed for classical estimating equation based estimators, e.g., Cheng et al. (1995, 1997). Besides, compared to the non-parametric maximum likelihood-estimator (NPMLE), e.g., Zeng et al. (2005), which approximates the non-parametric function by a right-continuous step function with jumps only at observed failure time, our approach is computationally more efficient and stable.

The rest of the paper is organized as follows. In Section 2, we introduce the proposed MATA approach and prediction accuracy evaluation measures. The asymptotic properties of the proposed estimator are discussed in Section 3. In Section 4, we conduct various simulation studies to explore the performance of our approach under small labeled sample. In Section 5, we apply our approach to a lung cancer data set. Section 6 contains a short discussion. Technical details and proofs are provided in the Supplementary Material.

2. Semi-supervised MATA

Let T denote the continuous event time of interest which is observable up to (X, Δ) in , where X=min(T,C),Δ=I(TC) and C is the follow up time. Let 𝓝=(𝒩[1],,𝒩[q]) denote the q-variat point processes and U denote baseline covariates observable in both and 𝒰, where 𝒩[j] is a point process associated with the jth clinical code whose occurrence times are {t1[j],t2[j],} with 𝒩[j](𝒜)=sI(ts[j]𝒜) for any set 𝒜 in the Borel σ-algebra of the positive half of the real line and I(·) is the indicator function. If T denotes the true event time of heart failure, examples of 𝒩[j] include longitudinal encounter processes of diagnostic code for heart failure and NLP mentions of heart failure in clinical notes. The local intensity function for 𝒩[j] is λ[j](t)=E{d𝒩[j](t)}/dt,t0. Here we assume λ[j](t) is integrable, i.e., τ[j]=0λ[j](u)du<, for j = 1, ⋯, q. Then the corresponding random density trajectory is f[j](t)=λ[j](t)/τ[j],t0. Equivalently, λ[j](t)=τ[j]f[j](t)=E{𝒩[j][0,)}f[j](t), i.e. the intensity function λ[j](t) is the product of the density trajectory f[j](t) and the expected lifetime encounters.

The encounter times of the point processes are also only observable up to the end of follow up C and we let M[j]=𝒩[j]([0,C]) denote the total number of occurrences for 𝒩[j] up to C and 𝓗t denote the history of 𝓝 up to t along with the baseline covariate vector U. Thus, the observed data consist of

Labeled data:={(Xi,Δi,Ci,𝓗iCi):i=1,,n},Unlabeled data:𝒰={(Ci,𝓗iCi):i=n+1,,n+N},

where i indexes the subject and we assume that Nn.

2.1. Models

Our proposed MATA procedure involves two models, one for the point processes and another for the survival function of T. We connect two models by including the underlying intensity functions for the point processes as the part of the covariates for survival time.

Point Process Model

The intensity function λ[j](t) for t0 is treated as a realization of a non-negative valued stochastic intensity process Λ[j](t). Conditional on Λ[j]=λ[j], the number of observed medical encounters is assumed to be a non-homogeneous Poisson process with local intensity function λ[j](t) that satisfies E{𝒩[j](a,b)Λ[j]=λ[j]}=abλ[j](u)du, where 0ab<. Thus, τ[j]=E{𝒩[j][0,)Λ[j]=λ[j]}. Define the truncated random density

fC[j](t)=f[j](t)/0Cf[j](t)dt=λ[j](t)/0Cλ[j](t)dt,t[0,C];

and its scaled version

fC,scaled[j](t)=CfC[j](Ct)=λ[j](Ct)/01λ[j](Ct)dt,t[0,1].

As we only observe the point process 𝒩[j] up to C, our goal is to estimate the truncated density function fC[j](t) or equivalently the scaled density function fC,scaled[j](t) rather than the density function f[j](t). Note the scaling is done to meet the uniform endpoint requirement of the FPCA approach by Wu et al. (2013).

Event Time Model

We next relate features derived from the intensity functions to the distribution of T. Define W[j]=f[j], where is a known functional. For example, if the local intensity function f[j](x) follows the exponential distribution with rate θ−1, then we may set f[j]=xf[j](x)dx=θ. Other potential features include peaks or change points of intensity functions λ[j]. Here peak is defined as the time that the intensity (or density) curve reaches maximum, while change point is defined as the time of the largest increases in the intensity (or density) curve. Due to censoring, we instead have WC[j]=fC[j]. For features like peak and change point, W[j] and WC[j] would be identical if they were reached before censoring time C. We assume that TZ=(U,W) follows an PO model (Klein and Moeschberger, 2006)

F(tZ)pr(TtZ)=exp(βZ)α(t)1+exp(βZ)α(t)withα(t)=0texp{m(s)}ds, (1)

where W=(W[1],,W[q]),β is the unknown effect vector of the derived features Z, and m(t) is an unknown smooth function of t. This formulation ensures that α(t)=0texp{m(s)}ds is positive and increasing for t(0,).

2.2. Estimation

To derive a prediction rule for T based on the proposed MATA procedure, we first estimate truncated density function fC[j](t) from the longitudinal encounter data using the FPCA method proposed by Wu et al. (2013) to obtain estimates for the derived features W using unlabeled data 𝒰, denoted by W^. Then we estimate α(t) and β using synthetic observations in the labeled set {(Xi,Δi,Z^i),i=1,,n} via penalized estimation with regression spline.

2.2.1. Step I: Estimation of f[j]

We estimate the mean fμ,scaled[j](t) and variance Gscaled[j](t,s) of the scaled density function fC,scaled[j](t) according to the FPCA approach by Wu et al. (2013). Using the estimators f^μ,scaled[j](t) and G^scaled[j](t,s), we predict the scaled density function by f^iK,scaled[j](t) with truncation at zero to ensure nonnegativity of the density function. The index K in the subscript is the number of basis functions selected according to the proportion of variation explained. We obtain the truncated density function by

f^iK[j](t)=f^iK,scaled[j](t/Ci)/0Cif^iK,scaled[j](t/Ci)dt.

For the i-th patient and its j-th point process 𝒩i[j], we only observe one realization of its expected number of encounters on [0, Ci], i.e., Mi=𝒩i[j]([0,Ci]). We approximate the expected numbers of encounters with observed encounters, and estimate λi(t) as λ^i[j](t)=Mif^iK[j](t), for t[0,Ci]. We further estimate the derived feature Wi[j] as W^i[j]=f^iK[j]. Detailed form of these estimators are given in Appendix C. We also establish the rate of convergence for the estimated loadings of the functional PCA, which can be subsequently used as potential derived features.

Incorporation of the large unlabeled data in the semi-supervised learning facilitates the extraction of characteristics from noisy and complex longitudinal trajectories. Using unlabeled data for feature preprocessing has a longstanding tradition in semi-supervised phenotyping (Yu et al., 2015, 2016; Zhang et al., 2019). While existing literatures focused on selection of simple features, we consider the de-noising of complex features. Moreover, the large unlabeled data in Step I also greatly reduces the uncertainty from the step to the point that it becomes negligible in Step II.

2.2.2. Step II. PO Model Estimation with B-spline Approximation to m(·)

To estimate m(t) and β in the PO model (1), we approximate m(t) via B-splines with order r (degree r − 1) described as follows. Divide the support of censoring time C, denoted as [0, ], into (Rn + 1) subintervals {(ξp,ξp+1),p=r,r+1,,Rn+r1}, where (ξp)p=r+1Rn+r is a sequence of interior knots, ξ1==ξr=0<ξr+1<ξr+Rn<=ξRn+r+1==ξRn+2r. Let the basis functions be Br(t) = {Br,1(t), …, Br,Pn(t)} where the number of B-spline basis functions Pn=Rn+r. Then m(t) can be approximated by

m(t;γ)=γBr(t)=p=1PnBr,p(t)γp,

where γ=(γ1,,γPn) is the vector of coefficients for the spline basis functions Br(t).

With the B-spline formulation and features Wi[j]=fi[j] estimated as W^i[j]=f^iK[j], we can estimate m(·) and β by maximizing an estimated likelihood. Specifically, let

ln(β,γ)=i=1nlog{H˜i(β,γ)}=i=1n[Δi{Br(Xi)γ+Z^iβ}(1+Δi)log{1+exp(Z^iβ)0Xiexp{γBr(t)}dt}] (2)

where Z^i=(Ui,W^i[1],,W^i[q]),

H˜i(β,γ)=exp[Δi{Br(Xi)γ+Z^iβ}][1+exp(Z^iβ)0Xiexp{Br(t)γ}dt](1+Δi).

Then we may estimate β by maximizing the approximated profile likelihood

β^MLE=argmaxβln(β,γ^(β)), (3)

where γ^(β)=argmaxγln(β,γ) and MLE stands for maximum likelihood estimator. Subsequently, we may estimate m(t) as

m^MLE(t)=γ^MLEBr(t),whereγ^MLE=γ^(β^MLE). (4)

The log-likelihood ln is concave with negative definite Hessian

l¨n=i=1n(1+Δi)0Xiexp{Z^iβ+Br(t)γ}1+0Xiexp{Z^iβ+Br(u)γ}du{Q^(t)Q^¯}2dt,

where Q^(t)=(Z^,Br(t)),Q^¯=0Xiexp{Z^iβ+Br(t)γ}1+0Xiexp{Z^iβ+Br(u)γ}duQ^(t)dt.

Standard convex optimization problem can be used to solve for β^MLE and γ^MLE. The integrals therein can be evaluated analytically through calculation of indefinite integrals in the following forms:

tbexp(j=0kaktk)dt,0bk,

which is fairly straightforward for piecewise constant (k = 0) or piecewise linear (k = 1) B-splines. The evaluation of integrals is needed for every iteration, so we reduce the computational cost by a three-stage algorithm for the optimization problem:

  1. An initial estimator for β using the U-statistic approach of Cheng et al. (1995) and an inverse-probability-of-censoring-weighted (IPCW) initial estimator for baseline α.

  2. Update β and α from an alternative B-spline approximation
    α(t)=exp{γ˜0tBr(s)ds}.
    Under the alternative approximation, the integral only needs to be evaluated once at the initiation. We update β and γ˜ iteratively using a pseudo logistic regression trick and the Newton’s method.
  3. Derive the initial estimator for γ from the initial α(t), and solve for the final β and γ. Likewise, we update β and γ iteratively using a pseudo logistic regression trick and the Newton’s method.

We provide the detail of the algorithm given in Appendix F. The program is available at https://github.com/celehs/SPT.grpadalasso.

2.2.3. Feature Selection

When the dimension of Z is not small, the MLE from (3) and (4),

θ^MLE=(γ^MLE,β^MLE), (5)

may suffer from high variability. On the other hand, it is highly likely that only a small number of codes are truly predictive of the event time. To overcome this challenge, one may employ standard penalized regression approach by imposing a penalty for the model complexity. To efficiently carry out penalized estimation under the B-spline PO model, we borrow the least square approximation (LSA) strategy proposed in Wang and Leng (2007) and update (γ,β) from initial MLE estimator (5)

θ^glasso=argminθ(θθ^MLE){n1n(θ^MLE)}(θθ^MLE)+λgβ[g]2β^MLE,[g]2, (6)

where θ=(γ,β),θ^glasso=(γ^glasso,β^glasso),β[g] is a subvector of β that corresponds to the features in group g, [g] represents indices associated with group g, and 2 is the L2 norm. All features associated with a specific code can be joined to create a group. The adaptive group lasso (glasso) penalty (Wang and Leng, 2008) employed in (6) enables the removal of all features related to a code. The tuning parameter λ can be chosen by standard methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the cross-validation.

With θ^glasso we may obtain a glasso estimator for m(t) as

m^glasso(t)=γ^glassoBr(t)

For any patient with derived feature Z^, his/her probability of having an event by t can be predicted as

F^(tZ^)=eβ^Z^0tem^glasso(s)ds1+eβ^Z^0tem^glasso(s)ds

2.3. Evaluation of Prediction Performance

Based on π^t=F^(tZ^), one may derive subject specific prediction rules for the event status and/or time. For example, one may predict the binary event status Dt=I(Tt) using π^t and Δ using π^C. One may also predict X=0CI(Tt)dt based on X^u=C(1Δ^u)+Δ^uX^ for some u chosen to satisfy a desired sensitivity or specificity level of classifying Δ based on Δ^u=I(π^Cu), where X^=0C{1F^(tZ^)}dt.

To summarize the overall performance of (X^u,Δ^u) in predicting (X, Δ), we may consider the Kendall’s-τ type rank correlation summary measures, e.g.,

𝒞u=P(X^uiX^ujXiXj),
𝒞u+=P(Δ^u,i=1,X^uiX^ujΔi=1,XiXj).

To account for calibration, we propose to examine the absolute prediction error (APE) measure via

APEu=E0Ci|I(X^uit)I(Xit)|dt=E0Ci|I(T^uit)Δ^uiI(Tit)Δi|dt=E|X^uiXi|.

APEu is an important summary measure for the quality of annotating (X, Δ) since most survival estimation procedures rely on the at risk function I(Xit) and the counting process I(Tit)Δi. The cut-off value u can be selected also to minimize APEu.

These accuracy measures can be estimated empirically by

APEu^=n1i=1n|X^uiXi|,𝒞^u=i<jI(X^uiX^uj,XiXj)i<jI(XiXj),𝒞^u+=i<jΔ^uiΔiI(X^uiX^uj,XiXj)i<jΔiI(XiXj).

Since X^u and Δ^u are estimated using the same training data, such plug in accuracy estimate may suffer overfitting bias especially when n is not large. For such cases, standard cross-validation procedures can be used for bias correction.

3. Asymptotic Results

The asymptotic distribution of the proposed MATA estimator is given in theorems 1 and 2 below, with proofs provided in the Appendix. We assume the following regularity conditions.

  • (C1)

    The density function fC(t) of the random variable C is bounded and bounded away from 0 on [0, ) and satisfies the Lipschitz condition of order 1 on [0, ). Additionally, SC()=limΔ0+SC(Δ)>0.

  • (C2)

    m()C(q)([0,]) for q2, and the spline order satisfies rq.

  • (C3)
    There exists 0 < c < ∞, such that the distances between neighboring knots satisfy
    maxrpRn+r|hp+1hp|=o(Rn1),maxrpRn+rhp/minrpRn+rhpc.
    Furthermore, the number of knots satisfies Rn, as n,Rn2n and Rnn1/(2q).
  • (C4)

    The function m(t) is bounded on [0, ]. The pdf of the covariate Z is bounded and has a compact support.

  • (C5)

    The estimated features has a fast convergence rate supi=1,,nW^iWi=op(n1/2).

Here condition C1 assumes that SC(t), the survival function of C, is discontinuous at . In practice, most studies have a preselected ending time , when all patients that have not experienced failure are censored. This automatically leads to the discontinuity of SC(t) at . Besides, for some studies that keep tracking patients until the last patient is censored or experience failure, the performance of the estimated survival curve near the tail can be highly uncertain. A straightforward solution is manually censoring all the patients to at least the last failure time , which results in the discontinuity at this point. Conditions C2 and C3 are standard smoothness and knots conditions in B-spline approximation. Condition C4 implies that both ST(t), the survival function of T, and fT(t), the density function of T, are bounded away from 0 on [0, ]. Hence, [0, ] is strictly contained in the support of the failure time T, i.e., [0,]support(T).

In the statements of our theory, we use β0 to indicate the underlying true parameter of model (1). Here and throughout the text, a2aa for any matrix or vector a. We first establish the consistency and asymptotic normality of the MLE estimator (3).

Lemma 1 Under the Conditions C1-C5, when β equals the truth β0 or equals a n -consistent estimator of β0, then the baseline function estimator from

m^(t)=γ^Br(t)withγ^(β)=argmaxγln(β,γ)

satisfies |m^(u,β)m(u)|=Op{(nhp)1/2+hpq}=Op{(nhp)1/2} uniformly in u[0,] and as n,σ^1(u,β0){m^(u,β)m(u)}Normal(0,1) in distribution.

Lemma 2 Under the Conditions C1-C5, β^MLEβ02=Op(n1/2), and

n1/2(β^MLEβ0)Normal{0,VMLE},VMLE=A1ΣA1

where

A=E{Sββ,i(β0,m)}+E{Sβγ,i(β0,m)}Vn(β0)1E{Sβγ,i(β0,m)};Σ=E([Sβ,i(β0,m)+E{Sβγ,i(β0,m)}Vn(β0)1Sγ,i(β0,m)]2),

and the definitions of Sββ,i,Sβγ,i,Sββ,i,Sγγ,,i and Sβγ,i are given in Appendix D.

With the consistent and asymptotically normal θ^MLE, we achieve oracle property through the adaptive group LASSO with least square approximation:

Theorem 1 Let 𝒮 be the indices set for coefficients in nonzero groups of β0. Define the sub-matrices and sub-vectors, A𝒮,𝒮 and Σ𝒮,𝒮 with rows and columns in 𝒮,βglasso,𝒮 and β0,𝒮 with elements in 𝒮. Under the Conditions C1-C5 with n1λn1/2,supj𝒮c|β^glasso,j|=op(n1/2), and

n1/2(β^glasso,𝒮β0,𝒮)Normal{0,Vglasso},Vglasso=A𝒮,𝒮1Σ𝒮,𝒮A𝒮,𝒮1.

Thank to the large unlabeled data in Step I, the uncertainty from the complex feature extraction W^ does not affect the asymptotic analysis of Step II estimators. Consequently, we obtain standard regular estimators for β and m. Confidence intervals for functionals of β and m including F(t; Z) can be obtained through bootstrap or the perturbation resampling (Efron, 1979; Jin et al., 2001).

Corollary 1 Under the Conditions C1-C5, the estimation error for incidence rate satisfies

F^(tZ^)F(tZ)=Op(n1/2+hq),

and the error is asymptotically normally distributed.

4. Simulation

We have conducted extensive simulations to evaluate the performance of our proposed MATA procedure and compare to existing methods including (a) the nonparametric MLE (NLPMLE) approach by Zeng et al. (2005) using the same set of derived features; (b) the tree-based method by Chubak et al. (2015) which first uses the decision tree to classify patients as experienced failure event or not, and then among the patients who are determined to have events, assign the event time by the earliest arrival time of all groups of medical encounters used in the decision tree; and (c) the two-step procedure by Hassett et al. (2015) and Uno et al. (2018), which first fit a logistic regression with group lasso to classify the patients and select significant groups of encounters, and then assign the failure time to patients experiencing event as the weighted average of the peak time of the significant encounters with adjustment to correct the systematic bias. Throughout, we fix the total sample size to be n + N = 4000 and consider n ∈ {200, 400}.

For simplicity, we only consider the case where all patients are enrolled at the same time as we can always shift each patient’s follow-up period to a preselected origin. The censoring time of the i-th patient, i.e., Ci, is simulated from the mixed distribution 0.909Uniform[0, )+0.091δ, where δ is the Dirac function at and = 20, for i = 1, …, n + N. Intuitively, this imitates a long-term clinical study that tracks patients up to 20 years, where 90.9% of the patients quit the study at uniform rate before the study terminates and 9.1% patients stay in the study until the end. We simulate the number of encounters and encounter arrival times using the expression λi[j](t)=τi[j]fi[j](t) for t0. We consider two sets of density functions {fi[j](t)} for the point processes: Gaussian and Gamma. In addition, for each density shape, we considered both the case that density functions are independent across the q = 10 counting processes of the medical encounters and the case that the densities are correlated. Details on the data generation mechanisms for the point processes are given in Appendix A of the Supplementary Materials. We then set τi[j]=mi[j]+5 with mi=(mi[1],,mi[q])=Fj1{Φ(ιi)}, where Φ is the cumulative distribution function (CDF) of the standard normal and Fj is the CDF of a Gamma distribution with shape k2j and scale θ2j, Gamma(k2j,θ2j). We let

k2=(k21,,k2q)=(0.6,0.48,0.36,1.2,0.6,0.9,0.54,1.26,0.45,0.468),θ2=(θ21,,θ2q)=(10,6,20,4,8,9,6.5,5,16,14)

and generate

ιi=(ιi1,,ιiq)MNormal(0,Σι).

We consider two choices of Σι:Σι=Iq and Σι={0.5|m|}1m,q. We further simulate encounter times

ti1[j],,tiMi[j][j]fi[j]withMi[j]Poisson(mi[j])+5

but only keep the ones that fall into the interval [0, ]. The final number of arrival times are thus reduced to Mi[j]=#{k:0tik[j]} and we relabel the retained arrival times as ti1[j],,tiMi[j][j]. Simple calculation shows E(Mi[j]τi[j])=τi[j]pr(ω), where ωfi[j].

The event time Ti is simulated from the PO model in (1) where the true features are set to be the log of the peak time and the logit of the ratio between change point time and peak time of the intensity functions λi[j](t) for i = 1, …, q. Intuitively, an early peak time may result in early disease onset; and a relatively close peak time and change point time may imply a quick exacerbation of the disease status. We set the nonparametric function logα(t)=3log(t)+αc and varies αc to control the censoring rate. We further set β=(β1,,βq) with β1=(4,3) and β2==βq=0. Consequently, only the first group of medical encounters affects the recurrence time. The estimated features are set to be the estimated peak time as well as the logit of the estimated ratio between change point time and peak time. We also included the logarithm of the first encounter arrival time, a feature directly observable from the medical encounter data.

We summarize results with 400 replications for each configuration. With each simulated dataset, we extract the features for both labeled and unlabeled data of total size n + N whereas the PO model (1) is only fitted on the labeled training data of size n. The interior knots for B-splines in our approach are chosen to be the 10ath percentile of the observed survival time X with a = 1, 2, …, 9 for both Gaussian and Gamma cases. For the tree approach and the two-step logistic, denoted by ”Logi”, approach, we use the same input feature space as the Z^ for the PO model (1) of MATA. To evaluate the performance of different procedures, we simulate a validation data of size nv = 5000 in each simulation to evaluate the out-of-sample prediction performance through the accuracy measures discussed in Section 2.3.

4.1. Results for the Gaussian Intensity Setting

We present the results for Gaussian intensity setting here. The parallel results for Gamma intensity setting are given in Appendix A. The estimated probability of having zero and ≤ 3 arrival times under all settings from a simulation with sample size 500, 000 are given in Table A4 in Appendix A. As a benchmark, we also present results from fitting the PO model with true feature sets. For the true feature sets. In Table 1, we reported the bias and standard error (se) of the non-zero coefficients, i.e., β1=(β11,β12), from MATA and NPMLE. In general, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MATA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, leading to an effective sample size of 60, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE has smaller standard error in the extreme case but its absolute bias is more then twice of the standard error. These results is consistent with Theorem 2.

Table 1.

Displayed are the bias and standard error of the estimation on β1=(β11,β12) fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gaussian intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.

Indp Corr
β11 β12 β11 β12
Bias se Bias se Bias se Bias se
Gaussian, 30% censoring rate

n = 200 MATA −0.060 0.404 −0.072 0.282 −0.053 0.418 −0.083 0.292
NPMLE −0.355 0.692 −0.216 0.550 −0.359 0.716 −0.232 0.495
n = 400 MATA 0.017 0.271 −0.020 0.183 0.030 0.258 −0.027 0.177
NPMLE −0.036 0.373 0.000 0.259 −0.028 0.385 −0.001 0.242

Gaussian, 70% censoring rate

n = 200 MATA −0.408 0.893 −0.305 0.582 −0.352 0.776 −0.277 0.532
NPMLE 1.449 0.562 1.172 0.382 1.448 0.574 1.167 0.361
n = 400 MATA −0.082 0.440 −0.081 0.279 −0.088 0.403 −0.095 0.280
NPMLE 1.698 0.270 1.338 0.161 1.708 0.247 1.341 0.140

For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures 𝒞u,𝒞u+, and absolute prediction error APEu depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02 for = 0, 1, …, 50 and pick the u with minimum APEu. We then compare these measures at the selected u with Tree and Logi methods in Tables 2 and 3.

Table 2.

Kendall’s-τ type rank correlation summary measures (𝒞 and 𝒞+), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

n = 200 n = 400
MATA NPMLE Tree Logi MATA NPMLE Tree Logi
Gaussian, 30%, independent counting processes, true features

𝒞 0.901 (0.002) 0.894 (0.003) 0.791 (0.022) 0.719 (0.042) 0.901 (0.002) 0.898 (0.002) 0.791 (0.018) 0.718 (0.038)
𝒞+ 0.868 (0.006) 0.864 (0.006) 0.742 (0.027) 0.718 (0.029) 0.868 (0.005) 0.867 (0.005) 0.747 (0.018) 0.725 (0.022)
APE 0.971 (0.032) 1.049 (0.050) 1.978 (0.368) 3.373 (0.947) 0.962 (0.027) 1.001 (0.031) 1.884 (0.125) 3.406 (0.917)

Gaussian, 70%, independent counting processes, true features

𝒞 0.932 (0.004) 0.921 (0.005) 0.867 (0.014) 0.859 (0.023) 0.933 (0.002) 0.926 (0.003) 0.872 (0.010) 0.859 (0.021)
𝒞+ 0.782 (0.021) 0.757 (0.023) 0.705 (0.064) 0.682 (0.086) 0.789 (0.014) 0.761 (0.015) 0.716 (0.051) 0.699 (0.067)
APE 0.772 (0.049) 0.899 (0.057) 1.479 (0.162) 1.590 (0.268) 0.751 (0.030) 0.840 (0.037) 1.413 (0.121) 1.576 (0.234)

Gaussian, 30%, correlated counting processes, true features

𝒞 0.900 (0.003) 0.894 (0.003) 0.792 (0.023) 0.720 (0.042) 0.901 (0.002) 0.898 (0.002) 0.791 (0.018) 0.724 (0.039)
𝒞+ 0.866 (0.006) 0.862 (0.007) 0.741 (0.029) 0.719 (0.027) 0.867 (0.005) 0.865 (0.005) 0.748 (0.017) 0.727 (0.022)
APE 0.972 (0.035) 1.050 (0.046) 1.994 (0.434) 3.383 (0.962) 0.963 (0.026) 1.002 (0.031) 1.871 (0.114) 3.278 (0.976)

Gaussian, 70%, correlated counting processes, true features

𝒞 0.932 (0.004) 0.921 (0.005) 0.868 (0.015) 0.859 (0.025) 0.933 (0.002) 0.926 (0.003) 0.873 (0.011) 0.861 (0.022)
𝒞+ 0.782 (0.020) 0.758 (0.022) 0.698 (0.066) 0.679 (0.084) 0.789 (0.013) 0.762 (0.016) 0.720 (0.050) 0.701 (0.068)
APE 0.771 (0.051) 0.896 (0.057) 1.473 (0.175) 1.599 (0.280) 0.750 (0.029) 0.838 (0.038) 1.395 (0.121) 1.548 (0.247)

Table 3.

Estimated features, Gaussian. Kendall’s-τ type rank correlation summary measures (𝒞 and 𝒞+), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4,000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

n = 200 n = 400
MATA NPMLE Tree Logi MATA NPMLE Tree Logi
Gaussian, 30%, independent counting processes, estimated features

𝒞 0.781 (0.020) 0.768 (0.011) 0.790 (0.019) 0.740 (0.036) 0.791 (0.008) 0.784 (0.006) 0.788 (0.022) 0.744 (0.027)
𝒞+ 0.701 (0.028) 0.667 (0.017) 0.690 (0.019) 0.654 (0.036) 0.675 (0.027) 0.680 (0.010) 0.692 (0.016) 0.653 (0.028)
APE 2.148 (0.332) 2.184 (0.132) 1.913 (0.139) 2.409 (0.540) 1.960 (0.236) 2.019 (0.074) 1.904 (0.125) 2.272 (0.315)

Gaussian, 70%, independent counting processes, estimated features

𝒞 0.839 (0.018) 0.833 (0.010) 0.802 (0.020) 0.827 (0.015) 0.853 (0.008) 0.845 (0.006) 0.806 (0.019) 0.832 (0.013)
𝒞+ 0.397 (0.111) 0.413 (0.040) 0.536 (0.107) 0.500 (0.107) 0.468 (0.042) 0.447 (0.025) 0.555 (0.072) 0.511 (0.090)
APE 1.868 (0.253) 1.938 (0.122) 2.276 (0.248) 1.992 (0.216) 1.685 (0.106) 1.775 (0.071) 2.202 (0.223) 1.907 (0.177)

Gaussian, 30%, correlated counting processes, estimated features

𝒞 0.781 (0.019) 0.768 (0.011) 0.789 (0.021) 0.743 (0.032) 0.791 (0.010) 0.783 (0.006) 0.791 (0.016) 0.747 (0.022)
𝒞+ 0.701 (0.028) 0.669 (0.016) 0.690 (0.019) 0.656 (0.036) 0.672 (0.030) 0.680 (0.011) 0.693 (0.014) 0.654 (0.027)
APE 2.142 (0.323) 2.180 (0.122) 1.915 (0.145) 2.352 (0.418) 1.958 (0.257) 2.018 (0.072) 1.886 (0.099) 2.243 (0.194)

Gaussian, 70%, correlated counting processes, estimated features

𝒞 0.838 (0.018) 0.832 (0.009) 0.802 (0.021) 0.827 (0.014) 0.852 (0.007) 0.846 (0.005) 0.804 (0.017) 0.832 (0.012)
𝒞+ 0.393 (0.109) 0.430 (0.042) 0.533 (0.119) 0.500 (0.110) 0.468 (0.041) 0.450 (0.026) 0.564 (0.070) 0.519 (0.083)
APE 1.879 (0.248) 1.944 (0.120) 2.289 (0.253) 1.987 (0.211) 1.688 (0.102) 1.772 (0.071) 2.225 (0.193) 1.898 (0.168)

The performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher 𝒞,𝒞+ and lower APE in all cases under Gaussian intensities. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.

5. Example

We applied our MATA algorithm to extraction of cancer recurrence time for Veterans Affair Central Cancer Registry (VACCR). We obtained from VACCR 36705 lung cancer patients diagnosed with stage I to III lung cancer before 2018 and followed through 2019, among whom 3752 diagnosed in 2000-2018 with cancer stage information had annotated dates for cancer recurrence. Through the research platform under Office of Research & Development at Department of Veterans Affairs, the cancer registry data were linked to the EHR at Veterans Affairs healthcare containing diagnosis, procedure, medication, lab tests and medical notes.

The gold-standard recurrence status was collected through manual abstraction and tumor registry for the VACCR data. Besides, baseline covariates, including age at diagnosis, gender, and cancer stage, are extracted. Due to the predominance of male patients in VACCR (97.7% among the 3752), we excluded gender in the subsequent analysis. We randomly selected 1000 patients as training data and used the remaining 2752 as validation data. To assess the change of performance with the size of training data, we also considered smaller training data n = 200 and n = 400 sub-sampled from the n = 1000 set. We ran 400 bootstrap samples from the 1000 train samples for each n to quantify the variability of analysis. We selected time unit as month and focused on recurrence within 2 year in the analysis. Patients without recurrence before 24 months from diagnosis date were censored at 24 months. Censoring rate was 39%. The diagnosis, procedure, medication codes and mentions in medical notes associated with the following nine events are collected: lung cancer, chemotherapy, computerized tomography (CT) scan, radiotherapy, secondary malignant neoplasm, palliative or hospice care, recurrence, medications for systematic therapies (including cytotoxic therapies, targeted therapies or immunotherapies), biopsy or excision. See Table A6 for the detailed summary of the sparsity in the nine groups of medical encounters.

For each of the nine selected events except the hospice, we estimate the subject-specific intensity function on the training and unlabeled data sets by applying the FPCA approach described in Section 2.2.1, and then use the resulting basis functions to project the intensity functions for the validation set. The peak and change point time of the estimated intensity functions are then extracted as features. In addition, first code arrival time, first FPCA score, and the total number of diagnosis and procedure code are added as features for each event. All those features except the FPCA score are transformed in log-scale to reduce the skewness. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features. Finally, to overcome the potential collinearity of the extracted features from the same group (i.e., event), we further run the principal component analysis on each group of features and keep the first few principal components with proportion of variation exceeds 90%.

Similarly as the simulation, we fit the decision tree to minimize the cross-validation error for the Tree approach and fit the logistic regression model. For MATA, NPMLE and Logi, we take a fine grid on false positive rate (FPR) on Δ and compute all other accuracy measures in Section 2.3 for each value of FPR. Then we pick the result which matches the FPR from Tree.

The prediction accuracy is summarized in Table 4. For the measurements regarding the timing of recurrence, our MATA estimator dominates the other three approaches with larger 𝒞,𝒞+ and yet smaller APE.

Table 4.

Mean and bootstrap standard deviation of Kendall’s-τ type rank correlation summary measures (𝒞 and 𝒞+), and absolute prediction error (APE) for the medical encounter data analyzed in Section 5 under the four approaches, i.e., MATA, NPMLE, Tree, and Logi, over 400 bootstrap simulations.

n = 1000
MATA MLE Tree Logi

𝒞 0.810 (0.010) 0.809 (0.009) 0.755 (0.032) 0.762 (0.015)
𝒞+ 0.688 (0.016) 0.690 (0.017) 0.646 (0.049) 0.650 (0.010)
APE 3.326 (0.051) 3.399 (0.074) 5.693 (0.960) 4.317 (0.152)

n = 400
MATA MLE Tree Logi

𝒞 0.796 (0.013) 0.795 (0.013) 0.748 (0.033) 0.762 (0.059)
𝒞+ 0.663 (0.026) 0.662 (0.027) 0.644 (0.044) 0.646 (0.044)
APE 3.576 (0.094) 3.615 (0.129) 5.880 (1.159) 4.653 (0.577)

n = 200
MATA MLE Tree Logi

𝒞 0.791 (0.025) 0.784 (0.024) 0.748 (0.039) 0.859 (0.127)
𝒞+ 0.624 (0.057) 0.624 (0.043) 0.640 (0.044) 0.681 (0.090)
APE 3.842 (0.317) 3.945 (0.286) 6.094 (1.418) 5.736 (1.080)

Through its variable selection feature, MATA excluded stage II cancer from the n = 1000 analysis and additionally, stage III cancer, age at diagnosis, medication for systematic therapies from the n = 200 and n = 400 analyses. The selection is consistent with the NPMLE result of the n = 1000 analysis, as the excluded features coincide with the feature groups with no p-value < 0.05. Additional details for feature selection is given in Appendix B.

6. Discussion

We proposed an MATA method to auto-extract patients’ longitudinal characteristics from the medical encounter data, and further build a risk prediction model on clinical events occurrence time with the extracted features. Such an approach integrates both labeled and unlabeled data to obtain the longitudinal features, thus tackled the sparsity of the medical encounter data. In addition, the FPCA approach preserves the flexibility of the resulting subject-specific intensity function to a great extent. Specifically, the intensity functions are often of different shapes between female and male, or between young patients and elder patients in practice. Therefore, multiplicative intensity model that assumes the heterogeneity among patients only results from the subject-specific random effects may not be adequate. The fitted risk prediction model is chosen to be the proportional odds model, whereas the nonparametric function is approximated by B-splines under certain transformation to ensure its monotonicity. The resulting estimator for the parametric part is shown to be root-n consistent, whereas the non-parametric function is consistent, under the correctly-specified model. Though the proportional odds model is adopted, our proof can be extended to other semiparametric transformation models such as the proportional hazard model easily. In the presence of large feature sets, we propose to use the group lasso with LSA for feature selection. The finite sample performance of our approach are studied under various settings.

Here, the FPCA is employed on each group of medical encounters separately for feature extraction. However, different groups are potentially connected, and separate estimation may fail to capture such relationship. A potential future work is to use the multivariate FPCA to directly address potential covariation among different groups. Though various multivariate FPCA approaches exist, none of them can handle the case in the medical encounters, where the encounter arrival times rather than the underlying intensity functions are observed. Much effort is needed on developing the applicable multivariate FPCA methodology and theories in this setting.

The spline model works very well with only a few knots. Small-sample performance of our estimator is studied in various simulations, with the prediction accuracy examined by C-statistics (Uno et al., 2011) and Brier Score on simulated validation data sets.

The adoption of the PO model is for the simplicity of the illustration, while the theory of our estimator can be easily generated to arbitrary linear transformation models. As the goal of this paper is to annotating event time within the observational window, the medical encounter data involved may extend after the actual event time.

Appendices

In Appendix A, we present additional simulation studies with Gamma intensities, as well as extra information on the simulation settings. In Appendix B, we offer additional details on the data example of lung cancer recurrence with VACCR data. In Appendix C, we provide the theoretical properties for the derived features. In Appendix D, we provide the theoretical properties for the MATA estimator based on the proportional odds model. In Appendix F, we provide the detailed algorithm for optimization of the log-likelihood ln.

Appendix A. Additional Simulation Details

A1. Simulation Settings for the Gaussian Intensities

We first simulate Gaussian shape density, i.e., fi[j] is the density function of Normal(μij,σij2) truncated at 0.

Set μij to be Fj1{Φ(νij)}, Fj is the CDF of Gamma(k1j,θ1j), with k1jUniform(3,6) and θ1jUniform(2,3) for j = 1, …, q, and νi=(νi1,,νiq)MNormal(0,Σν), i.e., the multivariate normal distribution with mean 0 and variance Σν. For simplicity, we set Σν=Σι We further set μij to be one if it is less than one.

Simulate σijUniform(0.5,sj) with sj=min{0.9μij,Fj1(0.5)}, where Fj is the CDF of Gamma(k1j,θ1j). The way we simulate μij and σij guarantees that the largest change in the intensity functions only occurs after patients enter the study, i.e., μijσij>0, as expected in practice. Besides, the simulated σij is not only controlled by the value of μij but also the median of Gamma(k1j,θ1j). Thus σij will not get too extreme even with a large peak time μij. In other words, the corresponding largest change in the intensity function μijσij is more likely to occur near the peak time μij than much earlier than μij.

Finally, we set αc, the constant in the nonparametric function α(t), to be 7.5 and 1.1 to obtain an approximately 30% and 70% censoring rate.

A2. Simulation Settings for the Gamma Intensity functions

We also consider gamma shape density, i.e., fi[j](t) is the density function of Gamma(kij,θij), truncated at 0. We let fi[j](t) be the density function of Gamma(kij,θij), truncated at 0. Set kij=Fj1{Φ(νij)}, where Fj is the CDF of Uniform(k,j,ku,j), with k,jUniform(2,4), and kμ,jUniform(4,6), and νi=(νi1,,νiq)MNormal(0,Σν). Generate θij from Gamma(aj,bj) truncated at its third quartile with ajUniform(3,6), and bjUniform(2,4). We set αc=6.8 and 1.9 to obtain the approximate 30% and 70% censoring rates.

Table A1.

Displayed are the bias and standard error of the estimation on β1=(β11,β12) fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gamma intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.

Indp Corr
β11 β12 β11 β12
Bias se Bias se Bias se Bias se
Gamma, 30% censoring rate

n = 200 MATA −0.086 0.443 −0.084 0.410 −0.091 0.447 −0.126 0.436
NPMLE −0.480 0.581 −0.301 0.510 −0.482 0.549 −0.349 0.541
n = 400 MATA −0.006 0.296 −0.032 0.271 0.019 0.283 −0.067 0.266
NPMLE −0.223 0.374 −0.138 0.315 −0.190 0.346 −0.167 0.340

Gamma, 70% censoring rate

n = 200 MATA −0.383 0.743 −0.299 0.676 −0.400 0.783 −0.339 0.666
NPMLE 0.336 0.731 0.284 0.592 0.325 0.866 0.267 0.670
n = 400 MATA −0.109 0.399 −0.070 0.328 −0.074 0.410 −0.112 0.344
NPMLE 0.708 0.429 0.551 0.330 0.744 0.385 0.533 0.352

A3. Results for Gamma Intensity setting

For the true feature sets, we reported the bias and standard error (se) of the non-zero coefficients, i.e., β1=(β11,β12), from MATA and NPMLE in Table A1. Similar to the Gaussian intensities settings, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MASA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE tends to be numerically unstable. We observe the estimation bias of NPMLE for n = 400 setting is larger than its own standard error and the the bias at n = 200 setting. These results is consistent with Theorem 2.

For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures 𝒞u,𝒞u+, and absolute prediction error APEu depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02 for = 0, 1, ⋯, 50 and pick the u with minimum APEu. We then compare these measures at the selected u with Tree and Logi methods in Tables A2 and A3.

Similar to the Gaussian intensities setting, the performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher 𝒞,𝒞+ and lower APE in all cases except when the encounters are simulated from independent Gamma counting processes with 30% censoring rate. In this exceptional case, our MATA estimator has very minor advantage in 𝒞+ compared to NPMLE, and is still better in terms of 𝒞 and APE. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.

Table A2.

True features, Gamma. Kendall’s-τ type rank correlation summary measures (𝒞 and 𝒞+), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

n = 200 n = 400
MATA NPMLE Tree Logi MATA NPMLE Tree Logi
Gamma, 30%, independent counting processes, true features

𝒞 0.872 (0.003) 0.864 (0.004) 0.720 (0.040) 0.678 (0.074) 0.873 (0.003) 0.869 (0.003) 0.731 (0.023) 0.683 (0.071)
𝒞+ 0.814 (0.008) 0.812 (0.008) 0.617 (0.047) 0.658 (0.048) 0.814 (0.006) 0.815 (0.007) 0.636 (0.026) 0.667 (0.043)
APE 1.318 (0.038) 1.410 (0.048) 3.826 (1.011) 4.937 (1.844) 1.307 (0.031) 1.350 (0.036) 3.434 (0.578) 4.784 (1.835)

Gamma, 70%, independent counting processes, true features

𝒞 0.922 (0.004) 0.914 (0.004) 0.876 (0.010) 0.837 (0.042) 0.924 (0.002) 0.919 (0.003) 0.880 (0.008) 0.841 (0.040)
𝒞+ 0.735 (0.021) 0.701 (0.024) 0.618 (0.078) 0.604 (0.106) 0.743 (0.014) 0.723 (0.017) 0.630 (0.053) 0.627 (0.085)
APE 0.892 (0.049) 0.995 (0.054) 1.436 (0.125) 1.986 (0.539) 0.869 (0.030) 0.930 (0.037) 1.388 (0.090) 1.917 (0.518)

Gamma, 30%, correlated counting processes, true features

𝒞 0.872 (0.003) 0.864 (0.004) 0.720 (0.040) 0.685 (0.072) 0.873 (0.002) 0.869 (0.003) 0.731 (0.026) 0.684 (0.068)
𝒞+ 0.814 (0.008) 0.813 (0.009) 0.617 (0.047) 0.662 (0.049) 0.816 (0.006) 0.813 (0.007) 0.635 (0.031) 0.668 (0.041)
APE 1.320 (0.041) 1.408 (0.049) 3.819 (1.022) 4.826 (1.842) 1.307 (0.030) 1.350 (0.034) 3.500 (0.689) 4.808 (1.810)

Gamma, 70%, correlated counting processes, true features

𝒞 0.921 (0.005) 0.913 (0.004) 0.875 (0.011) 0.841 (0.044) 0.924 (0.003) 0.919 (0.003) 0.879 (0.007) 0.848 (0.038)
𝒞+ 0.733 (0.025) 0.702 (0.024) 0.616 (0.072) 0.612 (0.099) 0.742 (0.015) 0.724 (0.016) 0.628 (0.053) 0.622 (0.086)
APE 0.900 (0.056) 0.996 (0.053) 1.447 (0.126) 1.933 (0.572) 0.872 (0.032) 0.929 (0.036) 1.390 (0.086) 1.833 (0.488)
Table A3.

Estimated features, Gamma. Kendall’s-τ type rank correlation summary measures (𝒞 and 𝒞+), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

n = 200 n = 400
MATA NPMLE Tree Logi MATA NPMLE Tree Logi
Gamma, 30%, independent counting processes, estimated features

𝒞 0.728 (0.027) 0.720 (0.010) 0.650 (0.043) 0.668 (0.059) 0.749 (0.011) 0.737 (0.007) 0.667 (0.045) 0.659 (0.056)
𝒞+ 0.555 (0.037) 0.578 (0.015) 0.456 (0.074) 0.570 (0.078) 0.573 (0.018) 0.589 (0.010) 0.480 (0.072) 0.558 (0.063)
APE 2.732 (0.544) 2.698 (0.121) 4.164 (1.191) 5.018 (1.659) 2.443 (0.218) 2.528 (0.074) 3.840 (1.095) 4.858 (1.330)

Gamma, 70%, independent counting processes, estimated features

𝒞 0.833 (0.016) 0.827 (0.011) 0.829 (0.017) 0.822 (0.019) 0.849 (0.010) 0.841 (0.006) 0.831 (0.018) 0.829 (0.014)
𝒞+ 0.277 (0.115) 0.325 (0.043) 0.485 (0.139) 0.425 (0.119) 0.371 (0.058) 0.374 (0.027) 0.519 (0.070) 0.450 (0.089)
APE 2.002 (0.223) 2.040 (0.136) 2.006 (0.202) 2.086 (0.254) 1.766 (0.141) 1.866 (0.077) 1.964 (0.195) 1.973 (0.166)

Gamma, 30%, correlated counting processes, estimated features

𝒞 0.731 (0.024) 0.720 (0.011) 0.656 (0.046) 0.670 (0.053) 0.749 (0.010) 0.737 (0.007) 0.672 (0.045) 0.665 (0.057)
𝒞+ 0.568 (0.038) 0.577 (0.016) 0.453 (0.076) 0.560 (0.073) 0.579 (0.020) 0.588 (0.011) 0.485 (0.072) 0.561 (0.063)
APE 2.681 (0.494) 2.691 (0.127) 4.133 (1.176) 4.770 (1.522) 2.451 (0.244) 2.522 (0.073) 3.778 (1.076) 4.874 (1.354)

Gamma, 70%, correlated counting processes, estimated features

𝒞 0.833 (0.016) 0.826 (0.011) 0.828 (0.018) 0.822 (0.017) 0.849 (0.009) 0.840 (0.006) 0.831 (0.017) 0.829 (0.012)
𝒞+ 0.283 (0.107) 0.322 (0.044) 0.484 (0.136) 0.421 (0.124) 0.366 (0.060) 0.388 (0.028) 0.515 (0.076) 0.442 (0.094)
APE 1.996 (0.214) 2.053 (0.135) 2.017 (0.213) 2.088 (0.236) 1.766 (0.128) 1.868 (0.075) 1.963 (0.180) 1.975 (0.156)
Table A4.

Estimated probability of having zero or ≤ 3 encounter arrival times under each counting process 𝒩[j] for j = 1, …, 10 from a simulation with sample size 500, 000.

𝒩[1] 𝒩[2] 𝒩[3] 𝒩[4] 𝒩[5] 𝒩[6] 𝒩[7] 𝒩[8] 𝒩[9] 𝒩[10]
Probability of zero encounters

Indp Gaussian 0.508 0.802 0.793 0.767 0.878 0.758 0.474 0.594 0.818 0.755
Corr Gaussian 0.508 0.802 0.792 0.766 0.879 0.758 0.474 0.595 0.817 0.755
Indp Gamma 0.761 0.805 0.716 0.786 0.750 0.944 0.712 0.810 0.939 0.750
Corr Gamma 0.761 0.806 0.715 0.786 0.749 0.943 0.713 0.812 0.938 0.750

Probability of ≤ 3 encounters

Indp Gaussian 0.720 0.945 0.926 0.930 0.971 0.918 0.680 0.778 0.938 0.902
Corr Gaussian 0.721 0.946 0.926 0.930 0.972 0.918 0.681 0.778 0.938 0.902
Indp Gamma 0.938 0.962 0.913 0.939 0.933 0.995 0.935 0.970 0.995 0.931
Corr Gamma 0.938 0.962 0.913 0.939 0.933 0.995 0.936 0.970 0.995 0.931
Table A5.

Average model sizes selected by MATA.

30% Indp 70% Indp 30% Corr 70% Corr
Tr Ft Est Ft Tr Ft Est Ft Tr Ft Est Ft Tr Ft Est Ft
Gaussian

n = 200 AIC 13.24 15.57 13.74 17.09 13.30 15.91 13.75 17.41
BIC 13.07 15.45 13.45 16.87 13.09 15.83 13.39 17.27
n = 400 AIC 13.15 14.40 13.30 14.57 13.18 14.39 13.31 14.77
BIC 13.00 14.05 13.01 14.17 13.00 14.12 13.00 14.22

Gamma

n = 200 AIC 13.38 18.37 13.88 19.35 13.41 18.04 13.94 19.57
BIC 13.23 18.22 13.65 19.01 13.29 17.88 13.72 19.21
n = 400 AIC 13.24 15.02 13.20 15.09 13.21 15.04 13.34 14.97
BIC 13.00 14.80 13.01 14.78 13.01 14.72 13.02 14.72
Supplementary Results on Simulations

We show the sparsity in the simulated data in Tables A4. We show the Average Model Size and MSE of Estimation in Table A5.

Table A6.

Sparsity of the nine groups of medical encounter data analyzed in Section 5.

Feature Zero ≤ 3 times
Lung Cancer 0.014 0.087
Chemotherapy 0.567 0.736
CT Scan 0.127 0.363
Radiotherapy 0.778 0.912
Secondary Maligant Neoplasm 0.554 0.856
Palliative or Hospice Care 0.576 0.888
Recurrence 0.279 0.723
Medication 0.707 0.824
Biopsy or Excision 0.964 1.000

Appendix B. Additional Details on Data Example

We show the sparsity of features in Table A6. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features.

We show the MATA and NPMLE coefficients for n = 1000, 400, 200 in Tables A7A9. Similar as in Section 4, our MATA estimator has smaller bootstrap standard error compared to the NPMLE. For the analysis with n = 1000, both MATA and NPMLE showed a significant impact of first arrival time and peak time of lung cancer code, first arrival time and first FPCA score of chemotherapy code, first arrival time of radiotherapy code, total number of secondary malignant neoplasm code, peak and change point times of palliative or hospice care in medical notes, first FPCA score and total number of recurrence in medical notes and first arrival time of biopsy or excision. MATA additionally finds the change point time of lung cancer code to have strong association with high risk of lung cancer recurrence. Furthermore, MATA excludes the stage II cancer, which coincides with the large p-values on those four group of encounters under NPMLE. For the analyses with n = 200 and n = 400, MATA excludes cancer stage, age at diagnosis and medication for systematic therapies, which coincides with the groups without any significant feature from the n = 1000 NPMLE analysis.

Table A7.

Analysis with n = 1000. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

MATA NPMLE
Group Feature mean boot.se pval mean boot.se pval
Stage II 0.075 0.144 0.604
Stage III 0.168 0.168 0.319 0.160 0.181 0.379
Age 0.013 0.008 0.111 0.013 0.007 0.069

Lung Cancer 1stCode −0.277 0.116 0.017 −0.294 0.116 0.011
Pk 0.213 0.084 0.012 0.213 0.089 0.016
ChP 0.135 0.065 0.040 0.131 0.068 0.054
1stScore −0.091 0.183 0.619 −0.028 0.204 0.891
logN 0.072 0.108 0.502 0.070 0.121 0.561

Chemotherapy 1stCode −0.140 0.065 0.032 −0.146 0.067 0.029
Pk −0.162 0.106 0.127 −0.169 0.111 0.127
ChP 0.019 0.067 0.773 0.019 0.073 0.799
1stScore 0.652 0.180 < 0.001 0.678 0.188 < 0.001
logN 0.073 0.092 0.424 0.076 0.103 0.463

CT scan 1stCode 0.020 0.076 0.789 0.017 0.093 0.858
Pk 0.104 0.093 0.262 0.115 0.103 0.266
ChP 0.046 0.043 0.286 0.047 0.048 0.329
1stScore −0.244 0.132 0.065 −0.266 0.131 0.042
logN −0.019 0.096 0.847 −0.034 0.112 0.763

Radiotherapy 1stCode −0.327 0.157 0.037 −0.382 0.163 0.019
logN −0.057 0.056 0.311 −0.068 0.062 0.275

Secondary Malignant Neoplasm 1stCode 0.013 0.127 0.921 −0.008 0.141 0.954
Pk −0.135 0.113 0.230 −0.130 0.126 0.299
ChP −0.067 0.049 0.168 −0.067 0.054 0.217
1stScore −0.197 0.122 0.105 −0.205 0.128 0.109
logN 0.333 0.077 < 0.001 0.335 0.079 < 0.001

Palliative or Hospice Care 1stCode −0.055 0.085 0.517 −0.066 0.089 0.457
Pk −0.942 0.187 < 0.001 −1.009 0.205 < 0.001
ChP −0.704 0.140 < 0.001 −0.753 0.153 < 0.001
1stScore 0.068 0.095 0.470 0.070 0.098 0.471
logN 0.017 0.061 0.785 0.002 0.064 0.979

Recurrence 1stCode 0.121 0.081 0.138 0.122 0.084 0.147
Pk −0.105 0.093 0.259 −0.099 0.097 0.310
ChP −0.046 0.058 0.426 −0.042 0.060 0.479
1stScore −0.281 0.119 0.018 −0.288 0.122 0.018
logN 0.234 0.076 0.002 0.255 0.075 < 0.001

Medication 1stCode 0.173 0.118 0.143 0.185 0.113 0.104
logN 0.062 0.071 0.384 0.071 0.081 0.380

Biopsy 1stCode −0.865 0.411 0.035 −0.968 0.399 0.015
logN −0.423 0.502 0.399 −0.478 0.523 0.360

Table A8.

Analysis with n = 400. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

MATA NPMLE
Group Feature mean boot.se pval mean boot.se pval
Stage II 0.067 0.254 0.790
Stage III 0.189 0.349 0.587
Age 0.014 0.012 0.264

Lung Cancer 1stCode −0.232 0.178 0.192 −0.311 0.189 0.101
Pk 0.191 0.133 0.150 0.232 0.144 0.108
ChP 0.115 0.106 0.279 0.133 0.117 0.258
1stScore −0.098 0.266 0.712 −0.075 0.332 0.821
logN 0.078 0.163 0.633 0.074 0.203 0.715

Chemotherapy 1stCode −0.120 0.109 0.270 −0.150 0.126 0.232
Pk −0.140 0.176 0.428 −0.181 0.209 0.387
ChP 0.001 0.096 0.991 0.004 0.122 0.975
1stScore 0.607 0.288 0.035 0.719 0.311 0.021
logN 0.064 0.139 0.643 0.064 0.174 0.714

CT scan 1stCode 0.017 0.121 0.886 0.014 0.160 0.933
Pk 0.068 0.143 0.634 0.110 0.179 0.538
ChP 0.038 0.071 0.589 0.050 0.089 0.571
1stScore −0.207 0.204 0.310 −0.291 0.222 0.190
logN −0.019 0.151 0.897 −0.037 0.188 0.844

Radiotherapy 1stCode −0.229 0.221 0.301 −0.345 0.248 0.165
logN −0.027 0.086 0.749 −0.058 0.109 0.595

Secondary Malignant Neoplasm 1stCode −0.019 0.172 0.913 −0.035 0.234 0.881
Pk −0.125 0.163 0.444 −0.119 0.211 0.575
ChP −0.065 0.072 0.366 −0.063 0.092 0.490
1stScore −0.207 0.182 0.257 −0.224 0.219 0.307
logN 0.302 0.128 0.018 0.343 0.134 0.011

Palliative or Hospice Care 1stCode −0.076 0.137 0.580 −0.091 0.160 0.567
Pk −0.845 0.248 < 0.001 −0.936 0.276 < 0.001
ChP −0.631 0.185 < 0.001 −0.699 0.206 < 0.001
1stScore 0.054 0.126 0.670 0.067 0.143 0.641
logN 0.040 0.092 0.663 0.015 0.105 0.889

Recurrence 1stCode 0.089 0.116 0.443 0.125 0.134 0.351
Pk −0.114 0.139 0.412 −0.103 0.161 0.521
ChP −0.055 0.085 0.519 −0.046 0.099 0.642
1stScore −0.229 0.176 0.193 −0.280 0.197 0.154
logN 0.199 0.122 0.104 0.266 0.124 0.033

Medication 1stCode 0.201 0.188 0.284
logN 0.061 0.155 0.693

Biopsy 1stCode −0.814 0.689 0.238 −1.127 0.734 0.125
logN −0.363 0.811 0.654 −0.559 0.989 0.572

Table A9.

Analysis with n = 200. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

MATA NPMLE
Group Feature mean boot.se pval mean boot.se pval
Stage II 0.102 0.393 0.795
Stage III 0.161 0.549 0.769
Age 0.014 0.019 0.465

Lung Cancer 1stCode −0.223 0.266 0.401 −0.369 0.318 0.246
Pk 0.188 0.190 0.323 0.270 0.220 0.220
ChP 0.112 0.148 0.451 0.160 0.180 0.375
1stScore −0.102 0.390 0.793 −0.080 0.553 0.885
logN 0.072 0.244 0.767 0.065 0.325 0.843

Chemotherapy 1stCode −0.103 0.143 0.471 −0.170 0.212 0.423
Pk −0.119 0.218 0.585 −0.206 0.331 0.534
ChP 0.006 0.160 0.972 0.027 0.243 0.913
1stScore 0.530 0.409 0.195 0.764 0.516 0.139
logN 0.056 0.184 0.759 0.064 0.282 0.822

CT scan 1stCode 0.007 0.165 0.965 0.008 0.252 0.976
Pk 0.068 0.196 0.730 0.116 0.276 0.674
ChP 0.037 0.109 0.730 0.055 0.143 0.700
1stScore −0.188 0.292 0.520 −0.321 0.366 0.380
logN −0.016 0.229 0.944 −0.047 0.314 0.881

Radiotherapy 1stCode −0.207 0.314 0.509 −0.359 0.405 0.376
logN −0.029 0.114 0.798 −0.059 0.163 0.718

Secondary Malignant Neoplasm 1stCode −0.036 0.273 0.896 −0.056 0.415 0.893
Pk −0.095 0.232 0.683 −0.118 0.349 0.735
ChP −0.051 0.101 0.609 −0.065 0.148 0.660
1stScore −0.161 0.248 0.516 −0.197 0.348 0.571
logN 0.258 0.185 0.162 0.338 0.207 0.102

Palliative or Hospice Care 1stCode −0.090 0.197 0.647 −0.102 0.267 0.703
Pk −0.726 0.384 0.059 −0.928 0.446 0.037
ChP −0.542 0.287 0.059 −0.692 0.334 0.038
1stScore 0.020 0.179 0.912 0.034 0.268 0.899
logN 0.041 0.124 0.740 0.024 0.173 0.890

Recurrence 1stCode 0.070 0.181 0.697 0.131 0.247 0.598
Pk −0.094 0.183 0.608 −0.105 0.240 0.661
ChP −0.042 0.112 0.705 −0.044 0.148 0.767
1stScore −0.237 0.264 0.369 −0.332 0.338 0.326
logN 0.180 0.177 0.309 0.284 0.193 0.141

Medication 1stCode 0.174 0.321 0.589
logN 0.059 0.230 0.798

Biopsy 1stCode −0.741 0.890 0.405 −1.223 1.155 0.289
logN −0.467 1.551 0.763 −0.876 2.311 0.705

Appendix C. Convergence Rate of Derived Features

Instead of deriving asymptotic properties for truncated density fCi, i.e., random density fi truncated on [0, Ci], we focus on the scaled densities fCi,scaled, which is fCi scaled to [0, 1]. As we assume censoring time Ci has finite support [0, ] with <,fCi,scaled and fCi shared the same asymptotic properties.

Let fμ,scaled[j](t)=E{fC,scaled[j](t)} and Gscaled[j](t,s)=cov{fC,scaled[j](t),fC,scaled[j](s)}. The Karhunen-Loève theorem (Stark and Woods, 1986) states

fC,scaled[j](t)=fμ,scaled[j](t)+k=1ζk,scaled[j]ϕk,scaled[j](t),fort[0,1],

where {ϕk,scaled[j](t)} are the orthonormal eigenfunctions of Gscaled[j](t,s),{ζk,scaled[j]} are pair-wise uncorrelated random variables with mean 0 and variance λk,scaled[j], and {λk,scaled[j]}, are eigenvalues of Gscaled[j](t,s).

For the i-th patient, conditional on fCi[j](t), and Mi[j]=𝒩[j]([0,Ci]), the observed event times ti1[j],,tiMi[j][j] are assumed to be generated as an i.i.d. sample tij[j]iidfCi[j](t). Equivalently, the scaled observed event times ti1[j]/Ci,,tiMi[j][j]/CiiidfCi,scaled[j](t). Following Wu et al. (2013), we estimate fμ[j](t) and G[j](t,s), which are the mean and covariance functions of scaled density fC,scaled[j](t) respectively, as

f^μ,scaled[j](t)=(M+[j])1i=1n+N=1Mi[j]κμhμ[j](tti[j]/Ci);G^scaled[j](t,s)=g^scaled[j](t,s)f^μ,scaled[j](t)f^μ,scaled[j](s),

for t, s ∈ [0, 1], where

g^scaled[j](t,s)=(M++[j])1i=1n+NI(Mi[j]2)1kMi[j]κGhg[j](tti[j]/Ci,stik[j]/Ci).

Here M+[j]=i=1n+NMi[j] is the total number of encounters. M++[j]=i=1,Mi[j]2n+NMi[j](Mi[j]1) is the total number of pairs. κμ and κG are symmetric univariate and bivariate probability density functions, respectively, with κμh(x)=κμ(x/h)/h,κGh(x1,x2)=κG(x1/h,x2/h)/h2.hμ[j] and hg[j] are bandwidth parameters.

The estimates of eigenfunctions and eigenvalues, denoted by ϕ^k,scaled[j](x) and λ^k,scaled[j] respectively, are solutions to

01G^scaled[j](s,t)ϕ^k,scaled[j](s)ds=λ^k,scaled[j]ϕ^k,scaled[j](t),

with constraints 01ϕ^k,scaled[j](s)2ds=1 and 01ϕ^k,scaled[j](s)ϕ^,scaled[j](s)ds=0. One can obtain estimated eigenfunctions ϕ^k,scaled[j](x) and eigenvalues λ^k,scaled[j] by numerical spectral decomposition on a properly discretized version of the smooth covariance function G^scaled[j](t,s) (Rice and Silverman, 1991; Capra and Müller, 1997). Subsequently, we estimate

ζik,scaled[j]={fCi,scaled[j](t)fμ,scaled[j](t)}ϕk,scaled[j](t)dt,

by

ζ^ik,scaled[j]=1Mi[j]=1Mi[j]ϕ^k,scaled[j](ti[j]/Ci)f^μ,scaled[j](t)ϕ^k,scaled[j](t)dt.

Let ζ˜ik,scaled[j]=(Mi[j])1=1Mi[j]ϕk,scaled[j](ti[j]/Ci)fμ,scaled[j](t)ϕk,scaled[j](t)dt be the population counterpart of ζ^ik,scaled[j] constructed with true eigenfunctions. We show in Lemma A3 that maxi|ζ^ik,scaledζ˜ik,scaled| goes to zero at any k as long as Nhμ2 and Nhg4.

We then estimate the scaled density fCi,scaled[j](t)

f^iK,scaled[j](t)=max{0,f^μ,scaled[j](t)+k=1K[j]ζ^ik,scaled[j]ϕ^k,scaled[j](t)},

and the truncated density fCi[j](t) as

f^iK[j](t)=f^iK,scaled[j](t/Ci)/0Cif^iK,scaled[j](t/Ci)dt.

For the i-th patient and its j-th point process 𝒩i[j], we only observe one realization of its expected number of encounters on [0, Ci], i.e., Mi=𝒩i[j]([0,Ci]). Following Wu et al. (2013), we approximate the expected numbers of encounters with observed encounters, and estimate λi(t) as λ^i[j](t)=Mif^iK[j](t), for t[0,Ci]. We further estimate the derived feature Wi[j] as W^i[j]=f^iK[j].

For notation simplicity in the proof, we drop the superscript [j], the index for the j-th counting process, for j = 1, …, q throughout the appendix.

Derivative of the Mean and Covariance Functions

Nonparametric estimation of the mean and covariance function on the scaled densities are

f^μ,scaled(t)=(M+)1i=1n+N=1Miκμhμ(tti/Ci);G^scaled(t,s)=g^scaled(t,s)f^μ,scaled(t)f^μ,scaled(s),

for t,s[0,1], where

g^scaled(t,s)=(M++)1(hg)2i=1n+NI(Mi2)1kMiκGhg(tti/Ci,stik/Ci).

Here M+=i=1n+NMi is the total number of encounters. M++=i=1,Mi2n+NMi(Mi1) is the total number of pairs. κμ and κG are symmetric univariate and bivariate probability density functions, respectively, with κμh(x)=κμ(x/h)/h,κGh(x1,x2)=κG(x1/h,x2/h)/h2.hμ and hg are bandwidth parameters.

Their derivatives are

f^μ,scaled(t)=1M+(hμ)2i=1n+N=1Miκ1(tti/Cihμ),G^scaled(0,1)(t,s)=g^scaled(0,1)(t,s)f^μ,scaled(t)f^μ,scaled(s),G^scaled(1,0)(t,s)=g^scaled(1,0)(t,s)f^μ,scaled(t)f^μ,scaled(s),

with

g^scaled(ν,u)(t,s)=1M++hg3i=1,Mi2n+N=1Mik=1,kjMiκ2(ν,u)(tti/Cihg,stik/Cihg),

for ν = 0, u = 1 and ν = 1, u = 0, where for an arbitrary bivariate function h,h(ν,u)(x,y)=ν+uG(x,y)/νxuy.

Assume the following regularity conditions holds.

  • (A1)

    Scaled random densities fCi,scaled, its mean density fμ,scaled, covariance function gscaled and eigenfunctions ϕk,scaled(x) are thrice continuously differentiable.

  • (A2)

    fCi,scaled,fμ,scaled and their first three derivatives are bounded, where the bounds hold uniformly across the set of random densities.

  • (A3)
    κ1() and κ2(,) are symmetric univariate and bivariate density function satisfying
    uκ1(u)du=uκ2(u,v)dudv=vκ2(u,v)dudv=0,
    u2κ1(u)du<,u2κ2(u,v)dudv<,v2κ2(u,v)dudv<
  • (A4)

    Denote the Fourier transformations χ1(t)=exp(iut)κ1(u)duandχ2(s,t)=exp(iusivt)κ2(u,v)dudv. |χ1(u)|du< and |uχ1(u)|du<.|χ2(u,v)|dudv<,|uχ2(u,v)|dudv< and |vχ2(u,v)|dudv<.

  • (A5)
    The numbers of observations Mi for the j-th trajectory of i-th object, are i.i.d. r.v.’s that are independent of the densities fi and satisfy
    E(N/M+)<,E{N/M++}<.
  • (A6)

    hμ0,hg0,Nhμ4,Nhg6 as N.

  • (A7)

    Mi,i=1,,n+N are i.i.d positive r.v. generated from a truncated-Poisson distribution with rate τN, such that pr(Mi = 0) = 0, and pr(Mi=k)=τNkexp(τN)/[k!{1exp(τN)}] for k1.

  • (A8)

    ωi=E(MiCi)=E(Ni[0,Ci]Ci) and fCi,scaled,i=1,,n+N are independent. Eωi1/2=O(αN), where αN0 as N for j = 1, …, q.

  • (A9)

    The number of eigenfunctions and functional principal components Ki is a r.v. with Ki=dK, and for any ϵ>0, there exists Kϵ< such that pr(K>Kϵ)<ϵ for j = 1, …, q.

Lemma A1 Under the regularity conditions A1 - A6,

supx|f^μ,scaled(x)fμ,scaled(x)|=Op(hμ2+1Nhμ), (A.1)
supx|f^μ,scaled(x)fμ,scaled(x)|=Op(hμ2+1Nhμ2), (A.2)
supx,y|g^scaled(x,y)gscaled(x,y)|=Op(hg2+1Nhg2), (A.3)
supx,y|g^scaled(x,y)gscaled(x,y)|=Op(hg2+1Nhg3), (A.4)
supx,y|G^scaled(x,y)Gscaled(x,y)|=Op(hg2+1Nhg2+hμ2+1Nhμ), (A.5)
supx,y|G^scaled(x,y)Gscaled(x,y)|=Op(hg2+1Nhg3+hμ2+1Nhμ2). (A.6)

Proof The proof on the mean density and covariance function can be found in Wu et al. (2013). Here we only obtain the proof for the derivative of the mean density function. The proof for the derivative of the covariance function is similar.

Under conditions A1 and A2, we have

E{f^μ,scaled(x)}=E[1M+hμ2i=1n+NMiE{κ1(tti/Cihμ)Mi,fCi,scaled}]=E[1M+i=1n+NMiE{fCi,scaled(x)+12fCi,scaled(x)σκ12hμ2+o(hμ2)Mi}]=E[1M+i=1n+NMi{fμ,scaled(x)+12Efμ,scaled(x)σκ12hμ2+o(hμ2)}]=fμ,scaled(x)+O(hμ2),

Hence, supx|Ef^μ(x)fμ(x)|=O(hμ2).

With inverse Fourier transformation κ1(t)=(2π)1exp(iut)χ1(u)du, we have

κ1(t)=(2π)1iuexp(iut)χ1(u)du.

We further insert this equation into f^μ,

f^μ,scaled(t)=1M+hμ2k=1n+N=1Mkκ1(ttk/Ckhμ)=1M+hμ2k=1n+N=1Mk(2π)1iuexp{iu(ttk/Ck)/hμ}χ1(u)du=1M+k=1n+N=1Mk(2π)1iuexp{iu(ttk/Ck)}χ1(uhμ)du=(2π)1iς(u)uexp(iut)χ1(uhμ)du,

where

ς(u)=1M+hμ2k=1n+N=1Mkexp{iutk/Ck}.

Therefore,

|f^μ,scaled(t)Ef^μ,scaled(t)|=|(2π)1i{ς(u)Eς(u)}uexp(iut)χ1(uhμ)du|(2π)1|ς(u)Eς(u)||uχ1(uhμ)|du.

Note that the right-hand side of the above inequality is free of t. Thus,

supt|f^μ,scaled(t)Ef^μ,scaled(t)|(2π)1|ς(u)Eς(u)||uχ1(uhμ)|du.

As an intermediate result of the Proof of Theorem 1 in Wu et al. (2013), we have

var{ς(u)}1n+N{1+2E(n+NM+)}.

This further lead to

E{supt|f^μ,scaled(t)Ef^μ,scaled(t)|}(2π)1E{|ς(u)Eς(u)||uχ1(uhμ)|du}=(2π)1E{|ς(u)Eς(u)|}|uχ1(uhμ)|du(2π)1[var{ς(u)}]1/2|uχ1(uhμ)|du(2π)11n+N{1+2E(n+NM+)}|uχ1(uhμ)|du=O(1Nhμ2).

Thus, supt|f^μ,scaled(t)Ef^μ,scaled(t)|=Op(1Nhμ2). Furthermore,

supt|f^μ,scaled(t)fμ,scaled(t)|supx|f^μ,scaled(t)Ef^μ,scaled(t)|+supt|Ef^μ,scaled(t)fμ,scaled(t)|=Op(hμ2+1Nhμ2).

Derivative of the Eigenfunctions

Lemma A2 Under the regularity conditions A1 - A6

|λ^k,scaledλk,scaled|=Op(hg2+1Nhg2+hμ2+1Nhμ), (A.7)
supx|ϕ^k,scaled(x)ϕk,scaled(x)|=Op(hg2+1Nhg2+hμ2+1Nhμ), (A.8)
supx|ϕ^k,scaled(x)ϕk,scaled(x)|=Op(hg2+1Nhg3+hμ2+1Nhμ2). (A.9)

Proof The first two equations are direct result of Theorem 2 in Yao et al. (2005). Note that

λ^k,scaledϕ^k,scaled(x)=G^scaled(1,0)(x,y)ϕ^k,scaled(y)dy,λk,scaledϕk,scaled(x)=Gscaled(1,0)(x,y)ϕk,scaled(y)dy,

where Gscaled(1,0)(x,y)=Gscaled(x,y)/x. Thus,

|λ^k,scaledϕ^k,scaled(x)λk,scaledϕk,scaled(x)|=|G^scaled(1,0)(x,y)ϕ^k,scaled(y)dyGscaled(1,0)(x,y)ϕk,scaled(y)dy||G^scaled(1,0)(x,y)Gscaled(1,0)(x,y)||ϕ^k,scaled(y)|dy+|Gscaled(1,0)(x,y)||ϕ^k,scaled(y)ϕk,scaled(y)|dy{|G^scaled(1,0)(x,y)Gscaled(1,0)(x,y)|2dy}1/2+{|Gscaled(1,0)(x,y)|2dy}1/2{|ϕ^k,scaled(y)ϕk,scaled(y)|2dy}1/2.

Without loss of generality assuming λk,scaled>0, then

supx|(λ^k,scaled/λk,scaled)ϕ^k,scaled(x)ϕk,scaled(x)|=Op(hg2+1Nhg3+hμ2+1Nhμ2).

Then (A.9) follows by applying (A.7).

Derivative of the Estimated Density Functions

Lemma A3 Under regularity conditions A1 - A9, for any ϵ>0, there exists an event Aϵ with pr(Aϵ)1ϵ such that on Aϵ it holds that

|ζ^ik,scaledζik,scaled|=Op(αN+1Nhg2+1Nhμ), (A.10)
supx|f^Ci,scaled(x)fCi,scaled(x)|=Op(αN+1Nhg2+1Nhμ), (A.11)
supx|f^Ci,scaled(x)fCi,scaled(x)|=Op(αN+hg2+1Nhg3+hμ2+1Nhμ2). (A.12)

Proof The existence of Aϵ for (A.10) - (A.11) are guaranteed by the Theorem 3 in Wu et al. (2013). We followed their definition of Aϵ, i.e., Aϵc={K>Kϵ}{Mi=1,i=1,,n+N}, and prove for (A.12).

Note that

|f^Ci,scaled(x)fCi,scaled(x)||f^Ci,scaled(x)fCi,scaledK(x)|+|fCi,scaledK(x)fCi,scaled(x)|.

We have

supxE|fCi,scaledK(x)fCi,scaled(x)|2=supxE|k=K+1ζik,scaledϕk,scaled(x)|2=supxk=K+1λk,scaled|ϕk,scaled(x)|20,

as K. Hence, |fCi,scaledK(x)fCi,scaled(x)|=op(1).

Furthermore, on Aϵ

supx|f^Ci,scaled(x)fCi,scaledK(x)|supx|f^μ,scaled(x)fμ,scaled(x)|+k=1Ksupx|ζ^ik,scaledϕ^k,scaled(x)ζik,scaledϕk,scaled(x)|supx|f^μ,scaled(x)fμ,scaled(x)|+k=1Ksupx|ζ^ik,scaledζik,scaled||ϕ^k,scaled(x)|+k=1Ksupx|ζik,scaled||ϕ^k,scaled(x)ϕk(x)|=Op(hμ2+1Nhμ2)+Op(αN+1Nhg2+1Nhμ)+Op(hg2+1Nhg3+hμ2+1Nhμ2)=Op(αN+hg2+1Nhg3+hμ2+1Nhμ2).

Therefore (A.12) holds.

Peaks and Change Points

Assume fCi,scaled is locally unimodal, fCi,scaled(x)=0 has a unique solution, denoted by xi0, in a neighbourhood of xi0, denoted by 𝓑(xi0)=(xi0Δxi0,xi0+Δxi0). Further assume |fCi,scaled| is bounded away from 0 in xi0:fCi,scaled(xi0)=0𝓑(xi0) and the bound holds uniformly across i=1,,n+N. Let x^i0, be the solution of f^Ci,scaled(x)=0 which is closet to xi0. Then

0=f^Ci,scaled(x^i0)=fCi,scaled(x^i0)+Op(αN+hg2+1Nhg3+hμ2+1Nhμ2)=fCi,scaled(xi0)(x^i0xi0)+Op(αN+hg2+1Nhg3+hμ2+1Nhμ2),

where xi0 is an intermediate value between xi0 and x^i0.

Thus, |x^i0xi0|=Op(αN+hg2+1Nhg3+hμ2+1Nhμ2). This further implies x^i0 is the only solution of f^Ci,scaled in 𝓑(xi0). In other words, there is one-to-one correspondence between estimated peak and the true peak and the estimated peak converges to the true peak uniformly.

The derivation of the change point is similar, and here we only list the order of the absolute difference between estimated change point y^i0 and the true change point yi0.

|y^i0yi0|=Op(αN+hg2+1Nhg4+hμ2+1Nhμ3).

Remark A1 For peak and change point, the approximation error would decay faster than n1/2 when the unlabeled data expand with αNn1/2 in follow-up duration and Nn3 in sample size. In that case, we may choose (n/N)1/8hgn1/4 and (n/N)1/6hμn1/4 so that Assumption (C5) is satisfied.

Appendix D. B-spline Approximation and Profile-likelihood Estimation

Some Definitions on Vector and Matrix Norms

For any vector a=(a1,,as)Rs, denote the norm ar=(|a1|r++|as|r)1/r,1r. For positive numbers an and bn,n>1, let anbn denote that limnan/bn=c, where c is some nonzero constant. Denote the space of the qth order smooth functions as C(q)([0,])={ϕ:ϕ(q)C[0,]}. For any s × s symmetric matrix A, denote its Lq norm as Aq=maxvRs,v0Avqvq1. Let A=max1isj=1s|aij|. For a vector a, let a=max1is|ai|.

Some Definition on Scores and Hessian Matrices

Define

Sγ,i(β,γ)=logH˜i(β,γ)γ=ΔiBr(Xi)(1+Δi)exp(Ziβ)0Xiexp{Br(u)γ}Br(u)du1+exp(Ziβ)0Xiexp{Br(u)γ}du,Sγ,i(β,m)=ΔiBr(Xi)(1+Δi)exp(Ziβ)0Xiexp{m(u)}Br(u)du1+exp(Ziβ)0Xiexp{m(u)}du,Sβ,i(β,γ)=logH˜i(β,γ)β=ΔiZi(1+Δi)Ziexp(Ziβ)0Xiexp{Br(u)γ}du1+exp(Ziβ)0Xiexp{Br(u)γ}du,Sβ,i(β,m)=ΔiZi(1+Δi)Ziexp(Ziβ)0Xiexp{m(u)}du1+exp(Ziβ)0Xiexp{m(u)}du.

Further, define

Sββ,i(β,γ)Sβ,i(β,γ)β=(1+Δi)Zi2exp(Ziβ)0Xiexp{Br(u)γ}du[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2,Sββ,i(β,m)Sβ,i(β,γ)β=(1+Δi)Zi2exp(Ziβ)0Xiexp{m(u)}du[1+exp(Ziβ)0Xiexp{m(u)}du]2,Sγγ,i(β,γ)Sγ,i(β,γ)γ=(1+Δi)exp(Ziβ)0Xiexp{Br(u)γ}Br(u)2du1+exp(Ziβ)0Xiexp{Br(u)γ}du+(1+Δi)exp(2Ziβ)[0Xiexp{Br(u)γ}Br(u)du]2[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2,Sγγ,i(β,m)(1+Δi)exp(Ziβ)0Xiexp{m(u)}Br(u)2du1+exp(Ziβ)0Xiexp{m(u)}du+(1+Δi)exp(2Ziβ)[0Xiexp{m(u)}Br(u)du]2[1+exp(Ziβ)0Xiexp{m(u)}du]2,Sβγ,i(β,γ)Sβ,i(β,γ)γ=(1+Δi)Ziexp(Ziβ)0Xiexp{Br(u)γ}Br(u)du[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2,Sβγ,i(β,m)(1+Δi)Ziexp(Ziβ)0Xiexp{m(u)}Br(u)du[1+exp(Ziβ)0Xiexp{m(u)}du]2.

Note that

ln(β,γ)γ=i=1nSγ,i(β,γ),ln(β,γ)β=i=1nSβ,i(β,γ).

For u[0,], define

σ^2(u,β)=Br(u){Vn(β0)}1{n2i=1nSγ,i(β0,m)2}{Vn(β0)}1Br(u), (A.13)

where

Vn(β)=E{Sγγ,i(β,m)}.

Approximation Error from W^

We first assess the approximation error from using the estimated features W^ in ln. Once we establish the identifiability of ln in the proof of Lemma 1, the approximation of losses would translate to the approximation of their minimums.

Lemma A4 Let

ln(β,γ)=i=1n[Δi{Br(Xi)γ+Ziβ}(1+Δi)log{1+exp(Ziβ)0Xiexp{γBr(t)}dt}]

with Zi=(Ui,Wi[1],,Wi[q]) be the loss with true features from the intensity functions. Let Ω be a sufficiently large compact neighborhood of

θ0=(β0,γ0)=argminθE{n1ln(β,γ)}.

We have

supθΩ1n|ln(β,γ)ln(β,γ)|supi=1,,nW^iWi. (A.14)

Proof By the mean value theorem, we may express the difference as

1n{ln(β,γ)ln(β,γ)}=1ni=1nΔiβ1(WiW^i)T11ni=1n(1+Δi)exp(Z˜iβ)0Xiexp{γBr(t)}dt1+exp(Z˜iβ)0Xiexp{γBr(t)}dtβ1(WiW^i)T2 (A.15)

for Z˜i between Z^i and Zi. Since Δi is binary and β is bounded in compact set Ω, we have

|T1|βsupi=1,,nW^iWisupi=1,,nW^iWi. (A.16)

For T2, we apply the bounds for Δi and β along with the bound of the function ex/(1+ex)[0,1],

|T2|βsupi=1,,nW^iWisupi=1,,nW^iWi. (A.17)

Thus, we obtain (A.14) by applying (A.16) and (A.17) to (A.15).

In the following theorems, we establish the consistency, asymptotic normality of our procedure.

Proof of Lemma 1

Proof By Lemma A4, the loss with estimated features deviates from the loss with true features by at most supi=1,,nW^iWi. Under Assumption (C5), the error decays faster than n1/2 order. Thus, if either loss produces estimator identifying the true parameter at n1/2 rate, both losses produce asymptotically equivalent consistent estimators. We focus the analysis of the loss with true features in the following.

For mCq[0,], there exists γ0RPn, such that

supu[0,]|m(u)m˜(u)|=O(hq), (A.18)

where m˜(u)=Br(u)γ0 (de Boor, 2001). In the following, we prove the results for the nonparametric estimator m^(u,β) in Theorem 1 when β=β0. Then the results also hold when β is a n-consistent estimator of β0, since the nonparametric convergence rate in Theorem 1 is slower than n1/2. Define the distance between neighboring knots as hp=ξp+1ξp,rpRn+r, and h=maxrpRn+rhp. Let ρn=n1/2h1+hq1/2. We will show that for any given ϵ>0, for n sufficiently large, there exists a large constant C>0 such that

pr{supτ2=Cln(β0,γ0+ρnτ)<ln(β0,γ0)}16ϵ. (A.19)

This implies that for n sufficiently large, with probability at least 1 – 6ϵ, there exists a local maximum for (2) in the ball {γ0+ρnτ:τ2C}. Hence, there exists a local maximizer such that γ^(β0)γ02=Op(ρn). Note that

2ln(β0,γ)γγ=i=1nSγγ,i(β0,γ)

and

Sγγ,i(β,γ)=(1+Δi)exp(Ziβ)0Xiexp{Br(u)γ}Br(u)2du[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2(1+Δi)exp(2Ziβ)0Xiexp{Br(u)γ}Br(u)2du0Xiexp{Br(u)γ}du[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2+(1+Δi)exp(2Ziβ)[0Xiexp{Br(u)γ}Br(u)du]2[1+exp(Ziβ)0Xiexp{Br(u)γ}du]2.

The first term above is negative-definite, and last two terms are also negative-definite because of Cauch-Schwartz inequality, hence Sγγ,i(β0,γ) is negative-definite. Thus, ln(β0,γ) is a concave function of γ, so the local maximizer is the global maximizer of (2), which will show the convergence of γ^(β0) to γ0.

By Taylor expansion, we have

ln(β0,γ0+ρnτ)ln(β0,γ0)=ln(β0,γ0)γρnτ{12ρnτ2ln(β0,γ)γγρnτ} (A.20)

where γ=ργ+(1ρ)γ0 for some ρ(0,1). Moreover,

|ln(β0,γ0)γρnτ|ρnln(β0,γ0)γ2τ2=Cρnln(β0,γ0)γ2=CρnTn1+Tn22,

where

Tn1=i=1nSγ,i(β0,m)
Tn2=i=1nSγ,i(β0,γ0)i=1nSγ,i(β0,m).

Recall that SC(·) and fC(·) are the censoring process survival and density functions respectively, we have

E{Sγ,i(β0,m)Zi}=E[ΔiBr(Xi)(1+Δi)exp(Ziβ0)0Xiexp{m(u)}Br(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]=exp(Ziβ0)0[Br(Xi)2exp(Ziβ0)0Xiexp{m(u)}Br(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]×exp{m(Xi)}[1+exp(Ziβ0)0Xiexp{m(u)}du]2SC(XiZi)dXi0exp(Ziβ0)0Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2fC(XiZi)dXi0exp{m(u)}Br(u)du[1+exp(Ziβ0)0exp{m(u)}du]2SC(Zi)=exp(Ziβ0)[0exp{m(Xi)}Br(Xi)[1+exp(Ziβ0)0Xiexp{m(u)}du]2SC(XiZi)dXi02exp{m(Xi)}exp(Ziβ0)0Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]3SC(XiZi)dXi00Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2fC(XiZi)dXi]0exp{m(u)}Br(u)du[1+exp(Ziβ0)0exp{m(u)}du]2SC(Zi)=exp(Ziβ0)[0SC(XiZi)Xi0Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2dXi00Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2fC(XiZi)dXi]0exp{m(u)}Br(u)du[1+exp(Ziβ0)0exp{m(u)}du]2SC(Zi)=exp(Ziβ0)[0XiSC(XiZi)0Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2dXi]0exp{m(u)}Br(u)du[1+exp(Ziβ0)0exp{m(u)}du]2SC(Zi)=0.

In the following, all the integrals are calculated on [0, ], unless otherwise specified.

Thus, E(Tn1)=0. Further

E[{epSγ,i(β0,m)}2Zi]=E([ΔiBr,p(Xi)(1+Δi)exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2Zi)=[Br,p(Xi)2exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2fT(XiZi)SC(XiZi)dXi+[exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2fC(XiZi)ST(XiZi)dXi=[Br,p(Xi)2exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2fT(XiZi)SC(XiZi)dXi+0[exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2fC(XiZi)ST(XiZi)dXi+[exp(Ziβ0)0exp{m(u)}Br,p(u)du1+exp(Ziβ0)0exp{m(u)}du]2SC(Zi)ST(Zi)C1([Br,p(Xi)2exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2dXi+0[exp(Ziβ0)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ0)0Xiexp{m(u)}du]2dXi+[exp(Ziβ0)0exp{m(u)}Br,p(u)du1+exp(Ziβ0)0exp{m(u)}du]2)C1(2Br,p(Xi)2dXi+9exp(2Ziβ0)[exp{m(u)}Br,p(u)du]2dXi+exp(2Ziβ0)[0exp{m(u)}Br,p(u)du]2)C1(2Br,p(Xi)2dXi+9exp(2Ziβ0)[exp{m(u)}Br,p(u)du]2+exp(2Ziβ0)[0exp{m(u)}Br,p(u)du]2)C1(2Br,p(Xi)2dXi+(9+1)exp(2Ziβ0)[exp{2m(u)}duBr,p2(u)du])C1h,

for some constant 0<C1< by Condition (C4). Thus, E(n1Tn122)Pnn1C1h. By Condition (C3), we have hPn1. Then E(n1Tn122)C1n1 for some constant 0 < C1 < ∞. Then for any ϵ > 0, by Chebyshev’s inequality, we have pr(n1Tn12n1C1ϵ1)ϵ, or equivalently

pr(Tn12nC1ϵ1)ϵ. (A.21)

Moreover, by (A.18), we have supu|Br(u)γ0m(u)|=O(hq). Denote

Tip=ep{Sγ,i(β0,γ0)Sγ,i(β0,m)}=(1+Δi)[exp(Ziβ)0Xiexp{m(u)}Br,p(u)du1+exp(Ziβ)0Xiexp{m(u)}duexp(Ziβ)0Xiexp{Br(u)γ}Br,p(u)du1+exp(Ziβ)0Xiexp{Br(u)γ}du]=(1+Δi)exp(Ziβ)0Xi[exp{m(u)}exp{Br(u)γ}]Br,p(u)du[1+exp(Ziβ)0Xiexp{m(u)}du][1+exp(Ziβ)0Xiexp{Br(u)γ}du]+(1+Δi)exp(2Ziβ)[0Xiexp{m(u)}Br,p(u)du0Xi[exp{Br(u)γ}exp{m(u)}]du[1+exp(Ziβ)0Xiexp{m(u)}du][1+exp(Ziβ)0Xiexp{Br(u)γ}du]]+(1+Δi)exp(2Ziβ)[0Xi[exp{m(u)}exp{Br(u)γ}]Br,p(u)du0Xiexp{m(u)}du[1+exp(Ziβ)0Xiexp{m(u)}du][1+exp(Ziβ)0Xiexp{Br(u)γ}du]],

then

|Tip|2exp(Ziβ)0Xi|exp{m(u)}exp{Br(u)γ}|Br,p(u)du+2exp(2Ziβ)0Xiexp{m(u)}Br,p(u)du0Xi|exp{Br(u)γ}exp{m(u)}|du+2exp(2Ziβ)0Xi|exp{m(u)}exp{Br(u)γ}|Br,p(u)du0Xiexp{m(u)}duC2hq+1

for a constant 0<C2< under Condition (C4). Therefore, E(Tn22){Pn(C2hq+1n)2}1/2=Pn1/2C2nhq+1C2nhq+1/2 for a constant 0 < C2 < ∞, and E(Tn222)Pn(C2hq+1n)2(C2nhq+1/2)2. Again by Chebyshev’s inequality, for 1/4>ϵ>0, we have

pr(Tn22ϵ1/2C2nhq+1/2)pr{|Tn22E(Tn22)|ϵ1/2C2nhq+1/2/2}+pr{E(Tn22)ϵ1/2C2nhq+1/2/2}pr(|Tn22E(Tn22)|ϵ1/2{var(Tn22)}1/2/2)+pr(C2nhq+1/2ϵ1/2C2nhq+1/2/2)=pr(|Tn22E(Tn22)|ϵ1/2{var(Tn22)}1/2/2)4ϵ. (A.22)

Combining (A.21) and (A.22), with probablity at least 1 − 5ϵ,

|{ln(β0,γ0)/γ}ρnτ|Cρn(Tn12+Tn22)Cρn(C1ϵ1n1/2+ϵ1/2C2nhq+1/2). (A.23)

Moreover, Lemma A5 implies there exists a constant 0 < C3 < ∞ such that

12τ2ln(β0,γ)γγτnC3C2h

for n sufficiently large, with probability approaching 1. Thus, for any ϵ > 0, there is probability at least 1 − ϵ,

21(ρnτ){2ln(β0,γ)/γγ}(ρnτ)ρn2C3C2nh. (A.24)

Therefore, by (A.20), (A.23) and (A.24), for n sufficiently large, with probability at least 1 − 6ϵ,

ln(β0,γ0+ρnτ)ln(β0,γ0)Cρn(C1ϵ1n1/2+ϵ1/2C2nhq+1/2)ρn2C3C2nh=Cρnh(C1ϵ1n1/2h1+ϵ1/2C2nhq1/2CC3nρn)=Cρnh(C1ϵ1n1/2h1+ϵ1/2C2nhq1/2CC3n1/2h1CC3nhq1/2)<0,

when C>max(C31C1ϵ1,ϵ1/2C31C2). This shows (A.19). Hence, we have γ^(β0)γ02=Op(ρn)=Op(n1/2h1+hq1/2)=op(1) under Condition (C3).

It is easily seen that E{Sγ,i(β0,m)d}C4dh for a constant 1<C4< and any d1, by Bernstein’s inequality, under condition (C3), we have

n1i=1nSγ,i(β0,m)=Op[h+{hlog(n)}1/2n1/2]=Op(h).

Also, it is easy to check that

n1i=1nSγ,i(β0,m)n1i=1nSγ,i(β0,γ0)=Op(hq+1).

Thus, combining with Lemma A7-A8, we have

|Br(u)[{n12ln(β0,γ0)γγ}1{n1ln(β0,γ0)γ}Vn(β0)1n1i=1nSγ,i(β0,m)]|r{Br(u){n12ln(β0,γ0)γγ}1n1i=1nSγ,i(β0,γ0)n1i=1nSγ,i(β0,m)+Br(u){n12ln(β0,γ0)γγ}1Vn(β0)1n1i=1nSγ,i(β0,m)}=Op(h1)Op(hq+1)+Op(hq1+n1/2h1)Op(h)=Op(hq+n1/2), (A.25)

where the inequality above uses the fact that for arbitrary u, only r elements in Br(u) are non-zero.

Let e^=Vn(β0)1n1i=1nSγ,i(β0,m). Let Z=(Z1,,Zn). By Central Limit Theorem,

[Br(u)var(e^Z)Br(u)]1/2Br(u)e^Normal(0,1),

where var(e^Z)={Vn(β0)}1{n2i=1nSγ,i(β0,m)2}{Vn(β0)}1 and Br(u)var(e^Z)Br(u)=σ^2(u,β0). With Lemma A7 and A9, we can get that c5(nh)1Br(u)22Br(u)var(e^Z)Br(u)C5(nh)1Br(u)22, for some constants 0<c5,c5<. So there exist constants 0<cσCσ< such that with probability approaching 1 and for large enough n,

cσ(nh)1/2infu[0,]σ^(u,β0)supu[0,]σ^(u,β0)Cσ(nh)1/2. (A.26)

Thus Br(u)e^=Op{(nh)1/2} uniformly in u ∈ [0, ε], and hence

Br(u){2ln(β0,γ0)/γγ}1{ln(β0,γ0)/γ}=Op{(nh)1/2+hq+n1/2}=Op(hq+n1/2h1/2).

uniformly in u ∈ [0, ] as well.

By Taylor expansion,

Br(u){γ^(β0)γ0}=Br(u){2ln(β0,γ0)/γγ}1{ln(β0,γ0)/γ}{1+op(1)}=Br(u){2ln(β0,γ0)/γγ}1{ln(β0,γ0)/γ}+op(hq+n1/2h1/2). (A.27)

Thus by (A.25), (A.26), (A.27) and Condition (C3),

supu[0,]|σ^(u,β0)1[Br(u){γ^(β0)γ0}Br(u)e^]|=Op{(nh)1/2)}{Op(hq+n1/2)+op(hq+n1/2h1/2)}=Op(n1/2hq+1/2+h1/2)+op(1)=op(1).

Therefore by Slutsky’s theorem σ^1(u,β0){m^(u,β0)m˜(u)}Normal(0,1) and m^(u,β0)m˜(u)=Op{(nh)1/2} uniformly in u[0,]. By supu[0,]|m(u)m˜(u)|=O(hq), we have |m^(u,β0)m(u)|=Op{(nh)1/2+hq} uniformly in u[0,]. By Slutsky’s theorem and Condition (C3), we have

σ^1(u,β0){m^(u,β0)m(u)}Normal(0,1).

Proof of Lemma 2

Proof Because Sββ,i(β,γ) is negative definite and E{Sβ,i(β0,m)}=0, similar but simpler derivation as for Theorem 1 can be used to show the consistency of the maximizer β^.

Because at any β,i=1nSγ,i{β,γ^(β)}=0, hence

0=i=1nSγ,i{β,γ^(β)}β+i=1nSγγ,i{β,γ^(β)}γ^(β)β=i=1nSβγ,i{β,γ^(β)}+i=1nSγγ,i{β,γ^(β)}γ^(β)β.

so

γ^(β0)β=[n1i=1nSγγ,i{β0,γ^(β0)}]1n1i=1nSβγ,i{β0,γ^(β0)}=Vn(β0)1E{Sβγ,i(β0,m)}+r1, (A.28)

where r1 is the residual term and is of smaller order of Vn(β0)1E{Sβγ,i(β0,m)} componentwise. Note that Sβγ,i(β,γ)=Op(h) uniformly elementwise. Hence,

Sβγ,i(β0,m)2=Sβγ,i(β0,m)2=Op(h1/2),Sβγ,i(β0,m)=Op(h),Sβγ,i(β0,m)=Op(1).

Subsequently, we have

Vn(β0)1E{Sβγ,i(β0,m)}2Vn(β0)12E{Sβγ,i(β0,m)}2=Op(h1)Op(h1/2)=Op(h1/2),

and

Vn(β0)1E{Sβγ,i(β0,m)}Vn(β0)1E{Sβγ,i(β0,m)}=Op(h1)Op(h)=Op(1).

Here we use the fact that Vn(β0)12=Op(h1) and Vn(β0)1=Op(h1), where the former one is a direct corollary of Lemma A5 and the latter one is shown in Lemma A8. Therefore, r12=op(h1/2) and r1=op(1).

By Taylor expansion, for β=ρβ0+(1ρ)β^,0<ρ<1,

0=n1/2i=1nSβ,i{β^,γ^(β^)}=n1/2i=1nSβ,i{β0,γ^(β0)}+n1i=1nSββ,i{β,γ^(β)}n1/2(β^β0)+n1i=1n[Sβγ,i{β,γ^(β)}γ^(β)β]n1/2(β^β0)=n1/2i=1nSβ,i{β0,γ^(β0)}+[E{Sββ,i(β0,m)}+op(1)]n1/2(β^β0)+[E{Sβγ,i(β0,m)}γ^(β0)β+r2]n1/2(β^β0), (A.29)

where r2 is the residual term and is of smaller order of E{Sβγ,i(β0,m)}γ^(β0)/β componentwise. We claim that the residual term r2 satisfies r22=op(1) and r2=op(1). This is because

E{Sβγ,i(β0,m)}γ^(β0)β2E{Sβγ,i(β0,m)}2γ^(β0)β2=Op(h1/2)Op(h1/2)=Op(1),

and

E{Sβγ,i(β0,m)}γ^(β0)βE{Sβγ,i(β0,m)}γ^(β0)β=Op(1)Op(1)=Op(1),

which leads to the claimed order of the residual r2 in (A.29).

We further use Taylor expansion to write

n1/2i=1nSβ,i{β0,γ^(β0)}=n1/2i=1nSβ,i(β0,γ0)+n1/2i=1nSβγ,i(β0,γ){γ^(β0)γ0}=n1/2i=1nSβ,i(β0,γ0)+n1/2i=1n(1+Δi)Ziexp(Ziβ0)0Xiexp{Br(u)γ}Br(u){γ^(β0)γ0}du[1+exp(Ziβ0)0Xiexp{Br(u)γ}du]2=n1/2i=1nSβ,i(β0,γ0)+n1/2i=1n(1+Δi)Ziexp(Ziβ0)0Xiexp{m(u)}Br(u){γ^(β0)γ0}du[1+exp(Ziβ0)0Xiexp{m(u)}du]2+n1/2Op(hq)Op(hq+n1/2h1/2)=n1/2i=1nSβ,i(β0,γ0)+(E[(1+Δi)Ziexp(Ziβ0)0Xiexp{m(u)}Br(u)du[1+exp(Ziβ0)0Xiexp{m(u)}du]2]+r)n1/2{γ^(β0)γ0}+op(1)=n1/2i=1nSβ,i(β0,γ0)+E{Sβγ,i(β0,m)}n1/2{γ^(β0)γ0}+op(1),

where γ=ργ0+(1ρ)γ^,0<ρ<1, and the residual term r in the second last equality satisfies r=Op(n1/2) and r2=Op(n1/2h1/2).

Plugging this and (A.28) into (A.29), recall that

A=E{Sββ,i(β0,m)}E{Sβγ,i(β0,m)}[E{Sγγ,i(β0,m)}]1E{Sβγ,i(β0,m)},

we get

n1/2{A+op(1)}(β^β)=n1/2i=1nSβ,i(β0,γ0)+E{Sβγ,i(β0,m)}n1/2{γ^(β0)γ0}+op(1)=n1/2i=1nSβ,i(β0,γ0)+n1/2i=1nE{Sβγ,i(β0,m)}Vn(β0)1Sγ,i(β0,γ0)+op(1).

It is straightforward to check that

E{Sβ,i(β0,m)}=E[ΔiZi(1+Δi)Ziexp(Ziβ)0Xiexp{m(u)}du1+exp(Ziβ)0Xiexp{m(u)}du]=[Zi2Ziexp(Ziβ)0Xiexp{m(u)}du1+exp(Ziβ)0Xiexp{m(u)}du]exp[{m(Xi)+Ziβ}]SC(Xi,Zi)[1+exp(Ziβ)0Xiexp{m(u)}du]2dXi[Ziexp(Ziβ)0Xiexp{m(u)}du1+exp(Ziβ)0Xiexp{m(u)}du]fC(Xi,Zi)1+exp(Ziβ)0Xiexp{m(u)}dudXi=Xi[Ziexp(Ziβ)0Xiexp{m(u)}duSC(Xi,Zi)[1+exp(Ziβ)0Xiexp{m(u)}du]2]dXi=0,

and we already have E{Sγ,i(β0,m)}=0. Thus,

E[Sβ,i(β0,γ0)+E{Sβγ,i(β0,m)}Vn(β0)1Sγ,i(β0,γ0)]=E[Sβ,i(β0,γ0)Sβ,i(β0,m)+E{Sβγ,i(β0,m)}Vn(β0)1{Sγ,i(β0,γ0)Sγ,i(β0,m)}]=O(hq+1)+Op(E{Sβγ,i(β0,m)}Vn(β0)1E{Sγ,i(β0,γ0)Sγ,i(β0,m)})=O(hq+1)+Op(1)Op(h1)Op(hq+1)=O(hq).

By Central Limit Theorem, n1/2(β^β)Normal{0,A1Σ(A1)}, where Σ is given in Theorem 2.

Proof of Theorem 1

Proof We prove the theorem in two steps. First we derive the asymptotic distribution of the solution θ˜ by restricting θ to the oracle group selection set 𝒮. Then we validate that the θ˜ satisfies the optimality condition of the original problem (6). Without loss of generality, we rearrange the order of the covariates by moving the nonzero groups to the front. We would have simpler notation with 𝒮={1,,card(𝒮)}. We denote the Hessian and its limit

H^=n1n(θ^MLE),H=E(Sβ,β,iSβ,γ,iSγ,β,iSγ,γ,i),

and the sub-matrices notations A𝒮,·. for selecting rows, A·,𝒮 for selecting columns, A𝒮,𝒮 for selecting rows and columns in 𝒮{p+1,,p+Pn}. We denote the variance of score as V=E{(Sβ,i,Sγ,i)(Sβ,i,Sγ,i)}.

Define the oracle selection subspace R𝒮={θRp+Pn:θj=0,forjp,j𝒮} and the estimator under oracle selection

θ˜=argminθR𝒮(θθ^MLE)H^(θθ^MLE)+λgβ[g]2β^MLE,[g]2. (A.30)

Since 𝒮 contains only groups with nonzero coefficient of β0, and β^MLE is consistent for β0 by Lemma 2, the denominators in the penalty terms in (A.30) β^MLE,[g]2 are bounded away from zero. Then choosing λ=o(n1/2), we have the solution as

θ˜𝒮=H^𝒮,𝒮1H^𝒮,·θ^MLE+op(n1/2).

Using the identity

H^𝒮,𝒮1H^𝒮,,θ0=H^𝒮,𝒮1H^𝒮,𝒮θ0,𝒮=θ0,𝒮,

we may derive the estimation error of θ˜ as

n(θ˜𝒮θ0,𝒮)=nH^𝒮,𝒮1H^𝒮,(θ^MLEθ0)+H^𝒮,𝒮1H^𝒮,θ0θ0,𝒮=0+Op(nλ)=H^𝒮,𝒮1H^𝒮,n(θ^MLEθ0)+op(1). (A.31)

In the proof of the Lemma 2, we have establish for hqn1/2 the asymptotic normality of θ^MLE and the consistency of Hessian

n(θ^MLEθ0)Normal(0,H1VH1),H^H=Op(n1/2). (A.32)

Applying the (A.32) to (A.31), we obtain

n(θ˜𝒮θ0,𝒮)Normal(0,H𝒮,𝒮1H𝒮,H1VH1H,𝒮H𝒮,𝒮1)=Normal(0,H𝒮,𝒮1V𝒮,𝒮H𝒮,𝒮1).

Profiling out γ components as in Lemma 2, we have

n(β˜𝒮β0,𝒮)Normal(0,A𝒮,𝒮1Σ𝒮,𝒮A𝒮,𝒮1). (A.33)

The optimality condition for original problem (6) is

ifβ[g]2>0,2H^[g],θ2H^[g],θ^MLE+λβ[g]β^MLE,[g]2β[g]2=0,ifβ[g]2=0,2H^[g],θH^[g],θ^MLE2λβ^MLE,[g]2,forj>p,2H^j,θ2H^j,θ^MLE=0. (A.34)

The oracle selection estimator θ˜ must satisfy the conditions in (A.34) for positions in R𝒮 by the same set of optimality conditions for (A.30). We only need to verify that θ˜ also satisfy the conditions in (A.34) for j𝒮c={1,,p}𝒮. By Lemma 2 and the definition of 𝒮, we have

θ^MLE,[g]=Op(n1/2),forg:β0,[g]=0. (A.35)

For λn1, the penalty factor for zero group g is

λθ^MLE,[g]2n1/2,forg:β0,[g]=0. (A.36)

By definition θ˜R𝒮, the 𝒮c components of θ˜ are all zero,

θ˜[g]=0,forg:β0,[g]=0. (A.37)

Combining (A.35)–(A.37), we establish that the optimality conditions in (A.34) hold asymptotically

2H^[g],θ˜H^[g],θ^MLE2n1/2λθ^MLE,[g]2

for g : β0,[g] = 0, i.e. all elements in 𝒮c. Therefore, we conclude that β^glasso=β˜ with large probability. The asymptotic distribution of β˜ (A.33) is thus the asymptotic distribution of β^glasso.

Proof of Corollary 1

Proof Using delta method, it is seen that

F^(tZ^)F(tZ)F(tZ){1F(tZ)}0teβ0Z+m(u){Br(u)γ^glassom(u)+(β^glassoβ0)Z+β0(Z^Z)}duF(tZ){1F(tZ)}0teβ0Z+m(u){Br(u)(γ^glassoγ0)+(β^glassoβ0)Z}du+Op(hq)+Op(Z^Z)

Applying the n asymptotical normality of γ^glasso and β^glasso of Lemma 1 and 1 along with Assumption (C5), we conclude that

F^(tZ^)F(tZ)n1/2+hq

and n asymptotically normal with h<n1/2q.

Appendix E. Matrix Norms

Lemma A5 There exists constants 0 < c < C < ∞ such that, for n sufficiently large, with probability approach 1,

ch<n12ln(β0,γ)γγ2<Ch,ch<n12ln(β0,γ)γγ<Ch,ch<Vn(β0)2<Ch,ch<Vn(β0)<Ch,

where γ is an arbitrary vector in RPn with γγ02=op(1). Furthermore, for arbitrary aRPn,

cha22<a{n12ln(β0,γ)γγ}a<Cha22,cha22<aVn(β0)a<Cha22.

Proof We only prove the result for 2ln(β0,γ)/γγ. The proof for Vn(β0) can be obtained similarly. We have

n1a2ln(β0,γ)γγa=n1i=1na((1+Δi)exp(Ziβ0)0Xiexp{Br(u)γ}Br(u)2du1+exp(Ziβ0)0Xiexp{Br(u)γ}du(1+Δi)exp(2Ziβ0)[0Xiexp{Br(u)γ}Br(u)du]2[1+exp(Ziβ0)0Xiexp{Br(u)γ}du]2)an1i=1na((1+Δi)exp(Ziβ0)0Xiexp{Br(u)γ}Br(u)2du[1+exp(Ziβ0)0Xiexp{Br(u)γ}du]2)ac1n1i=1na{(1+Δi)Br(u)2du}ac1Ea{(1+Δi)0XiBr(u)2du}ac1a{00xBr(u)2dufC(x)ST(x)dx}a=c1a{00xBr(u)2dufC(x)ST(x)dx+0Br(u)2duSC()ST()}aST()SC()c1a{0Br(u)2du}ac1ha22, (A.38)

for positive constants 0<c1,c1< by conditions (C1) and (C4).

Following a similar proof, we can further obtain

a{n12ln(β0,γ)γγ}an1i=1na[(1+Δi)exp(Ziβ0)0Xiexp{Br(u)γ}Br(u)2du1+exp(Ziβ0)0Xiexp{Br(u)γ}du]aC1n1i=1na0XiBr(u)2duaC1a0Br(u)2duaC1ha22, (A.39)

for some constant 0<C1,C1<, because 0Br(u)2du is an r-banded matrix with diagonal and jth off-diagonal elements of order O(h) uniformly elementwise, for j = 1, …, r − 1, and 0 elsewhere.

Combining (A.38) and (A.39), we have

c1h<n12ln(β0,γ)γγ2<C1h.

Next, we investigate the order of n1{2ln(β0,γ)/γγ}. We have

n12ln(β0,γ)γγ=max1jPnk=1Pn|[n1{2ln(β0,γ)/γγ}]jk|k=1Pn|[n1{2ln(β0,γ)/γγ}]1k|c2k=1Pn|n1i=1n0XiBr,1(u)Br,k(u)du|=c2n1i=1n0Xik=1PnBr,1(u)Br,k(u)du=c2n1i=1n0Xik=1rBr,1(u)Br,k(u)duc2E0Xik=1rBr,1(u)Br,k(u)du=c2h

with probability 1 as n → ∞, where 0<c2,c2< are constants. Here, for an arbitrary matrix A we use Ajk to denote its element in the jth row and the kth column. In the above inequalities, we use the fact that B-spline basis are all non-negative and are non-zero on no more than r consecutive intervals formed by its knots.

On the other hand,

n1{2ln(β0,γ)/γγ}=max1jPnk=1Pn|[n1{2ln(β0,γ)/γγ}]jk|C2max1jPnk=1Pn[|{n1i=1n0XiBr(u)2du}jk|+|n1i=1n{0XiBr(u)du}jk2|]C2n1i=1nmax1jPnk=1Pn[|{0XiBr(u)2du}jk|+|{0XiBr(u)du}jk2|]=C2n1i=1nmax1jPnk=1Pn[{0XiBr,j(u)Br,k(u)du}+0XiBr,j(u)du0XiBr,k(u)du]C2n1i=1n[max1jPnk=max(1,jr+1)min(j+r1,Pn){0XiBr,j(u)Br,k(u)du}+max1jPnk=1Pn0XiBr,j(u)du0XiBr,k(u)du]C2n1i=1n[max1jPnk=max(1,jr+1)min(j+r1,Pn){0Br,j(u)Br,k(u)du}+max1jPnk=1Pn0Br,j(u)du0Br,k(u)du]C2h.

Hence, c2hn1{2ln(β0,γ)/γγ}C2h.

Therefore, Lemma A5 holds for c = min(c1, c2) and C = max(C1, C2).

Lemma A6

n12ln(β0,γ)γγVn(β0)2=Op(hq+1+n1/2h),n12ln(β0,γ)γγVn(β0)=Op(hq+1+n1/2h).

Proof Similarly as the previous derivations,

n12ln(β0,γ0)γγVn(β0)=n1i=1nSγγ,i(β0,γ0)Vn(β0)n1i=1n(1+Δi)(exp(Ziβ0)0Xiexp{Br(u)γ0}Br(u)2du1+exp(Ziβ0)0Xiexp{Br(u)γ0}duexp(Ziβ0)0Xiexp{m(u)}Br(u)2du1+exp(Ziβ0)0Xiexp{m(u)}duexp(2Ziβ0)[0Xiexp{Br(u)γ0}Br(u)du]2[1+exp(Ziβ0)0Xiexp{Br(u)γ0}du]2+exp(2Ziβ0)[0Xiexp{m(u)}Br(u)du]2[1+exp(Ziβ0)0Xiexp{m(u)}du]2)+n1i=1n(1+Δi)(exp(Ziβ0)0Xiexp{m(u)}Br(u)2du1+exp(Ziβ0)0Xiexp{m(u)}duexp(2Ziβ0)[0Xiexp{m(u)}Br(u)du]2[1+exp(Ziβ0)0Xiexp{m(u)}du]2)E{(1+Δi)(exp(Ziβ0)0Xiexp{m(u)}Br(u)2du1+exp(Ziβ0)0Xiexp{m(u)}du+exp(2Ziβ0)[0Xiexp{m(u)}Br(u)du]2[1+exp(Ziβ0)0Xiexp{m(u)}du]2)}=Op(hq+1+n1/2h), (A.40)

where the second term Op(n1/2h) in the last equality is obtained using both the Central Limit Theorem and the matrices above are banded to the first order. Specifically, n12ln(β0,γ0)/γγVn(β0) has diagonal and jth off-diagonal element with order Op(hq+1+n1/2h) for j = 1, …, r − 1 and all the other elements of order Op(hq+2+n1/2h2). Further,

n12ln(β0,γ0)γγVn(β0)2=Op(hq+1+n1/2h),

again because the matrices are banded to the first order. In fact for arbitrary vector aRPn,

|a{n12ln(β0,γ0)γγVn(β0)}a|j,k|aj||{n12ln(β0,γ0)γγVn(β0)}jk||ak|=|jk|2r1|aj||{n12ln(β0,γ0)γγVn(β0)}jk||ak|+|jk|>2r1|aj||{n12ln(β0,γ0)γγVn(β0)}jk||ak|C(hq+1+n1/2h)|jk|2r1|aj||ak|+C3(hq+2+n1/2h2)|jk|>2r1|aj||ak|C(hq+1+n1/2h)|jk|2r1(aj2+ak2)/2+C3(hq+2+n1/2h2)|jk|>2r1(aj2+ak2)/2C(hq+1+n1/2h)(2r+hPn)a22C(hq+1+n1/2h)a22,

where 0<C,C< are constants.

Lemma A7 There exists constant 0 < c, C < ∞, such that for n sufficiently large, with probability approach 1,

ch1/2<{n12ln(β0,γ0)γγ}1<Ch1,ch1/2<Vn(β0)1<Ch1.

Proof We have Vn(β0)=hV0(β0)h2V1(β0), where

V0(β0)=h1E[(1+Δi)exp(Ziβ0)0Xiexp{m(u)}Br(u)2du1+exp(Ziβ0)0Xiexp{m(u)}du]

is a banded matrix with each nonzero element of order O(1) uniformly and

V1(β0)=h2E{(1+Δi)exp(2Ziβ0)[0Xiexp{m(u)}Br(u)du]2[1+exp(Ziβ0)0Xiexp{m(u)}du]2}

is a matrix with all elements of order O(1) uniformly. It is easily seen that V0(β) is positive definite, and V1(β) is semi-positive definite.

According to Demko (1977) and Theorem 4.3 in Chapter 13 of DeVore and Lorentz (1993), we have

V0(β0)1C,

for some constant 0<C<. Furthermore, there exists constants 0<C< and 0 < λ < 1 such that |{V0(β)1}jk|Cλ|jk| for j, k = 1, …, Pn.

We want to show that

{IhV0(β0)1V1(β0)}1

is bounded. As a result,

Vn(β0)1=h1{IhV0(β0)1V1(β0)}1V0(β0)1h1{IhV0(β0)1V1(β0)}1V0(β0)1=Op(h1).

Denote W=V0(β0)1V1(β0). There exists constant 0 < κ < ∞ such that |{V1(β0)}jk|<κ for j, k = 1, …, Pn. Hence,

|Wij|=|{V0(β0)1V1(β0)}jk|=|=1Pn{V0(β0)1}j{V1(β0)}k|=1PnCλ|j|κ2Cκ(1λ)1κ1,

where κ1=max{1,2Cκ(1λ)1}1.

Let Pnhκ2, where 1κ2< is a constant. Similar derivation as before shows there exists some constant 0<c˜<C˜<, such that for arbitrary aRPn,c˜a22<aV0(β0)a<C˜a22 and c˜a22<a{V0(β0)hV1(β0)}a<C˜a22. Hence,

(I+hW)12={V0(β0)hV1(β0)}1V0(β0)2{V0(β0)hV1(β0)}12V0(β0)2κ3C˜/c˜.

where 1κ3<.

In the following, we will use induction to show that

aPn|det(IPn+hWPn)|(1+hκ4)Pn1,bPn|det(JPn+hWPn)|(κ1+2κ12κ2κ3)h(1+hκ4)Pn2,

where JPn=(Jij)1i,jPn with Jij = 1 if j − i = 1 and Jij = 0 otherwise. Here κ4=4κ12κ2(1+κ1κ2κ3).

When Pn = 2,

a2=|det(I2+hW2)|=|(1+hW11)(1+hW22)h2W12W21||(1+hW11)(1+hW22)|+|h2W12W21|(1+hκ1)2+h2κ12=1+2hκ1+2h2κ121+4hκ121+hκ4.

Similary, we have

b2=|det(J2+hW2)|h2κ12+h(1+hκ1)κ1(κ1+2κ12)h(κ1+2κ12κ2κ3)h.

Assume the result holds for 2, …, Pn−1, then for Pn, denote WPn,Pn=(WPn1,,WPn(Pn1)) and WPn,Pn=(W1Pn,,W(Pn1)Pn), we have

aPn=|det(IPn+hWPn)|=det(IPn1+hWPn1)(1+hWPnPnh2WPn,Pn(IPn1+hWPn1)1WPn,PnaPn1{1+hκ1+h2(n1)κ12(IPn1+hWPn1)12}aPn1{1+hκ1+hκ12κ2κ3}(1+hκ4)Pn1,

and

bPn=|det(JPn+hWPn)|=|{hWPn1h2WPn,1(IPn1+hWPn1)1WPn,1}det(IPn1+hWPn1)|{hκ1+h2κ12(n1)(IPn1+hWPn1)12}aPn1h(κ1+κ12κ2κ3)aPn1h(κ1+2κ12κ2κ3)(1+hκ4)Pn2.

where WPn,1=(WPn2,,WPnPn) and WPn,1=(W11,,W(Pn1)1).

Therefore,

(IPn+hWPn)1=maxjk|{(IPn+hWPn)1}jk|aPn1+bPn1(Pn1)|det(IPn+hWPn)|(1+hκ4)Pn+(κ1+2κ12κ2κ3)κ2(1+hκ4)Pn2|det(IPn+hWPn)|2κ4(1+hκ4)κ2κ4hκ4|det(IPn+hWPn)|,

where the numerator converges to 2κ4exp(κ2κ4) as h → 0, or equivalently, Pn → ∞. Here in the first equation above we use the fact that the (j, k)th element of the matrix (IPn+hWPn)1 is the determinant of the matrix IPn+hWPn without its jth column and kth row, divided by the determinant of IPn+hWPn itself. Specifically, when j = k, the absolute value of that (j, k)th element is |det(IPn1+hWPn1)|/det(IPn+hWPn)|=aPn1/|det(IPn+hWPn); when jk, with certain column operations, we obtain |det(JPn1+hWPn1)|/|det(IPn+hWPn)|=bPn1/|det(IPn+hWPn)|.

Now it remains to show that there exists κ5>0, such that

aPn=|det(IPn+hWPn)|κ5,

for Pn sufficiently large. This can be seen from

aPn=|det(IPn+hWPn)|=det(IPn1+hWPn1)(1+hWPnPnh2WPn,Pn(IPn1+hWPn1)1WPn,Pn{1hκ1h2(Pn1)κ12κ3}aPn1(1hκ1hκ12κ2κ3)aPn1(1hκ4)aPn1(1hκ4)Pn3a2(1hκ4)Pn2(1hκ4)κ2κ4hκ4exp(κ2κ4).

Thus the result holds for κ5=exp(κ2κ4).

Therefore, we have Vn(β0)1Ch1 for some constant 0<C<.

On the other hand, we have

Vn(β0)1Pn1/2Vn(β0)12ch1/2,

for some constant 0 < c < ∞ (Horn and Johnson, 1990; Golub and Van Loan, 1996).

The proof for {n12ln(β0,γ0)/γγ}1 is similar, and hence is omitted.

Lemma A8

{n12ln(β0,γ)γγ}1Vn(β0)1=Op(hq1+n1/2h1).

Proof According to Lemma A6 - A7, we have

{n12ln(β0,γ0)γγ}1Vn(β0)1=Vn(β0)1[Vn(β0){n12ln(β0,γ0)γγ}]{n12ln(β0,γ0)γγ}1{n12ln(β0,γ0)γγ}1Vn(β0)1n12ln(β0,γ0)γγVn(β0)=Op(h2)Op(hq+1+n1/2h)=Op(hq1+n1/2h1).

Lemma A9 There exists constants 0 < c < C < ∞ such that for n sufficiently large, with probability approach 1, for arbitrary aRPn,

cha22<a{n1i=1nSγ,i(β0,m)2}a<Cha22.

Proof We have

n1i=1nSγ,i(β0,m)2=n1i=1n[ΔiBr(Xi)(1+Δi)exp(Ziβ)0Xiexp{m(u)}Br(u)du1+exp(Ziβ)0Xiexp{m(u)}du]2n1i=1nC[Br(Xi)2+{Br(u)du}2],

for some constants 0 < C′ < ∞. Similar derivation leads to

cha22a{n1i=1nSγ,i(β0,m)2}aCha22,

for constant 0 < c < C < ∞.

Appendix F. Algorithm for optimization

Here we provide the detailed algorithm for optimization of ln.

  1. Obtain the initial estimator of β^init from the U-statistic equation (Cheng et al., 1995),
    i=1nj=1n(Z^iZ^j)[ΔjI(XiXj)G^2(Xj)ξ((Z^iZ^j)β)]=0,
    where G^ is the Kaplan-Meier or empirical distribution estimator for censoring time distribution, and ξ(s)={es(s1)+1}/(es1)2. We solve the equation by classical Newton’s method. Then, we calculate the initial estimator for baseline function α^init by solving
    i=1nΔiI(Xit)G^(Xi)exp{β^init Zi}α(t)1+exp{β^init Zi}α(t)=0.
  2. Update β^update and α^update from β^init and α^init with the alternative B-spline approximation
    α(t)=exp{γ˜0tBr(s)ds}.
    Setting the initial value β^[0]=β^init and α^[0]=α^init we perform the iterative algorithm:
    1. Calculate
      π^i[k]=eβ^[k1]TZ^iα^(t)[k1]1+eβ^[k1]Z^iα^(t)[k1]
      and update α by the Breslow type of estimator
      γ˜^p[k]=i=1nΔiBr,p(Xi)i=1n[(1+Δi)π^i[k]Δi]0XiBr,p(t)dt,α^(t)[k]=exp{γ˜^[k]T0tBr(t)dt}
    2. Update β by the pseudo logistic regression
      • If Δi = 0, observation-i contributes one entry in the pseudo data, 0Z^i with offset log{α^(t)[k](Xi)}.
      • If Δi = 1, observation-i contributes two entries in the pseudo data, 0Z^i and 1Z^i both with offset log{α^(t)[k](Xi)}.
      The solution is β^[k].

    During this step, we compute the integrals 0XiBr,p(t)dt once at the initiation step and use the computation repeatedly. The parameters at the convergence are β^update ,α^update  and γ˜^update .

  3. Obtain the final MLE estimators β^MLE and γ^MLE . We use β^[0]=β^update  as initial value for β and calculate the initial value for γ from the linear regression
    γ^[0]=argminγPni=1n{log(α^update (Xi))γBr(Xi)}2=argminγPni=1n{log(α^update (Xi))+log(γ^update Br(Xi))γBr(Xi)}2.
    We perform the iterative algorithm:
    1. Update γ by the one-step Newton’s method
      γ^[k]=γ^[k1]{2γ2ln(β^[k1],γ^[k1])}1γln(β^[k1],γ^[k1]).
    2. Update β by the pseudo logistic regression as in Step 2.

Contributor Information

Liang Liang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.

Jue Hou, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.

Hajime Uno, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.

Kelly Cho, Massachusetts Veterans Epidemiology Research and Information Center, US Department of Veteran Affairs, Boston, MA, USA; Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA.

Yanyuan Ma, Department of Statistics, Penn State University, University Park, PA, USA.

Tianxi Cai, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

References

  1. Ahuja Y, Hong C, Xia Z, Cai T (2021) Samgep: A novel method for prediction of phenotype event times using the electronic health record. medRxiv DOI 10.1101/2021.03.07.21253096, URL https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096, https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096.full.pdf [DOI] [PMC free article] [PubMed]
  2. de Boor C (2001) A Practical Guide to Splines. Springer, New York [Google Scholar]
  3. Capra WB, Müller HG (1997) An accelerated-time model for response curves. Journal of the American Statistical Association 92:72–83 [Google Scholar]
  4. Cheng S, Wei L, Ying Z (1995) Analysis of transformation models with censored data. Biometrika 82:835–845 [Google Scholar]
  5. Cheng S, Wei L, Ying Z (1997) Predicting survival probabilities with semi-parametric transformation models. Journal of the American Statistical Association 92:227–235 [Google Scholar]
  6. Chubak J, Onega T, Zhu W, Buist DS, Hubbard RA (2015) An electronic health record-based algorithm to ascertain the date of second breast cancer events. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dean C, Balshaw R (1997) Efficiency lost by analyzing counts rather than event times in poisson and overdispersed poisson regression models. Journal of the American Statistical Association 92:1387–1398 [Google Scholar]
  8. Demko S (1977) Inverses of band matrices and local convergence of spline projections. SIAM Journal on Numerical Analysis 14:616–619 [Google Scholar]
  9. DeVore RA, Lorentz GG (1993) Constructive approximation, vol 303. Springer Science & Business Media [Google Scholar]
  10. Efron B (1979) Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7(1):1–26, DOI 10.1214/aos/1176344552, URL 10.1214/aos/1176344552 [DOI] [Google Scholar]
  11. Golub GH, Van Loan CF (1996) Matrix computations, 3rd. Johns Hopkins University, Press, Baltimore, MD, USA [Google Scholar]
  12. Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D (2015) Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Horn RA, Johnson CR (1990) Matrix analysis. Cambridge university press [Google Scholar]
  14. Jin Z, Ying Z, Wei LJ (2001) A simple resampling method by perturbing the minimand. Biometrika 88(2):381–390, URL http://www.jstor.org/stable/2673486 [Google Scholar]
  15. Klein JP, Moeschberger ML (2006) Survival analysis: techniques for censored and truncated data. Springer Science & Business Media [Google Scholar]
  16. Lawless JF (1987) Regression methods for poisson process data. Journal of the American Statistical Association 82:808–815 [Google Scholar]
  17. Nielsen J, Dean C (2005) Regression splines in the quasi-likelihood analysis of recurrent event data. Journal of statistical planning and inference 134:521–535 [Google Scholar]
  18. Rice JA, Silverman BW (1991) Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53:233–243 [Google Scholar]
  19. Royston P, Parmar MK (2002) Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine 21:2175–2197 [DOI] [PubMed] [Google Scholar]
  20. Shen X (1998) Propotional odds regression and sieve maximum likelihood estimation. Biometrika 85:165–177 [Google Scholar]
  21. Stark H, Woods JW (1986) Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc. [Google Scholar]
  22. Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei L (2011) On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine 30:1105–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Uno H, Ritzwoller DP, Cronin AM, Carroll NM, Hornbrook MC, Hassett MJ (2018) Determining the time of cancer recurrence using claims or electronic medical record data. JCO Clinical Cancer Informatics 2:1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wang H, Leng C (2007) Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102(479):1039–1048 [Google Scholar]
  25. Wang H, Leng C (2008) A note on adaptive group lasso. Computational statistics & data analysis 52(12):5277–5286 [Google Scholar]
  26. Wu S, Müller HG, Zhang Z (2013) Functional data analysis for point processes with rare events. Statistica Sinica pp 1–23 [Google Scholar]
  27. Yao F, Müller HG, Wang JL (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100:577–590 [Google Scholar]
  28. Younes N, Lachin J (1997) Link-based models for survival data with interval and continuous time censoring. Biometrics pp 1199–1211 [Google Scholar]
  29. Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. (2015) Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association 22(5):993–1000, DOI 10.1093/jamia/ocv034, URL 10.1093/jamia/ocv034, https://academic.oup.com/jamia/article-pdf/22/5/993/34146486/ocv034.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN, Gainer VS, et al. (2016) Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association 24(e1):e143–e149, DOI 10.1093/jamia/ocw135, URL 10.1093/jamia/ocw135, https://academic.oup.com/jamia/article-pdf/24/e1/e143/34149618/ocw135.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zeng D, Lin D, Yin G (2005) Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100:470–483 [Google Scholar]
  32. Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37:338–354 [Google Scholar]
  33. Zhang Y, Cai T, Yu S, Cho K, Hong C, Sun J, et al. (2019) High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature Protocols 14(12):3426–3444, DOI 10.1038/s41596-019-0227-6, URL 10.1038/s41596-019-0227-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES