Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records

Liang Liang; Jue Hou; Hajime Uno; Kelly Cho; Yanyuan Ma; Tianxi Cai

doi:10.1007/s10985-022-09557-5

. Author manuscript; available in PMC: 2023 Mar 28.

Published in final edited form as: Lifetime Data Anal. 2022 Jun 26;28(3):428–491. doi: 10.1007/s10985-022-09557-5

Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records

Liang Liang ¹, Jue Hou ², Hajime Uno ³, Kelly Cho ⁴, Yanyuan Ma ⁵, Tianxi Cai ⁶

PMCID: PMC10044535 NIHMSID: NIHMS1879201 PMID: 35753014

Abstract

Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.

Keywords: censoring, electronic health records, functional principle component analysis, point process, proportional odds model, Semi-supervised learning, more

1. Introduction

While clinical trials and traditional cohort studies remain critical sources of data for clinical research, they have limitations including the generalizability of the study findings and the limited ability to test broader hypotheses. In recent years, real world clinical data derived from disease registry, insurance claims and electronic health record (EHR) systems are increasingly used for precision medicine research. These real world data (RWD) open opportunities for developing accurate personalized risk prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision medicine. Efficiently deriving prediction models for the risk of developing future clinical events using RWD, however, faces practical and methodological challenges. Precise event time information such as time to cancer recurrence is not readily available in RWD such as EHR and claims data. Simple proxies to the event time based on the encounter time of first diagnosis or procedure codes may poorly approximate the true event time (Uno et al., 2018). On the other hand, annotating event times manually via chart review is time and resource prohibitive.

Learning the onset status of clinical events has been thoroughly investigated in the past decade using a large-scale medical encounter data set that lacks precise onset status and a small training set with gold standard labels on the true onset status. The solutions can be classified by their training sample into supervised approaches that use only the labeled data, unsupervised approaches that use no labels and semi-supervised approaches that combine information from labeled and unlabeled data. Semi-supervised methods (Yu et al., 2015, 2016; Zhang et al., 2019) usually use the unlabeled data for feature screening and selection before the final training on the labeled data. Growing efforts have been made in recent years to predict the onset time of clinical events under a similar setting with partially labeled event times. The existing literatures on phenotyping of event times are mostly supervised approaches that only use labeled data for training. Several supervised algorithms exist for predicting cancer recurrence time by extracting features from the encounter pattern of relevant codes. For example, Chubak et al. (2015) proposed a rule based algorithm that classifies the recurrence status, R ∈ {+, −}, based on decision tree, and assign the recurrence time for those with predicted R = + as the earliest encounter time of one or more specific codes. Hassett et al. (2015) proposed two-step algorithms where a logistic regression was used to classify R in step I and then the recurrence time for those with R = + is estimated as a weighted average of the times that the counts of several prespecified codes peaked. Instead of peak time, Uno et al. (2018) focuses on the time at which an empirically estimated encounter intensity function has the sharpest change, referred to as the change point throughout the paper. The recurrence time is approximated as a weighted average of the change point times associated with a few selected codes. Despite of their reasonable empirical performance, these ad hoc procedures have several major limitations. First, only a very small number of codes are selected according to domain knowledge. Second, intensity function estimated based on finite difference may yield substantial variability in the resulting peak or change times due to the sparsity of encounter data. One exception is a recent semi-supervised approach by Ahuja et al. (2021). Ahuja et al. (2021) first conducted aggregation of the predicting features, then predicted the event time from the longitudinal trajectories of the aggregated features using a hidden Markov model. Our approach differs from Ahuja et al. (2021) in the order of addressing the feature aggregation and temporal trajectory — we first extract the characteristics from the longitudinal trajectories of the predicting features, then aggregate the extracted characteristics to predict the event time by fitting a survival model.

In this paper, we frame the question of annotating event time with longitudinal encounter records as a statistical question of predicting an event time T using baseline covariates U as well as features derived from a p-variat point process, $𝒩 = (𝒩_{1}, \dots, 𝒩_{p})$ . Specifically, with a small labeled set $ℒ$ containing observations on ${T, U, 𝒩}$ and a large unlabeled set $𝒰$ containing observations on U and $𝒩$ only, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) procedure. In the Step I, we include the large unlabeled data to estimate the underlying subject specific intensity functions associated with $𝒩$ and deriving summaries of the intensity functions, denoted by $\hat{W}$ , as features for predicting T. In Step II, we predict T using $\hat{Z} = {(U^{⊤}, {\hat{W}}^{⊤})}^{⊤}$ by fitting a penalized proportional odds (PO) model which approximates the non-parametric baseline function via B-splines. Our MATA is semi-supervised as unlabeled data and labeled data are used in Step I and II, correspondingly. Estimating individualized intensity functions is a challenging task in the current setting because the encounter data is often sparse and the shape of the intensity functions can vary greatly across subjects. As such traditional multiplicative intensity models (Lawless, 1987; Dean and Balshaw, 1997; Nielsen and Dean, 2005) fail to provide accurate approximations. To overcome those difficulties, we employed a non-parametric FPCA method by Wu et al. (2013) to estimate the subject specific intensity functions using the large unlabeled set $𝒰$ . We demonstrate that when the size of $𝒰$ is sufficiently large relative to the size of $ℒ$ , the approximation error of $‖ \hat{W} - W ‖$ can is ignorable compared to the estimation error from fitting the spline model in $ℒ$ .

Even though the idea of employing a spline-based approach is straightforward and intuitive, our method differs from the classical B-spline works in the sense that B-splines are used on the outcome model, i.e., the failure time, rather than the preprocessing of the predictors. Special attention is devised to accommodate this fact. We established the novel consistency results and asymptotic convergence rates for the proposed estimator, both the parametric and nonparametric part. There are some existing literature adopting a spline-based approach in a similar context as ours, including Shen (1998); Royston and Parmar (2002); Zhang et al. (2010); Younes and Lachin (1997). However, Royston and Parmar (2002) and Younes and Lachin (1997) did not address the asymptotic properties of their estimators at all; Shen (1998) and Zhang et al. (2010) employed a seive maximum likelihood based approach which considers spline as a special case but only provided theoretical justification on the asymptotics of the parametric part.

One great advantage of the proposed MATA approach is the easy implementation of classical variable selection algorithms such as LASSO. In comparison, Chubak et al. (2015); Hassett et al. (2015); Uno et al. (2018) exhaust all possible combinations of selected encounters and select the optimal one under certain criteria, which brings great computational complexity. No variable selection method has been developed for classical estimating equation based estimators, e.g., Cheng et al. (1995, 1997). Besides, compared to the non-parametric maximum likelihood-estimator (NPMLE), e.g., Zeng et al. (2005), which approximates the non-parametric function by a right-continuous step function with jumps only at observed failure time, our approach is computationally more efficient and stable.

The rest of the paper is organized as follows. In Section 2, we introduce the proposed MATA approach and prediction accuracy evaluation measures. The asymptotic properties of the proposed estimator are discussed in Section 3. In Section 4, we conduct various simulation studies to explore the performance of our approach under small labeled sample. In Section 5, we apply our approach to a lung cancer data set. Section 6 contains a short discussion. Technical details and proofs are provided in the Supplementary Material.

2. Semi-supervised MATA

Let T denote the continuous event time of interest which is observable up to (X, Δ) in $ℒ$ , where $X = \min (T, C), Δ = I (T \leq C)$ and C is the follow up time. Let $𝓝 = {(𝒩^{[1]}, \dots, 𝒩^{[q]})}^{⊤}$ denote the q-variat point processes and U denote baseline covariates observable in both $ℒ$ and $𝒰$ , where $𝒩^{[j]}$ is a point process associated with the jth clinical code whose occurrence times are ${t_{1}^{[j]}, t_{2}^{[j]}, \dots}$ with $𝒩^{[j]} (𝒜) = \sum_{s} I (t_{s}^{[j]} \in 𝒜)$ for any set $𝒜$ in the Borel σ-algebra of the positive half of the real line and I(·) is the indicator function. If T denotes the true event time of heart failure, examples of $𝒩^{[j]}$ include longitudinal encounter processes of diagnostic code for heart failure and NLP mentions of heart failure in clinical notes. The local intensity function for $𝒩^{[j]}$ is $λ^{[j]} (t) = E {d 𝒩^{[j]} (t)} / d t, t \geq 0$ . Here we assume $λ^{[j]} (t)$ is integrable, i.e., $τ^{[j]} = \int_{0}^{\infty} λ^{[j]} (u) d u < \infty$ , for j = 1, ⋯, q. Then the corresponding random density trajectory is $f^{[j]} (t) = λ^{[j]} (t) / τ^{[j]}, t \geq 0$ . Equivalently, $λ^{[j]} (t) = τ^{[j]} f^{[j]} (t) = E {𝒩^{[j]} [0, \infty)} f^{[j]} (t)$ , i.e. the intensity function $λ^{[j]} (t)$ is the product of the density trajectory $f^{[j]} (t)$ and the expected lifetime encounters.

The encounter times of the point processes are also only observable up to the end of follow up C and we let $M^{[j]} = 𝒩^{[j]} ([0, C])$ denote the total number of occurrences for $𝒩^{[j]}$ up to C and $𝓗_{t}$ denote the history of $𝓝$ up to t along with the baseline covariate vector U. Thus, the observed data consist of

Labeled data: ℒ = {(X_{i}, Δ_{i}, C_{i}, 𝓗_{i C_{i}}) : i = 1, \dots, n}, Unlabeled data: 𝒰 = {(C_{i}, 𝓗_{i C_{i}}) : i = n + 1, \dots, n + N},

where i indexes the subject and we assume that $N ≫ n$ .

2.1. Models

Our proposed MATA procedure involves two models, one for the point processes and another for the survival function of T. We connect two models by including the underlying intensity functions for the point processes as the part of the covariates for survival time.

Point Process Model

The intensity function $λ^{[j]} (t)$ for $t \geq 0$ is treated as a realization of a non-negative valued stochastic intensity process $Λ^{[j]} (t)$ . Conditional on $Λ^{[j]} = λ^{[j]}$ , the number of observed medical encounters is assumed to be a non-homogeneous Poisson process with local intensity function $λ^{[j]} (t)$ that satisfies $E {𝒩^{[j]} (a, b) ∣ Λ^{[j]} = λ^{[j]}} = \int_{a}^{b} λ^{[j]} (u) d u$ , where $0 \leq a \leq b < \infty$ . Thus, $τ^{[j]} = E {𝒩^{[j]} [0, \infty) ∣ Λ^{[j]} = λ^{[j]}}$ . Define the truncated random density

f_{C}^{[j]} (t) = f^{[j]} (t) / \int_{0}^{C} f^{[j]} (t) d t = λ^{[j]} (t) / \int_{0}^{C} λ^{[j]} (t) d t, t \in [0, C];

and its scaled version

f_{C, scaled}^{[j]} (t) = C f_{C}^{[j]} (C t) = λ^{[j]} (C t) / \int_{0}^{1} λ^{[j]} (C t) d t, t \in [0, 1] .

As we only observe the point process $𝒩^{[j]}$ up to C, our goal is to estimate the truncated density function $f_{C}^{[j]} (t)$ or equivalently the scaled density function $f_{C, scaled}^{[j]} (t)$ rather than the density function $f^{[j]} (t)$ . Note the scaling is done to meet the uniform endpoint requirement of the FPCA approach by Wu et al. (2013).

Event Time Model

We next relate features derived from the intensity functions to the distribution of T. Define $W^{[j]} = ℋ \circ f^{[j]}$ , where $ℋ$ is a known functional. For example, if the local intensity function $f^{[j]} (x)$ follows the exponential distribution with rate θ⁻¹, then we may set $ℋ \circ f^{[j]} = \int x f^{[j]} (x) d x = θ$ . Other potential features include peaks or change points of intensity functions $λ^{[j]}$ . Here peak is defined as the time that the intensity (or density) curve reaches maximum, while change point is defined as the time of the largest increases in the intensity (or density) curve. Due to censoring, we instead have $W_{C}^{[j]} = ℋ \circ f_{C}^{[j]}$ . For features like peak and change point, $W^{[j]}$ and $W_{C}^{[j]}$ would be identical if they were reached before censoring time C. We assume that $T ∣ Z = {(U^{⊤}, W^{⊤})}^{⊤}$ follows an PO model (Klein and Moeschberger, 2006)

F (t ∣ Z) \equiv pr (T \leq t ∣ Z) = \frac{\exp (β^{⊤} Z) α (t)}{1 + \exp (β^{⊤} Z) α (t)} with α (t) = \int_{0}^{t} \exp {m (s)} d s,

(1)

where $W = {(W^{{[1]}^{⊤}}, \dots, W^{{[q]}^{⊤}})}^{⊤}, β$ is the unknown effect vector of the derived features Z, and m(t) is an unknown smooth function of t. This formulation ensures that $α (t) = \int_{0}^{t} \exp {m (s)} d s$ is positive and increasing for $t \in (0, \infty)$ .

2.2. Estimation

To derive a prediction rule for T based on the proposed MATA procedure, we first estimate truncated density function $f_{C}^{[j]} (t)$ from the longitudinal encounter data using the FPCA method proposed by Wu et al. (2013) to obtain estimates for the derived features W using unlabeled data $𝒰$ , denoted by $\hat{W}$ . Then we estimate α(t) and $β$ using synthetic observations in the labeled set ${(X_{i}, Δ_{i}, {\hat{Z}}_{i}^{⊤}), i = 1, \dots, n}$ via penalized estimation with regression spline.

2.2.1. Step I: Estimation of f^[j]

We estimate the mean $f_{μ, scaled}^{[j]} (t)$ and variance $G_{scaled}^{[j]} (t, s)$ of the scaled density function $f_{C, scaled}^{[j]} (t)$ according to the FPCA approach by Wu et al. (2013). Using the estimators ${\hat{f}}_{μ, scaled}^{[j]} (t)$ and ${\hat{G}}_{scaled}^{[j]} (t, s)$ , we predict the scaled density function by ${\hat{f}}_{i K, scaled}^{[j]} (t)$ with truncation at zero to ensure nonnegativity of the density function. The index K in the subscript is the number of basis functions selected according to the proportion of variation explained. We obtain the truncated density function by

{\hat{f}}_{i K}^{[j]} (t) = {\hat{f}}_{i K, scaled}^{[j]} (t / C_{i}) / \int_{0}^{C_{i}} {\hat{f}}_{i K, scaled}^{[j]} (t / C_{i}) d t .

For the i-th patient and its j-th point process $𝒩_{i}^{[j]}$ , we only observe one realization of its expected number of encounters on [0, C_i], i.e., $M_{i} = 𝒩_{i}^{[j]} ([0, C_{i}])$ . We approximate the expected numbers of encounters with observed encounters, and estimate $λ_{i} (t)$ as ${\hat{λ}}_{i}^{[j]} (t) = M_{i} {\hat{f}}_{i K}^{[j]} (t)$ , for $t \in [0, C_{i}]$ . We further estimate the derived feature $W_{i}^{[j]}$ as ${\hat{W}}_{i}^{[j]} = ℋ \circ {\hat{f}}_{i K}^{[j]}$ . Detailed form of these estimators are given in Appendix C. We also establish the rate of convergence for the estimated loadings of the functional PCA, which can be subsequently used as potential derived features.

Incorporation of the large unlabeled data in the semi-supervised learning facilitates the extraction of characteristics from noisy and complex longitudinal trajectories. Using unlabeled data for feature preprocessing has a longstanding tradition in semi-supervised phenotyping (Yu et al., 2015, 2016; Zhang et al., 2019). While existing literatures focused on selection of simple features, we consider the de-noising of complex features. Moreover, the large unlabeled data in Step I also greatly reduces the uncertainty from the step to the point that it becomes negligible in Step II.

2.2.2. Step II. PO Model Estimation with B-spline Approximation to m(·)

To estimate m(t) and β in the PO model (1), we approximate m(t) via B-splines with order r (degree r − 1) described as follows. Divide the support of censoring time C, denoted as [0, ℰ], into (R_n + 1) subintervals ${(ξ_{p}, ξ_{p + 1}), p = r, r + 1, \dots, R_{n} + r - 1}$ , where ${(ξ_{p})}_{p = r + 1}^{R_{n} + r}$ is a sequence of interior knots, $ξ_{1} = \dots = ξ_{r} = 0 < ξ_{r + 1} < \dots ξ_{r + R_{n}} < ℰ = ξ_{R_{n} + r + 1} = \dots = ξ_{R_{n} + 2 r}$ . Let the basis functions be B_r(t) = {B_r,1(t), …, B_{r,P_n}(t)}^⊤ where the number of B-spline basis functions $P_{n} = R_{n} + r$ . Then m(t) can be approximated by

m (t; γ) = γ^{⊤} B_{r} (t) = \sum_{p = 1}^{P_{n}} B_{r, p} (t) γ_{p},

where $γ = {(γ_{1}, \dots, γ_{P_{n}})}^{⊤}$ is the vector of coefficients for the spline basis functions $B_{r} (t)$ .

With the B-spline formulation and features $W_{i}^{[j]} = ℋ \circ f_{i}^{[j]}$ estimated as ${\hat{W}}_{i}^{[j]} = ℋ \circ {\hat{f}}_{i K}^{[j]}$ , we can estimate m(·) and $β$ by maximizing an estimated likelihood. Specifically, let

l_{n} (β, γ) = \sum_{i = 1}^{n} \log {{\tilde{H}}_{i} (β, γ)} = \sum_{i = 1}^{n} [Δ_{i} {B_{r}^{⊤} (X_{i}) γ + {\hat{Z}}_{i}^{⊤} β} - (1 + Δ_{i}) \log {1 + \exp ({\hat{Z}}_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {γ^{⊤} B_{r} (t)} d t}]

(2)

where ${\hat{Z}}_{i} = {(U_{i}^{⊤}, {\hat{W}}_{i}^{{[1]}^{⊤}}, \dots, {\hat{W}}_{i}^{{[q]}^{⊤}})}^{⊤}$ ,

{\tilde{H}}_{i} (β, γ) = \frac{\exp [Δ_{i} {B_{r}^{⊤} (X_{i}) γ + {\hat{Z}}_{i}^{⊤} β}]}{{[1 + \exp ({\hat{Z}}_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (t) γ} d t]}^{(1 + Δ_{i})}} .

Then we may estimate $β$ by maximizing the approximated profile likelihood

{\hat{β}}_{MLE} = {argmax}_{β} l_{n} (β, \hat{γ} (β)),

(3)

where $\hat{γ} (β) = {argmax}_{γ} l_{n} (β, γ)$ and MLE stands for maximum likelihood estimator. Subsequently, we may estimate m(t) as

{\hat{m}}_{MLE} (t) = {\hat{γ}}_{MLE}^{⊤} B_{r} (t), where {\hat{γ}}_{MLE} = \hat{γ} ({\hat{β}}_{MLE}) .

(4)

The log-likelihood l_n is concave with negative definite Hessian

{\ddot{l}}_{n} = - \sum_{i = 1}^{n} (1 + Δ_{i}) \int_{0}^{X_{i}} \frac{\exp {{\hat{Z}}_{i}^{⊤} β + B_{r}^{⊤} (t) γ}}{1 + \int_{0}^{X_{i}} \exp {{\hat{Z}}_{i}^{⊤} β + B_{r}^{⊤} (u) γ} d u} {\hat{Q} (t) - \bar{\hat{Q}}}^{\otimes 2} d t,

where $\hat{Q} (t) = {({\hat{Z}}^{⊤}, B_{r} {(t)}^{⊤})}^{⊤}, \bar{\hat{Q}} = \int_{0}^{X_{i}} \frac{\exp {{\hat{Z}}_{i}^{⊤} β + B_{r}^{⊤} (t) γ}}{1 + \int_{0}^{X_{i}} \exp {{\hat{Z}}_{i}^{⊤} β + B_{r}^{⊤} (u) γ} d u} \hat{Q} (t) d t .$

Standard convex optimization problem can be used to solve for ${\hat{β}}_{MLE}$ and ${\hat{γ}}_{MLE}$ . The integrals therein can be evaluated analytically through calculation of indefinite integrals in the following forms:

\int t^{b} \exp (\sum_{j = 0}^{k} a_{k} t^{k}) d t, 0 \leq b \leq k,

which is fairly straightforward for piecewise constant (k = 0) or piecewise linear (k = 1) B-splines. The evaluation of integrals is needed for every iteration, so we reduce the computational cost by a three-stage algorithm for the optimization problem:

An initial estimator for $β$ using the U-statistic approach of Cheng et al. (1995) and an inverse-probability-of-censoring-weighted (IPCW) initial estimator for baseline $α$ .
Update $β$ and $α$ from an alternative B-spline approximation
$α (t) = \exp {{\tilde{γ}}^{⊤} \int_{0}^{t} B_{r} (s) d s} .$
Under the alternative approximation, the integral only needs to be evaluated once at the initiation. We update $β$ and $\tilde{γ}$ iteratively using a pseudo logistic regression trick and the Newton’s method.
Derive the initial estimator for $γ$ from the initial $α (t)$ , and solve for the final $β$ and $γ$ . Likewise, we update $β$ and $γ$ iteratively using a pseudo logistic regression trick and the Newton’s method.

We provide the detail of the algorithm given in Appendix F. The program is available at https://github.com/celehs/SPT.grpadalasso.

2.2.3. Feature Selection

When the dimension of Z is not small, the MLE from (3) and (4),

{\hat{θ}}_{MLE} = {({\hat{γ}}_{MLE}^{⊤}, {\hat{β}}_{MLE}^{⊤})}^{⊤},

(5)

may suffer from high variability. On the other hand, it is highly likely that only a small number of codes are truly predictive of the event time. To overcome this challenge, one may employ standard penalized regression approach by imposing a penalty for the model complexity. To efficiently carry out penalized estimation under the B-spline PO model, we borrow the least square approximation (LSA) strategy proposed in Wang and Leng (2007) and update $(γ, β)$ from initial MLE estimator (5)

{\hat{θ}}_{glasso} = \underset{θ}{argmin} {(θ - {\hat{θ}}_{MLE})}^{⊤} {- n^{- 1} ℓ_{n}^{″} ({\hat{θ}}_{MLE})} (θ - {\hat{θ}}_{MLE}) + λ \sum_{g} \frac{{‖ β^{[g]} ‖}_{2}}{{‖ {\hat{β}}_{MLE, [g]} ‖}_{2}},

(6)

where $θ = {(γ^{⊤}, β^{⊤})}^{⊤}, {\hat{θ}}_{glasso} = {({\hat{γ}}_{glasso}^{⊤}, {\hat{β}}_{glasso}^{⊤})}^{⊤}, β^{[g]}$ is a subvector of β that corresponds to the features in group g, [g] represents indices associated with group g, and $‖ \cdot ‖_{2}$ is the L₂ norm. All features associated with a specific code can be joined to create a group. The adaptive group lasso (glasso) penalty (Wang and Leng, 2008) employed in (6) enables the removal of all features related to a code. The tuning parameter λ can be chosen by standard methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the cross-validation.

With ${\hat{θ}}_{glasso}$ we may obtain a glasso estimator for m(t) as

{\hat{m}}_{glasso} (t) = {\hat{γ}}_{glasso}^{⊤} B_{r} (t)

For any patient with derived feature $\hat{Z}$ , his/her probability of having an event by t can be predicted as

\hat{F} (t ∣ \hat{Z}) = \frac{e^{{\hat{β}}^{⊤} \hat{Z}} \int_{0}^{t} e^{{\hat{m}}_{glasso} (s) d s}}{1 + e^{{\hat{β}}^{⊤} \hat{Z}} \int_{0}^{t} e^{{\hat{m}}_{glasso} (s) d s}}

2.3. Evaluation of Prediction Performance

Based on ${\hat{π}}_{t} = \hat{F} (t ∣ \hat{Z})$ , one may derive subject specific prediction rules for the event status and/or time. For example, one may predict the binary event status $D_{t} = I (T \leq t)$ using ${\hat{π}}_{t}$ and $Δ$ using ${\hat{π}}_{C}$ . One may also predict $X = \int_{0}^{C} I (T \geq t) d t$ based on ${\hat{X}}_{u} = C (1 - {\hat{Δ}}_{u}) + {\hat{Δ}}_{u} \hat{X}$ for some u chosen to satisfy a desired sensitivity or specificity level of classifying Δ based on ${\hat{Δ}}_{u} = I ({\hat{π}}_{C} \geq u)$ , where $\hat{X} = \int_{0}^{C} {1 - \hat{F} (t ∣ \hat{Z})} d t$ .

To summarize the overall performance of $({\hat{X}}_{u}, {\hat{Δ}}_{u})$ in predicting (X, Δ), we may consider the Kendall’s-τ type rank correlation summary measures, e.g.,

𝒞_{u} = P ({\hat{X}}_{u i} \leq {\hat{X}}_{u j} ∣ X_{i} \leq X_{j}),

𝒞_{u}^{+} = P ({\hat{Δ}}_{u, i} = 1, {\hat{X}}_{u i} \leq {\hat{X}}_{u j} ∣ Δ_{i} = 1, X_{i} \leq X_{j}) .

To account for calibration, we propose to examine the absolute prediction error (APE) measure via

{APE}_{u} = E \int_{0}^{C_{i}} | I ({\hat{X}}_{u i} \geq t) - I (X_{i} \geq t) | d t = E \int_{0}^{C_{i}} | I ({\hat{T}}_{u i} \leq t) {\hat{Δ}}_{u i} - I (T_{i} \leq t) Δ_{i} | d t = E | {\hat{X}}_{u i} - X_{i} | .

APE_u is an important summary measure for the quality of annotating (X, Δ) since most survival estimation procedures rely on the at risk function $I (X_{i} \geq t)$ and the counting process $I (T_{i} \leq t) Δ_{i}$ . The cut-off value u can be selected also to minimize APE_u.

These accuracy measures can be estimated empirically by

\hat{{APE}_{u}} = n^{- 1} \sum_{i = 1}^{n} | {\hat{X}}_{u i} - X_{i} |, {\hat{𝒞}}_{u} = \frac{\sum_{i < j} I ({\hat{X}}_{u i} \leq {\hat{X}}_{u j}, X_{i} \leq X_{j})}{\sum_{i < j} I (X_{i} \leq X_{j})}, {\hat{𝒞}}_{u}^{+} = \frac{\sum_{i < j} {\hat{Δ}}_{u i} Δ_{i} I ({\hat{X}}_{u i} \leq {\hat{X}}_{u j}, X_{i} \leq X_{j})}{\sum_{i < j} Δ_{i} I (X_{i} \leq X_{j})} .

Since ${\hat{X}}_{u}$ and ${\hat{Δ}}_{u}$ are estimated using the same training data, such plug in accuracy estimate may suffer overfitting bias especially when n is not large. For such cases, standard cross-validation procedures can be used for bias correction.

3. Asymptotic Results

The asymptotic distribution of the proposed MATA estimator is given in theorems 1 and 2 below, with proofs provided in the Appendix. We assume the following regularity conditions.

(C1)
The density function f_C(t) of the random variable C is bounded and bounded away from 0 on [0, ℰ) and satisfies the Lipschitz condition of order 1 on [0, ℰ). Additionally, $S_{C} (ℰ -) = \lim_{Δ \to 0 +} S_{C} (ℰ - Δ) > 0$ .
(C2)
$m (\cdot) \in C^{(q)} ([0, ℰ])$ for $q \geq 2$ , and the spline order satisfies $r \geq q$ .
(C3)
There exists 0 < c < ∞, such that the distances between neighboring knots satisfy
$\max_{r \leq p \leq R_{n} + r} | h_{p + 1} - h_{p} | = o (R_{n}^{- 1}), \max_{r \leq p \leq R_{n} + r} h_{p} / \min_{r \leq p \leq R_{n} + r} h_{p} \leq c .$
Furthermore, the number of knots satisfies $R_{n} \to \infty$ , as $n \to \infty, R_{n}^{- 2} n \to \infty$ and $R_{n} n^{- 1 / (2 q)} \to \infty$ .
(C4)
The function m(t) is bounded on [0, ℰ]. The pdf of the covariate Z is bounded and has a compact support.
(C5)
The estimated features has a fast convergence rate $\sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ = o_{p} (n^{- 1 / 2})$ .

Here condition C1 assumes that $S_{C} (t)$ , the survival function of C, is discontinuous at ℰ. In practice, most studies have a preselected ending time ℰ, when all patients that have not experienced failure are censored. This automatically leads to the discontinuity of $S_{C} (t)$ at ℰ. Besides, for some studies that keep tracking patients until the last patient is censored or experience failure, the performance of the estimated survival curve near the tail can be highly uncertain. A straightforward solution is manually censoring all the patients to at least the last failure time ℰ, which results in the discontinuity at this point. Conditions C2 and C3 are standard smoothness and knots conditions in B-spline approximation. Condition C4 implies that both S_T(t), the survival function of T, and f_T(t), the density function of T, are bounded away from 0 on [0, ℰ]. Hence, [0, ℰ] is strictly contained in the support of the failure time T, i.e., $[0, ℰ] \subset s u p p o r t (T)$ .

In the statements of our theory, we use β₀ to indicate the underlying true parameter of model (1). Here and throughout the text, $a^{\otimes 2} \equiv a a^{⊤}$ for any matrix or vector a. We first establish the consistency and asymptotic normality of the MLE estimator (3).

Lemma 1 Under the Conditions C1-C5, when β equals the truth β₀ or equals a $\sqrt{n}$ -consistent estimator of β₀, then the baseline function estimator from

\hat{m} (t) = {\hat{γ}}^{⊤} B_{r} (t) with \hat{γ} (β) = a r g m a x_{γ} l_{n} (β, γ)

satisfies $| \hat{m} (u, β) - m (u) | = O_{p} {{(n h_{p})}^{- 1 / 2} + h_{p}^{q}} = O_{p} {{(n h_{p})}^{- 1 / 2}}$ uniformly in $u \in [0, ℰ]$ and as $n \to \infty, {\hat{σ}}^{- 1} (u, β_{0}) {\hat{m} (u, β) - m (u)} \to N o r m a l (0, 1)$ in distribution.

Lemma 2 Under the Conditions C1-C5, ${‖ {\hat{β}}_{MLE} - β_{0} ‖}_{2} = O_{p} (n^{- 1 / 2})$ , and

n^{1 / 2} ({\hat{β}}_{MLE} - β_{0}) \to N o r m a l {0, V_{MLE}}, V_{MLE} = A^{- 1} Σ A^{- 1}

where

A = E {S_{β β, i} (β_{0}, m)} + E {S_{β γ, i} (β_{0}, m)} V_{n} {(β_{0})}^{- 1} E {S_{β γ, i} (β_{0}, m)}^{⊤}; Σ = E ({[S_{β, i} (β_{0}, m) + E {S_{β γ, i} (β_{0}, m)} V_{n} {(β_{0})}^{- 1} S_{γ, i} (β_{0}, m)]}^{\otimes 2}),

and the definitions of $S_{β β, i}, S_{β γ, i}, S_{β β, i}, S_{γ γ,, i}$ and $S_{β γ, i}$ are given in Appendix D.

With the consistent and asymptotically normal ${\hat{θ}}_{MLE}$ , we achieve oracle property through the adaptive group LASSO with least square approximation:

Theorem 1 Let 𝒮 be the indices set for coefficients in nonzero groups of β₀. Define the sub-matrices and sub-vectors, $A_{𝒮, 𝒮}$ and $Σ_{𝒮, 𝒮}$ with rows and columns in $𝒮, β_{glasso, 𝒮}$ and $β_{0, 𝒮}$ with elements in 𝒮. Under the Conditions C1-C5 with $n^{- 1} ≪ λ ≪ n^{- 1 / 2}, s u p_{j \in 𝒮^{c}} | {\hat{β}}_{{glasso}^{, j}} | = o_{p} (n^{- 1 / 2})$ , and

n^{1 / 2} ({\hat{β}}_{glasso, 𝒮} - β_{0, 𝒮}) \to N o r m a l {0, V_{glasso}}, V_{glasso} = A_{𝒮, 𝒮}^{- 1} Σ_{𝒮, 𝒮} A_{𝒮, 𝒮}^{- 1} .

Thank to the large unlabeled data in Step I, the uncertainty from the complex feature extraction $\hat{W}$ does not affect the asymptotic analysis of Step II estimators. Consequently, we obtain standard regular estimators for β and m. Confidence intervals for functionals of β and m including F(t; Z) can be obtained through bootstrap or the perturbation resampling (Efron, 1979; Jin et al., 2001).

Corollary 1 Under the Conditions C1-C5, the estimation error for incidence rate satisfies

\hat{F} (t ∣ \hat{Z}) - F (t ∣ Z) = O_{p} (n^{- 1 / 2} + h^{q}),

and the error is asymptotically normally distributed.

4. Simulation

We have conducted extensive simulations to evaluate the performance of our proposed MATA procedure and compare to existing methods including (a) the nonparametric MLE (NLPMLE) approach by Zeng et al. (2005) using the same set of derived features; (b) the tree-based method by Chubak et al. (2015) which first uses the decision tree to classify patients as experienced failure event or not, and then among the patients who are determined to have events, assign the event time by the earliest arrival time of all groups of medical encounters used in the decision tree; and (c) the two-step procedure by Hassett et al. (2015) and Uno et al. (2018), which first fit a logistic regression with group lasso to classify the patients and select significant groups of encounters, and then assign the failure time to patients experiencing event as the weighted average of the peak time of the significant encounters with adjustment to correct the systematic bias. Throughout, we fix the total sample size to be n + N = 4000 and consider n ∈ {200, 400}.

For simplicity, we only consider the case where all patients are enrolled at the same time as we can always shift each patient’s follow-up period to a preselected origin. The censoring time of the i-th patient, i.e., C_i, is simulated from the mixed distribution 0.909Uniform[0, ℰ)+0.091δ_ℰ, where δ_ℰ is the Dirac function at ℰ and ℰ = 20, for i = 1, …, n + N. Intuitively, this imitates a long-term clinical study that tracks patients up to 20 years, where 90.9% of the patients quit the study at uniform rate before the study terminates and 9.1% patients stay in the study until the end. We simulate the number of encounters and encounter arrival times using the expression $λ_{i}^{[j]} (t) = τ_{i}^{[j]} f_{i}^{[j]} (t)$ for $t \geq 0$ . We consider two sets of density functions ${f_{i}^{[j]} (t)}$ for the point processes: Gaussian and Gamma. In addition, for each density shape, we considered both the case that density functions are independent across the q = 10 counting processes of the medical encounters and the case that the densities are correlated. Details on the data generation mechanisms for the point processes are given in Appendix A of the Supplementary Materials. We then set $τ_{i}^{[j]} = m_{i}^{[j]} + 5$ with $m_{i} = {(m_{i}^{[1]}, \dots, m_{i}^{[q]})}^{⊤} = F_{j}^{- 1} {Φ (ι_{i})}$ , where $Φ$ is the cumulative distribution function (CDF) of the standard normal and F_j is the CDF of a Gamma distribution with shape $k_{2 j}$ and scale $θ_{2 j}$ , $Gamma (k_{2 j}, θ_{2 j})$ . We let

k_{2} = {(k_{21}, \dots, k_{2 q})}^{⊤} = {(0.6, 0.48, 0.36, 1.2, 0.6, 0.9, 0.54, 1.26, 0.45, 0.468)}^{⊤}, θ_{2} = {(θ_{21}, \dots, θ_{2 q})}^{⊤} = {(10, 6, 20, 4, 8, 9, 6.5, 5, 16, 14)}^{⊤}

and generate

ι_{i} = {(ι_{i 1}, \dots, ι_{i q})}^{⊤} \sim MNormal (0, Σ_{ι}) .

We consider two choices of $Σ_{ι} : Σ_{ι} = I_{q}$ and $Σ_{ι} = {{0.5}^{| m - ℓ |}}_{1 \leq m, ℓ \leq q}$ . We further simulate encounter times

t_{i 1}^{[j]}, \dots, t_{i M_{i}^{[j] *}}^{[j]} \sim f_{i}^{[j]} with M_{i}^{{[j]}^{*}} \sim Poisson (m_{i}^{[j]}) + 5

but only keep the ones that fall into the interval [0, ℰ]. The final number of arrival times are thus reduced to $M_{i}^{[j]} = # {k : 0 \leq t_{i k}^{[j]} \leq ℰ}$ and we relabel the retained arrival times as $t_{i 1}^{[j]}, \dots, t_{i M_{i}^{[j]}}^{[j]}$ . Simple calculation shows $E (M_{i}^{[j]} ∣ τ_{i}^{[j]}) = τ_{i}^{[j]} p r (ω \leq ℰ)$ , where $ω \sim f_{i}^{[j]}$ .

The event time T_i is simulated from the PO model in (1) where the true features are set to be the log of the peak time and the logit of the ratio between change point time and peak time of the intensity functions $λ_{i}^{[j]} (t)$ for i = 1, …, q. Intuitively, an early peak time may result in early disease onset; and a relatively close peak time and change point time may imply a quick exacerbation of the disease status. We set the nonparametric function $\log α (t) = 3 \log (t) + α_{c}$ and varies α_c to control the censoring rate. We further set $β = {(β_{1}^{⊤}, \dots, β_{q}^{⊤})}^{⊤}$ with $β_{1} = {(- 4, - 3)}^{⊤}$ and $β_{2} = \dots = β_{q} = 0$ . Consequently, only the first group of medical encounters affects the recurrence time. The estimated features are set to be the estimated peak time as well as the logit of the estimated ratio between change point time and peak time. We also included the logarithm of the first encounter arrival time, a feature directly observable from the medical encounter data.

We summarize results with 400 replications for each configuration. With each simulated dataset, we extract the features for both labeled and unlabeled data of total size n + N whereas the PO model (1) is only fitted on the labeled training data of size n. The interior knots for B-splines in our approach are chosen to be the 10a^th percentile of the observed survival time X with a = 1, 2, …, 9 for both Gaussian and Gamma cases. For the tree approach and the two-step logistic, denoted by ”Logi”, approach, we use the same input feature space as the $\hat{Z}$ for the PO model (1) of MATA. To evaluate the performance of different procedures, we simulate a validation data of size n_v = 5000 in each simulation to evaluate the out-of-sample prediction performance through the accuracy measures discussed in Section 2.3.

4.1. Results for the Gaussian Intensity Setting

We present the results for Gaussian intensity setting here. The parallel results for Gamma intensity setting are given in Appendix A. The estimated probability of having zero and ≤ 3 arrival times under all settings from a simulation with sample size 500, 000 are given in Table A4 in Appendix A. As a benchmark, we also present results from fitting the PO model with true feature sets. For the true feature sets. In Table 1, we reported the bias and standard error (se) of the non-zero coefficients, i.e., $β_{1} = {(β_{11}, β_{12})}^{⊤}$ , from MATA and NPMLE. In general, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MATA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, leading to an effective sample size of 60, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE has smaller standard error in the extreme case but its absolute bias is more then twice of the standard error. These results is consistent with Theorem 2.

Table 1.

Displayed are the bias and standard error of the estimation on $β_{1} = {(β_{11}, β_{12})}^{⊤}$ fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gaussian intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.

		Indp				Corr
		$β_{11}$		$β_{12}$		$β_{11}$		$β_{12}$
		Bias	se	Bias	se	Bias	se	Bias	se
Gaussian, 30% censoring rate

n = 200	MATA	−0.060	0.404	−0.072	0.282	−0.053	0.418	−0.083	0.292
	NPMLE	−0.355	0.692	−0.216	0.550	−0.359	0.716	−0.232	0.495
n = 400	MATA	0.017	0.271	−0.020	0.183	0.030	0.258	−0.027	0.177
	NPMLE	−0.036	0.373	0.000	0.259	−0.028	0.385	−0.001	0.242

Gaussian, 70% censoring rate

n = 200	MATA	−0.408	0.893	−0.305	0.582	−0.352	0.776	−0.277	0.532
	NPMLE	1.449	0.562	1.172	0.382	1.448	0.574	1.167	0.361
n = 400	MATA	−0.082	0.440	−0.081	0.279	−0.088	0.403	−0.095	0.280
	NPMLE	1.698	0.270	1.338	0.161	1.708	0.247	1.341	0.140

Open in a new tab

For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures $𝒞_{u}, 𝒞_{u}^{+}$ , and absolute prediction error APE_u depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02ℓ for ℓ = 0, 1, …, 50 and pick the u with minimum APE_u. We then compare these measures at the selected u with Tree and Logi methods in Tables 2 and 3.

Table 2.

Kendall’s-τ type rank correlation summary measures ( $𝒞$ and $𝒞^{+}$ ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

	n = 200				n = 400
	MATA	NPMLE	Tree	Logi	MATA	NPMLE	Tree	Logi
Gaussian, 30%, independent counting processes, true features

$𝒞$	0.901 (0.002)	0.894 (0.003)	0.791 (0.022)	0.719 (0.042)	0.901 (0.002)	0.898 (0.002)	0.791 (0.018)	0.718 (0.038)
$𝒞^{+}$	0.868 (0.006)	0.864 (0.006)	0.742 (0.027)	0.718 (0.029)	0.868 (0.005)	0.867 (0.005)	0.747 (0.018)	0.725 (0.022)
APE	0.971 (0.032)	1.049 (0.050)	1.978 (0.368)	3.373 (0.947)	0.962 (0.027)	1.001 (0.031)	1.884 (0.125)	3.406 (0.917)

Gaussian, 70%, independent counting processes, true features

$𝒞$	0.932 (0.004)	0.921 (0.005)	0.867 (0.014)	0.859 (0.023)	0.933 (0.002)	0.926 (0.003)	0.872 (0.010)	0.859 (0.021)
$𝒞^{+}$	0.782 (0.021)	0.757 (0.023)	0.705 (0.064)	0.682 (0.086)	0.789 (0.014)	0.761 (0.015)	0.716 (0.051)	0.699 (0.067)
APE	0.772 (0.049)	0.899 (0.057)	1.479 (0.162)	1.590 (0.268)	0.751 (0.030)	0.840 (0.037)	1.413 (0.121)	1.576 (0.234)

Gaussian, 30%, correlated counting processes, true features

$𝒞$	0.900 (0.003)	0.894 (0.003)	0.792 (0.023)	0.720 (0.042)	0.901 (0.002)	0.898 (0.002)	0.791 (0.018)	0.724 (0.039)
$𝒞^{+}$	0.866 (0.006)	0.862 (0.007)	0.741 (0.029)	0.719 (0.027)	0.867 (0.005)	0.865 (0.005)	0.748 (0.017)	0.727 (0.022)
APE	0.972 (0.035)	1.050 (0.046)	1.994 (0.434)	3.383 (0.962)	0.963 (0.026)	1.002 (0.031)	1.871 (0.114)	3.278 (0.976)

Gaussian, 70%, correlated counting processes, true features

$𝒞$	0.932 (0.004)	0.921 (0.005)	0.868 (0.015)	0.859 (0.025)	0.933 (0.002)	0.926 (0.003)	0.873 (0.011)	0.861 (0.022)
$𝒞^{+}$	0.782 (0.020)	0.758 (0.022)	0.698 (0.066)	0.679 (0.084)	0.789 (0.013)	0.762 (0.016)	0.720 (0.050)	0.701 (0.068)
APE	0.771 (0.051)	0.896 (0.057)	1.473 (0.175)	1.599 (0.280)	0.750 (0.029)	0.838 (0.038)	1.395 (0.121)	1.548 (0.247)

Open in a new tab

Table 3.

Estimated features, Gaussian. Kendall’s-τ type rank correlation summary measures ( $𝒞$ and $𝒞^{+}$ ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4,000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

	n = 200				n = 400
	MATA	NPMLE	Tree	Logi	MATA	NPMLE	Tree	Logi
Gaussian, 30%, independent counting processes, estimated features

$𝒞$	0.781 (0.020)	0.768 (0.011)	0.790 (0.019)	0.740 (0.036)	0.791 (0.008)	0.784 (0.006)	0.788 (0.022)	0.744 (0.027)
$𝒞^{+}$	0.701 (0.028)	0.667 (0.017)	0.690 (0.019)	0.654 (0.036)	0.675 (0.027)	0.680 (0.010)	0.692 (0.016)	0.653 (0.028)
APE	2.148 (0.332)	2.184 (0.132)	1.913 (0.139)	2.409 (0.540)	1.960 (0.236)	2.019 (0.074)	1.904 (0.125)	2.272 (0.315)

Gaussian, 70%, independent counting processes, estimated features

$𝒞$	0.839 (0.018)	0.833 (0.010)	0.802 (0.020)	0.827 (0.015)	0.853 (0.008)	0.845 (0.006)	0.806 (0.019)	0.832 (0.013)
$𝒞^{+}$	0.397 (0.111)	0.413 (0.040)	0.536 (0.107)	0.500 (0.107)	0.468 (0.042)	0.447 (0.025)	0.555 (0.072)	0.511 (0.090)
APE	1.868 (0.253)	1.938 (0.122)	2.276 (0.248)	1.992 (0.216)	1.685 (0.106)	1.775 (0.071)	2.202 (0.223)	1.907 (0.177)

Gaussian, 30%, correlated counting processes, estimated features

$𝒞$	0.781 (0.019)	0.768 (0.011)	0.789 (0.021)	0.743 (0.032)	0.791 (0.010)	0.783 (0.006)	0.791 (0.016)	0.747 (0.022)
$𝒞^{+}$	0.701 (0.028)	0.669 (0.016)	0.690 (0.019)	0.656 (0.036)	0.672 (0.030)	0.680 (0.011)	0.693 (0.014)	0.654 (0.027)
APE	2.142 (0.323)	2.180 (0.122)	1.915 (0.145)	2.352 (0.418)	1.958 (0.257)	2.018 (0.072)	1.886 (0.099)	2.243 (0.194)

Gaussian, 70%, correlated counting processes, estimated features

$𝒞$	0.838 (0.018)	0.832 (0.009)	0.802 (0.021)	0.827 (0.014)	0.852 (0.007)	0.846 (0.005)	0.804 (0.017)	0.832 (0.012)
$𝒞^{+}$	0.393 (0.109)	0.430 (0.042)	0.533 (0.119)	0.500 (0.110)	0.468 (0.041)	0.450 (0.026)	0.564 (0.070)	0.519 (0.083)
APE	1.879 (0.248)	1.944 (0.120)	2.289 (0.253)	1.987 (0.211)	1.688 (0.102)	1.772 (0.071)	2.225 (0.193)	1.898 (0.168)

Open in a new tab

The performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher $𝒞, 𝒞^{+}$ and lower APE in all cases under Gaussian intensities. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.

5. Example

We applied our MATA algorithm to extraction of cancer recurrence time for Veterans Affair Central Cancer Registry (VACCR). We obtained from VACCR 36705 lung cancer patients diagnosed with stage I to III lung cancer before 2018 and followed through 2019, among whom 3752 diagnosed in 2000-2018 with cancer stage information had annotated dates for cancer recurrence. Through the research platform under Office of Research & Development at Department of Veterans Affairs, the cancer registry data were linked to the EHR at Veterans Affairs healthcare containing diagnosis, procedure, medication, lab tests and medical notes.

The gold-standard recurrence status was collected through manual abstraction and tumor registry for the VACCR data. Besides, baseline covariates, including age at diagnosis, gender, and cancer stage, are extracted. Due to the predominance of male patients in VACCR (97.7% among the 3752), we excluded gender in the subsequent analysis. We randomly selected 1000 patients as training data and used the remaining 2752 as validation data. To assess the change of performance with the size of training data, we also considered smaller training data n = 200 and n = 400 sub-sampled from the n = 1000 set. We ran 400 bootstrap samples from the 1000 train samples for each n to quantify the variability of analysis. We selected time unit as month and focused on recurrence within 2 year in the analysis. Patients without recurrence before 24 months from diagnosis date were censored at 24 months. Censoring rate was 39%. The diagnosis, procedure, medication codes and mentions in medical notes associated with the following nine events are collected: lung cancer, chemotherapy, computerized tomography (CT) scan, radiotherapy, secondary malignant neoplasm, palliative or hospice care, recurrence, medications for systematic therapies (including cytotoxic therapies, targeted therapies or immunotherapies), biopsy or excision. See Table A6 for the detailed summary of the sparsity in the nine groups of medical encounters.

For each of the nine selected events except the hospice, we estimate the subject-specific intensity function on the training and unlabeled data sets by applying the FPCA approach described in Section 2.2.1, and then use the resulting basis functions to project the intensity functions for the validation set. The peak and change point time of the estimated intensity functions are then extracted as features. In addition, first code arrival time, first FPCA score, and the total number of diagnosis and procedure code are added as features for each event. All those features except the FPCA score are transformed in log-scale to reduce the skewness. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features. Finally, to overcome the potential collinearity of the extracted features from the same group (i.e., event), we further run the principal component analysis on each group of features and keep the first few principal components with proportion of variation exceeds 90%.

Similarly as the simulation, we fit the decision tree to minimize the cross-validation error for the Tree approach and fit the logistic regression model. For MATA, NPMLE and Logi, we take a fine grid on false positive rate (FPR) on Δ and compute all other accuracy measures in Section 2.3 for each value of FPR. Then we pick the result which matches the FPR from Tree.

The prediction accuracy is summarized in Table 4. For the measurements regarding the timing of recurrence, our MATA estimator dominates the other three approaches with larger $𝒞, 𝒞^{+}$ and yet smaller APE.

Table 4.

Mean and bootstrap standard deviation of Kendall’s-τ type rank correlation summary measures ( $𝒞$ and $𝒞^{+}$ ), and absolute prediction error (APE) for the medical encounter data analyzed in Section 5 under the four approaches, i.e., MATA, NPMLE, Tree, and Logi, over 400 bootstrap simulations.

n = 1000
	MATA	MLE	Tree	Logi

$𝒞$	0.810 (0.010)	0.809 (0.009)	0.755 (0.032)	0.762 (0.015)
$𝒞^{+}$	0.688 (0.016)	0.690 (0.017)	0.646 (0.049)	0.650 (0.010)
APE	3.326 (0.051)	3.399 (0.074)	5.693 (0.960)	4.317 (0.152)

n = 400
	MATA	MLE	Tree	Logi

$𝒞$	0.796 (0.013)	0.795 (0.013)	0.748 (0.033)	0.762 (0.059)
$𝒞^{+}$	0.663 (0.026)	0.662 (0.027)	0.644 (0.044)	0.646 (0.044)
APE	3.576 (0.094)	3.615 (0.129)	5.880 (1.159)	4.653 (0.577)

n = 200
	MATA	MLE	Tree	Logi

$𝒞$	0.791 (0.025)	0.784 (0.024)	0.748 (0.039)	0.859 (0.127)
$𝒞^{+}$	0.624 (0.057)	0.624 (0.043)	0.640 (0.044)	0.681 (0.090)
APE	3.842 (0.317)	3.945 (0.286)	6.094 (1.418)	5.736 (1.080)

Open in a new tab

Through its variable selection feature, MATA excluded stage II cancer from the n = 1000 analysis and additionally, stage III cancer, age at diagnosis, medication for systematic therapies from the n = 200 and n = 400 analyses. The selection is consistent with the NPMLE result of the n = 1000 analysis, as the excluded features coincide with the feature groups with no p-value < 0.05. Additional details for feature selection is given in Appendix B.

6. Discussion

We proposed an MATA method to auto-extract patients’ longitudinal characteristics from the medical encounter data, and further build a risk prediction model on clinical events occurrence time with the extracted features. Such an approach integrates both labeled and unlabeled data to obtain the longitudinal features, thus tackled the sparsity of the medical encounter data. In addition, the FPCA approach preserves the flexibility of the resulting subject-specific intensity function to a great extent. Specifically, the intensity functions are often of different shapes between female and male, or between young patients and elder patients in practice. Therefore, multiplicative intensity model that assumes the heterogeneity among patients only results from the subject-specific random effects may not be adequate. The fitted risk prediction model is chosen to be the proportional odds model, whereas the nonparametric function is approximated by B-splines under certain transformation to ensure its monotonicity. The resulting estimator for the parametric part is shown to be root-n consistent, whereas the non-parametric function is consistent, under the correctly-specified model. Though the proportional odds model is adopted, our proof can be extended to other semiparametric transformation models such as the proportional hazard model easily. In the presence of large feature sets, we propose to use the group lasso with LSA for feature selection. The finite sample performance of our approach are studied under various settings.

Here, the FPCA is employed on each group of medical encounters separately for feature extraction. However, different groups are potentially connected, and separate estimation may fail to capture such relationship. A potential future work is to use the multivariate FPCA to directly address potential covariation among different groups. Though various multivariate FPCA approaches exist, none of them can handle the case in the medical encounters, where the encounter arrival times rather than the underlying intensity functions are observed. Much effort is needed on developing the applicable multivariate FPCA methodology and theories in this setting.

The spline model works very well with only a few knots. Small-sample performance of our estimator is studied in various simulations, with the prediction accuracy examined by C-statistics (Uno et al., 2011) and Brier Score on simulated validation data sets.

The adoption of the PO model is for the simplicity of the illustration, while the theory of our estimator can be easily generated to arbitrary linear transformation models. As the goal of this paper is to annotating event time within the observational window, the medical encounter data involved may extend after the actual event time.

Appendices

In Appendix A, we present additional simulation studies with Gamma intensities, as well as extra information on the simulation settings. In Appendix B, we offer additional details on the data example of lung cancer recurrence with VACCR data. In Appendix C, we provide the theoretical properties for the derived features. In Appendix D, we provide the theoretical properties for the MATA estimator based on the proportional odds model. In Appendix F, we provide the detailed algorithm for optimization of the log-likelihood l_n.

Appendix A. Additional Simulation Details

A1. Simulation Settings for the Gaussian Intensities

We first simulate Gaussian shape density, i.e., $f_{i}^{[j]}$ is the density function of $Normal (μ_{i j}, σ_{i j}^{2})$ truncated at 0.

Set $μ_{i j}$ to be $F_{j}^{- 1} {Φ (ν_{i j})}$ , F_j is the CDF of $Gamma (k_{1 j}, θ_{1 j})$ , with $k_{1 j} \sim Uniform (3, 6)$ and $θ_{1 j} \sim Uniform (2, 3)$ for j = 1, …, q, and $ν_{i} = {(ν_{i 1}, \dots, ν_{i q})}^{⊤} \sim MNormal (0, Σ_{ν})$ , i.e., the multivariate normal distribution with mean 0 and variance $Σ_{ν}$ . For simplicity, we set $Σ_{ν} = Σ_{ι}$ We further set $μ_{i j}$ to be one if it is less than one.

Simulate $σ_{i j} \sim Uniform (0.5, s_{j})$ with $s_{j} = \min {0.9 μ_{i j}, F_{j}^{- 1} (0.5)}$ , where F_j is the CDF of $Gamma (k_{1 j}, θ_{1 j})$ . The way we simulate $μ_{i j}$ and $σ_{i j}$ guarantees that the largest change in the intensity functions only occurs after patients enter the study, i.e., $μ_{i j} - σ_{i j} > 0$ , as expected in practice. Besides, the simulated σ_ij is not only controlled by the value of μ_ij but also the median of $Gamma (k_{1 j}, θ_{1 j})$ . Thus σ_ij will not get too extreme even with a large peak time $μ_{i j}$ . In other words, the corresponding largest change in the intensity function $μ_{i j} - σ_{i j}$ is more likely to occur near the peak time $μ_{i j}$ than much earlier than $μ_{i j}$ .

Finally, we set α_c, the constant in the nonparametric function α(t), to be 7.5 and 1.1 to obtain an approximately 30% and 70% censoring rate.

A2. Simulation Settings for the Gamma Intensity functions

We also consider gamma shape density, i.e., $f_{i}^{[j]} (t)$ is the density function of $Gamma (k_{i j}, θ_{i j})$ , truncated at 0. We let $f_{i}^{[j]} (t)$ be the density function of $Gamma (k_{i j}, θ_{i j})$ , truncated at 0. Set $k_{i j} = F_{j}^{- 1} {Φ (ν_{i j})}$ , where F_j is the CDF of $Uniform (k_{ℓ, j}, k_{u, j})$ , with $k_{ℓ, j} \sim Uniform (2, 4)$ , and $k_{μ, j} \sim Uniform (4, 6)$ , and $ν_{i} = {(ν_{i 1}, \dots, ν_{i q})}^{⊤} \sim MNormal (0, Σ_{ν})$ . Generate $θ_{i j}$ from $Gamma (a_{j}, b_{j})$ truncated at its third quartile with $a_{j} \sim Uniform (3, 6)$ , and $b_{j} \sim Uniform (2, 4)$ . We set $α_{c} = 6.8$ and 1.9 to obtain the approximate 30% and 70% censoring rates.

Table A1.

Displayed are the bias and standard error of the estimation on $β_{1} = {(β_{11}, β_{12})}^{⊤}$ fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gamma intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.

		Indp				Corr
		$β_{11}$		$β_{12}$		$β_{11}$		$β_{12}$
		Bias	se	Bias	se	Bias	se	Bias	se
Gamma, 30% censoring rate

n = 200	MATA	−0.086	0.443	−0.084	0.410	−0.091	0.447	−0.126	0.436
	NPMLE	−0.480	0.581	−0.301	0.510	−0.482	0.549	−0.349	0.541
n = 400	MATA	−0.006	0.296	−0.032	0.271	0.019	0.283	−0.067	0.266
	NPMLE	−0.223	0.374	−0.138	0.315	−0.190	0.346	−0.167	0.340

Gamma, 70% censoring rate

n = 200	MATA	−0.383	0.743	−0.299	0.676	−0.400	0.783	−0.339	0.666
	NPMLE	0.336	0.731	0.284	0.592	0.325	0.866	0.267	0.670
n = 400	MATA	−0.109	0.399	−0.070	0.328	−0.074	0.410	−0.112	0.344
	NPMLE	0.708	0.429	0.551	0.330	0.744	0.385	0.533	0.352

Open in a new tab

A3. Results for Gamma Intensity setting

For the true feature sets, we reported the bias and standard error (se) of the non-zero coefficients, i.e., $β_{1} = {(β_{11}, β_{12})}^{⊤}$ , from MATA and NPMLE in Table A1. Similar to the Gaussian intensities settings, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MASA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE tends to be numerically unstable. We observe the estimation bias of NPMLE for n = 400 setting is larger than its own standard error and the the bias at n = 200 setting. These results is consistent with Theorem 2.

For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures $𝒞_{u}, 𝒞_{u}^{+}$ , and absolute prediction error APE_u depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02ℓ for ℓ = 0, 1, ⋯, 50 and pick the u with minimum APE_u. We then compare these measures at the selected u with Tree and Logi methods in Tables A2 and A3.

Similar to the Gaussian intensities setting, the performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher $𝒞, 𝒞^{+}$ and lower APE in all cases except when the encounters are simulated from independent Gamma counting processes with 30% censoring rate. In this exceptional case, our MATA estimator has very minor advantage in $𝒞^{+}$ compared to NPMLE, and is still better in terms of $𝒞$ and APE. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.

Table A2.

True features, Gamma. Kendall’s-τ type rank correlation summary measures ( $𝒞$ and $𝒞^{+}$ ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

	n = 200				n = 400
	MATA	NPMLE	Tree	Logi	MATA	NPMLE	Tree	Logi
Gamma, 30%, independent counting processes, true features

$𝒞$	0.872 (0.003)	0.864 (0.004)	0.720 (0.040)	0.678 (0.074)	0.873 (0.003)	0.869 (0.003)	0.731 (0.023)	0.683 (0.071)
$𝒞^{+}$	0.814 (0.008)	0.812 (0.008)	0.617 (0.047)	0.658 (0.048)	0.814 (0.006)	0.815 (0.007)	0.636 (0.026)	0.667 (0.043)
APE	1.318 (0.038)	1.410 (0.048)	3.826 (1.011)	4.937 (1.844)	1.307 (0.031)	1.350 (0.036)	3.434 (0.578)	4.784 (1.835)

Gamma, 70%, independent counting processes, true features

$𝒞$	0.922 (0.004)	0.914 (0.004)	0.876 (0.010)	0.837 (0.042)	0.924 (0.002)	0.919 (0.003)	0.880 (0.008)	0.841 (0.040)
$𝒞^{+}$	0.735 (0.021)	0.701 (0.024)	0.618 (0.078)	0.604 (0.106)	0.743 (0.014)	0.723 (0.017)	0.630 (0.053)	0.627 (0.085)
APE	0.892 (0.049)	0.995 (0.054)	1.436 (0.125)	1.986 (0.539)	0.869 (0.030)	0.930 (0.037)	1.388 (0.090)	1.917 (0.518)

Gamma, 30%, correlated counting processes, true features

$𝒞$	0.872 (0.003)	0.864 (0.004)	0.720 (0.040)	0.685 (0.072)	0.873 (0.002)	0.869 (0.003)	0.731 (0.026)	0.684 (0.068)
$𝒞^{+}$	0.814 (0.008)	0.813 (0.009)	0.617 (0.047)	0.662 (0.049)	0.816 (0.006)	0.813 (0.007)	0.635 (0.031)	0.668 (0.041)
APE	1.320 (0.041)	1.408 (0.049)	3.819 (1.022)	4.826 (1.842)	1.307 (0.030)	1.350 (0.034)	3.500 (0.689)	4.808 (1.810)

Gamma, 70%, correlated counting processes, true features

$𝒞$	0.921 (0.005)	0.913 (0.004)	0.875 (0.011)	0.841 (0.044)	0.924 (0.003)	0.919 (0.003)	0.879 (0.007)	0.848 (0.038)
$𝒞^{+}$	0.733 (0.025)	0.702 (0.024)	0.616 (0.072)	0.612 (0.099)	0.742 (0.015)	0.724 (0.016)	0.628 (0.053)	0.622 (0.086)
APE	0.900 (0.056)	0.996 (0.053)	1.447 (0.126)	1.933 (0.572)	0.872 (0.032)	0.929 (0.036)	1.390 (0.086)	1.833 (0.488)

Open in a new tab

Table A3.

Estimated features, Gamma. Kendall’s-τ type rank correlation summary measures ( $𝒞$ and $𝒞^{+}$ ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.

	n = 200				n = 400
	MATA	NPMLE	Tree	Logi	MATA	NPMLE	Tree	Logi
Gamma, 30%, independent counting processes, estimated features

$𝒞$	0.728 (0.027)	0.720 (0.010)	0.650 (0.043)	0.668 (0.059)	0.749 (0.011)	0.737 (0.007)	0.667 (0.045)	0.659 (0.056)
$𝒞^{+}$	0.555 (0.037)	0.578 (0.015)	0.456 (0.074)	0.570 (0.078)	0.573 (0.018)	0.589 (0.010)	0.480 (0.072)	0.558 (0.063)
APE	2.732 (0.544)	2.698 (0.121)	4.164 (1.191)	5.018 (1.659)	2.443 (0.218)	2.528 (0.074)	3.840 (1.095)	4.858 (1.330)

Gamma, 70%, independent counting processes, estimated features

$𝒞$	0.833 (0.016)	0.827 (0.011)	0.829 (0.017)	0.822 (0.019)	0.849 (0.010)	0.841 (0.006)	0.831 (0.018)	0.829 (0.014)
$𝒞^{+}$	0.277 (0.115)	0.325 (0.043)	0.485 (0.139)	0.425 (0.119)	0.371 (0.058)	0.374 (0.027)	0.519 (0.070)	0.450 (0.089)
APE	2.002 (0.223)	2.040 (0.136)	2.006 (0.202)	2.086 (0.254)	1.766 (0.141)	1.866 (0.077)	1.964 (0.195)	1.973 (0.166)

Gamma, 30%, correlated counting processes, estimated features

$𝒞$	0.731 (0.024)	0.720 (0.011)	0.656 (0.046)	0.670 (0.053)	0.749 (0.010)	0.737 (0.007)	0.672 (0.045)	0.665 (0.057)
$𝒞^{+}$	0.568 (0.038)	0.577 (0.016)	0.453 (0.076)	0.560 (0.073)	0.579 (0.020)	0.588 (0.011)	0.485 (0.072)	0.561 (0.063)
APE	2.681 (0.494)	2.691 (0.127)	4.133 (1.176)	4.770 (1.522)	2.451 (0.244)	2.522 (0.073)	3.778 (1.076)	4.874 (1.354)

Gamma, 70%, correlated counting processes, estimated features

$𝒞$	0.833 (0.016)	0.826 (0.011)	0.828 (0.018)	0.822 (0.017)	0.849 (0.009)	0.840 (0.006)	0.831 (0.017)	0.829 (0.012)
$𝒞^{+}$	0.283 (0.107)	0.322 (0.044)	0.484 (0.136)	0.421 (0.124)	0.366 (0.060)	0.388 (0.028)	0.515 (0.076)	0.442 (0.094)
APE	1.996 (0.214)	2.053 (0.135)	2.017 (0.213)	2.088 (0.236)	1.766 (0.128)	1.868 (0.075)	1.963 (0.180)	1.975 (0.156)

Open in a new tab

Table A4.

Estimated probability of having zero or ≤ 3 encounter arrival times under each counting process $𝒩^{[j]}$ for j = 1, …, 10 from a simulation with sample size 500, 000.

	$𝒩^{[1]}$	$𝒩^{[2]}$	$𝒩^{[3]}$	$𝒩^{[4]}$	$𝒩^{[5]}$	$𝒩^{[6]}$	$𝒩^{[7]}$	$𝒩^{[8]}$	$𝒩^{[9]}$	$𝒩^{[10]}$
Probability of zero encounters

Indp Gaussian	0.508	0.802	0.793	0.767	0.878	0.758	0.474	0.594	0.818	0.755
Corr Gaussian	0.508	0.802	0.792	0.766	0.879	0.758	0.474	0.595	0.817	0.755
Indp Gamma	0.761	0.805	0.716	0.786	0.750	0.944	0.712	0.810	0.939	0.750
Corr Gamma	0.761	0.806	0.715	0.786	0.749	0.943	0.713	0.812	0.938	0.750

Probability of ≤ 3 encounters

Indp Gaussian	0.720	0.945	0.926	0.930	0.971	0.918	0.680	0.778	0.938	0.902
Corr Gaussian	0.721	0.946	0.926	0.930	0.972	0.918	0.681	0.778	0.938	0.902
Indp Gamma	0.938	0.962	0.913	0.939	0.933	0.995	0.935	0.970	0.995	0.931
Corr Gamma	0.938	0.962	0.913	0.939	0.933	0.995	0.936	0.970	0.995	0.931

Open in a new tab

Table A5.

Average model sizes selected by MATA.

		30% Indp		70% Indp		30% Corr		70% Corr
		Tr Ft	Est Ft	Tr Ft	Est Ft	Tr Ft	Est Ft	Tr Ft	Est Ft
Gaussian

n = 200	AIC	13.24	15.57	13.74	17.09	13.30	15.91	13.75	17.41
	BIC	13.07	15.45	13.45	16.87	13.09	15.83	13.39	17.27
n = 400	AIC	13.15	14.40	13.30	14.57	13.18	14.39	13.31	14.77
	BIC	13.00	14.05	13.01	14.17	13.00	14.12	13.00	14.22

Gamma

n = 200	AIC	13.38	18.37	13.88	19.35	13.41	18.04	13.94	19.57
	BIC	13.23	18.22	13.65	19.01	13.29	17.88	13.72	19.21
n = 400	AIC	13.24	15.02	13.20	15.09	13.21	15.04	13.34	14.97
	BIC	13.00	14.80	13.01	14.78	13.01	14.72	13.02	14.72

Open in a new tab

Supplementary Results on Simulations

We show the sparsity in the simulated data in Tables A4. We show the Average Model Size and MSE of Estimation in Table A5.

Table A6.

Sparsity of the nine groups of medical encounter data analyzed in Section 5.

Feature	Zero	≤ 3 times
Lung Cancer	0.014	0.087
Chemotherapy	0.567	0.736
CT Scan	0.127	0.363
Radiotherapy	0.778	0.912
Secondary Maligant Neoplasm	0.554	0.856
Palliative or Hospice Care	0.576	0.888
Recurrence	0.279	0.723
Medication	0.707	0.824
Biopsy or Excision	0.964	1.000

Open in a new tab

Appendix B. Additional Details on Data Example

We show the sparsity of features in Table A6. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features.

We show the MATA and NPMLE coefficients for n = 1000, 400, 200 in Tables A7–A9. Similar as in Section 4, our MATA estimator has smaller bootstrap standard error compared to the NPMLE. For the analysis with n = 1000, both MATA and NPMLE showed a significant impact of first arrival time and peak time of lung cancer code, first arrival time and first FPCA score of chemotherapy code, first arrival time of radiotherapy code, total number of secondary malignant neoplasm code, peak and change point times of palliative or hospice care in medical notes, first FPCA score and total number of recurrence in medical notes and first arrival time of biopsy or excision. MATA additionally finds the change point time of lung cancer code to have strong association with high risk of lung cancer recurrence. Furthermore, MATA excludes the stage II cancer, which coincides with the large p-values on those four group of encounters under NPMLE. For the analyses with n = 200 and n = 400, MATA excludes cancer stage, age at diagnosis and medication for systematic therapies, which coincides with the groups without any significant feature from the n = 1000 NPMLE analysis.

Table A7.

Analysis with n = 1000. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

		MATA			NPMLE
Group	Feature	mean	boot.se	pval	mean	boot.se	pval
	Stage II	–	–	–	0.075	0.144	0.604
	Stage III	0.168	0.168	0.319	0.160	0.181	0.379
	Age	0.013	0.008	0.111	0.013	0.007	0.069

Lung Cancer	1stCode	−0.277	0.116	0.017	−0.294	0.116	0.011
	Pk	0.213	0.084	0.012	0.213	0.089	0.016
	ChP	0.135	0.065	0.040	0.131	0.068	0.054
	1stScore	−0.091	0.183	0.619	−0.028	0.204	0.891
	logN	0.072	0.108	0.502	0.070	0.121	0.561

Chemotherapy	1stCode	−0.140	0.065	0.032	−0.146	0.067	0.029
	Pk	−0.162	0.106	0.127	−0.169	0.111	0.127
	ChP	0.019	0.067	0.773	0.019	0.073	0.799
	1stScore	0.652	0.180	< 0.001	0.678	0.188	< 0.001
	logN	0.073	0.092	0.424	0.076	0.103	0.463

CT scan	1stCode	0.020	0.076	0.789	0.017	0.093	0.858
	Pk	0.104	0.093	0.262	0.115	0.103	0.266
	ChP	0.046	0.043	0.286	0.047	0.048	0.329
	1stScore	−0.244	0.132	0.065	−0.266	0.131	0.042
	logN	−0.019	0.096	0.847	−0.034	0.112	0.763

Radiotherapy	1stCode	−0.327	0.157	0.037	−0.382	0.163	0.019
Radiotherapy	logN	−0.057	0.056	0.311	−0.068	0.062	0.275

Secondary Malignant Neoplasm	1stCode	0.013	0.127	0.921	−0.008	0.141	0.954
	Pk	−0.135	0.113	0.230	−0.130	0.126	0.299
	ChP	−0.067	0.049	0.168	−0.067	0.054	0.217
	1stScore	−0.197	0.122	0.105	−0.205	0.128	0.109
	logN	0.333	0.077	< 0.001	0.335	0.079	< 0.001

Palliative or Hospice Care	1stCode	−0.055	0.085	0.517	−0.066	0.089	0.457
	Pk	−0.942	0.187	< 0.001	−1.009	0.205	< 0.001
	ChP	−0.704	0.140	< 0.001	−0.753	0.153	< 0.001
	1stScore	0.068	0.095	0.470	0.070	0.098	0.471
	logN	0.017	0.061	0.785	0.002	0.064	0.979

Recurrence	1stCode	0.121	0.081	0.138	0.122	0.084	0.147
	Pk	−0.105	0.093	0.259	−0.099	0.097	0.310
	ChP	−0.046	0.058	0.426	−0.042	0.060	0.479
	1stScore	−0.281	0.119	0.018	−0.288	0.122	0.018
	logN	0.234	0.076	0.002	0.255	0.075	< 0.001

Medication	1stCode	0.173	0.118	0.143	0.185	0.113	0.104
Medication	logN	0.062	0.071	0.384	0.071	0.081	0.380

Biopsy	1stCode	−0.865	0.411	0.035	−0.968	0.399	0.015
Biopsy	logN	−0.423	0.502	0.399	−0.478	0.523	0.360

Open in a new tab

Table A8.

Analysis with n = 400. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

		MATA			NPMLE
Group	Feature	mean	boot.se	pval	mean	boot.se	pval
	Stage II	–	–	–	0.067	0.254	0.790
	Stage III	–	–	–	0.189	0.349	0.587
	Age	–	–	–	0.014	0.012	0.264

Lung Cancer	1stCode	−0.232	0.178	0.192	−0.311	0.189	0.101
	Pk	0.191	0.133	0.150	0.232	0.144	0.108
	ChP	0.115	0.106	0.279	0.133	0.117	0.258
	1stScore	−0.098	0.266	0.712	−0.075	0.332	0.821
	logN	0.078	0.163	0.633	0.074	0.203	0.715

Chemotherapy	1stCode	−0.120	0.109	0.270	−0.150	0.126	0.232
	Pk	−0.140	0.176	0.428	−0.181	0.209	0.387
	ChP	0.001	0.096	0.991	0.004	0.122	0.975
	1stScore	0.607	0.288	0.035	0.719	0.311	0.021
	logN	0.064	0.139	0.643	0.064	0.174	0.714

CT scan	1stCode	0.017	0.121	0.886	0.014	0.160	0.933
	Pk	0.068	0.143	0.634	0.110	0.179	0.538
	ChP	0.038	0.071	0.589	0.050	0.089	0.571
	1stScore	−0.207	0.204	0.310	−0.291	0.222	0.190
	logN	−0.019	0.151	0.897	−0.037	0.188	0.844

Radiotherapy	1stCode	−0.229	0.221	0.301	−0.345	0.248	0.165
Radiotherapy	logN	−0.027	0.086	0.749	−0.058	0.109	0.595

Secondary Malignant Neoplasm	1stCode	−0.019	0.172	0.913	−0.035	0.234	0.881
	Pk	−0.125	0.163	0.444	−0.119	0.211	0.575
	ChP	−0.065	0.072	0.366	−0.063	0.092	0.490
	1stScore	−0.207	0.182	0.257	−0.224	0.219	0.307
	logN	0.302	0.128	0.018	0.343	0.134	0.011

Palliative or Hospice Care	1stCode	−0.076	0.137	0.580	−0.091	0.160	0.567
	Pk	−0.845	0.248	< 0.001	−0.936	0.276	< 0.001
	ChP	−0.631	0.185	< 0.001	−0.699	0.206	< 0.001
	1stScore	0.054	0.126	0.670	0.067	0.143	0.641
	logN	0.040	0.092	0.663	0.015	0.105	0.889

Recurrence	1stCode	0.089	0.116	0.443	0.125	0.134	0.351
	Pk	−0.114	0.139	0.412	−0.103	0.161	0.521
	ChP	−0.055	0.085	0.519	−0.046	0.099	0.642
	1stScore	−0.229	0.176	0.193	−0.280	0.197	0.154
	logN	0.199	0.122	0.104	0.266	0.124	0.033

Medication	1stCode	–	–	–	0.201	0.188	0.284
Medication	logN	–	–	–	0.061	0.155	0.693

Biopsy	1stCode	−0.814	0.689	0.238	−1.127	0.734	0.125
Biopsy	logN	−0.363	0.811	0.654	−0.559	0.989	0.572

Open in a new tab

Table A9.

Analysis with n = 200. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.

		MATA			NPMLE
Group	Feature	mean	boot.se	pval	mean	boot.se	pval
	Stage II	–	–	–	0.102	0.393	0.795
	Stage III	–	–	–	0.161	0.549	0.769
	Age	–	–	–	0.014	0.019	0.465

Lung Cancer	1stCode	−0.223	0.266	0.401	−0.369	0.318	0.246
	Pk	0.188	0.190	0.323	0.270	0.220	0.220
	ChP	0.112	0.148	0.451	0.160	0.180	0.375
	1stScore	−0.102	0.390	0.793	−0.080	0.553	0.885
	logN	0.072	0.244	0.767	0.065	0.325	0.843

Chemotherapy	1stCode	−0.103	0.143	0.471	−0.170	0.212	0.423
	Pk	−0.119	0.218	0.585	−0.206	0.331	0.534
	ChP	0.006	0.160	0.972	0.027	0.243	0.913
	1stScore	0.530	0.409	0.195	0.764	0.516	0.139
	logN	0.056	0.184	0.759	0.064	0.282	0.822

CT scan	1stCode	0.007	0.165	0.965	0.008	0.252	0.976
	Pk	0.068	0.196	0.730	0.116	0.276	0.674
	ChP	0.037	0.109	0.730	0.055	0.143	0.700
	1stScore	−0.188	0.292	0.520	−0.321	0.366	0.380
	logN	−0.016	0.229	0.944	−0.047	0.314	0.881

Radiotherapy	1stCode	−0.207	0.314	0.509	−0.359	0.405	0.376
Radiotherapy	logN	−0.029	0.114	0.798	−0.059	0.163	0.718

Secondary Malignant Neoplasm	1stCode	−0.036	0.273	0.896	−0.056	0.415	0.893
	Pk	−0.095	0.232	0.683	−0.118	0.349	0.735
	ChP	−0.051	0.101	0.609	−0.065	0.148	0.660
	1stScore	−0.161	0.248	0.516	−0.197	0.348	0.571
	logN	0.258	0.185	0.162	0.338	0.207	0.102

Palliative or Hospice Care	1stCode	−0.090	0.197	0.647	−0.102	0.267	0.703
	Pk	−0.726	0.384	0.059	−0.928	0.446	0.037
	ChP	−0.542	0.287	0.059	−0.692	0.334	0.038
	1stScore	0.020	0.179	0.912	0.034	0.268	0.899
	logN	0.041	0.124	0.740	0.024	0.173	0.890

Recurrence	1stCode	0.070	0.181	0.697	0.131	0.247	0.598
	Pk	−0.094	0.183	0.608	−0.105	0.240	0.661
	ChP	−0.042	0.112	0.705	−0.044	0.148	0.767
	1stScore	−0.237	0.264	0.369	−0.332	0.338	0.326
	logN	0.180	0.177	0.309	0.284	0.193	0.141

Medication	1stCode	–	–	–	0.174	0.321	0.589
Medication	logN	–	–	–	0.059	0.230	0.798

Biopsy	1stCode	−0.741	0.890	0.405	−1.223	1.155	0.289
Biopsy	logN	−0.467	1.551	0.763	−0.876	2.311	0.705

Open in a new tab

Appendix C. Convergence Rate of Derived Features

Instead of deriving asymptotic properties for truncated density f_{C_i}, i.e., random density f_i truncated on [0, C_i], we focus on the scaled densities $f_{C_{i}, scaled}$ , which is $f_{C_{i}}$ scaled to [0, 1]. As we assume censoring time C_i has finite support [0, ℰ] with $ℰ < \infty, f_{C_{i}, scaled}$ and $f_{C_{i}}$ shared the same asymptotic properties.

Let $f_{μ, scaled}^{[j]} (t) = E {f_{C, scaled}^{[j]} (t)}$ and $G_{scaled}^{[j]} (t, s) = cov {f_{C, scaled}^{[j]} (t), f_{C, scaled}^{[j]} (s)}$ . The Karhunen-Loève theorem (Stark and Woods, 1986) states

f_{C, scaled}^{[j]} (t) = f_{μ, scaled}^{[j]} (t) + \sum_{k = 1}^{\infty} ζ_{k, scaled}^{[j]} ϕ_{k, scaled}^{[j]} (t), for t \in [0, 1],

where ${ϕ_{k, scaled}^{[j]} (t)}$ are the orthonormal eigenfunctions of $G_{scaled}^{[j]} (t, s), {ζ_{k, scaled}^{[j]}}$ are pair-wise uncorrelated random variables with mean 0 and variance $λ_{k, scaled}^{[j]}$ , and ${λ_{k, scaled}^{[j]}}$ , are eigenvalues of $G_{scaled}^{[j]} (t, s)$ .

For the i-th patient, conditional on $f_{C_{i}}^{[j]} (t)$ , and $M_{i}^{[j]} = 𝒩^{[j]} ([0, C_{i}])$ , the observed event times $t_{i 1}^{[j]}, \dots, t_{i M_{i}^{[j]}}^{[j]}$ are assumed to be generated as an i.i.d. sample $t_{i j}^{[j]} \overset{iid}{\sim} f_{C_{i}}^{[j]} (t)$ . Equivalently, the scaled observed event times $t_{i 1}^{[j]} / C_{i}, \dots, t_{i M_{i}^{[j]}}^{[j]} / C_{i} \overset{iid}{\sim} f_{C_{i}, scaled}^{[j]} (t)$ . Following Wu et al. (2013), we estimate $f_{μ}^{[j]} (t)$ and $G^{[j]} (t, s)$ , which are the mean and covariance functions of scaled density $f_{C, scaled}^{[j]} (t)$ respectively, as

{\hat{f}}_{μ, scaled}^{[j]} (t) = {(M_{+}^{[j]})}^{- 1} \sum_{i = 1}^{n + N} \sum_{ℓ = 1}^{M_{i}^{[j]}} κ_{μ}^{h_{μ}^{[j]}} (t - t_{i ℓ}^{[j]} / C_{i}); {\hat{G}}_{scaled}^{[j]} (t, s) = {\hat{g}}_{scaled}^{[j]} (t, s) - {\hat{f}}_{μ, scaled}^{[j]} (t) {\hat{f}}_{μ, scaled}^{[j]} (s),

for t, s ∈ [0, 1], where

{\hat{g}}_{scaled}^{[j]} (t, s) = {(M_{+ +}^{[j]})}^{- 1} \sum_{i = 1}^{n + N} I (M_{i}^{[j]} \geq 2) \sum_{1 \leq ℓ \neq k \leq M_{i}^{[j]}} κ_{G}^{h_{g}^{[j]}} (t - t_{i ℓ}^{[j]} / C_{i}, s - t_{i k}^{[j]} / C_{i}) .

Here $M_{+}^{[j]} = \sum_{i = 1}^{n + N} M_{i}^{[j]}$ is the total number of encounters. $M_{+ +}^{[j]} = \sum_{i = 1, M_{i}^{[j]} \geq 2}^{n + N} M_{i}^{[j]} (M_{i}^{[j]} - 1)$ is the total number of pairs. $κ_{μ}$ and $κ_{G}$ are symmetric univariate and bivariate probability density functions, respectively, with $κ_{μ}^{h} (x) = κ_{μ} (x / h) / h, κ_{G}^{h} (x_{1}, x_{2}) = κ_{G} (x_{1} / h, x_{2} / h) / h^{2} . h_{μ}^{[j]}$ and $h_{g}^{[j]}$ are bandwidth parameters.

The estimates of eigenfunctions and eigenvalues, denoted by ${\hat{ϕ}}_{k, scaled}^{[j]} (x)$ and ${\hat{λ}}_{k, scaled}^{[j]}$ respectively, are solutions to

\int_{0}^{1} {\hat{G}}_{scaled}^{[j]} (s, t) {\hat{ϕ}}_{k, scaled}^{[j]} (s) d s = {\hat{λ}}_{k, scaled}^{[j]} {\hat{ϕ}}_{k, scaled}^{[j]} (t),

with constraints $\int_{0}^{1} {\hat{ϕ}}_{k, scaled}^{[j]} {(s)}^{2} d s = 1$ and $\int_{0}^{1} {\hat{ϕ}}_{k, scaled}^{[j]} (s) {\hat{ϕ}}_{ℓ, scaled}^{[j]} (s) d s = 0$ . One can obtain estimated eigenfunctions ${\hat{ϕ}}_{k, scaled}^{[j]} (x)$ and eigenvalues ${\hat{λ}}_{k, scaled}^{[j]}$ by numerical spectral decomposition on a properly discretized version of the smooth covariance function ${\hat{G}}_{scaled}^{[j]} (t, s)$ (Rice and Silverman, 1991; Capra and Müller, 1997). Subsequently, we estimate

ζ_{i k, scaled}^{[j]} = \int {f_{C_{i}, scaled}^{[j]} (t) - f_{μ, scaled}^{[j]} (t)} ϕ_{k, scaled}^{[j]} (t) d t,

{\hat{ζ}}_{i k, scaled}^{[j]} = \frac{1}{M_{i}^{[j]}} \sum_{ℓ = 1}^{M_{i}^{[j]}} {\hat{ϕ}}_{k, scaled}^{[j]} (t_{i ℓ}^{[j]} / C_{i}) - \int {\hat{f}}_{μ, scaled}^{[j]} (t) {\hat{ϕ}}_{k, scaled}^{[j]} (t) d t .

Let ${\tilde{ζ}}_{i k, scaled}^{[j]} = {(M_{i}^{[j]})}^{- 1} \sum_{ℓ = 1}^{M_{i}^{[j]}} ϕ_{k, scaled}^{[j]} (t_{i ℓ}^{[j]} / C_{i}) - \int f_{μ, scaled}^{[j]} (t) ϕ_{k, scaled}^{[j]} (t) d t$ be the population counterpart of ${\hat{ζ}}_{i k, scaled}^{[j]}$ constructed with true eigenfunctions. We show in Lemma A3 that $\max_{i} | {\hat{ζ}}_{i k, scaled} - {\tilde{ζ}}_{i k, scaled} |$ goes to zero at any k as long as $N h_{μ}^{2} \to \infty$ and $N h_{g}^{4} \to \infty$ .

We then estimate the scaled density $f_{C_{i}, scaled}^{[j]} (t)$

{\hat{f}}_{i K, scaled}^{[j]} (t) = \max {0, {\hat{f}}_{μ, scaled}^{[j]} (t) + \sum_{k = 1}^{K^{[j]}} {\hat{ζ}}_{i k, scaled}^{[j]} {\hat{ϕ}}_{k, scaled}^{[j]} (t)},

and the truncated density $f_{C_{i}}^{[j]} (t)$ as

{\hat{f}}_{i K}^{[j]} (t) = {\hat{f}}_{i K, scaled}^{[j]} (t / C_{i}) / \int_{0}^{C_{i}} {\hat{f}}_{i K, scaled}^{[j]} (t / C_{i}) d t .

For the i-th patient and its j-th point process $𝒩_{i}^{[j]}$ , we only observe one realization of its expected number of encounters on [0, C_i], i.e., $M_{i} = 𝒩_{i}^{[j]} ([0, C_{i}])$ . Following Wu et al. (2013), we approximate the expected numbers of encounters with observed encounters, and estimate $λ_{i} (t)$ as ${\hat{λ}}_{i}^{[j]} (t) = M_{i} {\hat{f}}_{i K}^{[j]} (t)$ , for $t \in [0, C_{i}]$ . We further estimate the derived feature $W_{i}^{[j]}$ as ${\hat{W}}_{i}^{[j]} = ℋ \circ {\hat{f}}_{i K}^{[j]}$ .

For notation simplicity in the proof, we drop the superscript ^[j], the index for the j-th counting process, for j = 1, …, q throughout the appendix.

Derivative of the Mean and Covariance Functions

Nonparametric estimation of the mean and covariance function on the scaled densities are

{\hat{f}}_{μ, scaled} (t) = {(M_{+})}^{- 1} \sum_{i = 1}^{n + N} \sum_{ℓ = 1}^{M_{i}} κ_{μ}^{h_{μ}} (t - t_{i ℓ} / C_{i}); {\hat{G}}_{scaled} (t, s) = {\hat{g}}_{scaled} (t, s) - {\hat{f}}_{μ, scaled} (t) {\hat{f}}_{μ, scaled} (s),

for $t, s \in [0, 1]$ , where

{\hat{g}}_{scaled} (t, s) = {(M_{+ +})}^{- 1} {(h_{g})}^{2} \sum_{i = 1}^{n + N} I (M_{i} \geq 2) \sum_{1 \leq ℓ \neq k \leq M_{i}} κ_{G}^{h_{g}} (t - t_{i ℓ} / C_{i}, s - t_{i k} / C_{i}) .

Here $M_{+} = \sum_{i = 1}^{n + N} M_{i}$ is the total number of encounters. $M_{+ +} = \sum_{i = 1, M_{i} \geq 2}^{n + N} M_{i} (M_{i} - 1)$ is the total number of pairs. $κ_{μ}$ and $κ_{G}$ are symmetric univariate and bivariate probability density functions, respectively, with $κ_{μ}^{h} (x) = κ_{μ} (x / h) / h, κ_{G}^{h} (x_{1}, x_{2}) = κ_{G} (x_{1} / h, x_{2} / h) / h^{2} . h_{μ}$ and $h_{g}$ are bandwidth parameters.

Their derivatives are

{\hat{f}}_{μ, scaled}^{'} (t) = \frac{1}{M_{+} {(h_{μ})}^{2}} \sum_{i = 1}^{n + N} \sum_{ℓ = 1}^{M_{i}} κ_{1}^{'} (\frac{t - t_{i ℓ} / C_{i}}{h_{μ}}), {\hat{G}}_{scaled}^{(0, 1)} (t, s) = {\hat{g}}_{scaled}^{(0, 1)} (t, s) - {\hat{f}}_{μ, scaled} (t) {\hat{f}}_{μ, scaled}^{'} (s), {\hat{G}}_{scaled}^{(1, 0)} (t, s) = {\hat{g}}_{scaled}^{(1, 0)} (t, s) - {\hat{f}}_{μ, scaled}^{'} (t) {\hat{f}}_{μ, scaled} (s),

with

{\hat{g}}_{scaled}^{(ν, u)} (t, s) = \frac{1}{M_{+ +} h_{g}^{3}} \sum_{i = 1, M_{i} \geq 2}^{n + N} \sum_{ℓ = 1}^{M_{i}} \sum_{k = 1, k \neq j}^{M_{i}} κ_{2}^{(ν, u)} (\frac{t - t_{i ℓ} / C_{i}}{h_{g}}, \frac{s - t_{i k} / C_{i}}{h_{g}}),

for ν = 0, u = 1 and ν = 1, u = 0, where for an arbitrary bivariate function $h, h^{(ν, u)} (x, y) = \partial^{ν + u} G (x, y) / \partial^{ν} x \partial^{u} y$ .

Assume the following regularity conditions holds.

(A1)
Scaled random densities $f_{C_{i}, scaled}$ , its mean density $f_{μ, scaled}$ , covariance function g_scaled and eigenfunctions $ϕ_{k, scaled} (x)$ are thrice continuously differentiable.
(A2)
$f_{C_{i}, scaled}, f_{μ, scaled}$ and their first three derivatives are bounded, where the bounds hold uniformly across the set of random densities.
(A3)
$κ_{1} (\cdot)$ and $κ_{2} (\cdot, \cdot)$ are symmetric univariate and bivariate density function satisfying
$\int u κ_{1} (u) d u = \int u κ_{2} (u, v) d u d v = \int v κ_{2} (u, v) d u d v = 0,$

$\int u^{2} κ_{1} (u) d u < \infty, \int u^{2} κ_{2} (u, v) d u d v < \infty, \int v^{2} κ_{2} (u, v) d u d v < \infty$
(A4)
Denote the Fourier transformations $χ_{1} (t) = \int \exp (- i u t) κ_{1} (u) d u and χ_{2} (s, t) = \int \exp (- i u s - i v t) κ_{2} (u, v) d u d v$ . $\int | χ_{1} (u) | d u < \infty$ and $\int | u χ_{1} (u) | d u < \infty . \int | χ_{2} (u, v) | d u d v < \infty, \int | u χ_{2} (u, v) | d u d v < \infty$ and $\int | v χ_{2} (u, v) | d u d v < \infty$ .
(A5)
The numbers of observations M_i for the j-th trajectory of i-th object, are i.i.d. r.v.’s that are independent of the densities fⁱ and satisfy
$E (N / M_{+}) < \infty, E {N / M_{+ +}} < \infty .$
(A6)
$h_{μ} \to 0, h_{g} \to 0, N h_{μ}^{4} \to \infty, N h_{g}^{6} \to \infty$ as $N \to \infty$ .
(A7)
$M_{i}, i = 1, \dots, n + N$ are i.i.d positive r.v. generated from a truncated-Poisson distribution with rate τ_N, such that pr(M_i = 0) = 0, and $pr (M_{i} = k) = τ_{N}^{k} \exp (- τ_{N}) / [k! {1 - \exp (- τ_{N})}]$ for $k \geq 1$ .
(A8)
$ω_{i} = E (M_{i} ∣ C_{i}) = E (N_{i} [0, C_{i}] ∣ C_{i})$ and $f_{C_{i}, scaled}, i = 1, \dots, n + N$ are independent. $E ω_{i}^{- 1 / 2} = O (α_{N})$ , where $α_{N} \to 0$ as $N \to \infty$ for j = 1, …, q.
(A9)
The number of eigenfunctions and functional principal components K_i is a r.v. with $K_{i} \overset{d}{=} K$ , and for any $ϵ > 0$ , there exists $K_{ϵ}^{*} < \infty$ such that $pr (K > K_{ϵ}^{*}) < ϵ$ for j = 1, …, q.

Lemma A1 Under the regularity conditions A1 - A6,

s u p_{x} | {\hat{f}}_{μ, scaled} (x) - f_{μ, scaled} (x) | = O_{p} (h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}}),

(A.1)

s u p_{x} | {\hat{f}}_{μ, scaled}^{'} (x) - f_{μ, scaled}^{'} (x) | = O_{p} (h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}),

(A.2)

s u p_{x, y} | {\hat{g}}_{scaled} (x, y) - g_{scaled} (x, y) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{2}}),

(A.3)

s u p_{x, y} | \nabla {\hat{g}}_{scaled} (x, y) - \nabla g_{scaled} (x, y) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}}),

(A.4)

s u p_{x, y} | {\hat{G}}_{scaled} (x, y) - G_{scaled} (x, y) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{2}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}}),

(A.5)

s u p_{x, y} | \nabla {\hat{G}}_{scaled} (x, y) - \nabla G_{scaled} (x, y) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

(A.6)

Proof The proof on the mean density and covariance function can be found in Wu et al. (2013). Here we only obtain the proof for the derivative of the mean density function. The proof for the derivative of the covariance function is similar.

Under conditions A1 and A2, we have

E {{\hat{f}}_{μ, scaled}^{'} (x)} = E [\frac{1}{M_{+} h_{μ}^{2}} \sum_{i = 1}^{n + N} M_{i} E {κ_{1}^{'} (\frac{t - t_{i ℓ} / C_{i}}{h_{μ}}) ∣ M_{i}, f_{C_{i}, scaled}}] = E [\frac{1}{M_{+}} \sum_{i = 1}^{n + N} M_{i} E {f_{C_{i}, scaled}^{'} (x) + \frac{1}{2} f_{C_{i}, scaled}^{‴} (x) σ_{κ_{1}}^{2} h_{μ}^{2} + o (h_{μ}^{2}) ∣ M_{i}}] = E [\frac{1}{M_{+}} \sum_{i = 1}^{n + N} M_{i} {f_{μ, scaled}^{'} (x) + \frac{1}{2} E f_{μ, scaled}^{‴} (x) σ_{κ_{1}}^{2} h_{μ}^{2} + o (h_{μ}^{2})}] = f_{μ, scaled}^{'} (x) + O (h_{μ}^{2}),

Hence, $\sup_{x} | E {\hat{f}}_{μ}^{'} (x) - f_{μ}^{'} (x) | = O (h_{μ}^{2})$ .

With inverse Fourier transformation $κ_{1} (t) = {(2 π)}^{- 1} \int \exp (i u t) χ_{1} (u) d u$ , we have

κ_{1}^{'} (t) = {(2 π)}^{- 1} i \int u \exp (i u t) χ_{1} (u) d u .

We further insert this equation into ${\hat{f}}_{μ}^{'}$ ,

{\hat{f}}_{μ, scaled}^{'} (t) = \frac{1}{M_{+} h_{μ}^{2}} \sum_{k = 1}^{n + N} \sum_{ℓ = 1}^{M_{k}} κ_{1}^{'} (\frac{t - t_{k ℓ} / C_{k}}{h_{μ}}) = \frac{1}{M_{+} h_{μ}^{2}} \sum_{k = 1}^{n + N} \sum_{ℓ = 1}^{M_{k}} {(2 π)}^{- 1} i \int u \exp {i u (t - t_{k ℓ} / C_{k}) / h_{μ}} χ_{1} (u) d u = \frac{1}{M_{+}} \sum_{k = 1}^{n + N} \sum_{ℓ = 1}^{M_{k}} {(2 π)}^{- 1} i \int u \exp {i u (t - t_{k ℓ} / C_{k})} χ_{1} (u h_{μ}) d u = {(2 π)}^{- 1} i \int ς (u) u \exp (i u t) χ_{1} (u h_{μ}) d u,

where

ς (u) = \frac{1}{M_{+} h_{μ}^{2}} \sum_{k = 1}^{n + N} \sum_{ℓ = 1}^{M_{k}} \exp {- i u t_{k ℓ} / C_{k}} .

Therefore,

| {\hat{f}}_{μ, scaled}^{'} (t) - E {\hat{f}}_{μ, scaled}^{'} (t) | = | {(2 π)}^{- 1} i \int {ς (u) - E ς (u)} u \exp (i u t) χ_{1} (u h_{μ}) d u | \leq {(2 π)}^{- 1} \int | ς (u) - E ς (u) | | u χ_{1} (u h_{μ}) | d u .

Note that the right-hand side of the above inequality is free of t. Thus,

\sup_{t} | {\hat{f}}_{μ, scaled}^{'} (t) - E {\hat{f}}_{μ, scaled}^{'} (t) | \leq {(2 π)}^{- 1} \int | ς (u) - E ς (u) | | u χ_{1} (u h_{μ}) | d u .

As an intermediate result of the Proof of Theorem 1 in Wu et al. (2013), we have

var {ς (u)} \leq \frac{1}{n + N} {1 + 2 E (\frac{n + N}{M_{+}})} .

This further lead to

E {\sup_{t} | {\hat{f}}_{μ, scaled}^{'} (t) - E {\hat{f}}_{μ, scaled}^{'} (t) |} \leq {(2 π)}^{- 1} E {\int | ς (u) - E ς (u) | | u χ_{1} (u h_{μ}) | d u} = {(2 π)}^{- 1} \int E {| ς (u) - E ς (u) |} | u_{χ 1} (u h_{μ}) | d u \leq {(2 π)}^{- 1} \int {[var {ς (u)}]}^{1 / 2} | u χ_{1} (u h_{μ}) | d u \leq {(2 π)}^{- 1} \sqrt{\frac{1}{n + N} {1 + 2 E (\frac{n + N}{M_{+}})}} \int | u χ_{1} (u h_{μ}) | d u = O (\frac{1}{\sqrt{N} h_{μ}^{2}}) .

Thus, $\sup_{t} | {\hat{f}}_{μ, scaled}^{'} (t) - E {\hat{f}}_{μ, scaled}^{'} (t) | = O_{p} (\frac{1}{\sqrt{N} h_{μ}^{2}})$ . Furthermore,

\sup_{t} | {\hat{f}}_{μ, scaled}^{'} (t) - f_{μ, scaled}^{'} (t) | \leq \sup_{x} | {\hat{f}}_{μ, scaled}^{'} (t) - E {\hat{f}}_{μ, scaled}^{'} (t) | + \sup_{t} | E {\hat{f}}_{μ, scaled}^{'} (t) - f_{μ, scaled}^{'} (t) | = O_{p} (h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

Derivative of the Eigenfunctions

Lemma A2 Under the regularity conditions A1 - A6

| {\hat{λ}}_{k, scaled} - λ_{k, scaled} | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{2}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}}),

(A.7)

s u p_{x} | {\hat{ϕ}}_{k, scaled} (x) - ϕ_{k, scaled} (x) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{2}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}}),

(A.8)

s u p_{x} | {\hat{ϕ}}_{k, scaled}^{'} (x) - ϕ_{k, scaled}^{'} (x) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

(A.9)

Proof The first two equations are direct result of Theorem 2 in Yao et al. (2005). Note that

{\hat{λ}}_{k, scaled} {\hat{ϕ}}_{k, scaled}^{'} (x) = \int {\hat{G}}_{scaled}^{(1, 0)} (x, y) {\hat{ϕ}}_{k, scaled} (y) d y, λ_{k, scaled} ϕ_{k, scaled}^{'} (x) = \int G_{scaled}^{(1, 0)} (x, y) ϕ_{k, scaled} (y) d y,

where $G_{scaled}^{(1, 0)} (x, y) = \partial G_{scaled} (x, y) / \partial x$ . Thus,

| {\hat{λ}}_{k, scaled} {\hat{ϕ}}_{k, scaled}^{'} (x) - λ_{k, scaled} ϕ_{k, scaled}^{'} (x) | = | \int {\hat{G}}_{scaled}^{(1, 0)} (x, y) {\hat{ϕ}}_{k, scaled} (y) d y - \int G_{scaled}^{(1, 0)} (x, y) ϕ_{k, scaled} (y) d y | \leq \int | {\hat{G}}_{scaled}^{(1, 0)} (x, y) - G_{scaled}^{(1, 0)} (x, y) | | {\hat{ϕ}}_{k, scaled} (y) | d y + \int | G_{scaled}^{(1, 0)} (x, y) | | {\hat{ϕ}}_{k, scaled} (y) - ϕ_{k, scaled} (y) | d y \leq {\int {| {\hat{G}}_{scaled}^{(1, 0)} (x, y) - G_{scaled}^{(1, 0)} (x, y) |}^{2} d y}^{1 / 2} + {\int {| G_{scaled}^{(1, 0)} (x, y) |}^{2} d y}^{1 / 2} {\int {| {\hat{ϕ}}_{k, scaled} (y) - ϕ_{k, scaled} (y) |}^{2} d y}^{1 / 2} .

Without loss of generality assuming $λ_{k, scaled} > 0$ , then

\sup_{x} | ({\hat{λ}}_{k, scaled} / λ_{k, scaled}) {\hat{ϕ}}_{k, scaled}^{'} (x) - ϕ_{k, scaled}^{'} (x) | = O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

Then (A.9) follows by applying (A.7).

Derivative of the Estimated Density Functions

Lemma A3 Under regularity conditions A1 - A9, for any $ϵ > 0$ , there exists an event $A_{ϵ}$ with $p r (A_{ϵ}) \geq 1 - ϵ$ such that on $A_{ϵ}$ it holds that

| {\hat{ζ}}_{i k, scaled} - ζ_{i k, scaled} | = O_{p} (α_{N} + \frac{1}{\sqrt{N} h_{g}^{2}} + \frac{1}{\sqrt{N} h_{μ}}),

(A.10)

s u p_{x} | {\hat{f}}_{C_{i}, scaled} (x) - f_{C_{i}, scaled} (x) | = O_{p} (α_{N} + \frac{1}{\sqrt{N} h_{g}^{2}} + \frac{1}{\sqrt{N} h_{μ}}),

(A.11)

s u p_{x} | {\hat{f}}_{C_{i}, scaled}^{'} (x) - f_{C_{i}, scaled}^{'} (x) | = O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

(A.12)

Proof The existence of $A_{ϵ}$ for (A.10) - (A.11) are guaranteed by the Theorem 3 in Wu et al. (2013). We followed their definition of $A_{ϵ}$ , i.e., $A_{ϵ}^{c} = {K > K_{ϵ}^{*}} \cup {M_{i} = 1, i = 1, \dots, n + N}$ , and prove for (A.12).

Note that

| {\hat{f}}_{C_{i}, scaled}^{'} (x) - f_{C_{i}, scaled}^{'} (x) | \leq | {\hat{f}}_{C_{i}, scaled}^{'} (x) - {f^{'}}_{C_{i}, scaled}^{K} (x) | + | {f^{'}}_{C_{i}, scaled}^{K} (x) - f_{C_{i}, scaled}^{'} (x) | .

We have

\sup_{x} E {| {f^{'}}_{C_{i}, scaled}^{K} (x) - f_{C_{i}, scaled}^{'} (x) |}^{2} = \sup_{x} E {| \sum_{k = K + 1}^{\infty} ζ_{i k, scaled} ϕ_{k, scaled}^{'} (x) |}^{2} = \sup_{x} \sum_{k = K + 1}^{\infty} λ_{k, scaled} {| ϕ_{k, scaled}^{'} (x) |}^{2} \to 0,

as $K \to \infty$ . Hence, $| {f^{'}}_{C_{i}, scaled}^{K} (x) - f_{C_{i}, scaled}^{'} (x) | = o_{p} (1)$ .

Furthermore, on $A_{ϵ}$

\sup_{x} | {\hat{f}}_{C_{i}, scaled}^{'} (x) - {f^{'}}_{C_{i}, scaled}^{K} (x) | \leq \sup_{x} | {\hat{f}}_{μ, scaled}^{'} (x) - f_{μ, scaled}^{'} (x) | + \sum_{k = 1}^{K} \sup_{x} | {\hat{ζ}}_{i k, scaled} {\hat{ϕ}}_{k, scaled}^{'} (x) - ζ_{i k, scaled} ϕ_{k, scaled}^{'} (x) | \leq \sup_{x} | {\hat{f}}_{μ, scaled}^{'} (x) - f_{μ, scaled}^{'} (x) | + \sum_{k = 1}^{K} \sup_{x} | {\hat{ζ}}_{i k, scaled} - ζ_{i k, scaled} | | {\hat{ϕ}}_{k, scaled}^{'} (x) | + \sum_{k = 1}^{K} \sup_{x} | ζ_{i k, scaled} | | {\hat{ϕ}}_{k, scaled}^{'} (x) - ϕ_{k}^{'} (x) | = O_{p} (h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) + O_{p} (α_{N} + \frac{1}{\sqrt{N} h_{g}^{2}} + \frac{1}{\sqrt{N} h_{μ}}) + O_{p} (h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) = O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) .

Therefore (A.12) holds.

Peaks and Change Points

Assume $f_{C_{i}, scaled}$ is locally unimodal, $f_{C_{i}, scaled}^{'} (x) = 0$ has a unique solution, denoted by x_i0, in a neighbourhood of x_i0, denoted by $𝓑 (x_{i 0}) = (x_{i 0} - Δ x_{i 0}, x_{i 0} + Δ x_{i 0})$ . Further assume $| f_{C_{i}, scaled}^{″} |$ is bounded away from 0 in $\cup_{x_{i 0} : f_{C_{i}, scaled}^{'}} (x_{i 0}) = 0 𝓑 (x_{i 0})$ and the bound holds uniformly across $i = 1, \dots, n + N$ . Let ${\hat{x}}_{i 0}$ , be the solution of ${\hat{f}}_{C_{i}, scaled}^{'} (x) = 0$ which is closet to x_i0. Then

0 = {\hat{f}}_{C_{i}, scaled}^{'} ({\hat{x}}_{i 0}) = f_{C_{i}, scaled}^{'} ({\hat{x}}_{i 0}) + O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}) = f_{C_{i}, scaled}^{″} (x_{i 0}^{*}) ({\hat{x}}_{i 0} - x_{i 0}) + O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}}),

where $x_{i 0}^{*}$ is an intermediate value between x_i0 and ${\hat{x}}_{i 0}$ .

Thus, $| {\hat{x}}_{i 0} - x_{i 0} | = O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{3}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{2}})$ . This further implies ${\hat{x}}_{i 0}$ is the only solution of ${\hat{f}}_{C_{i}, scaled}^{'}$ in $𝓑 (x_{i 0})$ . In other words, there is one-to-one correspondence between estimated peak and the true peak and the estimated peak converges to the true peak uniformly.

The derivation of the change point is similar, and here we only list the order of the absolute difference between estimated change point ${\hat{y}}_{i 0}$ and the true change point y_i0.

| {\hat{y}}_{i 0} - y_{i 0} | = O_{p} (α_{N} + h_{g}^{2} + \frac{1}{\sqrt{N} h_{g}^{4}} + h_{μ}^{2} + \frac{1}{\sqrt{N} h_{μ}^{3}}) .

Remark A1 For peak and change point, the approximation error would decay faster than $n^{- 1 / 2}$ when the unlabeled data expand with $α_{N} ≪ n^{- 1 / 2}$ in follow-up duration and $N ≫ n^{3}$ in sample size. In that case, we may choose ${(n / N)}^{1 / 8} ≪ h_{g} ≪ n^{- 1 / 4}$ and ${(n / N)}^{1 / 6} ≪ h_{μ} ≪ n^{- 1 / 4}$ so that Assumption (C5) is satisfied.

Appendix D. B-spline Approximation and Profile-likelihood Estimation

Some Definitions on Vector and Matrix Norms

For any vector $a = {(a_{1}, \dots, a_{s})}^{⊤} \in R^{s}$ , denote the norm $‖ a ‖_{r} = {({| a_{1} |}^{r} + \dots + {| a_{s} |}^{r})}^{1 / r}, 1 \leq r \leq \infty$ . For positive numbers a_n and $b_{n}, n > 1$ , let $a_{n} ≍ b_{n}$ denote that $\lim_{n \to \infty} a_{n} / b_{n} = c$ , where c is some nonzero constant. Denote the space of the q^th order smooth functions as $C^{(q)} ([0, ℰ]) = {ϕ : ϕ^{(q)} \in C [0, ℰ]}$ . For any s × s symmetric matrix A, denote its L_q norm as $‖ A ‖_{q} = \max_{v \in R^{s}, v \neq 0} ‖ A v ‖_{q} ‖ v ‖_{q}^{- 1}$ . Let $‖ A ‖_{\infty} = \max_{1 \leq i \leq s} \sum_{j = 1}^{s} | a_{i j} |$ . For a vector a, let $‖ a ‖_{\infty} = \max_{1 \leq i \leq s} | a_{i} |$ .

Some Definition on Scores and Hessian Matrices

Define

S_{γ, i} (β, γ) = \frac{\partial \log {\tilde{H}}_{i} (β, γ)}{\partial γ} = Δ_{i} B_{r} (X_{i}) - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} (u) d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}, S_{γ, i} (β, m) = Δ_{i} B_{r} (X_{i}) - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}, S_{β, i} (β, γ) = \frac{\partial \log {\tilde{H}}_{i} (β, γ)}{\partial β} = Δ_{i} Z_{i} - (1 + Δ_{i}) \frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}, S_{β, i} (β, m) = Δ_{i} Z_{i} - (1 + Δ_{i}) \frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u} .

Further, define

S_{β β, i} (β, γ) \equiv \frac{\partial S_{β, i} (β, γ)}{\partial β^{⊤}} = - (1 + Δ_{i}) \frac{Z_{i}^{\otimes 2} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}}, S_{β β, i} (β, m) \equiv \frac{\partial S_{β, i} (β, γ)}{\partial β^{⊤}} = - (1 + Δ_{i}) \frac{Z_{i}^{\otimes 2} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}, S_{γ γ, i} (β, γ) \equiv \frac{\partial S_{γ, i} (β, γ)}{\partial γ^{⊤}} = - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u} + (1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β) {[\int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}}, S_{γ γ, i} (β, m) \equiv - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u} + (1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β) {[\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}, S_{β γ, i} (β, γ) \equiv \frac{\partial S_{β, i} (β, γ)}{\partial γ^{⊤}} = \frac{- (1 + Δ_{i}) Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r}^{⊤} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}}, S_{β γ, i} (β, m) \equiv \frac{- (1 + Δ_{i}) Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r}^{⊤} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} .

Note that

\frac{l_{n} (β, γ)}{\partial γ} = \sum_{i = 1}^{n} S_{γ, i} (β, γ), \frac{l_{n} (β, γ)}{\partial β} = \sum_{i = 1}^{n} S_{β, i} (β, γ) .

For $u \in [0, ℰ]$ , define

{\hat{σ}}^{2} (u, β) = B_{r}^{⊤} (u) {V_{n} (β_{0})}^{- 1} {n^{- 2} \sum_{i = 1}^{n} S_{γ, i} {(β_{0}, m)}^{\otimes 2}} {V_{n} (β_{0})}^{- 1} B_{r} (u),

(A.13)

where

V_{n} (β) = - E {S_{γ γ, i} (β, m)} .

Approximation Error from $\hat{W}$

We first assess the approximation error from using the estimated features $\hat{W}$ in l_n. Once we establish the identifiability of l_n in the proof of Lemma 1, the approximation of losses would translate to the approximation of their minimums.

Lemma A4 Let

l_{n}^{*} (β, γ) = \sum_{i = 1}^{n} [Δ_{i} {B_{r}^{⊤} (X_{i}) γ + Z_{i}^{⊤} β} - (1 + Δ_{i}) l o g {1 + e x p (Z_{i}^{⊤} β) \int_{0}^{X_{i}} e x p {γ^{⊤} B_{r} (t)} d t}]

with $Z_{i} = {(U_{i}^{⊤}, W_{i}^{{[1]}^{⊤}}, \dots, W_{i} {[q]}^{⊤})}^{⊤}$ be the loss with true features from the intensity functions. Let $Ω$ be a sufficiently large compact neighborhood of

θ_{0} = {(β_{0}^{⊤}, γ_{0}^{⊤})}^{⊤} = a r g m i n_{θ} E {n^{- 1} l_{n}^{*} (β, γ)} .

We have

\sup_{θ \in Ω} \frac{1}{n} | l_{n}^{*} (β, γ) - l_{n} (β, γ) | ≲ \sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ .

(A.14)

Proof By the mean value theorem, we may express the difference as

\frac{1}{n} {l_{n}^{*} (β, γ) - l_{n} (β, γ)} = \underset{T_{1}}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} Δ_{i} β_{- 1}^{⊤} (W_{i} - {\hat{W}}_{i})}} - \underset{T_{2}}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} (1 + Δ_{i}) \frac{\exp ({\tilde{Z}}_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {γ^{⊤} B_{r} (t)} d t}{1 + \exp ({\tilde{Z}}_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {γ^{⊤} B_{r} (t)} d t} β_{- 1}^{⊤} (W_{i} - {\hat{W}}_{i})}}

(A.15)

for ${\tilde{Z}}_{i}$ between ${\hat{Z}}_{i}$ and Z_i. Since $Δ_{i}$ is binary and $‖ β ‖$ is bounded in compact set Ω, we have

| T_{1} | \leq ‖ β ‖ \sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ ≲ \sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ .

(A.16)

For T₂, we apply the bounds for $Δ_{i}$ and $‖ β ‖$ along with the bound of the function $e^{x} / (1 + e^{x}) \in [0, 1]$ ,

| T_{2} | \leq ‖ β ‖ \sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ ≲ \sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖ .

(A.17)

Thus, we obtain (A.14) by applying (A.16) and (A.17) to (A.15).

In the following theorems, we establish the consistency, asymptotic normality of our procedure.

Proof of Lemma 1

Proof By Lemma A4, the loss with estimated features deviates from the loss with true features by at most $\sup_{i = 1, \dots, n} ‖ {\hat{W}}_{i} - W_{i} ‖$ . Under Assumption (C5), the error decays faster than $n^{- 1 / 2}$ order. Thus, if either loss produces estimator identifying the true parameter at $n^{- 1 / 2}$ rate, both losses produce asymptotically equivalent consistent estimators. We focus the analysis of the loss with true features in the following.

For $m \in C^{q} [0, ℰ]$ , there exists $γ_{0} \in R^{P_{n}}$ , such that

\sup_{u \in [0, ℰ]} | m (u) - \tilde{m} (u) | = O (h^{q}),

(A.18)

where $\tilde{m} (u) = B_{r}^{⊤} (u) γ_{0}$ (de Boor, 2001). In the following, we prove the results for the nonparametric estimator $\hat{m} (u, β)$ in Theorem 1 when $β = β_{0}$ . Then the results also hold when β is a $\sqrt{n}$ -consistent estimator of $β_{0}$ , since the nonparametric convergence rate in Theorem 1 is slower than $n^{- 1 / 2}$ . Define the distance between neighboring knots as $h_{p} = ξ_{p + 1} - ξ_{p}, r \leq p \leq R_{n} + r$ , and $h = \max_{r \leq p \leq R_{n} + r} h_{p}$ . Let $ρ_{n} = n^{- 1 / 2} h^{- 1} + h^{q - 1 / 2}$ . We will show that for any given $ϵ > 0$ , for n sufficiently large, there exists a large constant $C > 0$ such that

pr {\sup_{‖ τ ‖_{2} = C} l_{n} (β_{0}, γ_{0} + ρ_{n} τ) < l_{n} (β_{0}, γ_{0})} \geq 1 - 6 ϵ .

(A.19)

This implies that for n sufficiently large, with probability at least 1 – 6ϵ, there exists a local maximum for (2) in the ball ${γ_{0} + ρ_{n} τ : ‖ τ ‖_{2} \leq C}$ . Hence, there exists a local maximizer such that ${‖ \hat{γ} (β_{0}) - γ_{0} ‖}_{2} = O_{p} (ρ_{n})$ . Note that

\frac{\partial^{2} l_{n} (β_{0}, γ)}{\partial γ \partial γ^{⊤}} = \sum_{i = 1}^{n} S_{γ γ, i} (β_{0}, γ)

and

S_{γ γ, i} (β, γ) = - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} {(u)}^{\otimes 2} d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}} - (1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} {(u)}^{\otimes 2} d u \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}} + (1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β) {[\int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}^{2}} .

The first term above is negative-definite, and last two terms are also negative-definite because of Cauch-Schwartz inequality, hence $S_{γ γ, i} (β_{0}, γ)$ is negative-definite. Thus, $l_{n} (β_{0}, γ)$ is a concave function of $γ$ , so the local maximizer is the global maximizer of (2), which will show the convergence of $\hat{γ} (β_{0})$ to $γ_{0}$ .

By Taylor expansion, we have

l_{n} (β_{0}, γ_{0} + ρ_{n} τ) - l_{n} (β_{0}, γ_{0}) = \frac{\partial l_{n} (β_{0}, γ_{0})}{\partial γ^{⊤}} ρ_{n} τ - {- \frac{1}{2} ρ_{n} τ^{⊤} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} ρ_{n} τ}

(A.20)

where $γ^{*} = ρ γ + (1 - ρ) γ_{0}$ for some $ρ \in (0, 1)$ . Moreover,

| \frac{\partial l_{n} (β_{0}, γ_{0})}{\partial γ^{⊤}} ρ_{n} τ | \leq ρ_{n} {‖ \frac{\partial l_{n} (β_{0}, γ_{0})}{\partial γ} ‖}_{2} ‖ τ ‖_{2} = C ρ_{n} {‖ \frac{\partial l_{n} (β_{0}, γ_{0})}{\partial γ} ‖}_{2} = C ρ_{n} {‖ T_{n 1} + T_{n 2} ‖}_{2},

where

T_{n 1} = \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m)

T_{n 2} = \sum_{i = 1}^{n} S_{γ, i} (β_{0}, γ_{0}) - \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m) .

Recall that S_C(·) and f_C(·) are the censoring process survival and density functions respectively, we have

E {S_{γ, i} (β_{0}, m) ∣ Z_{i}} = E [Δ_{i} B_{r} (X_{i}) - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}] = \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} [B_{r} (X_{i}) - \frac{2 \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}] \times \frac{\exp {m (X_{i})}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} S_{C} (X_{i} ∣ Z_{i}) d X_{i} - \int_{0}^{ℰ -} \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} f_{C} (X_{i} ∣ Z_{i}) d X_{i} - \frac{\int_{0}^{ℰ} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u]}^{2}} S_{C} (ℰ - ∣ Z_{i}) = \exp (Z_{i}^{⊤} β_{0}) [\int_{0}^{ℰ -} \frac{\exp {m (X_{i})} B_{r} (X_{i})}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} S_{C} (X_{i} ∣ Z_{i}) d X_{i} - \int_{0}^{ℰ -} \frac{2 \exp {m (X_{i})} \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{3}} S_{C} (X_{i} ∣ Z_{i}) d X_{i} - \int_{0}^{ℰ -} \frac{\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} f_{C} (X_{i} ∣ Z_{i}) d X_{i}] - \frac{\int_{0}^{ℰ} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u]}^{2}} S_{C} (ℰ - ∣ Z_{i}) = \exp (Z_{i}^{⊤} β_{0}) [\int_{0}^{ℰ -} S_{C} (X_{i} ∣ Z_{i}) \frac{\partial}{\partial X_{i}} \frac{\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} d X_{i} - \int_{0}^{ℰ -} \frac{\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} f_{C} (X_{i} ∣ Z_{i}) d X_{i}] - \frac{\int_{0}^{ℰ} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u]}^{2}} S_{C} (ℰ - ∣ Z_{i}) = \exp (Z_{i}^{⊤} β_{0}) [\int_{0}^{ℰ -} \frac{\partial}{\partial X_{i}} \frac{S_{C} (X_{i} ∣ Z_{i}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} d X_{i}] - \frac{\int_{0}^{ℰ} \exp {m (u)} B_{r} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u]}^{2}} S_{C} (ℰ - ∣ Z_{i}) = 0 .

In the following, all the integrals are calculated on [0, ℰ], unless otherwise specified.

Thus, $E (T_{n 1}) = 0$ . Further

E [{e_{p}^{⊤} S_{γ, i} (β_{0}, m)}^{2} ∣ Z_{i}] = E ({[Δ_{i} B_{r, p} (X_{i}) - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} ∣ Z_{i}) = \int {[B_{r, p} (X_{i}) - 2 \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} f_{T} (X_{i} ∣ Z_{i}) S_{C} (X_{i} ∣ Z_{i}) d X_{i} + \int {[\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} f_{C} (X_{i} ∣ Z_{i}) S_{T} (X_{i} ∣ Z_{i}) d X_{i} = \int {[B_{r, p} (X_{i}) - 2 \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} f_{T} (X_{i} ∣ Z_{i}) S_{C} (X_{i} ∣ Z_{i}) d X_{i} + \int_{0}^{ℰ -} {[\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} f_{C} (X_{i} ∣ Z_{i}) S_{T} (X_{i} ∣ Z_{i}) d X_{i} + {[\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u}]}^{2} S_{C} (ℰ - ∣ Z_{i}) S_{T} (ℰ ∣ Z_{i}) \leq C_{1}^{″} (\int {[B_{r, p} (X_{i}) - 2 \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} d X_{i} + \int_{0}^{ℰ -} {[\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{2} d X_{i} + {[\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{ℰ} \exp {m (u)} d u}]}^{2}) \leq C_{1}^{″} (2 \int B_{r, p} {(X_{i})}^{2} d X_{i} + 9 \exp (2 Z_{i}^{⊤} β_{0}) \int {[\int \exp {m (u)} B_{r, p} (u) d u]}^{2} d X_{i} + \exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{ℰ} \exp {m (u)} B_{r, p} (u) d u]}^{2}) \leq C_{1}^{″} (2 \int B_{r, p} {(X_{i})}^{2} d X_{i} + 9 ℰ \exp (2 Z_{i}^{⊤} β_{0}) {[\int \exp {m (u)} B_{r, p} (u) d u]}^{2} + \exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{ℰ} \exp {m (u)} B_{r, p} (u) d u]}^{2}) \leq C_{1}^{″} (2 \int B_{r, p} {(X_{i})}^{2} d X_{i} + (9 ℰ + 1) \exp (2 Z_{i}^{⊤} β_{0}) [\int \exp {2 m (u)} d u \int B_{r, p}^{2} (u) d u]) \leq C_{1}^{'} h,

for some constant $0 < C_{1}^{'} < \infty$ by Condition (C4). Thus, $E ({‖ n^{- 1} T_{n 1} ‖}_{2}^{2}) \leq P_{n} n^{- 1} C_{1}^{'} h$ . By Condition (C3), we have $h ≍ P_{n}^{- 1}$ . Then $E ({‖ n^{- 1} T_{n 1} ‖}_{2}^{2}) \leq C_{1} n^{- 1}$ for some constant 0 < C₁ < ∞. Then for any ϵ > 0, by Chebyshev’s inequality, we have $pr ({‖ n^{- 1} T_{n 1} ‖}_{2} \geq \sqrt{n^{- 1} C_{1} ϵ^{- 1}}) \leq ϵ$ , or equivalently

pr ({‖ T_{n 1} ‖}_{2} \geq \sqrt{n C_{1} ϵ^{- 1}}) \leq ϵ .

(A.21)

Moreover, by (A.18), we have $\sup_{u} | B_{r}^{⊤} (u) γ_{0} - m (u) | = O (h^{q})$ . Denote

T_{i p} = e_{p}^{⊤} {S_{γ, i} (β_{0}, γ_{0}) - S_{γ, i} (β_{0}, m)} = (1 + Δ_{i}) [\frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u} - \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} B_{r, p} (u) d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u}] = \frac{(1 + Δ_{i}) \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} [\exp {m (u)} - \exp {B_{r}^{⊤} (u) γ}] B_{r, p} (u) d u}{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u] [1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]} + (1 + Δ_{i}) \exp (2 Z_{i}^{⊤} β) [\frac{\int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u \int_{0}^{X_{i}} [\exp {B_{r}^{⊤} (u) γ} - \exp {m (u)}] d u}{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u] [1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}] + (1 + Δ_{i}) \exp (2 Z_{i}^{⊤} β) [\frac{\int_{0}^{X_{i}} [\exp {m (u)} - \exp {B_{r}^{⊤} (u) γ}] B_{r, p} (u) d u \int_{0}^{X_{i}} \exp {m (u)} d u}{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u] [1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ} d u]}],

then

| T_{i p} | \leq 2 \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} | \exp {m (u)} - \exp {B_{r}^{⊤} (u) γ} | B_{r, p} (u) d u + 2 \exp (2 Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r, p} (u) d u \int_{0}^{X_{i}} | \exp {B_{r}^{⊤} (u) γ} - \exp {m (u)} | d u + 2 \exp (2 Z_{i}^{⊤} β) \int_{0}^{X_{i}} | \exp {m (u)} - \exp {B_{r}^{⊤} (u) γ} | B_{r, p} (u) d u \int_{0}^{X_{i}} \exp {m (u)} d u \leq C_{2}^{'} h^{q + 1}

for a constant $0 < C_{2}^{'} < \infty$ under Condition (C4). Therefore, $E ({‖ T_{n 2} ‖}_{2}) \leq {P_{n} {(C_{2}^{'} h^{q + 1} n)}^{2}}^{1 / 2} = P_{n}^{1 / 2} C_{2}^{'} n h^{q + 1} \leq C_{2} n h^{q + 1 / 2}$ for a constant 0 < C₂ < ∞, and $E ({‖ T_{n 2} ‖}_{2}^{2}) \leq P_{n} {(C_{2}^{'} h^{q + 1} n)}^{2} \leq {(C_{2} n h^{q + 1 / 2})}^{2}$ . Again by Chebyshev’s inequality, for $1 / 4 > ϵ > 0$ , we have

pr ({‖ T_{n 2} ‖}_{2} \geq ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2}) \leq pr {| {‖ T_{n 2} ‖}_{2} - E ({‖ T_{n 2} ‖}_{2}) | \geq ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2} / 2} + pr {E ({‖ T_{n 2} ‖}_{2}) \geq ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2} / 2} \leq pr (| {‖ T_{n 2} ‖}_{2} - E ({‖ T_{n 2} ‖}_{2}) | \geq ϵ^{- 1 / 2} {var ({‖ T_{n 2} ‖}_{2})}^{1 / 2} / 2) + pr (C_{2} n h^{q + 1 / 2} \geq ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2} / 2) = pr (| ‖ T_{n 2} ‖_{2} - E ({‖ T_{n 2} ‖}_{2}) | \geq ϵ^{- 1 / 2} {var ({‖ T_{n 2} ‖}_{2})}^{1 / 2} / 2) \leq 4 ϵ .

(A.22)

Combining (A.21) and (A.22), with probablity at least 1 − 5ϵ,

| {\partial l_{n} (β_{0}, γ_{0}) / \partial γ}^{⊤} ρ_{n} τ | \leq C ρ_{n} ({‖ T_{n 1} ‖}_{2} + {‖ T_{n 2} ‖}_{2}) \leq C ρ_{n} (\sqrt{C_{1} ϵ^{- 1}} n^{1 / 2} + ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2}) .

(A.23)

Moreover, Lemma A5 implies there exists a constant 0 < C₃ < ∞ such that

- \frac{1}{2} τ^{⊤} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} τ \geq n C_{3} C^{2} h

for n sufficiently large, with probability approaching 1. Thus, for any ϵ > 0, there is probability at least 1 − ϵ,

- 2^{- 1} {(ρ_{n} τ)}^{⊤} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}} (ρ_{n} τ) \geq ρ_{n}^{2} C_{3} C^{2} n h .

(A.24)

Therefore, by (A.20), (A.23) and (A.24), for n sufficiently large, with probability at least 1 − 6ϵ,

l_{n} (β_{0}, γ_{0} + ρ_{n} τ) - l_{n} (β_{0}, γ_{0}) \leq C ρ_{n} (\sqrt{C_{1} ϵ^{- 1}} n^{1 / 2} + ϵ^{- 1 / 2} C_{2} n h^{q + 1 / 2}) - ρ_{n}^{2} C_{3} C^{2} n h = C ρ_{n} h (\sqrt{C_{1} ϵ^{- 1}} n^{1 / 2} h^{- 1} + ϵ^{- 1 / 2} C_{2} n h^{q - 1 / 2} - C C_{3} n ρ_{n}) = C ρ_{n} h (\sqrt{C_{1} ϵ^{- 1}} n^{1 / 2} h^{- 1} + ϵ^{- 1 / 2} C_{2} n h^{q - 1 / 2} - C C_{3} n^{1 / 2} h^{- 1} - C C_{3} n h^{q - 1 / 2}) < 0,

when $C > \max (C_{3}^{- 1} \sqrt{C_{1} ϵ^{- 1}}, ϵ^{- 1 / 2} C_{3}^{- 1} C_{2})$ . This shows (A.19). Hence, we have $‖ \hat{γ} (β_{0}) - γ_{0} ‖_{2} = O_{p} (ρ_{n}) = O_{p} (n^{- 1 / 2} h^{- 1} + h^{q - 1 / 2}) = o_{p} (1)$ under Condition (C3).

It is easily seen that $E {{‖ S_{γ, i} (β_{0}, m) ‖}_{\infty}^{d}} \leq C_{4}^{d} h$ for a constant $1 < C_{4} < \infty$ and any $d \geq 1$ , by Bernstein’s inequality, under condition (C3), we have

{‖ n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m) ‖}_{\infty} = O_{p} [h + {h \log (n)}^{1 / 2} n^{- 1 / 2}] = O_{p} (h) .

Also, it is easy to check that

{‖ n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m) - n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, γ_{0}) ‖}_{\infty} = O_{p} (h^{q + 1}) .

Thus, combining with Lemma A7-A8, we have

| B_{r} {(u)}^{⊤} [{- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} {n^{- 1} \frac{\partial l_{n} (β_{0}, γ_{0})}{\partial γ}} - V_{n} {(β_{0})}^{- 1} n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m)] | \leq r {{‖ B_{r} (u) ‖}_{\infty} {‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} ‖}_{\infty} {‖ n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, γ_{0}) - n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m) ‖}_{\infty} + {‖ B_{r} (u) ‖}_{\infty} {‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} - V_{n} {(β_{0})}^{- 1} ‖}_{\infty} {‖ n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m) ‖}_{\infty}} = O_{p} (h^{- 1}) O_{p} (h^{q + 1}) + O_{p} (h^{q - 1} + n^{- 1 / 2} h^{- 1}) O_{p} (h) = O_{p} (h^{q} + n^{- 1 / 2}),

(A.25)

where the inequality above uses the fact that for arbitrary u, only r elements in B_r(u) are non-zero.

Let $\hat{e} = V_{n} {(β_{0})}^{- 1} n^{- 1} \sum_{i = 1}^{n} S_{γ, i} (β_{0}, m)$ . Let $Z = {(Z_{1}^{⊤}, \dots, Z_{n}^{⊤})}^{⊤}$ . By Central Limit Theorem,

{[B_{r}^{⊤} (u) var (\hat{e} ∣ Z) B_{r} (u)]}^{- 1 / 2} B_{r}^{⊤} (u) \hat{e} \to Normal (0, 1),

where $var (\hat{e} ∣ Z) = {V_{n} (β_{0})}^{- 1} {n^{- 2} \sum_{i = 1}^{n} S_{γ, i} {(β_{0}, m)}^{\otimes 2}} {V_{n} (β_{0})}^{- 1}$ and $B_{r}^{⊤} (u) var (\hat{e} ∣ Z) B_{r} (u) = {\hat{σ}}^{2} (u, β_{0})$ . With Lemma A7 and A9, we can get that $c_{5} {(n h)}^{- 1} {‖ B_{r} (u) ‖}_{2}^{2} \leq B_{r}^{⊤} (u) var (\hat{e} ∣ Z) B_{r} (u) \leq C_{5} {(n h)}^{- 1} {‖ B_{r} (u) ‖}_{2}^{2}$ , for some constants $0 < c_{5}, c_{5} < \infty$ . So there exist constants $0 < c_{σ} \leq C_{σ} < \infty$ such that with probability approaching 1 and for large enough n,

c_{σ} {(n h)}^{- 1 / 2} \leq \inf_{u \in [0, ℰ]} \hat{σ} (u, β_{0}) \leq \sup_{u \in [0, ℰ]} \hat{σ} (u, β_{0}) \leq C_{σ} {(n h)}^{- 1 / 2} .

(A.26)

Thus $B_{r}^{⊤} (u) \hat{e} = O_{p} {{(n h)}^{- 1 / 2}}$ uniformly in u ∈ [0, ε], and hence

B_{r}^{⊤} (u) {- \partial^{2} l_{n} (β_{0}, γ_{0}) / \partial γ \partial γ^{⊤}}^{- 1} {\partial l_{n} (β_{0}, γ_{0}) / \partial γ} = O_{p} {{(n h)}^{- 1 / 2} + h^{q} + n^{- 1 / 2}} = O_{p} (h^{q} + n^{- 1 / 2} h^{- 1 / 2}) .

uniformly in u ∈ [0, ℰ] as well.

By Taylor expansion,

B_{r}^{⊤} (u) {\hat{γ} (β_{0}) - γ_{0}} = B_{r}^{⊤} (u) {- \partial^{2} l_{n} (β_{0}, γ_{0}) / \partial γ \partial γ^{⊤}}^{- 1} {\partial l_{n} (β_{0}, γ_{0}) / \partial γ} {1 + o_{p} (1)} = B_{r}^{⊤} (u) {- \partial^{2} l_{n} (β_{0}, γ_{0}) / \partial γ \partial γ^{⊤}}^{- 1} {\partial l_{n} (β_{0}, γ_{0}) / \partial γ} + o_{p} (h^{q} + n^{- 1 / 2} h^{- 1 / 2}) .

(A.27)

Thus by (A.25), (A.26), (A.27) and Condition (C3),

\sup_{u \in [0, ℰ]} | \hat{σ} {(u, β_{0})}^{- 1} [B_{r}^{⊤} (u) {\hat{γ} (β_{0}) - γ_{0}} - B_{r}^{⊤} (u) \hat{e}] | = O_{p} {{(n h)}^{1 / 2})} {O_{p} (h^{q} + n^{- 1 / 2}) + o_{p} (h^{q} + n^{- 1 / 2} h^{- 1 / 2})} = O_{p} (n^{1 / 2} h^{q + 1 / 2} + h^{1 / 2}) + o_{p} (1) = o_{p} (1) .

Therefore by Slutsky’s theorem ${\hat{σ}}^{- 1} (u, β_{0}) {\hat{m} (u, β_{0}) - \tilde{m} (u)} \to Normal (0, 1)$ and $\hat{m} (u, β_{0}) - \tilde{m} (u) = O_{p} {{(n h)}^{- 1 / 2}}$ uniformly in $u \in [0, ℰ]$ . By $\sup_{u \in [0, ℰ]} | m (u) - \tilde{m} (u) | = O (h^{q})$ , we have $| \hat{m} (u, β_{0}) - m (u) | = O_{p} {{(n h)}^{- 1 / 2} + h^{q}}$ uniformly in $u \in [0, ℰ]$ . By Slutsky’s theorem and Condition (C3), we have

{\hat{σ}}^{- 1} (u, β_{0}) {\hat{m} (u, β_{0}) - m (u)} \to Normal (0, 1) .

Proof of Lemma 2

Proof Because $S_{β β, i} (β, γ)$ is negative definite and $E {S_{β, i} (β_{0}, m)} = 0$ , similar but simpler derivation as for Theorem 1 can be used to show the consistency of the maximizer $\hat{β}$ .

Because at any $β, \sum_{i = 1}^{n} S_{γ, i} {β, \hat{γ} (β)} = 0$ , hence

0 = \sum_{i = 1}^{n} \frac{\partial S_{γ, i} {β, \hat{γ} (β)}}{\partial β^{⊤}} + \sum_{i = 1}^{n} S_{γ γ, i} {β, \hat{γ} (β)} \frac{\partial \hat{γ} (β)}{\partial β^{⊤}} = \sum_{i = 1}^{n} S_{β γ, i}^{⊤} {β, \hat{γ} (β)} + \sum_{i = 1}^{n} S_{γ γ, i} {β, \hat{γ} (β)} \frac{\partial \hat{γ} (β)}{\partial β^{⊤}} .

\frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} = - {[n^{- 1} \sum_{i = 1}^{n} S_{γ γ, i} {β_{0}, \hat{γ} (β_{0})}]}^{- 1} n^{- 1} \sum_{i = 1}^{n} S_{β γ, i}^{⊤} {β_{0}, \hat{γ} (β_{0})} = V_{n} {(β_{0})}^{- 1} E {S_{β γ, i}^{⊤} (β_{0}, m)} + r_{1},

(A.28)

where r₁ is the residual term and is of smaller order of $V_{n} {(β_{0})}^{- 1} E {S_{β γ, i}^{⊤} (β_{0}, m)}$ componentwise. Note that $S_{β γ, i} (β, γ) = O_{p} (h)$ uniformly elementwise. Hence,

{‖ S_{β γ, i}^{⊤} (β_{0}, m) ‖}_{2} = {‖ S_{β γ, i} (β_{0}, m) ‖}_{2} = O_{p} (h^{1 / 2}), {‖ S_{β γ, i}^{⊤} (β_{0}, m) ‖}_{\infty} = O_{p} (h), {‖ S_{β γ, i} (β_{0}, m) ‖}_{\infty} = O_{p} (1) .

Subsequently, we have

{‖ V_{n} {(β_{0})}^{- 1} E {S_{β γ, i}^{⊤} (β_{0}, m)} ‖}_{2} \leq {‖ V_{n} {(β_{0})}^{- 1} ‖}_{2} {‖ E {S_{β γ, i}^{⊤} (β_{0}, m)} ‖}_{2} = O_{p} (h^{- 1}) O_{p} (h^{1 / 2}) = O_{p} (h^{- 1 / 2}),

and

{‖ V_{n} {(β_{0})}^{- 1} E {S_{β γ, i}^{⊤} (β_{0}, m)} ‖}_{\infty} \leq {‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} {‖ E {S_{β γ, i}^{⊤} (β_{0}, m)} ‖}_{\infty} = O_{p} (h^{- 1}) O_{p} (h) = O_{p} (1) .

Here we use the fact that ${‖ V_{n} {(β_{0})}^{- 1} ‖}_{2} = O_{p} (h^{- 1})$ and ${‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} = O_{p} (h^{- 1})$ , where the former one is a direct corollary of Lemma A5 and the latter one is shown in Lemma A8. Therefore, ${‖ r_{1} ‖}_{2} = o_{p} (h^{- 1 / 2})$ and ${‖ r_{1} ‖}_{\infty} = o_{p} (1)$ .

By Taylor expansion, for $β^{*} = ρ β_{0} + (1 - ρ) \hat{β}, 0 < ρ < 1$ ,

0 = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} {\hat{β}, \hat{γ} (\hat{β})} = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} {β_{0}, \hat{γ} (β_{0})} + n^{- 1} \sum_{i = 1}^{n} S_{β β, i} {β^{*}, \hat{γ} (β^{*})} n^{1 / 2} (\hat{β} - β_{0}) + n^{- 1} \sum_{i = 1}^{n} [S_{β γ, i} {β^{*}, \hat{γ} (β^{*})} \frac{\partial \hat{γ} (β^{*})}{\partial β^{⊤}}] n^{1 / 2} (\hat{β} - β_{0}) = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} {β_{0}, \hat{γ} (β_{0})} + [E {S_{β β, i} (β_{0}, m)} + o_{p} (1)] n^{1 / 2} (\hat{β} - β_{0}) + [E {S_{β γ, i} (β_{0}, m)} \frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} + r_{2}] n^{1 / 2} (\hat{β} - β_{0}),

(A.29)

where r₂ is the residual term and is of smaller order of $E {S_{β γ, i} (β_{0}, m)} \partial \hat{γ} (β_{0}) / \partial β^{⊤}$ componentwise. We claim that the residual term r₂ satisfies ${‖ r_{2} ‖}_{2} = o_{p} (1)$ and ${‖ r_{2} ‖}_{\infty} = o_{p} (1)$ . This is because

{‖ E {S_{β γ, i} (β_{0}, m)} \frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} ‖}_{2} \leq {‖ E {S_{β γ, i} (β_{0}, m)} ‖}_{2} {‖ \frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} ‖}_{2} = O_{p} (h^{1 / 2}) O_{p} (h^{- 1 / 2}) = O_{p} (1),

and

{‖ E {S_{β γ, i} (β_{0}, m)} \frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} ‖}_{\infty} \leq {‖ E {S_{β γ, i} (β_{0}, m)} ‖}_{\infty} {‖ \frac{\partial \hat{γ} (β_{0})}{\partial β^{⊤}} ‖}_{\infty} = O_{p} (1) O_{p} (1) = O_{p} (1),

which leads to the claimed order of the residual r₂ in (A.29).

We further use Taylor expansion to write

n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} {β_{0}, \hat{γ} (β_{0})} = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + n^{- 1 / 2} \sum_{i = 1}^{n} S_{β γ, i} (β_{0}, γ^{*}) {\hat{γ} (β_{0}) - γ_{0}} = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + n^{- 1 / 2} \sum_{i = 1}^{n} \frac{- (1 + Δ_{i}) Z_{i} \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} B_{r}^{⊤} (u) {\hat{γ} (β_{0}) - γ_{0}} d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} d u]}^{2}} = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + n^{- 1 / 2} \sum_{i = 1}^{n} \frac{- (1 + Δ_{i}) Z_{i} \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r}^{⊤} (u) {\hat{γ} (β_{0}) - γ_{0}} d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} + n^{1 / 2} O_{p} (h^{q}) O_{p} (h^{q} + n^{- 1 / 2} h^{- 1 / 2}) = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + (E [\frac{- (1 + Δ_{i}) Z_{i} \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r}^{⊤} (u) d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}] + r) n^{1 / 2} {\hat{γ} (β_{0}) - γ_{0}} + o_{p} (1) = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + E {S_{β γ, i} (β_{0}, m)} n^{1 / 2} {\hat{γ} (β_{0}) - γ_{0}} + o_{p} (1),

where $γ^{*} = ρ γ_{0} + (1 - ρ) \hat{γ}, 0 < ρ < 1$ , and the residual term r in the second last equality satisfies $‖ r ‖_{\infty} = O_{p} (n^{- 1 / 2})$ and $‖ r ‖_{2} = O_{p} (n^{- 1 / 2} h^{1 / 2})$ .

Plugging this and (A.28) into (A.29), recall that

A = E {S_{β β, i} (β_{0}, m)} - E {S_{β γ, i} (β_{0}, m)} {[E {S_{γ γ, i} (β_{0}, m)}]}^{- 1} E {S_{β γ, i}^{⊤} (β_{0}, m)},

we get

n^{1 / 2} {- A + o_{p} (1)} (\hat{β} - β) = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + E {S_{β γ, i} (β_{0}, m)} n^{1 / 2} {\hat{γ} (β_{0}) - γ_{0}} + o_{p} (1) = n^{- 1 / 2} \sum_{i = 1}^{n} S_{β, i} (β_{0}, γ_{0}) + n^{- 1 / 2} \sum_{i = 1}^{n} E {S_{β γ, i} (β_{0}, m)} V_{n} {(β_{0})}^{- 1} S_{γ, i} (β_{0}, γ_{0}) + o_{p} (1) .

It is straightforward to check that

E {S_{β, i} (β_{0}, m)} = E [Δ_{i} Z_{i} - (1 + Δ_{i}) \frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}] = \int [Z_{i} - 2 \frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}] \frac{\exp [{m (X_{i}) + Z_{i}^{⊤} β}] S_{C} (X_{i}, Z_{i})}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}} d X_{i} - \int [\frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}] \frac{f_{C} (X_{i}, Z_{i})}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u} d X_{i} = \int \frac{\partial}{\partial X_{i}} [\frac{Z_{i} \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u S_{C} (X_{i}, Z_{i})}{{[1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}] d X_{i} = 0,

and we already have $E {S_{γ, i} (β_{0}, m)} = 0$ . Thus,

E [S_{β, i} (β_{0}, γ_{0}) + E {S_{β γ, i} (β_{0}, m)} V_{n} {(β_{0})}^{- 1} S_{γ, i} (β_{0}, γ_{0})] = E [S_{β, i} (β_{0}, γ_{0}) - S_{β, i} (β_{0}, m) + E {S_{β γ, i} (β_{0}, m)} V_{n} {(β_{0})}^{- 1} {S_{γ, i} (β_{0}, γ_{0}) - S_{γ, i} (β_{0}, m)}] = O (h^{q + 1}) + O_{p} ({‖ E {S_{β γ, i} (β_{0}, m)}_{\infty} V_{n} {(β_{0})}^{- 1} ‖}_{\infty} {‖ E {S_{γ, i} (β_{0}, γ_{0}) - S_{γ, i} (β_{0}, m)} ‖}_{\infty}) = O (h^{q + 1}) + O_{p} (1) O_{p} (h^{- 1}) O_{p} (h^{q + 1}) = O (h^{q}) .

By Central Limit Theorem, $n^{1 / 2} (\hat{β} - β) \to Normal {0, A^{- 1} Σ {(A^{- 1})}^{⊤}}$ , where Σ is given in Theorem 2.

Proof of Theorem 1

Proof We prove the theorem in two steps. First we derive the asymptotic distribution of the solution $\tilde{θ}$ by restricting θ to the oracle group selection set $𝒮$ . Then we validate that the $\tilde{θ}$ satisfies the optimality condition of the original problem (6). Without loss of generality, we rearrange the order of the covariates by moving the nonzero groups to the front. We would have simpler notation with $𝒮 = {1, \dots, c a r d (𝒮)}$ . We denote the Hessian and its limit

\hat{H} = - n^{- 1} ℓ_{n}^{″} ({\hat{θ}}_{MLE}), H = E (\begin{matrix} S_{β, β, i} & S_{β, γ, i} \\ S_{γ, β, i} & S_{γ, γ, i} \end{matrix}),

and the sub-matrices notations $A_{𝒮, \cdot}$ . for selecting rows, $A_{\cdot, 𝒮}$ for selecting columns, $A_{𝒮, 𝒮}$ for selecting rows and columns in $𝒮 \cup {p + 1, \dots, p + P_{n}}$ . We denote the variance of score as $V = E {{(S_{β, i}^{⊤}, S_{γ, i}^{⊤})}^{⊤} (S_{β, i}^{⊤}, S_{γ, i}^{⊤})}$ .

Define the oracle selection subspace $R^{𝒮} = {θ \in R^{p + P_{n}} : θ_{j} = 0, for j \leq p, j \notin 𝒮}$ and the estimator under oracle selection

\tilde{θ} = \underset{θ \in R^{𝒮}}{argmin} {(θ - {\hat{θ}}_{MLE})}^{⊤} \hat{H} (θ - {\hat{θ}}_{MLE}) + λ \sum_{g} \frac{{‖ β^{[g]} ‖}_{2}}{{‖ {\hat{β}}_{MLE, [g]} ‖}_{2}} .

(A.30)

Since 𝒮 contains only groups with nonzero coefficient of $β_{0}$ , and ${\hat{β}}_{MLE}$ is consistent for $β_{0}$ by Lemma 2, the denominators in the penalty terms in (A.30) ${‖ {\hat{β}}_{MLE, [g]} ‖}_{2}$ are bounded away from zero. Then choosing $λ = o (n^{- 1 / 2})$ , we have the solution as

{\tilde{θ}}_{𝒮} = {\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, \cdot} {\hat{θ}}_{MLE} + o_{p} (n^{- 1 / 2}) .

Using the identity

{\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, \cdot}, θ_{0} = {\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, 𝒮} θ_{0, 𝒮} = θ_{0, 𝒮},

we may derive the estimation error of $\tilde{θ}$ as

\sqrt{n} ({\tilde{θ}}_{𝒮} - θ_{0, 𝒮}) = \sqrt{n} {\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, \cdot} ({\hat{θ}}_{MLE} - θ_{0}) + \underset{= 0}{\underset{︸}{{\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, \cdot θ_{0}} - θ_{0, 𝒮}}} + O_{p} (\sqrt{n} λ) = {\hat{H}}_{𝒮, 𝒮}^{- 1} {\hat{H}}_{𝒮, \cdot} \sqrt{n} ({\hat{θ}}_{MLE} - θ_{0}) + o_{p} (1) .

(A.31)

In the proof of the Lemma 2, we have establish for $h^{q} ≪ n^{- 1 / 2}$ the asymptotic normality of ${\hat{θ}}_{MLE}$ and the consistency of Hessian

\sqrt{n} ({\hat{θ}}_{MLE} - θ_{0}) \to Normal (0, H^{- 1} V H^{- 1}), ‖ \hat{H} - H ‖ = O_{p} (n^{- 1 / 2}) .

(A.32)

Applying the (A.32) to (A.31), we obtain

\sqrt{n} ({\tilde{θ}}_{𝒮} - θ_{0, 𝒮}) \to Normal (0, H_{𝒮, 𝒮}^{- 1} H_{𝒮, \cdot} H^{- 1} V H^{- 1} H_{\cdot, 𝒮} H_{𝒮, 𝒮}^{- 1}) = Normal (0, H_{𝒮, 𝒮}^{- 1} V_{𝒮, 𝒮} H_{𝒮, 𝒮}^{- 1}) .

Profiling out γ components as in Lemma 2, we have

\sqrt{n} ({\tilde{β}}_{𝒮} - β_{0, 𝒮}) \to Normal (0, A_{𝒮, 𝒮}^{- 1} Σ_{𝒮, 𝒮} A_{𝒮, 𝒮}^{- 1}) .

(A.33)

The optimality condition for original problem (6) is

if {‖ β^{[g]} ‖}_{2} > 0, 2 {\hat{H}}_{[g],} \cdot θ - 2 {\hat{H}}_{[g],} \cdot {\hat{θ}}_{MLE} + \frac{λ β^{[g]}}{{‖ {\hat{β}}_{MLE, [g]} ‖}_{2} {‖ β^{[g]} ‖}_{2}} = 0, if {‖ β^{[g]} ‖}_{2} = 0, 2 {‖ {\hat{H}}_{[g], \cdot} θ - {\hat{H}}_{[g], \cdot} {\hat{θ}}_{MLE} ‖}_{2} \leq \frac{λ}{{‖ {\hat{β}}_{MLE, [g]} ‖}_{2}}, for j > p, 2 {\hat{H}}_{j,} \cdot θ - 2 {\hat{H}}_{j, \cdot} {\hat{θ}}_{MLE} = 0 .

(A.34)

The oracle selection estimator $\tilde{θ}$ must satisfy the conditions in (A.34) for positions in R^𝒮 by the same set of optimality conditions for (A.30). We only need to verify that $\tilde{θ}$ also satisfy the conditions in (A.34) for $j \in 𝒮^{c} = {1, \dots, p} ∖ 𝒮$ . By Lemma 2 and the definition of 𝒮, we have

{\hat{θ}}_{MLE, [g]} = O_{p} (n^{- 1 / 2}), for g : β_{0, [g]} = 0 .

(A.35)

For $λ ≫ n^{- 1}$ , the penalty factor for zero group g is

\frac{λ}{{‖ {\hat{θ}}_{MLE, [g]} ‖}_{2}} ≫ n^{- 1 / 2}, for g : β_{0, [g]} = 0 .

(A.36)

By definition $\tilde{θ} \in R^{𝒮}$ , the 𝒮^c components of $\tilde{θ}$ are all zero,

{\tilde{θ}}_{[g]} = 0, for g : β_{0, [g]} = 0 .

(A.37)

Combining (A.35)–(A.37), we establish that the optimality conditions in (A.34) hold asymptotically

2 {‖ {\hat{H}}_{[g],} \cdot \tilde{θ} - {\hat{H}}_{[g],} \cdot {\hat{θ}}_{MLE} ‖}_{2} ≍ n^{- 1 / 2} ≪ \frac{λ}{{‖ {\hat{θ}}_{MLE, [g]} ‖}_{2}}

for g : β_0,[g] = 0, i.e. all elements in 𝒮^c. Therefore, we conclude that ${\hat{β}}_{glasso} = \tilde{β}$ with large probability. The asymptotic distribution of $\tilde{β}$ (A.33) is thus the asymptotic distribution of ${\hat{β}}_{glasso}$ .

Proof of Corollary 1

Proof Using delta method, it is seen that

\hat{F} (t ∣ \hat{Z}) - F (t ∣ Z) ≍ F (t ∣ Z) {1 - F (t ∣ Z)} \int_{0}^{t} e^{β_{0}^{⊤} Z + m (u)} {B_{r}^{⊤} (u) {\hat{γ}}_{glasso} - m (u) + {({\hat{β}}_{glasso} - β_{0})}^{⊤} Z + β_{0}^{⊤} (\hat{Z} - Z)} d u ≍ F (t ∣ Z) {1 - F (t ∣ Z)} \int_{0}^{t} e^{β_{0}^{⊤} Z + m (u)} {B_{r}^{⊤} (u) ({\hat{γ}}_{glasso} - γ_{0}) + {({\hat{β}}_{glasso} - β_{0})}^{⊤} Z} d u + O_{p} (h^{q}) + O_{p} (‖ \hat{Z} - Z ‖)

Applying the $\sqrt{n}$ asymptotical normality of ${\hat{γ}}_{glasso}$ and ${\hat{β}}_{glasso}$ of Lemma 1 and 1 along with Assumption (C5), we conclude that

\hat{F} (t ∣ \hat{Z}) - F (t ∣ Z) ≍ n^{- 1 / 2} + h^{q}

and $\sqrt{n}$ asymptotically normal with $h < n^{- 1 / 2 q}$ .

Appendix E. Matrix Norms

Lemma A5 There exists constants 0 < c < C < ∞ such that, for n sufficiently large, with probability approach 1,

c h < {‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} ‖}_{2} < C h, c h < {‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} ‖}_{\infty} < C h, c h < {‖ V_{n} (β_{0}) ‖}_{2} < C h, c h < {‖ V_{n} (β_{0}) ‖}_{\infty} < C h,

where $γ^{*}$ is an arbitrary vector in $R^{P_{n}}$ with ${‖ γ^{*} - γ_{0} ‖}_{2} = o_{p} (1)$ . Furthermore, for arbitrary a ∈ R^P_n,

c h ‖ a ‖_{2}^{2} < a^{⊤} {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}}} a < C h ‖ a ‖_{2}^{2}, c h ‖ a ‖_{2}^{2} < a^{⊤} V_{n} (β_{0}) a < C h ‖ a ‖_{2}^{2} .

Proof We only prove the result for $\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}$ . The proof for $V_{n} (β_{0})$ can be obtained similarly. We have

- n^{- 1} a^{⊤} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} a = n^{- 1} \sum_{i = 1}^{n} a^{⊤} ((1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} d u} - (1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} d u]}^{2}}) a \geq n^{- 1} \sum_{i = 1}^{n} a^{⊤} ((1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} B_{r} {(u)}^{\otimes 2} d u}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} d u]}^{2}}) a \geq c_{1}^{'} n^{- 1} \sum_{i = 1}^{n} a^{⊤} {(1 + Δ_{i}) B_{r} {(u)}^{\otimes 2} d u} a \to c_{1}^{'} E a^{⊤} {(1 + Δ_{i}) \int_{0}^{X_{i}} B_{r} {(u)}^{\otimes 2} d u} a \geq c_{1}^{'} a^{⊤} {\int_{0}^{ℰ} \int_{0}^{x} B_{r} {(u)}^{\otimes 2} d u f_{C} (x) S_{T} (x) d x} a = c_{1}^{'} a^{⊤} {\int_{0}^{ℰ -} \int_{0}^{x} B_{r} {(u)}^{\otimes 2} d u f_{C} (x) S_{T} (x) d x + \int_{0}^{ℰ} B_{r} {(u)}^{\otimes 2} d u S_{C} (ℰ -) S_{T} (ℰ)} a \geq S_{T} (ℰ) S_{C} (ℰ -) c_{1}^{'} a^{⊤} {\int_{0}^{ℰ} B_{r} {(u)}^{\otimes 2} d u} a \geq c_{1} h ‖ a ‖_{2}^{2},

(A.38)

for positive constants $0 < c_{1}^{'}, c_{1} < \infty$ by conditions (C1) and (C4).

Following a similar proof, we can further obtain

a^{⊤} {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}}} a \leq n^{- 1} \sum_{i = 1}^{n} a^{⊤} [(1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ^{*}} d u}] a \leq C_{1}^{'} n^{- 1} \sum_{i = 1}^{n} a^{⊤} \int_{0}^{X_{i}} B_{r} {(u)}^{\otimes 2} d u a \leq C_{1}^{'} a^{⊤} \int_{0}^{ℰ} B_{r} {(u)}^{\otimes 2} d u a \leq C_{1} h ‖ a ‖_{2}^{2},

(A.39)

for some constant $0 < C_{1}^{'}, C_{1} < \infty$ , because $\int_{0}^{ℰ} B_{r} {(u)}^{\otimes 2} d u$ is an r-banded matrix with diagonal and j^th off-diagonal elements of order O(h) uniformly elementwise, for j = 1, …, r − 1, and 0 elsewhere.

Combining (A.38) and (A.39), we have

c_{1} h < {‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} ‖}_{2} < C_{1} h .

Next, we investigate the order of ${‖ - n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}} ‖}_{\infty}$ . We have

{‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} ‖}_{\infty} = \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} | {[- n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}}]}_{j k} | \geq \sum_{k = 1}^{P_{n}} | {[- n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}}]}_{1 k} | \geq c_{2}^{'} \sum_{k = 1}^{P_{n}} | n^{- 1} \sum_{i = 1}^{n} \int_{0}^{X_{i}} B_{r, 1} (u) B_{r, k} (u) d u | = c_{2}^{'} n^{- 1} \sum_{i = 1}^{n} \int_{0}^{X_{i}} \sum_{k = 1}^{P_{n}} B_{r, 1} (u) B_{r, k} (u) d u = c_{2}^{'} n^{- 1} \sum_{i = 1}^{n} \int_{0}^{X_{i}} \sum_{k = 1}^{r} B_{r, 1} (u) B_{r, k} (u) d u \to c_{2}^{'} E \int_{0}^{X_{i}} \sum_{k = 1}^{r} B_{r, 1} (u) B_{r, k} (u) d u = c_{2} h

with probability 1 as n → ∞, where $0 < c_{2}, c_{2}^{'} < \infty$ are constants. Here, for an arbitrary matrix A we use A_jk to denote its element in the j^th row and the k^th column. In the above inequalities, we use the fact that B-spline basis are all non-negative and are non-zero on no more than r consecutive intervals formed by its knots.

On the other hand,

{‖ - n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}} ‖}_{\infty} = \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} | {[- n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}}]}_{j k} | \leq C_{2}^{'} \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} [| {n^{- 1} \sum_{i = 1}^{n} \int_{0}^{X_{i}} B_{r} {(u)}^{\otimes 2} d u}_{j k} | + | n^{- 1} \sum_{i = 1}^{n} {\int_{0}^{X_{i}} B_{r} (u) d u}_{j k}^{\otimes 2} |] \leq C_{2}^{'} n^{- 1} \sum_{i = 1}^{n} \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} [| {\int_{0}^{X_{i}} B_{r} {(u)}^{\otimes 2} d u}_{j k} | + | {\int_{0}^{X_{i}} B_{r} (u) d u}_{j k}^{\otimes 2} |] = C_{2}^{'} n^{- 1} \sum_{i = 1}^{n} \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} [{\int_{0}^{X_{i}} B_{r, j} (u) B_{r, k} (u) d u} + \int_{0}^{X_{i}} B_{r, j} (u) d u \int_{0}^{X_{i}} B_{r, k} (u) d u] \leq C_{2}^{'} n^{- 1} \sum_{i = 1}^{n} [\max_{1 \leq j \leq P_{n}} \sum_{k = \max (1, j - r + 1)}^{\min (j + r - 1, P_{n})} {\int_{0}^{X_{i}} B_{r, j} (u) B_{r, k} (u) d u} + \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} \int_{0}^{X_{i}} B_{r, j} (u) d u \int_{0}^{X_{i}} B_{r, k} (u) d u] \leq C_{2}^{'} n^{- 1} \sum_{i = 1}^{n} [\max_{1 \leq j \leq P_{n}} \sum_{k = \max (1, j - r + 1)}^{\min (j + r - 1, P_{n})} {\int_{0}^{ℰ} B_{r, j} (u) B_{r, k} (u) d u} + \max_{1 \leq j \leq P_{n}} \sum_{k = 1}^{P_{n}} \int_{0}^{ℰ} B_{r, j} (u) d u \int_{0}^{ℰ} B_{r, k} (u) d u] \leq C_{2} h .

Hence, $c_{2} h \leq {‖ - n^{- 1} {\partial^{2} l_{n} (β_{0}, γ^{*}) / \partial γ \partial γ^{⊤}} ‖}_{\infty} \leq C_{2} h$ .

Therefore, Lemma A5 holds for c = min(c₁, c₂) and C = max(C₁, C₂).

Lemma A6

{‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0}) ‖}_{2} = O_{p} (h^{q + 1} + n^{- 1 / 2} h), {‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0}) ‖}_{\infty} = O_{p} (h^{q + 1} + n^{- 1 / 2} h) .

Proof Similarly as the previous derivations,

{‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0}) ‖}_{\infty} = {‖ - n^{- 1} \sum_{i = 1}^{n} S_{γ γ, i} (β_{0}, γ_{0}) - V_{n} (β_{0}) ‖}_{\infty} \leq ‖ n^{- 1} \sum_{i = 1}^{n} (1 + Δ_{i}) (\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ_{0}} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ_{0}} d u} - \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u} {- \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ_{0}} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {B_{r}^{⊤} (u) γ_{0}} d u]}^{2}} + \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}) ‖}_{\infty} + ‖ n^{- 1} \sum_{i = 1}^{n} (1 + Δ_{i}) (\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u} - \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}) - E {(1 + Δ_{i}) (\frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u} {+ \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}})} ‖}_{\infty} = O_{p} (h^{q + 1} + n^{- 1 / 2} h),

(A.40)

where the second term $O_{p} (n^{- 1 / 2} h)$ in the last equality is obtained using both the Central Limit Theorem and the matrices above are banded to the first order. Specifically, $- n^{- 1} \partial^{2} l_{n} (β_{0}, γ_{0}) / \partial γ \partial γ^{⊤} - V_{n} (β_{0})$ has diagonal and j^th off-diagonal element with order $O_{p} (h^{q + 1} + n^{- 1 / 2} h)$ for j = 1, …, r − 1 and all the other elements of order $O_{p} (h^{q + 2} + n^{- 1 / 2} h^{2})$ . Further,

{‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0}) ‖}_{2} = O_{p} (h^{q + 1} + n^{- 1 / 2} h),

again because the matrices are banded to the first order. In fact for arbitrary vector a ∈ R^P_n,

| a^{⊤} {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0})} a | \leq \sum_{j, k} | a_{j} | | {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0})}_{j k} | | a_{k} | = \sum_{| j - k | \leq 2 r - 1} | a_{j} | | {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0})}_{j k} | | a_{k} | + \sum_{| j - k | > 2 r - 1} | a_{j} | | {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0})}_{j k} | | a_{k} | \leq C^{'} (h^{q + 1} + n^{- 1 / 2} h) \sum_{| j - k | \leq 2 r - 1} | a_{j} | | a_{k} | + C_{3}^{″} (h^{q + 2} + n^{- 1 / 2} h^{2}) \sum_{| j - k | > 2 r - 1} | a_{j} | | a_{k} | \leq C^{'} (h^{q + 1} + n^{- 1 / 2} h) \sum_{| j - k | \leq 2 r - 1} (a_{j}^{2} + a_{k}^{2}) / 2 + C_{3}^{″} (h^{q + 2} + n^{- 1 / 2} h^{2}) \sum_{| j - k | > 2 r - 1} (a_{j}^{2} + a_{k}^{2}) / 2 \leq C^{'} (h^{q + 1} + n^{- 1 / 2} h) (2 r + h P_{n}) ‖ a ‖_{2}^{2} \leq C (h^{q + 1} + n^{- 1 / 2} h) ‖ a ‖_{2}^{2},

where $0 < C, C^{'} < \infty$ are constants.

Lemma A7 There exists constant 0 < c, C < ∞, such that for n sufficiently large, with probability approach 1,

c h^{- 1 / 2} < {‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} ‖}_{\infty} < C h^{- 1}, c h^{- 1 / 2} < {‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} < C h^{- 1} .

Proof We have $V_{n} (β_{0}) = h V_{0} (β_{0}) - h^{2} V_{1} (β_{0})$ , where

V_{0} (β_{0}) = h^{- 1} E [(1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} B_{r} {(u)}^{\otimes 2} d u}{1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u}]

is a banded matrix with each nonzero element of order O(1) uniformly and

V_{1} (β_{0}) = h^{- 2} E {(1 + Δ_{i}) \frac{\exp (2 Z_{i}^{⊤} β_{0}) {[\int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u]}^{\otimes 2}}{{[1 + \exp (Z_{i}^{⊤} β_{0}) \int_{0}^{X_{i}} \exp {m (u)} d u]}^{2}}}

is a matrix with all elements of order O(1) uniformly. It is easily seen that $V_{0} (β)$ is positive definite, and $V_{1} (β)$ is semi-positive definite.

According to Demko (1977) and Theorem 4.3 in Chapter 13 of DeVore and Lorentz (1993), we have

{‖ V_{0} {(β_{0})}^{- 1} ‖}_{\infty} \leq C^{'},

for some constant $0 < C^{'} < \infty$ . Furthermore, there exists constants $0 < C^{″} < \infty$ and 0 < λ < 1 such that $| {V_{0} {(β)}^{- 1}}_{j k} | \leq C^{″} λ^{| j - k |}$ for j, k = 1, …, P_n.

We want to show that

{‖ {I - h V_{0} {(β_{0})}^{- 1} V_{1} (β_{0})}^{- 1} ‖}_{\infty}

is bounded. As a result,

{‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} = h^{- 1} {‖ {I - h V_{0} {(β_{0})}^{- 1} V_{1} (β_{0})}^{- 1} V_{0} {(β_{0})}^{- 1} ‖}_{\infty} \leq h^{- 1} {‖ {I - h V_{0} {(β_{0})}^{- 1} V_{1} (β_{0})}^{- 1} ‖}_{\infty} {‖ V_{0} {(β_{0})}^{- 1} ‖}_{\infty} = O_{p} (h^{- 1}) .

Denote $W = - V_{0} {(β_{0})}^{- 1} V_{1} (β_{0})$ . There exists constant 0 < κ < ∞ such that $| {V_{1} (β_{0})}_{j k} | < κ$ for j, k = 1, …, P_n. Hence,

| W_{i j} | = | {V_{0} {(β_{0})}^{- 1} V_{1} (β_{0})}_{j k} | = | \sum_{ℓ = 1}^{P_{n}} {V_{0} {(β_{0})}^{- 1}}_{j ℓ} {V_{1} (β_{0})}_{ℓ k} | \leq \sum_{ℓ = 1}^{P_{n}} C^{″} λ^{| j - ℓ |} κ \leq 2 C^{″} κ {(1 - λ)}^{- 1} \leq κ_{1},

where $κ_{1} = \max {1, 2 C^{″} κ {(1 - λ)}^{- 1}} \geq 1$ .

Let $P_{n} h \leq κ_{2}$ , where $1 \leq κ_{2} < \infty$ is a constant. Similar derivation as before shows there exists some constant $0 < \tilde{c} < \tilde{C} < \infty$ , such that for arbitrary $a \in R^{P_{n}}, \tilde{c} ‖ a ‖_{2}^{2} < a^{⊤} V_{0} (β_{0}) a < \tilde{C} ‖ a ‖_{2}^{2}$ and $\tilde{c} ‖ a ‖_{2}^{2} < a^{⊤} {V_{0} (β_{0}) - h V_{1} (β_{0})} a < \tilde{C} ‖ a ‖_{2}^{2}$ . Hence,

{‖ {(I + h W)}^{- 1} ‖}_{2} = {‖ {V_{0} (β_{0}) - h V_{1} (β_{0})}^{- 1} V_{0} (β_{0}) ‖}_{2} \leq {‖ {V_{0} (β_{0}) - h V_{1} (β_{0})}^{- 1} ‖}_{2} {‖ V_{0} (β_{0}) ‖}_{2} \leq κ_{3} \equiv \tilde{C} / \tilde{c} .

where $1 \leq κ_{3} < \infty$ .

In the following, we will use induction to show that

a_{P_{n}} \equiv | \det (I_{P_{n}} + h W_{P_{n}}) | \leq {(1 + h κ_{4})}^{P_{n} - 1}, b_{P_{n}} \equiv | \det (J_{P_{n}} + h W_{P_{n}}) | \leq (κ_{1} + 2 κ_{1}^{2} κ_{2} κ_{3}) h {(1 + h κ_{4})}^{P_{n} - 2},

where $J_{P_{n}} = {(J_{i j})}_{1 \leq i, j \leq P_{n}}$ with J_ij = 1 if j − i = 1 and J_ij = 0 otherwise. Here $κ_{4} = 4 κ_{1}^{2} κ_{2} (1 + κ_{1} κ_{2} κ_{3})$ .

When P_n = 2,

a_{2} = | \det (I_{2} + h W_{2}) | = | (1 + h W_{11}) (1 + h W_{22}) - h^{2} W_{12} W_{21} | \leq | (1 + h W_{11}) (1 + h W_{22}) | + | h^{2} W_{12} W_{21} | \leq {(1 + h κ_{1})}^{2} + h^{2} κ_{1}^{2} = 1 + 2 h κ_{1} + 2 h^{2} κ_{1}^{2} \leq 1 + 4 h κ_{1}^{2} \leq 1 + h κ_{4} .

Similary, we have

b_{2} = | \det (J_{2} + h W_{2}) | \leq h^{2} κ_{1}^{2} + h (1 + h κ_{1}) κ_{1} \leq (κ_{1} + 2 κ_{1}^{2}) h \leq (κ_{1} + 2 κ_{1}^{2} κ_{2} κ_{3}) h .

Assume the result holds for 2, …, P_n−1, then for P_n, denote $W_{P_{n}, - P_{n}} = {(W_{P_{n} 1}, \dots, W_{P_{n} (P_{n} - 1)})}^{⊤}$ and $W_{- P_{n}, P_{n}} = {(W_{1 P_{n}}, \dots, W_{(P_{n} - 1) P_{n}})}^{⊤}$ , we have

a_{P_{n}} = | \det (I_{P_{n}} + h W_{P_{n}}) | = ∣ \det (I_{P_{n} - 1} + h W_{P_{n} - 1}) (1 + h W_{P_{n} P_{n}} - h^{2} W_{P_{n}, - P_{n}}^{⊤} {(I_{P_{n} - 1} + h W_{P_{n} - 1})}^{- 1} W_{- P_{n}, P_{n}} ∣ \leq a_{P_{n} - 1} {1 + h κ_{1} + h^{2} (n - 1) κ_{1}^{2} {‖ {(I_{P_{n} - 1} + h W_{P_{n} - 1})}^{- 1} ‖}_{2}} \leq a_{P_{n} - 1} {1 + h κ_{1} + h κ_{1}^{2} κ_{2} κ_{3}} \leq {(1 + h κ_{4})}^{P_{n} - 1},

and

b_{P_{n}} = | \det (J_{P_{n}} + h W_{P_{n}}) | = | {h W_{P_{n} 1} - h^{2} W_{P_{n}, - 1}^{⊤} {(I_{P_{n} - 1} + h W_{P_{n} - 1})}^{- 1} W_{- P_{n}, 1}} \det (I_{P_{n} - 1} + h W_{P_{n} - 1}) | \leq {h κ_{1} + h^{2} κ_{1}^{2} (n - 1) {‖ {(I_{P_{n} - 1} + h W_{P_{n} - 1})}^{- 1} ‖}_{2}} a_{P_{n} - 1} \leq h (κ_{1} + κ_{1}^{2} κ_{2} κ_{3}) a_{P_{n} - 1} \leq h (κ_{1} + 2 κ_{1}^{2} κ_{2} κ_{3}) {(1 + h κ_{4})}^{P_{n} - 2} .

where $W_{P_{n}, - 1} = {(W_{P_{n} 2}, \dots, W_{P_{n} P_{n}})}^{⊤}$ and $W_{- P_{n}, 1} = {(W_{11}, \dots, W_{(P_{n} - 1) 1})}^{⊤}$ .

Therefore,

{‖ {(I_{P_{n}} + h W_{P_{n}})}^{- 1} ‖}_{\infty} = \max_{j} \sum_{k} | {{(I_{P_{n}} + h W_{P_{n}})}^{- 1}}_{j k} | \leq \frac{a_{P_{n} - 1} + b_{P_{n} - 1} (P_{n} - 1)}{| \det (I_{P_{n}} + h W_{P_{n}}) |} \leq \frac{{(1 + h κ_{4})}^{P_{n}} + (κ_{1} + 2 κ_{1}^{2} κ_{2} κ_{3}) κ_{2} {(1 + h κ_{4})}^{P_{n} - 2}}{| \det (I_{P_{n}} + h W_{P_{n}}) |} \leq \frac{2 κ_{4} {(1 + h κ_{4})}^{\frac{κ_{2} κ_{4}}{h κ_{4}}}}{| \det (I_{P_{n}} + h W_{P_{n}}) |},

where the numerator converges to 2κ₄exp(κ₂κ₄) as h → 0, or equivalently, P_n → ∞. Here in the first equation above we use the fact that the (j, k)^th element of the matrix ${(I_{P_{n}} + h W_{P_{n}})}^{- 1}$ is the determinant of the matrix $I_{P_{n}} + h W_{P_{n}}$ without its j^th column and k^th row, divided by the determinant of $I_{P_{n}} + h W_{P_{n}}$ itself. Specifically, when j = k, the absolute value of that (j, k)^th element is $| \det (I_{P_{n} - 1} + h W_{P_{n} - 1}) | / ∣ \det (I_{P_{n}} + h W_{P_{n}}) | = a_{P_{n} - 1} / | \det (I_{P_{n}} + h W_{P_{n}})$ ; when $j \neq k$ , with certain column operations, we obtain $| \det (J_{P_{n} - 1} + h W_{P_{n} - 1}) | / | \det (I_{P_{n}} + h W_{P_{n}}) | = b_{P_{n} - 1} / | \det (I_{P_{n}} + h W_{P_{n}}) |$ .

Now it remains to show that there exists $κ_{5} > 0$ , such that

a_{P_{n}} = | \det (I_{P_{n}} + h W_{P_{n}}) | \geq κ_{5},

for P_n sufficiently large. This can be seen from

a_{P_{n}} = | \det (I_{P_{n}} + h W_{P_{n}}) | = ∣ \det (I_{P_{n} - 1} + h W_{P_{n} - 1}) (1 + h W_{P_{n} P_{n}} - h^{2} W_{P_{n}, - P_{n}}^{⊤} {(I_{P_{n} - 1} + h W_{P_{n} - 1})}^{- 1} W_{- P_{n}, P_{n}} ∣ \geq {1 - h κ_{1} - h^{2} (P_{n} - 1) κ_{1}^{2} κ_{3}} a_{P_{n} - 1} \geq (1 - h κ_{1} - h κ_{1}^{2} κ_{2} κ_{3}) a_{P_{n - 1}} \geq (1 - h κ_{4}) a_{P_{n} - 1} \geq {(1 - h κ_{4})}^{P_{n} - 3} a_{2} \geq {(1 - h κ_{4})}^{P_{n} - 2} \geq {(1 - h κ_{4})}^{\frac{κ_{2} κ_{4}}{h κ_{4}}} \to \exp (- κ_{2} κ_{4}) .

Thus the result holds for $κ_{5} = \exp (- κ_{2} κ_{4})$ .

Therefore, we have ${‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} \leq C h^{- 1}$ for some constant $0 < C < \infty$ .

On the other hand, we have

{‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} \geq P_{n}^{- 1 / 2} {‖ V_{n} {(β_{0})}^{- 1} ‖}_{2} \geq c h^{- 1 / 2},

for some constant 0 < c < ∞ (Horn and Johnson, 1990; Golub and Van Loan, 1996).

The proof for ${- n^{- 1} \partial^{2} l_{n} (β_{0}, γ_{0}) / \partial γ \partial γ^{⊤}}^{- 1}$ is similar, and hence is omitted.

Lemma A8

{‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ^{*})}{\partial γ \partial γ^{⊤}}}^{- 1} - V_{n} {(β_{0})}^{- 1} ‖}_{\infty} = O_{p} (h^{q - 1} + n^{- 1 / 2} h^{- 1}) .

Proof According to Lemma A6 - A7, we have

{‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} - V_{n} {(β_{0})}^{- 1} ‖}_{\infty} = {‖ V_{n} {(β_{0})}^{- 1} [V_{n} (β_{0}) - {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}] {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} ‖}_{\infty} \leq {‖ {- n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}}}^{- 1} ‖}_{\infty} {‖ V_{n} {(β_{0})}^{- 1} ‖}_{\infty} {‖ - n^{- 1} \frac{\partial^{2} l_{n} (β_{0}, γ_{0})}{\partial γ \partial γ^{⊤}} - V_{n} (β_{0}) ‖}_{\infty} = O_{p} (h^{- 2}) O_{p} (h^{q + 1} + n^{- 1 / 2} h) = O_{p} (h^{q - 1} + n^{- 1 / 2} h^{- 1}) .

Lemma A9 There exists constants 0 < c < C < ∞ such that for n sufficiently large, with probability approach 1, for arbitrary $a \in R^{P_{n}}$ ,

c h ‖ a ‖_{2}^{2} < a^{⊤} {n^{- 1} \sum_{i = 1}^{n} S_{γ, i} {(β_{0}, m)}^{\otimes 2}} a < C h {‖ a ‖}_{2}^{2} .

Proof We have

n^{- 1} \sum_{i = 1}^{n} S_{γ, i} {(β_{0}, m)}^{\otimes 2} = n^{- 1} \sum_{i = 1}^{n} {[Δ_{i} B_{r} (X_{i}) - (1 + Δ_{i}) \frac{\exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} B_{r} (u) d u}{1 + \exp (Z_{i}^{⊤} β) \int_{0}^{X_{i}} \exp {m (u)} d u}]}^{\otimes 2} \leq n^{- 1} \sum_{i = 1}^{n} C^{'} [B_{r} {(X_{i})}^{\otimes 2} + {\int B_{r} (u) d u}^{\otimes 2}],

for some constants 0 < C′ < ∞. Similar derivation leads to

c h ‖ a ‖_{2}^{2} \leq a^{⊤} {n^{- 1} \sum_{i = 1}^{n} S_{γ, i} {(β_{0}, m)}^{\otimes 2}} a \leq C h ‖ a ‖_{2}^{2},

for constant 0 < c < C < ∞.

Appendix F. Algorithm for optimization

Here we provide the detailed algorithm for optimization of l_n.

Obtain the initial estimator of ${\hat{β}}_{init}$ from the U-statistic equation (Cheng et al., 1995),
$\sum_{i = 1}^{n} \sum_{j = 1}^{n} ({\hat{Z}}_{i} - {\hat{Z}}_{j}) [\frac{Δ_{j} I (X_{i} \geq X_{j})}{{\hat{G}}^{2} (X_{j})} - ξ ({({\hat{Z}}_{i} - {\hat{Z}}_{j})}^{⊤ β})] = 0,$
where $\hat{G}$ is the Kaplan-Meier or empirical distribution estimator for censoring time distribution, and $ξ (s) = {e^{s} (s - 1) + 1} / {(e^{s} - 1)}^{2}$ . We solve the equation by classical Newton’s method. Then, we calculate the initial estimator for baseline function ${\hat{α}}_{init}$ by solving
$\sum_{i = 1}^{n} \frac{Δ_{i} I (X_{i} \leq t)}{\hat{G} (X_{i})} - \frac{\exp {{\hat{β}}_{init}^{⊤} Z_{i}} α (t)}{1 + \exp {{\hat{β}}_{init}^{⊤} Z_{i}} α (t)} = 0 .$
Update ${\hat{β}}_{update}$ and ${\hat{α}}_{update}$ from ${\hat{β}}_{init}$ and ${\hat{α}}_{init}$ with the alternative B-spline approximation
$α (t) = \exp {{\tilde{γ}}^{⊤} \int_{0}^{t} B_{r} (s) d s} .$
Setting the initial value ${\hat{β}}^{[0]} = {\hat{β}}_{init}$ and ${\hat{α}}^{[0]} = {\hat{α}}_{init}$ we perform the iterative algorithm:
1. Calculate
  ${\hat{π}}_{i}^{[k]} = \frac{e^{{\hat{β}}^{[k - 1] T} {\hat{Z}}_{i}} \hat{α} {(t)}^{[k - 1]}}{1 + e^{{\hat{β}}^{[k - 1]}^{⊤}} {\hat{Z}}_{i} \hat{α} {(t)}^{[k - 1]}}$
  and update α by the Breslow type of estimator
  ${\hat{\tilde{γ}}}_{p}^{[k]} = \frac{\sum_{i = 1}^{n} Δ_{i} B_{r, p} (X_{i})}{\sum_{i = 1}^{n} [(1 + Δ_{i}) {\hat{π}}_{i}^{[k]} - Δ_{i}] \int_{0}^{X_{i}} B_{r, p} (t) d t}, \hat{α} {(t)}^{[k]} = \exp {{\hat{\tilde{γ}}}^{[k] T} \int_{0}^{t} B_{r} (t) d t}$
2. Update β by the pseudo logistic regression
  - If Δ_i = 0, observation-i contributes one entry in the pseudo data, $0 \sim {\hat{Z}}_{i}$ with offset $\log {\hat{α} {(t)}^{[k]} (X_{i})}$ .
  - If Δ_i = 1, observation-i contributes two entries in the pseudo data, $0 \sim {\hat{Z}}_{i}$ and $1 \sim {\hat{Z}}_{i}$ both with offset $\log {\hat{α} {(t)}^{[k]} (X_{i})}$ .
  The solution is ${\hat{β}}^{[k]}$ .
During this step, we compute the integrals $\int_{0}^{X_{i}} B_{r, p} (t) d t$ once at the initiation step and use the computation repeatedly. The parameters at the convergence are ${\hat{β}}_{update}, {\hat{α}}_{update}$ and ${\hat{\tilde{γ}}}_{update}$ .
Obtain the final MLE estimators ${\hat{β}}_{MLE}$ and ${\hat{γ}}_{MLE}$ . We use ${\hat{β}}^{[0]} = {\hat{β}}_{update}$ as initial value for β and calculate the initial value for γ from the linear regression
${\hat{γ}}^{[0]} = {argmin}_{γ \in ℝ^{P_{n}}} \sum_{i = 1}^{n} {\log ({\hat{α}}_{update}^{'} (X_{i})) - γ^{⊤} B_{r} (X_{i})}^{2} = {argmin}_{γ \in ℝ^{P_{n}}} \sum_{i = 1}^{n} {\log ({\hat{α}}_{update} (X_{i})) + \log ({\hat{γ}}_{update}^{⊤} B_{r} (X_{i})) - γ^{⊤} B_{r} (X_{i})}^{2} .$
We perform the iterative algorithm:
1. Update γ by the one-step Newton’s method
  ${\hat{γ}}^{[k]} = {\hat{γ}}^{[k - 1]} - {\frac{\partial^{2}}{\partial γ^{\otimes 2}} l_{n} ({\hat{β}}^{[k - 1]}, {\hat{γ}}^{[k - 1]})}^{- 1} \frac{\partial}{\partial γ} l_{n} ({\hat{β}}^{[k - 1]}, {\hat{γ}}^{[k - 1]}) .$
2. Update β by the pseudo logistic regression as in Step 2.

Contributor Information

Liang Liang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.

Jue Hou, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.

Hajime Uno, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.

Kelly Cho, Massachusetts Veterans Epidemiology Research and Information Center, US Department of Veteran Affairs, Boston, MA, USA; Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA.

Yanyuan Ma, Department of Statistics, Penn State University, University Park, PA, USA.

Tianxi Cai, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

References

Ahuja Y, Hong C, Xia Z, Cai T (2021) Samgep: A novel method for prediction of phenotype event times using the electronic health record. medRxiv DOI 10.1101/2021.03.07.21253096, URL https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096, https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096.full.pdf [DOI] [PMC free article] [PubMed]
de Boor C (2001) A Practical Guide to Splines. Springer, New York [Google Scholar]
Capra WB, Müller HG (1997) An accelerated-time model for response curves. Journal of the American Statistical Association 92:72–83 [Google Scholar]
Cheng S, Wei L, Ying Z (1995) Analysis of transformation models with censored data. Biometrika 82:835–845 [Google Scholar]
Cheng S, Wei L, Ying Z (1997) Predicting survival probabilities with semi-parametric transformation models. Journal of the American Statistical Association 92:227–235 [Google Scholar]
Chubak J, Onega T, Zhu W, Buist DS, Hubbard RA (2015) An electronic health record-based algorithm to ascertain the date of second breast cancer events. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
Dean C, Balshaw R (1997) Efficiency lost by analyzing counts rather than event times in poisson and overdispersed poisson regression models. Journal of the American Statistical Association 92:1387–1398 [Google Scholar]
Demko S (1977) Inverses of band matrices and local convergence of spline projections. SIAM Journal on Numerical Analysis 14:616–619 [Google Scholar]
DeVore RA, Lorentz GG (1993) Constructive approximation, vol 303. Springer Science & Business Media [Google Scholar]
Efron B (1979) Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7(1):1–26, DOI 10.1214/aos/1176344552, URL 10.1214/aos/1176344552 [DOI] [Google Scholar]
Golub GH, Van Loan CF (1996) Matrix computations, 3rd. Johns Hopkins University, Press, Baltimore, MD, USA [Google Scholar]
Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D (2015) Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
Horn RA, Johnson CR (1990) Matrix analysis. Cambridge university press [Google Scholar]
Jin Z, Ying Z, Wei LJ (2001) A simple resampling method by perturbing the minimand. Biometrika 88(2):381–390, URL http://www.jstor.org/stable/2673486 [Google Scholar]
Klein JP, Moeschberger ML (2006) Survival analysis: techniques for censored and truncated data. Springer Science & Business Media [Google Scholar]
Lawless JF (1987) Regression methods for poisson process data. Journal of the American Statistical Association 82:808–815 [Google Scholar]
Nielsen J, Dean C (2005) Regression splines in the quasi-likelihood analysis of recurrent event data. Journal of statistical planning and inference 134:521–535 [Google Scholar]
Rice JA, Silverman BW (1991) Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53:233–243 [Google Scholar]
Royston P, Parmar MK (2002) Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine 21:2175–2197 [DOI] [PubMed] [Google Scholar]
Shen X (1998) Propotional odds regression and sieve maximum likelihood estimation. Biometrika 85:165–177 [Google Scholar]
Stark H, Woods JW (1986) Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc. [Google Scholar]
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei L (2011) On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine 30:1105–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]
Uno H, Ritzwoller DP, Cronin AM, Carroll NM, Hornbrook MC, Hassett MJ (2018) Determining the time of cancer recurrence using claims or electronic medical record data. JCO Clinical Cancer Informatics 2:1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H, Leng C (2007) Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102(479):1039–1048 [Google Scholar]
Wang H, Leng C (2008) A note on adaptive group lasso. Computational statistics & data analysis 52(12):5277–5286 [Google Scholar]
Wu S, Müller HG, Zhang Z (2013) Functional data analysis for point processes with rare events. Statistica Sinica pp 1–23 [Google Scholar]
Yao F, Müller HG, Wang JL (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100:577–590 [Google Scholar]
Younes N, Lachin J (1997) Link-based models for survival data with interval and continuous time censoring. Biometrics pp 1199–1211 [Google Scholar]
Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. (2015) Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association 22(5):993–1000, DOI 10.1093/jamia/ocv034, URL 10.1093/jamia/ocv034, https://academic.oup.com/jamia/article-pdf/22/5/993/34146486/ocv034.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN, Gainer VS, et al. (2016) Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association 24(e1):e143–e149, DOI 10.1093/jamia/ocw135, URL 10.1093/jamia/ocw135, https://academic.oup.com/jamia/article-pdf/24/e1/e143/34149618/ocw135.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng D, Lin D, Yin G (2005) Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100:470–483 [Google Scholar]
Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37:338–354 [Google Scholar]
Zhang Y, Cai T, Yu S, Cho K, Hong C, Sun J, et al. (2019) High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature Protocols 14(12):3426–3444, DOI 10.1038/s41596-019-0227-6, URL 10.1038/s41596-019-0227-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Ahuja Y, Hong C, Xia Z, Cai T (2021) Samgep: A novel method for prediction of phenotype event times using the electronic health record. medRxiv DOI 10.1101/2021.03.07.21253096, URL https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096, https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096.full.pdf [DOI] [PMC free article] [PubMed]

[R2] de Boor C (2001) A Practical Guide to Splines. Springer, New York [Google Scholar]

[R3] Capra WB, Müller HG (1997) An accelerated-time model for response curves. Journal of the American Statistical Association 92:72–83 [Google Scholar]

[R4] Cheng S, Wei L, Ying Z (1995) Analysis of transformation models with censored data. Biometrika 82:835–845 [Google Scholar]

[R5] Cheng S, Wei L, Ying Z (1997) Predicting survival probabilities with semi-parametric transformation models. Journal of the American Statistical Association 92:227–235 [Google Scholar]

[R6] Chubak J, Onega T, Zhu W, Buist DS, Hubbard RA (2015) An electronic health record-based algorithm to ascertain the date of second breast cancer events. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Dean C, Balshaw R (1997) Efficiency lost by analyzing counts rather than event times in poisson and overdispersed poisson regression models. Journal of the American Statistical Association 92:1387–1398 [Google Scholar]

[R8] Demko S (1977) Inverses of band matrices and local convergence of spline projections. SIAM Journal on Numerical Analysis 14:616–619 [Google Scholar]

[R9] DeVore RA, Lorentz GG (1993) Constructive approximation, vol 303. Springer Science & Business Media [Google Scholar]

[R10] Efron B (1979) Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7(1):1–26, DOI 10.1214/aos/1176344552, URL 10.1214/aos/1176344552 [DOI] [Google Scholar]

[R11] Golub GH, Van Loan CF (1996) Matrix computations, 3rd. Johns Hopkins University, Press, Baltimore, MD, USA [Google Scholar]

[R12] Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D (2015) Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Horn RA, Johnson CR (1990) Matrix analysis. Cambridge university press [Google Scholar]

[R14] Jin Z, Ying Z, Wei LJ (2001) A simple resampling method by perturbing the minimand. Biometrika 88(2):381–390, URL http://www.jstor.org/stable/2673486 [Google Scholar]

[R15] Klein JP, Moeschberger ML (2006) Survival analysis: techniques for censored and truncated data. Springer Science & Business Media [Google Scholar]

[R16] Lawless JF (1987) Regression methods for poisson process data. Journal of the American Statistical Association 82:808–815 [Google Scholar]

[R17] Nielsen J, Dean C (2005) Regression splines in the quasi-likelihood analysis of recurrent event data. Journal of statistical planning and inference 134:521–535 [Google Scholar]

[R18] Rice JA, Silverman BW (1991) Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53:233–243 [Google Scholar]

[R19] Royston P, Parmar MK (2002) Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine 21:2175–2197 [DOI] [PubMed] [Google Scholar]

[R20] Shen X (1998) Propotional odds regression and sieve maximum likelihood estimation. Biometrika 85:165–177 [Google Scholar]

[R21] Stark H, Woods JW (1986) Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc. [Google Scholar]

[R22] Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei L (2011) On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine 30:1105–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Uno H, Ritzwoller DP, Cronin AM, Carroll NM, Hornbrook MC, Hassett MJ (2018) Determining the time of cancer recurrence using claims or electronic medical record data. JCO Clinical Cancer Informatics 2:1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Wang H, Leng C (2007) Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102(479):1039–1048 [Google Scholar]

[R25] Wang H, Leng C (2008) A note on adaptive group lasso. Computational statistics & data analysis 52(12):5277–5286 [Google Scholar]

[R26] Wu S, Müller HG, Zhang Z (2013) Functional data analysis for point processes with rare events. Statistica Sinica pp 1–23 [Google Scholar]

[R27] Yao F, Müller HG, Wang JL (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100:577–590 [Google Scholar]

[R28] Younes N, Lachin J (1997) Link-based models for survival data with interval and continuous time censoring. Biometrics pp 1199–1211 [Google Scholar]

[R29] Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. (2015) Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association 22(5):993–1000, DOI 10.1093/jamia/ocv034, URL 10.1093/jamia/ocv034, https://academic.oup.com/jamia/article-pdf/22/5/993/34146486/ocv034.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN, Gainer VS, et al. (2016) Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association 24(e1):e143–e149, DOI 10.1093/jamia/ocw135, URL 10.1093/jamia/ocw135, https://academic.oup.com/jamia/article-pdf/24/e1/e143/34149618/ocw135.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zeng D, Lin D, Yin G (2005) Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100:470–483 [Google Scholar]

[R32] Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37:338–354 [Google Scholar]

[R33] Zhang Y, Cai T, Yu S, Cho K, Hong C, Sun J, et al. (2019) High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature Protocols 14(12):3426–3444, DOI 10.1038/s41596-019-0227-6, URL 10.1038/s41596-019-0227-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records

Liang Liang

Jue Hou

Hajime Uno

Kelly Cho

Yanyuan Ma

Tianxi Cai

Abstract

1. Introduction

2. Semi-supervised MATA

2.1. Models

Point Process Model

Event Time Model

2.2. Estimation

2.2.1. Step I: Estimation of f[j]

2.2.2. Step II. PO Model Estimation with B-spline Approximation to m(·)

2.2.3. Feature Selection

2.3. Evaluation of Prediction Performance

3. Asymptotic Results

4. Simulation

4.1. Results for the Gaussian Intensity Setting

Table 1.

Table 2.

Table 3.

5. Example

Table 4.

6. Discussion

Appendices

Appendix A. Additional Simulation Details

A1. Simulation Settings for the Gaussian Intensities

A2. Simulation Settings for the Gamma Intensity functions

Table A1.

A3. Results for Gamma Intensity setting

Table A2.

Table A3.

Table A4.

Table A5.

Supplementary Results on Simulations

Table A6.

Appendix B. Additional Details on Data Example

Table A7.

Table A8.

Table A9.

Appendix C. Convergence Rate of Derived Features

Derivative of the Mean and Covariance Functions

Derivative of the Eigenfunctions

Derivative of the Estimated Density Functions

Peaks and Change Points

Appendix D. B-spline Approximation and Profile-likelihood Estimation

Some Definitions on Vector and Matrix Norms

Some Definition on Scores and Hessian Matrices

Approximation Error from W^

Proof of Lemma 1

Proof of Lemma 2

Proof of Theorem 1

Proof of Corollary 1

Appendix E. Matrix Norms

Appendix F. Algorithm for optimization

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2.1. Step I: Estimation of f^[j]

Approximation Error from $\hat{W}$