Abstract
Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.
Keywords: censoring, electronic health records, functional principle component analysis, point process, proportional odds model, Semi-supervised learning, more
1. Introduction
While clinical trials and traditional cohort studies remain critical sources of data for clinical research, they have limitations including the generalizability of the study findings and the limited ability to test broader hypotheses. In recent years, real world clinical data derived from disease registry, insurance claims and electronic health record (EHR) systems are increasingly used for precision medicine research. These real world data (RWD) open opportunities for developing accurate personalized risk prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision medicine. Efficiently deriving prediction models for the risk of developing future clinical events using RWD, however, faces practical and methodological challenges. Precise event time information such as time to cancer recurrence is not readily available in RWD such as EHR and claims data. Simple proxies to the event time based on the encounter time of first diagnosis or procedure codes may poorly approximate the true event time (Uno et al., 2018). On the other hand, annotating event times manually via chart review is time and resource prohibitive.
Learning the onset status of clinical events has been thoroughly investigated in the past decade using a large-scale medical encounter data set that lacks precise onset status and a small training set with gold standard labels on the true onset status. The solutions can be classified by their training sample into supervised approaches that use only the labeled data, unsupervised approaches that use no labels and semi-supervised approaches that combine information from labeled and unlabeled data. Semi-supervised methods (Yu et al., 2015, 2016; Zhang et al., 2019) usually use the unlabeled data for feature screening and selection before the final training on the labeled data. Growing efforts have been made in recent years to predict the onset time of clinical events under a similar setting with partially labeled event times. The existing literatures on phenotyping of event times are mostly supervised approaches that only use labeled data for training. Several supervised algorithms exist for predicting cancer recurrence time by extracting features from the encounter pattern of relevant codes. For example, Chubak et al. (2015) proposed a rule based algorithm that classifies the recurrence status, R ∈ {+, −}, based on decision tree, and assign the recurrence time for those with predicted R = + as the earliest encounter time of one or more specific codes. Hassett et al. (2015) proposed two-step algorithms where a logistic regression was used to classify R in step I and then the recurrence time for those with R = + is estimated as a weighted average of the times that the counts of several prespecified codes peaked. Instead of peak time, Uno et al. (2018) focuses on the time at which an empirically estimated encounter intensity function has the sharpest change, referred to as the change point throughout the paper. The recurrence time is approximated as a weighted average of the change point times associated with a few selected codes. Despite of their reasonable empirical performance, these ad hoc procedures have several major limitations. First, only a very small number of codes are selected according to domain knowledge. Second, intensity function estimated based on finite difference may yield substantial variability in the resulting peak or change times due to the sparsity of encounter data. One exception is a recent semi-supervised approach by Ahuja et al. (2021). Ahuja et al. (2021) first conducted aggregation of the predicting features, then predicted the event time from the longitudinal trajectories of the aggregated features using a hidden Markov model. Our approach differs from Ahuja et al. (2021) in the order of addressing the feature aggregation and temporal trajectory — we first extract the characteristics from the longitudinal trajectories of the predicting features, then aggregate the extracted characteristics to predict the event time by fitting a survival model.
In this paper, we frame the question of annotating event time with longitudinal encounter records as a statistical question of predicting an event time T using baseline covariates U as well as features derived from a p-variat point process, . Specifically, with a small labeled set containing observations on and a large unlabeled set containing observations on U and only, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) procedure. In the Step I, we include the large unlabeled data to estimate the underlying subject specific intensity functions associated with and deriving summaries of the intensity functions, denoted by , as features for predicting T. In Step II, we predict T using by fitting a penalized proportional odds (PO) model which approximates the non-parametric baseline function via B-splines. Our MATA is semi-supervised as unlabeled data and labeled data are used in Step I and II, correspondingly. Estimating individualized intensity functions is a challenging task in the current setting because the encounter data is often sparse and the shape of the intensity functions can vary greatly across subjects. As such traditional multiplicative intensity models (Lawless, 1987; Dean and Balshaw, 1997; Nielsen and Dean, 2005) fail to provide accurate approximations. To overcome those difficulties, we employed a non-parametric FPCA method by Wu et al. (2013) to estimate the subject specific intensity functions using the large unlabeled set . We demonstrate that when the size of is sufficiently large relative to the size of , the approximation error of can is ignorable compared to the estimation error from fitting the spline model in .
Even though the idea of employing a spline-based approach is straightforward and intuitive, our method differs from the classical B-spline works in the sense that B-splines are used on the outcome model, i.e., the failure time, rather than the preprocessing of the predictors. Special attention is devised to accommodate this fact. We established the novel consistency results and asymptotic convergence rates for the proposed estimator, both the parametric and nonparametric part. There are some existing literature adopting a spline-based approach in a similar context as ours, including Shen (1998); Royston and Parmar (2002); Zhang et al. (2010); Younes and Lachin (1997). However, Royston and Parmar (2002) and Younes and Lachin (1997) did not address the asymptotic properties of their estimators at all; Shen (1998) and Zhang et al. (2010) employed a seive maximum likelihood based approach which considers spline as a special case but only provided theoretical justification on the asymptotics of the parametric part.
One great advantage of the proposed MATA approach is the easy implementation of classical variable selection algorithms such as LASSO. In comparison, Chubak et al. (2015); Hassett et al. (2015); Uno et al. (2018) exhaust all possible combinations of selected encounters and select the optimal one under certain criteria, which brings great computational complexity. No variable selection method has been developed for classical estimating equation based estimators, e.g., Cheng et al. (1995, 1997). Besides, compared to the non-parametric maximum likelihood-estimator (NPMLE), e.g., Zeng et al. (2005), which approximates the non-parametric function by a right-continuous step function with jumps only at observed failure time, our approach is computationally more efficient and stable.
The rest of the paper is organized as follows. In Section 2, we introduce the proposed MATA approach and prediction accuracy evaluation measures. The asymptotic properties of the proposed estimator are discussed in Section 3. In Section 4, we conduct various simulation studies to explore the performance of our approach under small labeled sample. In Section 5, we apply our approach to a lung cancer data set. Section 6 contains a short discussion. Technical details and proofs are provided in the Supplementary Material.
2. Semi-supervised MATA
Let T denote the continuous event time of interest which is observable up to (X, Δ) in , where and C is the follow up time. Let denote the q-variat point processes and U denote baseline covariates observable in both and , where is a point process associated with the jth clinical code whose occurrence times are with for any set in the Borel σ-algebra of the positive half of the real line and I(·) is the indicator function. If T denotes the true event time of heart failure, examples of include longitudinal encounter processes of diagnostic code for heart failure and NLP mentions of heart failure in clinical notes. The local intensity function for is . Here we assume is integrable, i.e., , for j = 1, ⋯, q. Then the corresponding random density trajectory is . Equivalently, , i.e. the intensity function is the product of the density trajectory and the expected lifetime encounters.
The encounter times of the point processes are also only observable up to the end of follow up C and we let denote the total number of occurrences for up to C and denote the history of up to t along with the baseline covariate vector U. Thus, the observed data consist of
where i indexes the subject and we assume that .
2.1. Models
Our proposed MATA procedure involves two models, one for the point processes and another for the survival function of T. We connect two models by including the underlying intensity functions for the point processes as the part of the covariates for survival time.
Point Process Model
The intensity function for is treated as a realization of a non-negative valued stochastic intensity process . Conditional on , the number of observed medical encounters is assumed to be a non-homogeneous Poisson process with local intensity function that satisfies , where . Thus, . Define the truncated random density
and its scaled version
As we only observe the point process up to C, our goal is to estimate the truncated density function or equivalently the scaled density function rather than the density function . Note the scaling is done to meet the uniform endpoint requirement of the FPCA approach by Wu et al. (2013).
Event Time Model
We next relate features derived from the intensity functions to the distribution of T. Define , where is a known functional. For example, if the local intensity function follows the exponential distribution with rate θ−1, then we may set . Other potential features include peaks or change points of intensity functions . Here peak is defined as the time that the intensity (or density) curve reaches maximum, while change point is defined as the time of the largest increases in the intensity (or density) curve. Due to censoring, we instead have . For features like peak and change point, and would be identical if they were reached before censoring time C. We assume that follows an PO model (Klein and Moeschberger, 2006)
| (1) |
where is the unknown effect vector of the derived features Z, and m(t) is an unknown smooth function of t. This formulation ensures that is positive and increasing for .
2.2. Estimation
To derive a prediction rule for T based on the proposed MATA procedure, we first estimate truncated density function from the longitudinal encounter data using the FPCA method proposed by Wu et al. (2013) to obtain estimates for the derived features W using unlabeled data , denoted by . Then we estimate α(t) and using synthetic observations in the labeled set via penalized estimation with regression spline.
2.2.1. Step I: Estimation of f[j]
We estimate the mean and variance of the scaled density function according to the FPCA approach by Wu et al. (2013). Using the estimators and , we predict the scaled density function by with truncation at zero to ensure nonnegativity of the density function. The index K in the subscript is the number of basis functions selected according to the proportion of variation explained. We obtain the truncated density function by
For the i-th patient and its j-th point process , we only observe one realization of its expected number of encounters on [0, Ci], i.e., . We approximate the expected numbers of encounters with observed encounters, and estimate as , for . We further estimate the derived feature as . Detailed form of these estimators are given in Appendix C. We also establish the rate of convergence for the estimated loadings of the functional PCA, which can be subsequently used as potential derived features.
Incorporation of the large unlabeled data in the semi-supervised learning facilitates the extraction of characteristics from noisy and complex longitudinal trajectories. Using unlabeled data for feature preprocessing has a longstanding tradition in semi-supervised phenotyping (Yu et al., 2015, 2016; Zhang et al., 2019). While existing literatures focused on selection of simple features, we consider the de-noising of complex features. Moreover, the large unlabeled data in Step I also greatly reduces the uncertainty from the step to the point that it becomes negligible in Step II.
2.2.2. Step II. PO Model Estimation with B-spline Approximation to m(·)
To estimate m(t) and β in the PO model (1), we approximate m(t) via B-splines with order r (degree r − 1) described as follows. Divide the support of censoring time C, denoted as [0, ℰ], into (Rn + 1) subintervals , where is a sequence of interior knots, . Let the basis functions be Br(t) = {Br,1(t), …, Br,Pn(t)}⊤ where the number of B-spline basis functions . Then m(t) can be approximated by
where is the vector of coefficients for the spline basis functions .
With the B-spline formulation and features estimated as , we can estimate m(·) and by maximizing an estimated likelihood. Specifically, let
| (2) |
where ,
Then we may estimate by maximizing the approximated profile likelihood
| (3) |
where and MLE stands for maximum likelihood estimator. Subsequently, we may estimate m(t) as
| (4) |
The log-likelihood ln is concave with negative definite Hessian
where
Standard convex optimization problem can be used to solve for and . The integrals therein can be evaluated analytically through calculation of indefinite integrals in the following forms:
which is fairly straightforward for piecewise constant (k = 0) or piecewise linear (k = 1) B-splines. The evaluation of integrals is needed for every iteration, so we reduce the computational cost by a three-stage algorithm for the optimization problem:
An initial estimator for using the U-statistic approach of Cheng et al. (1995) and an inverse-probability-of-censoring-weighted (IPCW) initial estimator for baseline .
- Update and from an alternative B-spline approximation
Under the alternative approximation, the integral only needs to be evaluated once at the initiation. We update and iteratively using a pseudo logistic regression trick and the Newton’s method. Derive the initial estimator for from the initial , and solve for the final and . Likewise, we update and iteratively using a pseudo logistic regression trick and the Newton’s method.
We provide the detail of the algorithm given in Appendix F. The program is available at https://github.com/celehs/SPT.grpadalasso.
2.2.3. Feature Selection
When the dimension of Z is not small, the MLE from (3) and (4),
| (5) |
may suffer from high variability. On the other hand, it is highly likely that only a small number of codes are truly predictive of the event time. To overcome this challenge, one may employ standard penalized regression approach by imposing a penalty for the model complexity. To efficiently carry out penalized estimation under the B-spline PO model, we borrow the least square approximation (LSA) strategy proposed in Wang and Leng (2007) and update from initial MLE estimator (5)
| (6) |
where is a subvector of β that corresponds to the features in group g, [g] represents indices associated with group g, and is the L2 norm. All features associated with a specific code can be joined to create a group. The adaptive group lasso (glasso) penalty (Wang and Leng, 2008) employed in (6) enables the removal of all features related to a code. The tuning parameter λ can be chosen by standard methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the cross-validation.
With we may obtain a glasso estimator for m(t) as
For any patient with derived feature , his/her probability of having an event by t can be predicted as
2.3. Evaluation of Prediction Performance
Based on , one may derive subject specific prediction rules for the event status and/or time. For example, one may predict the binary event status using and using . One may also predict based on for some u chosen to satisfy a desired sensitivity or specificity level of classifying Δ based on , where .
To summarize the overall performance of in predicting (X, Δ), we may consider the Kendall’s-τ type rank correlation summary measures, e.g.,
To account for calibration, we propose to examine the absolute prediction error (APE) measure via
APEu is an important summary measure for the quality of annotating (X, Δ) since most survival estimation procedures rely on the at risk function and the counting process . The cut-off value u can be selected also to minimize APEu.
These accuracy measures can be estimated empirically by
Since and are estimated using the same training data, such plug in accuracy estimate may suffer overfitting bias especially when n is not large. For such cases, standard cross-validation procedures can be used for bias correction.
3. Asymptotic Results
The asymptotic distribution of the proposed MATA estimator is given in theorems 1 and 2 below, with proofs provided in the Appendix. We assume the following regularity conditions.
-
(C1)
The density function fC(t) of the random variable C is bounded and bounded away from 0 on [0, ℰ) and satisfies the Lipschitz condition of order 1 on [0, ℰ). Additionally, .
-
(C2)
for , and the spline order satisfies .
-
(C3)There exists 0 < c < ∞, such that the distances between neighboring knots satisfy
Furthermore, the number of knots satisfies , as and . -
(C4)
The function m(t) is bounded on [0, ℰ]. The pdf of the covariate Z is bounded and has a compact support.
-
(C5)
The estimated features has a fast convergence rate .
Here condition C1 assumes that , the survival function of C, is discontinuous at ℰ. In practice, most studies have a preselected ending time ℰ, when all patients that have not experienced failure are censored. This automatically leads to the discontinuity of at ℰ. Besides, for some studies that keep tracking patients until the last patient is censored or experience failure, the performance of the estimated survival curve near the tail can be highly uncertain. A straightforward solution is manually censoring all the patients to at least the last failure time ℰ, which results in the discontinuity at this point. Conditions C2 and C3 are standard smoothness and knots conditions in B-spline approximation. Condition C4 implies that both ST(t), the survival function of T, and fT(t), the density function of T, are bounded away from 0 on [0, ℰ]. Hence, [0, ℰ] is strictly contained in the support of the failure time T, i.e., .
In the statements of our theory, we use β0 to indicate the underlying true parameter of model (1). Here and throughout the text, for any matrix or vector a. We first establish the consistency and asymptotic normality of the MLE estimator (3).
Lemma 1 Under the Conditions C1-C5, when β equals the truth β0 or equals a -consistent estimator of β0, then the baseline function estimator from
satisfies uniformly in and as in distribution.
Lemma 2 Under the Conditions C1-C5, , and
where
and the definitions of and are given in Appendix D.
With the consistent and asymptotically normal , we achieve oracle property through the adaptive group LASSO with least square approximation:
Theorem 1 Let 𝒮 be the indices set for coefficients in nonzero groups of β0. Define the sub-matrices and sub-vectors, and with rows and columns in and with elements in 𝒮. Under the Conditions C1-C5 with , and
Thank to the large unlabeled data in Step I, the uncertainty from the complex feature extraction does not affect the asymptotic analysis of Step II estimators. Consequently, we obtain standard regular estimators for β and m. Confidence intervals for functionals of β and m including F(t; Z) can be obtained through bootstrap or the perturbation resampling (Efron, 1979; Jin et al., 2001).
Corollary 1 Under the Conditions C1-C5, the estimation error for incidence rate satisfies
and the error is asymptotically normally distributed.
4. Simulation
We have conducted extensive simulations to evaluate the performance of our proposed MATA procedure and compare to existing methods including (a) the nonparametric MLE (NLPMLE) approach by Zeng et al. (2005) using the same set of derived features; (b) the tree-based method by Chubak et al. (2015) which first uses the decision tree to classify patients as experienced failure event or not, and then among the patients who are determined to have events, assign the event time by the earliest arrival time of all groups of medical encounters used in the decision tree; and (c) the two-step procedure by Hassett et al. (2015) and Uno et al. (2018), which first fit a logistic regression with group lasso to classify the patients and select significant groups of encounters, and then assign the failure time to patients experiencing event as the weighted average of the peak time of the significant encounters with adjustment to correct the systematic bias. Throughout, we fix the total sample size to be n + N = 4000 and consider n ∈ {200, 400}.
For simplicity, we only consider the case where all patients are enrolled at the same time as we can always shift each patient’s follow-up period to a preselected origin. The censoring time of the i-th patient, i.e., Ci, is simulated from the mixed distribution 0.909Uniform[0, ℰ)+0.091δℰ, where δℰ is the Dirac function at ℰ and ℰ = 20, for i = 1, …, n + N. Intuitively, this imitates a long-term clinical study that tracks patients up to 20 years, where 90.9% of the patients quit the study at uniform rate before the study terminates and 9.1% patients stay in the study until the end. We simulate the number of encounters and encounter arrival times using the expression for . We consider two sets of density functions for the point processes: Gaussian and Gamma. In addition, for each density shape, we considered both the case that density functions are independent across the q = 10 counting processes of the medical encounters and the case that the densities are correlated. Details on the data generation mechanisms for the point processes are given in Appendix A of the Supplementary Materials. We then set with , where is the cumulative distribution function (CDF) of the standard normal and Fj is the CDF of a Gamma distribution with shape and scale , . We let
and generate
We consider two choices of and . We further simulate encounter times
but only keep the ones that fall into the interval [0, ℰ]. The final number of arrival times are thus reduced to and we relabel the retained arrival times as . Simple calculation shows , where .
The event time Ti is simulated from the PO model in (1) where the true features are set to be the log of the peak time and the logit of the ratio between change point time and peak time of the intensity functions for i = 1, …, q. Intuitively, an early peak time may result in early disease onset; and a relatively close peak time and change point time may imply a quick exacerbation of the disease status. We set the nonparametric function and varies αc to control the censoring rate. We further set with and . Consequently, only the first group of medical encounters affects the recurrence time. The estimated features are set to be the estimated peak time as well as the logit of the estimated ratio between change point time and peak time. We also included the logarithm of the first encounter arrival time, a feature directly observable from the medical encounter data.
We summarize results with 400 replications for each configuration. With each simulated dataset, we extract the features for both labeled and unlabeled data of total size n + N whereas the PO model (1) is only fitted on the labeled training data of size n. The interior knots for B-splines in our approach are chosen to be the 10ath percentile of the observed survival time X with a = 1, 2, …, 9 for both Gaussian and Gamma cases. For the tree approach and the two-step logistic, denoted by ”Logi”, approach, we use the same input feature space as the for the PO model (1) of MATA. To evaluate the performance of different procedures, we simulate a validation data of size nv = 5000 in each simulation to evaluate the out-of-sample prediction performance through the accuracy measures discussed in Section 2.3.
4.1. Results for the Gaussian Intensity Setting
We present the results for Gaussian intensity setting here. The parallel results for Gamma intensity setting are given in Appendix A. The estimated probability of having zero and ≤ 3 arrival times under all settings from a simulation with sample size 500, 000 are given in Table A4 in Appendix A. As a benchmark, we also present results from fitting the PO model with true feature sets. For the true feature sets. In Table 1, we reported the bias and standard error (se) of the non-zero coefficients, i.e., , from MATA and NPMLE. In general, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MATA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, leading to an effective sample size of 60, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE has smaller standard error in the extreme case but its absolute bias is more then twice of the standard error. These results is consistent with Theorem 2.
Table 1.
Displayed are the bias and standard error of the estimation on fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gaussian intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.
| Indp | Corr | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | se | Bias | se | Bias | se | Bias | se | ||
| Gaussian, 30% censoring rate | |||||||||
|
| |||||||||
| n = 200 | MATA | −0.060 | 0.404 | −0.072 | 0.282 | −0.053 | 0.418 | −0.083 | 0.292 |
| NPMLE | −0.355 | 0.692 | −0.216 | 0.550 | −0.359 | 0.716 | −0.232 | 0.495 | |
| n = 400 | MATA | 0.017 | 0.271 | −0.020 | 0.183 | 0.030 | 0.258 | −0.027 | 0.177 |
| NPMLE | −0.036 | 0.373 | 0.000 | 0.259 | −0.028 | 0.385 | −0.001 | 0.242 | |
|
| |||||||||
| Gaussian, 70% censoring rate | |||||||||
|
| |||||||||
| n = 200 | MATA | −0.408 | 0.893 | −0.305 | 0.582 | −0.352 | 0.776 | −0.277 | 0.532 |
| NPMLE | 1.449 | 0.562 | 1.172 | 0.382 | 1.448 | 0.574 | 1.167 | 0.361 | |
| n = 400 | MATA | −0.082 | 0.440 | −0.081 | 0.279 | −0.088 | 0.403 | −0.095 | 0.280 |
| NPMLE | 1.698 | 0.270 | 1.338 | 0.161 | 1.708 | 0.247 | 1.341 | 0.140 | |
For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures , and absolute prediction error APEu depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02ℓ for ℓ = 0, 1, …, 50 and pick the u with minimum APEu. We then compare these measures at the selected u with Tree and Logi methods in Tables 2 and 3.
Table 2.
Kendall’s-τ type rank correlation summary measures ( and ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.
| n = 200 | n = 400 | |||||||
|---|---|---|---|---|---|---|---|---|
| MATA | NPMLE | Tree | Logi | MATA | NPMLE | Tree | Logi | |
| Gaussian, 30%, independent counting processes, true features | ||||||||
|
| ||||||||
| 0.901 (0.002) | 0.894 (0.003) | 0.791 (0.022) | 0.719 (0.042) | 0.901 (0.002) | 0.898 (0.002) | 0.791 (0.018) | 0.718 (0.038) | |
| 0.868 (0.006) | 0.864 (0.006) | 0.742 (0.027) | 0.718 (0.029) | 0.868 (0.005) | 0.867 (0.005) | 0.747 (0.018) | 0.725 (0.022) | |
| APE | 0.971 (0.032) | 1.049 (0.050) | 1.978 (0.368) | 3.373 (0.947) | 0.962 (0.027) | 1.001 (0.031) | 1.884 (0.125) | 3.406 (0.917) |
|
| ||||||||
| Gaussian, 70%, independent counting processes, true features | ||||||||
|
| ||||||||
| 0.932 (0.004) | 0.921 (0.005) | 0.867 (0.014) | 0.859 (0.023) | 0.933 (0.002) | 0.926 (0.003) | 0.872 (0.010) | 0.859 (0.021) | |
| 0.782 (0.021) | 0.757 (0.023) | 0.705 (0.064) | 0.682 (0.086) | 0.789 (0.014) | 0.761 (0.015) | 0.716 (0.051) | 0.699 (0.067) | |
| APE | 0.772 (0.049) | 0.899 (0.057) | 1.479 (0.162) | 1.590 (0.268) | 0.751 (0.030) | 0.840 (0.037) | 1.413 (0.121) | 1.576 (0.234) |
|
| ||||||||
| Gaussian, 30%, correlated counting processes, true features | ||||||||
|
| ||||||||
| 0.900 (0.003) | 0.894 (0.003) | 0.792 (0.023) | 0.720 (0.042) | 0.901 (0.002) | 0.898 (0.002) | 0.791 (0.018) | 0.724 (0.039) | |
| 0.866 (0.006) | 0.862 (0.007) | 0.741 (0.029) | 0.719 (0.027) | 0.867 (0.005) | 0.865 (0.005) | 0.748 (0.017) | 0.727 (0.022) | |
| APE | 0.972 (0.035) | 1.050 (0.046) | 1.994 (0.434) | 3.383 (0.962) | 0.963 (0.026) | 1.002 (0.031) | 1.871 (0.114) | 3.278 (0.976) |
|
| ||||||||
| Gaussian, 70%, correlated counting processes, true features | ||||||||
|
| ||||||||
| 0.932 (0.004) | 0.921 (0.005) | 0.868 (0.015) | 0.859 (0.025) | 0.933 (0.002) | 0.926 (0.003) | 0.873 (0.011) | 0.861 (0.022) | |
| 0.782 (0.020) | 0.758 (0.022) | 0.698 (0.066) | 0.679 (0.084) | 0.789 (0.013) | 0.762 (0.016) | 0.720 (0.050) | 0.701 (0.068) | |
| APE | 0.771 (0.051) | 0.896 (0.057) | 1.473 (0.175) | 1.599 (0.280) | 0.750 (0.029) | 0.838 (0.038) | 1.395 (0.121) | 1.548 (0.247) |
Table 3.
Estimated features, Gaussian. Kendall’s-τ type rank correlation summary measures ( and ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gaussian intensities over 400 simulations each with n + N = 4,000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.
| n = 200 | n = 400 | |||||||
|---|---|---|---|---|---|---|---|---|
| MATA | NPMLE | Tree | Logi | MATA | NPMLE | Tree | Logi | |
| Gaussian, 30%, independent counting processes, estimated features | ||||||||
|
| ||||||||
| 0.781 (0.020) | 0.768 (0.011) | 0.790 (0.019) | 0.740 (0.036) | 0.791 (0.008) | 0.784 (0.006) | 0.788 (0.022) | 0.744 (0.027) | |
| 0.701 (0.028) | 0.667 (0.017) | 0.690 (0.019) | 0.654 (0.036) | 0.675 (0.027) | 0.680 (0.010) | 0.692 (0.016) | 0.653 (0.028) | |
| APE | 2.148 (0.332) | 2.184 (0.132) | 1.913 (0.139) | 2.409 (0.540) | 1.960 (0.236) | 2.019 (0.074) | 1.904 (0.125) | 2.272 (0.315) |
|
| ||||||||
| Gaussian, 70%, independent counting processes, estimated features | ||||||||
|
| ||||||||
| 0.839 (0.018) | 0.833 (0.010) | 0.802 (0.020) | 0.827 (0.015) | 0.853 (0.008) | 0.845 (0.006) | 0.806 (0.019) | 0.832 (0.013) | |
| 0.397 (0.111) | 0.413 (0.040) | 0.536 (0.107) | 0.500 (0.107) | 0.468 (0.042) | 0.447 (0.025) | 0.555 (0.072) | 0.511 (0.090) | |
| APE | 1.868 (0.253) | 1.938 (0.122) | 2.276 (0.248) | 1.992 (0.216) | 1.685 (0.106) | 1.775 (0.071) | 2.202 (0.223) | 1.907 (0.177) |
|
| ||||||||
| Gaussian, 30%, correlated counting processes, estimated features | ||||||||
|
| ||||||||
| 0.781 (0.019) | 0.768 (0.011) | 0.789 (0.021) | 0.743 (0.032) | 0.791 (0.010) | 0.783 (0.006) | 0.791 (0.016) | 0.747 (0.022) | |
| 0.701 (0.028) | 0.669 (0.016) | 0.690 (0.019) | 0.656 (0.036) | 0.672 (0.030) | 0.680 (0.011) | 0.693 (0.014) | 0.654 (0.027) | |
| APE | 2.142 (0.323) | 2.180 (0.122) | 1.915 (0.145) | 2.352 (0.418) | 1.958 (0.257) | 2.018 (0.072) | 1.886 (0.099) | 2.243 (0.194) |
|
| ||||||||
| Gaussian, 70%, correlated counting processes, estimated features | ||||||||
|
| ||||||||
| 0.838 (0.018) | 0.832 (0.009) | 0.802 (0.021) | 0.827 (0.014) | 0.852 (0.007) | 0.846 (0.005) | 0.804 (0.017) | 0.832 (0.012) | |
| 0.393 (0.109) | 0.430 (0.042) | 0.533 (0.119) | 0.500 (0.110) | 0.468 (0.041) | 0.450 (0.026) | 0.564 (0.070) | 0.519 (0.083) | |
| APE | 1.879 (0.248) | 1.944 (0.120) | 2.289 (0.253) | 1.987 (0.211) | 1.688 (0.102) | 1.772 (0.071) | 2.225 (0.193) | 1.898 (0.168) |
The performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher and lower APE in all cases under Gaussian intensities. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.
5. Example
We applied our MATA algorithm to extraction of cancer recurrence time for Veterans Affair Central Cancer Registry (VACCR). We obtained from VACCR 36705 lung cancer patients diagnosed with stage I to III lung cancer before 2018 and followed through 2019, among whom 3752 diagnosed in 2000-2018 with cancer stage information had annotated dates for cancer recurrence. Through the research platform under Office of Research & Development at Department of Veterans Affairs, the cancer registry data were linked to the EHR at Veterans Affairs healthcare containing diagnosis, procedure, medication, lab tests and medical notes.
The gold-standard recurrence status was collected through manual abstraction and tumor registry for the VACCR data. Besides, baseline covariates, including age at diagnosis, gender, and cancer stage, are extracted. Due to the predominance of male patients in VACCR (97.7% among the 3752), we excluded gender in the subsequent analysis. We randomly selected 1000 patients as training data and used the remaining 2752 as validation data. To assess the change of performance with the size of training data, we also considered smaller training data n = 200 and n = 400 sub-sampled from the n = 1000 set. We ran 400 bootstrap samples from the 1000 train samples for each n to quantify the variability of analysis. We selected time unit as month and focused on recurrence within 2 year in the analysis. Patients without recurrence before 24 months from diagnosis date were censored at 24 months. Censoring rate was 39%. The diagnosis, procedure, medication codes and mentions in medical notes associated with the following nine events are collected: lung cancer, chemotherapy, computerized tomography (CT) scan, radiotherapy, secondary malignant neoplasm, palliative or hospice care, recurrence, medications for systematic therapies (including cytotoxic therapies, targeted therapies or immunotherapies), biopsy or excision. See Table A6 for the detailed summary of the sparsity in the nine groups of medical encounters.
For each of the nine selected events except the hospice, we estimate the subject-specific intensity function on the training and unlabeled data sets by applying the FPCA approach described in Section 2.2.1, and then use the resulting basis functions to project the intensity functions for the validation set. The peak and change point time of the estimated intensity functions are then extracted as features. In addition, first code arrival time, first FPCA score, and the total number of diagnosis and procedure code are added as features for each event. All those features except the FPCA score are transformed in log-scale to reduce the skewness. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features. Finally, to overcome the potential collinearity of the extracted features from the same group (i.e., event), we further run the principal component analysis on each group of features and keep the first few principal components with proportion of variation exceeds 90%.
Similarly as the simulation, we fit the decision tree to minimize the cross-validation error for the Tree approach and fit the logistic regression model. For MATA, NPMLE and Logi, we take a fine grid on false positive rate (FPR) on Δ and compute all other accuracy measures in Section 2.3 for each value of FPR. Then we pick the result which matches the FPR from Tree.
The prediction accuracy is summarized in Table 4. For the measurements regarding the timing of recurrence, our MATA estimator dominates the other three approaches with larger and yet smaller APE.
Table 4.
Mean and bootstrap standard deviation of Kendall’s-τ type rank correlation summary measures ( and ), and absolute prediction error (APE) for the medical encounter data analyzed in Section 5 under the four approaches, i.e., MATA, NPMLE, Tree, and Logi, over 400 bootstrap simulations.
| n = 1000 | ||||
|---|---|---|---|---|
| MATA | MLE | Tree | Logi | |
|
| ||||
| 0.810 (0.010) | 0.809 (0.009) | 0.755 (0.032) | 0.762 (0.015) | |
| 0.688 (0.016) | 0.690 (0.017) | 0.646 (0.049) | 0.650 (0.010) | |
| APE | 3.326 (0.051) | 3.399 (0.074) | 5.693 (0.960) | 4.317 (0.152) |
|
| ||||
| n = 400 | ||||
| MATA | MLE | Tree | Logi | |
|
| ||||
| 0.796 (0.013) | 0.795 (0.013) | 0.748 (0.033) | 0.762 (0.059) | |
| 0.663 (0.026) | 0.662 (0.027) | 0.644 (0.044) | 0.646 (0.044) | |
| APE | 3.576 (0.094) | 3.615 (0.129) | 5.880 (1.159) | 4.653 (0.577) |
|
| ||||
| n = 200 | ||||
| MATA | MLE | Tree | Logi | |
|
| ||||
| 0.791 (0.025) | 0.784 (0.024) | 0.748 (0.039) | 0.859 (0.127) | |
| 0.624 (0.057) | 0.624 (0.043) | 0.640 (0.044) | 0.681 (0.090) | |
| APE | 3.842 (0.317) | 3.945 (0.286) | 6.094 (1.418) | 5.736 (1.080) |
Through its variable selection feature, MATA excluded stage II cancer from the n = 1000 analysis and additionally, stage III cancer, age at diagnosis, medication for systematic therapies from the n = 200 and n = 400 analyses. The selection is consistent with the NPMLE result of the n = 1000 analysis, as the excluded features coincide with the feature groups with no p-value < 0.05. Additional details for feature selection is given in Appendix B.
6. Discussion
We proposed an MATA method to auto-extract patients’ longitudinal characteristics from the medical encounter data, and further build a risk prediction model on clinical events occurrence time with the extracted features. Such an approach integrates both labeled and unlabeled data to obtain the longitudinal features, thus tackled the sparsity of the medical encounter data. In addition, the FPCA approach preserves the flexibility of the resulting subject-specific intensity function to a great extent. Specifically, the intensity functions are often of different shapes between female and male, or between young patients and elder patients in practice. Therefore, multiplicative intensity model that assumes the heterogeneity among patients only results from the subject-specific random effects may not be adequate. The fitted risk prediction model is chosen to be the proportional odds model, whereas the nonparametric function is approximated by B-splines under certain transformation to ensure its monotonicity. The resulting estimator for the parametric part is shown to be root-n consistent, whereas the non-parametric function is consistent, under the correctly-specified model. Though the proportional odds model is adopted, our proof can be extended to other semiparametric transformation models such as the proportional hazard model easily. In the presence of large feature sets, we propose to use the group lasso with LSA for feature selection. The finite sample performance of our approach are studied under various settings.
Here, the FPCA is employed on each group of medical encounters separately for feature extraction. However, different groups are potentially connected, and separate estimation may fail to capture such relationship. A potential future work is to use the multivariate FPCA to directly address potential covariation among different groups. Though various multivariate FPCA approaches exist, none of them can handle the case in the medical encounters, where the encounter arrival times rather than the underlying intensity functions are observed. Much effort is needed on developing the applicable multivariate FPCA methodology and theories in this setting.
The spline model works very well with only a few knots. Small-sample performance of our estimator is studied in various simulations, with the prediction accuracy examined by C-statistics (Uno et al., 2011) and Brier Score on simulated validation data sets.
The adoption of the PO model is for the simplicity of the illustration, while the theory of our estimator can be easily generated to arbitrary linear transformation models. As the goal of this paper is to annotating event time within the observational window, the medical encounter data involved may extend after the actual event time.
Appendices
In Appendix A, we present additional simulation studies with Gamma intensities, as well as extra information on the simulation settings. In Appendix B, we offer additional details on the data example of lung cancer recurrence with VACCR data. In Appendix C, we provide the theoretical properties for the derived features. In Appendix D, we provide the theoretical properties for the MATA estimator based on the proportional odds model. In Appendix F, we provide the detailed algorithm for optimization of the log-likelihood ln.
Appendix A. Additional Simulation Details
A1. Simulation Settings for the Gaussian Intensities
We first simulate Gaussian shape density, i.e., is the density function of truncated at 0.
Set to be , Fj is the CDF of , with and for j = 1, …, q, and , i.e., the multivariate normal distribution with mean 0 and variance . For simplicity, we set We further set to be one if it is less than one.
Simulate with , where Fj is the CDF of . The way we simulate and guarantees that the largest change in the intensity functions only occurs after patients enter the study, i.e., , as expected in practice. Besides, the simulated σij is not only controlled by the value of μij but also the median of . Thus σij will not get too extreme even with a large peak time . In other words, the corresponding largest change in the intensity function is more likely to occur near the peak time than much earlier than .
Finally, we set αc, the constant in the nonparametric function α(t), to be 7.5 and 1.1 to obtain an approximately 30% and 70% censoring rate.
A2. Simulation Settings for the Gamma Intensity functions
We also consider gamma shape density, i.e., is the density function of , truncated at 0. We let be the density function of , truncated at 0. Set , where Fj is the CDF of , with , and , and . Generate from truncated at its third quartile with , and . We set and 1.9 to obtain the approximate 30% and 70% censoring rates.
Table A1.
Displayed are the bias and standard error of the estimation on fitted with the true features from 400 simulations each with N + n = 4,000 and n = 200 and 400. Two methods, MATA and NPMLE, are contrasted. Panels from the top to bottom are Gamma intensities with the subject-specific follow-up duration under 30% and 70% censoring rate as discussed in Section A1. The results under independent groups of encounters are shown on the left whereas the results for correlated one are shown on the right.
| Indp | Corr | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | se | Bias | se | Bias | se | Bias | se | ||
| Gamma, 30% censoring rate | |||||||||
| n = 200 | MATA | −0.086 | 0.443 | −0.084 | 0.410 | −0.091 | 0.447 | −0.126 | 0.436 |
| NPMLE | −0.480 | 0.581 | −0.301 | 0.510 | −0.482 | 0.549 | −0.349 | 0.541 | |
| n = 400 | MATA | −0.006 | 0.296 | −0.032 | 0.271 | 0.019 | 0.283 | −0.067 | 0.266 |
| NPMLE | −0.223 | 0.374 | −0.138 | 0.315 | −0.190 | 0.346 | −0.167 | 0.340 | |
| Gamma, 70% censoring rate | |||||||||
| n = 200 | MATA | −0.383 | 0.743 | −0.299 | 0.676 | −0.400 | 0.783 | −0.339 | 0.666 |
| NPMLE | 0.336 | 0.731 | 0.284 | 0.592 | 0.325 | 0.866 | 0.267 | 0.670 | |
| n = 400 | MATA | −0.109 | 0.399 | −0.070 | 0.328 | −0.074 | 0.410 | −0.112 | 0.344 |
| NPMLE | 0.708 | 0.429 | 0.551 | 0.330 | 0.744 | 0.385 | 0.533 | 0.352 | |
A3. Results for Gamma Intensity setting
For the true feature sets, we reported the bias and standard error (se) of the non-zero coefficients, i.e., , from MATA and NPMLE in Table A1. Similar to the Gaussian intensities settings, we find that the MATA procedure performs well with small sample size regardless of the censoring rate, the correlation structure between groups of encounters, and the family of the intensity curves. MASA generally leads to both smaller bias and smaller standard error compared to the NPMLE. In the extreme case when n = 200 and the censoring rate reaches 70%, both estimators deteriorate. However, the resulting 95% confidence interval of MATA covers the truth as the absolute bias is less than 1.96 times standard error. In contrast, the NPMLE tends to be numerically unstable. We observe the estimation bias of NPMLE for n = 400 setting is larger than its own standard error and the the bias at n = 200 setting. These results is consistent with Theorem 2.
For both true and estimated feature sets, we computed the out-of-sample accuracy measures discussed in Section 2.3 on a validation data set. All other accuracy measures, i.e., Kendall’s-τ type rank correlation summary measures , and absolute prediction error APEu depend on u, which is easy to control for MATA and NPMLE but not for Tree and Logi. We therefore minimize the cross-validation error for the Tree approach and minimize the misclassification rate for the Logi approach at their first step, i.e., classifying the censoring status Δ. For MATA and NPMLE, We calculate these accuracy measures at u = 0.02ℓ for ℓ = 0, 1, ⋯, 50 and pick the u with minimum APEu. We then compare these measures at the selected u with Tree and Logi methods in Tables A2 and A3.
Similar to the Gaussian intensities setting, the performance of the MATA estimator when fitted with the true features largely dominates that of NPMLE, Tree, and Logi, with higher and lower APE in all cases except when the encounters are simulated from independent Gamma counting processes with 30% censoring rate. In this exceptional case, our MATA estimator has very minor advantage in compared to NPMLE, and is still better in terms of and APE. When fitted with the estimated features, there is no clear winner among the four methods when the labeled data size is n = 200; however, when the labeled data size increased to n = 400, MATA generally outperforms the other three approaches in terms of APE.
Table A2.
True features, Gamma. Kendall’s-τ type rank correlation summary measures ( and ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the true features. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.
| n = 200 | n = 400 | |||||||
|---|---|---|---|---|---|---|---|---|
| MATA | NPMLE | Tree | Logi | MATA | NPMLE | Tree | Logi | |
| Gamma, 30%, independent counting processes, true features | ||||||||
| 0.872 (0.003) | 0.864 (0.004) | 0.720 (0.040) | 0.678 (0.074) | 0.873 (0.003) | 0.869 (0.003) | 0.731 (0.023) | 0.683 (0.071) | |
| 0.814 (0.008) | 0.812 (0.008) | 0.617 (0.047) | 0.658 (0.048) | 0.814 (0.006) | 0.815 (0.007) | 0.636 (0.026) | 0.667 (0.043) | |
| APE | 1.318 (0.038) | 1.410 (0.048) | 3.826 (1.011) | 4.937 (1.844) | 1.307 (0.031) | 1.350 (0.036) | 3.434 (0.578) | 4.784 (1.835) |
| Gamma, 70%, independent counting processes, true features | ||||||||
| 0.922 (0.004) | 0.914 (0.004) | 0.876 (0.010) | 0.837 (0.042) | 0.924 (0.002) | 0.919 (0.003) | 0.880 (0.008) | 0.841 (0.040) | |
| 0.735 (0.021) | 0.701 (0.024) | 0.618 (0.078) | 0.604 (0.106) | 0.743 (0.014) | 0.723 (0.017) | 0.630 (0.053) | 0.627 (0.085) | |
| APE | 0.892 (0.049) | 0.995 (0.054) | 1.436 (0.125) | 1.986 (0.539) | 0.869 (0.030) | 0.930 (0.037) | 1.388 (0.090) | 1.917 (0.518) |
| Gamma, 30%, correlated counting processes, true features | ||||||||
| 0.872 (0.003) | 0.864 (0.004) | 0.720 (0.040) | 0.685 (0.072) | 0.873 (0.002) | 0.869 (0.003) | 0.731 (0.026) | 0.684 (0.068) | |
| 0.814 (0.008) | 0.813 (0.009) | 0.617 (0.047) | 0.662 (0.049) | 0.816 (0.006) | 0.813 (0.007) | 0.635 (0.031) | 0.668 (0.041) | |
| APE | 1.320 (0.041) | 1.408 (0.049) | 3.819 (1.022) | 4.826 (1.842) | 1.307 (0.030) | 1.350 (0.034) | 3.500 (0.689) | 4.808 (1.810) |
| Gamma, 70%, correlated counting processes, true features | ||||||||
| 0.921 (0.005) | 0.913 (0.004) | 0.875 (0.011) | 0.841 (0.044) | 0.924 (0.003) | 0.919 (0.003) | 0.879 (0.007) | 0.848 (0.038) | |
| 0.733 (0.025) | 0.702 (0.024) | 0.616 (0.072) | 0.612 (0.099) | 0.742 (0.015) | 0.724 (0.016) | 0.628 (0.053) | 0.622 (0.086) | |
| APE | 0.900 (0.056) | 0.996 (0.053) | 1.447 (0.126) | 1.933 (0.572) | 0.872 (0.032) | 0.929 (0.036) | 1.390 (0.086) | 1.833 (0.488) |
Table A3.
Estimated features, Gamma. Kendall’s-τ type rank correlation summary measures ( and ), and absolute prediction error (APE) are computed from four methods, MATA, NPMLE, Tree, and Logi, under q = 10 Gamma intensities over 400 simulations each with n + N = 4, 000 and n = 200 or 400. The PO model is fitted with the estimated features derived from FPCA approach in Section 2.2.1. The upper two panels display the result under independent intensities with 30% and 70% censoring rate, respectively; the lower two panels display the result under correlated intensities with 30% and 70% censoring rate, respectively.
| n = 200 | n = 400 | |||||||
|---|---|---|---|---|---|---|---|---|
| MATA | NPMLE | Tree | Logi | MATA | NPMLE | Tree | Logi | |
| Gamma, 30%, independent counting processes, estimated features | ||||||||
| 0.728 (0.027) | 0.720 (0.010) | 0.650 (0.043) | 0.668 (0.059) | 0.749 (0.011) | 0.737 (0.007) | 0.667 (0.045) | 0.659 (0.056) | |
| 0.555 (0.037) | 0.578 (0.015) | 0.456 (0.074) | 0.570 (0.078) | 0.573 (0.018) | 0.589 (0.010) | 0.480 (0.072) | 0.558 (0.063) | |
| APE | 2.732 (0.544) | 2.698 (0.121) | 4.164 (1.191) | 5.018 (1.659) | 2.443 (0.218) | 2.528 (0.074) | 3.840 (1.095) | 4.858 (1.330) |
| Gamma, 70%, independent counting processes, estimated features | ||||||||
| 0.833 (0.016) | 0.827 (0.011) | 0.829 (0.017) | 0.822 (0.019) | 0.849 (0.010) | 0.841 (0.006) | 0.831 (0.018) | 0.829 (0.014) | |
| 0.277 (0.115) | 0.325 (0.043) | 0.485 (0.139) | 0.425 (0.119) | 0.371 (0.058) | 0.374 (0.027) | 0.519 (0.070) | 0.450 (0.089) | |
| APE | 2.002 (0.223) | 2.040 (0.136) | 2.006 (0.202) | 2.086 (0.254) | 1.766 (0.141) | 1.866 (0.077) | 1.964 (0.195) | 1.973 (0.166) |
| Gamma, 30%, correlated counting processes, estimated features | ||||||||
| 0.731 (0.024) | 0.720 (0.011) | 0.656 (0.046) | 0.670 (0.053) | 0.749 (0.010) | 0.737 (0.007) | 0.672 (0.045) | 0.665 (0.057) | |
| 0.568 (0.038) | 0.577 (0.016) | 0.453 (0.076) | 0.560 (0.073) | 0.579 (0.020) | 0.588 (0.011) | 0.485 (0.072) | 0.561 (0.063) | |
| APE | 2.681 (0.494) | 2.691 (0.127) | 4.133 (1.176) | 4.770 (1.522) | 2.451 (0.244) | 2.522 (0.073) | 3.778 (1.076) | 4.874 (1.354) |
| Gamma, 70%, correlated counting processes, estimated features | ||||||||
| 0.833 (0.016) | 0.826 (0.011) | 0.828 (0.018) | 0.822 (0.017) | 0.849 (0.009) | 0.840 (0.006) | 0.831 (0.017) | 0.829 (0.012) | |
| 0.283 (0.107) | 0.322 (0.044) | 0.484 (0.136) | 0.421 (0.124) | 0.366 (0.060) | 0.388 (0.028) | 0.515 (0.076) | 0.442 (0.094) | |
| APE | 1.996 (0.214) | 2.053 (0.135) | 2.017 (0.213) | 2.088 (0.236) | 1.766 (0.128) | 1.868 (0.075) | 1.963 (0.180) | 1.975 (0.156) |
Table A4.
Estimated probability of having zero or ≤ 3 encounter arrival times under each counting process for j = 1, …, 10 from a simulation with sample size 500, 000.
| Probability of zero encounters | ||||||||||
| Indp Gaussian | 0.508 | 0.802 | 0.793 | 0.767 | 0.878 | 0.758 | 0.474 | 0.594 | 0.818 | 0.755 |
| Corr Gaussian | 0.508 | 0.802 | 0.792 | 0.766 | 0.879 | 0.758 | 0.474 | 0.595 | 0.817 | 0.755 |
| Indp Gamma | 0.761 | 0.805 | 0.716 | 0.786 | 0.750 | 0.944 | 0.712 | 0.810 | 0.939 | 0.750 |
| Corr Gamma | 0.761 | 0.806 | 0.715 | 0.786 | 0.749 | 0.943 | 0.713 | 0.812 | 0.938 | 0.750 |
| Probability of ≤ 3 encounters | ||||||||||
| Indp Gaussian | 0.720 | 0.945 | 0.926 | 0.930 | 0.971 | 0.918 | 0.680 | 0.778 | 0.938 | 0.902 |
| Corr Gaussian | 0.721 | 0.946 | 0.926 | 0.930 | 0.972 | 0.918 | 0.681 | 0.778 | 0.938 | 0.902 |
| Indp Gamma | 0.938 | 0.962 | 0.913 | 0.939 | 0.933 | 0.995 | 0.935 | 0.970 | 0.995 | 0.931 |
| Corr Gamma | 0.938 | 0.962 | 0.913 | 0.939 | 0.933 | 0.995 | 0.936 | 0.970 | 0.995 | 0.931 |
Table A5.
Average model sizes selected by MATA.
| 30% Indp | 70% Indp | 30% Corr | 70% Corr | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Tr Ft | Est Ft | Tr Ft | Est Ft | Tr Ft | Est Ft | Tr Ft | Est Ft | ||
| Gaussian | |||||||||
| n = 200 | AIC | 13.24 | 15.57 | 13.74 | 17.09 | 13.30 | 15.91 | 13.75 | 17.41 |
| BIC | 13.07 | 15.45 | 13.45 | 16.87 | 13.09 | 15.83 | 13.39 | 17.27 | |
| n = 400 | AIC | 13.15 | 14.40 | 13.30 | 14.57 | 13.18 | 14.39 | 13.31 | 14.77 |
| BIC | 13.00 | 14.05 | 13.01 | 14.17 | 13.00 | 14.12 | 13.00 | 14.22 | |
| Gamma | |||||||||
| n = 200 | AIC | 13.38 | 18.37 | 13.88 | 19.35 | 13.41 | 18.04 | 13.94 | 19.57 |
| BIC | 13.23 | 18.22 | 13.65 | 19.01 | 13.29 | 17.88 | 13.72 | 19.21 | |
| n = 400 | AIC | 13.24 | 15.02 | 13.20 | 15.09 | 13.21 | 15.04 | 13.34 | 14.97 |
| BIC | 13.00 | 14.80 | 13.01 | 14.78 | 13.01 | 14.72 | 13.02 | 14.72 | |
Supplementary Results on Simulations
We show the sparsity in the simulated data in Tables A4. We show the Average Model Size and MSE of Estimation in Table A5.
Table A6.
Sparsity of the nine groups of medical encounter data analyzed in Section 5.
| Feature | Zero | ≤ 3 times |
|---|---|---|
| Lung Cancer | 0.014 | 0.087 |
| Chemotherapy | 0.567 | 0.736 |
| CT Scan | 0.127 | 0.363 |
| Radiotherapy | 0.778 | 0.912 |
| Secondary Maligant Neoplasm | 0.554 | 0.856 |
| Palliative or Hospice Care | 0.576 | 0.888 |
| Recurrence | 0.279 | 0.723 |
| Medication | 0.707 | 0.824 |
| Biopsy or Excision | 0.964 | 1.000 |
Appendix B. Additional Details on Data Example
We show the sparsity of features in Table A6. The radiotherapy, medication for systematic therapies and biopsy/excision has a zero code rate of 77.8%, 70.7% and 96.4%, respectively. Consequently, the estimated peak and largest increase time of these features are identical as the associated first occurrence time for most patients. Thus, only the first occurrence time and the total number of diagnosis and procedure code are considered for these features.
We show the MATA and NPMLE coefficients for n = 1000, 400, 200 in Tables A7–A9. Similar as in Section 4, our MATA estimator has smaller bootstrap standard error compared to the NPMLE. For the analysis with n = 1000, both MATA and NPMLE showed a significant impact of first arrival time and peak time of lung cancer code, first arrival time and first FPCA score of chemotherapy code, first arrival time of radiotherapy code, total number of secondary malignant neoplasm code, peak and change point times of palliative or hospice care in medical notes, first FPCA score and total number of recurrence in medical notes and first arrival time of biopsy or excision. MATA additionally finds the change point time of lung cancer code to have strong association with high risk of lung cancer recurrence. Furthermore, MATA excludes the stage II cancer, which coincides with the large p-values on those four group of encounters under NPMLE. For the analyses with n = 200 and n = 400, MATA excludes cancer stage, age at diagnosis and medication for systematic therapies, which coincides with the groups without any significant feature from the n = 1000 NPMLE analysis.
Table A7.
Analysis with n = 1000. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.
| MATA | NPMLE | ||||||
|---|---|---|---|---|---|---|---|
| Group | Feature | mean | boot.se | pval | mean | boot.se | pval |
| Stage II | – | – | – | 0.075 | 0.144 | 0.604 | |
| Stage III | 0.168 | 0.168 | 0.319 | 0.160 | 0.181 | 0.379 | |
| Age | 0.013 | 0.008 | 0.111 | 0.013 | 0.007 | 0.069 | |
| Lung Cancer | 1stCode | −0.277 | 0.116 | 0.017 | −0.294 | 0.116 | 0.011 |
| Pk | 0.213 | 0.084 | 0.012 | 0.213 | 0.089 | 0.016 | |
| ChP | 0.135 | 0.065 | 0.040 | 0.131 | 0.068 | 0.054 | |
| 1stScore | −0.091 | 0.183 | 0.619 | −0.028 | 0.204 | 0.891 | |
| logN | 0.072 | 0.108 | 0.502 | 0.070 | 0.121 | 0.561 | |
| Chemotherapy | 1stCode | −0.140 | 0.065 | 0.032 | −0.146 | 0.067 | 0.029 |
| Pk | −0.162 | 0.106 | 0.127 | −0.169 | 0.111 | 0.127 | |
| ChP | 0.019 | 0.067 | 0.773 | 0.019 | 0.073 | 0.799 | |
| 1stScore | 0.652 | 0.180 | < 0.001 | 0.678 | 0.188 | < 0.001 | |
| logN | 0.073 | 0.092 | 0.424 | 0.076 | 0.103 | 0.463 | |
| CT scan | 1stCode | 0.020 | 0.076 | 0.789 | 0.017 | 0.093 | 0.858 |
| Pk | 0.104 | 0.093 | 0.262 | 0.115 | 0.103 | 0.266 | |
| ChP | 0.046 | 0.043 | 0.286 | 0.047 | 0.048 | 0.329 | |
| 1stScore | −0.244 | 0.132 | 0.065 | −0.266 | 0.131 | 0.042 | |
| logN | −0.019 | 0.096 | 0.847 | −0.034 | 0.112 | 0.763 | |
| Radiotherapy | 1stCode | −0.327 | 0.157 | 0.037 | −0.382 | 0.163 | 0.019 |
| logN | −0.057 | 0.056 | 0.311 | −0.068 | 0.062 | 0.275 | |
| Secondary Malignant Neoplasm | 1stCode | 0.013 | 0.127 | 0.921 | −0.008 | 0.141 | 0.954 |
| Pk | −0.135 | 0.113 | 0.230 | −0.130 | 0.126 | 0.299 | |
| ChP | −0.067 | 0.049 | 0.168 | −0.067 | 0.054 | 0.217 | |
| 1stScore | −0.197 | 0.122 | 0.105 | −0.205 | 0.128 | 0.109 | |
| logN | 0.333 | 0.077 | < 0.001 | 0.335 | 0.079 | < 0.001 | |
| Palliative or Hospice Care | 1stCode | −0.055 | 0.085 | 0.517 | −0.066 | 0.089 | 0.457 |
| Pk | −0.942 | 0.187 | < 0.001 | −1.009 | 0.205 | < 0.001 | |
| ChP | −0.704 | 0.140 | < 0.001 | −0.753 | 0.153 | < 0.001 | |
| 1stScore | 0.068 | 0.095 | 0.470 | 0.070 | 0.098 | 0.471 | |
| logN | 0.017 | 0.061 | 0.785 | 0.002 | 0.064 | 0.979 | |
| Recurrence | 1stCode | 0.121 | 0.081 | 0.138 | 0.122 | 0.084 | 0.147 |
| Pk | −0.105 | 0.093 | 0.259 | −0.099 | 0.097 | 0.310 | |
| ChP | −0.046 | 0.058 | 0.426 | −0.042 | 0.060 | 0.479 | |
| 1stScore | −0.281 | 0.119 | 0.018 | −0.288 | 0.122 | 0.018 | |
| logN | 0.234 | 0.076 | 0.002 | 0.255 | 0.075 | < 0.001 | |
| Medication | 1stCode | 0.173 | 0.118 | 0.143 | 0.185 | 0.113 | 0.104 |
| logN | 0.062 | 0.071 | 0.384 | 0.071 | 0.081 | 0.380 | |
| Biopsy | 1stCode | −0.865 | 0.411 | 0.035 | −0.968 | 0.399 | 0.015 |
| logN | −0.423 | 0.502 | 0.399 | −0.478 | 0.523 | 0.360 | |
Table A8.
Analysis with n = 400. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.
| MATA | NPMLE | ||||||
|---|---|---|---|---|---|---|---|
| Group | Feature | mean | boot.se | pval | mean | boot.se | pval |
| Stage II | – | – | – | 0.067 | 0.254 | 0.790 | |
| Stage III | – | – | – | 0.189 | 0.349 | 0.587 | |
| Age | – | – | – | 0.014 | 0.012 | 0.264 | |
| Lung Cancer | 1stCode | −0.232 | 0.178 | 0.192 | −0.311 | 0.189 | 0.101 |
| Pk | 0.191 | 0.133 | 0.150 | 0.232 | 0.144 | 0.108 | |
| ChP | 0.115 | 0.106 | 0.279 | 0.133 | 0.117 | 0.258 | |
| 1stScore | −0.098 | 0.266 | 0.712 | −0.075 | 0.332 | 0.821 | |
| logN | 0.078 | 0.163 | 0.633 | 0.074 | 0.203 | 0.715 | |
| Chemotherapy | 1stCode | −0.120 | 0.109 | 0.270 | −0.150 | 0.126 | 0.232 |
| Pk | −0.140 | 0.176 | 0.428 | −0.181 | 0.209 | 0.387 | |
| ChP | 0.001 | 0.096 | 0.991 | 0.004 | 0.122 | 0.975 | |
| 1stScore | 0.607 | 0.288 | 0.035 | 0.719 | 0.311 | 0.021 | |
| logN | 0.064 | 0.139 | 0.643 | 0.064 | 0.174 | 0.714 | |
| CT scan | 1stCode | 0.017 | 0.121 | 0.886 | 0.014 | 0.160 | 0.933 |
| Pk | 0.068 | 0.143 | 0.634 | 0.110 | 0.179 | 0.538 | |
| ChP | 0.038 | 0.071 | 0.589 | 0.050 | 0.089 | 0.571 | |
| 1stScore | −0.207 | 0.204 | 0.310 | −0.291 | 0.222 | 0.190 | |
| logN | −0.019 | 0.151 | 0.897 | −0.037 | 0.188 | 0.844 | |
| Radiotherapy | 1stCode | −0.229 | 0.221 | 0.301 | −0.345 | 0.248 | 0.165 |
| logN | −0.027 | 0.086 | 0.749 | −0.058 | 0.109 | 0.595 | |
| Secondary Malignant Neoplasm | 1stCode | −0.019 | 0.172 | 0.913 | −0.035 | 0.234 | 0.881 |
| Pk | −0.125 | 0.163 | 0.444 | −0.119 | 0.211 | 0.575 | |
| ChP | −0.065 | 0.072 | 0.366 | −0.063 | 0.092 | 0.490 | |
| 1stScore | −0.207 | 0.182 | 0.257 | −0.224 | 0.219 | 0.307 | |
| logN | 0.302 | 0.128 | 0.018 | 0.343 | 0.134 | 0.011 | |
| Palliative or Hospice Care | 1stCode | −0.076 | 0.137 | 0.580 | −0.091 | 0.160 | 0.567 |
| Pk | −0.845 | 0.248 | < 0.001 | −0.936 | 0.276 | < 0.001 | |
| ChP | −0.631 | 0.185 | < 0.001 | −0.699 | 0.206 | < 0.001 | |
| 1stScore | 0.054 | 0.126 | 0.670 | 0.067 | 0.143 | 0.641 | |
| logN | 0.040 | 0.092 | 0.663 | 0.015 | 0.105 | 0.889 | |
| Recurrence | 1stCode | 0.089 | 0.116 | 0.443 | 0.125 | 0.134 | 0.351 |
| Pk | −0.114 | 0.139 | 0.412 | −0.103 | 0.161 | 0.521 | |
| ChP | −0.055 | 0.085 | 0.519 | −0.046 | 0.099 | 0.642 | |
| 1stScore | −0.229 | 0.176 | 0.193 | −0.280 | 0.197 | 0.154 | |
| logN | 0.199 | 0.122 | 0.104 | 0.266 | 0.124 | 0.033 | |
| Medication | 1stCode | – | – | – | 0.201 | 0.188 | 0.284 |
| logN | – | – | – | 0.061 | 0.155 | 0.693 | |
| Biopsy | 1stCode | −0.814 | 0.689 | 0.238 | −1.127 | 0.734 | 0.125 |
| logN | −0.363 | 0.811 | 0.654 | −0.559 | 0.989 | 0.572 | |
Table A9.
Analysis with n = 200. Estimated coefficient (“est”), bootstrap standard error (“boot.se”), and p-value (“pval”) over 400 bootstraps for the extracted feature sets, including first code time (1stCode), peak time (Pk), change point time (ChP), first FPC score (1stScore), and log of total number of codes (logN), from the nine groups of medical encounter data in 5. For each group, its group p-value (“group pval”) is calculated via chi-square test. All features regarding time are transformed in log-scale. The result for the proposed MATA estimator is given in the left panel and that of NPMLE is shown in the right panel.
| MATA | NPMLE | ||||||
|---|---|---|---|---|---|---|---|
| Group | Feature | mean | boot.se | pval | mean | boot.se | pval |
| Stage II | – | – | – | 0.102 | 0.393 | 0.795 | |
| Stage III | – | – | – | 0.161 | 0.549 | 0.769 | |
| Age | – | – | – | 0.014 | 0.019 | 0.465 | |
| Lung Cancer | 1stCode | −0.223 | 0.266 | 0.401 | −0.369 | 0.318 | 0.246 |
| Pk | 0.188 | 0.190 | 0.323 | 0.270 | 0.220 | 0.220 | |
| ChP | 0.112 | 0.148 | 0.451 | 0.160 | 0.180 | 0.375 | |
| 1stScore | −0.102 | 0.390 | 0.793 | −0.080 | 0.553 | 0.885 | |
| logN | 0.072 | 0.244 | 0.767 | 0.065 | 0.325 | 0.843 | |
| Chemotherapy | 1stCode | −0.103 | 0.143 | 0.471 | −0.170 | 0.212 | 0.423 |
| Pk | −0.119 | 0.218 | 0.585 | −0.206 | 0.331 | 0.534 | |
| ChP | 0.006 | 0.160 | 0.972 | 0.027 | 0.243 | 0.913 | |
| 1stScore | 0.530 | 0.409 | 0.195 | 0.764 | 0.516 | 0.139 | |
| logN | 0.056 | 0.184 | 0.759 | 0.064 | 0.282 | 0.822 | |
| CT scan | 1stCode | 0.007 | 0.165 | 0.965 | 0.008 | 0.252 | 0.976 |
| Pk | 0.068 | 0.196 | 0.730 | 0.116 | 0.276 | 0.674 | |
| ChP | 0.037 | 0.109 | 0.730 | 0.055 | 0.143 | 0.700 | |
| 1stScore | −0.188 | 0.292 | 0.520 | −0.321 | 0.366 | 0.380 | |
| logN | −0.016 | 0.229 | 0.944 | −0.047 | 0.314 | 0.881 | |
| Radiotherapy | 1stCode | −0.207 | 0.314 | 0.509 | −0.359 | 0.405 | 0.376 |
| logN | −0.029 | 0.114 | 0.798 | −0.059 | 0.163 | 0.718 | |
| Secondary Malignant Neoplasm | 1stCode | −0.036 | 0.273 | 0.896 | −0.056 | 0.415 | 0.893 |
| Pk | −0.095 | 0.232 | 0.683 | −0.118 | 0.349 | 0.735 | |
| ChP | −0.051 | 0.101 | 0.609 | −0.065 | 0.148 | 0.660 | |
| 1stScore | −0.161 | 0.248 | 0.516 | −0.197 | 0.348 | 0.571 | |
| logN | 0.258 | 0.185 | 0.162 | 0.338 | 0.207 | 0.102 | |
| Palliative or Hospice Care | 1stCode | −0.090 | 0.197 | 0.647 | −0.102 | 0.267 | 0.703 |
| Pk | −0.726 | 0.384 | 0.059 | −0.928 | 0.446 | 0.037 | |
| ChP | −0.542 | 0.287 | 0.059 | −0.692 | 0.334 | 0.038 | |
| 1stScore | 0.020 | 0.179 | 0.912 | 0.034 | 0.268 | 0.899 | |
| logN | 0.041 | 0.124 | 0.740 | 0.024 | 0.173 | 0.890 | |
| Recurrence | 1stCode | 0.070 | 0.181 | 0.697 | 0.131 | 0.247 | 0.598 |
| Pk | −0.094 | 0.183 | 0.608 | −0.105 | 0.240 | 0.661 | |
| ChP | −0.042 | 0.112 | 0.705 | −0.044 | 0.148 | 0.767 | |
| 1stScore | −0.237 | 0.264 | 0.369 | −0.332 | 0.338 | 0.326 | |
| logN | 0.180 | 0.177 | 0.309 | 0.284 | 0.193 | 0.141 | |
| Medication | 1stCode | – | – | – | 0.174 | 0.321 | 0.589 |
| logN | – | – | – | 0.059 | 0.230 | 0.798 | |
| Biopsy | 1stCode | −0.741 | 0.890 | 0.405 | −1.223 | 1.155 | 0.289 |
| logN | −0.467 | 1.551 | 0.763 | −0.876 | 2.311 | 0.705 | |
Appendix C. Convergence Rate of Derived Features
Instead of deriving asymptotic properties for truncated density fCi, i.e., random density fi truncated on [0, Ci], we focus on the scaled densities , which is scaled to [0, 1]. As we assume censoring time Ci has finite support [0, ℰ] with and shared the same asymptotic properties.
Let and . The Karhunen-Loève theorem (Stark and Woods, 1986) states
where are the orthonormal eigenfunctions of are pair-wise uncorrelated random variables with mean 0 and variance , and , are eigenvalues of .
For the i-th patient, conditional on , and , the observed event times are assumed to be generated as an i.i.d. sample . Equivalently, the scaled observed event times . Following Wu et al. (2013), we estimate and , which are the mean and covariance functions of scaled density respectively, as
for t, s ∈ [0, 1], where
Here is the total number of encounters. is the total number of pairs. and are symmetric univariate and bivariate probability density functions, respectively, with and are bandwidth parameters.
The estimates of eigenfunctions and eigenvalues, denoted by and respectively, are solutions to
with constraints and . One can obtain estimated eigenfunctions and eigenvalues by numerical spectral decomposition on a properly discretized version of the smooth covariance function (Rice and Silverman, 1991; Capra and Müller, 1997). Subsequently, we estimate
by
Let be the population counterpart of constructed with true eigenfunctions. We show in Lemma A3 that goes to zero at any k as long as and .
We then estimate the scaled density
and the truncated density as
For the i-th patient and its j-th point process , we only observe one realization of its expected number of encounters on [0, Ci], i.e., . Following Wu et al. (2013), we approximate the expected numbers of encounters with observed encounters, and estimate as , for . We further estimate the derived feature as .
For notation simplicity in the proof, we drop the superscript [j], the index for the j-th counting process, for j = 1, …, q throughout the appendix.
Derivative of the Mean and Covariance Functions
Nonparametric estimation of the mean and covariance function on the scaled densities are
for , where
Here is the total number of encounters. is the total number of pairs. and are symmetric univariate and bivariate probability density functions, respectively, with and are bandwidth parameters.
Their derivatives are
with
for ν = 0, u = 1 and ν = 1, u = 0, where for an arbitrary bivariate function .
Assume the following regularity conditions holds.
-
(A1)
Scaled random densities , its mean density , covariance function gscaled and eigenfunctions are thrice continuously differentiable.
-
(A2)
and their first three derivatives are bounded, where the bounds hold uniformly across the set of random densities.
-
(A3)and are symmetric univariate and bivariate density function satisfying
-
(A4)
Denote the Fourier transformations . and and .
-
(A5)The numbers of observations Mi for the j-th trajectory of i-th object, are i.i.d. r.v.’s that are independent of the densities fi and satisfy
-
(A6)
as .
-
(A7)
are i.i.d positive r.v. generated from a truncated-Poisson distribution with rate τN, such that pr(Mi = 0) = 0, and for .
-
(A8)
and are independent. , where as for j = 1, …, q.
-
(A9)
The number of eigenfunctions and functional principal components Ki is a r.v. with , and for any , there exists such that for j = 1, …, q.
Lemma A1 Under the regularity conditions A1 - A6,
| (A.1) |
| (A.2) |
| (A.3) |
| (A.4) |
| (A.5) |
| (A.6) |
Proof The proof on the mean density and covariance function can be found in Wu et al. (2013). Here we only obtain the proof for the derivative of the mean density function. The proof for the derivative of the covariance function is similar.
Under conditions A1 and A2, we have
Hence, .
With inverse Fourier transformation , we have
We further insert this equation into ,
where
Therefore,
Note that the right-hand side of the above inequality is free of t. Thus,
As an intermediate result of the Proof of Theorem 1 in Wu et al. (2013), we have
This further lead to
Thus, . Furthermore,
Derivative of the Eigenfunctions
Lemma A2 Under the regularity conditions A1 - A6
| (A.7) |
| (A.8) |
| (A.9) |
Proof The first two equations are direct result of Theorem 2 in Yao et al. (2005). Note that
where . Thus,
Without loss of generality assuming , then
Derivative of the Estimated Density Functions
Lemma A3 Under regularity conditions A1 - A9, for any , there exists an event with such that on it holds that
| (A.10) |
| (A.11) |
| (A.12) |
Proof The existence of for (A.10) - (A.11) are guaranteed by the Theorem 3 in Wu et al. (2013). We followed their definition of , i.e., , and prove for (A.12).
Note that
We have
as . Hence, .
Furthermore, on
Therefore (A.12) holds.
Peaks and Change Points
Assume is locally unimodal, has a unique solution, denoted by xi0, in a neighbourhood of xi0, denoted by . Further assume is bounded away from 0 in and the bound holds uniformly across . Let , be the solution of which is closet to xi0. Then
where is an intermediate value between xi0 and .
Thus, . This further implies is the only solution of in . In other words, there is one-to-one correspondence between estimated peak and the true peak and the estimated peak converges to the true peak uniformly.
The derivation of the change point is similar, and here we only list the order of the absolute difference between estimated change point and the true change point yi0.
Remark A1 For peak and change point, the approximation error would decay faster than when the unlabeled data expand with in follow-up duration and in sample size. In that case, we may choose and so that Assumption (C5) is satisfied.
Appendix D. B-spline Approximation and Profile-likelihood Estimation
Some Definitions on Vector and Matrix Norms
For any vector , denote the norm . For positive numbers an and , let denote that , where c is some nonzero constant. Denote the space of the qth order smooth functions as . For any s × s symmetric matrix A, denote its Lq norm as . Let . For a vector a, let .
Some Definition on Scores and Hessian Matrices
Define
Further, define
Note that
For , define
| (A.13) |
where
Approximation Error from
We first assess the approximation error from using the estimated features in ln. Once we establish the identifiability of ln in the proof of Lemma 1, the approximation of losses would translate to the approximation of their minimums.
Lemma A4 Let
with be the loss with true features from the intensity functions. Let be a sufficiently large compact neighborhood of
We have
| (A.14) |
Proof By the mean value theorem, we may express the difference as
| (A.15) |
for between and Zi. Since is binary and is bounded in compact set Ω, we have
| (A.16) |
For T2, we apply the bounds for and along with the bound of the function ,
| (A.17) |
Thus, we obtain (A.14) by applying (A.16) and (A.17) to (A.15).
In the following theorems, we establish the consistency, asymptotic normality of our procedure.
Proof of Lemma 1
Proof By Lemma A4, the loss with estimated features deviates from the loss with true features by at most . Under Assumption (C5), the error decays faster than order. Thus, if either loss produces estimator identifying the true parameter at rate, both losses produce asymptotically equivalent consistent estimators. We focus the analysis of the loss with true features in the following.
For , there exists , such that
| (A.18) |
where (de Boor, 2001). In the following, we prove the results for the nonparametric estimator in Theorem 1 when . Then the results also hold when β is a -consistent estimator of , since the nonparametric convergence rate in Theorem 1 is slower than . Define the distance between neighboring knots as , and . Let . We will show that for any given , for n sufficiently large, there exists a large constant such that
| (A.19) |
This implies that for n sufficiently large, with probability at least 1 – 6ϵ, there exists a local maximum for (2) in the ball . Hence, there exists a local maximizer such that . Note that
and
The first term above is negative-definite, and last two terms are also negative-definite because of Cauch-Schwartz inequality, hence is negative-definite. Thus, is a concave function of , so the local maximizer is the global maximizer of (2), which will show the convergence of to .
By Taylor expansion, we have
| (A.20) |
where for some . Moreover,
where
Recall that SC(·) and fC(·) are the censoring process survival and density functions respectively, we have
In the following, all the integrals are calculated on [0, ℰ], unless otherwise specified.
Thus, . Further
for some constant by Condition (C4). Thus, . By Condition (C3), we have . Then for some constant 0 < C1 < ∞. Then for any ϵ > 0, by Chebyshev’s inequality, we have , or equivalently
| (A.21) |
Moreover, by (A.18), we have . Denote
then
for a constant under Condition (C4). Therefore, for a constant 0 < C2 < ∞, and . Again by Chebyshev’s inequality, for , we have
| (A.22) |
Combining (A.21) and (A.22), with probablity at least 1 − 5ϵ,
| (A.23) |
Moreover, Lemma A5 implies there exists a constant 0 < C3 < ∞ such that
for n sufficiently large, with probability approaching 1. Thus, for any ϵ > 0, there is probability at least 1 − ϵ,
| (A.24) |
Therefore, by (A.20), (A.23) and (A.24), for n sufficiently large, with probability at least 1 − 6ϵ,
when . This shows (A.19). Hence, we have under Condition (C3).
It is easily seen that for a constant and any , by Bernstein’s inequality, under condition (C3), we have
Also, it is easy to check that
Thus, combining with Lemma A7-A8, we have
| (A.25) |
where the inequality above uses the fact that for arbitrary u, only r elements in Br(u) are non-zero.
Let . Let . By Central Limit Theorem,
where and . With Lemma A7 and A9, we can get that , for some constants . So there exist constants such that with probability approaching 1 and for large enough n,
| (A.26) |
Thus uniformly in u ∈ [0, ε], and hence
uniformly in u ∈ [0, ℰ] as well.
By Taylor expansion,
| (A.27) |
Thus by (A.25), (A.26), (A.27) and Condition (C3),
Therefore by Slutsky’s theorem and uniformly in . By , we have uniformly in . By Slutsky’s theorem and Condition (C3), we have
Proof of Lemma 2
Proof Because is negative definite and , similar but simpler derivation as for Theorem 1 can be used to show the consistency of the maximizer .
Because at any , hence
so
| (A.28) |
where r1 is the residual term and is of smaller order of componentwise. Note that uniformly elementwise. Hence,
Subsequently, we have
and
Here we use the fact that and , where the former one is a direct corollary of Lemma A5 and the latter one is shown in Lemma A8. Therefore, and .
By Taylor expansion, for ,
| (A.29) |
where r2 is the residual term and is of smaller order of componentwise. We claim that the residual term r2 satisfies and . This is because
and
which leads to the claimed order of the residual r2 in (A.29).
We further use Taylor expansion to write
where , and the residual term r in the second last equality satisfies and .
Plugging this and (A.28) into (A.29), recall that
we get
It is straightforward to check that
and we already have . Thus,
By Central Limit Theorem, , where Σ is given in Theorem 2.
Proof of Theorem 1
Proof We prove the theorem in two steps. First we derive the asymptotic distribution of the solution by restricting θ to the oracle group selection set . Then we validate that the satisfies the optimality condition of the original problem (6). Without loss of generality, we rearrange the order of the covariates by moving the nonzero groups to the front. We would have simpler notation with . We denote the Hessian and its limit
and the sub-matrices notations . for selecting rows, for selecting columns, for selecting rows and columns in . We denote the variance of score as .
Define the oracle selection subspace and the estimator under oracle selection
| (A.30) |
Since 𝒮 contains only groups with nonzero coefficient of , and is consistent for by Lemma 2, the denominators in the penalty terms in (A.30) are bounded away from zero. Then choosing , we have the solution as
Using the identity
we may derive the estimation error of as
| (A.31) |
In the proof of the Lemma 2, we have establish for the asymptotic normality of and the consistency of Hessian
| (A.32) |
Applying the (A.32) to (A.31), we obtain
Profiling out γ components as in Lemma 2, we have
| (A.33) |
The optimality condition for original problem (6) is
| (A.34) |
The oracle selection estimator must satisfy the conditions in (A.34) for positions in R𝒮 by the same set of optimality conditions for (A.30). We only need to verify that also satisfy the conditions in (A.34) for . By Lemma 2 and the definition of 𝒮, we have
| (A.35) |
For , the penalty factor for zero group g is
| (A.36) |
By definition , the 𝒮c components of are all zero,
| (A.37) |
Combining (A.35)–(A.37), we establish that the optimality conditions in (A.34) hold asymptotically
for g : β0,[g] = 0, i.e. all elements in 𝒮c. Therefore, we conclude that with large probability. The asymptotic distribution of (A.33) is thus the asymptotic distribution of .
Proof of Corollary 1
Proof Using delta method, it is seen that
Applying the asymptotical normality of and of Lemma 1 and 1 along with Assumption (C5), we conclude that
and asymptotically normal with .
Appendix E. Matrix Norms
Lemma A5 There exists constants 0 < c < C < ∞ such that, for n sufficiently large, with probability approach 1,
where is an arbitrary vector in with . Furthermore, for arbitrary a ∈ RPn,
Proof We only prove the result for . The proof for can be obtained similarly. We have
| (A.38) |
for positive constants by conditions (C1) and (C4).
Following a similar proof, we can further obtain
| (A.39) |
for some constant , because is an r-banded matrix with diagonal and jth off-diagonal elements of order O(h) uniformly elementwise, for j = 1, …, r − 1, and 0 elsewhere.
Combining (A.38) and (A.39), we have
Next, we investigate the order of . We have
with probability 1 as n → ∞, where are constants. Here, for an arbitrary matrix A we use Ajk to denote its element in the jth row and the kth column. In the above inequalities, we use the fact that B-spline basis are all non-negative and are non-zero on no more than r consecutive intervals formed by its knots.
On the other hand,
Hence, .
Therefore, Lemma A5 holds for c = min(c1, c2) and C = max(C1, C2).
Lemma A6
Proof Similarly as the previous derivations,
| (A.40) |
where the second term in the last equality is obtained using both the Central Limit Theorem and the matrices above are banded to the first order. Specifically, has diagonal and jth off-diagonal element with order for j = 1, …, r − 1 and all the other elements of order . Further,
again because the matrices are banded to the first order. In fact for arbitrary vector a ∈ RPn,
where are constants.
Lemma A7 There exists constant 0 < c, C < ∞, such that for n sufficiently large, with probability approach 1,
Proof We have , where
is a banded matrix with each nonzero element of order O(1) uniformly and
is a matrix with all elements of order O(1) uniformly. It is easily seen that is positive definite, and is semi-positive definite.
According to Demko (1977) and Theorem 4.3 in Chapter 13 of DeVore and Lorentz (1993), we have
for some constant . Furthermore, there exists constants and 0 < λ < 1 such that for j, k = 1, …, Pn.
We want to show that
is bounded. As a result,
Denote . There exists constant 0 < κ < ∞ such that for j, k = 1, …, Pn. Hence,
where .
Let , where is a constant. Similar derivation as before shows there exists some constant , such that for arbitrary and . Hence,
where .
In the following, we will use induction to show that
where with Jij = 1 if j − i = 1 and Jij = 0 otherwise. Here .
When Pn = 2,
Similary, we have
Assume the result holds for 2, …, Pn−1, then for Pn, denote and , we have
and
where and .
Therefore,
where the numerator converges to 2κ4exp(κ2κ4) as h → 0, or equivalently, Pn → ∞. Here in the first equation above we use the fact that the (j, k)th element of the matrix is the determinant of the matrix without its jth column and kth row, divided by the determinant of itself. Specifically, when j = k, the absolute value of that (j, k)th element is ; when , with certain column operations, we obtain .
Now it remains to show that there exists , such that
for Pn sufficiently large. This can be seen from
Thus the result holds for .
Therefore, we have for some constant .
On the other hand, we have
for some constant 0 < c < ∞ (Horn and Johnson, 1990; Golub and Van Loan, 1996).
The proof for is similar, and hence is omitted.
Lemma A8
Proof According to Lemma A6 - A7, we have
Lemma A9 There exists constants 0 < c < C < ∞ such that for n sufficiently large, with probability approach 1, for arbitrary ,
Proof We have
for some constants 0 < C′ < ∞. Similar derivation leads to
for constant 0 < c < C < ∞.
Appendix F. Algorithm for optimization
Here we provide the detailed algorithm for optimization of ln.
- Obtain the initial estimator of from the U-statistic equation (Cheng et al., 1995),
where is the Kaplan-Meier or empirical distribution estimator for censoring time distribution, and . We solve the equation by classical Newton’s method. Then, we calculate the initial estimator for baseline function by solving -
Update and from and with the alternative B-spline approximationSetting the initial value and we perform the iterative algorithm:
- Calculate
and update α by the Breslow type of estimator -
Update β by the pseudo logistic regression
- If Δi = 0, observation-i contributes one entry in the pseudo data, with offset .
- If Δi = 1, observation-i contributes two entries in the pseudo data, and both with offset .
The solution is .
During this step, we compute the integrals once at the initiation step and use the computation repeatedly. The parameters at the convergence are and .
-
Obtain the final MLE estimators and . We use as initial value for β and calculate the initial value for γ from the linear regressionWe perform the iterative algorithm:
- Update γ by the one-step Newton’s method
- Update β by the pseudo logistic regression as in Step 2.
Contributor Information
Liang Liang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Jue Hou, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Hajime Uno, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
Kelly Cho, Massachusetts Veterans Epidemiology Research and Information Center, US Department of Veteran Affairs, Boston, MA, USA; Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA.
Yanyuan Ma, Department of Statistics, Penn State University, University Park, PA, USA.
Tianxi Cai, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
References
- Ahuja Y, Hong C, Xia Z, Cai T (2021) Samgep: A novel method for prediction of phenotype event times using the electronic health record. medRxiv DOI 10.1101/2021.03.07.21253096, URL https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096, https://www.medrxiv.org/content/early/2021/03/09/2021.03.07.21253096.full.pdf [DOI] [PMC free article] [PubMed]
- de Boor C (2001) A Practical Guide to Splines. Springer, New York [Google Scholar]
- Capra WB, Müller HG (1997) An accelerated-time model for response curves. Journal of the American Statistical Association 92:72–83 [Google Scholar]
- Cheng S, Wei L, Ying Z (1995) Analysis of transformation models with censored data. Biometrika 82:835–845 [Google Scholar]
- Cheng S, Wei L, Ying Z (1997) Predicting survival probabilities with semi-parametric transformation models. Journal of the American Statistical Association 92:227–235 [Google Scholar]
- Chubak J, Onega T, Zhu W, Buist DS, Hubbard RA (2015) An electronic health record-based algorithm to ascertain the date of second breast cancer events. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dean C, Balshaw R (1997) Efficiency lost by analyzing counts rather than event times in poisson and overdispersed poisson regression models. Journal of the American Statistical Association 92:1387–1398 [Google Scholar]
- Demko S (1977) Inverses of band matrices and local convergence of spline projections. SIAM Journal on Numerical Analysis 14:616–619 [Google Scholar]
- DeVore RA, Lorentz GG (1993) Constructive approximation, vol 303. Springer Science & Business Media [Google Scholar]
- Efron B (1979) Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7(1):1–26, DOI 10.1214/aos/1176344552, URL 10.1214/aos/1176344552 [DOI] [Google Scholar]
- Golub GH, Van Loan CF (1996) Matrix computations, 3rd. Johns Hopkins University, Press, Baltimore, MD, USA [Google Scholar]
- Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D (2015) Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Medical care [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horn RA, Johnson CR (1990) Matrix analysis. Cambridge university press [Google Scholar]
- Jin Z, Ying Z, Wei LJ (2001) A simple resampling method by perturbing the minimand. Biometrika 88(2):381–390, URL http://www.jstor.org/stable/2673486 [Google Scholar]
- Klein JP, Moeschberger ML (2006) Survival analysis: techniques for censored and truncated data. Springer Science & Business Media [Google Scholar]
- Lawless JF (1987) Regression methods for poisson process data. Journal of the American Statistical Association 82:808–815 [Google Scholar]
- Nielsen J, Dean C (2005) Regression splines in the quasi-likelihood analysis of recurrent event data. Journal of statistical planning and inference 134:521–535 [Google Scholar]
- Rice JA, Silverman BW (1991) Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological) 53:233–243 [Google Scholar]
- Royston P, Parmar MK (2002) Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in medicine 21:2175–2197 [DOI] [PubMed] [Google Scholar]
- Shen X (1998) Propotional odds regression and sieve maximum likelihood estimation. Biometrika 85:165–177 [Google Scholar]
- Stark H, Woods JW (1986) Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc. [Google Scholar]
- Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei L (2011) On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine 30:1105–1117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uno H, Ritzwoller DP, Cronin AM, Carroll NM, Hornbrook MC, Hassett MJ (2018) Determining the time of cancer recurrence using claims or electronic medical record data. JCO Clinical Cancer Informatics 2:1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang H, Leng C (2007) Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102(479):1039–1048 [Google Scholar]
- Wang H, Leng C (2008) A note on adaptive group lasso. Computational statistics & data analysis 52(12):5277–5286 [Google Scholar]
- Wu S, Müller HG, Zhang Z (2013) Functional data analysis for point processes with rare events. Statistica Sinica pp 1–23 [Google Scholar]
- Yao F, Müller HG, Wang JL (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100:577–590 [Google Scholar]
- Younes N, Lachin J (1997) Link-based models for survival data with interval and continuous time censoring. Biometrics pp 1199–1211 [Google Scholar]
- Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. (2015) Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association 22(5):993–1000, DOI 10.1093/jamia/ocv034, URL 10.1093/jamia/ocv034, https://academic.oup.com/jamia/article-pdf/22/5/993/34146486/ocv034.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN, Gainer VS, et al. (2016) Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association 24(e1):e143–e149, DOI 10.1093/jamia/ocw135, URL 10.1093/jamia/ocw135, https://academic.oup.com/jamia/article-pdf/24/e1/e143/34149618/ocw135.pdf [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng D, Lin D, Yin G (2005) Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100:470–483 [Google Scholar]
- Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37:338–354 [Google Scholar]
- Zhang Y, Cai T, Yu S, Cho K, Hong C, Sun J, et al. (2019) High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (phecap). Nature Protocols 14(12):3426–3444, DOI 10.1038/s41596-019-0227-6, URL 10.1038/s41596-019-0227-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
