Functional principal component based landmark analysis for the effects of longitudinal cholesterol profiles on the risk of coronary heart disease

Bin Shi; Peng Wei; Xuelin Huang

doi:10.1002/sim.8794

. Author manuscript; available in PMC: 2022 Oct 19.

Published in final edited form as: Stat Med. 2020 Nov 5;40(3):650–667. doi: 10.1002/sim.8794

Functional principal component based landmark analysis for the effects of longitudinal cholesterol profiles on the risk of coronary heart disease

Bin Shi ^1,², Peng Wei ^2,^*, Xuelin Huang ^2,^*

PMCID: PMC9580044 NIHMSID: NIHMS1843034 PMID: 33155338

Abstract

Patients’ longitudinal biomarker changing patterns are crucial factors for their disease progression. In this research, we apply functional principal component analysis (FPCA) techniques to extract these changing patterns and use them as predictors in landmark analyses models for dynamic prediction. The time-varying effects of risk factors along a sequence of landmark times are smoothed by a supermodel to borrow information from neighbor time intervals. This results in more stable estimation and more clear demonstration of the time-varying effects. Compared with the traditional landmark analysis, simulation studies show our proposed approach result in lower prediction error and higher AUC values, which indicate better ability to discriminate between subjects with different risk levels. We apply our method to data from the Framingham Heart Study, using longitudinal total cholesterol levels to predict future coronary heart disease (CHD) risk profiles. Our approach not only obtains the overall trend of biomarker-related risk profiles, but also reveals different risk patterns that are not available from the traditional landmark analyses. Our results show that high cholesterol levels during young ages are more harmful than those in old ages. This demonstrates the importance of analyzing the age-dependent effects of total cholesterol on CHD risk.

Keywords: Landmark analysis, Functional principal components analysis, Longitudinal study, Survival analysis

1. Introduction

In many clinical settings, changing disease severity and rate of disease progression are accompanied by dynamic changes in longitudinal biomarker measurements. Therefore, it is important to use the longitudinal biomarker data collected during each patient’s follow-up visit to make real-time predictions of future disease events. This is dynamic prediction. Such real-time predictive information is critical for physicians and patients, so that they can actively monitor the disease and possibly plan for early treatment and prevention regimes. As a result, using longitudinal data to construct real-time prediction models has prevailed within the medical literature. The traditional statistical approach for dynamic prediction is to develop a joint model that uses some parametric models for longitudinal data. The problem is that the model may not fit well for some real data. FPCA is a more flexible and robust alternative that is without distribution assumptions. In this research, we evaluate simple and easy-to-interpret-models–landmark analysis coupled with FPCA– and apply them to the dynamic prediction of cardiovascular disease (CVD).

CVD is the leading cause of morbidity and mortality among men and women in the United States [31], and it is a complex condition that consists of a number of diseases related to atherosclerosis, which includes CHD. Some factors used to assess the risk of CHD include lipid profile, blood pressure, obesity, diabetes, and smoking. The incidence and prevalence of CHD increase exponentially with age. Therefore, selecting older people as the target population for prevention and therapy is unsurprisingly straightforward, even though the resulting reductions in CHD risk might be small. In fact, there is an ongoing debate regarding efficacy and whether targeting classical risk factors in adults older than 70 is beneficial [26, 28]. Considerable evidence has shown that higher levels of total cholesterol (TC) are associated with increased risk of CHD in adults in a younger age range (40 to 69 years). However, the importance of hypercholesterolemia as a risk factor for CVD in the older age group remains controversial. Thus it is unknown whether TC or other risk factors are important targets for effective CHD prevention in elderly individuals.

Longitudinal biomarkers are often used to monitor patient health status in clinical practice. For example, the lipid profile is a good indicator of CVD risk. It is recommended that adults aged 20 and older have their serum cholesterol measured at least every 5 years. This entails a blood test called lipoprotein profiling. A TC value of less than 200 mg/dL is desirable, while a measure of 200–239 mg/dL is borderline high and 240 mg/dL or above is considered high. In order to use such information longitudinally in clinical practice, both landmark analysis and FPCA modeling are important techniques to handle the longitudinal effect of time-varying variables, such as TC values.

Survival models with time-dependent covariates describe the joint relationship between follow-up time T of the occurrence of an event and covariates Z(t) by modeling the hazard h(t|Z(t)). The use of time-varying covariates typically assumes that Z(t) are available for all possible timepoints. However, in practice, we can only collect a measured covariate history at discrete times and obtain a patient’s vital status discretely. Estimating a patient’s prognosis through these observed data would require knowledge of the future values of Z(t), which involves integrating over the conditional distribution of the future covariate process given the history of Z(t). Therefore, analysis can traditionally be done by making a joint model for the failure event and the history of the time-dependent covariates. We can use such an approach to derive a predictive model by conditioning on the history of time-dependent covariates and other fixed covariates to perform dynamic prediction [27, 39]. However, this approach usually requires parametric assumptions for the distribution and trajectory of longitudinal biomarkers, which may not hold in the real data analysis. Moreover, the approach of joint models involves intensive computation and complicated modeling, therefore not convenient for use in practice.

In this article, we use landmark analysis models for dynamic prediction. Compared with the joint modeling approach, the landmark model is simpler to implement and provides easier clinical interpretation. As a result, the landmark model has been used extensively in medical research. First introduced by Anderson [1], landmark analysis can dynamically adjust predictive models for survival data with follow-up data. With this approach in hand, simple proportional hazards models are able to capture the development trend of the risk over time for modeling data with time-dependent covariates.

A traditional multivariate framework is modeled with data in the form of random vectors, but under a functional data framework, it focuses on data in the form of curves, shapes and images as realizations of a stochastic process. They are intrinsically of infinite dimensions, but measured discretely. Multivariate principal component analysis (PCA) is a useful tool for data dimension reduction. FPCA extends this concept by reducing random trajectories to a set of functional principal component (FPC) scores for functional data analysis. Besides dimension reduction, FPCA attempts to extract the dominant features from varied random curves around their mean trends. The principal components of FPCA, which exhibit the variability of the data through the continuous index time, are curves and can be seen naturally as the major modes of variation. Early work on FPCA was done for asymptotic properties [3, 9]. Further development of the theory and applications was achieved by Boente [5]. Yao et al. proposed to estimate the principal components via conditional expectation (PACE) for the analysis of not only regular grid data, but also highly irregular or sparse data [41, 42]. More recent developments involve multivariate FPCA, which differs from one-dimensional FPCA in that it aims to use a multivariate functional Karhunen-Loeve representation of the longitudinal data [8, 16, 19].

In the current study, we use landmark only or landmark combined with FPCA to perform the survival model estimation. We find that our two approaches yield similar conclusions for estimating the risk of CHD associated with TC levels. However, the landmark-only model provides only the overall trend, which is a marginal TC effect. In contrast, the landmark model with FPCA can decompose this overall trend into different orthogonal patterns, from which we can interpret the overall effect in detail or perform further prediction analyses in the future. We identify three different risk profiles corresponding to three distinct CHD risk levels, which may not only explain the overall trend effect, but also give the three scenarios we might encounter in clinical practice. An important finding is that a higher TC was associated with a greater risk of CHD in the younger adult age group (< 60 – 65 years), but in the older age (> 65 years), the TC levels were no longer associated with CHD risk. This provides additional evidence for the debate as to whether it is beneficial to treat older patients in order to lower TC levels. Our results are consistent with the recent work done by two groups [15,28]. In addition, we use FPCA-based simulation data to evaluate the two models’ prediction performance. We show that the FPCA-based landmark analysis consistently has lower Brier scores (smaller prediction error) and higher values for the area under receiver operating characteristic curve (AUC), which yields better discrimination ability.

The rest of this paper is organized as follows. Section 2 presents details of the statistical models that we addressed, landmark analysis and FPCA models. Section 3 presents simulation studies. Section 4 provides the results of the real data analysis. Section 5 concludes with a discussion.

2. Methods

2.1. Dataset Description

The Framingham Heart Study (FHS) is a family-based, ongoing prospective cohort study, which investigates CVD-related risk factors for three generations of residents from the town of Framingham, Massachusetts. For our present study, we use the offspring cohort dataset that contains longitudinal phenotype measurements, including TC, low density lipoprotein cholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C), total triglycerides (TG), glucose, blood pressure and body mass index, collected every five years for seven consecutive visits per participant. The age range represented in the data was from 17 to 85 years, which almost covers the entire life of most subjects. We downloaded the datasets from the National Center for Biotechnology Information dbGaP, study accession number phs000007.v30.p11 and phs000342.v18.p11 [7].

To build the survival models, we used unrelated individuals and selected the age interval of 25 to 75 years from the offspring cohort. The total sample size is 1669. The number of CHD events is 110 (6.59%), and the number of censoring is 1559 (93.41%).

2.2. Review of Landmark Analysis

In medical research, we often encounter covariates that change over time. The FHS dataset includes several time-varying covariates, which are also called longitudinal biomarkers, including TC. The use of time-varying covariates allows researchers to not only explore associations, but also provides potential opportunities for making causal inference. Unfortunately, we also encounter a lot of technical difficulty, for example, choosing the covariate forms has great potential for the introduction of bias into the analysis. To this end, without making comprehensive models, we build landmark models for the time-varying covariates by constructing stacked datasets together with all the relevant information needed. Then, we can use the constructed datasets to compute the predictive probability by selecting individuals who are at risk at that moment, and then making predictions for their future risk using only their currently available information at that moment. The process of making a data selection at a moment in time is called landmarking.

More specifically, we construct a series of data sets for five landmark time points of ages: L = 30, 40, 50, 60, and 70 years as an example, which are more often denoted as time points L in the FHS datasets. At each time point of landmarking, we can fit a simple Cox model and use it to obtain a prediction of the participant’s overall survival probability. For each participant who was enrolled and followed up in the study before time point L, we use the last-value-carry-forward (LVCF) method to define the biomarker value at the landmark time, denoted by covariate values $W_{i}^{(L)} = Z_{i} (\underset{t \leq L}{m a x} (t) : Z_{i} (t) is o b s e r v e d)$ , for i = 1, 2, · · · , n.

Letting T_i and C_i denote the event time and censoring time, respectively, both are at the scale of each individual’s age. We assume C_i is independent of the biomarker measurement time and event time T_i. Rather than observing T_i for all the participants, we observe only X_i = min(T_i, C_i) and δ_i = I(T_i ≤ C_i). For participant i, at each landmark time L, we assume a Cox proportional hazards model that specifies λ_Li(s), the hazard function for T_i − L at time T_i − L = s, as follows,

λ_{L i} (s ∣ T_{i} \geq L, {Z_{i} (t), 0 \leq t \leq L}) = λ_{L 0} (s) \exp ({\vec{α}}^{(L)} {\vec{V}}_{i} + β^{(L)} W_{i}^{(L)}), L \leq s \leq L + t_{h o r},

(1)

where λ_L0(t) is a baseline hazard function, and ${\vec{V}}_{i}$ denotes for baseline covariates, such as gender and smoking status, and ${\vec{α}}^{(L)}$ is their coefficients. $W_{i}^{(L)}$ is a summary statistic of {Z_i(t), 0 ≤ t ≤ L} and t_hor is a horizon time. It is to be specified by a data analyst. For example, it could be the current biomarker value, i.e., $W_{i}^{(L)} = Z_{i} (L)$ . It could also be a changing slope of the biomarker such as $W_{i}^{(L)} = {Z_{i} (L) - Z_{i} (L - Δ)} / Δ$ . After separately fitting a regular Cox model for each landmark time point L, we stack these five datasets into one big one, and fit this unified dataset by a stratified-Cox model, with each dataset as a stratum, which we term a super model for convenience [36]. We expect the coefficients β^(L) of the super model to depend on L in a smooth function. For example, we can impose smoothness in a simple way, such as piece-wise constant, or we can bring more structure into the analysis by making the regression parameters β^(L) as a complex function of L in a parametric way [36, 43]. In doing so, we should see several advantages to this approach. First, the super model can be fitted by using standard software for the large, stacked dataset. Also, we can use a simple Cox model to approximate the complex survival model with several time-varying effects together. Such approximation usually works well if the time-dependent effect does not vary too much over time [43]. However, fitting the Cox models for each landmark time interval separately would “ignore” the overlaps between the landmark data sets. Without loss of generality, one of the polynomial forms is adopted for the β(L) function [11, 32], and we specify a super model as follows:

λ_{L i} (s ∣ T_{i} \geq L, {Z_{i} (t), 0 \leq t \leq L}) = λ_{L 0} (s) \exp {\vec{α} {\vec{V}}_{i} + W_{i}^{(L)} β (L)}, L \leq s \leq L + t_{h o r},

(2)

where β(L) is a fractional polynomial with a form such as $β (L) = β_{0} + β_{1} (1 / L) + β_{2} (1 / \sqrt{L}) + β_{3} \log (L) + β_{4} \sqrt{L} + β_{5} L + β_{6} L^{2}$ . It would be inappropriate to use such a super model fit to test the statistical significance of the biomarkers’ coefficients. The correct standard errors can be obtained by using the sandwich estimators to account for the “clustering” effect of the repeated data [23].

2.3. Proposed Landmark Analysis with FPCs as Predictors

There are various ad hoc ways to use historical longitudinal biomarker values, such as using the current biomarker value. When this is not available, the LVCF value is used, but this is hardly a systematic approach. In many situations, the historical biomarker trajectory features (changing patterns) are more important than the current value or recent change in magnitude, in terms of predicting a future event. Since we expect the changing patterns of TC are important predictors for predicting the risk of CHD, we used FPCA methods to capture these changing patterns. First, FPCA were performed on the whole available dataset, then we integrated the FPC scores on each separate landmark time interval. Lastly we performed a Cox regression analysis in term of landmark principles by using the integrally obtained scores as our predictors in survival models.

To apply the FPCA to our longitudinal data (please see the review of FPCA method in the Appendix), in brief, we can use M eigen functions to project onto the infinite-dimensional function space of covariate random trajectories: a very good approximation can be obtained without losing too much variation. Due to the function decomposition, each ${\hat{ρ}}_{k} (t)$ , where k = 1, · · · , M and t ∈ [0, τ], may be viewed as the patterns of deviations from the population mean over time, and ${\hat{γ}}_{i k}$ describes how strongly the data from the ith subject follow the changing pattern ρ_k(t) as a weight function. Therefore, we can naturally use these low-dimensional FPC scores γ_ik to substitute the original data as the new predictor variables in the survival analysis to model the relationship between the survival time and the patterns of the trajectories. This will improve our estimation, prediction, and interpretation in many ways.

Given the mean estimates $\hat{μ} (t)$ and choosing certain basis function estimates ${\hat{ρ}}_{k} (t)$ , the various FPC scores estimates ${\hat{γ}}_{i k}$ are obtained by integrating all available information on each landmark time interval [0, L] and result in different trajectory patterns for different subjects, which approximate to the original data structure as closely as possible. So these estimates are still functions of the landmark time L. The formula is

{\hat{γ}}_{i k}^{(L)} = \int_{0}^{L} {Z_{i} (t) - \hat{μ} (t)} {\hat{ρ}}_{k} (t) d t,

(3)

where the estimated $\hat{μ} (t)$ and ${\hat{ρ}}_{k} (t)$ are obtained from the whole dataset. Since Z_i(t) denotes the longitudinal biomarker trajectory for subject i and t ≥ 0, the substituted data can be a multivariate p-dimensional vector. Here, we use one longitudinal marker as an example. Then we use ${\hat{γ}}_{i k}^{(L)}, k = 1, \dots, M$ as predictors, and the fitting survival model is as follows:

λ_{L i} (s ∣ T_{i} \geq L) = λ_{L 0} (s) \exp {{\vec{α}}^{(L)} {\vec{V}}_{i} + \sum_{k = 1}^{M} β_{k}^{(L)} {\hat{γ}}_{i k}^{(L)}}, L \leq s \leq L + t_{h o r},

(4)

where $β^{(L)} = {(β_{1}^{(L)}, \dots, β_{M}^{(L)})}^{'}$ are the regression coefficients for the M estimated FPC score vector ${\hat{γ}}_{i}^{(L)} = {({\hat{γ}}_{i 1}^{(L)}, \dots, {\hat{γ}}_{i M}^{(L)})}^{'}$ . We choose an M value that makes the overall variation explained be larger than or equal to 95%. Please see the details in the Appendix about how to choose M values. After fitting the above model (4) for each L separately, we fit a super model, as below, to get smooth estimates for the regression coefficient by a fractional polynomial function,

λ_{L i} (s ∣ T_{i} \geq L) = λ_{L 0} (s) \exp {\vec{α} {\vec{V}}_{i} + \sum_{k = 1}^{M} β_{k} (L) {\hat{γ}}_{i k}^{(L)}}, L \leq s \leq L + t_{h o r},

(5)

As before, β_k(L) is a fractional polynomial with a form like $β_{k} (L) = β_{k 0} + β_{k 1} (1 / L) + β_{k 2} (1 / \sqrt{L}) + β_{k 3} \log (L) + β_{k 4} \sqrt{L} + β_{k 5} L + β_{k 6} L^{2})$ . The organization of the data sets is similar, as previously described in Section 2 for model (2). Various smoothness constraints, such as splines and fractional polynomials, can be placed on β_k(L). Binder et al. used simulations to compare these two approaches, and concluded that fractional polynomials better recover simpler functions, whereas splines better recover more complex functions [4]. Generally, β_k(L) is believed to be a simple smooth function so that fractional polynomial is better than splines in our application.

3. Simulation

In order to evaluate which provides the better performance, the proposed approach of landmark (LM) coupled with FPCA or the landmark-only approach, we performed simulations to study the statistical prediction properties.

3.1. Simulation Setting

We performed a simulation study by mimicking the FHS as a real data example. (Please see the detailed analysis of FHS data in the next section of this article.) Specifically, we conducted strategies similar to those of previous work by Wei’s group [7, 37]. First, we simulated a time-varying biomarker of subject i by using the first three TC FPCA components (γ_i1, γ_i2, γ_i3) in the following model: Y_i(s) = μ(s) + γ_i1ρ₁(s) + γ_i2ρ₂(s) + γ_i3ρ₃(s) + ε, where ρ₁(s), ρ₂(s), and ρ₃(s) are obtained from three eigen functions, i = 1, 2, · · · , n, and s ∈ [25, τ_i], τ_i ∈ [35, 75], they are recorded exam ages, which follow independent normal distributions with means of seven visit ages [34.5, 42.7, 47.1, 50.3, 54.0, 58.1, 60.8] and variances of those visit ages [9.0², 9.0², 9.1², 9.0², 9.0², 8.9², 8.9²]. ε ~ N(0, 0.1). Three FPC scores are independently simulated from normal distributions: γ_i1 ~ N(0, 32.3²), γ_i2 ~ N(0, 26.5²), and γ_i3 ~ N(0, 15.3²). Then the time to CHD disease is generated according to the formula: λ_i(s) = λ₀(s) exp(β₁(s)γ_i1 + β₂(s)γ_i2 + β₃(s)γ_i3). We adopted the coefficient values based on estimating the effects of TC on the CHD risk by using the FPCs as covariates in the Cox analysis for the FHS data.

We consider a common baseline hazard functional form: λ₀(s) = λs^λ−1 exp(η), which follows a Weibull distribution with λ = −1.5, η = −7. We adjusted such parameter values to make its distribution of time to CHD increasingly similar to that of the real data in FHS. We also modified three estimated effects of TC FPC scores based on the original estimates, and let (β₁, β₂, β₃) = (−0.27, 0.30, −0.48), sample size n = 1669, replication times Q = 500. Without loss of generality, random censoring times are generated independently from a uniform distribution, and we set approximately 25% of the event times to be right-censored.

LM coupled with FPCA compared with landmark-only models may have good estimation, or better prediction performance on functional data. In order to assess them, we conduct the above simulation and focus on the predictive performance [17]. The Brier score is used to assess the calibration capacity by calculating the mean square difference between the true vs. predicted survival probabilities of the two models. To incorporate censoring in survival data, the weighted version of the Brier scores is calculated by using inverse probability as censoring weights according to the following formula [25]:

\hat{B S} (t, L) = \frac{1}{n_{L}} \sum_{i \in R (L)} {\frac{(1 - {\hat{S}}_{L_{i}} {(t ∣ Z_{i} (L))}^{2} I (T_{i L} > t)}{{\hat{S}}_{c} (t)} + \frac{(0 - {\hat{S}}_{L_{i}} {(t ∣ Z_{i} (L))}^{2} I (T_{i L} \leq t, δ_{i} = 1)}{{\hat{S}}_{c} (T_{i L})}},

where n_L denotes the size of risk set R(L) at the landmark time L; T_iL = T_i − L; ${\hat{S}}_{c} (t) = P (C_{i} > t)$ is the survival function of censoring time as the inverse weights, which can be estimated via the Kaplan-Meier method by letting C_iL = C_i − L; and ${\hat{S}}_{L_{i}} (t ∣ Z_{i} (L))$ is the predicted survival probability for individual i with biomarker variable Z_i(L) at the prediction time t, which can be estimated by equation (1) in the landmark-only model or equation (4) in the LM coupled with FPCA model. As an example of LM coupled with FPCA, ${\hat{S}}_{L_{i}} (t) = {{\hat{S}}_{L_{0}} (t)}^{\exp {\sum_{k = 1}^{M} ({\hat{γ}}_{i k}^{(L)} β_{k}^{(L)})}}$ , where ${\hat{S}}_{L_{0}} (t) = \exp {- \sum_{s \leq t} {\hat{λ}}_{L_{0}} (s)}$ , and ${\hat{λ}}_{L_{0}} (s)$ is a baseline hazard function. We used the R package “pec” to compute it [25].

To assess the discrimination ability of the two models, we compute the AUC values according to the following formula [39, 40]:

\hat{A U C} (t, L) = P r [{\hat{S}}_{i} (t ∣ Z_{i} (L)) < {\hat{S}}_{j} (t ∣ Z_{j} (L)) ∣ {\tilde{T}}_{i} \in (L, t] \cap {\tilde{T}}_{j} > t],

where ${\hat{S}}_{i} (t ∣ Z_{i} (L))$ is the predicted survival probability for individual i with biomarker variable Z_i(L) at the prediction time t. At each evaluated time point, AUC summarizes all the sensitivity and specificity along the curve and gives the discriminating power to distinguish between subjects with different risk levels, as defined by the above formula. In a similar way, the weighted method is applied to incorporate censoring information to calculate the concordance index for survival models [33]. For simplicity, a nonparametric approach is used to compute the AUC values, and it is similar to compute the Wilcoxon statistics for survival data [34]. Here we used R package “tdROC” to perform the calculations [22].

3.2. Simulation Results

Compared with the landmark-only approach, the major difference of our proposed LM coupled with FPCA is that we apply a more comprehensive and statistically principled method to extract predictive features from the longitudinal biomarker data. The landmark-only option implements the method of LVCF. However, LM coupled with FPCA first uses the whole dataset to compute the FPC scores, then uses all available accumulative information up to the landmark time point for prediction. In order to assess such differences, we implemented the FPCA-based method to generate functional data, and then studied the empirical performance of the two approaches. Table 1 summarizes the predictive performance results.

Table 1.

Evaluation of predictive performance : Brier Scores and AUC Values

Time		Brier Scores		AUC Values

L (age: years)	t_hor	LM-only (1e-2)	LM-FPCA (1e-2)	LM-only (1e-1)	LM-FPCA (1e-1)

35	1	5.565	0.353	5.028	6.718
35	3	5.846	0.424	5.033	6.778
35	5	6.177	0.491	5.037	6.865
50	1	5.565	0.353	5.030	6.725
50	3	5.846	0.424	5.035	6.782
50	5	6.177	0.490	5.039	6.868
65	1	5.565	0.353	5.024	6.720
65	3	5.846	0.424	5.029	6.778
65	5	6.177	0.491	5.033	6.865
75	1	5.565	0.353	5.034	6.715
75	3	5.846	0.424	5.039	6.777
75	5	6.177	0.490	5.043	6.863

Open in a new tab

Prediction at landmark time was set at L = 35, 50, 65, 75 years in exam ages, and three prediction horizon times were investigated at t_hor = t − L = 1, 3, 5 years. The table 1 shows that, compared with the landmark-only model, the LM coupled with FPCA model produces a much smaller prediction error, in average tenfold lower in Brier scores, for all settings. At the same time, the AUC values of the proposed model are about 30% higher than landmark-only models in all the settings. This indicates our proposed functional approach performs much better than the LVCF-based method in terms of calibration and discrimination, respectively. One can infer from our simulation studies that, when we deal with functional data, especially sparse and irregular grids, or frequently missing-caused irregularity, LVCF-based methods misspecify the models and could cause biased coefficient estimates and inaccurate prediction. Functional approach based models might account for such irregularity, minimize the adverse effects, pursue models flexibility, and obtain better performance.

4. Application to the Framingham Heart Study

4.1. Lowess Curves Analysis for TC

It has been reported that the following major risk factors affect lipid profiles, such as high blood pressure, cigarette smoking, family history of early heart disease, old age, and obesity (BMI 30 or higher). These are also the risk factors for CHD events, so they are highly correlated together during disease developments. In general, cholesterol is a fat-like substance in the blood, which is necessary for the build-up of artery walls. However, high blood cholesterol is one of the major risk factors for heart disease. Too much cholesterol in the blood can cause the deposit or formation of plaque in vessel walls, hardening of arteries, and narrowing or total blockage of blood flow to the heart, which eventually can lead to a heart attack.

In this study, we collected TC levels and the ages of participants at the time of the exam for seven visits from the FHS dataset, and separated the corresponding participants into two groups: participants who experienced CHD (CHD event occurred) and participants who did not experience CHD (non-CHD event). Then we assessed the time-varying information by constructing the three lowess smoothing plots (Figure 1). For non-CHD events, the lowess plot shows that the TC smoothing curve starts from levels lower than 200 mg/dL; with increasing exam age, TC slowly increases until ages greater than 55, at which time TC decreases a bit and gradually flattens out (Figure 1A). For the group “CHD event occurred,” the lowess curve shows that the TC level starts with 200 mg/dL, then increases gradually until reaching peak level at age 50, slowly decreasing after, and is lower than 200 mg/dL after age 60 (Figure 1B). We also plotted the lowess smoothing curves for TC levels against the time-varying windows closest to when the CHD events occurred (length of ages; Figure 1C). This curve starts from a lower TC level, which is a bit closer to 200 mg/dL at a time that is 25 years before the CHD event occurs, then slowly increases to the TC level peak (higher than 220 mg/dL) at 9 years before the CHD event occurs. After that, it gradually decreases to 200 mg/dL at the time point closest to when the CHD event occurs.

Figure 1. — Plots of TC values against the participants’ ages at the exam time, showing the changing trend of the time-varying covariate TC with exam ages for participants who experienced a CHD event and those who did not (non-CHD). (A) TC lowess curve for participants experiencing non-CHD; (B) TC lowess curve for CHD participants; (C) Lowess curve of TC on the time windows close to and before CHD occurred for participants who experienced CHD only.

In general, from these three plots, we can see in the CHD event occurred group, that the TC level is higher than in the non-CHD group during earlier adult ages. But during later ages, the TC levels become similar among the two groups. We also see that the overall trend of the TC level increases with age until about 10 years before the CHD event, at which time the TC level is maintained at a high level.

4.2. Landmark Analysis

We performed the landmark analysis on the longitudinal biomarkers TC. First, separated datasets were constructed with stratification on landmark time points. Then, we fitted the landmark Cox model in equation (1), based on the different landmark datasets separately. So the effect of TC is fixed at each landmark time interval. We plotted the coefficients against the landmark time points with their 95% confidence interval (CI) values in Figure 2A. This plot shows that at the beginning ages, the coefficient is higher than zero and increases over time until age 40, reaching to a peak. This indicates that high TC is associated with greater risk of CHD during earlier adult ages. After age 40, the curves decrease with age until 55, at which point the coefficient CIs include zero. This means that TC was not associated with CHD event occurrence. After age 55, the plot shows a second increasing trend for unknown reasons until age 60, and then decreases to below zero (95% CIs include zero). The overall trend implies that for adults of younger age (below 40), higher TC levels correlate with greater risk of CHD. With increasing age, the CHD risk associated with TC level decreases and then vanishes at older ages (after around age 55).

Figure 2. — Plots of TC coefficients against participants’ ages at the time of the exam after adjusting baseline covariates of gender and smoking status, showing the changing trend of the time-varying covariate TC effect with exam ages for landmark models. (A) Sliding landmark effects with pointwise 95% confidence intervals for survival analysis of CHD against TC in the model; (B) smooth curves on the coefficient of the landmark model against time using the formula $β (L) = β_{0} + β_{1} * \log (\frac{L - 25}{100})$ .

At the same time, we conducted the super model analysis described by equation (2) in our methods, Section 2.2. By stacking the individual landmark datasets together, we can estimate the coefficients dependent on the landmark time L. The regression coefficients in such a model can be treated as weighted average effects of the time-varying covariates over the landmark time. We brought a smooth structure into the analysis by modeling the regression parameters as a fractional polynomial specified in equation (2). It turns out that among those terms in β(L), only the log(L) term is statistically significant. Moreover, since the youngest age of subjects we selected is 25 years, we actually used $\frac{L - 25}{100}$ instead of L throughout our analyses. Consequently, the smooth function form is $β (L) = β_{0} + β_{1} * \log (\frac{L - 25}{100})$ . This smooth curve is plotted in Figure 2B. We see similar patterns as the unsmoothed plot (Figure 2A): at younger adult ages, higher TC is associated with greater risk for the CHD; with increasing age, this risk becomes smaller and vanishes at older ages.

4.3. Results of Landmark Analysis with FPCs as Predictors

Since FPCA can be applied for capturing the underlying TC trajectory patterns via the PACE approach, we checked the density of data time points by the design plot to see whether the longitudinal TC values in the FHS data are suitable for PACE. The details of the design plot (Figure 6) of TC can be found in the Appendix.

We performed FPCA for TC trajectory analysis by using all the available FHS data, and obtained all the eigen functions. These eigen functions described the patterns of deviations from the population mean over time. We selected the first three eigen functions, and their corresponding FPC scores explained 72.0%, 17.5%, and 10.0% of the variation, respectively, with a total 99% of the variation. As demonstrated in Figure 3, the first eigen function showed an almost constant reduction from the mean values at the time interval 25–60 years and started increasing a bit in the age interval 60–75 years. The second eigen function showed a slowly decreasing trend over time from zero to a negative value at age interval 25–45 years, then started to show an increase to a positive peak value at close to age 65 years. The third one had a bell shape, which increased slowly from 25 to 55 years of age, and then started to decrease.

Figure 3. — Plots of three TC eigen functions against the participants’ ages for FPCA analysis without mean functions. The first three eigen components plotted using FPCA explain 72.0%, 17%, and 10% of the relevant variation.

Since FPCA can capture the individual changing patterns of TC from the longitudinal data by using mutually orthogonal eigen functions, we obtained the FPC scores by integrating on the seven time windows: ages 25–40, 25–45, 25–50, 25–55, 25–60, 25–65, and 25–70 separately resulting in the accumulative scores, which represent the cumulative biomarker historical values on each time interval. The averages of the standard deviation for the scores at the seven age intervals are 0.4488, 0.13022, and 0.1185 for FPC I, II, and III, respectively. These calculated FPC scores act as our new covariates and are used to conduct the landmark Cox survival analysis, as specified in equation (4). Then we separately plotted the regression coefficient against the landmark time points with their 95% CI values in Figure 4A, 4C, and 4E. In order to give a general picture of the functional principal components’ effects, we summarize the three FPCs’ patterns and their corresponding effects briefly in Table 2. The details of interpretation are as follows. As shown in Figure 4A, for the first FPC score, the regression coefficients start with a low level (negative value) at the beginning, age 40, increase to around zero at age 55, continuously increase to a peak at age 65, and then start decreasing to near zero until age 70. That is to say, subjects with positive values for the first FPC score during early ages 40 to 55 would have lower CHD risks than subjects with negative scores during these ages. Since the first FPC score is the inner product of the centered TC and the first eigen function, a positive value for the first FPC score is corresponding to a below-average TC value. Therefore, the negative regression coefficients for ages 40 to 55 in Figure 4A indicate that a below-average level of TC has protective effects during the younger adult ages. On the other hand, the positive regression coefficients after age 55 in Figure 4A indicate that, with increasing age, the protective effects of below-average TC values vanish. Because the first eigen function and its FPC scores explain 72.0% of total variation, this should be the main conclusion of our analysis.

Figure 4. — (A, C, E): Sliding landmark effects with point-wise 95% CIs for coefficients of TC FPCA components I (β₁(L)), II (β₂(L)), and III (β₃(L)), respectively, against exam ages after adjusting baseline covariates gender and smoking status in the Cox model for CHD risks. Each data point and 95% CI multiply the standard deviation of the corresponding functional principal component scores I, II, and III, respectively. (B, D, F): Smooth curves on the coefficient of landmark models for FPCA components I, II, and III, respectively, against time using the formula: $β_{k} (L) = β_{k 0} + β_{k 1} * \log (\frac{L - 25}{100}), k = 1, 2, 3$ .

Table 2.

Summary for patterns and effects of three FPCs

FPCs	Patterns	Effects
I	Constantly below average	Early age protective and late age no more protective
II	Early decreasing and late increasing	Early age harmful and late age risk-neutral
III	Increasing at 40 and decreasing at 60	No much effect and equal to average level

Open in a new tab

As shown in Figure 4C, the regression coefficient for the second FPC score starts with a significantly positive value at age 40, and decreases to and stays at non-significantly negative after around age 55. Figure 3 shows that the second eigen function has an initial decreasing and later increasing pattern. Following a similar argument as above, from these two plots, we conclude that patients whose TC values follow such a pattern might have higher CHD risks at early ages, but not old age. Figure 4E shows that the regression coefficient for the third FPC score remains around zero throughout all the ages. This implies that a TC profile as indicated by the third eigen function in Figure 3 does not have much effect on CHD risks. We also imposed a smooth function on the three regression coefficients against the landmark time points with the formula $β (L) = β_{0} + β_{1} * \log (\frac{L - 25}{100})$ , as specified in equation (5), and then plotted three smooth curves, as seen in Figure 4B, 4D, and 4F. We obtained similar patterns with smoothed curves.

From the above analysis of FPC scores, we were not only able to well capture the mean longitudinal profiles of biomarkers, but also could characterize the variability of individual temporal patterns, leading to a parsimonious representation of longitudinal profile variability. Figure 5 shows the observed longitudinal TC curves of individuals with the top 5% and bottom 5% of each of the three FPC scores. Although the trajectories of different subjects varied among the three FPC scores, we could see from I to III that variation increased a bit sequentially. We could also see the cases present in the top 5% of FPC I scores was lower than the ones present in the bottom 5% of FPC I scores. A consistent pattern across all three FPCs was that the TC levels, starting at a higher level, were associated with more CHD events or higher event percentages; however, higher TC levels occurring in older ages were not associated with a higher CHD risk (e.g., as seen from Figure 5A versus 5B).

Figure 5. — Individual TC longitudinal profiles by selecting from top 5% and bottom 5% of FPC scores. (A,B) shows individual TC longitudinal profiles selected from top 5% (B) and bottom 5% (A) of FPC scores I. Correspondingly, there exists a number (percentage) of CHD events (shown in red lines): 10(12%) in the top 5% subjects, and 5(6%) in the bottom 5% subjects. (C,D) shows individual TC longitudinal profiles selected from top 5% (D) and bottom 5% (C) of FPC scores II. Correspondingly, there exists a number (percentage) of CHD events (shown in red lines): 11(13%) in the top 5% subjects, and 2(2%) in the bottom 5% subjects. (E,F) shows individual TC longitudinal profiles selected from top 5% (F) and bottom 5% (E) of FPC scores III. Correspondingly, there exists a number (percentage) of CHD events (shown in red lines): 29(34%) in the top 5% subjects, and 5(6%) in the bottom 5% subjects.

In conclusion, the three time varying-coefficient plots exhibit three different risk profiles (Figure 4). The overall trend was consistent with the above landmark analysis using the original data (Figure 2) and our conclusion was also consistent with the results of previous studies, which we discuss in the next section.

At the same time, we compared the prediction performance for the two models in Section 4.2 and 4.3, the AUC values were calculated at the landmark time points L = 35, 50, 65, 75 years in the exam ages, and horizon time is 5 years. The FPCA approach results in higher AUCs (0.6908, 0.6907, 0.6904, 0.6900) than the LVCF method (0.5905, 0.5884, 0.5891, 0.5917). This means the FPCA-based landmark model has a better ability to distinguish between high-risk and low-risk subjects. These results were corroborated by simulation studies in Section 3.

5. Discussion and conclusion

Traditional Cox survival analysis fits the proportional hazards model by using baseline biomarkers as covariates, which are measured at a single time point for each individual. This study used FPCA as a robust way to capture the trajectory patterns of longitudinal biomarker data. Then, this summary information can be used as predictors in the traditional Cox survival analysis to conduct real-time dynamic prediction of the time to a specific event of interest. We have adapted the landmark analysis by using two different strategies. In one strategy, we first partitioned the datasets based on landmark methods, then we conducted the LVCF to “impute” the current biomarker values. In the second strategy, we first performed FPCA on the whole dataset, then used the obtained FPCA components to project FPC scores onto each separate time interval, which we used as our predictors in survival models. Comparing these two methods, we found the second to be more stable and accurate, though final conclusions are similar. These results make sense as we borrowed information across subjects to estimate the FPCA scores at the landmark time intervals. With more data, prediction accuracy increases and is more stable, which is consistent with the dynamic prediction process. When we have prior knowledge of the whole process of biomarker development, we can obtain more stable and precise predictions. So we focus on the second method in this study.

Numerous observational studies and clinical trials have shown mixed findings regarding the association between TC and CHD in older populations. Our findings contrast with those of some previously published reports. For example, Aronow et al [2] performed a prospective study that recruited 708 elderly patients to follow up with for 41 months to assess the risk factors of coronary heart diseases, including serum TC level. The results suggested that serum TC was significantly associated with CHD events in univariate or multivariate analysis. Rubin and colleagues showed similar results in a very large cohort of 2746 white men aged 60 to 79 years [30]. A recent study showed an increased mortality risk for men with very high HDL-C, but an inverse relationship between HDL-C and CHD events for women [38]. On the other hand, our findings were corroborated by a large body of studies [10, 20, 28]. Han et al [15] recently reported their work on statin use and CHD risk among older people (> 65 or > 75 years old). They found that the statin group had significantly lower cholesterol levels in the 6-year follow-up, but no reduced risk of CHD, which is consistent with our analysis of the FHS data. The debates surrounding this issue involve public health problems, but also health economics. As the proportion of the population that is elderly increases world-wide, with particular growth within the United States, the cost of providing daily statin therapy to older adults must be justified by clear benefit. Without a clear benefit, such therapy should be avoided. Our findings demonstrate the importance of analyzing age-dependent effects of TC on CHD risks.

In this article, we applied FPCA for longitudinal continuous covariates. However, FPCA can also be used for longitudinal binary or count data. There exist two approaches to address such issues: covariance decomposition based and probabilistic based FPCA. The covariance decomposition based approach is the standard and popular way for identifying modes of variation in functional continuous data. Hall et al. applied the FPCA for binary functional data by positing a smooth latent Gaussian process and then decomposing this process on the variance-covariance matrix of demeaned functional observations [14]. However, when this approach is extended to the exponential family data, as demonstrated in Gertheiss et al., it has an inherent bias due to reliance on a marginal mean estimate rather than conditional mean [12]. Probabilistic FPCA is a more appealing and generalized alternative to the covariance smoothing approach for longitudinal categorical variables. This FPCA framework is approached as a likelihood-based model from a Bayesian perspective which easily accounts for sparse or irregular data. At the same time, the probabilistic framework avoids the bias inherent in the covariance decomposition approach. The reasons behind this advantage are that the likelihood approach is often used a link function to relate the expected value of observed data to a smooth latent process, and moreover, all parameters in this approach are estimated simultaneously rather than sequentially. Goldsmith et al. extended the probabilistic FPCA via a fully Bayesian approach [13], while Van der Linde applied the generalized FPCA to binary and count data through a variational Bayesian framework which approximates the Taylor’s expansion on a log likelihood function [35]. In principle, all these techniques can be adapted to our landmark-FPCA framework in the future.

6. Acknowledgments

This research of Bin Shi was supported by the NIH/NIGMS through grant 5T32 GM074902, and the National Science Foundation through grant DMS 1612965. This research of Peng Wei was partially supported by the U.S. National Institutes of Health grants R01HL116720 and R01CA169122. This research of Xuelin Huang was supported by the USA National Institutes of Health grants U54 CA096300, U01 CA152958 and 5P50 CA100632, and the Dr. Mien-Chie Hung and Mrs. Kinglan Hung Endowed Professorship. The authors declare that there are no conflicts of interest. The authors thank the reviewers for helpful and constructive comments. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.

Appendices

A. Functional principal component analysis

For performing FPCA on the longitudinal biomarkers from FHS datasets, we assume that all the biomarkers change smoothly and continuously over time. Then we model individuals’ biomarker trajectories as independent realizations from a stochastic process Z(t). We treat the underlying Z(t) as a real-valued, square integrable function, which belongs to the separable Hilbert space L²[0, τ], where t ∈ (0, τ), and τ is the maximum follow-up time, which is a compacted time interval in [0, ∞). For the observed data, let Y_ij be the observed biomarker of the jth time point observations from the ith subject across random time interval t_ij ∈ [0, τ]. t_ij is for j = 1, 2, · · · , n_i, where n_i is the maximum observed time point for ith subject, and subject i = 1, 2, · · · , n. Thus, Z_i(t) is the underlying biomarker trajectory of subject i, and often not directly observable due to measurement errors, but must be reconstructed from noisy observations Y_ij. Therefore, we model each longitudinal biomarker as follows:

Y_{i j} = Z_{i} (t_{i j}) + ε_{i j},

(6)

where ε_ij are independent measurement error terms with E(ε_ij) = 0 and V ar(ε_ij) = σ².

Given Z_i(t) defined as above, we suppose there exists the mean function E[Z_i(t)] = μ(t) and covariance function E[(Z_i(t) − μ(t)) * (Z_i(v) − μ(v))] = G(t, v), for t, v in [0, τ]. As noted, FPCA relies on an expansion of the data in terms of the eigen functions of the covariance function G(t, v). Mercer’s theorem [6] guarantees that there exists an orthonormal basis for decomposition of a symmetric and non-negative definite covariance function, such that: $G (t, v) = \sum_{k = 1}^{\infty} λ_{k} ρ_{k} (t) ρ_{k} (v)$ , where the eigen functions: ρ_k(t), 0 ≤ t ≤ τ, and the eigen values : λ₁ ≥ λ₂ ≥ … ≥ 0, for k = 1, 2, · · · , ∞. The eigen functions ${ρ_{k} (t)}_{k = 1}^{\infty}$ are satisfied as follows:

\int {(ρ_{k} (t))}^{2} d t = 1, and \int ρ_{k} (t) ρ_{j} (t) d t = 0, for any j \neq k, j, k = 1, 2, \dots, \infty .

(7)

Correspondingly, the individual biomarker temporal profile Z_i(t) can be expanded by the Karhunen-Loeve representation [24] in term of this orthonormal basis, as follows:

Z_{i} (t) = μ (t) + \sum_{k = 1}^{\infty} γ_{i k} ρ_{k} (t), i = 1, 2, \dots, n, and t \in [0, τ],

(8)

where $γ_{i k} = \int_{0}^{τ} {Z_{i} (t) - μ (t)} ρ_{k} (t) d t$ is the kth FPCA score, which are uncorrelated random variables with a mean of 0 and a variance λ_k.

To achieve dimensional reduction, each sample function can be sufficiently well approximated, without losing much variation, by the first M components, which changes the notation from $\sum_{k = 1}^{\infty} γ_{i k} ρ_{k} (t)$ to $\sum_{k = 1}^{m} γ_{i k} ρ_{k} (t)$ ,where m = 1, 2, · · · , M. Then the model we consider is

Z_{i} (t) = μ (t) + \sum_{k = 1}^{M} γ_{i k} ρ_{k} (t) .

(9)

To choose M, which is the number of eigen functions needed to provide a reasonable approximation for the infinite dimensional process, we may base it on the fraction of variance, as explained as at least 95%, as suggested by [21]. Alternately, M can be selected using cross-validation methods based on the minimized prediction error [29] or some other methods, such as Akaike or Bayesian information criterion. In practice, the choice of the basis is decided by the shape of the curve. Usually, the basis should be chosen according to the following criterion: the expansion by eigen functions approximates the curve itself as closely as possible. The common choices for the function basis include polynomial, B-splines, wavelets, and the Fourier basis.

The above FPCA framework for functional data is a flexible method for capturing the trajectories of longitudinal biomarkers. The underlying random function curves can be reconstructed as a linear combination of the orthonormal basis, defined by the eigen functions of its covariance matrix with random coefficients γ_ik considered as the weight function for each basis. This is the so-called data projection or transformation from the original correlated scale Z_i(t) to the new orthogonally coordinated system. However, in order to estimate such a process, the time grids of longitudinal data need to be dense and regular. In reality, functional data are usually a collection of noise-corrupted observations. Data measurements can be sparse, error-prone, and taken at irregular time points. To deal with this incongruence, Yao [42] proposed to use PACE to obtain smooth estimators. In this study, we use the same strategy to estimate the FPCA model components. The mean, covariance, and eigen functions are assumed to be smooth. For estimating the mean function $\hat{μ} (t)$ , we use one-dimensional, local linear kernel smoothers to estimate the function surface, which is obtained by weighted least squares. The estimated covariance $\hat{G} (t, v)$ , given t, v ∈ [0, τ], is obtained by fitting a two-dimensional kernel smoother with all pairwise products ${Y_{i j} - μ (t_{i j})} {Y_{i h} - μ (t_{i h})}$ for j ≠ h in the ith subject. The estimates ${\hat{ρ}}_{k} (t)$ and ${\hat{λ}}_{k}$ of eigen functions and eigen values that correspond to the solutions are obtained by decomposing the smoothed covariance through the following eigen equation:

\int_{0}^{τ} \hat{G} (t, v) {\hat{ρ}}_{k} (t) d t = {\hat{λ}}_{k} {\hat{ρ}}_{k} (t) .

(10)

Traditionally, FPC scores γ_ik have been estimated by numerical integration. It works well when the grid density of the measurements for each subject is sufficiently large. However, when data are sparsely and irregularly measured or contaminated with measurement errors, PACE provides the best way to estimate them. Briefly, PACE carries out FPCA under the normal assumption. This approach is necessary if the number of measurements per subject is too small, like one or two observations per subject. In order to obtain a good approximation for the following integral,

{\hat{γ}}_{i k} = \int_{0}^{τ} {Z_{i} (t) - \hat{μ} (t)} {\hat{ρ}}_{k} (t) d t .

(11)

The approach is as follows: the quantity $\hat{μ} (t)$ and ${\hat{ρ}}_{k} (t)$ are estimated using the entire set of observed data {Y_ij, i = 1, · · · n, j = 1, · · · , n_i} and borrowing strength from the data on all subjects. As suggested by Yao, assuming γ_ik and ε_ij to be jointly Gaussian and predicting the random effects γ_ik based on its conditional expectation [42], we have ${\hat{γ}}_{i k} = \hat{E} (γ_{i k} ∣ Y_{i})$ . Specifically, let $Y_{i} = {(Y_{i 1}, \dots, Y_{i n_{i}})}^{'}$ , $\hat{μ} ({\vec{t}}_{i}) = {(\hat{μ} (t_{i 1}), \dots, \hat{μ} (t_{i n_{i}}))}^{'}$ , and ${\hat{ρ}}_{k} (\vec{t_{i}}) = {({\hat{ρ}}_{k} (t_{i 1}), \dots, {\hat{ρ}}_{k} (t_{i n_{i}}))}^{'}$ , the best linear prediction of FPCA scores is

{\hat{γ}}_{i k} = \hat{E} (γ_{i k} ∣ Y_{i}) = {\hat{λ}}_{k} {\hat{ρ}}_{k} (\vec{t_{i}}) {\hat{Σ}}_{Y_{i}}^{- 1} (Y_{i} - \hat{μ} (\vec{t_{i}})),

(12)

where ${\hat{Σ}}_{Y_{i}} = c o v (Y_{i}, Y_{i}) = c o v (Z_{i}, Z_{i}) + σ^{2} I_{N_{i}}$ , $Z_{i} = {(Z_{i 1}, \dots, Z_{i n_{i}})}^{'}$ . That is to say, the (j, h) entry of the N_i × N_i matrix ${\hat{Σ}}_{Y_{i}}$ is ${({\hat{Σ}}_{Y_{i}})}_{j, h} = \hat{G} (t_{i j}, t_{i h}) + σ^{2} δ_{j h}$ , with δ_jh = 1 if j = h, and 0 otherwise. Under a Gaussian assumption, this method provides the best linear unbiased prediction estimator. However, it is a robust estimation, irrespective of whether the Gaussian assumption holds.

B. Design plot

From the design plot (Figure 6) of TC, we can see the total number of data points are sufficiently dense to cover the surface of variance-covariance matrix. Because the maximum follow-up time was 30 years, we have max|T_ij − T_ik| = 30. Therefore, the design plot is blank in the area specified by |y − x| > 30.

C. Selection of M in FPCA

As shown in the table 3, when we let M vary from 1 to 6, the explained variance increased from 68.6% to 99.9% correspondingly. We noted that using the percentage of variance explained at least 95% selected the true number M = 3 across all 500 simulation replications. We found that the Brier scores were stable when M ≤ 3, then increased dramatically when M > 3. However, the AUC values were quite similar for all selected M values. We can see that the Brier scores were more sensitive than the AUC values with respect to the choice of M. This is because the Brier score measures the mean squared difference between the true and the predicted survival probabilities (so-called calibration error), while the AUC value, a rank-based measure, is less sensitive to perturbations.

Table 3.

Selection of M in FPCA and its effect on predictive performance in terms of Brier score and AUC

Selection of M		LM-FPCA

M	Variation Explained (%)	Brier Scores (1e-2)	AUC Values (1e-1)

1	68.60	0.479	6.379
2	90.40	0.491	6.704
3	96.00	0.489	6.856
4	99.80	1.057	6.769
5	99.85	1.781	6.536
6	99.90	5.981	6.538

Open in a new tab

References

1.Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin Oncol. 1983;1(11):710–719. [DOI] [PubMed] [Google Scholar]
2.Aronow WS, Herzig AH, Etienne F, D’Alba P, Ronquillo J. 41-month follow-up of risk factors correlated with new coronary events in 708 elderly patients. Journal of the American Geriatrics Society. 1989;37(6):501–506. [DOI] [PubMed] [Google Scholar]
3.Besse P, Ramsay JO. Principal components analysis of sampled functions. Psychometrika. 1986;51(2):285–311. [Google Scholar]
4.Binder H and Sauerbrei W and Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Statistics in Medicine. 2013;32(13):2262–2277. [DOI] [PubMed] [Google Scholar]
5.Boente G, Fraiman R. Kernel-based functional principal components. Statistics & probability letters. 2000;48(4):335–345. [Google Scholar]
6.Bosq D Linear processes in function spaces. Springer: New York. 2000 [Google Scholar]
7.Cao Y, Rajan SS, Wei P. Mendelian randomization analysis of a time-varying exposure for binary disease outcomes using functional data analysis methods. Genetic epidemiology. 2016;40(8):744–755. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chiou JM, Chen YT, Yang YF. Multivariate functional principal component analysis: A normalization approach. Statistica Sinica. 2014:1571–1596. [Google Scholar]
9.Dauxois J, Pousse A, Romain Y. Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis. 1982;12(1):136–154. [Google Scholar]
10.De Ruijter W, Westendorp RG, Assendelft WJ, den Elzen WP, de Craen AJ, le Cessie S, Gussekloo J. Use of Framingham risk score and new biomarkers to predict cardiovascular mortality in older people: population based observational cohort study. Bmj. 2009;338:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Duong H, Volding D. Modelling continuous risk variables: Introduction to fractional polynomial regression. Vietnam Journal of Science. 2014;11:1–5 [Google Scholar]
12.Gertheiss J, Goldsmith J, Staicu AM. A note on modeling sparse exponential-family functional response curves. Computational statistics and data analysis. 2017;105:46–52 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Goldsmith J, Zipunnikov V, Schrack J. Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics. 2015;71(2):344–353 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hall P, Müller HG, Yao F. Modelling sparse generalized longitudinal observations with latent Gaussian processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):703–707. [Google Scholar]
15.Han BH, Sutin D, Williamson JD, Davis BR, Piller LB, Pervin H, Pressel SL, Blaum CS. Effect of statin treatment vs usual care on primary cardiovascular prevention among older adults: the ALLHAT-LLT randomized clinical trial. JAMA internal medicine. 2017;177(7):955–965. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Happ C, Greven S. Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association. 2018;113(522):649–659. [Google Scholar]
17.Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61(1):92–105. [DOI] [PubMed] [Google Scholar]
18.Jacobson TA, Ito MK, Maki KC, Orringer CE, Bays HE, Jones PH, McKenney JM, Grundy SM, Gill EA, Wild RA, Wilson DP. National lipid association recommendations for patient-centered management of dyslipidemia: part 1—full report. Journal of clinical lipidology. 2015;9(2):129–169. [DOI] [PubMed] [Google Scholar]
19.Jacques J, Preda C. Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis. 2014;71:92–106. [Google Scholar]
20.Krumholz HM, Seeman TE, Merrill SS, de Leon CF, Vaccarino V, Silverman DI, Tsukahara R, Ostfeld AM, Berkman LF. Lack of association between cholesterol and coronary heart disease mortality and morbidity and all-cause mortality in persons older than 70 years. JAMA. 1994;272(17):1335–1340. [PubMed] [Google Scholar]
21.Li H, Staudenmayer J, Carroll RJ. Hierarchical functional data with mixed continuous and binary measurements. Biometrics. 2014;70:802–811. [DOI] [PubMed] [Google Scholar]
22.Li L, Wu C. tdROC: Nonparametric Estimation of Time-Dependent ROC Curve from Right Censored Survival Data. R package version 1.0. [Google Scholar]
23.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association. 1989;84(408):1074–1078. [Google Scholar]
24.Mas A Weak convergence for the covariance operators of a Hilbertian linear process. Stochastic processes and their applications. 2002;99(1):117–135. [Google Scholar]
25.Mogensen UB, Ishwaran H, Gerds TA. Evaluating random forests for survival analysis using prediction error curves. Journal of statistical software. 2012;50(11):1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Petersen LK, Christensen K, Kragstrup J. Lipid-lowering treatment to the end? A review of observational studies and RCTs on cholesterol and mortality in 80+-year olds. Age and ageing. 2010;39(6):674–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Portela A, Esteller M. Epigenetic modifications and human disease. Nature biotechnology. 2010;28(10):1057–1068. [DOI] [PubMed] [Google Scholar]
28.Redberg RF, Katz MH. Statins for primary prevention: the debate is intense, but the data are weak. Jama. 2016. 15;316(19):1979–1981. [DOI] [PubMed] [Google Scholar]
29.Rice JA, Silverman BW. Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological). 1991;53(1):233–243. [Google Scholar]
30.Rubin SM, Sidney S, Black DM, Browner WS, Hulley SB, Cummings SR. High blood cholesterol in elderly men and the excess risk for coronary heart disease. Annals of internal medicine. 1990;113(12):916–920. [DOI] [PubMed] [Google Scholar]
31.Sanchis-Gomar F, Perez-Quilis C, Leischik R, Lucia A. Epidemiology of coronary heart disease and acute coronary syndrome. Annals of translational medicine. 2016;4(13):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Computational Statistics & Data Analysis. 2006;50(12):3464–3485. [Google Scholar]
33.Stare J, Perme MP, Henderson R. A measure of explained variation for event history data. Biometrics. 2011;67(3):750–759. [DOI] [PubMed] [Google Scholar]
34.Tang LL, Balakrishnan N. A random-sum Wilcoxon statistic and its application to analysis of ROC and LROC data. Journal of statistical planning and inference. 2011;141(1):335–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Van Der Linde A Variational bayesian functional PCA. Computational statistics & data analysis. 2008;53(2):517–533. [Google Scholar]
36.Van Houwelingen HC. Dynamic prediction by landmarking in event history analysis. Scandinavian Journal of Statistics. 2007;34(1):70–85. [Google Scholar]
37.Wei P, Tang H, Li D. Functional logistic regression approach to detecting gene by longitudinal environmental exposure interaction in a case-control study. Genetic epidemiology. 2014;38(7):638–651. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Wilkins JT, Ning H, Stone NJ, Criqui MH, Zhao L, Greenland P, Lloyd-Jones DM. Coronary heart disease risks associated with high levels of HDL cholesterol. Journal of the American Heart Association. 2014;3(2):1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Yan F, Lin X, Huang X. Dynamic prediction of disease progression for leukemia patients by functional principal component analysis of longitudinal expression levels of an oncogene. The Annals of Applied Statistics. 2017;11(3):1649–1670. [Google Scholar]
40.Yan F, Lin X, Li R, Huang X. Functional principal components analysis on moving time windows of longitudinal data: dynamic prediction of times to event. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2018;67(4):961–978. [Google Scholar]
41.Yao F, Müller HG, Clifford AJ, Dueker SR, Follett J, Lin Y, Buchholz BA, Vogel JS. Shrinkage estimation for functional principal component scores with application to the population kinetics of plasma folate. Biometrics. 2003;59(3):676–685. [DOI] [PubMed] [Google Scholar]
42.Yao F, Müller HG, Wang JL. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association. 2005;100(470):577–590. [Google Scholar]
43.Zheng Y, Heagerty PJ. Partly conditional survival models for longitudinal data. Biometrics. 2005;61(2):379–391. [DOI] [PubMed] [Google Scholar]

[R1] 1.Anderson JR, Cain KC, Gelber RD. Analysis of survival by tumor response. J Clin Oncol. 1983;1(11):710–719. [DOI] [PubMed] [Google Scholar]

[R2] 2.Aronow WS, Herzig AH, Etienne F, D’Alba P, Ronquillo J. 41-month follow-up of risk factors correlated with new coronary events in 708 elderly patients. Journal of the American Geriatrics Society. 1989;37(6):501–506. [DOI] [PubMed] [Google Scholar]

[R3] 3.Besse P, Ramsay JO. Principal components analysis of sampled functions. Psychometrika. 1986;51(2):285–311. [Google Scholar]

[R4] 4.Binder H and Sauerbrei W and Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Statistics in Medicine. 2013;32(13):2262–2277. [DOI] [PubMed] [Google Scholar]

[R5] 5.Boente G, Fraiman R. Kernel-based functional principal components. Statistics & probability letters. 2000;48(4):335–345. [Google Scholar]

[R6] 6.Bosq D Linear processes in function spaces. Springer: New York. 2000 [Google Scholar]

[R7] 7.Cao Y, Rajan SS, Wei P. Mendelian randomization analysis of a time-varying exposure for binary disease outcomes using functional data analysis methods. Genetic epidemiology. 2016;40(8):744–755. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Chiou JM, Chen YT, Yang YF. Multivariate functional principal component analysis: A normalization approach. Statistica Sinica. 2014:1571–1596. [Google Scholar]

[R9] 9.Dauxois J, Pousse A, Romain Y. Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis. 1982;12(1):136–154. [Google Scholar]

[R10] 10.De Ruijter W, Westendorp RG, Assendelft WJ, den Elzen WP, de Craen AJ, le Cessie S, Gussekloo J. Use of Framingham risk score and new biomarkers to predict cardiovascular mortality in older people: population based observational cohort study. Bmj. 2009;338:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Duong H, Volding D. Modelling continuous risk variables: Introduction to fractional polynomial regression. Vietnam Journal of Science. 2014;11:1–5 [Google Scholar]

[R12] 12.Gertheiss J, Goldsmith J, Staicu AM. A note on modeling sparse exponential-family functional response curves. Computational statistics and data analysis. 2017;105:46–52 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Goldsmith J, Zipunnikov V, Schrack J. Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics. 2015;71(2):344–353 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Hall P, Müller HG, Yao F. Modelling sparse generalized longitudinal observations with latent Gaussian processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(4):703–707. [Google Scholar]

[R15] 15.Han BH, Sutin D, Williamson JD, Davis BR, Piller LB, Pervin H, Pressel SL, Blaum CS. Effect of statin treatment vs usual care on primary cardiovascular prevention among older adults: the ALLHAT-LLT randomized clinical trial. JAMA internal medicine. 2017;177(7):955–965. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Happ C, Greven S. Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association. 2018;113(522):649–659. [Google Scholar]

[R17] 17.Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61(1):92–105. [DOI] [PubMed] [Google Scholar]

[R18] 18.Jacobson TA, Ito MK, Maki KC, Orringer CE, Bays HE, Jones PH, McKenney JM, Grundy SM, Gill EA, Wild RA, Wilson DP. National lipid association recommendations for patient-centered management of dyslipidemia: part 1—full report. Journal of clinical lipidology. 2015;9(2):129–169. [DOI] [PubMed] [Google Scholar]

[R19] 19.Jacques J, Preda C. Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis. 2014;71:92–106. [Google Scholar]

[R20] 20.Krumholz HM, Seeman TE, Merrill SS, de Leon CF, Vaccarino V, Silverman DI, Tsukahara R, Ostfeld AM, Berkman LF. Lack of association between cholesterol and coronary heart disease mortality and morbidity and all-cause mortality in persons older than 70 years. JAMA. 1994;272(17):1335–1340. [PubMed] [Google Scholar]

[R21] 21.Li H, Staudenmayer J, Carroll RJ. Hierarchical functional data with mixed continuous and binary measurements. Biometrics. 2014;70:802–811. [DOI] [PubMed] [Google Scholar]

[R22] 22.Li L, Wu C. tdROC: Nonparametric Estimation of Time-Dependent ROC Curve from Right Censored Survival Data. R package version 1.0. [Google Scholar]

[R23] 23.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association. 1989;84(408):1074–1078. [Google Scholar]

[R24] 24.Mas A Weak convergence for the covariance operators of a Hilbertian linear process. Stochastic processes and their applications. 2002;99(1):117–135. [Google Scholar]

[R25] 25.Mogensen UB, Ishwaran H, Gerds TA. Evaluating random forests for survival analysis using prediction error curves. Journal of statistical software. 2012;50(11):1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Petersen LK, Christensen K, Kragstrup J. Lipid-lowering treatment to the end? A review of observational studies and RCTs on cholesterol and mortality in 80+-year olds. Age and ageing. 2010;39(6):674–680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Portela A, Esteller M. Epigenetic modifications and human disease. Nature biotechnology. 2010;28(10):1057–1068. [DOI] [PubMed] [Google Scholar]

[R28] 28.Redberg RF, Katz MH. Statins for primary prevention: the debate is intense, but the data are weak. Jama. 2016. 15;316(19):1979–1981. [DOI] [PubMed] [Google Scholar]

[R29] 29.Rice JA, Silverman BW. Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological). 1991;53(1):233–243. [Google Scholar]

[R30] 30.Rubin SM, Sidney S, Black DM, Browner WS, Hulley SB, Cummings SR. High blood cholesterol in elderly men and the excess risk for coronary heart disease. Annals of internal medicine. 1990;113(12):916–920. [DOI] [PubMed] [Google Scholar]

[R31] 31.Sanchis-Gomar F, Perez-Quilis C, Leischik R, Lucia A. Epidemiology of coronary heart disease and acute coronary syndrome. Annals of translational medicine. 2016;4(13):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Sauerbrei W, Meier-Hirmer C, Benner A, Royston P. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Computational Statistics & Data Analysis. 2006;50(12):3464–3485. [Google Scholar]

[R33] 33.Stare J, Perme MP, Henderson R. A measure of explained variation for event history data. Biometrics. 2011;67(3):750–759. [DOI] [PubMed] [Google Scholar]

[R34] 34.Tang LL, Balakrishnan N. A random-sum Wilcoxon statistic and its application to analysis of ROC and LROC data. Journal of statistical planning and inference. 2011;141(1):335–344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Van Der Linde A Variational bayesian functional PCA. Computational statistics & data analysis. 2008;53(2):517–533. [Google Scholar]

[R36] 36.Van Houwelingen HC. Dynamic prediction by landmarking in event history analysis. Scandinavian Journal of Statistics. 2007;34(1):70–85. [Google Scholar]

[R37] 37.Wei P, Tang H, Li D. Functional logistic regression approach to detecting gene by longitudinal environmental exposure interaction in a case-control study. Genetic epidemiology. 2014;38(7):638–651. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Wilkins JT, Ning H, Stone NJ, Criqui MH, Zhao L, Greenland P, Lloyd-Jones DM. Coronary heart disease risks associated with high levels of HDL cholesterol. Journal of the American Heart Association. 2014;3(2):1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Yan F, Lin X, Huang X. Dynamic prediction of disease progression for leukemia patients by functional principal component analysis of longitudinal expression levels of an oncogene. The Annals of Applied Statistics. 2017;11(3):1649–1670. [Google Scholar]

[R40] 40.Yan F, Lin X, Li R, Huang X. Functional principal components analysis on moving time windows of longitudinal data: dynamic prediction of times to event. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2018;67(4):961–978. [Google Scholar]

[R41] 41.Yao F, Müller HG, Clifford AJ, Dueker SR, Follett J, Lin Y, Buchholz BA, Vogel JS. Shrinkage estimation for functional principal component scores with application to the population kinetics of plasma folate. Biometrics. 2003;59(3):676–685. [DOI] [PubMed] [Google Scholar]

[R42] 42.Yao F, Müller HG, Wang JL. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association. 2005;100(470):577–590. [Google Scholar]

[R43] 43.Zheng Y, Heagerty PJ. Partly conditional survival models for longitudinal data. Biometrics. 2005;61(2):379–391. [DOI] [PubMed] [Google Scholar]

PERMALINK

Functional principal component based landmark analysis for the effects of longitudinal cholesterol profiles on the risk of coronary heart disease

Bin Shi

Peng Wei

Xuelin Huang

Abstract

1. Introduction