High-dimensional longitudinal classification with the multinomial fused lasso

Samrachana Adhikari; Fabrizio Lecci; James T Becker; Brian W Junker; Lewis H Kuller; Oscar L Lopez; Ryan J Tibshirani

doi:10.1002/sim.8100

. Author manuscript; available in PMC: 2019 Jun 30.

Published in final edited form as: Stat Med. 2019 Jan 30;38(12):2184–2205. doi: 10.1002/sim.8100

High-dimensional longitudinal classification with the multinomial fused lasso

Samrachana Adhikari ¹, Fabrizio Lecci ², James T Becker ³, Brian W Junker ², Lewis H Kuller ⁴, Oscar L Lopez ³, Ryan J Tibshirani ²

PMCID: PMC6599683 NIHMSID: NIHMS1037245 PMID: 30701586

Abstract

We study regularized estimation in high-dimensional longitudinal classification problems, using the lasso and fused lasso regularizers. The constructed coefficient estimates are piecewise constant across the time dimension in the longitudinal problem, with adaptively selected change points (break points). We present an efficient algorithm for computing such estimates, based on proximal gradient descent. We apply our proposed technique to a longitudinal data set on Alzheimer’s disease from the Cardiovascular Health Study Cognition Study. Using data analysis and a simulation study, we motivate and demonstrate several practical considerations such as the selection of tuning parameters and the assessment of model stability. While race, gender, vascular and heart disease, lack of caregivers, and deterioration of learning and memory are all important predictors of dementia, we also find that these risk factors become more relevant in the later stages of life.

Keywords: Alzheimer’s disease, cardiovascular health study, cognition study, longitudinal observational data, regularization

1 |. INTRODUCTION

The prevalence of Alzheimer’s disease (AD) increases at an alarming rate beyond the age of 65. After 90 years of age, the incidence of AD increases dramatically, from 12.7% per year in the 90–94 age group to 21.2% per year in the 95–99 age group and to 40.7% per year for those older than 100 years.^1–3 Possibility of clinical intervention to control the dramatic progression of AD on patients at risk hinges on identifying risk factors that can predict the time of onset of the disease. Many previous studies have shown the importance of a range of risk factors in predicting the time of onset of clinical expression of AD, either as dementia or its prodromal syndrome, mild cognitive impairment (MCI). For example, the risk of dementia is associated with the presence of the APOE*4 allele, male sex, lower education, and having a family history of dementia.^2,4,5 Medical risk factors include the presence of systemic hypertension, diabetes mellitus, and cardiovascular or cerebrovascular disease.^6–8 Lifestyle risk factors include physical and cognitive activity, and diet.^9–11 The complex relationships between age and the risk factors produce highly variable natural histories from normal cognition to the clinical expression of AD, thus making the prediction of the time of onset of AD challenging.^12–16

Our work is particularly motivated by the analysis of a large longitudinal data set provided by the Cardiovascular Health Study Cognition Study (CHS-CS). Over the past 24 years, the CHS-CS recorded multiple demographic, metabolic, cardiovascular, and neuroimaging risk factors for AD, as well as detailed cognitive assessments for people of ages 65 to 110 years old.^12–14 A wide range of statistical approaches has been considered in the previous studies of the CHS-CS data for predicting the time of onset of AD, including exploratory statistical summaries, hypothesis tests, survival analyses, logistic regression models, and latent trajectory models. However, none of these methods can directly accommodate a large number of predictors that can potentially exceed the number of observations. A small number of variables was often chosen a priori to match the requirements of a particular model, neglecting the full potential of the CHS-CS data, which consists of thousands of variables.

In this paper, we examine data from 924 individuals in the Pittsburgh section of the CHS-CS, to identify important risk factors for the prediction of the cognitive status at t + 10 years of age, given predictor measurements at t years of age, for t = 65, 66, … , 98. For each age, the outcome variable assigns an individual to one of the three cognitive status: normal, dementia, or death. While we are mainly interested in the prediction of dementia, this task is complicated by the fact that risk factors for dementia are also known to be risk factors for death.¹⁷ To account for this competing risk of death, we include death as a separate category in the prediction model. We further develop a simulation study to demonstrate that the approach we introduce can accommodate an array of predictors of arbitrary dimension, using regularization to maintain a well-defined predictive model, and avoids overfitting, compared to other competing prediction models.

More generally, we study longitudinal classification problems in which the number of predictors can exceed the number of observations. The setup: we observe n individuals across discrete time points t = 1, … , T. At each time point, we record p predictor variables per individual, and an outcome that places each individual into one of K classes. The goal is to construct a model that predicts the outcome of an individual at time t + Δ, given his or her predictor measurements at time t. Since we allow for the possibility that p > n, regularization must be employed in order for such a predictive model (eg, based on maximum likelihood) to be well defined. Borrowing from the extensive literature on high-dimensional regression, we consider two well-known regularizers, each of which also has a natural place in high-dimensional longitudinal analysis for many scientific problems of interest. The first is the lasso regularizer, which encourages overall sparsity in the active (contributing) predictors at each time point; the second is the fused lasso regularizer, which encourages a notion of persistence or contiguity in the sets of active predictors across time points. Justification for this second penalty is based on the scientific intuition that predictors that are clinically important should have similar effects at successive ages.

There has been recent development in using high-dimensional data analysis tools for understanding the progression of AD.^18–21 These statistical tools are built to utilize special features of data, such as information on brain biomarkers from magnetic resonance images (MRI), for prediction. In fact, the fused lasso has been applied to the study of AD in the work of Xin et al.²² However, Xin et al²² consider a very different prediction problem than ours that is based on using static MRI for prediction, and they do not consider the time-varying setup. These studies make use of either cross-sectional data²² or longitudinal data collected for less than 5 years,^18–21 both much shorter compared to the CHS-CS, and do not consider the competing risk of death in their analysis. Novelty of our analysis lies in the fact that we utilize high-dimensional longitudinal data collected over 24 years for predicting progression of AD while also accounting for competing risk of death.

The rest of this paper is organized as follows. In Section 2, we introduce the multinomial fused lasso model and discuss numerous approaches for selecting the tuning parameters λ₁, λ₂ ≥ 0 that govern the strength of the lasso and fused lasso penalties, along with the stability of estimated coefficients, and related concepts. We present the analysis of the CHS-CS data set in Section 3 and a simulation study in Section 4. In Section 5, we conclude with some final comments and lay out ideas for future work. In Appendix A, we describe a proximal gradient descent algorithm for efficiently computing a solution to the multinomial fused lasso model, while in Appendix B, we provide details about the implementation of the algorithm for our application.

2 |. THE MULTINOMIAL FUSED LASSO MODEL

Given the number of parameters involved in our general longitudinal setup, it will be helpful to be clear about notation (see Table 1). The matrix Y stores future outcome values, where the element Y_it records the outcome of the ith individual at time t + Δ, and Δ ≥ 0 determines the time lag of the prediction. In the following, we will generally use the “·” symbol to denote partial indexing; examples are X_i·t, the vector of p predictors for individual i at time t, and β_·tk, the vector of p multinomial coefficients at time t and for class k. While the number of individuals n is assumed to be fixed over time, in Appendix B, we introduce an extension of the basic setup in which the number of individuals can vary across time points, with n_t denoting the number of individuals at each time point t = 1, … , T.

TABLE 1.

Notation used throughout this paper

Parameter	Meaning
i = 1, … , n	Index for individuals
j = 1, … , p	Index for predictors
t = 1, … , T	Index for time points
k = 1, … , K	Index for outcomes
Y	n × T matrix of (future) outcomes
X	n × p × T array of predictors
_β0	T × (K − 1) matrix of intercepts
_β	p × T × (K − 1) array of coefficients

Open in a new tab

At each time point t = 1, … , T, we use a separate multinomial logit model for the outcome at time t + Δ

\begin{matrix} \log \frac{ℙ (Y_{i t} = 1 | X_{i \cdot t} = x)}{ℙ (Y_{i t} = K | X_{i \cdot t} = x)} = β_{0 t 1} + β_{\cdot t 1}^{T} x \\ \log \frac{ℙ (Y_{i t} = 2 | X_{i \cdot t} = x)}{ℙ (Y_{i t} = K | X_{i \cdot t} = x)} = β_{0 t 2} + β_{\cdot t 2}^{T} x \\ ⋮ \\ \log \frac{ℙ (Y_{i t} = K - 1 | X_{i \cdot t} = x)}{ℙ (Y_{i t} = K | X_{i \cdot t} = x)} = β_{0 t (K - 1)} + β_{\cdot t (K - 1)}^{T} x . \end{matrix}

(1)

The coefficients are determined by maximizing a penalized log likelihood criterion

({\hat{β}}_{0}, \hat{β}) \in \underset{β_{0}, β}{\arg \max} ℓ (β_{0}, β) - λ_{1} P_{1} (β) - λ_{2} P_{2} (β),

(2)

where ℓ(β₀, β) is the multinomial log likelihood

ℓ (β_{0}, β) = \sum_{t = 1}^{T} \sum_{i = 1}^{n} \log ℙ (Y_{i t} | X_{i \cdot t}),

P₁ is the lasso penalty²³

P_{1} (β) = \sum_{j = 1}^{p} \sum_{t = 1}^{T} \sum_{k = 1}^{K - 1} | β_{j t k} |,

and P₂ is a version of the fused lasso penalty²⁴ applied across time points

P_{2} (β) = \sum_{j = 1}^{p} \sum_{t = 1}^{T - 1} \sum_{k = 1}^{K - 1} | β_{j t k} - β_{j (t + 1) k} | .

(The element notation in (2) emphasizes the fact that the maximizing coefficients $({\hat{β}}_{0}, \hat{β})$ need not be unique, since the log likelihood ℓ(β₀, β) need not be strictly concave, eg, this is the case when p > n.)

In broad terms, the lasso and fused lasso penalties encourage sparsity and persistence, respectively, in the estimated coefficients $\hat{β}$ . A larger value of the tuning parameter λ₁ ≥ 0 generally corresponds to fewer nonzero entries in $\hat{β}$ ; a larger value of the tuning parameter λ₂ ≥ 0 generally corresponds to fewer change points in the piecewise constant coefficient trajectories ${\hat{β}}_{j \cdot k}$ , across t = 1, … , T. We note that the form the log likelihood ℓ(β₀, β) specified above assumes independence between the outcomes across time points, which is a rather naive assumption given the longitudinal nature of our problem setup. However, this naivety is partly compensated by the role of the fused lasso penalty, which ties together the multinomial models across time points, as we show later in the simulation study.

It helps to see an example. We consider a simple longitudinal problem with n = 50 individuals, T = 15 time points, and K = 2 classes. At each time point, we sampled p = 30 predictors independently from a standard normal distribution. The true (unobserved) coefficient matrix β is now 30 × 15; we set β_j· = 0 for j = 1 … , 27, and set the three remaining coefficients trajectories to be piecewise constant across t = 1, … , 15, as shown in the left panel of Figure 1. In other words, the assumption here is that only 3 of the 30 variables are relevant for predicting the outcome, and these variables have piecewise constant effects over time. We generated a matrix of binary outcomes Y according to the multinomial model (1) and computed the multinomial fused lasso estimates ${\hat{β}}_{0}$ , $\hat{β}$ in (2). The right panel of Figure 1 displays these estimates (all but the intercept ${\hat{β}}_{0}$ ) across t = 1, … , 15, for a favorable choice of tuning parameters λ₁ = 2.5, λ₂ = 12.5; the middle plot shows the unregularized (maximum likelihood) estimates corresponding to λ₁ = λ₂ = 0.

A simple example with n = 50, T = 15, K = 2, and p = 30. The left panel displays the true coefficient trajectories across time points t = 1, … , 15 (only 3 of the 30 are nonzero); the middle panel shows the (unregularized) maximum likelihood estimates; the right panel shows the regularized estimates from (1), with λ₁ = 2.5 and λ₂ = 12.5 [Colour figure can be viewed at wileyonlinelibrary.com]

Each plot in Figure 1 has a y-axis that has been scaled to suit its own dynamic range. We can see that the multinomial fused lasso estimates, with an appropriate amount of regularization, pick up the underlying trend in the true coefficients, though the overall magnitude of coefficients is shrunken toward zero (an expected consequence of the ℓ₁ penalties). In comparison, the unregularized multinomial estimates are wild and do not convey the proper structure. From the perspective of prediction error, the multinomial fused lasso estimates offer a clear advantage, as well: over 30 repetitions from the same simulation setup, we used both the regularized coefficient estimates (with λ₁ = 2.5 and λ₂ = 12.5) and the unregularized estimates to predict the outcomes on an i.i.d. test set, with 0.5 as a prediction threshold. The average prediction error using the regularized estimates was 0.114 (with a standard error of 0.014), while the average prediction error from the unregularized estimates was 0.243 (with a standard error of 0.022).

2.1 |. Implementation

We develop an algorithm based on proximal gradient descent for computing solutions of the fused lasso regularized multinomial regression problem in Equation (2). While a number of other algorithmic approaches are possible, such as implementations of the alternating direction method of multipliers,²⁵ we settle on the proximal gradient method because of its simplicity and because of the extremely efficient, direct proximal mapping associated with the fused lasso regularizer. We review proximal gradient descent in generality in Appendix A. In Appendix B, we provide details about the implementation of the algorithm for our problem and discuss a number of practical considerations like the choice of step size, stopping criteria, and a rescaling of the loss function that makes it roughly independent of the effective sample size, which might vary across time.

An efficient C++ implementation of the proximal gradient descent algorithm for the fused lasso regularized multinomial regression problem, with an easy interface to R, is available from the Github page: https://github.com/SamAdhikari/longfused.

2.2 |. Model selection and evaluation

The selection of tuning parameters λ₁, λ₂ is clearly an important issue, as they are not known a priori. We discuss various methods for automatic tuning parameter selection in the multinomial fused lasso model (2). In particular, we consider the following methods for model selection: cross-validation, cross-validation under the one-standard-error rule, AIC, BIC, and, finally, AIC and BIC using misclassification loss (in place of the usual negative log likelihood). The cross-validation in our longitudinal setting is performed by dividing the individuals 1, … , n into folds, and, per its typical usage, selecting the tuning parameter pair λ₁, λ₂ (over, say, a grid of possible values) that minimizes the cross-validation misclassification loss. The one-standard-error rule, on the other hand, picks the simplest estimate that achieves a cross-validation misclassification loss within one standard error of the minimum. Here, “simplest” is interpreted to mean the estimate with the fewest number of nonzero component blocks. AIC and BIC scores are computed for a candidate λ₁, λ₂ pair by

AIC (λ_{1}, λ_{2}) = 2 \cdot loss ({({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}) + 2 \cdot df ({({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}),

BIC (λ_{1}, λ_{2}) = 2 \cdot loss ({({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}) + \log N_{tot} \cdot df ({({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}),

and in each case, the tuning parameter pair is chosen (again, say, over a grid of possible values) to minimize the score. In the above, ${({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}$ denotes the multinomial fused lasso estimate (2) at the tuning parameter pair λ₁, λ₂, and N_tot denotes the total number of observations in the longitudinal study, N_tot = nT (or $N_{tot} = \sum_{t = 1}^{T} n_{t}$ in the missing data setting). Also, $df(({\hat{β}}_{0}, \hat{β})_{λ_{1}, λ_{2}})$ denotes the degrees of freedom of the estimate ${({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}$ , and we employ the approximation

df ({({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}}) \approx # of nonzero blocks in {({\hat{β}}_{0}, \hat{β})}_{λ_{1}, λ_{2}},

borrowing from known results in the Gaussian likelihood case.^26,27 Finally, loss $(({\hat{β}}_{0}, \hat{β})_{λ_{1}, λ_{2}})$ denotes a loss function considered for the estimate, which we take either to be the negative multinomial log likelihood $- ℓ (({\hat{β}}_{0}, \hat{β})_{λ_{1}, λ_{2}})$ , as is typical in AIC and BIC,²⁸ or the misclassification loss, to put it on closer footing to cross-validation. Note that both loss functions are computed in-sample, ie, over the training samples, and hence, AIC and BIC are computationally much cheaper than cross-validation. The six model selection methods will be compared using a K-fold nested cross-validation scheme, where the best tuning parameters λ₁ and λ₂ are selected within the training data according to the model selection criteria discussed here.

For each observation in the test data, we predict outcome based on the predicted probabilities for the K different classes from the multinomial logit model in (1). An outcome class corresponding to the largest predicted probability is the predicted outcome for the observation. We compare prediction performance of different model selection methods by considering four different evaluation metrics: misclassification error rate, true positive rate, false positive rate, and positive predictive value. Misclassification error rate is the proportion of total observations incorrectly classified to a wrong outcome class. True positive rate for class k is the ratio of number of outcomes correctly predicted as class k and the number of outcomes observed in that class. False positive rate for a class k is the ratio of number of outcomes incorrectly predicted as class k and the number of observed outcomes that are not in that class. Finally, positive predictive value for class k is the ratio of number of outcomes correctly predicted as class k and the total number of outcomes predicted in that class. Ideally, we seek to find a selection method that maximizes the true positive rate and the positive predictive value while minimizing the misclassification error and the false positive rate.

2.3 |. Measures of stability

Examining the stability of variables in a fitted model, subject to small perturbations of the data set, is one way to assess variable importance. Applications of stability, in this spirit, have recently gained popularity in the literature, across a variety of settings such as clustering (eg, the work of Lange et al²⁹), regression (eg, the work of Meinshausen and Bühlmann³⁰), and graphical models (eg, the work of Liu et al³¹). In this paper, we propose a very simple stability-based measure of variable importance, based on the definition of variable importance for trees and additive tree expansions.^28,32 We fit the multinomial fused lasso estimate (2) on the data set X_i··, Y_i·, for i = i₁, … , i_m, a subsample of the total individuals 1, … , n, and repeat this process R times. Let ${\hat{β}}^{(r)}$ denote the coefficients from the rth subsampled data set, for r = 1, … , R. Then, we define the importance of variable j for class k as

I_{j k} = \frac{1}{R T} \sum_{r = 1}^{R} \sum_{t = 1}^{T} | {\hat{β}}_{j t k}^{(r)} |,

(3)

for each j = 1, … , p and k = 1, … , K − 1, which is the average absolute magnitude of the coefficients for the jth variable and kth class, across all time points, and subsampled data sets. Therefore, a larger value of I_jk indicates a higher variable importance, as measured by stability (not only across subsampled data sets r, but actually across time points t, as well). Relative importances can be computed by scaling the highest variable importance to be 100, and adjusting the other values accordingly; for simplicity, we typically consider relative variable importances in favor of absolute ones, because the original scale has no real meaning.

There is some subtlety in the role of the tuning parameters λ₁, λ₂ used to fit the coefficients ${\hat{β}}^{(r)}$ on each subsampled data set r = 1, … , R. Note that the importance measure (3) reflects the importance of a variable in the context of a fitting procedure that, given data samples, produces estimated coefficients. The simplest approach would be to consider the fitting procedure defined by the multinomial fused lasso problem (2) at a fixed pair of tuning parameter values λ₁, λ₂. But, in practice, it is seldom true that appropriate tuning parameter values are known ahead of time, and one typically employs a method like cross-validation to select parameter values. Hence, in this case, to determine variable importances in the final coefficient estimates, we would take care to define our fitting procedure in (3) to be the one that, given data samples, performs cross-validation on these data samples to determine the best choice of λ₁, λ₂, and then uses this choice to fit coefficient estimates. In other words, for each subsampled data set r = 1, … , R in (3), we would perform cross-validation to determine tuning parameter values and then compute ${\hat{β}}^{(r)}$ as the multinomial fused lasso solution at these chosen parameter values. This is more computationally demanding, but it is a more accurate reflection of variable importance in the final model output by the multinomial fused lasso under cross-validation for model selection.

2.4 |. Alternative approach

We are mainly interested in the prediction of dementia; this task is complicated by the fact that risk factors for dementia are also known to be risk factors for death,¹⁷ and so to account for this, we include the death category in the multinomial classification model. An alternate approach would be to use a Cox proportional hazards model,³³ where the event of interest is the onset of dementia, and censorship corresponds to death.

Traditionally, the Cox model is not fit with time-varying predictors or time-varying coefficients, but it can be naturally extended to the setting considered in this work, even using the same regularization schemes. Instead of the multinomial model (1), we would model the hazard function as

h (t + Δ | X_{i \cdot t} = x) = h_{0} (t + Δ) \cdot \exp (x^{T} β_{\cdot t}),

(4)

where $β \in ℝ^{p \times T}$ are a set of coefficients over time, and h₀ is some baseline hazard function (that does not depend on predictor measurements). Note that the hazard model (4) relates the instantaneous rate of failure (onset of dementia) at time t + Δ to the predictor measurements at time t. This is as in the multinomial model (1), which relates the outcomes at time t + Δ (dementia or death) to predictor measurements at time t. The coefficients in (4) would be determined by maximizing the partial log likelihood with the analogous lasso and fused lasso penalties on β, as in the above multinomial setting (2).

The partial likelihood approach can be viewed as a sequence of conditional log odds models,^34,35 and therefore, one might expect the (penalized) Cox regression model described here to perform similarly to the (penalized) multinomial regression model pursued in this paper. In fact, the computational routine described in Appendix B would apply to the Cox model with only very minor modifications (that concern the gradient computations). A rigorous comparison of the two approaches is beyond the scope of the current manuscript but is an interesting topic for future development.

3 |. ALZHEIMER’S DISEASE DATA ANALYSIS

3.1 |. Data

We use data from the n = 924 individuals in the Pittsburgh section of the CHS-CS, recorded between 1990 and 2012. Each individual underwent clinical and cognitive assessments at multiple ages, all falling in the range 65, … , 108. The matrix of (future) outcomes Y has dimension n × 34: for i = 1, … , 924 and t = 65, … , 98, the outcome Y_it stores the cognitive status at age t + 10 and can assume one of the following values:

Y_{i t} = {\begin{array}{l} 1, & if normal \\ 2, & if MCI / dementia \\ 3, & if dead, \end{array}

where 11% of the overall cases are in outcome class normal, 21% in class dementia, and 68% in class dead. MCI is included in the same class as dementia, as they are both instances of cognitive impairment. Hence, the proposed multinomial model predicts the onset of MCI/dementia, in the presence of a separate death category. This is done to implicitly adjust for the confounding effect of death, as some risk factors for dementia are also known to be risk factors for death.¹⁷

The number n_t of outcomes observed at age t varies across time, for two reasons: first, different subjects entered the study at different ages, and second, once a subject dies at time t₀, we exclude them from consideration in the model formed at all ages t > t₀, to predict the outcomes of individuals at age t + 10. Figure 2 shows the proportion of the three outcome categories at each age, along with total number of observed outcomes at the age. The maximum number of outcomes is 604 at age 88, whereas the minimum is 7 at age 108, with 10 504 total outcomes measured in at least one time point. We resort to the strategy described in Appendix B.2.3 and use the scaled loss in (B8) to compensate for the varying sample sizes.

Proportion of observed outcome in each category over age, starting at 75. The total number of observations at each age is shown in the top axis of the plot. While the proportion of dead patients increases over time, that of normal patients decreases strictly. Interestingly, the prevalence of proportion of patients with dementia is increasing until around age 90 and then starts decreasing [Colour figure can be viewed at wileyonlinelibrary.com]

The array of predictors X is composed of time-varying variables that were recorded at least twice during the CHS-CS study, and time-invariant variables, such as gender and race. Predictors collected in the Pittsburgh chapter of CHS-CS are described in detail in the works of Lopez et al.^12,14 A complication in the data set is the ample amount of missingness in the array of predictors. We impute missing values using a uniform rule for all possible causes of missingness. A missing value at age t is imputed by taking the closest past measurement from the same individual, if present. If all the past values are missing, the global median from people of age t is used. The only exception is the case of time-invariant predictors, whose missing values are imputed by either future or past values, as available.

Categorical variables with m possible outcomes are converted to m − 1 binary variables and all the predictors are standardized to have zero mean and unit standard deviation. This is a standard procedure in regularization, as the lasso and fused lasso penalties puts constraints on the size of the coefficients associated with each variable.³⁶ To be precise, imputation of missing values and standardization of the predictors are performed within each of the folds used in the cross-validation method for the choice of the tuning parameters λ₁ and λ₂ (discussed in the following) and then again for the full data set in the final estimation procedure that uses the selected tuning parameters. The final array of predictors X has dimension n_t × 1050 × 34, where 1050 is the number of variables recorded over the period of 34 years of age range and n_t is the number of observed outcome at age t.

Additional information about the CHS-CS data can be found in the website https://chs-nhlbi.org/. The data can be accessed by submitting a data request form to National Heart, Lung, and Blood Institute.

3.2 |. Selection of tuning parameters λ₁ and λ₂

We compare different model selection methods on the subset of the CHS-CS data set with 140 predictors and 600 randomly selected individuals, as an illustration. The individuals are randomly split into five folds. We use 4/5 of the data set to perform model selection and subsequent model fitting with the six techniques described above: cross-validation, cross-validation with the one-standard-error rule, and AIC and BIC under negative log likelihood and misclassification losses. To be perfectly clear, the model selection techniques work entirely within this given 4/5 of the data set, so that, eg, cross-validation further divides this data set into folds. In fact, we used fourfold cross-validation to make this division simplest. The remaining 1/5 of the data set is then used for evaluation of the estimates coming from each of the six methods, and this entire process is repeated, leaving out each fold in turn as the evaluation set. We record several measures on each evaluation set: the misclassification rate, true positive rate in identifying the dementia class, true positive rate in identifying the dementia and death classes combined, and degrees of freedom (number of nonzero blocks in the estimate). Figure 3 displays the mean and standard errors of these four measures, for each of the six model selection methods.

Comparison of different methods for selection of tuning parameters λ₁, λ₂ on the Cardiovascular Health Study Cognition Study data set. The x-axis in each plot parameterizes the six different methods considered, which are from left to right: AIC and BIC under negative log likelihood loss (*AIC*_LD and *BIC*_LD), AIC and BIC under misclassification loss (*AIC*_MC and *BIC*_MC), cross-validation (CV), and cross-validation with the one-standard-error rule (CV1se). The upper left plot shows (out-of-sample) misclassification rate associated with the estimates selected by each method, averaged over 5 iterations. The segments denote ±1 standard errors around the mean. The red dotted line is the average misclassification rate associated with the naive estimator that predicts all individuals as dead (the majority class). The upper right and bottom left plots show different measures of evaluation (again, computed out-of-sample): the true positive rate in identifying the dementia class, respectively, the true positive rate in identifying the dementia and death classes combined. Finally, the bottom right plot shows degrees of freedom (number of nonzero blocks) of the estimates selected by each method [Colour figure can be viewed at wileyonlinelibrary.com]

We further investigated the out-of-sample mean true positive rate and mean positive predictive value in identifying dementia at different ages for the six different model selection methods considered, as shown in Figure 4.

Mean true positive rate (TPR) and mean positive predictive value (PPV) for correctly predicting dementia over age, for each of the methods considered for selecting the tuning parameters λ₁ and λ₂. Cross-validation under the usual rule gives the best out of sample prediction in terms of maximizing the TPR and PPV for predicting dementia, with best prediction in terms of these evaluation metrics for predicting dementia between ages 80 and 90 using predictors observed between 70 and 80 years of age [Colour figure can be viewed at wileyonlinelibrary.com]

Cross-validation and cross-validation with the one-standard-error rule both seem to represent a favorable balance between the different evaluation measures. The cross-validation methods provide a misclassification rate significantly better than that of the null model, which predicts according to the majority class (death), they yield two of the three highest true positive rates in identifying the dementia class and perform well in terms of identifying the dementia and death classes combined (as do all methods: note that all true positive rates here are about 0.75 or higher). The cross-validation under the usual rule also had the best performance in terms of true positive rate and positive predictive value over age, with highest values for predictors between ages 70 and 80 to predict dementia between ages 80 and 90.

We ended up settling for cross-validation under the usual rule, rather than the one-standard-error rule, because the former achieves the highest true positive rate in identifying the dementia class, which was our primary concern in the CHS-CS data analysis. By design, cross-validation with the one-standard-error rule delivers a simpler estimate in terms of degrees of freedom (196 for the one-standard-error rule versus 388 for the usual rule) though both cross-validation models are highly regularized in absolute terms (eg, the fully saturated model would have thousands of nonzero blocks).

3.3 |. Model and algorithm specification

In the AD application, the multinomial model in (1) is determined by two equations, as there are three possible outcomes (normal, MCI/dementia, death); the outcome “normal” is taken as the base class. We will refer to the two equations (and the corresponding sets of coefficients) as the “dementia vs normal” and “death vs normal” equations, respectively.

We use the proximal gradient descent algorithm described in Appendix B to estimate the coefficients that maximize the penalized log likelihood criterion in (2). The initializations $(β_{0}^{(0)}, β^{(0)})$ are set to be zero matrices, the maximum number of iterations is S = 80, the initial step size before backtracking is τ₀ = 20, the backtracking shrinkage parameter is γ = 0.6 and the tolerance of the first stopping criterion (relative difference in function values) is ϵ = 0.001. We select the tuning parameters by a fourfold cross-validation procedure that minimizes the misclassification error. The selected parameters are λ1 = 0.019 and λ₂ = 0.072, which yield an average prediction error of 0.312 (standard error 0.009).

The relative variable importances for the CHS-CS data set are computed using four subsampled data sets, each one containing 75% of the total number of individuals. The tuning parameter values have been selected by cross-validation. The variable importances were defined to incorporate this selection step into the fitting procedure, as explained in Section 2.3.

3.4 |. Results

Out of the 1050 coefficients associated with the predictors described above, 148 are estimated to be nonzero for at least one time point in the 34 years age range. More precisely, for at least one age, 57 coefficients are nonzero in the “dementia versus normal” equation of the predictive multinomial logit model, and 124 are nonzero in the “death versus normal” equation.

For interpreting the results, we focus on the 15 most important predictors in the 34 years age range, separately for the two equations modeling “dementia versus normal“ and “death versus normal,” respectively. The meaning of these predictors and the coding used for the categorical variables are reported in Table 2. The measure of importance is described in detail in Section 2.3 and is, in fact, a measure of stability of the estimated coefficients, across four subsets of the data (the four training sets used in cross-validation).

TABLE 2.

The 15 most important variables in the two separate equations of the multinomial logit model

	Dementia vs Normal
Variable	Meaning (and coding for categorical variables, before scaling)
Race white	Race: “White” 1, else 0
Vitamin C	Taken vitamin C in the last 2 weeks? (number of days)
Learn new	How is the person at learning new things wrt 10 years ago? “A bit worse” 1, else 0
Estrogen	If you not currently taking estrogen, have you taken in the past? “Yes” 1, “No” 0
Fearful	How often felt fearful during last week? “Most of the time” 1, else 0
Wake early	Do you usually wake up far too early? “Yes” 1, “No” 0
Female	Gender: “Female” 1, “Male” 0
Diuretics	Medication: thiazide diuretics w/o K-sparing. “Yes” 1, “No” 0
Race other	Race: “Other (no white, no black)” 1, else 0
Orthosis	Do you use a lower extremity orthosis? “Yes” 1, “No” 0
Pulse	60 second heart rate
Gripping	What causes difficulty in gripping? “Pain in arm/hand” 1, else 0
Find help	If sick, could easily find someone to help? “Probably False” 1, else 0
Digits correct	Digit-symbol substitution task: number of symbols correctly coded
Trust	There is at least one person whose advice you really trust. “Probably true” 1, else 0
	Death vs Normal
Variable	Meaning (and coding for categorical variables, before scaling)
Digits correct	Digit-symbol substitution task: number of symbols correctly coded
Chair stand	Repeated chair stands: number of seconds
Female	Gender: “Female” 1, “Male” 0
Cardiac injury	Cardiac injury score
Chest pain	Ever had pain in chest when walking uphill/hurry? “No” 1, else 0
Cigarettes	Number of cigarettes smoked per day
Digitalis	Digitalis medicines prescripted? “Yes” 1, “No” 0
Smoker	Current smoke status: “Never smoked” 1, else 0
Health	Would you say, in general, your health is … ? “Fair” 1, else 0
Exercise	If gained/lost weight, was exercise a major factor? “Yes” 1, “No” 0
#Medications	Number of medications taken
Diabetes	ADA diabetic status? “New diabetes” 1, else 0
Any smoker	Does anyone living with you smoke cigarettes regularly? “Yes” 1, “No” 0
Bld pressure	Blood pressure variable: left ankle-arm index
Walking	Do you have difficulty walking one-half a mile? “Yes” 1, else 0

Open in a new tab

In Figure 5, we show the plot of relative importance and the corresponding coefficients of the important predictors for the two equations. The plots on the left show the relative importance of the 15 variables with respect to the most important one, whose importance was scaled to be 100. The plots on the right show, separately for the two equations, the longitudinal estimated coefficients for the 15 most important variables, using the data and algorithm specification described above. The nonzero coefficients that are not displayed in Figure 5 are less important (according to our measure of stability) and, for the vast majority, their absolute values are less than 0.1.

Cardiovascular Health Study Cognition Study data analysis. Left: relative importance plots for the 15 most important variables in the “dementia vs normal” and “death vs normal” equations of the multinomial logit model. X-axis represents measure of importance on a scale of 0–100. Right: corresponding estimated coefficients for predictors recorded between ages 65 and 100. The order of the legends follows the order of the maximum/minimum values of the estimated coefficient trajectories. Note that some coefficients are estimated to be very close to 0 and the corresponding trajectories are hidden by other coefficients [Colour figure can be viewed at wileyonlinelibrary.com]

We now proceed to interpret the results, keeping in mind that, ultimately, we are estimating the coefficients of a multinomial logit model and that the outcome variable is recorded 10 years in the future with respect to the predictors. For example, an increase in the value of a predictor with positive estimated coefficient in the top right plot of Figure 5 is associated with an increase of the (10 years future) odds of dementia with respect to a normal cognitive status. In what follows, to facilitate the exposition of results, our statements are less formal.

Inspecting the “dementia vs normal” plot we see that, in general, being Caucasian (Race white) is an important predictor of decrease in the odds of dementia, while, after the age of 85, fear (Fearful), lack of available caretakers (Find help), and deterioration of learning skills (Learn new) are important predictors of the increasing odds of dementia. Variables Diuretics (a particular diuretic) and Wake early (early wake-ups) have positive coefficients for the age ranges 65, … , 78 and 77, … , 91, respectively, and hence, if active, they predict the increasing risk of dementia. The “death vs normal” plot reveals the importance of several variables in the age range 65, … , 85: longer time to rise from sitting in a chair (Chair stand), more cigarettes (Cigarettes), higher cardiac injury score (Cardiac injury) are important predictors of increasing risk of death. Other variables in the same age range, with analogous interpretations, but lower importance, are Diabetes (“new diabetes”‘ diagnosis), Health (“fair” health status), Digitalis (use of Digitalis), Walking (difficulty in walking). By contrast, in the same age range, good performance on the digit-symbol substitution task (Digits correct) will predict decrease in the odds of death. Finally, regardless of the age, being around nonsmokers (Any smoker) or being a woman (Female) are good predictors of decrease in the odds of death. Note that caution is needed while interpreting the changes in log-odds over time, which could be due to a selection bias as discussed in the work of Hernán.³⁷ Of course, this bias will have more severe impact for a causal interpretation that is not the focus of our work.

Figure 6 shows the intercept coefficients ${\hat{β}}_{0 \cdot 1}$ and ${\hat{β}}_{0 \cdot 2}$ , which, we recall, are not penalized in the log likelihood criterion in (2). The intercepts account for time-varying risk that is not explained by the predictors. In particular, the coefficients ${\hat{β}}_{0 \cdot 2}$ increases over time, suggesting that an increasing amount of risk of death can be attributed to a subject’s age alone, independent of the predictor measurements. However, as we explain later, the risk factors of dementia become more relevant in later years, suggesting that the age alone does not explain the risk of dementia after age of 85.

Cardiovascular Health Study Cognition Study data analysis. Estimated intercept coefficients in the two separate equations of the multinomial logit model

The results of the proposed multinomial fused lasso methodology applied to the CHS data are broadly consistent with what is known about risk and protective factors for dementia in the elderly.³⁸ Race, gender, vascular and heart disease, lack of available caregivers, and deterioration of learning and memory are all associated with an increased risk of dementia. The results, however, provide critical new insights into the natural progression of MCI/dementia. First, the relative importance of the risk factors changes over time. As shown in Figure 5, with the exception of race, risk factors for dementia become more relevant after the age of 85. This is critical, as there is increasing evidence³⁹ for a change in the risk profile for the expression of clinical dementia among the oldest-old. Second, the independent prediction of death and the associated risk/protection factors highlight the close connection between risk of death and risk of dementia. That is, performance on a simple, timed test of psychomotor speed (digit symbol substitution task) is a very powerful predictor of death within 10 years, as is a measure of physical strength/frailty (time to arise from a chair). Other variables, including gender, diabetes, walking, and exercise, are all predictors of death, but are known, from other analyses in the CHS and other studies, to be linked to the risk of dementia. The importance of these risk/protective factors for death is attenuated (with the exception of gender) after age 85, likely reflecting survivor bias. Taken together, these results add to the growing body of evidence of the critical importance of accounting for mortality in the analysis of risk for dementia, especially among the oldest old.³⁹

We further observe that the variables with high positive or negative coefficients for most ages in the plotted trajectories typically also have among the highest relative importances. Another interesting observation concerns categorical predictors, which (recall) have been converted into binary predictors over multiple levels: often only some levels of a categorical predictor are active in the plotted trajectories.

For our analysis, we chose a 10-year time window for risk prediction. The clinical motivation for the time window comes from the evidence in literature that the “biological” processes of neurodegeneration begins at least 10 years (if not earlier) before the “clinical” expression as MCI or dementia.^8,40 Among individuals of age 65–75, who are cognitively normal, this is a scientifically and clinically reasonable time window to use. However, had we similar data from individuals as young as 45–50 years old, then we might wish to choose time windows of 20 years or longer. It could also be argued that a shorter time window might be more scientifically and clinically relevant among the oldest-old individuals for whom survival times of 10 years become increasingly less likely. Since we are starting at around 65 years old and using a uniform prediction window for all ages in our analysis, 10-year window is a reasonable compromise to examine prediction in a clinically meaningful way. However, the methodology introduced in this paper is in no way tied to this one prediction window.

4 |. SIMULATION STUDY

We compare prediction performance of the proposed model with other competing models using simulated data. Data are simulated to emulate distributions of the observed covariates and outcome, for individuals with no missing outcome, in the CHS-CS data. We randomly chose 300 predictors, both time varying and invariant, from the observed data and simulated a matrix of predictors $X_{sim}^{t}$ , at each time t for t = 1, … , 34. Continuous predictors, which are all positive, are simulated from a truncated Normal distribution, such that, for each continuous covariate j and time t,

X_{sim}^{j, t} ~ Truncated Normal (μ_{j, t}, σ_{j, t}^{2}, a_{j, t}, b_{j, t}) .

The mean μ_j,t and the variance $σ_{j, t}^{2}$ are both estimated from empirical mean and variance in the observed data for covariate j at time t. The lower limit a_j,t and the upper limit b_j,t are estimated as minimum and maximum in the observed data. Binary predictors are simulated using Bernoulli distribution, such that, for each binary covariate l and time t,

X_{sim}^{l, t} ~ Bernoulli (p_{l . t}),

where p_l,t is the observed proportion of events for covariate l in time t. Categorical covariates with more than two categories are simulated from Multinomial distribution such that the probability of each category is again estimated using the proportion in the observed data. The outcome for patient i at time t, denoted as $Y_{i, sim}^{t}$ , is then generated using a multinomial logistic model in Equation (1) with three categories. The matrix of true coefficient β is sparse and stepwise constant over time, with dimension 300 × 34 × 2. The true degrees of freedom (number of nonzero blocks) β of is 94 with 20 relevant predictors, ie, these predictors had nonzero coefficient for at least one time point. In the simulated data, 24% of the patients are in class dementia, 58% are in class death, and 18% are in class normal. The proportion of different outcome categories in the simulated data is comparable to those in the observed data.

In addition to the proposed multinomial logistic regression with lasso and fused lasso penalties, we consider, multinomial logistic regression without any penalty, multinomial logistic regression with only lasso penalty, and two binomial logistic regressions, with lasso and fused lasso penalties, for comparison. In the first binomial regression model, we ignore patients who died at time t + Δ in training data. This logistic regression is fitted assuming that patients can either be in normal or in dementia category only. In the second binomial regression, patients in outcome categories normal and dead are pooled to form a single category, and the model is fitted to predict whether a patient will be in class dementia in time t + Δ versus no dementia. The two binomial regression models do not account for the competing risk of death.

Prediction performance of the five different models is compared using nested cross-validation. The internal fourfold cross-validation is used for the selection of tuning parameters that minimize the cross-validation error, whereas the external fivefold cross-validation is used to compare out-of-sample prediction for each model using the best tuning parameter. Fitting multinomial logistic regression without any penalty is equivalent to fitting the proposed model with λ₁ = λ₂ = 0. Similarly, for multinomial logistic regression with only lasso penalty, we fix λ₂ at zero and use cross-validation to select ₁.

Misclassification error, true positive rate, false positive rate, and estimated degrees of freedom are used to compare the fits from different prediction model. Since correct prediction of dementia is the primary interest of the analysis, we focus on true positive rate and false positive rate for predicting dementia only. For comparability among different models with varying number of outcome categories (two categories in binomial models and three in multinomial models), we standardize the degrees of freedom by the number of outcome categories K in the fitted model. Finally, recall that the cases in class dead were not included in the first binomial logistic regression model for fitting normal versus dementia. However, to reflect real world scenario as closely as possible, we do not make such consideration during prediction in test data. Instead, we use the predictors for the patients in all three outcome categories to predict normal or dementia outcome and use the predicted outcome to compute evaluation metrics. This will help us identify whether and how often patients are incorrectly classified as dementia, when category dead is not used during training. In Figure 7, we show the evaluation metrics for each model computed using fivefold cross-validation.

Simulation: comparison of different prediction models on simulated data set. The x-axis in each plot represents the five different models considered, which are, from left to right: multinomial logistic regression without penalty (**GLM**), multinomial regression with only lasso penalty (**multi-lasso**), multinomial logistic regression with lasso and fused lasso penalties (**multi-fused**), binomial logistic regression with lasso and fused lasso penalties in which people who died are ignored from training (**bi-fused I**), and, finally, the binomial logistic regression with lasso and fused lasso penalties in which normal and dead patients are grouped into one category (**bi-fused II**). The upper left plot shows (out-of-sample) misclassification rate associated with the estimates selected by each method, averaged over 5 iterations. The segments denote ±1 standard errors around the mean. The upper right and bottom left plots show different measures of evaluation (again, computed out-of-sample): the true positive rate (TPR) in identifying the dementia class, and the false positive rate (FPR) in identifying the dementia. Finally, the bottom right plot shows degrees of freedom (number of nonzero blocks) of the estimates selected by each method and standardized by the number of outcome categories in the fitted model, for comparability

We also compute importance measures for each covariate, following the procedure discussed in the data analysis (Section 3.2), and we compare covariates with high relative measure of importance to true important covariates. For each model, we rank the predictors using their relative importance and select the top 20 important predictors. We then compute the proportion of the true top 20 important predictors that were selected as top 20 important predictors in the fitted model. The proportions for the five models and the two outcome categories are displayed in Table 3.

TABLE 3.

Proportion of the true top 20 important predictors that were selected as top 20 important predictors in the fitted model, for dementia versus normal and dead versus normal, for each of the five prediction models

	Method	% Dementia	% Death
Multinomial logistic regression
	No regularization	65	50
	Lasso	70	75
	Lasso and fused lasso	75	75
Binomial logistic regression
(Dementia vs normal)	Lasso and fused lasso	80	0
(Dementia vs normal or death)	Lasso and fused lasso	65	0

Open in a new tab

Our simulation results show that the proposed multinomial logistic regression with lasso and fused lasso penalty performs better compared to other competing models, in a longitudinal setting, when dimension of predictor space is large. The proposed model has a relatively high true positive rate for predicting dementia with very low false positive rate and relatively small degrees of freedom. Our simulation results further justify to consider death as a competing outcome, rather than either completely ignoring it (which resulted in high false positive rate) or merging it with normal category (resulting with low true positive rate). Finally, the covariates with true nonzero coefficients were also the ones with high relative importance in our proposed model, compared to other competing models.

5 |. DISCUSSION AND FUTURE WORK

In this work, we proposed a multinomial model for high-dimensional longitudinal classification tasks. Our proposal operates under the assumption that a sparse number of predictors contribute more or less persistent effects across time. The multinomial model is fit under lasso and fused lasso regularizations, which address the assumptions of sparsity and persistence, respectively, and lead to piecewise constant estimated coefficient profiles. We described a highly efficient computational algorithm for this model based on proximal gradient descent, demonstrated the applicability of this model on an Alzheimer’s data set taken from the CHS-CS, and discussed practically important issues such as stability measures for the estimates and tuning parameter selection. Our proposed model has relevance to many applications in medical research beyond the one considered in this paper, since use of high-dimensional data source, such as electronic health record, is gaining popularity in those applications.

A number of extensions of the basic model are well within reach. For example, placing a group lasso penalty on the coefficients associated with each level of a binary expansion for a categorical variable may be useful for encouraging sparsity in a group sense (ie, over all levels of a categorical variable at once). As another example, more complex trends than piecewise constant ones may be fit by replacing the fused lasso penalty with a trend filtering penalty,^41,42 which would lead to piecewise polynomial trends of any chosen order k. The appropriateness of such a penalty would depend on the scientific application; the use of a fused lasso penalty assumes that the effect of a given variable is mostly constant across time, with possible change points; the use of a quadratic trend filtering penalty (polynomial order k = 2) allows the effect to vary more smoothly across time.

More difficult and open-ended extensions concern statistical inference for the fitted longitudinal classification models. For example, the construction of confidence intervals (or bands) for selected coefficients (or coefficient profiles) would be an extremely useful tool for the practitioner and would offer more concrete and rigorous interpretations than the stability measures described in Section 2.3. Unfortunately, this is quite a difficult problem, even for simpler regularization schemes (such as a pure lasso penalty) and simpler observation models (such as linear regression). But, recent inferential developments for related high-dimensional estimation tasks^43–48 shed a positive light on this future endeavor.

ACKNOWLEDGEMENTS

This research was supported through contracts HHSN268201200036C, HHSN268200800007C, HHSN268201800001C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, and N01HC85086 and through grants U01HL080295 and U01HL130114 from the National Heart, Lung, and Blood Institute, with additional contribution from the National Institute of Neurological Disorders and Stroke. Additional support was provided through grant R01AG023629 from the National Institute on Aging. A full list of principal CHS investigators and institutions can be found at CHS-NHLBI.org.

Funding information

National Heart, Lung, and Blood Institute, Grant/Award Number: HHSN268201200036C, HHSN268200800007C, HHSN268201800001C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086, U01HL080295, and U01HL130114; National Institute of Neurological Disorders and Stroke; National Institute on Aging, Grant/Award Number: R01AG023629

APPENDIX A. A PROXIMAL GRADIENT DESCENT APPROACH

Suppose that $g : ℝ^{d} \to ℝ$ is convex and differentiable, $h : ℝ^{d} \to ℝ$ is convex, and we are interested in computing a solution

x^{⋆} \in \underset{x \in ℝ^{d}}{\arg \min} g (x) + h (x) .

If h were assumed differentiable, then the criterion f(x) = g(x) + h(x) is convex and differentiable, and repeating the simple gradient descent steps

x^{+} = x - τ \nabla f (x)

(A1)

suffices to minimize f, for an appropriate choice of step size τ. (In the above, we write x⁺ to denote the gradient descent update from the current iterate x.) If h is not differentiable, then gradient descent obviously does not apply, but as long as h is “simple” (to be made precise shortly), we can apply a variant of gradient descent that shares many of its properties, called proximal gradient descent. Proximal gradient descent is often also called composite or generalized gradient descent, and in this routine, we repeat the steps

x^{+} = {prox}_{h, τ} (x - τ \nabla g (x))

(A2)

until convergence, where ${prox}_{h, τ} : ℝ^{d} \to ℝ^{d}$ is the proximal mapping associated with h (and ),

{prox}_{h, τ} (x) = \underset{z \in ℝ^{d}}{\arg \min} \frac{1}{2 τ} {‖ x - z ‖}_{2}^{2} + h (z) .

(A3)

(Strict convexity of the above criterion ensures that it has a unique minimizer, so that the proximal mapping is well defined.) Provided that h is simple, by which we mean that its proximal map (A3) is explicitly computable, the proximal gradient descent steps (A2) are straightforward and resemble the classical gradient descent analogues (A1); we simply take a gradient step in the direction governed by the smooth part g, and then apply the proximal map of h. A slightly more formal perspective argues that the updates (A3) are the result of minimizing h plus a quadratic expansion of g, around the current iterate x.

Proximal gradient descent has become a very popular tool for optimization problems in statistics and machine learning, where, typically, g represents a smooth loss function and h a nonsmooth regularizer. This trend is somewhat recent, even though the study of proximal mappings has a long history of in the optimization community (eg, see the work of Parikh and Boyd⁴⁹ for a nice review paper). In terms of convergence properties, proximal gradient descent enjoys essentially the same convergence rates as gradient descent under the analogous assumptions and is amenable to acceleration techniques just like gradient descent (eg, the works of Nesterov⁵⁰ and Beck and Teboulle⁵¹). Of course, for proximal gradient descent to be applicable in practice, one must be able to exactly (or even approximately) compute the proximal map of h in (A3); fortunately, this is possible for many optimization problems, ie, many common regularizers h, that are encountered in statistics. In our case, the proximal mapping reduces to solving a problem of the form

\hat{θ} = \underset{θ}{\arg \min} \frac{1}{2} {‖ x - θ ‖}_{2}^{2} + λ_{1} \sum_{i = 1}^{m} | θ_{i} | + λ_{2} \sum_{i = 1}^{m - 1} | θ_{i} - θ_{i + 1} | .

(A4)

This is often called the fused lasso signal approximator (FLSA) problem, and extremely fast, linear-time algorithms exist to compute its solution. In particular, we rely on an elegant dynamic programming approach proposed by Johnson.⁵²

APPENDIX B. PROXIMAL GRADIENT DESCENT ALGORITHM

We describe the implementation of proximal gradient descent algorithm for our problem and a number of practical considerations like the choice of step size and stopping criterion as follows.

B.1 |. Application to the multinomial fused lasso problem

The problem in (2) fits into the desired form for proximal gradient descent, with g the multinomial regression loss (ie, negative multinomial regression log likelihood) and h the lasso plus fused lasso penalties. Formally, we can rewrite (2) as

({\hat{β}}_{0}, \hat{β}) \in \underset{β_{0}, β}{\arg \min} g (β_{0}, β) + h (β_{0}, β),

(B1)

where g is the convex, smooth function

g (β_{0}, β) = \sum_{t = 1}^{T} \sum_{i = 1}^{n} {\sum_{k = 1}^{K - 1} - I (Y_{i t} = k) (β_{0 t k} + X_{i \cdot t} β_{\cdot t k}) + \log (1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h}))},

and h is the convex, nonsmooth function

h (β_{0}, β) = λ_{1} \sum_{j = 1}^{p} \sum_{t = 1}^{T} \sum_{k = 1}^{K - 1} | β_{j t k} | + λ_{2} \sum_{j = 1}^{p} \sum_{t = 1}^{T - 1} \sum_{k = 1}^{K - 1} | β_{j t k} - β_{j (t + 1) k} | .

Here, we consider fixed values λ₁, λ₂ ≥ 0. As described previously, each of these tuning parameters will have a big influence on the strength of their respective penalty terms and hence the properties of the computed estimate $({\hat{β}}_{0}, \hat{β})$ ; we discuss the selection of λ₁ and λ₂ in Section 3.2. We note that the intercept coefficients ₀ are not penalized.

To compute the proximal gradient updates, as given in (A2), we must consider two quantities: the gradient of g and the proximal map of h. First, we discuss the gradient. As $β_{0} \in ℝ^{T \times (K - 1)}$ , $β \in ℝ^{p \times T \times (K - 1)}$ , we may consider the gradient as having dimension $\nabla g (β_{0}, β) \in ℝ^{(p + 1) \times T \times (K - 1)}$ . We will index this as [∇g(β₀, β)]_jtk for j = 0, … , p, t = 1, … , T, k = 1, … , K − 1; hence, note that [∇g(β, β)]_0tk gives the partial derivative of g with respect to β_0tk, and [∇g(β, β)]_jtk the partial derivative with respect to β_jtk, for j = 1, … , p. For generic t, k, we have

{[\nabla g (β_{0}, β)]}_{0 t k} = \sum_{i = 1}^{n} (- I (Y_{i t} = k) + \frac{\exp (β_{0 t k} + X_{i \cdot t} β_{\cdot t k})}{1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h})}),

(B2)

and for j ≥ 1,

{[\nabla g (β_{0}, β)]}_{j t k} = \sum_{i = 1}^{n} (- I (Y_{i t} = k) X_{i j t} + X_{i j t} \frac{\exp (β_{0 t k} + X_{i \cdot t} β_{\cdot t k})}{1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h})}) .

(B3)

It is evident that computation of the gradient requires O(npTK).

Now, we discuss the proximal operator. Since the intercept coefficients $β_{0} \in ℝ^{T \times (K - 1)}$ are left unpenalized, the proximal map over β₀ just reduces to the identity, and the intercept terms undergo the updates

β_{0 t k}^{+} = β_{0 t k} - τ {[\nabla g (β_{0}, β)]}_{0 t k} for t = 1, \dots, T, k = 1, \dots, K - 1.

Hence, we consider the proximal map over β alone. At an arbitrary input $x \in ℝ^{p \times T \times (K - 1)}$ , this is

\underset{z \in ℝ^{p \times T \times (K - 1)}}{\arg \min} \frac{1}{2 τ} \sum_{j = 1}^{p} \sum_{t = 1}^{T} \sum_{k = 1}^{K - 1} {(x_{j t k} - β_{j t k})}^{2} + λ_{1} \sum_{j = 1}^{p} \sum_{t = 1}^{T} \sum_{k = 1}^{K - 1} | β_{j t k} | + λ_{2} \sum_{j = 1}^{p} \sum_{t = 1}^{T - 1} \sum_{k = 1}^{K - 1} | β_{j t k} - β_{j (t + 1) k} |,

which we can see decouples into p(K − 1) separate minimizations, one for each predictor j = 1, …, p and class k = 1, …, K − 1. In other words, the coefficients β undergo the updates

β_{j \cdot k}^{+} = \underset{θ \in ℝ^{T}}{\arg \min} \frac{1}{2} \sum_{t = 1}^{T} {((β_{j \cdot k} - τ [\nabla g {(β_{0}, β_{0}]}_{j \cdot k}) - θ)}^{2} + τ λ_{1} \sum_{t = 1}^{T} | θ_{t} | + τ λ_{2} \sum_{t = 1}^{T - 1} | θ_{t} - θ_{t + 1} |, for j = 1, \dots, p, k = 1, \dots, K - 1,

(B4)

each minimization being an FLSA problem,²⁴ ie, of the form (A4). There are many computational approaches that may be applied to such a problem structure; we employ a specialized, highly efficient algorithm by Johnson⁵² that is based on dynamic programming. This algorithm requires O(T) operations for each of the problems in (B4), making the total cost of the update O(pTK) operations. Note that this is actually dwarfed by the cost of computing the gradient ∇g(β, β) in the first place, and therefore, the total complexity of a single iteration of our proposed proximal gradient descent algorithm is O(npTK).

B.2 |. Practical considerations

We discuss several practical issues that arise in applying the proximal gradient descent algorithm.

B.2.1 |. Backtracking line search

Returning to the generic perspective for proximal gradient descent as described in Appendix A, we rewrite the proximal gradient descent update in (A2) as

x^{+} = x - τ G_{τ} (x),

(B5)

where G (x) is called the generalized gradient and is defined as

G_{τ} = \frac{x - {prox}_{h, τ} (x - τ \nabla g (x))}{τ} .

The update is rewritten in this way so that it more closely resembles the usual gradient update in (A1). We can see that, analogous to the gradient descent case, the choice of parameter τ > 0 in each iteration of proximal gradient descent determines the magnitude of the update in the direction of the generalized gradient G_τ(x). Classical analysis shows that, if ∇g is Lipschitz with constant L > 0, then proximal gradient descent converges with any fixed choice of step size τ ≤ 1∕L across all iterations. In most practical situations, however, the Lipschitz constant L of ∇g is not known or easily computable, and we rely on an adaptive scheme for choosing an appropriate step size at each iteration; backtracking line search is one such scheme, which is straightforward to implement in practice and guarantees convergence of the algorithm under the same Lipschitz assumption on ∇g (but importantly, without having to know its Lipschitz constant L). Given a shrinkage factor 0 < γ < 1, the backtracking line search routine at a given iteration of proximal gradient descent starts with τ = τ₀ (a large initial guess for the step size), and while

g (x - τ G_{τ} (x)) > g (x) - τ \nabla g {(x)}^{T} G_{τ} (x) + \frac{τ}{2} {‖ G_{τ} (x) ‖}_{2}^{2},

(B6)

it shrinks the step size by letting τ = γτ. Once the exit criterion is achieved (ie, the above is no longer satisfied), the proximal gradient descent algorithm then uses the current value of τ to take an update step, as in (B5) (or (A2)).

In the case of the multinomial fused lasso problem, the generalized gradient is of dimension $G_{τ} (β_{0}, β) \in ℝ^{(p + 1) \times T \times (K - 1)}$ , where

{[G_{τ} (β_{0}, β)]}_{0 \cdot \cdot} = {[\nabla g (β_{0}, β)]}_{0 \cdot \cdot},

and

{[G_{τ} (β_{0}, β)]}_{j \cdot k} = \frac{β_{j \cdot k} - {prox}_{FLSA, τ} (β_{j \cdot k} - τ {[\nabla g (β_{0}, β)]}_{j \cdot k})}{τ} for j = 1, \dots p, k = 1, \dots, K - 1.

Here, prox_FLSA,τ (β_j·k − [∇g(β₀, β)]_j·k) is the proximal map defined by the FLSA evaluated at β_j·k – τ[∇g(β₀, β)]_j·k, ie, the right-hand side in (B4). Backtracking line search now applies just as described above.

B.2.2 |. Stopping criteria

The simplest implementation of proximal gradient descent would run the algorithm for a fixed, large number of steps S. A more refined approach would check a stopping criterion at the end of each step and terminate if such a criterion is met. Given a tolerance level ϵ > 0, two common stopping criteria are then based on the relative difference in function values, as in

stopping criterion 1 : terminate if C_{1} = \frac{| f (β_{0}^{+}, β^{+}) - f (β_{0}, β) |}{f (β_{0}, β)} \leq ϵ,

and the relative difference in iterates, as in

stopping criterion 2 : terminate if C_{2} = \frac{{‖ (β_{0}^{+}, β^{+}) - (β_{0}, β) ‖}_{2}}{{‖ (β_{0}, β) ‖}_{2}} \leq ϵ .

The second stopping criterion is generally more stringent and may be hard to meet in large problems, given a small tolerance ϵ.

For the sake of completeness, we outline the full proximal gradient descent procedure in the notation of the multinomial fused lasso problem, with backtracking line search and the first stopping criterion, in Algorithms 1 and 2.

\bar{\begin{array}{l} Algorithm 1 Proximal gradient descent for the multinomial fused lasso \\ \bar{\underline{\begin{array}{l} INPUT: Predictors X, outcomes Y, tuning parameter values λ_{1}, λ_{2}, initial coefficient guesses(β_{0}^{(0)}, β^{(0)}), maximum number \\ of iterations S, initial step size before backtracking τ_{0}, backtracking shrinkage parameter γ, tolerance ε \\ OUTPUT: Approximate solution ({\hat{β}}_{0}, \hat{β}) \\ 1 : s = 1, C = \infty \\ 2 : while (s \leq S and C > ε) do \\ 3 : Find τ_{s} using backtracking, Algorithm 2 (INPUT : β_{0}^{(s - 1)}, β^{(s - 1)}, τ_{0}, γ) \\ 4 : Update the intercept: β_{0 \cdot \cdot}^{(s)} = β_{0 \cdot \cdot}^{(s - 1)} - τ_{s} {[\nabla g (β_{0}^{(s - 1)}, β^{(s - 1)})]}_{0 \cdot \cdot} \\ 5 : for j = 1, \dots, p do \\ 6 : for k = 1, \dots, (K - 1) do \\ 7 : Update β_{j \cdot k}^{(s)} = {prox}_{FLSA, τ_{s}} (β_{j \cdot k}^{(s - 1)} - τ_{s} {[\nabla g (β_{0}^{(s - 1)}, β^{(s - 1)})]}_{j \cdot k}) \\ 8 : end for \\ 9 : end for \\ 10 : Increment s = s + 1 \\ 11 : Compute C = [f (β_{0}^{(s)}, β^{(s)}) - f (β_{0}^{(s - 1)}, β^{(s - 1)})] / f (β_{0}^{(s - 1)}, β^{(s - 1)}) \\ 12 : end while \\ 13 : {\hat{β}}_{0} = β_{0}^{(s)}, \hat{β} = β^{(s)} \\ 14 : return (β_{0}, \hat{β}) \end{array}}} \end{array}}

\bar{\begin{array}{l} Algorithm 2 Backtracking line search for the multinomial fused lasso \\ \bar{\underline{\begin{array}{l} INPUT: β_{0}, β, τ_{0}, γ \\ OUTPUT: τ \\ 1 : τ = τ_{0} \\ 2 : while (true) do \\ 3 : Compute {[G_{τ} (β_{0}, β)]}_{0 \cdot \cdot} = {[\nabla_{g} (β_{0}, β)]}_{0 \cdot \cdot} \\ 4 : for j = 1, \dots, p do \\ 5 : for k = 1, \dots, (K - 1) do \\ 6 : Compute {[G_{τ} (β_{0}, β)]}_{j \cdot k} = [β_{j \cdot k} - {prox}_{FLSA, τ} (β_{j \cdot k} - τ {[\nabla g (β_{0}, β)]}_{j \cdot k}) / τ \\ 7 : end for \\ 8 : end for \\ 9 : if g ((β_{0}, β) - τ G_{τ} (β_{0}, β)) > g (β_{0}, β) - τ {[\nabla g (β_{0}, β)]}^{T} G_{τ} (β_{0}, β) + \frac{τ}{2} {‖ G_{τ} (β_{0}, β) ‖}_{2}^{2} then \\ 10 : Break \\ 11 : else \\ 12 : Shrink τ = γ τ \\ 13 : end if \\ 14 : end while \\ 15 : return τ \end{array}}} \end{array}}

B.2.3 |. Missing individuals

Often, in practice, some individuals are not present at some time points in the longitudinal study, meaning that one or both of their outcome values and predictor measurements are missing over a subset of t = 1, … , T. Let I_t denote the set of completely observed individuals (ie, with both predictor measurements and outcomes observed) at time t, and let n_t = |I_t|. The simplest strategy to accommodate such missingness would be to compute the loss function g only observed individuals, so that

g (β_{0}, β) = \sum_{t = 1}^{T} \sum_{i \in I_{t}} {\sum_{k = 1}^{K - 1} - I (Y_{i t} = k) (β_{0 t k} + X_{i \cdot t} β_{\cdot t k}) + \log (1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h}))} .

An issue arises when the effective sample size n_t is quite variable across time points t: in this case, the penalty terms can have quite different effects on the coefficients β_··t at one time t versus another. That is, the coefficients β_··t at a time t in which n_t is small experience a relatively small loss term

\sum_{i \in I_{t}} {\sum_{k = 1}^{K - 1} - I (Y_{i t} = k) (β_{0 t k} + X_{i \cdot t} β_{\cdot t k}) + \log (1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h}))},

(B7)

simply because there are fewer terms in the above sum compared to a time with a larger effective sample size; however, the penalty term

λ_{1} \sum_{j = 1}^{p} \sum_{k = 1}^{K - 1} | β_{j t k} | + λ_{2} \sum_{j = 1}^{p} \sum_{k = 1}^{K - 1} | β_{j t k} - β_{j (t + 1) k} |

remains comparable across all time points, regardless of sample size. A fix would be to scale the loss term in (B7) by n_t to make it (roughly) independent of the effective sample size, so that the total loss becomes

g (β_{0}, β) = \sum_{t = 1}^{T} \frac{1}{n_{t}} \sum_{i \in I_{t}} {\sum_{k = 1}^{K - 1} - I (Y_{i t} = k) (β_{0 t k} + X_{i \cdot t} β_{\cdot t k}) + \log (1 + \sum_{h = 1}^{K - 1} \exp (β_{0 t h} + X_{i \cdot t} β_{\cdot t h}))} .

(B8)

This modification indeed ends up being important for the Alzheimer’s analysis that we present in Section 3, since this study has a number of individuals in the tens at some time points and in the hundreds for others. The proximal gradient descent algorithm described in this section extends to cover the loss in (B8) with only trivial modifications.

Footnotes

DISCLAIMER

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

REFERENCES

1.Evans DA, Funkenstein HH, Albert MS, et al. Prevalence of Alzheimer’s disease in a community population of older persons. JAMA J Am Med Assoc 1989;262(18):2551–2556. [PubMed] [Google Scholar]
2.Fitzpatrick AL, Kuller LH, Ives DG, et al. Incidence and prevalence of dementia in the cardiovascular health study. J Am Geriatrics Soc 2004;52(2):195–204. [DOI] [PubMed] [Google Scholar]
3.Corrada M, Brookmeyer R, Paganini-Hill A, Berlau D, Kawas CH. Dementia incidence continues to increase with age in the oldest old: the 90+ study. Ann Neurol 2010;67(1):114–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Tang MX, Maestre G, Tsai WY, et al. Relative risk of Alzheimer disease and age-at-onset distributions, based on APOE genotypes among elderly African Americans, Caucasians, and Hispanics in New York City. Am J Human Genetics 1996;58(3):574–584. [PMC free article] [PubMed] [Google Scholar]
5.Launer LJ, Andersen K, Dewey ME, et al. Rates and risk factors for dementia and Alzheimer’s disease results from EURODEM pooled analyses. Neurology 1999;52(1):78–84. [DOI] [PubMed] [Google Scholar]
6.Kuller LH, Lopez OL, Newman A, et al. Risk factors for dementia in the cardiovascular health cognition study. Neuroepidemiology 2003;22(1):13–22. [DOI] [PubMed] [Google Scholar]
7.Irie F, Fitzpatrick AL, Lopez OL, et al. Type 2 diabetes (T2D), genetic susceptibility and the incidence of dementia in the cardiovascular health study. Neurology 2005;64(6):A316–A316. [Google Scholar]
8.Skoog I, Nilsson L, Persson G, et al. 15-year longitudinal study of blood pressure and dementia. The Lancet 1996;347(9009):1141–1145. [DOI] [PubMed] [Google Scholar]
9.Verghese J, Lipton RB, Katz MJ, et al. Leisure activities and the risk of dementia in the elderly. New England J Med 2003;348(25):2508–2516. [DOI] [PubMed] [Google Scholar]
10.Erickson KI, Raji CA, Lopez OL, et al. Physical activity predicts gray matter volume in late adulthood the cardiovascular health study. Neurology 2010;75(16):1415–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Scarmeas N, Stern Y, Tang M-X, Mayeux R, Luchsinger JA. Mediterranean diet and risk for Alzheimer’s disease. Ann Neurol 2006;59(6):912–921. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lopez OL, Jagust WJ, DeKosky ST, et al. Prevalence and classification of mild cognitive impairment in the cardiovascular health study cognition study: part 1. Arch Neurol 2003;60(10):1385–1389. [DOI] [PubMed] [Google Scholar]
13.Saxton J, Lopez OL, Ratcliff G, et al. Preclinical Alzheimer disease neuropsychological test performance 1.5 to 8 years prior to onset. Neurology 2004;63(12):2341–2347. [DOI] [PubMed] [Google Scholar]
14.Lopez OL, Kuller LH, Becker JT, et al. Incidence of dementia in mild cognitive impairment in the cardiovascular health study cognition study. Arch Neurol 2007;64(3):416–420. [DOI] [PubMed] [Google Scholar]
15.Sweet RA, Seltman H, Emanuel JE, et al. Effect of Alzheimer’s disease risk genes on trajectories of cognitive function in the cardiovascular health study. Am J Psychiatr 2012;169(9):954–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Lecci F An analysis of development of dementia through the extended trajectory grade of membership model. In: Handbook of Mixed Membership Models and Their Applications Boca Raton, FL: Chapman & Hall; 2014. [Google Scholar]
17.Rosvall L, Rizzuto D, Wang H-X, Winblad B, Graff C, Fratiglioni L. APOE-related mortality: effect of dementia, cardiovascular disease and gender. Neurobiol Aging 2009;30(10):1545–1551. [DOI] [PubMed] [Google Scholar]
18.Zhang D, Shen D, Initiative Alzheimer’s Disease Neuroimaging. Predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers. PloS One 2012;7(3):e33182. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhou J, Liu J, Narayan VA, Ye J, Initiative Alzheimer’s Disease Neuroimaging. Modeling disease progression via multi-task learning. NeuroImage 2013;78:233–248. [DOI] [PubMed] [Google Scholar]
20.Weiner MW, Veitch DP, Aisen PS, et al. The Alzheimer’s disease neuroimaging initiative: a review of papers published since its inception. Alzheimer’s Dement 2013;9(5):e111–e194. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Huang L, Jin Y, Gao Y, Thung K-H, Shen D, Alzheimer’s Disease Neuroimaging Initiative. Longitudinal clinical score prediction in Alzheimer’s disease with soft-split sparse regression based random forest. Neurobiol Aging 2016;46:180–191. http://www.sciencedirect.com/science/article/pii/S0197458016301282 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Bo Xin, Kawahara Y, Wang Y, Gao W. Efficient generalized fused lasso and its application to the diagnosis of Alzheimer’s Disease. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence; 2014; Quebec City, Canada. [Google Scholar]
23.Tibshirani R Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B Methodol 1996;58(1):267–288. [Google Scholar]
24.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Royal Stat Soc Ser B Stat Methodol 2005;67(1):91–108. [Google Scholar]
25.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternative direction method of multipliers. Found Trends Mach Learn 2011;3(1):1–122. [Google Scholar]
26.Tibshirani RJ, Taylor J. The solution path of the generalized lasso. Ann Stat 2011;39(3):1335–1371. [Google Scholar]
27.Tibshirani RJ, Taylor J. Degrees of freedom in lasso problems. Ann Stat 2012;40(2):1198–1232. [Google Scholar]
28.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inference and Prediction 2nd ed. New York, NY: Springer; 2008. [Google Scholar]
29.Lange T, Roth V, Braun M, Buhmann J. Stability-based validation of clustering solutions. Neural Computation 2004;16:1299–1323. [DOI] [PubMed] [Google Scholar]
30.Meinshausen N, Bühlmann P. Stability selection. J Royal Stat Soc Ser B Stat Methodol 2010;72(4):417–473. [Google Scholar]
31.Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (StARS) for high-dimensional graphical models. Paper presented at: Neural Information Processing Systems 23 (NIPS 2010); 2010. [PMC free article] [PubMed] [Google Scholar]
32.Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Trees Boca Raton, FL: Chapman & Hall/CRC Press; 1984. [Google Scholar]
33.Cox D Regression models and life-tables. J Royal Stat Soc Ser B Methodol 1972;34(2):187–220. [Google Scholar]
34.Efron B The efficiency of Cox’s likelihood function for censored data. J Am Stat Assoc 1977;72(359):557–565. [Google Scholar]
35.Kalbflesich J, Prentice R. The Statistical Analysis of Failure Time Data Hoboken, NJ: John Wiley & Sons, Inc; 2002. Wiley Series in Probability and Statistics. [Google Scholar]
36.Tibshirani R The lasso method for variable selection in the Cox model. Statist Med 1997;16(4):385–395. [DOI] [PubMed] [Google Scholar]
37.Hernán MA. The hazards of hazard ratios. Epidemiol Camb Mass 2010;21(1):13. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lopez OL, Becker JT, Kuller LH. Patterns of compensation and vulnerability in normal subjects at risk of Alzheimer’s disease. J Alzheimer’s Dis 2013;33:S427–S438. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Kuller L, Chang Y, Becker J, Lopez O. Does Alzheimer’s disease over 80 years old have a different etiology? Alzheimer’s Dement 2011;7(4):S596–S597. [Google Scholar]
40.Kivipelto M, Helkala E-L, Laakso MP, et al. Midlife vascular risk factors and Alzheimer’s disease in later life: longitudinal, population based study. BMJ 2001;322(7300):1447–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kim S-J, Koh K, Boyd S, Gorinevsky D. ℓ₁ trend filtering. SIAM Review 2009;51(2):339–360. [Google Scholar]
42.Tibshirani RJ. Adaptive piecewise polynomial estimation via trend filtering. Annal Statist 2014;42(1):285–323. [Google Scholar]
43.Zhang C-H, Zhang S. Confidence intervals for low-dimensional parameters with high-dimensional data 2011. arXiv: 1110.2563. [Google Scholar]
44.Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression 2013. arXiv: 1306.3171. [Google Scholar]
45.van de Geer S, Bühlmann P, Ritov Y. On asymptotically optimal confidence regions and tests for high-dimensional models 2013. arXiv: 1303.0518. [Google Scholar]
46.Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the lasso. Annal Statist 2014;42(2):413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Lee J, Sun D, Sun Y, Taylor J. Exact post-selection inference with the lasso 2013. arXiv: 1311.6238. [Google Scholar]
48.Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Exact post-selection inference for forward stepwise and least angle regression 2014. arXiv: 1401.3889. [Google Scholar]
49.Parikh N, Boyd S. Proximal algorithms. Found Trends Optim 2013;1(3):123–231. [Google Scholar]
50.Nesterov Y Gradient Methods for Minimizing Composite Objective Function CORE Discussion Paper; 2007. [Google Scholar]
51.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2009;2(1):183–202. [Google Scholar]
52.Johnson NA. A dynamic programming algorithm for the fused lasso and L₀-segmentation. J Comput Graph Stat 2013;22(2):246–260. [Google Scholar]

[R1] 1.Evans DA, Funkenstein HH, Albert MS, et al. Prevalence of Alzheimer’s disease in a community population of older persons. JAMA J Am Med Assoc 1989;262(18):2551–2556. [PubMed] [Google Scholar]

[R2] 2.Fitzpatrick AL, Kuller LH, Ives DG, et al. Incidence and prevalence of dementia in the cardiovascular health study. J Am Geriatrics Soc 2004;52(2):195–204. [DOI] [PubMed] [Google Scholar]

[R3] 3.Corrada M, Brookmeyer R, Paganini-Hill A, Berlau D, Kawas CH. Dementia incidence continues to increase with age in the oldest old: the 90+ study. Ann Neurol 2010;67(1):114–121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Tang MX, Maestre G, Tsai WY, et al. Relative risk of Alzheimer disease and age-at-onset distributions, based on APOE genotypes among elderly African Americans, Caucasians, and Hispanics in New York City. Am J Human Genetics 1996;58(3):574–584. [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Launer LJ, Andersen K, Dewey ME, et al. Rates and risk factors for dementia and Alzheimer’s disease results from EURODEM pooled analyses. Neurology 1999;52(1):78–84. [DOI] [PubMed] [Google Scholar]

[R6] 6.Kuller LH, Lopez OL, Newman A, et al. Risk factors for dementia in the cardiovascular health cognition study. Neuroepidemiology 2003;22(1):13–22. [DOI] [PubMed] [Google Scholar]

[R7] 7.Irie F, Fitzpatrick AL, Lopez OL, et al. Type 2 diabetes (T2D), genetic susceptibility and the incidence of dementia in the cardiovascular health study. Neurology 2005;64(6):A316–A316. [Google Scholar]

[R8] 8.Skoog I, Nilsson L, Persson G, et al. 15-year longitudinal study of blood pressure and dementia. The Lancet 1996;347(9009):1141–1145. [DOI] [PubMed] [Google Scholar]

[R9] 9.Verghese J, Lipton RB, Katz MJ, et al. Leisure activities and the risk of dementia in the elderly. New England J Med 2003;348(25):2508–2516. [DOI] [PubMed] [Google Scholar]

[R10] 10.Erickson KI, Raji CA, Lopez OL, et al. Physical activity predicts gray matter volume in late adulthood the cardiovascular health study. Neurology 2010;75(16):1415–1422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Scarmeas N, Stern Y, Tang M-X, Mayeux R, Luchsinger JA. Mediterranean diet and risk for Alzheimer’s disease. Ann Neurol 2006;59(6):912–921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Lopez OL, Jagust WJ, DeKosky ST, et al. Prevalence and classification of mild cognitive impairment in the cardiovascular health study cognition study: part 1. Arch Neurol 2003;60(10):1385–1389. [DOI] [PubMed] [Google Scholar]

[R13] 13.Saxton J, Lopez OL, Ratcliff G, et al. Preclinical Alzheimer disease neuropsychological test performance 1.5 to 8 years prior to onset. Neurology 2004;63(12):2341–2347. [DOI] [PubMed] [Google Scholar]

[R14] 14.Lopez OL, Kuller LH, Becker JT, et al. Incidence of dementia in mild cognitive impairment in the cardiovascular health study cognition study. Arch Neurol 2007;64(3):416–420. [DOI] [PubMed] [Google Scholar]

[R15] 15.Sweet RA, Seltman H, Emanuel JE, et al. Effect of Alzheimer’s disease risk genes on trajectories of cognitive function in the cardiovascular health study. Am J Psychiatr 2012;169(9):954–962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Lecci F An analysis of development of dementia through the extended trajectory grade of membership model. In: Handbook of Mixed Membership Models and Their Applications Boca Raton, FL: Chapman & Hall; 2014. [Google Scholar]

[R17] 17.Rosvall L, Rizzuto D, Wang H-X, Winblad B, Graff C, Fratiglioni L. APOE-related mortality: effect of dementia, cardiovascular disease and gender. Neurobiol Aging 2009;30(10):1545–1551. [DOI] [PubMed] [Google Scholar]

[R18] 18.Zhang D, Shen D, Initiative Alzheimer’s Disease Neuroimaging. Predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers. PloS One 2012;7(3):e33182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Zhou J, Liu J, Narayan VA, Ye J, Initiative Alzheimer’s Disease Neuroimaging. Modeling disease progression via multi-task learning. NeuroImage 2013;78:233–248. [DOI] [PubMed] [Google Scholar]

[R20] 20.Weiner MW, Veitch DP, Aisen PS, et al. The Alzheimer’s disease neuroimaging initiative: a review of papers published since its inception. Alzheimer’s Dement 2013;9(5):e111–e194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Huang L, Jin Y, Gao Y, Thung K-H, Shen D, Alzheimer’s Disease Neuroimaging Initiative. Longitudinal clinical score prediction in Alzheimer’s disease with soft-split sparse regression based random forest. Neurobiol Aging 2016;46:180–191. http://www.sciencedirect.com/science/article/pii/S0197458016301282 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Bo Xin, Kawahara Y, Wang Y, Gao W. Efficient generalized fused lasso and its application to the diagnosis of Alzheimer’s Disease. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence; 2014; Quebec City, Canada. [Google Scholar]

[R23] 23.Tibshirani R Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B Methodol 1996;58(1):267–288. [Google Scholar]

[R24] 24.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J Royal Stat Soc Ser B Stat Methodol 2005;67(1):91–108. [Google Scholar]

[R25] 25.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternative direction method of multipliers. Found Trends Mach Learn 2011;3(1):1–122. [Google Scholar]

[R26] 26.Tibshirani RJ, Taylor J. The solution path of the generalized lasso. Ann Stat 2011;39(3):1335–1371. [Google Scholar]

[R27] 27.Tibshirani RJ, Taylor J. Degrees of freedom in lasso problems. Ann Stat 2012;40(2):1198–1232. [Google Scholar]

[R28] 28.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inference and Prediction 2nd ed. New York, NY: Springer; 2008. [Google Scholar]

[R29] 29.Lange T, Roth V, Braun M, Buhmann J. Stability-based validation of clustering solutions. Neural Computation 2004;16:1299–1323. [DOI] [PubMed] [Google Scholar]

[R30] 30.Meinshausen N, Bühlmann P. Stability selection. J Royal Stat Soc Ser B Stat Methodol 2010;72(4):417–473. [Google Scholar]

[R31] 31.Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (StARS) for high-dimensional graphical models. Paper presented at: Neural Information Processing Systems 23 (NIPS 2010); 2010. [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Trees Boca Raton, FL: Chapman & Hall/CRC Press; 1984. [Google Scholar]

[R33] 33.Cox D Regression models and life-tables. J Royal Stat Soc Ser B Methodol 1972;34(2):187–220. [Google Scholar]

[R34] 34.Efron B The efficiency of Cox’s likelihood function for censored data. J Am Stat Assoc 1977;72(359):557–565. [Google Scholar]

[R35] 35.Kalbflesich J, Prentice R. The Statistical Analysis of Failure Time Data Hoboken, NJ: John Wiley & Sons, Inc; 2002. Wiley Series in Probability and Statistics. [Google Scholar]

[R36] 36.Tibshirani R The lasso method for variable selection in the Cox model. Statist Med 1997;16(4):385–395. [DOI] [PubMed] [Google Scholar]

[R37] 37.Hernán MA. The hazards of hazard ratios. Epidemiol Camb Mass 2010;21(1):13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Lopez OL, Becker JT, Kuller LH. Patterns of compensation and vulnerability in normal subjects at risk of Alzheimer’s disease. J Alzheimer’s Dis 2013;33:S427–S438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Kuller L, Chang Y, Becker J, Lopez O. Does Alzheimer’s disease over 80 years old have a different etiology? Alzheimer’s Dement 2011;7(4):S596–S597. [Google Scholar]

[R40] 40.Kivipelto M, Helkala E-L, Laakso MP, et al. Midlife vascular risk factors and Alzheimer’s disease in later life: longitudinal, population based study. BMJ 2001;322(7300):1447–1451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Kim S-J, Koh K, Boyd S, Gorinevsky D. ℓ₁ trend filtering. SIAM Review 2009;51(2):339–360. [Google Scholar]

[R42] 42.Tibshirani RJ. Adaptive piecewise polynomial estimation via trend filtering. Annal Statist 2014;42(1):285–323. [Google Scholar]

[R43] 43.Zhang C-H, Zhang S. Confidence intervals for low-dimensional parameters with high-dimensional data 2011. arXiv: 1110.2563. [Google Scholar]

[R44] 44.Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression 2013. arXiv: 1306.3171. [Google Scholar]

[R45] 45.van de Geer S, Bühlmann P, Ritov Y. On asymptotically optimal confidence regions and tests for high-dimensional models 2013. arXiv: 1303.0518. [Google Scholar]

[R46] 46.Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the lasso. Annal Statist 2014;42(2):413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Lee J, Sun D, Sun Y, Taylor J. Exact post-selection inference with the lasso 2013. arXiv: 1311.6238. [Google Scholar]

[R48] 48.Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Exact post-selection inference for forward stepwise and least angle regression 2014. arXiv: 1401.3889. [Google Scholar]

[R49] 49.Parikh N, Boyd S. Proximal algorithms. Found Trends Optim 2013;1(3):123–231. [Google Scholar]

[R50] 50.Nesterov Y Gradient Methods for Minimizing Composite Objective Function CORE Discussion Paper; 2007. [Google Scholar]

[R51] 51.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2009;2(1):183–202. [Google Scholar]

[R52] 52.Johnson NA. A dynamic programming algorithm for the fused lasso and L₀-segmentation. J Comput Graph Stat 2013;22(2):246–260. [Google Scholar]

PERMALINK

High-dimensional longitudinal classification with the multinomial fused lasso

Samrachana Adhikari

Fabrizio Lecci

James T Becker

Brian W Junker

Lewis H Kuller

Oscar L Lopez

Ryan J Tibshirani

Abstract

1 |. INTRODUCTION

2 |. THE MULTINOMIAL FUSED LASSO MODEL

TABLE 1.

FIGURE 1.

2.1 |. Implementation

2.2 |. Model selection and evaluation

2.3 |. Measures of stability

2.4 |. Alternative approach

3 |. ALZHEIMER’S DISEASE DATA ANALYSIS

3.1 |. Data

FIGURE 2.

3.2 |. Selection of tuning parameters λ1 and λ2

FIGURE 3.

FIGURE 4.

3.3 |. Model and algorithm specification

3.4 |. Results

TABLE 2.

FIGURE 5.

FIGURE 6.

4 |. SIMULATION STUDY

FIGURE 7.

TABLE 3.

5 |. DISCUSSION AND FUTURE WORK

ACKNOWLEDGEMENTS

APPENDIX A. A PROXIMAL GRADIENT DESCENT APPROACH

APPENDIX B. PROXIMAL GRADIENT DESCENT ALGORITHM

B.1 |. Application to the multinomial fused lasso problem

B.2 |. Practical considerations

B.2.1 |. Backtracking line search

B.2.2 |. Stopping criteria

B.2.3 |. Missing individuals

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2 |. Selection of tuning parameters λ₁ and λ₂