Statistical Learning Methods for Longitudinal High-dimensional Data

Shuo Chen; Edward Grant; Tong Tong Wu; F DuBois Bowman

doi:10.1002/wics.1282

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Wiley Interdiscip Rev Comput Stat. 2014 January/February;6(1):10–18. doi: 10.1002/wics.1282

Statistical Learning Methods for Longitudinal High-dimensional Data

Shuo Chen ¹, Edward Grant ¹, Tong Tong Wu ¹, F DuBois Bowman ²

PMCID: PMC4181610 NIHMSID: NIHMS629351 PMID: 25285184

Abstract

Recent studies have collected high-dimensional data longitudinally. Examples include brain images collected during different scanning sessions and time-course gene expression data. Because of the additional information learned from the temporal changes of the selected features, such longitudinal high-dimensional data, when incorporated with appropriate statistical learning techniques, are able to more accurately predict disease status or responses to a therapeutic treatment. In this article, we review recently proposed statistical learning methods dealing with longitudinal high-dimensional data.

Keywords: High-dimensionality, Multiple times points, Prediction, Support vector machines, Shrinkage, Temporal effects

1 Introduction

Current biomedical technology enables the collection of high-dimensional data longitudinally to gain understanding of genomic, proteomic, and in vivo neural processing properties over time. The temporal changes in high-profile biological properties may provide insight into disease diagnosis, progression, or recovery. Depending on the types of outcomes and clinical needs, the goals of longitudinal high-dimensional data include clustering and classification, survival analysis, multilevel regression and time series modeling.

By “longitudinal data,” we indicate two types of data collections: (1) high-dimensional profiles are collected at multiple times points during the study but the response variable is only collected at the end of the study as a final outcome; and (2) both the high-dimensional predictor variables and response variable are collected at multiple times points during the study. The desired methodology for high-dimensional longitudinal data would take advantage of the additional data to determine temporal trends of features and incorporate the temporal effects into learning methods and models that allow for repeated measurements. Recent research has developed several strategies to analyze high-dimensional longitudinal data using different statistical learning techniques, including support vector machines, non-parametric Bayesian methods, and shrinkage methods for different purposes. To address different objectives in the context of different data structures, we review several recent methods for high-dimensional longitudinal data. Across these models, the key challenges are determining how to extract features in high-dimensional space and incorporate the temporal effects for more accurate prediction.

In this paper, we review a set of methods for high-dimensional longitudinal data, with focus on longitudinal support vector machine and penalized linear mixed effects models. We begin with basic concepts of each method, and then introduce how recent high-dimensional longitudinal data analysis methods extended from original model. We also review the computational strategies and algorithm implementations for these methods.

2 Methods

In this section, we will review several current statistical methods for use with both types of longitudinal high-dimensional data.

2.1 Longitudinal Support Vector Classifier - LSVC

We first review the statistical methods to deal with the first type of longitudinal high-dimensional data. The support vector (SVC) classifier is a robust and effective machine learning method that has been widely used for high-dimensional data analysis (Mitchell et al., 2004, Vapnik, 1996). Also, SVC has been applied to handle spatial-temporal high-dimensional data, Mourao-Miranda et al, 2007 first use singular value decomposition to obtain the linear combination of spatial and temporal effects and then apply the component as input of SVC. Recently, Chen and Bowman, 2011 developed a SVC based method for high-dimensional data measured at multiple time points. Suppose that longitudinal high-dimensional data is collected from N subjects at T measurement time points, and p represents the dimensionality of data. The expanded feature matrix for longitudinal high-dimensional data becomes TN by p. For features x_i_,_t collected for subject i at time t, the goal is to classify each individual x̃_i = {x_i_,1, x_i_,2, …, x_i_,_T}′ to certain groups y_i ∈ {−1, 1}, outcomes only collected at the end of the study.

Linear trends of change are characterized: x_s = x_i_,1 + β₁x_i_,2 + β₂x_i_,3… + β_T₋₁x_i_,_T, where β = (1, β₁, β₂, …, β_T₋₁)′ is an unknown parameter vector. Such trend information is highly desired to improve the classification accuracy, usually not available. Thus, a key challenge of building a classifier of longitudinal high dimensional data is jointly estimating the separating hyperplane parameters α and the temporal trend parameters β. Chen and Bowman, 2011 first proposed a novel longitudinal support vector classifier (LSVC) to solve the problem using quadratic programming.

The LSVC is extended from the conventional support vector classifier (SVC), and it augments the cross-sectional high-dimensional feature space to a longitudinal high-dimensional feature space. The method seeks to construct an objective function by incorporating both temporal trend parameters and separating the hyperplane parameters. In the paper, the authors first note the augmented Gram matrix as

\begin{array}{l} G = {({\tilde{X}}_{t = 1} + β_{1} {\tilde{X}}_{t = 2} + \dots + β_{T - 1} {\tilde{X}}_{t = T})}^{T} ({\tilde{X}}_{t = 1} + β_{1} {\tilde{X}}_{t = 2} + \dots + β_{T - 1} {\tilde{X}}_{t = T}) \\ = {({\tilde{X}}_{m} β_{m})}^{T} ({\tilde{X}}_{m} β_{m}) \\ = β_{m}^{T} G_{m} β_{m}, \end{array}

where

G_{m} = [\begin{matrix} {\tilde{X}}_{t = 1}^{T} {\tilde{X}}_{t = 1} & \dots & {\tilde{X}}_{t = 1}^{T} {\tilde{X}}_{t = T} \\ ⋮ & ⋱ & ⋮ \\ {\tilde{X}}_{t = T}^{T} {\tilde{X}}_{t = 1} & \dots & {\tilde{X}}_{t = T}^{T} {\tilde{X}}_{t = T} \end{matrix}],

X̃_m = [X̃_t₌₁, X̃_t₌₂, …, X̃_t₌_T]^T represents the p × TN longitudinal high dimensional features, with components X̃_t₌_k = (y₁x_1,_t₌_k, y₂x_2,_t₌_k, …, y_Nx_N_,_t₌_k) to be data from N subjects each with p features at time point k. The corresponding β_m is a TN × N matrix, and $β_{m}^{T} = [I_{N \times N}, β_{1} I_{N \times N}, β_{2} I_{N \times N}, \dots, β_{T - 1} I_{N \times N}]$ . Similar to the conventional SVC, the objective function of LSVC is also subject to maximize the margins in the following equation:

min_{w_{n v}} \frac{1}{2} {‖ w_{n v} ‖}^{2} + C \sum_{i = 1}^{N} ξ_{i}, for i = 1, \dots, N

(2.1)

where w_nv is the estimate of separating hyperplane parameters with, by assuming that the temporal trend parameters are known, longitudinal high-dimensional features x_i = x_i_,1 + β₁x_i_,2 + β₂x_i_,3… + β_T₋₁x_i_,_T.

After the incorporation of the temporal trend parameters $w_{n v} = \sum_{s = 1}^{N} y_{s} α_{s} {({\tilde{x}}_{s} β_{m})}^{T}$ , the Langrange (Wolfe) dual function becomes:

subject to \begin{array}{l} min_{α} \frac{1}{2} α_{m}^{T} G_{m} α_{m} - 1^{T} α \\ C \geq α_{m} (i) \geq 0, \\ \sum_{t}^{T} \sum_{i}^{N} α_{m} [i + (t - 1) N] y_{i} = 0, \end{array}

(2.2)

for i = 1, …, N and t = 1, …, T − 1.

Provided with α_m, the separating hyperplane parameter becomes

w_{n v} = \sum_{i = 1}^{n} y_{i} α_{m} (i) (x_{i, 1} + β_{1} x_{i, 2} + β_{2} x_{i, 3} \dots + β_{T - 1} x_{i, T}),

where α_m_,_i = (α_m(i), α_m(i + N), …, α_m(i + (T − 1)N)) and $w_{n v} = \sum_{i = 1}^{n} y_{i} α_{m, i} {\tilde{x}}_{i}$ .

Given w_nv, the intercept term then becomes $b = \frac{1}{N} \sum_{i = 1}^{N} (w_{n v} \cdot {({\tilde{x}}_{i} β_{m})}^{T} - y_{i})$ , in which β_m can be estimated based on α_m. At last, the separating hyperplane can be used to classify each subject by

h (\tilde{x}) = w_{n v} \cdot {(\tilde{x} β_{m})}^{T} + b .

(2.3)

Model Estimation

To estimate α and β vectors, the authors suggest to reparameterize the first part of the objective function in 2.6 as:

f = {(\begin{matrix} α \\ β α \end{matrix})}^{T} [\begin{matrix} G_{m}^{0, 0} & G_{m}^{0, 1} \\ G_{m}^{1, 0} & G_{m}^{1, 1} \end{matrix}] (\begin{matrix} α \\ β α \end{matrix})

where

G_{m} = [\begin{matrix} G_{m}^{0, 0} & G_{m}^{0, T} \\ G_{m}^{T, 0} & G_{m}^{T, T} \end{matrix}],

and $G_{m}^{0, 0}$ is the N × N submatrix in the left top corner of the matrix G_m for the baseline data ( ${\tilde{X}}_{t = 1}^{T} {\tilde{X}}_{t - 1}$ ), the Gram matrix of SVC.

The objective function has been proven to be convex, and an iterative quadratic programming (QP) procedure is developed for optimization: (1) start with initial values of β and use QP to optimize 2.3 to obtain α; (2) use the updated α obtained in step 1 and apply QP again to estimate β; and (3) repeat the above two steps until convergence. The convergence of the iterative algorithm can be achieved because a unique solution exists.

Nonlinear Kernel Functions

The authors also provide solutions for nonlinear kernels. The Gram matrix of a nonlinear kernel is

\tilde{K} ({\tilde{x}}_{i}, {\tilde{x}}_{i^{'}}) = [\begin{matrix} K ({\tilde{x}}_{i, 1}, {\tilde{x}}_{i^{'}, 1}) & \dots & K ({\tilde{x}}_{i, 1}, {\tilde{x}}_{i^{'}, T}) \\ ⋮ & ⋱ & ⋮ \\ K ({\tilde{x}}_{i, T}, {\tilde{x}}_{i^{'}, 1}) & \dots & K ({\tilde{x}}_{i, T}, {\tilde{x}}_{i^{'}, T}) \end{matrix}],

where 〈βK(·, x̃_i_,_t), K(·, x̃_i_′,_t)〉 = βK(x̃_i_,_t, x̃_i_′,_t), and K(·, x̃_i_,_t) indicates the reproducing kernel map of x̃_i_,_t (Wahba, 1990). The separating hyperplane with a nonlinear kernel becomes

h (\tilde{x}) = \sum_{s = 1}^{N} y_{s} α_{m} \tilde{K} (\tilde{x}, {\tilde{x}}_{s}) β_{m}^{T} + b,

where b is obtained by $b = \frac{1}{N} \sum_{i = 1}^{N} \sum_{i^{'} = 1}^{N} (y_{i} y_{i}^{'} α_{m} \tilde{K} ({\tilde{x}}_{i}, {\tilde{x}}_{i^{'}}) β_{m}^{T} - y_{i})$ . Therefore, the nonlinear kernel does not increase the complexity of estimating β. In addition, the authors discussed the variable selection based on the predictors’ effect on the objective function which is a “wrapper” method (Guyon et al., 2003; Hastie and Tibshirani, 2004).

To demonstrate the use and potential advantages of LSVC, the authors apply the method to a simulation study and a data example from the Alzheimer’s disease Neuroimaging Initiative. The results show that by leveraging the additional longitudinal information LSVC achieves higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.

2.2 Penalized Linear Mixed Effects Models

Linear mixed effects models can be used in the analysis of clustered or longitudinal data. Those models estimate the relationship between the dependent variable and the fixed effects and random effects of independent variables by considering both means and covariances. With the improvement of data collection and storage technology, a large number of independent variables are available and can be included in the model. Inference and prediction of such a model becomes too complex and infeasible when the number of predictors increases. One challenge is how to choose significant predictors while excluding variables that have no true effects on the outcome. An example is the Trial of Activity for Adolescent Girls (TAAG) study, which determined the effectiveness of a school and community based intervention on the physical activities of girls from 6 middle schools in Maryland (Young et al., 2013). A large group of girls were followed up for four years from year 2006 to 2009 and asked to take a survey with hundreds of questions at multiple time points to measure the change of physical activities. A linear mixed effects model can be fitted to take into account the clustering effects (6 middle schools) and temporal effects (four years) as well as the fixed effects such as race, socio-economic status, and other questions in the survey.

To construct a linear mixed effects model, consider the ith subject in a longitudinal study with n subjects, each having observations at m_i time points for a total of $N = \sum_{i = 1}^{n} m_{i}$ observations. The linear mixed effects model can be written by

y_{i j} = x_{i j}^{T} β + z_{i j}^{T} b_{i} + ε_{i j}, j = 1, \dots, m_{i}

where y_ij is the response variable at the jth time point, x_ij is the vector of p fixed effects, z_ij is the vectors of q random effects at the jth time point, β is the p parameter vector for the fixed effects, b_i is the q parameter vector for random effects, and ε_ij is the i.i.d. random error from N (0, σ²). The random effects parameter b_i are i.i.d. multivariate normal variables following MV N (0, σ²Π), where Π is the covariance matrix, and are independent of ε_ij. Using the matrix notation, the model can be simplified as

Y_{i} = X_{i} β + Z_{i} b_{i} + ε_{i},

(2.4)

where Y_i = (y_i₁, …, y_{im_i})^T, X_i = (x_i₁, …, x_{im_i})^T is the m × p design matrix of fixed effects, Z_i = (z_i₁, …, z_{im_i})^T is the m × q design matrix of random effects, and ε_i = (ε_i₁, …, ε_{im_i})^T is an i.i.d. random error following N(0, σ²I_{m_i}).

Previous methods for penalized estimation of fixed effects include Efron et al., 2004, Zou and Hastie, 2005, and Bondell and Reich, 2008. Previous methods for selection of random effects include Stram and Lee, 1994, Lin, 1997, Hall and Praestgaard, 2001, and Chen, 2003. In this review we focus on three penalized linear mixed effects models, which, unlike many previous approaches, select both fixed effects and random effects simultaneously. The first model is developed by Bondell et al., 2010, maximizing a penalized joint likelihood problem. The second model is introduced by Fan and Li, 2012, uses a proxy matrix in maximizing a penalized profile likelihood for fixed and random effects separately. Third, the paper of Li et al., 2012 optimizes maximum likelihood estimator with separate penalization methods for fixed and random effects. A side-by-side comparison of these models is summarized in Table 1.

Table 1.

Comparison of three penalized linear mixed effects models

	1. Joint Penalization	2. Independent Selection	3. Double Penalization
Authors	Bondell et al. (2010)	Fan and Li (2012)	Li et al. (2012)
Regularizations	1	1	2
Objective Functions	ℓ(β, Π, σ²) − λP(β, b)	Fixed: ℓ(β) − λ₁P₁ (β) Random: ℓ(b, σ²) λ₂P₂(b)	ℓ(β, Π, σ²) − λ₁P₁(β) − λ₂P₂(b)
Penalties	1 total (ALASSO)	1 fixed (SCAD) 1 random (SCAD)	1 fixed (ALASSO) 1 random (L-2 norm)
Covariance Structure	Modified Cholesky decomposition for covariance matrix	Use of proxy matrix to substitute for unknown covariance matrix	Cholesky decomposition for covariance matrix
Algorithms	EM algorithm	LARS/Elastic net	New efficient algorithm with two quadratic components
High-Dimensional Data	EM algorithm is not efficient for large number of predictors	p and q can diverge to ∞, but must reduce dimensions of fixed effects ignoring random effects first	p and q can diverge to 1

Open in a new tab

Model 1 (Bondell et al., 2010)

Method 1 estimates fixed effects, random effects, and the covariance structure of the selected random effects simultaneously in a model with one penalty function. Equation (2.4) can be reparameterized using a modified Cholesky decomposition to factorize the covariance matrix of the random effects, Π. Through this factorization, Π = DΓΓ^TD, where Γ is a q × q lower triangular matrix with 1’s on the diagonal and whose (l, r)th element is given by γ_lr and D = diag(d₁, d₂, …, d_q) is a diagonal matrix. After this reparameterization, the linear mixed effects model (2.4) becomes:

Y_{i} = X_{i} β + Z_{i} D Γ b_{i} + ε_{i} .

The covariance matrix of b_i is now expressed in terms of vector d = (d₁, d₂, …, d_q)^T and the free elements of Γ, denoted by vector γ = (γ_lr : l = 1, …, q : r = l + 1, …, q)^T. Setting any d_l = 0 will set the corresponding lth row and column of the covariance matrix Π to 0 and therefore remove the lth random effect from the model.

After reparameterizing the model and treating b as given, maximizing the log-likelihood function is equivalent to minimizing the conditional expectation of ||y − ZD̃Γ̃b − Xβ||², where D̃ = I_m ⊗ D and Γ̃ = I_m ⊗ Γ. Rearranging the terms and adding the Adaptive Least Absolute Selection and Shrinkage Operator (ALASSO, Zou, 2006) penalty, the goal is to minimize the quadratic problem:

Q (β^{T}, d^{T}, γ^{T} ∣ y, b) = {‖ y - X β - Z Diag (\tilde{Γ} b) (1_{q} \otimes I_{m}) ‖}^{2} + λ (\sum_{j = 1}^{p} \frac{∣ β_{i} ∣}{∣ {\hat{β}}_{j} ∣} + \sum_{k = 1}^{k} \frac{∣ d_{k} ∣}{∣ {\hat{d}}_{k} ∣}),

(2.5)

where β̂_j and d̂_k are ordinary least squares estimates and 1_q is a column vector of ones of length q. The problem (2.5) can be solved using the EM algorithm developed by Laird and Ware (1982) and Laird, Lange, and Stram (1987).

For high-dimensional data, it would be necessary to reduce the dimension of the data before using the method. This could be accomplished by using previous methods of penalized variable selection on the fixed effects, while ignoring the random effects, and vice versa. The resulting reduced model could then be applied to this joint penalty problem for further simultaneous selection of fixed and random effects. While this method is effective at selecting fixed and random effects, the EM algorithm that it uses is not efficient and may not be plausible when the number of predictors is too large due to its slow convergence rate and computational burden.

Model 2 (Fan and Li, 2012)

This method selects important fixed and random effects independently in two separate models. Proxy matrices are used to account for the unknown variance-covariance structure of the random effects during the selections. Stacking X_i, b_i, y_i, and ε_i and setting Z = diag(Z₁, …, Z_n) with corresponding Π̃ = diag(Π, …, Π), the linear mixed effects model in (2.4) can be rewritten as

y = X β + Zb + ε .

For the fixed effects parameter β, it is necessary to minimize the penalized likelihood equation

Q (β) = \frac{1}{2} {(y - X β)}^{T} P_{z} (y - X β) + n \sum_{j = 1}^{p} P_{λ} (∣ β_{j} ∣),

(2.6)

where P_z = (I + σ⁻²ZΠ̃Z^T)⁻¹ and the penalty function P_λ(|β_j|) is the Smoothed Clipped Absolute Deviation (SCAD, Fan and Li, 2001) penalty with tuning parameter λ. It is important to note that the problem Q(β) depends on the unknown parameters Π̃ and σ². To overcome this obstacle, a proxy matrix P̃_z = (I + Z Inline graphic Z^T)⁻¹, where = (log n)I, is substituted into (2.6) for P_z. Since this regularization function is quadratic, it can be solved through previous methods for penalized least-squares, such as the LARS algorithm (Efron et al., 2004).

The selection of the random effects is accomplished through Bayesian methods of deriving the restricted posterior distribution of the random effects, and penalizing this solution of the restricted posterior mode. The resulting regularization problem to be minimized is

Q (b) = \frac{1}{2} {(y - Zb)}^{T} P_{x} (y - Zb) + \frac{1}{2} σ^{2} b^{T} \tilde{Π} + b + n \sum_{k = 1}^{q} P_{λ} (∣ b_{k} ∣),

(2.7)

where P_x = I − X(X^TX)⁻¹X^T, the penalty function P_λ(|b_k|) is again the SCAD penalty with tuning parameter λ, and Π̃⁺ is the Moore-Penrose generalized inverse of Π̃. Again, Π̃ and σ² are unknown so a the proxy matrix Inline graphic = diag(M, …, M), with M = (log n)I, is substituted for σ⁻²Π̃, so the regularization problem in (2.7) becomes:

Q (b) = \frac{1}{2} {(y - Zb)}^{T} P_{x} (y - Zb) + \frac{1}{2} b^{T} M b + n \sum_{k = 1}^{q} P_{λ} (∣ b_{k} ∣) .

This problem is similar to the penalized quadratic function of adaptive elastic net (Zou and Zhang, 2009), so it can be solved through modification of this algorithm.

For high-dimensional data where N ≤ p, the dimension of fixed effects must be lowered to below the sample size before using the above methods. This can be done by first using penalized least squares methods on the fixed effects while ignoring random effects. Using these selected fixed effects, the random effects can be estimated using the regularization problem (2.7). Next, using these selected random effects from the second step, the fixed effects regularization problem (2.6) can be used to select from the remaining fixed effects. The second and third steps can be repeated, as needed, to further reduce the dimensionality of the data.

Model 3 (Li et al., 2012)

The final method selects and estimates fixed effects, random effects, and the covariance structure of the selected random effects simultaneously in a linear mixed effects model using two penalty functions. Using the model (2.4), When N > p, a modified log-likelihood incorporating the Restricted Maximum Likelihood (REML) is

ℓ_{n M} (β, Π, σ^{2}) = - \frac{1}{2} \sum_{i = 1}^{n} log ∣ σ^{2} V_{i} ∣ - \frac{1}{2} log σ^{- 2} \sum_{i = 1}^{n} X_{i}^{T} V_{i}^{- 1} X_{i} - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(Y_{i} - X_{i} β)}^{T} V_{i}^{- 1} (Y_{i} - X_{i} β),

(2.8)

where $V_{i} = I_{m_{i}} + Z_{i} Π Z_{i}^{T}$ . In high dimensional settings when N ≤ p, the restricted term in (2.8) will become singular so the following full log-likelihood must be used

ℓ_{n F} (β, Π, σ^{2}) = - \frac{1}{2} \sum_{i = 1}^{n} log ∣ σ^{2} V_{i} ∣ - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(Y_{i} - X_{i} β)}^{T} V_{i}^{- 1} (Y_{i} - X_{i} β) .

(2.9)

The maximum likelihood can be found for this equation to obtain the parameters. Adding penalty functions, the regularization problem to maximize is

Q_{n} (β, Π, σ^{2}) = ℓ_{n} (β, Π, σ^{2}) - λ_{1} P_{1} (β) - λ_{2} P_{2} (D),

(2.10)

where ℓ_n(θ) is (2.8) or (2.9) depending on if N > p or N ≤ p, respectively. The first penalty function, λ₁P₁(β), is for the fixed effects and is an ALASSO penalty. For the random effects, a Cholesky decomposition is performed such that Π = LL^T where L is a lower triangular matrix with positive diagonal elements. If any diagonal element of L_kk = 0, then the corresponding random effect b_k is also 0 and is removed from the model. The second penalty function in (2.10) is set to be an L − 2 penalty function (Yuan and Lin, 2006) with an adaptive weight added so that the regularization problem is

Q_{n} (β, Π, σ^{2}) = ℓ_{n} (β, Π, σ^{2}) - λ_{1} \sum_{j = 1}^{p} \frac{∣ β_{j} ∣}{∣ {\hat{β}}_{j} ∣} - \sum_{k = 2}^{q} \frac{\sqrt{L_{k 1}^{2} + \dots L_{k q}^{2}}}{‖ {\hat{L}}_{(k)} ‖},

(2.11)

where β̂_j and ||L̂₍_k₎|| are ordinary least squares estimates. An estimation of the variance σ² (Lind-strom and Bates, 1988) can be substituted into the model, allowing (2.11) to be maximized in terms of only L and β.

The problem (2.11) can be solved by a new algorithm that iteratively updates two quadratic optimization functions for the random and fixed effects. This has proven to be more efficient than the EM algorithm, which cannot handle large numbers of predictors. When the maximum likelihood is used, the new algorithm is proven to be a consistent estimator for high-dimensional data, where p and q can diverge at an exponential rate with the sample size n.

3 Summary

In this article, we described two types of longitudinal high-dimensional data that researchers often encounter in current biomedical research and reviewed several recently developed statistical methods to deal with these two types of data. First, we introduced a kernel method for classification for the first type of longitudinal high-dimensional data and the corresponding computational strategy for parameter estimation. Second, we reviewed three mixed effect shrinkage models for the other type of longitudinal high-dimensional data. In the review, we compared the model setups, computational strategies, and advantages and shortcomings of the methods.

Acknowledgments

Wu’s research is supported in part by NSF Grant CCF-0926181.

References

Chen S, Bowman FD. A Novel Support Vector Classifier for Longitudinal High Dimensional Data. Statistical Analysis and Data Mining. 2011;4(6):604–611. doi: 10.1002/sam.10141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitchell TM, Hutchinson R, Niculescu RS, Pereira F, Wang X, Just M, Newman S. Learning to decode cognitive states from brain images. Machine Learning. 2004;57:145–175. [Google Scholar]
Vapnik V. The nature of statistical learning theory. New York: Springer; 1996. p. 188. [Google Scholar]
Mourao-Miranda J, Friston KJ, Brammer M. Dynamic discrimination analysis: a spatial-temporal SVM. Neuroimage. 2007;36(1):88. doi: 10.1016/j.neuroimage.2007.02.020. [DOI] [PubMed] [Google Scholar]
Guyon I, Elisseeff A. Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]
Wahba G. Spline Models for Observational Data. SIAM; Philadelphia, PA: 1990. [Google Scholar]
Hastie T, Rosset S, Tibshirani R, Zhu J. The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research. 2004;5:1391–1415. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001:1348–1360. [Google Scholar]
Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]
Bondell H, Krishna A, Ghosh SK. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics. 2010;66:1069–1077. doi: 10.1111/j.1541-0420.2010.01391.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laird NM, Ware JL. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
Laird NM, Lange N, Stram D. Maximum likelihood computations with repeated measures: Application of the EM algorithm. Journal of the American Statistical Association. 1987;82:97–105. [Google Scholar]
Fan Y, Li R. Variable selecion in linear mixed effects models. Annals of Statistics. 2012;40:2043–2068. doi: 10.1214/12-AOS1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32(2):407–451. [Google Scholar]
Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y, Wang S, Song PX-K, Wang N, Zhu J. Doubly Regularized Estimation and Selection in Linear Mixed-Effects Models fo High-Dimenstional Longitudinal Data. Unpublished manuscript. doi: 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindstrom MJ, Bates DM. Newton-Raphson and EM algorithms for linear mixed effects models for repeated measures data. Journal of the American Statistical Association. 1988;83:1014–1022. [Google Scholar]
Yuan M, Lin Y. Model Selection and Estimation in Regression With Grouped Variable. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]
Young DR, Saksvig BI, Wu TT, Zook K, Li X, Champaloux S, Grieser M, Lee S, Treuth M. Multilevel Predictors of Physical Activity For Early, Mid, and Late Adolescent Girls. Journal of Physical Activity & Health. doi: 10.1123/jpah.2012-0192. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bondell H, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008;64(1):115–23. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen Z. Random effects selection in linear mixed models. Biometrics. 2003;59:762–769. doi: 10.1111/j.0006-341x.2003.00089.x. [DOI] [PubMed] [Google Scholar]
Hall DB, Praestgaard JT. Order-restricted score tests for homogeneity in generalised linear and nonlinear models. Biometrika. 2001;88(3):739–751. [Google Scholar]
Lin X. Variance component testing in generlised linear models with random effects. Biometrika. 1997;84(2):309–326. [Google Scholar]
Stram DO, Lee JW. Variance components testing in the longitudinal mixed effects model. Biometrics. 1994;50(4):1171–7. [PubMed] [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological) 2005;67(2):301–320. [Google Scholar]

[R1] Chen S, Bowman FD. A Novel Support Vector Classifier for Longitudinal High Dimensional Data. Statistical Analysis and Data Mining. 2011;4(6):604–611. doi: 10.1002/sam.10141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Mitchell TM, Hutchinson R, Niculescu RS, Pereira F, Wang X, Just M, Newman S. Learning to decode cognitive states from brain images. Machine Learning. 2004;57:145–175. [Google Scholar]

[R3] Vapnik V. The nature of statistical learning theory. New York: Springer; 1996. p. 188. [Google Scholar]

[R4] Mourao-Miranda J, Friston KJ, Brammer M. Dynamic discrimination analysis: a spatial-temporal SVM. Neuroimage. 2007;36(1):88. doi: 10.1016/j.neuroimage.2007.02.020. [DOI] [PubMed] [Google Scholar]

[R5] Guyon I, Elisseeff A. Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]

[R6] Wahba G. Spline Models for Observational Data. SIAM; Philadelphia, PA: 1990. [Google Scholar]

[R7] Hastie T, Rosset S, Tibshirani R, Zhu J. The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research. 2004;5:1391–1415. [Google Scholar]

[R8] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001:1348–1360. [Google Scholar]

[R9] Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]

[R10] Bondell H, Krishna A, Ghosh SK. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics. 2010;66:1069–1077. doi: 10.1111/j.1541-0420.2010.01391.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Laird NM, Ware JL. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]

[R12] Laird NM, Lange N, Stram D. Maximum likelihood computations with repeated measures: Application of the EM algorithm. Journal of the American Statistical Association. 1987;82:97–105. [Google Scholar]

[R13] Fan Y, Li R. Variable selecion in linear mixed effects models. Annals of Statistics. 2012;40:2043–2068. doi: 10.1214/12-AOS1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32(2):407–451. [Google Scholar]

[R15] Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Li Y, Wang S, Song PX-K, Wang N, Zhu J. Doubly Regularized Estimation and Selection in Linear Mixed-Effects Models fo High-Dimenstional Longitudinal Data. Unpublished manuscript. doi: 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Lindstrom MJ, Bates DM. Newton-Raphson and EM algorithms for linear mixed effects models for repeated measures data. Journal of the American Statistical Association. 1988;83:1014–1022. [Google Scholar]

[R18] Yuan M, Lin Y. Model Selection and Estimation in Regression With Grouped Variable. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]

[R19] Young DR, Saksvig BI, Wu TT, Zook K, Li X, Champaloux S, Grieser M, Lee S, Treuth M. Multilevel Predictors of Physical Activity For Early, Mid, and Late Adolescent Girls. Journal of Physical Activity & Health. doi: 10.1123/jpah.2012-0192. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Bondell H, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008;64(1):115–23. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Chen Z. Random effects selection in linear mixed models. Biometrics. 2003;59:762–769. doi: 10.1111/j.0006-341x.2003.00089.x. [DOI] [PubMed] [Google Scholar]

[R22] Hall DB, Praestgaard JT. Order-restricted score tests for homogeneity in generalised linear and nonlinear models. Biometrika. 2001;88(3):739–751. [Google Scholar]

[R23] Lin X. Variance component testing in generlised linear models with random effects. Biometrika. 1997;84(2):309–326. [Google Scholar]

[R24] Stram DO, Lee JW. Variance components testing in the longitudinal mixed effects model. Biometrics. 1994;50(4):1171–7. [PubMed] [Google Scholar]

[R25] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological) 2005;67(2):301–320. [Google Scholar]

PERMALINK

Statistical Learning Methods for Longitudinal High-dimensional Data

Shuo Chen

Edward Grant

Tong Tong Wu

F DuBois Bowman

Abstract

1 Introduction

2 Methods

2.1 Longitudinal Support Vector Classifier - LSVC

Model Estimation

Nonlinear Kernel Functions

2.2 Penalized Linear Mixed Effects Models

Table 1.

Model 1 (Bondell et al., 2010)

Model 2 (Fan and Li, 2012)

Model 3 (Li et al., 2012)

3 Summary

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Statistical Learning Methods for Longitudinal High-dimensional Data

Shuo Chen

Edward Grant

Tong Tong Wu

F DuBois Bowman

Abstract

1 Introduction

2 Methods

2.1 Longitudinal Support Vector Classifier - LSVC

Model Estimation

Nonlinear Kernel Functions

2.2 Penalized Linear Mixed Effects Models

Table 1.

Model 1 (Bondell et al., 2010)

Model 2 (Fan and Li, 2012)

Model 3 (Li et al., 2012)

3 Summary

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases