Abstract
Recent studies have collected high-dimensional data longitudinally. Examples include brain images collected during different scanning sessions and time-course gene expression data. Because of the additional information learned from the temporal changes of the selected features, such longitudinal high-dimensional data, when incorporated with appropriate statistical learning techniques, are able to more accurately predict disease status or responses to a therapeutic treatment. In this article, we review recently proposed statistical learning methods dealing with longitudinal high-dimensional data.
Keywords: High-dimensionality, Multiple times points, Prediction, Support vector machines, Shrinkage, Temporal effects
1 Introduction
Current biomedical technology enables the collection of high-dimensional data longitudinally to gain understanding of genomic, proteomic, and in vivo neural processing properties over time. The temporal changes in high-profile biological properties may provide insight into disease diagnosis, progression, or recovery. Depending on the types of outcomes and clinical needs, the goals of longitudinal high-dimensional data include clustering and classification, survival analysis, multilevel regression and time series modeling.
By “longitudinal data,” we indicate two types of data collections: (1) high-dimensional profiles are collected at multiple times points during the study but the response variable is only collected at the end of the study as a final outcome; and (2) both the high-dimensional predictor variables and response variable are collected at multiple times points during the study. The desired methodology for high-dimensional longitudinal data would take advantage of the additional data to determine temporal trends of features and incorporate the temporal effects into learning methods and models that allow for repeated measurements. Recent research has developed several strategies to analyze high-dimensional longitudinal data using different statistical learning techniques, including support vector machines, non-parametric Bayesian methods, and shrinkage methods for different purposes. To address different objectives in the context of different data structures, we review several recent methods for high-dimensional longitudinal data. Across these models, the key challenges are determining how to extract features in high-dimensional space and incorporate the temporal effects for more accurate prediction.
In this paper, we review a set of methods for high-dimensional longitudinal data, with focus on longitudinal support vector machine and penalized linear mixed effects models. We begin with basic concepts of each method, and then introduce how recent high-dimensional longitudinal data analysis methods extended from original model. We also review the computational strategies and algorithm implementations for these methods.
2 Methods
In this section, we will review several current statistical methods for use with both types of longitudinal high-dimensional data.
2.1 Longitudinal Support Vector Classifier - LSVC
We first review the statistical methods to deal with the first type of longitudinal high-dimensional data. The support vector (SVC) classifier is a robust and effective machine learning method that has been widely used for high-dimensional data analysis (Mitchell et al., 2004, Vapnik, 1996). Also, SVC has been applied to handle spatial-temporal high-dimensional data, Mourao-Miranda et al, 2007 first use singular value decomposition to obtain the linear combination of spatial and temporal effects and then apply the component as input of SVC. Recently, Chen and Bowman, 2011 developed a SVC based method for high-dimensional data measured at multiple time points. Suppose that longitudinal high-dimensional data is collected from N subjects at T measurement time points, and p represents the dimensionality of data. The expanded feature matrix for longitudinal high-dimensional data becomes TN by p. For features xi,t collected for subject i at time t, the goal is to classify each individual x̃i = {xi,1, xi,2, …, xi,T}′ to certain groups yi ∈ {−1, 1}, outcomes only collected at the end of the study.
Linear trends of change are characterized: xs = xi,1 + β1xi,2 + β2xi,3… + βT−1xi,T, where β = (1, β1, β2, …, βT−1)′ is an unknown parameter vector. Such trend information is highly desired to improve the classification accuracy, usually not available. Thus, a key challenge of building a classifier of longitudinal high dimensional data is jointly estimating the separating hyperplane parameters α and the temporal trend parameters β. Chen and Bowman, 2011 first proposed a novel longitudinal support vector classifier (LSVC) to solve the problem using quadratic programming.
The LSVC is extended from the conventional support vector classifier (SVC), and it augments the cross-sectional high-dimensional feature space to a longitudinal high-dimensional feature space. The method seeks to construct an objective function by incorporating both temporal trend parameters and separating the hyperplane parameters. In the paper, the authors first note the augmented Gram matrix as
where
X̃m = [X̃t=1, X̃t=2, …, X̃t=T]T represents the p × TN longitudinal high dimensional features, with components X̃t=k = (y1x1,t=k, y2x2,t=k, …, yNxN,t=k) to be data from N subjects each with p features at time point k. The corresponding βm is a TN × N matrix, and . Similar to the conventional SVC, the objective function of LSVC is also subject to maximize the margins in the following equation:
(2.1) |
where wnv is the estimate of separating hyperplane parameters with, by assuming that the temporal trend parameters are known, longitudinal high-dimensional features xi = xi,1 + β1xi,2 + β2xi,3… + βT−1xi,T.
After the incorporation of the temporal trend parameters , the Langrange (Wolfe) dual function becomes:
(2.2) |
for i = 1, …, N and t = 1, …, T − 1.
Provided with αm, the separating hyperplane parameter becomes
where αm,i = (αm(i), αm(i + N), …, αm(i + (T − 1)N)) and .
Given wnv, the intercept term then becomes , in which βm can be estimated based on αm. At last, the separating hyperplane can be used to classify each subject by
(2.3) |
Model Estimation
To estimate α and β vectors, the authors suggest to reparameterize the first part of the objective function in 2.6 as:
where
and is the N × N submatrix in the left top corner of the matrix Gm for the baseline data ( ), the Gram matrix of SVC.
The objective function has been proven to be convex, and an iterative quadratic programming (QP) procedure is developed for optimization: (1) start with initial values of β and use QP to optimize 2.3 to obtain α; (2) use the updated α obtained in step 1 and apply QP again to estimate β; and (3) repeat the above two steps until convergence. The convergence of the iterative algorithm can be achieved because a unique solution exists.
Nonlinear Kernel Functions
The authors also provide solutions for nonlinear kernels. The Gram matrix of a nonlinear kernel is
where 〈βK(·, x̃i,t), K(·, x̃i′,t)〉 = βK(x̃i,t, x̃i′,t), and K(·, x̃i,t) indicates the reproducing kernel map of x̃i,t (Wahba, 1990). The separating hyperplane with a nonlinear kernel becomes
where b is obtained by . Therefore, the nonlinear kernel does not increase the complexity of estimating β. In addition, the authors discussed the variable selection based on the predictors’ effect on the objective function which is a “wrapper” method (Guyon et al., 2003; Hastie and Tibshirani, 2004).
To demonstrate the use and potential advantages of LSVC, the authors apply the method to a simulation study and a data example from the Alzheimer’s disease Neuroimaging Initiative. The results show that by leveraging the additional longitudinal information LSVC achieves higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.
2.2 Penalized Linear Mixed Effects Models
Linear mixed effects models can be used in the analysis of clustered or longitudinal data. Those models estimate the relationship between the dependent variable and the fixed effects and random effects of independent variables by considering both means and covariances. With the improvement of data collection and storage technology, a large number of independent variables are available and can be included in the model. Inference and prediction of such a model becomes too complex and infeasible when the number of predictors increases. One challenge is how to choose significant predictors while excluding variables that have no true effects on the outcome. An example is the Trial of Activity for Adolescent Girls (TAAG) study, which determined the effectiveness of a school and community based intervention on the physical activities of girls from 6 middle schools in Maryland (Young et al., 2013). A large group of girls were followed up for four years from year 2006 to 2009 and asked to take a survey with hundreds of questions at multiple time points to measure the change of physical activities. A linear mixed effects model can be fitted to take into account the clustering effects (6 middle schools) and temporal effects (four years) as well as the fixed effects such as race, socio-economic status, and other questions in the survey.
To construct a linear mixed effects model, consider the ith subject in a longitudinal study with n subjects, each having observations at mi time points for a total of observations. The linear mixed effects model can be written by
where yij is the response variable at the jth time point, xij is the vector of p fixed effects, zij is the vectors of q random effects at the jth time point, β is the p parameter vector for the fixed effects, bi is the q parameter vector for random effects, and εij is the i.i.d. random error from N (0, σ2). The random effects parameter bi are i.i.d. multivariate normal variables following MV N (0, σ2Π), where Π is the covariance matrix, and are independent of εij. Using the matrix notation, the model can be simplified as
(2.4) |
where Yi = (yi1, …, yimi)T, Xi = (xi1, …, ximi)T is the m × p design matrix of fixed effects, Zi = (zi1, …, zimi)T is the m × q design matrix of random effects, and εi = (εi1, …, εimi)T is an i.i.d. random error following N(0, σ2Imi).
Previous methods for penalized estimation of fixed effects include Efron et al., 2004, Zou and Hastie, 2005, and Bondell and Reich, 2008. Previous methods for selection of random effects include Stram and Lee, 1994, Lin, 1997, Hall and Praestgaard, 2001, and Chen, 2003. In this review we focus on three penalized linear mixed effects models, which, unlike many previous approaches, select both fixed effects and random effects simultaneously. The first model is developed by Bondell et al., 2010, maximizing a penalized joint likelihood problem. The second model is introduced by Fan and Li, 2012, uses a proxy matrix in maximizing a penalized profile likelihood for fixed and random effects separately. Third, the paper of Li et al., 2012 optimizes maximum likelihood estimator with separate penalization methods for fixed and random effects. A side-by-side comparison of these models is summarized in Table 1.
Table 1.
Comparison of three penalized linear mixed effects models
1. Joint Penalization | 2. Independent Selection | 3. Double Penalization | |
---|---|---|---|
Authors | Bondell et al. (2010) | Fan and Li (2012) | Li et al. (2012) |
Regularizations | 1 | 1 | 2 |
Objective Functions |
ℓ(β, Π, σ2) − λP(β, b) | Fixed: ℓ(β) − λ1P1 (β) Random: ℓ(b, σ2) λ2P2(b) |
ℓ(β, Π, σ2) − λ1P1(β) − λ2P2(b) |
Penalties | 1 total (ALASSO) | 1 fixed (SCAD) 1 random (SCAD) |
1 fixed (ALASSO) 1 random (L-2 norm) |
Covariance Structure | Modified Cholesky decomposition for covariance matrix | Use of proxy matrix to substitute for unknown covariance matrix | Cholesky decomposition for covariance matrix |
Algorithms | EM algorithm | LARS/Elastic net | New efficient algorithm with two quadratic components |
High-Dimensional Data | EM algorithm is not efficient for large number of predictors | p and q can diverge to ∞, but must reduce dimensions of fixed effects ignoring random effects first | p and q can diverge to 1 |
Model 1 (Bondell et al., 2010)
Method 1 estimates fixed effects, random effects, and the covariance structure of the selected random effects simultaneously in a model with one penalty function. Equation (2.4) can be reparameterized using a modified Cholesky decomposition to factorize the covariance matrix of the random effects, Π. Through this factorization, Π = DΓΓTD, where Γ is a q × q lower triangular matrix with 1’s on the diagonal and whose (l, r)th element is given by γlr and D = diag(d1, d2, …, dq) is a diagonal matrix. After this reparameterization, the linear mixed effects model (2.4) becomes:
The covariance matrix of bi is now expressed in terms of vector d = (d1, d2, …, dq)T and the free elements of Γ, denoted by vector γ = (γlr : l = 1, …, q : r = l + 1, …, q)T. Setting any dl = 0 will set the corresponding lth row and column of the covariance matrix Π to 0 and therefore remove the lth random effect from the model.
After reparameterizing the model and treating b as given, maximizing the log-likelihood function is equivalent to minimizing the conditional expectation of ||y − ZD̃Γ̃b − Xβ||2, where D̃ = Im ⊗ D and Γ̃ = Im ⊗ Γ. Rearranging the terms and adding the Adaptive Least Absolute Selection and Shrinkage Operator (ALASSO, Zou, 2006) penalty, the goal is to minimize the quadratic problem:
(2.5) |
where β̂j and d̂k are ordinary least squares estimates and 1q is a column vector of ones of length q. The problem (2.5) can be solved using the EM algorithm developed by Laird and Ware (1982) and Laird, Lange, and Stram (1987).
For high-dimensional data, it would be necessary to reduce the dimension of the data before using the method. This could be accomplished by using previous methods of penalized variable selection on the fixed effects, while ignoring the random effects, and vice versa. The resulting reduced model could then be applied to this joint penalty problem for further simultaneous selection of fixed and random effects. While this method is effective at selecting fixed and random effects, the EM algorithm that it uses is not efficient and may not be plausible when the number of predictors is too large due to its slow convergence rate and computational burden.
Model 2 (Fan and Li, 2012)
This method selects important fixed and random effects independently in two separate models. Proxy matrices are used to account for the unknown variance-covariance structure of the random effects during the selections. Stacking Xi, bi, yi, and εi and setting Z = diag(Z1, …, Zn) with corresponding Π̃ = diag(Π, …, Π), the linear mixed effects model in (2.4) can be rewritten as
For the fixed effects parameter β, it is necessary to minimize the penalized likelihood equation
(2.6) |
where Pz = (I + σ−2ZΠ̃ZT)−1 and the penalty function Pλ(|βj|) is the Smoothed Clipped Absolute Deviation (SCAD, Fan and Li, 2001) penalty with tuning parameter λ. It is important to note that the problem Q(β) depends on the unknown parameters Π̃ and σ2. To overcome this obstacle, a proxy matrix P̃z = (I + Z
ZT)−1, where
= (log n)I, is substituted into (2.6) for Pz. Since this regularization function is quadratic, it can be solved through previous methods for penalized least-squares, such as the LARS algorithm (Efron et al., 2004).
The selection of the random effects is accomplished through Bayesian methods of deriving the restricted posterior distribution of the random effects, and penalizing this solution of the restricted posterior mode. The resulting regularization problem to be minimized is
(2.7) |
where Px = I − X(XTX)−1XT, the penalty function Pλ(|bk|) is again the SCAD penalty with tuning parameter λ, and Π̃+ is the Moore-Penrose generalized inverse of Π̃. Again, Π̃ and σ2 are unknown so a the proxy matrix
= diag(M, …, M), with M = (log n)I, is substituted for σ−2Π̃, so the regularization problem in (2.7) becomes:
This problem is similar to the penalized quadratic function of adaptive elastic net (Zou and Zhang, 2009), so it can be solved through modification of this algorithm.
For high-dimensional data where N ≤ p, the dimension of fixed effects must be lowered to below the sample size before using the above methods. This can be done by first using penalized least squares methods on the fixed effects while ignoring random effects. Using these selected fixed effects, the random effects can be estimated using the regularization problem (2.7). Next, using these selected random effects from the second step, the fixed effects regularization problem (2.6) can be used to select from the remaining fixed effects. The second and third steps can be repeated, as needed, to further reduce the dimensionality of the data.
Model 3 (Li et al., 2012)
The final method selects and estimates fixed effects, random effects, and the covariance structure of the selected random effects simultaneously in a linear mixed effects model using two penalty functions. Using the model (2.4), When N > p, a modified log-likelihood incorporating the Restricted Maximum Likelihood (REML) is
(2.8) |
where . In high dimensional settings when N ≤ p, the restricted term in (2.8) will become singular so the following full log-likelihood must be used
(2.9) |
The maximum likelihood can be found for this equation to obtain the parameters. Adding penalty functions, the regularization problem to maximize is
(2.10) |
where ℓn(θ) is (2.8) or (2.9) depending on if N > p or N ≤ p, respectively. The first penalty function, λ1P1(β), is for the fixed effects and is an ALASSO penalty. For the random effects, a Cholesky decomposition is performed such that Π = LLT where L is a lower triangular matrix with positive diagonal elements. If any diagonal element of Lkk = 0, then the corresponding random effect bk is also 0 and is removed from the model. The second penalty function in (2.10) is set to be an L − 2 penalty function (Yuan and Lin, 2006) with an adaptive weight added so that the regularization problem is
(2.11) |
where β̂j and ||L̂(k)|| are ordinary least squares estimates. An estimation of the variance σ2 (Lind-strom and Bates, 1988) can be substituted into the model, allowing (2.11) to be maximized in terms of only L and β.
The problem (2.11) can be solved by a new algorithm that iteratively updates two quadratic optimization functions for the random and fixed effects. This has proven to be more efficient than the EM algorithm, which cannot handle large numbers of predictors. When the maximum likelihood is used, the new algorithm is proven to be a consistent estimator for high-dimensional data, where p and q can diverge at an exponential rate with the sample size n.
3 Summary
In this article, we described two types of longitudinal high-dimensional data that researchers often encounter in current biomedical research and reviewed several recently developed statistical methods to deal with these two types of data. First, we introduced a kernel method for classification for the first type of longitudinal high-dimensional data and the corresponding computational strategy for parameter estimation. Second, we reviewed three mixed effect shrinkage models for the other type of longitudinal high-dimensional data. In the review, we compared the model setups, computational strategies, and advantages and shortcomings of the methods.
Acknowledgments
Wu’s research is supported in part by NSF Grant CCF-0926181.
References
- Chen S, Bowman FD. A Novel Support Vector Classifier for Longitudinal High Dimensional Data. Statistical Analysis and Data Mining. 2011;4(6):604–611. doi: 10.1002/sam.10141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell TM, Hutchinson R, Niculescu RS, Pereira F, Wang X, Just M, Newman S. Learning to decode cognitive states from brain images. Machine Learning. 2004;57:145–175. [Google Scholar]
- Vapnik V. The nature of statistical learning theory. New York: Springer; 1996. p. 188. [Google Scholar]
- Mourao-Miranda J, Friston KJ, Brammer M. Dynamic discrimination analysis: a spatial-temporal SVM. Neuroimage. 2007;36(1):88. doi: 10.1016/j.neuroimage.2007.02.020. [DOI] [PubMed] [Google Scholar]
- Guyon I, Elisseeff A. Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]
- Wahba G. Spline Models for Observational Data. SIAM; Philadelphia, PA: 1990. [Google Scholar]
- Hastie T, Rosset S, Tibshirani R, Zhu J. The Entire Regularization Path for the Support Vector Machine. Journal of Machine Learning Research. 2004;5:1391–1415. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001:1348–1360. [Google Scholar]
- Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]
- Bondell H, Krishna A, Ghosh SK. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics. 2010;66:1069–1077. doi: 10.1111/j.1541-0420.2010.01391.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laird NM, Ware JL. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
- Laird NM, Lange N, Stram D. Maximum likelihood computations with repeated measures: Application of the EM algorithm. Journal of the American Statistical Association. 1987;82:97–105. [Google Scholar]
- Fan Y, Li R. Variable selecion in linear mixed effects models. Annals of Statistics. 2012;40:2043–2068. doi: 10.1214/12-AOS1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. The Annals of Statistics. 2004;32(2):407–451. [Google Scholar]
- Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Wang S, Song PX-K, Wang N, Zhu J. Doubly Regularized Estimation and Selection in Linear Mixed-Effects Models fo High-Dimenstional Longitudinal Data. Unpublished manuscript. doi: 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lindstrom MJ, Bates DM. Newton-Raphson and EM algorithms for linear mixed effects models for repeated measures data. Journal of the American Statistical Association. 1988;83:1014–1022. [Google Scholar]
- Yuan M, Lin Y. Model Selection and Estimation in Regression With Grouped Variable. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]
- Young DR, Saksvig BI, Wu TT, Zook K, Li X, Champaloux S, Grieser M, Lee S, Treuth M. Multilevel Predictors of Physical Activity For Early, Mid, and Late Adolescent Girls. Journal of Physical Activity & Health. doi: 10.1123/jpah.2012-0192. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bondell H, Reich BJ. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics. 2008;64(1):115–23. doi: 10.1111/j.1541-0420.2007.00843.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Z. Random effects selection in linear mixed models. Biometrics. 2003;59:762–769. doi: 10.1111/j.0006-341x.2003.00089.x. [DOI] [PubMed] [Google Scholar]
- Hall DB, Praestgaard JT. Order-restricted score tests for homogeneity in generalised linear and nonlinear models. Biometrika. 2001;88(3):739–751. [Google Scholar]
- Lin X. Variance component testing in generlised linear models with random effects. Biometrika. 1997;84(2):309–326. [Google Scholar]
- Stram DO, Lee JW. Variance components testing in the longitudinal mixed effects model. Biometrics. 1994;50(4):1171–7. [PubMed] [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological) 2005;67(2):301–320. [Google Scholar]