Abstract
Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called ‘power parameter’, which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.
Introduction
In discrimination studies, data sets are often handled having high numbers of features but only few samples. Especially for gene expression experiments, where thousands of genes are measured and in comparison only few samples are used, dimension reduction is advantageous as a pre-processing step before the final classification step takes place. There exist a lot of feature extraction methods for dimension reduction. One such method is powered partial least squares discriminant analysis (PPLS-DA) which is a specialized version of the well-known partial least squares discriminant analysis (PLS-DA), which was first introduced in chemometrics by Wold et al. [1] by using the PLS regression [2] for classification purposes. Here the response variable ( in the linear model is given in form of indicator variables. Barker & Rayens [3] and Nocairi et al. [4] were the first to formulate PLS-DA accurately. The aim is to reduce the dimensions (number of features) by coordinate transformation to a lower dimensional space. PPLS-DA was introduced by Liland and Indahl in [5] to improve the calculation of the loading weights for better separation of the groups by introducing a power parameter analogously to powered partial least squares (PPLS) [6] and maximizing the correlation between the data matrix and the group memberships, analogous to Fisher’s canonical discriminant analysis (FCDA). The optimization criterion in PPLS-DA is therewith not directly aimed at prediction, and therefore the original algorithm does not necessarily yield the best components for class prediction.
Former studies of Telaar et al. in [7] show similar prediction errors (PEs) for PLS-DA and PPLS-DA for most of the analysed data sets, and even lower error rates compared to other classification methods e.g. random forest and support vector machine. Therefore, we try to optimize the power parameter of PPLS-DA towards class prediction in a training set, to see if the prediction result for a test set can be improved. The power parameter and the number of components are determined according to the lowest prediction error of a linear discriminant analysis (LDA) using the PPLS-DA components, taking a cross-validation approach. Furthermore, we compare the results of this extension and of PPLS-DA with the ordinary PLS-DA with respect to prediction error (PE) for simulated data sets and five publicly available experimental data sets. Finally the PPLS-DA results of the simulated and experimental data sets are compared to those of support vector machine (SVM) with linear kernel and LDA with ten features selected according to the t-test.
Materials and Methods
Description of a Classification Problem
An data matrix with objects and features together with a vector of group memberships form the basis of a classification problem. Here is the group label for object which for example can be given in the form of discrete variables or symbolics. The k-th column of the matrix is denoted by . We assume that each sample belongs to a unique group and each group has a sample size of and a prior probability of group membership of . In total different groups exist and . In this article, we restrict ourselves to only two different groups . Our results can be extended to more than two groups.
The group information can also be given in the form of a dummy coded matrix as follows equals 1 if sample belongs to groups , otherwise the entry equals 0, and . The goal is to determine a function which assigns to each object a unique group () with the greatest possible accuracy [8]. In the following we assume that is centered. Therewith the empirical total covariance matrix is equal to , and let denote the empirical between group sum of squares and cross-product matrix, which can be formalized as with [8].
Introduction of PLS-DA and PPLS
PLS-DA
Before we start with the detailed description of PPLS-DA, we give a brief introduction to the roots of this method, PLS-DA and PPLS. Let be a vector of loadings with and a vector containing the mean values of the groups, . Nocairi et al. [4] showed that the dominant eigenvector of maximized the covariance between and in the context of classification. Therefore this eigenvector should be used for the determination of the loading weights vector which was also recommended by Barker & Rayens [3]. An enhanced version of PLS-DA with inclusion of prior probabilities () in the estimation of is proposed by Indahl et al. [8]. In this version, the importance of each group no longer depends on the empirical prior probabilities. Therewith a direct opportunity is given to put more weight on special groups for the calculation of the loading weights. Moreover is transformed to an matrix , here and . Therewith a eigenvalue problem () is reduced to a eigenvalue problem (2 = number of groups), because only the between-group covariance matrix according to is considered (). Afterwards the eigenvalue needs to be back transformed by .
PPLS
Now we are turning back to PLS regression, to explain the introduction of the power parameter in PLS regression. The loading weights vector of PLS regression maximizes and can be rewritten as.
with a scaling constant where , and denote covariance, correlation and standard deviation, respectively. Hence, the influence on the loading weights of the correlation and standard deviation part is balanced. Dominating -variance which is irrelevant for prediction does not lead to optimal models, therefore Indahl et al. [6] propose PPLS which allows the user to control the importance of the correlation part and the standard deviation part by a power parameter as follows:
Here denotes the sign of . The power parameter is determined such that the correlation is maximal: . Also equal to 0 and 1 is included in the maximization problem by calculation correlation with a loading weights vector which has only a non-zero entry (for ) for the feature with the largest standard deviation and by determining the correlation according to a loading weights vector which has only a non-zero entry for the feature with the largest correlation to ().
PPLS-DA: Optimization of the Power Parameter According to Correlation
PPLS-DA is designed to deliver components which are optimal for discriminating the cases coded in . This optimality can be understood in terms of a correlation approach. Nocairi et al. showed in [4], that correlation is determined by the so called Rayleigh quotient.
(1) |
Here the correlation is measured by the squared coefficient of correlation . Maximization of the correlation is therewith equivalent to maximization of the Rayleigh quotient.
(2) |
The well known solution of the maximization problem (2) is the dominant eigenvector of . This is exactly the approach of Fishers`s canonical discriminant analysis (FCDA) for determining the vector of loadings.
For PPLS-DA, Liland and Indahl combine in [5] the approaches of FCDA and PPLS and further include, like in [8], prior probabilities for each group in the calculation of and . The data matrix is transformed with to the matrix , where contains the possible candidate loading weights vectors as columns.
(3) |
with ,
, and .
The power parameter enables the focus on features which have a high correlation to or on features which have a high standard deviation. For the transformed matrix the between group sum of squares and cross-product matrix including prior probabilities can be calculated as follows and the total variance matrix is obtained as with and [5]. is an diagonal matrix where the non-zero entry belonging to sample is the ratio of the prior probability of the group to which belongs and the corresponding group size times the number of samples .
The maximization problem of PPLS-DA is as follows.
(4) |
For the feature(s) with the highest standard deviation are tested and for the feature(s) with the highest correlation to the group membership vector are checked separately. We denote the solution of the optimization problem (4) by . To avoid singular matrices and to get a numerically more stable solution, Liland and Indahl [5] substitute the maximization problem (4) by the maximization problem , where cca denotes the canonical correlation. In [9] the authors showed that these procedures lead to equal results.
For each component, is determined by maximization of the canonical correlation:
(5) |
In the algorithm of PPLS-DA in the R package pls, the R function optimize is used to search for the maximum. For this purpose, the whole training set is used to find . Transforming into an data matrix , the maximization problem depends on the number of groups and number of samples. It does not depend on the usually much higher number of features. The final loading weight vector is then . For a detailed description of PPLS-DA we refer to [5].
In our experience with PPLS-DA, we have observed that optimization of does not always lead to the lowest possible prediction error available through other choices of .
PPLS-DA: Optimizing the Power Parameter with Respect to Prediction
To improve prediction errors in classification tasks using PPLS-DA, we aim at optimizing the power parameter with respect to prediction error of LDA. For this, we propose a cross-validation approach, to avoid overfitting. Therefore, we separate in a first step the data into a training and a test set which are disjunct. We now call these sets outer training set and outer test set. Using only the outer training set to optimize the value, we use the outer test set to evaluate the computed classification function. In a next step we split the outer training set randomly in different inner training and inner test sets. Because unbalanced data have influence on the estimated classifier, we down-sample the majority group objects to get equal numbers of objects in both groups for the outer training set. Here, the proportion of 0.7 of the smallest group size determines the size of the outer training set for each group. The remaining objects build the outer test set. For the optimization step, we take into account equidistant fixed values in [0,1] with step size , resulting in a sequence of values for the optimization . For example for a choice of , we consider 11 values, .
We calculate the prediction error (PE), the proportion of wrongly classified samples of a test set, as a measure for good classification. In Figure 1 a rough overview of our proposed extension is given. In this paper, all cross-validation procedures consist of random samples of the corresponding data sets to the proportions of 0.7 (training set) and 0.3 (test set). For example, a cross-validation with 10 repeats, repeats the sampling 10 times. Utilizing the statistical software R, we use the function cppls (of the R-package pls) for PPLS-DA and the function lda (of the R-package MASS). Furthermore, we use the default setting for the priors in the lda function, using the proportions of the groups which are equal in our cases.
We propose an extension for optimizing both, and the number of components to be used as input for the LDA. The optimization of the -value depends on the parameters and .
In our extension all components are used to optimize the value minimizing PE. The number of components and the power parameter are optimized in one procedure. Therewith all components have the same value.
In the optimization, for each fixed value () the optimal number of components of PPLS-DA is determined as follows (see Figure 2): For the different inner test sets, the PE for one up to five components Cj, j is calculated, all using the same resulting in a matrix , . Then the average inner PE is calculated for each component. Therewith we select for each the smallest mean PE over the -test sets with a corresponding optimal number of components (). We search for the minimal PE leading to with the optimal number of components . For these components and the power parameter , we calculate the corresponding loading weights vectors on the outer training set, and we finally determine the PE of the outer test set.
R Functions Used
(P)PLS-DA implementations
For PPLS-DA, we used the R-function cppls of the R-package pls. We implemented an R-code for PLS-DA based on [8]. The optimal number of components was determined by a cross-validation on the outer training set for PLS-DA, PPLS-DA with and our described extensions of PPLS-DA. For this step, we restricted the maximal number of components to five. The segments of the cross-validation are randomly chosen at the proportion of 0.7 and 0.3 of the data set. We repeat this procedure 10 times.
Further Classification Methods
SVM
The classification method SVM with a linear kernel searching for a linear hyperplane for the separation of the data is considered for comparison. For this purpose the R package e1071 [10] is applied and the parameter for the linear kernel is tuned within the interval using the R-function tune.svm with a cross-validation of 10 steps. The interval for tuning is chosen according to the suggestion of Dettling & Buehlmann [11].
t-LDA
Additionally an LDA is performed using ten features which are filtered based on the outer training set according to a ranking list based on the lowest p-value of the t-test. For the t-test we use the R-package stats.
The same segments of the outer training and outer test set are used across all tested methods for fair comparison.
Data
We investigate simulated data and five publicly available experimental data sets. After preprocessing (like mentioned in the description of the experimental data sets), all experimental data are on the log-scale of gene expression as for the simulated data.
For detailed description of the covariance structure of our data, we use two measures analogous to Sæbø et al. in [12]. The condition index, first used in [13], and the absolute value of the covariances between the principal components of and the response vector as used in [12]. The condition index is used as a measure for variable dependence, with being the kth eigenvalue of . It can be assumed that . The increase of the first five condition indexes () reflects the collinearity of the features. A rapid increase means, the features have a strong linear dependence, a weak increase implies a weak dependence. If we now consider the principal components, like in [12], the relevance of a component is measured by means of the absolute value of the covariances () between the principal component and the class vector . Here equals 1 if sample belongs to group , otherwise equals -1, . The eigenvector belonging to the kth largest eigenvalue is denoted by . Helland and Almøy [14] infer, that data sets with relevant components, which have small eigenvalues, are difficult to predict. The condition index is plotted for the first five largest eigenvalues (scaled to the first eigenvalue) in Figure 3. Figure 4 shows the first 50 largest scaled eigenvalues and the corresponding scaled covariances between and for all experimental data sets and a simulated data set (case 3) investigated.
Simulated data sets
Gene expression data are simulated as normally distributed data, considering the log scale of microarray intensities after normalization: and . Here denotes the biological variance which we chose equal to 0.04, and represents the technical variance which we chose in different proportions of the biological variance. Studying a two-class classification objective, we simulate 60 samples per class for the whole data set and 1000 genes partitioned in an informative part and non-informative part for the classification. For the test set, 30 single samples per class are randomly chosen. The remaining 30 single samples per class constitute the training set. The non-informative part of the data matrix which shows no differences between the two classes, consists of normally distributed random variables with mean and biological variance , as well as different cases of technical variance . The informative part contains ten differentially expressed genes (DEGs) with a mean class difference [7]. We take 5 cases into account. For case 1, 2 and 3, for each DEGs, is chosen according to the uniform distribution from the interval . Case 1 has a technical variance of zero, case 2 of one-quarter of the biological variance (), and case 3 is simulated with a technical variance of the same size as the biological variance (). The ten DEGs of case 4 have a mean class difference of and . The simulated data of case 5 also have the high noise level (), but a higher mean class difference with .
We illustrate the data structure for the simulated data on the example of case 3. The condition indexes are 1.00, 1.01, 1.04, 1.05, 1.06. The increase is the weakest for all data sets considered (Figure 3), and therewith the genes are only weakly linear dependent, which is also shown by Figure 4F. This is similar for all simulated data cases. Furthermore, the proportion of DEGs is very low with for all cases.
Experimental data sets
Additionally, we considered five publicly available experimental microarray gene expression data sets which are summarized in Table 1 containing information about the group size, number of genes, proportion of differentially expressed genes and original publication. For the determination of the number of differentially expressed genes () we use a t-test (from the R-package stats) and an FDR correction [15] (R-package qvalue). We count all genes with a q-value below 0.05. In the following the five data sets are described:
Table 1. Overview of the experimental data sets.
name | number of samples | number of genes | in % | original publication | |
Leukemia | 47/25 | 3571 | 1445 | 40.46 | [19] |
Lymphoma | 58/19 | 7129 | 1739 | 24.39 | [20] |
Breast cancer | 44/34 | 4997 | 54 | 1.08 | [17] |
Prostate 1 | 50/52 | 6033 | 2393 | 35.26 | [21] |
Prostate 2 | 41/62 | 42129 | 595 | 1.40 | [22] |
For the determination of the number of differentially expressed genes () we use a t-test (from the R-package stats) and an FDR correction [15] (R-package qvalue). We count all genes with a q-value below 0.05.
The Leukemia data were downloaded from the Whitehead Institute website. We merge the training set and the test set to get a higher sample size and sample from these to get new proportions 0.7 and 0.3 for the training and test set. The R code for data preprocessing from http://svitsrv25.epfl.ch/R-doc/library/multtest/doc/golub.R is used which is according to [16]. The data set consists of two groups, 25 patients with acute myeloid leukemia and 47 patients with acute lymphoblastic leukemia and 3571 genes.
The condition indexes show a weak increase for this data set (1.00, 1.31, 1.49, 1.537, 1.83). This and the plot of the eigenvalues (Figure 4A) lead to the assumption of a weak linear dependency between the genes. The more relevant components have the largest eigenvalues (Figure 4A). Therefore we can expect good prediction performance of this data set. This data set has the highest proportion of DEGs (40.46%, Table 1).
The Lymphoma data set was downloaded from the website http://www.broadinstitute.org/mpr/lymphoma/. The data are GC-RMA normalized. Two groups are considered, 58 patients with diffuse large B-cell lymphomas and 19 patients with B-cell lymphoma, follicular lymphoma. Only genes with a non-zero variance are used in our analysis, which leads to 7129 genes.
The between-variable dependencies are comparable to the Leukemia data set (condition indexes: 1.00, 1.10, 1.40, 1.50, 1.84). The covariance structure (Figure 4B) is also comparable to those of the Leukemia data set and the total number of DEGs is a little bit higher than for the Leukemia data set, but the proportion on the total number of genes is clearly lower (24.3%, Table 1).
The Breast Cancer data set consists of normalized and filtered data, downloaded from http://homes.dsi. unimi.it/∼valenti/DATA/MICROARRAY-DATA/R-code/Do-Veer-data.R. The normalization was performed according to [17]. In this data set, only the two groups with the highest sample size are included: 34 patients with distant metastases within 5 years and 44 patients without, after at least 5 years. The total number of genes is 4997.
We found again a weak increase in the condition indexes for the first five eigenvalues (1.00, 1.42, 1.77, 1.90 and 1.91), but slightly faster than for the Leukemia and Lymphoma data set (Figure 3). The eigenvalue plot (Figure 4C) illustrates also a weak linear dependence between the features. The proportion of DEGs is the lowest for all experimental data sets (1.08, Table 1).
The Prostate 1 data set contains 52 tumor and 50 non-tumor cases and was downloaded from http://stat.ethz.ch/~dettling/bagboost.html. The preprocessing is described in [18] and the final data set contains 6033 genes.
This data set shows a rapid increase of the condition index from to (1.00, 2.96, 3.24, 5.046, 5.397), describing a strong linear dependency of the genes (Figure 3). This property is also indicated by the plot of the eigenvectors (Figure 4D). This data set also has a high proportion of DEGs (32.26%, Table 1).
We downloaded the Prostate 2 data set, which was already normalized, from http://bioinformatics. mdanderson.org/TailRank/. A description of the normalization can be found at http://bioinformatics.mda nderson.org/TailRank/tolstoy-new.pdf. In this data set, only the two groups with 41 patients with normal prostate tissue and the 62 patients with primary tumors are in included.
The condition index shows a rapid increase (1.00, 1.71, 2.10, 2.56, 2.93) for the first five eigenvalues (Figure 3), but more moderate than for the Prostate 1 data set. Figure 4E shows that also relevant components have small eigenvalues, which indicates low prediction performance. The proportion of DEGs is very low (1.4%, Table 1) and similar to those of the Breast Cancer data set, but the total number of genes is the largest (42129) for all experimental data sets.
Results
Results for the simulated data are based on 100 repeated simulations, and for the experimental data 100 different outer training and outer test sets are sampled. We present mean PE values of the outer test sets and the corresponding 95 confidence intervals. At first, we describe and compare the results of PPLS-DA using with PPLS-DA using , followed by the comparison between PLS-DA, PPLS-DA using , t-LDA and SVM.
We calculate confidence intervals for PE as follows: let be the vector with 100 estimates of prediction errors. The upper bound of the confidence interval is then calculated as . The lower bound is calculated likewise. If these confidence intervals overlap, we report no significant differences, if they are disjunct, the corresponding PEs are reported as significantly different.
Results for the Simulated Data
PE results
At first we study the dependency of optimization on the step size , and on the number of internal cross-validation steps . We calculate the mean PE results for and or . Because the corresponding confidence intervals overlap for all different parameter choices for fix (Figure S1), we show all further results for and .
Figure 5 illustrates the mean PE results and corresponding 95% confidence intervals for the simulated data and Table 2 summarizes the average number of components used for PLS-DA, PPLS-DA using and .
Table 2. The mean number of components used for simulated data for and = 0.1.
simulated data | PPLS-DA with | PLS-DA | ||||
case | ||||||
1 | 0 | [0.1,0.5] | 2.6 | 1.9 | 2.7 | 2.7 |
2 | [0.1,0.5] | 2.8 | 2.0 | 2.7 | 2.9 | |
3 | [0.1,0.5] | 2.9 | 2.0 | 2.8 | 2.9 | |
4 | 0.2 | 2.7 | 2.5 | 2.7 | 2.7 | |
5 | 0.5 | 2.4 | 1.7 | 2.5 | 2.5 |
For all considered cases, PPLS-DA using shows a significantly smaller PE than PPLS-DA using . Especially for DEGs with , the PE of PPLS-DA with is only one-tenth of the PE for the PPLS-DA with in the case without noise, one-fifth for a minor noise level () and still one-third for a high noise level with .
Considering the frequency distributions of -values (Figure S2 shows the corresponding histograms for case 3), the value with the highest frequency for the optimal -value () determined in a cross-validation approach is 0.8. This is in contrast to the values for , with 0.5 as highest frequency (Figures S2A and B, see Supplementary Materials).
Different power parameters have large effects on the loading weights, which can be seen in Figure 6, e.g. for the first component for and with = 50 and = 0.1 for case 3. The first 10 genes, which are simulated as differentially expressed, receive the highest absolute loading weight values for all methods. For , these loading weights of the informative genes are increasing in absolute values, and especially the non-informative genes receive loading weight values near to zero in comparison to the loading weights induced by .
Comparing our above findings of the PE with those of the PE of PLS-DA and of PPLS-DA with , PLS-DA shows an equal PE to PPLS-DA using or , for all cases of the simulated data. Therewith also PPLS-DA using shows a significantly lower PE than PLS-DA. The number of components used for PLS-DA is equal to the corresponding number of components used for PPLS-DA using or . Hence overall, PPLS-DA using uses in average the lowest number of components.
Now, we consider the results of the two methods, SVM and t-LDA. For all simulated data cases, the method SVM shows the largest PE for all simulated data cases. The method t-LDA does not show a significant different PE to PPLS-DA using .
Results for the Experimental Data Sets
As for the simulated data, we choose and .
In Figure 7 the mean PE results and the corresponding 95% confidence intervals are shown for all five experimental data sets for all considered methods. For PPLS-DA using , and and PLS-DA, the average number of of components are shown in Table 3.
Table 3. The mean number of components used for the experimental data sets.
PPLS-DA with | PLS-DA | |||
data set | ||||
Leukemia | 2.6 | 2.8 | 2.3 | 2.0 |
Lymphoma | 2.9 | 3.1 | 2.8 | 1.8 |
Breast cancer | 2.4 | 2.3 | 2.4 | 2.7 |
Prostate 1 | 2.5 | 3.4 | 4.0 | 4.1 |
Prostate 2 | 2.9 | 3.7 | 4.3 | 4.2 |
Leukemia data set: For this data set, we found no significant differences in the PE of PPLS-DA using compared to PPLS-DA using (Figure 7A), and both methods use similar numbers of components. The modal value of is 0.5, in comparison to with two accumulations one around 0.3 and the other around 0.8 (see Figure S3, see Supplementary Materials). Comparing the above PE results to PLS-DA, then PLS-DA shows an equal PE compared to all four extensions of PPLS-DA. The PE of PPLS-DA using is significantly larger than the PE of PPLS-DA with and PLS-DA. PLS-DA uses also in average the lowest number of components (2.0).
For PPLS-DA using we find a significantly lower PE than for t-LDA, but for SVM the PE is similar to that of PPLS-DA using .
Lymphoma data set: The PE of PPLS-DA using is significantly lower than the PE of PPLS-DA using (Figure 7B). PPLS-DA using leads in average to 3.1 components and PPLS-DA using to 2.9 components.
Considering PLS-DA, we find a significantly lower PE than for PPLS-DA using , which is equal to the PEs of PPLS-DA using and . PLS-DA used the smallest average number of components (1.8), for PPLS-DA using and we find similar number of components in average 2.9 and 2.8.
The method t-LDA shows a significantly higher PE in comparison to PPLS-DA using , and SVM an equal PE to PPLS-DA using .
Breast cancer data set: For this data set, PPLS-DA using and PPLS-DA using does not show significantly different PEs (Figure 7C). The number of components used are also similar for these two methods with 2.4 for PPLS-DA using and 2.3 for PPLS-DA using .
Also PLS-DA shows an equal PE and a slightly higher number of components (2.7), compared to PPLS-DA using and or .
If compared to PPLS-DA using , t-LDA and SVM show similar PE.
Prostate 1 data set: PPLS-DA using shows equal PE to PPLS-DA using (Figure 7D). PPLS-DA using leads in average to 3.4 components, while PPLS-DA using uses 2.5 components.
Investigating PLS-DA, the PE is equal to the PE of PPLS-DA using or . PPLS-DA using and PLS-DA use the largest average number of components, 4.1 and 4 (see Figure S4).
Also t-LDA and SVM show equal PEs to the other methods.
Prostate 2 data set: The PE of PPLS-DA using is equal to the PE of PPLS-DA with (Figure 7E). Moreover, PPLS-DA using used 2.9 components and PPLS-DA using uses 3.7 components.
PPLS-DA using and PLS-DA show equal PE and significantly higher PE than for PPLS-DA using or using . PPLS-DA with and PLS-DA use more components in comparison to all versions of PPLS-DA.
Both methods t-LDA and SVM show a significantly larger PE than PPLS-DA using or PPLS-DA using , but a significantly lower PE than PLS-DA and PPLS-DA using .
Discussion
The focus of our study is on introduction of an extension for the method PPLS-DA for better classification in high-dimensional datasets, as for example gene expression datasets in biomedicine. The optimization criterion for the power parameter in the ordinary PPLS-DA is towards canonical correlation, and does not need to be best for prediction. Our extension of PPLS-DA introduces optimization of with respect to prediction using an inner cross-validation approach. We carry along comparisons to LDA and SVM, to bring our proposed method into line with these standard classification methods.
Comparison between PPLS-DA using and
The PEs of the outer test sets for PPLS-DA were improved or showed at least equal values by optimization of the value with respect to the prediction with LDA in comparison to PPLS-DA using , for all simulated data and the experimental data sets.
Simulated data
Comparing the histograms of values found by PPLS-DA using and PPLS-DA with , the reason for the lower PE can be traced back to the down-weighting of the non-informative features for the simulated data (see Figure S2 and Figure 6). The loading weights for these features are near or equal to zero. The influence of the features, which are not informative for the discrimination (and can be interpreted as noise), is reduced, because the impact on the calculation of the components is lower for PPLS-DA using than for PPLS-DA with . Values of near to one, leads to preference of features which show a high correlation to the dummy response (in the simulation study these are the differentially expressed genes). Changing the optimization criterion from correlation towards prediction, leads also to a lower average number of components for the simulated data.
Experimental data
For the experimental data, the -values determined by the canonical correlation () are larger than the values detected by our proposed extension of PPLS-DA. Even if our analyses of simulated data suggested lower PE values for larger choices of , for the Leukemia data set and the Lymphoma data set also PPLS-DA using shows a significantly lower PE than PPLS-DA using . Note, that the true informative genes for the experimental data are not known, and the proportion of differentially expressed genes most likely is much larger than for the simulated data, therefore the comparison of the results for simulated and experimental data is not straightforward. Moreover experimental data are usually noisier than simulated data. The part of reality we have not been able to model in the simulated data might be a sort of noise and data structure that we cannot improve on, regardless our choice of . If we could remove this part of the noise from the real data, the relative improvements might be just as good as with the simulated data.
Summarizing the findings for the simulated data, PPLS-DA using shows significantly lower PEs than PPLS-DA with . For the experimental data, the results are also significantly lower or equal for the extensions considering the PE.
Additionally we had considered three further versions to determine the power parameter towards prediction. These versions for example differ in the number of considered components for the optimization of the power parameter and also one version which optimizes an individual power parameter for each component was studied. The results are similar to those of the extension presented here (data not shown).
Comparison between PPLS-DA using and PLS-DA
The development of PPLS-DA followed the development of powered partial least squares as a natural extension of the power methodology to handle discrete responses. Several factors motivated this advancement to PLS-DA. First the application of powers enables focusing on fewer explanatory variables in the loading weights, smoothing over some of the noise in the remaining variables. Second, focus can be shifted between the correlation and standard deviation parts of the loading weights, which is even more important for discrete responses. Finally, the maximization criterion is moved from the between-group variation (B) to the product of the between group variation matrix and the inverse of the within-group variation matrix (). This has the effect of moving from covariance maximization to a correlation maximization. Instead of just searching for the space having highest variation between the groups, we also minimize the variation inside the groups, increasing the likeliness of good group separation.
In our study, PPLS-DA with (applying no power parameter) and PLS-DA always show equal PEs for the simulated and the experimental data sets. Hence, for this case the different optimization tasks show no great differences with respect to the PEs of the outer test sets. Including the power parameter, the PE of PPLS-DA using is equal to the PE of PLS-DA for all simulated data. Also, the number of components used is in average lower or equal for PPLS-DA using than for PLS-DA.
For two of the five experimental data sets (Leukemia and Lymphoma), the PE of PPLS-DA using is significantly higher than the PE of PLS-DA. For these data sets, the proportions of differentially expressed genes are large (24.4% and 40.5%) and the genes are only weakly linear dependent (considering the condition indexes). The PE of PPLS-DA using is significantly lower than for PLS-DA for the Prostate 2 data set, and the number of components used is also in average lower. This data set contains only a low proportion of 1.4% differentially expressed genes, and the total number of genes is very high (42129). Moreover, for this data set, the genes show a stronger linear dependency (rapid increase of the condidion index) than for the Leukemia or the Lymphoma data set.
Summarizing, a weak increase of indicates no improvement for the PE when using PPLS-DA with instead of PLS-DA. Concerning percentage of DEGs, the PE of PPLS-DA using is equal to the PE of PLS-DA only for a small percentage of DEGs (Breast Cancer data and case 3 of the simulated data). For a weak increase of and high percentage of DEGs, the PE for PPLS-DA using is even larger than for PLS-DA (Leukemia and Lymphoma data sets).
A rapid increase of and a large proportion of DEGs, using instead of PLS-DA does not improve the PE (Prostate 1). On the contrary, for a rapid increase of and the case of small percentage of DEGs, we can improve the PE, employing PPLS-DA using instead of PLS-DA (Prostate 2).
Comparison between PLS-DA and PPLS-DA using
For the simulated data, PLS-DA shows equal PE if compared to PPLS-DA using or . The PE of PPLS-DA using is significantly lower than the PE of PLS-DA. For the experimental data, for four of the five data sets (Leukemia, Lymphoma, Breast Cancer and Prostate 1), the PEs of PLS-DA and PPLS-DA using show no significant differences. For the Prostate 2 data set, the PE of PPLS-DA using is clearly lower than for PLS-DA.
We conclude first, equal PEs between PLS-DA and PPLS-DA using are caused by a weak between-feature dependency, independent of the proportion of DEGs. Second, a data set with strong collinearity between the features and a low number of DEGs, in contrary shows a clearly lower PE for PPLS-DA using than for PLS-DA.
Comparison between PPLS-DA using and PLS-DA
The PEs for PLS-DA and PPLS-DA using are non-distinguishable for all simulated and all experimental data sets investigated. Maximization of the covariance or maximization of the correlation without the power parameter, results in equal PEs of the outer test set.
Comparison between PPLS-DA using and t-LDA and SVM
The more classic and widely available classification methods t-LDA and SVM were also run and compared to PPLS-DA using on all simulated and experimental data sets. For the simulated data, t-LDA performs indistinguishable well as PPLS-DA using . For the experimental datasets, SVM draws level with our proposed approach except for the case of Prostate 2. In this comparison, PPLS-DA using shows a comparatively stable well performance. There may, however, exist further statistical learning methods which outperform the methods presented in this study. t-LDA and SVM should serve to demonstrate comparability of PLS-related methods to other commonly chosen approaches for the classification problems. The focus of our study, however, was on further developing PPLS-DA, a so-called multivariate method, which has already been proven its large potential for classification problems involving magnitude more features than samples as it is the case in OMICs data sets.
Conclusions and Outlook
It is conceivable to use results of an initial PPLS-DA cross-validation series, optimizing , to try to judge if running the extended version would be rewarding. Data sets with a high proportion of differentially expressed genes and weak linear dependency (like the Leukemia data set and the Lymphoma data set) most probably show good prediction results for PLS-DA. Here, we found no gain using PPLS-DA with powers ( or ). On the contrary, for a rapid increase of the condition index, a low proportion of differentially expressed genes and a large total number of genes, using PPLS-DA with clearly improves the prediction error compared to PLS-DA. In cases where PPLS-DA using gives no advantages over PLS-DA, using the extensions of PPLS-DA (optimizing the power parameter) for prediction can be advantageous. Starting to analyse the eigenvalue structure and the number of differentially expressed genes, can possibly be useful to decide which method to use. One aspect of future work is to validate our conclusions by additional experimental data sets as well as further simulations implementing a more complex covariance structure.
Supporting Information
Acknowledgments
The authors thank K. Schlettwein for comments and discussion of implementation possibilities, and H. Rudolf for critical comments and normalization of the Lymphoma data set.
Funding Statement
The study was supported by the Max Planck Institute for Infection Biology, Berlin, and the German Federal Minstry of Education and Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Wold S, Albano C, Dunn II W, Esbensen K, Hellberg S, et al.. (1983) Pattern recognition: Finding and using regularites in multivariate data, applied Sciences publ. London. 147–188.
- 2.Martens SWH, Wold H (1983) The multivariate calibration problem in chemistry solved by the pls method. Proc Conf Matrix Pencils (ARuhe, BKaagstroem, eds) March 1982 Lecture notes in Mathematics : 286–293.
- 3. Barker M, Rayens W (2003) Partial least squares for discrimination. Journal of Chemometrics 17(3): 166–173. [Google Scholar]
- 4. Nocairi H, Qannari EM, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. application to multicollinear data. Computational Statistics & Data Analysis 48 (1): 139–147. [Google Scholar]
- 5. Liland KH, Indahl U (2009) Powered partial least squares discriminant analysis. Journal of Chemometrics 23: 7–18. [Google Scholar]
- 6. Indahl UG (2005) A twist to partial least squares regression. Journal of Chemometrics 19: 32–44. [Google Scholar]
- 7.Telaar A, Repsilber D, Nürnberg G (2011) Biomarker Discovery: Classification using pooled samples - a simulation study. Computational Statistics.
- 8. Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in pls-da. Journal of Chemometrics 21: 529–536. [Google Scholar]
- 9. Indahl UG, Liland KH, Næs T (2009) Canonical partial least squares - a unified pls approach to classification and regression problems. Journal of Chemometrics 23: 495–504. [Google Scholar]
- 10.Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2009) e1071: Misc functions of the department of statistics (e1071). TU Wien.
- 11. Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20: 3583–3593. [DOI] [PubMed] [Google Scholar]
- 12. Sæbø S, Almøy T, Aarøe J, Aastveit AH (2008) ST-PLS: a multi-directional nearest shrunken centroid type classifier via PLS. Journal of Chemometrics 20: 54–62. [Google Scholar]
- 13.Belsley DA, Kuh E, Welsch RE (1980) Regression Diagnostics: Identifying Inuential Data and Sources of Collinearity. John Wiley & Sons.
- 14. Helland IS, Almøy T (1994) Comparison of prediction methods when only a few components are relevant. Journal of the American Statistical Association 89: 583–591. [Google Scholar]
- 15. Storey J, Tibshirani R (2003) Statistical signilficance for genomewide studies. Proc Natal Acad Sci 100: 9440–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97: 77–87. [Google Scholar]
- 17. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536. [DOI] [PubMed] [Google Scholar]
- 18.Dettling M, Buehlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics : 1061–1069. [DOI] [PubMed]
- 19. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–537. [DOI] [PubMed] [Google Scholar]
- 20. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et al. (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8: 68–74. [DOI] [PubMed] [Google Scholar]
- 21. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1: 203–209. [DOI] [PubMed] [Google Scholar]
- 22. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, et al. (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci U S A 101: 811–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.