Abstract
Principal component analysis (PCA) is a data analysis method that can deal with large volumes of data. Owing to the complexity and volume of the data generated by today's advanced technologies in genomics, proteomics, and metabolomics, PCA has become predominant in the medical sciences. Despite its popularity, PCA leaves much to be desired in terms of accuracy and may not be suitable for certain medical applications, such as diagnostics, where accuracy is paramount. In this study, we introduced a new PCA method, one that is carefully supervised by receiver operating characteristic (ROC) curve analysis. In order to assess its performance with respect to its ability to render an accurate differential diagnosis, and to compare its performance with that of standard PCA, we studied the striatal metabolomic profile of R6/2 Huntington disease (HD) transgenic mice, as well as that of wild type (WT) mice, using high field in vivo proton nuclear magnetic resonance (NMR) spectroscopy (9.4-Tesla). We tested both the standard PCA and our ROC-supervised PCA (using in each case both the covariance and the correlation matrix), 1) with the original R6/2 HD mice and WT mice, 2) with unknown mice, whose status had been determined via genotyping, and 3) with the ability to separate the original R6/2 mice into the two age subgroups (8 and 12 wks old). Only our ROC-supervised PCA (both with the covariance and the correlation matrix) passed all tests with a total accuracy of 100%; thus, providing evidence that it may be used for diagnostic purposes.
Keywords: Diagnostic methods, principal component analysis, receiver operating characteristic (ROC) curve analysis, metabolomics, nuclear magnetic resonance spectroscopy, huntington disease
Introduction
The concept of principal component analysis (PCA) was introduced by Pearson [1] and was developed by Hotelling [2-5]. Since then, PCA has been used in many research areas, including natural sciences, medical sciences, and behavioral and social sciences.
PCA is a multivariate data analysis/mining technique that seeks to transform, in a linear way, M correlated sets of P independent variables (IVs) into K uncorrelated sets of P IVs (where K<<M). The goal of PCA, in other words, is to reduce significantly the dimensionality of the original IVs (P) so that 1) the amount of the original variance accounted for by the number of the retained sets (K) of P IVs is maximized and 2) the K retained sets of P IVs are uncorrelated with each other. The fact that PCA is designed to replace a large number of sets of IVs with just a few (usually two or three) sets of those original IVs, with the condition that the few retained sets are not correlated, and also with the condition that those few retained sets capture the largest possible amount of the information (variance) contained in the original sets of IVs, has a significant and deterministic impact on both the applicability and performance of PCA.
Since the intended function of PCA is data dimensionality reduction, many have noted the advantages and disadvantages of PCA in that regard [3, 6-9]. Very little has been said, however, about PCA in connection with classification accuracy, a critical prerequisite for diagnostics. In this study, we investigated PCA specifically with respect to classification accuracy, assessed its performance, and offered explanations about its evidenced weaknesses based on specific examples from our study (see section 3 of Supplementary Material). Moreover, and more importantly, in order to increase its classification accuracy and render it suitable for diagnostic applications, we introduced a new PCA method, one that is carefully supervised by receiver operating characteristic (ROC) curve analysis. Just as we did in the case of standard PCA, we used our nuclear magnetic resonance (NMR) spectroscopy study of Huntington disease (HD) in mice to assess the performance of the ROC-supervised PCA; and we compared the results with those of the standard PCA.
Brief Description of ROC-supervised PCA: 1) All of the variables of the original dataset are assessed in terms of their discriminating power between the target and the reference group (ROC AUC); 2) Those variables with an AUC > θ1 (recommended θ1 = 0.75) are used in the 1st PCA setting; 3) The classification results of the 1st PCA setting with respect to the original subjects according to the equation of the first principal component (PC1) are recorded, and both the sum and the mean value of the squared residuals of every original subject as predicted by PC1 (Q1) are calculated; 4) Those variables with an AUC > θ1 (recommended θ1 = 0.80) are used in the 2nd PCA setting; 5) The classification results of the 2nd PCA setting with respect to the original subjects according to the equation of the first principal component (PC1) are recorded, and both the sum and the mean value of the squared residuals Q1 are calculated; 6) The previous two steps are repeated k times with increasing AUC values until the kth PCA setting, wherein only those original variables with an AUC > θk are used, yields a) the most accurate classification results with respect to the original subjects and b) the smallest mean value and sum value of all Q1 squared residuals. This kth PCA setting constitutes the diagnostic model; 7) The diagnostic model is tested with unknown subjects.
Materials & methods
R6/2 transgenic mice
Animal experiments described in this study were performed in accordance to the procedures approved by the University of Minnesota Institutional Animal Care and Use Committee. The R6/2 mice were originally purchased from the Jackson Laboratories (Bar Harbor, ME, USA) and bred by crossing transgenic males and wild type (WT) females at 5 weeks of age. Offspring were genotyped according to established procedures [10] and the Jackson Laboratory.
Animal preparation
In preparation for in vivo 1H NMR (proton nuclear magnetic resonance) scanning, all animals were anesthetized and maintained thus throughout the duration of the scanning procedure. A gas mixture (O2 : N2O = 1:1) containing 1.25-2.0% of isoflurane was used for anesthesia and flowed throughout the cylindrical chamber wherein the spontaneously breathing animals were placed. The chamber temperature was maintained at 30° C by the circulation of warm water on the outside surface of the chamber. The 1H NMR scanning for each animal required approximately 1hr.
In Vivo 1H NMR spectroscopy
1H NMR scans were conducted with a 9.4 T/31 cm magnet (Magnex Scientific, Abingdon, UK). The magnet was equipped with an 11 cm gradient coil insert (300 mT/m, 500 Is) and strong custom-designed second order shim coils (Magnex Scientific, Abingdon, UK) [11]. The volume of interest (VOI) was selected based on multi-slice RARE images. The VOI was centered in the left striatum at the level of the anterior commissure. The size of the VOI, which varied from 7-12 μL, was adjusted to fit the anatomical structure of the left striatum, as well as to exclude the lateral ventricle and, thus, to minimize partial volume effects (inclusion of a tissue other than the target tissue). The striatum was selected as the area of interest because it consists to a large extent of the medium spiny projection neurons, which are GABA-ergic, and which, more importantly, constitute the initial and preferential target of HD. It is in the medium spiny projection neurons of the striatum where HD first manifests itself. At the end stage, following extensive neuronal cell loss in the striatum, the disease evinces itself in other brain areas, such as the cerebral cortex, globus pallidus, substantia nigra, thalamus, cerebellum, nucleus accumbens, and white matter [12].
Thirty mice (17 WT and 13 R6/2) were scanned according to the aforementioned procedure. Of the 17 WT mice, 8 were 8 wks old and 9 were 12 wks old; whereas of the 13 R6/2 mice, 7 were 8 wks old and 6 were 12 wks old. Those 30 mice were used in the development of both the standard and the ROC-supervised PCA diagnostic biomarker models (DBMs). In addition, 31 unknown mice (11 R6/2 and 20 WT) were also scanned according to the aforementioned procedure and were used to test and validate all PCA DBMs. All of the 31 unknown mice were extraneous to the development of the PCA DBMs, and their status had been ascertained via genotyping.
Spectral analysis resulted in the identification and individual quantification of 15 metabolites. By combining the obtained individual absolute concentrations of creatine (Cr) and phospho-creatine (PCr), we created the Cr+PCr and PCr/ Cr metabolites (variables) in order to obtain information about the total striatal creatine (free and phosphorylated), as well as about the ratio of those two metabolites. In the case of glycerophosphorylcholine (GPC) and phosphorylcholine (PC), we were not able to separate those two and obtain individual concentrations. We were able, however, to obtain the absolute concentration of the sum of GPC and PC, which represents the total striatal phosphorylated choline. All of the 15 striatal metabolites we were able to identify and quantify individually as a result of the high magnetic field spectrometer we used (9.4 Tesla), as well as the two metabolites (variables) we created, are shown in Table 1.
Table 1.
Names & abbreviations of all metabolites detected and measured in the study
No. | Metabolite Symbol | Metabolite Name |
---|---|---|
1 | Cr | creatine |
2 | PCr | phosphocreatine |
3 | Cr+PCr | creatine + phosphocreatine |
4 | PCr/Cr | phosphocreatine / creatine |
5 | GABA | γ-aminobutyric acid |
6 | Glc | glucose |
7 | Gln | glutamine |
8 | Glu | glutamate |
9 | GSH | glutathione |
10 | GPC+PC | glycerophosphorylcholine + phosphorylcholine |
11 | Lac | lactate |
12 | MM | macromolecules |
13 | mIns | myo-Inositol |
14 | NAA | N-acetylaspartate |
15 | NAAG | N-acetylaspartylglutamate |
16 | PE | phosphorylethanolamine |
17 | Tau | Taurine |
Since both of our animal groups (WT & R6/2) comprised two age subgroups (8-wk old & 12-wk old mice), the time dependent variable was collapsed, so the developed models for diagnostic biomarkers (DBMs) would be applicable from 8-12 weeks of age - a most important time period in the progression of the disease in R6/2 mice, as well as a significant portion of the observed lifespan of the R6/2 mice. The development of all diagnostic biomarker models (DBMs), therefore, was based on the data of the aforementioned 13 R6/2 mice [seven at 8 wks of age & six at 12 wks of age] and 17 WT mice [eight at 8 wks of age & nine at 12 wks of age]. For more details on animal methods, as well as on spectra obtainment and processing, please see our previous study [13].
Statistical software
For our study, we used the statistical software by NCSS 2007, Kaysville, Utah, USA.
Computer programs
Computer programs were written using MATLAB R2009b by The MathWorks, Inc., Natick, MA, USA.
Diagnostic biomarker models
General Description: We used the data (concentrations of 17 metabolites) of our 30 original mice to develop diagnostic biomarker models (DBMs) for both the standard and the ROC supervised PCA methods. The DBMs comprised computer programs, which, based on the equation of the first principal component (PC1) (1.1) of the respective PCA method, could render a differential diagnosis of an unknown mouse (WT or R6/2). More specifically, the equation of the first principal component is given by
![]() |
X1N, X2N, …, XPN are the P variables (in our case, P=17 metabolite concentrations) of subject N; W11, W12, … W1P are the weights of the P variables with respect to PC1, which can be calculated from the eigenvector of PC1; and PC1N is the score of subject N with respect to the first principal component (PC1). The first principal component (PC1) is the most important of all principal components for the following two reasons: 1) it contains most of the information (variance) of the original variables and 2) it has the highest potential in terms of classification accuracy with respect to the target and the reference group (see results in Tables S1-S10 in the Supplementary Material). Therefore, we can use equation (1.1) to make a diagnosis of an unknown mouse by calculating its score with respect to the first principal component. Based on whether the score is positive or negative, the unknown mouse can be diagnosed as either WT or R6/2 respectively. More details on (1.1) and other PCA equations, as well as the basic theory of PCA, can be found in section 1 of Supplementary Material.
We subjected both PCA DBMs (standard and ROC-supervised) to the following three tests:
Test 1: Identification of our original 30 mice, which were intrinsic to the development of all DBMs. This is a necessary first test in that a DBM has to demonstrate that it has the prerequisite discriminating accuracy to classify correctly the original 30 mice, which were used in the development of that DBM. It is by no means a foregone conclusion that a DBM can pass this test with 100% accuracy.
Test 2: Identification of 31 unknown mice, which were extraneous to the development of all DBMs. This is the validation test, and as such, it is by far the most important test. A DBM is asked to identify/diagnose 31 unknown mice. These 31 mice were new and different from the 30 original mice used in the development of that DBM. The status of these 31 unknown mice had been determined by genotyping, which is the gold standard in HD.
Test3: Identification of our 13 original R6/2 mice into their two age groups: 8 wk-old and 12 wk-old. Seven of those R6/2 mice were scanned at the age of 8 weeks and six of them were scanned at the age of 12 weeks. This is a test designed to assess the sensitivity of a DBM with respect to the progression of the disease. Those R6/2 mice that were scanned at the age of 12 weeks were more impaired than those R6/2 mice that were scanned when they were 8 weeks old. A DBM should have the required sensitivity to discriminate between those two groups of R6/2 mice.
For both the standard and the ROC-supervised PCA in connection with the first test, we entered our data (subjects) in the following order: rows #1-17 were the WT mice and rows #18-30 were the R6/2 mice. For both the standard and the ROC-supervised PCA in connection with the third test, we entered our data (subjects) in the following order: rows #1-7 were the 8-wk old R6/2, whereas rows #8-13 were the 12-wk old R6/2 mice.
PCA with Covariance Matrix: For both the standard and the ROC-supervised PCA with the covariance matrix, we chose the following settings: Matrix Type: We chose the Covariance Matrix; Factor Selection - Method: We chose Percent of Eigenvalues; Factor Selection - Value: We selected 100; Factor Rotation: We chose none.
PCA with Correlation Matrix: Except for the Matrix Type, for the correlation matrix PCAs (both the standard and the ROC-supervised one), we chose the same settings as those for the PCAs with the covariance matrix (listed immediately above). In section 2 of the Supplementary Material, there is an account of the differences between PCA with covariance matrix and PCA with correlation matrix.
ROC curve analysis
ROC curve analysis is a theory of probabilities. It studies two probabilities, namely, sensitivity and (1-specificity), in order to determine a third probability, namely, the area under the curve (AUC). The ROC AUC probability is basically an assessment of the discriminating power of a given variable with respect to the two groups involved. If the AUC of a given variable is equal to 1.00, then according to that variable, the two groups involved can be separated with 100% accuracy. A variable with perfect discrimination between the two groups has an AUC = 1.00, whereas a variable with the poorest discrimination between the two groups has an AUC = 0.50 (chance probability). For a more detailed account on the properties, methodology, and applications of ROC curve analysis, please refer to our previous study [14].
Since ROC curve analysis allows us to assess our variables in terms of discriminating power with respect to our two groups (WT vs. R6/2), we used the results of ROC curve analysis (Table 2) not only to supervise PCA but also to determine the best possible setting of the ROC-supervised PCA. To be more specific, first we entered only those IVs (metabolite concentrations) that had an AUC > 0.70 (70%), then only those that had an AUC > 0.80, then only those with an AUC > 0.90, and finally only those with an AUC > 0.95. In order to assess the different settings of the ROC-supervised PCA, we used the following criteria: 1) classification results and 2) the residuals (both the sum of the Q1 values of all subjects and the mean Q1 value of all subjects of a particular setting). QP is the sum of squared residuals when a subject is predicted using the first P principal components [4]. Since we are interested in the first principal component, Q1 is the residual of our interest. The smaller the sum of all the residuals of all subjects (sum of all Q1 values of all subjects) and the smaller the mean value of the residuals of all subjects (mean value of the Q1 values of all subjects), the better the setting.
Table 2.
Rank of all metabolites based on their discriminating power (AUC) from ROC curve analysis
Time: 8-12 wks | ROC Curve Analysis | |
---|---|---|
Metabolite | AUC | AUC Rank |
Cr+PCr | 1.00000 | 1 |
Gln | 0.98897 | 2 |
Cr | 0.98832 | 3 |
NAA | 0.98198 | 4 |
GSH | 0.94052 | 5 |
GPC+PC | 0.90301 | 6 |
mIns | 0.89978 | 7 |
PCr | 0.87023 | 8 |
PE | 0.83667 | 9 |
Tau | 0.72888 | 10 |
NAAG | 0.69632 | 11 |
Glc | 0.58495 | 12 |
Glu | 0.58179 | 13 |
PCr/Cr | 0.53852 | 14 |
GABA | 0.52209 | 15 |
Lac | 0.52187 | 16 |
MM | 0.50067 | 17 |
Results
Standard PCA with covariance matrix
Test 1: Identification of the original 30 mice (WT vs. R6/2): We ran the standard PCA with the covariance matrix and unsupervised, i.e. with all of our 17 IVs (metabolites). Rows # 1-17 were the WT mice, and rows #18-30 were the R6/2 mice. Table S1 in the Supplementary Material shows the scores of the 30 original mice with respect to the first test according to the first six principal components (factors) (PC1 - PC6). None of the 17 principal components correctly identified all of the 30 mice. As one can see from Table S1, the first principal component (PC1) misidentified 5 mice (#19-22 & #24), which are R6/2, and which should have negative factor scores. The results, therefore, according to PC1 are: 17/17 WT mice (100% correct) & 8/13 R6/2 mice (61.54% correct) → with a total accuracy of 25/30 original mice (83.33% correct). In this case, sensitivity = 0.615 and (1-specificity) = 0.
The positive Likelihood Ratio [(+)LR] is: (+)LR = (sensitivity)/(1-specificity) = 0.615/0 → ∞ The negative Likelihood Ratio [(-)LR] is: (-)LR = (1-sensitivity)/(specificity) = 0.385/1 = 0.385.
As can be seen in Figure 1, there is no separation between the WT mice (#1-17) and the R6/2 mice (#18-30) either with respect to PC1 or PC2. The general results of all PCA runs, including those of this run, appear in Table 3. The second principal component (PC2) misidentified 6 mice: #26 & #28-30, which are R6/2, and which should have negative factor scores, as well as mice #2-3, which are WT, and which should have positive factor scores (Table S1). Not surprisingly, the rest of the factors (3-17), which collectively account for only ∼ 20% of the original variance (Table S2 in the Supplementary Material), did not show any meaningful results with respect to the identification of the 30 mice. The individual significance of the 17 IVs for each of the first five principal components can be seen in Table 4. More specifically, the eigenvectors of the first 5 principal components (factors) of standard PCA (using the covariance matrix) are shown. Since the magnitude of the absolute value of the weights of the variables within each eigenvector is directly proportional to the significance of the variables for each principal component, one can see the magnitude of significance of each variable for each of the first five principal components. Focusing on PC1 (Factor 1), one can see that Tau has by far the greatest weight (0.6756), and it is, therefore, the most significant variable for PC1, which, in turn, is the most significant of all principal components since it alone accounts for 57.51% of the original variance (Table S2 in the Supplementary Material). That means that the equation of PC1 has been heavily influenced by Tau. As can be seen in Table 2, Tau, according to ROC curve analysis, has an AUC = 0.7289, which means that in this case, Tau as a bio-marker cannot be used for diagnostic purposes. Cr+PCr, on the other hand, is the perfect bio-marker (AUC = 1.0000) (Table 2), and it is upon this variable (Cr + PCr) that the equation of PC1 should have been predominantly based. In Section 3 in the Supplementary Material, there is a more detailed account and discussion of the two aforementioned metabolites in connection with the basic principle of operation of PCA.
Figure 1.
Standard PCA (covariance matrix) - Test 1. Scores of the 30 original mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run unsupervised (all 17 IVs were used) using the covariance matrix. As can be seen, there is no separation between the two groups [WT (#1-17) & R6/2 (#18-30)] either with respect to the first principal component or with respect to the second one.
Table 3.
General results of all PCA runs with respect to our three tests
PCA RESULTS | ||||
---|---|---|---|---|
PCA COVARIANCE MATRIX | PCA CORRELATION MATRIX | |||
Standard (17 Variables) | ROC-Supervised (4 Variables) | Standard (17 Variables) | ROC-Supervised (4 Variables) | |
Test 1: ID of original 30 mice | % Correct | % Correct | ||
17 WT | 17/17 (100%) | 17/17 (100%) | 17/17 (100%) | 17/17 (100%) |
13 R6/2 | 8/13 (61.54%) | 13/13 (100%) | 13/13 (100%) | 13/13 (100%) |
Total | 25/30 (83.33%) | 30/30 (100%) | 30/30 (100%) | 30/30 (100%) |
(+) Likelihood Ratio | 0.615/0 →∞ | 1/0 →∞ | 1/0 →∞ | 1/0 →∞ |
(-) Likelihood Ratio | 0.385 | 0/1=0 | 0/1=0 | 0/1=0 |
TEST 2: ID of 31 unknown | ||||
mice | ||||
20 WT | 20/20 (100%) | 20/20 (100%) | 20/20 (100%) | 20/20 (100%) |
11 R6/2 | 7/11 (63.64%) | 11/11 (100%) | 8/11 (72.73%) | 11/11 (100%) |
Total | 27/31 (87.10%) | 31/31 (100%) | 28/31 (90.32%) | 31/31 (100%) |
(+) Likelihood Ratio | 0.636/0→∞ | 1/0→∞ | 0.727/0→∞ | 1/0→∞ |
(-) Likelihood Ratio | 0.364 | 0/1=0 | 0.273 | 0/1=0 |
TEST 3: ID of original 13 R6/2 mice | (R6/2)-ROCSupervised (2 Variables) | (R6/2)-ROCSupervised (2 Variables) | ||
7 R6/2 8 wks old | 6/7 (85.71%) | 7/7 (100%) | 7/7 (100%) | 7/7 (100%) |
6 R6/2 12 wks old | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) | 6/6 (100%) |
Total | 12/13 (92.31%) | 13/13 (100%) | 13/13 (100%) | 13/13 (100%) |
(+) Likelihood Ratio | 6.998 | 1/0→∞ | 1/0→∞ | 1/0→∞ |
(-) Likelihood Ratio | 0/0.857=0 | 0/1=0 | 0/1=0 | 0/1=0 |
Table 4.
The eigenvectors of the first 5 principal components (factors) of standard PCA using the covari-ance matrix
Eigenvectors Variables | Factors | ||||
---|---|---|---|---|---|
Factor 1 | Factor 2 | Factor 3 | Factor 4 | Factor 5 | |
Cr | -0.140814 | -0.404184 | -0.252721 | 0.084649 | 0.003137 |
Gln | -0.421211 | -0.291823 | -0.023370 | 0.211430 | 0.324034 |
NAA | 0.186569 | 0.208104 | 0.245636 | 0.080688 | 0.375307 |
Cr+PCr | -0.345292 | -0.395072 | -0.198341 | 0.044874 | 0.025746 |
PCr | -0.204489 | 0.009019 | 0.054307 | -0.039749 | 0.022740 |
Glc | 0.073144 | -0.274866 | 0.010358 | -0.918090 | 0.057282 |
Glu | -0.158447 | 0.169657 | 0.142106 | -0.020456 | 0.629347 |
GSH | -0.051790 | -0.059521 | -0.063733 | -0.032056 | 0.017220 |
mIns | -0.130164 | -0.219076 | 0.048181 | -0.045341 | 0.260795 |
Lac | 0.213520 | -0.517945 | 0.777880 | 0.163626 | -0.129195 |
PE | 0.088775 | 0.067588 | -0.021296 | -0.093113 | 0.246811 |
Tau | -0.675590 | 0.347693 | 0.427139 | -0.205028 | -0.274482 |
GPC+PC | -0.213405 | -0.010141 | -0.018493 | 0.085112 | -0.177403 |
MM | 0.003147 | -0.011569 | 0.038310 | -0.030734 | -0.020392 |
GABA | -0.021660 | 0.014303 | 0.123482 | -0.061108 | 0.316906 |
NAAG | -0.024488 | -0.020983 | 0.015875 | -0.043573 | -0.004801 |
PCr/Cr | -0.017543 | 0.039048 | 0.031111 | -0.018325 | -0.000205 |
The magnitude of the absolute value of the weights of the variables within each eigenvector (column) is directly proportional to the significance of the variables for each factor. As can be seen from the absolute value of the weights of Factor 1 (PC1), Tau has by far the greatest weight (0.675590), and it is, therefore, the most significant variable for Factor 1, which is the most significant of all factors as by itself it accounts for 57.51% of the original variance. That means that the equation of Factor 1 has been heavily influenced by Tau. The rest most significant variables for Factor 1 are Gln, Cr+PCr, Lac, GPC+PC, PCr, etc in a descending order of significance after Tau. All 17 IVs were used in this run.
It is elucidating to observe that the order of significance of the 17 IVs according to the eigenvector of PC1 is as follows: 1) Tau, 2) Gln, 3) Cr+PCr, 4) Lac, 5) GPC+PC, 6) PCr, 7) NAA, 8) Glu, 9) Cr, 10) mIns, 11) PE, 12) Glc, 13) GSH, 14) NAAG, 15) GABA, 16) PCr/Cr, and 17) MM. This order of significance of the 17 IVs is markedly different from that yielded by ROC curve analysis (Table 2). Besides the problem with the ranking of Tau, Lac, which has an AUC = 0.5219 (Table 2), which in essence means that the diagnostic (discriminating) power of Lac is at the chance level (AUC = 0.50), is ranked by standard PCA (covariance matrix) in the top four most significant metabolites. On the other hand, Cr, which has an AUC = 0.9883, which is considered excellent (> 0.95), is ranked by standard PCA as number 9 (out of 17).
Table S2 in the Supplementary Material shows the eigenvalues of the eigenvectors of all 17 principal components (factors) of standard PCA using the covariance matrix. As can be seen, the first principal component (PC1) accounts for 57.51% of the original variance of the data; the second one (PC2) accounts for 22.29%; the third one (PC3) for 8.35%, etc. It is worth noting that the first four principal components collectively account for 91.53% of the original variance. Another observation that is worth mentioning is that since an eigenvalue represents the variance of a principal component, it can be seen that PC1 has the largest variance (9.6213) of all principal components.
Test 2: Identification of the 31 unknown mice (WT vs. R6/2) - Validation Test: We subjected the standard PCA (covariance matrix) to the second test (identification of 31 unknown mice). Based on the equation of the first principal component we derived from the standard PCA (covariance matrix) in the previous Section (3.1.1), we wrote a computer program that, following the input of the 17 metabolite concentrations of an unknown mouse, would render a differential diagnosis as to whether that unknown mouse was a WT or an R6/2 mouse. As we mentioned previously, we had 31 unknown mice (11 R6/2 and 20 WT), which were extraneous to all of the DBMs, and the status of which had been determined via genotyping. Standard PCA with the covariance matrix correctly determined the status of 27/31 unknown mice [20/20 WT mice (100% correct) and 7/11 R6/2 mice (63.64% correct), with a total accuracy of 27/31 unknown mice (87.10% correct)]. Therefore, for the second test, the standard PCA (covariance matrix) exhibited a sensitivity = 0.636 and a (1-specificity) = 0 [(+)LR = 0.636/0 → ∞ and (-)LR = 0.364]. Detailed results of all PCA runs (standard and ROC-supervised) with respect to the second test are shown in Table 5. The general results of all PCA runs, including those of this run, appear in Table 3.
Table 5.
Detailed results of all PCA runs with respect to the second test, i.e. the identification of the 31 unknown mice. Mice in rows #1-20 are WT, whereas mice in rows #21-31 are R6/2
PCA Results for Test 2 | ||||
---|---|---|---|---|
Unknown Subject | First Principal Component (PC1) Score | |||
PCA Covariance Matrix | PCA Correlation Matrix | |||
Standard | ROC-Supervised | Standard | ROC-Supervised | |
1 | 0.13167 | 0.61446 | 0.40658 | 0.78912 |
2 | 0.59647 | 0.32887 | 0.62741 | 0.34816 |
3 | 0.61650 | 0.58706 | 0.67466 | 0.68506 |
4 | 0.27976 | 0.64139 | 0.48024 | 0.84950 |
5 | 0.26657 | 0.93238 | 0.56482 | 1.20830 |
6 | 0.64921 | 0.99697 | 0.72184 | 1.21220 |
7 | 0.58748 | 0.82755 | 0.68411 | 0.97334 |
8 | 0.45766 | 0.89058 | 0.67364 | 1.03990 |
9 | 0.43063 | 0.94261 | 0.71080 | 1.09190 |
10 | 0.76189 | 1.08260 | 1.08250 | 1.24480 |
11 | 0.40067 | 1.03740 | 0.74022 | 1.21150 |
12 | 0.71320 | 1.24290 | 1.08390 | 1.49750 |
13 | 0.44182 | 1.16940 | 0.77265 | 1.37920 |
14 | 0.40866 | 1.15300 | 0.79191 | 1.28990 |
15 | 0.48321 | 0.77448 | 0.69302 | 0.97603 |
16 | 0.28809 | 0.76857 | 0.55896 | 0.92809 |
17 | 0.81518 | 1.01340 | 1.02390 | 1.19510 |
18 | 0.25073 | 0.54776 | 0.59821 | 0.55057 |
19 | 0.23476 | 1.01100 | 0.62662 | 1.23000 |
20 | 0.73047 | 0.95328 | 0.97916 | 1.13380 |
21 | -0.89293 | -1.15940 | -0.78696 | -1.43120 |
22 | -0.91048 | -1.02050 | -1.14860 | -1.41420 |
23 | -1.78630 | -1.77620 | -1.76590 | -2.02740 |
24 | -1.03060 | -1.68840 | -1.29770 | -2.03200 |
25 | -1.24270 | -1.79300 | -1.43730 | -2.14990 |
26 | 0.33614 | -0.94422 | -0.04884 | -1.33740 |
27 | -1.11610 | -1.51380 | -1.50130 | -1.82760 |
28 | 0.61648 | -0.01065 | 0.36699 | -0.10788 |
29 | 0.11753 | -0.30609 | 0.03627 | -0.47616 |
30 | 0.88638 | -0.31285 | 0.41756 | -0.51304 |
31 | -0.13088 | -0.27708 | -0.27506 | -0.37472 |
Only the ROC-supervised PCA (both with the covariance and the correlation matrix) diagnosed/identified correctly all of the 31 unknown mice. All of the WT mice (rows # 1-20) have positive PC1 scores, whereas all of the R6/2 mice (rows # 21-31) have negative PC1 scores.
Test 3: Identification of the 13 original R6/2 mice (8-wk old vs. 12-wk old): Next, we subjected the standard PCA (covariance matrix) to our third test. More specifically, we wanted to know whether standard PCA (covariance matrix) was sensitive enough to detect the metabolomic differences caused by the progression of Huntington disease (HD) between our two R6/2 subgroups, i.e. between the 8-wk old R6/2 and the 12-wk old R6/2 mice. Physiologically, we know that the progression of HD will effect alterations in the metabolite concentrations of the cells in the striatum area of the brain. A mathematical model, therefore, should be sensitive enough to detect those alterations in the time span of four weeks. We entered all of the 17 IVs and our 13 original R6/2 mice [(7) 8-wk old & (6) 12-wk old] in the following manner: rows #1-7 the seven 8-wk old ones and rows #8-13 the six 12-wk old ones. Standard PCA (covariance matrix) correctly identified and classified 12/13 of our original R6/2 mice into their respective two subgroups [6/7 8 wk-old R6/2 mice (85.71% correct) & 6/6 12 wk-old R6/2 mice (100% correct) → with a total accuracy of 12/13 original R6/2 mice (92.31% correct)]. In this case, the sensitivity = 1 and the (1-specificity) = 0.143 [(+)LR = 1/0.143 = 6.998 and (-)LR = 0/0.857 = 0]. As can be seen in Figure 2, there is no accurate separation of our two R6/2 groups either with respect to PC1 or PC2. The general results of all PCA runs, including those of this run, appear in Table 3.
Figure 2.
Standard PCA (covariance matrix) - Test 3. Scores of the 13 original R6/2 mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run unsupervised (all 17 IVs were used) using the covariance matrix. As can be seen, there is no accurate separation between the two groups [8 wk-old R6/2 (#1-7) and 12 wk-old R6/2 (#8-13)] either with respect to the first principal component or with respect to the second one.
ROC-supervised PCA with covariance matrix
Test 1: Identification of the original 30 mice (WT vs. R6/2): Our goal was to find the best ROC-supervised PCA setting for the 30 original mice and use that setting to develop a ROC-supervised DBM. Using the covariance matrix, we ran the PCA with the top 10 IVs (AUC > 70%), top 9 IVs (AUC > 80%), top 7 IVs (AUC ≥ 90%), and top 4 IVs (AUC > 95%) according to ROC curve analysis (Table 2). Of those runs, the last three correctly identified all of the 30 original mice [17/17 WT mice (100% correct) & 13/13 R6/2 mice (100% correct) à with a total accuracy of 30/30 original mice (100% correct)]. The run with the top 9 IVs (AUC > 80%) yielded a sum of all Q1 residuals equal to 111.82, and a mean Q1 residual value of 3.73. The corresponding values of the run with the top 7 IVs (AUC ≥ 90%) were: 75.17 and 2.51. The run with the top 4 IVs (AUC > 95%) yielded the following values respectively: 23.11 and 0.77. Clearly, the run with the top 4 IVs (AUC > 95%) was the best ROC-supervised PCA setting (covariance matrix) for the 30 original mice, and it was upon the equation of the first principal component of this setting that the ROC-supervised PCA DBM (covariance matrix) was based. As was mentioned, the ROC-supervised PCA DBM (covariance matrix) correctly identified all of our 30 original mice [17/17 WT mice (100% correct) & 13/13 R6/2 mice (100% correct) → with a total accuracy of 30/30 original mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞; (-)LR = 0/1 = 0]. Figure 3 depicts those results. As can be seen, our two groups were successfully separated (correctly identified) by the first principal component: all of the WT mice have positive scores, whereas all of the R6/2 mice have negative scores. Once again, Table 3 depicts the general results of all of the PCA runs.
Figure 3.
ROC-supervised PCA (covariance matrix) -Test 1. Scores of the 30 original mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run using the covariance matrix and was supervised by the ROC curve analysis [the top four most significant IVs (Cr+PCr, Gln, Cr, and NAA) (AUC > 95%) as determined by the ROC curve analysis were used]. As can be seen, there is a separation between the two groups [WT (#1-17) & R6/2 (#18-30)] only with respect to the first principal component: all of the WT mice have positive scores, whereas all of the R6/2 mice have negative scores.
Test 2: Identification of the 31 unknown mice (WT vs. R6/2) - Validation Test: We subjected the ROC-supervised PCA DBM (covariance matrix) [using only the top 4 IVs (AUC > 98%) according to ROC curve analysis] to the second test. It correctly determined the status of all of the 31 unknown mice [20/20 WT mice (100% correct) and 11/11 R6/2 mice (100% correct), with a total accuracy of 31/31 unknown mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞; (-)LR = 0/1 = 0]. Those results in detail, along with the results of all PCA runs with respect to the second test, are shown in Table 5. The general results of all PCA runs, including those of this run, appear in Table 3.
Test 3: Identification of the 13 original R6/2 mice (8-wk old vs. 12-wk old): Subjecting the best ROC-supervised PCA setting to the third test was the next task. The third test concerns itself exclusively with the R6/2 mice; more specifically, it assesses the ability of a given model to discriminate between the two R6/2 groups: the 8-wk old vs. the 12-wk old. The ROC curve analysis with which we supervised PCA in the first and second test, and the results of which appear in Table 2, was designed to assess the ability of all 17 IVs to discriminate between the WT and the R6/2 mice. Clearly, as far as the third test was concerned, we had to perform another ROC curve analysis, one that would deal exclusively with the 13 original R6/2 mice, and one that would assess all of the 17 IVs in terms of their ability to discriminate between the 8-wk old and the 12-wk old R6/2 mice. The top 5 most significant IVs (metabolites) in the discrimination between the two R6/2 groups according to their AUC value as determined by the R6/2 ROC curve analysis are: 1) TTau (AUC = 0.9752) [Transformed Tau in order to meet normality criteria], 2) GPC+PC (AUC = 0.9517), 3) Glu (AUC = 0.9460), 4) Lac (AUC = 0.9446), and 5) Gln (AUC = 0.9432). The best R6/2 ROC-supervised PCA setting for the 13 original R6/2 mice both in terms of classification accuracy and residuals was the one that employed only the top two most significant IVs (AUC > 95%), i.e. TTau and GPC+PC; and it is this setting that we used for the third test. This R6/2-ROC-supervised PCA (covariance matrix) correctly identified and classified all of our original R6/2 mice into their respective two subgroups [7/7 8 wk-old R6/2 mice (100% correct) & 6/6 12 wk-old R6/2 mice (100% correct) → with a total accuracy of 13/13 original R6/2 mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+) LR = 1/0 → ∞ (-)LR = 0/1 = 0]. Figure 4 depicts those results. As can be seen, our two R6/2 groups were successfully separated (correctly identified) by the first principal component: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores. Those results are also shown, along with the general results of all PCA runs, in Table 3.
Figure 4.
R6/2-ROC-supervised PCA (covariance matrix) - Test 3. Scores of the 13 original R6/2 mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run using the covariance matrix and was supervised by the R6/2-ROC curve analysis [the top two most significant IVs (TTau and GPC+PC) (AUC > 95%) as determined by the R6/2-ROC curve analysis were used]. As can be seen, there is a separation between the two groups [8 wk-old R6/2 (#1-7) and 12 wk-old R6/2 (#8-13)] only with respect to the first principal component: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores.
Standard PCA with correlation matrix
Test 1: Identification of the original 30 mice (WT vs. R6/2): We next ran standard PCA (all 17 IVs) with the correlation matrix. As can be seen from Table S6 in the Supplementary Material, this PCA run was more successful than the standard PCA with the covariance matrix. More specifically, the first principal component (PC1) correctly identified all of our 30 original mice: all WT mice have positive PC1 scores, whereas all R6/2 mice have negative PC1 scores [17/17 WT mice (100% correct) & 13/13 R6/2 mice (100% correct) → with a total accuracy of 30/30 original mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞; (-)LR = 0/1 = 0]. None of the remaining principal components (PC2 - PC17) identified correctly the 30 original mice. Figure 5 illustrates the scores of our 30 original mice with respect to the first and second principal components (PC1 and PC2) of the standard PCA with the correlation matrix. One can see from Figure 5 that there is a separation of the two groups (WT & R6/2) only with respect to PC1. Table S7 in the Supplementary Material shows the corresponding eigenvectors of the first five principal components; and as can be seen from there, Gln has the greatest weight (0.3577), and it is, therefore, the most significant variable for Factor 1. Observing the absolute value of the weights of the variables, one can see that, in a descending order of significance, the most significant variables for Factor 1 are: 1) Gln, 2) GPC+PC, 3) Cr+PCr, 4) PCr, 5) NAA, 6) Tau, 7) GSH, 8) mIns, 9) PE, 10) Cr, 11) Glu, 12) NAAG, 13) PCr/Cr, 14) Lac, 15) GABA, 16) Glc, and 17) MM. This constitutes a large improvement on the part of the standard PCA (correlation matrix) with respect to the order of the significance of the variables as compared with ROC curve analysis. In other words, using standard PCA with the correlation matrix was a considerable improvement over standard PCA with the covariance matrix. Focusing on the top 10 most important IVs, one can see that they are the same as those identified by ROC curve analysis (Table 2). That is, however, the whole extent of the commonality between the two methods. The order of significance of the top ten IVs according to the standard PCA (correlation matrix) is markedly different from that of ROC curve analysis. The most notable differences in that order are the following: 1) Cr+PCr, which has a perfect AUC (1.0000), is placed third and not first; 2) Cr, which has an AUC = 0.9883, almost the same as Gln, is placed tenth and not third; 3) PCr, which has an AUC = 0.8702, is placed fourth (instead of eighth) and ahead of NAA, which has an AUC = 0.9820; 4) Tau, which has the lowest AUC of all ten metabolites (AUC = 0.7289) is placed sixth (instead of tenth) and ahead of NAA.
Figure 5.
Standard PCA (correlation matrix) - Test 1. Scores of the 30 original mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run unsupervised (all 17 IVs were used) using the correlation matrix. As can be seen, there is a separation between the two groups [WT (#1-17) & R6/2 (#18-30)] only with respect to the first principal component all of the WT mice have positive scores, whereas all of the R6/2 mice have negative scores.
Test 2: Identification of the 31 unknown mice (WT vs. R6/2) - Validation Test: We subjected the standard PCA (all 17 IVs) with the correlation matrix to the second and most stringent test, namely the identification of the 31 unknown mice. It correctly identified 28/31 unknown mice [20/20 WT mice (100% correct) and 8/11 R6/2 mice (72.73% correct), with a total accuracy of 28/31 unknown mice (90.32% correct)] [sensitivity = 0.727; (1-specificity) = 0; (+)LR = 0.727/0 → ∞ (-)LR = 0.273]. Those results in detail, along with the results of all PCA runs with respect to the second test, are shown in Table 5. The general results of all PCA runs, including those of this run, appear in Table 3.
Test 3: Identification of the 13 original R6/2 mice (8-wk old vs. 12-wk old): Standard PCA (correlation matrix) correctly identified and classified all of our original 13 R6/2 mice into their respective two subgroups [7/7 8 wk-old R6/2 mice (100% correct) & 6/6 12 wk-old R6/2 mice (100% correct) → with a total accuracy of 13/13 original R6/2 mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞ (-)LR = 0/1 = 0]. As can be seen from Figure 6, there is a separation of the two R6/2 groups with respect to PC1: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores.
Figure 6.
Standard PCA (correlation matrix) - Test 3. Scores of the 13 original R6/2 mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run unsupervised (all 17 IVs were used) using the correlation matrix. As can be seen, there is a separation between the two groups [8 wk-old R6/2 (#1-7) and 12 wk-old R6/2 (#8-13)] only with respect to the first principal component: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores.
The results of this run, along with the general results of all PCA runs, appear in Table 3.
ROC-supervised PCA with correlation matrix
Test 1: Identification of the original 30 mice (WT vs. R6/2): Just as we did in the case of the ROC-supervised PCA (covariance matrix), we ran the ROC-supervised PCA (correlation matrix) with the top 10 IVs (AUC > 70%), top 9 IVs (AUC > 80%), top 7 IVs (AUC ≥ 90%), and top 4 IVs (AUC > 95%) according to ROC curve analysis (Table 2). All four of those runs correctly identified all of the 30 original mice [17/17 WT mice (100% correct) & 13/13 R6/2 mice (100% correct) → with a total accuracy of 30/30 original mice (100% correct)]. According to the residuals, the run with the top 10 IVs (AUC > 70%) yielded a sum of all Q1 residuals equal to 101.42, and a mean Q1 residual value of 3.38. The respective values of the run with the top 9 IVs (AUC > 80%) were: 85.17 and 2.84. The respective values of the run with the top 7 IVs (AUC ≥ 90%) were: 54.72 and 1.82; and those of the run with the top 4 IVs (AUC > 95%) were: 18.02 and 0.60 respectively. Clearly, here, too, the run with the top 4 IVs (AUC > 95%) was the best ROC-supervised PCA setting (correlation matrix) for the 30 original mice, and it was upon the equation of the first principal component of this setting that the ROC-supervised PCA DBM (correlation matrix) was based. As was mentioned, the ROC-supervised PCA DBM (correlation matrix) correctly identified all of our 30 original mice [17/17 WT mice (100% correct) & 13/13 R6/2 mice (100% correct) → with a total accuracy of 30/30 original mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞; (-)LR = 0/1 = 0]. Figure 7 illustrates those results. As can be seen, our two groups (WT & R6/2) were successfully separated by the first principal component: all of the WT mice have positive scores, whereas all of the R6/2 mice have negative scores. The general results of all PCA runs, including those of this run, appear in Table 3.
Figure 7.
ROC-supervised PCA (correlation matrix) -Test 1. Scores of the 30 original mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run using the corrrelation matrix and was supervised by the ROC curve analysis [the top four most significant IVs (Cr+PCr, Gln, Cr, and NAA) (AUC > 95%) as determined by the ROC curve analysis were used]. As can be seen, there is a separation between the two groups [WT (#1-17) & R6/2 (#18-30)] only with respect to the first principal component: all of the WT mice have positive scores, whereas all of the R6/2 mice have negative scores.
Since both the standard PCA (correlation matrix) and the ROC-supervised PCA (correlation matrix) passed the first test, i.e. correctly identified all of the 30 original mice, we compared their respective residuals. In the case of the former, the sum of all Q1 residuals was 291.11 and the mean Q1 residual value was 9.70. In the case of the latter [top 4 IVs (AUC > 95%)], the respective values, as already reported above, were: 18.02 and 0.60. Evidently, there is a vast difference in classification accuracy between the standard PCA (correlation matrix) and the ROC-supervised (correlation matrix), albeit both passed the first test.
Test 2: Identification of the 31 unknown mice (WT vs. R6/2) - Validation Test: We subjected the ROC-supervised PCA (correlation matrix) [using only the top 4 IVs (AUC > 98%) according to ROC curve analysis] to the second test. It correctly determined the status of all of the 31 unknown mice [20/20 WT mice (100% correct) and 11/11 R6/2 mice (100% correct), with a total accuracy of 31/31 unknown mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+) LR = 1/0 → ∞; (-)LR = 0/1 = 0]. Those results in detail, along with the results of all PCA runs with respect to the second test, are shown in Table 5. The general results of all PCA runs, including those of this run, appear in Table 3.
Test 3: Identification of the 13 original R6/2 mice (8-wk old vs. 12-wk old): Just as was the case with the R6/2-ROC-supervised PCA (covariance matrix), the best R6/2-ROC-supervised PCA (correlation matrix) setting for the 13 original R6/2 mice both in terms of classification accuracy and residuals was the one that employed only the top two most significant IVs (AUC > 95%), i.e. TTau and GPC+PC; and it is this setting that we used for the third test. This R6/2-ROC-supervised PCA (correlation matrix) correctly identified and classified all of the 13 original R6/2 mice into their respective two subgroups [7/7 8 wk-old R6/2 mice (100% correct) & 6/6 12 wk-old R6/2 mice (100% correct) → with a total accuracy of 13/13 original R6/2 mice (100% correct)] [sensitivity = 1; (1-specificity) = 0; (+)LR = 1/0 → ∞; (-)LR = 0/1 = 0]. Figure 8 depicts those results. As can be seen, our two R6/2 groups were successfully separated (correctly identified) by the first principal component: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores. These results are also shown, along with the general results of all PCA runs, in Table 3. The numerical results of all PCA runs not presented here can be found in the Supplementary Material.
Figure 8.
R6/2-ROC-supervised PCA (correlation matrix) - Test 3. Scores of the 13 original R6/2 mice according to the first principal component (PC1 Score) plotted against the scores of the same mice according to the second principal component (PC2 Score). The PCA was run using the correlation matrix and was supervised by the R6/2-ROC curve analysis [the top two most significant IVs (TTau and GPC+PC) (AUC > 95%) as determined by the R6/2-ROC curve analysis were used]. As can be seen, there is a separation between the two groups [8 wk-old R6/2 (#1-7) and 12 wk-old R6/2 (#8-13)] only with respect to the first principal component: all of the 8 wk-old R6/2 mice have positive scores, whereas all of the 12 wk-old R6/2 mice have negative scores.
Since both the standard PCA (correlation matrix) and the R6/2-ROC-supervised PCA (correlation matrix) passed the third test, i.e. correctly identified all of the 13 original R6/2 mice, we compared their respective residuals. In the case of the former, the sum of all Q1 residuals was 118.43 and the mean Q1 residual value was 9.11. In the case of the latter, the respective values were: 1.44 and 0.11. Therefore, in the case of the third test, as well, there was a vast difference in classification and predictive performance between the standard PCA (correlation matrix) and the R/62-ROC-supervised PCA, even though both passed the third test.
ROC-supervised PCA with covariance matrix vs. ROC-supervised PCA with correlation matrix
Finally, given that both ROC-supervised PCAs (covariance and correlation matrix) passed all three tests with 100% accuracy, we wanted to know if there were any differences in the classification and predictive performance of those two methods.
In connection with the first test, the ROC-supervised PCA (covariance matrix) yielded the following: Sum of all Q1 residuals = 23.11 and Mean value of all Q1 residuals = 0.77. The ROC-supervised PCA (correlation matrix) yielded respectively: 18.02 and 0.60. This suggests that, all things being equal, the ROC-supervised PCA with the correlation matrix performs better than the ROC-supervised PCA with the covariance matrix in terms of classification and predictive capabilities.
In connection with the third test, the R6/2-ROC-supervised PCA (covariance matrix) yielded the following: Sum of all Q1 residuals = 11.27 and Mean value of all Q1 residuals = 0.87. The R6/2 -ROC-supervised PCA (correlation matrix) yielded respectively: 1.44 and 0.11. In the case of the third test, also, the ROC-supervised PCA with the correlation matrix turned out to be more robust than the ROC-supervised PCA with the covariance matrix in terms of classification capabilities.
That the ROC-supervised PCA with the correlation matrix has better classification capability than the ROC-supervised PCA with the covariance matrix is further supported by the following theoretical observations. In the case of the ROC-supervised PCA with the covariance matrix, according to the absolute value of the weights of the variables within the eigenvector of the first principal component (PC1), the rank of significance of the 4 IVs, including the absolute value of their respective weights, is: 1) Gln [0.6277], 2) Cr+PCr [0.5971], 3) Cr [0.3763], and 4) NAA [0.3284]. According to the ROC curve analysis, the rank of significance of those 4 IVs according to their respective AUC value is: 1) Cr+PCr, 2) Gln, 3) Cr, and 4) NAA (Table 2). As we mentioned earlier, and as can also be seen from Table 2, Cr+PCr is the only perfect biomarker (AUC = 1.0000), and it is upon it that PC1 should be based. Similarly, in the case of the ROC-supervised PCA with the correlation matrix, the rank of significance of the 4 IVs, including the absolute value of their respective weights, is: 1) Cr+PCr [0.5303], 2) Gln [0.4993], 3) NAA [0.4852], 4) Cr [0.4838]. This shows that the PC1 of the ROC-supervised PCA with the correlation matrix was based predominantly on the Cr+PCr variable, which has a perfect discriminating power (AUC = 1.0000). That further indicates that the ROC-supervised PCA with the correlation matrix has a better classification and predictive capability than the ROC-supervised PCA with the covariance matrix.
Discussion
The results of our study (Table 3) demonstrate that our ROC-supervised PCA may be employed for the diagnosis of diseases. More specifically, both ROC-supervised PCA with the covariance matrix and ROC-supervised PCA with the correlation matrix passed all three stringent tests with 100% accuracy, exhibiting, thus, high diagnostic accuracy, and providing evidence that they may be used for diagnostic purposes.
The fact that both of those methods yielded results that were 100% accurate notwithstanding, as was demonstrated in the previous section, the ROC-supervised PCA with the correlation matrix exhibited a better classification and predictive capability than the ROC-supervised PCA with the covariance matrix.
Standard PCA, on the other hand, be it with the covariance or the correlation matrix, did not pass all of our three tests (Table 3), and that provides evidence against its employment in diagnostic applications. More specifically, standard PCA (covariance matrix) failed all three of our tests, thus proving itself unsuitable for diagnostic applications; whereas standard PCA (correlation matrix) passed the first and the third test but failed the second test (the validation test, i.e. the most difficult of the three tests), thus demonstrating that it lacks the high degree of accuracy required for the diagnosis of diseases. The primary objective of the standard PCA algorithm is to reduce the dimensionality of the data by seeking to maximize the amount of the original variance (information) in the direction of the variable(s) with the largest variance. Unfortunately, the largest variance, the largest amount of information, is not always synonymous with the most significant information. Commenting on this issue, Mather [7] pointed out that “A major problem in PCA is the distinction between important and unimportant dimensions of variability.” The results of our study clearly support this contention. Therefore, employing standard PCA for diagnostic or other purposes requiring a high degree of accuracy may not constitute a wise choice. Today, PCA has found its way in the main stream of biomedical research. Owing to significant advances of technology, such as the ability to gather information about large numbers of metabolites, genes, or proteins, vast amounts of data can be generated. Confronted by such a plethora of data, researchers have little choice but to resort to data analysis/mining methods, such as PCA. On account of many reasons, including ease of use, PCA, in one form or another, has become popular. Many researchers routinely entrust their data to PCA and predicate their study conclusions on the results yielded by it [15-19].
As we have shown, our ROC-supervised PCA, especially with the correlation matrix, possesses the high degree of classification and predictive accuracy that is prerequisite in the diagnosis of diseases. We should also point out here that insofar as accuracy and performance are concerned, according to the results of our previous studies, our ROC-supervised PCA provides a competitive alternative to other more complex multivariate methods [14], as well as other data analysis methods [20]. There is a limitation, however, that underlies its applicability. Owing to the fact that the outcome of the dependent variable in ROC curve analysis is dichotomous (only two outcomes are possible, i.e. WT or R6/2 in our case), and since our PCA is supervised by ROC curve analysis, it follows that our ROC-supervised PCA can be applied only to those diseases wherein there are only two groups (or two classifications). This, however, in actuality, may not be as restrictive as it sounds for the following two reasons. As it turns out, in most of the disease states, researchers are, at least initially, interested in differences between the state of a given disease and the normal state. Secondly, if in a given disease a researcher is indeed interested in three groups, let us say, normal, pre-symptomatic, and pathological, then three different ROC curve analyses (one between normal and pre-symptomatic, one between pre-symptomatic and pathological, and one between normal and pathological) may be performed and used with our ROC-supervised PCA. However, if the number of groups (or classifications) is greater than three, then this approach may be unrealistic.
Furthermore, we should point out that owing to the fact that the spectrum of diseases and disorders is very wide and variegated, the degree of accuracy will vary in accordance with the specific conditions of the particular disease and with the desired type of diagnostic model. For instance, if one is interested in colorectal cancer (CRC), and if, furthermore, one is interested in the differential diagnosis between normal subjects and patients with stage II CRC because the majority of the CRC patients when first diagnosed present with stage II, then the degree of accuracy of ROC-supervised PCA will be higher since the contrast between the normal and the specific diseased state is relatively large. In the case of those diseases where the patient population is not as homogeneous in terms of severity, extent of impairment, progression, symptomatology, etc., finding a suitable reference point for a diagnostic model will undoubtedly be more challenging, and the performance of ROC-supervised PCA in that case will be dependent on the careful selection of the variables, as well as on larger sample sizes.
We should also point out here that although the idea of seeking to improve the standard PCA is not new, our PCA method is, as far as we know, novel and fundamentally different from those proposed by others. For example, Bair et al. [22] proposed a supervised PCA method based on standard regression and applied to survival analysis. Nguyen and Rocke [23] and Hi and Gui [24] proposed various PLS (partial least squares) methods also in connection with survival predictions. Our ROC-supervised PCA was developed specifically for diagnostic purposes, and it is predicated on the screening and selection of data variables according to their discriminating accuracy between the target and the reference group. Moreover, unlike in the case of other proposed supervised PCA methods, our ROC-supervised PCA considers only the first principal component, which as we have shown above captures most of the variance of the original variables and has the highest potential in terms of classification accuracy than any other principal component.
In conclusion, in the present study, we assessed the performance and brought to light the weaknesses of standard PCA in connection with classification accuracy and diagnostics; we introduced our ROC-supervised PCA that was developed to address specifically the weaknesses of standard PCA in that area; we assessed the classification and predictive accuracy of our ROC-supervised PCA and compared it to that of the standard PCA; and we provided evidence that supports the use of our ROC-supervised PCA for diagnostic purposes.
Acknowledgments
We would like to thank C. Dirk Keene and Ivan Tkac for helping us with the acquisition of spectra and Janet M. Dubinsky for providing us with the spectral data of 20 unknown mice. This study was funded by the National Institutes of Health (NIH) - Grant numbers: T32 DA007097 and RO3 NS060059.
Supplementary material
References
- 1.Pearson K. On lines and planes of closest fit to systems of points in space. Philosophy. 1901;2:559–572. [Google Scholar]
- 2.Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology. 1933;24:417–441. [Google Scholar]
- 3.Dunteman GH. Newbury Park, CA: Sage University Paper series on Quantitative Applications in the Social Sciences, No 07-069; 1989. Principal Components Analysis. [Google Scholar]
- 4.Jackson JE. New York, NY: John Wiley & Sons; 1991. A User's Guide to Principal Components. [Google Scholar]
- 5.Jolliffe IT. New York, NY: Springer-Verlag; 2002. Principal Component Analysis. [Google Scholar]
- 6.McArdle JJ. Principles versus Principals of Structural Factor Analysis. Multivariate Behavioral Research. 1990;25:81–87. doi: 10.1207/s15327906mbr2501_10. [DOI] [PubMed] [Google Scholar]
- 7.Mather PM. London: John Wiley & Sons; 1976. Computational Methods of Multivariate Analysis in Physical Geography. [Google Scholar]
- 8.Costello AB, Osborne JW. Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most From Your Analysis. Practical Assessment Research & Evaluation. 2005;10:7. [Google Scholar]
- 9.Velicer WF, Jackson DN. Component Analysis versus Common Factor Analysis: Some Issues in Selecting an Appropriate Procedure. Multivariate Behavioral Research. 1990;25:1–28. doi: 10.1207/s15327906mbr2501_1. [DOI] [PubMed] [Google Scholar]
- 10.Mangiarini L, Sathasivam K, Seller M, Cozens B, Harper A, Hetherington LM, Trottier Y, Lehrach H, Davies SW, Bates GP. Exon 1 of the HD gene with an expanded CAG repeat is sufficient to cause a progressive neurological phenotype in transgenic mice. Cell. 1996;87:493–506. doi: 10.1016/s0092-8674(00)81369-0. [DOI] [PubMed] [Google Scholar]
- 11.Tkac I, Henry PG, Andersen P, Keene CD, Low WC, Gruetter R. Highly resolved in vivo 1H NMR spectroscopy of the mouse brain at 9.4. T. Magn. Reson. Med. 2004;52:478–484. doi: 10.1002/mrm.20184. [DOI] [PubMed] [Google Scholar]
- 12.Browne SE, Beal MF. The Energetics of Huntington's Disease. Neurochemical Research. 2004;29:531–546. doi: 10.1023/b:nere.0000014824.04728.dd. [DOI] [PubMed] [Google Scholar]
- 13.Tkac I, Dubinsky JM, Keene CD, Gruetter R, Low WC. Neurochemical changes in Huntington R6/2 mouse striatum detected by in vivo 1H NMR spectroscopy. Journal of Neurochemistry. 2007;100:1397–406. doi: 10.1111/j.1471-4159.2006.04323.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Nikas JB, Keene CD, Low WC. Comparison of Analytical Mathematical Approaches for Identifying Key Nuclear Magnetic Resonance Spectroscopy Biomarkers in the Diagnosis and Assessment of Clinical Change of Diseases. Journal of Comparative Neurology. 2010;518:4091–4112. doi: 10.1002/cne.22365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 16.Gelson W, Hoare M, Unitt E, Palmer C, Gibbs P, Coleman N, Davies S, Alexander GJM. Heterogeneous Inflammatory Changes in Liver Graft Recipients With Normal Biochemistry. Transplantation. 2010;89:739–748. doi: 10.1097/TP.0b013e3181c96b32. [DOI] [PubMed] [Google Scholar]
- 17.Hillegass JM, Shukla A, Macpherson MB, Bond JP, Steele C, Mossman BT. Utilization of gene profiling and proteomics to determine mineral pathogenicity in a human mesothelial cell line (LP9/TERT-1) J Toxicol Environ Health A. 2010;73:423–436. doi: 10.1080/15287390903486568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rohrbeck A, Borlak J. Cancer Genomics Identifies Regulatory Gene Networks Associated with the Transition from Dysplasia to Advanced Lung Adenocarcinomas Induced by c-Raf-1. PLoS ONE. 2009;4(10):e7315. doi: 10.1371/journal.pone.0007315. doi: 10.1371/journal.pone.0007315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Massad LS, Evans CT, Wilson TE, Goderre JL, Hessol NA, Henry D, Colie C, Strickler HD, Levine AM, Watts DH, Weber KM. Knowledge of cervical cancer prevention and human papillomavirus among women with HIV. Gynecologic Oncology. 2010;117:70–76. doi: 10.1016/j.ygyno.2009.12.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nikas JB, Low WC. Clustering Analyses for the Diagnosis of Huntington and Other Diseases (Abstracts for the 17th Annual Meeting of the American Society for Neural Therapy and Repair) Cell Transplantation. 2010;19:355. [Google Scholar]
- 21.Hintze JL. Kaysville, Utah: NCSS; 2007. NCSS 2007 Manual. [Google Scholar]
- 22.Bair E, Hastie T, Paul D, Tibshirani R. Prediction by Supervised Principal Components. Journal of the American Statistical Association. 2006;101:119–137. [Google Scholar]
- 23.Nguyen D, Rocke D. Partial Least Squares Proportional Hazard Regression for Application to DNA Microarrays. Bioinformatics. 2002;18:1625–1632. doi: 10.1093/bioinformatics/18.12.1625. [DOI] [PubMed] [Google Scholar]
- 24.Li H, Gui J. Partial Cox Regression Analysis for High-Dimensional Microarray Gene Expression Data. Bioinformatics. 2004;20(Suppl 1):i208–i215. doi: 10.1093/bioinformatics/bth900. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.