Abstract
Integrative analysis of multiple data types can take advantage of their complementary information and therefore may provide higher power to identify potential biomarkers that would be missed using individual data analysis. Due to different nature of diverse data modality, data integration is challenging. Here we address the data integration problem by developing a generalized sparse model (GSM) using weighting factors to integrate multi-modality data for biomarker selection. As an example, we applied the GSM model to a joint analysis of two types of schizophrenia data sets: 759075 SNPs and 153594 functional magnetic resonance imaging (fMRI) voxels in 208 subjects (92 cases/116 controls). To solve this small-sample-large-variable problem, we developed a novel sparse representation based variable selection (SRVS) algorithm, with the primary aim to identify biomarkers associated with schizophrenia. To validate the effectiveness of the selected variables, we performed multivariate classification followed by a ten-fold cross validation. We compared our proposed SRVS algorithm with an earlier sparse model based variable selection algorithm for integrated analysis. In addition, we compared with the traditional statistics method for univariant data analysis (Chi-squared test for SNP data and ANOVA for fMRI data). Results showed that our proposed SRVS method can identify novel biomarkers that show stronger capability in distinguishing schizophrenia patients from healthy controls. Moreover, better classification ratios were achieved using biomarkers from both types of data, suggesting the importance of integrative analysis.
Keywords: Sparse representations, SNP, fMRI, variable selection, schizophrenia
I. INTRODUCTION
Schizophrenia has been hypothesized to arise from a number of genetic factors and environmental effects. To date, many studies have investigated the role of critical genes or single nucleotide polymorphisms (SNP) associated with schizophrenia. Many genes of great significance have been identified as potential causal genetic makers for schizophrenia, such as G72/G30 on chromosome 13q, DISC1, GRIK3, EFNA5, AKAP5 and CACNG2 (Badner and Gershon, 2002; Callicott et al., 2005; Sutrala et al., 2007). Besides genetic studies, functional magnetic resonance imaging (fMRI) is another widely used tool for the study of schizophrenia in that it has the ability to identify both structural and functional abnormalities in brain regions of schizophrenia patients (Meda et al., 2008; Szycik et al., 2009). Therefore, the identification of biomarkers from joint analysis of fMRI and SNPs data is of tremendous importance for disease diagnosis and treatment (Liu et al., 2009; Lin et al., 2011).
In this paper we proposed a generalized sparse model (GSM) to integrate multi-modality data (e.g., SNP and fMRI data) for biomarker selection. Sparse representation, particularly compressive sensing, received a great attention in recent years (Gribonval and Nielsen, 2003;Tropp et al., 2003; Donoho and Elad, 2003; Kidron et al., 2007; Tang et al., 2012; Cao et al., 2012a; Cao et al., 2012b). For example, Kidron et. al. used sparse regression for cross-modal localizations of sound-related region in the video (Kidron et al., 2007). We recently developed the sparse representation-based classification algorithms for sub-typing of leukemia from gene expression data (Tang et al., 2012), for chromosome image segmentation (Cao et al., 2012a) and for integrative analysis of gene copy number variation and gene expression data (Cao et al., 2012b).
Traditionally, the GSM can be solved by many existing algorithms, such as Homotopy method (Donoho and Tsaig, 2008), orthogonal matching pursuit (OMP) algorithm (Davis et al., 1997; Tropp, 2004; Cai and Wang, 2011), single best replacement (SBR) algorithm (Soussen et al., 2011), and FOCUSS method (Cotter et al., 2005). However, in compressive sensing theory, the exact signal recovery of a s-sparse signal typically requires a large number of samples (Davenport et al., 2011). Here the s-sparse signal refers to a vector having at most s number of nonzero entries. Those entries with high amplitudes correspond to the variables to be selected (Cai and Wang, 2011). When the number of measurements n ≫ m, (m is the number of samples), it will be difficult for the exact signal recovery (Hsu et al., 2009; Davenport et al., 2011). One most often used conditions for exact signal recovery is the restricted isometry property (RIP) (Davenport et al., 2011). However, whether a measurement matrix satisfies the RIP condition is hard to verify in practice. Another method is using the coherence of a matrix X (Donoho, 2004; Candes and Tao, 2006), which is often required to be small (e.g. ). Moreover, when a matrix X satisfies the signal recovery condition, the number of signals to be recovered or variables to be selected using those traditional sparse representation methods will generally be equal to or less than the number of samples (Li et al. 2009). To address this problem, Li et al. proposed a sparse representation based variable selection method, aiming to achieve a sparse solution for the GSM when the sample number is large (e.g., larger than the number of variables to be selected; Li et al. 2009).
In this work and in many other practical cases, the number of samples (92 cases/116 controls) is far less than the number of variables (i.e., 759075 SNPs and 153594 fMRI voxels). As a consequence, the small coherence condition of the data matrix is hard to be satisfied (Hsu et al., 2009), and thus directly using existing compressive sensing methods may fail (Blankertz et al. 2011; Parra et al. 2005; Zien et al. 2009). To overcome the difficulty caused by this large-n-small-m problem, we proposed a novel sparse representation based variable selection (SRVS) algorithm, which can select the significant variables regardless of the coherence condition of the measurement matrix. Moreover, the proposed SRVS algorithm has been proven to have multi-resolution properties that select variables at different significance level. Instead of solving the GSM directly, the proposed SRVS algorithm solves sub-matrixes based Lp norm minimization problems and generates a sparse solution for the GSM. In a preliminary work (Cao et al., 2012c), we studied the orthogonal matching pursuit (OMP) based SRVS algorithm. Our preliminary results documented that, even for small number of samples, the SRVS is capable of identifying a number of biomarkers for schizophrenia, leading to improved identification accuracy.
Here, we extend the work by applying our proposed SRVS algorithm to GSM with a more general penalization term (Lp (0 ≤ p ≤ 1) norms), aiming to identify more effective joint biomarkers for schizophrenia. Specifically, we tested and compared three models with p=0, 0.5, 1. For the Lp (0 ≤ p ≤ 1) based model, we proved that the proposed SRVS method can identify significant variables at different significant level, and recover signals with large probability regardless of the coherence of the measurement matrix. We also showed the convergence and effectiveness of the proposed SRVS algorithm. After that, we applied the SRVS to the GSM model integrating 759175 SNPs and 153594 fMRI voxels in 208 subjects (92 cases and 116 controls) for the identification of biomarkers for schizophrenia. To test the predictive power of the biomarkers or variables selected, we used the selected variables to distinguish schizophrenia patients from healthy controls followed by a 10-fold cross-validation. We evaluated the three models with different penalization terms (i.e., Lp norm, p=0, 0.5 and 1) and compared them with the biomarker selection approach proposed by Li et al. (Li et al., 2009) for integrated analysis. In addition, we compared our method with the traditional statistical methods for uni-type data analysis (i.e. Chi-squared test for SNP data and ANOVA for fMRI data).
II. Materials and Methods
The proposed variable selection approach includes three steps, as shown in Fig. 1: 1.) Data combination. The GSM model is proposed to combine two types of data. 2.) Variable selection. The SRVS algorithm is proposed to solve the sparse linear system in GSM. 3.) Validation of the selected variables. We employed a multi-classification approach to test the effectiveness of the selected variables. We used cross validation to select the optimal parameters used in the GSM.
Fig. 1.
The flowchart of our proposed variable selection for integrative analysis of two types of data
2.1 A sparse model for Data combination
The sparse representation of a signal can be modeled as
(1) |
where y ∈ Rm×1 is the observation vector; X ∈ Rm×n represents the measurement matrix; and ε ∈ Rm×1 is the measurement error or noise. The goal of sparse representation is to recover the unknown sparse vector δ ∈ Rn×1 from y and X, and the non-zero entries of δ correspond to selected variables/columns in X.
In case of representing multi-modality data (e.g. SNP data and fMRI data from same subject groups), we propose a generalized sparse model (GSM) in Eq. (2).
(2) |
where y ∈ Rm×1 is the observation vector (phenotypes of the subjects; e.g., 1 for disease case; 0 for healthy control); X1 ∈ Rm×n1 and X2 ∈ Rm×n2 are the measurements of two different data types (e.g., numerical SNPs values (0, 1, 2) and fMRI voxel values) with m samples, and n1(or n2) features in each sample; each column is normalized to have unit L2 norm; X = [α1X1, α2X2] ∈ Rm×n; α1 + α2 = 1, and α1, α2 > 0 are the weight factors for the two types of data; ε ∈ Rm×1 is the measurement error. Then, the problem of variable selection becomes to identify the unknown sparse vector from y and X, where δ1 ∈ Rn1×1, δ2 ∈ Rn2×1, and n = n1 + n2. To determine optimal weighting factors α1 and α2, cross validation can be used, i.e. the weighting factors that generates the best classification ratio (CR).
2.2 Variable selection with SRVS
By integrating multi-modality data, the GSM model given by Eq. (2) offers the potential to detect more significant and reliable biomarkers (Rhodes and Chinnaiyan, 2005; Liu et al., 2009; Cao et al., 2012d). However, in genomic and bio-imaging data analysis (e.g. SNP and fMRI), usually n ≫ m and the linear system defined by Eq. (2) is underdetermined, and the solution of the system is not unique. To overcome the problem, a sparse constraint is usually imposed on the model. An example is given in Eq. (3) by using the L0 norm based penalty (Cai and Wang, 2011; Tropp, 2004; Davis et al., 1997; Soussen et al., 2011), which measures the number of nonzero elements.
(3) |
But it is unfortunate that this penalty results in a combinatorial problem with NP-hard complexity. Thus, L1 norm penalty (Donoho and Tsaig, 2008) is introduced instead:
(4) |
Detailed discussions on the differences between L0 and L1 norm penalties can be found in Sharon and Ma, 2007 and Donoho and Tsaig, 2006. In recent years, Lp norm penalty (0 < p < 1) was also studied (Cotter et al., 2005; Xu et al., 2012), which can lead to a more sparse solution. The model is formulated as
(5) |
Several algorithms (Cotter et al., 2005; Foucart and Lai, 2009 and Wang et al., 2011) have been proposed.
Nevertheless, when applying model (5) to signal recovery/variable selection, the measurement matrix X ∈ Rm×n is usually required to satisfy the RIP condition for exact signal recovery (Davenport et al., 2011; Donoho, 2004; Candes and Tao, 2006 and Hsu et al., 2009). The RIP is defined as follows.
A matrix X ∈ Rm×n is said to satisfy the restricted isometry property (RIP) of order s if there exists a τs ∈ (0,1) such that
(6) |
holds for all s-sparse vector δ ∈ Rn.
When a matrix X ∈ Rm×n satisfies the RIP of order s, the s-sparse vector δ ∈ RN can be recovered from the m samples. One necessary condition for X ∈ Rm×n to satisfy RIP is that m and n satisfy Eq. (7) (Davenport et al., 2011).
Theorem
If X ∈ Rm×n satisfies an RIP of order 2s with , then
(7) |
where (Davenport et al., 2011). Thus for a data sample with the number of m, the following condition has to be satisfied
(8) |
for exact recovery of s-sparse vector δ ∈ RN. In genomic or medical imaging data analysis, we can assume m ≤ s, i.e., the number of biomarkers to be detected is generally larger than the number of sample m (Li et al., 2004; Li et al., 2009). Then Eq. (8) can be simplified as
(9) |
Eq. (9) suggests that, for a given sample size m, the number of n in the measurements matrix X ∈ Rm×n should be less than 35m so that the sparse solution can be obtained.
However, in practice, this condition cannot be satisfied, because the number of features (SNPs/fMRI voxels) is often greatly larger than that of the sample.
Eq. (9) describes a necessary condition for a matrix X ∈ Rm×n to satisfy signal recovery requirement. Since it’s difficult to verify if a matrix X ∈ Rm×n satisfies the RIP, the coherence of X is used instead (Donoho, 2004). The coherence μ(X) is defined as:
(10) |
The coherence given by Eq. (10) is always within the following range: ; the lower bound is known as the Welch bound (Davenport et al., 2011). Some reconstruction algorithms require a strong condition of bounded coherence (Donoho, 2004; Hsu et al., 2009):
(11) |
However, this small coherence condition is hard to be satisfied by our problem. In fact, when n ≫ m, it is inevitable for some columns within the small-m-large-n measurement matrix to have big coherence (Candes and Tao, 2006). In our study, n=759075 + 153594 (SNPs + fMRI voxels) and m=208 (92 cases/116 controls). Therefore, we proposed the SRVS algorithm to find an approximate solution of the GSM model given by Eq. (2). The SRVS algorithm is described as follows, and its properties are presented in Appendix A.
SRVS Algorithm (http://hongbaocao.weebly.com/software-for-download.html)
Initialize δ(0) = 0;
For Step l, randomly choose k columns from X = {x1, …, xn} ∈ Rm×n to construct a m × k sub-matrix denoted as Xl ∈ Rm×k, and denote the index vector of the selected columns as Il ∈ {1, 2, 3, …};
- Given the sub-matrix Xl, solve the following Lp minimization problem to get the optimal sparse solution δl ∈ Rk×1
(12) Update δ(l) ∈ Rn×1 with δl: δ(l)(Il) = δ(l−1)(Il)+δl; where δ(l)(Il) and δ(l−1)(Il) denote the Ilth entries in δ(l) and δ(l−1), respectively;
If a stopping rule is not satisfied, update l = l +1 and go to Step 2. Otherwise, set δ = δ(l)/l and break. The non-zero entries in δ correspond to the column vectors selected, i.e., variable selection.
In Step 2, one way to achieve random selection of k columns from X is to shuffle the data with Fisher-Yates algorithm (Fisher and Yates, 1948), and then use a window of length k to select variables randomly (Cao et al., 2012c). It should be noted that in each iteration, a different sub-matrix Xl will be randomly selected (totally possible combinations), which is not a simple split of the X into several sub-sets.
In Step 3, there are many well-established methods for solving the Lp minimization problem, such as Homotopy algorithm (Donoho and Tsaig, 2008) for p = 1, orthogonal matching pursuit (OMP) algorithm (Cai and Wang, 2011; Tropp, 2004; Davis et al., 1997) and single best replacement (SBR) algorithm (Soussen et al., 2011) for p = 0 and the FOCUSS method for 0 ≤ p ≤ 1 (Cotter et al., 2005).
In Step 5, we set the following two stop rules: 1. ||δ(l)/l − δ(l−1)/(l−1)||2 < α, where α is a predefined threshold; 2. The probability that each column in X has been evaluated should be greater than 1−pstop. The algorithm terminates when both rules are satisfied, which decides the total number of iterations. In this work, we set α=0.01 and pstop = 1e−4. At those stop rules, the total number of iterations was around 200 for the simulation data with n = 1e6 features and 300 for the real data sets (759075 SNPs and 153594 fMRI voxels) tested in this work. The effect of stop rules on the number of iterations will be evaluated in Sec. B of Appendix A, where the convergence of the algorithm is also proved.
We present the discussion and proof of the properties of the proposed SRVS algorithm in Appendix A, including: 1.) The independence on coherence condition of the data matrix X; 2.) Convergence and effectiveness of SRVS; 3.) Multi-resolution property of SRVS; and 4.) Sparsity control using ε. The Matlab based software toolbox for the proposed SRVS algorithm is available online: http://hongbaocao.weebly.com/software-for-download.html.
2.3 Validation of selected variables
To test the detective power of the selected biomarkers (SNPs/fMRI voxels), we performed a multivariate classification followed by a ten-fold cross-validation to classify schizophrenia patients from health controls. Results from four models were compared: SRVS algorithm with different Lp norm penalties (p = 0, 0.5, 1) and Li et al.’s method (Li et al., 2009).
Furthermore, we compared several classifiers, including sparse representation-based classifier (SRC), fuzzy c-means (FCM) classifier, and support vector machine (SVM) based classifier, and the SRC gives the best performance (see Appendix B, Fig. B 1.). The SRC has been proven effective for many tasks such as face recognition (Wright et al. 2009), speech recognition (Gemmeke et al. 2011), signal classification for brain computer interface (Shin et al. 2012), and image classification (Cao et al., 2012a). We provide the results in Appendix B, Fig. B. 1. Here, we describe the SRC algorithm as follows.
Sparse Representation-based Classification (SRC) algorithm
Inputs: a matrix of training samples A = [A1, A2, …, Ac] ∈ Rn×s for c classes; and a test sample st ∈ Rn.
Normalize the columns of A to have unit L2-norm;
Solve the L1 norm minimization problem: x̂ = arg min||x||1, subject to Ax = sj;
Calculate the residuals ri(st) = ||st − Aδi(x)||2 for i = 1, …, c;
ClassID (st) = arg mini ri(st)
The inputs of the SRC algorithm include 1. st ∈ Rn, the feature vector from the subject t; 2. A ∈ Rnxs, feature vectors from c = 2 cluster groups in a total of s samples/subjects; δi(*) is a Rs → Rs transformation function, which selects the coefficients associated with the i-th class. The output is the ClassID of the subject t.
In each run of the 10-fold cross-validation, 90 percent subjects from both cases and controls were randomly selected for variable/biomarker selection, while the rest were used for testing. For each method, we carried out 100 runs and the average of the classification ratios was used as the final identification accuracy.
We also used the cross-validation to determine the optimal weighting factors in Eq. (2). For different pair of weighting factors, different variable groups will be selected, resulting in different classification ratios. Therefore, using the cross validation stated above, we can select the best weighting factors that lead to the highest classification ratio.
To test the effectiveness of integrative analysis, we compared our method with two traditional statistical methods for uni-type data analysis (i.e. Chi-squared test for SNP data and ANOVA for fMRI data). We provide the top 200 selected SNPs and fMRI voxels and the classification ratios in Appendix C and Appendix D, respectively.
III. Results
This section is organized as follows. We first describe in Sec. 3.1 the data we tested. Then we present in Sec. 3.2 the variables (SNPs/fMRI voxels) selected using GSM with different weighting factors. After that we compare in Sec. 3.3 the variables selected with different models (SRVS with three different penalties, Li et al.’s method). Finally, in Sec. 3.4, we provide the cross validation results for the selection of weighting factors.
3.1 Data Collection
In this study, participant recruitment and data collection were conducted by the Mind Clinical Imaging Consortium (MCIC). Two types of data (SNP and fMRI) were collected from 208 subjects including 96 schizophrenia patients (age: 34 ± 1, 22 females) and 112 healthy controls (age: 32 ± 1, 44 females). All of them provided written informed consents. Healthy participants were free of any medical, neurological or psychiatric illnesses and had no history of substance abuse. By the clinical interview of patients for DSM IV-TR Disorders (Pascual-Leone et al., 2002; Kumari et al., 2012) or the comprehensive assessment of symptoms and history, patients met criteria for DSM-IV-TR schizophrenia (Onitsuka et al., 2004; Meier et al., 2008). Antipsychotic history was collected as part of the psychiatric assessment.
3.1.1 fMRI Data Collecting and Preprocessing
The fMRI data were collected during a sensorimotor task, a block-design motor response to auditory stimulation. During the on-block, 200 msec tones presented a 500 msec stimulus onset asynchrony (SOA). A total of 16 different tones were presented in each on-block, with frequency ranging from 236 Hz to 1318 Hz. The fMRI images were acquired on Siemens 3T Trio Scanners and a 1.5T Sonata with echo-planar imaging (EPI) sequences using the following parameters (TR = 2000msec, TE = 30msec (3.0T)/40msec (1.5T), field of view = 22cm, slice thickness = 4mm, 1mm skip, 27 slices, acquisition matrix = 64 × 64, flip angle = 90°.) Four scanners were used and we have roughly equal numbers of patients and controls at all sites. Data were pre-processed in SPM5 (http://www.fil.ion.ucl.ac.uk/spm) and were realigned, spatially normalized and re-sliced to 3×3×3 mm3, smoothed with a 10×10×10 mm3 Gaussian kernel to reduce spatial noise, and analyzed by multiple regression considering the stimulus and their temporal derivatives plus an intercept term as regressors. Finally the stimulus-on versus stimulus-off contrast images were extracted with 53 × 63 × 46 voxels and all of the voxels with missing measurements were excluded.
3.1.2 SNPs Data
A blood sample was obtained from each participant and DNA was extracted. Genotyping for all participants was performed at the Mind Research Network using the Illumina Infinium HumanOmni1-Quad assay covering 1,140,419 SNP loci. Bead Studio was used to make the final genotype calls. Next, the PLINK software package (http://pngu.mgh.harvard.edu/~purcell/plink) was used to perform a series of standard quality control procedures, resulting in the final dataset spanning 759075 SNP loci. Each SNP was categorized into three clusters based on their genotype and was represented with discrete numbers: 0 for ‘BB’ (no minor allele), 1 for ‘AB’ (one minor allele) and 2 for ‘AA’ (two minor alleles).
3.2 Variable Selection with Generalized Sparse Model
Based on the generalized sparse model (Eq. (2)), we applied the proposed SRVS algorithm to select biomarkers for schizophrenia from the combination of two data sets (SNP data and fMRI data), where the weight factors α1 and α2 (α1 + α2 = 1) reflect the level of contribution from SNP and fMRI data set respectively. When the weight factor α1 = 1 or α2 = 1, the variable selection is performed only on one type of data. We tested the range of α1 from 0.3 to 0.6, and used a step length of 0.02 with a total of 16 different trials and we set k = 0.05n. To test the most important biomarkers, we selected 200 biomarkers in each trial by using our proposed SRVS method in three models with different Lp norms (p = 0, 0.5, 1). We also compared with Li et al.’s method (Li et al., 2009). Fig. 2 shows the plot of the number of SNPs and fMRI voxels selected against weight factor α1 for these four models. As shown in Fig. 2, the weight factor has similar effects on the variables selected with the four models. It was interesting to see that even though the number of SNPs was much larger than that of fMRI voxels (759075 vs. 153594), similar number of variables was selected from both data sets when weight factor α1 was around 0.4 (0.38 for SRVS method with L1/2 norms, 0.46 for SRVS method with L0 norms, 0.47 for SRVS method with L1 norms, and 0.47 for Li et al.’s method).
Fig. 2.
Variable selection with generalized sparse model using different models, where the number of selected fMRI voxels is in red color and the number of selected SNPs is in blue color. The ‘Weight factor’ in the plots refers to the weight factor α1 (for SNP data set), and the weight factor α2 = 1 − α1 (for fMRI data set).
In addition, from Fig. 2 we can see that when α1 took a smaller value, only a few SNPs were selected. Those SNPs can be viewed as the most important biomarkers since they were identified in both two data sets consisting of SNPs with small weight. When α1 took a large value (α2 was small), only a few fMRI voxels were selected. For the same reason, these voxels should be the most important ones. To further understand the relationships between the groups of variables selected in each trail, we analyzed the newly selected variables with the decrease of the corresponding weight factor, as shown in Fig. 3.
Fig. 3.
The newly selected variables in each trial with the decrease of the corresponding weight factor. The ‘Weight factor’ in the plots refers to the weight factor α1, and the weight factor α2 = 1 − α1.
In Fig. 3, the newly selected variables shown in each trial have no overlap with variables from any other trials. When the weight factors have larger values (0.6 for SNP data set and 0.7 for fMRI data set), the selected groups have relatively larger size, and the variables were mostly from one type of data. Those were the variables that can be identified when using one type of data alone for the analysis. With the decrease of the weight factor, fewer new variables were detected. However, those variables should not be viewed as less significant than those selected with bigger weight factors, since they were selected over variables from both types of data with smaller weights.
3.3 Comparison of the Variables Selected Using Different Methods
We further compared the selected variables (SNPs/fMRI voxels) using different methods: SRVS with (L0, L1, L1/2 and Li et al.’s method, as shown in Table 1. For the 16 trials with 200 variables selected in each trial, there were totally 3200 variables. However, as shown in Fig. 3, only a few of new variables were selected in each run, resulting in less number of final selected variables (807, 888, 1092 and 1939 for the four models, respectively). The overlaps among the selected variable groups using SRVS models with different Lp norm penalties are around 50% (458, 447 and 514 as shown in Table 1 and Fig. 4). Totally 349 variables are selected by all those three models. When compared with Li et al.’s method, only a small percentage (<10%) overlapped with those of the SRVS models with different Lp norms (67, 87 and 79 respectively). There were totally 48 variables selected by all the four models. We provided the first 50 SNPs and the corresponding genes identified by the four method in Table A. 1.
Table 1.
The comparison of the variable numbers (SNPs/fMRI voxels) selected by the four Models: SRVS with (L0, L1, L1/2) and Li et al.’s method
SRVS (L1/2) | SRVS (L0) | SRVS (L1) | Li et al.’s method | Three SRVS methods | |
---|---|---|---|---|---|
SRVS (L1/2) | 807 | 458 | 447 | 67 | 349 |
SRVS (L0) | / | 888 | 514 | 87 | |
SRVS (L1) | / | / | 1092 | 79 | |
Li et al’s method | / | / | / | 1939 | / |
All four methods | 48 |
Fig. 4.
Comparison of the selected variables (SNPs/fMRI voxels) using a Venn diagram. A, B and C are the variables selected using SRVS with L1/2, L0 and L1 norm penalties, respectively.
We also compared our selected genes with the top ranked 45 schizophrenia genes reported in (http://www.szgene.org/default.asp) (see Table A. 2). We selected 200 variables in each trial and identified 4 to 5 reported genes by using our proposed SRVS methods with L0 and L1 norm penalties, and by using Li et al.’s method, as shown in Table 2. Our proposed SRVS algorithm with L1/2 norm identified 6 reported genes. The genes/SNPs identified by each model were different. It should be noted that even though the OPCML gene was identified by all the four models, the corresponding SNPs, according to which the gene was identified, were different. If we select more variables in each run, corresponding to larger s-sparsity, more significant variables can be selected. This has been reported in our previous work, in which we selected around 800 variables in each of the 16 trials (α1 is from 0.3 to 0.6; step length = 0.02), and 20 reported genes (e.g. PRSS16, NOTCH4, PDE4B, TCF4) (Cao et al., 2012c).
Table 2.
The comparison with the reported first 45 SCHIZOPHRENIA GENES (http://www.szgene.org/default.asp)
SRVS (L0) | SRVS (L1/2) | SRVS (L1) | Li’s method | ||||
---|---|---|---|---|---|---|---|
Genes | SNPs | Genes | SNPs | Genes | SNPs | Genes | SNPs |
PDE4B | rs10846559 | DRD2 | rs10800893 | HIST1H2BJ | rs11220916 | PRSS16 | rs13399561 |
NRG1 | rs12097254 | NRG1 | rs16956192 | DRD2 | rs16828456 | DAOA | rs16869700 |
PLXNA2 | rs4811326 | RGS4 | rs1293448 | NRG1 | rs10846559 | RPP21 | rs1836942 |
OPCML | rs3026883 | PPP3CC | rs6637088 | PLXNA2 | rs4632116 | NRG1 | rs10833482 |
/ | / | PLXNA2 | rs4072729 | OPCML | rs11807403 | OPCML | rs1745939 |
/ | / | OPCML | rs11772714 | / | / | / | / |
We also compared the fMRI voxels selected by our proposed SRVS method with those of using Li et al.’s method (Li et al., 2009), as shown in Fig. 4. It is evident the voxels selected by the SRVS method tended to cluster together at specific regions such as temporal lobe, lateral frontal lobe, occipital lobe, and motor cortex, which are schizophrenia related brain regions (Pascual-Leone et al., 2002; Kumari et al., 2012; Onitsuka et al., 2004). However, the voxels selected by Li et al.’s method tend to be small regions scattered over the whole brain. This may be due to the fact that the voxels within the same brain region may not be simultaneously detected using their method.
The above comparisons show big differences among the biomarkers selected using the four models. To compare and the test the effectiveness of those different groups of biomarkers, they were used for the classification of schizophrenia patients from normal controls and the results were provided in Sec. 3.4.
3.4 Cross validation for the selection of weighting factor
We used the SRC as a classifier in classification with a ten-fold cross validation to test the predictive power of the variables/biomarkers selected in the four models (SRVS with L0.5 penalty; SRVS with L0 penalty; SRVS with L1 penalty; and Li et al.’s method). We also used cross-validation to select the best weighting factors for the GSM (i.e., the weighting factors corresponding to the highest classification accuracy). Fig. 6 (a) shows the results of the ten-fold cross validation for the 16 trials given in Fig. 2. It can be seen from Fig. 6 (b) that our proposed SRVS methods in all three models with different Lp norm penalties provide much higher classification ratios than that of Li et al.’s method (Li et al., 2009) (p-valve <1e−8). In addition, the SRVS method with the L1/2 norm penalty gives the highest classification ratio among four tested models. This is consistent with the fact that, the L1/2 norm based sparse models provide the best data fitting and visual modeling among all the Lp norms (p ∈ (0,1]) as demonstrated by Xu et al.’s work (Xu et al., 2012).
Fig. 6.
A comparison of classification results of using four sparse models. (a) gives the classification ratio of differentiating schizophrenia from healthy controls using four models with different weight factors; (b) is the box plot generated with ANOVA analysis of the classification ratios using four different models.
The cross validation results determined that, for the four models tested, the best weighting factors are: 1.) SRVS with L1/2 penalty, α1 =0.58, CR=89.7%; 2.) SRVS with L0 penalty, α1 =0.48, CR=81.5%; 3.) SRVS with L1 penalty, α1 =0.38, CR=82.1%; and 4.) Li et al.’s method, α1 =0.46, CR=62.1%. It should be noted that the best CRs and the corresponding optimum weighting factors were achieved with biomarkers selected from both types of data.
When compared with the traditional statistics method for uni-type data analysis (Chi-squared test for SNP data and ANOVA for fMRI data), results showed that using selected top (200~1000) SNPs alone can reach identification accuracy of (83.11±1.32)%, while using top (200~1000) fMRI voxels alone the accuracy was (63.13±0.74)%. Please refer Appendix C and Appendix D for the selected top features and the classification results.
IV. Discussion and Conclusion
This work aimed at biomarker identification using multi-modality data, i.e., SNP and fMRI data. To achieve this goal, we proposed a generalized sparse model (GSM) solved by a novel SRVS algorithm. The selected biomarkers were then tested by applying to the classification of schizophrenia patients with a ten-fold cross validation.
The GSM given in Eq. (2) uses multi-modality data for integrative analysis to detect biomarkers that cannot be identified using one type of data alone. The two weighting factors in GSM represent the level of contribution from different data types, and the best values can be determined by cross-validation. As shown in Sec. 3.4, the best weighting factors for all the four models are between [0.38,0.58]; using the combination of both data sets at these values can lead to to highest classification ratios. This demonstrates the advantage of integrating multi-modality data for the diagnosis of schizophrenia. In addition, when compared to the uni-type data analysis, our proposed SRVS method with L1/2 norm led to significantly higher identification accuracy (P-value<0.001) than that from using one type of biomarkers alone (i.e., only using SNP or fMRI data for classification). This further demonstrates the advantage of using multiple data modalities.
The ten-fold validation results showed that features selected using our proposed SRVS algorithm gave higher classification accuracy than that of Li et al.’s method (see Fig. 6 (b), p-value<1e−8), indicating its effectiveness when the sample numbers are much smaller than the number of variables. The comparison results indicate that, even though Li’s method is valid for data of large sample size, our proposed SRVS is more suitable for processing data of small sample size.
For data set of small-m-large-n, traditional sparse model may fail. By randomly sampling the columns of the original measurement matrix into smaller sub-matrixes, however, our proposed SRVS method can overcome the difficulty. We proved that, a significant variable from the original data set can be selected with high probability, regardless of the coherence conditions (Appendix A, Sec. A). Moreover, we showed the convergence and effectiveness of the proposed SRVS method in Appendix A, Sec. B. As shown in Fig. 6 (a), voxels from the same clusters were identified simultaneously. Those variables may not be recovered with the traditional sparse models (e.g., the method used by Li et al. (Li et al., 2009)).
One advantage of our proposed SRVS algorithm is its multi-resolution properties. We showed in Appendix A, Sec. C that variables selected using larger window length k will be subset of those selected with smaller k. When k = n, the model reduces to traditional sparse model as given by Eq. (5), and no more than m biomarkers can be identified (Li et al. 2009; Cai and Wang, 2011). When using two different parameters k1>k2, a larger group of variables will be selected by using k2, which include the variables selected using k1. Thus, as long as the parameter n > k > m (e.g., k = 0.05n), the same top (n/k) × m variables will be always selected. We provided the discussion of the relationship of window length k and the number of variables to be selected in Appendix A, Sec. D. In addition, those variables can be ranked in the order of their significance (i.e., the amplitudes of the corresponding entries in the solution δ; Cai and Wang, 2011; see Appendix A, Fig. A 3). Then the residual ε can be used to determine how many variables should be selected (i.e., sparsity control; see Appendix A,Sec. E.).
Our multivariate classification results show that the variables selected by using our proposed SRVS algorithm, especially with L1/2 norm based penalty, generated highest classification accuracy in discriminating schizophrenia patients from healthy controls. This suggests that L1/2 norm may be the best choice as penalization term for the proposed SRVS method. However, the multivariate classification approach is not an extra validation step, but rather a way to determine how predictive the selected variables are as selected by the SRVS approach. For those variables that are not reported by the previous studies, further validation approaches should be conducted to test their significance. Moreover, the SRVS method proposed in this work is a data driven method, which do not directly interpret the physiological meaning of the selected variables. It was suggested by Haufe et al.’s work that signals detected using general linear model may involve noise (Haufe et al 2014). Therefore, those variables that are not reported by the previous studies may need further validation approaches using independent data sets to test their physiological significance.
In summary, we presented an effective multi-modality data integration method for biomarker selection. Using the method, we are able to integrate different types of data with large number of variables but small number of samples. However, due to the limited sample size, further biological experimental work is needed to validate the biomarkers identified in the paper.
Supplementary Material
Figure 5.
A comparison of the selected fMRI voxels between SRVS (L1/2) and Li et al.’s method (Li et al., 2009). The value of a voxel represents the frequency that it has been selected in the 16 trials
Acknowledgments
This work has been partially supported by both NIH and NSF. Drs. Cao and Shugart are supported by the intramural Program of NIMH, National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Badner JA, Gershon ES. Meta-analysis of whole-genome linkage scans of bipolar disorder and schizophrenia. Mol Psychiatry. 2002;7(4):405–411. doi: 10.1038/sj.mp.4001012. [DOI] [PubMed] [Google Scholar]
- Cai TT, Wang L. Orthogonal Matching Pursuit for Sparse Signal Recovery. IEEE Trans on Inf Theory. 2011;57(7):1–26. [Google Scholar]
- Callicott JH, Straub RE, Pezawas L, Egan MF, Mattay VS, Hariri AR, Verchinski BA, Meyer-Lindenberg A, Balkissoon R, Kolachana B, Goldberg TE, Weinberger DR. Variation in DISC1 affects hippocampal structure and function and increases risk for schizophrenia. Proc Natl Acad Sci U S A. 2005 Jun;102(24):8627–32. doi: 10.1073/pnas.0500515102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Candès E, Tao T. Near optimal signal recovery from random projections:Universal encoding strategies? IEEE Trans Inf Theory Dec. 2006;52(12):5406–5425. [Google Scholar]
- Cao H, Deng H, Li M, Wang YP. Classification of Multicolor Fluorescence In-situ Hybridization (M-FISH) Images with Sparse Representation, IEEE Tans. Nanobioscience. 2012a;11(2):111–118. doi: 10.1109/TNB.2012.2189414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao H, Duan J, Lin D, Wang YP. Sparse Representation Based Clustering for Integrated Analysis of Gene Copy Number Variation and Gene Expression Data. IJCA. 2012b Jun;19(2):131. [Google Scholar]
- Cao H, Duan J, Lin D, Calhoun V, Wang YP. Biomarker Identification for Diagnosis of Schizophrenia with Integrated Analysis of fMRI and SNPs. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on; Oct. 4–7, 2012c; Philadelphia, PA, USA. pp. 1–6. [Google Scholar]
- Cao H, Lei S, Deng HW, Wang YP. Identification of genes for complex diseases using integrated analysis of multiple types of genomic data. PLoS One. 2012d;7(9):e42755. doi: 10.1371/journal.pone.0042755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cotter SF, Rao BD, Engan K, Kreutz-Delgado K. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Trans on Signal processing. 2005;53(7):2477–2488. [Google Scholar]
- Davenport M, Duarte M, Hegde C, Baraniuk R. Introduction to compressive sensing. Connexions. 2011 Apr 10; Web site. http://cnx.org/content/m37172/1.7/
- Davis G, Mallat S, Avellaneda M. Greedy adaptive approximation. J Constr Approx. 1997;13(1):57–98. [Google Scholar]
- Donoho DL, Elad M. Maximal sparsity representation via L1 minimization. Proc Nat Acad Sci USA. 2003 Mar;100:2197–2202. doi: 10.1073/pnas.0437847100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho DL, Tsaig Y. Fast solution of L1-norm minimization problems when the solution may be sparse. IEEE Transs on Information Theory. 2008 Nov;54:4789–4812. [Google Scholar]
- Donoho DL. Technical Report. Stanford University; 2004. Compressed sensing. [Google Scholar]
- Donoho DL, Elad M, Temlyakov VN. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006;52(1):6–18. [Google Scholar]
- Fisher RA, Yates F. Statistical tables for biological, agricultural and medical research. 3. London: Oliver & Boyd; 1948. pp. 26–27. OCLC 14222135. [Google Scholar]
- Foucart S, Lai MJ. Sparsest Solutions of Underdetermined Linear Systems via q minimization for 0 < q 1, Applied Comput. Harmonic Analysis May. 2009;26(3):395–407. [Google Scholar]
- Gemmeke JF, Virtanen T, Hurmalainen A. Exemplar-based sparse representations for noise robust automatic speech recognition, IEEE Trans. Audio Speech Lang Process. 2011;19(7):2067–2080. [Google Scholar]
- Gilbert Anna C, Muthukrishnan S, Strauss Martin J. Improved sparse approximation over quasi-incoherent dictionaries. Proc. 2003 IEEE Int. Conf. Image Process; Barcelona, Spain. Sep., 1; pp. 137–140. [Google Scholar]
- Gribonval R, Nielsen M. Sparse decompositions in unions of bases\ IEEE Trans Inf Theory. 2003 Dec;49(12):3320–3325. [Google Scholar]
- Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage. 2014;87c:96–110. doi: 10.1016/j.neuroimage.2013.10.067. [DOI] [PubMed] [Google Scholar]
- Hsu D, Kakade S, Langford J, Zhang T. In Neural Information Processing Systems (NIPS) 2009. Multi-label prediction via compressed sensing. [Google Scholar]
- Kidron E, Schechner YY, Elad M. Cross-modal localization via sparsity, IEEE Trans. Signal Process. 2007 Apr;55(4):1390–1404. [Google Scholar]
- Kumari V, Gray JA, Honey GD, Soni W, Bullmore ET, Williams SC, Ng VW, Vythelingum GN, Simmons A, Suckling J, Corr PJ, Sharma T. Procedural learning in schizophrenia: a functional magnetic resonance imaging investigation. Schizophrenia Research. 2012 Sep;57(1):97–107. doi: 10.1016/s0920-9964(01)00270-5. [DOI] [PubMed] [Google Scholar]
- Li Y, Cichocki A, Amari S. Analysis of sparse representation and blind source separation. Neural Comput. 2004;16(6):1193–1234. doi: 10.1162/089976604773717586. [DOI] [PubMed] [Google Scholar]
- Li Y, Namburi P, Yu Z, Guan C, Feng J, Gu Z. Voxel Selection in fMRI Data Analysis Based on Sparse Representation. IEEE Transs on Biomed Eng. 2009;56(10):2439–2451. doi: 10.1109/TBME.2009.2025866. [DOI] [PubMed] [Google Scholar]
- Lin D, Cao H, Calhoun VD, Wang Y. Classification of schizophrenia patients with combined analysis of SNP and fMRI data based on sparse representation. BIBM 2011; 2011 IEEE International Conference on; Atlanta, GA, USA. [Google Scholar]
- Liu J, Pearlson G, Windemuth A, Ruano G, Perrone-Bizzozero NI, Calhoun VD. Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA. Hum Brain Mapp. 2009 Jan;30(1):241–55. doi: 10.1002/hbm.20508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meda SA, Bhattarai M, Morris NA, Astur RS, Calhoun VD, Mathalon DH, Kiehl KA, Pearlson GD. An fMRI study of working memory in first-degree unaffected relatives of schizophrenia patients. Schizophr Res. 2008 Sep;104(1–3):85–95. doi: 10.1016/j.schres.2008.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meier L, Geer S, Bühlmann P. The group lasso for logistic regression. J R Statist Soc B. 2008;70(1):53–71. [Google Scholar]
- Onitsuka T, Shenton ME, Salisbury DF, Dickey CC, Kasai K, Toner SK, Frumin M, Kikinis R, Jolesz FA, McCarley RW. Middle and inferior temporal gyrus gray matter volume abnormalities in chronic schizophrenia: an MRI study. Am J Psychiatry. 2004 Sep;161(9):1603–11. doi: 10.1176/appi.ajp.161.9.1603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pascual-Leone A, Manoach DS, Birnbaum R, Goff DC. Motor cortical excitability in schizophrenia. Biol Psychiatry. 2002 Jul;52(1):24–31. doi: 10.1016/s0006-3223(02)01317-3. [DOI] [PubMed] [Google Scholar]
- Ramirez I, Sprechmann P, Sapiro G. Classification and clustering via dictionary learning with structured incoherence and shared features. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on; San Francisco, CA, USA. June 2010; pp. 3501–3508. [Google Scholar]
- Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat Genet. 2005 Jun;37:31–37. doi: 10.1038/ng1570. [DOI] [PubMed] [Google Scholar]
- Sharon Y, Wright J, Ma Y. UIUC, Tech Rep UILU-ENG-07-2008. 2007. Computation and relaxation of conditions for equivalence between 11 and 10 minimization. [Google Scholar]
- Shin Y, Lee S, Lee J, Lee H-N. Sparse representation-based classification scheme for motor imagery-based brain computer interface systems. J Neural Eng. 2012;9 (5):1, 12. doi: 10.1088/1741-2560/9/5/056002. [DOI] [PubMed] [Google Scholar]
- Soussen C, Idier J, Brie D, Duan J. From Bernoulli-Gaussian deconvolution to sparse signal restoration, IEEE Trans. Signal Processing. 2011;59(10):4572–4584. [Google Scholar]
- Sutrala SR, Norton N, Williams NM, Buckland PR. Gene copy number variation in schizophrenia. Am J Med Genet B Neuropsychiatr Genet. 2008 Jul 5;147B(5):606–11. doi: 10.1002/ajmg.b.30645. [DOI] [PubMed] [Google Scholar]
- Szycik GR, Münte TF, Dillo W, Mohammadi B, Samii A, Emrich HM, Dietrich DE. Audiovisual integration of speech is disturbed in schizophrenia: an fMRI study. Schizophr Res. 2009 May;110(1–3):111–8. doi: 10.1016/j.schres.2009.03.003. [DOI] [PubMed] [Google Scholar]
- Tang W, Cao H, Duan J, Wang YP. A compressed sensing based approach for subtyping of leukemia from gene expression data. J of Bioinformatics and Computational Biology. 2011;9(5):631–645. doi: 10.1142/s0219720011005689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tropp JA. Greed is good: Algorithmic results for sparse approximation, IEEE Trans. Inf Theory. 2004;50(10):2231–2242. [Google Scholar]
- Wang J, Yang CY, Chen B. Sparse Signal Recovery Based on lq (0 < q = 1) Minimization. 2011 International Conference on Multimedia and Signal Processing; Guilin, Guangxi China. May 2011. [Google Scholar]
- Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y. Robust face recognition via sparse representation. IEEE Trans Pattern Anal Intell. 2009 Mar;31(2):210–227. doi: 10.1109/TPAMI.2008.79. [DOI] [PubMed] [Google Scholar]
- Xu Z, Chang X, Xu F. L-1/2 Regularization: A Thresholding Representation Theory and a Fast Solver. IEEE Trans on Neural Networks and Learning Systems. 2012;23 (7):1013–1027. doi: 10.1109/TNNLS.2012.2197412. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.