Skip to main content
OMICS : a Journal of Integrative Biology logoLink to OMICS : a Journal of Integrative Biology
. 2012 Jul;16(7-8):363–373. doi: 10.1089/omi.2011.0126

Simultaneous Analysis of Multiple Data Types in Pharmacogenomic Studies Using Weighted Sparse Canonical Correlation Analysis

Prabhakar Chalise 1, Anthony Batzler 2, Ryan Abo 3, Liewei Wang 3, Brooke L Fridley 2,
PMCID: PMC3394856  PMID: 22734853

Abstract

Variation in drug response results from a combination of factors that include differences in gender, ethnicity, and environment, as well as genetic variation that may result in differences in mRNA and protein expression. This article presents two integrative analytic approaches that make use of both genome-wide SNP and mRNA expression data available on the same set of subjects: a step-wise integrative approach and a comprehensive analysis using sparse canonical correlation analysis (SCCA). In addition to applying standard SCCA, we present a novel modification of SCCA which allows different weighting for the various pair-wise relationships in the SCCA. These integrative approaches are illustrated with both simulated data and data from a pharmacogenomic study of the drug gemcitabine. Results from these analyses found little overlap in terms of genes detected, possibly detecting different biological mechanisms. In addition, we found the proposed weighted SCCA to outperform its unweighted counterpart in detecting associations between the genomic features and phenotype. Further research is needed to develop and assess new integrative methods for pharmacogenomic studies, as these types of analyses may uncover novel insights into the relationship between genomic variation and drug response.

Introduction

Variation in response to drug therapies is the result of a combination of many factors, including gene sequence variation, ultimately resulting in differences in mRNA and protein expression. Most of the current methods for analyzing high-dimensional genomic data have focused on analyzing a single data type, or experiment, at a time in a naive fashion. This naive one-at-a-time analysis approach ignores known biological information and the interaction between genes, proteins, and biochemical reactions, which may give rise to complex drug-related phenotypes. With the wealth of data being produced by new technologies, the collection of multiple types of genomic data on a set of samples is becoming commonplace.

Recently, multifactor approaches combining different types of genomic data have been used, in which a multistep procedure is employed to identify potential key drivers of complex traits integrating DNA variation and mRNA expression data (Hauser et al., 2003; Huang et al., 2008; Li et al., 2008; Schadt et al., 2005). Niu and associates (2010) used a step-wise integrative approach to find genes related to the response to radiation therapy. Another set of “integrative genomics” methods analyze the complete set of data in one comprehensive analysis, as opposed to a multistep procedure. One such approach is canonical correlation analysis (CCA; Hotelling, 1936). CCA focuses on maximizing the correlation between linear combinations of different sets of variables. However, when the number of variables far exceeds the number of subjects, as is the case for large-scale genomic studies, traditional CCA methods are no longer appropriate. To overcome this limitation, sparse canonical correlation analysis (SCCA) has recently been proposed for the analysis of two or three data sets (Parkhomenko et al., 2009; Waaijenborg et al., 2008; Witten and Tibshirani, 2009).

In this article, we compare these integrative analysis approaches, including the novel weighted SCCA, using data from a pharmacogenomics study of the cancer agent gemcitabine, in which genome-wide single-nucleotide polymorphisms (SNP) and mRNA expression have been collected on the same set of cell lines (Li et al., 2008, 2009). These methods are also applied to simulated data in which the “truth” is known. In this article, we focus on analysis methods that integrate multiple types of data into one comprehensive analysis, and propose a novel weighted SCCA method for analyzing high-dimensional data in pharmacogenomics studies.

Materials and Methods

Pharmacogenomic study of gemcitabine

To understand the pharmacogenomics of gemcitabine drug therapy, the Coriell Human Variation Panel (HVP) lymphoblastic cell lines were utilized, as previously described (Li et al., 2008, 2009). The HVP contains Epstein-Barr virus (EBV)-transformed B lymphoblastic cells from 100 Caucasians, 100 African-Americans, and 100 Han Chinese Americans. Cytotoxicity assays were performed at various drug doses, followed by estimation of the phenotype IC50 (the effective dose that kills 50% of the cells), using a four-parameter logistic model (Gallant, 1987). The phenotypic variable IC50 was used in the univariate and step-wise integrative methods, while the cytotoxicity values at the eight drug dose levels was used in the SCCA, which is designed for multiple variables. The cell lines have been genotyped using the Illumina HumanHap 550K. Following quality control, a total of 515,039 SNPs remained for integrative statistical analyses. SNPs were quantified as 0, 1, or 2, based on an additive genetic model in terms of the number of minor alleles. Genome-wide mRNA expression data were measured for the cell lines with the Affymetrix U133 Plus 2.0 expression array chip, with 54,613 probe sets available for analysis. In total, 172 cell lines (60 Caucasian, 53 African-American, and 59 Han Chinese American) had all three data types: gemcitabine cytotoxicity measurements, genome-wide mRNA expression data, and genome-wide SNP data.

Statistical analyses of the cell-line gemcitabine pharmacogenomic study

Univariate analyses

The expression array data were normalized on the log scale using guanine cytosine robust multi-array analysis (GCRMA; Bolstad et al., 2003; Wu et al., 2004). The normalized expression data on a log scale were then regressed on gender and race. Residuals from this regression were then standardized to arrive at a standardized adjusted expression value. IC50 values were log transformed and adjusted in a fashion similar to that described for the basal gene expression data. Pearson correlation coefficients were then calculated for the adjusted standardized IC50 and expression levels, followed by a Wald test of the association (p<0.0001).

For all analyses involving SNPs, adjustment for population stratification was completed as outlined in the works by Li and colleagues (2009) and Niu and associates (2010). Briefly, we used a principal component analysis (PCA) approach using genome-wide SNPs to adjust for population stratification (Price et al., 2006), in which PCA was completed by race, with the top five principal components saved. Using these components, the individual genotypes were adjusted. In a similar manner, the IC50 values were log transformed and adjusted for gender and race using the five principal components. The resulting race-adjusted genotypes and IC50 were then used in the genotype-phenotype correlation analysis. The SNP-IC50, and the SNP-expression analyses, were completed in a similar manner.

Step-wise integrative analysis

Based on the univariate analysis results, SNPs associated with IC50 were identified (p<0.0001). Since SNPs may control mRNA expression in either a cis- or a trans-manner, associations between these identified SNPs and genome-wide expression were completed, with SNP-expression associations identified (p<0.0001). Next, we determined if the expression probe sets identified to be associated with an SNP (with the SNP found to be associated with IC50), were also associated with IC50 (p<0.0001). This resulted in a set of candidate genes that could then be assessed for possible biological relevance with the drug. This approach is the same as the one followed by Niu and colleagues (2010).

SCCA integrative methods

CCA is a multivariate statistical method designed to explore the correlation between two sets of quantitative variables (Hotelling, 1936), and has been extended for the analysis of three or more data sets (Via et al., 2007). Suppose that three data sets X, Y, and Z, are of dimensions n×p, n×q, and n×r, with pn, qn, and rn measured on the same set of n subjects. Suppose that the columns of X, Y, and Z are standardized to have mean 0 and standard deviation 1. Let u, v, and w be p×1, q×1, and r×1 vectors of weights, and let ξ=Xu, η=Yv, and θ=Zw be the linear combinations of the variables in data sets X, Y, and Z, respectively, where ξ, η, and θ are n×1 vectors. The first canonical correlation (ρ) is then computed by maximizing the following equation:

graphic file with name M1.gif [Eq. 1]

subject to uT XT Xu=vT YT Yv=wT ZT Zw=1.

A limitation of CCA is that when the number of variables far exceeds the number of subjects, as is the case for large-scale genomic studies, the method is not applicable. To overcome this issue, a few important variables are selected using standard model selection criteria, and the canonical correlation is computed using the selected variables (Parkhomenko et al., 2009; Witten and Tibshirani, 2009), referred to as sparse canonical correlation analysis (SCCA). Mathematically, SCCA is performed by maximizing the penalized version of the expression in [Eq. 1] with penalties P1(u), P2(v), and P3(w), placed on u, v, and w, respectively. In this article, maximization of [Eq. 1] and calculation of the loadings were carried out using the method described in Witten and Tibshirani (2009).

However, a disadvantage of the SCCA method is that it does not directly control the sparsity of solution, and as a result it is difficult to achieve effective dimension reduction (Lykou and Whittaker, 2010). Zhou and He (2008) proposed a two-step procedure that uses a BIC criterion, balancing the loss in the correlation and gain in the sparsity of variables. The procedure is carried out iteratively which is given as follows. The pair-wise canonical correlation coefficients (rij, rjk, and rik) for the linear combination of the variables selected from SCCA are computed, with ρd defined as Inline graphic, where ρd ranges from −3 to +3. Then the BIC value is estimated using

graphic file with name M3.gif [Eq. 2]

where dm=pm+qm+rm is the total number of parameters at the mth iteration, n is the sample size, Inline graphic is the square of the sum of the pair-wise correlation coefficients with dm parameters, and Inline graphic is the square of the sum of maximum possible correlation coefficients (e.g., Inline graphic=9 for the case involving three pair-wise correlations). Next, the variable with the smallest loading in absolute value is dropped and a new correlation and corresponding BIC value are computed. Thus, the variable filtering is carried out by dropping the variable with the smallest loading at each iteration, followed by the re-computation of the first canonical correlation and the BIC value. The variables corresponding to the minimum BIC value are then selected.

Another limitation of the current definition of SCCA for more than two data sets deals with the fact that the maximization involves the sum of pair-wise correlations of the linear combination of variables among the three data sets. That is, all pairs of correlations are given equal weights, which may not be appropriate in pharmacogenomic studies. In particular, the component representing the correlation between the SNP and expression data can dominate the analysis, with SNP and expression variables being selected that have no relationship with the phenotypes. Therefore, we propose a novel weighted SCCA for analysis of three data sets that allows for weighting of the different pair-wise correlation within the objective function with

graphic file with name M7.gif [Eq. 3]

subject to uT XT Xu=vT YTYv=wT ZT Zw=1. For example, in the SCCA of cytotoxicity, expression, and SNP data, one could select variables that maximize the sum of the correlation between the cytotoxicity – expression and cytotoxicity – SNP, with w12=w13=1, thus removing the component representing the correlation between SNP and expression (w23=0). The BIC function is then adjusted accordingly for weighted SCCA using

graphic file with name M8.gif [Eq. 4]

where Inline graphic, ranging from −2 to +2, and therefore Inline graphic

Finally, since applying SCCA to genome-wide SNP data is still computationally intensive, the following dimension reduction steps were completed prior to SCCA. SNPs were partitioned into bins based on their correlation using hierarchical clustering with a liberal threshold of 0.05 (Rinaldo et al., 2005), followed by PCA for the SNPs within the bin. The first principal component for each bin of SNPs was used in the model as the “genetic” variable, as opposed to the individual SNP genotypes. A similar PCA approach has often been carried out for SNPs in a candidate gene to capture the variation of those SNPs within the locus (Gauderman et al., 2007). This approach resulted in 3135 “genetic” factors to be included in the analysis. Adjustment of population stratification and covariates was completed in a similar manner as that outlined for the univariate analysis, with SCCA based on the residuals.

Results

Gemcitabine pharmacogenomic study

Univariate analysis

Five SNPs (Table 1A) were detected with p<10−5 (6.14×10−7 to 8.53×10−6), and nine loci (regions with more than one SNP with p<0.0001) found to be associated with IC50. These regions corresponded to genes PIK3R1, TGFBI, HLA-DRA, MCPH1/ANGPT2, CLDN23, NIPSNAP3A/B, LOC100128586, RAB27A/PIGB, and NPM1P3. Results for these loci are also presented in Table 1B. Analysis of mRNA expression and IC50 detected a total of 261 probe sets associated with IC50 with p<0.0001. The probe sets with p<10−6 are listed in Table 2. Multiple probe sets in the genes FKBP5, VARS, and VAV3 were found to be associated with gemcitabine IC50.

Table 1.

(A) Univariate SNP-IC50 Associations with p<10−5 from Univariate Analysis of the Gemcitabine Study. (B) Nine loci (Regions with More Than One SNP with p<0.0001) associated with IC50

  SNP Chromosome Position Nearest gene MAF Correlation p
(A) rs4272382 8 8470898 CLDN23 0.132 −0.39 6.1E-07
  rs3775182 4 87198607 MAPK10 0.109 0.38 9.6E-07
  rs2290344 15 53407088 PIGB 0.254 0.36 3.6E-06
  rs10761082 9 106555990 NIPSNAP3A 0.36 0.36 4.9E-06
  rs2472476 9 106571777 NIPSNAP3B 0.389 0.35 8.5E-06
(B) rs7713001 5 67999371 PIK3R1 0.459 −0.32 4.5E-05
  rs12188464 5 67999705 PIK3R1 0.459 −0.32 4.5E-05
  rs13171512 5 68000787 PIK3R1 0.462 −0.32 3.9E-05
  rs2107331 5 135405248 TGFBI 0.456 −0.34 1.5E-05
  rs2282791 5 135405629 TGFBI 0.477 0.32 4.3E-05
  rs7192 6 32519624 HLA-DRA 0.374 0.31 5.7E-05
  rs3129890 6 32522251 HLA-DRA 0.342 0.34 1.0E-05
  rs2922876 8 6384104 MCPH1; ANGPT2 0.164 −0.31 7.2E-05
  rs1375668 8 6384278 MCPH1; ANGPT2 0.363 −0.31 9.1E-05
  rs4272382 8 8470898 CLDN23 0.132 −0.39 6.1E-07
  rs4595128 8 8471286 CLDN23 0.202 −0.32 3.2E-05
  rs10761082 9 106555990 NIPSNAP3A 0.36 0.36 4.9E-06
  rs2472476 9 106571777 NIPSNAP3B 0.389 0.35 8.5E-06
  rs12244977 10 58762688 LOC100128586 0.19 −0.31 9.4E-05
  rs12256364 10 58765694 LOC100128586 0.19 −0.31 9.4E-05
  rs12050885 15 53345916 RAB27A 0.379 0.33 2.2E-05
  rs11636687 15 53392444 PIGB 0.42 0.32 5.3E-05
  rs2290344 15 53407088 PIGB 0.254 0.36 3.6E-06
  rs12050587 15 53414820 PIGB 0.45 0.32 5.0E-05
  rs8024695 15 53426597 PIGB 0.287 0.34 1.4E-05
  rs11639680 16 5500905 NPM1P3 0.193 0.33 2.8E-05
  rs4511535 16 5506393 NPM1P3 0.199 0.31 6.0E-05

MAF, minor allele frequency; SNP, single-nucleotide polymorphism; IC50, effective dose that kills 50% of cells.

Table 2.

Expression Probe Set-IC50 Associations with p<10−6 from Univariate Analysis of the Gemcitabine Study

Probe set Chromosome Gene Correlation p
202092_s_at 16 ARL2BP −0.38 2.6E-07
212437_at 20 CENPB −0.39 8.3E-08
226017_at 3 CMTM7 −0.36 8.9E-07
211118_x_at 14 ERS2 0.39 6.7E-08
224856_at 6 FKBP5 −0.41 2.1E-08
204560_at 6 FKBP5 −0.39 9.6E-08
224840_at 6 FKBP5 −0.37 5.6E-07
205164_at 22 GCAT −0.39 9.9E-08
230362_at 10 INPP5F −0.39 5.9E-08
210644_s_at 19 LAIR1 −0.38 1.7E-07
203726_s_at 18 LAMA3 −0.44 1.2E-09
212715_s_at 22 LOC731210; MICAL3 −0.38 2.3E-07
225391_at 4 LOC93622 −0.36 9.0E-07
206571_s_at 2 MAP4K4 −0.37 5.1E-07
204880_at 10 MGMT 0.40 5.4E-08
209853_s_at 17 PSME3 −0.37 5.5E-07
209815_at 9 PTCH1 0.36 8.8E-07
204759_at 13 RCBTB2 0.37 5.9E-07
205645_at 23 REPS2 −0.36 8.1E-07
224338_s_at 11 RNF26 −0.39 1.2E-07
201796_s_at 6 VARS −0.40 2.5E-08
201797_s_at 6 VARS −0.37 3.4E-07
218807_at 1 VAV3 −0.38 1.4E-07
218806_s_at 1 VAV3 −0.37 5.7E-07

IC50, effective dose that kills 50% of cells.

Step-wise integration approach

For the step-wise integration analysis approach, 58 SNPs were found to be associated with IC50 (p<0.0001). For these 58 SNPs, cis- and trans-associations with gene expression were determined. We found associations with 468 unique expression probe sets (538 associations with p<0.0001). In particular, SNP rs922369 (chromosome 10, bp 71020137, 5′ upstream of the gene NEUROG3), was associated with 60 unique expression probe sets, and rs2472476 (chromosome 9, bp 106571777, intronic to NIPSNAP3B and 3′ downstream of NIPSNAP3A) was associated with 41 probe sets. The SNP rs2472476 was also in a locus associated with IC50 containing SNP rs10761082.

Subsequently, the association of these 468 probe sets with IC50 determined 21 probe sets associated with IC50 (p<0.0001). These results are displayed in Table 3. In addition to the gene PIGB detected by the SNP-IC50 analyses, and FKBP5 detected by the expression-IC50 analyses, these two genes were also detected via the three-way step-wise analysis. The four SNPs in PIGB associated with IC50 (p<0.0001) were also found to regulate the expression of PIGB, with the most significant association found between the four SNPs and mRNA expression (242760_x_at) observed for rs2290344 (p=2.55×10−10). This probe set was also found to be associated with IC50 (p=8.98×10−5), indicating that the SNPs may be indirectly affecting gemcitabine IC50 through the expression of PIGB. In addition, Table 3 presents 15 novel candidate genes detected through a trans mechanism. In particular, SNP rs922369 (chromosome 10; minor allele frequency [MAF]=0.25) in NEUROGF3 was associated with IC50 (p=4.88×10−5) and mRNA expression for genes TMEM142B, MTRF1, and a cDNA, with the probe sets for these genes also associated with IC50 through a trans mechanism (Table 4).

Table 3.

Twenty-One Probe Sets Found to Be Associated with Gemcitabine IC50 Based on a Step-Wise Integrative Approach

Probe set Chromosome Gene Correlation p
218812_s_at 7 TMEM142B −0.29 6.7E-05
219798_s_at 7 BCDIN3 −0.32 1.6E-05
219822_at 13 MTRF1 0.30 5.8E-05
231406_at 7 −0.34 3.6E-06
1569396_at 16 RAB40C 0.30 5.4E-05
1570537_a_at 8 0.32 1.1E-05
225086_at 15 FAM98B −0.35 2.1E-06
204560_at 6 FKBP5 −0.39 9.6E-08
242760_x_at 15 PIGB −0.29 8.9E-05
228832_at 4 FLJ20021 0.29 8.1E-05
203099_s_at 6 CDYL −0.32 1.0E-05
219338_s_at 15 LRRC49 −0.30 6.1E-05
244276_at 4 KLB 0.30 3.8E-05
231851_at 1 RAVER2 −0.29 6.6E-05
236170_x_at 7 HERPUD2 −0.29 6.7E-05
225391_at 4 LOC93622 −0.36 9.0E-07
203706_s_at 2 FZD7 −0.30 4.7E-05
213056_at 3 FRMD4B −0.30 6.3E-05
219098_at 17 MYBBP1A −0.33 8.3E-06
230908_at 2 −0.35 1.8E-06
200988_s_at 17 PSME3 −0.35 2.6E-06

IC50, effective dose that kills 50% of cells.

Table 4.

Results for rs922369 (NEUROGF3) Observed to be Associated with Gemcitabine Trans-Acting Manner Through Expression

mRNA expression
rs922369 Expression analysis
Expression IC50 analysis
Probe set Gene Chromosome Correlation p Correlation p
218812_s_at TMEM142B 7 −0.28 2.0E-06 −0.29 6.7E-05
219798_s_at BCDIN3 7 −0.25 2.2E-05 −0.32 1.6E-05
219822_at MTRF1 13 0.29 1.0E-06 0.30 5.8E-05
231406_at cDNA 7 −0.26 1.1E-05 −0.34 3.6E-06

IC50, effective dose that kills 50% of cells.

SCCA approach

Using the standard (unweighted) SCCA applied to the SNP, mRNA expression, and cytotoxicity data sets, resulted in the selection of 182 genetic variables (defined as the first principal component for the linkage disequilibrium [LD]-based binned SNPs), 2581 expression probe sets, and 2 cytotoxicity variables (doses of 10 μM and 1000 μM) with a SCCA coefficient of 1.0299. The pair-wise SCCA coefficients were: correlation(genotype, expression)=0.734, correlation(genotype, cytotoxicity)=0.153, and correlation(expression, cytotoxicity)=0.143. Applying the BIC-type variable filtering method resulted in a more sparse solution, with only five genetic variables, one expression probe set (215301_at), and one cytotoxicity variable (10 μM) selected. The selected probe set 215301_at corresponds to genes SYCE1L and LOC400547, located at chromosome 16 and position 75804375–75809512. This probe set was not significantly associated with IC50 in univariate analysis (p=0.30). The five selected genetic variables correspond to bins consisting of 106, 220, 7, 9, and 10 SNPs, respectively (a total of 352 SNPs). Three of these SNPs, rs4074037, rs3811259, and rs2930629, were among the SNPs obtained from the univariate SNP IC50 analyses (p=0.0098, 0.0053, and 0.0044, respectively). There were no common SNPs detected from the unweighted SCCA and step-wise approach. Out of 352 total SNPs, the 25 most “important” SNPs with largest first principal component loadings, together with their positions and associated genes, are listed in Table 5A.

Table 5.

SCCA Results, Following BIC Filtering, for the Gemcitabine Pharmacogenomic Study

  SNP Chromosome Position Gene
(A)
  rs3766117a 1 167794480 F5
  rs1894701a 1 167797210 F5
  rs7545236a 1 167796694 F5
  rs6022a 1 167796450 F5
  rs6128a 1 167829528 SELP; F5
  rs6678795 1 167799890 F5
  rs1335532 1 116902480 CD58
  rs6427202 1 167795454 F5
  rs10924103 1 116838074 LOC148766
  rs800292 1 194908856 CFH
  rs505102 1 194886125 CFH
  rs10802189 1 116858253 CD58
  rs10145908 14 62823082 RHOJ
  rs4457900 14 60759645 TMEM30B
  rs3783814 14 60869673 PRKCH
  rs1139130a 14 21037756 TOX4; METTL3
  rs2297093a 14 21025196 TOX4
  rs933192a 14 21033649 TOX4; METTL3
  rs4417466a 14 21042491 TOX4; METTL3
  rs6571850a 14 21020676 TOX4; RAB2B
  rs719785 14 21048133 METTL3
  rs7179423a 15 23471534 ATP10A
  rs2930629a 15 23469059 ATP10A
  rs7181116a 15 23471769 ATP10A
  rs2066711a 15 23474311 ATP10A
(B)
  rs12345642 9 137870884 CAMSAP1
  rs7852055 9 137835056 CAMSAP1, LOC100131786
  rs10116440 9 137896193 CAMSAP1
  rs10858179 9 137945755 UBAC1
  rs12972385 19 5842052 NDUFA11
  rs1678868 19 5843954 NDUFA11
  rs8108064 19 5854807 NDUFA11,VMAC
  rs1015048 21 32871177 TCP10L,C21orf77
  rs1015047 21 32871294 TCP10L,C21orf77
  rs2833890 21 32849757 C21orf77
  rs2833902 21 32862329 C21orf77,TCP10L
a

Detected in both unweighted and weighted SCCA.

SNPs with the highest loadings in the first principal component from selected bins are presented: (A) unweighted SCCA, and (B) weighted SCCA.

SNP, single-nucleotide polymorphism; SCCA, sparse canonical correlation analysis.

When the weighted SCCA was applied to the gemcitabine pharmacogenomic study, 57 genetic variables, 874 expression probe sets, and 2 cytotoxicity variables were selected. The sparse canonical correlation was 0.3611 (correlations between genotype and cytotoxicity equal to 0.1902, and expression and cytotoxicity equal to 0.1709). After applying the BIC filter, only 13 genetic variables, 7 expression probe sets, and 1 cytotoxicity (dose 10 μM) variable were selected. However, these probe sets (1556404_a_at, 1557921_s_at, 1559336_at, 1565742_at, 1566970_at, 215301_at, and 239006_at) were not significantly associated with IC50 in expression-IC50 analyses. The selected genetic variables correspond to bins consisting of 106, 45, 16, 127, 220, 7, 9, 37, 108, 18, 56, 44, and 10 (total of 803 SNPs). Out of these 803 SNPs, rs9651539, rs778972, rs739236, and rs1107514, were also detected in the univariate SNP-IC50 analyses (p=0.0005, 0.0009, 0.0009, and 0.0009, respectively). The 25 most important SNPs with the largest principal component loadings are listed in Table 5A and B, together with their position and associated genes.

These novel genes detected from SCCA might represent additional mechanisms that could contribute to gemcitabine sensitivity. Therefore, we completed a pathway analysis of these genes using Ingenuity Pathway Analysis (IPA; Ingenuity Systems, Redwood City, CA, USA). This software consists of a curated database and several analysis tools to obtain pathways and networks associated with a set of genes. Networks are constructed in IPA with a set of genes by first identifying other molecules in the IPA database that have evidence of interacting with these genes, and then maximizing the connectivity of these components. The scores of the constructed networks indicate how well the network is “fit” to the set of genes input, and is the log-transformed value of a right-tailed Fisher's exact test result. The top network identified was the TNF pathway (score of 32), and involved 13 of the 16 genes input (Fig. 1). This pathway is extremely important in the inflammatory response and cancer development. Therefore, future functional and mechanistic studies would help to validate this finding.

FIG. 1.

FIG. 1.

Network showing the top network (TNF pathway) detected with Ingenuity Pathway Analysis.

Comparison of the approaches

The standard SCCA method selected 5 SNP bins, 1 mRNA expression probe set, and 1 cytotoxicity variable, while the weighted SCCA method selected 13 bins of SNPs, 7 mRNA expression probe sets, and 1 cytotoxicity variable. All the variables selected by unweighted SCCA were also selected by weighted SCCA. Comparing the SCCA results with the univariate approach, 4 SNPs (rs9651539, rs778972, rs739236, and rs1107514) from the weighted SCCA method were found to have a significant association with IC50 (p values range from 0.0005−0.0009). However, the SNPs selected from the unweighted SCCA method were not found to have a significant association with IC50 (at p<0.001 level). Also, the mRNA variables selected by both the unweighted and weighted SCCA methods were not detected by either the univariate (p>0.05) or the step-wise analyses. The univariate and step-wise analyses identified several genes in common. The gene PIGB was both detected by the step-wise model and univariate SNP-IC50 analyses. Similarly, gene FKBP5 was detected by both the step-wise and univariate expression-IC50 analyses. In addition, the step-wise method detected 15 novel candidate genes not detected by either the univariate SNP-IC50 or expression-IC50 analysis.

Description of simulated data

Simulation of the genotype data was based on the SNPs within the gemcitabine pathway for the Caucasian HVP cell lines. SNPs mapped to the gemcitabine pathway, which passed quality control, were identified resulting in 749 SNPs in 19 genes. The 19 genes within the pathway were mapped to chromosomes, and haplotypes were phased using the program fastPHASE (Scheet and Stephens, 2006). These haplotype frequencies were used as the “true” haplotype frequencies for the underlying population, with haplotypes simulated using the hapsim library in R (http://cran.r-project.org/web/packages/hapsim/index.html). These haplotypes were then assigned in a sequential fashion to the 200 individuals, producing simulated genotypes for SNPs that mimic realistic LD for the regions in which they lie.

Following the simulation of the genotype data for n=200 subjects, mRNA gene expression data were simulated, such that a few SNPs in the pathway were correlated with mRNA expression levels. The expression data for each individual were simulated using a multivariate normal distribution X∼MVN(μi, ΣX), for which the mean vector for subject i, μi=Gi×B, is based on the effect matrix B and subject i's vector of genotypes Gi. Next, SNPs were selected to be associated with gene expression, with an effect size k for those SNP-expression pairs, the effect matrix B is defined as

graphic file with name M11.gif [Eq. 5]

with the number of rows equal to the number of SNPs and the number of columns equal to the number of expression variables. Using the mean vector μi, and a covariance matrix (ΣX) based on the observed correlation structure between the mRNA expression values within the gemcitabine pathway, expression data for individual i were simulated. Three SNPs, rs2840075, rs3781281, and rs7776847, were chosen to be associated with three expression variables 203302_at, 209155_s_at, and 223298_s_at, respectively. The effect size was varied, with either a small effect (k=0.3), or a large effect (k=0.6).

Finally, the cytotoxicity values for each individual were generated from a multivariate normal distribution Inline graphic. The mean of the distribution was based on the four-parameter logistic function

graphic file with name M13.gif [Eq. 6]

where the responses at infinite and zero concentration are represented by β1 and β2, respectively. The parameter β3 represents log(IC50), β4 represents the slope of the dose-response curve, and Di is one of the eight gemcitabine drug concentrations. The parameters β1, β2, and β4 in the four-parameter logistic model were set to 10, 95, and 1.5, respectively. The covariance matrix ΣY was estimated from the gemcitabine cytotoxicity data. For simulations with a genetic effect on the phenotype IC50, β3 was based on the direct effects of two expression probe sets: direct effect of one SNP, and indirect effect of an additional SNP. The genes with mRNA expression affecting the cytotoxicity were NT5C3 and NT5C1B. The SNPs impacting cytotoxicity were rs11140525 and rs7776847. The simulation scenario is depicted in Figure 2. For each effect size (k=0.3 or 0.6), 100 simulations were run with the aforementioned settings.

FIG. 2.

FIG. 2.

Simulation scenario in which the genotype and expression variables have direct and indirect effects on cytotoxicity. Two genes, NT5C3 and NT5C1B, with probe sets 223298_s_at and 1554368_at, have direct effects on cytotoxicity. SNP rs11140525 has a direct effect and SNP rs7776847 has an indirect effect on cytotoxicity.

Simulation study results

Univariate results

Pearson correlation coefficients were calculated for all possible pairs of variables, followed by a test of association using a Wald test with Fisher's transformation and Bonferroni correction for multiple testing. To compare the methods, the proportion of times the true variables were selected (PTTS), and the average number of false-discovery (AvgFD) were computed. AvgFD was computed by adding all the false-positive variables across all simulations divided by the number of simulations. Therefore, a good analysis method is one with high PTTS and low AvgFD. To adjust for multiple testing, a Bonferroni correction was applied for which the significance threshold was set to 0.001 for the mRNA-IC50 comparisons, 10−4 for the SNP-IC50 comparisons, and 10−5 for the SNP-mRNA comparisons due to differing number of tests in these categories. The results are displayed in Table 6A.

Table 6.

Results from Univariate (A), and Unweighted and Weighted SCCA (B) Integrative Analyses of Simulated Data

Analysis approach Variable/associations PTTS % (k=0.6) PTTS % (k=0.3)
(A) Univariate analysis rs2840075-203302_at 61 25
  rs3781281-209155_s_at 86 37
  rs7776847-223298_s_at 59 31
  rs11140525-IC50 87 58
  rs7776847-IC50 66 18
  223298_s_at-IC50 100 86
  1554368_at-IC50 91 79
(B) Unweighted SCCA rs11140525 61 53
  rs7776847 72 41
  223298_s_at 66 57
  1554368_at 33 30
Weighted SCCA rs11140525 63 51
  rs7776847 77 40
  223298_s__at 86 64
  1554368_at 77 43

PTTS, proportion of times the true variables were selected; SCCA, sparse canonical correlation analysis.

For the larger effect size (k=0.6) simulations, all three pairs of the univariate analysis selected the simulated effects with good power. In particular, when the number of tests was smaller (e.g., expression-IC50 analysis), the true variables were selected by almost every simulation (100% and 91% of simulations for 223298_s_at and 1554368_at, respectively). However, when the effect size was decreased to 0.3, the PTTS values were reduced for all three univariate analyses. This reduction in PTTS was larger for the SNP-expression analyses, with the maximum reduction being 49% for the association between rs3781281 and 209155_s_at. The reduction in the PTTS value was also evident for the SNP rs7776847 in the SNP-IC50 analysis, which was simulated to have an indirect effect on IC50 via NT5C3 (probe set 223298_s_at). In addition to the impact on PTTS, as the effect size decreased the avgFD also decreased, with changes in avgFD of: (1) 0.05 and 0.12 for detecting false expression effects from the associations with SNP and IC50, respectively; and (2) 2.68 and 0.81 for detecting false SNP effects from associations with expression and IC50, respectively. For all simulation pairs, the false-positives were scattered among the remaining pairs, with no pair being detected in more than 8% of simulations.

Step-wise integration approach

For the step-wise integrative approach, SNPs associated with cytotoxicity variables were selected with p<10−4. Then for these selected SNPs, the associations with genome-wide mRNA expression variables was assessed, with mRNA probe sets selected with p<10−5. Finally, the association of these expression variables with cytotoxicity was determined with mRNA expression probe sets selected to be associated with IC50 at the 0.001 significance level. The focus of the step-wise approach lies in determining the expression variables associated with IC50, given that an SNP was associated with IC50 and expression of the gene; therefore we focused on the proportion of times the NT5C3 gene was selected. The gene NT5C3 (probe set 223298_s_at) was detected in 46% of simulations when the effect size was k=0.6, and 18% of the time when the effect size was k=0.3. The AvgFD was 0.23 and 0.12 when k was 0.6 or 0.3, respectively.

SCCA Approach

Weighted and unweighted SCCA, with an additional BIC step, were applied to the simulated data, with results presented in Table 6B. PTTS values were smaller for the smaller effect size (k=0.3), with similar results for detecting SNP effects between the weighted and unweighted SCCA. However, a larger difference was observed for detection of expression effects between the two SCCA approaches, with higher PTTS rates observed for the weighted SCCA, compared to the unweighted SCCA. The AvgFD values for SNP and mRNA expression analyses were 0.68 (2.24) and 0.26 (0.77), when k=0.6 (k=0.30) for the unweighted SCCA. Similarly, for weighted SCCA, the false-detection rate increased when the effect size decreased. The AvgFD for SNP and expression were 1.57 (1.69) and 0.55 (0.63), when k=0.6 (k=0.30). The selected false variables appeared to be random, and none of the variables were selected in more than 7% of simulations.

Comparison of approaches

The simulation study showed that when the variables are strongly associated with a large effect size (k=0.6), and if the number of tests conducted was relatively small (e.g., expression-IC50 pairs), the univariate approach was able to detect almost all simulated true variables (223298_s_at 100% and 1554368_at 91%). There were fewer true variables selected when the number of comparisons was increased. The SNP-IC50 correlation had the second largest number of tests, and the SNP rs11140525 was selected in 87% of simulations, and SNP rs7776847 was selected in 66%. The SNP expression pair had the largest number of tests, and the true pairs rs2840075-203302_at, rs3781281-209155_s_at, and rs7776847-223298_s_at, were detected in 61%, 86%, and 59% of simulations, respectively. However, when the variables are moderately or even weakly associated (k=0.3), and if the number of tests being conducted is large (e.g., SNP-expression pairs), there was a substantial decrease in the number of true associations detected (the pairs rs2840075-203302_at from 61 to 25%, rs3781281-209155_s_at from 86 to 37%, and rs7776847-223298_s_at from 59 to 31%). However, the fall in the PTTS was smaller when the number of comparisons was smaller in expression-IC50 (14% fall in PTTS for 223298_s_at, and 12% fall in PTTS for 1554368_at). In particular, the PTTS for SNP rs7776847, which was simulated to have an indirect effect with IC50, dropped by 48%, even though the number of tests was moderate.

In contrast, as the effect size decreased, there was less of a drop in the PTTS for the weighted and unweighted SCCA (compared to the univariate approach). In particular, the SCCA methods were able to detect the smaller indirect effect (k=0.3) better than the univariate method (e.g., rs7776847 PTTS of 40–41% versus 18%). Comparing unweighted and weighted SCCA methods, the simulation studies show that the weighted SCCA method is generally better able to detect the true associations between the genotypic and phenotypic variables. Since the focus in pharmacogenomics studies is to determine genomic variables (e.g., SNP and mRNA) associated with drug response, the novel weighted SCCA method may reveal more relevant associations.

Discussion and Conclusions

In this article, we have described several integrative analysis methods that could be applied to pharmacogenomic studies involving multiple types of genome-wide genomic data collected on the same set of subjects. Each of these methods has its benefits and limitations. Univariate analysis approaches, in which each data type is analyzed individually for the association with the phenotype, has been widely used in genome-scale studies. This method allows application of computationally efficient, standard statistical methods. However, the interpretation of the results after millions of tests have been performed for each data type is challenging. In addition, the univariate analysis approach only considers one pair of variables at a time, ignoring other variables which might influence them (i.e., expression quantitative trait loci [eQTL]). As a result, it is difficult to assess complex relationships between the multiple types of genomic data and the drug-response phenotype of interest.

The second method described involves a step-wise approach to integrate SNP and expression data for the selection of candidate genes associated with drug response. In this approach, the relationships between genetic variants (e.g., SNPs) and the phenotype are assessed individually. Variants detected from the pair-wise analysis are then carried forward to determine their association with mRNA expression, followed by the assessment of the significant mRNA probe sets identified with the phenotype. Therefore, this approach for the selection of candidate genes integrates both genetic and mRNA variations. However, the mRNA expression of genes associated with the phenotype could be missed if an SNP was not selected in the step-wise procedure to be associated with the gene's mRNA expression levels. SCCA overcomes this limitation of the step-wise approach, in which a comprehensive integrative analysis is completed to identify candidate genes associated with the drug-response phenotypes. Currently, however, application of existing SCCA methods to high-dimensional data is computationally intensive.

In terms of choice of the phenotype used for the univariate and step-wise analyses, we chose the commonly used summary measure of the dose-response curve, the IC50 (Huang et al., 2008; Li et al., 2008, 2009; Niu et al., 2010). In their research, Fridley's group developed a Bayesian hierarchical nonlinear model to model the genomic effects within a pathway on the entire dose-response curve (Fridley et al., 2009). However, this approach is computationally intensive and cannot be scaled up to genome-scale data. In contrast to the univariate analyses, SCCA is designed to be applied to a set of variables. Therefore, we chose to use all cytotoxicity values, as opposed to the summary measures of IC50, for application of SCCA to the gemcitabine study. However, the SCCA method did not explicitly model the dose-response relationship between the cytotoxicity values and the drug dose. Future work is needed to extend the weighted SCCA to incorporate this dose-response relationship, possibly using an approach similar to that proposed by Leurgans and colleagues (1993).

The application of these analytical approaches to the pharmacogenomic study of the anti-cancer agent gemcitabine, along with their application to simulated data, demonstrated the utility of each of these approaches. These results show that for studies with the goal of finding a large to moderate effect between genomic and phenotypic variables, the simple univariate analysis may be adequate. However, for studies with a large number of variables, if the association between the genomic variables and the phenotype is small to moderate, the univariate analysis method may be unable to detect the association, after adjusting for multiple testing. In contrast, if the true underlying relationship is more complex, a more comprehensive integrative analysis approach, such as SCCA, may be more suitable. For such comprehensive studies, we found that our novel weighted SCCA method outperformed the standard (non-weighted) SCCA method. However, the results from the weighted or unweighted SCCA method do not fully agree with univariate and step-wise methods. It should be noted that a limitation of our implementation of SCCA here is that we have used PCA to reduce the dimensionality of the SNPs, creating bins of SNPs. This was done since there was no feasible way, computationally, to complete SCCA on individual SNP data from large arrays. This may result in loss of information, and therefore could contribute to the differences seen between the results of other methods. Further research is required to improve the performance of the weighted SCCA method, such as enabling it to handle large numbers of variables, and determining the optimal values of weights for the maximization function. In addition, we suggest a sensitivity analysis (i.e., run analyses with a variety of thresholds and determine the impact of differing cut-points on the results).

In conclusion, applying integrative analysis methods to studies involving multiple types of genomic data may lead to novel hypotheses to be tested in future studies. For example, applying SCCA to the pharmacogenomic study of gemcitabine detected a large number of genes involved in the TNF pathway, which may contribute to gemcitabine drug response, as this pathway is extremely important in the inflammatory response and cancer development. Future functional and mechanistic studies would help to validate this finding. In addition to following-up novel hypotheses developed from integrative analysis, further research is needed to develop powerful integrative methods that are able to detect complex relationships in pharmacogenomic studies. Application of such integrative methods may uncover additional insights into the relationship between genomic variation and drug response.

Acknowledgments

This research was supported by the National Institutes of Health (grants CA140879, CA130828, CA138461, CA102701, GM61388, and GM86689), the Minnesota Partnership for Biotechnology and Medical Genomics, and the Mayo Foundation. The funders had no role in study design, data collection and analysis, the decision to publish, or preparation of the manuscript.

Author Disclosure Statement

No competing financial interests exist.

References

  1. Bolstad B.M. Irizarry R.A. Astrand M. Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
  2. Fridley B.L. Jenkins G. Schaid D.J. Wang L. A Bayesian hierarchical nonlinear model for assessing the association between genetic variation and drug cytotoxicity. Statistics Med. 2009;28:2709–2722. doi: 10.1002/sim.3649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Gallant A.R. Nonlinear Statistical Models. New York: Wiley; 1987. [Google Scholar]
  4. Gauderman W.J. Murcray C. Gilliland F. Conti D.V. Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007;31:383–395. doi: 10.1002/gepi.20219. [DOI] [PubMed] [Google Scholar]
  5. Hauser M.A. Li Y.J. Takeuchi S., et al. Genomic convergence: identifying candidate genes for Parkinson's disease by combining serial analysis of gene expression and genetic linkage. Hum Molec Genet. 2003;12:671–677. [PubMed] [Google Scholar]
  6. Hotelling H. Relations between two sets of variates. Biometrika Trust. 1936;38:321–377. [Google Scholar]
  7. Huang R.S. Duan S. Kistner E.O., et al. Genetic variants contributing to daunorubicin-induced cytotoxicity. Cancer Res. 2008;68:3161–3168. doi: 10.1158/0008-5472.CAN-07-6381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Leurgans S.E. Moyeed R.A. Silverman B.W. Canonical correlation analysis when the data are curves. J Royal Statistical Soc Series B. 1993;55:725–740. [Google Scholar]
  9. Li L. Fridley B. Kalari K., et al. Gemcitabine and cytosine arabinoside cytotoxicity: Association with lymphoblastoid cell expression. Cancer Res. 2008;68:7050–7058. doi: 10.1158/0008-5472.CAN-08-0405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Li L. Fridley B.L. Kalari K., et al. Gemcitabine and arabinosylcytosin pharmacogenomics: genome-wide association and drug response biomarkers. PLoS One 4. 2009:e7765. doi: 10.1371/journal.pone.0007765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lykou A. Whittaker J. Sparse CCA using a lasso with positivity constraints. Computational Statistics Data Analysis. 2010;54:3144–3157. [Google Scholar]
  12. Niu N. Qin Y. Fridley B.L., et al. Radiation pharmacogenomics: a genome-wide association approach to identify radiation response biomarkers using human lymphoblastoid cell lines. Genome Res. 2010;20:1482–1492. doi: 10.1101/gr.107672.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Parkhomenko E. Tritchler D. Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Appl Genet Molec Biol. 2009;8 doi: 10.2202/1544-6115.1406. Article 1. [DOI] [PubMed] [Google Scholar]
  14. Price A.L. Patterson N.J. Plenge R.M. Weinblatt M.E. Shadick N.A. Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  15. Rinaldo A. Bacanu S.A. Devlin B. Sonpar V. Wasserman L. Roeder K. Characterization of multilocus linkage disequilibrium. Genet Epidemiol. 2005;28:193–206. doi: 10.1002/gepi.20056. [DOI] [PubMed] [Google Scholar]
  16. Schadt E.E. Lamb J. Yang X., et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet. 2005;37:710–717. doi: 10.1038/ng1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Scheet P. Stephens M. A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Via J. Santamaria I. Perez J. A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks. 2007;20:139–152. doi: 10.1016/j.neunet.2006.09.011. [DOI] [PubMed] [Google Scholar]
  19. Waaijenborg S. Verselewel de Witt Hamer P.C. Zwinderman A.H. Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Appl Genetics Molec Biol. 2008;7 doi: 10.2202/1544-6115.1329. Article 3. [DOI] [PubMed] [Google Scholar]
  20. Witten D.M. Tibshirani R.J. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Appl Genetics Molec Biol. 2009;8 doi: 10.2202/1544-6115.1470. Article 28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Wu Z. Irizarry R. Gentleman R. Martinez-Murillo F. Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J Am Statistical Assn. 2004;99:909–917. [Google Scholar]
  22. Zhou J. He X. Dimension reduction based on constrained canonical correlation and variable filtering. Ann Statistics. 2008;36:1649–1668. [Google Scholar]

Articles from OMICS : a Journal of Integrative Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES