Abstract
Background
Gene expression profiling yields quantitative data on gene expression used to create prognostic models that accurately predict patient outcome in diffuse large B cell lymphoma (DLBCL). Often, data are analyzed with genes classified by whether they fall above or below the median expression level. We sought to determine whether examining multiple cut-points might be a more powerful technique to investigate the association of gene expression with outcome.
Methodology/Principal Findings
We explored gene expression profiling data using variable cut-point analysis for 36 genes with reported prognostic value in DLBCL. We plotted two-group survival logrank test statistics against corresponding cut-points of the gene expression levels and smooth estimates of the hazard ratio of death versus gene expression levels. To facilitate comparisons we also standardized the expression of each of the genes by the fraction of patients that would be identified by any cut-point. A multiple comparison adjusted permutation p-value identified 3 different patterns of significance: 1) genes with significant cut-point points below the median, whose loss is associated with poor outcome (e.g. HLA-DR); 2) genes with significant cut-points above the median, whose over-expression is associated with poor outcome (e.g. CCND2); and 3) genes with significant cut-points on either side of the median, (e.g. extracellular molecules such as FN1).
Conclusions/Significance
Variable cut-point analysis with permutation p-value calculation can be used to identify significant genes that would not otherwise be identified with median cut-points and may suggest biological patterns of gene effects.
Introduction
Diffuse large B cell lymphoma (DLBCL) is an aggressive disease with a variable outcome. In order to quantify patient risk, numerous biomarkers have been identified that can be detected with a variety of methods. We recently described the use of a quantitative nuclease protection assay (qNPA) to measure gene expression levels from formalin fixed, paraffin embedded (FFPE) tissue blocks of DLBCL [1]. In a subsequent study of CHOP and rituximab-CHOP (R-CHOP) treated DLBCL cases, qNPA results for many genes were significantly associated with overall survival [2]. Initial data analysis was performed by categorizing patients into expression levels above and below the median level of expression. The best selected 2-variable model predicting overall survival in DLBCL was the combination of the major histocompatibility (MHC) class II antigen, HLA-DRB, and the cell cycle associated gene, MYC. In agreement with the literature, these results implicated lack of immunosurveillance and increased cell proliferation as important features that characterize the most aggressive B cell lymphomas [3]–[8].
We then further explored the relationship between expression levels and survival for these two genes. We plotted the score test statistic (logrank test statistic) from Cox regression for the association of gene expression quantile and survival, where gene expression was converted to a binary variable with cut-points defined along a continuous spectrum of low to high expression [9]. For HLA-DRB, the highest logrank statistic chi-square value indicating the most significant cut-point of gene expression was at the 20th percentile. Many other cut-points were also significant [2]. This observation was in keeping with previous data demonstrating that there is a smooth non-linear association of MHC Class II expression levels as related to patient risk, with small incremental decreases in expression corresponding to increases in the hazard ratio of death with a sharp increase in hazard at lower levels of expression [10]. For MYC the most significant cut-point was at the 80th percentile of expression. While the 80th percentile was the optimal cut-point, there was a wide range of cut-point values that were also nominally significant [2]. This has biological implications for MYC, suggesting that incremental increases in MYC expression portend a worse prognosis with the sharpest increase in risk at higher levels of expression.
In the current study, we went on to perform this variable cut-point analysis on 36 genes to determine whether we could identify genes that might have significant cut-points other than the median and how that might be a factor in the reported discrepancies in prognostic value of genes by different investigators and techniques.
Materials and Methods
Ethics Statement
The project was approved by the University of Arizona Institutional Review Board (IRB) according to the principles expressed in the Declaration of Helsinki. The University of Arizona IRB specifically waived the need for informed consent for this project.
Patient groups and mRNA data
We used the mRNA levels determined using qNPA (ArrayPlateR Assay, High Throughput Genomics, Tucson, AZ) as described previously [1], [2]. Briefly, unstained FFPE sections of 209 DLBCL, previously treated with CHOP-like regimens (N = 93) or R-CHOP (N = 116) were subjected to the qNPA procedure. This process begins with cell lysis followed by exposure to specifically designed probe sets that bind to the target mRNA of interest. S1 nuclease is used to degrade all single stranded RNA and the surviving probes are identified by binding to linker probes and detection probes on the ArrayPlateR followed by chemiluminescence and imaging. The study set of cases included FFPE blocks from cases of de novo, previously untreated DLBCL which had also been a part of 2 larger case series using gene expression profiling of snap frozen biopsies from patients treated with CHOP or R-CHOP and then later in a study of ArrayPlateR gene expression technique on the corresponding FFPE blocks [2], [11], [12]. The customized ArrayPlateR assay had been designed to assess the expression levels of 36 prognostic genes identified in DLBCL by different research groups and published in the literature. A list of the genes, their function (if known), and the reference from which they were chosen are listed in Table 1. All research was conducted under an IRB (human subjects committee) approved protocol from the University of Arizona. We obtained expression measurements with ≥95% success on all but 3 genes, and with ≤80% success on only one gene (HTR2B).
Table 1. Prognostic genes tested1.
Name in original reference | Alternative names | qNPA name | Reference | Function |
BCL-6 | BCL6* | Rosenwald 1/Lossos 6 | Transcriptional repressor that controls germinal center formation [26], [27] | |
IMAGE 1334260 | centerin/GCET1 (germinal center B-cell expressed transcript 1) | SERPINA9* | Rosenwald 2 | Serpin (serine protease inhibitor) [28] |
IMAGE 814622 | GCET2 (germinal center B-cell expressed transcript 2)/HGAL (human germinal center-associated lymphoma) | GCET2 | Rosenwald 3 | Membrane-associated protein with a putative role in signal transduction [29]; myosin-interacting protein that is a putative inhibitor of cell migration [30] |
HLA-DPa | HLA-DPA1 | Rosenwald 4 | Antigen presentation [31] | |
HLA-DQa | HLA-DQA1 | Rosenwald 5 | Antigen presentation [31] | |
HLA-DRa | HLA-DRA | Rosenwald 6 | Antigen presentation [31] | |
HLA-DRb | HLA-DRB* | Rosenwald 7 | Antigen presentation [31] | |
alpha-actinin | ACTN1* | Rosenwald 8 | Non-muscle α-actinin isoform involved in bundling actin filaments and attaching them to focal adhesions; important for cell motility [32] | |
collagen type III alpha1 | COL3A1* | Rosenwald 9 | Type III fibrillar collagen; part of the extracellular matrix in lymph nodes [33], [34] | |
connective tissue growth factor | CTGF* | Rosenwald 10 | Heparin and integrin binding protein involved in extracellular matrix remodeling [35] | |
fibronectin | FN1* | Rosenwald 11/Lossos 5 | Extracellular integrin ligand involved in cell adhesion [36] | |
KIAA0233 | Piezo1 | FAM38A | Rosenwald 12 | Multipass transmembrane protein involved in mechanotransduction and regulation of integrin activation [37], [38] |
urokinase plasminogen activator | Urokinase/uPA | PLAU* | Rosenwald 13 | Serine protease that activates plasminogen which results in extracellular matrix degradation [39] |
C-MYC | MYC* | Rosenwald 14 | Transcription factor that controls proliferation, growth, metabolism, microRNAs and apoptosis [40] | |
E21G3 Nucleostemin | NS | C20orf155 | Rosenwald 15 | Nucleolar GTP-binding protein that regulates cell cycle by regulating p53 and maintains nucleolar structure [41], [42] |
NPM3 | Nucleophosmin 3 | NPM3 | Rosenwald 16 | Nucleolar protein that inhibits ribosome biogenesis and histone assembly and enhances transcription [43], [44] |
BMP6 | Bone morphogenetic protein-6 | BMP6 | Rosenwald 17 | Cytokine that regulates B-cell lymphopoiesis [45] |
LMO2 | LIM domain only-2 | LMO2 | Lossos1 | Transcription factor that regulates erythropoiesis and angiogenesis [46], [47] |
BCL2 | BCL2 | Lossos 2 | Membrane bound protein that prevents apoptosis [48] | |
SCYA3 | MIP-1α(macrophage inflammatory protein-1) | CCL3 | Lossos 3 | Chemokine that recruits cells to sites of inflammation and inhibits hematopoietic stem cell proliferation [49] |
CCND2 | Cyclin D2 | CCND2* | Lossos 4 | Activator of cell cycle progression [50] |
DRP2-dystrophin related protein 2 | DRP2 | Shipp 1 | One of a class of structural proteins that maintains membrane–associated complexes at the points of intercellular contact [51] | |
PRKACB-protein kinase C beta 1 | PKCβII | PRKCB1* | Shipp 2 | Serine/threonine-specific kinase that plays a role in B-cell receptor signaling and B-cell development [52] |
H731-nuclear antigen | Programmed Cell Death 4 | PDCD4* | Shipp 3 | Protein translation initiation factor inhibitor that is a putative context-specific tumor suppressor [53], [54] |
3′ UTR of unknown protein | Microtubule-Associated Protein 1B | MAP1B | Shipp 4 | Protein that stabilizes microtubules, attaches other proteins to microtubules and has a putative role in microvessicle trafficking [55], [56] |
Transducin-like enhancer protein 1 | Groucho | TLE1* | Shipp 5 | Transcriptional co-repressor involved in differentiation of hematopoietic cells [57], [58] |
Uncharacterized | citrin | SLC25A13 | Shipp 6 | Mitochondrial inner membrane aspartate-glutamate carrier that moves aspartate to the cytosol and NADH reducing equivalents into the mitochondria [59], [60] |
PDE4B Phosphodiesterase 4B, cAMP-specific | PDE4B | Shipp 7 | Phosphodiesterase that degrades cAMP to inactivate cAMP signaling [61] | |
Uncharacterized | UDP-Gal:betaGlcNAc β-1,4-galactosyltransferase polypeptide 1 | B4GALT1 | Shipp 8 | Enzyme that transfers galactose to glycoproteins in a steriospecific manner; galactoproteins are involved in immune cell trafficking [62] |
PRKCG Protein kinase C, gamma | PRKCG | Shipp 9 | Serine/threonine–specific kinase activated by lipid signals and reactive oxygen species [63], [64] | |
Oviductal glycoprotein | MUC9 | OVGP1 | Shipp 10 | Glycoprotein secreted by oviduct epithelial cells under estrogen control [65] |
(MINO/NOR1) Mitogen induced nuclear orphan receptor | NR4A3 | Shipp 11 | Nuclear hormone receptor that regulates metabolism and inhibits leukemogenesis in a ligand-independent manner [66], [67] | |
Zinc-finger protein C2H2-150 | ZNF212 | Shipp 12 | Putative transcription factor [68] | |
5-Hydroxytryptamine 2B receptor | HTR2B | Shipp 13 | Serotonin receptor isotype involved in tumorigenesis [69], [70] | |
Catalase | CAT | Tome 1 | Peroxisomal enzyme that metabolizes H2O2 [71] | |
Manganese superoxide dismutase | SOD2 | Tome 2 | Mitochondrial enzyme that metabolizes superoxide [72] |
Variable cut-point and smooth hazard regression analysis
Variable cut-point (or split-point) analysis was performed on all 36 genes in order to discriminate between groups of patients with the most significant differences in overall survival. This statistical technique calculates the score test statistic from a Cox model (analogous to the logrank statistic) at a continuous spectrum of cut-points on the gene expression variable [13]. (Typically the maximum statistic is often used to define best split of patients.) In the plot (Figure 1), the vertical axis corresponds to the score statistic on the standard normal scale. To adjust for the evaluation of the large number of cut-point models, permutation sampling is used to control the family-wise type 1 error for each gene. The permutation p-values presented in the cut-point plots are based on 1000 samples, and the horizontal line on each plot corresponds to the 90th percentile of the sampled permutation distribution of the maximum test statistic. Therefore, a cut-point statistical test reaching above the horizontal line has a permutation adjusted p-value of <0.10 [9]. Note that the 90th percentile horizontal lines for the genes are at approximately 2.5 for most gene expression variables; if there were no adjustment for multiple comparisons, a value of 1.64 would correspond to a p-value of 0.1. Without this adjustment there would be the tendency to falsely believe moderately large test statistics correspond to real association, when observed associations could simply be due to the large number of cut-point models that have been investigated. In addition, to control statistical variability, a minimum possible subgroup size of 10% of total patients was set for each analysis. Since our previous test of panel-wide interaction between the CHOP and R-CHOP groups had shown no significance, we combined the 2 data sets for purposes of the current analysis [2]. However, the cut-point technique adjusted for treatment group (CHOP versus R-CHOP) as a main effect in the relative risk regression model, since R-CHOP is well known to be associated with improved survival. The cut-point technique also allows for more general adjustment of an existing prognostic model to assess the statistical significance of the addition of a new gene expression variable and cut-point. Analyses presented are based on overall survival, where overall survival is defined as the time from study registration until death. Patients without an observed death time are censored at the last known time under follow-up.
Figure 1. Graphs for each of the 13 genes with a significant logrank statistic (Z-value).
On the Y-axis, an unadjusted score statistic of 2 corresponds to a p-value of approximately 0.05. On the X-axis, a value of 0.1 corresponds to the 10th percentile of gene expression, 0.2 to the 20th percentile, and etc up to the 90th percentile of expression. Different cut-point values assessed for each gene are represented by the dots along the connected line of chi-square values. The solid horizontal line represents the 90th percentile of the permutation distribution of the maximal score statistics. The range on the x-axis is from 10% to 90% of the distribution of the gene expression variable. An overall p-value adjusted for the permutation analysis is shown along the right sided Y-axis.
While the cut-point evaluation allows the assessment of statistical significance of multiple partitions of a gene expression variable, it does not directly lead to an estimate of the underlying regression function representing how gene expression is associated with survival. Therefore, we also used hazard regression modeling (based on a B-spline basis) to calculate smooth estimates of the hazard function for each gene [14]. An alternative estimation strategy for smooth hazard regression functions is by local likelihood [15]. In addition, we transformed each gene expression variable to be approximately uniformly distributed to make the analysis consistent with the cut-point analysis, which only depends on the rank of the gene expression variables. As done for the cut-point analysis, we adjusted for the two treatment groups (CHOP versus R-CHOP) via main effect in the relative risk regression model.
While our combination of cut-point analysis and smooth hazard regression modeling is useful for interpreting individual effects of a small set of continuous biological measurements, such as gene expression with censored survival patient outcome, there are other related statistical methods available for multivariable modeling and subgroup analysis. For instance, with respect to smooth regression modeling, there has been considerable study of generalized additive models, which consist of additive combinations of smooth univariate regression functions. For deriving subgroups in the context of many variables, the cut-point methods we proposed can be utilized recursively to cut-up or partition the data on multiple covariates to construct regression trees [16]. There is an extenstive discussion of other statistical or machine learning algorithms in Hastie et al. [17]. Due to the complexity of some of the multivariable models, their use is often better applied to patient prognostic predictions or subgroup stratification rather than probing the interpretation and clinical impact of individual gene expression measurements. In addition, alternatives to the smooth hazard regression models based on locally estimated quantiles of the survival distribution can be helpful for exploring gene effects [18]; however, we chose the hazard based methods for our exploration of DLBCL gene expression data given the relatively modest sample size. In addition, hazard regression methods tend to achieve better variance control in such cases.
Results
We first generated a series of graphs for each of the 13 genes with significant logrank statistic (Z-value) (Fig. 1). The different cut-point values assessed for each gene are represented by the dots along the connected line of chi-square values. The solid horizontal line represents the 90th percentile of the permutation distribution of the maximal score statistics under the assumption the gene is not associated with patient outcome (i.e., under the null hypothesis). Given the exploratory nature of this analysis, all values with a significance cut-off above the 90th percentile line (type 1 error of 0.10) were considered significant. An overall p-value adjusted for the permutation analysis is presented on each of the panels. Note that only score statistics for cut-points that generate subgroups of patients with ≥10% of the sample size were considered, since smaller groups would probably not be considered useful clinically. We think it is useful to plot the cut-point analysis against the quantile of the gene expression distribution so that one could just read what fraction of the sample is above or below the cut-point.
Thirteen out of the 36 genes (36%) had at least 1 significant cut-point at p<0.10, including SERPINA9, HLA-DRB, ACTN1, COL3A, CTGF, FN1, PLAU, MYC, BCL6, CCND2, PRKCB1, PDCD4, and TLE1. Of these, 10 (77%) would have been significant at a pre-specified cut-point at the median (SERPINA9, ACTN1, COL3A, CTGF, PLAU, MYC, BCL6, CCND2, PDCD4, and TLE1) and 3 genes (or a relative 23% of the 13 genes) would not have been significant (HLA-DRB, FN1, and PRKCB1). Therefore, the median cut-point analysis would have missed detecting the significance of a notable selection of genes.
Inspection of the graphs revealed patterns that allowed us to classify the results into 3 different groups. The first group was defined as those genes with the significant cut-points only below the median. The second group was defined as genes with significant cut-points only above the median. The third group was defined as genes with significant cut-points above, below, or including the median.
The single gene in the first category was HLA-DRB, with the highest chi-square values all below the median and the most significant cut-point at the 20th percentile. This pattern is consistent with a gene whose loss is associated with poor outcome.
The two genes that fell into the second category, showing significant cut-points above the median gene expression values, were CCND2 and PRKCB1. CCND2 is G1/S-specific regulator of cyclin-dependent kinases, and PRKCB1 functions as a serine- and threonine-specific protein kinase. This second pattern is consistent with genes whose over-expression is associated with poor outcome.
Ten genes fell into the third category, with significant cut-points above and below the median gene expression values. The genes in this category included ACTN1, COL3A, FN1, CTGF, PLAU, TLE1, PDCD4, MYC, SERPINA9, and BCL6. The first 5 of these 10 genes code for extra-cellular molecules. PDCD4 codes for an apoptosis related molecule, MYC is associated with proliferation and other cellular processes, while SERPINA9 and BCL6 are related to germinal center formation. While it wasn't explored in this analysis, an extended strategy for constructing prognostic groups of patients with significant cut-points at multiple points in the gene expression distribution (i.e., above and below the median) could be implemented. Here, a stage-wise approach would be appropriate. First, the maximal cut-point with all of the data would be identified; this defines two subgroups of patients. Next, evidence of a significant cut-point in either of the two remaining subgroups would be assessed. As before, permutation resampling methods would be used to determine evidence of further cut-points; this would indicate whether more than two prognostic groups, based on that gene, are needed.
Analysis of the cut-point graphs indicates whether or not expression of a particular gene is critical for patient outcome. However, to understand the impact of increasing or decreasing expression of a particular gene on patient outcome and gain insight into the tumor biology we generated hazard regression functions for the 13 genes with significant cut-points (Fig. 2). A hazard function that is increasing with respect to gene expression indicates a worse prognosis (or survival) with higher gene expression; conversely, a decreasing function implies improved survival for higher gene expression. The hazard regression functions confirm the importance of these genes and indicate whether an increase or decrease in expression is associated with better or worse patient survival. For example, examination of the hazard regression functions is in agreement with the known data on MYC. MYC over-expression in DLBCL results from translocations, increased gene copy number, or other mechanisms, and correlates with poor patient outcome [19]–[21].
Figure 2. Hazard regression functions for the 13 genes with significant cut-points.
The Y-axis shows the log of the hazard ratio of death. The X-axis shows the quantile of gene expression. The thin lines show the 90% confidence intervals.
In secondary analysis, we assessed whether adjustment for the International Prognostic Index (IPI) [22] mitigated the effect of gene expression on survival for the 13 genes described above. Results were similar, with ten of the 13 genes achieving family-wise error rate of <0.10.
Discussion
While a large amount of effort in recent years has been devoted to evaluating thousands of genes from unfixed, snap frozen tissue, we have focused on a more detailed analysis of a smaller number of genes using FFPE. In this paper, we investigated the use of different cut-points for determining gene significance, which we applied here for the first time on GEP data for 36 genes on paraffin embedded tissue. We show that while using the median cut-point is often useful, the significance of some genes may be missed when the effect is limited to patients with only markedly high or low (rather than median) levels of expression.
Therefore, we believe the results more generally show that the variable cut-point method is a powerful tool to explore the relationship of gene expression data with outcomes. The strategy produces a sequence of decision rules to directly identify a group of patients, and hence, has a potential role in the translation of results to other studies. The second tool, smooth hazard regression, allows a finer understanding of the underlying biological relationships of gene expression with patient survival, but doesn't produce a decision rule. Therefore, this pair of tools together allows a fuller interrogation of gene expression data, an approach which has been largely overlooked under the current paradigm of performing simple univariate analyses at a genome-wide level. In practice, the choice of a cut-point derived by the methods we be propose can be used if there is not a specific cut-point of interest specified based on prior research. Our proposal would be to evaluate cut-points over a range of clinical interest. The choice of cut-point for subsequent clinical applications would often be the one that gave the largest test statistic value (or smallest p-value). However, one may choose other significant cut-point values that lead to larger subgroups depending on the clinical need in future studies. Importantly, given the multiple possible cut-points evaluated, the methodology includes an algorithm (permutation resampling) to control for potential false positive selection of a cut-point; that is, where there may not be a true association with patient outcome.
While we have focused our analysis and discussion on the understanding of individual genes, it is important to note that a cut-point algorithm can also be used to explore and draw inferences into whether or not other adaptively selected models might improve the existing prognostic models. Given a model with a set of specified variables and cut-points, the method allows one to statistically evaluate all cut-points over all remaining genes to see if any other variables would improve model performance. We assessed whether our prior model that included HLA-DRB and MYC could be improved by applying this method. We found that inclusion of the gene PDCD4, with a cut-point at the upper 27th percentile of its distribution, had an adjusted p-value (controlling for multiple comparisons) of 0.001 to enter the model. Therefore, the 3-gene model including HLA-DRB, MYC, and PDCD4 appears to be preferred statistically over the prior 2-gene model. This improved model would likely not have been evident without using cut-point methodology.
In this project, a single median cut-point approach would have missed detecting a notable subset (23%) of the genes that were most significantly associated with survival at lower or higher expression cut-points. This may account for differences in significance of certain genes reported between different studies. Since a near complete loss of gene expression or high over-expression may be a relatively infrequent event for certain genes in some tumor types, these 2 categories of genes may be overlooked in general data analysis using median cut-points. We note that both in this data set and others, the statistically significant association of HLA-DR gene expression with survival would have been missed if only the median value of expression had been investigated.
Laboratory methods that either minimize or maximize signal will tend to underestimate the significance of genes with significant data cut-points at lower or higher levels of gene expression. For example, immunohistochemistry (IHC) often runs the chemical reaction through to equilibrium and may therefore over-estimate protein expression of genes by favoring a strong positive reaction. Furthermore, IHC is usually interpreted with simple descriptions of positive and negative staining based on visual inspection. Therefore, IHC strongly dichotomizes data and may miss the significance of lower or higher amounts of protein. Conversely methods that rely on high amounts of target for detection may also not reveal genes that are most significant at low levels of expression. It is therefore apparent that quantitative data with an appropriate dynamic range will be the most effective for exploring gene and protein expression patterns that play a prognostic role in DLBCL and other cancers. This factor might account for some of the discrepancies seen between gene expression and follow up confirmatory studies on their protein products.
By grouping similar hazard regression function patterns, we can speculate about the biological roles of the significant genes in DLBCL. These groups can differ somewhat from the categories generated in the cut-point analysis. Genes for which high expression is correlated with poor survival could be roughly described as oncogenes. MYC is a charter member of this category. Inspection of the MYC hazard regression function indicates that incremental increases have incremental effects on survival. This category would include CCND2, a protein closely related to proliferation, which has long been linked to outcome in DLBCL and mantle cell lymphomas [7], [12], [23]. PDCD4 also fits this pattern in DLBCL although studies in other cell types suggest PDCD4 can play a tumor suppressor role in other contexts [24].
Another hazard ratio pattern could be roughly described as genes for which loss of expression is associated with poor outcome. These genes have characteristics of tumor suppressor genes and include HLA-DRB. The pattern for HLA-DRB, which shows a sharp increase in hazard at lower levels of expression, also fits our previous data showing a loss of HLA-DRB is associated with poor outcome [2]. Previously, we had demonstrated an incrementally worse overall survival in patients as average major histocompatibility class II (MHC II) gene expression values (of which HLA-DRB is a principle gene) decreased by quantiles with the poorest outcome in patients at the 25th percentile and below [10]. The current data also agree with our previous analysis that showed a non-linear association of HLA-DRA (part of the HLA-DR heterodimer) with patient hazard ratio of death - specifically with a sharp increase in hazard at lower levels of expression [10]. A comparison of the hazard regression functions for genes with a less well understood role in DLBCL to those of MYC and HLA-DRB provide insight as to their biological significance.
A third hazard ratio pattern is the genes with impact on survival especially at high and low expression. This pattern is most pronounced for COL3A1 and FN1, but PLAU also has this pattern. A gene expression pattern like this argues for threshold effects rather than a rheostat where incremental increases have incremental effects on survival. This type of pattern could reflect a requirement for other proteins in a complex to exert the full biological effect. Alternatively, this pattern could reflect a different impact of the gene in subgroups of DLBCL such as the cell of origin subtypes previously identified by GEP including germinal center B cell and activated B cell types [11], [12], [25]. The information from the hazard regression functions provides the basis for developing testable hypotheses to determine the importance of these genes for DLBCL biology.
In summary, we have demonstrated a method of statistical analysis that can be applied to GEP data and may reveal interesting associations with patient outcome. In particular, when data are evaluated by being split at expression levels other than the median, additional genes that correlate with patient outcome may be identified. A key component of the analysis is the use of the appropriate statistical techniques to control for false positive findings. To this end we have found re-sampling (in this study permutation sampling) to be extremely useful strategy to avoid over interpretation of flexible exploratory analysis such as cut-point techniques. Finally, while these genes and their cut-points will need to be validated in future studies, the results presented here may serve as hypothesis generating tools in regards to the use of particular genes at particular cut-points with possible implications for gene and tumor biology.
Software implementing the adjusted cut-point analysis is available from the final author.
Acknowledgments
We acknowledge the contribution of High Throughput Genomics, Tucson, AZ which generated the data on which this analysis was based. We thank Dr. Sarah T. Wilkinson for critical reading of the manuscript.
Footnotes
Competing Interests: The authors have read the journal's policy and have the following conflicts: the reagents used in this project were donated free of charge from High Throughput Genomics. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.
Funding: Funding for this work was provided by NIH grant R01 CA90998 (PI LeBlanc) and American Cancer Society grant RSG0605501LIB (PI Rimsza). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Roberts RA, Sabalos CM, LeBlanc ML, Martel RR, Frutiger YM, et al. Quantitative nuclease protection assay in paraffin-embedded tissue replicates prognostic microarray gene expression in diffuse large-B-cell lymphoma. Laboratory Investigation. 2007;87:979–997. doi: 10.1038/labinvest.3700665. [DOI] [PubMed] [Google Scholar]
- 2.Rimsza LM, LeBlanc ML, Unger JM, Miller TP, Grogan TM, et al. Gene expression predicts overall survival in paraffin-embedded tissues of diffuse large B-cell lymphoma treated with R-CHOP. Blood. 2008;112:3425–3433. doi: 10.1182/blood-2008-02-137372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chang KC, Huang GC, Jones D, Lin YH. Distribution patterns of dendritic cells and T cells in diffuse large B-cell lymphomas correlate with prognoses. Clin Cancer Res. 2007;13:6666–6672. doi: 10.1158/1078-0432.CCR-07-0504. [DOI] [PubMed] [Google Scholar]
- 4.Dave SS, Fu K, Wright GW, Lam LT, Kluin P, et al. Molecular diagnosis of Burkitt's lymphoma. N Engl J Med. 2006;354:2431–2442. doi: 10.1056/NEJMoa055759. [DOI] [PubMed] [Google Scholar]
- 5.Hummel M, Bentink S, Berger H, Klapper W, Wessendorf S, et al. A biologic definition of Burkitt's lymphoma from transcriptional and genomic profiling. New England Journal of Medicine. 2006;354:2419–2430. doi: 10.1056/NEJMoa055351. [DOI] [PubMed] [Google Scholar]
- 6.List AF, Spier CM, Miller TP, Grogan TM. Deficient tumor-infiltrating T-lymphocyte response in malignant lymphoma: relationship to HLA expression and host immunocompetence. Leukemia. 1993;7:398–403. [PubMed] [Google Scholar]
- 7.Miller TP, Grogan TM, Dahlberg S, Spier CM, Braziel RM, et al. Prognostic-Significance of the Ki-67 Associated Proliferative Antigen in Aggressive Non-Hodgkins-Lymphomas - A Prospective Southwest-Oncology-Group Trial. Blood. 1994;83:1460–1466. [PubMed] [Google Scholar]
- 8.Rybski JA, Spier CM, Miller TP, Lippman SM, McGee D, et al. Prediction of outcome in diffuse large cell lymphoma by the major histocompatibility complex Class I (HLA-A, -B, -C) and Class II (HLA-DR, -DP, -DQ) phenotype. Leukemia Lymphoma. 1991;6:31–38. doi: 10.3109/10428199109064876. [DOI] [PubMed] [Google Scholar]
- 9.LeBlanc M, Crowley J. Step-function covariate effects in the proportional hazards model. Canadian Journal of Statistics-Revue Canadienne de Statistique. 1995;23:109–129. [Google Scholar]
- 10.Rimsza LM, Roberts RA, Miller TP, Unger JM, LeBlanc M, et al. Loss of MHC class II gene and protein expression in diffuse large B-cell lymphoma is related to decreased tumor immunosurveillance and poor patient survival regardless of other prognostic factors: a follow-up study from the Leukemia and Lymphoma Molecular Profiling Project. Blood. 2004;103:4251–4258. doi: 10.1182/blood-2003-07-2365. [DOI] [PubMed] [Google Scholar]
- 11.Lenz G, Wright G, Dave SS, Xiao W, Powell J, et al. Stromal Gene Signatures in Large-B-Cell Lymphomas. New England Journal of Medicine. 2008;359:2313–2323. doi: 10.1056/NEJMoa0802885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
- 13.Cox D. Regression models and life tables. Journal of the Royal Statistical Society B. 1972;B34:187–220. [Google Scholar]
- 14.Sleeper LA, Harrington DP. Regression Splines in the Cox Model with Application to Covariate Effects in Liver-Disease. Journal of the American Statistical Association. 1990;85:941–949. [Google Scholar]
- 15.Gentleman R, Crowley J. Local full likelihood estimation for the proportional hazards model. Biometrics. 1991;47:1283–1296. [PubMed] [Google Scholar]
- 16.LeBlanc M, Rasmussen E, Crowley J. Constructing Prognostic Groups by Tree-Based Partitioning and Peeling Methods. In: Crowley J, Hoering A, Ankerst DP, editors. Handbook of Statistics in Clinical Oncology. New York: Chapman and Hall; 2005. pp. 365–382. [Google Scholar]
- 17.Hastie T, Tibshirani R, Friedman J. Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer-Verlag; 2009. [Google Scholar]
- 18.Bowman AW, Wright EM. Graphical exploration of covariate effects on survival data through nonparametric quantile curves. Biometrics. 2000;56:563–570. doi: 10.1111/j.0006-341x.2000.00563.x. [DOI] [PubMed] [Google Scholar]
- 19.Akasaka T, Akasaka H, Ueda C, Yonetani N, Maesako Y, et al. Molecular and clinical features of non-Burkitt's, diffuse large-cell lymphoma of B-cell type associated with the c-MYC/immunoglobulin heavy-chain fusion gene. Journal of Clinical Oncology. 2000;18:510–518. doi: 10.1200/JCO.2000.18.3.510. [DOI] [PubMed] [Google Scholar]
- 20.Pienkowska-Grela B, Witkowska A, Grygalewicz B, Rymkiewicz G, Rygier J, et al. Frequent aberrations of chromosome 8 in aggressive B-cell non-Hodgkin lymphoma. Cancer Genetics and Cytogenetics. 2005;156:114–121. doi: 10.1016/j.cancergencyto.2004.04.009. [DOI] [PubMed] [Google Scholar]
- 21.Vitolo U, Gaidano G, Botto B, Volpe G, Audisio E, et al. Rearrangements of bcl-6, bcl-2, c-myc and 6q deletion in B-diffuse large-cell lymphoma: Clinical relevance in 71 patients. Annals of Oncology. 1998;9:55–61. doi: 10.1023/a:1008201729596. [DOI] [PubMed] [Google Scholar]
- 22.Shipp MA, Harrington DP, Anderson JR, Armitage JO, Bonadonna G, et al. A Predictive Model for Aggressive Non-Hodgkins-Lymphoma. N Engl J Med. 1993;329:987–994. doi: 10.1056/NEJM199309303291402. [DOI] [PubMed] [Google Scholar]
- 23.Iqbal J, Sanger WG, Horsman DE, Rosenwald A, Pickering DL, et al. BCL2 translocation defines a subset of DLBCL with germinal center B-cell-like gene expression profiles and preferential expression of a set of genes. Blood. 2003;102:884A. [Google Scholar]
- 24.Zhang SH, Li JF, Jiang Y, Xu YJ, Qin CY. Programmed cell death 4 (PDCD4) suppresses metastatic potential of human hepatocellular carcinoma cells. Journal of Experimental & Clinical Cancer Research. 2009;28 doi: 10.1186/1756-9966-28-71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
- 26.Chang CC, Ye BH, Chaganti RS, Dalla-Favera R. BCL-6, a POZ/zinc-finger protein, is a sequence-specific transcriptional repressor. Proc Natl Acad Sci U S A. 1996;93:6947–6952. doi: 10.1073/pnas.93.14.6947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ye BH, Cattoretti G, Shen Q, Zhang J, Hawe N, et al. The BCL-6 proto-oncogene controls germinal-centre formation and Th2-type inflammation. Nat Genet. 1997;16:161–170. doi: 10.1038/ng0697-161. [DOI] [PubMed] [Google Scholar]
- 28.Paterson MA, Horvath AJ, Pike RN, Coughlin PB. Molecular characterization of centerin, a germinal centre cell serpin. Biochem J. 2007;405:489–494. doi: 10.1042/BJ20070174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pan Z, Shen Y, Ge B, Du C, McKeithan T, et al. Studies of a germinal centre B-cell expressed gene, GCET2, suggest its role as a membrane associated adapter protein. Br J Haematol. 2007;137:578–590. doi: 10.1111/j.1365-2141.2007.06597.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lu X, Chen J, Malumbres R, Cubedo GE, Helfman DM, et al. HGAL, a lymphoma prognostic biomarker, interacts with the cytoskeleton and mediates the effects of IL-6 on cell migration. Blood. 2007;110:4268–4277. doi: 10.1182/blood-2007-04-087775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ting JP, Trowsdale J. Genetic control of MHC class II expression. Cell. 2002;109(Suppl):S21–S33. doi: 10.1016/s0092-8674(02)00696-7. [DOI] [PubMed] [Google Scholar]
- 32.Otey CA, Carpen O. Alpha-actinin revisited: a fresh look at an old player. Cell Motil Cytoskeleton. 2004;58:104–111. doi: 10.1002/cm.20007. [DOI] [PubMed] [Google Scholar]
- 33.Cooper TK, Zhong Q, Krawczyk M, Tae HJ, Muller GA, et al. The haploinsufficient col3a1 mouse as a model for vascular ehlers-danlos syndrome. Vet Pathol. 2010;47:1028–1039. doi: 10.1177/0300985810374842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Karttunen T, Sormunen R, Risteli L, Risteli J, Autio-Harmainen H. Immunoelectron microscopic localization of laminin, type IV collagen, and type III pN-collagen in reticular fibers of human lymph nodes. J Histochem Cytochem. 1989;37:279–286. doi: 10.1177/37.3.2918219. [DOI] [PubMed] [Google Scholar]
- 35.Moussad EE, Brigstock DR. Connective tissue growth factor: what's in a name? Mol Genet Metab. 2000;71:276–292. doi: 10.1006/mgme.2000.3059. [DOI] [PubMed] [Google Scholar]
- 36.Pankov R, Yamada KM. Fibronectin at a glance. J Cell Sci. 2002;115:3861–3863. doi: 10.1242/jcs.00059. [DOI] [PubMed] [Google Scholar]
- 37.Coste B, Mathur J, Schmidt M, Earley TJ, Ranade S, et al. Piezo1 and Piezo2 are essential components of distinct mechanically activated cation channels. Science. 2010;330:55–60. doi: 10.1126/science.1193270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.McHugh BJ, Buttery R, Lad Y, Banks S, Haslett C, et al. Integrin activation by Fam38A uses a novel mechanism of R-Ras targeting to the endoplasmic reticulum. J Cell Sci. 2010;123:51–61. doi: 10.1242/jcs.056424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Smith HW, Marshall CJ. Regulation of cell signalling by uPAR. Nat Rev Mol Cell Biol. 2010;11:23–36. doi: 10.1038/nrm2821. [DOI] [PubMed] [Google Scholar]
- 40.Klapproth K, Wirth T. Advances in the understanding of MYC-induced lymphomagenesis. Br J Haematol. 2010;149:484–497. doi: 10.1111/j.1365-2141.2010.08159.x. [DOI] [PubMed] [Google Scholar]
- 41.Dai MS, Sun XX, Lu H. Aberrant expression of nucleostemin activates p53 and induces cell cycle arrest via inhibition of MDM2. Mol Cell Biol. 2008;28:4365–4376. doi: 10.1128/MCB.01662-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Romanova L, Kellner S, Katoku-Kikyo N, Kikyo N. Novel role of nucleostemin in the maintenance of nucleolar architecture and integrity of small nucleolar ribonucleoproteins and the telomerase complex. J Biol Chem. 2009;284:26685–26694. doi: 10.1074/jbc.M109.013342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Gadad SS, Shandilya J, Kishore AH, Kundu TK. NPM3, a member of the nucleophosmin/nucleoplasmin family, enhances activator-dependent transcription. Biochemistry. 2010;49:1355–1357. doi: 10.1021/bi9021632. [DOI] [PubMed] [Google Scholar]
- 44.Huang N, Negi S, Szebeni A, Olson MO. Protein NPM3 interacts with the multifunctional nucleolar protein B23/nucleophosmin and inhibits ribosome biogenesis. J Biol Chem. 2005;280:5496–5502. doi: 10.1074/jbc.M407856200. [DOI] [PubMed] [Google Scholar]
- 45.Kersten C, Dosen G, Myklebust JH, Sivertsen EA, Hystad ME, et al. BMP-6 inhibits human bone marrow B lymphopoiesis–upregulation of Id1 and Id3. Exp Hematol. 2006;34:72–81. doi: 10.1016/j.exphem.2005.09.010. [DOI] [PubMed] [Google Scholar]
- 46.Warren AJ, Colledge WH, Carlton MB, Evans MJ, Smith AJ, et al. The oncogenic cysteine-rich LIM domain protein rbtn2 is essential for erythroid development. Cell. 1994;78:45–57. doi: 10.1016/0092-8674(94)90571-1. [DOI] [PubMed] [Google Scholar]
- 47.Yamada Y, Pannell R, Forster A, Rabbitts TH. The oncogenic LIM-only transcription factor Lmo2 regulates angiogenesis but not vasculogenesis in mice. Proc Natl Acad Sci U S A. 2000;97:320–324. doi: 10.1073/pnas.97.1.320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Leber B, Lin J, Andrews DW. Still embedded together binding to membranes regulates Bcl-2 protein interactions. Oncogene. 2010;29:5221–5230. doi: 10.1038/onc.2010.283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Menten P, Wuyts A, Van Damme J. Macrophage inflammatory protein-1. Cytokine Growth Factor Rev. 2002;13:455–481. doi: 10.1016/s1359-6101(02)00045-x. [DOI] [PubMed] [Google Scholar]
- 50.Chiles TC. Regulation and function of cyclin D2 in B lymphocyte subsets. J Immunol. 2004;173:2901–2907. doi: 10.4049/jimmunol.173.5.2901. [DOI] [PubMed] [Google Scholar]
- 51.Roberts RG, Bobrow M. Dystrophins in vertebrates and invertebrates. Hum Mol Genet. 1998;7:589–595. doi: 10.1093/hmg/7.4.589. [DOI] [PubMed] [Google Scholar]
- 52.Abrams ST, Brown BR, Zuzel M, Slupsky JR. Vascular endothelial growth factor stimulates protein kinase CbetaII expression in chronic lymphocytic leukemia cells. Blood. 2010;115:4447–4454. doi: 10.1182/blood-2009-06-229872. [DOI] [PubMed] [Google Scholar]
- 53.Suzuki C, Garces RG, Edmonds KA, Hiller S, Hyberts SG, et al. PDCD4 inhibits translation initiation by binding to eIF4A using both its MA3 domains. Proc Natl Acad Sci U S A. 2008;105:3274–3279. doi: 10.1073/pnas.0712235105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Allgayer H. Pdcd4, a colon cancer prognostic that is regulated by a microRNA. Crit Rev Oncol Hematol. 2010;73:185–191. doi: 10.1016/j.critrevonc.2009.09.001. [DOI] [PubMed] [Google Scholar]
- 55.Halpain S, Dehmelt L. The MAP1 family of microtubule-associated proteins. Genome Biol. 2006;7:224. doi: 10.1186/gb-2006-7-6-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Bialik S, Kimchi A. Lethal weapons: DAP-kinase, autophagy and cell death: DAP-kinase regulates autophagy. Curr Opin Cell Biol. 2010;22:199–205. doi: 10.1016/j.ceb.2009.11.004. [DOI] [PubMed] [Google Scholar]
- 57.Desjobert C, Noy P, Swingler T, Williams H, Gaston K, et al. The PRH/Hex repressor protein causes nuclear retention of Groucho/TLE co-repressors. Biochem J. 2009;417:121–132. doi: 10.1042/BJ20080872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Swingler TE, Bess KL, Yao J, Stifani S, Jayaraman PS. The proline-rich homeodomain protein recruits members of the Groucho/Transducin-like enhancer of split protein family to co-repress transcription in hematopoietic cells. J Biol Chem. 2004;279:34938–34947. doi: 10.1074/jbc.M404488200. [DOI] [PubMed] [Google Scholar]
- 59.Palmieri L, Pardo B, Lasorsa FM, del Arco A, Kobayashi K, et al. Citrin and aralar1 are Ca(2+)-stimulated aspartate/glutamate transporters in mitochondria. EMBO J. 2001;20:5060–5069. doi: 10.1093/emboj/20.18.5060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Saheki T, Iijima M, Li MX, Kobayashi K, Horiuchi M, et al. Citrin/mitochondrial glycerol-3-phosphate dehydrogenase double knock-out mice recapitulate features of human citrin deficiency. J Biol Chem. 2007;282:25041–25052. doi: 10.1074/jbc.M702031200. [DOI] [PubMed] [Google Scholar]
- 61.Houslay MD. Underpinning compartmentalised cAMP signalling through targeted cAMP breakdown. Trends Biochem Sci. 2010;35:91–100. doi: 10.1016/j.tibs.2009.09.007. [DOI] [PubMed] [Google Scholar]
- 62.Sperandio M, Gleissner CA, Ley K. Glycosylation in immune cell trafficking. Immunol Rev. 2009;230:97–113. doi: 10.1111/j.1600-065X.2009.00795.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Barnett ME, Madgwick DK, Takemoto DJ. Protein kinase C as a stress sensor. Cell Signal. 2007;19:1820–1829. doi: 10.1016/j.cellsig.2007.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Martiny-Baron G, Fabbro D. Classical PKC isoforms in cancer. Pharmacol Res. 2007;55:477–486. doi: 10.1016/j.phrs.2007.04.001. [DOI] [PubMed] [Google Scholar]
- 65.Buhi WC. Characterization and biological roles of oviduct-specific, oestrogen-dependent glycoprotein. Reproduction. 2002;123:355–362. doi: 10.1530/rep.0.1230355. [DOI] [PubMed] [Google Scholar]
- 66.Pearen MA, Muscat GE. Minireview: Nuclear hormone receptor 4A signaling: implications for metabolic disease. Mol Endocrinol. 2010;24:1891–1903. doi: 10.1210/me.2010-0015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Mullican SE, Zhang S, Konopleva M, Ruvolo V, Andreeff M, et al. Abrogation of nuclear receptors Nr4a3 and Nr4a1 leads to development of acute myeloid leukemia. Nat Med. 2007;13:730–735. doi: 10.1038/nm1579. [DOI] [PubMed] [Google Scholar]
- 68.Becker KG, Nagle JW, Canning RD, Dehejia AM, Polymeropoulos MH, et al. Molecular cloning and mapping of a novel human KRAB domain-containing C2H2-type zinc finger to chromosome 7q36.1. Genomics. 1997;41:502–504. doi: 10.1006/geno.1997.4678. [DOI] [PubMed] [Google Scholar]
- 69.Vicaut E, Laemmel E, Stucker O. Impact of serotonin on tumour growth. Ann Med. 2000;32:187–194. doi: 10.3109/07853890008998826. [DOI] [PubMed] [Google Scholar]
- 70.Launay JM, Birraux G, Bondoux D, Callebert J, Choi DS, et al. Ras involvement in signal transduction by the serotonin 5-HT2B receptor. J Biol Chem. 1996;271:3141–3147. doi: 10.1074/jbc.271.6.3141. [DOI] [PubMed] [Google Scholar]
- 71.Chance B, Sies H, Boveris A. Hydroperoxide metabolism in mammalian organs. Physiol Rev. 1979;59:527–605. doi: 10.1152/physrev.1979.59.3.527. [DOI] [PubMed] [Google Scholar]
- 72.Kinnula VL, Crapo JD. Superoxide dismutases in malignant cells and human tumors. Free Radic Biol Med. 2004;36:718–744. doi: 10.1016/j.freeradbiomed.2003.12.010. [DOI] [PubMed] [Google Scholar]
- 73.Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, et al. Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. N Engl J Med. 2004;350:1828–1837. doi: 10.1056/NEJMoa032520. [DOI] [PubMed] [Google Scholar]
- 74.Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002;8:68–74. doi: 10.1038/nm0102-68. [DOI] [PubMed] [Google Scholar]
- 75.Tome ME, Johnson DBF, Rimsza LM, Roberts RA, Grogan TM, et al. A redox signature score identifies diffuse large B-cell lymphoma patients with a poor prognosis. Blood. 2005;106:3594–3601. doi: 10.1182/blood-2005-02-0487. [DOI] [PMC free article] [PubMed] [Google Scholar]