High-throughput molecular phenotyping (e.g., transcriptomics, proteomics, DNA sequencing, epigenetics, etc.) has impacted cancer research, enabling the subtyping of disease states and the discovery of actionable molecular signatures and biomarkers (1, 2). While the use of these technologies has proved more challenging than initially anticipated (3), the important role for molecular phenotyping in the health sciences is broadly appreciated. The Cancer Genome Atlas (4), assembled to integrate multidimensional molecular omic data through systems approaches, is one indicator. The development of classifiers for molecular phenotypes that enable clinically useful stratification of patients into treatment groups remains a critical challenge to overcome. Technical and biological noise, patient heterogeneity, sample contamination, inconsistent sample processing, and changing data-collection platforms all make this a difficult task. Combining classifiers to increase sensitivity (5) has only proved incrementally useful. Thus, informatic approaches that can deal robustly with noisy and heterogeneous molecular phenotyping data remain crucial to develop. The report in PNAS by Zadran et al. (6) highlights a new twist in which an established analytical approach drawn from the physical sciences is applied to the analysis of molecular phenotypes.
Surprisal Analysis and the Inference of the Internal Reference State
Surprisal analysis was pioneered in the early 1970s to study the dynamics of nonequilibrium systems (7). Levine and coworkers have recently explored the use of surprisal analysis in the characterization of molecular phenotypes (8, 9) and cellular dynamics (10). The core elements of the method are briefly described in the context of transcriptomic data. The same principles could apply to other omic data or integrated analyses:
i) All transcripts contribute to the description of a cellular state, but they do so with a thermodynamic weight that is proportional to their abundance. The principled assignment of weights distinguishes surprisal analysis from analytical methods that use fold-change and cut-offs. The latter bias analyses toward the species with greatest fold change irrespective of their abundance and thus potential influence on cellular state.
ii) Surprisal analysis applies principles from maximum entropy and statistical mechanics. A molecular phenotype is treated as an informational slice through the complex set of all molecular reactions in the cell. This slice provides an incomplete but broad insight into cellular activity which reflects on the state of the system and the underlying mechanisms that constrain the biological system from reaching a state maximal entropy. Constraints include metabolic networks, regulatory programs, and mechanisms for maintenance and replication of the genetic code. These dynamic/kinetic processes shift cellular life away from pure thermodynamic equilibria and toward sets of constrained equilibrium states. While surprisal analysis does not explicitly identify constrains, it infers their effects on the dataset. In the case of transcriptomic data, the collective activity of the constraints express themselves as specific profiles of transcript abundances that can be treated as fingerprints or molecular signatures of a particular cellular state.
iii) Making sense of constrained equilibria and the constraint-derived molecular signatures requires that the signatures be analyzed in the context of a common reference. The inference of such an internal reference state, termed the “balance state,” is perhaps the most critical element of surprisal analysis. In the balance state the abundance of transcripts is presumed not to change and deviations from the balance state can then be thought of abstractly as the effects of molecular constraints. Provided that the balance states are found to be similar between experimental samples, surprisal analysis can find gene-expression profiles (signatures) within the data that reflect the activity of the constraints. Together, the notion of a balance state and appropriately gene- and patient-weighted constraints compress the original dataset in a way that simplifies the discovery of processes that energetically shift the system between states.
One of the most intriguing conclusions presented by Zadran et al. (6) is that the inferred balance states seem to be extraordinarily similar across all carcinomas. Although small differences exist between the balance states of cancers from different tissue types, they are nevertheless remarkably alike. A similar conclusion, regarding the stability of balance state, had been drawn from experiments on cell lines (8, 9). It was not obvious, however, that the much noisier data derived from heterogeneous clinical samples would give similar results. Zadran et al. (6) report that genes contributing most to the balance state have annotations consistent with the maintenance of cellular homeostasis. The appearance of a common reference state, robust to sample noise and heterogeneity, against which comparisons of healthy, diseased, and carcinoma-specific molecular phenotypes can be made, may turn out to be the most significant contribution of this work.
Identifying Disease Signatures and Putative Therapeutic Targets
The analysis of constraints (deviations from the balance state) by Zadran et al. (6) for RNA abundance profiles (mRNA and miRNA) demonstrates that the most informative perturbation signatures distinguish healthy samples from each of the four different carcinomas. Moreover, their analysis shows promise in distinguishing between the carcinomas themselves. Surprisal analysis of miRNA abundance profiles of breast and lung cancer samples independently identified specific miRNA markers (miR-141 and miR-206) that were previously known to be involved with disease processes (11, 12), indicating that surprisal analysis-derived constraints may be meaningful tools for biomarker discovery. Moreover, constraint-associated genes were tested for functional association to the disease. Here, siRNA knockdown experiments in cultured cell lines were conducted targeting genes with the greatest thermodynamic contributions to the cancer states of each of the breast, ovarian, lung, and prostate carcinomas. While preliminary, all cell proliferation assays showed shifts away from the cancer phenotype in vitro, suggesting that surprisal analysis likely identified genes involved in the disease process. Of course, other classifiers of transcriptome data have demonstrated the ability to group samples into disease or healthy states and to identify putatively important gene targets. Zadran et al. (6) do not make any direct comparison to these other methods to evaluate whether the targets identified by surprisal analysis are clinically more relevant than those previously discovered by other methods. The authors, however, never make the claim that the classifications made by surprisal analysis are unique; this validation may have to wait for follow up studies.
The investigation of molecular phenotypes by surprisal analysis could be instrumental in the development of personalized medicine.
Nevertheless, a key differentiating feature between surprisal analysis and other classifiers its thermodynamic-like foundations. Zadran et al. (6) demonstrate that surprisal analysis may be able to relate some of the complex underlying microscopic/molecular processes that collectively define a cellular state (e.g., metabolism, regulatory systems, replication, and repair) to the cell’s bulk properties (e.g., disease or health state). The mathematical compaction of these processes returns discrete disease signatures that appear to be informationally relevant to the process. In this context, the report by Zadran et al. (6) suggests that the underlying physical processes regulating the transcriptome lead to a thermodynamically stable state. If this is indeed the case, one could invoke something analogous to Le Châtelier’s principle to predict how the stable system might respond to a small perturbations. If so, predicting system-wide effects of drugs that target specific transcripts might become possible.
Taking a broader view, the investigation of molecular phenotypes by surprisal analysis could be instrumental in the development of personalized medicine, provided some key questions are answered. For instance, Zadran et al. (6) demonstrate that the thermodynamic notion of associating a potential to the bulk state of the sample (as identified by λ1 in Fig. 1) allows the reliable classification of diseased from healthy samples. However, patient-to-patient variability in potential, although typically much smaller than the differences in potential between diseased and healthy states, is nevertheless observed. Part of this patient-dependent variation is likely evidence of information that could be exploited for subtyping diseases and developing personalized treatment options. Noise is also a key factor. What are the origins of noise in molecular phenotypes and are there sources that must be explicitly accounted for in the analytical framework itself? We know that cellular heterogeneity contributes to noise in the data. We also know that noise arising from fluctuations in small molecular numbers (13, 14) leads to stochastically determined phenotypes and that they cannot be avoided through more careful sample collection. Unfortunately, the degree of phenotypic variation arising from such processes and sample heterogeneity is insufficiently characterized to make principled changes to the current analytical framework. Single-cell data characterizing the variance in phenotypic diversity could therefore provide a critical additional source of data to advance this approach.
Footnotes
The author declares no conflict of interest.
See companion article on page 19160 of issue 47 in volume 110.
References
- 1.Colombo PE, Milanezi F, Weigelt B, Reis-Filho JS. Microarrays in the 2010s: The contribution of microarray-based gene expression profiling to breast cancer classification, prognostication and prediction. Breast Cancer Res. 2011;13(3):212. doi: 10.1186/bcr2890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Guiu S, et al. Molecular subclasses of breast cancer: How do we define them? The IMPAKT 2012 Working Group Statement. Ann Oncol. 2012;23(12):2997–3006. doi: 10.1093/annonc/mds586. [DOI] [PubMed] [Google Scholar]
- 3.Ioannidis JP, et al. Repeatability of published microarray gene expression analyses. Nat Genet. 2009;41(2):149–155. doi: 10.1038/ng.295. [DOI] [PubMed] [Google Scholar]
- 4. Chang K, et al.; Cancer Genome Atlas Research Network; Genome Characterization Center; Genome Data Analysis Center; Sequencing Center; Data Coordinating Center; Tissue Source Site; Biospecimen Core Resource Center; National Cancer Institute/National Human Genome Research Institute Project Team; Collaborators (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120. [DOI] [PMC free article] [PubMed]
- 5.Care MA, et al. A microarray platform-independent classification tool for cell of origin class allows comparative analysis of gene expression in diffuse large B-cell lymphoma. PLoS ONE. 2013;8(2):e55895. doi: 10.1371/journal.pone.0055895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zadran S, Remacle F, Levine RD. miRNA and mRNA cancer signatures determined by analysis of expression levels in large cohorts of patients. Proc Natl Acad Sci USA. 2013;110:19160–19165. doi: 10.1073/pnas.1316991110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Levine RD, Bernstein RB. Energy disposal and energy consumption in elementary chemical reactions. Information theoretic approach. Acc Chem Res. 1974;7(12):393–400. [Google Scholar]
- 8.Remacle F, Kravchenko-Balasha N, Levitzki A, Levine RD. Information-theoretic analysis of phenotype changes in early stages of carcinogenesis. Proc Natl Acad Sci USA. 2010;107(22):10324–10329. doi: 10.1073/pnas.1005283107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kravchenko-Balasha N, et al. On a fundamental structure of gene networks in living cells. Proc Natl Acad Sci USA. 2012;109(12):4702–4707. doi: 10.1073/pnas.1200790109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gross A, Li CM, Remacle F, Levine RD. Free energy rhythms in Saccharomyces cerevisiae: A dynamic perspective with implications for ribosomal biogenesis. Biochemistry. 2013;52(9):1641–1648. doi: 10.1021/bi3016982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang X, Ling C, Bai Y, Zhao J. MicroRNA-206 is associated with invasion and metastasis of lung cancer. Anat Rec (Hoboken) 2011;294(1):88–92. doi: 10.1002/ar.21287. [DOI] [PubMed] [Google Scholar]
- 12.Zhao L, et al. MiRNA expression analysis of cancer-associated fibroblasts and normal fibroblasts in breast cancer. Int J Biochem Cell Biol. 2012;44(11):2051–2059. doi: 10.1016/j.biocel.2012.08.005. [DOI] [PubMed] [Google Scholar]
- 13.McAdams HH, Arkin A. Simulation of prokaryotic genetic circuits. Annu Rev Biophys Biomol Struct. 1998;27:199–224. doi: 10.1146/annurev.biophys.27.1.199. [DOI] [PubMed] [Google Scholar]
- 14.Friedman N, Cai L, Xie XS. Linking stochastic dynamics to population distribution: An analytical framework of gene expression. Phys Rev Lett. 2006;97(16):168302. doi: 10.1103/PhysRevLett.97.168302. [DOI] [PubMed] [Google Scholar]