Abstract
With the advent of next-generation sequencing technologies, there has been a dramatic increase in the availability of paired clinical and transcriptomic data in a variety of disease states. For basic science researchers, this has provided a valuable opportunity for querying the impact of the transcript levels of a gene on disease survival in humans. However, there are a multitude of methodological and technical considerations to evaluate before embarking on these analyses. Herein, we provide a brief description of statistical considerations involved in these analyses, geared toward basic scientists who may not necessarily routinely use such statistical models as part of their studies.
Keywords: Kaplan–Meier, RNAseq, survival
INTRODUCTION
The advent of large-scale genomics and transcriptomics efforts has resulted in dramatic increases in genomic and transcriptomic data that are, oftentimes, publicly available, curated for use, easily accessible via website user interfaces, and retrievable. These databases, such as the cancer genome atlas (TCGA), contain rich data sets on genes and gene expression collected from diseased individuals and allow researchers to investigate whether specific genes or expression levels are associated with disease status. In addition, many of these databases also include analytic tools that generate “on the go” Kaplan–Meier curves that can be used to examine differences in RNA expression levels and time to death among the individuals for whom transcriptomic data are available (1). Currently, many of the websites that offer this functionality are focused on cancer genomics; however, with the generation of large-scale sequencing data for nonmalignant diseases, including pulmonary diseases (2, 3), such functionality is expected to be more widely available. From a basic science standpoint, the ability to query whether a particular transcript impacts the development of a disease or survival can be powerful. Certainly, a statistically significant association will add strong circumstantial evidence that the gene/protein may be causally related to disease occurrence, disease severity, progression of disease, and/or death. The seeming ease at which these analyses can be quickly performed is a strength; however, it is all too easy for an investigator to overlook the fundamental principles of study design and analyses that must be considered. For example, mRNA expression levels are continuous measures that vary over time within an individual, as cells can adjust mRNA levels in response to a variety of external stimuli (therapeutic interventions, exacerbation of disease, etc.). Yet, most studies collect mRNA expression levels at a single time point and these expression levels are further dichotomized (i.e., “high” vs. “low” expression) for analytic purposes. These differences in a static mRNA level may not represent biological complexities related to the effect of changes in mRNA transcript over time. Given the expected increase in the number and functionality of these databases, it is critically important to understand how and when these databases can be integrated into different research settings. In this article, we discuss the use of such data sets when conducting time-to-event analyses in which the primary goal is to contrast a single gene expression level with time to death among a population of diseased individuals. We approach this topic first through the epidemiological lens of the cohort study and then introduce key issues related to the conduct of time-to-event analyses, focusing on those methods that are currently available through databases and those more applicable when primary data can be downloaded for more in-depth analysis. Our overall goal is to introduce key concepts that will enhance the quality of post hoc analysis of transcriptomic data, especially when such analyses are integrated into the basic science setting.
We limit our discussion to a very specific type of post hoc analysis of transcriptomic data—determining the association between transcript levels of a gene and a patient outcome (e.g., survival). We recognize that there are many high-dimensional methodologies for analyzing transcriptomic data (including pathway and cluster analyses), as well as tools for constructing and validating predictive models using sets of gene transcripts. Furthermore, there are many additional considerations (such as false discovery rates) when thinking about genome-wide differences in all transcripts across samples. Indeed, these global analyses are the primary outcomes of sequencing experiments. However, these topics are beyond the scope of this report, which focuses on the scenario where prior preclinical work has led an investigator to ask the specific question: “In an existing transcriptomic data set, is transcript X associated with time to death in patients with disease Y.”
SCIENTIFIC QUESTION OF INTEREST
Our discussion proceeds with the abovementioned scientific question of interest. This is an example of a cohort study in which the exposure is measured at a single point in time and individuals are followed until death or censoring. Importantly, previous literature or results from our experimental data should guide our overall hypothesis of the direction of the exposure (expression level)-outcome (time to death) relationship. Formal statistical testing procedures are used to determine whether we can accept our null hypothesis (generally one of no association) or whether there is sufficient information to reject it and accept the alternative hypothesis (i.e., there is a difference in time to death between individuals with high and low levels of the gene/transcript in question). Effect sizes or detectable differences will be based on the sample size (number of individuals and/or event rate), with larger sample sizes needed to detect smaller differences. Thus, a failure to reject the null hypothesis does not necessarily imply that there is no difference; rather, it may reflect that the sample size is not of sufficient size to detect a smaller and possibly clinically important difference. A well-designed study will be adequately powered to answer our scientific question of interest; however, the interpretation of results will always need to be contextualized with respect to the study population, generalizability of results, acknowledgment of limitations and reflect observed associations which, importantly, do not necessarily imply causality.
ASSESSING AVAILABLE DATA
In transcriptome analyses, the two most common analytic approaches are to compare survival curves and evaluate the risk of death through Cox proportional hazards regression. Before undertaking these formal analyses, it is first critical to determine whether there are sufficient data to answer the scientific question of interest. The number of individuals with available exposure (transcriptome) and outcome data (follow-up time and indicator of observed deaths) will comprise the study population. A high degree of missingness of either of these variables may preclude the conduct of an analysis due to a small sample size or questions as to the representativeness of those with complete data. The investigator must carefully weigh these considerations before starting analyses. Transcriptomic datasets may also contain limited patient characteristics that may be further explored through subgroup analyses or in adjusted Cox proportional hazards regression models. The same issues related to the completeness of variables are applicable here. As there are many approaches for handling missing data, consultation with a biostatistician may be warranted. In particular, if the decision for imputation is made, the method for doing so is not straightforward (4) and requires a discussion of the nature of the missing data. For instance, if a transcript level is missing, that could mean either that the transcript is not present in that patient’s sample, or it could be a technical issue where the transcript was not sequenced due to lack of amplification. Similar considerations apply for clinical data. Measures of function (DLCO, 6-min walk test) may be missing either because the test was not performed or because the patient was unable to complete the test due to the severity of disease. We do note that most software will perform analyses in the presence of missing data (complete case analyses); the fact that analyses can be executed does not mean that the data set was complete or the sample included in the analysis is representative of all individuals. On a related note, in addition to missingness, simple data checks (ensuring that all values for age are greater than zero, etc.) are also recommended to assess the quality of the data.
In addition to missingness, the quality of the transcript reads must also be considered. Specific details on best practices for quality control and appropriate normalization of mRNA transcript reads in genomics data are addressed in detail elsewhere (5). In particular, it is important to note that a variety of methodologies can be used to normalize mRNA transcript data (e.g., fragments per kilobase million, RNA-Seq by Expectation-Minimization). Visual examination of the coverage depth of the transcript of interest [done using tools like IGV (integrative genomics viewer)] can often be useful in ensuring that the transcript of interest is sufficiently captured in the RNAseq data. Methodologies for capturing, normalizing, and aligning RNAseq data have changed over time; thus, when dealing with multiple data sets, it is also important to consider when the initial sequencing occurred, and whether the time difference between the two data sets may be influencing observed differences.
For survival analyses, two additional important pieces of data should be examined: the follow-up time and an indicator of whether death occurred. The details of how follow-up was conducted will be important in this process. Often, individuals will be administratively censored, reflecting the fact that at the time of the last follow-up (or a certain date), the patient was known to be alive. This is different from loss to follow-up, in which an individual was known to be alive at some previous time, but could no longer be contacted. A high proportion of individuals lost to follow-up may impact results as there is an underlying assumption that loss to follow-up (and censoring) is independent of the outcome. Finally, all-cause death may not be the ideal outcome of interest, as cause specific or disease progression/recurrence may be more clinically relevant outcomes.
EXPLORATORY DATA ANALYSIS
Typical RNAseq data consist of gene expression levels for thousands of genes for hundreds of individuals, coupled with a clinical data set (e.g., patient and disease characteristics). It is often helpful to describe and visualize the clinical data in exploratory analyses. For instance, what is the age distribution of the patients in the data set? Were samples collected across multiple study centers? What is the time frame during which patient samples (and clinical data) were collected? This last question becomes important in disease states where treatment paradigms may have changed dramatically during the data collection period. Another consideration is the disease states that were included. For instance, in the data set of patients with pulmonary hypertension (PH), were patients with all groups of PH included, or were only patients with Group 1 PH (PAH) used? In cancer data sets where data is typically collected based on organ of cancer origin, several histology types may be present in a single data set. In datasets where samples were collected from multiple sites or across multiple years, examining differences in demographics and transcript levels stratified by the site may be useful in determining whether site-specific differences may be contributing to observed differences.
ANALYSIS
Once the abovementioned exploratory analyses have been performed, we typically then proceed using the following individual-level data: the transcript level for the gene of interest, the follow-up time, and an indicator of whether the individual was censored or died. For the purposes of this focused discussion, we are only considering questions that probe the relationship between a single transcript of interest and survival. If the research question involves an investigation of the association of a group of correlated transcripts (a gene set), more advanced analyses will be needed. This may include linear (principal component analysis) or nonlinear (multidimensional scaling) dimension reduction, or step-wise addition to individual genes into a Cox PH model and are beyond of the scope of our current discussion.
For descriptive purposes, we often summarize the median time to death and compare this between groups through parametric or nonparametric tests. The Kaplan–Meier curve is a tool to (descriptively) visualize survival in a population whereby the probability of remaining alive is plotted over time. Importantly, these curves can be constructed for the different transcriptome strata (e.g., “high” vs. “low”) and formal statistical testing procedures can be used to compare whether survival differs between groups. Available databases allow for the construction of these plots and differences to be tested between groups using a log-rank test, which compares whether the occurrence of death differs from what would be expected under the null hypothesis of no difference. However, there are instances, such as when survival curves cross, in which the log-rank test has less power to detect differences. Alternative tests are available that weigh the timing of events differently; however, performing these tests requires that the primary data be downloaded and analyzed separately. Finally, the identification of a difference between groups does not support a causative role as an observed difference may be confounded due to imbalances of patient-level characteristics between groups. Furthermore, although Kaplan–Meier curves are useful in graphically displaying findings, careful construction and presentation of the accompanying Cox proportional hazards regression analyses is critical.
Cox proportional hazards regression is the most commonly used approach to investigate the association between an exposure and a time-to-event outcome. The results of these models are reported as a hazard ratio (or instantaneous risk of death) corresponding to a contrast in exposure levels. As with all regression models, there are many caveats and factors that must be considered when conducting these analyses. One of the most important assumptions is the proportional hazards assumption (ratio of hazards is constant across time), which can be evaluated through a number of methods and, if violated, can be accounted for through stratified analyses or alternative time-to-event methods. Cox regression models can accommodate variables that are fixed (e.g., sex, race, tumor stage at diagnosis) or time-dependent and time-varying if these variables were measured over time.
Initially, univariate Cox proportional hazards regression will be performed, where survival is modeled as a function of a single explanatory variable (e.g., RNA transcript level of one gene). However, the strength of this model is the ability to conduct a multivariable analysis with the goal of evaluating the association of the transcript after adjustment for potential confounders, such as age, sex, and other demographic and disease characteristics. Inclusion of covariates should be guided by the literature and biological plausibility or through the construction of directed acyclic graphs, whereas data-driven approaches, such as step-wise regression, are generally not recommended. A statistically significant association that is present in the univariate analysis may no longer be “significant” in multivariable analysis; however, this does not mean that the transcript of interest is not related to the outcome. Instead, such a result may suggest that further exploration of the relationship between the transcript in question and the other model covariates should be explored. Furthermore, although the results of multivariable models may support the association between the mRNA of interest and survival; however, they should not be interpreted as predictive of the outcome. A table reporting the results (hazard ratios and confidence intervals) of univariate and multivariate models is recommended.
VALIDATION OF FINDINGS
Analyses using data extracted from RNA-seq experiments reflect associations (noncausal) and should be considered hypothesis generating. Inferences may be strengthened through replication in other RNAseq data sets and/or patient cohorts. In addition, RNA or protein levels of the gene of interest can be specifically interrogated in patient tissue samples, cell lines, or animal models. Examining the functional effect of altering the protein produced by the transcript of interest in cell lines is a well-established method to further support a causal relationship between the transcript and disease outcome of interest. Thus, a variety of options for mechanistic validation in vitro exist; one or more of these studies are typically needed to more definitely implicate the role of a specific transcript in the disease under study.
CONCLUSIONS
In summary, genomics-based survival analyses can provide evidence that a specific gene may impact survival and support further investigation of that gene. The readily available data and ease of performing “on the go” survival analyses should not preclude the careful considerations of study design and analyses when determining whether the data are appropriate to answer the scientific question of interest.
GRANTS
This work was supported by NIH Grant R01HL151530.
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the authors.
AUTHOR CONTRIBUTIONS
K.S. and K.J.P. drafted manuscript; K.S. and K.J.P. edited and revised manuscript; K.S. and K.J.P. approved final version of manuscript.
REFERENCES
- 1. Zheng H, Zhang G, Zhang L, Wang Q, Li H, Han Y, Xie L, Yan Z, Li Y, An Y, Dong H, Zhu W, Guo X. Comprehensive review of web servers and bioinformatics tools for cancer prognosis analysis. Frontiers Oncol 10: 68, 2020. doi: 10.3389/fonc.2020.00068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Hemnes AR, Beck GJ, Newman JH, Abidov A, Aldred MA, Barnard J, Berman Rosenzweig E, Borlaug BA, Chung WK, Comhair SAA, Erzurum SC, Frantz RP, Gray MP, Grunig G, Hassoun PM, Hill NS, Horn EM, Hu B, Lempel JK, Maron BA, Mathai SC, Olman MA, Rischard FP, Systrom DM, Tang WHW, Waxman AB, Xiao L, Yuan JX-J, Leopold JA; PVDOMICS Study Group. PVDOMICS: a multi-center study to improve understanding of pulmonary vascular disease through phenomics. Circ Res 121: 1136–1139, 2017. doi: 10.1161/CIRCRESAHA.117.311737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Li B, Sun W-X, Zhang W-Y, Zheng Y, Qiao L, Hu Y-M, Li W-Q, Liu D, Leng B, Liu J-R, Jiang X-F, Zhang Y. The transcriptome characteristics of severe asthma from the prospect of co-expressed gene modules. Frontiers Genet 12: 765400, 2021. doi: 10.3389/fgene.2021.765400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Harel O, Mitchell EM, Perkins NJ, Cole SR, Tchetgen EJ, Sun BLuo, Schisterman EF. Multiple imputation for incomplete data in epidemiologic studies. Am J Epidemiol 187: 576–584, 2018. doi: 10.1093/aje/kwx349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, Winter DR. A beginner’s guide to analysis of RNA sequencing data. Am J Respir Cell Mol Biol 59: 145–157, 2018. doi: 10.1165/rcmb.2017-0430TR. [DOI] [PMC free article] [PubMed] [Google Scholar]
