Abstract
Recent advances in high-throughput genomic technologies have nurtured a growing demand for statistical tools to facilitate identification of molecular changes as potential prognostic biomarkers or drugable targets for personalized precision medicine. In this study, we developed a web-based interactive and user-friendly platform for high-dimensional analysis of molecular alterations in cancer (HDMAC) (https://ripsung26.shinyapps.io/rshiny/). On HDMAC, several penalized regression models that are suitable for high-dimensional data analysis, Ridge, Lasso and adaptive Lasso, are offered, with Cox regression for survival and logistic regression for binary outcomes. Choice of a first-step screening is provided to address the multiple-comparison issue that often arises with large-volume genomic data. Hazard ratio or estimated coefficient is provided with each selected gene so that a multivariate regression model may be built based on the genes selected. Cross validation is provided as the method to estimate the prediction power of each regression model. In addition, R codes are also provided to facilitate download of whole sets of molecular variables from TCGA. In this study, illustration of the use of HDMAC was made through a set of data on gene mutations and a set on mRNA expression from ovarian cancer patients and a set on mRNA expression from bladder cancer patient. From the analysis of each set of data, a list of candidate genes was obtained that might be associated with mutations or abnormal expression of genes in ovarian and bladder cancers. HDMAC offers a solution for rigorous and validation analysis of high-dimensional genomic data.
Subject terms: Genome informatics, Medical research, Oncology
Introduction
Recent advances in high-throughput technologies such as microarrays and next generation sequencing have enabled researchers to identify molecular changes that are associated with cancers in a systematic way1,2. Such efforts have attracted much attention as the molecular changes may represent potential prognostic biomarkers or drugable targets for personalized precision medicine. Meanwhile, several multiple-data platforms, e.g., the Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx), have also become available to researchers when identifying genome-wide molecular changes of individual cancers3,4. With these updated tools and consortiums, there emerges a growing demand for statistical tools to facilitate identification of molecular changes.
There are several web tools available for researchers to analyze genomic data. For example, cBioPortal provides simultaneous display of RNA expression, mutations, copy number alterations and protein expression with multiple choices of plots for visualization5,6. HPA and Protein Expression Atlas are specialized in protein expression. The former is good at integrating protein information and the latter provides multi-species expression data7,8. There are also tools that provide analysis on specific molecules such as miRGator for miRNAs9. As useful as all these tools are, researchers always have specific demands in their studies that cannot be well addressed by the existing platforms. For example, with deepening understanding of cancer-associated genetic alterations, it becomes imperative to explore whether the changes are associated with clinical variables and survival and binary outcomes, and how. A few preliminary attempts have been made to generate new platforms to meet specific needs of researchers10–12, but a platform that is capable of handling high dimensional data is still lacking.
Genomic data are usually high dimensional, often with information of thousands of gene loci obtained from a much smaller number of patients, say, hundreds, and an even smaller number of clinical parameters. When the number of genes is larger than the number of subjects, standard regression models that are commonly used in statistical analysis become overwhelmed. Penalized regression models, such as the ridge regression, the least absolute shrinkage and selection operator (Lasso) regression, and the adaptive Lasso regression, provide attractive alternatives13–17. These methods typically result in shrinkage of the size of the regression coefficients. Specifically, the ridge regression reduces the magnitude of the coefficients while the Lasso and the adaptive Lasso force some of the coefficients to become zero. In addition, the Lasso regression estimator is sparse, i.e., many components are exactly 0 and Lasso automatically deletes unnecessary covariates, and the adaptive Lasso estimator is even sparser. Thus both the Lasso and the adaptive Lasso can be used for variable selection, with the latter selecting fewer variables than the former. In fact, these penalized regression methods have been widely used in large-scale genetic studies in recent years, such as identification of gene-gene interactions, gene selection in a high-dimensional cancer classification problem and a transcriptome analysis of pancreatic cancer survival18–20. Unfortunately, although these methods are heavily used in genetic analysis, they have not been incorporated in user-friendly web-based programs.
Aside from the high-dimensionality, the multiple-testing problem also needs to be addressed. In genomic studies, typically a test statistic and its corresponding p-value between one gene and the outcome variable are calculated to measure the extent of the association between them. When many tests are conducted at the same time, a lot of false positives (false discoveries) may arise. In fact, the false discovery rate (FDR) has become a key concept in recent large-scale genetic studies21. Unfortunately, such a function is rarely offered in currently available web tools and apps22–24. Therefore, proper statistical algorithms are thus needed to address the FDR issue.
Therefore, we aimed to develop a web-based interactive and user-friendly platform to fulfill the following goals. First, it would fit the regression models with survival and binary outcomes and high-dimensional genetic covariates, with the option of including clinical covariates. It would also identify important genetic alterations and construct a fitted multivariate regression model based on the identified genes. Further, it would choose a penalty type for the corresponding penalized regression model for high-dimensional data. It would offer a choice of a first-step screening to screen out unrelated variables if the multiple-testing problem is of concern. Last but not the least, it would estimate the prediction power for each regression model using cross validation with the correct p-values for the Lasso and adaptive Lasso provided. We also aimed to provide all relevant codes on GitHub for users’ convenience.
Materials and Methods
The platform was written and all statistical analysis was performed with the statistical computing and graphic drawing language, R, with the help of Shiny, an R package that facilitates the building of interactive web Apps straight from R25,26.
Clinical data
The data for developing the platform and the associated statistical analysis were downloaded from TCGA. It is also possible to download TCGA data from cBioPortal, but only limited numbers of genetic entries may be downloaded each time. We therefore wrote R codes to download large numbers of genomic data from TCGA, and the codes are available at GitHub (https://github.com/chung-R/HD-MAC). It is worth noting that users can use our App to run any available genetic datasets while TCGA is just an important source.
Two sets of data were obtained for this study. One contained 316 patients with serous type high-grade ovarian cancer, the most common and malignant form of ovarian cancer. The data contained detected mutations in 8,310 genes and expression information of 18,263 expressed mRNA entries, as well as the patients’ clinical parameters including age, stage, overall survival, disease-free survival and lymphovascular invasion. The other set of data included 189 patients with bladder cancer. It had expression information of 18,335 expressed mRNA entries and the patients’ clinical parameters including age, sex, stage, tumor invasion type, disease-free survival and overall survival. The Z score data of mRNA expression were used to indicate the deviation from the mean of each gene’s expression level. A Z score above 2 or below −2 was considered abnormal. In addition, to search for major genetic events in cancer-driving genes with minimal statistical bias, a preliminary screening was performed so that only the genes whose mutations were found in 1% or more and the mRNA entries whose abnormal expression was found in 2% or more of the patients were included. As a result, 670 mutated genes and 9,548 expressed mRNA entries of ovarian cancer and 8,024 expressed mRNA entries of bladder cancer were included in the final analysis below.
Statistical methods
Ridge, lasso and adaptive lasso logistic regression
To identify genetic alterations associated with binary clinical outcomes, logistic regression based methods were used.
For logistic regression, the data are (xi, yi), i = 1, …, n, where xi = (xi1, …, xiM) is the covariate of the ith subject such as copy number variation (CNV), gene expression and mutation (M is the number of genes) and yi is the binary response for the ith subject such as stage (advanced stage vs. early stage) and tumor subtype (invasive vs. non-invasive).
The logistic regression model may be written as follows:
where pi = P(yi = 1xi) and β = (βi, …, βM)T is the regression coefficient vector. Let L(β) be the log-likelihood for this model.
To address the high-dimensionality of the genomic data, we considered three regularized logistic regression models, ridge logistic regression, Lasso logistic regression and adaptive Lasso logistic regression13,14. The ridge logistic regression estimator 15 can be obtained by minimizing
The Lasso logistic regression estimator 27 can be obtained by minimizing
The adaptive Lasso logistic regression estimator can be obtained by minimizing
where wj = is the jth component of .
We used the cross-validation method to get the optimal tuning parameter estimators, , and . Then the genes selected by the Lasso and adaptive Lasso regression were evaluated based on their association with the binary outcome variable, invasive vs. non-invasive bladder cancer here.
Ridge, lasso and adaptive lasso cox models
To associate genetic alterations with the survival outcome, the Cox proportional hazards (PH) model was used.
The survival data are (Zj, δi, xi) where Zi, δi and xi are the observed time, right censoring indicator and the high-dimensional genetic covariates (such as CNV, gene expression and mutation) of the ith subject, respectively. Zi = min(Ti, Ci), where Ti and Ci are the failure time and the right censoring time of the ith subject, respectively. δi = 1 if Ti < Ci and δi = 0 if Ti > Ci. Assume Ti and Ci are independent conditional on xi. Here Ti is the disease-free survival time or overall survival time.
Similar to the above, three regularized Cox PH models, ridge, Lasso and adaptive Lasso, were used to analyze the survival data with high-dimensional covariates16,17. The hazard function given xi in the Cox PH model is defined as follows:
where is the regression coefficient vector.
Let PL(α) be the log partial likelihood for the Cox PH model. The Cox ridge regression estimator can be obtained by minimizing
The Cox Lasso regression estimator 27 can be obtained by minimizing
The Cox adaptive Lasso regression estimator can be obtained by minimizing
where is the jth component of .
As noted above, the Cox Lasso and adaptive Lasso regression methods were used for variable selection, and the optimal tuning parameter estimators, , and , were obtained with the cross-validation method. Similar to the penalized logistic regression methods described above, the Cox Lasso and adaptive Lasso regression methods can be used for variable selection. The genes selected were evaluated based on their association with the survival time distribution.
FDR for screening
The method proposed by Benjamini and Hochberg to control the FDR, defined as the expectation of the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses, was used here28. On the App we developed, the method to control the FDR is provided as an optional first-step screening method and users may also specify their own FDR thresholds. In this study, the default FDR threshold was set at 0.05. When FDR was chosen, univariate analysis (Cox regression for a survival outcome and logistic regression for a binary outcome) was first performed to compute the p-value (the extent of the association) for each gene, and then FDR screening was performed. Once the associated genes were selected, the regression model would be fit to the outcome variable with the selected genes as covariates.
Cross validation for estimating prediction power
The cross-validation algorithm is provided on the App to estimate the prediction power of each model available on the platform and users are allowed to choose the fold number for the cross validation. The default fold number is 5, and cross validation method will not be performed if 1 is chosen. Accuracy, sensitivity, specificity and area under curve (AUC) are computed and displayed to show the prediction power for each logistic regression model, and the concordance index (C-index) for each survival model.
Computing the correct p-values for lasso and adaptive lasso
When running the Lasso (or adaptive Lasso) regression analysis, most statistical software programs do not provide p-values. Computing p-values for the Lasso (or adaptive Lasso) is difficult as both regression methods are involved in the variable selection procedure (see detailed explanation in Lee et al.29. To solve this problem, Lee et al.29 developed a general approach to compute the correct p-values after model selection. Here we used the ‘selectiveInference’ R package29,30 to implement the algorithm by Lee et al. to compute the correct p-values for the Lasso and adaptive Lasso regression.
Results
Introduction to HDMAC
We constructed a package of high-dimensional analysis of molecular alterations in cancer, HDMAC, and made it a web-based platform at https://ripsung26.shinyapps.io/rshiny/. A flowchart of running HDMAC is provided in Fig. 1, and the tutorial on how to use it is available both at GitHub (https://github.com/chung-R/HD-MAC) and as a supplementary file (Supplementary Method 1).
On HDMAC, we provided a set of example data for users to get familiar with the platform. For analysis of their own data, users may choose to upload the data and run it through the statistical methods provided. For analysis of data from TCGA, users may take advantage of the R codes we wrote to download whole sets of data from TCGA. These codes help with procurement of large-scale data, and are available at GitHub (see “Data download from TCGA.r” at https://github.com/chung-R/HD-MAC). We have also provided all the codes of the entire platform at GitHub (see folder HDMAC at https://github.com/chung-R/HD-MAC) for researchers to analyze their data offline with RStudio31. In addition, all the functions in our App were validated with different R packages and the validation codes were available at GitHub (https://github.com/chung-R/HD-MAC) as well.
Analysis with the statistical methods provided on HDMAC is illustrated in the sections below.
Survival analysis with serous type high-grade ovarian cancer patients
To show how to analyze survival data, we adopted a set of data of high-grade serous ovarian cancer and ran the data on the HDMAC. The patients’ overall survival was used as the outcome variable.
The three Cox regression methods, the ridge, Lasso and adaptive Lasso, all available on HDMAC, were used to analyze the data in response to overall survival. The ridge regression showed mutations of 670 genes, and each of the Lasso and the adaptive Lasso selected 1 gene.
Then the method to control the FDR was included as the first-step screening. As a result, each of the ridge, Lasso and adaptive Lasso Cox methods selected mutations of 2 same genes, ZSWIM8 and PABPC3.
The above results may be tested for their predictive performance with the cross validation method provided on HDMAC. Here we adopted the 5-fold cross validation to calculate the C-indices of the results above. The C-indices were 0.529, 0.501, and 0.501 for the ridge, Lasso and adaptive Lasso without controlling the FDR, respectively, and with the control for the FDR, the three indices were 0.502, 0.502, and 0.497, respectively.
Similar analysis was then performed on the mRNA expression data. The ridge Cox regression showed 9,548 mRNA expression entries while each of the Lasso and adaptive Lasso selected 4 mRNA entries. Their C-indices were 0.591, 0.554, and 0.560, respectively. When the method to control the FDR was included as the first-step screening, 6 same entries were left to all the three methods, with the respective C-indices being 0.538, 0.538, and 0.540.
All the results above are summarized in Table 1. The 2 mutated genes and the 6 abnormally expressed mRNAs identified with the control for the FDR, as well as their hazard ratios, Table 2.
Table 1.
Cox PH method | Ridge | Lasso | Adaptive Lasso | ||||
---|---|---|---|---|---|---|---|
numbers | c-index | numbers | c-index | numbers | c-index | ||
mutated genes | no FDR | 670 | 0.529 | 1 | 0.501 | 1 | 0.501 |
after FDR | 2 | 0.502 | 2 | 0.502 | 2 | 0.497 | |
mRNA expression abnormalities | no FDR | 9548 | 0.591 | 4 | 0.554 | 4 | 0.560 |
after FDR | 6 | 0.538 | 6 | 0.538 | 6 | 0.540 |
Table 2.
Mutated genes | Estimated coefficients | Hazard ratio | p-value | Abnormally expressed genes | Estimated coefficients | Hazard ratio | p-value |
---|---|---|---|---|---|---|---|
ZSWIM8 | 2.014 | 7.493 | 0.00007 | ASAP3 | 0.09 | 1.094 | 0.0682 |
PABPC3 | 1.729 | 5.635 | 0.00071 | C10ORF113 | 0.08 | 1.083 | 0.0330 |
TIGAR | 0.08 | 1.083 | 0.0001 | ||||
KIAA0100 | 0.05 | 1.051 | 0.0188 | ||||
REPL4B | 0.007 | 1.007 | 0.0036 | ||||
ZFHX4 | 0.08 | 1.083 | 0.0231 |
Logistic regression analysis on the invasion subtype of bladder cancer
To demonstrate the analysis associated with binary clinical outcomes, we chose a set of bladder cancer data and performed analysis relative to the subtype of bladder cancer, i.e., whether or not the patients had invasive or non-invasive tumors. We chose a different set of data to illustrate the analysis with logistic regression here to show that HDMAC was applicable to various types of data. The analysis with a binary outcome based on the ovarian cancer data above and that with survival based on the bladder cancer data here are provided in Supplementary Tables S1 and S2.
As the outcome was binary, we used the ridge, Lasso and adaptive Lasso logistic regression. The ridge logistic regression showed 8,024 mRNA entries, and the Lasso and the adaptive Lasso selected 46 and 27, respectively, in relation to cancer subtype without controlling the FDR. When the method to control the FDR was included, the ridge showed 461 mRNA entries, and the Lasso and the adaptive Lasso, 36 and 24, respectively. We also tested the predictivity of these results by calculating the sensitivity, specificity, accuracy, and AUC based on 5-fold cross validation. All the results above are shown in Table 3. As a relatively large number of genes were selected in each method, we only presented the shortest list, i.e., the mRNA entries selected with FDR adaptive Lasso regression, as well as their estimated coefficients in Table 4.
Table 3.
Logistic regression | Ridge | Lasso | Adaptive Lasso | |||
---|---|---|---|---|---|---|
no FDR | with FDR | no FDR | with FDR | no FDR | with FDR | |
# abnormal expression | 8024 | 461 | 46 | 36 | 27 | 24 |
Sensitivity | 0.565 | 0.500 | 0.533 | 0.565 | 0.484 | 0.532 |
Specificity | 0.701 | 0.764 | 0.709 | 0.677 | 0.772 | 0.717 |
Accuracy | 0.656 | 0.677 | 0.651 | 0.640 | 0.677 | 0.656 |
AUC (area under curve) | 68.107 | 66.515 | 65.864 | 67.020 | 62.442 | 64.300 |
Table 4.
Abnormally expressed genes | Estimated coefficients | Odds ratio (ln) | p-value |
---|---|---|---|
SPTSSA | −0.16 | 0.852 | 0.51 |
ATAT1 | 0.06 | 1.061 | 0.47 |
CABP4 | 0.26 | 1.296 | 0.11 |
CCNK | −0.27 | 1.309 | 0.19 |
CIR1 | 0.55 | 1.733 | 0.50 |
DPP9 | 0.42 | 1.521 | 0.05 |
FANCL | 0.01 | 1.010 | 0.92 |
ICOSLG | −0.66 | 0.516 | 0.004 |
JOSD1 | −0.35 | 0.704 | 0.54 |
MED30 | −0.43 | 0.650 | 0.01 |
NADSYN1 | −0.71 | 0.491 | 0.27 |
NCOA3 | −0.52 | 0.594 | 0.003 |
LINC00173 | −0.12 | 0.886 | 0.66 |
NKIRAS1 | −0.29 | 0.748 | 0.10 |
NUDT16P1 | 0.24 | 1.271 | 0.15 |
PDRG1 | −0.69 | 0.501 | 0.49 |
POLR1D | 0.55 | 1.733 | 0.02 |
PSORS1C2 | 1.14 | 3.126 | 0.005 |
RETSAT | −0.32 | 0.726 | 0.18 |
RPL23AP7 | 0.66 | 1.934 | 0.01 |
SETMAR | 0.29 | 1.336 | 0.52 |
SLC14A1 | 0.50 | 1.648 | 0.05 |
SLC39A4 | 0.14 | 1.150 | 0.65 |
ZSCAN2 | 0.27 | 1.309 | 0.16 |
Multivariate model building
Once genes are selected with their corresponding coefficients, a multivariate model may be built. For example, the coefficients of the abnormally expressed genes found to be associated with the invasive subtype of bladder cancer with the adaptive Lasso regression after the FDR penalty, as listed in Table 4, may be used to construct a multivariate model as follows:
A positive coefficient indicates that the gene’s abnormal expression is positively associated with the invasive subtype while a negative one, negatively. The result of the above function could be used to predict whether a patient has invasive bladder cancer with a given threshold. In this study, the threshold was set at 0.34 such that a patient with a score calculated from the above function higher than 0.34 would be predicted to have the invasive subtype of bladder cancer and vice versa.
Computation time
Since the data to be analyzed on HDMAC may be extremely big with a large number of observations and/or a large number of variables, there may be concerns about how efficient HDMAC is. We thus tested the computing time and uploading time with simulation of different situations of observations/numbers. Tables 5 and 6 show the uploading time and the average computing time, respectivly, for both logistic and survival analyses. Each table shows the results of 9 combinations with a small (50), a medium (200) and a large (1000) number of observations and a small (50), a medium (500) and a large (5000) number of variables. All the analyses for the simulation were performed using the online version of HDMAC. The simulated data were generated based on the real datasets we used in this paper. The simulation was conducted using the adaptive Lasso and Lasso for logistic regression analysis and survival analysis, respectively, to keep consistency with the real data analysis.
Table 5.
Number of Observations | Number of variables | ||
---|---|---|---|
Small (50) | Medium (500) | Large (5000) | |
Small (50) | 1.1 | 1.8 | 5.1 |
Medium (200) | 1.5 | 3.5 | 12.4 |
Large (1000) | 3.4 | 8.4 | 54.9 |
Table 6.
Number of Observations | Number of variables | |||||
---|---|---|---|---|---|---|
Logistic regression | Survival analysis | |||||
Small (50) | Medium (500) | Large (5000) | Small (50) | Medium (500) | Large (5000) | |
Small (50) | 1.5 | 1.7 | 4.5 | 1.4 | 1.6 | 2.5 |
Medium (200) | 1.7 | 1.9 | 5.5 | 1.6 | 5.8 | 14.1 |
Large (1000) | 4.3 | 6.4 | 16.4 | 12.8 | 59.2 | 128.2 |
As expected, with the increasing numbers of observations and variables, the computing time for the survival analysis and that for the logistic regression analysis increased. As the numbers of observations and variables increased, the uploading time also increased. When the number of observations was large, the computing time for the survival analysis increased much more than that for the logistic regression analysis. In addition, as the numbers of observations and variables were both very large, the uploading time increased significantly.
Discussion
Cancer has become one of the top killers in the present world32. Recent advances in high-throughput assays and genomic analysis have greatly enriched our understanding of genetic alterations underlying the etiology of cancer. However, there is a growing need for convenient use of solid and rigorous statistical tools, especially those that are able to address the high dimensionality of genomic data. HDMAC, the platform we developed, has the following advantages. It provides regularized regression to analyze high-dimensional data and is the only web-based software that offers penalized Cox regression for survival analysis. For logistic regression, HDMAC offers the adaptive Lasso regression, which is important for variable selection but rarely found in other web-based tools. It also provides users with many statistical analyses in one single platform, including the first step screening (FDR method) and p-value corrections that usually require users to download specific packages or even navigate to a different platform. Furthermore, HDMAC is web based and no code writing or downloading is needed.
HDMAC is a user-friendly, interactive and web-based platform. Few such platforms for genetic analysis have been developed in the literature, among which the GEPIA and UALCAN are closest to our purpose. While both GEPIA and UALCAN are useful web-based interactive tools to analyze cancer OMICS data and suitable for exploratory analysis and visualization, the most important advantage of HDMAC is that it includes high dimensional regression analysis, and the other two do not. Here high-dimensional regression analysis is to analyze how thousands of or even more, hence high-dimensional, variables affect the outcome at the same time. It is not univariate analysis for many variables which many web-based platforms for omics data analysis do (i.e., many genes are considered, but each analysis only involves one gene), or traditional multivariate regression analysis which only deals with at the most dozens of variables each time. The purpose of the high-dimensional regression analysis using HDMAC is to explore the effect of the “high-dimensional” genetic variables combined on the outcome, select important variables and estimate their prediction power for the outcome. As far as we know, HDMAC is the only web-based interactive tool that offers high-dimensional regression analysis although such analysis has been used intensively for OMICS data. Moreover, GEPIA and UALCAN only have univariate survival analysis, and HDMAC offers both survival and logistic regression analyses, with both univariate and multivariate options. Furthermore, HDMAC can analyze many kinds of OMICS data such as gene expression, copy number variation, mutation, protein expression, methylation, etc., while the other two platforms are more focused on specific OMICS data such as gene expression on GEPIA and gene expression and methylation on UALCAN.
There are other apps that are related to HDMAC, e.g., CASAS is a web-based app for survival analysis and MLJAR (at https://mljar.com/) is a web-based tool for logistic regression analysis. However, CASAS offers only univariate Cox regression analysis for one or several user-specified variables, but not for high dimensional penalized Cox regression analysis12, and MLJAR is for traditional, not regularized, logistic regression. There are several apps that provide some penalized regression analysis that are also available on HDMAC. Compared to these apps, HDMAC has the advantage of offering these functions readily without any need to write codes or download additional packages. For example, both Tensorboard and Weka require users to download and install software and/or packages or even write codes to run the regularized logistic regression although only Lasso and Ridge, and not adaptive lasso, regression can be downloaded22–24. Similarly, for first step screening or conducting significance test for the Lasso and adaptive Lasso regression, currently available apps require users to either download other packages or to run them using other apps.
For more specific functions for statistical inference, HDMAC provides validation methods for prediction power so that researchers will be aware of how much confidence they may have in their results. Therefore, if a higher prediction power is desired, users may rely on the validation test, e.g., C-index for a survival outcome and accuracy for a binary one, for the final choice of a regression method. In contrast, if variable selection is preferred, the Lasso and the adaptive Lasso are best choices. In particular, HDMAC offers an algorithm to calculate the correct p values for the Lasso and adaptive Lasso methods, which are not usually available in common statistical software due to the methods’ involvement in variable selection. In addition to the statistical strength mentioned above, we also provided a method to control the FDR as the first-step screening. It is an optional choice for users to address the multiple-testing problem that arises when they study the associations among many molecular variables at the same time. Inclusion of FDR is recommended if users are dealing with variables at the magnitude of a hundred thousand where penalized regression models fall short. In addition, clinical variables such as gender and age may also be included in the analysis although they were not illustrated in the results above.
We have provided on GitHub both the R scripts of HDMAC that enable Rstudio users to use all the analysis on HDMAC offline and the R script to download data from the TCGA. Meanwhile, it is worth noting that users can use HDMAC with any data while the TCGA database is just one important source. Also, there are several existing useful tools to download the TCGA data in addition to the R script we provided. For example, FireBrowse portal allows for downloading TCGA data directly through a web UI (Firebrowse.org), and TCGAbiolinks (https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html) is also a useful R package to this end. Compared to TCGAbiolinks, our R script has the advantage that it was written with a hierarchical structure where users are guided step-by step to download a TCGA dataset. At each step, users can see the options they have on the screen and immediately know the key words they need to enter at the next step.
Ovarian cancer, especially the serous type high-grade ovarian cancer, is a major threat to women. It is the seventh most common cancer among women, but the second leading cause of gynecologic cancers worldwide, with estimated 295,414 new cases and 184,799 deaths in 201832. Most women are diagnosed with ovarian cancer at an advanced stage, and the overall 5-year survival rate ranges between 30% and 40%, which has seen only extremely modest improvement since 199533.
Some molecular changes are known to predispose the development of ovarian cancer. The most studied genes are BRCA1 and BRCA234–36. Other genes, such as CHEK2, ATM, and PALB2 and Lynch syndrome genes, are also implicated in ovarian cancer37. Overall, however, genome-wide search for genetic changes associated with survival in ovarian cancer is still waiting. Our efforts in this study came up with a preliminary list of genes worth further study in depth, such as ASAP3 [26886260].
Bladder cancer is the most common cancer of the urinary tract and the ninth most common cancer worldwide, with estimated 549,393 new cases and 199,992 deaths in 201832. Its incidence is observed to be strongly prevalent in males, with approximately a men-to-women ratio of 3:1, and it is strongly associated with smoking38. Approximately 80% of newly diagnosed patients are identified as the non-muscle invasive subtype (NMIBC; stages Ta/T1), while the remaining 20% are muscle invasive (MIBC; stages T2-4)39. Due to distinct cancerous behaviors and clinical outcome, their respective origins remain controversial40–42. Therefore, it is highly desirable to explore molecules involved in the interplay and transition between these two subtypes.
A variety of chromosomal alterations, including mutations, copy number changes and allelic losses, in combinations of multiple genetic signatures, have been linked to bladder cancer such as changes in FGFR3, activation of cellular signaling in PI3K, MAPK and WNT pathways, or dysregulation of genes involved in cell cycle43. However, whether those alterations drive bladder cancer to become more aggressive needs further investigation. The genes identified in this study, although still preliminary, provide rational directions to further explore molecular links that control the switch for transition between the two types. Notably, different lines of evidence have already suggested the usefulness of our predicted gene candidates. For examples, genetic variations in SLC14A1 have been linked to the development of bladder cancer44,45 and its upregulation has been suggested as a potential target for clinical intervention46,47. In addition, a negative regulatory role of MED30 has been recently revealed in that its overexpression can suppress the progression of bladder cancer48.
In summary, the HDMAC platform we developed offers a solution for rigorous analysis of high-dimensional genomic data. It is clinically oriented and user friendly while including statistical methods to address major issues in large-scale data analysis. It thus has a potentially wide application.
Supplementary information
Acknowledgements
We are grateful to the National Center for High-performance Computing of ROC for computer time and facilities. The study was supported in part by two grants from the Ministry of Science and Technology of ROC (106-2118-M-110 -002 and 107-2118-M-110-003), three grants from KSVGH (VGHKS108-G2-1, VGHKS108-G2-2, and VGHKS108-G2-3) and an NSYSU-KMU joint research project (109-I004).
Author contributions
Study formulation and design: C.C., J.S. & C.Y.; data collection: C.S. & H.H.; data interpretation: I.C., W.K., L.C., C.Y. & J.S.; statistical analysis: C.C., C.S., H.H. & P.K.; platform building: C.C., C.S., H.H., M.C. & P.W.; overall analysis: all; medical & molecular interpretation: J.C., I.C., W.K., L.C., C.Y. & J.S.; figure preparation: C.S.; table preparation: C.S., C.C. & J.C.; writing: J.C., C.C. & J.S.; editing and checking: C.C., J.S., C.Y. & J.C.; manuscript approval: all.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Chia-Cheng Yu, Email: tough0857@icloud.com.
Jim Jinn-Chyuan Sheu, Email: jimsheu@mail.nsysu.edu.tw.
Supplementary information
is available for this paper at 10.1038/s41598-020-60791-z.
References
- 1.Trevino V, Falciani F, Barrera-Saldana HA. DNA microarrays: a powerful genomic tool for biomedical and clinical research. Mol Med. 2007;13:527–541. doi: 10.2119/2006-00107.Trevino. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Molecular cell. 2015;58:586–597. doi: 10.1016/j.molcel.2015.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Weinstein JN, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Human genomics The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gao J, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science signaling. 2013;6:pl1. doi: 10.1126/scisignal.2004088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cerami E, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery. 2012;2:401–404. doi: 10.1158/2159-8290.CD-12-0095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Uhlen M, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347:1260419. doi: 10.1126/science.1260419. [DOI] [PubMed] [Google Scholar]
- 8.Petryszak R, et al. Expression Atlas update–an integrated database of gene and protein expression in humans, animals and plants. Nucleic acids research. 2016;44:D746–752. doi: 10.1093/nar/gkv1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cho S, et al. MiRGator v3.0: a microRNA portal for deep sequencing, expression profiling and mRNA targeting. Nucleic acids research. 2013;41:D252–257. doi: 10.1093/nar/gks1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tang Z, Li C, Kang B, Gao G, Zhang Z. GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses. Nucleic acids research. 2017;45:W98–W102. doi: 10.1093/nar/gkx247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chandrashekar DS, et al. UALCAN: A Portal for Facilitating Tumor Subgroup Gene Expression and Survival Analyses. Neoplasia. 2017;19:649–658. doi: 10.1016/j.neo.2017.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rupji M, Zhang X, Kowalski J. CASAS: Cancer Survival Analysis Suite, a web based application. F1000Research. 2017;6:919. doi: 10.12688/f1000research.11830.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of statistics. 2004;32:407–451. doi: 10.1214/009053604000000067. [DOI] [Google Scholar]
- 14.Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006;101:1418–1429. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]
- 15.Le Cessie S, Van Houwelingen JC. Ridge Estimators in Logistic Regression. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1992;41:10. [Google Scholar]
- 16.Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in medicine. 1997;16:385–395. doi: 10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3. [DOI] [PubMed] [Google Scholar]
- 17.Zhang HH, Lu WB. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. doi: 10.1093/biomet/asm037. [DOI] [Google Scholar]
- 18.Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]
- 19.Algamal ZY, Lee MH. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015;42:9326–9332. doi: 10.1016/j.eswa.2015.08.016. [DOI] [PubMed] [Google Scholar]
- 20.Wu TT, Gong HJ, Clarke EM. A Transcriptome Analysis by Lasso Penalized Cox Regression for Pancreatic Cancer Survival. J Bioinf Comput Biol. 2011;9:63–73. doi: 10.1142/S0219720011005744. [DOI] [PubMed] [Google Scholar]
- 21.Chen JJ, Roberson PK, Schell MJ. The false discovery rate: a key concept in large-scale genetic studies. Cancer control: journal of the Moffitt Cancer Center. 2010;17:58–62. doi: 10.1177/107327481001700108. [DOI] [PubMed] [Google Scholar]
- 22.Demsar J CT, et al. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research. 2013;14:5. [Google Scholar]
- 23.Zhang Z, Mo L, Huang C, Xu P. Binary logistic regression modeling with TensorFlow. Annals of translational medicine. 2019;7:591. doi: 10.21037/atm.2019.09.125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Frank, E., et al In Data Mining and Knowledge Discovery Handbook 1305 –1314 (Springer, 2005).
- 25.R: A language and environment of statistical computing (R Foundation for Statistical Computing, Vienna, Austria., 2010).
- 26.The Shiny (v1.2.0) (2018).
- 27.Noah Simon JF, Hastie T, Tibshirani R. Regularization Paths for CoxDs Proportional Hazards Model via Coordinate Descent. J Stat Softw. 2011;39:13. doi: 10.18637/jss.v039.i05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. B. 1995;57:289–300. [Google Scholar]
- 29.Lee, J. S., Dennis & Sun, Y & Jonathan, E. T. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 21 (2016).
- 30.Taylor JT. Robert Post‐selection inference for L1-penalized likelihood models. The Canandian Journal of Statistics. 2017;46:21. doi: 10.1002/cjs.11313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.RStudio: Integrated Development for R. (RStudio, Inc., Boston, MA, 2015).
- 32.Bray F, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians. 2018;68:394–424. doi: 10.3322/caac.21492. [DOI] [PubMed] [Google Scholar]
- 33.Reid BM, Permuth JB, Sellers TA. Epidemiology of ovarian cancer: a review. Cancer biology & medicine. 2017;14:9–32. doi: 10.20892/j.issn.2095-3941.2016.0084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Miki Y, et al. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 1994;266:66–71. doi: 10.1126/science.7545954. [DOI] [PubMed] [Google Scholar]
- 35.Wooster R, et al. Identification of the breast cancer susceptibility gene BRCA2. Nature. 1995;378:789–792. doi: 10.1038/378789a0. [DOI] [PubMed] [Google Scholar]
- 36.Jayson GC, Kohn EC, Kitchener HC, Ledermann JA. Ovarian cancer. Lancet. 2014;384:1376–1388. doi: 10.1016/S0140-6736(13)62146-7. [DOI] [PubMed] [Google Scholar]
- 37.Desmond A, et al. Clinical Actionability of Multigene Panel Testing for Hereditary Breast and Ovarian Cancer Risk Assessment. JAMA oncology. 2015;1:943–951. doi: 10.1001/jamaoncol.2015.2690. [DOI] [PubMed] [Google Scholar]
- 38.Antoni S, et al. Bladder Cancer Incidence and Mortality: A Global Overview and Recent Trends. European urology. 2017;71:96–108. doi: 10.1016/j.eururo.2016.06.010. [DOI] [PubMed] [Google Scholar]
- 39.Bellmunt J, et al. Bladder cancer: ESMO Practice Guidelines for diagnosis, treatment and follow-up. Annals of oncology: official journal of the European Society for Medical Oncology. 2014;25(Suppl 3):iii40–48. doi: 10.1093/annonc/mdu223. [DOI] [PubMed] [Google Scholar]
- 40.Hedegaard J, et al. Comprehensive Transcriptional Analysis of Early-Stage Urothelial Carcinoma. Cancer cell. 2016;30:27–42. doi: 10.1016/j.ccell.2016.05.004. [DOI] [PubMed] [Google Scholar]
- 41.Comprehensive molecular characterization of urothelial bladder carcinoma. Nature507, 315–322, 10.1038/nature12965 (2014). [DOI] [PMC free article] [PubMed]
- 42.Tsherniak A, et al. Defining a Cancer Dependency Map. Cell. 2017;170:564–576 e516. doi: 10.1016/j.cell.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Knowles MA, Hurst CD. Molecular biology of bladder cancer: new insights into pathogenesis and clinical diversity. Nature reviews. Cancer. 2015;15:25–41. doi: 10.1038/nrc3817. [DOI] [PubMed] [Google Scholar]
- 44.Koutros S, et al. Differential urinary specific gravity as a molecular phenotype of the bladder cancer genetic association in the urea transporter gene, SLC14A1. International journal of cancer. 2013;133:3008–3013. doi: 10.1002/ijc.28325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Rafnar T, et al. European genome-wide association study identifies SLC14A1 as a new urinary bladder cancer susceptibility gene. Human molecular genetics. 2011;20:4268–4281. doi: 10.1093/hmg/ddr303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hou R, et al. Identification of a Novel UT-B Urea Transporter in Human Urothelial Cancer. Frontiers in physiology. 2017;8:245. doi: 10.3389/fphys.2017.00245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hou R, Kong X, Yang B, Xie Y, Chen G. SLC14A1: a novel target for human urothelial cancer. Clinical & translational oncology: official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico. 2017;19:1438–1446. doi: 10.1007/s12094-017-1693-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Syring I, et al. The Contrasting Role of the Mediator Subunit MED30 in the Progression of Bladder Cancer. Anticancer research. 2017;37:6685–6695. doi: 10.21873/anticanres.12127. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.