Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 1.
Published in final edited form as: IEEE J Biomed Health Inform. 2019 Jul 30;24(5):1528–1536. doi: 10.1109/JBHI.2019.2931997

Inferring Gene Regulatory Networks of Metabolic Enzymes Using Gradient Boosted Trees

Yi Zhang 1, Xiaofei Zhang 2, Andrew N Lane 3, Teresa W-M Fan 4, Jinze Liu 5
PMCID: PMC9435551  NIHMSID: NIHMS1829410  PMID: 31380773

Abstract

Metabolic reprogramming is a hallmark of cancer. In cancer cells, transcription factors (TFs) govern metabolic reprogramming through abnormally increasing or decreasing the transcription rate of metabolic enzymes, which provides cancer cells growth advantages and concurrently leads to the altered metabolic phenotypes observed in many cancers. Consequently, targeting TFs that govern metabolic reprogramming can be highly effective for novel cancer therapeutics. In this work, we present TFmeta, a machine learning approach to uncover TFs that govern reprogramming of cancer metabolism. Our approach achieves state-of-the-art performance in reconstructing relations between TFs and their target genes on public benchmark data sets. Leveraging TF binding profiles inferred from genome-wide ChIP-seq experiments and 150 RNA-seq samples from 75 paired cancerous (CA) and non-cancerous (NC) human lung tissues, our approach predicted 19 key TFs that may be the major regulators of the gene expression changes of metabolic enzymes of the central metabolic pathway glycolysis, which may underlie the dysregulation of glycolysis in non-small-cell lung cancer patients.

Keywords: Machine learning, transcription factor, gene regulatory network, metabolic reprogramming, lung cancer

I. INTRODUCTION

METABOLISM is the collection of predominantly enzyme-catalyzed biochemical transformations that are needed for maintenance, growth and survival of an organism. For nearly a century, scientists have documented profound metabolic changes that occur in tumors [1, 2]. Oncogenes and tumor suppressors are well-established regulators of metabolism, and dysregulated expression as well as mutations can lead to the altered metabolic phenotypes observed in many cancers [3, 4]. A high proportion of oncogenes and tumor suppressor genes encode transcription factors (TFs) [5]. Most oncogenic pathways converge on sets of TFs that ultimately control gene expression patterns resulting in tumor formation and progression as well as metastasis [6]. Deregulated expression, activation or inactivation of TFs play critical roles in tumorigenesis. In cancer cells, TFs govern metabolic reprogramming by controlling the expression patterns of metabolic enzymes [7, 8]. For example, the transcription factor MYC is frequently overexpressed in human cancers and regulates the expression of many metabolic enzymes. In carcinomas, MYC drives increased Gln uptake and conversion to Glu by upregulating glutamine transporters and inducing the expression of metabolic enzyme GLS at the mRNA and protein level, leading to increased anaplerotic input via glutaminolysis into the Krebs cycle and increased Gln incorporation into lactate [3, 9, 10].

Comprehensive characterization of TF-metabolic enzyme relations in cancer cells can help uncover potential TFs governing cancer metabolic reprogramming and prioritize targets for novel cancer therapeutics. Reconstructing relations between TFs and their target genes from transcriptomic data is a long-standing and well-studied challenge in molecular and computational biology. However, current methods have at least two major drawbacks for reconstructing TF-target gene relations. First, a fundamental assumption of current relation reconstruction methods using transcriptomic data is that mRNA levels of TFs and their target genes are strongly correlated; however, this assumption may not be true for all the data sets, especially for those containing complex TF-target gene relations. The Dialogue on Reverse Engineering Assessment and Methods (DREAM) project performed an assessment of 35 TF-target gene relation reconstruction methods on both synthetic and real transcriptomic data sets [11]. The competing methods achieved an average AUROC score of 0.69 on the synthetic data set, but 0.55 on the real data sets. The poor performance on the real data sets was due to the low correlation at the mRNA level in the data, which would suggest that reliable reconstruction of complex TF-target gene relations requires the incorporation of additional information from heterogeneous data sources besides transcriptomic data. Second, current relation reconstruction methods disregard the valuable pairing information of the samples in transcriptomic data, treating each input gene expression profile independently in their inference models. For cancer patients’ transcriptomic data, pairwise comparisons of gene expression profiles between matched cancerous (CA) and non-cancerous (NC) samples of the same patient should circumvent the interferences from genetic and physiological variations, eliminating the prediction of false TF-target gene relations caused by the variations.

Here, we developed TFmeta, a machine learning method for the reverse engineering of TF-metabolic enzyme relations that pinpoint TFs governing cancer metabolic reprogramming. To improve the inference performance on data sets with low correlations at the mRNA level, TFmeta leverages TF binding profiles inferred from genome-wide ChIP-seq experiments to select candidate transcription factors fed to the inference models. To take advantage of the pairing information ignored by existing approaches, pairwise comparisons between matched CA and NC sample are included in the pipeline. Using a gold standard data set, namely DREAM5 network inference challenge [11] data, we demonstrate that TFmeta outperformed the winner of the challenge in reconstructing TF-target gene relations. Taking 150 RNA-seq samples from 75 paired CA and NC human lung tissues and TF binding profiles as input, TFmeta predicted a set of key TFs that may control the transcription rate of metabolic enzymes in the central metabolic pathway glycolysis, which may cause the observed metabolic reprogramming in glycolysis pathway in non-small-cell lung cancer patients. A preliminary version of this work has been reported [12]. In this extended version, we have thoroughly reviewed the related works of gene regulatory network inference and discussed the pros and cons of each category of methods. In addition, detailed clinical information of the samples and patients has been included. Additional experiments and results beyond the preliminary version have been added, for example, we included additional experiments to study the functions of TF binding profiles; we tested whether the type of the tree will affect the feature importance measure; we further applied TFmeta to infer TFs that govern other major metabolic pathways in non-small-cell lung cancer patients.

II. RELATED WORK

To reconstruct the structure of the gene regulatory networks, there exist several different approaches. First, data-driven methods could estimate the gene dependencies directly based on data. According to the dependencies calculated, edges in a fully connected network could be weighted. By using some suitable threshold, the final regulatory network could be obtained. As one of the simplest data-driven method, correlation method reconstructs the gene regulatory network by calculation the pair wise correlation between genes. Different correlation like Pearson correlation, Kendall’s correlation and Spearman’s correlation could be used in the analysis. Weighted gene co-expression network analysis (WGCNA) [13] is a reliable and widely used tool in this category. Due to the limitation of correlation in capturing complex regulatory relations, some studies (ARACNE [14], CLR [15] and MRNET [16]) used empirical distribution of gene expression levels for each pair of genes, which is estimated from the samples, to replace the probability distributions. User defined thresholds are used to filter the edges in the fully connected networks to get the gene regulatory networks. Besides of the correlation methods, regression methods quantify the dependency of two genes by using one gene to predict the other. Many popular methods are using this approach. TIGRESS [17] uses L1 regularization on linear regression model to reconstruct the gene regulatory networks. GENIE3 [18] and its descendant methods [19, 20] calculate the network by using ensemble of regression trees to perform the data regression. Regression methods have predictive capability while they are relatively more computationally intensive. In general, the data-driven methods are the most popular methods to generate gene regulatory network because of their simplicity in implementation and efficiency in computational cost.

As a non-data-driven approach, probabilistic models could capture the probabilistic models of the data using global measures of joint likelihood or Bayesian approach, which the data-driven methods couldn’t explicitly do. Gaussian Graphical models [21] have been successfully used in gene regulatory network re-construction by assuming the gene expression as a multivariate normal distribution. Another method in this category, Bayesian networks could predicts the networks with limited prior information [22]. Probabilistic models are easy to implement however hard to optimize and scale up [22]. As a special class of Bayesian networks, Dynamic Bayesian networks (DBN) could use time series data to capture the features of biological system. To some extend it reduces the optimization time compare to normal Bayesian networks because the absence of loops condition in DBN is automatically satisfied. A lot of tools were developed using DBN approach [23].

Differential equation methods are also widely used to infer the gene regulatory networks re-construction using the time series data. The Inferelator [24] is the most popular tool uses differential equation method. Because their continuous time semantics are close to the models used in system biology analysis, the results from differential equation methods are more interpretable compare to other methods. However, time series data are generally hard to obtain and limited to experiment design. The methods use time-series data are also subject to computational difficulties.

III. MATERIALS

A. Data Description

We sequenced 150 RNA-seq data sets from 75 paired CA and NC human lung tissues under IRB approval from the University of Kentucky. All patient information was de-identified and adhered to HIPPA guidelines. Samples were freshly resected paired CA and NC lung tissue from each subject according to IRB approved protocols. Table I summarizes the clinical attributes of 73 subjects, and the clinical information of the rest two subjects is lacking. Samples were collected intraoperatively and portions of tumor and non-tumorous lung both adjacent to and distant from (>5 cm) the tumor margins were flash frozen in liquid N2 as previously described [25, 26]. Further samples were placed in 4% formalin for pathological analysis, and in media for metabolic analyses. Total RNA was isolated from the paired primary CA and NC bulk tissues flash frozen in the OR within 30 min of resection. The tissues were assessed pathologically for necrosis and viability and cellularity and only necrosis-free regions that have at least 30% cancer were used for RNA-seq.

TABLE I.

Summary of Clinical Attributes for Lung Specimens

Total ADC SCC Other

Number of Cau males 33 12 12 9
Number of AA males 3 0 2 1
Number of Cau females 34 14 15 5
Number of AA females 3 2 0 1
Sum 73 28 29 16

Cau: Caucasian; AA: African American; ADC: adenocarcinoma; SCC: squamous cell carcinoma; Other: stage 4 metastases to the lung.

B. RNA-seq Analysis

100 bp paired-end reads were generated by Illumina HiSeq 2000 sequencer. RNA-seq reads were mapped to the human reference genome GRCh38, and gene expression values (TPM, transcripts per million) were estimated using RSEM package [27]. Gene expression profiles generated from RSEM were normalized and comparable between samples. Pairwise gene expression comparisons of CA and NA samples from the same patient were conducted through measuring the log2 ratios of gene expression values between CA and matched NC samples. Based on the log2 ratios, we maintained a master table for showing the regulation status of each gene in each individual patient. The regulation status of each gene was represented by a categorical variable that can take on one of the three possible values: upregulated, downregulated, and no change. Genes with the log2 ratio greater than 0.8 were categorized as upregulated genes, and genes with the log2 ratio lower than −0.8 were categorized as downregulated genes, and the rest were genes with no expression change. The size of the master table was 19,814 (number of genes) by 75 (number of patients). Using the gene expression log2 ratio of paired CA and NC tissue samples from the same patient should reduce the effects of individuality and the impact of tissue-specific genes and consequently, increase the accuracy of predicting clinical outcomes [28].

We then collected the detailed information of the major metabolic pathways in human, including glycolysis, the Krebs cycle, purine metabolism, and others from KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database [29]. The regulation status of metabolic enzymes involved in each metabolic pathway was extracted from the master regulation status table. According to the one-tailed one-proportion z-test (with a hypothesized proportion of 0.6667), we considered metabolic enzymes with consistent expression change (upregulated or downregulated) among at least 57 patients out of the 75 patients as altered metabolic enzymes (P value for 57 patients: p=0.0433<0.05).

C. Transcription Factor Binding Profiling

We integrated TF binding profiles which were inferred from genome-wide ChIP-seq experiments in four public databases, including ChEA [30], ENCODE [31], JASPAR [32], TRANSFAC [33]. We eventually accumulated 2,286,192 TF DNA binding activities, involving 493 TFs and 23,644 target genes. The minimum, median and maximum number of TFs binding to a target gene is 1, 104 and 279, and the minimum, median and maximum number of target genes for one TF is 4, 1853 and 21545, respectively. The total number of metabolic enzymes involved in the major metabolic pathways is 366. For each altered metabolic enzyme, we curated a list of TFs which bind to the transcription start site of that enzyme according to the TF DNA binding activities.

IV. METHODS

A. Problem Definition

We approached the problem of uncovering TFs that govern cancer metabolic reprogramming by measuring the relations between the altered metabolic enzymes and TFs binding to the transcription start sites of them. Through RNA-seq analysis, we identified M altered metabolic enzymes with consistent expression change between CA and matched NC samples. We divided the problem of inferring TF-metabolic enzyme relations involving M enzymes into M sub-problems. Each of these sub-problems uncovered the TFs regulating one of the enzymes. We generated M sub-tables from the master regulation status table, each of which contained the regulation status of one enzyme and TFs which bind to the transcription start site of that enzyme according to the TF DNA binding activities. Leveraging the TF binding profiles, we filtered out the irrelevant TFs and only kept the relevant ones, which could avoid overfitting. In the sub-table, for enzyme m with Tm TFs binding to its transcription start site, every patient’s regulation status profile can be expressed as xnm,ynm, where n ∈ {1, … , N} is the index of each patient out of N patients, and xnm is a tensor of Tm TF regulation status, and ynm is the regulation status of enzyme m.

B. Relation Inference as a Feature Selection Problem

TFs and their target genes are known to interact in a dynamic and nonlinear manner [34]. We hypothesize that the regulation status of the enzyme m is a function fm of the regulation status of the Tm TFs, and the function fm only employs the regulation status of the TFs that are direct regulators of the enzyme m. Identifying those TFs whose regulation status is predictive of the regulation status of the enzyme m. can be considered as a feature selection problem, which is to rank the input features in the function fm based on their relevance for predicting the output in machine learning terminology. Considering a large amount of TFs as input features relative to a small set of learning patient regulation status profiles and the nonlinear relationship between input TFs and the output enzyme, we proposed to use gradient boosted trees [35, 36] to find the function fm and rank the input TFs by their relevance. Gradient tree boosting is a scalable and highly effective machine learning algorithm, which works well in reliably extracting relevant features and identifying non-linear feature relations. Gradient boosting is an ensemble technique, in which the weak base learners are trained sequentially and the later trained learner empowers the previously trained ones. Gradient boosted trees (GBT) use CARTs (classification and regression tree) as the base learners. Random forest (RF) also uses CARTs as the base learners but build them individually. In both GBT and RF, Gini impurity scores are used to calculate the importance of the features used. However due to the difference of the building processes of the CARTs, the importance of the features calculated by the two methods may vary.

C. Gradient Boosted Tree-based Model

For each sub-problem, we fitted a multi-class classification model ( fm ) to predict the regulation status (upregulated, downregulated, no change) of the enzyme m based on the combined regulation status of the Tm TFs. Gradient boosted trees were employed to find the function fm which minimizes the multi-class classification error rate which is calculated as the number of wrong predictions divided by the number of all predictions. To achieve this goal, classification and regression tree (CART) recursively partitions the N patients into smaller disjoint sets based on the input regulation status of TFs, aiming at minimizing the number of wrong predictions of the output enzyme regulation status in the resulting subsets. Classification and regression tree uses the tree structure to represent the recursive partition, and each of the leaves in the tree represents a cell of partition. The basic idea of tree boosting is to build additive models through classification and regression trees. Let bm,kxnm be a classification and regression tree in mth sub-problem, which works as the base learner. In tree boosting, we built a model that is the sum of base learners as:

fmxnm=k=1Kbm,kxnm,

where k ∈ {1, … , K} is the index of each base learner out of K base learners. The target additive model was built in a forward stagewise fashion. Namely, it started with the simple function fm,0xnm=0, then iteratively adds base learners to minimize the multi-class classification error rate of fm,k1xnm+bm,kxnm. Gradient Boosting attempts to solve this minimization problem numerically via steepest descent. By iteratively shifting the focus towards problematic observations that were difficult to predict, the performance of the classification and regression tree is very much boosted.

D. Feature Importance Measure: TF Ranking

A benefit of using CART-based methods is that after the trees are constructed, it is relatively straightforward to retrieve estimates of feature importance that allow ranking the input features according to their relevance for predicting the output. The feature importance is independent of the choice of tree type, classification trees or regression trees. The importance is calculated for a single classification and regression tree by the amount that each attribute split point reduces the Gini impurity, weighted by the number of observations the node is responsible for. The feature importance scores are then averaged across all the classification and regression trees within the model. In this application, every CART-based sub-model solving one sub-problem yields a separate ranking of TFs as potential regulators of a target enzyme m along with importance scores Im,tm for tm ∈ {1, … , Tm}.

E. TF-metabolic enzyme Map

To combine the separate rankings of TFs in sub-models, we performed the Wilcoxon signed-rank test on every pair of TFs to compare their ranks, which tested whether the ranks of one TF from all sub-models were significantly higher (or lower) than those of the other TF. Based on the test decisions of comparing all pairs of TFs, the orders of TFs were eventually determined to generate the combined ranking. Through evaluating the number of output TFs and their biological significance, we considered TFs in the top 5% of the combined ranking as robust targets. In this article, we focus on the task of ranking the candidate transcription factors. The number of transcription factors selected for downstream analysis or further experimental validation varies in different scenarios. Thus, the choice of an optimal percentage will be left open and determined by users. Though the default percentage is 5%, this setting can be easily adjusted in different applications. The relations between the predicted TFs and their target enzymes were then displayed in a TF-metabolic enzyme map. The overall workflow of TF-metabolic enzyme relation inference is shown in Fig. 1.

Fig. 1.

Fig. 1.

Overview of TF-metabolic enzyme relation inference workflow. We divided the problem of inferring TF-metabolic enzyme relations involving M enzymes into M sub-problems. In each sub-problem, taking the regulation status table of one enzyme and TFs binding to its transcription start site as input, we utilized gradient boosted trees to identify those TFs whose regulation status is predictive of the regulation status of the enzyme. This learning process was repeated on all the M enzymes. The predicted relations between TFs and enzymes were then displayed in the TF-metabolic enzyme map as output.

F. Implementation

TFmeta was implemented using scikit-learn library (version 0.19.1) [37] and XGBoost library (version 0.7) [36] in Python (version 2.7.13) as task parallelized program. TFmeta [12] is freely available for academic use and can be accessible at https://github.com/zhangyimc/TFmeta.

V. RESULTS

A. Benchmarking TFmeta with DREAM5 Network Inference Challenge Data Sets

We utilized the data sets in Dialogue on Reverse Engineering Assessment and Methods (DREAM) 5 network inference challenge [11]. The DREAM project is a framework to enable an assessment of computational methods through standardized performance metrics and common benchmarks. DREAM5 challenge performed a comprehensive blind assessment of 35 TF-target gene relation inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. Table II summarizes the number of TFs, the number of genes, and the number of microarray chips for each network. DREAM5 challenge organizer claimed that Staphylococcus aureus data was not used for the final evaluation for the lack of a sufficiently large set of experimentally validated relations. Each microarray data set is represented as a m * n gene expression matrix, where m is the total number of genes including both TFs and target genes, and n is the total number of microarray measurements. Based on descriptions provided by participants, DREAM5 challenge classified the 35 competing methods into six distinct categories: regression, mutual information, correlation, Bayesian networks, meta (methods that combine several different approaches) and others (methods that do not belong to any of the previous categories).

TABLE II.

Summary of DREAM5 Challenge Data Sets

Network Number of TFs Number of genes Number of microarray chips

In silico 195 1643 805
S.aureus 99 2810 160
E. coli 334 4511 805
S. cerevisiae 333 5950 536

TFmeta was trained and tested on the same benchmark data sets used by the 35 competing methods. Since the input data is numerical, the gene expression values generated from microarray chips, the functionality of classification and regression trees (CART) in TFmeta was shifted from classification to regression. In DREAM5 challenge, standardized performance metrics were provided to evaluate the performance of different methods. An overall score was used to summarize the performance across the three networks, which is a comprehensive assessment on both the area under the precision-recall (AUPR) and receiver operating characteristic (AUROC) curves. We applied the same metrics used by the 35 competing methods to TFmeta. Fig. 2a shows the overall scores for TFmeta and the 35 competing methods. The winner of DREAM5 challenge, GENIE3 [18], achieved an overall score of 40.279. A newly published method GRNBoost2 [38], which is inspired by GENIE3, is reported to achieve overall scores ranging from 60 to 65 on DREAM5 data sets of 100 runs. The overall score of TFmeta is 69.031, which outperforms the winner of DREAM5 challenge and GRNBoost2.

Fig. 2.

Fig. 2.

Performance evaluation of DREAM5 challenge data sets. (a) demonstrates the overall scores for TFmeta and the 35 competing methods. The winner of DREAM5 challenge, GENIE3, achieved an overall score of 40.279. The overall score of TFmeta is 69.031. (b) illustrates the accuracy of the top relations predicted by GENIE3 and TFmeta. TFmeta consistently achieved a higher accuracy than GENIE3. (c) shows the total CPU running time of GENIE3 and TFmeta on the testing datasets. TFmeta is orders of magnitude faster than GENIE3.

Transcription-factor perturbation experiments can be applied to validate the biological significance of the TFs predicted by computational methods. However, the usage of transcription-factor perturbation experiments is limited by their high cost and strong dependence on cellular type and context. Though TF-target gene relation inference methods reconstruct gene regulatory networks with a large set of regulatory relations, the number of TFs chosen for further experimental validation is always limited, and it is highly likely that only the top predicted relations will be selected for further validation. We then evaluated the accuracy of the top relations predicted by GENIE3 and TFmeta. As shown in Fig. 2b, TFmeta consistently achieved a higher accuracy than GENIE3 for the top predictions on in silico data set, indicating that the most significant relations predicted by TFmeta are more likely to be true relations than those by GENIE3. We further compared TFmeta with GENIE3 in terms of computational efficiency. Fig. 2c illustrates the total CPU running time of GENIE3 and TFmeta for reconstructing the testing gene regulatory networks. It took GENIE3 761.58 hours to finish the entire reconstruction job, but only 6.03 hours for TFmeta. TFmeta is orders of magnitude faster than GENIE3.

B. Prediction of TFs Governing the Dysregulation of Glycolysis in Non-small-cell Lung Cancer Patients

All parts of the body require energy to maintain non-equilibrium cellular states and perform work, and this energy is derived from consumption and oxidation of external nutrients. Typically, all food is broken down into smaller parts and coupled to the production of the main energy intermediate, ATP. ATP provides a uniformly usable store of biochemical energy that can be used to drive endergonic cellular reactions. The process of the breakdown of glucose, termed glycolysis, occurs in the cytoplasm of mammalian cells [39]. Since the early twentieth century, abnormalities of glycolysis in cancer cells have been observed [40]. Marked progress has been made in understanding the molecular mechanisms leading to constitutive upregulation of glycolysis in tumor cells. Many glycolytic enzymes are often overexpressed in cancer cells. For example, phosphofructokinase-1 (PFK1) has been identified to be upregulated in types of breast cancer [41]. Another well-known classic glycolytic enzyme, glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is also implicated in cancer. Overexpression of GAPDH is considered an important feature of numerous types of cancer [39]. GAPDH has been proposed as a promising target for the treatment of carcinomas [42]. Both MYC and HIF1a are known to upregulate expression of most of the glycolytic enzymes in cancers [43]. These results indicate that uncovering TFs that govern the abnormal expression patterns of these glycolysis and/or glycolytic enzymes in tumor cells may underlie the abnormalities of glycolysis, which could be highly effective for the treatment of different types of cancer.

We acquired 150 RNA-seq samples from 75 paired CA and NC human lung tissues. Through pairwise gene expression comparisons of CA and NA samples from the same patient, we identified 14 altered glycolytic enzymes with consistent expression changes. ENO1, ENO2, GAPDH, GPI, LDHA, PFKP, PKM, and TPI1 were upregulated, whereas ACSS2, ADH1B, ALDH2, ALDH3B1, FBP1, and HK3 were downregulated.

For every altered glycolytic enzyme, we curated a list of TFs which bind to the transcription start site of that enzyme according to the TF DNA binding activities inferred from ChIP-seq experiments. Leveraging the TF binding profiles, we filtered out the irrelevant TFs and only kept the relevant ones. In this experiment, this narrowed down the candidate TFs from 19,813 to an average number of 134. In other words, the number of input features for the model decreased from 19,813 to 134 with the irrelevant features being removed. The model would become very complicated without filtering out the irrelevant features, which would in turn lead to overfitting. We then fitted a gradient boosted tree-based classification model to predict the regulation status of each altered glycolytic enzyme based on the combined regulation status of the selected TFs. The optimal model configuration was achieved by extensive hyperparameter search over various learning rate (0.001, 0.01, 0.1, and 1), maximum tree depth (1, 3, and 5), and number of rounds for boosting (100, 200, 300, and 400). To evaluate the performance of models with different parameter settings, 10-fold cross-validation was used. Table III summarizes the average prediction accuracy of models varying parameter settings upon the 14 altered glycolytic enzymes. Based on these results, we used 0.01 as learning rate, 3 as maximum tree depth, and 300 as number of rounds for boosting in our model to save the computing time without loss of classification accuracy.

TABLE III.

Performance Evaluation of Models with Different Parameter Settings

Learning rate Maximum tree depth Number of rounds for boosting Accuracy

0.001 3 300 0.696
0.01 0.723
0.1 0.661
1 0.634

0.01 1 300 0.714
3 0.723
5 0.723

0.01 3 100 0.679
200 0.714
300 0.723
400 0.705

The application of TFmeta allows us to narrow down to a list of key TFs as modulating the dysregulated expression of those altered glycolytic enzymes. Fig. 3 shows the TF-metabolic enzyme map predicted by TFmeta. In the map, the 14 altered glycolytic enzymes (red squares) and 19 predicted TFs (blue squares) are nodes, and an edge from one TF to one enzyme demonstrates that TF is predicted to regulate that enzyme, and all the edges are directed. Some predicted TFs and their relations with glycolytic enzymes in the map have already been supported by literature evidence. For example, transcription factor E2-alpha (TCF3) was identified as novel putative TF in lung cancer [44]. ETS Proto-Oncogene 1 (ETS1) was reported as a key TF involved in the metabolism of cancer cells, and ETS1 is particularly important in the metabolic shift towards glycolysis and anabolic means of energy production [45]. Enhancer of zeste homolog 2 (EZH2) promotes tumorigenesis and malignant progression in part by activating glycolysis. The mRNA expression of key enzymes involved in glycolysis in xenograft tumors was significantly increased in tumors derived from cells overexpressing EZH2, which suggests EZH2 overexpression leads to increases in glycolysis in vivo [46]. Forkhead box transcription factor-2 (FOXA2) was implicated as a suppressor of lung cancer, playing an important role in lipid and glucose metabolism in lung development using Foxa2+/− mice model [47]. Another well-known TF, MYC is a critical growth regulatory gene that is commonly overexpressed in a wide range of cancers. Overexpression of MYC leads to the upregulation of many glycolytic enzymes [48]. Zinc finger and BTB domain-containing protein 7A (ZBTB7A) acts as a tumor suppressor through the transcriptional repression of glycolysis, which directly binds to the promoter and represses the transcription of critical glycolytic enzymes, including GLUT3, PFKP, and PKM [49]. Krüppel-like factor 4 (KLF4) represses the transcription of the glycolytic enzyme LDHA in pancreatic cancer [50]. We propose that these TFs should be prioritized for follow-up experiments, both to validate predicted target metabolic enzymes and to evaluate specific biological functions for each TF.

Fig. 3.

Fig. 3.

Visualization of the TF-metabolic enzyme map predicted by TFmeta. In the map, the 14 altered glycolytic enzymes (red squares) and 19 predicted TFs (blue squares) are nodes, and an edge from one TF to one enzyme demonstrates that TF is predicted to regulate that enzyme, and all the edges are directed.

Thus, in this pilot study, we demonstrated the feasibility of using TFmeta for uncovering TFs that govern glycolytic reprogramming in non-small-cell lung cancer patients. This approach should be equally powerful for deciphering other metabolic reprogramming in cancer cells, thereby enabling more comprehensive characterization of cancer metabolism.

C. Prediction of TFs Governing Other Major Metabolic Pathways in Non-small-cell Lung Cancer Patients

We further applied TFmeta to infer TFs that govern other major metabolic pathways in non-small-cell lung cancer patients. The Krebs cycle is a central metabolic hub that integrates carbohydrate, lipid, and amino acid metabolism. The pentose phosphate pathway (PPP) is an alternative route for glycolysis, yielding ribose 5-phosphate for nucleotide biosynthesis and NADPH for fatty acid biosynthesis and decomposition of peroxides [51]. Purine metabolism maintains cellular pools of adenylate and guanylate via synthesis and degradation of purine nucleotides. The top TFs predicted for each metabolic pathway are shown as follows:

  1. The Krebs cycle: ZBTB7A, MYC, SMARCB1, TAL1, TCF7L2;

  2. The pentose phosphate pathway: FOXA2, MYC, EGR1, TCF3, ZEB1;

  3. Purine metabolism: MYC, H2AFZ, EZH2, NFIC, ETS1, TCF3, BHLHE40, CEBPB, STAT1, MAFK.

VI. DISCUSSION

We designed TFmeta, which primarily focuses on the task of ranking the candidate transcription factors governing metabolic enzymes. The application of TFmeta narrows down to a list of key TFs as modulating the dysregulated expression of metabolic enzymes. In real applications, users might need to determine the number of TFs selected for downstream analysis or biological experimental validation on the basis of their requirements. According to our experimental results, we found the majority of metabolic enzymes have relations with more than one TFs. TFs are known to have to work together to achieve needed specificity in both DNA binding and effector function [52]. In our current model, the analysis of TF-TF relationships is generally lacking. TFmeta could extend its functionality to evaluate the associations of TFs in future.

Both dysregulated expression and mutations in transcription factors can lead to the altered metabolic phenotypes observed in many cancers. The effects of dysregulated expression have been studied in this work. Future studies that use our RNA-seq data sets to investigate whether the presence of mutations potentially affects metabolism will expand the functionality of TFmeta.

VII. CONCLUSIONS

Metabolic reprogramming of cancer cells is recognized as one of the hallmarks of cancer. One of the most common trends in anti-cancer metabolism therapies is to inhibit metabolic enzymes that are exclusively or mostly expressed or used in tumor cells. This therapeutic strategy would effectively eliminate tumors while minimizing damage to normal cells [53]. Thus, targeting TFs that control the transcription rate of those metabolic enzymes could be highly effective for novel cancer therapy. In this paper, we develop TFmeta, a machine learning approach to uncover TFs governing cancer metabolic reprogramming and reconstruct their relations with metabolic enzymes. TFmeta leverages TF binding profiles to overcome the problem of low correlations at the mRNA level on real data sets. Pairwise comparisons between matched samples are included in the TFmeta pipeline to eliminate the false predictions caused by individual variations, which has been ignored by existing methods. TFmeta implemented as task parallelized program is orders of magnitude faster than conventional methods, which equips users with an efficient and scalable tool for large date sets, such as single-cell RNA-seq. In addition, we carefully curated a data set containing 2,286,192 TF DNA binding activities, involving 493 TFs and 23,644 target genes. We demonstrated that TFmeta achieved state-of-the-art performance in recovering TF-target gene relations on public benchmark data sets. TFmeta achieved more than 70-percent improvement over GENIE3 in terms of accuracy, and TFmeta is two orders of magnitude faster than GENIE3. We applied our model to non-small-cell lung cancer patients’ data sets to predict TFs modulating the dysregulation of glycolysis in lung cancer. We tested 11 different model settings to find the optimal configuration. Eventually, we predicted 19 key TFs that may motivate the upregulation of glycolysis observed in tumor cells, some of which have been supported by literature evidence, and some of which were predicted as novel putative TFs in lung cancer. We further applied TFmeta to infer TFs that govern other major metabolic pathways. Using TFmeta, we predicted 5 key TFs for the Krebs cycle, 5 for the pentose phosphate pathway, 10 for purine metabolism.

Acknowledgments

This work was supported in part by the National Institutes of Health under Award 1P01CA163223-01A1 and Award 1U24DK097215-01A1, and in part by the Redox Metabolism Shared Resource(s) of the University of Kentucky Markey Cancer Center under NCI Grant P30CA177558.

Footnotes

ETHICS STATEMENT

Written consent was obtained for the collection of human tissue and blood samples under an IRB approved protocol (13-LUN-94-MCC: Preoperative Metabolomic Analysis of Primary Lung Cancer) at the University of Kentucky.

Contributor Information

Yi Zhang, Department of Computer Science, University of Kentucky, Lexington, KY 40506 USA.

Xiaofei Zhang, Department of Computer Science, University of Kentucky, Lexington, KY 40506 USA.

Andrew N. Lane, Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40506 USA

Teresa W-M. Fan, Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40506 USA

Jinze Liu, Department of Computer Science, University of Kentucky, Lexington, KY 40506 USA.

REFERENCES

  • 1.Vander Heiden MG, Cantley LC, and Thompson CB, Understanding the Warburg effect: the metabolic requirements of cell proliferation. science, 2009. 324(5930): p. 1029–1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Warburg O, Wind F, and Negelein E, The metabolism of tumors in the body. The Journal of general physiology, 1927. 8(6): p. 519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wise DR, et al. , Myc regulates a transcriptional program that stimulates mitochondrial glutaminolysis and leads to glutamine addiction. Proceedings of the National Academy of Sciences, 2008. 105(48): p. 18782–18787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gottlieb E and Tomlinson IP, Mitochondrial tumour suppressors: a genetic and biochemical update. Nature Reviews Cancer, 2005. 5(11): p. 857. [DOI] [PubMed] [Google Scholar]
  • 5.Paccez JD and Zerbini LF, Oncogenic Transcription Factors: Target Genes. eLS, 2007. [Google Scholar]
  • 6.Libermann TA and Zerbini LF, Targeting transcription factors for cancer gene therapy. Current gene therapy, 2006. 6(1): p. 17–33. [DOI] [PubMed] [Google Scholar]
  • 7.Hanahan D and Weinberg RA, Hallmarks of cancer: the next generation. cell, 2011. 144(5): p. 646–674. [DOI] [PubMed] [Google Scholar]
  • 8.De Mas IM, et al. , Cancer cell metabolism as new targets for novel designed therapies. Future medicinal chemistry, 2014. 6(16): p. 1791–1810. [DOI] [PubMed] [Google Scholar]
  • 9.Yuneva MO, et al. , The metabolic profile of tumors depends on both the responsible genetic lesion and tissue type. Cell metabolism, 2012. 15(2): p. 157–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Gao P, et al. , c-Myc suppression of miR-23a/b enhances mitochondrial glutaminase expression and glutamine metabolism. Nature, 2009. 458(7239): p. 762. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Marbach D, et al. , Wisdom of crowds for robust gene network inference. Nature methods, 2012. 9(8): p. 796–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhang Y, et al. TFmeta: A Machine Learning Approach to Uncover Transcription Factors Governing Metabolic Reprogramming. in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2018. ACM. [Google Scholar]
  • 13.Zhang B and Horvath S, A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology, 2005. 4(1). [DOI] [PubMed] [Google Scholar]
  • 14.Margolin AA, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. in BMC bioinformatics. 2006. BioMed Central. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Faith JJ, et al. , Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS biology, 2007. 5(1): p. e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Meyer PE, et al. , Information-theoretic inference of large transcriptional regulatory networks. EURASIP journal on bioinformatics and systems biology, 2007. 2007: p. 8–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Haury A-C, et al. , TIGRESS: trustful inference of gene regulation using stability selection. BMC systems biology, 2012. 6(1): p. 145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Irrthum A, Wehenkel L, and Geurts P, Inferring regulatory networks from expression data using tree-based methods. PloS one, 2010. 5(9): p. e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wehenkel L and Geurts P, Gene regulatory network inference from systems genetics data using tree-based methods, in Gene Network Inference. 2013, Springer. p. 63–85. [Google Scholar]
  • 20.Huynh-Thu VA and Sanguinetti G, Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics, 2015. 31(10): p. 1614–1622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Schäfer J and Strimmer K, An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 2004. 21(6): p. 754–764. [DOI] [PubMed] [Google Scholar]
  • 22.Friedman N, et al. , Using Bayesian networks to analyze expression data. Journal of computational biology, 2000. 7(3–4): p. 601–620. [DOI] [PubMed] [Google Scholar]
  • 23.Oates CJ and Mukherjee S, Network inference and biological dynamics. The annals of applied statistics, 2012. 6(3): p. 1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Bonneau R, et al. , The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome biology, 2006. 7(5): p. R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sellers K, et al. , Pyruvate carboxylase is critical for non–small-cell lung cancer proliferation. The Journal of clinical investigation, 2015. 125(2): p. 687–698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lane AN, et al. , Stable isotope-resolved metabolomics (SIRM) in cancer research with clinical application to nonsmall cell lung cancer. Omics: a journal of integrative biology, 2011. 15(3): p. 173–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li B and Dewey CN, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics, 2011. 12(1): p. 323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ishibashi Y, et al. , Profiling gene expression ratios of paired cancerous and normal tissue predicts relapse of esophageal squamous cell carcinoma. Cancer research, 2003. 63(16): p. 5159–5164. [PubMed] [Google Scholar]
  • 29.Kanehisa M and Goto S, KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 2000. 28(1): p. 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lachmann A, et al. , ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics, 2010. 26(19): p. 2438–2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Consortium EP, The ENCODE (ENCyclopedia of DNA elements) project. Science, 2004. 306(5696): p. 636–640. [DOI] [PubMed] [Google Scholar]
  • 32.Mathelier A, et al. , JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research, 2015. 44(D1): p. D110–D115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Matys V, et al. , TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic acids research, 2006. 34(suppl_1): p. D108–D110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Li H and Zhan M, Unraveling transcriptional regulatory programs by integrative analysis of microarray and transcription factor binding data. Bioinformatics, 2008. 24(17): p. 1874–1880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Friedman JH, Greedy function approximation: a gradient boosting machine. Annals of statistics, 2001: p. 1189–1232. [Google Scholar]
  • 36.Chen T and Guestrin C. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. ACM. [Google Scholar]
  • 37.Pedregosa F, et al. , Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. 12(Oct): p. 2825–2830. [Google Scholar]
  • 38.Moerman T, et al. , GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. [DOI] [PubMed] [Google Scholar]
  • 39.Sreedhar A and Zhao Y, Dysregulated metabolic enzymes and metabolic reprogramming in cancer cells. Biomedical Reports, 2018. 8(1): p. 3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Warburg O, Versuche an überlebendem Carcinomgewebe (Methoden). Biochem. Zeitschr., 1923. 142: p. 317–333. [Google Scholar]
  • 41.Zancan P, et al. , Differential expression of phosphofructokinase-1 isoforms correlates with the glycolytic efficiency of breast cancer cells. Molecular genetics and metabolism, 2010. 100(4): p. 372–378. [DOI] [PubMed] [Google Scholar]
  • 42.Krasnov GS, et al. , Deregulation of glycolysis in cancer: glyceraldehyde-3-phosphate dehydrogenase as a therapeutic target. Expert opinion on therapeutic targets, 2013. 17(6): p. 681–693. [DOI] [PubMed] [Google Scholar]
  • 43.Dang CV, Le A, and Gao P, MYC-Induced Cancer Cell Energy Metabolism and Therapeutic Opportunities. Clinical Cancer Research, 2009. 15(21): p. 6479–6483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.El-aarag SA, et al. , In silico identification of potential key regulatory factors in smoking-induced lung cancer. BMC medical genomics, 2017. 10(1): p. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Verschoor ML, et al. , Ets-1 regulates energy metabolism in cancer cells. PLoS one, 2010. 5(10): p. e13565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Pang B, et al. , EZH2 promotes metabolic reprogramming in glioblastomas through epigenetic repression of EAF2-HIF1α signaling. Oncotarget, 2016. 7(29): p. 45134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Tang X and Luo F, The forkhead box transcription factor-2 (Foxa2) and lung disease. Inflammation and Cell Signaling, 2014. 1(4). [Google Scholar]
  • 48.Sabnis HS, Somasagara RR, and Bunting KD, Targeting MYC dependence by metabolic inhibitors in cancer. Genes, 2017. 8(4): p. 114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Liu X-S, et al. , ZBTB7A acts as a tumor suppressor through the transcriptional repression of glycolysis. Genes & development, 2014. 28(17): p. 1917–1928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Shi M, et al. , A novel KLF4/LDHA signaling pathway regulates aerobic glycolysis in and progression of pancreatic cancer. Clinical Cancer Research, 2014. 20(16): p. 4370–4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Patra KC and Hay N, The pentose phosphate pathway and cancer. Trends in biochemical sciences, 2014. 39(8): p. 347–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lambert SA, et al. , The human transcription factors. Cell, 2018. 172(4): p. 650–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Phan LM, Yeung S-CJ, and Lee M-H, Cancer metabolic reprogramming: importance, main features, and potentials for precise targeted anti-cancer therapies. Cancer biology & medicine, 2014. 11(1): p. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES