Skip to main content
BMC Medical Genomics logoLink to BMC Medical Genomics
. 2019 Jul 11;12(Suppl 5):97. doi: 10.1186/s12920-019-0515-6

CONFIGURE: A pipeline for identifying context specific regulatory modules from gene expression data and its application to breast cancer

Sungjoon Park 1, Doyeong Hwang 1, Yoon Sun Yeo 1, Hyunggee Kim 3,4, Jaewoo Kang 1,2,
PMCID: PMC6624175  PMID: 31296219

Abstract

Background

Gene expression data is widely used for identifying subtypes of diseases such as cancer. Differentially expressed gene analysis and gene set enrichment analysis are widely used for identifying biological mechanisms at the gene level and gene set level, respectively. However, the results of differentially expressed gene analysis are difficult to interpret and gene set enrichment analysis does not consider the interactions among genes in a gene set.

Results

We present CONFIGURE, a pipeline that identifies context specific regulatory modules from gene expression data. First, CONFIGURE takes gene expression data and context label information as inputs and constructs regulatory modules. Then, CONFIGURE makes a regulatory module enrichment score (RMES) matrix of enrichment scores of the regulatory modules on samples using the single-sample GSEA method. CONFIGURE calculates the importance scores of the regulatory modules on each context to rank the regulatory modules.

We evaluated CONFIGURE on the Cancer Genome Atlas (TCGA) breast cancer RNA-seq dataset to determine whether it can produce biologically meaningful regulatory modules for breast cancer subtypes. We first evaluated whether RMESs are useful for differentiating breast cancer subtypes using a multi-class classifier and one-vs-rest binary SVM classifiers. The multi-class and one-vs-rest binary classifiers were trained using the RMESs as features and outperformed baseline classifiers. Furthermore, we conducted literature surveys on the basal-like type specific regulatory modules obtained by CONFIGURE and showed that highly ranked modules were associated with the phenotypes of basal-like type breast cancers.

Conclusions

We showed that enrichment scores of regulatory modules are useful for differentiating breast cancer subtypes and validated the basal-like type specific regulatory modules by literature surveys. In doing so, we found regulatory module candidates that have not been reported in previous literature. This demonstrates that CONFIGURE can be used to predict novel regulatory markers which can be validated by downstream wet lab experiments. We validated CONFIGURE on the breast cancer RNA-seq dataset in this work but CONFIGURE can be applied to any gene expression dataset containing context information.

Keywords: Context specific regulatory module, Gene regulatory network inference, Single sample GSEA, Feature importance score, Breast cancer subtype

Background

Many researches have identified biological phenotypes (i.e., contexts) such as cancer subtypes or cell types from gene expression data. Usually, clustering algorithms are applied to gene expression data for identifying biological contexts [13]. Though gene expression signatures accurately represent biological contexts from clustering results, it is difficult to identify the biological mechanisms underlying each biological context.

When context information is available, differentially expressed gene (DEG) analysis [46] is the most widely used for identifying marker genes that help to differentiate contexts. However, from DEGs, it is often difficult to identify the phenotypes or biological networks that are differentiated between the contexts. To overcome this problem, the gene set enrichment analysis (GSEA) method is widely used. The GSEA method can identify phenotypes or biological networks in which the DEGs are over-represented [7]. However, since the GSEA method is based on gene sets, the interaction information of biological networks is ignored. Several methods consider the interactions in biological networks when identifying context specific subnetworks [810]. However, these methods are unable to score subnetworks for single samples.

A gene regulatory network (GRN) describes transcriptional relationships between transcription factors (TFs) and their target genes. Among various data types, gene expression data is often used for inferring GRNs. The core component of the GRN inference method involves calculating regulatory interaction scores of genes; statistical and machine learning methods are applied for scoring interactions. However, most GRN inference algorithms are unsuitable to identify context specific GRNs [1115].

Recently, the authors of [16] have developed a single cell GRN inference and clustering method called SCENIC. SCENIC was developed to infer GRNs of single cells and identify new cell types by clustering single cells based on the activity scores of the GRN modules. However, SCENIC does not prioritize GRN modules for each identified context of a cell type.

In this work, we present CONFIGURE which is a pipeline for identifying CONtext speciFIc reGUlatoRy modulEs. CONFIGURE first constructs regulatory modules from gene expression data using a gene regulatory network inference method and a transcription factor (TF) motif enrichment analysis method[13, 16]. A regulatory module consists of a TF and its target genes, and the regulatory interaction scores of them. Using the single sample gene set enrichment analysis (ssGSEA) method [17], CONFIGURE calculates the enrichment scores of all regulatory modules for all samples. An enrichment score indicates the degree of up- or down- regulation of a regulatory module for a given sample. To identify context specific regulatory modules, the importance scores of regulatory modules are computed on each context. To obtain the importance scores, CONFIGURE computes the feature importance scores of one-vs-rest binary random forest classifiers. The random forest classifiers are trained on each context using the enrichment scores of regulatory modules as features. Based on the feature importance scores computed by the random forest classifiers, CONFIGURE ranks regulatory modules on each context.

Methods

Input and output of CONFIGURE

The overview of CONFIGURE is shown in Fig. 1. Gene expression data and context information are used as inputs of CONFIGURE. The gene expression data is a two dimensional matrix where samples and genes are listed in rows and columns, respectively. Entries of the matrix are gene expression values. The context information contains samples with their context labels. For each context, CONFIGURE outputs regulatory modules ranked based on their importance scores. We regard the regulatory modules with high importance scores in each context as the context specific regulatory modules.

Fig. 1.

Fig. 1

Overview of CONFIGURE

To obtain the context specific regulatory modules, CONFIGURE performs the following three tasks: Constructing regulatory modules from gene expression data, constructing a regulatory module enrichment score (RMES) matrix, and computing importance scores of regulatory modules on each context. The following sections describe each task in detail.

Construction of regulatory modules

CONFIGURE constructs regulatory modules from gene expression data used as input. First, CONFIGURE constructs a gene regulatory network using GRNBoost2 which is a gene regulatory network inference method provided in SCENIC [16]. GRNBoost2 infers a gene regulatory network from gene expression data using a stochastic gradient boosting method [18]. The output of GRNBoost2 is a directed weighted network where a node indicates a TF or a target gene, and an edge indicates a regulatory interaction between a TF and a target gene. Then, the gene regulatory network is divided into regulatory modules using the modules_from_adjacencies function provided in the pySCENIC package [16]. We define a regulatory module as a tree with a depth of 1 where a root node is a TF and leaf nodes are target genes. The weight of an edge is the regulatory interaction score (RIS) which indicates the degree of regulation of a given target gene by a TF. Figure 2 illustrates a regulatory module. Regulatory modules are represented as either activated regulatory modules or repressed regulatory modules. Activated regulatory modules contain only interactions where the expression values of a TF are positively correlated with the expression values of target genes. If regulatory modules contain only negatively correlated interactions, they are repressed regulatory modules. Regulatory modules are further pruned using the RcisTarget method [16] which filters low confident target genes by motif enrichment analysis.

Fig. 2.

Fig. 2

An illustration of a regulatory module

Construction of a regulatory module enrichment score (RMES) matrix

After constructing regulatory modules, CONFIGURE constructs a regulatory module enrichment score (RMES) matrix. The RMES matrix contains the enrichment scores of samples and regulatory modules where samples are listed in the rows and regulatory modules are listed in the columns of the matrix. The enrichment score indicates the extent of up- or down-regulation of a given regulatory module in a given sample. The enrichment score is obtained using the single sample gene set enrichment analysis (ssGSEA) method [17].

The ssGSEA method computes the enrichment score of a given gene set for a single sample. The ssGSEA method and the original GSEA method are similar [7], but they use different gene score values. In the original GSEA method, gene score values are usually fold change of different contexts. However, in the ssGSEA method, gene score values of a sample are rank normalized where a gene with a high expression value is a high rank normalized value. Also, the ssGSEA method computes enrichment scores using the empirical cumulative distribution function (ECDF) whereas the GSEA method computes scores using the Kolmogorov-Smirnov statistic. We use normalized enrichment scores (NESs) as the entries of the RMES matrix.

Computing importance scores of regulatory modules

To identify context specific regulatory modules, CONFIGURE uses the RMES matrix and context information of samples, and it computes the importance scores of the regulatory modules on each context. The importance scores are computed based on the feature importance scores of one-vs-rest binary random forest classifiers using RMESs as the features of the random forest classifiers [19]. Assuming we have a C number of contexts in context information {c 1,c 2,.. c C}, then the C number of binary random forest classifiers are trained. When training a binary random forest classifier on the context c 1, samples with the context label “c 1” are considered as positive samples, and samples without the context label “c 1” are considered as negative samples. After training the random forest classifiers on each context, the feature importance scores of the classifiers are computed on each context. Feature importance scores of a random forest classifier are based on the average delta impurity scores of the base decision tree classifiers. The importance score of a regulatory module m using the feature importance scores (FIS) of a random forest classifier, which are based on the Gini impurity score, is calculated as follows [1921]:

Importance(m)=FIS(m)=1Tt=1Tn:f(n)=mSnSΔGini(n) 1
ΔGini(n)=Gini(n)SnleftSnGininleftSnrightSnGininright 2
Ginin=1c=1C[p(c|n)]2 3

where T indicates the number of base decision trees in a random forest, n denotes a node in a base decision tree, nleft and nright denote the left and right child nodes of n, respectively. f(n) indicates the feature used for splitting the node n. S is the total number of samples, Sn is the number of samples on node n, C is the total number of contexts, and p(c|n) is the probability of the samples having the context c on node n.

Since we are using RMESs as features, regulatory modules are given with feature importance scores. The importance score of a regulatory module indicates the degree to which the RMESs of the regulatory module have contributed in differentiating between positive and negative samples. We consider the regulatory modules with high feature importance scores for a given context as the context specific regulatory modules for that context. If the average RMESs of context specific regulatory modules are higher in positive samples, the modules are considered as up-regulated context specific modules; otherwise, they are considered as down-regulated context specific modules. We use the random forest classifier implemented in the scikit-learn Python machine learning package (RandomForestClassifier(n_estimators=500, criterion=’gini’)) [22].

Results

TCGA breast invasive carcinoma dataset

We tested whether CONFIGURE can produce biologically meaningful context specific regulatory modules using RNA-seq data from the Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) dataset. Breast cancer can be divided into the following four subtypes: Luminal A, Luminal B, Her2, and basal-like [23, 24]. Table 1 lists the breast cancer types according to the expression status of breast cancer biomarkers [25]. Compared with other types, the basal-like type breast cancers have a poorer prognosis. Furthermore, it is difficult to find specific drug targets for the basal-like type breast cancers due to the absence of breast cancer biomarkers [26]. Identifying basal-like type regulatory modules can help to understand the regulatory mechanisms underlying basal-like type breast cancers and suggest new therapy options for such cancers.

Table 1.

Classifying breast cancer subtypes according to the expression status of three breast cancer biomarkers

ER PR HER2 Ki67
Luminal A + + - -
Luminal B HER2 + + + +
HER2 - + + - +
HER2 - - +
Basal-like - - -

We downloaded the RNA-seq data of the TCGA BRCA dataset (provisional) from cBioPortal [2729]. We used the “data_RNA_Seq_v2_expression_median.txt” file from the TCGA BRCA dataset which contains RNA-seq data for 1100 samples. The RNA-seq data was quantified using the RSEM method [30]. Since the subtype information of samples in the TCGA BRCA dataset was not provided, we used the PAM50 method [31] to produce subtype labels for all the samples in the TCGA dataset. We used the genefu R package for running the PAM50 method [32]. Among 1100 samples, 1072 samples were classified as Luminal A, Luminal B, Her2, or basal-like by using the genefu R package and theses samples were used for the analysis. Table 2 shows the number of samples in each breast cancer subtype used for the analysis.

Table 2.

The number of samples of each breast cancer subtype

Luminal A Luminal B Her2 Basal-like Total
# of Samples 391 370 109 202 1072

Construction of regulatory modules of breast cancer

Using the gene expression data from the TCGA BRCA dataset, we first constructed regulatory modules of breast cancers. The gene expression data was log2 normalized after adding 1 to all the gene expression values. We used a list of 800 transcription factors, which was obtained from the TRRUST database [33, 34]. Using the gene expression data and transcription factor list, a gene regulatory network of breast cancer is constructed using the GRNBoost2 method. After filtering edges with low weight values and further pruning by RCisTarget, regulatory modules of breast cancers are then constructed. A total of 110 regulatory modules with 34.682 target genes on average and a standard deviation of 23.476 were obtained.

Quantitative evaluation

We first tested whether RMESs are useful for differentiating breast cancer subtypes. To do this, we performed 10-fold cross validation and evaluated the performance of the multi-class support vector machine (SVM) classifier which uses RMESs as features[35]. We used the scikit-learn Python implementation of the SVM classifier (LinearSVC(penalty=’l1’, multi_class=’ovr’,dual=False)) [22].

Table 3 shows the performance of the multi-class classifiers. Accuracy, F1-macro, and F1-weighted were used as the evaluation metrics. The accuracy score is defined by the number of correctly predicted samples divided by the total number of predicted samples. The F1-macro score is the average of the F1-scores of all contexts where F1-score is defined as follows.

F1score=2×Precision×RecallPrecision+Recallwhere
Precision=#ofTruePositives#ofTruePositives+#ofFalsePositives
Recall=#ofTruePositives#ofTruePositives+#ofTrueNegatives

Table 3.

Performance of multi-class classifiers

Accuracy F1-macro F1-weighted
SVM-RMES 0.8983 0.8924 0.8986
SVM-Gene expression 0.8899 0.8917 0.8898
SVM-Gene expression (Hallmarks) 0.8834 0.8923 0.8831
COSSY 0.8657 0.8225 0.8723
Dominant Class Prediction 0.3451 0.1283 0.5132

The F1-weighted score represents the weighted average F1-scores where the support values of each context are weighted when averaging the F1-scores. We used the following four baseline classifiers: the multi-class SVM classifier using gene expression values as features (# of genes = 20531), the multi-class SVM classifier using gene expression values of cancer hallmark genes as features (# of genes = 167), the COSSY classifier, and the classifier that predicts the dominant class in the dataset (here, dominant class is Luminal A). The cancer hallmark genes were obtained from the COSMIC database [36]. The COSSY method identifies subnetworks that differentiate two contexts based on the entropy scores of the subnetworks[10]. Subnetworks with low entropy scores are ranked highly, indicating that the subnetworks accurately differentiate two contexts. The COSSY classifier predicts context labels based on weighted voting using highly ranked subnetworks. For the multi-class prediction, COSSY was trained on each breast cancer subtype in a one-vs-rest manner and the context with the highest positive weight was chosen. Table 3 shows the performance of the multi-class classifiers which were evaluated using 10-fold cross validation. The multi-class SVM classifier using RMESs as features achieved similar or slightly higher scores than the multi-class SVM classifier using gene expression values as features, and much higher performance than COSSY. COSSY is similar to CONFIGURE in that COSSY identifies context specific subnetworks. However, since classification is not the main purpose of COSSY, the classification scores may be low. The SVM classifiers trained using gene expression values as features achieve higher classification performance than COSSY as shown in Table 3 but they cannot identify context specific subnetworks. However, CONFIGURE can identify context specific subnetworks and achieve high classification performance.

We also evaluated the performance of one-vs-rest binary classifiers. One-vs-rest binary classifiers were trained on each subtype where samples of a given subtype were considered as positive samples and samples of other subtypes were considered as negative samples. Table 4 shows the accuracy score of each breast cancer subtype. The dominant class prediction classifier achieved an F1-score of 0 for all four subtypes because the negative class was the dominant class for all the subtypes, which resulted in 0 true positives. Our model which is the one-vs-rest binary SVM classifier trained using RMESs also obtained similar or slightly better performance than the SVM classifier trained using gene expression values. Also, our model obtained much better performance than COSSY. The classification results from the multi-class and one-rest-binary class experiments show that RMESs are useful features for differentiating contexts.

Table 4.

Performance of one-vs-rest binary classifiers

Luminal A Luminal B HER2 Basal-like Average
Accuracy
SVM-RMES 0.9366 0.8722 0.9627 0.9907 0.9405
SVM-Gene expression 0.9104 0.8741 0.9664 0.9841 0.9338
SVM-Gene expression (Hallmarks) 0.9291 0.8657 0.958 0.9888 0.9354
COSSY 0.8871 0.7836 0.9067 0.9813 0.8897
Dominant Class Prediction 0.6353 0.6549 0.8983 0.8116 0.75
F1-Score
SVM-RMES 0.913 0.8105 0.7959 0.9747 0.8736
SVM-Gene expression 0.8772 0.8143 0.8378 0.958 0.8719
SVM-Gene expression (Hallmarks) 0.9033 0.8 0.7887 0.9698 0.8655
COSSY 0.8428 0.8542 0.3101 0.9506 0.7394
Dominant Class Prediction 0 0 0 0 0

Validating basal-like type specific regulatory modules

Determining whether CONFIGURE can identify regulatory modules that can represent each context (here, breast cancer subtype) is crucial. Basal-like type breast cancer is a type of triple negative breast cancer where the expression status of Estrogen Receptor (ER), Progesterone Receptor (PR), and HER2 is negative. Basal-like type breast cancers usually have higher grade tumors and poorer prognosis than other subtype breast cancers. Due to their triple negative characteristic, there is a lack of targeted therapies for basal-like type breast cancers. Even with chemotherapies, it is difficult to dramatically improve the prognosis of patients with these cancers [23, 26, 28, 28]. Thus, it is essential to identify the transcriptional mechanisms underlying basal-like type breast cancers and eventually identify the molecular targets of basal-like type breast cancers.

After confirming the accuracy of the basal-like type binary classifier in Table 4, we extracted basal-like type specific regulatory modules using CONFIGURE. Table 5 shows the results of the basal-like type specific regulatory modules. The “+” sign in the regulatory module name indicates that the regulatory module is activated, and the “-” sign indicates that the module is repressed. We ranked the regulatory modules based on their importance scores. The top 10 regulatory modules and their scores are shown in Table 5. Target genes in a regulatory module are ranked based on their regulatory interaction scores (the top 5 target genes are shown in Table 5). The Status column indicates whether a basal-like type regulatory module is up-regulated or down-regulated. A regulatory module is up-regulated if its average RMES value is higher in the positive samples than in the negative samples.

Table 5.

The results of basal-like type specific regulatory modules obtained by CONFIGURE

Regulatory Module Target Gene FIS Status Evidence
POU5F1(-) TOX3 0.1328 down-regulated [38, 39]
RALGPS2
FUT8
HMGCR
FOXA1
ZIC1(-) XBP1 0.1072 down-regulated
OVOL1
SLC1A4
SMAD7
CNTN1
RARA(+) RARA 0.0827 down-regulated
STARD3
PLEKHH3
MAG
PCGF2
E2F3(+) E2F3 0.0667 up-regulated
ANP32E
GEN1
SYNCRIP
BEND3
GATA6(-) MAST4 0.058 down-regulated [40]
PDE6B
ROBO2
KIF5A
ABI2
PHOX2B(+) PHOX2B 0.0454 up-regulated
DDC
MSGN1
AKR1D1
FABP7
GLI3(-) PPIF 0.0447 up-regulated [41]
ELF5
ORAI1
POR
HMGA1
ETV6(+) PHB2 0.0346 up-regulated
NCAPD2
VANGL2
PLEKHA5
ETV6
SRF(-) PAIP2 0.0344 down-regulated [37, 42]
ERLEC1
NECAP1
SCRN3
ZFP62
PLAGL1(-) SLC25A17 0.0327 down-regulated
NPBWR2
PTK6
SYCE2
HN1L

We validated the basal-like type specific regulatory modules obtained by CONFIGURE through literature surveys. We checked whether the TFs of regulatory modules were reported to have associations with the phenotypes of basal-like type breast cancers. The Evidence column in Table 5 indicates whether the TF of a regulatory module has been reported.

Interestingly, a recent study by [37] has showed that the expression of serum response factor (SRF) promotes the stemness of basal-like type breast cancers by activating Interleukin 6 (IL6) through binding to the Yes-associated protein (YAP). In our result, the SRF(-) regulatory module was ranked 9th (Table 5). The SRF(-) regulatory module is down-regulated which indicates that the regulatory module contains only target genes repressed by SRF and the target genes are down-regulated in the basal-like type. The target genes of the SRF(-) regulatory module that are over repressed by SRF may be novel candidates for promoting the stemness of basal-like type breast cancers.

Conclusion

In this article, we presented CONFIGURE, a pipeline that identifies context specific regulatory modules from gene expression data. CONFIGURE infers and prunes a gene regulatory network to construct regulatory modules. CONFIGURE uses normalized enrichment scores obtained using the single sample GSEA (ssGSEA) method to score the regulatory modules for given samples and make a regulatory module enrichment score (RMES) matrix. The enrichment score indicates the extent to which a regulatory module is up- or down-regulated in a given sample. Then using the feature importance scores of a one-vs-rest binary random forest classifier, CONFIGURE identifies context specific regulatory modules.

We quantitatively evaluated CONFIGURE in the multi-class experiment and one-vs-rest binary class experiment using 10-fold cross validation. In the multi-class experiment, the multi-class SVM classifier trained using RMESs as features achieved an accuracy of 0.8983, an F1-macro score of 0.894, and an F1-weighted score of 0.8986. In the one-vs-rest binary experiment, the one-vs-rest binary SVM classifier trained using RMESs as features achieved accuracy scores of 0.9356, 0.8806, 0.9328, and 0.9907 on Luminal A, Luminal B, HER2, and basal-like, respectively. The multi-class and one-vs-rest binary SVM classifiers performed the best in the multi-class and one-vs-rest binary experiments, respectively. We validated the basal-like type specific regulatory modules through literature surveys. Compared with other breast cancer subtypes, basal-like type breast cancers have a poor prognosis and lack targeted therapies. Thus, it is important to identify the transcriptional mechanisms underlying basal-like type breast cancers. The literature survey result showed that basal-like type specific regulatory modules are associated with the phenotypes of basal-like type breast cancers.

Although CONFIGURE has many advantages, there is still room for improvement. CONFIGURE can be applied to other types of cancer or any gene expression dataset as long as it contains contextual information (e.g., single cell RNA-seq data that contains cell type information). However, we validated CONFIGURE only on the breast cancer dataset. In future work, CONFIGURE can be validated on other datasets, and more importantly, regulatory modules identified by CONFIGURE can be verified through wet-lab experiments.

We believe that CONFIGURE will prove to be a useful pipeline for generating hypotheses about novel transcriptional mechanisms that accurately characterize phenotypes.

Acknowledgements

We appreciate Susan Kim for editing the manuscript.

Funding

Publication of this article has been funded by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2016M3A9A7916996, NRF-2017M3C4A7065887) and supported by the National IT Industry Promotion Agency grant funded by the Ministry of Science and ICT and Ministry of Health and Welfare (NO. C1202-18-1001, Development Project of The Precision Medicine Hospital Information System (P-HIS)).

Availability of data and materials

Not applicable.

About this supplement

This article has been published as part of BMC Medical Genomics Volume 12 Supplement 5, 2019: Selected articles from the 8th Translational Bioinformatics Conference: Medical Genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-12-supplement-5.

Authors’ contributions

SP, JK, and HK designed the study. SP, DH, and YSY conducted the experiments. SP and JK analyzed the results. SP wrote the manuscript. All authors have read and approved the final manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer cell. 2010;17(1):98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Network CGA, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, et al. Sc3: consensus clustering of single-cell rna-seq data. Nat Methods. 2017;14(5):483. doi: 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98(9):5116–21. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Glaab E, Baudot A, Krasnogor N, Schneider R, Valencia A. Enrichnet: network-based gene set enrichment analysis. Bioinformatics. 2012;28(18):451–7. doi: 10.1093/bioinformatics/bts389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Alexeyenko A, Lee W, Pernemalm M, Guegan J, Dessen P, Lazar V, Lehtiö J, Pawitan Y. Network enrichment analysis: extension of gene-set enrichment analysis to gene networks. BMC Bioinforma. 2012;13(1):226. doi: 10.1186/1471-2105-13-226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Saha A, Tan AC, Kang J. Automatic context-specific subnetwork discovery from large interaction networks. PloS ONE. 2014;9(1):84227. doi: 10.1371/journal.pone.0084227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Margolin AA, Wang K, Lim WK, Kustagi M, Nemenman I, Califano A. Reverse engineering cellular networks. Nat Protocol. 2006;1(2):662. doi: 10.1038/nprot.2006.106. [DOI] [PubMed] [Google Scholar]
  • 12.Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5(1):8. doi: 10.1371/journal.pbio.0050008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Irrthum A., Wehenkel L., Geurts P., et al. Inferring regulatory networks from expression data using tree-based methods. PloS ONE. 2010;5(9):12776. doi: 10.1371/journal.pone.0012776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Guo S, Jiang Q, Chen L, Guo D. Gene regulatory network inference using pls-based methods. BMC Bioinforma. 2016;17(1):545. doi: 10.1186/s12859-016-1398-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Huynh-Thu Vân Anh, Sanguinetti Guido. Methods in Molecular Biology. New York, NY: Springer New York; 2018. Gene Regulatory Network Inference: An Introductory Survey; pp. 1–23. [DOI] [PubMed] [Google Scholar]
  • 16.Aibar S, González-Blas CB, Moerman T, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, van den Oord J, et al. Scenic: single-cell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083. doi: 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, Schinzel AC, Sandy P, Meylan E, Scholl C, et al. Systematic rna interference reveals that oncogenic kras-driven cancers require tbk1. Nature. 2009;462(7269):108. doi: 10.1038/nature08460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–378. [Google Scholar]
  • 19.Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  • 20.Tan P-N, et al. Introduction to Data Mining. Chennai: Pearson Education India; 2007. [Google Scholar]
  • 21.Louppe G, Wehenkel L, Sutera A, Geurts P. Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc.; 2013. Understanding variable importances in forests of randomized trees. [Google Scholar]
  • 22.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
  • 23.Perou CM, Sørlie T, Eisen MB, Van De Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen L, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747. doi: 10.1038/35021093. [DOI] [PubMed] [Google Scholar]
  • 24.Kapp AV, Jeffrey SS, Langerød A, Børresen-Dale A-L, Han W, Noh D-Y, Bukholm IR, Nicolau M, Brown PO, Tibshirani R. Discovery and validation of breast cancer subtypes. BMC Genomics. 2006;7(1):231. doi: 10.1186/1471-2164-7-231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Gnant M, Harbeck N, Thomssen C. St. gallen 2011: summary of the consensus discussion. Breast care. 2011;6(2):136–41. doi: 10.1159/000328054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Dai X, Li T, Bai Z, Yang Y, Liu X, Zhan J, Shi B. Breast cancer intrinsic subtype classification, clinical use and future trends. Am J Cancer Res. 2015;5(10):2929. [PMC free article] [PubMed] [Google Scholar]
  • 27.Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Philadelphia: American Association for Cancer Research; 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhao X, Rødland EA, Tibshirani R, Plevritis S. Molecular subtyping for clinically defined breast cancer subgroups. Breast Cancer Res. 2015;17(1):29. doi: 10.1186/s13058-015-0520-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Sci Signal. 2013;6(269):p11. doi: 10.1126/scisignal.2004088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Li B, Dewey CN. Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323. doi: 10.1186/1471-2105-12-323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160. doi: 10.1200/JCO.2008.18.1370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gendoo DM, Ratanasirigulchai N, Schröder MS, Paré L, Parker JS, Prat A, Haibe-Kains B. Genefu: an r/bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics. 2015;32(7):1097–9. doi: 10.1093/bioinformatics/btv693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, Kim H, Cho A, Kim E, Lee T, et al. Trrust: a reference database of human transcriptional regulatory interactions. Sci Rep. 2015;5:11432. doi: 10.1038/srep11432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Han H, Cho J-W, Lee S, Yun A, Kim H, Bae D, Yang S, Kim CY, Lee M, Kim E, et al. Trrust v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2017;46(D1):380–6. doi: 10.1093/nar/gkx1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. [Google Scholar]
  • 36.Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, Cole CG, Ward S, Dawson E, Ponting L, et al. Cosmic: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2016;45(D1):777–83. doi: 10.1093/nar/gkw1121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kim T, Yang S-J, Hwang D, Song J, Kim M, Kim SK, Kang K, Ahn J, Lee D, Kim M-y, et al. A basal-like breast cancer-specific role for srf–il6 in yap-induced cancer stemness. Nat Commun. 2015;6:10186. doi: 10.1038/ncomms10186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Cheng C-C, Shi L-H, Wang X-J, Wang S-X, Wan X-Q, Liu S-R, Wang Y-F, Lu Z, Wang L-H, Ding Y. Stat3/oct-4/c-myc signal circuit for regulating stemness-mediated doxorubicin resistance of triple-negative breast cancer cells and inhibitory effects of wp1066. Int J Oncol. 2018;53(1):339–48. doi: 10.3892/ijo.2018.4399. [DOI] [PubMed] [Google Scholar]
  • 39.Zhang J-M, Wei K, Jiang M. Oct4 but not sox2 expression correlates with worse prognosis in surgical patients with triple-negative breast cancer. Breast Cancer. 2018;25(4):447–55. doi: 10.1007/s12282-018-0844-x. [DOI] [PubMed] [Google Scholar]
  • 40.Martinelli P, Carrillo-de Santa Pau E, Cox T, Sainz B, Dusetti N, Greenhalf W, Rinaldi L, Costello E, Ghaneh P, Malats N, et al. GATA6 regulates EMT and tumour dissemination, and is a marker of response to adjuvant chemotherapy in pancreatic cancer. Gut. 2017;66(9):1665–76. doi: 10.1136/gutjnl-2015-311256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Habib JG, O’Shaughnessy JA. The hedgehog pathway in triple-negative breast cancer. Cancer Med. 2016;5(10):2989–3006. doi: 10.1002/cam4.833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kim T, Lim D-S. The srf-yap-il6 axis promotes breast cancer stemness. Cell Cycle. 2016;15(10):1311. doi: 10.1080/15384101.2016.1161994. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from BMC Medical Genomics are provided here courtesy of BMC

RESOURCES