Skip to main content
Bioinformatics Advances logoLink to Bioinformatics Advances
. 2022 Sep 27;2(1):vbac073. doi: 10.1093/bioadv/vbac073

EnrichIntersect: an R package for custom set enrichment analysis and interactive visualization of intersecting sets

Zhi Zhao 1,2,, Manuela Zucknick 3, Tero Aittokallio 4,5,6
PMCID: PMC9710586  PMID: 36699400

Abstract

Summary

Enrichment analysis has been widely used to study whether predefined sets of genes or other molecular features are over-represented in a ranked list associated with a disease or other phenotype. However, computational tools that perform enrichment analysis and visualization are usually limited to predefined sets available from public databases. To make such analyses more flexible, we introduce an R package, EnrichIntersect, which enables enrichment analyses among any ranked features and user-defined custom sets. For interactive visualization of multiple covariates, such as genes or other features, which are associated with multiple phenotypes and multiple sample groups, such as drug responses in various cancer types, EnrichIntersect illustrates all associations at a glance, hence explicitly indicating intersecting covariates between multiple phenotypic variables and between multiple sample groups.

Availability and implementation

The EnrichIntersect R package is available at https://CRAN.R-project.org/package=EnrichIntersect via an open-source MIT license. A package installation process is described on CRAN at https://cran.r-project.org/. A user-manual description of features and function calls can be found from the vignette of our package on CRAN.

1 Introduction

Gene set enrichment or over-representation analysis maps large lists of genes into a smaller list of more easily interpretable gene sets or pathways (Subramanian et al., 2005). However, most enrichment tools use already-existing gene sets or biochemical pathway databases, for example, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). This limits the users when exploring their gene lists or other feature lists with respect to predefined sets of interest. Innis et al. (2021) provide an R package for customizing gene set enrichment analysis; however, this method identifies differentially expressed genes from a given data matrix with respect to only a single binary phenotype and subsequently performs enrichment analysis for the identified genes involved in customized gene set libraries, but there is no standalone function for the enrichment analysis in the later step. For wider applications, there is a need to implement computational tools that enable standalone enrichment analysis for any feature list of interest, based on user-defined custom sets. For instance, Zhao et al. (2022) analyzed tissue-specific drugs as drug classes enriched among a ranked drug list corresponding to a given cancer tissue. To enable more flexible enrichment analyses among any ranked features and user-defined custom set, we introduce the R (R Core Team, 2021) package EnrichIntersect for custom set enrichment analysis (CSEA).

The R package EnrichIntersect can also perform interactive visualization of the feature list of interest associated with multiple phenotypes and multiple sample groups using ggplot2 (Wickham, 2016). Previously, Caldas et al. (2009) visualized a topic model via a so-called eye diagram, which is applicable to several latent variables (e.g. topics), many analysis levels (e.g. experiments), and a large number of tasks (e.g. gene sets). However, when there are many intermediate latent variables, such as genomic variables as covariates, our EnrichIntersect package makes it easier to visualize their relationships at multiple levels and with respect to multiple tasks using R package networkD3 (Allaire et al., 2017). For example, Zhao et al. (2022) developed cancer tissue-specific variable selection for large-scale drug screens, where a Sankey diagram, which is implemented in the EnrichIntersect package, was used to visualize the relationships between multiple genomic features and several drugs of interest (i.e. multiple tasks), corresponding to several cancer tissue types (i.e. sample groups). A similar alluvial plot was utilized by Gálvez et al. (2019) to visualize common GO terms enriched among clusters of self-organizing maps and to highlight response-significant modules from weighted gene correlation network analysis, but without providing publicly available software implementation.

2 Methods

2.1 Enrichment analysis

The methodology of CSEA is based on the guideline provided by Reimand et al. (2019). For a feature list with rank metric sLsis1, consider a user-defined feature set consisting of a list of Lk features Gk={gkj:j=1,,Lk} indexed by k where each feature gkj must be represented in the ranked list. Define the set of features outside of Gk as G¯k={g¯kj:j=1,,LLk}. A (weighted) Kolmogorov–Smirnov statistic can measure the enrichment score for a given feature set Gk, i.e.

Sk=sup1iL|FiGkFiG¯k|,

where FiGk and FiG¯k are weighted empirical cumulative distribution functions

FiGk=t=1i|st|α·1{featuretGk}t=1L|st|α·1{featuretGk},FiG¯k=t=1i|st|α·1{featuretG¯k}LLk,

where α is for weighted enrichment score and 1{·} is the indicator function for membership in the specified feature list.

A large enrichment score Sk indicates that the list of features is likely to be enriched to the given feature set Gk. To determine the statistical significance of each enrichment score, permutation test is used to derive an empirical P-value. Accounting for the feature set size Lk in the function of enrichment score, normalized enrichment score can be calculated, which is the enrichment score divided by the expected value of the corresponding null distribution FGk=FG¯k.

2.2 Multilevel task groups in multitask learning

In multitask learning with feature selection, features associated with multiple tasks are often identified. When there are heterogeneous sample groups or levels in each task, we are interested in identifying associated features with each task in each group or level. For example, Han and Zhang (2015) proposed the multilevel task grouping method and Zhao et al. (2022) proposed mix-lasso for identifying multitask multilevel features. Let us suppose there are m learning tasks, e.g. cancer drugs, p features, e.g. genes and each task is measured across T heterogeneous sample groups, e.g. drug-treated cell lines of T cancer types. In Figure 1, we illustrate the general situation with T levels of m tasks output data Y and T levels of p features input data X. There are a total of (2m1)·(2T1) possible sets of features associated with individual tasks and individual levels.

Fig. 1.

Fig. 1.

General representation of data for modeling m multiple tasks with T multiple levels given p features

3 Implementation

EnrichIntersect is an R package that generalizes gene set enrichment analysis, so that one can identify over-represented custom sets among any ranked feature list in response to given phenotypes. The enrichment() function requires a scored list of features which is either a vector corresponding to one phenotype or a matrix corresponding to multiple phenotypes, and a data.frame object with user-defined sets for all the features in the scored list. CSEA provides (normalized) enrichment scores for the list of features enriched to multiple custom sets for a single phenotype, for example, multiple drug classes for a given cancer type in one column of Figure 2A. We use red colored circles to visualize statistically significant enriched custom sets at a user-specified significance level with a prespecified multiple-comparison correction method.

Fig. 2.

Fig. 2.

An example of visualizing the results of drug-set enrichment analysis and an Sankey diagram example of visualizing associations of a gene set with two cancer drugs corresponding to three cancer types. (A) For each cancer tissue type (x-axis), a list of 147 ranked drugs was analyzed for their enrichment among nine drug classes (y-axis). (B) A Sankey diagram illustrating how a set of 56 genes is associated with two drugs (RG-108 and JQ-1), corresponding to three cancer types (colorectal, melanoma and multiple myeloma). The plot shows the results from a mix-lasso analysis (Zhao et al., 2022) to identify genomic features that are associated with the sensitivity of cell lines from these cancer types to RG-108 or JQ-1 as measured by cell viability assays

To visualize a large number of features associated with multitask multilevel responses, as in Figure 1, EnrichIntersect has adapted the network3D package for visualizing a large number of intersecting sets at a glance. One intersecting set is a list of features associated with a combination of a subset of tasks and a subset of sample groups or levels. It is not straightforward to visualize all possible (2m1)·(2T1) intersecting sets for m tasks and T levels. The network3D package can visualize only a small or moderate number of features at each stage. Our package EnrichIntersect passes suitable parameters to JavaScript in networkD3 for visualizing a large number of features, small number of tasks and small number of levels, which is useful in many applications. For example, Figure 2B shows a relatively large number of genomic features associated with a few drugs (i.e. tasks) and a few cancer types (i.e. levels). Note that our package requires an array object of the estimated regression coefficients from a multitask multilevel model as an input data for the function intersectSankey().

The R package EnrichIntersect has been made freely available and it has passed all the required quality checks and platforms on CRAN (https://CRAN.R-project.org/package=EnrichIntersect). A user-manual description of features, function calls and examples can be found from the vignette of our package on CRAN.

4 Conclusion

The EnrichIntersect R package provides a flexible tool for enrichment analysis based on user-defined sets. It allows users to perform over-representation analysis of the custom sets among any specified ranked feature list, hence making enrichment analysis applicable to various types of data from different scientific fields. A possible extension of CSEA is to use Fisher’s exact test or hypergeometric test for unranked list of features. EnrichIntersect also enables an interactive means to visualize identified associations based on, for example, the multilevel task grouping method (Han and Zhang, 2015), mix-lasso (Zhao et al., 2022) or any other similar method most suitable for particular application scenarios with multitask and multilevel data, as illustrated in Figure 1. It refines the use of the networkD3 package to allow the visualization of relationships between a large number of features and multiple tasks corresponding to multiple sample groups or levels.

Acknowledgements

The authors thank the authors of the R packages ggplot2, networkD3, htmlwidgets, webshot2, jsonlite, methods, stats and grDevices for making their code available for reuse and extension, and Dr Shixiong Wang for the example shown in Figure 2A.

Funding

This work was supported by grants from the Helse Sør-Øst [2020026 to T.A.], the Norwegian Cancer Society [216104 to T.A.], the Radium Hospital Foundation, the Academy of Finland [326238, 340141, 344698 and 345803 to T.A.], Cancer Society of Finland, the European Union’s Horizon 2020 research and innovation programme [‘RESCUER’ 847912 to M.Z.] and the Innovative Medicines Initiative 2 Joint Undertaking of the European Union’s Horizon 2020 research and innovation programme and EFPIA and JDRF INTERNATIONAL [‘imSAVAR’ 853988 to M.Z.].

Conflict of Interest: none declared.

Contributor Information

Zhi Zhao, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo N-0310, Norway; Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo N-0372, Norway.

Manuela Zucknick, Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo N-0372, Norway.

Tero Aittokallio, Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo N-0310, Norway; Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo N-0372, Norway; Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki FI-00014, Finland.

Data Availability

The data associated with this article are openly available in the R package EnrichIntersect at https://CRAN.R-project.org/package=EnrichIntersect.

References

  1. Allaire J. et al. (2017) networkd3: D3 javascript network graphs from R. R package version 0.4. https://CRAN.R-project.org/package=networkD3 (5 October 2022, date last accessed).
  2. Caldas J. et al. (2009) Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25, i145–i153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Gálvez S. et al. (2019) Hotspots in the genomic architecture of field drought responses in wheat as breeding targets. Funct. Integr. Genomics, 19, 295–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Han L., Zhang Y. (2015) Learning multi-level task groups in multi-task learning. AAAI, 29, 2638–2644. [Google Scholar]
  5. Innis S.E. et al. (2021) Gseaplot: a package for customizing gene set enrichment analysis in R. J. Comput. Biol., 28, 629–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. R Core Team (2021) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  7. Reimand J. et al. (2019) Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, cytoscape and enrichmentmap. Nat. Protoc., 14, 482–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Subramanian A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York. [Google Scholar]
  10. Zhao Z. et al. (2022) Tissue-specific identification of multi-omics features for pan-cancer drug response prediction. iScience, 25, 104767. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data associated with this article are openly available in the R package EnrichIntersect at https://CRAN.R-project.org/package=EnrichIntersect.


Articles from Bioinformatics Advances are provided here courtesy of Oxford University Press

RESOURCES