Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2023 May 23;24:214. doi: 10.1186/s12859-023-05342-9

Empowering biologists to decode omics data: the Genekitr R package and web server

Yunze Liu 1,2,3, Gang Li 1,2,3,
PMCID: PMC10205030  PMID: 37221491

Abstract

Background

A variety of high-throughput analyses, such as transcriptome, proteome, and metabolome analysis, have been developed, producing unprecedented amounts of omics data. These studies generate large gene lists, of which the biological significance shall be deeply understood. However, manually interpreting these lists is difficult, especially for non-bioinformatics-savvy scientists.

Results

We developed an R package and a corresponding web server—Genekitr, to assist biologists in exploring large gene sets. Genekitr comprises four modules: gene information retrieval, ID (identifier) conversion, enrichment analysis and publication-ready plotting. Currently, the information retrieval module can retrieve information on up to 23 attributes for genes of 317 organisms. The ID conversion module assists in ID-mapping of genes, probes, proteins, and aliases. The enrichment analysis module organizes 315 gene set libraries in different biological contexts by over-representation analysis and gene set enrichment analysis. The plotting module performs customizable and high-quality illustrations that can be used directly in presentations or publications.

Conclusions

This web server tool will make bioinformatics more accessible to scientists who might not have programming expertise, allowing them to perform bioinformatics tasks without coding.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-023-05342-9.

Keywords: Bioinformatics tool, Web server, Gene set enrichment analysis, Non-programming bioinformatics, Plotting

Background

High-throughput methodologies have revolutionized biomedical research by enabling deep sequencing of genomes, transcriptomes, and epigenomes. These studies generate many gene lists, and interpreting these gene lists can be a significant challenge. Particularly, many laboratories still require the assistance of bioinformaticians for completing fundamental tasks such as retrieving gene information, converting IDs, performing enrichment analysis, and creating plots suitable for publication. However, not all laboratories have an in-house bioinformatician, and most bench scientists lack the skills to use the R programming language. This has resulted in a significant demand for online applications that can perform these tasks. Despite numerous online tools available, many of them failed to meet the needs of bench scientists.

For example, (i) traditional resources used to retrieve gene attributes, including the Entrez Gene from National Center for Biotechnology Information (NCBI) [1], are usually organized in a one-gene-at-a-time format; whereas currently available batch retrieval tools such as The Mouse Genome Informatics Database (MGI) [2, 3], HGNC could only retrieval limited attributes without summaries for gene functions [4, 5]; (ii) most current ID conversion tools, including g:Convert [6] and The Database for Annotation, Visualization and Integrated Discovery (DAVID) [7], are unaware of alias matching, especially when gene symbol and alias are mixed; (iii) the parent-child relationship redundancy of Gene Ontology (GO) terms confounds interpretation [8], increasing the perceived number regarding biologically relevant results; (iv) web servers including WebGestalt [9], Enrichr [10], Web Gene Ontology Annotation Plot (WEGO) [11] and ShinyGO [12] only provides build-in static figure and leave few spaces for users to generate publication-ready illustrations.

To address these issues, we developed an integrated online toolkit called Genekitr. It integrates various functionalities into a single web server, including four modules: GeneInfo module for batch query gene information, IDConvert and ProbeConvert modules for gene and probe identifier conversion, GeneEnrich module for gene enrichment analysis and Plot module for publication-ready plotting. This tool provides a convenient one-stop solution for bench scientists without programming skills.

Methodology and implementation

Gene information retrieval module

Data collection

Gene information of 317 species, containing 195 vertebrates, 120 plants and 2 bacteria, was retrieved from the quarterly updated Ensembl database (version 108, Oct 2022) [13]. Moreover, NCBI gene annotation for 19 organisms was retrieved by organism-level packages in Bioconductor [14] and UniProt identifiers for 12 organisms were downloaded from UniProt [15, 16], which were subsequently integrated with Ensembl resources as a complement. The gene information mainly includes gene nomenclature, gene function summary, genomic location, gene sequence, gene biotype, and transcript count. Besides, species-specific information was appended. For example, 13,605 human cell marker genes were obtained from the CellMarker database, which assists in identifying and characterizing tissue and cell types [17].

Input data

The gene information retrieval module accepts lists of gene identifiers separated by blanks, commas or semicolons. Various types of gene identifiers are accepted, including: Entrez Gene IDs, Ensembl IDs, UniProt IDs, gene symbols and aliases. Gene symbols and aliases are case-insensitive.

One-to-many mapping rules

If one-to-many ID mapping occurs, the program performs Boolean operations: firstly, the program will keep records with the maximal number of attributes, then it saves the records with standard chromosome nomenclature instead of unplaced scaffolds and lastly, the program selects the record with the smallest Entrez ID number, as this is usually mapped to a non-predicted genome sequence and is therefore considered official. Besides, the program leaves the result blank if no match is found during this process (see Additional file 1: Fig. S1).

ID conversion module

The ID conversion module assists in two separate tasks. The first task is ID conversion among gene symbols, gene aliases, Entrez IDs, Ensembl IDs, and UniProt IDs. It is based on the gene information retrieval module and inherits one-to-many mapping rules. The second task is converting human probes to gene symbols or IDs. The human probe annotation data of popular platforms, including Affymetrix, Agilent, Illumina, Phalanx and Codelink, were downloaded from Ensembl by biomaRt [18]. If the probe has no matched gene in the Ensembl database, the NCBI probe annotation data will be loaded from Bioconductor as a supplement. Any unmatched IDs are left as blanks.

Enrichment analysis module

Gene set collection

Gene set raw data files were curated from 11 popular public databases, including 4 libraries of GO (All, biological process (BP), molecular function (MF) and cellular component (CC)), 6 libraries of Kyoto Encyclopedia of Genes and Genomes (KEGG) (Pathway, Module, Enzyme, Network, Drug and Disease) [19], 20 libraries of Medical Subject Headings (MeSH) [20], 22 libraries of Molecular Signatures Database (MsigDB) [21], 256 libraries of Enrichr, and the gene set libraries from WikiPathways [22], Reactome [23], DisGeNET [24], Disease Ontology (DO) [25], Network of Cancer Genes (version 6 and 7) [26] and COVID-19 Gene Set Library [27]. For each database, the term descriptions and gene-term mappings were parsed and retrieved from the raw data files.

Enrichment methods

The program supports over-representation analysis (ORA) [28] and Gene set enrichment analysis (GSEA) [29] methods. The ORA method passes a list of gene symbols, gene aliases, Entrez, Ensembl, or UniProt IDs to hypergeometric distribution model, which sampling without replacement:

PXk=1-i=0k-1MiN-Mn-iNn

where P is the probability of observing k genes in a given gene set, N is the total number of genes in the background set, M is the number of genes within the background set that are annotated to the specific gene set, n is the total size of interested gene list and k is the number of genes within the list which are annotated to the gene set. The GSEA method accepts gene symbols, gene aliases, Entrez or Ensembl IDs with associated fold change values from differential expression analysis. It utilizes fgsea R package to calculate the enrichment scores which represents a gene set is accumulated at the top or bottom of the entire ordered gene list [30]. The nominal p-value is defined as an empirical phenotype-based permutation test.

GO term simplifying

15 organism-specific GO term information was extracted from Bioconductor, including Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit fly), Arabidopsis thaliana (thale cress), Saccharomyces cerevisiae (budding yeast), Danio rerio (zebrafish), Caenorhabditis elegans (nematode), Bos taurus (cow), Sus scrofa (pig), Gallus gallus (chicken), Anopheles gambiae (mosquito), Canis familiaris (dog), Xenopus laevis (clawed frog) and Pan troglodytes (chimpanzee). The relationships between GO terms were retrieved from GO.db [31]. 5 statistical algorithms ("Resnik", "Lin", "Jiang", "Rel" and "Wang") of GOSemSim R package were utilized to calculate semantic similarity for GO BP, CC and MF [32].

Publication-ready plotting module

Plots are generated based on R packages, including ggplot2 [33], pheatmap [34], VennDiagram [35], ggrepel [36], ComplexUpset [37], ggraph [38], igraph [39].

Web server implementation

Genekitr web server is implemented on Ubuntu (version 18.04.6) with Shiny R package [40]. Genekitr is accessible from multiple platforms through Microsoft Edge, Chrome, Safari and Firefox.

Programmatic access

All the functions in Genekitr can be implemented using a local R package called Genekitr, which is available at The Comprehensive R Archive Network (CRAN) repository [41]. Besides, unique features were added to the R package. For instance, the "getPubmed" function helps batch query for PubMed records. "importPanther" function assists in importing and reorganizing GO enrichment analysis results from The Gene Ontology Resource [42], which is powered by PANTHER [43]. "genORA" function supports the comparison of results from multiple gene enrichment analyses.

Utility and discussion

Genekitr is an R package and web server that helps biologists analyze large gene sets generated from high-throughput analyses. It comprises four modules to perform gene information retrieval, ID conversion, enrichment analysis, and publication-ready plotting (Fig. 1). Genekitr makes bioinformatics accessible to researchers without programming expertise and enables them to efficiently analyze, present and publish data.

Fig. 1.

Fig. 1

An overview of the functional modules of Genekitr. Genekitr can perform gene information retrieval, identifier (ID) conversion, functional enrichment analysis and online publication-ready plotting. It can plot over-representation analysis, gene set enrichment analysis, Venn diagram, and differentially expressed genes (DEG), in total 21 graph types. The graphs could be customized and exported as files such as Encapsulated PostScript (EPS), Enhanced Metafiles (EMF), Joint Photographic Experts Group (JPEG), Portable Network Graphics (PNG), Tag Image File Format (TIFF), and editable Microsoft PowerPoint (PPT). Both a standalone R package and an online webserver are available to users

GeneInfo module

The GeneInfo module allows users to batch-retrieve up to 23 attributes for genes of 317 organisms, including gene symbol, alias, location, biotype, transcript counts, and links to download its sequence and visualize genes in the University of California Santa Cruz (UCSC) genome browser [44]. Importantly, it can batch-retrieve functional summaries of gene products from RefSeq [45]. To help users explore gene information interactively, hyperlinks are provided for databases such as Entrez, Ensembl, HGNC, Online Mendelian Inheritance in Man (OMIM) [46], MGI and International Mouse Phenotyping Consortium (IMPC) [47], which direct users to the official website (Table 1). All retrieved gene information can be downloaded as a Microsoft Excel file, a convenient feature that allows users to analyze the data further. Overall, the GeneInfo module is a valuable tool for exploring gene information and interpreting the potential significance of a list of genes in various biological processes.

Table 1.

Overview of the hyperlinked data sources

Attribute Source Website
EntrezID Entrez gene https://www.ncbi.nlm.nih.gov/gene
Ensembl Ensembl http://www.ensembl.org/id

UCSC

(human and mouse)

UCSC genome browser https://genome.ucsc.edu/cgi-bin/hgTracks

Sequence

(human and mouse)

UCSC sequence http://genome.ucsc.edu/cgi-bin/das/dsn
Mirbase_ID MicroRNA database https://www.mirbase.org
HGNC_ID HUGO Gene Nomenclature Committee https://www.genenames.org
OMIM Online Mendelian Inheritance in Man https://www.omim.org
MGI_ID Mouse genome informatics http://www.informatics.jax.org
IMPC_ID International Mouse Phenotyping Consortium https://www.mousephenotype.org

IDConvert module

The IDConvert module in Genekitr enables the conversion of IDs across gene symbols/aliases, Entrez, Ensembl and Uniprot IDs. The results of the conversion also come with hyperlinks that allow users to access additional information. Notably, the module can handle input that includes a mixture of gene symbols and aliases. To assess Genekitr's ability to resolve outdated or unofficial gene symbols, aliases, and identifiers, a gene list related to Shh inhibitors and HH/GLI signaling modulation from a recent publication was analyzed [48]. The gene symbols or aliases were converted to Entrez IDs using Genekitr and five other publicly available tools: DAVID, bioDBnet [49], g:Convert, clusterProfiler [50], and biomaRt. Compared to the other tools, Genekitr was the only one that was able to return Entrez IDs for all searched terms. 4 out of the 5 other tools were able to return Entrez IDs for only 6 out of the 12 queried terms, while bioDBnet was able to recognize gene aliases and return 10 of the 12 terms but could not recognize special characters such as α and κ, in gene names (Table 2). Genekitr also has the ability to provide unique results by adhering to " one-to-many mapping rules". For instance, the human gene known as programmed cell death protein 1 (PD1) has three matching symbols: PDCD1, SNCA, and SPATA2. By default, all matching records would be returned, but when the "unique" option is selected, only "PDCD1" is returned (Fig. 2). In conclusion, the IDConvert module in Genekitr offers a robust approach to ID conversion by enabling batch queries, handling mixed gene symbols and aliases input, and providing comprehensive results.

Table 2.

Comparison of gene name converting efficiency*

Searched Terms Genekitr DAVID bioDBnet g:Convert clusterProfiler biomaRt
CCR2 729230 729230 729230 729230 729230 729230
FOXP3 50943 50943 50943 50943 50943 50943
CCL2 6347 6347 6347 6347 6347 6347
CCL3 6348 6348 6348 6348 6348 6348
IL-6 3569 3569
IL10 3586 3586 3586 3586 3586 3586
TNF-α 7124
COX-2 5743 5743
STAT3 6774 6774 6774 6774 6774 6774
NF-κB 4790
PD1 5133 5133
PD-L1 29126 29126

*The table displays Entrez IDs as the output, while the input consists of a mixture of gene symbols and aliases that were converted using the indicated tools

Fig. 2.

Fig. 2

The usage of the gene identifier conversion module in Genekitr. a The default behavior of the IDConvert module, which returns all records, including all matches and potential duplicates. b The behavior of the IDConvert module when the "unique" button is selected, the module will return only one-to-one mapping results

GeneEnrich module

The GeneEnrich module in Genekitr can perform two types of enrichment analysis: ORA and GSEA. ORA assumes genes operate independently and only considers DEGs based on p-value and fold change; it compares the gene set to a background set and calculates a p-value and fold enrichment to determine significance. GSEA calculates enrichment scores from raw expression levels and detects subtle associations using permutation methods.

Genekitr's GeneEnrich module incorporates GOSemSim, a GO simplification method, to reduce term redundancy and facilitate more explicit GO enrichment analysis. GO compiles terms of BP, MF and CC as directed acyclic graphs, resulting in a large number of gene sets. However, the parent-child relationship redundancy in the resulting set of GO terms confounds interpretation. To illustrate the effect of simplifying GO terms, we utilized a built-in example generated from differential expression analysis of GSE42872 [51]. GO CC analysis was performed with the option "Simplify GO terms" on or off. All other parameters in the GeneEnrich module are set as default. Both results (see Additional file 2: Table S1, S2) were visualized by the "term network" with "circle" layout in the Plot module (Fig. 3). With the redundancy reduced, researchers could explore GO enrichment analysis more explicitly.

Fig. 3.

Fig. 3

Term network representation of the Gene Ontology (GO) cellular component (CC) enrichment analysis. a A circle layout plot that displays all the enriched terms with redundancy. b A version of the same diagram with the GO terms simplified

Compared to other web tools for gene enrichment analysis [6, 7, 9, 10, 12, 43, 49, 5257], Genekitr stands out with several advantages (Table 3). It is the first webserver to integrate the GOSemSim method; it integrates more resources with over 315 libraries covering up to 8213 species; and it has a simple and intuitive interface with a demo file to help users understand the input format. The analysis results can be downloaded in excel format, including comprehensive information such as the "ID" and "Description" of the gene set, "GeneRatio", "BgRatio", "FoldEnrich", "RichFactor" in the ORA method, and "setSize", "normalized enrichmentScore", "geneID" and "geneID_symbol" in the GSEA method (see detail in the online help page of Genekitr). Notably, the gene ID/symbol information can also be used as input for the GeneInfo module of Genekitr to quickly batch-retrieve gene summary information, allowing for a faster and more efficient way to access background knowledge. In addition, Genekitr offers a large number of plotting options for visualization (see below). By using Genekitr itself, researchers can easily generate publication-ready plots.

Table 3.

Benchmark of Genekitr and existing enrichment analysis webservers

Tool Method Gene set libraries No. of species No. of plotting types Customizable plotting Image type GO simplification Availability
WebGestalt [9] ORA, GSEA, PTa GO, KEGG, + 20 more 12 5 No PNG, SVG No

- webserver

- R

- API

KOBAS [52] ORA, CGPSb GO, KEGG, + 3 more

- GO: 71—BioCyc: 18

- KEGG: 5944—Reactome: 14

- PANTHER: 41

3 No PNG No

- webserver

- Python program

g:Profiler [6] ORA GO, KEGG, + 7 more

- GO: 821—MIRNA: 16

- KEGG: 255—WikiPathways: 13

- TF: 9

- HP: 414

- CORUM: 3

2 No PNG No

- webserver

- R

- API

DAVID [7] ORA GO, KEGG, + 77 more  > 65,000 No

- webserver

- API

Gorilla [53] ORA GO 8 1 PNG No webserver
ToppGene [56] ORA GO, KEGG, + 135 more 2 1 No HTML No

- webserver

- API

bioDBnet [49] ORA GO, KEGG, + 4 more 6 No

- webserver

- API

agriGO [57] ORA, GSEA GO 404 2 Yes PNG, JPEG, GIF, SVG, PDF No webserver
Revigo [54] ORA GO 25 No webserver
PANTHER [43] ORA, GSEA GO, KEGG 143 5 No SVG No webserver
Enrichr [10] ORA GO, KEGG, + 274 more 6 4 No SVG, PNG, JPG No webserver
FunSet [55] ORA GO 11 1 No SVG No

- webserver

- API

ShinyGO [12] ORA GO 315 6 Yes PDF, PNG, SVG No webserver
Genekitr ORA, GSEA GO, KEGG, + 313 more

- GO: 143—MSigDB: 20

- KEGG: 8213—Enrichr: 5

- Reactome: 11—WikiPathways: 16

- MeSH: 71—Disease specificc: 1

- ORA: 13

- GSEA: 5

Yes EPS, EMF, PPT, PNG, TIFF, JPEG Yes

- webserver

- R

aPT Pathway topology

bCGPS Combined Gene set analysis incorporating Prioritization and Sensitivity

cThe disease specific gene sets for human includes DisGeNET, DO, NCG and COVID-19 Gene Set Library

Plot module

The plot module offers 21 plot options for tailored data visualization, including 13 options for ORA, 5 for GSEA, 2 for group interactions and 1 for the differentially expressed genes (DEGs) volcano plot (Fig. 4). It has three panels: the upload panel, the parameter panel, and the plot panel.

Fig. 4.

Fig. 4

Various types of plots offered by the plot module. a 13 plotting types of Over-Representation Analysis (ORA) of gene enrichment, including (i) dot plot, (ii) bar chart, (iii) lollipop plot, (iv) bubble graph, (v) gene-pathway heatmap, (vi) gene-pathway chord graph, (vii) UpsetR interaction plot, (viii) enriched terms network, (ix) enriched term treemap, (x) wordcloud chart, (xi) enriched terms heatmap, (xii) enriched terms tangram and (xiii) WEGO plot b 5 plotting types of Gene Set Enrichment Analysis (GSEA), containing (i) classic GSEA plot, (ii) enriched terms volcano plot, (iii) ridge plot, (iv) two-side bar graph and (v) table chart. c the Venn plot, which can be used to analyze group interactions. d the volcano plot, which can be used in differential gene expression analysis

Upload panel Data can be uploaded in either Microsoft Excel spreadsheet (.xlsx), Tab Separated Value (.tsv), or Comma Separated Value (.csv) format. To help clarify the process, a demo file is provided for closer examination, serving as a guide for the required data format and allowing for testing purposes. By clicking the "Upload" button, the data file will be loaded along with preset parameters.

Parameter Panel This panel consists of two sections for setting basic and advanced parameters. In the Basic Parameters section, users can select plot types and choose labels for axes, legends, and more. The Advanced Parameters section allows users to customize the plot's color, text size, border thickness, and dot size. It's important to note that the basic parameters vary based on the chosen plot. A key feature in the Basic Parameters section is a drop-down menu, which lists all gene or pathway names from the input file. By selecting one or multiple items, users can directly label their data points on the plot, facilitating the visualization and presentation of their results.

Plot Panel By clicking the plot button, the generated plot will be displayed in the Plot Panel with a default resolution of 300 dots per inch (DPI). Users can resize the figure by adjusting the slider bars for width and height. Finally, the figure can be exported in a variety of formats, including Encapsulated PostScript (EPS), Enhanced Metafiles (EMF), editable Microsoft PowerPoint (PPT), Joint Photographic Experts Group (JPEG), Portable Network Graphics (PNG), and Tag Image File Format (TIFF), to satisfy a range of requirements.

The visualization component is crucial in effectively communicating and presenting the analysis results. The plot module offers a range of customization options, including the ability to label data points directly on the plot and export the figures in different sizes and formats, such as EPS, EMF, and editable PPT. These exported figures can be further edited in the related programs, which can be further edited to meet the publisher's requirements. Taken it all, Genekitr offers a comprehensive solution for visualizing, presenting, and publishing the analysis results.

Conclusions

In summary, Genekitr is a comprehensive toolkit for gene information retrieval, identifier conversion, functional enrichment analysis and plotting. The features of Genekitr include: (i) provision of both a web server and standalone R package, making it accessible to a wide range of users; (ii) the ability to perform batch retrieval of gene summaries and other attributes from up-to-date backend gene databases covering more species; (iii) the ability to handle input that includes a mixture of gene symbols and aliases, resolve outdated gene aliases and provide unique results by adhering to "one-to-many mapping rules" when doing ID conversion; (iv) It supports ORA and GSEA gene enrichment analyses with a simple interface and includes a GO simplification method, and notably, its results provide inputs for batch retrieval of gene summaries for further analysis; (v) Genekitr also enables researchers to easily generate more than 20 types of publication-ready plots with customizability and compatibility with other programs. These features make Genekitr particularly useful for wet-lab biologists with limited bioinformatics expertise who need to conduct basic bioinformatics analysis and generate publication-ready plots.

Availability and requirements

Project name: Genekitr

Project home page: https://genekitr.org

Operating system(s): Windows

Linux and Mac (web server and R package)

Programming language: R

Other requirements: R 3.6 or higher

License: GPL-3

Any restrictions to use by non-academics: none.

Supplementary Information

12859_2023_5342_MOESM1_ESM.pdf (29.3KB, pdf)

Additional file 1. Flowchart of one-to-many mapping rules for gene information retrieval.

12859_2023_5342_MOESM2_ESM.xlsx (16.2KB, xlsx)

Additional file 2. 1) GO CC enrichment analysis result without simplification method. 2) GO CC enrichment analysis result after simplification.

Acknowledgements

We appreciate the valuable feedback provided by the members of Gang Li's laboratory.

Abbreviations

BP

Biological process

CC

Cellular component

CRAN

The Comprehensive R Archive Network

DAVID

The Database for Annotation, Visualization and Integrated Discovery

DEGs

Differentially expressed genes

DPI

Dots per inch

DO

Disease Ontology

EMF

Enhanced Metafiles

EPS

Encapsulated PostScript

GO

Gene Ontology

GSEA

Gene set enrichment analysis

HGNC

HUGO Gene Nomenclature Committee

IMPC

International Mouse Phenotyping Consortium

JPEG

Joint Photographic Experts Group

KEGG

Kyoto Encyclopedia of Genes and Genomes

MeSH

Medical Subject Headings

MF

Molecular function

MGI

The Mouse Genome Informatics Database

MSigDB

Molecular Signatures Database

NCBI

National Center for Biotechnology Information

NCG

Network of Cancer Genes

OMIM

Online Mendelian Inheritance in Man

ORA

Over-representation analysis

PNG

Portable Network Graphics

PPT

Microsoft PowerPoint

TIFF

Tag Image File Format

UCSC

University of California Santa Cruz

WEGO

Web Gene Ontology Annotation Plot

Author contributions

YL and GL conceived the project; YL developed the R Package and Web Server; YL and GL wrote the manuscript. Both authors reviewed and approved the final version of the manuscript.

Funding

Funding for this work was provided by the Science and Technology Development Fund of Macau (0107/2019/A2 and 0073/2022/A2), the Research Services and Knowledge Transfer Office of the University of Macau (MYRG2018-00022-FHS), and the Ministry of Education Frontiers Science Center for Precision Oncology.

Availability of data and materials

The web server is available at https://genekitr.org. The source code for the web server and the standalone R package is available at https://github.com/GangLiLab/genekitr.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors have no competing interests as defined by BMC, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/gene. Accessed 14 Feb 2023.
  • 2.The Mouse Genome Informatics Database. https://www.informatics.jax.org/batch. Accessed 14 Feb 2023.
  • 3.Bult CJ, Blake JA, Smith CL, Kadin JA, Richardson JE, The Mouse Genome Database Group et al. mouse genome database (MGD) 2019. Nucleic Acids Res. 2019;47:D801–6. doi: 10.1093/nar/gky1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.HUGO Gene Nomenclature Committee. https://www.genenames.org/tools/multi-symbol-checker. Accessed 14 Feb 2023.
  • 5.Seal RL, Braschi B, Gray K, Jones TEM, Tweedie S, Haim-Vilmovsky L, et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023;51:D1003–9. doi: 10.1093/nar/gkac888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Res. 2019;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50:W216–W221. [DOI] [PMC free article] [PubMed]
  • 8.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005;33:W741–8. doi: 10.1093/nar/gki475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90–W97. doi: 10.1093/nar/gkw377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ye J, Zhang Y, Cui H, Liu J, Wu Y, Cheng Y, et al. WEGO 2.0: a web tool for analyzing and plotting GO annotations, 2018 update. Nucleic Acids Res. 2018;46:W71–W75. doi: 10.1093/nar/gky400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ge SX, Jung D, Yao R. ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–2629. doi: 10.1093/bioinformatics/btz931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50:D988–D995. doi: 10.1093/nar/gkab1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.The UniProt Consortium. Bateman A, Martin M-J, Orchard S, Magrane M, Ahmad S, et al. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.The UniProt Consortium: UniProt ID mapping knowledgebase. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism (2022). Accessed 31 Oct 2022.
  • 17.Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, et al. Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019;47:D721–D728. doi: 10.1093/nar/gky900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21:3439–3440. doi: 10.1093/bioinformatics/bti525. [DOI] [PubMed] [Google Scholar]
  • 19.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27-30. [DOI] [PMC free article] [PubMed]
  • 20.Baumann N. How to use the medical subject headings (MeSH) Int J Clin Pract. 2016;70:171–174. doi: 10.1111/ijcp.12767. [DOI] [PubMed] [Google Scholar]
  • 21.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–40. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, et al. WikiPathways: connecting communities. Nucleic Acids Res. 2021;49:D613–D621. doi: 10.1093/nar/gkaa1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2019;48:D845-D855. [DOI] [PMC free article] [PubMed]
  • 25.Schriml LM, Munro JB, Schor M, Olley D, McCracken C, Felix V, et al. The human disease ontology 2022 update. Nucleic Acids Res. 2022;50:D1255–D1261. doi: 10.1093/nar/gkab1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Dressler L, Bortolomeazzi M, Keddar MR, Misetic H, Sartini G, Acha-Sagredo A, et al. Comparative assessment of genes driving cancer and somatic evolution in non-cancer tissues: an update of the Network of Cancer Genes (NCG) resource. Genome Biol. 2022;23:35. doi: 10.1186/s13059-022-02607-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kuleshov MV, Clarke DJB, Kropiwnicki E, Jagodnik KM, Bartal A, Evangelista JE, et al. The COVID-19 gene and drug set library. Preprint. In review; 2020. [DOI] [PMC free article] [PubMed]
  • 28.Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, et al. GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20:3710–3715. doi: 10.1093/bioinformatics/bth456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Korotkevich G, Sukhov V, Sergushichev A. Fast gene set enrichment analysis. bioRxiv. 2019; doi: 10.1101/060012.
  • 31.Carlson M. GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.8.2. 2019.
  • 32.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976–978. doi: 10.1093/bioinformatics/btq064. [DOI] [PubMed] [Google Scholar]
  • 33.Wickham H. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.3.6. 2016.
  • 34.Kolde R. pheatmap: Pretty Heatmaps. R package version 1.0.12. 2019.
  • 35.Chen H, Boutros PC. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinform. 2011;12:35. doi: 10.1186/1471-2105-12-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Slowikowski K. ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1. 2023.
  • 37.Krassowski M. ComplexUpset: Create Complex UpSet Plots Using 'ggplot2' Components. R package version 1.3.3. 2021.
  • 38.Pedersen T. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.0.5. 2021.
  • 39.Nepusz T. igraph: Network Analysis and Visualization. R package version 1.3.5. 2022..
  • 40.Chang W. shiny: Web Application Framework for R. R package version 1.7.3. 2022.
  • 41.Liu Y. genekitr: Gene Analysis Toolkit. R package version 1.1.0. 2023.
  • 42.The Gene Ontology Consortium The gene ontology resource: 20 years and still going strong. Nucleic Acids Res. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Mi H, Ebert D, Muruganujan A, Mills C, Albou L-P, Mushayamaha T, et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res. 2021;49:D394–403. doi: 10.1093/nar/gkaa1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018;46:D851–D860. doi: 10.1093/nar/gkx1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 2019;47:D1038–43. doi: 10.1093/nar/gky1151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Groza T, Gomez FL, Mashhadi HH, Muñoz-Fuentes V, Gunes O, Wilson R, et al. The International Mouse Phenotyping Consortium: comprehensive knockout phenotyping underpinning the study of human disease. Nucleic Acids Res. 2023;51:D1038–D1045. doi: 10.1093/nar/gkac972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Palla M, Scarpato L, Di Trolio R, Ascierto PA. Sonic hedgehog pathway for the treatment of inflammatory diseases: implications and opportunities for future research. J Immunother Cancer. 2022;10:e004397. doi: 10.1136/jitc-2021-004397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Mudunuri U, Che A, Yi M, Stephens RM. bioDBnet: the biological database network. Bioinformatics. 2009;25:555–556. doi: 10.1093/bioinformatics/btn654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS J Integr Biol. 2012;16:284–7. doi: 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Parmenter TJ, Kleinschmidt M, Kinross KM, Bond ST, Li J, Kaadige MR, et al. Response of BRAF-mutant melanoma to BRAF inhibition is mediated by a network of transcriptional regulators of glycolysis. Cancer Discov. 2014;4:423–433. doi: 10.1158/2159-8290.CD-13-0440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bu D, Luo H, Huo P, Wang Z, Zhang S, He Z, et al. KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res. 2021;49:W317–W325. doi: 10.1093/nar/gkab447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinform. 2009;10:48. doi: 10.1186/1471-2105-10-48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6:e21800. doi: 10.1371/journal.pone.0021800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hale ML, Thapa I, Ghersi D. FunSet: an open-source software and web server for performing and displaying Gene Ontology enrichment analysis. BMC Bioinform. 2019;20:359. doi: 10.1186/s12859-019-2960-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37:W305–11. doi: 10.1093/nar/gkp427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Du Z, Zhou X, Ling Y, Zhang Z, Su Z. agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res. 2010;38:W64–70. doi: 10.1093/nar/gkq310. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2023_5342_MOESM1_ESM.pdf (29.3KB, pdf)

Additional file 1. Flowchart of one-to-many mapping rules for gene information retrieval.

12859_2023_5342_MOESM2_ESM.xlsx (16.2KB, xlsx)

Additional file 2. 1) GO CC enrichment analysis result without simplification method. 2) GO CC enrichment analysis result after simplification.

Data Availability Statement

The web server is available at https://genekitr.org. The source code for the web server and the standalone R package is available at https://github.com/GangLiLab/genekitr.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES