GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data

Shani Ben-Ari Fuchs; Iris Lieder; Gil Stelzer; Yaron Mazor; Ella Buzhor; Sergey Kaplan; Yoel Bogoch; Inbar Plaschkes; Alina Shitrit; Noa Rappaport; Asher Kohn; Ron Edgar; Liraz Shenhav; Marilyn Safran; Doron Lancet; Yaron Guan-Golan; David Warshawsky; Ronit Shtrichman

doi:10.1089/omi.2015.0168

. 2016 Mar 1;20(3):139–151. doi: 10.1089/omi.2015.0168

GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data

Shani Ben-Ari Fuchs ¹, Iris Lieder ¹, Gil Stelzer ^1,,^2,^✉, Yaron Mazor ¹, Ella Buzhor ³, Sergey Kaplan ¹, Yoel Bogoch ⁴, Inbar Plaschkes ¹, Alina Shitrit ², Noa Rappaport ², Asher Kohn ⁵, Ron Edgar ⁶, Liraz Shenhav ¹, Marilyn Safran ², Doron Lancet ², Yaron Guan-Golan ⁵, David Warshawsky ⁵, Ronit Shtrichman ⁷

PMCID: PMC4799705 PMID: 26983021

Abstract

Postgenomics data are produced in large volumes by life sciences and clinical applications of novel omics diagnostics and therapeutics for precision medicine. To move from “data-to-knowledge-to-innovation,” a crucial missing step in the current era is, however, our limited understanding of biological and clinical contexts associated with data. Prominent among the emerging remedies to this challenge are the gene set enrichment tools. This study reports on GeneAnalytics™ (geneanalytics.genecards.org), a comprehensive and easy-to-apply gene set analysis tool for rapid contextualization of expression patterns and functional signatures embedded in the postgenomics Big Data domains, such as Next Generation Sequencing (NGS), RNAseq, and microarray experiments. GeneAnalytics' differentiating features include in-depth evidence-based scoring algorithms, an intuitive user interface and proprietary unified data. GeneAnalytics employs the LifeMap Science's GeneCards suite, including the GeneCards®—the human gene database; the MalaCards—the human diseases database; and the PathCards—the biological pathways database. Expression-based analysis in GeneAnalytics relies on the LifeMap Discovery®—the embryonic development and stem cells database, which includes manually curated expression data for normal and diseased tissues, enabling advanced matching algorithm for gene–tissue association. This assists in evaluating differentiation protocols and discovering biomarkers for tissues and cells. Results are directly linked to gene, disease, or cell “cards” in the GeneCards suite. Future developments aim to enhance the GeneAnalytics algorithm as well as visualizations, employing varied graphical display items. Such attributes make GeneAnalytics a broadly applicable postgenomics data analyses and interpretation tool for translation of data to knowledge-based innovation in various Big Data fields such as precision medicine, ecogenomics, nutrigenomics, pharmacogenomics, vaccinomics, and others yet to emerge on the postgenomics horizon.

Introduction

High throughput genomics technologies, such as next generation DNA/RNA sequencing or microarray analyses, are frequently used during biomedical research, as well as in diagnostic and therapeutic product development. These generate large quantities of Big Data that require advanced bioinformatics analysis and interpretation. The key step towards translating these results into meaningful scientific discoveries is deduction of biological and clinical contexts from the generated data. In this realm, several methods and tools have been developed to interpret large sets of genes or proteins, using information available in biological databases. Prominent among these are gene set enrichment tools.

In conventional examples, the Gene Ontology database is used for the functional study of large scale genomics or transcriptomics data. Multiple applications such as GeneCodis, GOEAST, Gorilla, and Blast2GO (Conesa et al., 2005; Eden et al., 2009; Nogales-Cadenas et al., 2009; Zheng and Wang, 2008) can analyze and visualize statistical enrichment of GO terms in a given gene set. Other tools rely on popular data sources such as Kyoto Encyclopedia of Genes and Genomes (KEGG), TransPath, Online Mendelian Inheritance in Man (OMIM), and GeneCards to identify enriched pathways, diseases, and phenotypes (Backes et al., 2007; Huang da et al., 2009b; Safran et al., 2010; Sherman et al., 2007; Stelzer et al., 2009; Zhang et al., 2005). These analysis tools differ in several respects, including statistical methodology, supported organisms and gene identifiers, coverage of functional categories, source databases, and user interface. The common result is the identification of known functional biological descriptors that are significantly enriched within the experimentally-derived gene list.

Enrichment of biological descriptors for a given set of genes introduces three immediate challenges: The first is determining the statistical significance of enrichment of each descriptor. There are several approaches to calculating the statistics for a descriptor shared among genes, such as Gene Set Enrichment Analysis [GSEA (Maezawa and Yoshimura, 1991)] and Fisher's exact test [Database for Annotation, Visualization and Integrated Discovery—DAVID (Dennis et al., 2003)]. Some tools, such as the DAVID functional annotation tool, initially cluster the descriptors belonging to similar categories, and then present a score for an enriched group of terms.

The second challenge is judicious use of multiple data sources. It is a nontrivial task to integrate and model information derived from various origins. In an example, disease information could be derived from data sources such as OMIM (Hamosh et al., 2005), SwissProt/UniProt (Wu et al., 2006), and Orphanet (Maiella et al., 2013), and pathway information—from Reactome (Jupe et al., 2014; Matthews et al., 2009) and/or KEGG (Kanehisa et al., 2010). Therefore many analysis tools present separate enrichment results for each data source, while others perform consolidated analysis on source types.

A third challenge is optimal data presentation. Tools such as DAVID group enriched terms by biological categories in an attempt to provide a general sense of the biological processes involved in the experimental results. Other tools, such as MSigDB (GSEA) (Liberzon et al., 2011) and GeneDecks Set Distiller (Stelzer et al., 2009), interlace biological descriptors of various kinds, based on their statistical enrichment strength, thus emphasizing the individual significance of each in the context of the general enriched descriptor list. It would be optimal to give both a birds-eye view of grouped descriptors for a given set of genes, as well as display the descriptors in detail.

Multiple data sources are generally employed for both broad and in-depth depictions of enrichment. A related challenge is to develop a straightforward and easy-to-use application, with intuitive output results, rendering the tool accessible to inexperienced users, with little or no bioinformatics background.

We present GeneAnalytics™ (geneanalytics.genecards.org), designed to distill enriched descriptors for a given gene set, while optimally addressing the aforementioned challenges. It is empowered by the GeneCards Suite, embodied as LifeMap's integrated knowledgebase, which automatically mines data from more than 120 data sources. GeneAnalytics' broad descriptor categories enable users to focus on areas of interest, each rich with annotation and supporting evidence. The GeneAnalytics analyses provide gene associations with tissues and cells types from LifeMap Discovery (LMD, discovery.lifemapsc.com), diseases from MalaCards, (www.malacards.org), as well as GO terms, pathways, phenotypes, and drug/compounds from GeneCards (www.genecards.org), (Fig. 1). Navigation within such comprehensive information, as well as further scrutiny, is facilitated by GeneAnalytics categorization and filtration tools.

FIG. 1. — GeneAnalytics structure. GeneAnalytics is powered by GeneCards, LifeMap Discovery, MalaCards, and PathCards, which integrate >100 data sources. These databases contain annotated gene lists for tissues and cells, diseases, pathways, compounds, and GO terms. GeneAnalytics compares the user's gene set to these compendia in search of the best matches. The output contains the best matched gene lists, scored and subdivided into their biological categories such as diseases or pathways. In the figure, each output category and its respective data source are marked with the same color.

Methods

GeneAnalytics input

During the input stage (Fig. 2A), the relevant species, human or mouse, is selected. Then a gene list is typed, aided by an autocomplete feature to define the correct official gene symbol. Alternatively, a gene list may be pasted or uploaded as a text file. In the latter case, the gene list automatically undergoes gene symbol identification (“symbolization”) process yielding “ready for analysis” and “unidentified genes” lists (Fig. 2B, C). Each gene in the “ready for analysis” list is shown with its full name and all available aliases/synonyms, enabling review and approval of the input genes before analysis.

For the “unidentified genes” list, GeneAnalytics assists in manual symbol identification by directly linking to the gene search in GeneCards. To provide all relevant results for each gene symbol, GeneAnalytics unifies orthologs and paralogs into ‘ortholog groups’ based on the information available inc HomoloGene (www.ncbi.nlm.nih.gov/homologene), with minor adaptations (See Supplementary S1 Appendix; supplementary material is available online at www.liebertpub.com/omi).

Upon completion of the input stage, GeneAnalytics analysis produces results that are divided into the following categories: Tissues and Cells, Diseases, Pathways, GO terms, Phenotypes, and Compounds. Genes are associated with these categories either by their expression (“expression-based analysis”) or by their function (“function-based analysis”) (Table 1). All sections have a “drill down” capacity for performing subqueries, allowing users to focus only on genes from their original gene set, filtered by those that match the selected entity.

Table 1.

GeneAnalytics Data Sources and Statistics

Results Category
Analysis based on	Entity type	Data sources	Total number of entities with associated genes	Total number of genes related to entities
Expression	Normal tissues and cells	LifeMap Discovery	3,346	17,512
	Diseased tissues and cells^*	LifeMap Discovery (via MalaCards)	96	6,963
Function	Disease	MalaCards	12,085	22,280
	Pathways	PathCards	1073 SuperPaths (unification of 3215 pathways)	11,479
	GO—biological process	GeneCards	9,436	14,907
	GO—molecular function		3,509	15,624
	Compounds		19,961 (unification of 44,942 compounds)	8,434

Open in a new tab

Data sources and statistics for each result category, based on the type of analysis.

^{^*}

The expression data in diseased tissues and cells are available in the disease category.

Tissues and cells

All gene expression data, including those that are manually collected, annotated, and integrated into LMD, are used to rank the GeneAnalytics matching results.

The gene expression data available in LMD are obtained from three types of sources:

a) Scientific peer-reviewed manuscripts and books (Edgar et al., 2013).
b) High Throughput (HT) gene expression comparisons available in the Gene Expression Omnibus (GEO) (Edgar et al., 2002). These are subject to various standardization and analyses methods. For this, we developed and fine-tuned an algorithm for extracting differentially expressed genes from GEO matrix files (normalized data, detailed in “Differentially expressed genes identification algorithm” in Supplementary S2 Appendix). Applying a uniform algorithm to the gene data increased the comparability of the resulting differentially expressed gene list. For experiments that do not have normalized data deposited in a public repository, the differentially expressed gene lists, incorporated into the LMD database, are derived from the relevant article.
c) Large Scale Data Sets (LSDS): those obtained from wide-scope experiments that encompass multiple samples and require suitable standardization and analyses methods. This refers to data that obtained by In situ hybridization (ISH), immunostaining (IS), microarray, or RNA sequencing data sets. These data, retrieved from big-data repositories such as Mouse Genome Informatics (MGI) (Smith et al., 2014), Eurexpress (Geffers et al., 2012) or BioGPS (Wu et al., 2013), are filtered and analyzed in-house or obtained in analyzed form from projects that developed unique large-scale analysis methods such as Homer or Barcode.

The complete list of data sources is provided on the LMD webpage (discovery.lifemapsc.com/gene-expression-signals#ht-gene-expression).

In LMD, each anatomical entity has a unique card that contains a list of associated expressed genes [see (Edgar et al., 2013) for further details]. Organ and tissue cards include lists of genes expressed in whole tissue samples (e.g., RNA extracted from tissue homogenates). Genes reported to be expressed in a specific cell type (in vivo or in vitro) or in an anatomical compartment are listed in the relevant cards, which contain extensive manually curated information from the literature.

The High Throughput gene expression comparisons are described within ‘experiment cards.’ The top differentially expressed genes derived from these comparisons are linked into the highest resolution entity card possible (organs/tissues, anatomical compartments, or cells). Each card details the comparisons used in the experiment, listing the test and control samples comprising each comparison and supplying additional information for the experiment. The top differentially expressed genes (calculated as described in “Differentially expressed genes identification algorithm” in the Supplementary S2 Appendix) as well as links to LifeMap entities (tissues, compartments, etc.) may be viewed in the comparison cards associated with an experiment card.

Similarly, the lists of differentially expressed genes derived from Large Scale Data Sets are linked into entity cards, unless such a card is not available (for example, when the entity does not exist for a given release), in which case they are presented in Large Scale Data Sets cards. Thus the Tissues and Cells results are labeled by the four types of LMD entities shown in Table 2, with relevant links for further investigation (Fig. 3C).

Table 2.

LMD Entities Used in GeneAnalytics Matching Analysis in Tissues & Cells Category

Entity type	Data Origin	Example	Notes
Organ	• High throughput gene expression comparisons • Large scale data sets	Heart	These entities contain a list of genes that have been found to be expressed in whole-tissue samples.
Tissue
Anatomical compartment	• High throughput gene expression comparisons • Large scale data sets	Renal collecting duct system	These entities describe specific temporospatial regions within an organ/tissue.
In -vivo cell	• Data manually curated from the scientific literature • High throughput gene expression comparisons • Large Scale Data Sets	Inner cell Mass cells (ICM) Trabecular meshwork-derived mesenchymal stem cells
In -vitro cell: cultured stem, progenitor and primary cell
Protocol-derived cell
Cell Family
Large Scale Data Set sample cards	Large Scale Data Sets	GUDMAP: Ovary	These entities contain the gene list for each Large Scale Data Sets sample. These entities are only included in GeneAnalytics results if their gene list is not contained within any of the above entity types.

Open in a new tab

The entities available in the LMD database with gene expression information and an example for each.

The Tissues and Cells GeneAnalytics results contain useful filters that enable focus on specific subsets of the results (Fig. 3B). Each entity is classified into tissue(s) and/or system(s) in LMD, enabling results aggregation and filtration. This is done using higher anatomical hierarchy elements, tissues, and systems. For example, the in vivo cell Dopaminergic Progenitor Cells belongs to the anatomical compartment Substantia Nigra pars Compacta, which belongs to the tissue Brain, which is included in the system Nervous System.

The filtering into tissues or systems is associated with scores that reflect their matching quality to the query gene set (Fig. 3C, see next section). The Tissues and Cells results can also be used to filter In vivo/In vitro or Pre-natal/Post-natal entities (for further details, see “Filters” in Supplementary S2 Appendix). Further, GeneAnalytics allows user interaction for display of additional information. For example, for each entry in the Tissues and Cells table, we provide the type of entity, the expression type (expressed, selective marker, etc.), the number of genes matched to that entity (including the number of total genes expressed in the entity), and localization (within a popup).

When scoring after tissue/system filtering, during this aggregative filtering, a gene that appears in more than one entity will be represented only once at the tissue/system level, and will get the maximal score attributed to it in any of its associated entities. Once all of the genes are assembled for the tissue/system, the score is computed in the same manner as for every entity (shown in the detailed entity section, on the right).

The matching algorithm for this category aims to identify the anatomical entities most strongly associated with the query gene set. The algorithm is composed of two major stages:

a) Computation of a score for each gene associated with an entity. These pre-computed scores represent the importance of this gene in the specific entity as compared to its distribution in the entire entity landscape.
b) Computation of the matching score, which is the similarity score between the user's query gene set and the genes associated with each of the entities, taking into account the differences in the expression information, both quantitative and qualitative, available for each entity.

The above is based on the fact that each gene associated with an entity is assigned one or more of the following specificity annotations: specific, enriched, selective, expressed, abundant, and/or low confidence (Edgar et al., 2013). The annotations are derived from the literature and/or from bioinformatic calculations. The calculations consider the source from which the gene–entity association was established and the distribution of the gene expression in LMD. Criteria include how rare is the gene in the database, how specific it is to a certain cell type or tissue, and whether there is extensive evidence for the expression of the gene in the tissue.

In addition, the gene score considers the entity type in which the expression is observed. Genes listed in organ/tissue, anatomical compartment or cell cards are ranked higher than genes with the same specificity annotations, which are listed in Large Scale Data Sets entities that are not linked to any of the above (tissue, compartments, etc.). Supplementary information elaborating on the determination of the gene annotation and the given scores, with additional details, is summarized in the Supplementary S1 Table.

After defining the gene scores, the gene set of each entity and the query gene set can be viewed as gene expression vectors. The entity gene–set vector holds defined scores for each of its genes and zero for all other genes, while the query gene–set vector is a binary vector that holds the value 1 for each of the query's genes and 0 for all other existing genes. The affinity between the query gene set and each of the entities is measured by the scalar product of the two vectors (i.e., the sum of the scores of the entity genes matched to the query gene set). The choice of normalization factor and the details of the score levels are described in “Gene Scores, The matching score algorithm” in Supplementary S2 Appendix.

The entity scores are divided into three levels, representing the strength of the results (high, medium, or low), which is indicated by the color of the score bar. This categorization is performed by a two-step procedure that runs automatically before each release. The first step is determining the threshold for medium and high scores for a group of query gene sets with varying sizes. The second step uses a linear regression between the various query sizes and their computed medium/high scores in order to create an equation from which the thresholds in the first step can be computed easily for any query gene size.

The first step of the automatic procedure uses a set of 50 test cases. From each test case, six gene lists of different sizes are generated (5 to 300 genes). The matching algorithm applied on these gene lists produces a range of typical scores for each query size. In order to obtain the high and medium threshold automatically, a preliminary analysis was performed on many control microarray experiments. Each experiment represents a known cell/compartment/tissue and therefore was expected to produce high scores for the highly relevant entities, medium scores for entities with modest relevance, and lower scores for weakly related entities.

By analyzing the distribution curves for all control sets, we established the percentiles of entities that produce medium and high scores. These determined percentiles enabled the high and medium boundaries in the aforementioned first step to be computed automatically. In the second step, a linear regression is applied between the various query sizes and their high or medium scores from which an equation for computing these boundaries in the general case is generated.

Diseases

Gene–disease relations in GeneAnalytics are divided into the following categories, indicated in the GeneAnalytics results (Fig. 4):

a) Gene associations along with their confidence classifications as derived from MalaCards data sources. Since each data source has its own annotation terminology, Table 3 categorizes all of the possible disease–gene associations in descending order according to their source-associated confidence, which is later transformed into a GeneAnalytics score.
b) Genes that are significantly up- or downregulated in disease tissues in comparison to their healthy counterparts. Differential gene expression profiles are derived from High Throughput experiments extracted from GEO or from the literature, and analyzed using LMD algorithms (Supplementary S3 Appendix).
c) GeneCards inferred genes (i.e., genes with the disease name mentioned anywhere in the relevant GeneCards webcard, e.g., in the publication section). This is a somewhat weaker association, which often does not imply causality.

Table 3.

Disease–Gene Associations from Manually Curated Genetic Sources

Association category	Source
Causative mutation	ClinVar, OMIM, Orphanet
Risk factor	ClinVar, OMIM, Orphanet
Resistant factor	ClinVar, OMIM
Genetic tests	GeneTests
Drug response	ClinVar
Structural gene variation	OMIM, Orphanet
Unconfirmed association	OMIM, Orphanet

Open in a new tab

See the Supplementary S3 Appendix for additional details.

The disease matching score is calculated in three steps:

a) Each gene associated with each disease receives a score based on the gene–disease relations described in the disease data modeling section (Supplementary S2 Table):
- (i) Genes with a genetic association to the disease receive a score according to the association category described in Supplementary S3 Table. A gene linked with multiple filter categories is assigned the strongest association score among them.
- (ii) Differentially expressed genes are binned and scored based on their rank in the list of differentially expressed genes in the diseased vs. normal tissue analysis (analyses were performed as per all High Throughput experiments, detailed in “Differentially expressed genes identification algorithm” in Supplementary S2 Appendix).
- (iii) Genes with “GeneCards inferred” relations receive a score based on the number of sections in GeneCards in which the disease appears.
b) Each gene may have more than one type of relationship with the disease; the final gene score a disease receives is the highest among all of the possible scores mentioned in point a above.
c) The gene–disease matching score is calculated based on scores of each of the matched genes, the number of matched genes, and the total number of genes associated with the disease in MalaCards (used for normalization). The scoring function is identical to the one used in the Tissues and Cells category (see “The matching score algorithm” in Supplementary S2 Appendix).

The disease results category in GeneAnalytics includes several filters that enable the user to focus on the results of interest (Fig. 4A).

a) Gene–disease relations. This enables the user to filter for gene–disease relation types, including differentially expressed genes and specific types of genetic associations. Selection of ‘differentially expressed genes’ (DE) or ‘genetic association’, will only show diseases for which their matched gene set includes at least one differentially expressed or genetically associated gene, respectively. This filtration caters to users who are interested in diagnostic disease markers, in the case of differentially expressed genes, or those with genetically associated variants for specific diseases. Importantly, the matching score for each disease category is recalculated following filtration, so the scoring algorithm considers only entities that contain at least one gene matching the requested filter terms.
b) Disease categories. This filter enables the user to focus on specific disease categories, as defined by MalaCards categorizations. MalaCards categorizes diseases into anatomical (e.g., eye, ear, liver, blood) and global (rare, fetal, genetic, cancer, and infectious) diseases. The categorization is based on either the International Statistical Classification of Diseases and Related Health Problems 10^th Revision (ICD-10) (Organization, 1992) or on the MalaCards classification algorithm that utilizes category-specific keywords contained in the disease names and annotations, as well as textual heuristics. For example, if the disease name includes the words ‘tumor’ or ‘malignant,’ it is classified as a cancer disease (Rappaport et al., 2014). Further, a disease can be associated with more than one category.

Pathways

In GeneAnalytics, matched SuperPaths appear with their matching score and link to the relevant webcard in PathCards, as well as the list of matched genes and total number of genes associated with each SuperPath. The user can then expand each matched SuperPath to view the list of its clustered pathways with links to their original individual pathway sources and to the relevant genes in the user's query (Fig. 5).

FIG. 5. — GeneAnalytics Pathways results. **(A)** The pathway filters panel enables filtration of results according to their data sources. **(B)** The detailed results table includes all of the matched SuperPaths, presented in descending score and with links to the related card in PathCards. **(C)** Each SuperPath includes one or more pathways from different sources. Clicking on the plus sign exposes the names of the separate pathways that comprise the SuperPath, with links to the pathway page in the original data source.

The scoring algorithm in the pathways category is based on the algorithm used by the GeneDecks Set Distiller tool (Stelzer et al., 2009). Briefly, all genes in each SuperPath are given a similar weight in the analysis, and the matching score is based on the cumulative binomial distribution, which is used to test the null hypothesis that the queried genes are not over-represented within any SuperPath (see more details in Supplementary S4 Appendix). As in all sections, the score is represented by a colored score bar and classified by its quality (see details in the Tissues & Cells matching algorithm description).

Pathway unification is employed on all of the sources found in GeneCards. GeneAnalytics enables users to concentrate on as many sources as desired by applying a source filter.

Gene Ontology (GO) terms and phenotypes

The matching algorithm for both GO terms and phenotypes is based on the binomial distribution and is identical to that used in the pathways category (see Supplementary S3 Appendix).

Drugs and compounds

The GeneAnalytics compounds results category takes advantage of multiple sources that cover more than 83,000 compounds, approximately 45,000 of which are associated with genes. GeneAnalytics applies a unification process which reduces the number of compounds with associated genes by more than half, from ∼45,000 to ∼20,000 compounds (Table 1). This robust process saves time in reviewing identical compounds presented under various names by different data sources and enables massive aggregation of genes per compound, and is featured in GeneCards.

The compound unification process seeks out similar compounds described in different data sources, and is based on the following rules:

a) Unification of compounds with exact identical names (case/dash- insensitive).
b) Unification of compounds with identical identifiers, more specifically both a Chemical Abstracts Service (CAS) number (unique numerical identifier assigned to chemical substances) and a PubChem ID (PubChem is an NCBI database providing information on the biological activities of small molecules). Note that not all compounds have these identifiers, nor do all databases provide these identifiers for their compounds.
c) Unification of compounds with either an identical CAS number or PubChem ID and identical synonyms. Note that different compounds might have identical synonyms and therefore, only compounds with at least one identical identifier and one identical synonym are unified.
d) Metabolite unification based on metabolite family and gene sharing: several metabolite families contain thousands of compounds with almost identical names, many of which are associated with an identical list of genes. In GeneAnalytics, prevalent metabolite family subgroups belonging to Triglycerides, Diglycerides, Phosphatidylcholines, Phosphatidylethanolamines, have been unified based on identical lists of associated genes. These groups are described in the user guide (geneanalytics.genecards.org/user-guide#1628).

Unified compounds are shown with links to all supporting data sources, providing further information and its relevance to the evaluated genes, while the original compound name is shown near its data source. The matching algorithm is based on the binomial distribution and is identical to that used in the pathways category (see Supplementary S3 Appendix).

The compound category in GeneAnalytics provides the opportunity to explore relationships between compounds and gene sets, to define potential drugs and their mechanisms of action and to facilitate drug target discovery.