Skip to main content
. 2016 Dec 1;222(4):1557–1580. doi: 10.1007/s00429-016-1338-2
Box1 | Gene Sets
Complex biological functions and disorders usually involve several rather than a single gene. Gene sets are groups of genes that share common biological functions and that can be defined either based on prior knowledge (e.g. about biochemical pathways or diseases) or experimental data (e.g. transcription factor targets identified using CHIP-seq). Gene set databases organize existing knowledge about these groups of genes by arranging them in sets that are associated with a functional term, such as a pathway name or a transcription factor that regulates the genes. Gene sets can be classified into 5 types:
Gene Ontology (GO)
The Gene Ontology project (Ashburner et al. 2000) developed three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions. Genes annotated with the same GO term(s) constitute a gene set.
Biological Pathways
Biological pathways are networks of molecular interactions underlying biological processes. Pathway databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Ogata et al. 1999) and REACTOME (Croft et al. 2014), catalog physical entities (proteins and other macromolecules, small molecules, complexes of these entities and post-translationally modified forms of them), their subcellular locations and the transformations they can undergo (biochemical reaction, association to form a complex and translocation from one cellular compartment to another).
Transcription
Transcription databases include information on regulation of genes by transcription factors (TFs) binding to the DNA, or post-transcriptional regulation by microRNA binding to the mRNA. Determining these physical interactions can be done either in silico using computational inference (motif enrichment analysis) or using experimental data (such as CHIP-seq and microRNA binding data). For the motif enrichment analysis, position weight matrices (PWMs) from databases TRANSFAC (Matys et al. 2006) and JASPER (Portales-Casamar et al. 2010) can be used to scan the promoters of genes in the region around the transcription factor start site (TSS). CHIP-seq data, such as the large collection of experiments from the Encyclopedia of DNA Elements (ENCODE) project (Bernstein et al. 2012b) and the Roadmap Epigenomics consortium (Consortium 2015a), is used to identify genes targeted by the TFs. Similarly, microRNA targets can be extracted from databases such as TargetScan (Lewis et al. 2003).
Cell-type markers
Cell type-specific transcriptional data provide a very rich source of cell type marker genes. Genes are identified as a cell type marker if they are up-regulated in one cell population compared to other cell populations. Several studies have used microarrays and RNA-seq to profile the transcriptome of a number of neuronal cell types (Cahoy et al. 2008; Zhang et al. 2014). Recently, studies are using single-cell sequencing to precisely capture the transcriptome of individual neuronal cells (Darmanis et al. 2015; Zeisel et al. 2015).
Disease
Genes can be grouped into sets based on their association to the same diseases. Public databases, such as OMIM (2015a) and DisGeNet (Pinero et al. 2015), contains curated information from literature and public sources on gene-disease association. Another source to obtain disease-related gene sets is by identifying genes harboring variants identified using GWAS (Simón-Sánchez and Singleton 2008; Welter et al. 2014), exome-sequencing (2015b), or whole-genome sequencing.