Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2016 Sep 13;17(1):366. doi: 10.1186/s12859-016-1229-9

Tissue enrichment analysis for C. elegans genomics

David Angeles-Albores 1, Raymond Y N Lee 1, Juancarlos Chan 1, Paul W Sternberg 1,
PMCID: PMC5020436  PMID: 27618863

Abstract

Background

Over the last ten years, there has been explosive development in methods for measuring gene expression. These methods can identify thousands of genes altered between conditions, but understanding these datasets and forming hypotheses based on them remains challenging. One way to analyze these datasets is to associate ontologies (hierarchical, descriptive vocabularies with controlled relations between terms) with genes and to look for enrichment of specific terms. Although Gene Ontology (GO) is available for Caenorhabditis elegans, it does not include anatomical information.

Results

We have developed a tool for identifying enrichment of C. elegans tissues among gene sets and generated a website GUI where users can access this tool. Since a common drawback to ontology enrichment analyses is its verbosity, we developed a very simple filtering algorithm to reduce the ontology size by an order of magnitude. We adjusted these filters and validated our tool using a set of 30 gold standards from Expression Cluster data in WormBase. We show our tool can even discriminate between embryonic and larval tissues and can even identify tissues down to the single-cell level. We used our tool to identify multiple neuronal tissues that are down-regulated due to pathogen infection in C. elegans.

Conclusions

Our Tissue Enrichment Analysis (TEA) can be found within WormBase, and can be downloaded using Python’s standard pip installer. It tests a slimmed-down C. elegans tissue ontology for enrichment of specific terms and provides users with a text and graphic representation of the results.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-016-1229-9) contains supplementary material, which is available to authorized users.

Keywords: Gene ontology, Anatomy ontology, WormBase, RNA-seq, High-throughput biology

Background

RNA-seq and other high-throughput methods in biology have the ability to identify thousands of genes that are altered between conditions. These genes are often correlated in their biological characteristics or functions, but identifying these functions remains challenging. To interpret these long lists of genes, biologists need to abstract genes into concepts that are biologically relevant to form hypotheses about what is happening in the system. One such abstraction method relies on Gene Ontology (GO). GO provides a controlled set of hierarchically ordered terms [1, 2] that provide detailed descriptions about the molecular, cellular or biochemical functions of any gene. For a given gene list, existing software programs can query whether a particular term is enriched [36]. One area of biological significance that GO does not include is anatomy. One way to address this shortcoming is to use a ‘tissue ontology’ that provides a complete anatomical description for an organism (e.g.‘tissue’, ‘organ’ or ‘specific cell’), in this case for C. elegans. Such an ontology has been described previously for this organism [7]. Cells and tissues are physiologically relevant units with broad, relatively well-understood functionalities amenable to hypothesis formation. The C. elegans database, WormBase [8], maintains a curated list of gene expression data from the literature. Here we provide a new framework that analyzes a user-input list for enrichment of specific cells and tissues.

Another problem frequently associated with GO enrichment analysis is that it is often difficult to interpret due to the large number of terms associated with a given gene (which we refer to as ‘result verbosity’). DAVID, a common tool for GO enrichment analysis, clusters enriched terms into broad categories [9], whereas PANTHER [3, 10] attempts to solve this issue by employing a manually reduced ontology, GOslim (pers. comm., H. Yu and P. Thomas). To reduce verbosity, we have filtered our ontology using a small set of well-defined criteria to remove terms that do not contribute additional information. To our knowledge, such filtering has not been performed in an algorithmic fashion for a biological ontology before; indeed, DAVID does not employ term trimming a priori of testing, but rather fuzzy clustering post testing to reduce the number of ontology terms. Other pruning methods do exist (see for example [11, 12]), but the pruning is query-dependent or generates a brand new ‘brief ontology’ which satisfies a set of logic relationships and has certain connectivity requirements. We do not propose to regenerate a new ‘brief ontology’, but instead we use our approach to select those nodes that have sufficient annotated evidence for statistical testing. We believe our trimming methodology strikes a good balance between detailed tissue calling and conservative testing.

We have developed a tool that tests a user-provided list of genes for term enrichment using a nematode-specific tissue ontology. This ontology, which is not a module of Gene Ontology, is verbose. We select nodes from the ontology for statistical testing using an algorithmic approach, outlined below, that reduces multiple hypothesis testing issues by limiting testing to terms that are well-annotated. The results are provided to the user in a GUI that includes a table of results and an automatically generated bar-chart. This software addresses a previously unmet need in the C. elegans community for a tool that reliably and specifically links gene expression with changes in specific cells, organs or tissues in the worm.

Results

Generating a gene-tissue dictionary by specific node selection

Reducing term redundancy through a similarity metric

For our tool, we employ a previously generated cell and tissue ontology for C. elegans [7], which is maintained and curated by WormBase. This ontology contains thousands of anatomiy terms, but not every term is equally well-annotated. As a first step to generate our tissue enrichment software, we wished to select tissue terms that were reasonably well-annotated, yet specific enough to provide insight and not redundant with other terms. For example, nematodes have a number of neurons that are placed symmetrically along the left/right body axis, and are functionally similar. These left/right neuronal pairs (which are sisters in the ontology) have almost identical annotations, with at most one or two gene differences between them, and therefore we cannot have statistical confidence in differentiating between them. As a result, testing these sister terms provides no additional information compared with testing only the parent node to these sisters. To identify redundancy, we defined two possible similarity metrics (see “Methods” section and Fig. 1a) that can be used to identify ontology sisters that have very high similarity between them. Intuitively, a set of sisters can be considered very similar if they share most gene annotations. Within a given set of sisters, we can calculate a similarity score for a single node by counting the number of unique annotations it contains and dividing by the total number of unique annotations in the sister set. Having assigned to each sister a similarity score, we can identify the average similarity score for this set of sisters, and if this average value exceeds a threshold, these sisters are not considered testable candidates. An alternative method is check whether any of the scores exceeds a predetermined threshold, and if so remove this sister set from the ontology. We referred to these two scoring criteria as ‘avg’ and ‘any’ respectively.

Fig. 1.

Fig. 1

Schematic representation of trimming filters for an acyclical ontology. a The parent node (green) contains at least as many annotations as the union of the two sisters. These two sisters share annotations extensively, as expressed by the overlap in the Venn diagram, so they qualify for removal. b Nodes with less than a threshold number of genes are trimmed (red) and discarded from the dictionary. Here, the example threshold is 25 genes. Nodes ε,ζ,η, shown in red are removed. c Parent nodes are removed recursively, starting from the root, if all their daughter nodes have more than the threshold number of annotations. Nodes in grey (ε,ζ,η) were removed in the previous step. Nodes α,β shown in red are trimmed because each one has a complete daughter set. Only nodes γ and δ will be used to generate the static dictionary

Terminal branch terms and parent terms can be safely removed in an algorithmic fashion

Another problem arises from the ontology being scarcely populated. Many nodes have 0–10 annotations, which we consider too few to accurately test. To solve this issue, we implemented another straightforward node selection strategy. For a given terminal node, we test whether the node has more than a threshold number of annotations. If it does not, the node is not used for statistical testing. The next higher node in the branch is tested and removed recursively until a node that satisfies the condition is found. At that point, no more nodes can be removed from that branch. This completion is guaranteed by the structure of the ontology: parent nodes inherit all of the annotations of all of their descendants, so the number of annotated terms monotonically increases with increasing term hierarchy (see Fig. 1b). In this way, we ensure that our term dictionary includes only those tissues that are considered sufficiently well annotated for statistical purposes.

Additionally, we reasoned that for any parent node if all its daughters were selected for testing, there was no additional benefit to test the parent. We removed parent nodes from the analysis if all their daughter nodes passed the annotation threshold (see Fig. 1c). We called this a ceiling filter. Applying these three filters reduced the number of ontology terms by an order of magnitude.

Filtering greatly reduces the number of nodes used for analysis

By itself, each of these filters can reduce the number of nodes employed for analysis, but applying the filters in different orders removes different numbers of nodes (not all the filters are commutative). We chose to always execute annotation and similarity thresholding first, followed by the ceiling filter. For validation (see below) we made a number of different dictionaries. The original ontology has almost 6,000 terms of which 1675 have at least 5 gene annotations. After filtering, dictionary sizes ranged from 21 to a maximum of 460 terms, which shows the number of terms in a scarcely annotated ontology can be reduced by an order of magnitude through the application of a few simple filters (see Table 1). These filters were used to compile a static dictionary that we employ for all analyses (see “Validation of the algorithm and optimizing parameter selection” section for details). Our trimming pipeline is applied as part of each new WormBase release. This ensures that the ontology database we are using remains up-to-date with regards to both addition or removal of specific terms as well as with regard to gene expression annotations.

Table 1.

Parameter specifications and number of tissues for all dictionaries

Annotation cutoff Similarity threshold Method No. of terms in dictionary
25 0.9 any 460
25 0.9 avg 461
25 0.95 any 466
25 0.95 avg 468
25 1 any 476
25 1 avg 476
33 0.9 any 261
33 0.9 avg 255
33 0.95 any 261
33 0.95 avg 262
33 1 any 247
33 1 avg 247
50 0.9 any 83
50 0.9 avg 77
50 0.95 any 82
50 0.95 avg 81
50 1 any 70
50 1 avg 70
100 0.9 any 45
100 0.9 avg 35
100 0.95 any 42
100 0.95 avg 36
100 1 any 21
100 1 avg 21

The ‘Method’ column refers to the trimming criterion for the similarity metric. We used two such criteria, ‘any’ and ‘avg’.‘any’: For a given sister set, if any sister had a similarity exceeding the corresponding threshold, all sisters were removed from the final dictionary. ‘avg’: For a given sister set, if the average similarity across all the sisters in the set was greater than the corresponding threshold, all sisters were removed from the final dictionary

Tissue enrichment testing via a hypergeometric model

Having built a static dictionary, we generated a Python script that implements a significance testing algorithm based on the hypergeometric model. Briefly, the hypergeometric model tests the probability of observing ni occurences of a tissue i in a list of size M if there are mi labels for that tissue in a dictionary of total size N that are drawn without replacement. Mathematically, this is expressed as:

P(ni|N,mi,M)=miniMmiNniNni. 1

Although a user will input gene IDs, we test the number of ocurrences of a term within the gene list, so a single gene can contribute to multiple terms. Due to the discrete nature of the hypergeometric distribution, this algorithm can generate artifacts when the list is small. To avoid spurious results, a tissue is never considered significant if there are no annotations for it in the user-provided list.

Once the p-values for each term have been calculated, we apply a standard FDR correction using a Benjamini-Hochberg step-up algorithm [13]. FDR corrected p-values are called q-values. Genes that have a q-value less than a given alpha are considered significant. Our default setting is an alpha of 0.1, which is a standard threshold broadly agreed upon by the scientific community (see for example [1416]). This threshold cannot be altered in the web GUI, but is user tunable through our command-line implementation.

Users input a gene list using any valid gene name for C. elegans. These names are processed into standard WormBase gene IDs (WBGene IDs). The program returns a table containing all the enriched terms and associated information such as number of terms in gene list and expected number of terms. Finally, the program can also return a bar chart of the enrichment fold change for the fifteen tissues with the lowest measured q-values. The bars in the graph are sorted in ascending order of q-value and then in descending order of fold-change. Bars are colored for ease of viewing, and color does not convey information. Our software is implemented in an easy to use GUI (see Fig. 2; alternatively, users can interface with TEA via python, see Additional file 1). Anatomy terms are displayed in human-readable format followed by their unique ontology ID (WBbt ID). In summary, each time the ontology annotations are updated, a new trimmed ontology is generated using our filters; in parallel, users can submit their gene lists through WormBase for testing, with results output in a number of formats (see Fig. 3).

Fig. 2.

Fig. 2

Screenshot of results from the web GUI. After inputting a gene-list, the user is provided with the results. An HTML table is output with hyperlinks to the ontology terms. A publication-ready graph is provided below, which can be saved by dragging to the desktop. The graph is colored for better visualization; color is not intended to convey information. The graph and the table show anatomy terms in human-readable format, followed by their unique WBbt ID. Finally, lists of the genes used and discarded for the analysis are also presented

Fig. 3.

Fig. 3

TEA Workflow. The complete ontology is annotated continuously by WormBase curators. After each update, the ontology is processed to remove uninformative terms, and the remaining terms are used for statistical testing. Users can select a gene list and input it into our tool using our WormBase portal. The gene list is tested for enrichment using the trimmed ontology, and results are output in tabular and graphic formats for analysis

Validation of the algorithm and optimizing parameter selection

We wanted to select a dictionary that included enough terms to be specific beyond the most basic C. elegans tissues, yet would minimize the number of spurious results and which had a good dynamic range in terms of enrichment fold-change. Larger tissues are correlated with better annotation, so increasing term specificity is associated with losses in statistical power. To help us select an appropriate dictionary and validate our tool, we used a set of 30 gold standards based on microarray and RNA-seq literature which are believed to be enriched in specific tissues [1724]. These data sets are annotated gene lists derived from the corresponding Expression Cluster data in WormBase. Some of these studies have been used to annotate gene expression, and so they did not constitute an independent testing set. To correct this flaw, we built a clean dictionary that specifically excluded all annotation evidence that came from these studies.

As a first attempt to select a dictionary, we generated all possible combinations of dictionaries with minimal annotations of 10, 25, 33, 50 and 100 genes and similarity cutoffs of 0.9, 0.95 and 1, using ‘avg’ or ‘any’ similarity thresholding methods (see Table 1). The number of remaining ontology terms was inversely correlated to the minimum annotation cutoff, and was largely insensitive to the similarity threshold in the range we explored. Next, we analyzed all 30 datasets using each dictionary. Because of the large number of results, instead of analyzing each set of terms individually, we measured the average q-value for significantly enriched terms in each dataset without regard for the perceived accuracy of the terms that tested significant. We found that the similarity threshold mattered relatively little for any dictionary. We also noticed that the ‘any’ thresholding method resulted in tighter histograms with a mode closer to 0. For this reason, we chose the ‘any’ method for dictionary generation. The average q-value increased with decreasing annotation cut-off (see Fig. 4), which reflects the decreasing statistical power associated with fewer annotations per term, but we remained agnostic as to how significant is the trade-off between power and term specificity. Based on these observations, we ruled out the dictionary with the 100 gene annotation cut-off: it had the fewest terms and its q-values were not low enough in our opinion to compensate for the trade-off in specificity.

Fig. 4.

Fig. 4

Kernel density estimates (KDE) for 30 gold standard datasets. We ran TEA on 30 datasets we believed to be enriched in particular tissues and pooled all the results to observe the distribution of q-values. The mode of the distribution for dictionaries with annotation cut-offs of 100 and 50 genes are very similar; however, when the cut-off is lowered to 25 genes, the mode of the distribution shifts to the left, potentially signalling a decrease in measurement power

To select between dictionaries generated between 50, 33 and 25 annotation cut-offs, and also to ensure the terms that are selected as enriched by our algorithm are reasonable, we looked in detail at the enrichment analysis results. Most results were comparable and expected. For some sets, all dictionaries performed well. For example, in our ‘all neuron enriched sets’ [18, 20] all terms were neuron-related regardless of the dictionary used (see Table 2). On the other hand, for a set enriched for germline precursor expression in the embryo [18], the 50 cutoff dictionary was only able to identify ‘oocyte WBbt:006797’, which is not a germline precursor although it is germline related; whereas the two smaller dictionaries singled out actual germline precursor cells—at the 33 cutoff, our tool identified the larval germline precursor cells ‘Z2’ and ‘Z3’ as enriched, and at the 25 gene cutoff the embryonic germline precursor terms ‘P4’,‘P3’ and ‘P2’ were identified in addition to ‘Z2’ and ‘Z3’. We also queried an intestine precursor set [18]. Notably, this gene set yielded no enrichment when using the 25 cutoff dictionary, nor when using the 50 cutoff dictionary. However, the 33 cutoff dictionary identified the E lineage, which is the intestinal precursor lineage in C. elegans, as enriched. Both of these results capture specific aspects of C. elegans that are well known to developmental biologists.

Table 2.

Comparison of results for a GABAergic neuronal-enriched gene set from Watson [20] showing that results are similar regardless of annotation cutoff

Tissue Q value33 Q value50 Enrichment fold change33 Enrichment fold change50
Nerve ring WBbt:0006749 0.005 0.0015 2.7 2.7
Nervous system WBbt:0005735 0.005 0.0015 1.4 1.4
Dorsal nerve cord WBbt:0006750 0.005 0.0015 3.8 3.8
Retrovesicular ganglion WBbt:0005656 0.011 0.0034 5 5
Ventral nerve cord WBbt:0005829 - 0.022 - 2.4

We ran the same gene list on a dictionary with a minimum annotation cutoff of 50, similarity threshold of 0.95 and similarity method ‘any’ versus another with a minimum annotation cutoff of 33, similarity threshold of 0.95 and similarity method ‘any’. In the table, columns are labeled with their significance value (Q-value) or enrichment fold change followed by a hyphen and a number which indicates which the cutoff for the dictionary that was used for testing. Not all tissues are present in either dictionary. Hyphens denote not-applicable values, which occurs when a particular tissue is not present in both dictionaries

Not all queries worked equally well. For example, a number of intestinal sets [18, 21] were not enriched in intestine-related terms in any dictionary, but were enriched for pharynx and hypodermis. We were surprised that intestinal gene sets performed poorly, since the intestine is a relatively well-annotated tissue.

We assessed the internal agreement of our tool by using independent gene sets that we expected to be enriched in the same tissues. We used two pan-neuronal sets [18, 20]; two PVD sets [18, 24]; and two GABAergic sets [18, 19]. Overall, the tool has good internal agreement. On most sets, the same terms were enriched, although order was somewhat variable (see Fig. 5), and most high-scoring terms were preserved between sets. All comparisons can be found online in our Github repository (see Availability of data and materials). The complete list of gene sets and results can also be found in Additional files 2, 3 and 4. Overall, the dictionary generated by a 33 gene annotation cutoff with 0.95 redundancy threshold using the ‘any’ criterion performed best, with a good balance between specificity, verbosity and accuracy, so we selected this parameter set to generate our static dictionary. As of this publication, the testable dictionary contains 261 terms.

Fig. 5.

Fig. 5

Independently derived gene sets show similar results when tested with the same dictionary. Set 1) GABAergic gene set from Watson [20]. Set 2) GABAergic gene set from Spencer [18]. Arrowheads highlight identical terms between both analyses. All terms refer to neurons or neuronal tissues and are GABA-associated. Dictionary with cutoff: 33; threshold: 0.95; method: ‘any’

Applying the tool

We applied our tool to the RNA-seq datasets developed by Engelmann et al. [25] to gain further understanding of their underlying biology. Engelmann et al. exposed young adult worms to 5 different pathogenic bacteria or fungi for 24 h, after which mRNA was extracted from the worms for sequencing. We ran TEA on the genes Engelmann et al. identified as up- or down-regulated. Initially we noticed that genes that are down-regulated tend to be twice as better annotated on average than genes that were up-regulated, suggesting that our understanding of the worm immune system is scarce, in spite of important advances made over the last decade. Up-regulated tissues, when detected, almost always included the hypodermis and excretory duct. Three of the five samples showed enrichment of neuronal tissues or neuronal precursor tissues among the down-regulated genes. As an independent verification, we also performed GO analysis using PANTHER on the down-regulated genes for D. coniospora. These results also showed enrichment in terms associated with neurons (see Fig. 6). A possible explanation for this neuronal association might be that the infected worms are sick and the neurons are beginning to shut down; an alternative hypothesis would be that the worm is down-regulating specific neuronal pathways as a behavioral response against the pathogen. Indeed, several studies [26, 27] have provided evidence that C. elegans uses chemosensory neurons to identify pathogens. Our results highlight the involvement of various C. elegans neuronal tissues in pathogen defense.

Fig. 6.

Fig. 6

D. coniospora gene enrichment analysis and tissue enrichment analysis results. We compared and contrasted the results from a gene enrichment analysis program, pantherDB, with TEA by analyzing genes that were significantly down-regulated when C. elegans was exposed to D. coniospora in a previously published dataset by Engelmann et al. [25] with both tools. a pantherDB screenshot of results, sorted by p-value. Only top hits shown. b TEA results, sorted by q-value (lowest on top) and fold-change. Both pantherDB and TEA identify terms associated with neurons (red square). The two analyses provide complementary, not redundant, information

Discussion

We have presented a tissue enrichment analysis tool that employs a standard hypergeometric model to test the C. elegans tissue ontology. We use a hypergeometric function to test a user-provided gene list for enrichment of anatomical terms in C. elegans. Our hope is that the physiological relevance of anatomical terms will enable researchers to make hypotheses about high-dimensionality data. Specifically, we believe an enriched term may broadly suggest one of two hypotheses: if a list is enriched in a particular anatomical region, that anatomical region is affected by the experimental treatment; alternatively, the anatomical regions that are enriched reflect biologically relevant interactions between tissues. We believe the first hypothesis is a reasonable one to make in the case of whole-worm RNA-seq data for example, whereas the second hypothesis may be more plausible in cases where a researcher already knows what tissues a particular gene list came from, as may be the case in single-cell RNA-seq.

Our tool relies on an annotation dictionary that is continuously updated primarily with data from single gene qualitative analyses, does not require retraining and does not require ranked genes. To our knowledge, this is the first tool that tests tissue enrichment in C. elegans via the hypergeometric method, but similar projects exist for humans and zebrafish [28, 29], highlighting the relevance of our tool for high-dimensionality biology. Chikina et al. [30] have previously reported a tissue enrichment model for C. elegansbased on a Support Vector Machine classifier that has been trained on microarray studies. SVMs are powerful tools, but they require continuous retraining as more tissue expression data becomes available. Moreover, classifiers require that data be rank-ordered by some metric, something which is not possible for certain studies. Furthermore, this tissue enrichment tool provides users with enrichment results for only 6 large tissues. In contrast, our tool routinely tests a much larger number of terms, and we have shown it can even accurately identify enrichment of embryonic precursor lineages for select data sets.

We have also presented the first, to our knowledge, ontology term filtering algorithm applied to biomedical ontologies. This algorithm, which is very easy to execute, identifies terms that have specificity and statistical power for hypothesis testing. Due to the nature of all ontologies as hierarchical, acyclical graphs with term inheritance, term annotations are correlated along any given branch. This correlation reduces the benefits of including all terms for statistical analysis: for any given term along a branch, if that term passes significance, there is a high probability that many other terms along that branch will also pass significance. If the branch is enriched by random chance, error propagation along a branch means that many more false positives will follow. Thus, a researcher might be misled by the number of terms of correlated function and assign importance to this finding; the fact that the branching structure of GO amplifies false positive signals is a powerful argument for either reducing branch length or branch intracorrelation, or both. On the other hand, if a term is actually enriched, we argue that there is little benefit to presenting the user with additional terms along that branch. Instead, a user will benefit most from testing sparsely along the tree at a suitable specificity for hypothesis formation. Related terms of the same level should only be tested when there is sufficient annotation to differentiate, with statistical confidence, whether one term is enriched above the other. Our algorithm reduces branch length by identifying and removing nodes that are insufficiently annotated and parents that are likely to include sparse information.

We endeavoured to benchmark our tool well, but our analysis cannot address problems related to spurious term enrichment. Although we were unable to determine false-positive and false-negative rates, we do not believe this should deter scientists from using our tool. Rather, we encourage researchers to use our tool as a guide, integrating evidence from multiple sources to inform the most likely hypotheses. As with any other tool based on statistical sampling, our analysis is most vulnerable to bias in the data set. For example, expression reports are negatively biased against germline expression because of the difficulties associated with expressing transgenes in this tissue [31]. As time passes, we are certain the accuracy and power of this tool will improve thanks to the continuing efforts of the worm research community; indeed, without the community reports of tissue expression in the first place, this tool would not be possible.

Conclusions

We have built a tissue enrichment tool that employs a tissue ontology previously developed by WormBase. We use a simple algorithm to identify the best ontology terms for statistical testing and in this way minimize multiple testing problems. Our tool is available within WormBase or can be downloaded for offline use via ‘pip install’.

Methods

Fetching annotation terms

We used WormBase-curated gene expression data, which includes annotated descriptions of spatial-temporal expression patterns of genes, to build our dictionary. Gene lists per anatomy term were extracted from a Solr document store of gene expression data from the WS252 database provided by WormBase [8]. We used the Solr document store because it provided a convenient access to expression data that included inferred annotations. That is, for each anatomy term, the expression gene list includes genes that were directly annotated to the term, as well as those that were annotated to the term’s descendant terms (if there were any). Descendant terms were those connected with the focus term by is_a/part_of relationship chains defined in the anatomy term ontology hierarchy.

Filtering nodes

Defining a similarity metric

To identify redundant sisters, we defined the following similarity metric:

si=|gi||i=0kgi| 2

Where si is the similarity for a tissue i with k sisters; gi refers to the set of tissues associated with tissue i and |g| refers to the cardinality of set g. For a given set of sisters, we called them redundant if they exceeded a given similarity threshold. We envisioned two possible criteria and built different dictionaries using each one. Under a threshold criteron ‘any’ with parameter S between (0,1), a given set of sisters j was considered redundant if the condition

si,j>S 3

was true for any sister i in set j. Under a threshold criterion ‘avg’ with parameter S, a given set of sisters j was considered redundant if the condition

E[si]j>S 4

was true for the set of sisters j (see Fig. 1a).

Since nodes can have multiple parents (and therefore multiple sister sets), a complete set of similarity scores was calculated before trimming the ontology, and nodes were removed from the ontology if they exceeded the similarity threshold at least once in any comparison.

Implementation

All scripts were written in Python 3.5. Our software relies on the pandas, NumPy, Seaborn and SciPy modules to perform all statistical testing and data handling [3234].

Acknowledgements

We thank Justin Bois for his help and support. We would like to acknowledge all members of the Sternberg lab for helpful discussion.

Funding

This work was supported by HG002223 from the National Human Genome Research Institute to PWS., and H.H.M.I., with which PWS is an investigator. The funding agencies had no input into the design, execution or interpretation of our experiments, nor into the writing of this manuscript.

Availability of data and materials

Our web implementation is available at http://www.wormbase.org/tools/enrichment/tea/tea.cgi. Our software can also be downloaded using Python’s pip installer via the command

pip install tissue_enrichment_analysis

Alternatively, our software is available for download at: http://dangeles.github.io/TissueEnrichmentAnalysis

All benchmark gene sets, benchmarking code and Figures can also be found at the same address, under the ‘tests’ folder.

Authors’ contributions

DA and PWS conceived of the project; DA developed algorithm; RYL made intellectual contributions to the project; RYL and JC developed the web GUI. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Abbreviations

GO

Gene ontology

TEA

Tissue enrichment analysis

WBbt ID

A unique ID assigned to reference ontology terms

WBgene ID

A unique ID assigned to reference nematode genes

Additional files

Additional file 1 (161.1KB, pdf)

TEA Tutorial. Tutorial for users interested in using our software within a python script. (PDF 161 kb)

Additional file 2 (138.4KB, pdf)

Folder Structure for SI files 3 and 4. A file detailing the folder structure of the zipped folders 3 and 4. (PDF 138 kb)

Additional file 3 (74.4KB, zip)

Golden Gene Sets. A list of all the genes used for our benchmarking process. (ZIP 74 kb)

Additional file 4 (1.6MB, zip)

Results. A folder containing a complete version of the results we generated for this paper. (ZIP 1597 kb)

References

  • 1.The Gene Ontology Consortium Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(May):25–9. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.The Gene Ontology Consortium Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015;43(D1):D1049–56. doi: 10.1093/nar/gku1179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD. PANTHER version 7: Improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res. 2009;38(SUPPL.1):D204–10. doi: 10.1093/nar/gkp1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28(5):495–501. doi: 10.1038/nbt.1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Huang DW, Lempicki Ra, Sherman BT. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 6.Pathan M, Keerthikumar S, Ang CS, Gangoda L, Quek CYJ, Williamson NA, Mouradov D, Sieber OM, Simpson RJ, Salim A, Bacic A, Hill AF, Stroud DA, Ryan MT, Agbinya JI, Mariadason JM, Burgess AW, Mathivanan S. FunRich: An open access standalone functional enrichment and interaction network analysis tool. Proteomics. 2015;15(15):2597–601. doi: 10.1002/pmic.201400515. [DOI] [PubMed] [Google Scholar]
  • 7.Lee RYN, Sternberg PW. Building a cell and anatomy ontology of Caenorhabditis elegans. Comp Funct Genomics. 2003;4(1):121–6. doi: 10.1002/cfg.248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Howe KL, Bolt BJ, Cain S, Chan J, Chen WJ, Davis P, Done J, Down T, Gao S, Grove C, Harris TW, Kishore R, Lee R, Lomax J, Li Y, Muller H-M, Nakamura C, Nuin P, Paulini M, Raciti D, Schindelman G, Stanley E, Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Wright A, Yook K, Berriman M, Kersey P, Schedl T, Stein L, Sternberg PW. WormBase 2016: expanding to enable helminth genomic research. Nucleic Acids Res. 2016;44(November 2015):D774–80. doi: 10.1093/nar/gkv1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huang DW, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane HC, Lempicki RA. DAVID Bioinformatics Resources: Expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 2007;35(SUPPL.2):W169–75. doi: 10.1093/nar/gkm415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: Modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41(D1):D377–86. doi: 10.1093/nar/gks1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kim JW, Caralt JC, Hilliard JK. Proceedings of the 40th Annual Hawaii International Conference on System Sciences. Washington: IEEE Computer Society; 2007. Pruning bio-ontologies. [Google Scholar]
  • 12.Garrido J, Requena I. Towards summarizing knowledge: Brief ontologies. Expert Syst Appl. 2012;39(3):3213–22. doi: 10.1016/j.eswa.2011.09.008. [DOI] [Google Scholar]
  • 13.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc B. 1995;57(1):289–300. [Google Scholar]
  • 14.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21(13):3017–24. doi: 10.1093/bioinformatics/bti448. [DOI] [PubMed] [Google Scholar]
  • 16.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100(16):9440–5. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gaudet J, Muttumu S, Horner M, Mango SE. Whole-genome analysis of temporal gene expression during foregut development. PLoS Biol. 2004;2(11). [DOI] [PMC free article] [PubMed]
  • 18.Spencer WC, Zeller G, Watson JD, Henz SR, Watkins KL, McWhirter RD, Petersen S, Sreedharan VT, Widmer C, Jo J, Reinke V, Petrella L, Strome S, Stetina SEV, Katz M, Shaham S, Rätsch G, Miller DM. A spatial and temporal map of C. elegans gene expression. Genome Res. 2011;21(2):325–41. doi: 10.1101/gr.114595.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cinar H, Keles S, Jin Y. Expression profiling of GABAergic motor neurons in Caenorhabditis elegans. Curr Biol. 2005;15(4):340–6. doi: 10.1016/j.cub.2005.02.025. [DOI] [PubMed] [Google Scholar]
  • 20.Watson JD, Wang S, Stetina SEV, Spencer WC, Levy S, Dexheimer PJ, Kurn N, Heath JD, Miller 3rd DM, Miller DM. Complementary RNA amplification methods enhance microarray identification of transcripts expressed in the C. elegans nervous system. BMC Genomics. 2008; 9:84. [DOI] [PMC free article] [PubMed]
  • 21.Pauli F, Liu Y, Kim Ya, Chen P-J, Kim SK. Chromosomal clustering and GATA transcriptional regulation of intestine-expressed genes in C. elegans. Development (Cambridge, England) 2006;133(2):287–95. doi: 10.1242/dev.02185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Portman DS, Emmons SW. Identification of C. elegans sensory ray genes using whole-genome expression profiling. Dev Biol. 2004;270(2):499–512. doi: 10.1016/j.ydbio.2004.02.020. [DOI] [PubMed] [Google Scholar]
  • 23.Fox RM, Watson JD, Stetina SEV, McDermott J, Brodigan TM, Fukushige T, Krause M, Miller 3rd DM, Miller DM. The embryonic muscle transcriptome of Caenorhabditis elegans. Genome Biol. 2007; 8(9):R188. [DOI] [PMC free article] [PubMed]
  • 24.Smith CJ, Watson JD, Spencer WC, O’Brien T, Cha B, Albeg A, Treinin M, Miller DM. Time-lapse imaging and cell-specific expression profiling reveal dynamic branching and molecular determinants of a multi-dendritic nociceptor in C. elegans. Dev Biol. 2010;345(1):18–33. doi: 10.1016/j.ydbio.2010.05.502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Engelmann I, Griffon A, Tichit L, Montaña-Sanchis F, Wang G, Reinke V, Waterston RH, Hillier LW, Ewbank JJ. A comprehensive analysis of gene expression changes provoked by bacterial and fungal infection in C. elegans. PLoS ONE. 2011;6(5):e19055. doi: 10.1371/journal.pone.0019055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Meisel JD, Kim DH. Behavioral avoidance of pathogenic bacteria by Caenorhabditis elegans. Trends Immunol. 2014;35(10):465–70. doi: 10.1016/j.it.2014.08.008. [DOI] [PubMed] [Google Scholar]
  • 27.Zhang Y, Lu H, Bargmann CI. Pathogenic bacteria induce aversive olfactory learning in Caenorhabditis elegans. Nature. 2005;438(7065):179–84. doi: 10.1038/nature04216. [DOI] [PubMed] [Google Scholar]
  • 28.Lee YS, Krishnan A, Zhu Q, Troyanskaya OG. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics. 2013;29(23):3036–44. doi: 10.1093/bioinformatics/btt529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Prykhozhij SV, Marsico A, Meijsing SH. Zebrafish Expression Ontology of Gene Sets (ZEOGS): a tool to analyze enrichment of zebrafish anatomical terms in large gene sets. Zebrafish. 2013;10(3):303–15. doi: 10.1089/zeb.2012.0865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Chikina MD, Huttenhower C, Murphy CT, Troyanskaya OG. Global prediction of tissue-specific gene expression and context-dependent gene networks in Caenorhabditis elegans. PLoS Comput Biol. 2009;5(6):e1000417. doi: 10.1371/journal.pcbi.1000417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kelly WG, Xu S, Montgomery MK. A Fire. Distinct requirements for somatic and germline expression of a generally expressed Caenorhabditis elegans gene. Genetics. 1997;146(1):227–38. doi: 10.1093/genetics/146.1.227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.McKinney W. pandas: a Foundational Python Library for Data Analysis and Statistics. Python High Perform Sci Comput. 2011::1–9. http://pandas.pydata.org/talks.html.
  • 33.Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: A structure for efficient numerical computation. Comput Sci Eng. 2011;13(2):22–30. doi: 10.1109/MCSE.2011.37. [DOI] [Google Scholar]
  • 34.Oliphant TE. SciPy: Open source scientific tools for Python. Comput Sci Eng. 2007;9:10–20. doi: 10.1109/MCSE.2007.58. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Our web implementation is available at http://www.wormbase.org/tools/enrichment/tea/tea.cgi. Our software can also be downloaded using Python’s pip installer via the command

pip install tissue_enrichment_analysis

Alternatively, our software is available for download at: http://dangeles.github.io/TissueEnrichmentAnalysis

All benchmark gene sets, benchmarking code and Figures can also be found at the same address, under the ‘tests’ folder.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES