Skip to main content
Worm logoLink to Worm
. 2016 Sep 19;5(4):e1234659. doi: 10.1080/21624054.2016.1234659

GExplore 1.4: An expanded web interface for queries on Caenorhabditis elegans protein and gene function

Harald Hutter 1,, Jinkyo Suh 1
PMCID: PMC5190144  PMID: 28090394

ABSTRACT

Genetic high-throughput experiments often result in hundreds or thousands of genes satisfying certain experimental conditions. Grouping and prioritizing a large number of genes for further analysis can be a time-consuming challenge. In 2009 we developed a web-based user interface, GExplore, to assist with large-scale data-mining related to gene function in Caenorhabditis elegans. The underlying database contained information about Caenorhabditis elegans genes and proteins including domain organization of the proteins, phenotypic descriptions, expression data and Gene Ontology Consortium annotations. These data enable users to quickly obtain an overview of biological and biochemical functions of a large number of genes at once. Since its inception the underlying database has been updated and expanded significantly. Here we describe the current version of GExplore 1.4, documenting the changes since the original release. GExplore 1.4 now contains information about the domain organization of the proteomes of 9 nematode species, can display the location of Caenorhabditis elegans mutations with respect to the domain organization of the proteins, and includes stage-specific RNAseq gene expression data generated by the modENCODE project. The underlying database has been reorganized to facilitate independent updates of the different parts of the database and to allow the addition of novel data sets in the future. The web interface is available under http://genome.sfu.ca/gexplore.

KEYWORDS: C. elegans, database, expression, gene function, phenotype, proteome

Introduction

Whole genome sequences are now available for a large number of organisms but many genes in eukaryotic genomes remain uncharacterized. While the potential biochemical function of proteins with sequence similarity to characterized proteins can be predicted based on the sequence alone, the biological function within the organism is not immediately obvious. In addition, every genome contains genes that lack any similarity to characterized proteins, so that not even a biochemical function can be predicted.1-4 Functional characterization of the predicted genes in a genome is a serious bottleneck even in established genetic model organisms such as the nematode Caenorhabditis elegans.

Sequence similarity to known proteins, expression and phenotype data are the most important pieces of information used to predict the functions of a protein. For C. elegans these data are available through Wormbase,5 the major database for C. elegans genes. The main user interface of Wormbase is gene-centered, i.e. the user can search for one gene at the time and information related to a particular gene of interest is displayed on a single result page. Graphical displays in Wormbase focus on DNA level and graphical display options for information at the protein level are limited. A search interface called WormMine5 is available for large-scale data mining and searches across different data sets within Wormbase. WormMine offers a sophisticated query builder and provides application programming interfaces (APIs) for several programming languages. Only selected data sets are available through WormMine, e.g. currently phenotype information about alleles is available, but RNAi phenotypes for genes are not. Several years ago we developed GExplore,6 a tool that provides a simple and fast search and display interface that allows a multi-gene display of data sets relevant to gene and protein function. Searches are executed quickly even for thousands of genes so that results can be surveyed immediately and search parameters modified if necessary. GExplore includes a graphic display interface for protein domains and gene expression and contains phenotypic data as well as Gene Ontology (GO) annotations. GExplore can assist researchers in the planning and analysis of high-throughput experiments by helping researchers to quickly prioritize genes for further study or to select groups of genes based on shared phenotypes or expression data.

GExplore can also help in the analysis of large-scale experiments. A common way to identify genes regulated by a particular transcription factor involves comparing genome-wide expression profiles of a transcription factor mutant with wild type. Such comparisons can yield several hundreds of genes that are differentially expressed. Finding relevant candidate genes among such a large set of genes is not trivial. GExplore allows researchers to quickly survey a large number of candidate genes in order to select or eliminate genes for further analysis.

The GExplore database is based on data downloaded from Wormbase. Since the original version was established in 2009, new sequence and expression data sets have become available and phenotypic and expression information has accumulated. High quality gene predictions for additional nematode genomes have been generated.7-9 The Million Mutation Project provided more than 800,000 novel mutations in C. elegans genes,10 and the modENCODE project11 among other things created new gene expression data sets (Boeck et al. submitted). Here we describe the current version of GExplore (version 1.4), which has changed considerably compared to the original version. We have added information about the domain organization of the proteomes of 8 additional nematode species, mapped the location of C. elegans mutations onto the domain organization of the proteins and included stage-specific expression data recently generated by the modENCODE project. The underlying database has been reorganized to facilitate independent updates of the different parts of the database and to allow the addition of novel data sets in the future.

Results and discussion

The user interface: Search pages

GExplore contains 4 main search pages for the different data sets: “Proteins,” “Mutations,” “Genes” and “Expression” (see Fig. 1). Details of the underlying databases are described in later sections. The search pages contain several search fields with links to help pages that explain the search terms that can be used. The “compare” page can be used to compare the output of different searches to identify common and unique genes.

Figure 1.

Figure 1.

Overview of the data in GExplore.

The user interface: Display pages

Several display parameters can be selected on the search pages, but display options can also be modified on the page, where the results are displayed. This allows the user to quickly display different data for the found set of genes, without having to repeat the entire search process. Results are displayed in table format and modifying display options adds or removes columns from the result table. A “text only” display option will result in TAB-delimited raw text output that can be downloaded for further processing.

Datasets: Proteomes

GExplore 1.4 contains several new data sets provided by Wormbase (Fig. 1). We included proteomes from all 9 species considered to be “core species” in Wormbase release 250. These are Caenorhabditis elegans, Caenorhabditis brenneri, Caenorhabditis briggsae, Caenorhabditis japonica, Caenorhabditis remanei, Brugia malayi, Pristionchus pacificus, Onchocerca volvulus and Strongyloides ratti. The protein data can be accessed through a new “Proteins” search interface, which includes search options for gene or protein name, protein domains, domain patterns and Gene Ontology (GO) annotations (Fig. 2A). GO annotations for all species other than C. elegans are mainly based on sequence similarity to known proteins as direct functional information is largely unavailable for these species. While the previous versions of GExplore only contained the largest splice variant for each gene, all splice variants are included in GExplore 1.4. Users can choose to show only the largest splice variant or all splice variants. Showing only the largest splice variant provides an easy way to count the number of different genes satisfying the search criteria. The results page (Fig. 2A) shows either a graphical representation of the domain organization of the proteins or the actual protein sequence. Output can be displayed as “text only” to allow download of protein sequences in FASTA format for further local processing or analysis.

Figure 2.

Figure 2.

The “Proteins” and “Mutations” search and display interfaces. (A) The top part shows the “Proteins” search interface with a query in the domain pattern field (“TM* KIN”). The bottom part shows the corresponding results page (truncated). (B) The top part shows the “Mutations” search interface with a query in the gene search field (“unc-30”). The bottom part shows the corresponding results page (truncated).

Domain representations are based on analysis by SMART.12,13 SMART uses their own collection of predicted domains in addition to Pfam14 domain predictions. GExplore integrates Pfam and SMART domains for display purposes. For example, all SMART and Pfam domains reflecting “protein kinase” (SM00219, SM00220, SM00221, PF00069, PF07714) are displayed as “KIN” domain. In GExplore 1.0 we assigned domain names such as “KIN” to frequently occurring domains grouping closely related domains. We have expanded the set of protein domains for this integrated display to cover all domains present in at least 10 C. elegans genes to a total of 420 domains. All other domains are displayed with their Pfam or SMART identifier, and proteins containing these domains can be retrieved using these identifiers in the domain search field. Pfam or SMART identifiers can also be used to retrieve a subset of the domains we grouped. For example, searching for protein kinases using “KIN” retrieves 411 genes. Searching for the underlying individual SMART identifiers results in 66 genes for SM00219 (tyrosine kinases), 145 genes for SM00220 (serine/threonine kinases) and 195 genes for SM00221 (kinases of unclassified specificity). In addition to searching for individual domains, GExplore allows more complex searches for proteins with a particular domain organization. For example, searching for “TM* KIN” in the “domain pattern” search field retrieves all proteins containing a transmembrane domain (TM) followed by any number of unspecified domains (*) followed by a kinase domain (KIN), i.e., putative receptor kinases.

When several different domains are predicted for the same part of the protein, only the one with the best fit to the underlying Hidden Markov Model is shown. Overlapping or otherwise hidden domains are not included in the database with 2 exceptions. Hidden signal peptides and hidden transmembrane domains that are part of larger domains such as channels or transporters, have been included in the list of domains present in a protein, even if they will not be displayed. This means a search for “TM” (transmembrane domain) will retrieve all transmembrane proteins including those with hidden transmembrane domains. Information about the presence and localization of hidden domains not shown in GExplore can be obtained by analyzing the protein sequence in SMART.

Datasets: Mutations

A large number of mutations in C. elegans genes have been isolated. Recently, the Million Mutation Project has identified more than 800,000 new mutations in more than 20,000 C. elegans genes with more than 13 novel non-synonymous alleles per gene.10 GExplore contains a search interface, “Mutations,” to identify the location of mutations in a list of candidate genes (Fig. 2B). The search can be restricted to mutations of a particular type (e.g., nonsense, missense, deletion) or of a particular origin (e.g., Million Mutation Project). The results page (Fig. 2B) displays the location of the mutations mapped onto the domain organization of the protein. In sequence display mode, the mutation is shown in the context of the protein sequence. This allows the user to quickly identify mutations affecting a particular part of the protein and to quickly identify putative null alleles or strong loss-of-function alleles. It is important to note that the “Mutations” search interface covers only alleles where the mutation is in the coding part of the gene and where the molecular nature of the mutation is known (and included in Wormbase). To identify all mutations in a given gene, including those that are uncharacterized at the molecular level, the main “Genes” search interface should be used.

Data sets: C. elegans genes

The core data sets containing expression and phenotypic information about C. elegans genes can be accessed through the “Genes” search interface (Fig. 3A). Since the initial release of GExplore the core data sets on C. elegans genes in GExplore have been updated regularly using the latest available release of data. Newly added data include information about homologs in selected other model organisms (Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus and Homo sapiens) and the other core nematode species in Wormbase (see above). The search interface has been expanded so that C. elegans homologs in those organisms can be identified. We removed the original expression data sets that were based on microarray, SAGE and older RNAseq data. This was done mainly because the underlying raw data were taken from the original publications and were never re-mapped to the current gene predictions making them somewhat outdated. However, all previous versions of GExplore including all versions of the datasets are still available online through a link at the bottom of the “About” page. A new data set with stage-specific expression profiles based on recent RNAseq data has been added in a separate section instead (see Expression data below). In addition the option to search C. elegans literature has been removed, because Textpresso15 provides an excellent search tool for literature data mining.

Figure 3.

Figure 3.

The “Genes” and “Expression” search and display interfaces. (A) The top part shows the “Genes” search interface with a query in the expression search field (“muscle”). The bottom part shows the corresponding results page (truncated). (B) The top part shows the “Expression” search interface with a query for maternally enriched genes. The bottom part shows the corresponding results page (truncated).

Data sets: Expression data

GExplore now contains quantitative genome-wide expression data for the different developmental stages of C. elegans, which were obtained by analyzing RNAseq data as part of the NHGRI modENCODE project (Boeck et al. submitted). Quantitative expression data for 18 embryonic stages, all 4 postembryonic larval stages, adult, adult soma, gonad, L4 males and dauer larvae (including dauer entry and dauer exit) are available. The RNAseq data have been mapped to gene models from Wormbase release 220. The search interface “Expression” allows to search for the expression data of a list of target genes or to identify genes by their expression characteristics (Fig. 3B). As search parameters, expression levels in “depth of coverage per million reads” (dcpm) can be specified either in isolation (e.g. “>2” or “<2” or “2–10”) or in comparison with any other developmental stage (e.g., “more than 2x higher in adult compared to L4”). This allows for example to identify genes that are maternally enriched by searching for genes that are highly expressed in the early embryo and upregulated in the adult compared to the L4 stage. Resulting expression profiles are shown in table format (Fig. 3B) and can be downloaded as plain text for further analysis. In addition graphical displays of the expression profiles are available.

Future updates

GExplore uses data downloaded from Wormbase. The two main ways to retrieve data from Wormbase are through files available on Wormbase's FTP server and through WormMine. The underlying databases and accompanying search interfaces in GExplore have been reorganized to reflect the different sources, which are updated in different intervals. A substantial fraction of the data on C. elegans genes is downloaded using WormMine, which is not updated with every new Wormbase release. On the other hand proteome data accessible through the “Proteins” search interface are directly downloaded from Wormbase's FTP server. These data are easily accessible for every Wormbase data release and consequently will be updated in GExplore more frequently and independently of the other databases. The stage-specific RNAseq dataset accessible through the “Expression” search page will remain static, as the underlying data will not change.

GExplore in comparison to other nematode databases

Information about nematode genomes, genes and proteins can be found in a few databases with web-based user interfaces. Most databases focus on genome sequence data as well as gene and protein sequence data. Nematode.net16 hosts genome and gene expression data on a growing number of nematodes and provides numerous data mining tools with an emphasis on transcriptome data. The site is continuously expanding and most recently has been integrated into Helminth.net,17 covering an even wider range of species and an increasing number of databases. Independently, NEMBASE18 hosts a similar range of data, mainly genome sequence information and expressed sequence tags (ESTs) from a growing number nematode species that can be queried by sequence identifier or via a BLAST interface.

For most nematode species data on gene function are limited. By contrast, for the best-studied nematode, C. elegans, a functional annotation based on mutant or RNAi phenotypes is available for about 30% of the genes. Functional annotation and other information about C. elegans genes can be found on Wormbase, the leading database for genome data in C. elegans.5 Wormbase has a gene-centric main user interface but lacks a simple search interface that allows the user to display information about a large number of genes on a single results page. GExplore provides such an interface. In addition GExplore provides a graphic display of the domain organization of proteins, and the location of mutations in relationship to protein domains. Taken together, GExplore complements the user interfaces provided by Wormbase to display selected data sets relevant to analyze the function of C. elegans genes.

Material and methods

Origin of data and data processing

Proteome data

Protein sequences were downloaded from the Wormbase FTP server (ftp://ftp.wormbase.org/pub/wormbase/) using release WS250. Domain predictions were done using SMART with the help of a batch processing script provided on the SMART website (http://smart.embl-heidelberg.de/help/SMART_batch.pl). Custom Perl scripts were used to further process the raw SMART data files.

Mutation data

The gff annotation file “c_elegans.PRJNA13758.WS250.annotations.gff3” was downloaded from the Wormbase FTP server (ftp://ftp.wormbase.org/pub/wormbase/). Information on the nature and localization of C. elegans mutations was collected from this file using custom Perl scripts.

Information on C. elegens genes

Information about location, description, phenotype, expression and GO annotation for all C. elegans genes was downloaded using WormMine (http://www.wormbase.org/tools/wormmine/begin.do), which provided data from Wormbase release WS250 at the time of download. Data were assembled into database tables using custom Perl scripts.

RNAseq expression data

Stage-specific expression data and images providing a graphic display of expression profiles were provided by the Waterston lab. A brief description on how the expression data were generated can be found on the help pages of the GExplore website.

Disclosure of potential conflicts of interest

No potential conflicts of interest were disclosed.

Acknowledgments

We would like to thank members of the Hutter lab for critically reading the manuscript.

Funding

Funding for this research was provided by a grant from the Natural Sciences and Engineering Research Council of Canada to HH (grant no 312498).

References

  • [1].Consortium CeS. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998; 282:2012-8; PMID:9851916; http://dx.doi.org/ 10.1126/science.282.5396.2012 [DOI] [PubMed] [Google Scholar]
  • [2].Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al.. The genome sequence of Drosophila melanogaster. Science 2000; 287:2185-95; PMID:10731132; http://dx.doi.org/ 10.1126/science.287.5461.2185 [DOI] [PubMed] [Google Scholar]
  • [3].Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921; PMID:11237011; http://dx.doi.org/ 10.1038/35057062 [DOI] [PubMed] [Google Scholar]
  • [4].Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.. Initial sequencing and comparative analysis of the mouse genome. Nature 2002; 420:520-62; PMID:12466850; http://dx.doi.org/ 10.1038/nature01262 [DOI] [PubMed] [Google Scholar]
  • [5].Harris TW, Baran J, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, Done J, Grove C, Howe K, et al.. WormBase 2014: new views of curated biology. Nucleic Acids Res 2014; 42:D789-93; PMID:24194605; http://dx.doi.org/ 10.1093/nar/gkt1063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Hutter H, Ng MP, Chen N. GExplore: a web server for integrated queries of protein domains, gene expression and mutant phenotypes. BMC Genomics 2009; 10:529; PMID:19917126; http://dx.doi.org/ 10.1186/1471-2164-10-529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Gupta BP, Sternberg PW. The draft genome sequence of the nematode Caenorhabditis briggsae, a companion to C. elegans. Genome Biol 2003; 4:238; PMID:14659008; http://dx.doi.org/ 10.1186/gb-2003-4-12-238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al.. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol 2003; 1:E45; PMID:14624247; http://dx.doi.org/ 10.1371/journal.pbio.0000045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, Dinkelacker I, Fulton L, Fulton R, Godfrey J, Minx P, et al.. The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism. Nat Genet 2008; 40:1193-8; PMID:18806794; http://dx.doi.org/ 10.1038/ng.227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Thompson O, Edgley M, Strasbourger P, Flibotte S, Ewing B, Adair R, Au V, Chaudhry I, Fernando L, Hutter H, et al.. The million mutation project: a new approach to genetics in Caenorhabditis elegans. Genome Res 2013; 23:1749-62; PMID:23800452; http://dx.doi.org/ 10.1101/gr.157651.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al.. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 2010; 330:1775-87; PMID:21177976; http://dx.doi.org/ 10.1126/science.1196914 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Letunic I, Doerks T, Bork P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 2014; 43:D257-60; PMID:25300481; http://dx.doi.org/ 10.1093/nar/gku949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Schultz J, Milpetz F, Bork P, Ponting CP. SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998; 95:5857-64; PMID:9600884; http://dx.doi.org/ 10.1073/pnas.95.11.5857 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al.. The Pfam protein families database. Nucleic Acids Res 2010; 38:D211-22; PMID:19920124; http://dx.doi.org/ 10.1093/nar/gkp985 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Muller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004; 2:e309; PMID:15383839; http://dx.doi.org/ 10.1371/journal.pbio.0020309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Martin J, Abubucker S, Heizer E, Taylor CM, Mitreva M. Nematode.net update 2011: addition of data sets and tools featuring next-generation sequencing data. Nucleic Acids Res 2012; 40:D720-8; PMID:22139919; http://dx.doi.org/ 10.1093/nar/gkr1194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Martin J, Rosa BA, Ozersky P, Hallsworth-Pepin K, Zhang X, Bhonagiri-Palsikar V, Tyagi R, Wang Q, Choi YJ, Gao X, et al.. Helminth.net: expansions to Nematode.net and an introduction to Trematode.net. Nucleic Acids Res 2014; 43:D698-706; PMID:25392426 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Parkinson J, Whitton C, Schmid R, Thomson M, Blaxter M. NEMBASE: a resource for parasitic nematode ESTs. Nucleic Acids Res 2004; 32:D427-30; PMID:14681449; http://dx.doi.org/ 10.1093/nar/gkh018 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Worm are provided here courtesy of Taylor & Francis

RESOURCES