Abstract
We developed a Gene Annotation Easy Viewer (GAEV) that integrates the gene annotation data from the KEGG (Kyoto Encyclopedia of Genes and Genomes) Automatic Annotation Server. GAEV generates an easy-to-read table that summarizes the query gene name, the KO (KEGG Orthology) number, name of gene orthologs, functional definition of the ortholog, and the functional pathways that query gene has been mapped to. Via links to KEGG pathway maps, users can directly examine the interaction between gene products involved in the same molecular pathway. We provide a usage example by annotating the newly published freshwater microcrustacean Daphnia pulex genome. This gene-centered view of gene function and pathways will greatly facilitate the genome annotation of non-model species and metagenomics data. GAEV runs on a Windows or Linux system equipped with Python 3 and provides easy accessibility to users with no prior Unix command line experience.
Keywords: molecular pathway, Daphnia, genome annotation, visualization, homologous genes
Introduction
In our efforts to de novo assemble a draft genome, describing the biological function of computationally annotated genes and the molecular pathways formed by these genes’ products is critical for identifying the genetic basis of the various unique biological attributes (e.g., physiology, life history, behavior) of the species in question. Computational search against DNA/protein databases, e.g., NCBI Blast ( Boratyn et al., 2013), UniProt ( Bateman et al., 2017), InterPro ( Finn et al., 2017), based on homology and protein domain information using computational tools, such as Blast ( Camacho et al., 2009), InterProScan ( Jones et al., 2014), and Hmmer ( Mistry et al., 2013), can make predictions for individual gene functions. In contrast, delineating the molecular pathways encoded by the entire suite of genes of a single species is a much more challenging task, especially for non-model species. To this extent, mapping genes to the molecular pathways derived from intensively studied model organisms provides an entry point for addressing this need.
For mapping genes into known molecular pathways, the Kyoto Encyclopedia of Genes and Genomes (KEGG) provide comprehensive web services ( Kanehisa et al., 2017; Kanehisa & Goto, 2000; Kanehisa et al., 2016a). KEGG is an integrated database for biological interpretation of genome sequences. The molecular function of genes is classified using ortholog groups, i.e., KEGG Orthology (KO). KEGG also contains KEGG pathways, BRITE hierarchies, and KEGG modules, all of which are networks of KO nodes. It is possible to annotate the molecular functions of a set of genes from complete/partial genome assembly or metagenomics dataset and their encoded molecular pathways using KEGG automatic annotation services that are provided through webservers BlastKOALA and GhostKOALA ( Kanehisa et al., 2016b). For a non-model species, we can use KAAS (KEGG Automatic Annotation Server) web services to annotate the complete or random set of genes to describe their molecular function and map them into identified molecular pathways. The annotation results consist of KO numbers for each gene, genes mapped to KEGG pathway database, and genes mapped to BRITE. Nonetheless, the resulting complete set of pathways and BRITE hierarchy can only be viewed through the temporary URL provided by KEGG, which are only available for several days after the analyses are completed. Although these results are organized through either curated KEGG pathways or BRITE hierarchy, KAAS does not provide an integrative gene-centered view of gene function and pathways, i.e., the complete summary of gene function and all associated molecular pathways for each gene.
As can be envisioned, integrating the gene function annotation based on KEGG orthology and KEGG pathways can provide an efficient way to characterize both the predicted genes and associated pathways for a newly assembled genome or metagenomics dataset. Despite numerous computational packages for retrieving KEGG pathways using the API interface provided by KEGG database (e.g., Moutselos et al., 2009; Wrzodek et al., 2011), none of these packages to our best knowledge allows us to reconstruct the complete set of molecular pathways contained in a newly assembled genome. To provide a means to utilizing the highly informative resources at KEGG for annotating genomic sequences and molecular pathways for non-model species, we have developed a Gene Annotation Easy Viewer (GAEV) for integrating results of KEGG orthology annotation and KEGG pathways mapping using KEGG API tools in both Windows and Linux environment. GAEV is implemented in Python 3 and can be used as an independent package.
Methods
Assuming that the KEGG ortholog number is known for a single gene, the KO information can be retrieved from KEGG database by utilizing KEGG REST-style API. GAEV uses the ‘get’ operation of the KEGG API to extract data on the gene and linked pathways of every K number provided in the input file. The data extracted from KEGG database are stored in data files that can be loaded into GAEV to skip the data extraction step ( Figure 1).
Once data extraction from KEGG’s database is complete and the data file is generated, GAEV helps the user handle and visualize the data by exporting the data as a table in an HTML file. GAEV populates the table with the user defined gene ID provided in the input file and the associated K number provided in the input file, as well as the gene name, definition, and linked pathways that have been retrieved from the KEGG database. The linked pathway map URLs that highlight identified genes in the genome assembly are created using the following formula: http://www.kegg.jp/kegg-bin/show_pathway?map=[ mapno]&multi_query=%23bfffbf%0d%0a[ k-num1]+ %23bfffbf%0d%0a [k-num2]+... %23bfffbf%0d%0a [k-num_interest]+%23[ node_color],%23 [font_color].
In the above URL, [mapno] represents the pathway accession number. [k-num{1,2,3…}] represents the K number for each gene in the pathway that is present in the provided genome assembly, and [k-num_interest] represents the K number of the focal gene that will be highlighted with a unique color. [node_color] and [font_color] represent the desired color of the focal gene’s node and font on the pathway map, respectively. By default, the node color of the focal gene is dark red, whereas the node color of other genes in the same pathway that are present in the genome assembly is light green.
Use cases
Installation
The most up-to-date version of this software can be downloaded at https://github.com/UtaDaphniaLab/kegg_path_generator. This software requires Python 3 or newer to run. It is recommended that this software be used as a standalone program simply by double clicking on GAEV.py or by using the ‘python 3 GAEV.py’ command.
Annotation
We analyze the newly published Daphnia pulex genome ( Ye et al., 2017) to demonstrate the usage of our package. The required input file for our package contains two columns. The first column contains the gene names, whereas the second column represents the KO (KEGG orthology) numbers ( Figure 2, Supplementary File 1). The KO numbers for the entire set of genes can be obtained through KEGG Automatic Annotation Server. Briefly, users can provide the query protein sequences in a fasta file and use one of the provided search algorithms (e.g, Blast, GhostX, GhostZ) to assign KO numbers to each of queried genes. At the end of this analysis, the user will receive via email a link to the result page, where the query result can be downloaded. The downloaded query result can be directly used as input file for our package even when some genes are not provided a KO number (which will be automatically excluded from further analysis)
With the obtained input file, the annotation analysis can be started by simply running GAEV.py and following the instructions of the menus. The first menu provides the option of using the obtained input file to extract data from KEGG or skipping the data extraction step by loading a pre-generated data file. Next, GAEV will prompt the user for the location of the input or data file. Both absolute and relative paths are accepted, but it is recommended that the GAEV.py file be placed in the same folder as the input or data file, so that the relative path can be easily used. After the data extraction from KEGG’s servers is completed, a data file will be created, which can be repeatedly used for making different pathway tables. The next several menus guide the user through the process of customizing the output table. The user has the options to apply filters so that GAEV only outputs a table using genes with a specific keyword in its definition or linked pathways.
Output file
The output file is an html file that can be opened in any internet browser (for example see Supplementary File 2). The results are organized in three different sections. The first section is the Genes and Linked Pathways, where for each query gene the molecular function based on KO and relevant pathways are listed. For each gene, its associated pathway(s) contains a link to the corresponding pathway page on KEGG website, where this specific gene is colored in red and all the identified genes from the genome assembly are colored in green. The other two sections contain a list of the pathways sorted by the number of identified genes and by alphabetic order, respectively. These two sections provide a pathway-centered view of the functions of the annotated genome.
Conclusions
The integrative annotation approach implemented in our package GAEV draws upon resources available at KEGG and provides an efficient way to explore the molecular pathways embodied in a draft genome. The integration of the generated html file with KEGG web services provides an intuitive interface to explore specific molecular pathways, with all the identified KEGG homologs highlighted in the pathway map. This type of information is essential to initial exploration of non-model organisms’ genomes to understand the conservation of specific pathways compared to established model systems. For example, if we examine the circadian rhythm pathway in the Daphnia genome, we see strong conservation between Daphnia and Drosophila, with only 1 gene (i.e., Vri ) in this pathway missing an identified homolog in the Daphnia assembly ( Figure 3). Further efforts can be dedicated to verifying the absence of Vri gene in Daphnia genome. The strong conservation of the circadian pathway can greatly aid future efforts in using the freshwater microcrustacean Daphnia to understand the internal clock of aquatic organisms in response to aquatic environments.
In principle, GAEV can be used for visualizing functions and pathways for gene sets of any scale, ranging from genome-wide data to subsets of genes in a genome. For example, we can use GAEV to visualize the pathways that differentially expressed genes are involved in. Often the large number of differentially expressed genes from RNA-seq experiments prevents clear cataloging of these genes and molecular pathways. Analyzing the genes of interest using our package can provide a quick, integrative view of the genes and affected pathways.
In summary, with a user-friendly design (e.g., no requirement of UNIX command line experience) in mind, we have developed GAEV to provide a fast, easily accessible summary for KEGG gene annotation results. We expect that GAEV will find its use in many bioinformatic analyses, especially those involving non-model species.
Data and software availability
Software source code available from: https://github.com/UtaDaphniaLab/Gene_Annotation_Easy_Viewer
Archived source code as at time of publication: https://zenodo.org/record/1186291#.WpbWVa6nGUk
License: This software is licensed under the MIT license
Acknowledgements
The authors thank M. Snyman for testing the software.
Funding Statement
University of Texas at Arlington
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; referees: 2 approved with reservations]
Supplementary Files
Supplementary File 1. Example input file https://raw.githubusercontent.com/UtaDaphniaLab/Gene_Annotation_Easy_Viewer/master/gene_annotation_easy_viewer/example_input.txt
Supplementary File 2. Example output file https://raw.githubusercontent.com/UtaDaphniaLab/Gene_Annotation_Easy_Viewer/master/gene_annotation_easy_viewer/Example_Output.html
References
- Bateman A, Martin MJ, O'Donovan C, et al. : UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–D169. 10.1093/nar/gkw1099 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boratyn GM, Camacho C, Cooper PS, et al. : BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013;41(Web Server issue):W29–W33. 10.1093/nar/gkt282 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho C, Coulouris G, Avagyan V, et al. : BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Attwood TK, Babbitt PC, et al. : InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 2017;45(D1):D190–D199. 10.1093/nar/gkw1107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones P, Binns D, Chang HY, et al. : InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–1240. 10.1093/bioinformatics/btu031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Furumichi M, Tanabe M, et al. : KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–D361. 10.1093/nar/gkw1092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Sato Y, Kawashima M, et al. : KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016a;44(D1):D457–D462. 10.1093/nar/gkv1070 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M, Sato Y, Morishima K: BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J Mol Biol. 2016b;428(4):726–731. 10.1016/j.jmb.2015.11.006 [DOI] [PubMed] [Google Scholar]
- Mistry J, Finn RD, Eddy SR, et al. : Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41(12):e121. 10.1093/nar/gkt263 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moutselos K, Kanaris I, Chatziioannou A, et al. : KEGGconverter: a tool for the in-silico modelling of metabolic networks of the KEGG Pathways database. BMC Bioinformatics. 2009;10:324. 10.1186/1471-2105-10-324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wrzodek C, Dräger A, Zell A: KEGGtranslator: visualizing and converting the KEGG PATHWAY database to various formats. Bioinformatics. 2011;27(16):2314–2315. 10.1093/bioinformatics/btr377 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye Z, Xu S, Spitze K, et al. : A New Reference Genome Assembly for the Microcrustacean Daphnia pulex. G3 (Bethesda). 2017;7(5):1405–1416. 10.1534/g3.116.038638 [DOI] [PMC free article] [PubMed] [Google Scholar]