Skip to main content
Data in Brief logoLink to Data in Brief
. 2024 Apr 6;54:110401. doi: 10.1016/j.dib.2024.110401

Gene ontology functional annotation datasets for the ITAG3.2 and ITAG4.0 tomato (Solanum lycopersicum) genome annotations

Ricardo Rivera-Silva a,1, Ricardo A Chávez Montes b,1, Fabiola Jaimes-Miranda c,
PMCID: PMC11033075  PMID: 38646191

Abstract

Functional annotation based on Gene Ontology has provided a structured and comprehensive system to access the current knowledge about the function of genes. For model plants such as Arabidopsis thaliana, there is a constant updating and restructuring of the functional annotation that increases the reliability of the analyses that use it. For tomato (Solanum lycopersicum), a crop widely used as a model plant for the study of fleshy fruits, there is no functional annotation, at least not freely accessible, even though its genome has long been sequenced and annotated. In this work, we generated, using a simplified version of the maize GAMER pipeline, a tomato Gene Ontology functional annotation with 72.42% (ITAG3.2) and 74.2% (ITAG4.0) of protein-coding genes with at least one GO term association. With this dataset, we share a reliable and easy-to-use tool with the tomato community.

Keywords: Gene ontology, Enrichment analysis, Tomato, Solanum lycopersicum


Specifications Table

Subject Biological Sciences / Bioinformatics
Specific subject area GO annotation.
Data format Raw, Analysed and Filtered.
Type of data Table, text file, figure.
Data collection GOA annotations for SwissProt proteins were downloaded from the European Bioinformatics Institute (https://www.ebi.ac.uk/GOA/downloads) on October 1st, 2023.
The Araport11 annotation and gaf file were downloaded from TAIR (https://www.arabidopsis.org/) on Sep 14th, 2022.
The PANNZER2 webserver was accessed on October 14th, 2023.
Data source location Institution: Instituto Potosino de Investigación Científica y Tecnológica City/Region: San Luis Potosí, San Luis Potosí Country: México Latitude and longitude: 22°08′57″N 101°02′04″O [22.14916667, −101.03444444]
Data accessibility Repository name: Mendeley Data
Data identification number: 10.17632/8whszkhk6b.2
Direct URL to data: https://data.mendeley.com/datasets/8whszkhk6b/2

1. Value of the Data

  • We generated a high-coverage GO functional annotation of tomato (Solanum lycopersicum) protein-coding genes.

  • This dataset provides an accurate and updated tool to perform tomato gene enrichment and other GO-based analyses.

  • Prior to this work, no dataset of tomato GO functional annotation was publicly accessible.

  • Any tomato research can use this dataset.

2. Background

Functional annotations are the result of the process of collecting and classifying information that describes the biological function of the genes [1]. The Gene Ontology knowledgebase provides a structured representation about the function of genes and is divided into three main categories: Molecular Function, Cellular Component, and Biological Process [2]. Tomato (Solanum lycopersicum) is a crop with high economic value and a model plant for studying fleshy fruits. Many tomato resources are available, such as a completely sequenced and annotated genome and expression data in many databases, but despite this, there is no available functional annotation. Using a simple method based on the maize-GAMER pipeline [3], we generated an updated, easy-to-use, and reliable tomato GO annotation dataset.

3. Data Description

We obtained two GO annotation datasets, one for the tomato SL3.0 genome assembly and ITAG3.2 annotation, gene_association.itag3.2.nr, and one for the SL4.0 genome assembly and ITAG 4.0 annotation, gene_association.itag4.0.nr. Fig. 1 shows the datasets files available within the repository (doi:10.17632/8whszkhk6b.2).

Fig. 1.

Fig. 1

Tomato Gene Ontology functional annotation files available at Mendeley Data, doi:10.17632/8whszkhk6b.2.

The raw data contain the following files:

dir: ITAG3.2
dir: blastp. Directory containing the reciprocal blastp hits data.
 dir: Araport11. Directory containing blastp data of ITAG3.2vs Araport11 proteins
  file: Araport11_blastp_itag3.tsv.lz. blastp Araport11 proteins as query, ITAG3.2 proteins as subject output file in tsv format, lzip compressed.
  file: gene_association.itag3.Araport11. GAF file for the blastp reciprocal best hit data.
  file: itag3_blastp_Araport11.tsv.lz. blastp ITAG3.2 proteins as query, Araport11 proteins as subject output file in tsv format, lzip compressed.
  file: rbh_itag3_Araport11.tsv.lz. Reciprocal best hits table ITAG3.2 vs Araport11.
 dir:uniprot
  file: gene_association.itag3.uniprot. GAF file for the ITAG3.2 blastp uniprot reciprocal best hit data.
  file: itag3_blastp_uniprot.tsv.lz. blastp ITAG3.2 proteins as query, UniProt proteins as subject output file in tsv format, lzip compressed.
  file: rbh_itag3_uniprot.tsv.lz. Reciprocal best hits table ITAG3.2 vs UniProt.
  file: uniprot_blastp_itag3.tsv.lz. UniProt proteins as query, ITAG3.2 proteins as subject output file in tsv format, lzip compressed.
dir: interproscan
 file: gene_association.itag3.interproscan. GAF file for the ITAG3.2 Interproscan data.
 file: itag3_ALL_interproscan.tsv.lz. ITAG3.2 proteins Interproscan output file in tsv format, lzip compressed.
dir: pannzer2. ITAG3.2 PANNZER2 output files.
 file: anno.out. PANNZER2 protein annotations table.
 file: DE.out. PANNZER2 protein functional descriptions table.
 file: err. PANNZER2 stderr log file.
 file: gene_association.itag3.pannzer2.
file: GO.out. PANNZER2 Gene Ontology annotations table.
 file: input.fasta.lz. ITAG3.2 proteins uploaded to PANNZER2, lzip compressed.
 file: log. PANNZER2 stdout log file.
ITAG4.0
dir: blastp. Directory containing the reciprocal blastp hits data.
 dir: Araport11. Directory containing blastp data of ITAG4.0vs Araport11 proteins
  file: Araport11_blastp_itag4.tsv.lz. blastp Araport11 proteins as query, ITAG4.0 proteins as subject output file in tsv format, lzip compressed.
  file: gene_association.itag4.Araport11. GAF file for the blastp reciprocal best hit data.
  file: itag4_blastp_Araport11.tsv.lz. blastp ITAG4.0 proteins as query, Araport11 proteins as subject output file in tsv format, lzip compressed.
  file: rbh_itag4_Araport11.tsv.lz. Reciprocal best hits table ITAG4.0 vs Araport11.
 dir:uniprot
  file: gene_association.itag4.uniprot. GAF file for the ITAG4.0 blastp uniprot reciprocal best hit data.
  file: itag4_blastp_uniprot.tsv.lz. blastp ITAG4.0 proteins as query, UniProt proteins as subject output file in tsv format, lzip compressed.
  file: rbh_itag4_uniprot.tsv.lz. Reciprocal best hits table ITAG4.0 vs UniProt.
  file: uniprot_blastp_itag4.tsv.lz. UniProt proteins as query, ITAG4.0 proteins as subject output file in tsv format, lzip compressed.
dir: interproscan
 file: gene_association.itag4.interproscan. GAF file for the ITAG4.0 Interproscan data.
 file: itag4_ALL_interproscan.tsv.lz. ITAG4.0 proteins Interproscan output file in tsv format, lzip compressed.
dir: pannzer2. itag4.2 PANNZER2 output files.
 file: anno.out. PANNZER2 protein annotations table.
 file: DE.out. PANNZER2 protein functional descriptions table.
 file: err. PANNZER2 stderr log file.
 file: gene_association.itag4.pannzer2.
file: GO.out. PANNZER2 Gene Ontology annotations table.
 file: input.fasta.lz. ITAG4.0 proteins uploaded to PANNZER2, lzip compressed.
 file: log. PANNZER2 stdout log file.

The ITAG3.2 GO annotation has coverage of 72.42% with 25,905 protein-coding genes, out of 35,768 total genes, with at least one GO term. It contains 7632 unique GO terms, of which 3988 (52.25%) belong to the Biological Process category, 2782 (36.45%) to Molecular Function, and 862 (11.30%) to Cellular Component. The ITAG4.0 GO annotation has a coverage of 74.2% with 25,285 protein-coding genes out of the 34,075 total genes with at least one GO term. It has 7612 unique GO terms, of which 3985 (52.35%) belong to Biological Process, 2770 (36.39%) to Molecular Function, and 857 (11.26%) to Cellular Component. A histogram with the number of GO terms per gene is shown for each GO annotation in Fig. 2.

Fig. 2.

Fig. 2

Histogram of the number of GO terms per gene for ITAG3.2 (a) and ITAG4.0 (b).

4. Experimental Design, Materials and Methods

Tomato gene GO annotations were obtained using a simplified version of the maize-GAMER pipeline [3]. GO annotations for tomato genes were assigned 1) from the GO annotations of the reciprocal blastp best hits vs Araport11 proteins [4]; 2) GO annotations from the GOA file from the European Bioinformatics Institute of the reciprocal blastp best hits vs the Uniprot [5] SwissProt proteins from nine species, Glycine max, Oryza sativa subsp. japonica, Populus trichocarpa, Sorghum bicolor, Vitis vinifera, Brachypodium distachyon, Physcomitrium patens, Chlamydomonas reinhardtii and Solanum lycopersicum itself; 3) the GO annotations from an Interproscan (v5.64–96.0) [6] run with the -goterms option; and 4) the GO annotations from the GO output text file (named GO.txt or GO.out) from the PANNZER2 web server [7], keeping annotations with a PPV of 0.5 or higher. Output files were parsed with custom Perl scripts to obtain a non-redundant GO annotation file. Fig. 3 shows a graphical representation of this pipeline.

Fig. 3.

Fig. 3

Graphical representation of the pipeline used to obtain the tomato Gene Ontology functional annotations.

Limitations

Not applicable.

Ethics statement

The work presented above did not involve human subjects, animal experiments, or any data collected from social media platforms, no regulatory compliance guidelines were applicable.

CRediT authorship contribution statement

Ricardo Rivera-Silva: Writing – original draft, Visualization, Writing – review & editing. Ricardo A. Chávez Montes: Conceptualization, Software, Validation, Writing – review & editing. Fabiola Jaimes-Miranda: Supervision, Writing – review & editing.

Acknowledgements

This research was supported by Consejo Nacional de Humanidades, Ciencias y Tecnologías Grant CB 2017-2018 A1-S-7679. We thank CONAHCyT for the scholarship 861546 provided to RRS.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

References

  • 1.Berardini T.Z., Mundodi S., Reiser L., Huala E., Garcia-Hernandez M., Zhang P., Mueller L.A., Yoon J., Doyle A., Lander G., Moseyko N., Yoo D., Xu I., Zoeckler B., Montoya M., Miller N., Weems D., Rhee S.Y. Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 2004;135:745–755. doi: 10.1104/pp.104.040071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Consortium T.G.O. The Gene Ontology resource: enriching a GOld mine. Nucl. Acids Res. 2020;49:325–334. doi: 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wimalanathan K., Friedberg I., Andorf C.M., Lawrence-Dill C.J., Carolyn Lawrence-Dill C.J. 2018. Maize GO Annotation-Methods, Evaluation, and Review (maize-GAMER) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cheng C.Y., Krishnakumar V., Chan A.P., Thibaud-Nissen F., Schobel S., Town C.D. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804. doi: 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
  • 5.The UniProt Consortium UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jones P., Binns D., Chang H.Y., Fraser M., Li W., McAnulla C., McWilliam H., Maslen J., Mitchell A., Nuka G., Pesseat S., Quinn A.F., Sangrador-Vegas A., Scheremetjew M., Yong S.Y., Lopez R., Hunter S. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Törönen P., Medlar A., Holm L. PANNZER2: a rapid functional annotation web server. Nucl. Acids Res. 2018;46:W84–W88. doi: 10.1093/nar/gky350. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement


Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES