Abstract
Solanum lycopersicum and Solanum tuberosum are agriculturally important crop species as they are rich sources of starch, protein, antioxidants, lycopene, beta-carotene, vitamin C, and fiber. The genomes of S. lycopersicum and S. tuberosum are currently available. However the linear strings of nucleotides that together comprise a genome sequence are of limited significance by themselves. Computational and bioinformatics approaches can be used to exploit the genomes for fundamental research for improving their varieties. The comparative genome analysis, Pfam analysis of predicted reviewed paralogous proteins was performed. It was found that S. lycopersicum proteins belong to more families, domains and clans in comparison with S. tuberosum. It was also found that mostly intergenic regions are conserved in two genomes followed by exons, intron and UTR. This can be exploited to predict regions between genomes that are similar to each other and to study the evolutionary relationship between two genomes, leading towards the development of disease resistance, stress tolerance and improved varieties of tomato.
Keywords: S. lycopersicum, S. tuberosum, genome
Background
Solanaceae family represent important family in agriculture as it is one of the major source of edible fruits Solanum lycopersicum, Solanum tuberosum and Nicotiana tabacum. Tomato fruits are the second most consumed vegetable after potatoes, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. Potato contributes to dietary intake of starch, protein, antioxidants, and vitamins. In addition to its agricultural value and due to its diploid genetics and inbreeding potential, tomato is a widely used model species for fundamental research on subjects including fruit development and pathogen response [1].
The developments in sequencing technologies are providing genome sequences of different species. Deciphering a genome sequence, that is, determining the linear order of nucleotides for each chromosome in the genome, allows molecular biologists to understand and manipulate this blueprint. For plants in particular, this in turn enables breeders to more efficiently engineer solutions for crop improvement to respond to the growing demand for food and energy from modern society [2].
The genome draft of Tomato and Potato is now available in plant databases. The nuclear genome of potato and tomato consists of twelve chromosomes. Their genomes are expected to measure approximately 840 Mb and 950 Mb in size, respectively [3–5].
The availability of their genome sequences will provide the community with a first glimpse into genome evolution of Solanaceae (and Asterids in general) and will impact both fundamental research and breeding strategies in these species for the coming years.
The aim of the present research work was to predict paralogous proteins in Tomato proteome and to carry out comparative genome analysis of Tomato and Potato to uncover various genomic features of two genomes and to gain insight the similarity and differences between two genomes.
Methodology
The genomic data of S. lycopersicum is available at, NCBI, EMBL, DDBJ and KEGG. The nucleotide and amino acid data is retrieved in the FASTA format from FTP server. These databases and tools are freely available for computational analysis.
The Sol Genomics Network (http:// solgenomics.net) is a database for comparative genomics platform for Solanaceae species.
Computational tools are required for data processing, data visualization, interpretation and interrogation to analyze flood of new sequence data that is being produced. The comparison of Tomato and Potato genome was performed by sing VISTA server. VISTA (http://genome.lbl.gov/vista/index.shtml) is a comprehensive suite of programs and databases for comparative analysis of genomic sequences [6].
The genomic data retrieved from above server was used for selected objectives. The retrieved genomic data was analyzed with the help of different computational tools, software and online servers.
Prediction of Paralogous Proteins in S. lycopersicum and S. tuberosum Genome:
The reviewed set of proteins sequences of S. lycopersicum and S. tuberosum was retrieved from the Uniprot Database in FASTA format. The all against all database searches by using the genomic BLAST-P available at NCBI server was used to predict paralogous protein in the selected set of protein sequences [7–8]. In case of all against all search, a comparison was made in which every predicted protein sequence was used as a query in a similarity search against a database composed of the rest of the self-proteome, and the significant matches were identified by a low E-value. Since many proteins comprise different combinations of a common set of domains, proteins that align more than 80% of their lengths for query and subject were selected. After this filtration only those alignments were selected which give the sequence identity more than 60%.
Families, domain and repeats for paralogous protein sequences in S. lycopersicum and S. tuberosum:
For the purpose of functional annotation and to investigate the gene family expansion, the identified set of paralogous proteins was used to search the protein families by using the Pfam search. Each family is represented by multiple sequence alignments and Hidden Markov models (HMMs) [9]. The paralogous protein dataset was submitted at Pfam server which predicted the protein families, motifs, repeats and clans at the default pfam parameter (http://pfam.sanger.ac.uk/).
Results and Discussion
After performing the all against all searches for all reviewed protein sequences of S. lycopersicum and S. tuberosum it was found that 60 paralogous proteins present in S. lycopersicum and while 110 were present in S. tuberosum. All predicted paralogous proteins of S. lycopersicum and S. tuberosum can be retrieved by using accession number given in Table 1 & Table 2 (see supplementary material). The predicted paralogous proteins belong to different family having different domain and repeats. For the purpose of functional annotation and to investigate the gene family expansion, the identified set of paralogous proteins was used to search the protein families by using the Pfam search.
Pfam analysis of S. lycopersicum and S. tuberosum protein sequences:
It was found that most of the identified proteins belong to different families, domains and clans in S. lycopersicum and S. tuberosum protein sequences Table 3 (see supplementary material). But also there are proteins having no clans (Figure 1). Proteins contain functional units known as domains and various combinations of domains results in different protein formations. Therefore identification of domains in proteins is essential for giving insights into their function. Pfam also generates higher-level groupings of related families, known as clans. A protein belongs to different families, domains and clans may be due to proteins family expansion and adaptations by the genomes [10].
Figure 1.

Pfam comparison of S. lycopersicum and S. tuberosum protein sequences.
It was found that S. lycopersicum proteins belong to more families, domains and clans in comparison with S. tuberosum. But also there are proteins having no clans.
Comparative genomics Solanum lycopersicum and Solanum tuberosum:
The comparison of the genomic regions of S. lycopersicum and S. tuberosum was performed. It was found that the genome of two selected plants have conserved, non conserved and also different genomic compassions and different levels. But there are other areas also where difference in conservation was noted.
It was found that mostly intergenic regions are conserved in two genomes followed by exons, intron (they are found in the genes of most organisms and many viruses, and can be located in a wide range of genes) and UTR (untranslated region) (Figure 2).
Figure 2.

Genomic region comparison of Tomato and Potato.
An Intergenic region (sometimes also referred to as junk DNA) represent stretch of DNA sequences located between genes. Their function is still unknown but sometime they are involve in regulation of gene expressions (these regions do contain functionally important elements such as promoters and enhancers).
The comparative alignment of genomic regions of S. lycopersicum and S. tuberosum revealed that it was found that there are regions where only conserved part is present in two genomes (Figure 3). Along with this there are regions were conserved regions, untranslated region (UTR) exons present together without any non aligned region (Figure 4). Non aligned Genomic region are also found in the alignment two genomes (Figure 5).
Figure 3.

Conserved region present in two genomes.
Figure 4.

Genomic regions with conserved, UTR and exons.
Figure 5.

Non aligned Genomic region.
Once the elements in a genome sequence have been identified, the next step is to assign to them a plausible biological function. Computational inference of the function of a particular sequence can be achieved either directly through sequence similarity searches, or indirectly through the identification of common motifs or domains between groups of functionally related sequences.
Presence of Intergenic region in large number may be due to a higher repeat content in tomato genome than the potato genome. There are many protein families that represent a large gene superfamily in plants, these genes are involved in the biosynthesis of secondary metabolites [11–12].
Alignments between genome sequences of multiple accessions or varieties of a single species allow for the study of genome diversity, evolution and insertion/deletion polymorphisms (InDels). Moreover, alignments between the genomes of related species, for example from the same genus, can be generated to identify structural variation such as translocations, inversions, The identified sequence variation from both approaches can be utilized to study the evolution of genomes, and to generate molecular markers that can be exploited to screen large populations [13–14].
The general availability of genome sequences for crop plant species is having a tremendous impact on the genetics and breeding of these organisms. Future comparative sequence analyses of the completed tomato and potato genome sequences will address many of the unresolved questions related with genome-wide profiles of specific multigene families [15].
Conclusion
The large scale analysis of tomato and potato revealed many interesting structural and functional differences between two genomes. It was found that tomato genome is more repetitive than the potato genome also the composition of repeat is different in these genomes. Taken together, the present will help in understanding the contents, structure and organization of the tomato and potato genomes, which will be of great value to plant breeders and researchers in the years to come.
Supplementary material
Acknowledgments
The authors are grateful to the Sam Higginbottom Institute of Agriculture, Technology & Sciences, Deemed University, Allahabad for providing the facilities and support to complete the present research work.
Footnotes
Citation:Lall et al, Bioinformation 9(18): 923-928 (2013)
References
- 1.Kimura S, Sinha N. CSH Protoc. 2008 doi: 10.1101/pdb.prot5083. doi: 10.1101/pdb.emo105. [DOI] [PubMed] [Google Scholar]
- 2.Fray RG, Grierson D. Trends Genet. 1993;9:438. doi: 10.1016/0168-9525(93)90108-t. [DOI] [PubMed] [Google Scholar]
- 3.Tomato Genome Consortium. Nature. 2012;485:635. doi: 10.1038/nature11119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Xu X, et al. Nature. 2011;475:189. doi: 10.1038/nature10158. Potato Genome Sequencing Consortium. [DOI] [PubMed] [Google Scholar]
- 5.Dolezel J, et al. Nat Protoc. 2007;2:2233. doi: 10.1038/nprot.2007.310. [DOI] [PubMed] [Google Scholar]
- 6.Frazer KA, et al. Nucleic Acids Res. 2004;32:W273. doi: 10.1093/nar/gkh458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Altschul SF, et al. Nucleic Acids Res. 1997;25:3389. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Singh S, et al. Bioinformation. 2011;6:31. doi: 10.6026/97320630006031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Finn RD, et al. Nucleic Acids Res. 2008;36:D281. doi: 10.1093/nar/gkm960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gogarten JP, Olendzenski L. Curr Opin Genet Dev. 1999;9:630. doi: 10.1016/s0959-437x(99)00029-5. [DOI] [PubMed] [Google Scholar]
- 11.Zhu W, et al. BMC Genomics. 2008;9:286. doi: 10.1186/1471-2164-9-286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Datema E, et al. BMC Plant Biol. 2008;8:34. doi: 10.1186/1471-2229-8-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pujar A, et al. Database (Oxford) 2013 doi: 10.1093/database/bat028. [Google Scholar]
- 14.Väli U, et al. BMC Genet. 2008;9:8. doi: 10.1186/1471-2156-9-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Moyle LC, et al. Evolution. 2008;62:2995. doi: 10.1111/j.1558-5646.2008.00487.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
