Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Feb 27.
Published in final edited form as: Dataset Pap Biol. 2013;2013:706465. doi: 10.7167/2013/706465

Extending MapMan Ontology to Tobacco for Visualization of Gene Expression

Maurice HT Ling 1, Roel C Rabara 2, Prateek Tripathi 2, Paul J Rushton 2, Steven X Ge 1
PMCID: PMC3583537  NIHMSID: NIHMS414051  PMID: 23457664

Abstract

Microarrays are a large-scale expression profiling method which has been used to study the transcriptome of plants under various environmental conditions. However, manual inspection of microarray data is difficult at the genome level because of the large number of genes (normally at least 30,000) and the many different processes that occur within any given plant. MapMan software, which was initially developed to visualize microarray data for Arabidopsis, has been adapted to other plant species by mapping other species onto MapMan ontology. This paper provides a detailed procedure and the relevant computing codes to generate a MapMan ontology mapping file for tobacco (Nicotiana tabacum L.) using potato and Arabidopsis as intermediates. The mapping file can be used directly with our custom made NimbleGen oligoarray, that contains gene sequences from both the tobacco gene space sequence and the tobacco gene index 4 (NTGI4) collection of ESTs. The generated data set will be informative for scientists working on tobacco as their model plant by providing a MapMan ontology mapping file to tobacco, homology between tobacco coding sequences and that of potato and Arabidopsis, as well as adapting our procedure and codes for other plant species where the complete genome is not yet available.

Introduction

Plants, being sessile organisms, must react and acclimatize to abiotic stresses to survive in various environmental conditions. Plants have developed various stress tolerance mechanisms, such as physiological and biochemical alterations, that result in adaptive or morphological changes. In crop production, understanding how cultivated crops respond to abiotic stress is crucial in developing new varieties that could tolerate stress without affecting potential yield. With the rapid development of technologies for functional genomics research, comprehensive analyses at the mRNA, protein and metabolites level has become possible. This is leading to increased understanding of the complex regulatory networks associated with stress adaptation and tolerance [1].

Currently, microarrays are one of the most popular technologies for large-scale expression profiling because they allow the simultaneous detection of tens of thousands of transcripts at a reasonable cost [2]. The development of gene chips for model plants like Arabidopsis and rice and other species that have a sequenced genome has led to genome-wide transcriptional profiling from diverse tissues. This is a key tool for identification of novel target genes for functional genomics [3]. Studies using microarrays to characterize abiotic stress responses have been reported for model species such as the moss Physcomitrella patens [4], Arabidopsis thaliana [5, 6], Medicago truncatula [7, 8], rice [9], as well as non-model species such as soybean [10], and Musa [11]. However, microarrays generate huge amounts of data which is often in the form of lists of differentially expressed genes. Manual inspection of these data is time consuming and this complexity creates a problem in interpretation. This is compounded when transcriptomic analysis is being combined with other OMICS data. Development of new, more reliable methods of data analysis and visualization will enable easier interpretation of results and thus a greater contribution to explaining the biological problem. [12]. Several bioinformatics tools have been developed to visualize data sets in the context of biological pathways. These include GenMAPP (http://www.genmapp.org) and BioMiner [13] among others. However, their application to plant data sets is limited due to the following reasons. Firstly, these tools were developed for microbial and animal systems and, secondly, flexibility is limited in terms of the display of family members (e.g. class of enzymes) [14]. This limitation had been addressed by MapMan software [14] which relied on its own ontology to classify genes and metabolites and visualize the pathways and processes in pictorial diagrams in a modular system [15]. MapMan was initially developed to analyze two sets of 22K Affymetrix arrays that investigated the response of Arabidopsis rosettes to low sugar [14]. Rapid advances in sequencing have resulted in full genome sequences for an increasing number of important crop species (e.g. soybean, rice, maize, papaya, sorghum, and corn). These genome sequences have facilitated the development of large scale whole genome arrays. MapMan software can be applied in new species by transferring the MapMan ontology to the transcripts and proteins of the studied species [15]. Several studies have been reported in extending MapMan ontology to other species such as cotton [16], grapevine [17], maize [15], Musa [11], potato [12, 18], tomato [18].

Tobacco is a popular model plant in recombinant technology because of its well-established gene transfer and regeneration methodologies as well as the availability of many robust expression cassettes for the control of transgene expression [19]. However, the disadvantage of tobacco is that it is an allotetraploid and its genome is not yet fully sequenced. A large gene space sequence project was performed for tobacco (www.tobaccogenome.org) and the gene space reads have been deposited as individual unassembled reads at The National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/. This is an excellent resource but does not cover the entire genome. The tobacco cv. Bright Yellow-2 (BY-2) cell line is an important model system to study cell physiology, hormone signaling, cell cycle, cell growth and stress situations [20]. However, tobacco is still a mostly un-sequenced and relatively un-annotated plant system in which identification of the proteins and their interactions relies on cross-species identification based on homology and orthology [21].

In this paper, we extend MapMan ontology of sequenced dicot plants to generate a mapping data set for tobacco. The data set we have generated from this study will be a tool for scientists working on tobacco as their model plant by providing a MapMan ontology mapping file for tobacco, and homology comparisons between tobacco coding sequences and those of potato and Arabidopsis. In addition, we provide a method and the required computer codes to generate MapMan mapping files which may be adapted for other plant species where the complete genome is not yet available.

Methodology

Tobacco transcript sequences are from two sources: tobacco gene index 4 (NTGI4; ftp://occams.dfci.harvard.edu/pub/bio/tgi/data/Nicotiana_tabacum/; file name: Source files/NTGI.071508.fasta) and tobacco genomic survey sequences (A data set of 1,159,022 genomic survey sequences was downloaded from the TGI http://www.tobaccogenome.org/ in 2008. These genome survey sequences have subsequently been deposited at The National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/; file name: Source files/TGSS_expressed.fasta). Two mapping files from the MapMan Store (http://mapman.gabipd.org/web/guest/mapmanstore) were used: Arabidopsis Information Resource version 9 (TAIR9) to MapMan ontology (file name: Map files/Ath_AGI_TAIR9_Jan2010.txt), and genome release version 3.2 from Potato Genome Sequence Consortium to MapMan ontology (file name: Map files/Stub_PGSC_DM_v3.4.txt). In addition, potato DNA coding sequence (downloaded from http://potatogenomics.plantbiology.msu.edu/index.html; file name: Source files/potatoCDS.fasta) and TAIR9 coding sequence (ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR9_blastsets/TAIR9_cds_20090619; file name: Source files/TAIR9CDS.fasta) were used to generate nucleotide BLAST database using a local installation of NCBI BLAST version 2.2.25+.

Both tobacco transcript sequence files were used as input files for nucleotide BLAST (blastn) and translated nucleotide BLAST (tblastx) against the potato DNA coding sequence and TAIR9 coding sequence. The XML file from blastn and tblastx were read using BioPython and hits with an expectation threshold of less than 1e-9 were saved as comma-delimited files for mapping onto MapMan ontology using the following Python script.

from Bio.Blast import NCBIXML
expect_threshold = 1e-9
outfile = open (‘blast_output.txt’, ‘w’)
for record in NCBIXML.parse (open (‘blast_output.xml’)):
  query_title = record.query.split () [0]
  query_length = float (record.query_length)
  for alignment in record.alignments:
   for hsp in alignment.hsps:
    if float(hsp.expect) < expect_threshold:
     data = [query_title,
      alignment.title.split () [1] .split (‘.’) [0],
      str (float (hsp.identities) / query_length),
      str (query_length),
      str (hsp.expect) ]
     outfile.writelines (‘,’ .join (data) + ‘\n’)
outfile.close ()

Duplicate hits in the generated comma-delimited files were removed and all the BLAST results were combined into a tab-delimited file (file name: Map files/mapped_by_blast.txt) for mapping to MapMan ontology using the potato and Arabidopsis mapping file from MapMan store. In addition to an expectation threshold of less than 1e-9, the threshold for global sequence identity is set at 30% using the following Python script.

blastn_IT = 0.3 # global identity threshold for blastn
blastn_ET = 1e-9 # E-value threshold for blastn
tblastx_IT = 0.3 # global identity threshold for tblastx
tblastx_ET = 1e-9 # E-value threshold for tblastx
mfile = open (‘mapped_by_blast.txt’, ‘r’) .readlines ()
mfile = [x[:-1] .split (‘\t’) for x in mfile]
amap = open (‘Ath_AGI_TAIR9_Jan2010.txt’, ‘r’) .readlines ()
amap = [x[:-1] .split (‘\t’) for x in amap [1:]]
amap = [[x[0] [1:-1], x[1] [1:-1], x[2] [1:-1] .upper(), x[3] [1:-1], x[4]]
     for x in amap]
pmap = open (‘Stub_PGSC_DM_v3.4.txt’, ‘r’) .readlines()
pmap = [x[:-1] .split (‘\t’) for x in pmap [1:]]
pmap = [[x[0] [1:-1], x[1] [1:-1], x[2] [1:-1] .upper(), x[3] [1:-1], x[4]]
     for x in pmap]
mapping = amap + pmap
mfile = [[x[0], x[1] .split (‘_’) [0], x[8]] for x in mfile
      if ((float (x[2]) > blastn_IT and x[7] == ‘blastn’) and \
        (float (x[4]) < blastn_ET and x[7] == ‘blastn’)) or \
       ((float (x[2]) > tblastx_IT and x[7] == ‘tblastx’) and \
        (float (x[4]) < tblastx_ET and x[7] == ‘tblastx’))]
d = {}
for x in mapping: d[x[2]] = (x[0], x[1])
mapped = []
for i in range(len(mfile)) :
  try: mapped.append ((d[mfile[i] [1]] [0], ‘ ’,
        mfile[i] [0], mfile[i] [2], ‘T’))
  except KeyError: pass
outfile = open (‘tobaccoGSS_NGTI.MapMan_bins.txt’, ‘w’)
outfile.write (‘\t’ .join ([‘BINCODE’, ‘NAME’, ‘IDENTIFIER’,
        ‘DESCRIPTION’, ‘TYPE’]) + ‘\n’)
for x in mapped: outfile.write(‘\t’ .join(x) + ‘\n’)
outfile.close ()

The resulting MapMan ontology map file is tobaccoGSS_NGTI.MapMan_bins.txt. Duplicate mapping with the same MapMan bin code (ontology) and identifier (NGTI4 IDs or tobacco genomic survey sequence IDs) were removed.

Dataset Description

The main contribution of this work is the resulting MapMan ontology map file, tobaccoGSS_NGTI.MapMan_bins.txt, which is a tab-delimited file of 5 attributes as per MapMan ontology map file format. The 5 attributes are BINCODE, NAME, IDENTIFIER, DESCRIPTION, and TYPE. Through our own map file generation and use, we found two limitations in the format of the map file. Firstly, only BINCODE, IDENTIFIER and TYPE attributes are mandatory and used by MapMan [14]. BINCODE and IDENTIFIER attributes are the MapMan ontology bin identifier and microarray probe identifier respectively. In our case, the IDENTIFIER attribute refers to NTGI4 or tobacco genomic survey sequence identifier, which is also used as probe identifier in our custom microarray. TYPE attribute is default to “T”. Secondly, the NAME attribute is the name descriptor used by MapMan [14] for displaying the ontology in a tree format, together with the BINCODE, even though it is not mandatory. However, if NAME attribute is used, each BINCODE can only be mapped to one NAME. As we had combined two primary map files (potato and Arabidopsis) to generate a tobacco map file, we found that the NAME attributes may not be consistent with the BINCODE and resulted in error. Thus, the NAME attribute is not used and left blank. We used DESCRIPTION attribute as a composite of 6 attributes to describe the BLAST process. The composite of 6 attributes are as follows:

  • Query Length to denote the length of query sequence in number of bases,

  • Global Identity to denote the global sequence identity between the query sequence and the matched sequence in the BLAST database,

  • E-value to denote the expectation value from the BLAST hit,

  • Query Source to denote the source of the query sequence; hence, the source of IDENTIFIER; which is either “NTGI transcript” or “Tobacco GSS”,

  • BLAST Database to denote the source of sequences to generate the custom BLAST database, which is either “TAIR9 CDS” or “Potato CDS”,

  • BLAST Method to denote the BLAST program used, which is either “blastn” or “tblastx”.

We had included the processed BLAST result file, mapped_by_blast.txt, for re-generating the required MapMan ontology map file. The processed BLAST result file contains 9 attributes. Query ID is the identifier for the query sequence from either NTGI transcript or Tobacco GSS, which is also the source of IDENTIFIER in the ontology map file, while Mapped ID is the identifier of the hit sequence from either TAIR9 CDS or Potato CDS. Global Identity, Query Length, E-value, Query Source, BLAST Database, and BLAST Method are attributes containing the source data for concatenation into the DESCRIPTION attribute in the ontology map file.

For completeness, we had also included the necessary source files, Python scripts, potato and Arabidopsis mapping file from MapMan store; as described in the methodology section; in our dataset.

Concluding Remarks

Figure 1 shows that the MapMan ontology mapping file for tobacco that we have generated does indeed work. Figure 1A shows the changes in expression level of genes associated with primary metabolism after a short period of water stress in tobacco leaves. The blue colour shows genes that are up-regulated at the mRNA level and red genes that are down-regulated. The darkest colour represents at least 8-fold change in mRNA level. Figure 1B shows a later timepoint in water stress and show that the MapMan ontology mapping file for tobacco visualizes the changes in gene expression well. Based on the MapMan results, we can clearly identify areas of primary and secondary metabolism that are subject to regulation during water stress. These genes are therefore identified as potential targets for improving drought responses.

Figure 1.

Figure 1

Figure 1

MapMan visualization of changes in expression levels of genes associated with primary metabolism using our MapMan ontology mapping file for tobacco (Nicotiana tabacum L.). Blue denotes up-regulation and red down-regulation. Intense blue or red denote fold changes of 8-fold or more. A. Changes in gene expression after a short period of dehydration of tobacco seedings (40 minutes) compared to unstressed controls. B. Changes in gene expression after a longer period of dehydration of tobacco seedings (4 hours).

The successful use of the MapMan ontology mapping file for tobacco (Figure 1) illustrates that our strategy of going via potato has been a good one. This is because each unassembled gene space read from tobacco that is present on the oligo array may only contain a short part of an exon and this may not correspond to any protein sequence in the more distantly related Arabidopsis proteome. Potato is much more closely related to tobacco as it is also a member of the Solanaceae. This means that most fragmentary tobacco sequences can be assigned to a corresponding full length potato sequence that will contain conserved domains that allow identification of the protein. This full length potato protein sequence will in the majority of cases have a similar type of protein in Arabidopsis for mapping purposes. We propose that adapting our procedure and codes for other plant species where the complete genome is not yet available will facilitate MapMan ontology mapping for those plant species.

Acknowledgments

The authors would like to thank Administrative and Research Computing at South Dakota State University for providing computational resources. This project was supported by National Research Initiative grants 2008-35100-04519 and 2008-35100-05969 from the USDA National Institute of Food and Agriculture. Research in the Rushton laboratory is also supported by The United Soybean Board, The Consortium for Plant Biotechnology Research, The South Dakota Soybean Research and Promotion Council and The North Central Soybean Research Program. ML and XG were supported by National Institute of Heath grant, 5R01GM083226-03.

Footnotes

Disclosure

No conflict of interest is declared.

References

  • 1.Urano K, et al. ‘Omics’ analyses of regulatory networks in plant abiotic stress responses. Curr Opin Plant Biol. 2010;13(2):132–8. doi: 10.1016/j.pbi.2009.12.006. [DOI] [PubMed] [Google Scholar]
  • 2.Busch W, Lohmann JU. Profiling a plant: expression analysis in Arabidopsis. Curr Opin Plant Biol. 2007;10(2):136–41. doi: 10.1016/j.pbi.2007.01.002. [DOI] [PubMed] [Google Scholar]
  • 3.Hafidh S, et al. Comprehensive analysis of tobacco pollen transcriptome unveils common pathways in polar cell expansion and underlying heterochronic shift during spermatogenesis. BMC Plant Biology. 2012;12(1):24. doi: 10.1186/1471-2229-12-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cuming AC, et al. Microarray analysis of transcriptional responses to abscisic acid and osmotic, salt, and drought stress in the moss, Physcomitrella patens. New Phytologist. 2007;176(2):275–287. doi: 10.1111/j.1469-8137.2007.02187.x. [DOI] [PubMed] [Google Scholar]
  • 5.Seki M, et al. Monitoring the expression profiles of 7000 Arabidopsis genes under drought, cold and high-salinity stresses using a full-length cDNA microarray. Plant Journal. 2002;31(3):279–292. doi: 10.1046/j.1365-313x.2002.01359.x. [DOI] [PubMed] [Google Scholar]
  • 6.Oono Y, et al. Monitoring expression profiles of Arabidopsis genes during cold acclimation and deacclimation using DNA microarrays. Functional & Integrative Genomics. 2006;6(3):212–234. doi: 10.1007/s10142-005-0014-z. [DOI] [PubMed] [Google Scholar]
  • 7.Li D, et al. An expression database for roots of the model legume Medicago truncatula under salt stress. BMC Genomics. 2009;10:517. doi: 10.1186/1471-2164-10-517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Li D, et al. Transcriptional profiling of Medicago truncatula under salt stress identified a novel CBF transcription factor MtCBF4 that plays an important role in abiotic stress responses. BMC Plant Biol. 2011;11:109. doi: 10.1186/1471-2229-11-109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Shujia L, Weidong H, Diqiu Y. Functional analysis of an Arabidopsis transcription factor WRKY25 in heat stress. Plant Cell Reports. 2009;28(4):683–693. doi: 10.1007/s00299-008-0666-y. [DOI] [PubMed] [Google Scholar]
  • 10.Stolf-Moreira R, et al. Transcriptional Profiles of Roots of Different Soybean Genotypes Subjected to Drought Stress. Plant Molecular Biology Reporter. 2011;29(1):19–34. [Google Scholar]
  • 11.Davey MW, et al. Heterologous oligonucleotide microarrays for transcriptomics in a non-model species; a proof-of-concept study of drought stress in Musa. BMC Genomics. 2009;10:436. doi: 10.1186/1471-2164-10-436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rotter A, et al. Adaptation of the MapMan ontology to biotic stress responses: application in solanaceous species. Plant Methods. 2007;3:10. doi: 10.1186/1746-4811-3-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sirava M, et al. BioMiner--modeling, analyzing, and visualizing biochemical pathways and networks. Bioinformatics. 2002;18(2):S219–30. doi: 10.1093/bioinformatics/18.suppl_2.s219. [DOI] [PubMed] [Google Scholar]
  • 14.Thimm O, et al. mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. The Plant Journal. 2004;37(6):914–939. doi: 10.1111/j.1365-313x.2004.02016.x. [DOI] [PubMed] [Google Scholar]
  • 15.Usadel B, et al. A guide to using MapMan to visualize and compare Omics data in plants: a case study in the crop species, Maize. Plant Cell Environ. 2009;32(9):1211–29. doi: 10.1111/j.1365-3040.2009.01978.x. [DOI] [PubMed] [Google Scholar]
  • 16.Al-Ghazi Y, et al. Transcript Profiling During Fiber Development Identifies Pathways in Secondary Metabolism and Cell Wall Structure That May Contribute to Cotton Fiber Quality. Plant and Cell Physiology. 2009;50(7):1364–1381. doi: 10.1093/pcp/pcp084. [DOI] [PubMed] [Google Scholar]
  • 17.Rotter A, et al. Gene expression profiling in susceptible interaction of grapevine with its fungal pathogen Eutypa lata: extending MapMan ontology for grapevine. BMC Plant Biol. 2009;9:104. doi: 10.1186/1471-2229-9-104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Urbanczyk-Wochniak E, et al. Conversion of MapMan to allow the analysis of transcript data from Solanaceous species: effects of genetic and environmental alterations in energy metabolism in the leaf. Plant Mol Biol. 2006;60(5):773–92. doi: 10.1007/s11103-005-5772-4. [DOI] [PubMed] [Google Scholar]
  • 19.Twyman RM. Host plants, Systems and Expression Strategies for Molecular Farming. In: Fischer R, Schillberg S, editors. Molecular Farming: Plant-Made Pharmaceuticals and Technical Proteins. John Wiley & Sons; 2006. p. 338. [Google Scholar]
  • 20.Nagata T, Nemoto Y, Hasezawa S. Tobacco BY-2 Cell Line as the “HeLa” Cell in the Cell Biology of Higher Plants. In: Kwang WJ, Martin F, editors. International Review of Cytology. Academic Press; 1992. pp. 1–30. [Google Scholar]
  • 21.Remmerie N, et al. Unraveling tobacco BY-2 protein complexes with BN PAGE/LC MS/MS and clustering methods. J Proteomics. 2011;74(8):1201–1217. doi: 10.1016/j.jprot.2011.03.023. [DOI] [PubMed] [Google Scholar]

RESOURCES