Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2001 Jul 1;29(13):e63. doi: 10.1093/nar/29.13.e63

yMGV: a database for visualization and data mining of published genome-wide yeast expression data

Philippe Marc 1,a, Frédéric Devaux 1, Claude Jacq 1
PMCID: PMC55787  PMID: 11433039

Abstract

The yeast Microarray Global Viewer (yMGV) is an on-line database providing a synthetic view of the transcriptional expression profiles of Saccharomyces cerevisiae genes in most of the published expression datasets. yMGV displays a one-screen graphical representation of gene expression variations for each published genome-wide experiment, allowing quick retrieval of experimental conditions affecting expression of this gene. yMGV also provides tools to isolate groups of genes sharing similar transcription profiles in a defined subset of experiments. Additionally, yMGV furnishes a set of statistical tools for critical assessment of published data. We therefore believe that yMGV is an efficient tool that affords a quick and comprehensive overview of microarray data and generates new gene classifications. As of 20 March 2001 the yMGV database contains 6 000 000 measurements, representing genome-wide expression comparisons of 932 experiments from 39 microarray publications. The yMGV interface is available at http://transcriptome.ens.fr/ymgv/.

INTRODUCTION

In the past three years DNA microarrays have evolved from a sophisticated and precious technical advance, available only in a few pioneering laboratories, into a common technology accessible to numerous research groups. The number of published studies using microarrays has grown, generating an ever-increasing mass of data. Data mining constitutes a well-recognized challenge, especially when the data are scattered among numerous websites, when not incomplete or inaccessible. Some efforts dealing with specific aspects of data analyses have been made to construct databases from the published datasets. However, these databases either contain a limited number of published datasets [YPD (1) and Expression connection (2)] or only allow very specific requests [ExpressDB (3) and Webminer (4)].

If one wishes to consider a large variety of genome-wide expression data, the yeast Saccharomyces cerevisiae is especially appropriate, since more than 50 studies dealing with a large spectrum of biological conditions have been published and most of the authors have agreed to make their data available for a common database. We have developed a World Wide Web accessible database called the yeast Microarray Global Viewer (yMGV), which contains most of the results of published microarray experiments on the yeast S.cerevisiae. For the first time it is possible to view quickly and graphically the expression level of one particular yeast gene across nearly 1000 experiments on a single screen. This immediately indicates the experimental conditions that affect expression of any given S.cerevisiae gene. Such information is important in view of recent pioneering studies which have characterized new functional gene clusters, also described as synexpression groups (5–7). For this purpose yMGV provides a user-friendly query system to compare results from different publications, thus allowing users to select genes sharing a particular expression profile.

GENERAL USE OF yMGV

Datasets available in yMGV

The yMGV database contains datasets from most of the publications that have used microarrays to assess genomic expression in yeast. The data stored in yMGV are the ORF identifier and the filtered normalized Cy5/Cy3 ratio provided by the author for each experiment. To date (March 2001) the database contains 6 000 000 records representing 932 S.cerevisiae experiments published in 39 articles. All the data used are public data downloaded from the web or directly obtained with the agreement of the corresponding authors. A list of the available publications is shown in Table 1. The publication identifier is composed of the first author’s name and a keyword summarizing the major topic of the publication. Complete information concerning each publication is available online. It includes a complete reference (title of the paper, list of authors and Medline reference) and direct links to the publication Medline page, PDF format article (when access is allowed) and website containing the original data (if available).

Table 1. Available datasets.

fig name = "gne063t01">

Datasets currently accessible from yMGV (20 March 2001). An updated version is available online.

Using yMGV to detect conditions that affect the expression of a given gene

The interrogation form allows entry of a gene identifier [ORF name or gene name registered in the Saccharomyces Genome Database (8)] and selection of the set of publications one wishes to scan (Fig. 1A). This selection can be done via predefined groups containing related publications. Alternatively, a custom group containing a specific publication selection can be created. A filter option allows the setting of a cut-off (1.5, 2 or 3) for significance of the ratios, thus allowing data exploration according to one’s own criteria. The ‘aligned transcription profiles’ option applies the same search criteria to several genes at the same time (Fig. 1B). Once again, a one-screen representation allows direct comparison of the requested transcription profiles (Fig. 1D).

Figure 1.

Figure 1

Main features of yMGV. The upper part of this figure indicates three major ways to use yMGV. (A) One gene can be selected and the expression profiles corresponding to the selected experiments appear in the second line. The red and green bars correspond, respectively, to a 2-fold increase or decrease in the ratio. (B) Several genes can be selected at the same time, thus allowing direct comparison of their expression profiles in the different microarray experiments. (C) One can impose specific parameters, for instance a 2-fold increase in ratio, select a subset of experiments and obtain the expression profiles of the corresponding genes. The histogram (E) corresponds to one microarray experiment. Each bar corresponds to one specific experiment for which details can be obtained directly. These details include not only a description of the exact conditions, the minimum, maximum and average ratio (F), but also a graphical representation of the log2 (ratio) distribution, the standard deviation and the genome-wide expression map for the experiment (H). Details on publications are directly available (G).

yMGV output histograms represent the expression profile of the requested gene in each selected publication. Base 2 logarithms have been used to make induction and repression effects directly comparable. One histogram is drawn for each experiment in the selected publication. The histogram bars are red when the ratio is greater than 2 (log2 > 1) and green when the ratio is below 0.5 (log2 < –1) (these limits can be manually changed in zoom mode). Direct links to the publication Pubmed page, the article and the website containing the original data can be obtained by clicking on the publication identifier (Fig. 1G). A click on a histogram set reveals a full-screen version, including an experimental description and ratios (Fig. 1E). The experimental conditions are those reported in the original publication. A short comment on each experiment can be obtained by clicking on it (Fig. 1F). This comment includes strains and exact conditions used for each fluorochrome as well as statistics on the distribution of the ratios in the experiment.

Since the yMGV database contains the final filtered ratio, it is especially important to be able to rapidly access the relevant publication for assessment of the absolute levels of expression, together with the methods used to filter the data.

A FEW EXAMPLES OF SPECIFIC REQUESTS ADDRESSED TO yMGV

Search for genes sharing similar transcription profiles

Several recent genome-wide studies have highlighted the biological importance of group effects in transcription profile patterns. Genes that have similar expression profiles are called gene clusters (5) or synexpression groups (7). The premise of this guilt-by-association approach is that clustered genes may be co-regulated and therefore involved in similar functions. Versatile access to such analyses covering the available published data is important since there are many ways to consider the microarray data, depending on one’s interests. yMGV allows the user to set up his or her own search criteria (selected set of experiments and significance ratio), for instance for all the ORFs exhibiting a 2-fold increase in expression in a first publication and a 3-fold decrease in two others (Fig. 1C). Such overlaps between the expression patterns of different experiments are likely to be of biological significance and give important clues to guide new research.

Statistical analyses of the published data

Few transcriptome analyses indicate the statistical significance of their results. yMGV provides simple statistical tools which allow the user to critically assess the expression profile changes between two tested conditions in any publication (Fig. 1H). For each experiment it is possible: (i) to see the number of genes whose expression level has been increased (or decreased) 1.5-, 2- or 3-fold; (ii) to draw a graphical representation of the log2 (ratio) distribution and to obtain the standard deviation of the distribution; (iii) to access maxima, minima and the average ratio. yMGV can also display a graphical representation of the genomic localization of the activated (or repressed) genes on the whole set of chromosomes. This representation of more or less transcriptionally active genomic domains is also available using all the datasets included in yMGV.

Search for marginally fluctuating gene expression

yMGV allows the user to select all the yeast genes that, for instance, have never been induced >1.5-fold in all the experiments considered. The whole dataset naturally does not cover all possible conditions; some genes are not represented on currently used microarrays (the corresponding list is available in yMGV) and poorly expressed genes are not faithfully analyzed in microarray experiments. Bearing these restrictions in mind, such non-fluctuating genes might be good candidates as artificial ORFs devoid of biological significance. The list of these genes will become increasingly accurate as the number of microarray experiments increases.

Analysis of highly fluctuating genes can also be interesting. Table 2 contains the 50 most induced genes across the 932 experiments in yMGV. A precise knowledge of this biological noise is, of course, important in the analysis of microarray data. Furthermore, this ‘fluctuating effect’ can be biologically relevant and a better definition of the genes involved is worth considering.

Table 2. Highly fluctuating genes list.

fig name = "gne063t02">

The top 50 most often 2-fold changed (up or down) ORFs in yMGV (as of 20 March 2001). An updated version is available online. Functional information is from the SGD. Score is the number of conditions where the ratio falls outside the range 0.5–2. Present is the number of conditions where the ORF was present in the array. Percent is score/present.

IMPLEMENTATION, LIMITATIONS AND FUTURE DEVELOPMENTS

Implementation

The yMGV interface is freely available via the internet at www.transcriptome.ens.fr/ymgv/. All the software used to power yMGV is distributed under an open source licence, i.e. anybody can download it totally free of charge from the World Wide Web (see www.opensource.org for more information). We plan to distribute yMGV database schema and scripts (including graphical administration) so that any laboratory can create its own local database containing private datasets.

Limitations

The first limitation of yMGV is that, despite all our efforts, a number of published datasets remain unreachable because the corresponding authors did not answer our request or refused to make their data publicly available. The existence of a public repository coupled to a well-defined publication policy should solve this problem. Such a public repository imposing a universal transfer format is urgently required for comparative approaches to microarray data. The establishment of universal standards for DNA array experiment annotation, data representation and universal controls is presently being elaborated (see the MGED group at www.mged.org). Prototypes using the first draft of the MGED group have already been constructed, but are not yet functional (ArrayExpress at www.ebi.ac.uk/arrayexpress/, GeneX at www.ncgr.org/research/genex/ and GEO at www.ncbi.nlm.nih.gov/geo/).

Future developments

The next step in yMGV development will be to pre-calculate cluster groups in each biological system as previously defined (cell cycle, metabolism, stress, etc.). This should help the user to establish the complete list of genes that share similar transcription profiles with the requested gene. Notably, this should help to suggest functions for orphan genes.

The second step in the yMGV project will be to add datasets from other organisms to the database. Connections between  genes from different organisms via sequence similarities or via function assignment should permit the use of the vast amount of knowledge accumulated on yeast to find useful information on related genes in other model organisms.

CONCLUSION

It is clear that the extensive scientific endeavor which aims at characterizing the transcriptome properties of several model organisms will lead to important new biological concepts only if reliable data are shared. yMGV is the first database offering a global view of existing yeast transcriptome data coupled to a simple interface. yMGV also provides direct access to yeast microarray data, allowing the user to elaborate his or her own interpretation of the published data. Moreover, with this large overview of the data it is possible to address global questions which overlap in several distinct experiments. In this respect new tools will be introduced into yMGV for better characterization of gene clusters. Furthermore, this database has been designed to store microarray data from different organisms. Such cross-talk between organism data will allow direct assessment of the progress and functional meaning of transcription regulation networks through evolution.

Acknowledgments

ACKNOWLEDGEMENTS

This work would not have been possible without the help of all the authors who made their data freely accessible. We thank GNU projects for providing such good free software. This work was supported by CNRS and by grants from ARC (no. 5691) and from Genopole Ile de France.

References

  • 1.Costanzo M.C., Crawford,M.E., Hirschman,J.E., Kranz,J.E., Olsen,P., Robertson,L.S., Skrzypek,M.S., Braun,B.R., Hopkins,K.L., Kondu,P., Lengieza,C., Lew-Smith,J.E., Tillberg,M. and Garrels,J.I. (2001) YPD™, PombePD™ and WormPD™: model organism volumes of the BioKnowledge™ library, an integrated resource for protein information. Nucleic Acids Res., 29, 75–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ball C.A., Jin,H., Sherlock,G., Weng,S., Matese,J.C., Andrada,R., Binkley,G., Dolinski,K., Dwight,S.S., Harris,M.A., Issel-Tarver,L., Schroeder,M., Botstein,D. and Cherry,J.M. (2001) Saccharomyces genome database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res., 29, 80–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Aach J., Rindone,W. and Church,G.M. (2000) Systematic management and analysis of yeast gene expression data. Genome Res., 10, 431–445. [DOI] [PubMed] [Google Scholar]
  • 4.Heiman M.G. and Walter,P. (2000) Prm1p, a pheromone-regulated multispanning membrane protein, facilitates plasma membrane fusion during yeast mating. J. Cell Biol., 151, 719–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Eisen M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hughes T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D., Kidd,M.J., King,A.M., Meyer,M.R., Slade,D., Lum,P.Y., Stepaniants,S.B., Shoemaker,D.D., Gachotte,D., Chakraburtty,K., Simon,J., Bard,M. and Friend,S.H. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126. [DOI] [PubMed] [Google Scholar]
  • 7.Niehrs C. and Pollet,N. (1999) Synexpression groups in eukaryotes. Nature, 402, 483–487. [DOI] [PubMed] [Google Scholar]
  • 8.Cherry J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hester,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M., Weng,S. and Botstein,D. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res., 26, 73–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES