Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Jan 24;15(1):e0227076. doi: 10.1371/journal.pone.0227076

All of gene expression (AOE): An integrated index for public gene expression databases

Hidemasa Bono 1,*
Editor: Robert Hoehndorf2
PMCID: PMC6980531  PMID: 31978081

Abstract

Gene expression data have been archived as microarray and RNA-seq datasets in two public databases, Gene Expression Omnibus (GEO) and ArrayExpress (AE). In 2018, the DNA DataBank of Japan started a similar repository called the Genomic Expression Archive (GEA). These databases are useful resources for the functional interpretation of genes, but have been separately maintained and may lack RNA-seq data, while the original sequence data are available in the Sequence Read Archive (SRA). We constructed an index for those gene expression data repositories, called All Of gene Expression (AOE), to integrate publicly available gene expression data. The web interface of AOE can graphically query data in addition to the application programming interface. By collecting gene expression data from RNA-seq in the SRA, AOE also includes data not included in GEO and AE. AOE is accessible as a search tool from the GEA website and is freely available at https://aoe.dbcls.jp/.

Introduction

After the invention of microarray, it became possible to measure the abundance of all transcripts at the genomic scale, which is now called the transcriptome. Since then, gene expression data from those experiments have been archived in public repositories after the development of the Minimum Information About a Microarray Experiment (MIAME) standard [1]. These are the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) [2] and the EBI ArrayExpress (AE; https://www.ebi.ac.uk/arrayexpress/) [3] in a MIAME compliant manner.

Unlike the International Nucleotide Sequence Database [4], these two databases for gene expression have not been exchanging data with each other. AE had once imported data from GEO but stopped doing so in 2017 (https://www.ebi.ac.uk/arrayexpress/help/GEO_data.html). Archived GEO data is still available from AE, but new data archived in GEO is no longer available from AE. Therefore, users need to search both databases to get comprehensive public gene expression data of interest because these databases have been separately maintained. Furthermore, the DNA DataBank of Japan (DDBJ) recently started a similar repository called the Genomic Expression Archive (GEA; https://www.ddbj.nig.ac.jp/gea/) [5]. Hence there is a need for integration of these public gene expression databases.

Also, these databases may lack transcriptome sequencing data (RNA-seq) while the original sequence data are accessible in the nucleotide sequence repository of high-throughput sequencing platforms; the Sequence Read Archive (SRA) [6]. This is because data deposition to GEO and AE is not mandatory when the original sequencing data are deposited to the SRA.

We, therefore, developed an index of public gene expression databases, called All Of gene Expression (AOE). The aim of AOE is to integrate gene expression data and make them all searchable together. We have maintained AOE for five years, and it has been useful for functional genomics research. Here, we report a detailed description and utility of AOE. AOE is freely accessible from https://aoe.dbcls.jp/.

Results

Status of gene expression databases

Gene expression data in NCBI Gene Expression Omnibus (GEO) used to be continuously imported into EBI ArrayExpress (AE), and thus we were theoretically able to obtain all data deposited to GEO from AE. Therefore, All Of gene Expression (AOE) was originally indexed for AE only.

Unfortunately, AE discontinued GEO data import in 2017. At that point, we investigated data-series entries in these two databases by matching GEO series IDs: IDs beginning with GSE in GEO and those beginning with E-GEOD in AE; for example, GSE52334 in GEO corresponds to E-GEOD-52334 in AE. Apparently over thirty thousand entries were missing in AE (Fig 1). Furthermore, even GEO did not publicly represent the whole transcriptome data, as over ten thousand entries in AE were missing in GEO. Thus, we decided to include those missing entries in AOE. In other words, we started indexing GEO data and other public transcriptome data, including the DDBJ Genomic Expression Archive (GEA), to allow all public gene expression data to be searched.

Fig 1. Comparison of EBI ArrayExpress and NCBI Gene Expression Omnibus.

Fig 1

The number of overlapping data-series entries in ArrayExpress (left) and Gene Expression Omnibus (right).

An index of gene expression data series from metadata

AOE was originally developed to provide a graphical web interface to search EBI AE, which is one of the public gene expression databases described above. We call this dataset that includes data from AE only ‘AOE level 1’ (Fig 2). Data at this level contain only IDs for AE, and the entries imported from GEO contain IDs for both BioProject and GEO.

Fig 2. How to make AOE index.

Fig 2

Three levels of AOE data flow are required to make the AOE index. Level 1 from ArrayExpress data, level 2 from GEO data in SRA via DBCLS SRA API, and level 3 from RNA-seq data in SRA, but not in GEO via DBCLS SRA API.

After the import of GEO data to AE was discontinued, AOE began importing GEO data by directly utilizing the DBCLS SRA application programming interface (API) [7]. By subtracting the GEO data already existing in AE, new entries were included in AOE. We call the merged dataset that includes GEO data ‘AOE level 2’ (Fig 2). Data at this level contain IDs for BioProject and GEO, but not for AE.

There were still some gene expression data missing that were not included in AE and GEO but were registered as transcriptome sequencing data in SRA. The final merged dataset is called ‘AOE level 3’ and represents a real public gene expression dataset (Fig 2). Data at this level contain BioProject IDs only.

Fig 2 shows a schematic view of the data flow that was used to prepare the index data for AOE, as summarized above. The constructed data are archived in the Life Science Database Archive at the National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST), and is available at the DOI: 10.18908/lsdba.nbdc00467-000 (https://doi.org/10.18908/lsdba.nbdc00467-000).

As AOE was designed to index public gene expression data, ‘experimental series’-wise data have been indexed for the search. Individual hybridization data for microarray and run data for RNA-seq are directly linked to the original databases. All codes to parse public databases and construct a web service are accessible from the DBCLS AOE GitHub repository (https://github.com/dbcls/AOE/). These are free and open-source software and can be installed anywhere.

Graphical web interface

Gathering all three levels of data described above, AOE enables visualization and exploration of gene expression data. AOE provides an interactive web interface (https://aoe.dbcls.jp/) to retrieve data of interest (Fig 3). Users can see overall statistics of stored data in AOE (Fig 3A). The histogram for ranking by quantification methods can be dynamically created by clicking on the technology name. Fig 3B shows the number of data in AOE only for sequencing assays (RNA-seq).

Fig 3. AOE web interface.

Fig 3

Users can retrieve data of interest graphically in the AOE web interface.

Users can easily filter data by organism and quantification method of gene expression. For example, users can search with the keyword ‘hypoxia’ (Fig 3A). AOE currently reports 524 items with three histograms (by year, organism, and quantification method; Fig 3C). After looking at the histograms, the user can filter the data by ‘Homo sapiens’ by dragging the bar in the histogram by organism. Then, AOE recreates the histograms with the selected data (Fig 3D). Additionally, the user can filter the data by ‘Illumina’ by dragging the bar in the histogram by quantification method (Fig 3E). The selected data (58 records currently) can be retrieved by clicking on the ‘Retrieve’ button (Fig 3F). Users can browse retrieved data and jump to original data by clicking on IDs in the table (ArrayExpress, BioProject, and GEO; Fig 3G). Optionally, users can also download the list of IDs from the ‘Download ID list’ button.

A shortcut to retrieve a list of specific organisms is to click on the species icon with nomenclature and the ‘retrieve’ button (Fig 3A). The top 30 species in AOE will be listed and can be accessed in this way.

Further, a tutorial movie that shows how to make use of the AOE web interface is available at https://doi.org/10.7875/togotv.2018.146. The movie originated from the contents of TogoTV, which provides tutorial videos for useful databases and web tools in the life sciences [8], and is available on the TogoTV original website (https://togotv.dbcls.jp/en/) and YouTube (https://youtube.com/togotv/).

AOE web server has been maintained for five years. From the usage statistics (from July 2015 to Oct 2019), there were 95,334 visits, 393,174 page views, and 630,837 hits. The fact that two-thirds of visits were under 30seconds indicates that users accessed the AOE web server in their web browser with an instant query with specific keywords.

Application programming interface

Users can also query AOE via API. AOE provides a simple Representational State Transfer (REST) API that enables users to perform searches with their client programs in an automated manner. The search results in a JSON formatted output can be retrieved through the following URL:

https://aoe.dbcls.jp/api/search?fulltext=KEYWORD&[Technology=TECHNOLOGY&Organisms=ORGANISM&page=OFFSET&size=SIZE]

KEYWORD: keyword to search AOE

TECHNOLOGY: technology to use expression profiling (sequencing, microarray, Affymetrix, Agilent, Illumina, etc.)

ORGANISM: ‘homo%20sapiens’ for human, ‘mus%20musculus’ for mouse, etc.

OFFSET: the page number

SIZE: the number of results in one page

1. A search for the keyword ‘hypoxia’ with 25 results in one page is represented as follows:

https://aoe.dbcls.jp/api/search?fulltext=hypoxia&page=1&size=25

2. A search for human data in RNA-seq (sequencing assay) with twenty-five results in one page is represented as follows:

https://aoe.dbcls.jp/api/search?Technology=sequencing&organisms=homo%20sapiens&page=1&size=25

A precise description for the AOE API is available from the AOE website or directly from the DBCLS AOE GitHub website at https://github.com/dbcls/AOE/blob/master/API_documentation.md.

Discussion

We have developed and maintained an index of public gene expression databases, called All Of gene Expression (AOE). AOE originally began as an index for the ArrayExpress (AE) database maintained at EBI (we call this ‘AOE level 1’), because AE had exported gene expression data from Gene Expression Omnibus (GEO), which is the largest gene expression database maintained at NCBI. That meant AE contained all gene expression data, including those deposited to GEO.

AE stopped importing data from GEO in 2017. While GEO data archived in AE is still available from AE, new data archived in GEO is no longer available from AE. Thus, we started indexing GEO data directly by making use of the API of DBCLS SRA (AOE level 2). In 2018, the DNA DataBank of Japan (DDBJ) started the Genomic Expression Archive (GEA), which is a repository for gene expression quantification data. Integration of these public gene expression databases is needed to increase the reusability of gene expression data. Newly submitted data contain BioProject IDs, and this feature makes it possible to integrate multiple levels of indices and resolve complicated relationships among IDs, while old AE entries do not have BioProject ID.

The existence of a great deal of data at AOE level 3 shows that not all sequencing gene expression data are stored in GEO. This indicates that GEO is insufficient as a complete public gene expression database. Much of the data at AOE level 3 are heterogeneous, and metadata for those can lack several descriptions, which are curated and cleanly described in GEO and AE.

A similar approach has also been undertaken by EBI, called the Omics Discovery Index (OmicsDI; https://www.omicsdi.org/), which provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics, and metabolomics) [9]. OmicsDI aims to integrate various types of omics data and is not focused on gene expression data.

AOE is focused on gene expression data. It is also designed to be a search interface for DDBJ GEA, and a link to AOE can be found at the official GEA website. When AOE is used as a search interface for DDBJ GEA, it is expected that AOE will be continuously used at the DDBJ website.

The web interface for AOE is simple and user-friendly, and so AOE can also be used by biologists who are not familiar with database searching. AOE can also be used by professionals to construct reference expression datasets for specific organisms. We have also developed a Reference Expression dataset (RefEx) for humans and mice [10]. We are planning to implement RefEx for other organisms, by making use of these reference expression datasets retrieved by AOE.

For future development, we are planning to use not only metadata but also quantified expression data that will allow users to search for data based on the similarity of gene expression profiles. Moreover we are going to use the quality control results from the FASTQ program to screen for RNA-seq data.

Methods

Acquisition of public gene expression data

AOE consists of two major types of data sources. One is the EBI ArrayExpress (AE), and the other is data in NCBI, including the Gene Expression Omnibus (GEO).

For the AE data type, several files are required to make an AOE index. These files are in a simple spreadsheet-based, MIAME-supportive format, called MicroArray Gene Expression Tabular (MAGE-TAB) files, which are Array Design Format (ADF), Investigation Description Format (IDF), and Sample and Data Relationship Format (SDRF) files [11]. These files are routinely acquired from the AE FTP site (ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/). ADF files are located in subdirectories in ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/array/, with the file extension .adf.txt. IDF and SDRF files are located in subdirectories in ftp://ftp.ebi.ac.uk/pub/databases/arrayexpress/data/experiment/, with file extensions .idf.txt and .sdrf.txt, respectively.

For data from NCBI, in addition to the file describing ID relationships in the Sequence Read Archive (SRA) named SRA_Accessions.tab from ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/, metadata from SRA, BioProject and BioSample are used to make an index for AOE, as all data from GEO have BioProject IDs and BioSample IDs even if the gene expression quantification for that data was not a transcriptome sequencing one.

Organizing metadata from different sources

For the AE type of data, ADF, IDF, and SDRF files are required to make an index for AOE. Data from the DDBJ Genomic Expression Archive (GEA) also consist of the AE type of data and are available from its FTP site (ftp://ftp.ddbj.nig.ac.jp/ddbj_database/gea/). We used the AE type of data to construct an initial AOE index set (called AOE level 1).

GEO data in the Sequence Read Archive (SRA), BioProject, and BioSample are used to make an index for AOE. These data have been stored in the DBCLS SRA as JSON-LD, and the application programming interface (API) for metadata for those has also been maintained in DBCLS. AOE used this API to retrieve data needed to make the index (AOE level 2).

Finally, we collected the RNA-seq data in SRA, making use of the DBCLS SRA API. Most of this fraction of data are in AOE level 2, but many entries can be found in this filter (AOE level 3).

A concatenated tab-delimited file of the constructed index is archived in the Life Science Database Archive at National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) at DOI: 10.18908/lsdba.nbdc00467-000 (https://doi.org/10.18908/lsdba.nbdc00467-000).

Data parsers to make a tab-delimited text file for visualization are implemented in Perl5 and UNIX shell commands. All shell and Perl5 scripts for those are accessible from GitHub (https://github.com/dbcls/AOE/).

Visualization of datasets

For visualizing datasets, we implemented specially coded Python3 scripts, and we also used D3.js, a JavaScript library for manipulating documents based on data (https://d3js.org/). This enables data selection by mouse operation. For example, the user can select data by release date by dragging the histogram generated with the keyword search.

Acknowledgments

The author wishes to thank Dr. Naoya Oishi for the design and development of the web and application programming interface for AOE, Dr. Yuichi Kodama for helpful advice for the integration of the DDBJ Genomic Expression Archive (GEA), and Dr. Takeru Nakazato and Dr. Tazro Ohta for their help in using the DBCLS SRA API for updating AOE entries from NCBI databases. The tutorial movie for the AOE web interface was created under the TogoTV project in DBCLS with editorial direction by Dr. Hiromasa Ono. The computing resource was partly provided by the supercomputer system at the National Institute of Genetics (NIG), Research Organization of Information and Systems (ROIS), Japan. We would like to thank Editage (www.editage.com) for English language editing.

Data Availability

Availability of supporting source code, Project name: All Of gene Expression (AOE) Project home page: https://aoe.dbcls.jp/en GitHub page: https://github.com/dbcls/AOE/ Operating system: UNIX (macOS and linux) Programming language: Python, Perl License: MIT. Availability of supporting data: The constructed data are archived at the Life Science Database Archive at the National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST), and is available with the DOI: 10.18908/lsdba.nbdc00467-000 (https://doi.org/10.18908/lsdba.nbdc00467-000). License: CC-BY 4.0.

Funding Statement

This work was supported by the National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST) to HB. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 29, 365–371 (2001). 10.1038/ng1201-365 [DOI] [PubMed] [Google Scholar]
  • 2.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013). 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E et al. ArrayExpress update—simplifying data submissions. Nucleic Acids Res. 43, D1113–1116 (2015). 10.1093/nar/gku1057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Karsch-Mizrachi I., Takagi T., & Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 46, D48–D51 (2018) 10.1093/nar/gkx1097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kodama Y, Mashima J, Kosuge T, Ogasawara O DDBJ update: the Genomic Expression Archive (GEA) for functional genomics data. Nucleic Acids Res. 47, D69–D73 (2019). 10.1093/nar/gky1002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kodama Y, Shumway M, Leinonen R. International Nucleotide Sequence Database Collaboration. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012). 10.1093/nar/gkr854 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ohta T, Nakazato T, Bono H. Calculating quality of public high-throughput sequencing data to obtain suitable subset for reanalysis from the Sequence Read Archive GigaScience, 6, gix029 (2017). 10.1093/gigascience/gix029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kawano S, Ono H, Takagi T, Bono H. Tutorial videos of bioinformatics resources: online distribution trial in Japan named TogoTV. Brief Bioinform. 13, 258–268 (2012) 10.1093/bib/bbr039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Perez-Riverol Y, Bai M, da Veiga Leprevost F, Squizzato S, Park YM, Haug K et al. Discovering and linking public omics data sets using the Omics Discovery Index. Nat. Biotech. 35, 406–409 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ono H, Ogasawara O, Okubo K, Bono H. RefEx, a reference gene expression dataset as a web tool for the functional analysis of genes. Sci. Data 4, 170105 (2017) 10.1038/sdata.2017.105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E et al. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics. 7, 489 (2006). 10.1186/1471-2105-7-489 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Robert Hoehndorf

7 Oct 2019

PONE-D-19-23442

All of gene expression (AOE): integrated index for public gene expression databases

PLOS ONE

Dear Dr. Bono,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Nov 21 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Robert Hoehndorf, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Please amend either the title on the online submission form (via Edit Submission) or the title in the manuscript so that they are identical.

Additional Editor Comments (if provided):

The reviewers have considered the manuscript. They have raised a number of points that should be taken into consideration in a revision. Many changes are minor, but Reviewer 2 asks for substantially more information to be included on the implementation and the use of the services developed, which will improve the utility of the resource and increase adoption; this should be addressed in the revised manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: These are more of recommendations to the authors regarding the manuscript:

1. For the statement: "Unlike the International Nucleotide Sequence Database [3], these two databases for gene expression have stopped exchanging data with each other since AE stopped importing data from GEO"

Is there a reason why they stopped exchanging data? Could this be elaborated further, for the benefit of the readers?

2. This paper highlights the creation of an index, but could it be possible to describe how the index was generated and what technology was being used?

3. The authors note that AOE has been around for 5 years. Could it be possible for the authors to share usage statistics over this period that can highlight usage patterns. This might also be insightful to readers.

Reviewer #2: * All of Gene Expression (AOE) by Bono

** Editorial overview

The author created tools that mine metadata from the main public gene

expression databases and created a queryable cross-database index 'all

of gene expression' (OAE) that can be accessed through an API and a

browser web interface. This paper describes the need for such an index

and presents it as a completed work.

I think the work is worthwhile publishing but the publication needs

some additional work to make it more interesting for the reader. The

paper is very short as it is, and I think it can be improved by

providing more context and useful examples.

** Notes

Some historical context is missing. MIAME is not mentioned though

(probably) used. Similar efforts at describing expression

experiments should be mentioned and put in context. Likewise

XML metadata descriptions exist and RDF ontologies. I miss that

information and some sense of how important they are to providing

and exposing an index of gene expression experiments. Does the

current work provide metadata, what does it look like and why is

it (not) RDF?

In the discussion the author can mention the challenges of matching

data and data quality. In microarrays the data files come in different

forms - especially raw data, differential expression and normalized

data. How do we deal with that when mining data? As it stands the

expression databases are highly suspect. An index should give hints

about the state of the data.

A different topic concerns probe-level information and RNA-seq

alignments using different tools and different reference

genomes. These are fraught with problems and naive comparisons are

pretty useless. Providing an index to data resources raises questions

about what these resources contain and how to deal with that. I

realise an index is a starting point, but maybe the author can explain

what the next steps are?

I think the paper would greatly benefit from a description of the REST

API and by giving some examples using R or Python. This may lead

to readers trying out the API. It also will explain the value of

the index for software developers.

Similarly, a few figures of the browser GUI that *explain* the use of

it would be helpful. The provided figure is not descriptive and even

lacks a caption.

Finally I think there should be a clear statement that all source code

is free and open source software and can be installed using this and

that...

Even more important: as a web service is provided, there should be

indication about future maintenance plans, and that the data and

service can be replicated freely elsewhere (and how it can be

done). There are too many bioinformatics initiatives that disappear

after publication. How does the author want to warrant continuous

service?

The paper states that the 'parser' source code is available. What

about the web-service software? It is not clear from the github repo

either. Interestingly the source code points out the use of the

common workflow language (CWL). That, I believe, should be highlighted

too.

** English editing

- I prefer not to start sentences with 'However'

- Findings in abstract I would use Results instead - because

these are useful tools

- Conclusions or Conclusion?

** Conclusion

The paper would greatly benefit from extra information. No additional

software implementation by the author is required, my comments only

relate to presentation and making the work more interesting/useful for

the reader. I would say it is a minor revision, though as the paper is

expected to double in size it may be considered major.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Pjotr Prins

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jan 24;15(1):e0227076. doi: 10.1371/journal.pone.0227076.r002

Author response to Decision Letter 0


11 Nov 2019

Detailed point-by-point response two sets of referees’ comments are in the rebuttal letter in the separate file, but I will paste the contents of our responses below.

Response to Reviewer #1

> 1. For the statement: "Unlike the International Nucleotide Sequence Database [3], these two databases for gene expression have stopped exchanging data with each other since AE stopped importing data from GEO"

> Is there a reason why they stopped exchanging data? Could this be elaborated further, for the benefit of the readers?

Thanks for your suggestion.

In the web page of ArrayExpress about GEO data (https://www.ebi.ac.uk/arrayexpress/help/GEO_data.html), they just say

> “We have stopped the regular imports of Gene Expression Omnibus (GEO) data into ArrayExpress. We will keep using data from GEO to build our added value database Expression Atlas, and the reprocessed and additionally annotated data for selected datasets will be available from there.”

and no reason for that is described. We do not know the reason for that.

Thus, in the revised manuscript, we added the URL above and one more sentence emphasizing new data in GEO are not included in AE.

“AE once had imported data from GEO, but stopped importing data in 2017 (https://www.ebi.ac.uk/arrayexpress/help/GEO_data.html). Archived GEO data is still available from AE, but new data archived in GEO no longer available from AE.” in the Introduction section and “AE stopped importing data from GEO in 2017. While GEO data archived in AE is still available from AE, new data archived in GEO no longer available from AE.” in the Discussion section.

> 2. This paper highlights the creation of an index, but could it be possible to describe how the index was generated and what technology was being used?

Thanks again for your suggestion.

How to construct AOE index is described in “An index of gene expression data series from metadata” subsection in the Result section and depicted as Fig 2.We added Fig 2 pointers for this subsection.

For the technology issue,as describe in Method section (Organizing metadata from different sources), we use shell commands and Perl script for data parsing.

“Data parsers to make a tab-delimited text file for visualization are implemented in Perl5 and UNIX shell commands. All shell and Perl5 scripts for those are accessib

le from GitHub (https://github.com/dbcls/AOE/).”

For the web interface, also described in the following section, we used Python3 codes and D3.js javascript library. “For visualizing datasets, we employed specially coded Python3 scripts, and we also used D3.js, a JavaScript library for manipulating documents based on data (https://d3js.org/)."

> 3. The authors note that AOE has been around for 5 years. Could it be possible for the authors to share usage statistics over this period that can highlight usage patterns. This might also be insightful to readers.

Thanks for your constructive suggestion.

We added the description about usage statistics in the Results section.

“AOE web server has been maintained for five years. From the usage statistics (during July 2015 to Oct 2019), there were 95,334 visits, 393,174 page views and 630,837 hits. From the fact that most visits were under 30seconds, it seems that users accessed AOE web server in their web browser with an instant query with specific keywords.”

Response to Reviewer #2

> I think the work is worthwhile publishing but the publication needs

> some additional work to make it more interesting for the reader. The

> paper is very short as it is, and I think it can be improved by

> providing more context and useful examples.

Thank you very much for your constructive comments.

We added more contents and some examples to use AOE to the revised manuscript.

> ** Notes

>

> Some historical context is missing. MIAME is not mentioned though

> (probably) used.

Description about Minimum Information About a Microarray Experiment (MIAME) standard was added in the Introduction according to your suggestion.

“Since then, gene expression data from those experiments have been archived in public repositories after the development of the Minimum Information About a Microarray Experiment (MIAME) standard [1].”

Indeed, IDF, SDRF and ADF files are used to make AOE index, and these metadata are described as tab delimited file known as MAGE-TAB. Thus we added this point in the Methods section.

“These files are in a simple spreadsheet-based, MIAME-supportive format, called MicroArray Gene Expression Tabular (MAGE-TAB) files, which are Array Design Format (ADF), Investigation Description Format (IDF), and Sample and Data Relationship Format (SDRF) files”

> Similar efforts at describing expression experiments should be mentioned and put in context.

Concerning a similar effort, EBI holds the service called OmicsDI.

It was described in the Discussion section.

“A similar approach has also been undertaken by EBI, called the Omics Discovery Index (OmicsDI; https://www.omicsdi.org/), which provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics) [9]. OmicsDI aims to integrate various types of omics data and is not focused on gene expression data.”

> Likewise XML metadata descriptions exist and RDF ontologies. I miss that

> information and some sense of how important they are to providing

> and exposing an index of gene expression experiments. Does the

> current work provide metadata, what does it look like and why is

> it (not) RDF?

AOE is an index of gene expression data, and thus not providing metadata of the contents.

Concerning the relationship to RDF, AOE is an application that makes heavy use of metadata of the Sequence Read Archive in RDF (JSON-LD formatted data). We added the description about JSON-LD in the Methods section.

“These data have been stored in the DBCLS SRA as JSON-LD, and the application programming interface (API) for metadata for those has also been maintained in DBCLS."

> In the discussion the author can mention the challenges of matching

> data and data quality. In microarrays the data files come in different

> forms - especially raw data, differential expression and normalized

> data. How do we deal with that when mining data? As it stands the

> expression databases are highly suspect. An index should give hints

> about the state of the data.

Using expression data itself is a next challenge of AOE project. It currently integrates and indexes metadata in the public databases. We added this issue in the Discussion section.

“For the future development, we are also planning to use not only metadata, but also quantified expression data that will allow users to search data based on the similarity of gene expression profiles.”

> A different topic concerns probe-level information and RNA-seq

> alignments using different tools and different reference

> genomes. These are fraught with problems and naive comparisons are

> pretty useless. Providing an index to data resources raises questions

> about what these resources contain and how to deal with that. I

> realise an index is a starting point, but maybe the author can explain

> what the next steps are?

We think that the quality control of data will be needed to screen the data.

As a first step, we are going to use the result of quality control by FASTQ program to screen the data for RNA-seq data. This point is also added in the Discussion section.

“And, we are going to use the result of quality control by FASTQ program to screen the data for RNA-seq data.”

> I think the paper would greatly benefit from a description of the REST

> API and by giving some examples using R or Python. This may lead

> to readers trying out the API. It also will explain the value of

> the index for software developers.

Thank you very much for great suggestion. A description about API was added in the manuscript in ‘Application programming interface’ subsection in the Result section.

“Users can also query AOE via API. AOE provides a simple Representational State Transfer (REST) API that enables users to perform searches with their client programs in an automated manner. The search results in a JSON formatted output can be retrieve through the following URI:

https://aoe.dbcls.jp/api/search?fulltext=KEYWORD&[Technology=TECHNOLOGY&Organisms=ORGANISM&page=OFFSET&size=SIZE]

KEYWORD: keyword to search AOE

TECHNOLOGY: technology to use expression profiling (sequencing, microarray, Affymetrix, Agilent, Illumina, etc)

ORGANISM: ‘homo%20sapiens’ for human, ‘mus%20musculus’ for mouse, etc.

OFFSET: the page number

SIZE: the number of results in one page

1. A search for the keyword ‘hypoxia’ with twenty-five results in one page is represented as follows:

https://aoe.dbcls.jp/api/search?fulltext=hypoxia&page=1&size=25

2. A search for human data in RNA-seq (sequencing assay) with twenty-five results in one page is represented as follows:

https://aoe.dbcls.jp/api/search?Technology=sequencing&organisms=homo%20sapiens&page=1&size=25

> Similarly, a few figures of the browser GUI that *explain* the use of

> it would be helpful. The provided figure is not descriptive and even

> lacks a caption.

Thanks for your comment.

According to your suggestion, Fig 3 was completely updated to instruct how to make use of AOE web interface, and descriptions for that were added in the manuscript. The caption for Fig3 was also added.

“Fig 3. AOE web interface

Users can retrieve data of interest .graphically in the AOE web interface.”

Additionally, a tutorial movie for this operation on the web browser is also available called TogoTV as described in the Results section.

“Further, a tutorial movie to show how to make use of AOE web interface is available at https://doi.org/10.7875/togotv.2018.146."

> Finally I think there should be a clear statement that all source code

> is free and open source software and can be installed using this and

> that...

It is very important issue. Thank you very much for your comment. We added clear statement that all source code is free and open source software in the Results section.

“All codes to parse public databases and construct a web service are accessible from the DBCLS AOE GitHub repository (https://github.com/dbcls/AOE/). They are free and open source software, and can be installed anywhere.”

> Even more important: as a web service is provided, there should be

> indication about future maintenance plans, and that the data and

> service can be replicated freely elsewhere (and how it can be

> done). There are too many bioinformatics initiatives that disappear

> after publication. How does the author want to warrant continuous

> service?

As described in ‘Discussion’ section, we plan that AOE is now used as a search interface in DDBJ Genomic Expression Archive (GEA) at https://www.ddbj.nig.ac.jp/gea/index-e.html .

We aimed AOE will be continuously used in the DDBJ website when AOE is used as a search interface to DDBJ GEA.

“AOE is focused on gene expression data. It is also designed to be a search interface for DDBJ GEA, and a link to AOE can be found on the official GEA website. When AOE is used as a search interface to DDBJ GEA, it is expected that AOE will be continuously used in the DDBJ website. “

> The paper states that the 'parser' source code is available. What

> about the web-service software? It is not clear from the github repo

> either. Interestingly the source code points out the use of the

> common workflow language (CWL). That, I believe, should be highlighted

> too.

Thank you very much for your careful examination of github repository for AOE.

Source codes for AOE web-service are now merged into the repository (https://github.com/dbcls/AOE/tree/master/Web).

The CWL codes in the current repository are a product of Biohackathon2018. We are now trying to do CWLization of AOE index parsers at Biohackathons in the future.

> ** English editing

>

> - I prefer not to start sentences with 'However'

> - Findings in abstract I would use Results instead - because

> these are useful tools

> - Conclusions or Conclusion?

Thank you very much for your suggestions.

We modified the manuscript according to your suggestions.

> ** Conclusion

>

> The paper would greatly benefit from extra information. No additional

> software implementation by the author is required, my comments only

> relate to presentation and making the work more interesting/useful for

> the reader. I would say it is a minor revision, though as the paper is

> expected to double in size it may be considered major.

Thank you very much for you various suggestions and comments.

We are sure that the manuscript is now useful for readers of PLOS ONE.

Decision Letter 1

Robert Hoehndorf

27 Nov 2019

PONE-D-19-23442R1

All of gene expression (AOE): an integrated index for public gene expression databases

PLOS ONE

Dear Dr. Bono,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Jan 11 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Robert Hoehndorf, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

The reviewers have assessed the manuscript and recommend the manuscript to be published once some minor issues are addressed. The reviewers commented on the language which needs some editing. Please address the comments of the reviewers and carefully edit the language used in the manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. Regarding the usage statistics:

"From the fact that most visits were under 30seconds, it seems that users accessed AOE web server in their web browser with an instant query with specific keywords."

- It could also be first time visitors who do not spend much time on the page.

- Are there additional metrics available to identify such users?

2. Phrasing of sentences through out the paper:

"AE once had imported data from GEO, but stopped importing data in 2017"

- could be rephrased to "AE had once imported data from GEO, but stopped doing so in 2017."

"Integration of these public gene expression databases is required"

- could be rephrased to "There is a need for integration of these public gene expression databases."

"AOE was originally begun as an index for the ArrayExpress (AE) database maintained at EBI"

- could be rephrased to "AOE originally began as an index..."

Reviewer #2: Dear author, thank you for addressing my comments. I think the paper will benefit from some minor editing. Nothing serious, but I am sure a native speaker can be helpful to get it up to PLoS standards. Maybe the editor can recommend someone.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Pjotr Prins

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jan 24;15(1):e0227076. doi: 10.1371/journal.pone.0227076.r004

Author response to Decision Letter 1


6 Dec 2019

Response to Reviewer #1

> 1. Regarding the usage statistics:

> "From the fact that most visits were under 30seconds, it seems that users accessed AOE web server in their web browser with an instant query with specific keywords."

> - It could also be first time visitors who do not spend much time on the page.

> - Are there additional metrics available to identify such users?

Thanks for clarification. We are using awstats (https://awstats.sourceforge.io) to analyze httpd log, and there is a statistics for ‘visits duration’.

While the analyzed page can be browsed by months, percentages of ‘visit duration’ under 30s are around 66%. Below is an example for the latest statistics for that(in Oct 2019; the image is in the attached PDF, response to reviewers)

In other words, two thirds of web access are below 30s.

Thus, we changed the description about usage statistics in the Results section.

“The fact that two-thirds of visits were under 30seconds indicates that users accessed the AOE web server in their web browser with an instant query with specific keywords.”

> 2. Phrasing of sentences through out the paper:

> "AE once had imported data from GEO, but stopped importing data in 2017"

> - could be rephrased to "AE had once imported data from GEO, but stopped doing so in 2017.”

> "Integration of these public gene expression databases is required"

> - could be rephrased to "There is a need for integration of these public gene expression databases."

> "AOE was originally begun as an index for the ArrayExpress (AE) database maintained at EBI"

> - could be rephrased to "AOE originally began as an index…"

Thank your very much for your editing.

All issues were rephrased.

Response to Reviewer #2

> Reviewer #2: Dear author, thank you for addressing my comments. I think the paper will benefit from some minor editing. Nothing serious, but I am sure a native speaker can be helpful to get it up to PLoS standards. Maybe the editor can recommend someone.

Thank you very much for your comment.

The manuscript was re-edited by another editor, and English was very much improved.

I attached a PDF file of ‘CERTIFICATE OF ENGLISH EDITING’.

Attachment

Submitted filename: 191207rebuttal.pdf

Decision Letter 2

Robert Hoehndorf

12 Dec 2019

All of gene expression (AOE): an integrated index for public gene expression databases

PONE-D-19-23442R2

Dear Dr. Bono,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Robert Hoehndorf, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Robert Hoehndorf

10 Jan 2020

PONE-D-19-23442R2

All of gene expression (AOE): an integrated index for public gene expression databases

Dear Dr. Bono:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Robert Hoehndorf

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: 191207rebuttal.pdf

    Data Availability Statement

    Availability of supporting source code, Project name: All Of gene Expression (AOE) Project home page: https://aoe.dbcls.jp/en GitHub page: https://github.com/dbcls/AOE/ Operating system: UNIX (macOS and linux) Programming language: Python, Perl License: MIT. Availability of supporting data: The constructed data are archived at the Life Science Database Archive at the National Bioscience Database Center (NBDC), Japan Science and Technology Agency (JST), and is available with the DOI: 10.18908/lsdba.nbdc00467-000 (https://doi.org/10.18908/lsdba.nbdc00467-000). License: CC-BY 4.0.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES