Abstract
TidyGEO is a Web-based tool for downloading, tidying, and reformatting data series from Gene Expression Omnibus (GEO). As a freely accessible repository with data from over 6 million biological samples across more than 4000 organisms, GEO provides diverse opportunities for secondary research. Although scientists may find assay data relevant to a given research question, most analyses require sample-level annotations. In GEO, such annotations are stored alongside assay data in delimited, text-based files. However, the structure and semantics of the annotations vary widely from one series to another, and many annotations are not useful for analysis purposes. Thus, every GEO series must be tidied before it is analyzed. Manual approaches may be used, but these are error prone and take time away from other research tasks. Custom computer scripts can be written, but many scientists lack the computational expertise to create such scripts. To address these challenges, we created TidyGEO, which supports essential data-cleaning tasks for sample-level annotations, such as selecting informative columns, renaming columns, splitting or merging columns, standardizing data values, and filtering samples. Additionally, users can integrate annotations with assay data, restructure assay data, and generate code that enables others to reproduce these steps.
Keywords: data cleaning, data wrangling, gene-expression analysis, interactive curation, web application
1. Background
Gene Expression Omnibus (GEO) is an Internet-based, publicly accessible repository for high-throughput molecular-abundance data [1, 2]. Originally, GEO was designed for gene-expression profiles [3], but its scope has broadened, now housing data for other measurement types, including DNA methylation levels, genotypes, and chromatin-accessibility measurements. GEO data are stored in four units: Platforms, which define a set of molecules that may be detected using a given profiling technology; Samples, which describe measurements produced for a single replicate; Series, which organize assay data and metadata for a set of samples that make up an experiment; and Datasets, which are curated collections of samples for a particular platform [2]. Most samples are not part of a Dataset due to the time and expertise required for curation.
GEO enables researchers to share data with the broader community. Many journals and funding agencies require such sharing to enable validation of research results and to ensure that data assets are accessible [4]. GEO data are used for many types of secondary research. Some researchers use GEO data for methods development and evaluation. For example, Zhou et al. combined diverse datasets to impute gene-expression values with the goal of maximizing platform compatibility [5]; Eren et al. used GEO data to compare the effectiveness of biclustering algorithms [6]; and Golightly et al. curated a compendium of GEO datasets to enable benchmark comparisons across machine-learning algorithms [7]. Other researchers have used GEO data for discovery, such as identifying differentially expressed genes [8, 9], identifying pathways that influence disease development [10, 11], or investigating the potential to repurpose existing drugs [12].
GEO adheres to the Minimum Information about Microarray Experiments (MIAME) and Minimum Information about a high-throughput SeQuencing Experiment (MINSEQE) guidelines, which define the types of content that should be provided in publicly available, gene-expression datasets [13, 14]. This content can be categorized as (1) metadata about the experiment, (2) sample-level annotations, and (3) a processed version of the molecular-assay data. Metadata include information such as a title for the study, the species name(s), an experimental design description, contact information, and the platform(s) used. Sample-level annotations typically indicate the experimental condition(s) associated with each research subject and covariate factors such as the age, sex, and/or disease subtype of each subject; these annotations differ considerably across studies. Additional sample-level annotations are provided for informational purposes but may not be useful in analyses. Such variables might indicate the molecule type that was profiled, the extraction and hybridization protocols used, a description of how the data were preprocessed, the last-updated date, and subject identifiers used originally by the submitter. Although many GEO series provide a raw version of the data, many more provide a processed version of the data that were used in the researchers’ analysis. Examples of processed data include normalized microarray measurements, feature counts from read-aligned RNA-Sequencing data, RT-PCR measurements, etc.
GEO has long used a spreadsheet-based submission system to collect metadata and annotations from researchers. A curator reviews each submission. This process ensures that the information is structured consistently across studies and that key data elements are provided, while ensuring flexibility for a wide range of experiments. This flexibility has enabled GEO to grow quickly over the past decades and to meet its creators’ goal of remaining “flexible and responsive to future trends, rather than setting rigid requirements and standards for entry” [3]. However, challenges for data reuse [4] have accompanied this flexibility. Within the constraints of the submission process, researchers provide free-form descriptions of sample characteristics. In some cases, researchers provide multiple values per cell, separated by delimiters. For example, a value of “female; 52; anastrozole” might be used to represent a female breast-cancer patient who was 52 years old and had been treated with anastrozole. When reusing the data, a secondary researcher would need to decipher the semantic meaning of these values and write custom code to separate the values into distinct columns. In other cases, sample characteristics are stored as key-value pairs. For example, the same patient might be represented as “sex = female; age = 52; drug = anastrozole”. This approach provides some semantic information, but secondary researchers would still need to parse the data points. In addition, missing values can cause problems. In some GEO datasets, sample-level annotations in a given row are shifted leftward to fill empty cells. Consequently, a given column may contain data for multiple variables. To analyze such data, a researcher would need to realign the values.
In many cases, these inconsistencies violate the “tidy data” principles, which state that every column should describe a single variable, every row should represent a particular observation (sample), and every table should represent a particular type of observational unit [15]. When data conform to these principles, they are conducive to diverse types of quantitative analyses, enabling researchers to devise analytical strategies that generalize across datasets.
Other challenges relate to semantics [16]. The names of columns often do not reflect the data stored in those columns. For example, Huang, et al. generated gene-expression data for Wilms tumor patients (GEO accession: GSE10320) [17]. Sample-level annotations indicate the clinical outcome for each patient: “Relapse” or “Non-relapse”. The column containing these values is labeled “characteristics_ch1”. As researchers analyze such data, they may wish to rename columns to be more descriptive. Additionally, the actual data values may lack standardization. For example, a more recent dataset published by the same group used “Yes” or “No” values to indicate whether a given Wilms tumor patient had relapsed [18]. If a secondary researcher wished to merge these datasets, they would need to modify the data to use a common vocabulary for describing relapse status. Finally, in some secondary analyses, researchers wish to use only a subset of the available samples. For example, they may wish to focus on Wilms tumor patients who had relapsed.
Aside from sample-level annotations, assay data often must be re-summarized. For example, Affymetrix microarray data are often summarized at the probeset level, and a given gene is often associated with multiple probesets. Researchers who wish to make gene-level inferences might wish to calculate the average (or median or maximum) value across all probesets for each gene.
Writing custom code to address any one of these problems may be trivial for some researchers. However, multiple such transformations are needed for many datasets, and these tasks differ widely among datasets [7]. Performing these tasks idiosyncratically for each dataset is inefficient and makes secondary analyses less accessible to researchers who lack computational skills. To address this problem, we created TidyGEO, an interactive Web application that enables researchers to download, tidy, and restructure GEO series into a form that is suitable for downstream analysis. Unlike alternative tools that focus primarily on molecular-assay data, TidyGEO emphasizes tidying sample-level annotations. However, TidyGEO also provides options for tidying molecular-assay data. Users can create graphical summaries of the data and download graphics files. Additionally, users can export the data in diverse formats.
In this article, we describe TidyGEO’s functionality in more detail and describe our findings from using it to tidy 109 GEO series. Furthermore, we highlight the importance of computational reproducibility and describe TidyGEO’s approach to ensuring that the tidying steps can be reproduced. Finally, we describe the landscape of existing tools for finding, interoperating, and reusing GEO data [4]; and we indicate how TidyGEO fits within that landscape.
2. Results
TidyGEO enables researchers to tidy sample-level annotations, assay data, and platform annotations for GEO series. In addition, researchers can merge these data types, export the data in various formats, and generate scripts that support computational reproducibility. To evaluate our assertion that most GEO series need to be tidied and to verify that TidyGEO is effective for diverse types of GEO data, we used TidyGEO to import 109 series, identified needs for tidying each series, performed the tidying steps, and recorded which steps we performed. We selected these series in a semi-random manner with the intent to ensure a diverse representation of molecular-profiling platforms (Figure 1) and samples sizes. However, we emphasized larger sample sizes to test the limits of TidyGEO; the sample sizes ranged between 2 and 12,288 (Figure 2). We excluded datasets with sample sizes of zero or one because these were either incomplete or did not require help with formatting.
Figure 1:
Assay types for the GEO series that we tidied to demonstrate TidyGEO’s functionality. We tidied 109 GEO series representing a diverse range of sizes and molecular-profiling technologies. This graph illustrates the number of datasets tested for each profiling technology.
Figure 2:

Sample sizes for the GEO series that we used to demonstrate TidyGEO functionality. We tidied 109 GEO series representing a wide range of sample sizes. This graph illustrates samples sizes across the series.
In TidyGEO, the “Clinical data” panel provides tabs for selecting informative columns, shifting cells, splitting key-value pairs, splitting columns, renaming columns, substituting values, and filtering samples. Table 1 lists the criteria we followed to assess whether each task needed to be performed for a given GEO series. We ignored the “Rename columns” and “Filter samples” tabs because these tasks are subjective and/or depend on the analysis being done. The “Select informative columns” tab provides preset filters that enable users to exclude columns that have web addresses (e.g., URLs for raw files), dates (e.g., upload dates), or for which all values are the same (e.g., tissue type). It also provides an option for the user to remove columns in a custom manner. We considered these two processes as separate types of tasks. If a series included multiple platforms, we used data for the first one listed.
Table 1:
Criteria used to evaluate which tidying steps were needed for a given dataset. We used TidyGEO to process 109 datasets across a range of platforms and data sizes. In processing the data, we used these criteria to evaluate whether each type of tidying functionality needed to be applied to each dataset.
| Task | Criteria |
|---|---|
| Filter by column name | Does any column have data that you judge to be non-informative for data analyses? |
| Shift cells | In some cases, when values are missing, sample-level annotations become shifted leftward to fill empty cells. For many datasets, GEO has rectified this problem with new columns. Is there any cell in which this has occurred (and has not already been rectified)? |
| Split key-value pairs | Some sample-level annotations are represented as key-value pairs. In these cases, a data value is prefixed with a descriptor (key). Typically, the same descriptor is used for each cell. To analyze such data, it is helpful to use the descriptor as the column name and remove the descriptor from the cells in that column. Does any column have key-value pairs? |
| Split columns | In some cases, multiple values are stored in a single cell. For example, a person’s age and sex might be stored in the same column, separated by commas. In these cases, it is helpful to split the column into multiple columns so the data can be analyzed more easily. Does any column need to be split? |
| Substitute values | Different researchers often use different terms to describe the same concepts. For example, one researcher might use “F” to indicate that a sample is female, while another researcher might use “female” or “woman.” In other cases, spelling mistakes or inconsistencies might be present. Or delimiters might be used inconsistently. Standardizing those values can make it easier to analyze the data and combine datasets. Do you judge that any of the informative columns have values that should be standardized (substituted with different values)? |
Figure 3 illustrates the number of tidying tasks per series. All 109 series required at least two tidying tasks. Three series required all six tasks. Figure 4 indicates the frequency with which each task was performed. By far, the most common tasks involved selecting informative columns. But each of the six tasks was necessary for at least 10 % of the series.
Figure 3:

Number of tidying tasks necessary to tidy each of 109 GEO series. We used TidyGEO to tidy sample-level annotations for 109 GEO series. This graph illustrates the total number of tasks per series that were necessary to tidy the data. Two tasks were possible for the “select columns” tab, and we did not consider the “rename columns” or “filter samples” tabs. So the maximum possible tasks per series was 6.
Figure 4:
Percentage of times that each tidying task was needed to tidy each of 109 GEO series. We used TidyGEO to tidy sample-level annotations for 109 GEO series. This graph illustrates the percentage of times that each type of tidying task was performed across all series.
During testing, we found that assay data were not available for direct download for approximately 30 % of datasets. In many cases (as in RNA-Sequencing experiments), raw data are stored in repositories separate from GEO (e.g., the Sequencing Read Archive [19]).
3. Discussion
The vastness and accessibility of GEO data have inspired tools for finding, downloading, and parsing the data. Some of these tools are provided within GEO itself. From the GEO website, researchers can browse available data, perform differential-expression analyses [2], and download data manually. To help in identifying relevant data, researchers can perform advanced searches. For example, they can filter based on metadata variables such as the title, authors, publication date, species name, or platform type of a given study. Furthermore, researchers can filter based on derived variables such as the sample count or availability of supplementary files. If desired, researchers can also perform such queries using an application programming interface. Other tools for working with GEO data have been created by third-party researchers. Below we describe many such tools and categorize them as a way to place TidyGEO’s functionality in context. In some cases, to provide historical context, we mention tools that are no longer available. In categorizing the tools, we refer to principles of making data findable, accessible, interoperable, and reusable (FAIR) [4]. Where applicable, we note gaps not currently filled by existing tools and how TidyGEO helps to fill those gaps.
3.1. Finding GEO data
The first important step to reusing GEO data is to find the most relevant data for a given study. GEOmetadb is an R package that supports flexible querying of GEO metadata and thus may enable researchers to find relevant data more easily than using the GEO website [20]. STARGEO uses a crowdsourcing approach to annotate GEO series [21]; graduate students with a biomedical background have reviewed series metadata and manually assigned tags to them. Researchers can submit queries to identify series associated with topic(s) of interest. Another tool that used crowdsourcing is CREEDS; students in a massively open online course annotated GEO data and created themed collections of searchable gene signatures (lists of genes up- or down-regulated after a sample subset has been perturbed naturally or experimentally) [22]. OMics Compendia Commons (OMiCC) and GEOMetaCuration enable biomedical researchers to collaboratively annotate GEO data and then create and share cross-study compendia [23, 24]. geoCancerPrognosticDatasetsRetriever is an R package that enables researchers to identify GEO datasets specific to cancer-prognosis research [25]. In an alternative approach, researchers have acted as curators. For example, curatedOvarianData is a compilation of expression data and metadata (mostly from GEO) associated with ovarian tumors [26]. Gemma has a much broader collection of curated gene-expression datasets [27].
TidyGEO does not attempt to replicate these efforts. Instead, it asks researchers to search for data series using existing tools and then specify a series identifier (e.g., GSE10320) to begin the tidying process.
3.2. Accessing GEO data
Via the GEO website, data are freely accessible using the HTTP/S and FTP protocols; each entry is assigned a unique identifier. Pointing and clicking can be cumbersome when researchers need to download many data series. As an alternative, scientists can use the GEOquery package to download GEO data programmatically [28]. TidyGEO uses this package to obtain data.
Tools like shinyGEO, Bioinformatics Array Research Tool, and GEOexplorer provide interfaces that allow researchers to search for GEO series and then retrieve those data [29–31]. Other repositories store subsets of GEO data; these include the curated packages mentioned above, as well as scanGEO [32] and inSilicoDb (no longer available) [33]. Such repositories are typically limited to particular assay types (e.g., gene-expression microarray data).
GEO contains data for hundreds of thousands of human and mouse samples that have been profiled using RNA-Sequencing technologies. However, these data have been preprocessed using diverse reference genomes, gene annotations, and alignment/quantification tools. ARCHS4 uses cloud-based infrastructure to acquire the data in raw form (from the Sequence Read Archive [19]) and reprocess them so that researchers can more easily access and analyze the data [34, 35].
3.3. Making GEO data interoperable
The FAIR guidelines recommend that data be compatible with tools available to process such data. GEO releases data in specific file formats that are consistent across series and platforms. However, these formats are specific to GEO, so most data-analysis tools are unable to import them directly. GEOquery transforms data into data frames and matrices so that analyses can be performed using the R statistical software [36]. shinyGEO uses GEOquery to obtain data and provides an option to download sample-level annotations in CSV format [29]. In addition, MeV and BRB-ArrayTools provide ways for researchers to import and export GEO datasets [37, 38]. SMart Automatic Classification (SMAC) enables researchers to mine PubMed for datasets associated with user-defined queries, extract the data from GEO (using GEOquery), and then transform the data; it focuses primarily on cancer research, but some functionality can be applied more broadly [39].
TidyGEO provides diverse options for exporting data, thus providing flexibility for researchers who wish to analyze data with external tools. In TidyGEO, data can be downloaded in any of five formats. In some cases, these formats are tied to specific software (e.g., Microsoft Excel or BRB-ArrayTools); in other cases (CSV, TSV, JSON), the formats are generic and thus can be imported into many tools for custom analyses.
3.4. Reusing GEO data
In describing reuse, we focus on tools that provide analytical capabilities for gene-expression data. Researchers commonly seek to identify differentially-expressed genes [40], perform functional-enrichment analyses, construct gene co-expression networks, identify disease subtypes via clustering, classify samples (e.g., predict survival, treatment response), perform meta-analyses (combine evidence across studies) [41], and create visualizations. Many tools support such tasks (Table 2), typically via Web-based interfaces. Most of these tasks require sample-level annotations. For example, when performing a differential-expression analysis, researchers must place samples into comparison groups. Web-based analysis tools often extract metadata from GEO and ask the user to indicate the group associated with each sample. However, in many cases, it is useful to go beyond this capability. For example, if a researcher were comparing disease and normal samples, they might wish to account for race, sex, and/or age as covariates; we are unaware of existing tools that provide this flexibility. TidyGEO helps to fill the gaps in such scenarios. But rather than enabling researchers to analyze data directly, TidyGEO supports data-preparation tasks, enabling researchers to export analysis-ready data and import them into whatever third-party tool(s) are most suitable for the analysis tasks they wish to perform.
Table 2:
Summary of third-party tools that support reuse of GEO data.
| Tool | Differential expression | Functional enrichment | Co-expression | Clustering | Classification | Meta-analysis | Visualization |
|---|---|---|---|---|---|---|---|
| ARCHS4 | X | X | X | X | X | ||
| BART | X | X | X | ||||
| BioJupies | X | X | X | X | |||
| Gemma | X | X | X | X | |||
| GEO2Enrichr | X | ||||||
| GEO2R | X | X | |||||
| GEOexplorer | X | X | X | X | |||
| MeV | X | X | X | X | |||
| OMiCC | X | X | |||||
| ScanGEO | X | X | |||||
| shinyGEO | X | X | |||||
| ImaGEO | X | X | X | ||||
| SMAC | X | X | |||||
| ShinyGEO | X |
4. Conclusions
GEO data are largely findable and accessible. Furthermore, many tools are available to support interoperability and reusability of GEO data. However, a key gap that prevents researchers from taking full advantage of this resource is a lack of support for wrangling (tidying) the data into an analysis-read state. Data tidying requires human interaction and judgment from a domain expert. To be efficient and reproducible, it requires custom code. TidyGEO aims to meet these needs by facilitating the process of interacting with the data and generating custom code. Accordingly, we anticipate that it will be most useful to researchers who wish (1) to easily view the sample-level annotations in a GEO series and/or (2) to perform custom tidying operations on a GEO series and export the data to other tools. By facilitating these processes, TidyGEO promises to accelerate the research process and reduce barriers for biologists.
5. Methods
TidyGEO is a Shiny web application (http://shiny.rstudio.com) hosted in a Docker container (https://docker.com) [42]. The user begins by entering a GEO Series identifier. Initially, TidyGEO displays metadata about the series, including organism, experiment type, summary, and publication information. A link to the series on the GEO website is also provided. If the user chooses to “Import” the series, TidyGEO uses the GEOquery R package [28] to download sample-level annotations, assay data, and platform annotations. Options for manipulating the data are organized based on high-level operations: (1) tidying sample-level annotations (“clinical” data), (2) tidying assay data, (3) tidying platform annotations (feature data), and (4) merging data. These sub-pages can be accessed in any order, although TidyGEO attempts to guide the user through a logical sequence of steps.
5.1. Clinical data
Clinical data (sample-level annotations) describe the research subjects (samples) from which assay measurements were derived, providing insight into factors that may explain biological variation. Typically, scientists use clinical variables either as the primary variable of interest or as covariates. These variables may also be used to identify sample subsets that meet certain criteria – for example, patients with a certain blood type.
TidyGEO provides functionality for manipulating clinical data so that they can more easily be used in downstream analyses. This functionality is presented to the user in a series of tabs that allows users to (1) select informative columns, (2) shift cells due to missing data, (3) split key-value pairs, (4) split columns based on delimiters, (5) rename columns, (6) substitute values in a given column, (7) filter samples, and (8) save the tidied data to a file. TidyGEO guides users from tab to tab via “Next” and “Back” buttons; however, users need only perform tasks that are relevant to a particular dataset.
As users visit the tabs and manipulate data, views of the data are displayed on the right panel. The default option is a tabular view (data represented as rows and columns). Users can navigate the data horizontally or vertically using control buttons. A second option provides graphical summaries of the data. Users select a column to visualize. If the variable is numeric, TidyGEO generates a histogram and allows the user to select the bin size. If the variable is categorical, TidyGEO generates a bar plot. In either case, the user can download the figure (PDF, JPG, or PNG format) and specify the color used, image width and height, and font size. The graphs are generated via the ggplot2 package [43].
The tabs that support tidying clinical data are described below.
5.1.1. Selecting informative columns
Many sample-level annotations provide information that is not useful for quantitative analyses. TidyGEO users can select individual columns to be removed. Alternatively, they can indicate that they wish to remove all columns for which (1) all samples have the same value, (2) all samples have unique values, (3) all values are dates, or (4) all values are Web addresses. TidyGEO then removes these columns.
5.1.2. Shifting cells
When an annotation is missing for a given sample, data values for that sample may be shifted leftward to other columns. TidyGEO users can shift such values to the right so that data for each variable is stored in a single column across all samples. Users indicate which columns contain relevant values and then specify a pattern that indicates which cells have relevant data. TidyGEO then extracts these values to a new column.
5.1.3. Splitting key-value pairs
When annotations are represented as key-value pairs (e.g., “treatment:control” or “sex = female”) in a given column – and the key is identical for all samples – users can indicate that the key should be used as the column name and the column should contain the respective values. Users indicate the delimiter that separates the key-value pairs (e.g., “:” or “=”). For more complicated scenarios, a regular expression can be used.
5.1.4. Splitting columns
When multiple data points (or key-value pairs) are nested within cells in a given column, users can split those columns by specifying a delimiter (or regular expression). TidyGEO then separates these values into multiple columns.
5.1.5. Renaming columns
Users can rename a column to better reflect its semantics and to improve consistency of column names.
5.1.6. Substituting column values
Users can replace all instances of a given value with an alternative value. For example, if “F” were used to represent females, the user could change these values to “Female.” Regular expressions can be used for more advanced scenarios.
5.1.7. Filtering samples
For a given column, TidyGEO users can indicate that all samples with a given value should be excluded. For example, all males could be excluded.
5.1.8. Saving the data
Users can download the data in any of five file formats: comma-separated value (CSV), tab-separated value (TSV), JavaScript Object Notation (JSON), Microsoft Excel, or the format required by BRB-ArrayTools. BRB-ArrayTools is a spreadsheet-based system for analyzing gene-expression microarray, copy-number, methylation and RNA-Seq data [38].
5.2. Assay data
Researchers may wish to summarize assay measurements. For example, Affymetrix gene-expression microarrays map measurements to “probe sets,” which quantify expression for gene regions [44]. In many cases, multiple probe sets are associated with a single gene. Thus, probe-set values may be difficult to interpret for researchers who are interested in gene-level measurements. Alternatively, for RNA-Sequencing data, values may be quantified at the transcript level. Again, researchers may wish to summarize the data at the gene level. In its “Assay data” panel, TidyGEO enables users to examine the platform annotations for a given series and select an alternative way of summarizing the abundance measurements. For example, if Affymetrix data had been summarized at the probe-set level and the platform annotations provided a gene symbol (or other gene identifier) for each probe set, the user could summarize the measurements at the gene level. In another example, if the platform annotations indicated the cytogenetic band associated with each measurement, the user could summarize the data based on cytogenetic band. When summarizing, the user can select from the following options: mean, median, minimum, or maximum. Alternatively, the “keep all” option assigns the specified identifiers to the abundance measurements but does not collapse the values.
Regardless of the summarization method used, TidyGEO enables researchers to filter the assay data based on values in a given column. For example, a researcher might retain only rows in which the measurement was higher than a given threshold. As with the clinical data, the user can create visualizations of each column in the assay data and download these graphics.
In GEO, clinical annotations for each sample are stored on a row-by-row basis, whereas assay measurements for each sample are stored on a column-by-column basis. To support easier merging of these data types, TidyGEO enables users to transpose the assay data with the click of a button. In addition, the user can export the assay data in a variety of formats.
5.3. Feature data
Platform annotations in GEO suffer from many of the same limitations as sample-level annotations. For example, in some cases, extraneous columns are present, or multiple identifiers are stored in the same cell. In other cases, data are stored as key/value pairs, and values can become shifted when data are missing. To address these problems, TidyGEO provides options for tidying platform annotations: selecting informative columns, shifting empty cells, splitting key/value pairs, and splitting columns based on a delimiter. This process is independent of the process of analyzing assay data, but it can help with the latter. Tidying platform annotations can ensure that assay data are summarized using consistent and relevant identifiers, which is particularly important for deriving inferences that span multiple datasets and mapping data to external resources, such as pathway databases [45].
As with the other panels in TidyGEO, users can create visualizations of the platform annotations in each column, download graphics, and download the platform data in a variety of formats.
5.4. Merging data
Many researchers will wish to identify relationships between sample-level annotations and assay measurements. TidyGEO enables users to merge these data types and export them as combined data files. If a user has used other panels in TidyGEO to filter data, the user can choose whether to perform an inner join (retain only samples that are present for both data types), left or right join (retain all samples from the first or second data type, respectively), or a full join (retain data from samples in either dataset).
5.5. Supporting computational reproducibility
When an analysis is computationally reproducible, the researcher who performed the analysis maintains a detailed record of the analysis steps that were performed and is able to repeat those steps, leading to the same (or similar) results [46]. However, this process is typically tedious when the analysis was performed using a graphical interface because the researcher must manually record each step in detail. An additional complication is that the user interface may change as new versions of the software are released, so these descriptions may become obsolete. TidyGEO addresses this problem by generating reproducible scripts. As a user processes a GEO series, TidyGEO records the code that was used to perform those tasks. On each panel, TidyGEO provides an option for the researcher to download an R script that contains the code, along with comments that explain the tasks being performed.
Acknowledgment
We thank James Wengler, Teancum Paquette, and Ifeanyichukwu Nwosu for providing helpful feedback on TidyGEO’s usability.
Footnotes
Research ethics: Brigham Young University’s Institutional Review Board approved this study under exemption status. TidyGEO uses data collected from a public repository. We play no part in patient recruiting or in obtaining consent.
Informed consent: Not applicable.
Author contributions: AM conceived the project, developed the software, collected data, analyzed data, created visualizations, and wrote the manuscript. AS developed the software. BIQ developed the software. GS collected data. SRP conceived the project, supervised the project, provided computational resources, collected data, developed the software, wrote the manuscript, and edited the manuscript.
Competing interests: The authors declare that they have no competing interests.
Research funding: Not applicable.
Data availability: Source code for TidyGEO is available at https://github.com/srp33/TidyGEO; this repository also includes a developer’s guide to provide insight into how the application works. Packaged within a software-container image, TidyGEO can be executed on a researcher’s own computer; we also provide a public implementation for demonstration purposes (https://bioapps.byu.edu/TidyGEO).
References
- 1.Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, et al. NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Res. 2005;33:D562–6. doi: 10.1093/nar/gki022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data setsupdate. Nucleic Acids Res. 2012;41:D991–5. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–10. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhou W, Han L, Altman RB. Imputing gene expression to maximize platform compatibility. Bioinformatics. 2016;33:522–8. doi: 10.1093/bioinformatics/btw664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Eren K, Deveci M, Küçüktunç O, Çatalyürek ÜV. A comparative analysis of biclustering algorithms for gene expression data. Briefings Bioinf. 2013;14:279–92. doi: 10.1093/bib/bbs032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Golightly NP, Bell A, Bischoff AI, Hollingsworth PD, Piccolo SR. Curated compendium of human transcriptional biomarker data. Sci Data. 2018;5:180066. doi: 10.1038/sdata.2018.66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gan BL, Zhang LJ, Gao L, Ma FC, He RQ, Chen G, et al. Downregulation of miR-224-5p in prostate cancer and its relevant molecular mechanism via TCGA, GEO database and in silico analyses. Oncol Rep. 2018;40:3171–88. doi: 10.3892/or.2018.6766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huang S, Sun C, Hou Y, Tang Y, Zhu Z, Zhang Z, et al. A comprehensive bioinformatics analysis on multiple gene expression omnibus datasets of nonalcoholic fatty liver disease and nonalcoholic steatohepatitis. Sci Rep. 2018;8:7630. doi: 10.1038/s41598-018-25658-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.He P, Shao D, Ye M, Zhang G. Analysis of gene expression identifies candidate markers and pathways in pre-eclampsia. J Obstet Gynaecol. 2015;35:578–84. doi: 10.3109/01443615.2014.990430. [DOI] [PubMed] [Google Scholar]
- 11.Hossain MA, Asa TA, Rahman MR, Moni MA. Network-based approach to identify key candidate genes and pathways shared by thyroid cancer and chronic kidney disease. Inform Med Unlocked. 2019;16:100240. doi: 10.1016/j.imu.2019.100240. [DOI] [Google Scholar]
- 12.Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, Sweet-Cordero A, et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med. 2011;3:96ra77. doi: 10.1126/scitranslmed.3001318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME) toward standards for microarray data. Nat Genet. 2001;29:365–71. doi: 10.1038/ng1201-365. [DOI] [PubMed] [Google Scholar]
- 14.Rustici G, Williams E, Barzine N, Brazma A, Bumgarner R, Chierici M, et al. Transcriptomics data availability and reusability in the transition from microarray to next-generation sequencing. bioRxiv. 2020:12.31.425022. doi: 10.1101/2020.12.31.425022. [DOI] [Google Scholar]
- 15.Wickham H. Tidy data. J Stat Software. 2014;59:1–23. doi: 10.18637/jss.v059.i10. [DOI] [Google Scholar]
- 16.Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. 2019;6:190021. doi: 10.1038/sdata.2019.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Huang CC, Gadd S, Breslow N, Cutcliffe C, Sredni ST, Helenowski IB, et al. Predicting relapse in favorable histology Wilms tumor using gene expression analysis: a report from the renal tumor committee of the children’s oncology group. Clin Cancer Res. 2009;15:1770–8. doi: 10.1158/1078-0432.ccr-08-1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gadd S, Huff V, Huang CC, Ruteshouser EC, Dome JS, Grundy PE, et al. Clinically relevant subsets identified by gene expression patterns support a revised ontogenic model of Wilms tumor: a children’s oncology group study. Neoplasia. 2012;14:742–56. doi: 10.1593/neo.12714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kodama Y, Shumway M, Leinonen R. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–6. doi: 10.1093/nar/gkr854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics. 2008;24:2798–800. doi: 10.1093/bioinformatics/btn520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hadley D, Pan J, El-Sayed O, Aljabban J, Aljabban I, Azad TD, et al. Precision annotation of digital samples in NCBI’s gene expression omnibus. Sci Data. 2017;4:170125. doi: 10.1038/sdata.2017.125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wang Z, Monteiro CD, Jagodnik KM, Fernandez NF, Gundersen GW, Rouillard AD, et al. Extraction and analysis of signatures from the gene expression omnibus by the crowd. Nat Commun. 2016;7:12846. doi: 10.1038/ncomms12846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Shah N, Guo Y, Wendelsdorf KV, Lu Y, Sparks R, Tsang JS. A crowdsourcing approach for reusing and meta-analyzing gene expression data. Nat Biotechnol. 2016;34:803–6. doi: 10.1038/nbt.3603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li Z, Li J, Yu P. GEOMetaCuration: a web-based application for accurate manual curation of gene expression omnibus metadata. Database. 2018;2018:bay019–9. doi: 10.1093/database/bay019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Alameer A, Chicco D. geoCancerPrognosticDatasetsRetriever: a bioinformatics tool to easily identify cancer prognostic datasets on gene expression omnibus (GEO) Bioinformatics. 2022;38:1761–3. doi: 10.1093/bioinformatics/btab852. [DOI] [PubMed] [Google Scholar]
- 26.Ganzfried BF, Riester M, Haibe-Kains B, Risch T, Tyekucheva S, Jazic I, et al. curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database. 2013;2013:bat013. doi: 10.1093/database/bat013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lim N, Tesar S, Belmadani M, Poirier-Morency G, Mancarci BO, Sicherman J, et al. Curation of over 10,000 transcriptomic studies to enable data reuse. Database. 2021;2021:baab006. doi: 10.1093/database/baab006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Davis S, Meltzer PS. GEOquery: a bridge between the gene expression omnibus (GEO) and BioConductor. Bioinforma Oxf Engl. 2007;23:1846–7. doi: 10.1093/bioinformatics/btm254. [DOI] [PubMed] [Google Scholar]
- 29.Dumas J, Gargano MA, Dancik GM. shinyGEO: a web-based application for analyzing gene expression omnibus datasets. Bioinformatics. 2016;32:3679–81. doi: 10.1093/bioinformatics/btw519. [DOI] [PubMed] [Google Scholar]
- 30.Amaral ML, Erikson GA, Shokhirev MN. BART: bioinformatics array research tool. BMC Bioinf. 2018;19:296. doi: 10.1186/s12859-018-2308-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hunt GP, Grassi L, Henkin R, Smeraldi F, Spargo TP, Kabiljo R, et al. GEOexplorer: a webserver for gene expression analysis and visualisation. Nucleic Acids Res. 2022;50:W367–74. doi: 10.1093/nar/gkac364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Koeppen K, Stanton BA, Hampton TH. ScanGEO: parallel mining of high-throughput gene expression data. Bioinformatics. 2017;33:3500–1. doi: 10.1093/bioinformatics/btx452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Taminau J, Meganck S, Lazar C, Steenhoff D, Coletta A, Molter C, et al. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/bioconductor packages. BMC Bioinf. 2012;13:335. doi: 10.1186/1471-2105-13-335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018;9:1366. doi: 10.1038/s41467-018-03751-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Torre D, Lachmann A, Ma’ayan A. BioJupies: automated generation of interactive notebooks for RNA-Seq data analysis in the cloud. Cell Syst. 2018;7:556–61. doi: 10.1016/j.cels.2018.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.R Core Team . R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2022. [Google Scholar]
- 37.Howe EA, Sinha R, Schlauch D, Quackenbush J. RNA-seq analysis in MeV. Bioinformatics. 2011;27:3209–10. doi: 10.1093/bioinformatics/btr490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Simon R, Lam A, Li MC, Ngan M, Menenzes S, Zhao Y. Analysis of gene expression data using BRB-array tools. Cancer Inf. 2007;3 doi: 10.1177/117693510700300022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Pirrò S, Gadaleta E, Galgani A, Colizzi V, Chelala C. SMAC, a computational system to link literature, biomedical and expression data. Sci Rep. 2019;9:10480. doi: 10.1038/s41598-019-47046-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gundersen GW, Jones MR, Rouillard AD, Kou Y, Monteiro CD, Feldmann AS, et al. GEO2Enrichr: browser extension and server app to extract gene sets from GEO and analyze them for biological functions. Bioinformatics. 2015;31:3060–2. doi: 10.1093/bioinformatics/btv297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Toro-Domínguez D, Martorell-Marugán J, López-Domínguez R, García-Moreno A, González-Rumayor V, Alarcón-Riquelme ME, et al. ImaGEO: integrative gene expression meta-analysis from GEO database. Bioinformatics. 2019;35:880–2. doi: 10.1093/bioinformatics/bty721. [DOI] [PubMed] [Google Scholar]
- 42.Piccolo SR, Ence ZE, Anderson EC, Chang JT, Bild AH. Simplifying the development of portable, scalable, and reproducible workflows. Elife. 2021;10:e71069. doi: 10.7554/elife.71069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wickham H. Ggplot2: elegant graphics for data analysis. New York: Springer-Verlag; 2016. [Google Scholar]
- 44.Liu H, Bebu I, Li X. Microarray probes and probe sets. Front Biosci (Elite Ed) 2010;2:325–38. doi: 10.2741/e93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Rodchenkov I, Babur O, Luna A, Aksoy BA, Wong JV, Fong D, et al. Pathway commons 2019 update: integration, analysis and exploration of pathway data. Nucleic Acids Res. 2020;48:D489–97. doi: 10.1093/nar/gkz946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Piccolo SR, Frampton MB. Tools and techniques for computational reproducibility. GigaScience. 2016;5:30. doi: 10.1186/s13742-016-0135-4. [DOI] [PMC free article] [PubMed] [Google Scholar]


